Marking quotes in texts

Quoted text is important to our analysis. When identifying authorial style in our corpus, we need to be able to distinguish between the writing of the reviewer and the reviewed. Hence, we need to exclude quoted text from our analysis easily and systematically.

Our corpus is marked-up in TEI: Text Encoding Initiative XML. Using XML mark-up we can selectively extract text from the documents as required by our analysis. TEI offers several ways of marking-up quoted text,as set out in the TEI Guildeines 3.3.3 Quotation.

Review of the guidelines immediately suggested two elements suitable for our purpose:

  • quote (quotation) contains a phrase or passage attributed by the narrator or author to some agency external to the text.
  • q (quoted) contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used.

We could not choose between the two based on our needs.

One occasional need is for nested quotes. Each element can contain itself or the other element so that it is possible to mark-up nested quotes using either.

We need to distinguish between inline quotes, with the quoted text incorporated directly into the author’s own writing, and block quotes, with the quoted text existing as stand alone blocks on the page. Neither element distinguishes between inline and block quotes. Yet we need to understanding which is which and process accordingly, for quotes can affect stylistic metrics, such as sentence length being influenced by the presence or absence of inline quotes. Indeed, it might be that the use of inline quotes, say, is a stylistic quirk in its own right worth pursuing.

TEI does not have an explicit inline and blockquote elements, so we looked at two possibilities to distinguish the quoted texts and using the existing TEI elements:
– and arbitrarily assigning for inline quotes and for block quotes
– or using with a rend attribute to indicate “inline” or “block” accordingly

We have adopted the latter approach because it enables us to consistently mark-up all quotes using the same element. This can simplify the programmatic removal of all quotes, yet it still provides us a with means of programmatically distinguishing between the two forms of quote.

Further the TEI definition for quote seems closer to our intended use than that given for q, and the TEI definition of the rend attribute (rendition – indicates how the element in question was rendered or presented in the source text) is a fair statement of our intended use of it too.

Here is an example of an inline quote penned by Hazlitt:

There cannot, in our opinion, be a greater mistake than to consider Don Quixote as a merely satirical work, or an attempt to explode, by coarse raillery, <quote rend="inline"> the long forgotten order of chivalry.</quote> There could be no need to…

Here is an example of a block quote used by Scott:

…[Cid's body] was supported in an upright state by a thin frame of wood; and the whole being made fast to a right noble saddle, this retinue prepared to leave Valencia.</p>
<quote rend="block">'When it was midnight they took the body of the Cid, fastened to the saddle as it was…

Note, citations are not applicable in this context because our documents are reviews of another work and generally all quotes are from that one other work. This means that each quote does not contain an individual reference to that other work. Hence, we have not used the cit element, (cited quotation –  contains a quotation from some other document, together with a bibliographic reference to its source.

Nor are we interested in the punctuation marks used to highlight quoted text. Hence, we simply include the marks and text in full in our documents. We do not identify the punctuation marks with a hi element, (highlighted – marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made ) as suggested in the TEI guidelines.