All posts by David King

RSVP corpus: Foscolo supplement, 2018

We are pleased to announce the publication of the Foscolo supplement to our corpus.

This supplementary corpus has been created to support a study into the contributions of Italian poet Ugo Foscolo to the Edinburgh Review of the Quarterly Review in 1818-1819. Foscolo broke with both reviews in 1820, allegedly because of the poor quality of the translations of his articles from Italian and French into English, executed by, among others, James Mackintosh, Francis Palgrave and Francis Jeffrey. Additionally, Foscolo objected to the cuts and interpolations the two journals operated, accusing them both of “tampering with articles”.

This follow on study aimed to investigate if Foscolo’s claims were a convenient fiction or a fact that can be tested empirically. The corpus is our raw data to allow an assessment of Digital Humanities methodologies, such as stylometry and corpus stylistics, and if they can help confirm or deny Foscolo’s claims.

The Foscolo supplement contains all five articles written by Ugo Foscolo and published in either the Review. There are three versions of each article: the source, dirty text; the corrected, clean text; and the fully curated and marked-up TEI XML version.

The corpus is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Licence).

The corpus’ DOI is 10.21954/ou.rd.7472210, and may be downloaded from the A Question of Style project’s
online data site.

Raw text corpus

To supplement our curated corpus of 85 articles drawn from the Edinburgh Review and Quarterly Review,  we have published the raw texts from which the corpus was prepared.

Typically, the OCR process is imperfect, especially on older texts. The contents of this collection provide the uncorrected raw text
to set against the project’s curated corpus; which together can be used to develop and evaluate new programmatic correction
techniques.

The raw text corpus’ DOI is 10.21954/ou.rd.7176377.
The curated corpus’ DOI is 10.21954/ou.rd.6850865.

They may be downloaded from the project’s online data site, and are freely available for reuse on a CC BY-SA 4.0 licence.

When m is m

A common OCR induced error is to identify in as m, as was the case in the original OCR text in Brougham_Carnot_Defense_ER_25_1815, line 347:

who would have preferred death to any place m Robespierre's Committee,—and, for

It would be nice to think that we could autocorrect all such examples of an isolated, lower case m to in. However, there are real examples, such as this one taken from
Barrow_Humboldt_American_Researches_QR_15_30_1816 describing the sounds and letters of Portuguese (or Portugueze as Barrow spells it), line 319:

alphabet, is, in ours, equivalent to sh, and that of m to ng ; so that,

Or this example, on a humming sound emanating from a waterfall as described in Southey_Lewis-Clarke_American_Travels_QR_12_24_1815, line 1728:

upon the letter m; for otherwise Timm looks as little like the

Full simplistic autocorrection is not the answer here…

Rather we need to consider context. However, to build up a library of suitable examples is beyond the scope of our current work, though this project does contribute to a library of curated contextual examples of the use of an isolated m in texts.

Even then, it is questionable how accurate such an automatic correction system could ever be given the range of possible conditions in which an isolated m could be valid. Therefore, a semi-automated approach in which the change is highlighted for manual review rather than made automatically, may well be preferable.

This approach is still beneficial, for the dumb computer will relentlessly identify all isolated ms in a text, whereas the intelligent human reviewer may well ‘autocorrect’  the m into in, for example, when reading the text because that is what they expect to see, and so miss the need to correct the mistake.

On the shoulders of giants

At the heart of our project is a stylometric analysis of written texts, how appropriate then that one of the articles in our corpus is doing the same thing.

Brougham was trying to identify the author of the Junius letters,
and conducted his own stylistic analysis.

From Brougham_Junius_Letters_ER_29_1817, lines 709-715:

<p>There are various peculiarities of spelling which occur uniformly in both writers; and
 neither of them has any such peculiarity that is not common to both. Thus, they both write
 'practise' with an s; 'compleatly' instead of 'completely;' 'ingross,' intire, intrust,
 and many other such words, which are usually begun with an e—endeavor without an u—skreen
 with a k, and several others. There may not be much in any of these instances taken singly;
 but when we find that all the peculiarities that belong to either writer are common to both,
 it is impossible not to receive them as ingredients in the mass of evidence.</p>

Real misspelling

The joy of working with misspelling in the original text, such as this example in Brougham_Park_Journey_ER_24_1815, talking about the “British Musuem”.

We have chosen to leave such mistakes uncorrected. They have the potential to be a signature, if not of an author, then of the typesetter for that publication.

Marking quotes in texts

Quoted text is important to our analysis. When identifying authorial style in our corpus, we need to be able to distinguish between the writing of the reviewer and the reviewed. Hence, we need to exclude quoted text from our analysis easily and systematically.

Our corpus is marked-up in TEI: Text Encoding Initiative XML. Using XML mark-up we can selectively extract text from the documents as required by our analysis. TEI offers several ways of marking-up quoted text,as set out in the TEI Guildeines 3.3.3 Quotation.

Review of the guidelines immediately suggested two elements suitable for our purpose:

  • quote (quotation) contains a phrase or passage attributed by the narrator or author to some agency external to the text.
  • q (quoted) contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used.

We could not choose between the two based on our needs.

One occasional need is for nested quotes. Each element can contain itself or the other element so that it is possible to mark-up nested quotes using either.

We need to distinguish between inline quotes, with the quoted text incorporated directly into the author’s own writing, and block quotes, with the quoted text existing as stand alone blocks on the page. Neither element distinguishes between inline and block quotes. Yet we need to understanding which is which and process accordingly, for quotes can affect stylistic metrics, such as sentence length being influenced by the presence or absence of inline quotes. Indeed, it might be that the use of inline quotes, say, is a stylistic quirk in its own right worth pursuing.

TEI does not have an explicit inline and blockquote elements, so we looked at two possibilities to distinguish the quoted texts and using the existing TEI elements:
– and arbitrarily assigning for inline quotes and for block quotes
– or using with a rend attribute to indicate “inline” or “block” accordingly

We have adopted the latter approach because it enables us to consistently mark-up all quotes using the same element. This can simplify the programmatic removal of all quotes, yet it still provides us a with means of programmatically distinguishing between the two forms of quote.

Further the TEI definition for quote seems closer to our intended use than that given for q, and the TEI definition of the rend attribute (rendition – indicates how the element in question was rendered or presented in the source text) is a fair statement of our intended use of it too.

Here is an example of an inline quote penned by Hazlitt:

There cannot, in our opinion, be a greater mistake than to consider Don Quixote as a merely satirical work, or an attempt to explode, by coarse raillery, <quote rend="inline"> the long forgotten order of chivalry.</quote> There could be no need to…

Here is an example of a block quote used by Scott:

…[Cid's body] was supported in an upright state by a thin frame of wood; and the whole being made fast to a right noble saddle, this retinue prepared to leave Valencia.</p>
<quote rend="block">'When it was midnight they took the body of the Cid, fastened to the saddle as it was…

Note, citations are not applicable in this context because our documents are reviews of another work and generally all quotes are from that one other work. This means that each quote does not contain an individual reference to that other work. Hence, we have not used the cit element, (cited quotation –  contains a quotation from some other document, together with a bibliographic reference to its source.

Nor are we interested in the punctuation marks used to highlight quoted text. Hence, we simply include the marks and text in full in our documents. We do not identify the punctuation marks with a hi element, (highlighted – marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made ) as suggested in the TEI guidelines.