Category Archives: Uncategorized

RSVP corpus: Foscolo supplement, 2018

We are pleased to announce the publication of the Foscolo supplement to our corpus.

This supplementary corpus has been created to support a study into the contributions of Italian poet Ugo Foscolo to the Edinburgh Review of the Quarterly Review in 1818-1819. Foscolo broke with both reviews in 1820, allegedly because of the poor quality of the translations of his articles from Italian and French into English, executed by, among others, James Mackintosh, Francis Palgrave and Francis Jeffrey. Additionally, Foscolo objected to the cuts and interpolations the two journals operated, accusing them both of “tampering with articles”.

This follow on study aimed to investigate if Foscolo’s claims were a convenient fiction or a fact that can be tested empirically. The corpus is our raw data to allow an assessment of Digital Humanities methodologies, such as stylometry and corpus stylistics, and if they can help confirm or deny Foscolo’s claims.

The Foscolo supplement contains all five articles written by Ugo Foscolo and published in either the Review. There are three versions of each article: the source, dirty text; the corrected, clean text; and the fully curated and marked-up TEI XML version.

The corpus is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Licence).

The corpus’ DOI is 10.21954/ou.rd.7472210, and may be downloaded from the A Question of Style project’s
online data site.

Raw text corpus

To supplement our curated corpus of 85 articles drawn from the Edinburgh Review and Quarterly Review,  we have published the raw texts from which the corpus was prepared.

Typically, the OCR process is imperfect, especially on older texts. The contents of this collection provide the uncorrected raw text
to set against the project’s curated corpus; which together can be used to develop and evaluate new programmatic correction
techniques.

The raw text corpus’ DOI is 10.21954/ou.rd.7176377.
The curated corpus’ DOI is 10.21954/ou.rd.6850865.

They may be downloaded from the project’s online data site, and are freely available for reuse on a CC BY-SA 4.0 licence.

When m is m

A common OCR induced error is to identify in as m, as was the case in the original OCR text in Brougham_Carnot_Defense_ER_25_1815, line 347:

who would have preferred death to any place m Robespierre's Committee,—and, for

It would be nice to think that we could autocorrect all such examples of an isolated, lower case m to in. However, there are real examples, such as this one taken from
Barrow_Humboldt_American_Researches_QR_15_30_1816 describing the sounds and letters of Portuguese (or Portugueze as Barrow spells it), line 319:

alphabet, is, in ours, equivalent to sh, and that of m to ng ; so that,

Or this example, on a humming sound emanating from a waterfall as described in Southey_Lewis-Clarke_American_Travels_QR_12_24_1815, line 1728:

upon the letter m; for otherwise Timm looks as little like the

Full simplistic autocorrection is not the answer here…

Rather we need to consider context. However, to build up a library of suitable examples is beyond the scope of our current work, though this project does contribute to a library of curated contextual examples of the use of an isolated m in texts.

Even then, it is questionable how accurate such an automatic correction system could ever be given the range of possible conditions in which an isolated m could be valid. Therefore, a semi-automated approach in which the change is highlighted for manual review rather than made automatically, may well be preferable.

This approach is still beneficial, for the dumb computer will relentlessly identify all isolated ms in a text, whereas the intelligent human reviewer may well ‘autocorrect’  the m into in, for example, when reading the text because that is what they expect to see, and so miss the need to correct the mistake.

On the shoulders of giants

At the heart of our project is a stylometric analysis of written texts, how appropriate then that one of the articles in our corpus is doing the same thing.

Brougham was trying to identify the author of the Junius letters,
and conducted his own stylistic analysis.

From Brougham_Junius_Letters_ER_29_1817, lines 709-715:

<p>There are various peculiarities of spelling which occur uniformly in both writers; and
 neither of them has any such peculiarity that is not common to both. Thus, they both write
 'practise' with an s; 'compleatly' instead of 'completely;' 'ingross,' intire, intrust,
 and many other such words, which are usually begun with an e—endeavor without an u—skreen
 with a k, and several others. There may not be much in any of these instances taken singly;
 but when we find that all the peculiarities that belong to either writer are common to both,
 it is impossible not to receive them as ingredients in the mass of evidence.</p>

Real misspelling

The joy of working with misspelling in the original text, such as this example in Brougham_Park_Journey_ER_24_1815, talking about the “British Musuem”.

We have chosen to leave such mistakes uncorrected. They have the potential to be a signature, if not of an author, then of the typesetter for that publication.

Marking quotes in texts

Quoted text is important to our analysis. When identifying authorial style in our corpus, we need to be able to distinguish between the writing of the reviewer and the reviewed. Hence, we need to exclude quoted text from our analysis easily and systematically.

Our corpus is marked-up in TEI: Text Encoding Initiative XML. Using XML mark-up we can selectively extract text from the documents as required by our analysis. TEI offers several ways of marking-up quoted text,as set out in the TEI Guildeines 3.3.3 Quotation.

Review of the guidelines immediately suggested two elements suitable for our purpose:

  • quote (quotation) contains a phrase or passage attributed by the narrator or author to some agency external to the text.
  • q (quoted) contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used.

We could not choose between the two based on our needs.

One occasional need is for nested quotes. Each element can contain itself or the other element so that it is possible to mark-up nested quotes using either.

We need to distinguish between inline quotes, with the quoted text incorporated directly into the author’s own writing, and block quotes, with the quoted text existing as stand alone blocks on the page. Neither element distinguishes between inline and block quotes. Yet we need to understanding which is which and process accordingly, for quotes can affect stylistic metrics, such as sentence length being influenced by the presence or absence of inline quotes. Indeed, it might be that the use of inline quotes, say, is a stylistic quirk in its own right worth pursuing.

TEI does not have an explicit inline and blockquote elements, so we looked at two possibilities to distinguish the quoted texts and using the existing TEI elements:
– and arbitrarily assigning for inline quotes and for block quotes
– or using with a rend attribute to indicate “inline” or “block” accordingly

We have adopted the latter approach because it enables us to consistently mark-up all quotes using the same element. This can simplify the programmatic removal of all quotes, yet it still provides us a with means of programmatically distinguishing between the two forms of quote.

Further the TEI definition for quote seems closer to our intended use than that given for q, and the TEI definition of the rend attribute (rendition – indicates how the element in question was rendered or presented in the source text) is a fair statement of our intended use of it too.

Here is an example of an inline quote penned by Hazlitt:

There cannot, in our opinion, be a greater mistake than to consider Don Quixote as a merely satirical work, or an attempt to explode, by coarse raillery, <quote rend="inline"> the long forgotten order of chivalry.</quote> There could be no need to…

Here is an example of a block quote used by Scott:

…[Cid's body] was supported in an upright state by a thin frame of wood; and the whole being made fast to a right noble saddle, this retinue prepared to leave Valencia.</p>
<quote rend="block">'When it was midnight they took the body of the Cid, fastened to the saddle as it was…

Note, citations are not applicable in this context because our documents are reviews of another work and generally all quotes are from that one other work. This means that each quote does not contain an individual reference to that other work. Hence, we have not used the cit element, (cited quotation –  contains a quotation from some other document, together with a bibliographic reference to its source.

Nor are we interested in the punctuation marks used to highlight quoted text. Hence, we simply include the marks and text in full in our documents. We do not identify the punctuation marks with a hi element, (highlighted – marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made ) as suggested in the TEI guidelines.

Stylometry and corpus stylistics: suggested readings

Style, by Dr Case from Flickr. CC BY-NC
Style, by Dr Case from Flickr. CC BY-NC

Some members of the audience at yesterday’s seminar at The Open University asked us about the methods we are employing in our research. As an answer, we include here some of the key works that are inspiring our research.

 Stylometry

Burrows, John. 2002. ‘“Delta”: a Measure of Stylistic Difference and a Guide to Likely Authorship’. Lit Linguist Computing 17 (3): 267–87. doi:10.1093/llc/17.3.267.

Burrows, John. 2007. ‘All the Way Through: Testing for Authorship in Different Frequency Strata’. Lit Linguist Computing 22 (1): 27–47. doi:10.1093/llc/fqi067.

Hoover, David L. 2002. ‘Frequent Word Sequences and Statistical Stylistics’. Literary and Linguistic Computing 17 (2): 157–80. doi:10.1093/llc/17.2.157.

Collaborative Authorship
Lang, Anouk. 2016. ‘Stylo and the Stevensons’. Anouk Lang. July 13. http://aelang.net/wordpress/2016/07/13/stylostevensons/.

Rybicki, Jan, David Hoover, and Mike Kestemont. 2014. ‘Collaborative Authorship: Conrad, Ford and Rolling Delta’. Literary and Linguistic Computing 29 (3): 422–31. doi:10.1093/llc/fqu016.
Stylochronometry
Hulle, Dirk van, and Mike Kestemont. 2016. ‘Periodizing Samuel Beckett’s Works: A Stylochronometric Approach’. Style 50 (2): 172-202.

Corpus Stylistics

Mahlberg, Michaela. 2012. Corpus Stylistics and Dickens’s Fiction. Routledge.

Mahlberg, Michaela. 2007. ‘Clusters, Key Clusters and Local Textual Functions in Dickens’. Corpora 2 (1): 1–31. doi:10.3366/cor.2007.2.1.1.

Style and meaning

Pennebaker, James W. 2011. The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury Publishing USA.

Herrmann J. Berenike, Dalen-Oskam Karina van, and Christof Schöch. 2015. ‘Revisiting Style, a Key Concept in Literary Studies’. Journal of Literary Theory 9 (1): 25–52. doi:10.1515/jlt-2015-0003.

Ongoing projects

The Riddle of Literary Quality, Huygens Institute
Stanford Literary Lab

Upcoming seminar, 9 February

Jennie Lee Building, Open Univeristy

We will be in the Jennie Lee Building presenting the results of our research so far at the Open University’s School of Computing and Communications Research Seminar series on Thursday 9th February.

Here is our description:

In a collaboration between FASS and STEM, we present our project, A Question of Style: individual voices and corporate identity in the Edinburgh Review, 1814-1820, which is funded by a Research Society for Victorian Periodicals Field Development Grant running from January 2017 to October 2017.

The Edinburgh Review was the main literary journal in early 19th-century Britain, including among its contributors some of the most prominent contemporary authors and politicians. Yet all its articles were published anonymously, their authority stemming exclusively from their presence in the Edinburgh and not from the name of their author. In 2016 we undertook a proof-of-concept project, employing methods from periodical studies, book history, computational linguistics and computational stylistics to assess the assumption that early nineteenth-century periodicals like the Edinburgh succeeded in creating, through a “transauthorial discourse”, a unified corporate voice that hid individual authors behind an impersonal public style (Klancher 1987).

We will discuss how we are now taking forward this work through “operationalising” our definition of style in order to select features that can be measured empirically (Moretti 2013) at the level of words and sentences, using methods such as term frequency: inverse document frequency, Burrows’ Delta and Zeta methods, Moretti’s Most Distinctive Words Method, and Principal Component Analysis.

Finally, we will qualitatively describe the results of our preliminary stylistic analysis.

 

References:

Klancher, Jon P. The Making of English Reading Audiences, 1790-1832. University of Wisconsin Press, 1987.

Moretti, Franco. “Operationalizing”: or, the function of measurement in modern literary theory” Stanford Literary Lab. Pamphlet 6. Stanford Lit. Lab, December 2013. http://litlab.stanford.edu/LiteraryLabPamphlet6.pdf