RSVP corpus: Foscolo supplement, 2018

We are pleased to announce the publication of the Foscolo supplement to our corpus.

This supplementary corpus has been created to support a study into the contributions of Italian poet Ugo Foscolo to the Edinburgh Review of the Quarterly Review in 1818-1819. Foscolo broke with both reviews in 1820, allegedly because of the poor quality of the translations of his articles from Italian and French into English, executed by, among others, James Mackintosh, Francis Palgrave and Francis Jeffrey. Additionally, Foscolo objected to the cuts and interpolations the two journals operated, accusing them both of “tampering with articles”.

This follow on study aimed to investigate if Foscolo’s claims were a convenient fiction or a fact that can be tested empirically. The corpus is our raw data to allow an assessment of Digital Humanities methodologies, such as stylometry and corpus stylistics, and if they can help confirm or deny Foscolo’s claims.

The Foscolo supplement contains all five articles written by Ugo Foscolo and published in either the Review. There are three versions of each article: the source, dirty text; the corrected, clean text; and the fully curated and marked-up TEI XML version.

The corpus is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Licence).

The corpus’ DOI is 10.21954/ou.rd.7472210, and may be downloaded from the A Question of Style project’s
online data site.

Raw text corpus

To supplement our curated corpus of 85 articles drawn from the Edinburgh Review and Quarterly Review,  we have published the raw texts from which the corpus was prepared.

Typically, the OCR process is imperfect, especially on older texts. The contents of this collection provide the uncorrected raw text
to set against the project’s curated corpus; which together can be used to develop and evaluate new programmatic correction
techniques.

The raw text corpus’ DOI is 10.21954/ou.rd.7176377.
The curated corpus’ DOI is 10.21954/ou.rd.6850865.

They may be downloaded from the project’s online data site, and are freely available for reuse on a CC BY-SA 4.0 licence.

When m is m

A common OCR induced error is to identify in as m, as was the case in the original OCR text in Brougham_Carnot_Defense_ER_25_1815, line 347:

who would have preferred death to any place m Robespierre's Committee,—and, for

It would be nice to think that we could autocorrect all such examples of an isolated, lower case m to in. However, there are real examples, such as this one taken from
Barrow_Humboldt_American_Researches_QR_15_30_1816 describing the sounds and letters of Portuguese (or Portugueze as Barrow spells it), line 319:

alphabet, is, in ours, equivalent to sh, and that of m to ng ; so that,

Or this example, on a humming sound emanating from a waterfall as described in Southey_Lewis-Clarke_American_Travels_QR_12_24_1815, line 1728:

upon the letter m; for otherwise Timm looks as little like the

Full simplistic autocorrection is not the answer here…

Rather we need to consider context. However, to build up a library of suitable examples is beyond the scope of our current work, though this project does contribute to a library of curated contextual examples of the use of an isolated m in texts.

Even then, it is questionable how accurate such an automatic correction system could ever be given the range of possible conditions in which an isolated m could be valid. Therefore, a semi-automated approach in which the change is highlighted for manual review rather than made automatically, may well be preferable.

This approach is still beneficial, for the dumb computer will relentlessly identify all isolated ms in a text, whereas the intelligent human reviewer may well ‘autocorrect’  the m into in, for example, when reading the text because that is what they expect to see, and so miss the need to correct the mistake.

On the shoulders of giants

At the heart of our project is a stylometric analysis of written texts, how appropriate then that one of the articles in our corpus is doing the same thing.

Brougham was trying to identify the author of the Junius letters,
and conducted his own stylistic analysis.

From Brougham_Junius_Letters_ER_29_1817, lines 709-715:

<p>There are various peculiarities of spelling which occur uniformly in both writers; and
 neither of them has any such peculiarity that is not common to both. Thus, they both write
 'practise' with an s; 'compleatly' instead of 'completely;' 'ingross,' intire, intrust,
 and many other such words, which are usually begun with an e—endeavor without an u—skreen
 with a k, and several others. There may not be much in any of these instances taken singly;
 but when we find that all the peculiarities that belong to either writer are common to both,
 it is impossible not to receive them as ingredients in the mass of evidence.</p>

Real misspelling

The joy of working with misspelling in the original text, such as this example in Brougham_Park_Journey_ER_24_1815, talking about the “British Musuem”.

We have chosen to leave such mistakes uncorrected. They have the potential to be a signature, if not of an author, then of the typesetter for that publication.

A Question of Style Corpus on ORDO

Today marks an important milestone in our A Question of Style project. Our corpus is available to download from ORDO, The Open University’s data repository. The graphs below give an overview of the composition of the corpus. Our presentation from the RSVP/ VSAWC 2018 conference gives more details on the creation and processing of the Corpus. The Corpus is licenced under a Creative Commons Attribution Licence, so you are free re-use it for your own research if you acknowledge its provenance.

Number of Articles by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Number of Articles by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Number of Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King. CC BY 4.0
Number of Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Percentage of quotations in A Question of Style Corpus by Francesca Benatti and David King
Percentage of quotations in A Question of Style Corpus by Francesca Benatti and David King.
Number of Non- Quotation Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Number of Non- Quotation Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.

Conference roundup: RSVP/VSAWC 2018

We were honoured to be invited to present at the RSVP/VSAWC 2018 conference, which took place at the University of Victoria on 26-28 August 2018.

We presented a paper on the progress of our research, especially on the issues of OCR correction and treatment of quotations, as part of a panel on Voices on the Page on 26 August. You can view it on ORO, The Open University’s institutional repository.

The following day, we joined six other projects in the Digital Research Showcase, showcasing the breadth of digital research in the field of nineteenth-century periodicals. We hope to develop further links with these projects and thank the RSVP and VSAWC for their generous travel bursary.

Conference roundup: Galway and Dublin, February 2018

Photograph of Francesca Benatti with fellow panelists Derek Greene (UCD) and Karen Wade (UCD) at the Digital Cultures, Big Data and Society conference
Francesca Benatti (right) with fellow panelists Derek Greene (UCD) and Karen Wade (UCD) at the Digital Cultures, Big Data and Society conference

We presented our research at two venues in Ireland last week.

On 14 February, we were at the Moore Institute in NUI Galway at the Digital Scholarship Seminar.  We spoke to an audience of students and staff drawn from the Moore Institute, the Insight Centre for Data Analytics and the School of Humanities active in the area of Digital Humanities .

On 16 February, we took part in the Digital Cultures, Big Data and Society conference, held at the Royal Irish Academy and the UCD Humanities Institute. Organised by the Irish Memory Studies Network, the conference focused on questions of close and distant reading and the critical functions of digital tools in the humanities.  Our paper was well received andparticipated in a lively debate on  how Humanities scholars can use digital tools to analyse data at a new scale but also subject digital data to critical scrutiny.

The conference culminated with the launch of the Industrial Memories project. Led by Prof Emilie Pine, the project represents a striking application of Digital Humanities methodologies such as text mining and data visualisation to enable analysis of the 2009 Ryan Report into child abuse at residential school run by the Catholic Church between 1936 and 1999.

Conference round-up: DHSI Colloquium and SHARP 2017

We presented papers on our A Question of Style Project at two further events: the Digital Humanities Summer Institute (DHSI) Colloquium on 5 June 2017 and the conference of the Society for the Study of Authorship, Reading and Publishing (SHARP) Conference on 11 June 2017.

The first short paper was aimed at an audience of Digital Humanities specialists, gathered for the annual Digital Humanities Summer Institute at the University of Victoria (a subsequent blog post will examine our experience at DHSI and lessons learned from the excellent Out-of-the-Box Text Analysis course, taught by David Hoover). At the Colloquium, Francesca gave a succint 5-minute presentation on our work-in-progress, focusing especially on our current work on post-OCR correction and TEI encoding.

The second paper was aimed at an audience of book historian and periodical studies specialists gathered for the annual SHARP conference, which this year focued on the theme of “Technologies of the Book” and was co-located with DHSI. This 20-minute presentation provided our reflections on the theoretical and methodological implications of the process we defined as “assisted close reading” (inspired by Anne Bandry-Scubbi’s article on the Chawton Novels Online corpus) on the study of authorship in the Edinburgh Review.

Both papers were well received and provoked numerous questions and suggestions, which we are gladly incorporating into our practice and reflection. In particular, the issue of untangling the influence of the editor, Francis Jeffrey, merits further reflections, which will be the subject of a future blog post.

In addition to presenting, we met several colleagues, old and new, and learned about exciting research being conducted in ares that are close to our. We were particularly intrigued to discover from Julia Flanders’s plenary lecture that the Women Writers Project has noticed certain patterns in the use of quotations and pronouns that we are also observing in the course of our research.

Marking quotes in texts

Quoted text is important to our analysis. When identifying authorial style in our corpus, we need to be able to distinguish between the writing of the reviewer and the reviewed. Hence, we need to exclude quoted text from our analysis easily and systematically.

Our corpus is marked-up in TEI: Text Encoding Initiative XML. Using XML mark-up we can selectively extract text from the documents as required by our analysis. TEI offers several ways of marking-up quoted text,as set out in the TEI Guildeines 3.3.3 Quotation.

Review of the guidelines immediately suggested two elements suitable for our purpose:

  • quote (quotation) contains a phrase or passage attributed by the narrator or author to some agency external to the text.
  • q (quoted) contains material which is distinguished from the surrounding text using quotation marks or a similar method, for any one of a variety of reasons including, but not limited to: direct speech or thought, technical terms or jargon, authorial distance, quotations from elsewhere, and passages that are mentioned but not used.

We could not choose between the two based on our needs.

One occasional need is for nested quotes. Each element can contain itself or the other element so that it is possible to mark-up nested quotes using either.

We need to distinguish between inline quotes, with the quoted text incorporated directly into the author’s own writing, and block quotes, with the quoted text existing as stand alone blocks on the page. Neither element distinguishes between inline and block quotes. Yet we need to understanding which is which and process accordingly, for quotes can affect stylistic metrics, such as sentence length being influenced by the presence or absence of inline quotes. Indeed, it might be that the use of inline quotes, say, is a stylistic quirk in its own right worth pursuing.

TEI does not have an explicit inline and blockquote elements, so we looked at two possibilities to distinguish the quoted texts and using the existing TEI elements:
– and arbitrarily assigning for inline quotes and for block quotes
– or using with a rend attribute to indicate “inline” or “block” accordingly

We have adopted the latter approach because it enables us to consistently mark-up all quotes using the same element. This can simplify the programmatic removal of all quotes, yet it still provides us a with means of programmatically distinguishing between the two forms of quote.

Further the TEI definition for quote seems closer to our intended use than that given for q, and the TEI definition of the rend attribute (rendition – indicates how the element in question was rendered or presented in the source text) is a fair statement of our intended use of it too.

Here is an example of an inline quote penned by Hazlitt:

There cannot, in our opinion, be a greater mistake than to consider Don Quixote as a merely satirical work, or an attempt to explode, by coarse raillery, <quote rend="inline"> the long forgotten order of chivalry.</quote> There could be no need to…

Here is an example of a block quote used by Scott:

…[Cid's body] was supported in an upright state by a thin frame of wood; and the whole being made fast to a right noble saddle, this retinue prepared to leave Valencia.</p>
<quote rend="block">'When it was midnight they took the body of the Cid, fastened to the saddle as it was…

Note, citations are not applicable in this context because our documents are reviews of another work and generally all quotes are from that one other work. This means that each quote does not contain an individual reference to that other work. Hence, we have not used the cit element, (cited quotation –  contains a quotation from some other document, together with a bibliographic reference to its source.

Nor are we interested in the punctuation marks used to highlight quoted text. Hence, we simply include the marks and text in full in our documents. We do not identify the punctuation marks with a hi element, (highlighted – marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made ) as suggested in the TEI guidelines.

Individual voices and corporate identity in the Edinburgh Review, 1814-20

This blog is protected by dr Dave\'s Spam Karma 2: 252 Spams eaten and counting...