When m is m

A common OCR induced error is to identify in as m, as was the case in the original OCR text in Brougham_Carnot_Defense_ER_25_1815, line 347:

who would have preferred death to any place m Robespierre's Committee,—and, for

It would be nice to think that we could autocorrect all such examples of an isolated, lower case m to in. However, there are real examples, such as this one taken from
Barrow_Humboldt_American_Researches_QR_15_30_1816 describing the sounds and letters of Portuguese (or Portugueze as Barrow spells it), line 319:

alphabet, is, in ours, equivalent to sh, and that of m to ng ; so that,

Or this example, on a humming sound emanating from a waterfall as described in Southey_Lewis-Clarke_American_Travels_QR_12_24_1815, line 1728:

upon the letter m; for otherwise Timm looks as little like the

Full simplistic autocorrection is not the answer here…

Rather we need to consider context. However, to build up a library of suitable examples is beyond the scope of our current work, though this project does contribute to a library of curated contextual examples of the use of an isolated m in texts.

Even then, it is questionable how accurate such an automatic correction system could ever be given the range of possible conditions in which an isolated m could be valid. Therefore, a semi-automated approach in which the change is highlighted for manual review rather than made automatically, may well be preferable.

This approach is still beneficial, for the dumb computer will relentlessly identify all isolated ms in a text, whereas the intelligent human reviewer may well ‘autocorrect’  the m into in, for example, when reading the text because that is what they expect to see, and so miss the need to correct the mistake.

On the shoulders of giants

At the heart of our project is a stylometric analysis of written texts, how appropriate then that one of the articles in our corpus is doing the same thing.

Brougham was trying to identify the author of the Junius letters,
and conducted his own stylistic analysis.

From Brougham_Junius_Letters_ER_29_1817, lines 709-715:

<p>There are various peculiarities of spelling which occur uniformly in both writers; and
 neither of them has any such peculiarity that is not common to both. Thus, they both write
 'practise' with an s; 'compleatly' instead of 'completely;' 'ingross,' intire, intrust,
 and many other such words, which are usually begun with an e—endeavor without an u—skreen
 with a k, and several others. There may not be much in any of these instances taken singly;
 but when we find that all the peculiarities that belong to either writer are common to both,
 it is impossible not to receive them as ingredients in the mass of evidence.</p>

Real misspelling

The joy of working with misspelling in the original text, such as this example in Brougham_Park_Journey_ER_24_1815, talking about the “British Musuem”.

We have chosen to leave such mistakes uncorrected. They have the potential to be a signature, if not of an author, then of the typesetter for that publication.

A Question of Style Corpus on ORDO

Today marks an important milestone in our A Question of Style project. Our corpus is available to download from ORDO, The Open University’s data repository. The graphs below give an overview of the composition of the corpus. Our presentation from the RSVP/ VSAWC 2018 conference gives more details on the creation and processing of the Corpus. The Corpus is licenced under a Creative Commons Attribution Licence, so you are free re-use it for your own research if you acknowledge its provenance.

Number of Articles by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Number of Articles by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Number of Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King. CC BY 4.0
Number of Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Percentage of quotations in A Question of Style Corpus by Francesca Benatti and David King
Percentage of quotations in A Question of Style Corpus by Francesca Benatti and David King.
Number of Non- Quotation Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.
Number of Non- Quotation Words by Periodicals in A Question of Style Corpus by Francesca Benatti and David King.