OCR and spell checking

The quality of OCR in the texts we want to process with our scripts can be quite variable.

This is an example following up on the definition of ‘quadroon’ in the Massachusetts Agricultural Journal. We cannot use the Google Books copy of the Journal because it has not been OCRed. However, there are several copies available in the Biodiversity Heritage Library that have. Hence, we used one of these copies[1] for a machine readable version of volume 4.

The raw OCRed text is really quite good.

Of the truth of the principles which I am endeavouring to 
establish, there cannot be better or more irrefragable evidence 
than in the known effect of mixing different varieties of the 
human race. Thus, " a white man with a negro woman pro- 
duces a Mulatto, of a yellow blackish colour, with black short 
frizzled hair. A white man with a mulatto woman produces 
a Quadroon, of a lighter yellow than tlie former. A white man 
with a Quadroon woman produces a Mestizo. A white man with 
a Mestizo woman produces almost a perfect white, called a 
Quinteroon. This is the last gradation, their being no visible 
difference between the fa;r Quinteroons and the whites ; and 
the children of a white and Quinteroon consider themselves as 
free from all taint of the negro race,"*

The only OCR induced errors are the fairly common mistakes of li for h, hence tlie for the; and an ; for i, hence fa;r for fair. These can be automatically corrected because there are no English words that can be confused with these errors.

There is one other minor OCR induced error, that of a comma for a period at the end of the paragraph.

Running the corrected text through Hunspell, a widely used open-source spell checking tool incorporated into many other pieces of software such as LibreOffice, produces the following list of terms:

  • duces
  • fa;r
  • irrefragable
  • negro
  • Quinteroon
  • Quinteroons
  • tlie

Of which duces can be automatically addressed by rejoining hyphenated words. Similarly, fa;r and tlie can be automatically corrected as already noted. This leaves us with four unrecognised words. Irrefragable, an archaic word related to unquestionable; while the other three words are relevant to our work, and can be collated into a list of genuine search terms upon manual review.

Of note, mestizo, mulatto and quadroon are recognised by the Hunspell’s default English dictionary.


1 @book{bhlitem82215,
    title = {Massachusetts agricultural journal. },
    volume = {v.4 1816-17},
    copyright = {Not provided. Contact Contributing Library to verify copyright status.},
    url = {https://www.biodiversitylibrary.org/item/82215},
    note = {https://www.biodiversitylibrary.org/bibliography/34787},
    publisher = {[Boston :Massachusetts Society for Promoting Agriculture,},
    author = {Massachusetts Society for Promoting Agriculture.},
    year = {1816},
    pages = {448},
    keywords = {Agriculture|Massachusetts|Periodicals|},
  }

Understanding our search terms and their context

To help us understand the contemporary use of our seed search terms, we used Google Ngram Viewer to explore historic, digitised texts[1], as can be seen below when searching on ‘quadroon’:

Screenshot of google Ngram Viewer
Screenshot of results when searching for quadroon in Google Ngram Viewer

Following up on the texts discovered by the search, we encountered a definition of ‘quadroon’ in the Massachusetts Agricultural Journal for 1816[2].

Interestingly, the topic arose in a letter on page 157, “On the crossing the breed of animals” by Dr Parry. Having discussed cross-breeding of horses and then of sheep, Dr Parry concludes his letter with:

Of the truth of the principles which I am endeavouring to
establish, there cannot be better or more irrefragable evidence
than in the known effect of mixing different varieties of the
human race. Thus, “ a white man with a negro woman pro-
duces a Mulatto, of a yellow blackish colour, with black short
frizzled hair. A white man with a mulatto woman produces
a Quadroon, of a lighter yellow than the former. A white man
with a Quadroon woman produces a Mestizo. A white man with
a Mestizo woman produces almost a perfect white, called a
Quinteroon. This is the last gradation, their being no visible
difference between the fair Quinteroons and the whites ; and
the children of a white and Quinteroon consider themselves as
free from all taint of the negro race.”

Precisely the converse of this fact takes place in the mixture
of white females with Negro males.

Quite an insight into the mindset of the time, for not only is Dr Parry specific about the definitions of these ‘gradations’ but that he uses them as terms he expects to be widely known to illustrate the points he made earlier in his letter when discussing animals. In modern pedagogical parlance, he is scaffolding his readers, providing a known scaffold of human ‘cross-breeding’ on which they can construct the new knowledge of animal cross-breeding.

Returning to the immediate point of our research, this is another example of the oft found problem especially in social history, that when something is common or well-known, it is not documented because there is no need, and hence lost from the historic record. Though in this case, that the phenomenon is well known has led to it being recorded.


1 We appreciate the eclectic nature of Google Books, but argue the collection of texts can still form an effective starting point for getting a feel for the terms, if not providing a definitive, academically rigorous understanding of those terms.

2 @book{1816massachusetts,
    title={The Massachusetts Agricultural Repository and Journal},
    number={v. 4},
    url={https://books.google.co.uk/books?id=r45aAAAAYAAJ},
    year={1816}
  }

Spell checking for slang?

One approach to finding slang words in texts, or other potential clues to identify BAME presence, is to spell check the document and review the unknown words. This is arguably a benefit of working with modern tools, which are generally trained on modern, especially, born-digital, texts. Older terms, which are relevant to us, may well be unknown to spell checkers trained on modern materials.

Using MS Word as an example, we see that our list of descriptive words produces four errors (bold words):

negro, negress, mulatto, quadroon, quarteroon, fustee, mustee, dusky, nabob, anglo-indian, swarthy, blackamoor, moor, african, torrid zone.

The first two errors are spelling issues, the latter two, capitalisation. This suggests that the approach might be worth further consideration, though it will depend on the quality of digitisation. Some historical source materials are simply scanned and OCRed, without further curation of the automatically generated text. The first two steps are relatively quick, cheap and easy; the latter one, slow, expensive and hard. However, without this curation, the resulting text can be of poor quality, especially when dealing with historical texts that not only have the problem of working with ageing documents, but contain words that the OCR engine cannot recognise because they are not in its modern-trained spell checker!

 

Refining our list of descriptive search terms

To seed our search for hitherto unrecognised BAME presence, we produced a list of search terms in an early project meeting. Our first list was simply descriptive of the person or their origin:

negro, negress, mulatto, quadroon, quarteroon, fustee, mustee, dusky, nabob, anglo-indian, swarthy, blackamoor, moor, african, torrid zone.

A follow-up task was to see if we could refine or extend this list easily. However, we soon found that, as was to be expected, that many tools and resources draw on modern, especially born-digital media, only. For example, neither The Online Slang Dictionary nor Green’s Dictionary of Slang include archaic slang such as quadroon.

No doubt matters will improve over time, as more historical texts are digitised and harvested for their contents, with more tools trained on this older literature. But for our immediate needs… we’re not there yet.

Any refinement of our search terms will be through manual intervention, based on close reading of the results of previous searches.

Teasing out the unseen

How do we find the unseen?

Looking for hitherto unknown – or forgotten, hidden, obscured – BAME presence with a role in British politics in previous centuries relies on us being able to find some clue as to their identity.

The approach is the same whether done by a human or a computer, the search for indirect clues. As we are studying texts, those clues must be in the words themselves. What words?

We produced three lists of words to seed our search. The first list is simply descriptive of the person or their origin: negro, negress, mulatto, quadroon, quarteroon, fustee, mustee, dusky, nabob, anglo-indian, swarthy, blackamoor, moor, african, torrid zone.

The second list involved places, such as Antigua or Jamaica if focusing on the West Indies.

The third list involve ‘curious’ names such as Galgacus or Scipio, as were often given to illegitimate offspring.

Using these as our search terms, we could search large volumes of digitised text easily and retrieve the adjacent words or larger passages for close reading review. We can refine and repeat the searches as learn more about the source texts and the search terms.

Hence, the twin core tasks of this pilot project:

  • Can we access suitable digitised texts?
  • Can we identify suitable search terms?