OCR and spell checking

The quality of OCR in the texts we want to process with our scripts can be quite variable.

This is an example following up on the definition of ‘quadroon’ in the Massachusetts Agricultural Journal. We cannot use the Google Books copy of the Journal because it has not been OCRed. However, there are several copies available in the Biodiversity Heritage Library that have. Hence, we used one of these copies[1] for a machine readable version of volume 4.

The raw OCRed text is really quite good.

Of the truth of the principles which I am endeavouring to 
establish, there cannot be better or more irrefragable evidence 
than in the known effect of mixing different varieties of the 
human race. Thus, " a white man with a negro woman pro- 
duces a Mulatto, of a yellow blackish colour, with black short 
frizzled hair. A white man with a mulatto woman produces 
a Quadroon, of a lighter yellow than tlie former. A white man 
with a Quadroon woman produces a Mestizo. A white man with 
a Mestizo woman produces almost a perfect white, called a 
Quinteroon. This is the last gradation, their being no visible 
difference between the fa;r Quinteroons and the whites ; and 
the children of a white and Quinteroon consider themselves as 
free from all taint of the negro race,"*

The only OCR induced errors are the fairly common mistakes of li for h, hence tlie for the; and an ; for i, hence fa;r for fair. These can be automatically corrected because there are no English words that can be confused with these errors.

There is one other minor OCR induced error, that of a comma for a period at the end of the paragraph.

Running the corrected text through Hunspell, a widely used open-source spell checking tool incorporated into many other pieces of software such as LibreOffice, produces the following list of terms:

  • duces
  • fa;r
  • irrefragable
  • negro
  • Quinteroon
  • Quinteroons
  • tlie

Of which duces can be automatically addressed by rejoining hyphenated words. Similarly, fa;r and tlie can be automatically corrected as already noted. This leaves us with four unrecognised words. Irrefragable, an archaic word related to unquestionable; while the other three words are relevant to our work, and can be collated into a list of genuine search terms upon manual review.

Of note, mestizo, mulatto and quadroon are recognised by the Hunspell’s default English dictionary.

