Automatic text recognition (OCR) for Solr or Elastic Search

Automatic text recognition in images or scanned documents by Optical Character Recognition (OCR)

Text stored in image formats like JPG, PNG, TIFF or GIF (i.e. scans, photos or screenshots) can not be found by standard full text search. So this enhancer enriches meta data of images like filename, format and size with results from automatic text recognition or optical character recognition (OCR) by free open source software like Tesseract OCR.

Since many information is not searchable by full text search because its in graphical formats embedded in PDF documents (i.e. scans or screenshots instead of text format), the enhancer extracts images from PDF files for automatic text recognition (OCR), too.

Enable OCR

(OCR is enabled by default in the virtual machine packages like Open Semantic Desktop Search or Open Semantic Search Appliance)

Install the package tesseract-ocr (included in your Linux distribution): apt-get install tesseract-ocr

If you enabled OCR, should enable OCR for images inside PDF files, too, since many PDF files are scans and do contain much text data only as graphics:

OCR of images embedded within PDF documents

Add (uncomment) the PDF OCR plugin to enable OCR for images or scanned documents within PDF files: #Enable OCR for images inside PDF files config['plugins'].append('enhance_pdf_ocr')

How to optimize OCR settings to improve OCR results

You can optimize OCR results to find more by different ways, which you can combine for optimal OCR results:

Scanning resolution

If scanning documents yourself, scan or store the images with a higher resolution, so the OCR can analyse more details of the characters.

Language of dictionary

Since OCR uses a language specific dictionary, set the OCR language to your language or to multiple languages, which are used in your documents.

Setting OCR language to an other language than english: 1. Install the tesseract language package (for german: tesseract-ocr-deu). See the list of available languages for Debian or Ubuntu. 2. set option ocr_language to the language of your documents. Default is eng for english (in tesseract its eng, not en!). For german set deu (in tesseract its not de!): `# language for automatic text recognition (ocr)

config['ocr_lang'] = "eng"

config['ocr_lang'] = "deu"`

Or set the OCR language to multiple languages, which are used in your documents: # language for automatic text recognition (ocr) config['ocr_lang'] = "eng+deu"

Additional custom OCR dictionary entries from Thesaurus and Ontologies

By the coming out of the box integration with the Tesseract OCR user word list or custom dictionary, your Concepts, words and names of named entities like organizations, places, locations or persons that are important for you so you added them to your thesaurus or which are included in lists of names or ontologies (for example lists of names of relevant persons from internal meta data sources or from open data sources like Wikidata) you defined for faceted search/interactive filters and/or analytics/aggregated overviews are recognized better by OCR of scanned documents, too.

Therefore your additional domain knowledge / vocabulary from thesaurus, lists and ontologies is used additional to the OCR dictionary by Tesseract option --user-words /etc/opensemanticsearch/ocr/dictionary.txt.

Since in many scanned legacy files on paper names are fully written in uppercase, this autogenerated custom OCR dictionary / OCR wordlist includes the uppercase variant of each word, too.

So you should consider to rebuild your index / reindex important files by force (so they are analyzed again & reindexed even if yet in index) after adding very important concepts or names to thesaurus or ontologies.

Rotation and deskewing low quality scans before OCR

Many documents are scanned skew.

Additional deskewing of such low quality scans by Scantailor before OCR can improve the OCR results.

Install Scantailor: apt-get install scantailor

Enable additional optimization with Scantailor before OCR by adding uncomment the descewing plugin to your ETL config /etc/opensemanticsearch/etl: config['plugins'].append('enhance_ocr_descew')

In default configuration the descewing plugin is disabled because it needs more time and CPU resources while indexing documents with images.

Combining OCR results of multiple OCR tools

No OCR engine is perfect.

So in some research projects we used for example Abby Finereader to OCR the images in PDFs additionally to the integrated Open Source OCR Software Tesseract.

Each of them recognized words or names the other software failed. By combining and indexing both OCR results for the same document, we could find many documents more.

Therefore the Open Semantic ETL framework of Open Semantic Search is able to combine or unify and index analysis results of multiple analysis or OCR tools or OCR parameters for the same document or image.

Train characters and fonts

You can train the OCR with the special fonts used in your documents to improve the machine learning model for recognition of characters of this fonts.

How to manage OCR failures

Handle OCR errors by collaborative tagging and annotation

For single documents with OCR errors you can add annotations or tags with the words that were recognized wrong by the OCR engine, so the search engine can find them despite this OCR errors because of the tags or annotations written correct.

Manage OCR errors in thesaurus (Hidden labels)

Manage common OCR errors for all documents and new documents by Thesaurus entries for management of OCR errors (Hidden labels)

The recommender can analyse the corpus for typos/OCR errors of a thesaurus entry and recommends such misspellings for adding to the thesaurus as hidden label by one click.

More information about improving OCR quality

Improving the quality of the output of Tesseract OCR