Automatic text recognition in images or scanned documents by Optical Character Recognition (OCR)
Text stored in image formats like JPG, PNG, TIFF or GIF (i.e. scans, photos or screenshots) can not be found by standard full text search. So this enhancer enriches meta data of images like filename, format and size with results from automatic text recognition or optical character recognition (OCR) by free open source software like Tesseract OCR.
Since many information is not searchable by full text search because its in graphical formats embedded in PDF documents (i.e. scans or screenshots instead of text format), the enhancer extracts images from PDF files for automatic text recognition (OCR), too.
(OCR is enabled by default in the virtual machine packages like Open Semantic Desktop Search or Open Semantic Search Appliance)
Install the package tesseract-ocr (included in your Linux distribution):
apt-get install tesseract-ocr
If you enabled OCR, should enable OCR for images inside PDF files, too, since many PDF files are scans and do contain much text data only as graphics:
OCR of images embedded within PDF documents
Add (uncomment) the PDF OCR plugin to enable OCR for images or scanned documents within PDF files:
#Enable OCR for images inside PDF files
How to optimize OCR settings to improve OCR results
You can optimize OCR results to find more by different ways, which you can combine for optimal OCR results:
If scanning documents yourself, scan or store the images with a higher resolution, so the OCR can analyse more details of the characters.
Language of dictionary
Since OCR uses a language specific dictionary, set the OCR language to your language or to multiple languages, which are used in your documents.
Setting OCR language to an other language than english:
- Install the tesseract language package (for german: tesseract-ocr-deu). See the list of available languages for Debian or Ubuntu.
- set option ocr_language to the language of your documents. Default is eng for english (in tesseract its eng, not en!). For german set deu (in tesseract its not de!):
# language for automatic text recognition (ocr)
#config['ocr_lang'] = "eng"
config['ocr_lang'] = "deu"
Or set the OCR language to multiple languages, which are used in your documents:
# language for automatic text recognition (ocr)
config['ocr_lang'] = "eng+deu"
Additional custom OCR dictionary entries from Thesaurus and Ontologies
By the coming out of the box integration with the Tesseract OCR user word list or custom dictionary, your Concepts, words and names of named entities like organizations, places, locations or persons that are important for you so you added them to your thesaurus or which are included in lists of names or ontologies (for example lists of names of relevant persons from internal meta data sources or from open data sources like Wikidata) you defined for faceted search/interactive filters and/or analytics/aggregated overviews are recognized better by OCR of scanned documents, too.
Therefore your additional domain knowledge / vocabulary from thesaurus, lists and ontologies is used additional to the OCR dictionary by Tesseract option
Since in many scanned legacy files on paper names are fully written in uppercase, this autogenerated custom OCR dictionary / OCR wordlist includes the uppercase variant of each word, too.
So you should consider to rebuild your index / reindex important files by force (so they are analyzed again & reindexed even if yet in index) after adding very important concepts or names to thesaurus or ontologies.
Rotation and deskewing low quality scans before OCR
Many documents are scanned skew.
Additional deskewing of such low quality scans by Scantailor before OCR can improve the OCR results.
apt-get install scantailor
Enable additional optimization with Scantailor before OCR by adding uncomment the descewing plugin to your ETL config
In default configuration the descewing plugin is disabled because it needs more time and CPU resources while indexing documents with images.
Combining OCR results of multiple OCR tools
No OCR engine is perfect.
So in some research projects we used for example Abby Finereader to OCR the images in PDFs additionally to the integrated Open Source OCR Software Tesseract.
Each of them recognized words or names the other software failed. By combining and indexing both OCR results for the same document, we could find many documents more.
Therefore the Open Semantic ETL framework of Open Semantic Search is able to combine or unify and index analysis results of multiple analysis or OCR tools or OCR parameters for the same document or image.
Train characters and fonts
You can train the OCR with the special fonts used in your documents to improve the machine learning model for recognition of characters of this fonts.
How to manage OCR failures
Handle OCR errors by collaborative tagging and annotation
For single documents with OCR errors you can add annotations or tags with the words that were recognized wrong by the OCR engine, so the search engine can find them despite this OCR errors because of the tags or annotations written correct.
Manage OCR errors in thesaurus (Hidden labels)
Manage common OCR errors for all documents and new documents by Thesaurus entries for management of OCR errors (Hidden labels)
The recommender can analyse the corpus for typos/OCR errors of a thesaurus entry and recommends such misspellings for adding to the thesaurus as hidden label by one click.