In Blog

OCR stands for optical character recognition. This is a technology used for reading and converting OCR PDF. Some of the PDF files especially those that are created from a scanner are indeed images. There is no textual information inside the file they are just images.

1) When you try to use an OCR converter from a ocr online service you may get lower quality results without considering a few tips.

OCR technology is highly sensitive to the direction of the scanned document. When you use an ocr online service makes sure the service can rotate the file in the right orientation usually portrait. If the service does not rotate the file consider to do this with another tool.

Orienting the scanned OCR pdf in the right direction can result in dramatically Better performances.

2) Anther area of concern when using ocr online services is the type of document. Not all the ocr readers and ocr converter are optimized for the same tasks. Generally you have ocr software that is optimized for :

a) Forms extraction

b) Text Extraction

c) Data extraction

d) Hand writing extraction

You can generally achieve more than one goal but each ocr online service will have a a specialty. For example Tabex is focused on data extraction from portrait documents.

3) Finally an important consideration goes towards the language. Each language may have different characters and punctuation. This is particularly evident when you try to convert Arabic, Chines and other languages that are far from English. Please spend a moment to test a few simple paragraphs in your language and in English to compare how the OCR actually does.

OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are allowed to occur in a document. This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like proper nouns. 

The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation.

Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.


This post is also available in: Spanish, Portuguese (Brazil)

Recommended Posts

Leave a Comment