Summary: The extraction order used in an OCR engine is crucial to ensure the usefulness of the future uses of that data.

In certain instances, a need may arise in a company in which metadata needs to be extracted from a document. This could be the case for documents with a known structure such as an invoice from a certain vendor that always follows a particular layout. In these instances, the metadata may be needed in order to tag the document to facilitate later retrieval especially when the document is committed to a content management system, or the metadata may need to be written into the filename to facilitate locating the document with a simple search in a file system. Finally, the metadata may be needed for use in business systems. For example, a company may desire to extract the invoice number from an invoice and then file the document into their accounting system by using the invoice number as an index value. In the case where only simple index values need to be extracted from a document a regular expression (REGEX) or text search string may be used to extract the text values from a document after the text is obtained from the image file. In those instances when a field name is located and the expected index volume is expected to immediately follow the field name, the ordering of the OCR text is crucial to the successful extraction of the value. Any example will help to illustrate the nature of the problem.

In our introduction, we mentioned that a REGEX could first be used to locate a field value with the expectation of locating a text string that followed as the desired metadata. For example, a program could be written that would parse a text file using a REGEX in a script to search for the pattern “invoice no:” and then extract the integer values after that value to a certain number of digits. In such a case, we could possibly locate the value “invoice #: 123456” and the resulting value of “123456” could then be used for indexing the document. That assumes that the value of the field related to the “invoice no” immediately followed that particular keyword search phrase used in the REGEX search string. However, problems may arise when the data associated with the field label (“invoice #.:” in this example) falls after a line break. In such a case, the OCR tool must utilize a more logical “reading order” method in order to extract the OCR text, otherwise the field value will be placed at the end of a string of information that it extracted from the line below the field label.

Here is a visual explanation of where things can go wrong with the data extraction tool. One of the invoices tested was laid out this way on the image version of the invoice document:

                                                                                                   Invoice #:

Date: 06-21-2014                                                                       924568

 

During the first attempt the raw OCR text result was:

     Invoice #:Date:06-21-2014 924568

And the invoice number extracted from the system was:

     062120

Which was taking into account a rule where the invoice number consisted of strictly six digits for this particular document format.

During the second attempt to extract the invoice number from the document, the OCR engine was changed to use a reading order method. Frequently, this means changing a setting within the engine from high speed to high quality. With the new setting, the OCR results followed a different order that created a much different result in the metadata extraction tool:

Image Layout

                                                                                                  Invoice #:

Date: 06-21-2014                                                                       924568

OCR Result

     Invoice #: 924568 Date: 06-21-2014

And the invoice number extracted from the system was:

     924568

Difficulty in Predicting Future Uses of Data

Perhaps the need for utilizing a unique setting inside the OCR engine is obvious, yet at other times it may be hard for a company to envision the future uses of any OCR text results. Recently at UFC, Inc. we encountered a client who had spent a substantial sum of money to OCR a large backlog of documents with hopes of extracting critical metadata values from them into a business intelligence system. When we examined the OCR text, we found frequent cases of incorrect ordering of the text values which made locating them within the files impossible while attempting to use search label rules. This required us to reprocess the documents with an OCR setting using the reading order type in order to obtain acceptable results in the extraction tool.

Information about the Author
About Me
Articles by Jim Hill: Jim works to align the customer's needs with software and consulting solutions in the areas of forms processing, document capture software, and content migration. His background is in the following: 1. Enterprise content management systems including FileNet and SharePoint, including migration of documents to and from FileNet. 2. Document capture systems including Quillix Capture, ABBYY Flexicapture and IRISXtract. 3. OCR systems including ABBYY FineReader, ABBYY Recognition Server, ABBYY FineReader Engine, 4. Forms processing systems including ABBYY Flexicapture and IRISXtract by Canon. Jim began his career as a mechanical engineer at Ford Motor Company. He joined UFC, Inc. in 1998.