Tags Used in Articles on User Friendly Consulting

Data Extraction

  • The Importance of Understanding OCR Processing Sequence Order

    Summary: The extraction order used in an OCR engine is crucial to ensure the usefulness of the future uses of that data.

    In certain instances, a need may arise in a company in which metadata needs to be extracted from a document. This could be the case for documents with a known structure such as an invoice from a certain vendor that always follows a particular layout. In these instances, the metadata may be needed in order to tag the document to facilitate later retrieval especially when the document is committed to a content management system, or the metadata may need to be written into the filename to facilitate locating the document with a simple search in a file system. Finally, the metadata may be needed for use in business systems. For example, a company may desire to extract the invoice number from an invoice and then file the document into their accounting system by using the invoice number as an index value. In the case where only simple index values need to be extracted from a document a regular expression (REGEX) or text search string may be used to extract the text values from a document after the text is obtained from the image file. In those instances when a field name is located and the expected index volume is expected to immediately follow the field name, the ordering of the OCR text is crucial to the successful extraction of the value. Any example will help to illustrate the nature of the problem.