Glossary of Document Capture and Computer Programming Terms

Document Capture and Content Management Terms

Terms about document capture, enterprise content management and forms processing software.


An image is a logical container for files. It is a parent to image files and universal files. An image consists of file consisting of image 1, image 2, etc. Properties of an image include filename, image type, Has Image, Has Universal, Original Filename. A universal file is something other than an image format file. Image format files are files ending with identifiers such as TIFF, JPG, or GIF.

Image over Text

Image over text is a technical term for a specific format of electronic document generally associated with the PDF specification. An image over text PDF is a clever method of imbedding searchable text behind the scanned image of a document. This handy type of PDF document is created by first scanning the document, then running it through and OCR engine. Next a mapping is created for each word from the OCR text to the zone from which the text was located on the scanned image. As a result when the PDF document is displayed it can be searched for words and phrases. When a search term is located within a PDF viewer such as Adobe Reader, the location of the search term within the document can be display. Perhaps one of the most useful attributes of the image over text PDF is that the textual data from within the document can be added to the index of an enterprise document system or content search engine. This makes available all of the text from within the scanned image available for searching by users trying to locate a document.



An index is what is known as metadata, meaning “data which describes data.” An index can exist as multiple levels such as a document index or batch index. The properties of an index can include attributes such as: Name, Value, Source (OCR, Typed, Barcode, etc.), Location (Page, X, Y, Length, Width), and Barcode Type.

Intelligent Character Recognition (ICR)

ICR is an acronym which stands for Intelligent character recognition. It is a handwriting recognition system that allows fonts and different styles of handwriting to be learned by a computer in order to generate a textual value of a scanned section of handwritten text. ICR is most frequently used to decode predefined areas on fixed forms. ICR is not frequently applied to decode an entire page of handwritten text, and almost never applied to analyze a page of mixed machine printed and handwritten text. 


