ABBYY FlexiCapture - Image Snippet/Annotated PDF Add-On

Watch how to create image snippets and annotated searchable PDFs for captured fields in ABBYY FlexiCapture with our FlexiCapture add-on.


Hello. My name is Joe Hill from User Friendly Consulting. Today, I'm going to be showing you an application that was built on top of the ABBYY FlexiCapture version 11 software. This is the distributed version of the product. This product has so many great integration opportunities that exist within the product, including a web service and also a full workflow engine. That's what I'm going to be showing you today. I'm going to be using the workflow features that are part of the software to perform an operation that might be useful for a lot of customers.

I'm going to start by showing you some documents that I've processed through the ABBYY FlexiCapture system. I have some batches that are sitting in the verification area right now. Let's just look at one of these documents. This document is publicly available FAA document, and I'm extracting some information using a FlexiLayout, which is a series of rules that allow you to extract data from documents. This FlexiLayout was created to extract repeat groups of information. You can see how the information within this document flows down in columns.

What I'm doing to be demonstrating today is a feature that allows you to look at a field that's been recognized, and then access the part of the image on the page where that image was found and capture an area of that image, so that you could use that in another application. FlexiCapture keeps track where the individual field has been extracted from each of the documents, and stores that in XML that is available through the web service.

One of the things that we're seeing within our business right now is a lot of customers are using ABBYY FlexiCapture in a web-based cloud environment, so that the people who are using the ABBYY FlexiCapture software don't actually see the tools, like the Verification Station I'm showing you now. All they see is a web-based application, and all they really care about is the fact that they got their data out of their documents.

This application that I'm showing today provides a way to provide additional feedback back to those users regarding where that information was extracted on a document, by showing them a little image snippet of the document that is centered on the area where the information was extracted. We'll also draw a little box around that. That's the first part.

The second part of this system that I'm going to show you today allows you to create a searchable PDF, and then to draw boxes around each of the fields that have been extracted. Those are going to be drawn as annotations on top of the searchable PDF document. You can see for this example here that we've extracted this field right here. I've set this up so that the workflow is going to look at each one of the fields that has been extracted, and then we are going to save a little image snippet around that area that gives the user enough information, so that they can understand the context of that value in a larger portion of the document.

For my example, I'm just saving these image snippets to a folder on disk. I'm naming them using the batch ID then the document ID, the page ID and then the coordinates of that index. That's the left and top of that index value that was extracted from the document. This naming works out well, because when you retrieve the processing results for a document using the FlexiCapture web service API, you have all this information provided to you. It's very easy to correlate these files back to the original documents that have been processed.

There is a second option available within this little demonstration that I set up, that you could attach these two as attachments in the document right here. Then, you could retrieve those using the web service API. For my example right now, I just saved them on disk. Let me just show you. I think maybe then you'll get an understanding of the usefulness of some of these little, what I'm calling, snippets right here.

Let's just look at the first one here. This one, you can see that the field that was extracted is highlighted. The script that we wrote, and we'll be showing the technical details of that, allows you to determine how thick the line is around the index value that's been extracted and if it's dotted or solid or it's dot-dash pattern. There's some options for that. Also, you can determine the color of the box that's being drawn around the image snippet, and then the size. You can determine how far you want to go out from either side, how far you want to go out above and below.

This gives an app that uses the FlexiCapture web service a way to retrieve a little bit larger portion of an image. Let's say the user clicks on this field in the web app. You really don't want them to see other pages of the document. You don't want them maybe even to see even an entire page of the document. You're okay with them seeing just a limited area. By showing that to them, that would make them understand the context of that field in the document. That's an example of an individual image snippet here. You can see some of these other ones that we've created the same thing here, and I'll zoom in just a little bit so you can see. We're using a dashed pattern around here.

This is the original image, so this is the same DPI as the original image that came in. Very useful. We've got one for each field. The last part is the other option that I spoke about, and that is looking at the entire document and understanding the information that was extracted from the document and displaying that in a searchable PDF. I'll show you that result too.

This is a searchable PDF that represents the document. It's an exact image of the document. It is text searchable, so text has been laid on top of it. Then, what we've done within our script here in the workflow, that I'm going to be showing you, is we've created square annotations over top each of the field areas that have been extracted. Now, you can look at some of these and you see that, like the speed, that column was populated. Yet, you could see where the information would have been extracted based on the definition from the FlexiLayout.

The area that's lassoed is going to be dependent upon how the rules are set up within the FlexiCapture software for grabbing those fields. These are highlights, so they're annotation objects. The original document is not changed but, yet, you have a way to see like a list view of everything that's been extracted from the document, all the fields that have been extracted.

Now, let me just go into a few technical details. To do that, I'll use the project setup station. Within my project for these FAA documents, I've set up a couple of attached dot net reference here. This is the code that we wrote, so we're using the FlexiCapture API and obtaining access to documents that pass through workflows, and then performing operations on those. The step in the workflow that we're using is right here. I just named it Extract Field Snippets. If you look at that, it's a pretty simple script behind that. It's a document level script, and I'll just show that to you real fast here. Get it on that monitor.

It's one line to save the individual image snippets for each of the field. In this example, we are extracting an area that is a half an inch above and below the index value, and then a half an inch to the right and a half an inch to the left. We're using a dashed line. We're using a width, a line width, of .016th inch. Using inches, we found, made the line size better based on varying documents that would come in. We're specifying the color of the box that's being shown on the image snippets using RGB color codes. That would be all red.

For the PDF, we're using slightly different values. We found that red was a little shocking, and the PDFs we've softened that color there. That's configurable, so you can do anything you want there. Then, the other parameter would be just where you want to save the image snippets. If you leave that blank, then they would get attached to each individual document under the attachments here. That can be accessed through the web service API, so your web app would have access to each of those. That includes the PDF.

That's the colors for the PDF border, one line of code. To do that, as documents pass through this workflow then, after recognition, the image field snippets are extracted. Now, you might be doing that after verification potentially if you are using verification of documents. That's an example of an application that has been built using the FlexiCapture version 11 workflow system. It allows you, after you've extracted data, to grab a little image snippet of each piece of data that's been extracted from each part, each field, in the document and save those images.

Again, that's really intended to be used as part of another application where the users are going through and reviewing the information that was extracted from a document, but they're not using the FlexiCapture 11 tool set, so the Verification Station and the other stations, but rather talking through the web service API to FlexiCapture. They just get the results back. Just a handy way of manipulating documents that go through the system accessing information that has been extracted from them using this workflow system. This is a really nice demonstration of the workflow system.

Thank you for joining me today. We'll be doing a lot more demos showing what can be done using the FlexiCapture 11 workflow. Thank you.

Information about the Author
Joe Hill
About Me
Joe is the chief technologist for UFC, Inc. He guides the decisions on which products UFC offers as well as research on new software applications under the Jovation and MuWave trademarks. Joe earned a bachelor of science in computer science engineering at Western Michigan University.
Some of My Other Articles