Skip to main content
ABBYY Vantage offers a machine learning mode for processing structured documents, for example, documents where the location of the fields is the same on each document instance. Examples of such documents include questionnaires, application forms, and tax return forms. Some structured documents may have multiple variants, with slight differences in fields and their locations.

Sample Images

IRS Form 1040 - 2020 IRS Form 1040 - 2019 Two variants of the IRS Form 1040 for the years 2020 and 2019.

Creating Skills for Structured Documents

You can create skills for processing structured documents in both Vantage and Advanced Designer. To edit such skills, however, you will need to use Advanced Designer. In Vantage, you can create a skill for processing structured documents by turning on the Fixed-form documents toggle for that skill. You will also need to upload and label some blank forms.
Note: For detailed instructions on creating a skill for processing structured documents that have multiple variants, see Setting up a Document skill for processing structured documents.
The skill you create in Vantage will appear in Advanced Designer. Its document processing flow will include a Forms activity designed specifically for processing structured documents.
Note: If you didn’t enable the Fixed-form documents toggle, the document processing flow of your skill will consist of just the Fast Learning activity.
In Advanced Designer, you can create and edit skills for structured documents when you need to combine the processing of structured documents with other Vantage technologies. In this case, a Forms activity needs to be accompanied by other activities created and set up in Advanced Designer.
Note: If your document processing flow includes a Forms activity accompanied by other activities, or if it contains multiple Forms activities, your editing options in Vantage will be limited to changing the skill’s properties, and training will not be available. For more advanced edits, use Advanced Designer.

Extracting Data from Forms Containing Unstructured Elements or Mixed Structures

A structured document may sometimes contain an unstructured element, such as a barcode or stamp placed anywhere on the document, which also needs to be detected. Another example is a mixed document: part of it is structured, while another part is a table of variable length (for example, a table with a varying number of rows). To process such documents, use a Forms activity followed by an activity which will handle the unstructured elements. In the steps below, we use a Forms activity to process structured fields, and an Extraction Rules activity to detect barcodes.

Steps for Creating a Document Skill

  1. Open Advanced Designer. On the start page, create a new skill by clicking Create Document Skill.
  2. Navigate to the Activities tab and add a Forms activity to the document processing flow.
  3. Click Activity Editor. On the Blank Form tab, upload one sample blank form for each variant of your document (we do not recommend uploading more than 10 different variants). Label the fields from which data must be extracted. For guidelines on labeling, see Labeling documents.
  4. Click Train Activity.
  5. Click the Test Set tab and upload completed test documents. Make sure that all the fields are labeled correctly on each document. Click Test Activity. When the operation completes, review the results.
  6. Return to the Activities tab and add an Extraction Rules activity to the document processing flow.
  7. Click Activity Editor and configure the Extraction Rules activity.
  8. Click Test Skill Using Selected Documents. When the operation completes, review the results. If you are satisfied with the results, publish your skill. Otherwise, adjust the labeling, then train and test the activity again.

Working with Tables and Repeating Groups

When processing structured documents, Vantage can handle tables and repeating groups if the maximum number of table rows or group instances is known in advance and the boundaries of the table or group are fixed. You will need to label all the rows that may possibly occur on all the variants of the form.
Note: Only rows with data will be displayed in the processing results. Any empty rows will be ignored.
If the number of rows or instances in a group is not known in advance, you must use another Vantage technology.
Note: Currently, only tables with text values can be handled. If your table has columns with checkboxes or barcodes, use a repeating group instead.

Extracting Data from Forms and Unstructured Documents in One Flow

Sometimes information may be collected using both forms and unstructured documents. For example, answers to a questionnaire may be received either on printed forms or as unstructured documents written in a freeform manner. To process a mixture of such documents, use a combination of a Forms activity, which will process forms, and a Fast Learning or Extraction Rules activity, which will process unstructured documents. You must then apply a Classify activity to separate forms from unstructured documents.

Steps for Creating a Document Skill

  1. Open Advanced Designer. On the start page, create a new skill by clicking Create Document Skill.
  2. Navigate to the Activities tab and add a Forms activity to the document processing flow.
  3. Click Activity Editor. On the Blank Form tab, upload a sample blank form and label the fields from which data must be extracted. For guidelines on labeling, see Labeling documents.
  4. Click Train Activity.
  5. Click the Test Set tab and upload completed test documents. Make sure that all the fields are labeled correctly on each document. Click Test Activity. When the operation completes, review the results.
  6. Navigate to the Activities tab and add a Fast Learning activity to the document processing flow.
  7. Open the Activity Editor to configure and train the activity.
  8. Navigate to the Activities tab and add a Classify activity at the beginning of the document processing flow.
  9. Click Activity Editor and set up the Classify activity. You will need to create a class for each document variant, assign classes to your documents, and train the activity.
  10. Return to the Activities tab and add an IF activity to set up conditional branching for the document processing flow. Connect this activity to the Forms and Fast Learning activities.
  11. Click Test Skill Using Selected Documents. When the operation completes, review the results. If you are satisfied with the results, publish your skill. Otherwise, adjust the labeling and train the activity again.