Labeling guidelines - ABBYY Documentation

To label a document, mark the regions that contain field values and tell the skill what data type each field holds. Before you start, pick the right selection method for the field shape, then follow the per-type guidelines for structured, semi-structured, or unstructured documents.

Selection methods

Method	Best for
Hover and click a word	Single-word fields
Drag a rectangle around words	Semi-structured documents
Click the first word, then drag (left mouse button held) to the last word	Unstructured documents

Structured documents

Structured documents (such as pre-formatted forms) always contain the same information in the same locations. You only need to label a few sample documents because there’s no layout variation.

Specify each field’s region accurately — field values alone aren’t enough for training.
Mark the entire placeholder, not the value inside it.
If a field contains no value, mark the empty placeholder anyway.
For multi-part fields, hold Shift to add additional parts. All parts must be on the same page.
For tables on a fixed form, label every row, including empty rows.
If you add a new field after labeling, go back and label that field on every document in the training set.

Semi-structured documents

Semi-structured documents — bills, payment orders, invoices — contain similar fields, but field locations, sizes, and counts vary across documents.

Specify each field’s region accurately — field values alone aren’t enough for training.
Click the field’s value (the word or words it contains); the region is created automatically.
If a field contains no value, don’t create a region for it.
Don’t mark partial words — the trainer learns on whole words only.
For multi-part fields, hold Shift to add additional parts. All parts must be on the same page.
Do not instruct the program to find fields inside another field’s region (whether an individual field like an address or a table cell like Description). To extract from a large region, chain activities: a semi-structured extraction activity to find the region, then an NLP Extraction Rules activity or a script rule to pull specific fields from it.
If you add a new field after labeling, go back and label that field on every document in the training set.

Tables and repeating groups

For repeating data, decide between a table and a repeating group:

Use this	When
Table	Tabular data with a common header and values that have no keywords next to them
Repeating group with the Allow multiple items option	Less-structured data where keywords sit next to the values

If different documents are organized differently, pick the option that fits the majority. To label a table, mark the first row’s cells one at a time (each click creates a column), then click Continue table from this row and verify the rest of the table is labeled correctly.

For large tables on visually similar pages, you can delete the similar middle pages and label only the first page, the last page, and a few pages in between.

Unstructured documents

Unstructured documents — contracts, scientific articles, email messages — have no consistent structure.

Specify each field’s region accurately — field values alone aren’t enough for training.
For segments (fields trained by the Segmentation activity), include one or more whole paragraphs. A segment cannot include only part of a paragraph.
Click the field’s value (the word or words it contains); the region is created automatically.
If a field contains no value, don’t create a region for it.
Don’t mark partial words — the trainer learns on whole words only.
If a word is followed by punctuation, adjust the region so the punctuation isn’t enclosed.
A field region may span pages (for example, a contract clause). Label the first part on the first page, then hold Shift while continuing on the next page.
To label a field inside another field’s region (for example, a field inside a segment), select the inner field and start labeling — the action creates a new region rather than selecting the outer one.

This is the opposite of the semi-structured guideline above: segments in unstructured documents are designed to contain inner fields, so labeling within them is intended. In semi-structured documents, the equivalent nesting creates training conflicts.

Labeling documents

Reuse labeled documents from training sets, manual review, or FlexiCapture.

Importing from FlexiCapture

Format and procedure for reusing FlexiCapture-labeled documents.

Document categories

Background on structured, semi-structured, unstructured, and mixed documents.

Segmentation activity

Used for segment fields in unstructured documents.

​Selection methods

​Structured documents

​Semi-structured documents

​Tables and repeating groups

​Unstructured documents

​Related topics

Labeling documents

Importing from FlexiCapture

Document categories

Segmentation activity

Selection methods

Structured documents

Semi-structured documents

Tables and repeating groups

Unstructured documents

Related topics