To label a document, mark the regions that contain field values and tell the skill what data type each field holds. Before you start, pick the right selection method for the field shape, then follow the per-type guidelines for structured, semi-structured, or unstructured documents.Documentation Index
Fetch the complete documentation index at: https://docs.abbyy.com/llms.txt
Use this file to discover all available pages before exploring further.
Selection methods
| Method | Best for |
|---|---|
| Hover and click a word | Single-word fields |
| Drag a rectangle around words | Semi-structured documents |
| Click the first word, then drag (left mouse button held) to the last word | Unstructured documents |
Structured documents
Structured documents (such as pre-formatted forms) always contain the same information in the same locations. You only need to label a few sample documents because there’s no layout variation.- Specify each field’s region accurately — field values alone aren’t enough for training.
- Mark the entire placeholder, not the value inside it.
- If a field contains no value, mark the empty placeholder anyway.
- For multi-part fields, hold Shift to add additional parts. All parts must be on the same page.
- For tables on a fixed form, label every row, including empty rows.
- If you add a new field after labeling, go back and label that field on every document in the training set.
Semi-structured documents
Semi-structured documents — bills, payment orders, invoices — contain similar fields, but field locations, sizes, and counts vary across documents.- Specify each field’s region accurately — field values alone aren’t enough for training.
- Click the field’s value (the word or words it contains); the region is created automatically.
- If a field contains no value, don’t create a region for it.
- Don’t mark partial words — the trainer learns on whole words only.
- For multi-part fields, hold Shift to add additional parts. All parts must be on the same page.
- Do not instruct the program to find fields inside another field’s region (whether an individual field like an address or a table cell like Description). To extract from a large region, chain activities: a semi-structured extraction activity to find the region, then an NLP Extraction Rules activity or a script rule to pull specific fields from it.
- If you add a new field after labeling, go back and label that field on every document in the training set.
Tables and repeating groups
For repeating data, decide between a table and a repeating group:| Use this | When |
|---|---|
| Table | Tabular data with a common header and values that have no keywords next to them |
| Repeating group with the Allow multiple items option | Less-structured data where keywords sit next to the values |
Unstructured documents
Unstructured documents — contracts, scientific articles, email messages — have no consistent structure.- Specify each field’s region accurately — field values alone aren’t enough for training.
- For segments (fields trained by the Segmentation activity), include one or more whole paragraphs. A segment cannot include only part of a paragraph.
- Click the field’s value (the word or words it contains); the region is created automatically.
- If a field contains no value, don’t create a region for it.
- Don’t mark partial words — the trainer learns on whole words only.
- If a word is followed by punctuation, adjust the region so the punctuation isn’t enclosed.
- A field region may span pages (for example, a contract clause). Label the first part on the first page, then hold Shift while continuing on the next page.
- To label a field inside another field’s region (for example, a field inside a segment), select the inner field and start labeling — the action creates a new region rather than selecting the outer one.
This is the opposite of the semi-structured guideline above: segments in unstructured documents are designed to contain inner fields, so labeling within them is intended. In semi-structured documents, the equivalent nesting creates training conflicts.
Related topics
Labeling documents
Reuse labeled documents from training sets, manual review, or FlexiCapture.
Importing from FlexiCapture
Format and procedure for reusing FlexiCapture-labeled documents.
Document categories
Background on structured, semi-structured, unstructured, and mixed documents.
Segmentation activity
Used for segment fields in unstructured documents.
