Skip to main contentDocument skills are used to extract field values from different kinds of documents: structured documents (such as tax forms or application forms), semi-structured documents for example, invoices, order bills, or air waybills), and unstructured documents ( contracts, lease agreements, or email messages).
Document skills can be created either in ABBYY Vantage or in Advanced Designer. The latter should be your tool of choice if you need to create complex Document skills for non-standard documents with varying layouts and field structures. Advanced Designer also allows you to combine different technologies in your Document skills, add NLP for processing unstructured documents, or impose conditions on processing different types of document (see Use cases for an overview of typical scenarios).
Document Type Variants
Documents of the same type almost always have identical sets of fields, validation rules, and structure. The variants of a single document type can differ slightly, depending, for example, on the year the document was issued.
Documents of the same type can be processed by one Document skill trained on different variants of this document type. Vantage and Advanced Designer can handle any number of variants within a document type:
- For hundreds of variants, skills trained using Online Learning in Vantage will be able to extract data almost flawlessly.
- For thousands of variants, skills trained using the Deep Learning activity will be able to extract data with an accuracy of approximately 80% to 90%, depending on the complexity of the document types.
- For the most essential variants of a document type, skills trained using the Fast Learning and/or the Extraction Rules activities will ensure accurate extraction of data from complex documents.
- For structured documents, which always have the same type of information in the exact same locations, we recommend using up to 10 variants. If a fixed form has many variants, we recommend you treat them all as different document types. For more information, see Processing structured documents.
Training and Testing a Document Skill
For best extraction results, we recommend training and testing a Document skill using three different document sets:
- Training set
- Test set
- Blind set (an additional test set that contains sample documents not included in any of the two sets above)
Training Set Requirements
For a training set, use a representative document set containing at least 2-3 sample documents for each variant. If there are a lot of variants and the set does not contain at least one sample document of each, consider using the Deep Learning activity. This activity understands image patterns, the structure of documents, field contents, and surrounding labels and can process variants which were not used in training.
The number of sample documents for the activities depends on the technologies you use in your Document skill:
- Deep Learning activity for semi-structured documents:
- For high-variability documents, at least 200-300 sample documents (2-3 sample documents per variant) is required. Generally, we recommend having about 1,000 documents in the set.
- For low-variability documents, 100 sample documents is usually sufficient.
- Segmentation activity:
- For high-variability documents, we recommend having at least sample 100 documents.
- For low-variability documents, we recommend having at least 20 sample documents.
- Deep Learning for NLP activity:
- For high-variability documents, we recommend having at least 300 sample documents (2-3 samples per variant).
- For low-variability documents, we recommend having at least 50 sample documents.
Note: Even if you don’t have the recommended number of sample documents, having one sample document per variant is better than none at all.
Test Set Requirements
For a test set, sample document distribution must be similar to that in the actual flow of documents in production. This will ensure that the accuracy estimate is valid.
For example, if invoices from a particular vendor account for 30% of the production document flow, about 30% of the sample documents in the test set should be from that vendor. You can also achieve the required ratio by testing your skill on random samples of documents from the production document flow.
Blind Set Requirements
For a blind set, be sure to use documents that have not already been used for training or testing your skill. The extraction results obtained on a blind set will help you evaluate the quality of your skill.
Note: Be sure to use different documents for training and testing your skill.
Configuring a Document Skill
After you create a Document skill on the start page, follow these steps to configure your skill:
- Click the settings button next to the skill name to view and adjust the skill settings.
- On the Documents tab, upload some documents.
- On the Fields tab, label the data fields from which values will be extracted, specifying their locations.
- On the Activities tab, configure the document processing flow.
- On the Results tab, test your skill to see how well it performs on sample documents.
- On the Publish tab, publish your skill.
After configuring and publishing your Document skill, it will become available in the Skill Catalog in ABBYY Vantage.
In the Skill Catalog, you can view and manage your skills, including built-in skills, read-only skills, and derived skills.