Document Type Variants
Documents of the same type almost always have identical sets of fields, validation rules, and structure. The variants of a single document type can differ slightly, depending, for example, on the year the document was issued. Documents of the same type can be processed by one Document skill trained on different variants of this document type. Vantage and Advanced Designer can handle any number of variants within a document type:- For hundreds of variants, skills trained using Online Learning in Vantage will be able to extract data almost flawlessly.
- For thousands of variants, skills trained using the Deep Learning activity will be able to extract data with an accuracy of approximately 80% to 90%, depending on the complexity of the document types.
- For the most essential variants of a document type, skills trained using the Fast Learning and/or the Extraction Rules activities will ensure accurate extraction of data from complex documents.
- For structured documents, which always have the same type of information in the exact same locations, we recommend using up to 10 variants. If a fixed form has many variants, we recommend you treat them all as different document types. For more information, see Processing structured documents.
Training and Testing a Document Skill
For best extraction results, we recommend training and testing a Document skill using three different document sets:- Training set
- Test set
- Blind set (an additional test set that contains sample documents not included in any of the two sets above)
Training Set Requirements
For a training set, use a representative document set containing at least 2-3 sample documents for each variant. If there are a lot of variants and the set does not contain at least one sample document of each, consider using the Deep Learning activity. This activity understands image patterns, the structure of documents, field contents, and surrounding labels and can process variants which were not used in training. The number of sample documents for the activities depends on the technologies you use in your Document skill:- Deep Learning activity for semi-structured documents:
- For high-variability documents: At least 200-300 sample documents (2-3 sample documents per variant) are required.
- For low-variability documents: A minimum of 10 sample documents (2-3 sample document per variant) are required.
- Segmentation activity:
- For high-variability documents, we recommend having at least sample 100 documents.
- For low-variability documents, we recommend having at least 20 sample documents.
- Deep Learning for NLP activity:
- For high-variability documents: At least 150 sample documents (2-3 sample documents per variant) are required.
- For low-variability documents: You can start training with one sample document, but at least 2-3 sample documents per variant are required.
Even if you don’t have the recommended number of sample documents, having one sample document per variant is better than none at all.
Test Set Requirements
For a test set, sample document distribution must be similar to that in the actual flow of documents in production. This will ensure that the accuracy estimate is valid. For example, if invoices from a particular vendor account for 30% of the production document flow, about 30% of the sample documents in the test set should be from that vendor. You can also achieve the required ratio by testing your skill on random samples of documents from the production document flow.Blind Set Requirements
For a blind set, be sure to use documents that have not already been used for training or testing your skill. The extraction results obtained on a blind set will help you evaluate the quality of your skill.Be sure to use different documents for training and testing your skill.
Configuring a Document Skill
After you create a Document skill on the start page, follow these steps to configure your skill:- Click the settings button next to the skill name to view and adjust the skill settings.
- On the Documents tab, upload some documents.
- On the Fields tab, label the data fields from which values will be extracted, specifying their locations.
- On the Activities tab, configure the document processing flow.
- On the Results tab, test your skill to see how well it performs on sample documents.
- On the Publish tab, publish your skill.
