Skip to main content
A Document skill lets you extract field values from structured and semi-structured documents of a single type. Documents of the same type have exactly the same set of fields and validation rules, as well as the same structure, for example, invoices, agreements, and shipping lists are three types of documents. Structured documents are forms where the location of the fields is the same on each document instance. Examples of structured documents include questionnaires, application forms, and tax return forms.
Tip: You can also create and edit skills for structured documents in Advanced Designer when you need to combine the processing of structured documents with other Vantage technologies.
Semi-structured documents have a specific set of fields, whose labeling, number, and placement varies from document to document of the same type. One typical example of semi-structured documents are invoices issued by different companies, which vary in the number and formatting of line items. Each invoice will have an invoice number and total amount printed on it, but their exact location of this information will vary from invoice to invoice. To start training your Document skill, label the fields on one document. As you train your skill, the program will begin to automatically suggest field locations to facilitate the field labeling process.
Note: Currently, only one file can be processed by a Document skill as part of a single transaction. If you need to process several files, use the Extract activity of the Process skill.

Document type variants

Documents of a single type almost always have identical sets of fields, validation rules, and structure. The variants of a single document type can differ slightly, depending, for example, on the year the document was issued. Documents of a single type can be processed by one Document skill trained using different variants of this document type. Vantage and Advanced Designer can handle any number of variants within a single document type:
  • For hundreds of variants, skills trained using Online Learning in Vantage will be able to extract data almost flawlessly.
  • For thousands of variants, skills trained using the Deep Learning activity will be able to extract data with an accuracy of approximately 80% to 90%, depending on the complexity of the document types.
  • For the most essential variants of a document type, skills trained using the Fast Learning and/or the Extraction Rules activities will ensure accurate extraction of data from complex documents.
  • For structured documents, which always have the same type of information in the exact same locations, we recommend using up to 10 variants. If a fixed form has many variants, we recommend you treat them all as different document types.
When training and testing a skill, we recommend the following:
  • When training a skill, use a representative document set containing at least 2-3 documents of each variant. If there are a lot of variants and the set does not contain at least one document of every variant, then you can use the Deep Learning activity. It understands image patterns, the spatial structure of documents, field contents, and surrounding labels and can process variants which weren’t used for training.
  • When testing a skill, use a document distribution similar to that in the actual flow of documents in production: the percentage of documents of a specific variant in the training set should be representative of how often the variant appears in your document flow. This will ensure that the accuracy estimate is valid. To do so, you test skills using a random sample of documents from the actual document flow in production.
  • One variant sample is better than no sample.