Processing Semi-structured Documents

When extracting data from semi-structured documents, Advanced Designer is used for complex document sets (for example, those which contain many vastly different document variants). The document processing flow will include activities targeted at extracting data from semi-structured documents.

New Document Variants Can Appear After Development

Suppose that you have to process documents of the same type with varying layouts and you can’t provide all document variants during skill development. This may be the case when you are creating a skill to process invoices from various suppliers. Typically, each supplier will have their own invoice template, and you can be certain that new templates will appear in the future. If you have a sufficient quantity of document samples, you can use a Deep Learning activity followed by the Fast Learning activity. The Deep Learning activity will be responsible for processing unforeseen document variants, while the Fast Learning activity will learn the specific document variants the customer has provided, resulting in an even higher quality for those documents. The Fast Learning activity can also be trained via the Online Learning feedback loop from manual review.

Steps for Creating a Document Skill

Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
Use the Documents tab that will open to upload documents that will be used to set up your skill.
Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
Navigate to the Activities tab and add a Deep Learning activity for semi-structured documents to the document processing flow.
Open the Activity Editor to configure and train the Deep Learning activity. Keep in mind that the document set used for training this activity should contain at least 100 labeled documents.
Return to the Activities tab and add a Fast Learning activity to the document processing flow.
Open the Activity Editor to configure and train the activity.
Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
Once testing results are good enough, publish your skill.

Some Documents Contain Structures That Can’t Be Extracted Using Machine Learning

Suppose that the majority of document variants in your document set can be handled with Deep Learning and Fast Learning activities. Still, a few documents may have nested tables, or be in some other way completely different from all other documents used for training. To handle such documents, you need to separate them from the main document set using the Classification activity:

Use the Classify By Company activity if the document variants are issued by different companies, and the company name and/or address is printed on the document. For example, when processing bank statements from different banks; you can easily provide a database list of those banks, taking care of all variants that should be handled separately.
Use the Classify By Text and Image activity in all other cases. This multimodal classification technology uses text, spatial structure, and image patterns to distinguish different documents variants from one another, so it will easily recognize deviating document variants.

Use an IF activity to branch the document processing flow and separate document variants with poor processing quality (for instance, as mentioned earlier, documents with nested tables), and then use an Extraction Rules activity to extract targeted fields and tables from such documents.

IF with Deep Learning and Extraction Rules

Steps for Creating a Document Skill

Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
Use the Documents tab that will open to upload documents that will be used to set up your skill. To make sure that your document set is sufficient for setting up a classifier, add a roughly equal number of documents for each variant.
Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
Navigate to the Activities tab and add a Classify activity to the document processing flow.
Open the Activity Editor and set up the Classify activity. To do so, create a corresponding class for each variant, assign these classes to your documents, and train the activity.
Return to the Activities tab and set up conditional branching for the processing flow by adding an IF activity, as well as separate activities to process each document variant.
Set up the activities you created.
Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
Once testing results are good enough, publish your skill.

You Don’t Have Enough Documents to Use Machine Learning

Suppose that you have to extract data from a small number of document variants but you don’t have enough documents to train a Deep Learning activity, however, you have some type of expert knowledge that allows you to describe the main principles of data extraction for each document variant. For example, if you are creating a skill to process tax forms for different years, you can split up all your documents into different variants using a Classify activity. It should be followed by a set of Extraction Rules activities, where each activity is tailored to a certain document variant. Add a Fast Learning activity if you want Vantage to further train your skill.

Steps for Creating a Document Skill

Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
Use the Documents tab that will open to upload documents that will be used to set up your skill. To make sure that your document set is sufficient for setting up a classifier, add a roughly equal number of documents for each variant.
Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
Navigate to the Activities tab and add a Classify activity to the document processing flow.
Open the Activity Editor and set up the Classify activity. To do so, create a corresponding class for each variant, assign these classes to your documents, and train the activity.
Return to the Activities tab and create an Extraction Rules activity. Add other Extraction Rules activities to this workflow item. Set up branching conditions by selecting the field filled by the Classify activity and mapping its values to Extraction Rules activities. You can also skip this step for documents of certain classes that don’t require special extraction rules.
Set up the extraction activities you created.
Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
Once testing results are good enough, publish your skill.

Introduction

Quickstart

Skill Catalog

Skill Designer

Advanced Designer

Runtime Guide

Tenant Admin Guide

Scanning Station Guide

Developer Guide

Release Notes

New Document Variants Can Appear After Development

Steps for Creating a Document Skill

Some Documents Contain Structures That Can’t Be Extracted Using Machine Learning

Steps for Creating a Document Skill

You Don’t Have Enough Documents to Use Machine Learning

Steps for Creating a Document Skill

Introduction

Quickstart

Skill Catalog

Skill Designer

Advanced Designer

Runtime Guide

Tenant Admin Guide

Scanning Station Guide

Developer Guide

Release Notes

​New Document Variants Can Appear After Development

​Steps for Creating a Document Skill

​Some Documents Contain Structures That Can’t Be Extracted Using Machine Learning

​Steps for Creating a Document Skill

​You Don’t Have Enough Documents to Use Machine Learning

​Steps for Creating a Document Skill

New Document Variants Can Appear After Development

Steps for Creating a Document Skill

Some Documents Contain Structures That Can’t Be Extracted Using Machine Learning

Steps for Creating a Document Skill

You Don’t Have Enough Documents to Use Machine Learning

Steps for Creating a Document Skill