Processing Unstructured Documents

Skills for processing unstructured documents can only be created in Advanced Designer. The document processing flow of such skills will include activities targeted at extracting data using NLP. The following activities support a limited number of languages. You can find a list of languages supported by each activity on their corresponding pages:

Segmentation activity
Deep Learning for NLP activity
Named Entities (NER) activity
Address Parsing activity

Extracting Pre-trained Named Entities from the Whole Document

Suppose that you need to create a Document skill to extract company names and addresses from unstructured documents, such as letters. To extract these entities, you can set up a Named Entities (NER) activity, which is designed to extract named entities. If the address needs to be split into components such as street, city, state, country, postal code, and extracted into different fields, set up an Address Parsing activity.

Steps for Creating a Document Skill

Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
Use the Documents tab that will open to upload documents that will be used to set up your skill.
Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
Navigate to the Activities tab. Create a Named Entities (NER) activity and specify fields that will be used to store extracted named entities. Map the named entities to the selected fields.
If you have a field that contains an address and you want to split the address into components, create an Address Parsing activity and specify fields that will be used to store extracted address components. Map the address components to the selected fields.
Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
Once testing results are good enough, publish your skill.

Extracting Pre-trained Named Entities from Certain Paragraphs

Suppose that the named entity you want to extract is always located in the same paragraph. For example, if you need to extract an amount of money from a purchase price paragraph that is a part of a sales and purchase agreement, first use the Segmentation activity to extract the target paragraph, and then the Named Entities (NER) activity to extract the targeted field. The targeted data should represent a named entity supported by a Named Entities (NER) or an Address Parsing activity, for example, names, addresses, and dates. You can also extract the target paragraph using Fast Learning and Extraction Rules activities. To do so, first make sure that the text chunk is extracted correctly by a Fast Learning or an Extraction Rules activity, and then create and set up a Named Entities (NER) or an Address Parsing activity. If the target paragraph also contains other named entities of the same type that shouldn’t be extracted, refer to the following use case. Pre-trained activities are a good starting point since they are easily configured and don’t require training. However, a neural network trained on your documents may provide higher extraction accuracy. If you have an extensive document set, you may also want to try the next scenario and choose the one that performs better on your documents.

Segmentation with NER and Address Parsing

Steps for Creating a Document Skill

Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
Use the Documents tab that will open to upload documents that will be used to set up your skill.
Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
Navigate to the Activities tab, create a Segmentation activity, and specify the fields that will be used to store target paragraphs.
Open the Activity Editor, set up and train the Segmentation activity.
Return to the Activities tab, create a Named Entities (NER) activity and specify a source field, as well as fields that will be used to store extracted named entities. Map named entities to the selected fields.
If you have a field that contains an address and you want to split the address into components, create an Address Parsing activity and specify a source field, as well as fields that will be used to store extracted address components. Map the address components to the selected fields.
Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
Once testing results are good enough, publish your skill.

Extracting Custom Named Entities

Suppose that you need to extract the name of one organization from a paragraph containing information about both parties to the agreement. Additionally, you need to extract an e-mail address. In this case, you should first use the Segmentation activity to extract the target paragraph. However, you can’t use a Named Entities (NER) activity, as it will extract the names of both organizations from the target paragraph, moreover, it isn’t trained to extract emails. In this case use Deep Learning activity for NLP instead. You may also use this scenario for improving extraction accuracy for pre-trained named entities. You may test both a pre-trained activity and the Deep Learning activity and then choose the one that performs better on your documents. Keep in mind that you need a lot of documents to use this activity (the minimum amount is 50 documents but we recommend having at least 150 documents). You may also want to test both activities (Named Entities (NER) and Deep Learning for NLP) and then choose the activity that performs better on your documents.

Steps for Creating a Document Skill

Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
Use the Documents tab that will open to upload documents that will be used to set up your skill.
Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
Navigate to the Activities tab, create a Segmentation activity, and specify the fields that will be used to store target paragraphs.
Open the Activity Editor, set up and train the Segmentation activity.
Return to the Activities tab, create a Deep Learning for NLP activity, and specify the fields that should be extracted by this activity.
Open the Activity Editor to set up and train the Deep Learning activity.
Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
Once testing results are good enough, publish your skill.

Introduction

Quickstart

Skill Catalog

Skill Designer

Advanced Designer

Runtime Guide

Tenant Admin Guide

Scanning Station Guide

Developer Guide

Release Notes

Extracting Pre-trained Named Entities from the Whole Document

Steps for Creating a Document Skill

Extracting Pre-trained Named Entities from Certain Paragraphs

Steps for Creating a Document Skill

Extracting Custom Named Entities

Steps for Creating a Document Skill

Introduction

Quickstart

Skill Catalog

Skill Designer

Advanced Designer

Runtime Guide

Tenant Admin Guide

Scanning Station Guide

Developer Guide

Release Notes

​Extracting Pre-trained Named Entities from the Whole Document

​Steps for Creating a Document Skill

​Extracting Pre-trained Named Entities from Certain Paragraphs

​Steps for Creating a Document Skill

​Extracting Custom Named Entities

​Steps for Creating a Document Skill

Extracting Pre-trained Named Entities from the Whole Document

Steps for Creating a Document Skill

Extracting Pre-trained Named Entities from Certain Paragraphs

Steps for Creating a Document Skill

Extracting Custom Named Entities

Steps for Creating a Document Skill