Skip to main content

Extracting Data from a Mixed Document Set (Semi-structured and Unstructured)

Suppose that a single Document skill has to process both semi-structured and unstructured documents. In this case, first classify documents into the corresponding types using the Classify By Text and Image activity, which combines textual and geometric features, being able to classify even images of a poorer quality and documents of different classes that can only be differentiated by graphic objects, such as signatures or seals. Use an IF activity to branch the document processing flow and separate the unstructured documents from semi-structured ones. Each branch can be processed using one of the scenarios from the Processing semi-structured documents and Processing unstructured documents sections. For example, semi-structured documents can be processed by a Fast Learning activity, while unstructured documents can be processed by a combination of a Segmentation activity and a Deep Learning activity for NLP. As the described documents all belong to the same type, they will have the same set of output fields. Mixed Document Processing Flow

Steps for Creating a Document Skill

  1. Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
  2. Use the Documents tab that will open to upload documents that will be used to set up your skill. To make sure that your document set is sufficient for setting up a classifier, add a roughly equal number of documents for each variant.
  3. Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
  4. Navigate to the Activities tab and add a Classify activity to the document processing flow.
  5. Open the Activity Editor and set up the Classify activity. To do so, create a corresponding class for each variant, assign these classes to your documents, and train the activity.
  6. Return to the Activities tab and set up conditional branching for the processing flow by adding an IF activity, as well as separate activities to process each document variant.
  7. Set up and train the activities you created.
  8. Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
  9. Once testing results are good enough, publish your skill.

Extracting Text from Table Cells in Semi-structured Documents

Suppose that you are extracting data from semi-structured documents with tables, and you need to extract not just the text of each cell, but also specific numerical values embedded within a cell’s text. For example, if you need to extract information about a borrower from a Closing Disclosure document, you can use a Fast Learning activity, which is intended for semi-structured documents, to extract the entire text of the targeted table cell, and then use an activity for unstructured documents (Named Entities (NER) and Address Parsing in this case) to extract the name of the borrower and a portion of his/her address from within the targeted cell. Fast Learning with NER and Address Parsing

Steps for Creating a Document Skill

  1. Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
  2. Use the Documents tab that will open to upload documents that will be used to set up your skill.
  3. Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
  4. Navigate to the Activities tab, create a Fast Learning activity, and specify the fields that will be extracted by this activity.
  5. Open the Activity Editor, set up and train the Fast Learning activity.
  6. Return to the Activities tab, create a Named Entities (NER) activity and specify a source field, as well as fields that will be used to store extracted named entities. Map the named entities to the selected fields.
  7. If you have a field that contains an address and you want to split the address into components, create an Address Parsing activity and specify a source field, as well as fields that will be used to store extracted address components. Map the address components to the selected fields.
  8. Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
  9. Once testing results are good enough, publish your skill.

Extracting Data from Unstructured Documents with Tables, Titles, Headers, and Footers

Suppose that you have to extract data from unstructured documents (e.g. contracts) that contain tables, titles, headers, or footers. Sample Mixed Document In this case, set up a Segmentation activity to detect continuous paragraphs of text and an Extraction Rules activity to detect semi-structured inserts. Once the required document fragment has been detected, use the appropriate activities to extract fields from those fragments.

Steps for Creating a Document Skill

  1. Open Advanced Designer. Create a new skill by clicking Create Document Skill on the start page.
  2. Use the Documents tab that will open to upload documents that will be used to set up your skill.
  3. Once you have uploaded your images, navigate to the Fields tab and set up a field structure for the skill by creating and setting up fields that will be extracted using the skill. Label documents in the Reference section.
  4. Navigate to the Activities tab, create a Segmentation activity, and specify the fields that will be used to store paragraphs of plain text.
  5. Open the Activity Editor, set up and train the Segmentation activity.
  6. Return to the Activities tab, create an Extraction Rules activity and specify the fields that will be used to store data from semi-structured fragments of the document.
  7. Open the Activity Editor, set up and test the Extraction Rules activity.
  8. Test your skill by clicking Test Skill Using Selected Documents and analyze the results you obtain.
  9. Once testing results are good enough, publish your skill.