Skip to main content
It may be difficult to specify field extraction properties for cases where a single Document skill needs to process documents that vary significantly with regards to their field placement (despite being of the same type). For example, the same skill can be used to process invoices from different vendors, where the same fields may be placed in locations that differ from vendor to vendor. To improve the extraction quality for such skills, you can choose to classify its documents into classes, which are document subgroups (with common properties) for a single document type, and set up separate extraction activities for each one. Classifying documents into classes may also be required when you need to improve extraction quality for one of the classes. For example, a single skill may be used to process bank statements compiled by different banks. One statement type may have a lower extraction quality compared to the rest. To improve the extraction quality for that skill, you can sort the statements into classes and set up an Extraction Rules activity for the class that has an unsatisfactory extraction quality. The Classify By Text and Image activity is designed to sort a skill’s documents into classes that require their own extraction activities to be created and set up.

Setup Overview

To create and set up a Classify By Text and Image activity, follow these steps:
  1. Create a Classify By Text and Image activity in the document processing flow.
  2. Upload images, create classes, assign expected classes to documents.
  3. Train the activity and analyze training results.
  4. Modify properties if classification results need to be improved.

Creating and Setting Up Using the Activities Tab

Create a Classify By Text and Image activity in the workflow. When it is created, a field to record the classification results will be created in the skill structure. The value of this field will be used to classify documents. This field will be displayed in the skill field structure, however, it will be marked as hidden and will not be editable.
Note: A Classify By Text and Image activity does not return a confidence value for a class, it only returns its name.
To navigate to the Activity Editor, click Activity Editor or double-click the activity block.

Setting Up Using the Activity Editor

Step 1: Upload Documents

Upload documents that will be used to set the activity up by clicking Upload in the toolbar and selecting an upload method: a. Upload Documents… Use the dialog box that will open to select the appropriate documents. The selected documents will be displayed in the No Class list. b. Upload Folder Like Classes… Use the dialog box that will open to select a folder that contains subfolders with images. Each subfolder should contain images of a single class. Uploading documents this way will automatically create classes that correspond to subfolders, with documents in those respective subfolders classified to be of that class. As such, you will not need to manually create classes in the Activity Editor.

Step 2: Create Classes

Create classes that correspond to the different types of documents being processed by clicking either Create Class in the toolbar or Create in the Assign class pane. If your documents were uploaded using Upload folder like classes, make sure that all required classes have been created.

Step 3: Classify Documents

Classify your documents using one of the following methods:
  • Select all documents of a single class in the list and click an appropriate class name in the Assign class pane.
  • If an appropriate class has not been created yet, select all appropriate documents in the list and create a class by clicking either Create Class in the toolbar or Create in the Assign class pane.
  • Select all documents of a single class and drag them to the list that corresponds to that class.

Additional Options

If required, you can change the orientation of document pages using the Rotate drop-down menu on the toolbar. You can select one of the following options: Rotate All Pages Left, Rotate All Pages Right, or Rotate All Pages 180º. To switch view modes, use the following buttons in the toolbar:
  • List view. Displays documents as a list
  • Thumbnail view. Displays documents as thumbnails
To view the full image for a document displayed in thumbnail view, use the preview button.

Training a Classifier and Viewing Classification Results

Once documents have been classified, train your activity using the Train Activity button. After training has finished, statistics regarding the classification results will be displayed on the Results tab. Analyzing these statistics helps identify problem classes and evaluate the general quality of the classifier.

General Statistics

The top pane displays general statistics for all documents and classes of the activity. These statistics help evaluate the general quality of your classifier:
  • accuracy. The percentage of documents the expected class of which matched the class assigned by the program.
  • F-Measure. Use to evaluate classification precision and completeness.
  • Recall. The ratio of documents correctly classified as a specific class to all documents of that class.
  • Precision. The ratio of documents correctly classified as a specific class to all documents classified as that class (both correctly and incorrectly).

Class-Specific Statistics

On the Classes pane, you can view statistics for each class. For each class, the percentage of documents with the expected class matching the class assigned by the program is displayed, as well as the number of documents with correctly and incorrectly assigned classes. To view documents with incorrectly assigned classes, select an appropriate class in the Classes pane and expand the incorrectly assigned document list (displayed in red). Analyzing these documents should help you understand why the program assigned a specific document a class that is different from the expected class. This can often happen if the expected was assigned incorrectly to begin with, e.g. when documents of different classes are overly similar.

Fixing Classification Errors

Incorrect Expected Classes

One possible cause of incorrect classification is incorrectly assigned expected classes. To fix this type of error, simply assign the correct expected class to a document. On the Results tab, select a class that was incorrectly assigned to a document. Expand the list of documents with incorrectly assigned classes, select all documents of that class, and assign the correct expected class to them from the list in the Assign class pane.

Similar Documents in Different Classes

Another possible reason for classification errors is having very similar documents divided into different classes. If the classifier confuses classes for two similar document variants, most likely these variants need to be in the same single class with a single extraction activity. In this case, review the number of classes and unite the confused classes into one. Their differences should then be described using rules in an Extraction Rules activity.

Insufficient Training Data

Yet another possible reason for classification errors can be a lack of documents in a class set. In this case, you can improve the classifier quality by adding more documents to the set. After adding new documents or changing classes, you will need to retrain your classifier.