Skip to main content
Set custom rules for detecting and extracting fields from semi-structured documents with varying layouts The Extraction Rules activity allows to set rules for detecting fields on semi-structured documents and verifying how such rules work on real-life documents. It is usually applied when a field’s location may differ from document to document, complicating the extraction of data, and when you can provide additional information for detecting such fields: e.g. the location of fields relative to other objects on the document or regular expressions specifying search conditions for an object. For example, you can specify that the Invoice Number field may be located either on the right of the image or directly under the words “Order number”, “Order #”, or other similar keywords. We also recommend adding a Fast Learning activity to the processing flow, enabling Online Learning to collect runtime documents, which will automatically rebuild the skill later via machine learning.

Use Cases

Add the Extraction Rules activity to your document processing flow in the following cases:
  • When your document set isn’t streamlined enough to use a Fast Learning activity to extract data, you don’t have enough documents to train a Deep Learning activity, and the documents have a known structure which you can formalize.
  • When you want greater control over the AI, analyzing the prediction results of the Deep Learning and Fast Learning activities before transferring those values into document fields. For example, if you expect to extract a number located close to some keyword, you can filter out hypotheses that don’t appear to be a number and hypotheses that are not located near the keyword. Generally, if post-processing with rules is required, this usually indicates that the training set for the Deep Learning and Fast Learning activities should be expanded, because machine learning technologies can “feel out” and learn a field’s data type, typical location, and surroundings.
  • When you have a FlexiLayout file from ABBYY FlexiLayout Studio which you want to reuse. For more information, see Importing FlexiLayouts from ABBYY FlexiLayout Studio.
  • When your documents contain complex structures (e.g. nested tables, which are repeating structures inside other tables) which can’t be extracted by other activities targeted at semi-structured documents.

How It Works

An Extraction Rules activity is a formalized description of a set of documents that enables data capture workers to use custom rules to locate data fields on documents and extract information from these fields. In other words, an Extraction Rules activity allows to specify field search algorithms for document images. You can either specify the location of fields relative to other objects or use absolute coordinates to specify their location. Various objects on the document image are detected using search elements. For every object that needs to be detected on the image, you need to create a corresponding element that fully describes the required type of object (such as text, image, barcode), its characteristics, and the presumed search area for the object. The elements compose a Search Elements tree, which is a logically connected structure (of any nesting level) where elements are searched for relative to each other. The order of the elements in the tree directly corresponds to the order in which the activity searches for them, i.e. when matching a description to the image, the activity will look for elements in descending order. Grouping elements helps optimize the search and allows the creation of independent sub-hierarchies. To extract data to a field, you should map it to a search element. If the element is found on the image, its region becomes the region of the mapped field. For more information, see Setting up an Extraction Rules activity.

Combining Several Extraction Rules Activities

You can create a workflow item that contains several Extraction Rules activities. An activity to be applied to the document is selected depending on the value of some field. This field may contain classification results or other data that helps distinguish among the document variants. The specified values serve as conditions for choosing a corresponding activity. For more information, see Several sets of Extraction Rules within a single activity.