Skip to main contentFor most search elements of the Extraction Rules activity, the Properties pane has two sections: What to search for and Where to search. The exceptions are:
- Group and Repeating Group elements, which have no properties of their own.
- Input field elements, which are taken from other activities preceding the Extraction Rules activity and only provide the Get region from option to switch from one input field to another.
What to search for
The What to search for section contains properties specific to each element.
Person, Organization, Address, Location, Date, Duration, Money
For all the search elements that look for named entities, you can specify the following properties:
- Entities: entity type. If you change the type, the icon against the search element will be updated automatically.
- Instances: the number of instances. Either the first one or all instances found can be extracted.
Value from Dictionary
For a dictionary phrase, specify:
- Text source: a TXT file with a list of words or phrases to find, one variant per line.
- Use morphology: turn on this option to look for all word forms.
- Instances: the number of instances. Either the first one or all detected instances can be extracted.
Value from Regular Expression
For a regular expression, specify:
- Regular expression: a regular expression that defines the search. The program uses the PCRE2 regular expression syntax.
- Search for parts of words: turn on this option to find the matches even if they are not separated by spaces from the rest of the text.
- Instances: the number of instances. Either the first one or all detected instances can be extracted.
Text
For a text search element, click the edit icon and enter a list of words or phrases to find, or click on the document image to add recognized words from the document.
Unlike the Value from Dictionary search element, keywords are listed directly instead of in a TXT file, and you also have the option to allow for some recognition errors.
- Text source: a list of words or phrases to find, one variant per line.
- Use morphology: turn on this option to look for all word forms.
- Allowed errors: the percentage or the number of differing characters that will still allow the text to be found. May be helpful in case of recognition errors.
Note: This option will not be available if you turn on the Use morphology option.
- Instances: the number of instances. Either the first one or all detected instances can be extracted.
Where to search
The Where to search section is identical for all elements. In this section, you can narrow down the area where the program will look for the search element. In the following settings, you can use the search elements located above the current element in the list:
- Search in: the search element is located either within the Whole Document or inside another search element.
Example: Look for the organization name in the preamble of the document.
-
After: the search element is located after another search element in the recognized text.
- Search in the same sentence: turn on this option to find the element within the same sentence.
Example: Look for the role of the organization after its name within the same sentence.
-
Before: the search element is located before another search element in the recognized text.
- Search in the same sentence: turn on this option to find the element within the same sentence.
For example, if you are looking for somebody’s date of birth, you can first create an auxiliary search element with the “born” keyword, then specify that the Date entity is located somewhere after this keyword within the same sentence.
You can add multiple After and Before elements, refining your search still more.