Sample extraction rules

This section provides some sample extraction rules created using the code editor, which offers more flexibility than the graphical user interface (see the Syntax section for a detailed description of the syntax used).

Finding the last instance of a named entity

In this example, an extraction rule is used to find the name of the tenant’s organization in a lease agreement. This name occurs after the names of all the other organizations and cannot be extracted using the settings in the GUI, where you can only select Instances > First. However, you can use an extraction rule to look for the name of the organization that closely precedes the word “Tenant” enclosed in brackets. First, we find the keyword “Tenant” as an auxiliary search element. Next, we look for the name of the organization that occurs before that keyword. In this example, there are two intervening tokens—an opening bracket and a quotation mark—so we limit the distance to three tokens, allowing a margin for safety. If your documents have more intervening tokens separating the keyword from the name of the organization, increase this number accordingly.

Rule for extracting the last organization instance

// We are looking only for one instance
~TenantName
// Find the organization name that occurs before the keyword "Tenant"
// and is separated from it by no more than three tokens
// The intervening tokens must not contain another organization's name
[ t: @NEROrganization ]+ [ ~@NEROrganization ~@kw_Tenant ]{0,3} [ @kw_Tenant ]
=>
TenantName( t );

Example

Extracting an amount of money stated both in words and as a numeral

In this example, an extraction rule with regular expressions is used to find an amount of money that is written out first in words and then in digits in round brackets, for example: Two Thousand One Hundred Forty-Seven Dollars and Sixty Cents ($2,147.60) Two Thousand One Hundred Forty-Seven Dollars and Sixty Cents (in numbers: $2,147.60)

Rule for extracting an amount of money

// The variable a receives the Money named entity that is not in numbers
// and has not yet been assigned to any other instance of the Money search element
[ a: @NERMoney( same ) ~@Money ~/\d+/ ]+
[ br: '(' ]
// The question mark means that the words "in numbers" are optional
// The ^ and $ symbols mean that the entire string must match the regular expression
// The i option means that matching is case-insensitive
( [ e1: /^in$/i ] [ e2: /^numbers$/i ] [ e3: ':' ]? )?
// The variable b receives the Money named entity that is placed inside brackets
[ b: @NERMoney( same ) ~@Money ]+
=>
Money( a + br + e1 + e2 + e3 + b );

Example

Finding segments by means of keywords

In this example, extraction rules are used to find segments which cannot be reliably detected by the Segmentation activity. The rules look for keywords that start or end a segment and extract the text in between. We assume that our documents have numbered second-level headings written in all capitals, for example: 1.1 PREMISES, 2.3 LIABILITY AND INDEMNITY, and so on. We also assume that we have already extracted the first-level headings into a search element named “kw_Heading1” (the respective extraction rule is omitted for the sake of brevity). First, we look for keywords that start each paragraph of the document and extract them into a search element named “kw_Heading2.” Next, we put the text between two consecutive keywords into a search element named “Segment.” In the current version of Advanced Designer, the code editor is only available for search elements that are used to find named entities. As a workaround, to extract text like headings or segments by means of code, simply create a search element for any of the supported named entities (for example, Organization) and enter the code of the rule into the code editor of that element.

Rule for extracting second-level headings into the kw_Heading2 search element

// Look for a numbered second-level heading that has up to five words in all caps
// The heading number can be found as one token: 1.1
[ t1: /\d{1,2}\.\d{1,2}/ ] [ t2: <all_letters_capitalized> ]{1,5}
=>
kw_Heading2( t1 + t2 );
// The heading number can be found as three separate tokens: 1, dot, 1
[ t1: /\d{1,2}/ ] [ t2: '.' ] [ t3: /\d{1,2}/ ] [ t4: <all_letters_capitalized> ]{1,5}
=>
kw_Heading2( t1 + t2 + t3 + t4 );

Rule for extracting the segment into the Segment search element

// Find the text segment between two consecutive section headings
// Exclude also any first-level headings
[ @kw_Heading2 ] [ interval: ~@kw_Heading2 ~@kw_Heading1 ]+
=>
Root.Segment( interval );

Example

In this example, the words “3.1 FIRSTLY” and “3.2 SECONDLY” are extracted into the kw_Heading2 search element, and then the text between any two consecutive instances of the kw_Heading2 search element is extracted into an instance of the Segment search element.

Grouping information about each entity

In this example, extraction rules are used to make sure that the details about each party to an agreement are grouped correctly (that is, the name and address of each party belong to one group instance and are not split into several instances or mixed with the details of the other party). The idea is to find some identifying information about a party that always comes first. There are multiple ways to do this, depending on how agreements are drafted. In this example, we assume that the name of each organization always comes first, followed by its address and role in the agreement. Therefore, we will:

Look for organization names, create a new instance of the Party_Group group search element for each name found, and fill in its child search element named “Organization_Name.”
Look for the address and role that are separated by no more than, say, 20 tokens from each instance of the organization name, access the instance of Party_Group that is the parent of the organization name and fill in the child search elements named “Address” and “Role” in that instance.

Data will only be searched within the segment found by the Segmentation activity and passed to the Extraction Rules activity as an Input field named “Parties_Segment.”

Rule for extracting the Organization_Name search element

// Find organization name and create a new group instance for each
[ org: @NEROrganization( same ) ~@Party_Group.Organization_Name @Parties_Segment ]+
=>
// Create a new instance of Party_Group and fill in Organization_Name
Party_Group.Organization_Name( org );

Rule for extracting the Address search element

// Now look for the address that is separated by no more than 20 tokens
// from the organization name
// and that has not yet been assigned to any of the organizations in Party_Group
[ org: @Party_Group.Organization_Name( same ) ]+
[ ~@NERAddress ]{0,20}
[ t: @NERAddress( same ) ~@Party_Group.Address @Parties_Segment ]+
=>
// Access the instance of Party_Group that is the parent of the organization name
// and fill in its Address child
parent( obj( org )).Address( t );

Rule for extracting the Role search element

// Repeat the same for the role
// To find the role, use keywords
[ org: @Party_Group.Organization_Name( same ) ]+
[ ~("Tenant" | "Landlord" | "Broker") ]{0,20}
[ t: "Tenant" | "Landlord" | "Broker" ~@Party_Group.Role @Parties_Segment ]+
=>
parent( obj( org )).Role( t );

Example

The search elements will be extracted as follows:

On the data form, you can also see that the name, address, and role of each company is grouped under a separate group instance:

The details of each organization are grouped together, as shown by the instance numbers in brackets.

Finding the date and time together

In this example, an extraction rule is used to find a combination of time and date. First, we use a search element named “Time” of type Value from Regular Expression (the regular expression used is [1]?\d:\d{2}\s+(([ap]\.m\.)|([AP]M))?). Next, we look for a Date named entity located close to it. Finally, we concatenate the token sequences found and assign the result to a search element named “TimeAndDate.”

Rule for extracting date and time combined

// Use a Value from Regular Expression search element to find the time
// Use @NERDate to find a Date named entity close to the time
[ time: @Time ~@TimeAndDate ]+ [ t: ~@NERDate ]{0,3} [ date: @NERDate( same ) ]+
=>
// Combine the values to write them in one field
// Only consecutive token sequences can be combined, so the auxiliary token is also added
TimeAndDate( time + t + date );

Introduction

Quickstart

Skill Catalog

Skill Designer

Advanced Designer

Runtime Guide

Tenant Admin Guide

Scanning Station Guide

Developer Guide

Finding the last instance of a named entity

Rule for extracting the last organization instance

Example

Extracting an amount of money stated both in words and as a numeral

Rule for extracting an amount of money

Example

Finding segments by means of keywords

Rule for extracting second-level headings into the kw_Heading2 search element

Rule for extracting the segment into the Segment search element

Example

Grouping information about each entity

Rule for extracting the Organization_Name search element

Rule for extracting the Address search element

Rule for extracting the Role search element

Example

Finding the date and time together

Rule for extracting date and time combined

Example

Introduction

Quickstart

Skill Catalog

Skill Designer

Advanced Designer

Runtime Guide

Tenant Admin Guide

Scanning Station Guide

Developer Guide

​Finding the last instance of a named entity

​Rule for extracting the last organization instance

​Example

​Extracting an amount of money stated both in words and as a numeral

​Rule for extracting an amount of money

​Example

​Finding segments by means of keywords

​Rule for extracting second-level headings into the kw_Heading2 search element

​Rule for extracting the segment into the Segment search element

​Example

​Grouping information about each entity

​Rule for extracting the Organization_Name search element

​Rule for extracting the Address search element

​Rule for extracting the Role search element

​Example

​Finding the date and time together

​Rule for extracting date and time combined

​Example

Finding the last instance of a named entity

Rule for extracting the last organization instance

Example

Extracting an amount of money stated both in words and as a numeral

Rule for extracting an amount of money

Example

Finding segments by means of keywords

Rule for extracting second-level headings into the kw_Heading2 search element

Rule for extracting the segment into the Segment search element

Example

Grouping information about each entity

Rule for extracting the Organization_Name search element

Rule for extracting the Address search element

Rule for extracting the Role search element

Example

Finding the date and time together

Rule for extracting date and time combined

Example