Skip to main content
We have finished configuring the “Sick Note DE” activity and are ready to create the second set of Extraction Rules for the other class of sick notes. The structure of Dutch and Belgian sick notes is quite different from that of the German documents. There are many variants inside the class so this time we can’t use Fast Learning activity to extract any fields. These documents contain some additional information that is not available in the German sick notes, so we’ll also add some new fields when configuring the activity. We’ll start by extracting the data available on all documents and then we’ll add some new fields to the data form. You can switch to another activity without closing the Activity Editor. Click on the current activity name next to the skill name and select “Sick Note BE-NL” in the drop-down list. Select the first document in the set.

Extracting the issue date

Dates in these documents can be easily extracted using the Date element, so this time we will use the search element that was created automatically for this field.
  1. Open the Manage Fields dialog on the Fields tab and select a “Date” field to be used in this activity. Click Save.
  2. Go to the Search Elements tab. You will see a search element of type Date created for the “Date” field. It is mapped to the field automatically.
  3. Create a Group search element called “IssueDateGroup”. Make the element optional.
  4. Add a Static Text element called “kwDate” to find the label which will help us locate the actual date.
  5. This document class contains documents in Dutch or French language, so there are several options for the label text. You can enter each option on a new line in the Text to find dialog. Enter the text “Date” on the first line and “Datum” on the second line.
  6. Disable the Search for parts of words option.
  7. Drag and drop the “Date” search element into the group and place it under the “kwDate” element.
  8. Specify the search area for the “Date” element. a. Delete the Nearest to relation that was automatically added when the element was created. b. Select the “kwDate” element as the one nearest to the element we’re searching for. c. The date can be located to the right of the keyword or below it. Specify the search area below the “kwDate” element. d. The search area should also include the line on which the keyword is located. Click the bottom boundary icon to the right of the element name and select Top Boundary of Region. The lines may be uneven, so set the Below value to -10 to extend the search area a little bit above the line.
  9. Click Match to make sure the date is located correctly.
This is how the search element structure should look like: AD_Tutorial_BE_IssueDate_Structure

Extracting the sickness dates

We’ll extract these dates using Key value elements. The Key value element allows to search both for a static text label and the value. However, it doesn’t allow too much variation in the value location and properties. In these documents, the sickness dates are formatted so that each date component is in a separate cell of a table. The table cells can be located in non-standard places in each document, but the relative position of the cells is always the same. We can’t count on the table cell boundaries being very clear but we will still use the Table Cell element because it allows for fuzzy borders and will be convenient if we decide to train the activity on more documents. So we’ll use the Group element to organize the search elements hierarchy. Note: You can use the Table Cell element not only for fields located inside the document tables. It can also be useful if you need to extract data from a form where the content is located in similar boxes or table-like structures. If these boxes have clear dividing lines, the Table Cell element will prove very effective.
  1. Open the Manage Fields dialog and add the following fields to the current activity:
    • Start Date
    • End Date
    Click Save.
  2. Go to the Search Elements tab and create the Group element for the start date extraction. Set the following parameters for the elements included in the group:
ParameterValue
Group search element:
NameStartDateGroup
Static Text search element:
NamekwStartDate
Text to findVanaf / From, A partir du, Van
Search for parts of wordsDisabled
Table Cell search elements:
NameStartDateDay
Search patternNumber
Character count{1, 1, 3, 3}
Search for parts of wordsDisabled
Search areaBelow the “kwStartDate” element, nearest to “kwStartDate”
Table Cell search element:
NameStartDateMonth
Search patternNumber
Character count{1, 1, 3, 3}
Search for parts of wordsDisabled
Search areaBelow the “kwStartDate” element, right of “StartDateDay”, nearest to “StartDateDay”
Table Cell search element:
NameStartDateYear
Search patternNumber
Character count{2, 2, 4, 4}
Search for parts of wordsDisabled
Search areaBelow the “kwStartDate” element, right of “StartDateMonth”, nearest to “StartDateMonth”
Note: The Table Cell element returns the text from the cell as it is. In this case the search pattern contains a Number which recognizes only the digits, so the text returned by the element will be a number.
  1. Create a copy of the “StartDateGroup” element and rename it to “EndDateGroup”.
  2. Rename the group’s sub-elements: “kwStartDate” to “kwEndDate”, “StartDateDay” to “EndDateDay”, “StartDateMonth” to “EndDateMonth”, “StartDateYear” to “EndDateYear”.
  3. Change the text to find of the “kwEndDate” element to “Tot en met / Till and incl., Jusqu’ au, Tot en met”.
  4. Specify the search area for the “EndDateDay” element. It should be located below the “kwEndDate” element and nearest to it. Delete the other relations.
  5. Open the Manage Fields dialog and add a Data Composition Field called “Start Date Composed”. Map the following elements to the fields:
    • “StartDateDay” to Day
    • “StartDateMonth” to Month
    • “StartDateYear” to Year
    Click Save.
  6. Create a Data Composition Field called “End Date Composed”. Map the following elements to the fields:
    • “EndDateDay” to Day
    • “EndDateMonth” to Month
    • “EndDateYear” to Year
    Click Save.
  7. Map the “Start Date Composed” and “End Date Composed” data composition fields to the “Start Date” and “End Date” fields.
This is how the search element structure should look like: AD_Tutorial_BE_Dates_Structure

Extracting the type of sick note

We’ll extract the type of sick note using a checkmark in just the same way as we did for the German documents.
  1. Open the Manage Fields dialog on the Fields tab and enable the “Type of Sick Note” checkmark group. Enable the “Primary” and “Secondary” checkmarks in the group to be used in the current activity. Click Save.
  2. Build a structure similar to what was built for the German documents, but keep in mind that in Dutch and Belgian documents the label (the text near the checkmark) goes first. The order of child elements for such groups does matter. a. Create a Group element called “TypeOfSickNoteGroup”. b. Create a copy of this group and rename it to “PrimaryGroup”. Place it inside “TypeOfSickNoteGroup”. c. Add a Static Text element called “kwCheckmark” to the “PrimaryGroup” group. d. Set the text to find to “eerste / Primary, première, primair”.
Note: In these documents, the text near the checkmark is located to the left of the checkmark, so we set the search area to the left of it, not to the right. Configure the rest of the elements according to the table below:
ParameterValue
Static Text search element:
NameCheckmark
Text to findX
Character count{1, 1, 3, 3}
Search for parts of wordsDisabled
Search areaRight of “kwCheckmark”, nearest to “kwCheckmark”
Static Text search element:
NameXMark
Text to findX
Character count{1, 1, 3, 3}
Search for parts of wordsDisabled
Search areaBelow the “kwCheckmark” top boundary, Below value = -15, Left of “kwCheckmark”, Above the “kwCheckmark” bottom boundary, Above value = -15, Nearest to “kwCheckmark”
Under what conditionsDo not find element if “Checkmark” is found
Region search element:
NameCheckmarkRegion
Search Conditions section of the Code Editorif Checkmark.IsFound then RSA: Checkmark.Rect; else if XMark.IsFound then RSA: XMark.Rect; else DontFind;
e. Create a copy of “PrimaryGroup” and rename it to “SecondaryGroup”. Change the text to find of its “kwCheckmark” element to “prolongation”, “verlenging”. f. German sick notes were divided into two types. As opposed to them, Dutch and Belgian sick notes are divided into three types (‘relapse’ is an additional type). Hence create another copy of the “PrimaryGroup” group and rename it to “RelapseGroup”. g. Change the text to find of its “kwCheckmark” element to “Herval” and enable the Match case option to exclude words occurring in the middle of a sentence. This is how the search element structure should look like: AD_Tutorial_BE_TypeOfSickNote_Structure
  1. Open the Manage Fields window and add a “Relapse” checkmark to the “Type of Sick Note” checkmark group. Enable all checkmarks in the group to be used in the current activity and click Save.
  2. Map the checkmarks to the corresponding Region elements and delete the elements that were automatically created when enabling the fields.

Testing the activity

We have configured all the necessary search elements and fields. Select all documents, click Match, and switch to the Fields tab to review the field regions on the document images. Keep in mind that a region will be passed to a field only if it belongs to the hypothesis from the best path. Once you’re satisfied with the results, click the copy icon above the document image to copy predicted labeling to reference labeling.