Skip to main content
We have already started extracting data from the German documents, so we’ll first configure the Extraction Rules activity for these documents.

Preparatory steps

  1. Open the “Sick Note DE” activity in the Activity Editor.
  2. Select one of the documents from the document set.
  3. Make sure that the advanced mode for the element properties is enabled. To toggle this mode on or off, click the icon on the Properties pane.
  4. All uploaded documents have undergone pre-recognition, and it’s useful to see what objects were found on the image. Click the icon. If you don’t see this icon due to the size of your screen, click the icon and select Recognized Words. The corresponding objects will be highlighted on the document image. You can switch between various highlighted object types at any time. For example, switching to Recognized Lines can be helpful when looking for paragraphs, and switching to Separators will facilitate the configuration of a Separator search element.
  5. If a search element lies outside the search area, it will not be found. Enable the Show search area option in the document image context menu. The search area for each element will be highlighted in green when you evaluate the matching results.
Keep in mind that it may be helpful to experiment with advanced properties of the search elements to improve extraction accuracy. We also encourage you to often click Match to check how your extraction rules work and to compare extraction results on different documents in the set. You can test a single element without relations to other elements by clicking Match Element in its context menu. In this case hypothesis quality for previous elements won’t affect matching results.

Extracting the patient’s data

Let’s start by extracting the missing data for the patient. To do so, we need to create several search elements. We advise grouping all elements related to one entity. Elements are matched one after another, and not finding the top element will decrease the hypothesis quality for subsequent elements. In the meantime, groups of search elements are processed independently of one another during matching, and an individual hypothesis is formulated for each group. Thus, you can control how elements influence one another. You can also evaluate matching results at a glance by checking whether the group elements have been found successfully. Lastly, grouping may help reduce matching time.
  1. Click Create Element and select the Group element from the drop-down list. Change its name to “PatientDataArea”.
  2. A new group search element is set to be required by default. If a required element is not found, the Activity Editor runs into an error and matching is aborted. This scenario lets activities be skipped if they are not suitable for a certain document. However, in this tutorial we are creating an activity to extract data from all incoming documents, so we want the group to be optional. In the Under what conditions section, change the Element is value to Optional.
  3. We want to extract the find the paragraph that contains the patient’s name and address. In German documents the paragraph we are looking for is always located in the field with the label “Name, Vorname … ”. We need to find this text on the document and use it as a reference to search for the data we want to extract.
    a. Keywords can be found using the Static Text search element. Click Create Element and select the Static Text element from the drop-down list. Change its name to “kwPatientTitle”.
    b. Enter the text “Name, Vorname” in the Text to find field on the Properties pane.
    c. Click Match. When processing is finished, you will see the Tree of Hypotheses below the document. Make sure that Advanced Designer has successfully found the desired static text. A green dot next to the element name indicates that a corresponding element was successfully found on the document. If you click the element name in the Tree of Hypotheses, you will see a violet frame around the corresponding region on the document.
Note: If an element wasn’t found, you will see an orange dot next to its name and an orange frame around the document image. Keep in mind that the hypothesis quality of an element affects the state of subsequent elements in the chain and the overall quality of a chain. You can find detailed information about hypothesis quality in the documentation.
  1. Now let’s find the lower boundary of the cell which contains the patient’s name and address. We will do so using a Separator element.
    a. Add a Separator element to the group and call it “SeparatorBottom”. Set its minimum length to 200.
    b. Right-click the element and select Match Element in the context menu. You will see that the Tree of Hypotheses contains many green dots. They correspond to different separators that fit the search criteria. You can click on each dot to see the corresponding object on the image.
    c. To narrow down the search criteria, specify the search area for the separator. Click Match to find that “kwPatientTitle” element that will be used as an anchor element. In the Where to search section of the Properties pane, click Draw on Image. Select the “kwPatientTitle” element on the document and click the down arrow icon to specify the search area below the keyword and the nearest icon to look for the separator nearest to the keyword. You can find a detailed description of the anchor elements in the documentation.
    d. Click Match and check that Advanced Designer has found the separator below the “kwPatientTitle” element. You can check the hypothesis for each element by clicking its name in the Tree of Hypotheses section.
  2. A label and a separator are reliable reference elements for the patient’s data. However, if the print quality is too low, there is a chance the label text won’t be recognized or the separator won’t be found. To ensure good extraction results, we will search for a paragraph that lies between the label and the separator. A paragraph is a uniform block of text, meaning that it can succesfully be found even if some of the boundary elements were not found.
    a. Create a Paragraph search element and call it “NameAddressParagraph”.
    b. Change Text alignment to Left.
    c. The patient’s data occupies from two to five lines, so specify the Line count from 2 to 5.
    d. Specify the search area for the paragraph. This time you should use the Add menu in the Where to search section. The element should be located below the “kwPatientTitle” element and above the “SeparatorBottom” element.
    e. Click Match.
  3. Now we want to extract the patient’s data. Create a new group element called “PatientGroup”.
  4. The patient’s name that can occupy one or two lines. To capture several instances of an element, we will use a repeating group.
    a. Create a Repeating Group search element and call it “NameGroup”. Specify 2 as the maximum number of repetitions. Make the element optional.
    b. We want to search for the lines that are part of the “NameAddressParagraph” paragraph. To specify the element’s region as the search area, click the code editor icon below the document image and paste the following script in the Search Conditions section of the Code Editor:
RSA:PatientDataArea.NameAddressParagraph.Rect;
c. Inside the repeating group, create a Character String element that is designed to capture a line of characters. Call it “NameLine”.
d. The text we are looking for may contain upper- and lower-case letters, as well as a set of punctuation marks that may occur in names. Configure two separate character sets. The first set should contain all Latin upper- and lower-case letters. To add characters with diacritical marks, change the Unicode subrange or paste the characters directly into the Selected characters field.
e. The other set should contain the following punctuation marks: ,-.()’. We don’t want the string to contain only punctuation marks, so set the Portion in text, % for the second set to 40%. This property defines the maximum allowed percentage of characters from a certain set.
Note: The default settings allow the string to contain up to 30% of characters not included in any set. This helps find strings even when some characters are recognized incorrectly or are not included in the set (such as characters with diacritical marks). You can adjust this setting by changing the Allowed errors value on the Properties pane. f. Disable the Search for parts of words option.
g. Specify the search area for the “NameLine” element: below the “kwPatientTitle” element and nearest to it.
h. Click Match and review the Tree of Hypotheses. You will see that two character strings are found. However, the second string contains the patient’s address.
i. To exclude the address from the search results, we will check if the first string contains both the first and the last name. This can be done by adding a simple script search condition. Select the “NameLine” search element and open the Search Conditions code editor.
j. We assume that the first line contains a full name if it contains a comma and a whitespace. If it contains a full name, we don’t want to search for a second instance of the repeating group. Paste the following script in the editor:
if (NameGroup.HasInstances and LastFound.NameLine.Value.Find(", ") > 0) then DontFind;
k. Click Match and make sure that the name is found correctly.
  1. The patient’s name extracted in step 7 will be mapped to the “Name” field. We will also extract and map the patient’s address.
    a. Inside the “PatientGroup”, create a Character String search element called “Address” with the same character set configuration as the “NameLine” element.
    b. Specify the search area for the element using code: the address must be located below the “NameLine” or, in case this element was not found, below the first line of the “NameAddressParagraph” element.
RSA: PatientDataArea.NameAddressParagraph.Rect;
if NameGroup.HasInstances then
  RSA.Top: Max(RSA.Top, LastFound.NameLine.Rect.Bottom);
else
  RSA.Top: PatientDataArea.NameAddressParagraph.Lines[0].Rect.Bottom;
c. Disable the Search for parts of words option. d. Click Match. This is how the search element structure should look like: AD_Tutorial_DE_Patient_Structure
  1. Open the Manage Fields dialog, create the corresponding fields, and map them to search elements as follows:
NameTypeSearch element
NameText field in the “Patient” groupNameLine
AddressText field in the “Patient” groupAddress
  1. Delete the search elements that were automatically created for the new fields.

Extracting the type of sick note

The type of sick note field has two checkboxes. They are labeled as “Erstbescheinigung” and “Folgebescheinigung”. The task is to find the labels and then to check whether there are filled checkmarks next to them.
  1. Create a Group element called “TypeOfSickNoteGroup”. Make the element optional.
  2. To store the information about both checkmarks, create a Repeating Group search element and call it “PrimaryGroup”.
    a. A good idea is to restrict the search area for the element group. Specify the search area using code: to the right of the “PatientGroup” element and above the “DoctorAreaGroup” element (that will be created later on). **Note: **Always specify the “Exists” condition when using future elements.
if PatientGroup.Exists then RSA.Left: PatientGroup.NameGroup.NameLine.Rect.Right;
if DoctorAreaGroup.Exists then RSA.Bottom: DoctorAreaGroup.DataArea.SeparatorTop.Rect.Top;
b. Create a Static Text search element called “kwPrimary” (text to find: “Erstbescheinigung”) and make it required.
c. Create an Object Collection search element called “Checkmark” with the following settings: Type: Checkmark, Checkmark state: Checked, Minimum height: 10, Maximum width: 20, Maximum height: 20. Specify that the element is located to the left of the “kwPrimary” element and nearest to it.
d. Click Match.
  1. Copy and paste the “PrimaryGroup” group. Rename the copied group to “SecondaryGroup”. This group will be required.
  2. Edit the “SecondaryGroup”.
    a. Rename the “kwPrimary” element to “kwSecondary” and set the text to find to “Folgebescheinigung”. Specify the search area: below the “kwPrimary” element from the “PrimaryGroup”.
    b. Specify the search area for the “Checkmark” element: to the left of “kwSecondary” and nearest to it.
    c. The Object Collection search element finds a collection of all suitable objects within the search area. If the checkmarks are located on the same line, the “Checkmark” element of the “SecondaryGroup” may also find the Primary checkmark. To avoid this, exclude the primary checkmark (“Checkmark” element of the “PrimaryGroup”) from the search area for the “Checkmark” element from the “SecondaryGroup”.
    d. Click Match.
This is how the search element structure should look: AD_Tutorial_DE_TypeOfSickNote_Structure
  1. Open the Manage Fields window, create the corresponding fields and map them to search elements as follows:
NameTypeSearch element
Type of Sick NoteCheckmark group
PrimaryCheckmark in the “Type of Sick Note” checkmark groupPrimaryGroup -> Checkmark
SecondaryCheckmark in the “Type of Sick Note” checkmark groupSecondaryGroup -> Checkmark
  1. Delete the search elements that were automatically created for the new fields.

Extracting the doctor’s data

We now have to process the last block of data on these documents. It contains the doctor’s data and signature. We’ll first find the box which holds the data and then extract a paragraph with the doctor’s information and an image region containing the signature.
  1. Create a Group element called “DoctorAreaGroup”. Make the element optional.
  2. The box we’ll be looking for contains a label. To find it, create a Static Text element called “kwDoctorTitle” (text to find: “Unterschrift des Arztes”).
  3. Inside the “DoctorAreaGroup” group, create another group called “DataArea”.
  4. The box that contains the doctor’s information and signature is a combination of four separators. They are located around the “kwDoctorTitle” element. However, we should configure the elements in a way that allows the program to find them even if the “kwDoctorTitle” element wasn’t found. In the “DataArea” group, create four Separator search elements with the following properties:
NameOrientationMinimum lengthSearch area
SeparatorRightVertical180Right of “kwDoctorTitle”, Nearest to the right page edge
SeparatorLeftVertical180Left of “kwDoctorTitle”, Left of “SeparatorRight” (in case “kwDoctorTitle” wasn’t found), Nearest to “SeparatorRight”, Below “SeparatorRight” (click the icon to the right of the separator name and select Top Boundary of Region), Exclude “SeparatorRight”
SeparatorBottomHorizontal200Below “kwDoctorTitle” (with adjustment of -10 points), Right of “SeparatorLeft”, Left of “SeparatorRight”, Nearest to the bottom page edge (this setting will be useful in case “kwDoctorTitle” wasn’t found)
SeparatorTopHorizontal200Above “kwDoctorTitle”, Right of “SeparatorLeft”, Nearest to “TypeOfSickNoteGroup”, Exclude “SeparatorBottom”
You also should disable the Fits entirely within search area option for all these elements.
  1. We could specify the search area for the doctor’s signature and doctor information manually with respect to the found separators. Instead of doing so, we will create a Region element that corresponds to the area bounded using the separators. Create a Region search element called “BoxRegion” and specify the search area: left of “SeparatorRight”, right of “SeparatorLeft”, above “SeparatorBottom”, and below “SeparatorTop”.
  2. Create a new group called “DoctorGroup”.
  3. To locate the doctor’s signature, create an Object Collection element with the following settings inside the “DoctorGroup”:
PropertyValue
NameSignature
TypePicture
Minimum width15
Minimum height15
Maximum width600
Maximum height350
Search Conditions section of the Code EditorThe signature may be partly located outside the box. To find the whole image, we will expand the search area by 100 dots in each direction: RSA: DoctorAreaGroup.DataArea.BoxRegion.Rect.GetInflated(100dot,100dot);
  1. To extract the text information in the box, create a Paragraph element with the following settings:
PropertyValue
NameDoctorInformation
Maximum line count6
Search areaAbove “kwDoctorTitle”, Exclude “Signature”
Search Conditions section of the Code EditorRSA: DoctorAreaGroup.DataArea.BoxRegion.Rect;
  1. Click Match and make sure the elements are found correctly.
This is how the search element structure should look like: AD_Tutorial_DE_Doctor_Structure
  1. Open the Manage Fields dialog, create the corresponding fields, and map them to search elements as follows:
NameTypeSearch element
Doctor InformationText field in the “Doctor” groupDoctorInformation
SignatureImage field in the “Doctor” groupSignature
  1. Delete the search elements that were automatically created for the new fields.

Testing the activity

We have configured all the necessary search elements and fields. Select all documents, click Match, and switch to the Fields tab to review the field regions on the document images. Keep in mind that a region will be passed to a field only if it belongs to the hypothesis from the best path. Once you’re satisfied with the results, click the copy icon above the document image to copy predicted labeling to reference labeling.