Skip to main content
You need to label a certain number of documents in order to train and test a skill. To do this, you should select regions on the document that contain field values. To select a region, do one of the following:
  • Hover over a word and click on it. This will create a region and copy the word to the field. Use this method to label fields that contain only one word.
  • Draw a rectangle around some words. All the words inside this rectangle will be copied to the field. We recommend using this method to label semi-structured documents.
  • Select a region by clicking on the first word of the sequence and, while holding down the left mouse button, dragging the cursor to the last word of the sequence. We recommend using this method to label unstructured documents.
The guidelines below will help you label your documents properly depending on their type.

Structured Documents

Structured documents always include the exact same type of information in the exact same locations. One example of structured documents are pre-formatted forms. You will need to label just a few sample documents for training, as there is no variant in their layout. Please follow the guidelines below when labeling structured documents.
  • Be sure to accurately specify the region of each field, as field values alone are not enough for training.
  • To mark out the region of a field, don’t click on its value, but mark out the entire placeholder instead.
  • If a field contains no value, mark out the empty placeholder.
  • If a field consists of multiple parts, hold down the Shift key to add the parts. Please note that all parts should be on the same page.
  • If a fixed form contains a table, mark out all the rows, including those that are empty.
  • If a field is added after some labeling has already been done, this new field must be labeled on all the documents in the training set. Please review all of your documents and label the new field on all the documents where it occurs.

Semi-Structured Documents

Semi-structured documents generally contain the same or similar types of information, but the location, size, and number of fields may vary from document to document. Examples of semi-structured documents include bills, payment orders, and invoices. Please follow the guidelines below when labeling semi-structured documents.
  • Be sure to accurately specify the region of each field, as field values alone are not enough for training.
  • To mark out the region of a field, click on its value (such as the word or words it contains), and the region will be created automatically.
  • If a field contains no value, do not create a region for such a field.
  • Do not mark out parts of words, as the program can only learn on whole words.
  • If a field consists of multiple parts, hold down the Shift key to add the parts. Please note that all parts should be on the same page.
  • If you have a repeating structure, analyze your documents first and create either a table or a repeating group. If your documents contain tables with a common header and values that do not have any keywords next to them, create a table. If your data is less structured and has keywords located next to the values, create a group with the Allow multiple items option. If data is organized differently on different documents, select the option that best fits the majority of the documents.
  • When labeling a table, mark out the first row, then click Continue table from this row, making sure that the entire table has been labeled correctly. To mark out the cells in the first row, click on its cells one by one, and the corresponding columns will be created automatically. Proceed until all the table has been marked out.
If tables are large and document pages are similar in appearance, you can delete the similar pages and label the first and the last page and some pages in between.
  • Do not instruct the program to find fields inside the region of another field, regardless of whether it is an individual field (e.g., an address) or a table cell (e.g., “Description”). If you need to extract data from a large text fragment, use a sequence of activities. First, use an activity designed to extract data from semi-structured documents and train it to find the desired region. Next, to extract specific fields from this region, use an activity designed to extract data from text (NLP) or implement your own script rules.
  • If a field is added after some labeling has already been done, this new field must be labeled on all the documents in the training set. Please review all of your documents and label the new field on all the documents where it occurs.

Unstructured Documents

Unstructured documents contain information that is not structured in any way. Examples of unstructured documents include contracts, scientific articles, and e-mail messages. Please follow the guidelines below when labeling unstructured documents.
  • Be sure to accurately specify the region of each field, as field values alone are not enough for training.
  • When labeling segments (such as fields trained in the Segmentation activity), regions should include one or more whole paragraphs. A segment cannot include only a part of a paragraph.
  • To mark out the region of a field, click on its value (such as the word or words it contains) and the region will be created automatically.
  • If a field contains no value, do not create a region for such a field.
  • Do not mark out parts of words, as the program can only learn on whole words.
If a word is followed by a punctuation mark (for example, ”… and Mary Jones,(“Borrower… ”)), adjust the region so that it does not enclose the punctuation mark.
  • Sometimes, a field region may spill over to the next page (for example, a clause in a contract). In this case, label a part of the filed on the first page, then continue labeling on the next page while holding down the Shift key.
  • When creating a region for a field inside the region of another field (for example, to mark out a field inside a segment), select the desired field and just start labeling it inside the region of the other field. Doing so will not select the existing region but will create a new region for the selected field.