Skip to main content
The Segmentation activity is designed to segment text in unstructured documents into paragraphs. This lets the program narrow down search regions for fields that need to be extracted by other activities. The activity can also be used to extract entire paragraphs into text fields (for example, if you want to extract comprising legal clauses and conditions from a contract).
Sample Paragraph

Use Cases

Add this activity to your document processing flow in the following cases:
  • When you know that the named entities you want to extract from the documents are always located in the same paragraph. For example, if you know that organization names and addresses that you need to extract are located in the first paragraph of each contract, you can extract the first paragraph using a Segmentation activity, and then extract company names and addresses from this paragraph using a Named Entities (NER) activity. This approach is more reliable than extracting named entities from the entire document, since you can control the specific area where those entities are extracted from.
  • When a paragraph needs to be extracted in its entirety because all of its contents are valuable, for example, a paragraph that contains the payment terms of a contract.

How It Works

Segmentation activities are trained using reference labeling, so it is essential to correctly label as many documents as possible. If the training set contains enough documents, the activity is trained using cross-validation. The document set is divided into several subsets, and the activity is trained several times. Each time one subset is excluded from training and used for internal testing, which allows training results to be validated. This technique improves extraction accuracy, as well as detecting errors in labeling and suggesting corrections for them. The recommended number of sample documents is as follows:
  • For high-variability documents, at least 100 sample documents is required.
  • For low-variability documents, at least 20 sample documents is required.
For more information, see Setting up a Segmentation activity.