- You need to extract entities from a table.
- You do not have enough sample documents to train your NLP model.
- You are not satisfied with the quality of extraction on some of the fields.
-
Identify text spans that match
- certain regular expressions
- certain words or phrases from user dictionaries occurring in any inflected form in the text
- any of the built-in NER objects:
-
People (NerPerson)
- Organizations (NerOrg)
- Locations (NerGeo)
- Addresses (NerAddress)
- Amounts of money (NerMoney)
- Dates (NerDate)
-
Duration (NerDuration, available only for Russian and English texts)
- Account numbers (NERAccountNumber, available only for Russian texts)
Note: The NerMoney, NerDate, NerDuration and NERAccountNumber objects are used only in extraction scripts.
- Account numbers (NERAccountNumber, available only for Russian texts)
- Run queries on text and text spans where search words and phrases may occur in any inflected form.
- Save any identified text spans into document fields.
-
Extract addresses and the following address components from documents:
- ZIP code (NerZipCode)
- Country (NerCountry)
- State (NerState)
- City (NerCity)
- Street (NerStreet)
- Open the Document Definition editor.
- Select a document section, right-click it, and click Properties… on the shortcut menu.
- Click the NLP tab.
- Under Extraction Scripts, click Create…
- In the Extraction Script dialog box,
- Click the Load… button to load a user dictionary, or
- Click the Edit… button to open the script editor.
The user dictionaries should be encoded in UTF-8 with BOM or ANSI.
Extracting address components from a document
To extract address components, do the following:- Specify the area of the document that contains the address.
You can search for address components in the entire field or in a part of the field. When using the ParseAddressInPosition( resultCollectionNamePrefix : string, startPos : int, endPos : int ) and ParseAddressInSpan( resultCollectionNamePrefix : string, span : IInterval ) methods to parse an address, each word in the detected components receives the following attributes during indexing, which can then be used in XML queries:
- The name of the collection in the format [resultCollectionNamePrefix]_[NerTypeOfComponent].
- The resultCollectionNamePrefix prefix.
- The type of the NER object.
Currently, you can only extract components of German and US addresses.
