Detecting the main fields - ABBYY Documentation

This article describes how the main fields of an invoice are detected and captured. The program starts processing an invoice by recognizing its text in accordance with the Document Definition settings:

Recognition mode (Fast, Balanced, Normal, or Accurate) determines the speed of recognition and the quality of the resulting text layer. To specify a recognition mode, in the Document Definition Editor, click Document Definition → Document Definition Properties… → Recognition.
Recognition languages are the languages used for recognition. To specify them, in the Document Definition Editor, click Document Definition → Document Definition Properties… → Document Definition Settings, and then click Edit in the Countries and Languages group to select the required languages.

Recognition languages in FlexiCapture for Invoices are tied to the country settings. When you add an invoice country to the Countries and Languages group, the corresponding languages automatically appear in the Document Definition settings. Invoice fields are extracted upon recognition.

To detect and capture fields on an invoice, the program can use:

A FlexiLayout
Neural networks

Both methods are described below, together with the algorithm that combines the results obtained by both methods or selects the best result.

Using a FlexiLayout

Business unit and vendor

The following may be used to determine the Vendor and Business Unit:

Document Definition settings: IBAN, VATID, and NationalVATID formats, as well as the corresponding keywords.
Data set record fields: IBAN, VATID, NationalVATID, Name, Street, City, ZIP.

For more information about the BusinessUnits and Vendors columns in the Data sets and how to use them, see BusinessUnits data set and Vendors data set.

Automatic company detection algorithm

The detail and quality of the information entered in the Data set columns has a significant impact on detection quality. To ensure that the search results are as accurate as possible, make sure that:

The unique company identifiers are filled in. Filling in unique-value columns (VATID, NationalVATID, IBAN) significantly improves the probability of correct detection, since these values are unique for all companies.
There are no repeating company records. The absence of repeating records improves the probability of correctly detecting the company. For more information, see Eliminate duplicate records in the external database.
There are no unrelated records. Outdated or invalid records in the Data set may cause the company to be detected incorrectly because of coincidental similarities between various field values.
All fields are filled in for every company record. Specify as much information about companies as possible. The more fields are filled in, the higher the probability of correctly detecting the company.
Multiple-value columns are used to store the same information written in different ways, not different information altogether. For example, if a single company has several addresses, there must be a separate record for each of them, even if all other fields contain the same information. For more information, see Preparing vendor and business unit databases.

The automatic vendor and business unit detection algorithm consists of the following steps:

Unique identifier search

The following fields are considered to be unique company identifiers:

VATID
NationalVATID
IBAN

FlexiCapture for Invoices searches the document image for the values listed above. In the Document Definition properties (Document Definition Settings tab, Countries and Languages group), the VATID, NationalVATID, and IBAN formats (Formats tab) and keywords (Keywords tab) are set for each country using regular expressions.

Correctly filled-in keywords and identifier formats significantly improve detection quality.

The program looks for exact matches on the image for such fields. Regular expressions can also take possible recognition errors into account, by means of extended regular expressions. For more information, see Extended regular expressions.

ABBYY FlexiCapture for Invoices offers preset regular expressions, but you can create your own if required. To do so, navigate to the Countries and Languages group on the Document Definition Settings tab, select the appropriate country, and click Edit….

Detected values are normalized as follows:

Letters are converted to uppercase.
Spaces and the following characters are removed: ., ,, —, /, \.

If the letter prefix of a field is specified using a regular expression in the country properties on the Formats tab, the recognized prefix is replaced by the primary prefix (also set on the Formats tab). For example, the identifier DE12345 may be recognized as OE12345; the detected prefix OE is then replaced with the correct prefix DE.The VATID, NationalVATID, and IBAN fields detected on a document image are used to query the Data set. The VATID, NationalVATID, and IBAN column values received from the Data set are normalized the same way as the values detected on the image, and then matched (using exact matching) to the normalized values detected on the image.

Company name and address search

A query that uses all document text to look for the records that match it most accurately is sent to the Data set.The Name, Street, ZIP, and City values detected on the image are matched to the corresponding Data set record values.

To get the best possible name and company search results, make sure that the corresponding Data set columns are filled in. Company name and address information is especially important when the company cannot be identified using VATID, NationalVATID, or IBAN.

Hypothesis formation

The companies found during the previous steps are used to form a set of hypotheses. ABBYY FlexiCapture for Invoices evaluates these hypotheses and selects the 5 vendor records and 5 business unit records that most reliably match the field values on the document image. These records form 25 vendor–business-unit pairs, with each pair treated as a separate hypothesis. A neural network algorithm then rates the hypotheses by reliability, and the best-fit vendor–BU pair becomes the final hypothesis and the result of vendor and business unit detection.

If only the vendor database is connected, the quality of the vendor–BU pair evaluation may be negatively affected. We recommend connecting a business unit database even if business unit detection is not required. For more information, see Using vendor and business unit databases.

If there is a very small number of business units (for example, one), connecting such a database does not significantly affect the evaluation. However, it may improve detection quality when a business unit is being incorrectly detected as a vendor.

Hypothesis filtering

Hypotheses are split into the following based on match reliability (between the Data set record and the document image field value):

Reliably matching the document image
Unreliably matching the document image

Depending on the verification scenario, you can decide whether to take hypothesis reliability into account when detecting the vendor and business unit. To make ABBYY FlexiCapture for Invoices select the final hypothesis exclusively from reliable hypotheses, use the InvoiceReader/ShouldFilterUnsureCompanyHypotheses registry flag, which can be set to:

true — filtering is enabled, and the final hypothesis is selected exclusively from the reliable hypotheses (default).
false — filtering is disabled, and the final hypothesis is selected from all hypotheses regardless of their reliability.

Hypothesis filtering works differently for vendors and business units:

When detecting vendors, no unreliable hypotheses are considered. If there are no reliable hypotheses, a vendor is not detected.
When detecting business units:
- If at least one reliable hypothesis has been found, no unreliable hypotheses are considered.
- If the set of hypotheses does not contain at least one reliable hypothesis, the flag value is ignored, and the final hypothesis is selected from the unreliable hypotheses.

This is due to the differences between vendor and business unit Data sets:

There are usually far fewer business unit records than vendor records. They also change far less frequently, so they are easier to keep up-to-date. Therefore, detecting a reliable hypothesis increases the probability of the final hypothesis being correct. However, detecting a business unit is important even if no reliable hypotheses have been found, because the most important factor in the reliability of the detection result is the reliability evaluation of the vendor–BU pairs.
There are usually far more vendor records, and the Data set contains more columns, because vendors specify more information about their own company on their invoices than about the business unit. Records can also contain outdated information, so unreliable hypothesis filtering depends on both the quality of the Data set and the verification scenario type.

To improve the probability of detecting reliable hypotheses, keep Data sets up-to-date and include as much information about vendors and business units as possible.

Results of detecting the vendor and business unit

The main results of detecting the vendor and business unit on the invoice are:

The identifier of the vendor record in the Vendors data set
The identifier of the business unit record in the BusinessUnits data set

If the Vendors data set specifies that Id depends on BusinessUnitId (see Vendors data set), the result of vendor detection contains the Id that corresponds to the BusinessUnitId.

A business unit may be detected unreliably. In this case, the document’s registration parameter fc_Predefined:InvoiceIsVendorSuspicious (fc_Predefined:InvoiceIsBusinessUnitSuspicious) is set to true. The regions of the following fields may be found as a result of vendor and business unit detection:

For the vendor: Name, VatID, NationalVatID, IBAN, Street, Zip, City.
For the business unit: Name, VatID, Street, Zip, City.

By examining the locations of these regions on the image, you can see exactly where the program found the fields of the Vendor and Business Unit field groups, which enabled it to detect the vendor and the business unit.

If the field values for IBAN and VATID are absent from the Vendors data set, keywords and format can be used to detect the appropriate values the same way that bank details are detected (if the corresponding vendor has been found).

Search for any field region can be modified through training or by applying an additional FlexiLayout (see Capturing additional invoice fields). This has no effect on vendor and business unit detection, but may affect the location of the field regions in these field groups after matching the Document Definition with the invoices.

An important result of detecting the vendor and business unit is that information about their respective countries is retrieved from the CountryCode field of the records found in the data set. This information is then used to select keywords and tax rates, to capture other invoice fields, and as a condition for launching validation rules for the invoice.

How to change the way the program detects the vendor or business unit

The better a vendor or business unit record in the data set matches the text extracted from an invoice image, the more accurately the program detects the vendor or business unit. First, identify the data in the external database that corresponds to the data set columns used for finding the company on an invoice. The external database and the data set have to be properly connected. For more information, see Using vendor and business unit databases. If the same company occurs in both the list of vendors and the list of business units, specify the same VATID for the respective records in both data sets (even if there is no VATID on invoices). This prevents the program from detecting the vendor and business unit incorrectly. To compensate for possible variations in field values on images, use:

Normalization of data set columns (see Normalization of values in data sets)
Multiple-value data set columns (see Multiple-value columns in a data set)

Using pre-determined vendor and business unit values

The vendor or business unit of the invoice’s company can be determined in advance based on the invoice’s source (the name of the Scanning Operator or the email address of the message’s sender). You can specify the vendor or business unit explicitly prior to automatic detection. To do so, set the value of the document’s registration parameter fc_Predefined:InvoicePredefinedVendorId (fc_Predefined:InvoicePredefinedBusinessUnitId) to the identifier (Id) of an entry in the Vendors or BusinessUnits data set. This does not prevent automatic detection of the vendor or business unit. As a result, in addition to the pre-determined vendor or business unit, you get a confidence value (indicating how well the pre-determined values match the values extracted from the image), as well as the regions of fields from the Vendor and Business Unit field groups.

Invoice Header field group

InvoiceNumber and InvoiceDate

An invoice’s header includes, among others, the InvoiceNumber and InvoiceDate fields. These fields are detected using keywords specified in the language properties of the Document Definition. The vendor and business unit are detected first, providing information about their countries. The countries determine the languages (languages that correspond to a country are specified in the Document Definition). The set of keywords for finding fields is taken from the countries of the vendor and the business unit. You can change the way the program looks for field regions by editing keywords (see Keywords) and by using training (see Training ABBYY FlexiCapture for Invoices).

How the program determines that a document is an invoice

FlexiCapture determines whether a document is an invoice when applying the FlexiLayout. The conditions listed below indicate that a document is an invoice. Not all of these conditions have to be met, but each one carries a certain weight.

InvoiceNumber and InvoiceDate fields were detected.
Keywords from the InvoiceIdentifiers located element were detected (see Keywords).
A vendor or a business unit was detected on the document.

A document can be identified as a credit note if keywords from the CreditNoteKeyword element were detected on the image or if the document has a negative Total.

Amounts field group

FlexiCapture for Invoices captures the following fields from an invoice:

Field	Invoice Processing (Au-NZ, US, CA, EU, JP)	Invoice Processing (ES)
The total sum of the invoice (Total) and the currency of the invoice (Currency)	Yes	Yes
Taxes: the total without taxes (NetAmount0), the sum of the invoice prior to taxation (TotalNetAmount), the amount of payable tax (TotalTaxAmount)	Yes	Yes
Tax groups: sum prior to taxation (NetAmount), amount of payable tax (TaxAmount), tax rate (TaxRate)	No	Yes
Additional tax (AdditionalCosts)	Yes	Yes

Information from the Document Definition is used to find sums and tax rates:

Rates of taxes payable in the vendor’s country (you can specify these on the Tax Rates tab of the country’s properties — see Country and language settings).
Keywords for tax rates (you can specify these on the Keywords tab of the language’s properties — see Keywords).

The program tries to find up to two tax rates on the image. If there are more than two tax rates in the invoice, additional fields can be created and filled in manually on the data form. The program uses keywords to detect the TotalTax and TotalNetto fields. You can specify these keywords in the properties of a country or language, depending on how the keyword should be used (for more information, see Country and language settings). For more information about keywords, see Keywords. There are two types of keywords for the Total field, located in different categories (for more information about Located element categories, see Keywords):

AmountTotalHighConfidenceLabels: keywords that only occur near the Total field, such as “Pay this amount.”
AmountTotalLowConfidenceLabels: keywords that can occur near the Total field but can also occur near other fields. For example, the keyword “Total” can appear near the Total field but may also occur near a field that contains the total weight of all items on an invoice.

If you are not sure which of these two categories to add a keyword to, add it to AmountTotalHighConfidenceLabels. If you encounter invoices where the keyword causes the program to identify another field as the Total field, you can move it to AmountTotalLowConfidenceLabels.

In addition to keywords, the program looks for the following items when attempting to detect the Total field:

Numbers that occur two or three times in the same line or column on the image. Such numbers may be the Total on invoices where no taxes are specified.
Numbers that are sums of the numbers located above them in the same column.
The largest (by absolute value) numbers located at the end of the document.

The program searches for the Currency field only if a Total field has been detected. Keywords from the properties of the country in the Document Definition are used. Any fields in the Amounts field group that could not be detected on the image are calculated automatically, except for the Total field, which must be detected on the image. If the program fails to correctly extract information from the fields in the Amounts field group, the Total field is marked as requiring verification. If the program fails to detect the Total and Currency fields with a high degree of confidence, or fails to detect them altogether, you can use training to improve the quality of extraction.

Purchase Order field group

FlexiCapture for Invoices can extract all purchase order numbers and their corresponding sums from the invoice. This feature is disabled by default (see Purchase order matching). To extract Purchase Order numbers, you need a data set with a list of possible Purchase Order numbers and their sums (see PurchaseOrders data set). The Purchase Order field can be extracted using:

A regular expression
A data set containing possible purchase order numbers (see PurchaseOrders data set)

If a data set with possible purchase order numbers is used, FlexiCapture for Invoices searches images for numbers from this data set. It is best to have as few purchase order numbers in the database as possible. To decrease their number, you can:

Use the VendorId column of the data set. In this case, the program only uses Purchase Order numbers from the invoice’s vendor.
Filter out purchase orders for which an invoice has already been received, and add only the numbers of purchase orders for which no invoice has been received yet.

The program searches the database for sums that correspond to detected Purchase Order numbers. It also searches the image for all Purchase Order numbers, including those in the invoice’s line items. Purchase orders are usually generated by the buyer’s ERP system, so invoices billed to a specific Business Unit tend to be similar, and it is usually possible to describe them using a regular expression. If there is a regular expression for purchase order numbers, the program detects all numbers on images that satisfy the expression. The regular expression can be specified in an XML configuration file using the following tags:

<InvoiceSettings>
...
<OrderNumber>
   <Value>
      <RegularExpression></RegularExpression>
   </Value>
</OrderNumber>
</InvoiceSettings>

For more information about XML configuration files, see Editing invoice processing settings in XML files.

The Line Items field group

FlexiCapture for Invoices can extract invoice line items from images. Extraction of invoice line items is disabled by default (see Additional fields). For a list of fields the program extracts automatically, see Captured fields. FlexiCapture for Invoices first searches the image for a table. During this search, it uses the keywords for column titles specified for every language in the Document Definition’s properties. Keywords for columns of invoice line items are also used for classifying items, that is, for determining the type of each invoice line item column. The program then uses information about detected columns and mathematical expressions to find invoice line items in the invoice’s table. Finally, it searches invoice line items for fields from columns. Training can be used to improve the quality of automatic line item extraction.

Using neural networks

One of the main advantages of neural networks is their ability to self-learn: they can detect complex dependencies among input data and make useful generalizations. The program includes two neural networks that can be used to capture the following fields:

InvoiceNumber
InvoiceDate
Total
Vendor\Name
Vendor\Address
Business Unit\Name
Business Unit\Address
Purchase Orders\Order Number
LineItems:
- OrderNumber
- OrderDate
- Position
- ArticleNumber
- Description
- Quantity
- Unit of measurement
- Unit Price
- Total Price Netto
- VATPercentage

For maximum precision, the program uses both a FlexiLayout and its neural networks to capture invoice fields. Fields that the program fails to extract using its neural networks are extracted using the FlexiLayout. If a field can be extracted by both the neural networks and the FlexiLayout, the program intelligently combines the results. How the results are combined depends on the field. For more information, see Combining the field detection results.

Disabling the neural networks

By default, the neural networks are used as the second method of capturing document fields. If you need to process documents other than invoices within your invoice project, you may want to disable the neural network, as it was specifically trained to capture invoice fields and may not perform well on other types of documents. To disable the neural network for the Line Items group:

Open the Document Definition Editor

Open the Document Definition Editor.

Open the additional fields and features

Click Document Definition Properties… → Document Definition Settings → Additional Fields and Features.

Disable the option

Disable the Thorough extraction of invoice line items option.

To disable the neural network for the Invoice Header, Vendor, Business Unit, and Purchase Order groups:

Open the Document Definition Editor

Open the Document Definition Editor.

Open the additional fields and features

Click Document Definition Properties… → Document Definition Settings → Additional Fields and Features.

Disable the option

Disable the Thorough extraction of invoice header fields option.

Combining the field detection results

How the program combines the field detection results or selects the best result depends on the field. As a general rule, precedence is given to the results obtained by the respective neural network. The exceptions are searches based on data sets and searches using regular expressions created for specific customer documents. Invoice Header field group The results obtained by the neural network always have precedence for the following fields:

Invoice Number
Invoice Date
Total

Business unit and vendor By default, the business unit and vendor are detected based on a data set, provided a data set is selected. Additionally, the following fields may be detected using the neural network if there is no corresponding record in the data set:

Name
VATID (ABN)
Address

If no data set is selected, only the neural network is used. Purchase Order field group The neural network is only used if the value is not detected by means of a data set or a regular expression. Line items For line item fields, precedence is given to the results obtained by the neural network. If the neural network detects the entire table of line items, this table is used for further processing. Otherwise, the program uses the line items detected by means of the FlexiLayout. If the neural network detects only the Description and TotalPriceNetto fields for each line item, they are complemented with the fields detected by means of the FlexiLayout.

​Using a FlexiLayout

​Business unit and vendor

​Automatic company detection algorithm

​Hypothesis filtering

​Results of detecting the vendor and business unit

​How to change the way the program detects the vendor or business unit

​Using pre-determined vendor and business unit values

​Invoice Header field group

​InvoiceNumber and InvoiceDate

​How the program determines that a document is an invoice

​Amounts field group

​Purchase Order field group

​The Line Items field group

​Using neural networks

​Disabling the neural networks

​Combining the field detection results

Using a FlexiLayout

Business unit and vendor

Automatic company detection algorithm

Hypothesis filtering

Results of detecting the vendor and business unit

How to change the way the program detects the vendor or business unit

Using pre-determined vendor and business unit values

Invoice Header field group

InvoiceNumber and InvoiceDate

How the program determines that a document is an invoice

Amounts field group

Purchase Order field group

The Line Items field group

Using neural networks

Disabling the neural networks

Combining the field detection results