How company detection works - ABBYY Documentation

The detail and quality of data catalog records significantly affects the accuracy of company detection. The closer a document issuer and receiver records match the text extracted from a document image, the more accurately the document issuer and receiver companies are detected.

Best practices for accurate detection

To ensure that the detection results are as accurate as possible, make sure that:

Unique company identifiers are filled in. Filling in unique value columns (Tax ID, National Tax ID, IBAN) will significantly improve the probability of correct detection, since these values are unique for all companies.
There are no duplicate company records. The absence of duplicate records will increase the probability of correctly detecting the company.
There are no unrelated records. Outdated or invalid records in the data catalog may cause the company to be detected incorrectly because of coincidental similarities between various field values.
All fields are filled in for each company record. Specify as much accurate information about companies as possible. The more accurate the information, the higher the probability of correctly detecting the companies.

Company detection process

Company detection includes the following steps:

Step 1: Unique identifier search

The values of the following fields are considered to be unique company identifiers:

Tax ID
National Tax ID
IBAN

A Classify By Company activity searches the document image for the values of the fields listed above using keywords and regular expressions. If none are specified, this step is skipped. The Tax ID, National Tax ID, and IBAN values detected on a document image are used to query the data catalog. Next, the Tax ID, National Tax ID, and IBAN values received from the data catalog are matched against the values detected on the image (exact matching is used). For matching purposes, values are normalized as follows:

letters are changed to upper case
spaces and the following characters are removed: ”.”, ”,”, ”—”, ”/”, ”****“

Step 2: Company name and address search

The entire text detected on the document image is used to query the data catalog. Next, the Name, Street, Postal code, and City values received from the data catalog are matched against the values detected on the image (exact matching is used).

To get the best possible search results, make sure that the corresponding columns in the data catalog are filled in. Company name and address information is especially important in cases where the company cannot be identified using a Tax ID, National Tax ID, or IBAN.

Step 3: Generate hypotheses

Based on the companies found in steps 1 and 2, a set of hypotheses is generated. A Classify By Company activity evaluates these hypotheses and selects five document issuer and five document receiver company records that most reliably match the field values detected on the document image. These records are then used to form 25 pairs, with each pair treated as a separate hypothesis. A trained model then rates the hypotheses by reliability, selecting the best matching issuer–receiver pair.

Even if the number of document receiver companies is very small (for example, if there is only one document receiver company), using a Document Receiver Companies data catalog is still recommended, as it will prevent a document receiver company from being incorrectly detected as a document issuer company.

If the Document Issuer Companies data catalog specifies that the Issuer Company ID depends on the Receiver Company ID, hypotheses are generated based on this correlation (see Looking for a pair of companies).

Results of detecting document issuer and receiver companies

As a result of detecting issuer and receiver companies on a document the following identifiers will be found:

The issuer company identifier in the Document Issuer Companies data catalog
The receiver company identifier in the Document Receiver Companies data catalog

If the Document Issuer Companies data catalog specifies that the Issuer Company ID depends on the Receiver Company ID, the result of document issuer detection will contain the Issuer Company ID that corresponds to the Receiver Company ID. For more information, see Looking for a pair of companies.

​Best practices for accurate detection

​Company detection process

​Step 1: Unique identifier search

​Step 2: Company name and address search

​Step 3: Generate hypotheses

​Results of detecting document issuer and receiver companies

Best practices for accurate detection

Company detection process

Step 1: Unique identifier search

Step 2: Company name and address search

Step 3: Generate hypotheses

Results of detecting document issuer and receiver companies