Skip to main content
The detail and quality of data catalog records significantly affects the accuracy of company detection. The closer a document issuer and receiver records match the text extracted from a document image, the more accurately the document issuer and receiver companies are detected.

Best Practices for Accurate Detection

To ensure that the detection results are as accurate as possible, make sure that:
  • Unique company identifiers are filled in. Filling in unique value columns (Tax ID, National Tax ID, IBAN) will significantly improve the probability of correct detection, since these values are unique for all companies.
  • There are no duplicate company records. The absence of duplicate records will increase the probability of correctly detecting the company.
  • There are no unrelated records. Outdated or invalid records in the data catalog may cause the company to be detected incorrectly because of coincidental similarities between various field values.
  • All fields are filled in for each company record. Specify as much accurate information about companies as possible. The more accurate the information, the higher the probability of correctly detecting the companies.

Company Detection Process

Company detection includes the following steps: The values of the following fields are considered to be unique company identifiers:
  • Tax ID
  • National Tax ID
  • IBAN
A Classify By Company activity searches the document image for the values of the fields listed above using keywords and regular expressions. If none are specified, this step is skipped. The Tax ID, National Tax ID, and IBAN values detected on a document image are used to query the data catalog. Next, the Tax ID, National Tax ID, and IBAN values received from the data catalog are matched against the values detected on the image (exact matching is used). For matching purposes, values are normalized as follows:
  • letters are changed to upper case
  • spaces and the following characters are removed: ”.”, ”,”, ””, ”/”, ”****“
The entire text detected on the document image is used to query the data catalog. Next, the Name, Street, Postal code, and City values received from the data catalog are matched against the values detected on the image (exact matching is used).
Note: To get the best possible search results, make sure that the corresponding columns in the data catalog are filled in. Company name and address information is especially important in cases where the company cannot be identified using a Tax ID, National Tax ID, or IBAN.

Step 3: Generating Hypotheses

Based on the companies found in steps 1 and 2, a set of hypotheses is generated. A Classify By Company activity evaluates these hypotheses and selects five document issuer and five document receiver company records that most reliably match the field values detected on the document image. These records are then used to form 25 pairs, with each pair treated as a separate hypothesis. A trained model then rates the hypotheses by reliability, selecting the best matching issuer–receiver pair.
Note: Even if the number of document receiver companies is very small (for example, if there is only one document receiver company), using a Document Receiver Companies data catalog is still recommended, as it will prevent a document receiver company from being incorrectly detected as a document issuer company.
If the Document Issuer Companies data catalog specifies that the Issuer Company ID depends on the Receiver Company ID, hypotheses are generated based on this correlation (see Looking for a pair of companies).

Results of Detecting Document Issuer and Receiver Companies

As a result of detecting issuer and receiver companies on a document the following identifiers will be found:
  • The issuer company identifier in the Document Issuer Companies data catalog
  • The receiver company identifier in the Document Receiver Companies data catalog
Note: If the Document Issuer Companies data catalog specifies that the Issuer Company ID depends on the Receiver Company ID (see Looking for a pair of companies), the result of document issuer detection will contain the Issuer Company ID that corresponds to the Receiver Company ID.