- Creating a classification database
- Classifying documents
Scenario implementation
The code samples provided in this topic are Windows -specific.
Step 1. Loading ABBYY FineReader Engine
Step 1. Loading ABBYY FineReader Engine
To start your work with ABBYY FineReader Engine, you need to create the Engine object. The Engine object is the top object in the hierarchy of the ABBYY FineReader Engine objects and provides various global settings, some processing methods, and methods for creating the other objects.To create the Engine object, you can use the InitializeEngine function. See also other ways to load Engine object (Win).
C#
Step 2. Creating ClassificationEngine
Step 2. Creating ClassificationEngine
Create a ClassificationEngine object, which serves as a factory for other Classification API objects. Use the CreateClassificationEngine method of the Engine object.
C#
Step 3. Preparing the classification objects
Step 3. Preparing the classification objects
The training and classification methods work with the special kind of object created from a document or page: ClassificationObject, containing all classification-relevant information.To prepare a document for use in classification scenario, do the following:
- Load the images for processing. There are several ways to do it: for example, you may create the FRDocument object with the help of the CreateFRDocument method of the Engine object, then add images to the created FRDocument object from file using the AddImageFile method.
- If you are going to train or use a classifier of the type which takes into account text features (CT_Combined, CT_Text), first recognize the document with the help of any convenient method. We will use the Analyze and Recognize methods of the FRDocument object. Document synthesis is not necessary for classification.
Although parallel processing is not supported for classification itself, you may need it for preparatory recognition of the documents in Windows and Linux . If the number of documents you are going to classify is large, we recommend using Batch Processor or other parallel processing methods described in Parallel Processing with ABBYY FineReader Engine.
- Use the CreateObjectFromDocument method of the ClassificationEngine object to create a ClassificationObject containing the information from the first page of the document. If you need to use another of the document’s pages, call the CreateObjectFromPage method.
- The Description property of the ClassificationObject is empty by default. Specify this property if you need a relevant description.
It may sometimes happen that the recognized document or page nevertheless contains no recognized text (for example, if an empty page was used by mistake). In this case, the ClassificationObject could not be used for classifiers that require text features. You can use its SuitableClassifiers property for a double-check.
C#
Step 4. Creating a training data set
Step 4. Creating a training data set
To train a classifier that would distinguish between several types of documents, you need a categorized data set that contains samples of each type. Use the TrainingData object to populate and manage this data set:
- Create an empty object with the help of the CreateTrainingData method of the ClassificationEngine object.
- Access the collection of categories via its Categories property.
- Use the AddNew method of the Categories object several times to add a category for each of the document types you intend to classify. The method requires a string with the category label as the input parameter. The label will be returned by the classification methods, so it must be unique in the categories set.
- For each newly-added Category object, open the collection of classification objects using the Objects property. With the help of the IClassificationObjects::Add method, add the classification objects which correspond to this category.
No category may be left empty. For obvious reasons, at least two categories are required for training. - Now that you have configured the training data set, you may wish to save it into a file on disk for later use: for example, if the trained model accuracy proves unacceptable and you wish to add or correct some data for better quality. The TrainingData object provides the SaveToFile method.
C#
Step 5. Training the classification model
Step 5. Training the classification model
The functionality for model training is provided by the Trainer object. Use the CreateTrainer method of the ClassificationEngine object to create it.It contains all settings for classifier type and training procedure, in two subobjects TrainingParams and ValidationParams. Decide which settings you need and change the corresponding properties:The ITrainingResult::Model property provides access to the trained classification model. You may save it into a file with the help of the SaveToFile method or use it directly to classify some documents (proceed to Step 6).
- The type of classifier (ITrainingParams::ClassifierType). This setting determines which features of the document are taken into account when assigning a category: image characteristics, contents of the recognized text, or both. To select a type which uses the text contents, you need to make sure all the classification objects in the training data set have been created from previously recognized documents.
- The training mode (ITrainingParams::TrainingMode). This setting determines if the training process should favor high precision (how many of the selected elements are correct), high recall (how many of the correct elements are selected), or balance between the two.
- If k-fold cross-validation should be used (IValidationParams::ShouldPerformValidation). We recommend using cross-validation when your training sample is not large, as it allows you to train several models on the different partitions of the same sample and select the best. If you have a large supply of categorized data, it may be best to turn the validation off, train the model on the whole training sample, then use the classification methods (Step 6) to test the model on another sample, calculating the performance scores on your side.
- The parameters of k-fold cross-validation: the number of parts into which the training sample is divided (IValidationParams::FoldsCount) and the number of iterations (IValidationParams::RepeatCount). Note that the required number of objects in the training set on each iteration is not less than 4 for the text classifier and not less than 8 for the combined classifier. Make sure that your training sample contains enough objects.
odel training and classification will be performed in sequential mode in Linux and Windows , regardless of the IMultiProcessingParams::MultiProcessingMode value.
C#
Step 6. Classifying documents
Step 6. Classifying documents
To use the trained model for classification:
- If the model is not currently loaded, call the CreateModelFromFile method of the ClassificationEngine object to load it from a file on disk.
- Prepare the classification objects from the documents you need to classify, as described in Step 3.
- For each classification object, call the Classify method of the Model object with the ClassificationObject as the input parameter. The method returns a collection of ClassificationResult objects, each containing the category label and the probability for this category. The results are sorted by probability from best to worst. Retrieve the result and check that the probability level is acceptable to you.
If the classifier was unable to assign a category, null is returned instead of the results collection.
odel training and classification will be performed in sequential mode in Linux and Windows , regardless of the IMultiProcessingParams::MultiProcessingMode value.
C#
Step 7. Unloading ABBYY FineReader Engine
Step 7. Unloading ABBYY FineReader Engine
After finishing your work with ABBYY FineReader Engine, you need to unload the Engine object. To do this, use the DeinitializeEngine exported function.
C#
Required resources
You can use the FREngineDistribution.csv file to automatically create a list of files required for your application to function. For processing with this scenario, select in the column 5 (RequiredByModule) the following values: Core Core.Resources Opening Opening, Processing Processing Processing.Classification Processing.Classification.NaturalLanguages Processing.OCR Processing.OCR, Processing.ICR Processing.OCR.NaturalLanguages Processing.OCR.NaturalLanguages, Processing.ICR.NaturalLanguages If you modify the standard scenario, change the required modules accordingly. You also need to specify the interface languages, recognition languages and any additional features which your application uses (such as, e.g., Opening.PDF if you need to open PDF files, or Processing.OCR.CJK if you need to recognize texts in CJK languages). See Working with the FREngineDistribution.csv File for further details.Additional optimization
You can find more information about setting up the various processing stages in these articles:- Loading Engine - Windows Only
- Different Ways to Load the Engine Object
Describes the ways of loading the Engine object in detail. - Using ABBYY FineReader Engine in Multi-Threaded Server Applications
Discusses the specifics of using FineReader Engine in server applications.
- Different Ways to Load the Engine Object
- Recognition - For Linux and Windows
- Parallel Processing with ABBYY FineReader Engine
To quickly prepare the recognized documents or pages for a classifier with text features, use parallel processing for recognition and then turn multiprocessing off for classification.
- Parallel Processing with ABBYY FineReader Engine
