Usage scenarios
We will take it for granted that you are processing a lot of documents. But we must also take into consideration the results you need to receive and choose the best means to implement your task. The distinct scenarios to consider are:- Converting multi-page documents with a large number of pages. It generally means processing books, long reports, etc. In this case, you can recognize pages of the document in parallel, then perform synthesis in the main process and export in parallel again. You can also, when using a pool of Engines, process several multi-page documents simultaneously, but the memory consumption can be huge and even lead to “out of memory” errors.
- Converting a large number of one-page documents. This is the case when you process invoices, contracts, letters, etc. Parallel processing is easiest for this situation, as one-page documents do not depend on each other and do not require large amounts of memory at once.
- Processing a large number of images and searching them for a necessary information or working with recognition results in some other way. You might not need to convert most of them into an editable format, so the speed of synthesis and export is not an issue. The operation which will be performed in multiple processes is iterating through layout blocks and accessing the recognition results for text blocks.
If you want to use parallel processing for export, keep in mind that this feature is supported only for export to PDF (except TextOnly mode) and PPTX formats.
Recommendations and restrictions
-
For parallel processing of multi-page documents, we recommend using FRDocument. It is the most easy-to-code multiprocessing way because you do not have to implement any additional interfaces.
Opening, preprocessing, analysis, and recognition are performed in parallel; document synthesis is performed sequentially in the main process and then export to PDF (except TextOnly mode), and PPTX formats is performed in parallel. -
To process many one-page documents which are received from some source (such as a scanner), we recommend BatchProcessor.
The advantage of this method is that it can be used when you do not know in advance the number of documents they can be of different types and must be processed directly they arrive. The disadvantage is that it requires more implementation effort: you have to implement interfaces for a file adapter and a custom source of images.
All processing stages are performed in parallel because in the case of one-page documents the page and document synthesis are performed for each page separately.
Parallel export is not supported in the scenarios with Batch Processor.
- To perform full processing of many one-page documents in parallel, you can use a pool of Engines loaded out-of-process by means of COM. This method is the most efficient in speed and automatically eliminates all difficulties related to multi-threading: all operations with the ABBYY FineReader Engine objects are serialized by means of COM. But it has some limitations:
- due to the use of COM, you need to register FREngine.dll;
- if your code is written in C++, working with COM requires more routine coding than, for example, in C#;
- in this case, the processing is going on in another process, so you cannot open images from memory, and iterating the recognition results takes more time because each request has to be passed into another process and back;
- and finally, loading several instances of Engine means more memory consumption, especially, as in this case, all processing stages are performed in parallel, and several simultaneous synthesis operations can go on at the same time, using up still more memory.
- To catch and handle the events that happened during the parallel processing, you can use the IParallelProcessingCallback interface. This interface can be very useful for managing problematic situations. For example, when the timeout error happens, the IParallelProcessingCallback interface provides several solutions to the problem depending on the user preferences. For more information, see IParallelProcessingCallback::OnWaitIntervalExceeded.
Events that happened during parallel processing of a page are converted to events of a whole document.
Speed testing results
In the table below are presented the results of performance testing.| <br /> | One-page documents | One multi-page document | Searching through results without exporting |
|---|---|---|---|
| Sequential processing | 60 | 51 | 87 |
| Processing with FRDocument | 41 | 117 | 57 |
| Processing with FRDocument (with PageFlushingPolicy \= PFP\_KeepInMemory) | 55 | 141 | 82 |
| Processing using Batch Processor | 99 | 115 | 294 |
| Processing using a pool of Engines | 165 | 10 | 102 |
Processing with FRDocument object
The number of processes to run is detected automatically depending on the number of available physical or logical CPU cores, number of free CPU cores available in the license, and number of pages in the document. To turn on the multiprocessing mode, do the following:- Set the value of the MultiProcessingMode property of the MultiProcessingParams subobject of the Engine object. Parallel processing is used if this property is set to MPM_Parallel or MPM_Auto, and the number of pages in the document and the number of available CPU cores are both greater than one.
- Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.
- AddImageFile, AddImageFileFromMemory, AddImageFileFromStream, AddImageFileWithPassword, AddImageFileWithPasswordCallback
- Preprocess, PreprocessPages
- Analyze, AnalyzePages
- Recognize, RecognizePages
- Process, ProcessPages
- Export, ExportPages, ExportToMemory — for export to PDF (except TextOnly mode) and PPTX formats only
C# code
C# code
Processing using Batch Processor
When Batch Processor is initialized, asynchronous recognition processes are invoked and configured. Then the processor takes image files from a custom image source. For each page of the image file, a new processing task is created, and this task is passed to one of the recognition processes. If all the tasks for one file have been passed for processing, but not all of the recognition processes are occupied, the next image file from the image queue of the source is taken and passed for processing. This is done until the first image page has been converted and passed to the user. Pages are returned to the user in the order they have been taken from the image source. To organize multiprocessing with the Batch Processor, do the following:- Implement the IImageSource and IFileAdapter interfaces, which provide access to the image source and files in it.
- [optional] Implement the IAsyncProcessingCallback interface to manage the processing. The methods of this interface allow you to handle errors and/or cancel the processing.
- [optional] Set up multiprocessing using the MultiProcessingParams subobject of the Engine object. Please note that there is no need to set the MultiProcessingMode property, because parallel processing is used by default if you work with Batch Processor. Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.
- Call the CreateBatchProcessor method of the Engine object, to receive the BatchProcessor object.
- Call the Start method of this object to initialize the processor and invoke asynchronous recognition processes. You can specify the source of images, pass the references to the IAsyncProcessingCallback interface and parameters objects in the call to this method.
- Call the GetNextProcessedPage method in a loop until the method returns 0, which means that there are no more images in the source and all the processed images have been returned to the user.
C# code
C# code
Processing using a pool of Engines
In this multiprocessing scenario, you use several instances of Engine loaded out-of-process. Inside each worker thread, the procedure can be almost the same as for processing just for one thread. But it is recommended that you implement a custom image source that will distribute the images among threads, using some kind of synchronizing object to ensure that each image is processed once and only once. To load the Engine object out-of-process, use the OutprocLoader object, which implements an IEngineLoader interface. When using it with special accounts, permissions may be required to run OutprocLoader for such accounts.C# code
C# code
- Account permissions can be set up using the DCOM Config utility (either type DCOMCNFG in the command line, or select Control Panel > Administrative Tools > Component Services). In the console tree, locate the Component Services > Computers > My Computer > DCOM Config folder, right-click ABBYY FineReader Engine 12.5 Loader (Local Server), and click Properties. A dialog box will open. Click the Security tab. Under Launch Permissions, click Customize, and then click Edit to specify the accounts that can launch the application.
Note that on a 64-bit operating system the registered DCOM-application is available in the 32-bit MMC console, which can be run using the following command line:
- To register FREngine.dll when installing your application on an end-user computer, use the regsvr32 utility. If you are on a 64-bit operating system, the 64-bit version of regsvr32 will run by default. Use the following command line:
- Implementing Engine as an out-of-process server specify sequential mode of document processing by setting MultiProcessingMode property of MultiProcessingParams object to MPM_Sequential.
