Usage scenarios
We will take it for granted that you are processing a lot of documents. But we must also take into consideration the results you need to receive and choose the best means to implement your task. The distinct scenarios to consider are:- Converting multi-page documents with a large number of pages. It generally means processing books, long reports, etc. In this case, you can recognize pages of the document in parallel, then perform synthesis in the main process and export in parallel again. You can also, when using a pool of Engines, process several multi-page documents simultaneously, but the memory consumption can be huge and even lead to “out of memory” errors.
- Converting a large number of one-page documents. This is the case when you process invoices, contracts, letters, etc. Parallel processing is easiest for this situation, as one-page documents do not depend on each other and do not require large amounts of memory at once.
- Processing a large number of images and searching them for a necessary information or working with recognition results in some other way. You might not need to convert most of them into an editable format, so the speed of synthesis and export is not an issue. The operation which will be performed in multiple processes is iterating through layout blocks and accessing the recognition results for text blocks.
If you want to use parallel processing for export, keep in mind that this feature is supported only for export to PDF (except TextOnly mode) and PPTX formats.
Recommendations and restrictions
-
For parallel processing of multi-page documents, we recommend using FRDocument. It is the most easy-to-code multiprocessing way because you do not have to implement any additional interfaces.
Opening, preprocessing, analysis, and recognition are performed in parallel; document synthesis is performed sequentially in the main process and then export to PDF (except TextOnly mode), and PPTX formats is performed in parallel. -
To process many one-page documents which are received from some source (such as a scanner), we recommend BatchProcessor.
The advantage of this method is that it can be used when you do not know in advance the number of documents they can be of different types and must be processed directly they arrive. The disadvantage is that it requires more implementation effort: you have to implement interfaces for a file adapter and a custom source of images.
All processing stages are performed in parallel because in the case of one-page documents the page and document synthesis are performed for each page separately.
Parallel export is not supported in the scenarios with Batch Processor.
Events that happened during parallel processing of a page are converted to events of a whole document.
Processing with FRDocument object
The number of processes to run is detected automatically depending on the number of available physical or logical CPU cores, number of free CPU cores available in the license, and number of pages in the document. To turn on the multiprocessing mode, do the following:- Set the value of the MultiProcessingMode property of the MultiProcessingParams subobject of the Engine object. Parallel processing is used if this property is set to MPM_Parallel or MPM_Auto, and the number of pages in the document and the number of available CPU cores are both greater than one.
- Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.
- AddImageFile, AddImageFileFromMemory, AddImageFileFromStream, AddImageFileWithPassword, AddImageFileWithPasswordCallback
- Preprocess, PreprocessPages
- Analyze, AnalyzePages
- Recognize, RecognizePages
- Process, ProcessPages
- Export, ExportPages, ExportToMemory — for export to PDF (except TextOnly mode) and PPTX formats only
Processing using Batch Processor
When Batch Processor is initialized, asynchronous recognition processes are invoked and configured. Then the processor takes image files from a custom image source. For each page of the image file, a new processing task is created, and this task is passed to one of the recognition processes. If all the tasks for one file have been passed for processing, but not all of the recognition processes are occupied, the next image file from the image queue of the source is taken and passed for processing. This is done until the first image page has been converted and passed to the user. Pages are returned to the user in the order they have been taken from the image source. To organize multiprocessing with the Batch Processor, do the following:- Implement the IImageSource and IFileAdapter interfaces, which provide access to the image source and files in it.
- [optional] Implement the IAsyncProcessingCallback interface to manage the processing. The methods of this interface allow you to handle errors and/or cancel the processing.
- [optional] Set up multiprocessing using the MultiProcessingParams subobject of the Engine object. Please note that there is no need to set the MultiProcessingMode property, because parallel processing is used by default if you work with Batch Processor. Tune the number of processes to be run using the RecognitionProcessesCount property and specify the values of other properties, if necessary.
- Call the CreateBatchProcessor method of the Engine object, to receive the BatchProcessor object.
- Call the Start method of this object to initialize the processor and invoke asynchronous recognition processes. You can specify the source of images, pass the references to the IAsyncProcessingCallback interface and parameters objects in the call to this method.
- Call the GetNextProcessedPage method in a loop until the method returns 0, which means that there are no more images in the source and all the processed images have been returned to the user.
