Basic document analysis features
Document Analysis is a set of functions for automatic detection of the following objects on a page:- Text blocks
- Pictures
- Tables and table cells
- Barcodes
- Separators
- process detection of page orientation — 90, 180, and 270 degrees
- split double pages
- process vertical text detection in table cells
- detect and mark the blocks of garbage on page
- General document analysis
- Document analysis for invoices
- Document analysis for full-text indexing
- Manual blocks specification for field-level recognition
General document analysis
This is default document analysis type which searches all objects: text blocks, pictures, tables, barcodes and separators. The results of this analysis are used for document structure and layout retrieval in content reuse scenario. All pictures and diagrams are preserved in original form without recognizing text on them.Document analysis for invoices
This is a preprocessing engine for converting semi-structured documents, such as invoices, payment drafts, bills, waybills, business cards, agreements, health claim forms, resumes, etc. It has been designed to accurately locate all the text on these documents, including characters and numbers — even if this information is located within stamps, pictures, logos or small-text areas. Unlike the standard full-page document analysis, this one assumes that all printed information on documents is text. It also ensures that important text information is not identified as graphic elements and words or numerical values are not separated into multiple characters. As a result, maximum information about the text, including its coordinates, is available for analysis, field-by-field processing and parsing at subsequent processing stages by other systems.Document analysis for full-text indexing
Automatically detects and recognizes all text on documents including text embedded in pictures, charts, and diagrams. Developers may choose to use this mode of document analysis to extract exhaustive full-text information on documents needed for document index building (as in DMS, CMS, Archiving systems).
