Skip to main content
An XML file contains the recognized text, with additional information about its structure, attributes and recognition variants described with the help of XML tags. See the table below for the description of possible tags. Some tags may not be present according to the values of the recognition parameters. For example, word or character recognition variants will only be saved if the corresponding properties of the XMLExportParams object are set to TRUE. You can find the XML schema in the ABBYY_Scheme_XML.xsd file. This file is located in the Headers folder for macOS or the Inc folder for Linux and Windows (Start > Programs > ABBYY FineReader Engine 12 > Installation Folders > Include Files Folder). The picture below shows the example of Picture, Text, and Table blocks in the output XML file. XMLScheme

Description of document tags

Name

Type

Multiplicity

Parent Tag

Description

document

Complex Type

Type elements

page

documentData

Type attributes

version — XML version

producer — the producer of the XML file

pagesCount — (optional) the number of pages in the document

mainLanguage — (optional) the main language of the document

languages — (optional) all languages of the document

1

no

Document.

page

Complex Type, a sequence of block tags

Type elements

block

Type attributes

width — the image width in pixels

height — the image height in pixels

resolution — the image resolution in pixels per inch

originalCoords — (optional) if true, all coordinates are relative to the original image before opening; otherwise, they are relative to the opened (deskewed) image

rotation — (optional) the type of rotation applied to original page image before processing. It can be one of the following values: Normal, RotatedClockwise, RotatedUpsideDown, RotatedCounterclockwise

0…unbounded

document

Recognized page.

block

BlockType

BlockType elements

Availability of this or that element depends on the type of the block (see blockType attribute).

region — always available

text — available only if blockType attribute is “Text”

row — available only if blockType attribute is “Table”

separatorsBox — available only if blockType attribute is “SeparatorsBox”

separator — available only if blockType attribute is “Separator”

checkmark — available only if blockType attribute is “Checkmark”

groupCheckmark — available only if blockType attribute is “GroupCheckmark”

BlockType attributes

blockType — the type of the block. It can be one of the following values: Text, Table, Picture, Barcode, Separator, SeparatorsBox, Checkmark, GroupCheckmark

blockName — (optional) the name of the block

isHidden — (optional) specifies if the block is hidden (the default value is false)

l — (optional) the coordinate of the left border of the block

t — (optional) the coordinate of the top border of the block

r — (optional) the coordinate of the right border of the block

b — (optional) the coordinate of the bottom border of the block

0…unbounded

page

Recognized block.

region

Complex Type, a sequence of rect tags

Type elements

rect

Has no type attributes

1

block

Block region, a set of rectangles.

rect

Complex Type

Type attributes

l — the coordinate of the left border of the rectangle

t — the coordinate of the top border of the rectangle

r — the coordinate of the right border of the rectangle

b — the coordinate of the bottom border of the rectangle

1…unbounded

region

Rectangle of a block region.

text

TextType

TextType elements

par

TextType attributes

orientation — (optional) the text orientation. It can be one of the following values: Normal, RotatedClockwise, RotatedUpsidedown, RotatedCounterclockwise (the default value is Normal)

backgroundColor — (optional) the background color of the text (the default value is -1, which means that the color is transparent)

mirrored — (optional) specifies if the text is mirrored (the default value is false)

inverted — (optional) specifies if the text is inverted (the default value is false)

0…unbounded

block

Text of a recognized text block (presents as an element of block tag, if blockType attribute is “Text”).

0…unbounded

cell

Text of a table cell.

par

ParagraphType

ParagraphType elements

line

ParagraphType attributes

dropCapCharsCount — (optional) the number of drop caps in the paragraph (the default value is 0)

dropCap-l — (optional) the left coordinate of the drop cap rectangle

dropCap-t — (optional) the top coordinate of the drop cap rectangle

dropCap-r — (optional) the right coordinate of the drop cap rectangle

dropCap-b — (optional) the bottom coordinate of the drop cap rectangle

align — (optional) the paragraph aligning. It can be one of the following values: Left, Center, Right, Justified (the default value is Left)

leftIndent — (optional) the left paragraph indent (the default value is 0)

rightIndent — (optional) the right paragraph indent (the default value is 0)

startIndent — (optional) the indent of the first line of the paragraph (default value is 0)

lineSpacing — (optional) the spacing between lines (the default value is 0)

isListItem — (optional) indicates that the paragraph is part of a list (the default value is false)

lstLvl — (optional) the list level

lstNum — (optional) the number of the paragraph in the list

0…unbounded

text

Paragraph of a recognized text.

line

LineType

LineType elements

formatting

LineType attributes

baseline — the distance from the base line to the top edge of the page

l — the coordinate of the left border of the surrounding rectangle

t — the coordinate of the top border of the surrounding rectangle

r — the coordinate of the right border of the surrounding rectangle

b — the coordinate of the bottom border of the surrounding rectangle

0…unbounded

par

Line of a paragraph.

formatting

FormattingType

FormattingType group

charParams

or

wordRecVariants

FormattingType attributes

lang — the name of the language

ff — (optional) the name of the font

fs — (optional) the size of the font

bold — (optional) the bold font style (the default value is false)

italic — (optional) the italic font style (the default value is false)

subscript — (optional) the subscript font effect (the default value is false)

superscript — (optional) the superscript font effect (the default value is false)

smallcaps — (optional) the small caps font effect (the default value is false)

underline — (optional) the underline font effect (the default value is false)

strikeout — (optional) the strikeout font effect (the default value is false)

color — (optional) the color of the font (the default value is 0)

scaling — (optional) the scaling of the font (the default value is 1000)

spacing — (optional) the character spacing (the default value is 0)

0…unbounded

line

Group of characters with uniform formatting. Attributes of characters are alternated with word’s recognition variants. The variants of recognition of the word are written before the word.

charParams

CharParamsType

CharParamsType elements

charRecVariants

CharParamsType attributes

l — the coordinate of the left border of the character rectangle

t — the coordinate of the top border of the character rectangle

r — the coordinate of the right border of the character rectangle

b — the coordinate of the bottom border of the character rectangle

suspicious — (optional) this property set to TRUE means that the character was recognized uncertainly (the default value is false)

proofed — (optional) specifies whether spell-checking was performed upon this character (the default value is false)

wordStart — deprecated; (optional) this property set to TRUE marks the leftmost character in a word

wordFirst — (optional) this property set to TRUE marks the first character in a word

wordLeftmost — (optional) this property set to TRUE marks the leftmost character in a word

wordFromDictionary — (optional) specifies whether the word was found in the dictionary

wordNormal — (optional) specifies whether the word was recognized with either a standard or user-defined language, and that it is not a number or an identifier

wordNumeric — (optional) specifies whether the word is a number

wordIdentifier — (optional) specifies whether the word is an identifier (abbreviation, URL, etc.)

wordPenalty — (optional) penalty for discordance of characters in the word

meanStrokeWidth — (optional) the mean width of the stroke in the RLE representation of the word image, expressed in pixels multiplied by 10

charConfidence — (optional) stores the value of character confidence. It is a numerical estimate of the probability that the recognition was correct. However, this number is not guaranteed to be positive, and the only meaningful use of confidence is to compare different recognition variants of the same character

serifProbability — (optional) specifies the probability that the character is written with a Serif font

isTab — (optional) specifies if the character is a tab

tabLeaderCount — (optional) specifies symbols quantity in the tab leader. The quantity is calculated at the synthesis stage considering font and tab width. This attribute is used if isTab=TRUE

0…unbounded

formatting

Attributes of a single character.

charRecVariants

Complex Type, a sequence of charRecVariant tags

Type elements

charRecVariant

Has no type attributes


charParams

Variants of a character recognition.

charRecVariant

CharRecognitionVariant

Type attributes

charConfidence — (optional) a numerical estimate of the probability that the recognition was correct

serifProbability — (optional) probability that a character is written with a Serif font

0…unbounded

charRecVariants

Variant of a character recognition.

wordRecVariants

Complex Type, a sequence of wordRecVariant tags

Type elements

wordRecVariant

Has no type attributes


formatting

Variants of recognition of the next word.

wordRecVariant

WordRecognitionVariant type

WordRecognitionVariant elements

variantText

WordRecognitionVariant attributes

wordFromDictionary — (optional) specifies whether the word was found in the dictionary

wordNormal — (optional) specifies whether the word was recognized with a standard or user-defined language, and that it is not a number or an identifier

wordNumeric — (optional) specifies whether the word is a number

wordIdentifier — (optional) specifies whether the word is an identifier (abbreviation, URL, etc.)

wordPenalty — (optional) penalty for discordance of characters in the word

meanStrokeWidth — (optional) the mean width of the stroke in the RLE representation of the word image, expressed in pixels multiplied by 10

0…unbounded

wordRecVariants

Variant of recognition of the next word.

variantText

Complex Type, a sequence of charParams tags

Type elements

charParams

Has no type attributes

1

wordRecVariant

Word.

row

TableRowType

TableRowType elements

cell

Has no type attributes

0…unbounded

block

Table row (presents if blockType attribute is Table).

cell

Complex Type, a sequence of TextType tags

Type elements

text

Type attributes

colSpan — (optional) the column span of the cell (the default value is 1)

rowSpan — (optional) the row span of the cell (the default value is 1)

align — (optional) this property specifies alignment for a tab stop and can be one of the following values: Top, Center, Bottom (the default value is Top)

picture — (optional) specifies if the cell contains only a picture (the default value is false)

leftBorder — (optional) the table cell left border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)

topBorder — (optional) the table cell top border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)

rightBorder — (optional) the table cell right border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)

bottomBorder — (optional) the table cell bottom border type. It can be one of the following values: Absent, Unknown, White, Black (the default value is Black)

width — the width of the cell

height — the height of the cell

0…unbounded

row

Table cell (presents if blockType attribute is Table).

separatorsBox

Complex Type, a sequence of separator tags

Type elements

separator

Has no type attributes

0…1

block

Group of separators, presents if blockType attribute is “SeparatorsBox”

separator

SeparatorBlockType type

SeparatorBlockType elements

start

end

SeparatorBlockType attributes

thickness — specifies the precise width of the separator in pixels

type — specifies the type of the separator. It can be one of the following values: Unknown, Black, Dotted

0…1

block

Single separator, presents if blockType attribute is “Separator”.

0…unbounded

separatorsBox

Separator in a group of separators.

groupCheckmark

Complex Type, a sequence of checkmark tags

Type elements

checkmark

Has no type attributes

0…1

block

Group of checkmarks, presents if blockType attribute is “GroupCheckmark”

checkmark

CheckmarkBlockType type

CheckmarkBlockType attributes

confidence — specifies the confidence of checkmark recognition

value — specifies the checkmark state. It can be one of the following values: Unknown, Checked, Unchecked, Corrected

0…1

block

Single checkmark, present if blockType attribute is “Checkmark”.

0…unbounded

checkmarkGroup

Checkmark in a group of checkmarks.

barcodeInfo

BarcodeInfoType type

BarcodeInfoType attributes

type — specifies the type of the barcode. It can be one of the following values:

  • CODE39
  • INTERLEAVED25
  • EAN13
  • CODE128
  • EAN8
  • PDF417
  • CODABAR
  • UPCE
  • INDUSTRIAL25
  • IATA25
  • MATRIX25
  • CODE93
  • POSTNET
  • UCC128
  • PATCH
  • AZTEC
  • DATAMATRIX
  • QRCODE
  • UPCA
  • MAXICODE
  • CODE32
  • FULLASCII
  • ROYAL
  • KIX
  • INTELLIGENT
  • AUSTRALIA_POST
  • Unknown

supplement — (optional) specifies the type of supplementary barcode. It can be one of the following values: void, 2dig, 5dig

0…1

block

Information about barcode, presents if blockType attribute is “Barcode”.

start

Point type

Point attributes

x — specifies the horizontal coordinate of the start point of separator

y — the vertical coordinate of the start point of separator

1

separator

Start point of a separator.

end

Point type

Point attributes

x — specifies the horizontal coordinate of the end point of separator

y — the vertical coordinate of the end point of separator

1

separator

End point of a separator.

documentData

Complex Type

Type elements

paragraphStyles

sections

Has no type attributes

0…1

document

Parameters of paragraph and font styles of the document.

paragraphStyles

Complex Type, a sequence of paragraphStyle tags

Type elements

paragraphStyle

Has no type attributes

0…1

documentData

Collection of paragraph formatting styles.

paragraphStyle

ParagraphStyleType Type

ParagraphStyleType elements

fontStyle

ParagraphStyleType attributes

id — the identifier of the paragraph

name — the name of the paragraph style

mainFontStyleId — the main font style of the paragraph

role — the paragraph role. It can be one of the following values:

  • text
  • tableText
  • heading
  • tableHeading
  • pictureCaption
  • tableCaption
  • contents — table of contents
  • footnote
  • endnote
  • rt — running title
  • garb — garbage
  • other
  • barcode
  • headingNumber

roleLevel — (optional) (the default value is -1, which means that the level is not available for this role)

align — paragraph alignment. It can be one of the following values: Left, Center, Right, Justified, CjkJustified, ThaiJustified

before — (optional) space before the paragraph of this style (the default value is 0)

after — (optional) space after the paragraph of this style (the default value is 0)

startIndent — (optional) indent of the first line of the paragraph

leftIndent — (optional) left indent of the whole paragraph

rightIndent — (optional) right indent of the whole paragraph

lineSpacing — (optional) line spacing

lineSpacingRatio — (optional) line spacing (proportional to the letter height)

fixedLineSpacing — (optional) if true, the line spacing in the paragraph does not vary

0…unbounded

paragraphStyles

Formatting style of a paragraph.

fontStyle

FontStyleType Type

FontStyleType attributes

id — the identifier of the font style

baseFont — (optional)

italic — (optional) if true, the font is italic

bold — (optional) if true, the font is bold

underline — (optional) if true, the font is underlined

strikeout — (optional) if true, the font is strikeout

smallcaps — (optional) if true, the font is small caps

scaling — (optional) the scaling of the font (the default value is 1000)

spacing — (optional) the character spacing (the default value is 0)

color — (optional) the color of the font (the default value is 0)

backgroundColor — (optional) the background color (the default value is 0)

ff — the name of the font

fs — the size of the font

0…unbounded

paragraphStyle

The font style.

sections

Complex Type, a sequence of section tags

Type elements

section

Has no type attributes

0…1

documentData

The collection of document sections.

section

SectionType Type

SectionType elements

stream

Has no type attributes

0…unbounded

sections

A document section.

stream

TextStreamType Type

TextStreamType elements

mainText

elemId

TextStreamType attributes

role — (optional) the stream role. It can be one of the following values: garb, text, footnote, incut (the default value is text)

vertCjk — (optional) if true, the stream contains vertical CJK text

beginPage — the number of page on which the stream begins

endPage — (optional) the number of page on which the stream ends

0…unbounded

section

A sequence of paragraphs and blocks.

mainText

Complex Type

Type attributes

rtl — (optional) if true, the text has right-to-left writing direction

columnCount — the number of columns

0…1

stream


elemId

Complex Type

Type attributes

id — string ID of the element

0…unbounded

stream

The ID of a page element.

Tag hierarchy diagram

XMLSchemeDiagram

See also

XMLExportParams