Recognizing Chinese, Japanese, and Korean Languages
Chinese, Japanese, and Korean languages are often grouped together under the abbreviation “CJK”. They have several features in common, such as the use of Chinese characters and of vertical as well as horizontal writing direction.This section deals with certain peculiarities of recognizing and exporting texts in CJK languages with ABBYY FineReader Engine 12.First, in order to recognize CJK languages, you must have an ABBYY FineReader Engine license that supports the Chinese, Japanese, and Korean language modules. For more information about licenses and modules, see the Licensing section.
Japanese (Modern) recognition language is a compound language consisting of the Japanese and English languages and four letters of the Greek language. This language is intended for recognizing contemporary Japanese texts, which may include some Kanji characters, Kana (Katakana or Hiragana) symbols, some Latin and/or Greek letters (such as reports, research papers, etc.). To get the best recognition results for all documents written primarily in Japanese, we strongly recommend using the Japanese (Modern) recognition language as an independent language, without using its combinations with the English language.
ABBYY FineReader Engine supports recognition language combinations consisting of several of these languages or combinations of CJK and other languages.
To prevent garbling of Asian characters, you must specify for document synthesis a font that includes the necessary set of characters, e.g., Arial Unicode MS, SimSun. You can set the font with the help of the ISynthesisParamsForDocument::FontSet property. The SystemFontSet property of the FontSet object is set by default to selecting those of the system fonts which correspond to the recognition languages of the document.
You can export CJK languages to PDF/A in “text under the image” mode (IPDFExportParams::TextExportMode = PEM_ImageOnText) to ensure that the document looks the same.
Pass the configured DocumentProcessingParams object to the Process method of the FRDocument object. If you use methods of the Engine object, you should call one of the synthesis methods of the Engine object with the configured SynthesisParamsForDocument object as a parameter before export.
Perform export of the recognized text with the help of the Export method of the FRDocument object. If you export to PDF of PDF/A format, specify the required export mode.
Do not use the Word object and its properties or the IsWordFirst , IsWordLeftmost properties of the CharParams object for the texts written in CJK languages. The processing technology divides the text lines into “words” only for internal purposes, and those groups of symbols do not coincide with the actual words.
C++ code
// We assume that Engine was already created// and the document loadedIEngine* Engine;IFRDocument* frDocument;HRESULT res; // use this variable to check if the call to the method was successful...// Create a DocumentProcessingParams objectIDocumentProcessingParams* params = 0;IPageProcessingParams* pageParams = 0;IRecognizerParams* recParams = 0;res = Engine->CreateDocumentProcessingParams( ¶ms );res = params->get_PageProcessingParams( &pageParams );res = pageParams->get_RecognizerParams( &recParams );// Specify the recognition languageres = recParams->SetPredefinedTextLanguage( L"Japanese" );ISynthesisParamsForDocument* synthesisParams = 0;IFontSet* set = 0;ISystemFontSet* systemSet = 0;res = params->get_SynthesisParamsForDocument( &synthesisParams );res = synthesisParams->get_FontSet( &set );res = set->get_SystemFontSet( &systemSet );// Select font setres = systemSet->put_FontNamesFilter( FNF_Japanese );// Recognize and export the documentfrDocument->Process( params );frDocument->Export( L"/opt/Demo.rtf", FEF_RTF, 0 );...
C++ (COM) code
FREngine::IEnginePtr Engine;FREngine::IFRDocumentPtr frDocument;...// Create a DocumentProcessingParams objectFREngine::IDocumentProcessingParamsPtr pDocumentProcessingParams = Engine->CreateDocumentProcessingParams();// Specify the recognition languagepDocumentProcessingParams->PageProcessingParams->RecognizerParams->SetPredefinedTextLanguage( "Japanese" );// Select font setpDocumentProcessingParams->SynthesisParamsForDocument->FontSet->SystemFontSet->FontNamesFilter = FREngine::FNF_Japanese;// Recognize and export the documentfrDocument->Process( pDocumentProcessingParams );frDocument->Export( L"D:\\Demo.rtf", FREngine::FEF_RTF, 0 );...
C# code
FREngine.IEngine engine;FREngine.IFRDocument frdoc;...// Create a DocumentProcessingParams objectFREngine.IDocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();// Specify the recognition languagedpp.PageProcessingParams.RecognizerParams.SetPredefinedTextLanguage( "Japanese" );// Select font setdpp.SynthesisParamsForDocument.FontSet.SystemFontSet.FontNamesFilter = (int)FREngine.FontNamesFiltersEnum.FNF_Japanese;// Recognize and export the documentfrdoc.Process( dpp );frdoc.Export( "D:\\Demo.rtf", FREngine.FileExportFormatEnum.FEF_RTF, null );...