Skip to main content

Overview

Prompt-based extraction allows you to use natural language instructions to extract structured data from documents using LLMs. Instead of training traditional machine learning models, you describe what data you want to extract and how it should be formatted, and the LLM handles the extraction based on your instructions. What you’ll accomplish:
  • Create a prompt-based extraction activity
  • Configure an LLM connection
  • Write effective extraction prompts
  • Define output format and structure
  • Apply strictness and validation rules
  • Test and refine your extraction
Time to complete: 20-30 minutes Use Cases:
  • Vendor information extraction from invoices
  • Header-level document data capture
  • Semi-structured document processing
  • Documents with variable layouts

Prerequisites

Before you begin, ensure you have:
  1. Access to ABBYY Vantage Advanced Designer
  2. An LLM connection configured (see How to Configure LLM Connections)
  3. A Document Skill with sample documents loaded
  4. Basic understanding of JSON structure
  5. Field definitions for the data you want to extract
Note: This guide focuses on header-level extraction. Table extraction capabilities may vary.

Understanding Prompt-Based Extraction

What is Prompt-Based Extraction?

Prompt-based extraction uses LLMs to understand and extract data from documents based on natural language instructions. You define:
  • Role: What the LLM should act as (e.g., “data extraction model”)
  • Instructions: How to extract and format data
  • Output Structure: The exact JSON format for results
  • Rules: Guidelines for handling ambiguous or missing data

Benefits

  • No training data required: Works with just prompt engineering
  • Flexible: Easy to add or modify fields
  • Handles variations: LLMs can understand different document formats
  • Quick setup: Faster than training traditional ML models
  • Natural language: Write instructions in plain English

Limitations

  • Cost: Each extraction uses LLM API calls
  • Speed: Slower than traditional extraction for simple documents
  • Consistency: Results may vary slightly between runs
  • Context limits: Very long documents may need special handling

Step 1: Add a Prompt-Based Activity

Create a new prompt-based extraction activity in your Document Skill.
  1. Open your Document Skill in ABBYY Vantage Advanced Designer
  2. In the left panel, locate EXTRACT FROM TEXT (NLP)
  3. Find and click on Prompt-based
Selecting Prompt-Based Activity
  1. The activity appears in your workflow canvas
  2. Connect it between your input and output activities
Note: Prompt-based activities are found under “EXTRACT FROM TEXT (NLP)” in the Activities panel, alongside other extraction methods like Named Entities (NER) and Deep Learning.

Step 2: Configure the LLM Connection

Select which LLM connection the activity should use.
  1. Select the prompt-based activity in your workflow
  2. In the Activity Properties panel on the right, locate LLM Connection
  3. Click the dropdown menu
Configuring LLM Connection
  1. Select your configured LLM connection from the list
    • Example: Nick-ChatGPT, Microsoft Foundry, Production GPT-4
  2. Verify the connection is selected
Note: If you don’t see any connections listed, you need to configure an LLM connection first through Configuration → Connections.

Step 3: Define Output Fields

Set up the fields you want to extract before writing your prompt.
  1. In the Activity Properties panel, locate the Output section
  2. You’ll see a hierarchical list of field groups and fields
  3. For this example, we’re extracting vendor information:
    • Vendor
      • Name
      • Address
      • TaxID
      • Account Number
      • Sort Code
      • IBAN
      • BIC_SWIFT
    • Business Unit
      • Name
      • Address
      • Invoice Date
      • Invoice Number
    • Totals
      • Net Amount
Field Output Structure
  1. Click Activity Editor button to begin configuring the prompt
Note: Define all fields before writing your prompt. The field names will be referenced in your prompt structure.

Step 4: Write the Role Definition

Define what role the LLM should play when processing documents.
  1. In the Activity Editor, you’ll see the Prompt Text interface
  2. Start with the ROLE section:
ROLE

You are a data extraction model. Extract only the specified vendor-related 
fields from a document. Extract the value text verbatim (not the label). Do 
not infer or reformat any data. Omit any field that is not clearly present.
Prompt Text Editor Key Role Instructions:
  • Be specific: “data extraction model” tells the LLM its purpose
  • Define scope: “vendor-related fields” limits what to extract
  • Set expectations: “value text verbatim” prevents reformatting
  • Handle missing data: “Omit any field that is not clearly present”
Best Practices:
  • Keep the role clear and concise
  • Use imperative statements (“Extract”, “Do not infer”)
  • Be explicit about what NOT to do
  • Define how to handle edge cases

Step 5: Define the Output Format

Specify the exact JSON structure for extraction results.
  1. Below the ROLE section, add the OUTPUT FORMAT heading
  2. Define the JSON structure:
OUTPUT FORMAT

Return one valid JSON object using this exact structure:

{
  "Fields": [
    { "FieldName": Vendor.Name, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.Address, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.TaxID, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.Account Number, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.Sort Code, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.IBAN, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.BIC_SWIFT, "Text": "...", "Line": <FirstLineIndex> }
  ]
}
JSON Output Format Structure Components:
  • FieldName: Must match your field definitions exactly (e.g., Vendor.Name)
  • Text: The extracted value as a string
  • Line: 0-based line index where the value appears in the document
Important Notes:
  • Use exact field names from your Output configuration
  • Include all fields even if some might be empty
  • The structure must be valid JSON
  • Line numbers help with verification and troubleshooting

Step 6: Add Field-Specific Extraction Rules

Provide detailed instructions for extracting each field. Below the OUTPUT FORMAT, add specific rules for each field type:
VENDOR NAME
1) Recognize names like "ABC Corporation", "XYZ Ltd", "Acme Inc.".
2) Extract the complete company name including legal suffixes (Ltd, Inc, GmbH, etc.).
3) Vendor name typically appears near the top of the document.

VENDOR ADDRESS
1) Extract the complete address including street, city, postal code.
2) For multiline addresses, represent each new line using "\n".
3) Vendor-side only; exclude customer/buyer addresses.

ACCOUNT NUMBER
1) Recognize "Account Number", "Account No", "Acct #".
2) Extract the numeric format exactly as printed (e.g., "12-34-56" or "500 105 17").
3) Vendor-owned accounts only (e.g., "Beneficiary" or "Vendor Payment" sections).
4) Ignore IBAN — it has its own field.

SORT CODE
1) Recognize "Sort Code", "Sort No.", "BLZ", "Bankleitzahl".
2) Extract the numeric format exactly as printed (e.g., "12-34-56" or "500 105 17").
3) Vendor-side data only; ignore payer/buyer codes.

IBAN
1) Recognize "IBAN", "International Bank Account Number".
2) Extract the full IBAN exactly as printed (include spaces).
3) Vendor-side only, typically under "Bankverbindung", "Coordonnées bancaires", "Payment Details", or "Beneficiary Bank".

BIC_SWIFT
1) Recognize "BIC", "SWIFT", or "BIC/SWIFT".
2) Extract the complete identifier (usually 8 or 11 uppercase letters/numbers).
3) Vendor-side only, near the IBAN or bank name.
4) Exclude customer/payer data.
Extraction Rules Rule Structure:
  • Recognition patterns: List alternative labels for each field
  • Format specifications: Describe exact format to extract
  • Location hints: Where to typically find the data
  • Exclusions: What NOT to extract
Best Practices:
  • Number your rules for clarity
  • Provide multiple label variations
  • Specify data ownership (vendor-side vs. customer-side)
  • Include format examples in parentheses
  • Be explicit about related fields (e.g., “Ignore IBAN — it has its own field”)

Step 7: Apply Strictness Rules

Add validation rules to ensure data quality and consistency. At the end of your prompt, add a STRICTNESS section:
STRICTNESS
- Never generate or infer values.
- Omit ambiguous or missing fields.
- If none of the vendor fields are found, return:
  {
    "Fields": []
  }
Strictness Rules Additional Strictness Rules (Optional):
GENERAL RULES
- Extract exactly one value per field.
- Skip any field that cannot be confidently located — omit it from the output.
- "FieldName" must match the names above exactly.
- "Text" must be copied verbatim from the document — no normalization or inference.
- For multiline values (e.g., addresses), represent each new line using the escape sequence "\n" (a backslash followed by the letter n).
- Do not insert HTML tags such as <br> in the output text.
- "Line" is the 0-based index of the first line containing the extracted value; include it only if verifiable.
Why Strictness Matters:
  • Prevents hallucination: LLMs may generate plausible but incorrect data
  • Ensures consistency: Clear rules reduce variation between runs
  • Handles missing data: Defines what to do when fields aren’t found
  • Maintains data integrity: Verbatim extraction preserves original formatting
Key Strictness Principles:
  • Never generate data that isn’t in the document
  • Omit uncertain extractions rather than guessing
  • Return empty structure if no fields are found
  • Match field names exactly
  • Preserve original text formatting

Step 8: Select Document Format

Choose which document representation to send to the LLM.
  1. In the Activity Editor, locate the Prompt dropdown
  2. You’ll see options for how the document is provided to the LLM
Document Format Options Available Formats:
  • PDF: Original PDF file
    • Use for: Documents where layout is critical
    • Considerations: Larger file size, some LLMs have limited PDF support
  • Plain Text: Unformatted text extraction
    • Use for: Simple text-only documents
    • Considerations: Loses all formatting and layout information
  • Annotated Text ⭐ (Recommended)
    • Use for: Most document types
    • Considerations: Preserves structure while being text-based
    • Benefits: Best balance of structure and performance
  • Formatted Text: Text with basic formatting preserved
    • Use for: Documents where some formatting matters
    • Considerations: Middle ground between Plain and Annotated
  1. Select Annotated Text for best results
Note: Through testing, Annotated Text has been found to provide the most consistent and reliable results for extraction tasks. It preserves document structure while being efficiently processed by LLMs.

Step 9: Test Your Extraction

Run the activity on sample documents to verify results.

Run the Activity

  1. Close the Activity Editor
  2. Navigate to All Documents tab
  3. Select a test document
  4. Click Test Activity or Run button
Testing Activity
  1. Wait for the LLM to process the document
    • Processing time: typically 5-30 seconds depending on document complexity
    • You’ll see a loading indicator while waiting for the API response

Review Results

Once processing completes:
  1. The interface switches to Predictive view
  2. Review the Output panel showing extracted fields
  3. Click on each field to see:
    • Extracted value
    • Confidence (if provided)
    • Highlighted region on the document image
Reviewing Results What to Check:
  • ✅ All expected fields are populated
  • ✅ Values match the document exactly
  • ✅ No hallucinated or inferred data
  • ✅ Proper handling of multiline fields
  • ✅ Missing fields are omitted (not filled with incorrect data)

Common Result Patterns

Successful Extraction:
{
  "Fields": [
    { "FieldName": "Vendor.Name", "Text": "ABC Corporation Ltd", "Line": 3 },
    { "FieldName": "Vendor.Address", "Text": "123 Business Street\nLondon SW1A 1AA", "Line": 5 },
    { "FieldName": "Vendor.IBAN", "Text": "GB29 NWBK 6016 1331 9268 19", "Line": 15 }
  ]
}
Partial Extraction (some fields missing):
{
  "Fields": [
    { "FieldName": "Vendor.Name", "Text": "ABC Corporation Ltd", "Line": 3 }
  ]
}
No Fields Found:
{
  "Fields": []
}

Step 10: Refine Your Prompt

Iterate on your prompt based on test results.

Common Issues and Solutions

Issue: LLM extracts wrong field
  • Solution: Add more specific location hints
  • Example: “Vendor-side only; exclude customer/buyer addresses”
Issue: Formatting is changed
  • Solution: Emphasize verbatim extraction
  • Example: “Extract the numeric format exactly as printed (e.g., ‘12-34-56’)”
Issue: LLM invents data
  • Solution: Strengthen strictness rules
  • Example: “Never generate or infer values. Omit if not present.”
Issue: Multiline fields are concatenated
  • Solution: Specify escape sequences
  • Example: “For multiline values, use \n for new lines”
Issue: Incorrect field names in output
  • Solution: Verify field names match exactly
  • Example: Use Vendor.Account Number not AccountNumber

Iterative Improvement Process

  1. Test on multiple documents: Don’t optimize for a single example
  2. Document patterns: Note which rules work and which need refinement
  3. Add specific examples: Include format examples in parentheses
  4. Refine strictness: Adjust based on over/under-extraction patterns
  5. Test edge cases: Try documents with missing fields, unusual layouts

Example Refinements

Before:
VENDOR NAME
1) Extract the vendor name from the document.
After:
VENDOR NAME
1) Recognize names like "ABC Corporation", "XYZ Ltd", "Acme Inc.".
2) Extract the complete company name including legal suffixes (Ltd, Inc, GmbH, etc.).
3) Vendor name typically appears near the top of the document.
4) Exclude customer/buyer names - focus on the entity issuing the invoice.

Understanding the Extraction Process

How Prompt-Based Extraction Works

  1. Document Conversion: Your document is converted to the selected format (Annotated Text recommended)
  2. Prompt Assembly: Your role, output format, field rules, and strictness rules are combined
  3. API Call: The prompt and document are sent to the LLM via your connection
  4. LLM Processing: The LLM reads the document and extracts data according to your instructions
  5. JSON Response: The LLM returns structured data in the specified JSON format
  6. Field Mapping: Vantage maps the JSON response to your defined output fields
  7. Verification: Line numbers and confidence scores (if provided) help verify accuracy

Token Usage and Costs

Factors Affecting Cost:
  • Document length: Longer documents use more tokens
  • Prompt complexity: Detailed prompts increase token count
  • Format choice: Annotated Text is typically more efficient than PDF
  • Number of fields: More fields = longer prompts
Optimization Tips:
  • Use concise but clear language in prompts
  • Don’t duplicate instructions
  • Remove unnecessary examples
  • Consider field grouping for related data

Best Practices

Prompt Writing

Do:
  • ✅ Use clear, imperative statements (“Extract”, “Recognize”, “Omit”)
  • ✅ Provide multiple label variations for each field
  • ✅ Include format examples in parentheses
  • ✅ Specify what NOT to extract (exclusions)
  • ✅ Number your rules for easy reference
  • ✅ Use consistent terminology throughout
Don’t:
  • ❌ Use vague instructions (“get the name”)
  • ❌ Assume the LLM knows domain-specific conventions
  • ❌ Write overly long, complex sentences
  • ❌ Contradict yourself in different sections
  • ❌ Skip strictness rules

Field Definitions

Effective Field Instructions:
  • Start with recognition patterns (alternative labels)
  • Specify exact format to preserve
  • Provide location hints (typical placement)
  • Define data ownership (vendor vs. customer)
  • Include handling for multiline values
  • Reference related fields to avoid confusion
Example:
IBAN
1) Recognize "IBAN", "International Bank Account Number".
2) Extract the full IBAN exactly as printed (include spaces).
3) Vendor-side only, typically under "Bankverbindung", "Payment Details".
4) Do NOT confuse with Account Number — IBAN is longer and alphanumeric.

Testing Strategy

  1. Start with simple documents: Test basic extraction first
  2. Expand to variations: Try different layouts and formats
  3. Test edge cases: Missing fields, unusual positions, multiple matches
  4. Document failures: Keep examples of where extraction fails
  5. Iterate systematically: Change one thing at a time

Performance Optimization

For Speed:
  • Keep prompts concise
  • Use Annotated Text format
  • Minimize number of fields per activity
  • Consider splitting complex documents
For Accuracy:
  • Provide comprehensive field rules
  • Include format examples
  • Add strong strictness rules
  • Test with diverse document samples
For Cost:
  • Optimize prompt length
  • Use efficient document formats
  • Cache results when appropriate
  • Monitor token usage via LLM provider dashboard

Troubleshooting

Extraction Issues

Problem: Fields are empty despite data being present Solutions:
  • Check field name spelling matches exactly
  • Verify the data is in the selected document format
  • Add more label variations to recognition patterns
  • Reduce strictness temporarily to see if LLM finds it
  • Check if document quality affects OCR/text extraction
Problem: LLM extracts customer data instead of vendor data Solutions:
  • Strengthen vendor-side specifications
  • Add explicit exclusions for customer/buyer data
  • Provide location hints (e.g., “top of document”, “issuer section”)
  • Include examples of correct vs. incorrect extraction
Problem: Multiline values are concatenated or malformed Solutions:
  • Explicitly specify escape sequence format (\n)
  • Provide examples of correct multiline output
  • Verify document format preserves line breaks
  • Add instruction: “Preserve original line breaks using \n
Problem: LLM reformats or normalizes data Solutions:
  • Emphasize “verbatim” and “exactly as printed”
  • Add strictness rule: “No normalization or inference”
  • Provide specific examples showing preservation of formatting
  • Include negative examples: “Not ‘12-34-56’, keep as ‘12 34 56‘“

Performance Issues

Problem: Extraction is too slow Solutions:
  • Switch to Annotated Text format if using PDF
  • Simplify prompt without losing critical instructions
  • Reduce document resolution if images are very large
  • Check LLM provider status and rate limits
  • Consider using a faster model for simple documents
Problem: Inconsistent results between runs Solutions:
  • Strengthen strictness rules
  • Make instructions more specific and unambiguous
  • Add more format examples
  • Reduce prompt complexity that might lead to interpretation
  • Test with higher temperature settings (if available in connection)
Problem: High API costs Solutions:
  • Optimize prompt length
  • Use Annotated Text instead of PDF
  • Process documents in batches during off-peak
  • Consider using smaller/cheaper models for simple documents
  • Monitor and set budget alerts in LLM provider dashboard

Advanced Techniques

Conditional Extraction

You can instruct the LLM to extract certain fields only if conditions are met:
ACCOUNT NUMBER (CONDITIONAL)
1) Only extract if the document contains bank payment details.
2) If "Payment Method: Check" or similar appears, omit this field.
3) Recognize "Account Number", "Account No", "Acct #".

Multi-Language Support

Prompt-based extraction works well with multilingual documents:
VENDOR NAME (MULTI-LANGUAGE)
1) Recognize in English: "Vendor Name", "Supplier", "Seller"
2) Recognize in German: "Verkäufer", "Lieferant", "Anbieter"
3) Recognize in French: "Fournisseur", "Vendeur"
4) Extract the complete company name regardless of language.

Validation Rules

Add validation logic to your prompts:
IBAN (WITH VALIDATION)
1) Extract the full IBAN exactly as printed.
2) Verify it starts with a 2-letter country code.
3) If format doesn't match IBAN pattern, omit the field.
4) Do not invent check digits or country codes.

Field Relationships

Specify how fields relate to each other:
ACCOUNT NUMBER vs IBAN
- Account Number: Usually shorter, numeric, domestic format
- IBAN: Alphanumeric, starts with country code (e.g., "GB29 NWBK...")
- If both are present, extract both to separate fields
- If only one is present, extract to the appropriate field
- Do not duplicate the same value in both fields

Limitations and Considerations

Current Capabilities

Supported:
  • ✅ Header-level field extraction
  • ✅ Single and multiline values
  • ✅ Multiple fields per document
  • ✅ Conditional extraction logic
  • ✅ Multi-language documents
  • ✅ Variable document layouts
Limited or Not Supported:
  • ⚠️ Table extraction (varies by implementation)
  • ⚠️ Nested complex structures
  • ⚠️ Very large documents (token limits)
  • ⚠️ Real-time processing (API latency)
  • ⚠️ Guaranteed deterministic results

When to Use Prompt-Based Extraction

Best For:
  • Documents with variable layouts
  • Semi-structured documents
  • Quick prototyping and testing
  • Small to medium document volumes
  • When training data is unavailable
  • Multi-language document processing
Consider Alternatives For:
  • High-volume production (traditional ML may be faster)
  • Highly structured forms (template-based extraction)
  • Cost-sensitive applications (traditional methods may be cheaper)
  • Latency-critical applications (LLM APIs have network delay)
  • Offline processing requirements (no internet needed for traditional methods)

Integration with Document Skills

Using Extracted Data

Once extraction is complete, the field data is available throughout your Document Skill:
  1. Validation Activities: Apply business rules to extracted values
  2. Script Activities: Process or transform extracted data
  3. Export Activities: Send data to external systems
  4. Review Interface: Manual verification of extracted fields

Combining with Other Activities

Prompt-based extraction can work alongside other activities:
Workflow Example:
1. Classification (identify document type)
2. OCR (extract text)
3. Prompt-based extraction (extract structured data)
4. Validation rules (verify data quality)
5. Script (format for export)
6. Output (deliver results)

Field Mapping

The extracted JSON fields automatically map to your defined output fields:
  • "FieldName": "Vendor.Name" → Maps to Output field Vendor.Name
  • Field hierarchy is preserved in the output structure
  • Line numbers help with verification and troubleshooting

Summary

You’ve successfully:
  • ✅ Created a prompt-based extraction activity
  • ✅ Configured an LLM connection
  • ✅ Written a comprehensive extraction prompt with role, format, and rules
  • ✅ Selected the optimal document format (Annotated Text)
  • ✅ Applied strictness rules for data quality
  • ✅ Tested extraction and reviewed results
  • ✅ Learned best practices for prompt engineering
Key Takeaways:
  • Prompt-based extraction uses natural language instructions
  • Annotated Text format provides best results
  • Clear, specific prompts yield consistent extraction
  • Strictness rules prevent hallucination and maintain data quality
  • Iterative testing and refinement improve accuracy
Your prompt-based extraction activity is now ready for document processing!

Next Steps

  1. Test with diverse documents: Validate across different layouts and variations
  2. Refine your prompts: Continuously improve based on results
  3. Monitor costs: Track token usage in your LLM provider dashboard
  4. Optimize performance: Fine-tune prompts for speed and accuracy
  5. Explore table extraction: Experiment with extracting line items (if supported)
  6. Integrate with workflows: Combine with other activities for complete processing

Additional Resources


Frequently Asked Questions

Q: What’s the difference between prompt-based and traditional extraction? A: Prompt-based uses LLM natural language instructions without training data. Traditional methods require training examples but are faster and more cost-effective at scale. Q: Can I extract tables with prompt-based activities? A: Header-level extraction is well-supported. Table extraction capabilities may vary and require specific prompt structures. Q: Why use Annotated Text over PDF? A: Annotated Text provides the best balance of structure preservation and processing efficiency. It’s been proven most reliable through testing. Q: How do I reduce API costs? A: Optimize prompt length, use Annotated Text format, process efficiently, and monitor token usage via your LLM provider’s dashboard. Q: What if my LLM connection fails? A: Check your connection status in Configuration → Connections. Test the connection, verify credentials, and ensure your API quota isn’t exceeded. Q: Can I use multiple LLM connections in one skill? A: Yes, different activities can use different connections. This allows you to use different models for different extraction tasks. Q: How do I handle documents in multiple languages? A: Add multi-language label variations to your field rules. LLMs generally handle multilingual content well. Q: What’s the maximum document size? A: This depends on your LLM provider’s token limits. Very long documents may need to be split or processed in sections.