How to Extract Data Using Prompt-Based Activities in Advanced Designer

Overview

Prompt-based extraction allows you to use natural language instructions to extract structured data from documents using LLMs. Instead of training traditional machine learning models, you describe what data you want to extract and how it should be formatted, and the LLM handles the extraction based on your instructions. What you’ll accomplish:

Create a prompt-based extraction activity
Configure an LLM connection
Write effective extraction prompts
Define output format and structure
Apply strictness and validation rules
Test and refine your extraction

Time to complete: 20-30 minutes Use Cases:

Vendor information extraction from invoices
Header-level document data capture
Semi-structured document processing
Documents with variable layouts

Prerequisites

Before you begin, ensure you have:

Access to ABBYY Vantage Advanced Designer
An LLM connection configured (see How to Configure LLM Connections)
A Document Skill with sample documents loaded
Basic understanding of JSON structure
Field definitions for the data you want to extract

Note: This guide focuses on header-level extraction. Table extraction capabilities may vary.

Understanding Prompt-Based Extraction

What is Prompt-Based Extraction?

Prompt-based extraction uses LLMs to understand and extract data from documents based on natural language instructions. You define:

Role: What the LLM should act as (e.g., “data extraction model”)
Instructions: How to extract and format data
Output Structure: The exact JSON format for results
Rules: Guidelines for handling ambiguous or missing data

Benefits

No training data required: Works with just prompt engineering
Flexible: Easy to add or modify fields
Handles variations: LLMs can understand different document formats
Quick setup: Faster than training traditional ML models
Natural language: Write instructions in plain English

Limitations

Cost: Each extraction uses LLM API calls
Speed: Slower than traditional extraction for simple documents
Consistency: Results may vary slightly between runs
Context limits: Very long documents may need special handling

Step 1: Add a Prompt-Based Activity

Create a new prompt-based extraction activity in your Document Skill.

Open your Document Skill in ABBYY Vantage Advanced Designer
In the left panel, locate EXTRACT FROM TEXT (NLP)
Find and click on Prompt-based

The activity appears in your workflow canvas
Connect it between your input and output activities

Note: Prompt-based activities are found under “EXTRACT FROM TEXT (NLP)” in the Activities panel, alongside other extraction methods like Named Entities (NER) and Deep Learning.

Step 2: Configure the LLM Connection

Select which LLM connection the activity should use.

Select the prompt-based activity in your workflow
In the Activity Properties panel on the right, locate LLM Connection
Click the dropdown menu

Select your configured LLM connection from the list
- Example: Nick-ChatGPT, Microsoft Foundry, Production GPT-4
Verify the connection is selected

Note: If you don’t see any connections listed, you need to configure an LLM connection first through Configuration → Connections.

Step 3: Define Output Fields

Set up the fields you want to extract before writing your prompt.

In the Activity Properties panel, locate the Output section
You’ll see a hierarchical list of field groups and fields
For this example, we’re extracting vendor information:
- Vendor
  - Name
  - Address
  - TaxID
  - Account Number
  - Sort Code
  - IBAN
  - BIC_SWIFT
- Business Unit
  - Name
  - Address
  - Invoice Date
  - Invoice Number
- Totals
  - Net Amount

Click Activity Editor button to begin configuring the prompt

Note: Define all fields before writing your prompt. The field names will be referenced in your prompt structure.

Step 4: Write the Role Definition

Define what role the LLM should play when processing documents.

In the Activity Editor, you’ll see the Prompt Text interface
Start with the ROLE section:

ROLE

You are a data extraction model. Extract only the specified vendor-related 
fields from a document. Extract the value text verbatim (not the label). Do 
not infer or reformat any data. Omit any field that is not clearly present.

Key Role Instructions:

Be specific: “data extraction model” tells the LLM its purpose
Define scope: “vendor-related fields” limits what to extract
Set expectations: “value text verbatim” prevents reformatting
Handle missing data: “Omit any field that is not clearly present”

Best Practices:

Keep the role clear and concise
Use imperative statements (“Extract”, “Do not infer”)
Be explicit about what NOT to do
Define how to handle edge cases

Step 5: Define the Output Format

Specify the exact JSON structure for extraction results.

Below the ROLE section, add the OUTPUT FORMAT heading
Define the JSON structure:

OUTPUT FORMAT

Return one valid JSON object using this exact structure:

{
  "Fields": [
    { "FieldName": Vendor.Name, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.Address, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.TaxID, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.Account Number, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.Sort Code, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.IBAN, "Text": "...", "Line": <FirstLineIndex> },
    { "FieldName": Vendor.BIC_SWIFT, "Text": "...", "Line": <FirstLineIndex> }
  ]
}

Structure Components:

FieldName: Must match your field definitions exactly (e.g., Vendor.Name)
Text: The extracted value as a string
Line: 0-based line index where the value appears in the document

Important Notes:

Use exact field names from your Output configuration
Include all fields even if some might be empty
The structure must be valid JSON
Line numbers help with verification and troubleshooting

Step 6: Add Field-Specific Extraction Rules

Provide detailed instructions for extracting each field. Below the OUTPUT FORMAT, add specific rules for each field type:

VENDOR NAME
1) Recognize names like "ABC Corporation", "XYZ Ltd", "Acme Inc.".
2) Extract the complete company name including legal suffixes (Ltd, Inc, GmbH, etc.).
3) Vendor name typically appears near the top of the document.

VENDOR ADDRESS
1) Extract the complete address including street, city, postal code.
2) For multiline addresses, represent each new line using "\n".
3) Vendor-side only; exclude customer/buyer addresses.

ACCOUNT NUMBER
1) Recognize "Account Number", "Account No", "Acct #".
2) Extract the numeric format exactly as printed (e.g., "12-34-56" or "500 105 17").
3) Vendor-owned accounts only (e.g., "Beneficiary" or "Vendor Payment" sections).
4) Ignore IBAN — it has its own field.

SORT CODE
1) Recognize "Sort Code", "Sort No.", "BLZ", "Bankleitzahl".
2) Extract the numeric format exactly as printed (e.g., "12-34-56" or "500 105 17").
3) Vendor-side data only; ignore payer/buyer codes.

IBAN
1) Recognize "IBAN", "International Bank Account Number".
2) Extract the full IBAN exactly as printed (include spaces).
3) Vendor-side only, typically under "Bankverbindung", "Coordonnées bancaires", "Payment Details", or "Beneficiary Bank".

BIC_SWIFT
1) Recognize "BIC", "SWIFT", or "BIC/SWIFT".
2) Extract the complete identifier (usually 8 or 11 uppercase letters/numbers).
3) Vendor-side only, near the IBAN or bank name.
4) Exclude customer/payer data.

Rule Structure:

Recognition patterns: List alternative labels for each field
Format specifications: Describe exact format to extract
Location hints: Where to typically find the data
Exclusions: What NOT to extract

Best Practices:

Number your rules for clarity
Provide multiple label variations
Specify data ownership (vendor-side vs. customer-side)
Include format examples in parentheses
Be explicit about related fields (e.g., “Ignore IBAN — it has its own field”)

Step 7: Apply Strictness Rules

Add validation rules to ensure data quality and consistency. At the end of your prompt, add a STRICTNESS section:

STRICTNESS
- Never generate or infer values.
- Omit ambiguous or missing fields.
- If none of the vendor fields are found, return:
  {
    "Fields": []
  }

Additional Strictness Rules (Optional):

GENERAL RULES
- Extract exactly one value per field.
- Skip any field that cannot be confidently located — omit it from the output.
- "FieldName" must match the names above exactly.
- "Text" must be copied verbatim from the document — no normalization or inference.
- For multiline values (e.g., addresses), represent each new line using the escape sequence "\n" (a backslash followed by the letter n).
- Do not insert HTML tags such as <br> in the output text.
- "Line" is the 0-based index of the first line containing the extracted value; include it only if verifiable.

Why Strictness Matters:

Prevents hallucination: LLMs may generate plausible but incorrect data
Ensures consistency: Clear rules reduce variation between runs
Handles missing data: Defines what to do when fields aren’t found
Maintains data integrity: Verbatim extraction preserves original formatting

Key Strictness Principles:

Never generate data that isn’t in the document
Omit uncertain extractions rather than guessing
Return empty structure if no fields are found
Match field names exactly
Preserve original text formatting

Step 8: Select Document Format

Choose which document representation to send to the LLM.

In the Activity Editor, locate the Prompt dropdown
You’ll see options for how the document is provided to the LLM

Available Formats:

PDF: Original PDF file
- Use for: Documents where layout is critical
- Considerations: Larger file size, some LLMs have limited PDF support
Plain Text: Unformatted text extraction
- Use for: Simple text-only documents
- Considerations: Loses all formatting and layout information
Annotated Text ⭐ (Recommended)
- Use for: Most document types
- Considerations: Preserves structure while being text-based
- Benefits: Best balance of structure and performance
Formatted Text: Text with basic formatting preserved
- Use for: Documents where some formatting matters
- Considerations: Middle ground between Plain and Annotated

Select Annotated Text for best results

Note: Through testing, Annotated Text has been found to provide the most consistent and reliable results for extraction tasks. It preserves document structure while being efficiently processed by LLMs.

Step 9: Test Your Extraction

Run the activity on sample documents to verify results.

Run the Activity

Close the Activity Editor
Navigate to All Documents tab
Select a test document
Click Test Activity or Run button

Wait for the LLM to process the document
- Processing time: typically 5-30 seconds depending on document complexity
- You’ll see a loading indicator while waiting for the API response

Review Results

Once processing completes:

The interface switches to Predictive view
Review the Output panel showing extracted fields
Click on each field to see:
- Extracted value
- Confidence (if provided)
- Highlighted region on the document image

What to Check:

✅ All expected fields are populated
✅ Values match the document exactly
✅ No hallucinated or inferred data
✅ Proper handling of multiline fields
✅ Missing fields are omitted (not filled with incorrect data)

Common Result Patterns

Successful Extraction:

{
  "Fields": [
    { "FieldName": "Vendor.Name", "Text": "ABC Corporation Ltd", "Line": 3 },
    { "FieldName": "Vendor.Address", "Text": "123 Business Street\nLondon SW1A 1AA", "Line": 5 },
    { "FieldName": "Vendor.IBAN", "Text": "GB29 NWBK 6016 1331 9268 19", "Line": 15 }
  ]
}

Partial Extraction (some fields missing):

{
  "Fields": [
    { "FieldName": "Vendor.Name", "Text": "ABC Corporation Ltd", "Line": 3 }
  ]
}

No Fields Found:

{
  "Fields": []
}

Step 10: Refine Your Prompt

Iterate on your prompt based on test results.

Common Issues and Solutions

Issue: LLM extracts wrong field

Solution: Add more specific location hints
Example: “Vendor-side only; exclude customer/buyer addresses”

Issue: Formatting is changed

Solution: Emphasize verbatim extraction
Example: “Extract the numeric format exactly as printed (e.g., ‘12-34-56’)”

Issue: LLM invents data

Solution: Strengthen strictness rules
Example: “Never generate or infer values. Omit if not present.”

Issue: Multiline fields are concatenated

Solution: Specify escape sequences
Example: “For multiline values, use \n for new lines”

Issue: Incorrect field names in output

Solution: Verify field names match exactly
Example: Use Vendor.Account Number not AccountNumber

Iterative Improvement Process

Test on multiple documents: Don’t optimize for a single example
Document patterns: Note which rules work and which need refinement
Add specific examples: Include format examples in parentheses
Refine strictness: Adjust based on over/under-extraction patterns
Test edge cases: Try documents with missing fields, unusual layouts

Before:

VENDOR NAME
1) Extract the vendor name from the document.

After:

VENDOR NAME
1) Recognize names like "ABC Corporation", "XYZ Ltd", "Acme Inc.".
2) Extract the complete company name including legal suffixes (Ltd, Inc, GmbH, etc.).
3) Vendor name typically appears near the top of the document.
4) Exclude customer/buyer names - focus on the entity issuing the invoice.

Understanding the Extraction Process

How Prompt-Based Extraction Works

Document Conversion: Your document is converted to the selected format (Annotated Text recommended)
Prompt Assembly: Your role, output format, field rules, and strictness rules are combined
API Call: The prompt and document are sent to the LLM via your connection
LLM Processing: The LLM reads the document and extracts data according to your instructions
JSON Response: The LLM returns structured data in the specified JSON format
Field Mapping: Vantage maps the JSON response to your defined output fields
Verification: Line numbers and confidence scores (if provided) help verify accuracy

Token Usage and Costs

Factors Affecting Cost:

Document length: Longer documents use more tokens
Prompt complexity: Detailed prompts increase token count
Format choice: Annotated Text is typically more efficient than PDF
Number of fields: More fields = longer prompts

Optimization Tips:

Use concise but clear language in prompts
Don’t duplicate instructions
Remove unnecessary examples
Consider field grouping for related data

Best Practices

Prompt Writing

Do:

✅ Use clear, imperative statements (“Extract”, “Recognize”, “Omit”)
✅ Provide multiple label variations for each field
✅ Include format examples in parentheses
✅ Specify what NOT to extract (exclusions)
✅ Number your rules for easy reference
✅ Use consistent terminology throughout

Don’t:

❌ Use vague instructions (“get the name”)
❌ Assume the LLM knows domain-specific conventions
❌ Write overly long, complex sentences
❌ Contradict yourself in different sections
❌ Skip strictness rules

Field Definitions

Effective Field Instructions:

Start with recognition patterns (alternative labels)
Specify exact format to preserve
Provide location hints (typical placement)
Define data ownership (vendor vs. customer)
Include handling for multiline values
Reference related fields to avoid confusion

Example:

IBAN
1) Recognize "IBAN", "International Bank Account Number".
2) Extract the full IBAN exactly as printed (include spaces).
3) Vendor-side only, typically under "Bankverbindung", "Payment Details".
4) Do NOT confuse with Account Number — IBAN is longer and alphanumeric.

Testing Strategy

Start with simple documents: Test basic extraction first
Expand to variations: Try different layouts and formats
Test edge cases: Missing fields, unusual positions, multiple matches
Document failures: Keep examples of where extraction fails
Iterate systematically: Change one thing at a time

Performance Optimization

For Speed:

Keep prompts concise
Use Annotated Text format
Minimize number of fields per activity
Consider splitting complex documents

For Accuracy:

Provide comprehensive field rules
Include format examples
Add strong strictness rules
Test with diverse document samples

For Cost:

Optimize prompt length
Use efficient document formats
Cache results when appropriate
Monitor token usage via LLM provider dashboard

Troubleshooting

Extraction Issues

Problem: Fields are empty despite data being present Solutions:

Check field name spelling matches exactly
Verify the data is in the selected document format
Add more label variations to recognition patterns
Reduce strictness temporarily to see if LLM finds it
Check if document quality affects OCR/text extraction

Problem: LLM extracts customer data instead of vendor data Solutions:

Strengthen vendor-side specifications
Add explicit exclusions for customer/buyer data
Provide location hints (e.g., “top of document”, “issuer section”)
Include examples of correct vs. incorrect extraction

Problem: Multiline values are concatenated or malformed Solutions:

Explicitly specify escape sequence format (\n)
Provide examples of correct multiline output
Verify document format preserves line breaks
Add instruction: “Preserve original line breaks using \n”

Problem: LLM reformats or normalizes data Solutions:

Emphasize “verbatim” and “exactly as printed”
Add strictness rule: “No normalization or inference”
Provide specific examples showing preservation of formatting
Include negative examples: “Not ‘12-34-56’, keep as ‘12 34 56‘“

Performance Issues

Problem: Extraction is too slow Solutions:

Switch to Annotated Text format if using PDF
Simplify prompt without losing critical instructions
Reduce document resolution if images are very large
Check LLM provider status and rate limits
Consider using a faster model for simple documents

Problem: Inconsistent results between runs Solutions:

Strengthen strictness rules
Make instructions more specific and unambiguous
Add more format examples
Reduce prompt complexity that might lead to interpretation
Test with higher temperature settings (if available in connection)

Problem: High API costs Solutions:

Optimize prompt length
Use Annotated Text instead of PDF
Process documents in batches during off-peak
Consider using smaller/cheaper models for simple documents
Monitor and set budget alerts in LLM provider dashboard

Advanced Techniques

Conditional Extraction

You can instruct the LLM to extract certain fields only if conditions are met:

ACCOUNT NUMBER (CONDITIONAL)
1) Only extract if the document contains bank payment details.
2) If "Payment Method: Check" or similar appears, omit this field.
3) Recognize "Account Number", "Account No", "Acct #".

Multi-Language Support

Prompt-based extraction works well with multilingual documents:

VENDOR NAME (MULTI-LANGUAGE)
1) Recognize in English: "Vendor Name", "Supplier", "Seller"
2) Recognize in German: "Verkäufer", "Lieferant", "Anbieter"
3) Recognize in French: "Fournisseur", "Vendeur"
4) Extract the complete company name regardless of language.

Validation Rules

Add validation logic to your prompts:

IBAN (WITH VALIDATION)
1) Extract the full IBAN exactly as printed.
2) Verify it starts with a 2-letter country code.
3) If format doesn't match IBAN pattern, omit the field.
4) Do not invent check digits or country codes.

Field Relationships

Specify how fields relate to each other:

ACCOUNT NUMBER vs IBAN
- Account Number: Usually shorter, numeric, domestic format
- IBAN: Alphanumeric, starts with country code (e.g., "GB29 NWBK...")
- If both are present, extract both to separate fields
- If only one is present, extract to the appropriate field
- Do not duplicate the same value in both fields

Limitations and Considerations

Current Capabilities

Supported:

✅ Header-level field extraction
✅ Single and multiline values
✅ Multiple fields per document
✅ Conditional extraction logic
✅ Multi-language documents
✅ Variable document layouts

Limited or Not Supported:

⚠️ Table extraction (varies by implementation)
⚠️ Nested complex structures
⚠️ Very large documents (token limits)
⚠️ Real-time processing (API latency)
⚠️ Guaranteed deterministic results

When to Use Prompt-Based Extraction

Best For:

Documents with variable layouts
Semi-structured documents
Quick prototyping and testing
Small to medium document volumes
When training data is unavailable
Multi-language document processing

Consider Alternatives For:

High-volume production (traditional ML may be faster)
Highly structured forms (template-based extraction)
Cost-sensitive applications (traditional methods may be cheaper)
Latency-critical applications (LLM APIs have network delay)
Offline processing requirements (no internet needed for traditional methods)

Integration with Document Skills

Using Extracted Data

Once extraction is complete, the field data is available throughout your Document Skill:

Validation Activities: Apply business rules to extracted values
Script Activities: Process or transform extracted data
Export Activities: Send data to external systems
Review Interface: Manual verification of extracted fields

Combining with Other Activities

Prompt-based extraction can work alongside other activities:

Workflow Example:
Classification (identify document type)
OCR (extract text)
Prompt-based extraction (extract structured data)
Validation rules (verify data quality)
Script (format for export)
Output (deliver results)

Field Mapping

The extracted JSON fields automatically map to your defined output fields:

"FieldName": "Vendor.Name" → Maps to Output field Vendor.Name
Field hierarchy is preserved in the output structure
Line numbers help with verification and troubleshooting

Summary

You’ve successfully:

✅ Created a prompt-based extraction activity
✅ Configured an LLM connection
✅ Written a comprehensive extraction prompt with role, format, and rules
✅ Selected the optimal document format (Annotated Text)
✅ Applied strictness rules for data quality
✅ Tested extraction and reviewed results
✅ Learned best practices for prompt engineering

Key Takeaways:

Prompt-based extraction uses natural language instructions
Annotated Text format provides best results
Clear, specific prompts yield consistent extraction
Strictness rules prevent hallucination and maintain data quality
Iterative testing and refinement improve accuracy

Your prompt-based extraction activity is now ready for document processing!

Next Steps

Test with diverse documents: Validate across different layouts and variations
Refine your prompts: Continuously improve based on results
Monitor costs: Track token usage in your LLM provider dashboard
Optimize performance: Fine-tune prompts for speed and accuracy
Explore table extraction: Experiment with extracting line items (if supported)
Integrate with workflows: Combine with other activities for complete processing

Additional Resources

ABBYY Vantage Advanced Designer Documentation: https://docs.abbyy.com
LLM Connection Setup Guide: How to Configure LLM Connections
Prompt Engineering Best Practices: Consult your LLM provider’s documentation
Support: Contact ABBYY support for technical assistance

Frequently Asked Questions

Q: What’s the difference between prompt-based and traditional extraction? A: Prompt-based uses LLM natural language instructions without training data. Traditional methods require training examples but are faster and more cost-effective at scale. Q: Can I extract tables with prompt-based activities? A: Header-level extraction is well-supported. Table extraction capabilities may vary and require specific prompt structures. Q: Why use Annotated Text over PDF? A: Annotated Text provides the best balance of structure preservation and processing efficiency. It’s been proven most reliable through testing. Q: How do I reduce API costs? A: Optimize prompt length, use Annotated Text format, process efficiently, and monitor token usage via your LLM provider’s dashboard. Q: What if my LLM connection fails? A: Check your connection status in Configuration → Connections. Test the connection, verify credentials, and ensure your API quota isn’t exceeded. Q: Can I use multiple LLM connections in one skill? A: Yes, different activities can use different connections. This allows you to use different models for different extraction tasks. Q: How do I handle documents in multiple languages? A: Add multi-language label variations to your field rules. LLMs generally handle multilingual content well. Q: What’s the maximum document size? A: This depends on your LLM provider’s token limits. Very long documents may need to be split or processed in sections.

About

Quickstart

​Overview

​Prerequisites

​Understanding Prompt-Based Extraction

​What is Prompt-Based Extraction?

​Benefits

​Limitations

​Step 1: Add a Prompt-Based Activity

​Step 2: Configure the LLM Connection

​Step 3: Define Output Fields

​Step 4: Write the Role Definition

​Step 5: Define the Output Format

​Step 6: Add Field-Specific Extraction Rules

​Step 7: Apply Strictness Rules

​Step 8: Select Document Format

​Step 9: Test Your Extraction

​Run the Activity

​Review Results

​Common Result Patterns

​Step 10: Refine Your Prompt

​Common Issues and Solutions

​Iterative Improvement Process

​Example Refinements

​Understanding the Extraction Process

​How Prompt-Based Extraction Works

​Token Usage and Costs

​Best Practices

​Prompt Writing

​Field Definitions

​Testing Strategy

​Performance Optimization

​Troubleshooting

​Extraction Issues

​Performance Issues

​Advanced Techniques

​Conditional Extraction

​Multi-Language Support

​Validation Rules

​Field Relationships

​Limitations and Considerations

​Current Capabilities

​When to Use Prompt-Based Extraction

​Integration with Document Skills

​Using Extracted Data

​Combining with Other Activities

​Field Mapping

​Summary

​Next Steps

​Additional Resources

​Frequently Asked Questions