Key-value pair extraction is the process of identifying labeled fields in a document and mapping each label to its corresponding value. For example, extracting “Invoice Number” → “INV-4521” or “Patient Name” → “John Smith” from a scanned form. Modern document AI uses spatial relationships and semantic understanding to perform this mapping automatically, even when labels are implicit or document layouts vary between sources.
Every structured document contains key-value pairs. An invoice has “Invoice Date: March 15, 2026.” A medical form has “Patient DOB: _____.” A shipping label has “Weight: 12.4 kg.” The data is there, clearly organized for human readers. The challenge is teaching machines to read it the same way.
Key-value pair extraction is the foundation of OCR data extraction. Without it, OCR just gives you a wall of text. With it, you get structured records ready for your database, spreadsheet, or ERP system. Lido performs key-value extraction automatically on any document type, with no templates, no training, and no field mapping configuration. This article explains how the technology works, where it fails, and what to look for in a solution.
Key-value pair extraction identifies two elements in a document and creates a relationship between them:
The key (also called the label or field name) is the identifier: “Invoice Number,” “Ship To,” “Total Amount,” “Policy Holder.” It tells you what kind of data follows.
The value is the actual data: “INV-4521,” “123 Main St, Suite 400,” “$14,287.50,” “Jane Martinez.” It’s the information you want to capture.
The extraction system must identify both elements, determine which key belongs to which value (the association problem), and output structured pairs that downstream systems can consume. This sounds simple when keys and values sit side-by-side with a colon separator. It gets harder fast when you consider real documents: keys above values, keys to the left of values, implicit keys (a field with no visible label), and values that span multiple lines.
Key-value extraction is distinct from intelligent OCR (which handles the text recognition step) and from table extraction (which handles grid-structured data). In practice, a complete document processing pipeline uses all three: OCR to read text, key-value extraction for header fields, and table extraction for line items.
Key-value pairs appear in virtually every business document. The format varies, but the pattern is consistent:
Invoices: Invoice Number → INV-4521, Invoice Date → 2026-03-15, Due Date → 2026-04-14, Vendor Name → Acme Corp, Subtotal → $12,450.00, Tax → $1,121.50, Total → $13,571.50
Medical forms: Patient Name → John Smith, DOB → 04/22/1985, Insurance ID → BC-9927441, Provider → Dr. Rachel Kim, Diagnosis Code → J06.9
Shipping documents: BOL Number → 7829341, Carrier → FedEx Freight, Weight → 2,400 lbs, Pieces → 12, Ship Date → 2026-03-18, Consignee → Western Distributing LLC
Tax forms: Employer Name → TechStart Inc, EIN → 82-1234567, Wages → $87,500.00, Federal Tax Withheld → $18,375.00
Contracts: Effective Date → January 1, 2026, Term → 24 months, Monthly Fee → $4,500, Auto-Renewal → Yes, Governing Law → State of Delaware
The challenge isn’t that these fields are hard to read. It’s that every vendor, form designer, and organization formats them differently. One invoice puts “Invoice #” in bold above the number. Another puts “Inv. No.:” to the left. A third uses “Reference” instead. The extraction system needs to handle all variations without per-template configuration.
Humans read key-value pairs effortlessly because we understand spatial relationships, context, and document conventions intuitively. Machines don’t. Here’s what makes this problem hard:
Spatial relationships are ambiguous. When a label sits at the top-left of a form and a value sits to its right, the association is clear. But what about a label that sits above a value? Or a label with multiple values below it? Or two columns of key-value pairs where left-column values are adjacent to right-column labels? The spatial proximity heuristic breaks down in multi-column layouts.
Implicit labels. Some documents have values without visible labels. A check has the dollar amount in a specific location, but there’s no “Amount:” label. Humans know what it is from the position and formatting. Bank statements might have account numbers in headers without explicit labeling.
Multi-line values. An address spans 2-4 lines. A description field might wrap across multiple lines. The system must know where one value ends and the next key begins, without relying on simple newline detection.
Label synonyms. “Invoice Number,” “Invoice #,” “Invoice No.,” “Inv. No,” “Bill Number,” “Reference Number” all mean the same thing. The system must normalize variants to a canonical field name.
Overlapping regions. In dense forms, the spatial boundary between one key-value pair and the next is ambiguous. A value might be directly adjacent to another field’s label, with no whitespace or line separator between them.
Before AI-powered solutions, organizations used these methods. Many still do for high-volume, low-variation documents:
Rule-based extraction. Define explicit rules: “The invoice number is always on line 3, characters 15-25” or “The date follows the text ‘Date:’ and is in MM/DD/YYYY format.” Works perfectly for known templates. Breaks completely when a vendor changes their invoice layout or you onboard a new supplier.
Regular expressions (regex). Pattern matching for specific data formats: dates (\\d{2}/\\d{2}/\\d{4}), currency (\\$[\\d,]+\\.\\d{2}), invoice numbers ([A-Z]{2,3}-\\d{4,8}). Regex finds values reliably but can’t associate them with keys without positional context. Two dollar amounts on the same page? Regex alone can’t tell you which is the subtotal and which is the total.
Coordinate-based (zonal OCR). Define bounding boxes on a template: “The invoice number is in the rectangle from (420, 85) to (580, 105).” Extremely fast and accurate for fixed-format documents. Useless when documents have variable layouts, different page sizes, or shifted content positions.
| Approach | Accuracy (known templates) | Accuracy (new templates) | Setup effort | Maintenance |
|---|---|---|---|---|
| Rule-based | 95-99% | 0-20% | High (per template) | High |
| Regex | 85-95% | 40-60% | Medium | Medium |
| Coordinate/Zonal | 98-99% | 0% | High (per template) | Low (if stable) |
| AI/Layout-aware | 92-98% | 85-95% | Low | Low |
The fundamental limitation of traditional approaches is that they encode human knowledge about specific documents into rigid rules. Every new document format requires new rules. Scale the number of document types to 50, 100, or 500, and maintenance becomes a full-time job.
Modern document AI treats key-value extraction as a machine learning problem. These systems train models to understand document structure the way humans do, rather than following explicit rules:
Layout-aware language models (LayoutLM family). These models jointly encode text content, spatial position (x/y coordinates), and visual features (font size, bolding, lines). By learning the relationship between position and semantics across millions of documents, they can identify key-value pairs in documents they’ve never seen before. LayoutLMv3 and similar architectures are the current state of the art for structured extraction.
Large multimodal models (GPT-4V, Claude, Gemini). Vision-language models can look at a document image and extract key-value pairs through visual understanding. They perform well on diverse layouts because they’ve been trained on enormous corpora of document images. Tradeoffs: higher latency, higher cost per page, and occasional hallucination of values that don’t exist in the document.
Cloud Document AI services (Google Document AI, AWS Textract, Azure Form Recognizer). Pre-built APIs that combine OCR with key-value extraction in a single call. They work out-of-the-box for common document types (invoices, receipts, IDs) and can be fine-tuned on custom formats. Good baseline accuracy, but fine-tuning requires labeled training data.
Hybrid approaches. The best production systems combine multiple methods. Use AI for initial extraction, then apply rule-based validation on the output. If a date field doesn’t match a date format, flag it. If an invoice total doesn’t match the sum of line items, send for review. AI handles the variability; rules enforce data quality. Learn more about how AI data extraction works in practice.
Measuring extraction quality requires field-level evaluation, not just overall document accuracy. A system that correctly extracts 9 of 10 fields sounds like 90% accuracy. But if the missed field is always the invoice total, it’s functionally useless.
Field-level accuracy. For each key (invoice_number, date, total, vendor_name, etc.), what percentage of documents have the correct extracted value? This reveals which fields your system handles well and which need work.
Precision per key. Of all values the system extracted for a given field, what percentage are correct? Low precision means the system is confidently returning wrong values. That’s worse than returning nothing.
Recall per key. Of all documents that contain a given field, what percentage did the system successfully extract? Low recall means the system misses fields. Better than wrong values, but still problematic for automation.
Character-level accuracy. For partially correct extractions, how close was the output? Extracting “$14,287.5” instead of “$14,287.50” is a character-level error that might or might not matter downstream. This metric helps distinguish near-misses from complete failures.
Straight-through processing rate. The percentage of documents processed without human intervention. This is the metric that determines ROI: if 85% of documents flow through automatically and 15% need review, that’s the number that hits your bottom line.
Understanding where extraction systems fail helps you design better validation and exception handling:
Missing labels. Checks, some invoices, and government forms have fields in known positions without text labels. The model must infer field identity from position and value format alone. Solution: train on position-aware features, not just text proximity.
Merged fields. “Bill To / Ship To” as a combined header with two address blocks beneath it. Which address belongs to which key? Or “Name/Title” as a single field label for two distinct values. These require understanding of form structure beyond simple pair matching.
Tables vs. key-value pairs. A recurring source of confusion: when does a two-column layout become a table? Line items in an invoice are tabular data, not key-value pairs, even though they superficially look like “Item: Widget, Qty: 5, Price: $10.” Mislabeling tables as key-value pairs (or vice versa) produces incorrect output structure.
Handwritten values. Printed labels with handwritten fill-in values (common in medical forms, inspection checklists). The OCR step itself introduces errors, and the handwriting recognition confidence varies widely by individual penmanship.
Low-contrast or degraded scans. Faded text, coffee stains, partial page scans, low-DPI images. The OCR layer struggles with all of these, and extraction accuracy degrades proportionally. No amount of extraction intelligence compensates for unreadable source text.
Multi-page key-value pairs. A value that begins on one page and continues on the next (long descriptions, multi-line addresses that cross page boundaries). Most systems treat pages independently, losing continuity.
Lido takes a different approach from template-based tools. Instead of requiring you to define extraction rules for each document type, Lido uses semantic field identification to understand documents the way a human would:
Zero configuration. Upload a document or send it via API. Lido identifies key-value pairs automatically. No template creation, no field mapping, no training samples. The system understands that “Invoice #,” “Invoice Number,” and “Ref No.” all map to the same canonical field.
Semantic understanding. Rather than relying solely on spatial proximity, Lido’s models understand what fields mean. They know that a number following a date label should be a date, that a dollar amount near “Total” is likely the document total, and that an address block belongs to whichever entity label (“Bill To,” “Ship To,” “Sold To”) is nearest.
Confidence scoring. Every extracted value includes a confidence score. Low-confidence extractions route to human review queues rather than flowing through automatically. This gives you control over the accuracy-automation tradeoff: set confidence thresholds higher for sensitive fields (amounts, account numbers) and lower for informational fields (descriptions, notes).
Works across document types. The same system handles invoices, forms, receipts, shipping documents, and contracts without switching modes or loading different templates. Feed it a PDF and it figures out what the document is, what fields are present, and what values to extract. Read more about extracting data from any PDF format.
Continuous improvement. When users correct extraction errors, the system learns. Field accuracy improves over time for your specific document mix without manual retraining or rule updates.
Key-value pair extraction is the automated process of identifying labeled fields (keys) in a document and capturing their associated data (values). For example, on an invoice, “Invoice Number” is the key and “INV-4521” is the value. The system detects these relationships using spatial positioning, text proximity, and semantic understanding, then outputs structured data pairs that can be loaded into databases, spreadsheets, or business systems. It works on both digital and scanned documents, handling printed forms, typed documents, and mixed-format files.
AI-based extraction uses layout-aware models that simultaneously process text content, spatial coordinates, and visual features of a document. These models are trained on millions of documents to learn patterns: bold text followed by regular text often indicates a key-value pair, text in specific page positions corresponds to specific field types, and certain value formats (dates, currencies, IDs) provide context clues. The model predicts which text spans are keys, which are values, and which key-value associations are correct—all without explicit rules or per-template configuration.
Key-value extraction captures single-instance fields (one label maps to one value), like “Invoice Date: March 15” or “Customer ID: 44821.” Table extraction captures repeating row-column data—like invoice line items with columns for description, quantity, unit price, and total across multiple rows. A typical invoice requires both: key-value extraction for header fields (vendor, date, invoice number, totals) and table extraction for line items. Misapplying one method where the other belongs produces incorrect data structures.
Accuracy depends on document quality and system sophistication. AI-powered tools achieve 92-98% field-level accuracy on well-scanned documents with standard layouts. For previously unseen document formats, accuracy typically ranges from 85-95%—significantly higher than template-based systems that score near zero on unfamiliar templates. Factors that reduce accuracy include poor scan quality (under 200 DPI), handwritten values, unusual layouts, and documents with implicit or missing labels. Most production systems target 90%+ straight-through processing with human review for low-confidence extractions.
Documents with clearly labeled fields and consistent spatial relationships between labels and values yield the highest accuracy. Invoices, purchase orders, insurance forms, tax documents, and government applications are ideal candidates because they follow predictable conventions. Documents with explicit labels (text like “Name:” or “Date:” preceding values), clear visual separation between fields, and machine-printed text perform best. Documents that present challenges include handwritten notes, free-form letters without labeled fields, and heavily degraded scans where text is partially illegible.