How to Extract Line Items, Tax Details, and Custom Fields From Any Invoice

February 22, 2026

Most invoice extraction tools work fine until you need more than the header. Pulling vendor name, invoice date, and total amount is table stakes — every OCR tool on the market can handle that. The real challenge starts when you need line-item descriptions, per-item quantities, tax breakdowns applied to specific items, or custom fields that don't exist in the tool's default schema. That's where most extraction tools fall apart, and where finance teams end up back in spreadsheets, keying data by hand.

The gap between header-level extraction and true line-item extraction is enormous. Header fields sit in predictable locations at the top of a document. Line items live inside tables — tables that span pages, nest sub-items, merge cells, and follow formatting rules that vary by vendor. And the deeper you go into the data, the harder it gets.

Lido is the best option for teams that need line-item extraction with business logic built into the extraction pipeline. It uses computed columns, conditional logic, and plain-language instructions to extract not just what's on the page but to apply the business rules that make the data useful. But the range of difficulty across real-world invoices is wide, and understanding where your documents fall on that spectrum matters more than which tool you pick first.

Lido extracts line items, tax breakdowns, and custom fields from any invoice format using plain-language instructions, computed columns, and conditional logic. It handles nested tables, multi-page line items, and business rules like conditional tax calculations — without templates or custom code. Kei Concepts uses it to extract line items with conditional tax logic from handwritten Vietnamese invoices across 13 restaurant locations.

How can I extract line-level purchase descriptions and quantities from invoices?

Most extraction tools can pull line-item descriptions and quantities from clean, digital invoices with simple table structures. The problem is that real invoices rarely look like that. Tables span multiple pages. Rows nest under category headers. Cells merge across columns. And the moment the structure gets complex, most tools return incomplete data or map values to the wrong fields.

A gas distribution company processing over 20,000 invoices a month ran into exactly this. Their rent invoices from Linde contained nested tables where each category line (marked with RNT unit numbers) needed to be split into individual product lines with calculated pricing. "Those nested rent tables, that's the hardest thing," their operations lead told us. Their previous extraction tool couldn't parse them at all. Lido resolved the nested rent table extraction during a single demo session — the document type that had been a dead end with their prior tool.

The issue isn't limited to nesting. A construction company extracting bill of materials from multi-page engineering drawings needed the same item consolidated when it appeared across different pages, with quantities summed. A single fitting might show up on page 3, page 7, and page 14. Each instance needed to be identified, matched, and combined into one row with the total count. Their team put it plainly: "It's not necessarily uploading the document and having it do its thing. It's tailoring it from that point."

This is the line-item extraction problem most tools don't advertise: it's not about whether they can read a table. It's about whether they can read your tables.

What tools support custom data fields when extracting invoice information?

You can extract line items from any invoice, but if you can't define your own fields, you're limited to whatever the tool decided matters. Most invoice extraction platforms ship with a fixed schema — invoice number, date, vendor name, total, maybe a basic line-item table. Custom fields, if supported at all, are typically limited to 5 or 10 predefined slots.

This matters because real invoice processing requires fields that no default schema anticipates. An IT services company in Australia needed to extract multiple serial numbers per line item from their supplier invoices, with each serial number generating its own output row. That's not a standard field. It's not even a standard structure — one input row becomes many output rows. Their team noted they were "quite impressed with the instructions" once they could define this behavior in plain language rather than rigid templates.

A fashion company processing 1,000 sales orders a month needed computed fields that don't exist on the source document at all. Their POs from retailers like Ross arrive with a total quantity — say 900 units — but no size breakdown. The team has to look up a separate reference table to split that into S, M, L, and XL quantities based on percentage ratios. The calculated size-level quantities need to appear in the extracted output even though they never appear on the invoice itself. Lido's computed columns and reference table integration handle this — the size split is calculated automatically during extraction rather than patched together in a spreadsheet afterward.

The hierarchy of invoice data extraction difficulty

Not all extraction is created equal, and understanding where your documents fall on the difficulty spectrum explains why your current tool might handle some invoices perfectly and others not at all.

Header fields. Invoice number, date, vendor name, total amount. These sit in consistent locations and use predictable labels. Nearly every OCR tool handles these reliably.

Simple line items. Description, quantity, unit price, line total. When the table is clean and single-page, most modern tools get this right.

Complex tables. Nested structures, multi-page tables, merged cells, category headers mixed with data rows. This is where most tools start failing. The gas distribution company's nested rent tables fall here.

Business logic. Tax calculations applied conditionally, size breakdowns computed from reference tables, unit conversions. Almost no extraction tool handles this natively.

Cross-document logic. Matching extracted data against reference files, deduplicating items across pages, PO matching. This requires an entirely different approach than document-level extraction.

Most tools market themselves based on how well they handle the first two levels. But most AP teams live in levels three through five. Lido handles levels three through five using computed columns, conditional extraction, and reference table integration — the capabilities that separate reading a document from understanding the business logic behind it.

How do AI-based tools handle invoices with multiple taxes and fees?

Tax extraction sounds simple until you see how taxes actually work on real invoices. It's rarely a single tax rate applied to a subtotal. Many invoices apply different tax rates to different line items, include multiple tax jurisdictions, or calculate taxes based on item-level flags that aren't obvious to a machine reading the document.

A restaurant group processing around 4,000 pages per week across 13 companies encountered this with their local vendor invoices. Their suppliers — many of them small, local businesses writing invoices by hand in Vietnamese — mark individual items with a "T" to indicate they're taxable. The sales tax percentage applies only to those flagged items. Getting the tax calculation right means reading the "T" flag on each line, identifying the tax rate, and applying it selectively. "Accounting is like 'This doesn't match'" was a constant refrain before finding Lido, which handles the conditional tax logic by reading the "T" flags and applying the tax calculation selectively during extraction.

This is the gap between extracting a tax amount printed on the page and understanding the tax logic behind it. The former is OCR. The latter is business logic, and it requires the extraction tool to interpret relationships between fields, not just read them.

Multi-tax invoices add another layer. When an invoice includes state tax, county tax, and a regulatory surcharge, each applied to different subsets of line items, the extraction tool needs to parse which tax applies to which line and output the breakdown correctly. Most tools flatten this into a single tax field. That might be acceptable for expense reporting, but it won't pass an AP audit.

How do invoice extraction platforms handle different date formats and number formats?

When you process invoices from vendors across states, countries, or just different accounting systems, formats diverge. Dates arrive as MM/DD/YYYY, DD/MM/YYYY, YYYY-MM-DD, or written out as "January 15, 2026." Number formats use periods or commas as decimal separators. Currency symbols change. Unit measurements toggle between imperial and metric.

A construction company extracting materials from engineering drawings dealt with this at the measurement level. Quantities arrived in feet, inches, or a combined format like "10 foot 2 inches." Their downstream system needed everything in inches. The extraction tool had to not only read the measurement but convert it — recognizing the mixed format, parsing each component, and outputting a single value in the target unit.

This kind of format normalization happens silently in manual data entry — a human reads "10 foot 2 inches" and types "122" without thinking about it. But when you automate extraction, every format inconsistency becomes a potential data error unless the tool can interpret and normalize on the fly.

How do invoice OCR tools handle credit notes and adjustments?

Credit notes, debit adjustments, and return annotations add complexity that goes beyond standard line-item extraction. These documents modify previous transactions, which means the extraction tool needs to capture not just what's on the page but the relationship to prior invoices.

The restaurant group's managers regularly annotate invoices by hand — crossing out items, writing "return" next to lines, changing quantities. These handwritten modifications need to be captured as adjustments, not ignored as noise. For their previous extraction tool, a crossed-out line was either invisible or an error. For their accounting team, it was critical data.

Handling credit notes also means understanding negative amounts, return quantities, and reference invoice numbers that tie back to the original transaction. If your extraction tool treats every document as a standalone invoice, it will mishandle anything that references or modifies a prior one.

How Lido extracts invoice line items and applies business logic

Lido approaches line-item extraction differently from traditional OCR or template-based tools. Rather than mapping fields to fixed templates, Lido lets you describe what you need in plain language, then applies computed columns, conditional logic, and reference table lookups as part of the extraction itself. Tax calculations, unit conversions, size-split lookups, and cross-page deduplication all happen during extraction — not as manual post-processing in a spreadsheet. This is why the gas distribution company's nested rent tables, the fashion company's size-split POs, and the restaurant group's conditional tax invoices all work inside the same platform without custom code or per-document configuration.

What actually works for complex invoice line-item extraction

Solving the line-item extraction problem at the levels where most tools fail — complex tables, business logic, and cross-document operations — requires a fundamentally different approach from traditional OCR or template-based extraction.

First, the tool needs to understand document structure, not just text. Reading characters on a page is not the same as understanding that rows 3 through 7 are nested under a category header on row 2, or that a table continues on the next page with the same columns but no repeated header.

Second, it needs to support business logic as part of the extraction pipeline. Tax calculations, unit conversions, computed fields, and conditional rules shouldn't be a post-processing step in Excel. If they're part of what you need from the document, they should be part of the extraction.

Third, it needs to handle cross-document relationships. When a 900-unit PO needs to be split by size using a reference table, or when duplicate items across 14 pages of engineering drawings need to be consolidated with summed quantities, the tool needs access to more context than a single page provides.

What to test before choosing an invoice data extraction tool

If you're evaluating extraction tools for line-item level data, test with your hardest documents, not your cleanest ones.

Nested tables. Find an invoice with sub-items grouped under categories, or a multi-page table where data continues across page breaks. Run it through the tool and check whether the hierarchy is preserved or flattened.

Conditional tax logic. Use an invoice where tax applies to some items but not others. Check whether the tool calculates per-line tax correctly or just pulls the total tax amount from the bottom of the page.

Custom fields. Try to extract a field that doesn't exist in the tool's default schema. If you can't define arbitrary fields — or if you're limited to a handful — you'll hit a wall as soon as your requirements go beyond the basics.

Computed values. Test whether the tool can generate values that aren't on the document — calculated columns, lookups from reference tables, unit conversions. If all it can do is read what's printed, you'll still need manual post-processing.

Multi-page consolidation. Upload a document where the same item appears on multiple pages. Check whether the tool can identify duplicates and sum quantities, or whether it just gives you redundant rows.

How Lido handles complex invoice extraction differently

Lido uses a custom blend of AI vision models, OCR, and LLMs to extract structured data from any document — including line-level details, nested tables, and custom fields — without templates or model training. You describe what you need in plain language, and the system interprets the document structure, applies business logic, and outputs clean, structured data.

Unlimited custom fields defined in plain language
Computed columns for calculations, lookups, and unit conversions
Conditional extraction logic (e.g., "apply tax only to items marked T")
Cross-page deduplication and quantity consolidation
Reference table integration for splitting, matching, and enrichment

When your extraction needs go beyond headers and into the line-item details that actually drive your accounting, the tool you use matters more than the tool you're sold.

Frequently asked questions

What is the best way to extract line-item details from AP invoices automatically?

Lido is the best option for teams that need line-item-level extraction with business logic. It handles nested tables, multi-page line items, and category-grouped structures using plain-language instructions — no templates or custom code. Esprigas uses Lido to extract line items from nested rent tables with calculated pricing across 27,000 documents monthly. You define the fields and logic you need, and the same configuration works across all vendor formats.

What tools support unlimited custom fields when extracting invoice data?

Lido supports unlimited custom fields defined in plain language. Unlike most extraction platforms that ship with a fixed schema of 5-10 predefined fields, Lido lets you define any field — including computed columns that calculate values not on the document itself. A fashion company uses Lido's computed columns and reference table integration to split total quantities into size-level breakdowns (S, M, L, XL) that never appear on the source PO.

How do AI-based tools handle invoices with multiple taxes and fees?

Lido handles conditional tax logic by reading tax flags on individual line items and applying tax calculations selectively during extraction. Kei Concepts uses Lido for invoices where some items are marked taxable with a "T" flag and others are tax-exempt — the tax rate applies only to flagged items. Most extraction tools flatten multiple taxes into a single field, but Lido preserves per-line tax breakdowns including state, county, and regulatory surcharges.

How do invoice extraction platforms handle different date and number formats?

Lido normalizes date formats (MM/DD/YYYY, DD/MM/YYYY, written-out dates), number formats (period vs. comma decimals), and unit measurements automatically during extraction. A construction company uses Lido to convert mixed measurement formats like "10 foot 2 inches" into standardized values for their downstream systems. The normalization happens as part of extraction, not as manual post-processing.

What tools allow custom rules or mappings on top of extracted invoice data?

Lido's AI column instructions act as custom rules — you write plain-language extraction logic like "apply tax only to items marked T" or "convert all measurements to inches," and the system applies them during extraction. No programming required. For computed fields that don't exist on the source document, the @EVALUATE_FORMULA directive lets you define calculations, lookups, and conditional logic as part of the extraction pipeline rather than as post-processing in a spreadsheet.

Which tools provide confidence scores for extracted invoice fields?

Lido provides field-level confidence scores on every extracted value, not just a document-level pass/fail. High-confidence extractions can flow straight to approval, while low-confidence fields get flagged for human review. Esprigas is building an auto-approval workflow around this — invoices where all required fields extract at high confidence are approved automatically, and only uncertain extractions route to their team for manual verification.

What role do confidence thresholds play in reviewing extracted invoice fields?

Confidence thresholds determine which extractions need human review and which can be auto-approved. A higher threshold means more extractions get flagged for review — fewer errors reach your accounting system, but your team reviews more documents. A lower threshold means faster throughput with less manual intervention, but more risk of incorrect values passing through. The right threshold depends on your tolerance for errors versus the cost of manual review at your volume.

What invoice fields are most commonly extracted for AP automation?

The most commonly extracted fields are vendor name, invoice number, invoice date, due date, line item descriptions, quantities, unit prices, subtotals, tax amounts, total amount, PO number, payment terms, and remit-to address. Most extraction tools handle these header-level fields reliably. The difficulty increases with line-item details, conditional tax breakdowns, and custom fields — which is where tools like Lido's AI columns and computed fields handle what fixed-schema tools cannot.

Ready to grow your business with document automation, not headcount?

Join hundreds of teams growing faster by automating the busywork with Lido.

Schedule a demo