Blog

How to Extract Data From Invoices With Hundreds of Different Vendor Formats

February 22, 2026

Invoice data extraction works until you add your twentieth vendor. The first few templates are manageable. But when you're processing invoices from 200 or 300 different suppliers — each with its own layout, line item structure, and date format — you're no longer doing automation. You're maintaining a system that requires constant attention just to keep up with the documents flowing in.

This is the problem most AP and operations teams hit somewhere between 50 and 500 vendors: the extraction tool needs to know what the invoice looks like before it can read it. And when every vendor's invoice looks different, that requirement becomes the bottleneck. Lido is the strongest option for teams processing invoices from hundreds of different vendors. It handles any vendor format without templates or prior training — but most teams don't discover this approach until they've already burned through template-based and model-trained tools.

Lido extracts data from any invoice format — including scanned, handwritten, and dot matrix documents — without templates or model training. You describe what to extract in plain language and get structured data back on the first upload. Companies processing 20,000+ invoices monthly from hundreds of vendors, like Erewhon and Esprigas, use it to eliminate per-vendor template maintenance entirely.

What "hundreds of vendor formats" means for invoice data extraction

The phrase "different vendor formats" understates the problem. It's not just that Vendor A puts the invoice number in the top right and Vendor B puts it in the center. The differences run deeper than field placement.

One vendor sends a single-page PDF with five line items. Another sends an eight-page scanned document printed on a dot matrix printer with perforated edges. A third nests rental charges inside category groupings that need to be broken apart into individual line items with calculated pricing. A fourth sends invoices with quantities that require multiplying daily rates by the number of days in the billing period.

Erewhon, a grocery chain with 10 stores, processes roughly 20,000 invoices per month from thousands of vendors. Their formats range from clean digital PDFs to scanned dot matrix printouts. As their CEO put it about one particular vendor's invoice: "That thing's ugly. I can't believe they actually still print on a frickin' dot matrix." Erewhon tested Lido on those dot matrix scanned invoices and saw accurate results on the first pass — no templates, no training data.

Esprigas, a gas distribution company, handles 27,000 documents per month across hundreds of suppliers. Their invoices include nested rent tables — category lines that need to be split into individual product lines with calculated pricing. Their operations lead described these as "the hardest thing" to extract accurately.

A consumer products company processing 800 invoices per month asked a question that comes up on nearly every evaluation call: "Will the system be able to go through different formats of different vendors and extract everything into a standardized template with the same columns?"

The answer they're hoping for is yes. The reality with most tools is "yes, but you'll need to configure each one."

Why the template math of invoice OCR breaks AP teams

Here's the math that makes template-based invoice extraction unsustainable at scale. If you have 200 vendors and each vendor averages 1.5 format variations (because the same vendor's US entity invoices look different from their international ones, or their regular invoices differ from credit memos), that's 300 templates to build, test, and maintain.

Each template takes time to configure. Draw the zones, map the fields, test on sample documents, handle edge cases. Then a vendor updates their billing system and the template breaks. You rebuild it. Another vendor merges with a subsidiary and their invoice layout changes overnight. You rebuild that one too.

This is exactly the path Esprigas traveled. They started on Docparser, a first-generation template tool. When template maintenance became untenable at their volume, they migrated to Nanonets, a model-trained extraction platform. The model-training approach promised to handle format variance without rigid templates.

It didn't solve the problem. It changed the shape of it.

What happens when you train AI models instead of building invoice templates

Model-trained extraction tools like Nanonets take a different approach: feed the system sample documents, annotate the fields you want extracted, and let it learn the patterns. In theory, this handles more variance than templates. In practice, it creates a different maintenance burden.

The initial setup takes weeks. You collect sample documents for each format, label the fields, train the model, validate the output, retrain on errors, and repeat. Esprigas built two separate Nanonets models — one with intentional mapping using 50 sample pages, one without. They still ended up with a manual approval process on every single extraction. Not because they wanted to review the business logic, but because they couldn't trust the accuracy. Esprigas is now evaluating Lido to replace Nanonets for their 27,000 documents per month.

"We spend a ton of time retraining the models," their operations lead told us.

When a vendor changes their invoice format, the model needs retraining. When you onboard a new supplier, you need new training data. When document quality degrades — scans, faxes, handwritten notes — the model's accuracy drops and the retraining cycle starts again. At 27,000 documents per month, this becomes a significant operational burden.

A government agency paid $30,000 for a Nanonets contract expecting plug-and-play extraction. One of their team members described the experience bluntly: "It's great for a quick and easy but it is absolutely one of the worst." They were charged for every extraction attempt, including the ones that failed.

The pattern is consistent: companies migrate from template tools to model-trained tools expecting a fundamentally different experience, and find themselves on the same treadmill with different mechanics.

How extraction platforms handle complex invoice layouts with multiple tables

Nested tables, multi-line item groupings, and calculated fields are where most extraction tools break down entirely. Simple invoices with a header and a flat line item table are the easy case. The hard case is an invoice where line items are grouped under categories, with subtotals, taxes, and adjustments scattered across the page.

Esprigas deals with this daily. Their rent invoices contain category lines (like "RNT" for rental equipment) that group multiple products underneath. The extraction tool needs to understand that the category line isn't a line item — it's a header for the items below it. Then it needs to split each sub-item into its own row and calculate pricing based on daily rates multiplied by the billing period length.

Their operations lead was direct: "Those nested rent tables, that's the hardest thing."

Template-based tools can't handle this without custom configuration for each nested structure. Model-trained tools need extensive training data showing the pattern. And when the nesting structure changes — a different vendor uses a different grouping logic — both approaches require rework.

Why unpredictable invoice formats break template-based extraction

Some businesses can at least predict what formats they'll receive. A company with 50 stable vendors knows what's coming. CPA firms, auditors, and compliance teams don't have that luxury.

Legacy CPA processes 3,500 compliance audits per year. They receive payroll documents from hundreds of different employers, each using different payroll systems configured in different ways. "Even if 18 employers use the same payroll system, the way they utilize it is different," they explained. Legacy CPA chose Lido specifically because they needed a tool that could handle formats it had never seen before — no templates to build, no models to train.

Their assessment of template-based approaches was straightforward: "Template-based thoughts are really not what we're going for."

The same dynamic plays out in high-variance AP environments. Erewhon's thousands of vendors include large national distributors with clean digital invoices and small local suppliers printing on equipment from the 1990s. A fashion company processing 1,000 sales orders per month receives PDFs from different retailers, half of them handwritten. A restaurant group gets invoices from local vendors handwritten in Vietnamese.

In all of these cases, the common thread is the same: you can't pre-configure the system for documents you haven't seen yet. And the documents you haven't seen yet are a constant stream.

What tools need to do to normalize vendor names and formats across invoices

Turning hundreds of different invoice formats into clean, normalized data requires solving several problems simultaneously.

Field location varies. The invoice number might be labeled "Invoice #," "Inv No," "Bill Number," or not labeled at all. The tool needs to find it regardless of where it sits on the page or what it's called.

Date formats differ. One vendor writes 02/10/2026. Another writes 10-Feb-2026. A third writes 2026.02.10. Extraction needs to normalize these into a consistent format for your accounting system.

Number formats conflict. European vendors use commas as decimal separators. Some invoices show quantities as "1,000" while others show "1.000" meaning completely different things. Getting this wrong means your GL entries are off by orders of magnitude.

Vendor name variations multiply. The same supplier might appear as "ABC Corp," "ABC Corporation," "A.B.C. Corp.," or "ABC Corp Inc." across different invoices. Reference file matching — comparing extracted vendor names against a master list — is the only reliable way to standardize.

Line item structures range from flat tables to nested groupings to free-form descriptions with inline quantities and prices. Normalizing all of these into consistent rows and columns is the core challenge.

These aren't cosmetic differences. Each one is a potential data quality issue that, without proper handling, requires a human to catch and correct downstream.

What to evaluate before choosing a multi-format invoice extraction tool

If you're processing invoices from more than 50 vendors and your current tool requires per-format configuration, the problem will only grow as you add suppliers. Here's what to prioritize in an evaluation:

First, test with your hardest documents — not your cleanest ones. Scanned invoices, dot matrix printouts, multi-page documents with nested tables. If the tool only works on clean digital PDFs, it won't survive your actual workflow.

Second, ask what happens when a vendor changes their invoice layout. If the answer involves retraining, rebuilding templates, or contacting support, that's a recurring cost that won't show up on the pricing page.

Third, check whether the tool can normalize data across formats without per-vendor configuration. Same output columns regardless of input layout. Same date format. Vendor name standardization against a reference file. Tools like Lido take this approach — you upload a document, describe what to extract, and get structured data back on the first pass.

Fourth, find out what iteration costs. Extraction isn't always perfect on the first pass, especially with documents you've never seen. Tools that charge per attempt — including failed attempts — penalize you for their own limitations. Lido offers free reprocessing for 24 hours, so you can adjust extraction instructions and re-run without additional cost.

Fifth, ask about time to first result. If setup takes six to twelve weeks of model training before you can test your own documents, you'll be committed before you know if it works.

How Lido handles invoices from hundreds of vendors without templates

Lido uses a custom blend of AI vision models, OCR, and LLMs to extract data from any invoice format — without templates, model training, or per-vendor configuration. You upload a document, describe what to extract in plain language, and get structured data back. When a vendor changes their layout, nothing breaks. When you onboard a new supplier, there's nothing to configure.

  1. Works on any invoice layout without templates or training
  2. Handles scanned documents, dot matrix printouts, and handwriting
  3. Normalizes output into consistent columns regardless of input format
  4. Supports reference files for vendor name standardization
  5. Free reprocessing for 24 hours — no charge for iteration

If you're spending more time maintaining your extraction tool than benefiting from it, the problem isn't your invoices — it's the approach. Try Lido free today and test it on your own documents.

Frequently asked questions

What is the best software for extracting data from invoices with many different vendor formats?

Lido is the most effective tool for teams processing invoices from hundreds of vendors in different formats. It extracts data from any invoice layout — including scanned, handwritten, and dot matrix documents — without templates or per-vendor configuration. Esprigas processes 27,000 documents monthly from hundreds of vendors through Lido without maintaining a single template, and Erewhon handles 20,000 invoices monthly from thousands of suppliers across 10 retail locations.

How do extraction tools adapt to changing invoice formats without retraining?

Lido uses a layout-agnostic approach — it understands document structure rather than memorizing field coordinates, so it handles format changes automatically with no retraining or reconfiguration. Esprigas processes 27,000 documents monthly from hundreds of vendors through Lido, and when those vendors change their invoice layouts, nothing breaks. Tools that require templates or model training need manual updates every time a format changes.

What tools can parse complex invoice layouts with multiple tables and nested line items?

Lido handles nested tables, multi-page line items, and category-grouped structures that break template-based tools. Esprigas uses Lido to extract data from rent invoices with nested category tables where each sub-item needs calculated pricing. You describe the extraction logic in plain language, including how to handle nesting and calculations, and the same configuration works across all vendor formats.

How can I turn unstructured invoice PDFs into clean, normalized spreadsheet data?

Lido converts any invoice PDF — regardless of layout, scan quality, or language — into structured spreadsheet data with consistent columns. It normalizes date formats, number formats, and vendor name variations automatically during extraction. Erewhon uses Lido to normalize 20,000 invoices monthly from thousands of suppliers — including dot matrix scans and handwritten documents — into consistent, structured output. Reprocessing is free for 24 hours if the extraction needs refinement.

What is template-free OCR and how does it benefit AP automation?

Template-free OCR means the extraction tool reads document structure and content using AI rather than relying on pre-configured field mappings for each vendor layout. The benefits are significant: no setup time per vendor, no maintenance when a vendor changes their invoice format, and any new vendor's invoices are processed immediately on the first upload. Lido takes this approach — you describe what to extract in plain language, and it works across any format without per-vendor configuration.

How does template-free invoice extraction compare to rule-based systems?

Rule-based systems require explicit field coordinates or text patterns for each document layout — when a vendor moves their invoice number from the top right to the center, the rule breaks. Template-free extraction uses AI to understand document structure, so it finds the invoice number regardless of where it appears on the page. Esprigas migrated from Docparser (rule-based) to Nanonets (model-trained) and still spent significant time on maintenance. Template-free tools like Lido eliminate that maintenance cycle entirely.

What is the difference between traditional OCR and AI-based document understanding for AP?

Traditional OCR converts images to machine-readable text but doesn't understand what the text means — it sees characters, not fields. AI-based document understanding reads the document the way a person would, identifying that "Total" next to "$5,432" is the invoice total regardless of where it appears on the page. This is why traditional OCR fails on format variance: it needs to be told where fields are, while AI-based tools figure it out from context.

How do AI platforms handle recurring invoices with slight variations each month?

AI extraction processes each document independently based on its content, not a stored template of what last month's invoice looked like. Monthly variations in line items, totals, formatting, or even layout are handled automatically because the AI reads the document structure fresh each time. Erewhon processes 20,000 invoices monthly from thousands of vendors through Lido — including recurring vendors whose invoices vary month to month — without any per-vendor configuration or retraining.

Ready to grow your business with document automation, not headcount?

Join hundreds of teams growing faster by automating the busywork with Lido.