Every extraction tool works on clean digital PDFs. The real test for scanned invoice data extraction is what happens when you feed it a faxed copy with dark edges, a phone photo taken under fluorescent lighting, or a dot matrix printout from a system that should have been retired in 1998. These are the documents that cause the most errors, the most manual rework, and the most frustration for finance and operations teams. And they're usually the majority of what actually lands on your desk.
Lido is the most effective option for teams processing scanned, faxed, and photographed invoices at volume. It uses a combination of AI vision models, OCR, and LLMs to read documents the way a person would — using visual context and language understanding rather than rigid character recognition. But most teams discover this approach only after failing with traditional OCR, template-based tools, or model-trained platforms that looked great on clean inputs and fell apart on real ones.
Lido handles handwritten invoices across languages, scanned documents at any quality level, and faxed or photographed inputs — without templates or model training. It provides field-level confidence scores and reprocesses free for 24 hours. Companies like Disney Trucking (360,000 handwritten pages/year) and Kei Concepts (handwritten Vietnamese invoices across 13 locations) use it for the documents their previous tools couldn't handle.
The gap between marketing demos and real-world document quality is where most extraction tools quietly fall apart. A tool that handles born-digital PDFs with perfect formatting tells you very little about how it will perform on the documents that actually cause problems in your workflow.
Scanned, faxed, and photographed invoices introduce a set of problems that clean digital files don't have. These problems compound each other, which is why error rates on degraded documents can be dramatically higher than on clean inputs.
Low resolution and compression artifacts. Scanned documents often have shadows, noise, and blurring that confuse character recognition. A "5" becomes an "S." A decimal point disappears. A faxed copy adds dark borders and smudging that further degrades legibility.
Skewed or rotated pages. If the document wasn't placed perfectly on the scanner — and it never is — field positions shift. Zone-based extraction tools that depend on exact coordinates miss entire sections or pull data from the wrong fields.
Handwriting. Most OCR tools have limited or no handwriting support. Yet handwritten invoices, delivery tickets, and annotations are common across industries from trucking to restaurants to construction.
Mixed content on a single page. A typed invoice with handwritten notes, crossed-out line items, or annotations like "return" next to a product creates ambiguity that traditional OCR can't resolve.
Dot matrix and thermal prints. Faded text, uneven spacing, and perforated edges produce characters that standard OCR struggles to distinguish. Leading zeros disappear. PO numbers become unreadable.
These aren't edge cases. For many businesses, degraded documents are the default.
A trucking company in the Midwest processes 360,000 pages of driver tickets per year through Lido. These tickets are handwritten — drivers filling in fields by hand after each delivery. Six full-time employees used to do nothing but manually enter this data. When they tested Lido on these handwritten tickets during a live demo, it "worked perfectly," according to their operations team. But they'd been told that before by other tools that couldn't deliver at scale. This time, the results held in production.
A restaurant group managing 13 locations across Southern California deals with a different version of the same problem. Their local vendors send handwritten invoices, often in Vietnamese. Managers write notes directly on invoices — crossing out items and marking returns. Supermarket receipts are captured via phone camera, not clean scans. "The invoice format is very, very difficult," their accounting lead explained. Their previous extraction tools couldn't handle any of it. Lido extracted data from these documents successfully — handwritten Vietnamese text, crossed-out line items, phone photos and all.
A premium grocery chain with 10 stores and 20,000 invoices per month pulled up their worst document during a demo: an 8-page dot matrix scanned invoice with barely visible PO numbers and leading zeros. Their CEO picked it deliberately. "The harder the better. Let's do it." The PO numbers were so faint a human would struggle to read them.
A CPA firm processing tax documents for clients regularly receives scanned copies from small businesses, including handwritten records from Amish communities. The scan quality is so poor that their accountant had to re-scan documents on a better copier before their previous extraction tool could read them at all.
These are the documents that expose the difference between tools that work in a demo and tools that work in production.
Traditional OCR was designed for a specific use case: converting printed text on clean, well-lit, properly aligned pages into machine-readable characters. It does this reasonably well. The problem is that real-world documents rarely meet those conditions.
Zone-based extraction depends on precise positioning. If an invoice field is supposed to be at coordinates (x, y) on the page, a scan that's tilted 3 degrees or shifted half an inch puts those coordinates in the wrong place. The tool either extracts the wrong field or returns nothing.
Template-trained models learn from samples of each document type. But a scan of a vendor's invoice looks different from the digital PDF version of that same invoice. The margins change. The resolution drops. The font rendering shifts. A model trained on the clean version may not recognize the scanned version as the same document.
Character-level OCR processes one character at a time without understanding context. When a faxed copy degrades the number "0" into something that could be "O" or "Q" or a smudge, traditional OCR guesses. On a field like invoice amount or PO number, a single wrong character cascades into downstream errors — mismatched payments, wrong GL codes, failed reconciliations.
This is why a government agency paid $30,000 for a Nanonets contract and watched it fail on their scanned documents during the demo itself. "They bombed the demo," as their project lead described it. The tool worked fine on clean inputs. But the agency's real documents — scanned images, handwritten notes from staff, degraded PDFs — were a different story entirely. The agency evaluated Lido as a replacement specifically because of its layout-agnostic approach to scanned and handwritten inputs.
Solving scanned invoice data extraction at scale requires more than better OCR. It requires a fundamentally different approach to understanding documents.
First, the tool needs to read documents the way a person does — using visual context, not just character shapes. A person looking at a faded dot matrix printout can figure out that the number at the top right is probably a PO number because of where it sits on the page and what label is next to it, even if the characters themselves are barely legible. An extraction tool that combines vision models with language understanding can do the same thing.
Second, it has to handle handwriting across languages. A Vietnamese handwritten invoice from a local vendor. A handwritten driver ticket from a delivery route. Handwritten annotations on a typed invoice marking returns or quantity changes. These aren't unusual inputs — for many businesses, they're the most common document type. If your extraction tool doesn't support handwriting, it doesn't support your actual workflow.
Third, it needs confidence scoring at the field level. When a character is ambiguous — a "5" that might be an "S," a leading zero that's barely visible — the tool should flag that specific field rather than silently returning a wrong value. This lets your team focus review time on the 5% of fields that are uncertain instead of manually checking every extraction.
Lido processes degraded documents by combining AI vision models with language understanding rather than relying on character recognition or fixed templates. When a faxed invoice arrives with dark edges and faded text, Lido reads the visual layout and surrounding context to identify fields — the same way a person would. This is what allowed Disney Trucking to move 360,000 handwritten driver tickets per year off a six-person manual entry team. It's why Kei Concepts processes handwritten Vietnamese invoices with crossed-out line items across 13 restaurant locations. And it's how a premium grocery chain extracted PO numbers with leading zeros from an 8-page dot matrix scan their CEO called "ugly" — on the first pass, during the demo.
The approach works without templates, without model training, and without retraining when vendors change their formats. When extraction is uncertain, Lido flags specific fields with confidence scores rather than silently returning wrong values. And if the initial extraction needs refinement, reprocessing is free for 24 hours — you only pay when the output is right.
If you're evaluating extraction tools for scanned, faxed, or photographed invoices, test with documents that actually represent your problem. Every tool looks good on clean files.
Bring your worst scan. The faded fax. The phone photo with shadows. The dot matrix printout with perforated edges. If the vendor resists testing with messy documents or asks you to send "better quality" files, that tells you everything you need to know.
Test handwriting if you have it. Driver tickets, vendor invoices, annotations on typed documents. Ask the vendor to show you handwriting extraction live, on your documents, not on a prepared sample.
Check what happens when extraction is wrong. Can you refine instructions and reprocess without being charged again? Lido, for example, reprocesses free for 24 hours — you only pay when the output is right. The NASA team specifically flagged this as a dealbreaker with their previous tool: "You didn't do the job the first time correctly and yeah... why are you charging me again?" Tools that charge per attempt, including failed ones, are penalizing you for their own shortcomings.
Ask about field-level confidence. Not just a document-level "pass/fail" but granular confidence on each extracted value. This is the difference between "this invoice extracted successfully" and "this invoice extracted successfully but the PO number has low confidence — check it."
Lido uses a custom blend of AI vision models, OCR, and LLMs to read documents the way a person does — using visual layout, context, and language understanding rather than rigid character recognition or zone mapping. No templates, no model training, no retraining when formats change.
The teams described throughout this post — Disney Trucking, Kei Concepts, Erewhon, the NASA project — all tested Lido on the documents their previous tools failed on. Erewhon's CEO pulled up an 8-page dot matrix scan he called "ugly" and watched it extract PO numbers with leading zeros on the first pass.
The documents that cause the most errors are rarely the clean digital PDFs. They're the scans, the faxes, the phone photos, and the handwritten tickets that every other tool quietly fails on.