To convert a scanned table to Excel, use an AI-powered OCR tool that preserves row and column structure during extraction. Upload the image or scanned PDF, let the tool detect table boundaries and cell contents, then export directly to .xlsx or CSV. Free tools work for occasional single-page tables, while AI extraction handles complex layouts, merged cells, and multi-page tables at scale.
Tables are everywhere in business documents: financial statements, invoices with line items, lab reports, shipping manifests, rate sheets. When those documents exist as scans or images, the data is locked inside pixels. You can see the numbers, but you can’t sort, filter, or calculate with them until they’re in a spreadsheet.
Standard OCR tools are built for running text, not structured grids. They read left to right, top to bottom, and mangle the column alignment that makes a table useful. Converting a scanned table to Excel requires tools that understand spatial relationships between cells, not just the characters inside them. Lido and similar AI extraction platforms solve this by detecting table structure first, then reading cell contents within that structure.
This guide covers four methods for getting tabular data from scans into Excel, ranked from simplest to most powerful.
OCR (optical character recognition) converts images of text into machine-readable characters. The underlying OCR algorithms range from basic template matching to modern transformer models, and the choice of algorithm affects table detection quality. Table extraction goes further: it identifies rows, columns, and cells before reading the text inside each cell. The output is a grid of data that maps directly to spreadsheet cells.
Without table detection, OCR reads a three-column table as a single stream of text. “Product A 500 $12.50 Product B 200 $8.75” becomes one line with no column separation. Table-aware OCR recognizes three columns and two rows, placing each value in the correct cell.
The input can be:
The output is typically Excel (.xlsx), CSV, or a structured format like JSON that can be imported into any spreadsheet or database.
Plain text flows in one direction. Tables require the OCR engine to solve a two-dimensional layout problem before it can even start reading characters. Here are the specific challenges:
Column alignment without borders. Many printed tables use whitespace rather than visible lines to separate columns. The OCR engine must infer column boundaries from the alignment of text across multiple rows. If one row has a long entry that bleeds into the next column’s space, the engine may merge columns.
Merged cells. Headers that span multiple columns (like “Q1 2025” spanning Jan, Feb, Mar sub-columns) break the regular grid structure. The engine needs to recognize that one cell occupies the space of three and assign it correctly.
Multi-line cell content. When a cell contains text that wraps to two or three lines, simple OCR treats each line as a new row. A 10-row table with some wrapped cells might be read as 15 or 20 rows.
Variable row heights. Financial tables often have subtotal rows with extra spacing above and below. The OCR engine must distinguish between “extra whitespace within the same table” and “end of this table, start of something else.”
Rotated or skewed scans. A page scanned at a slight angle throws off column detection. What looks like a straight column to a human is actually a diagonal line of text that the engine may split across two detected columns.
| Table Feature | Difficulty Level | Failure Mode |
|---|---|---|
| Simple bordered grid | Low | Rarely fails |
| Borderless with aligned columns | Medium | Columns merge or split |
| Merged header cells | Medium-High | Header misaligned with data |
| Multi-line cells | High | Extra phantom rows created |
| Nested tables (table within table) | Very High | Structure completely lost |
| Multi-page table (continuation) | Very High | Treated as separate tables |
Best for: one-off extractions, simple tables, quick results without installing software.
Free online OCR tools like OnlineOCR.net, i2OCR, and NewOCR accept image uploads and return text or basic table output. The workflow is straightforward:
Advantages: Zero setup, free, works from any device with a browser.
Limitations: Most free tools have file size limits (5–10 MB), process one page at a time, don’t handle complex layouts well, and may not preserve column structure. Accuracy on borderless tables is typically 60–75%. You’ll spend time fixing misaligned cells manually. Privacy matters too: you’re uploading potentially sensitive documents to a third-party server.
When to use this method: You have a single table, it has visible borders, and you need the data in 5 minutes. Don’t use this for financial data that requires high accuracy or batch processing.
Best for: occasional table extraction if you already have an Acrobat Pro subscription.
Adobe Acrobat Pro’s “Export PDF” feature includes OCR and table detection. For scanned PDFs:
Advantages: Good table detection on bordered tables, handles multi-page PDFs, preserves some formatting. If you already pay for Acrobat ($22.99/month for Pro), there’s no additional cost.
Limitations: Struggles with borderless tables and complex layouts. Multi-page tables often split into separate tables per page rather than concatenating into one. Merged cells rarely export correctly. No batch processing, so each PDF must be opened and exported individually. No validation or post-processing rules.
When to use this method: You have Acrobat Pro, the table has borders, and you’re processing fewer than 10 documents. For higher volumes, the manual open-export-review cycle becomes a bottleneck.
Best for: production workflows, complex tables, batch processing, high accuracy requirements.
AI-powered tools like Lido approach table extraction differently. Instead of detecting lines and borders, they understand the semantic structure of the table: what’s a header, what’s a data row, what’s a subtotal. This works regardless of visual formatting.
The workflow:
Advantages: Handles borderless tables, merged cells, multi-page tables, and inconsistent formatting. Batch processes hundreds of documents. Accuracy of 90–98% on complex tables. You define the output structure, so the result always matches your spreadsheet format. Validation rules catch errors before export.
Limitations: Requires an account and subscription. Overkill for a single one-off table. Initial field definition takes a few minutes of setup.
When to use this method: You process tables regularly (weekly or monthly), need high accuracy, deal with inconsistent source formats, or have more than 10 documents to process. The setup time pays for itself on the second batch. For guidance on choosing between tools, see the best image-to-table converter comparison.
Best for: developers, custom pipelines, native PDFs (not scans), integration with existing data workflows.
Several Python libraries extract tables from PDFs. The choice depends on whether your PDF contains selectable text (native) or is a scanned image.
For native PDFs (selectable text):
tabula.read_pdf("file.pdf", pages="all")For scanned PDFs (images):
A minimal Python workflow for native PDFs:
import camelot
import pandas as pd
tables = camelot.read_pdf("report.pdf", pages="1-5", flavor="lattice")
for i, table in enumerate(tables):
table.df.to_excel(f"table_{i}.xlsx", index=False)
Advantages: Full control, free (for native PDF tools), integrates with existing data pipelines, reproducible and scriptable.
Limitations: Requires Python knowledge. Native PDF tools don’t work on scans. Scanned-document tools need additional OCR setup (Tesseract, cloud APIs). No built-in validation. Debugging column detection issues requires manual tuning of parameters. Each new document layout may need parameter adjustments.
When to use this method: You have engineering resources, process native PDFs (not scans), need the extraction integrated into an automated pipeline, or have very specific output format requirements that no commercial tool satisfies.
Even good tools fail on certain table types. Here’s what goes wrong and how to work around it.
A header like “Revenue” spanning three sub-columns gets duplicated into each sub-column, or the sub-columns lose their association with the parent header. Fix: Extract without headers first, then add headers manually. Or define your output columns to flatten the hierarchy (e.g., “Revenue – Q1”, “Revenue – Q2” instead of nested headers).
When column separation relies on whitespace, OCR may merge adjacent columns if entries vary in width. A short value in column A and a long value in column B look like one merged cell. Fix: In Camelot, switch to “stream” mode and adjust column_tol parameter. In AI tools, explicitly define the expected column count and names so the engine knows how many columns to look for.
A 50-row table spanning pages 3–5 gets extracted as three separate tables, each with repeated headers. Fix: In Lido, use multi-page table mode that concatenates rows across pages and strips repeated headers. In Python, extract all pages and use pandas to detect and remove duplicate header rows: df = df[df["Column1"] != "Column1"].
OCR confuses 0/O, 1/l/I, 5/S, 8/B in table cells. A quantity of “1,050” becomes “l,O5O”. Fix: Post-processing rules that enforce data types. If a column should contain numbers, strip non-numeric characters and flag entries that don’t parse as valid numbers. AI extraction tools handle this natively because they understand that a “Quantity” column should contain integers.
A dollar sign or percentage symbol gets assigned to an adjacent cell, shifting an entire column by one position. Fix: Define column data types in advance. Currency columns should include the symbol in the expected format. Tools with column-type awareness (like Lido’s PDF-to-Excel extraction) interpret “$1,250.00” as a single currency value rather than splitting on the dollar sign.
Accuracy varies widely based on table structure, image quality, and the tool used. These benchmarks reflect real-world performance across common document types.
| Table Type | Free Online OCR | Adobe Acrobat | AI Extraction (Lido) | Python (Camelot) |
|---|---|---|---|---|
| Simple bordered (3–5 cols, 10 rows) | 85–90% | 92–95% | 97–99% | 95–98% (native PDF only) |
| Borderless aligned (5+ cols) | 60–70% | 75–85% | 92–96% | 80–90% (with tuning) |
| Merged headers + subtotals | 40–55% | 60–70% | 88–94% | 70–80% (with custom logic) |
| Multi-page continuation | N/A (single page only) | 50–65% | 90–95% | 85–92% (with concatenation) |
| Low-quality scan (fax, photocopy) | 30–50% | 55–70% | 80–90% | N/A (requires OCR preprocessing) |
| Handwritten table entries | 10–25% | 20–35% | 60–75% | N/A |
Accuracy percentages represent cell-level accuracy: the percentage of individual cells extracted correctly. A table with 50 cells at 90% accuracy means 5 cells need correction. At 98% accuracy, only 1 cell needs fixing. For financial data where every number matters, the difference between 90% and 98% is the difference between “usable after quick review” and “needs extensive manual correction.”
For context on how these numbers compare across tools, see the full OCR accuracy benchmarks.
The right approach depends on volume, table complexity, and technical resources.
Low volume (1–5 tables/month), simple structure: Free online tools or Adobe Acrobat. The manual cleanup time is acceptable when you’re only doing it a few times per month.
Medium volume (10–50 tables/month), mixed complexity: AI extraction tools. The subscription cost is justified by time savings, and the accuracy on complex tables cuts most manual correction. A good starting point is Lido’s PDF-to-Excel conversion.
High volume (100+ tables/month), consistent source format: Python pipeline or AI extraction with API access. At this volume, per-document manual work of any kind becomes unsustainable. You need either a fully automated script (if you have engineering resources) or an AI tool with batch processing and API export.
Mixed document types with tables embedded in larger documents: AI extraction is the only practical option. Python tools need you to specify which pages contain tables. Adobe exports the entire document. AI tools identify table regions within multi-page documents automatically and extract only the structured data you defined.
Upload the scanned document to an OCR tool with table detection capability. The tool first identifies table boundaries and cell structure (rows and columns), then reads the text within each cell using optical character recognition. The result exports as an Excel file with each value placed in the correct cell. For best results on scanned documents, use an AI-powered tool rather than basic OCR, as AI understands table structure contextually rather than relying solely on visible borders or line detection.
The best OCR for tables depends on your use case. For scanned documents with complex layouts (borderless tables, merged cells, multi-page tables), AI-powered extraction tools like Lido offer the highest accuracy at 90–98%. For native PDFs with selectable text, Camelot (Python library) is the best free option for bordered tables. For occasional use without installing software, Adobe Acrobat Pro works well on simple bordered tables. No single tool is best for every scenario—the key differentiator is whether your tables have visible borders and whether your PDFs are native or scanned.
Yes, but accuracy drops significantly with basic OCR tools. Borderless tables rely on whitespace alignment to separate columns, which traditional OCR engines struggle to interpret correctly. AI-powered extraction tools handle borderless tables by understanding the semantic content—recognizing that aligned text across rows forms columns even without visible dividers. Camelot’s “stream” mode also detects borderless tables using text positioning, though it requires parameter tuning. Expect 60–70% accuracy from free tools and 88–96% from AI-powered tools on borderless tables.
OCR table extraction accuracy ranges from 30% to 99% depending on the tool, table complexity, and image quality. Simple bordered tables on clean scans achieve 85–99% cell-level accuracy across most tools. Borderless tables drop to 60–96% depending on the tool. Complex tables with merged cells, subtotals, and multi-page continuation range from 40–95%. Low-quality scans (faxes, photocopies) reduce accuracy by 10–20 percentage points regardless of tool. AI-powered tools consistently outperform traditional OCR by 15–25 percentage points on complex table structures.
Several free options exist for basic table OCR. OnlineOCR.net and i2OCR handle simple bordered tables and export to Excel format without requiring an account. For native PDFs (not scans), Tabula is a free open-source tool with a browser interface that extracts tables to CSV or Excel. Python users can use Camelot or pdfplumber at no cost. Google Drive also offers basic OCR—upload an image and open with Google Docs to extract text, though table structure is often lost. Free tools work best for simple, bordered tables under 20 rows.