Blog

OCR Table to Excel: Best Ways to Extract Tabular Data

May 5, 2026

To convert a scanned table to Excel, use an AI-powered OCR tool that preserves row and column structure during extraction. Upload the image or scanned PDF, let the tool detect table boundaries and cell contents, then export directly to .xlsx or CSV. Free tools work for occasional single-page tables, while AI extraction handles complex layouts, merged cells, and multi-page tables at scale.

Tables are everywhere in business documents: financial statements, invoices with line items, lab reports, shipping manifests, rate sheets. When those documents exist as scans or images, the data is locked inside pixels. You can see the numbers, but you can’t sort, filter, or calculate with them until they’re in a spreadsheet.

Standard OCR tools are built for running text, not structured grids. They read left to right, top to bottom, and mangle the column alignment that makes a table useful. Converting a scanned table to Excel requires tools that understand spatial relationships between cells, not just the characters inside them. Lido and similar AI extraction platforms solve this by detecting table structure first, then reading cell contents within that structure.

This guide covers four methods for getting tabular data from scans into Excel, ranked from simplest to most powerful.

What OCR table extraction actually does

OCR (optical character recognition) converts images of text into machine-readable characters. The underlying OCR algorithms range from basic template matching to modern transformer models, and the choice of algorithm affects table detection quality. Table extraction goes further: it identifies rows, columns, and cells before reading the text inside each cell. The output is a grid of data that maps directly to spreadsheet cells.

Without table detection, OCR reads a three-column table as a single stream of text. “Product A 500 $12.50 Product B 200 $8.75” becomes one line with no column separation. Table-aware OCR recognizes three columns and two rows, placing each value in the correct cell.

The input can be:

  • A photo of a printed table (taken with a phone camera)
  • A scanned PDF (image-only, no selectable text)
  • A screenshot from a system that doesn’t offer data export
  • A faxed document converted to PDF

The output is typically Excel (.xlsx), CSV, or a structured format like JSON that can be imported into any spreadsheet or database.

Why tables are harder than plain text for OCR

Plain text flows in one direction. Tables require the OCR engine to solve a two-dimensional layout problem before it can even start reading characters. Here are the specific challenges:

Column alignment without borders. Many printed tables use whitespace rather than visible lines to separate columns. The OCR engine must infer column boundaries from the alignment of text across multiple rows. If one row has a long entry that bleeds into the next column’s space, the engine may merge columns.

Merged cells. Headers that span multiple columns (like “Q1 2025” spanning Jan, Feb, Mar sub-columns) break the regular grid structure. The engine needs to recognize that one cell occupies the space of three and assign it correctly.

Multi-line cell content. When a cell contains text that wraps to two or three lines, simple OCR treats each line as a new row. A 10-row table with some wrapped cells might be read as 15 or 20 rows.

Variable row heights. Financial tables often have subtotal rows with extra spacing above and below. The OCR engine must distinguish between “extra whitespace within the same table” and “end of this table, start of something else.”

Rotated or skewed scans. A page scanned at a slight angle throws off column detection. What looks like a straight column to a human is actually a diagonal line of text that the engine may split across two detected columns.

Table Feature Difficulty Level Failure Mode
Simple bordered grid Low Rarely fails
Borderless with aligned columns Medium Columns merge or split
Merged header cells Medium-High Header misaligned with data
Multi-line cells High Extra phantom rows created
Nested tables (table within table) Very High Structure completely lost
Multi-page table (continuation) Very High Treated as separate tables

Method 1: Screenshot or photo to online OCR tool

Best for: one-off extractions, simple tables, quick results without installing software.

Free online OCR tools like OnlineOCR.net, i2OCR, and NewOCR accept image uploads and return text or basic table output. The workflow is straightforward:

  1. Take a screenshot or photo of the table
  2. Upload to the web tool
  3. Select “Table” or “Excel” as the output format
  4. Download the result

Advantages: Zero setup, free, works from any device with a browser.

Limitations: Most free tools have file size limits (5–10 MB), process one page at a time, don’t handle complex layouts well, and may not preserve column structure. Accuracy on borderless tables is typically 60–75%. You’ll spend time fixing misaligned cells manually. Privacy matters too: you’re uploading potentially sensitive documents to a third-party server.

When to use this method: You have a single table, it has visible borders, and you need the data in 5 minutes. Don’t use this for financial data that requires high accuracy or batch processing.

Method 2: Scanned PDF to Adobe Acrobat to Excel

Best for: occasional table extraction if you already have an Acrobat Pro subscription.

Adobe Acrobat Pro’s “Export PDF” feature includes OCR and table detection. For scanned PDFs:

  1. Open the scanned PDF in Acrobat Pro
  2. Run “Recognize Text” (Edit PDF > Recognize Text) to apply OCR
  3. File > Export To > Spreadsheet > Microsoft Excel Workbook
  4. Review the exported .xlsx for accuracy

Advantages: Good table detection on bordered tables, handles multi-page PDFs, preserves some formatting. If you already pay for Acrobat ($22.99/month for Pro), there’s no additional cost.

Limitations: Struggles with borderless tables and complex layouts. Multi-page tables often split into separate tables per page rather than concatenating into one. Merged cells rarely export correctly. No batch processing, so each PDF must be opened and exported individually. No validation or post-processing rules.

When to use this method: You have Acrobat Pro, the table has borders, and you’re processing fewer than 10 documents. For higher volumes, the manual open-export-review cycle becomes a bottleneck.

Method 3: AI-powered table extraction

Best for: production workflows, complex tables, batch processing, high accuracy requirements.

AI-powered tools like Lido approach table extraction differently. Instead of detecting lines and borders, they understand the semantic structure of the table: what’s a header, what’s a data row, what’s a subtotal. This works regardless of visual formatting.

The workflow:

  1. Define the columns you expect (product name, quantity, unit price, total, or whatever your table contains)
  2. Upload one or more documents containing tables
  3. The AI identifies table regions, maps content to your defined columns, and extracts
  4. Export to Excel, Google Sheets, or CSV

Advantages: Handles borderless tables, merged cells, multi-page tables, and inconsistent formatting. Batch processes hundreds of documents. Accuracy of 90–98% on complex tables. You define the output structure, so the result always matches your spreadsheet format. Validation rules catch errors before export.

Limitations: Requires an account and subscription. Overkill for a single one-off table. Initial field definition takes a few minutes of setup.

When to use this method: You process tables regularly (weekly or monthly), need high accuracy, deal with inconsistent source formats, or have more than 10 documents to process. The setup time pays for itself on the second batch. For guidance on choosing between tools, see the best image-to-table converter comparison.

Method 4: Python programmatic extraction

Best for: developers, custom pipelines, native PDFs (not scans), integration with existing data workflows.

Several Python libraries extract tables from PDFs. The choice depends on whether your PDF contains selectable text (native) or is a scanned image.

For native PDFs (selectable text):

  • Tabula-py – Java-based, works well on bordered tables. Extracts tables as pandas DataFrames. tabula.read_pdf("file.pdf", pages="all")
  • Camelot – Two modes: “lattice” for bordered tables (detects lines) and “stream” for borderless tables (uses text alignment). More configurable than Tabula.
  • pdfplumber – Lower-level access to text positions. Lets you define custom table boundaries. Best when other tools fail on unusual layouts.

For scanned PDFs (images):

  • img2table – Detects table structure from images, requires Tesseract for OCR. Handles bordered and borderless tables.
  • PaddleOCR + custom logic – OCR engine with table structure recognition. More setup required but handles complex Asian-language tables well.
  • Azure Document Intelligence / Google Document AI – Cloud APIs with table extraction endpoints. High accuracy but per-page pricing adds up at volume.

A minimal Python workflow for native PDFs:

import camelot
import pandas as pd

tables = camelot.read_pdf("report.pdf", pages="1-5", flavor="lattice")
for i, table in enumerate(tables):
    table.df.to_excel(f"table_{i}.xlsx", index=False)

Advantages: Full control, free (for native PDF tools), integrates with existing data pipelines, reproducible and scriptable.

Limitations: Requires Python knowledge. Native PDF tools don’t work on scans. Scanned-document tools need additional OCR setup (Tesseract, cloud APIs). No built-in validation. Debugging column detection issues requires manual tuning of parameters. Each new document layout may need parameter adjustments.

When to use this method: You have engineering resources, process native PDFs (not scans), need the extraction integrated into an automated pipeline, or have very specific output format requirements that no commercial tool satisfies.

Common table OCR failures and how to fix them

Even good tools fail on certain table types. Here’s what goes wrong and how to work around it.

Merged cells splitting into multiple rows

A header like “Revenue” spanning three sub-columns gets duplicated into each sub-column, or the sub-columns lose their association with the parent header. Fix: Extract without headers first, then add headers manually. Or define your output columns to flatten the hierarchy (e.g., “Revenue – Q1”, “Revenue – Q2” instead of nested headers).

Borderless tables with misaligned columns

When column separation relies on whitespace, OCR may merge adjacent columns if entries vary in width. A short value in column A and a long value in column B look like one merged cell. Fix: In Camelot, switch to “stream” mode and adjust column_tol parameter. In AI tools, explicitly define the expected column count and names so the engine knows how many columns to look for.

Multi-page tables treated as separate tables

A 50-row table spanning pages 3–5 gets extracted as three separate tables, each with repeated headers. Fix: In Lido, use multi-page table mode that concatenates rows across pages and strips repeated headers. In Python, extract all pages and use pandas to detect and remove duplicate header rows: df = df[df["Column1"] != "Column1"].

Numbers misread as similar characters

OCR confuses 0/O, 1/l/I, 5/S, 8/B in table cells. A quantity of “1,050” becomes “l,O5O”. Fix: Post-processing rules that enforce data types. If a column should contain numbers, strip non-numeric characters and flag entries that don’t parse as valid numbers. AI extraction tools handle this natively because they understand that a “Quantity” column should contain integers.

Currency and unit symbols causing column shifts

A dollar sign or percentage symbol gets assigned to an adjacent cell, shifting an entire column by one position. Fix: Define column data types in advance. Currency columns should include the symbol in the expected format. Tools with column-type awareness (like Lido’s PDF-to-Excel extraction) interpret “$1,250.00” as a single currency value rather than splitting on the dollar sign.

Accuracy benchmarks by table complexity

Accuracy varies widely based on table structure, image quality, and the tool used. These benchmarks reflect real-world performance across common document types.

Table Type Free Online OCR Adobe Acrobat AI Extraction (Lido) Python (Camelot)
Simple bordered (3–5 cols, 10 rows) 85–90% 92–95% 97–99% 95–98% (native PDF only)
Borderless aligned (5+ cols) 60–70% 75–85% 92–96% 80–90% (with tuning)
Merged headers + subtotals 40–55% 60–70% 88–94% 70–80% (with custom logic)
Multi-page continuation N/A (single page only) 50–65% 90–95% 85–92% (with concatenation)
Low-quality scan (fax, photocopy) 30–50% 55–70% 80–90% N/A (requires OCR preprocessing)
Handwritten table entries 10–25% 20–35% 60–75% N/A

Accuracy percentages represent cell-level accuracy: the percentage of individual cells extracted correctly. A table with 50 cells at 90% accuracy means 5 cells need correction. At 98% accuracy, only 1 cell needs fixing. For financial data where every number matters, the difference between 90% and 98% is the difference between “usable after quick review” and “needs extensive manual correction.”

For context on how these numbers compare across tools, see the full OCR accuracy benchmarks.

Which method to use

The right approach depends on volume, table complexity, and technical resources.

Low volume (1–5 tables/month), simple structure: Free online tools or Adobe Acrobat. The manual cleanup time is acceptable when you’re only doing it a few times per month.

Medium volume (10–50 tables/month), mixed complexity: AI extraction tools. The subscription cost is justified by time savings, and the accuracy on complex tables cuts most manual correction. A good starting point is Lido’s PDF-to-Excel conversion.

High volume (100+ tables/month), consistent source format: Python pipeline or AI extraction with API access. At this volume, per-document manual work of any kind becomes unsustainable. You need either a fully automated script (if you have engineering resources) or an AI tool with batch processing and API export.

Mixed document types with tables embedded in larger documents: AI extraction is the only practical option. Python tools need you to specify which pages contain tables. Adobe exports the entire document. AI tools identify table regions within multi-page documents automatically and extract only the structured data you defined.

Frequently asked questions

How do I convert a scanned table to Excel?

Upload the scanned document to an OCR tool with table detection capability. The tool first identifies table boundaries and cell structure (rows and columns), then reads the text within each cell using optical character recognition. The result exports as an Excel file with each value placed in the correct cell. For best results on scanned documents, use an AI-powered tool rather than basic OCR, as AI understands table structure contextually rather than relying solely on visible borders or line detection.

What is the best OCR for tables?

The best OCR for tables depends on your use case. For scanned documents with complex layouts (borderless tables, merged cells, multi-page tables), AI-powered extraction tools like Lido offer the highest accuracy at 90–98%. For native PDFs with selectable text, Camelot (Python library) is the best free option for bordered tables. For occasional use without installing software, Adobe Acrobat Pro works well on simple bordered tables. No single tool is best for every scenario—the key differentiator is whether your tables have visible borders and whether your PDFs are native or scanned.

Can OCR handle tables without borders?

Yes, but accuracy drops significantly with basic OCR tools. Borderless tables rely on whitespace alignment to separate columns, which traditional OCR engines struggle to interpret correctly. AI-powered extraction tools handle borderless tables by understanding the semantic content—recognizing that aligned text across rows forms columns even without visible dividers. Camelot’s “stream” mode also detects borderless tables using text positioning, though it requires parameter tuning. Expect 60–70% accuracy from free tools and 88–96% from AI-powered tools on borderless tables.

How accurate is OCR table extraction?

OCR table extraction accuracy ranges from 30% to 99% depending on the tool, table complexity, and image quality. Simple bordered tables on clean scans achieve 85–99% cell-level accuracy across most tools. Borderless tables drop to 60–96% depending on the tool. Complex tables with merged cells, subtotals, and multi-page continuation range from 40–95%. Low-quality scans (faxes, photocopies) reduce accuracy by 10–20 percentage points regardless of tool. AI-powered tools consistently outperform traditional OCR by 15–25 percentage points on complex table structures.

Is there a free tool to OCR tables to Excel?

Several free options exist for basic table OCR. OnlineOCR.net and i2OCR handle simple bordered tables and export to Excel format without requiring an account. For native PDFs (not scans), Tabula is a free open-source tool with a browser interface that extracts tables to CSV or Excel. Python users can use Camelot or pdfplumber at no cost. Google Drive also offers basic OCR—upload an image and open with Google Docs to extract text, though table structure is often lost. Free tools work best for simple, bordered tables under 20 rows.

Ready to grow your business with document automation, not headcount?

Join hundreds of teams growing faster by automating the busywork with Lido.