Blog

How to Copy and Extract Tables from PDF to Excel

May 5, 2026

To copy a table from a PDF to Excel, select the table text in your PDF reader and paste it into Excel. This works on native (digitally created) PDFs with simple table layouts. For scanned PDFs, multi-page tables, or complex formatting, you need a tool that understands table structure: Adobe Acrobat exports PDFs to Excel with decent table preservation, Python libraries like tabula-py extract tables programmatically, and AI-powered tools like Lido extract table data from any PDF format, including scans, without templates or manual cleanup.

The right method depends on what kind of PDF you have and how often you need to do this. A one-time copy-paste from a native PDF takes 30 seconds. Extracting tables from 50 scanned invoices every week is a completely different problem. The five methods below cover the full range, from the simplest manual option to fully automated AI extraction, so you can pick the one that fits your situation.

The core problem is that PDFs do not store data in rows and columns. A PDF stores text as individually positioned characters at specific x-y coordinates on a page. What looks like a table to you is just characters that happen to be aligned in a grid. Every method for extracting a table from a PDF has to reconstruct the row-and-column structure from those character positions. That reconstruction breaks in predictable ways depending on table complexity.

Method 1: Copy and paste (free, instant, limited)

Open your PDF in any PDF reader (Adobe Reader, Preview on Mac, Chrome's built-in viewer, or Edge). Select the table text by clicking and dragging across the table area. Copy it (Ctrl+C or Cmd+C). Open Excel and paste (Ctrl+V or Cmd+V).

When this works: Native PDFs (created digitally by software, not scanned from paper) with simple tables that have clear column separation, no merged cells, no spanning headers, and fit on a single page. If you can select the text in your PDF reader and the selection highlights the text in reading order, copy-paste has a reasonable chance of producing usable output.

When this breaks:

  • Scanned PDFs. If the PDF is an image (from a scanner, fax, or camera), there is no text to select. Copy-paste produces nothing. You need OCR table extraction instead.
  • Multi-column layouts. If the table sits next to other content on the page, the selection captures everything in the region, mixing table data with adjacent text.
  • Multi-page tables. You have to copy-paste each page separately and manually align the rows, removing repeated headers from continuation pages.
  • Merged cells and spanning headers. Column alignment breaks because the PDF stores merged cells as a single text block positioned across multiple column boundaries.
  • Columns without borders. Without visible gridlines, Excel cannot determine where one column ends and the next begins. All values paste into a single column.

Fix for single-column paste: If everything pastes into one column, select the column in Excel, go to Data > Text to Columns, choose “Fixed width,” and manually set column breaks. This works for simple tables with consistent spacing but fails on tables where values have different widths across rows.

Use copy-paste for a single, simple table from a native PDF that you need once. Beyond that, the cleanup time exceeds whatever you save by skipping a proper tool.

Method 2: Adobe Acrobat export (paid, better tables)

Adobe Acrobat Pro can export a PDF directly to an Excel (.xlsx) file. Open the PDF in Acrobat, click File > Export a PDF > Spreadsheet > Microsoft Excel Workbook, choose your save location, and open the resulting file in Excel. The entire process takes under a minute for most documents.

Acrobat understands PDF internals better than anything else on the market because Adobe created the format. Its export engine identifies table boundaries, column structure, and cell alignment from the PDF's internal positioning data. For native PDFs with well-structured tables, Acrobat typically preserves the row-and-column layout accurately enough to work with immediately.

When this works well:

  • Native PDFs generated by accounting software, ERP systems, or spreadsheet-to-PDF exports
  • Tables with visible borders and gridlines
  • Single-page tables with standard formatting
  • Documents where you need the entire page converted, not just one specific table

Limitations:

  • Cost. Acrobat Pro is $22.99 per month. If PDF-to-Excel is your only need, that is expensive.
  • Scanned PDFs. Acrobat runs OCR on scans, but the table reconstruction quality drops noticeably compared to native PDFs. Expect misaligned columns and split cells.
  • Complex tables. Tables with nested sub-tables, merged cells spanning multiple rows, or irregular column widths often export with broken alignment.
  • Multi-page tables. Each page may export as a separate table. Headers from page 2 onward are sometimes treated as data rows.
  • Full-page export. Acrobat converts the entire page, not just the table you want. Surrounding text, headers, footers, and logos end up in the Excel file and need manual removal.

Acrobat is the strongest option among traditional file-conversion tools. If you already have an Acrobat subscription and process native PDFs with reasonably clean table formatting, it handles most single-page tables well. For scanned documents or high-volume processing, the cleanup time per document adds up. For a full comparison of conversion tools, see our PDF to Excel converter roundup.

Method 3: Free online converters (free, privacy trade-off)

Dozens of free websites convert PDF to Excel: Smallpdf, ILovePDF, PDF24, Zamzar, PDF2Go, and others. The workflow is identical across all of them: upload your PDF, wait for processing, download the Excel or CSV file. No software installation, no account required (on most), and results in under a minute for small files.

How they compare:

ToolFree tier limitsTable qualityScanned PDFsPrivacy
Smallpdf2 files/dayBasicLimited OCRFiles deleted after 1 hour
ILovePDF1-2 files/dayBasicLimited OCRFiles deleted after 2 hours
PDF24UnlimitedBasicBasic OCRDesktop app available (local processing)
Zamzar2 files/day, 50MB maxBasicNoFiles stored 24 hours
PDF2Go3 files/dayBasicOCR availableFiles deleted after 24 hours

When to use them: Occasional one-off conversions of simple, non-sensitive PDFs with basic table layouts. If you need to convert a single PDF once and the table is straightforward, a free converter saves you from installing software.

Why they fail on real-world tables:

  • Table detection is basic. Most free converters use the same open-source PDF parsing libraries (pdfplumber, PDFBox) with minimal post-processing. They work on clearly bordered tables and fail on borderless tables, nested tables, and irregular layouts.
  • Column misalignment is common. When column spacing is inconsistent or values have different widths, the converter guesses wrong about column boundaries. The output requires manual column adjustment.
  • Multi-page tables get split. Each page is converted independently. A table that spans pages becomes two or three separate tables in the Excel output, with headers repeated or missing.
  • Privacy. Your document is uploaded to a third-party server. For financial documents, client data, medical records, or anything confidential, this is a non-starter. PDF24 is the exception: they offer a desktop application that processes files locally.

Free online converters occupy the space between copy-paste and paid tools. They handle simple tables better than copy-paste but fall short of Acrobat or AI extraction on anything complex. If privacy matters or you process more than a few documents per week, they are not the right solution.

Method 4: Python libraries (free, technical, scalable)

For developers and technical users, Python libraries offer precise control over PDF table extraction. The two primary options are tabula-py (a Python wrapper around Tabula) and Camelot. Both are free, open-source, and run locally on your machine.

tabula-py detects tables in native PDFs and extracts them as pandas DataFrames. Basic usage:

import tabula
# Extract all tables from a PDF
tables = tabula.read_pdf("invoice.pdf", pages="all")
# Export the first table to Excel
tables[0].to_excel("output.xlsx", index=False)

tabula-py uses two extraction modes: “lattice” for tables with visible borders (gridlines), and “stream” for tables without borders (using whitespace alignment). You can also specify exact page areas to extract from, which avoids capturing non-table content.

Camelot offers similar functionality with more control over table detection parameters:

import camelot
# Extract tables using stream mode (borderless tables)
tables = camelot.read_pdf("invoice.pdf", flavor="stream")
# Export to Excel
tables[0].to_excel("output.xlsx")

Camelot reports a per-table accuracy score, letting you identify tables that may need manual review. It also handles tables with merged cells better than tabula-py in most cases.

Strengths of the Python approach:

  • Free and local. No subscription cost, no file uploads to external servers.
  • Scriptable. Process hundreds of PDFs in a batch with a single script.
  • Precise area targeting. Extract a specific table from a specific region on a specific page, ignoring everything else.
  • Output flexibility. Export to Excel, CSV, JSON, or keep as a DataFrame for further processing.

Limitations:

  • Native PDFs only. Neither tabula-py nor Camelot includes OCR. Scanned PDFs produce no output. You would need to run a separate OCR step (Tesseract or similar) first, convert to a native PDF, and then extract.
  • No semantic understanding. The libraries extract text in table regions but do not understand what the data means. They cannot tell you which column is “Quantity” and which is “Unit Price.” They just capture the text.
  • Tuning required. Each document layout may need different parameters: lattice vs. stream mode, area coordinates, column separators. What works on one PDF format often fails on another.
  • No column header mapping. Output columns are numbered (column 0, 1, 2) rather than semantically labeled. You manually rename them for each format.
  • Multi-page inconsistency. Tables spanning pages often need custom logic to merge correctly, handling repeated headers and pagination artifacts.

Python libraries are the best option for developers who process native PDFs from a small number of consistent sources and need programmatic control. They are not practical for non-technical users, for scanned documents, or for workflows with high format diversity. For a broader view of table extraction approaches, see our table extraction software guide.

Method 5: AI-powered extraction (handles everything)

AI-powered table extraction reads a PDF the way a person does: it looks at the page, identifies where tables are, understands what each column represents, and outputs structured data with correct column headers and row alignment. It works on native PDFs, scanned documents, photos of printed pages, and any layout the other methods struggle with.

With Lido, the process is: upload a PDF (or forward it by email, or connect a cloud storage folder), and the AI identifies every table in the document and returns structured data in Excel, Google Sheets, CSV, or JSON. No templates to define, no extraction zones to draw, no training data to provide, and no Python code to write.

What AI extraction handles that other methods cannot:

  • Scanned PDFs. AI runs OCR and table extraction in a single step. No separate OCR pre-processing required.
  • Multi-page tables. The AI recognizes that a table continues across page breaks, merging pages into a single continuous table with one header row.
  • Merged cells and spanning headers. The AI understands column hierarchy (a header that spans two sub-columns) and maps values to the correct columns.
  • Irregular layouts. Tables with inconsistent spacing, missing borders, mixed alignment, or nested sub-tables are handled through layout understanding rather than coordinate matching.
  • Semantic column labeling. The output has meaningful column names (“Description,” “Quantity,” “Unit Price,” “Amount”) rather than generic numbered columns.
  • Line-item extraction from invoices. This is the single hardest table extraction task. Invoice line items vary wildly in formatting, column order, and structure. AI extraction consistently handles them because it reasons about what each value means in context.

When to choose AI extraction over other methods:

  • You process PDFs from multiple sources with different layouts
  • Your PDFs are scanned, photographed, or image-based
  • Tables span multiple pages
  • You need the table data in labeled columns, not raw text grids
  • You process more than 10 documents per week and want automation
  • You cannot spend time on per-document cleanup

Lido offers 50 free pages per month, with paid plans starting at $29/month. For a detailed comparison with other extraction tools, see how to extract data from any PDF.

Why PDF tables break during extraction (the technical reason)

Understanding why table extraction is hard helps you choose the right tool and set realistic expectations for each method.

A PDF does not contain a “table” object. There is no internal tag that says “this is a table with 5 columns and 12 rows.” Instead, a PDF stores a sequence of text-drawing instructions: “place the character ‘I’ at position (72.3, 401.2), place ‘n’ at (77.1, 401.2), place ‘v’ at (82.0, 401.2)...” and so on for every character on the page. Table borders are separate line-drawing instructions that happen to form a grid shape.

Any tool that extracts tables from a PDF must reverse-engineer the table structure from those raw positioning instructions. This involves:

  1. Detecting table regions. Which part of the page is a table and which is regular text? This is ambiguous when tables have no borders, when paragraphs are formatted in columns, or when tables sit adjacent to other content.
  2. Identifying columns. Characters aligned vertically might be in the same column, or they might be in adjacent columns with small horizontal gaps. The tool must decide which gaps are column separators and which are normal word spacing.
  3. Identifying rows. Characters on the same horizontal line are usually in the same row, but multi-line cells break this assumption. A cell containing a long description that wraps to two lines looks like two rows at the character-position level.
  4. Handling merged cells. A header that spans three columns is stored as a single text block positioned above those columns. The tool must recognize the spanning relationship.

Simple tables with visible borders, consistent spacing, and no merged cells are straightforward for any tool to parse. The difficulty escalates rapidly with: borderless tables (no gridlines to anchor column detection), multi-line cells (break row assumptions), merged cells (break column assumptions), multi-page spans (no structural continuity signal between pages), and irregular spacing (different column widths in different rows).

This is why each method has a different failure point. Copy-paste fails on anything beyond the simplest tables because the clipboard captures characters in reading order without column structure. Acrobat fails on complex layouts because its parser uses heuristics for column detection that break on irregular spacing. Python libraries fail on scanned PDFs because they have no text positions to analyze. AI extraction handles the broadest range because it reasons about document structure visually, the same way a person would when reading a table they have never seen before.

Which method should you use?

Match your situation to the right approach:

Your situationBest methodWhy
Simple native PDF, one-time needCopy-pasteFree, instant, no setup
Native PDF, decent tables, have AcrobatAdobe Acrobat exportBest file-conversion quality
Occasional simple PDF, no paid toolsFree online converterNo install, handles basic tables
Developer, batch processing, native PDFsPython (tabula-py/Camelot)Free, scriptable, precise
Scanned PDFsAI extraction (Lido)Only method that handles scans reliably
Multi-page tablesAI extraction (Lido)Merges pages into continuous table
Multiple formats, recurring volumeAI extraction (Lido)No per-format setup, automated
Invoices and business documentsAI extraction (Lido)Semantic field labeling, line-item support
Privacy-sensitive, must stay localPython or PDF24 desktopNo file upload to external server

Most people start with copy-paste, watch it mangle their table, and escalate through the methods above. The pattern is predictable: simple methods work on simple PDFs, and anything more complex demands a tool that actually understands table structure.

If you process business documents (invoices, purchase orders, bank statements, receipts) at any recurring volume, start with Method 5. The other four methods all produce output that needs manual cleanup. AI extraction produces structured, labeled data that is ready to use immediately. The best PDF to Excel converters all use some form of AI extraction because it is the only approach that scales across document diversity without per-format configuration.

Frequently asked questions

Can I copy a table from a scanned PDF to Excel?

Not with copy-paste or basic converters. Scanned PDFs are images with no selectable text, so there is nothing to copy. You need a tool with OCR (optical character recognition) that converts the image to text and then reconstructs the table structure. Adobe Acrobat Pro includes OCR but produces inconsistent table quality on scans. Python libraries like tabula-py cannot process scanned PDFs at all. AI extraction tools like Lido handle scanned PDFs natively, running OCR and table extraction in a single step and outputting structured data to Excel.

Why does my PDF table paste into one column in Excel?

PDFs store text as positioned characters, not as table cells. When you copy text from a PDF, the clipboard captures a stream of characters in reading order without column separators. Excel receives this as a single text stream and places it in one column. You can try Data, Text to Columns, Fixed Width to manually set column breaks. However, this only works reliably on tables with perfectly consistent spacing. For tables with variable-width values, use Adobe Acrobat export or AI extraction instead of copy-paste.

What is the best free way to extract a table from a PDF?

For native PDFs with visible table borders, tabula-py (Python library) produces the cleanest free results with precise area targeting. For non-technical users, Smallpdf and ILovePDF offer free tiers that handle basic tables adequately. PDF24 is completely free with no daily limits and has a desktop app for local processing. Lido offers 50 free pages per month with AI-powered extraction that handles complex tables, scanned documents, and multi-page tables that free converters cannot process. For simple text-only PDFs, copy-paste costs nothing and works instantly.

How do I extract a multi-page table from a PDF to Excel?

Multi-page tables are where most extraction methods fail. Copy-paste requires extracting each page separately and manually aligning rows. Free online converters typically split each page into a separate table. Adobe Acrobat sometimes handles page continuity but often repeats or drops headers on continuation pages. Python libraries require custom code to merge page results. AI extraction tools like Lido process multi-page PDFs as a single document and merge table content across page breaks automatically, producing one continuous table with a single header row.

Can I automate PDF table extraction to Excel?

Yes, through two approaches. Python scripts using tabula-py or Camelot can batch-process native PDFs and export tables to Excel programmatically, which is free but requires coding ability and only works on digitally created PDFs. AI extraction tools like Lido automate the full pipeline: connect an email inbox or cloud storage folder, and every PDF that arrives is automatically processed with table data exported to Excel, Google Sheets, or your ERP. No code required, and it handles scanned documents alongside native PDFs.

Ready to grow your business with document automation, not headcount?

Join hundreds of teams growing faster by automating the busywork with Lido.