Blog

How to Extract Text From a Scanned PDF (2026 Guide)

June 7, 2026

To extract text from a scanned PDF, you need a tool with OCR (optical character recognition) that reads the characters from the scanned image and converts them into selectable, editable text. AI tools like Lido combine OCR with structural analysis to deliver organized output. Free options like Google Docs and Tesseract work for simple scans but produce raw text without formatting.

A scanned PDF is a photograph of a paper document saved as a PDF file. The file contains an image, not text. You cannot select, copy, or search the content using standard PDF tools because there are no text characters in the file to work with.

This guide covers 5 ways to extract text from a scanned PDF, starting with free options and ending with the most accurate approach for complex documents.

Why You Cannot Copy Text From a Scanned PDF

When you scan a paper document, the scanner takes a picture of the page and saves it inside a PDF file. The result looks like a normal PDF on screen, but underneath it is just an image. There are no text characters stored in the file.

This is why clicking on a word in a scanned PDF does nothing. Your PDF reader cannot find any text to select because none exists. To get the text out, you need software that can look at the image, recognize the letter shapes, and convert them into actual text characters. That software is called OCR.

Method 1: Google Docs (Free)

Upload the scanned PDF to Google Drive. Right-click the file and select "Open with Google Docs." Google runs OCR automatically during the conversion and produces an editable document with the extracted text.

Google Docs handles simple, single-page scans with clear printed text well. It is completely free and requires no software installation. It struggles with multi-page documents (only processes the first few pages reliably), complex layouts, tables, low-resolution scans, and handwritten text. The output loses all original formatting, so tables and multi-column content come out as unstructured text.

Method 2: Adobe Acrobat Pro

Open the scanned PDF in Acrobat Pro and go to Tools > Scan & OCR > Recognize Text. Acrobat adds a text layer on top of the scanned image, making the text selectable and searchable. You can then copy text directly or export the document to Word or Excel.

Acrobat Pro produces good results on high-resolution scans with standard fonts. It supports dozens of languages and preserves basic formatting during export. It costs $22.99 per month and struggles with low-quality scans, heavily compressed images, and complex table structures. For occasional use, it is a solid option if you already have an Acrobat subscription.

Method 3: Free OCR Software

Several free tools can extract text from scanned PDFs without a subscription. Tesseract is the most popular open-source OCR engine and runs locally on your computer. OnlineOCR.net and OCR.space are browser-based alternatives that process files on their servers.

Free OCR tools produce plain text output. They read the characters accurately on clean scans but do not organize the results. A scanned invoice comes out as a block of text with no distinction between field labels, values, and line items. For simple pages where you just need the words, free OCR works. For structured data extraction, you need additional processing on top of the raw OCR output.

Method 4: Python With Tesseract

Developers can automate text extraction from scanned PDFs using Python. The pytesseract library wraps the Tesseract OCR engine and works with pdf2image to convert PDF pages into images for processing.

from pdf2image import convert_from_path
import pytesseract

pages = convert_from_path("scanned_document.pdf", dpi=300)

full_text = ""
for page in pages:
    text = pytesseract.image_to_string(page)
    full_text += text + "\n\n"

with open("extracted_text.txt", "w") as f:
    f.write(full_text)

This script converts each page of a scanned PDF to a 300 DPI image, runs OCR on it, and saves all extracted text to a file. You can customize it with Tesseract's language options, page segmentation modes, and preprocessing steps like deskewing or thresholding. Like all basic OCR approaches, the output is unstructured text.

Method 5: AI-Powered Extraction

AI tools like Lido combine OCR with machine learning to read scanned PDFs and deliver structured, organized output. The AI does not just recognize characters. It understands the document layout, identifies tables, matches field labels to their values, and organizes everything into clean columns.

This is the difference between getting a wall of raw text and getting a properly labeled spreadsheet. Every other method on this list produces text that you still need to organize manually. AI extraction produces data that is ready to use.

How to Extract Text From a Scanned PDF With Lido

Lido is the fastest way to extract text from a scanned PDF with the structure intact. Here is how it works.

1. Upload the Scanned PDF

Drag and drop your scanned PDF into Lido. It handles scans of any quality, resolution, or page count. Multi-page documents, faded text, low-resolution scans, and photographed pages all work.

2. AI Reads and Organizes the Text

Lido runs OCR to read every character on the page, then applies machine learning to understand the document structure. Tables are extracted as tables with correct rows and columns. Form fields are matched to their values. Headers, line items, and totals are identified and labeled automatically.

3. Export the Results

Export the structured data to Excel, Google Sheets, CSV, or QuickBooks. The output is clean, labeled, and ready to use without manual reformatting or cleanup.

Lido delivers 99%+ field-level accuracy on scanned PDFs and includes a 24-hour refinement window for corrections at no extra cost. It is SOC 2 Type II and HIPAA compliant. Start with 50 free pages to test it on your own scanned documents.

Tips for Better Text Extraction From PDF Scans

The quality of your scan has a direct impact on how accurate the extracted text will be. These tips help you get the best results regardless of which tool you use.

Scan at 300 DPI or higher. Resolution has the biggest impact on OCR accuracy. 300 DPI is the standard for text documents. Lower resolutions force the OCR engine to guess at characters, especially on smaller text.

Use black and white mode for text documents. Color scans produce larger files and can introduce noise that confuses OCR engines. For documents that are primarily text, grayscale or black-and-white scanning produces better results.

Keep pages straight. Skewed or rotated pages reduce OCR accuracy significantly. Use your scanner's auto-straighten feature or straighten pages before processing. Most OCR tools can handle slight skew, but anything beyond a few degrees causes errors.

Avoid scanning through plastic sleeves. Sheet protectors and laminated covers create glare and reduce contrast, which makes character recognition harder. Remove documents from sleeves before scanning when possible.

Now that you know how to extract text from a scanned PDF, you can process paper documents digitally without retyping them.

Frequently asked questions

How Do I Extract Text From a Scanned PDF?

Use a tool with OCR to read the text from the scanned image. Free options include Google Docs and Tesseract. For structured output with tables and labeled fields preserved, use an AI tool like Lido. Standard copy-paste does not work on scanned PDFs because they contain images, not text.

Can I Extract Text From a Scanned PDF for Free?

Yes. Upload the PDF to Google Drive and open it with Google Docs for free OCR. Tesseract is a free open-source OCR engine you can run locally. OnlineOCR.net is a free browser-based option. Free tools work for simple scans but produce raw text without structure.

What Is the Best Tool to Extract Text From a PDF Scan?

For raw text, Tesseract and Adobe Acrobat Pro are both effective on clean scans. For structured data where tables, fields, and formatting are preserved, AI tools like Lido produce the best results because they understand document layout in addition to reading characters.

Why Can I Not Select Text in My Scanned PDF?

Because the PDF contains an image of the page, not actual text characters. When you scan a paper document, the scanner takes a photograph and saves it as a PDF. There is no text in the file for your PDF reader to select. You need OCR to convert the image into selectable text.

How Accurate Is OCR on Scanned PDFs?

On clean, high-resolution scans with standard printed fonts, modern OCR achieves 95-99% character accuracy. Accuracy drops on low-resolution scans, faded text, unusual fonts, and handwriting. AI tools like Lido achieve 99%+ field-level accuracy by combining OCR with contextual understanding.

Can I Extract Text From a Handwritten Scanned PDF?

Basic OCR tools struggle with handwriting. AI-powered tools like Lido use models trained on handwritten text and can read most legible handwriting. Accuracy depends on the clarity and consistency of the handwriting. Printed text always produces better results than handwritten text.

Ready to grow your business with document automation, not headcount?

Join hundreds of teams growing faster by automating the busywork with Lido.