Blog

How to Convert PDF to XML: Tools and Methods

May 5, 2026

To convert a PDF to XML, use Adobe Acrobat’s Export function for simple text PDFs, Python libraries (pdfplumber or PyMuPDF) for programmatic extraction, or an AI extraction tool like Lido that outputs structured data you can serialize to any XML schema. The right method depends on whether you need raw text preservation or structured field extraction mapped to a specific XML format like EDI, XBRL, or a custom integration schema.

PDF and XML serve opposite purposes. PDF preserves visual layout. It’s a print-ready format designed to look identical on every screen and printer. XML preserves data structure. It’s a machine-readable format designed to move information between systems. Converting from one to the other requires deciding what you actually want: a text dump wrapped in XML tags, or meaningfully structured data that follows a defined schema.

Most people searching for “PDF to XML” want the second thing. They have invoices, purchase orders, shipping documents, or regulatory filings in PDF format, and they need that data in a structured XML format for import into an ERP, EDI system, government portal, or data pipeline. Lido handles this by extracting structured fields from any PDF using AI, returning JSON that you can transform to any target XML schema without manual data entry.

This guide covers four methods ranked by complexity and capability, from point-and-click tools to fully programmable solutions.

What PDF to XML conversion means

There are two fundamentally different operations people call “PDF to XML conversion,” and confusing them leads to wasted time and wrong tool choices.

Structure preservation takes the visual elements of a PDF (paragraphs, headings, tables, images) and wraps them in XML tags that describe their layout role. Adobe Acrobat’s XML export does this. You get tags like <Para>, <Table>, <Figure>, and <Span> that mirror the PDF’s visual structure. The output is valid XML, but it doesn’t extract meaning from the content.

Data extraction reads the PDF, understands what the content represents (invoice fields, table rows, form entries), and outputs XML with semantic tags that match your target schema. An invoice becomes <Invoice><VendorName>Acme Corp</VendorName><Total>4,250.00</Total></Invoice>. This requires OCR or AI to interpret the document, not just parse its layout.

If you need to feed data into another system (an ERP, a customs filing platform, a financial reporting tool), you almost certainly need data extraction, not structure preservation.

When you need XML output

XML remains the standard interchange format for several industries and use cases, even as JSON has taken over in web APIs.

EDI (Electronic Data Interchange): Trading partners in supply chain, logistics, and retail exchange purchase orders, invoices, and advance ship notices in XML-based EDI formats (EDIFACT, X12, or proprietary XML schemas). Converting a PDF invoice to the buyer’s required XML format is a daily task for thousands of vendors without EDI-capable accounting systems.

XBRL (eXtensible Business Reporting Language): Publicly traded companies file financial statements with the SEC in XBRL format. Accounting teams working from PDF financial statements need to convert values into XBRL-tagged XML with precise taxonomy references.

Regulatory and government filing: Tax authorities, customs agencies, and compliance bodies accept or require XML submissions. Converting PDF source documents to the required XML schema is a common workflow for customs brokers, tax preparers, and compliance officers.

System integrations: Legacy enterprise systems (SAP, Oracle, older healthcare platforms) often accept XML batch imports where newer systems accept JSON. If your target system’s import expects XML, you need XML output regardless of what format you extract data in initially.

Archival and search: Converting PDF documents to XML makes their content searchable and indexable by enterprise search systems. Libraries, government archives, and legal discovery platforms use XML for document corpus management.

Method 1: Adobe Acrobat Export to XML

Adobe Acrobat Pro includes an Export function that converts PDF to XML. This is the simplest method and requires zero technical knowledge.

How it works: Open the PDF in Acrobat Pro. Go to File → Export To → XML 1.0. Choose a save location. Acrobat generates an XML file with tags reflecting the document’s structural elements.

Output format: Acrobat’s XML export produces a tagged structure based on the PDF’s internal tag tree (if the PDF is tagged) or its best guess at structure (if it’s not). Tables become <Table> elements with <TR> and <TD> children. Paragraphs become <P> elements. The output preserves document structure but does not extract semantic meaning.

Limitations: The XML schema is Adobe’s proprietary format, not your target schema. You still need to transform the output to match whatever system you’re feeding. Scanned PDFs produce poor results because Acrobat’s export relies on the PDF having embedded text or a tag tree. It doesn’t run OCR automatically. Tables in non-tagged PDFs often export with incorrect cell boundaries.

Best for: Quick one-off conversions of digitally-created PDFs (not scans) when you need raw text in XML tags and plan to transform it manually afterward. Not suitable for batch processing or recurring workflows.

Method 2: Python libraries (pdfplumber + xml.etree, PyMuPDF)

For developers who need programmatic PDF-to-XML conversion, Python offers several libraries that extract text, tables, and metadata from PDFs and let you build custom XML output.

Using pdfplumber + xml.etree.ElementTree

pdfplumber excels at table extraction. It identifies table structures in PDFs and returns them as lists of rows and cells. Combined with Python’s built-in xml.etree.ElementTree module, you can build XML output in any schema you define.

import pdfplumber
import xml.etree.ElementTree as ET

# Extract table data from PDF
with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table()

# Build XML
root = ET.Element("Invoice")
for row in table[1:]:  # skip header row
    item = ET.SubElement(root, "LineItem")
    ET.SubElement(item, "Description").text = row[0]
    ET.SubElement(item, "Quantity").text = row[1]
    ET.SubElement(item, "UnitPrice").text = row[2]
    ET.SubElement(item, "Total").text = row[3]

tree = ET.ElementTree(root)
tree.write("output.xml", encoding="unicode", xml_declaration=True)

Using PyMuPDF (fitz)

PyMuPDF provides fast text extraction with position information. It’s better than pdfplumber for full-page text extraction but less specialized for tables.

import fitz
import xml.etree.ElementTree as ET

doc = fitz.open("document.pdf")
root = ET.Element("Document")

for page_num, page in enumerate(doc):
    page_elem = ET.SubElement(root, "Page", number=str(page_num + 1))
    blocks = page.get_text("dict")["blocks"]
    for block in blocks:
        if block["type"] == 0:  # text block
            para = ET.SubElement(page_elem, "Paragraph")
            para.text = " ".join(
                span["text"] for line in block["lines"]
                for span in line["spans"]
            )

tree = ET.ElementTree(root)
tree.write("output.xml", encoding="unicode", xml_declaration=True)

Limitations: Both libraries work only on digitally-created PDFs with embedded text. Scanned documents need OCR preprocessing (Tesseract or similar) before these libraries can extract text. Table detection in pdfplumber relies on visible cell borders, so borderless tables often extract incorrectly. For scanned tables specifically, see our guide on OCR table to Excel extraction which covers methods that handle image-based documents. Neither library understands what the content means; you must write code to assign semantic meaning (mapping “the number in the bottom-right cell” to <InvoiceTotal>).

Best for: Developers building batch processing pipelines for documents with consistent, known layouts. Requires coding skill and layout-specific logic for each document type.

Library Strengths Weaknesses Best use case
pdfplumberTable extraction, cell-level precisionSlow on large PDFs, no OCRInvoices and forms with bordered tables
PyMuPDF (fitz)Fast, position-aware text blocksWeak table detectionFull-page text extraction, metadata
pdf2xml (CLI tool)Preserves layout coordinatesOutput format is layout-centric, not semanticResearch, layout analysis
CamelotHandles borderless tablesRequires Ghostscript, slowComplex table-heavy documents

Method 3: Online converters (limitations and privacy concerns)

Dozens of free online tools offer PDF-to-XML conversion: Zamzar, CloudConvert, PDF2Go, Convertio, and others. They work by uploading your PDF to a remote server, processing it, and returning an XML file.

How they work: Upload a PDF through a web form. Wait 10–60 seconds. Download the resulting XML file. Most use Adobe’s structure-preservation approach or simple text extraction wrapped in generic XML tags.

Output quality: Variable. Most online converters produce XML with generic structural tags (paragraphs, spans) rather than semantic extraction. Tables frequently break. Multi-page documents sometimes lose page boundaries. None of them produce output matching a custom XML schema, so you still need a transformation step.

Privacy and security concerns: Your documents travel to and are processed on third-party servers. For sensitive business documents (invoices with bank details, contracts with pricing, medical records, financial statements), this creates compliance and confidentiality risks. Most free tools don’t specify data retention policies. Some explicitly state they keep uploaded files for 24 hours. For any document containing PII, financial data, or trade secrets, online converters are unsuitable.

Volume limits: Free tiers typically cap at 2–5 conversions per day or 10–25MB file size. Paid tiers ($5–$15/month) raise these limits but still don’t solve the schema mapping problem.

Best for: One-off conversions of non-sensitive documents where approximate structure is acceptable. Not suitable for production workflows, recurring document processing, or any document containing confidential information.

Method 4: AI extraction to structured XML (Lido API → JSON → XML)

The most capable approach combines AI-powered data extraction with schema mapping. Instead of parsing PDF structure, an AI model reads the document the way a human would, understanding that “Total Due: $4,250.00” is an invoice total regardless of where it appears on the page.

How it works with Lido:

Step 1: Define the fields you need. For an invoice destined for EDI import, that might be: buyer name, buyer address, seller name, seller address, invoice number, invoice date, payment terms, line item descriptions, quantities, unit prices, and totals.

Step 2: Send the PDF to Lido’s extraction API (or upload through the web interface). Lido returns structured JSON with each field populated from the document content.

Step 3: Transform the JSON to your target XML schema. This is a straightforward key mapping. No parsing, no position detection, no template building. A simple Python script or XSLT stylesheet converts Lido’s JSON output to whatever XML structure your target system expects.

import json
import xml.etree.ElementTree as ET

# Lido extraction output (JSON)
data = {
    "vendor_name": "Acme Manufacturing Ltd",
    "invoice_number": "INV-2026-0847",
    "invoice_date": "2026-04-28",
    "line_items": [
        {"description": "Widget A", "qty": "100", "unit_price": "12.50", "total": "1250.00"},
        {"description": "Widget B", "qty": "50", "unit_price": "24.00", "total": "1200.00"}
    ],
    "subtotal": "2450.00",
    "tax": "196.00",
    "total": "2646.00"
}

# Build target XML schema
root = ET.Element("InvoiceDocument", xmlns="urn:your-edi-schema")
header = ET.SubElement(root, "Header")
ET.SubElement(header, "SupplierName").text = data["vendor_name"]
ET.SubElement(header, "DocumentNumber").text = data["invoice_number"]
ET.SubElement(header, "IssueDate").text = data["invoice_date"]

items = ET.SubElement(root, "LineItems")
for item in data["line_items"]:
    li = ET.SubElement(items, "Item")
    ET.SubElement(li, "Description").text = item["description"]
    ET.SubElement(li, "Quantity").text = item["qty"]
    ET.SubElement(li, "Price").text = item["unit_price"]
    ET.SubElement(li, "Amount").text = item["total"]

summary = ET.SubElement(root, "Summary")
ET.SubElement(summary, "Subtotal").text = data["subtotal"]
ET.SubElement(summary, "TaxAmount").text = data["tax"]
ET.SubElement(summary, "GrandTotal").text = data["total"]

Why this works best for production use: The AI handles document variability (different vendor layouts, scanned vs. digital, multi-page documents) without per-format configuration. The schema mapping step is simple code that rarely changes. You separate the hard problem (understanding document content) from the easy problem (serializing data to XML).

See how to extract data from any PDF for the full API documentation and workflow setup.

XML schema considerations

Before converting any PDF to XML, define your target schema. The schema determines what tags you use, what attributes are required, and how elements nest. Without a defined schema, you produce XML that’s technically valid but useless for system integration.

Using an existing standard schema: If you’re targeting a known system, the schema already exists. UBL (Universal Business Language) defines schemas for invoices, purchase orders, and dispatch advices. XBRL defines financial reporting taxonomies. HL7 CDA defines clinical document formats. Your job is mapping extracted fields to the correct elements in the standard schema.

Defining a custom schema: If your target system uses a proprietary XML format, get a sample file from that system’s documentation or export function. Reverse-engineer the structure. Build your mapping to match exactly. Pay attention to required vs. optional elements, data types (dates must match the expected format), and namespace declarations.

Validation: Always validate your generated XML against the target schema (XSD) before submission. Python’s lxml library handles XSD validation. Catching schema violations programmatically prevents rejected submissions and manual rework.

Target system XML schema Common source PDFs Fields required
EDI (supply chain)UBL 2.1 InvoiceVendor invoices, POsParty IDs, line items, tax categories
SEC filingXBRL US-GAAPFinancial statementsTaxonomy concepts, periods, units
Customs (CBP)ACE XMLCommercial invoices, packing listsHTS codes, quantities, country of origin
HealthcareHL7 CDA R2Lab reports, discharge summariesPatient ID, observations, codes
Banking (payments)ISO 20022 (pain.001)Payment instructionsIBAN, BIC, amounts, references

Handling complex PDFs (tables, multi-page, nested structures)

Real-world PDFs are messy. Single-page invoices with clearly bordered tables are the easy case. Production workflows deal with multi-page documents, borderless tables, nested line items, headers that repeat across pages, and merged cells. Here’s how each method handles complexity.

Multi-page documents: Adobe Acrobat handles pagination automatically. Python libraries require iterating over pages and stitching content together. You need logic to detect when a table spans pages and merge the rows. AI extraction (Lido) processes the full document as a single unit and returns consolidated data regardless of page breaks. A 12-page invoice with line items spanning pages 2–8 returns one complete line items array.

Borderless tables: pdfplumber struggles with tables that use whitespace instead of lines for cell separation. Camelot handles these better using its “stream” mode. AI extraction doesn’t rely on visual table detection. It recognizes that columns of aligned text represent tabular data based on context.

Nested structures: A purchase order with multiple ship-to addresses, each containing its own line items. Or an invoice with grouped line items by project code. Python code handles this only if you build explicit grouping logic. AI extraction can return nested JSON structures (line items grouped by section) that map directly to nested XML elements.

Headers and footers: Repeating page elements (company logo area, page numbers, “continued on next page” markers) create noise in extraction output. Python libraries include these unless you explicitly filter by position coordinates. AI extraction ignores them automatically because it understands they’re not document data.

Mixed content: Some PDFs contain both text and scanned images (common in contracts with signed pages, or invoices with a scanned header on a digital body). Python libraries extract only the text layer. AI extraction with OCR handles both layers in a single pass.

For detailed strategies on handling complex documents, the PDF parsing guide covers the underlying techniques. For extraction tool comparisons, see best PDF data extraction tools.

The core decision when selecting a method comes down to document variability. If every PDF you process has the same layout (one vendor, one format), Python scripts with hardcoded position logic work reliably and cost nothing beyond development time. If you process documents from dozens of sources with different layouts, AI extraction eliminates the per-format development work. The best PDF data extractors handle this variability without per-document configuration.

For ongoing workflows where PDFs arrive regularly and need XML output for system integration, the AI extraction approach (Method 4) provides the best balance of accuracy, maintainability, and setup time. For one-off research tasks or simple text-heavy documents, Methods 1–3 work adequately. If you’re working with tabular PDFs destined for spreadsheets before XML, see copying tables from PDF to Excel as an intermediate step.

Frequently asked questions

How do I convert a PDF to XML?

The simplest method is Adobe Acrobat Pro: open the PDF, go to File → Export To → XML 1.0. This produces structure-preserving XML with layout tags. For semantic data extraction (getting specific field values into a defined XML schema), use an AI extraction tool like Lido to pull structured data from the PDF, then serialize that data to your target XML format. For developers, Python libraries like pdfplumber extract table data that you can build into XML using xml.etree.ElementTree. The right method depends on whether you need layout preservation or meaningful data extraction.

Can you extract structured data from PDF to XML?

Yes. AI extraction tools read PDFs and return structured data (vendor names, invoice numbers, line items, amounts) that you can map to any XML schema. The process works in two steps: first extract the data into structured fields (JSON), then transform those fields to your target XML format using a simple mapping script or XSLT stylesheet. This approach handles scanned documents, varied layouts, and multi-page files without per-format configuration. It produces XML with semantic tags that match your target system’s import requirements.

What is the best free PDF to XML converter?

For simple text-based PDFs, pdf2xml (open-source command line tool) converts PDF content to XML with position coordinates. For Python developers, pdfplumber (free, open-source) extracts text and tables that you can serialize to XML programmatically. Online converters like PDF2Go offer free tiers but upload your documents to external servers, creating privacy risks for sensitive files. None of these produce semantic XML mapped to a custom schema—they output generic structural tags. For production workflows needing schema-specific XML, Lido offers a free tier (50 pages/month) with AI extraction that handles any document layout.

Is there a Python library for PDF to XML?

Several Python libraries handle parts of the PDF-to-XML pipeline. pdfplumber extracts text and tables from digital PDFs. PyMuPDF (fitz) provides fast text extraction with position data. Camelot specializes in table extraction from complex layouts. Python’s built-in xml.etree.ElementTree or the lxml library builds XML output from extracted data. None of these perform OCR—for scanned PDFs, add pytesseract or pdf2image as a preprocessing step. The typical workflow chains pdfplumber (extraction) with ElementTree (XML generation) and lxml (schema validation).

What is the difference between PDF to XML and PDF to JSON?

Both formats represent the same extracted data in different serialization formats. XML uses angle-bracket tags with attributes and is required by EDI systems, XBRL filing, government portals, and legacy enterprise integrations. JSON uses curly braces and key-value pairs and is preferred by modern web APIs, NoSQL databases, and JavaScript-based applications. The extraction step is identical—an AI reads the PDF and returns structured data. The output format is a serialization choice. Most extraction tools (including Lido) return JSON natively, and converting JSON to XML requires a simple mapping script of 10–20 lines of code.

Ready to grow your business with document automation, not headcount?

Join hundreds of teams growing faster by automating the busywork with Lido.