Python PDF to CSV: Convert PDF Tables with Python

Q: What is the difference between tabula-py and Camelot for PDF table extraction?

Tabula-py is a Python wrapper around the Java-based Tabula tool and often excels with grid-like tables. Camelot is a pure-Python option that can be more approachable and flexible for text-based tables. Depending on the PDF layout, one may outperform the other.

Q: Do I need Java to use tabula-py?

Yes. Tabula-py relies on the tabula-java library under the hood, so a Java Runtime Environment is required for table extraction.

Q: Can Camelot read rotated text or merged cells?

Camelot can handle some rotated or merged cells, especially with the stream flavor, but complex layouts may require manual tuning or alternative approaches.

Q: Is there a straightforward way to merge multiple tables into a single CSV?

Yes. Read all tables into separate DataFrames and concatenate them with pandas.concat, then save as a single CSV.

Q: How should I handle PDFs with no clear table borders?

Prefer Camelot's stream flavor and experiment with crop areas or manual table extraction to identify headers and rows.

Q: What encoding should I use when exporting to CSV?

UTF-8 is the safest default; specify encoding='utf-8' when saving and test with non-ASCII data.

Learn practical methods to convert PDF data to CSV in Python using tabula-py and Camelot, with code samples, data cleaning tips, and best practices for reliable table extraction.

MyDataTables Team

March 9, 2026·5 min read

Python CSV MyDataTables Read CSV CSV Cleaning

Quick AnswerSteps

To convert a PDF to CSV in Python, start by identifying the PDF's table structure. Use libraries such as tabula-py for table extraction or Camelot for structured tables, then convert the results to CSV with pandas. This guide offers practical workflows for simple and multi-page documents, including common edge cases.

Why PDF to CSV matters in data workflows

PDFs are ubiquitous as distribution formats, and turning them into CSVs unlocks data manipulation with pandas, SQL, and BI tools. According to MyDataTables, data teams increasingly automate PDF-to-CSV pipelines to save time and reduce manual data-entry errors. This article explains common scenarios such as vendor reports with quarterly tables, financial statements, and research papers whose tables hold key results. We compare two mainstream approaches: Tabula-py, which wraps the Java-based Tabula; and Camelot, a pure-Python option with different table-detection modes. No tool fits every PDF, so understanding layout cues—borders, whitespace, and header rows—matters. Before you code, plan how to handle multi-row headers, merged cells, and rotated text. The goal is a clean CSV suitable for downstream analysis and reproducible workflows. The sections that follow show how to start a conversion with the two libraries and how to validate the extracted data with pandas. Technical readers will appreciate the emphasis on reliability and repeatability.

Python

import tabula
# read all tables from a sample PDF
dfs = tabula.read_pdf("sample.pdf", pages="all", multiple_tables=True)
print([df.shape for df in dfs])

Python

import pandas as pd
# merge into one practical DataFrame (if you have multiple tables)
combined = pd.concat(dfs, ignore_index=True)
combined.head()

In practice, decide early whether you expect well-delimited grid tables (lattice) or text-like tables (stream). This choice affects detection parameters, table area tuning, and how you handle headers. The following steps illustrate a pragmatic path: install dependencies, try both libraries on a small PDF, and compare the extracted data with the source. Finally, create a clean CSV with consistent encoding and column names.

Steps

Estimated time: 60-120 minutes

1
Install dependencies
Set up Python 3.8+ and pip in a virtual environment. Install Java if you plan to use tabula-py. Verify installations with python --version, pip --version, and java -version.
Tip: Use a virtualenv or conda environment to keep project dependencies isolated.
2
Choose library and test on a sample PDF
Decide between tabula-py and Camelot by testing a small PDF. Run a quick extraction to inspect the DataFrame shapes and header detection.
Tip: Start with a simple, clearly tabular page to confirm the extraction basics.
3
Extract tables with Tabula-py
Run read_pdf with pages and multiple_tables, then concatenate results into a single DataFrame for inspection.
Tip: If headers are misdetected, try lattice=True to enforce grid boundaries.
4
Handle edge cases (rotations, merged cells)
If you encounter rotated text or merged cells, experiment with Camelot flavor options and page area cropping to improve extraction.
Tip: Rotated pages often require the stream flavor and careful area specs.
5
Clean and normalize data
Use pandas to drop empty rows/columns, normalize headers, and standardize data types before saving to CSV.
Tip: Normalize headers to lowercase and replace spaces with underscores for consistency.
6
Save to CSV and validate
Write the final DataFrame to CSV with UTF-8 encoding and perform quick validation checks.
Tip: Include a small test that asserts the output is non-empty and has reasonable column counts.

Pro Tip: Always validate the extracted data against the source PDF by spot-checking rows and columns.

Warning: If PDFs come from scans, OCR quality may hinder extraction; consider pre-processing or alternative data sources.

Note: Prefer UTF-8 encoding when saving CSV to avoid character loss in non-English data.

Pro Tip: Test multiple pages with varying table layouts to ensure your pipeline generalizes beyond a single page.

Prerequisites

Required

Python 3.8+↗
Required
pip package manager
Required
Java Runtime Environment (Java 11+)↗
Required
Basic Python scripting knowledge
Required

Optional

PDFs to practice with
Optional

Commands

Action	Command
Install tabula-pyRequires Java runtime; ensure Java is installed	`pip install tabula-py`
Install CamelotInstall additional dependencies for image processing (opencv, ghostscript) depending on OS	`pip install camelot-py[cv]`
Verify Java installationTabula-py relies on tabula-java	`java -version`
Extract with tabula-py (example)Handles multiple tables; concatenate as needed	`python -c 'import tabula, pandas as pd; dfs = tabula.read_pdf("input.pdf", pages="all", multiple_tables=True); df = pd.concat(dfs, ignore_index=True); df.to_csv("output.csv", index=False)'`
Extract with Camelot (example)Use flavor='lattice' for grid-based tables	`python - <<'PY' from camelot import read_pdf tables = read_pdf('input.pdf', pages='1-end', flavor='stream') for i, t in enumerate(tables): t.to_csv(f'table_{i+1}.csv') PY`