Python PDF to CSV: Convert PDF Tables with Python
Learn practical methods to convert PDF data to CSV in Python using tabula-py and Camelot, with code samples, data cleaning tips, and best practices for reliable table extraction.

To convert a PDF to CSV in Python, start by identifying the PDF's table structure. Use libraries such as tabula-py for table extraction or Camelot for structured tables, then convert the results to CSV with pandas. This guide offers practical workflows for simple and multi-page documents, including common edge cases.
Why PDF to CSV matters in data workflows
PDFs are ubiquitous as distribution formats, and turning them into CSVs unlocks data manipulation with pandas, SQL, and BI tools. According to MyDataTables, data teams increasingly automate PDF-to-CSV pipelines to save time and reduce manual data-entry errors. This article explains common scenarios such as vendor reports with quarterly tables, financial statements, and research papers whose tables hold key results. We compare two mainstream approaches: Tabula-py, which wraps the Java-based Tabula; and Camelot, a pure-Python option with different table-detection modes. No tool fits every PDF, so understanding layout cues—borders, whitespace, and header rows—matters. Before you code, plan how to handle multi-row headers, merged cells, and rotated text. The goal is a clean CSV suitable for downstream analysis and reproducible workflows. The sections that follow show how to start a conversion with the two libraries and how to validate the extracted data with pandas. Technical readers will appreciate the emphasis on reliability and repeatability.
import tabula
# read all tables from a sample PDF
dfs = tabula.read_pdf("sample.pdf", pages="all", multiple_tables=True)
print([df.shape for df in dfs])import pandas as pd
# merge into one practical DataFrame (if you have multiple tables)
combined = pd.concat(dfs, ignore_index=True)
combined.head()In practice, decide early whether you expect well-delimited grid tables (lattice) or text-like tables (stream). This choice affects detection parameters, table area tuning, and how you handle headers. The following steps illustrate a pragmatic path: install dependencies, try both libraries on a small PDF, and compare the extracted data with the source. Finally, create a clean CSV with consistent encoding and column names.
Steps
Estimated time: 60-120 minutes
- 1
Install dependencies
Set up Python 3.8+ and pip in a virtual environment. Install Java if you plan to use tabula-py. Verify installations with python --version, pip --version, and java -version.
Tip: Use a virtualenv or conda environment to keep project dependencies isolated. - 2
Choose library and test on a sample PDF
Decide between tabula-py and Camelot by testing a small PDF. Run a quick extraction to inspect the DataFrame shapes and header detection.
Tip: Start with a simple, clearly tabular page to confirm the extraction basics. - 3
Extract tables with Tabula-py
Run read_pdf with pages and multiple_tables, then concatenate results into a single DataFrame for inspection.
Tip: If headers are misdetected, try lattice=True to enforce grid boundaries. - 4
Handle edge cases (rotations, merged cells)
If you encounter rotated text or merged cells, experiment with Camelot flavor options and page area cropping to improve extraction.
Tip: Rotated pages often require the stream flavor and careful area specs. - 5
Clean and normalize data
Use pandas to drop empty rows/columns, normalize headers, and standardize data types before saving to CSV.
Tip: Normalize headers to lowercase and replace spaces with underscores for consistency. - 6
Save to CSV and validate
Write the final DataFrame to CSV with UTF-8 encoding and perform quick validation checks.
Tip: Include a small test that asserts the output is non-empty and has reasonable column counts.
Prerequisites
Required
- Required
- pip package managerRequired
- Required
- Basic Python scripting knowledgeRequired
Optional
- PDFs to practice withOptional
Commands
| Action | Command |
|---|---|
| Install tabula-pyRequires Java runtime; ensure Java is installed | pip install tabula-py |
| Install CamelotInstall additional dependencies for image processing (opencv, ghostscript) depending on OS | pip install camelot-py[cv] |
| Verify Java installationTabula-py relies on tabula-java | java -version |
| Extract with tabula-py (example)Handles multiple tables; concatenate as needed | python -c 'import tabula, pandas as pd; dfs = tabula.read_pdf("input.pdf", pages="all", multiple_tables=True); df = pd.concat(dfs, ignore_index=True); df.to_csv("output.csv", index=False)' |
| Extract with Camelot (example)Use flavor='lattice' for grid-based tables | python - <<'PY'
from camelot import read_pdf
tables = read_pdf('input.pdf', pages='1-end', flavor='stream')
for i, t in enumerate(tables): t.to_csv(f'table_{i+1}.csv')
PY |
People Also Ask
What is the difference between tabula-py and Camelot for PDF table extraction?
Tabula-py is a Python wrapper around the Java-based Tabula tool and often excels with grid-like tables. Camelot is a pure-Python option that can be more approachable and flexible for text-based tables. Depending on the PDF layout, one may outperform the other.
Tabula-py uses Java and is great for grid tables, while Camelot runs in Python and adapts well to text tables.
Do I need Java to use tabula-py?
Yes. Tabula-py relies on the tabula-java library under the hood, so a Java Runtime Environment is required for table extraction.
Yes, you need Java for Tabula-py.
Can Camelot read rotated text or merged cells?
Camelot can handle some rotated or merged cells, especially with the stream flavor, but complex layouts may require manual tuning or alternative approaches.
Camelot can handle some rotated or merged areas, but very messy layouts may need extra tweaking.
Is there a straightforward way to merge multiple tables into a single CSV?
Yes. Read all tables into separate DataFrames and concatenate them with pandas.concat, then save as a single CSV.
Yes—load each table, combine them with pandas, and export once.
How should I handle PDFs with no clear table borders?
Prefer Camelot's stream flavor and experiment with crop areas or manual table extraction to identify headers and rows.
If borders are missing, try Camelot's stream mode and adjust the extraction area.
What encoding should I use when exporting to CSV?
UTF-8 is the safest default; specify encoding='utf-8' when saving and test with non-ASCII data.
Use UTF-8 encoding to be safe with non-English characters.
Main Points
- Choose between tabula-py and Camelot based on table structure
- Validate and clean data before downstream use
- Handle multi-page PDFs with careful page selection and merging
- Preserve UTF-8 encoding to avoid character corruption
- Automate extraction with small, repeatable scripts