Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

pdftables/python-pdftables-api

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

163 Commits
163 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdftables-api

Python library to interact with the PDFTables.com API.

Supported versions of Python are listed in ci-build.yml.

Installation

pip: (requires git installed)

pip install git+https://github.com/pdftables/python-pdftables-api.git

pip: (without git)

pip install https://github.com/pdftables/python-pdftables-api/archive/master.tar.gz

For local development:

uv sync

Upgrading

If using pip, then use pip with the --upgrade flag, e.g.

pip install --upgrade git+https://github.com/pdftables/python-pdftables-api.git

Usage

Sign up for an account at PDFTables.com and then visit the API page to see your API key.

Replace my-api-key below with your API key.

import pdftables_api

c = pdftables_api.Client('my-api-key')
c.xlsx('input.pdf', 'output.xlsx')

Formats

To convert to CSV, XML or HTML simply change c.xlsx to be c.csv, c.xml or c.html respectively.

To specify Excel (single sheet) or Excel (multiple sheets) use c.xlsx_single or c.xlsx_multiple.

Extractor

You can specify which extraction engine to use when creating a Client. The available extractors are standard (default), ai-1, and ai-2.

For AI extractors (ai-1 and ai-2), you can also specify an extract option to control what content is extracted: tables (default) or tables-paragraphs.

from pdftables_api import (Client, EXTRACTOR_AI_1, EXTRACTOR_AI_2,
    EXTRACT_TABLES, EXTRACT_TABLES_PARAGRAPHS)

# Standard extractor (default)
c_standard = Client('my-api-key')

# AI extractors for complex documents
c_ai_1 = Client('my-api-key', extractor=EXTRACTOR_AI_1, extract=EXTRACT_TABLES)
c_ai_2 = Client('my-api-key', extractor=EXTRACTOR_AI_2, extract=EXTRACT_TABLES_PARAGRAPHS)

See PDFTables API documentation for details.

Test

Tests run with pytest: make test

Linting and formatting

  • Format with make format
  • Apply Ruff fixes with make fix

Configuring a timeout

If you are converting a large document (hundreds or thousands of pages), you may want to increase the timeout.

Here is an example of the sort of error that might be encountered:

ReadTimeout: HTTPSConnectionPool(host='pdftables.com', port=443): Read timed out. (read timeout=300)

The below example allows 60 seconds to connect to our server, and 1 hour to convert the document:

import pdftables_api

c = pdftables_api.Client('my-api-key', timeout=(60, 3600))
c.xlsx('input.pdf', 'output.xlsx')
Morty Proxy This is a proxified and sanitized view of the page, visit original site.