Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Latest commit

 

History

History
History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Outline

zpdf

High-performance PDF text extraction powered by Zig. ~4x faster than MuPDF on large documents.

Install

pip install zpdf

Usage

from zpdf import Document

with Document("paper.pdf") as doc:
    print(doc.page_count)

    # Extract all text (reading order)
    text = doc.extract_all()

    # Extract single page
    page_text = doc.extract_page(0)

    # Extract as markdown
    md = doc.extract_all_markdown()

    # Get text with bounding boxes
    spans = doc.extract_bounds(0)
    for span in spans:
        print(f"{span.text} at ({span.x0}, {span.y0})")

From bytes

with open("doc.pdf", "rb") as f:
    data = f.read()

with Document(data) as doc:
    text = doc.extract_all()

Benchmark

Text extraction on Apple M4 Pro:

Document Pages zpdf MuPDF Speedup
Intel SDM 5,252 582ms 2,152ms 3.7x
Pandas Docs 3,743 640ms 1,130ms 1.8x
C++ Standard 2,134 438ms 1,007ms 2.3x
PDF Reference 1,310 236ms 1,481ms 6.3x

License

CC0-1.0

Morty Proxy This is a proxified and sanitized view of the page, visit original site.