blobboxes

Note: This codebase was generated entirely by Claude (Anthropic) under close human supervision. It is intended as a development aid for experimentation — useful for getting something up and running quickly — but should not be used directly in production. There may be flaws in both the design and implementation.

A normalized relational view of document content — PDF, Excel, Word, and plain text — exposed through Python, DuckDB, SQLite, and a C API.

Every source produces the same five tables: doc, pages, fonts, styles, and bboxes. This uniform schema lets you query and join across formats as if they were database tables.

Why

Extracting structured data from documents usually means reaching for a different library per format, each with its own API and output shape. blobboxes normalizes all of them into one relational schema so you can:

Query a PDF invoice and an Excel spreadsheet with the same SQL
Join a DOCX table with an XLSX sheet on shared column values
Pipe any document through the same downstream pipeline

The core is a C library with a cursor-based API. Each consumer (Python, DuckDB extension, SQLite extension) compiles it in directly — no intermediate shared library, no IPC.

Supported formats

Format	Backend	Notes
PDF	PDFium	Text runs with position, font, style, color
XLSX	xlnt	One bbox per cell; includes formulas
DOCX	pugixml + miniz	Table cells extracted from Word documents
Text	Built-in	One bbox per line

Format is auto-detected from magic bytes, or you can specify it explicitly.

Schema

All formats produce the same five tables:

doc — one row per document

Field	Type	Description
document_id	int	Unique ID
source_type	string	`"pdf"`, `"xlsx"`, `"text"`, `"docx"`
filename	string	NULL for buffer-based opens
checksum	string	MD5 hex of source bytes
page_count	int	Number of pages (sheets for XLSX)

pages — one row per page/sheet

Field	Type	Description
page_id	int	Unique ID
document_id	int	FK to doc
page_number	int	1-based
width, height	float	Dimensions in points (PDF) or 0 (other formats)

fonts — distinct fonts used

Field	Type	Description
font_id	int	Unique ID
name	string	Font family name

styles — distinct (font, size, color, weight) combinations

Field	Type	Description
style_id	int	Unique ID
font_id	int	FK to fonts
font_size	float	Size in points
color	string	`"rgba(r,g,b,a)"`
weight	string	`"normal"` or `"bold"`
italic	int	0 or 1
underline	int	0 or 1

bboxes — one row per text run / cell / line

Field	Type	Description
page_id	int	FK to pages
style_id	int	FK to styles
x, y, w, h	float	Bounding box (points for PDF, 0 for others)
text	string	Text content
formula	string	Raw formula (XLSX only, NULL otherwise)

Usage

Python

pip install blobboxes

import blobboxes

data = open("invoice.pdf", "rb").read()

# Auto-detect format, open a cursor
cur = blobboxes.open(data)
doc = cur.doc()          # dict
pages = cur.pages()      # list of dicts
fonts = cur.fonts()
styles = cur.styles()
bboxes = cur.bboxes()
cur.close()

# Or use format-specific openers
cur = blobboxes.open_pdf(data)
cur = blobboxes.open_xlsx(data)
cur = blobboxes.open_docx(data)
cur = blobboxes.open_text(data)

# Quick metadata without a cursor
info = blobboxes.info(data)   # {"source_type": "pdf", "page_count": 3, ...}
fmt  = blobboxes.detect(data) # "pdf"

# JSON interface (returns JSON strings, no cursor needed)
import json
doc_obj = json.loads(blobboxes.doc_json(data))
all_bboxes = json.loads(blobboxes.bboxes_json(data))

Pydantic schemas are available for validation:

from blobboxes.schema import Doc, Page, Font, Style, BBox, XlsxBBox

DuckDB

LOAD 'bboxes.duckdb_extension';

-- Auto-detecting (inspects magic bytes)
SELECT * FROM bboxes('invoice.pdf');
SELECT * FROM bboxes_doc('report.xlsx');
SELECT * FROM bboxes_pages('document.docx');

-- Format-specific
SELECT * FROM bboxes_xlsx('report.xlsx');
SELECT * FROM bboxes_xlsx_fonts('report.xlsx');

-- JSON scalars
SELECT bboxes_json('invoice.pdf');
SELECT bboxes_xlsx_doc_json('report.xlsx');

-- Quick info
SELECT bboxes_info('invoice.pdf');

Each format registers five table functions and five JSON scalars:

Auto-detect	PDF	XLSX	Text	DOCX
`bboxes_doc`	`bboxes_pdf_doc`	`bboxes_xlsx_doc`	`bboxes_text_doc`	`bboxes_docx_doc`
`bboxes_pages`	`bboxes_pdf_pages`	`bboxes_xlsx_pages`	`bboxes_text_pages`	`bboxes_docx_pages`
`bboxes_fonts`	`bboxes_pdf_fonts`	`bboxes_xlsx_fonts`	`bboxes_text_fonts`	`bboxes_docx_fonts`
`bboxes_styles`	`bboxes_pdf_styles`	`bboxes_xlsx_styles`	`bboxes_text_styles`	`bboxes_docx_styles`
`bboxes`	`bboxes_pdf`	`bboxes_xlsx`	`bboxes_text`	`bboxes_docx`

The same names with _json appended are JSON scalar variants (e.g. bboxes_pdf_json).

SQLite

.load bboxes

-- Virtual tables (file_path is a hidden required column)
SELECT * FROM bboxes WHERE file_path = 'invoice.pdf';
SELECT * FROM bboxes_pages WHERE file_path = 'report.xlsx';

-- Format-specific
SELECT * FROM bboxes_xlsx WHERE file_path = 'report.xlsx';

-- JSON scalars
SELECT bboxes_json('invoice.pdf');
SELECT bboxes_xlsx_doc_json('report.xlsx');
SELECT bboxes_info('invoice.pdf');

The function naming follows the same pattern as DuckDB.

C/C++

#include "bboxes.h"

bboxes_pdf_init();

bboxes_cursor* cur = bboxes_open(buf, len);    // auto-detect
// or: bboxes_open_pdf(buf, len, password, start_page, end_page);
// or: bboxes_open_xlsx(buf, len, password, start_page, end_page);
// or: bboxes_open_text(buf, len);
// or: bboxes_open_docx(buf, len);

const bboxes_doc*   doc = bboxes_get_doc(cur);
const bboxes_page*  pg;  while ((pg = bboxes_next_page(cur)))  { /* ... */ }
const bboxes_font*  f;   while ((f  = bboxes_next_font(cur)))  { /* ... */ }
const bboxes_style* s;   while ((s  = bboxes_next_style(cur))) { /* ... */ }
const bboxes_bbox*  b;   while ((b  = bboxes_next_bbox(cur)))  { /* ... */ }

bboxes_close(cur);
bboxes_pdf_destroy();

Every struct iterator has a JSON variant (bboxes_next_page_json, etc.) that returns a const char* JSON string.

Building

Requires CMake 3.20+ and a C++17 compiler. Dependencies (PDFium, nlohmann/json, hash-library, xlnt, pugixml, miniz) are fetched automatically via CMake FetchContent.

# Quick: build everything and run tests
./build_all.sh

# Manual: configure with desired backends
cmake -B build \
  -DBUILD_PYTHON_BINDINGS=ON \
  -DBUILD_DUCKDB_EXTENSION=ON \
  -DBUILD_SQLITE_EXTENSION=ON \
  -DBUILD_XLSX_BACKEND=ON \
  -DBUILD_TEXT_BACKEND=ON \
  -DBUILD_DOCX_BACKEND=ON
cmake --build build

For Python bindings, the build needs a virtual environment with nanobind:

uv venv .venv
uv pip install nanobind scikit-build-core
cmake -B build \
  -DBUILD_PYTHON_BINDINGS=ON \
  -DPython_FIND_VIRTUALENV=ONLY \
  -DPython_ROOT_DIR=.venv
cmake --build build

Testing

./test_cross_check.sh

Runs 30 tests verifying that struct and JSON outputs match across all consumers (Python, DuckDB, SQLite) for all formats (PDF, XLSX, DOCX, text). Uses EXCEPT queries to ensure exact equivalence.

License

MIT

Name	Name	Last commit message	Last commit date
Latest commit History 26 Commits 26 Commits
.github/workflows	.github/workflows
duckdb_ext	duckdb_ext
example	example
include	include
python	python
sqlite_ext/src	sqlite_ext/src
src	src
test_data	test_data
.gitignore	.gitignore
CLAUDE.md	CLAUDE.md
CMakeLists.txt	CMakeLists.txt
LICENSE	LICENSE
README.md	README.md
build_all.sh	build_all.sh
pyproject.toml	pyproject.toml
test.py	test.py
test_cross_check.sh	test_cross_check.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blobboxes

Why

Supported formats

Schema

Usage

Python

DuckDB

SQLite

C/C++

Building

Testing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

blobboxes

Why

Supported formats

Schema

Usage

Python

DuckDB

SQLite

C/C++

Building

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages