Feature/pdf ingestion jpdfium by EthanHealy01 · Pull Request #6525 · Stirling-Tools/Stirling-PDF

EthanHealy01 · Jun 3, 2026

PDF Ingestion / Convert to markdown agent, also replaced the current convert to markdown API in java

TextLine-driven converter (tables: bordered/borderless, multi-table, uneven rows, cross-page stitching, wrapped cells; multi-signal heading detection; image metadata; two-column handling). Wires the orchestrator convert_markdown path to the deterministic Java endpoint. Synthetic/owned test fixtures only.

… inital converter, and future improvements

ConnorYoh · Jun 4, 2026

One thing on the table column detection in PdfMarkdownConverter (findColumnRanges). It sizes an array straight from the PDF's word coordinates:

int span = (int) Math.ceil(maxX) - lo + 1;
int[] coverage = new int[span];

There's no upper bound on span, and those coordinates come straight from jpdfium with no clamping (I checked the lib, it passes PDFium's raw values through untouched). A PDF can position text anywhere via a text matrix, so a crafted or just genuinely weird file can report a massive maxX. That gives you either a multi-GB int[] (OutOfMemoryError), or if the numbers overflow, a negative span and a NegativeArraySizeException. Either way the request dies, and because the endpoint runs the converter synchronously on the request thread, one bad upload can take it down.

The fix is small: clamp the span to a sane page width and skip table detection when the geometry is implausible, e.g.

if (!(maxX > minX) || (maxX - minX) > MAX_PAGE_SPAN_PT) return List.of();
int span = Math.min((int) Math.ceil(maxX) - lo + 1, MAX_PAGE_SPAN_PT);

Real pages are under ~2000pt wide, so anything past that is junk. Ideally we clamp coordinates once at extraction so the rest of the pipeline is protected too.

ConnorYoh · Jun 4, 2026

The converter doesn't escape any of the body text it emits. Only | gets escaped, and only inside table cells. Headings, paragraphs and bold all output the raw text from the PDF.

Why it matters: if a PDF's text happens to contain markdown characters (#, *, backticks, [label](url)) or HTML, it gets reinterpreted as structure instead of staying as literal text. So a line that literally reads # Heading in the PDF becomes a real H1 in the output. It's mostly a fidelity issue. I checked our in-app renderer and it's react-markdown with no raw HTML, so it's not an XSS today, but the .md we return is a download and anything that later renders it as HTML would inherit the risk.

What we can do: escape markdown special characters in the body text before emitting (same idea as escapeCell, just applied more broadly to paragraph/heading/bold text). At a minimum we should treat the output as untrusted content.

ConnorYoh · Jun 4, 2026

The converter's accuracy test (PdfMarkdownConverterTest) is @Disabled, so it never runs in CI. The header notes it's a work in progress and that some fixtures are expected to exceed the 5% tolerance. The only test that does run mocks the converter out and just checks the controller returns 200/500.

Why it matters: this PR replaces two working implementations (the old Java parser and the Python agent) with a new 940-line converter on a public endpoint, and right now nothing actually verifies its output. So there's no safety net against regressions, or against the crash case in the other comment.

What we can do: get a couple of the golden fixtures passing and enable just those in CI (a small enforced set beats a disabled one), and add a fixture with degenerate/extreme geometry to cover the crash path. It doesn't need all four green to start, just something real gating it.

EthanHealy01 · Jun 4, 2026

The converter's accuracy test (PdfMarkdownConverterTest) is @Disabled, so it never runs in CI. The header notes it's a work in progress and that some fixtures are expected to exceed the 5% tolerance. The only test that does run mocks the converter out and just checks the controller returns 200/500.

Why it matters: this PR replaces two working implementations (the old Java parser and the Python agent) with a new 940-line converter on a public endpoint, and right now nothing actually verifies its output. So there's no safety net against regressions, or against the crash case in the other comment.

What we can do: get a couple of the golden fixtures passing and enable just those in CI (a small enforced set beats a disabled one), and add a fixture with degenerate/extreme geometry to cover the crash path. It doesn't need all four green to start, just something real gating it.

Never thought about the fact this replaces the older implementation and is a total regression when it comes to testing ahahahaha, fixing now, along with the other 2

…into feature/pdf-ingestion-jpdfium

…irling-Tools/Stirling-PDF into feature/pdf-ingestion-jpdfium

stirlingbot · Jun 5, 2026

🚀 V2 Auto-Deployment Complete!

Your V2 PR with embedded architecture has been deployed!

🔗 Direct Test URL (non-SSL) http://54.175.155.236:6525

🔐 Secure HTTPS URL: https://6525.ssl.stirlingpdf.cloud

This deployment will be automatically cleaned up when the PR is closed.

🔄 Auto-deployed for approved V2 contributors.

EthanHealy01 added 4 commits June 2, 2026 16:49

make deterministic pdf ingestion pipeline

5116c8e

stop pdftomarkdown test from failing CI, it was just for creating the…

a0ca330

… inital converter, and future improvements

remove unneeded gitignore

5c02589

EthanHealy01 requested review from ConnorYoh, Frooodle, Ludy87 and jbrunton96 as code owners June 3, 2026 13:07

dosubot Bot added size:XXL This PR changes 1000+ lines ignoring generated files. enhancement New feature or request labels Jun 3, 2026

stirlingbot Bot added Documentation Improvements or additions to documentation Java Pull requests that update Java code API API-related issues or pull requests Test Testing-related issues or pull requests engine and removed enhancement New feature or request labels Jun 3, 2026

aikido-pr-checks Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread app/common/src/main/java/stirling/software/common/pdf/PdfMarkdownConverter.java Outdated

EthanHealy01 added 4 commits June 3, 2026 16:12

aikido

7ad4878

merge main

2b4539a

remove dead code

874bc4b

remove real email from example pdf (purely a hygiene thing)

2385584

EthanHealy01 added 2 commits June 4, 2026 13:23

merge main

8b9d7d8

backend fix

610687f

EthanHealy01 and others added 3 commits June 4, 2026 18:08

handle change requests

8d6b0e2

Merge branch 'main' into feature/pdf-ingestion-jpdfium

6a6a90d

Merge branch 'main' of https://github.com/Stirling-Tools/Stirling-PDF …

fa5a4c3

…into feature/pdf-ingestion-jpdfium

EthanHealy01 and others added 3 commits June 4, 2026 19:24

Merge branch 'feature/pdf-ingestion-jpdfium' of https://github.com/St…

59c6f34

…irling-Tools/Stirling-PDF into feature/pdf-ingestion-jpdfium

Merge branch 'main' into feature/pdf-ingestion-jpdfium

e9b769b

Merge branch 'main' into feature/pdf-ingestion-jpdfium

bb25e3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/pdf ingestion jpdfium#6525

Feature/pdf ingestion jpdfium#6525
EthanHealy01 wants to merge 16 commits into
mainStirling-Tools/Stirling-PDF:mainfrom
feature/pdf-ingestion-jpdfiumStirling-Tools/Stirling-PDF:feature/pdf-ingestion-jpdfiumCopy head branch name to clipboard

EthanHealy01 commented Jun 3, 2026

Uh oh!

Uh oh!

ConnorYoh commented Jun 4, 2026

Uh oh!

ConnorYoh commented Jun 4, 2026

Uh oh!

ConnorYoh commented Jun 4, 2026

Uh oh!

EthanHealy01 commented Jun 4, 2026

Uh oh!

stirlingbot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

Conversation

EthanHealy01 commented Jun 3, 2026

Uh oh!

Uh oh!

ConnorYoh commented Jun 4, 2026

Uh oh!

ConnorYoh commented Jun 4, 2026

Uh oh!

ConnorYoh commented Jun 4, 2026

Uh oh!

EthanHealy01 commented Jun 4, 2026

Uh oh!

stirlingbot Bot commented Jun 5, 2026

🚀 V2 Auto-Deployment Complete!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants