Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Feature/pdf ingestion jpdfium#6525

Open
EthanHealy01 wants to merge 16 commits into
mainStirling-Tools/Stirling-PDF:mainfrom
feature/pdf-ingestion-jpdfiumStirling-Tools/Stirling-PDF:feature/pdf-ingestion-jpdfiumCopy head branch name to clipboard
Open

Feature/pdf ingestion jpdfium#6525
EthanHealy01 wants to merge 16 commits into
mainStirling-Tools/Stirling-PDF:mainfrom
feature/pdf-ingestion-jpdfiumStirling-Tools/Stirling-PDF:feature/pdf-ingestion-jpdfiumCopy head branch name to clipboard

Conversation

@EthanHealy01
Copy link
Copy Markdown
Contributor

PDF Ingestion / Convert to markdown agent, also replaced the current convert to markdown API in java

TextLine-driven converter (tables: bordered/borderless, multi-table, uneven
rows, cross-page stitching, wrapped cells; multi-signal heading detection;
image metadata; two-column handling). Wires the orchestrator convert_markdown
path to the deterministic Java endpoint. Synthetic/owned test fixtures only.
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines ignoring generated files. enhancement New feature or request labels Jun 3, 2026
@stirlingbot stirlingbot Bot added Documentation Improvements or additions to documentation Java Pull requests that update Java code API API-related issues or pull requests Test Testing-related issues or pull requests engine and removed enhancement New feature or request labels Jun 3, 2026
Comment thread app/common/src/main/java/stirling/software/common/pdf/PdfMarkdownConverter.java Outdated
@ConnorYoh
Copy link
Copy Markdown
Member

One thing on the table column detection in PdfMarkdownConverter (findColumnRanges). It sizes an array straight from the PDF's word coordinates:

int span = (int) Math.ceil(maxX) - lo + 1;
int[] coverage = new int[span];

There's no upper bound on span, and those coordinates come straight from jpdfium with no clamping (I checked the lib, it passes PDFium's raw values through untouched). A PDF can position text anywhere via a text matrix, so a crafted or just genuinely weird file can report a massive maxX. That gives you either a multi-GB int[] (OutOfMemoryError), or if the numbers overflow, a negative span and a NegativeArraySizeException. Either way the request dies, and because the endpoint runs the converter synchronously on the request thread, one bad upload can take it down.

The fix is small: clamp the span to a sane page width and skip table detection when the geometry is implausible, e.g.

if (!(maxX > minX) || (maxX - minX) > MAX_PAGE_SPAN_PT) return List.of();
int span = Math.min((int) Math.ceil(maxX) - lo + 1, MAX_PAGE_SPAN_PT);

Real pages are under ~2000pt wide, so anything past that is junk. Ideally we clamp coordinates once at extraction so the rest of the pipeline is protected too.

@ConnorYoh
Copy link
Copy Markdown
Member

The converter doesn't escape any of the body text it emits. Only | gets escaped, and only inside table cells. Headings, paragraphs and bold all output the raw text from the PDF.

Why it matters: if a PDF's text happens to contain markdown characters (#, *, backticks, [label](url)) or HTML, it gets reinterpreted as structure instead of staying as literal text. So a line that literally reads # Heading in the PDF becomes a real H1 in the output. It's mostly a fidelity issue. I checked our in-app renderer and it's react-markdown with no raw HTML, so it's not an XSS today, but the .md we return is a download and anything that later renders it as HTML would inherit the risk.

What we can do: escape markdown special characters in the body text before emitting (same idea as escapeCell, just applied more broadly to paragraph/heading/bold text). At a minimum we should treat the output as untrusted content.

@ConnorYoh
Copy link
Copy Markdown
Member

The converter's accuracy test (PdfMarkdownConverterTest) is @Disabled, so it never runs in CI. The header notes it's a work in progress and that some fixtures are expected to exceed the 5% tolerance. The only test that does run mocks the converter out and just checks the controller returns 200/500.

Why it matters: this PR replaces two working implementations (the old Java parser and the Python agent) with a new 940-line converter on a public endpoint, and right now nothing actually verifies its output. So there's no safety net against regressions, or against the crash case in the other comment.

What we can do: get a couple of the golden fixtures passing and enable just those in CI (a small enforced set beats a disabled one), and add a fixture with degenerate/extreme geometry to cover the crash path. It doesn't need all four green to start, just something real gating it.

@EthanHealy01
Copy link
Copy Markdown
Contributor Author

The converter's accuracy test (PdfMarkdownConverterTest) is @Disabled, so it never runs in CI. The header notes it's a work in progress and that some fixtures are expected to exceed the 5% tolerance. The only test that does run mocks the converter out and just checks the controller returns 200/500.

Why it matters: this PR replaces two working implementations (the old Java parser and the Python agent) with a new 940-line converter on a public endpoint, and right now nothing actually verifies its output. So there's no safety net against regressions, or against the crash case in the other comment.

What we can do: get a couple of the golden fixtures passing and enable just those in CI (a small enforced set beats a disabled one), and add a fixture with degenerate/extreme geometry to cover the crash path. It doesn't need all four green to start, just something real gating it.

Never thought about the fact this replaces the older implementation and is a total regression when it comes to testing ahahahaha, fixing now, along with the other 2

@stirlingbot
Copy link
Copy Markdown
Contributor

stirlingbot Bot commented Jun 5, 2026

🚀 V2 Auto-Deployment Complete!

Your V2 PR with embedded architecture has been deployed!

🔗 Direct Test URL (non-SSL) http://54.175.155.236:6525

🔐 Secure HTTPS URL: https://6525.ssl.stirlingpdf.cloud

This deployment will be automatically cleaned up when the PR is closed.

🔄 Auto-deployed for approved V2 contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API API-related issues or pull requests Documentation Improvements or additions to documentation engine Java Pull requests that update Java code size:XXL This PR changes 1000+ lines ignoring generated files. Test Testing-related issues or pull requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.