Feature/pdf ingestion jpdfium#6525
Feature/pdf ingestion jpdfium#6525EthanHealy01 wants to merge 16 commits intomainStirling-Tools/Stirling-PDF:mainfrom feature/pdf-ingestion-jpdfiumStirling-Tools/Stirling-PDF:feature/pdf-ingestion-jpdfiumCopy head branch name to clipboard
Conversation
TextLine-driven converter (tables: bordered/borderless, multi-table, uneven rows, cross-page stitching, wrapped cells; multi-signal heading detection; image metadata; two-column handling). Wires the orchestrator convert_markdown path to the deterministic Java endpoint. Synthetic/owned test fixtures only.
… inital converter, and future improvements
|
One thing on the table column detection in int span = (int) Math.ceil(maxX) - lo + 1;
int[] coverage = new int[span];There's no upper bound on The fix is small: clamp the span to a sane page width and skip table detection when the geometry is implausible, e.g. if (!(maxX > minX) || (maxX - minX) > MAX_PAGE_SPAN_PT) return List.of();
int span = Math.min((int) Math.ceil(maxX) - lo + 1, MAX_PAGE_SPAN_PT);Real pages are under ~2000pt wide, so anything past that is junk. Ideally we clamp coordinates once at extraction so the rest of the pipeline is protected too. |
|
The converter doesn't escape any of the body text it emits. Only Why it matters: if a PDF's text happens to contain markdown characters ( What we can do: escape markdown special characters in the body text before emitting (same idea as |
|
The converter's accuracy test ( Why it matters: this PR replaces two working implementations (the old Java parser and the Python agent) with a new 940-line converter on a public endpoint, and right now nothing actually verifies its output. So there's no safety net against regressions, or against the crash case in the other comment. What we can do: get a couple of the golden fixtures passing and enable just those in CI (a small enforced set beats a disabled one), and add a fixture with degenerate/extreme geometry to cover the crash path. It doesn't need all four green to start, just something real gating it. |
Never thought about the fact this replaces the older implementation and is a total regression when it comes to testing ahahahaha, fixing now, along with the other 2 |
…into feature/pdf-ingestion-jpdfium
🚀 V2 Auto-Deployment Complete!Your V2 PR with embedded architecture has been deployed! 🔗 Direct Test URL (non-SSL) http://54.175.155.236:6525 🔐 Secure HTTPS URL: https://6525.ssl.stirlingpdf.cloud This deployment will be automatically cleaned up when the PR is closed. 🔄 Auto-deployed for approved V2 contributors. |
PDF Ingestion / Convert to markdown agent, also replaced the current convert to markdown API in java