Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

nickscamara
Copy link
Member

@nickscamara nickscamara commented Sep 17, 2025

Summary by cubic

Adds Reducto as a PDF parsing fallback and introduces a delayed racing strategy with RunPod MU to improve reliability and speed for PDFs under 19MB. If RunPod is slow (>120s) or fails, Reducto runs; pdf-parse remains the final fallback.

  • New Features

    • Reducto integration with async job polling and result fetching.
    • Racing mode: start Reducto after 120s if RunPod is still running; use the first to finish.
    • Fallback order: race (if both), RunPod solo, Reducto solo, then pdf-parse.
    • Applies to PDFs <19MB; adds clearer logging and error reporting.
  • Migration

    • Set REDUCTO_API_KEY to enable Reducto racing/fallback.

@nickscamara nickscamara requested a review from mogery September 17, 2025 21:52
@nickscamara
Copy link
Member Author

@cubic-dev-ai review

Copy link
Contributor

cubic-dev-ai bot commented Sep 17, 2025

@cubic-dev-ai review

@nickscamara I've started the AI code review. It'll take a few minutes to complete.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 1 file

Prompt for AI agents (all 3 issues)

Understand the root cause of the following 3 issues and fix them.


<file name="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts">

<violation number="1" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:303">
XSS risk: untrusted Markdown rendered to HTML with marked.parse without sanitization.</violation>

<violation number="2" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:343">
Timer not cleared; Reducto may start after completion, causing background work and potential unhandled rejection.</violation>

<violation number="3" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:376">
Racing logic rejects on first failure; should use first successful result (e.g., Promise.any).</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

apps/api/src/scraper/scrapeURL/engines/pdf/index.ts Outdated Show resolved Hide resolved
apps/api/src/scraper/scrapeURL/engines/pdf/index.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 1 file

Prompt for AI agents (all 3 issues)

Understand the root cause of the following 3 issues and fix them.


<file name="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts">

<violation number="1" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:301">
XSS risk: Rendering untrusted Markdown to HTML with marked without sanitization</violation>

<violation number="2" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:374">
Promise.race will reject on the first failure, breaking the intended &quot;first successful result&quot; behavior; use Promise.any instead.</violation>

<violation number="3" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:519">
Use decoded byte length instead of base64 string length when enforcing the 19MB limit.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

apps/api/src/scraper/scrapeURL/engines/pdf/index.ts Outdated Show resolved Hide resolved
@nickscamara
Copy link
Member Author

Don't merge, not working as expected

@nickscamara
Copy link
Member Author

@cubic-dev-ai re-run

@nickscamara
Copy link
Member Author

Fixed!

Copy link
Contributor

cubic-dev-ai bot commented Sep 17, 2025

@cubic-dev-ai re-run

@nickscamara I've started the AI code review. It'll take a few minutes to complete.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file

Prompt for AI agents (all 2 issues)

Understand the root cause of the following 2 issues and fix them.


<file name="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts">

<violation number="1" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:394">
Reducto race task isn’t cancelled if RunPod later succeeds, causing unnecessary background processing and cost; consider aborting Reducto when RunPod resolves.</violation>

<violation number="2" location="apps/api/src/scraper/scrapeURL/engines/pdf/index.ts:584">
Size check uses Base64 string length instead of decoded byte length, misclassifying PDFs near the 19MB limit.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

@gustavovalverde
Copy link

@nickscamara I'd highly recommend giving a try to pdf.jshttps://github.com/mozilla/pdf.js I've previously used pdf-parse and the performance of pdf.js is way better, and the APIs too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.