Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[BUG: Output] HTML tags in Surya OCR output?#467

Copy link
Copy link
@ivoras

Description

@ivoras
Issue body actions

馃摑 Describe the Output Issue

I'm running Surya on PDF with basically this code:

    pages = pdf2image.convert_from_path(pdf_path, thread_count=4, dpi=200)

    foundation_predictor = FoundationPredictor()
    recognition_predictor = RecognitionPredictor(foundation_predictor)
    detection_predictor = DetectionPredictor()

    predictions = recognition_predictor(pages, det_predictor=detection_predictor, math_mode=False)

And I'm getting HTML and tags in the output. Is there a way to get Surya to not generate HTML tags?

鈿欙笍 Environment

  • Surya version: 0.17.0
  • Python version: 3.12.3
  • PyTorch version: 2.8.0
  • Transformers version: 4.57.0
  • Operating System: Ubuntu 24.04.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug: outputPoor markdown/HTML output qualityPoor markdown/HTML output quality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.