[BUG: Output] HTML tags in Surya OCR output?

📝 Describe the Output Issue

I'm running Surya on PDF with basically this code:

    pages = pdf2image.convert_from_path(pdf_path, thread_count=4, dpi=200)

    foundation_predictor = FoundationPredictor()
    recognition_predictor = RecognitionPredictor(foundation_predictor)
    detection_predictor = DetectionPredictor()

    predictions = recognition_predictor(pages, det_predictor=detection_predictor, math_mode=False)

And I'm getting HTML and tags in the output. Is there a way to get Surya to not generate HTML tags?

⚙️ Environment

Surya version: 0.17.0

Python version: 3.12.3

PyTorch version: 2.8.0

Transformers version: 4.57.0

Operating System: Ubuntu 24.04.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG: Output] HTML tags in Surya OCR output? #467

📝 Describe the Output Issue

⚙️ Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

[BUG: Output] HTML tags in Surya OCR output? #467

Description

📝 Describe the Output Issue

⚙️ Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions