Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Integrate Docling for Advanced Document Processing in RAG Pipeline #756

Copy link
Copy link
@coleam00

Description

@coleam00
Issue body actions

Overview

Integrate Docling to enhance Archon's document processing capabilities with multi-format support and intelligent chunking for RAG operations.

Why Docling?

  • Multi-Format Support: PDF, DOCX, PPTX, XLSX, HTML, Audio (MP3, WAV), Images
  • Built-in OCR: No custom OCR implementation required (EasyOCR support)
  • Structure Preservation: Maintains tables, sections, hierarchies automatically
  • RAG-Optimized: Hybrid chunking strategy respects semantic boundaries
  • Unified Output: All formats export to clean Markdown

Key Features to Implement

1. Document Conversion

from docling import DocumentConverter

converter = DocumentConverter()
doc = converter.convert("path/to/file.pdf")
markdown = doc.export_to_markdown()

2. Hybrid Chunking for RAG

from docling.chunking import HybridChunker

chunker = HybridChunker()
chunks = chunker.chunk(doc)  # Semantic + token-aware chunking

3. Audio Transcription

from docling.pipeline import AsrPipeline

# Whisper Turbo for local audio transcription
# Supports 90+ languages with timestamps

Implementation Tasks

Core Integration

  • Add Docling dependency to project
  • Create document processing module using DocumentConverter
  • Implement file format detection and routing
  • Add error handling for unsupported formats

RAG Pipeline Enhancement

  • Integrate HybridChunker for intelligent document splitting
  • Configure token limits (typical: 512 tokens for embeddings)
  • Preserve metadata (sections, headings, timestamps)
  • Update vector database insertion to handle Docling chunks

Audio Processing (Optional)

  • Add FFmpeg dependency for audio support
  • Configure Whisper Turbo ASR pipeline
  • Implement timestamp extraction for temporal referencing
  • Support MP3, WAV, M4A, FLAC formats

Advanced Features (Future)

  • Picture classification & description (IBM Granite Vision)
  • Code syntax understanding for technical docs
  • Advanced table structure recognition (TableFormer)
  • Formula and diagram extraction

Benefits

  • No Manual Parsers: Eliminate custom PDF/Word/Excel parsing logic
  • Better RAG Performance: Semantic chunking improves retrieval accuracy
  • Local Processing: Everything runs locally with Hugging Face models
  • Fast: 30-second audio transcribed in ~10 seconds, complex PDFs in <30s

References

Technical Notes

  • Replaces current document processing with unified API
  • Compatible with existing pgvector/Pinecone/Qdrant implementations
  • Token-aware chunking respects paragraphs, sections, tables
  • Markdown output is ideal for LLM consumption

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.