-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Overview
Integrate Docling to enhance Archon's document processing capabilities with multi-format support and intelligent chunking for RAG operations.
Why Docling?
- Multi-Format Support: PDF, DOCX, PPTX, XLSX, HTML, Audio (MP3, WAV), Images
- Built-in OCR: No custom OCR implementation required (EasyOCR support)
- Structure Preservation: Maintains tables, sections, hierarchies automatically
- RAG-Optimized: Hybrid chunking strategy respects semantic boundaries
- Unified Output: All formats export to clean Markdown
Key Features to Implement
1. Document Conversion
from docling import DocumentConverter
converter = DocumentConverter()
doc = converter.convert("path/to/file.pdf")
markdown = doc.export_to_markdown()
2. Hybrid Chunking for RAG
from docling.chunking import HybridChunker
chunker = HybridChunker()
chunks = chunker.chunk(doc) # Semantic + token-aware chunking
3. Audio Transcription
from docling.pipeline import AsrPipeline
# Whisper Turbo for local audio transcription
# Supports 90+ languages with timestamps
Implementation Tasks
Core Integration
- Add Docling dependency to project
- Create document processing module using
DocumentConverter
- Implement file format detection and routing
- Add error handling for unsupported formats
RAG Pipeline Enhancement
- Integrate
HybridChunker
for intelligent document splitting - Configure token limits (typical: 512 tokens for embeddings)
- Preserve metadata (sections, headings, timestamps)
- Update vector database insertion to handle Docling chunks
Audio Processing (Optional)
- Add FFmpeg dependency for audio support
- Configure Whisper Turbo ASR pipeline
- Implement timestamp extraction for temporal referencing
- Support MP3, WAV, M4A, FLAC formats
Advanced Features (Future)
- Picture classification & description (IBM Granite Vision)
- Code syntax understanding for technical docs
- Advanced table structure recognition (TableFormer)
- Formula and diagram extraction
Benefits
- No Manual Parsers: Eliminate custom PDF/Word/Excel parsing logic
- Better RAG Performance: Semantic chunking improves retrieval accuracy
- Local Processing: Everything runs locally with Hugging Face models
- Fast: 30-second audio transcribed in ~10 seconds, complex PDFs in <30s
References
- Docling Documentation
- Docling GitHub
- Tutorial: Turn ANY File into LLM Knowledge
- Research notes in Slack #research channel
Technical Notes
- Replaces current document processing with unified API
- Compatible with existing pgvector/Pinecone/Qdrant implementations
- Token-aware chunking respects paragraphs, sections, tables
- Markdown output is ideal for LLM consumption
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request