A layered subtitle retrieval system in Go that handles noisy user input—typos, phonetic errors, and paraphrasing—by combining SimHash, phonetic hashing, and semantic embeddings.
- SimHash for quick fuzzy matching on subtitle chunks.
- Phonetic hashing (Double Metaphone + Levenshtein) for sound-alike queries.
- Semantic search via OpenAI's Ada embeddings stored in PostgreSQL with pgvector.
- Sliding window chunking to preserve full context across multi-line quotes.
- Built with Go, Redis, and PostgreSQL.
- Tuned for performance today; scalable with LSH, BK-trees, or FAISS for future growth.
Read the full technical write-up and see visuals, trade-offs, and real examples: [https://medium.com/@sarvesh20123/building-a-robust-subtitle-search-system-with-simhash-phonetic-hashing-and-embeddings-67437e5864b1]
- Language: Go
- In-memory index: Redis
- Semantic storage: PostgreSQL + pgvector
- Embeddings: OpenAI text-embedding-ada-002