Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

QuartzUnit/embgrep

Open more actions menu
 
 

Repository files navigation

embgrep

한국어 문서 · llms.txt

Local semantic search — embedding-powered grep for files, zero external services.

PyPI Python License: MIT

Search your codebase and documentation by meaning, not just keywords. embgrep indexes files into local embeddings and lets you run semantic queries — no API keys, no cloud services, no vector database servers.

Features

  • Local embeddings — Uses fastembed (ONNX Runtime), no API keys needed
  • SQLite storage — Single-file index, no external vector DB
  • Incremental indexing — Only re-indexes changed files (SHA-256 hash comparison)
  • Smart chunking — Function-level splitting for code, heading-level for docs
  • MCP native — 4-tool FastMCP server for LLM agent integration
  • 15+ file types.py, .js, .ts, .java, .go, .rs, .md, .txt, .yaml, .json, .toml, and more

Install

pip install embgrep              # core (fastembed + numpy)
pip install embgrep[cli]         # + click/rich CLI
pip install embgrep[mcp]         # + FastMCP server
pip install embgrep[all]         # everything

Quick Start

Python API

from embgrep import EmbGrep

eg = EmbGrep()

# Index a directory
eg.index("./my-project", patterns=["*.py", "*.md"])

# Semantic search
results = eg.search("database connection pooling", top_k=5)
for r in results:
    print(f"{r.file_path}:{r.line_start}-{r.line_end} (score: {r.score:.4f})")
    print(f"  {r.chunk_text[:80]}...")

# Incremental update (only changed files)
eg.update()

# Index statistics
status = eg.status()
print(f"{status.total_files} files, {status.total_chunks} chunks, {status.index_size_mb} MB")

eg.close()

CLI

# Index a project
embgrep index ./my-project --patterns "*.py,*.md"

# Search
embgrep search "error handling patterns"

# Filter by file type
embgrep search "async database query" --path-filter "%.py"

# Check status
embgrep status

# Update changed files
embgrep update

Convenience functions

import embgrep

embgrep.index("./src")
results = embgrep.search("authentication middleware")
status = embgrep.status()
embgrep.update()

MCP Server

Add to your Claude Desktop / MCP client configuration:

{
  "mcpServers": {
    "embgrep": {
      "command": "embgrep-mcp"
    }
  }
}

Or with uvx:

{
  "mcpServers": {
    "embgrep": {
      "command": "uvx",
      "args": ["--from", "embgrep[mcp]", "embgrep-mcp"]
    }
  }
}

MCP Tools

Tool Description
index_directory Index files in a directory for semantic search
semantic_search Search indexed files using natural language
index_status Get current index statistics
update_index Incremental update — re-index changed files only

How It Works

flowchart TD
    A["📁 Files"] --> B["Smart Chunking\ncode: function-level\ndocs: heading-level"]
    B --> C["fastembed\nlocal embeddings"]
    C --> D["SQLite\nvector index"]
    D --> E["🔍 Query"]
    E --> F["Cosine Similarity\nranked results"]
    F --> G["✅ Matches\nwith context"]
Loading
  1. Chunking — Files are split into semantically meaningful chunks:

    • Code files (.py, .js, .ts, etc.): split by function/class boundaries
    • Documents (.md, .txt): split by headings or paragraph breaks
    • Config files: fixed-size chunking
  2. Embedding — Each chunk is converted to a 384-dimensional vector using BGE-small-en-v1.5 via ONNX Runtime (no PyTorch needed)

  3. Storage — Embeddings are stored as BLOBs in a local SQLite database

  4. Search — Query text is embedded and compared against all chunks using cosine similarity

Configuration

Parameter Default Description
db_path ~/.local/share/embgrep/embgrep.db SQLite database location
model BAAI/bge-small-en-v1.5 fastembed model name
max_chunk_size 1000 chars Maximum chunk size for fixed-size splitting
top_k 5 Number of search results

QuartzUnit Ecosystem

Package Description
markgrab HTML/YouTube/PDF/DOCX to LLM-ready markdown
snapgrab URL to screenshot + metadata
docpick OCR + LLM document structure extraction
browsegrab Local LLM browser agent
feedkit RSS feed collection + MCP
embgrep Local semantic search for files

Used in

  • newswatch — RSS news monitoring pipeline (feedkit → markgrab → embgrep → diffgrab)

License

MIT


Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.

About

Local semantic search — embedding-powered grep for files, zero external services

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 100.0%
Morty Proxy This is a proxified and sanitized view of the page, visit original site.