Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

commoncrawl/cc-vec

Open more actions menu

Repository files navigation

CCVec - Common Crawl to Vector Stores

Search, analyze, and index Common Crawl data into vector stores for RAG applications. Three surfaces available:

  • CLI
  • Python library
  • MCP server

Quick Start

Environment variables:

  • ATHENA_OUTPUT_BUCKET - Required S3 bucket for Athena query results (needed for reliable queries to Common Crawl metadata)
  • AWS_ACCESS_KEY_ID - Required for Athena/S3 access (needed to run Athena queries)
  • AWS_SECRET_ACCESS_KEY - Required for Athena/S3 access (needed to run Athena queries)
  • AWS_SESSION_TOKEN - Optional for Athena/S3 access (needed to run Athena queries). This is required for temporary credentials
  • OPENAI_API_KEY - Required for vector operations (index, query, list)
  • OPENAI_BASE_URL - Optional custom OpenAI endpoint (e.g., http://localhost:8321/v1 for Llama Stack)
  • OPENAI_EMBEDDING_MODEL - Embedding model to use (e.g., text-embedding-3-small, ollama/nomic-embed-text:latest)
  • OPENAI_EMBEDDING_DIMENSIONS - Embedding dimensions (optional, model-specific)
  • AWS_DEFAULT_REGION - AWS region (defaults to us-west-2)
  • LOG_LEVEL - Logging level (defaults to INFO)

Note: Uses SQL wildcards (%) not glob patterns (*) for URL matching.

1. ⌨️ Command Line

# Search Common Crawl index
uv run cc-vec search --url-patterns "%.github.io" --limit 10

# Get statistics
uv run cc-vec stats --url-patterns "%.edu"

# Fetch and process content (returns clean text)
uv run cc-vec fetch --url-patterns "%.example.com" --limit 5

# Advanced filtering - multiple filters can be combined
uv run cc-vec fetch --url-patterns "%.github.io" --status-codes "200,201" --mime-types "text/html" --limit 10

# Filter by hostname instead of pattern
uv run cc-vec search --url-host-names "github.io,github.com" --limit 10

# Filter by TLD for better performance (uses indexed column)
uv run cc-vec search --url-host-tlds "edu,gov" --limit 20

# Filter by registered domain (uses indexed column)
uv run cc-vec search --url-host-registered-domains "github.com,example.com" --limit 15

# Filter by URL path (for specific site sections)
uv run cc-vec search --url-host-names "github.io" --url-paths "/blog/%,/docs/%" --limit 10

# Query across multiple Common Crawl datasets
uv run cc-vec search --url-patterns "%.edu" --crawl-ids "CC-MAIN-2024-33,CC-MAIN-2024-30" --limit 20

# List available Common Crawl datasets
uv run cc-vec list-crawls

# List all available filter columns (no API keys needed)
uv run cc-vec list-filter-columns
uv run cc-vec list-filter-columns --output json

# Vector operations (require OPENAI_API_KEY)
# Create vector store with processed content (OpenAI handles chunking with token limits)
uv run cc-vec index --url-patterns "%.github.io" --vector-store-name "ml-research" --limit 50 --chunk-size 800 --overlap 400

# Vector store name is optional - will auto-generate if not provided
uv run cc-vec index --url-patterns "%.github.io" --limit 50

# List cc-vec vector stores (default - only shows stores created by cc-vec)
uv run cc-vec list --output json

# List ALL vector stores (including non-cc-vec stores)
uv run cc-vec list --all

# Query vector store by ID for RAG
uv run cc-vec query "What is machine learning?" --vector-store-id "vs-123abc" --limit 5

# Query vector store by name
uv run cc-vec query "Explain deep learning" --vector-store-name "ml-research" --limit 3

# Fetch → Index Pipeline (for large datasets or multi-step workflows)
# Step 1: Fetch and save content to files
uv run cc-vec fetch --url-patterns "%.github.io" --output-dir ./fetched_data/ --limit 100

# Step 2: Index from saved files (can be run later or on different machine)
uv run cc-vec index --input-dir ./fetched_data/ --vector-store-name "my-research"

# Use --batch-size to reduce load on embedding service (helps with local models)
uv run cc-vec index --input-dir ./fetched_data/ --batch-size 5 --limit 50

# Use --provider-id to specify vector store backend (Llama Stack only)
uv run cc-vec index --input-dir ./fetched_data/ --provider-id chromadb --vector-store-name "persistent-store"

Pipeline Benefits

The fetch → index pipeline allows you to:

  • Fetch data once and index multiple times with different configurations
  • Process data on a different machine than where it was fetched
  • Resume indexing if the process was interrupted
  • Share fetched data with others

Advanced Options

  • --batch-size N: Upload files in batches of N to reduce load on embedding services (recommended for local models like Ollama)
  • --provider-id ID: Specify vector store backend for Llama Stack (chromadb for persistent storage, faiss for in-memory)

1.5. Local Llama Stack Setup (Optional)

Run cc-vec with local models using Ollama + Llama Stack. This provides a fully local version.

Step 1: Install and Start Ollama

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server
ollama serve &

# Pull required models
ollama pull llama3.2:3b                        # Inference model
ollama pull nomic-embed-text                    # Embedding model (768 dimensions)

Step 2: Start ChromaDB (Optional - for persistent vector storage)

The starter distribution uses in-memory FAISS by default. For persistent storage, run ChromaDB:

# Install and run ChromaDB
uv run --with chromadb chroma run --host localhost --port 8000 --path ./chroma_data

Step 3: Start Llama Stack

# With explicit Ollama URL (and faiss in-memory vectory store)
uv run --with 'llama-stack>=0.4.0' llama stack run starter --port 8321 \
  --env OLLAMA_URL=http://localhost:11434/v1

# With ChromaDB for persistent vector storage (if running ChromaDB from Step 2)
uv run --with 'llama-stack>=0.4.0' llama stack run starter --port 8321 \
  --env OLLAMA_URL=http://localhost:11434/v1 \
  --env CHROMADB_URL=http://localhost:8000

This starts the Llama Stack server at http://localhost:8321.

Step 4: Use with cc-vec

# Set environment variables
export OPENAI_BASE_URL=http://localhost:8321/v1
export OPENAI_API_KEY=none # Llama Stack doesn't require a real key
export OPENAI_EMBEDDING_MODEL=ollama/nomic-embed-text:latest
export OPENAI_EMBEDDING_DIMENSIONS=768

# Set your Athena credentials
export ATHENA_OUTPUT_BUCKET=s3://your-bucket/
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret

# Use cc-vec with local models
uv run cc-vec index --url-patterns "%.edu" --limit 10

# Use ChromaDB for persistent storage (requires ChromaDB running)
uv run cc-vec index --url-patterns "%.edu" --limit 10 --provider-id chromadb

# Use --batch-size to prevent overwhelming Ollama with concurrent requests
uv run cc-vec index --input-dir ./data/ --batch-size 5 --provider-id chromadb

Documentation:

2. 📦 Python Library

import os
from cc_vec import (
    search,
    stats,
    fetch,
    index,
    index_from_files,
    list_vector_stores,
    query_vector_store,
    list_crawls,
    FilterConfig,
    VectorStoreConfig,
)

# For alternative endpoints, set environment variables before importing
# Example: Using Ollama
# os.environ["OPENAI_BASE_URL"] = "http://localhost:11434/v1"
# os.environ["OPENAI_API_KEY"] = "ollama"
# os.environ["OPENAI_EMBEDDING_MODEL"] = "ollama/nomic-embed-text:latest"
# os.environ["OPENAI_EMBEDDING_DIMENSIONS"] = "768"

# Example: Using Llama Stack
# os.environ["OPENAI_BASE_URL"] = "http://localhost:8321/v1"
# os.environ["OPENAI_API_KEY"] = "your-llama-stack-key"

# Basic search and stats (no OpenAI key needed)
filter_config = FilterConfig(url_patterns=["%.github.io"])

stats_response = stats(filter_config)
print(f"Estimated records: {stats_response.estimated_records:,}")
print(f"Estimated size: {stats_response.estimated_size_mb:.2f} MB")
print(f"Athena cost: ${stats_response.estimated_cost_usd:.4f}")

results = search(filter_config, limit=10)
print(f"Found {len(results)} URLs")
for result in results[:3]:
    print(f"  {result.url} (Status: {result.status})")

# Advanced filtering - multiple criteria
filter_config = FilterConfig(
    url_patterns=["%.github.io", "%.github.com"],
    url_host_names=["github.io"],
    url_host_tlds=["io", "com"],  # Filter by TLD (uses indexed column)
    url_host_registered_domains=["github.com"],  # Filter by domain (uses indexed column)
    url_paths=["/blog/%", "/docs/%"],  # Filter by URL path
    crawl_ids=["CC-MAIN-2024-33", "CC-MAIN-2024-30"],  # Query multiple crawls
    status_codes=[200, 201],
    mime_types=["text/html"],
    charsets=["utf-8"],
    languages=["en"],
)

results = search(filter_config, limit=20)
print(f"Found {len(results)} URLs matching filters")

# Using indexed columns for better performance
filter_config = FilterConfig(
    url_host_tlds=["edu", "gov"],  # Much faster than url_patterns=["%.edu", "%.gov"]
    status_codes=[200],
)
results = search(filter_config, limit=50)
print(f"Found {len(results)} .edu and .gov sites")

# Fetch and process content (returns clean text)
filter_config = FilterConfig(url_patterns=["%.example.com"])
content_results = fetch(filter_config, limit=2)
print(f"Processed {len(content_results)} content records")
for record, processed in content_results:
    if processed:
        print(f"  {record.url}: {processed['word_count']} words")
        print(f"    Title: {processed.get('title', 'N/A')}")

# List available Common Crawl datasets
crawls = list_crawls()
print(f"Available crawls: {len(crawls)}")
print(f"Latest: {crawls[0]}")

# Index data in a vector store
filter_config = FilterConfig(url_patterns=["%.github.io"])
vector_config = VectorStoreConfig(
    name="ml-research",
    chunk_size=800,
    overlap=400,
    embedding_model="text-embedding-3-small",
    embedding_dimensions=1536,
)

result = index(filter_config, vector_config, limit=50)
print(f"Created vector store: {result['vector_store_name']}")
print(f"Vector Store ID: {result['vector_store_id']}")
print(f"Processed records: {result['total_fetched']}")

# List cc-vec vector stores (default - only shows stores created by cc-vec)
stores = list_vector_stores()
print(f"Available stores: {len(stores)}")
for store in stores[:3]:
    print(f"  {store['name']} (ID: {store['id']}, Status: {store['status']})")

# List ALL vector stores (including non-cc-vec stores)
all_stores = list_vector_stores(cc_vec_only=False)
print(f"All stores: {len(all_stores)}")

# Query vector store for RAG
query_results = query_vector_store("vs-123abc", "What is machine learning?", limit=5)
print(f"Query found {len(query_results.get('results', []))} relevant results")
for i, result in enumerate(query_results.get("results", []), 1):
    print(f"  {i}. Score: {result.get('score', 0):.3f}")
    print(f"     Content: {result.get('content', '')[:100]}...")
    print(f"     File: {result.get('file_id', 'N/A')}")

# Index from pre-fetched files (two-step workflow)
# Step 1: Save fetch results to disk (use fetch --output-dir in CLI or save manually)
# Step 2: Index from the saved files
vector_config = VectorStoreConfig(
    name="from-saved-files",
    chunk_size=800,
    overlap=400,
)
result = index_from_files("./fetched_data/", vector_config, limit=100)
print(f"Indexed {result['total_fetched']} files into {result['vector_store_name']}")

3. 🔌 MCP Server (Claude Desktop)

Setup:

  1. Copy and edit the config: cp claude_desktop_config.json ~/Library/Application\ Support/Claude/claude_desktop_config.json
  2. Update the directory path and API key in the config file
  3. Restart Claude Desktop

The config uses stdio mode (required by Claude Desktop):

{
  "mcpServers": {
    "cc-vec": {
      "command": "uv",
      "args": ["run", "--directory", "your-path-to-the-repo", "cc-vec", "mcp-serve", "--mode", "stdio"],
      "env": {
        "ATHENA_OUTPUT_BUCKET": "your-athena-output-bucket",
        "OPENAI_API_KEY": "your-openai-api-key-here"
        // "OPENAI_BASE_URL": "http://localhost:11434/v1"   // Optional: Use for Ollama, Llama Stack, or other endpoints
        // "OPENAI_EMBEDDING_MODEL": "ollama/nomic-embed-text:latest"     // Optional: Specify custom embedding model
        // "OPENAI_EMBEDDING_DIMENSIONS": "768"              // Optional: Specify embedding dimensions
      }
    }
  }
}

Available MCP tools:

# Search and analysis (no OpenAI key needed)
cc_search - Search Common Crawl for URLs matching patterns with advanced filtering
cc_stats - Get statistics and cost estimates for patterns with advanced filtering
cc_fetch - Download actual content from matched URLs with advanced filtering
cc_list_crawls - List available Common Crawl dataset IDs

# Vector operations (require OPENAI_API_KEY)
cc_index - Create and populate vector stores from Common Crawl content with chunking config
cc_list_vector_stores - List OpenAI vector stores (defaults to cc-vec created only)
cc_query - Query vector stores for relevant content

Example usage in Claude Desktop:

  • "Use cc_search to find GitHub Pages sites: url_pattern=%.github.io, limit=10"
  • "Use cc_stats to analyze education sites: url_pattern=%.edu"
  • "Use cc_search with indexed columns for better performance: url_host_tlds=['edu', 'gov'], limit=20"
  • "Use cc_search with registered domains: url_host_registered_domains=['github.com'], limit=15"
  • "Use cc_search for specific paths: url_host_names=['github.io'], url_paths=['/blog/%'], limit=10"
  • "Use cc_search across multiple crawls: url_pattern=%.edu, crawl_ids=['CC-MAIN-2024-33', 'CC-MAIN-2024-30']"
  • "Use cc_fetch to get content: url_host_names=['github.io'], limit=5"
  • "Use cc_list_crawls to show available Common Crawl datasets"
  • "Use cc_index to create vector store: vector_store_name=research, url_pattern=%.arxiv.org, limit=100, chunk_size=800"
  • "Use cc_list_vector_stores to show cc-vec stores (default)"
  • "Use cc_list_vector_stores with cc_vec_only=false to show all vector stores"
  • "Use cc_query to search: vector_store_id=vs-123, query=machine learning"

Note: All filter options available in CLI (shown via cc-vec list-filter-columns) are also available in MCP tools.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.