CCVec - Common Crawl to Vector Stores

Search, analyze, and index Common Crawl data into vector stores for RAG applications. Three surfaces available:

CLI
Python library
MCP server

Quick Start

Environment variables:

ATHENA_OUTPUT_BUCKET - Required S3 bucket for Athena query results (needed for reliable queries to Common Crawl metadata)
AWS_ACCESS_KEY_ID - Required for Athena/S3 access (needed to run Athena queries)
AWS_SECRET_ACCESS_KEY - Required for Athena/S3 access (needed to run Athena queries)
AWS_SESSION_TOKEN - Optional for Athena/S3 access (needed to run Athena queries). This is required for temporary credentials
OPENAI_API_KEY - Required for vector operations (index, query, list)
OPENAI_BASE_URL - Optional custom OpenAI endpoint (e.g., http://localhost:8321/v1 for Llama Stack)
OPENAI_EMBEDDING_MODEL - Embedding model to use (e.g., text-embedding-3-small, ollama/nomic-embed-text:latest)
OPENAI_EMBEDDING_DIMENSIONS - Embedding dimensions (optional, model-specific)
AWS_DEFAULT_REGION - AWS region (defaults to us-west-2)
LOG_LEVEL - Logging level (defaults to INFO)

Note: Uses SQL wildcards (%) not glob patterns (*) for URL matching.

1. ⌨️ Command Line

# Search Common Crawl index
uv run cc-vec search --url-patterns "%.github.io" --limit 10

# Get statistics
uv run cc-vec stats --url-patterns "%.edu"

# Fetch and process content (returns clean text)
uv run cc-vec fetch --url-patterns "%.example.com" --limit 5

# Advanced filtering - multiple filters can be combined
uv run cc-vec fetch --url-patterns "%.github.io" --status-codes "200,201" --mime-types "text/html" --limit 10

# Filter by hostname instead of pattern
uv run cc-vec search --url-host-names "github.io,github.com" --limit 10

# Filter by TLD for better performance (uses indexed column)
uv run cc-vec search --url-host-tlds "edu,gov" --limit 20

# Filter by registered domain (uses indexed column)
uv run cc-vec search --url-host-registered-domains "github.com,example.com" --limit 15

# Filter by URL path (for specific site sections)
uv run cc-vec search --url-host-names "github.io" --url-paths "/blog/%,/docs/%" --limit 10

# Query across multiple Common Crawl datasets
uv run cc-vec search --url-patterns "%.edu" --crawl-ids "CC-MAIN-2024-33,CC-MAIN-2024-30" --limit 20

# List available Common Crawl datasets
uv run cc-vec list-crawls

# List all available filter columns (no API keys needed)
uv run cc-vec list-filter-columns
uv run cc-vec list-filter-columns --output json

# Vector operations (require OPENAI_API_KEY)
# Create vector store with processed content (OpenAI handles chunking with token limits)
uv run cc-vec index --url-patterns "%.github.io" --vector-store-name "ml-research" --limit 50 --chunk-size 800 --overlap 400

# Vector store name is optional - will auto-generate if not provided
uv run cc-vec index --url-patterns "%.github.io" --limit 50

# List cc-vec vector stores (default - only shows stores created by cc-vec)
uv run cc-vec list --output json

# List ALL vector stores (including non-cc-vec stores)
uv run cc-vec list --all

# Query vector store by ID for RAG
uv run cc-vec query "What is machine learning?" --vector-store-id "vs-123abc" --limit 5

# Query vector store by name
uv run cc-vec query "Explain deep learning" --vector-store-name "ml-research" --limit 3

# Fetch → Index Pipeline (for large datasets or multi-step workflows)
# Step 1: Fetch and save content to files
uv run cc-vec fetch --url-patterns "%.github.io" --output-dir ./fetched_data/ --limit 100

# Step 2: Index from saved files (can be run later or on different machine)
uv run cc-vec index --input-dir ./fetched_data/ --vector-store-name "my-research"

# Use --batch-size to reduce load on embedding service (helps with local models)
uv run cc-vec index --input-dir ./fetched_data/ --batch-size 5 --limit 50

# Use --provider-id to specify vector store backend (Llama Stack only)
uv run cc-vec index --input-dir ./fetched_data/ --provider-id chromadb --vector-store-name "persistent-store"

Pipeline Benefits

The fetch → index pipeline allows you to:

Fetch data once and index multiple times with different configurations
Process data on a different machine than where it was fetched
Resume indexing if the process was interrupted
Share fetched data with others

Advanced Options

--batch-size N: Upload files in batches of N to reduce load on embedding services (recommended for local models like Ollama)
--provider-id ID: Specify vector store backend for Llama Stack (chromadb for persistent storage, faiss for in-memory)

1.5. Local Llama Stack Setup (Optional)

Run cc-vec with local models using Ollama + Llama Stack. This provides a fully local version.

Step 1: Install and Start Ollama

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama server
ollama serve &

# Pull required models
ollama pull llama3.2:3b                        # Inference model
ollama pull nomic-embed-text                    # Embedding model (768 dimensions)

Step 2: Start ChromaDB (Optional - for persistent vector storage)

The starter distribution uses in-memory FAISS by default. For persistent storage, run ChromaDB:

# Install and run ChromaDB
uv run --with chromadb chroma run --host localhost --port 8000 --path ./chroma_data

Step 3: Start Llama Stack

# With explicit Ollama URL (and faiss in-memory vectory store)
uv run --with 'llama-stack>=0.4.0' llama stack run starter --port 8321 \
  --env OLLAMA_URL=http://localhost:11434/v1

# With ChromaDB for persistent vector storage (if running ChromaDB from Step 2)
uv run --with 'llama-stack>=0.4.0' llama stack run starter --port 8321 \
  --env OLLAMA_URL=http://localhost:11434/v1 \
  --env CHROMADB_URL=http://localhost:8000

This starts the Llama Stack server at http://localhost:8321.

Step 4: Use with cc-vec

# Set environment variables
export OPENAI_BASE_URL=http://localhost:8321/v1
export OPENAI_API_KEY=none # Llama Stack doesn't require a real key
export OPENAI_EMBEDDING_MODEL=ollama/nomic-embed-text:latest
export OPENAI_EMBEDDING_DIMENSIONS=768

# Set your Athena credentials
export ATHENA_OUTPUT_BUCKET=s3://your-bucket/
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret

# Use cc-vec with local models
uv run cc-vec index --url-patterns "%.edu" --limit 10

# Use ChromaDB for persistent storage (requires ChromaDB running)
uv run cc-vec index --url-patterns "%.edu" --limit 10 --provider-id chromadb

# Use --batch-size to prevent overwhelming Ollama with concurrent requests
uv run cc-vec index --input-dir ./data/ --batch-size 5 --provider-id chromadb

Documentation:

2. 📦 Python Library

import os
from cc_vec import (
    search,
    stats,
    fetch,
    index,
    index_from_files,
    list_vector_stores,
    query_vector_store,
    list_crawls,
    FilterConfig,
    VectorStoreConfig,
)

# For alternative endpoints, set environment variables before importing
# Example: Using Ollama
# os.environ["OPENAI_BASE_URL"] = "http://localhost:11434/v1"
# os.environ["OPENAI_API_KEY"] = "ollama"
# os.environ["OPENAI_EMBEDDING_MODEL"] = "ollama/nomic-embed-text:latest"
# os.environ["OPENAI_EMBEDDING_DIMENSIONS"] = "768"

# Example: Using Llama Stack
# os.environ["OPENAI_BASE_URL"] = "http://localhost:8321/v1"
# os.environ["OPENAI_API_KEY"] = "your-llama-stack-key"

# Basic search and stats (no OpenAI key needed)
filter_config = FilterConfig(url_patterns=["%.github.io"])

stats_response = stats(filter_config)
print(f"Estimated records: {stats_response.estimated_records:,}")
print(f"Estimated size: {stats_response.estimated_size_mb:.2f} MB")
print(f"Athena cost: ${stats_response.estimated_cost_usd:.4f}")

results = search(filter_config, limit=10)
print(f"Found {len(results)} URLs")
for result in results[:3]:
    print(f"  {result.url} (Status: {result.status})")

# Advanced filtering - multiple criteria
filter_config = FilterConfig(
    url_patterns=["%.github.io", "%.github.com"],
    url_host_names=["github.io"],
    url_host_tlds=["io", "com"],  # Filter by TLD (uses indexed column)
    url_host_registered_domains=["github.com"],  # Filter by domain (uses indexed column)
    url_paths=["/blog/%", "/docs/%"],  # Filter by URL path
    crawl_ids=["CC-MAIN-2024-33", "CC-MAIN-2024-30"],  # Query multiple crawls
    status_codes=[200, 201],
    mime_types=["text/html"],
    charsets=["utf-8"],
    languages=["en"],
)

results = search(filter_config, limit=20)
print(f"Found {len(results)} URLs matching filters")

# Using indexed columns for better performance
filter_config = FilterConfig(
    url_host_tlds=["edu", "gov"],  # Much faster than url_patterns=["%.edu", "%.gov"]
    status_codes=[200],
)
results = search(filter_config, limit=50)
print(f"Found {len(results)} .edu and .gov sites")

# Fetch and process content (returns clean text)
filter_config = FilterConfig(url_patterns=["%.example.com"])
content_results = fetch(filter_config, limit=2)
print(f"Processed {len(content_results)} content records")
for record, processed in content_results:
    if processed:
        print(f"  {record.url}: {processed['word_count']} words")
        print(f"    Title: {processed.get('title', 'N/A')}")

# List available Common Crawl datasets
crawls = list_crawls()
print(f"Available crawls: {len(crawls)}")
print(f"Latest: {crawls[0]}")

# Index data in a vector store
filter_config = FilterConfig(url_patterns=["%.github.io"])
vector_config = VectorStoreConfig(
    name="ml-research",
    chunk_size=800,
    overlap=400,
    embedding_model="text-embedding-3-small",
    embedding_dimensions=1536,
)

result = index(filter_config, vector_config, limit=50)
print(f"Created vector store: {result['vector_store_name']}")
print(f"Vector Store ID: {result['vector_store_id']}")
print(f"Processed records: {result['total_fetched']}")

# List cc-vec vector stores (default - only shows stores created by cc-vec)
stores = list_vector_stores()
print(f"Available stores: {len(stores)}")
for store in stores[:3]:
    print(f"  {store['name']} (ID: {store['id']}, Status: {store['status']})")

# List ALL vector stores (including non-cc-vec stores)
all_stores = list_vector_stores(cc_vec_only=False)
print(f"All stores: {len(all_stores)}")

# Query vector store for RAG
query_results = query_vector_store("vs-123abc", "What is machine learning?", limit=5)
print(f"Query found {len(query_results.get('results', []))} relevant results")
for i, result in enumerate(query_results.get("results", []), 1):
    print(f"  {i}. Score: {result.get('score', 0):.3f}")
    print(f"     Content: {result.get('content', '')[:100]}...")
    print(f"     File: {result.get('file_id', 'N/A')}")

# Index from pre-fetched files (two-step workflow)
# Step 1: Save fetch results to disk (use fetch --output-dir in CLI or save manually)
# Step 2: Index from the saved files
vector_config = VectorStoreConfig(
    name="from-saved-files",
    chunk_size=800,
    overlap=400,
)
result = index_from_files("./fetched_data/", vector_config, limit=100)
print(f"Indexed {result['total_fetched']} files into {result['vector_store_name']}")

3. 🔌 MCP Server (Claude Desktop)

Setup:

Copy and edit the config: cp claude_desktop_config.json ~/Library/Application\ Support/Claude/claude_desktop_config.json
Update the directory path and API key in the config file
Restart Claude Desktop

The config uses stdio mode (required by Claude Desktop):

{
  "mcpServers": {
    "cc-vec": {
      "command": "uv",
      "args": ["run", "--directory", "your-path-to-the-repo", "cc-vec", "mcp-serve", "--mode", "stdio"],
      "env": {
        "ATHENA_OUTPUT_BUCKET": "your-athena-output-bucket",
        "OPENAI_API_KEY": "your-openai-api-key-here"
        // "OPENAI_BASE_URL": "http://localhost:11434/v1"   // Optional: Use for Ollama, Llama Stack, or other endpoints
        // "OPENAI_EMBEDDING_MODEL": "ollama/nomic-embed-text:latest"     // Optional: Specify custom embedding model
        // "OPENAI_EMBEDDING_DIMENSIONS": "768"              // Optional: Specify embedding dimensions
      }
    }
  }
}

Available MCP tools:

# Search and analysis (no OpenAI key needed)
cc_search - Search Common Crawl for URLs matching patterns with advanced filtering
cc_stats - Get statistics and cost estimates for patterns with advanced filtering
cc_fetch - Download actual content from matched URLs with advanced filtering
cc_list_crawls - List available Common Crawl dataset IDs

# Vector operations (require OPENAI_API_KEY)
cc_index - Create and populate vector stores from Common Crawl content with chunking config
cc_list_vector_stores - List OpenAI vector stores (defaults to cc-vec created only)
cc_query - Query vector stores for relevant content

Example usage in Claude Desktop:

"Use cc_search to find GitHub Pages sites: url_pattern=%.github.io, limit=10"
"Use cc_stats to analyze education sites: url_pattern=%.edu"
"Use cc_search with indexed columns for better performance: url_host_tlds=['edu', 'gov'], limit=20"
"Use cc_search with registered domains: url_host_registered_domains=['github.com'], limit=15"
"Use cc_search for specific paths: url_host_names=['github.io'], url_paths=['/blog/%'], limit=10"
"Use cc_search across multiple crawls: url_pattern=%.edu, crawl_ids=['CC-MAIN-2024-33', 'CC-MAIN-2024-30']"
"Use cc_fetch to get content: url_host_names=['github.io'], limit=5"
"Use cc_list_crawls to show available Common Crawl datasets"
"Use cc_index to create vector store: vector_store_name=research, url_pattern=%.arxiv.org, limit=100, chunk_size=800"
"Use cc_list_vector_stores to show cc-vec stores (default)"
"Use cc_list_vector_stores with cc_vec_only=false to show all vector stores"
"Use cc_query to search: vector_store_id=vs-123, query=machine learning"

Note: All filter options available in CLI (shown via cc-vec list-filter-columns) are also available in MCP tools.

License

MIT

Name	Name	Last commit message	Last commit date
Latest commit History 25 Commits 25 Commits
examples	examples
src/cc_vec	src/cc_vec
tests	tests
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
.python-version	.python-version
LICENSE	LICENSE
README.md	README.md
claude_desktop_config.json	claude_desktop_config.json
pyproject.toml	pyproject.toml
uv.lock	uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCVec - Common Crawl to Vector Stores

Quick Start

1. ⌨️ Command Line

Pipeline Benefits

Advanced Options

1.5. Local Llama Stack Setup (Optional)

2. 📦 Python Library

3. 🔌 MCP Server (Claude Desktop)

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

License

commoncrawl/cc-vec

Folders and files

Latest commit

History

Repository files navigation

CCVec - Common Crawl to Vector Stores

Quick Start

1. ⌨️ Command Line

Pipeline Benefits

Advanced Options

1.5. Local Llama Stack Setup (Optional)

2. 📦 Python Library

3. 🔌 MCP Server (Claude Desktop)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages