Production RAG Service — Starter Kit

BM25 retrieval service with a prebuilt JSON index, FastAPI endpoints, and basic latency metrics. Dev API key guard; no vectors, reranking, eval suite, or rate limiting (yet).

Acceptance Criteria (edit targets as needed)

Recall@10 ≥ 0.80; Answer F1 ≥ 0.70 (or EM ≥ 0.60)
p95 latency ≤ 800 ms (≥100 queries); p50 ≤ 300 ms
Cost/1k queries within budget; cache hit-rate ≥ 30%
API-key auth + rate limiting
Docker + one-click deploy (Render/Fly/Cloud Run)
README benchmarks table + Loom demo

Quickstart

Option A — conda

conda create -n rag_env python=3.11 -y
conda activate rag_env
pip install -r requirements.txt

# build the BM25 index
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json

# run (auth + rate limit)
API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
python -m uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

Option B — venv

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# build the BM25 index
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json

# run (auth + rate limit)
API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

One-liner — build index & run

API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json && \
python -m uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

Endpoints

GET /health → {"ok": true, "version": "..." }
GET /version → {"version": "..." }
GET /metrics → {"requests": n, "latency_ms_p50": ..., "latency_ms_p95": ..., "window": n}
POST /ask → { "answer": "...", "latency_ms": 0.0, "docs": [ { "doc_id": "...", "text": "...", "score": ... } ] }

Usage (auth required)

Set your base URL and API key (local example shown):

export BASE_URL=http://localhost:8010
export API_KEY=dev-key

Authorized request (200):

curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}' | python -m json.tool

Unauthorized example (should be 401):

curl -i -s -X POST "$BASE_URL/ask" \
  -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'

Metrics:

curl -s "$BASE_URL/metrics" | python -m json.tool

Evaluate

Place 50 Q/A pairs in eval/gold.jsonl:

{"question":"...", "answer":"..."}

Run the evaluator:

python -m eval.evaluate --gold ./eval/gold.jsonl --api http://localhost:8010/ask --k 5

Benchmarks (local demo)

Metric	Value
Answer F1	1.00 (toy)
Recall@10	1.00 (toy)
p50 latency	0.199 ms (local)
p95 latency	0.355 ms (local)

Architecture

flowchart TD
  A[Client] -->|HTTP| S[FastAPI Service]

  subgraph Routes
    S --> R1["POST /ask"]
    S --> R2["GET /health"]
    S --> R3["GET /version"]
    S --> R4["GET /metrics"]
  end

  R1 --> G{API key valid?}
  G -- No --> E401[401 Unauthorized]
  G -- Yes --> L{Within rate limit?}
  L -- No --> E429[429 Rate Limited]
  L -- Yes --> Q[BM25 query k]

  Q --> IDX[Index JSON]
  Q --> B[Stopword-aware boost]
  B --> SSEL[Best sentence]
  SSEL --> CAN[Canonical phrasing]
  CAN --> RESP[Response: answer/docs/latency_ms]

  R1 -. on success .-> MREC[Record latency]
  MREC --> R4
 
  subgraph Build
    C1[corpus txt files] --> IDX
    C2[python -m rag_app.index] --> IDX
  end

Components

API layer: rag_app/main.py (FastAPI app, routes, request/response models).
Auth: Simple API-key via x-api-key header. Disabled if API_KEY env is unset/empty.
Rate limiting: In-memory token bucket per key (RATE_LIMIT_PER_MIN), thread-safe.
Retrieval: rag_app/retrieval.py with BM25Retriever over a JSON index.
Index build: rag_app/index.py splits corpus/*.txt into snippets → writes rag_app/index.json.
Answering: Stopword-aware boost, choose best sentence from top snippet, then optional canonical phrasing for known intents.
Metrics: In-memory deque of recent latencies (p50/p95) + request count, exposed at /metrics.
(Optional) Cache: Small in-memory LRU for repeated (question,k) lookups.

Request lifecycle (`POST /ask`)

Guard: Check x-api-key (if API_KEY is set) and rate limit the caller.
Retrieve: Query BM25 over rag_app/index.json (top-k).
Re-rank: Apply stopword-aware term-match boost to prioritize relevant snippets.
Answer pick: Choose the best sentence from the top snippet; if the question matches a known intent, apply canonical phrasing.
Metrics: Record latency (ms) into a rolling window (default 5k requests).
Respond: Return {answer, latency_ms, docs}.

Data & storage

Corpus: Plain text files under corpus/. Edit or replace for your domain.
Index artifact: rag_app/index.json (generated). Treat as a build artifact; ignore in git.
- Build at image build time (Docker) or at container start if missing.

Configuration (env)

API_KEY – enables auth when set (e.g., dev-key for local).
RATE_LIMIT_PER_MIN – integer per-key budget (default 60).
(If you add caching) CACHE_TTL_S, CACHE_MAX.

Module layout (key files)

rag_app/
├─ main.py         # FastAPI app, routes, auth, limiter, metrics, answering
├─ retrieval.py    # BM25Retriever (loads/snaps index)
├─ index.py        # builds JSON index from corpus/*.txt
└─ index.json      # generated artifact (ignored in VCS)
eval/
└─ evaluate.py     # computes F1/Recall@k via API calls
corpus/
└─ *.txt           # domain text

Deployment (Docker & one-click)

One-click (Render)

Uses Dockerfile. Set API_KEY in Render env vars after deploy.

Docker (local)

# build
docker build -t rag-service .

# run (maps 8000->8000 in the container)
docker run --rm -p 8000:8000 \
  -e API_KEY=dev-key \
  -e RATE_LIMIT_PER_MIN=60 \
  rag-service

Set BASE_URL=http://localhost:8000 when testing the container.

Monitoring & Metrics

What’s exposed

GET /metrics → JSON:

{
  "requests": 42,
  "latency_ms_p50": 1.23,
  "latency_ms_p95": 3.45,
  "window": 42,
  "version": "..."
}

window = number of recent requests kept in memory (rolling window).
Values reset on process restart (in-memory).

Quick checks

# Pretty print
curl -s "$BASE_URL/metrics" | python -m json.tool

# Print just key numbers (quote-safe)
curl -s "$BASE_URL/metrics" \
| python -c 'import sys,json; d=json.load(sys.stdin); print("requests={}  p50={} ms  p95={} ms".format(d["requests"], d["latency_ms_p50"], d["latency_ms_p95"]))'

Optional: Prometheus endpoint

pip install prometheus-fastapi-instrumentator

# rag_app/main.py
from prometheus_fastapi_instrumentator import Instrumentator

@app.on_event("startup")
def _startup():
    Instrumentator().instrument(app).expose(app, endpoint="/metrics/prom")

Security (API key & rate limiting)

API key

Header: x-api-key: <YOUR_KEY>
Enabled when API_KEY env var is set (any non-empty string).
Disabled in dev if API_KEY is empty.

Examples

# Authorized (200)
curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'

# Unauthorized (401)
curl -i -s -X POST "$BASE_URL/ask" \
  -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'

Rate limiting

In-memory token bucket per API key.
Budget per minute: RATE_LIMIT_PER_MIN (default 60).
Exceeds budget → 429 Too Many Requests.
For multi-replica deployments, move buckets to Redis (shared state).

Set limits

API_KEY=<strong-secret> RATE_LIMIT_PER_MIN=60 \
uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

Best practices

Use different keys per environment (dev/stage/prod).
Rotate keys; never commit them.
Front with a gateway/WAF if exposed publicly.
Add CORS policy if you’ll call from a browser app.

Troubleshooting

Symptom	Likely cause	Fix
`401 Unauthorized` on `/ask`	Missing/incorrect `x-api-key` or `API_KEY` not set on server	Set `API_KEY` server-side and send `x-api-key` header. Test `curl -s $BASE_URL/health`.
`429 Too Many Requests`	Rate limit exceeded	Lower request rate, increase `RATE_LIMIT_PER_MIN`, or use separate keys for tests.
`404 Not Found` on `/version` or `/ask`	Wrong app path or port	Ensure you run `rag_app.main:app` and target the right port. List paths via `/openapi.json`.
Port already in use	Old server still running	`ss -lptn 'sport = :8010'` then kill the PID, or change `--port`.
`uvicorn: command not found`	Not installed in current env	`pip install uvicorn[standard]`; confirm with `which python` / `which uvicorn`.
`ModuleNotFoundError: rag_app`	Wrong cwd / PYTHONPATH	Run from repo root or set `PYTHONPATH=.`; `uvicorn rag_app.main:app ...`.
Index missing at startup	`rag_app/index.json` not built	Run `python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json`.
`/metrics` shows zeros	Fresh process or no traffic	Send a few `/ask` requests, then recheck.
JSON errors in CLI snippets	F-string quoting	Use the `.format()` example in Monitoring section.
Docker healthcheck failing	Wrong port or env	Container listens on `$PORT` (default 8000). Map and set `API_KEY`.

Diagnostics

# List routes
curl -s "$BASE_URL/openapi.json" | python -m json.tool

# Health/version
curl -s "$BASE_URL/health"; curl -s "$BASE_URL/version"

# Minimal POST
curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":3}' | python -m json.tool

Notes

Start with BM25 baseline (rank_bm25), then add vectors + reranker as needed.
Consider a small LRU cache for repeated queries and structured logging for observability.

License

MIT — see LICENSE.

Contact

Questions? Open an issue or ping me on LinkedIn.

Name	Name	Last commit message	Last commit date
Latest commit History 25 Commits 25 Commits
.github/workflows	.github/workflows
corpus	corpus
eval	eval
rag_app	rag_app
tests	tests
ui	ui
.gitignore	.gitignore
Dockerfile	Dockerfile
Makefile	Makefile
README.md	README.md
environment.yml	environment.yml
requirements.txt	requirements.txt

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Production RAG Service — Starter Kit

Table of Contents

Acceptance Criteria (edit targets as needed)

Quickstart

Option A — conda

Option B — venv

One-liner — build index & run

Endpoints

Usage (auth required)

Evaluate

Benchmarks (local demo)

Architecture

Components

Request lifecycle (POST /ask)

Data & storage

Configuration (env)

Module layout (key files)

Deployment (Docker & one-click)

One-click (Render)

Docker (local)

Monitoring & Metrics

What’s exposed

Quick checks

Optional: Prometheus endpoint

Security (API key & rate limiting)

API key

Rate limiting

Troubleshooting

Notes

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Request lifecycle (`POST /ask`)

Packages