Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations

License

Notifications You must be signed in to change notification settings

OtherVibes/mcp-as-a-judge

Open more actions menu

Repository files navigation

MCP as a Judge ⚖️

mcp-name: io.github.OtherVibes/mcp-as-a-judge

MCP as a Judge Logo

MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.

License: MIT Python 3.13+ MCP Compatible

CI Release PyPI version

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations for:

  • Research, system design, and planning
  • Code changes, testing, and task-completion verification

It enforces evidence-based research, reuse over reinvention, and human-in-the-loop decisions.

If your IDE has rules/agents (Copilot, Cursor, Claude Code), keep using them—this Judge adds enforceable approval gates on plan, code diffs, and tests.

Key problems with AI coding assistants and LLMs

  • Treat LLM output as ground truth; skip research and use outdated information
  • Reinvent the wheel instead of reusing libraries and existing code
  • Cut corners: code below engineering standards and weak tests
  • Make unilateral decisions when requirements are ambiguous or plans change
  • Security blind spots: missing input validation, injection risks/attack vectors, least‑privilege violations, and weak defensive programming

Vibe coding doesn’t have to be frustrating

What it enforces

  • Evidence‑based research and reuse (best practices, libraries, existing code)
  • Plan‑first delivery aligned to user requirements
  • Human‑in‑the‑loop decisions for ambiguity and blockers
  • Quality gates on code and tests (security, performance, maintainability)

Key capabilities

  • Intelligent code evaluation via MCP sampling; enforces software‑engineering standards and flags security/performance/maintainability risks
  • Comprehensive plan/design review: validates architecture, research depth, requirements fit, and implementation approach
  • User‑driven decisions via MCP elicitation: clarifies requirements, resolves obstacles, and keeps choices transparent
  • Security validation in system design and code changes

Tools and how they help

Tool What it solves
set_coding_task Creates/updates task metadata; classifies task_size; returns next-step workflow guidance
get_current_coding_task Recovers the latest task_id and metadata to resume work safely
judge_coding_plan Validates plan/design; requires library selection and internal reuse maps; flags risks
judge_code_change Reviews unified Git diffs for correctness, reuse, security, and code quality
judge_testing_implementation Validates tests using real runner output and optional coverage
judge_coding_task_completion Final gate ensuring plan, code, and tests approvals before completion
raise_missing_requirements Elicits missing details and decisions to unblock progress
raise_obstacle Engages the user on trade‑offs, constraints, and enforced changes

🚀 Quick Start

Requirements & Recommendations

MCP Client Prerequisites

MCP as a Judge is heavily dependent on MCP Sampling and MCP Elicitation features for its core functionality:

System Prerequisites

  • Docker Desktop / Python 3.13+ - Required for running the MCP server

Supported AI Assistants

AI Assistant Platform MCP Support Status Notes
GitHub Copilot Visual Studio Code ✅ Full Recommended Complete MCP integration with sampling and elicitation
Claude Code - ⚠️ Partial Requires LLM API key Sampling Support feature request
Elicitation Support feature request
Cursor - ⚠️ Partial Requires LLM API key MCP support available, but sampling/elicitation limited
Augment - ⚠️ Partial Requires LLM API key MCP support available, but sampling/elicitation limited
Qodo - ⚠️ Partial Requires LLM API key MCP support available, but sampling/elicitation limited

✅ Recommended setup: GitHub Copilot + VS Code — full MCP sampling; no API key needed.

⚠️ Critical: For assistants without full MCP sampling (Cursor, Claude Code, Augment, Qodo), you MUST set LLM_API_KEY. Without it, the server cannot evaluate plans or code. See LLM API Configuration.

💡 Tip: Prefer large context models (≥ 1M tokens) for better analysis and judgments.

If the MCP server isn’t auto‑used

For troubleshooting, visit the FAQs section.

🔧 MCP Configuration

Configure MCP as a Judge in your MCP-enabled client:

Method 1: Using Docker (Recommended)

One‑click install for VS Code (MCP)

Install for MCP as a Judge

Notes:

  • VS Code controls the sampling model; select it via “MCP: List Servers → mcp-as-a-judge → Configure Model Access”.
  1. Configure MCP Settings:

    Add this to your MCP client configuration file:

    {
      "command": "docker",
      "args": ["run", "--rm", "-i", "--pull=always", "ghcr.io/othervibes/mcp-as-a-judge:latest"],
      "env": {
        "LLM_API_KEY": "your-openai-api-key-here",
        "LLM_MODEL_NAME": "gpt-4o-mini"
      }
    }

    📝 Configuration Options (All Optional):

    • LLM_API_KEY: Optional for GitHub Copilot + VS Code (has built-in MCP sampling)
    • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
    • The --pull=always flag ensures you always get the latest version automatically

    Then manually update when needed:

    # Pull the latest version
    docker pull ghcr.io/othervibes/mcp-as-a-judge:latest

Method 2: Using uv

  1. Install the package:

    uv tool install mcp-as-a-judge
  2. Configure MCP Settings:

    The MCP server may be automatically detected by your MCP‑enabled client.

    📝 Notes:

    • No additional configuration needed for GitHub Copilot + VS Code (has built-in MCP sampling)
    • LLM_API_KEY is optional and can be set via environment variable if needed
  3. To update to the latest version:

    # Update MCP as a Judge to the latest version
    uv tool upgrade mcp-as-a-judge

Select a sampling model in VS Code

  • Open Command Palette (Cmd/Ctrl+Shift+P) → “MCP: List Servers”
  • Select the configured server “mcp-as-a-judge”
  • Choose “Configure Model Access”
  • Check your preferred model(s) to enable sampling

🔑 LLM API Configuration (Optional)

For AI assistants without full MCP sampling support you can configure an LLM API key as a fallback. This ensures MCP as a Judge works even when the client doesn't support MCP sampling.

  • Set LLM_API_KEY (unified key). Vendor is auto-detected; optionally set LLM_MODEL_NAME to override the default.

Supported LLM Providers

Rank Provider API Key Format Default Model Notes
1 OpenAI sk-... gpt-4.1 Fast and reliable model optimized for speed
2 Anthropic sk-ant-... claude-sonnet-4-20250514 High-performance with exceptional reasoning
3 Google AIza... gemini-2.5-pro Most advanced model with built-in thinking
4 Azure OpenAI [a-f0-9]{32} gpt-4.1 Same as OpenAI but via Azure
5 AWS Bedrock AWS credentials anthropic.claude-sonnet-4-20250514-v1:0 Aligned with Anthropic
6 Vertex AI Service Account JSON gemini-2.5-pro Enterprise Gemini via Google Cloud
7 Groq gsk_... deepseek-r1 Best reasoning model with speed advantage
8 OpenRouter sk-or-... deepseek/deepseek-r1 Best reasoning model available
9 xAI xai-... grok-code-fast-1 Latest coding-focused model (Aug 2025)
10 Mistral [a-f0-9]{64} pixtral-large Most advanced model (124B params)

Client-Specific Setup

Cursor

  1. Open Cursor Settings:

    • Go to FilePreferencesCursor Settings
    • Navigate to the MCP tab
    • Click + Add to add a new MCP server
  2. Add MCP Server Configuration:

    {
      "command": "uv",
      "args": ["tool", "run", "mcp-as-a-judge"],
      "env": {
        "LLM_API_KEY": "your-openai-api-key-here",
        "LLM_MODEL_NAME": "gpt-4.1"
      }
    }

    📝 Configuration Options:

    • LLM_API_KEY: Required for Cursor (limited MCP sampling)
    • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)

Claude Code

  1. Add MCP Server via CLI:

    # Set environment variables first (optional model override)
    export LLM_API_KEY="your_api_key_here"
    export LLM_MODEL_NAME="claude-3-5-haiku"  # Optional: faster/cheaper model
    
    # Add MCP server
    claude mcp add mcp-as-a-judge -- uv tool run mcp-as-a-judge
  2. Alternative: Manual Configuration:

    • Create or edit ~/.config/claude-code/mcp_servers.json
    {
      "command": "uv",
      "args": ["tool", "run", "mcp-as-a-judge"],
      "env": {
        "LLM_API_KEY": "your-anthropic-api-key-here",
        "LLM_MODEL_NAME": "claude-3-5-haiku"
      }
    }

    📝 Configuration Options:

    • LLM_API_KEY: Required for Claude Code (limited MCP sampling)
    • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)

Other MCP Clients

For other MCP-compatible clients, use the standard MCP server configuration:

{
  "command": "uv",
  "args": ["tool", "run", "mcp-as-a-judge"],
  "env": {
    "LLM_API_KEY": "your-openai-api-key-here",
    "LLM_MODEL_NAME": "gpt-5"
  }
}

📝 Configuration Options:

  • LLM_API_KEY: Required for most MCP clients (except GitHub Copilot + VS Code)
  • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)

🔒 Privacy & Flexible AI Integration

🔑 MCP Sampling (Preferred) + LLM API Key Fallback

Primary Mode: MCP Sampling

  • All judgments are performed using MCP Sampling capability
  • No need to configure or pay for external LLM API services
  • Works directly with your MCP-compatible client's existing AI model
  • Currently supported by: GitHub Copilot + VS Code

Fallback Mode: LLM API Key

  • When MCP sampling is not available, the server can use LLM API keys
  • Supports multiple providers via LiteLLM: OpenAI, Anthropic, Google, Azure, Groq, Mistral, xAI
  • Automatic vendor detection from API key patterns
  • Default model selection per vendor when no model is specified

🛡️ Your Privacy Matters

  • The server runs locally on your machine
  • No data collection - your code and conversations stay private
  • No external API calls when using MCP Sampling. If you set LLM_API_KEY for fallback, the server will call your chosen LLM provider only to perform judgments (plan/code/test) with the evaluation content you provide.
  • Complete control over your development workflow and sensitive information

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/OtherVibes/mcp-as-a-judge.git
cd mcp-as-a-judge

# Install dependencies with uv
uv sync --all-extras --dev

# Install pre-commit hooks
uv run pre-commit install

# Run tests
uv run pytest

# Run all checks
uv run pytest && uv run ruff check && uv run ruff format --check && uv run mypy src

© Concepts and Methodology

© 2025 OtherVibes and Zvi Fried. The "MCP as a Judge" concept, the "behavioral MCP" approach, the staged workflow (plan → code → test → completion), tool taxonomy/descriptions, and prompt templates are original work developed in this repository.

Prior Art and Attribution

While “LLM‑as‑a‑judge” is a broadly known idea, this repository defines the original “MCP as a Judge” behavioral MCP pattern by OtherVibes and Zvi Fried. It combines task‑centric workflow enforcement (plan → code → test → completion), explicit LLM‑based validations, and human‑in‑the‑loop elicitation, along with the prompt templates and tool taxonomy provided here. Please attribute as: “OtherVibes – MCP as a Judge (Zvi Fried)”.

❓ FAQ

How is “MCP as a Judge” different from rules/subagents in IDE assistants (GitHub Copilot, Cursor, Claude Code)?

Feature IDE Rules Subagents MCP as a Judge
Static behavior guidance
Custom system prompts
Project context integration
Specialized task handling
Active quality gates
Evidence-based validation
Approve/reject with feedback
Workflow enforcement
Cross-assistant compatibility

How does the Judge workflow relate to the tasklist? Why do we need both?

  • Tasklist = planning/organization: tracks tasks, priorities, and status. It doesn’t guarantee engineering quality or readiness.
  • Judge workflow = quality gates: enforces approvals for plan/design, code diffs, tests, and final completion. It demands real evidence (e.g., unified Git diffs and raw test output) and returns structured approvals and required improvements.
  • Together: Use the tasklist to organize work; use the Judge to decide when each stage is actually ready to proceed. The server also emits next_tool guidance to keep progress moving through the gates.

If the Judge isn’t used automatically, how do I force it?

  • In your prompt: "use mcp-as-a-judge" or "Evaluate plan/code/test using the MCP server mcp-as-a-judge".
  • VS Code: Command Palette → "MCP: List Servers" → ensure "mcp-as-a-judge" is listed and enabled.
  • Ensure the MCP server is running and, in your client, the judge tools are enabled/approved.

How do I select models for sampling in VS Code?

  • Open Command Palette (Cmd/Ctrl+Shift+P) → "MCP: List Servers"
  • Select "mcp-as-a-judge" → "Configure Model Access"
  • Check your preferred model(s) to enable sampling

📄 License

This project is licensed under the MIT License (see LICENSE).

🙏 Acknowledgments


Morty Proxy This is a proxified and sanitized view of the page, visit original site.