-
Notifications
You must be signed in to change notification settings - Fork 60
feat: MCP tool calling evaluations in CI/CD #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
0af403f
fix: add evals
jirispilka a7306e0
fix: add evals
jirispilka 43e8b5a
fix: add evals
jirispilka fd232da
fix: add docs
jirispilka 21444a3
fix: update evaluations
jirispilka bc32f04
fix: lint
jirispilka 1a0da38
fix: env variables
jirispilka 14d4cc5
fix: fix uv
jirispilka 20eda85
fix: fix uv
jirispilka c688ecb
fix: fix uv
jirispilka 64bce1e
fix: Update results
jirispilka cf1054b
feat: Add typescript code
jirispilka 54244b9
feat: Add run-evaluation.ts
jirispilka c9d9aeb
fix: lint
jirispilka 5194a4a
fix: update create-dataset.ts with logs
jirispilka f2619f2
fix: update run-evaluation.ts
jirispilka 1353e8b
fix: update run-evaluation.ts
jirispilka e766cab
fix: update run-evaluation.ts
jirispilka 8764c63
fix: update run-evaluation.ts
jirispilka fbbe4c8
fix: update run-evaluation.ts
jirispilka 45db30d
fix: update run-evaluation.ts
jirispilka 483ffd5
fix: update documentation
jirispilka 120fc18
fix: update tsconfig.json
jirispilka dcb5e61
fix: update evaluations.yaml
jirispilka 5a16f53
fix: add PHOENIX_BASE_URL
jirispilka 40bdeb4
fix: run-again
jirispilka c608779
fix: add debug log
jirispilka a3f04aa
fix: add function to sanitize headers
jirispilka abe73f5
fix: evaluation and lint
jirispilka e40feb1
fix: update tools_exact_match
jirispilka 5ed9eb8
fix: update run-evaluation.ts with llm as judge
jirispilka 86c45df
fix: update logs and rename for clarity
jirispilka 4b3db00
fix: update prompt
jirispilka e27f07b
fix: improve evaluation results
jirispilka a763934
fix: Use openrouter as llm judge
jirispilka 529c334
fix: Organize packages
jirispilka 1fedb52
Clean up Python cache files and update .gitignore
jirispilka 0684a2a
Remove unnecessary __init__.py from evals directory
jirispilka 0cd6642
fix: evals ci
jirispilka cb26e7e
fix: create dataset
jirispilka 00ca260
fix: decrease threshold to get green light
jirispilka f1e9256
fix: fix eslint config
jirispilka 9db9c24
fix: update README.md
jirispilka d118273
fix: run on push to master or evals tag
jirispilka 4794ca4
fix: run on push to master or validated tag
jirispilka 82df1ef
fix: value interpolation in the template! It was not working and fail…
jirispilka ab61d1b
fix: minor changes and a couple of more test cases
jirispilka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,8 @@ | ||
APIFY_TOKEN= | ||
# ANTHROPIC_API_KEY is only required when you want to run examples/clientStdioChat.js | ||
ANTHROPIC_API_KEY= | ||
|
||
# EVALS | ||
PHOENIX_API_KEY= | ||
PHOENIX_HOST= | ||
|
||
OPENROUTER_API_KEY= | ||
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# This workflow runs MCP tool calling evaluations on master branch merges | ||
# It evaluates AI models' ability to correctly identify and call MCP tools. | ||
|
||
name: MCP tool calling evaluations | ||
|
||
on: | ||
# Run evaluations on master branch merges | ||
push: | ||
branches: | ||
- 'master' | ||
# Also run on PRs with 'evals' label for testing | ||
pull_request: | ||
types: [labeled, synchronize, reopened] | ||
|
||
jobs: | ||
evaluations: | ||
name: MCP tool calling evaluations | ||
runs-on: ubuntu-latest | ||
# Run on master pushes or PRs with 'evals' label | ||
if: github.event_name == 'push' || contains(github.event.pull_request.labels.*.name, 'validated') | ||
|
||
steps: | ||
- name: Checkout code | ||
uses: actions/checkout@v4 | ||
|
||
- name: Use Node.js 22 | ||
uses: actions/setup-node@v4 | ||
with: | ||
node-version: 22 | ||
cache: 'npm' | ||
cache-dependency-path: 'package-lock.json' | ||
|
||
- name: Install Node dependencies | ||
run: npm ci --include=dev | ||
|
||
- name: Build project | ||
run: npm run build | ||
|
||
- name: Run evaluations | ||
run: npm run evals:run | ||
env: | ||
GITHUB_PR_NUMBER: ${{ github.event_name == 'pull_request' && github.event.number || 'master' }} | ||
PHOENIX_API_KEY: ${{ secrets.PHOENIX_API_KEY }} | ||
PHOENIX_BASE_URL: ${{ secrets.PHOENIX_BASE_URL }} | ||
OPENROUTER_BASE_URL: ${{ secrets.OPENROUTER_BASE_URL }} | ||
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
# MCP tool selection evaluation | ||
|
||
Evaluates MCP server tool selection. Phoenix used only for storing results and visualization. | ||
|
||
## CI Workflow | ||
|
||
The evaluation workflow runs automatically on: | ||
- **Master branch pushes** - for production evaluations (saves CI cycles) | ||
- **PRs with `validated` label** - for testing evaluation changes before merging | ||
|
||
To trigger evaluations on a PR, add the `validated` label to your pull request. | ||
|
||
## Two evaluation methods | ||
|
||
1. **exact match** (`tool-exact-match`) - binary tool name validation | ||
2. **LLM judge** (`tool-selection-llm`) - Phoenix classifier with structured prompt | ||
|
||
## Why OpenRouter? | ||
|
||
unified API for Gemini, Claude, GPT. no separate integrations needed. | ||
|
||
## Judge model | ||
|
||
- model: `openai/gpt-4o-mini` | ||
- prompt: structured eval with context + tool definitions | ||
- output: "correct"/"incorrect" → 1.0/0.0 score (and explanation) | ||
|
||
## Config (`config.ts`) | ||
|
||
```typescript | ||
MODELS_TO_EVALUATE = ['openai/gpt-4o-mini', 'anthropic/claude-3.5-haiku', 'google/gemini-2.5-flash'] | ||
PASS_THRESHOLD = 0.6 | ||
TOOL_SELECTION_EVAL_MODEL = 'openai/gpt-4o-mini' | ||
``` | ||
|
||
## Setup | ||
|
||
```bash | ||
export PHOENIX_BASE_URL="your_url" | ||
export PHOENIX_API_KEY="your_key" | ||
export OPENROUTER_API_KEY="your_key" | ||
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1" | ||
|
||
npm ci | ||
npm run evals:create-dataset # one-time | ||
npm run evals:run | ||
``` | ||
|
||
## Test cases | ||
|
||
40+ cases across 7 tool categories: `fetch-actor-details`, `search-actors`, `apify-slash-rag-web-browser`, `search-apify-docs`, `call-actor`, `get-actor-output`, `fetch-apify-docs` | ||
|
||
## Output | ||
|
||
- Phoenix dashboard with detailed results | ||
- console: pass/fail per model + evaluator | ||
- exit code: 0 = success, 1 = failure | ||
|
||
## Adding new test cases | ||
|
||
### How to contribute? | ||
|
||
1. **Create an issue or PR** with your new test cases | ||
2. **Explain why it should pass** - add a `reference` field with clear reasoning | ||
3. **Test locally** before submitting | ||
4. **Publish** - we'll review and merge | ||
|
||
### Test case structure | ||
|
||
Each test case in `test-cases.json` has this structure: | ||
|
||
```json | ||
{ | ||
"id": "unique-test-id", | ||
"category": "tool-category", | ||
"query": "user query text", | ||
"expectedTools": ["tool-name"], | ||
"reference": "explanation of why this should pass (optional)", | ||
"context": [/* conversation history (optional) */] | ||
} | ||
``` | ||
|
||
### Simple examples | ||
|
||
**Basic tool selection:** | ||
```json | ||
{ | ||
"id": "fetch-actor-details-1", | ||
"category": "fetch-actor-details", | ||
"query": "What are the details of apify/instagram-scraper?", | ||
"expectedTools": ["fetch-actor-details"] | ||
} | ||
``` | ||
|
||
**With reference explanation:** | ||
```json | ||
{ | ||
"id": "fetch-actor-details-3", | ||
"category": "fetch-actor-details", | ||
"query": "Scrape details of apify/google-search-scraper", | ||
"expectedTools": ["fetch-actor-details"], | ||
"reference": "It should call the fetch-actor-details with the actor ID 'apify/google-search-scraper' and return the actor's documentation." | ||
} | ||
``` | ||
|
||
### Advanced examples with context | ||
|
||
**Multi-step conversation flow:** | ||
```json | ||
{ | ||
"id": "weather-mcp-search-then-call-1", | ||
"category": "flow", | ||
"query": "Now, use the mcp to check the weather in Prague, Czechia?", | ||
"expectedTools": ["call-actor"], | ||
"context": [ | ||
{ "role": "user", "content": "Search for weather MCP server" }, | ||
{ "role": "assistant", "content": "I'll help you to do that" }, | ||
{ "role": "tool_use", "tool": "search-actors", "input": {"search": "weather mcp", "limit": 5} }, | ||
{ "role": "tool_result", "tool_use_id": 12, "content": "Tool 'search-actors' successful, Actor found: jiri.spilka/weather-mcp-server" } | ||
] | ||
} | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,114 @@ | ||
/** | ||
* Configuration for Apify MCP Server evaluations. | ||
*/ | ||
|
||
import { readFileSync } from 'node:fs'; | ||
import { dirname, join } from 'node:path'; | ||
import { fileURLToPath } from 'node:url'; | ||
|
||
// Read version from test-cases.json | ||
function getTestCasesVersion(): string { | ||
const currentFilename = fileURLToPath(import.meta.url); | ||
const currentDirname = dirname(currentFilename); | ||
const testCasesPath = join(currentDirname, 'test-cases.json'); | ||
const testCasesContent = readFileSync(testCasesPath, 'utf-8'); | ||
const testCases = JSON.parse(testCasesContent); | ||
return testCases.version; | ||
} | ||
|
||
// Evaluator names | ||
export const EVALUATOR_NAMES = { | ||
TOOLS_EXACT_MATCH: 'tool-exact-match', | ||
TOOL_SELECTION_LLM: 'tool-selection-llm', | ||
} as const; | ||
|
||
export type EvaluatorName = typeof EVALUATOR_NAMES[keyof typeof EVALUATOR_NAMES]; | ||
|
||
// Models to evaluate | ||
export const MODELS_TO_EVALUATE = [ | ||
'openai/gpt-4o-mini', | ||
'anthropic/claude-3.5-haiku', | ||
'google/gemini-2.5-flash', | ||
]; | ||
|
||
export const TOOL_SELECTION_EVAL_MODEL = 'openai/gpt-4o-mini'; | ||
|
||
export const PASS_THRESHOLD = 0.7; | ||
|
||
export const DATASET_NAME = `mcp_server_dataset_v${getTestCasesVersion()}`; | ||
|
||
// System prompt | ||
export const SYSTEM_PROMPT = 'You are a helpful assistant'; | ||
|
||
export const TOOL_CALLING_BASE_TEMPLATE = ` | ||
You are an evaluation assistant evaluating user queries and tool calls to | ||
determine whether a tool was chosen and if it was a right tool. | ||
|
||
The tool calls have been generated by a separate agent, and chosen from the list of | ||
tools provided below. It is your job to decide whether that agent chose | ||
the right tool to call. | ||
|
||
[BEGIN DATA] | ||
************ | ||
[User's previous interaction with the assistant]: {{context}} | ||
[User query]: {{query}} | ||
************ | ||
[LLM decided to call these tools]: {{tool_calls}} | ||
[LLM response]: {{llm_response}} | ||
************ | ||
[END DATA] | ||
|
||
DECISION: [correct or incorrect] | ||
EXPLANATION: [Super short explanation of why the tool choice was correct or incorrect] | ||
|
||
Your response must be single word, either "correct" or "incorrect", | ||
and should not contain any text or characters aside from that word. | ||
|
||
"correct" means the correct tool call was chosen, the correct parameters | ||
were extracted from the query, the tool call generated is runnable and correct, | ||
and that no outside information not present in the query was used | ||
in the generated query. | ||
|
||
"incorrect" means that the chosen tool was not correct | ||
or that the tool signature includes parameter values that don't match | ||
the formats specified in the tool signatures below. | ||
|
||
You must not use any outside information or make assumptions. | ||
Base your decision solely on the information provided in [BEGIN DATA] ... [END DATA], | ||
the [Tool Definitions], and the [Reference instructions] (if provided). | ||
Reference instructions are optional and are intended to help you understand the use case and make your decision. | ||
|
||
[Reference instructions]: {{reference}} | ||
|
||
[Tool definitions]: {{tool_definitions}} | ||
` | ||
export function getRequiredEnvVars(): Record<string, string | undefined> { | ||
return { | ||
PHOENIX_BASE_URL: process.env.PHOENIX_BASE_URL, | ||
PHOENIX_API_KEY: process.env.PHOENIX_API_KEY, | ||
OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY, | ||
OPENROUTER_BASE_URL: process.env.OPENROUTER_BASE_URL, | ||
}; | ||
} | ||
|
||
// Removes newlines and trims whitespace. Useful for Authorization header values | ||
// because CI secrets sometimes include trailing newlines or quotes. | ||
export function sanitizeHeaderValue(value?: string): string | undefined { | ||
if (value == null) return value; | ||
return value.replace(/[\r\n]/g, '').trim().replace(/^"|"$/g, ''); | ||
} | ||
|
||
export function validateEnvVars(): boolean { | ||
const envVars = getRequiredEnvVars(); | ||
const missing = Object.entries(envVars) | ||
.filter(([, value]) => !value) | ||
.map(([key]) => key); | ||
|
||
if (missing.length > 0) { | ||
// eslint-disable-next-line no-console | ||
console.error(`Missing required environment variables: ${missing.join(', ')}`); | ||
return false; | ||
} | ||
|
||
return true; | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.