feat: MCP tool calling evaluations in CI/CD #313

jirispilka · Oct 14, 2025

[x] - LLM as a judge
[x] - Add more test cases

Btw. the CI/CD run is pretty verbose (intentionally). I wanted to see what is happening.

CI/CD run: https://github.com/apify/apify-mcp-server/actions/runs/18567210467/job/52931463632
Results: https://app.phoenix.arize.com/s/apify/datasets/RGF0YXNldDozOQ==/experiments

MQ37

Actually I really like this solution - except the ipynb but I may be biased :D Tested with the run-evaluation.ts and it works flawlessly and I think this is great starting point - good job! I just quickly scouted through the code and I see it is still WIP - will do throughout review once it is ready.

evals/__pycache__/config.cpython-312.pyc

evals/run-evaluation.ts

- Remove accidentally committed __pycache__ files - Add Python cache patterns to .gitignore to prevent future commits - Update evaluation notebook

The evals directory doesn't need to be a Python package since it contains TypeScript evaluation scripts.

…ing silently

jirispilka · Oct 16, 2025

@MQ37 I'm sorry for so many commits 🤦🏻 Somehow python and typescript versions are quite different. One thing that surprised me a bit was string interpolation, in python it is ok to write {query} while in ts I had to change it to {{query}}

MQ37

Let's merge this as we need some kind of evaluation framework and this should do the job for simple use cases. Thank you for finishing this initial version!

One major point that I do not like is that there is two separate versions in two separate languages (typescript and the jupyter notebook python). I would keep only one version as maintaining the two will be painful experience and does not really make sense - I know that jupyter notebooks are fun and great for discovery and experimentation but we should make the typescript experience pleasant that we do not need this python stuff.

The code needs more love and a bit of refactor but we can do this later, so LGTM 👍

evals/run-evaluation.ts

.github/workflows/evaluations.yaml

jirispilka · Oct 16, 2025

Let's merge this as we need some kind of evaluation framework and this should do the job for simple use cases. Thank you for finishing this initial version!

One major point that I do not like is that there is two separate versions in two separate languages (typescript and the jupyter notebook python). I would keep only one version as maintaining the two will be painful experience and does not really make sense - I know that jupyter notebooks are fun and great for discovery and experimentation but we should make the typescript experience pleasant that we do not need this python stuff.

The code needs more love and a bit of refactor but we can do this later, so LGTM 👍

moved to notebooks

jirispilka added 22 commits October 13, 2025 16:44

fix: add evals

0af403f

fix: add evals

a7306e0

fix: add evals

43e8b5a

fix: add docs

fd232da

fix: update evaluations

21444a3

fix: lint

bc32f04

fix: env variables

1a0da38

fix: fix uv

14d4cc5

fix: fix uv

20eda85

fix: fix uv

c688ecb

fix: Update results

64bce1e

feat: Add typescript code

cf1054b

feat: Add run-evaluation.ts

54244b9

fix: lint

c9d9aeb

fix: update create-dataset.ts with logs

5194a4a

fix: update run-evaluation.ts

f2619f2

fix: update run-evaluation.ts

1353e8b

fix: update run-evaluation.ts

e766cab

fix: update run-evaluation.ts

8764c63

fix: update run-evaluation.ts

fbbe4c8

fix: update run-evaluation.ts

45db30d

fix: update documentation

483ffd5

github-actions bot assigned jirispilka Oct 14, 2025

github-actions bot added the t-ai Issues owned by the AI team. label Oct 14, 2025

jirispilka added 6 commits October 14, 2025 15:17

fix: update tsconfig.json

120fc18

fix: update evaluations.yaml

dcb5e61

fix: add PHOENIX_BASE_URL

5a16f53

fix: run-again

40bdeb4

fix: add debug log

c608779

fix: add function to sanitize headers

a3f04aa

jirispilka requested a review from MQ37 October 14, 2025 15:59

jirispilka marked this pull request as ready for review October 14, 2025 15:59

jirispilka changed the title ~~feat: MCP tool calling evaluations in CI/CD~~ WIP: feat: MCP tool calling evaluations in CI/CD Oct 14, 2025

MQ37 reviewed Oct 14, 2025

View reviewed changes

evals/__pycache__/config.cpython-312.pyc Outdated Show resolved Hide resolved

evals/run-evaluation.ts Outdated Show resolved Hide resolved

jirispilka added 13 commits October 16, 2025 10:16

fix: update tools_exact_match

e40feb1

fix: update run-evaluation.ts with llm as judge

5ed9eb8

fix: update logs and rename for clarity

86c45df

fix: update prompt

4b3db00

fix: improve evaluation results

e27f07b

fix: Use openrouter as llm judge

a763934

fix: Organize packages

529c334

Clean up Python cache files and update .gitignore

1fedb52

- Remove accidentally committed __pycache__ files - Add Python cache patterns to .gitignore to prevent future commits - Update evaluation notebook

Remove unnecessary __init__.py from evals directory

0684a2a

The evals directory doesn't need to be a Python package since it contains TypeScript evaluation scripts.

fix: evals ci

0cd6642

fix: create dataset

cb26e7e

fix: decrease threshold to get green light

00ca260

fix: fix eslint config

f1e9256

jirispilka requested a review from MQ37 October 16, 2025 10:53

jirispilka added 3 commits October 16, 2025 13:00

fix: update README.md

9db9c24

fix: run on push to master or evals tag

d118273

fix: run on push to master or validated tag

4794ca4

jirispilka changed the title ~~WIP: feat: MCP tool calling evaluations in CI/CD~~ feat: MCP tool calling evaluations in CI/CD Oct 16, 2025

jirispilka added the validated Issues that are resolved and their solutions fulfill the acceptance criteria. label Oct 16, 2025

fix: value interpolation in the template! It was not working and fail…

82df1ef

…ing silently

MQ37 approved these changes Oct 16, 2025

View reviewed changes

evals/run-evaluation.ts Outdated Show resolved Hide resolved

.github/workflows/evaluations.yaml Show resolved Hide resolved

fix: minor changes and a couple of more test cases

ab61d1b

jirispilka merged commit a971322 into master Oct 16, 2025
5 checks passed

jirispilka deleted the feat/evaluations branch October 16, 2025 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: MCP tool calling evaluations in CI/CD #313

feat: MCP tool calling evaluations in CI/CD #313

jirispilka commented Oct 14, 2025 •

edited

Loading

Uh oh!

MQ37 left a comment

Uh oh!

Uh oh!

Uh oh!

jirispilka commented Oct 16, 2025

Uh oh!

MQ37 left a comment

Uh oh!

Uh oh!

Uh oh!

jirispilka commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Search code, repositories, users, issues, pull requests...

feat: MCP tool calling evaluations in CI/CD #313

feat: MCP tool calling evaluations in CI/CD #313

Conversation

jirispilka commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MQ37 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jirispilka commented Oct 16, 2025

Uh oh!

MQ37 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jirispilka commented Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jirispilka commented Oct 14, 2025 •

edited

Loading