Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

jirispilka
Copy link
Collaborator

@jirispilka jirispilka commented Oct 14, 2025

[x] - LLM as a judge
[x] - Add more test cases

Btw. the CI/CD run is pretty verbose (intentionally). I wanted to see what is happening.

CI/CD run: https://github.com/apify/apify-mcp-server/actions/runs/18567210467/job/52931463632
Results: https://app.phoenix.arize.com/s/apify/datasets/RGF0YXNldDozOQ==/experiments

@github-actions github-actions bot added the t-ai Issues owned by the AI team. label Oct 14, 2025
@jirispilka jirispilka requested a review from MQ37 October 14, 2025 15:59
@jirispilka jirispilka marked this pull request as ready for review October 14, 2025 15:59
@jirispilka jirispilka changed the title feat: MCP tool calling evaluations in CI/CD WIP: feat: MCP tool calling evaluations in CI/CD Oct 14, 2025
Copy link
Contributor

@MQ37 MQ37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I really like this solution - except the ipynb but I may be biased :D Tested with the run-evaluation.ts and it works flawlessly and I think this is great starting point - good job! I just quickly scouted through the code and I see it is still WIP - will do throughout review once it is ready.

evals/__pycache__/config.cpython-312.pyc Outdated Show resolved Hide resolved
evals/run-evaluation.ts Outdated Show resolved Hide resolved
@jirispilka jirispilka requested a review from MQ37 October 16, 2025 10:53
@jirispilka jirispilka changed the title WIP: feat: MCP tool calling evaluations in CI/CD feat: MCP tool calling evaluations in CI/CD Oct 16, 2025
@jirispilka jirispilka added the validated Issues that are resolved and their solutions fulfill the acceptance criteria. label Oct 16, 2025
@jirispilka
Copy link
Collaborator Author

@MQ37 I'm sorry for so many commits 🤦🏻 Somehow python and typescript versions are quite different. One thing that surprised me a bit was string interpolation, in python it is ok to write {query} while in ts I had to change it to {{query}}

Copy link
Contributor

@MQ37 MQ37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this as we need some kind of evaluation framework and this should do the job for simple use cases. Thank you for finishing this initial version!

One major point that I do not like is that there is two separate versions in two separate languages (typescript and the jupyter notebook python). I would keep only one version as maintaining the two will be painful experience and does not really make sense - I know that jupyter notebooks are fun and great for discovery and experimentation but we should make the typescript experience pleasant that we do not need this python stuff.

The code needs more love and a bit of refactor but we can do this later, so LGTM 👍

evals/run-evaluation.ts Outdated Show resolved Hide resolved
.github/workflows/evaluations.yaml Show resolved Hide resolved
@jirispilka
Copy link
Collaborator Author

Let's merge this as we need some kind of evaluation framework and this should do the job for simple use cases. Thank you for finishing this initial version!

One major point that I do not like is that there is two separate versions in two separate languages (typescript and the jupyter notebook python). I would keep only one version as maintaining the two will be painful experience and does not really make sense - I know that jupyter notebooks are fun and great for discovery and experimentation but we should make the typescript experience pleasant that we do not need this python stuff.

The code needs more love and a bit of refactor but we can do this later, so LGTM 👍

moved to notebooks

@jirispilka jirispilka merged commit a971322 into master Oct 16, 2025
5 checks passed
@jirispilka jirispilka deleted the feat/evaluations branch October 16, 2025 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-ai Issues owned by the AI team. validated Issues that are resolved and their solutions fulfill the acceptance criteria.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.