-
Notifications
You must be signed in to change notification settings - Fork 59
feat: MCP tool calling evaluations in CI/CD #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I really like this solution - except the ipynb but I may be biased :D Tested with the run-evaluation.ts and it works flawlessly and I think this is great starting point - good job! I just quickly scouted through the code and I see it is still WIP - will do throughout review once it is ready.
- Remove accidentally committed __pycache__ files - Add Python cache patterns to .gitignore to prevent future commits - Update evaluation notebook
The evals directory doesn't need to be a Python package since it contains TypeScript evaluation scripts.
@MQ37 I'm sorry for so many commits 🤦🏻 Somehow python and typescript versions are quite different. One thing that surprised me a bit was string interpolation, in python it is ok to write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this as we need some kind of evaluation framework and this should do the job for simple use cases. Thank you for finishing this initial version!
One major point that I do not like is that there is two separate versions in two separate languages (typescript and the jupyter notebook python). I would keep only one version as maintaining the two will be painful experience and does not really make sense - I know that jupyter notebooks are fun and great for discovery and experimentation but we should make the typescript experience pleasant that we do not need this python stuff.
The code needs more love and a bit of refactor but we can do this later, so LGTM 👍
moved to notebooks |
[x] - LLM as a judge
[x] - Add more test cases
Btw. the CI/CD run is pretty verbose (intentionally). I wanted to see what is happening.
CI/CD run: https://github.com/apify/apify-mcp-server/actions/runs/18567210467/job/52931463632
Results: https://app.phoenix.arize.com/s/apify/datasets/RGF0YXNldDozOQ==/experiments