ilab model evaluate command and eval library usage#1369
ilab model evaluate command and eval library usage#1369alinaryan merged 35 commits intoinstructlab:maininstructlab/instructlab:mainfrom
ilab model evaluate command and eval library usage#1369Conversation
|
this is a placeholder for now, lmk when the library is somewhere I can access and import (with the actual code) |
|
@cdoern you can install directory from test.pypy.org for testing if you wish: https://test.pypi.org/project/instructlab-eval once we have a |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Signed-off-by: Dan McPherson <dmcphers@redhat.com>
Signed-off-by: Dan McPherson <dmcphers@redhat.com>
danmcp
left a comment
There was a problem hiding this comment.
I am approving with a few notes:
- The error handling still needs a decent bit of work both here and the eval library. -> Replace stack traces with nice messages.
- This version is very chatty in the cli. Will largely be addressed with https://github.com/instructlab/instructlab/compare/main...danmcp:instructlab:evalandvllm?expand=1. Those commit(s) also address the vllm serving hack still in place in this PR.
- More test cases are needed
- Config defaulting probably still needs some work with the models selected
alinaryan
left a comment
There was a problem hiding this comment.
Thanks for the strong work on this!! Just have a few comments/q's:
|
made follow up issues! |
|
Given that tomorrow morning (7/1) is a deadline. This PR needs to be merged before then to add some form of evaluation support. That being said, there is a stale change request on this PR, some pending reviews that haven't come in yet, etc. We will be dismissing those in favor of deferring to follow up issues but if any of the reviewers have immediate follow up concerns please feel free to reach out!!!! |
|
I filed #1540 as a follow-up to get this tested in |
ilab model evaluatewhich allows users to run MMLU Bench MT Bench, MMLU Branch, and MT Branch Benchmarks_evaluateclass toconfig.yamlso that users can get sane evaluation defaults that they can see and modify. These funnel directly into the evaluation flags as training now does.a sample evaluation class looks like: