feat: DK-Bench ilab implementation#2940
feat: DK-Bench ilab implementation#2940mergify[bot] merged 4 commits intoinstructlab:maininstructlab/instructlab:mainfrom alimaredia:dk-bench-cli-onlyalimaredia/instructlab:dk-bench-cli-onlyCopy head branch name to clipboard
Conversation
fd3906b to
2b41058
Compare
| help="Comma-separated list of file formats for results of the DK-Bench evaluation. Valid options are csv, jsonl, and xlsx. If this option is not provided the results are written as a .jsonl file", | ||
| ) | ||
| @click.option( | ||
| "--model-prompt", |
There was a problem hiding this comment.
Would it make more sense for this to be --system-prompt?
There was a problem hiding this comment.
I thought about that but I thought maybe a user new to AI would understand that the --model-prompt would be for the model specifically being tested in case they didn't know what that these prompts are generally known as system prompts.
If you really think it should be --system prompt I don't have any objection to changing it.
There was a problem hiding this comment.
--model-prompt would imply it's what we're prompting the model with, i.e., the user prompt. We should stick with --system-prompt in order to be consistent.
| help="Temperature for the model when getting responses in the DK-Bench evaluation", | ||
| ) | ||
| @click.option( | ||
| "--dk-bench-model", |
There was a problem hiding this comment.
How is this different from the --model flag that already exists in this CLI command?
There was a problem hiding this comment.
The way this PR is currently written there are two things the command does.
- If a user doesn't pass in a --dk-bench-model and the .jsonl file has response, reference, and user input then the command just does the evaluation on the file.
- If you pass in the dk_bench_model then responses are generated from the model for the questions.
I had started by using the --model flag, but ran into some issues with Click always passing a value into model because it's set in every profile even though the default in Pydantic is None. This made mode number 1 impossible to execute.
I'm thinking about require a student model in every evaluation run since you'd expect a model to be evaluated in the ilab model evaluate command, thus getting rid of the first functionality above. One of the many benefits of this is we could go back to use the --model flag.
If in the future users want to just be able to do evaluation on a file with responses then we can tweak or add to the command then.
@RobotSail what do you think?
There was a problem hiding this comment.
@alimaredia Okay, I see your challenge.
What you could probably do is check if the JSONL file contains response fields already, and if so then you can just pass None in for the student model. Otherwise, you can just try to proceed with what's set at the student model.
This way you would still get the desired functionality but without needing to create a separate field.
Otherwise, the user will be burdened to have a different model set depending on which eval they run which will be very confusing for them.
| except OpenAIError as exc: | ||
| raise ClientException(f"Connection Error {exc}") from exc | ||
|
|
||
| if len(models.data) != 1: |
There was a problem hiding this comment.
Why would we error if the server has more than one models actively running? What if I want to be evaluating multiple models concurrently?
There was a problem hiding this comment.
My thinking here was that most OpenAI compatible endpoints, or at least in my research those served with vLLM only had one model served per endpoint.
I'd like it for the user to not have to think about passing in a model name at all and just get it for the server since getting the model name right is required to get the responses from the student model in the Eval library. I even had a flag for a model name that and then removed it for that reason.
That being said in my next iteration of the PR I'm thinking of re-adding the --model-name flag and checking if that model_name is in the list of model names returned by the server. If it's not I'll throw an error. If no model_name is passed in and there are more than 1 models on the server I'll throw an error telling the user to pass in the model name from the ones returned by the server.
@RobotSail what do you think?
There was a problem hiding this comment.
@alimaredia since you're already performing logic to check if the student model exists in the list of served models, you can either delete this check entirely or have it only check that the server has at least one model being served.
In either of the proposed cases, you will have the same outcome of knowing whether or not a model exists. But it might help with UX if you can raise the error that the active server isn't hosting any models.
| backend=backend, | ||
| enable_serving_output=enable_serving_output, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Are we sure that we have enough information here to know that a model is an endpoint?
There was a problem hiding this comment.
I could add a check that curls the endpoint or something like that. I figured get_endpoint_model_name would throw an error if the endpoint wasn't valid anyway.
There was a problem hiding this comment.
In this case you have only ruled out that the model either doesn't exist locally or is an invalid format - not necessarily that it's an endpoint.
For the most part I think this logic is fine, but you should check with the above that it's not the case that they're trying to pass in a local model that simply doesn't exist.
| enable_serving_output=enable_serving_output, | ||
| ) | ||
| else: | ||
| logger.debug("DK-Bench model is an endpoint") |
There was a problem hiding this comment.
Why are we calling it a dk-bench model? How is this different from a student model?
There was a problem hiding this comment.
It's the same thing, but I'll rephrase this to Model being evaluated in DK-Bench is an endpoint.
|
|
||
| api_key = os.environ.get("STUDENT_ENDPOINT_API_KEY", None) | ||
| if api_key is None: | ||
| logger.debug( |
There was a problem hiding this comment.
If no API key was set then you should set it to be an empty string, not None.
There was a problem hiding this comment.
Good point. Per this function: https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mt_bench_common.py#L369 it looks like "NO_API_KEY" should be the value I'm passing in and checking for in this case.
There was a problem hiding this comment.
You can set the API key to whatever string you want if one is set, provided that the student model is running in vLLM and not on an hosted OpenAI-compatible server which expects an API key. I'd say it would be worthwhile to simply set it to the default you suggested for consistency, and then add a debug log indicating that you've done so because none was provided.
| gpus: int | None, | ||
| backend: str | None, | ||
| enable_serving_output: bool, | ||
| ) -> tuple: |
There was a problem hiding this comment.
When returning something like a list, tuple, dict, etc., you must use the typing library for this:
from typing import Tuple
# ...
def foo() -> Tuple[str, int, bool]:
return ('hello', 42, False)| ) -> tuple: | |
| ) -> Tuple: |
| QNA_YAML = "qna.yaml" | ||
|
|
||
|
|
||
| def model_is_localand_valid(model: str) -> bool: |
There was a problem hiding this comment.
typo
| def model_is_localand_valid(model: str) -> bool: | |
| def model_is_local_and_valid(model: str) -> bool: |
|
|
||
| def make_run_dir(output_dir: str) -> str: | ||
| now = datetime.now() | ||
| timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f") |
There was a problem hiding this comment.
It seems like we are using some magical string formatting here. If you don't want to use the built-in string formatting then could you please leave a comment explaining why we're doing it this way?
There was a problem hiding this comment.
What's the default format for a timestamp that you mentioned below?
If the user runs this bench mark over and over I didn't want results files piling up in DEFAULTS.EVAL_DIR, so I'm creating a directory with the timestamp of the run in the name. I wanted to have a timestamp that goes down to the microsecond level just so that two runs couldn't possibly have the same directory.
There was a problem hiding this comment.
@alimaredia You can use the datetime.isoformat() method if you're looking for a string. Otherwise you can use datetime.timestamp to get a floating point representation of the current UNIX timestamp.
| timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f") | |
| timestamp = now.isoformat() |
| now = datetime.now() | ||
| timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f") | ||
| if output_dir.endswith("/"): | ||
| output_dir = output_dir[:-1] |
There was a problem hiding this comment.
Does this give you a single element or a list?
There was a problem hiding this comment.
It gives you back a single element. That being said, there's better ways to string a trailing slash from a path in python, so I'll change this in my next iteration.
There was a problem hiding this comment.
@alimaredia Ah I see. I recommend using pathlib.Path for any path-related logic here. It will prevent you from needing to write your own logic for these cases.
| return run_dir | ||
|
|
||
|
|
||
| def print_header(): |
There was a problem hiding this comment.
Were you planning on expanding this function?
There was a problem hiding this comment.
This is just the beginning of the output. I'll get rid of the function for the time being since it's so small.
| question_num = 0 | ||
| for score in result.scores: | ||
| question_num += 1 | ||
| print(f"Question #{question_num}: {score['domain_specific_rubrics']}/5") |
There was a problem hiding this comment.
| question_num = 0 | |
| for score in result.scores: | |
| question_num += 1 | |
| print(f"Question #{question_num}: {score['domain_specific_rubrics']}/5") | |
| for i, score in enumerate(result.scores): | |
| print(f"Question #{i+1}: {score['domain_specific_rubrics']}/5") |
| average = round(average, 2) | ||
| print("----------------------------") | ||
| print(f"Average Score: {average}/5") |
There was a problem hiding this comment.
Instead of averaging you can simply use formatting syntax
| average = round(average, 2) | |
| print("----------------------------") | |
| print(f"Average Score: {average}/5") | |
| print("----------------------------") | |
| print(f"Average Score: {average:.2f}/5") |
| response_df["scores"] = scores | ||
|
|
||
| now = datetime.now() | ||
| timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f") |
There was a problem hiding this comment.
Not against using timestamps but if you plan on using non-standard ones then please add a comment explaining why you picked it and what the format is intended to be.
There was a problem hiding this comment.
What's a standard timestamp format? Here I'm trying to have a unique identifier for each run in the data.
| timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f") | ||
| response_df["evaluation_run"] = f"run-{timestamp}" | ||
|
|
||
| print("Responses and scores written to:") |
There was a problem hiding this comment.
The print() on line 246 outputs the full path of the file to the user. An example of the entire output right now is:
# DK-BENCH REPORT
## MODEL: finetuned
Question #1: 5/5
Question #2: 5/5
----------------------------
Average Score: 5.0/5
Total Score: 10/10
Responses and scores written to:
/home/ali/.local/share/instructlab/internal/eval_data/job-01-18-2025_00-49-57-386910/responses-finetuned-01-18-2025_00-49-57-388553.jsonl
INFO 2025-01-18 00:49:57,389 instructlab.model.evaluate:125: ᕦ(òᴗóˇ)ᕤ Model evaluate with DK-Bench completed! ᕦ(òᴗóˇ)ᕤ
There was a problem hiding this comment.
Let's make sure the prints all happen in a single place. And also let's make sure we are using logger.info instead of printing. This way if someone runs ilab in a server and redirects the output elsewhere, it will go to its intended source versus STDOUT.
There was a problem hiding this comment.
We discussed this elsewhere - the results will continue to use print while log messages will use logger
| ) | ||
| create_excel_results_file(results_file, result) | ||
| else: | ||
| logger.debug("Output format %s is not valid", fmt) |
There was a problem hiding this comment.
You're supporting QNA.yaml as a file type but it seems like you silently fail here.
There was a problem hiding this comment.
This function is just about output types and this initial PR is going to just have .jsonl as the input. I'll have another one open soon that supports qna.yaml as an input format and remove it from the list above in the mean time.
There was a problem hiding this comment.
I see. In this case I recommend having a sane default if we didn't validate their output format and they got this far. But I think we talked above that we should just validate as early as possible.
| server = None | ||
| try: | ||
| evaluator = RagasEvaluator() | ||
| if model is not None: |
There was a problem hiding this comment.
In your type-hinting above, model is hinted as only ever being set to str, so this check will always be True. Is there something I'm missing here?
There was a problem hiding this comment.
This check would go away per the changes I'm proposing in https://github.com/instructlab/instructlab/pull/2940/files#r1921098046
| ) | ||
| model_name = "no-model-provided" | ||
|
|
||
| finally: |
There was a problem hiding this comment.
Why have we wrapped these statements in a broad try statement where we have no except statement and are only using finally? It seems like we should either handle the exception case for the exceptions you plan on being raised, or we should just get rid of the try/finally blocks.
There was a problem hiding this comment.
This try -> finally pattern is used for every other benchmark in evaluate.py. Take a look at this for an example: https://github.com/instructlab/instructlab/blob/main/src/instructlab/model/evaluate.py#L98
It seems like it has to do with shutting down local servers that are started up. @danmcp is that right?
That being said this code is confusing to read and I'm going to try to split this logic out into an evaluate_local_model function so that it's in the same place as the call to launch_server and looks like the rest of the code for the other benchmarks.
There was a problem hiding this comment.
@alimaredia I completely understand why you'd wrap it this way. But my point here is that we need to catch the exception here if we're wrapping it. Otherwise we are silently failing and ignoring the failure, which is not a good pattern.
There was a problem hiding this comment.
Otherwise we are silently failing and ignoring the failure, which is not a good pattern.
There are multiple ways to manage the exceptions here. But as written, this wouldn't ignore any failures. They would simply bubble up to the caller.
There was a problem hiding this comment.
@danmcp Yes, that's a good point, thank you for pointing that out.
What will actually happen here then is everything inside of the try will execute, if an exception is raised at any point then we automatically jump to the finally statement, and then the exception continues to bubble up.
Given that the finally statement here exists solely to shut down a server if we had created one in the first branch of the if statement, would it make sense to narrow the scope of this try catch statement + server initialization to be under the if model is not None branch? It seems like the server bits only really apply to the first branch, but the exceptions will propagate up no matter what.
There was a problem hiding this comment.
For additional context, although this is syntactically fine, the problem this introduces is that if someone unfamiliar with this codebase inspects this try/finally statement, they might attribute meaning to the entire statement being wrapped where there is none.
In this sense, limiting error-handling statements to their effective scope promotes interpretability in the codebase.
There was a problem hiding this comment.
would it make sense to narrow the scope of this try catch statement + server initialization to be under the if model is not None branch?
+1 from me
RobotSail
left a comment
There was a problem hiding this comment.
Great work on this PR so far, we're definitely getting there. I've provided feedback on the stuff that still needs to be worked on. Let me know once you've addressed this and I'll take another look. Thanks!
2b41058 to
8628942
Compare
|
e2e workflow succeeded on this PR: View run, congrats! |
| # Directory where DK-Bench evaluation results are stored. | ||
| output_dir: ~/.local/share/instructlab/internal/eval_data/dk_bench | ||
| # Comma-separated list of file formats for results of the DK-Bench evaluation. | ||
| output_file_formats: jsonl |
There was a problem hiding this comment.
Couldn't we just set this to all of them in the profiles, rather than do the flag override in the E2E script?
|
This pull request has merge conflicts that must be resolved before it can be |
258bf1f to
d55d4c8
Compare
| system_prompt: You are an advanced AI assistant designed to provide precise and | ||
| accurate information. Your primary goal is to answer queries with the most up-to-date | ||
| and factual information available. Focus on delivering clear, concise, and correct | ||
| responses. If you're uncertain about any aspect of the query, state your level | ||
| of confidence and provide the most accurate information you can. Your responses | ||
| should prioritize accuracy over all other considerations. | ||
| # Temperature for model getting responses during DK-Bench. | ||
| temperature: 0.0 |
There was a problem hiding this comment.
these aren't under the dk_bench section, so is this setting system_prompt and temperature for benchmark models other than dk-bench?
There was a problem hiding this comment.
same q applies to the other test-data files too
There was a problem hiding this comment.
@alinaryan Great catch! Both temperature and system_prompt are settings which are common to all benchmarks, since they control how the student model generates information. That said, @alimaredia I would avoid setting system_prompt explicitly in these files and would instead allow ilab to figure it out internally. This should be used as more of an override setting when users are trying out different ideas.
There was a problem hiding this comment.
same q applies to the other test-data files too
We'll be bringing in more Ragas-based metrics that will also be reading in from the test-data files.
src/instructlab/configuration.py
Outdated
| base_branch: Optional[str] = Field(default=None, description="Base taxonomy branch") | ||
| dk_bench: _dkbench = Field( | ||
| default_factory=_dkbench, | ||
| description="Settings to run DK-Bench against a file of user created questions, reference answsers, and responses. If responses are not provided they are generated from a model", |
There was a problem hiding this comment.
non blocking nit: s/answsers/answers/g
| MISTRAL_GGUF_MODEL_NAME = "mistral-7b-instruct-v0.2.Q4_K_M.gguf" | ||
| MODEL_REPO = "instructlab/granite-7b-lab" | ||
| JUDGE_MODEL_MT = "prometheus-eval/prometheus-8x7b-v2.0" | ||
| JUDGE_MODEL_DK = "gpt-4o" |
There was a problem hiding this comment.
Why didn’t we use the DK Judge model as Prometheus? I recall the MT judge model was also GPT-4o previously, and users had concerns back then. Won’t those concerns arise again now with the MT judge model being GPT-4o?
There was a problem hiding this comment.
@alinaryan Prometheus was fine-tuned specifically for MT-Bench, based on research that the team had done to create a local model for that particular benchmark. In order to get accurate results on the domain knowledge use-case, we need to rely on SotA models to get scores which correlate to human preference. More work must be done to figure out what we can use as a local judge model that's not GPT-4.
d55d4c8 to
715fc45
Compare
|
@alimaredia please make issues for the following follow up items to be worked on soon after merging:
@alinaryan I think this answers many of your review concerns, we have been discussing these offline and on the PR |
|
@nathan-weinberg I've addressed all of the comments you made in your last round. I'm going to remove your request for changes. |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
| response_df = result.dataset.to_pandas() | ||
| response_df["scores"] = scores | ||
|
|
||
| with pd.ExcelWriter(excel_file) as writer: |
There was a problem hiding this comment.
We need to explicitly select openpyxl here in order to prevent the file from being corrupted.
| with pd.ExcelWriter(excel_file) as writer: | |
| with pd.ExcelWriter(excel_file, engine="openpyxl") as writer: |
This commit adds Domain Knowledge bench (DK-Bench) as an evaluation option as part of `ilab model evaluate`. This commit adds the cli flags to enable run the benchmark. Signed-off-by: Ali Maredia <amaredia@redhat.com>
This commit does the following: - Passes in ctx.obj.config.serve and use existing tls config vars passed into evaluate_data instead of relying on the click context to fill out the values. - Moves per benchmark judge and output dir selection to cli/model/evaluate.py so that click context is separated from evaluate_data(). Signed-off-by: Ali Maredia <amaredia@redhat.com>
This commit does the following: 1. The output dir for mt_bench used to be: ~/.local/share/instructlab/internal/eval_data It is now: ~/.local/share/instructlab/internal/eval_data/mt_bench This commit explicitly changes that in all of the profiles though this default shoud apply on it's own. 2. Overrides the judge model in the config section for mt-bench-branch to match the judge model used in mt-bench for smaller hardware profiles. 3. Updates the evaluation section of the config for the custom profiles for CI. Though many of the values added are not changed from defaults they are added as a reference in case further changes want to be made in the future. Signed-off-by: Ali Maredia <amaredia@redhat.com>
This PR does the following: - Adds running of dk_bench into e2e-ci.sh - Add test questions for dk-bench - Adds the OPENAI_API_KEY to the x-large job The addition of DK-Bench requires OPENAI_API_KEY to be set before e2e-ci.sh is run. If this key is not set, the program will exit with a warning. If a user of the script does not have an OPENAI_API_KEY and wishes to run the script without running DK-Bench they make set OPENAI_API_KEY='NO_API_KEY'. Signed-off-by: Ali Maredia <amaredia@redhat.com>
715fc45 to
0513a87
Compare
RobotSail
left a comment
There was a problem hiding this comment.
Thank you so much for all of your hard work to push this PR forward @alimaredia. And thanks all to those who have spent their valuable time reviewing this on such short notice. LGTM.
|
e2e workflow succeeded on this PR: View run, congrats! |
This commit adds Domain Knowledge bench (DK-Bench) as an evaluation option as part of
ilab model evaluate.This commit adds the cli flags to enable run the
benchmark.
Checklist:
conventional commits.