Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Comments

Close side panel

feat: DK-Bench ilab implementation#2940

Merged
mergify[bot] merged 4 commits intoinstructlab:maininstructlab/instructlab:mainfrom
alimaredia:dk-bench-cli-onlyalimaredia/instructlab:dk-bench-cli-onlyCopy head branch name to clipboard
Jan 24, 2025
Merged

feat: DK-Bench ilab implementation#2940
mergify[bot] merged 4 commits intoinstructlab:maininstructlab/instructlab:mainfrom
alimaredia:dk-bench-cli-onlyalimaredia/instructlab:dk-bench-cli-onlyCopy head branch name to clipboard

Conversation

@alimaredia
Copy link
Contributor

This commit adds Domain Knowledge bench (DK-Bench) as an evaluation option as part of
ilab model evaluate.

This commit adds the cli flags to enable run the
benchmark.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Changelog updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Functional tests have been added, if necessary.
  • E2E Workflow tests have been added, if necessary.

@mergify mergify bot added the ci-failure PR has at least one CI failure label Jan 18, 2025
@alimaredia alimaredia marked this pull request as ready for review January 18, 2025 04:29
help="Comma-separated list of file formats for results of the DK-Bench evaluation. Valid options are csv, jsonl, and xlsx. If this option is not provided the results are written as a .jsonl file",
)
@click.option(
"--model-prompt",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for this to be --system-prompt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that but I thought maybe a user new to AI would understand that the --model-prompt would be for the model specifically being tested in case they didn't know what that these prompts are generally known as system prompts.

If you really think it should be --system prompt I don't have any objection to changing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--model-prompt would imply it's what we're prompting the model with, i.e., the user prompt. We should stick with --system-prompt in order to be consistent.

help="Temperature for the model when getting responses in the DK-Bench evaluation",
)
@click.option(
"--dk-bench-model",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the --model flag that already exists in this CLI command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this PR is currently written there are two things the command does.

  1. If a user doesn't pass in a --dk-bench-model and the .jsonl file has response, reference, and user input then the command just does the evaluation on the file.
  2. If you pass in the dk_bench_model then responses are generated from the model for the questions.

I had started by using the --model flag, but ran into some issues with Click always passing a value into model because it's set in every profile even though the default in Pydantic is None. This made mode number 1 impossible to execute.

I'm thinking about require a student model in every evaluation run since you'd expect a model to be evaluated in the ilab model evaluate command, thus getting rid of the first functionality above. One of the many benefits of this is we could go back to use the --model flag.

If in the future users want to just be able to do evaluation on a file with responses then we can tweak or add to the command then.

@RobotSail what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimaredia Okay, I see your challenge.

What you could probably do is check if the JSONL file contains response fields already, and if so then you can just pass None in for the student model. Otherwise, you can just try to proceed with what's set at the student model.

This way you would still get the desired functionality but without needing to create a separate field.

Otherwise, the user will be burdened to have a different model set depending on which eval they run which will be very confusing for them.

src/instructlab/cli/model/evaluate.py Outdated Show resolved Hide resolved
except OpenAIError as exc:
raise ClientException(f"Connection Error {exc}") from exc

if len(models.data) != 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we error if the server has more than one models actively running? What if I want to be evaluating multiple models concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was that most OpenAI compatible endpoints, or at least in my research those served with vLLM only had one model served per endpoint.

I'd like it for the user to not have to think about passing in a model name at all and just get it for the server since getting the model name right is required to get the responses from the student model in the Eval library. I even had a flag for a model name that and then removed it for that reason.

That being said in my next iteration of the PR I'm thinking of re-adding the --model-name flag and checking if that model_name is in the list of model names returned by the server. If it's not I'll throw an error. If no model_name is passed in and there are more than 1 models on the server I'll throw an error telling the user to pass in the model name from the ones returned by the server.

@RobotSail what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimaredia since you're already performing logic to check if the student model exists in the list of served models, you can either delete this check entirely or have it only check that the server has at least one model being served.

In either of the proposed cases, you will have the same outcome of knowing whether or not a model exists. But it might help with UX if you can raise the error that the active server isn't hosting any models.

backend=backend,
enable_serving_output=enable_serving_output,
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that we have enough information here to know that a model is an endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add a check that curls the endpoint or something like that. I figured get_endpoint_model_name would throw an error if the endpoint wasn't valid anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case you have only ruled out that the model either doesn't exist locally or is an invalid format - not necessarily that it's an endpoint.

For the most part I think this logic is fine, but you should check with the above that it's not the case that they're trying to pass in a local model that simply doesn't exist.

enable_serving_output=enable_serving_output,
)
else:
logger.debug("DK-Bench model is an endpoint")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling it a dk-bench model? How is this different from a student model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same thing, but I'll rephrase this to Model being evaluated in DK-Bench is an endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I like that.


api_key = os.environ.get("STUDENT_ENDPOINT_API_KEY", None)
if api_key is None:
logger.debug(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no API key was set then you should set it to be an empty string, not None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Per this function: https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mt_bench_common.py#L369 it looks like "NO_API_KEY" should be the value I'm passing in and checking for in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set the API key to whatever string you want if one is set, provided that the student model is running in vLLM and not on an hosted OpenAI-compatible server which expects an API key. I'd say it would be worthwhile to simply set it to the default you suggested for consistency, and then add a debug log indicating that you've done so because none was provided.

gpus: int | None,
backend: str | None,
enable_serving_output: bool,
) -> tuple:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When returning something like a list, tuple, dict, etc., you must use the typing library for this:

from typing import Tuple

# ...
def foo() -> Tuple[str, int, bool]:
  return ('hello', 42, False)
Suggested change
) -> tuple:
) -> Tuple:

QNA_YAML = "qna.yaml"


def model_is_localand_valid(model: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
def model_is_localand_valid(model: str) -> bool:
def model_is_local_and_valid(model: str) -> bool:


def make_run_dir(output_dir: str) -> str:
now = datetime.now()
timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are using some magical string formatting here. If you don't want to use the built-in string formatting then could you please leave a comment explaining why we're doing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default format for a timestamp that you mentioned below?

If the user runs this bench mark over and over I didn't want results files piling up in DEFAULTS.EVAL_DIR, so I'm creating a directory with the timestamp of the run in the name. I wanted to have a timestamp that goes down to the microsecond level just so that two runs couldn't possibly have the same directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimaredia You can use the datetime.isoformat() method if you're looking for a string. Otherwise you can use datetime.timestamp to get a floating point representation of the current UNIX timestamp.

Suggested change
timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f")
timestamp = now.isoformat()

now = datetime.now()
timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f")
if output_dir.endswith("/"):
output_dir = output_dir[:-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this give you a single element or a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gives you back a single element. That being said, there's better ways to string a trailing slash from a path in python, so I'll change this in my next iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimaredia Ah I see. I recommend using pathlib.Path for any path-related logic here. It will prevent you from needing to write your own logic for these cases.

return run_dir


def print_header():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you planning on expanding this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the beginning of the output. I'll get rid of the function for the time being since it's so small.

Comment on lines 170 to 139
question_num = 0
for score in result.scores:
question_num += 1
print(f"Question #{question_num}: {score['domain_specific_rubrics']}/5")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
question_num = 0
for score in result.scores:
question_num += 1
print(f"Question #{question_num}: {score['domain_specific_rubrics']}/5")
for i, score in enumerate(result.scores):
print(f"Question #{i+1}: {score['domain_specific_rubrics']}/5")

Comment on lines 177 to 179
average = round(average, 2)
print("----------------------------")
print(f"Average Score: {average}/5")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of averaging you can simply use formatting syntax

Suggested change
average = round(average, 2)
print("----------------------------")
print(f"Average Score: {average}/5")
print("----------------------------")
print(f"Average Score: {average:.2f}/5")

src/instructlab/model/dk_bench_utils.py Outdated Show resolved Hide resolved
response_df["scores"] = scores

now = datetime.now()
timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not against using timestamps but if you plan on using non-standard ones then please add a comment explaining why you picked it and what the format is intended to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a standard timestamp format? Here I'm trying to have a unique identifier for each run in the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp = now.strftime("%m-%d-%Y_%H-%M-%S-%f")
response_df["evaluation_run"] = f"run-{timestamp}"

print("Responses and scores written to:")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Written to where?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print() on line 246 outputs the full path of the file to the user. An example of the entire output right now is:

# DK-BENCH REPORT

## MODEL: finetuned

Question #1:     5/5
Question #2:     5/5
----------------------------
Average Score:   5.0/5
Total Score:     10/10

Responses and scores written to:
/home/ali/.local/share/instructlab/internal/eval_data/job-01-18-2025_00-49-57-386910/responses-finetuned-01-18-2025_00-49-57-388553.jsonl


INFO 2025-01-18 00:49:57,389 instructlab.model.evaluate:125: ᕦ(òᴗóˇ)ᕤ Model evaluate with DK-Bench completed! ᕦ(òᴗóˇ)ᕤ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure the prints all happen in a single place. And also let's make sure we are using logger.info instead of printing. This way if someone runs ilab in a server and redirects the output elsewhere, it will go to its intended source versus STDOUT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this elsewhere - the results will continue to use print while log messages will use logger

)
create_excel_results_file(results_file, result)
else:
logger.debug("Output format %s is not valid", fmt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're supporting QNA.yaml as a file type but it seems like you silently fail here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is just about output types and this initial PR is going to just have .jsonl as the input. I'll have another one open soon that supports qna.yaml as an input format and remove it from the list above in the mean time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In this case I recommend having a sane default if we didn't validate their output format and they got this far. But I think we talked above that we should just validate as early as possible.

src/instructlab/model/dk_bench_utils.py Show resolved Hide resolved
src/instructlab/model/dk_bench_utils.py Outdated Show resolved Hide resolved
server = None
try:
evaluator = RagasEvaluator()
if model is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your type-hinting above, model is hinted as only ever being set to str, so this check will always be True. Is there something I'm missing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check would go away per the changes I'm proposing in https://github.com/instructlab/instructlab/pull/2940/files#r1921098046

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense.

)
model_name = "no-model-provided"

finally:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have we wrapped these statements in a broad try statement where we have no except statement and are only using finally? It seems like we should either handle the exception case for the exceptions you plan on being raised, or we should just get rid of the try/finally blocks.

Copy link
Contributor Author

@alimaredia alimaredia Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try -> finally pattern is used for every other benchmark in evaluate.py. Take a look at this for an example: https://github.com/instructlab/instructlab/blob/main/src/instructlab/model/evaluate.py#L98

It seems like it has to do with shutting down local servers that are started up. @danmcp is that right?

That being said this code is confusing to read and I'm going to try to split this logic out into an evaluate_local_model function so that it's in the same place as the call to launch_server and looks like the rest of the code for the other benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimaredia I completely understand why you'd wrap it this way. But my point here is that we need to catch the exception here if we're wrapping it. Otherwise we are silently failing and ignoring the failure, which is not a good pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we are silently failing and ignoring the failure, which is not a good pattern.

There are multiple ways to manage the exceptions here. But as written, this wouldn't ignore any failures. They would simply bubble up to the caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danmcp Yes, that's a good point, thank you for pointing that out.

What will actually happen here then is everything inside of the try will execute, if an exception is raised at any point then we automatically jump to the finally statement, and then the exception continues to bubble up.

Given that the finally statement here exists solely to shut down a server if we had created one in the first branch of the if statement, would it make sense to narrow the scope of this try catch statement + server initialization to be under the if model is not None branch? It seems like the server bits only really apply to the first branch, but the exceptions will propagate up no matter what.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For additional context, although this is syntactically fine, the problem this introduces is that if someone unfamiliar with this codebase inspects this try/finally statement, they might attribute meaning to the entire statement being wrapped where there is none.

In this sense, limiting error-handling statements to their effective scope promotes interpretability in the codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to narrow the scope of this try catch statement + server initialization to be under the if model is not None branch?

+1 from me

src/instructlab/model/evaluate.py Show resolved Hide resolved
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this PR so far, we're definitely getting there. I've provided feedback on the stuff that still needs to be worked on. Let me know once you've addressed this and I'll take another look. Thanks!

@cdoern cdoern added the hold In-progress PR. Tag should be removed before merge. label Jan 24, 2025
@mergify mergify bot added the one-approval PR has one approval from a maintainer label Jan 24, 2025
@github-actions
Copy link

e2e workflow succeeded on this PR: View run, congrats!

scripts/test-data/profile-l4-x1.yaml Show resolved Hide resolved
# Directory where DK-Bench evaluation results are stored.
output_dir: ~/.local/share/instructlab/internal/eval_data/dk_bench
# Comma-separated list of file formats for results of the DK-Bench evaluation.
output_file_formats: jsonl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just set this to all of them in the profiles, rather than do the flag override in the E2E script?

src/instructlab/configuration.py Outdated Show resolved Hide resolved
src/instructlab/model/dk_bench_utils.py Outdated Show resolved Hide resolved
src/instructlab/model/dk_bench_utils.py Outdated Show resolved Hide resolved
src/instructlab/model/dk_bench_utils.py Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented Jan 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @alimaredia please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Jan 24, 2025
@mergify mergify bot added ci-failure PR has at least one CI failure and removed needs-rebase This Pull Request needs to be rebased labels Jan 24, 2025
Comment on lines +62 to +69
system_prompt: You are an advanced AI assistant designed to provide precise and
accurate information. Your primary goal is to answer queries with the most up-to-date
and factual information available. Focus on delivering clear, concise, and correct
responses. If you're uncertain about any aspect of the query, state your level
of confidence and provide the most accurate information you can. Your responses
should prioritize accuracy over all other considerations.
# Temperature for model getting responses during DK-Bench.
temperature: 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these aren't under the dk_bench section, so is this setting system_prompt and temperature for benchmark models other than dk-bench?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same q applies to the other test-data files too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alinaryan Great catch! Both temperature and system_prompt are settings which are common to all benchmarks, since they control how the student model generates information. That said, @alimaredia I would avoid setting system_prompt explicitly in these files and would instead allow ilab to figure it out internally. This should be used as more of an override setting when users are trying out different ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same q applies to the other test-data files too

We'll be bringing in more Ragas-based metrics that will also be reading in from the test-data files.

src/instructlab/cli/model/evaluate.py Show resolved Hide resolved
base_branch: Optional[str] = Field(default=None, description="Base taxonomy branch")
dk_bench: _dkbench = Field(
default_factory=_dkbench,
description="Settings to run DK-Bench against a file of user created questions, reference answsers, and responses. If responses are not provided they are generated from a model",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non blocking nit: s/answsers/answers/g

MISTRAL_GGUF_MODEL_NAME = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
MODEL_REPO = "instructlab/granite-7b-lab"
JUDGE_MODEL_MT = "prometheus-eval/prometheus-8x7b-v2.0"
JUDGE_MODEL_DK = "gpt-4o"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn’t we use the DK Judge model as Prometheus? I recall the MT judge model was also GPT-4o previously, and users had concerns back then. Won’t those concerns arise again now with the MT judge model being GPT-4o?

Copy link
Member

@RobotSail RobotSail Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alinaryan Prometheus was fine-tuned specifically for MT-Bench, based on research that the team had done to create a local model for that particular benchmark. In order to get accurate results on the domain knowledge use-case, we need to rely on SotA models to get scores which correlate to human preference. More work must be done to figure out what we can use as a local judge model that's not GPT-4.

@mergify mergify bot removed the ci-failure PR has at least one CI failure label Jan 24, 2025
@cdoern
Copy link
Contributor

cdoern commented Jan 24, 2025

@alimaredia please make issues for the following follow up items to be worked on soon after merging:

  1. make issues for all "generic" evaluation flags meant to apply to multiple or all benchmarks. Scope work to ensure each benchmark class we support has that option, not just DK-Bench. an example of this is --temperature. Currently it only exists in the dk_bench class so I assumed the flag should be called --dk-bench-temperature it turns out it is meant to apply to more benchmarks, but currently doesn't, which is why the went with the vague flag name
  2. make issues to re-factor the switch statement reading from the ctx object. We should not be depending on code inside of evaluate itself to swap in different default values than what was specified in the help text of the flag. Currently a user can run ilab model evaluate --help and see None as the default for judge-model . However, new logic added in the DK-bench PR makes it so that each benchmark uses a different value for judge_model, swapping that value in from the config to the --judge-model flag even if the user does not specify it. Generally, we are trying to move away from this kind of "magic swapping in of default values" per command unless it is handled by the clickext class

@alinaryan I think this answers many of your review concerns, we have been discussing these offline and on the PR

@alimaredia
Copy link
Contributor Author

@nathan-weinberg I've addressed all of the comments you made in your last round. I'm going to remove your request for changes.

@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

response_df = result.dataset.to_pandas()
response_df["scores"] = scores

with pd.ExcelWriter(excel_file) as writer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to explicitly select openpyxl here in order to prevent the file from being corrupted.

Suggested change
with pd.ExcelWriter(excel_file) as writer:
with pd.ExcelWriter(excel_file, engine="openpyxl") as writer:

This commit adds Domain Knowledge bench (DK-Bench)
as an evaluation option as part of
`ilab model evaluate`.

This commit adds the cli flags to enable run the
benchmark.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
This commit does the following:

- Passes in ctx.obj.config.serve and use existing tls
config vars passed into evaluate_data instead of
relying on the click context to fill out the values.

- Moves per benchmark judge and output dir selection to
cli/model/evaluate.py so that click context is
separated from evaluate_data().

Signed-off-by: Ali Maredia <amaredia@redhat.com>
This commit does the following:

1. The output dir for mt_bench used to be:
~/.local/share/instructlab/internal/eval_data

It is now:
~/.local/share/instructlab/internal/eval_data/mt_bench

This commit explicitly changes that in all of the
profiles though this default shoud apply on it's own.

2. Overrides the judge model in the
config section for mt-bench-branch to match
the judge model used in mt-bench for smaller hardware
profiles.

3. Updates the evaluation section of the config for the
custom profiles for CI. Though many of the values added
are not changed from defaults they are added as a reference
in case further changes want to be made in the future.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
This PR does the following:
- Adds running of dk_bench into e2e-ci.sh
- Add test questions for dk-bench
- Adds the OPENAI_API_KEY to the x-large job

The addition of DK-Bench requires OPENAI_API_KEY
to be set before e2e-ci.sh is run.

If this key is not set, the program will exit with
a warning.

If a user of the script does not have an OPENAI_API_KEY
and wishes to run the script without running DK-Bench
they make set OPENAI_API_KEY='NO_API_KEY'.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Jan 24, 2025
@cdoern cdoern requested a review from RobotSail January 24, 2025 20:50
Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for all of your hard work to push this PR forward @alimaredia. And thanks all to those who have spent their valuable time reviewing this on such short notice. LGTM.

@RobotSail RobotSail removed the hold In-progress PR. Tag should be removed before merge. label Jan 24, 2025
@mergify mergify bot merged commit 0c0036f into instructlab:main Jan 24, 2025
32 checks passed
@github-actions
Copy link

e2e workflow succeeded on this PR: View run, congrats!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration dependencies Relates to dependencies documentation Improvements or additions to documentation testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.