adds basic ragas eval#193
adds basic ragas eval#193mergify[bot] merged 6 commits intoinstructlab:maininstructlab/eval:mainfrom
Conversation
requirements.txt
Outdated
| lm-eval>=0.4.4 | ||
| httpx | ||
|
|
||
| ragas No newline at end of file |
| def __init__(self): | ||
| pass | ||
|
|
||
| def run( |
There was a problem hiding this comment.
So for this a user is expected to bring a list of Sample objects, which hold the input, prediction, and ground truth? Are we going to provide a way to build this list of Samples from given files or lists of each category, or is this moreso just for use with self-built scripts that import the Sample object and build?
There was a problem hiding this comment.
Updated it such that dataset now is either a pathlib.Path object or a list of samples, and we read what we need to accordingly
f12fbd7 to
f0db4a4
Compare
|
@RobotSail let me know once you are done with testing the code. Other than that LGTM. |
src/instructlab/eval/ragas.py
Outdated
|
|
||
| max_tokens: int = 768 | ||
|
|
||
| # Random seed for reproducibility. This is not supported everywhere and therefore is unreliable. |
There was a problem hiding this comment.
We had discussed this earlier, I think you're going to want to remove this comment.
There was a problem hiding this comment.
@alimaredia I'll update this comment because I believe you are confusing it's meaning with our earlier conversation. This comment relates to the seed not being supported by every model serving framework.
|
@abhi1092 I've updated the code to have unit tests, please let me know if there was anything I missed. |
tests/test_ragas.py
Outdated
| api_key="test-api-key", | ||
| ) | ||
| evaluator = RagasEvaluator() | ||
| result_df = evaluator._generate_answers_from_model(questions, student_model) |
There was a problem hiding this comment.
@abhi1092 We do test it with a mock-client, GPT-3.5 was just the first model that came to mind as a fill-in.
There was a problem hiding this comment.
Updated this to use fake values to eliminate the possibility of it actually calling out to the real openai API.
src/instructlab/eval/ragas.py
Outdated
| Given a DataFrame containing `user_input` columns, generates responses from the given model | ||
| and returns a new DataFrame containing its answers in the `response` column. | ||
| """ | ||
| client = get_openai_client( |
There was a problem hiding this comment.
@RobotSail maybe this method and also the class can instead take the client as input?
There was a problem hiding this comment.
@abhi1092 That's a good idea. Although the way I've done it here is how it happens in MT-Bench as well. Would it make sense to make a follow-up issue for this?
There was a problem hiding this comment.
There was a problem hiding this comment.
@abhi1092 Actually I will just make this change in this PR since it's small.
There was a problem hiding this comment.
Updated the PR to include this.
| ######################################################################## | ||
| # Test case: directly passing a dataset | ||
| ######################################################################## | ||
| result = evaluator.run( |
There was a problem hiding this comment.
Here too maybe we can call mock gpt-4o with pre-defined evaluation result. This way we only test the evaluation given the student model response, reference and judge model output
There was a problem hiding this comment.
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
We want ragas to read from both a list as well as a list of samples Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
When a dataset is provided and is missing the `response` field, we will need to generate these responses. This commit ensures that when this case happens, we will error out when a student model is not configured. Otherwise, we will always generate these responses if the student model exists, regardless if `response` is in the dataframe or not. Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
…ng that gets passed in to __init__ Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
| if isinstance(dataset, list): | ||
| input_df = DataFrame(dataset) | ||
| elif isinstance(dataset, Path): | ||
| input_df = read_json(dataset, orient="records", lines=True) |
There was a problem hiding this comment.
I think there's an implicit requirement here that the dataset referred to by the path is well-formed (shaped like list[Sample]). Could consider doing a quick check to make sure the required columns are present in the df and failing here if they aren't.
There was a problem hiding this comment.
Sure, I don't see a reason not to.
JamesKunstle
left a comment
There was a problem hiding this comment.
This seems solid. I have a couple of function naming suggestions, and I think that we could type-safe the incoming data from a .jsonl file to fail as early as possible.
src/instructlab/eval/ragas.py
Outdated
|
|
||
| # we will be using gpt-4o for the foreseeable future, we hardcode this | ||
| # for consistency of answers | ||
| critic_lm = ChatOpenAI(model=DEFAULT_JUDGE_MODEL) |
There was a problem hiding this comment.
How is the API key supposed to be passed into the judge model? If we're assuming the environment variable is set for that we need When I run the script you the short script you sent me yesterday I hit this as an issue.
There were previous iterations of this PR that had a base_url and a key in the ModelConfig that I think are needed since the student and the judge need at least a base_url if not both.
There was a problem hiding this comment.
I'm realizing now, maybe your intention is that the users OpenAI key for the judge model is already set before calling RagasEvaluator.run(). Could we have a check for that or make it more clear somehow?
There was a problem hiding this comment.
I think since we want to lock it down to OpenAI for now just to have a consistent evaluation base, we can just expose the API key. We can make a follow-up where this is something configurable.
There was a problem hiding this comment.
This comment is outdated, we've since updated this evaluation to also accept the judge model name as well.
src/instructlab/eval/ragas.py
Outdated
| updated_df.at[i, "response"] = response.choices[0].message.content | ||
| return updated_df | ||
|
|
||
| def _get_metrics(self) -> List[Metric]: |
There was a problem hiding this comment.
What's the use of this function? Since it's only being used in one place and we can't configure what's being passed into rubrics.
There was a problem hiding this comment.
The point is that since we select the metrics carefully, having them isolated allows us to treat them with more care. Also - when you add to this list it becomes very messy, so it's just slightly more readable to have it in a separate method.
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
This PR introduces Rubric-based evaluation through Ragas using the default with-reference rubric that they provide.
The current evaluation supports the two following modes:
datasetcontains records which holduser_input(question),reference(golden answer), andresponse(model answer)datasetwith theuser_inputandreference, and additionally aModelConfigurationof a model which will be used to generate theresponsefor each question. This can be any model running on an OpenAI-compatible endpointSigned-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com