🤗 Hugging Face • ⏬ Data • 📖 Tutorial
中文 | English
DevOps-Eval is a comprehensive evaluation suite specifically designed for foundation models in the DevOps field. We hope DevOps-Eval could help developers, especially in the DevOps field, track the progress and analyze the important strengths/shortcomings of their models.
📚 This repo contains questions and exercises related to DevOps, including the AIOps.
🔥️ There are currently 4850 multiple-choice questions spanning 8 diverse general categories, as shown below.
💥 There are a total of 2200 samples in the AIOps subcategory, covering scenarios such as log parsing, time series anomaly detection, time series classification, and root cause analysis.
- [2023.10.25] Add the AIOps samples, including log parsing, time series anomaly detection, time series classification and root cause analysis.
- [2023.10.18] Update the initial Leaderboard...
Below are zero-shot and five-shot accuracies from the models that we evaluate in the initial release. We note that five-shot performance is better than zero-shot for many instruction-tuned models.
ModelName | plan | code | build | test | release | deploy | operate | monitor | AVG |
---|---|---|---|---|---|---|---|---|---|
DevOpsPal-14B-Chat | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 81.34 | 79.17 | 80.34 |
DevOpsPal-14B-Base | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 85.82 | 82.41 | 80.26 |
Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 83.58 | 80.56 | 79.28 |
Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 82.09 | 80.09 | 77.92 |
Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 85.07 | 83.8 | 75.10 |
Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 82.09 | 84.72 | 74.60 |
DevOpsPal-7B-Chat | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 79.85 | 77.78 | 74.00 |
DevOpsPal-7B-Base | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 79.85 | 78.7 | 73.55 |
Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 83.58 | 80.09 | 73.13 |
Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 80.6 | 79.17 | 71.96 |
Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 75.37 | 79.63 | 68.17 |
Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 74.63 | 78.24 | 68.08 |
Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 76.12 | 75.93 | 67.51 |
Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 78.36 | 75.93 | 66.91 |
ModelName | plan | code | build | test | release | deploy | operate | monitor | AVG |
---|---|---|---|---|---|---|---|---|---|
DevOpsPal-14B-Chat | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 89.55 | 81.48 | 81.77 |
DevOpsPal-14B-Base | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 85.07 | 80.09 | 81.70 |
Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 85.82 | 81.48 | 79.55 |
Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 86.57 | 80.56 | 79.51 |
Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 91.79 | 85.19 | 77.09 |
Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 85.07 | 81.94 | 77.02 |
Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 88.06 | 80.56 | 75.32 |
DevOpsPal-7B-Chat | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 82.84 | 76.85 | 75.25 |
DevOpsPal-7B-Base | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 80.6 | 79.17 | 75.17 |
Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 80.6 | 81.02 | 73.62 |
Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 80.6 | 79.63 | 72.11 |
Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 73.88 | 77.31 | 71.09 |
Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 78.36 | 79.17 | 70.49 |
Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 79.85 | 75.46 | 69.17 |
- Method 1: Download the zip file (you can also simply open the following link with the browser):
then unzip it and you may load the data with pandas:
wget https://huggingface.co/datasets/codefuse-admin/devopseval-exam/resolve/main/devopseval-exam.zip
import os import pandas as pd File_Dir="devopseval-exam" test_df=pd.read_csv(os.path.join(File_Dir,"test","UnitTesting.csv"))
- Method 2: Directly load the dataset using Hugging Face datasets:
from datasets import load_dataset dataset=load_dataset(r"DevOps-Eval/devopseval-exam",name="UnitTesting") print(dataset['val'][0]) # {"id": 1, "question": "单元测试应该覆盖以下哪些方面?", "A": "正常路径", "B": "异常路径", "C": "边界值条件","D": 所有以上,"answer": "D", "explanation": ""} ```
To facilitate usage, we have organized the category name handlers and English/Chinese names corresponding to 53 subcategories. Please refer to category_mapping.json for details. The format is:
{
"UnitTesting.csv": [
"unit testing",
"单元测试",
{"dev": 5, "test": 32}
"TEST"
],
...
"file_name":[
"English Name",
"Chinese Name",
"Sample Number",
"Supercatagory Label(PLAN,CODE,BUILD,TEST,RELEASE,DEPOLY,OPERATE,MONITOR choose 1 out of 8)"
]
}
Each subcategory consists of two splits: dev and test. The dev set per subcategory consists of five exemplars with explanations for few-shot evaluation. And the test set is for model evaluation. Labels on the test split are also released.
Below is a dev example from 'version control':
id: 4
question: 如何找到Git特定提交中已更改的文件列表?
A: 使用命令 `git diff --name-only SHA`
B: 使用命令 `git log --name-only SHA`
C: 使用命令 `git commit --name-only SHA`
D: 使用命令 `git clone --name-only SHA`
answer: A
explanation:
分析原因:
git diff --name-only SHA命令会显示与SHA参数对应的提交中已修改的文件列表。参数--name-only让命令只输出文件名,而忽略其他信息。其它选项中的命令并不能实现此功能。
If you need to test your own huggingface-formatted model, the overall steps are as follows:
- Write the loader function for the model.
- Write the context_builder function for the model.
- Register the model in the configuration file.
- Run the testing script. If the model does not require any special processing after loading, and the input does not need to be converted to a specific format (e.g. chatml format or other human-bot formats), you can directly proceed to step 4 to initiate the testing.
If the model requires additional processing after loading (e.g. adjusting the tokenizer), you need to inherit the ModelAndTokenizerLoader
class in src.context_builder.context_builder_family.py
and override the corresponding load_model
and load_tokenizer
functions. You can refer to the following example:
class QwenModelAndTokenizerLoader(ModelAndTokenizerLoader):
def __init__(self):
super().__init__()
pass
@override
def load_model(self, model_path: str):
# Implementation of the method
pass
@override
def load_tokenizer(self, model_path: str):
# Implementation of the method
pass
If the input needs to be converted to a specific format (e.g. chatml format or other human-bot formats), you need to inherit the ContextBuilder class in src.context_builder.context_builder_family
and override the make_context function. This function is used to convert the input to the corresponding required format. An example is shown below:
class QwenChatContextBuilder(ContextBuilder):
def __init__(self):
super().__init__()
@override
def make_context(self, model, tokenizer, query: str, system: str = "hello!"):
# Implementation of the method
pass
Go to the model_conf.json
file in the conf directory and register the corresponding model name and the loader and context_builder that will be used for this model. Simply write the class names defined in the first and second steps for the loader and context_builder. Here is an example:
{
"Qwen-Chat": {
"loader": "QwenModelAndTokenizerLoader",
"context_builder": "QwenChatContextBuilder"
}
}
Run the following code to initiate the test:
python src/run_eval.py \
--model_path path_to_model \
--model_name model_name_in_conf \
--model_conf_path path_to_model_conf \
--eval_dataset_list all \
--eval_dataset_fp_conf_path path_to_dataset_conf \
--eval_dataset_type test \
--data_path path_to_downloaded_devops_eval_data \
--k_shot 0
👀 👀 The specific evaluation process is as follows 📖 Evaluate Tutorial
- add AIOps samples.
- add AIOps scenario time series forecasting.
- increase in sample size.
- add samples with the difficulty level set to hard.
- add the English version of the samples.
Coming Soon...
Please cite our paper if you use our dataset.
Coming Soon...