@@ -11,14 +11,15 @@ DevOps-Eval is a comprehensive evaluation suite specifically designed for founda
11
11
12
12
📚 This repo contains questions and exercises related to DevOps, including the AIOps.
13
13
14
- 💥️ There are currently ** 4850 ** multiple-choice questions spanning 8 diverse general categories, as shown [ below] ( images/data_info.png ) .
14
+ 💥️ There are currently ** 5977 ** multiple-choice questions spanning 8 diverse general categories, as shown [ below] ( images/data_info.png ) .
15
15
16
- 🔥 There are a total of ** 2200 ** samples in the AIOps subcategory, covering scenarios such as ** log parsing** , ** time series anomaly detection** , ** time series classification** , ** and root cause analysis** .
16
+ 🔥 There are a total of ** 2840 ** samples in the AIOps subcategory, covering scenarios such as ** log parsing** , ** time series anomaly detection** , ** time series classification** , ** time series forecasting ** , and ** root cause analysis** .
17
17
18
18
<p align =" center " > <a href =" resources/devops_diagram_zh.jpg " > <img src =" images/data_info.png " style =" width : 100% ;" id =" data_info " ></a ></p >
19
19
20
20
21
21
## 🔔 News
22
+ * ** [ 2023.11.27] ** Add 487 operation sense samples and 640 time series forecasting samples; Update the Leaderboard;
22
23
* ** [ 2023.10.30] ** Add the AIOps Leaderboard.
23
24
* ** [ 2023.10.25] ** Add the AIOps samples, including log parsing, time series anomaly detection, time series classification and root cause analysis.
24
25
* ** [ 2023.10.18] ** Update the initial Leaderboard...
@@ -44,77 +45,77 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
44
45
45
46
| ** ModelName** | plan | code | build | test | release | deploy | operate | monitor | ** AVG** |
46
47
| :------------------------:| :-----:| :-----:| :-----:| :------:| :--------:| :------:| :-------:| :--------:| :-----------:|
47
- | ** DevOpsPal-14B-Chat** | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 81.34 | 79.17 | ** 80.34 ** |
48
- | ** DevOpsPal-14B-Base** | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 85.82 | 82.41 | ** 80.26 ** |
49
- | Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 83.58 | 80.56 | 79.28 |
50
- | Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 82.09 | 80.09 | 77.92 |
51
- | Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 85.07 | 83.8 | 75.10 |
52
- | Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 82.09 | 84.72 | 74.60 |
53
- | ** DevOpsPal-7B-Chat** | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 79.85 | 77.78 | ** 74.00 ** |
54
- | ** DevOpsPal-7B-Base** | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 79.85 | 78.7 | ** 73.55 ** |
55
- | Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 83.58 | 80.09 | 73.13 |
56
- | Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 80.6 | 79.17 | 71.96 |
57
- | Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 75.37 | 79.63 | 68.17 |
58
- | Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 74.63 | 78.24 | 68.08 |
59
- | Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 76.12 | 75.93 | 67.51 |
60
- | Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 78.36 | 75.93 | 66.91 |
48
+ | DevOpsPal-14B-Chat | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 69.89 | 79.17 | 78.23 |
49
+ | DevOpsPal-14B-Base | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 71.18 | 82.41 | 78.23 |
50
+ | Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 69.57 | 80.56 | 77.18 |
51
+ | Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 70.05 | 80.09 | 76.19 |
52
+ | Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 70.37 | 83.8 | 73.73 |
53
+ | Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 67.63 | 84.72 | 72.9 |
54
+ | DevOpsPal-7B-Chat | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 64.73 | 77.78 | 71.92 |
55
+ | DevOpsPal-7B-Base | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 65.54 | 78.7 | 71.69 |
56
+ | Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 65.06 | 80.09 | 71.09 |
57
+ | Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 62.64 | 79.17 | 69.75 |
58
+ | Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 59.42 | 79.63 | 66.97 |
59
+ | Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 60.39 | 78.24 | 66.27 |
60
+ | Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 61.67 | 75.93 | 66.21 |
61
+ | Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 65.86 | 75.93 | 65.99 |
61
62
62
63
63
64
#### Five Shot
64
65
65
66
| ** ModelName** | plan | code | build | test | release | deploy | operate | monitor | ** AVG** |
66
67
| :------------------------:| :-----:| :-----:| :-----:| :------:| :--------:| :------:| :-------:| :--------:| :---------:|
67
- | ** DevOpsPal-14B-Chat** | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 89.55 | 81.48 | ** 81.77 ** |
68
- | ** DevOpsPal-14B-Base** | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 85.07 | 80.09 | ** 81.70 ** |
69
- | Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 85.82 | 81.48 | 79.55 |
70
- | Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 86.57 | 80.56 | 79.51 |
71
- | Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 91.79 | 85.19 | 77.09 |
72
- | Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 85.07 | 81.94 | 77.02 |
73
- | Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 88.06 | 80.56 | 75.32 |
74
- | ** DevOpsPal-7B-Chat** | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 82.84 | 76.85 | ** 75.25 ** |
75
- | ** DevOpsPal-7B-Base** | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 80.6 | 79.17 | ** 75.17 ** |
76
- | Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 80.6 | 81.02 | 73.62 |
77
- | Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 80.6 | 79.63 | 72.11 |
78
- | Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 73.88 | 77.31 | 71.09 |
79
- | Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 78.36 | 79.17 | 70.49 |
80
- | Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 79.85 | 75.46 | 69 .17 |
68
+ | DevOpsPal-14B-Chat | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 72.95 | 81.48 | 79.69 |
69
+ | DevOpsPal-14B-Base | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 71.98 | 80.09 | 79.63 |
70
+ | Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 70.85 | 81.48 | 77.81 |
71
+ | Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 72.46 | 80.56 | 77.56 |
72
+ | Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 73.75 | 85.19 | 75.8 |
73
+ | Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 70.37 | 81.94 | 75.36 |
74
+ | Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 71.98 | 80.56 | 74.12 |
75
+ | DevOpsPal-7B-Chat | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 68.6 | 76.85 | 73.61 |
76
+ | DevOpsPal-7B-Base | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 67.15 | 79.17 | 73.35 |
77
+ | Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 65.38 | 81.02 | 71.69 |
78
+ | Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 67.31 | 79.63 | 70.8 |
79
+ | Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 60.06 | 77.31 | 69.21 |
80
+ | Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 64.9 | 79.17 | 69.05 |
81
+ | Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 63.45 | 75.46 | 67 .17 |
81
82
82
83
### 🔥 AIOps
83
84
#### Zero Shot
84
- | ** ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | ** AVG** |
85
- | :-------------------:| :------------:| :------------------:| :---------------------------:| :-------------------------:| :-------:|
86
- | Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 49.27 |
87
- | DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 46.55 |
88
- | DevOpsPal -14B— Chat | 60 | 56 | 24 | 43 | 46.18 |
89
- | Qwen -14B- Chat | 64.57 | 51.6 | 22.67 | 36 | 45 |
90
- | Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 40.82 |
91
- | Qwen -7B- Chat | 57.43 | 38.8 | 22 .33 | 39.5 | 40.36 |
92
- | DevOpsPal-7B— Chat | 56.57 | 30.4 | 25.33 | 45 | 40 |
93
- | Baichuan2-13B -Chat | 64 | 18 | 21 .33 | 37 .5 | 37.09 |
94
- | Baichuan2 -7B- Chat | 60 .86 | 10 | 28 | 34.5 | 35.55 |
95
- | Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 34.09 |
96
- | Internlm -7B— Base | 48.57 | 18 .8 | 23.33 | 37 .5 | 32.91 |
97
- | Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 32.55 |
98
- | DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 32.55 |
99
- | Internlm-7B—Chat | 58.86 | 8 .8 | 22 .33 | 28 .5 | 32 |
85
+ | ** ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | ** AVG** |
86
+ | :-------------------:| :------------:| :------------------:| :---------------------------:| :-----------------------------------------: | :--------------------------- :| :-------:|
87
+ | Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 62.5 | 52.25 |
88
+ | DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 64.06 | 50.49 |
89
+ | Qwen -14B- Chat | 64.57 | 51.6 | 22.67 | 36 | 62.5 | 48.94 |
90
+ | DevOpsPal -14B— Chat | 60 | 56 | 24 | 43 | 57.81 | 48.8 |
91
+ | Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 43.75 | 41.48 |
92
+ | DevOpsPal -7B— Chat | 56.57 | 30.4 | 25 .33 | 45 | 44.06 | 40.92 |
93
+ | Baichuan2-13B- Chat | 64 | 18 | 21.33 | 37.5 | 46.88 | 39.3 |
94
+ | Qwen-7B -Chat | 57.43 | 38.8 | 22 .33 | 39 .5 | 25.31 | 36.97 |
95
+ | Internlm -7B— Chat | 58 .86 | 8.8 | 22.33 | 28.5 | 51.25 | 36.34 |
96
+ | Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 39.06 | 36.34 |
97
+ | Baichuan2 -7B- Base | 53.43 | 12 .8 | 27.67 | 36 .5 | 40.31 | 35.49 |
98
+ | Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 42.81 | 34.86 |
99
+ | DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 38.75 | 33.94 |
100
+ | Internlm-7B—Base | 48.57 | 18 .8 | 23 .33 | 37 .5 | 33.75 | 33.1 |
100
101
101
102
#### One Shot
102
- | ** ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | ** AVG** |
103
- | :-------------------:| :------------:| :------------------:| :---------------------------:| :-------------------------:| :-------:|
104
- | DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 53.91 |
105
- | Qwen -14B- Base | 64.29 | 74.4 | 28 | 48 .5 | 53.82 |
106
- | DevOpsPal -14B— Base | 60 | 74 | 25.33 | 43 .5 | 50.73 |
107
- | Qwen-14B-Chat | 49.71 | 65.6 | 28 .67 | 48 | 47.27 |
108
- | Qwen-7B-Base | 56 | 60.8 | 27 .67 | 44 | 47.18 |
109
- | DevOpsPal-7B— Base | 52.86 | 44.4 | 28 | 44.5 | 42.64 |
110
- | Qwen -7B-Chat | 54 .57 | 52 | 29.67 | 26 .5 | 42.09 |
111
- | Baichuan2-13B- Base | 56 | 43.2 | 24.33 | 41 | 41.73 |
112
- | Baichuan2-13B-Chat | 57.43 | 44 .4 | 25 | 25.5 | 39.82 |
113
- | Baichuan2 -7B-Base | 48.29 | 40.4 | 27 | 42 | 39.55 |
114
- | Baichuan2-7B -Chat | 58.57 | 31.6 | 27 | 31 .5 | 38.91 |
115
- | DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 38.64 |
116
- | Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 37.09 |
117
- | Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 32.73 |
103
+ | ** ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | ** AVG** |
104
+ | :-------------------:| :------------:| :------------------:| :---------------------------:| :-----------------------------------------: | :--------------------------- :| :-------:|
105
+ | DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 56.25 | 54.44 |
106
+ | DevOpsPal -14B— Base | 60 | 74 | 25.33 | 43.5 | 52 .5 | 51.13 |
107
+ | Qwen -14B- Base | 64.29 | 74.4 | 28 | 48 .5 | 40.31 | 50.77 |
108
+ | Qwen-7B-Base | 56 | 60.8 | 27 .67 | 44 | 57.19 | 49.44 |
109
+ | Qwen-14B-Chat | 49.71 | 65.6 | 28 .67 | 48 | 42.19 | 46.13 |
110
+ | Baichuan2-13B- Base | 56 | 43.2 | 24.33 | 41 | 46.88 | 42.89 |
111
+ | Baichuan2 -7B-Chat | 58 .57 | 31.6 | 27 | 31 .5 | 51.88 | 41.83 |
112
+ | DevOpsPal-7B— Base | 52.86 | 44.4 | 28 | 44.5 | 36.25 | 41.2 |
113
+ | Baichuan2-7B-Base | 48.29 | 40 .4 | 27 | 42 | 40.94 | 39.86 |
114
+ | Qwen -7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 27.19 | 38.73 |
115
+ | Baichuan2-13B -Chat | 57.43 | 44.4 | 25 | 25 .5 | 30.63 | 37.75 |
116
+ | DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 33.44 | 37.46 |
117
+ | Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 50.31 | 36.69 |
118
+ | Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 31.56 | 35.85 |
118
119
119
120
120
121
## ⏬ Data
@@ -140,7 +141,7 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
140
141
# {"id": 1, "question": "单元测试应该覆盖以下哪些方面?", "A": "正常路径", "B": "异常路径", "C": "边界值条件","D": 所有以上,"answer": "D", "explanation": ""} ```
141
142
142
143
# ### 👀 Notes
143
- To facilitate usage, we have organized the category name handlers and English/ Chinese names corresponding to 53 subcategories. Please refer to [category_mapping.json](resources/ categroy_mapping.json) for details. The format is :
144
+ To facilitate usage, we have organized the category name handlers and English/ Chinese names corresponding to 55 subcategories. Please refer to [category_mapping.json](resources/ categroy_mapping.json) for details. The format is :
144
145
145
146
```
146
147
{
@@ -285,7 +286,7 @@ python src/run_eval.py \
285
286
286
287
## 🧭 TODO
287
288
- [x] add AIOps samples.
288
- - [ ] add AIOps scenario ** time series forecasting** .
289
+ - [x ] add AIOps scenario ** time series forecasting** .
289
290
- [ ] increase in sample size.
290
291
- [ ] add samples with the difficulty level set to hard.
291
292
- [ ] add the English version of the samples.
0 commit comments