Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit b95cf6d

Browse filesBrowse files
author
jimmy.xj
committed
Update README.md
1 parent 04b5d2d commit b95cf6d
Copy full SHA for b95cf6d

File tree

2 files changed

+130
-128
lines changed
Filter options

2 files changed

+130
-128
lines changed

‎README.md

Copy file name to clipboardExpand all lines: README.md
+65-64Lines changed: 65 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,15 @@ DevOps-Eval is a comprehensive evaluation suite specifically designed for founda
1111

1212
📚 This repo contains questions and exercises related to DevOps, including the AIOps.
1313

14-
💥️ There are currently **4850** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
14+
💥️ There are currently **5977** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
1515

16-
🔥 There are a total of **2200** samples in the AIOps subcategory, covering scenarios such as **log parsing**, **time series anomaly detection**, **time series classification**, **and root cause analysis**.
16+
🔥 There are a total of **2840** samples in the AIOps subcategory, covering scenarios such as **log parsing**, **time series anomaly detection**, **time series classification**, **time series forecasting**, and **root cause analysis**.
1717

1818
<p align="center"> <a href="resources/devops_diagram_zh.jpg"> <img src="images/data_info.png" style="width: 100%;" id="data_info"></a></p>
1919

2020

2121
## 🔔 News
22+
* **[2023.11.27]** Add 487 operation sense samples and 640 time series forecasting samples; Update the Leaderboard;
2223
* **[2023.10.30]** Add the AIOps Leaderboard.
2324
* **[2023.10.25]** Add the AIOps samples, including log parsing, time series anomaly detection, time series classification and root cause analysis.
2425
* **[2023.10.18]** Update the initial Leaderboard...
@@ -44,77 +45,77 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
4445

4546
| **ModelName** | plan | code | build | test | release | deploy | operate | monitor | **AVG** |
4647
|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:-----------:|
47-
| **DevOpsPal-14B-Chat** | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 81.34 | 79.17 | **80.34** |
48-
| **DevOpsPal-14B-Base** | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 85.82 | 82.41 | **80.26** |
49-
| Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 83.58 | 80.56 | 79.28 |
50-
| Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 82.09 | 80.09 | 77.92 |
51-
| Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 85.07 | 83.8 | 75.10 |
52-
| Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 82.09 | 84.72 | 74.60 |
53-
| **DevOpsPal-7B-Chat** | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 79.85 | 77.78 | **74.00** |
54-
| **DevOpsPal-7B-Base** | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 79.85 | 78.7 | **73.55** |
55-
| Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 83.58 | 80.09 | 73.13 |
56-
| Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 80.6 | 79.17 | 71.96 |
57-
| Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 75.37 | 79.63 | 68.17 |
58-
| Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 74.63 | 78.24 | 68.08 |
59-
| Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 76.12 | 75.93 | 67.51 |
60-
| Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 78.36 | 75.93 | 66.91 |
48+
| DevOpsPal-14B-Chat | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 69.89 | 79.17 | 78.23 |
49+
| DevOpsPal-14B-Base | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 71.18 | 82.41 | 78.23 |
50+
| Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 69.57 | 80.56 | 77.18 |
51+
| Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 70.05 | 80.09 | 76.19 |
52+
| Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 70.37 | 83.8 | 73.73 |
53+
| Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 67.63 | 84.72 | 72.9 |
54+
| DevOpsPal-7B-Chat | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 64.73 | 77.78 | 71.92 |
55+
| DevOpsPal-7B-Base | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 65.54 | 78.7 | 71.69 |
56+
| Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 65.06 | 80.09 | 71.09 |
57+
| Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 62.64 | 79.17 | 69.75 |
58+
| Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 59.42 | 79.63 | 66.97 |
59+
| Internlm-7B-Chat | 60.61 | 62.15 | 77.06 | 76.32 | 66.98 | 74.51 | 60.39 | 78.24 | 66.27 |
60+
| Baichuan2-7B-Base | 56.06 | 62.45 | 75.69 | 70.61 | 74.06 | 69.8 | 61.67 | 75.93 | 66.21 |
61+
| Internlm-7B-Base | 54.55 | 58.29 | 79.36 | 78.95 | 77.83 | 70.59 | 65.86 | 75.93 | 65.99 |
6162

6263

6364
#### Five Shot
6465

6566
| **ModelName** | plan | code | build | test | release | deploy | operate | monitor | **AVG** |
6667
|:------------------------:|:-----:|:-----:|:-----:|:------:|:--------:|:------:|:-------:|:--------:|:---------:|
67-
| **DevOpsPal-14B-Chat** |63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 89.55 | 81.48 | **81.77** |
68-
| **DevOpsPal-14B-Base** | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 85.07 | 80.09 | **81.70** |
69-
| Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 85.82 | 81.48 | 79.55 |
70-
| Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 86.57 | 80.56 | 79.51 |
71-
| Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 91.79 | 85.19 | 77.09 |
72-
| Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 85.07 | 81.94 | 77.02 |
73-
| Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 88.06 | 80.56 | 75.32 |
74-
| **DevOpsPal-7B-Chat** | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 82.84 | 76.85 | **75.25** |
75-
| **DevOpsPal-7B-Base** | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 80.6 | 79.17 | **75.17** |
76-
| Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 80.6 | 81.02 | 73.62 |
77-
| Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 80.6 | 79.63 | 72.11 |
78-
| Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 73.88 | 77.31 | 71.09 |
79-
| Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 78.36 | 79.17 | 70.49 |
80-
| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 79.85 | 75.46 | 69.17 |
68+
| DevOpsPal-14B-Chat | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 72.95 | 81.48 | 79.69 |
69+
| DevOpsPal-14B-Base | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 71.98 | 80.09 | 79.63 |
70+
| Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 70.85 | 81.48 | 77.81 |
71+
| Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 72.46 | 80.56 | 77.56 |
72+
| Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 73.75 | 85.19 | 75.8 |
73+
| Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 70.37 | 81.94 | 75.36 |
74+
| Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 71.98 | 80.56 | 74.12 |
75+
| DevOpsPal-7B-Chat | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 68.6 | 76.85 | 73.61 |
76+
| DevOpsPal-7B-Base | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 67.15 | 79.17 | 73.35 |
77+
| Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 65.38 | 81.02 | 71.69 |
78+
| Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 67.31 | 79.63 | 70.8 |
79+
| Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 60.06 | 77.31 | 69.21 |
80+
| Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 64.9 | 79.17 | 69.05 |
81+
| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 63.45 | 75.46 | 67.17 |
8182

8283
### 🔥 AIOps
8384
#### Zero Shot
84-
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | **AVG** |
85-
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-------------------------:|:-------:|
86-
| Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 49.27 |
87-
| DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 46.55 |
88-
| DevOpsPal-14BChat | 60 | 56 | 24 | 43 | 46.18 |
89-
| Qwen-14B-Chat | 64.57 | 51.6 | 22.67 | 36 | 45 |
90-
| Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 40.82 |
91-
| Qwen-7B-Chat | 57.43 | 38.8 | 22.33 | 39.5 | 40.36 |
92-
| DevOpsPal-7B—Chat | 56.57 | 30.4 | 25.33 | 45 | 40 |
93-
| Baichuan2-13B-Chat | 64 | 18 | 21.33 | 37.5 | 37.09 |
94-
| Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 35.55 |
95-
| Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 34.09 |
96-
| Internlm-7BBase | 48.57 | 18.8 | 23.33 | 37.5 | 32.91 |
97-
| Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 32.55 |
98-
| DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 32.55 |
99-
| Internlm-7B—Chat | 58.86 | 8.8 | 22.33 | 28.5 | 32 |
85+
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** |
86+
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:|
87+
| Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 62.5 | 52.25 |
88+
| DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 64.06 | 50.49 |
89+
| Qwen-14B-Chat | 64.57 | 51.6 | 22.67 | 36 | 62.5 | 48.94 |
90+
| DevOpsPal-14BChat | 60 | 56 | 24 | 43 | 57.81 | 48.8 |
91+
| Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 43.75 | 41.48 |
92+
| DevOpsPal-7BChat | 56.57 | 30.4 | 25.33 | 45 | 44.06 | 40.92 |
93+
| Baichuan2-13B-Chat | 64 | 18 | 21.33 | 37.5 | 46.88 | 39.3 |
94+
| Qwen-7B-Chat | 57.43 | 38.8 | 22.33 | 39.5 | 25.31 | 36.97 |
95+
| Internlm-7BChat | 58.86 | 8.8 | 22.33 | 28.5 | 51.25 | 36.34 |
96+
| Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 39.06 | 36.34 |
97+
| Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 40.31 | 35.49 |
98+
| Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 42.81 | 34.86 |
99+
| DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 38.75 | 33.94 |
100+
| Internlm-7B—Base | 48.57 | 18.8 | 23.33 | 37.5 | 33.75 | 33.1 |
100101

101102
#### One Shot
102-
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | **AVG** |
103-
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-------------------------:|:-------:|
104-
| DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 53.91 |
105-
| Qwen-14B-Base | 64.29 | 74.4 | 28 | 48.5 | 53.82 |
106-
| DevOpsPal-14BBase | 60 | 74 | 25.33 | 43.5 | 50.73 |
107-
| Qwen-14B-Chat | 49.71 | 65.6 | 28.67 | 48 | 47.27 |
108-
| Qwen-7B-Base | 56 | 60.8 | 27.67 | 44 | 47.18 |
109-
| DevOpsPal-7B—Base | 52.86 | 44.4 | 28 | 44.5 | 42.64 |
110-
| Qwen-7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 42.09 |
111-
| Baichuan2-13B-Base | 56 | 43.2 | 24.33 | 41 | 41.73 |
112-
| Baichuan2-13B-Chat | 57.43 | 44.4 | 25 | 25.5 | 39.82 |
113-
| Baichuan2-7B-Base | 48.29 | 40.4 | 27 | 42 | 39.55 |
114-
| Baichuan2-7B-Chat | 58.57 | 31.6 | 27 | 31.5 | 38.91 |
115-
| DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 38.64 |
116-
| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 37.09 |
117-
| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 32.73 |
103+
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** |
104+
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:|
105+
| DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 56.25 | 54.44 |
106+
| DevOpsPal-14BBase | 60 | 74 | 25.33 | 43.5 | 52.5 | 51.13 |
107+
| Qwen-14B-Base | 64.29 | 74.4 | 28 | 48.5 | 40.31 | 50.77 |
108+
| Qwen-7B-Base | 56 | 60.8 | 27.67 | 44 | 57.19 | 49.44 |
109+
| Qwen-14B-Chat | 49.71 | 65.6 | 28.67 | 48 | 42.19 | 46.13 |
110+
| Baichuan2-13B-Base | 56 | 43.2 | 24.33 | 41 | 46.88 | 42.89 |
111+
| Baichuan2-7B-Chat | 58.57 | 31.6 | 27 | 31.5 | 51.88 | 41.83 |
112+
| DevOpsPal-7B—Base | 52.86 | 44.4 | 28 | 44.5 | 36.25 | 41.2 |
113+
| Baichuan2-7B-Base | 48.29 | 40.4 | 27 | 42 | 40.94 | 39.86 |
114+
| Qwen-7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 27.19 | 38.73 |
115+
| Baichuan2-13B-Chat | 57.43 | 44.4 | 25 | 25.5 | 30.63 | 37.75 |
116+
| DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 33.44 | 37.46 |
117+
| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 50.31 | 36.69 |
118+
| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 31.56 | 35.85 |
118119

119120

120121
## ⏬ Data
@@ -140,7 +141,7 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
140141
# {"id": 1, "question": "单元测试应该覆盖以下哪些方面?", "A": "正常路径", "B": "异常路径", "C": "边界值条件","D": 所有以上,"answer": "D", "explanation": ""} ```
141142

142143
#### 👀 Notes
143-
To facilitate usage, we have organized the category name handlers and English/Chinese names corresponding to 53 subcategories. Please refer to [category_mapping.json](resources/categroy_mapping.json) for details. The format is:
144+
To facilitate usage, we have organized the category name handlers and English/Chinese names corresponding to 55 subcategories. Please refer to [category_mapping.json](resources/categroy_mapping.json) for details. The format is:
144145

145146
```
146147
{
@@ -285,7 +286,7 @@ python src/run_eval.py \
285286

286287
## 🧭 TODO
287288
- [x] add AIOps samples.
288-
- [ ] add AIOps scenario **time series forecasting**.
289+
- [x] add AIOps scenario **time series forecasting**.
289290
- [ ] increase in sample size.
290291
- [ ] add samples with the difficulty level set to hard.
291292
- [ ] add the English version of the samples.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.