Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit e197e98

Browse filesBrowse files
author
jimmy.xj
committed
Update README.md
1 parent 102dd35 commit e197e98
Copy full SHA for e197e98

File tree

1 file changed

+37
-2
lines changed
Filter options

1 file changed

+37
-2
lines changed

‎README.md

Copy file name to clipboardExpand all lines: README.md
+37-2Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,19 @@
99
DevOps-Eval is a comprehensive evaluation suite specifically designed for foundation models in the DevOps field. We hope DevOps-Eval could help developers, especially in the DevOps field, track the progress and analyze the important strengths/shortcomings of their models.
1010

1111

12-
📚 This repo contains questions and exercises related to DevOps, including the AIOps.
12+
📚 This repo contains questions and exercises related to DevOps, including the AIOps, ToolLearning;
1313

14-
💥️ There are currently **5977** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
14+
💥️ There are currently **7486** multiple-choice questions spanning 8 diverse general categories, as shown [below](images/data_info.png).
1515

1616
🔥 There are a total of **2840** samples in the AIOps subcategory, covering scenarios such as **log parsing**, **time series anomaly detection**, **time series classification**, **time series forecasting**, and **root cause analysis**.
1717

18+
🔧 There are a total of **1509** samples in the ToolLearning subcategory, covering 239 tool scenes across 59 fields.
19+
1820
<p align="center"> <a href="resources/devops_diagram_zh.jpg"> <img src="images/data_info.png" style="width: 100%;" id="data_info"></a></p>
1921

2022

2123
## 🔔 News
24+
* **[2023.12.27]** Add 1509 **ToolLearning** samples, covering 239 tool categories across 59 fields; Release the associated evaluation leaderboard;
2225
* **[2023.11.27]** Add 487 operation scene samples and 640 time series forecasting samples; Update the Leaderboard;
2326
* **[2023.10.30]** Add the AIOps Leaderboard.
2427
* **[2023.10.25]** Add the AIOps samples, including log parsing, time series anomaly detection, time series classification and root cause analysis.
@@ -30,9 +33,11 @@ DevOps-Eval is a comprehensive evaluation suite specifically designed for founda
3033
- [🏆 Leaderboard](#-leaderboard)
3134
- [👀 DevOps](#-devops)
3235
- [🔥 AIOps](#-aiops)
36+
- [🔧 ToolLearning](#-toollearning)
3337
- [⏬ Data](#-data)
3438
- [👀 Notes](#-notes)
3539
- [🔥 AIOps Sample Example](#-aiops-sample-example)
40+
- [🔧 ToolLearning Sample Example](#-toollearning-sample-example)
3641
- [🚀 How to Evaluate](#-how-to-evaluate)
3742
- [🧭 TODO](#-todo)
3843
- [🏁 Licenses](#-licenses)
@@ -83,6 +88,9 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
8388
| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 63.45 | 75.46 | 67.17 |
8489

8590
### 🔥 AIOps
91+
92+
<details>
93+
8694
#### Zero Shot
8795
| **ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | TimeSeriesForecasting | **AVG** |
8896
|:-------------------:|:------------:|:------------------:|:---------------------------:|:-----------------------------------------:|:---------------------------:|:-------:|
@@ -119,6 +127,29 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
119127
| Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 50.31 | 36.69 |
120128
| Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 31.56 | 35.85 |
121129

130+
</details>
131+
132+
133+
### 🔧 ToolLearning
134+
<details>
135+
136+
| **FuncCall-Filler** | dataset_name | fccr | 1-fcffr | 1-fcfnr | 1-fcfpr | 1-fcfnir | aar |
137+
|:-------------------:| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
138+
| Qwen-14b-chat | luban | 98.37 | 99.73 | 99.86 | 98.78 | 100 | 81.58 |
139+
| Qwen-7b-chat | luban | 99.46 | 99.86 | 100 | 99.59 | 100 | 79.25 |
140+
| Baichuan-7b-chat | luban | 97.96 | 99.32 | 100 | 98.64 | 100 | 89.53 |
141+
| Internlm-chat-7b | luban | 94.29 | 95.78 | 100 | 98.5 | 100 | 88.19 |
142+
| Qwen-14b-chat | fc_data | 98.78 | 99.73 | 100 | 99.05 | 100 | 94.7 |
143+
| Qwen-7b-chat | fc_data | 98.1 | 99.87 | 99.73 | 98.5 | 100 | 93.14 |
144+
| Baichuan-7b-chat | fc_data | 98.91 | 99.87 | 99.87 | 99.18 | 100 | 89.5 |
145+
| Internlm-chat-7b | fc_data | 61 | 100 | 97.68 | 63.32 | 100 | 69.46 |
146+
| CodeLLaMa-7b | fc_data | 50.58 | 100 | 98.07 | 52.51 | 100 | 63.59 |
147+
| CodeFuse-7b-16k | fc_data | 60.23 | 100 | 97.3 | 62.93 | 99.61 | 61.12 |
148+
| CodeFuse-7b-4k | fc_data | 47.88 | 100 | 96.14 | 51.74 | 99.61 | 61.85 |
149+
150+
151+
</details>
152+
122153

123154
## ⏬ Data
124155
#### Download
@@ -216,6 +247,9 @@ D: 12
216247
answer: D
217248
explanation: According to the analysis, the value 265 in the given time series at 12 o'clock is significantly larger than the surrounding data, indicating a sudden increase phenomenon. Therefore, selecting option D is correct.
218249
```
250+
#### 🔧 ToolLearning Sample Example
251+
252+
👀 👀The data format is compatible with OpenAI's Function Calling. Please refer to [category_mapping.json](resources/categroy_mapping.json) for details.
219253

220254

221255
## 🚀 How to Evaluate
@@ -289,6 +323,7 @@ python src/run_eval.py \
289323
## 🧭 TODO
290324
- [x] add AIOps samples.
291325
- [x] add AIOps scenario **time series forecasting**.
326+
- [x] add **ToolLearning** samples.
292327
- [ ] increase in sample size.
293328
- [ ] add samples with the difficulty level set to hard.
294329
- [ ] add the English version of the samples.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.