@@ -26,6 +26,8 @@ DevOps-Eval is a comprehensive evaluation suite specifically designed for founda
26
26
## 📜 Table of Contents
27
27
28
28
- [ 🏆 Leaderboard] ( #-leaderboard )
29
+ - [ 👀 DevOps] ( #-devops )
30
+ - [ 🔥 AIOps] ( #-aiops )
29
31
- [ ⏬ Data] ( #-data )
30
32
- [ 👀 Notes] ( #-notes )
31
33
- [ 🔥 AIOps Sample Example] ( #-aiops-sample-example )
@@ -36,19 +38,19 @@ DevOps-Eval is a comprehensive evaluation suite specifically designed for founda
36
38
37
39
## 🏆 Leaderboard
38
40
Below are zero-shot and five-shot accuracies from the models that we evaluate in the initial release. We note that five-shot performance is better than zero-shot for many instruction-tuned models.
39
-
41
+ ### DevOps
40
42
#### Zero Shot
41
43
42
44
| ** ModelName** | plan | code | build | test | release | deploy | operate | monitor | ** AVG** |
43
45
| :------------------------:| :-----:| :-----:| :-----:| :------:| :--------:| :------:| :-------:| :--------:| :-----------:|
44
- | ** DevOps-Model -14B-Chat** | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 81.34 | 79.17 | ** 80.34** |
45
- | ** DevOps-Model -14B-Base** | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 85.82 | 82.41 | ** 80.26** |
46
+ | ** DevOpsPal -14B-Chat** | 60.61 | 78.35 | 84.86 | 84.65 | 87.26 | 82.75 | 81.34 | 79.17 | ** 80.34** |
47
+ | ** DevOpsPal -14B-Base** | 54.55 | 77.82 | 83.49 | 85.96 | 86.32 | 81.96 | 85.82 | 82.41 | ** 80.26** |
46
48
| Qwen-14B-Chat | 60.61 | 75.4 | 85.32 | 84.21 | 89.62 | 82.75 | 83.58 | 80.56 | 79.28 |
47
49
| Qwen-14B-Base | 57.58 | 73.81 | 84.4 | 85.53 | 86.32 | 81.18 | 82.09 | 80.09 | 77.92 |
48
50
| Baichuan2-13B-Base | 60.61 | 69.42 | 79.82 | 79.82 | 82.55 | 81.18 | 85.07 | 83.8 | 75.10 |
49
51
| Baichuan2-13B-Chat | 60.61 | 68.43 | 77.98 | 80.7 | 81.6 | 83.53 | 82.09 | 84.72 | 74.60 |
50
- | ** DevOps-Model -7B-Chat** | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 79.85 | 77.78 | ** 74.00** |
51
- | ** DevOps-Model -7B-Base** | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 79.85 | 78.7 | ** 73.55** |
52
+ | ** DevOpsPal -7B-Chat** | 54.55 | 69.11 | 83.94 | 82.02 | 76.89 | 80 | 79.85 | 77.78 | ** 74.00** |
53
+ | ** DevOpsPal -7B-Base** | 54.55 | 68.96 | 82.11 | 78.95 | 80.66 | 76.47 | 79.85 | 78.7 | ** 73.55** |
52
54
| Qwen-7B-Base | 53.03 | 68.13 | 78.9 | 75.44 | 80.19 | 80 | 83.58 | 80.09 | 73.13 |
53
55
| Qwen-7B-Chat | 57.58 | 66.01 | 80.28 | 79.82 | 76.89 | 77.65 | 80.6 | 79.17 | 71.96 |
54
56
| Baichuan2-7B-Chat | 54.55 | 63.66 | 77.98 | 76.32 | 71.7 | 73.33 | 75.37 | 79.63 | 68.17 |
@@ -61,21 +63,59 @@ Below are zero-shot and five-shot accuracies from the models that we evaluate in
61
63
62
64
| ** ModelName** | plan | code | build | test | release | deploy | operate | monitor | ** AVG** |
63
65
| :------------------------:| :-----:| :-----:| :-----:| :------:| :--------:| :------:| :-------:| :--------:| :---------:|
64
- | ** DevOps-Model -14B-Chat** | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 89.55 | 81.48 | ** 81.77** |
65
- | ** DevOps-Model -14B-Base** | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 85.07 | 80.09 | ** 81.70** |
66
+ | ** DevOpsPal -14B-Chat** | 63.64 | 79.49 | 81.65 | 85.96 | 86.79 | 86.67 | 89.55 | 81.48 | ** 81.77** |
67
+ | ** DevOpsPal -14B-Base** | 62.12 | 80.55 | 82.57 | 85.53 | 85.85 | 84.71 | 85.07 | 80.09 | ** 81.70** |
66
68
| Qwen-14B-Chat | 65.15 | 76 | 82.57 | 85.53 | 84.91 | 84.31 | 85.82 | 81.48 | 79.55 |
67
69
| Qwen-14B-Base | 66.67 | 76.15 | 84.4 | 85.53 | 86.32 | 80.39 | 86.57 | 80.56 | 79.51 |
68
70
| Baichuan2-13B-Base | 63.64 | 71.39 | 80.73 | 82.46 | 81.13 | 84.31 | 91.79 | 85.19 | 77.09 |
69
71
| Qwen-7B-Base | 75.76 | 72.52 | 78.9 | 81.14 | 83.96 | 81.18 | 85.07 | 81.94 | 77.02 |
70
72
| Baichuan2-13B-Chat | 62.12 | 69.95 | 76.61 | 84.21 | 83.49 | 79.61 | 88.06 | 80.56 | 75.32 |
71
- | ** DevOps-Model -7B-Chat** | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 82.84 | 76.85 | ** 75.25** |
72
- | ** DevOps-Model -7B-Base** | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 80.6 | 79.17 | ** 75.17** |
73
+ | ** DevOpsPal -7B-Chat** | 66.67 | 69.95 | 83.94 | 81.14 | 80.19 | 82.75 | 82.84 | 76.85 | ** 75.25** |
74
+ | ** DevOpsPal -7B-Base** | 69.7 | 69.49 | 82.11 | 81.14 | 82.55 | 82.35 | 80.6 | 79.17 | ** 75.17** |
73
75
| Qwen-7B-Chat | 65.15 | 66.54 | 82.57 | 81.58 | 81.6 | 81.18 | 80.6 | 81.02 | 73.62 |
74
76
| Baichuan2-7B-Base | 60.61 | 67.22 | 76.61 | 75 | 77.83 | 78.43 | 80.6 | 79.63 | 72.11 |
75
77
| Internlm-7B-Chat | 60.61 | 63.06 | 79.82 | 80.26 | 67.92 | 75.69 | 73.88 | 77.31 | 71.09 |
76
78
| Baichuan2-7B-Chat | 60.61 | 64.95 | 81.19 | 75.88 | 71.23 | 75.69 | 78.36 | 79.17 | 70.49 |
77
79
| Internlm-7B-Base | 62.12 | 65.25 | 77.52 | 80.7 | 74.06 | 78.82 | 79.85 | 75.46 | 69.17 |
78
80
81
+ ### AIOps
82
+ #### Zero Shot
83
+ | ** ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | ** AVG** |
84
+ | :-------------------:| :------------:| :------------------:| :---------------------------:| :-------------------------:| :-------:|
85
+ | Qwen-14B-Base | 66.29 | 58.8 | 25.33 | 43.5 | 49.27 |
86
+ | DevOpsPal-14B—Base | 63.14 | 53.6 | 23.33 | 43.5 | 46.55 |
87
+ | DevOpsPal-14B—Chat | 60 | 56 | 24 | 43 | 46.18 |
88
+ | Qwen-14B-Chat | 64.57 | 51.6 | 22.67 | 36 | 45 |
89
+ | Qwen-7B-Base | 50 | 39.2 | 22.67 | 54 | 40.82 |
90
+ | Qwen-7B-Chat | 57.43 | 38.8 | 22.33 | 39.5 | 40.36 |
91
+ | DevOpsPal-7B—Chat | 56.57 | 30.4 | 25.33 | 45 | 40 |
92
+ | Baichuan2-13B-Chat | 64 | 18 | 21.33 | 37.5 | 37.09 |
93
+ | Baichuan2-7B-Chat | 60.86 | 10 | 28 | 34.5 | 35.55 |
94
+ | Baichuan2-7B-Base | 53.43 | 12.8 | 27.67 | 36.5 | 34.09 |
95
+ | Internlm-7B—Base | 48.57 | 18.8 | 23.33 | 37.5 | 32.91 |
96
+ | Baichuan2-13B-Base | 54 | 12.4 | 23 | 34.5 | 32.55 |
97
+ | DevOpsPal-7B—Base | 46.57 | 20.8 | 25 | 34 | 32.55 |
98
+ | Internlm-7B—Chat | 58.86 | 8.8 | 22.33 | 28.5 | 32 |
99
+
100
+ #### One Shot
101
+ | ** ModelName** | LogParsing | RootCauseAnalysis | TimeSeriesAnomalyDetection | TimeSeriesClassification | ** AVG** |
102
+ | :-------------------:| :------------:| :------------------:| :---------------------------:| :-------------------------:| :-------:|
103
+ | DevOpsPal-14B—Chat | 66.29 | 80.8 | 23.33 | 44.5 | 53.91 |
104
+ | Qwen-14B-Base | 64.29 | 74.4 | 28 | 48.5 | 53.82 |
105
+ | DevOpsPal-14B—Base | 60 | 74 | 25.33 | 43.5 | 50.73 |
106
+ | Qwen-14B-Chat | 49.71 | 65.6 | 28.67 | 48 | 47.27 |
107
+ | Qwen-7B-Base | 56 | 60.8 | 27.67 | 44 | 47.18 |
108
+ | DevOpsPal-7B—Base | 52.86 | 44.4 | 28 | 44.5 | 42.64 |
109
+ | Qwen-7B-Chat | 54.57 | 52 | 29.67 | 26.5 | 42.09 |
110
+ | Baichuan2-13B-Base | 56 | 43.2 | 24.33 | 41 | 41.73 |
111
+ | Baichuan2-13B-Chat | 57.43 | 44.4 | 25 | 25.5 | 39.82 |
112
+ | Baichuan2-7B-Base | 48.29 | 40.4 | 27 | 42 | 39.55 |
113
+ | Baichuan2-7B-Chat | 58.57 | 31.6 | 27 | 31.5 | 38.91 |
114
+ | DevOpsPal-7B—Chat | 56.57 | 27.2 | 25.33 | 41.5 | 38.64 |
115
+ | Internlm-7B—Base | 48 | 33.2 | 29 | 35 | 37.09 |
116
+ | Internlm-7B—Chat | 62.57 | 12.8 | 22.33 | 21 | 32.73 |
117
+
118
+
79
119
## ⏬ Data
80
120
#### Download
81
121
* Method 1: Download the zip file (you can also simply open the following link with the browser):
0 commit comments