You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 11, 2023. It is now read-only.
Copy file name to clipboardExpand all lines: BENCHMARK.md
+6Lines changed: 6 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,9 @@
1
+
> ## The Challenge has been concluded
2
+
> No new submissions to the benchmark will be accepted. However, we would like
3
+
> to encourage practitioners and researchers to continue using
4
+
> the dataset and the human relevance annotations. Please see the
5
+
> [main README](/README.md) for more information.
6
+
1
7
## Submitting runs to the benchmark
2
8
3
9
The [Weights & Biases (W&B)](https://www.wandb.com)[benchmark](https://app.wandb.ai/github/CodeSearchNet/benchmark) tracks and compares models trained on the CodeSearchNet dataset by the global machine learning research community. Anyone is welcome to submit their results for review.
Copy file name to clipboardExpand all lines: README.md
+15-2Lines changed: 15 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,12 @@
4
4
5
5
[paper]: https://arxiv.org/abs/1909.09436
6
6
7
+
> # The CodeSearchNet challenge has been concluded
8
+
> We would like to thank all participants for their submissions
9
+
> and we hope that this challenge provided insights to practitioners and researchers about the challenges in semantic code search and motivated new research. We would like to encourage everyone to continue using the dataset and the human evaluations, which we now provide publicly. Please, see below for details.
10
+
>
11
+
> No new submissions to the challenge will be accepted.
@@ -83,11 +89,11 @@ More context regarding the motivation for this problem is in this [technical rep
83
89
84
90
## Evaluation
85
91
86
-
The metric we use for evaluation is [Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG). Please reference [this paper][paper] for further details regarding model evaluation.
92
+
The metric we use for evaluation is [Normalized Discounted Cumulative Gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG). Please reference [this paper][paper] for further details regarding model evaluation. The evaluation script can be found [here](/src/relevanceeval.py).
87
93
88
94
### Annotations
89
95
90
-
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please refer to [this paper][paper] for further details on the annotation process.
96
+
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please refer to [this paper][paper] for further details on the annotation process. These annotations were used to compute the scores in the leaderboard. Now that the competition has been concluded, you can find the annotations, along with the annotator comments [here](/resources/annotationStore.csv).
91
97
92
98
93
99
## Setup
@@ -242,6 +248,13 @@ For example, the link for the `java` is:
242
248
243
249
The size of the dataset is approximately 20 GB. The various files and the directory structure are explained [here](resources/README.md).
244
250
251
+
## Human Relevance Judgements
252
+
To train neural models with a large dataset we use the documentation comments (e.g. docstrings) as a proxy. For evaluation (and the leaderboard), we collected human relevance judgements of pairs of realistic-looking natural language queries and code snippets. Now that the challenge has been concluded, we provide the data [here](/resources/annotationStore.csv) as a `.csv`, with the following fields:
253
+
* Language: The programming language of the snippet.
254
+
* Query: The natural language query
255
+
* GitHubUrl: The URL of the target snippet. This matches the `URL` key in the data (see [here](#schema--format)).
256
+
* Relevance: the 0-3 human relevance judgement, where "3" is the highest score (very relevant) and "0" is the lowest (irrelevant).
257
+
* Notes: a free-text field with notes that annotators optionally provided.
0 commit comments