You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 11, 2023. It is now read-only.
# download data (~3.5GB) from S3; build and run Docker container
44
+
# download data (~3.5GB) from S3; build and run the Docker container
45
45
script/setup
46
-
# this will drop you into the shell inside a docker container.
46
+
# this will drop you into the shell inside a Docker container
47
47
script/console
48
-
# optional: log in to W&B to track your experiments, and submit your results to the benchmark
48
+
# optional: log in to W&B to see your training metrics,
49
+
# track your experiments, and submit your models to the benchmark
49
50
wandb login
51
+
50
52
# verify your setup by training a tiny model
51
53
python train.py --testrun
52
-
# see other command line options and try a full training run with default values
54
+
# see other command line options, try a full training run with default values,
55
+
# and explore other model variants by extending this baseline script
53
56
python train.py --help
54
57
python train.py
55
58
56
59
# generate predictions for model evaluation
57
60
python predict.py -r github/codesearchnet/0123456 # this is the org/project_name/run_id
58
61
```
59
62
60
-
Finally, you can submit your run to the [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by following these [instructions](src/docs/BENCHMARK.md).
63
+
Finally, you can submit your run to the [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by following these [instructions](BENCHMARK.md).
61
64
62
65
# Introduction
63
66
64
67
## Project Overview
65
68
66
-
[CodeSearchNet][paper] is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this [blog post](https://githubengineering.com/towards-natural-language-semantic-code-search/) and is a joint collaboration between GitHub and the [Deep Program Understanding](https://www.microsoft.com/en-us/research/project/program/) group at [Microsoft Research - Cambridge](https://www.microsoft.com/en-us/research/lab/microsoft-research-cambridge/). Our intent is to present and provide a platform for this research to the community by providing the following:
69
+
[CodeSearchNet][paper] is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this [blog post](https://githubengineering.com/towards-natural-language-semantic-code-search/) and is a joint collaboration between GitHub and the [Deep Program Understanding](https://www.microsoft.com/en-us/research/project/program/) group at [Microsoft Research - Cambridge](https://www.microsoft.com/en-us/research/lab/microsoft-research-cambridge/). We aim to provide a platform for community research on semantic code search via the following:
67
70
68
71
1. Instructions for obtaining large corpora of relevant data
69
72
2. Open source code for a range of baseline models, along with pre-trained weights
70
-
3. Baseline evaluation metrics and utilities.
71
-
4. Mechanisms to track progress on a [shared community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark), hosted by [Weights & Biases](https://www.wandb.com/)
73
+
3. Baseline evaluation metrics and utilities
74
+
4. Mechanisms to track progress on a [shared community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) hosted by [Weights & Biases](https://www.wandb.com/)
72
75
73
76
We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.
74
77
75
78
More context regarding the motivation for this problem is in this [technical report][paper].
76
79
77
80
## Data
78
81
79
-
The primary dataset consists of 2 Million (`comment`, `code`) pairs from open source libraries. Concretely, a `comment` is a top-level function or method comment (e.g. [docstrings](https://en.wikipedia.org/wiki/Docstring) in Python), and `code` is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary stastics about this dataset can be found in [this notebook](notebooks/ExploreData.ipynb)
82
+
The primary dataset consists of 2 million (`comment`, `code`) pairs from open source libraries. Concretely, a `comment` is a top-level function or method comment (e.g. [docstrings](https://en.wikipedia.org/wiki/Docstring) in Python), and `code` is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary stastics about this dataset can be found in [this notebook](notebooks/ExploreData.ipynb)
80
83
81
84
For more information about how to obtain the data, see [this section](#data-details).
82
85
@@ -86,7 +89,7 @@ More context regarding the motivation for this problem is in this [technical rep
86
89
87
90
### Annotations
88
91
89
-
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please reference[this paper][paper] for further details on the annotation process.
92
+
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please refer to[this paper][paper] for further details on the annotation process.
90
93
91
94
92
95
## Setup
@@ -102,8 +105,17 @@ More context regarding the motivation for this problem is in this [technical rep
102
105
This will build Docker containers and download the datasets. By default, the data is downloaded into the `resources/data/` folder inside this repository, with the directory structure described [here](resources/README.md).
103
106
104
107
**The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.**
108
+
109
+
3. To start the Docker container, run `script/console`:
110
+
```
111
+
script/console
112
+
```
113
+
This will land you inside the Docker container, starting in the `/src` directory. You can detach from/attach to this container to pause/continue your work.
114
+
105
115
106
-
For more about the data, see [Data Details](#data-details) below as well as [this notebook](notebooks/ExploreData.ipynb).
116
+
**The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.**
117
+
118
+
For more about the data, see [Data Details](#data-details) below, as well as [this notebook](notebooks/ExploreData.ipynb).
107
119
108
120
109
121
# Data Details
@@ -219,7 +231,7 @@ Code, comments, and docstrings are extracted in a language-specific manner, remo
219
231
}
220
232
```
221
233
222
-
Furthermore, summary statistics such as row counts and token length histograms can be found in [this notebook](notebooks/ExploreData.ipynb)
234
+
Summary statistics such as row counts and token length histograms can be found in [this notebook](notebooks/ExploreData.ipynb)
223
235
224
236
## Downloading Data from S3
225
237
@@ -236,9 +248,9 @@ For example, the link for the `java` is:
236
248
The size of the dataset is approximately 20 GB. The various files and the directory structure are explained [here](resources/README.md).
237
249
238
250
239
-
# Running our Baseline Model
251
+
# Running Our Baseline Model
240
252
241
-
Warning: the scripts provided to reproduce our baseline model take more than 24 hours on an [AWS P3-V100](https://aws.amazon.com/ec2/instance-types/p3/) instance.
253
+
We encourage you to reproduce and extend these models, though most variants take several hours to train (and some take more than 24 hours on an [AWS P3-V100](https://aws.amazon.com/ec2/instance-types/p3/) instance).
242
254
243
255
## Model Architecture
244
256
@@ -258,9 +270,9 @@ This step assumes that you have a suitable Nvidia-GPU with [Cuda v9.0](https://d
258
270
```
259
271
script/console
260
272
```
261
-
This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded in the previous step. By default you will be placed in the `src/` folder of this GitHub repository. From here you can execute commands to run the model.
273
+
This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded earlier. By default you will be placed in the `src/` folder of this GitHub repository. From here you can execute commands to run the model.
262
274
263
-
2. Set up [W&B](https://docs.wandb.com/docs/started.html) (free for open source projects) per the instructions below if you would like to share your results on the community benchmark. This is optional but highly recommended.
275
+
2. Set up [W&B](https://docs.wandb.com/docs/started.html) (free for open source projects) [per the instructions below](#W&B Setup) if you would like to share your results on the community benchmark. This is optional but highly recommended.
264
276
265
277
3. The entry point to this model is `src/train.py`. You can see various options by executing the following command:
266
278
```
@@ -277,7 +289,7 @@ This step assumes that you have a suitable Nvidia-GPU with [Cuda v9.0](https://d
277
289
python train.py --model neuralbow
278
290
```
279
291
280
-
The above command will assume default values for the location(s) of the training data and a destination where would like to save the output model. The default location for training data is specified in `/src/data_dirs_{train,valid,test}.txt`. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of `src/data_dirs_train.txt`:
292
+
The above command will assume default values for the location(s) of the training data and a destination where you would like to save the output model. The default location for training data is specified in `/src/data_dirs_{train,valid,test}.txt`. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of `src/data_dirs_train.txt`:
281
293
282
294
```
283
295
$ cat data_dirs_train.txt
@@ -301,18 +313,15 @@ This step assumes that you have a suitable Nvidia-GPU with [Cuda v9.0](https://d
301
313
Additional notes:
302
314
* Options for `--model` are currently listed in `src/model_restore_helper.get_model_class_from_name`.
303
315
304
-
* Hyperparameters are specific to the respective model/encoder classes; a simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).
305
-
306
-
* By default, models are saved in the `/resources/saved_models` folder of this repository, but this can be overridden as shown above.
307
-
316
+
* Hyperparameters are specific to the respective model/encoder classes. A simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).
308
317
309
318
# References
310
319
311
320
## Benchmark
312
321
313
-
We are using a community benchmark for this project to encourage collaboration and improve reproducibility. It is hosted by [Weights & Biases](https://www.wandb.com/) (W&B), which is free for open source projects. Our entries in the benchmark link to detailed logs of our training and evaluation metrics, as well as model artifacts, and we encourage other participants to provide as much transparency as possible.
322
+
We are using a community benchmark for this project to encourage collaboration and improve reproducibility. It is hosted by [Weights & Biases](https://www.wandb.com/) (W&B), which is free for open source projects. Our entries in the benchmark link to detailed logs of our training and evaluation metrics, as well as model artifacts, and we encourage other participants to provide as much detail as possible.
314
323
315
-
We invite the community to submit their runs to this benchmark to facilitate transperency by following [these instructions](src/docs/BENCHMARK.md).
324
+
We invite the community to submit their runs to this benchmark to facilitate transparency by following [these instructions](BENCHMARK.md).
316
325
317
326
## How to Contribute
318
327
@@ -329,7 +338,7 @@ Additional notes:
329
338
330
339
1. Navigate to the `/src` directory in this repository.
331
340
332
-
2. If it's your first time using W&B on a machine, you will need to login:
341
+
2. If it's your first time using W&B on a machine, you will need to log in:
0 commit comments