Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
This repository was archived by the owner on Apr 11, 2023. It is now read-only.

Commit 25603bd

Browse filesBrowse files
committed
merging in README updates
1 parent e36254a commit 25603bd
Copy full SHA for 25603bd

File tree

1 file changed

+33
-24
lines changed
Filter options

1 file changed

+33
-24
lines changed

‎README.md

Copy file name to clipboardExpand all lines: README.md
+33-24Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -41,42 +41,45 @@
4141
# clone this repository
4242
git clone https://github.com/ml-msr-github/CodeSearchNet.git
4343
cd CodeSearchNet/
44-
# download data (~3.5GB) from S3; build and run Docker container
44+
# download data (~3.5GB) from S3; build and run the Docker container
4545
script/setup
46-
# this will drop you into the shell inside a docker container.
46+
# this will drop you into the shell inside a Docker container
4747
script/console
48-
# optional: log in to W&B to track your experiments, and submit your results to the benchmark
48+
# optional: log in to W&B to see your training metrics,
49+
# track your experiments, and submit your models to the benchmark
4950
wandb login
51+
5052
# verify your setup by training a tiny model
5153
python train.py --testrun
52-
# see other command line options and try a full training run with default values
54+
# see other command line options, try a full training run with default values,
55+
# and explore other model variants by extending this baseline script
5356
python train.py --help
5457
python train.py
5558

5659
# generate predictions for model evaluation
5760
python predict.py -r github/codesearchnet/0123456 # this is the org/project_name/run_id
5861
```
5962

60-
Finally, you can submit your run to the [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by following these [instructions](src/docs/BENCHMARK.md).
63+
Finally, you can submit your run to the [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by following these [instructions](BENCHMARK.md).
6164

6265
# Introduction
6366

6467
## Project Overview
6568

66-
[CodeSearchNet][paper] is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this [blog post](https://githubengineering.com/towards-natural-language-semantic-code-search/) and is a joint collaboration between GitHub and the [Deep Program Understanding](https://www.microsoft.com/en-us/research/project/program/) group at [Microsoft Research - Cambridge](https://www.microsoft.com/en-us/research/lab/microsoft-research-cambridge/). Our intent is to present and provide a platform for this research to the community by providing the following:
69+
[CodeSearchNet][paper] is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this [blog post](https://githubengineering.com/towards-natural-language-semantic-code-search/) and is a joint collaboration between GitHub and the [Deep Program Understanding](https://www.microsoft.com/en-us/research/project/program/) group at [Microsoft Research - Cambridge](https://www.microsoft.com/en-us/research/lab/microsoft-research-cambridge/). We aim to provide a platform for community research on semantic code search via the following:
6770

6871
1. Instructions for obtaining large corpora of relevant data
6972
2. Open source code for a range of baseline models, along with pre-trained weights
70-
3. Baseline evaluation metrics and utilities.
71-
4. Mechanisms to track progress on a [shared community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark), hosted by [Weights & Biases](https://www.wandb.com/)
73+
3. Baseline evaluation metrics and utilities
74+
4. Mechanisms to track progress on a [shared community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) hosted by [Weights & Biases](https://www.wandb.com/)
7275

7376
We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.
7477

7578
More context regarding the motivation for this problem is in this [technical report][paper].
7679

7780
## Data
7881

79-
The primary dataset consists of 2 Million (`comment`, `code`) pairs from open source libraries. Concretely, a `comment` is a top-level function or method comment (e.g. [docstrings](https://en.wikipedia.org/wiki/Docstring) in Python), and `code` is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary stastics about this dataset can be found in [this notebook](notebooks/ExploreData.ipynb)
82+
The primary dataset consists of 2 million (`comment`, `code`) pairs from open source libraries. Concretely, a `comment` is a top-level function or method comment (e.g. [docstrings](https://en.wikipedia.org/wiki/Docstring) in Python), and `code` is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary stastics about this dataset can be found in [this notebook](notebooks/ExploreData.ipynb)
8083

8184
For more information about how to obtain the data, see [this section](#data-details).
8285

@@ -86,7 +89,7 @@ More context regarding the motivation for this problem is in this [technical rep
8689

8790
### Annotations
8891

89-
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please reference [this paper][paper] for further details on the annotation process.
92+
We manually annotated retrieval results for the six languages from 99 general [queries](resources/queries.csv). This dataset is used as groundtruth data for evaluation _only_. Please refer to [this paper][paper] for further details on the annotation process.
9093

9194

9295
## Setup
@@ -102,8 +105,17 @@ More context regarding the motivation for this problem is in this [technical rep
102105
This will build Docker containers and download the datasets. By default, the data is downloaded into the `resources/data/` folder inside this repository, with the directory structure described [here](resources/README.md).
103106
104107
**The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.**
108+
109+
3. To start the Docker container, run `script/console`:
110+
```
111+
script/console
112+
```
113+
This will land you inside the Docker container, starting in the `/src` directory. You can detach from/attach to this container to pause/continue your work.
114+
105115
106-
For more about the data, see [Data Details](#data-details) below as well as [this notebook](notebooks/ExploreData.ipynb).
116+
**The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.**
117+
118+
For more about the data, see [Data Details](#data-details) below, as well as [this notebook](notebooks/ExploreData.ipynb).
107119
108120
109121
# Data Details
@@ -219,7 +231,7 @@ Code, comments, and docstrings are extracted in a language-specific manner, remo
219231
}
220232
```
221233

222-
Furthermore, summary statistics such as row counts and token length histograms can be found in [this notebook](notebooks/ExploreData.ipynb)
234+
Summary statistics such as row counts and token length histograms can be found in [this notebook](notebooks/ExploreData.ipynb)
223235

224236
## Downloading Data from S3
225237

@@ -236,9 +248,9 @@ For example, the link for the `java` is:
236248
The size of the dataset is approximately 20 GB. The various files and the directory structure are explained [here](resources/README.md).
237249

238250

239-
# Running our Baseline Model
251+
# Running Our Baseline Model
240252

241-
Warning: the scripts provided to reproduce our baseline model take more than 24 hours on an [AWS P3-V100](https://aws.amazon.com/ec2/instance-types/p3/) instance.
253+
We encourage you to reproduce and extend these models, though most variants take several hours to train (and some take more than 24 hours on an [AWS P3-V100](https://aws.amazon.com/ec2/instance-types/p3/) instance).
242254

243255
## Model Architecture
244256

@@ -258,9 +270,9 @@ This step assumes that you have a suitable Nvidia-GPU with [Cuda v9.0](https://d
258270
```
259271
script/console
260272
```
261-
This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded in the previous step. By default you will be placed in the `src/` folder of this GitHub repository. From here you can execute commands to run the model.
273+
This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded earlier. By default you will be placed in the `src/` folder of this GitHub repository. From here you can execute commands to run the model.
262274
263-
2. Set up [W&B](https://docs.wandb.com/docs/started.html) (free for open source projects) per the instructions below if you would like to share your results on the community benchmark. This is optional but highly recommended.
275+
2. Set up [W&B](https://docs.wandb.com/docs/started.html) (free for open source projects) [per the instructions below](#W&B Setup) if you would like to share your results on the community benchmark. This is optional but highly recommended.
264276
265277
3. The entry point to this model is `src/train.py`. You can see various options by executing the following command:
266278
```
@@ -277,7 +289,7 @@ This step assumes that you have a suitable Nvidia-GPU with [Cuda v9.0](https://d
277289
python train.py --model neuralbow
278290
```
279291
280-
The above command will assume default values for the location(s) of the training data and a destination where would like to save the output model. The default location for training data is specified in `/src/data_dirs_{train,valid,test}.txt`. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of `src/data_dirs_train.txt`:
292+
The above command will assume default values for the location(s) of the training data and a destination where you would like to save the output model. The default location for training data is specified in `/src/data_dirs_{train,valid,test}.txt`. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of `src/data_dirs_train.txt`:
281293
282294
```
283295
$ cat data_dirs_train.txt
@@ -301,18 +313,15 @@ This step assumes that you have a suitable Nvidia-GPU with [Cuda v9.0](https://d
301313
Additional notes:
302314
* Options for `--model` are currently listed in `src/model_restore_helper.get_model_class_from_name`.
303315
304-
* Hyperparameters are specific to the respective model/encoder classes; a simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).
305-
306-
* By default, models are saved in the `/resources/saved_models` folder of this repository, but this can be overridden as shown above.
307-
316+
* Hyperparameters are specific to the respective model/encoder classes. A simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).
308317
309318
# References
310319
311320
## Benchmark
312321
313-
We are using a community benchmark for this project to encourage collaboration and improve reproducibility. It is hosted by [Weights & Biases](https://www.wandb.com/) (W&B), which is free for open source projects. Our entries in the benchmark link to detailed logs of our training and evaluation metrics, as well as model artifacts, and we encourage other participants to provide as much transparency as possible.
322+
We are using a community benchmark for this project to encourage collaboration and improve reproducibility. It is hosted by [Weights & Biases](https://www.wandb.com/) (W&B), which is free for open source projects. Our entries in the benchmark link to detailed logs of our training and evaluation metrics, as well as model artifacts, and we encourage other participants to provide as much detail as possible.
314323
315-
We invite the community to submit their runs to this benchmark to facilitate transperency by following [these instructions](src/docs/BENCHMARK.md).
324+
We invite the community to submit their runs to this benchmark to facilitate transparency by following [these instructions](BENCHMARK.md).
316325
317326
## How to Contribute
318327
@@ -329,7 +338,7 @@ Additional notes:
329338
330339
1. Navigate to the `/src` directory in this repository.
331340
332-
2. If it's your first time using W&B on a machine, you will need to login:
341+
2. If it's your first time using W&B on a machine, you will need to log in:
333342
334343
```
335344
$ wandb login

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.