Bug In The Code Stack

A new benchmark for measuring LLM's capability to detect bugs in large codebase.

About

Similar to the Needle In The Haystack benchmark, the Bug In The Haystack benchmark utilizes Python source code (randomly assembled) as the background noise and syntactic bug as the needle.
This allows measurement of LLM's capability to retrieve code-related information at a very large context window, which is useful for SWE agent and co-pilot applications.

Example

1 | def fahrenheit_to_celsius(fahrenheit):
2 |   return (fahrenheit - 32) * 5.0/9.0
3 |
4 | def is_prime(num:
5 |     if num <= 1:
6 |         return False
7 |     for i in range(2, int(num**0.5) + 1):
8 |         if num % i == 0:
9 |             return False
10|     return True
Answer: 4, missing_parenthesis

Results

*All models were evaluated on their latest versions.

Average Accuracy for Each Model (Exc. 16k Results)

GPT-4o

GPT-4o Mini

GPT-4-Turbo

Claude-3.5 Sonnet

Claude-3 Opus

Gemini 1.5 Pro

Gemini 1.5 Flash

GPT-3.5-Turbo

Codestral

Llama3-70B

Command-R+

Gemini-1.0-Pro

Notebooks

notebooks/bug_in_the_code_stack_python_source_code_preprocessing.ipynb contains Colab notebook for data processing.
notebooks/bug_in_the_code_stack_experiment_openai.ipynb contains Colab notebook for running the experiment on OpenAI models.
notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt4_turbo.ipynb contains Colab notebook for running the experiment on GPT-4-Turbo w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt4o.ipynb contains Colab notebook for running the experiment on GPT-4o w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt4o_mini.ipynb contains Colab notebook for running the experiment on GPT-4o Mini w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt35.ipynb contains Colab notebook for running the experiment on GPT-3.5 w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_litellm_anthropic_claude_35_sonnet.ipynb contains Colab notebook for running the experiment on Claude-3.5 Sonnet w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_litellm_anthropic_claude_3_opus.ipynb contains Colab notebook for running the experiment on Claude-3 Opus w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_litellm_cohere_commandr.ipynb contains Colab notebook for running the experiment on Command-R+ w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_litellm_meta_llama3.ipynb contains Colab notebook for running the experiment on Llama3 70B w/t LiteLLM.
notebooks/bug_in_the_code_stack_experiment_mistral_codestral.ipynb contains Colab notebook for running the experiment on Mistral Codestral w/t LiteLLM
notebooks/bug_in_the_code_stack_experiment_qwen_codeqwen_local.ipynb contains Colab notebook for running the experiment on CodeQwen1.5 locally. Make sure to run this on Colab with A100 GPU.
notebooks/bug_in_the_code_stack_experiment_genai_gemini10.py contains Python script for running the experiment on Gemini-1.0-Pro w/t Generative AI package. Make sure to run this locally (doesn't work on Colab).
notebooks/bics_helper_result_graphs.ipynb contains helper functions to create beautiful graphs.
notebooks/bics_result_analysis_graphs.ipynb contains code to analyze the properties of codegen-focused models compared to larger, general-purpose models.

Dataset

datasets/bug_in_the_code_stack_alpaca_dataset.csv is the preprocessed dataset used for the experiment.

Google Drive

All notebooks and datasets can also be found at Bug In The Code Stack Google Drive. If you don't have access, please request access to techandy42@gmail.com.

Name	Name	Last commit message	Last commit date
Latest commit History 39 Commits 39 Commits
datasets	datasets
media	media
notebooks	notebooks
stats	stats
.gitignore	.gitignore
BUGDEMO.md	BUGDEMO.md
LICENSE	LICENSE
README.md	README.md
requirements.txt	requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bug In The Code Stack

About

Example

Results

Notebooks

Dataset

Google Drive

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

License

techandy42/bug_in_the_code_stack

Folders and files

Latest commit

History

Repository files navigation

Bug In The Code Stack

About

Example

Results

Notebooks

Dataset

Google Drive

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages