Status: Work in Progress 🚧
This repository is under active development. The current codebase is incomplete, and new modules are being progressively added.
This repository explores lossless data compression using Large Language Models (LLMs).
The primary objective is to investigate novel compression techniques that aim to improve the compression ratio while maintaining a throughput comparable to classical compressors.
The research focuses exclusively on source code compression, as the project is conducted within the context of Software Heritage.
This work introduces novel lossless compression pipelines that integrate Large Language Models into the traditional compression workflow. The pipelines are designed to improve the compression ratio while keeping throughput close to that of classical compressors.
Below is a schematic representation of the two pipelines:
The source code datasets used in this study were collected from Software Heritage archives via the boto libraries, covering six widely-used programming languages:
- C
- C#
- C++
- Python
- Java
- JavaScript
Two phases of dataset preparation were performed:
A smaller exploratory dataset was first curated to allow broad experimentation with a large number of models.
| Language | # Files | Total Size (B) | Total Size (MB) | Avg Size (B) |
|---|---|---|---|---|
| C# | 3,407 | 10,485,193 | 10.00 | 3,077.54 |
| Python | 3,893 | 10,484,445 | 10.00 | 2,693.15 |
| Java | 3,436 | 10,484,007 | 10.00 | 3,051.22 |
| C | 3,941 | 10,479,511 | 9.99 | 2,659.10 |
| C++ | 4,059 | 10,464,180 | 9.98 | 2,578.02 |
| JavaScript | 4,010 | 10,453,595 | 9.97 | 2,606.88 |
| Total | 23,746 | 62,850,931 | 59.94 | -- |
After selecting the most promising models, six larger datasets were built, one for each programming language.
| Language | # Files | Total Size (B) | Total Size (MB) | Avg Size (B) |
|---|---|---|---|---|
| Java | 34,401 | 104,856,598 | 100.00 | 3,048.07 |
| C# | 33,853 | 104,856,129 | 100.00 | 3,097.40 |
| Python | 38,312 | 104,853,756 | 100.00 | 2,736.84 |
| JavaScript | 38,988 | 104,848,652 | 99.99 | 2,689.25 |
| C++ | 35,565 | 104,848,494 | 99.99 | 2,948.08 |
| C | 38,716 | 104,844,579 | 99.99 | 2,708.04 |
| Total | 219,835 | 629,108,208 | 599.97 | -- |
A total of 30 LLMs were explored, including:
- Code-specialized models (e.g., StarCoder2, CodeGemma, Granite Code, CodeT5)
- Quantized models (4-bit and 8-bit versions for efficiency)
- General-purpose models (Llama 3, Mistral, GPT-2)
All models were retrieved from HuggingFace.
| Model name / Released by | Params | Memory | Quantized | Code-specialized | Reference |
|---|---|---|---|---|---|
| bigcode/starcoder2-3b (4bit) | 3B | 1.99 GB | Yes | Yes | BigCode |
| bigcode/starcoder2-3b (8bit) | 3B | 3.43 GB | Yes | Yes | BigCode |
| bigcode/starcoder2-3b (bfloat16) | 3B | 6.31 GB | No | Yes | BigCode |
| bigcode/starcoder2-3b (float32) | 3B | 12.62 GB | No | Yes | BigCode |
| deepseek-ai/deepseek-coder-1.3b-base | 1.3B | 2.6 GB | No | Yes | DeepSeek |
| TheBloke/deepseek-coder-1.3b-base-AWQ | 1.3B | 0.85 GB | Yes | Yes | TheBloke |
| google/codegemma-2b | 2.5B | 4.7 GB | No | Yes | |
| PrunaAI/codegemma-2b-AWQ-4bit | 2.5B | 3 GB | Yes | Yes | PrunaAI |
| google/gemma-2-2b | 2.6B | 5.2 GB | No | No | |
| unsloth/gemma-2-2b-bnb-4bit | 2.6B | 2.1 GB | Yes | No | Unsloth |
| HuggingFaceTB/SmolLM3-3B | 3B | 4.7 GB | No | No | HuggingFaceTB |
| ibm-granite/granite-3.3-2b-base | 2B | 4.7 GB | No | No | IBM |
| ibm-granite/granite-3b-code-base-2k | 3B | 6.6 GB | No | Yes | IBM |
| PrunaAI/ibm-granite-3b-code-base-4bit | 3B | 2 GB | Yes | Yes | PrunaAI |
| microsoft/phi-2 | 2.7B | 4.7 GB | No | No | Microsoft |
| microsoft/unixcoder-base | 0.13B | 0.48 GB | No | Yes | Microsoft |
| mistralai/Mistral-7B-v0.3 | 7B | 14 GB | No | No | Mistral |
| unsloth/mistral-7b-v0.3-bnb-4bit | 7B | 4 GB | Yes | No | Unsloth |
| meta-llama/Llama-3.1-8B | 8B | 15 GB | No | No | Meta |
| unsloth/Meta-Llama-3.1-8B-bnb-4bit | 8B | 5.4 GB | Yes | No | Unsloth |
| meta-llama/Llama-3.2-3B | 3.21B | 6.4 GB | No | No | Meta |
| meta-llama/Llama-3.2-1B | 1.23B | 2.5 GB | No | No | Meta |
| unsloth/Llama-3.2-3B-bnb-4bit | 3.21B | 2.2 GB | Yes | No | Unsloth |
| unsloth/Llama-3.2-1B-bnb-4bit | 1.23B | 0.97 GB | Yes | No | Unsloth |
| Qwen/Qwen3-1.7B | 1.7B | 3.3 GB | No | No | Qwen |
| openai/GPT-2-small | 0.124B | 0.48 GB | No | No | OpenAI |
| openai/GPT-2-medium | 0.355B | 1.35 GB | No | No | OpenAI |
| openai/GPT-2-large | 0.774B | 2.95 GB | No | No | OpenAI |
| openai/GPT-2-xl | 1.5B | 5.6 GB | No | No | OpenAI |
| Salesforce/codet5-base | 0.22B | 0.85 GB | No | Yes | Salesforce |
For benchmarking purposes, standard lossless compression algorithms were used as baselines:
These compressors provide reference points in terms of both compression ratio and throughput.
If you use this repository or the associated datasets in your research, please cite the corresponding work (citation details will be provided soon).

