SemanticCodeCloneBERT is a fine-tuned CodeBERT model designed for different specific task, but in this repository, it has been used for semantic code clone detection.
Semantic code clone detection involves identifying functionally similar code fragments, even when their syntax differs significantly. This project leverages a fine-tuned CodeBERT model to detect semantic equivalence in code, enabling applications in software maintenance, plagiarism detection, and code search optimization. The fine-tuning process adapts the pre-trained model to better capture functional similarities across diverse programming languages. Moreover, it has to be mentioned that, this fine-tuned model only focused on python programming language source code in existed dataset.
The base CodeBERT base model was fine-tuned using transfer learning techniques. Here are the details:
-
Pre-trained Model: CodeBERT (Microsoft pre-trained transformer model).
-
Result:
Model Loss Accuracy Precision Recall F1 CodeBERT-base 0.058 0.987 0.980 0.994 0.987
- Source: https://drive.google.com/open?id=1KicfslV02p6GDPPBjZHNlmiXk-9IoGWl.
- Description: For further details on this dataset, please refer to the original publication cited in the References section.:
- Farouq Al-Omari, Chanchal K. Roy, and Tonghao Chen. Semanticclonebench: A semantic code clone benchmark using crowd-source knowledge. In 2020 IEEE 14th International Workshop on Software Clones (IWSC), pages 57–63, 2020.
- Saad Arshad, Shamsa Abid, and Shafay Shamail. Codebert for code clone detection: A replication study. In 2022 IEEE 16th International Workshop on Software Clones (IWSC), pages 39–45, 2022. CodeBERT for Code Clone Detection Replication Pack. https://doi.org/10.5281/zenodo.6361315, 2022.
- Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). CodeBERT: A Pre-Trained Model for Programming and Natural Languages. arXiv preprint arXiv:2002.08155, 2020. Model Repository: https://github.com/microsoft/CodeBERT