SemRepo is an RDF knowledge graph with over 81 million triples on nearly 200,000 GitHub repositories linked to scientific research. SemRepo captures fine-grained repository-level metadata (e.g., contributors, issues, dependencies, programming languages) and interlinks this information with external scholarly knowledge graphs: repositories are connected to publications in LPWC, repository authors are linked to their profiles in SemOpenAlex, and research artifacts (e.g., datasets, experiments) are linked via MLSea KG. SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.
SemRepo’s services and documentation are available at: https://semrepo.org:
-
Data Access: RDF data dumps are available via Zenodo and are periodically updated (approx. twice per year).
-
Query Services: The dataset is hosted in a public triple store with a SPARQL endpoint at https://semrepo.org/sparql.
-
Open-source Pipeline: We release the full source code for knowledge graph construction and automatic interlinking, enabling reproducibility and future extensions of SemRepo.
-
URI Resolution: SemRepo supports persistent URI resolution for entities within the Linked Open Data cloud.
Following Linked Open Data and FAIR best practices, SemRepo provides machine-readable ontology and dataset metadata descriptions:
- SemRepo Ontology (OWL)
- VoID Description — dataset statistics, interlinks, and access points
- DCAT Metadata Description — dataset catalog and distribution metadata
- Repository: 197,566
- Issues: 2,609,510
- Organization: 12,879
- Package: 95,505
- Forked Repository: 2,468,660
- Person: 2,916,508
- Topic: 272,378
- Programming Language: 387,284
- Linkage to LPWC: 197,566
- Linkage to SemOpenAlex: 11,867
- Linkage to MLSea: 148,185.... (and more)
*core classes only
Clone the repository and install the required dependencies:
git clone https://github.com/faerber-lab/SemRepo.git
cd SemRepo
pip install -r requirements.txtEnsure you are using Python 3.10 or higher. The pipeline then follows the following steps:
-
Repository and Metadata Harvesting
We collect repository data and metadata using dedicated scripts. An overview of the crawled dataset is available in JSON format.
To extract libraries and dependencies used in the code, we utilize scripts that clone each repository and parse source files to identify imported packages. -
RDF Knowledge Graph Construction and Linking
We construct the RDF knowledge graph and interlink it with external scholarly knowledge graphs using the provided scripts.
We demonstrate and evaluate the utility of SemRepo through the following:
-
Competency Questions (CQs)
We formulate competency questions to illustrate SemRepo’s analytical capabilities for non-trivial queries. All SPARQL queries used for the CQs are available in CQs. -
Reproducibility and Sustainability Analysis Use Case
We conduct an empirical reproducibility auditing study on a sample of 20,000 repositories from SemRepo that are linked to scientific publications. All resources for this use case are available in theusecasedirectory. -
Additional SPARQL Query Examples
A collection of additional SPARQL query examples.
/usecase— SemRepo use cases i.e., reproducibility analysis/ontologies— OWL ontology and VoID files/CQs— competency questions/crawling-gitHub-metadata— github crawling/extract-libraries-from-code— code dependencies and libraries harvesting/making-repo-metadata-kg— kg construction and interlinking/assets— figures- requirements.txt
- Dataset: CC0 1.0 Universal
- Ontology: CC0 1.0 Universal
- Source Code: MIT License
SemRepo is released under the CC0 license to maximize reuse and interoperability; users are nevertheless encouraged to cite the associated publication and dataset.
SemRepo follows the FAIR data principles to support long-term usability, reproducibility, and integration within the scholarly data ecosystem. The dataset is publicly available via Zenodo, GitHub, and the project website through RDF dumps and a public SPARQL endpoint. Interoperability is ensured through open standards (RDF, OWL, SPARQL, VoID) and interlinking with external scholarly knowledge graphs, while reusability is supported through open licensing and the release of the full construction pipeline.
SemRepo provides versioned releases with periodic updates (approximately twice per year), accompanied by dissemination of each release via mailing list. SemRepo is constructed exclusively from publicly available software and scholarly metadata. We acknowledge that inherited biases and coverage limitations from upstream sources (e.g., GitHub and linked scholarly knowledge graphs) may affect representation across research communities and regions.
See details: Open Science & Compliance Overview
If you use SemRepo, please cite:
@inproceedings{semrepo,
title={SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem.},
author={Rafay, A., Lamprecht, D., Susanti, Y., & Färber, M.},
year={2026}
}SemRepo follows a versioned release cycle with periodic updates (twice per year), with each release announced via a mailing list.
SemRepo is maintained by the Faerber Lab Research Group at TU Dresden.
📧 Contact: michael.faerber@tu-dresden.de
