SemRepo

SemRepo is an RDF knowledge graph with over 81 million triples on nearly 200,000 GitHub repositories linked to scientific research. SemRepo captures fine-grained repository-level metadata (e.g., contributors, issues, dependencies, programming languages) and interlinks this information with external scholarly knowledge graphs: repositories are connected to publications in LPWC, repository authors are linked to their profiles in SemOpenAlex, and research artifacts (e.g., datasets, experiments) are linked via MLSea KG. SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.

SemRepo’s services and documentation are available at: https://semrepo.org:

Data Access: RDF data dumps are available via Zenodo and are periodically updated (approx. twice per year).
Query Services: The dataset is hosted in a public triple store with a SPARQL endpoint at https://semrepo.org/sparql.
Open-source Pipeline: We release the full source code for knowledge graph construction and automatic interlinking, enabling reproducibility and future extensions of SemRepo.
URI Resolution: SemRepo supports persistent URI resolution for entities within the Linked Open Data cloud.

Ontology and Metadata Descriptions

Following Linked Open Data and FAIR best practices, SemRepo provides machine-readable ontology and dataset metadata descriptions:

SemRepo Ontology (OWL)
VoID Description — dataset statistics, interlinks, and access points
DCAT Metadata Description — dataset catalog and distribution metadata

Key Statistics* (as of April 2026)

Repository: 197,566
Issues: 2,609,510
Organization: 12,879
Package: 95,505
Forked Repository: 2,468,660
Person: 2,916,508
Topic: 272,378
Programming Language: 387,284
Linkage to LPWC: 197,566
Linkage to SemOpenAlex: 11,867
Linkage to MLSea: 148,185.... (and more)

*core classes only

Construction Pipeline

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/faerber-lab/SemRepo.git
cd SemRepo
pip install -r requirements.txt

Ensure you are using Python 3.10 or higher. The pipeline then follows the following steps:

Repository and Metadata Harvesting
We collect repository data and metadata using dedicated scripts. An overview of the crawled dataset is available in JSON format.
To extract libraries and dependencies used in the code, we utilize scripts that clone each repository and parse source files to identify imported packages.
RDF Knowledge Graph Construction and Linking
We construct the RDF knowledge graph and interlink it with external scholarly knowledge graphs using the provided scripts.

Example Usage

We demonstrate and evaluate the utility of SemRepo through the following:

Competency Questions (CQs)
We formulate competency questions to illustrate SemRepo’s analytical capabilities for non-trivial queries. All SPARQL queries used for the CQs are available in CQs.
Reproducibility and Sustainability Analysis Use Case
We conduct an empirical reproducibility auditing study on a sample of 20,000 repositories from SemRepo that are linked to scientific publications. All resources for this use case are available in the usecase directory.
Additional SPARQL Query Examples
A collection of additional SPARQL query examples.

Repository Structure

/usecase — SemRepo use cases i.e., reproducibility analysis
/ontologies — OWL ontology and VoID files
/CQs — competency questions
/crawling-gitHub-metadata — github crawling
/extract-libraries-from-code — code dependencies and libraries harvesting
/making-repo-metadata-kg — kg construction and interlinking
/assets — figures
requirements.txt

License

Dataset: CC0 1.0 Universal
Ontology: CC0 1.0 Universal
Source Code: MIT License

SemRepo is released under the CC0 license to maximize reuse and interoperability; users are nevertheless encouraged to cite the associated publication and dataset.

FAIR, Sustainability, and Ethical Compliance

SemRepo follows the FAIR data principles to support long-term usability, reproducibility, and integration within the scholarly data ecosystem. The dataset is publicly available via Zenodo, GitHub, and the project website through RDF dumps and a public SPARQL endpoint. Interoperability is ensured through open standards (RDF, OWL, SPARQL, VoID) and interlinking with external scholarly knowledge graphs, while reusability is supported through open licensing and the release of the full construction pipeline.

SemRepo provides versioned releases with periodic updates (approximately twice per year), accompanied by dissemination of each release via mailing list. SemRepo is constructed exclusively from publicly available software and scholarly metadata. We acknowledge that inherited biases and coverage limitations from upstream sources (e.g., GitHub and linked scholarly knowledge graphs) may affect representation across research communities and regions.

See details: Open Science & Compliance Overview

Citation

If you use SemRepo, please cite:

@inproceedings{semrepo,
  title={SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem.},
  author={Rafay, A., Lamprecht, D., Susanti, Y., & Färber, M.},
  year={2026}
}

Maintenance

SemRepo follows a versioned release cycle with periodic updates (twice per year), with each release announced via a mailing list.
SemRepo is maintained by the Faerber Lab Research Group at TU Dresden.
📧 Contact: michael.faerber@tu-dresden.de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemRepo

Ontology and Metadata Descriptions

Key Statistics* (as of April 2026)

Construction Pipeline

Installation

Example Usage

Repository Structure

License

FAIR, Sustainability, and Ethical Compliance

Citation

Maintenance

About

Uh oh!

Contributors

Uh oh!

Languages

Name	Name	Last commit message	Last commit date
Latest commit History 141 Commits 141 Commits
CQs	CQs
assets	assets
crawling-gitHub-metadata	crawling-gitHub-metadata
extract-libraries-from-code	extract-libraries-from-code
make-package-kg	make-package-kg
making-repo-metadata-kg	making-repo-metadata-kg
ontologies	ontologies
usecase	usecase
LICENSE	LICENSE
README.md	README.md
requirements.txt	requirements.txt
robots.txt	robots.txt

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

SemRepo

Ontology and Metadata Descriptions

Key Statistics* (as of April 2026)

Construction Pipeline

Installation

Example Usage

Repository Structure

License

FAIR, Sustainability, and Ethical Compliance

Citation

Maintenance

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages