Datassert

By Skye Lane Goetz

Datassert is a high-performance CLI for building a DuckDB-backed assertion store from NCATS Translator BABEL export files, with a focus on fast local builds and simple command-driven workflows.

Quick Start

# Install CLI from GitHub
go install github.com/SkyeAv/datassert@latest

# Verify install
datassert --help

Build Command

# Build a Datassert database (downloads BABEL data automatically)
datassert build

The build command automatically downloads BABEL exports from RENCI (https://stars.renci.org/var/babel_outputs), processes them, and produces sharded DuckDB databases.

Flags

Flag	Required	Default	Description
`--skip-downloads` / `-s`	No	`false`	Skip the BABEL download phase (use previously downloaded files)
`--use-existing-parquets` / `-p`	No	`false`	Use existing Parquet files to rebuild DuckDB databases

Data Pipeline

Download -- BABEL class and synonym files are downloaded from RENCI and split into LZ4-compressed NDJSON chunks under ./datassert/downloads/classes/ and ./datassert/downloads/synonyms/.
Lookup -- Class files (*.ndjson.lz4) are read to build an in-memory equivalent-identifier lookup.
Parquet Staging -- Synonym files are processed with the lookup, quality-controlled, and written as sharded Parquet files to ./datassert/parquets/.
DuckDB Generation -- Parquet files are loaded into 10 sharded DuckDB databases under ./datassert/data/.

Output Artifacts

10 sharded DuckDB databases are written to ./datassert/data/{0..9}.duckdb.
Each shard contains SOURCES, CATEGORIES, CURIES, and SYNONYMS tables, deduplicated, sorted, and indexed for query performance.
Staging Parquet files are written to ./datassert/parquets/{0..9}/.

Examples

# Full build (download, process, and generate databases)
datassert build

# Skip downloads if BABEL files were already fetched
datassert build --skip-downloads

# Rebuild DuckDB databases from existing Parquet files
datassert build --use-existing-parquets

Runtime Behavior

Displays progress bars for download, class lookup, synonym processing, and DuckDB build phases.
Uses 90% of available CPUs for concurrent processing.
Downloads are retried up to 5 times on failure with a 5-second backoff.
All working files are stored under ./datassert/.

Maintainer

Skye Lane Goetz

Name	Name	Last commit message	Last commit date
Latest commit History 155 Commits 155 Commits
.github/workflows	.github/workflows
cmd	cmd
.gitignore	.gitignore
AGENTS.md	AGENTS.md
LICENSE	LICENSE
README.md	README.md
go.mod	go.mod
go.sum	go.sum
main.go	main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datassert

By Skye Lane Goetz

Quick Start

Build Command

Flags

Data Pipeline

Output Artifacts

Examples

Runtime Behavior

Maintainer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

Datassert

By Skye Lane Goetz

Quick Start

Build Command

Flags

Data Pipeline

Output Artifacts

Examples

Runtime Behavior

Maintainer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages