opensmi

Agentless, multi-node GPU allocation manager (SSH + nvidia-smi only)

_{Screenshots are taken from a real environment; sensitive details (node names, usernames, file paths) have been redacted with Nano Banana.}

opensmi helps teams monitor and enforce GPU allocations across a self-managed cluster without installing anything on GPU nodes.
It runs from your terminal, connects over SSH, and reads nvidia-smi.

What you get

Interactive TUI — live dashboard, node detail, GPU runner, job tracker
Multi-cluster tab bar — switch between SSH clusters and Slurm clusters in one view
Slurm GPU monitoring — read-only per-node GPU usage via Slurm APIs (no SSH to compute nodes)
CLI — poll, allocate, detect violations, watch, kill, exec
Policy enforcement — unallocated GPU usage is a violation; * = open to all
No agents or daemons on GPU nodes
Python stdlib only — zero pip dependencies for the CLI

Install

Recommended — installs both CLI + TUI:

curl -fsSL https://raw.githubusercontent.com/seilk/opensmi/main/scripts/install.sh | bash

Binaries land in ~/.local/bin. The installer auto-detects your shell (zsh/bash/fish) and prints the exact PATH line to add — or offers to add it for you when run interactively.

Requirements: macOS or Linux · Python 3.9+ · SSH access to GPU nodes with nvidia-smi

Update

opensmi update

Replaces the CLI, TUI binary, and wrapper in one step. No uninstall needed.
If you hit GitHub API rate limits: export OPENSMI_GITHUB_TOKEN=<token>

Uninstall

opensmi uninstall             # remove CLI + TUI
opensmi uninstall --dry-run   # preview what would be removed

To also wipe state and config (irreversible):

opensmi uninstall --purge-state --yes

Quick start

# 1. Create config (interactive wizard)
opensmi onboard

# 2. Verify SSH connectivity + GPU visibility
opensmi poll

# 3. Launch the TUI
opensmi

Config is written to ~/.opensmi/opensmi.json by default.
Override with --config <path> or OPENSMI_CONFIG.

TUI

Launch with:

opensmi

The top bar shows: cluster name · user@hostname · GPUs used/total · Violations · Poll time

Cluster Tab Bar

A tab bar at the very top of the TUI shows all configured clusters. Press Tab / Shift+Tab to cycle, or click a tab directly.

SSH clusters — defined in clusters[] in your config, polled via SSH + nvidia-smi
Slurm clusters — defined in slurm_clusters[], show per-node GPU usage via Slurm APIs (read-only, no SSH to compute nodes)

When a newer version is available, the version label on the right of the tab bar turns yellow: opensmi@0.2.5 → 0.2.6 ↑

Tab Navigation

Switch tabs with Ctrl+X T to open the tab switcher, then press the shortcut or use arrow keys.

Shortcut	Tab	Description
`d`	Dashboard	Live GPU grid — who's using what, per node
`n`	Node Detail	Per-GPU memory, utilization, process list (enter from Dashboard via `Enter`)
`g`	My GPUs	Personal GPU view for the current operator
`j`	Jobs	Track queued, running, and finished jobs
`s`	Setup	Per-node env config (conda, venv, work dir)
`h`	Help	Keyboard shortcuts reference

Note: Node Detail is a hidden tab — navigate to it by selecting a node in the Dashboard and pressing Enter.
Allocation management (a allocate, x clear, Shift+K kill) is done directly from the Dashboard, not from a separate tab.

The Command Runner is a persistent pane at the bottom of the screen (not a tab). Focus it with Ctrl+X ↓.

Global shortcuts (work from any tab):

Key	Action
`Ctrl+X T`	Open tab switcher
`Ctrl+X ↓`	Focus command runner pane
`Ctrl+X F`	Fold / unfold runner pane
`Ctrl+X Q`	Quit

Command Runner

The runner pane sits at the bottom of the TUI at all times. Focus it with Ctrl+X ↓, type a command, and execute with Ctrl+X Enter. Press Esc to unfocus.

Execution modes (Tab to toggle):

direct — background process, output captured
tmux — creates a tmux session you can attach to

Distribution modes (Shift+Tab to toggle):

single — one command across multiple GPUs (CUDA_VISIBLE_DEVICES=0,1,2)
one-to-one — different command per GPU (e.g., cross-validation folds)

GPU assignment (g to toggle):

auto — ranks GPUs by idleness, last-used time, utilization
manual — click GPUs in the panel to select

Queue mode (q to toggle):

immediate — runs now
queued — saves to job queue for auto-dispatch when GPUs free up

Preflight checks run before execution: tmux availability, command syntax, GPU availability.

Jobs Tab

Tracks all submitted jobs (immediate and queued). From the detail view you can:

View live output from tmux sessions
Retry the last command on a session
Cancel or delete a job record
Clean up finished tmux sessions

CLI Reference

opensmi poll                        # snapshot cluster GPU state
opensmi violations                  # list allocation violations (live)
opensmi alloc list                  # show all allocations
opensmi job list                    # list jobs
opensmi job list --status running   # filter by status
opensmi log                         # tail opensmi debug logs
opensmi log --follow                # live log stream
opensmi --help                      # full command list

All commands support --json for machine-readable output where applicable.

Admin Features

Admin actions require the operator to be listed in opensmi.json under admins.master or admins.members, and to have remote sudo-group membership on target nodes.

Allocations

Allocations define which user is allowed on which GPU. Without an allocation, any GPU usage is a violation.

opensmi alloc list                        # show all allocations
opensmi alloc set GPU-01 0 alice          # assign GPU 0 on GPU-01 to alice
opensmi alloc set GPU-01 1 '*'            # open GPU 1 to everyone
opensmi alloc clear GPU-01 0              # remove allocation
opensmi alloc seed                        # auto-seed from live usage
opensmi alloc seed --force                # overwrite existing allocations

Special target * means any user is allowed on that GPU.

Violations & Watch

opensmi violations                        # one-shot violation check (exit 1 if any)
opensmi watch                             # poll every 60s, print new violations
opensmi watch --interval 30               # custom poll interval (seconds)
opensmi watch --slack-webhook <url>       # send alerts to Slack

violations exits 0 (clean) or 1 (violations found) — suitable for CI/cron.

Kill

Send a signal to remote PIDs:

opensmi kill GPU-01 <pid> [<pid> ...]
opensmi kill GPU-01 1234 5678 --signal KILL
opensmi kill GPU-01 1234 --no-sudo        # skip sudo, only own processes

Supported signals: TERM (default), KILL, INT, HUP.

Remote Execution

# Run a command on a node with specific GPUs
opensmi exec GPU-01 --gpus 0,1 --command "python train.py"

# Use tmux mode for long-running jobs
opensmi exec GPU-01 --gpus 0 --command "python train.py" --mode tmux

# Submit to the job queue (auto-dispatches when GPUs free up)
opensmi job submit --auto-gpus 2 --command "python train.py"

Node Env

Per-node environment configuration (conda/venv activation, working directory):

opensmi node-env GPU-01                                   # show current config
opensmi node-env GPU-01 --env-manager conda --env-name ml # set conda env
opensmi node-env GPU-01 --work-dir ~/projects             # set working dir
opensmi node-env GPU-01 --env-manager venv --env-name .venv

This config is used automatically when dispatching jobs to that node.

Sudo Check

Verify that your SSH user has the required sudo-group membership on a node:

opensmi sudo-check GPU-01
opensmi sudo-check GPU-01 --json

Admin Config

Admin identity and remote sudo-group requirements are set in opensmi.json:

{
  "admins": {
    "master": "alice",
    "members": ["alice", "bob"],
    "remote_sudo_groups": ["sudo", "wheel"]
  }
}

master / members: local usernames allowed to run admin commands
remote_sudo_groups: SSH user must be in one of these groups on the target node for alloc, kill, and exec actions

Configuration

Config is plain JSON. Start from the template:

opensmi onboard          # interactive wizard
opensmi init             # write default template

Reference template: opensmi.example.json
Keep your real opensmi.json private — it's gitignored by default.

Multi-cluster config

To monitor multiple SSH clusters as separate tabs, use the clusters array:

{
  "clusters": [
    {
      "cluster_name": "Lab-A",
      "nodes": [{ "alias": "GPU-01", "address": "10.0.0.1", "user": "ubuntu" }]
    },
    {
      "cluster_name": "Lab-B",
      "nodes": [{ "alias": "GPU-05", "address": "10.0.1.1", "user": "admin" }]
    }
  ]
}

Single-cluster configs (root-level cluster_name + nodes) continue to work unchanged.

Slurm monitoring config

To add a read-only Slurm cluster tab, add slurm_clusters:

{
  "slurm_clusters": [
    {
      "name": "HPC Cluster",
      "login_node": "hpc-login",
      "user": "myuser"
    }
  ]
}

opensmi SSHes into the login node and queries sinfo/squeue/scontrol — no access to compute nodes is required.

Key environment variables:

Variable	Purpose
`OPENSMI_CONFIG`	Override config path
`OPENSMI_STATE_DIR`	Override state directory (useful for NFS/shared home)
`OPENSMI_PYTHON`	Override Python interpreter
`OPENSMI_GITHUB_TOKEN`	GitHub token to avoid API rate limits during update
`OPENSMI_BIN_DIR`	Override install directory (default: `~/.local/bin`)
`OPENSMI_LOG_DIR`	Override log directory
`OPENSMI_LOG_LEVEL`	Log verbosity: `DEBUG`, `INFO` (default), `WARNING`, `ERROR`
`OPENSMI_REPO`	Override GitHub repo for update (default: `seilk/opensmi`)
`OPENSMI_TUI_BIN`	Override TUI binary path

Scope / Supported Environments

opensmi supports two distinct cluster setups:

1. Self-managed clusters (no scheduler)

Full feature set — allocation, enforcement, job dispatch, kill.
SSH directly into each GPU node; reads nvidia-smi for live GPU state.

Using opensmi's job dispatch (tmux/direct execution) on a cluster already running Slurm is not recommended:

CUDA_VISIBLE_DEVICES: Slurm remaps GPU indices to 0-based; opensmi uses physical indices — they will conflict.
Process lifecycle: opensmi tmux sessions run outside Slurm cgroups, bypassing Slurm's resource accounting.

2. Slurm-managed clusters (read-only monitoring)

opensmi can monitor a Slurm cluster as a read-only tab in the TUI — showing per-node GPU assignments, job owners, partition info, and GPU indices via scontrol.
Configure via slurm_clusters in opensmi.json. No access to compute nodes is required.

Local node: If opensmi runs on a GPU node itself, SSH is bypassed automatically — no loopback connection needed.

Security

opensmi can execute remote commands over SSH (including process signals).
Treat the machine you run it on as an admin workstation.
See SECURITY.md.

Docs

Architecture: docs/ARCHITECTURE.md
Releasing: docs/RELEASING.md
Changelog: CHANGELOG.md

License

MIT — see LICENSE.

Name	Name	Last commit message	Last commit date
Latest commit History 419 Commits 419 Commits
.github/workflows	.github/workflows
.openchrome	.openchrome
assets	assets
docs	docs
scripts	scripts
src/opensmi	src/opensmi
tests	tests
tui	tui
.editorconfig	.editorconfig
.gitignore	.gitignore
CHANGELOG.md	CHANGELOG.md
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE	LICENSE
Makefile	Makefile
README.kr.md	README.kr.md
README.md	README.md
ROADMAP.md	ROADMAP.md
SECURITY.md	SECURITY.md
opensmi.example.json	opensmi.example.json
pyproject.toml	pyproject.toml

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

opensmi

What you get

Install

Update

Uninstall

Quick start

TUI

Cluster Tab Bar

Tab Navigation

Command Runner

Jobs Tab

CLI Reference

Admin Features

Allocations

Violations & Watch

Kill

Remote Execution

Node Env

Sudo Check

Admin Config

Configuration

Multi-cluster config

Slurm monitoring config

Scope / Supported Environments

1. Self-managed clusters (no scheduler)

2. Slurm-managed clusters (read-only monitoring)

Security

Docs

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages