STREAM: Multi-Tier LLM Inference Middleware

Smart Tiered Routing Engine for AI Models

STREAM unifies local device inference, campus HPC inference, and commercial cloud inference behind a single OpenAI-compatible API with automatic complexity-based routing and real-time token streaming from all tiers.

Key Features

Three-tier routing: Automatically routes queries to the cheapest capable tier based on complexity
- Local (Ollama) — free, private, instant response
- Campus HPC (Globus Compute + vLLM) — free GPU inference on institutional clusters
- Cloud (500+ models via OpenRouter) — frontier models when needed
Dual-channel HPC streaming: Novel architecture that separates Globus Compute's control plane from a WebSocket relay data plane, enabling sub-second time-to-first-token from HPC
Tier-aware context summarization: Differential rolling summarization per tier prevents long conversations from forcing unnecessary tier upgrades
Two deployment modes: Docker Compose server (multi-user) and standalone desktop app (single-user), sharing 90%+ of the codebase
OpenAI-compatible API: Drop-in replacement for existing tools and workflows
Multimodal support: Vision-language routing across all three tiers

Architecture

STREAM classifies each query as LOW, MEDIUM, or HIGH complexity using a local LLM-as-judge, routes it to the cheapest capable tier, applies tier-aware context summarization if needed, and streams the response back via SSE.

The dual-channel streaming architecture is the key innovation for HPC inference: Globus Compute handles authentication and job dispatch (control plane), while a lightweight WebSocket relay delivers tokens in real-time (data plane). Both sides connect outbound to the relay, requiring no firewall changes on HPC nodes or user devices.

Three Inference Tiers

Tier	Model	Hardware	Context	Cost
Local	Llama 3.2 3B / Gemma 3 4B (VL)	CPU / Apple Silicon	32K	$0
HPC (Lakeshore)	Qwen 2.5-VL 72B-AWQ	H100 NVL 94 GB	64K	$0
Cloud	500+ models via OpenRouter	Provider-managed	64K-1M	$$$

Prerequisites

Python 3.11+
Ollama — for local model inference
Node.js 18+ — for building the React frontend
Docker (server mode only)

Quick Start

Desktop Mode (Single User)

# Clone the repo
git clone https://github.com/uicacer/STREAM.git
cd STREAM

# Install Ollama and pull required models
ollama pull llama3.2:3b
ollama pull gemma3:4b

# Install Python dependencies
pip install -e .

# Build the frontend
cd frontends/react && npm install && npm run build && cd ../..

# Run (opens a native window with the chat UI)
python -m stream.desktop.main

Server Mode (Docker Compose)

# Clone and configure
git clone https://github.com/uicacer/STREAM.git
cd STREAM
cp .env.example .env  # Edit with your API keys

# Start all services
docker compose up -d

The UI is available at http://localhost:5000.

Optional Setup

Cloud tier: Add your OpenRouter API key in the UI settings panel (get one free at openrouter.ai/keys)
Lakeshore HPC tier: Requires UIC Lakeshore cluster access and Globus Compute authentication (click "Authenticate with Globus" in the UI)

Demo

Tech Stack

Backend: Python 3.11+, FastAPI, LiteLLM, Globus Compute
Frontend: React 18 + TypeScript, Vite, Zustand, Tailwind CSS
HPC: vLLM, Apptainer, NVIDIA H100
Infrastructure: Docker Compose, PostgreSQL/SQLite, WebSocket relay

Publication

STREAM: Multi-Tier LLM Inference Middleware with Dual-Channel HPC Token Streaming Anas Nassar, Steve Mohr, Leonard Apanasevich, Himanshu Sharma PEARC '26: Practice and Experience in Advanced Research Computing, July 2026

Datasets and Models

The routing benchmark dataset and fine-tuned complexity classifier are published on Hugging Face:

Dataset: anasnassar/llm-query-complexity-benchmark — labeled queries across 6 domains (general knowledge, science, mathematics, humanities, computer science, research computing), LOW/MEDIUM/HIGH complexity classes
Classifier: anasnassar/llm-query-complexity-classifier — ModernBERT-base fine-tuned on LLM-generated labels for local, privacy-preserving query routing

License

Apache License 2.0. See LICENSE for details.

Author

Anas Nassar (nassar@uic.edu) — Advanced Cyberinfrastructure for Education and Research (ACER), University of Illinois Chicago

Name	Name	Last commit message	Last commit date
Latest commit History 105 Commits 105 Commits
.github	.github
assets	assets
docs	docs
frontends	frontends
scripts	scripts
stream	stream
tests	tests
.dockerignore	.dockerignore
.env.example	.env.example
.gitignore	.gitignore
.gitleaks.toml	.gitleaks.toml
.pre-commit-config.yaml	.pre-commit-config.yaml
CITATION.cff	CITATION.cff
Dockerfile.middleware	Dockerfile.middleware
Dockerfile.proxy	Dockerfile.proxy
LICENSE	LICENSE
README.md	README.md
docker-compose.gpu.yml	docker-compose.gpu.yml
docker-compose.yml	docker-compose.yml
pyproject.toml	pyproject.toml
stream.spec	stream.spec
uv.lock	uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STREAM: Multi-Tier LLM Inference Middleware

Key Features

Architecture

Three Inference Tiers

Prerequisites

Quick Start

Desktop Mode (Single User)

Server Mode (Docker Compose)

Optional Setup

Demo

Tech Stack

Publication

Datasets and Models

License

Author

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

STREAM: Multi-Tier LLM Inference Middleware

Key Features

Architecture

Three Inference Tiers

Prerequisites

Quick Start

Desktop Mode (Single User)

Server Mode (Docker Compose)

Optional Setup

Demo

Tech Stack

Publication

Datasets and Models

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages