Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

eelbaz/dgx-spark-vllm-setup

Open more actions menu

Repository files navigation

vLLM Setup for NVIDIA DGX Spark (Blackwell GB10)

One-command installation of vLLM for NVIDIA DGX Spark systems with GB10 GPUs (Blackwell architecture, sm_121).

This repository provides a dgx-spark tested, ready setup script that handles all the complexities of building vLLM on the DGX Spark platform, including:

  • CUDA 13.0 support with Blackwell-specific optimizations
  • Critical fixes for SM100/SM120 MOE kernel compilation
  • Triton 3.5.0 from main branch (required for sm_121a support)
  • PyTorch 2.9.0 with CUDA 13.0 bindings
  • All necessary build fixes and workarounds

Quick Start

One-command installation - installs to ./vllm-install in your current directory:

curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash

Or specify a custom directory:

curl -fsSL https://raw.githubusercontent.com/eelbaz/dgx-spark-vllm-setup/main/install.sh | bash -s -- --install-dir ~/my/custom/path

Installation time: ~20-30 minutes (mostly compilation)

Alternative: Clone and Install

git clone https://github.com/eelbaz/dgx-spark-vllm-setup.git
cd dgx-spark-vllm-setup
./install.sh

Installation Options

./install.sh [OPTIONS]

Options:
  --install-dir DIR    Installation directory (default: ./vllm-install)
  --vllm-version TAG   vLLM git tag/branch (default: v0.11.1rc3)
  --python-version VER Python version (default: 3.12)
  --skip-tests         Skip post-installation tests
  --help               Show help message

System Requirements

  • Hardware: NVIDIA DGX Spark with GB10 GPU (Blackwell sm_121)
  • OS: Ubuntu 22.04+ (tested on Linux 6.11.0 ARM64)
  • CUDA: 13.0 or later (driver 580.95.05+)
  • Disk Space: ~50GB free
  • RAM: 8GB+ recommended during build

What Gets Installed

Installed to ./vllm-install (or your custom directory):

  • Python 3.12 virtual environment at .vllm/
  • PyTorch 2.9.0+cu130 with full CUDA 13.0 support
  • Triton 3.5.0+git from main branch (pre-release with Blackwell support)
  • vLLM 0.11.1rc3+ with all Blackwell-specific patches
  • Helper scripts for managing vLLM server
  • Environment activation script (vllm_env.sh)

Usage

All examples assume you're in the installation directory (default: ./vllm-install).

Activate Environment

cd vllm-install
source vllm_env.sh

Start vLLM Server

./vllm-serve.sh                                    # Default: Qwen2.5-0.5B on port 8000
./vllm-serve.sh "facebook/opt-125m" 8001          # Custom model and port

Check Server Status

./vllm-status.sh

Stop Server

./vllm-stop.sh

Test API

# List models
curl http://localhost:8000/v1/models

# Generate completion
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    trust_remote_code=True,
    gpu_memory_utilization=0.9
)

prompts = ["Tell me about DGX Spark"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

print(outputs[0].outputs[0].text)

Critical Fixes Applied

This installer automatically applies the following critical fixes:

1. CMakeLists.txt SM100/SM120 MOE Kernel Fix

Issue: vLLM's MOE kernels for SM100/SM120 Blackwell architectures were incomplete Fix: Added 12.0f and 12.1a to SCALED_MM_ARCHS in CMakeLists.txt

# CUDA 13.0+ path (line ~671)
# Before
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
# After
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")

# Older CUDA path (line ~673)
# Before
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
# After
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;12.1a" "${CUDA_ARCHS}")

2. pyproject.toml License Field Format

Issue: Newer setuptools requires structured license format Fix: Convert license string to dict format in both vLLM and flashinfer-python

# Before
license = "Apache-2.0"
license-files = ["LICENSE"]

# After
license = {text = "Apache-2.0"}

Applied to:

  • vLLM's pyproject.toml
  • flashinfer-python's pyproject.toml (patched during build)

3. GPT-OSS Triton MOE Kernels for Qwen3/gpt-oss Support

Issue: vLLM's GPT-OSS MOE kernel implementation uses deprecated Triton routing API Fix: Update to new Triton kernel API (topk and SparseMatrix)

Changes:

  • Replace deprecated routing() with triton_topk()
  • Replace deprecated routing_from_bitmatrix() with SparseMatrix()
  • Add support for GatherIndx, ScatterIndx, and new ragged tensor metadata

Enables support for:

  • Qwen3 models with MOE architecture
  • gpt-oss models using Triton kernels
  • Latest Triton kernel optimizations for Blackwell

4. Triton Main Branch Requirement

Issue: Official Triton 3.5.0 release has bugs with sm_121a Fix: Build Triton from main branch with latest Blackwell fixes

Architecture-Specific Configuration

The installer sets these critical environment variables:

TORCH_CUDA_ARCH_LIST=12.1a                      # Blackwell sm_121
VLLM_USE_FLASHINFER_MXFP4_MOE=1                 # Enable FlashInfer MOE optimization
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas     # CUDA PTX assembler
TIKTOKEN_CACHE_DIR=$INSTALL_DIR/.tiktoken_cache # Cache tiktoken encodings locally

Cluster Mode Setup

To set up multi-node vLLM cluster:

  1. Run this installer on all nodes
  2. Follow CLUSTER.md for configuration

Troubleshooting

Build Fails with "TypeError: can only concatenate str (not 'NoneType') to str"

This is a known Triton editable-mode build issue. The installer works around this by:

  • Building Triton in non-editable mode
  • Or copying pre-built Triton from another node

Symbol Error: cutlass_moe_mm_sm100

Symptom: ImportError: undefined symbol: _Z20cutlass_moe_mm_sm100 Solution: Ensure CMakeLists.txt fix is applied (done automatically by installer)

PyTorch CUDA Capability Warning

Symptom: Warning about GPU capability 12.1 vs PyTorch max 12.0 Status: Harmless warning - PyTorch 2.9.0+cu130 works correctly with GB10

ImportError: No module named 'vllm'

Solution:

source vllm-install/vllm_env.sh
python -c "import vllm; print(vllm.__version__)"

File Structure

vllm-install/
├── .vllm/                  # Python virtual environment
├── vllm/                   # vLLM source (editable install)
├── triton/                 # Triton source
├── vllm_env.sh            # Environment activation script
├── vllm-serve.sh          # Start server
├── vllm-stop.sh           # Stop server
├── vllm-status.sh         # Check status
└── vllm-server.log        # Server logs

Manual Installation

If you prefer to understand each step:

# 1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# 2. Create installation directory and Python virtual environment
mkdir -p vllm-install && cd vllm-install
uv venv .vllm --python 3.12
source .vllm/bin/activate

# 3. Install PyTorch with CUDA 13.0
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

# 4. Clone and build Triton from main
git clone https://github.com/triton-lang/triton.git
cd triton
uv pip install pip cmake ninja pybind11
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas python -m pip install --no-build-isolation .

# 5. Install additional dependencies
uv pip install xgrammar setuptools-scm apache-tvm-ffi==0.1.0b15 --prerelease=allow

# 6. Clone vLLM
cd ..
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.11.1rc3

# 7. Apply fixes (see scripts/apply-fixes.sh)
# 8. Build vLLM (see install.sh for full process)

Version Information

  • vLLM: 0.11.1rc4.dev6+g66a168a19.d20251026
  • PyTorch: 2.9.0+cu130
  • Triton: 3.5.0+git4caa0328
  • CUDA: 13.0
  • Python: 3.12.3
  • Target Architecture: sm_121 (Blackwell GB10)

Contributing

Issues and pull requests welcome! This installer is maintained by the DGX Spark community.

References

License

MIT License - See LICENSE

Acknowledgments

Developed and tested on NVIDIA DGX Spark systems. Special thanks to the vLLM and Triton communities.

About

One-command vLLM installation for NVIDIA DGX Spark with Blackwell GB10 GPUs (sm_121 architecture)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

Morty Proxy This is a proxified and sanitized view of the page, visit original site.