diff --git a/.gitignore b/.gitignore
index 78e1837..d230f33 100644
--- a/.gitignore
+++ b/.gitignore
@@ -160,8 +160,4 @@ Thumbs.db
*.bak
*.backup
-# Claude specific files
-CLAUDE.md
-
-# Installation scripts
-scripts/install_cuda_nvidia.sh
+results/
diff --git a/README.md b/README.md
index 7269f2d..380bb7a 100644
--- a/README.md
+++ b/README.md
@@ -6,11 +6,10 @@
# Start massive AI/ML container images 10x faster with lazy-loading snapshotter
+[](https://join.slack.com/t/tensorfusecommunity/shared_invite/zt-30r6ik3dz-Rf7nS76vWKOu6DoKh5Cs5w)
+[](https://tensorfuse.io/docs/blogs/blog)
-

-

-
-[Installation](#install-fastpull-on-a-vm) • [Results](#understanding-test-results)
+[Installation](#install-fastpull-on-a-vm) • [Results](#understanding-test-results) • [Detailed Usage](docs/fastpull.md)
@@ -29,25 +28,26 @@ AI/ML container images like CUDA, vLLM, and sglang are large (10 GB+). Tradition
#### The Solution
-Fastpull uses lazy-loading to pull only the files needed to start the container, then fetches remaining layers on demand. This accelerates start times by 10x. See the results below:
+Fastpull uses lazy-loading to pull only the files needed to start the container, then fetches remaining layers on demand. This accelerates start times by 10x. See the results below:
+You can now:
+- [Install Fastpull on a VM](#install-fastpull-on-a-vm)
+- [Install Fastpull on Kubernetes](#install-fastpull-on-a-kubernetes-cluster)
+
For more information, check out the [fastpull blog release](https://tensorfuse.io/docs/blogs/reducing_gpu_cold_start).
---
## Install fastpull on a VM
-> **Note:** For Kubernetes installation, [contact us](mailto:agam@tensorfuse.io) for early access to our helm chart.
-
### Prerequisites
-- Debian or Ubuntu VM with GPU
-- Docker and CUDA driver installed
-- Registry authentication configured (GAR, ECR, etc.)
+- VM Image: Works on Debian 12+, Ubuntu, AL2023 VMs with GPU, mileage on other AMIs may vary.
+- Python>=3.10, pip, python3-venv, [Docker](https://docs.docker.com/engine/install/), [CUDA drivers](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/), [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed
### Installation Steps
@@ -56,81 +56,191 @@ For more information, check out the [fastpull blog release](https://tensorfuse.i
```bash
git clone https://github.com/tensorfuse/fastpull.git
cd fastpull/
-sudo python3 scripts/install_snapshotters.py
-
-# Verify installation
-sudo systemctl status nydus-snapshotter-fuse.service
+sudo python3 scripts/setup.py
```
You should see: **"✅ Fastpull installed successfully on your VM"**
**2. Run containers**
-Fastpull requires your images to be in a special format. You can either choose from our template of pre-built images like vLLM, TensorRT, and SGlang or build your own using a Dockerfile.
+Fastpull requires your images to be in a special format. You can either choose from our template of pre-built images like vLLM, TensorRT, and SGlang or build your own using a Dockerfile.
-Option A: Use pre-built images
+#### Use pre-built images
Test with vLLM, TensorRT, or Sglang:
```bash
-python3 scripts/benchmark/test-bench-vllm.py \
- --image public.ecr.aws/s6z9f6e5/tensorfuse/fastpull/vllm:latest-nydus \
- --snapshotter nydus
+fastpull quickstart tensorrt
+fastpull quickstart vllm
+fastpull quickstart sglang
```
-Option B: Build custom images
+Each of these will run two times, once with fastpull optimisations, and one the way docker runs it
+After the quickstart runs are complete, we also run `fastpull clean --all` which cleans up the downloaded images.
+
+#### Build custom images
+
+First, authenticate with your registry
+For ECR:
+```
+aws configure;
+aws ecr get-login-password --region us-east-1 | sudo nerdctl login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
+
+```
+
+For GAR:
+```
+gcloud auth login;
+gcloud auth print-access-token | sudo nerdctl login -docker.pkg.dev --username oauth2accesstoken --password-stdin
+```
+For Dockerhub:
+```
+sudo docker login
+```
+
+Build and push from your Dockerfile:
+
+> [!NOTE]
+> - We support --registry gar, --registry ecr, --registry dockerhub
+> - For ``, you can use any name that's convenient, ex: `v1`, `latest`
+> - 2 images are created, one is the overlayfs with tag:`` and another is the fastpull image with tag: `-fastpull`
-Build from your Dockerfile:
```bash
-# Build image
-python3 scripts/build.py --dockerfile
+# Build and push image
+fastpull build --registry --dockerfile-path --repository-url :
+```
+
+### Benchmarking with Fastpull
+
+To get the run time for your container, you can use either:
-# Push to registry
-python3 scripts/push.py \
- --registry_type \
- --account_id
+Completion Time
-# Run with fastpull
-python3 scripts/fastpull.py --image
+Use if the workload has a defined end point
+```
+fastpull run --benchmark-mode completion [--FLAGS] :
+fastpull run --benchmark-mode completion --mode normal [--FLAGS] :
```
+Server Endpoint Readiness Time
----
+Use if you're preparing a server, and it send with a 200 SUCCESS response once the server is up
+```
+fastpull run --benchmark-mode readiness --readiness-endpoint localhost:/ [--FLAGS] :
+fastpull run --benchmark-mode readiness --readiness-endpoint localhost:/ --model normal [--FLAGS] :
+```
+
+> [!NOTE]
+> - When running for Readiness, you must publish the right port ex. `-p 8000:8000` and use `--readiness-endpoint localhost:8000/health`
+> - Use --mode normal to run normal docker, running without this flag runs with fastpull optimisations
+> - For `[--FLAGS]` you can use any docker compatible flags, ex. `--gpus all`, `-p PORT:PORT`, `-v `
+> - If using GPUs, make sure you add `--gpus all` as a fastpull run flag
-## Understanding Test Results
+#### Cleaning after a run
+
+To get the right cold start numbers, run the clean command after each run:
+```
+fastpull clean --all
+```
-Results show timing breakdown across startup phases:
+### Understanding Test Results
-- **Time to first log:** Container start to entrypoint execution
-- **First log to model download start:** Initialization time
-- **Model download time:** Downloading weights (e.g., Qwen-3-8b, 16GB)
-- **Model load time:** Loading weights into GPU
-- **CUDA compilation/graph capture:** Optimization phase
-- **Total end-to-end time:** Container start to server ready
+Results show the startup and completion/readiness times:
Example Output
```bash
-=== VLLM TIMING SUMMARY ===
-Container Startup Time: 2.145s
-Container to First Log: 15.234s
-Engine Initialization: 45.123s
-Weights Download Start: 67.890s
-Weights Download Complete: 156.789s
-Weights Loaded: 198.456s
-Graph Capture Complete: 245.678s
-Server Ready: 318.435s
-Total Test Time: 325.678s
-
-BREAKDOWN:
-Container to First Log: 15.234s
-First Log to Weight Download Start: 52.656s
-Weight Download Start to Complete: 88.899s
-Weight Download Complete to Weights Loaded: 41.667s
-Weights Loaded to Server Ready: 119.979s
+==================================================
+BENCHMARK SUMMARY
+==================================================
+Time to Container Start: 141.295s
+Time to Readiness: 329.367s
+Total Elapsed Time: 329.367s
+==================================================
```
+---
+
+## Install fastpull on a Kubernetes Cluster
+
+### Prerequisites
+- Tested on GKE
+- Tested with COS Operating System for the nodes
+
+### Installation
+1. In your K8s cluster, create a GPU Nodepool. For GKE, ensure Workload Identity is enabled on your cluster
+2. Install Nvidia GPU drivers. For COS:
+```bash
+kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
+```
+3. Install containerd config updater daemonset: `kubectl apply -f https://raw.githubusercontent.com/tensorfuse/fastpull-gke/main/containerd-daemonset.yaml`
+4. Install the [Helm Chart](https://hub.docker.com/repository/docker/tensorfuse/fastpull-snapshotter/general). For COS:
+```bash
+helm upgrade --install fastpull-snapshotter oci://registry-1.docker.io/tensorfuse/fastpull-snapshotter \
+--version 0.0.10-gke-helm \
+--create-namespace \
+--namespace fastpull-snapshotter \
+--set 'tolerations[0].key=nvidia.com/gpu' \
+--set 'tolerations[0].operator=Equal' \
+--set 'tolerations[0].value=present' \
+--set 'tolerations[0].effect=NoSchedule' \
+--set 'affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=cloud.google.com/gke-accelerator' \
+--set 'affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=Exists'
+```
+5. Build your images, which can be done by two ways:
+
+ a. On a standalone VM, preferably using Ubuntu os, [install fastpull](#installation-steps) and [build your image](#build-custom-images)
+
+ b. Build in a container:
+
+ First authenticate to your registry and ensure the ~/docker/config.json is updated
+ ```bash
+ #for aws
+ aws configure
+ aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
+ #for gcp
+ gcloud auth login
+ gcloud auth print-access-token | sudo nerdctl login -docker.pkg.dev --username oauth2accesstoken --password-stdin
+ ```
+ Then build using our image:
+ ```bash
+ docker run --rm --privileged \
+ -v /path/to/dockerfile-dir:/workspace:ro \
+ -v ~/.docker/config.json:/root/.docker/config.json:ro \
+ tensorfuse/fastpull-builder:latest \
+ REGISTRY/REPO/IMAGE:TAG
+ ```
+ This creates `IMAGE:TAG` (normal) and `IMAGE:TAG-fastpull` (fastpull-optimized). Use the `-fastpull` tag in your pod spec. See [builder documentation](scripts/builder/README.md) for details.
+
+6. Create the pod spec for image we created. For COS, use a pod spec like this:
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+ name: gpu-test-a100-fastpull
+spec:
+ tolerations:
+ - operator: Exists
+ nodeSelector:
+ cloud.google.com/gke-accelerator: nvidia-tesla-a100 # Use your GPU Type
+ runtimeClassName: runc-fastpull
+ containers:
+ - name: debug-container
+ image: IMAGE_PATH:-fastpull # USE FASTPULL IMAGE
+ resources:
+ limits:
+ nvidia.com/gpu: 1
+ env:
+ - name: LD_LIBRARY_PATH
+ value: /usr/local/cuda/lib64:/usr/local/nvidia/lib64 # NOTE: This path may vary depending on the base image
+```
+7. Run a pod with this spec:
+```bash
+kubectl apply -f .yaml
+```
+
+
---
@@ -145,4 +255,4 @@ We welcome contributions! Submit a Pull Request or join our [Slack community](ht
[](https://opensource.org/licenses/MIT)
-
\ No newline at end of file
+
diff --git a/docs/fastpull.md b/docs/fastpull.md
new file mode 100644
index 0000000..cf3e839
--- /dev/null
+++ b/docs/fastpull.md
@@ -0,0 +1,312 @@
+# FastPull CLI - Quick Reference
+
+The new unified `fastpull` command-line interface for building and running containers with lazy-loading snapshotters.
+
+## Installation
+
+The setup script automatically detects your OS (Ubuntu/Debian/RHEL/CentOS/Fedora) and installs all dependencies including `python3-venv` and `wget`.
+
+```bash
+# Full installation (containerd + Nydus + CLI)
+sudo python3 scripts/setup.py
+
+# Install only CLI (if containerd/Nydus already installed)
+sudo python3 scripts/setup.py --cli-only
+
+# Verify installation
+fastpull --version
+```
+
+**Supported Package Managers:**
+- `apt` (Ubuntu/Debian)
+- `yum` (RHEL/CentOS 7)
+- `dnf` (RHEL/CentOS 8+/Fedora)
+
+## Commands
+
+### `fastpull quickstart` - Quick Benchmark Comparisons
+
+Run pre-configured benchmarks to quickly compare snapshotter performance.
+
+#### Available Workloads
+
+**TensorRT:**
+```bash
+sudo fastpull quickstart tensorrt
+sudo fastpull quickstart tensorrt --output-dir ./results
+```
+
+**vLLM:**
+```bash
+sudo fastpull quickstart vllm
+sudo fastpull quickstart vllm --output-dir ./results
+```
+
+**SGLang:**
+```bash
+sudo fastpull quickstart sglang
+sudo fastpull quickstart sglang --output-dir ./results
+```
+
+Each quickstart automatically:
+1. Runs with FastPull mode (Nydus snapshotter)
+2. Runs with Normal mode (OverlayFS snapshotter)
+3. Measures readiness benchmarking for startup performance
+4. **Auto-cleans containers and images after completion**
+
+---
+
+### `fastpull run` - Run Containers with Benchmarking
+
+Run containers with FastPull (Nydus) or Normal (OverlayFS) mode.
+
+#### Basic Usage
+
+```bash
+# Run with FastPull mode (default, auto-adds -nydus suffix to tag)
+fastpull run myapp:latest
+
+# Run with Normal mode (OverlayFS, no suffix)
+fastpull run --mode normal myapp:latest
+
+# Run with GPU support
+fastpull run myapp:latest --gpus all -p 8080:8080
+```
+
+#### Benchmarking Modes
+
+**Readiness Mode** - Poll HTTP endpoint until 200 response:
+```bash
+fastpull run \
+ myapp:latest \
+ --benchmark-mode readiness \
+ --readiness-endpoint http://localhost:8080/health \
+ -p 8080:8080
+```
+
+**Completion Mode** - Wait for container to exit:
+```bash
+fastpull run \
+ myapp:latest \
+ --benchmark-mode completion
+```
+
+**Export Metrics** - Save results to JSON:
+```bash
+fastpull run \
+ myapp:latest \
+ --benchmark-mode readiness \
+ --readiness-endpoint http://localhost:8080/health \
+ --output-json results.json \
+ -p 8080:8080
+```
+
+#### Supported Flags
+
+- `--mode` - Run mode: nydus (default, adds -nydus suffix), normal (overlayfs, no suffix)
+- `IMAGE` - Container image to run (positional argument, required)
+- `--benchmark-mode` - Options: none, completion, readiness (default: none)
+- `--readiness-endpoint` - HTTP endpoint for health checks
+- `--output-json` - Export metrics to JSON file
+- `--name` - Container name
+- `-p, --publish` - Publish ports (repeatable)
+- `-e, --env` - Environment variables (repeatable)
+- `-v, --volume` - Bind mount volumes (repeatable)
+- `--gpus` - GPU devices (e.g., "all")
+- `--rm` - Auto-remove container on exit
+- `-d, --detach` - Run in background
+
+**Note:** Any additional arguments after the image are passed through to nerdctl.
+
+#### Pass-through Examples
+
+```bash
+# Custom entrypoint
+fastpull run myapp:latest --entrypoint /bin/bash
+
+# Command override
+fastpull run myapp:latest python script.py --arg1 value1
+
+# Additional nerdctl flags
+fastpull run myapp:latest --privileged --network host
+```
+
+---
+
+### `fastpull build` - Build and Push Images in Multiple Formats
+
+Build Docker and snapshotter-optimized images, then push to registry.
+
+#### Basic Usage
+
+```bash
+# Build Docker and Nydus (default) and push
+fastpull build --dockerfile-path ./app --repository-url myapp:latest
+
+# Build specific formats
+fastpull build \
+ --dockerfile-path ./app \
+ --repository-url myapp:v1 \
+ --format docker,nydus
+```
+
+#### Build Options
+
+```bash
+# No cache
+fastpull build --dockerfile-path ./app --repository-url myapp:latest --no-cache
+
+# With build arguments
+fastpull build \
+ --dockerfile-path ./app \
+ --repository-url myapp:latest \
+ --build-arg VERSION=1.0 \
+ --build-arg ENV=prod
+
+# Custom Dockerfile
+fastpull build \
+ --dockerfile-path ./app \
+ --repository-url myapp:latest \
+ --dockerfile Dockerfile.prod
+```
+
+#### Supported Flags
+
+- `--dockerfile-path` - Path to Dockerfile directory (required)
+- `--repository-url` - Full image reference including registry, repository, and tag (required)
+- `--format` - Comma-separated formats: docker, nydus (default: docker,nydus)
+- `--no-cache` - Build without cache
+- `--build-arg` - Build arguments (repeatable)
+- `--dockerfile` - Dockerfile name (default: Dockerfile)
+
+**Note:** Images are automatically pushed to the registry after building.
+
+---
+
+### `fastpull clean` - Remove Local Images and Artifacts
+
+Clean up local container images and stopped containers.
+
+#### Basic Usage
+
+```bash
+# Clean all images and containers (requires confirmation)
+fastpull clean --all
+
+# Clean only images
+fastpull clean --images
+
+# Clean only stopped containers
+fastpull clean --containers
+
+# Target specific snapshotter
+fastpull clean --all --snapshotter nydus
+fastpull clean --all --snapshotter overlayfs
+
+# Dry run to see what would be removed
+fastpull clean --all --dry-run
+
+# Force removal without confirmation
+fastpull clean --all --force
+```
+
+#### Supported Flags
+
+- `--images` - Remove all images
+- `--containers` - Remove stopped containers
+- `--all` - Remove both images and containers
+- `--snapshotter` - Target specific snapshotter: nydus, overlayfs, all (default: all)
+- `--dry-run` - Show what would be removed without removing
+- `--force` - Force removal without confirmation
+
+---
+
+## Complete Workflow Example
+
+```bash
+# 1. Build and push images in multiple formats
+fastpull build \
+ --dockerfile-path ./my-app \
+ --repository-url 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0 \
+ --format docker,nydus
+
+# 2. Run with benchmarking (FastPull mode, auto-adds -nydus suffix)
+fastpull run \
+ 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0 \
+ --benchmark-mode readiness \
+ --readiness-endpoint http://localhost:8000/health \
+ --output-json benchmark-results.json \
+ -p 8000:8000 \
+ --gpus all
+```
+
+---
+
+## Benchmarking Metrics
+
+When using `--benchmark-mode`, fastpull tracks:
+
+1. **Time to Container Start** - Using `ctr events` to monitor container lifecycle
+2. **Time to Readiness/Completion**:
+ - **Readiness mode**: Polls HTTP endpoint until 200 response
+ - **Completion mode**: Waits for container to exit
+
+Example output:
+
+**FastPull mode (Nydus):**
+```
+==================================================
+FASTPULL BENCHMARK SUMMARY
+==================================================
+Time to Container Start: 2.34s
+Time to Readiness: 45.67s
+Total Elapsed Time: 48.01s
+==================================================
+```
+
+**Normal mode (OverlayFS):**
+```
+==================================================
+NORMAL BENCHMARK SUMMARY
+==================================================
+Time to Container Start: 13.64s
+Time to Readiness: 387.77s
+Total Elapsed Time: 387.77s
+==================================================
+```
+
+---
+
+## Uninstallation
+
+```bash
+# Remove fastpull CLI
+sudo python3 scripts/setup.py --uninstall
+```
+
+---
+
+## Backwards Compatibility
+
+The original scripts remain unchanged and continue to work:
+- `scripts/build_push.py`
+- `scripts/benchmark/test-bench-vllm.py`
+- `scripts/benchmark/test-bench-sglang.py`
+- `scripts/install_snapshotters.py`
+
+---
+
+## Service Management
+
+After installation, the Nydus snapshotter service is renamed to `fastpull.service`:
+
+```bash
+# Check status
+systemctl status fastpull.service
+
+# Restart service
+sudo systemctl restart fastpull.service
+
+# View logs
+journalctl -u fastpull.service -f
+```
diff --git a/images/alpine-loop/Dockerfile b/images/alpine-loop/Dockerfile
new file mode 100644
index 0000000..bfcce27
--- /dev/null
+++ b/images/alpine-loop/Dockerfile
@@ -0,0 +1,3 @@
+FROM alpine:latest
+
+CMD ["/bin/sh", "-c", "for i in $(seq 1 1000); do echo \"Iteration $i\"; done; echo \"Loop complete\""]
diff --git a/pyproject.toml b/pyproject.toml
new file mode 100644
index 0000000..17a98c1
--- /dev/null
+++ b/pyproject.toml
@@ -0,0 +1,41 @@
+[build-system]
+requires = ["setuptools>=61.0", "wheel"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "fastpull"
+version = "0.1.0"
+description = "Accelerate AI/ML container startup with lazy-loading snapshotters"
+readme = "README.md"
+requires-python = ">=3.7"
+license = {text = "MIT"}
+authors = [
+ {name = "TensorFuse", email = "saurabh@tensorfuse.io"}
+]
+keywords = ["containers", "docker", "fastpull", "snapshotter", "ml", "ai"]
+classifiers = [
+ "Development Status :: 4 - Beta",
+ "Intended Audience :: Developers",
+ "Topic :: Software Development :: Build Tools",
+ "License :: OSI Approved :: MIT License",
+ "Programming Language :: Python :: 3",
+ "Programming Language :: Python :: 3.7",
+ "Programming Language :: Python :: 3.8",
+ "Programming Language :: Python :: 3.9",
+ "Programming Language :: Python :: 3.10",
+ "Programming Language :: Python :: 3.11",
+]
+
+[project.urls]
+Homepage = "https://github.com/tensorfuse/fastpull"
+Documentation = "https://github.com/tensorfuse/fastpull/blob/main/docs/fastpull.md"
+Repository = "https://github.com/tensorfuse/fastpull"
+Issues = "https://github.com/tensorfuse/fastpull/issues"
+
+[project.scripts]
+fastpull = "scripts.fastpull.cli:main"
+
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["scripts.fastpull*"]
+exclude = ["docs*", "images*"]
diff --git a/scripts/benchmark/benchmark_base.py b/scripts/benchmark/benchmark_base.py
deleted file mode 100644
index c3f1594..0000000
--- a/scripts/benchmark/benchmark_base.py
+++ /dev/null
@@ -1,782 +0,0 @@
-#!/usr/bin/env python3
-"""
-Generic Benchmark Framework Base Class
-Provides common functionality for all ML application benchmarks.
-"""
-
-import argparse
-import json
-import os
-import queue
-import re
-import requests
-import signal
-import subprocess
-import sys
-import threading
-import time
-from abc import ABC, abstractmethod
-from datetime import datetime, timezone
-from typing import Dict, List, Optional, Tuple
-
-
-def run_command(cmd, check=True, capture_output=False):
- """Run a shell command and handle errors."""
- try:
- if capture_output:
- result = subprocess.run(cmd, shell=True, check=check, capture_output=True, text=True)
- return result.stdout.strip()
- else:
- subprocess.run(cmd, shell=True, check=check)
- except subprocess.CalledProcessError as e:
- print(f"Error running command: {cmd}")
- print(f"Error: {e}")
- if capture_output and e.stdout:
- print(f"Stdout: {e.stdout}")
- if capture_output and e.stderr:
- print(f"Stderr: {e.stderr}")
- raise
-
-
-def check_aws_credentials():
- """Check if AWS credentials are configured."""
- try:
- run_command("aws sts get-caller-identity", capture_output=True)
- print("✓ AWS credentials are configured")
- return True
- except:
- print("Warning: AWS credentials not configured. Please run 'aws configure' first.")
- return False
-
-
-def docker_login_ecr(account=None, region="us-east-1"):
- """Login to ECR using both docker and nerdctl."""
- print("Checking AWS credentials and logging into ECR...")
-
- if not check_aws_credentials():
- print("Skipping ECR login due to missing AWS credentials")
- return False
-
- if not account:
- # Try to get account from AWS STS
- try:
- account_info = run_command("aws sts get-caller-identity --query Account --output text", capture_output=True)
- account = account_info.strip()
- print(f"Auto-detected AWS account: {account}")
- except:
- print("Could not auto-detect AWS account ID")
- return False
-
- try:
- password = run_command(f"aws ecr get-login-password --region {region}", capture_output=True)
- registry = f"{account}.dkr.ecr.{region}.amazonaws.com"
-
- # Login with docker
- login_cmd = f"echo '{password}' | docker login -u AWS --password-stdin {registry}"
- run_command(login_cmd, check=False)
-
- # Login with nerdctl
- login_cmd = f"echo '{password}' | nerdctl login -u AWS --password-stdin {registry}"
- run_command(login_cmd, check=False)
-
- # Login with sudo nerdctl
- login_cmd = f"echo '{password}' | sudo nerdctl login -u AWS --password-stdin {registry}"
- run_command(login_cmd, check=False)
-
- print("✓ Successfully logged into ECR")
- return True
-
- except Exception as e:
- print(f"Warning: Could not login to ECR: {e}")
- return False
-
-
-def construct_ecr_image(repo: str, tag: str, snapshotter: str, region: str = "us-east-1") -> str:
- """Construct ECR image URL from repo, tag, and snapshotter."""
- try:
- # Get AWS account ID
- account_info = run_command("aws sts get-caller-identity --query Account --output text", capture_output=True)
- account = account_info.strip()
-
- # Add snapshotter suffix to tag (except for overlayfs/native which use base tag)
- if snapshotter in ["overlayfs", "native"]:
- final_tag = tag
- else:
- final_tag = f"{tag}-{snapshotter}"
-
- return f"{account}.dkr.ecr.{region}.amazonaws.com/{repo}:{final_tag}"
-
- except Exception as e:
- raise ValueError(f"Could not construct ECR image URL: {e}. Ensure AWS credentials are configured.")
-
-
-class BenchmarkBase(ABC):
- """Abstract base class for all benchmarks."""
-
- def __init__(self, image: str, container_name: str, snapshotter: str = "nydus", port: int = 8080, model_mount_path: str = None):
- self.image = image
- self.container_name = container_name
- self.snapshotter = snapshotter
- self.port = port
- self.model_mount_path = model_mount_path
- self.start_time = None
- self.phases = {}
- self.log_queue = queue.Queue()
- self.should_stop = threading.Event()
-
- # Container events monitoring
- self.ctr_events_queue = queue.Queue()
- self.ctr_events_thread = None
- self.container_create_time = None
- self.container_start_time = None
- self.container_startup_duration = None
-
- # Health endpoint polling
- self.health_thread = None
- self.health_ready_time = None
- self.health_ready_event = threading.Event()
- self.interrupted = False
-
- # Initialize phases from subclass
- self._init_phases()
-
- @abstractmethod
- def _init_phases(self) -> None:
- """Initialize the phases dictionary for the specific application."""
- pass
-
- @abstractmethod
- def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]:
- """Analyze a log line and return detected phase. Must be implemented by subclass."""
- pass
-
-
- @abstractmethod
- def get_default_image(self, snapshotter: str) -> str:
- """Get default image for the snapshotter. Must be implemented by subclass."""
- pass
-
- def get_health_endpoint(self) -> Optional[str]:
- """Get health endpoint for the application. Override in subclasses."""
- return None
-
- def supports_health_polling(self) -> bool:
- """Check if this application supports health endpoint polling. Override in subclasses."""
- return False
-
- def get_elapsed_time(self) -> float:
- """Get elapsed time since start in seconds."""
- if self.start_time is None:
- return 0.0
- return time.time() - self.start_time
-
- def start_ctr_events_monitor(self):
- """Start monitoring containerd events in a separate thread."""
- def monitor_events():
- try:
- cmd = ["sudo", "ctr", "events"]
- process = subprocess.Popen(
- cmd,
- stdout=subprocess.PIPE,
- stderr=subprocess.PIPE,
- text=True,
- bufsize=1,
- universal_newlines=True
- )
-
- while not self.should_stop.is_set():
- line = process.stdout.readline()
- if not line:
- if process.poll() is not None:
- break
- time.sleep(0.1)
- continue
-
- self.ctr_events_queue.put((time.time(), line.strip()))
-
- process.terminate()
- process.wait()
-
- except Exception as e:
- print(f"Error monitoring ctr events: {e}")
-
- self.ctr_events_thread = threading.Thread(target=monitor_events, daemon=True)
- self.ctr_events_thread.start()
- return self.ctr_events_thread
-
- def process_ctr_events(self):
- """Process containerd events to track container lifecycle timing."""
- while not self.should_stop.is_set():
- try:
- timestamp, line = self.ctr_events_queue.get(timeout=1.0)
-
- # Parse containerd event line
- # Format: TIMESTAMP NAMESPACE EVENT_TYPE DATA
- parts = line.split(' ', 3)
- if len(parts) < 4:
- continue
-
- event_timestamp_str = f"{parts[0]} {parts[1]}"
- namespace = parts[2]
- event_type = parts[3]
-
- # Parse the event timestamp
- try:
- # Remove timezone info for parsing, then add it back
- ts_clean = event_timestamp_str.replace(" +0000 UTC", "")
- event_time = datetime.fromisoformat(ts_clean.replace(' ', 'T'))
- event_time = event_time.replace(tzinfo=timezone.utc)
- event_timestamp = event_time.timestamp()
- except:
- event_timestamp = timestamp # Fallback to capture time
-
- # Look for task start event (any task since only one container is running)
- if "/tasks/start" in event_type and self.container_start_time is None:
- self.container_start_time = event_timestamp
- if self.container_create_time:
- self.container_startup_duration = self.container_start_time - self.container_create_time
- elapsed = event_timestamp - self.start_time if self.start_time else 0
- print(f"[{elapsed:.3f}s] ✓ CONTAINER START (startup: {self.container_startup_duration:.3f}s)")
- break # We found what we needed - stop monitoring
-
- except queue.Empty:
- continue
- except KeyboardInterrupt:
- break
-
- def cleanup_container(self):
- """Remove any existing container with the same name."""
- try:
- nerdctl_snapshotter = self.get_nerdctl_snapshotter()
- cmd = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "rm", "-f", self.container_name]
- subprocess.run(cmd, capture_output=True, check=False)
- except Exception as e:
- print(f"Warning: Could not cleanup container: {e}")
-
- def start_container(self) -> bool:
- """Start the container and return success status."""
- try:
- # Start ctr events monitoring before container creation
- print("Starting containerd events monitoring...")
- self.start_ctr_events_monitor()
-
- # Start processing events in background
- events_thread = threading.Thread(target=self.process_ctr_events, daemon=True)
- events_thread.start()
-
- # Small delay to ensure events monitoring is ready
- time.sleep(0.5)
-
- nerdctl_snapshotter = self.get_nerdctl_snapshotter()
- cmd = [
- "sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "run",
- "--name", self.container_name,
- "--gpus", "all",
- "--detach",
- "--publish", f"{self.port}:8000"
- ]
-
- # Add volume mounts if model mount path is provided
- if self.model_mount_path:
- cmd.extend([
- "--volume", f"{self.model_mount_path}/huggingface:/workspace/huggingface",
- "--volume", f"{self.model_mount_path}/hf-xet-cache:/workspace/hf-xet-cache"
- ])
-
- cmd.append(self.image)
-
- print(f"Running command: {' '.join(cmd)}")
- # Set container creation time just before running nerdctl command
- self.container_create_time = time.time()
- if self.start_time is not None:
- elapsed = self.container_create_time - self.start_time
- print(f"[{elapsed:.3f}s] ✓ CONTAINER CREATE (nerdctl run started)")
- else:
- print("No start time is set")
-
- result = subprocess.run(cmd, capture_output=True, text=True, check=True)
- return True
-
- except subprocess.CalledProcessError as e:
- print(f"Error starting container: {e}")
- print(f"STDERR: {e.stderr}")
- return False
-
- def monitor_logs(self):
- """Monitor container logs in a separate thread."""
- def log_reader():
- try:
- cmd = ["sudo", "nerdctl", "--snapshotter", self.snapshotter, "logs", "-f", self.container_name]
- process = subprocess.Popen(
- cmd,
- stdout=subprocess.PIPE,
- stderr=subprocess.STDOUT,
- text=True,
- bufsize=1,
- universal_newlines=True
- )
-
- while not self.should_stop.is_set():
- line = process.stdout.readline()
- if not line:
- if process.poll() is not None:
- break
- time.sleep(0.1)
- continue
-
- self.log_queue.put((time.time(), line.strip()))
-
- process.terminate()
-
- except Exception as e:
- print(f"Error monitoring logs: {e}")
-
- log_thread = threading.Thread(target=log_reader, daemon=True)
- log_thread.start()
- return log_thread
-
- def start_health_polling(self):
- """Start health endpoint polling in a separate thread."""
- if not self.supports_health_polling():
- return None
-
- def health_poller():
- endpoint = self.get_health_endpoint()
- if not endpoint:
- return
-
- url = f"http://localhost:{self.port}/{endpoint}"
- print(f"Starting health polling for endpoint: {url}")
-
- # Poll with 0.1 second intervals, timeout after 20 minutes
- start_time = time.time()
- timeout = 20 * 60 # 20 minutes
-
- while not self.should_stop.is_set() and not self.health_ready_event.is_set():
- if time.time() - start_time > timeout:
- print(f"Health polling timed out after {timeout}s")
- break
-
- # Check for interrupt
- if self.interrupted:
- print("Health polling interrupted by user")
- break
-
- try:
- response = requests.get(url, timeout=5)
- if response.status_code == 200:
- self.health_ready_time = time.time() - self.start_time
- elapsed = self.health_ready_time
- print(f"[{elapsed:.3f}s] ✓ SERVER READY (HTTP 200)")
-
- # Set server ready time from health check
- self.phases["server_ready"] = self.health_ready_time
- self.health_ready_event.set()
- break
-
- except requests.exceptions.RequestException:
- # Connection failed, server not ready yet
- pass
-
- time.sleep(0.1) # Wait 0.1 seconds before next poll
-
- self.health_thread = threading.Thread(target=health_poller, daemon=True)
- self.health_thread.start()
- return self.health_thread
-
- def process_logs(self, timeout: int = 1200):
- """Process logs and detect phases."""
- print("Monitoring container logs...")
- log_thread = self.monitor_logs()
-
- # Start health polling if supported
- health_thread = None
-
- start_monitoring = time.time()
-
- while time.time() - start_monitoring < timeout:
- try:
- timestamp, line = self.log_queue.get(timeout=1.0)
- elapsed = timestamp - self.start_time
-
- # Detect first log
- if "first_log" in self.phases and self.phases["first_log"] is None:
- self.phases["first_log"] = elapsed
- print(f"[{elapsed:.3f}s] ✓ FIRST LOG")
-
- # Start health polling after first log if supported
- if self.supports_health_polling() and not health_thread:
- health_thread = self.start_health_polling()
-
- phase = self.analyze_log_line(line, timestamp)
-
- if phase:
- print(f"[{elapsed:.3f}s] ✓ {phase.upper().replace('_', ' ')}")
-
- print(f"[{elapsed:.3f}s] {line}")
-
- # Check if we should stop monitoring
- if self._should_stop_monitoring(elapsed):
- break
-
- except queue.Empty:
- # Check if we should stop monitoring even when no new logs
- elapsed = time.time() - self.start_time
- if self._should_stop_monitoring(elapsed):
- break
- continue
- except KeyboardInterrupt:
- print("\nReceived interrupt signal...")
- break
-
- self.should_stop.set()
-
- def _should_stop_monitoring(self, elapsed: float) -> bool:
- """Determine if we should stop monitoring logs. Should be overridden by subclasses."""
- # For applications that support health polling, stop only after health check succeeds
- if self.supports_health_polling():
- return self.health_ready_event.is_set()
- return False
-
-
- def stop_container(self):
- """Stop and remove the container. Wait for health check or timeout first."""
- try:
- # For applications that support health polling, wait for health check or timeout
- # But skip waiting if interrupted by user
- if (self.supports_health_polling() and not self.health_ready_event.is_set()
- and not self.interrupted):
- print("Waiting for health check success or timeout before stopping container...")
- timeout = 20 * 60 # 20 minutes
- if self.health_ready_event.wait(timeout):
- if not self.interrupted:
- print("Health check succeeded, proceeding with container stop")
- else:
- print("Interrupted during health check, proceeding with container stop")
- else:
- print("Health check timed out, proceeding with container stop")
- elif self.interrupted:
- print("Skipping health check wait due to interrupt, proceeding with container stop")
-
- self.should_stop.set()
- # Stop the container
- cmd_stop = ["sudo", "nerdctl", "--snapshotter", self.snapshotter, "stop", self.container_name]
- subprocess.run(cmd_stop, capture_output=True, check=False, timeout=30)
-
- # Wait for container to fully stop
- time.sleep(2)
-
- # Remove the container
- cmd_rm = ["sudo", "nerdctl", "--snapshotter", self.snapshotter, "rm", self.container_name]
- subprocess.run(cmd_rm, capture_output=True, check=False, timeout=60)
-
- # Wait a moment for the container removal to be fully processed
- time.sleep(2)
-
- except Exception as e:
- print(f"Warning: Could not stop/remove container cleanly: {e}")
-
- def get_nerdctl_snapshotter(self) -> str:
- """Get the correct snapshotter name for nerdctl commands."""
- # Map estargz to stargz for nerdctl compatibility
- if self.snapshotter == "estargz":
- return "stargz"
- return self.snapshotter
-
- def cleanup_soci_snapshotter(self):
- """Perform SOCI-specific cleanup: remove state directory and restart service."""
- if self.snapshotter != "soci":
- return
-
- try:
- print("Performing SOCI-specific cleanup...")
-
- # Remove SOCI state directory
- print("Removing SOCI state directory...")
- cmd_rm = ["sudo", "rm", "-rf", "/var/lib/soci-snapshotter-grpc/"]
- result = subprocess.run(cmd_rm, capture_output=True, text=True, check=False, timeout=30)
-
- if result.returncode == 0:
- print("SOCI state directory removed successfully")
- else:
- print(f"Warning: Could not remove SOCI state directory: {result.stderr}")
-
- # Restart SOCI snapshotter service
- print("Restarting SOCI snapshotter service...")
- cmd_restart = ["sudo", "systemctl", "restart", "soci-snapshotter-grpc.service"]
- result = subprocess.run(cmd_restart, capture_output=True, text=True, check=False, timeout=30)
-
- if result.returncode == 0:
- print("SOCI snapshotter service restarted successfully")
- # Give the service a moment to start
- time.sleep(2)
- else:
- print(f"Warning: Could not restart SOCI snapshotter service: {result.stderr}")
-
- except Exception as e:
- print(f"Warning: Could not perform SOCI cleanup: {e}")
-
- def cleanup_images(self):
- """Remove the image to ensure fresh pulls for testing."""
- try:
- print(f"Removing image {self.image} for clean testing...")
-
- nerdctl_snapshotter = self.get_nerdctl_snapshotter()
-
- # First, try with image name/tag
- cmd_rmi = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "rmi", self.image]
- result = subprocess.run(cmd_rmi, capture_output=True, text=True, check=False, timeout=60)
-
- if result.returncode == 0:
- print("Image removed successfully")
- return
-
- # If that fails, get the image ID and try with that
- print("Trying to remove by image ID...")
- cmd_images = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "images", "--format", "{{.ID}}", self.image]
- images_result = subprocess.run(cmd_images, capture_output=True, text=True, check=False, timeout=30)
-
- if images_result.returncode == 0 and images_result.stdout.strip():
- image_id = images_result.stdout.strip().split('\n')[0]
- cmd_rmi_id = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "rmi", image_id]
- id_result = subprocess.run(cmd_rmi_id, capture_output=True, text=True, check=False, timeout=60)
-
- if id_result.returncode == 0:
- print(f"Image removed successfully using ID: {image_id}")
- else:
- print(f"Could not remove image by ID: {id_result.stderr}")
- else:
- print(f"Note: Could not find or remove image: {result.stderr}")
-
- except Exception as e:
- print(f"Warning: Could not remove image: {e}")
-
- def print_summary(self, total_time: float):
- """Print timing summary."""
- print("\n" + "="*50)
- print(f"{self.__class__.__name__.replace('Benchmark', '').upper()} TIMING SUMMARY")
- print("="*50)
-
- for label, value in self._get_summary_items(total_time):
- if label == "":
- print() # Empty line
- elif label.endswith(":") and value is None:
- print(label) # Section header
- elif value is not None:
- print(f"{label:<30} {value:.3f}s")
- else:
- print(f"{label:<30} N/A")
-
- print("="*50)
-
- def _get_summary_items(self, total_time: float) -> List[Tuple[str, Optional[float]]]:
- """Get summary items for printing. Must be overridden by subclasses."""
- items = []
-
- # Add container startup time at the beginning
- items.append(("Container Startup Time:", self.container_startup_duration))
-
- for phase_key, phase_value in self.phases.items():
- label = phase_key.replace('_', ' ').title() + ":"
- items.append((label, phase_value))
- items.append(("Total Test Time:", total_time))
- return items
-
- def run_benchmark(self) -> Dict[str, Optional[float]]:
- """Run the complete benchmark."""
- app_name = self.__class__.__name__.replace('Benchmark', '')
- print(f"=== {app_name} Startup Timing Test ===")
- print(f"Image: {self.image}")
- print(f"Snapshotter: {self.snapshotter}")
- print(f"Port: {self.port}")
- print()
-
- # Check AWS credentials and login to ECR if needed
- if ".ecr." in self.image:
- print("ECR image detected, attempting AWS login...")
- # Extract region from image URL if possible, otherwise use default
- region = "us-east-1" # Default region
- if hasattr(self, '_region'):
- region = self._region
- docker_login_ecr(region=region)
-
- # Cleanup
- print("Cleaning up existing containers...")
- self.cleanup_container()
- self.cleanup_soci_snapshotter()
-
- # Start timing
- self.start_time = time.time()
- start_datetime = datetime.fromtimestamp(self.start_time)
- print(f"Test started at: {start_datetime.strftime('%Y-%m-%d %H:%M:%S')}")
- print()
-
- try:
- # Start container
- print("Starting container...")
- if not self.start_container():
- print("Failed to start container")
- return self.phases
-
- # Wait a moment for container to initialize
- time.sleep(2)
-
- # Monitor logs
- self.process_logs()
-
- except KeyboardInterrupt:
- print("\nBenchmark interrupted by user")
- self.interrupted = True
- self.should_stop.set()
- self.health_ready_event.set() # Stop waiting for health check
- except Exception as e:
- print(f"Error during benchmark: {e}")
- finally:
- # Cleanup
- print("\nCleaning up...")
- self.stop_container()
- if hasattr(self, '_keep_image') and not self._keep_image:
- self.cleanup_images()
- self.cleanup_soci_snapshotter()
-
- # Calculate total time and print summary
- total_time = time.time() - self.start_time
- self.print_summary(total_time)
-
- return self.phases
-
- def create_arg_parser(self, description: str) -> argparse.ArgumentParser:
- """Create standard argument parser for benchmarks."""
- parser = argparse.ArgumentParser(description=description)
-
- # Image specification - either full image or repo + tag
- image_group = parser.add_mutually_exclusive_group()
- image_group.add_argument(
- "--image",
- help="Full container image to test (e.g., registry.com/repo:tag-snapshotter)"
- )
- image_group.add_argument(
- "--repo",
- help="ECR repository name (e.g., my-vllm-app). Will construct full ECR URL automatically"
- )
-
- parser.add_argument(
- "--tag",
- default="latest",
- help="Image tag base (default: latest). Snapshotter suffix will be appended (e.g., latest-nydus)"
- )
- parser.add_argument(
- "--region",
- default="us-east-1",
- help="AWS region for ECR (default: us-east-1)"
- )
- parser.add_argument(
- "--container-name",
- default=f"{self.__class__.__name__.lower().replace('benchmark', '')}-timing-test",
- help="Name for the test container"
- )
- parser.add_argument(
- "--snapshotter",
- default="nydus",
- choices=["nydus", "overlayfs", "native", "soci", "estargz"],
- help="Snapshotter to use"
- )
- parser.add_argument(
- "--port",
- type=int,
- default=self.port,
- help=f"Local port to bind (default: {self.port})"
- )
- parser.add_argument(
- "--model-mount-path",
- help="Path to local SSD directory to mount for model storage (e.g., /mnt/ssd/models)"
- )
- parser.add_argument(
- "--output-json",
- help="Output results to JSON file"
- )
- parser.add_argument(
- "--keep-image",
- action="store_true",
- help="Don't remove image after test (faster for repeated runs)"
- )
- return parser
-
- def save_results(self, results: Dict[str, Optional[float]], output_file: str,
- image: str, snapshotter: str):
- """Save results to JSON file."""
- output_data = {
- "application": self.__class__.__name__.replace('Benchmark', '').lower(),
- "snapshotter": snapshotter,
- "image": image,
- "timestamp": datetime.now().isoformat(),
- "phases": results,
- "container_startup_duration": self.container_startup_duration,
- "health_ready_time": self.health_ready_time,
- "supports_health_polling": self.supports_health_polling()
- }
-
- with open(output_file, 'w') as f:
- json.dump(output_data, f, indent=2)
-
- print(f"\nResults saved to: {output_file}")
-
- def setup_signal_handler(self):
- """Setup graceful interrupt handling."""
- def signal_handler(sig, frame):
- print("\nReceived interrupt signal, cleaning up...")
- self.interrupted = True
- self.should_stop.set()
- self.health_ready_event.set() # Stop waiting for health check
- # Don't exit immediately, let cleanup happen
-
- signal.signal(signal.SIGINT, signal_handler)
-
- def main(self, description: str) -> int:
- """Main execution method for benchmark scripts."""
- parser = self.create_arg_parser(description)
- args = parser.parse_args()
-
- # Determine image to use
- if args.image:
- # Full image provided
- final_image = args.image
- elif args.repo:
- # Construct ECR image from repo + tag + snapshotter
- final_image = construct_ecr_image(args.repo, args.tag, args.snapshotter, args.region)
- print(f"Constructed ECR image: {final_image}")
- else:
- # Fall back to default image from subclass
- final_image = self.get_default_image(args.snapshotter)
-
- # Update instance with parsed arguments
- self.image = final_image
- self.container_name = args.container_name
- self.snapshotter = args.snapshotter
- self.port = args.port
- self.model_mount_path = args.model_mount_path
- self._keep_image = args.keep_image
- self._region = args.region
-
- # Setup signal handling
- self.setup_signal_handler()
-
- # Override image cleanup if requested
- if args.keep_image:
- self.cleanup_images = lambda: print("Keeping image as requested")
-
- # Run benchmark
- results = self.run_benchmark()
-
- # Output JSON if requested
- if args.output_json:
- self.save_results(results, args.output_json, self.image, args.snapshotter)
-
-
- return 0 if self._is_successful(results) else 1
-
-
- def _is_successful(self, results: Dict[str, Optional[float]]) -> bool:
- """Determine if benchmark was successful. Can be overridden by subclasses."""
- # Default: successful if we have first_log timing
- return results.get("first_log") is not None
\ No newline at end of file
diff --git a/scripts/benchmark/test-bench-sglang.py b/scripts/benchmark/test-bench-sglang.py
deleted file mode 100644
index f63978e..0000000
--- a/scripts/benchmark/test-bench-sglang.py
+++ /dev/null
@@ -1,248 +0,0 @@
-#!/usr/bin/env python3
-"""
-SGLang Inference Server Benchmark
-Measures container startup and SGLang readiness times with different snapshotters.
-
-LOG PATTERN DETECTION & PHASES:
-===============================
-
-This benchmark monitors SGLang inference server logs and detects the following phases:
-
-1. SGLANG_INIT (SGLang Framework Initialization)
- - Patterns: "starting sglang", "sglang server", "initializing sglang", "launch_server"
- - Detects: SGLang framework startup and server initialization
-
-2. WEIGHTS_DOWNLOAD (Weight Download Start)
- - Patterns: "load weight begin"
- - Detects: Beginning of model weight loading process
-
-3. WEIGHTS_DOWNLOAD_COMPLETE (Weight Download Complete)
- - Patterns: "loading safetensors checkpoint shards: 0%"
- - Detects: First safetensors checkpoint loading starts
-
-4. WEIGHTS_LOADED (Weights Loaded)
- - Patterns: "load weight end"
- - Detects: Completion of weight loading phase
-
-5. KV_CACHE_ALLOCATED (KV Cache Setup)
- - Patterns: "kv cache is allocated", "kv cache allocated"
- - Detects: Key-value cache memory allocation for inference
-
-6. GRAPH_CAPTURE_BEGIN (CUDA Graph Start)
- - Patterns: "capture cuda graph begin", "capturing cuda graph"
- - Detects: Beginning of CUDA graph capture for optimization
-
-7. GRAPH_CAPTURE_END (CUDA Graph Complete)
- - Patterns: "capture cuda graph end", "cuda graph capture complete"
- - Detects: CUDA graph capture completion
-
-8. SERVER_LOG_READY (Server Log Ready)
- - Patterns: "starting server", "server starting", "uvicorn", "listening on"
- - Detects: HTTP/API server initialization (log-based)
-
-9. SERVER_READY (Server Ready)
- - Tested via HTTP requests to /health_generate endpoint with 0.1s polling
- - Detects: API actually responding with valid HTTP 200 responses
-
-MONITORING BEHAVIOR:
-===================
-- Timeout: 20 minutes (model loading and optimization can be slow)
-- Container Status: Monitors container health during startup
-- Health Polling: Polls /health_generate endpoint every 0.1 seconds after first log
-- Success Criteria: HTTP 200 response from health endpoint
-- Port: Maps container port 8000 to specified local port
-- Stop Condition: Immediately after health endpoint returns 200
-
-EXAMPLE LOG FLOW:
-================
-[20.145s] starting sglang → sglang_init
-[119.058s] load weight begin → weights_download
-[200.525s] loading safetensors checkpoint shards: 0% → weights_download_complete
-[233.778s] load weight end → weights_loaded
-[233.828s] kv cache is allocated → kv_cache_allocated
-[245.123s] capture cuda graph begin → graph_capture_begin
-[267.890s] capture cuda graph end → graph_capture_end
-[289.456s] starting server → server_log_ready
-[291.789s] HTTP 200 /health_generate → server_ready
-"""
-
-import requests
-import json
-import time
-from typing import Dict, Optional
-from benchmark_base import BenchmarkBase
-
-
-class SGLangBenchmark(BenchmarkBase):
- def __init__(self, image: str = "", container_name: str = "sglang-timing-test",
- snapshotter: str = "nydus", port: int = 8000):
- super().__init__(image, container_name, snapshotter, port)
-
- def get_health_endpoint(self) -> str:
- """Get health endpoint for SGLang application."""
- return "health_generate"
-
- def supports_health_polling(self) -> bool:
- """SGLang application supports health endpoint polling."""
- return True
-
- def _should_stop_monitoring(self, elapsed: float) -> bool:
- """Custom stop monitoring logic for SGLang."""
- # Use base class logic for health polling apps
- return super()._should_stop_monitoring(elapsed)
-
- def _init_phases(self) -> None:
- """Initialize the phases dictionary for SGLang."""
- self.phases = {
- "first_log": None,
- "sglang_init": None,
- "model_loading": None,
- "weights_download": None,
- "weights_download_complete": None,
- "weights_loaded": None,
- "kv_cache_allocated": None,
- "graph_capture_begin": None,
- "graph_capture_end": None,
- "model_loaded": None,
- "server_log_ready": None,
- "server_ready": None
- }
-
- def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]:
- """Analyze a log line and return detected phase."""
- elapsed = timestamp - self.start_time
- line_lower = line.lower()
-
- # SGLang initialization
- if self.phases["sglang_init"] is None:
- if any(pattern in line_lower for pattern in [
- "starting sglang", "sglang server", "initializing sglang", "launch_server"
- ]):
- self.phases["sglang_init"] = elapsed
- return "sglang_init"
-
- # Weight download start (was "load weight begin")
- if self.phases["weights_download"] is None:
- if "load weight begin" in line_lower:
- self.phases["weights_download"] = elapsed
- return "weights_download"
-
- # Weight download complete (first loading safetensors)
- if self.phases["weights_download_complete"] is None:
- if "loading safetensors checkpoint shards:" in line_lower and "0%" in line_lower:
- self.phases["weights_download_complete"] = elapsed
- return "weights_download_complete"
-
- # Weights loaded (was "load weight end")
- if self.phases["weights_loaded"] is None:
- if "load weight end" in line_lower:
- self.phases["weights_loaded"] = elapsed
- return "weights_loaded"
-
- # KV cache allocation
- if self.phases["kv_cache_allocated"] is None:
- if any(pattern in line_lower for pattern in [
- "kv cache is allocated", "kv cache allocated"
- ]):
- self.phases["kv_cache_allocated"] = elapsed
- return "kv_cache_allocated"
-
- # CUDA graph capture begin
- if self.phases["graph_capture_begin"] is None:
- if any(pattern in line_lower for pattern in [
- "capture cuda graph begin", "capturing cuda graph"
- ]):
- self.phases["graph_capture_begin"] = elapsed
- return "graph_capture_begin"
-
- # CUDA graph capture end
- if self.phases["graph_capture_end"] is None:
- if any(pattern in line_lower for pattern in [
- "capture cuda graph end", "cuda graph capture complete"
- ]):
- self.phases["graph_capture_end"] = elapsed
- return "graph_capture_end"
-
- # Server log ready pattern
- if self.phases["server_log_ready"] is None:
- if any(pattern in line_lower for pattern in [
- "starting server", "server starting", "uvicorn", "listening on"
- ]):
- self.phases["server_log_ready"] = elapsed
- return "server_log_ready"
-
- return None
-
- def test_api_readiness(self, timeout: int = 120) -> bool:
- """SGLang benchmark doesn't test API readiness - stops after server ready."""
- print("Skipping API readiness test - stopping after server ready detection")
- return True
-
- def get_default_image(self, snapshotter: str) -> str:
- """Get default image for the snapshotter. Users should now use --repo parameter instead."""
- raise ValueError(
- "No default image configured. Please specify either:\n"
- " --repo (e.g., --repo my-sglang-app)\n"
- " --image (e.g., --image registry.com/repo:tag)\n"
- "\nExample: python test-bench-sglang.py --repo saurabh-sglang-test --tag latest --snapshotter nydus"
- )
-
-
-
- def _get_summary_items(self, total_time):
- """Get summary items for printing."""
- items = [
- ("Container Startup Time:", self.container_startup_duration),
- ("Container to First Log:", self.phases["first_log"]),
- ("SGLang Initialization:", self.phases["sglang_init"]),
- ("Weight Download Start:", self.phases["weights_download"]),
- ("Weight Download Complete:", self.phases["weights_download_complete"]),
- ("Weights Loaded:", self.phases["weights_loaded"]),
- ("KV Cache Allocated:", self.phases["kv_cache_allocated"]),
- ("Graph Capture Begin:", self.phases["graph_capture_begin"]),
- ("Graph Capture End:", self.phases["graph_capture_end"]),
- ("Server Log Ready:", self.phases["server_log_ready"]),
- ("Server Ready:", self.phases["server_ready"]),
- ("Total Test Time:", total_time)
- ]
-
- # Add breakdown section
- items.append(("", None)) # Empty line separator
- items.append(("BREAKDOWN:", None))
-
- # Calculate breakdowns
- if self.phases["first_log"] is not None:
- items.append(("Container to First Log:", self.phases["first_log"]))
-
- if self.phases["first_log"] is not None and self.phases["weights_download"] is not None:
- first_to_download = self.phases["weights_download"] - self.phases["first_log"]
- items.append(("First Log to Weight Download Start:", first_to_download))
-
- if self.phases["weights_download"] is not None and self.phases["weights_download_complete"] is not None:
- download_duration = self.phases["weights_download_complete"] - self.phases["weights_download"]
- items.append(("Weight Download Start to Complete:", download_duration))
-
- if self.phases["weights_download_complete"] is not None and self.phases["weights_loaded"] is not None:
- download_to_loaded = self.phases["weights_loaded"] - self.phases["weights_download_complete"]
- items.append(("Weight Download Complete to Weights Loaded:", download_to_loaded))
-
- if self.phases["weights_loaded"] is not None and self.phases["server_ready"] is not None:
- loaded_to_ready = self.phases["server_ready"] - self.phases["weights_loaded"]
- items.append(("Weights Loaded to Server Ready:", loaded_to_ready))
-
- return items
-
- def _is_successful(self, results: Dict[str, Optional[float]]) -> bool:
- """Determine if benchmark was successful."""
- return results.get("server_ready") is not None
-
-
-def main():
- benchmark = SGLangBenchmark()
- return benchmark.main("SGLang Container Startup Benchmark")
-
-
-if __name__ == "__main__":
- import sys
- import subprocess
- sys.exit(main())
\ No newline at end of file
diff --git a/scripts/benchmark/test-bench-tensorrt.py b/scripts/benchmark/test-bench-tensorrt.py
deleted file mode 100755
index 2f5586f..0000000
--- a/scripts/benchmark/test-bench-tensorrt.py
+++ /dev/null
@@ -1,218 +0,0 @@
-#!/usr/bin/env python3
-"""
-TensorRT-LLM Startup Timing Benchmark
-Measures container startup and TensorRT-LLM readiness times with different snapshotters.
-
-LOG PATTERN DETECTION & PHASES:
-===============================
-
-This benchmark monitors TensorRT-LLM server logs and detects the following phases:
-
-1. ENGINE_INIT (TensorRT-LLM Engine Initialization)
- - Patterns: "PyTorchConfig(", "TensorRT-LLM version", "KV cache quantization"
- - Detects: TensorRT-LLM engine initialization and configuration
-
-2. WEIGHT_DOWNLOAD_START (Weight Download Start)
- - Patterns: "Prefetching", "checkpoint files", "Use.*GB for model weights"
- - Detects: Beginning of model weight download/prefetching to memory
-
-3. WEIGHT_DOWNLOAD_COMPLETE (Weight Download Complete)
- - Patterns: "Loading /workspace/huggingface", first model loading line
- - Detects: All model weights downloaded and loading starts
-
-4. WEIGHTS_LOADED (Weight Loading Complete)
- - Patterns: "Loading weights: 100%", "Model init total"
- - Detects: Model weights fully loaded into memory
-
-5. MODEL_LOADED (Model Fully Loaded)
- - Patterns: "Autotuning process ends", "Autotuner Cache size", memory configuration
- - Detects: Complete model initialization with autotuning and optimization
-
-6. SERVER_LOG_READY (Server Log Ready)
- - Patterns: "Started server process", "Waiting for application startup"
- - Detects: Uvicorn/FastAPI server initialization (log-based)
-
-7. SERVER_READY (Server Ready)
- - Tested via HTTP requests to /health endpoint with 0.1s polling
- - Detects: API actually responding with valid HTTP 200 responses
-
-MONITORING BEHAVIOR:
-===================
-- Timeout: 25 minutes (model loading and autotuning can be very slow)
-- Container Status: Monitors container health during startup
-- Health Polling: Polls /health endpoint every 0.1 seconds after first log
-- Success Criteria: HTTP 200 response from health endpoint
-- Port: Maps container port 8000 to specified local port
-- Stop Condition: Immediately after health endpoint returns 200
-
-EXAMPLE LOG FLOW:
-================
-[10.230s] Starting TensorRT-LLM server → first_log
-[73.120s] PyTorchConfig( → engine_init
-[76.780s] Prefetching 15.26GB checkpoint → weight_download_start
-[130.450s] Loading /workspace/huggingface → weight_download_complete
-[156.670s] Loading weights: 100% → weights_loaded
-[324.456s] Autotuning process ends → model_loaded
-[325.789s] Started server process → server_log_ready
-[326.012s] HTTP 200 /health → server_ready
-"""
-
-import requests
-import json
-import time
-from typing import Dict, Optional
-from benchmark_base import BenchmarkBase
-
-
-class TensorRTBenchmark(BenchmarkBase):
- def __init__(self, image: str = "", container_name: str = "tensorrt-timing-test",
- snapshotter: str = "nydus", port: int = 8080):
- super().__init__(image, container_name, snapshotter, port)
-
- def get_health_endpoint(self) -> str:
- """Get health endpoint for TensorRT application."""
- return "health"
-
- def supports_health_polling(self) -> bool:
- """TensorRT application supports health endpoint polling."""
- return True
-
- def _should_stop_monitoring(self, elapsed: float) -> bool:
- """Custom stop monitoring logic for TensorRT-LLM."""
- # Use base class logic for health polling apps
- return super()._should_stop_monitoring(elapsed)
-
- def _is_successful(self, results: Dict[str, Optional[float]]) -> bool:
- """Determine if benchmark was successful."""
- return results.get("server_ready") is not None
-
- def _init_phases(self) -> None:
- """Initialize the phases dictionary for TensorRT-LLM."""
- self.phases = {
- "first_log": None,
- "engine_init": None,
- "weight_download_start": None,
- "weight_download_complete": None,
- "weights_loaded": None,
- "model_loaded": None,
- "server_log_ready": None,
- "server_ready": None
- }
-
- def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]:
- """Analyze a log line and return detected phase."""
- elapsed = timestamp - self.start_time
- line_lower = line.lower()
-
- # TensorRT-LLM engine initialization
- if self.phases["engine_init"] is None:
- if any(pattern in line_lower for pattern in [
- "pytorchconfig(", "tensorrt-llm version", "kv cache quantization"
- ]):
- self.phases["engine_init"] = elapsed
- return "engine_init"
-
- # Weight download start
- if self.phases["weight_download_start"] is None:
- if any(pattern in line_lower for pattern in [
- "prefetching", "checkpoint files", "gb for model weights"
- ]):
- self.phases["weight_download_start"] = elapsed
- return "weight_download_start"
-
- # Weight download complete and loading starts
- if self.phases["weight_download_complete"] is None:
- if any(pattern in line_lower for pattern in [
- "loading /workspace/huggingface"
- ]):
- self.phases["weight_download_complete"] = elapsed
- return "weight_download_complete"
-
- # Weights loading complete
- if self.phases["weights_loaded"] is None:
- if any(pattern in line_lower for pattern in [
- "loading weights: 100%", "model init total"
- ]):
- self.phases["weights_loaded"] = elapsed
- return "weights_loaded"
-
- # Model fully loaded (autotuning complete, memory configured)
- if self.phases["model_loaded"] is None:
- if any(pattern in line_lower for pattern in [
- "autotuning process ends", "autotuner cache size",
- "max_seq_len=", "max_num_requests=", "allocated.*gib for max tokens"
- ]):
- self.phases["model_loaded"] = elapsed
- return "model_loaded"
-
- # Server log ready pattern
- if self.phases["server_log_ready"] is None:
- if any(pattern in line_lower for pattern in [
- "started server process", "waiting for application startup"
- ]):
- self.phases["server_log_ready"] = elapsed
- return "server_log_ready"
-
- return None
-
- def test_api_readiness(self, timeout: int = 120) -> bool:
- """TensorRT benchmark doesn't test API readiness - stops after server ready."""
- print("Skipping API readiness test - stopping after server ready detection")
- return True
-
- def get_default_image(self, snapshotter: str) -> str:
- """Get default image for the snapshotter. Users should now use --repo parameter instead."""
- raise ValueError(
- "No default image configured. Please specify either:\n"
- " --repo (e.g., --repo my-tensorrt-app)\n"
- " --image (e.g., --image registry.com/repo:tag)\n"
- "\nExample: python test-bench-tensorrt.py --repo my-tensorrt-app --tag latest --snapshotter nydus"
- )
-
- def _get_summary_items(self, total_time):
- """Get summary items for the timing summary."""
- items = [
- ("Container Startup Time:", self.container_startup_duration),
- ("Container to First Log:", self.phases["first_log"]),
- ("Engine Initialization:", self.phases["engine_init"]),
- ("Weight Download Start:", self.phases["weight_download_start"]),
- ("Weight Download Complete:", self.phases["weight_download_complete"]),
- ("Weights Loaded:", self.phases["weights_loaded"]),
- ("Model Loaded:", self.phases["model_loaded"]),
- ("Server Log Ready:", self.phases["server_log_ready"]),
- ("Server Ready:", self.phases["server_ready"]),
- ("Total Test Time:", total_time)
- ]
-
- # Add breakdown section
- items.append(("", None)) # Empty line separator
- items.append(("BREAKDOWN:", None))
-
- # Calculate breakdowns
- if self.phases["first_log"] is not None:
- items.append(("Container to First Log:", self.phases["first_log"]))
-
- if self.phases["first_log"] is not None and self.phases["weight_download_start"] is not None:
- first_to_download = self.phases["weight_download_start"] - self.phases["first_log"]
- items.append(("First Log to Weight Download Start:", first_to_download))
-
- if self.phases["weight_download_start"] is not None and self.phases["weight_download_complete"] is not None:
- download_duration = self.phases["weight_download_complete"] - self.phases["weight_download_start"]
- items.append(("Weight Download Start to Complete:", download_duration))
-
- if self.phases["weight_download_complete"] is not None and self.phases["weights_loaded"] is not None:
- download_to_loaded = self.phases["weights_loaded"] - self.phases["weight_download_complete"]
- items.append(("Weight Download Complete to Weights Loaded:", download_to_loaded))
-
- if self.phases["weights_loaded"] is not None and self.phases["server_ready"] is not None:
- loaded_to_ready = self.phases["server_ready"] - self.phases["weights_loaded"]
- items.append(("Weights Loaded to Server Ready:", loaded_to_ready))
-
- return items
-
-
-if __name__ == "__main__":
- import sys
-
- benchmark = TensorRTBenchmark()
- sys.exit(benchmark.main("TensorRT-LLM Container Startup Benchmark"))
\ No newline at end of file
diff --git a/scripts/benchmark/test-bench-vllm.py b/scripts/benchmark/test-bench-vllm.py
deleted file mode 100755
index efa1fca..0000000
--- a/scripts/benchmark/test-bench-vllm.py
+++ /dev/null
@@ -1,219 +0,0 @@
-#!/usr/bin/env python3
-"""
-vLLM Startup Timing Benchmark
-Measures container startup and vLLM readiness times with different snapshotters.
-
-LOG PATTERN DETECTION & PHASES:
-===============================
-
-This benchmark monitors vLLM inference server logs and detects the following phases:
-
-1. ENGINE_INIT (vLLM Engine Initialization)
- - Patterns: "initializing a v1 llm engine", "waiting for init message", "v1 llm engine"
- - Detects: vLLM V1 engine initialization start
-
-2. MODEL_LOADING (Model Loading Start)
- - Patterns: "starting to load model", "loading model from scratch"
- - Detects: Beginning of model loading process
-
-3. WEIGHTS_DOWNLOAD (Weight Download)
- - Patterns: "time spent downloading weights", "downloading weights"
- - Detects: Model weight download completion (if needed)
-
-4. WEIGHTS_LOADED (Weight Loading Complete)
- - Patterns: "loading weights took", "loading safetensors checkpoint shards: 100%"
- - Detects: Model weights fully loaded into memory
-
-5. MODEL_LOADED (Model Fully Loaded)
- - Patterns: "model loading took", "init engine", "engine.*took.*seconds"
- - Detects: Complete model initialization and engine setup
-
-6. GRAPH_CAPTURE (CUDA Graph Optimization)
- - Patterns: "graph capturing finished", "capturing cuda graph shapes: 100%"
- - Detects: CUDA graph capture completion for optimization
-
-7. SERVER_LOG_READY (Server Log Ready)
- - Patterns: "started server process"
- - Detects: FastAPI/Uvicorn server process started (log-based)
-
-8. SERVER_READY (Server Ready)
- - Tested via HTTP requests to /health endpoint with 0.1s polling
- - Detects: API actually responding with valid HTTP 200 responses
-
-MONITORING BEHAVIOR:
-===================
-- Timeout: 20 minutes (model loading can be slow)
-- Container Status: Monitors container health during startup
-- Health Polling: Polls /health endpoint every 0.1 seconds after first log
-- Success Criteria: HTTP 200 response from health endpoint
-- Port: Maps container port 8000 to specified local port
-- Stop Condition: Immediately after health endpoint returns 200
-
-EXAMPLE LOG FLOW:
-================
-[15.230s] initializing a v1 llm engine → engine_init
-[45.120s] starting to load model → model_loading
-[67.340s] downloading weights → weights_download
-[156.780s] loading weights took 89.44s → weights_loaded
-[198.450s] model loading took 153.33s → model_loaded
-[245.670s] graph capturing finished → graph_capture
-[318.429s] started server process → server_log_ready
-[318.435s] HTTP 200 /health → server_ready
-"""
-
-import requests
-import json
-import time
-from typing import Dict, Optional
-from benchmark_base import BenchmarkBase
-
-
-class VLLMBenchmark(BenchmarkBase):
- def __init__(self, image: str = "", container_name: str = "vllm-timing-test",
- snapshotter: str = "nydus", port: int = 8080):
- super().__init__(image, container_name, snapshotter, port)
-
- def get_health_endpoint(self) -> str:
- """Get health endpoint for vLLM application."""
- return "health"
-
- def supports_health_polling(self) -> bool:
- """vLLM application supports health endpoint polling."""
- return True
-
- def _should_stop_monitoring(self, elapsed: float) -> bool:
- """Custom stop monitoring logic for vLLM."""
- # Use base class logic for health polling apps
- return super()._should_stop_monitoring(elapsed)
-
- def _is_successful(self, results: Dict[str, Optional[float]]) -> bool:
- """Determine if benchmark was successful."""
- return results.get("server_ready") is not None
-
- def _init_phases(self) -> None:
- """Initialize the phases dictionary for vLLM."""
- self.phases = {
- "first_log": None,
- "engine_init": None,
- "weights_download": None,
- "weights_download_complete": None,
- "weights_loaded": None,
- "graph_capture": None,
- "server_log_ready": None,
- "server_ready": None
- }
-
- def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]:
- """Analyze a log line and return detected phase."""
- elapsed = timestamp - self.start_time
- line_lower = line.lower()
-
- # Engine initialization (vLLM V1 engine)
- if self.phases["engine_init"] is None:
- if any(pattern in line_lower for pattern in [
- "initializing a v1 llm engine", "waiting for init message", "v1 llm engine"
- ]):
- self.phases["engine_init"] = elapsed
- return "engine_init"
-
- # Weights download start (was model loading start)
- if self.phases["weights_download"] is None:
- if any(pattern in line_lower for pattern in [
- "starting to load model", "loading model from scratch"
- ]):
- self.phases["weights_download"] = elapsed
- return "weights_download"
-
- # Weights download complete
- if self.phases["weights_download_complete"] is None:
- if any(pattern in line_lower for pattern in [
- "time spent downloading weights", "downloading weights"
- ]):
- self.phases["weights_download_complete"] = elapsed
- return "weights_download_complete"
-
- # Weights loaded patterns
- if self.phases["weights_loaded"] is None:
- if any(pattern in line_lower for pattern in [
- "loading weights took", "loading safetensors checkpoint shards: 100%"
- ]):
- self.phases["weights_loaded"] = elapsed
- return "weights_loaded"
-
- # CUDA graph capture
- if self.phases["graph_capture"] is None:
- if any(pattern in line_lower for pattern in [
- "graph capturing finished", "capturing cuda graph shapes: 100%"
- ]):
- self.phases["graph_capture"] = elapsed
- return "graph_capture"
-
- # Server log ready pattern (vLLM/FastAPI specific)
- if self.phases["server_log_ready"] is None:
- if "started server process" in line_lower:
- self.phases["server_log_ready"] = elapsed
- return "server_log_ready"
-
- return None
-
- def test_api_readiness(self, timeout: int = 120) -> bool:
- """vLLM benchmark uses health polling instead of direct API test."""
- print("Using health polling instead of direct API test")
- return True
-
- def get_default_image(self, snapshotter: str) -> str:
- """Get default image for the snapshotter. Users should now use --repo parameter instead."""
- raise ValueError(
- "No default image configured. Please specify either:\n"
- " --repo (e.g., --repo my-vllm-app)\n"
- " --image (e.g., --image registry.com/repo:tag)\n"
- "\nExample: python test-bench-vllm.py --repo saurabh-vllm-test --tag latest --snapshotter nydus"
- )
-
- def _get_summary_items(self, total_time):
- """Get summary items for the timing summary."""
- items = [
- ("Container Startup Time:", self.container_startup_duration),
- ("Container to First Log:", self.phases["first_log"]),
- ("Engine Initialization:", self.phases["engine_init"]),
- ("Weights Download Start:", self.phases["weights_download"]),
- ("Weights Download Complete:", self.phases["weights_download_complete"]),
- ("Weights Loaded:", self.phases["weights_loaded"]),
- ("Graph Capture Complete:", self.phases["graph_capture"]),
- ("Server Log Ready:", self.phases["server_log_ready"]),
- ("Server Ready:", self.phases["server_ready"]),
- ("Total Test Time:", total_time)
- ]
-
- # Add breakdown section
- items.append(("", None)) # Empty line separator
- items.append(("BREAKDOWN:", None))
-
- # Calculate breakdowns
- if self.phases["first_log"] is not None:
- items.append(("Container to First Log:", self.phases["first_log"]))
-
- if self.phases["first_log"] is not None and self.phases["weights_download"] is not None:
- first_to_download = self.phases["weights_download"] - self.phases["first_log"]
- items.append(("First Log to Weight Download Start:", first_to_download))
-
- if self.phases["weights_download"] is not None and self.phases["weights_download_complete"] is not None:
- download_duration = self.phases["weights_download_complete"] - self.phases["weights_download"]
- items.append(("Weight Download Start to Complete:", download_duration))
-
- if self.phases["weights_download_complete"] is not None and self.phases["weights_loaded"] is not None:
- download_to_loaded = self.phases["weights_loaded"] - self.phases["weights_download_complete"]
- items.append(("Weight Download Complete to Weights Loaded:", download_to_loaded))
-
- if self.phases["weights_loaded"] is not None and self.phases["server_ready"] is not None:
- loaded_to_ready = self.phases["server_ready"] - self.phases["weights_loaded"]
- items.append(("Weights Loaded to Server Ready:", loaded_to_ready))
-
- return items
-
-
-if __name__ == "__main__":
- import sys
-
- benchmark = VLLMBenchmark()
- sys.exit(benchmark.main("vLLM Container Startup Benchmark"))
\ No newline at end of file
diff --git a/scripts/build_push.py b/scripts/build_push.py
deleted file mode 100755
index 65afabf..0000000
--- a/scripts/build_push.py
+++ /dev/null
@@ -1,547 +0,0 @@
-#!/usr/bin/env python3
-"""
-Build and push container images with different snapshotter formats.
-Supports ECR (AWS) and GAR (Google Artifact Registry).
-"""
-
-import argparse
-import os
-import subprocess
-import sys
-import json
-from pathlib import Path
-from abc import ABC, abstractmethod
-
-
-def run_command(cmd, check=True, capture_output=False):
- """Run a shell command and handle errors."""
- import time
-
- print(f"Running: {cmd}")
- start_time = time.time()
-
- try:
- if capture_output:
- result = subprocess.run(cmd, shell=True, check=check, capture_output=True, text=True)
- elapsed = time.time() - start_time
- print(f"✓ Completed in {elapsed:.2f}s")
- return result.stdout.strip()
- else:
- subprocess.run(cmd, shell=True, check=check)
- elapsed = time.time() - start_time
- print(f"✓ Completed in {elapsed:.2f}s")
- except subprocess.CalledProcessError as e:
- elapsed = time.time() - start_time
- print(f"❌ Failed after {elapsed:.2f}s")
- print(f"Error running command: {cmd}")
- print(f"Error: {e}")
- if capture_output and e.stdout:
- print(f"Stdout: {e.stdout}")
- if capture_output and e.stderr:
- print(f"Stderr: {e.stderr}")
- sys.exit(1)
-
-
-class Registry(ABC):
- """Abstract base class for container registries."""
-
- @abstractmethod
- def check_credentials(self):
- """Check if credentials are configured."""
- pass
-
- @abstractmethod
- def create_repository(self, image_name):
- """Create repository if it doesn't exist."""
- pass
-
- @abstractmethod
- def login(self):
- """Login to registry with docker and nerdctl."""
- pass
-
- @abstractmethod
- def get_registry_url(self):
- """Return the registry URL."""
- pass
-
- def get_full_image_name(self, image_name, tag="latest"):
- """Construct full image reference."""
- return f"{self.get_registry_url()}/{image_name}:{tag}"
-
-
-class ECRRegistry(Registry):
- """AWS Elastic Container Registry implementation."""
-
- def __init__(self, account, region):
- self.account = account
- self.region = region
- self.registry_url = f"{account}.dkr.ecr.{region}.amazonaws.com"
-
- def check_credentials(self):
- """Check if AWS credentials are configured."""
- try:
- run_command("aws sts get-caller-identity", capture_output=True)
- print("✓ AWS credentials are configured")
- except:
- print("Error: AWS credentials not configured. Please run 'aws configure' first.")
- sys.exit(1)
-
- def create_repository(self, image_name):
- """Create ECR repository if it doesn't exist."""
- print(f"Checking/creating ECR repository: {image_name}")
-
- # Check if repository exists
- check_cmd = f"aws ecr describe-repositories --repository-names {image_name} --region {self.region}"
- try:
- run_command(check_cmd, capture_output=True)
- print(f"✓ Repository {image_name} already exists")
- except:
- # Repository doesn't exist, create it
- create_cmd = f"aws ecr create-repository --repository-name {image_name} --region {self.region}"
- run_command(create_cmd)
- print(f"✓ Created repository {image_name}")
-
- def login(self):
- """Login to ECR using both docker and nerdctl."""
- print("Logging into ECR...")
-
- password = run_command(f"aws ecr get-login-password --region {self.region}", capture_output=True)
-
- # Login with docker
- login_cmd = f"echo '{password}' | docker login -u AWS --password-stdin {self.registry_url}"
- run_command(login_cmd)
-
- # Login with nerdctl
- login_cmd = f"echo '{password}' | nerdctl login -u AWS --password-stdin {self.registry_url}"
- run_command(login_cmd)
-
- # Login with sudo nerdctl
- login_cmd = f"echo '{password}' | sudo nerdctl login -u AWS --password-stdin {self.registry_url}"
- run_command(login_cmd)
-
- print("✓ Successfully logged into ECR")
-
- def get_registry_url(self):
- """Return the ECR registry URL."""
- return self.registry_url
-
-
-class GARRegistry(Registry):
- """Google Artifact Registry implementation."""
-
- def __init__(self, project_id, repository, location):
- self.project_id = project_id
- self.repository = repository
- self.location = location
- self.registry_url = f"{location}-docker.pkg.dev/{project_id}/{repository}"
-
- def check_credentials(self):
- """Check if GCP credentials are configured."""
- try:
- run_command("gcloud auth application-default print-access-token", capture_output=True)
- print("✓ GCP credentials are configured")
- except:
- print("Error: GCP credentials not configured.")
- print("Please run 'gcloud auth application-default login' or 'gcloud auth login'")
- sys.exit(1)
-
- def create_repository(self, image_name):
- """Create GAR repository if it doesn't exist."""
- print(f"Checking/creating GAR repository: {self.repository}")
-
- # Check if repository exists
- check_cmd = f"gcloud artifacts repositories describe {self.repository} --location={self.location} --project={self.project_id}"
- try:
- run_command(check_cmd, capture_output=True)
- print(f"✓ Repository {self.repository} already exists")
- except:
- # Repository doesn't exist, create it
- create_cmd = f"gcloud artifacts repositories create {self.repository} --repository-format=docker --location={self.location} --project={self.project_id}"
- run_command(create_cmd)
- print(f"✓ Created repository {self.repository}")
-
- def login(self):
- """Login to GAR using both docker and nerdctl."""
- print("Logging into Google Artifact Registry...")
-
- # Configure Docker authentication helper for GAR
- auth_cmd = f"gcloud auth configure-docker {self.location}-docker.pkg.dev"
- run_command(auth_cmd)
-
- # Get access token for nerdctl login
- token = run_command("gcloud auth print-access-token", capture_output=True)
-
- # Login with nerdctl
- login_cmd = f"echo '{token}' | nerdctl login -u oauth2accesstoken --password-stdin {self.location}-docker.pkg.dev"
- run_command(login_cmd)
-
- # Login with sudo nerdctl
- login_cmd = f"echo '{token}' | sudo nerdctl login -u oauth2accesstoken --password-stdin {self.location}-docker.pkg.dev"
- run_command(login_cmd)
-
- print("✓ Successfully logged into GAR")
-
- def get_registry_url(self):
- """Return the GAR registry URL."""
- return self.registry_url
-
-
-def build_and_push_image(image_dir, image_name, registry):
- """Build and push the base Docker image."""
- print(f"Building image from {image_dir}...")
-
- # Change to image directory for build context
- original_dir = os.getcwd()
- os.chdir(image_dir)
-
- try:
- # Build the image
- build_cmd = f"docker build -t {image_name} ."
- run_command(build_cmd)
-
- # Tag for registry
- full_image = registry.get_full_image_name(image_name, "latest")
- tag_cmd = f"docker tag {image_name} {full_image}"
- run_command(tag_cmd)
-
- # Push the image
- push_cmd = f"docker push {full_image}"
- run_command(push_cmd)
-
- print(f"✓ Successfully built and pushed {full_image}")
-
- finally:
- os.chdir(original_dir)
-
-
-def convert_to_nydus(image_name, registry):
- """Convert and push Nydus image."""
- print("Converting to Nydus format...")
-
- source_image = registry.get_full_image_name(image_name, "latest")
- target_image = registry.get_full_image_name(image_name, "latest-nydus")
-
- nydus_cmd = f"""nydusify convert \\
- --source {source_image} \\
- --source-backend-config ~/.docker/config.json \\
- --target {target_image}"""
-
- run_command(nydus_cmd)
- print(f"✓ Successfully converted and pushed {target_image}")
-
-
-def convert_to_soci(image_name, registry):
- """Convert and push SOCI image."""
- print("Converting to SOCI format...")
-
- source_image = registry.get_full_image_name(image_name, "latest")
- target_image = registry.get_full_image_name(image_name, "latest-soci")
-
- # Pull the image with nerdctl first
- pull_cmd = f"sudo nerdctl pull {source_image}"
- run_command(pull_cmd)
-
- # Convert to SOCI
- soci_cmd = f"sudo soci convert {source_image} {target_image}"
- run_command(soci_cmd)
-
- # Push SOCI image
- push_cmd = f"sudo nerdctl push {target_image}"
- run_command(push_cmd)
-
- print(f"✓ Successfully converted and pushed {target_image}")
-
-
-def convert_to_estargz(image_name, registry):
- """Convert and push eStargz image."""
- print("Converting to eStargz format...")
-
- source_image = registry.get_full_image_name(image_name, "latest")
- target_image = registry.get_full_image_name(image_name, "latest-estargz")
-
- # Pull the image with nerdctl first
- pull_cmd = f"sudo nerdctl pull {source_image}"
- run_command(pull_cmd)
-
- estargz_cmd = f"sudo nerdctl image convert --estargz --oci {source_image} {target_image}"
- run_command(estargz_cmd)
-
- # Push eStargz image
- push_cmd = f"sudo nerdctl push {target_image}"
- run_command(push_cmd)
-
- print(f"✓ Successfully converted and pushed {target_image}")
-
-
-def cleanup_built_images(image_name, registry, formats):
- """Remove only the images that were built in this run."""
- import time
-
- print("\n" + "="*60)
- print("🧹 CLEANUP: Removing built images...")
- print("="*60)
-
- cleanup_start = time.time()
- images_to_remove = []
-
- # Collect all image references that were built
- if "normal" in formats:
- images_to_remove.append(image_name) # Local tag
- images_to_remove.append(registry.get_full_image_name(image_name, "latest"))
- if "nydus" in formats:
- images_to_remove.append(registry.get_full_image_name(image_name, "latest-nydus"))
- if "soci" in formats:
- images_to_remove.append(registry.get_full_image_name(image_name, "latest-soci"))
- if "estargz" in formats:
- images_to_remove.append(registry.get_full_image_name(image_name, "latest-estargz"))
-
- # Cleanup Docker images
- print("\n📦 Docker Cleanup:")
- for image in images_to_remove:
- try:
- print(f" Removing: {image}")
- run_command(f"docker rmi -f {image}", check=False, capture_output=True)
- except Exception as e:
- print(f" ⚠️ Warning: Could not remove {image}: {e}")
-
- # Cleanup nerdctl images for relevant snapshotters
- snapshotter_map = {
- "normal": "overlayfs",
- "nydus": "nydus",
- "soci": "soci",
- "estargz": "stargz"
- }
-
- print(f"\n🔧 nerdctl Cleanup:")
- for format_type in formats:
- snapshotter = snapshotter_map.get(format_type)
- if not snapshotter:
- continue
-
- print(f" Processing {snapshotter} snapshotter...")
- try:
- # Determine the correct tag
- if format_type == "normal":
- tag = "latest"
- else:
- tag = f"latest-{format_type}"
-
- image_ref = registry.get_full_image_name(image_name, tag)
- print(f" Removing: {image_ref}")
- run_command(f"sudo nerdctl --snapshotter {snapshotter} rmi -f {image_ref}", check=False, capture_output=True)
-
- except Exception as e:
- print(f" ⚠️ Warning: Could not cleanup {snapshotter} images: {e}")
-
- total_cleanup_time = time.time() - cleanup_start
- print(f"\n✅ Cleanup completed in {total_cleanup_time:.2f}s")
- print("="*60)
-
-
-def list_available_images(base_path="snapshotters/images"):
- """List available image directories."""
- images_dir = Path(base_path)
- if not images_dir.exists():
- print(f"Error: {base_path} directory not found")
- return []
-
- image_dirs = []
- for item in images_dir.iterdir():
- if item.is_dir() and (item / "Dockerfile").exists():
- image_dirs.append(item.name)
-
- return sorted(image_dirs)
-
-
-def main():
- parser = argparse.ArgumentParser(
- description="Build and push container images with different snapshotter formats. Supports ECR (AWS) and GAR (Google Artifact Registry).",
- formatter_class=argparse.RawDescriptionHelpFormatter,
- epilog="""
-Examples:
- # ECR (AWS) - Build image from custom path
- python3 build_push.py --registry-type ecr --account 123456789 --image-path /path/to/my/image --image-name my-image --region us-east-1
-
- # ECR - Build with specific formats
- python3 build_push.py --registry-type ecr --account 123456789 --image-path ./images/cuda --image-name cuda-test --formats normal,nydus
-
- # GAR (Google) - Build and push all formats
- python3 build_push.py --registry-type gar --project-id my-gcp-project --repository my-repo --image-path ./images/vllm --image-name vllm-app --location us-central1
-
- # GAR - Build with specific formats
- python3 build_push.py --registry-type gar --project-id my-project --repository ai-models --image-path ./images/sglang --image-name sglang --location us-east1 --formats normal,nydus,soci
-
- # List available images in default directory
- python3 build_push.py --list-images
- """)
-
- # Registry selection
- parser.add_argument("--registry-type", choices=["ecr", "gar"], default="ecr",
- help="Registry type: ecr (AWS) or gar (Google Artifact Registry). Default: ecr")
-
- # Common arguments
- parser.add_argument("--image-path", required=False, help="Full path to image directory")
- parser.add_argument("--image-name", required=False, help="Image name for the container")
- parser.add_argument("--formats", default="normal,nydus,soci,estargz",
- help="Comma-separated list of formats to build (normal,nydus,soci,estargz)")
- parser.add_argument("--list-images", action="store_true", help="List available image directories")
- parser.add_argument("--no-cleanup", action="store_true", help="Skip cleanup of local images after build")
-
- # ECR-specific arguments
- parser.add_argument("--account", required=False, help="AWS account ID (required for ECR)")
- parser.add_argument("--region", required=False, default="us-east-1",
- help="AWS region for ECR (default: us-east-1)")
-
- # GAR-specific arguments
- parser.add_argument("--project-id", required=False, help="GCP project ID (optional for GAR, defaults to gcloud config)")
- parser.add_argument("--repository", required=False, help="GAR repository name (required for GAR)")
- parser.add_argument("--location", required=False, default="us-central1",
- help="GCP location for GAR (default: us-central1)")
-
- args = parser.parse_args()
-
- # List available images
- if args.list_images:
- available_images = list_available_images()
- if available_images:
- print("Available image directories:")
- for img in available_images:
- print(f" - {img}")
- else:
- print("No image directories found with Dockerfiles")
- return
-
- # Validate registry-specific arguments
- if args.registry_type == "ecr":
- if not args.account:
- parser.error("--account is required for ECR")
- elif args.registry_type == "gar":
- # Get project ID from gcloud config if not provided
- if not args.project_id:
- try:
- args.project_id = run_command("gcloud config get project", capture_output=True)
- if not args.project_id:
- parser.error("--project-id is required for GAR (or set default project with 'gcloud config set project PROJECT_ID')")
- print(f"Using project ID from gcloud config: {args.project_id}")
- except:
- parser.error("--project-id is required for GAR (or set default project with 'gcloud config set project PROJECT_ID')")
- if not args.repository:
- parser.error("--repository is required for GAR")
-
- # Validate common required arguments
- if not args.image_path:
- parser.error("--image-path is required")
- if not args.image_name:
- parser.error("--image-name is required")
-
- # Validate image directory exists
- image_dir = Path(args.image_path)
- if not image_dir.exists():
- print(f"Error: Image directory '{args.image_path}' not found")
- sys.exit(1)
-
- dockerfile_path = image_dir / "Dockerfile"
- if not dockerfile_path.exists():
- print(f"Error: No Dockerfile found in {image_dir}")
- sys.exit(1)
-
- # Parse formats
- formats = [f.strip() for f in args.formats.split(",")]
- valid_formats = {"normal", "nydus", "soci", "estargz"}
- invalid_formats = set(formats) - valid_formats
- if invalid_formats:
- print(f"Error: Invalid formats: {invalid_formats}")
- print(f"Valid formats: {valid_formats}")
- sys.exit(1)
-
- # Set image name
- image_name = args.image_name
-
- # Create registry instance based on type
- if args.registry_type == "ecr":
- registry = ECRRegistry(args.account, args.region)
- registry_info = f"Account: {args.account}, Region: {args.region}"
- else: # gar
- registry = GARRegistry(args.project_id, args.repository, args.location)
- registry_info = f"Project: {args.project_id}, Repository: {args.repository}, Location: {args.location}"
-
- print("="*70)
- print("🚀 STARTING CONTAINER IMAGE BUILD AND PUSH")
- print("="*70)
- print(f"Registry Type: {args.registry_type.upper()}")
- print(f"Building image: {image_name}")
- print(f"From directory: {image_dir}")
- print(f"{registry_info}")
- print(f"Formats: {formats}")
- print()
-
- import time
- total_start_time = time.time()
-
- # Check credentials
- print(f"🔐 Checking {args.registry_type.upper()} credentials...")
- registry.check_credentials()
-
- # Login to registry
- print(f"\n🔑 Logging into {args.registry_type.upper()}...")
- registry.login()
-
- # Create repository
- print(f"\n📦 Setting up repository...")
- registry.create_repository(image_name)
-
- # Build and push base image
- if "normal" in formats:
- print(f"\n🏗️ Building and pushing base image...")
- build_start = time.time()
- build_and_push_image(str(image_dir), image_name, registry)
- build_time = time.time() - build_start
- print(f"✅ Base image build completed in {build_time:.2f}s")
-
- # Convert to different formats
- if "nydus" in formats:
- print(f"\n🔄 Converting to Nydus format...")
- nydus_start = time.time()
- convert_to_nydus(image_name, registry)
- nydus_time = time.time() - nydus_start
- print(f"✅ Nydus conversion completed in {nydus_time:.2f}s")
-
- if "soci" in formats:
- print(f"\n🔄 Converting to SOCI format...")
- soci_start = time.time()
- convert_to_soci(image_name, registry)
- soci_time = time.time() - soci_start
- print(f"✅ SOCI conversion completed in {soci_time:.2f}s")
-
- if "estargz" in formats:
- print(f"\n🔄 Converting to eStargz format...")
- estargz_start = time.time()
- convert_to_estargz(image_name, registry)
- estargz_time = time.time() - estargz_start
- print(f"✅ eStargz conversion completed in {estargz_time:.2f}s")
-
- total_time = time.time() - total_start_time
-
- print("\n" + "="*70)
- print("🎉 ALL FORMATS BUILT AND PUSHED SUCCESSFULLY!")
- print("="*70)
- print(f"Registry: {registry.get_registry_url()}")
- print(f"Base image: {registry.get_full_image_name(image_name, 'latest')}")
- if "nydus" in formats:
- print(f"Nydus image: {registry.get_full_image_name(image_name, 'latest-nydus')}")
- if "soci" in formats:
- print(f"SOCI image: {registry.get_full_image_name(image_name, 'latest-soci')}")
- if "estargz" in formats:
- print(f"eStargz image: {registry.get_full_image_name(image_name, 'latest-estargz')}")
-
- print(f"\n⏱️ Total build and push time: {total_time:.2f}s ({total_time/60:.1f} minutes)")
- print("="*70)
-
- # Cleanup built images by default (unless --no-cleanup is specified)
- if not args.no_cleanup:
- cleanup_built_images(image_name, registry, formats)
-
-
-if __name__ == "__main__":
- main()
diff --git a/scripts/builder/Dockerfile b/scripts/builder/Dockerfile
new file mode 100644
index 0000000..f90e9d5
--- /dev/null
+++ b/scripts/builder/Dockerfile
@@ -0,0 +1,56 @@
+# Build stage: Compile buildkit with Nydus support
+FROM golang:1.21-alpine AS buildkit-builder
+
+# Install build dependencies
+RUN apk add --no-cache git make
+
+# Clone nydusaccelerator/buildkit fork
+ARG BUILDKIT_VERSION=nydus-compression-type-enhance
+RUN git clone --depth 1 --branch ${BUILDKIT_VERSION} \
+ https://github.com/nydusaccelerator/buildkit.git /buildkit
+
+WORKDIR /buildkit
+
+# Build buildkitd and buildctl with Nydus support
+RUN go build -tags=nydus -o ./bin/buildkitd ./cmd/buildkitd && \
+ go build -o ./bin/buildctl ./cmd/buildctl
+
+# Runtime stage
+FROM alpine:latest
+
+# Copy buildkit binaries with Nydus support
+COPY --from=buildkit-builder /buildkit/bin/buildctl /usr/bin/buildctl
+COPY --from=buildkit-builder /buildkit/bin/buildkitd /usr/bin/buildkitd
+
+# Copy buildctl-daemonless.sh wrapper from moby/buildkit repo
+ADD https://raw.githubusercontent.com/moby/buildkit/master/examples/buildctl-daemonless/buildctl-daemonless.sh /usr/bin/buildctl-daemonless.sh
+RUN chmod +x /usr/bin/buildctl-daemonless.sh
+
+# Install runtime dependencies
+RUN apk add --no-cache \
+ ca-certificates \
+ curl \
+ wget \
+ iptables \
+ fuse-overlayfs \
+ containerd
+
+# Install nydus-image binary (v2.3.6)
+ARG NYDUS_VERSION=v2.3.6
+RUN wget -O /tmp/nydus.tgz \
+ "https://github.com/dragonflyoss/nydus/releases/download/${NYDUS_VERSION}/nydus-static-${NYDUS_VERSION}-linux-amd64.tgz" \
+ && tar -xzf /tmp/nydus.tgz -C /tmp \
+ && mv /tmp/nydus-static/nydus-image /usr/bin/nydus-image \
+ && chmod +x /usr/bin/nydus-image \
+ && rm -rf /tmp/nydus.tgz /tmp/nydus-static
+
+# Set NYDUS_BUILDER environment variable (required for buildkit)
+ENV NYDUS_BUILDER=/usr/bin/nydus-image
+
+# Copy build script
+COPY build.sh /usr/local/bin/build.sh
+RUN chmod +x /usr/local/bin/build.sh
+
+WORKDIR /workspace
+
+ENTRYPOINT ["/usr/local/bin/build.sh"]
diff --git a/scripts/builder/README.md b/scripts/builder/README.md
new file mode 100644
index 0000000..450d4f0
--- /dev/null
+++ b/scripts/builder/README.md
@@ -0,0 +1,156 @@
+# Container-Based Image Builder
+
+Builds container images using `buildctl` in a containerized environment. Produces both normal OCI and Nydus-optimized images.
+
+## Features
+
+- **Registry-agnostic**: Works with AWS ECR, Google Artifact Registry, Docker Hub, or any OCI registry
+- **No local dependencies**: All build tools run inside a container
+- **Two image formats**: Builds both normal OCI and Nydus images in one go
+- **Direct push**: Images pushed directly to registry via buildctl
+
+## Architecture
+
+```
+Host (authenticated) → Builder Container (buildctl + nydus-image) → Registry
+```
+
+- **Host**: Authenticates to registry, mounts build context and docker config
+- **Builder Container**: Runs buildctl to build and push images
+- **No Docker daemon dependency**: buildctl pushes directly to registries
+
+## Prerequisites
+
+1. **Docker** installed on host (no other dependencies needed!)
+2. **Authenticated to your registry** before running:
+
+```bash
+# AWS ECR
+aws ecr get-login-password --region us-east-1 | \
+ docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
+
+# Google Artifact Registry
+gcloud auth configure-docker us-central1-docker.pkg.dev
+
+# Docker Hub
+docker login
+```
+
+## Usage
+
+```bash
+docker run --rm --privileged \
+ -v /path/to/build-context:/workspace:ro \
+ -v ~/.docker/config.json:/root/.docker/config.json:ro \
+ tensorfuse/fastpull-builder:latest \
+
+```
+
+### Examples
+
+**AWS ECR:**
+```bash
+docker run --rm --privileged \
+ -v ./my-app:/workspace:ro \
+ -v ~/.docker/config.json:/root/.docker/config.json:ro \
+ tensorfuse/fastpull-builder:latest \
+ 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
+```
+
+**Google Artifact Registry:**
+```bash
+docker run --rm --privileged \
+ -v ./my-app:/workspace:ro \
+ -v ~/.docker/config.json:/root/.docker/config.json:ro \
+ tensorfuse/fastpull-builder:latest \
+ us-central1-docker.pkg.dev/my-project/my-repo/my-app:v1.0
+```
+
+**Docker Hub:**
+```bash
+docker run --rm --privileged \
+ -v ./my-app:/workspace:ro \
+ -v ~/.docker/config.json:/root/.docker/config.json:ro \
+ tensorfuse/fastpull-builder:latest \
+ docker.io/username/my-app:latest
+```
+
+**No tag (defaults to :latest):**
+```bash
+docker run --rm --privileged \
+ -v ./my-app:/workspace:ro \
+ -v ~/.docker/config.json:/root/.docker/config.json:ro \
+ tensorfuse/fastpull-builder:latest \
+ my-registry.com/my-app
+```
+
+**Custom Dockerfile:**
+```bash
+docker run --rm --privileged \
+ -v ./my-app:/workspace:ro \
+ -v ~/.docker/config.json:/root/.docker/config.json:ro \
+ -e DOCKERFILE=Dockerfile.custom \
+ tensorfuse/fastpull-builder:latest \
+ my-registry.com/my-app:latest
+```
+
+## Output
+
+The script builds and pushes two images:
+- `:` - Normal OCI image
+- `:-fastpull` - Fastpull-optimized image
+
+## Files
+
+- `Dockerfile` - Builder container definition (builds from nydusaccelerator/buildkit fork)
+- `build.sh` - Build script that runs inside container (entrypoint)
+- `README.md` - This file
+
+## Technical Details
+
+### Buildkit with Nydus Support
+The Dockerfile builds `buildkitd` and `buildctl` from the [nydusaccelerator/buildkit](https://github.com/nydusaccelerator/buildkit) fork with the `-tags=nydus` flag, which enables Nydus compression support. The standard moby/buildkit does not include this functionality.
+
+### Components
+- **buildkitd/buildctl**: Compiled from nydusaccelerator/buildkit fork
+- **nydus-image**: v2.3.6 binary (set via `NYDUS_BUILDER` env var)
+- **buildctl-daemonless.sh**: Wrapper that runs buildkitd in rootless mode
+
+## How It Works
+
+1. **Pull builder image**: Downloads `tensorfuse/fastpull-builder:latest` from Docker Hub
+2. **Mount context**: Your build context is mounted read-only into `/workspace`
+3. **Mount auth**: `~/.docker/config.json` is mounted for registry authentication
+4. **Run buildctl**: Builds normal OCI image with `buildctl-daemonless.sh`
+5. **Run buildctl again**: Builds Fastpull image with Nydus compression
+6. **Direct push**: Both images pushed directly to registry
+
+## Troubleshooting
+
+**"Error: Docker config not found"**
+- Run registry authentication command first (see Prerequisites)
+
+**"Error: Build context path does not exist"**
+- Check that `--context` points to a valid directory
+
+**"Error: Dockerfile not found"**
+- Ensure Dockerfile exists in context directory
+- Or specify custom name with `--dockerfile`
+
+**Build fails with authentication error:**
+- Re-authenticate to your registry
+- Check that `~/.docker/config.json` contains valid credentials
+
+**"permission denied" errors:**
+- Builder container runs with `--privileged` flag (required for buildkit)
+- Ensure Docker is running with appropriate permissions
+
+## Comparison with Original build_push.py
+
+| Feature | Original | Container-Based |
+|---------|----------|-----------------|
+| Dependencies | Requires nerdctl, nydusify, soci, stargz locally | All tools in container |
+| Registry | AWS ECR or GAR | Any OCI registry |
+| Formats | normal, nydus, soci, estargz | normal, nydus |
+| Push method | nerdctl/docker | buildctl (direct) |
+| Portability | Requires snapshotter setup | Runs anywhere Docker runs |
diff --git a/scripts/builder/build.sh b/scripts/builder/build.sh
new file mode 100644
index 0000000..8858ccf
--- /dev/null
+++ b/scripts/builder/build.sh
@@ -0,0 +1,72 @@
+#!/bin/sh
+set -e
+
+# Usage: build.sh
+# Example: build.sh my-registry.com/my-app:latest
+# Example: build.sh my-registry.com/my-app (defaults to :latest)
+
+if [ $# -lt 1 ]; then
+ echo "Usage: $0 "
+ echo "Example: $0 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.0"
+ echo "Example: $0 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app (defaults to :latest)"
+ exit 1
+fi
+
+IMAGE_WITH_TAG="$1"
+DOCKERFILE="${DOCKERFILE:-Dockerfile}"
+CONTEXT_PATH="${CONTEXT_PATH:-/workspace}"
+
+# Parse image and tag (default to :latest if no tag provided)
+if echo "$IMAGE_WITH_TAG" | grep -q ":"; then
+ IMAGE_NAME="${IMAGE_WITH_TAG%:*}"
+ TAG="${IMAGE_WITH_TAG##*:}"
+else
+ IMAGE_NAME="$IMAGE_WITH_TAG"
+ TAG="latest"
+fi
+
+FULL_IMAGE="${IMAGE_NAME}:${TAG}"
+FULL_IMAGE_FASTPULL="${IMAGE_NAME}:${TAG}-fastpull"
+
+echo "=========================================="
+echo "Building images for: ${IMAGE_NAME}"
+echo "Tag: ${TAG}"
+echo "Context: ${CONTEXT_PATH}"
+echo "Dockerfile: ${DOCKERFILE}"
+echo "=========================================="
+
+# Build normal OCI image
+echo ""
+echo ">>> Building normal OCI image: ${FULL_IMAGE}"
+echo ""
+time buildctl-daemonless.sh build \
+ --frontend dockerfile.v0 \
+ --local context="${CONTEXT_PATH}" \
+ --local dockerfile="${CONTEXT_PATH}" \
+ --opt filename="${DOCKERFILE}" \
+ --output type=image,name="${FULL_IMAGE}",push=true
+
+echo ""
+echo "✓ Normal OCI image built and pushed: ${FULL_IMAGE}"
+echo ""
+
+# Build Fastpull image
+echo ""
+echo ">>> Building Fastpull image: ${FULL_IMAGE_FASTPULL}"
+echo ""
+time buildctl-daemonless.sh build \
+ --frontend dockerfile.v0 \
+ --local context="${CONTEXT_PATH}" \
+ --local dockerfile="${CONTEXT_PATH}" \
+ --opt filename="${DOCKERFILE}" \
+ --output type=image,name="${FULL_IMAGE_FASTPULL}",push=true,compression=nydus,force-compression=true,oci-mediatypes=true
+
+echo ""
+echo "✓ Fastpull image built and pushed: ${FULL_IMAGE_FASTPULL}"
+echo ""
+
+echo "=========================================="
+echo "✓ Build complete!"
+echo " Normal: ${FULL_IMAGE}"
+echo " Fastpull: ${FULL_IMAGE_FASTPULL}"
+echo "=========================================="
diff --git a/scripts/fastpull-cli.py b/scripts/fastpull-cli.py
new file mode 100755
index 0000000..064a501
--- /dev/null
+++ b/scripts/fastpull-cli.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+"""
+FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters.
+
+Main CLI entry point for the unified fastpull command.
+"""
+
+import argparse
+import sys
+import os
+
+# Add the library directory to the path to import fastpull module
+# When installed, fastpull module is at /usr/local/lib/fastpull
+# When running from source, it's in the same directory as this script
+script_dir = os.path.dirname(os.path.abspath(__file__))
+sys.path.insert(0, script_dir) # For running from source
+sys.path.insert(0, '/usr/local/lib') # For installed version
+
+from fastpull import __version__
+from fastpull import run, build, quickstart
+
+
+def main():
+ """Main CLI entry point."""
+ parser = argparse.ArgumentParser(
+ prog='fastpull',
+ description='FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters',
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog="""
+Examples:
+ # Run container with benchmarking
+ fastpull run --snapshotter nydus --image myapp:latest-nydus \\
+ --benchmark-mode readiness --readiness-endpoint http://localhost:8080/health -p 8080:8080
+
+ # Build and push Docker and Nydus images
+ fastpull build --image-path ./app --image myapp:v1 --format docker,nydus
+
+For more information, visit: https://github.com/tensorfuse/fastpull
+ """
+ )
+
+ parser.add_argument(
+ '--version',
+ action='version',
+ version=f'%(prog)s {__version__}'
+ )
+
+ # Create subparsers for commands
+ subparsers = parser.add_subparsers(
+ dest='command',
+ title='commands',
+ description='Available fastpull commands',
+ help='Command to execute'
+ )
+
+ # Add subcommands
+ run.add_parser(subparsers)
+ build.add_parser(subparsers)
+ quickstart.add_parser(subparsers)
+
+ # Parse arguments
+ args = parser.parse_args()
+
+ # If no command specified, print help
+ if not args.command:
+ parser.print_help()
+ sys.exit(1)
+
+ # Execute the command
+ try:
+ args.func(args)
+ except KeyboardInterrupt:
+ print("\n\nInterrupted by user")
+ sys.exit(130)
+ except Exception as e:
+ print(f"Error: {e}")
+ sys.exit(1)
+
+
+if __name__ == '__main__':
+ main()
diff --git a/scripts/fastpull/__init__.py b/scripts/fastpull/__init__.py
new file mode 100644
index 0000000..23d4405
--- /dev/null
+++ b/scripts/fastpull/__init__.py
@@ -0,0 +1,8 @@
+"""
+FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters.
+
+A unified CLI for building, pushing, and running containers with Nydus, SOCI,
+and eStarGZ snapshotters.
+"""
+
+__version__ = "0.1.0"
diff --git a/scripts/fastpull/benchmark.py b/scripts/fastpull/benchmark.py
new file mode 100644
index 0000000..f79d228
--- /dev/null
+++ b/scripts/fastpull/benchmark.py
@@ -0,0 +1,193 @@
+"""
+Benchmarking utilities for fastpull run command.
+
+Tracks container lifecycle events and readiness checks.
+"""
+
+import json
+import subprocess
+import threading
+import time
+from datetime import datetime
+from typing import Optional, Dict
+from urllib.request import urlopen
+from urllib.error import URLError, HTTPError
+
+
+class ContainerBenchmark:
+ """Track container startup and readiness metrics."""
+
+ def __init__(self, container_id: str, benchmark_mode: str = 'none',
+ readiness_endpoint: Optional[str] = None, mode: str = 'normal'):
+ """
+ Initialize benchmark tracker.
+
+ Args:
+ container_id: Container ID to track
+ benchmark_mode: 'none', 'completion', or 'readiness'
+ readiness_endpoint: HTTP endpoint for readiness checks
+ mode: 'nydus' or 'normal' (for display purposes)
+ """
+ self.container_id = container_id
+ self.benchmark_mode = benchmark_mode
+ self.readiness_endpoint = readiness_endpoint
+ self.mode = mode
+ self.metrics: Dict[str, float] = {}
+ self.start_time = time.time()
+ self._event_thread: Optional[threading.Thread] = None
+ self._container_started = False
+
+ def start_event_monitoring(self):
+ """Start monitoring containerd events in background thread."""
+ if self.benchmark_mode == 'none':
+ return
+
+ def monitor_events():
+ """Monitor ctr events for container lifecycle."""
+ try:
+ # Run sudo ctr events and parse for our container
+ proc = subprocess.Popen(
+ ['sudo', 'ctr', 'events'],
+ stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE,
+ text=True,
+ bufsize=1
+ )
+
+ for line in proc.stdout:
+ # Look for /tasks/start event (check any task since we're the only one running)
+ if '/tasks/start' in line and self.metrics.get('container_start_time') is None:
+ elapsed = time.time() - self.start_time
+ self.metrics['container_start_time'] = elapsed
+ self._container_started = True
+ print(f"[{elapsed:.3f}s] ✓ CONTAINER START")
+
+ # Look for our specific container's exit event
+ if self.container_id in line and '/tasks/exit' in line and self.benchmark_mode == 'completion':
+ elapsed = time.time() - self.start_time
+ self.metrics['completion_time'] = elapsed
+ print(f"[{elapsed:.3f}s] ✓ CONTAINER EXIT")
+ break
+
+ except Exception as e:
+ print(f"Event monitoring error: {e}")
+
+ self._event_thread = threading.Thread(target=monitor_events, daemon=True)
+ self._event_thread.start()
+
+ def wait_for_readiness(self, timeout: int = 600, poll_interval: int = 2):
+ """
+ Poll readiness endpoint until HTTP 200 response.
+
+ Args:
+ timeout: Maximum time to wait in seconds
+ poll_interval: Time between polls in seconds
+
+ Returns:
+ True if endpoint became ready, False if timeout
+ """
+ if self.benchmark_mode != 'readiness' or not self.readiness_endpoint:
+ return True
+
+ # Ensure endpoint has protocol prefix
+ endpoint = self.readiness_endpoint
+ if not endpoint.startswith(('http://', 'https://')):
+ endpoint = f'http://{endpoint}'
+
+ print(f"Polling {endpoint} for readiness...")
+ end_time = time.time() + timeout
+
+ while time.time() < end_time:
+ try:
+ response = urlopen(endpoint, timeout=5)
+ if response.getcode() == 200:
+ elapsed = time.time() - self.start_time
+ self.metrics['readiness_time'] = elapsed
+ print(f"Container ready (HTTP 200): {elapsed:.2f}s")
+ return True
+ except (URLError, HTTPError):
+ pass
+
+ time.sleep(poll_interval)
+
+ print(f"Readiness check timeout after {timeout}s")
+ return False
+
+ def wait_for_completion(self, timeout: int = 3600):
+ """
+ Wait for container to exit.
+
+ Args:
+ timeout: Maximum time to wait in seconds
+
+ Returns:
+ True if container exited, False if timeout
+ """
+ if self.benchmark_mode != 'completion':
+ return True
+
+ print(f"Waiting for container completion...")
+ end_time = time.time() + timeout
+
+ while time.time() < end_time:
+ # Check if container is still running
+ result = subprocess.run(
+ ['nerdctl', 'ps', '-q', '-f', f'id={self.container_id}'],
+ capture_output=True,
+ text=True
+ )
+
+ if not result.stdout.strip():
+ # Container has exited
+ if 'completion_time' not in self.metrics:
+ elapsed = time.time() - self.start_time
+ self.metrics['completion_time'] = elapsed
+ print(f"Container completed")
+ return True
+
+ time.sleep(1)
+
+ print(f"Completion timeout after {timeout}s")
+ return False
+
+ def print_summary(self):
+ """Print benchmark results summary."""
+ if self.benchmark_mode == 'none':
+ return
+
+ mode_label = "FASTPULL" if self.mode == 'nydus' else "NORMAL"
+ print("\n" + "="*50)
+ print(f"{mode_label} BENCHMARK SUMMARY")
+ print("="*50)
+
+ if 'container_start_time' in self.metrics:
+ print(f"Time to Container Start: {self.metrics['container_start_time']:.3f}s")
+
+ if 'readiness_time' in self.metrics:
+ print(f"Time to Readiness: {self.metrics['readiness_time']:.3f}s")
+
+ if 'completion_time' in self.metrics:
+ print(f"Time to Completion: {self.metrics['completion_time']:.3f}s")
+
+ total_time = time.time() - self.start_time
+ print(f"Total Elapsed Time: {total_time:.3f}s")
+ print("="*50 + "\n")
+
+ def export_json(self, filepath: str):
+ """
+ Export metrics to JSON file.
+
+ Args:
+ filepath: Path to output JSON file
+ """
+ output = {
+ 'container_id': self.container_id,
+ 'benchmark_mode': self.benchmark_mode,
+ 'metrics': self.metrics,
+ 'timestamp': datetime.now().isoformat()
+ }
+
+ with open(filepath, 'w') as f:
+ json.dump(output, f, indent=2)
+
+ print(f"Metrics exported to {filepath}")
diff --git a/scripts/fastpull/build.py b/scripts/fastpull/build.py
new file mode 100644
index 0000000..7418d17
--- /dev/null
+++ b/scripts/fastpull/build.py
@@ -0,0 +1,428 @@
+"""
+FastPull build command - Build and convert container images.
+
+Supports two modes:
+1. Build from Dockerfile: docker build → push → convert
+2. Convert existing image: pull (if needed) → push → convert
+"""
+
+import argparse
+import os
+import subprocess
+import sys
+from typing import List
+
+from . import common
+
+
+def add_parser(subparsers):
+ """Add build subcommand parser."""
+ parser = subparsers.add_parser(
+ 'build',
+ help='Build and convert container images',
+ description='Build Docker images and convert to Nydus/SOCI/eStarGZ formats'
+ )
+
+ # Image specification
+ parser.add_argument(
+ '--repository-url',
+ required=True,
+ help='Full image reference (e.g., account.dkr.ecr.region.amazonaws.com/myapp:v1)'
+ )
+ parser.add_argument(
+ '--dockerfile-path',
+ help='Path to Dockerfile directory (optional - if not provided, assumes image exists)'
+ )
+
+ # Registry configuration
+ parser.add_argument(
+ '--registry',
+ choices=['ecr', 'gar', 'dockerhub', 'auto'],
+ default='auto',
+ help='Registry type (default: auto-detect from image URL)'
+ )
+
+ # Google GAR parameters
+ parser.add_argument(
+ '--project-id',
+ help='GCP project ID (for GAR)'
+ )
+ parser.add_argument(
+ '--location',
+ default='us-central1',
+ help='GCP location (default: us-central1)'
+ )
+ parser.add_argument(
+ '--repository',
+ help='GAR repository name (for GAR)'
+ )
+
+ # Build options
+ parser.add_argument(
+ '--format',
+ default='docker,nydus',
+ help='Comma-separated formats: docker, nydus, soci, estargz (default: docker,nydus)'
+ )
+ parser.add_argument(
+ '--no-cache',
+ action='store_true',
+ help='Build without cache'
+ )
+ parser.add_argument(
+ '--build-arg',
+ action='append',
+ help='Build arguments (can be used multiple times)'
+ )
+ parser.add_argument(
+ '--dockerfile',
+ default='Dockerfile',
+ help='Dockerfile name (default: Dockerfile)'
+ )
+
+ parser.set_defaults(func=build_command)
+ return parser
+
+
+def build_command(args):
+ """Execute the build command."""
+ # Auto-detect registry
+ if args.registry == 'auto':
+ args.registry = common.detect_registry_type(args.repository_url)
+ if args.registry == 'unknown':
+ print(f"Error: Could not auto-detect registry from image: {args.repository_url}")
+ print("Please specify --registry explicitly")
+ sys.exit(1)
+ print(f"Auto-detected registry: {args.registry}")
+
+ # Validate registry-specific parameters
+ if args.registry == 'ecr':
+ # Get account and region from AWS CLI
+ args.account = common.get_aws_account_id()
+ args.region = common.get_aws_region()
+
+ if not args.account:
+ print("Error: Could not detect AWS account ID. Please configure AWS CLI (aws configure)")
+ sys.exit(1)
+
+ if not args.region:
+ args.region = 'us-east-1' # Fallback to default
+
+ print(f"Using AWS account: {args.account}, region: {args.region}")
+
+ if args.registry == 'gar' and not args.repository:
+ parsed = common.parse_gar_url(args.repository_url)
+ if parsed:
+ args.location, args.project_id, args.repository = parsed
+ else:
+ print("Error: --repository required for GAR")
+ sys.exit(1)
+
+ # Parse formats
+ formats = [f.strip().lower() for f in args.format.split(',')]
+ valid_formats = ['docker', 'nydus', 'soci', 'estargz']
+ for fmt in formats:
+ if fmt not in valid_formats:
+ print(f"Error: Invalid format '{fmt}'. Valid: {', '.join(valid_formats)}")
+ sys.exit(1)
+
+ # Determine build mode
+ if args.dockerfile_path:
+ # Mode 1: Build from Dockerfile
+ build_from_dockerfile(args, formats)
+ else:
+ # Mode 2: Convert existing image
+ if 'docker' in formats:
+ print("Warning: --image-path not provided, skipping docker build")
+ formats.remove('docker')
+
+ if not formats:
+ print("Error: No formats to build (docker requires --image-path)")
+ sys.exit(1)
+
+ convert_existing_image(args, formats)
+
+ print("\n" + "="*60)
+ print("BUILD COMPLETE")
+ print("="*60)
+
+
+def authenticate_registry(args) -> bool:
+ """Authenticate with the registry."""
+ if args.registry == 'ecr':
+ return authenticate_ecr(args)
+ elif args.registry == 'gar':
+ return authenticate_gar(args)
+ elif args.registry == 'dockerhub':
+ print("Assuming Docker Hub authentication already configured")
+ return True
+ return False
+
+
+def authenticate_ecr(args) -> bool:
+ """Authenticate with AWS ECR."""
+ try:
+ # Get login password
+ result = subprocess.run(
+ ['aws', 'ecr', 'get-login-password', '--region', args.region],
+ check=True,
+ capture_output=True,
+ text=True
+ )
+ password = result.stdout.strip()
+
+ # Login with docker
+ registry_url = f"{args.account}.dkr.ecr.{args.region}.amazonaws.com"
+ subprocess.run(
+ ['docker', 'login', '--username', 'AWS', '--password-stdin', registry_url],
+ input=password,
+ check=True,
+ capture_output=True,
+ text=True
+ )
+
+ # Login with nerdctl
+ subprocess.run(
+ ['sudo', 'nerdctl', 'login', '--username', 'AWS', '--password-stdin', registry_url],
+ input=password,
+ check=True,
+ capture_output=True,
+ text=True
+ )
+
+ print(f"✓ Authenticated with ECR")
+ return True
+ except subprocess.CalledProcessError as e:
+ print(f"✗ ECR authentication failed: {e}")
+ return False
+
+
+def authenticate_gar(args) -> bool:
+ """Authenticate with Google Artifact Registry."""
+ try:
+ if not args.project_id:
+ result = subprocess.run(
+ ['gcloud', 'config', 'get', 'project'],
+ check=True,
+ capture_output=True,
+ text=True
+ )
+ args.project_id = result.stdout.strip()
+
+ registry_url = f"{args.location}-docker.pkg.dev"
+ subprocess.run(
+ ['gcloud', 'auth', 'configure-docker', registry_url, '--quiet'],
+ check=True,
+ capture_output=True
+ )
+
+ print(f"✓ Authenticated with GAR")
+ return True
+ except subprocess.CalledProcessError as e:
+ print(f"✗ GAR authentication failed: {e}")
+ return False
+
+
+def build_from_dockerfile(args, formats: List[str]):
+ """Mode 1: Build from Dockerfile, push, and convert."""
+ print("\n" + "="*60)
+ print("MODE: Build from Dockerfile")
+ print("="*60)
+
+ # Auto-detect if dockerfile_path is a file or directory
+ if os.path.isfile(args.dockerfile_path):
+ # User provided a file path, extract directory and filename
+ dockerfile_dir = os.path.dirname(args.dockerfile_path)
+ dockerfile_name = os.path.basename(args.dockerfile_path)
+
+ # Use current directory if no directory in path
+ if not dockerfile_dir:
+ dockerfile_dir = '.'
+
+ # Override the dockerfile argument with detected filename
+ args.dockerfile = dockerfile_name
+ args.dockerfile_path = dockerfile_dir
+
+ print(f"Detected Dockerfile: {dockerfile_name} in {dockerfile_dir}")
+
+ # Validate directory exists
+ if not os.path.isdir(args.dockerfile_path):
+ print(f"Error: Directory not found: {args.dockerfile_path}")
+ sys.exit(1)
+
+ # Construct full Dockerfile path
+ dockerfile_path = os.path.join(args.dockerfile_path, args.dockerfile)
+ if not os.path.isfile(dockerfile_path):
+ print(f"Error: Dockerfile not found: {dockerfile_path}")
+ sys.exit(1)
+
+ built_images = []
+
+ # Build and push Docker image
+ if 'docker' in formats:
+ if build_and_push_docker(args):
+ built_images.append(args.repository_url)
+
+ # Convert to other formats
+ if 'nydus' in formats:
+ nydus_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-fastpull"
+ if convert_to_nydus(args.repository_url, nydus_image):
+ built_images.append(nydus_image)
+
+ if 'soci' in formats:
+ soci_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-soci"
+ if convert_to_soci(args.repository_url, soci_image):
+ built_images.append(soci_image)
+
+ if 'estargz' in formats:
+ estargz_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-estargz"
+ if convert_to_estargz(args.repository_url, estargz_image):
+ built_images.append(estargz_image)
+
+ # Summary
+ print_summary(built_images)
+
+
+def convert_existing_image(args, formats: List[str]):
+ """Mode 2: Convert existing image (no docker build)."""
+ print("\n" + "="*60)
+ print("MODE: Convert Existing Image")
+ print("="*60)
+
+ built_images = []
+
+ # Convert to requested formats
+ if 'nydus' in formats:
+ nydus_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-fastpull"
+ if convert_to_nydus(args.repository_url, nydus_image):
+ built_images.append(nydus_image)
+
+ if 'soci' in formats:
+ soci_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-soci"
+ if convert_to_soci(args.repository_url, soci_image):
+ built_images.append(soci_image)
+
+ if 'estargz' in formats:
+ estargz_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-estargz"
+ if convert_to_estargz(args.repository_url, estargz_image):
+ built_images.append(estargz_image)
+
+ # Summary
+ print_summary(built_images)
+
+
+def build_and_push_docker(args) -> bool:
+ """Build and push Docker image."""
+ print(f"\n[Docker] Building {args.repository_url}...")
+
+ # Build
+ cmd = [
+ 'sudo', 'docker', 'build',
+ '-t', args.repository_url,
+ '-f', os.path.join(args.dockerfile_path, args.dockerfile)
+ ]
+
+ if args.no_cache:
+ cmd.append('--no-cache')
+
+ if args.build_arg:
+ for build_arg in args.build_arg:
+ cmd.extend(['--build-arg', build_arg])
+
+ cmd.append(args.dockerfile_path)
+
+ try:
+ subprocess.run(cmd, check=True)
+ print(f"[Docker] ✓ Built {args.repository_url}")
+ except subprocess.CalledProcessError:
+ print(f"[Docker] ✗ Build failed")
+ return False
+
+ # Push
+ print(f"[Docker] Pushing {args.repository_url}...")
+ try:
+ subprocess.run(['sudo', 'docker', 'push', args.repository_url], check=True)
+ print(f"[Docker] ✓ Pushed {args.repository_url}")
+ return True
+ except subprocess.CalledProcessError:
+ print(f"[Docker] ✗ Push failed")
+ return False
+
+
+def convert_to_nydus(source_image: str, target_image: str) -> bool:
+ """Convert to Nydus format."""
+ print(f"\n[Nydus] Converting {source_image} → {target_image}...")
+
+ cmd = [
+ 'nydusify', 'convert',
+ '--source', source_image,
+ '--target', target_image
+ ]
+
+ try:
+ subprocess.run(cmd, check=True)
+ print(f"[Nydus] ✓ Converted and pushed {target_image}")
+ return True
+ except subprocess.CalledProcessError:
+ print(f"[Nydus] ✗ Conversion failed")
+ return False
+
+
+def convert_to_soci(source_image: str, target_image: str) -> bool:
+ """Convert to SOCI format."""
+ print(f"\n[SOCI] Converting {source_image} → {target_image}...")
+
+ # Pull with nerdctl
+ try:
+ subprocess.run(['sudo', 'nerdctl', 'pull', source_image], check=True, capture_output=True)
+ except subprocess.CalledProcessError:
+ print(f"[SOCI] ✗ Pull failed")
+ return False
+
+ # Convert
+ try:
+ subprocess.run(['sudo', 'soci', 'create', source_image], check=True)
+ except subprocess.CalledProcessError:
+ print(f"[SOCI] ✗ Conversion failed")
+ return False
+
+ # Tag and push
+ try:
+ subprocess.run(['sudo', 'nerdctl', 'tag', source_image, target_image], check=True)
+ subprocess.run(['sudo', 'nerdctl', 'push', target_image], check=True)
+ print(f"[SOCI] ✓ Converted and pushed {target_image}")
+ return True
+ except subprocess.CalledProcessError:
+ print(f"[SOCI] ✗ Push failed")
+ return False
+
+
+def convert_to_estargz(source_image: str, target_image: str) -> bool:
+ """Convert to eStarGZ format."""
+ print(f"\n[eStarGZ] Converting {source_image} → {target_image}...")
+
+ try:
+ subprocess.run(['sudo', 'nerdctl', '--snapshotter', 'stargz', 'pull', source_image],
+ check=True, capture_output=True)
+ subprocess.run(['sudo', 'nerdctl', '--snapshotter', 'stargz', 'tag', source_image, target_image],
+ check=True)
+ subprocess.run(['sudo', 'nerdctl', '--snapshotter', 'stargz', 'push', target_image],
+ check=True)
+ print(f"[eStarGZ] ✓ Converted and pushed {target_image}")
+ return True
+ except subprocess.CalledProcessError:
+ print(f"[eStarGZ] ✗ Conversion failed")
+ return False
+
+
+def print_summary(images: List[str]):
+ """Print build summary."""
+ print("\n" + "="*60)
+ print("SUMMARY")
+ print("="*60)
+ if images:
+ print("Successfully built and pushed:")
+ for img in images:
+ print(f" ✓ {img}")
+ else:
+ print("No images were built successfully")
+ print("="*60)
diff --git a/scripts/fastpull/clean.py b/scripts/fastpull/clean.py
new file mode 100644
index 0000000..85a53f1
--- /dev/null
+++ b/scripts/fastpull/clean.py
@@ -0,0 +1,181 @@
+"""
+FastPull clean command - Remove local images and artifacts.
+"""
+
+import argparse
+import subprocess
+import sys
+from typing import List
+
+
+def add_parser(subparsers):
+ """Add clean subcommand parser."""
+ parser = subparsers.add_parser(
+ 'clean',
+ help='Remove local images and artifacts',
+ description='Clean up fastpull images and containers'
+ )
+
+ parser.add_argument(
+ '--images',
+ action='store_true',
+ help='Remove all fastpull images'
+ )
+ parser.add_argument(
+ '--containers',
+ action='store_true',
+ help='Remove stopped containers'
+ )
+ parser.add_argument(
+ '--all',
+ action='store_true',
+ help='Remove all images and containers'
+ )
+ parser.add_argument(
+ '--snapshotter',
+ choices=['nydus', 'overlayfs', 'all'],
+ default='all',
+ help='Target specific snapshotter (default: all)'
+ )
+ parser.add_argument(
+ '--dry-run',
+ action='store_true',
+ help='Show what would be removed without removing'
+ )
+ parser.add_argument(
+ '--force',
+ action='store_true',
+ help='Force removal without confirmation'
+ )
+
+ parser.set_defaults(func=clean_command)
+ return parser
+
+
+def clean_command(args):
+ """Execute the clean command."""
+ # If no specific target, clean all
+ if not args.images and not args.containers and not args.all:
+ print("Please specify what to clean: --images, --containers, or --all")
+ sys.exit(1)
+
+ if args.all:
+ args.images = True
+ args.containers = True
+
+ # Determine which snapshotters to clean
+ snapshotters = ['nydus', 'overlayfs'] if args.snapshotter == 'all' else [args.snapshotter]
+
+ # Clean containers first
+ if args.containers:
+ clean_containers(snapshotters, args.dry_run, args.force)
+
+ # Clean images
+ if args.images:
+ clean_images(snapshotters, args.dry_run, args.force)
+
+
+def clean_containers(snapshotters: List[str], dry_run: bool = False, force: bool = False):
+ """
+ Remove stopped containers.
+
+ Args:
+ snapshotters: List of snapshotters to target
+ dry_run: If True, only show what would be removed
+ force: If True, skip confirmation
+ """
+ print("\n=== Cleaning Containers ===")
+
+ for snapshotter in snapshotters:
+ # Get all containers (including stopped ones)
+ result = subprocess.run(
+ ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'ps', '-a', '-q'],
+ capture_output=True,
+ text=True
+ )
+
+ container_ids = result.stdout.strip().split('\n') if result.stdout.strip() else []
+
+ if not container_ids:
+ print(f"[{snapshotter}] No containers to clean")
+ continue
+
+ print(f"[{snapshotter}] Found {len(container_ids)} container(s)")
+
+ if dry_run:
+ print(f"[{snapshotter}] Would remove {len(container_ids)} container(s)")
+ for cid in container_ids:
+ print(f" - {cid}")
+ continue
+
+ # Confirm removal
+ if not force:
+ response = input(f"Remove {len(container_ids)} container(s) for {snapshotter}? [y/N]: ")
+ if response.lower() != 'y':
+ print(f"[{snapshotter}] Skipped")
+ continue
+
+ # Remove containers
+ for cid in container_ids:
+ subprocess.run(
+ ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'rm', '-f', cid],
+ capture_output=True
+ )
+
+ print(f"[{snapshotter}] Removed {len(container_ids)} container(s)")
+
+
+def clean_images(snapshotters: List[str], dry_run: bool = False, force: bool = False):
+ """
+ Remove all images.
+
+ Args:
+ snapshotters: List of snapshotters to target
+ dry_run: If True, only show what would be removed
+ force: If True, skip confirmation
+ """
+ print("\n=== Cleaning Images ===")
+
+ for snapshotter in snapshotters:
+ # Get all images
+ result = subprocess.run(
+ ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'images', '-q'],
+ capture_output=True,
+ text=True
+ )
+
+ image_ids = result.stdout.strip().split('\n') if result.stdout.strip() else []
+
+ if not image_ids:
+ print(f"[{snapshotter}] No images to clean")
+ continue
+
+ print(f"[{snapshotter}] Found {len(image_ids)} image(s)")
+
+ if dry_run:
+ print(f"[{snapshotter}] Would remove {len(image_ids)} image(s)")
+ # Show image details
+ result = subprocess.run(
+ ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'images'],
+ capture_output=True,
+ text=True
+ )
+ print(result.stdout)
+ continue
+
+ # Confirm removal
+ if not force:
+ response = input(f"Remove {len(image_ids)} image(s) for {snapshotter}? [y/N]: ")
+ if response.lower() != 'y':
+ print(f"[{snapshotter}] Skipped")
+ continue
+
+ # Remove images
+ subprocess.run(
+ ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'rmi', '-f'] + image_ids,
+ capture_output=True
+ )
+
+ print(f"[{snapshotter}] Removed {len(image_ids)} image(s)")
+
+ print("\n=== Cleanup Complete ===\n")
diff --git a/scripts/fastpull/cli.py b/scripts/fastpull/cli.py
new file mode 100644
index 0000000..1a2f4cb
--- /dev/null
+++ b/scripts/fastpull/cli.py
@@ -0,0 +1,73 @@
+#!/usr/bin/env python3
+"""
+FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters.
+
+Main CLI entry point for the unified fastpull command.
+"""
+
+import argparse
+import sys
+
+from . import __version__, run, build, quickstart, clean
+
+
+def main():
+ """Main CLI entry point."""
+ parser = argparse.ArgumentParser(
+ prog='fastpull',
+ description='FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters',
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog="""
+Examples:
+ # Run container with benchmarking
+ fastpull run --snapshotter nydus --image myapp:latest-nydus \\
+ --benchmark-mode readiness --readiness-endpoint http://localhost:8080/health -p 8080:8080
+
+ # Build and push Docker and Nydus images
+ fastpull build --image-path ./app --image myapp:v1 --format docker,nydus
+
+For more information, visit: https://github.com/tensorfuse/fastpull
+ """
+ )
+
+ parser.add_argument(
+ '--version',
+ action='version',
+ version=f'%(prog)s {__version__}'
+ )
+
+ # Create subparsers for commands
+ subparsers = parser.add_subparsers(
+ dest='command',
+ title='commands',
+ description='Available fastpull commands',
+ help='Command to execute'
+ )
+
+ # Add subcommands
+ run.add_parser(subparsers)
+ build.add_parser(subparsers)
+ quickstart.add_parser(subparsers)
+ clean.add_parser(subparsers)
+
+ # Parse arguments
+ args = parser.parse_args()
+
+ # If no command specified, print help
+ if not args.command:
+ parser.print_help()
+ sys.exit(1)
+
+ # Execute the command
+ try:
+ args.func(args)
+ except KeyboardInterrupt:
+ print("\n\nInterrupted by user")
+ sys.exit(130)
+ except Exception as e:
+ print(f"Error: {e}")
+ sys.exit(1)
+
+
+if __name__ == '__main__':
+ main()
diff --git a/scripts/fastpull/common.py b/scripts/fastpull/common.py
new file mode 100644
index 0000000..bb01a07
--- /dev/null
+++ b/scripts/fastpull/common.py
@@ -0,0 +1,139 @@
+"""
+Common utilities for fastpull commands.
+
+Includes registry detection, authentication helpers, and shared functions.
+"""
+
+import re
+import subprocess
+from typing import Optional, Tuple
+
+
+def detect_registry_type(image: str) -> str:
+ """
+ Auto-detect registry type from image URL.
+
+ Args:
+ image: Container image URL
+
+ Returns:
+ Registry type: 'ecr', 'gar', 'dockerhub', or 'unknown'
+ """
+ if 'dkr.ecr' in image or 'ecr.aws' in image:
+ return 'ecr'
+ elif 'pkg.dev' in image:
+ return 'gar'
+ elif 'docker.io' in image or '/' not in image or image.count('/') == 1:
+ return 'dockerhub'
+ return 'unknown'
+
+
+def parse_ecr_url(image: str) -> Optional[Tuple[str, str, str]]:
+ """
+ Parse ECR image URL to extract account, region, and repository.
+
+ Args:
+ image: ECR image URL
+
+ Returns:
+ Tuple of (account_id, region, repository) or None if invalid
+ """
+ pattern = r'(\d+)\.dkr\.ecr\.([^.]+)\.amazonaws\.com/(.+)'
+ match = re.match(pattern, image)
+ if match:
+ return match.group(1), match.group(2), match.group(3)
+ return None
+
+
+def parse_gar_url(image: str) -> Optional[Tuple[str, str, str]]:
+ """
+ Parse GAR image URL to extract location, project, and repository.
+
+ Args:
+ image: GAR image URL (e.g., us-central1-docker.pkg.dev/project/repo/image:tag)
+
+ Returns:
+ Tuple of (location, project_id, repository) or None if invalid
+ """
+ # Pattern: location-docker.pkg.dev/project/repository/image:tag
+ # Use .+? for location to handle hyphens (e.g., us-central1)
+ pattern = r'(.+?)-docker\.pkg\.dev/([^/]+)/([^/]+)'
+ match = re.match(pattern, image)
+ if match:
+ return match.group(1), match.group(2), match.group(3)
+ return None
+
+
+def run_command(cmd: list, check: bool = True, capture_output: bool = True) -> subprocess.CompletedProcess:
+ """
+ Run a shell command with consistent error handling.
+
+ Args:
+ cmd: Command to run as list of strings
+ check: Raise exception on non-zero exit code
+ capture_output: Capture stdout/stderr
+
+ Returns:
+ CompletedProcess instance
+ """
+ return subprocess.run(
+ cmd,
+ check=check,
+ capture_output=capture_output,
+ text=True
+ )
+
+
+def get_snapshotter_binary(snapshotter: str) -> str:
+ """
+ Get the appropriate binary for the snapshotter.
+
+ Args:
+ snapshotter: Snapshotter type
+
+ Returns:
+ Binary name ('nerdctl' or 'docker')
+ """
+ # All snapshotters use nerdctl except for plain docker
+ if snapshotter in ['docker', 'overlayfs']:
+ return 'docker'
+ return 'nerdctl'
+
+
+def get_aws_account_id() -> Optional[str]:
+ """
+ Get AWS account ID from AWS CLI.
+
+ Returns:
+ Account ID or None if failed
+ """
+ try:
+ result = subprocess.run(
+ ['aws', 'sts', 'get-caller-identity', '--query', 'Account', '--output', 'text'],
+ check=True,
+ capture_output=True,
+ text=True
+ )
+ return result.stdout.strip()
+ except (subprocess.CalledProcessError, FileNotFoundError):
+ return None
+
+
+def get_aws_region() -> Optional[str]:
+ """
+ Get AWS region from AWS CLI configuration.
+
+ Returns:
+ Region or None if failed
+ """
+ try:
+ result = subprocess.run(
+ ['aws', 'configure', 'get', 'region'],
+ check=True,
+ capture_output=True,
+ text=True
+ )
+ region = result.stdout.strip()
+ return region if region else None
+ except (subprocess.CalledProcessError, FileNotFoundError):
+ return None
diff --git a/scripts/fastpull/quickstart.py b/scripts/fastpull/quickstart.py
new file mode 100644
index 0000000..1795b8e
--- /dev/null
+++ b/scripts/fastpull/quickstart.py
@@ -0,0 +1,81 @@
+"""
+FastPull quickstart command - Quick benchmarking comparisons.
+"""
+
+import argparse
+import subprocess
+import sys
+import os
+
+
+# Workload configurations: (name, base_image, endpoint)
+WORKLOADS = {
+ 'tensorrt': ('TensorRT', 'tensorrt', '/health'),
+ 'vllm': ('vLLM', 'vllm', '/health'),
+ 'sglang': ('SGLang', 'sglang', '/health_generate'),
+}
+
+
+def add_parser(subparsers):
+ """Add quickstart subcommand parser."""
+ parser = subparsers.add_parser(
+ 'quickstart',
+ help='Quick benchmark comparisons',
+ description='Run pre-configured benchmarks'
+ )
+
+ subparsers_qs = parser.add_subparsers(dest='workload', help='Workload to benchmark')
+
+ for workload in WORKLOADS:
+ wp = subparsers_qs.add_parser(workload, help=f'Benchmark {WORKLOADS[workload][0]} (nydus vs overlayfs)')
+ wp.add_argument('--output-dir', help='Directory to save results')
+ wp.set_defaults(func=run_quickstart)
+
+ parser.set_defaults(func=lambda args: parser.print_help() if not args.workload else None)
+ return parser
+
+
+def run_quickstart(args):
+ """Run benchmark comparison for a workload."""
+ name, image_name, endpoint = WORKLOADS[args.workload]
+
+ print(f"\n{'='*60}\n{name} Benchmark: FastPull vs Normal\n{'='*60}\n")
+
+ base = f"public.ecr.aws/s6z9f6e5/tensorfuse/fastpull/{image_name}:latest"
+
+ for mode in ['nydus', 'normal']:
+ print(f"\n[{mode.upper()}] Starting benchmark...")
+
+ # Use fastpull command directly (works when installed via pip)
+ cmd = [
+ 'fastpull', 'run',
+ '--mode', mode,
+ '--benchmark-mode', 'readiness',
+ '--readiness-endpoint', f'http://localhost:8080{endpoint}',
+ '-p', '8080:8000',
+ '--gpus', 'all',
+ base # Image as positional argument (tag suffix added automatically by run command)
+ ]
+
+ if args.output_dir:
+ os.makedirs(args.output_dir, exist_ok=True)
+ cmd.extend(['--output-json', f'{args.output_dir}/{image_name}-{mode}.json'])
+
+ try:
+ subprocess.run(cmd, check=True)
+ except (subprocess.CalledProcessError, KeyboardInterrupt):
+ sys.exit(1)
+
+ print(f"\n{'='*60}\nBenchmark complete!")
+ if args.output_dir:
+ print(f"Results: {args.output_dir}/")
+ print(f"{'='*60}\n")
+
+ # Auto cleanup after benchmarks complete
+ print("\nCleaning up containers and images...")
+ cleanup_cmd = ['fastpull', 'clean', '--all', '--force']
+ try:
+ subprocess.run(cleanup_cmd, check=False) # Don't fail if cleanup has issues
+ except Exception as e:
+ print(f"Warning: Cleanup had issues: {e}")
+ print("Cleanup complete!\n")
diff --git a/scripts/fastpull/run.py b/scripts/fastpull/run.py
new file mode 100644
index 0000000..3cfb0cb
--- /dev/null
+++ b/scripts/fastpull/run.py
@@ -0,0 +1,325 @@
+"""
+FastPull run command - Run containers with specified snapshotters and benchmarking.
+"""
+
+import argparse
+import subprocess
+import sys
+import threading
+import time
+from typing import List, Optional
+
+from . import benchmark
+from . import common
+
+
+def add_parser(subparsers):
+ """Add run subcommand parser."""
+ parser = subparsers.add_parser(
+ 'run',
+ help='Run container with specified snapshotter',
+ description='Run containers with Nydus or OverlayFS snapshotter'
+ )
+
+ # Mode selection (replaces --snapshotter)
+ parser.add_argument(
+ '--mode',
+ choices=['nydus', 'normal'],
+ default='nydus',
+ help='Run mode: nydus (default, adds -fastpull suffix) or normal (overlayfs, no suffix)'
+ )
+
+ # Benchmarking arguments
+ parser.add_argument(
+ '--benchmark-mode',
+ choices=['none', 'completion', 'readiness'],
+ default='none',
+ help='Benchmarking mode (default: none)'
+ )
+ parser.add_argument(
+ '--readiness-endpoint',
+ help='HTTP endpoint to poll for readiness (required if benchmark-mode=readiness)'
+ )
+ parser.add_argument(
+ '--output-json',
+ help='Export benchmark metrics to JSON file'
+ )
+
+ # Common container flags
+ parser.add_argument('--name', help='Container name')
+ parser.add_argument('-p', '--publish', action='append', help='Publish ports (can be used multiple times)')
+ parser.add_argument('-e', '--env', action='append', help='Set environment variables')
+ parser.add_argument('-v', '--volume', action='append', help='Bind mount volumes')
+ parser.add_argument('--gpus', help='GPU devices to use (e.g., "all")')
+ parser.add_argument('--rm', action='store_true', help='Automatically remove container when it exits')
+ parser.add_argument('-d', '--detach', action='store_true', help='Run container in background')
+
+ # Image as positional argument (like docker/nerdctl run)
+ parser.add_argument(
+ 'image',
+ help='Container image to run'
+ )
+
+ # Pass-through for additional nerdctl flags (optional trailing args)
+ parser.add_argument(
+ 'nerdctl_args',
+ nargs='*',
+ help='Additional arguments to pass to nerdctl/docker (e.g., command to run in container)'
+ )
+
+ parser.set_defaults(func=run_command)
+ return parser
+
+
+def run_command(args):
+ """Execute the run command."""
+ # Validate benchmark mode
+ if args.benchmark_mode == 'readiness' and not args.readiness_endpoint:
+ print("Error: --readiness-endpoint is required when --benchmark-mode=readiness")
+ sys.exit(1)
+
+ # Determine snapshotter and modify image tag based on mode
+ if args.mode == 'nydus':
+ args.snapshotter = 'nydus'
+ # Add -fastpull suffix to image tag if not already present
+ if ':' in args.image:
+ base, tag = args.image.rsplit(':', 1)
+ if not tag.endswith('-fastpull'):
+ args.image = f"{base}:{tag}-fastpull"
+ else:
+ args.image = f"{args.image}:latest-fastpull"
+ else: # normal mode
+ args.snapshotter = 'overlayfs'
+ # Use image as-is for normal mode
+
+ # Build the nerdctl/docker command
+ cmd = build_run_command(args)
+
+ print(f"Running container with {args.snapshotter} snapshotter...")
+ print(f"Image: {args.image}")
+ print(f"Command: {' '.join(cmd)}\n")
+
+ # For benchmarking, we need to track the container
+ if args.benchmark_mode != 'none':
+ run_with_benchmark(cmd, args)
+ else:
+ run_without_benchmark(cmd)
+
+
+def build_run_command(args) -> List[str]:
+ """
+ Build the nerdctl/docker run command from arguments.
+
+ Args:
+ args: Parsed command-line arguments
+
+ Returns:
+ Command as list of strings
+ """
+ # Determine binary (use sudo)
+ if args.snapshotter == 'overlayfs':
+ cmd = ['sudo', 'nerdctl', '--snapshotter', 'overlayfs', 'run']
+ else:
+ cmd = ['sudo', 'nerdctl', '--snapshotter', args.snapshotter, 'run']
+
+ # Add common flags
+ if args.name:
+ cmd.extend(['--name', args.name])
+
+ if args.rm:
+ cmd.append('--rm')
+
+ if args.detach:
+ cmd.append('-d')
+
+ # Add ports
+ if args.publish:
+ for port in args.publish:
+ cmd.extend(['-p', port])
+
+ # Add environment variables
+ if args.env:
+ for env in args.env:
+ cmd.extend(['-e', env])
+
+ # Add volumes
+ if args.volume:
+ for vol in args.volume:
+ cmd.extend(['-v', vol])
+
+ # Add GPU support
+ if args.gpus:
+ cmd.extend(['--gpus', args.gpus])
+
+ # Add any additional pass-through arguments
+ if args.nerdctl_args:
+ cmd.extend(args.nerdctl_args)
+
+ # Add image (must be last)
+ cmd.append(args.image)
+
+ return cmd
+
+
+def run_without_benchmark(cmd: List[str]):
+ """
+ Run container without benchmarking.
+
+ Args:
+ cmd: Command to execute
+ """
+ try:
+ subprocess.run(cmd, check=True)
+ except subprocess.CalledProcessError as e:
+ print(f"Error running container: {e}")
+ sys.exit(1)
+
+
+def run_with_benchmark(cmd: List[str], args):
+ """
+ Run container with benchmarking enabled.
+
+ Args:
+ cmd: Command to execute
+ args: Parsed arguments
+ """
+ # Force detached mode for benchmarking
+ if '-d' not in cmd and '--detach' not in cmd:
+ cmd.insert(cmd.index('run') + 1, '-d')
+
+ # Initialize benchmark tracker early (before starting container)
+ # We'll set container_id later, but we need to start event monitoring first
+ bench = benchmark.ContainerBenchmark(
+ container_id='', # Will be set after container starts
+ benchmark_mode=args.benchmark_mode,
+ readiness_endpoint=args.readiness_endpoint,
+ mode=args.mode
+ )
+
+ # Start event monitoring BEFORE starting the container
+ print("Starting containerd events monitoring...")
+ bench.start_event_monitoring()
+
+ # Small delay to ensure event monitoring is ready
+ time.sleep(0.5)
+
+ # Start the container
+ try:
+ print(f"Running container...")
+ result = subprocess.run(
+ cmd,
+ check=True,
+ capture_output=True,
+ text=True
+ )
+ container_id = result.stdout.strip()
+
+ if not container_id:
+ print("Error: Failed to get container ID")
+ sys.exit(1)
+
+ print(f"Container started: {container_id[:12]}")
+
+ # Update benchmark tracker with container ID
+ bench.container_id = container_id
+
+ except subprocess.CalledProcessError as e:
+ print(f"Error starting container: {e}")
+ if e.stderr:
+ print(f"stderr: {e.stderr}")
+ sys.exit(1)
+
+ # Start monitoring logs in background
+ print("Monitoring container logs...")
+ stop_logs_event = threading.Event()
+ log_thread = start_log_monitoring(container_id, args.snapshotter, bench.start_time, stop_logs_event)
+
+ # Wait for completion or readiness
+ try:
+ if args.benchmark_mode == 'completion':
+ success = bench.wait_for_completion()
+ elif args.benchmark_mode == 'readiness':
+ success = bench.wait_for_readiness()
+ else:
+ success = True
+
+ # Stop log monitoring after benchmark completes
+ stop_logs_event.set()
+
+ if not success:
+ print("Benchmark failed (timeout)")
+ # Cleanup on failure
+ cleanup_container(container_id, args.snapshotter)
+ sys.exit(1)
+
+ # Print summary
+ bench.print_summary()
+
+ # Export JSON if requested
+ if args.output_json:
+ bench.export_json(args.output_json)
+
+ # Cleanup container after successful benchmark
+ print("\nBenchmark complete, cleaning up container...")
+ cleanup_container(container_id, args.snapshotter)
+
+ except KeyboardInterrupt:
+ print("\nInterrupted by user")
+ # Stop and remove container
+ cleanup_container(container_id, args.snapshotter)
+ sys.exit(1)
+
+
+def start_log_monitoring(container_id: str, snapshotter: str, start_time: float, stop_event: threading.Event) -> threading.Thread:
+ """
+ Start monitoring container logs in background thread.
+
+ Args:
+ container_id: Container ID
+ snapshotter: Snapshotter type
+ start_time: Benchmark start time
+ stop_event: Event to signal when to stop monitoring
+
+ Returns:
+ Log monitoring thread
+ """
+ def log_reader():
+ try:
+ cmd = ['sudo', 'nerdctl', 'logs', '-f', container_id]
+
+ process = subprocess.Popen(
+ cmd,
+ stdout=subprocess.PIPE,
+ stderr=subprocess.STDOUT,
+ text=True,
+ bufsize=1,
+ universal_newlines=True
+ )
+
+ for line in process.stdout:
+ if stop_event.is_set():
+ process.terminate()
+ break
+ if line:
+ elapsed = time.time() - start_time
+ print(f"[{elapsed:.3f}s] {line.rstrip()}")
+
+ except Exception as e:
+ pass # Silently handle errors (container might be stopped)
+
+ thread = threading.Thread(target=log_reader, daemon=True)
+ thread.start()
+ return thread
+
+
+def cleanup_container(container_id: str, snapshotter: str):
+ """
+ Stop and remove container.
+
+ Args:
+ container_id: Container ID
+ snapshotter: Snapshotter type
+ """
+ print(f"Cleaning up container {container_id[:12]}...")
+ subprocess.run(['sudo', 'nerdctl', 'stop', container_id], capture_output=True)
+ subprocess.run(['sudo', 'nerdctl', 'rm', container_id], capture_output=True)
diff --git a/scripts/install_snapshotters.py b/scripts/install_snapshotters.py
deleted file mode 100755
index ec959b7..0000000
--- a/scripts/install_snapshotters.py
+++ /dev/null
@@ -1,523 +0,0 @@
-#!/usr/bin/env python3
-"""
-Container Snapshotter Installation Script
-
-This script installs and configures multiple container snapshotters:
-- Nydus: Efficient container image storage with lazy loading
-- SOCI (Seekable OCI): AWS-developed snapshotter for faster container startup
-- StarGZ: Google-developed snapshotter with eStargz format support
-
-The script also installs supporting tools like nerdctl and CNI plugins,
-configures systemd services, and sets up containerd integration.
-
-Requirements:
-- Must be run as root
-- Linux system with systemd
-- Internet access for downloading binaries
-"""
-
-import os
-import sys
-import subprocess
-import shutil
-import tempfile
-from pathlib import Path
-
-# Configuration constants for component versions
-NYDUS_VERSION = "2.3.6"
-NYDUS_SNAPSHOTTER_VERSION = "0.15.3"
-NERDCTL_VERSION = "2.1.4"
-CNI_VERSION = "v1.8.0"
-SOCI_VERSION = "0.11.1"
-STARGZ_VERSION = "0.17.0"
-
-def run_command(cmd, check=True, shell=False):
- """
- Execute a shell command with error handling.
-
- Args:
- cmd: Command to execute (list or string)
- check: Whether to raise exception on non-zero exit code
- shell: Whether to use shell execution
-
- Returns:
- subprocess.CompletedProcess: Command execution result
- """
- if shell:
- result = subprocess.run(cmd, shell=True, check=check, capture_output=True, text=True)
- else:
- result = subprocess.run(cmd, check=check, capture_output=True, text=True)
- return result
-
-def check_root():
- """
- Verify that the script is running with root privileges.
- Exits with error code 1 if not running as root.
- """
- if os.geteuid() != 0:
- print("This script must be run as root")
- sys.exit(1)
-
-def download_and_extract(url, extract_to=None):
- """
- Download and extract a tar.gz archive from a URL.
-
- Args:
- url: URL to download the archive from
- extract_to: Optional directory to extract to (current dir if None)
-
- Returns:
- str: Filename of the downloaded archive
- """
- filename = url.split('/')[-1]
-
- # Download the archive
- print(f" Downloading {filename}...")
- run_command(['wget', url])
-
- # Extract the archive
- print(f" Extracting {filename}...")
- if extract_to:
- run_command(['tar', '-xzf', filename, '-C', extract_to])
- else:
- run_command(['tar', '-xzf', filename])
-
- # Clean up the downloaded archive
- os.remove(filename)
- return filename
-
-def install_nydus():
- """
- Install Nydus container image acceleration toolkit.
-
- Nydus provides lazy loading capabilities for container images,
- reducing startup time and bandwidth usage.
- """
- print("------------------ Installing Nydus -------------------------------")
- print(f"Installing Nydus v{NYDUS_VERSION}...")
-
- # Download and extract Nydus static binaries
- url = f"https://github.com/dragonflyoss/nydus/releases/download/v{NYDUS_VERSION}/nydus-static-v{NYDUS_VERSION}-linux-amd64.tgz"
- download_and_extract(url)
-
- # Install binaries to system path
- print(" Installing Nydus binaries...")
- nydus_binaries = list(Path('nydus-static').glob('*'))
- run_command(['cp', '-r'] + [str(b) for b in nydus_binaries] + ['/usr/local/bin/'])
-
- # Make binaries executable
- nydus_installed = list(Path('/usr/local/bin').glob('nydus*'))
- run_command(['chmod', '+x'] + [str(p) for p in nydus_installed])
-
- # Clean up temporary files
- shutil.rmtree('nydus-static', ignore_errors=True)
-
-def install_nydus_snapshotter():
- """
- Install Nydus Snapshotter for containerd integration.
-
- This component bridges Nydus with containerd, enabling
- container runtime to use Nydus-optimized images.
- """
- print(f"Installing Nydus Snapshotter v{NYDUS_SNAPSHOTTER_VERSION}...")
-
- # Download Nydus Snapshotter
- url = f"https://github.com/containerd/nydus-snapshotter/releases/download/v{NYDUS_SNAPSHOTTER_VERSION}/nydus-snapshotter-v{NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz"
- download_and_extract(url)
-
- # Install the containerd-nydus-grpc binary
- print(" Installing Nydus Snapshotter binary...")
- run_command(['cp', 'bin/containerd-nydus-grpc', '/usr/local/bin/'])
- run_command(['chmod', '+x', '/usr/local/bin/containerd-nydus-grpc'])
-
- # Clean up temporary files
- shutil.rmtree('bin', ignore_errors=True)
-
-def install_nerdctl():
- """
- Install nerdctl - containerd-compatible Docker CLI.
-
- nerdctl provides a Docker-compatible command line interface
- for containerd, enabling easy container management.
- """
- print(f"Installing nerdctl v{NERDCTL_VERSION}...")
-
- # Download nerdctl
- url = f"https://github.com/containerd/nerdctl/releases/download/v{NERDCTL_VERSION}/nerdctl-{NERDCTL_VERSION}-linux-amd64.tar.gz"
- download_and_extract(url)
-
- # Install nerdctl binary
- print(" Installing nerdctl binary...")
- run_command(['cp', 'nerdctl', '/usr/local/bin/'])
-
- # Clean up temporary files
- os.remove('nerdctl')
-
-def install_cni_plugins():
- """
- Install Container Network Interface (CNI) plugins.
-
- CNI plugins provide networking capabilities for containers,
- enabling network isolation and communication.
- """
- print("Installing CNI plugins...")
-
- # Create CNI plugin directory
- print(" Creating CNI plugin directory...")
- os.makedirs('/opt/cni/bin', exist_ok=True)
-
- # Download and install CNI plugins
- url = f"https://github.com/containernetworking/plugins/releases/download/{CNI_VERSION}/cni-plugins-linux-amd64-{CNI_VERSION}.tgz"
- filename = url.split('/')[-1]
-
- print(f" Downloading CNI plugins {CNI_VERSION}...")
- run_command(['wget', url])
-
- print(" Installing CNI plugins...")
- run_command(['tar', '-xzf', filename, '-C', '/opt/cni/bin'])
- os.remove(filename)
-
-def test_nydus_installation():
- """
- Verify that Nydus components are properly installed.
-
- Tests the installation by checking version information
- for core Nydus tools.
- """
- print("Testing Nydus installation...")
-
- # List of Nydus tools to test
- commands = [
- ['nydus-image', '--version'], # Image conversion tool
- ['nydusd', '--version'], # Nydus daemon
- ['nydusify', '--version'] # Image format converter
- ]
-
- # Test each tool and report any failures
- for cmd in commands:
- try:
- result = run_command(cmd)
- print(f" ✓ {cmd[0]} is working")
- except subprocess.CalledProcessError as e:
- print(f" ✗ Warning: {' '.join(cmd)} failed: {e}")
-
-def configure_nydus_snapshotter():
- """
- Create configuration files for Nydus Snapshotter.
-
- Sets up the nydusd daemon configuration with optimized
- settings for registry backend and filesystem prefetching.
- """
- print("=== Nydus Snapshotter Configuration Deployment ===")
-
- # Create Nydus configuration directory
- print(" Creating Nydus configuration directory...")
- os.makedirs('/etc/nydus', exist_ok=True)
-
- # Nydus daemon configuration for FUSE mode
- config_content = """{
- "device": {
- "backend": {
- "type": "registry",
- "config": {
- "timeout": 5,
- "connect_timeout": 5,
- "retry_limit": 2
- }
- },
- "cache": {
- "type": "blobcache"
- }
- },
- "mode": "direct",
- "digest_validate": false,
- "iostats_files": false,
- "enable_xattr": true,
- "amplify_io": 1048576,
- "fs_prefetch": {
- "enable": true,
- "threads_count": 64,
- "merging_size": 1048576,
- "prefetch_all": true
- }
-}"""
-
- # Write configuration file
- print(" Writing Nydus daemon configuration...")
- with open('/etc/nydus/nydusd-config.fusedev.json', 'w') as f:
- f.write(config_content)
-
-def install_soci():
- """
- Install SOCI (Seekable OCI) snapshotter.
-
- SOCI is AWS's container image format that enables
- faster container startup through lazy loading.
- """
- print("------------------ Installing Soci -------------------------------")
- print(f"Installing SOCI v{SOCI_VERSION}...")
-
- # Download SOCI snapshotter
- url = f"https://github.com/awslabs/soci-snapshotter/releases/download/v{SOCI_VERSION}/soci-snapshotter-{SOCI_VERSION}-linux-amd64.tar.gz"
- filename = url.split('/')[-1]
-
- print(" Downloading SOCI snapshotter...")
- run_command(['wget', url])
-
- # Extract specific binaries directly to system path
- print(" Installing SOCI binaries...")
- run_command(['tar', '-C', '/usr/local/bin', '-xvf', filename, 'soci', 'soci-snapshotter-grpc'])
- os.remove(filename)
-
-def install_stargz():
- """
- Install StarGZ snapshotter.
-
- StarGZ (Stargz/eStargz) is Google's container image format
- that provides lazy loading capabilities similar to Nydus.
- """
- print("------------------ Installing (e)StarGZ -------------------------------")
- print(f"Installing StarGZ v{STARGZ_VERSION}...")
-
- # Download StarGZ snapshotter
- url = f"https://github.com/containerd/stargz-snapshotter/releases/download/v{STARGZ_VERSION}/stargz-snapshotter-v{STARGZ_VERSION}-linux-amd64.tar.gz"
- filename = url.split('/')[-1]
-
- print(" Downloading StarGZ snapshotter...")
- run_command(['wget', url])
-
- # Extract specific binaries directly to system path
- print(" Installing StarGZ binaries...")
- run_command(['tar', '-C', '/usr/local/bin', '-xvf', filename, 'containerd-stargz-grpc', 'ctr-remote'])
- os.remove(filename)
-
-def setup_systemd_services(snapshotters):
- """
- Create and start systemd services for specified snapshotters.
-
- Creates service files for each snapshotter daemon and starts them.
- This enables automatic startup and management via systemctl.
-
- Args:
- snapshotters: List of snapshotters to set up ('nydus', 'soci', 'stargz')
- """
- print("------------------ Setting up Snapshotter Services -------------------------------")
-
- services_to_start = []
-
- if 'nydus' in snapshotters:
- # Nydus Snapshotter service configuration
- print(" Creating Nydus Snapshotter service...")
- nydus_service = """[Unit]
-Description=nydus snapshotter (fuse mode)
-After=network.target
-
-[Service]
-Type=simple
-ExecStart=/usr/local/bin/containerd-nydus-grpc --nydusd-config /etc/nydus/nydusd-config.fusedev.json
-Restart=always
-StandardOutput=journal
-StandardError=journal
-
-[Install]
-WantedBy=multi-user.target
-"""
-
- with open('/etc/systemd/system/nydus-snapshotter-fuse.service', 'w') as f:
- f.write(nydus_service)
- services_to_start.append('nydus-snapshotter-fuse.service')
-
- if 'soci' in snapshotters:
- # SOCI Snapshotter service configuration
- print(" Creating SOCI Snapshotter service...")
- soci_service = """[Unit]
-Description=SOCI Snapshotter GRPC daemon
-After=network.target
-
-[Service]
-Type=simple
-ExecStart=/usr/local/bin/soci-snapshotter-grpc
-Restart=on-failure
-
-[Install]
-WantedBy=multi-user.target
-"""
-
- with open('/etc/systemd/system/soci-snapshotter-grpc.service', 'w') as f:
- f.write(soci_service)
- services_to_start.append('soci-snapshotter-grpc.service')
-
- if 'stargz' in snapshotters:
- # StarGZ Snapshotter service configuration
- print(" Creating StarGZ Snapshotter service...")
- stargz_service = """[Unit]
-Description=Stargz Snapshotter daemon
-After=network.target
-
-[Service]
-Type=simple
-ExecStart=/usr/local/bin/containerd-stargz-grpc
-Restart=on-failure
-
-[Install]
-WantedBy=multi-user.target
-"""
-
- with open('/etc/systemd/system/stargz-snapshotter.service', 'w') as f:
- f.write(stargz_service)
- services_to_start.append('stargz-snapshotter.service')
-
- # Start all snapshotter services
- if services_to_start:
- print(" Starting snapshotter services...")
- for service in services_to_start:
- print(f" Starting {service}...")
- run_command(['systemctl', 'start', service])
-
-def setup_containerd(snapshotters):
- """
- Configure containerd to use the installed snapshotters.
-
- Creates containerd configuration that registers specified
- snapshotters as proxy plugins, then restarts containerd.
-
- Args:
- snapshotters: List of snapshotters to configure ('nydus', 'soci', 'stargz')
- """
- print("------------------ Setting up Containerd -------------------------------")
-
- # Ensure containerd configuration directory exists
- print(" Creating containerd configuration directory...")
- os.makedirs('/etc/containerd', exist_ok=True)
-
- # Build containerd configuration with proxy plugins for specified snapshotters
- containerd_config = "version = 2\n\n[proxy_plugins]\n"
-
- if 'soci' in snapshotters:
- containerd_config += """ [proxy_plugins.soci]
- type = "snapshot"
- address = "/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock"
-"""
-
- if 'nydus' in snapshotters:
- containerd_config += """ [proxy_plugins.nydus]
- type = "snapshot"
- address = "/run/containerd-nydus/containerd-nydus-grpc.sock"
-"""
-
- if 'stargz' in snapshotters:
- containerd_config += """ [proxy_plugins.stargz]
- type = "snapshot"
- address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"
- [proxy_plugins.stargz.exports]
- root = "/var/lib/containerd-stargz-grpc/"
-"""
-
- # Write containerd configuration
- print(" Writing containerd configuration...")
- with open('/etc/containerd/config.toml', 'w') as f:
- f.write(containerd_config)
-
- # Restart containerd to apply new configuration
- print(" Restarting containerd service...")
- run_command(['systemctl', 'restart', 'containerd'])
-
-def main():
- """
- Main installation orchestrator.
-
- Performs the complete installation sequence:
- 1. Verify root privileges
- 2. Install specified snapshotter components and dependencies
- 3. Configure services and containerd integration
- 4. Start all services
-
- Uses a temporary directory for downloads to avoid cluttering
- the current working directory.
- """
- import argparse
-
- # Parse command line arguments
- parser = argparse.ArgumentParser(
- description="Install container snapshotters for lazy-loading container images.",
- formatter_class=argparse.RawDescriptionHelpFormatter,
- epilog="""
-Examples:
- # Install only Nydus (default)
- sudo python3 install_snapshotters.py
-
- # Install all snapshotters
- sudo python3 install_snapshotters.py --snapshotters nydus,soci,stargz
-
- # Install Nydus and SOCI
- sudo python3 install_snapshotters.py --snapshotters nydus,soci
- """)
-
- parser.add_argument(
- "--snapshotters",
- default="nydus",
- help="Comma-separated list of snapshotters to install (nydus,soci,stargz). Default: nydus"
- )
-
- args = parser.parse_args()
-
- # Parse and validate snapshotters
- requested_snapshotters = [s.strip() for s in args.snapshotters.split(",")]
- valid_snapshotters = {"nydus", "soci", "stargz"}
- invalid_snapshotters = set(requested_snapshotters) - valid_snapshotters
-
- if invalid_snapshotters:
- print(f"Error: Invalid snapshotters: {invalid_snapshotters}")
- print(f"Valid options: {valid_snapshotters}")
- sys.exit(1)
-
- # Ensure script is run with root privileges
- check_root()
-
- snapshotter_names = ", ".join(requested_snapshotters)
- print("Starting container snapshotter installation...")
- print(f"Installing: {snapshotter_names}, nerdctl, and CNI plugins")
- print()
-
- # Use temporary directory for all downloads and extraction
- with tempfile.TemporaryDirectory() as tmpdir:
- original_dir = os.getcwd()
- os.chdir(tmpdir)
-
- try:
- # Install core container runtime tools first
- install_nerdctl()
- install_cni_plugins()
-
- # Install Nydus components if requested
- if 'nydus' in requested_snapshotters:
- install_nydus()
- install_nydus_snapshotter()
- test_nydus_installation()
- configure_nydus_snapshotter()
-
- # Install SOCI if requested
- if 'soci' in requested_snapshotters:
- install_soci()
-
- # Install StarGZ if requested
- if 'stargz' in requested_snapshotters:
- install_stargz()
-
- # Set up system integration for installed snapshotters
- setup_systemd_services(requested_snapshotters)
- setup_containerd(requested_snapshotters)
-
- finally:
- # Return to original directory
- os.chdir(original_dir)
-
- print()
- print("------------------ INSTALLATION COMPLETE -------------------")
- print(f"Installed snapshotters: {snapshotter_names}")
- print("You can now use nerdctl with --snapshotter flag to specify:")
- for snapshotter in requested_snapshotters:
- print(f" --snapshotter={snapshotter}")
-
-if __name__ == "__main__":
- main()
diff --git a/scripts/setup.py b/scripts/setup.py
new file mode 100755
index 0000000..d1c31fd
--- /dev/null
+++ b/scripts/setup.py
@@ -0,0 +1,550 @@
+#!/usr/bin/env python3
+"""
+FastPull Setup Script
+
+Installs containerd, Nydus snapshotter, and FastPull CLI via pip.
+"""
+
+import argparse
+import os
+import subprocess
+import sys
+
+
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+PROJECT_ROOT = os.path.dirname(SCRIPT_DIR)
+VENV_PATH = os.path.join(PROJECT_ROOT, '.venv')
+FASTPULL_BIN = '/usr/local/bin/fastpull'
+
+
+def run_command(cmd, check=True, capture_output=False, shell=False):
+ """Run a command and return result."""
+ try:
+ if shell:
+ result = subprocess.run(cmd, shell=True, check=check, capture_output=capture_output, text=True)
+ else:
+ result = subprocess.run(cmd, check=check, capture_output=capture_output, text=True)
+ return result
+ except subprocess.CalledProcessError as e:
+ if not check:
+ return e
+ raise
+
+
+def detect_package_manager():
+ """Detect the system package manager."""
+ # Check for apt (Debian/Ubuntu)
+ if os.path.exists('/usr/bin/apt-get') or os.path.exists('/usr/bin/apt'):
+ return 'apt'
+ # Check for yum (RHEL/CentOS 7)
+ elif os.path.exists('/usr/bin/yum'):
+ return 'yum'
+ # Check for dnf (RHEL/CentOS 8+/Fedora)
+ elif os.path.exists('/usr/bin/dnf'):
+ return 'dnf'
+ else:
+ return None
+
+
+def install_system_dependencies():
+ """Install required system packages (python3-venv, wget)."""
+ pkg_mgr = detect_package_manager()
+
+ if not pkg_mgr:
+ print("⚠ Warning: Could not detect package manager (apt/yum/dnf)")
+ print("Please manually install: python3-venv, wget")
+ return False
+
+ print(f"Detected package manager: {pkg_mgr}")
+ print("Installing system dependencies (python3-venv, wget)...")
+
+ try:
+ if pkg_mgr == 'apt':
+ # Update package list and install dependencies
+ run_command(['apt-get', 'update', '-qq'], check=True)
+ run_command(['apt-get', 'install', '-y', 'python3-venv', 'wget'], check=True)
+ elif pkg_mgr == 'yum':
+ run_command(['yum', 'install', '-y', 'python3-venv', 'wget'], check=True)
+ elif pkg_mgr == 'dnf':
+ run_command(['dnf', 'install', '-y', 'python3-venv', 'wget'], check=True)
+
+ print("✓ System dependencies installed")
+ return True
+ except subprocess.CalledProcessError as e:
+ print(f"✗ Failed to install system dependencies: {e}")
+ return False
+
+
+def check_root():
+ """Check if running as root."""
+ if os.geteuid() != 0:
+ print("Error: This script must be run as root (use sudo)")
+ sys.exit(1)
+
+
+def install_containerd_nerdctl():
+ """Install containerd and nerdctl."""
+ print("\n" + "="*60)
+ print("Installing Containerd & Nerdctl")
+ print("="*60)
+
+ # Check if already installed
+ nerdctl_path = "/usr/local/bin/nerdctl"
+ if os.path.exists(nerdctl_path):
+ print(f"✓ nerdctl already installed at {nerdctl_path}")
+ result = run_command([nerdctl_path, "--version"], capture_output=True)
+ print(f" {result.stdout.strip()}")
+ return True
+
+ print("\nInstalling containerd and nerdctl...")
+
+ install_script = """
+set -e
+
+cd /tmp
+
+# Remove old download if exists
+rm -f /tmp/nerdctl-full.tar.gz
+
+# Download nerdctl-full
+NERDCTL_VERSION="1.7.3"
+echo "Downloading nerdctl-full ${NERDCTL_VERSION}..."
+wget -O /tmp/nerdctl-full.tar.gz https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz
+
+# Extract to /usr/local
+echo "Extracting to /usr/local..."
+tar -C /usr/local -xzf /tmp/nerdctl-full.tar.gz
+
+# Enable and start containerd service
+echo "Enabling containerd service..."
+systemctl enable containerd
+systemctl start containerd
+
+# Clean up
+rm -f /tmp/nerdctl-full.tar.gz
+
+echo "✓ Containerd and nerdctl installed"
+"""
+
+ try:
+ result = run_command(install_script, shell=True, capture_output=True)
+ print("✓ Containerd and nerdctl installed successfully")
+ return True
+ except subprocess.CalledProcessError as e:
+ print(f"✗ Failed to install containerd: {e}")
+ if e.stdout:
+ print(f"stdout: {e.stdout}")
+ if e.stderr:
+ print(f"stderr: {e.stderr}")
+ return False
+
+
+def install_nydus():
+ """Install Nydus snapshotter."""
+ print("\n" + "="*60)
+ print("Installing Nydus Snapshotter")
+ print("="*60)
+
+ nydus_path = "/usr/local/bin/containerd-nydus-grpc"
+ service_path = "/etc/systemd/system/fastpull.service"
+
+ # Check if binary exists
+ if os.path.exists(nydus_path):
+ print(f"✓ Nydus binary found at {nydus_path}")
+ # Always recreate service and config (to ensure latest settings)
+ print("Updating service and configuration...")
+ create_nydus_service()
+ return True
+
+ install_script = """
+set -e
+
+NYDUS_SNAPSHOTTER_VERSION="0.15.3"
+echo "Downloading Nydus Snapshotter v${NYDUS_SNAPSHOTTER_VERSION}..."
+
+# Download Nydus Snapshotter
+cd /tmp
+wget https://github.com/containerd/nydus-snapshotter/releases/download/v${NYDUS_SNAPSHOTTER_VERSION}/nydus-snapshotter-v${NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz
+
+# Extract and install
+tar -xzf nydus-snapshotter-v${NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz
+cp bin/containerd-nydus-grpc /usr/local/bin/
+chmod +x /usr/local/bin/containerd-nydus-grpc
+
+# Also install nydusd (required by snapshotter)
+NYDUS_VERSION="v2.3.6"
+echo "Downloading Nydus tools ${NYDUS_VERSION}..."
+wget -O nydus.tgz https://github.com/dragonflyoss/nydus/releases/download/${NYDUS_VERSION}/nydus-static-${NYDUS_VERSION}-linux-amd64.tgz
+tar xzf nydus.tgz
+cp nydus-static/nydusd /usr/local/bin/
+cp nydus-static/nydus-image /usr/local/bin/
+cp nydus-static/nydusify /usr/local/bin/
+chmod +x /usr/local/bin/nydusd /usr/local/bin/nydus-image /usr/local/bin/nydusify
+
+# Clean up
+rm -rf bin nydus-snapshotter-v${NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz nydus-static nydus.tgz
+
+echo "✓ Nydus binaries installed"
+"""
+
+ try:
+ result = run_command(install_script, shell=True, capture_output=True)
+ print("✓ Nydus binaries installed successfully")
+
+ # Now create the service (shared code)
+ create_nydus_service()
+ return True
+ except subprocess.CalledProcessError as e:
+ print(f"✗ Failed to install Nydus: {e}")
+ if e.stderr:
+ print(f"stderr: {e.stderr}")
+ return False
+
+
+def create_nydus_service():
+ """Create systemd service for Nydus snapshotter."""
+ service_script = """
+# Create systemd service
+cat > /etc/systemd/system/fastpull.service <<'EOF'
+[Unit]
+Description=nydus snapshotter (fuse mode)
+After=network.target
+
+[Service]
+Type=simple
+ExecStart=/usr/local/bin/containerd-nydus-grpc --nydusd-config /etc/nydus/nydusd-config.fusedev.json
+Restart=always
+StandardOutput=journal
+StandardError=journal
+
+[Install]
+WantedBy=multi-user.target
+EOF
+
+# Create necessary directories
+mkdir -p /etc/nydus
+mkdir -p /var/lib/nydus/cache
+
+# Create Nydus config if it doesn't exist
+if [ ! -f /etc/nydus/nydusd-config.fusedev.json ]; then
+cat > /etc/nydus/nydusd-config.fusedev.json <<'EOF'
+{
+ "device": {
+ "backend": {
+ "type": "registry",
+ "config": {
+ "timeout": 5,
+ "connect_timeout": 5,
+ "retry_limit": 2
+ }
+ },
+ "cache": {
+ "type": "blobcache"
+ }
+ },
+ "mode": "direct",
+ "digest_validate": false,
+ "iostats_files": false,
+ "enable_xattr": true,
+ "amplify_io": 10485760,
+ "fs_prefetch": {
+ "enable": true,
+ "threads_count": 16,
+ "merging_size": 1048576,
+ "prefetch_all": true
+ }
+}
+EOF
+fi
+
+# Enable and start service
+systemctl daemon-reload
+systemctl enable fastpull.service
+systemctl start fastpull.service
+
+echo "✓ Nydus service created and started"
+"""
+
+ try:
+ run_command(service_script, shell=True, capture_output=True)
+ print("✓ Created and started fastpull.service")
+ return True
+ except subprocess.CalledProcessError as e:
+ print(f"✗ Failed to create service: {e}")
+ return False
+
+
+def configure_containerd_for_nydus():
+ """Configure containerd to use Nydus snapshotter."""
+ print("\nConfiguring containerd for Nydus...")
+
+ config_dir = "/etc/containerd"
+ config_file = os.path.join(config_dir, "config.toml")
+
+ os.makedirs(config_dir, exist_ok=True)
+
+ # Create containerd config with Nydus proxy plugin
+ config_content = """version = 2
+
+[proxy_plugins]
+ [proxy_plugins.nydus]
+ type = "snapshot"
+ address = "/run/containerd-nydus/containerd-nydus-grpc.sock"
+
+[plugins."io.containerd.grpc.v1.cri".containerd]
+ snapshotter = "nydus"
+ disable_snapshot_annotations = false
+"""
+
+ with open(config_file, 'w') as f:
+ f.write(config_content)
+
+ print(f"✓ Updated containerd config at {config_file}")
+
+ # Restart fastpull service first
+ print("Restarting fastpull service...")
+ run_command(["systemctl", "restart", "fastpull.service"], check=False)
+
+ # Then restart containerd service
+ print("Restarting containerd service...")
+ run_command(["systemctl", "restart", "containerd.service"], check=False)
+
+ print("✓ Services restarted")
+
+ return True
+
+
+def install_cli():
+ """Install fastpull CLI via pip in a venv."""
+ print("\n" + "="*60)
+ print("Installing FastPull CLI")
+ print("="*60)
+
+ try:
+ # Create venv if it doesn't exist
+ if not os.path.exists(VENV_PATH):
+ print(f"Creating virtual environment at {VENV_PATH}...")
+ result = run_command(['python3', '-m', 'venv', VENV_PATH], check=False, capture_output=True)
+ if result.returncode != 0:
+ print(f"✗ Failed to create venv: {result.stderr}")
+ return False
+ print(f"✓ Created virtual environment")
+
+ # Get pip path in venv
+ venv_pip = os.path.join(VENV_PATH, 'bin', 'pip')
+ venv_python = os.path.join(VENV_PATH, 'bin', 'python3')
+
+ # Install fastpull in venv
+ print("Installing fastpull in virtual environment...")
+ result = run_command([venv_pip, 'install', '-e', PROJECT_ROOT], check=False, capture_output=True)
+ if result.returncode != 0:
+ print(f"✗ Failed to install in venv: {result.stderr}")
+ return False
+ print("✓ Installed fastpull in virtual environment")
+
+ # Create wrapper script in /usr/local/bin
+ wrapper_script = f"""#!/bin/bash
+# FastPull CLI wrapper script
+# Activates venv and runs fastpull
+
+exec {venv_python} -m scripts.fastpull.cli "$@"
+"""
+
+ print(f"Creating wrapper script at {FASTPULL_BIN}...")
+ with open(FASTPULL_BIN, 'w') as f:
+ f.write(wrapper_script)
+ os.chmod(FASTPULL_BIN, 0o755)
+ print(f"✓ Created fastpull command at {FASTPULL_BIN}")
+
+ return True
+
+ except Exception as e:
+ print(f"✗ Failed to install fastpull: {e}")
+ return False
+
+
+def verify_installation():
+ """Verify fastpull installation."""
+ print("\n" + "="*60)
+ print("Verifying Installation")
+ print("="*60)
+
+ # Test CLI
+ try:
+ result = run_command(['fastpull', '--version'], capture_output=True, check=False)
+ if result.returncode == 0:
+ print(f"✓ fastpull CLI: {result.stdout.strip()}")
+ else:
+ print(f"✗ fastpull CLI not found in PATH")
+ print("Try running: hash -r (or restart your shell)")
+ return False
+ except Exception as e:
+ print(f"✗ fastpull CLI test failed: {e}")
+ return False
+
+ # Check nerdctl
+ nerdctl_path = "/usr/local/bin/nerdctl"
+ if os.path.exists(nerdctl_path):
+ try:
+ result = run_command([nerdctl_path, "--version"], capture_output=True)
+ print(f"✓ nerdctl: {result.stdout.strip().split()[2]}")
+ except:
+ print(f" nerdctl found but version check failed")
+
+ # Check containerd service
+ try:
+ result = run_command(["systemctl", "is-active", "containerd.service"], capture_output=True)
+ if result.returncode == 0:
+ print(f"✓ containerd service: active")
+ else:
+ print(f" containerd service: {result.stdout.strip()}")
+ except:
+ print(f" Could not check containerd service")
+
+ # Check FastPull service
+ try:
+ result = run_command(["systemctl", "is-active", "fastpull.service"], capture_output=True)
+ if result.returncode == 0:
+ print(f"✓ fastpull service: active")
+ else:
+ print(f" fastpull service: {result.stdout.strip()}")
+ except:
+ print(f" Could not check fastpull service")
+
+ return True
+
+
+def main():
+ """Main setup function."""
+ parser = argparse.ArgumentParser(
+ description='Install FastPull with containerd and Nydus snapshotter',
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog="""
+Examples:
+ # Full installation (containerd + Nydus + CLI)
+ sudo python3 scripts/setup.py
+
+ # Install only CLI (skip containerd/Nydus setup)
+ sudo python3 scripts/setup.py --cli-only
+
+ # Uninstall fastpull CLI
+ sudo python3 scripts/setup.py --uninstall
+"""
+ )
+ parser.add_argument(
+ '--cli-only',
+ action='store_true',
+ help='Install only the fastpull CLI, skip containerd/Nydus setup'
+ )
+ parser.add_argument(
+ '--uninstall',
+ action='store_true',
+ help='Uninstall fastpull CLI'
+ )
+
+ args = parser.parse_args()
+
+ # Check root
+ check_root()
+
+ if args.uninstall:
+ print("Uninstalling fastpull...")
+ removed = False
+
+ # Remove wrapper script
+ if os.path.exists(FASTPULL_BIN):
+ os.remove(FASTPULL_BIN)
+ print(f"✓ Removed {FASTPULL_BIN}")
+ removed = True
+
+ # Remove venv
+ if os.path.exists(VENV_PATH):
+ import shutil
+ shutil.rmtree(VENV_PATH)
+ print(f"✓ Removed virtual environment at {VENV_PATH}")
+ removed = True
+
+ if removed:
+ print("✓ Uninstall complete")
+ else:
+ print("✗ fastpull not found or already uninstalled")
+ return
+
+ print("="*60)
+ print("FastPull Setup")
+ print("="*60)
+
+ if args.cli_only:
+ print("\nThis will install:")
+ print(" • FastPull CLI tool (via pip)")
+ print()
+ else:
+ print("\nThis will install:")
+ print(" • Containerd and nerdctl")
+ print(" • Nydus snapshotter")
+ print(" • FastPull CLI tool (via pip)")
+ print()
+
+ # Install system dependencies first
+ print("\n" + "="*60)
+ print("Installing System Dependencies")
+ print("="*60)
+ if not install_system_dependencies():
+ print("\n⚠ Warning: System dependencies installation had issues")
+ print("Continuing anyway, but you may encounter errors...")
+
+ # Track installation status
+ success = True
+ warnings = []
+
+ if not args.cli_only:
+ # Install containerd and nerdctl
+ if not install_containerd_nerdctl():
+ print("\n⚠ Warning: Containerd installation failed")
+ print("You can still install the CLI with --cli-only")
+ sys.exit(1)
+
+ # Install Nydus snapshotter
+ if not install_nydus():
+ print("\n⚠ Warning: Nydus installation failed")
+ success = False
+ warnings.append("Nydus snapshotter installation failed")
+ else:
+ # Only configure containerd if Nydus installed successfully
+ configure_containerd_for_nydus()
+
+ # Install CLI
+ if not install_cli():
+ print("\nSetup incomplete: CLI installation failed")
+ if not args.cli_only:
+ print("Note: Snapshotters may have been installed")
+ sys.exit(1)
+
+ # Verify
+ verify_installation()
+
+ print("\n" + "="*60)
+ if success:
+ print("✅ Fastpull installed successfully on your VM")
+ else:
+ print("⚠️ Fastpull installed with warnings")
+ print("\nWarnings:")
+ for warning in warnings:
+ print(f" • {warning}")
+ print("="*60)
+ print("\n📋 Usage:")
+ print(" fastpull --help")
+ print(" fastpull run --help")
+ print(" fastpull build --help")
+ print(" fastpull quickstart --help")
+ if not args.cli_only:
+ print("\n🔍 Check services:")
+ print(" systemctl status containerd")
+ print(" systemctl status fastpull")
+ print("\n📖 Example:")
+ print(" fastpull quickstart tensorrt")
+ print("="*60)
+
+
+if __name__ == '__main__':
+ main()