diff --git a/.gitignore b/.gitignore index 78e1837..d230f33 100644 --- a/.gitignore +++ b/.gitignore @@ -160,8 +160,4 @@ Thumbs.db *.bak *.backup -# Claude specific files -CLAUDE.md - -# Installation scripts -scripts/install_cuda_nvidia.sh +results/ diff --git a/README.md b/README.md index 7269f2d..380bb7a 100644 --- a/README.md +++ b/README.md @@ -6,11 +6,10 @@
# Start massive AI/ML container images 10x faster with lazy-loading snapshotter +[![Join Slack](https://img.shields.io/badge/Join_Slack-2EB67D?style=for-the-badge&logo=slack&logoColor=white)](https://join.slack.com/t/tensorfusecommunity/shared_invite/zt-30r6ik3dz-Rf7nS76vWKOu6DoKh5Cs5w) +[![Read our Blog](https://img.shields.io/badge/Read_our_Blog-ff9800?style=for-the-badge&logo=RSS&logoColor=white)](https://tensorfuse.io/docs/blogs/blog) - - - -[Installation](#install-fastpull-on-a-vm) • [Results](#understanding-test-results) +[Installation](#install-fastpull-on-a-vm) • [Results](#understanding-test-results) • [Detailed Usage](docs/fastpull.md)
@@ -29,25 +28,26 @@ AI/ML container images like CUDA, vLLM, and sglang are large (10 GB+). Tradition #### The Solution -Fastpull uses lazy-loading to pull only the files needed to start the container, then fetches remaining layers on demand. This accelerates start times by 10x. See the results below: +Fastpull uses lazy-loading to pull only the files needed to start the container, then fetches remaining layers on demand. This accelerates start times by 10x. See the results below:
benchmark
+You can now: +- [Install Fastpull on a VM](#install-fastpull-on-a-vm) +- [Install Fastpull on Kubernetes](#install-fastpull-on-a-kubernetes-cluster) + For more information, check out the [fastpull blog release](https://tensorfuse.io/docs/blogs/reducing_gpu_cold_start). --- ## Install fastpull on a VM -> **Note:** For Kubernetes installation, [contact us](mailto:agam@tensorfuse.io) for early access to our helm chart. - ### Prerequisites -- Debian or Ubuntu VM with GPU -- Docker and CUDA driver installed -- Registry authentication configured (GAR, ECR, etc.) +- VM Image: Works on Debian 12+, Ubuntu, AL2023 VMs with GPU, mileage on other AMIs may vary. +- Python>=3.10, pip, python3-venv, [Docker](https://docs.docker.com/engine/install/), [CUDA drivers](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/), [Nvidia Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed ### Installation Steps @@ -56,81 +56,191 @@ For more information, check out the [fastpull blog release](https://tensorfuse.i ```bash git clone https://github.com/tensorfuse/fastpull.git cd fastpull/ -sudo python3 scripts/install_snapshotters.py - -# Verify installation -sudo systemctl status nydus-snapshotter-fuse.service +sudo python3 scripts/setup.py ``` You should see: **"✅ Fastpull installed successfully on your VM"** **2. Run containers** -Fastpull requires your images to be in a special format. You can either choose from our template of pre-built images like vLLM, TensorRT, and SGlang or build your own using a Dockerfile. +Fastpull requires your images to be in a special format. You can either choose from our template of pre-built images like vLLM, TensorRT, and SGlang or build your own using a Dockerfile. -Option A: Use pre-built images +#### Use pre-built images Test with vLLM, TensorRT, or Sglang: ```bash -python3 scripts/benchmark/test-bench-vllm.py \ - --image public.ecr.aws/s6z9f6e5/tensorfuse/fastpull/vllm:latest-nydus \ - --snapshotter nydus +fastpull quickstart tensorrt +fastpull quickstart vllm +fastpull quickstart sglang ``` -Option B: Build custom images +Each of these will run two times, once with fastpull optimisations, and one the way docker runs it +After the quickstart runs are complete, we also run `fastpull clean --all` which cleans up the downloaded images. + +#### Build custom images + +First, authenticate with your registry +For ECR: +``` +aws configure; +aws ecr get-login-password --region us-east-1 | sudo nerdctl login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com + +``` + +For GAR: +``` +gcloud auth login; +gcloud auth print-access-token | sudo nerdctl login -docker.pkg.dev --username oauth2accesstoken --password-stdin +``` +For Dockerhub: +``` +sudo docker login +``` + +Build and push from your Dockerfile: + +> [!NOTE] +> - We support --registry gar, --registry ecr, --registry dockerhub +> - For ``, you can use any name that's convenient, ex: `v1`, `latest` +> - 2 images are created, one is the overlayfs with tag:`` and another is the fastpull image with tag: `-fastpull` -Build from your Dockerfile: ```bash -# Build image -python3 scripts/build.py --dockerfile +# Build and push image +fastpull build --registry --dockerfile-path --repository-url : +``` + +### Benchmarking with Fastpull + +To get the run time for your container, you can use either: -# Push to registry -python3 scripts/push.py \ - --registry_type \ - --account_id +Completion Time -# Run with fastpull -python3 scripts/fastpull.py --image +Use if the workload has a defined end point +``` +fastpull run --benchmark-mode completion [--FLAGS] : +fastpull run --benchmark-mode completion --mode normal [--FLAGS] : ``` +Server Endpoint Readiness Time ---- +Use if you're preparing a server, and it send with a 200 SUCCESS response once the server is up +``` +fastpull run --benchmark-mode readiness --readiness-endpoint localhost:/ [--FLAGS] : +fastpull run --benchmark-mode readiness --readiness-endpoint localhost:/ --model normal [--FLAGS] : +``` + +> [!NOTE] +> - When running for Readiness, you must publish the right port ex. `-p 8000:8000` and use `--readiness-endpoint localhost:8000/health` +> - Use --mode normal to run normal docker, running without this flag runs with fastpull optimisations +> - For `[--FLAGS]` you can use any docker compatible flags, ex. `--gpus all`, `-p PORT:PORT`, `-v ` +> - If using GPUs, make sure you add `--gpus all` as a fastpull run flag -## Understanding Test Results +#### Cleaning after a run + +To get the right cold start numbers, run the clean command after each run: +``` +fastpull clean --all +``` -Results show timing breakdown across startup phases: +### Understanding Test Results -- **Time to first log:** Container start to entrypoint execution -- **First log to model download start:** Initialization time -- **Model download time:** Downloading weights (e.g., Qwen-3-8b, 16GB) -- **Model load time:** Loading weights into GPU -- **CUDA compilation/graph capture:** Optimization phase -- **Total end-to-end time:** Container start to server ready +Results show the startup and completion/readiness times: Example Output ```bash -=== VLLM TIMING SUMMARY === -Container Startup Time: 2.145s -Container to First Log: 15.234s -Engine Initialization: 45.123s -Weights Download Start: 67.890s -Weights Download Complete: 156.789s -Weights Loaded: 198.456s -Graph Capture Complete: 245.678s -Server Ready: 318.435s -Total Test Time: 325.678s - -BREAKDOWN: -Container to First Log: 15.234s -First Log to Weight Download Start: 52.656s -Weight Download Start to Complete: 88.899s -Weight Download Complete to Weights Loaded: 41.667s -Weights Loaded to Server Ready: 119.979s +================================================== +BENCHMARK SUMMARY +================================================== +Time to Container Start: 141.295s +Time to Readiness: 329.367s +Total Elapsed Time: 329.367s +================================================== ``` +--- + +## Install fastpull on a Kubernetes Cluster + +### Prerequisites +- Tested on GKE +- Tested with COS Operating System for the nodes + +### Installation +1. In your K8s cluster, create a GPU Nodepool. For GKE, ensure Workload Identity is enabled on your cluster +2. Install Nvidia GPU drivers. For COS: +```bash +kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml +``` +3. Install containerd config updater daemonset: `kubectl apply -f https://raw.githubusercontent.com/tensorfuse/fastpull-gke/main/containerd-daemonset.yaml` +4. Install the [Helm Chart](https://hub.docker.com/repository/docker/tensorfuse/fastpull-snapshotter/general). For COS: +```bash +helm upgrade --install fastpull-snapshotter oci://registry-1.docker.io/tensorfuse/fastpull-snapshotter \ +--version 0.0.10-gke-helm \ +--create-namespace \ +--namespace fastpull-snapshotter \ +--set 'tolerations[0].key=nvidia.com/gpu' \ +--set 'tolerations[0].operator=Equal' \ +--set 'tolerations[0].value=present' \ +--set 'tolerations[0].effect=NoSchedule' \ +--set 'affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key=cloud.google.com/gke-accelerator' \ +--set 'affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator=Exists' +``` +5. Build your images, which can be done by two ways: + + a. On a standalone VM, preferably using Ubuntu os, [install fastpull](#installation-steps) and [build your image](#build-custom-images) + + b. Build in a container: + + First authenticate to your registry and ensure the ~/docker/config.json is updated + ```bash + #for aws + aws configure + aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com + #for gcp + gcloud auth login + gcloud auth print-access-token | sudo nerdctl login -docker.pkg.dev --username oauth2accesstoken --password-stdin + ``` + Then build using our image: + ```bash + docker run --rm --privileged \ + -v /path/to/dockerfile-dir:/workspace:ro \ + -v ~/.docker/config.json:/root/.docker/config.json:ro \ + tensorfuse/fastpull-builder:latest \ + REGISTRY/REPO/IMAGE:TAG + ``` + This creates `IMAGE:TAG` (normal) and `IMAGE:TAG-fastpull` (fastpull-optimized). Use the `-fastpull` tag in your pod spec. See [builder documentation](scripts/builder/README.md) for details. + +6. Create the pod spec for image we created. For COS, use a pod spec like this: +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: gpu-test-a100-fastpull +spec: + tolerations: + - operator: Exists + nodeSelector: + cloud.google.com/gke-accelerator: nvidia-tesla-a100 # Use your GPU Type + runtimeClassName: runc-fastpull + containers: + - name: debug-container + image: IMAGE_PATH:-fastpull # USE FASTPULL IMAGE + resources: + limits: + nvidia.com/gpu: 1 + env: + - name: LD_LIBRARY_PATH + value: /usr/local/cuda/lib64:/usr/local/nvidia/lib64 # NOTE: This path may vary depending on the base image +``` +7. Run a pod with this spec: +```bash +kubectl apply -f .yaml +``` + + ---
@@ -145,4 +255,4 @@ We welcome contributions! Submit a Pull Request or join our [Slack community](ht [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) -
\ No newline at end of file + diff --git a/docs/fastpull.md b/docs/fastpull.md new file mode 100644 index 0000000..cf3e839 --- /dev/null +++ b/docs/fastpull.md @@ -0,0 +1,312 @@ +# FastPull CLI - Quick Reference + +The new unified `fastpull` command-line interface for building and running containers with lazy-loading snapshotters. + +## Installation + +The setup script automatically detects your OS (Ubuntu/Debian/RHEL/CentOS/Fedora) and installs all dependencies including `python3-venv` and `wget`. + +```bash +# Full installation (containerd + Nydus + CLI) +sudo python3 scripts/setup.py + +# Install only CLI (if containerd/Nydus already installed) +sudo python3 scripts/setup.py --cli-only + +# Verify installation +fastpull --version +``` + +**Supported Package Managers:** +- `apt` (Ubuntu/Debian) +- `yum` (RHEL/CentOS 7) +- `dnf` (RHEL/CentOS 8+/Fedora) + +## Commands + +### `fastpull quickstart` - Quick Benchmark Comparisons + +Run pre-configured benchmarks to quickly compare snapshotter performance. + +#### Available Workloads + +**TensorRT:** +```bash +sudo fastpull quickstart tensorrt +sudo fastpull quickstart tensorrt --output-dir ./results +``` + +**vLLM:** +```bash +sudo fastpull quickstart vllm +sudo fastpull quickstart vllm --output-dir ./results +``` + +**SGLang:** +```bash +sudo fastpull quickstart sglang +sudo fastpull quickstart sglang --output-dir ./results +``` + +Each quickstart automatically: +1. Runs with FastPull mode (Nydus snapshotter) +2. Runs with Normal mode (OverlayFS snapshotter) +3. Measures readiness benchmarking for startup performance +4. **Auto-cleans containers and images after completion** + +--- + +### `fastpull run` - Run Containers with Benchmarking + +Run containers with FastPull (Nydus) or Normal (OverlayFS) mode. + +#### Basic Usage + +```bash +# Run with FastPull mode (default, auto-adds -nydus suffix to tag) +fastpull run myapp:latest + +# Run with Normal mode (OverlayFS, no suffix) +fastpull run --mode normal myapp:latest + +# Run with GPU support +fastpull run myapp:latest --gpus all -p 8080:8080 +``` + +#### Benchmarking Modes + +**Readiness Mode** - Poll HTTP endpoint until 200 response: +```bash +fastpull run \ + myapp:latest \ + --benchmark-mode readiness \ + --readiness-endpoint http://localhost:8080/health \ + -p 8080:8080 +``` + +**Completion Mode** - Wait for container to exit: +```bash +fastpull run \ + myapp:latest \ + --benchmark-mode completion +``` + +**Export Metrics** - Save results to JSON: +```bash +fastpull run \ + myapp:latest \ + --benchmark-mode readiness \ + --readiness-endpoint http://localhost:8080/health \ + --output-json results.json \ + -p 8080:8080 +``` + +#### Supported Flags + +- `--mode` - Run mode: nydus (default, adds -nydus suffix), normal (overlayfs, no suffix) +- `IMAGE` - Container image to run (positional argument, required) +- `--benchmark-mode` - Options: none, completion, readiness (default: none) +- `--readiness-endpoint` - HTTP endpoint for health checks +- `--output-json` - Export metrics to JSON file +- `--name` - Container name +- `-p, --publish` - Publish ports (repeatable) +- `-e, --env` - Environment variables (repeatable) +- `-v, --volume` - Bind mount volumes (repeatable) +- `--gpus` - GPU devices (e.g., "all") +- `--rm` - Auto-remove container on exit +- `-d, --detach` - Run in background + +**Note:** Any additional arguments after the image are passed through to nerdctl. + +#### Pass-through Examples + +```bash +# Custom entrypoint +fastpull run myapp:latest --entrypoint /bin/bash + +# Command override +fastpull run myapp:latest python script.py --arg1 value1 + +# Additional nerdctl flags +fastpull run myapp:latest --privileged --network host +``` + +--- + +### `fastpull build` - Build and Push Images in Multiple Formats + +Build Docker and snapshotter-optimized images, then push to registry. + +#### Basic Usage + +```bash +# Build Docker and Nydus (default) and push +fastpull build --dockerfile-path ./app --repository-url myapp:latest + +# Build specific formats +fastpull build \ + --dockerfile-path ./app \ + --repository-url myapp:v1 \ + --format docker,nydus +``` + +#### Build Options + +```bash +# No cache +fastpull build --dockerfile-path ./app --repository-url myapp:latest --no-cache + +# With build arguments +fastpull build \ + --dockerfile-path ./app \ + --repository-url myapp:latest \ + --build-arg VERSION=1.0 \ + --build-arg ENV=prod + +# Custom Dockerfile +fastpull build \ + --dockerfile-path ./app \ + --repository-url myapp:latest \ + --dockerfile Dockerfile.prod +``` + +#### Supported Flags + +- `--dockerfile-path` - Path to Dockerfile directory (required) +- `--repository-url` - Full image reference including registry, repository, and tag (required) +- `--format` - Comma-separated formats: docker, nydus (default: docker,nydus) +- `--no-cache` - Build without cache +- `--build-arg` - Build arguments (repeatable) +- `--dockerfile` - Dockerfile name (default: Dockerfile) + +**Note:** Images are automatically pushed to the registry after building. + +--- + +### `fastpull clean` - Remove Local Images and Artifacts + +Clean up local container images and stopped containers. + +#### Basic Usage + +```bash +# Clean all images and containers (requires confirmation) +fastpull clean --all + +# Clean only images +fastpull clean --images + +# Clean only stopped containers +fastpull clean --containers + +# Target specific snapshotter +fastpull clean --all --snapshotter nydus +fastpull clean --all --snapshotter overlayfs + +# Dry run to see what would be removed +fastpull clean --all --dry-run + +# Force removal without confirmation +fastpull clean --all --force +``` + +#### Supported Flags + +- `--images` - Remove all images +- `--containers` - Remove stopped containers +- `--all` - Remove both images and containers +- `--snapshotter` - Target specific snapshotter: nydus, overlayfs, all (default: all) +- `--dry-run` - Show what would be removed without removing +- `--force` - Force removal without confirmation + +--- + +## Complete Workflow Example + +```bash +# 1. Build and push images in multiple formats +fastpull build \ + --dockerfile-path ./my-app \ + --repository-url 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0 \ + --format docker,nydus + +# 2. Run with benchmarking (FastPull mode, auto-adds -nydus suffix) +fastpull run \ + 123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0 \ + --benchmark-mode readiness \ + --readiness-endpoint http://localhost:8000/health \ + --output-json benchmark-results.json \ + -p 8000:8000 \ + --gpus all +``` + +--- + +## Benchmarking Metrics + +When using `--benchmark-mode`, fastpull tracks: + +1. **Time to Container Start** - Using `ctr events` to monitor container lifecycle +2. **Time to Readiness/Completion**: + - **Readiness mode**: Polls HTTP endpoint until 200 response + - **Completion mode**: Waits for container to exit + +Example output: + +**FastPull mode (Nydus):** +``` +================================================== +FASTPULL BENCHMARK SUMMARY +================================================== +Time to Container Start: 2.34s +Time to Readiness: 45.67s +Total Elapsed Time: 48.01s +================================================== +``` + +**Normal mode (OverlayFS):** +``` +================================================== +NORMAL BENCHMARK SUMMARY +================================================== +Time to Container Start: 13.64s +Time to Readiness: 387.77s +Total Elapsed Time: 387.77s +================================================== +``` + +--- + +## Uninstallation + +```bash +# Remove fastpull CLI +sudo python3 scripts/setup.py --uninstall +``` + +--- + +## Backwards Compatibility + +The original scripts remain unchanged and continue to work: +- `scripts/build_push.py` +- `scripts/benchmark/test-bench-vllm.py` +- `scripts/benchmark/test-bench-sglang.py` +- `scripts/install_snapshotters.py` + +--- + +## Service Management + +After installation, the Nydus snapshotter service is renamed to `fastpull.service`: + +```bash +# Check status +systemctl status fastpull.service + +# Restart service +sudo systemctl restart fastpull.service + +# View logs +journalctl -u fastpull.service -f +``` diff --git a/images/alpine-loop/Dockerfile b/images/alpine-loop/Dockerfile new file mode 100644 index 0000000..bfcce27 --- /dev/null +++ b/images/alpine-loop/Dockerfile @@ -0,0 +1,3 @@ +FROM alpine:latest + +CMD ["/bin/sh", "-c", "for i in $(seq 1 1000); do echo \"Iteration $i\"; done; echo \"Loop complete\""] diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 0000000..17a98c1 --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,41 @@ +[build-system] +requires = ["setuptools>=61.0", "wheel"] +build-backend = "setuptools.build_meta" + +[project] +name = "fastpull" +version = "0.1.0" +description = "Accelerate AI/ML container startup with lazy-loading snapshotters" +readme = "README.md" +requires-python = ">=3.7" +license = {text = "MIT"} +authors = [ + {name = "TensorFuse", email = "saurabh@tensorfuse.io"} +] +keywords = ["containers", "docker", "fastpull", "snapshotter", "ml", "ai"] +classifiers = [ + "Development Status :: 4 - Beta", + "Intended Audience :: Developers", + "Topic :: Software Development :: Build Tools", + "License :: OSI Approved :: MIT License", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.7", + "Programming Language :: Python :: 3.8", + "Programming Language :: Python :: 3.9", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", +] + +[project.urls] +Homepage = "https://github.com/tensorfuse/fastpull" +Documentation = "https://github.com/tensorfuse/fastpull/blob/main/docs/fastpull.md" +Repository = "https://github.com/tensorfuse/fastpull" +Issues = "https://github.com/tensorfuse/fastpull/issues" + +[project.scripts] +fastpull = "scripts.fastpull.cli:main" + +[tool.setuptools.packages.find] +where = ["."] +include = ["scripts.fastpull*"] +exclude = ["docs*", "images*"] diff --git a/scripts/benchmark/benchmark_base.py b/scripts/benchmark/benchmark_base.py deleted file mode 100644 index c3f1594..0000000 --- a/scripts/benchmark/benchmark_base.py +++ /dev/null @@ -1,782 +0,0 @@ -#!/usr/bin/env python3 -""" -Generic Benchmark Framework Base Class -Provides common functionality for all ML application benchmarks. -""" - -import argparse -import json -import os -import queue -import re -import requests -import signal -import subprocess -import sys -import threading -import time -from abc import ABC, abstractmethod -from datetime import datetime, timezone -from typing import Dict, List, Optional, Tuple - - -def run_command(cmd, check=True, capture_output=False): - """Run a shell command and handle errors.""" - try: - if capture_output: - result = subprocess.run(cmd, shell=True, check=check, capture_output=True, text=True) - return result.stdout.strip() - else: - subprocess.run(cmd, shell=True, check=check) - except subprocess.CalledProcessError as e: - print(f"Error running command: {cmd}") - print(f"Error: {e}") - if capture_output and e.stdout: - print(f"Stdout: {e.stdout}") - if capture_output and e.stderr: - print(f"Stderr: {e.stderr}") - raise - - -def check_aws_credentials(): - """Check if AWS credentials are configured.""" - try: - run_command("aws sts get-caller-identity", capture_output=True) - print("✓ AWS credentials are configured") - return True - except: - print("Warning: AWS credentials not configured. Please run 'aws configure' first.") - return False - - -def docker_login_ecr(account=None, region="us-east-1"): - """Login to ECR using both docker and nerdctl.""" - print("Checking AWS credentials and logging into ECR...") - - if not check_aws_credentials(): - print("Skipping ECR login due to missing AWS credentials") - return False - - if not account: - # Try to get account from AWS STS - try: - account_info = run_command("aws sts get-caller-identity --query Account --output text", capture_output=True) - account = account_info.strip() - print(f"Auto-detected AWS account: {account}") - except: - print("Could not auto-detect AWS account ID") - return False - - try: - password = run_command(f"aws ecr get-login-password --region {region}", capture_output=True) - registry = f"{account}.dkr.ecr.{region}.amazonaws.com" - - # Login with docker - login_cmd = f"echo '{password}' | docker login -u AWS --password-stdin {registry}" - run_command(login_cmd, check=False) - - # Login with nerdctl - login_cmd = f"echo '{password}' | nerdctl login -u AWS --password-stdin {registry}" - run_command(login_cmd, check=False) - - # Login with sudo nerdctl - login_cmd = f"echo '{password}' | sudo nerdctl login -u AWS --password-stdin {registry}" - run_command(login_cmd, check=False) - - print("✓ Successfully logged into ECR") - return True - - except Exception as e: - print(f"Warning: Could not login to ECR: {e}") - return False - - -def construct_ecr_image(repo: str, tag: str, snapshotter: str, region: str = "us-east-1") -> str: - """Construct ECR image URL from repo, tag, and snapshotter.""" - try: - # Get AWS account ID - account_info = run_command("aws sts get-caller-identity --query Account --output text", capture_output=True) - account = account_info.strip() - - # Add snapshotter suffix to tag (except for overlayfs/native which use base tag) - if snapshotter in ["overlayfs", "native"]: - final_tag = tag - else: - final_tag = f"{tag}-{snapshotter}" - - return f"{account}.dkr.ecr.{region}.amazonaws.com/{repo}:{final_tag}" - - except Exception as e: - raise ValueError(f"Could not construct ECR image URL: {e}. Ensure AWS credentials are configured.") - - -class BenchmarkBase(ABC): - """Abstract base class for all benchmarks.""" - - def __init__(self, image: str, container_name: str, snapshotter: str = "nydus", port: int = 8080, model_mount_path: str = None): - self.image = image - self.container_name = container_name - self.snapshotter = snapshotter - self.port = port - self.model_mount_path = model_mount_path - self.start_time = None - self.phases = {} - self.log_queue = queue.Queue() - self.should_stop = threading.Event() - - # Container events monitoring - self.ctr_events_queue = queue.Queue() - self.ctr_events_thread = None - self.container_create_time = None - self.container_start_time = None - self.container_startup_duration = None - - # Health endpoint polling - self.health_thread = None - self.health_ready_time = None - self.health_ready_event = threading.Event() - self.interrupted = False - - # Initialize phases from subclass - self._init_phases() - - @abstractmethod - def _init_phases(self) -> None: - """Initialize the phases dictionary for the specific application.""" - pass - - @abstractmethod - def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]: - """Analyze a log line and return detected phase. Must be implemented by subclass.""" - pass - - - @abstractmethod - def get_default_image(self, snapshotter: str) -> str: - """Get default image for the snapshotter. Must be implemented by subclass.""" - pass - - def get_health_endpoint(self) -> Optional[str]: - """Get health endpoint for the application. Override in subclasses.""" - return None - - def supports_health_polling(self) -> bool: - """Check if this application supports health endpoint polling. Override in subclasses.""" - return False - - def get_elapsed_time(self) -> float: - """Get elapsed time since start in seconds.""" - if self.start_time is None: - return 0.0 - return time.time() - self.start_time - - def start_ctr_events_monitor(self): - """Start monitoring containerd events in a separate thread.""" - def monitor_events(): - try: - cmd = ["sudo", "ctr", "events"] - process = subprocess.Popen( - cmd, - stdout=subprocess.PIPE, - stderr=subprocess.PIPE, - text=True, - bufsize=1, - universal_newlines=True - ) - - while not self.should_stop.is_set(): - line = process.stdout.readline() - if not line: - if process.poll() is not None: - break - time.sleep(0.1) - continue - - self.ctr_events_queue.put((time.time(), line.strip())) - - process.terminate() - process.wait() - - except Exception as e: - print(f"Error monitoring ctr events: {e}") - - self.ctr_events_thread = threading.Thread(target=monitor_events, daemon=True) - self.ctr_events_thread.start() - return self.ctr_events_thread - - def process_ctr_events(self): - """Process containerd events to track container lifecycle timing.""" - while not self.should_stop.is_set(): - try: - timestamp, line = self.ctr_events_queue.get(timeout=1.0) - - # Parse containerd event line - # Format: TIMESTAMP NAMESPACE EVENT_TYPE DATA - parts = line.split(' ', 3) - if len(parts) < 4: - continue - - event_timestamp_str = f"{parts[0]} {parts[1]}" - namespace = parts[2] - event_type = parts[3] - - # Parse the event timestamp - try: - # Remove timezone info for parsing, then add it back - ts_clean = event_timestamp_str.replace(" +0000 UTC", "") - event_time = datetime.fromisoformat(ts_clean.replace(' ', 'T')) - event_time = event_time.replace(tzinfo=timezone.utc) - event_timestamp = event_time.timestamp() - except: - event_timestamp = timestamp # Fallback to capture time - - # Look for task start event (any task since only one container is running) - if "/tasks/start" in event_type and self.container_start_time is None: - self.container_start_time = event_timestamp - if self.container_create_time: - self.container_startup_duration = self.container_start_time - self.container_create_time - elapsed = event_timestamp - self.start_time if self.start_time else 0 - print(f"[{elapsed:.3f}s] ✓ CONTAINER START (startup: {self.container_startup_duration:.3f}s)") - break # We found what we needed - stop monitoring - - except queue.Empty: - continue - except KeyboardInterrupt: - break - - def cleanup_container(self): - """Remove any existing container with the same name.""" - try: - nerdctl_snapshotter = self.get_nerdctl_snapshotter() - cmd = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "rm", "-f", self.container_name] - subprocess.run(cmd, capture_output=True, check=False) - except Exception as e: - print(f"Warning: Could not cleanup container: {e}") - - def start_container(self) -> bool: - """Start the container and return success status.""" - try: - # Start ctr events monitoring before container creation - print("Starting containerd events monitoring...") - self.start_ctr_events_monitor() - - # Start processing events in background - events_thread = threading.Thread(target=self.process_ctr_events, daemon=True) - events_thread.start() - - # Small delay to ensure events monitoring is ready - time.sleep(0.5) - - nerdctl_snapshotter = self.get_nerdctl_snapshotter() - cmd = [ - "sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "run", - "--name", self.container_name, - "--gpus", "all", - "--detach", - "--publish", f"{self.port}:8000" - ] - - # Add volume mounts if model mount path is provided - if self.model_mount_path: - cmd.extend([ - "--volume", f"{self.model_mount_path}/huggingface:/workspace/huggingface", - "--volume", f"{self.model_mount_path}/hf-xet-cache:/workspace/hf-xet-cache" - ]) - - cmd.append(self.image) - - print(f"Running command: {' '.join(cmd)}") - # Set container creation time just before running nerdctl command - self.container_create_time = time.time() - if self.start_time is not None: - elapsed = self.container_create_time - self.start_time - print(f"[{elapsed:.3f}s] ✓ CONTAINER CREATE (nerdctl run started)") - else: - print("No start time is set") - - result = subprocess.run(cmd, capture_output=True, text=True, check=True) - return True - - except subprocess.CalledProcessError as e: - print(f"Error starting container: {e}") - print(f"STDERR: {e.stderr}") - return False - - def monitor_logs(self): - """Monitor container logs in a separate thread.""" - def log_reader(): - try: - cmd = ["sudo", "nerdctl", "--snapshotter", self.snapshotter, "logs", "-f", self.container_name] - process = subprocess.Popen( - cmd, - stdout=subprocess.PIPE, - stderr=subprocess.STDOUT, - text=True, - bufsize=1, - universal_newlines=True - ) - - while not self.should_stop.is_set(): - line = process.stdout.readline() - if not line: - if process.poll() is not None: - break - time.sleep(0.1) - continue - - self.log_queue.put((time.time(), line.strip())) - - process.terminate() - - except Exception as e: - print(f"Error monitoring logs: {e}") - - log_thread = threading.Thread(target=log_reader, daemon=True) - log_thread.start() - return log_thread - - def start_health_polling(self): - """Start health endpoint polling in a separate thread.""" - if not self.supports_health_polling(): - return None - - def health_poller(): - endpoint = self.get_health_endpoint() - if not endpoint: - return - - url = f"http://localhost:{self.port}/{endpoint}" - print(f"Starting health polling for endpoint: {url}") - - # Poll with 0.1 second intervals, timeout after 20 minutes - start_time = time.time() - timeout = 20 * 60 # 20 minutes - - while not self.should_stop.is_set() and not self.health_ready_event.is_set(): - if time.time() - start_time > timeout: - print(f"Health polling timed out after {timeout}s") - break - - # Check for interrupt - if self.interrupted: - print("Health polling interrupted by user") - break - - try: - response = requests.get(url, timeout=5) - if response.status_code == 200: - self.health_ready_time = time.time() - self.start_time - elapsed = self.health_ready_time - print(f"[{elapsed:.3f}s] ✓ SERVER READY (HTTP 200)") - - # Set server ready time from health check - self.phases["server_ready"] = self.health_ready_time - self.health_ready_event.set() - break - - except requests.exceptions.RequestException: - # Connection failed, server not ready yet - pass - - time.sleep(0.1) # Wait 0.1 seconds before next poll - - self.health_thread = threading.Thread(target=health_poller, daemon=True) - self.health_thread.start() - return self.health_thread - - def process_logs(self, timeout: int = 1200): - """Process logs and detect phases.""" - print("Monitoring container logs...") - log_thread = self.monitor_logs() - - # Start health polling if supported - health_thread = None - - start_monitoring = time.time() - - while time.time() - start_monitoring < timeout: - try: - timestamp, line = self.log_queue.get(timeout=1.0) - elapsed = timestamp - self.start_time - - # Detect first log - if "first_log" in self.phases and self.phases["first_log"] is None: - self.phases["first_log"] = elapsed - print(f"[{elapsed:.3f}s] ✓ FIRST LOG") - - # Start health polling after first log if supported - if self.supports_health_polling() and not health_thread: - health_thread = self.start_health_polling() - - phase = self.analyze_log_line(line, timestamp) - - if phase: - print(f"[{elapsed:.3f}s] ✓ {phase.upper().replace('_', ' ')}") - - print(f"[{elapsed:.3f}s] {line}") - - # Check if we should stop monitoring - if self._should_stop_monitoring(elapsed): - break - - except queue.Empty: - # Check if we should stop monitoring even when no new logs - elapsed = time.time() - self.start_time - if self._should_stop_monitoring(elapsed): - break - continue - except KeyboardInterrupt: - print("\nReceived interrupt signal...") - break - - self.should_stop.set() - - def _should_stop_monitoring(self, elapsed: float) -> bool: - """Determine if we should stop monitoring logs. Should be overridden by subclasses.""" - # For applications that support health polling, stop only after health check succeeds - if self.supports_health_polling(): - return self.health_ready_event.is_set() - return False - - - def stop_container(self): - """Stop and remove the container. Wait for health check or timeout first.""" - try: - # For applications that support health polling, wait for health check or timeout - # But skip waiting if interrupted by user - if (self.supports_health_polling() and not self.health_ready_event.is_set() - and not self.interrupted): - print("Waiting for health check success or timeout before stopping container...") - timeout = 20 * 60 # 20 minutes - if self.health_ready_event.wait(timeout): - if not self.interrupted: - print("Health check succeeded, proceeding with container stop") - else: - print("Interrupted during health check, proceeding with container stop") - else: - print("Health check timed out, proceeding with container stop") - elif self.interrupted: - print("Skipping health check wait due to interrupt, proceeding with container stop") - - self.should_stop.set() - # Stop the container - cmd_stop = ["sudo", "nerdctl", "--snapshotter", self.snapshotter, "stop", self.container_name] - subprocess.run(cmd_stop, capture_output=True, check=False, timeout=30) - - # Wait for container to fully stop - time.sleep(2) - - # Remove the container - cmd_rm = ["sudo", "nerdctl", "--snapshotter", self.snapshotter, "rm", self.container_name] - subprocess.run(cmd_rm, capture_output=True, check=False, timeout=60) - - # Wait a moment for the container removal to be fully processed - time.sleep(2) - - except Exception as e: - print(f"Warning: Could not stop/remove container cleanly: {e}") - - def get_nerdctl_snapshotter(self) -> str: - """Get the correct snapshotter name for nerdctl commands.""" - # Map estargz to stargz for nerdctl compatibility - if self.snapshotter == "estargz": - return "stargz" - return self.snapshotter - - def cleanup_soci_snapshotter(self): - """Perform SOCI-specific cleanup: remove state directory and restart service.""" - if self.snapshotter != "soci": - return - - try: - print("Performing SOCI-specific cleanup...") - - # Remove SOCI state directory - print("Removing SOCI state directory...") - cmd_rm = ["sudo", "rm", "-rf", "/var/lib/soci-snapshotter-grpc/"] - result = subprocess.run(cmd_rm, capture_output=True, text=True, check=False, timeout=30) - - if result.returncode == 0: - print("SOCI state directory removed successfully") - else: - print(f"Warning: Could not remove SOCI state directory: {result.stderr}") - - # Restart SOCI snapshotter service - print("Restarting SOCI snapshotter service...") - cmd_restart = ["sudo", "systemctl", "restart", "soci-snapshotter-grpc.service"] - result = subprocess.run(cmd_restart, capture_output=True, text=True, check=False, timeout=30) - - if result.returncode == 0: - print("SOCI snapshotter service restarted successfully") - # Give the service a moment to start - time.sleep(2) - else: - print(f"Warning: Could not restart SOCI snapshotter service: {result.stderr}") - - except Exception as e: - print(f"Warning: Could not perform SOCI cleanup: {e}") - - def cleanup_images(self): - """Remove the image to ensure fresh pulls for testing.""" - try: - print(f"Removing image {self.image} for clean testing...") - - nerdctl_snapshotter = self.get_nerdctl_snapshotter() - - # First, try with image name/tag - cmd_rmi = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "rmi", self.image] - result = subprocess.run(cmd_rmi, capture_output=True, text=True, check=False, timeout=60) - - if result.returncode == 0: - print("Image removed successfully") - return - - # If that fails, get the image ID and try with that - print("Trying to remove by image ID...") - cmd_images = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "images", "--format", "{{.ID}}", self.image] - images_result = subprocess.run(cmd_images, capture_output=True, text=True, check=False, timeout=30) - - if images_result.returncode == 0 and images_result.stdout.strip(): - image_id = images_result.stdout.strip().split('\n')[0] - cmd_rmi_id = ["sudo", "nerdctl", "--snapshotter", nerdctl_snapshotter, "rmi", image_id] - id_result = subprocess.run(cmd_rmi_id, capture_output=True, text=True, check=False, timeout=60) - - if id_result.returncode == 0: - print(f"Image removed successfully using ID: {image_id}") - else: - print(f"Could not remove image by ID: {id_result.stderr}") - else: - print(f"Note: Could not find or remove image: {result.stderr}") - - except Exception as e: - print(f"Warning: Could not remove image: {e}") - - def print_summary(self, total_time: float): - """Print timing summary.""" - print("\n" + "="*50) - print(f"{self.__class__.__name__.replace('Benchmark', '').upper()} TIMING SUMMARY") - print("="*50) - - for label, value in self._get_summary_items(total_time): - if label == "": - print() # Empty line - elif label.endswith(":") and value is None: - print(label) # Section header - elif value is not None: - print(f"{label:<30} {value:.3f}s") - else: - print(f"{label:<30} N/A") - - print("="*50) - - def _get_summary_items(self, total_time: float) -> List[Tuple[str, Optional[float]]]: - """Get summary items for printing. Must be overridden by subclasses.""" - items = [] - - # Add container startup time at the beginning - items.append(("Container Startup Time:", self.container_startup_duration)) - - for phase_key, phase_value in self.phases.items(): - label = phase_key.replace('_', ' ').title() + ":" - items.append((label, phase_value)) - items.append(("Total Test Time:", total_time)) - return items - - def run_benchmark(self) -> Dict[str, Optional[float]]: - """Run the complete benchmark.""" - app_name = self.__class__.__name__.replace('Benchmark', '') - print(f"=== {app_name} Startup Timing Test ===") - print(f"Image: {self.image}") - print(f"Snapshotter: {self.snapshotter}") - print(f"Port: {self.port}") - print() - - # Check AWS credentials and login to ECR if needed - if ".ecr." in self.image: - print("ECR image detected, attempting AWS login...") - # Extract region from image URL if possible, otherwise use default - region = "us-east-1" # Default region - if hasattr(self, '_region'): - region = self._region - docker_login_ecr(region=region) - - # Cleanup - print("Cleaning up existing containers...") - self.cleanup_container() - self.cleanup_soci_snapshotter() - - # Start timing - self.start_time = time.time() - start_datetime = datetime.fromtimestamp(self.start_time) - print(f"Test started at: {start_datetime.strftime('%Y-%m-%d %H:%M:%S')}") - print() - - try: - # Start container - print("Starting container...") - if not self.start_container(): - print("Failed to start container") - return self.phases - - # Wait a moment for container to initialize - time.sleep(2) - - # Monitor logs - self.process_logs() - - except KeyboardInterrupt: - print("\nBenchmark interrupted by user") - self.interrupted = True - self.should_stop.set() - self.health_ready_event.set() # Stop waiting for health check - except Exception as e: - print(f"Error during benchmark: {e}") - finally: - # Cleanup - print("\nCleaning up...") - self.stop_container() - if hasattr(self, '_keep_image') and not self._keep_image: - self.cleanup_images() - self.cleanup_soci_snapshotter() - - # Calculate total time and print summary - total_time = time.time() - self.start_time - self.print_summary(total_time) - - return self.phases - - def create_arg_parser(self, description: str) -> argparse.ArgumentParser: - """Create standard argument parser for benchmarks.""" - parser = argparse.ArgumentParser(description=description) - - # Image specification - either full image or repo + tag - image_group = parser.add_mutually_exclusive_group() - image_group.add_argument( - "--image", - help="Full container image to test (e.g., registry.com/repo:tag-snapshotter)" - ) - image_group.add_argument( - "--repo", - help="ECR repository name (e.g., my-vllm-app). Will construct full ECR URL automatically" - ) - - parser.add_argument( - "--tag", - default="latest", - help="Image tag base (default: latest). Snapshotter suffix will be appended (e.g., latest-nydus)" - ) - parser.add_argument( - "--region", - default="us-east-1", - help="AWS region for ECR (default: us-east-1)" - ) - parser.add_argument( - "--container-name", - default=f"{self.__class__.__name__.lower().replace('benchmark', '')}-timing-test", - help="Name for the test container" - ) - parser.add_argument( - "--snapshotter", - default="nydus", - choices=["nydus", "overlayfs", "native", "soci", "estargz"], - help="Snapshotter to use" - ) - parser.add_argument( - "--port", - type=int, - default=self.port, - help=f"Local port to bind (default: {self.port})" - ) - parser.add_argument( - "--model-mount-path", - help="Path to local SSD directory to mount for model storage (e.g., /mnt/ssd/models)" - ) - parser.add_argument( - "--output-json", - help="Output results to JSON file" - ) - parser.add_argument( - "--keep-image", - action="store_true", - help="Don't remove image after test (faster for repeated runs)" - ) - return parser - - def save_results(self, results: Dict[str, Optional[float]], output_file: str, - image: str, snapshotter: str): - """Save results to JSON file.""" - output_data = { - "application": self.__class__.__name__.replace('Benchmark', '').lower(), - "snapshotter": snapshotter, - "image": image, - "timestamp": datetime.now().isoformat(), - "phases": results, - "container_startup_duration": self.container_startup_duration, - "health_ready_time": self.health_ready_time, - "supports_health_polling": self.supports_health_polling() - } - - with open(output_file, 'w') as f: - json.dump(output_data, f, indent=2) - - print(f"\nResults saved to: {output_file}") - - def setup_signal_handler(self): - """Setup graceful interrupt handling.""" - def signal_handler(sig, frame): - print("\nReceived interrupt signal, cleaning up...") - self.interrupted = True - self.should_stop.set() - self.health_ready_event.set() # Stop waiting for health check - # Don't exit immediately, let cleanup happen - - signal.signal(signal.SIGINT, signal_handler) - - def main(self, description: str) -> int: - """Main execution method for benchmark scripts.""" - parser = self.create_arg_parser(description) - args = parser.parse_args() - - # Determine image to use - if args.image: - # Full image provided - final_image = args.image - elif args.repo: - # Construct ECR image from repo + tag + snapshotter - final_image = construct_ecr_image(args.repo, args.tag, args.snapshotter, args.region) - print(f"Constructed ECR image: {final_image}") - else: - # Fall back to default image from subclass - final_image = self.get_default_image(args.snapshotter) - - # Update instance with parsed arguments - self.image = final_image - self.container_name = args.container_name - self.snapshotter = args.snapshotter - self.port = args.port - self.model_mount_path = args.model_mount_path - self._keep_image = args.keep_image - self._region = args.region - - # Setup signal handling - self.setup_signal_handler() - - # Override image cleanup if requested - if args.keep_image: - self.cleanup_images = lambda: print("Keeping image as requested") - - # Run benchmark - results = self.run_benchmark() - - # Output JSON if requested - if args.output_json: - self.save_results(results, args.output_json, self.image, args.snapshotter) - - - return 0 if self._is_successful(results) else 1 - - - def _is_successful(self, results: Dict[str, Optional[float]]) -> bool: - """Determine if benchmark was successful. Can be overridden by subclasses.""" - # Default: successful if we have first_log timing - return results.get("first_log") is not None \ No newline at end of file diff --git a/scripts/benchmark/test-bench-sglang.py b/scripts/benchmark/test-bench-sglang.py deleted file mode 100644 index f63978e..0000000 --- a/scripts/benchmark/test-bench-sglang.py +++ /dev/null @@ -1,248 +0,0 @@ -#!/usr/bin/env python3 -""" -SGLang Inference Server Benchmark -Measures container startup and SGLang readiness times with different snapshotters. - -LOG PATTERN DETECTION & PHASES: -=============================== - -This benchmark monitors SGLang inference server logs and detects the following phases: - -1. SGLANG_INIT (SGLang Framework Initialization) - - Patterns: "starting sglang", "sglang server", "initializing sglang", "launch_server" - - Detects: SGLang framework startup and server initialization - -2. WEIGHTS_DOWNLOAD (Weight Download Start) - - Patterns: "load weight begin" - - Detects: Beginning of model weight loading process - -3. WEIGHTS_DOWNLOAD_COMPLETE (Weight Download Complete) - - Patterns: "loading safetensors checkpoint shards: 0%" - - Detects: First safetensors checkpoint loading starts - -4. WEIGHTS_LOADED (Weights Loaded) - - Patterns: "load weight end" - - Detects: Completion of weight loading phase - -5. KV_CACHE_ALLOCATED (KV Cache Setup) - - Patterns: "kv cache is allocated", "kv cache allocated" - - Detects: Key-value cache memory allocation for inference - -6. GRAPH_CAPTURE_BEGIN (CUDA Graph Start) - - Patterns: "capture cuda graph begin", "capturing cuda graph" - - Detects: Beginning of CUDA graph capture for optimization - -7. GRAPH_CAPTURE_END (CUDA Graph Complete) - - Patterns: "capture cuda graph end", "cuda graph capture complete" - - Detects: CUDA graph capture completion - -8. SERVER_LOG_READY (Server Log Ready) - - Patterns: "starting server", "server starting", "uvicorn", "listening on" - - Detects: HTTP/API server initialization (log-based) - -9. SERVER_READY (Server Ready) - - Tested via HTTP requests to /health_generate endpoint with 0.1s polling - - Detects: API actually responding with valid HTTP 200 responses - -MONITORING BEHAVIOR: -=================== -- Timeout: 20 minutes (model loading and optimization can be slow) -- Container Status: Monitors container health during startup -- Health Polling: Polls /health_generate endpoint every 0.1 seconds after first log -- Success Criteria: HTTP 200 response from health endpoint -- Port: Maps container port 8000 to specified local port -- Stop Condition: Immediately after health endpoint returns 200 - -EXAMPLE LOG FLOW: -================ -[20.145s] starting sglang → sglang_init -[119.058s] load weight begin → weights_download -[200.525s] loading safetensors checkpoint shards: 0% → weights_download_complete -[233.778s] load weight end → weights_loaded -[233.828s] kv cache is allocated → kv_cache_allocated -[245.123s] capture cuda graph begin → graph_capture_begin -[267.890s] capture cuda graph end → graph_capture_end -[289.456s] starting server → server_log_ready -[291.789s] HTTP 200 /health_generate → server_ready -""" - -import requests -import json -import time -from typing import Dict, Optional -from benchmark_base import BenchmarkBase - - -class SGLangBenchmark(BenchmarkBase): - def __init__(self, image: str = "", container_name: str = "sglang-timing-test", - snapshotter: str = "nydus", port: int = 8000): - super().__init__(image, container_name, snapshotter, port) - - def get_health_endpoint(self) -> str: - """Get health endpoint for SGLang application.""" - return "health_generate" - - def supports_health_polling(self) -> bool: - """SGLang application supports health endpoint polling.""" - return True - - def _should_stop_monitoring(self, elapsed: float) -> bool: - """Custom stop monitoring logic for SGLang.""" - # Use base class logic for health polling apps - return super()._should_stop_monitoring(elapsed) - - def _init_phases(self) -> None: - """Initialize the phases dictionary for SGLang.""" - self.phases = { - "first_log": None, - "sglang_init": None, - "model_loading": None, - "weights_download": None, - "weights_download_complete": None, - "weights_loaded": None, - "kv_cache_allocated": None, - "graph_capture_begin": None, - "graph_capture_end": None, - "model_loaded": None, - "server_log_ready": None, - "server_ready": None - } - - def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]: - """Analyze a log line and return detected phase.""" - elapsed = timestamp - self.start_time - line_lower = line.lower() - - # SGLang initialization - if self.phases["sglang_init"] is None: - if any(pattern in line_lower for pattern in [ - "starting sglang", "sglang server", "initializing sglang", "launch_server" - ]): - self.phases["sglang_init"] = elapsed - return "sglang_init" - - # Weight download start (was "load weight begin") - if self.phases["weights_download"] is None: - if "load weight begin" in line_lower: - self.phases["weights_download"] = elapsed - return "weights_download" - - # Weight download complete (first loading safetensors) - if self.phases["weights_download_complete"] is None: - if "loading safetensors checkpoint shards:" in line_lower and "0%" in line_lower: - self.phases["weights_download_complete"] = elapsed - return "weights_download_complete" - - # Weights loaded (was "load weight end") - if self.phases["weights_loaded"] is None: - if "load weight end" in line_lower: - self.phases["weights_loaded"] = elapsed - return "weights_loaded" - - # KV cache allocation - if self.phases["kv_cache_allocated"] is None: - if any(pattern in line_lower for pattern in [ - "kv cache is allocated", "kv cache allocated" - ]): - self.phases["kv_cache_allocated"] = elapsed - return "kv_cache_allocated" - - # CUDA graph capture begin - if self.phases["graph_capture_begin"] is None: - if any(pattern in line_lower for pattern in [ - "capture cuda graph begin", "capturing cuda graph" - ]): - self.phases["graph_capture_begin"] = elapsed - return "graph_capture_begin" - - # CUDA graph capture end - if self.phases["graph_capture_end"] is None: - if any(pattern in line_lower for pattern in [ - "capture cuda graph end", "cuda graph capture complete" - ]): - self.phases["graph_capture_end"] = elapsed - return "graph_capture_end" - - # Server log ready pattern - if self.phases["server_log_ready"] is None: - if any(pattern in line_lower for pattern in [ - "starting server", "server starting", "uvicorn", "listening on" - ]): - self.phases["server_log_ready"] = elapsed - return "server_log_ready" - - return None - - def test_api_readiness(self, timeout: int = 120) -> bool: - """SGLang benchmark doesn't test API readiness - stops after server ready.""" - print("Skipping API readiness test - stopping after server ready detection") - return True - - def get_default_image(self, snapshotter: str) -> str: - """Get default image for the snapshotter. Users should now use --repo parameter instead.""" - raise ValueError( - "No default image configured. Please specify either:\n" - " --repo (e.g., --repo my-sglang-app)\n" - " --image (e.g., --image registry.com/repo:tag)\n" - "\nExample: python test-bench-sglang.py --repo saurabh-sglang-test --tag latest --snapshotter nydus" - ) - - - - def _get_summary_items(self, total_time): - """Get summary items for printing.""" - items = [ - ("Container Startup Time:", self.container_startup_duration), - ("Container to First Log:", self.phases["first_log"]), - ("SGLang Initialization:", self.phases["sglang_init"]), - ("Weight Download Start:", self.phases["weights_download"]), - ("Weight Download Complete:", self.phases["weights_download_complete"]), - ("Weights Loaded:", self.phases["weights_loaded"]), - ("KV Cache Allocated:", self.phases["kv_cache_allocated"]), - ("Graph Capture Begin:", self.phases["graph_capture_begin"]), - ("Graph Capture End:", self.phases["graph_capture_end"]), - ("Server Log Ready:", self.phases["server_log_ready"]), - ("Server Ready:", self.phases["server_ready"]), - ("Total Test Time:", total_time) - ] - - # Add breakdown section - items.append(("", None)) # Empty line separator - items.append(("BREAKDOWN:", None)) - - # Calculate breakdowns - if self.phases["first_log"] is not None: - items.append(("Container to First Log:", self.phases["first_log"])) - - if self.phases["first_log"] is not None and self.phases["weights_download"] is not None: - first_to_download = self.phases["weights_download"] - self.phases["first_log"] - items.append(("First Log to Weight Download Start:", first_to_download)) - - if self.phases["weights_download"] is not None and self.phases["weights_download_complete"] is not None: - download_duration = self.phases["weights_download_complete"] - self.phases["weights_download"] - items.append(("Weight Download Start to Complete:", download_duration)) - - if self.phases["weights_download_complete"] is not None and self.phases["weights_loaded"] is not None: - download_to_loaded = self.phases["weights_loaded"] - self.phases["weights_download_complete"] - items.append(("Weight Download Complete to Weights Loaded:", download_to_loaded)) - - if self.phases["weights_loaded"] is not None and self.phases["server_ready"] is not None: - loaded_to_ready = self.phases["server_ready"] - self.phases["weights_loaded"] - items.append(("Weights Loaded to Server Ready:", loaded_to_ready)) - - return items - - def _is_successful(self, results: Dict[str, Optional[float]]) -> bool: - """Determine if benchmark was successful.""" - return results.get("server_ready") is not None - - -def main(): - benchmark = SGLangBenchmark() - return benchmark.main("SGLang Container Startup Benchmark") - - -if __name__ == "__main__": - import sys - import subprocess - sys.exit(main()) \ No newline at end of file diff --git a/scripts/benchmark/test-bench-tensorrt.py b/scripts/benchmark/test-bench-tensorrt.py deleted file mode 100755 index 2f5586f..0000000 --- a/scripts/benchmark/test-bench-tensorrt.py +++ /dev/null @@ -1,218 +0,0 @@ -#!/usr/bin/env python3 -""" -TensorRT-LLM Startup Timing Benchmark -Measures container startup and TensorRT-LLM readiness times with different snapshotters. - -LOG PATTERN DETECTION & PHASES: -=============================== - -This benchmark monitors TensorRT-LLM server logs and detects the following phases: - -1. ENGINE_INIT (TensorRT-LLM Engine Initialization) - - Patterns: "PyTorchConfig(", "TensorRT-LLM version", "KV cache quantization" - - Detects: TensorRT-LLM engine initialization and configuration - -2. WEIGHT_DOWNLOAD_START (Weight Download Start) - - Patterns: "Prefetching", "checkpoint files", "Use.*GB for model weights" - - Detects: Beginning of model weight download/prefetching to memory - -3. WEIGHT_DOWNLOAD_COMPLETE (Weight Download Complete) - - Patterns: "Loading /workspace/huggingface", first model loading line - - Detects: All model weights downloaded and loading starts - -4. WEIGHTS_LOADED (Weight Loading Complete) - - Patterns: "Loading weights: 100%", "Model init total" - - Detects: Model weights fully loaded into memory - -5. MODEL_LOADED (Model Fully Loaded) - - Patterns: "Autotuning process ends", "Autotuner Cache size", memory configuration - - Detects: Complete model initialization with autotuning and optimization - -6. SERVER_LOG_READY (Server Log Ready) - - Patterns: "Started server process", "Waiting for application startup" - - Detects: Uvicorn/FastAPI server initialization (log-based) - -7. SERVER_READY (Server Ready) - - Tested via HTTP requests to /health endpoint with 0.1s polling - - Detects: API actually responding with valid HTTP 200 responses - -MONITORING BEHAVIOR: -=================== -- Timeout: 25 minutes (model loading and autotuning can be very slow) -- Container Status: Monitors container health during startup -- Health Polling: Polls /health endpoint every 0.1 seconds after first log -- Success Criteria: HTTP 200 response from health endpoint -- Port: Maps container port 8000 to specified local port -- Stop Condition: Immediately after health endpoint returns 200 - -EXAMPLE LOG FLOW: -================ -[10.230s] Starting TensorRT-LLM server → first_log -[73.120s] PyTorchConfig( → engine_init -[76.780s] Prefetching 15.26GB checkpoint → weight_download_start -[130.450s] Loading /workspace/huggingface → weight_download_complete -[156.670s] Loading weights: 100% → weights_loaded -[324.456s] Autotuning process ends → model_loaded -[325.789s] Started server process → server_log_ready -[326.012s] HTTP 200 /health → server_ready -""" - -import requests -import json -import time -from typing import Dict, Optional -from benchmark_base import BenchmarkBase - - -class TensorRTBenchmark(BenchmarkBase): - def __init__(self, image: str = "", container_name: str = "tensorrt-timing-test", - snapshotter: str = "nydus", port: int = 8080): - super().__init__(image, container_name, snapshotter, port) - - def get_health_endpoint(self) -> str: - """Get health endpoint for TensorRT application.""" - return "health" - - def supports_health_polling(self) -> bool: - """TensorRT application supports health endpoint polling.""" - return True - - def _should_stop_monitoring(self, elapsed: float) -> bool: - """Custom stop monitoring logic for TensorRT-LLM.""" - # Use base class logic for health polling apps - return super()._should_stop_monitoring(elapsed) - - def _is_successful(self, results: Dict[str, Optional[float]]) -> bool: - """Determine if benchmark was successful.""" - return results.get("server_ready") is not None - - def _init_phases(self) -> None: - """Initialize the phases dictionary for TensorRT-LLM.""" - self.phases = { - "first_log": None, - "engine_init": None, - "weight_download_start": None, - "weight_download_complete": None, - "weights_loaded": None, - "model_loaded": None, - "server_log_ready": None, - "server_ready": None - } - - def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]: - """Analyze a log line and return detected phase.""" - elapsed = timestamp - self.start_time - line_lower = line.lower() - - # TensorRT-LLM engine initialization - if self.phases["engine_init"] is None: - if any(pattern in line_lower for pattern in [ - "pytorchconfig(", "tensorrt-llm version", "kv cache quantization" - ]): - self.phases["engine_init"] = elapsed - return "engine_init" - - # Weight download start - if self.phases["weight_download_start"] is None: - if any(pattern in line_lower for pattern in [ - "prefetching", "checkpoint files", "gb for model weights" - ]): - self.phases["weight_download_start"] = elapsed - return "weight_download_start" - - # Weight download complete and loading starts - if self.phases["weight_download_complete"] is None: - if any(pattern in line_lower for pattern in [ - "loading /workspace/huggingface" - ]): - self.phases["weight_download_complete"] = elapsed - return "weight_download_complete" - - # Weights loading complete - if self.phases["weights_loaded"] is None: - if any(pattern in line_lower for pattern in [ - "loading weights: 100%", "model init total" - ]): - self.phases["weights_loaded"] = elapsed - return "weights_loaded" - - # Model fully loaded (autotuning complete, memory configured) - if self.phases["model_loaded"] is None: - if any(pattern in line_lower for pattern in [ - "autotuning process ends", "autotuner cache size", - "max_seq_len=", "max_num_requests=", "allocated.*gib for max tokens" - ]): - self.phases["model_loaded"] = elapsed - return "model_loaded" - - # Server log ready pattern - if self.phases["server_log_ready"] is None: - if any(pattern in line_lower for pattern in [ - "started server process", "waiting for application startup" - ]): - self.phases["server_log_ready"] = elapsed - return "server_log_ready" - - return None - - def test_api_readiness(self, timeout: int = 120) -> bool: - """TensorRT benchmark doesn't test API readiness - stops after server ready.""" - print("Skipping API readiness test - stopping after server ready detection") - return True - - def get_default_image(self, snapshotter: str) -> str: - """Get default image for the snapshotter. Users should now use --repo parameter instead.""" - raise ValueError( - "No default image configured. Please specify either:\n" - " --repo (e.g., --repo my-tensorrt-app)\n" - " --image (e.g., --image registry.com/repo:tag)\n" - "\nExample: python test-bench-tensorrt.py --repo my-tensorrt-app --tag latest --snapshotter nydus" - ) - - def _get_summary_items(self, total_time): - """Get summary items for the timing summary.""" - items = [ - ("Container Startup Time:", self.container_startup_duration), - ("Container to First Log:", self.phases["first_log"]), - ("Engine Initialization:", self.phases["engine_init"]), - ("Weight Download Start:", self.phases["weight_download_start"]), - ("Weight Download Complete:", self.phases["weight_download_complete"]), - ("Weights Loaded:", self.phases["weights_loaded"]), - ("Model Loaded:", self.phases["model_loaded"]), - ("Server Log Ready:", self.phases["server_log_ready"]), - ("Server Ready:", self.phases["server_ready"]), - ("Total Test Time:", total_time) - ] - - # Add breakdown section - items.append(("", None)) # Empty line separator - items.append(("BREAKDOWN:", None)) - - # Calculate breakdowns - if self.phases["first_log"] is not None: - items.append(("Container to First Log:", self.phases["first_log"])) - - if self.phases["first_log"] is not None and self.phases["weight_download_start"] is not None: - first_to_download = self.phases["weight_download_start"] - self.phases["first_log"] - items.append(("First Log to Weight Download Start:", first_to_download)) - - if self.phases["weight_download_start"] is not None and self.phases["weight_download_complete"] is not None: - download_duration = self.phases["weight_download_complete"] - self.phases["weight_download_start"] - items.append(("Weight Download Start to Complete:", download_duration)) - - if self.phases["weight_download_complete"] is not None and self.phases["weights_loaded"] is not None: - download_to_loaded = self.phases["weights_loaded"] - self.phases["weight_download_complete"] - items.append(("Weight Download Complete to Weights Loaded:", download_to_loaded)) - - if self.phases["weights_loaded"] is not None and self.phases["server_ready"] is not None: - loaded_to_ready = self.phases["server_ready"] - self.phases["weights_loaded"] - items.append(("Weights Loaded to Server Ready:", loaded_to_ready)) - - return items - - -if __name__ == "__main__": - import sys - - benchmark = TensorRTBenchmark() - sys.exit(benchmark.main("TensorRT-LLM Container Startup Benchmark")) \ No newline at end of file diff --git a/scripts/benchmark/test-bench-vllm.py b/scripts/benchmark/test-bench-vllm.py deleted file mode 100755 index efa1fca..0000000 --- a/scripts/benchmark/test-bench-vllm.py +++ /dev/null @@ -1,219 +0,0 @@ -#!/usr/bin/env python3 -""" -vLLM Startup Timing Benchmark -Measures container startup and vLLM readiness times with different snapshotters. - -LOG PATTERN DETECTION & PHASES: -=============================== - -This benchmark monitors vLLM inference server logs and detects the following phases: - -1. ENGINE_INIT (vLLM Engine Initialization) - - Patterns: "initializing a v1 llm engine", "waiting for init message", "v1 llm engine" - - Detects: vLLM V1 engine initialization start - -2. MODEL_LOADING (Model Loading Start) - - Patterns: "starting to load model", "loading model from scratch" - - Detects: Beginning of model loading process - -3. WEIGHTS_DOWNLOAD (Weight Download) - - Patterns: "time spent downloading weights", "downloading weights" - - Detects: Model weight download completion (if needed) - -4. WEIGHTS_LOADED (Weight Loading Complete) - - Patterns: "loading weights took", "loading safetensors checkpoint shards: 100%" - - Detects: Model weights fully loaded into memory - -5. MODEL_LOADED (Model Fully Loaded) - - Patterns: "model loading took", "init engine", "engine.*took.*seconds" - - Detects: Complete model initialization and engine setup - -6. GRAPH_CAPTURE (CUDA Graph Optimization) - - Patterns: "graph capturing finished", "capturing cuda graph shapes: 100%" - - Detects: CUDA graph capture completion for optimization - -7. SERVER_LOG_READY (Server Log Ready) - - Patterns: "started server process" - - Detects: FastAPI/Uvicorn server process started (log-based) - -8. SERVER_READY (Server Ready) - - Tested via HTTP requests to /health endpoint with 0.1s polling - - Detects: API actually responding with valid HTTP 200 responses - -MONITORING BEHAVIOR: -=================== -- Timeout: 20 minutes (model loading can be slow) -- Container Status: Monitors container health during startup -- Health Polling: Polls /health endpoint every 0.1 seconds after first log -- Success Criteria: HTTP 200 response from health endpoint -- Port: Maps container port 8000 to specified local port -- Stop Condition: Immediately after health endpoint returns 200 - -EXAMPLE LOG FLOW: -================ -[15.230s] initializing a v1 llm engine → engine_init -[45.120s] starting to load model → model_loading -[67.340s] downloading weights → weights_download -[156.780s] loading weights took 89.44s → weights_loaded -[198.450s] model loading took 153.33s → model_loaded -[245.670s] graph capturing finished → graph_capture -[318.429s] started server process → server_log_ready -[318.435s] HTTP 200 /health → server_ready -""" - -import requests -import json -import time -from typing import Dict, Optional -from benchmark_base import BenchmarkBase - - -class VLLMBenchmark(BenchmarkBase): - def __init__(self, image: str = "", container_name: str = "vllm-timing-test", - snapshotter: str = "nydus", port: int = 8080): - super().__init__(image, container_name, snapshotter, port) - - def get_health_endpoint(self) -> str: - """Get health endpoint for vLLM application.""" - return "health" - - def supports_health_polling(self) -> bool: - """vLLM application supports health endpoint polling.""" - return True - - def _should_stop_monitoring(self, elapsed: float) -> bool: - """Custom stop monitoring logic for vLLM.""" - # Use base class logic for health polling apps - return super()._should_stop_monitoring(elapsed) - - def _is_successful(self, results: Dict[str, Optional[float]]) -> bool: - """Determine if benchmark was successful.""" - return results.get("server_ready") is not None - - def _init_phases(self) -> None: - """Initialize the phases dictionary for vLLM.""" - self.phases = { - "first_log": None, - "engine_init": None, - "weights_download": None, - "weights_download_complete": None, - "weights_loaded": None, - "graph_capture": None, - "server_log_ready": None, - "server_ready": None - } - - def analyze_log_line(self, line: str, timestamp: float) -> Optional[str]: - """Analyze a log line and return detected phase.""" - elapsed = timestamp - self.start_time - line_lower = line.lower() - - # Engine initialization (vLLM V1 engine) - if self.phases["engine_init"] is None: - if any(pattern in line_lower for pattern in [ - "initializing a v1 llm engine", "waiting for init message", "v1 llm engine" - ]): - self.phases["engine_init"] = elapsed - return "engine_init" - - # Weights download start (was model loading start) - if self.phases["weights_download"] is None: - if any(pattern in line_lower for pattern in [ - "starting to load model", "loading model from scratch" - ]): - self.phases["weights_download"] = elapsed - return "weights_download" - - # Weights download complete - if self.phases["weights_download_complete"] is None: - if any(pattern in line_lower for pattern in [ - "time spent downloading weights", "downloading weights" - ]): - self.phases["weights_download_complete"] = elapsed - return "weights_download_complete" - - # Weights loaded patterns - if self.phases["weights_loaded"] is None: - if any(pattern in line_lower for pattern in [ - "loading weights took", "loading safetensors checkpoint shards: 100%" - ]): - self.phases["weights_loaded"] = elapsed - return "weights_loaded" - - # CUDA graph capture - if self.phases["graph_capture"] is None: - if any(pattern in line_lower for pattern in [ - "graph capturing finished", "capturing cuda graph shapes: 100%" - ]): - self.phases["graph_capture"] = elapsed - return "graph_capture" - - # Server log ready pattern (vLLM/FastAPI specific) - if self.phases["server_log_ready"] is None: - if "started server process" in line_lower: - self.phases["server_log_ready"] = elapsed - return "server_log_ready" - - return None - - def test_api_readiness(self, timeout: int = 120) -> bool: - """vLLM benchmark uses health polling instead of direct API test.""" - print("Using health polling instead of direct API test") - return True - - def get_default_image(self, snapshotter: str) -> str: - """Get default image for the snapshotter. Users should now use --repo parameter instead.""" - raise ValueError( - "No default image configured. Please specify either:\n" - " --repo (e.g., --repo my-vllm-app)\n" - " --image (e.g., --image registry.com/repo:tag)\n" - "\nExample: python test-bench-vllm.py --repo saurabh-vllm-test --tag latest --snapshotter nydus" - ) - - def _get_summary_items(self, total_time): - """Get summary items for the timing summary.""" - items = [ - ("Container Startup Time:", self.container_startup_duration), - ("Container to First Log:", self.phases["first_log"]), - ("Engine Initialization:", self.phases["engine_init"]), - ("Weights Download Start:", self.phases["weights_download"]), - ("Weights Download Complete:", self.phases["weights_download_complete"]), - ("Weights Loaded:", self.phases["weights_loaded"]), - ("Graph Capture Complete:", self.phases["graph_capture"]), - ("Server Log Ready:", self.phases["server_log_ready"]), - ("Server Ready:", self.phases["server_ready"]), - ("Total Test Time:", total_time) - ] - - # Add breakdown section - items.append(("", None)) # Empty line separator - items.append(("BREAKDOWN:", None)) - - # Calculate breakdowns - if self.phases["first_log"] is not None: - items.append(("Container to First Log:", self.phases["first_log"])) - - if self.phases["first_log"] is not None and self.phases["weights_download"] is not None: - first_to_download = self.phases["weights_download"] - self.phases["first_log"] - items.append(("First Log to Weight Download Start:", first_to_download)) - - if self.phases["weights_download"] is not None and self.phases["weights_download_complete"] is not None: - download_duration = self.phases["weights_download_complete"] - self.phases["weights_download"] - items.append(("Weight Download Start to Complete:", download_duration)) - - if self.phases["weights_download_complete"] is not None and self.phases["weights_loaded"] is not None: - download_to_loaded = self.phases["weights_loaded"] - self.phases["weights_download_complete"] - items.append(("Weight Download Complete to Weights Loaded:", download_to_loaded)) - - if self.phases["weights_loaded"] is not None and self.phases["server_ready"] is not None: - loaded_to_ready = self.phases["server_ready"] - self.phases["weights_loaded"] - items.append(("Weights Loaded to Server Ready:", loaded_to_ready)) - - return items - - -if __name__ == "__main__": - import sys - - benchmark = VLLMBenchmark() - sys.exit(benchmark.main("vLLM Container Startup Benchmark")) \ No newline at end of file diff --git a/scripts/build_push.py b/scripts/build_push.py deleted file mode 100755 index 65afabf..0000000 --- a/scripts/build_push.py +++ /dev/null @@ -1,547 +0,0 @@ -#!/usr/bin/env python3 -""" -Build and push container images with different snapshotter formats. -Supports ECR (AWS) and GAR (Google Artifact Registry). -""" - -import argparse -import os -import subprocess -import sys -import json -from pathlib import Path -from abc import ABC, abstractmethod - - -def run_command(cmd, check=True, capture_output=False): - """Run a shell command and handle errors.""" - import time - - print(f"Running: {cmd}") - start_time = time.time() - - try: - if capture_output: - result = subprocess.run(cmd, shell=True, check=check, capture_output=True, text=True) - elapsed = time.time() - start_time - print(f"✓ Completed in {elapsed:.2f}s") - return result.stdout.strip() - else: - subprocess.run(cmd, shell=True, check=check) - elapsed = time.time() - start_time - print(f"✓ Completed in {elapsed:.2f}s") - except subprocess.CalledProcessError as e: - elapsed = time.time() - start_time - print(f"❌ Failed after {elapsed:.2f}s") - print(f"Error running command: {cmd}") - print(f"Error: {e}") - if capture_output and e.stdout: - print(f"Stdout: {e.stdout}") - if capture_output and e.stderr: - print(f"Stderr: {e.stderr}") - sys.exit(1) - - -class Registry(ABC): - """Abstract base class for container registries.""" - - @abstractmethod - def check_credentials(self): - """Check if credentials are configured.""" - pass - - @abstractmethod - def create_repository(self, image_name): - """Create repository if it doesn't exist.""" - pass - - @abstractmethod - def login(self): - """Login to registry with docker and nerdctl.""" - pass - - @abstractmethod - def get_registry_url(self): - """Return the registry URL.""" - pass - - def get_full_image_name(self, image_name, tag="latest"): - """Construct full image reference.""" - return f"{self.get_registry_url()}/{image_name}:{tag}" - - -class ECRRegistry(Registry): - """AWS Elastic Container Registry implementation.""" - - def __init__(self, account, region): - self.account = account - self.region = region - self.registry_url = f"{account}.dkr.ecr.{region}.amazonaws.com" - - def check_credentials(self): - """Check if AWS credentials are configured.""" - try: - run_command("aws sts get-caller-identity", capture_output=True) - print("✓ AWS credentials are configured") - except: - print("Error: AWS credentials not configured. Please run 'aws configure' first.") - sys.exit(1) - - def create_repository(self, image_name): - """Create ECR repository if it doesn't exist.""" - print(f"Checking/creating ECR repository: {image_name}") - - # Check if repository exists - check_cmd = f"aws ecr describe-repositories --repository-names {image_name} --region {self.region}" - try: - run_command(check_cmd, capture_output=True) - print(f"✓ Repository {image_name} already exists") - except: - # Repository doesn't exist, create it - create_cmd = f"aws ecr create-repository --repository-name {image_name} --region {self.region}" - run_command(create_cmd) - print(f"✓ Created repository {image_name}") - - def login(self): - """Login to ECR using both docker and nerdctl.""" - print("Logging into ECR...") - - password = run_command(f"aws ecr get-login-password --region {self.region}", capture_output=True) - - # Login with docker - login_cmd = f"echo '{password}' | docker login -u AWS --password-stdin {self.registry_url}" - run_command(login_cmd) - - # Login with nerdctl - login_cmd = f"echo '{password}' | nerdctl login -u AWS --password-stdin {self.registry_url}" - run_command(login_cmd) - - # Login with sudo nerdctl - login_cmd = f"echo '{password}' | sudo nerdctl login -u AWS --password-stdin {self.registry_url}" - run_command(login_cmd) - - print("✓ Successfully logged into ECR") - - def get_registry_url(self): - """Return the ECR registry URL.""" - return self.registry_url - - -class GARRegistry(Registry): - """Google Artifact Registry implementation.""" - - def __init__(self, project_id, repository, location): - self.project_id = project_id - self.repository = repository - self.location = location - self.registry_url = f"{location}-docker.pkg.dev/{project_id}/{repository}" - - def check_credentials(self): - """Check if GCP credentials are configured.""" - try: - run_command("gcloud auth application-default print-access-token", capture_output=True) - print("✓ GCP credentials are configured") - except: - print("Error: GCP credentials not configured.") - print("Please run 'gcloud auth application-default login' or 'gcloud auth login'") - sys.exit(1) - - def create_repository(self, image_name): - """Create GAR repository if it doesn't exist.""" - print(f"Checking/creating GAR repository: {self.repository}") - - # Check if repository exists - check_cmd = f"gcloud artifacts repositories describe {self.repository} --location={self.location} --project={self.project_id}" - try: - run_command(check_cmd, capture_output=True) - print(f"✓ Repository {self.repository} already exists") - except: - # Repository doesn't exist, create it - create_cmd = f"gcloud artifacts repositories create {self.repository} --repository-format=docker --location={self.location} --project={self.project_id}" - run_command(create_cmd) - print(f"✓ Created repository {self.repository}") - - def login(self): - """Login to GAR using both docker and nerdctl.""" - print("Logging into Google Artifact Registry...") - - # Configure Docker authentication helper for GAR - auth_cmd = f"gcloud auth configure-docker {self.location}-docker.pkg.dev" - run_command(auth_cmd) - - # Get access token for nerdctl login - token = run_command("gcloud auth print-access-token", capture_output=True) - - # Login with nerdctl - login_cmd = f"echo '{token}' | nerdctl login -u oauth2accesstoken --password-stdin {self.location}-docker.pkg.dev" - run_command(login_cmd) - - # Login with sudo nerdctl - login_cmd = f"echo '{token}' | sudo nerdctl login -u oauth2accesstoken --password-stdin {self.location}-docker.pkg.dev" - run_command(login_cmd) - - print("✓ Successfully logged into GAR") - - def get_registry_url(self): - """Return the GAR registry URL.""" - return self.registry_url - - -def build_and_push_image(image_dir, image_name, registry): - """Build and push the base Docker image.""" - print(f"Building image from {image_dir}...") - - # Change to image directory for build context - original_dir = os.getcwd() - os.chdir(image_dir) - - try: - # Build the image - build_cmd = f"docker build -t {image_name} ." - run_command(build_cmd) - - # Tag for registry - full_image = registry.get_full_image_name(image_name, "latest") - tag_cmd = f"docker tag {image_name} {full_image}" - run_command(tag_cmd) - - # Push the image - push_cmd = f"docker push {full_image}" - run_command(push_cmd) - - print(f"✓ Successfully built and pushed {full_image}") - - finally: - os.chdir(original_dir) - - -def convert_to_nydus(image_name, registry): - """Convert and push Nydus image.""" - print("Converting to Nydus format...") - - source_image = registry.get_full_image_name(image_name, "latest") - target_image = registry.get_full_image_name(image_name, "latest-nydus") - - nydus_cmd = f"""nydusify convert \\ - --source {source_image} \\ - --source-backend-config ~/.docker/config.json \\ - --target {target_image}""" - - run_command(nydus_cmd) - print(f"✓ Successfully converted and pushed {target_image}") - - -def convert_to_soci(image_name, registry): - """Convert and push SOCI image.""" - print("Converting to SOCI format...") - - source_image = registry.get_full_image_name(image_name, "latest") - target_image = registry.get_full_image_name(image_name, "latest-soci") - - # Pull the image with nerdctl first - pull_cmd = f"sudo nerdctl pull {source_image}" - run_command(pull_cmd) - - # Convert to SOCI - soci_cmd = f"sudo soci convert {source_image} {target_image}" - run_command(soci_cmd) - - # Push SOCI image - push_cmd = f"sudo nerdctl push {target_image}" - run_command(push_cmd) - - print(f"✓ Successfully converted and pushed {target_image}") - - -def convert_to_estargz(image_name, registry): - """Convert and push eStargz image.""" - print("Converting to eStargz format...") - - source_image = registry.get_full_image_name(image_name, "latest") - target_image = registry.get_full_image_name(image_name, "latest-estargz") - - # Pull the image with nerdctl first - pull_cmd = f"sudo nerdctl pull {source_image}" - run_command(pull_cmd) - - estargz_cmd = f"sudo nerdctl image convert --estargz --oci {source_image} {target_image}" - run_command(estargz_cmd) - - # Push eStargz image - push_cmd = f"sudo nerdctl push {target_image}" - run_command(push_cmd) - - print(f"✓ Successfully converted and pushed {target_image}") - - -def cleanup_built_images(image_name, registry, formats): - """Remove only the images that were built in this run.""" - import time - - print("\n" + "="*60) - print("🧹 CLEANUP: Removing built images...") - print("="*60) - - cleanup_start = time.time() - images_to_remove = [] - - # Collect all image references that were built - if "normal" in formats: - images_to_remove.append(image_name) # Local tag - images_to_remove.append(registry.get_full_image_name(image_name, "latest")) - if "nydus" in formats: - images_to_remove.append(registry.get_full_image_name(image_name, "latest-nydus")) - if "soci" in formats: - images_to_remove.append(registry.get_full_image_name(image_name, "latest-soci")) - if "estargz" in formats: - images_to_remove.append(registry.get_full_image_name(image_name, "latest-estargz")) - - # Cleanup Docker images - print("\n📦 Docker Cleanup:") - for image in images_to_remove: - try: - print(f" Removing: {image}") - run_command(f"docker rmi -f {image}", check=False, capture_output=True) - except Exception as e: - print(f" ⚠️ Warning: Could not remove {image}: {e}") - - # Cleanup nerdctl images for relevant snapshotters - snapshotter_map = { - "normal": "overlayfs", - "nydus": "nydus", - "soci": "soci", - "estargz": "stargz" - } - - print(f"\n🔧 nerdctl Cleanup:") - for format_type in formats: - snapshotter = snapshotter_map.get(format_type) - if not snapshotter: - continue - - print(f" Processing {snapshotter} snapshotter...") - try: - # Determine the correct tag - if format_type == "normal": - tag = "latest" - else: - tag = f"latest-{format_type}" - - image_ref = registry.get_full_image_name(image_name, tag) - print(f" Removing: {image_ref}") - run_command(f"sudo nerdctl --snapshotter {snapshotter} rmi -f {image_ref}", check=False, capture_output=True) - - except Exception as e: - print(f" ⚠️ Warning: Could not cleanup {snapshotter} images: {e}") - - total_cleanup_time = time.time() - cleanup_start - print(f"\n✅ Cleanup completed in {total_cleanup_time:.2f}s") - print("="*60) - - -def list_available_images(base_path="snapshotters/images"): - """List available image directories.""" - images_dir = Path(base_path) - if not images_dir.exists(): - print(f"Error: {base_path} directory not found") - return [] - - image_dirs = [] - for item in images_dir.iterdir(): - if item.is_dir() and (item / "Dockerfile").exists(): - image_dirs.append(item.name) - - return sorted(image_dirs) - - -def main(): - parser = argparse.ArgumentParser( - description="Build and push container images with different snapshotter formats. Supports ECR (AWS) and GAR (Google Artifact Registry).", - formatter_class=argparse.RawDescriptionHelpFormatter, - epilog=""" -Examples: - # ECR (AWS) - Build image from custom path - python3 build_push.py --registry-type ecr --account 123456789 --image-path /path/to/my/image --image-name my-image --region us-east-1 - - # ECR - Build with specific formats - python3 build_push.py --registry-type ecr --account 123456789 --image-path ./images/cuda --image-name cuda-test --formats normal,nydus - - # GAR (Google) - Build and push all formats - python3 build_push.py --registry-type gar --project-id my-gcp-project --repository my-repo --image-path ./images/vllm --image-name vllm-app --location us-central1 - - # GAR - Build with specific formats - python3 build_push.py --registry-type gar --project-id my-project --repository ai-models --image-path ./images/sglang --image-name sglang --location us-east1 --formats normal,nydus,soci - - # List available images in default directory - python3 build_push.py --list-images - """) - - # Registry selection - parser.add_argument("--registry-type", choices=["ecr", "gar"], default="ecr", - help="Registry type: ecr (AWS) or gar (Google Artifact Registry). Default: ecr") - - # Common arguments - parser.add_argument("--image-path", required=False, help="Full path to image directory") - parser.add_argument("--image-name", required=False, help="Image name for the container") - parser.add_argument("--formats", default="normal,nydus,soci,estargz", - help="Comma-separated list of formats to build (normal,nydus,soci,estargz)") - parser.add_argument("--list-images", action="store_true", help="List available image directories") - parser.add_argument("--no-cleanup", action="store_true", help="Skip cleanup of local images after build") - - # ECR-specific arguments - parser.add_argument("--account", required=False, help="AWS account ID (required for ECR)") - parser.add_argument("--region", required=False, default="us-east-1", - help="AWS region for ECR (default: us-east-1)") - - # GAR-specific arguments - parser.add_argument("--project-id", required=False, help="GCP project ID (optional for GAR, defaults to gcloud config)") - parser.add_argument("--repository", required=False, help="GAR repository name (required for GAR)") - parser.add_argument("--location", required=False, default="us-central1", - help="GCP location for GAR (default: us-central1)") - - args = parser.parse_args() - - # List available images - if args.list_images: - available_images = list_available_images() - if available_images: - print("Available image directories:") - for img in available_images: - print(f" - {img}") - else: - print("No image directories found with Dockerfiles") - return - - # Validate registry-specific arguments - if args.registry_type == "ecr": - if not args.account: - parser.error("--account is required for ECR") - elif args.registry_type == "gar": - # Get project ID from gcloud config if not provided - if not args.project_id: - try: - args.project_id = run_command("gcloud config get project", capture_output=True) - if not args.project_id: - parser.error("--project-id is required for GAR (or set default project with 'gcloud config set project PROJECT_ID')") - print(f"Using project ID from gcloud config: {args.project_id}") - except: - parser.error("--project-id is required for GAR (or set default project with 'gcloud config set project PROJECT_ID')") - if not args.repository: - parser.error("--repository is required for GAR") - - # Validate common required arguments - if not args.image_path: - parser.error("--image-path is required") - if not args.image_name: - parser.error("--image-name is required") - - # Validate image directory exists - image_dir = Path(args.image_path) - if not image_dir.exists(): - print(f"Error: Image directory '{args.image_path}' not found") - sys.exit(1) - - dockerfile_path = image_dir / "Dockerfile" - if not dockerfile_path.exists(): - print(f"Error: No Dockerfile found in {image_dir}") - sys.exit(1) - - # Parse formats - formats = [f.strip() for f in args.formats.split(",")] - valid_formats = {"normal", "nydus", "soci", "estargz"} - invalid_formats = set(formats) - valid_formats - if invalid_formats: - print(f"Error: Invalid formats: {invalid_formats}") - print(f"Valid formats: {valid_formats}") - sys.exit(1) - - # Set image name - image_name = args.image_name - - # Create registry instance based on type - if args.registry_type == "ecr": - registry = ECRRegistry(args.account, args.region) - registry_info = f"Account: {args.account}, Region: {args.region}" - else: # gar - registry = GARRegistry(args.project_id, args.repository, args.location) - registry_info = f"Project: {args.project_id}, Repository: {args.repository}, Location: {args.location}" - - print("="*70) - print("🚀 STARTING CONTAINER IMAGE BUILD AND PUSH") - print("="*70) - print(f"Registry Type: {args.registry_type.upper()}") - print(f"Building image: {image_name}") - print(f"From directory: {image_dir}") - print(f"{registry_info}") - print(f"Formats: {formats}") - print() - - import time - total_start_time = time.time() - - # Check credentials - print(f"🔐 Checking {args.registry_type.upper()} credentials...") - registry.check_credentials() - - # Login to registry - print(f"\n🔑 Logging into {args.registry_type.upper()}...") - registry.login() - - # Create repository - print(f"\n📦 Setting up repository...") - registry.create_repository(image_name) - - # Build and push base image - if "normal" in formats: - print(f"\n🏗️ Building and pushing base image...") - build_start = time.time() - build_and_push_image(str(image_dir), image_name, registry) - build_time = time.time() - build_start - print(f"✅ Base image build completed in {build_time:.2f}s") - - # Convert to different formats - if "nydus" in formats: - print(f"\n🔄 Converting to Nydus format...") - nydus_start = time.time() - convert_to_nydus(image_name, registry) - nydus_time = time.time() - nydus_start - print(f"✅ Nydus conversion completed in {nydus_time:.2f}s") - - if "soci" in formats: - print(f"\n🔄 Converting to SOCI format...") - soci_start = time.time() - convert_to_soci(image_name, registry) - soci_time = time.time() - soci_start - print(f"✅ SOCI conversion completed in {soci_time:.2f}s") - - if "estargz" in formats: - print(f"\n🔄 Converting to eStargz format...") - estargz_start = time.time() - convert_to_estargz(image_name, registry) - estargz_time = time.time() - estargz_start - print(f"✅ eStargz conversion completed in {estargz_time:.2f}s") - - total_time = time.time() - total_start_time - - print("\n" + "="*70) - print("🎉 ALL FORMATS BUILT AND PUSHED SUCCESSFULLY!") - print("="*70) - print(f"Registry: {registry.get_registry_url()}") - print(f"Base image: {registry.get_full_image_name(image_name, 'latest')}") - if "nydus" in formats: - print(f"Nydus image: {registry.get_full_image_name(image_name, 'latest-nydus')}") - if "soci" in formats: - print(f"SOCI image: {registry.get_full_image_name(image_name, 'latest-soci')}") - if "estargz" in formats: - print(f"eStargz image: {registry.get_full_image_name(image_name, 'latest-estargz')}") - - print(f"\n⏱️ Total build and push time: {total_time:.2f}s ({total_time/60:.1f} minutes)") - print("="*70) - - # Cleanup built images by default (unless --no-cleanup is specified) - if not args.no_cleanup: - cleanup_built_images(image_name, registry, formats) - - -if __name__ == "__main__": - main() diff --git a/scripts/builder/Dockerfile b/scripts/builder/Dockerfile new file mode 100644 index 0000000..f90e9d5 --- /dev/null +++ b/scripts/builder/Dockerfile @@ -0,0 +1,56 @@ +# Build stage: Compile buildkit with Nydus support +FROM golang:1.21-alpine AS buildkit-builder + +# Install build dependencies +RUN apk add --no-cache git make + +# Clone nydusaccelerator/buildkit fork +ARG BUILDKIT_VERSION=nydus-compression-type-enhance +RUN git clone --depth 1 --branch ${BUILDKIT_VERSION} \ + https://github.com/nydusaccelerator/buildkit.git /buildkit + +WORKDIR /buildkit + +# Build buildkitd and buildctl with Nydus support +RUN go build -tags=nydus -o ./bin/buildkitd ./cmd/buildkitd && \ + go build -o ./bin/buildctl ./cmd/buildctl + +# Runtime stage +FROM alpine:latest + +# Copy buildkit binaries with Nydus support +COPY --from=buildkit-builder /buildkit/bin/buildctl /usr/bin/buildctl +COPY --from=buildkit-builder /buildkit/bin/buildkitd /usr/bin/buildkitd + +# Copy buildctl-daemonless.sh wrapper from moby/buildkit repo +ADD https://raw.githubusercontent.com/moby/buildkit/master/examples/buildctl-daemonless/buildctl-daemonless.sh /usr/bin/buildctl-daemonless.sh +RUN chmod +x /usr/bin/buildctl-daemonless.sh + +# Install runtime dependencies +RUN apk add --no-cache \ + ca-certificates \ + curl \ + wget \ + iptables \ + fuse-overlayfs \ + containerd + +# Install nydus-image binary (v2.3.6) +ARG NYDUS_VERSION=v2.3.6 +RUN wget -O /tmp/nydus.tgz \ + "https://github.com/dragonflyoss/nydus/releases/download/${NYDUS_VERSION}/nydus-static-${NYDUS_VERSION}-linux-amd64.tgz" \ + && tar -xzf /tmp/nydus.tgz -C /tmp \ + && mv /tmp/nydus-static/nydus-image /usr/bin/nydus-image \ + && chmod +x /usr/bin/nydus-image \ + && rm -rf /tmp/nydus.tgz /tmp/nydus-static + +# Set NYDUS_BUILDER environment variable (required for buildkit) +ENV NYDUS_BUILDER=/usr/bin/nydus-image + +# Copy build script +COPY build.sh /usr/local/bin/build.sh +RUN chmod +x /usr/local/bin/build.sh + +WORKDIR /workspace + +ENTRYPOINT ["/usr/local/bin/build.sh"] diff --git a/scripts/builder/README.md b/scripts/builder/README.md new file mode 100644 index 0000000..450d4f0 --- /dev/null +++ b/scripts/builder/README.md @@ -0,0 +1,156 @@ +# Container-Based Image Builder + +Builds container images using `buildctl` in a containerized environment. Produces both normal OCI and Nydus-optimized images. + +## Features + +- **Registry-agnostic**: Works with AWS ECR, Google Artifact Registry, Docker Hub, or any OCI registry +- **No local dependencies**: All build tools run inside a container +- **Two image formats**: Builds both normal OCI and Nydus images in one go +- **Direct push**: Images pushed directly to registry via buildctl + +## Architecture + +``` +Host (authenticated) → Builder Container (buildctl + nydus-image) → Registry +``` + +- **Host**: Authenticates to registry, mounts build context and docker config +- **Builder Container**: Runs buildctl to build and push images +- **No Docker daemon dependency**: buildctl pushes directly to registries + +## Prerequisites + +1. **Docker** installed on host (no other dependencies needed!) +2. **Authenticated to your registry** before running: + +```bash +# AWS ECR +aws ecr get-login-password --region us-east-1 | \ + docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com + +# Google Artifact Registry +gcloud auth configure-docker us-central1-docker.pkg.dev + +# Docker Hub +docker login +``` + +## Usage + +```bash +docker run --rm --privileged \ + -v /path/to/build-context:/workspace:ro \ + -v ~/.docker/config.json:/root/.docker/config.json:ro \ + tensorfuse/fastpull-builder:latest \ + +``` + +### Examples + +**AWS ECR:** +```bash +docker run --rm --privileged \ + -v ./my-app:/workspace:ro \ + -v ~/.docker/config.json:/root/.docker/config.json:ro \ + tensorfuse/fastpull-builder:latest \ + 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest +``` + +**Google Artifact Registry:** +```bash +docker run --rm --privileged \ + -v ./my-app:/workspace:ro \ + -v ~/.docker/config.json:/root/.docker/config.json:ro \ + tensorfuse/fastpull-builder:latest \ + us-central1-docker.pkg.dev/my-project/my-repo/my-app:v1.0 +``` + +**Docker Hub:** +```bash +docker run --rm --privileged \ + -v ./my-app:/workspace:ro \ + -v ~/.docker/config.json:/root/.docker/config.json:ro \ + tensorfuse/fastpull-builder:latest \ + docker.io/username/my-app:latest +``` + +**No tag (defaults to :latest):** +```bash +docker run --rm --privileged \ + -v ./my-app:/workspace:ro \ + -v ~/.docker/config.json:/root/.docker/config.json:ro \ + tensorfuse/fastpull-builder:latest \ + my-registry.com/my-app +``` + +**Custom Dockerfile:** +```bash +docker run --rm --privileged \ + -v ./my-app:/workspace:ro \ + -v ~/.docker/config.json:/root/.docker/config.json:ro \ + -e DOCKERFILE=Dockerfile.custom \ + tensorfuse/fastpull-builder:latest \ + my-registry.com/my-app:latest +``` + +## Output + +The script builds and pushes two images: +- `:` - Normal OCI image +- `:-fastpull` - Fastpull-optimized image + +## Files + +- `Dockerfile` - Builder container definition (builds from nydusaccelerator/buildkit fork) +- `build.sh` - Build script that runs inside container (entrypoint) +- `README.md` - This file + +## Technical Details + +### Buildkit with Nydus Support +The Dockerfile builds `buildkitd` and `buildctl` from the [nydusaccelerator/buildkit](https://github.com/nydusaccelerator/buildkit) fork with the `-tags=nydus` flag, which enables Nydus compression support. The standard moby/buildkit does not include this functionality. + +### Components +- **buildkitd/buildctl**: Compiled from nydusaccelerator/buildkit fork +- **nydus-image**: v2.3.6 binary (set via `NYDUS_BUILDER` env var) +- **buildctl-daemonless.sh**: Wrapper that runs buildkitd in rootless mode + +## How It Works + +1. **Pull builder image**: Downloads `tensorfuse/fastpull-builder:latest` from Docker Hub +2. **Mount context**: Your build context is mounted read-only into `/workspace` +3. **Mount auth**: `~/.docker/config.json` is mounted for registry authentication +4. **Run buildctl**: Builds normal OCI image with `buildctl-daemonless.sh` +5. **Run buildctl again**: Builds Fastpull image with Nydus compression +6. **Direct push**: Both images pushed directly to registry + +## Troubleshooting + +**"Error: Docker config not found"** +- Run registry authentication command first (see Prerequisites) + +**"Error: Build context path does not exist"** +- Check that `--context` points to a valid directory + +**"Error: Dockerfile not found"** +- Ensure Dockerfile exists in context directory +- Or specify custom name with `--dockerfile` + +**Build fails with authentication error:** +- Re-authenticate to your registry +- Check that `~/.docker/config.json` contains valid credentials + +**"permission denied" errors:** +- Builder container runs with `--privileged` flag (required for buildkit) +- Ensure Docker is running with appropriate permissions + +## Comparison with Original build_push.py + +| Feature | Original | Container-Based | +|---------|----------|-----------------| +| Dependencies | Requires nerdctl, nydusify, soci, stargz locally | All tools in container | +| Registry | AWS ECR or GAR | Any OCI registry | +| Formats | normal, nydus, soci, estargz | normal, nydus | +| Push method | nerdctl/docker | buildctl (direct) | +| Portability | Requires snapshotter setup | Runs anywhere Docker runs | diff --git a/scripts/builder/build.sh b/scripts/builder/build.sh new file mode 100644 index 0000000..8858ccf --- /dev/null +++ b/scripts/builder/build.sh @@ -0,0 +1,72 @@ +#!/bin/sh +set -e + +# Usage: build.sh +# Example: build.sh my-registry.com/my-app:latest +# Example: build.sh my-registry.com/my-app (defaults to :latest) + +if [ $# -lt 1 ]; then + echo "Usage: $0 " + echo "Example: $0 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.0" + echo "Example: $0 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app (defaults to :latest)" + exit 1 +fi + +IMAGE_WITH_TAG="$1" +DOCKERFILE="${DOCKERFILE:-Dockerfile}" +CONTEXT_PATH="${CONTEXT_PATH:-/workspace}" + +# Parse image and tag (default to :latest if no tag provided) +if echo "$IMAGE_WITH_TAG" | grep -q ":"; then + IMAGE_NAME="${IMAGE_WITH_TAG%:*}" + TAG="${IMAGE_WITH_TAG##*:}" +else + IMAGE_NAME="$IMAGE_WITH_TAG" + TAG="latest" +fi + +FULL_IMAGE="${IMAGE_NAME}:${TAG}" +FULL_IMAGE_FASTPULL="${IMAGE_NAME}:${TAG}-fastpull" + +echo "==========================================" +echo "Building images for: ${IMAGE_NAME}" +echo "Tag: ${TAG}" +echo "Context: ${CONTEXT_PATH}" +echo "Dockerfile: ${DOCKERFILE}" +echo "==========================================" + +# Build normal OCI image +echo "" +echo ">>> Building normal OCI image: ${FULL_IMAGE}" +echo "" +time buildctl-daemonless.sh build \ + --frontend dockerfile.v0 \ + --local context="${CONTEXT_PATH}" \ + --local dockerfile="${CONTEXT_PATH}" \ + --opt filename="${DOCKERFILE}" \ + --output type=image,name="${FULL_IMAGE}",push=true + +echo "" +echo "✓ Normal OCI image built and pushed: ${FULL_IMAGE}" +echo "" + +# Build Fastpull image +echo "" +echo ">>> Building Fastpull image: ${FULL_IMAGE_FASTPULL}" +echo "" +time buildctl-daemonless.sh build \ + --frontend dockerfile.v0 \ + --local context="${CONTEXT_PATH}" \ + --local dockerfile="${CONTEXT_PATH}" \ + --opt filename="${DOCKERFILE}" \ + --output type=image,name="${FULL_IMAGE_FASTPULL}",push=true,compression=nydus,force-compression=true,oci-mediatypes=true + +echo "" +echo "✓ Fastpull image built and pushed: ${FULL_IMAGE_FASTPULL}" +echo "" + +echo "==========================================" +echo "✓ Build complete!" +echo " Normal: ${FULL_IMAGE}" +echo " Fastpull: ${FULL_IMAGE_FASTPULL}" +echo "==========================================" diff --git a/scripts/fastpull-cli.py b/scripts/fastpull-cli.py new file mode 100755 index 0000000..064a501 --- /dev/null +++ b/scripts/fastpull-cli.py @@ -0,0 +1,81 @@ +#!/usr/bin/env python3 +""" +FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters. + +Main CLI entry point for the unified fastpull command. +""" + +import argparse +import sys +import os + +# Add the library directory to the path to import fastpull module +# When installed, fastpull module is at /usr/local/lib/fastpull +# When running from source, it's in the same directory as this script +script_dir = os.path.dirname(os.path.abspath(__file__)) +sys.path.insert(0, script_dir) # For running from source +sys.path.insert(0, '/usr/local/lib') # For installed version + +from fastpull import __version__ +from fastpull import run, build, quickstart + + +def main(): + """Main CLI entry point.""" + parser = argparse.ArgumentParser( + prog='fastpull', + description='FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Run container with benchmarking + fastpull run --snapshotter nydus --image myapp:latest-nydus \\ + --benchmark-mode readiness --readiness-endpoint http://localhost:8080/health -p 8080:8080 + + # Build and push Docker and Nydus images + fastpull build --image-path ./app --image myapp:v1 --format docker,nydus + +For more information, visit: https://github.com/tensorfuse/fastpull + """ + ) + + parser.add_argument( + '--version', + action='version', + version=f'%(prog)s {__version__}' + ) + + # Create subparsers for commands + subparsers = parser.add_subparsers( + dest='command', + title='commands', + description='Available fastpull commands', + help='Command to execute' + ) + + # Add subcommands + run.add_parser(subparsers) + build.add_parser(subparsers) + quickstart.add_parser(subparsers) + + # Parse arguments + args = parser.parse_args() + + # If no command specified, print help + if not args.command: + parser.print_help() + sys.exit(1) + + # Execute the command + try: + args.func(args) + except KeyboardInterrupt: + print("\n\nInterrupted by user") + sys.exit(130) + except Exception as e: + print(f"Error: {e}") + sys.exit(1) + + +if __name__ == '__main__': + main() diff --git a/scripts/fastpull/__init__.py b/scripts/fastpull/__init__.py new file mode 100644 index 0000000..23d4405 --- /dev/null +++ b/scripts/fastpull/__init__.py @@ -0,0 +1,8 @@ +""" +FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters. + +A unified CLI for building, pushing, and running containers with Nydus, SOCI, +and eStarGZ snapshotters. +""" + +__version__ = "0.1.0" diff --git a/scripts/fastpull/benchmark.py b/scripts/fastpull/benchmark.py new file mode 100644 index 0000000..f79d228 --- /dev/null +++ b/scripts/fastpull/benchmark.py @@ -0,0 +1,193 @@ +""" +Benchmarking utilities for fastpull run command. + +Tracks container lifecycle events and readiness checks. +""" + +import json +import subprocess +import threading +import time +from datetime import datetime +from typing import Optional, Dict +from urllib.request import urlopen +from urllib.error import URLError, HTTPError + + +class ContainerBenchmark: + """Track container startup and readiness metrics.""" + + def __init__(self, container_id: str, benchmark_mode: str = 'none', + readiness_endpoint: Optional[str] = None, mode: str = 'normal'): + """ + Initialize benchmark tracker. + + Args: + container_id: Container ID to track + benchmark_mode: 'none', 'completion', or 'readiness' + readiness_endpoint: HTTP endpoint for readiness checks + mode: 'nydus' or 'normal' (for display purposes) + """ + self.container_id = container_id + self.benchmark_mode = benchmark_mode + self.readiness_endpoint = readiness_endpoint + self.mode = mode + self.metrics: Dict[str, float] = {} + self.start_time = time.time() + self._event_thread: Optional[threading.Thread] = None + self._container_started = False + + def start_event_monitoring(self): + """Start monitoring containerd events in background thread.""" + if self.benchmark_mode == 'none': + return + + def monitor_events(): + """Monitor ctr events for container lifecycle.""" + try: + # Run sudo ctr events and parse for our container + proc = subprocess.Popen( + ['sudo', 'ctr', 'events'], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, + bufsize=1 + ) + + for line in proc.stdout: + # Look for /tasks/start event (check any task since we're the only one running) + if '/tasks/start' in line and self.metrics.get('container_start_time') is None: + elapsed = time.time() - self.start_time + self.metrics['container_start_time'] = elapsed + self._container_started = True + print(f"[{elapsed:.3f}s] ✓ CONTAINER START") + + # Look for our specific container's exit event + if self.container_id in line and '/tasks/exit' in line and self.benchmark_mode == 'completion': + elapsed = time.time() - self.start_time + self.metrics['completion_time'] = elapsed + print(f"[{elapsed:.3f}s] ✓ CONTAINER EXIT") + break + + except Exception as e: + print(f"Event monitoring error: {e}") + + self._event_thread = threading.Thread(target=monitor_events, daemon=True) + self._event_thread.start() + + def wait_for_readiness(self, timeout: int = 600, poll_interval: int = 2): + """ + Poll readiness endpoint until HTTP 200 response. + + Args: + timeout: Maximum time to wait in seconds + poll_interval: Time between polls in seconds + + Returns: + True if endpoint became ready, False if timeout + """ + if self.benchmark_mode != 'readiness' or not self.readiness_endpoint: + return True + + # Ensure endpoint has protocol prefix + endpoint = self.readiness_endpoint + if not endpoint.startswith(('http://', 'https://')): + endpoint = f'http://{endpoint}' + + print(f"Polling {endpoint} for readiness...") + end_time = time.time() + timeout + + while time.time() < end_time: + try: + response = urlopen(endpoint, timeout=5) + if response.getcode() == 200: + elapsed = time.time() - self.start_time + self.metrics['readiness_time'] = elapsed + print(f"Container ready (HTTP 200): {elapsed:.2f}s") + return True + except (URLError, HTTPError): + pass + + time.sleep(poll_interval) + + print(f"Readiness check timeout after {timeout}s") + return False + + def wait_for_completion(self, timeout: int = 3600): + """ + Wait for container to exit. + + Args: + timeout: Maximum time to wait in seconds + + Returns: + True if container exited, False if timeout + """ + if self.benchmark_mode != 'completion': + return True + + print(f"Waiting for container completion...") + end_time = time.time() + timeout + + while time.time() < end_time: + # Check if container is still running + result = subprocess.run( + ['nerdctl', 'ps', '-q', '-f', f'id={self.container_id}'], + capture_output=True, + text=True + ) + + if not result.stdout.strip(): + # Container has exited + if 'completion_time' not in self.metrics: + elapsed = time.time() - self.start_time + self.metrics['completion_time'] = elapsed + print(f"Container completed") + return True + + time.sleep(1) + + print(f"Completion timeout after {timeout}s") + return False + + def print_summary(self): + """Print benchmark results summary.""" + if self.benchmark_mode == 'none': + return + + mode_label = "FASTPULL" if self.mode == 'nydus' else "NORMAL" + print("\n" + "="*50) + print(f"{mode_label} BENCHMARK SUMMARY") + print("="*50) + + if 'container_start_time' in self.metrics: + print(f"Time to Container Start: {self.metrics['container_start_time']:.3f}s") + + if 'readiness_time' in self.metrics: + print(f"Time to Readiness: {self.metrics['readiness_time']:.3f}s") + + if 'completion_time' in self.metrics: + print(f"Time to Completion: {self.metrics['completion_time']:.3f}s") + + total_time = time.time() - self.start_time + print(f"Total Elapsed Time: {total_time:.3f}s") + print("="*50 + "\n") + + def export_json(self, filepath: str): + """ + Export metrics to JSON file. + + Args: + filepath: Path to output JSON file + """ + output = { + 'container_id': self.container_id, + 'benchmark_mode': self.benchmark_mode, + 'metrics': self.metrics, + 'timestamp': datetime.now().isoformat() + } + + with open(filepath, 'w') as f: + json.dump(output, f, indent=2) + + print(f"Metrics exported to {filepath}") diff --git a/scripts/fastpull/build.py b/scripts/fastpull/build.py new file mode 100644 index 0000000..7418d17 --- /dev/null +++ b/scripts/fastpull/build.py @@ -0,0 +1,428 @@ +""" +FastPull build command - Build and convert container images. + +Supports two modes: +1. Build from Dockerfile: docker build → push → convert +2. Convert existing image: pull (if needed) → push → convert +""" + +import argparse +import os +import subprocess +import sys +from typing import List + +from . import common + + +def add_parser(subparsers): + """Add build subcommand parser.""" + parser = subparsers.add_parser( + 'build', + help='Build and convert container images', + description='Build Docker images and convert to Nydus/SOCI/eStarGZ formats' + ) + + # Image specification + parser.add_argument( + '--repository-url', + required=True, + help='Full image reference (e.g., account.dkr.ecr.region.amazonaws.com/myapp:v1)' + ) + parser.add_argument( + '--dockerfile-path', + help='Path to Dockerfile directory (optional - if not provided, assumes image exists)' + ) + + # Registry configuration + parser.add_argument( + '--registry', + choices=['ecr', 'gar', 'dockerhub', 'auto'], + default='auto', + help='Registry type (default: auto-detect from image URL)' + ) + + # Google GAR parameters + parser.add_argument( + '--project-id', + help='GCP project ID (for GAR)' + ) + parser.add_argument( + '--location', + default='us-central1', + help='GCP location (default: us-central1)' + ) + parser.add_argument( + '--repository', + help='GAR repository name (for GAR)' + ) + + # Build options + parser.add_argument( + '--format', + default='docker,nydus', + help='Comma-separated formats: docker, nydus, soci, estargz (default: docker,nydus)' + ) + parser.add_argument( + '--no-cache', + action='store_true', + help='Build without cache' + ) + parser.add_argument( + '--build-arg', + action='append', + help='Build arguments (can be used multiple times)' + ) + parser.add_argument( + '--dockerfile', + default='Dockerfile', + help='Dockerfile name (default: Dockerfile)' + ) + + parser.set_defaults(func=build_command) + return parser + + +def build_command(args): + """Execute the build command.""" + # Auto-detect registry + if args.registry == 'auto': + args.registry = common.detect_registry_type(args.repository_url) + if args.registry == 'unknown': + print(f"Error: Could not auto-detect registry from image: {args.repository_url}") + print("Please specify --registry explicitly") + sys.exit(1) + print(f"Auto-detected registry: {args.registry}") + + # Validate registry-specific parameters + if args.registry == 'ecr': + # Get account and region from AWS CLI + args.account = common.get_aws_account_id() + args.region = common.get_aws_region() + + if not args.account: + print("Error: Could not detect AWS account ID. Please configure AWS CLI (aws configure)") + sys.exit(1) + + if not args.region: + args.region = 'us-east-1' # Fallback to default + + print(f"Using AWS account: {args.account}, region: {args.region}") + + if args.registry == 'gar' and not args.repository: + parsed = common.parse_gar_url(args.repository_url) + if parsed: + args.location, args.project_id, args.repository = parsed + else: + print("Error: --repository required for GAR") + sys.exit(1) + + # Parse formats + formats = [f.strip().lower() for f in args.format.split(',')] + valid_formats = ['docker', 'nydus', 'soci', 'estargz'] + for fmt in formats: + if fmt not in valid_formats: + print(f"Error: Invalid format '{fmt}'. Valid: {', '.join(valid_formats)}") + sys.exit(1) + + # Determine build mode + if args.dockerfile_path: + # Mode 1: Build from Dockerfile + build_from_dockerfile(args, formats) + else: + # Mode 2: Convert existing image + if 'docker' in formats: + print("Warning: --image-path not provided, skipping docker build") + formats.remove('docker') + + if not formats: + print("Error: No formats to build (docker requires --image-path)") + sys.exit(1) + + convert_existing_image(args, formats) + + print("\n" + "="*60) + print("BUILD COMPLETE") + print("="*60) + + +def authenticate_registry(args) -> bool: + """Authenticate with the registry.""" + if args.registry == 'ecr': + return authenticate_ecr(args) + elif args.registry == 'gar': + return authenticate_gar(args) + elif args.registry == 'dockerhub': + print("Assuming Docker Hub authentication already configured") + return True + return False + + +def authenticate_ecr(args) -> bool: + """Authenticate with AWS ECR.""" + try: + # Get login password + result = subprocess.run( + ['aws', 'ecr', 'get-login-password', '--region', args.region], + check=True, + capture_output=True, + text=True + ) + password = result.stdout.strip() + + # Login with docker + registry_url = f"{args.account}.dkr.ecr.{args.region}.amazonaws.com" + subprocess.run( + ['docker', 'login', '--username', 'AWS', '--password-stdin', registry_url], + input=password, + check=True, + capture_output=True, + text=True + ) + + # Login with nerdctl + subprocess.run( + ['sudo', 'nerdctl', 'login', '--username', 'AWS', '--password-stdin', registry_url], + input=password, + check=True, + capture_output=True, + text=True + ) + + print(f"✓ Authenticated with ECR") + return True + except subprocess.CalledProcessError as e: + print(f"✗ ECR authentication failed: {e}") + return False + + +def authenticate_gar(args) -> bool: + """Authenticate with Google Artifact Registry.""" + try: + if not args.project_id: + result = subprocess.run( + ['gcloud', 'config', 'get', 'project'], + check=True, + capture_output=True, + text=True + ) + args.project_id = result.stdout.strip() + + registry_url = f"{args.location}-docker.pkg.dev" + subprocess.run( + ['gcloud', 'auth', 'configure-docker', registry_url, '--quiet'], + check=True, + capture_output=True + ) + + print(f"✓ Authenticated with GAR") + return True + except subprocess.CalledProcessError as e: + print(f"✗ GAR authentication failed: {e}") + return False + + +def build_from_dockerfile(args, formats: List[str]): + """Mode 1: Build from Dockerfile, push, and convert.""" + print("\n" + "="*60) + print("MODE: Build from Dockerfile") + print("="*60) + + # Auto-detect if dockerfile_path is a file or directory + if os.path.isfile(args.dockerfile_path): + # User provided a file path, extract directory and filename + dockerfile_dir = os.path.dirname(args.dockerfile_path) + dockerfile_name = os.path.basename(args.dockerfile_path) + + # Use current directory if no directory in path + if not dockerfile_dir: + dockerfile_dir = '.' + + # Override the dockerfile argument with detected filename + args.dockerfile = dockerfile_name + args.dockerfile_path = dockerfile_dir + + print(f"Detected Dockerfile: {dockerfile_name} in {dockerfile_dir}") + + # Validate directory exists + if not os.path.isdir(args.dockerfile_path): + print(f"Error: Directory not found: {args.dockerfile_path}") + sys.exit(1) + + # Construct full Dockerfile path + dockerfile_path = os.path.join(args.dockerfile_path, args.dockerfile) + if not os.path.isfile(dockerfile_path): + print(f"Error: Dockerfile not found: {dockerfile_path}") + sys.exit(1) + + built_images = [] + + # Build and push Docker image + if 'docker' in formats: + if build_and_push_docker(args): + built_images.append(args.repository_url) + + # Convert to other formats + if 'nydus' in formats: + nydus_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-fastpull" + if convert_to_nydus(args.repository_url, nydus_image): + built_images.append(nydus_image) + + if 'soci' in formats: + soci_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-soci" + if convert_to_soci(args.repository_url, soci_image): + built_images.append(soci_image) + + if 'estargz' in formats: + estargz_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-estargz" + if convert_to_estargz(args.repository_url, estargz_image): + built_images.append(estargz_image) + + # Summary + print_summary(built_images) + + +def convert_existing_image(args, formats: List[str]): + """Mode 2: Convert existing image (no docker build).""" + print("\n" + "="*60) + print("MODE: Convert Existing Image") + print("="*60) + + built_images = [] + + # Convert to requested formats + if 'nydus' in formats: + nydus_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-fastpull" + if convert_to_nydus(args.repository_url, nydus_image): + built_images.append(nydus_image) + + if 'soci' in formats: + soci_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-soci" + if convert_to_soci(args.repository_url, soci_image): + built_images.append(soci_image) + + if 'estargz' in formats: + estargz_image = f"{args.repository_url.rsplit(':', 1)[0]}:{args.repository_url.rsplit(':', 1)[1]}-estargz" + if convert_to_estargz(args.repository_url, estargz_image): + built_images.append(estargz_image) + + # Summary + print_summary(built_images) + + +def build_and_push_docker(args) -> bool: + """Build and push Docker image.""" + print(f"\n[Docker] Building {args.repository_url}...") + + # Build + cmd = [ + 'sudo', 'docker', 'build', + '-t', args.repository_url, + '-f', os.path.join(args.dockerfile_path, args.dockerfile) + ] + + if args.no_cache: + cmd.append('--no-cache') + + if args.build_arg: + for build_arg in args.build_arg: + cmd.extend(['--build-arg', build_arg]) + + cmd.append(args.dockerfile_path) + + try: + subprocess.run(cmd, check=True) + print(f"[Docker] ✓ Built {args.repository_url}") + except subprocess.CalledProcessError: + print(f"[Docker] ✗ Build failed") + return False + + # Push + print(f"[Docker] Pushing {args.repository_url}...") + try: + subprocess.run(['sudo', 'docker', 'push', args.repository_url], check=True) + print(f"[Docker] ✓ Pushed {args.repository_url}") + return True + except subprocess.CalledProcessError: + print(f"[Docker] ✗ Push failed") + return False + + +def convert_to_nydus(source_image: str, target_image: str) -> bool: + """Convert to Nydus format.""" + print(f"\n[Nydus] Converting {source_image} → {target_image}...") + + cmd = [ + 'nydusify', 'convert', + '--source', source_image, + '--target', target_image + ] + + try: + subprocess.run(cmd, check=True) + print(f"[Nydus] ✓ Converted and pushed {target_image}") + return True + except subprocess.CalledProcessError: + print(f"[Nydus] ✗ Conversion failed") + return False + + +def convert_to_soci(source_image: str, target_image: str) -> bool: + """Convert to SOCI format.""" + print(f"\n[SOCI] Converting {source_image} → {target_image}...") + + # Pull with nerdctl + try: + subprocess.run(['sudo', 'nerdctl', 'pull', source_image], check=True, capture_output=True) + except subprocess.CalledProcessError: + print(f"[SOCI] ✗ Pull failed") + return False + + # Convert + try: + subprocess.run(['sudo', 'soci', 'create', source_image], check=True) + except subprocess.CalledProcessError: + print(f"[SOCI] ✗ Conversion failed") + return False + + # Tag and push + try: + subprocess.run(['sudo', 'nerdctl', 'tag', source_image, target_image], check=True) + subprocess.run(['sudo', 'nerdctl', 'push', target_image], check=True) + print(f"[SOCI] ✓ Converted and pushed {target_image}") + return True + except subprocess.CalledProcessError: + print(f"[SOCI] ✗ Push failed") + return False + + +def convert_to_estargz(source_image: str, target_image: str) -> bool: + """Convert to eStarGZ format.""" + print(f"\n[eStarGZ] Converting {source_image} → {target_image}...") + + try: + subprocess.run(['sudo', 'nerdctl', '--snapshotter', 'stargz', 'pull', source_image], + check=True, capture_output=True) + subprocess.run(['sudo', 'nerdctl', '--snapshotter', 'stargz', 'tag', source_image, target_image], + check=True) + subprocess.run(['sudo', 'nerdctl', '--snapshotter', 'stargz', 'push', target_image], + check=True) + print(f"[eStarGZ] ✓ Converted and pushed {target_image}") + return True + except subprocess.CalledProcessError: + print(f"[eStarGZ] ✗ Conversion failed") + return False + + +def print_summary(images: List[str]): + """Print build summary.""" + print("\n" + "="*60) + print("SUMMARY") + print("="*60) + if images: + print("Successfully built and pushed:") + for img in images: + print(f" ✓ {img}") + else: + print("No images were built successfully") + print("="*60) diff --git a/scripts/fastpull/clean.py b/scripts/fastpull/clean.py new file mode 100644 index 0000000..85a53f1 --- /dev/null +++ b/scripts/fastpull/clean.py @@ -0,0 +1,181 @@ +""" +FastPull clean command - Remove local images and artifacts. +""" + +import argparse +import subprocess +import sys +from typing import List + + +def add_parser(subparsers): + """Add clean subcommand parser.""" + parser = subparsers.add_parser( + 'clean', + help='Remove local images and artifacts', + description='Clean up fastpull images and containers' + ) + + parser.add_argument( + '--images', + action='store_true', + help='Remove all fastpull images' + ) + parser.add_argument( + '--containers', + action='store_true', + help='Remove stopped containers' + ) + parser.add_argument( + '--all', + action='store_true', + help='Remove all images and containers' + ) + parser.add_argument( + '--snapshotter', + choices=['nydus', 'overlayfs', 'all'], + default='all', + help='Target specific snapshotter (default: all)' + ) + parser.add_argument( + '--dry-run', + action='store_true', + help='Show what would be removed without removing' + ) + parser.add_argument( + '--force', + action='store_true', + help='Force removal without confirmation' + ) + + parser.set_defaults(func=clean_command) + return parser + + +def clean_command(args): + """Execute the clean command.""" + # If no specific target, clean all + if not args.images and not args.containers and not args.all: + print("Please specify what to clean: --images, --containers, or --all") + sys.exit(1) + + if args.all: + args.images = True + args.containers = True + + # Determine which snapshotters to clean + snapshotters = ['nydus', 'overlayfs'] if args.snapshotter == 'all' else [args.snapshotter] + + # Clean containers first + if args.containers: + clean_containers(snapshotters, args.dry_run, args.force) + + # Clean images + if args.images: + clean_images(snapshotters, args.dry_run, args.force) + + +def clean_containers(snapshotters: List[str], dry_run: bool = False, force: bool = False): + """ + Remove stopped containers. + + Args: + snapshotters: List of snapshotters to target + dry_run: If True, only show what would be removed + force: If True, skip confirmation + """ + print("\n=== Cleaning Containers ===") + + for snapshotter in snapshotters: + # Get all containers (including stopped ones) + result = subprocess.run( + ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'ps', '-a', '-q'], + capture_output=True, + text=True + ) + + container_ids = result.stdout.strip().split('\n') if result.stdout.strip() else [] + + if not container_ids: + print(f"[{snapshotter}] No containers to clean") + continue + + print(f"[{snapshotter}] Found {len(container_ids)} container(s)") + + if dry_run: + print(f"[{snapshotter}] Would remove {len(container_ids)} container(s)") + for cid in container_ids: + print(f" - {cid}") + continue + + # Confirm removal + if not force: + response = input(f"Remove {len(container_ids)} container(s) for {snapshotter}? [y/N]: ") + if response.lower() != 'y': + print(f"[{snapshotter}] Skipped") + continue + + # Remove containers + for cid in container_ids: + subprocess.run( + ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'rm', '-f', cid], + capture_output=True + ) + + print(f"[{snapshotter}] Removed {len(container_ids)} container(s)") + + +def clean_images(snapshotters: List[str], dry_run: bool = False, force: bool = False): + """ + Remove all images. + + Args: + snapshotters: List of snapshotters to target + dry_run: If True, only show what would be removed + force: If True, skip confirmation + """ + print("\n=== Cleaning Images ===") + + for snapshotter in snapshotters: + # Get all images + result = subprocess.run( + ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'images', '-q'], + capture_output=True, + text=True + ) + + image_ids = result.stdout.strip().split('\n') if result.stdout.strip() else [] + + if not image_ids: + print(f"[{snapshotter}] No images to clean") + continue + + print(f"[{snapshotter}] Found {len(image_ids)} image(s)") + + if dry_run: + print(f"[{snapshotter}] Would remove {len(image_ids)} image(s)") + # Show image details + result = subprocess.run( + ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'images'], + capture_output=True, + text=True + ) + print(result.stdout) + continue + + # Confirm removal + if not force: + response = input(f"Remove {len(image_ids)} image(s) for {snapshotter}? [y/N]: ") + if response.lower() != 'y': + print(f"[{snapshotter}] Skipped") + continue + + # Remove images + subprocess.run( + ['sudo', 'nerdctl', '--snapshotter', snapshotter, 'rmi', '-f'] + image_ids, + capture_output=True + ) + + print(f"[{snapshotter}] Removed {len(image_ids)} image(s)") + + print("\n=== Cleanup Complete ===\n") diff --git a/scripts/fastpull/cli.py b/scripts/fastpull/cli.py new file mode 100644 index 0000000..1a2f4cb --- /dev/null +++ b/scripts/fastpull/cli.py @@ -0,0 +1,73 @@ +#!/usr/bin/env python3 +""" +FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters. + +Main CLI entry point for the unified fastpull command. +""" + +import argparse +import sys + +from . import __version__, run, build, quickstart, clean + + +def main(): + """Main CLI entry point.""" + parser = argparse.ArgumentParser( + prog='fastpull', + description='FastPull - Accelerate AI/ML container startup with lazy-loading snapshotters', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Run container with benchmarking + fastpull run --snapshotter nydus --image myapp:latest-nydus \\ + --benchmark-mode readiness --readiness-endpoint http://localhost:8080/health -p 8080:8080 + + # Build and push Docker and Nydus images + fastpull build --image-path ./app --image myapp:v1 --format docker,nydus + +For more information, visit: https://github.com/tensorfuse/fastpull + """ + ) + + parser.add_argument( + '--version', + action='version', + version=f'%(prog)s {__version__}' + ) + + # Create subparsers for commands + subparsers = parser.add_subparsers( + dest='command', + title='commands', + description='Available fastpull commands', + help='Command to execute' + ) + + # Add subcommands + run.add_parser(subparsers) + build.add_parser(subparsers) + quickstart.add_parser(subparsers) + clean.add_parser(subparsers) + + # Parse arguments + args = parser.parse_args() + + # If no command specified, print help + if not args.command: + parser.print_help() + sys.exit(1) + + # Execute the command + try: + args.func(args) + except KeyboardInterrupt: + print("\n\nInterrupted by user") + sys.exit(130) + except Exception as e: + print(f"Error: {e}") + sys.exit(1) + + +if __name__ == '__main__': + main() diff --git a/scripts/fastpull/common.py b/scripts/fastpull/common.py new file mode 100644 index 0000000..bb01a07 --- /dev/null +++ b/scripts/fastpull/common.py @@ -0,0 +1,139 @@ +""" +Common utilities for fastpull commands. + +Includes registry detection, authentication helpers, and shared functions. +""" + +import re +import subprocess +from typing import Optional, Tuple + + +def detect_registry_type(image: str) -> str: + """ + Auto-detect registry type from image URL. + + Args: + image: Container image URL + + Returns: + Registry type: 'ecr', 'gar', 'dockerhub', or 'unknown' + """ + if 'dkr.ecr' in image or 'ecr.aws' in image: + return 'ecr' + elif 'pkg.dev' in image: + return 'gar' + elif 'docker.io' in image or '/' not in image or image.count('/') == 1: + return 'dockerhub' + return 'unknown' + + +def parse_ecr_url(image: str) -> Optional[Tuple[str, str, str]]: + """ + Parse ECR image URL to extract account, region, and repository. + + Args: + image: ECR image URL + + Returns: + Tuple of (account_id, region, repository) or None if invalid + """ + pattern = r'(\d+)\.dkr\.ecr\.([^.]+)\.amazonaws\.com/(.+)' + match = re.match(pattern, image) + if match: + return match.group(1), match.group(2), match.group(3) + return None + + +def parse_gar_url(image: str) -> Optional[Tuple[str, str, str]]: + """ + Parse GAR image URL to extract location, project, and repository. + + Args: + image: GAR image URL (e.g., us-central1-docker.pkg.dev/project/repo/image:tag) + + Returns: + Tuple of (location, project_id, repository) or None if invalid + """ + # Pattern: location-docker.pkg.dev/project/repository/image:tag + # Use .+? for location to handle hyphens (e.g., us-central1) + pattern = r'(.+?)-docker\.pkg\.dev/([^/]+)/([^/]+)' + match = re.match(pattern, image) + if match: + return match.group(1), match.group(2), match.group(3) + return None + + +def run_command(cmd: list, check: bool = True, capture_output: bool = True) -> subprocess.CompletedProcess: + """ + Run a shell command with consistent error handling. + + Args: + cmd: Command to run as list of strings + check: Raise exception on non-zero exit code + capture_output: Capture stdout/stderr + + Returns: + CompletedProcess instance + """ + return subprocess.run( + cmd, + check=check, + capture_output=capture_output, + text=True + ) + + +def get_snapshotter_binary(snapshotter: str) -> str: + """ + Get the appropriate binary for the snapshotter. + + Args: + snapshotter: Snapshotter type + + Returns: + Binary name ('nerdctl' or 'docker') + """ + # All snapshotters use nerdctl except for plain docker + if snapshotter in ['docker', 'overlayfs']: + return 'docker' + return 'nerdctl' + + +def get_aws_account_id() -> Optional[str]: + """ + Get AWS account ID from AWS CLI. + + Returns: + Account ID or None if failed + """ + try: + result = subprocess.run( + ['aws', 'sts', 'get-caller-identity', '--query', 'Account', '--output', 'text'], + check=True, + capture_output=True, + text=True + ) + return result.stdout.strip() + except (subprocess.CalledProcessError, FileNotFoundError): + return None + + +def get_aws_region() -> Optional[str]: + """ + Get AWS region from AWS CLI configuration. + + Returns: + Region or None if failed + """ + try: + result = subprocess.run( + ['aws', 'configure', 'get', 'region'], + check=True, + capture_output=True, + text=True + ) + region = result.stdout.strip() + return region if region else None + except (subprocess.CalledProcessError, FileNotFoundError): + return None diff --git a/scripts/fastpull/quickstart.py b/scripts/fastpull/quickstart.py new file mode 100644 index 0000000..1795b8e --- /dev/null +++ b/scripts/fastpull/quickstart.py @@ -0,0 +1,81 @@ +""" +FastPull quickstart command - Quick benchmarking comparisons. +""" + +import argparse +import subprocess +import sys +import os + + +# Workload configurations: (name, base_image, endpoint) +WORKLOADS = { + 'tensorrt': ('TensorRT', 'tensorrt', '/health'), + 'vllm': ('vLLM', 'vllm', '/health'), + 'sglang': ('SGLang', 'sglang', '/health_generate'), +} + + +def add_parser(subparsers): + """Add quickstart subcommand parser.""" + parser = subparsers.add_parser( + 'quickstart', + help='Quick benchmark comparisons', + description='Run pre-configured benchmarks' + ) + + subparsers_qs = parser.add_subparsers(dest='workload', help='Workload to benchmark') + + for workload in WORKLOADS: + wp = subparsers_qs.add_parser(workload, help=f'Benchmark {WORKLOADS[workload][0]} (nydus vs overlayfs)') + wp.add_argument('--output-dir', help='Directory to save results') + wp.set_defaults(func=run_quickstart) + + parser.set_defaults(func=lambda args: parser.print_help() if not args.workload else None) + return parser + + +def run_quickstart(args): + """Run benchmark comparison for a workload.""" + name, image_name, endpoint = WORKLOADS[args.workload] + + print(f"\n{'='*60}\n{name} Benchmark: FastPull vs Normal\n{'='*60}\n") + + base = f"public.ecr.aws/s6z9f6e5/tensorfuse/fastpull/{image_name}:latest" + + for mode in ['nydus', 'normal']: + print(f"\n[{mode.upper()}] Starting benchmark...") + + # Use fastpull command directly (works when installed via pip) + cmd = [ + 'fastpull', 'run', + '--mode', mode, + '--benchmark-mode', 'readiness', + '--readiness-endpoint', f'http://localhost:8080{endpoint}', + '-p', '8080:8000', + '--gpus', 'all', + base # Image as positional argument (tag suffix added automatically by run command) + ] + + if args.output_dir: + os.makedirs(args.output_dir, exist_ok=True) + cmd.extend(['--output-json', f'{args.output_dir}/{image_name}-{mode}.json']) + + try: + subprocess.run(cmd, check=True) + except (subprocess.CalledProcessError, KeyboardInterrupt): + sys.exit(1) + + print(f"\n{'='*60}\nBenchmark complete!") + if args.output_dir: + print(f"Results: {args.output_dir}/") + print(f"{'='*60}\n") + + # Auto cleanup after benchmarks complete + print("\nCleaning up containers and images...") + cleanup_cmd = ['fastpull', 'clean', '--all', '--force'] + try: + subprocess.run(cleanup_cmd, check=False) # Don't fail if cleanup has issues + except Exception as e: + print(f"Warning: Cleanup had issues: {e}") + print("Cleanup complete!\n") diff --git a/scripts/fastpull/run.py b/scripts/fastpull/run.py new file mode 100644 index 0000000..3cfb0cb --- /dev/null +++ b/scripts/fastpull/run.py @@ -0,0 +1,325 @@ +""" +FastPull run command - Run containers with specified snapshotters and benchmarking. +""" + +import argparse +import subprocess +import sys +import threading +import time +from typing import List, Optional + +from . import benchmark +from . import common + + +def add_parser(subparsers): + """Add run subcommand parser.""" + parser = subparsers.add_parser( + 'run', + help='Run container with specified snapshotter', + description='Run containers with Nydus or OverlayFS snapshotter' + ) + + # Mode selection (replaces --snapshotter) + parser.add_argument( + '--mode', + choices=['nydus', 'normal'], + default='nydus', + help='Run mode: nydus (default, adds -fastpull suffix) or normal (overlayfs, no suffix)' + ) + + # Benchmarking arguments + parser.add_argument( + '--benchmark-mode', + choices=['none', 'completion', 'readiness'], + default='none', + help='Benchmarking mode (default: none)' + ) + parser.add_argument( + '--readiness-endpoint', + help='HTTP endpoint to poll for readiness (required if benchmark-mode=readiness)' + ) + parser.add_argument( + '--output-json', + help='Export benchmark metrics to JSON file' + ) + + # Common container flags + parser.add_argument('--name', help='Container name') + parser.add_argument('-p', '--publish', action='append', help='Publish ports (can be used multiple times)') + parser.add_argument('-e', '--env', action='append', help='Set environment variables') + parser.add_argument('-v', '--volume', action='append', help='Bind mount volumes') + parser.add_argument('--gpus', help='GPU devices to use (e.g., "all")') + parser.add_argument('--rm', action='store_true', help='Automatically remove container when it exits') + parser.add_argument('-d', '--detach', action='store_true', help='Run container in background') + + # Image as positional argument (like docker/nerdctl run) + parser.add_argument( + 'image', + help='Container image to run' + ) + + # Pass-through for additional nerdctl flags (optional trailing args) + parser.add_argument( + 'nerdctl_args', + nargs='*', + help='Additional arguments to pass to nerdctl/docker (e.g., command to run in container)' + ) + + parser.set_defaults(func=run_command) + return parser + + +def run_command(args): + """Execute the run command.""" + # Validate benchmark mode + if args.benchmark_mode == 'readiness' and not args.readiness_endpoint: + print("Error: --readiness-endpoint is required when --benchmark-mode=readiness") + sys.exit(1) + + # Determine snapshotter and modify image tag based on mode + if args.mode == 'nydus': + args.snapshotter = 'nydus' + # Add -fastpull suffix to image tag if not already present + if ':' in args.image: + base, tag = args.image.rsplit(':', 1) + if not tag.endswith('-fastpull'): + args.image = f"{base}:{tag}-fastpull" + else: + args.image = f"{args.image}:latest-fastpull" + else: # normal mode + args.snapshotter = 'overlayfs' + # Use image as-is for normal mode + + # Build the nerdctl/docker command + cmd = build_run_command(args) + + print(f"Running container with {args.snapshotter} snapshotter...") + print(f"Image: {args.image}") + print(f"Command: {' '.join(cmd)}\n") + + # For benchmarking, we need to track the container + if args.benchmark_mode != 'none': + run_with_benchmark(cmd, args) + else: + run_without_benchmark(cmd) + + +def build_run_command(args) -> List[str]: + """ + Build the nerdctl/docker run command from arguments. + + Args: + args: Parsed command-line arguments + + Returns: + Command as list of strings + """ + # Determine binary (use sudo) + if args.snapshotter == 'overlayfs': + cmd = ['sudo', 'nerdctl', '--snapshotter', 'overlayfs', 'run'] + else: + cmd = ['sudo', 'nerdctl', '--snapshotter', args.snapshotter, 'run'] + + # Add common flags + if args.name: + cmd.extend(['--name', args.name]) + + if args.rm: + cmd.append('--rm') + + if args.detach: + cmd.append('-d') + + # Add ports + if args.publish: + for port in args.publish: + cmd.extend(['-p', port]) + + # Add environment variables + if args.env: + for env in args.env: + cmd.extend(['-e', env]) + + # Add volumes + if args.volume: + for vol in args.volume: + cmd.extend(['-v', vol]) + + # Add GPU support + if args.gpus: + cmd.extend(['--gpus', args.gpus]) + + # Add any additional pass-through arguments + if args.nerdctl_args: + cmd.extend(args.nerdctl_args) + + # Add image (must be last) + cmd.append(args.image) + + return cmd + + +def run_without_benchmark(cmd: List[str]): + """ + Run container without benchmarking. + + Args: + cmd: Command to execute + """ + try: + subprocess.run(cmd, check=True) + except subprocess.CalledProcessError as e: + print(f"Error running container: {e}") + sys.exit(1) + + +def run_with_benchmark(cmd: List[str], args): + """ + Run container with benchmarking enabled. + + Args: + cmd: Command to execute + args: Parsed arguments + """ + # Force detached mode for benchmarking + if '-d' not in cmd and '--detach' not in cmd: + cmd.insert(cmd.index('run') + 1, '-d') + + # Initialize benchmark tracker early (before starting container) + # We'll set container_id later, but we need to start event monitoring first + bench = benchmark.ContainerBenchmark( + container_id='', # Will be set after container starts + benchmark_mode=args.benchmark_mode, + readiness_endpoint=args.readiness_endpoint, + mode=args.mode + ) + + # Start event monitoring BEFORE starting the container + print("Starting containerd events monitoring...") + bench.start_event_monitoring() + + # Small delay to ensure event monitoring is ready + time.sleep(0.5) + + # Start the container + try: + print(f"Running container...") + result = subprocess.run( + cmd, + check=True, + capture_output=True, + text=True + ) + container_id = result.stdout.strip() + + if not container_id: + print("Error: Failed to get container ID") + sys.exit(1) + + print(f"Container started: {container_id[:12]}") + + # Update benchmark tracker with container ID + bench.container_id = container_id + + except subprocess.CalledProcessError as e: + print(f"Error starting container: {e}") + if e.stderr: + print(f"stderr: {e.stderr}") + sys.exit(1) + + # Start monitoring logs in background + print("Monitoring container logs...") + stop_logs_event = threading.Event() + log_thread = start_log_monitoring(container_id, args.snapshotter, bench.start_time, stop_logs_event) + + # Wait for completion or readiness + try: + if args.benchmark_mode == 'completion': + success = bench.wait_for_completion() + elif args.benchmark_mode == 'readiness': + success = bench.wait_for_readiness() + else: + success = True + + # Stop log monitoring after benchmark completes + stop_logs_event.set() + + if not success: + print("Benchmark failed (timeout)") + # Cleanup on failure + cleanup_container(container_id, args.snapshotter) + sys.exit(1) + + # Print summary + bench.print_summary() + + # Export JSON if requested + if args.output_json: + bench.export_json(args.output_json) + + # Cleanup container after successful benchmark + print("\nBenchmark complete, cleaning up container...") + cleanup_container(container_id, args.snapshotter) + + except KeyboardInterrupt: + print("\nInterrupted by user") + # Stop and remove container + cleanup_container(container_id, args.snapshotter) + sys.exit(1) + + +def start_log_monitoring(container_id: str, snapshotter: str, start_time: float, stop_event: threading.Event) -> threading.Thread: + """ + Start monitoring container logs in background thread. + + Args: + container_id: Container ID + snapshotter: Snapshotter type + start_time: Benchmark start time + stop_event: Event to signal when to stop monitoring + + Returns: + Log monitoring thread + """ + def log_reader(): + try: + cmd = ['sudo', 'nerdctl', 'logs', '-f', container_id] + + process = subprocess.Popen( + cmd, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + bufsize=1, + universal_newlines=True + ) + + for line in process.stdout: + if stop_event.is_set(): + process.terminate() + break + if line: + elapsed = time.time() - start_time + print(f"[{elapsed:.3f}s] {line.rstrip()}") + + except Exception as e: + pass # Silently handle errors (container might be stopped) + + thread = threading.Thread(target=log_reader, daemon=True) + thread.start() + return thread + + +def cleanup_container(container_id: str, snapshotter: str): + """ + Stop and remove container. + + Args: + container_id: Container ID + snapshotter: Snapshotter type + """ + print(f"Cleaning up container {container_id[:12]}...") + subprocess.run(['sudo', 'nerdctl', 'stop', container_id], capture_output=True) + subprocess.run(['sudo', 'nerdctl', 'rm', container_id], capture_output=True) diff --git a/scripts/install_snapshotters.py b/scripts/install_snapshotters.py deleted file mode 100755 index ec959b7..0000000 --- a/scripts/install_snapshotters.py +++ /dev/null @@ -1,523 +0,0 @@ -#!/usr/bin/env python3 -""" -Container Snapshotter Installation Script - -This script installs and configures multiple container snapshotters: -- Nydus: Efficient container image storage with lazy loading -- SOCI (Seekable OCI): AWS-developed snapshotter for faster container startup -- StarGZ: Google-developed snapshotter with eStargz format support - -The script also installs supporting tools like nerdctl and CNI plugins, -configures systemd services, and sets up containerd integration. - -Requirements: -- Must be run as root -- Linux system with systemd -- Internet access for downloading binaries -""" - -import os -import sys -import subprocess -import shutil -import tempfile -from pathlib import Path - -# Configuration constants for component versions -NYDUS_VERSION = "2.3.6" -NYDUS_SNAPSHOTTER_VERSION = "0.15.3" -NERDCTL_VERSION = "2.1.4" -CNI_VERSION = "v1.8.0" -SOCI_VERSION = "0.11.1" -STARGZ_VERSION = "0.17.0" - -def run_command(cmd, check=True, shell=False): - """ - Execute a shell command with error handling. - - Args: - cmd: Command to execute (list or string) - check: Whether to raise exception on non-zero exit code - shell: Whether to use shell execution - - Returns: - subprocess.CompletedProcess: Command execution result - """ - if shell: - result = subprocess.run(cmd, shell=True, check=check, capture_output=True, text=True) - else: - result = subprocess.run(cmd, check=check, capture_output=True, text=True) - return result - -def check_root(): - """ - Verify that the script is running with root privileges. - Exits with error code 1 if not running as root. - """ - if os.geteuid() != 0: - print("This script must be run as root") - sys.exit(1) - -def download_and_extract(url, extract_to=None): - """ - Download and extract a tar.gz archive from a URL. - - Args: - url: URL to download the archive from - extract_to: Optional directory to extract to (current dir if None) - - Returns: - str: Filename of the downloaded archive - """ - filename = url.split('/')[-1] - - # Download the archive - print(f" Downloading {filename}...") - run_command(['wget', url]) - - # Extract the archive - print(f" Extracting {filename}...") - if extract_to: - run_command(['tar', '-xzf', filename, '-C', extract_to]) - else: - run_command(['tar', '-xzf', filename]) - - # Clean up the downloaded archive - os.remove(filename) - return filename - -def install_nydus(): - """ - Install Nydus container image acceleration toolkit. - - Nydus provides lazy loading capabilities for container images, - reducing startup time and bandwidth usage. - """ - print("------------------ Installing Nydus -------------------------------") - print(f"Installing Nydus v{NYDUS_VERSION}...") - - # Download and extract Nydus static binaries - url = f"https://github.com/dragonflyoss/nydus/releases/download/v{NYDUS_VERSION}/nydus-static-v{NYDUS_VERSION}-linux-amd64.tgz" - download_and_extract(url) - - # Install binaries to system path - print(" Installing Nydus binaries...") - nydus_binaries = list(Path('nydus-static').glob('*')) - run_command(['cp', '-r'] + [str(b) for b in nydus_binaries] + ['/usr/local/bin/']) - - # Make binaries executable - nydus_installed = list(Path('/usr/local/bin').glob('nydus*')) - run_command(['chmod', '+x'] + [str(p) for p in nydus_installed]) - - # Clean up temporary files - shutil.rmtree('nydus-static', ignore_errors=True) - -def install_nydus_snapshotter(): - """ - Install Nydus Snapshotter for containerd integration. - - This component bridges Nydus with containerd, enabling - container runtime to use Nydus-optimized images. - """ - print(f"Installing Nydus Snapshotter v{NYDUS_SNAPSHOTTER_VERSION}...") - - # Download Nydus Snapshotter - url = f"https://github.com/containerd/nydus-snapshotter/releases/download/v{NYDUS_SNAPSHOTTER_VERSION}/nydus-snapshotter-v{NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz" - download_and_extract(url) - - # Install the containerd-nydus-grpc binary - print(" Installing Nydus Snapshotter binary...") - run_command(['cp', 'bin/containerd-nydus-grpc', '/usr/local/bin/']) - run_command(['chmod', '+x', '/usr/local/bin/containerd-nydus-grpc']) - - # Clean up temporary files - shutil.rmtree('bin', ignore_errors=True) - -def install_nerdctl(): - """ - Install nerdctl - containerd-compatible Docker CLI. - - nerdctl provides a Docker-compatible command line interface - for containerd, enabling easy container management. - """ - print(f"Installing nerdctl v{NERDCTL_VERSION}...") - - # Download nerdctl - url = f"https://github.com/containerd/nerdctl/releases/download/v{NERDCTL_VERSION}/nerdctl-{NERDCTL_VERSION}-linux-amd64.tar.gz" - download_and_extract(url) - - # Install nerdctl binary - print(" Installing nerdctl binary...") - run_command(['cp', 'nerdctl', '/usr/local/bin/']) - - # Clean up temporary files - os.remove('nerdctl') - -def install_cni_plugins(): - """ - Install Container Network Interface (CNI) plugins. - - CNI plugins provide networking capabilities for containers, - enabling network isolation and communication. - """ - print("Installing CNI plugins...") - - # Create CNI plugin directory - print(" Creating CNI plugin directory...") - os.makedirs('/opt/cni/bin', exist_ok=True) - - # Download and install CNI plugins - url = f"https://github.com/containernetworking/plugins/releases/download/{CNI_VERSION}/cni-plugins-linux-amd64-{CNI_VERSION}.tgz" - filename = url.split('/')[-1] - - print(f" Downloading CNI plugins {CNI_VERSION}...") - run_command(['wget', url]) - - print(" Installing CNI plugins...") - run_command(['tar', '-xzf', filename, '-C', '/opt/cni/bin']) - os.remove(filename) - -def test_nydus_installation(): - """ - Verify that Nydus components are properly installed. - - Tests the installation by checking version information - for core Nydus tools. - """ - print("Testing Nydus installation...") - - # List of Nydus tools to test - commands = [ - ['nydus-image', '--version'], # Image conversion tool - ['nydusd', '--version'], # Nydus daemon - ['nydusify', '--version'] # Image format converter - ] - - # Test each tool and report any failures - for cmd in commands: - try: - result = run_command(cmd) - print(f" ✓ {cmd[0]} is working") - except subprocess.CalledProcessError as e: - print(f" ✗ Warning: {' '.join(cmd)} failed: {e}") - -def configure_nydus_snapshotter(): - """ - Create configuration files for Nydus Snapshotter. - - Sets up the nydusd daemon configuration with optimized - settings for registry backend and filesystem prefetching. - """ - print("=== Nydus Snapshotter Configuration Deployment ===") - - # Create Nydus configuration directory - print(" Creating Nydus configuration directory...") - os.makedirs('/etc/nydus', exist_ok=True) - - # Nydus daemon configuration for FUSE mode - config_content = """{ - "device": { - "backend": { - "type": "registry", - "config": { - "timeout": 5, - "connect_timeout": 5, - "retry_limit": 2 - } - }, - "cache": { - "type": "blobcache" - } - }, - "mode": "direct", - "digest_validate": false, - "iostats_files": false, - "enable_xattr": true, - "amplify_io": 1048576, - "fs_prefetch": { - "enable": true, - "threads_count": 64, - "merging_size": 1048576, - "prefetch_all": true - } -}""" - - # Write configuration file - print(" Writing Nydus daemon configuration...") - with open('/etc/nydus/nydusd-config.fusedev.json', 'w') as f: - f.write(config_content) - -def install_soci(): - """ - Install SOCI (Seekable OCI) snapshotter. - - SOCI is AWS's container image format that enables - faster container startup through lazy loading. - """ - print("------------------ Installing Soci -------------------------------") - print(f"Installing SOCI v{SOCI_VERSION}...") - - # Download SOCI snapshotter - url = f"https://github.com/awslabs/soci-snapshotter/releases/download/v{SOCI_VERSION}/soci-snapshotter-{SOCI_VERSION}-linux-amd64.tar.gz" - filename = url.split('/')[-1] - - print(" Downloading SOCI snapshotter...") - run_command(['wget', url]) - - # Extract specific binaries directly to system path - print(" Installing SOCI binaries...") - run_command(['tar', '-C', '/usr/local/bin', '-xvf', filename, 'soci', 'soci-snapshotter-grpc']) - os.remove(filename) - -def install_stargz(): - """ - Install StarGZ snapshotter. - - StarGZ (Stargz/eStargz) is Google's container image format - that provides lazy loading capabilities similar to Nydus. - """ - print("------------------ Installing (e)StarGZ -------------------------------") - print(f"Installing StarGZ v{STARGZ_VERSION}...") - - # Download StarGZ snapshotter - url = f"https://github.com/containerd/stargz-snapshotter/releases/download/v{STARGZ_VERSION}/stargz-snapshotter-v{STARGZ_VERSION}-linux-amd64.tar.gz" - filename = url.split('/')[-1] - - print(" Downloading StarGZ snapshotter...") - run_command(['wget', url]) - - # Extract specific binaries directly to system path - print(" Installing StarGZ binaries...") - run_command(['tar', '-C', '/usr/local/bin', '-xvf', filename, 'containerd-stargz-grpc', 'ctr-remote']) - os.remove(filename) - -def setup_systemd_services(snapshotters): - """ - Create and start systemd services for specified snapshotters. - - Creates service files for each snapshotter daemon and starts them. - This enables automatic startup and management via systemctl. - - Args: - snapshotters: List of snapshotters to set up ('nydus', 'soci', 'stargz') - """ - print("------------------ Setting up Snapshotter Services -------------------------------") - - services_to_start = [] - - if 'nydus' in snapshotters: - # Nydus Snapshotter service configuration - print(" Creating Nydus Snapshotter service...") - nydus_service = """[Unit] -Description=nydus snapshotter (fuse mode) -After=network.target - -[Service] -Type=simple -ExecStart=/usr/local/bin/containerd-nydus-grpc --nydusd-config /etc/nydus/nydusd-config.fusedev.json -Restart=always -StandardOutput=journal -StandardError=journal - -[Install] -WantedBy=multi-user.target -""" - - with open('/etc/systemd/system/nydus-snapshotter-fuse.service', 'w') as f: - f.write(nydus_service) - services_to_start.append('nydus-snapshotter-fuse.service') - - if 'soci' in snapshotters: - # SOCI Snapshotter service configuration - print(" Creating SOCI Snapshotter service...") - soci_service = """[Unit] -Description=SOCI Snapshotter GRPC daemon -After=network.target - -[Service] -Type=simple -ExecStart=/usr/local/bin/soci-snapshotter-grpc -Restart=on-failure - -[Install] -WantedBy=multi-user.target -""" - - with open('/etc/systemd/system/soci-snapshotter-grpc.service', 'w') as f: - f.write(soci_service) - services_to_start.append('soci-snapshotter-grpc.service') - - if 'stargz' in snapshotters: - # StarGZ Snapshotter service configuration - print(" Creating StarGZ Snapshotter service...") - stargz_service = """[Unit] -Description=Stargz Snapshotter daemon -After=network.target - -[Service] -Type=simple -ExecStart=/usr/local/bin/containerd-stargz-grpc -Restart=on-failure - -[Install] -WantedBy=multi-user.target -""" - - with open('/etc/systemd/system/stargz-snapshotter.service', 'w') as f: - f.write(stargz_service) - services_to_start.append('stargz-snapshotter.service') - - # Start all snapshotter services - if services_to_start: - print(" Starting snapshotter services...") - for service in services_to_start: - print(f" Starting {service}...") - run_command(['systemctl', 'start', service]) - -def setup_containerd(snapshotters): - """ - Configure containerd to use the installed snapshotters. - - Creates containerd configuration that registers specified - snapshotters as proxy plugins, then restarts containerd. - - Args: - snapshotters: List of snapshotters to configure ('nydus', 'soci', 'stargz') - """ - print("------------------ Setting up Containerd -------------------------------") - - # Ensure containerd configuration directory exists - print(" Creating containerd configuration directory...") - os.makedirs('/etc/containerd', exist_ok=True) - - # Build containerd configuration with proxy plugins for specified snapshotters - containerd_config = "version = 2\n\n[proxy_plugins]\n" - - if 'soci' in snapshotters: - containerd_config += """ [proxy_plugins.soci] - type = "snapshot" - address = "/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock" -""" - - if 'nydus' in snapshotters: - containerd_config += """ [proxy_plugins.nydus] - type = "snapshot" - address = "/run/containerd-nydus/containerd-nydus-grpc.sock" -""" - - if 'stargz' in snapshotters: - containerd_config += """ [proxy_plugins.stargz] - type = "snapshot" - address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock" - [proxy_plugins.stargz.exports] - root = "/var/lib/containerd-stargz-grpc/" -""" - - # Write containerd configuration - print(" Writing containerd configuration...") - with open('/etc/containerd/config.toml', 'w') as f: - f.write(containerd_config) - - # Restart containerd to apply new configuration - print(" Restarting containerd service...") - run_command(['systemctl', 'restart', 'containerd']) - -def main(): - """ - Main installation orchestrator. - - Performs the complete installation sequence: - 1. Verify root privileges - 2. Install specified snapshotter components and dependencies - 3. Configure services and containerd integration - 4. Start all services - - Uses a temporary directory for downloads to avoid cluttering - the current working directory. - """ - import argparse - - # Parse command line arguments - parser = argparse.ArgumentParser( - description="Install container snapshotters for lazy-loading container images.", - formatter_class=argparse.RawDescriptionHelpFormatter, - epilog=""" -Examples: - # Install only Nydus (default) - sudo python3 install_snapshotters.py - - # Install all snapshotters - sudo python3 install_snapshotters.py --snapshotters nydus,soci,stargz - - # Install Nydus and SOCI - sudo python3 install_snapshotters.py --snapshotters nydus,soci - """) - - parser.add_argument( - "--snapshotters", - default="nydus", - help="Comma-separated list of snapshotters to install (nydus,soci,stargz). Default: nydus" - ) - - args = parser.parse_args() - - # Parse and validate snapshotters - requested_snapshotters = [s.strip() for s in args.snapshotters.split(",")] - valid_snapshotters = {"nydus", "soci", "stargz"} - invalid_snapshotters = set(requested_snapshotters) - valid_snapshotters - - if invalid_snapshotters: - print(f"Error: Invalid snapshotters: {invalid_snapshotters}") - print(f"Valid options: {valid_snapshotters}") - sys.exit(1) - - # Ensure script is run with root privileges - check_root() - - snapshotter_names = ", ".join(requested_snapshotters) - print("Starting container snapshotter installation...") - print(f"Installing: {snapshotter_names}, nerdctl, and CNI plugins") - print() - - # Use temporary directory for all downloads and extraction - with tempfile.TemporaryDirectory() as tmpdir: - original_dir = os.getcwd() - os.chdir(tmpdir) - - try: - # Install core container runtime tools first - install_nerdctl() - install_cni_plugins() - - # Install Nydus components if requested - if 'nydus' in requested_snapshotters: - install_nydus() - install_nydus_snapshotter() - test_nydus_installation() - configure_nydus_snapshotter() - - # Install SOCI if requested - if 'soci' in requested_snapshotters: - install_soci() - - # Install StarGZ if requested - if 'stargz' in requested_snapshotters: - install_stargz() - - # Set up system integration for installed snapshotters - setup_systemd_services(requested_snapshotters) - setup_containerd(requested_snapshotters) - - finally: - # Return to original directory - os.chdir(original_dir) - - print() - print("------------------ INSTALLATION COMPLETE -------------------") - print(f"Installed snapshotters: {snapshotter_names}") - print("You can now use nerdctl with --snapshotter flag to specify:") - for snapshotter in requested_snapshotters: - print(f" --snapshotter={snapshotter}") - -if __name__ == "__main__": - main() diff --git a/scripts/setup.py b/scripts/setup.py new file mode 100755 index 0000000..d1c31fd --- /dev/null +++ b/scripts/setup.py @@ -0,0 +1,550 @@ +#!/usr/bin/env python3 +""" +FastPull Setup Script + +Installs containerd, Nydus snapshotter, and FastPull CLI via pip. +""" + +import argparse +import os +import subprocess +import sys + + +SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) +PROJECT_ROOT = os.path.dirname(SCRIPT_DIR) +VENV_PATH = os.path.join(PROJECT_ROOT, '.venv') +FASTPULL_BIN = '/usr/local/bin/fastpull' + + +def run_command(cmd, check=True, capture_output=False, shell=False): + """Run a command and return result.""" + try: + if shell: + result = subprocess.run(cmd, shell=True, check=check, capture_output=capture_output, text=True) + else: + result = subprocess.run(cmd, check=check, capture_output=capture_output, text=True) + return result + except subprocess.CalledProcessError as e: + if not check: + return e + raise + + +def detect_package_manager(): + """Detect the system package manager.""" + # Check for apt (Debian/Ubuntu) + if os.path.exists('/usr/bin/apt-get') or os.path.exists('/usr/bin/apt'): + return 'apt' + # Check for yum (RHEL/CentOS 7) + elif os.path.exists('/usr/bin/yum'): + return 'yum' + # Check for dnf (RHEL/CentOS 8+/Fedora) + elif os.path.exists('/usr/bin/dnf'): + return 'dnf' + else: + return None + + +def install_system_dependencies(): + """Install required system packages (python3-venv, wget).""" + pkg_mgr = detect_package_manager() + + if not pkg_mgr: + print("⚠ Warning: Could not detect package manager (apt/yum/dnf)") + print("Please manually install: python3-venv, wget") + return False + + print(f"Detected package manager: {pkg_mgr}") + print("Installing system dependencies (python3-venv, wget)...") + + try: + if pkg_mgr == 'apt': + # Update package list and install dependencies + run_command(['apt-get', 'update', '-qq'], check=True) + run_command(['apt-get', 'install', '-y', 'python3-venv', 'wget'], check=True) + elif pkg_mgr == 'yum': + run_command(['yum', 'install', '-y', 'python3-venv', 'wget'], check=True) + elif pkg_mgr == 'dnf': + run_command(['dnf', 'install', '-y', 'python3-venv', 'wget'], check=True) + + print("✓ System dependencies installed") + return True + except subprocess.CalledProcessError as e: + print(f"✗ Failed to install system dependencies: {e}") + return False + + +def check_root(): + """Check if running as root.""" + if os.geteuid() != 0: + print("Error: This script must be run as root (use sudo)") + sys.exit(1) + + +def install_containerd_nerdctl(): + """Install containerd and nerdctl.""" + print("\n" + "="*60) + print("Installing Containerd & Nerdctl") + print("="*60) + + # Check if already installed + nerdctl_path = "/usr/local/bin/nerdctl" + if os.path.exists(nerdctl_path): + print(f"✓ nerdctl already installed at {nerdctl_path}") + result = run_command([nerdctl_path, "--version"], capture_output=True) + print(f" {result.stdout.strip()}") + return True + + print("\nInstalling containerd and nerdctl...") + + install_script = """ +set -e + +cd /tmp + +# Remove old download if exists +rm -f /tmp/nerdctl-full.tar.gz + +# Download nerdctl-full +NERDCTL_VERSION="1.7.3" +echo "Downloading nerdctl-full ${NERDCTL_VERSION}..." +wget -O /tmp/nerdctl-full.tar.gz https://github.com/containerd/nerdctl/releases/download/v${NERDCTL_VERSION}/nerdctl-full-${NERDCTL_VERSION}-linux-amd64.tar.gz + +# Extract to /usr/local +echo "Extracting to /usr/local..." +tar -C /usr/local -xzf /tmp/nerdctl-full.tar.gz + +# Enable and start containerd service +echo "Enabling containerd service..." +systemctl enable containerd +systemctl start containerd + +# Clean up +rm -f /tmp/nerdctl-full.tar.gz + +echo "✓ Containerd and nerdctl installed" +""" + + try: + result = run_command(install_script, shell=True, capture_output=True) + print("✓ Containerd and nerdctl installed successfully") + return True + except subprocess.CalledProcessError as e: + print(f"✗ Failed to install containerd: {e}") + if e.stdout: + print(f"stdout: {e.stdout}") + if e.stderr: + print(f"stderr: {e.stderr}") + return False + + +def install_nydus(): + """Install Nydus snapshotter.""" + print("\n" + "="*60) + print("Installing Nydus Snapshotter") + print("="*60) + + nydus_path = "/usr/local/bin/containerd-nydus-grpc" + service_path = "/etc/systemd/system/fastpull.service" + + # Check if binary exists + if os.path.exists(nydus_path): + print(f"✓ Nydus binary found at {nydus_path}") + # Always recreate service and config (to ensure latest settings) + print("Updating service and configuration...") + create_nydus_service() + return True + + install_script = """ +set -e + +NYDUS_SNAPSHOTTER_VERSION="0.15.3" +echo "Downloading Nydus Snapshotter v${NYDUS_SNAPSHOTTER_VERSION}..." + +# Download Nydus Snapshotter +cd /tmp +wget https://github.com/containerd/nydus-snapshotter/releases/download/v${NYDUS_SNAPSHOTTER_VERSION}/nydus-snapshotter-v${NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz + +# Extract and install +tar -xzf nydus-snapshotter-v${NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz +cp bin/containerd-nydus-grpc /usr/local/bin/ +chmod +x /usr/local/bin/containerd-nydus-grpc + +# Also install nydusd (required by snapshotter) +NYDUS_VERSION="v2.3.6" +echo "Downloading Nydus tools ${NYDUS_VERSION}..." +wget -O nydus.tgz https://github.com/dragonflyoss/nydus/releases/download/${NYDUS_VERSION}/nydus-static-${NYDUS_VERSION}-linux-amd64.tgz +tar xzf nydus.tgz +cp nydus-static/nydusd /usr/local/bin/ +cp nydus-static/nydus-image /usr/local/bin/ +cp nydus-static/nydusify /usr/local/bin/ +chmod +x /usr/local/bin/nydusd /usr/local/bin/nydus-image /usr/local/bin/nydusify + +# Clean up +rm -rf bin nydus-snapshotter-v${NYDUS_SNAPSHOTTER_VERSION}-linux-amd64.tar.gz nydus-static nydus.tgz + +echo "✓ Nydus binaries installed" +""" + + try: + result = run_command(install_script, shell=True, capture_output=True) + print("✓ Nydus binaries installed successfully") + + # Now create the service (shared code) + create_nydus_service() + return True + except subprocess.CalledProcessError as e: + print(f"✗ Failed to install Nydus: {e}") + if e.stderr: + print(f"stderr: {e.stderr}") + return False + + +def create_nydus_service(): + """Create systemd service for Nydus snapshotter.""" + service_script = """ +# Create systemd service +cat > /etc/systemd/system/fastpull.service <<'EOF' +[Unit] +Description=nydus snapshotter (fuse mode) +After=network.target + +[Service] +Type=simple +ExecStart=/usr/local/bin/containerd-nydus-grpc --nydusd-config /etc/nydus/nydusd-config.fusedev.json +Restart=always +StandardOutput=journal +StandardError=journal + +[Install] +WantedBy=multi-user.target +EOF + +# Create necessary directories +mkdir -p /etc/nydus +mkdir -p /var/lib/nydus/cache + +# Create Nydus config if it doesn't exist +if [ ! -f /etc/nydus/nydusd-config.fusedev.json ]; then +cat > /etc/nydus/nydusd-config.fusedev.json <<'EOF' +{ + "device": { + "backend": { + "type": "registry", + "config": { + "timeout": 5, + "connect_timeout": 5, + "retry_limit": 2 + } + }, + "cache": { + "type": "blobcache" + } + }, + "mode": "direct", + "digest_validate": false, + "iostats_files": false, + "enable_xattr": true, + "amplify_io": 10485760, + "fs_prefetch": { + "enable": true, + "threads_count": 16, + "merging_size": 1048576, + "prefetch_all": true + } +} +EOF +fi + +# Enable and start service +systemctl daemon-reload +systemctl enable fastpull.service +systemctl start fastpull.service + +echo "✓ Nydus service created and started" +""" + + try: + run_command(service_script, shell=True, capture_output=True) + print("✓ Created and started fastpull.service") + return True + except subprocess.CalledProcessError as e: + print(f"✗ Failed to create service: {e}") + return False + + +def configure_containerd_for_nydus(): + """Configure containerd to use Nydus snapshotter.""" + print("\nConfiguring containerd for Nydus...") + + config_dir = "/etc/containerd" + config_file = os.path.join(config_dir, "config.toml") + + os.makedirs(config_dir, exist_ok=True) + + # Create containerd config with Nydus proxy plugin + config_content = """version = 2 + +[proxy_plugins] + [proxy_plugins.nydus] + type = "snapshot" + address = "/run/containerd-nydus/containerd-nydus-grpc.sock" + +[plugins."io.containerd.grpc.v1.cri".containerd] + snapshotter = "nydus" + disable_snapshot_annotations = false +""" + + with open(config_file, 'w') as f: + f.write(config_content) + + print(f"✓ Updated containerd config at {config_file}") + + # Restart fastpull service first + print("Restarting fastpull service...") + run_command(["systemctl", "restart", "fastpull.service"], check=False) + + # Then restart containerd service + print("Restarting containerd service...") + run_command(["systemctl", "restart", "containerd.service"], check=False) + + print("✓ Services restarted") + + return True + + +def install_cli(): + """Install fastpull CLI via pip in a venv.""" + print("\n" + "="*60) + print("Installing FastPull CLI") + print("="*60) + + try: + # Create venv if it doesn't exist + if not os.path.exists(VENV_PATH): + print(f"Creating virtual environment at {VENV_PATH}...") + result = run_command(['python3', '-m', 'venv', VENV_PATH], check=False, capture_output=True) + if result.returncode != 0: + print(f"✗ Failed to create venv: {result.stderr}") + return False + print(f"✓ Created virtual environment") + + # Get pip path in venv + venv_pip = os.path.join(VENV_PATH, 'bin', 'pip') + venv_python = os.path.join(VENV_PATH, 'bin', 'python3') + + # Install fastpull in venv + print("Installing fastpull in virtual environment...") + result = run_command([venv_pip, 'install', '-e', PROJECT_ROOT], check=False, capture_output=True) + if result.returncode != 0: + print(f"✗ Failed to install in venv: {result.stderr}") + return False + print("✓ Installed fastpull in virtual environment") + + # Create wrapper script in /usr/local/bin + wrapper_script = f"""#!/bin/bash +# FastPull CLI wrapper script +# Activates venv and runs fastpull + +exec {venv_python} -m scripts.fastpull.cli "$@" +""" + + print(f"Creating wrapper script at {FASTPULL_BIN}...") + with open(FASTPULL_BIN, 'w') as f: + f.write(wrapper_script) + os.chmod(FASTPULL_BIN, 0o755) + print(f"✓ Created fastpull command at {FASTPULL_BIN}") + + return True + + except Exception as e: + print(f"✗ Failed to install fastpull: {e}") + return False + + +def verify_installation(): + """Verify fastpull installation.""" + print("\n" + "="*60) + print("Verifying Installation") + print("="*60) + + # Test CLI + try: + result = run_command(['fastpull', '--version'], capture_output=True, check=False) + if result.returncode == 0: + print(f"✓ fastpull CLI: {result.stdout.strip()}") + else: + print(f"✗ fastpull CLI not found in PATH") + print("Try running: hash -r (or restart your shell)") + return False + except Exception as e: + print(f"✗ fastpull CLI test failed: {e}") + return False + + # Check nerdctl + nerdctl_path = "/usr/local/bin/nerdctl" + if os.path.exists(nerdctl_path): + try: + result = run_command([nerdctl_path, "--version"], capture_output=True) + print(f"✓ nerdctl: {result.stdout.strip().split()[2]}") + except: + print(f" nerdctl found but version check failed") + + # Check containerd service + try: + result = run_command(["systemctl", "is-active", "containerd.service"], capture_output=True) + if result.returncode == 0: + print(f"✓ containerd service: active") + else: + print(f" containerd service: {result.stdout.strip()}") + except: + print(f" Could not check containerd service") + + # Check FastPull service + try: + result = run_command(["systemctl", "is-active", "fastpull.service"], capture_output=True) + if result.returncode == 0: + print(f"✓ fastpull service: active") + else: + print(f" fastpull service: {result.stdout.strip()}") + except: + print(f" Could not check fastpull service") + + return True + + +def main(): + """Main setup function.""" + parser = argparse.ArgumentParser( + description='Install FastPull with containerd and Nydus snapshotter', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Full installation (containerd + Nydus + CLI) + sudo python3 scripts/setup.py + + # Install only CLI (skip containerd/Nydus setup) + sudo python3 scripts/setup.py --cli-only + + # Uninstall fastpull CLI + sudo python3 scripts/setup.py --uninstall +""" + ) + parser.add_argument( + '--cli-only', + action='store_true', + help='Install only the fastpull CLI, skip containerd/Nydus setup' + ) + parser.add_argument( + '--uninstall', + action='store_true', + help='Uninstall fastpull CLI' + ) + + args = parser.parse_args() + + # Check root + check_root() + + if args.uninstall: + print("Uninstalling fastpull...") + removed = False + + # Remove wrapper script + if os.path.exists(FASTPULL_BIN): + os.remove(FASTPULL_BIN) + print(f"✓ Removed {FASTPULL_BIN}") + removed = True + + # Remove venv + if os.path.exists(VENV_PATH): + import shutil + shutil.rmtree(VENV_PATH) + print(f"✓ Removed virtual environment at {VENV_PATH}") + removed = True + + if removed: + print("✓ Uninstall complete") + else: + print("✗ fastpull not found or already uninstalled") + return + + print("="*60) + print("FastPull Setup") + print("="*60) + + if args.cli_only: + print("\nThis will install:") + print(" • FastPull CLI tool (via pip)") + print() + else: + print("\nThis will install:") + print(" • Containerd and nerdctl") + print(" • Nydus snapshotter") + print(" • FastPull CLI tool (via pip)") + print() + + # Install system dependencies first + print("\n" + "="*60) + print("Installing System Dependencies") + print("="*60) + if not install_system_dependencies(): + print("\n⚠ Warning: System dependencies installation had issues") + print("Continuing anyway, but you may encounter errors...") + + # Track installation status + success = True + warnings = [] + + if not args.cli_only: + # Install containerd and nerdctl + if not install_containerd_nerdctl(): + print("\n⚠ Warning: Containerd installation failed") + print("You can still install the CLI with --cli-only") + sys.exit(1) + + # Install Nydus snapshotter + if not install_nydus(): + print("\n⚠ Warning: Nydus installation failed") + success = False + warnings.append("Nydus snapshotter installation failed") + else: + # Only configure containerd if Nydus installed successfully + configure_containerd_for_nydus() + + # Install CLI + if not install_cli(): + print("\nSetup incomplete: CLI installation failed") + if not args.cli_only: + print("Note: Snapshotters may have been installed") + sys.exit(1) + + # Verify + verify_installation() + + print("\n" + "="*60) + if success: + print("✅ Fastpull installed successfully on your VM") + else: + print("⚠️ Fastpull installed with warnings") + print("\nWarnings:") + for warning in warnings: + print(f" • {warning}") + print("="*60) + print("\n📋 Usage:") + print(" fastpull --help") + print(" fastpull run --help") + print(" fastpull build --help") + print(" fastpull quickstart --help") + if not args.cli_only: + print("\n🔍 Check services:") + print(" systemctl status containerd") + print(" systemctl status fastpull") + print("\n📖 Example:") + print(" fastpull quickstart tensorrt") + print("="*60) + + +if __name__ == '__main__': + main()