Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

2.2.11 Backend: AirLLM

av edited this page Apr 26, 2025 · 2 revisions

Handle: airllm
URL: http://localhost:33981

airllm_logo

Quickstart | Configurations | MacOS | Example notebooks | FAQ

AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.

Note that above is true, but don't expect a performant inference. AirLLM loads LLM layers into memory in small groups. The main benefit is that it allows a "transformers"-like workflow for models that are much much larger than your VRAM.

Note

AirLLM requires a GPU with CUDA by default, can't be run on CPU.

Starting

# [Optional] Pre-build the image
# Needs PyTorch and CUDA, so will be quite large
harbor build airllm

# Start the service
# Will download selected models if not present yet
harbor up airllm

# See service logs
harbor logs airllm

# Check it's running
curl $(harbor url airllm)/v1/models

Configuration

AirLLM only supports specific models, see original README for details.

Note

Default context size is 128, according to official examples

For funsies, Harbor also adds an OpenAI-compatible API to AirLLM, so you can enjoy 40m of wait per request when "chatting" with Llama 3.1 405B.

# Set the model to run
harbor airllm model meta-llama/Meta-Llama-3.1-8B-Instruct

# Set the context size to use
harbor airllm ctx 1024

# Set the compression
harbor airllm compression 4bit

Clone this wiki locally

Morty Proxy This is a proxified and sanitized view of the page, visit original site.