FastAPI implementation of Microsoft VibeVoice (text-to-speech) with REST and WebSocket APIs. The service loads a VibeVoice model on startup and exposes endpoints for TTS generation, health/readiness, and Prometheus metrics.
- Framework: FastAPI + Uvicorn
- Model:
microsoft/VibeVoice-1.5B(default) ormicrosoft/VibeVoice-Large - APIs: REST
/api/generate,/api/voices; WebSocket/ws/generate - Health/Monitoring:
/healthz,/readyz,/metrics - Static UI: Basic demo served from
/ifstatic/index.htmlexists
- Python 3.11+
- NVIDIA GPU with CUDA (configured for
cuda:0) - PyTorch with CUDA support; FlashAttention 2 optional (falls back to SDPA automatically)
Using uv (recommended):
# Install dependencies (dev optional)
uv pip install -e .[dev]
# Or, with lockfile sync
uv syncUsing pip:
python -m venv .venv && source .venv/bin/activate
pip install -e .[dev]Environment variables (can be placed in .env):
MODEL_PATH:microsoft/VibeVoice-1.5B(default) ormicrosoft/VibeVoice-LargeVOICES_DIR: path to voice samples directory (default:voices/)MAX_CONCURRENCY: concurrent generations (default:1)TIMEOUT_SEC: per-request timeout seconds (default:300)CORS_ALLOW_ORIGINS: comma-separated origins (default: empty)LOG_LEVEL:debug|info|warning|error|critical(default:info)
# With uv
uv run uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
# Or with python environment
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reloadOpen interactive docs at http://localhost:8000/docs.
Model downloads and initialization happen on first start; a CUDA-enabled GPU is required.
Requires the NVIDIA Container Toolkit.
docker compose up --build
# or
docker run --gpus all -p 8000:8000 --env-file .env \
-v $(pwd)/voices:/app/voices \
vibevoice-fastapi:latestGET /healthz: basic liveness probeGET /readyz: readiness + model/device infoGET /api/voices: list available voices (scanned fromVOICES_DIR)POST /api/generate: generate TTS- Content negotiation via
Accept:audio/wavreturns binary WAVapplication/jsonreturns JSON withaudio_base64
- Content negotiation via
GET /metrics: Prometheus exposition formatWS /ws/generate: streaming PCM16 frames with progress/final messages
Example: request a WAV directly and save to file
curl -X POST http://localhost:8000/api/generate \
-H 'Content-Type: application/json' \
-H 'Accept: audio/wav' \
-d '{
"script": "Alice: Hello, this is a test.",
"speakers": ["en-Alice_woman"],
"cfg_scale": 1.3,
"inference_steps": 5
}' \
--output out.wavExample: request JSON output with base64-encoded audio
curl -X POST http://localhost:8000/api/generate \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{
"script": "Alice: Hello, this is a test.",
"speakers": ["en-Alice_woman"],
"cfg_scale": 1.3,
"inference_steps": 5
}'- Put reference audio files in
voices/(default) and use the stem as the speaker ID. - Naming like
en-Alice_woman.wavis parsed into language/name/gender for/api/voices. - Discover available voices:
curl http://localhost:8000/api/voicespytest tests/
# or
uv run pytest -q- This service is a FastAPI implementation of the Microsoft VibeVoice model. It requires a CUDA-capable GPU and downloads models from Hugging Face on first use.
- If FlashAttention 2 is not available, the service automatically falls back to SDPA attention.