Kokoro TTS API

Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model

OpenAI-compatible Speech endpoint, with inline voice combination functionality
NVIDIA GPU accelerated or CPU Onnx inference
very fast generation time
- 100x+ real time speed via HF A100
- 35-50x+ real time speed via 4060Ti
- 5x+ real time speed via M3 Pro CPU
streaming support w/ variable chunking to control latency & artifacts
simple audio generation web ui utility
(new) phoneme endpoints for conversion and generation

Quick Start

The service can be accessed through either the API endpoints or the Gradio web interface.

Install prerequisites:

Install Docker Desktop + Git

Clone and start the service:

git clone https://github.com/remsky/Kokoro-FastAPI.git
cd Kokoro-FastAPI
docker compose up --build # for GPU
#docker compose -f docker-compose.cpu.yml up --build # for CPU

Run locally as an OpenAI-Compatible Speech Endpoint

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8880/v1",
    api_key="not-needed"
    )

response = client.audio.speech.create(
    model="kokoro", 
    voice="af_sky+af_bella", #single or multiple voicepack combo
    input="Hello world!",
    response_format="mp3"
)
response.stream_to_file("output.mp3")

or visit http://localhost:7860

Features

OpenAI-Compatible Speech Endpoint

# Using OpenAI's Python library
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8880/v1", api_key="not-needed")
response = client.audio.speech.create(
    model="kokoro",  # Not used but required for compatibility, also accepts library defaults
    voice="af_bella+af_sky",
    input="Hello world!",
    response_format="mp3"
)

response.stream_to_file("output.mp3")

Or Via Requests:

import requests


response = requests.get("http://localhost:8880/v1/audio/voices")
voices = response.json()["voices"]

# Generate audio
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "model": "kokoro",  # Not used but required for compatibility
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "mp3",  # Supported: mp3, wav, opus, flac
        "speed": 1.0
    }
)

# Save audio
with open("output.mp3", "wb") as f:
    f.write(response.content)

Quick tests (run from another terminal):

python examples/assorted_checks/test_openai/test_openai_tts.py # Test OpenAI Compatibility
python examples/assorted_checks/test_voices/test_all_voices.py # Test all available voices

Voice Combination

Averages model weights of any existing voicepacks
Saves generated voicepacks for future use
(new) Available through any endpoint, simply concatenate desired packs with "+"

Combine voices and generate audio:

import requests
response = requests.get("http://localhost:8880/v1/audio/voices")
voices = response.json()["voices"]

# Create combined voice (saves locally on server)
response = requests.post(
    "http://localhost:8880/v1/audio/voices/combine",
    json=[voices[0], voices[1]]
)
combined_voice = response.json()["voice"]

# Generate audio with combined voice (or, simply pass multiple directly with `+` )
response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": combined_voice, # or skip the above step with f"{voices[0]}+{voices[1]}"
        "response_format": "mp3"
    }
)

Multiple Output Audio Formats

mp3
wav
opus
flac
aac
pcm

Gradio Web Utility

Access the interactive web UI at http://localhost:7860 after starting the service. Features include:

Voice/format/speed selection
Audio playback and download
Text file or direct input

If you only want the API, just comment out everything in the docker-compose.yml under and including gradio-ui

Currently, voices created via the API are accessible here, but voice combination/creation has not yet been added

Note: Recent updates for streaming could lead to temporary glitches. If so, pull from the most recent stable release v0.0.2 to restore

Streaming Support

# OpenAI-compatible streaming
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8880", api_key="not-needed")

# Stream to file
with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    input="Hello world!"
) as response:
    response.stream_to_file("output.mp3")

# Stream to speakers (requires PyAudio)
import pyaudio
player = pyaudio.PyAudio().open(
    format=pyaudio.paInt16, 
    channels=1, 
    rate=24000, 
    output=True
)

with client.audio.speech.with_streaming_response.create(
    model="kokoro",
    voice="af_bella",
    response_format="pcm",
    input="Hello world!"
) as response:
    for chunk in response.iter_bytes(chunk_size=1024):
        player.write(chunk)

Or via requests:

import requests

response = requests.post(
    "http://localhost:8880/v1/audio/speech",
    json={
        "input": "Hello world!",
        "voice": "af_bella",
        "response_format": "pcm"
    },
    stream=True
)

for chunk in response.iter_content(chunk_size=1024):
    if chunk:
        # Process streaming chunks
        pass

Key Streaming Metrics:

First token latency @ chunksize
- ~300ms (GPU) @ 400
- ~3500ms (CPU) @ 200 (older i7)
- ~<1s (CPU) @ 200 (M3 Pro)
Adjustable chunking settings for real-time playback

Note: Artifacts in intonation can increase with smaller chunks

Processing Details

Performance Benchmarks

Benchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on:

Windows 11 Home w/ WSL2
NVIDIA 4060Ti 16gb GPU @ CUDA 12.1
11th Gen i7-11700 @ 2.5GHz
64gb RAM
WAV native output
H.G. Wells - The Time Machine (full text)

Key Performance Metrics:

Realtime Speed: Ranges between 25-50x (generation time to output audio length)
Average Processing Rate: 137.67 tokens/second (cl100k_base)

GPU Vs. CPU

# GPU: Requires NVIDIA GPU with CUDA 12.1 support (~35x realtime speed)
docker compose up --build

# CPU: ONNX optimized inference (~2.4x realtime speed)
docker compose -f docker-compose.cpu.yml up --build

Note: Overall speed may have reduced somewhat with the structural changes to accomodate streaming. Looking into it

Natural Boundary Detection

Automatically splits and stitches at sentence boundaries
Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output

Phoneme & Token Routes

Convert text to phonemes and/or generate audio directly from phonemes:

import requests

# Convert text to phonemes
response = requests.post(
    "http://localhost:8880/dev/phonemize",
    json={
        "text": "Hello world!",
        "language": "a"  # "a" for American English
    }
)
result = response.json()
phonemes = result["phonemes"]  # Phoneme string e.g  ðɪs ɪz ˈoʊnli ɐ tˈɛst
tokens = result["tokens"]      # Token IDs including start/end tokens 

# Generate audio from phonemes
response = requests.post(
    "http://localhost:8880/dev/generate_from_phonemes",
    json={
        "phonemes": phonemes,
        "voice": "af_bella",
        "speed": 1.0
    }
)

# Save WAV audio
with open("speech.wav", "wb") as f:
    f.write(response.content)

See examples/phoneme_examples/generate_phonemes.py for a sample script.

Known Issues

Linux GPU Permissions

Some Linux users may encounter GPU permission issues when running as non-root. Can't guarantee anything, but here are some common solutions, consider your security requirements carefully

Option 1: Container Groups (Likely the best option)

services:
  kokoro-tts:
    # ... existing config ...
    group_add:
      - "video"
      - "render"

Option 2: Host System Groups

services:
  kokoro-tts:
    # ... existing config ...
    user: "${UID}:${GID}"
    group_add:
      - "video"

Note: May require adding host user to groups: sudo usermod -aG docker,video $USER and system restart.

Option 3: Device Permissions (Use with caution)

services:
  kokoro-tts:
    # ... existing config ...
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-uvm:/dev/nvidia-uvm

⚠️ Warning: Reduces system security. Use only in development environments.

Prerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.

Visit NVIDIA Container Toolkit installation for more detailed information

Model and License

Model

This API uses the Kokoro-82M model from HuggingFace.

Visit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.

License

This project is licensed under the Apache License 2.0 - see below for details:

The Kokoro model weights are licensed under Apache 2.0 (see model page)
The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match
The inference code adapted from StyleTTS2 is MIT licensed

The full Apache 2.0 license text can be found at: https://www.apache.org/licenses/LICENSE-2.0

Name	Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows	.github/workflows
Kokoro-82M	Kokoro-82M
api	api
assets	assets
examples	examples
ui	ui
.coveragerc	.coveragerc
.dockerignore	.dockerignore
.gitignore	.gitignore
.ruff.toml	.ruff.toml
CHANGELOG.md	CHANGELOG.md
Dockerfile	Dockerfile
Dockerfile.cpu	Dockerfile.cpu
README.md	README.md
docker-compose.cpu.yml	docker-compose.cpu.yml
docker-compose.yml	docker-compose.yml
githubbanner.png	githubbanner.png
pytest.ini	pytest.ini
requirements-test.txt	requirements-test.txt
requirements.txt	requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Kokoro TTS API

Quick Start

Features

Processing Details

Known Issues

Option 1: Container Groups (Likely the best option)

Option 2: Host System Groups

Option 3: Device Permissions (Use with caution)

Model and License

About

Uh oh!

Releases 14

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 26

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Uh oh!

License

remsky/Kokoro-FastAPI

Folders and files

Latest commit

History

Repository files navigation

Kokoro TTS API

Quick Start

Features

Processing Details

Known Issues

Option 1: Container Groups (Likely the best option)

Option 2: Host System Groups

Option 3: Device Permissions (Use with caution)

Model and License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 14

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 26

Uh oh!

Languages

Packages