Quick Start | Screenshots | Features | Models | Integrations | API | FAQ | Report a bug
Turn your Android phone into an OpenAI-compatible API LLM server — fully local, private, and open source
Think of it as Ollama for Android. Pick a model, tap Start, and your phone becomes an LLM server — runs LLMs on your mobile GPU/CPU via Google's LiteRT-LM runtime and serves them as a standard OpenAI-compatible HTTP API on your local network.
No cloud. No API keys. No subscriptions. Just your phone.
- Multi-model Support — One-tap download from HuggingFace, import
.litertlmfiles or model lists from local storage, or add custom model sources via JSON file or URL - Multimodal & Reasoning — Vision, audio, thinking, streaming support, and tool calling (experimental) for capable models
- Benchmark Built-in — Test and compare models on your device to find the best fit for your hardware
- Activity Logs — Detailed request/response logs with search, filtering, and JSON highlighting
- Always On, Low Power — Configurable auto-start on boot, sips ~5-10W vs 300W+ for a GPU server — perfect for that old phone in your drawer*
- Highly Configurable — Per-model inference settings, GPU/CPU accelerator, idle model unload, bearer token auth, and more
- Model & Server Monitoring — Live stats dashboard, Prometheus metrics for Grafana, and Home Assistant REST API for remote server control
- Broad Compatibility — Home Assistant, Open WebUI, OpenClaw, Python, curl — if it talks to OpenAI, it works
Note
Home Assistant currently requires a custom integration such as Extended OpenAI Conversation or Local OpenAI LLM and OpenAI STT for voice commands — see the Home Assistant client setup
* I am not responsible for any swollen batteries, crispy phones, or spontaneous pocket warmers. Please don't run your LLM on your phone while it's under your pillow. You've been warned.
- Download & install the APK

- Download a model — Gemma 4 E2B is recommended for most devices (2.4 GB, runs on 8 GB RAM)
- Start the server — Tap the Start Server button on the downloaded model card
- Configure your client — Use the endpoint shown on the Status screen (e.g.
http://PHONE_IP:8000/v1) with any OpenAI-compatible client — Open WebUI, OpenClaw, Home Assistant, Python, etc. See Client Setup for detailed guides.
Important
Requires: Android 12+ · arm64-v8a device · 6 GB RAM minimum · 8 GB+ recommended for multimodal models (see model table)
| Model | Size | Min RAM | Context | Capabilities |
|---|---|---|---|---|
| Gemma 4 E2B ⭐ | 2.4 GB | 8 GB | 32K | Text · Vision · Audio · Thinking · Tools · MTP |
| Gemma 4 E4B ⭐ | 3.4 GB | 12 GB | 32K | Text · Vision · Audio · Thinking · Tools · MTP |
| Gemma 3n E2B | 3.4 GB | 8 GB | 4K | Text · Vision · Audio |
| Gemma 3n E4B | 4.6 GB | 12 GB | 4K | Text · Vision · Audio |
| Gemma 3 1B | 0.5 GB | 6 GB | 1K | Text |
| Qwen 2.5 1.5B | 1.5 GB | 6 GB | 4K | Text |
| DeepSeek-R1 1.5B | 1.7 GB | 6 GB | 4K | Text |
⭐ Recommended — E2B for most devices, E4B for high-end
Note
Tool calling is experimental and may not always be reliable due to model limitations.
See the Model Guide for recommendations, capability details, and import instructions.
- Prometheus metrics —
/metricsendpoint with 29 metrics for Grafana, Datadog, etc. - Home Assistant REST API — monitor server status, control model, update settings remotely
Available endpoints — click to expand
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/chat/completions |
OpenAI Chat Completions API (streaming + non-streaming) |
POST |
/v1/completions |
OpenAI Completions API |
POST |
/v1/responses |
OpenAI Responses API |
POST |
/v1/messages |
Anthropic Messages API (streaming + non-streaming) |
POST |
/v1/messages/count_tokens |
Anthropic input-token estimator |
POST |
/v1/audio/transcriptions |
Audio transcription |
GET |
/v1/models |
List available models |
GET |
/v1/models/{id} |
Get detail for a specific model |
GET |
/ or /v1 |
Server info (version, status, endpoints) |
GET |
/health |
Health check (with optional ?metrics=true) |
GET |
/metrics |
Prometheus metrics |
GET |
/ping |
Simple liveness check |
Full API docs and examples: docs/api/API.md
Known limitations — click to expand
- arm64-v8a only — other architectures (armeabi-v7a, x86, x86_64) are not supported. The LiteRT runtime ships native libraries for x86_64 but they crash on Android emulators due to unsupported CPU instructions. Nearly all Android devices from 2017+ are arm64-v8a.
- Single model, single request — one model loaded at a time, requests queue sequentially (LiteRT SDK limitation). On-demand model loading via client requests is planned for a future release.
- Tool calling is experimental — Full native tool calling in the LiteRT SDK is currently broken, so OlliteRT uses schema injection (tool schemas injected into the model's context via the SDK) for structured output. A prompt-based fallback is available if schema injection doesn't work with your model. Results may vary — works best with Gemma 4 models.
- Token counts are estimated — the LiteRT runtime doesn't expose a tokenizer API, so counts are approximated using character length ÷ 4. Reasonably accurate for English text, less so for code or multilingual content.
- Imported models are copied to app storage — when importing a model from your device, the file is copied rather than moved. You can delete the original after import to reclaim space.
- No GGUF support — only
.litertlmmodels are supported (LiteRT runtime limitation). Models are available from the LiteRT Community on HuggingFace. Advanced users can convert HuggingFace models to.litertlmusing Google'slitert-torchtooling (Linux, 32GB+ RAM required). - LiteRT runtime constraints — OlliteRT is built on Google's LiteRT-LM runtime, optimized for mobile. Features like logprobs, grammar-based output constraints, repetition penalties, and LoRA adapters are not available.
- FAQ — Model support, privacy, battery, architecture, tool calling
- Troubleshooting — Connection issues, performance, crashes, auto-start, storage
- Privacy Policy — no data is collected, no telemetry, no analytics
- Security Guide — bearer token auth, network exposure, HTTPS, securing your server
- Found a bug? Report it here
- Want to request a feature? Open an issue
- Building — build instructions, signing setup, and HuggingFace OAuth configuration
- Architecture — package structure, request flow, and dependency list
Product flavors — all installable side-by-side:
| Flavor | Icon | Purpose |
|---|---|---|
stable |
Stable release | |
beta |
Beta testing | |
dev |
Local development |
What happens on your phone stays on your phone. If that matters to you, consider supporting OlliteRT.
- Google AI Edge Gallery — Original project this was built upon
- LiteRT-LM — Google's on-device AI runtime
- Ktor — Coroutine-based HTTP server framework
Licensed under the Apache License 2.0.



