Making every device an AI-native device.

We research and build inference engines from the metal up - custom kernels, operator fusion, unified memory optimization. For the hardware you already own.

Read our research About us

metalrt benchmark

ready

0tok/sLLM decode · Qwen3-0.6B · M4 Max

Ollama · 85Apple MLX · 220llama.cpp · 290MetalRT · 668

token output stream

tok/s peak decode

faster than Apple MLX

GitHub stars

platforms shipped

The problem

Most AI runs in the cloud. That won't scale.

We study inference at the hardware level. Here's what we've found.

Cost

Cloud inference costs $0.08–0.35 per minute for voice alone. Serving AI to 8 billion people through centralized GPU clusters is economically impossible. The compute has to move to the edge.

marginal inference cost on-device

Latency

A round-trip to the cloud takes 300-400ms minimum. For real-time voice, vision, and autonomous systems, that’s too slow. Physics sets the floor - on-device removes it.

<7ms

time-to-first-token (Qwen3-0.6B, M4 Max)

The models are ready

Small models now match the quality of models 250x their size. The bottleneck isn’t the model - it’s the runtime. That’s what we build.

668

tok/s on a single MacBook

View on GitHub

New Research

Read our latest publications

On-device intelligence - fast, private, hardware-native. Applied research for hardware-native AI inference.

View all publications

MetalRT · Vision

Mar 13, 2026

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.

0 tok/s

Vision decode

MetalRT · Speech

Mar 9, 2026

The First Complete AI Inference Engine for Apple Silicon. Now with Speech.

0 ms

STT latency

MetalRT · LLM

Mar 3, 2026

We Built the Fastest LLM Decode Engine for Apple Silicon.

0 tok/s

LLM decode

Weekly Briefing · Inference Radar

The state of inference, weekly.

View all issues (4)

Latest2026-W16·Apr 16 — Apr 22, 2026·20 min read

Inference Layers Collapse Into One

“This week’s code tells a clear story: cloud servers, laptop runtimes, mobile frameworks, and compiler backends are converging on the same problems — KV cache pressure, tool-calling correctness, multimodal support, and hardware-specific exec...”

What We Build

Engines. SDKs. Observability.

Three layers that take on-device AI from research to production.

Inference Engines

MetalRT

Custom kernel runtime for the hardware you already own. 658 tok/s LLM decode, 101ms speech-to-text, 287 tok/s vision. Every kernel hand-written from scratch.

Read the benchmarks

Developer SDKs

Cross-Platform

Swift, Kotlin, React Native, Flutter. One API across iOS, Android, and edge. Ship on-device AI with a few lines of code - LLM, STT, TTS, vision, voice agents.

Explore documentation

Observability

Control Plane

Fleet dashboard, OTA model updates, policy-based routing, inference analytics. Manage thousands of devices without app store releases.

Learn more

Our approach

We build from the metal up.

Custom GPU kernels, operator fusion, unified memory optimization. Our benchmarks speak for themselves: 668 tok/s LLM decode, 287 tok/s vision inference on a single MacBook.

We write GPU kernels from scratch — hand-designed memory layouts, fused operators, and custom Metal shaders that bypass every generic abstraction layer. MetalRT achieves 668 tok/s LLM decode on Apple Silicon. Every kernel targets the specific hardware it runs on.

The shift from cloud to edge will be defined by whoever builds the best runtime. We publish our research openly, ship production SDKs across Swift, Kotlin, React Native, and Flutter, and make the engine available on GitHub.

Backed by Y Combinator, we are building the infrastructure layer for on-device AI at scale — starting with Apple Silicon, then Qualcomm, then Intel. We utilize the hardware people already own.

View SDKs on GitHub Read our publications

Inference Stack

Your App

iOS · macOS · Android

RunAnywhere.load("llama-3.2-1b")

SDK Layer

Swift · Kotlin · React Native · Flutter

Cross-platform bindings → C++ core

MetalRT Runtime

C++ Inference Engine · Quantized Weights · KV Cache

Orchestrates graph execution on unified memory

Custom .metal Kernels

Hand-written Metal Shading Language

We write every GPU kernel from scratch

qmv.metalattention_decode.metalrms_norm.metalrope.metalswiglu.metalkv_cache.metal

Apple Silicon GPU

M1 · M2 · M3 · M4 · Unified Memory · 800 GB/s

simd_sum · threadgroup_barrier · [[buffer(0)]]

Output:668 tok/s decode|101ms STT

qmv dim conv rsq sub half st softmax acc half sub dim sub vote rsqrt kv simd mad max log conv

i8 bf16 sub ffn acc sat f32 qkv acc sat idx qkv relu red attn kern softmax add rsq load blk

silu mad rsq sync dim scan stg stg max norm ffn gid ffn tile ldg softmax gid simd scan tile

buf0 pool abs softmax stg dot sat blk stor bf16 ldg rcp abs exp ldg embd fuse half embd i8

acc st sync mul vote rsq log ld idx rsq bf16 div mul sync mha ptr sqrt ffn fma silu ptr min

i8 mha rsqrt mul kern 0xff dot silu sqrt attn conv bf16 shfl mad rope warp simd kern q4 sqrt

rsqrt simd rsqrt embd blk log stor sqrt f32 rsq embd sat conv norm log ptr 0xff rope softmax

dim kv sync exp red mha kv kv tile bf16 sub pool simd warp shfl qmv conv gelu attn idx rope

load vote mul sat dim sub mha min smem i8 mul lid bar qkv 0xff ptr scan add fma acc bf16 0xff

fuse ptr dim silu grid dim acc scan shfl bar q4 max red bf16 dot idx log warp scan pool bar

warp sqrt pool tile dot ptr qmv mha neg embd stor sqrt shfl acc sub exp bar ptr add mha sat

mad max 0xff smem relu fma ld stor softmax relu silu load acc softmax load kv half sat mha

stor idx kern ld qkv proj softmax qmv softmax q4 pool fma shfl div proj abs grid dim log grid

add rcp half ptr i8 red sqrt log buf0 q4 qkv blk vote exp psum embd idx f32 kv fma embd stor

kern relu vote silu proj acc idx smem idx mul tid st qkv abs simd abs pool stor ld proj fuse

gelu blk sat kern max ptr f32 silu lid f32 ld blk qmv ptr tid exp lid log sync min gid sqrt

sqrt relu stor tid sqrt simd simd rsq stor rope softmax neg embd log sat rsqrt q4 f32 grid

sync dim i8 lid gelu silu ldg fma fma smem div stg tile 0xff qkv qkv sub embd bf16 ld log

0xff attn max dot max lid ldg exp q4 relu sub rsqrt mha q4 conv red stg qkv red proj blk tile

fuse rcp dot log q4 attn dim acc add i8 fma kern silu acc dot mad scan f32 load rope kern

bar idx ld ptr norm half simd load proj embd gelu f32 qmv fma fma half neg log shfl qmv qmv

qkv softmax exp relu norm attn 0xff fuse blk fuse embd psum fma shfl bf16 buf0 softmax bar

rope kv attn silu mha mha rope scan ptr rsqrt max ffn neg bf16 sub half mul mad warp neg kv

exp bf16 rope idx qkv mul rcp embd softmax gid grid gid sync rope idx idx ld mul div rsq rsqrt

proj f32 bf16 psum sub ldg kv shfl stg mha ffn attn qmv embd tile load min dot warp abs shfl

idx pool silu log mad kv stor attn q4 qmv ldg tile fma load kern div half mad blk qkv dot

abs ffn rcp ld ptr sat ptr warp pool qmv q4 bf16 attn blk tile embd rcp buf0 shfl sqrt pool

ffn softmax scan vote dim f32 acc buf0 sync fuse proj abs add attn ptr sat blk vote proj dot

pool neg lid sat sync scan ldg log min ld div tid sat add rsq ffn ldg proj embd rsq st shfl

grid load add proj bf16 sub proj rope half i8 norm dot rcp red q4 fma silu bar ffn rsq psum

vote neg stg proj kv i8 stor gelu warp sqrt warp buf0 f32 ld max embd softmax lid fma load

0xff 0xff rsq kv sub bf16 scan softmax attn load div acc smem sat 0xff ffn kv max relu embd

max stg exp i8 rope st red half sync max ptr psum load norm simd grid bar gid red exp grid

stor log fma sqrt idx i8 rcp tid silu bar mha sat max mha shfl vote sync bar neg q4 conv gelu

acc stg acc ld sqrt embd fuse fuse min rsqrt max dot max load 0xff rcp gelu rsqrt qmv mha

i8 softmax dot exp red rcp div stor dot f32 proj scan load rope fma rcp tile load norm ptr

ptr pool bar stg dim ld rsq psum fma ldg fma shfl f32 ffn idx shfl conv blk pool load rsq

warp conv vote rope gelu psum psum scan rsq min grid qmv rsqrt kern buf0 lid proj shfl stg

sqrt sub stor lid abs grid max pool relu qmv q4 vote idx ldg softmax kv sub fma scan grid

bar idx fma exp st shfl lid fuse attn warp rsq f32 embd st idx st warp vote exp gid ld conv

mha gelu add load bar fuse smem mad bar gelu embd softmax dim 0xff min embd dim tid blk grid

relu smem grid tile relu half rsq add exp fma acc qkv rope shfl ldg st idx 0xff st buf0 log

idx stg fma add st mad rsq proj sync stg 0xff lid rsq ffn rsq load fma kern bf16 smem min

blk qmv buf0 grid dim mad qmv vote add stg softmax bar grid mad scan attn proj mha ffn neg

blk load gid warp min attn embd min min relu psum load attn blk shfl neg fma sub bf16 ptr

Team

Y CombinatorAWSMicrosoftIntuitY CombinatorAWSMicrosoftIntuitY CombinatorAWSMicrosoftIntuit

“We left AWS and Intuit to write custom kernels by hand. Because the future of AI isn't in the cloud - it's on every device you already own.”

Founders

Sanchit Monga

Co-Founder & CEO

Built SDKs used by 50M+ users at Intuit. Leads product, go-to-market, and the vision for making every device AI-native.

Ex-IntuitYC W2650M+ SDK Users

Shubham Malhotra

Co-Founder & CTO

Former AWS EC2 Spot and Microsoft Azure Arc. Published ML researcher. Writes the custom kernels that power MetalRT.

Ex-AWS EC2 SpotEx-Microsoft AzurePublished ML Researcher

Choose your seat

GPU Kernel Engineer

Remote

Full TimeEngineering

ML Inference Researcher

Remote

Full TimeResearch

Mobile SDK Engineer

Remote

Full TimeEngineering

If you're interested in joining contact founders@runanywhere.ai

In the press

Latest news.

What the industry is saying about RunAnywhere and on-device AI.

Research

Building the Infrastructure Layer for On-Device AI at Scale

NextTech Today

Regulated Industries Are Rewriting Their AI Architecture

NY Weekly

From Cloud Dependence to Instant AI: The Rise of On-Device Voice Agents

FoundersBrief

RunAnywhere - Building the Edge AI Stack

DNA India

Why India's AI Future Will Be Built on the Edge

Read the research.

Try the engine.

Explore our work GitHub

Making every device an AI-native device.

Most AI runs in the cloud. That won't scale.

Read our latest publications

MetalRT Now Runs Vision Language Models. Fastest on Apple Silicon.

The First Complete AI Inference Engine for Apple Silicon. Now with Speech.

We Built the Fastest LLM Decode Engine for Apple Silicon.

The state of inference, weekly.

Inference Layers Collapse Into One

Engines. SDKs. Observability.

MetalRT

Cross-Platform

Control Plane

We build from the metal up.

Latest news.

MetalRT Brings the First Unified AI Inference Engine to Apple Silicon

Why the Next Wave of AI Will Run on Your Phone

RunAnywhere: The Infrastructure Powering the Edge AI Era

Read the research.