Y Combinator

Backed by Y Combinator

Making every device an AI-native device.

We research and build inference engines from the metal up - custom kernels, operator fusion, unified memory optimization. For the hardware you already own.

metalrt benchmark
ready
0tok/sLLM decode · Qwen3-0.6B · M4 Max
Ollama · 85Apple MLX · 220llama.cpp · 290MetalRT · 668

token output stream

0

tok/s peak decode

0x

faster than Apple MLX

0+

GitHub stars

0

platforms shipped

The problem

Most AI runs in the cloud. That won't scale.

We study inference at the hardware level. Here's what we've found.

Cost

Cloud inference costs $0.08–0.35 per minute for voice alone. Serving AI to 8 billion people through centralized GPU clusters is economically impossible. The compute has to move to the edge.

$0

marginal inference cost on-device

Latency

A round-trip to the cloud takes 300-400ms minimum. For real-time voice, vision, and autonomous systems, that’s too slow. Physics sets the floor - on-device removes it.

<7ms

time-to-first-token (Qwen3-0.6B, M4 Max)

The models are ready

Small models now match the quality of models 250x their size. The bottleneck isn’t the model - it’s the runtime. That’s what we build.

668

tok/s on a single MacBook

What We Build

Engines. SDKs. Observability.

Three layers that take on-device AI from research to production.

Inference Engines

01

MetalRT

Custom kernel runtime for the hardware you already own. 658 tok/s LLM decode, 101ms speech-to-text, 287 tok/s vision. Every kernel hand-written from scratch.

Developer SDKs

02

Cross-Platform

Swift, Kotlin, React Native, Flutter. One API across iOS, Android, and edge. Ship on-device AI with a few lines of code - LLM, STT, TTS, vision, voice agents.

Observability

03

Control Plane

Fleet dashboard, OTA model updates, policy-based routing, inference analytics. Manage thousands of devices without app store releases.

Our approach

We build from the metal up.

Custom GPU kernels, operator fusion, unified memory optimization. Our benchmarks speak for themselves: 668 tok/s LLM decode, 287 tok/s vision inference on a single MacBook.

We write GPU kernels from scratch — hand-designed memory layouts, fused operators, and custom Metal shaders that bypass every generic abstraction layer. MetalRT achieves 668 tok/s LLM decode on Apple Silicon. Every kernel targets the specific hardware it runs on.

The shift from cloud to edge will be defined by whoever builds the best runtime. We publish our research openly, ship production SDKs across Swift, Kotlin, React Native, and Flutter, and make the engine available on GitHub.

Backed by Y Combinator, we are building the infrastructure layer for on-device AI at scale — starting with Apple Silicon, then Qualcomm, then Intel. We utilize the hardware people already own.

Inference Stack

Your App

iOS · macOS · Android

RunAnywhere.load("llama-3.2-1b")

SDK Layer

Swift · Kotlin · React Native · Flutter

Cross-platform bindings → C++ core

MetalRT Runtime

C++ Inference Engine · Quantized Weights · KV Cache

Orchestrates graph execution on unified memory

Custom .metal Kernels

Hand-written Metal Shading Language

We write every GPU kernel from scratch

qmv.metalattention_decode.metalrms_norm.metalrope.metalswiglu.metalkv_cache.metal
Apple Silicon GPU

M1 · M2 · M3 · M4 · Unified Memory · 800 GB/s

simd_sum · threadgroup_barrier · [[buffer(0)]]

Output:668 tok/s decode|101ms STT
qmv dim conv rsq sub half st softmax acc half sub dim sub vote rsqrt kv simd mad max log conv
i8 bf16 sub ffn acc sat f32 qkv acc sat idx qkv relu red attn kern softmax add rsq load blk
silu mad rsq sync dim scan stg stg max norm ffn gid ffn tile ldg softmax gid simd scan tile
buf0 pool abs softmax stg dot sat blk stor bf16 ldg rcp abs exp ldg embd fuse half embd i8
acc st sync mul vote rsq log ld idx rsq bf16 div mul sync mha ptr sqrt ffn fma silu ptr min
i8 mha rsqrt mul kern 0xff dot silu sqrt attn conv bf16 shfl mad rope warp simd kern q4 sqrt
rsqrt simd rsqrt embd blk log stor sqrt f32 rsq embd sat conv norm log ptr 0xff rope softmax
dim kv sync exp red mha kv kv tile bf16 sub pool simd warp shfl qmv conv gelu attn idx rope
load vote mul sat dim sub mha min smem i8 mul lid bar qkv 0xff ptr scan add fma acc bf16 0xff
fuse ptr dim silu grid dim acc scan shfl bar q4 max red bf16 dot idx log warp scan pool bar
warp sqrt pool tile dot ptr qmv mha neg embd stor sqrt shfl acc sub exp bar ptr add mha sat
mad max 0xff smem relu fma ld stor softmax relu silu load acc softmax load kv half sat mha
stor idx kern ld qkv proj softmax qmv softmax q4 pool fma shfl div proj abs grid dim log grid
add rcp half ptr i8 red sqrt log buf0 q4 qkv blk vote exp psum embd idx f32 kv fma embd stor
kern relu vote silu proj acc idx smem idx mul tid st qkv abs simd abs pool stor ld proj fuse
gelu blk sat kern max ptr f32 silu lid f32 ld blk qmv ptr tid exp lid log sync min gid sqrt
sqrt relu stor tid sqrt simd simd rsq stor rope softmax neg embd log sat rsqrt q4 f32 grid
sync dim i8 lid gelu silu ldg fma fma smem div stg tile 0xff qkv qkv sub embd bf16 ld log
0xff attn max dot max lid ldg exp q4 relu sub rsqrt mha q4 conv red stg qkv red proj blk tile
fuse rcp dot log q4 attn dim acc add i8 fma kern silu acc dot mad scan f32 load rope kern
bar idx ld ptr norm half simd load proj embd gelu f32 qmv fma fma half neg log shfl qmv qmv
qkv softmax exp relu norm attn 0xff fuse blk fuse embd psum fma shfl bf16 buf0 softmax bar
rope kv attn silu mha mha rope scan ptr rsqrt max ffn neg bf16 sub half mul mad warp neg kv
exp bf16 rope idx qkv mul rcp embd softmax gid grid gid sync rope idx idx ld mul div rsq rsqrt
proj f32 bf16 psum sub ldg kv shfl stg mha ffn attn qmv embd tile load min dot warp abs shfl
idx pool silu log mad kv stor attn q4 qmv ldg tile fma load kern div half mad blk qkv dot
abs ffn rcp ld ptr sat ptr warp pool qmv q4 bf16 attn blk tile embd rcp buf0 shfl sqrt pool
ffn softmax scan vote dim f32 acc buf0 sync fuse proj abs add attn ptr sat blk vote proj dot
pool neg lid sat sync scan ldg log min ld div tid sat add rsq ffn ldg proj embd rsq st shfl
grid load add proj bf16 sub proj rope half i8 norm dot rcp red q4 fma silu bar ffn rsq psum
vote neg stg proj kv i8 stor gelu warp sqrt warp buf0 f32 ld max embd softmax lid fma load
0xff 0xff rsq kv sub bf16 scan softmax attn load div acc smem sat 0xff ffn kv max relu embd
max stg exp i8 rope st red half sync max ptr psum load norm simd grid bar gid red exp grid
stor log fma sqrt idx i8 rcp tid silu bar mha sat max mha shfl vote sync bar neg q4 conv gelu
acc stg acc ld sqrt embd fuse fuse min rsqrt max dot max load 0xff rcp gelu rsqrt qmv mha
i8 softmax dot exp red rcp div stor dot f32 proj scan load rope fma rcp tile load norm ptr
ptr pool bar stg dim ld rsq psum fma ldg fma shfl f32 ffn idx shfl conv blk pool load rsq
warp conv vote rope gelu psum psum scan rsq min grid qmv rsqrt kern buf0 lid proj shfl stg
sqrt sub stor lid abs grid max pool relu qmv q4 vote idx ldg softmax kv sub fma scan grid
bar idx fma exp st shfl lid fuse attn warp rsq f32 embd st idx st warp vote exp gid ld conv
mha gelu add load bar fuse smem mad bar gelu embd softmax dim 0xff min embd dim tid blk grid
relu smem grid tile relu half rsq add exp fma acc qkv rope shfl ldg st idx 0xff st buf0 log
idx stg fma add st mad rsq proj sync stg 0xff lid rsq ffn rsq load fma kern bf16 smem min
blk qmv buf0 grid dim mad qmv vote add stg softmax bar grid mad scan attn proj mha ffn neg
blk load gid warp min attn embd min min relu psum load attn blk shfl neg fma sub bf16 ptr

Team

Y CombinatorAWSMicrosoftIntuitY CombinatorAWSMicrosoftIntuitY CombinatorAWSMicrosoftIntuit

“We left AWS and Intuit to write custom kernels by hand. Because the future of AI isn't in the cloud it's on every device you already own.”

Founders

Sanchit Monga

Sanchit Monga

Co-Founder & CEO

Built SDKs used by 50M+ users at Intuit. Leads product, go-to-market, and the vision for making every device AI-native.

Ex-IntuitYC W2650M+ SDK Users
Shubham Malhotra

Shubham Malhotra

Co-Founder & CTO

Former AWS EC2 Spot and Microsoft Azure Arc. Published ML researcher. Writes the custom kernels that power MetalRT.

Ex-AWS EC2 SpotEx-Microsoft AzurePublished ML Researcher

Read the research.

Try the engine.

RunAnywhere Logo

RunAnywhere

On-device AI inference research and infrastructure. Building the fastest engines for the hardware you already own.

© 2026 RunAnywhere, Inc.

Playground
Morty Proxy This is a proxified and sanitized view of the page, visit original site.