tiny-llm - LLM Serving in a Week

Still WIP and in very early stage. A tutorial on LLM serving using MLX for system engineers. The codebase is solely (almost!) based on MLX array/matrix APIs without any high-level neural network APIs, so that we can build the model serving infrastructure from scratch and dig into the optimizations.

The goal is to learn the techniques behind efficiently serving a large language model (i.e., Qwen2 models).

Why MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.

Why Qwen2: this was the first LLM I've interacted with -- it's the go-to example in the vllm documentation. I spent some time looking at the vllm source code and built some knowledge around it.

Book

The tiny-llm book is available at https://skyzh.github.io/tiny-llm/. You can follow the guide and start building.

Community

You may join skyzh's Discord server and study with the tiny-llm community.

Roadmap

Week + Chapter	Topic	Code	Test	Doc
1.1	Attention	✅	✅	✅
1.2	RoPE	✅	✅	✅
1.3	Grouped Query Attention	✅	🚧	🚧
1.4	RMSNorm and MLP	✅	🚧	🚧
1.5	Transformer Block	✅	🚧	🚧
1.6	Load the Model	✅	🚧	🚧
1.7	Generate Responses (aka Decoding)	✅	✅	🚧
2.1	KV Cache	✅	🚧	🚧
2.2	Quantized Matmul and Linear - CPU	✅	🚧	🚧
2.3	Quantized Matmul and Linear - GPU	✅	🚧	🚧
2.4	Flash Attention and Other Kernels	🚧	🚧	🚧
2.5	Continuous Batching	🚧	🚧	🚧
2.6	Speculative Decoding	🚧	🚧	🚧
2.7	Prompt/Prefix Cache	🚧	🚧	🚧
3.1	Paged Attention - Part 1	🚧	🚧	🚧
3.2	Paged Attention - Part 2	🚧	🚧	🚧
3.3	Prefill-Decode Separation	🚧	🚧	🚧
3.4	Scheduler	🚧	🚧	🚧
3.5	Parallelism	🚧	🚧	🚧
3.6	AI Agent	🚧	🚧	🚧
3.7	Streaming API Server	🚧	🚧	🚧

Other topics not covered: quantized/compressed kv cache

Name	Name	Last commit message	Last commit date
Latest commit History 53 Commits 53 Commits
.github/workflows	.github/workflows
.vscode	.vscode
book	book
src	src
tests	tests
tests_ref_impl_week1	tests_ref_impl_week1
tests_ref_impl_week2	tests_ref_impl_week2
.clang-format	.clang-format
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
build_ext.sh	build_ext.sh
check.py	check.py
main.py	main.py
pdm.lock	pdm.lock
pyproject.toml	pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

tiny-llm - LLM Serving in a Week

Book

Community

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages