Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

[NO LONGER UPDATED]

Below is a summary of the functionality provided by the llama.cpp project.

  • The goal is to have a birds-eye-view of what works and what does not
  • Collaborators are encouraged to add things to the list and update the status of existing things as needed
  • The list should be simple without too much details about the specific problems - these belong to dedicated issues

Legend (feel free to update):

✅ - Working correctly
☁️ - Partially working
❌ - Failing
❓ - Status unknown (needs testing)
🔬 - Under investigation
🚧 - Currently in development

Feature Executable Status Issues
Inference
Single-batch decoding main, simple
Parallel / batched decoding batched
Continuous batching parallel
Speculative sampling speculative
Tree-based speculative sampling speculative
Self-speculative sampling speculative 🚧 #3565
Lookahead sampling lookahead
Infill infill
REST API server
Embeddings embedding
Grouped Query Attention CPU main
Grouped Query Attention CUDA main
Grouped Query Attention OpenCL main
Grouped Query Attention Metal main
Session load / save main
K-quants (256) CUDA main
K-quants (64) CUDA main
K-quants (256) Metal main
K-quants (64) Metal main ☁️ #3276
Special tokens main
Grammar sampling main, server
Beam search beam-search #3471 (comment)
LoRA main ☁️ #3333 #3519
SPM tokenizer test-tokenizer-0-llama
BPE tokenizer test-tokenizer-0-falcon
Models
LLaMA v1 main
LLaMA v2 main
Falcon main
StarCoder main
Baichuan main
MPT main
Persimmon main
LLaVA llava
Refact main
Bloom main
StableLM-3b-4e1t main
Training
Finetuning CPU finetune
Finetuning Metal finetune 🔬
Backends
CPU x64 ggml
CPU Arm ggml
GPU CUDA ggml-cuda
GPU ROCm ggml-cuda
GPU Metal ggml-metal
GPU OpenCL ggml-opencl
GPU Vulkan ggml-vulkan 🚧 #2059
You must be logged in to vote

Replies: 6 comments · 10 replies

Comment options

What does the "☁️" mean?

You must be logged in to vote
2 replies
@shibe2
Comment options

I don't know what the icon means, but current status of OpenCL back-end is: it works with supported models, but is buggy and perhaps, slower than it could be.

@ggerganov
Comment options

ggerganov Oct 4, 2023
Maintainer Author

Yup, this was my impression from reading a few issues lately. If you think it's not the case, feel free to update it. I just haven't set up OpenCL in my environment and cannot do tests atm

Comment options

So "Parallel decoding" is done by batched and "Continuous batching" is done by parallel? Are these reversed?

You must be logged in to vote
1 reply
@ggerganov
Comment options

ggerganov Oct 5, 2023
Maintainer Author

Parallel decoding is also called "batched decoding" hence batched. The parallel example demonstrates a basic server that serves clients in parallel - it just happens to have the continuous batching feature as an option.

Naming things is hard :) Sorry if these are confusing

Comment options

Should beam search be added here? I think it is broken atm, at least with CUDA.

You must be logged in to vote
4 replies
@ggerganov
Comment options

ggerganov Oct 8, 2023
Maintainer Author

Yes, it should be added. The list is far from complete

@Mihaiii
Comment options

Fwiw, for me beam search is broken even without CUDA in a sense that when I run the example, nothing happens (it just hangs for minutes at this line until I CTRL+C it).

If it's an unknown problem, I'll open an issue (tbh, it's strange that nobody mentioned it before so maybe I'm doing something wrong).

Update: when it hangs on the above mentioned line, I have 0 hard page fauls/sec.

@slaren
Comment options

With CUDA it works for a while, but then it starts generating gibberish. I think that the calls to llama_decode are failing and it is not catching it. It's probably missing some KV cache management after the batched decoding change.

@ggerganov
Comment options

ggerganov Oct 18, 2023
Maintainer Author

The beam search functionality should be moved out from the library and implemented as a standalone example.

Comment options

What would be criteria for considering OpenCL back-end working correctly? I've fixed all known bugs in ggml-opencl.cpp and now working on refactoring like #3669.

You must be logged in to vote
3 replies
@ggerganov
Comment options

ggerganov Oct 18, 2023
Maintainer Author

The criteria is that if it runs correctly on your machine, then it is ✅ until someone reports a problem that is reproducible - then it becomes ☁️ or ❌ depending on how broken the thing is

@shibe2
Comment options

Alright, turning the green light then!

@Yossef-Dawoad
Comment options

maybe you can ditch the icons for something Like Scoring Like (A+, A, A-, B, ...) this will make it obvious if something working fine but needs improvements has a score with A- and so on, maybe something like this :
[ A+ ] or [ A ] : working like charm
[ A- ] : Working correctly but needs improvement
[ B ] : Partially working
[ B- ] : Partially working with big issues to be resolved
[ C ] : Status unknown (needs testing)
[ D+ ] : Under investigation
[ D ] : Currently in development
[ F ] : Failing

maybe you should add a column for tier support for example, if a feature is tier 1 or 2, ... what do you think?

Comment options

Is there any further progress on Finetuning on metal gpu?

You must be logged in to vote
0 replies
Comment options

This page is "[no longer updated]" since jan. 11 2024.
So, why is there still a link to this page from the main README ❓

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
9 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.