Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Performance of llama.cpp on Apple Silicon A-series #4508

ggerganov started this conversation in Show and tell
Discussion options

Summary

🟥 - benchmark data missing
🟨 - benchmark data partial
✅ - benchmark data available

  • PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1)

TinyLlama 1.1B

  CPU
Cores
GPU
Cores
F16 PP
[t/s]
F16 TG
[t/s]
Q8_0 PP
[t/s]
Q8_0 TG
[t/s]
Q4_0 PP
[t/s]
Q4_0 TG
[t/s]
✅ A14 1 2+4 4 251.98 10.26 250.54 24.11 242.37 39.21
🟥 A15 2 2+3 5
✅ A15 2 2+4 4 X X 411.16 24.12 405.30 39.03
✅ A15 2 2+4 5 531.03 13.66 494.18 23.84 496.49 39.09
✅ A16 3 2+4 5 565.68 20.06 511.30 34.30 505.52 54.24
✅ A17 4 2+4 6 683.95 20.23 637.14 35.60 646.06 56.86

Phi-2 2.7B

  CPU
Cores
GPU
Cores
Q8_0 PP
[t/s]
Q8_0 TG
[t/s]
Q4_0 PP
[t/s]
Q4_0 TG
[t/s]
✅ A14 1 2+4 4 X X 51.39 8.52
🟥 A15 2 2+3 5
🟥 A15 2 2+4 4
✅ A15 2 2+4 5 X X 120.47 16.73
✅ A16 3 2+4 5 119.58 14.06 121.64 23.31
✅ A17 4 2+4 6 158.03 14.74 157.33 24.71

Mistral 7B

  CPU
Cores
GPU
Cores
Q4_0 PP
[t/s]
Q4_0 TG
[t/s]
✅ A14 1 2+4 4 X X
🟥 A15 2 2+3 5
🟥 A15 2 2+4 4
✅ A15 2 2+4 5 X X
🟥 A16 3 2+4 5
✅ A17 4 2+4 6 80.55 9.01

Description

This is a collection of short llama.cpp benchmarks on various Apple Silicon hardware. It can be useful to compare the performance that llama.cpp achieves across the A-Series chips. Similar collection for the M-series is available here: #4167

CPU Cores GPU Cores Memory [GB] Devices
A14 2+4 4 4-6 iPhone 12 (all variants), iPad Air (4th gen), iPad (10th gen)
A15 2+3 5 4 Apple TV 4K (3rd gen)
A15 2+4 4 4 iPhone SE (3rd gen), iPhone 13 & Mini
A15 2+4 5 4-6 iPad Mini (6th gen), iPhone 13 Pro & Pro Max, iPhone 14 & Plus
A16 2+4 5 6 iPhone 14 Pro & Pro Max, iPhone 15 & Plus
A17 Pro 2+4 6 8 iPhone 15 Pro & Pro Max

Instructions

  • Clone the project
    git clone https://github.com/ggerganov/llama.cpp
    git checkout 0e18b2e
  • Open the examples/llama.swiftui with Xcode
  • Enable Release build
    Screenshot 2023-12-17 at 19 50 25
  • Deploy on your iPhone / iPad
  • Stop Xcode and run the app from the device. This is important because the performance when running through Xcode is significantly slower
  • Download the models and run the "Bench" for each one
  • Running the "Bench" a second time can give more accurate results
  • Copy the results in the comments below, adding information about the device

iPhone 13 mini ✅

model size params backend test t/s
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 411.16 ± 6.22
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 24.12 ± 0.04
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 405.30 ± 7.26
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 39.03 ± 0.08

Footnotes

  1. https://en.wikipedia.org/wiki/Apple_A14 2 3

  2. https://en.wikipedia.org/wiki/Apple_A15 2 3 4 5 6 7 8 9

  3. https://en.wikipedia.org/wiki/Apple_A16 2 3

  4. https://en.wikipedia.org/wiki/Apple_A17 2 3

You must be logged in to vote

Replies: 19 comments · 32 replies

Comment options

iPhone 15 Pro (A17 Pro) ✅

model size params backend test t/s
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 683.95 ± 8.24
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 20.23 ± 0.08
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 637.14 ± 18.73
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 35.60 ± 0.25
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 646.06 ± 17.17
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 56.86 ± 0.17
phi2 3B Q8_0 2.75 GiB 2.78 B Metal pp 512 158.03 ± 14.03
phi2 3B Q8_0 2.75 GiB 2.78 B Metal tg 128 14.74 ± 0.07
phi2 3B Q4_0 1.49 GiB 2.78 B Metal pp 512 157.33 ± 14.25
phi2 3B Q4_0 1.49 GiB 2.78 B Metal tg 128 24.71 ± 0.04
llama 7B Q4_0 3.83 GiB 7.24 B Metal pp 512 80.55 ± 21.88
llama 7B Q4_0 3.83 GiB 7.24 B Metal tg 128 9.01 ± 0.50
You must be logged in to vote
7 replies
@ggerganov
Comment options

ggerganov Dec 18, 2023
Maintainer Author

The models above are now available in the app

@jhen0409
Comment options

Added phi-2 and f16 tinyllama results. Also checked the prev results are not changed in commit 0e18b2e.

@Dampfinchen
Comment options

Text Generation speed using Mistral is more than useable on newer iPhones it seems. Prompt processing is very slow however, even when using Metal. I wonder if this is a compute or bandwidth limitation.

@shouryan01
Comment options

What's the minimum t/s for it to be usable? For example, is 9 t/s usable for Mistral 7b?

@rhematt
Comment options

It's a UX design principle. For a model to be usable, the benchmark I've been working towards is 400ms for the first token and subsequent tokens. If a token isn't being processed to the user at least every 400ms, then the model won't be usable. Minimum usability should, therefore, be about 3t/s...

Comment options

iPhone 15 Pro Max (A17 Pro) ✅

model size params backend test t/s
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 652.70 ± 18.14
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 19.82 ± 0.30
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 662.89 ± 11.28
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 34.95 ± 0.08
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 645.78 ± 9.16
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 54.95 ± 0.14

Tested under iOS 17.3 Developer beta 1 (21D5026f)

You must be logged in to vote
3 replies
@ggerganov
Comment options

ggerganov Dec 18, 2023
Maintainer Author

Thanks - the iPhone 15 Pro Max should be using the same A17 chip as iPhone 15 Pro, correct? At least this is what I get from wikipedia and the numbers seem to mostly match the one from @jhen0409 above

@ymcui
Comment options

Yes. Both iPhone 15 Pro and Pro Max use A17. Concretely, it is named with A17 Pro by Apple. See here.

@ymcui
Comment options

@ggerganov Just updated F16 results. The model is taken from https://huggingface.co/SergiusFlavius/TinyLlama-1.1B-1T-OpenOrca-GGUF/blob/main/tinyllama-1.1b-1t-openorca.F16.gguf

Comment options

iPhone 12 mini (A14) ✅

tinyllama:

model size params backend test t/s
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 251.98 ± 5.15
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 10.26 ± 4.23
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 250.54 ± 0.95
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 24.11 ± 0.02
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 242.37 ± 0.81
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 39.21 ± 0.25

phi-2:

model size params backend test t/s
phi2 3B Q8_0 2.75 GiB 2.78 B Metal pp 512 CRASHED
phi2 3B Q8_0 2.75 GiB 2.78 B Metal tg 128 CRASHED
phi2 3B Q4_0 1.49 GiB 2.78 B Metal pp 512 51.39 ± 11.89
phi2 3B Q4_0 1.49 GiB 2.78 B Metal tg 128 8.52 ± 2.78
  1. phi-2 3B Q8_0 (2.75 GiB) cannot be loaded. The phone gets restarted when loading it.
  2. I won't test Mistral-7B-Q4_0 (3.8 GiB) on iPhone 12 mini either, because it's too large to fit in memory (4 GiB). iPhone 12 Pro & Pro Max may have a chance to run it, as they have 6 GB RAM.

Tested under iOS 17.1.2 (21B101)

You must be logged in to vote
1 reply
@ymcui
Comment options

Add phi-2 3B Q4_0 results. Others can't be loaded.

Comment options

Some additional info with memory and relevant devices.

CPU Cores GPU Cores Memory [GB] Devices
A14 2+4 4 4-6 iPhone 12 (all variants), iPad Air (4th gen), iPad (10th gen)
A15 2+3 5 4 Apple TV 4K (3rd gen)
A15 2+4 4 4 iPhone SE (3rd gen), iPhone 13 & Mini
A15 2+4 5 4-6 iPad Mini (6th gen), iPhone 13 Pro & Pro Max, iPhone 14 & Plus
A16 2+4 5 6 iPhone 14 Pro & Pro Max, iPhone 15 & Plus
A17 Pro 2+4 6 8 iPhone 15 Pro & Pro Max
You must be logged in to vote
4 replies
@ymcui
Comment options

Apple TV 4K (3rd gen) seems to be only one that has A15 (5CPU + 5GPU).
I checked my Apple TV 4K, and unfortunately, it is 2nd gen (A12) 😂

@ggerganov
Comment options

ggerganov Dec 21, 2023
Maintainer Author

Damn, first LLM on a TV 😄

@nikolay-kapustin
Comment options

if you really need this, i can build and run these tests for aTV 3gen )

@ggerganov
Comment options

ggerganov Dec 21, 2023
Maintainer Author

Don't really need it, but it might be a cool achievement. LLM on a watch has already been demonstrated: https://twitter.com/shxf0072/status/1736713832045982040, but I don't think that is the case for LLM on a TV

Comment options

iPhone 13 Pro (A15) ✅

model size params backend test t/s
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 496.49 ± 3.82
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 39.09 ± 0.12
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 494.18 ± 4.93
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 23.84 ± 0.05
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 531.03 ± 5.96
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 13.66 ± 0.02
phi2 3B Q4_0 1.49 GiB 2.78 B Metal pp 512 120.47 ± 1.44
phi2 3B Q4_0 1.49 GiB 2.78 B Metal tg 128 16.73 ± 0.02

also a model phi-2 3B Q8_0 cannot be loaded.
and Mistral-7B-Q4_0 (3.8 GiB) not to fit in memory

You must be logged in to vote
0 replies
Comment options

iPhone 14 Pro (A16) ✅

model size params backend test t/s
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 505.52 ± 0.58
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 54.24 ± 0.04
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 511.30 ± 1.00
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 34.30 ± 0.13
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 565.68 ± 0.21
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 20.06 ± 0.04
phi2 3B Q4_0 1.49 GiB 2.78 B Metal pp 512 121.64 ± 0.01
phi2 3B Q4_0 1.49 GiB 2.78 B Metal tg 128 23.31 ± 0.05
phi2 3B Q8_0 2.75 GiB 2.78 B Metal pp 512 119.58 ± 0.05
phi2 3B Q8_0 2.75 GiB 2.78 B Metal tg 128 14.06 ± 0.14
You must be logged in to vote
2 replies
@ggerganov
Comment options

ggerganov Dec 20, 2023
Maintainer Author

Thanks! I guess Mistral 7B does not fit on this device?

@Krish120003
Comment options

@ggerganov It does, I can load it and run it by sending a message, but the Bench button keeps aborting indicating heat up time being too long.

Comment options

iPhone 12 (A14) 🟨

model size params backend test t/s
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 227.46 ± 14.55
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 37.89 ± 0.27
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 224.22 ± 22.57
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 23.23 ± 0.12
  • llama 1B F16 has a long heat up time, so it aborted (28,16 seconds)
  • phi2 3B Q4_0 has a long heat up time, so it aborted (9,44 seconds)
  • phi2 Q8_0 crashes the app
  • Not testing the Mistral model since it's too large for the device's RAM (3.8GB model vs 4GB device)
You must be logged in to vote
0 replies
Comment options

can anyone tell me what does the output metric (t/s) mean? tokens per second or what?

You must be logged in to vote
2 replies
@XiongjieDai
Comment options

It's tokens per second.

@anchorbob
Comment options

thanks for confirmation

Comment options

Can anyone tell me about llama 1b download link? I can't find it on HF or not sure which is.

You must be logged in to vote
1 reply
@lawyinking
Comment options

Comment options

Hi, I was trying to load starcoderbase-3b-GGUF. It is not getting loaded in iphone 15 pro simulator. It is stuck with Loading model....
When investigating, I encountered one warning message: Publishing changes from background threads is not allowed; make sure to publish values from the main thread (via operators like receive(on:)) on model updates. What could be the cause of this? Thank you.

You must be logged in to vote
1 reply
@cosmo3769
Comment options

The above model got loaded on my device (iphone 13, ios 17.3). But when I try to send a message or benchmark it, I get Heat up time is too long message. The same message I am getting in iphone 15 (simulator), iphone 15 pro (simulator), and iphone 15 pro max (simulator) as well. The size of the model is 2.05 GB. @ggerganov

Comment options

iPhone SE (3rd Generation), A15 2+4 CPU, 4 GPU, 4 GB of RAM

model size params backend test t/s
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 428.48 ± 1.24
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 13.63 ± 0.03
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 141.71 ± 59.55
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 15.03 ± 0.87
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 152.40 ± 53.64
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 18.37 ± 1.67
phi2 3B Q8_0 2.75 GiB 2.78 B Metal pp 512 Model loaded but benchmark failed because llama.swift was killed
phi2 3B Q8_0 2.75 GiB 2.78 B Metal tg 128 Model loaded but benchmark failed because llama.swift was killed
phi2 3B Q4_0 1.49 GiB 2.78 B Metal pp 512 29.73 ± 5.09
phi2 3B Q4_0 1.49 GiB 2.78 B Metal tg 128 7.08 ± 2.08
llama 7B Q4_0 3.83 GiB 7.24 B Metal pp 512 Looks like model was loaded but benchmark failed with Heat up time is too long
llama 7B Q4_0 3.83 GiB 7.24 B Metal tg 128 Looks like model was loaded but benchmark failed with Heat up time is too long
You must be logged in to vote
0 replies
Comment options

What data/prompts are used for this?

You must be logged in to vote
0 replies
Comment options

I have run llamma.cpp on ios device (iphone) described here. But models are giving garbage response. what am I doing wrong?

You must be logged in to vote
0 replies
Comment options

Would it be possible to update these instructions for a recent version of XCode? I get a simple error that I can't quite figure out:
"/Users/USER/inference/llama.cpp/air-lld:1:1 81 duplicated symbols for target 'air64_v23-apple-ios14.0.0-simulator"

You must be logged in to vote
0 replies
Comment options

Same error as kinchahoy. Getting the same error.

You must be logged in to vote
0 replies
Comment options

Hi, I was trying to deploy llama.swiftUI on IOS 15.
I found "
:1:9: note: in file included from :1:
#import "llama.h"
^
/Users/naiwenxie/Documents/androidBUld/llama.cpp/Sources/llama/llama.h:3:10: error: 'llama.h' file not found with include; use "quotes" instead
#include <llama.h>
^
/Users/naiwenxie/Documents/androidBUld/llama.cpp/examples/llama.swiftui/llama.cpp.swift/LibLlama.swift:2:8: error: could not build Objective-C module 'llama'
import llama
^
/Users/naiwenxie/Documents/androidBUld/llama.cpp/examples/llama.swiftui/:1:9: In file included from :1:
" this error.
Does someone have any idea about that?

You must be logged in to vote
4 replies
@poom3d
Comment options

Did you checkout code from 0e18b2e?
Did you enable release build? IIRC, I got some errors if not on release build.

@sienaiwun
Comment options

Thanks I'm on the latest master branch as of February 2, and the issue still occurs. I opened the project directly with Xcode, configured the signing settings, and the issue appeared—all settings are at their defaults. Are there any additional steps or guidance?

@sienaiwun
Comment options

I was able to run it, thank you. The instructions for running it are here: #4508

@NegiAbhijeet
Comment options

why master branch's example/llama.swiftui not supporting simulator, like Building for 'iOS-simulator', but linking in dylib (/opt/homebrew/Cellar/llama.cpp/4733/lib/libggml.dylib) built for 'macOS'

i want to use qwen2 which is not supported in 0e18b2e branch, is there any solution

Comment options

why is it that we need to use the "Release" build?

Thanks

You must be logged in to vote
0 replies
Comment options

Hi, I'm new to the llama cpp. The llama.swiftui works well in my local machine by using the default downloadable models, but it failed to start after I copied the quatanized meta llama 3.2 1B, which is created by using HuggingFace https://huggingface.co/spaces/ggml-org/gguf-my-repo, to /examples/llama.swiftui/llama.swiftui/Resources/models.

image

It showed errors from console:

error loading model: create_tensor: tensor 'output.weight' not found
llama_load_model_from_file: failed to load model
Could not load model at /Users/xxx/Library/Developer/CoreSimulator/Devices/5EEF15B1-B91D-422B-AC6D-27039A036350/data/Containers/Bundle/Application/ACA68B26-D133-49E7-9341-D8E5F32CB63E/llama.swiftui.app/models/llama-3.2-1b-q4_k_m.gguf
Error while loading the model: The operation couldn’t be completed. (llama_swiftui.LlamaError error 0.)

Would you please point out what went wrong?

You must be logged in to vote
4 replies
@tuan88it
Comment options

@HaohongLin1 : Please guide me through the steps to build the LLAMA.cpp project. I can’t run LLAMA.swiftUI because I keep getting the error “llama.h not found.”

@HaohongLin1
Comment options

@tuan88it

OS: MacOS Sequoia 15.1.1

For the LLAMA.swiftUI, I only follow Instructions in this page, I don't need any extra steps to build it in XCode.

@guanyuhong
Comment options

@HaohongLin1 : Please guide me through the steps to build the LLAMA.cpp project. I can’t run LLAMA.swiftUI because I keep getting the error “llama.h not found.”

git checkout 0e18b2e I can run the code , but I still face the error "error loading model: create_tensor: tensor 'output.weight' not found" 😭

@poom3d
Comment options

After checkout I run the code just fine for both simulator and on device. Did you tap download link below first? (It takes a while to download some model).

Comment options

iPhone 16 pro, A18 Pro 2+4CPU, 6GPU, 8 GB of RAM

The first thing to do after I got my new phone is to run llama benchmark 👍

model size params backend test t/s
llama 1B Q4_0 0.59 GiB 1.10 B Metal pp 512 647.54 土 2.09
llama 1B Q4_0 0.59 GiB 1.10 B Metal tg 128 69.77 土 0.12
llama 1B Q8_0 1.09 GiB 1.10 B Metal pp 512 655.27 土 1.90
llama 1B Q8_0 1.09 GiB 1.10 B Metal tg 128 42.11 土 0.03
llama 1B F16 2.05 GiB 1.10 B Metal pp 512 682.50 土 1.19
llama 1B F16 2.05 GiB 1.10 B Metal tg 128 23.84 土 0.04
phi2 3B Q8_0 2.75 GiB 2.78 B Metal pp 512 180.04 土 4.55
phi2 3B Q8_0 2.75 GiB 2.78 B Metal tg 128 16.93 土 0.02
llama 7B Q4_0 3.83 GiB 7.24 B Metal pp 512 94.43 土 13.98
llama 7B Q4_0 3.83 GiB 7.24 B Metal tg 128 11.64 土 1.07
You must be logged in to vote
3 replies
@donohara
Comment options

Thanks for your work on this! I'm a definite noob but love the fact that I can study the code and read all the comments on this repo. Exciting that the attitude is 'let's see what we can do, and how to make it smaller/faster/better!

@sienaiwun
Comment options

I’d like to ask how you built the iOS target. Does the memory information show up when running the benchmark? When I tested on Android, I found that the memory usage (PSS value) was more than twice the size of the model.

@poom3d
Comment options

I just followed the instruction on this thread. For memory, it's a spec of the phone I'm running with.
The result os copied from running the benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Morty Proxy This is a proxified and sanitized view of the page, visit original site.