Performance of llama.cpp on Apple Silicon M-series #4167

Nov 22, 2023

ggerganov
Nov 22, 2023
Maintainer

Summary

LLaMA 7B

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
✅ M1 ¹	68	7			108.21	7.92	107.81	14.19
✅ M1 ¹	68	8			117.25	7.91	117.96	14.15
✅ M1 Pro ¹	200	14	262.65	12.75	235.16	21.95	232.55	35.52
✅ M1 Pro ¹	200	16	302.14	12.75	270.37	22.34	266.25	36.41
✅ M1 Max ¹	400	24	453.03	22.55	405.87	37.81	400.26	54.61
✅ M1 Max ¹	400	32	599.53	23.03	537.37	40.2	530.06	61.19
✅ M1 Ultra ¹	800	48	875.81	33.92	783.45	55.69	772.24	74.93
✅ M1 Ultra ¹	800	64	1168.89	37.01	1042.95	59.87	1030.04	83.73

✅ M2 ²	100	8			147.27	12.18	145.91	21.7
✅ M2 ²	100	10	201.34	6.72	181.4	12.21	179.57	21.91
✅ M2 Pro ²	200	16	312.65	12.47	288.46	22.7	294.24	37.87
✅ M2 Pro ²	200	19	384.38	13.06	344.5	23.01	341.19	38.86
✅ M2 Max ²	400	30	600.46	24.16	540.15	39.97	537.6	60.99
✅ M2 Max ²	400	38	755.67	24.65	677.91	41.83	671.31	65.95
✅ M2 Ultra ²	800	60	1128.59	39.86	1003.16	62.14	1013.81	88.64
✅ M2 Ultra ²	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27

🟥 M3 ³	100	8
🟨 M3 ³	100	10			187.52	12.27	186.75	21.34
🟨 M3 Pro ³	150	14			272.11	17.44	269.49	30.65
✅ M3 Pro ³	150	18	357.45	9.89	344.66	17.53	341.67	30.74
✅ M3 Max ³	300	30	589.41	19.54	566.4	34.3	567.59	56.58
✅ M3 Max ³	400	40	779.17	25.09	757.64	42.75	759.7	66.31
✅ M3 Ultra ³	800	60	1121.80	42.24	1085.76	63.55	1073.09	88.40
✅ M3 Ultra ³	800	80	1538.34	39.78	1487.51	63.93	1471.24	92.14

🟥 M4 ⁴	120	8
✅ M4 ⁴	120	10	230.18	7.43	223.64	13.54	221.29	24.11
✅ M4 Pro ⁴	273	16	381.14	17.19	367.13	30.54	364.06	49.64
✅ M4 Pro ⁴	273	20	464.48	17.18	449.62	30.69	439.78	50.74
🟥 M4 Max ⁴	410	32
✅ M4 Max ⁴	546	40	922.83	31.64	891.94	54.05	885.68	83.06
🟥 M4 Ultra	820	64
🟥 M4 Ultra	1092	80

plot.py

# GPT-4 Generated Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating DataFrame from the provided data
data = {
    "Chip": ["M1", "M1", "M1 Pro", "M1 Pro", "M1 Max", "M1 Max", "M1 Ultra", "M2", "M2 Pro", "M2 Pro", "M2 Max", "M2 Max", "M2 Ultra", "M2 Ultra", "M3", "M3 Pro", "M3 Pro", "M3 Max"],
    "BW (GB/s)":     [68, 68, 200, 200, 400, 400, 800, 100, 200, 200, 400, 400, 800, 800, 100, 150, 150, 400],
    "GPU Cores":     [7, 8, 14, 16, 24, 32, 48, 10, 16, 19, 30, 38, 60, 76, 10, 14, 18, 40],
    "F16 PP (t/s)":  [None, None, None, 302.14, 453.03, 599.53, 875.81, 201.34, 312.65, 384.38, 600.46, 755.67, 1128.59, 1401.85, None, None, 357.45, 779.17],
    "F16 TG (t/s)":  [None, None, None, 12.75, 22.55, 23.03, 33.92, 6.72, 12.47, 13.06, 24.16, 24.65, 39.86, 41.02, None, None, 9.89, 25.09],
    "Q8_0 PP (t/s)": [108.21, 117.25, 235.16, 270.37, 405.87, 537.37, 783.45, 181.4, 288.46, 344.5, 540.15, 677.91, 1003.16, 1248.59, 187.52, 272.11, 344.66, 757.64],
    "Q8_0 TG (t/s)": [7.92, 7.91, 21.95, 22.34, 37.81, 40.2, 55.69, 12.21, 22.7, 23.01, 39.97, 41.83, 62.14, 66.64, 12.27, 17.44, 17.53, 42.75],
    "Q4_0 PP (t/s)": [107.81, 117.96, 232.55, 266.25, 400.26, 530.06, 772.24, 179.57, 294.24, 341.19, 537.6, 671.31, 1013.81, 1238.48, 186.75, 269.49, 341.67, 759.7],
    "Q4_0 TG (t/s)": [14.19, 14.15, 35.52, 36.41, 54.61, 61.19, 74.93, 21.91, 37.87, 38.86, 60.99, 65.95, 88.64, 94.27, 21.34, 30.65, 30.74, 66.31]
}
df = pd.DataFrame(data)

# Helper function to plot and annotate multiple data series in the same plot
def plot_multi_series(ax, x, y_series, labels, xlabel, ylabel, title, poly_power=1):
    colors = ['r', 'g', 'b']  # Colors for different series
    for i, y in enumerate(y_series):
        # Sorting data for regression
        sorted_indices = np.argsort(x)
        x_sorted = x[sorted_indices]
        y_sorted = y[sorted_indices]

        # Masking NaN values
        mask = ~np.isnan(y_sorted)
        x_sorted = x_sorted[mask]
        y_sorted = y_sorted[mask]

        # Fitting a polynomial regression model
        coefficients = np.polyfit(x_sorted, y_sorted, poly_power)
        polynomial = np.poly1d(coefficients)

        # Creating a range of x-values for a smoother trendline
        x_range = np.linspace(x_sorted.min(), x_sorted.max(), 500)
        trendline = polynomial(x_range)

        # Plotting
        ax.scatter(x, y, color=colors[i], label=labels[i], s=20)
        ax.plot(x_range, trendline, f"{colors[i]}-", linewidth=1)  # Trendline in the same color

    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.legend()

    # Annotating points with the number of GPU cores and Bandwidth
    for i, txt in enumerate(df["Chip"]):
        ax.annotate(f"{df['GPU Cores'][i]} Cores, {df['BW (GB/s)'][i]} GB/s", (x[i], y_series[0][i]))


# Creating plots for PP vs Cores and TG vs Bandwidth
fig, axs = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('PP vs GPU Cores and TG vs Bandwidth for F16, Q8_0, and Q4_0')

# PP vs GPU Cores
y_series_cores_pp = [df["F16 PP (t/s)"], df["Q8_0 PP (t/s)"], df["Q4_0 PP (t/s)"]]
plot_multi_series(axs[0], df["GPU Cores"], y_series_cores_pp,
                  ['F16 PP', 'Q8_0 PP', 'Q4_0 PP'], 'GPU Cores', 'Performance (t/s)',
                  'PP Performance vs GPU Cores', 1)

# TG vs Bandwidth
y_series_bw_tg = [df["F16 TG (t/s)"], df["Q8_0 TG (t/s)"], df["Q4_0 TG (t/s)"]]
plot_multi_series(axs[1], df["BW (GB/s)"], y_series_bw_tg,
                  ['F16 TG', 'Q8_0 TG', 'Q4_0 TG'], 'Bandwidth (GB/s)', 'Performance (t/s)',
                  'TG Performance vs Bandwidth', 2)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Description

This is a collection of short llama.cpp benchmarks on various Apple Silicon hardware. It can be useful to compare the performance that llama.cpp achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508

If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):

git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null

Make sure to run the benchmark on commit 8e672ef
Please also include the F16 model as shown, not just the quantum models
Contributors can post the same results in the comments below
If a device is already benchmarked and your results are comparable, there is no need to add it again
PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"
✅ means the data has been added to the summary

Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 13) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares against the build 86ed72d (2024 Nov 21) on M2 Ultra:

	BW [GB/s]	GPU Cores	F16 PP [t/s]	F16 TG [t/s]	Q8_0 PP [t/s]	Q8_0 TG [t/s]	Q4_0 PP [t/s]	Q4_0 TG [t/s]
M2 Ultra `8e672ef`	800	76	1401.85	41.02	1248.59	66.64	1238.48	94.27
M2 Ultra `86ed72d` + FA	800	76	1525.95	43.15	1368.18	73.11	1391.78	108.80

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	302.14 ± 0.07
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.75 ± 0.00
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	270.37 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.34 ± 0.00
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	266.25 ± 0.07
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	36.41 ± 0.01

build: 8e672ef (1550)

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1401.85 ± 1.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	41.02 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1248.59 ± 0.73
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	66.64 ± 0.02
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1238.48 ± 0.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	94.27 ± 0.05

build: 8e672ef (1550)

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	794.26 ± 3.16
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.27 ± 0.07
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	749.37 ± 8.35
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	43.00 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	690.99 ± 33.76
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.85 ± 0.22

build: d103d93 (1553)

Nov 22, 2023

QueryType
Nov 22, 2023

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	201.34 ± 0.21
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	6.72 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	181.40 ± 0.05
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	12.21 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	179.57 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	21.91 ± 0.02

build: 8e672ef (1550)

0 replies

maver1ck · Nov 23, 2023

brozkrut
Nov 23, 2023

M2 Max Studio, 8+4 CPU, 38 GPU ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	755.67 ± 0.11
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.65 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	677.91 ± 0.26
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.83 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	671.31 ± 0.20
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	65.95 ± 0.08

build: 8e672ef (1550)

8 replies

maver1ck Dec 16, 2023

Wow. I wasn't aware that 4090 is so fast.

vitali-fridman Dec 26, 2023

This is from one/two generations old hardware but it's for 70B model which might be of interest.

CPU: AMD 3995WX, GPU: 2x Nvidia 3090, Ubuntu 23.10, Kernel 6.5.0-14, NV Driver: 545.23.08, CUDA: 12.3.1

model	size	params	backend	ngl	test	t/s
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	pp 512	179.29 ± 2.83
llama 70B Q4_0	36.20 GiB	68.98 B	CUDA	99	tg 128	21.17 ± 0.04

For comparison, 7B model on the same hardware

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp 512	1178.60 ± 88.08
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg 128	87.34 ± 0.89

zotona Dec 27, 2023

could you try at 7b model for correct comparation? Thanks!

pukhrajvansh Feb 16, 2025

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

atlas5301 Feb 17, 2025

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

Probably because llama.cpp is not well optimized on gpus. You can expect significantly better throughput with sglang and vllm.

Nov 23, 2023

crasm
Nov 23, 2023

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1128.59 ± 0.82
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	39.86 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1003.16 ± 0.39
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	62.14 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1013.81 ± 0.92
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	88.64 ± 0.06

build: 8e672ef (1550)

0 replies

slaren · Nov 24, 2023

ymcui
Nov 24, 2023

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	779.17 ± 0.49
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.09 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	757.64 ± 1.03
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.75 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	759.70 ± 2.26
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.31 ± 0.12

build: 55978ce (1555)

Short Note: mostly similar to the one reported by @slaren . But for Q4_0 pp 512, my result is 759.70 ± 2.26, while the one in the main post is 690.99 ± 33.76. Not sure about the source of the difference.

1 reply

slaren Nov 24, 2023
Maintainer

I am not sure why, but the results that I get are not very consistent. I suspect that it may due to the cooling limitations of the smaller laptop. I repeated the test now and the results are very similar to yours.

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	787.24 ± 0.84
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.15 ± 0.02
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	755.88 ± 1.56
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	42.64 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	760.65 ± 0.77
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	66.35 ± 0.24

Nov 24, 2023

Azirine
Nov 24, 2023

In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s.

0 replies

ggerganov · Nov 24, 2023

Azirine
Nov 24, 2023

How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later.

1 reply

ggerganov Nov 24, 2023
Maintainer Author

You can compute these. By default, you can use ~75% of the total RAM with the GPU. You can use more if you do some tricks

Nov 24, 2023

minosvasilias
Nov 24, 2023

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	312.65 ± 15.75
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	12.47 ± 0.71
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	288.46 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	22.70 ± 0.12
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	294.24 ± 0.10
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	37.87 ± 0.10

build: e9c13ff (1560)

0 replies

Nov 24, 2023

to3d
Nov 24, 2023

Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth.

0 replies

rlippmann · Nov 25, 2023

MrSparc
Nov 25, 2023

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.50 ± 0.58
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	41.79 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	669.51 ± 1.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	64.55 ± 1.36

build: e9c13ff (1560)

2 replies

rlippmann Nov 26, 2023

I'm also using a MBP16 M2Max with the same CPU/GPU specs, but only 32 gb ram and my results are roughly the same:

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	747.99 ± 0.28
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.54 ± 0.22
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	674.37 ± 0.63
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.67 ± 0.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	668.28 ± 0.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	62.98 ± 0.06

build: 22da055 (1566)

MrSparc Nov 26, 2023

Yes, it is expected that the same cpu/gpu spec will have similar performance values for same models to be compared regardless of RAM, as long as the size of the model to be used can be loaded into memory.
The amount of RAM is a limiting factor in the size of the model that can be loaded, as only 75% (by default) of the unified memory can be used as VRAM on the GPU
https://github.com/ggerganov/llama.cpp#memorydisk-requirements

Nov 25, 2023

CedricYauLBD
Nov 25, 2023

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	599.53 ± 0.86
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	23.03 ± 0.09
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	537.37 ± 0.19
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.20 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	530.06 ± 0.17
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	61.19 ± 0.15

build: e9c13ff (1560)

Note: M1 Max RAM Bandwidth is 400GB/s

0 replies

yxzwayne · Nov 25, 2023

philipturner
Nov 25, 2023

Look at what I started

1 reply

yxzwayne Nov 25, 2023

off topic, but your benchmark output is my desktop rn :D

ggerganov · Nov 25, 2023

paramaggarwal
Nov 25, 2023

M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	272.11 ± 1.40
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	17.44 ± 0.42
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	269.49 ± 1.14
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	30.65 ± 0.20

build: e9c13ff (1560)

5 replies

ggerganov Nov 25, 2023
Maintainer Author

This one has 150 GB/s memory bandwidth, correct?

paramaggarwal Nov 25, 2023

Yes, that's correct. (source)

Kaszebe May 30, 2024

Could it run a Q5 quant of llama3 70b Instruct at ~2 tokens per second?

mladencucakSYN Mar 22, 2025

I'm also interested to see if it can run a bit bigger model with some kind of reasonable outcome. Just don't want to spend MCB Max money

bagobones Mar 22, 2025

The old models give excellent comparative numbers but I wonder if the benchmark needs to be re-based around the current most popular models at some point.

Not just bigger ones for finding the biggest but popular sets / distillations that go from small to very large.

It looks like 96-128 ish gigs of shared memory will be practical on Apple / AMD / nvidia digits going forward.

Nov 25, 2023

brozkrut
Nov 25, 2023

Chip (vs. Predecessor)	F16 PP	F16 TG	Q8_0 PP	Q8_0 TG	Q4_0 PP	Q4_0 TG
M2 Pro (16) vs. M1 Pro (16)	312.65 302.14	12.47 12.75	288.46 270.37	22.7 22.34	294.24 266.25	37.87 36.41
	+3.48%	-2.20%	+6.69%	+1.61%	+10.51%	+4.01%
M2 Max (38) vs. M1 Max (32)	755.67 599.53	24.65 23.03	677.91 537.37	41.83 40.2	671.31 530.06	65.95 61.19
	+26.04%	+7.03%	+26.15%	+4.05%	+26.65%	+7.78%
M2 Ultra (60) vs. M2 Max (38)	1128.59 755.67	39.86 24.65	1003.16 677.91	62.14 41.83	1013.81 671.31	88.64 65.95
	+49.34%	+61.90%	+48.04%	+48.48%	+51.03%	+34.41%
M2 Ultra (76) vs. M2 Max (38)	1401.85 755.67	41.02 24.65	1248.59 677.91	66.64 41.83	1238.48 671.31	94.27 65.95
	+85.67%	+66.45%	+84.24%	+59.47%	+84.53%	+43.06%
M2 Ultra (76) vs. M2 Ultra (60)	1401.85 1128.59	41.02 39.86	1248.59 1003.16	66.64 62.14	1238.48 1013.81	94.27 88.64
	+24.25%	+2.91%	+24.43%	+7.23%	+22.19%	+6.33%
M3 Pro (14) vs. M2 Pro (16)			272.11 288.46	17.44 22.7	269.49 294.24	30.65 37.87
			-5.67%	-23.17%	-8.41%	-19.07%
M3 Max (40) vs. M2 Max (38)	779.17 755.67	25.09 24.65	757.64 677.91	42.75 41.83	759.7 671.31	66.31 65.95
	+3.11%	+1.78%	+11.76%	+2.20%	+13.17%	+0.55%

0 replies

Nov 25, 2023

pudepiedj
Nov 25, 2023

### M2 MAX (MBP 16) 38 Core 32GB ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	754.39 ± 0.36
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	24.31 ± 0.38
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	671.33 ± 2.65
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	40.85 ± 0.32
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	664.07 ± 9.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	63.29 ± 0.15

build: 795cd5a (1493)

0 replies

ggerganov · Nov 25, 2023

MrSparc
Nov 25, 2023

I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models.
Sorry if my question is silly, I'm new to this area, but can someone explain to me why original model delivers more performance than quantized models? Thanks

1 reply

ggerganov Nov 26, 2023
Maintainer Author

The question is not silly - the observation is expected. At large batch size (PP means batch size of 512) the computation is compute bound. I.e. the speed depends on how many FLOPS you can utilize. For quantum models, the existing kernels require extra compute to dequantize the data compared to F16 models where the data is already in F16 format.

Dec 7, 2024

Hanneseh
Dec 7, 2024

M4 Pro, 8+4 CPU, 16 GPU, 24 GB Memory (MBP 14) ✅

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	381.14 ± 0.06
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	17.19 ± 0.04
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	367.13 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	30.54 ± 0.01
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	364.06 ± 0.11
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	49.64 ± 0.01

build: 8e672ef (1550)

0 replies

maciejjedrzejczyk · Dec 17, 2024

eightpigs
Dec 17, 2024

M4 Max (Macbook Pro 14" 2024), 12+4 CPU, 40 GPU, 128 GB Memory

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	923.55 ± 0.12
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	31.61 ± 0.10
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	852.47 ± 48.37
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	53.06 ± 0.48
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	746.09 ± 29.30
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	82.52 ± 0.13

build: 8e672ef (1550)

5 replies

maciejjedrzejczyk Jan 26, 2025

@eightpigs can you please confirm that this is indeed a 14'' MBP that you used for testing? The spec you provided is only available for 16'' MBP M4 Pro Max which hase a higher memory bandwidth than 14'' MBP M4 Pro Max model (536gb/s vs 410gb/s).

eightpigs Jan 26, 2025

@maciejjedrzejczyk This is the result from my testing on a 14’’ MBP. The specs I provided are correct, and here are the details:

> system_profiler SPHardwareDataType SPDisplaysDataType 
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac16,6
      Chip: Apple M4 Max
      Total Number of Cores: 16 (12 performance and 4 efficiency)
      Memory: 128 GB
      ...

Graphics/Displays:

    Apple M4 Max:

      Chipset Model: Apple M4 Max
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 40
      ...

The M4 Max memory bandwidth can go up to 546GB/s: https://support.apple.com/en-us/121553

maciejjedrzejczyk Jan 26, 2025

Thank you for confirmation. My confusion was based on the ground that I used apple store configurator specific to my country which only showed a single available spec for 14'' MBP M4 Pro Max with a lower memory bandwidth (36GB RAM). I like the 14'' form factor much more than that of 16'' version but that was the only missing part in my research :) Just a follow-up question - would you consider the thermals (temp while on lap, fan noise, multitasking etc.) on this machine at an acceptable level while using LLM inference?

eightpigs Jan 27, 2025

I also prefer the 14-inch MBP.

As for noise, I mostly run 7B or 14B models, so noise hasn’t been an issue for me. Here’s some data I tracked with my Apple Watch for reference:

DeepSeek-R1-Distill-Llama-70B-8bit: Fan noise is around 56dB.
DeepSeek-R1-Distill-Qwen-32B-MLX-8bit: Fan noise is around 48dB.
DeepSeek-R1-Distill-Qwen-14B-8bit: Fan is almost silent.

On multitasking, I haven’t run into any scenarios where I felt the machine was under pressure. Performance has been more than enough for me.

olegshulyakov May 21, 2025

Can you please test Gemma 3 27B for me?

gsgxnet · Jan 11, 2025

DeconstructingAmbiguity
Jan 11, 2025

Which models can my M3 16GB MacBook Air support?

3 replies

gsgxnet Jan 12, 2025

Depends on how much RAM you want to dedicate to AI inferencing. I think if you tweak your MacOS you might have the option to use up to 12GB for the model. So a model with 20B parameters quantised down to 4bit might just work. If it is a plain M3 you have, inferencing speed might be too slow, so you probably would stay with a smaller model.

DeconstructingAmbiguity Jan 12, 2025

Thank you for the thoughtful response. I am keeping an eye out for any tests here that most closely resemble my system.

Crear12 Feb 23, 2025

You can try Deepseek-R1:14b Q4_K_M from ollame, it's only 9.0GB:
ollama run deepseek-r1:14b

byrongibson · Jan 17, 2025

gcr
Jan 17, 2025

why specifically is the M2 so cracked compared to the M3 and M4?

1 reply

byrongibson Jan 28, 2025

I think it's primarily due to memory bandwidth (first column). Where they have the same bandwidth, the results are close. But in cases where the M2 Max or Ultra has substantially higher bandwidth, it outperforms the equivalent M3 or M4.

Feb 19, 2025

kaush4l
Feb 19, 2025

M4 Max (Macbook Pro 16" 2024), 16 CPU, 40 GPU, 128 GB Memory

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	920.48 ± 3.25
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	31.56 ± 0.07
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	891.32 ± 0.95
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	53.75 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	884.59 ± 0.78
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	82.36 ± 0.22

Used the command below
git checkout 8e672ef
make clean && make -j llama-bench && ./llama-bench
-m ./models/Llama-2-7b-chat-f16.gguf
-m ./models/llama-2-7b-chat.Q8_0.gguf
-m ./models/llama-2-7b-chat.Q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null
HEAD is now at 8e672ef stablelm : simplify + speedup generation (#4153)

0 replies

gsgxnet · Mar 5, 2025

bluemoehre
Mar 5, 2025

Okay Apple... no M4 Ultra in the Mac Studio, but a M4 Max or M3 Ultra - ... planning a new Mac Pro, huh?!
https://www.apple.com/shop/buy-mac/mac-studio

So let's add these guys to the table 😁

14 replies

gsgxnet Mar 6, 2025

You know UP TO 16,9 that is marketing - may be Apple run that speed comparison with a model, which did not fit before into the unified memory and now does so. We have to tweak memory settings to make a bigger than default part of the RAM GPU-unified. But never all RAM can be used as unified RAM. See #2182 (comment) and all msg above.

Or Apple might offer a much more performant MLX option with M3 Ultra. Who knows? Benchmarks will tell in a few days I assume.

bluemoehre Mar 7, 2025

Will this end up like with NVIDIA's RTX series? Today some M4s (mostly those with maxed RAM) are no more listed on the website for several countries - if you are able to load it at all. Apple seems to have major backend issues since hours. Mmh.

fairydreaming Mar 11, 2025
Collaborator

Found some numbers: https://creativestrategies.com/mac-studio-m3-ultra-ai-workstation-review/
Seems to be only tg, no pp.

Thireus Mar 12, 2025

Nobody seems to have posted any pp so far... and I wonder why.

Edit: https://www.reddit.com/r/LocalLLaMA/comments/1j9jfbt/comment/mhe1ku9/

netrunnereve Mar 12, 2025
Collaborator

Nobody seems to have posted any pp so far... and I wonder why.

Considering the article's bias I'm not surprised. The 5090's going to destroy the Mac when the model fully fits in VRAM, so the author uses a 128k context and swapping on the 5090 (not even partial offloading) to make the Mac appear more effective. For prompt processing I think the 5090 might actually beat the Mac even with partial offloading. IMO he should have also done tests with smaller contexts to show the distinction between a model that fits in VRAM (5090 wins) and one that doesn't (Mac wins).

Our llama-bench should be the standard for testing llama.cpp but sadly a lot of people don't know about it.

arty-hlr · Mar 14, 2025

kelvinyangis
Mar 14, 2025

M3 Ultra 20+8 CPU, 60 GPU, 256GB RAM ✅

./llama-bench -m models/Llama-2-7b-chat-f16.gguf -m models/llama-2-7b-chat.Q8_0.gguf -m models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1121.80 ± 2.33
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	42.24 ± 0.05
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1085.76 ± 0.90
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	63.55 ± 0.04
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1073.09 ± 1.29
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	88.40 ± 0.44

build: 8e672ef (1550)

4 replies

arty-hlr Mar 14, 2025

So identical speeds to the M2 ultra it looks like, because same bandwidth... Not worth buying imo!

marcingomulkiewicz Mar 14, 2025

Well, entry level model with 256GB of RAM costs exactly the same as 192GB model costed previously, plus (if one's rich) there's 512GB version, so even though speed seems similar, there's still argument to be made in favour of those.

bluemoehre Mar 14, 2025

I've already seen several benchmarks that say the M3 only makes sense in terms of memory and not performance. It seems you get better value for money with a M4 Max.

Still I wonder what is the TG performance with a ~500GB model vs a ~50GB model on the same machine.

marcingomulkiewicz Mar 16, 2025

All else equal - probably 10x, as there is 10x as much weights, no matter if it's memory or compute bound. But it's not that simple: >600B DeepSeek/R1 are MoEs with iirc ~37B parameters per expert, so I'd expect it to work much (2x?) faster than 70B Llama.

Mar 23, 2025

ivanfioravanti
Mar 23, 2025

M3 Ultra 24+8 CPU, 80 GPU, 512GB RAM ✅

./llama-bench -m ./models/llama-7b-v2/ggml-model-f16.gguf -m ./models/llama-7b-v2/ggml-model-q8_0.gguf -m ./models/llama-7b-v2/ggml-model-q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1538.34 ± 2.14
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	39.78 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1487.51 ± 1.57
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	63.93 ± 0.24
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1471.24 ± 1.05
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	92.14 ± 0.66

build: 8e672ef (1550)

0 replies

marcingomulkiewicz · Mar 24, 2025

ilcommm
Mar 24, 2025

Can someone please test the base Mac Ultra M4 :)

5 replies

marcingomulkiewicz Mar 24, 2025

Doubtful. M4 Ultra does not exists, at least not yet.

ilcommm Mar 25, 2025

Of course, you're right. I obviously meant the base Mac Studio with M4 Max.

ilcommm Mar 25, 2025

I’m just choosing between a Mac Studio M2 Max with 32GB for 1,888 and a Mac Studio M4 Max with 36GB for 2,705 (in our shops). Trying to figure out if the performance boost is worth the extra cost.

mladencucakSYN Mar 25, 2025

which shops would that be? :D

ilcommm Mar 25, 2025

sad shops in Russia))). It is usd prices.

arty-hlr · Apr 3, 2025

shimza
Apr 3, 2025

Cost Per Token April 2025 ~ Bang for Buck

Seems like the M4 Mac Mini is cheapest instant win for now, with an M1 Max Studio coming in close second.

Product	Amazon Link	Price	Tokens/sec (t/s)	Token Cost
M1 MacBook Air	https://amzn.to/42gTl9Z	584	14.19	$ 41.16
M1 MacBook Pro	https://amzn.to/426VINP	777	14.15	$ 54.91
M1 Pro MacBook Pro 14"	https://amzn.to/3FST7hK	823	35.52	$ 23.17
M1 Pro MacBook Pro 16"	https://amzn.to/4leWSOC	901	36.41	$ 24.75
M1 Max MacBook Pro 14"	https://amzn.to/3XImWaV	1299	54.61	$ 23.79
M1 Max MacBook Pro 16"	https://amzn.to/3XImWaV	1551	61.19	$ 25.35
M1 Max Mac Studio	https://amzn.to/41UMFiJ	1385	61.19	$ 22.63
M1 Ultra Mac Studio	https://amzn.to/429gScF	1980	74.93	$ 26.42
M2 MacBook Air	https://amzn.to/3YcCIe4	749	21.7	$ 34.52
M2 MacBook Pro	https://amzn.to/3XInrBP	835	21.91	$ 38.11
M2 Pro MacBook Pro 14"	https://amzn.to/4cq9Y7O	1180	37.87	$ 31.16
M2 Pro MacBook Pro 16"	https://amzn.to/4i0pIzt	1502	38.86	$ 38.65
M2 Max MacBook Pro 14"	https://amzn.to/4lf04K0	1885	60.99	$ 30.91
M2 Max MacBook Pro 16"	https://amzn.to/4hVTm97	2014	65.95	$ 30.54
M2 Max Mac Studio	https://amzn.to/4hWJ7Bm	1799	60.99	$ 29.50
M2 Ultra Mac Studio	https://amzn.to/4jiIVgN	3889	88.64	$ 43.87
M2 Ultra Mac Studio	https://amzn.to/4jiIVgN	3889	94.27	$ 41.25
M3 Pro MacBook Pro 14	https://amzn.to/4jfqqda	1286	30.74	$ 41.83
M3 Pro MacBook Pro 16	https://amzn.to/4llV2M3	1976	30.74	$ 64.28
M3 Max MacBook Pro	https://amzn.to/3R3jWlD	2959	56.58	$ 52.30
M3 Ultra Mac Studio	https://www.cornellstore.com/Mac-Studio-M3-Ultra	3599	88.4	$ 40.71
M4 Mac Mini	https://amzn.to/43Eb1Pa	549	24.11	$ 22.77
M4 MacBook Air	https://amzn.to/4cl0ISi	949	24.11	$ 39.36
M4 Pro MacBook Pro 14"	https://amzn.to/3G2TVjW	1786	49.64	$ 35.98
M4 Pro MacBook Pro 16"	https://amzn.to/4hYXogP	1880	50.74	$ 37.05
M4 Max MacBook Pro	https://amzn.to/43xoCYr	2849	83.06	$ 34.30

2 replies

arty-hlr Apr 3, 2025

Is this really the place to post a bunch of sponsored links??
You are not computing the "token cost", which doesn't make sense, but the token speed cost.

shimza Apr 4, 2025

Point taken, but I've looked at this chart for months and really wanted to know which Mac to buy, wanted to optimize my spend for obvious reasons.

Yeh yeh nah - it is the Token Cost ... it's self explanatory the speed reference from the prior column.

Apr 3, 2025

lukewp
Apr 3, 2025

... cross-posted to the Vulkan thread:

Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS Monterey

Note: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board.

Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.55 ± 0.25
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	11.05 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.86 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	16.73 ± 0.05

build: d3bd719 (5092)

The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag):

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	25.87 ± 0.56
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	6.85 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	26.17 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	10.85 ± 0.01

build: d3bd719 (5092)

(proof-of-life images below):
GPU test:

CPU test:

0 replies

AndreasKunar · Apr 12, 2025

mirh
Apr 12, 2025

Just saying.. Shouldn't the OP be edited with the actual used bandwidth numbers, rather than the BS figures apple gave to the press?

4 replies

AndreasKunar May 12, 2025

I don't understand your post - llama.cpp token-generation is able to pretty much saturate the RAM bus-bandwidth on my Macs, on my Snapdragon X and on my NVIDIA Jetson. All with very comparable results based on their RAM's transactions/s-limit*bus-width calculated "marketing" GB/s. If someone is lying, they all seem to do it consistently.

Normal CPU-operations might not fully max out the RAM bus bandwidth. But NEON,... instructions and the GPUs apparently can.

mirh May 13, 2025

The bandwidth in the first post is clearly not measured, and it seems obvious that's just a copy-paste of the official PR numbers.
And idk really what the actual gpu limit is since anandtech (rip) didn't measure that specifically, but to be sure at least the ultra numbers are patent horseshit.
It's literally two maxs side-by-side. Even if a single one was 400GB/s, adding another 400GB/s unit physically cannot make 800GB/s.

AndreasKunar May 13, 2025

OK, if you want to be negative and say that "it cannot be" without any evidence applying to the measurements here, I don't care and don't want to waste my time feeding trolls.

I think it's perfect to compare llama.cpp's performance with the theoretical maximum memory-bandwidth the system is designed for (max. transactions/s of the RAMs x data-bus size bytes). It is a measurement, how good llama.cpp's code for TG can leverage the theoretical limit imposed by memory-bandwidth on that hardware. And - as evidence - the TG-data for the same model/build quite matches this, as I mentioned above. E.g. M2-series TG fp16 (quantization has an impact) - M2 100GB/s ~6.5tok/s, M2 Pro 200GB/s ~ 13tok/s, M2 Max ~400GB/s ~25tok/s, M2 Ultra 800GB/s ~41tok/s (some impact of split-chip design, it got better from M1 to M2 with Apple learning). My Snapdragon X / Jetson Orin NX only have 16GB unified RAM, cannot really run fp16, but Q4_0/8 matches (with some differences based on their quantization-algorithm HW support). This discussion currently enables the prediction of a probability-range for TG token/s based on a designed RAM-bandwidth - e.g. let's see if my NVIDIA DGX Spark prediction based on its 256GB/s holds up.

P.S. the M-series Ultra Fusion interconnect is many times faster than its RAM-bandwidth. I found no hard evidence, that the dies do RAM-address interleaving, but they probably do according to some internet-gossip - so there is no reason, that the combined chip cannot do double the individual chip's RAM transfer-rate to the caches. It definitely has some interconnect overhead, but It's not a typical NUMA multi-processor architecture, which would require llama.cpp to do a special tensor memory+operations-layout.

mirh May 13, 2025

I think it's perfect to compare llama.cpp's performance with the theoretical maximum memory-bandwidth the system is designed for

Yes, and what I'm telling you is that those theoretical numbers are unproven anywhere (and regardless it seems very odd for every single figure of those to be "empirical", except bandwidth that is taken for granted with blatantly unknown rounding).

I found no hard evidence, that the dies do RAM-address interleaving, but they probably do according to some internet-gossip

According to some other internet gossip, it may actually just have been M1 ultra to be a disaster.
And while after much much scavenging of the net I found some benchmarks that kinda resized my contempt (truthfully the gpu is really privileged), for the biggest most ambitious chips there's still a 20-25% difference from the datasheet.

Apr 22, 2025

beebopkim
Apr 22, 2025

M3 Ultra (Mac Studio 2025) 24+8 CPU, 80 GPU, 512GB RAM

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	1527.74 ± 2.02
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	40.10 ± 0.10
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	1488.84 ± 2.52
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	64.16 ± 0.38
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	1473.76 ± 1.09
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	91.93 ± 0.48

build: 8e672ef (1550)

0 replies

May 20, 2025

olegshulyakov
May 20, 2025

M1 (MacBook Air 2020) 8 CPU, 8GPU, 16GB RAM

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	115.67 ± 0.88
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	14.13 ± 0.04
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	121.73 ± 1.43
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	7.69 ± 0.12

build: 8e672ef (1550)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	pp512	131.46 ± 6.71
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	4	tg128	13.99 ± 0.14
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	pp512	133.34 ± 1.17
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	4	tg128	7.67 ± 0.02

build: 3e0be1c (5410)

0 replies

arty-hlr · May 21, 2025

ilcommm
May 21, 2025

Finally got the results I was asking about here recently 😊

Though I had to purchase a Mac Studio with the M4 Max chip myself to achieve this.

M4 MAX (Mac Studio 2024), 14CPU, 32 GPU, 36GB RAM

llama.cpp % ./llama-bench
-m ./models/llama-7b-v2/ggml-model-f16.gguf
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	ngl	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	pp 512	747.59 ± 0.92
llama 7B mostly F16	12.55 GiB	6.74 B	Metal	99	tg 128	25.58 ± 0.01
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	pp 512	720.38 ± 0.04
llama 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	99	tg 128	43.80 ± 0.03
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	pp 512	715.74 ± 0.52
llama 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	99	tg 128	69.24 ± 0.09

build: 8e672ef (1550)

On new build:
./build/bin/llama-bench
-m ./models/llama-7b-v2/ggml-model-f16.gguf
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	10	pp512	790.33 ± 0.49
llama 7B F16	12.55 GiB	6.74 B	Metal,BLAS	10	tg128	26.05 ± 0.01
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	10	pp512	702.39 ± 11.36
llama 7B Q8_0	6.67 GiB	6.74 B	Metal,BLAS	10	tg128	44.89 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	10	pp512	762.81 ± 1.03
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	10	tg128	72.26 ± 0.07

build: b44890d (5440)

7 replies

arty-hlr May 21, 2025

Keep in mind performance is increased using mlx instead of llama.cpp.

olegshulyakov May 21, 2025

@arty-hlr I was interested in it since price is not far away of GPU, but power consumption is much less.

arty-hlr May 21, 2025

@olegshulyakov I got a refurbished m2 ultra 76 cores 64 GB a few weeks ago, not regretting it at all. Silent, very power efficient, good speeds, just prompt processing is slower than NVIDIA GPUs, but that's a small compromise to make imo. Atm I'm using lmstudio for inference as they have mlx as backend, but I'll switch back to ollama when they add it, lmstudio doesn't handle loading models as well as ollama.

ilcommm May 21, 2025

Keep in mind performance is increased using mlx instead of llama.cpp.

how to achieve it in inference? only via lmstudio currently?

olegshulyakov May 21, 2025

@ilcommm There is mlx-ml server like llama.cpp does

Search code, repositories, users, issues, pull requests...

Performance of llama.cpp on Apple Silicon M-series #4167

Uh oh!

Uh oh!

ggerganov Nov 22, 2023 Maintainer

Summary

Description

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

Footnotes

Replies: 74 comments · 150 replies

Uh oh!

Uh oh!

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅

Uh oh!

Uh oh!

M2 Max Studio, 8+4 CPU, 38 GPU ✅

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

Uh oh!

Uh oh!

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

Uh oh!

slaren Nov 24, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

ggerganov Nov 24, 2023 Maintainer Author

Uh oh!

Uh oh!

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

Uh oh!

Uh oh!

Uh oh!

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

Uh oh!

Uh oh!

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

Uh oh!

Uh oh!

Uh oh!

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

Uh oh!

Uh oh!

Uh oh!

ggerganov
Nov 22, 2023
Maintainer

slaren Nov 24, 2023
Maintainer

ggerganov Nov 24, 2023
Maintainer Author