Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Performance of llama.cpp on Apple Silicon M-series #4167

ggerganov started this conversation in Show and tell
Discussion options

Summary

LLaMA 7B

  BW
[GB/s]
GPU
Cores
F16 PP
[t/s]
F16 TG
[t/s]
Q8_0 PP
[t/s]
Q8_0 TG
[t/s]
Q4_0 PP
[t/s]
Q4_0 TG
[t/s]
✅ M1 1 68 7 108.21 7.92 107.81 14.19
✅ M1 1 68 8     117.25 7.91 117.96 14.15
✅ M1 Pro 1 200 14 262.65 12.75 235.16 21.95 232.55 35.52
✅ M1 Pro 1 200 16 302.14 12.75 270.37 22.34 266.25 36.41
✅ M1 Max 1 400 24 453.03 22.55 405.87 37.81 400.26 54.61
✅ M1 Max 1 400 32 599.53 23.03 537.37 40.2 530.06 61.19
✅ M1 Ultra 1 800 48 875.81 33.92 783.45 55.69 772.24 74.93
✅ M1 Ultra 1 800 64 1168.89 37.01 1042.95 59.87 1030.04 83.73
✅ M2 2 100 8 147.27 12.18 145.91 21.7
✅ M2 2 100 10 201.34 6.72 181.4 12.21 179.57 21.91
✅ M2 Pro 2 200 16 312.65 12.47 288.46 22.7 294.24 37.87
✅ M2 Pro 2 200 19 384.38 13.06 344.5 23.01 341.19 38.86
✅ M2 Max 2 400 30 600.46 24.16 540.15 39.97 537.6 60.99
✅ M2 Max 2 400 38 755.67 24.65 677.91 41.83 671.31 65.95
✅ M2 Ultra 2 800 60 1128.59 39.86 1003.16 62.14 1013.81 88.64
✅ M2 Ultra 2 800 76 1401.85 41.02 1248.59 66.64 1238.48 94.27
🟥 M3 3 100 8
🟨 M3 3 100 10 187.52 12.27 186.75 21.34
🟨 M3 Pro 3 150 14     272.11 17.44 269.49 30.65
✅ M3 Pro 3 150 18 357.45 9.89 344.66 17.53 341.67 30.74
✅ M3 Max 3 300 30 589.41 19.54 566.4 34.3 567.59 56.58
✅ M3 Max 3 400 40 779.17 25.09 757.64 42.75 759.7 66.31
✅ M3 Ultra 3 800 60 1121.80 42.24 1085.76 63.55 1073.09 88.40
✅ M3 Ultra 3 800 80 1538.34 39.78 1487.51 63.93 1471.24 92.14
🟥 M4 4 120 8
✅ M4 4 120 10 230.18 7.43 223.64 13.54 221.29 24.11
✅ M4 Pro 4 273 16 381.14 17.19 367.13 30.54 364.06 49.64
✅ M4 Pro 4 273 20 464.48 17.18 449.62 30.69 439.78 50.74
🟥 M4 Max 4 410 32
✅ M4 Max 4 546 40 922.83 31.64 891.94 54.05 885.68 83.06
🟥 M4 Ultra 820 64
🟥 M4 Ultra 1092 80

Figure_2

plot.py
# GPT-4 Generated Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating DataFrame from the provided data
data = {
    "Chip": ["M1", "M1", "M1 Pro", "M1 Pro", "M1 Max", "M1 Max", "M1 Ultra", "M2", "M2 Pro", "M2 Pro", "M2 Max", "M2 Max", "M2 Ultra", "M2 Ultra", "M3", "M3 Pro", "M3 Pro", "M3 Max"],
    "BW (GB/s)":     [68, 68, 200, 200, 400, 400, 800, 100, 200, 200, 400, 400, 800, 800, 100, 150, 150, 400],
    "GPU Cores":     [7, 8, 14, 16, 24, 32, 48, 10, 16, 19, 30, 38, 60, 76, 10, 14, 18, 40],
    "F16 PP (t/s)":  [None, None, None, 302.14, 453.03, 599.53, 875.81, 201.34, 312.65, 384.38, 600.46, 755.67, 1128.59, 1401.85, None, None, 357.45, 779.17],
    "F16 TG (t/s)":  [None, None, None, 12.75, 22.55, 23.03, 33.92, 6.72, 12.47, 13.06, 24.16, 24.65, 39.86, 41.02, None, None, 9.89, 25.09],
    "Q8_0 PP (t/s)": [108.21, 117.25, 235.16, 270.37, 405.87, 537.37, 783.45, 181.4, 288.46, 344.5, 540.15, 677.91, 1003.16, 1248.59, 187.52, 272.11, 344.66, 757.64],
    "Q8_0 TG (t/s)": [7.92, 7.91, 21.95, 22.34, 37.81, 40.2, 55.69, 12.21, 22.7, 23.01, 39.97, 41.83, 62.14, 66.64, 12.27, 17.44, 17.53, 42.75],
    "Q4_0 PP (t/s)": [107.81, 117.96, 232.55, 266.25, 400.26, 530.06, 772.24, 179.57, 294.24, 341.19, 537.6, 671.31, 1013.81, 1238.48, 186.75, 269.49, 341.67, 759.7],
    "Q4_0 TG (t/s)": [14.19, 14.15, 35.52, 36.41, 54.61, 61.19, 74.93, 21.91, 37.87, 38.86, 60.99, 65.95, 88.64, 94.27, 21.34, 30.65, 30.74, 66.31]
}
df = pd.DataFrame(data)

# Helper function to plot and annotate multiple data series in the same plot
def plot_multi_series(ax, x, y_series, labels, xlabel, ylabel, title, poly_power=1):
    colors = ['r', 'g', 'b']  # Colors for different series
    for i, y in enumerate(y_series):
        # Sorting data for regression
        sorted_indices = np.argsort(x)
        x_sorted = x[sorted_indices]
        y_sorted = y[sorted_indices]

        # Masking NaN values
        mask = ~np.isnan(y_sorted)
        x_sorted = x_sorted[mask]
        y_sorted = y_sorted[mask]

        # Fitting a polynomial regression model
        coefficients = np.polyfit(x_sorted, y_sorted, poly_power)
        polynomial = np.poly1d(coefficients)

        # Creating a range of x-values for a smoother trendline
        x_range = np.linspace(x_sorted.min(), x_sorted.max(), 500)
        trendline = polynomial(x_range)

        # Plotting
        ax.scatter(x, y, color=colors[i], label=labels[i], s=20)
        ax.plot(x_range, trendline, f"{colors[i]}-", linewidth=1)  # Trendline in the same color

    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.legend()

    # Annotating points with the number of GPU cores and Bandwidth
    for i, txt in enumerate(df["Chip"]):
        ax.annotate(f"{df['GPU Cores'][i]} Cores, {df['BW (GB/s)'][i]} GB/s", (x[i], y_series[0][i]))


# Creating plots for PP vs Cores and TG vs Bandwidth
fig, axs = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('PP vs GPU Cores and TG vs Bandwidth for F16, Q8_0, and Q4_0')

# PP vs GPU Cores
y_series_cores_pp = [df["F16 PP (t/s)"], df["Q8_0 PP (t/s)"], df["Q4_0 PP (t/s)"]]
plot_multi_series(axs[0], df["GPU Cores"], y_series_cores_pp,
                  ['F16 PP', 'Q8_0 PP', 'Q4_0 PP'], 'GPU Cores', 'Performance (t/s)',
                  'PP Performance vs GPU Cores', 1)

# TG vs Bandwidth
y_series_bw_tg = [df["F16 TG (t/s)"], df["Q8_0 TG (t/s)"], df["Q4_0 TG (t/s)"]]
plot_multi_series(axs[1], df["BW (GB/s)"], y_series_bw_tg,
                  ['F16 TG', 'Q8_0 TG', 'Q4_0 TG'], 'Bandwidth (GB/s)', 'Performance (t/s)',
                  'TG Performance vs Bandwidth', 2)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Description

This is a collection of short llama.cpp benchmarks on various Apple Silicon hardware. It can be useful to compare the performance that llama.cpp achieves across the M-series chips and hopefully answer questions of people wondering if they should upgrade or not. Collecting info here just for Apple Silicon for simplicity. Similar collection for A-series chips is available here: #4508

If you are a collaborator to the project and have an Apple Silicon device, please add your device, results and optionally username for the following command directly into this post (requires LLaMA 7B v2):

git checkout 8e672efe
make clean && make -j llama-bench && ./llama-bench \
  -m ./models/llama-7b-v2/ggml-model-f16.gguf  \
  -m ./models/llama-7b-v2/ggml-model-q8_0.gguf \
  -m ./models/llama-7b-v2/ggml-model-q4_0.gguf \
  -p 512 -n 128 -ngl 99 2> /dev/null
  • Make sure to run the benchmark on commit 8e672ef
  • Please also include the F16 model as shown, not just the quantum models
  • Contributors can post the same results in the comments below
  • If a device is already benchmarked and your results are comparable, there is no need to add it again
  • PP means "prompt processing" (bs = 512), TG means "text-generation" (bs = 1), t/s means "tokens per second"
  • ✅ means the data has been added to the summary

Note that in this benchmark we are evaluating the performance against the same build 8e672ef (2023 Nov 13) in order to keep all performance factors even. Since then, there have been multiple improvements resulting in better absolute performance. As an example, here is how the same test compares against the build 86ed72d (2024 Nov 21) on M2 Ultra:

  BW
[GB/s]
GPU
Cores
F16 PP
[t/s]
F16 TG
[t/s]
Q8_0 PP
[t/s]
Q8_0 TG
[t/s]
Q4_0 PP
[t/s]
Q4_0 TG
[t/s]
M2 Ultra 8e672ef 800 76 1401.85 41.02 1248.59 66.64 1238.48 94.27
M2 Ultra 86ed72d + FA 800 76 1525.95 43.15 1368.18 73.11 1391.78 108.80

M1 Pro, 8+2 CPU, 16 GPU (@ggerganov) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 302.14 ± 0.07
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 12.75 ± 0.00
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 270.37 ± 0.02
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 22.34 ± 0.00
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 266.25 ± 0.07
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 36.41 ± 0.01

build: 8e672ef (1550)

M2 Ultra, 16+8 CPU, 76 GPU (@ggerganov) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 1401.85 ± 1.75
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 41.02 ± 0.02
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 1248.59 ± 0.73
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 66.64 ± 0.02
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 1238.48 ± 0.76
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 94.27 ± 0.05

build: 8e672ef (1550)

M3 Max (MBP 14), 12+4 CPU, 40 GPU (@slaren) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 794.26 ± 3.16
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 25.27 ± 0.07
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 749.37 ± 8.35
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 43.00 ± 0.12
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 690.99 ± 33.76
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 65.85 ± 0.22

build: d103d93 (1553)

Footnotes

  1. https://en.wikipedia.org/wiki/Apple_M1#Variants 2 3 4 5 6 7 8

  2. https://en.wikipedia.org/wiki/Apple_M2#Variants 2 3 4 5 6 7 8

  3. https://en.wikipedia.org/wiki/Apple_M3#Variants 2 3 4 5 6 7 8

  4. https://en.wikipedia.org/wiki/Apple_M4#Variants 2 3 4 5 6

You must be logged in to vote

Replies: 74 comments · 150 replies

Comment options

M2 Mac Mini, 4+4 CPU, 10 GPU, 24 GB Memory (@QueryType) ✅


model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 201.34 ± 0.21
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 6.72 ± 0.01
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 181.40 ± 0.05
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 12.21 ± 0.04
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 179.57 ± 0.04
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 21.91 ± 0.02

build: 8e672ef (1550)

You must be logged in to vote
0 replies
Comment options

M2 Max Studio, 8+4 CPU, 38 GPU ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 755.67 ± 0.11
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 24.65 ± 0.02
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 677.91 ± 0.26
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 41.83 ± 0.03
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 671.31 ± 0.20
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 65.95 ± 0.08

build: 8e672ef (1550)

You must be logged in to vote
8 replies
@maver1ck
Comment options

Wow. I wasn't aware that 4090 is so fast.

@vitali-fridman
Comment options

This is from one/two generations old hardware but it's for 70B model which might be of interest.

CPU: AMD 3995WX, GPU: 2x Nvidia 3090, Ubuntu 23.10, Kernel 6.5.0-14, NV Driver: 545.23.08, CUDA: 12.3.1

model size params backend ngl test t/s
llama 70B Q4_0 36.20 GiB 68.98 B CUDA 99 pp 512 179.29 ± 2.83
llama 70B Q4_0 36.20 GiB 68.98 B CUDA 99 tg 128 21.17 ± 0.04

For comparison, 7B model on the same hardware

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 pp 512 1178.60 ± 88.08
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 tg 128 87.34 ± 0.89
@zotona
Comment options

could you try at 7b model for correct comparation? Thanks!

@pukhrajvansh
Comment options

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

@atlas5301
Comment options

what the hell this is lower than m4 max, i mean 2x 3090 whatt..??

Probably because llama.cpp is not well optimized on gpus. You can expect significantly better throughput with sglang and vllm.

Comment options

M2 Ultra, 16+8 CPU, 60 GPU (@crasm) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 1128.59 ± 0.82
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 39.86 ± 0.01
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 1003.16 ± 0.39
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 62.14 ± 0.03
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 1013.81 ± 0.92
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 88.64 ± 0.06

build: 8e672ef (1550)

You must be logged in to vote
0 replies
Comment options

M3 Max (MBP 16), 12+4 CPU, 40 GPU (@ymcui) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 779.17 ± 0.49
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 25.09 ± 0.01
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 757.64 ± 1.03
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 42.75 ± 0.06
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 759.70 ± 2.26
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 66.31 ± 0.12

build: 55978ce (1555)


Short Note: mostly similar to the one reported by @slaren . But for Q4_0 pp 512, my result is 759.70 ± 2.26, while the one in the main post is 690.99 ± 33.76. Not sure about the source of the difference.

You must be logged in to vote
1 reply
@slaren
Comment options

I am not sure why, but the results that I get are not very consistent. I suspect that it may due to the cooling limitations of the smaller laptop. I repeated the test now and the results are very similar to yours.

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 787.24 ± 0.84
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 25.15 ± 0.02
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 755.88 ± 1.56
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 42.64 ± 0.04
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 760.65 ± 0.77
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 66.35 ± 0.24
Comment options

In the graph, why is PP t/s plotted against bandwidth and TG t/s plotted against GPU cores? Seems like GPU cores have more effect on PP t/s.
pp

You must be logged in to vote
0 replies
Comment options

How about also sharing the largest model sizes and context lengths people can run with their amount of RAM? It's important to get the amount of RAM right when buying Apple computers because you can't upgrade later.

You must be logged in to vote
1 reply
@ggerganov
Comment options

ggerganov Nov 24, 2023
Maintainer Author

You can compute these. By default, you can use ~75% of the total RAM with the GPU. You can use more if you do some tricks

Comment options

M2 Pro, 6+4 CPU, 16 GPU (@minosvasilias) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 312.65 ± 15.75
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 12.47 ± 0.71
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 288.46 ± 0.06
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 22.70 ± 0.12
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 294.24 ± 0.10
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 37.87 ± 0.10

build: e9c13ff (1560)

You must be logged in to vote
0 replies
Comment options

Would love to see how M1 Max and M1 Ultra fare given their high memory bandwidth.

You must be logged in to vote
0 replies
Comment options

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 96 GB RAM (@MrSparc) ✅

model size params backend ngl test t/s
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 674.50 ± 0.58
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 41.79 ± 0.04
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 669.51 ± 1.17
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 64.55 ± 1.36

build: e9c13ff (1560)

You must be logged in to vote
2 replies
@rlippmann
Comment options

I'm also using a MBP16 M2Max with the same CPU/GPU specs, but only 32 gb ram and my results are roughly the same:

M2 MAX (MBP 16) 8+4 CPU, 38 GPU, 32 GB RAM ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 747.99 ± 0.28
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 24.54 ± 0.22
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 674.37 ± 0.63
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 40.67 ± 0.05
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 668.28 ± 0.24
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 62.98 ± 0.06

build: 22da055 (1566)

@MrSparc
Comment options

Yes, it is expected that the same cpu/gpu spec will have similar performance values for same models to be compared regardless of RAM, as long as the size of the model to be used can be loaded into memory.
The amount of RAM is a limiting factor in the size of the model that can be loaded, as only 75% (by default) of the unified memory can be used as VRAM on the GPU
https://github.com/ggerganov/llama.cpp#memorydisk-requirements

Comment options

M1 Max (MBP 16) 8+2 CPU, 32 GPU, 64GB RAM (@CedricYauLBD) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 599.53 ± 0.86
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 23.03 ± 0.09
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 537.37 ± 0.19
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 40.20 ± 0.03
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 530.06 ± 0.17
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 61.19 ± 0.15

build: e9c13ff (1560)

Note: M1 Max RAM Bandwidth is 400GB/s

You must be logged in to vote
0 replies
Comment options

Look at what I started

You must be logged in to vote
1 reply
@yxzwayne
Comment options

off topic, but your benchmark output is my desktop rn :D
Screenshot 2023-11-25 at 02 16 30

Comment options

M3 Pro (MBP 14), 5+6 CPU, 14 GPU (@paramaggarwal) ✅

model size params backend ngl test t/s
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 272.11 ± 1.40
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 17.44 ± 0.42
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 269.49 ± 1.14
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 30.65 ± 0.20

build: e9c13ff (1560)

You must be logged in to vote
5 replies
@ggerganov
Comment options

ggerganov Nov 25, 2023
Maintainer Author

This one has 150 GB/s memory bandwidth, correct?

@paramaggarwal
Comment options

Yes, that's correct. (source)

@Kaszebe
Comment options

Could it run a Q5 quant of llama3 70b Instruct at ~2 tokens per second?

@mladencucakSYN
Comment options

I'm also interested to see if it can run a bit bigger model with some kind of reasonable outcome. Just don't want to spend MCB Max money

@bagobones
Comment options

The old models give excellent comparative numbers but I wonder if the benchmark needs to be re-based around the current most popular models at some point.

Not just bigger ones for finding the biggest but popular sets / distillations that go from small to very large.

It looks like 96-128 ish gigs of shared memory will be practical on Apple / AMD / nvidia digits going forward.

Comment options

Chip (vs. Predecessor) F16 PP F16 TG Q8_0 PP Q8_0 TG Q4_0 PP Q4_0 TG
M2 Pro (16) vs.
M1 Pro (16)
312.65
302.14
12.47
12.75
288.46
270.37
22.7
22.34
294.24
266.25
37.87
36.41
+3.48% -2.20% +6.69% +1.61% +10.51% +4.01%
M2 Max (38) vs.
M1 Max (32)
755.67
599.53
24.65
23.03
677.91
537.37
41.83
40.2
671.31
530.06
65.95
61.19
+26.04% +7.03% +26.15% +4.05% +26.65% +7.78%
M2 Ultra (60) vs.
M2 Max (38)
1128.59
755.67
39.86
24.65
1003.16
677.91
62.14
41.83
1013.81
671.31
88.64
65.95
+49.34% +61.90% +48.04% +48.48% +51.03% +34.41%
M2 Ultra (76) vs.
M2 Max (38)
1401.85
755.67
41.02
24.65
1248.59
677.91
66.64
41.83
1238.48
671.31
94.27
65.95
+85.67% +66.45% +84.24% +59.47% +84.53% +43.06%
M2 Ultra (76) vs.
M2 Ultra (60)
1401.85
1128.59
41.02
39.86
1248.59
1003.16
66.64
62.14
1238.48
1013.81
94.27
88.64
+24.25% +2.91% +24.43% +7.23% +22.19% +6.33%
M3 Pro (14) vs.
M2 Pro (16)
272.11
288.46
17.44
22.7
269.49
294.24
30.65
37.87
-5.67% -23.17% -8.41% -19.07%
M3 Max (40) vs.
M2 Max (38)
779.17
755.67
25.09
24.65
757.64
677.91
42.75
41.83
759.7
671.31
66.31
65.95
+3.11% +1.78% +11.76% +2.20% +13.17% +0.55%
You must be logged in to vote
0 replies
Comment options

### M2 MAX (MBP 16) 38 Core 32GB

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 754.39 ± 0.36
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 24.31 ± 0.38
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 671.33 ± 2.65
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 40.85 ± 0.32
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 664.07 ± 9.11
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 63.29 ± 0.15

build: 795cd5a (1493)

You must be logged in to vote
0 replies
Comment options

I'm looking at the summary plot about "PP performance vs GPU cores" and evidence that original unquantised fp16 model always delivers more performance than quantized models.
Sorry if my question is silly, I'm new to this area, but can someone explain to me why original model delivers more performance than quantized models? Thanks

You must be logged in to vote
1 reply
@ggerganov
Comment options

ggerganov Nov 26, 2023
Maintainer Author

The question is not silly - the observation is expected. At large batch size (PP means batch size of 512) the computation is compute bound. I.e. the speed depends on how many FLOPS you can utilize. For quantum models, the existing kernels require extra compute to dequantize the data compared to F16 models where the data is already in F16 format.

Comment options

M4 Pro, 8+4 CPU, 16 GPU, 24 GB Memory (MBP 14) ✅

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 381.14 ± 0.06
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 17.19 ± 0.04
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 367.13 ± 0.06
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 30.54 ± 0.01
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 364.06 ± 0.11
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 49.64 ± 0.01

build: 8e672ef (1550)

You must be logged in to vote
0 replies
Comment options

M4 Max (Macbook Pro 14" 2024), 12+4 CPU, 40 GPU, 128 GB Memory

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 923.55 ± 0.12
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 31.61 ± 0.10
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 852.47 ± 48.37
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 53.06 ± 0.48
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 746.09 ± 29.30
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 82.52 ± 0.13

build: 8e672ef (1550)

You must be logged in to vote
5 replies
@maciejjedrzejczyk
Comment options

@eightpigs can you please confirm that this is indeed a 14'' MBP that you used for testing? The spec you provided is only available for 16'' MBP M4 Pro Max which hase a higher memory bandwidth than 14'' MBP M4 Pro Max model (536gb/s vs 410gb/s).

@eightpigs
Comment options

@maciejjedrzejczyk This is the result from my testing on a 14’’ MBP. The specs I provided are correct, and here are the details:

> system_profiler SPHardwareDataType SPDisplaysDataType 
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac16,6
      Chip: Apple M4 Max
      Total Number of Cores: 16 (12 performance and 4 efficiency)
      Memory: 128 GB
      ...

Graphics/Displays:

    Apple M4 Max:

      Chipset Model: Apple M4 Max
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 40
      ...

The M4 Max memory bandwidth can go up to 546GB/s: https://support.apple.com/en-us/121553

@maciejjedrzejczyk
Comment options

Thank you for confirmation. My confusion was based on the ground that I used apple store configurator specific to my country which only showed a single available spec for 14'' MBP M4 Pro Max with a lower memory bandwidth (36GB RAM). I like the 14'' form factor much more than that of 16'' version but that was the only missing part in my research :) Just a follow-up question - would you consider the thermals (temp while on lap, fan noise, multitasking etc.) on this machine at an acceptable level while using LLM inference?

@eightpigs
Comment options

I also prefer the 14-inch MBP.

As for noise, I mostly run 7B or 14B models, so noise hasn’t been an issue for me. Here’s some data I tracked with my Apple Watch for reference:

  • DeepSeek-R1-Distill-Llama-70B-8bit: Fan noise is around 56dB.
  • DeepSeek-R1-Distill-Qwen-32B-MLX-8bit: Fan noise is around 48dB.
  • DeepSeek-R1-Distill-Qwen-14B-8bit: Fan is almost silent.

On multitasking, I haven’t run into any scenarios where I felt the machine was under pressure. Performance has been more than enough for me.

@olegshulyakov
Comment options

Can you please test Gemma 3 27B for me?

Comment options

Which models can my M3 16GB MacBook Air support?

You must be logged in to vote
3 replies
@gsgxnet
Comment options

Depends on how much RAM you want to dedicate to AI inferencing. I think if you tweak your MacOS you might have the option to use up to 12GB for the model. So a model with 20B parameters quantised down to 4bit might just work. If it is a plain M3 you have, inferencing speed might be too slow, so you probably would stay with a smaller model.

@DeconstructingAmbiguity
Comment options

Thank you for the thoughtful response. I am keeping an eye out for any tests here that most closely resemble my system.

@Crear12
Comment options

You can try Deepseek-R1:14b Q4_K_M from ollame, it's only 9.0GB:
ollama run deepseek-r1:14b

Comment options

why specifically is the M2 so cracked compared to the M3 and M4?

You must be logged in to vote
1 reply
@byrongibson
Comment options

I think it's primarily due to memory bandwidth (first column). Where they have the same bandwidth, the results are close. But in cases where the M2 Max or Ultra has substantially higher bandwidth, it outperforms the equivalent M3 or M4.

Comment options

M4 Max (Macbook Pro 16" 2024), 16 CPU, 40 GPU, 128 GB Memory

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 920.48 ± 3.25
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 31.56 ± 0.07
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 891.32 ± 0.95
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 53.75 ± 0.04
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 884.59 ± 0.78
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 82.36 ± 0.22

Used the command below
git checkout 8e672ef
make clean && make -j llama-bench && ./llama-bench
-m ./models/Llama-2-7b-chat-f16.gguf
-m ./models/llama-2-7b-chat.Q8_0.gguf
-m ./models/llama-2-7b-chat.Q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null
HEAD is now at 8e672ef stablelm : simplify + speedup generation (#4153)

You must be logged in to vote
0 replies
Comment options

Okay Apple... no M4 Ultra in the Mac Studio, but a M4 Max or M3 Ultra - ... planning a new Mac Pro, huh?!
https://www.apple.com/shop/buy-mac/mac-studio

So let's add these guys to the table 😁

You must be logged in to vote
14 replies
@gsgxnet
Comment options

You know UP TO 16,9 that is marketing - may be Apple run that speed comparison with a model, which did not fit before into the unified memory and now does so. We have to tweak memory settings to make a bigger than default part of the RAM GPU-unified. But never all RAM can be used as unified RAM. See #2182 (comment) and all msg above.

Or Apple might offer a much more performant MLX option with M3 Ultra. Who knows? Benchmarks will tell in a few days I assume.

@bluemoehre
Comment options

Will this end up like with NVIDIA's RTX series? Today some M4s (mostly those with maxed RAM) are no more listed on the website for several countries - if you are able to load it at all. Apple seems to have major backend issues since hours. Mmh.

@fairydreaming
Comment options

Found some numbers: https://creativestrategies.com/mac-studio-m3-ultra-ai-workstation-review/
Seems to be only tg, no pp.

@Thireus
Comment options

Nobody seems to have posted any pp so far... and I wonder why.

Edit: https://www.reddit.com/r/LocalLLaMA/comments/1j9jfbt/comment/mhe1ku9/

@netrunnereve
Comment options

Nobody seems to have posted any pp so far... and I wonder why.

Considering the article's bias I'm not surprised. The 5090's going to destroy the Mac when the model fully fits in VRAM, so the author uses a 128k context and swapping on the 5090 (not even partial offloading) to make the Mac appear more effective. For prompt processing I think the 5090 might actually beat the Mac even with partial offloading. IMO he should have also done tests with smaller contexts to show the distinction between a model that fits in VRAM (5090 wins) and one that doesn't (Mac wins).

Our llama-bench should be the standard for testing llama.cpp but sadly a lot of people don't know about it.

Comment options

M3 Ultra 20+8 CPU, 60 GPU, 256GB RAM ✅


./llama-bench -m models/Llama-2-7b-chat-f16.gguf -m models/llama-2-7b-chat.Q8_0.gguf -m models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 1121.80 ± 2.33
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 42.24 ± 0.05
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 1085.76 ± 0.90
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 63.55 ± 0.04
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 1073.09 ± 1.29
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 88.40 ± 0.44

build: 8e672ef (1550)

You must be logged in to vote
4 replies
@arty-hlr
Comment options

So identical speeds to the M2 ultra it looks like, because same bandwidth... Not worth buying imo!

@marcingomulkiewicz
Comment options

Well, entry level model with 256GB of RAM costs exactly the same as 192GB model costed previously, plus (if one's rich) there's 512GB version, so even though speed seems similar, there's still argument to be made in favour of those.

@bluemoehre
Comment options

I've already seen several benchmarks that say the M3 only makes sense in terms of memory and not performance. It seems you get better value for money with a M4 Max.

Still I wonder what is the TG performance with a ~500GB model vs a ~50GB model on the same machine.

@marcingomulkiewicz
Comment options

All else equal - probably 10x, as there is 10x as much weights, no matter if it's memory or compute bound. But it's not that simple: >600B DeepSeek/R1 are MoEs with iirc ~37B parameters per expert, so I'd expect it to work much (2x?) faster than 70B Llama.

Comment options

M3 Ultra 24+8 CPU, 80 GPU, 512GB RAM ✅

./llama-bench -m ./models/llama-7b-v2/ggml-model-f16.gguf -m ./models/llama-7b-v2/ggml-model-q8_0.gguf -m ./models/llama-7b-v2/ggml-model-q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 1538.34 ± 2.14
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 39.78 ± 0.06
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 1487.51 ± 1.57
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 63.93 ± 0.24
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 1471.24 ± 1.05
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 92.14 ± 0.66

build: 8e672ef (1550)

You must be logged in to vote
0 replies
Comment options

Can someone please test the base Mac Ultra M4 :)

You must be logged in to vote
5 replies
@marcingomulkiewicz
Comment options

Doubtful. M4 Ultra does not exists, at least not yet.

@ilcommm
Comment options

Of course, you're right. I obviously meant the base Mac Studio with M4 Max.

@ilcommm
Comment options

I’m just choosing between a Mac Studio M2 Max with 32GB for 1,888 and a Mac Studio M4 Max with 36GB for 2,705 (in our shops). Trying to figure out if the performance boost is worth the extra cost.

@mladencucakSYN
Comment options

which shops would that be? :D

@ilcommm
Comment options

sad shops in Russia))). It is usd prices.

Comment options

Cost Per Token April 2025 ~ Bang for Buck

Seems like the M4 Mac Mini is cheapest instant win for now, with an M1 Max Studio coming in close second.

Product Amazon Link Price Tokens/sec (t/s) Token Cost
M1 MacBook Air https://amzn.to/42gTl9Z 584 14.19 $    41.16
M1 MacBook Pro https://amzn.to/426VINP 777 14.15 $    54.91
M1 Pro MacBook Pro 14" https://amzn.to/3FST7hK 823 35.52 $    23.17
M1 Pro MacBook Pro 16" https://amzn.to/4leWSOC 901 36.41 $    24.75
M1 Max MacBook Pro 14" https://amzn.to/3XImWaV 1299 54.61 $    23.79
M1 Max MacBook Pro 16" https://amzn.to/3XImWaV 1551 61.19 $    25.35
M1 Max Mac Studio https://amzn.to/41UMFiJ 1385 61.19 $    22.63
M1 Ultra Mac Studio https://amzn.to/429gScF 1980 74.93 $    26.42
M2 MacBook Air https://amzn.to/3YcCIe4 749 21.7 $    34.52
M2 MacBook Pro https://amzn.to/3XInrBP 835 21.91 $    38.11
M2 Pro MacBook Pro 14" https://amzn.to/4cq9Y7O 1180 37.87 $    31.16
M2 Pro MacBook Pro 16" https://amzn.to/4i0pIzt 1502 38.86 $    38.65
M2 Max MacBook Pro 14" https://amzn.to/4lf04K0 1885 60.99 $    30.91
M2 Max MacBook Pro 16" https://amzn.to/4hVTm97 2014 65.95 $    30.54
M2 Max Mac Studio https://amzn.to/4hWJ7Bm 1799 60.99 $    29.50
M2 Ultra Mac Studio https://amzn.to/4jiIVgN 3889 88.64 $    43.87
M2 Ultra Mac Studio https://amzn.to/4jiIVgN 3889 94.27 $    41.25
M3 Pro MacBook Pro 14 https://amzn.to/4jfqqda 1286 30.74 $    41.83
M3 Pro MacBook Pro 16 https://amzn.to/4llV2M3 1976 30.74 $    64.28
M3 Max MacBook Pro https://amzn.to/3R3jWlD 2959 56.58 $    52.30
M3 Ultra Mac Studio https://www.cornellstore.com/Mac-Studio-M3-Ultra 3599 88.4 $    40.71
M4 Mac Mini https://amzn.to/43Eb1Pa 549 24.11 $    22.77
M4 MacBook Air https://amzn.to/4cl0ISi 949 24.11 $    39.36
M4 Pro MacBook Pro 14" https://amzn.to/3G2TVjW 1786 49.64 $    35.98
M4 Pro MacBook Pro 16" https://amzn.to/4hYXogP 1880 50.74 $    37.05
M4 Max MacBook Pro https://amzn.to/43xoCYr 2849 83.06 $    34.30
You must be logged in to vote
2 replies
@arty-hlr
Comment options

  1. Is this really the place to post a bunch of sponsored links??
  2. You are not computing the "token cost", which doesn't make sense, but the token speed cost.
@shimza
Comment options

Point taken, but I've looked at this chart for months and really wanted to know which Mac to buy, wanted to optimize my spend for obvious reasons.

Yeh yeh nah - it is the Token Cost ... it's self explanatory the speed reference from the prior column.

Comment options

... cross-posted to the Vulkan thread:

Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS Monterey

Note: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board.

Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model size params backend threads test t/s
llama 7B Q8_0 6.67 GiB 6.74 B Vulkan,BLAS 12 pp512 68.55 ± 0.25
llama 7B Q8_0 6.67 GiB 6.74 B Vulkan,BLAS 12 tg128 11.05 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 12 pp512 68.86 ± 0.16
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 12 tg128 16.73 ± 0.05

build: d3bd719 (5092)

The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag):

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null

model size params backend threads test t/s
llama 7B Q8_0 6.67 GiB 6.74 B Vulkan,BLAS 12 pp512 25.87 ± 0.56
llama 7B Q8_0 6.67 GiB 6.74 B Vulkan,BLAS 12 tg128 6.85 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 12 pp512 26.17 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan,BLAS 12 tg128 10.85 ± 0.01

build: d3bd719 (5092)

(proof-of-life images below):
GPU test:
Screenshot 2025-04-03 at 10 48 09 AM
CPU test:
Screenshot 2025-04-03 at 11 01 40 AM

You must be logged in to vote
0 replies
Comment options

Just saying.. Shouldn't the OP be edited with the actual used bandwidth numbers, rather than the BS figures apple gave to the press?

You must be logged in to vote
4 replies
@AndreasKunar
Comment options

I don't understand your post - llama.cpp token-generation is able to pretty much saturate the RAM bus-bandwidth on my Macs, on my Snapdragon X and on my NVIDIA Jetson. All with very comparable results based on their RAM's transactions/s-limit*bus-width calculated "marketing" GB/s. If someone is lying, they all seem to do it consistently.

Normal CPU-operations might not fully max out the RAM bus bandwidth. But NEON,... instructions and the GPUs apparently can.

@mirh
Comment options

The bandwidth in the first post is clearly not measured, and it seems obvious that's just a copy-paste of the official PR numbers.
And idk really what the actual gpu limit is since anandtech (rip) didn't measure that specifically, but to be sure at least the ultra numbers are patent horseshit.
It's literally two maxs side-by-side. Even if a single one was 400GB/s, adding another 400GB/s unit physically cannot make 800GB/s.

@AndreasKunar
Comment options

OK, if you want to be negative and say that "it cannot be" without any evidence applying to the measurements here, I don't care and don't want to waste my time feeding trolls.

I think it's perfect to compare llama.cpp's performance with the theoretical maximum memory-bandwidth the system is designed for (max. transactions/s of the RAMs x data-bus size bytes). It is a measurement, how good llama.cpp's code for TG can leverage the theoretical limit imposed by memory-bandwidth on that hardware. And - as evidence - the TG-data for the same model/build quite matches this, as I mentioned above. E.g. M2-series TG fp16 (quantization has an impact) - M2 100GB/s ~6.5tok/s, M2 Pro 200GB/s ~ 13tok/s, M2 Max ~400GB/s ~25tok/s, M2 Ultra 800GB/s ~41tok/s (some impact of split-chip design, it got better from M1 to M2 with Apple learning). My Snapdragon X / Jetson Orin NX only have 16GB unified RAM, cannot really run fp16, but Q4_0/8 matches (with some differences based on their quantization-algorithm HW support). This discussion currently enables the prediction of a probability-range for TG token/s based on a designed RAM-bandwidth - e.g. let's see if my NVIDIA DGX Spark prediction based on its 256GB/s holds up.

P.S. the M-series Ultra Fusion interconnect is many times faster than its RAM-bandwidth. I found no hard evidence, that the dies do RAM-address interleaving, but they probably do according to some internet-gossip - so there is no reason, that the combined chip cannot do double the individual chip's RAM transfer-rate to the caches. It definitely has some interconnect overhead, but It's not a typical NUMA multi-processor architecture, which would require llama.cpp to do a special tensor memory+operations-layout.

@mirh
Comment options

I think it's perfect to compare llama.cpp's performance with the theoretical maximum memory-bandwidth the system is designed for

Yes, and what I'm telling you is that those theoretical numbers are unproven anywhere (and regardless it seems very odd for every single figure of those to be "empirical", except bandwidth that is taken for granted with blatantly unknown rounding).

I found no hard evidence, that the dies do RAM-address interleaving, but they probably do according to some internet-gossip

According to some other internet gossip, it may actually just have been M1 ultra to be a disaster.
And while after much much scavenging of the net I found some benchmarks that kinda resized my contempt (truthfully the gpu is really privileged), for the biggest most ambitious chips there's still a 20-25% difference from the datasheet.

Comment options

M3 Ultra (Mac Studio 2025) 24+8 CPU, 80 GPU, 512GB RAM

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 1527.74 ± 2.02
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 40.10 ± 0.10
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 1488.84 ± 2.52
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 64.16 ± 0.38
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 1473.76 ± 1.09
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 91.93 ± 0.48

build: 8e672ef (1550)

You must be logged in to vote
0 replies
Comment options

M1 (MacBook Air 2020) 8 CPU, 8GPU, 16GB RAM

model size params backend ngl test t/s
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 115.67 ± 0.88
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 14.13 ± 0.04
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 121.73 ± 1.43
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 7.69 ± 0.12

build: 8e672ef (1550)

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 pp512 131.46 ± 6.71
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 4 tg128 13.99 ± 0.14
llama 7B Q8_0 6.67 GiB 6.74 B Metal,BLAS 4 pp512 133.34 ± 1.17
llama 7B Q8_0 6.67 GiB 6.74 B Metal,BLAS 4 tg128 7.67 ± 0.02

build: 3e0be1c (5410)

You must be logged in to vote
0 replies
Comment options

Finally got the results I was asking about here recently 😊

Though I had to purchase a Mac Studio with the M4 Max chip myself to achieve this.

M4 MAX (Mac Studio 2024), 14CPU, 32 GPU, 36GB RAM

llama.cpp % ./llama-bench
-m ./models/llama-7b-v2/ggml-model-f16.gguf
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null

model size params backend ngl test t/s
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 pp 512 747.59 ± 0.92
llama 7B mostly F16 12.55 GiB 6.74 B Metal 99 tg 128 25.58 ± 0.01
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 pp 512 720.38 ± 0.04
llama 7B mostly Q8_0 6.67 GiB 6.74 B Metal 99 tg 128 43.80 ± 0.03
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 pp 512 715.74 ± 0.52
llama 7B mostly Q4_0 3.56 GiB 6.74 B Metal 99 tg 128 69.24 ± 0.09

build: 8e672ef (1550)

On new build:
./build/bin/llama-bench
-m ./models/llama-7b-v2/ggml-model-f16.gguf
-m ./models/llama-7b-v2/ggml-model-q8_0.gguf
-m ./models/llama-7b-v2/ggml-model-q4_0.gguf
-p 512 -n 128 -ngl 99 2> /dev/null

model size params backend threads test t/s
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 10 pp512 790.33 ± 0.49
llama 7B F16 12.55 GiB 6.74 B Metal,BLAS 10 tg128 26.05 ± 0.01
llama 7B Q8_0 6.67 GiB 6.74 B Metal,BLAS 10 pp512 702.39 ± 11.36
llama 7B Q8_0 6.67 GiB 6.74 B Metal,BLAS 10 tg128 44.89 ± 0.04
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 10 pp512 762.81 ± 1.03
llama 7B Q4_0 3.56 GiB 6.74 B Metal,BLAS 10 tg128 72.26 ± 0.07

build: b44890d (5440)

You must be logged in to vote
7 replies
@arty-hlr
Comment options

Keep in mind performance is increased using mlx instead of llama.cpp.

@olegshulyakov
Comment options

@arty-hlr I was interested in it since price is not far away of GPU, but power consumption is much less.

@arty-hlr
Comment options

@olegshulyakov I got a refurbished m2 ultra 76 cores 64 GB a few weeks ago, not regretting it at all. Silent, very power efficient, good speeds, just prompt processing is slower than NVIDIA GPUs, but that's a small compromise to make imo. Atm I'm using lmstudio for inference as they have mlx as backend, but I'll switch back to ollama when they add it, lmstudio doesn't handle loading models as well as ollama.

@ilcommm
Comment options

Keep in mind performance is increased using mlx instead of llama.cpp.

how to achieve it in inference? only via lmstudio currently?

@olegshulyakov
Comment options

@ilcommm There is mlx-ml server like llama.cpp does

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics need feedback Testing and feedback with results are needed
Morty Proxy This is a proxified and sanitized view of the page, visit original site.