Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

Note

This guide is a live document. Feedback and benchmark numbers are welcome - the guide will be updated accordingly.


Overview

This is a detailed guide for running the new gpt-oss models locally with the best performance using llama.cpp. The guide covers a very wide range of hardware configurations. The gpt-oss models are very lightweight so you can run them efficiently in surprisingly low-end configurations.

Obtaining `llama.cpp` binaries for your system

Make sure you are running the latest release of llama.cpp: https://github.com/ggml-org/llama.cpp/releases

Obtaining the `gpt-oss` model data (optional)

The commands used below in the guide will automatically download the model data and store it locally on your device. So this step is completely optional and provided for completeness.

The original models provided by OpenAI are here:

First, you need to manually convert them to GGUF format. For convenience, we host pre-converted models here in ggml-org.

Pre-converted GGUF models:

Tip

Running the commands below will automatically download the latest version of the model and store it locally on your device for later usage. A WebUI chat and an OAI-compatible API will become available on localhost.

image
Sample output of using gpt-oss-120b with the built-in llama-server WebUI
image
Using llama-server with crush coding agent (gpt-oss-20b)

Minimum requirements

Here are some hard memory requirements for the 2 models. These numbers could vary a little bit by adjusting the CLI arguments, but should give a good reference point.

Model Model data (GB) Compute buffers (GB) KV cache per 8 192 tokens (GB) Total @ 8 192 tokens (GB) Total @ 32 768 tokens (GB) Total @ 131 072 tokens (GB)
gpt‑oss 20B 12.0 2.7 0.2 14.9 15.5 17.9
gpt‑oss 120B 61.0 2.7 0.3 64.0 64.9 68.5

Note

It is not necessary to fit the entire model in VRAM to get good performance. Offloading just the attention tensors and the KV cache in VRAM and keeping the rest of the model in the CPU RAM can provide decent performance as well. This is taken into account in the rest of the guide.

Relevant CLI arguments

Using the correct CLI arguments in your commands is crucial for getting the best performance for your hardware. Here is a summary of the important flags and their meaning:

Argument Purpose
-hf Specify the Hugging Face model ID to use. The model will be downloaded using curl from the respective model repository
-c Specify the context size to use. More context requires more memory. Both gpt-oss models have a maximum context of 128k tokens. Use -c 0 to set to the model's default
-ub N -b N Specify the max batch size N during processing. Larger size increases the size of compute buffers, but can improve the performance in some cases
-fa Enable Flash Attention kernels. This improves the performance on backends that support the operator
--n-cpu-moe N Number of MoE layers N to keep on the CPU. This is used in hardware configs that cannot fit the models fully on the GPU. The specific value depends on your memory resources and finding the optimal value requires some experimentation
--jinja Tell llama.cpp to use the Jinja chat-template embedded in the GGUF model file

 
 
 

Apple Silicon

Apple Silicon devices have unified memory that is seamlessly shared between the CPU and GPU. For optimal performance it is recommended to not exceed 70% of the total memory that your device has.

Tip

Install the latest llama.cpp package from Homebrew with:

brew install llama.cpp

Tip

To increase the amount of RAM available to the llama-server process, use the following command:

# on a 192GB machine, raise the limit from 154GB (default) to 180GB
sudo sysctl iogpu.wired_limit_mb=184320

  • ✅ Devices with more than 96GB RAM

    The M2 Max, M3 Max, M4 Max, M1 Ultra, M2 Ultra, M3 Ultra, etc. chips can run both models at full context:

    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048
    🟢 Benchmarks on M3 Ultra (512GB, 80 GPU cores) for `gpt-oss-20b`
    llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
    model size params backend n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp2048 2816.47 ± 2.74
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp8192 2308.17 ± 5.98
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp16384 1879.98 ± 1.99
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp32768 1351.67 ± 4.32
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 tg128 115.52 ± 0.29

    build: c8d0d14 (6310)

    🟢 Benchmarks on M2 Ultra (192GB, 76 GPU cores) for `gpt-oss-20b`
    llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
    model size params backend n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp2048 2191.13 ± 2.65
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp8192 1889.83 ± 3.91
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp16384 1594.51 ± 2.42
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp32768 1218.99 ± 0.44
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 tg128 116.08 ± 0.18

    build: 79c1160 (6123)

    llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1,2,4
    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    0 128 1 128 0.000 0.00 1.112 115.06 1.113 115.05
    0 128 2 256 0.000 0.00 1.601 159.92 1.601 159.91
    0 128 4 512 0.000 0.00 2.463 207.85 2.463 207.84
    2048 128 1 2176 0.990 2068.28 1.163 110.03 2.154 1010.44
    2048 128 2 4352 1.916 2137.49 1.710 149.72 3.626 1200.17
    2048 128 4 8704 3.775 2169.82 2.656 192.78 6.431 1353.37
    8192 128 1 8320 4.344 1885.93 1.279 100.11 5.622 1479.81
    8192 128 2 16640 8.689 1885.52 1.929 132.69 10.619 1567.04
    8192 128 4 33280 17.359 1887.62 3.053 167.69 20.413 1630.35
    16384 128 1 16512 10.202 1606.01 1.397 91.63 11.599 1423.63
    16384 128 2 33024 20.715 1581.82 2.186 117.08 22.902 1441.98
    16384 128 4 66048 41.721 1570.80 3.653 140.14 45.375 1455.61
    32768 128 1 32896 26.611 1231.39 1.665 76.88 28.276 1163.40
    32768 128 2 65792 54.977 1192.06 2.794 91.64 57.771 1138.85
    32768 128 4 131584 111.278 1177.88 4.883 104.85 116.161 1132.77
    llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048
    🟢 Benchmarks on M2 Ultra (192 GB, 76 GPU cores) for `gpt-oss-120b`
    llama-bench -m gpt-oss-120b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
    model size params backend n_ubatch fa test t/s
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Metal 2048 1 pp2048 1244.57 ± 5.10
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Metal 2048 1 pp8192 1101.31 ± 0.99
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Metal 2048 1 pp16384 955.41 ± 0.64
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Metal 2048 1 pp32768 752.31 ± 1.02
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Metal 2048 1 tg128 79.68 ± 0.12

    build: 79c1160 (6123)

    llama-batched-bench -m gpt-oss-120b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1,2,4
    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    0 128 1 128 0.000 0.00 1.610 79.48 1.611 79.48
    0 128 2 256 0.000 0.00 2.284 112.08 2.284 112.08
    0 128 4 512 0.000 0.00 3.477 147.27 3.477 147.27
    2048 128 1 2176 1.776 1152.89 1.711 74.82 3.487 623.99
    2048 128 2 4352 3.382 1211.16 2.458 104.14 5.840 745.18
    2048 128 4 8704 6.505 1259.34 3.747 136.65 10.252 849.02
    8192 128 1 8320 7.294 1123.16 1.857 68.94 9.150 909.25
    8192 128 2 16640 14.467 1132.48 2.767 92.53 17.234 965.53
    8192 128 4 33280 28.801 1137.74 4.358 117.50 33.159 1003.66
    16384 128 1 16512 16.580 988.15 2.058 62.18 18.639 885.89
    16384 128 2 33024 33.426 980.31 3.174 80.66 36.600 902.29
    16384 128 4 66048 67.190 975.39 5.245 97.61 72.435 911.83
    32768 128 1 32896 42.075 778.81 2.452 52.20 44.527 738.79
    32768 128 2 65792 86.615 756.64 4.029 63.54 90.644 725.83
    32768 128 4 131584 173.762 754.32 7.020 72.94 180.782 727.86

  • ✅ Devices with less than 96GB RAM

    The small gpt-oss-20b model can run efficiently on Macs with at least 16GB RAM:

    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048
    🟢 Benchmarks on M4 Max (36GB) for `gpt-oss-20b`
    llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
    model size params backend n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp2048 1277.42 ± 0.00
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp8192 1030.28 ± 0.00
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp16384 779.44 ± 0.00
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp32768 568.13 ± 0.00
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 tg128 92.36 ± 0.00

    build: 79c1160 (6123)

    llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1
    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    0 128 1 128 0.000 0.00 1.359 94.17 1.359 94.15
    2048 128 1 2176 1.676 1222.17 1.450 88.30 3.125 696.26
    8192 128 1 8320 7.624 1074.47 1.552 82.47 9.176 906.67
    16384 128 1 16512 19.210 852.91 1.669 76.67 20.879 790.84
    32768 128 1 32896 55.684 588.46 1.976 64.76 57.661 570.51
    🟢 Benchmarks on M1 Max (64GB) for `gpt-oss-20b`
    llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
    model size params backend n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp2048 994.75 ± 4.11
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp8192 843.01 ± 2.20
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp16384 698.82 ± 0.20
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp32768 497.65 ± 8.92
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 tg128 75.15 ± 0.98

    build: 2e2b22b (6180)

    🟢 Benchmarks on M1 Pro (32GB) for `gpt-oss-20b`
    llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384
    model size params backend n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp2048 515.76 ± 0.00
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp8192 437.22 ± 0.00
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 pp16384 361.29 ± 0.00
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal 2048 1 tg128 45.68 ± 0.00

    build: 79c1160 (6123)

    llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384 -ntg 128 -npl 1
    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    0 128 1 128 0.000 0.00 2.806 45.62 2.806 45.62
    2048 128 1 2176 4.054 505.14 3.076 41.61 7.130 305.18
    8192 128 1 8320 18.444 444.15 3.329 38.45 21.773 382.12
    16384 128 1 16512 44.683 366.67 3.780 33.86 48.464 340.71

  • ✅ Devices with 16GB RAM

    Macs don't allow to utilize the full 16GB memory by the GPU, so in this case you have to keep part of the layer on the CPU. Adjust --n-cpu-moe and -c as needed:

    llama-server -hf ggml-org/gpt-oss-20b-GGUF --n-cpu-moe 12 -c 32768 --jinja --no-mmap

  • 🟥 Devices with 8GB RAM

    Unfortunately, you are out of luck. The gpt-oss models are not possible to run on Macs with that small amount of memory.


 
 
 

NVIDIA

  • ✅ Devices with more than 64GB VRAM

    With more than 64B VRAM, you can run both models by offloading everything (both the model and the KV cache) to the GPU(s).

    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048
    🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-20b`
    llama-bench -m ./gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

    model size params backend n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp2048 9480.55 ± 44.01
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp8192 8921.62 ± 4.21
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp16384 8196.12 ± 19.16
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp32768 7050.35 ± 12.36
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 tg128 249.96 ± 0.99

    build: f08c4c0 (6199)

    🟢 Benchmarks on RTX Pro 6000 (96GB) for `gpt-oss-20b`
    llama-bench -m ./gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

    model size params backend n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp2048 11521.95 ± 26.03
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp8192 10673.03 ± 22.35
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp16384 9772.06 ± 19.59
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 pp32768 8267.46 ± 15.58
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 2048 1 tg128 286.91 ± 0.22

    build: a6d3cfe (6205)

    llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048
    🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-120b`
    llama-bench -m ./gpt-oss-120b-mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

    model size params backend n_ubatch fa test t/s
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp2048 4494.20 ± 20.87
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp8192 4327.73 ± 16.04
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp16384 4114.04 ± 12.84
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp32768 3718.01 ± 19.67
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 tg128 170.62 ± 0.47

    build: f08c4c0 (6199)

    🟢 Benchmarks on RTX Pro 6000 (96GB) for `gpt-oss-120b`
    llama-bench -m ./gpt-oss-120b-mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

    model size params backend n_ubatch fa test t/s
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp2048 5518.07 ± 31.18
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp8192 5315.65 ± 21.91
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp16384 5012.78 ± 24.18
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 pp32768 4503.36 ± 31.57
    gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 2048 1 tg128 196.31 ± 0.14

    build: a6d3cfe (6205)


  • ✅ Devices with less than 64GB VRAM

    In this case, you can fit the small gpt-oss-20b model fully in VRAM for optimal performance.

    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048
    🟢 Benchmarks on NVIDIA GeForce RTX 3090 (24GB) for `gpt-oss-20b`
    $ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768
    

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, CUDA 12.4

    model size params backend n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp2048 5170.56 ± 14.10
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp8192 4771.74 ± 12.96
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp16384 4289.11 ± 3.22
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp32768 3577.10 ± 2.09
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 tg128 161.77 ± 0.56
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp2048 5142.90 ± 26.50
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp8192 4711.52 ± 4.47
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp16384 4245.67 ± 5.30
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp32768 3539.35 ± 2.47
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 tg128 161.95 ± 0.49

    build: a094f38 (6210)

    🟢 Benchmarks on NVIDIA GeForce RTX 4090 (24GB) for `gpt-oss-20b`
    $ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768
    

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, CUDA 12.6

    model size params backend n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp2048 8022.33 ± 161.33
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp8192 7264.73 ± 69.07
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp16384 6298.35 ± 94.91
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp32768 5112.35 ± 34.90
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 tg128 221.95 ± 6.34
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp2048 8078.28 ± 39.37
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp8192 6715.17 ± 204.96
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp16384 6025.25 ± 66.75
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp32768 4924.71 ± 26.63
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 tg128 225.22 ± 0.10

    build: a094f38 (6210)

    🟢 Benchmarks on NVIDIA GeForce RTX 4080 SUPER (16GB) for `gpt-oss-20b`
    llama-bench -m 'gpt-oss-20b-mxfp4.gguf' -fa 1 -b 2048,4096 -ub 2048,4096 -p 2048,8192,16384,32768
    

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes

    model size params backend ngl n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp2048 8170.95 ± 10.83
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp8192 7989.22 ± 48.54
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp16384 7517.93 ± 11.39
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp32768 6739.51 ± 12.77
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 tg128 186.51 ± 0.33
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp2048 8145.36 ± 47.93
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp8192 7992.03 ± 22.22
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp16384 7560.80 ± 8.81
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp32768 6720.33 ± 21.73
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 tg128 185.68 ± 0.24
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp2048 8120.09 ± 23.07
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp8192 7942.44 ± 7.77
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp16384 7532.66 ± 12.13
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp32768 6735.01 ± 7.80
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 tg128 186.17 ± 0.34
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp2048 8110.85 ± 35.28
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp8192 7510.58 ± 22.65
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp16384 7222.12 ± 6.87
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp32768 6478.02 ± 2.87
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 tg128 186.37 ± 0.59

    build: 009b709 (6316)

    🟢 Benchmarks on NVIDIA GeForce RTX 5060 Ti (16GB) for `gpt-oss-20b`
    $ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768
    

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

    model size params backend n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp2048 3839.21 ± 6.79
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp8192 3695.85 ± 6.09
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp16384 3472.60 ± 1.82
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp32768 3078.06 ± 0.62
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 tg128 111.51 ± 0.05
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp2048 3821.18 ± 13.28
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp8192 3591.27 ± 1.45
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp16384 3385.30 ± 2.44
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp32768 3009.63 ± 2.82
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 tg128 111.56 ± 0.02

    build: 9ef6b0b (6208)

    🟢 Benchmarks on NVIDIA GeForce RTX 5070 Ti (16GB) for `gpt-oss-20b`
    $ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768
    

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, CUDA 12.8

    model size params backend n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp2048 6339.76 ± 25.60
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp8192 5913.85 ± 9.12
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp16384 5375.41 ± 10.22
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp32768 4547.18 ± 3.70
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 tg128 189.45 ± 0.09
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp2048 6325.97 ± 37.98
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp8192 5669.50 ± 13.36
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp16384 5193.12 ± 5.20
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp32768 4411.35 ± 2.43
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 tg128 189.46 ± 0.03

    build: a094f38 (6210)

    🟢 Benchmarks on NVIDIA GeForce RTX 5080 (16GB) for `gpt-oss-20b`
    $ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768
    

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes, CUDA 12.8

    model size params backend n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp2048 7476.55 ± 20.89
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp8192 7047.73 ± 19.47
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp16384 6465.65 ± 23.47
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp32768 5531.03 ± 29.67
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 tg128 204.85 ± 0.23
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp2048 7469.28 ± 43.22
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp8192 6725.38 ± 11.03
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp16384 6218.87 ± 25.68
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp32768 5376.58 ± 31.23
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 tg128 204.86 ± 0.04

    build: a094f38 (6210)

    🟢 Benchmarks on NVIDIA GeForce RTX 5090 (32GB) for `gpt-oss-20b`
    $ ${LLAMA_BUILD}/bin/llama-bench -m ${LLAMA_CACHE}/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768
    

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, CUDA 12.8

    model size params backend n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp2048 9848.38 ± 28.98
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp8192 8834.14 ± 27.65
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp16384 7802.21 ± 35.06
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 pp32768 6290.76 ± 64.50
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 2048 1 tg128 282.51 ± 0.44
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp2048 9841.15 ± 29.56
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp8192 8482.25 ± 44.45
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp16384 7513.55 ± 34.19
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 pp32768 6089.55 ± 77.41
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 4096 4096 1 tg128 282.26 ± 0.10

    build: a094f38 (6210)

    llamacpp-bench-rtx

    The large model has to be partially kept on the CPU.

    🟡 TODO: add commands for gpt-oss-120b


  • ✅ Devices with 16GB VRAM

    For example: NVIDIA V100

    This config is just at the edge to fit the full context of gpt-oss-20b in VRAM, so we have to restrict the maximum context down to 32k tokens.

    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 32768 --jinja -ub 4096 -b 4096
    🟢 Benchmarks on NVIDIA V100 (16GB) for `gpt-oss-20b`
    llama-bench -m gpt-oss-20b-mxfp4.gguf -fa 1 -b 4096 -ub 4096 -p 2048,8192,16384,32768

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes

    model size params backend ngl n_batch n_ubatch fa test t/s
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp2048 3526.65 ± 346.86
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp8192 3320.62 ± 44.98
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp16384 2768.99 ± 19.73
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp32768 2096.44 ± 8.58
    gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 tg128 117.71 ± 0.30

    build: 228f724 (6129)

    llama-batched-bench -m gpt-oss-20b-mxfp4.gguf -c 33792 -b 4096 -ub 4096 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1
    PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
    0 128 1 128 0.000 0.00 1.106 115.74 1.106 115.72
    2048 128 1 2176 0.481 4257.50 1.201 106.60 1.682 1293.82
    8192 128 1 8320 2.247 3646.05 1.417 90.31 3.664 2270.69
    16384 128 1 16512 5.421 3022.12 1.659 77.14 7.081 2331.96
    32768 128 1 32896 15.031 2180.10 2.121 60.35 17.151 1917.98

    Running the large gpt-oss-120b model with 16GB of VRAM requires to keep some of the layers on the CPU since it does not fit completely in VRAM:

    llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 32768 --jinja -ub 4096 -b 4096 --n-cpu-moe 32

  • ✅ Devices with less than 16GB VRAM

    For this config, it is recommended to tell llama.cpp to run the entire model on the GPU while keeping enough layers on the CPU. Here is a specific example with an RTX 2060 8GB machine:

    # gpt-oss-20b, full context, 22 layers on the CPU
    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0     --jinja -ub 2048 -b 2048 --n-cpu-moe 22
    
    # gpt-oss-20b, 32k context, 16 layers on the CPU (faster, but has less total context)
    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 32768 --jinja -ub 2048 -b 2048 --n-cpu-moe 16

    Note that even with just 8GB of VRAM, we can adjust the CPU layers so that we can run the large 120B model too:

    # gpt-oss-120b, 32k context, 35 layers on the CPU
    llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 32768 --jinja -ub 2048 -b 2048 --n-cpu-moe 35

Tip

For more information about how to adjust the CPU layers, see the "Tips" section at the end of this guide.


 
 
 

AMD

Note

If you have AMD hardware, please provide feedback about running the gpt-oss models on it and the performance that you observe. See the sections above for what kind of commands to try and try to adjust respectively.

With AMD devices, you can use either the ROCm or the Vulkan backends. Depending on your specific hardware, the results can vary.


  • ✅ RX 7900 XT (20GB VRAM) using ROCm backend

    llama-server -hf ggml-org/gpt-oss-20b-GGUF  --ctx-size 0 --jinja -ub 2048 -b 2048
    🟢 Benchmarks for `gpt-oss-20b`
    llama-bench -m gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
    ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
    ggml_cuda_init: found 1 ROCm devices:
    Device 0: AMD Radeon RX 7900 XT, gfx1100 (0x1100), VMM: no, Wave Size: 32

    model size params backend ngl threads n_batch n_ubatch fa test t/s
    gpt-oss 20B BF16 12.83 GiB 20.91 B ROCm,RPC 99 1 4096 2048 1 pp2048 4251.56 ± 21.68
    gpt-oss 20B BF16 12.83 GiB 20.91 B ROCm,RPC 99 1 4096 2048 1 pp8192 3567.45 ± 11.84
    gpt-oss 20B BF16 12.83 GiB 20.91 B ROCm,RPC 99 1 4096 2048 1 pp16384 2948.39 ± 10.34
    gpt-oss 20B BF16 12.83 GiB 20.91 B ROCm,RPC 99 1 4096 2048 1 pp32768 2099.25 ± 13.17
    gpt-oss 20B BF16 12.83 GiB 20.91 B ROCm,RPC 99 1 4096 2048 1 tg128 101.92 ± 0.27

    build: 3007baf (6194)

More information: #15396 (comment)

 
 
 

Tips

Determining the optimal number of layers to keep on the CPU

Good general advice for most MoE models would be to offload the entire model, and use -n-cpu-moe to keep as many MoE layers as necessary on the CPU. The minimum amount of VRAM to do this with the 120B model is about 8GB, below that you will need to start reducing context size and the number of layers offloaded. You can get for example about 30 t/s at zero context on a 5090 with --n-cpu-moe 21.

Caveat: on Windows it is possible to allocate more VRAM than available, and the result will be slow swapping to RAM and very bad performance. Just because the model loads without errors, it doesn't mean you have enough VRAM for the settings that you are using. A good way to avoid this is to look at the "GPU Memory" in Task Manager and check that it does not exceed the GPU VRAM.

Example on 5090 (32GB):
good, --n-cpu-moe 21, GPU Memory < 32:
image

bad, --n-cpu-moe 20, GPU Memory > 32:
image

Using `gpt-oss` + `llama.cpp` with coding agents (such as Claude Code, crush, etc.)
Configure the default sampling and reasoning settings

When starting a llama-server command, you can change the default sampling and reasoning settings like so:

# use recommended gpt-oss sampling params
llama-server ... --temp 1.0 --top-p 1.0

# set default reasoning effort
llama-server ... --chat-template-kwargs '{"reasoning_effort": "high"}'

Note that these are just the default settings and they could be overridden by the client connecting to the llama-server.

Frequently asked questions

Q: Which quants to use?

Always use the original MXFP4 model files. The gpt-oss models are natively "quantized". I.e. they are trained in the MXFP4 format which is roughly equivalent to ggml's Q4_0. The main difference with Q4_0 is that the MXFP4 models get to keep their full quality. This means that no quantization in the usual sense is necessary.

Q: What sampling parameters to use?

OpenAI recommends: temperature=1.0 and top_p=1.0.

Do not use repetition penalties! Some clients tend to enable repetition penalties by default - make sure to disable those.

image
Q: Should I set a chat template file manually?

No. The ggml-org/gpt-oss models have a built-in chat template that is used by default. The only reasons to ever want to change the chat template manually are:

  • If there is a bug in the built-in chat template
  • If you have a very specific use case and you know very well what you are doing

Known issues

Some rough edges in the implementation are still being polished. Here is a list of issue to keep track of:

You must be logged in to vote

Replies: 54 comments · 114 replies

Comment options

I can provide some numbers for AMD part of the guide.

My hardware is RX 7900 XT (20GB VRAM) + Ryzen 9 5900X + 32GB of RAM, running on latest Arch Linux with locally built llama.cpp version 6194 (3007baf), built with ROCm 6.4.1-1 (from official Arch repo)

Pulled the gpt-oss-20b repository and converted to GGUF using convert_hf_to_gguf.py, should probably result in the same GGUF file as on huggingface.

7900XT can load the full 20B model with full context without offloading MoE layers to CPU (although barely, because it will fill up the whole VRAM), by running

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa

With that, i get generation speeds (as reported by llama.cpp webui) at ~94 tokens/second, slowly going down as the context fills up.

I've also tested whether setting K/V cache quantization would help with model size or performance, but the result was... bad, performance was halved and CPU got involved... is this because of mxfp4 format of gpt-oss?

I'd also like to note that my PC likes to hang when i fill up my VRAM to the brim, so i've also checked out how gpt-oss-20b behaves when i off-load MoE layers to CPU.

When running with all MoE layers on CPU, as below:

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -cmoe

my GPU VRAM usage (as reported by btop) is around 10GB, RAM usage went up only ~2GB. However, the performance took a major 80% hit, as now my generation speed is in ~20tok/s - CPU takes most of the load. If you have better CPU and faster RAM (i'm still running dual-channel DDR4s @ 3200MHz CL16, mind you), you probably will get better results. I wonder how X3Ds behave in that case...

I assume that gpt-oss-20b has 24 MoE layers, so let's see how it behaves when i load only, let's say, 4 onto CPU:

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 4

VRAM is at 18GB (previously it was at 19, as reported by btop, so there's a decrease), RAM usage went up by around 1.5GB, generation speed is ~60tok/s. Neat, this is usable.

How about 8 layers?

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 8

In that case, i get 16GB VRAM usage, ~1.5GB RAM bump as previously, and generation speed went down to 38 tokens/s. Still pretty usable. How about 16 layers?

llama-server -m ./gpt-oss-20b.auto.gguf --ctx-size 0 --jinja -b 4096 -ub 4096 -ngl 99 -fa -ncmoe 16

VRAM: 13GB, RAM: as previously, not more than 2GB, generation speed: 27-25tok/s, this is getting pretty bad.

As mentioned before - your results may vary, i'm not running current-gen top-tier hardware and IIRC the largest performance bottleneck will be the RAM/PCIe link speed anyway - i'm pretty curious to see what the performance with this GPU is on more recent platform, especially with an X3D CPU.

You must be logged in to vote
11 replies
@aldehir
Comment options

aldehir Aug 18, 2025
Collaborator

I had issues with a higher batch/ubatch size than the default but I'm not seeing that problem anymore so that was probably user error on my end.

I believe you are likely hitting the case where the model needs the CoT from the past tool call but the client isn't sending it in or there is a mismatch in the reasoning field. That is an open issue across all client and inference servers/providers with GPT-OSS.

If you can collect any dumps of this happening, I'm happy to dig in further.

@SteelPh0enix
Comment options

@SteelPh0enix I've been able to get crush to work with moderate success.

Can you share the llama-server command line arguments you pass in?

Sure @aldehir, here's my config for this model in crush.json:

 "providers": {
        "llamacpp": {
            "type": "openai",
            "base_url": "http://steelph0enix.pc:51536/v1",
            "name": "Llama.cpp",
            "id": "llamacpp",
            "models": [
                {
                    "id": "gpt-oss-20b.auto",
                    "name": "GPT-OSS 20B",
                    "context_window": 131072,
                    "default_max_tokens": 51200,
                    "has_reasoning_efforts": true,
                    "can_reason": true,
                    "supports_attachments": false,
                    "default_reasoning_effort": "high",
                    "cost_per_1m_in": 0,
                    "cost_per_1m_in_cached": 0,
                    "cost_per_1m_out": 0,
                    "cost_per_1m_out_cached": 0
                }
            ]
        }
    }
// ...

my llama-server invocation:

llama-server --ctx-size 0 --model gpt-oss-20b.auto.gguf --alias "gpt-oss-20b.auto" --jinja

i keep most of my llama-server settings in env vars, as following:

export LLAMA_ARG_HOST="0.0.0.0"
export LLAMA_ARG_PORT="51536"
export LLAMA_ARG_BATCH=2048
export LLAMA_ARG_UBATCH=2048
export LLAMA_ARG_SWA_FULL=false
export LLAMA_ARG_KV_SPLIT=false
export LLAMA_SET_ROWS=1 # for ARG_KV_SPLIT=false to work
export LLAMA_ARG_FLASH_ATTN=true
export LLAMA_ARG_MLOCK=true
export LLAMA_ARG_NO_MMAP=false
export LLAMA_ARG_N_GPU_LAYERS=999
export LLAMA_OFFLINE=true
export LLAMA_ARG_ENDPOINT_SLOTS=true
export LLAMA_ARG_ENDPOINT_PROPS=true

I've opened my test project (a Rust app) with gpt-oss-20b as chosen model in Crush and told it to initialize the project... and it seems to work just fine now!

image

I've tested Crush back on 0.6.0 (or 0.6.1) with gpt-oss, if not on 0.5.x, and i definitely had issues (for example, the chat description below CRUSH logo contained gpt-oss chat template tags...) so you must've fixed it already - i just haven't noticed :)

Thanks!

If you want, you may add my piece of crush.json to the post as a config example (change the IP to localhost though ;) ) @ggerganov, the invocation from the original post should work just fine

@SteelPh0enix
Comment options

I had issues with a higher batch/ubatch size than the default but I'm not seeing that problem anymore so that was probably user error on my end.

I believe you are likely hitting the case where the model needs the CoT from the past tool call but the client isn't sending it in or there is a mismatch in the reasoning field. I believe that's an open issue across all client and inference servers/providers with GPT-OSS.

If you can collect any dumps of this happening, I'm happy to dig in further.

Yes, i think i did notice that on some other models, i've been mostly working with Qwen... If this happens again, how can i get some more logs/info?

Oh, and one potential "issue" i've just noticed - i've set my reasoning in model config to "high", but crush seem to force "minimal", is this "by design", or is it some issue?

@aldehir
Comment options

aldehir Aug 18, 2025
Collaborator

I can't say, but it likely has no effect since llama-server only respects the chat_template_kwargs.reasoning_effort field in the request. I doubt crush is setting it, so it defaults to "medium" unless you change it via command line.

I usually run mitmproxy in the background, but enabling verbose and searching for parse errors in the server log will likely show the root of the problem--if there is one.

@SteelPh0enix
Comment options

btw @ggerganov i think you made a mistake labeling my test results, i don't have a mac, and they sure don't use AMD GPUs anymore 😄

Comment options

Configure the default sampling and reasoning settings

When starting a llama-server command, you can change the default sampling and reasoning settings like so:

# use recommended gpt-oss sampling params
llama-server ... --temp 1.0 --top-p 1.0

Q: What sampling parameters to use?

OpenAI recommends: temperature=1.0 and top_p=1.0.

The problem I see is that the llama-server defaults are

--samplers SAMPLERS                     samplers that will be used for generation in the order, separated by
                                        ';'                                                                  
                                        (default:                                                            
                                        penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)                                                                                            
--temp N                                temperature (default: 0.8)                                           
--top-k N                               top-k sampling (**default: 40**, 0 = disabled)                         
--top-p N                               top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N                               min-p sampling (**default: 0.1**, 0.0 = disabled)   

So the above command is actually equivalent to:

llama-server ... --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.1

Which seems quite a bit different from the actual recommendation from OpenAI. Notably "min-p 0.1" will prune a lot of low-probability tokens, whereas the OpenAI recommendation is basically to follow the model output probabilities.

If you look at a lot of guides and settings for other SOTA LLM, they all recommend min-p 0.01 or 0.00.

Should the command line be changed to:

llama-server ... --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

You must be logged in to vote
8 replies
@ggerganov
Comment options

ggerganov Aug 19, 2025
Maintainer Author

Regardless of the OpenAI recommendation, I still think it's a good idea to filter low-probability tokens (for example with Top-K or Min-P).

For now, I've updated the guide with the following paragraph:

Be careful when you disable the `Top K` sampler. Although recommended by OpenAI, this can lead to significant CPU overhead and small but non-zero probability of sampling low-probability tokens.

We can revisit if we determine that sampling from the full vocab is actually important.

@SmallAndSoft
Comment options

then top_n_sigma with the default parameter of 1.

This isn't supported in llama-server. (A number of claims in top-n-sigma paper fall flat when temperature is applied last, as is the case for llama.cpp, so I'm not sure this is going to change any time soon)

Not sure what you mean. The support was merged in #13264
Temperature can be applied at any step you want if you define your own sampling chain.

@gcp
Comment options

I still think it's a good idea to filter low-probability tokens (for example with Top-K or Min-P).

Yes, that seems quite sensible. Note that the default will have both top-k 40 and min-p 0.1.

The support was merged in #13264

Looks like a bug crept in there, I'll file an issue.

Temperature can be applied at any step you want if you define your own sampling chain.

Yes, and if you put it last like llama.cpp does by default, you don't have some of the key problems that sampler is supposed to fix 😀

@Spyro000
Comment options

Using min-p 0.0 causes significant performance losses: from 57 tokens per second at min-p 0.1 down to 35 tokens per second at min-p 0.0.

@Tom94
Comment options

That's to be expected. I'd recommend min-p 0.01 or even min-p 0.001 for behavior that's close enough to 0 with performance close to the default.

Comment options

To fill better the low-end CUDA edge cases, here are some benchmarks for gpt-oss-20B (both MXFP4 and Unsloth UD quant) on 12GB VRAM:
Ryzen 7 5700X with 32GB RAM (PCIe 4), NVIDIA RTX3060, 12GB VRAM, with CUDA 13.0:
llama.cpp build: 6139

Optimal settings at 16K context window:
Comparing -ncmoe N vs. offloading just some of the later up-projection layers, e.g. -ot "\.([2-9][0-9])\.ffn_up_exps.=CPU". My reasoning was that front of NN is more 'expensive' to offload due to early layers seeing the full sequence - more work per token. In practice, not huge difference in my scenario:

❯llama-server -t 8 -m redacted/gpt-oss-20b-UD-Q4_K_XL.gguf -ngl 99 -fa -c 16384 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ot "\.([2-9][0-9])\.ffn_up_exps.=CPU"

(Leaves about 600 MB VRAM budget, with 67 tok/sec initial generation rate)
or

❯llama-server -t 8 -m redacted/gpt-oss-20b-mxfp4.gguf -ngl 99 -fa -c 16384 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ncmoe 2

(Leaves about 600 MB VRAM budget, with 64 tok/sec initial generation rate)

Optimal settings at 32K context window:

❯llama-server -t 8 -m redacted/gpt-oss-20b-UD-Q4_K_XL.gguf -ngl 99 -fa -c 32768 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ot "\.([1-9][0-9])\.ffn_up_exps.=CPU"

(Leaves about 1.2 GB VRAM budget, with 53 tok/sec initial generation rate)
or

❯llama-server -t 8 -m redacted/gpt-oss-20b-mxfp4.gguf -ngl 99 -fa -c 32768 --min-p 0.0 --temp 1.0 --top-p 1.0 --top-k 0.0 --jinja --reasoning-format none --no-mmap -ncmoe 3

(Leaves about 600 MB VRAM budget, with 56 tok/sec initial generation rate - this is too aggresive, will likely OOM before reaching context limit.)

#llama-bench for gpt-oss-20b-mxfp4.gguf:
bin❯master❯./llama-bench -m redacted/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384 -ot "\.([1-9][0-9])\.ffn_up_exps.=CPU"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------------- | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp2048 |       2229.95 ± 1.93 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp8192 |       2108.57 ± 6.36 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |         pp16384 |       1960.34 ± 2.08 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |           tg128 |         30.64 ± 0.08 |



#gpt-oss-20b-UD-Q4_K_XL.gguf (Unsloth)
bin❯master❯./llama-bench -m redacted/gpt-oss-20b-UD-Q4_K_XL.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384 -ot "\.([1-9][0-9])\.ffn_up_exps.=CPU"  
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | n_ubatch | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------------- | --------------: | -------------------: |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp2048 |       2212.30 ± 4.04 |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |          pp8192 |       2092.57 ± 6.44 |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |         pp16384 |       1948.17 ± 0.93 |
| gpt-oss ?B Q4_K - Medium       |  11.04 GiB |    20.91 B | CUDA       |  99 |       1 |     2048 |  1 | \.([1-9][0-9])\.ffn_up_exps.=CPU |           tg128 |         31.30 ± 0.07 |

build: 6139

Similar results for latest build: 6195.

For completeness, when running with a small context window of 2048 tokens ( when everything fits in the GPU, i.e. no offloading), the inference speed reaches 75 tok/sec. This is not irrelevant even for a reasoning model because there are scenarios where this window is sufficient for one-off tasks.
Excellent inference speed with a decent LLM, indeed.

You must be logged in to vote
3 replies
@ElliotK-IB
Comment options

-ncmoe N vs. offloading just some of the later up-projection layers, e.g. -ot ".([2-9][0-9]).ffn_up_exps.=CPU"

The first section looks like it compares the former on MXFP4 and the latter on UD-Q4_K_XL -- is this intentionally not a "controlled" experiment? Or is it that you're showing the optimal settings in your testing for the MXFPR4 and the UD-Q4_K_XL respectively? Just seeking clarification on what the pairs of results per context window size are for.

Is these where you got the GGUF files from?

Lastly, how is it that the 32K context with UD-Q4_K_XL leaves 1.2 GB VRAM but 16K only leaves over 600 MB? I'm understanding this as the 32K context left more free VRAM than 16K context.

@QuantiusBenignus
Comment options

Hi @ElliotK-IB,
It was meant to save comment space after noticing that the additional partial quantization on the MXFP4 quants by Unsloth introduces no noticeable difference for essentially the same neural network when all run parameters are the same. So this should have tried better to convey that with either model variant, the offload of the chosen up-projection tensors (-ot ) is equivalent to the use of -ncmoe 2 (memory-wise, baring a few t/s tg speed difference), leaving about the same amount of VRAM available on the specific 12GB GPU.

The MXFP4 was either from Bartowski or ggml-org. The Unsloth URL is correct, I think.

On the second question, the command with 16k context uses less aggressive regexp (20 to 99), while the 32k context would offload also tensors from 10 to 99 (if they existed ) of the up-projections of the feed forward network, thus leaving more VRAM available. (Which is needed for the larger context, about 370 to 400 MB per 16k if not mistaken). Looking at the layer structure of the LLM, actually blanketing everything up to 99 is not necessary. (up to 29 would have sufficed). On that note, a more aggressive regexp to leave a few more of the up-projections in VRAM would be -ot ".(1[7-9]|2[0-4]).ffn_up_exps.=CPU". This offloads from layers 17 to 24, leaving about 600 MB free VRAM (which would not be enough if the full 32k context window is to be used, so -ot ".(1[6-9]|2[0-4]).ffn_up_exps.=CPU" would be living on the absolute edge).

Bottom line, this somewhat convoluted text shows optimal (with enough VRAM left for the chosen context window, except maybe in the last case) settings for the hardware in question and suggests that in most cases it is better to offload tensors, not whole expert layers to the CPU/RAM.
Assuming no other, unrelated, GPU-intensive, VRAM-gobbling tasks on the system, of course.

@ElliotK-IB
Comment options

Interesting, I learned about offloading tensors vs layers thanks to your post, glad I asked further. Appreciate the detailed follow-up as well! I'll revisit this and this post I came across on r/LocalLLaMA for my own experiments.

Comment options

Several people are having issues with tool calling in Cline/Roo Code when using gpt-oss-20b. This is because those clients do not use native tool calling and the model insists on native tool calls. There is a workaround by using a custom grammar that inhibits native tool calling:

root ::= analysis? start final .+
analysis ::= "<|channel|>analysis<|message|>" ( [^<] | "<" [^|] | "<|" [^e] )* "<|end|>"
start ::= "<|start|>assistant"
final ::= "<|channel|>final<|message|>"

Passing this in a file with --grammar-file yields good results when coupled with this system prompt:

Valid channels: analysis, final. Channel must be included for every message.

Is this something useful to include in the docs?

You must be logged in to vote
5 replies
@ggerganov
Comment options

ggerganov Aug 19, 2025
Maintainer Author

Could you ELI5 the difference between native and non-native tool calls? Or point me to a reference document.

@aldehir
Comment options

aldehir Aug 19, 2025
Collaborator

Could you ELI5 the difference between native and non-native tool calls? Or point me to a reference document.

With native tool calls, the model invokes tools in its own syntax. The inference server is responsible for parsing it and exposing it via the API.

For gpt-oss, it generates tool calls in its harmony format through the commentary channel

<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"location": "New York"}

Other models may place them in tags such as <tool_call></tool_call>.

With non-native tool calls, the client prompts the model to respond a certain way to perform a tool call.

For example, Cline prompts the model to respond in an XML format. E.g.,

<get_weather>
  <location>New York</location>
</get_weather>

gpt-oss-20b is adamant about producing a native tool call when told it has tools. By constraining the grammar to only produce content and not a tool call, you force it to do a non-native call that Cline/Roo Code expect.

Hope that clears things up.

@ggerganov
Comment options

ggerganov Aug 19, 2025
Maintainer Author

Thanks. Expanded the "Tips" section with a link to this thread.

@maxiedaniels
Comment options

@aldehir so if I want to use these models with RooCode + Openrouter via Cerebras or Grok, is it on the provider to fix this or is it on the RooCode developers?

@aldehir
Comment options

aldehir Aug 22, 2025
Collaborator

@maxiedaniels I doubt the providers will adopt this grammar, it really is more of a hack than a fix. I think the appropriate fix for Cline / Roo Code is to adopt native tool calling. Roo Code has an open PR that may be promising.

The 120B model should work (mostly) with Cline/Roo Code. It seems to follow instruction quite well, but may fail occasionally. The 20B seems to always fail, and this grammar helps with that.

Comment options

Are we sure tool calling is currently implemented correctly? Openai has released a test script ( https://cookbook.openai.com/articles/gpt-oss/verifying-implementations ) to test backend implementations, but it's currently failing me with llama.cpp. Steps to run the test script:

git clone https://github.com/openai/openai-cookbook.git

cd gpt-oss/compatibility-test/

npm install
npm i -D tsx typescript @types/node

Then edit the providers.ts file (edit the correct details in):

export const PROVIDERS = {
  openai: {
    apiBaseUrl: "http://localhost:3001/v1",
    apiKey: "key",
    apiType: ["chat"], // choose from responses, chat, or both
    modelName: "GPT-OSS-120B",
    providerDetails: {
      // add any provider-specific details here. These will be passed as part of every request
      // for example to fix the provider for openrouter, you can do:
      // provider: {
      //   only: ["example"],
      // },
    },
  },
};

And then lastly start the test: npm start -- --provider openai

These are the results I obtained:

Summary:
  Provider: openai
  Total input cases: 30
  Tries: 1
  Total tasks: 29
  Total runs: 29
  Invalid Chat Completions API responses: 29 (out of 29)
  pass@k (k=1..1): 1=0.000
  pass^k (k=1..1): 1=0.000
  pass@k (k=1): 0.000
  pass^k (k=1): 0.000
  Wrong-input tool calls: 5
  Invalid cases.jsonl lines: 0

Expected outcome according to the guide: If your tests are successful, the output should show 0 invalid requests and over 90% on both pass@k and pass^k. This means the implementation should likely be correct.

Could anyone try replicating my findings? If they find the same, what should be done to fix this?

You must be logged in to vote
9 replies
@aldehir
Comment options

aldehir Aug 18, 2025
Collaborator

Do you happen to have the patch to enable reasoning_content compatibility? Typescript is definitely not my strong suit. I tried changing the if (item.type === "reasoning") { check to if (item.type === "reasoning_content") {, but that didn't work.

I can't produce one right this moment, but if you duplicate the validResponse line and change hasReasoningField to hasReasoningContentField, and message.reasoning to message.reasoning_content, it should work.

@aldehir
Comment options

aldehir Aug 18, 2025
Collaborator

It does look like they may have intended to pass the test if reasoning_content is set but forgot to add another line. That said, I do think the divergence between projects is problematic for clients. I created the discussion to see if there is community support behind adding a way to change the field.

@0xshivamagarwal
Comment options

I believe there is some issue with the current implementation of tool calling.

I used openai/gpt-oss-20b model with both lmstudio (compatibility test: success) & llama-server (compatibility test: failed) [version: 6190 (ae532ea)] and logged the result variable in runCase.ts at line 105.
command used to run test and generate output : npm start -- --provider <provider_name> -n 1 -k 1

Screenshot 2025-08-19 at 8 05 53 PM

Attaching output of both for the reference: output-llama-server.log, output-lm-studio.log

If you see the output-lm-studio.log, then you will find actual tool call & it's response but the same is not present in the output-llama-server.log file.

Please let me know if I did something incorrectly or if I can provide any information that can be helpful in solving this.

Also, I don't see any tool calls in the output using lm-studio with ggml-org/gpt-oss-20b-GGUF model. So, I believe they have done some extra handling just for the openai model to support tool calling.

@aldehir
Comment options

aldehir Aug 19, 2025
Collaborator

@0xshivamagarwal use --reasoning-format auto. none is no longer the recommendation, so you can opt to remove the option entirely as well (it defaults to auto).

@0xshivamagarwal
Comment options

@aldehir thanks for pointing it out. Tool calling is working perfectly.
It's just the API response that needs to be updated to work perfectly with the tests.

P.S. Let me know if I should delete the comments to avoid confusion for anyone seeing it in future.

Comment options

Maybe not relevant as the models are kinda large... But perf tested CPU inferencing on an Ampere system before --threads cores/2 was the sweet spot... Also --cache-reuse what are the considerations for use?

You must be logged in to vote
0 replies
Comment options

@ggerganov I've tested one more Apple Silicon. Here are the results of my MBP M1 Max 64GB

🟢 Benchmarks on M1 Max (64GB) for gpt-oss-20b

time llama-bench -m gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model size params backend threads n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp2048 994.75 ± 4.11
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp8192 843.01 ± 2.20
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp16384 698.82 ± 0.20
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp32768 497.65 ± 8.92
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 tg128 75.15 ± 0.98

build: 2e2b22b (6180)

llama-bench -m gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 10,31s user 2,38s system 2% cpu 10:15,13 total

You must be logged in to vote
4 replies
@gsgxnet
Comment options

@ggerganov MBP M3 Max 128GB

time llama-bench -m /Users/<user>/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model size params backend threads n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp2048 1347.72 ± 28.34
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp8192 1040.01 ± 19.65
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp16384 908.13 ± 7.98
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp32768 530.52 ± 74.68
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 tg128 64.26 ± 0.53

build: e92734d (6250)
llama-bench -m -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768 8.02s user 5.50s system 2% cpu 8:58.38 total

It did only run when I gave the full path to the model file. With just the name llama-bench could not find the model.

@ggerganov
Comment options

ggerganov Aug 26, 2025
Maintainer Author

The tg128 number looks quite low. I think it's possible that this measurement was heat throttled (new macbooks unfortunately do this for some reason). You can run it alone like this:

time llama-bench -m /Users/<user>/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf  -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 0

For example on my M4 Max (36GB) I get this:

model size params backend n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 2048 1 tg128 95.88 ± 0.12

build: 8b69686 (6293)

@gsgxnet
Comment options

Yes the heat throttling seems to be a real big issue with MB Pros, especially with M3 MAX SoC and big RAM.
I had found that report #10444
At the moment I fear it is worse than just low inference speed. I will evaluate further with the gpt-oss eval scripts. So far I get really bad results.

@gsgxnet
Comment options

Good benchmark seems to be this with an Mac mini M4 Pro 64 GB:

time llama-bench -m /Users/<user>/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

model size params backend threads n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp2048 700.78 ± 0.71
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp8192 618.59 ± 0.67
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp16384 534.95 ± 0.54
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 pp32768 419.74 ± 0.27
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 1 2048 1 tg128 63.34 ± 0.05

build: 0fd90db (6280)
llama-bench -m -ngl 99 -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768 6,29s user 2,40s system 1% cpu 13:01,60 total
tg128 seems consistent. And all t/s variations are low. Seems the very high variations in the MB Pro results are caused by the throttling.

Comment options

Getting the following error when attempting to use @playwright/mcp@latest MCP server. It's working well with other tools. This MCP server works fine with other models such as Devstral so its an issue with the gpt-oss-model implementation:

got exception: {"code":500,"message":"JSON schema conversion failed:\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}\nUnrecognized schema: {\"not\":{}}","type":"server_error"}
You must be logged in to vote
3 replies
@aldehir
Comment options

aldehir Aug 19, 2025
Collaborator

I'm not seeing this problem with Chatbox + @playright/mcp@latest. Which client are you using?

@Art9681
Comment options

I'm using the official OpenAI Go SDK. I can enable other MCP servers and internal tool implementations according to spec and they work fine.

@aldehir
Comment options

aldehir Aug 19, 2025
Collaborator

I'm afraid I'm stumped, as I cannot reproduce this with Chatbox or Crush. Neither produce a JSON schema with "not": {} for the playwright MCP server.

From what I can tell, not is unsupported, but I'm not equipped to give you a good answer. I recommend creating an issue and maybe someone more knowledgeable can help.

Comment options

Adding benchmarks for NVIDIA > 64 GB VRAM.

🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-20b`
llama-bench -m ./gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model size params backend ngl threads n_ubatch fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp2048 9480.55 ± 44.01
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp8192 8921.62 ± 4.21
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp16384 8196.12 ± 19.16
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp32768 7050.35 ± 12.36
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 tg128 249.96 ± 0.99

build: f08c4c0 (6199)

🟢 Benchmarks on RTX Pro 6000 Max-Q (96GB) for `gpt-oss-120b`
llama-bench -m ./gpt-oss-120b-mxfp4/gpt-oss-120b-mxfp4-00001-of-00003.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model size params backend ngl threads n_ubatch fa test t/s
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp2048 4494.20 ± 20.87
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp8192 4327.73 ± 16.04
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp16384 4114.04 ± 12.84
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp32768 3718.01 ± 19.67
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 tg128 170.62 ± 0.47

build: f08c4c0 (6199)

You must be logged in to vote
2 replies
@ggerganov
Comment options

ggerganov Aug 19, 2025
Maintainer Author

Thanks for the data!

p.s. Accidentally, edited your comment instead of the guide - sorry about that :)

@SoftwareRenderer
Comment options

It's amazing how the continued performance improvements have added up, even within the short period of the past 3 weeks: 40-50% PP and 15% TG improvement. Incredible work.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes

model size params backend ngl threads n_ubatch fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp2048 6403.85 ± 33.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp8192 6404.04 ± 10.92
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp16384 6188.01 ± 7.76
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp32768 5655.75 ± 28.18
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 tg128 197.21 ± 0.44

build: 00681df (6445)

Comment options

Since I spent more time than I probably should have, here is some info for the list

🧱 Hardware Specifications

🖥️ CPU

  • Model: AMD Ryzen Threadripper 1950X 16-Core Processor
  • Cores/Threads: 16 cores / 32 threads
  • Base Clock: 2.2 GHz
  • Boost Clock: 3.75 GHz
  • Sockets: 1
  • NUMA Nodes: 1 (CPUs 0–31)

🧠 Memory (RAM)

  • Total Capacity: 64 GB (4 × 16 GB)
  • Speed: 3200 MT/s
  • Channels: Quad-channel
  • Type: DDR4

🎮 GPUs

5 × AMD Radeon Pro VII (Vega 20 (gfx906: xnack-), 16 GB HBM2 each)

GPU ID Memory Vendor VBIOS Version
GPU 0 Hynix 113-D1640600-104
GPU 1 Hynix 113-D1640600-104
GPU 2 Hynix 113-D1640600-104
GPU 3 Samsung 113-D1640600-104
GPU 4 Samsung 113-D1640600-104
  • Total VRAM: 80 GB
  • ECC Support: Enabled
  • IOMMU + HMM/SVM: Enabled (shared virtual memory for ROCm)
  • Firmware & ROCm: Custom-built ROCm HIP stack with Flash attention support to enable functionality outside official compatibility list

🧩 PCIe Configuration

  • Risers: 5 total
  • Bifurcation Cards: 2 × 16x-to-8x dual-split
  • Layout:
    • 3 GPUs on straight x16 risers
    • 2 GPUs connected via bifurcation (x8/x8) splitters

🐧 OS

  • Distribution: Ubuntu 24.04 LTS

🧠 Inference Benchmark Summary

🏃 Run Command

./llama-server \
  --model gpt-oss-120b-F16.gguf \
  --threads 16 \
  --no-mmap \    #Prevents hang at 75% model loading (for me anyway)
  --flash-attn \
  --prio 2 \
  --n-gpu-layers 99 \
  --temp 1.0 \
  --top-p 1.0 \
  --top-k 0 \
  --min-p 0 \
  --no-warmup \
  --ubatch-size 2048 \
  --jinja \
  --chat-template-kwargs '{"reasoning_effort": "medium"}' \
  --ctx-size 32768
Code_JCvCop9lxa

🔍 Inference Performance: 9K Prompt + Generation (~11.4K tokens total)

📈 Performance Metrics

📤 Prompt Tokenization

  • Tokens: 9044
  • Time: 77,223.972 ms
  • Speed:117.1 tokens/sec

🧠 Token Generation

  • Tokens: 2381
  • Time: 134,268.908 ms
  • Speed: 🐢 17.7 tokens/sec

⚙️ Total Workload: ~11,425 tokens

Let me know if there is anything else you'd like to see or know that may be helpful to others

You must be logged in to vote
6 replies
@ggerganov
Comment options

ggerganov Aug 19, 2025
Maintainer Author

The --top-k 0 option is likely slowing text generation a lot.

@kj-c0d3s
Comment options

TL;DR
same peformance with top-k at default, and at least with my setup split row is much worse performance even though it looks like its taxing the gpus more..


I ran it with no --top-k specified, expecting default 40, running prompt from same entry point as original test:

Prompt

  • Tokens: 9044
  • Time: 77004.813 ms
  • Speed: 117.4 t/s

Generation

  • Tokens: 2888
  • Time: 164342.089 ms
  • Speed: 17.6 t/s

switching back for straight compare and adding --split-mode row yields:

First yes it definitely burns all GPUs at same time, and slight different loading
image

Prompt

  • Tokens: 9044
  • Time: 95424.939 ms
  • Speed: 94.8 t/s

Generation

  • Tokens: 3041
  • Time: 212404.812 ms
  • Speed: 14.3 t/s
@kj-c0d3s
Comment options

For giggles, I built latest vulkan:

Same original prompt entry point:

Prompt

  • Tokens: 9044
  • Time: 433164.602 ms
  • Speed: 20.9 t/s

Generation

  • Tokens: 3705
  • Time: 194148.47 ms
  • Speed: 19.1 t/s

is -fa not working or is there something else going on here? wild I get speed up on tg but my pp is abismal... sadge kj

@kj-c0d3s
Comment options

silly question @ggerganov - is it possible to use ROCm for prompt processing and Vulkan for token gen?

@nullnuller
Comment options

--split-mode row

Does it work seamlessly with --tensor-split ?

Comment options

More benchmarks for NVIDIA >64 GB. This one with the workstation edition of the RTX 6000 Pro Blackwell.

Crazy how the difference in performance to the 300W Max-Q version is only around 15%. I should start running my GPU at 300W as well to save some energy. 😅

gpt-oss-20b

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

model size params backend ngl threads n_ubatch fa test t/s
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp2048 11521.95 ± 26.03
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp8192 10673.03 ± 22.35
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp16384 9772.06 ± 19.59
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 pp32768 8267.46 ± 15.58
gpt-oss ?B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 2048 1 tg128 286.91 ± 0.22

build: a6d3cfe (6205)

gpt-oss-120b

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes

model size params backend ngl threads n_ubatch fa test t/s
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp2048 5518.07 ± 31.18
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp8192 5315.65 ± 21.91
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp16384 5012.78 ± 24.18
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 pp32768 4503.36 ± 31.57
gpt-oss ?B MXFP4 MoE 59.02 GiB 116.83 B CUDA 99 1 2048 1 tg128 196.31 ± 0.14

build: a6d3cfe (6205)

Edit: maybe worth noting that my GPU only draws around 390W out of the maximum 600W while running the benchmark. Probably hints at optimization opportunities.

You must be logged in to vote
1 reply
@SoftwareRenderer
Comment options

Very interesting that it doesn't come anywhere near max power draw for this workload!

For reference, the Max-Q version draws around ~250W during pp, and ~280W during tg for this benchmark, measured in nvtop.

Comment options

AMD Ryzen 7 7700

Id	Timestamp	Model	Input Tokens	Output Tokens	Prompt Processing	Generation Speed	Duration
12	8/19/2025, 11:11:18 AM	gpt-oss-20b-mxfp4	10,897	134	32.71 t/s	9.51 t/s	30.29s
/app/llama-server
      --model /models/gpt-oss-20b-mxfp4.gguf
      -c 0
      -fa
      --reasoning-format auto
      --no-warmup
      --chat-template-kwargs "{\"reasoning_effort\": \"high\"}"
      --seed 3407
      --repeat-penalty 1.05
      --jinja
      --chat-template-file /models/gpt-oss/chat_template.jinja
      --grammar-file /models/gpt-oss/cline.gbnf
      --temp 1.0
      --top-p 1.0
      --top-k 0.0
      --min-p 0.0
      -ngl 99
      --port 9999   
You must be logged in to vote
0 replies
Comment options

Thanks for the guide!
Here are my results with Nvidia RTX 5070 12 GB, Ryzen 5 9600X, and 64 GB DDR5‑6000:

llama-server -hf ggml-org/gpt-oss-20b-GGUF \
  --ctx-size 32768 --jinja -ub 2048 -b 2048 -ngl 99 -fa \
  --n-cpu-moe 2 --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0

Speed: 62.21 tokens per second

The 20‑B model itself is working surprisingly well. Has anyone managed to connect it to LangChain?

You must be logged in to vote
5 replies
@QuantiusBenignus
Comment options

This is actually quite interesting. Are you sure about the -ub batch size? I also have 12GB of VRAM on a RTX3060 (12288 MB total with 378 MB reserved for the Linux driver) and I cannot fit in the VRAM with those settings at 32K context (including -ncmoe 2). Need to drop -ub to the default of 512 to have 300MB spare. When I do, I get 60 tokens/sec (Ryzen 7 5700X, 32 GB DDR4 RAM). I am surprised that the difference in hardware generations (extra bandwidth of 5070 vs 3060, DDR5 vs DDR4 memory bandwidth etc.) results in such a small performance difference. Could it be the 6 cores of Ryzen 5 9600X vs. the 8 cores of 5700X? (Assuming that the --threads default of -1 automatically chooses the number of cores of the CPU)

Edit: CPU is Ryzen 7 5700X, sorry. In any case, the LLM works well, makes a good case to upgrade RAM to 64GB and load its 120B big brother.

@Spyro000
Comment options

Great results! I'd say the CPU and DDR5 are the bottleneck. When I run the model entirely on VRAM (with a small 2048 context window), I get 128 t/s.

@QuantiusBenignus
Comment options

You are right. The extra compute power and bandwidth of the 5070 over the 3060 shines when there is no GPU to CPU (RAM) data jumps. Inference is almost twice as fast (128 t/s vs. 75 t/s). You should try the 120B with that much RAM and if you don't mind, post the results.

@Spyro000
Comment options

Not enough memory to run the 120B model reliably. I did manage to start it, though, and got 12t/s. I suppose it would run ok with more RAM.

@QuantiusBenignus
Comment options

Bought 32GB extra RAM, and ran the unsloth UD-Q4_K_XL quant of gpt-oss-120b with -ncmoe 32 (both mmaped and -no-mmap). In either case I get 15 tok / sec generation and 85 tok / sec pp2048. The -ncmoe 32 setting leaves about 2.6 GB VRAM available on the GPU, but RAM is under moderate pressure with --no-mmap (only 2 GB RAM remains available). For lower RAM pressure let llama-server mmap the file and you should be OK in most cases (lower your swapiness just in case or try with --mlock ). Assuming your total model size is 63GB (the aforementioned quant or the MXFP4 from ggml-org). I think this is a reliable way to run the model, a true mid-tier LLM generating at reading speed on a Linux machine with a low-end NVIDIA GPU.

Thank you llama.cpp developers!

Comment options

Some numbers for AMD Ryzen AI 9 HX 370 with Radeon 890M (64GB allocated to VRAM out of 128GB total RAM) using Vulkan:

Benchmarks for `gpt-oss-120b`
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -b 2048,4096 -ub 512,1024,2048,4096 -p 2048,8192
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |          pp2048 |         92.62 ± 2.87 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |          pp8192 |         84.11 ± 0.14 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |           tg128 |         18.98 ± 0.16 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |          pp2048 |         84.88 ± 0.25 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |          pp8192 |         81.49 ± 0.32 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |           tg128 |         19.01 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |          pp2048 |         83.67 ± 0.34 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |          pp8192 |         79.53 ± 0.11 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |           tg128 |         19.12 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |          pp2048 |         83.73 ± 0.24 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |          pp8192 |         79.40 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |           tg128 |         19.28 ± 0.06 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |          pp2048 |         88.26 ± 0.28 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |          pp8192 |         83.43 ± 0.13 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |           tg128 |         19.37 ± 0.09 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |          pp2048 |         84.85 ± 0.24 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |          pp8192 |         81.42 ± 0.09 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |           tg128 |         19.46 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |          pp2048 |         83.54 ± 0.36 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |          pp8192 |         79.29 ± 0.32 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |           tg128 |         19.54 ± 0.08 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |          pp2048 |         82.78 ± 0.23 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |          pp8192 |         74.58 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |           tg128 |         19.58 ± 0.02 |

build: f08c4c0d (6199)
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 512,1024 --delay 120
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |           tg512 |         19.78 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          tg1024 |         19.08 ± 1.25 |

build: f08c4c0d (6199)
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          tg2048 |         18.21 ± 2.01 |

build: f08c4c0d (6199)
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 4096
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          tg4096 |         12.38 ± 4.12

build: f08c4c0d (6199)
Benchmarks for `gpt-oss-20b`
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 0,1 -p 0 -n 512 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  0 |           tg512 |         27.06 ± 0.07 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg512 |         27.48 ± 0.16 |

build: f08c4c0d (6199)
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -p 0 -n 128,256,1024,2048,4096 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg128 |         27.67 ± 0.26 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg256 |         27.38 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          tg1024 |         26.10 ± 2.62 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          tg2048 |         26.83 ± 0.09 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          tg4096 |         17.85 ± 6.01 |

build: f08c4c0d (6199)
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -n 0 -p 256,512,1024,2048,4096,8192 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           pp256 |        244.15 ± 1.59 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           pp512 |        285.28 ± 3.06 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp1024 |        283.54 ± 0.79 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp2048 |        274.22 ± 2.84 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp4096 |       252.54 ± 11.57 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp8192 |        230.29 ± 8.31 |


build: f08c4c0d (6199)
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -b 2048,4096 -ub 512,1024,2048,4096 -n 0 -p 4096 --delay 60
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | n_batch | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |      512 |  1 |          pp4096 |       252.72 ± 11.14 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |     1024 |  1 |          pp4096 |        244.87 ± 6.57 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |     2048 |  1 |          pp4096 |        234.58 ± 6.17 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    2048 |     4096 |  1 |          pp4096 |        234.88 ± 6.50 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |      512 |  1 |          pp4096 |        245.97 ± 9.11 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |     1024 |  1 |          pp4096 |        245.34 ± 6.65 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |     2048 |  1 |          pp4096 |        235.12 ± 6.79 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |    4096 |     4096 |  1 |          pp4096 |        216.17 ± 6.05 |

build: f08c4c0d (6199)

Edit: re-ran benchmarks on latest build as of Sep 15 (tldr: almost 2x increase in t/s for prompt processing)

Benchmarks for `gpt-oss-120b`
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 99 -t 1 -fa 1 -p 1024 -n 512 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |          pp1024 |        165.48 ± 1.59 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Vulkan     |  99 |       1 |  1 |           tg512 |         19.64 ± 0.06 |

build: 28c39da7c (6478)
Benchmarks for `gpt-oss-20b`
$ ./llama-cpp/build/bin/llama-bench -m models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1 -fa 1 -p 1024 -n 512 --delay 180
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 890M Graphics (RADV GFX1150) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | threads | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |          pp1024 |        430.67 ± 9.91 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |  1 |           tg512 |         27.85 ± 0.08 |

build: 28c39da7c (6478)
You must be logged in to vote
1 reply
@traysh
Comment options

On Phoronix benchmarks, the performance was much higher, what could be the difference?
https://www.phoronix.com/review/amd-rocm-7-strix-halo/3

Comment options

I got this running on Intel AI PC (Intel Core Ultra 7 258V with Vulkan!). The new GPU driver now can change the GPU memory allocation and so it can easily fit all 25 layers on the VRAM (shared).

Screenshot 2025-08-19 145713
llama_perf_sampler_print:    sampling time =       5.94 ms /    44 runs   (    0.14 ms per token,  7406.16 tokens per second)
llama_perf_context_print:        load time =   17569.42 ms
llama_perf_context_print: prompt eval time =    2044.57 ms /    84 tokens (   24.34 ms per token,    41.08 tokens per second)
llama_perf_context_print:        eval time =    4781.04 ms /    85 runs   (   56.25 ms per token,    17.78 tokens per second)
llama_perf_context_print:       total time =  301066.37 ms /   169 tokens
llama_perf_context_print:    graphs reused =         84

Video example:
https://youtu.be/C_9W19j0t_A

log.txt

P.S.
One more thing is the documentation was kinda missing this (https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#vulkan) so it won't compile with CURL by default: #9937

You must be logged in to vote
0 replies
Comment options

Hi,

Thank you very much for you work!

My setup

  • AMD Ryzen Threadripper 3990X
  • AsRock TRX40 Creator
  • Ubuntu 24.04 LTS (6.14.0-29-generic)
  • ROCm 7.0.1
  • 3x AsRock Taichi Radeon 7900 XTX 24GB
  • llama.ccp build: 4067f07 (6520)

gpt-oss:20b

❯ llama-bench -m gpt-oss-20b-GGUF -ngl 99 -fa 1 -b 2048,4096 -ub 2048,4096 -p 2048,8192,16384,32768 --split-mode none
model size params backend ngl n_batch n_ubatch sm fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 2048 none 1 pp2048 4544.54 ± 16.61
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 2048 none 1 pp8192 3559.03 ± 14.48
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 2048 none 1 pp16384 2772.53 ± 9.84
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 2048 none 1 pp32768 1829.37 ± 5.86
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 2048 none 1 tg128 133.35 ± 0.14
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 4096 none 1 pp2048 4486.20 ± 6.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 4096 none 1 pp8192 3537.48 ± 6.57
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 4096 none 1 pp16384 2772.35 ± 5.95
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 4096 none 1 pp32768 1840.97 ± 4.18
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 2048 4096 none 1 tg128 133.47 ± 0.06
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 2048 none 1 pp2048 4490.88 ± 22.26
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 2048 none 1 pp8192 3543.82 ± 3.11
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 2048 none 1 pp16384 2770.36 ± 3.92
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 2048 none 1 pp32768 1836.09 ± 7.45
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 2048 none 1 tg128 133.47 ± 0.06
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 4096 none 1 pp2048 4488.11 ± 12.69
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 4096 none 1 pp8192 3480.95 ± 6.45
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 4096 none 1 pp16384 2670.57 ± 5.14
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 4096 none 1 pp32768 1773.49 ± 9.14
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 4096 4096 none 1 tg128 133.25 ± 0.13
❯ llama-batched-bench -m gpt-oss-20b-GGUF -c 132096 -b 2048 -ub 2048 -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1,2,4 --split-mode none
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
0 128 1 128 0.000 0.00 0.996 128.51 0.996 128.51
0 128 2 256 0.000 0.00 1.383 185.05 1.383 185.05
0 128 4 512 0.000 0.00 1.684 304.10 1.684 304.10
2048 128 1 2176 0.522 3920.13 1.125 113.79 1.647 1320.90
2048 128 2 4352 0.893 4587.44 1.769 144.75 2.661 1635.21
2048 128 4 8704 1.777 4608.83 2.328 219.92 4.106 2120.05
8192 128 1 8320 2.303 3557.06 1.416 90.42 3.719 2237.40
8192 128 2 16640 4.578 3579.01 2.223 115.14 6.801 2446.64
8192 128 4 33280 9.121 3592.41 3.220 159.02 12.341 2696.68
16384 128 1 16512 5.910 2772.12 1.805 70.93 7.715 2140.26
16384 128 2 33024 11.779 2781.89 2.885 88.74 14.664 2252.08
16384 128 4 66048 23.473 2791.93 4.501 113.75 27.974 2361.02
32768 128 1 32896 17.686 1852.73 2.619 48.86 20.306 1620.03
32768 128 2 65792 35.455 1848.41 4.378 58.48 39.833 1651.70
32768 128 4 131584 71.422 1835.19 7.390 69.28 78.811 1669.61

gpt-oss:120b

❯ llama-bench -m gpt-oss-120b-GGUF -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
model size params backend ngl threads n_ubatch fa test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 2048 1 pp2048 2200.46 ± 7.07
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 2048 1 pp8192 1856.09 ± 3.19
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 2048 1 pp16384 1528.45 ± 6.29
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 2048 1 pp32768 1077.15 ± 2.84
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B ROCm 99 1 2048 1 tg128 57.59 ± 0.06
You must be logged in to vote
0 replies
Comment options

Hi team, i'm mostly working with Vllm and TRT-LLM, trying out llama.cpp with 8 * H200, sharing my numbers:

gpt-oss:20b

./llama-bench -m ./mod/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -fa 1 -b 2048,4096 -ub 2048,4096 -p 2048,8192,16384,32768

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA H200, compute capability 9.0, VMM: yes
Device 1: NVIDIA H200, compute capability 9.0, VMM: yes
Device 2: NVIDIA H200, compute capability 9.0, VMM: yes
Device 3: NVIDIA H200, compute capability 9.0, VMM: yes
Device 4: NVIDIA H200, compute capability 9.0, VMM: yes
Device 5: NVIDIA H200, compute capability 9.0, VMM: yes
Device 6: NVIDIA H200, compute capability 9.0, VMM: yes
Device 7: NVIDIA H200, compute capability 9.0, VMM: yes

model size params backend ngl n_batch n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp2048 9138.26 ± 67.94
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp8192 9854.43 ± 12.51
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp16384 9623.43 ± 6.93
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 pp32768 8313.67 ± 8.11
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 2048 1 tg128 226.20 ± 0.14
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp2048 9177.56 ± 54.44
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp8192 9874.25 ± 27.49
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp16384 9616.80 ± 21.14
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 pp32768 8304.30 ± 2.61
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 4096 1 tg128 226.29 ± 0.10
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp2048 9175.69 ± 35.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp8192 9802.65 ± 19.19
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp16384 9587.89 ± 15.55
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 pp32768 8288.19 ± 2.57
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 2048 1 tg128 225.96 ± 0.26
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp2048 9145.80 ± 47.77
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp8192 9230.66 ± 28.47
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp16384 9026.51 ± 6.35
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 pp32768 6923.33 ± 4.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 tg128 226.23 ± 0.09

build: 72b24d9 (6602)


I didn't try out 120b as 20b performance is already bad - i would expect a much higher tok/s on my system. Maybe I didn't use the correct configs for this benchmark (e.g. disabled tensor parallelism)?

You must be logged in to vote
1 reply
@ggerganov
Comment options

ggerganov Sep 27, 2025
Maintainer Author

You can enable pipeline parallelism by lowering the ubatch size - probably -ub 256 or -ub 128 would be OK for this system.

Comment options

I am looking for a config for a system with 96G RAM and 8G VRAM GPU.
I compile cuda backend with vulkan backend.
load as much as model onto 8g VRAM and the rest at 780M.
What should I do?
@ggerganov

You must be logged in to vote
0 replies
Comment options

./llama-bench -m gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -b 4096 -ub 4096 -p 2048,8192 --device cuda0
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model size params backend ngl n_batch n_ubatch fa dev test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 CUDA0 pp2048 1040.57 ± 2.37
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 CUDA0 pp8192 955.07 ± 0.87
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 4096 4096 1 CUDA0 tg128 66.43 ± 0.07

build: e6d65fb (6611)

You must be logged in to vote
0 replies
Comment options

4070 @ 100W, 2x32GB DDR5-4800 ECC, 7800X3D, B650 TUF GAMING-PLUS and build: d8359f5f (6615):

llama-bench -m gpt-oss-20b-F16.gguf -t 1 -fa 1 -b 4096 -p 2048,8192 --n-cpu-moe 5

model size params backend ngl threads n_batch fa test t/s
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 1 4096 1 pp2048 1042.36 ± 35.89
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 1 4096 1 pp8192 1012.44 ± 17.82
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 1 4096 1 tg128 33.68 ± 1.03

llama-bench -m gpt-oss-20b-F16.gguf -t 4 -fa 1 -b 4096 -p 2048,8192 --n-cpu-moe 5

model size params backend ngl threads n_batch fa test t/s
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 pp2048 1070.20 ± 53.75
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 pp8192 1028.48 ± 20.90
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 tg128 64.19 ± 0.52

llama-bench -m gpt-oss-20b-F16.gguf -t 4 -fa 1 -b 4096 -p 2048,8192,32768 --n-cpu-moe 7

model size params backend ngl threads n_batch fa test t/s
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 pp2048 903.72 ± 27.09
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 pp8192 888.57 ± 19.28
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 pp32768 795.59 ± 12.34
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 tg128 57.90 ± 0.20

4070 @ normal 200W:
llama-bench -m gpt-oss-20b-F16.gguf -t 4 -fa 1 -b 4096 -p 2048,8192 --n-cpu-moe 5

model size params backend ngl threads n_batch fa test t/s
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 pp2048 1215.83 ± 42.63
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 pp8192 1170.65 ± 16.06
gpt-oss 20B F16 12.83 GiB 20.91 B RPC,Vulkan 99 4 4096 1 tg128 65.84 ± 0.24
You must be logged in to vote
3 replies
@pt13762104
Comment options

From my testing, it seems like the Vulkan backend suffers much more from offloading compared to e.g. CUDA. Could you try to test it again with the CUDA backend?

@gesrtewrtwr
Comment options

Unfortunately there's no Linux CUDA build. But if the difference is within 20% or so, idc and vulkan is fine (not a fan of downloading the huge CUDA build each time, that's why I switched to the 25MB vulkan build and llama.cpp' webUI directly (the new webUI is a nice bonus too), instead of using any of the wrappers, at least so far, I don't need an of their features). If the 4070 had 16 GB VRAM, I could fully offload the LLM and get much faster speeds. When I got my 4070, LLMs weren't a thing, so get real goal would be to get the 16 GB VRAM GPU for this 20B LLM. A CUDA Linux build would be nice I guess, but what really matters for inferencing is the memory bandwidth and ofc that the LLM can fit well into the VRAM so that there's some space left for context.

@pt13762104
Comment options

I believe the difference is not 20%, but rather 2-3x (on prompt processing, tg looks fine). 4060 ti gets 3800 t/s pp. But if you don't want the bloat, maybe try decreasing the batch size and offload more layers. That will probably work better on Vulkan.

Comment options

quick tests on h100: 1K tps 20gb, with full context length (consider this as a very rough estimate)

docker run -d --name=gptoss20
--restart unless-stopped
--network=host
-v /apps/models:/models
--gpus all
gguf:server-cuda
--host 127.0.0.1 -ngl 99
--port 8081
-m /models/gpt-oss-20b-mxfp4.gguf
-c 0 -fa on --jinja --reasoning-format none

You must be logged in to vote
0 replies
Comment options

I've got my Framework Desktop, and i've managed to build llama.cpp there, so i can provide some data.

Hardware specification

This is quite unusual machine, as it's an APU with shared memory (architecture-wise, it's similar to Apple M-series APUs). My configuration rocks 128GB of DDR5 (V)RAM, running @ 8000MT/s with theoretical throughput of 256GB/s (around 210-220GB/s in practice) - unfortunately it's soldered to the motherboard, but that's the price we have to pay for the performance we get on those modules. That memory can be fully used by both CPU and GPU (on Linux, on Windows you get up to 96GB of VRAM with 32GB of RAM, as you don't have GTT there).

ROCm (both the latest 6.x.x and 7.0.0rc1) is currently broken - it seems that it can't use the memory on this APU correctly, it reports completely wrong sizes of memories, in effect it cannot allocate more than 32GB of VRAM - while it's enough for stuff like embedding and small models, it's not worth it at this point IMHO.

Vulkan works great, and uses GTT correctly - so it can provide GPU the access to whole memory, even beyond what's allocated in BIOS, up to limits configured via modprobe. In my case, it's ~120GB of memory configured with following modprobe config:

options amdgpu gttsize=122800
options amdgpu vm_fragment_size=8
options ttm pages_limit=31457280
options ttm page_pool_size=15728640

along with amd_iommu=off option in kernel parameters and tuned with accelerator-performance profile. I recommend reading https://strixhalo-homelab.d7.wtf/AI/AI-Capabilities-Overview for more details.

CPU: Ryzen AI MAX+ 395 (Strix Halo), 32 cores
GPU: AMD Radeon 8060S
RAM: 128GB DDR5, 8000MT/s (256GB/s theoretical throughput)
OS: NixOS 25.05 (Warbler) x86_64 w/ 6.16.9-zen1 kernel
llama.cpp commit hash: a23b9bd (b6697)

gpt-oss-20b

This is the same model that i've tested on my RX 7900XT here. Note that Vulkan reports no BF16 support, which may be the cause of hindered performance on that version (iirc this is mxfp4 mixed with bf16). I've checked other (u)batch sizes, 2048/512 is the optimal one.

> llama-bench -m ./gpt-oss-20b.auto.gguf -t 1 -fa 1 -mmp 1 -b 2048 -ub 512 -p 2048,8192,16384 -n 128,512,2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl threads fa test t/s
gpt-oss 20B BF16 12.83 GiB 20.91 B Vulkan 99 1 1 pp2048 982.38 ± 2.81
gpt-oss 20B BF16 12.83 GiB 20.91 B Vulkan 99 1 1 pp8192 839.62 ± 2.14
gpt-oss 20B BF16 12.83 GiB 20.91 B Vulkan 99 1 1 pp16384 668.83 ± 2.50
gpt-oss 20B BF16 12.83 GiB 20.91 B Vulkan 99 1 1 tg128 48.33 ± 0.07
gpt-oss 20B BF16 12.83 GiB 20.91 B Vulkan 99 1 1 tg512 48.36 ± 0.06
gpt-oss 20B BF16 12.83 GiB 20.91 B Vulkan 99 1 1 tg2048 47.78 ± 0.02

gpt-oss-120b

My gpt-oss-120b GGUF is Unsloth's Q6_K_XL quant, which may be the cause of better token generation performance compared to the BF16 20B model.

> llama-bench -m ./gpt-oss-120b-UD-Q6_K_XL.gguf -t 1 -fa 1 -mmp 1 -b 2048 -ub 512 -p 2048,8192,16384 -n 128,512,2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl threads fa test t/s
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 pp2048 503.97 ± 3.40
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 pp8192 452.69 ± 1.33
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 pp16384 379.42 ± 1.83
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 tg128 53.46 ± 0.16
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 tg512 53.50 ± 0.07
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 tg2048 52.50 ± 0.05

I've re-ran the tests with -mmp 0 to check whether mmap affects performance on Vulkan, and i've got slightly better results, therefore i recommend disabling it.

> llama-bench -m ./gpt-oss-120b-UD-Q6_K_XL.gguf -t 1 -fa 1 -mmp 0 -b 2048 -ub 512 -p 2048,8192,16384 -n 128,512,2048
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl threads fa mmap test t/s
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 0 pp2048 509.18 ± 2.78
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 0 pp8192 455.13 ± 0.84
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 0 pp16384 381.19 ± 1.66
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 0 tg128 53.60 ± 0.01
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 0 tg512 53.61 ± 0.02
gpt-oss 120B Q6_K 58.93 GiB 116.83 B Vulkan 99 1 1 0 tg2048 52.62 ± 0.01
You must be logged in to vote
0 replies
Comment options

Setup is 2x AMD Instinct MI50 with 32GB each, rocm 6.3.4:

heat:~/Projects/llama.cpp$ ./build-rocm/bin/llama-bench -m models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
  Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
model size params backend ngl threads n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp2048 1307.05 ± 1.95
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp8192 1654.80 ± 1.99
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp16384 1464.90 ± 65.28
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp32768 1009.23 ± 41.57
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B ROCm 99 1 2048 1 tg128 78.35 ± 0.22

build: 3a002af (6698)

You must be logged in to vote
3 replies
@ggerganov
Comment options

ggerganov Oct 8, 2025
Maintainer Author

With multiple GPUs, you can reduce the -ub while keeping -b high in order to enable pipeline parallelism. This should improve the prompt processing speed.

@Mushoz
Comment options

Is there a good place where I can read up about -ub and -b and how they relate to each other and how they are different? Because I have never really understood the concept of logical and physical batchsizes.

@ggerganov
Comment options

ggerganov Oct 8, 2025
Maintainer Author

The original PR introducing pipeline parallelism and logic/physical batches should have some info: #6017

Comment options

Hi everyone,

I am testing GPT-OSS 20B on a:

  • PC bought in 2020, AMD Ryzen 7, 128GB RAM
  • AMD Radeon RX 9060 XT (just added to the computer)
  • Fedora 42
  • Llama.cpp built with -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1200 -DGGML_HIP_ROCWMMA_FATTN=ON. For info, I got warning "rocwmma fattn is not suported on RDNA4 on rocwmma < v2.0.0, expect degraded performance".

Despite the warning, one sample from the llama-server logs:

eval time = 1016.55 ms / 79 tokens ( 12.87 ms per token, 77.71 tokens per second)

which is way more than I was expecting.

I share here the benchmark run on the model Q8_0:

$ build/bin/llama-bench -m ../MODELS/gpt-oss-20b/gpt-oss-20b-Q8_0.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1200 (0x1200), VMM: no, Wave Size: 32

model size params backend ngl threads n_ubatch fa test t/s
gpt-oss 20B Q8_0 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp2048 796.09 ± 1.26
gpt-oss 20B Q8_0 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp8192 743.42 ± 1.04
gpt-oss 20B Q8_0 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp16384 680.03 ± 0.78
gpt-oss 20B Q8_0 11.27 GiB 20.91 B ROCm 99 1 2048 1 pp32768 579.73 ± 0.33
gpt-oss 20B Q8_0 11.27 GiB 20.91 B ROCm 99 1 2048 1 tg128 78.19 ± 0.03

build: d2ee056 (6713)

You must be logged in to vote
0 replies
Comment options

This thread has been super helpful for getting a feel for performance across various configurations at the hardware and software level. Just wanted to cross-pollinate InferenceMAX (benchmark page, GitHub, blog post) -- no affiliation. Just wanted to share here in case folks here would be interested getting involved in this open-source benchmarking effort. The first rounds of benchmark results are focused on more expensive enterprise-grade systems, but the project could definitely do with some entry-level systems as well for those of us who aren't there (yet).

You must be logged in to vote
0 replies
Comment options

Suggest to change

'It is not necessary to fit the entire model in VRAM to get good performance. Keeping just the attention tensors and the KV cache in VRAM and offloading the rest of the model in the CPU RAM can provide decent performance as well. This is taken into account in the rest of the guide.'

You must be logged in to vote
0 replies
Comment options

Wow, things are just getting faster and faster! (compiled on Ubuntu 25.10 with cuda-13 and -DCMAKE_CUDA_ARCHITECTURES=120)

zen4-ubuntu:~/src/llama.cpp$ ./build/bin/llama-bench -m lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf -t 1 -fa 1 -b 4096 -ub 2048,4096 -p 2048,8192,16384,32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes

model size params backend threads n_batch n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 2048 1 pp2048 13001.61 ± 75.68
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 2048 1 pp8192 13382.82 ± 33.31
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 2048 1 pp16384 12922.16 ± 27.46
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 2048 1 pp32768 11817.23 ± 24.73
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 2048 1 tg128 298.58 ± 0.50
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 4096 1 pp2048 12929.68 ± 37.76
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 4096 1 pp8192 12377.66 ± 49.68
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 4096 1 pp16384 12287.30 ± 29.32
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 4096 1 pp32768 11352.96 ± 13.13
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA,BLAS 1 4096 4096 1 tg128 298.34 ± 0.46

build: 1bb4f43 (6782)

You must be logged in to vote
0 replies
Comment options

Initial result on dual RX 470 8G mining edition on windows. It will be updated for linux result.
Setup running on PCIe 3.0 x16/x4

llama-bench -m ..\gpt-oss-20b-mxfp4.gguf -sm layer -mg 1 -r 2 -t 8 -ngl 100 -fa 1 -p 512,2048,8192,16384

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon (TM) RX 470 Graphics (AMD proprietary driver) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = Radeon (TM) RX 470 Graphics (AMD proprietary driver) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none
model size params backend ngl threads main_gpu fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 1 pp512 93.89 ± 0.09
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 1 pp2048 90.77 ± 0.36
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 1 pp8192 76.88 ± 0.10
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 1 pp16384 80.69 ± 0.04
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 1 tg128 13.22 ± 0.01

build: 0cb7a06 (6773)

You must be logged in to vote
5 replies
@pebaryan
Comment options

/llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf

ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 2 = NVIDIA P106-100 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
model size params backend ngl test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 pp512 175.22 ± 2.41
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 tg128 29.81 ± 0.53

build: 66b0dbc (6791)

@pebaryan
Comment options

export GGML_VK_VISIBLE_DEVICES=0,2
./llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf 
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 pp512 237.04 ± 3.11
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 tg128 38.66 ± 0.53

build: 66b0dbc (6791)

@pebaryan
Comment options

export GGML_VK_VISIBLE_DEVICES=0,3
./llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf 

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = NVIDIA P106-100 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
model size params backend ngl test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 pp512 142.36 ± 2.35
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 99 tg128 30.93 ± 0.10

build: 66b0dbc (6791)

@pebaryan
Comment options

./llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf -sm layer -mg 0 -r 2 -t 8 -ngl 100 -fa 1 -p 512,2048,8192,16384

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
model size params backend ngl threads fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp512 223.22 ± 0.21
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp2048 211.10 ± 0.13
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp8192 169.15 ± 0.37
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp16384 131.44 ± 0.11
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 tg128 36.50 ± 0.42

build: 66b0dbc (6791)

@pebaryan
Comment options

/llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf -sm layer -mg 0 -r 2 -t 8 -ngl 100 -fa 1 -p 512,2048,8192,16384

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
ggml_vulkan: 1 = NVIDIA P106-100 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
model size params backend ngl threads fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp512 133.94 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp2048 132.43 ± 0.01
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp8192 117.49 ± 0.42
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 pp16384 101.06 ± 0.06
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Vulkan 100 8 1 tg128 29.41 ± 0.15

build: 66b0dbc (6791)

Comment options

I will be posting result of multi GPU settings using CUDA on low-end hardware (VRAM <12GB)

starting with GTX1070ti and GTX1060

./llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf -sm layer -mg 0 -r 2 -t 8 -ngl 100 -fa 1 -p 512,2048,8192,16384

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes
  Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
model size params backend ngl threads fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 8 1 pp512 848.91 ± 1.64
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 8 1 pp2048 1249.16 ± 1.74
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 8 1 pp8192 1230.78 ± 0.72
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 8 1 pp16384 1034.36 ± 3.60
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 8 1 tg128 45.57 ± 0.04

build: b907255 (6479)

You must be logged in to vote
9 replies
@pebaryan
Comment options

How the perf degrades if i introduce third card in the mix (x1)

./llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf -sm layer -mg 0 -r 2 -t 6 -ngl 100 -fa 1 -p 512,2048,8192,16384

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: Tesla P4, compute capability 6.1, VMM: yes
  Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
  Device 2: NVIDIA P106-100, compute capability 6.1, VMM: yes
model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp512 612.67 ± 0.32
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp2048 704.35 ± 1.12
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp8192 621.43 ± 1.22
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp16384 512.65 ± 0.46
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 tg128 31.30 ± 0.65

build: b907255 (6479)

@pebaryan
Comment options

I replace 1060 with 1070ti (x16)

./llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf -sm layer -mg 0 -r 2 -t 6 -ngl 100 -fa 1 -p 512,2048,8192,16384

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: Tesla P4, compute capability 6.1, VMM: yes
  Device 1: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes
  Device 2: NVIDIA P106-100, compute capability 6.1, VMM: yes
model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp512 728.00 ± 6.54
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp2048 875.36 ± 0.97
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp8192 784.19 ± 1.71
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp16384 652.94 ± 0.70
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 tg128 34.72 ± 0.51

build: b907255 (6479)

@pebaryan
Comment options

I replaced P106-100 on x1 slot with GTX1060

llama.cpp/build/bin/llama-bench -m /media/peb/PORT/gpt-oss-20b-mxfp4.gguf -sm layer -mg 0 -r 2 -t 6 -ngl 100 -fa 1 -p 512,2048,8192,16384

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: Tesla P4, compute capability 6.1, VMM: yes
  Device 1: NVIDIA GeForce GTX 1070 Ti, compute capability 6.1, VMM: yes
  Device 2: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
model size params backend ngl fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp512 763.05 ± 3.60
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp2048 923.38 ± 1.03
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp8192 837.49 ± 1.00
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 pp16384 703.86 ± 0.82
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 100 1 tg128 39.91 ± 0.57

build: b907255 (6479)

@ggerganov
Comment options

ggerganov Oct 18, 2025
Maintainer Author

@pebaryan It would be better if you merge all these datapoints into a single comment that you edit over time, instead of posting separate comments. Currently, your messages are generating a lot of notifications which is not nice for people following the discussion.

@pebaryan
Comment options

Sure! I will do that. Sorry for the inconvenience

Comment options

AMD Ryzen 7 7700 Debian 13, Linux 6.12

./llama-bench -m /mnt/models/gpt-oss-20b-mxfp4.gguf -ngl 99 -t 1  -p 512,1024 -n 512,1024 --delay 180
load_backend: loaded RPC backend from /mnt/llamacpp/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RAPHAEL_MENDOCINO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /mnt/llamacpp/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /mnt/llamacpp/bin/libggml-cpu-icelake.so
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |           pp512 |         42.63 ± 0.42 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |          pp1024 |         42.56 ± 0.08 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |           tg512 |          7.42 ± 0.00 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |       1 |          tg1024 |          7.34 ± 0.00 |


build: 81387858 (6792)

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Morty Proxy This is a proxified and sanitized view of the page, visit original site.