Files
ik_llama.cpp/github-data/discussions/357 - Qwen3 - early performance comparisons.md
2025-07-23 13:31:53 +02:00

303 KiB

🗣️ #357 - Qwen3 - early performance comparisons

Author ikawrakow
Created 2025-04-29
Updated 2025-05-19

Description

The Qwen3 models were officially released, and support was added in ik_llama.cpp in PR #355, so I was curious to run some performance benchmarks. As much as I would like to try the flagship model, I don't have enough horse power for that, so I experimented with Qwen3-30B-A3B, the 30B total, 3B active parameter MoE model.

This time I'm using a custom quantization where all experts are quantized with IQ4_XS, all attention tensors with Q5_K, and the output tensor is Q6_K. PPL for this model is only 1.25% above the PPL of the bf16 model, so it is a pretty decent quality quantization. Benchmarks are run on a Ryzen-7950X system with an RTX-4080 GPU. Compared are the latest ik_kllama.cpp and llama.cpp versions as of this morning (April 29 2025).

CPU-only performance

The command line for ik_llama.cpp is

./bin/llama-sweep-bench -m $model -c 16384 -t 16 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0

llama.cpp is similar, except that there is no -rtr -fmoe. I'm also including mainline results without Flash Attention (FA). In this case the K-cache is quantized with Q8_0 and the V-cache is fp16.

The following graph shows TG performance as a function of N_KV, the number of tokens in the KV cache. Performance is pretty close for empty KV cache, with a performance gap increasing with N_KV. At 16k tokens ik_llama.cpp is 44% faster than mainline without FA, and 3.3 times faster than mainline with FA enabled.

qwen3_cpu_tg

The next graph shows prompt processing (PP) speed as a function of N_KV. As usual for CPU only inference, ik_llama.cpp is much faster than mainline for PP - 3.3X for small N_KV, increasing to 3.9X at 16k tokens. This is compared to mainline without FA. Compared to llama.cpp with FA enabled, ik_llama.cpp is 11.2X faster.

qwen3_cpu_pp

llama.cpp CPU-only performance data without FA
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.610 141.84 5.067 25.26
512 128 512 3.902 131.20 5.354 23.91
512 128 1024 4.228 121.09 5.344 23.95
512 128 1536 4.582 111.74 5.528 23.16
512 128 2048 4.837 105.84 5.713 22.40
512 128 2560 5.188 98.69 5.745 22.28
512 128 3072 5.484 93.37 5.917 21.63
512 128 3584 5.793 88.38 6.035 21.21
512 128 4096 6.039 84.78 6.256 20.46
512 128 4608 6.433 79.59 6.449 19.85
512 128 5120 6.685 76.59 6.630 19.31
512 128 5632 7.013 73.00 6.852 18.68
512 128 6144 7.278 70.35 7.075 18.09
512 128 6656 7.689 66.59 7.259 17.63
512 128 7168 7.869 65.07 7.428 17.23
512 128 7680 8.337 61.41 7.604 16.83
512 128 8192 8.488 60.32 7.788 16.44
512 128 8704 8.958 57.15 7.925 16.15
512 128 9216 9.084 56.36 8.080 15.84
512 128 9728 9.557 53.57 8.226 15.56
512 128 10240 9.725 52.65 8.466 15.12
512 128 10752 10.470 48.90 8.575 14.93
512 128 11264 10.334 49.55 8.774 14.59
512 128 11776 10.861 47.14 8.940 14.32
512 128 12288 10.974 46.65 9.121 14.03
512 128 12800 11.494 44.55 9.321 13.73
512 128 13312 11.575 44.23 9.494 13.48
512 128 13824 12.063 42.44 9.665 13.24
512 128 14336 12.267 41.74 9.854 12.99
512 128 14848 12.737 40.20 9.970 12.84
512 128 15360 13.034 39.28 10.103 12.67
512 128 15872 13.427 38.13 10.231 12.51
llama.cpp CPU-only performance data with FA enabled
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.677 139.25 5.061 25.29
512 128 512 4.714 108.62 5.427 23.59
512 128 1024 5.922 86.46 5.987 21.38
512 128 1536 6.963 73.53 6.495 19.71
512 128 2048 8.207 62.39 7.086 18.06
512 128 2560 9.405 54.44 7.753 16.51
512 128 3072 10.370 49.37 8.375 15.28
512 128 3584 11.482 44.59 8.908 14.37
512 128 4096 12.604 40.62 9.487 13.49
512 128 4608 13.798 37.11 9.951 12.86
512 128 5120 15.149 33.80 10.504 12.19
512 128 5632 16.055 31.89 11.201 11.43
512 128 6144 17.214 29.74 11.740 10.90
512 128 6656 18.347 27.91 12.409 10.31
512 128 7168 19.478 26.29 12.842 9.97
512 128 7680 20.593 24.86 13.410 9.55
512 128 8192 21.726 23.57 14.082 9.09
512 128 8704 22.886 22.37 14.582 8.78
512 128 9216 23.937 21.39 15.117 8.47
512 128 9728 25.038 20.45 15.800 8.10
512 128 10240 26.188 19.55 16.390 7.81
512 128 10752 27.328 18.74 16.962 7.55
512 128 11264 28.434 18.01 17.550 7.29
512 128 11776 29.491 17.36 18.265 7.01
512 128 12288 30.663 16.70 18.898 6.77
512 128 12800 31.799 16.10 19.649 6.51
512 128 13312 32.887 15.57 20.277 6.31
512 128 13824 34.042 15.04 20.914 6.12
512 128 14336 35.152 14.57 21.562 5.94
512 128 14848 36.281 14.11 22.194 5.77
512 128 15360 37.400 13.69 22.754 5.63
512 128 15872 38.559 13.28 23.348 5.48
ik_llama.cpp CPU-only performance data
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 1.079 474.34 4.858 26.35
512 128 512 1.118 458.04 5.140 24.90
512 128 1024 1.194 428.88 5.059 25.30
512 128 1536 1.273 402.21 5.138 24.91
512 128 2048 1.353 378.31 5.241 24.42
512 128 2560 1.421 360.38 5.318 24.07
512 128 3072 1.501 341.07 5.397 23.72
512 128 3584 1.580 324.10 5.443 23.52
512 128 4096 1.654 309.50 5.522 23.18
512 128 4608 1.731 295.70 5.557 23.03
512 128 5120 1.809 283.11 5.622 22.77
512 128 5632 1.879 272.50 5.688 22.51
512 128 6144 1.963 260.87 5.750 22.26
512 128 6656 2.040 250.94 5.820 21.99
512 128 7168 2.122 241.24 5.893 21.72
512 128 7680 2.193 233.47 5.966 21.45
512 128 8192 2.281 224.44 6.039 21.19
512 128 8704 2.353 217.56 6.109 20.95
512 128 9216 2.436 210.21 6.176 20.73
512 128 9728 2.504 204.46 6.245 20.50
512 128 10240 2.596 197.19 6.317 20.26
512 128 10752 2.670 191.76 6.386 20.04
512 128 11264 2.756 185.79 6.459 19.82
512 128 11776 2.822 181.46 6.528 19.61
512 128 12288 2.917 175.54 6.596 19.41
512 128 12800 2.987 171.41 6.671 19.19
512 128 13312 3.073 166.62 6.740 18.99
512 128 13824 3.121 164.03 6.819 18.77
512 128 14336 3.230 158.50 6.888 18.58
512 128 14848 3.288 155.73 6.961 18.39
512 128 15360 3.389 151.07 7.037 18.19
512 128 15872 3.444 148.68 7.109 18.00

Hybrid inference

The custom IQ4_XS model is 15.4 GiB, so cannot be fully loaded on my 16 GB RTX-4080 GPU. This gives me the opportunity to try hybrid GPU+CPU inference via tensor overrides on both systems. The command line used in this case is

./bin/llama-sweep-bench -m $model -c 32768 -t 16 -ngl 100 -fa -ot "blk\.3[4-9]\.ffn=CPU,blk\.4[0-9]\.ffn=CPU"

I.e., everything is offloaded to the GPU except for the last 14 layers of the experts tensors. This leaves enough free VRAM to go up to a context of 32k tokens. In the case of ik_llama.cpp run-time-repacking (for the experts left on the CPU) and fused MoE (ffn_up*X)*silu(ffn_gate*X) is enabled via -rtr -fmoe.

The next graph shows TG performance as a function of N_KV. Compared to DeepSeek, Here the performance advantage of ik_llama.cpp is smaller and decreases with increasing N_KV. As there is no MLA involved, and we are dealing just with a standard attention mechanism, the CUDA FA improvements in this mainline PR that I have not (yet) ported over to ik_llama.cpp counteract the performance gains from the fused MoE operations in ik_llama.cpp, so we end up with a relatively close TG performance.

qwen3_hybrid_tg

The next graph shows PP performance as a function of N_KV. Also here the performance gap decreases with N_KV, from about 60% for small N_KV, to about 18% at 32k tokens.

qwen3_hybrid_pp

llama.cpp hybrid GPU+CPU performance data
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.587 871.89 1.864 68.68
512 128 512 0.499 1025.61 1.893 67.63
512 128 1024 0.505 1013.85 1.924 66.53
512 128 1536 0.504 1015.33 1.936 66.11
512 128 2048 0.519 987.22 1.959 65.33
512 128 2560 0.508 1008.35 1.978 64.71
512 128 3072 0.512 999.60 1.991 64.30
512 128 3584 0.508 1008.64 2.020 63.37
512 128 4096 0.516 992.09 2.027 63.15
512 128 4608 0.517 989.86 2.055 62.28
512 128 5120 0.520 983.77 2.065 61.97
512 128 5632 0.518 987.91 2.085 61.40
512 128 6144 0.522 980.59 2.110 60.66
512 128 6656 0.525 975.45 2.117 60.45
512 128 7168 0.532 962.98 2.147 59.62
512 128 7680 0.530 966.27 2.157 59.34
512 128 8192 0.539 950.13 2.181 58.68
512 128 8704 0.534 958.91 2.191 58.43
512 128 9216 0.538 952.23 2.216 57.76
512 128 9728 0.541 946.25 2.239 57.17
512 128 10240 0.538 951.61 2.259 56.66
512 128 10752 0.550 930.85 2.258 56.70
512 128 11264 0.547 935.91 2.272 56.33
512 128 11776 0.550 930.19 2.291 55.87
512 128 12288 0.550 931.21 2.307 55.49
512 128 12800 0.555 923.16 2.330 54.95
512 128 13312 0.556 921.17 2.355 54.36
512 128 13824 0.558 917.56 2.366 54.10
512 128 14336 0.557 918.53 2.388 53.60
512 128 14848 0.563 908.69 2.400 53.33
512 128 15360 0.565 905.61 2.425 52.79
512 128 15872 0.570 897.66 2.435 52.57
512 128 16384 0.570 897.53 2.447 52.30
512 128 16896 0.573 893.67 2.472 51.77
512 128 17408 0.578 885.91 2.484 51.54
512 128 17920 0.579 884.78 2.508 51.04
512 128 18432 0.585 875.25 2.523 50.72
512 128 18944 0.582 879.31 2.556 50.07
512 128 19456 0.590 868.21 2.585 49.52
512 128 19968 0.592 865.23 2.612 49.01
512 128 20480 0.585 875.09 2.637 48.53
512 128 20992 0.590 867.98 2.655 48.21
512 128 21504 0.596 858.70 2.671 47.92
512 128 22016 0.597 858.04 2.692 47.55
512 128 22528 0.602 849.98 2.713 47.17
512 128 23040 0.604 847.68 2.733 46.83
512 128 23552 0.604 847.62 2.759 46.40
512 128 24064 0.607 844.15 2.785 45.96
512 128 24576 0.609 840.08 2.804 45.65
512 128 25088 0.610 839.13 2.830 45.23
512 128 25600 0.609 840.04 2.841 45.06
512 128 26112 0.613 835.24 2.866 44.66
512 128 26624 0.617 829.66 2.878 44.47
512 128 27136 0.620 825.17 2.907 44.03
512 128 27648 0.622 823.54 2.932 43.65
512 128 28160 0.628 815.24 2.957 43.28
512 128 28672 0.635 806.54 3.022 42.35
512 128 29184 0.635 806.74 3.029 42.26
512 128 29696 0.635 805.74 3.054 41.91
512 128 30208 0.635 806.01 3.066 41.74
512 128 30720 0.641 799.08 3.094 41.37
512 128 31232 0.641 798.16 3.119 41.04
512 128 31744 0.642 797.16 3.134 40.85
512 128 32256 0.647 791.04 3.155 40.57
ik_llama.cpp hybrid GPU+CPU performance data
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 0.354 1445.39 1.668 76.74
512 128 512 0.305 1676.45 1.678 76.27
512 128 1024 0.311 1644.31 1.708 74.95
512 128 1536 0.309 1656.71 1.724 74.23
512 128 2048 0.322 1588.27 1.759 72.77
512 128 2560 0.318 1609.63 1.771 72.29
512 128 3072 0.326 1568.33 1.798 71.19
512 128 3584 0.324 1578.28 1.817 70.43
512 128 4096 0.331 1545.52 1.830 69.93
512 128 4608 0.336 1524.39 1.864 68.66
512 128 5120 0.338 1512.69 1.876 68.24
512 128 5632 0.341 1503.24 1.915 66.84
512 128 6144 0.345 1483.42 1.920 66.65
512 128 6656 0.350 1464.58 1.933 66.22
512 128 7168 0.356 1439.26 1.969 65.02
512 128 7680 0.358 1432.11 1.983 64.54
512 128 8192 0.365 1401.85 2.008 63.75
512 128 8704 0.364 1406.00 2.030 63.05
512 128 9216 0.370 1384.70 2.048 62.49
512 128 9728 0.374 1370.08 2.074 61.72
512 128 10240 0.375 1366.56 2.085 61.39
512 128 10752 0.384 1334.85 2.118 60.44
512 128 11264 0.384 1333.89 2.134 59.98
512 128 11776 0.389 1316.69 2.146 59.63
512 128 12288 0.391 1309.81 2.177 58.80
512 128 12800 0.396 1293.36 2.190 58.45
512 128 13312 0.399 1282.92 2.223 57.57
512 128 13824 0.403 1271.01 2.240 57.15
512 128 14336 0.405 1263.29 2.254 56.78
512 128 14848 0.412 1242.83 2.285 56.01
512 128 15360 0.416 1231.56 2.302 55.60
512 128 15872 0.419 1221.90 2.332 54.90
512 128 16384 0.422 1212.98 2.326 55.04
512 128 16896 0.427 1200.46 2.347 54.54
512 128 17408 0.431 1186.63 2.381 53.77
512 128 17920 0.434 1178.56 2.393 53.50
512 128 18432 0.475 1078.71 2.432 52.63
512 128 18944 0.476 1074.59 2.435 52.56
512 128 19456 0.483 1059.64 2.466 51.91
512 128 19968 0.488 1049.40 2.485 51.51
512 128 20480 0.488 1049.01 2.502 51.15
512 128 20992 0.494 1036.95 2.542 50.35
512 128 21504 0.500 1024.56 2.535 50.49
512 128 22016 0.503 1017.51 2.560 50.00
512 128 22528 0.509 1006.46 2.570 49.81
512 128 23040 0.524 976.26 2.596 49.31
512 128 23552 0.517 990.80 2.617 48.91
512 128 24064 0.523 979.07 2.628 48.71
512 128 24576 0.486 1053.92 2.664 48.05
512 128 25088 0.489 1046.91 2.684 47.70
512 128 25600 0.520 984.47 2.704 47.34
512 128 26112 0.498 1027.80 2.747 46.59
512 128 26624 0.503 1017.92 2.762 46.34
512 128 27136 0.509 1006.38 2.794 45.81
512 128 27648 0.514 995.15 2.814 45.49
512 128 28160 0.518 987.73 2.837 45.12
512 128 28672 0.528 970.19 2.853 44.87
512 128 29184 0.531 965.04 2.871 44.58
512 128 29696 0.535 957.76 2.900 44.13
512 128 30208 0.533 961.28 2.910 43.99
512 128 30720 0.540 948.50 2.944 43.47
512 128 31232 0.541 946.85 2.956 43.30
512 128 31744 0.542 943.99 2.987 42.85
512 128 32256 0.550 930.73 3.007 42.56

🗣️ Discussion

👤 ikawrakow replied the 2025-04-29 at 13:57:33:

Anyone who has the horse power to run Qwen3-235B-A22B, please feel free to add your results to this discussion.

👤 ubergarm replied the 2025-04-29 at 16:30:10:
I'm away from home but frantically trying to remote into a server I just got access too again and cook up a good Qwen3-235B-A22B mix for my home 3090TI 24GB VRAM + 96GB RAM system which is about the limit of common AM5 gaming rigs (with the faster and more supported 2x DIMM configuration).

Any particular reason you chose IQ4_XS for the experts over IQ4_K (possibly GPU inference speed?).

I haven't finished yet but my very rough WIP custom quantize script so far is:

Very rough ik_llama.cpp custom quantize script
#!/usr/bin/env bash

custom="
#token_embd.weight - [ 4096, 151936,     1,     1], type =   bf16, Using custom type q8_0 for tensor token_embd.weight
#blk.1.ffn_gate_inp.weight - [ 4096,   128,     1,     1], type =    f32, size =    2.000 MB
#blk.1.attn_k_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
#blk.1.attn_q_norm.weight - [  128,     1,     1,     1], type =    f32, size =    0.000 MB
#blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
#blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB

#blk.1.attn_k.weight - [ 4096,   512,     1,     1], type =   bf16, Using custom type q8_0 for tensor blk.1.attn_k.weight
#blk.1.attn_q.weight - [ 4096,  8192,     1,     1], type =   bf16, Using custom type q8_0 for tensor blk.1.attn_q.weight
#blk.1.attn_v.weight - [ 4096,   512,     1,     1], type =   bf16, Using custom type q8_0 for tensor blk.1.attn_v.weight
#blk.1.attn_output.weight - [ 8192,  4096,     1,     1], type =   bf16, Using custom type q8_0 for tensor blk.1.attn_output.weight

#blk.1.ffn_down_exps.weight - [ 1536,  4096,   128,     1], type =   bf16, Using custom type q8_0 for tensor blk.1.ffn_down_exps.weight
#blk.1.ffn_gate_exps.weight - [ 4096,  1536,   128,     1], type =   bf16, Using custom type q8_0 for tensor blk.1.ffn_gate_exps.weight
#blk.1.ffn_up_exps.weight - [ 4096,  1536,   128,     1], type =   bf16, Using custom type q8_0 for tensor blk.1.ffn_up_exps.weight

#output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB

# Token embedding
token_embd\.weight=q8_0

# Attention
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_q.*=iq4_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq4_k

# Experts
blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

    #--token-embedding-type q8_0 \
    #--output-tensor-type q8_0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/imatrix-Qwen3-235B-A22B.dat \
    /mnt/raid/models/Qwen/Qwen3-235B-A22B/Qwen3-235B-A22B-BF16-00001-of-00011.gguf \
    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K.gguf \
    IQ3_K \
    24

Did you bother to make an imatrix for your quant, and if so, were you able to activate enough experts with your imatrix corpus text? Thanks again, exciting times with Qwen3 MoE out and wondering if R2 is around the corner haha...

👤 ikawrakow replied the 2025-04-29 at 16:34:39:

Any particular reason you chose IQ4_XS for the experts over IQ4_K (possibly GPU inference speed?).

I wanted to have a quantized model that I can run with ik_llama.cpp and with llama.cpp so we have a fair performance comparison.

I'm playing with some quantization recipes for Qwen3-30B-A3B. I'll post the results tomorrow, maybe that can be useful to you for "cooking" the Qwen3-235B-A22B quants.

👤 Gaolingx replied the 2025-05-06 at 13:15:17:
I run Qwen3-235B-A22B on my pc(#385 ), but the performance not better, might the memory performance of RAM is too slow...


👤 ubergarm replied the 2025-04-30 at 04:45:24:

Just "cooked" my first ik_llama.cpp exclusive experimental quant and uploaded to huggingface ubergarm/Qwen3-235B-A22B-GGUF. Just tried a benchmark on my local gaming rig as it just finished downloading. Hybrid GPU+CPU inferencing with about 12 ffn layers on GPU and the rest repacked on CPU. Barely fits in VRAM+RAM (had to close my browser haha).

qwen3-moe-troll-rig

Looks pretty good! Only other somewhat comparable benchmark I've seen is from latest ktransformers v0.3 on a rig with better GPU and more RAM.

👈 Logs
model=/mnt/ai/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa \
  -ctk q8_0 -ctv q8_0 \
  -c 32768 \
  -fmoe \
  -amb 512 \
  -rtr \
  -ot blk\.1[2-9]\.ffn.*=CPU \
  -ot blk\.[2-8][0-9]\.ffn.*=CPU \
  -ot blk\.9[0-3]\.ffn.*=CPU \
  -ngl 99 \
  --threads 16

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 40 key-value pairs and 1131 tensors from /mnt/ai/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 235B A22B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   9:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv  10:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  11:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  12:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  13:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  17:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  18:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 139
llama_model_loader: - kv  20:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  21:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/Qwen3-235B-...
llama_model_loader: - kv  34:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  35:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  36:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  37:                                   split.no u16              = 0
llama_model_loader: - kv  38:                                split.count u16              = 3
llama_model_loader: - kv  39:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:    2 tensors
llama_model_loader: - type iq3_k:  188 tensors
llama_model_loader: - type iq4_k:   94 tensors
llama_model_loader: - type iq6_k:  376 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ3_K - 3.4325 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 106.830 GiB (3.903 BPW)
llm_load_print_meta: repeating layers = 105.598 GiB (3.879 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Qwen3 235B A22B
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    0.99 MiB
Tensor blk.12.ffn_norm.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_norm.weight buffer type overriden to CPU
.
.
.
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_norm.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_norm.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 89709.28 MiB
llm_load_tensors:  CUDA_Host buffer size =   630.59 MiB
llm_load_tensors:      CUDA0 buffer size = 19053.73 MiB
....................................................................................................
============ Repacked 246 tensors
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3196.05 MiB
llama_new_context_with_model: KV self size  = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   312.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   128.01 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 330

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 4.165 122.94 11.925 10.73
512 128 512 3.848 133.06 12.031 10.64
512 128 1024 3.581 142.97 12.163 10.52
512 128 1536 3.631 140.99 12.343 10.37
512 128 2048 3.622 141.36 12.491 10.25
512 128 2560 3.631 140.99 12.677 10.10
512 128 3072 3.657 140.02 12.859 9.95
512 128 3584 3.667 139.63 13.039 9.82
512 128 4096 3.694 138.61 13.226 9.68
512 128 4608 3.710 138.00 13.399 9.55
512 128 5120 3.719 137.67 13.587 9.42
512 128 5632 3.773 135.69 13.767 9.30
512 128 6144 3.756 136.32 13.936 9.18
512 128 6656 3.776 135.59 14.103 9.08
512 128 7168 3.796 134.88 14.277 8.97
512 128 7680 3.804 134.60 14.473 8.84
512 128 8192 3.879 132.00 14.638 8.74
512 128 8704 3.849 133.02 14.847 8.62
512 128 9216 3.929 130.31 15.027 8.52
512 128 9728 3.943 129.84 15.216 8.41
512 128 10240 3.908 131.02 15.385 8.32
512 128 10752 3.923 130.51 15.560 8.23
512 128 11264 3.935 130.12 15.741 8.13
512 128 11776 3.982 128.59 15.695 8.16
512 128 12288 3.971 128.94 15.602 8.20
512 128 12800 3.982 128.58 15.740 8.13
512 128 13312 3.993 128.22 15.901 8.05
512 128 13824 4.019 127.40 16.079 7.96
512 128 14336 4.044 126.62 16.265 7.87
512 128 14848 4.056 126.23 16.399 7.81
512 128 15360 4.070 125.80 16.582 7.72
512 128 15872 4.114 124.46 16.754 7.64
512 128 16384 4.101 124.86 16.899 7.57
512 128 16896 4.120 124.26 17.061 7.50
512 128 17408 4.148 123.43 17.219 7.43
512 128 17920 4.170 122.79 17.386 7.36
512 128 18432 4.183 122.41 17.559 7.29
512 128 18944 4.212 121.55 17.744 7.21
512 128 19456 4.222 121.26 17.925 7.14
512 128 19968 4.250 120.48 18.072 7.08
512 128 20480 4.253 120.38 18.233 7.02
512 128 20992 4.318 118.57 18.365 6.97
512 128 21504 4.289 119.38 18.574 6.89
512 128 22016 4.310 118.79 18.722 6.84
512 128 22528 4.337 118.05 18.884 6.78
512 128 23040 4.349 117.72 19.071 6.71
512 128 23552 4.361 117.40 19.233 6.66
512 128 24064 4.459 114.83 19.375 6.61
512 128 24576 4.396 116.47 19.506 6.56
512 128 25088 4.418 115.90 19.668 6.51
512 128 25600 4.432 115.53 19.840 6.45
512 128 26112 4.450 115.06 20.016 6.39
512 128 26624 4.464 114.70 20.157 6.35
512 128 27136 4.484 114.17 20.332 6.30
512 128 27648 4.502 113.72 20.479 6.25
512 128 28160 4.532 112.96 20.657 6.20
512 128 28672 4.534 112.92 20.814 6.15
512 128 29184 4.561 112.26 20.982 6.10
512 128 29696 4.565 112.16 21.138 6.06
512 128 30208 4.579 111.82 21.284 6.01
512 128 30720 4.614 110.97 21.457 5.97
512 128 31232 4.628 110.64 21.709 5.90
512 128 31744 4.647 110.17 21.866 5.85
512 128 32256 4.669 109.66 21.961 5.83

In the mean time, I ran a quick comparison of the Q8_0 on the remote threadripper pro 24 core using a single RTX A6000 48GB VRAM GPU and offloading the rest to CPU for a somewhat similar "hybrid inference" test.

Note that for some reason ik_llama.cpp could offload one additional ffn layer than mainline llama.cpp in this test. I didn't go back and re-run the test by reducing the layers by one on ik so it isn't technically exactly the same configuration but close enough for tonight!

qwen3-moe

👈 Logs

ik_llama.cpp

model=/mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-Q8_0.gguf

# Offload 48GB onto single RTX A6000 VRAM
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa \
  -ctk f16 -ctv f16 \
  -c 32768 \
  -fmoe \
  -amb 512 \
  -rtr \
  -ot blk\.1[4-9]\.ffn.*=CPU \
  -ot blk\.[2-8][0-9]\.ffn.*=CPU \
  -ot blk\.9[0-3]\.ffn.*=CPU \
  -ngl 99 \
  --threads 24

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 33 key-value pairs and 1131 tensors from /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-Q8_0
.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 235B A22B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   9:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv  10:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  11:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  12:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  13:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  17:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  18:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 7
llama_model_loader: - kv  20:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  21:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:  660 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 232.769 GiB (8.505 BPW)
llm_load_print_meta: repeating layers = 231.538 GiB (8.505 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Qwen3 235B A22B
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    0.99 MiB
Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
.
.
.
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_norm.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 196001.25 MiB
llm_load_tensors:  CUDA_Host buffer size =   630.59 MiB
llm_load_tensors:      CUDA0 buffer size = 41723.89 MiB
....................................................................................................
============ Repacked 240 tensors
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  6016.00 MiB
llama_new_context_with_model: KV self size  = 6016.00 MiB, K (f16): 3008.00 MiB, V (f16): 3008.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   312.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   128.01 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 322

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 24, n_threads_batch = 24

$ numastat -mp $(pidof llama-sweep-bench)
                          Node 0           Total                                                                                       00:10:39 [2/27]
                 --------------- ---------------
                 --------------- ---------------
MemTotal               257213.74       257213.74
MemFree                  1088.58         1088.58
MemUsed                256125.16       256125.16
SwapCached                 27.07           27.07
Active                  70427.99        70427.99
Inactive               181810.73       181810.73
Active(anon)            70360.04        70360.04
Inactive(anon)         126793.92       126793.92
Active(file)               67.95           67.95
Inactive(file)          55016.81        55016.81
Unevictable                 6.03            6.03
Mlocked                     0.02            0.02
Dirty                       0.18            0.18
Writeback                   0.00            0.00
FilePages               55889.19        55889.19
Mapped                   1024.76         1024.76
AnonPages              196380.88       196380.88
Shmem                     776.73          776.73
KernelStack                16.69           16.69
PageTables                407.07          407.07
SecPageTables             632.02          632.02
NFS_Unstable                0.00            0.00
Bounce                      0.00            0.00
WritebackTmp                0.00            0.00
Slab                     1134.24         1134.24
SReclaimable              633.10          633.10
SUnreclaim                501.14          501.14
AnonHugePages               0.00            0.00
ShmemHugePages              0.00            0.00
ShmemPmdMapped              0.00            0.00
FileHugePages               0.00            0.00
FilePmdMapped               0.00            0.00
HugePages_Total             0.00            0.00
HugePages_Free              0.00            0.00
HugePages_Surp              0.00            0.00
KReclaimable              633.10          633.10
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 3.115 164.36 11.715 10.93
512 128 512 3.086 165.89 11.837 10.81
512 128 1024 3.130 163.59 12.006 10.66
512 128 1536 3.106 164.85 12.066 10.61
512 128 2048 3.306 154.88 12.210 10.48
512 128 2560 3.346 153.03 12.307 10.40
512 128 3072 3.272 156.46 12.439 10.29
512 128 3584 3.170 161.52 12.523 10.22
512 128 4096 3.215 159.23 12.683 10.09
512 128 4608 3.222 158.91 12.732 10.05
512 128 5120 3.391 150.98 12.896 9.93
512 128 5632 3.343 153.18 12.943 9.89
512 128 6144 3.275 156.34 13.115 9.76
512 128 6656 3.280 156.09 13.241 9.67
512 128 7168 3.305 154.90 13.354 9.59
512 128 7680 3.328 153.83 13.450 9.52
512 128 8192 3.341 153.24 13.589 9.42
512 128 8704 3.365 152.16 13.692 9.35
512 128 9216 3.382 151.37 13.821 9.26
512 128 9728 3.395 150.80 13.924 9.19
512 128 10240 3.417 149.82 14.069 9.10
512 128 10752 3.491 146.64 14.153 9.04
512 128 11264 3.460 147.96 14.279 8.96
512 128 11776 3.478 147.21 14.367 8.91
512 128 12288 3.501 146.23 14.506 8.82
512 128 12800 3.729 137.29 14.588 8.77
512 128 13312 3.532 144.94 14.600 8.77
512 128 13824 3.555 144.03 14.732 8.69
512 128 14336 3.574 143.25 14.809 8.64
512 128 14848 3.596 142.39 14.981 8.54
512 128 15360 3.613 141.72 15.042 8.51
512 128 15872 3.634 140.91 15.220 8.41
512 128 16384 3.765 135.98 15.266 8.38
512 128 16896 3.671 139.47 15.390 8.32
512 128 17408 3.687 138.86 15.519 8.25
512 128 17920 3.703 138.25 15.617 8.20
512 128 18432 3.732 137.19 15.891 8.05
512 128 18944 3.810 134.40 15.866 8.07
512 128 19456 3.805 134.57 15.952 8.02
512 128 19968 3.812 134.33 16.093 7.95
512 128 20480 3.808 134.44 16.192 7.90
512 128 20992 3.824 133.89 16.340 7.83
512 128 21504 3.992 128.26 16.427 7.79
512 128 22016 3.870 132.29 16.546 7.74
512 128 22528 3.890 131.62 16.680 7.67
512 128 23040 4.018 127.41 16.809 7.62
512 128 23552 3.928 130.34 16.909 7.57
512 128 24064 3.955 129.47 17.031 7.52
512 128 24576 3.976 128.77 17.144 7.47
512 128 25088 3.993 128.23 17.331 7.39
512 128 25600 4.004 127.88 17.475 7.32
512 128 26112 4.026 127.17 17.515 7.31
512 128 26624 4.049 126.44 17.693 7.23
512 128 27136 4.074 125.68 17.808 7.19
512 128 27648 4.132 123.92 17.931 7.14
512 128 28160 4.098 124.94 18.083 7.08
512 128 28672 4.116 124.40 18.200 7.03
512 128 29184 4.137 123.75 18.314 6.99
512 128 29696 4.155 123.21 18.461 6.93
512 128 30208 4.304 118.95 18.597 6.88
512 128 30720 4.233 120.95 18.717 6.84
512 128 31232 4.306 118.91 18.847 6.79
512 128 31744 4.232 120.97 18.987 6.74
512 128 32256 4.288 119.39 19.105 6.70

llama.cpp

model=/mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-Q8_0.gguf

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
  --no-mmap \
  --model "$model" \
  -fa \
  -ctk f16 -ctv f16 \
  -c 32768 \
  -ot blk\.1[3-9]\.ffn.*=CPU \
  -ot blk\.[2-8][0-9]\.ffn.*=CPU \
  -ot blk\.9[0-3]\.ffn.*=CPU \
  -ngl 99 \
  --threads 24

ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                                                                                          23:49:33 [92/1809]
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
build: 5192 (e59a5f1e) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX A6000) - 48267 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 1131 tensors from /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-Q8_0
.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 235B A22B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   7:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   8:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   9:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv  10:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  11:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  12:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  13:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  17:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  18:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 7
llama_model_loader: - kv  20:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  21:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:  660 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 232.77 GiB (8.51 BPW)
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 4096
print_info: n_layer          = 94
print_info: n_head           = 64
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 235B.A22B
print_info: model params     = 235.09 B
print_info: general.name     = Qwen3 235B A22B
print_info: n_ff_exp         = 1536
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors:    CUDA_Host model buffer size =   630.59 MiB
load_tensors:        CUDA0 model buffer size = 39273.87 MiB
load_tensors:          CPU model buffer size = 198451.27 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 32768
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 32768, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 94, can_shift = 1
init:      CUDA0 KV buffer size =  6016.00 MiB
llama_context: KV self size  = 6016.00 MiB, K (f16): 3008.00 MiB, V (f16): 3008.00 MiB
llama_context:      CUDA0 compute buffer size =  1024.00 MiB
llama_context:  CUDA_Host compute buffer size =    72.01 MiB
llama_context: graph nodes  = 5741
llama_context: graph splits = 407 (with bs=512), 164 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1
 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1
| AARCH64_REPACK = 1 |

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 24, n_threads_batch = 24

$ numastat -mp $(pidof llama-sweep-bench)
                          Node 0           Total                                                                                       00:10:39 [2/27]
                 --------------- ---------------
MemTotal               257213.74       257213.74
MemFree                  3319.93         3319.93
MemUsed                253893.81       253893.81
SwapCached                 27.97           27.97
Active                  73301.44        73301.44
Inactive               176693.16       176693.16
Active(anon)            73109.92        73109.92
Inactive(anon)         126449.67       126449.67
Active(file)              191.52          191.52
Inactive(file)          50243.50        50243.50
Unevictable                 6.03            6.03
Mlocked                     0.02            0.02
Dirty                       0.17            0.17
Writeback                   0.00            0.00
FilePages               51183.76        51183.76
Mapped                    972.10          972.10
AnonPages              198841.57       198841.57
Shmem                     720.74          720.74
KernelStack                16.81           16.81
PageTables                411.71          411.71
SecPageTables             632.02          632.02
NFS_Unstable                0.00            0.00
Bounce                      0.00            0.00
WritebackTmp                0.00            0.00
Slab                     1134.53         1134.53
SReclaimable              633.56          633.56
SUnreclaim                500.96          500.96
AnonHugePages               0.00            0.00
ShmemHugePages              0.00            0.00
ShmemPmdMapped              0.00            0.00
FileHugePages               0.00            0.00
FilePmdMapped               0.00            0.00
HugePages_Total             0.00            0.00
HugePages_Free              0.00            0.00
HugePages_Surp              0.00            0.00
KReclaimable              633.56          633.56
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
512 128 0 9.319 54.94 12.265 10.44
512 128 512 9.242 55.40 12.236 10.46
512 128 1024 9.257 55.31 12.202 10.49
512 128 1536 9.277 55.19 12.141 10.54
512 128 2048 9.296 55.08 12.236 10.46
512 128 2560 9.284 55.15 12.161 10.53
512 128 3072 9.305 55.02 12.266 10.44
512 128 3584 9.303 55.04 12.309 10.40
512 128 4096 9.334 54.85 12.233 10.46
512 128 4608 9.324 54.91 12.263 10.44
512 128 5120 9.350 54.76 12.256 10.44
512 128 5632 9.339 54.83 12.357 10.36
512 128 6144 9.364 54.68 12.363 10.35
512 128 6656 9.364 54.68 12.471 10.26
512 128 7168 9.393 54.51 12.375 10.34
512 128 7680 9.390 54.53 12.451 10.28
512 128 8192 9.406 54.44 12.435 10.29
512 128 8704 9.413 54.39 12.409 10.31
512 128 9216 9.424 54.33 12.417 10.31
512 128 9728 9.430 54.29 12.528 10.22
512 128 10240 9.440 54.24 12.564 10.19
512 128 10752 9.461 54.12 12.872 9.94
512 128 11264 9.448 54.19 12.627 10.14
512 128 11776 9.474 54.04 12.575 10.18
512 128 12288 9.478 54.02 12.578 10.18
512 128 12800 9.484 53.99 12.630 10.13
512 128 13312 9.475 54.04 12.623 10.14
512 128 13824 9.498 53.91 12.609 10.15
512 128 14336 9.501 53.89 12.627 10.14
512 128 14848 9.513 53.82 12.640 10.13
512 128 15360 9.520 53.78 12.698 10.08
512 128 15872 9.534 53.70 12.695 10.08
512 128 16384 9.542 53.66 12.827 9.98
512 128 16896 9.544 53.64 12.812 9.99
512 128 17408 9.567 53.52 12.850 9.96
512 128 17920 9.570 53.50 12.933 9.90
512 128 18432 9.579 53.45 12.841 9.97
512 128 18944 9.579 53.45 12.829 9.98
512 128 19456 9.606 53.30 12.846 9.96
512 128 19968 9.620 53.22 12.846 9.96
512 128 20480 9.600 53.33 12.864 9.95
512 128 20992 9.605 53.30 12.878 9.94
512 128 21504 9.629 53.17 12.979 9.86
512 128 22016 9.644 53.09 13.079 9.79
512 128 22528 9.656 53.03 12.995 9.85
512 128 23040 9.653 53.04 13.008 9.84
512 128 23552 9.663 52.98 13.057 9.80
512 128 24064 9.685 52.87 13.084 9.78
512 128 24576 9.690 52.84 13.778 9.29
512 128 25088 9.702 52.77 13.490 9.49
512 128 25600 9.692 52.83 13.059 9.80
512 128 26112 9.717 52.69 13.050 9.81
512 128 26624 9.731 52.61 13.111 9.76
512 128 27136 9.737 52.58 13.187 9.71
512 128 27648 9.751 52.51 13.208 9.69
512 128 28160 9.751 52.51 13.233 9.67
512 128 28672 9.766 52.43 13.234 9.67
512 128 29184 9.785 52.32 13.183 9.71
512 128 29696 9.786 52.32 13.204 9.69
512 128 30208 9.787 52.32 13.274 9.64
512 128 30720 9.794 52.28 13.268 9.65
512 128 31232 9.811 52.19 13.290 9.63
512 128 31744 9.814 52.17 13.309 9.62
512 128 32256 9.841 52.03 13.433 9.53

Interestingly I could hear my fans spin up and down periodically every 15 seconds or so as the CPU ramped up and the GPU dropped down a bit. I noticed this more on the Q8_0 test visually with btop as the CPU would drop to almost 0 and the GPU would ramp up and oscillate slowly back and forth.

👤 ikawrakow replied the 2025-04-30 at 06:07:34:

Note that for some reason ik_llama.cpp could offload one additional ffn layer than mainline llama.cpp in this test

This is because the ik_llama.cpp CUDA compute buffer is smaller. This is most likely due to the fused ffn_up+ffn_gate op that you get with -fmoe. In any case, having 80 instead of 81 MoE experts competed on the CPU will not make a significant difference in performance.

👤 ubergarm replied the 2025-04-30 at 17:46:53:
I don't have access to enough RAM+VRAM currently to run the full bf16, so I'm using the Q8_0 as the baseline for my imatrix data and PPL/KLD.

👈 PPL and KLD comparisons on two test corpi
  • Qwen/Qwen3-235B-A22B/Qwen3-235B-A22B-BF16-00001-of-00011.gguf
    • 438GiB
    • TODO
  • ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-Q8_0
    • 233GiB
    • Final estimate: PPL = 5.3141 +/- 0.03321 wiki.test.raw
    • Final estimate: PPL = 11.7194 +/- 0.07212 ubergarm-kld-test-corpus.txt
  • ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K.gguf
    • 107GiB
    • Final estimate: PPL = 5.4403 +/- 0.03421 wiki.test.raw
    • Mean PPL(Q) : 11.788282 ± 0.072648 ubergarm-kld-test-corpus.txt
    • ====== KL divergence statistics ======
    • Mean KLD: 0.014594 ± 0.000064
    • Maximum KLD: 2.906263
    • 99.9% KLD: 0.296680
    • 99.0% KLD: 0.098368
    • ====== Token probability statistics ======
    • Mean Δp: -0.049 ± 0.006 %
    • Maximum Δp: 63.764%
    • 99.9% Δp: 17.122%
    • 99.0% Δp: 8.257%
    • 95.0% Δp: 4.175%
    • 90.0% Δp: 2.504%
  • unsloth/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-UD-Q3_K_XL
    • 97GiB
    • Final estimate: PPL = 5.5695 +/- 0.03524 wiki-test.raw
    • Mean PPL(Q): 11.855173 ± 0.073300 ubergarm-kld-test-corpus.txt
    • ====== KL divergence statistics ======
    • Mean KLD: 0.029122 ± 0.000123
    • Maximum KLD: 5.471307
    • 99.9% KLD: 0.543533
    • 99.0% KLD: 0.180988
    • ====== Token probability statistics ======
    • Mean Δp: -0.059 ± 0.009 %
    • Maximum Δp: 64.130%
    • 99.9% Δp: 22.421%
    • 99.0% Δp: 11.713%
    • 95.0% Δp: 5.976%
    • 90.0% Δp: 3.649%
  • lmstudio-community/Qwen_Qwen3-235B-A22B-GGUF
    • NOTE: bartowski releases these models quickly for lm studio without imatrix as per their preference
    • 104GiB
    • Final estimate: PPL = 5.6582 +/- 0.03584 wiki-test.raw
    • Mean PPL(Q) : 11.904309 ± 0.073302 ubergarm-kld-test-corpus.txt
    • ====== KL divergence statistics ======
    • Mean KLD: 0.036266 ± 0.000140
    • Maximum KLD: 8.358958
    • 99.9% KLD: 0.628216
    • 99.0% KLD: 0.219563
    • ====== Token probability statistics ======
    • Mean Δp: -0.284 ± 0.010 %
    • Maximum Δp: 77.349%
    • 99.9% Δp: 24.126%
    • 99.0% Δp: 12.470%
    • 95.0% Δp: 6.267%
    • 90.0% Δp: 3.742%
  • bartowski/Qwen_Qwen3-235B-A22B-GGUF
    • TODO, waiting for bartowski to finish the small imatrix quants before releasing them all
👈 ubergarm-kld-test-corpus.txt

I created a ~1.6MiB plain text test corpus using whisper-large-v3 audio transcripts of newer episodes of Rick Archer's Buddha at the Gas Pump Podcast and Youtube Channel for which I maintain a searchable full text index among other channels with similar content.

The formatting is a little odd as there are no paragraphs and i used fmt for line breaks. The thought was that at least this text probably hasn't yet been used in training or fine-tuning ai models and is different from the corpus I use to generate imatrix data.

I'd rather not release it publicly easily accessible in full for various reasons, but contact me if you are doing some research or want exact comparisons with my quants (or I could possibly run your quant if I have time). Here is a snippet so you can see what it looks like:

$ head ubergarm-kld-test-corpus.txt
## The Telepathy Tapes - Dr. Diane Hennacy Powell - Buddha at the Gas Pump Interview

Another thing that we have anecdotes about is the precognition. So
we have, for example, a girl on the podcast who had a dream that her
father slipped on ice.  And they live in Arizona where there's no
ice. And it happened three weeks later when he was on a business trip.
He slipped on the ice. And she... She also knew that he'd end up in the
hospital with a broken hip as a result, which was the case.  So that's
a really fascinating anecdote and I've heard many like that over the
years. But once again, to say that you have evidence for precognition

👤 ikawrakow replied the 2025-04-30 at 05:57:12:

@ubergarm Can you try the attached sweep_bench.cpp adaptation for llama.cpp instead of your adaptation? Thanks!

sweep-bench.cpp.gz

👤 ubergarm replied the 2025-04-30 at 17:21:50:
I compared your sweep-bench.cpp adaptation to mainline llama.cpp with my adaptation of @saood06 's code. A couple quick results suggest they are pretty similar for two benchmarks I had run:

bartowski/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf GQA FA

thud-sweep-mine-vs-iks-adaptation

Running the same and comparing against this previous data.

Qwen3-235B-A22B-Q8_0 GQA FA

qwen3-moe-ik-vs-ug-sweep-adaptation ^ title is wrong, this big model was on the thread ripper pro with RTX A6000 oops

Running the same and comparing against the above chart.

Conclusion

The general trends seem to hold, but your implementation seems a bit more consistent without the occasional dips unless that was just some noise or me doing something else on the machine. I'll use your adaptation going forward just to keep it as similar as possible with your comparisons. Thanks!


👤 ikawrakow replied the 2025-04-30 at 14:04:50:

OK, after thinking more about this, I can see why mainline has a better large context TG performance on CUDA for Qwen3-235B-A22B (and previously noted for LLaMA-4): these models have a quite large GQA factor, and I'm still using the old CUDA FA implementation that did not take advantage of that. Improved GQA FA performance was added in this mainline PR.

ik_llama.cpp does take advantage of GQA in the CPU FA implementation. Given the above results, it is clear that it is time to do the same for CUDA. I have two options:

  • Pickup the mainline PR (but heavy adaptation will be required as things have diverged a lot, and mainline FA does not support different K and V head sizes as required for DeepSeek models)
  • Finally sit down and write my own CUDA FA implementation

👤 ubergarm replied the 2025-04-30 at 15:02:22:
Interesting, yes, I first noticed this with GLM-4 (which uses GQA) in the CUDA + Flash Attention case benchmark.

I still have the dream of converting an existing GQA architecture model to MLA but the additional fine-tuning required even with a fraction of the original training data seems daunting:

The expressiveness of MLA is greater than that of GQA when both have the same size of KV cache. -TransMLA: Multi-head Latent Attention Is All You Need

But until MLA catches on more across other models, it might make sense to revisit the CUDA FA implementation for GQA, if that is something that interests you. Of course as soon as R2 comes around, this fickle world will jump on the next hype train lmao...

In the mean-time I'll re-run a couple llama-sweep-bench comparisons with your mainline sweep-bench.cpp adaptation to confirm or reject my prior benchmarks!

Thanks!

👤 ikawrakow replied the 2025-04-30 at 16:14:03:

I still have the dream of converting an existing GQA architecture model to MLA but the additional fine-tuning required even with a fraction of the original training data seems daunting:

But MLA is not all roses either. It took quite a bit of experimentation to arrive at a meaningful compromise between TG and PP performance. Mainline has a long way to go there (see #354). And then we have this much smaller KV cache, but then we need giant compute buffers to get meaningful performance, so we need to compute self attention in chunks to keep compute memory usage at a reasonable level, so suddenly the compute graph building becomes this huge pile of complications instead of being just a few tens of lines of simple code as ii is for the other models. And then seeing the massive drop in performance with large contexts in your DeepSeek-V3/R1 benchmarks, my guess is that it is still far from optimum.


👤 AesSedai replied the 2025-05-03 at 00:37:46:

Hello, @artus-dev and @ubergarm asked me to run some sweeps for Qwen3-235B-A22B. My homelab has a substantial server with a VM in it that has the following allocation:

56 threads of a AMD EPYC 9355 (64t total)
512GB of 12 channel DDR5 6000 ECC RAM (768GB total)
2x 24GB Nvidia 3090

I've run four sweeps as follows:

ik_llama.cpp CPU only
ik_llama.cpp one GPU
llama.cpp CPU only
llama.cpp one GPU

Both ik_llama.cpp and llama.cpp were compiled with CUDA and OpenBLAS support.

The sweeps were run with the following quants:

ik_llama.cpp: https://huggingface.co/ArtusDev/Qwen3-235B-A22B-GGUF (IQ6_K, ~212GB)
llama.cpp: https://huggingface.co/unsloth/Qwen3-235B-A22B-128K-GGUF (Q6_K, ~193GB)

The llama.cpp tests were conducted with the sweep-bench.cpp included in https://github.com/ikawrakow/ik_llama.cpp/discussions/357#discussioncomment-12988686

For the GPU tests, I kept the layer offloads identical between the two. This means that were was slightly less GPU VRAM utilization for the llama.cpp test because the model is smaller, but I felt that was the best way to keep the tests as comparable as I could manage:

-ot "blk\.(0|1|2|3|4)\.ffn.*=CUDA0"

Logs for the runs are as follows:

ik_llama.cpp CPU logs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf -c 16384 -t 48 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 128x10B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   6:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 142
llama_model_loader: - kv  19:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  20:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /workspace/ubergarm/imatrix-Qwen3-235...
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  36:                                   split.no u16              = 0
llama_model_loader: - kv  37:                                split.count u16              = 5
llama_model_loader: - kv  38:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq6_k:  564 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ6_K - 6.6 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 198.259 GiB (7.244 BPW) 
llm_load_print_meta: repeating layers = 197.028 GiB (7.237 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.50 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/95 layers to GPU
llm_load_tensors:  CUDA_Host buffer size = 203017.61 MiB
....................................................................................................
============ Repacked 95 tensors
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1598.00 MiB
llama_new_context_with_model: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    88.00 MiB
llama_new_context_with_model:        CPU compute buffer size =   304.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   129.01 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 1225

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |   32.900 |    15.56 |    8.978 |    14.26 |
|   512 |    128 |    512 |   33.034 |    15.50 |    8.744 |    14.64 |
|   512 |    128 |   1024 |   33.181 |    15.43 |    9.281 |    13.79 |
|   512 |    128 |   1536 |   33.299 |    15.38 |    9.199 |    13.91 |
|   512 |    128 |   2048 |   33.424 |    15.32 |    9.158 |    13.98 |
|   512 |    128 |   2560 |   33.579 |    15.25 |    9.624 |    13.30 |
|   512 |    128 |   3072 |   33.646 |    15.22 |    9.632 |    13.29 |
|   512 |    128 |   3584 |   33.863 |    15.12 |    9.266 |    13.81 |
|   512 |    128 |   4096 |   34.018 |    15.05 |    9.630 |    13.29 |
|   512 |    128 |   4608 |   34.192 |    14.97 |   10.042 |    12.75 |
|   512 |    128 |   5120 |   34.280 |    14.94 |    9.658 |    13.25 |
|   512 |    128 |   5632 |   34.481 |    14.85 |   11.059 |    11.57 |
|   512 |    128 |   6144 |   34.654 |    14.77 |   11.382 |    11.25 |
|   512 |    128 |   6656 |   34.813 |    14.71 |   10.431 |    12.27 |
|   512 |    128 |   7168 |   35.101 |    14.59 |   12.036 |    10.63 |
|   512 |    128 |   7680 |   35.158 |    14.56 |   13.169 |     9.72 |
|   512 |    128 |   8192 |   35.381 |    14.47 |   13.049 |     9.81 |
|   512 |    128 |   8704 |   35.544 |    14.40 |   14.775 |     8.66 |
|   512 |    128 |   9216 |   35.633 |    14.37 |   15.850 |     8.08 |
|   512 |    128 |   9728 |   35.774 |    14.31 |   15.061 |     8.50 |
|   512 |    128 |  10240 |   35.845 |    14.28 |   16.518 |     7.75 |
|   512 |    128 |  10752 |   36.028 |    14.21 |   16.483 |     7.77 |
|   512 |    128 |  11264 |   36.193 |    14.15 |   15.264 |     8.39 |
|   512 |    128 |  11776 |   36.357 |    14.08 |   16.721 |     7.66 |
|   512 |    128 |  12288 |   36.393 |    14.07 |   16.834 |     7.60 |
|   512 |    128 |  12800 |   36.579 |    14.00 |   15.609 |     8.20 |
|   512 |    128 |  13312 |   36.701 |    13.95 |   16.984 |     7.54 |
|   512 |    128 |  13824 |   36.927 |    13.87 |   17.220 |     7.43 |
|   512 |    128 |  14336 |   37.027 |    13.83 |   15.938 |     8.03 |
|   512 |    128 |  14848 |   37.247 |    13.75 |   17.507 |     7.31 |
|   512 |    128 |  15360 |   37.359 |    13.70 |   17.540 |     7.30 |
|   512 |    128 |  15872 |   37.496 |    13.65 |   16.480 |     7.77 |
ik_llama.cpp GPU logs
CUDA_VISIBLE_DEVICES="0" ./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf -c 16384 -t 48 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0 -ngl 99 -ot "blk\.(0|1|2|3|4)\.ffn.*=CUDA0" -ot "blk\.(5|6|7|8|9)\.ffn.*=CPU" -ot "blk\.1[0-9]\.ffn.*=CPU" -ot "blk\.[2-9][0-9]\.ffn.*=CPU"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 128x10B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   6:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 142
llama_model_loader: - kv  19:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  20:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /workspace/ubergarm/imatrix-Qwen3-235...
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  36:                                   split.no u16              = 0
llama_model_loader: - kv  37:                                split.count u16              = 5
llama_model_loader: - kv  38:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq6_k:  564 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ6_K - 6.6 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 198.259 GiB (7.244 BPW) 
llm_load_print_meta: repeating layers = 197.028 GiB (7.237 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    0.99 MiB
Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_norm.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_norm.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_norm.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_norm.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_norm.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_norm.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_norm.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_norm.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_norm.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_norm.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_norm.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_norm.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_norm.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_norm.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_norm.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_norm.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_norm.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_norm.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_norm.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_norm.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_norm.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_norm.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_norm.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_norm.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_norm.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_norm.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_norm.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_norm.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_norm.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_norm.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_norm.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_norm.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_norm.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_norm.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_norm.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_norm.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_norm.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_norm.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_norm.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_norm.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_norm.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_norm.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_norm.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_norm.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_norm.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_norm.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_norm.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_norm.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_norm.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_norm.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_norm.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_norm.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_norm.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_norm.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_norm.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_norm.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_norm.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_norm.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_norm.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_norm.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_norm.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_norm.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_norm.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_norm.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_norm.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_norm.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_norm.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_norm.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_norm.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_norm.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_norm.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_norm.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_norm.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_norm.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_norm.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_norm.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_norm.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_norm.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_norm.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_norm.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_norm.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_norm.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_norm.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_norm.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_norm.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_norm.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 186011.39 MiB
llm_load_tensors:  CUDA_Host buffer size =   630.59 MiB
llm_load_tensors:      CUDA0 buffer size = 16375.62 MiB
....................................................................................................
============ Repacked 89 tensors
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1598.05 MiB
llama_new_context_with_model: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   312.75 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   120.01 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 358

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    4.089 |   125.21 |    7.414 |    17.26 |
|   512 |    128 |    512 |    4.120 |   124.26 |    7.778 |    16.46 |
|   512 |    128 |   1024 |    4.077 |   125.58 |    8.250 |    15.51 |
|   512 |    128 |   1536 |    4.123 |   124.17 |   10.897 |    11.75 |
|   512 |    128 |   2048 |    4.181 |   122.45 |   11.487 |    11.14 |
|   512 |    128 |   2560 |    4.193 |   122.12 |   11.506 |    11.12 |
|   512 |    128 |   3072 |    4.197 |   122.01 |   11.770 |    10.88 |
|   512 |    128 |   3584 |    4.249 |   120.50 |   12.058 |    10.62 |
|   512 |    128 |   4096 |    4.316 |   118.64 |   12.234 |    10.46 |
|   512 |    128 |   4608 |    4.337 |   118.06 |   12.299 |    10.41 |
|   512 |    128 |   5120 |    4.331 |   118.23 |   12.540 |    10.21 |
|   512 |    128 |   5632 |    4.380 |   116.91 |   12.850 |     9.96 |
|   512 |    128 |   6144 |    4.413 |   116.03 |   13.086 |     9.78 |
|   512 |    128 |   6656 |    4.416 |   115.93 |   13.052 |     9.81 |
|   512 |    128 |   7168 |    4.462 |   114.75 |   13.409 |     9.55 |
|   512 |    128 |   7680 |    4.477 |   114.36 |   13.776 |     9.29 |
|   512 |    128 |   8192 |    4.505 |   113.66 |   13.847 |     9.24 |
|   512 |    128 |   8704 |    4.499 |   113.81 |   13.971 |     9.16 |
|   512 |    128 |   9216 |    4.494 |   113.93 |   14.251 |     8.98 |
|   512 |    128 |   9728 |    4.489 |   114.06 |   14.196 |     9.02 |
|   512 |    128 |  10240 |    4.470 |   114.53 |   14.242 |     8.99 |
|   512 |    128 |  10752 |    4.491 |   114.01 |   14.250 |     8.98 |
|   512 |    128 |  11264 |    4.521 |   113.25 |   14.597 |     8.77 |
|   512 |    128 |  11776 |    4.568 |   112.08 |   14.801 |     8.65 |
|   512 |    128 |  12288 |    4.562 |   112.23 |   14.969 |     8.55 |
|   512 |    128 |  12800 |    4.581 |   111.78 |   15.320 |     8.36 |
|   512 |    128 |  13312 |    4.582 |   111.73 |   15.368 |     8.33 |
|   512 |    128 |  13824 |    4.598 |   111.35 |   15.639 |     8.18 |
|   512 |    128 |  14336 |    4.619 |   110.84 |   15.904 |     8.05 |
|   512 |    128 |  14848 |    4.639 |   110.38 |   15.952 |     8.02 |
|   512 |    128 |  15360 |    4.649 |   110.14 |   16.225 |     7.89 |
|   512 |    128 |  15872 |    4.663 |   109.79 |   16.326 |     7.84 |
llama.cpp CPU logs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-128K-GGUF/Q6_K/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf -c 16384 -t 48 -fa -ctk q8_0 -ctv q8_0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 5269 (1d36b367) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 47 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-128K-GGUF/Q6_K/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   3:                           general.finetune str              = 128k
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Qwen3 235B A22B
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 131072
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 18
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = Qwen3-235B-A22B-128K-GGUF/imatrix_uns...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-235B-A22B-1...
llama_model_loader: - kv  42:             quantize.imatrix.entries_count i32              = 752
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count i32              = 46
llama_model_loader: - kv  44:                                   split.no u16              = 0
llama_model_loader: - kv  45:                        split.tensors.count i32              = 1131
llama_model_loader: - kv  46:                                split.count u16              = 4
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q6_K:  660 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 179.75 GiB (6.57 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 94
print_info: n_head           = 64
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 235B.A22B
print_info: model params     = 235.09 B
print_info: general.name     = Qwen3-235B-A22B-128K
print_info: n_ff_exp         = 1536
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/95 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 47091.25 MiB
load_tensors:   CPU_Mapped model buffer size = 47433.32 MiB
load_tensors:   CPU_Mapped model buffer size = 47377.52 MiB
load_tensors:   CPU_Mapped model buffer size = 42166.10 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'q8_0', type_v = 'q8_0', n_layer = 94, can_shift = 1, padding = 256
llama_kv_cache_unified:        CPU KV buffer size =  1598.00 MiB
llama_kv_cache_unified: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_context:      CUDA0 compute buffer size =   742.00 MiB
llama_context:        CPU compute buffer size =   304.75 MiB
llama_context:  CUDA_Host compute buffer size =    65.01 MiB
llama_context: graph nodes  = 5741
llama_context: graph splits = 1602 (with bs=512), 189 (with bs=1)

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |   42.357 |    12.09 |    9.060 |    14.13 |
|   512 |    128 |    512 |   51.197 |    10.00 |   12.240 |    10.46 |
|   512 |    128 |   1024 |   60.284 |     8.49 |   14.398 |     8.89 |
|   512 |    128 |   1536 |   68.365 |     7.49 |   17.737 |     7.22 |
|   512 |    128 |   2048 |   76.989 |     6.65 |   24.649 |     5.19 |
|   512 |    128 |   2560 |   87.495 |     5.85 |   29.569 |     4.33 |
|   512 |    128 |   3072 |   99.493 |     5.15 |   33.176 |     3.86 |
|   512 |    128 |   3584 |  104.887 |     4.88 |   35.395 |     3.62 |
|   512 |    128 |   4096 |  110.847 |     4.62 |   37.481 |     3.42 |
|   512 |    128 |   4608 |  118.194 |     4.33 |   46.298 |     2.76 |
|   512 |    128 |   5120 |  126.544 |     4.05 |   43.575 |     2.94 |
|   512 |    128 |   5632 |  132.354 |     3.87 |   51.306 |     2.49 |
|   512 |    128 |   6144 |  141.580 |     3.62 |   53.846 |     2.38 |
|   512 |    128 |   6656 |  147.841 |     3.46 |   51.455 |     2.49 |
|   512 |    128 |   7168 |  155.069 |     3.30 |   52.843 |     2.42 |
|   512 |    128 |   7680 |  166.590 |     3.07 |   61.982 |     2.07 |
|   512 |    128 |   8192 |  174.021 |     2.94 |   62.082 |     2.06 |
|   512 |    128 |   8704 |  180.649 |     2.83 |   68.306 |     1.87 |
|   512 |    128 |   9216 |  191.221 |     2.68 |   71.603 |     1.79 |
|   512 |    128 |   9728 |  197.848 |     2.59 |   78.050 |     1.64 |
|   512 |    128 |  10240 |  205.342 |     2.49 |   85.140 |     1.50 |
|   512 |    128 |  10752 |  209.842 |     2.44 |   82.100 |     1.56 |
|   512 |    128 |  11264 |  218.246 |     2.35 |   78.315 |     1.63 |
|   512 |    128 |  11776 |  229.003 |     2.24 |   81.961 |     1.56 |
|   512 |    128 |  12288 |  241.294 |     2.12 |   81.073 |     1.58 |
|   512 |    128 |  12800 |  247.041 |     2.07 |   92.054 |     1.39 |
|   512 |    128 |  13312 |  246.231 |     2.08 |   90.119 |     1.42 |
|   512 |    128 |  13824 |  267.642 |     1.91 |   91.823 |     1.39 |
|   512 |    128 |  14336 |  262.708 |     1.95 |   92.070 |     1.39 |
|   512 |    128 |  14848 |  276.199 |     1.85 |   93.608 |     1.37 |
|   512 |    128 |  15360 |  286.268 |     1.79 |   97.714 |     1.31 |
|   512 |    128 |  15872 |  293.752 |     1.74 |   97.181 |     1.32 |
llama.cpp GPU logs
CUDA_VISIBLE_DEVICES="0" ./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-128K-GGUF/Q6_K/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf -c 16384 -t 48 -fa -ctk q8_0 -ctv q8_0 -ngl 99 -ot "blk\.(0|1|2|3|4)\.ffn.*=CUDA0" -ot "blk\.(5|6|7|8|9)\.ffn.*=CPU" -ot "blk\.1[0-9]\.ffn.*=CPU" -ot "blk\.[2-9][0-9]\.ffn.*=CPU"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 5269 (1d36b367) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 47 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-128K-GGUF/Q6_K/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   3:                           general.finetune str              = 128k
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Qwen3 235B A22B
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 131072
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 18
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = Qwen3-235B-A22B-128K-GGUF/imatrix_uns...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-235B-A22B-1...
llama_model_loader: - kv  42:             quantize.imatrix.entries_count i32              = 752
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count i32              = 46
llama_model_loader: - kv  44:                                   split.no u16              = 0
llama_model_loader: - kv  45:                        split.tensors.count i32              = 1131
llama_model_loader: - kv  46:                                split.count u16              = 4
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q6_K:  660 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 179.75 GiB (6.57 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 94
print_info: n_head           = 64
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 235B.A22B
print_info: model params     = 235.09 B
print_info: general.name     = Qwen3-235B-A22B-128K
print_info: n_ff_exp         = 1536
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors:        CUDA0 model buffer size = 15191.95 MiB
load_tensors:   CPU_Mapped model buffer size = 46604.38 MiB
load_tensors:   CPU_Mapped model buffer size = 47377.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47377.52 MiB
load_tensors:   CPU_Mapped model buffer size = 42166.10 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'q8_0', type_v = 'q8_0', n_layer = 94, can_shift = 1, padding = 256
llama_kv_cache_unified:      CUDA0 KV buffer size =  1598.00 MiB
llama_kv_cache_unified: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_context:      CUDA0 compute buffer size =   774.00 MiB
llama_context:        CPU compute buffer size =     8.25 MiB
llama_context:  CUDA_Host compute buffer size =    40.01 MiB
llama_context: graph nodes  = 5741
llama_context: graph splits = 536 (with bs=512), 180 (with bs=1)

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    8.145 |    62.86 |   10.488 |    12.20 |
|   512 |    128 |    512 |    8.082 |    63.35 |   10.974 |    11.66 |
|   512 |    128 |   1024 |    8.101 |    63.20 |   10.899 |    11.74 |
|   512 |    128 |   1536 |    8.120 |    63.05 |   10.961 |    11.68 |
|   512 |    128 |   2048 |    8.133 |    62.96 |   11.266 |    11.36 |
|   512 |    128 |   2560 |    8.137 |    62.92 |   11.590 |    11.04 |
|   512 |    128 |   3072 |    8.155 |    62.78 |   11.656 |    10.98 |
|   512 |    128 |   3584 |    8.150 |    62.82 |   11.651 |    10.99 |
|   512 |    128 |   4096 |    8.178 |    62.61 |   11.773 |    10.87 |
|   512 |    128 |   4608 |    8.174 |    62.64 |   11.889 |    10.77 |
|   512 |    128 |   5120 |    8.200 |    62.44 |   12.031 |    10.64 |
|   512 |    128 |   5632 |    8.204 |    62.41 |   12.040 |    10.63 |
|   512 |    128 |   6144 |    8.215 |    62.32 |   12.113 |    10.57 |
|   512 |    128 |   6656 |    8.224 |    62.26 |   12.227 |    10.47 |
|   512 |    128 |   7168 |    8.235 |    62.17 |   12.386 |    10.33 |
|   512 |    128 |   7680 |    8.246 |    62.09 |   12.543 |    10.20 |
|   512 |    128 |   8192 |    8.268 |    61.93 |   12.871 |     9.94 |
|   512 |    128 |   8704 |    8.264 |    61.95 |   12.922 |     9.91 |
|   512 |    128 |   9216 |    8.278 |    61.85 |   13.009 |     9.84 |
|   512 |    128 |   9728 |    8.312 |    61.60 |   13.256 |     9.66 |
|   512 |    128 |  10240 |    8.313 |    61.59 |   13.236 |     9.67 |
|   512 |    128 |  10752 |    8.316 |    61.57 |   13.518 |     9.47 |
|   512 |    128 |  11264 |    8.323 |    61.52 |   13.594 |     9.42 |
|   512 |    128 |  11776 |    8.337 |    61.41 |   13.412 |     9.54 |
|   512 |    128 |  12288 |    8.376 |    61.13 |   13.554 |     9.44 |
|   512 |    128 |  12800 |    8.379 |    61.10 |   13.561 |     9.44 |
|   512 |    128 |  13312 |    8.367 |    61.19 |   13.692 |     9.35 |
|   512 |    128 |  13824 |    8.386 |    61.05 |   13.817 |     9.26 |
|   512 |    128 |  14336 |    8.402 |    60.94 |   13.954 |     9.17 |
|   512 |    128 |  14848 |    8.408 |    60.89 |   14.156 |     9.04 |
|   512 |    128 |  15360 |    8.416 |    60.84 |   14.256 |     8.98 |
|   512 |    128 |  15872 |    8.439 |    60.67 |   14.597 |     8.77 |

I used the sweep-bench-plot.py to generate the following charts. The series are disambiguated by filename that includes {llama | ik-llama}-{cpu | gpu}.

CPU performance PP comparison: performance_comparison_pp_cpu

GPU performance PP comparison: performance_comparison_pp_gpu

CPU performance TG comparison: performance_comparison_tg_cpu

GPU performance TG comparison: performance_comparison_tg_gpu

👤 AesSedai replied the 2025-05-03 at 05:29:15:
One more test, I disabled pipeline parallelism (setting it to 1) and re-built ik_llama.cpp:

cmake -DBLAS_INCLUDE_DIRS=/usr/include/openblas -B build -DGGML_CUDA=ON -DGGML_RPC=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_SCHED_MAX_COPIES=1

This let me use my second 3090 and offload a little more.

ik_llama.cpp 2x GPU logs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf -c 16384 -t 48 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0 -ngl 99 -ot "blk\.(0|1|2|3|4|5|6)\.ffn.*=CUDA0" -ot "blk\.(7|8|9|10|11|12|13)\.ffn.*=CUDA1" -ot "blk\.1[4-9]\.ffn.*=CPU" -ot "blk\.[2-9][0-9]\.ffn.*=CPU"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 128x10B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   6:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 142
llama_model_loader: - kv  19:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  20:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /workspace/ubergarm/imatrix-Qwen3-235...
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  36:                                   split.no u16              = 0
llama_model_loader: - kv  37:                                split.count u16              = 5
llama_model_loader: - kv  38:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq6_k:  564 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ6_K - 6.6 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 198.259 GiB (7.244 BPW) 
llm_load_print_meta: repeating layers = 197.028 GiB (7.237 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    1.49 MiB
Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_norm.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_norm.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_norm.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_norm.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_norm.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_norm.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_norm.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_norm.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_norm.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_norm.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_norm.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_norm.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_norm.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_norm.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_norm.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_norm.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_norm.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_norm.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_norm.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_norm.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_norm.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_norm.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_norm.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_norm.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_norm.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_norm.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_norm.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_norm.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_norm.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_norm.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_norm.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_norm.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_norm.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_norm.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_norm.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_norm.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_norm.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_norm.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_norm.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_norm.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_norm.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_norm.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_norm.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_norm.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_norm.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_norm.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_norm.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_norm.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_norm.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_norm.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_norm.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_norm.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_norm.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_norm.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_norm.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_norm.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_norm.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_norm.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_norm.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_norm.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_norm.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_norm.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_norm.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_norm.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_norm.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_norm.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_norm.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_norm.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_norm.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_norm.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_norm.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_norm.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_norm.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_norm.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_norm.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_norm.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_norm.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 167201.25 MiB
llm_load_tensors:  CUDA_Host buffer size =   630.59 MiB
llm_load_tensors:      CUDA0 buffer size = 17333.91 MiB
llm_load_tensors:      CUDA1 buffer size = 17851.86 MiB
....................................................................................................
============ Repacked 80 tensors
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   816.02 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   782.02 MiB
llama_new_context_with_model: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =   144.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   312.75 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   120.01 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 336

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.957 |   129.39 |    7.209 |    17.75 |
|   512 |    128 |    512 |    3.910 |   130.94 |    7.676 |    16.67 |
|   512 |    128 |   1024 |    3.963 |   129.18 |    7.769 |    16.48 |
|   512 |    128 |   1536 |    3.961 |   129.27 |    7.837 |    16.33 |
|   512 |    128 |   2048 |    4.033 |   126.94 |    8.236 |    15.54 |
|   512 |    128 |   2560 |    4.054 |   126.28 |    8.429 |    15.19 |
|   512 |    128 |   3072 |    4.072 |   125.73 |   10.839 |    11.81 |
|   512 |    128 |   3584 |    4.098 |   124.94 |   11.515 |    11.12 |
|   512 |    128 |   4096 |    4.177 |   122.56 |   11.817 |    10.83 |
|   512 |    128 |   4608 |    4.182 |   122.44 |   12.003 |    10.66 |
|   512 |    128 |   5120 |    4.215 |   121.48 |   12.178 |    10.51 |
|   512 |    128 |   5632 |    4.213 |   121.54 |   12.464 |    10.27 |
|   512 |    128 |   6144 |    4.275 |   119.76 |   12.475 |    10.26 |
|   512 |    128 |   6656 |    4.200 |   121.89 |   12.690 |    10.09 |
|   512 |    128 |   7168 |    4.220 |   121.32 |   12.896 |     9.93 |
|   512 |    128 |   7680 |    4.251 |   120.45 |   13.109 |     9.76 |
|   512 |    128 |   8192 |    4.279 |   119.66 |   13.253 |     9.66 |
|   512 |    128 |   8704 |    4.293 |   119.26 |   13.550 |     9.45 |
|   512 |    128 |   9216 |    4.291 |   119.31 |   13.668 |     9.37 |
|   512 |    128 |   9728 |    4.301 |   119.04 |   13.804 |     9.27 |
|   512 |    128 |  10240 |    4.306 |   118.90 |   14.200 |     9.01 |
|   512 |    128 |  10752 |    4.338 |   118.02 |   14.255 |     8.98 |
|   512 |    128 |  11264 |    4.330 |   118.25 |   14.403 |     8.89 |
|   512 |    128 |  11776 |    4.375 |   117.03 |   14.506 |     8.82 |
|   512 |    128 |  12288 |    4.413 |   116.03 |   14.864 |     8.61 |
|   512 |    128 |  12800 |    4.414 |   116.00 |   14.960 |     8.56 |
|   512 |    128 |  13312 |    4.419 |   115.86 |   15.197 |     8.42 |
|   512 |    128 |  13824 |    4.440 |   115.32 |   15.448 |     8.29 |
|   512 |    128 |  14336 |    4.463 |   114.72 |   15.592 |     8.21 |
|   512 |    128 |  14848 |    4.473 |   114.46 |   15.740 |     8.13 |
|   512 |    128 |  15360 |    4.507 |   113.61 |   15.883 |     8.06 |
|   512 |    128 |  15872 |    4.514 |   113.43 |   16.207 |     7.90 |
llama.cpp 2x GPU logs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-128K-GGUF/Q6_K/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf -c 16384 -t 48 -fa -ctk q8_0 -ctv q8_0 -ngl 99 -ot "blk\.(0|1|2|3|4|5|6)\.ffn.*=CUDA0" -ot "blk\.(7|8|9|10|11|12|13)\.ffn.*=CUDA1" -ot "blk\.1[4-9]\.ffn.*=CPU" -ot "blk\.[2-9][0-9]\.ffn.*=CPU"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 5269 (1d36b367) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23871 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 47 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-128K-GGUF/Q6_K/Qwen3-235B-A22B-128K-Q6_K-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   3:                           general.finetune str              = 128k
llama_model_loader: - kv   4:                           general.basename str              = Qwen3-235B-A22B-128K
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 235B-A22B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   9:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Qwen3 235B A22B
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv  14:                               general.tags arr[str,2]       = ["unsloth", "text-generation"]
llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 131072
llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 18
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = Qwen3-235B-A22B-128K-GGUF/imatrix_uns...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-235B-A22B-1...
llama_model_loader: - kv  42:             quantize.imatrix.entries_count i32              = 752
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count i32              = 46
llama_model_loader: - kv  44:                                   split.no u16              = 0
llama_model_loader: - kv  45:                        split.tensors.count i32              = 1131
llama_model_loader: - kv  46:                                split.count u16              = 4
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q6_K:  660 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 179.75 GiB (6.57 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 94
print_info: n_head           = 64
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 16
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 128
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 235B.A22B
print_info: model params     = 235.09 B
print_info: general.name     = Qwen3-235B-A22B-128K
print_info: n_ff_exp         = 1536
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors:        CUDA0 model buffer size = 15922.41 MiB
load_tensors:        CUDA1 model buffer size = 16297.68 MiB
load_tensors:   CPU_Mapped model buffer size = 46604.38 MiB
load_tensors:   CPU_Mapped model buffer size = 47377.52 MiB
load_tensors:   CPU_Mapped model buffer size = 47377.52 MiB
load_tensors:   CPU_Mapped model buffer size = 42166.10 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'q8_0', type_v = 'q8_0', n_layer = 94, can_shift = 1, padding = 256
llama_kv_cache_unified:      CUDA0 KV buffer size =   816.00 MiB
llama_kv_cache_unified:      CUDA1 KV buffer size =   782.00 MiB
llama_kv_cache_unified: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_context:      CUDA0 compute buffer size =   774.00 MiB
llama_context:      CUDA1 compute buffer size =   304.75 MiB
llama_context:        CPU compute buffer size =     8.25 MiB
llama_context:  CUDA_Host compute buffer size =    40.01 MiB
llama_context: graph nodes  = 5741
llama_context: graph splits = 543 (with bs=512), 176 (with bs=1)

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    7.567 |    67.67 |    7.486 |    17.10 |
|   512 |    128 |    512 |    7.474 |    68.51 |    8.910 |    14.37 |
|   512 |    128 |   1024 |    7.488 |    68.38 |   10.591 |    12.09 |
|   512 |    128 |   1536 |    7.515 |    68.13 |   10.581 |    12.10 |
|   512 |    128 |   2048 |    7.535 |    67.95 |   10.588 |    12.09 |
|   512 |    128 |   2560 |    7.537 |    67.93 |   11.006 |    11.63 |
|   512 |    128 |   3072 |    7.552 |    67.80 |   11.114 |    11.52 |
|   512 |    128 |   3584 |    7.563 |    67.70 |   11.234 |    11.39 |
|   512 |    128 |   4096 |    7.578 |    67.56 |   11.320 |    11.31 |
|   512 |    128 |   4608 |    7.600 |    67.37 |   11.389 |    11.24 |
|   512 |    128 |   5120 |    7.594 |    67.42 |   11.932 |    10.73 |
|   512 |    128 |   5632 |    7.595 |    67.41 |   11.827 |    10.82 |
|   512 |    128 |   6144 |    7.612 |    67.26 |   11.759 |    10.89 |
|   512 |    128 |   6656 |    7.626 |    67.14 |   11.961 |    10.70 |
|   512 |    128 |   7168 |    7.656 |    66.88 |   12.073 |    10.60 |
|   512 |    128 |   7680 |    7.660 |    66.84 |   12.190 |    10.50 |
|   512 |    128 |   8192 |    7.672 |    66.74 |   12.343 |    10.37 |
|   512 |    128 |   8704 |    7.682 |    66.65 |   12.790 |    10.01 |
|   512 |    128 |   9216 |    7.693 |    66.56 |   12.578 |    10.18 |
|   512 |    128 |   9728 |    7.712 |    66.39 |   12.825 |     9.98 |
|   512 |    128 |  10240 |    7.725 |    66.28 |   13.087 |     9.78 |
|   512 |    128 |  10752 |    7.736 |    66.18 |   13.017 |     9.83 |
|   512 |    128 |  11264 |    7.750 |    66.07 |   13.143 |     9.74 |
|   512 |    128 |  11776 |    7.750 |    66.07 |   13.249 |     9.66 |
|   512 |    128 |  12288 |    7.769 |    65.90 |   13.393 |     9.56 |
|   512 |    128 |  12800 |    7.773 |    65.87 |   13.537 |     9.46 |
|   512 |    128 |  13312 |    7.787 |    65.75 |   13.620 |     9.40 |
|   512 |    128 |  13824 |    7.805 |    65.60 |   13.697 |     9.34 |
|   512 |    128 |  14336 |    7.823 |    65.44 |   13.976 |     9.16 |
|   512 |    128 |  14848 |    7.825 |    65.43 |   14.067 |     9.10 |
|   512 |    128 |  15360 |    7.824 |    65.44 |   14.361 |     8.91 |
|   512 |    128 |  15872 |    7.838 |    65.32 |   14.393 |     8.89 |

and these are the updated GPU graphs:

performance_comparison_pp_gpu

performance_comparison_tg_gpu


👤 ikawrakow replied the 2025-05-03 at 05:47:58:

Thank you for these results!

I think it would be better to disable BLAS for both. CPU Prompt processing with ik_llama.cpp is likely faster without BLAS, but also it is more interesting to measure how well matrix multiplications are implemented in the two tool kits instead of measuring how well they call somebody else's GEMM implementation.

Prompt processing speed on CUDA will also benefit from larger u-batches (e.g., -ub 2048, in case VRAM permits).

The CUDA TG results are somewhat surprising (sharp performance drop with context length for ik_llama.cpp, performance basically the same as CPU-only for long context, performance decreasing with more layers offloaded to a second GPU).

👤 AesSedai replied the 2025-05-03 at 06:07:58:
I just re-ran the above with 2x GPU for llama.cpp as well and edited the comment / graph. I was already re-running ik_llama w/o BLAS, I'll have the results of that shortly.

👤 AesSedai replied the 2025-05-03 at 06:19:29:
Posted!


👤 AesSedai replied the 2025-05-03 at 06:18:41:

Some more data, this time compiled w/ no BLAS:

cmake -B build -DGGML_CUDA=ON -DGGML_RPC=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1

Note: the ik_llama.cpp CPU NO BLAS did hit a CUDA error on the very last iteration.

ik_llama.cpp CPU NO BLAS logs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf -c 16384 -t 48 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 128x10B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   6:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 142
llama_model_loader: - kv  19:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  20:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /workspace/ubergarm/imatrix-Qwen3-235...
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  36:                                   split.no u16              = 0
llama_model_loader: - kv  37:                                split.count u16              = 5
llama_model_loader: - kv  38:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq6_k:  564 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ6_K - 6.6 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 198.259 GiB (7.244 BPW) 
llm_load_print_meta: repeating layers = 197.028 GiB (7.237 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.50 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/95 layers to GPU
llm_load_tensors:  CUDA_Host buffer size = 203017.61 MiB
....................................................................................................
============ Repacked 95 tensors
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1598.00 MiB
llama_new_context_with_model: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   161.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   304.75 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 1319

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    4.591 |   111.51 |   14.650 |     8.74 |
|   512 |    128 |    512 |    4.285 |   119.49 |   14.375 |     8.90 |
|   512 |    128 |   1024 |    4.340 |   117.99 |   15.437 |     8.29 |
|   512 |    128 |   1536 |    4.327 |   118.32 |   15.208 |     8.42 |
|   512 |    128 |   2048 |    4.380 |   116.90 |   14.995 |     8.54 |
|   512 |    128 |   2560 |    4.413 |   116.03 |   15.521 |     8.25 |
|   512 |    128 |   3072 |    4.424 |   115.72 |   16.063 |     7.97 |
|   512 |    128 |   3584 |    4.401 |   116.34 |   15.785 |     8.11 |
|   512 |    128 |   4096 |    4.481 |   114.25 |   16.065 |     7.97 |
|   512 |    128 |   4608 |    4.519 |   113.29 |   16.395 |     7.81 |
|   512 |    128 |   5120 |    4.475 |   114.41 |   15.560 |     8.23 |
|   512 |    128 |   5632 |    4.553 |   112.45 |   16.124 |     7.94 |
|   512 |    128 |   6144 |    4.552 |   112.47 |   16.174 |     7.91 |
|   512 |    128 |   6656 |    4.564 |   112.19 |   15.575 |     8.22 |
|   512 |    128 |   7168 |    4.611 |   111.05 |   16.761 |     7.64 |
|   512 |    128 |   7680 |    4.630 |   110.57 |   16.769 |     7.63 |
|   512 |    128 |   8192 |    4.610 |   111.06 |   16.260 |     7.87 |
|   512 |    128 |   8704 |    4.677 |   109.47 |   17.069 |     7.50 |
|   512 |    128 |   9216 |    4.675 |   109.52 |   17.445 |     7.34 |
|   512 |    128 |   9728 |    4.706 |   108.80 |   16.528 |     7.74 |
|   512 |    128 |  10240 |    4.735 |   108.12 |   17.745 |     7.21 |
|   512 |    128 |  10752 |    4.746 |   107.89 |   17.733 |     7.22 |
|   512 |    128 |  11264 |    4.790 |   106.90 |   16.780 |     7.63 |
|   512 |    128 |  11776 |    4.822 |   106.18 |   17.795 |     7.19 |
|   512 |    128 |  12288 |    4.883 |   104.85 |   18.182 |     7.04 |
|   512 |    128 |  12800 |    4.875 |   105.02 |   17.035 |     7.51 |
|   512 |    128 |  13312 |    4.886 |   104.80 |   18.353 |     6.97 |
|   512 |    128 |  13824 |    4.932 |   103.80 |   18.488 |     6.92 |
|   512 |    128 |  14336 |    4.960 |   103.23 |   17.388 |     7.36 |
|   512 |    128 |  14848 |    4.999 |   102.42 |   18.743 |     6.83 |
|   512 |    128 |  15360 |    4.931 |   103.83 |   18.383 |     6.96 |
CUDA error: invalid argument
  current device: 0, in function ggml_backend_cuda_buffer_set_tensor at /home/jarvis/development/ik_llama.cpp/ggml/src/ggml-cuda.cu:507
  cudaMemcpyAsync((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, ((cudaStream_t)0x2))
/home/jarvis/development/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
ik_llama.cpp 2x GPU NO BLAS logs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf -c 16384 -t 48 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0 -ngl 99 -ot "blk\.(0|1|2|3|4|5|6)\.ffn.*=CUDA0" -ot "blk\.(7|8|9|10|11|12|13)\.ffn.*=CUDA1" -ot "blk\.1[4-9]\.ffn.*=CPU" -ot "blk\.[2-9][0-9]\.ffn.*=CPU"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 128x10B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   6:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 142
llama_model_loader: - kv  19:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  20:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /workspace/ubergarm/imatrix-Qwen3-235...
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  36:                                   split.no u16              = 0
llama_model_loader: - kv  37:                                split.count u16              = 5
llama_model_loader: - kv  38:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq6_k:  564 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ6_K - 6.6 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 198.259 GiB (7.244 BPW) 
llm_load_print_meta: repeating layers = 197.028 GiB (7.237 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    1.49 MiB
Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CUDA1
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CUDA1
Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_norm.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_norm.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_norm.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_norm.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_norm.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_norm.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_norm.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_norm.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_norm.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_norm.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_norm.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_norm.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_norm.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_norm.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_norm.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_norm.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_norm.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_norm.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_norm.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_norm.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_norm.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_norm.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_norm.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_norm.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_norm.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_norm.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_norm.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_norm.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_norm.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_norm.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_norm.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_norm.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_norm.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_norm.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_norm.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_norm.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_norm.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_norm.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_norm.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_norm.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_norm.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_norm.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_norm.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_norm.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_norm.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_norm.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_norm.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_norm.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_norm.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_norm.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_norm.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_norm.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_norm.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_norm.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_norm.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_norm.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_norm.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_norm.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_norm.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_norm.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_norm.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_norm.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_norm.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_norm.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_norm.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_norm.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_norm.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_norm.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_norm.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_norm.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_norm.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_norm.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_norm.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_norm.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_norm.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_norm.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_norm.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors:        CPU buffer size = 167201.25 MiB
llm_load_tensors:  CUDA_Host buffer size =   630.59 MiB
llm_load_tensors:      CUDA0 buffer size = 17221.25 MiB
llm_load_tensors:      CUDA1 buffer size = 17964.52 MiB
....................................................................................................
============ Repacked 80 tensors
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   782.02 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   816.02 MiB
llama_new_context_with_model: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =   144.00 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   312.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   120.01 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 336

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    3.738 |   136.96 |   10.651 |    12.02 |
|   512 |    128 |    512 |    3.657 |   140.02 |   10.862 |    11.78 |
|   512 |    128 |   1024 |    3.710 |   138.00 |   11.029 |    11.61 |
|   512 |    128 |   1536 |    3.727 |   137.38 |   10.993 |    11.64 |
|   512 |    128 |   2048 |    3.727 |   137.39 |   11.148 |    11.48 |
|   512 |    128 |   2560 |    3.788 |   135.15 |   11.437 |    11.19 |
|   512 |    128 |   3072 |    3.798 |   134.81 |   11.629 |    11.01 |
|   512 |    128 |   3584 |    3.761 |   136.14 |   12.042 |    10.63 |
|   512 |    128 |   4096 |    3.822 |   133.96 |   11.777 |    10.87 |
|   512 |    128 |   4608 |    3.838 |   133.41 |   11.934 |    10.73 |
|   512 |    128 |   5120 |    3.878 |   132.04 |   12.227 |    10.47 |
|   512 |    128 |   5632 |    3.911 |   130.93 |   12.452 |    10.28 |
|   512 |    128 |   6144 |    3.902 |   131.23 |   12.550 |    10.20 |
|   512 |    128 |   6656 |    3.929 |   130.30 |   12.660 |    10.11 |
|   512 |    128 |   7168 |    3.968 |   129.02 |   12.961 |     9.88 |
|   512 |    128 |   7680 |    4.016 |   127.49 |   13.059 |     9.80 |
|   512 |    128 |   8192 |    3.993 |   128.21 |   13.391 |     9.56 |
|   512 |    128 |   8704 |    4.052 |   126.37 |   13.501 |     9.48 |
|   512 |    128 |   9216 |    4.061 |   126.06 |   13.644 |     9.38 |
|   512 |    128 |   9728 |    4.093 |   125.10 |   13.916 |     9.20 |
|   512 |    128 |  10240 |    4.140 |   123.69 |   14.154 |     9.04 |
|   512 |    128 |  10752 |    4.190 |   122.20 |   14.478 |     8.84 |
|   512 |    128 |  11264 |    4.198 |   121.97 |   14.657 |     8.73 |
|   512 |    128 |  11776 |    4.224 |   121.21 |   14.880 |     8.60 |
|   512 |    128 |  12288 |    4.242 |   120.70 |   15.020 |     8.52 |
|   512 |    128 |  12800 |    4.244 |   120.63 |   15.034 |     8.51 |
|   512 |    128 |  13312 |    4.258 |   120.25 |   15.194 |     8.42 |
|   512 |    128 |  13824 |    4.238 |   120.81 |   15.376 |     8.32 |
|   512 |    128 |  14336 |    4.268 |   119.96 |   15.554 |     8.23 |
|   512 |    128 |  14848 |    4.260 |   120.20 |   15.667 |     8.17 |
|   512 |    128 |  15360 |    4.285 |   119.49 |   15.961 |     8.02 |
|   512 |    128 |  15872 |    4.315 |   118.65 |   16.185 |     7.91 |

ik_llama.cpp BLAS vs NO BLAS PP comparison: performance_comparison_pp_gpu

ik_llama.cpp BLAS vs NO BLAS TG comparison: performance_comparison_tg_gpu

👤 ikawrakow replied the 2025-05-03 at 06:24:36:
Oh, for CPU-only inference you want to build without CUDA. The almighty ggml back-end scheduler that is very difficult to work around takes all sorts of funny decisions where to run stuff when one has more than one back-end enabled.

👤 AesSedai replied the 2025-05-03 at 06:25:03:
D'oh, okay. I can redo it :)


👤 AesSedai replied the 2025-05-03 at 07:04:12:

ik_llama.cpp, no cuda, no blas:

cmake -B build -DGGML_RPC=ON -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
ik_llama.cpp CPU NO CUDA NO BLAS logs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf -c 16384 -t 48 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0                        
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 128x10B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   6:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 142
llama_model_loader: - kv  19:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  20:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /workspace/ubergarm/imatrix-Qwen3-235...
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  36:                                   split.no u16              = 0
llama_model_loader: - kv  37:                                split.count u16              = 5
llama_model_loader: - kv  38:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq6_k:  564 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ6_K - 6.6 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 198.259 GiB (7.244 BPW) 
llm_load_print_meta: repeating layers = 197.028 GiB (7.237 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    0.50 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/95 layers to GPU
llm_load_tensors:        CPU buffer size = 203017.61 MiB
....................................................................................................
============ Repacked 95 tensors
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1598.00 MiB
llama_new_context_with_model: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   304.75 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 1

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    5.810 |    88.13 |    8.608 |    14.87 |
|   512 |    128 |    512 |    5.915 |    86.57 |    8.561 |    14.95 |
|   512 |    128 |   1024 |    6.033 |    84.86 |    9.789 |    13.08 |
|   512 |    128 |   1536 |    6.295 |    81.34 |   13.626 |     9.39 |
|   512 |    128 |   2048 |    6.432 |    79.60 |   13.771 |     9.29 |
|   512 |    128 |   2560 |    6.527 |    78.44 |   14.327 |     8.93 |
|   512 |    128 |   3072 |    6.730 |    76.08 |   14.390 |     8.90 |
|   512 |    128 |   3584 |    6.829 |    74.98 |   13.901 |     9.21 |
|   512 |    128 |   4096 |    7.007 |    73.07 |   14.830 |     8.63 |
|   512 |    128 |   4608 |    7.170 |    71.41 |   14.868 |     8.61 |
|   512 |    128 |   5120 |    7.288 |    70.26 |   14.373 |     8.91 |
|   512 |    128 |   5632 |    7.450 |    68.73 |   15.489 |     8.26 |
|   512 |    128 |   6144 |    7.642 |    67.00 |   15.442 |     8.29 |
|   512 |    128 |   6656 |    7.808 |    65.57 |   14.509 |     8.82 |
|   512 |    128 |   7168 |    7.956 |    64.35 |   15.828 |     8.09 |
|   512 |    128 |   7680 |    8.136 |    62.93 |   15.789 |     8.11 |
|   512 |    128 |   8192 |    8.227 |    62.23 |   14.904 |     8.59 |
|   512 |    128 |   8704 |    8.396 |    60.98 |   15.952 |     8.02 |
|   512 |    128 |   9216 |    8.511 |    60.16 |   16.488 |     7.76 |
|   512 |    128 |   9728 |    8.664 |    59.10 |   15.098 |     8.48 |
|   512 |    128 |  10240 |    8.850 |    57.86 |   16.699 |     7.66 |
|   512 |    128 |  10752 |    8.979 |    57.02 |   16.578 |     7.72 |
|   512 |    128 |  11264 |    9.150 |    55.96 |   15.277 |     8.38 |
|   512 |    128 |  11776 |    9.339 |    54.82 |   16.711 |     7.66 |
|   512 |    128 |  12288 |    9.455 |    54.15 |   16.698 |     7.67 |
|   512 |    128 |  12800 |    9.627 |    53.18 |   15.516 |     8.25 |
|   512 |    128 |  13312 |    9.767 |    52.42 |   17.165 |     7.46 |
|   512 |    128 |  13824 |    9.944 |    51.49 |   17.486 |     7.32 |
|   512 |    128 |  14336 |   10.074 |    50.82 |   16.163 |     7.92 |
|   512 |    128 |  14848 |   10.183 |    50.28 |   17.452 |     7.33 |
|   512 |    128 |  15360 |   10.347 |    49.48 |   17.787 |     7.20 |
|   512 |    128 |  15872 |   10.553 |    48.52 |   16.355 |     7.83 |

performance_comparison_pp

performance_comparison_tg


👤 ikawrakow replied the 2025-05-03 at 07:21:24:

Thanks!

So, CPU PP is much better now and more inline with what I would have expected. Looking at the TG graph, it is clear that I still need to work on improving how the work is divided between the threads. The Qwen3 MoE models have a high GQA factor, so one should be able to achieve ~70-80% of zero-context performance at 16k tokens.

But I see that the Epyc 9355 has 32 cores, so we are using hyper-threading?

👤 AesSedai replied the 2025-05-03 at 07:23:30:
That's good news!

Yes, this is with hyperthreading. Out of the 64 threads on the system, 56 are passed through to the virtual machine and I have it configured to use 48 of those during the sweep.

Is there a particular -t count (or thread passthrough count) you would like me to try?

👤 ikawrakow replied the 2025-05-03 at 07:27:14:
On bare metal one achieves the best performance by setting the number of threads to the physical core count. But I have no idea how a VM will behave. You can try -t 32, but that would be only better if you get 32 cores involved, and not e.g. 16 cores with 2 threads per core.

👤 AesSedai replied the 2025-05-03 at 07:58:15:
Yes, I think it's about a ~10% performance loss because it's in a VM. The system is a hypervisor though and used for other homelab things, so I'm fine taking that loss. I was able to run likwid-bench inside the VM before and achieve ~500GB/s memory bandwidth for reference, theoretical maximum is ~576GB/s.

For completeness sake, I've disabled SMT on the host:

echo off > /sys/devices/system/cpu/smt/control

and verified the core count with btop shows 32. Re-launched the inference VM with all 32 cores set to the VM.

I also turned off the other three VMs on the system, so this is the "max performance" configuration that I can achieve without moving everything from the VM to the hypervisor host directly.

ik_llama.cpp CPU NO CUDA NO BLAS NO SMTlogs
./build/bin/llama-sweep-bench -m /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf -c 16384 -t 32 -fa -rtr -fmoe -ctk q8_0 -ctv q8_0
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 39 key-value pairs and 1131 tensors from /mnt/srv/slush/gguf/Qwen3-235B-A22B-GGUF-ik-llama/Qwen3-235B-A22B-mix-IQ6_K-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models
llama_model_loader: - kv   3:                         general.size_label str              = 128x10B
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv   6:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 94
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 4096
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 12288
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 64
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 142
llama_model_loader: - kv  19:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  20:        qwen3moe.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = /workspace/ubergarm/imatrix-Qwen3-235...
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 753
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 225
llama_model_loader: - kv  36:                                   split.no u16              = 0
llama_model_loader: - kv  37:                                split.count u16              = 5
llama_model_loader: - kv  38:                        split.tensors.count i32              = 1131
llama_model_loader: - type  f32:  471 tensors
llama_model_loader: - type q8_0:   96 tensors
llama_model_loader: - type iq6_k:  564 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 94
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = IQ6_K - 6.6 bpw
llm_load_print_meta: model params     = 235.094 B
llm_load_print_meta: model size       = 198.259 GiB (7.244 BPW) 
llm_load_print_meta: repeating layers = 197.028 GiB (7.237 BPW, 233.849 B parameters)
llm_load_print_meta: general.name     = Models
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 1536
llm_load_tensors: ggml ctx size =    0.50 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/95 layers to GPU
llm_load_tensors:        CPU buffer size = 203017.61 MiB
....................................................................................................
============ Repacked 95 tensors
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1598.00 MiB
llama_new_context_with_model: KV self size  = 1598.00 MiB, K (q8_0):  799.00 MiB, V (q8_0):  799.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   304.75 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 1

main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 32, n_threads_batch = 32

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|   512 |    128 |      0 |    4.359 |   117.46 |    7.881 |    16.24 |
|   512 |    128 |    512 |    4.373 |   117.09 |    9.647 |    13.27 |
|   512 |    128 |   1024 |    4.610 |   111.06 |    8.834 |    14.49 |
|   512 |    128 |   1536 |    4.641 |   110.33 |   10.037 |    12.75 |
|   512 |    128 |   2048 |    4.876 |   105.01 |   13.032 |     9.82 |
|   512 |    128 |   2560 |    4.976 |   102.89 |   14.779 |     8.66 |
|   512 |    128 |   3072 |    5.062 |   101.14 |   13.366 |     9.58 |
|   512 |    128 |   3584 |    5.151 |    99.40 |   13.447 |     9.52 |
|   512 |    128 |   4096 |    5.239 |    97.74 |   15.546 |     8.23 |
|   512 |    128 |   4608 |    5.356 |    95.60 |   13.975 |     9.16 |
|   512 |    128 |   5120 |    5.455 |    93.87 |   13.837 |     9.25 |
|   512 |    128 |   5632 |    5.543 |    92.37 |   15.700 |     8.15 |
|   512 |    128 |   6144 |    5.766 |    88.80 |   14.063 |     9.10 |
|   512 |    128 |   6656 |    5.768 |    88.77 |   14.064 |     9.10 |
|   512 |    128 |   7168 |    5.923 |    86.44 |   16.084 |     7.96 |
|   512 |    128 |   7680 |    6.006 |    85.25 |   14.581 |     8.78 |
|   512 |    128 |   8192 |    6.145 |    83.32 |   14.257 |     8.98 |
|   512 |    128 |   8704 |    6.153 |    83.22 |   16.262 |     7.87 |
|   512 |    128 |   9216 |    6.258 |    81.82 |   15.010 |     8.53 |
|   512 |    128 |   9728 |    6.395 |    80.06 |   14.768 |     8.67 |
|   512 |    128 |  10240 |    6.575 |    77.87 |   16.422 |     7.79 |
|   512 |    128 |  10752 |    6.695 |    76.48 |   14.911 |     8.58 |
|   512 |    128 |  11264 |    6.817 |    75.11 |   14.985 |     8.54 |
|   512 |    128 |  11776 |    6.784 |    75.48 |   16.958 |     7.55 |
|   512 |    128 |  12288 |    6.937 |    73.81 |   15.634 |     8.19 |
|   512 |    128 |  12800 |    7.936 |    64.52 |   15.454 |     8.28 |
|   512 |    128 |  13312 |    7.044 |    72.69 |   15.884 |     8.06 |
|   512 |    128 |  13824 |    7.244 |    70.68 |   15.820 |     8.09 |
|   512 |    128 |  14336 |    7.365 |    69.52 |   17.121 |     7.48 |
|   512 |    128 |  14848 |    7.635 |    67.06 |   16.042 |     7.98 |
|   512 |    128 |  15360 |    7.601 |    67.36 |   15.867 |     8.07 |
|   512 |    128 |  15872 |    7.696 |    66.53 |   17.578 |     7.28 |

performance_comparison_pp

performance_comparison_tg

👤 ikawrakow replied the 2025-05-03 at 08:08:26:
So, ~30% better for PP, but not much difference for TG. I need to understand the cause of the sharp drop in TG performance for the first ~2k tokens. I'll investigate.

Thanks a lot for these benchmarks!

👤 AesSedai replied the 2025-05-03 at 08:10:21:
You're welcome, let me know if you want me to re-run any of these benchmarks at some point in the future and I can pull / rebuild / re-test. Excited to see what shakes out!

👤 VinnyG9 replied the 2025-05-19 at 14:08:09:

That's good news!

Yes, this is with hyperthreading. Out of the 64 threads on the system, 56 are passed through to the virtual machine and I have it configured to use 48 of those during the sweep.

Is there a particular -t count (or thread passthrough count) you would like me to try?

hey, no idea what hypervisor you're running but unless it's ESXi big chances it can't handle AMD multi CCD chips or numa systems in general

unless you pin ONE CCD to your VM performance is likely taking a hit