84 KiB
🗣️ #334 - iq4_ks performs great on gemma-3-27b-it-qat-q4_0-unquantized
| Author | ubergarm |
|---|---|
| Created | 2025-04-18 |
| Updated | 2025-07-07 |
Description
EDIT: Just uploaded the ik_llama.cpp exclusive quants for best quality in minimum VRAM to huggingface ubergarm/gemma-3-27b-it-qat-GGUF.
I saw google released their google/gemma-3-27b-it-qat-q4_0-unquantized original .safetensors unquantized model. It is supposedly designed for q4_0 quantization which was released earlier in gguf format.
Thanks to Quantization Aware Training (QAT), the model is able to preserve similar quality as bfloat16 while significantly reducing the memory requirements to load the model.
I used mainline to convert the .safetensors to bf16 and then used ik_llama.cpp to cook some quants to compare size and perplexity. Here are the results which interestingly suggest ik4_ks has lower perplexity than the original bf16 (the q8_0 does too)!
Raw Data
Perplexity
## google/gemma-3-27b-it-BF16-00001-of-00002.gguf
50.311 GiB (16.001 BPW)
f32: 373 tensors
bf16: 435 tensors
Final estimate: PPL = 8.4276 +/- 0.06705
## google/gemma-3-27b-it-qat-q4_0-unquantized-BF16-00001-of-00002.gguf
50.311 GiB (16.001 BPW)
f32: 373 tensors
bf16: 435 tensors
Final estimate: PPL = 8.2021 +/- 0.06387
## google/gemma-3-27b-it-qat-q4_0-gguf/gemma-3-27b-it-q4_0.gguf
16.040 GiB (5.101 BPW)
f32: 373 tensors
f16: 1 tensors
q4_0: 434 tensors
Final estimate: PPL = 8.2500 +/- 0.06375
## ubergarm/gemma-3-27B-it-qat-q8_0.gguf
26.730 GiB (8.501 BPW)
f32: 373 tensors
q8_0: 435 tensors
Final estimate: PPL = 8.1890 +/- 0.06369
## ubergarm/gemma-3-27B-it-qat-q4_0.gguf
14.857 GiB (4.725 BPW)
f32: 373 tensors
q4_0: 427 tensors
type q4_1: 7 tensors (blk.[0-6].ffn_down.weight not sure why this happened?)
type q8_0: 1 tensors (token_embd.weight)
Final estimate: PPL = 8.2264 +/- 0.06350
## ubergarm/gemma-3-27B-it-qat-pure-q4_0.gguf
14.810 GiB (4.710 BPW)
f32: 373 tensors
q4_0: 434 tensors
q8_0: 1 tensors
Final estimate: PPL = 8.2235 +/- 0.06345
## ubergarm/gemma-3-27B-it-qat-iq4_xs.gguf
14.085 GiB (4.480 BPW)
f32: 373 tensors
q4_0: 62 tensors (blk.*.attn_v.weight can't be iq4_xs due to tensor size)
q8_0: 1 tensors
iq4_xs: 372 tensors
Final estimate: PPL = 8.2290 +/- 0.06365
## ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf (ik_llama.cpp exclusive quant)
14.099 GiB (4.484 BPW)
f32: 373 tensors
type q4_0: 62 tensors blk.*.attn_v.weight
type q8_0: 1 tensors
iq4_ks: 372 tensors
Final estimate: PPL = 8.1755 +/- 0.06296
## ubergarm/gemma-3-27B-it-qat-COSSIM-iq3_k.gguf (ik_llama.cpp exclusive quants)
12.875 GiB (4.095 BPW)
f32: 373 tensors
q4_0: 62 tensors (blk.*.attn_v.weight can't be iq3_k/iq4_ks due to tensor size)
q8_0: 1 tensors
iq3_k: 192 tensors
iq4_ks: 180 tensors (most important 30 layers by cosine similarity scores)
Final estimate: PPL = 8.2642 +/- 0.06359
## ubergarm/gemma-3-27B-it-qat-mix-iq3_k.gguf (ik_llama.cpp exclusive quant)
12.733 GiB (4.050 BPW)
f32: 373 tensors
q4_0: 62 tensors blk.*.attn_v.weight
q8_0: 1 tensors
iq3_k: 124 tensors ffn_(gate|up).weight
type iq4_ks: 248 tensors ffn_down.weight
Final estimate: PPL = 8.2367 +/- 0.06329
## ubergarm/gemma-3-27B-it-qat-iq4_nl.gguf
14.810 GiB (4.710 BPW)
type f32: 373 tensors
type q4_0: 62 tensors
type q8_0: 1 tensors
type iq4_nl: 372 tensors
Final estimate: PPL = 8.2477 +/- 0.06390
## ubergarm/gemma-3-27B-it-qat-q4_k_m.gguf
14.810 GiB (4.710 BPW)
q4_0: 62 tensors
q8_0: 1 tensors
type q4_K: 372 tensors
Final estimate: PPL = 8.2303 +/- 0.06364
Sweep Bench
## gemma-3-27B-it-qat-COSSIM-iq3_k.gguf
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.430 | 1191.53 | 3.764 | 34.01 |
| 512 | 128 | 512 | 0.406 | 1262.23 | 3.844 | 33.30 |
| 512 | 128 | 1024 | 0.415 | 1232.50 | 3.930 | 32.57 |
| 512 | 128 | 1536 | 0.426 | 1203.18 | 4.016 | 31.87 |
| 512 | 128 | 2048 | 0.436 | 1175.36 | 4.100 | 31.22 |
| 512 | 128 | 2560 | 0.445 | 1150.88 | 4.176 | 30.65 |
| 512 | 128 | 3072 | 0.455 | 1124.73 | 4.258 | 30.06 |
| 512 | 128 | 3584 | 0.465 | 1101.76 | 4.342 | 29.48 |
| 512 | 128 | 4096 | 0.474 | 1080.41 | 4.420 | 28.96 |
| 512 | 128 | 4608 | 0.483 | 1061.01 | 4.497 | 28.46 |
| 512 | 128 | 5120 | 0.493 | 1037.79 | 4.571 | 28.00 |
| 512 | 128 | 5632 | 0.502 | 1020.61 | 4.649 | 27.53 |
| 512 | 128 | 6144 | 0.511 | 1002.09 | 4.736 | 27.03 |
| 512 | 128 | 6656 | 0.521 | 983.37 | 4.810 | 26.61 |
| 512 | 128 | 7168 | 0.529 | 968.01 | 4.878 | 26.24 |
| 512 | 128 | 7680 | 0.537 | 952.66 | 4.948 | 25.87 |
| 512 | 128 | 8192 | 0.547 | 935.93 | 5.018 | 25.51 |
| 512 | 128 | 8704 | 0.555 | 922.51 | 5.088 | 25.15 |
| 512 | 128 | 9216 | 0.563 | 908.66 | 5.160 | 24.81 |
| 512 | 128 | 9728 | 0.573 | 894.03 | 5.236 | 24.45 |
| 512 | 128 | 10240 | 0.582 | 880.17 | 5.299 | 24.16 |
| 512 | 128 | 10752 | 0.591 | 866.73 | 5.372 | 23.83 |
| 512 | 128 | 11264 | 0.597 | 857.09 | 5.433 | 23.56 |
| 512 | 128 | 11776 | 0.608 | 842.77 | 5.502 | 23.26 |
| 512 | 128 | 12288 | 0.617 | 830.05 | 5.573 | 22.97 |
| 512 | 128 | 12800 | 0.625 | 819.12 | 5.637 | 22.71 |
| 512 | 128 | 13312 | 0.635 | 805.99 | 5.703 | 22.44 |
| 512 | 128 | 13824 | 0.642 | 796.94 | 5.768 | 22.19 |
| 512 | 128 | 14336 | 0.649 | 788.86 | 5.834 | 21.94 |
| 512 | 128 | 14848 | 0.657 | 778.90 | 5.896 | 21.71 |
| 512 | 128 | 15360 | 0.667 | 768.12 | 5.958 | 21.49 |
| 512 | 128 | 15872 | 0.673 | 760.21 | 6.019 | 21.27 |
| 512 | 128 | 16384 | 0.682 | 750.51 | 6.077 | 21.06 |
| 512 | 128 | 16896 | 0.690 | 742.39 | 6.139 | 20.85 |
| 512 | 128 | 17408 | 0.698 | 733.09 | 6.205 | 20.63 |
| 512 | 128 | 17920 | 0.707 | 724.07 | 6.274 | 20.40 |
| 512 | 128 | 18432 | 0.715 | 716.34 | 6.333 | 20.21 |
| 512 | 128 | 18944 | 0.723 | 708.48 | 6.391 | 20.03 |
| 512 | 128 | 19456 | 0.732 | 699.61 | 6.457 | 19.82 |
| 512 | 128 | 19968 | 0.738 | 693.42 | 6.524 | 19.62 |
| 512 | 128 | 20480 | 0.748 | 684.36 | 6.584 | 19.44 |
| 512 | 128 | 20992 | 0.756 | 677.00 | 6.650 | 19.25 |
| 512 | 128 | 21504 | 0.764 | 670.11 | 6.718 | 19.05 |
| 512 | 128 | 22016 | 0.773 | 662.68 | 6.787 | 18.86 |
| 512 | 128 | 22528 | 0.782 | 654.88 | 6.857 | 18.67 |
| 512 | 128 | 23040 | 0.789 | 648.73 | 6.922 | 18.49 |
| 512 | 128 | 23552 | 0.799 | 641.01 | 6.993 | 18.30 |
| 512 | 128 | 24064 | 0.809 | 632.98 | 7.063 | 18.12 |
| 512 | 128 | 24576 | 0.817 | 626.98 | 7.129 | 17.96 |
| 512 | 128 | 25088 | 0.828 | 618.46 | 7.204 | 17.77 |
| 512 | 128 | 25600 | 0.837 | 612.03 | 7.272 | 17.60 |
| 512 | 128 | 26112 | 0.845 | 605.89 | 7.345 | 17.43 |
| 512 | 128 | 26624 | 0.854 | 599.28 | 7.422 | 17.25 |
| 512 | 128 | 27136 | 0.863 | 593.26 | 7.490 | 17.09 |
| 512 | 128 | 27648 | 0.872 | 587.02 | 7.562 | 16.93 |
| 512 | 128 | 28160 | 0.881 | 581.17 | 7.640 | 16.75 |
| 512 | 128 | 28672 | 0.889 | 575.97 | 7.707 | 16.61 |
| 512 | 128 | 29184 | 0.899 | 569.59 | 7.783 | 16.45 |
| 512 | 128 | 29696 | 0.907 | 564.44 | 7.848 | 16.31 |
| 512 | 128 | 30208 | 0.916 | 558.91 | 7.925 | 16.15 |
| 512 | 128 | 30720 | 0.928 | 551.99 | 8.001 | 16.00 |
| 512 | 128 | 31232 | 0.938 | 545.95 | 8.070 | 15.86 |
| 512 | 128 | 31744 | 0.946 | 541.38 | 8.139 | 15.73 |
| 512 | 128 | 32256 | 0.955 | 536.31 | 8.215 | 15.58 |
## gemma-3-27B-it-qat-iq4_ks.gguf
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.442 | 1157.61 | 3.846 | 33.29 |
| 512 | 128 | 512 | 0.420 | 1219.61 | 3.918 | 32.67 |
| 512 | 128 | 1024 | 0.429 | 1192.64 | 3.988 | 32.09 |
| 512 | 128 | 1536 | 0.437 | 1171.22 | 4.058 | 31.54 |
| 512 | 128 | 2048 | 0.445 | 1150.45 | 4.115 | 31.10 |
| 512 | 128 | 2560 | 0.454 | 1126.95 | 4.185 | 30.58 |
| 512 | 128 | 3072 | 0.463 | 1106.45 | 4.251 | 30.11 |
| 512 | 128 | 3584 | 0.471 | 1087.13 | 4.316 | 29.66 |
| 512 | 128 | 4096 | 0.480 | 1067.46 | 4.383 | 29.20 |
| 512 | 128 | 4608 | 0.488 | 1048.69 | 4.452 | 28.75 |
| 512 | 128 | 5120 | 0.496 | 1031.55 | 4.517 | 28.34 |
| 512 | 128 | 5632 | 0.504 | 1016.37 | 4.587 | 27.90 |
| 512 | 128 | 6144 | 0.513 | 997.99 | 4.653 | 27.51 |
| 512 | 128 | 6656 | 0.520 | 983.90 | 4.718 | 27.13 |
| 512 | 128 | 7168 | 0.529 | 968.10 | 4.788 | 26.73 |
| 512 | 128 | 7680 | 0.537 | 954.18 | 4.856 | 26.36 |
| 512 | 128 | 8192 | 0.548 | 934.03 | 4.916 | 26.04 |
| 512 | 128 | 8704 | 0.554 | 923.73 | 4.982 | 25.69 |
| 512 | 128 | 9216 | 0.562 | 910.81 | 5.050 | 25.35 |
| 512 | 128 | 9728 | 0.571 | 896.48 | 5.118 | 25.01 |
| 512 | 128 | 10240 | 0.582 | 880.31 | 5.187 | 24.68 |
| 512 | 128 | 10752 | 0.589 | 869.67 | 5.252 | 24.37 |
| 512 | 128 | 11264 | 0.597 | 857.77 | 5.320 | 24.06 |
| 512 | 128 | 11776 | 0.606 | 844.62 | 5.386 | 23.77 |
| 512 | 128 | 12288 | 0.615 | 832.15 | 5.453 | 23.47 |
| 512 | 128 | 12800 | 0.622 | 823.56 | 5.519 | 23.19 |
| 512 | 128 | 13312 | 0.631 | 811.71 | 5.587 | 22.91 |
| 512 | 128 | 13824 | 0.639 | 801.76 | 5.656 | 22.63 |
| 512 | 128 | 14336 | 0.648 | 790.40 | 5.721 | 22.37 |
| 512 | 128 | 14848 | 0.656 | 779.95 | 5.788 | 22.11 |
| 512 | 128 | 15360 | 0.664 | 771.19 | 5.853 | 21.87 |
| 512 | 128 | 15872 | 0.674 | 759.82 | 5.927 | 21.60 |
| 512 | 128 | 16384 | 0.683 | 749.28 | 5.991 | 21.37 |
| 512 | 128 | 16896 | 0.691 | 740.99 | 6.056 | 21.14 |
| 512 | 128 | 17408 | 0.700 | 731.77 | 6.125 | 20.90 |
| 512 | 128 | 17920 | 0.708 | 722.66 | 6.196 | 20.66 |
| 512 | 128 | 18432 | 0.717 | 714.30 | 6.266 | 20.43 |
| 512 | 128 | 18944 | 0.726 | 705.03 | 6.337 | 20.20 |
| 512 | 128 | 19456 | 0.735 | 696.94 | 6.405 | 19.98 |
| 512 | 128 | 19968 | 0.743 | 688.78 | 6.478 | 19.76 |
| 512 | 128 | 20480 | 0.752 | 681.04 | 6.549 | 19.54 |
| 512 | 128 | 20992 | 0.762 | 672.17 | 6.619 | 19.34 |
| 512 | 128 | 21504 | 0.770 | 664.73 | 6.690 | 19.13 |
| 512 | 128 | 22016 | 0.778 | 658.00 | 6.760 | 18.93 |
| 512 | 128 | 22528 | 0.787 | 650.37 | 6.825 | 18.75 |
| 512 | 128 | 23040 | 0.796 | 643.02 | 6.893 | 18.57 |
| 512 | 128 | 23552 | 0.804 | 636.63 | 6.959 | 18.39 |
| 512 | 128 | 24064 | 0.813 | 629.68 | 7.033 | 18.20 |
| 512 | 128 | 24576 | 0.822 | 622.86 | 7.096 | 18.04 |
| 512 | 128 | 25088 | 0.830 | 616.54 | 7.164 | 17.87 |
| 512 | 128 | 25600 | 0.839 | 610.32 | 7.235 | 17.69 |
| 512 | 128 | 26112 | 0.847 | 604.35 | 7.300 | 17.53 |
| 512 | 128 | 26624 | 0.856 | 597.85 | 7.366 | 17.38 |
| 512 | 128 | 27136 | 0.865 | 591.85 | 7.434 | 17.22 |
| 512 | 128 | 27648 | 0.874 | 585.75 | 7.500 | 17.07 |
| 512 | 128 | 28160 | 0.881 | 581.29 | 7.566 | 16.92 |
| 512 | 128 | 28672 | 0.890 | 575.07 | 7.640 | 16.75 |
| 512 | 128 | 29184 | 0.899 | 569.51 | 7.695 | 16.63 |
| 512 | 128 | 29696 | 0.910 | 562.68 | 7.767 | 16.48 |
| 512 | 128 | 30208 | 0.917 | 558.17 | 7.834 | 16.34 |
| 512 | 128 | 30720 | 0.928 | 551.45 | 7.895 | 16.21 |
| 512 | 128 | 31232 | 0.935 | 547.36 | 7.963 | 16.07 |
| 512 | 128 | 31744 | 0.943 | 543.01 | 8.026 | 15.95 |
| 512 | 128 | 32256 | 0.951 | 538.16 | 8.096 | 15.81 |
## gemma-3-27B-it-qat-iq4_xs.gguf
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.354 | 1445.31 | 3.869 | 33.09 |
| 512 | 128 | 512 | 0.359 | 1426.20 | 3.938 | 32.50 |
| 512 | 128 | 1024 | 0.368 | 1392.49 | 4.009 | 31.93 |
| 512 | 128 | 1536 | 0.376 | 1360.20 | 4.079 | 31.38 |
| 512 | 128 | 2048 | 0.385 | 1330.26 | 4.145 | 30.88 |
| 512 | 128 | 2560 | 0.393 | 1302.43 | 4.215 | 30.37 |
| 512 | 128 | 3072 | 0.402 | 1273.64 | 4.280 | 29.91 |
| 512 | 128 | 3584 | 0.409 | 1251.28 | 4.350 | 29.42 |
| 512 | 128 | 4096 | 0.418 | 1226.26 | 4.419 | 28.97 |
| 512 | 128 | 4608 | 0.427 | 1199.54 | 4.487 | 28.52 |
| 512 | 128 | 5120 | 0.435 | 1175.95 | 4.552 | 28.12 |
| 512 | 128 | 5632 | 0.446 | 1146.84 | 4.625 | 27.67 |
| 512 | 128 | 6144 | 0.455 | 1126.16 | 4.693 | 27.27 |
| 512 | 128 | 6656 | 0.464 | 1103.10 | 4.761 | 26.88 |
| 512 | 128 | 7168 | 0.472 | 1085.43 | 4.829 | 26.51 |
| 512 | 128 | 7680 | 0.480 | 1066.39 | 4.896 | 26.15 |
| 512 | 128 | 8192 | 0.488 | 1048.19 | 4.966 | 25.77 |
| 512 | 128 | 8704 | 0.498 | 1027.72 | 5.037 | 25.41 |
| 512 | 128 | 9216 | 0.505 | 1013.91 | 5.107 | 25.06 |
| 512 | 128 | 9728 | 0.514 | 996.65 | 5.177 | 24.72 |
| 512 | 128 | 10240 | 0.522 | 981.43 | 5.241 | 24.42 |
| 512 | 128 | 10752 | 0.530 | 966.08 | 5.311 | 24.10 |
| 512 | 128 | 11264 | 0.540 | 948.94 | 5.380 | 23.79 |
| 512 | 128 | 11776 | 0.547 | 935.81 | 5.448 | 23.50 |
| 512 | 128 | 12288 | 0.556 | 921.26 | 5.515 | 23.21 |
| 512 | 128 | 12800 | 0.564 | 907.03 | 5.582 | 22.93 |
| 512 | 128 | 13312 | 0.572 | 894.35 | 5.648 | 22.66 |
| 512 | 128 | 13824 | 0.581 | 880.65 | 5.714 | 22.40 |
| 512 | 128 | 14336 | 0.590 | 868.50 | 5.781 | 22.14 |
| 512 | 128 | 14848 | 0.598 | 856.72 | 5.855 | 21.86 |
| 512 | 128 | 15360 | 0.605 | 846.01 | 5.920 | 21.62 |
| 512 | 128 | 15872 | 0.613 | 835.56 | 5.980 | 21.40 |
| 512 | 128 | 16384 | 0.623 | 821.71 | 6.050 | 21.16 |
| 512 | 128 | 16896 | 0.633 | 808.95 | 6.111 | 20.94 |
| 512 | 128 | 17408 | 0.640 | 799.92 | 6.176 | 20.73 |
| 512 | 128 | 17920 | 0.648 | 789.93 | 6.241 | 20.51 |
| 512 | 128 | 18432 | 0.658 | 777.96 | 6.308 | 20.29 |
| 512 | 128 | 18944 | 0.666 | 768.87 | 6.374 | 20.08 |
| 512 | 128 | 19456 | 0.677 | 756.54 | 6.447 | 19.85 |
| 512 | 128 | 19968 | 0.683 | 749.66 | 6.512 | 19.65 |
| 512 | 128 | 20480 | 0.693 | 738.62 | 6.581 | 19.45 |
| 512 | 128 | 20992 | 0.701 | 730.71 | 6.650 | 19.25 |
| 512 | 128 | 21504 | 0.708 | 723.54 | 6.715 | 19.06 |
| 512 | 128 | 22016 | 0.718 | 712.66 | 6.784 | 18.87 |
| 512 | 128 | 22528 | 0.727 | 703.84 | 6.848 | 18.69 |
| 512 | 128 | 23040 | 0.735 | 696.52 | 6.915 | 18.51 |
| 512 | 128 | 23552 | 0.744 | 688.40 | 6.984 | 18.33 |
| 512 | 128 | 24064 | 0.754 | 679.13 | 7.057 | 18.14 |
| 512 | 128 | 24576 | 0.760 | 673.38 | 7.122 | 17.97 |
| 512 | 128 | 25088 | 0.768 | 666.65 | 7.189 | 17.80 |
| 512 | 128 | 25600 | 0.777 | 658.95 | 7.258 | 17.64 |
| 512 | 128 | 26112 | 0.788 | 649.70 | 7.327 | 17.47 |
| 512 | 128 | 26624 | 0.796 | 643.27 | 7.398 | 17.30 |
| 512 | 128 | 27136 | 0.804 | 637.20 | 7.467 | 17.14 |
| 512 | 128 | 27648 | 0.813 | 629.55 | 7.538 | 16.98 |
| 512 | 128 | 28160 | 0.822 | 622.77 | 7.613 | 16.81 |
| 512 | 128 | 28672 | 0.834 | 614.26 | 7.685 | 16.66 |
| 512 | 128 | 29184 | 0.841 | 609.06 | 7.753 | 16.51 |
| 512 | 128 | 29696 | 0.847 | 604.25 | 7.823 | 16.36 |
| 512 | 128 | 30208 | 0.854 | 599.32 | 7.890 | 16.22 |
| 512 | 128 | 30720 | 0.867 | 590.81 | 7.960 | 16.08 |
| 512 | 128 | 31232 | 0.877 | 584.02 | 8.034 | 15.93 |
| 512 | 128 | 31744 | 0.883 | 579.81 | 8.099 | 15.80 |
| 512 | 128 | 32256 | 0.893 | 573.29 | 8.167 | 15.67 |
## gemma-3-27B-it-qat-mix-iq3_k.gguf
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.447 | 1145.52 | 3.945 | 32.44 |
| 512 | 128 | 512 | 0.417 | 1228.76 | 4.014 | 31.89 |
| 512 | 128 | 1024 | 0.426 | 1202.30 | 4.076 | 31.41 |
| 512 | 128 | 1536 | 0.434 | 1179.42 | 4.146 | 30.87 |
| 512 | 128 | 2048 | 0.443 | 1156.00 | 4.212 | 30.39 |
| 512 | 128 | 2560 | 0.451 | 1134.72 | 4.280 | 29.91 |
| 512 | 128 | 3072 | 0.460 | 1113.87 | 4.348 | 29.44 |
| 512 | 128 | 3584 | 0.468 | 1094.14 | 4.415 | 28.99 |
| 512 | 128 | 4096 | 0.477 | 1073.18 | 4.482 | 28.56 |
| 512 | 128 | 4608 | 0.485 | 1055.40 | 4.550 | 28.13 |
| 512 | 128 | 5120 | 0.495 | 1034.95 | 4.621 | 27.70 |
| 512 | 128 | 5632 | 0.503 | 1017.63 | 4.689 | 27.30 |
| 512 | 128 | 6144 | 0.511 | 1001.37 | 4.757 | 26.91 |
| 512 | 128 | 6656 | 0.519 | 986.65 | 4.825 | 26.53 |
| 512 | 128 | 7168 | 0.528 | 970.07 | 4.892 | 26.16 |
| 512 | 128 | 7680 | 0.537 | 954.21 | 4.959 | 25.81 |
| 512 | 128 | 8192 | 0.545 | 939.29 | 5.029 | 25.45 |
| 512 | 128 | 8704 | 0.554 | 923.71 | 5.097 | 25.11 |
| 512 | 128 | 9216 | 0.562 | 911.09 | 5.166 | 24.78 |
| 512 | 128 | 9728 | 0.569 | 900.24 | 5.235 | 24.45 |
| 512 | 128 | 10240 | 0.578 | 885.58 | 5.302 | 24.14 |
| 512 | 128 | 10752 | 0.586 | 873.69 | 5.371 | 23.83 |
| 512 | 128 | 11264 | 0.595 | 859.80 | 5.439 | 23.53 |
| 512 | 128 | 11776 | 0.604 | 847.90 | 5.506 | 23.25 |
| 512 | 128 | 12288 | 0.614 | 834.40 | 5.572 | 22.97 |
| 512 | 128 | 12800 | 0.621 | 824.19 | 5.635 | 22.71 |
| 512 | 128 | 13312 | 0.631 | 811.51 | 5.704 | 22.44 |
| 512 | 128 | 13824 | 0.638 | 803.08 | 5.768 | 22.19 |
| 512 | 128 | 14336 | 0.646 | 793.10 | 5.831 | 21.95 |
| 512 | 128 | 14848 | 0.656 | 780.23 | 5.905 | 21.68 |
| 512 | 128 | 15360 | 0.664 | 771.05 | 5.968 | 21.45 |
| 512 | 128 | 15872 | 0.673 | 761.30 | 6.033 | 21.22 |
| 512 | 128 | 16384 | 0.681 | 752.03 | 6.102 | 20.98 |
| 512 | 128 | 16896 | 0.690 | 741.95 | 6.169 | 20.75 |
| 512 | 128 | 17408 | 0.698 | 733.85 | 6.237 | 20.52 |
| 512 | 128 | 17920 | 0.707 | 724.39 | 6.304 | 20.30 |
| 512 | 128 | 18432 | 0.716 | 715.50 | 6.371 | 20.09 |
| 512 | 128 | 18944 | 0.724 | 707.19 | 6.440 | 19.88 |
| 512 | 128 | 19456 | 0.732 | 699.61 | 6.502 | 19.69 |
| 512 | 128 | 19968 | 0.740 | 692.05 | 6.573 | 19.47 |
| 512 | 128 | 20480 | 0.749 | 683.36 | 6.642 | 19.27 |
| 512 | 128 | 20992 | 0.758 | 675.25 | 6.713 | 19.07 |
| 512 | 128 | 21504 | 0.766 | 668.41 | 6.785 | 18.87 |
| 512 | 128 | 22016 | 0.776 | 660.00 | 6.853 | 18.68 |
| 512 | 128 | 22528 | 0.783 | 653.50 | 6.922 | 18.49 |
| 512 | 128 | 23040 | 0.793 | 645.39 | 6.994 | 18.30 |
| 512 | 128 | 23552 | 0.801 | 639.13 | 7.061 | 18.13 |
| 512 | 128 | 24064 | 0.811 | 631.43 | 7.133 | 17.94 |
| 512 | 128 | 24576 | 0.820 | 624.08 | 7.200 | 17.78 |
| 512 | 128 | 25088 | 0.828 | 618.62 | 7.270 | 17.61 |
| 512 | 128 | 25600 | 0.837 | 611.49 | 7.340 | 17.44 |
| 512 | 128 | 26112 | 0.844 | 606.28 | 7.408 | 17.28 |
| 512 | 128 | 26624 | 0.855 | 599.11 | 7.475 | 17.12 |
| 512 | 128 | 27136 | 0.862 | 594.18 | 7.541 | 16.97 |
| 512 | 128 | 27648 | 0.873 | 586.66 | 7.604 | 16.83 |
| 512 | 128 | 28160 | 0.880 | 581.61 | 7.674 | 16.68 |
| 512 | 128 | 28672 | 0.888 | 576.73 | 7.744 | 16.53 |
| 512 | 128 | 29184 | 0.897 | 571.07 | 7.810 | 16.39 |
| 512 | 128 | 29696 | 0.905 | 565.80 | 7.874 | 16.26 |
| 512 | 128 | 30208 | 0.916 | 558.82 | 7.944 | 16.11 |
| 512 | 128 | 30720 | 0.923 | 555.00 | 8.012 | 15.98 |
| 512 | 128 | 31232 | 0.934 | 548.01 | 8.078 | 15.84 |
| 512 | 128 | 31744 | 0.942 | 543.50 | 8.145 | 15.72 |
| 512 | 128 | 32256 | 0.951 | 538.44 | 8.212 | 15.59 |
## gemma-3-27B-it-qat-pure-q4_0.gguf
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.323 | 1585.65 | 3.516 | 36.40 |
| 512 | 128 | 512 | 0.328 | 1559.55 | 3.585 | 35.71 |
| 512 | 128 | 1024 | 0.338 | 1514.58 | 3.650 | 35.06 |
| 512 | 128 | 1536 | 0.346 | 1477.87 | 3.716 | 34.45 |
| 512 | 128 | 2048 | 0.356 | 1438.88 | 3.780 | 33.86 |
| 512 | 128 | 2560 | 0.364 | 1407.96 | 3.845 | 33.29 |
| 512 | 128 | 3072 | 0.372 | 1376.65 | 3.910 | 32.74 |
| 512 | 128 | 3584 | 0.380 | 1347.00 | 3.974 | 32.21 |
| 512 | 128 | 4096 | 0.389 | 1315.56 | 4.038 | 31.70 |
| 512 | 128 | 4608 | 0.398 | 1286.90 | 4.105 | 31.18 |
| 512 | 128 | 5120 | 0.405 | 1264.27 | 4.172 | 30.68 |
| 512 | 128 | 5632 | 0.413 | 1238.73 | 4.234 | 30.23 |
| 512 | 128 | 6144 | 0.422 | 1213.02 | 4.300 | 29.77 |
| 512 | 128 | 6656 | 0.430 | 1189.78 | 4.365 | 29.33 |
| 512 | 128 | 7168 | 0.440 | 1162.71 | 4.436 | 28.85 |
| 512 | 128 | 7680 | 0.448 | 1143.27 | 4.498 | 28.46 |
| 512 | 128 | 8192 | 0.458 | 1118.80 | 4.564 | 28.04 |
| 512 | 128 | 8704 | 0.466 | 1098.96 | 4.632 | 27.63 |
| 512 | 128 | 9216 | 0.474 | 1080.03 | 4.697 | 27.25 |
| 512 | 128 | 9728 | 0.483 | 1059.90 | 4.768 | 26.85 |
| 512 | 128 | 10240 | 0.492 | 1041.25 | 4.834 | 26.48 |
| 512 | 128 | 10752 | 0.500 | 1024.39 | 4.903 | 26.11 |
| 512 | 128 | 11264 | 0.509 | 1006.30 | 4.968 | 25.76 |
| 512 | 128 | 11776 | 0.517 | 989.51 | 5.035 | 25.42 |
| 512 | 128 | 12288 | 0.526 | 972.95 | 5.102 | 25.09 |
| 512 | 128 | 12800 | 0.534 | 958.30 | 5.171 | 24.75 |
| 512 | 128 | 13312 | 0.542 | 945.13 | 5.236 | 24.44 |
| 512 | 128 | 13824 | 0.551 | 928.86 | 5.302 | 24.14 |
| 512 | 128 | 14336 | 0.560 | 915.04 | 5.368 | 23.84 |
| 512 | 128 | 14848 | 0.568 | 900.75 | 5.437 | 23.54 |
| 512 | 128 | 15360 | 0.577 | 887.69 | 5.503 | 23.26 |
| 512 | 128 | 15872 | 0.586 | 874.12 | 5.570 | 22.98 |
| 512 | 128 | 16384 | 0.594 | 861.42 | 5.634 | 22.72 |
| 512 | 128 | 16896 | 0.603 | 849.78 | 5.702 | 22.45 |
| 512 | 128 | 17408 | 0.611 | 838.06 | 5.770 | 22.18 |
| 512 | 128 | 17920 | 0.619 | 826.92 | 5.837 | 21.93 |
| 512 | 128 | 18432 | 0.628 | 815.75 | 5.902 | 21.69 |
| 512 | 128 | 18944 | 0.635 | 805.86 | 5.970 | 21.44 |
| 512 | 128 | 19456 | 0.645 | 794.21 | 6.030 | 21.23 |
| 512 | 128 | 19968 | 0.652 | 784.86 | 6.097 | 20.99 |
| 512 | 128 | 20480 | 0.662 | 773.59 | 6.164 | 20.76 |
| 512 | 128 | 20992 | 0.669 | 765.08 | 6.229 | 20.55 |
| 512 | 128 | 21504 | 0.679 | 753.76 | 6.298 | 20.32 |
| 512 | 128 | 22016 | 0.686 | 746.01 | 6.365 | 20.11 |
| 512 | 128 | 22528 | 0.695 | 736.96 | 6.430 | 19.91 |
| 512 | 128 | 23040 | 0.703 | 728.05 | 6.497 | 19.70 |
| 512 | 128 | 23552 | 0.711 | 719.77 | 6.565 | 19.50 |
| 512 | 128 | 24064 | 0.722 | 709.53 | 6.629 | 19.31 |
| 512 | 128 | 24576 | 0.729 | 702.38 | 6.694 | 19.12 |
| 512 | 128 | 25088 | 0.738 | 693.75 | 6.760 | 18.93 |
| 512 | 128 | 25600 | 0.752 | 680.71 | 6.831 | 18.74 |
| 512 | 128 | 26112 | 0.755 | 677.81 | 6.896 | 18.56 |
| 512 | 128 | 26624 | 0.765 | 669.13 | 6.968 | 18.37 |
| 512 | 128 | 27136 | 0.772 | 663.37 | 7.033 | 18.20 |
| 512 | 128 | 27648 | 0.783 | 654.12 | 7.104 | 18.02 |
| 512 | 128 | 28160 | 0.790 | 647.97 | 7.174 | 17.84 |
| 512 | 128 | 28672 | 0.800 | 640.20 | 7.240 | 17.68 |
| 512 | 128 | 29184 | 0.808 | 633.63 | 7.309 | 17.51 |
| 512 | 128 | 29696 | 0.816 | 627.32 | 7.371 | 17.37 |
| 512 | 128 | 30208 | 0.825 | 620.68 | 7.445 | 17.19 |
| 512 | 128 | 30720 | 0.837 | 612.00 | 7.514 | 17.03 |
| 512 | 128 | 31232 | 0.844 | 606.59 | 7.586 | 16.87 |
| 512 | 128 | 31744 | 0.862 | 594.13 | 7.655 | 16.72 |
| 512 | 128 | 32256 | 0.860 | 595.13 | 7.729 | 16.56 |
## gemma-3-27B-it-qat-q4_k.gguf
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.350 | 1461.01 | 3.564 | 35.91 |
| 512 | 128 | 512 | 0.355 | 1442.43 | 3.635 | 35.22 |
| 512 | 128 | 1024 | 0.364 | 1406.79 | 3.702 | 34.58 |
| 512 | 128 | 1536 | 0.370 | 1382.30 | 3.770 | 33.95 |
| 512 | 128 | 2048 | 0.381 | 1343.95 | 3.837 | 33.36 |
| 512 | 128 | 2560 | 0.389 | 1315.80 | 3.903 | 32.80 |
| 512 | 128 | 3072 | 0.399 | 1283.92 | 3.975 | 32.20 |
| 512 | 128 | 3584 | 0.406 | 1262.15 | 4.040 | 31.68 |
| 512 | 128 | 4096 | 0.414 | 1235.98 | 4.108 | 31.16 |
| 512 | 128 | 4608 | 0.423 | 1210.07 | 4.174 | 30.67 |
| 512 | 128 | 5120 | 0.432 | 1186.43 | 4.240 | 30.19 |
| 512 | 128 | 5632 | 0.440 | 1162.38 | 4.307 | 29.72 |
| 512 | 128 | 6144 | 0.449 | 1139.33 | 4.372 | 29.28 |
| 512 | 128 | 6656 | 0.458 | 1118.86 | 4.440 | 28.83 |
| 512 | 128 | 7168 | 0.466 | 1098.62 | 4.505 | 28.41 |
| 512 | 128 | 7680 | 0.474 | 1079.84 | 4.572 | 27.99 |
| 512 | 128 | 8192 | 0.483 | 1060.70 | 4.639 | 27.59 |
| 512 | 128 | 8704 | 0.491 | 1042.31 | 4.708 | 27.19 |
| 512 | 128 | 9216 | 0.500 | 1024.70 | 4.776 | 26.80 |
| 512 | 128 | 9728 | 0.509 | 1006.70 | 4.845 | 26.42 |
| 512 | 128 | 10240 | 0.517 | 991.28 | 4.913 | 26.06 |
| 512 | 128 | 10752 | 0.524 | 976.78 | 4.979 | 25.71 |
| 512 | 128 | 11264 | 0.532 | 962.53 | 5.040 | 25.39 |
| 512 | 128 | 11776 | 0.542 | 944.54 | 5.117 | 25.02 |
| 512 | 128 | 12288 | 0.550 | 931.02 | 5.173 | 24.74 |
| 512 | 128 | 12800 | 0.559 | 916.72 | 5.240 | 24.43 |
| 512 | 128 | 13312 | 0.566 | 903.92 | 5.305 | 24.13 |
| 512 | 128 | 13824 | 0.576 | 888.40 | 5.373 | 23.82 |
| 512 | 128 | 14336 | 0.584 | 876.82 | 5.444 | 23.51 |
| 512 | 128 | 14848 | 0.592 | 864.61 | 5.506 | 23.25 |
| 512 | 128 | 15360 | 0.600 | 852.86 | 5.574 | 22.96 |
| 512 | 128 | 15872 | 0.609 | 840.81 | 5.641 | 22.69 |
| 512 | 128 | 16384 | 0.617 | 830.27 | 5.706 | 22.43 |
| 512 | 128 | 16896 | 0.626 | 817.66 | 5.773 | 22.17 |
| 512 | 128 | 17408 | 0.635 | 806.32 | 5.839 | 21.92 |
| 512 | 128 | 17920 | 0.644 | 795.42 | 5.908 | 21.67 |
| 512 | 128 | 18432 | 0.651 | 786.23 | 5.974 | 21.43 |
| 512 | 128 | 18944 | 0.660 | 775.45 | 6.041 | 21.19 |
| 512 | 128 | 19456 | 0.669 | 765.32 | 6.110 | 20.95 |
| 512 | 128 | 19968 | 0.677 | 756.12 | 6.176 | 20.72 |
| 512 | 128 | 20480 | 0.686 | 746.87 | 6.244 | 20.50 |
| 512 | 128 | 20992 | 0.695 | 736.54 | 6.313 | 20.28 |
| 512 | 128 | 21504 | 0.705 | 726.44 | 6.382 | 20.06 |
| 512 | 128 | 22016 | 0.713 | 718.06 | 6.451 | 19.84 |
| 512 | 128 | 22528 | 0.720 | 710.74 | 6.512 | 19.66 |
| 512 | 128 | 23040 | 0.728 | 702.83 | 6.579 | 19.46 |
| 512 | 128 | 23552 | 0.740 | 691.52 | 6.651 | 19.24 |
| 512 | 128 | 24064 | 0.749 | 683.96 | 6.723 | 19.04 |
| 512 | 128 | 24576 | 0.757 | 676.77 | 6.783 | 18.87 |
| 512 | 128 | 25088 | 0.764 | 669.98 | 6.852 | 18.68 |
| 512 | 128 | 25600 | 0.773 | 662.01 | 6.924 | 18.49 |
| 512 | 128 | 26112 | 0.783 | 653.89 | 6.990 | 18.31 |
| 512 | 128 | 26624 | 0.792 | 646.42 | 7.059 | 18.13 |
| 512 | 128 | 27136 | 0.802 | 638.16 | 7.139 | 17.93 |
| 512 | 128 | 27648 | 0.807 | 634.08 | 7.207 | 17.76 |
| 512 | 128 | 28160 | 0.817 | 626.62 | 7.268 | 17.61 |
| 512 | 128 | 28672 | 0.829 | 617.49 | 7.333 | 17.46 |
| 512 | 128 | 29184 | 0.836 | 612.25 | 7.411 | 17.27 |
| 512 | 128 | 29696 | 0.845 | 606.01 | 7.473 | 17.13 |
| 512 | 128 | 30208 | 0.855 | 598.76 | 7.540 | 16.98 |
| 512 | 128 | 30720 | 0.860 | 595.08 | 7.608 | 16.82 |
| 512 | 128 | 31232 | 0.871 | 588.11 | 7.681 | 16.66 |
| 512 | 128 | 31744 | 0.877 | 583.60 | 7.742 | 16.53 |
| 512 | 128 | 32256 | 0.885 | 578.74 | 7.812 | 16.39 |
## gemma-3-27B-it-qat-q8_0.gguf
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.332 | 1539.91 | 5.825 | 21.98 |
| 512 | 128 | 512 | 0.338 | 1512.99 | 5.894 | 21.72 |
| 512 | 128 | 1024 | 0.347 | 1475.13 | 5.957 | 21.49 |
| 512 | 128 | 1536 | 0.354 | 1444.45 | 6.024 | 21.25 |
| 512 | 128 | 2048 | 0.364 | 1407.89 | 6.086 | 21.03 |
| 512 | 128 | 2560 | 0.372 | 1374.75 | 6.151 | 20.81 |
| 512 | 128 | 3072 | 0.381 | 1342.80 | 6.216 | 20.59 |
| 512 | 128 | 3584 | 0.390 | 1311.71 | 6.282 | 20.37 |
| 512 | 128 | 4096 | 0.399 | 1282.81 | 6.346 | 20.17 |
| 512 | 128 | 4608 | 0.407 | 1258.61 | 6.410 | 19.97 |
| 512 | 128 | 5120 | 0.414 | 1237.17 | 6.474 | 19.77 |
| 512 | 128 | 5632 | 0.422 | 1212.05 | 6.538 | 19.58 |
| 512 | 128 | 6144 | 0.431 | 1187.26 | 6.604 | 19.38 |
| 512 | 128 | 6656 | 0.441 | 1161.80 | 6.666 | 19.20 |
| 512 | 128 | 7168 | 0.449 | 1140.72 | 6.730 | 19.02 |
| 512 | 128 | 7680 | 0.458 | 1118.44 | 6.793 | 18.84 |
| 512 | 128 | 8192 | 0.466 | 1098.54 | 6.857 | 18.67 |
| 512 | 128 | 8704 | 0.473 | 1083.10 | 6.923 | 18.49 |
| 512 | 128 | 9216 | 0.482 | 1062.20 | 6.991 | 18.31 |
| 512 | 128 | 9728 | 0.491 | 1042.60 | 7.054 | 18.15 |
| 512 | 128 | 10240 | 0.498 | 1027.18 | 7.123 | 17.97 |
| 512 | 128 | 10752 | 0.508 | 1007.38 | 7.185 | 17.81 |
| 512 | 128 | 11264 | 0.515 | 994.91 | 7.251 | 17.65 |
| 512 | 128 | 11776 | 0.523 | 978.05 | 7.317 | 17.49 |
| 512 | 128 | 12288 | 0.532 | 962.31 | 7.380 | 17.34 |
| 512 | 128 | 12800 | 0.541 | 947.09 | 7.445 | 17.19 |
| 512 | 128 | 13312 | 0.549 | 932.93 | 7.510 | 17.04 |
| 512 | 128 | 13824 | 0.557 | 919.44 | 7.577 | 16.89 |
| 512 | 128 | 14336 | 0.566 | 905.08 | 7.645 | 16.74 |
| 512 | 128 | 14848 | 0.575 | 890.47 | 7.707 | 16.61 |
| 512 | 128 | 15360 | 0.582 | 879.10 | 7.773 | 16.47 |
| 512 | 128 | 15872 | 0.592 | 865.03 | 7.834 | 16.34 |
| 512 | 128 | 16384 | 0.600 | 853.53 | 7.898 | 16.21 |
| 512 | 128 | 16896 | 0.607 | 843.01 | 7.966 | 16.07 |
| 512 | 128 | 17408 | 0.617 | 829.70 | 8.028 | 15.94 |
| 512 | 128 | 17920 | 0.625 | 818.59 | 8.092 | 15.82 |
| 512 | 128 | 18432 | 0.632 | 810.24 | 8.163 | 15.68 |
| 512 | 128 | 18944 | 0.640 | 800.31 | 8.229 | 15.56 |
| 512 | 128 | 19456 | 0.649 | 789.37 | 8.296 | 15.43 |
| 512 | 128 | 19968 | 0.658 | 778.33 | 8.361 | 15.31 |
| 512 | 128 | 20480 | 0.666 | 768.52 | 8.425 | 15.19 |
| 512 | 128 | 20992 | 0.676 | 757.55 | 8.490 | 15.08 |
| 512 | 128 | 21504 | 0.686 | 746.62 | 8.555 | 14.96 |
| 512 | 128 | 22016 | 0.694 | 737.58 | 8.620 | 14.85 |
| 512 | 128 | 22528 | 0.705 | 726.01 | 8.690 | 14.73 |
| 512 | 128 | 23040 | 0.713 | 718.42 | 8.762 | 14.61 |
| 512 | 128 | 23552 | 0.720 | 710.87 | 8.829 | 14.50 |
| 512 | 128 | 24064 | 0.729 | 701.96 | 8.893 | 14.39 |
| 512 | 128 | 24576 | 0.738 | 693.84 | 8.964 | 14.28 |
| 512 | 128 | 25088 | 0.747 | 685.68 | 9.029 | 14.18 |
| 512 | 128 | 25600 | 0.756 | 677.05 | 9.101 | 14.06 |
| 512 | 128 | 26112 | 0.764 | 670.53 | 9.165 | 13.97 |
| 512 | 128 | 26624 | 0.773 | 662.78 | 9.230 | 13.87 |
| 512 | 128 | 27136 | 0.782 | 654.47 | 9.296 | 13.77 |
| 512 | 128 | 27648 | 0.788 | 649.53 | 9.361 | 13.67 |
| 512 | 128 | 28160 | 0.798 | 641.93 | 9.430 | 13.57 |
| 512 | 128 | 28672 | 0.806 | 635.07 | 9.497 | 13.48 |
| 512 | 128 | 29184 | 0.815 | 628.26 | 9.561 | 13.39 |
| 512 | 128 | 29696 | 0.822 | 623.01 | 9.631 | 13.29 |
| 512 | 128 | 30208 | 0.832 | 615.43 | 9.687 | 13.21 |
| 512 | 128 | 30720 | 0.839 | 610.13 | 9.753 | 13.12 |
| 512 | 128 | 31232 | 0.849 | 603.39 | 9.824 | 13.03 |
| 512 | 128 | 31744 | 0.858 | 596.95 | 9.885 | 12.95 |
| 512 | 128 | 32256 | 0.865 | 591.61 | 9.953 | 12.86 |
Methodology
Perplexity
$ cd ik_llama.cpp
$ git rev-parse --short HEAD
3bb64d93
$ ./build/bin/llama-perplexity --version
version: 3639 (3bb64d93)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
$ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha256sum wiki.test.raw
173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08 wiki.test.raw
$ ./build/bin/llama-perplexity \
--model /mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-iq4_nl.gguf \
--ctx-size 512 \
--ubatch-size 512 \
-f wiki.test.raw \
--seed 1337 \
--n-gpu-layers 99 \
--threads 4
Sweep Bench
Using a single RTX A6000 48GB VRAM GPU
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-fa \
-amb 512 \
-fmoe \
-c 32768 \
-ub 512 \
-ngl 99 \
--threads 4
I tried a number of other mixes based on --layer-similarity scores trying to optimize entire layer score, and also based on attn and ffn scores, but in limited testing on this specific model it didn't prove to provide better perplexity. My impression is this QAT was indeed meant to be q4_0 as sometimes using a mix of slightly higher quants for some layers showed slightly worse perplexity.
I didn't compare against non-QAT bf16 quants, but wanted to share some early results with anyone else curious about this QAT business.
Cheers!
🗣️ Discussion
👤 saood06 replied the 2025-04-18 at 22:57:25:
I saw google released their google/gemma-3-27b-it-qat-q4_0-unquantized original
.safetensorsunquantized model. It is supposedly designed forq4_0quantization which was released earlier in gguf format.Thanks to Quantization Aware Training (QAT), the model is able to preserve similar quality as bfloat16 while significantly reducing the memory requirements to load the model.
This is QAT but unlike previous QAT models I have seen this was done with an additional stage of finetuning, which is why I think the raw PPL are less directly comparable to the version without QAT (and why they are lower as it was trained longer) but they should be still useful to compare given that you can often compare PPL within an architecture family, like the example below (a bit dated but I still find this graph made by ikawrakow interesting).
👤 bartowski1182 replied the 2025-04-19 at 00:55:33:
unlike previous QAT
Which ones have you seen, and what did they do if not additional fine tuning? 🤔
But yes I theorized it was possible that they did some fine tuning for the quant awareness with wiki text itself, maybe unlikely but certainly not impossible
I think it could be valuable to use a random additional well formatted English corpus for more PPL numbers, that might start giving a more full image
👤 saood06 replied the 2025-04-19 at 01:32:31:
Which ones have you seen, and what did they do if not additional fine tuning? 🤔
Not any that I remember being released, but just in papers/blogs/demos one example being this: where for example they do "Llama3-8B fine-tuned on the C4 dataset (en subset) with and without QAT" which allows you to see the difference between QAT and just finetuning.
I also do remember hearing about models trained entirely using QAT but don't have any reference handy.
But yes I theorized it was possible that they did some fine tuning for the quant awareness with wiki text itself, maybe unlikely but certainly not impossible
My point isn't really specific to any data they did finetune with (my guess is they just did one or a partial epoch of the last dataset used for the Instruction tuned model, as people have reported modern LLM's can get very sensitive to the diversity of their training data [reported since Llama 3 and why people may have struggled fine tuning that for a while] ), just that the QAT model was trained more.
👤 bartowski1182 replied the 2025-04-19 at 02:50:22:
Oh hmm I suppose that's possible as well, would definitely be very interesting to see the full details
👤 saood06 replied the 2025-04-19 at 05:00:05:
@ubergarm
Have you seen these versions, 27B, and 12B.
👤 ikawrakow replied the 2025-04-19 at 08:38:23:
In my quick experiments with Gemma3-12B, the Q4_0 quantized version has a significantly lower Wiki2 perplexity than the bf16 model, or any other quantization. Which means that whatever they have done, they have massively overfit that specific dataset, specifically with Q4_0 quantization. Which means that one cannot use Wiki2 for evaluation (PPL, but also KLD or any other quantization quality measure). On my book (but you may differ from me in that regard), it also means that one cannot take this model seriously.
👤 saood06 replied the 2025-04-19 at 08:56:08:
On my book (but you may differ from me in that regard), it also means that one cannot take this model seriously.
I haven't touched Gemma 3 myself yet (I want to see if it beats QwQ for my GPU only use cases), but I've heard a lot of positive feedback on the QAT version of Gemma 3. I agree that it does make them hard to directly compare since they differ so much, but whatever they did people seem generally happy with it.
but also KLD or any other quantization quality measure
Do you think knowing the KLD between the two BF16 versions versions would be insightful (not that I could run it in a reasonable amount of time for the 27B, the 12B might be possible though)?
👤 bartowski1182 replied the 2025-04-20 at 18:19:36:
they have massively overfit that specific dataset
that was my theory as well, that they may have used wikitext as a calibration dataset for the QAT portion
I don't know if it completely invalidates the model, but rather just makes wikitext useless/misleading, similar to using wikitext for imatrix and then checking PPL against wikitext, it's totally fine to use it, but need to use something different for PPL after
👤 saood06 replied the 2025-04-20 at 18:29:03:
that was my theory as well, that they may have used wikitext as a calibration dataset for the QAT portion
You could test the QAT against the old one with different datasets if you want to test that hypothesis.
I don't think they used a low diversity dataset to train on, my theory is they may have updated the model they distilled from and that might be the extra bump on top of the bump from more training on more tokens.
👤 ubergarm replied the 2025-04-20 at 22:51:23:
I threw together a quick-n-dirty perplexity test corpus mostly english language with a little chinese and XML. Very possible the model was trained on this stuff already, given it is available online. Might be able to generate some "novel" synthetic text using output from a few various LLMs to mix it up, but at least here are a few more data points with something other thanwiki.test.rawshown below.Observations
The absolute values here lower (~6.0) overall than the
wiki.test.raw(~8.2). The new QAT BF16 still outperforms the original BF16. For the QAT quants, my q4_k is "better" than google's q4_0 and this time mine isn't "better" than the BF16.Google Original
google/gemma-3-27b-it/gemma-3-27B-it-BF16-00001-of-00002.gguf
Final estimate: PPL = 6.0568 +/- 0.03981Google QAT
google/gemma-3-27b-it-qat-q4_0-unquantized/gemma-3-27B-it-qat-unquantized-BF16-00001-of-00002.gguf
Final estimate: PPL = 5.8897 +/- 0.03774google/gemma-3-27b-it-qat-q4_0-gguf/gemma-3-27b-it-q4_0.gguf
Final estimate: PPL = 6.0588 +/- 0.03904ubergarm QAT Quant
ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-q4_k.gguf
Final estimate: PPL = 5.9405 +/- 0.03799Methodology
Logs to generate corpus and run llama-perplexity
# i ching wget https://www.gutenberg.org/ebooks/25501.txt.utf-8 # dvaita Bodha Deeptka The Lamp of Non-Dual Knowledge One of the few books highly spoken of by Bhagavan Sri Ramana Maharshi wget 'https://archive.org/stream/ramanamaharishiebooks/Ramana%20Maharishi%20eBooks/Advaita%20Bodha%20Deepika_djvu.txt' # Anarchist Cookbook by William Powell wget 'https://archive.org/stream/the-anarchist-cookbook-william-powell/The%20Anarchist%20Cookbook%20-%20William%20Powell%20-%20Barricade%20Books%20Inc%20-%201989_djvu.txt' # Social Architecture Peter Hintjens wget 'https://raw.githubusercontent.com/hintjens/socialarchitecture/refs/heads/master/ch04.txt' # cat them together in order $ sha256sum ubergarm-ppl-corpus.txt 456de7da9d5eec01d357f5ccb7fc7207884a706efe94535aca146f8646771bcc ubergarm-ppl-corpus.txt $ ./build/bin/llama-perplexity \ --model "$MODEL" \ --ctx-size 512 \ --ubatch-size 512 \ -f ubergarm-ppl-corpus.txt \ --seed 1337 \ --n-gpu-layers 99 \ --threads 4👤 ubergarm replied the 2025-04-21 at 14:52:18:
A redditor mentioned their post measuring PPL and KLD withwiki.test.rawand a private corpus for some of the gemma-3-27b QAT models with an interesting writeup.Also amusing that a redditor quoted ik on this thread hah... My impression is folks with <= 16GB VRAM are interested in the gemma-3-27b-it-qat ~4 bits as while it isn't as good as R1/V3-0324/QwQ-32B imo, it is a newer model that just barely fits with enough context to play around with decent speed.
👤 ikawrakow replied the 2025-04-19 at 09:08:49:
Do you think knowing the KLD between the two BF16 versions versions would be insightful
Good question. If the Q4_0 model outperforms the bf16 model (at least it does for a set of quality metrics), do we now compare the Q4_0 model against bf16, or do we compare the other way around? How do I know which of these two models is better? QAT was involved, fine tuning was involved, so maybe Q4_0 is the best model?
but I've heard a lot of positive feedback on the QAT version of Gemma 3.
Is it so because it really is good, or is it more because the sentiment towards Google has shifted lately (at least when it comes to "AI"). My impression is that the Internet believes that the latest Gemini models are currently the best (and so, by extension, Gemma3 must be among the best open weight). But the few things that I asked Gemma3-12B where I have good knowledge of the subject matter, the answers were complete BS.
👤 saood06 replied the 2025-04-19 at 09:33:57:
Good question. If the
Q4_0model outperforms thebf16model (at least it does for a set of quality metrics), do we now compare theQ4_0model againstbf16, or do we compare the other way around?There are two different BF16's the original post shows PPL of both (pasted below for convenience).
Original BF16:
## google/gemma-3-27b-it-BF16-00001-of-00002.ggufFinal estimate: PPL = 8.4276 +/- 0.06705QAT BF16:
## google/gemma-3-27b-it-qat-q4_0-unquantized-BF16-00001-of-00002.ggufFinal estimate: PPL = 8.2021 +/- 0.06387The extra finetuning that happened because of QAT is definitely showing from these numbers.
How do I know which of these two models is better? QAT was involved, fine tuning was involved, so maybe
Q4_0is the best model? [...] Is it so because it really is good, or is it more because the sentiment towards Google has shifted lately (at least when it comes to "AI"). My impression is that the Internet believes that the latest Gemini models are currently the best (and so, by extension, Gemma3 must be among the best open weight).My impression is that sentiment on Gemma3 is still mixed (if it was better I would have tried it by now), but that for people that used Gemma 3 the QAT versions are a very good drop in replacement offering higher quality, faster inference, and smaller models, but to me it doesn't seem like it has changed the perception of Gemma3 as a whole with plenty of people still not liking it. There may be usecases in which it has degraded performance compared to the non QAT version, but I have yet to come across any reports of that.
But the few things that I asked Gemma3-12B where I have good knowledge of the subject matter, the answers were complete BS.
Did both QAT and non QAT do that? I'd assume they'd both fail your test.
Gemma3 may not be a good fit for you, but I am curious what models you have used and liked.
👤 ikawrakow replied the 2025-04-20 at 07:02:01:
Gemma3 may not be a good fit for you, but I am curious what models you have used and liked.
From the models I can run locally, none really passes the smell test. I would be hard pressed to say which one I like the best.
👤 saood06 replied the 2025-04-20 at 09:51:32:
From the models I can run locally
Have you used any non locally? You brought up gemini, those models have had a huge advantage on long context benchmarks for a long time. The gemma models are nothing similar. Deepseek-r1 is the best local model but even that pales in comparison to gemini (from benchmarks and user testimonials, I've used it a bit over lmarena but not enough to remark on it).
I would be hard pressed to say which one I like the best.
I'm not surprised, as most small and mid size models aren't that smart if that is what you are evaluating. I mostly use my local models for entertainment.
👤 ubergarm replied the 2025-04-27 at 03:08:43:
EDIT My compile script was messed up and putting me into DEBUG mode...
I was doing some more testing benchmarking of various gemma-3-27b-it-qat quants between mainline and ik_llama.cpp llama-sweep-bench and noticed some warnings printing out on my ik_llama.cpp fork:
ggml_backend_cuda_graph_compute: CUDA graph update failed ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1]
I believe I'm compiling Release and not Debug but not completely sure how to tell. I'm not seeing that warning on mainline with roughly the same command (-c 33792 instead of -c 32768 oops), but not sure if verbosity is different by default etc.
I ran one of the bartowski quants on both mainline and ik_llama.cpp just to see the difference not due to different quant.
👈 Logs
model="/mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf"
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-ctk f16 -ctv f16 \
-c 32768 \
-ngl 99 \
--threads 16
llama_model_loader: loaded meta data with 44 key-value pairs and 808 tensors from /mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 3 27b It Qat
llama_model_loader: - kv 3: general.finetune str = it-qat
llama_model_loader: - kv 4: general.basename str = gemma-3
llama_model_loader: - kv 5: general.size_label str = 27B
llama_model_loader: - kv 6: general.license str = gemma
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Gemma 3 27b It
llama_model_loader: - kv 9: general.base_model.0.organization str = Google
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv 11: general.tags arr[str,4] = ["gemma3", "gemma", "google", "image-...
llama_model_loader: - kv 12: gemma3.context_length u32 = 131072
llama_model_loader: - kv 13: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 14: gemma3.block_count u32 = 62
llama_model_loader: - kv 15: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 16: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 17: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 18: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 19: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 20: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 22: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 23: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 24: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 25: tokenizer.ggml.model str = llama
llama_model_loader: - kv 26: tokenizer.ggml.pre str = default
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 28: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 32: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 37: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 15
llama_model_loader: - kv 40: quantize.imatrix.file str = /models_out/gemma-3-27b-it-qat-GGUF/g...
llama_model_loader: - kv 41: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 434
llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 129
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type q4_K: 374 tensors
llama_model_loader: - type q6_K: 61 tensors
llm_load_vocab: special tokens cache size = 6415
llm_load_vocab: token to piece cache size = 1.9446 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 262208
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 5376
llm_load_print_meta: n_layer = 62
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 1024
llm_load_print_meta: n_swa_pattern = 6
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 2
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 21504
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 0.125
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 27B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 27.009 B
llm_load_print_meta: model size = 15.404 GiB (4.899 BPW)
llm_load_print_meta: general.name = Gemma 3 27b It Qat
llm_load_print_meta: BOS token = 2 '<bos>'
llm_load_print_meta: EOS token = 1 '<eos>'
llm_load_print_meta: UNK token = 3 '<unk>'
llm_load_print_meta: PAD token = 0 '<pad>'
llm_load_print_meta: LF token = 248 '<0x0A>'
llm_load_print_meta: EOT token = 106 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.70 MiB
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CPU buffer size = 1102.77 MiB
llm_load_tensors: CUDA0 buffer size = 15773.97 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 0.125
llama_kv_cache_init: CUDA0 KV buffer size = 15872.00 MiB
llama_new_context_with_model: KV self size = 15872.00 MiB, K (f16): 7936.00 MiB, V (f16): 7936.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 522.62 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 138.51 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 522.62 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 138.51 MiB
llama_new_context_with_model: graph nodes = 1806
llama_new_context_with_model: graph splits = 2
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
ggml_backend_cuda_graph_compute: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1]
| 512 | 128 | 0 | 0.356 | 1436.25 | 3.719 | 34.42 |
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1]
| 512 | 128 | 512 | 0.372 | 1378.12 | 3.782 | 33.85 |
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1]
.
.
.
EDIT Updated Graph with ik_llama.cpp in Release mode instead of Debug mode as shown above.... sorry for the confusion!
👤 ikawrakow replied the 2025-04-27 at 07:06:13:
CUDA graphs get disabled for MoE models in ik_llama.cpp, this is why you see the warning. It was the same in mainline until very recently, their PR 12970 enables CUDA graphs for TG (and apparently hides the warning when disabling graphs for PP). Also very recently, Johannes Gaessler completely independently discovered batched processing for TG with MoE models in PR 13014. He really discovered it by himself, without ever looking at ik_llama.cpp, not even once /s
ik_llama.cpp will benefit from -t 1 when running fully on the GPU.
I was clearly confused. This is Gemma3.
👤 ikawrakow replied the 2025-04-27 at 07:49:45:
@ubergarm
The PP performance difference between mainline and ik_llama.cpp did not look plausible to me, so I ran my own benchmarks. I only have 16 GB RTX-4080, hence cannot run Gemma3-27B fully offloaded, so used Gemma3-12B instead. Q4_0 quantized in both cases, FA on with fp16 KV cache (mainline cannot use quantized cache with Gemma3), which allows me to go up to 16k context with my paltry 16 GB of VRAM. Anyway, here is what I see:
Are you sure your sweep-bench adaptation for mainline is working correctly? Gemma3 KV cache size relative to model size is quite high, so just a ~40% drop in PP performance at 32k tokens seen for mainline seems relatively unlikely.
👤 saood06 replied the 2025-04-27 at 08:10:16:
@ubergarm
Are you sure your
sweep-benchadaptation for mainline is working correctly? Gemma3 KV cache size relative to model size is quite high, so just a ~40% drop in PP performance at 32k tokens seen for mainline seems relatively unlikely.
He is missing a llama_synchronize call, could that account for it?Edit: Nevermind
👤 ubergarm replied the 2025-04-27 at 17:14:57:
EDIT My compile script was messed up and putting me into DEBUG mode...Thanks for taking a look, I too am doubting my
sweep-benchadaptation for mainline as I just quickly got it compiling without looking too closely.Next I'll try:
- Use
-t 1withik_llama.cppwhen fully offloading to GPU- Look more closely at the
sweep-benchadaptation as it could be inflating numbers (though for non FA and CPU cases with GLM-4 it looked more like I expected). Thanks @saood06 for thellama_synchronizecall, I'll try to figure out if there is something I'm missing.
- Possibly repeat with
Gemma3-12BQ4_0to reproduce graphs like ik just gave above.
- Try good old
llama-benchfor a sanity test across a smaller range of values.👤 ubergarm replied the 2025-04-27 at 17:39:39:
He is missing a llama_synchronize call, could that account for it?
Hrmm, I only see one
llama_synchronize(ctx);call in the ik_llama.cpp/examples/sweep-bench/sweep-bench.cpp code which also appears in my adaptation?Its possible somehow I'm using the wrong number for
n_batchetc as I don't really understand what batches and n_batches are. Also maybe some function arguments changed beyond just the names for stuff likellama_model_params_from_gpt_params(params);tocommon_init_from_params(params);etc...I'll dig around some more as I'd be quite surprised if mainline FA CUDA implementation for dense models like gemma-3 and glm-4 was suddenly this good.
👤 ubergarm replied the 2025-04-27 at 18:02:40:
EDIT My compile script was messed up and putting me into DEBUG mode...Using
-t 1does seem slightly but consistently faster than-t 16in this one comparison.ik_llama.cppfor both runs using same bartowski quant:👤 saood06 replied the 2025-04-27 at 19:28:28:
* [x] Look more closely at the `sweep-bench` adaptation as it could be inflating numbers (though for non FA and CPU cases with GLM-4 it looked more like I expected). Thanks @saood06 for the `llama_synchronize` call, I'll try to figure out if there is something I'm missing.[...] Hrmm, I only see one llama_synchronize(ctx); call in the ik_llama.cpp/examples/sweep-bench/sweep-bench.cpp code which also appears in my adaptation?
Sorry, I was looking at this and I didn't fully expand the file. Ignore what I said.
👤 ubergarm replied the 2025-04-27 at 19:48:37:
EDIT My compile script was messed up and putting me into DEBUG mode...I ran a plain
llama-benchfor PP only to compare and sanity check if my adaptation ofllama-sweep-benchis accurate at least for PP. It looks like in generalllama-sweep-benchshows lower scores thanllama-benchassuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough toN_KV?👈 Logs
Simple PP Benchmark
model="/mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf" CUDA_VISIBLE_DEVICES="0," \ ./build/bin/llama-bench \ --model "$model" \ -ngl 99 \ --mmap 0 \ -ctk f16 -ctv f16 \ -fa 1 \ -p 512,1024,2048,4096,8192,16384,32768 \ -n 0 \ -b 2048 \ -ub 512 \ -r 2 \ --n-gpu-layers 99 \ --threads 1mainline llama.cpp
model size params backend ngl threads fa mmap test t/s gemma3 27B Q4_K - Medium 15.40 GiB 27.01 B CUDA 99 1 1 0 pp512 1477.77 ± 6.79 gemma3 27B Q4_K - Medium 15.40 GiB 27.01 B CUDA 99 1 1 0 pp1024 1475.54 ± 2.11 gemma3 27B Q4_K - Medium 15.40 GiB 27.01 B CUDA 99 1 1 0 pp2048 1460.71 ± 1.78 gemma3 27B Q4_K - Medium 15.40 GiB 27.01 B CUDA 99 1 1 0 pp4096 1420.40 ± 4.39 gemma3 27B Q4_K - Medium 15.40 GiB 27.01 B CUDA 99 1 1 0 pp8192 1341.60 ± 3.84 gemma3 27B Q4_K - Medium 15.40 GiB 27.01 B CUDA 99 1 1 0 pp16384 1215.80 ± 3.76 gemma3 27B Q4_K - Medium 15.40 GiB 27.01 B CUDA 99 1 1 0 pp32768 1031.38 ± 1.40 ik_llama.cpp
NOTE Every test throws a warning like this:
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 522.62 MiB ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 12.51 MiB
model size params backend ngl threads fa mmap test t/s gemma3 27B Q4_K - Medium 16.48 GiB 28.42 B CUDA 99 1 1 0 pp512 1439.27 ± 7.59 gemma3 27B Q4_K - Medium 16.48 GiB 28.42 B CUDA 99 1 1 0 pp1024 1461.46 ± 7.53 gemma3 27B Q4_K - Medium 16.48 GiB 28.42 B CUDA 99 1 1 0 pp2048 1442.42 ± 2.51 gemma3 27B Q4_K - Medium 16.48 GiB 28.42 B CUDA 99 1 1 0 pp4096 1386.80 ± 2.68 gemma3 27B Q4_K - Medium 16.48 GiB 28.42 B CUDA 99 1 1 0 pp8192 1282.03 ± 2.67 gemma3 27B Q4_K - Medium 16.48 GiB 28.42 B CUDA 99 1 1 0 pp16384 1078.81 ± 5.99 gemma3 27B Q4_K - Medium 16.48 GiB 28.42 B CUDA 99 1 1 0 pp32768 749.25 ± 4.50 One odd thing I noticed is in the logs ☝️ mainline llama.cpp reports a different value for
sizeandparamsthanik_llama.cpp. So to explore a bit more I ran the same short prompt on both withllama-eval-callbackto possibly show if they are doing anything fundamentally different in the calculations. Its quite long, and maybe not useful, but available if anyone is interested (too long to paste in here) and seems similar but some differences in FUSED_RMS_NORM and such.Finally, I'm beginning to wonder if something changed on my remote server as it was down for maintainence and reboot recently. Ever since then I've noticed those warnings printing with
ik_llama.cppnow:ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1) ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0) ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0) ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1) ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not validI'll poke at the server a bit to see if anything changed and also maybe roll back my
ik_llama.cppgit repo a couple weeks in case something odd changed.👤 ubergarm replied the 2025-04-27 at 21:25:01:
Okay, yeah, the remote server compile script was inDebugmode... I recompiled forReleaseand performance improved for both TG and PP and is more in line with what I would expect. Sorry for the fire drill...
👤 ikawrakow replied the 2025-04-28 at 06:08:20:
I ran a plain llama-bench for PP only to compare and sanity check if my adaptation of llama-sweep-bench is accurate at least for PP. It looks like in general llama-sweep-bench shows lower scores than llama-bench assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to N_KV?
It is related, but not really the same. With llama-sweep-bench you have N_KV tokens in the KV cache and you compute n_ubatch new tokens (n_ubatch=512 by default). With llama-bench you have 0 tokens in the KV cache, and you compute N new tokens. The llama-bench calculation is done in u-batches, and as it progresses, the number of tokens in the KV cache grows from 0 to N - n_ubatch. Hence, we can think of the llama-bench result as the llama-sweep-bench result averaged between 0 and N - n_ubatch tokens. If for simplicity we assume that performance decreases linearly with N_KV, then, to first order, the llama-bench PP performance for N tokens is about the same as llama-sweep-bench with N_KV = N/2.
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1)
If you see such messages, you are running in debug mode.
👤 saood06 replied the 2025-04-28 at 07:34:44:
I ran a plain llama-bench for PP only to compare and sanity check if my adaptation of llama-sweep-bench is accurate at least for PP. It looks like in general llama-sweep-bench shows lower scores than llama-bench assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to N_KV?
It is related, but not really the same. [...] If for simplicity we assume that performance decreases linearly with
N_KV, then, to first order, thellama-benchPP performance forNtokens is about the same asllama-sweep-benchwithN_KV = N/2.You can also convert by just summing up the time taken and tokens processed up to PP, and then divide, but I like sweep bench because it does accurately reflect single slot server usage at a depth of N_KV.
👤 Nexesenex replied the 2025-05-31 at 10:58:14:
@ubergarm Thanks for this iq4_ks quant, it works super. By the way, I tested the perplexity of a q8_0 and your qat iq4_ks in Serbian on an extract of a dataset named Sveznanje.
PPL Gemma 27b it q8_0 : Final estimate: PPL = 12.8797 +/- 0.12932 PPL Gemma 27b qat iq4_ks : Final estimate: PPL = 12.7006 +/- 0.12469
👤 Nexesenex replied the 2025-06-04 at 15:56:03:
More.
On llama-perplexity -m E:\text-generation-webui\models\google_gemma-3-4b-it-qat-q4_0-unquantized_CHOSENQUANT.gguf -f wiki.test.raw -fa -mg 0 -ngl 150 -ts 40,0,0 -b 512 --no-mmap -c 512
BF16: PPL = 15.1898 +/- 0.14353 For pure Q4_0 (imat): PPL = 15.4831 +/- 0.14487 For pure IQ4_XS (imat): PPL = 14.5142 +/- 0.13311 (!!!) For IQ4_XS (imat) with embed/output q8_0: PPL = 14.6061 +/- 0.13463 For IQ4_XS (imat) with embed/output q6_0: PPL = 14.6017 +/- 0.13465 For pure IQ4_KS (imat): PPL = 15.4225 +/- 0.14537 For pure Q4_K (imat): PL = 14.9259 +/- 0.13876 For pure IQ4_NL (imat): PPL = 14.6848 +/- 0.13511 For pure IQ4_K (imat): PPL = 15.5956 +/- 0.14764
No comment! ^^
Note : I quantized with my fork of Llama.cpp mainline b5588 including the IQ_K quants.
https://github.com/Nexesenex/croco.cpp/tree/NXS_Llama.cpp
Reminder :
## ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf (ik_llama.cpp exclusive quant)
14.099 GiB (4.484 BPW)
f32: 373 tensors
type q4_0: 62 tensors blk.*.attn_v.weight
type q8_0: 1 tensors
iq4_ks: 372 tensors
Final estimate: PPL = 8.1755 +/- 0.06296
Edit : for Gemma 3 27b qat q4_0 unquantized bf16 to pure iq4_xs with ubergarm's Imatrix : Final estimate: PPL = 8.2903 +/- 0.06439 For pure IQ4_KS (imat): PPL = 8.2450 +/- 0.06403 For IQ4_KS (imat) with embed/output q8_0: PPL = 8.1996 +/- 0.06343 For IQ4_KS (imat) with embed/output q6_0: PPL = 8.2032 +/- 0.06350 For IQ4_KS_R4 (imat) with embed/output q6_0: PPL = PPL = 8.2032 +/- 0.06349 (identical) For IQ4_KS (imat) with embed/output q6_0 and attn_v q4_0 (13.77 GiB (4.38 BPW)): PPL = 8.1783 +/- 0.06300 For IQ4_KS (imat) with embed/output q6_0 and attn_v / attn_k in q4_0 (13.79 GiB (4.39 BPW)): PPL = 8.1968 +/- 0.06324 For IQ4_KS (imat) with embed/output q6_0 and attn_v in q5_0 / attn_k in q4_0 (13.87 GiB (4.41 BPW)): PPL = 8.2156 +/- 0.06361 For IQ4_KS (imat) with embed/output q6_0 and attn_v in iq4_nl (13.77 GiB (4.38 BPW)): PPL = 8.2128 +/- 0.06354 For IQ4_KS (imat) with embed/output q6_0 and attn_q / attn_k / attn_o / ffn_gate / ffn_up in new_iq2_kt (9.361 GiB (2.977 BPW)): PPL = 9.0237 +/- 0.06934 (not bad at all!) For IQ4_KS (imat) with embed/output q6_0 and attn_q / attn_o / ffn_gate / ffn_up in new_iq2_kt (9.529 GiB (3.031 BPW)): : PPL = 9.0063 +/- 0.06917 For IQ4_KS (imat) with embed/output q6_0 and attn_q / ffn_gate / ffn_up in new_iq2_kt (9.867 GiB (3.138 BPW)): PPL = 9.0923 +/- 0.07124
👤 ubergarm replied the 2025-06-04 at 16:54:07:
Yeah these QATs are wild where the 4bpw "beats" the bf16!?! And for some reason theiq4_ks32 block quants seem to do very well. psure theiq4_ksis a strict upgrade of theiq4_xsas I understand it is the same bpw with better PPL.Thanks again for sharing your results! Definitely check out the new hottness
iqN_ktwhich are basically QTIP / exl3 style trellis quants. So far I'd say they are like a smaller version ofiqN_kwith similar perplexity, but I need to do more testing as the implementation isn't fully baked yet.👤 Nexesenex replied the 2025-06-05 at 02:42:48:
Well, it seems your iq4_ks was optimal. I'm satisfied with a q6_0 embed/ouput instead of Q8_0, but that's it. Generally, ofc iq4_ks is better, but on small models, I guess some tensors are so small that rules can be a bit different, as seen on the 4b. On my side, I had the first Trellis Cuda implementation made by IK working on Croco (6 months ago, maybe?) but I have yet to make work the second, it gives me gibberish for now. Probably missed a part of code somewhere.👤 ubergarm replied the 2025-06-08 at 22:51:42:
Well, it seems your iq4_ks was optimal.
I went back and looked at my recipe and oddly enough I think the
attn_k_bare atq4_0and not actuallyiq4_ksfor some reason. Maybe a mistake on my part or confusion about tensor dimensions vs quant limitations.I just tried one of the new iq4_kt quants on this same gemma-3-qat which is smaller but initial perplexity looks a little higher than the
iq4_ks.On my side, I had the first Trellis Cuda implementation made by IK working on Croco (6 months ago, maybe?) but I have yet to make work the second, it gives me gibberish for now. Probably missed a part of code somewhere.
Oh wow, I see you've had them a while! I'm just catching up lol. Yeah I believe the exact implementation is still in flux, so I haven't released any quants with it just yet.
👤 Thireus replied the 2025-07-04 at 22:15:18:
I'm trying to spot the difference between iq4_ks and iq4_xs. They seem to have the same bpw, the same perfs and the same PPL. Am I mistaken?👤 ikawrakow replied the 2025-07-05 at 09:25:02:
Maybe this is not the case for the model you are looking at, but typicallyIQ4_KSwill have a slightly better PPL thanIQ4_XS. Otherwise, yes, they are very similar. The differences are
IQ4_XSuses anfp16scale per super-block of 256.IQ4_KShas a single scale per tensor row.- The 16 bits per 256 weights saved that way give 2 extra bits per block of 32. One is spent on extra precision for the block scale, one is spent to select between 2 non-linear lookup tables. Being able to choose between two lookup tables results in a slightly lower difference to the model weights being quantized.
As
IQ4_KSdoes not need super-blocks of 256, theoretically one could remove the requirement for tensor row size being a multiple of 256. I haven't done that yet, but will if models with row sizes that are not a multiple of 256 become more common.👤 Thireus replied the 2025-07-05 at 13:48:21:
Thank you for the clarification! Indeed on DeepSeek-R1-0528 I haven't noticed a difference.👤 saood06 replied the 2025-07-07 at 04:47:41:
As
IQ4_KSdoes not need super-blocks of 256, theoretically one could remove the requirement for tensor row size being a multiple of 256. I haven't done that yet, but will if models with row sizes that are not a multiple of 256 become more common.Does that mean
IQ4_KSandIQ5_KScould be used for the KV cache, or is there some other limitation?👤 ikawrakow replied the 2025-07-07 at 05:07:40:
Does that mean IQ4_KS and IQ5_KS could be used for the KV cache, or is there some other limitation?
Theoretically yes, but with caveats:
- The quantization needs to use a simpler and less accurate algorithm, else storing data in the cache will be too slow. This is already done for
IQ4_NL, andIQ4_KS/IQ5_KSwill be similar- One runs into the same issue as with
Q8_KVfor DeepSeek, where the V cache is just a view into the K cache, so misses the row scale👤 saood06 replied the 2025-07-07 at 05:40:24:
The quantization needs to use a simpler and less accurate algorithm, else storing data in the cache will be too slow. This is already done for IQ4_NL, and IQ4_KS/IQ5_KS will be similar
I didn't know that, so based on that do you think they would offer quality/size benefits over the existing types?
👤 ikawrakow replied the 2025-07-07 at 07:59:21:
IQ4_NLhalves the PPL difference betweenQ4_0KV-cache andfp16KV cache, but is somewhat higher thanQ5_0KV cache. My guess is thatIQ4_KSwill perform similar toIQ4_NL. Not sure if/how much betterIQ5_KSwill be compared toQ5_0for KV cache quantization.