### 🗣️ [#334](https://github.com/ikawrakow/ik_llama.cpp/discussions/334) - `iq4_ks` performs great on gemma-3-27b-it-qat-q4_0-unquantized | **Author** | `ubergarm` | | :--- | :--- | | **Created** | 2025-04-18 | | **Updated** | 2025-07-07 | --- #### Description *EDIT*: Just uploaded the `ik_llama.cpp` exclusive quants for best quality in minimum VRAM to huggingface [ubergarm/gemma-3-27b-it-qat-GGUF](https://huggingface.co/ubergarm/gemma-3-27b-it-qat-GGUF). I saw google released their [google/gemma-3-27b-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-unquantized) original `.safetensors` unquantized model. It is supposedly designed for `q4_0` quantization which was released earlier in gguf format. > Thanks to Quantization Aware Training (QAT), the model is able to preserve similar quality as bfloat16 while significantly reducing the memory requirements to load the model. I used mainline to convert the `.safetensors` to `bf16` and then used `ik_llama.cpp` to cook some quants to compare size and perplexity. Here are the results which interestingly suggest `ik4_ks` has lower perplexity than the original `bf16` (the `q8_0` does too)! ![gemma-data](https://github.com/user-attachments/assets/6a28b711-773e-4fb3-bb2c-742d5d8d530c) ![gemma-sweep](https://github.com/user-attachments/assets/3e7cbef4-4ff2-4ffe-9a6b-4f519c5147d9)
Raw Data #### Perplexity ``` ## google/gemma-3-27b-it-BF16-00001-of-00002.gguf 50.311 GiB (16.001 BPW) f32: 373 tensors bf16: 435 tensors Final estimate: PPL = 8.4276 +/- 0.06705 ## google/gemma-3-27b-it-qat-q4_0-unquantized-BF16-00001-of-00002.gguf 50.311 GiB (16.001 BPW) f32: 373 tensors bf16: 435 tensors Final estimate: PPL = 8.2021 +/- 0.06387 ## google/gemma-3-27b-it-qat-q4_0-gguf/gemma-3-27b-it-q4_0.gguf 16.040 GiB (5.101 BPW) f32: 373 tensors f16: 1 tensors q4_0: 434 tensors Final estimate: PPL = 8.2500 +/- 0.06375 ## ubergarm/gemma-3-27B-it-qat-q8_0.gguf 26.730 GiB (8.501 BPW) f32: 373 tensors q8_0: 435 tensors Final estimate: PPL = 8.1890 +/- 0.06369 ## ubergarm/gemma-3-27B-it-qat-q4_0.gguf 14.857 GiB (4.725 BPW) f32: 373 tensors q4_0: 427 tensors type q4_1: 7 tensors (blk.[0-6].ffn_down.weight not sure why this happened?) type q8_0: 1 tensors (token_embd.weight) Final estimate: PPL = 8.2264 +/- 0.06350 ## ubergarm/gemma-3-27B-it-qat-pure-q4_0.gguf 14.810 GiB (4.710 BPW) f32: 373 tensors q4_0: 434 tensors q8_0: 1 tensors Final estimate: PPL = 8.2235 +/- 0.06345 ## ubergarm/gemma-3-27B-it-qat-iq4_xs.gguf 14.085 GiB (4.480 BPW) f32: 373 tensors q4_0: 62 tensors (blk.*.attn_v.weight can't be iq4_xs due to tensor size) q8_0: 1 tensors iq4_xs: 372 tensors Final estimate: PPL = 8.2290 +/- 0.06365 ## ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf (ik_llama.cpp exclusive quant) 14.099 GiB (4.484 BPW) f32: 373 tensors type q4_0: 62 tensors blk.*.attn_v.weight type q8_0: 1 tensors iq4_ks: 372 tensors Final estimate: PPL = 8.1755 +/- 0.06296 ## ubergarm/gemma-3-27B-it-qat-COSSIM-iq3_k.gguf (ik_llama.cpp exclusive quants) 12.875 GiB (4.095 BPW) f32: 373 tensors q4_0: 62 tensors (blk.*.attn_v.weight can't be iq3_k/iq4_ks due to tensor size) q8_0: 1 tensors iq3_k: 192 tensors iq4_ks: 180 tensors (most important 30 layers by cosine similarity scores) Final estimate: PPL = 8.2642 +/- 0.06359 ## ubergarm/gemma-3-27B-it-qat-mix-iq3_k.gguf (ik_llama.cpp exclusive quant) 12.733 GiB (4.050 BPW) f32: 373 tensors q4_0: 62 tensors blk.*.attn_v.weight q8_0: 1 tensors iq3_k: 124 tensors ffn_(gate|up).weight type iq4_ks: 248 tensors ffn_down.weight Final estimate: PPL = 8.2367 +/- 0.06329 ## ubergarm/gemma-3-27B-it-qat-iq4_nl.gguf 14.810 GiB (4.710 BPW) type f32: 373 tensors type q4_0: 62 tensors type q8_0: 1 tensors type iq4_nl: 372 tensors Final estimate: PPL = 8.2477 +/- 0.06390 ## ubergarm/gemma-3-27B-it-qat-q4_k_m.gguf 14.810 GiB (4.710 BPW) q4_0: 62 tensors q8_0: 1 tensors type q4_K: 372 tensors Final estimate: PPL = 8.2303 +/- 0.06364 ``` #### Sweep Bench ``` ## gemma-3-27B-it-qat-COSSIM-iq3_k.gguf | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.430 | 1191.53 | 3.764 | 34.01 | | 512 | 128 | 512 | 0.406 | 1262.23 | 3.844 | 33.30 | | 512 | 128 | 1024 | 0.415 | 1232.50 | 3.930 | 32.57 | | 512 | 128 | 1536 | 0.426 | 1203.18 | 4.016 | 31.87 | | 512 | 128 | 2048 | 0.436 | 1175.36 | 4.100 | 31.22 | | 512 | 128 | 2560 | 0.445 | 1150.88 | 4.176 | 30.65 | | 512 | 128 | 3072 | 0.455 | 1124.73 | 4.258 | 30.06 | | 512 | 128 | 3584 | 0.465 | 1101.76 | 4.342 | 29.48 | | 512 | 128 | 4096 | 0.474 | 1080.41 | 4.420 | 28.96 | | 512 | 128 | 4608 | 0.483 | 1061.01 | 4.497 | 28.46 | | 512 | 128 | 5120 | 0.493 | 1037.79 | 4.571 | 28.00 | | 512 | 128 | 5632 | 0.502 | 1020.61 | 4.649 | 27.53 | | 512 | 128 | 6144 | 0.511 | 1002.09 | 4.736 | 27.03 | | 512 | 128 | 6656 | 0.521 | 983.37 | 4.810 | 26.61 | | 512 | 128 | 7168 | 0.529 | 968.01 | 4.878 | 26.24 | | 512 | 128 | 7680 | 0.537 | 952.66 | 4.948 | 25.87 | | 512 | 128 | 8192 | 0.547 | 935.93 | 5.018 | 25.51 | | 512 | 128 | 8704 | 0.555 | 922.51 | 5.088 | 25.15 | | 512 | 128 | 9216 | 0.563 | 908.66 | 5.160 | 24.81 | | 512 | 128 | 9728 | 0.573 | 894.03 | 5.236 | 24.45 | | 512 | 128 | 10240 | 0.582 | 880.17 | 5.299 | 24.16 | | 512 | 128 | 10752 | 0.591 | 866.73 | 5.372 | 23.83 | | 512 | 128 | 11264 | 0.597 | 857.09 | 5.433 | 23.56 | | 512 | 128 | 11776 | 0.608 | 842.77 | 5.502 | 23.26 | | 512 | 128 | 12288 | 0.617 | 830.05 | 5.573 | 22.97 | | 512 | 128 | 12800 | 0.625 | 819.12 | 5.637 | 22.71 | | 512 | 128 | 13312 | 0.635 | 805.99 | 5.703 | 22.44 | | 512 | 128 | 13824 | 0.642 | 796.94 | 5.768 | 22.19 | | 512 | 128 | 14336 | 0.649 | 788.86 | 5.834 | 21.94 | | 512 | 128 | 14848 | 0.657 | 778.90 | 5.896 | 21.71 | | 512 | 128 | 15360 | 0.667 | 768.12 | 5.958 | 21.49 | | 512 | 128 | 15872 | 0.673 | 760.21 | 6.019 | 21.27 | | 512 | 128 | 16384 | 0.682 | 750.51 | 6.077 | 21.06 | | 512 | 128 | 16896 | 0.690 | 742.39 | 6.139 | 20.85 | | 512 | 128 | 17408 | 0.698 | 733.09 | 6.205 | 20.63 | | 512 | 128 | 17920 | 0.707 | 724.07 | 6.274 | 20.40 | | 512 | 128 | 18432 | 0.715 | 716.34 | 6.333 | 20.21 | | 512 | 128 | 18944 | 0.723 | 708.48 | 6.391 | 20.03 | | 512 | 128 | 19456 | 0.732 | 699.61 | 6.457 | 19.82 | | 512 | 128 | 19968 | 0.738 | 693.42 | 6.524 | 19.62 | | 512 | 128 | 20480 | 0.748 | 684.36 | 6.584 | 19.44 | | 512 | 128 | 20992 | 0.756 | 677.00 | 6.650 | 19.25 | | 512 | 128 | 21504 | 0.764 | 670.11 | 6.718 | 19.05 | | 512 | 128 | 22016 | 0.773 | 662.68 | 6.787 | 18.86 | | 512 | 128 | 22528 | 0.782 | 654.88 | 6.857 | 18.67 | | 512 | 128 | 23040 | 0.789 | 648.73 | 6.922 | 18.49 | | 512 | 128 | 23552 | 0.799 | 641.01 | 6.993 | 18.30 | | 512 | 128 | 24064 | 0.809 | 632.98 | 7.063 | 18.12 | | 512 | 128 | 24576 | 0.817 | 626.98 | 7.129 | 17.96 | | 512 | 128 | 25088 | 0.828 | 618.46 | 7.204 | 17.77 | | 512 | 128 | 25600 | 0.837 | 612.03 | 7.272 | 17.60 | | 512 | 128 | 26112 | 0.845 | 605.89 | 7.345 | 17.43 | | 512 | 128 | 26624 | 0.854 | 599.28 | 7.422 | 17.25 | | 512 | 128 | 27136 | 0.863 | 593.26 | 7.490 | 17.09 | | 512 | 128 | 27648 | 0.872 | 587.02 | 7.562 | 16.93 | | 512 | 128 | 28160 | 0.881 | 581.17 | 7.640 | 16.75 | | 512 | 128 | 28672 | 0.889 | 575.97 | 7.707 | 16.61 | | 512 | 128 | 29184 | 0.899 | 569.59 | 7.783 | 16.45 | | 512 | 128 | 29696 | 0.907 | 564.44 | 7.848 | 16.31 | | 512 | 128 | 30208 | 0.916 | 558.91 | 7.925 | 16.15 | | 512 | 128 | 30720 | 0.928 | 551.99 | 8.001 | 16.00 | | 512 | 128 | 31232 | 0.938 | 545.95 | 8.070 | 15.86 | | 512 | 128 | 31744 | 0.946 | 541.38 | 8.139 | 15.73 | | 512 | 128 | 32256 | 0.955 | 536.31 | 8.215 | 15.58 | ## gemma-3-27B-it-qat-iq4_ks.gguf | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.442 | 1157.61 | 3.846 | 33.29 | | 512 | 128 | 512 | 0.420 | 1219.61 | 3.918 | 32.67 | | 512 | 128 | 1024 | 0.429 | 1192.64 | 3.988 | 32.09 | | 512 | 128 | 1536 | 0.437 | 1171.22 | 4.058 | 31.54 | | 512 | 128 | 2048 | 0.445 | 1150.45 | 4.115 | 31.10 | | 512 | 128 | 2560 | 0.454 | 1126.95 | 4.185 | 30.58 | | 512 | 128 | 3072 | 0.463 | 1106.45 | 4.251 | 30.11 | | 512 | 128 | 3584 | 0.471 | 1087.13 | 4.316 | 29.66 | | 512 | 128 | 4096 | 0.480 | 1067.46 | 4.383 | 29.20 | | 512 | 128 | 4608 | 0.488 | 1048.69 | 4.452 | 28.75 | | 512 | 128 | 5120 | 0.496 | 1031.55 | 4.517 | 28.34 | | 512 | 128 | 5632 | 0.504 | 1016.37 | 4.587 | 27.90 | | 512 | 128 | 6144 | 0.513 | 997.99 | 4.653 | 27.51 | | 512 | 128 | 6656 | 0.520 | 983.90 | 4.718 | 27.13 | | 512 | 128 | 7168 | 0.529 | 968.10 | 4.788 | 26.73 | | 512 | 128 | 7680 | 0.537 | 954.18 | 4.856 | 26.36 | | 512 | 128 | 8192 | 0.548 | 934.03 | 4.916 | 26.04 | | 512 | 128 | 8704 | 0.554 | 923.73 | 4.982 | 25.69 | | 512 | 128 | 9216 | 0.562 | 910.81 | 5.050 | 25.35 | | 512 | 128 | 9728 | 0.571 | 896.48 | 5.118 | 25.01 | | 512 | 128 | 10240 | 0.582 | 880.31 | 5.187 | 24.68 | | 512 | 128 | 10752 | 0.589 | 869.67 | 5.252 | 24.37 | | 512 | 128 | 11264 | 0.597 | 857.77 | 5.320 | 24.06 | | 512 | 128 | 11776 | 0.606 | 844.62 | 5.386 | 23.77 | | 512 | 128 | 12288 | 0.615 | 832.15 | 5.453 | 23.47 | | 512 | 128 | 12800 | 0.622 | 823.56 | 5.519 | 23.19 | | 512 | 128 | 13312 | 0.631 | 811.71 | 5.587 | 22.91 | | 512 | 128 | 13824 | 0.639 | 801.76 | 5.656 | 22.63 | | 512 | 128 | 14336 | 0.648 | 790.40 | 5.721 | 22.37 | | 512 | 128 | 14848 | 0.656 | 779.95 | 5.788 | 22.11 | | 512 | 128 | 15360 | 0.664 | 771.19 | 5.853 | 21.87 | | 512 | 128 | 15872 | 0.674 | 759.82 | 5.927 | 21.60 | | 512 | 128 | 16384 | 0.683 | 749.28 | 5.991 | 21.37 | | 512 | 128 | 16896 | 0.691 | 740.99 | 6.056 | 21.14 | | 512 | 128 | 17408 | 0.700 | 731.77 | 6.125 | 20.90 | | 512 | 128 | 17920 | 0.708 | 722.66 | 6.196 | 20.66 | | 512 | 128 | 18432 | 0.717 | 714.30 | 6.266 | 20.43 | | 512 | 128 | 18944 | 0.726 | 705.03 | 6.337 | 20.20 | | 512 | 128 | 19456 | 0.735 | 696.94 | 6.405 | 19.98 | | 512 | 128 | 19968 | 0.743 | 688.78 | 6.478 | 19.76 | | 512 | 128 | 20480 | 0.752 | 681.04 | 6.549 | 19.54 | | 512 | 128 | 20992 | 0.762 | 672.17 | 6.619 | 19.34 | | 512 | 128 | 21504 | 0.770 | 664.73 | 6.690 | 19.13 | | 512 | 128 | 22016 | 0.778 | 658.00 | 6.760 | 18.93 | | 512 | 128 | 22528 | 0.787 | 650.37 | 6.825 | 18.75 | | 512 | 128 | 23040 | 0.796 | 643.02 | 6.893 | 18.57 | | 512 | 128 | 23552 | 0.804 | 636.63 | 6.959 | 18.39 | | 512 | 128 | 24064 | 0.813 | 629.68 | 7.033 | 18.20 | | 512 | 128 | 24576 | 0.822 | 622.86 | 7.096 | 18.04 | | 512 | 128 | 25088 | 0.830 | 616.54 | 7.164 | 17.87 | | 512 | 128 | 25600 | 0.839 | 610.32 | 7.235 | 17.69 | | 512 | 128 | 26112 | 0.847 | 604.35 | 7.300 | 17.53 | | 512 | 128 | 26624 | 0.856 | 597.85 | 7.366 | 17.38 | | 512 | 128 | 27136 | 0.865 | 591.85 | 7.434 | 17.22 | | 512 | 128 | 27648 | 0.874 | 585.75 | 7.500 | 17.07 | | 512 | 128 | 28160 | 0.881 | 581.29 | 7.566 | 16.92 | | 512 | 128 | 28672 | 0.890 | 575.07 | 7.640 | 16.75 | | 512 | 128 | 29184 | 0.899 | 569.51 | 7.695 | 16.63 | | 512 | 128 | 29696 | 0.910 | 562.68 | 7.767 | 16.48 | | 512 | 128 | 30208 | 0.917 | 558.17 | 7.834 | 16.34 | | 512 | 128 | 30720 | 0.928 | 551.45 | 7.895 | 16.21 | | 512 | 128 | 31232 | 0.935 | 547.36 | 7.963 | 16.07 | | 512 | 128 | 31744 | 0.943 | 543.01 | 8.026 | 15.95 | | 512 | 128 | 32256 | 0.951 | 538.16 | 8.096 | 15.81 | ## gemma-3-27B-it-qat-iq4_xs.gguf | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.354 | 1445.31 | 3.869 | 33.09 | | 512 | 128 | 512 | 0.359 | 1426.20 | 3.938 | 32.50 | | 512 | 128 | 1024 | 0.368 | 1392.49 | 4.009 | 31.93 | | 512 | 128 | 1536 | 0.376 | 1360.20 | 4.079 | 31.38 | | 512 | 128 | 2048 | 0.385 | 1330.26 | 4.145 | 30.88 | | 512 | 128 | 2560 | 0.393 | 1302.43 | 4.215 | 30.37 | | 512 | 128 | 3072 | 0.402 | 1273.64 | 4.280 | 29.91 | | 512 | 128 | 3584 | 0.409 | 1251.28 | 4.350 | 29.42 | | 512 | 128 | 4096 | 0.418 | 1226.26 | 4.419 | 28.97 | | 512 | 128 | 4608 | 0.427 | 1199.54 | 4.487 | 28.52 | | 512 | 128 | 5120 | 0.435 | 1175.95 | 4.552 | 28.12 | | 512 | 128 | 5632 | 0.446 | 1146.84 | 4.625 | 27.67 | | 512 | 128 | 6144 | 0.455 | 1126.16 | 4.693 | 27.27 | | 512 | 128 | 6656 | 0.464 | 1103.10 | 4.761 | 26.88 | | 512 | 128 | 7168 | 0.472 | 1085.43 | 4.829 | 26.51 | | 512 | 128 | 7680 | 0.480 | 1066.39 | 4.896 | 26.15 | | 512 | 128 | 8192 | 0.488 | 1048.19 | 4.966 | 25.77 | | 512 | 128 | 8704 | 0.498 | 1027.72 | 5.037 | 25.41 | | 512 | 128 | 9216 | 0.505 | 1013.91 | 5.107 | 25.06 | | 512 | 128 | 9728 | 0.514 | 996.65 | 5.177 | 24.72 | | 512 | 128 | 10240 | 0.522 | 981.43 | 5.241 | 24.42 | | 512 | 128 | 10752 | 0.530 | 966.08 | 5.311 | 24.10 | | 512 | 128 | 11264 | 0.540 | 948.94 | 5.380 | 23.79 | | 512 | 128 | 11776 | 0.547 | 935.81 | 5.448 | 23.50 | | 512 | 128 | 12288 | 0.556 | 921.26 | 5.515 | 23.21 | | 512 | 128 | 12800 | 0.564 | 907.03 | 5.582 | 22.93 | | 512 | 128 | 13312 | 0.572 | 894.35 | 5.648 | 22.66 | | 512 | 128 | 13824 | 0.581 | 880.65 | 5.714 | 22.40 | | 512 | 128 | 14336 | 0.590 | 868.50 | 5.781 | 22.14 | | 512 | 128 | 14848 | 0.598 | 856.72 | 5.855 | 21.86 | | 512 | 128 | 15360 | 0.605 | 846.01 | 5.920 | 21.62 | | 512 | 128 | 15872 | 0.613 | 835.56 | 5.980 | 21.40 | | 512 | 128 | 16384 | 0.623 | 821.71 | 6.050 | 21.16 | | 512 | 128 | 16896 | 0.633 | 808.95 | 6.111 | 20.94 | | 512 | 128 | 17408 | 0.640 | 799.92 | 6.176 | 20.73 | | 512 | 128 | 17920 | 0.648 | 789.93 | 6.241 | 20.51 | | 512 | 128 | 18432 | 0.658 | 777.96 | 6.308 | 20.29 | | 512 | 128 | 18944 | 0.666 | 768.87 | 6.374 | 20.08 | | 512 | 128 | 19456 | 0.677 | 756.54 | 6.447 | 19.85 | | 512 | 128 | 19968 | 0.683 | 749.66 | 6.512 | 19.65 | | 512 | 128 | 20480 | 0.693 | 738.62 | 6.581 | 19.45 | | 512 | 128 | 20992 | 0.701 | 730.71 | 6.650 | 19.25 | | 512 | 128 | 21504 | 0.708 | 723.54 | 6.715 | 19.06 | | 512 | 128 | 22016 | 0.718 | 712.66 | 6.784 | 18.87 | | 512 | 128 | 22528 | 0.727 | 703.84 | 6.848 | 18.69 | | 512 | 128 | 23040 | 0.735 | 696.52 | 6.915 | 18.51 | | 512 | 128 | 23552 | 0.744 | 688.40 | 6.984 | 18.33 | | 512 | 128 | 24064 | 0.754 | 679.13 | 7.057 | 18.14 | | 512 | 128 | 24576 | 0.760 | 673.38 | 7.122 | 17.97 | | 512 | 128 | 25088 | 0.768 | 666.65 | 7.189 | 17.80 | | 512 | 128 | 25600 | 0.777 | 658.95 | 7.258 | 17.64 | | 512 | 128 | 26112 | 0.788 | 649.70 | 7.327 | 17.47 | | 512 | 128 | 26624 | 0.796 | 643.27 | 7.398 | 17.30 | | 512 | 128 | 27136 | 0.804 | 637.20 | 7.467 | 17.14 | | 512 | 128 | 27648 | 0.813 | 629.55 | 7.538 | 16.98 | | 512 | 128 | 28160 | 0.822 | 622.77 | 7.613 | 16.81 | | 512 | 128 | 28672 | 0.834 | 614.26 | 7.685 | 16.66 | | 512 | 128 | 29184 | 0.841 | 609.06 | 7.753 | 16.51 | | 512 | 128 | 29696 | 0.847 | 604.25 | 7.823 | 16.36 | | 512 | 128 | 30208 | 0.854 | 599.32 | 7.890 | 16.22 | | 512 | 128 | 30720 | 0.867 | 590.81 | 7.960 | 16.08 | | 512 | 128 | 31232 | 0.877 | 584.02 | 8.034 | 15.93 | | 512 | 128 | 31744 | 0.883 | 579.81 | 8.099 | 15.80 | | 512 | 128 | 32256 | 0.893 | 573.29 | 8.167 | 15.67 | ## gemma-3-27B-it-qat-mix-iq3_k.gguf | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.447 | 1145.52 | 3.945 | 32.44 | | 512 | 128 | 512 | 0.417 | 1228.76 | 4.014 | 31.89 | | 512 | 128 | 1024 | 0.426 | 1202.30 | 4.076 | 31.41 | | 512 | 128 | 1536 | 0.434 | 1179.42 | 4.146 | 30.87 | | 512 | 128 | 2048 | 0.443 | 1156.00 | 4.212 | 30.39 | | 512 | 128 | 2560 | 0.451 | 1134.72 | 4.280 | 29.91 | | 512 | 128 | 3072 | 0.460 | 1113.87 | 4.348 | 29.44 | | 512 | 128 | 3584 | 0.468 | 1094.14 | 4.415 | 28.99 | | 512 | 128 | 4096 | 0.477 | 1073.18 | 4.482 | 28.56 | | 512 | 128 | 4608 | 0.485 | 1055.40 | 4.550 | 28.13 | | 512 | 128 | 5120 | 0.495 | 1034.95 | 4.621 | 27.70 | | 512 | 128 | 5632 | 0.503 | 1017.63 | 4.689 | 27.30 | | 512 | 128 | 6144 | 0.511 | 1001.37 | 4.757 | 26.91 | | 512 | 128 | 6656 | 0.519 | 986.65 | 4.825 | 26.53 | | 512 | 128 | 7168 | 0.528 | 970.07 | 4.892 | 26.16 | | 512 | 128 | 7680 | 0.537 | 954.21 | 4.959 | 25.81 | | 512 | 128 | 8192 | 0.545 | 939.29 | 5.029 | 25.45 | | 512 | 128 | 8704 | 0.554 | 923.71 | 5.097 | 25.11 | | 512 | 128 | 9216 | 0.562 | 911.09 | 5.166 | 24.78 | | 512 | 128 | 9728 | 0.569 | 900.24 | 5.235 | 24.45 | | 512 | 128 | 10240 | 0.578 | 885.58 | 5.302 | 24.14 | | 512 | 128 | 10752 | 0.586 | 873.69 | 5.371 | 23.83 | | 512 | 128 | 11264 | 0.595 | 859.80 | 5.439 | 23.53 | | 512 | 128 | 11776 | 0.604 | 847.90 | 5.506 | 23.25 | | 512 | 128 | 12288 | 0.614 | 834.40 | 5.572 | 22.97 | | 512 | 128 | 12800 | 0.621 | 824.19 | 5.635 | 22.71 | | 512 | 128 | 13312 | 0.631 | 811.51 | 5.704 | 22.44 | | 512 | 128 | 13824 | 0.638 | 803.08 | 5.768 | 22.19 | | 512 | 128 | 14336 | 0.646 | 793.10 | 5.831 | 21.95 | | 512 | 128 | 14848 | 0.656 | 780.23 | 5.905 | 21.68 | | 512 | 128 | 15360 | 0.664 | 771.05 | 5.968 | 21.45 | | 512 | 128 | 15872 | 0.673 | 761.30 | 6.033 | 21.22 | | 512 | 128 | 16384 | 0.681 | 752.03 | 6.102 | 20.98 | | 512 | 128 | 16896 | 0.690 | 741.95 | 6.169 | 20.75 | | 512 | 128 | 17408 | 0.698 | 733.85 | 6.237 | 20.52 | | 512 | 128 | 17920 | 0.707 | 724.39 | 6.304 | 20.30 | | 512 | 128 | 18432 | 0.716 | 715.50 | 6.371 | 20.09 | | 512 | 128 | 18944 | 0.724 | 707.19 | 6.440 | 19.88 | | 512 | 128 | 19456 | 0.732 | 699.61 | 6.502 | 19.69 | | 512 | 128 | 19968 | 0.740 | 692.05 | 6.573 | 19.47 | | 512 | 128 | 20480 | 0.749 | 683.36 | 6.642 | 19.27 | | 512 | 128 | 20992 | 0.758 | 675.25 | 6.713 | 19.07 | | 512 | 128 | 21504 | 0.766 | 668.41 | 6.785 | 18.87 | | 512 | 128 | 22016 | 0.776 | 660.00 | 6.853 | 18.68 | | 512 | 128 | 22528 | 0.783 | 653.50 | 6.922 | 18.49 | | 512 | 128 | 23040 | 0.793 | 645.39 | 6.994 | 18.30 | | 512 | 128 | 23552 | 0.801 | 639.13 | 7.061 | 18.13 | | 512 | 128 | 24064 | 0.811 | 631.43 | 7.133 | 17.94 | | 512 | 128 | 24576 | 0.820 | 624.08 | 7.200 | 17.78 | | 512 | 128 | 25088 | 0.828 | 618.62 | 7.270 | 17.61 | | 512 | 128 | 25600 | 0.837 | 611.49 | 7.340 | 17.44 | | 512 | 128 | 26112 | 0.844 | 606.28 | 7.408 | 17.28 | | 512 | 128 | 26624 | 0.855 | 599.11 | 7.475 | 17.12 | | 512 | 128 | 27136 | 0.862 | 594.18 | 7.541 | 16.97 | | 512 | 128 | 27648 | 0.873 | 586.66 | 7.604 | 16.83 | | 512 | 128 | 28160 | 0.880 | 581.61 | 7.674 | 16.68 | | 512 | 128 | 28672 | 0.888 | 576.73 | 7.744 | 16.53 | | 512 | 128 | 29184 | 0.897 | 571.07 | 7.810 | 16.39 | | 512 | 128 | 29696 | 0.905 | 565.80 | 7.874 | 16.26 | | 512 | 128 | 30208 | 0.916 | 558.82 | 7.944 | 16.11 | | 512 | 128 | 30720 | 0.923 | 555.00 | 8.012 | 15.98 | | 512 | 128 | 31232 | 0.934 | 548.01 | 8.078 | 15.84 | | 512 | 128 | 31744 | 0.942 | 543.50 | 8.145 | 15.72 | | 512 | 128 | 32256 | 0.951 | 538.44 | 8.212 | 15.59 | ## gemma-3-27B-it-qat-pure-q4_0.gguf | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.323 | 1585.65 | 3.516 | 36.40 | | 512 | 128 | 512 | 0.328 | 1559.55 | 3.585 | 35.71 | | 512 | 128 | 1024 | 0.338 | 1514.58 | 3.650 | 35.06 | | 512 | 128 | 1536 | 0.346 | 1477.87 | 3.716 | 34.45 | | 512 | 128 | 2048 | 0.356 | 1438.88 | 3.780 | 33.86 | | 512 | 128 | 2560 | 0.364 | 1407.96 | 3.845 | 33.29 | | 512 | 128 | 3072 | 0.372 | 1376.65 | 3.910 | 32.74 | | 512 | 128 | 3584 | 0.380 | 1347.00 | 3.974 | 32.21 | | 512 | 128 | 4096 | 0.389 | 1315.56 | 4.038 | 31.70 | | 512 | 128 | 4608 | 0.398 | 1286.90 | 4.105 | 31.18 | | 512 | 128 | 5120 | 0.405 | 1264.27 | 4.172 | 30.68 | | 512 | 128 | 5632 | 0.413 | 1238.73 | 4.234 | 30.23 | | 512 | 128 | 6144 | 0.422 | 1213.02 | 4.300 | 29.77 | | 512 | 128 | 6656 | 0.430 | 1189.78 | 4.365 | 29.33 | | 512 | 128 | 7168 | 0.440 | 1162.71 | 4.436 | 28.85 | | 512 | 128 | 7680 | 0.448 | 1143.27 | 4.498 | 28.46 | | 512 | 128 | 8192 | 0.458 | 1118.80 | 4.564 | 28.04 | | 512 | 128 | 8704 | 0.466 | 1098.96 | 4.632 | 27.63 | | 512 | 128 | 9216 | 0.474 | 1080.03 | 4.697 | 27.25 | | 512 | 128 | 9728 | 0.483 | 1059.90 | 4.768 | 26.85 | | 512 | 128 | 10240 | 0.492 | 1041.25 | 4.834 | 26.48 | | 512 | 128 | 10752 | 0.500 | 1024.39 | 4.903 | 26.11 | | 512 | 128 | 11264 | 0.509 | 1006.30 | 4.968 | 25.76 | | 512 | 128 | 11776 | 0.517 | 989.51 | 5.035 | 25.42 | | 512 | 128 | 12288 | 0.526 | 972.95 | 5.102 | 25.09 | | 512 | 128 | 12800 | 0.534 | 958.30 | 5.171 | 24.75 | | 512 | 128 | 13312 | 0.542 | 945.13 | 5.236 | 24.44 | | 512 | 128 | 13824 | 0.551 | 928.86 | 5.302 | 24.14 | | 512 | 128 | 14336 | 0.560 | 915.04 | 5.368 | 23.84 | | 512 | 128 | 14848 | 0.568 | 900.75 | 5.437 | 23.54 | | 512 | 128 | 15360 | 0.577 | 887.69 | 5.503 | 23.26 | | 512 | 128 | 15872 | 0.586 | 874.12 | 5.570 | 22.98 | | 512 | 128 | 16384 | 0.594 | 861.42 | 5.634 | 22.72 | | 512 | 128 | 16896 | 0.603 | 849.78 | 5.702 | 22.45 | | 512 | 128 | 17408 | 0.611 | 838.06 | 5.770 | 22.18 | | 512 | 128 | 17920 | 0.619 | 826.92 | 5.837 | 21.93 | | 512 | 128 | 18432 | 0.628 | 815.75 | 5.902 | 21.69 | | 512 | 128 | 18944 | 0.635 | 805.86 | 5.970 | 21.44 | | 512 | 128 | 19456 | 0.645 | 794.21 | 6.030 | 21.23 | | 512 | 128 | 19968 | 0.652 | 784.86 | 6.097 | 20.99 | | 512 | 128 | 20480 | 0.662 | 773.59 | 6.164 | 20.76 | | 512 | 128 | 20992 | 0.669 | 765.08 | 6.229 | 20.55 | | 512 | 128 | 21504 | 0.679 | 753.76 | 6.298 | 20.32 | | 512 | 128 | 22016 | 0.686 | 746.01 | 6.365 | 20.11 | | 512 | 128 | 22528 | 0.695 | 736.96 | 6.430 | 19.91 | | 512 | 128 | 23040 | 0.703 | 728.05 | 6.497 | 19.70 | | 512 | 128 | 23552 | 0.711 | 719.77 | 6.565 | 19.50 | | 512 | 128 | 24064 | 0.722 | 709.53 | 6.629 | 19.31 | | 512 | 128 | 24576 | 0.729 | 702.38 | 6.694 | 19.12 | | 512 | 128 | 25088 | 0.738 | 693.75 | 6.760 | 18.93 | | 512 | 128 | 25600 | 0.752 | 680.71 | 6.831 | 18.74 | | 512 | 128 | 26112 | 0.755 | 677.81 | 6.896 | 18.56 | | 512 | 128 | 26624 | 0.765 | 669.13 | 6.968 | 18.37 | | 512 | 128 | 27136 | 0.772 | 663.37 | 7.033 | 18.20 | | 512 | 128 | 27648 | 0.783 | 654.12 | 7.104 | 18.02 | | 512 | 128 | 28160 | 0.790 | 647.97 | 7.174 | 17.84 | | 512 | 128 | 28672 | 0.800 | 640.20 | 7.240 | 17.68 | | 512 | 128 | 29184 | 0.808 | 633.63 | 7.309 | 17.51 | | 512 | 128 | 29696 | 0.816 | 627.32 | 7.371 | 17.37 | | 512 | 128 | 30208 | 0.825 | 620.68 | 7.445 | 17.19 | | 512 | 128 | 30720 | 0.837 | 612.00 | 7.514 | 17.03 | | 512 | 128 | 31232 | 0.844 | 606.59 | 7.586 | 16.87 | | 512 | 128 | 31744 | 0.862 | 594.13 | 7.655 | 16.72 | | 512 | 128 | 32256 | 0.860 | 595.13 | 7.729 | 16.56 | ## gemma-3-27B-it-qat-q4_k.gguf | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.350 | 1461.01 | 3.564 | 35.91 | | 512 | 128 | 512 | 0.355 | 1442.43 | 3.635 | 35.22 | | 512 | 128 | 1024 | 0.364 | 1406.79 | 3.702 | 34.58 | | 512 | 128 | 1536 | 0.370 | 1382.30 | 3.770 | 33.95 | | 512 | 128 | 2048 | 0.381 | 1343.95 | 3.837 | 33.36 | | 512 | 128 | 2560 | 0.389 | 1315.80 | 3.903 | 32.80 | | 512 | 128 | 3072 | 0.399 | 1283.92 | 3.975 | 32.20 | | 512 | 128 | 3584 | 0.406 | 1262.15 | 4.040 | 31.68 | | 512 | 128 | 4096 | 0.414 | 1235.98 | 4.108 | 31.16 | | 512 | 128 | 4608 | 0.423 | 1210.07 | 4.174 | 30.67 | | 512 | 128 | 5120 | 0.432 | 1186.43 | 4.240 | 30.19 | | 512 | 128 | 5632 | 0.440 | 1162.38 | 4.307 | 29.72 | | 512 | 128 | 6144 | 0.449 | 1139.33 | 4.372 | 29.28 | | 512 | 128 | 6656 | 0.458 | 1118.86 | 4.440 | 28.83 | | 512 | 128 | 7168 | 0.466 | 1098.62 | 4.505 | 28.41 | | 512 | 128 | 7680 | 0.474 | 1079.84 | 4.572 | 27.99 | | 512 | 128 | 8192 | 0.483 | 1060.70 | 4.639 | 27.59 | | 512 | 128 | 8704 | 0.491 | 1042.31 | 4.708 | 27.19 | | 512 | 128 | 9216 | 0.500 | 1024.70 | 4.776 | 26.80 | | 512 | 128 | 9728 | 0.509 | 1006.70 | 4.845 | 26.42 | | 512 | 128 | 10240 | 0.517 | 991.28 | 4.913 | 26.06 | | 512 | 128 | 10752 | 0.524 | 976.78 | 4.979 | 25.71 | | 512 | 128 | 11264 | 0.532 | 962.53 | 5.040 | 25.39 | | 512 | 128 | 11776 | 0.542 | 944.54 | 5.117 | 25.02 | | 512 | 128 | 12288 | 0.550 | 931.02 | 5.173 | 24.74 | | 512 | 128 | 12800 | 0.559 | 916.72 | 5.240 | 24.43 | | 512 | 128 | 13312 | 0.566 | 903.92 | 5.305 | 24.13 | | 512 | 128 | 13824 | 0.576 | 888.40 | 5.373 | 23.82 | | 512 | 128 | 14336 | 0.584 | 876.82 | 5.444 | 23.51 | | 512 | 128 | 14848 | 0.592 | 864.61 | 5.506 | 23.25 | | 512 | 128 | 15360 | 0.600 | 852.86 | 5.574 | 22.96 | | 512 | 128 | 15872 | 0.609 | 840.81 | 5.641 | 22.69 | | 512 | 128 | 16384 | 0.617 | 830.27 | 5.706 | 22.43 | | 512 | 128 | 16896 | 0.626 | 817.66 | 5.773 | 22.17 | | 512 | 128 | 17408 | 0.635 | 806.32 | 5.839 | 21.92 | | 512 | 128 | 17920 | 0.644 | 795.42 | 5.908 | 21.67 | | 512 | 128 | 18432 | 0.651 | 786.23 | 5.974 | 21.43 | | 512 | 128 | 18944 | 0.660 | 775.45 | 6.041 | 21.19 | | 512 | 128 | 19456 | 0.669 | 765.32 | 6.110 | 20.95 | | 512 | 128 | 19968 | 0.677 | 756.12 | 6.176 | 20.72 | | 512 | 128 | 20480 | 0.686 | 746.87 | 6.244 | 20.50 | | 512 | 128 | 20992 | 0.695 | 736.54 | 6.313 | 20.28 | | 512 | 128 | 21504 | 0.705 | 726.44 | 6.382 | 20.06 | | 512 | 128 | 22016 | 0.713 | 718.06 | 6.451 | 19.84 | | 512 | 128 | 22528 | 0.720 | 710.74 | 6.512 | 19.66 | | 512 | 128 | 23040 | 0.728 | 702.83 | 6.579 | 19.46 | | 512 | 128 | 23552 | 0.740 | 691.52 | 6.651 | 19.24 | | 512 | 128 | 24064 | 0.749 | 683.96 | 6.723 | 19.04 | | 512 | 128 | 24576 | 0.757 | 676.77 | 6.783 | 18.87 | | 512 | 128 | 25088 | 0.764 | 669.98 | 6.852 | 18.68 | | 512 | 128 | 25600 | 0.773 | 662.01 | 6.924 | 18.49 | | 512 | 128 | 26112 | 0.783 | 653.89 | 6.990 | 18.31 | | 512 | 128 | 26624 | 0.792 | 646.42 | 7.059 | 18.13 | | 512 | 128 | 27136 | 0.802 | 638.16 | 7.139 | 17.93 | | 512 | 128 | 27648 | 0.807 | 634.08 | 7.207 | 17.76 | | 512 | 128 | 28160 | 0.817 | 626.62 | 7.268 | 17.61 | | 512 | 128 | 28672 | 0.829 | 617.49 | 7.333 | 17.46 | | 512 | 128 | 29184 | 0.836 | 612.25 | 7.411 | 17.27 | | 512 | 128 | 29696 | 0.845 | 606.01 | 7.473 | 17.13 | | 512 | 128 | 30208 | 0.855 | 598.76 | 7.540 | 16.98 | | 512 | 128 | 30720 | 0.860 | 595.08 | 7.608 | 16.82 | | 512 | 128 | 31232 | 0.871 | 588.11 | 7.681 | 16.66 | | 512 | 128 | 31744 | 0.877 | 583.60 | 7.742 | 16.53 | | 512 | 128 | 32256 | 0.885 | 578.74 | 7.812 | 16.39 | ## gemma-3-27B-it-qat-q8_0.gguf | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| | 512 | 128 | 0 | 0.332 | 1539.91 | 5.825 | 21.98 | | 512 | 128 | 512 | 0.338 | 1512.99 | 5.894 | 21.72 | | 512 | 128 | 1024 | 0.347 | 1475.13 | 5.957 | 21.49 | | 512 | 128 | 1536 | 0.354 | 1444.45 | 6.024 | 21.25 | | 512 | 128 | 2048 | 0.364 | 1407.89 | 6.086 | 21.03 | | 512 | 128 | 2560 | 0.372 | 1374.75 | 6.151 | 20.81 | | 512 | 128 | 3072 | 0.381 | 1342.80 | 6.216 | 20.59 | | 512 | 128 | 3584 | 0.390 | 1311.71 | 6.282 | 20.37 | | 512 | 128 | 4096 | 0.399 | 1282.81 | 6.346 | 20.17 | | 512 | 128 | 4608 | 0.407 | 1258.61 | 6.410 | 19.97 | | 512 | 128 | 5120 | 0.414 | 1237.17 | 6.474 | 19.77 | | 512 | 128 | 5632 | 0.422 | 1212.05 | 6.538 | 19.58 | | 512 | 128 | 6144 | 0.431 | 1187.26 | 6.604 | 19.38 | | 512 | 128 | 6656 | 0.441 | 1161.80 | 6.666 | 19.20 | | 512 | 128 | 7168 | 0.449 | 1140.72 | 6.730 | 19.02 | | 512 | 128 | 7680 | 0.458 | 1118.44 | 6.793 | 18.84 | | 512 | 128 | 8192 | 0.466 | 1098.54 | 6.857 | 18.67 | | 512 | 128 | 8704 | 0.473 | 1083.10 | 6.923 | 18.49 | | 512 | 128 | 9216 | 0.482 | 1062.20 | 6.991 | 18.31 | | 512 | 128 | 9728 | 0.491 | 1042.60 | 7.054 | 18.15 | | 512 | 128 | 10240 | 0.498 | 1027.18 | 7.123 | 17.97 | | 512 | 128 | 10752 | 0.508 | 1007.38 | 7.185 | 17.81 | | 512 | 128 | 11264 | 0.515 | 994.91 | 7.251 | 17.65 | | 512 | 128 | 11776 | 0.523 | 978.05 | 7.317 | 17.49 | | 512 | 128 | 12288 | 0.532 | 962.31 | 7.380 | 17.34 | | 512 | 128 | 12800 | 0.541 | 947.09 | 7.445 | 17.19 | | 512 | 128 | 13312 | 0.549 | 932.93 | 7.510 | 17.04 | | 512 | 128 | 13824 | 0.557 | 919.44 | 7.577 | 16.89 | | 512 | 128 | 14336 | 0.566 | 905.08 | 7.645 | 16.74 | | 512 | 128 | 14848 | 0.575 | 890.47 | 7.707 | 16.61 | | 512 | 128 | 15360 | 0.582 | 879.10 | 7.773 | 16.47 | | 512 | 128 | 15872 | 0.592 | 865.03 | 7.834 | 16.34 | | 512 | 128 | 16384 | 0.600 | 853.53 | 7.898 | 16.21 | | 512 | 128 | 16896 | 0.607 | 843.01 | 7.966 | 16.07 | | 512 | 128 | 17408 | 0.617 | 829.70 | 8.028 | 15.94 | | 512 | 128 | 17920 | 0.625 | 818.59 | 8.092 | 15.82 | | 512 | 128 | 18432 | 0.632 | 810.24 | 8.163 | 15.68 | | 512 | 128 | 18944 | 0.640 | 800.31 | 8.229 | 15.56 | | 512 | 128 | 19456 | 0.649 | 789.37 | 8.296 | 15.43 | | 512 | 128 | 19968 | 0.658 | 778.33 | 8.361 | 15.31 | | 512 | 128 | 20480 | 0.666 | 768.52 | 8.425 | 15.19 | | 512 | 128 | 20992 | 0.676 | 757.55 | 8.490 | 15.08 | | 512 | 128 | 21504 | 0.686 | 746.62 | 8.555 | 14.96 | | 512 | 128 | 22016 | 0.694 | 737.58 | 8.620 | 14.85 | | 512 | 128 | 22528 | 0.705 | 726.01 | 8.690 | 14.73 | | 512 | 128 | 23040 | 0.713 | 718.42 | 8.762 | 14.61 | | 512 | 128 | 23552 | 0.720 | 710.87 | 8.829 | 14.50 | | 512 | 128 | 24064 | 0.729 | 701.96 | 8.893 | 14.39 | | 512 | 128 | 24576 | 0.738 | 693.84 | 8.964 | 14.28 | | 512 | 128 | 25088 | 0.747 | 685.68 | 9.029 | 14.18 | | 512 | 128 | 25600 | 0.756 | 677.05 | 9.101 | 14.06 | | 512 | 128 | 26112 | 0.764 | 670.53 | 9.165 | 13.97 | | 512 | 128 | 26624 | 0.773 | 662.78 | 9.230 | 13.87 | | 512 | 128 | 27136 | 0.782 | 654.47 | 9.296 | 13.77 | | 512 | 128 | 27648 | 0.788 | 649.53 | 9.361 | 13.67 | | 512 | 128 | 28160 | 0.798 | 641.93 | 9.430 | 13.57 | | 512 | 128 | 28672 | 0.806 | 635.07 | 9.497 | 13.48 | | 512 | 128 | 29184 | 0.815 | 628.26 | 9.561 | 13.39 | | 512 | 128 | 29696 | 0.822 | 623.01 | 9.631 | 13.29 | | 512 | 128 | 30208 | 0.832 | 615.43 | 9.687 | 13.21 | | 512 | 128 | 30720 | 0.839 | 610.13 | 9.753 | 13.12 | | 512 | 128 | 31232 | 0.849 | 603.39 | 9.824 | 13.03 | | 512 | 128 | 31744 | 0.858 | 596.95 | 9.885 | 12.95 | | 512 | 128 | 32256 | 0.865 | 591.61 | 9.953 | 12.86 | ```
Methodology #### Perplexity ```bash $ cd ik_llama.cpp $ git rev-parse --short HEAD 3bb64d93 $ ./build/bin/llama-perplexity --version version: 3639 (3bb64d93) built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu $ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz $ gunzip wiki.test.raw.gz $ sha256sum wiki.test.raw 173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08 wiki.test.raw $ ./build/bin/llama-perplexity \ --model /mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-iq4_nl.gguf \ --ctx-size 512 \ --ubatch-size 512 \ -f wiki.test.raw \ --seed 1337 \ --n-gpu-layers 99 \ --threads 4 ``` ## Sweep Bench Using a single RTX A6000 48GB VRAM GPU ```bash CUDA_VISIBLE_DEVICES="0," \ ./build/bin/llama-sweep-bench \ --model "$model" \ -ctk q8_0 -ctv q8_0 \ -fa \ -amb 512 \ -fmoe \ -c 32768 \ -ub 512 \ -ngl 99 \ --threads 4 ```
I tried a number of other mixes based on `--layer-similarity` scores trying to optimize entire layer score, and also based on attn and ffn scores, but in limited testing on this specific model it didn't prove to provide better perplexity. My impression is this QAT was indeed meant to be `q4_0` as sometimes using a mix of slightly higher quants for some layers showed slightly worse perplexity. I didn't compare against non-QAT bf16 quants, but wanted to share some early results with anyone else curious about this QAT business. Cheers! --- #### 🗣️ Discussion 👤 **saood06** replied the **2025-04-18** at **22:57:25**:
> I saw google released their [google/gemma-3-27b-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-unquantized) original `.safetensors` unquantized model. It is supposedly designed for `q4_0` quantization which was released earlier in gguf format. > > > Thanks to Quantization Aware Training (QAT), the model is able to preserve similar quality as bfloat16 while significantly reducing the memory requirements to load the model. > This is QAT but unlike previous QAT models I have seen this was done with an additional stage of finetuning, which is why I think the raw PPL are less directly comparable to the version without QAT (and why they are lower as it was trained longer) but they should be still useful to compare given that you can often compare PPL within an architecture family, like the example below (a bit dated but I still find this graph made by ikawrakow interesting). ![image](https://github.com/user-attachments/assets/2e3b2cd5-8ffe-4b6b-95db-694d99c1fbf5) > 👤 **bartowski1182** replied the **2025-04-19** at **00:55:33**:
> > unlike previous QAT > > Which ones have you seen, and what did they do if not additional fine tuning? 🤔 > > But yes I theorized it was possible that they did some fine tuning for the quant awareness with wiki text itself, maybe unlikely but certainly not impossible > > I think it could be valuable to use a random additional well formatted English corpus for more PPL numbers, that might start giving a more full image > > 👤 **saood06** replied the **2025-04-19** at **01:32:31**:
> > Which ones have you seen, and what did they do if not additional fine tuning? 🤔 > > Not any that I remember being released, but just in papers/blogs/demos one example being [this](https://pytorch.org/blog/quantization-aware-training/): where for example they do "Llama3-8B fine-tuned on the C4 dataset (en subset) with and without QAT" which allows you to see the difference between QAT and just finetuning. > > I also do remember hearing about models trained entirely using QAT but don't have any reference handy. > > > But yes I theorized it was possible that they did some fine tuning for the quant awareness with wiki text itself, maybe unlikely but certainly not impossible > > My point isn't really specific to any data they did finetune with (my guess is they just did one or a partial epoch of the last dataset used for the Instruction tuned model, as people have reported modern LLM's can get very sensitive to the diversity of their training data [reported since Llama 3 and why people may have struggled fine tuning that for a while] ), just that the QAT model was trained more. > > 👤 **bartowski1182** replied the **2025-04-19** at **02:50:22**:
> Oh hmm I suppose that's possible as well, would definitely be very interesting to see the full details --- 👤 **saood06** replied the **2025-04-19** at **05:00:05**:
@ubergarm Have you seen these versions, [27B](https://huggingface.co/stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small), and [12B](https://huggingface.co/Dampfinchen/google-gemma-3-12b-it-qat-q4_0-gguf-small-fix). --- 👤 **ikawrakow** replied the **2025-04-19** at **08:38:23**:
In my quick experiments with Gemma3-12B, the `Q4_0` quantized version has a significantly lower Wiki2 perplexity than the `bf16` model, or any other quantization. Which means that whatever they have done, they have massively overfit that specific dataset, specifically with `Q4_0` quantization. Which means that one cannot use Wiki2 for evaluation (PPL, but also KLD or any other quantization quality measure). On my book (but you may differ from me in that regard), it also means that one cannot take this model seriously. > 👤 **saood06** replied the **2025-04-19** at **08:56:08**:
> > On my book (but you may differ from me in that regard), it also means that one cannot take this model seriously. > > I haven't touched Gemma 3 myself yet (I want to see if it beats QwQ for my GPU only use cases), but I've heard a lot of positive feedback on the QAT version of Gemma 3. I agree that it does make them hard to directly compare since they differ so much, but whatever they did people seem generally happy with it. > > > but also KLD or any other quantization quality measure > > Do you think knowing the KLD between the two BF16 versions versions would be insightful (not that I could run it in a reasonable amount of time for the 27B, the 12B might be possible though)? > > 👤 **bartowski1182** replied the **2025-04-20** at **18:19:36**:
> > they have massively overfit that specific dataset > > that was my theory as well, that they may have used wikitext as a calibration dataset for the QAT portion > > I don't know if it completely invalidates the model, but rather just makes wikitext useless/misleading, similar to using wikitext for imatrix and then checking PPL against wikitext, it's totally fine to use it, but need to use something different for PPL after > > 👤 **saood06** replied the **2025-04-20** at **18:29:03**:
> > that was my theory as well, that they may have used wikitext as a calibration dataset for the QAT portion > > You could test the QAT against the old one with different datasets if you want to test that hypothesis. > > I don't think they used a low diversity dataset to train on, my theory is they may have updated the model they distilled from and that might be the extra bump on top of the bump from more training on more tokens. > > 👤 **ubergarm** replied the **2025-04-20** at **22:51:23**:
> I threw together a quick-n-dirty perplexity test corpus mostly english language with a little chinese and XML. Very possible the model was trained on this stuff already, given it is available online. Might be able to generate some "novel" synthetic text using output from a few various LLMs to mix it up, but at least here are a few more data points with something other than `wiki.test.raw` shown below. > > ## Observations > The absolute values here lower (~6.0) overall than the `wiki.test.raw` (~8.2). The new QAT BF16 still outperforms the original BF16. For the QAT quants, my q4_k is "better" than google's q4_0 and this time mine isn't "better" than the BF16. > > ## Google Original > * `google/gemma-3-27b-it/gemma-3-27B-it-BF16-00001-of-00002.gguf` > - `Final estimate: PPL = 6.0568 +/- 0.03981` > > ## Google QAT > * `google/gemma-3-27b-it-qat-q4_0-unquantized/gemma-3-27B-it-qat-unquantized-BF16-00001-of-00002.gguf` > - `Final estimate: PPL = 5.8897 +/- 0.03774` > * `google/gemma-3-27b-it-qat-q4_0-gguf/gemma-3-27b-it-q4_0.gguf` > - `Final estimate: PPL = 6.0588 +/- 0.03904` > > ## ubergarm QAT Quant > * `ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27b-it-qat-q4_k.gguf` > - `Final estimate: PPL = 5.9405 +/- 0.03799` > > ## Methodology > >
> > Logs to generate corpus and run llama-perplexity > > ```bash > # i ching > wget https://www.gutenberg.org/ebooks/25501.txt.utf-8 > # dvaita Bodha Deeptka The Lamp of Non-Dual Knowledge One of the few books highly spoken of by Bhagavan Sri Ramana Maharshi > wget 'https://archive.org/stream/ramanamaharishiebooks/Ramana%20Maharishi%20eBooks/Advaita%20Bodha%20Deepika_djvu.txt' > # Anarchist Cookbook by William Powell > wget 'https://archive.org/stream/the-anarchist-cookbook-william-powell/The%20Anarchist%20Cookbook%20-%20William%20Powell%20-%20Barricade%20Books%20Inc%20-%201989_djvu.txt' > # Social Architecture Peter Hintjens > wget 'https://raw.githubusercontent.com/hintjens/socialarchitecture/refs/heads/master/ch04.txt' > # cat them together in order > $ sha256sum ubergarm-ppl-corpus.txt > 456de7da9d5eec01d357f5ccb7fc7207884a706efe94535aca146f8646771bcc ubergarm-ppl-corpus.txt > > $ ./build/bin/llama-perplexity \ > --model "$MODEL" \ > --ctx-size 512 \ > --ubatch-size 512 \ > -f ubergarm-ppl-corpus.txt \ > --seed 1337 \ > --n-gpu-layers 99 \ > --threads 4 > ``` > >
> > 👤 **ubergarm** replied the **2025-04-21** at **14:52:18**:
> A redditor [mentioned their post](https://www.reddit.com/r/LocalLLaMA/comments/1jqnnfp/comment/ml8nuof/) measuring PPL and KLD with `wiki.test.raw` and a private corpus for some of the gemma-3-27b QAT models with an interesting writeup. > > Also amusing that [a redditor quoted ik](https://www.reddit.com/r/LocalLLaMA/comments/1k3jal4/comment/mo707ni/) on this thread hah... My impression is folks with <= 16GB VRAM are interested in the gemma-3-27b-it-qat ~4 bits as while it isn't as good as R1/V3-0324/QwQ-32B imo, it is a newer model that just barely fits with enough context to play around with decent speed. --- 👤 **ikawrakow** replied the **2025-04-19** at **09:08:49**:
> Do you think knowing the KLD between the two BF16 versions versions would be insightful Good question. If the `Q4_0` model outperforms the `bf16` model (at least it does for a set of quality metrics), do we now compare the `Q4_0` model against `bf16`, or do we compare the other way around? How do I know which of these two models is better? QAT was involved, fine tuning was involved, so maybe `Q4_0` is the best model? > but I've heard a lot of positive feedback on the QAT version of Gemma 3. Is it so because it really is good, or is it more because the sentiment towards Google has shifted lately (at least when it comes to "AI"). My impression is that the Internet believes that the latest Gemini models are currently the best (and so, by extension, Gemma3 must be among the best open weight). But the few things that I asked Gemma3-12B where I have good knowledge of the subject matter, the answers were complete BS. > 👤 **saood06** replied the **2025-04-19** at **09:33:57**:
> > Good question. If the `Q4_0` model outperforms the `bf16` model (at least it does for a set of quality metrics), do we now compare the `Q4_0` model against `bf16`, or do we compare the other way around? > > There are two different BF16's the original post shows PPL of both (pasted below for convenience). > > Original BF16: > `## google/gemma-3-27b-it-BF16-00001-of-00002.gguf` > `Final estimate: PPL = 8.4276 +/- 0.06705` > > QAT BF16: > `## google/gemma-3-27b-it-qat-q4_0-unquantized-BF16-00001-of-00002.gguf` > `Final estimate: PPL = 8.2021 +/- 0.06387` > > The extra finetuning that happened because of QAT is definitely showing from these numbers. > > >How do I know which of these two models is better? QAT was involved, fine tuning was involved, so maybe `Q4_0` is the best model? > >[...] > > Is it so because it really is good, or is it more because the sentiment towards Google has shifted lately (at least when it comes to "AI"). My impression is that the Internet believes that the latest Gemini models are currently the best (and so, by extension, Gemma3 must be among the best open weight). > > My impression is that sentiment on Gemma3 is still mixed (if it was better I would have tried it by now), but that for people that used Gemma 3 the QAT versions are a very good drop in replacement offering higher quality, faster inference, and smaller models, but to me it doesn't seem like it has changed the perception of Gemma3 as a whole with plenty of people still not liking it. There may be usecases in which it has degraded performance compared to the non QAT version, but I have yet to come across any reports of that. > > >But the few things that I asked Gemma3-12B where I have good knowledge of the subject matter, the answers were complete BS. > > Did both QAT and non QAT do that? I'd assume they'd both fail your test. > > Gemma3 may not be a good fit for you, but I am curious what models you have used and liked. > > 👤 **ikawrakow** replied the **2025-04-20** at **07:02:01**:
> > Gemma3 may not be a good fit for you, but I am curious what models you have used and liked. > > From the models I can run locally, none really passes the smell test. I would be hard pressed to say which one I like the best. > > 👤 **saood06** replied the **2025-04-20** at **09:51:32**:
> >From the models I can run locally > > Have you used any non locally? You brought up gemini, those models have had a huge advantage on long context benchmarks for a long time. The gemma models are nothing similar. Deepseek-r1 is the best local model but even that pales in comparison to gemini (from benchmarks and user testimonials, I've used it a bit over lmarena but not enough to remark on it). > > >I would be hard pressed to say which one I like the best. > > I'm not surprised, as most small and mid size models aren't that smart if that is what you are evaluating. I mostly use my local models for entertainment. --- 👤 **ubergarm** replied the **2025-04-27** at **03:08:43**:
*EDIT* My compile script was messed up and putting me into DEBUG mode... I was doing some more testing benchmarking of various `gemma-3-27b-it-qat` quants between mainline and ik_llama.cpp `llama-sweep-bench` and noticed some warnings printing out on my `ik_llama.cpp` fork: > ggml_backend_cuda_graph_compute: CUDA graph update failed > ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1] I believe I'm compiling Release and not Debug but not completely sure how to tell. I'm not seeing that warning on mainline with roughly the same command (`-c 33792` instead of `-c 32768` oops), but not sure if verbosity is different by default etc. I ran one of the bartowski quants on both mainline and `ik_llama.cpp` just to see the difference not due to different quant. ![qat-sweep-debugging](https://github.com/user-attachments/assets/62b5adba-c895-462c-ab8a-64035b33268d)
👈 Logs ```bash model="/mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf" CUDA_VISIBLE_DEVICES="0" \ ./build/bin/llama-sweep-bench \ --model "$model" \ -fa \ -ctk f16 -ctv f16 \ -c 32768 \ -ngl 99 \ --threads 16 llama_model_loader: loaded meta data with 44 key-value pairs and 808 tensors from /mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 3 27b It Qat llama_model_loader: - kv 3: general.finetune str = it-qat llama_model_loader: - kv 4: general.basename str = gemma-3 llama_model_loader: - kv 5: general.size_label str = 27B llama_model_loader: - kv 6: general.license str = gemma llama_model_loader: - kv 7: general.base_model.count u32 = 1 llama_model_loader: - kv 8: general.base_model.0.name str = Gemma 3 27b It llama_model_loader: - kv 9: general.base_model.0.organization str = Google llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3... llama_model_loader: - kv 11: general.tags arr[str,4] = ["gemma3", "gemma", "google", "image-... llama_model_loader: - kv 12: gemma3.context_length u32 = 131072 llama_model_loader: - kv 13: gemma3.embedding_length u32 = 5376 llama_model_loader: - kv 14: gemma3.block_count u32 = 62 llama_model_loader: - kv 15: gemma3.feed_forward_length u32 = 21504 llama_model_loader: - kv 16: gemma3.attention.head_count u32 = 32 llama_model_loader: - kv 17: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 18: gemma3.attention.key_length u32 = 128 llama_model_loader: - kv 19: gemma3.attention.value_length u32 = 128 llama_model_loader: - kv 20: gemma3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 21: gemma3.attention.sliding_window u32 = 1024 llama_model_loader: - kv 22: gemma3.attention.head_count_kv u32 = 16 llama_model_loader: - kv 23: gemma3.rope.scaling.type str = linear llama_model_loader: - kv 24: gemma3.rope.scaling.factor f32 = 8.000000 llama_model_loader: - kv 25: tokenizer.ggml.model str = llama llama_model_loader: - kv 26: tokenizer.ggml.pre str = default llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ... llama_model_loader: - kv 28: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 32: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 36: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r... llama_model_loader: - kv 37: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 38: general.quantization_version u32 = 2 llama_model_loader: - kv 39: general.file_type u32 = 15 llama_model_loader: - kv 40: quantize.imatrix.file str = /models_out/gemma-3-27b-it-qat-GGUF/g... llama_model_loader: - kv 41: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 434 llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 129 llama_model_loader: - type f32: 373 tensors llama_model_loader: - type q4_K: 374 tensors llama_model_loader: - type q6_K: 61 tensors llm_load_vocab: special tokens cache size = 6415 llm_load_vocab: token to piece cache size = 1.9446 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 262208 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5376 llm_load_print_meta: n_layer = 62 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 1024 llm_load_print_meta: n_swa_pattern = 6 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 2 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 21504 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 0.125 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 27B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 27.009 B llm_load_print_meta: model size = 15.404 GiB (4.899 BPW) llm_load_print_meta: general.name = Gemma 3 27b It Qat llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 248 '<0x0A>' llm_load_print_meta: EOT token = 106 '' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.70 MiB llm_load_tensors: offloading 62 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 63/63 layers to GPU llm_load_tensors: CPU buffer size = 1102.77 MiB llm_load_tensors: CUDA0 buffer size = 15773.97 MiB ......................................................................................... llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: mla_attn = 0 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 0.125 llama_kv_cache_init: CUDA0 KV buffer size = 15872.00 MiB llama_new_context_with_model: KV self size = 15872.00 MiB, K (f16): 7936.00 MiB, V (f16): 7936.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 1.00 MiB ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 522.62 MiB ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 138.51 MiB llama_new_context_with_model: CUDA0 compute buffer size = 522.62 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 138.51 MiB llama_new_context_with_model: graph nodes = 1806 llama_new_context_with_model: graph splits = 2 main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16 | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | |-------|--------|--------|----------|----------|----------|----------| ggml_backend_cuda_graph_compute: CUDA graph update failed ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1] | 512 | 128 | 0 | 0.356 | 1436.25 | 3.719 | 34.42 | ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1] | 512 | 128 | 512 | 0.372 | 1378.12 | 3.782 | 33.85 | ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1] . . . ```
*EDIT* Updated Graph with `ik_llama.cpp` in `Release` mode instead of `Debug` mode as shown above.... sorry for the confusion! ![qat-sweep-release-mode](https://github.com/user-attachments/assets/38fa6631-5e43-4bcf-b6c6-8384334275e9) --- 👤 **ikawrakow** replied the **2025-04-27** at **07:06:13**:
~CUDA graphs get disabled for MoE models in `ik_llama.cpp`, this is why you see the warning. It was the same in mainline until very recently, their PR 12970 enables CUDA graphs for TG (and apparently hides the warning when disabling graphs for PP). Also very recently, Johannes Gaessler completely independently discovered batched processing for TG with MoE models in PR 13014. He really discovered it by himself, [without ever looking at ik_llama.cpp, not even once](https://github.com/ikawrakow/ik_llama.cpp/pull/283/files/7f6980fa5166d029ad04cef395d2993ddc8da307#r2029830357) /s~ `ik_llama.cpp` will benefit from `-t 1` when running fully on the GPU. I was clearly confused. This is Gemma3. --- 👤 **ikawrakow** replied the **2025-04-27** at **07:49:45**:
@ubergarm The PP performance difference between mainline and `ik_llama.cpp` did not look plausible to me, so I ran my own benchmarks. I only have 16 GB RTX-4080, hence cannot run Gemma3-27B fully offloaded, so used Gemma3-12B instead. `Q4_0` quantized in both cases, FA on with `fp16` KV cache (mainline cannot use quantized cache with Gemma3), which allows me to go up to 16k context with my paltry 16 GB of VRAM. Anyway, here is what I see: ![g3_cuda](https://github.com/user-attachments/assets/fcc1e251-a7a8-4673-a292-a424c70ea6e5) ![g3_cuda_tg](https://github.com/user-attachments/assets/88ad3715-9cc6-4e43-8f89-1ba2871292f1) Are you sure your `sweep-bench` adaptation for mainline is working correctly? Gemma3 KV cache size relative to model size is quite high, so just a ~40% drop in PP performance at 32k tokens seen for mainline seems relatively unlikely. > 👤 **saood06** replied the **2025-04-27** at **08:10:16**:
> > @ubergarm > > > > Are you sure your `sweep-bench` adaptation for mainline is working correctly? Gemma3 KV cache size relative to model size is quite high, so just a ~40% drop in PP performance at 32k tokens seen for mainline seems relatively unlikely. > > ~~He is missing a llama_synchronize call, could that account for it?~~ > > Edit: Nevermind > > 👤 **ubergarm** replied the **2025-04-27** at **17:14:57**:
> *EDIT* My compile script was messed up and putting me into DEBUG mode... > > Thanks for taking a look, I too am doubting my `sweep-bench` adaptation for mainline as I just quickly got it compiling without looking too closely. > > Next I'll try: > - [x] Use `-t 1` with `ik_llama.cpp` when fully offloading to GPU > - [x] Look more closely at the `sweep-bench` adaptation as it could be inflating numbers (though for non FA and CPU cases with GLM-4 it looked more like I expected). Thanks @saood06 for the `llama_synchronize` call, I'll try to figure out if there is something I'm missing. > 3. Possibly repeat with `Gemma3-12B` `Q4_0` to reproduce graphs like ik just gave above. > - [x] Try good old `llama-bench` for a sanity test across a smaller range of values. > > 👤 **ubergarm** replied the **2025-04-27** at **17:39:39**:
> > He is missing a llama_synchronize call, could that account for it? > > Hrmm, I only see one `llama_synchronize(ctx);` call in the [ik_llama.cpp/examples/sweep-bench/sweep-bench.cpp](https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/sweep-bench/sweep-bench.cpp#L90) code which also appears in [my adaptation](https://github.com/ubergarm/llama.cpp/blob/ug/port-sweep-bench/examples/sweep-bench/sweep-bench.cpp#L86)? > > Its possible somehow I'm using the wrong number for `n_batch` etc as I don't really understand what batches and n_batches are. Also maybe some function arguments changed beyond just the names for stuff like `llama_model_params_from_gpt_params(params);` to `common_init_from_params(params);` etc... > > I'll dig around some more as I'd be quite surprised if mainline FA CUDA implementation for dense models like gemma-3 and glm-4 was suddenly this good. > > 👤 **ubergarm** replied the **2025-04-27** at **18:02:40**:
> *EDIT* My compile script was messed up and putting me into DEBUG mode... > > Using `-t 1` does seem slightly but consistently faster than `-t 16` in this one comparison. `ik_llama.cpp` for both runs using same bartowski quant: > > ![qat-sweep-1-v-16-threads](https://github.com/user-attachments/assets/077e4fcf-97cc-431d-9a78-0163c70df333) > > 👤 **saood06** replied the **2025-04-27** at **19:28:28**:
> > > > * [x] Look more closely at the `sweep-bench` adaptation as it could be inflating numbers (though for non FA and CPU cases with GLM-4 it looked more like I expected). Thanks @saood06 for the `llama_synchronize` call, I'll try to figure out if there is something I'm missing. > > > >[...] > >Hrmm, I only see one llama_synchronize(ctx); call in the [ik_llama.cpp/examples/sweep-bench/sweep-bench.cpp](https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/sweep-bench/sweep-bench.cpp#L90) code which also appears in [my adaptation](https://github.com/ubergarm/llama.cpp/blob/ug/port-sweep-bench/examples/sweep-bench/sweep-bench.cpp#L86)? > > Sorry, I was looking at [this](https://github.com/ubergarm/llama.cpp/commit/e59a5f1eb92b5b99d6a6d386b4620f89f9dad5ec) and I didn't fully expand the file. Ignore what I said. > > 👤 **ubergarm** replied the **2025-04-27** at **19:48:37**:
> *EDIT* My compile script was messed up and putting me into DEBUG mode... > > I ran a plain `llama-bench` for PP only to compare and sanity check if my adaptation of `llama-sweep-bench` is accurate at least for PP. It looks like in general `llama-sweep-bench` shows lower scores than `llama-bench` assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to `N_KV`? > > ![ikbench](https://github.com/user-attachments/assets/3f316f86-ca36-462e-8934-d33dd33fd1b0) > >
> > 👈 Logs > > ## Simple PP Benchmark > ```bash > model="/mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf" > > CUDA_VISIBLE_DEVICES="0," \ > ./build/bin/llama-bench \ > --model "$model" \ > -ngl 99 \ > --mmap 0 \ > -ctk f16 -ctv f16 \ > -fa 1 \ > -p 512,1024,2048,4096,8192,16384,32768 \ > -n 0 \ > -b 2048 \ > -ub 512 \ > -r 2 \ > --n-gpu-layers 99 \ > --threads 1 > ``` > > ## mainline llama.cpp > | model | size | params | backend | ngl | threads | fa | mmap | test | t/s | > | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | ------------: | -------------------: | > | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 99 | 1 | 1 | 0 | pp512 | 1477.77 ± 6.79 | > | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 99 | 1 | 1 | 0 | pp1024 | 1475.54 ± 2.11 | > | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 99 | 1 | 1 | 0 | pp2048 | 1460.71 ± 1.78 | > | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 99 | 1 | 1 | 0 | pp4096 | 1420.40 ± 4.39 | > | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 99 | 1 | 1 | 0 | pp8192 | 1341.60 ± 3.84 | > | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 99 | 1 | 1 | 0 | pp16384 | 1215.80 ± 3.76 | > | gemma3 27B Q4_K - Medium | 15.40 GiB | 27.01 B | CUDA | 99 | 1 | 1 | 0 | pp32768 | 1031.38 ± 1.40 | > > ## ik_llama.cpp > *NOTE* Every test throws a warning like this: > ``` > ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 522.62 MiB > ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 12.51 MiB > ``` > | model | size | params | backend | ngl | threads | fa | mmap | test | t/s | > | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ---: | ------------: | ---------------: | > | gemma3 27B Q4_K - Medium | 16.48 GiB | 28.42 B | CUDA | 99 | 1 | 1 | 0 | pp512 | 1439.27 ± 7.59 | > | gemma3 27B Q4_K - Medium | 16.48 GiB | 28.42 B | CUDA | 99 | 1 | 1 | 0 | pp1024 | 1461.46 ± 7.53 | > | gemma3 27B Q4_K - Medium | 16.48 GiB | 28.42 B | CUDA | 99 | 1 | 1 | 0 | pp2048 | 1442.42 ± 2.51 | > | gemma3 27B Q4_K - Medium | 16.48 GiB | 28.42 B | CUDA | 99 | 1 | 1 | 0 | pp4096 | 1386.80 ± 2.68 | > | gemma3 27B Q4_K - Medium | 16.48 GiB | 28.42 B | CUDA | 99 | 1 | 1 | 0 | pp8192 | 1282.03 ± 2.67 | > | gemma3 27B Q4_K - Medium | 16.48 GiB | 28.42 B | CUDA | 99 | 1 | 1 | 0 | pp16384 | 1078.81 ± 5.99 | > | gemma3 27B Q4_K - Medium | 16.48 GiB | 28.42 B | CUDA | 99 | 1 | 1 | 0 | pp32768 | 749.25 ± 4.50 | > >
> > One odd thing I noticed is in the logs :point_up: mainline llama.cpp reports a different value for `size` and `params` than `ik_llama.cpp`. So to explore a bit more I ran the same short prompt on both with `llama-eval-callback` to possibly show if they are doing anything fundamentally different in the calculations. Its quite long, and maybe not useful, but available if anyone is interested (too long to paste in here) and seems similar but some differences in FUSED_RMS_NORM and such. > > Finally, I'm beginning to wonder if something changed on my remote server as it was down for maintainence and reboot recently. Ever since then I've noticed those warnings printing with `ik_llama.cpp` now: > > ``` > ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1) > ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid > ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve > ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0) > ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid > ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve > ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 0) > ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1) > ggml_gallocr_needs_realloc: src 0 (KQ_mask) of node KQ_mask (view) is not valid > ``` > > I'll poke at the server a bit to see if anything changed and also maybe roll back my `ik_llama.cpp` git repo a couple weeks in case something odd changed. > > 👤 **ubergarm** replied the **2025-04-27** at **21:25:01**:
> Okay, yeah, the remote server compile script was in `Debug` mode... I recompiled for `Release` and performance improved for both TG and PP and is more in line with what I would expect. Sorry for the fire drill... --- 👤 **ikawrakow** replied the **2025-04-28** at **06:08:20**:
> I ran a plain llama-bench for PP only to compare and sanity check if my adaptation of llama-sweep-bench is accurate at least for PP. It looks like in general llama-sweep-bench shows lower scores than llama-bench assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to N_KV? It is related, but not really the same. With `llama-sweep-bench` you have `N_KV` tokens in the KV cache and you compute `n_ubatch` new tokens (`n_ubatch=512` by default). With `llama-bench` you have 0 tokens in the KV cache, and you compute `N` new tokens. The `llama-bench` calculation is done in u-batches, and as it progresses, the number of tokens in the KV cache grows from 0 to `N - n_ubatch`. Hence, we can think of the `llama-bench` result as the `llama-sweep-bench` result averaged between 0 and `N - n_ubatch` tokens. If for simplicity we assume that performance decreases linearly with `N_KV`, then, to first order, the `llama-bench` PP performance for `N` tokens is about the same as `llama-sweep-bench` with `N_KV = N/2`. > ggml_backend_sched_alloc_splits: failed to allocate graph, reserving (backend_ids_changed = 1) If you see such messages, you are running in debug mode. > 👤 **saood06** replied the **2025-04-28** at **07:34:44**:
> > > I ran a plain llama-bench for PP only to compare and sanity check if my adaptation of llama-sweep-bench is accurate at least for PP. It looks like in general llama-sweep-bench shows lower scores than llama-bench assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to N_KV? > > > > It is related, but not really the same. > >[...] > >If for simplicity we assume that performance decreases linearly with `N_KV`, then, to first order, the `llama-bench` PP performance for `N` tokens is about the same as `llama-sweep-bench` with `N_KV = N/2`. > > You can also convert by just summing up the time taken and tokens processed up to PP, and then divide, but I like sweep bench because it does accurately reflect single slot server usage at a depth of N_KV. --- 👤 **Nexesenex** replied the **2025-05-31** at **10:58:14**:
@ubergarm Thanks for this iq4_ks quant, it works super. By the way, I tested the perplexity of a q8_0 and your qat iq4_ks in Serbian on an extract of a dataset named Sveznanje. PPL Gemma 27b it q8_0 : Final estimate: PPL = 12.8797 +/- 0.12932 PPL Gemma 27b qat iq4_ks : Final estimate: PPL = 12.7006 +/- 0.12469 --- 👤 **Nexesenex** replied the **2025-06-04** at **15:56:03**:
More. On llama-perplexity -m E:\text-generation-webui\models\google_gemma-3-4b-it-qat-q4_0-unquantized_CHOSENQUANT.gguf -f wiki.test.raw -fa -mg 0 -ngl 150 -ts 40,0,0 -b 512 --no-mmap -c 512 BF16: PPL = 15.1898 +/- 0.14353 For pure Q4_0 (imat): PPL = 15.4831 +/- 0.14487 For pure IQ4_XS (imat): PPL = 14.5142 +/- 0.13311 (!!!) For IQ4_XS (imat) with embed/output q8_0: PPL = 14.6061 +/- 0.13463 For IQ4_XS (imat) with embed/output q6_0: PPL = 14.6017 +/- 0.13465 For pure IQ4_KS (imat): PPL = 15.4225 +/- 0.14537 For pure Q4_K (imat): PL = 14.9259 +/- 0.13876 For pure IQ4_NL (imat): PPL = 14.6848 +/- 0.13511 For pure IQ4_K (imat): PPL = 15.5956 +/- 0.14764 No comment! ^^ Note : I quantized with my fork of Llama.cpp mainline b5588 including the IQ_K quants. https://github.com/Nexesenex/croco.cpp/tree/NXS_Llama.cpp Reminder : ``` ## ubergarm/gemma-3-27B-it-qat-iq4_ks.gguf (ik_llama.cpp exclusive quant) 14.099 GiB (4.484 BPW) f32: 373 tensors type q4_0: 62 tensors blk.*.attn_v.weight type q8_0: 1 tensors iq4_ks: 372 tensors Final estimate: PPL = 8.1755 +/- 0.06296 ``` Edit : for Gemma 3 27b qat q4_0 unquantized bf16 to pure iq4_xs with ubergarm's Imatrix : Final estimate: PPL = 8.2903 +/- 0.06439 For pure IQ4_KS (imat): PPL = 8.2450 +/- 0.06403 For IQ4_KS (imat) with embed/output q8_0: PPL = 8.1996 +/- 0.06343 For IQ4_KS (imat) with embed/output q6_0: PPL = 8.2032 +/- 0.06350 For IQ4_KS_R4 (imat) with embed/output q6_0: PPL = PPL = 8.2032 +/- 0.06349 (identical) For IQ4_KS (imat) with embed/output q6_0 and attn_v q4_0 (13.77 GiB (4.38 BPW)): PPL = 8.1783 +/- 0.06300 For IQ4_KS (imat) with embed/output q6_0 and attn_v / attn_k in q4_0 (13.79 GiB (4.39 BPW)): PPL = 8.1968 +/- 0.06324 For IQ4_KS (imat) with embed/output q6_0 and attn_v in q5_0 / attn_k in q4_0 (13.87 GiB (4.41 BPW)): PPL = 8.2156 +/- 0.06361 For IQ4_KS (imat) with embed/output q6_0 and attn_v in iq4_nl (13.77 GiB (4.38 BPW)): PPL = 8.2128 +/- 0.06354 For IQ4_KS (imat) with embed/output q6_0 and attn_q / attn_k / attn_o / ffn_gate / ffn_up in new_iq2_kt (9.361 GiB (2.977 BPW)): PPL = 9.0237 +/- 0.06934 (not bad at all!) For IQ4_KS (imat) with embed/output q6_0 and attn_q / attn_o / ffn_gate / ffn_up in new_iq2_kt (9.529 GiB (3.031 BPW)): : PPL = 9.0063 +/- 0.06917 For IQ4_KS (imat) with embed/output q6_0 and attn_q / ffn_gate / ffn_up in new_iq2_kt (9.867 GiB (3.138 BPW)): PPL = 9.0923 +/- 0.07124 > 👤 **ubergarm** replied the **2025-06-04** at **16:54:07**:
> Yeah these QATs are wild where the 4bpw "beats" the bf16!?! And for some reason the `iq4_ks` 32 block quants seem to do very well. psure the `iq4_ks` is a strict upgrade of the `iq4_xs` as I understand it is the same bpw with better PPL. > > Thanks again for sharing your results! Definitely check out the new hottness `iqN_kt` which are basically [QTIP](https://arxiv.org/html/2406.11235v3) / exl3 style trellis quants. So far I'd say they are like a smaller version of `iqN_k` with similar perplexity, but I need to do more testing as the implementation isn't fully baked yet. > > 👤 **Nexesenex** replied the **2025-06-05** at **02:42:48**:
> Well, it seems your iq4_ks was optimal. I'm satisfied with a q6_0 embed/ouput instead of Q8_0, but that's it. > Generally, ofc iq4_ks is better, but on small models, I guess some tensors are so small that rules can be a bit different, as seen on the 4b. > On my side, I had the first Trellis Cuda implementation made by IK working on Croco (6 months ago, maybe?) but I have yet to make work the second, it gives me gibberish for now. Probably missed a part of code somewhere. > > 👤 **ubergarm** replied the **2025-06-08** at **22:51:42**:
> > Well, it seems your iq4_ks was optimal. > > I went back and looked at my recipe and oddly enough I think the `attn_k_b` are at `q4_0` and not actually `iq4_ks` for some reason. Maybe a mistake on my part or confusion about tensor dimensions vs quant limitations. > > I just tried one of the new [iq4_kt quants on this same gemma-3-qat](https://github.com/ikawrakow/ik_llama.cpp/pull/505#issuecomment-2954265218) which is smaller but initial perplexity looks a little higher than the `iq4_ks`. > > > On my side, I had the first Trellis Cuda implementation made by IK working on Croco (6 months ago, maybe?) but I have yet to make work the second, it gives me gibberish for now. Probably missed a part of code somewhere. > > Oh wow, I see you've had them a while! I'm just catching up lol. Yeah I believe the exact implementation is still in flux, so I haven't released any quants with it just yet. > > 👤 **Thireus** replied the **2025-07-04** at **22:15:18**:
> I'm trying to spot the difference between iq4_ks and iq4_xs. They seem to have the same bpw, the same perfs and the same PPL. Am I mistaken? > > 👤 **ikawrakow** replied the **2025-07-05** at **09:25:02**:
> Maybe this is not the case for the model you are looking at, but typically `IQ4_KS` will have a slightly better PPL than `IQ4_XS`. Otherwise, yes, they are very similar. The differences are > * `IQ4_XS` uses an `fp16` scale per super-block of 256. `IQ4_KS` has a single scale per tensor row. > * The 16 bits per 256 weights saved that way give 2 extra bits per block of 32. One is spent on extra precision for the block scale, one is spent to select between 2 non-linear lookup tables. Being able to choose between two lookup tables results in a slightly lower difference to the model weights being quantized. > > As `IQ4_KS` does not need super-blocks of 256, theoretically one could remove the requirement for tensor row size being a multiple of 256. I haven't done that yet, but will if models with row sizes that are not a multiple of 256 become more common. > > 👤 **Thireus** replied the **2025-07-05** at **13:48:21**:
> Thank you for the clarification! Indeed on DeepSeek-R1-0528 I haven't noticed a difference. > > 👤 **saood06** replied the **2025-07-07** at **04:47:41**:
> > As `IQ4_KS` does not need super-blocks of 256, theoretically one could remove the requirement for tensor row size being a multiple of 256. I haven't done that yet, but will if models with row sizes that are not a multiple of 256 become more common. > > Does that mean `IQ4_KS` and `IQ5_KS` could be used for the KV cache, or is there some other limitation? > > 👤 **ikawrakow** replied the **2025-07-07** at **05:07:40**:
> > Does that mean IQ4_KS and IQ5_KS could be used for the KV cache, or is there some other limitation? > > Theoretically yes, but with caveats: > * The quantization needs to use a simpler and less accurate algorithm, else storing data in the cache will be too slow. This is already done for `IQ4_NL`, and `IQ4_KS/IQ5_KS` will be similar > * One runs into the same issue as with `Q8_KV` for DeepSeek, where the V cache is just a view into the K cache, so misses the row scale > > 👤 **saood06** replied the **2025-07-07** at **05:40:24**:
> >The quantization needs to use a simpler and less accurate algorithm, else storing data in the cache will be too slow. This is already done for IQ4_NL, and IQ4_KS/IQ5_KS will be similar > > I didn't know that, so based on that do you think they would offer quality/size benefits over the existing types? > > 👤 **ikawrakow** replied the **2025-07-07** at **07:59:21**:
> `IQ4_NL` halves the PPL difference between `Q4_0` KV-cache and `fp16` KV cache, but is somewhat higher than `Q5_0` KV cache. My guess is that `IQ4_KS` will perform similar to `IQ4_NL`. Not sure if/how much better `IQ5_KS` will be compared to `Q5_0` for KV cache quantization.