5.2 KiB
🔀 #566 - Adding IQ3_KS quants
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2025-07-01 |
| Updated | 2025-07-02 |
Description
This PR adds IQ3_KS - 3.1875 bpw quants with a block size of 32. This makes the IQX_KS quant series complete
| type | bpw |
|---|---|
| IQ2_KS | 2.1875 |
| IQ3_KS | 3.1875 |
| IQ4_KS | 4.25 |
| IQ5_KS | 5.25 |
CUDA and CPU performance are very good, Metal is not so great.
Here a few sweep-benches for LlaMA-3.1-8B-Instruct
RTX-4080
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 512 | 0.065 | 7932.94 | 0.887 | 144.38 |
| 512 | 128 | 1024 | 0.066 | 7725.27 | 0.893 | 143.35 |
| 512 | 128 | 1536 | 0.068 | 7551.51 | 0.908 | 141.02 |
| 512 | 128 | 2048 | 0.069 | 7404.30 | 0.924 | 138.59 |
| 512 | 128 | 2560 | 0.072 | 7098.39 | 0.939 | 136.30 |
| 512 | 128 | 3072 | 0.074 | 6873.96 | 0.955 | 134.08 |
| 512 | 128 | 3584 | 0.074 | 6890.43 | 0.969 | 132.07 |
| 512 | 128 | 4096 | 0.077 | 6620.20 | 0.987 | 129.64 |
| 512 | 128 | 4608 | 0.079 | 6445.44 | 1.000 | 128.00 |
| 512 | 128 | 5120 | 0.081 | 6350.94 | 1.026 | 124.82 |
| 512 | 128 | 5632 | 0.083 | 6175.82 | 1.033 | 123.97 |
| 512 | 128 | 6144 | 0.084 | 6071.67 | 1.043 | 122.77 |
| 512 | 128 | 6656 | 0.086 | 5944.16 | 1.057 | 121.15 |
| 512 | 128 | 7168 | 0.088 | 5810.65 | 1.071 | 119.46 |
| 512 | 128 | 7680 | 0.090 | 5693.89 | 1.087 | 117.77 |
Ryzen-7950X (Zen4)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 1.423 | 359.79 | 7.616 | 16.81 |
| 512 | 128 | 512 | 1.479 | 346.15 | 7.800 | 16.41 |
| 512 | 128 | 1024 | 1.537 | 333.06 | 7.979 | 16.04 |
| 512 | 128 | 1536 | 1.603 | 319.47 | 7.939 | 16.12 |
| 512 | 128 | 2048 | 1.661 | 308.29 | 7.984 | 16.03 |
| 512 | 128 | 2560 | 1.722 | 297.39 | 8.071 | 15.86 |
| 512 | 128 | 3072 | 1.778 | 287.90 | 8.154 | 15.70 |
| 512 | 128 | 3584 | 1.841 | 278.04 | 8.241 | 15.53 |
Ryzen-5975WX
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 1.697 | 301.64 | 6.933 | 18.46 |
| 512 | 128 | 512 | 1.760 | 290.91 | 7.062 | 18.13 |
| 512 | 128 | 1024 | 1.834 | 279.19 | 7.217 | 17.74 |
| 512 | 128 | 1536 | 1.910 | 268.03 | 7.414 | 17.26 |
| 512 | 128 | 2048 | 1.985 | 257.88 | 7.555 | 16.94 |
| 512 | 128 | 2560 | 2.062 | 248.26 | 7.666 | 16.70 |
| 512 | 128 | 3072 | 2.140 | 239.29 | 7.810 | 16.39 |
| 512 | 128 | 3584 | 2.217 | 230.98 | 7.987 | 16.03 |
M2-Max CPU
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 3.119 | 164.13 | 5.410 | 23.66 |
| 512 | 128 | 512 | 3.322 | 154.14 | 5.487 | 23.33 |
| 512 | 128 | 1024 | 3.614 | 141.66 | 5.658 | 22.62 |
| 512 | 128 | 1536 | 3.872 | 132.23 | 5.735 | 22.32 |
| 512 | 128 | 2048 | 4.089 | 125.21 | 5.911 | 21.65 |
M2-Max 30-core GPU
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 1.088 | 470.79 | 3.255 | 39.33 |
| 512 | 128 | 512 | 1.106 | 462.77 | 3.411 | 37.53 |
| 512 | 128 | 1024 | 1.126 | 454.85 | 3.579 | 35.77 |
| 512 | 128 | 1536 | 1.153 | 444.08 | 3.762 | 34.03 |
| 512 | 128 | 2048 | 1.178 | 434.48 | 3.965 | 32.28 |
| 512 | 128 | 2560 | 1.207 | 424.23 | 4.118 | 31.08 |
| 512 | 128 | 3072 | 1.235 | 414.51 | 4.290 | 29.84 |
| 512 | 128 | 3584 | 1.265 | 404.69 | 4.461 | 28.69 |
💬 Conversation
👤 ikawrakow commented the 2025-07-02 at 07:27:42:
Let's merge this so people don't get crashes when trying to run IQ3_KS models with the main branch.
👤 Nexesenex commented the 2025-07-02 at 15:01:59:
Thanks for the explanation, I understand that the alternatives you have atm are quite unpractical.
In any case, thank you for the IQ3_KS (and the Cuda MMQ Kernels you kindly provided for most quants), it completes the KS quants lot, which is more practical to quantize than the already very demanding indeed Trellis lot. I'm very happy with all of this, compared to what mainline limits itself to atm.