2.4 KiB
🔀 #124 - iq2_bn_r4: fastest Bitnet CPU implementation on the planet
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-12-06 |
| Updated | 2024-12-06 |
Description
In the footsteps of #118, #119, #120, #121, #122, #123, this PR adds IQ2_BN_R4, a 4-rows interleaved packing of the 2-bit Bitnet quantization type IQ2_BN.
Here is PP-512 for Bitner-1.58b-3B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)
| Platform | Threads | IQ2_BN | IQ2_BN_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 8 | 246.57 ± 1.66 | 304.68 ± 0.77 | 1.236 |
| Zen4 | 16 | 631.27 ± 2.81 | 834.46 ± 2.77 | 1.322 |
| AVX2 | 32 | 694.17 ± 0.60 | 704.62 ± 0.60 | 1.0125 |
There aren't enough vector registers on AVX2 for all necessary accumulators when processing 8 right matrix columns at once. Hence, one needs two passes per left matrix interleaved row, so the gain on AVX2 is very minor. But on Zen4 we now achieve 834 t/s! In comparison, T-MAC, a repository with currently 607 stars making bold claims about being the fastest Bitnet CPU implementation achieves 300 t/s on the same Ryzen-7950X system.
TG is of course memory bound, but for small number of threads I also observe a speedup. The table shows measurements for TG-128 on the above 3 platforms (table only shows results up to the number of threads that achieves maximum performance):
| Platform | Threads | IQ2_BN | IQ2_BN_R4 | Speedup |
|---|---|---|---|---|
| ARM_NEON | 1 | 21.01 ± 0.08 | 24.75 ± 0.08 | 1.178 |
| 2 | 39.15 ± 0.02 | 45.48 ± 0.08 | 1.162 | |
| 4 | 64.39 ± 0.17 | 71.82 ± 1.84 | 1.115 | |
| 8 | 99.60 ± 0.53 | 100.74 ± 1.13 | 1.011 | |
| Zen4 | 1 | 25.91 ± 0.12 | 30.35 ± 0.15 | 1.171 |
| 2 | 45.03 ± 0.22 | 50.93 ± 0.18 | 1.131 | |
| 4 | 57.42 ± 0.08 | 57.40 ± 0.06 | 1.000 | |
| AVX2 | 1 | 16.39 ± 0.00 | 18.42 ± 0.11 | 1.124 |
| 2 | 29.94 ± 0.03 | 31.56 ± 0.01 | 1.054 | |
| 4 | 44.09 ± 0.02 | 45.26 ± 0.03 | 1.027 | |
| 8 | 47.28 ± 0.04 | 49.25 ± 0.02 | 1.042 |