Commit Graph

317 Commits

Author SHA1 Message Date
Kawrakow
ef95b81733 R4 improvements on ARM_NEON (#125)
* q4_0_r4: 6% faster PP on NEON

* qx_0_r4_q8_0 template

Applied to q4_0_r4 and q5_0_r4. It makes q5_0_r4 PP
~7% faster.

* Apply qx_0_r4_q8_0 template also to q6_0_r4 and iq4_nl_x4

* Simplify

* Minor iq4_xs_r4 improvement on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08 09:13:10 +01:00
Kawrakow
3682e4700d iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)
* Adding iq2_bn_r4

This Zen4-only implementation achieves PP-512 = 826 t/s (!!!)
for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn.

* Make sure rows per thread are a multiple of the number of interleaved rows

With this I can run iq2_bn_r4 with 32 threads and this increases
PP-512 to 872 t/s.

* iq2_bn_r4: 1st shot at NEON

PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s
for Bitnet-1.58b-3B). TG-128 is ~5% slower.

* iq2_bn_r4: NEON

PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn
for 1 thread, but saturates to about the same 93 t/s at
8 threads.

* iq2_bn_r4: Experimenting on NEON

The matrix x vvector multiplication is erratic.
iq2_bn_r4 is faster at 1, 2, and 4 threads, but
saturates to a lower t/s at 8 threads compared to
iq2_bn. iq2_bn actually manages 99 t/s at 8 threads
and not 93 as I wrore in the last commit. iq2_bn_r4
performance has huge fluctuations at 4 and 8 threads.

* Some cleanup

* iq2_bn_r4: AVX2

As expected, PP is slightly slower as we just don;t have
enough vector registers (690 vs 710 t/s). TG is slightly faster
(18.2 vs 16.7 t/s at 1 thread).

* iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector

It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn.

* iq2_bn_r4: simdify q8_K16 quantization (AVX2)

PP-512 becomes 834 t/s and TG-128 now saturates to the same
performance as iq2_bn for 4 threads.

* iq2_bn_r4: simdify q8_K16 quantization (NEON)

PP-512 is now 304.7 t/s, and TG-128 @ 8 threads
very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s)

* iq2_bn_r4: fix AVX2 after breaking it two commits ago

* iq2_bn_r4: better AVX2

As we don't have enough vector registers on AVX2, it is better
to do two passes per row needing only half of the accumulator
registers that way.
With this, we now beat iq2_bn PP also on AVX2 by a small margin.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-06 12:15:39 +01:00
Kawrakow
f64de08203 IQ4_XS_R4 (#123)
* Adding iq4_xs_r4

This is a 1st working version on Zen4.
We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower
than iq4_nl_x4.

* iq4_xs_r4: WIP

* iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4

* iq4_xs_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max,
up from 68.2 t/s for iq4_xs!

* DRY

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-04 15:20:07 +01:00
Kawrakow
f1f4eb988f Q6_0_R4 (#122)
* Adding q6_0_r4

We get PP-512(LLaMA-3.1-8B) = 257 t/s on a Ryzen-7950X.

* q6_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 95 t/s on M2-Max.
In terms of ops, q6_0_r4 is identical to q5_0_r4
except for loading the high bits being
vld1q_u8_x2 instead of vld1q_u8. It is strange that
this can make a 5% difference in performance, especially
considering that this is amortized (re-used) over 8 columns
in the right matrix. Or am I running out of vector registers?

* Fix AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 14:48:26 +01:00
Kawrakow
c5bf589367 Q5_0_R4 (#121)
* Adding q5_0_r4

We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X.
We even get TG-128 improvement to 11.7 t/s from 11.1 t/s.

* q5_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max,
up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp
is no longer funny: they get 26.5 t/s for Q5_0.

For TG, we are nor able to fully saturate memory bandwidth
and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets
20.6 t/s for Q5_0.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 12:59:22 +01:00
Kawrakow
ccec00939a Q8_0_R4 (#120)
* Adding q8_0_r4

We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared
to 175.6 t/s for Q8_0.

* q8_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max.

* q8_0_r4: Zen4 matrix-vector specialization

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 06:15:29 +01:00
Kawrakow
239a344f99 Q4_0_R4 (#119)
* Adding iq4_0_r4 - q4_0 repacked

We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU,
so ~5-6% faster than iq4_nl_x4.

* q4_0_r4: NEON

Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02 17:01:48 +01:00
Kawrakow
6d0462d4a3 IQ4_NL_X4 (#118)
* Adding iq4_nl_x4

Looks very promising - I get PP-512(LLaMA-3.1-8B) = 230 t/s
on the Ryzen-7950X! This is faster than any other quant and
~40% faster than iq4_nl.

* iq4_nl_x4: getting amazing

This Zen4 variant gets us to PP-512(LLaMA-3.1-8B) = 263 t/s!

* iq4_nl_x4: AVX2

Here we gain only 25% compared to iq4_nl

* iq4_nl_x4: NEON

On M2-Max we get PP-512(LLaMA-3.1-8B) = 109.7 t/s, up from
82.4 t/s for iq4_nl.

* iq4_nl_x4: minor NEON improvement and cleanup

This gets us to 110.3 t/s. In comparison,
IQ4_NL_4_4 in mainline llama.cpp achieves 92.3 t/s.

* iq4_nl_x4: NEON specialization for matrix x vector

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02 07:25:39 +01:00
Iwan Kawrakow
3a9926b932 Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
2be4cffe66 Minor tweaks 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
3ee5434601 DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
81cd220f93 iq4_kt: CUDA dot product works 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
79565c92e0 DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e338e0a0cd DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4cf82e7e2f iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
ab1cef30e7 iq4_kt: very slightly better
at the expense of much longer quantization time.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1be0a9e0d7 iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21903f19b4 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c20b22b9a0 iq3_kt: small progress 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4213ab1cb3 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
215bea5c6a iq3_kt: small improvements and faster quantization 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dbe085474a iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
200a19f18f iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
de7fe92833 iq4_kt: minor tweaks 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9ced1bbe6 iq4_kt: CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21ee589996 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1d6ca83203 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
00b4bff286 Adding iq4_kt - not competitive at this point 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
47b28c1e92 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4608f0cc6d iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
0ffc9b435c iq3_kt: CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9e5879b94 iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c59830dafb iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
8f0d075f5e iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dfcc8a9cf3 iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
386d139e13 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
f1fb59b44b iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
435eb9bdd3 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
08503cec7d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
977f94b3e0 Forgotten change 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4774788136 Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
590f47278b Minor 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7bf6e158a9 iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7cafafc69e iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b354392c77 iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
aed3910dfa iq2_kt: very slightly faster CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
d2331b9287 iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b3dfe9984b iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
36e9c922b8 iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
426a6e685f iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
2024-11-21 08:16:41 +02:00