Commit Graph

109 Commits

Author SHA1 Message Date
Kawrakow
9be8812727 Add ability to hide imatrix details in llama-quantize (#329)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14 19:41:31 +02:00
Kawrakow
8210ed4883 Add copyright notices (#317)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 10:43:26 +02:00
Kawrakow
4158743014 Specify tensor name regex for tensors to be repacked (#274)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 10:51:37 +01:00
Kawrakow
c5e554f941 Convert models to row-interleaved quants using the quantize tool (#272)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:23:36 +01:00
Kawrakow
f8fb8ec9aa Custom quantization rules with regular expressions (#244)
* Custom quantization rules with regular expressions

* Add the --custom-q option to the help

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-07 08:54:09 +02:00
Kawrakow
1140b4568d Q8_KV: 8-bit quantization type targeting the KV cache (#208)
* Adding q8_KV - Basics + AVX2 gemm/gemv

* q8_KV: Better AVX2 gemm

* q8_KV: Better Zen4 gemm

We get 225.7 t/s for L3-8B. In comparison q8_0 without
run-tinme-repacking is at 169 t/s.

* q8_KV: AVX2 gemm/gemv

We get 254 t/s for L3-8B vs 194 t/s for q8_0 without rtr.

* q8_KV: be able to use it for K cache

This required quite a few fixes in ggml and llama.cpp:
* ggml: do not calculate row size as n/block_size*type_size. I had
  removed most of it when implementing the quants with per row scale,
  bit it was stull lurking in ggml_copy. Not sure if these were the last
  remnants of ggmil-style row sizes, or if there are still places left
* llama.cpp: get rid of the the 1d K cache assumption. Create and manage
  the K-cache as a 2D tensor so we can have per row meta data as needed
  by q8_KV.

Using q8_KV for K-cache results in non-negligible performance gains.
More details to follow, but for DeepSeek-Lite with MLA, we get
18% speedup for PP-8192 compared to q8_0 K-cache.

* q8_KV: be able to use it for K cache in FA

* q8_KV: repack it for K*Q in FA

* q8_KV: slightly faster gemv on Zen4

* q8_KV: slightly faster gemv on Zen4

* q8_KV: ARM_NEON

We get PP-512 = 167 t/s for L3-8B without interleaving!
We do the interleaving on the fly, so I wonder if this
could be done for other quants as well.

* q8_KV: use it in FA on NEON

* q8_KV_r8 - repacked q8_KV

On Zen4 it is slower than q8_k_r8 (292 vs 370 t/s)
This makes no sense whatsoever as the q8_KV_r8 GEMM is
basically the q8_k_r8 GEMM with the unnecessary block stuff
removed (so, one would think that it would be faster).

* q8_KV_r8: don't use nrc_y = 16 on Zen4

This is faster - 350 t/s. Why?
Much better than the 290 t/s we had before, but still slower
than the 370 t/s for q8_k_r8.

* q8_KV: nrc_y = 16 also doesn't pay off in FA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-19 11:47:07 +02:00
Kawrakow
8049ffcbc8 Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8 (#189)
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving

* Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving

* Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06 18:45:28 +02:00
Kawrakow
7c94c3da56 IQ1_M_R4: better 1.75 bpw quants (#187)
* iq1_m_r4: basics (quantize/dequantize)

* iq1_m_r4: Zen4 gemm

* iq1_m_r4: neon gemm

* iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4

With the deltas being per group of 8, we cannot make use
of the q8 sums stored in q8_1, so we get a tiny gain by
using q8_0_x4.

* iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-06 14:08:52 +02:00
Kawrakow
eb547bad1a IQ1_S_R4: better 1.5 bpw quants (#185)
* iq1_s_r4: basics - quantize/dequantize

* iq1_s_r4: gemm/gemv works on AVX2/Zen4

* Don't forget to make sure we have a multiple of 4 rows per thread

* iq1_s_r4: this is better

* iq1_s_r4: fix Zen4 after AVX2 changes

* iq1_s_r4: NEON gemm/gemv

* iq1_s_r4: more bits for shared experts

With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.

On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.

* Forgotten counter increment

* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv

* Compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-05 13:49:39 +02:00
Kawrakow
da3bfd1009 IQ3_S_R4 (#162)
* iq3_s_r4: WIP

* iq3_s_r4: Zen4

* iq3_s_r4: slightly better Zen4

* iq3_s_r4: AVX2

* iq3_s_r4: NEON

* iq3_s_r4: rearrange quants

* iq3_s_r4: rearranged quants - AVX2

* iq3_s_r4: rearranged quants - NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-23 14:34:23 +01:00
Kawrakow
7598ec79a2 IQ2_S_R4 (#156)
* iq2_s_r4: Zen4

* Minor

* iq2_s_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 11:26:35 +01:00
Kawrakow
6892554e43 IQ2_XS_R4 (#155)
* iq2_xs_r4: Zen4

* iq2_xs_r4: AVX2

* iq2_xs_r4: slightly better matrix x vector on AVX2

* iq2_xs_r4: NEON - not much better than iq2_xs

* iq2_xs_r4: slightly better NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-21 08:32:39 +01:00
Kawrakow
9dfd69bd93 IQ2_XXS_R4 (#154)
* iq2_xxs_r4: Zen4

Disapointing gain: 134.7 t/s -> 151.1 t/s for PP-512
TG-128 is better: 3.45 -> 4.61 t/s @ 1 thread

* Minor

* iq2_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 12:02:42 +01:00
Kawrakow
310f8b1d22 IQ3_XXS_R4 (#153)
* iq3_xxs_r4: 1st shot on Zen4

PP-512: 107 t/s -> 137 t/s
TG-128(1 thread): 2.64 t/s -> 3.44 t/s

* iq4_xxs_r4: WIP

* iq4_xxs_r4: 1st shot at AVX2

Note: there is a bug in the AVX2 implementation for nrc_y = 1
for IQ quants with blocks of 32. I have fixed it for now by
using the nrc_y > 1 implementation (which works) also for nrc_y = 1.

* iq3_xxs_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-20 09:12:48 +01:00
Kawrakow
8352f12275 IQ4_KS_R4 (#150)
* iq4_ks_r4: Zen4

* iq4_ks_r4: AVX2

* iq4_ks_r4: WIP

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: slightly better Zen4

* iq4_ks_r4: NEON

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 19:58:21 +01:00
Kawrakow
c208fec2f2 IQ5_K_R4 (#149)
* iq5_k_r4: Zen4

Much slower than the others.

* iq5_k_r5: WIP

* Minor

* iq5_k_r4: fix AVX2 nrc_y = 1 case

* iq5_k_r4: better Zen4

But TG is still slower than iq5_k

* iq5_k_r4: slightly better AVX2

* iq5_k_r4: NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-18 13:29:25 +01:00
Kawrakow
c16d352915 IQ2_K_R4 (#146)
* iq2_k_r4: Zen4

* iq2_k_r4: NEON

* iq2_k_r4: better matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 10:18:33 +01:00
Kawrakow
b52e2e2934 IQ3_K_R4 (#145)
* iq3_k_r4 WIP

* iq3_k_r4: Zen4

* iq3_k_r4: AVX2

* iq3_k_r4: NEON

* iq3_k_r4: faster matrix x vector multiplication on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-17 07:51:11 +01:00
Kawrakow
e811de75e9 BF16_R16 - 16 interleaved bf16 rows (#142)
* Not working bf16_r4

* Adding bf16_r8

Small performance gain compared to bf16 - 258 t/s vs 234 t/s.
I guess, this is still sub-obtimal.

* bf16_rx: Very slightly faster by interleaving 16 rows

258 t/s -> 263 t/s

* Rename bf16_r4 to bf16_r16

We are interleaving 16 rows now.

* Cleanup unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-15 09:54:21 +01:00
Kawrakow
e885c1e59b Q8_K_R8: Fastest quantized matrix multiplications (#141)
* q8_k_r8: fastest matrix multiplication known to human kind

We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!

* q8_k_r8: AVX2

I was worried that we don't have enough vector registrers on
AVX2, but it looks like it handles it just fine. We get
PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX.
Slightly slower than the Zen4 version with double the threads,
but still a huge upgrade compared to Q8_0_R4.

* q8_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 159.2 t/s.
Compare this to the 128 t/s we have fr Q8_0_R4.

* q8_k_r4: go to signed ints

Why?
* On AVX2 _mm256_maddubs_epi16() may overflow, so we need to
  stay within the signed int range and use _mm256_sign_epi8.
  Not yet tested on the AVX2 comp, vut expect major slowdown.
* It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8()
  needed tto convert from unsigned to signed seems to be extremely
  slow on the M2-Max
* We only lose ~0.5% in oerformance on Zen4 (there the exclusive
  or that we now use to convert fro signed to unsigned seems to be
  much faster than on M2-Max)

* Shutup useless compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-14 09:24:30 +01:00
Kawrakow
ce97b0325e IQ4_K_R4 (#138)
* iq4_k_r4: WIP

* iq4_k_r4: Zen4 and hopefully AVX2

On Zen4 we get PP-512(LLaMA-3.1-8B) = 232.6 t/s, up from 182.2 t/s
for iq4_k. Applying the extra shift costs a ~6 performance penalty.

* iq4_k_r4: AVX2

PP-512 = 227.60 t/s. The shifts are really costly.

* iq4_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 108 t/s, up from 58.2 t/s for iq4_k.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-12 16:04:20 +01:00
Kawrakow
0f6621d410 Q2_K_R4 (#136)
* q2_k_r4: Zen4

PP-512(LLaMA-3.1-8B) = 256 t/s

* q3_k_r4: AVX2

* q2_k_r4: AVX2

We get PP-512(LLaMA-3.1-8B) = 287 t/s.

Also cherry-picked the q3_k_r4 AVX2 adaptation that I somehow
forgot to push upstream.

* q2_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 106.2 t/s.
TG-128 is 36.02 t/s, which is ~10% higher than q2_K_S.

* Make sure rows per thread are a multiple of 4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 18:16:49 +01:00
Kawrakow
4872f2f57e Q3_K_R4 (#134)
* q3_k_r4: Zen4 works, but not as good as it should be

238 t/s, so sloghtly slower than q6_k_r4.

* q3_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 106.9 t/s.
This is 1.93X faster than q3_K_S!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-11 11:19:00 +01:00
Kawrakow
e78e47b857 Q5_K_R4 (#132)
* q5_k_r4: WIP

* q5_k_r4: Zen4 and AVX2

We get PP-512(LLaMA-3.1-8B) = 248.3 t/s on Zen4.
Q5_K_S has PP-512 = 190 t/s.

* q5_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 96.1 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10 18:13:47 +01:00
Kawrakow
b7e2f656f5 Q6_K_R4 (#130)
* Adding q6_k_r4

* q6_k_r4: 1st functional AVX2 version

* q6_k_r4: AVX2 and simple Zen4

"Simple" as in processing 4 instead of 8 rows at once.
On Zen4 we get PP-512(LLaMA-3.1-8B) = 238.3 t/s vs
195.2 t/s for Q6_K. TG-128 @ 1 thread is 7.94 t/s
vs 5.38 t/s for Q6_K.

* q6_k_r4: 1st NEON version

PP-512(LLaMA-3.1-8B) = 78 t/s vs 57.6 t/s for q6_K.
TG-128 is slightly lower rthan q6_K for low number of threads,
becomes very slightly better at 8 threads.

* q6_k_r4: slightly faster NEON

PP-512(LLaMA-3.1-8B) = 83.25 t/s

* q6_k_r4: slightly faster Zen4

238.3 t/s -> 243.2 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-10 12:26:40 +01:00
Kawrakow
13126ce100 Q4_K_R4 (#129)
* Something is still wrong

* Simply don't see what is wrong

* q4_k_r4: finally works on Zen4

I had forgotten to prevent token_embd.weight being quantized
with q4_k_r4!

* q4_k_r4: AVX2

We get PP-512(LLaMA-3.1-8B) = 267 t/s on a Ryzen-5975WX.
This is ~30% better than Q4_K_S.

* q4_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 110 t/s.
Not quite as good as q4_0_r4, but still a massive
improvement compared to he 69 t/s for q4_K.

* q4_k_r4: slightly better AVX2

PP-512 goes from 267 t/s to 282 t/s on Ryzen-5975WX

* Minor

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-09 16:59:18 +01:00
Kawrakow
daf5f52022 Rename iq4_nl_x4 to iq4_nl_r4 (#126)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-08 09:34:42 +01:00
Kawrakow
612a207676 iq2_bn_r4: fastest Bitnet CPU implementation on the planet (#124)
* Adding iq2_bn_r4

This Zen4-only implementation achieves PP-512 = 826 t/s (!!!)
for Bitnet-1.58b-3B, up from 620 t/s for iq2_bn.

* Make sure rows per thread are a multiple of the number of interleaved rows

With this I can run iq2_bn_r4 with 32 threads and this increases
PP-512 to 872 t/s.

* iq2_bn_r4: 1st shot at NEON

PP-512 is already faster than iq2_bn (284 t/s vs 246 t/s
for Bitnet-1.58b-3B). TG-128 is ~5% slower.

* iq2_bn_r4: NEON

PP-512 is now 296 t/s. TG-128 is ~20% faster than iq2_bn
for 1 thread, but saturates to about the same 93 t/s at
8 threads.

* iq2_bn_r4: Experimenting on NEON

The matrix x vvector multiplication is erratic.
iq2_bn_r4 is faster at 1, 2, and 4 threads, but
saturates to a lower t/s at 8 threads compared to
iq2_bn. iq2_bn actually manages 99 t/s at 8 threads
and not 93 as I wrore in the last commit. iq2_bn_r4
performance has huge fluctuations at 4 and 8 threads.

* Some cleanup

* iq2_bn_r4: AVX2

As expected, PP is slightly slower as we just don;t have
enough vector registers (690 vs 710 t/s). TG is slightly faster
(18.2 vs 16.7 t/s at 1 thread).

* iq2_bn_r4: use AVX2 implementation on Zen4 for matrix x vector

It is faster - we get 29.6 t/s at 1 thread vs 25.9 t/s for iq2_bn.

* iq2_bn_r4: simdify q8_K16 quantization (AVX2)

PP-512 becomes 834 t/s and TG-128 now saturates to the same
performance as iq2_bn for 4 threads.

* iq2_bn_r4: simdify q8_K16 quantization (NEON)

PP-512 is now 304.7 t/s, and TG-128 @ 8 threads
very slightly outperforms iq2_bn (100.7 t/s vs 99.6 t/s)

* iq2_bn_r4: fix AVX2 after breaking it two commits ago

* iq2_bn_r4: better AVX2

As we don't have enough vector registers on AVX2, it is better
to do two passes per row needing only half of the accumulator
registers that way.
With this, we now beat iq2_bn PP also on AVX2 by a small margin.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-06 12:15:39 +01:00
Kawrakow
9119023a4b IQ4_XS_R4 (#123)
* Adding iq4_xs_r4

This is a 1st working version on Zen4.
We get PP-512(LLaMA-3.1-8B) = 226 t/s, so 16% slower
than iq4_nl_x4.

* iq4_xs_r4: WIP

* iq4_xs_r4: Use AVX2 version for matrix x vector on Zen4

* iq4_xs_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 115.6 t/s on M2-Max,
up from 68.2 t/s for iq4_xs!

* DRY

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-04 15:20:07 +01:00
Kawrakow
bb699e1e6b Q6_0_R4 (#122)
* Adding q6_0_r4

We get PP-512(LLaMA-3.1-8B) = 257 t/s on a Ryzen-7950X.

* q6_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 95 t/s on M2-Max.
In terms of ops, q6_0_r4 is identical to q5_0_r4
except for loading the high bits being
vld1q_u8_x2 instead of vld1q_u8. It is strange that
this can make a 5% difference in performance, especially
considering that this is amortized (re-used) over 8 columns
in the right matrix. Or am I running out of vector registers?

* Fix AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 14:48:26 +01:00
Kawrakow
d9593f3689 Q5_0_R4 (#121)
* Adding q5_0_r4

We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X.
We even get TG-128 improvement to 11.7 t/s from 11.1 t/s.

* q5_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max,
up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp
is no longer funny: they get 26.5 t/s for Q5_0.

For TG, we are nor able to fully saturate memory bandwidth
and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets
20.6 t/s for Q5_0.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 12:59:22 +01:00
Kawrakow
6b26cb05f5 Q8_0_R4 (#120)
* Adding q8_0_r4

We get PP-512(LLaMA-3.1-8B) = 268 t/s on a Ryzen-7950X compared
to 175.6 t/s for Q8_0.

* q8_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 112.6 t/s on M2-Max.

* q8_0_r4: Zen4 matrix-vector specialization

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-03 06:15:29 +01:00
Kawrakow
61304f5c04 Q4_0_R4 (#119)
* Adding iq4_0_r4 - q4_0 repacked

We get PP-512(LLaMA-3.1-8B) = 278 t/s on a Ryzen-7950X CPU,
so ~5-6% faster than iq4_nl_x4.

* q4_0_r4: NEON

Here we get 115.8 t/s, so also ~5% better than iq4_nl_x4.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02 17:01:48 +01:00
Kawrakow
72d94fbf22 IQ4_NL_X4 (#118)
* Adding iq4_nl_x4

Looks very promising - I get PP-512(LLaMA-3.1-8B) = 230 t/s
on the Ryzen-7950X! This is faster than any other quant and
~40% faster than iq4_nl.

* iq4_nl_x4: getting amazing

This Zen4 variant gets us to PP-512(LLaMA-3.1-8B) = 263 t/s!

* iq4_nl_x4: AVX2

Here we gain only 25% compared to iq4_nl

* iq4_nl_x4: NEON

On M2-Max we get PP-512(LLaMA-3.1-8B) = 109.7 t/s, up from
82.4 t/s for iq4_nl.

* iq4_nl_x4: minor NEON improvement and cleanup

This gets us to 110.3 t/s. In comparison,
IQ4_NL_4_4 in mainline llama.cpp achieves 92.3 t/s.

* iq4_nl_x4: NEON specialization for matrix x vector

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-12-02 07:25:39 +01:00
Kawrakow
4b35340f45 Bitnet changes (#106)
* Adapting iq2_bn to work without separate scale tensors

Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).

* Adapting iq1_bn to work without separate scale tensors

* Adapting iq2_bn: CUDA dequantize

* Adapting iq2_bn: CUDA works

* Adapting iq1_bn: CUDA works

* Adapting iq1_bn, iq2_bn: NEON

* Adapting iq1_bn, iq2_bn: Metal

Dequantize works, but there is still something wrong
with the dot products.

* WIP

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.

* Remove iq1_tn and iq2_tn - Part 1

Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.

* Remove iq1_tn and iq2_tn - Part 2

* Bitnet: use the standard llm_build_kv to build self attention

My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).

* Revert "Avoid rebuild of GGML graph for each token (#98)"

This reverts commit f2d315b46f.
As far as I can tell, the commit breaks Metal TG.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25 13:08:43 +02:00
Nexes the Elder
2b1af6bade CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)
To complement the token_embd.weight and output.weight :

attn_v.weight
attn_k.weight.
attn_q_weight
attn_output.weight
attn_qkv.weight
ffn_gate
ffn_down
ffn_up
2024-10-18 09:48:15 +02:00
Kawrakow
f369c6f921 Adding IQ4_KSS: 4.0 bpw quants (#89)
* iq4_kss: WIP

* iq4_kss: CUDA dequantize works

So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.

* iq4_kss: slightly better quantization

* iq4_kss: another small quantization improvement

* iq4_kss: CUDA works

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.

* iq4_kss: new bit arrangement - CUDA and Zen4 work

Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.

* iq4_kss: ARM_NEON. Predictably very slow

* iq4_kss: Metal

PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.

* iq4_kss: somewhat faster Metal dot product

45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0

* iq4_kss: AVX2

Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.

* iq4_kss: very slightly faster Metal dot product

48.7 t/s -> 49.3 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-16 15:18:26 +03:00
Kawrakow
67817fb5b9 IQ2_KS: 2.1875 bpw non-linear quantization (#85)
* Experimenting

* iq2k: Try make_qx_quants for the scale

Slightly better for LLaMA-3.1, Gemma-2, slightly worse for
Qwen2.5

* iq2k with make_qx_quants: adjust scale

* iq2ks: basics

* iq2_ks: CUDA works

* iq2_ks: WIP

* iq2_ks: WIP

* iq2_ks: Zen4

* iq2_ks: AVX2

* iq2_ks: scalar dot product

* iq2_ks: ARM_NEON

* iq2_ks: Metal

* iq2_ks: faster Metal

LLaMA-3.1-8B:
PP-512 = 475.22 ± 0.37 t/s
TG-128 =  45.32 ± 0.03 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13 13:34:30 +03:00
Kawrakow
a10ccd65f3 New SOTA quantization: 4.25 bpw IQ4_KS (#83)
* iq4_k_xxs: basics

* WIP + adding iq3_kl quantization mix

* iq4_xxs: this looks very viable compared to iq4_xs

At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.

* iq4_xxs: CUDA dot product

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

* iq4_xxs: scalar CPU dot product

Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.

* iq4_xxs: Zen4

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).

* Fix iq4_xs (Zen4)

* iq4_xxs: AVX2

* iq4_xxs: ARM_NEON

* iq4_xxs: Metal

* iq4_xxs: slightly faster TG on Metal

* iq4_xxs: rename to iq4_ks

After all, tt is a smaller variant of iq4_k.

* iq3_kl: use iq4_ks instead of iq4_k/iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09 12:54:40 +03:00
Kawrakow
104e7e26c4 Adding Q6_0 (#77)
* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 15:22:13 +03:00
Kawrakow
733660accd Adding ability to have meta data per tensor row (#61)
* POC: per row scale

This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.

* POC per row scale: iq2_tn on NEON

* POC per row scale: iq2_tn on Metal

* Per row scale Metal templates

* iq1_tn: shrink to 1.625 bpw (NEON and Metal)

* POC per row scale: CUDA

* POC per row scale: add CUDA TODOs

There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.

* Per row scales - CUDA

The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.

* Update IQ1_TN and IQ2_TN bpw shown to user

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27 08:16:06 +03:00
Kawrakow
4a5d5e207d Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models

* iq1_tn: NEON

* iq1_tn: faster NEON

* iq2_bn: improve performance on NEON

We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!

* iq1_tn: improve AVX2

PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq1_tn: improve Zen4

PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq2_bn: improve on Zen4

We now get PP-512 = 614 t/s up from 542 t/s

* iq2_bn: improve AVX2 implementation

We now get PP-512 = 753 t/s up from 680 t/s.

* Remove unnecessary barrier in ggml_compute_forward_mul_mat

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09 14:56:34 +03:00
Kawrakow
1a4cfbcc53 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Kawrakow
4b2c94618f iq6_k: WIP (quantize/dequantize) 2024-08-09 16:00:31 +02:00
Kawrakow
58a323f585 Adding IQ2_TN for use with ternary models (#13)
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-07 07:56:09 +02:00
Kawrakow
695c7eef49 q2_K: allow it to detect ternary nets and quantize accordingly 2024-08-05 11:39:10 +02:00
Kawrakow
fb4cff3458 iq3_k: Basics
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01 09:38:06 +02:00
Kawrakow
e5cd93b4b7 iq5_k: Basics
Quantize/dequantize, CUDA dequantize
2024-08-01 09:38:06 +02:00
Kawrakow
3f7dad3000 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00
Kawrakow
007d2a56b3 IQ4_K: SOTA 4-bit quantization (#6)
* iq4_k: basics

* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
  PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
  In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
  products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.

* iq4_k: TG now works on CUDA

* iq4_k: AVX512 implementation

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.

* iq4_k: AVX2 implementation

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.

* iq4_k: NEON implementation

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

* iq4_k: Metal implementation

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

* iq4_k: scalar dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-28 12:11:59 +02:00