Commit Graph

60 Commits

Author SHA1 Message Date
Iwan Kawrakow
85d1011f52 Another iq3k improvement 2024-11-25 10:11:02 +02:00
Iwan Kawrakow
55db84400a Small iq3k improvement 2024-11-25 09:01:25 +02:00
Iwan Kawrakow
74e3b1fad7 Minor 2024-11-24 17:11:11 +02:00
Iwan Kawrakow
65ebc6f986 iq4_ks: minor PPL improvement 2024-11-24 12:01:18 +02:00
Iwan Kawrakow
70815ec5b2 iq2k: quantization improvement
I was not using the ciorrect scale sign to compute
mse when checking the solution with the sign flipped.
iq4_kss is now almost on par with the 4-bit Trellis.
2024-11-24 11:29:37 +02:00
Iwan Kawrakow
7447c55a8a iq2k: small PPL improvement
PPL(LLaMA-3.1-8B, 8192) is now 8.29 from previously 8.38.
LLaMA-v2-7B is about the same as before.
2024-11-23 19:18:45 +02:00
Iwan Kawrakow
3cac58e182 iq2ks: small PPL improvement
PPL(LLaMA-3.1-8B, 8192) is now 9.95 from previously 10.18.
LLaMA-v2-7B is about the same as before.
2024-11-23 12:27:14 +02:00
Iwan Kawrakow
3a9926b932 Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
2be4cffe66 Minor tweaks 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4cf82e7e2f iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
ab1cef30e7 iq4_kt: very slightly better
at the expense of much longer quantization time.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1be0a9e0d7 iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21903f19b4 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c20b22b9a0 iq3_kt: small progress 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4213ab1cb3 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
215bea5c6a iq3_kt: small improvements and faster quantization 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dbe085474a iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
200a19f18f iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21ee589996 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1d6ca83203 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
00b4bff286 Adding iq4_kt - not competitive at this point 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
47b28c1e92 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4608f0cc6d iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9e5879b94 iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c59830dafb iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
8f0d075f5e iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dfcc8a9cf3 iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
386d139e13 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
f1fb59b44b iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
435eb9bdd3 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
08503cec7d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
977f94b3e0 Forgotten change 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4774788136 Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b3dfe9984b iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
36e9c922b8 iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
a4f1ac8da4 iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
2024-11-21 08:16:40 +02:00
Kawrakow
6b968f3894 Bitnet changes (#106)
* Adapting iq2_bn to work without separate scale tensors

Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).

* Adapting iq1_bn to work without separate scale tensors

* Adapting iq2_bn: CUDA dequantize

* Adapting iq2_bn: CUDA works

* Adapting iq1_bn: CUDA works

* Adapting iq1_bn, iq2_bn: NEON

* Adapting iq1_bn, iq2_bn: Metal

Dequantize works, but there is still something wrong
with the dot products.

* WIP

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.

* Remove iq1_tn and iq2_tn - Part 1

Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.

* Remove iq1_tn and iq2_tn - Part 2

* Bitnet: use the standard llm_build_kv to build self attention

My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).

* Revert "Avoid rebuild of GGML graph for each token (#98)"

This reverts commit f2d315b46f.
As far as I can tell, the commit breaks Metal TG.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25 13:08:43 +02:00
Kawrakow
76b97c8064 Adding IQ4_KSS: 4.0 bpw quants (#89)
* iq4_kss: WIP

* iq4_kss: CUDA dequantize works

So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.

* iq4_kss: slightly better quantization

* iq4_kss: another small quantization improvement

* iq4_kss: CUDA works

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.

* iq4_kss: new bit arrangement - CUDA and Zen4 work

Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.

* iq4_kss: ARM_NEON. Predictably very slow

* iq4_kss: Metal

PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.

* iq4_kss: somewhat faster Metal dot product

45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0

* iq4_kss: AVX2

Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.

* iq4_kss: very slightly faster Metal dot product

48.7 t/s -> 49.3 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-16 15:18:26 +03:00
Iwan Kawrakow
ff23008ed4 Minor iq3_k tweak 2024-10-14 18:13:11 +03:00
Kawrakow
910a134094 IQ2_KS: 2.1875 bpw non-linear quantization (#85)
* Experimenting

* iq2k: Try make_qx_quants for the scale

Slightly better for LLaMA-3.1, Gemma-2, slightly worse for
Qwen2.5

* iq2k with make_qx_quants: adjust scale

* iq2ks: basics

* iq2_ks: CUDA works

* iq2_ks: WIP

* iq2_ks: WIP

* iq2_ks: Zen4

* iq2_ks: AVX2

* iq2_ks: scalar dot product

* iq2_ks: ARM_NEON

* iq2_ks: Metal

* iq2_ks: faster Metal

LLaMA-3.1-8B:
PP-512 = 475.22 ± 0.37 t/s
TG-128 =  45.32 ± 0.03 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13 13:34:30 +03:00
Kawrakow
b30c9e10d8 New SOTA quantization: 4.25 bpw IQ4_KS (#83)
* iq4_k_xxs: basics

* WIP + adding iq3_kl quantization mix

* iq4_xxs: this looks very viable compared to iq4_xs

At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.

* iq4_xxs: CUDA dot product

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

* iq4_xxs: scalar CPU dot product

Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.

* iq4_xxs: Zen4

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).

* Fix iq4_xs (Zen4)

* iq4_xxs: AVX2

* iq4_xxs: ARM_NEON

* iq4_xxs: Metal

* iq4_xxs: slightly faster TG on Metal

* iq4_xxs: rename to iq4_ks

After all, tt is a smaller variant of iq4_k.

* iq3_kl: use iq4_ks instead of iq4_k/iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09 12:54:40 +03:00
Kawrakow
fe36930c8b Move scale fudge factors to quantization (#81)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-04 16:16:01 +03:00
Kawrakow
6dec4af4b6 Adding ability to have meta data per tensor row (#61)
* POC: per row scale

This is a POC how to work around opinionated ggml to
have scales per row rather than per block.
Only implemened for Zen4 and only for iq2_tn.

* POC per row scale: iq2_tn on NEON

* POC per row scale: iq2_tn on Metal

* Per row scale Metal templates

* iq1_tn: shrink to 1.625 bpw (NEON and Metal)

* POC per row scale: CUDA

* POC per row scale: add CUDA TODOs

There are two places in ggml-cuda.cu left where it is assumed
that type_size * n_per_row / block_size is the way to compute
and handle row sizes. This does not affect simple usage,
but will lead to issues when tensors are split between GPUs.

* Per row scales - CUDA

The only place left where there are unnecessary assumptions being made
is in the Flash Attention code. As we are not using any quants that
use per row scales for quantized KV cache, it should be OK for now.

* Update IQ1_TN and IQ2_TN bpw shown to user

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-27 08:16:06 +03:00
Kawrakow
12bbdb8ce7 Fix compiler warnings (#58)
* Fix C++ compilation warnings caused by ggml-common.h

* Disable c99-extensions warning

I get tons of those on macOS due to the arm_neon.h header.

* Disable c99-extensions warning only for APPLE

* Fix warnings in iqk_quantize.cpp

Also add GGML_ABORT when implementation is missing.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-17 14:31:29 +03:00
Kawrakow
8c86231f93 Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44)
* Adding iq1_tn - 1.6875 bpw for TriLM ternary models

* iq1_tn: NEON

* iq1_tn: faster NEON

* iq2_bn: improve performance on NEON

We now get TG-128 = 100 t/s for Bitnet-3B-1.58b!

* iq1_tn: improve AVX2

PP-512 goes to 533 t/s up from 455.
TG-128 @ 2 threads goes to 16.6 t/s up from 14.2.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq1_tn: improve Zen4

PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380.
TG-128 @ 1 thread goes to 12.4 t/s up from 10.4.
However, we seem to have a bottleneck somewhere as
TG saturates at 8 threads.

* iq2_bn: improve on Zen4

We now get PP-512 = 614 t/s up from 542 t/s

* iq2_bn: improve AVX2 implementation

We now get PP-512 = 753 t/s up from 680 t/s.

* Remove unnecessary barrier in ggml_compute_forward_mul_mat

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09 14:56:34 +03:00
Kawrakow
dbb1db9899 Fix build when iqk_mul_mat is disabled (#31)
Ref #29

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-31 09:11:42 +03:00
Kawrakow
a73702d93b AVX2 quantization for Q8_K (#22)
It has been there for a while, but forgot to add here.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-19 15:33:27 +03:00
Iwan Kawrakow
f5d1af61d7 Fix Makefile
I always use cmake, so had forgotten to pay attention to the
Makefile.
2024-08-09 16:31:04 +02:00
Iwan Kawrakow
f0d7a0d53b Fix Zen4 implementation of iq3_k, iq4_k, iq5_k
See comments in f3a823ce72
2024-08-09 16:00:31 +02:00
Iwan Kawrakow
c3f5e4d9a7 iq6_k: CUDA dequantize
We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K
(0.14% vs 0.26% quantization error).
2024-08-09 16:00:31 +02:00