Commit Graph

3533 Commits

Author SHA1 Message Date
Iwan Kawrakow
2be4cffe66 Minor tweaks 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
5705dc7f2e Report actual bpw 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
3ee5434601 DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
81cd220f93 iq4_kt: CUDA dot product works 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
79565c92e0 DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e338e0a0cd DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4cf82e7e2f iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
ab1cef30e7 iq4_kt: very slightly better
at the expense of much longer quantization time.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1be0a9e0d7 iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21903f19b4 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c20b22b9a0 iq3_kt: small progress 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4213ab1cb3 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
215bea5c6a iq3_kt: small improvements and faster quantization 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dbe085474a iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
200a19f18f iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
de7fe92833 iq4_kt: minor tweaks 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9ced1bbe6 iq4_kt: CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21ee589996 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1d6ca83203 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
00b4bff286 Adding iq4_kt - not competitive at this point 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
47b28c1e92 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4608f0cc6d iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
0ffc9b435c iq3_kt: CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9e5879b94 iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c59830dafb iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
8f0d075f5e iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dfcc8a9cf3 iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
386d139e13 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
f1fb59b44b iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
435eb9bdd3 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
08503cec7d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
977f94b3e0 Forgotten change 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4774788136 Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
590f47278b Minor 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7bf6e158a9 iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7cafafc69e iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b354392c77 iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
aed3910dfa iq2_kt: very slightly faster CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
d2331b9287 iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b3dfe9984b iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
36e9c922b8 iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
766fa600c8 WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
86948f9c5d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
a961a48e88 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
426a6e685f iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
a4f1ac8da4 iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
2024-11-21 08:16:40 +02:00
Iwan Kawrakow
f1df1b7e15 Testing Trellis quantization: playing with scales and generators 2024-11-21 08:16:40 +02:00
Iwan Kawrakow
9ec145550d Testing Trellis quantization: 4-bit quantized block scales
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
2024-11-21 08:16:40 +02:00
Iwan Kawrakow
f21dd3fb15 Testing Trellis quantization
Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
2024-11-21 08:16:40 +02:00
Iwan Kawrakow
afe9db7143 WIP 2024-11-21 08:16:40 +02:00