Commit Graph

3540 Commits

Author SHA1 Message Date
Iwan Kawrakow
55db84400a Small iq3k improvement 2024-11-25 09:01:25 +02:00
Iwan Kawrakow
74e3b1fad7 Minor 2024-11-24 17:11:11 +02:00
Iwan Kawrakow
65ebc6f986 iq4_ks: minor PPL improvement 2024-11-24 12:01:18 +02:00
Iwan Kawrakow
70815ec5b2 iq2k: quantization improvement
I was not using the ciorrect scale sign to compute
mse when checking the solution with the sign flipped.
iq4_kss is now almost on par with the 4-bit Trellis.
2024-11-24 11:29:37 +02:00
Iwan Kawrakow
7447c55a8a iq2k: small PPL improvement
PPL(LLaMA-3.1-8B, 8192) is now 8.29 from previously 8.38.
LLaMA-v2-7B is about the same as before.
2024-11-23 19:18:45 +02:00
Iwan Kawrakow
3cac58e182 iq2ks: small PPL improvement
PPL(LLaMA-3.1-8B, 8192) is now 9.95 from previously 10.18.
LLaMA-v2-7B is about the same as before.
2024-11-23 12:27:14 +02:00
Iwan Kawrakow
3a9926b932 Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
2be4cffe66 Minor tweaks 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
5705dc7f2e Report actual bpw 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
3ee5434601 DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
81cd220f93 iq4_kt: CUDA dot product works 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
79565c92e0 DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e338e0a0cd DRY 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4cf82e7e2f iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
ab1cef30e7 iq4_kt: very slightly better
at the expense of much longer quantization time.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1be0a9e0d7 iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21903f19b4 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c20b22b9a0 iq3_kt: small progress 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4213ab1cb3 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
215bea5c6a iq3_kt: small improvements and faster quantization 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dbe085474a iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
200a19f18f iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
de7fe92833 iq4_kt: minor tweaks 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9ced1bbe6 iq4_kt: CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
21ee589996 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
1d6ca83203 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
00b4bff286 Adding iq4_kt - not competitive at this point 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
47b28c1e92 iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4608f0cc6d iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
0ffc9b435c iq3_kt: CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
e9e5879b94 iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
c59830dafb iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
8f0d075f5e iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
dfcc8a9cf3 iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
386d139e13 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
f1fb59b44b iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
435eb9bdd3 WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
08503cec7d WIP 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
977f94b3e0 Forgotten change 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
4774788136 Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
590f47278b Minor 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7bf6e158a9 iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
7cafafc69e iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b354392c77 iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
aed3910dfa iq2_kt: very slightly faster CUDA dot product 2024-11-21 08:16:41 +02:00
Iwan Kawrakow
d2331b9287 iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
b3dfe9984b iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
36e9c922b8 iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
766fa600c8 WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
2024-11-21 08:16:41 +02:00
Iwan Kawrakow
86948f9c5d WIP 2024-11-21 08:16:41 +02:00