ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-25 15:44:10 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	55db84400a	Small iq3k improvement	2024-11-25 09:01:25 +02:00
Iwan Kawrakow	74e3b1fad7	Minor	2024-11-24 17:11:11 +02:00
Iwan Kawrakow	65ebc6f986	iq4_ks: minor PPL improvement	2024-11-24 12:01:18 +02:00
Iwan Kawrakow	70815ec5b2	iq2k: quantization improvement I was not using the ciorrect scale sign to compute mse when checking the solution with the sign flipped. iq4_kss is now almost on par with the 4-bit Trellis.	2024-11-24 11:29:37 +02:00
Iwan Kawrakow	7447c55a8a	iq2k: small PPL improvement PPL(LLaMA-3.1-8B, 8192) is now 8.29 from previously 8.38. LLaMA-v2-7B is about the same as before.	2024-11-23 19:18:45 +02:00
Iwan Kawrakow	3cac58e182	iq2ks: small PPL improvement PPL(LLaMA-3.1-8B, 8192) is now 9.95 from previously 10.18. LLaMA-v2-7B is about the same as before.	2024-11-23 12:27:14 +02:00
Iwan Kawrakow	3a9926b932	Checkpoint Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude plus 1 bpw for the sign. It goves a visible improvement in the PPL vs bpw plot, but that comes at the expense of much longer quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX). I also notices that the 3INST generator is not actually generating a Gaussian distribution. But going to a better generator means readjusting all the hyper-parameters, so leaving it for later.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	2be4cffe66	Minor tweaks	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	5705dc7f2e	Report actual bpw	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	3ee5434601	DRY	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	81cd220f93	iq4_kt: CUDA dot product works	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	79565c92e0	DRY	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e338e0a0cd	DRY	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4cf82e7e2f	iq4_kt: failed attemt to adjust CUDA dot product It was working for 4.125 bpw. But after changing to 4.0 bpw there is something wrong and I don't see the bug.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	ab1cef30e7	iq4_kt: very slightly better at the expense of much longer quantization time.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	1be0a9e0d7	iq4_kt: go to 4.0 bpw 15 bits per group of 4, plus 8 bit scales ifor blocks of 32. This gives a slightly better PPL than iq4_kss.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	21903f19b4	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	c20b22b9a0	iq3_kt: small progress	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4213ab1cb3	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	215bea5c6a	iq3_kt: small improvements and faster quantization	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dbe085474a	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	200a19f18f	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	de7fe92833	iq4_kt: minor tweaks	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9ced1bbe6	iq4_kt: CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	21ee589996	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	1d6ca83203	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	00b4bff286	Adding iq4_kt - not competitive at this point	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	47b28c1e92	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4608f0cc6d	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	0ffc9b435c	iq3_kt: CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9e5879b94	iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	c59830dafb	iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	8f0d075f5e	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dfcc8a9cf3	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	386d139e13	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	f1fb59b44b	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	435eb9bdd3	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	08503cec7d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	977f94b3e0	Forgotten change	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4774788136	Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	590f47278b	Minor	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7bf6e158a9	iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7cafafc69e	iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b354392c77	iq2_kt: f16 CUDA dot product We arrive at 112 t/s.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	aed3910dfa	iq2_kt: very slightly faster CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	d2331b9287	iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b3dfe9984b	iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time).	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	36e9c922b8	iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	766fa600c8	WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	86948f9c5d	WIP	2024-11-21 08:16:41 +02:00

1 2 3 4 5 ...

3540 Commits