ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-25 23:54:10 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	dbe085474a	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	200a19f18f	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	de7fe92833	iq4_kt: minor tweaks	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9ced1bbe6	iq4_kt: CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	21ee589996	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	1d6ca83203	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	00b4bff286	Adding iq4_kt - not competitive at this point	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	47b28c1e92	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4608f0cc6d	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	0ffc9b435c	iq3_kt: CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9e5879b94	iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	c59830dafb	iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	8f0d075f5e	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dfcc8a9cf3	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	386d139e13	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	f1fb59b44b	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	435eb9bdd3	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	08503cec7d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	977f94b3e0	Forgotten change	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4774788136	Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	590f47278b	Minor	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7bf6e158a9	iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7cafafc69e	iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b354392c77	iq2_kt: f16 CUDA dot product We arrive at 112 t/s.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	aed3910dfa	iq2_kt: very slightly faster CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	d2331b9287	iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b3dfe9984b	iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time).	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	36e9c922b8	iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	766fa600c8	WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	86948f9c5d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	a961a48e88	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	426a6e685f	iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	a4f1ac8da4	iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B).	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	f1df1b7e15	Testing Trellis quantization: playing with scales and generators	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	9ec145550d	Testing Trellis quantization: 4-bit quantized block scales rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw.	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	f21dd3fb15	Testing Trellis quantization Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable.	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	afe9db7143	WIP	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	c578478911	WIP	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	798f93ce40	WIP	2024-11-21 08:16:40 +02:00
Kawrakow	4d2fbde0cb	MMQ for Q6_0 (#115 ) * MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-11-21 07:12:11 +01:00
Kawrakow	52874c5d21	Faster MoE inference (#112 ) * multi_sdd: WIP * multi_sdd: CPU works * multi_add: CUDA * multi_add: simplify * multi_add: Metal * Metal: speed up mul_mat_id For the Granite-1B MoE model PP-512 goes from 156 t/s to 890 t/s, so nearly a 6X speedup! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-31 12:05:27 +01:00
Kawrakow	5ad6439486	Use fused mul - unary op also for MoE models (#111 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 18:23:54 +02:00
Kawrakow	2e5f6db5de	Bitnet: use the fused mul-silu in the FFN network (#110 ) I had forgotten that build_bitnet() does not use the standerd llm_build_ffn function, so the fused mul-silu didn't get used for Bitnet when I added it to llm_build_ffn. This gives us another ~1% speedup for TG-128. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 17:40:32 +02:00
Kawrakow	bd309cb782	Bitnet CUDA improvements (#109 ) * iq1_bn: improve CUDA TG On RTX-3080 TG-128(Bitnet-1.58b-3B) goes from 318 t/s to 340 t/s. I see I have on the front page 301 t/s, so pretty nice improvement since then. * iq2_bn(CUDA): quants are not 4-byte aligned --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 16:26:04 +02:00
Kawrakow	3805c84686	Improve Bitnet PP on Metal (#108 ) iq1_bn goes from 702 t/s to 716 t/s iq2_bn goes from 714 t/s to 743 t/s Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 15:13:45 +02:00
Kawrakow	f7b05a09dd	Faster IQ1_BN Metal implementation (#107 ) * iq1_bn: faster Metal dot product 82 t/s -> 87.9 t/s * iq1_bn(Metal): 87.9 -> 89.0 t/s for TG-128 * iq1_bn(Metal): 89.0 -> 94.7 t/s for TG-128 So, total improvement is ~15%. Not bad. * iq1_bn(Metal): 686 -> 702 t/s for PP-512 * iq2_bn(Metal): 710 -> 714 t/s for PP-512 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 10:59:59 +02:00
Iwan Kawrakow	19cc3329bf	Remove forgotten IQ1_TN, IQ2_TN enum values	2024-10-25 14:14:56 +03:00
Kawrakow	6b968f3894	Bitnet changes (#106 ) * Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit `f2d315b46f`. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-25 13:08:43 +02:00
Kawrakow	9114078959	Fix quantized k-cache without FA (#105 ) * Added Johannes' changes, still getting NaNs with quantized k-cache. Also getting NaN's on Johannes's mainline branch. * This fixes it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-24 12:20:30 +02:00
Kawrakow	b61cf7d0d7	Add support for Granite and GraniteMoE models (#102 ) * Add Granite and GranoteMoE models * Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-22 17:28:14 +02:00

1 2 3 4 5 ...

3520 Commits