ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-25 15:44:10 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	4608f0cc6d	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	0ffc9b435c	iq3_kt: CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9e5879b94	iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	c59830dafb	iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	8f0d075f5e	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dfcc8a9cf3	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	386d139e13	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	f1fb59b44b	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	435eb9bdd3	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	08503cec7d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	977f94b3e0	Forgotten change	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4774788136	Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	590f47278b	Minor	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7bf6e158a9	iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7cafafc69e	iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b354392c77	iq2_kt: f16 CUDA dot product We arrive at 112 t/s.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	aed3910dfa	iq2_kt: very slightly faster CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	d2331b9287	iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b3dfe9984b	iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time).	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	36e9c922b8	iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	766fa600c8	WIP - try larger blocks With blocks of 32 and 16 bits per groups of 8 the brute force seach becomes prohibitive in terms of CPU time (30+ minutes for 8B LLaMA after SIMDifying with AVX2). The trick is to group the points in clusters, find the nearest cluster, and only search within the cluster.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	86948f9c5d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	a961a48e88	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	426a6e685f	iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	a4f1ac8da4	iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B).	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	f1df1b7e15	Testing Trellis quantization: playing with scales and generators	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	9ec145550d	Testing Trellis quantization: 4-bit quantized block scales rmse increases by just 3%, so this is beating iq2_xss in terms of rmse at the same 2.0625 bpw.	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	f21dd3fb15	Testing Trellis quantization Using 12 bits per 8 weights I get a better rmse than iq2_xxs. I still need to see how quantizing the group-of-8 scales will affect accuracy. By AVX2 SIMDifying the search for the best code, LLaMA-3.1-8B gets quantized in 130 seconds on the Ryzen-7950X CPU - sluggish but still acceptable.	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	afe9db7143	WIP	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	c578478911	WIP	2024-11-21 08:16:40 +02:00
Iwan Kawrakow	798f93ce40	WIP	2024-11-21 08:16:40 +02:00
Kawrakow	4d2fbde0cb	MMQ for Q6_0 (#115 ) * MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-11-21 07:12:11 +01:00
Kawrakow	52874c5d21	Faster MoE inference (#112 ) * multi_sdd: WIP * multi_sdd: CPU works * multi_add: CUDA * multi_add: simplify * multi_add: Metal * Metal: speed up mul_mat_id For the Granite-1B MoE model PP-512 goes from 156 t/s to 890 t/s, so nearly a 6X speedup! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-31 12:05:27 +01:00
Kawrakow	5ad6439486	Use fused mul - unary op also for MoE models (#111 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 18:23:54 +02:00
Kawrakow	2e5f6db5de	Bitnet: use the fused mul-silu in the FFN network (#110 ) I had forgotten that build_bitnet() does not use the standerd llm_build_ffn function, so the fused mul-silu didn't get used for Bitnet when I added it to llm_build_ffn. This gives us another ~1% speedup for TG-128. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 17:40:32 +02:00
Kawrakow	bd309cb782	Bitnet CUDA improvements (#109 ) * iq1_bn: improve CUDA TG On RTX-3080 TG-128(Bitnet-1.58b-3B) goes from 318 t/s to 340 t/s. I see I have on the front page 301 t/s, so pretty nice improvement since then. * iq2_bn(CUDA): quants are not 4-byte aligned --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 16:26:04 +02:00
Kawrakow	3805c84686	Improve Bitnet PP on Metal (#108 ) iq1_bn goes from 702 t/s to 716 t/s iq2_bn goes from 714 t/s to 743 t/s Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 15:13:45 +02:00
Kawrakow	f7b05a09dd	Faster IQ1_BN Metal implementation (#107 ) * iq1_bn: faster Metal dot product 82 t/s -> 87.9 t/s * iq1_bn(Metal): 87.9 -> 89.0 t/s for TG-128 * iq1_bn(Metal): 89.0 -> 94.7 t/s for TG-128 So, total improvement is ~15%. Not bad. * iq1_bn(Metal): 686 -> 702 t/s for PP-512 * iq2_bn(Metal): 710 -> 714 t/s for PP-512 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 10:59:59 +02:00
Iwan Kawrakow	19cc3329bf	Remove forgotten IQ1_TN, IQ2_TN enum values	2024-10-25 14:14:56 +03:00
Kawrakow	6b968f3894	Bitnet changes (#106 ) * Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit `f2d315b46f`. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-25 13:08:43 +02:00
Kawrakow	9114078959	Fix quantized k-cache without FA (#105 ) * Added Johannes' changes, still getting NaNs with quantized k-cache. Also getting NaN's on Johannes's mainline branch. * This fixes it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-24 12:20:30 +02:00
Kawrakow	b61cf7d0d7	Add support for Granite and GraniteMoE models (#102 ) * Add Granite and GranoteMoE models * Granite: avoid NaNs on CUDA by scaling Q before K*Q multiplication --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-22 17:28:14 +02:00
Kawrakow	462c6cd7b1	Enable q6_0 for flash attention (#101 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-22 11:34:49 +02:00
Kawrakow	dbf951df15	Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99 ) * Enable IQ4_NL for V-cache in token generation * We don't need these * Update printour of allowed quantized KV-cache combinations * Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. * Remove file added by mistake * Fix typo, which is not really a bug --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-21 12:16:54 +02:00
agray3	f2d315b46f	Avoid rebuild of GGML graph for each token (#98 ) Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.	2024-10-20 08:36:16 +02:00
Kawrakow	afbf2ef3e2	Bitnet: make the scale tensors optional (#97 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-19 18:52:58 +02:00
Nexes the Elder	a077f09bcb	Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96 ) * attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S Pattern worth to be tested on more quants and on L3 8B. PPL 512 = -0.024 for 70b ; - 0.005 for 8b Size = - 640MiB for 70b ; - 64MiB for 8b 70b Q5_K_S now beats Q5_K_M by -0.012 ppl I suspect that it goes for L3 as well, which was quite insensitive to attn_q quantization. * indent	2024-10-19 17:24:43 +02:00
Kawrakow	7b886ae3d8	Attempt to blindly fix Windows build failure (#93 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-19 11:43:04 +02:00
Nexes the Elder	03cabe1540	CLI - Specify GGML_TYPE to quantize for the main tensors. (#91 ) To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up	2024-10-18 09:48:15 +02:00
Kawrakow	76b97c8064	Adding IQ4_KSS: 4.0 bpw quants (#89 ) * iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 15:18:26 +03:00

1 2 3 4 5 ...

3512 Commits