ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-27 01:49:28 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	4213ab1cb3	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	215bea5c6a	iq3_kt: small improvements and faster quantization	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dbe085474a	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	200a19f18f	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	de7fe92833	iq4_kt: minor tweaks	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9ced1bbe6	iq4_kt: CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	21ee589996	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	1d6ca83203	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	00b4bff286	Adding iq4_kt - not competitive at this point	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	47b28c1e92	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4608f0cc6d	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	0ffc9b435c	iq3_kt: CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9e5879b94	iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	c59830dafb	iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	8f0d075f5e	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dfcc8a9cf3	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	386d139e13	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	f1fb59b44b	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	435eb9bdd3	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	08503cec7d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	977f94b3e0	Forgotten change	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4774788136	Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	590f47278b	Minor	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7bf6e158a9	iq2_kt: faster f16 CUDA dot product We arrive at 146 t/s (no FA), and 158 t/s (FA). This is measured for LLaMA-3.1-8B with output.weight left as f16.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	7cafafc69e	iq2_kt: faster f16 CUDA dot product We arrive at 139 t/s (no FA), and 149 t/s (FA). My RTX-4080 is ~20% slower than the RTX-6000 quoted in the QTIP repository, so with FA (which I'm sure they also used) we are at around ~180 t/s on their GPU, so almost matching their performance.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b354392c77	iq2_kt: f16 CUDA dot product We arrive at 112 t/s.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	aed3910dfa	iq2_kt: very slightly faster CUDA dot product	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	d2331b9287	iq2_kt: CUDA dot product Implemented as DMMV. Very slow - just 81 t/s for LLaMA-3.1-8B. Then again, Q2_K_S with forced to use DMMV only gets 112 t/s vs 145 t/s via MMVQ. My memory is that when the DMMV kernels were properly maintained/used, DMMV was about on par with MMVQ for k-quants on my GPU.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b3dfe9984b	iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time).	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	36e9c922b8	iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	426a6e685f	iq2_kt: CUDA dequantize so we can run perplexity calcs. As already indicated by rmse, the 2-bit trellis approach is quite a bit worse than iq2_xxs.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	a4f1ac8da4	iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B).	2024-11-21 08:16:40 +02:00
Kawrakow	4d2fbde0cb	MMQ for Q6_0 (#115 ) * MMQ for Q6_0 * Add Q6_0 MMQ to template generator --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-11-21 07:12:11 +01:00
Kawrakow	52874c5d21	Faster MoE inference (#112 ) * multi_sdd: WIP * multi_sdd: CPU works * multi_add: CUDA * multi_add: simplify * multi_add: Metal * Metal: speed up mul_mat_id For the Granite-1B MoE model PP-512 goes from 156 t/s to 890 t/s, so nearly a 6X speedup! --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-31 12:05:27 +01:00
Kawrakow	bd309cb782	Bitnet CUDA improvements (#109 ) * iq1_bn: improve CUDA TG On RTX-3080 TG-128(Bitnet-1.58b-3B) goes from 318 t/s to 340 t/s. I see I have on the front page 301 t/s, so pretty nice improvement since then. * iq2_bn(CUDA): quants are not 4-byte aligned --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 16:26:04 +02:00
Kawrakow	3805c84686	Improve Bitnet PP on Metal (#108 ) iq1_bn goes from 702 t/s to 716 t/s iq2_bn goes from 714 t/s to 743 t/s Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 15:13:45 +02:00
Kawrakow	f7b05a09dd	Faster IQ1_BN Metal implementation (#107 ) * iq1_bn: faster Metal dot product 82 t/s -> 87.9 t/s * iq1_bn(Metal): 87.9 -> 89.0 t/s for TG-128 * iq1_bn(Metal): 89.0 -> 94.7 t/s for TG-128 So, total improvement is ~15%. Not bad. * iq1_bn(Metal): 686 -> 702 t/s for PP-512 * iq2_bn(Metal): 710 -> 714 t/s for PP-512 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-26 10:59:59 +02:00
Iwan Kawrakow	19cc3329bf	Remove forgotten IQ1_TN, IQ2_TN enum values	2024-10-25 14:14:56 +03:00
Kawrakow	6b968f3894	Bitnet changes (#106 ) * Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit `f2d315b46f`. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-25 13:08:43 +02:00
Kawrakow	9114078959	Fix quantized k-cache without FA (#105 ) * Added Johannes' changes, still getting NaNs with quantized k-cache. Also getting NaN's on Johannes's mainline branch. * This fixes it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-24 12:20:30 +02:00
Kawrakow	462c6cd7b1	Enable q6_0 for flash attention (#101 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-22 11:34:49 +02:00
Kawrakow	dbf951df15	Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99 ) * Enable IQ4_NL for V-cache in token generation * We don't need these * Update printour of allowed quantized KV-cache combinations * Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. * Remove file added by mistake * Fix typo, which is not really a bug --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-21 12:16:54 +02:00
agray3	f2d315b46f	Avoid rebuild of GGML graph for each token (#98 ) Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.	2024-10-20 08:36:16 +02:00
Kawrakow	7b886ae3d8	Attempt to blindly fix Windows build failure (#93 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-19 11:43:04 +02:00
Kawrakow	76b97c8064	Adding IQ4_KSS: 4.0 bpw quants (#89 ) * iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 15:18:26 +03:00
Kawrakow	993ca95e9e	iq4_ks: faster dot product on Metal (#90 ) TG-128(LLaMA-3.1-8B) goes to 52.5 t/s up from 48.4 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 14:13:03 +03:00
Iwan Kawrakow	ff23008ed4	Minor iq3_k tweak	2024-10-14 18:13:11 +03:00
Kawrakow	302a6225a1	iq3_k: fix and optimize Metal dot product (#87 ) * iq3_k: fix Metal dot product I was accessing the scales as 4-byte aligned, but iq3_k is not 4-byte aligned. Instead of throwing an error (as it happens on CUDA when one makes this mistake), Metal silently accepts and we get garbage. * iq3_k: slightly faster Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-14 10:46:41 +03:00
Kawrakow	baab1d9a1e	Fix and optimize iq2k Metal implementation (#86 ) * I somehow broke iq2_k on Metal? - fix dequantize * I somehow broke iq2_k on Metal? - fix dot product * iq2_k: optimize Metal dot product 42.6 t/s -> 46.2 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 14:30:30 +03:00
Kawrakow	910a134094	IQ2_KS: 2.1875 bpw non-linear quantization (#85 ) * Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 13:34:30 +03:00

1 2 3

149 Commits