ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-11 22:40:01 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	85d1011f52	Another iq3k improvement	2024-11-25 10:11:02 +02:00
Iwan Kawrakow	55db84400a	Small iq3k improvement	2024-11-25 09:01:25 +02:00
Iwan Kawrakow	74e3b1fad7	Minor	2024-11-24 17:11:11 +02:00
Iwan Kawrakow	65ebc6f986	iq4_ks: minor PPL improvement	2024-11-24 12:01:18 +02:00
Iwan Kawrakow	70815ec5b2	iq2k: quantization improvement I was not using the ciorrect scale sign to compute mse when checking the solution with the sign flipped. iq4_kss is now almost on par with the 4-bit Trellis.	2024-11-24 11:29:37 +02:00
Iwan Kawrakow	7447c55a8a	iq2k: small PPL improvement PPL(LLaMA-3.1-8B, 8192) is now 8.29 from previously 8.38. LLaMA-v2-7B is about the same as before.	2024-11-23 19:18:45 +02:00
Iwan Kawrakow	3cac58e182	iq2ks: small PPL improvement PPL(LLaMA-3.1-8B, 8192) is now 9.95 from previously 10.18. LLaMA-v2-7B is about the same as before.	2024-11-23 12:27:14 +02:00
Iwan Kawrakow	3a9926b932	Checkpoint Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude plus 1 bpw for the sign. It goves a visible improvement in the PPL vs bpw plot, but that comes at the expense of much longer quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX). I also notices that the 3INST generator is not actually generating a Gaussian distribution. But going to a better generator means readjusting all the hyper-parameters, so leaving it for later.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	2be4cffe66	Minor tweaks	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4cf82e7e2f	iq4_kt: failed attemt to adjust CUDA dot product It was working for 4.125 bpw. But after changing to 4.0 bpw there is something wrong and I don't see the bug.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	ab1cef30e7	iq4_kt: very slightly better at the expense of much longer quantization time.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	1be0a9e0d7	iq4_kt: go to 4.0 bpw 15 bits per group of 4, plus 8 bit scales ifor blocks of 32. This gives a slightly better PPL than iq4_kss.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	21903f19b4	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	c20b22b9a0	iq3_kt: small progress	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4213ab1cb3	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627 PPL(LLaMA-2-7B, 4096) = 6.3825 Quantization is faster too: ~200 seconds for LLaMA-3.1-8B on Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	215bea5c6a	iq3_kt: small improvements and faster quantization	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dbe085474a	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297 PPL(LLaMA-2-7B, 4096) = 6.3913 Ah, quantization is faster too. About 20% faster.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	200a19f18f	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	21ee589996	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	1d6ca83203	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	00b4bff286	Adding iq4_kt - not competitive at this point	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	47b28c1e92	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642 PPL(LLaMA-2-7B, 4096) = 6.3920	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4608f0cc6d	iq2_kt: SOTA We arrive at PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406 PPL(LLaMA-2-7B, 4096) = 6.4179	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	e9e5879b94	iq3_kt speed up quantization Same trick as last commit applied to iq2_kt. Here we get an even larger speedup: quantization time on the Ryzen-5975WX for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	c59830dafb	iq3_kt WIP: speed up quantization Nearly 60% improvement of quantization speed by having the points nelonging to a cluster copied to contiguous memory during initialization, and then accessed sequantially while searching for the closest point. LLaMA-3.1-8B now gets quantized in ~150 seconds on the Ryzen-5975WX.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	8f0d075f5e	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking by 0.015 bpw by using iq4_k instead of q5_k for attn_v.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	dfcc8a9cf3	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	386d139e13	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	f1fb59b44b	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	435eb9bdd3	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	08503cec7d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	977f94b3e0	Forgotten change	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4774788136	Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b3dfe9984b	iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time).	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	36e9c922b8	iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	a4f1ac8da4	iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B).	2024-11-21 08:16:40 +02:00
Kawrakow	6b968f3894	Bitnet changes (#106 ) * Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit `f2d315b46f`. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-25 13:08:43 +02:00
Kawrakow	76b97c8064	Adding IQ4_KSS: 4.0 bpw quants (#89 ) * iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 15:18:26 +03:00
Iwan Kawrakow	ff23008ed4	Minor iq3_k tweak	2024-10-14 18:13:11 +03:00
Kawrakow	910a134094	IQ2_KS: 2.1875 bpw non-linear quantization (#85 ) * Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 13:34:30 +03:00
Kawrakow	b30c9e10d8	New SOTA quantization: 4.25 bpw IQ4_KS (#83 ) * iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-09 12:54:40 +03:00
Kawrakow	fe36930c8b	Move scale fudge factors to quantization (#81 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 16:16:01 +03:00
Kawrakow	6dec4af4b6	Adding ability to have meta data per tensor row (#61 ) * POC: per row scale This is a POC how to work around opinionated ggml to have scales per row rather than per block. Only implemened for Zen4 and only for iq2_tn. * POC per row scale: iq2_tn on NEON * POC per row scale: iq2_tn on Metal * Per row scale Metal templates * iq1_tn: shrink to 1.625 bpw (NEON and Metal) * POC per row scale: CUDA * POC per row scale: add CUDA TODOs There are two places in ggml-cuda.cu left where it is assumed that type_size * n_per_row / block_size is the way to compute and handle row sizes. This does not affect simple usage, but will lead to issues when tensors are split between GPUs. * Per row scales - CUDA The only place left where there are unnecessary assumptions being made is in the Flash Attention code. As we are not using any quants that use per row scales for quantized KV cache, it should be OK for now. * Update IQ1_TN and IQ2_TN bpw shown to user --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-27 08:16:06 +03:00
Kawrakow	12bbdb8ce7	Fix compiler warnings (#58 ) * Fix C++ compilation warnings caused by ggml-common.h * Disable c99-extensions warning I get tons of those on macOS due to the arm_neon.h header. * Disable c99-extensions warning only for APPLE * Fix warnings in iqk_quantize.cpp Also add GGML_ABORT when implementation is missing. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-17 14:31:29 +03:00
Kawrakow	8c86231f93	Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44 ) * Adding iq1_tn - 1.6875 bpw for TriLM ternary models * iq1_tn: NEON * iq1_tn: faster NEON * iq2_bn: improve performance on NEON We now get TG-128 = 100 t/s for Bitnet-3B-1.58b! * iq1_tn: improve AVX2 PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq1_tn: improve Zen4 PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq2_bn: improve on Zen4 We now get PP-512 = 614 t/s up from 542 t/s * iq2_bn: improve AVX2 implementation We now get PP-512 = 753 t/s up from 680 t/s. * Remove unnecessary barrier in ggml_compute_forward_mul_mat --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-09 14:56:34 +03:00
Kawrakow	dbb1db9899	Fix build when iqk_mul_mat is disabled (#31 ) Ref #29 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-31 09:11:42 +03:00
Kawrakow	a73702d93b	AVX2 quantization for Q8_K (#22 ) It has been there for a while, but forgot to add here. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 15:33:27 +03:00
Iwan Kawrakow	f5d1af61d7	Fix Makefile I always use cmake, so had forgotten to pay attention to the Makefile.	2024-08-09 16:31:04 +02:00
Iwan Kawrakow	f0d7a0d53b	Fix Zen4 implementation of iq3_k, iq4_k, iq5_k See comments in `f3a823ce72`	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	c3f5e4d9a7	iq6_k: CUDA dequantize We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).	2024-08-09 16:00:31 +02:00

1 2

60 Commits