ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-20 14:39:45 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	f1fb59b44b	iq3_kt WIP: slowly improving PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is starting to be competitive/slightly better than other quants.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	435eb9bdd3	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	08503cec7d	WIP	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	977f94b3e0	Forgotten change	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	4774788136	Adding iq3_kt 3.125 bpw. So far does not look good on the PPL vs bpw plot.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	b3dfe9984b	iq2_kt - even better Re-quantize after determining block scales (at the epxense of much longer quantization time).	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	36e9c922b8	iq2_kt - this is better Using blocks of 32 and 16 bits per group of 8 weights it beats iq2_xxs in terms of PPL by a significant margin. It is 0.0625 bpw larger, but even if we go to 15 bits per group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still lower.	2024-11-21 08:16:41 +02:00
Iwan Kawrakow	a4f1ac8da4	iq2_kt: quantize / dequantize I now see that I was comparing apples to oranges: iq2_xxs was using a weight of sigma^2/4 + x^2, while the Trellis approach wasn't (weight = 1). Once I use the same weight, iq2_kt is actually slightly worse than iq2_xxs in terms of rmse, so does not look promising at this point. Also, once each group of 8 Trellis values no longer has a constant sum(q^2) that we can precompute, quantization becomes significantly slower (476 seconds for LLaMA-3.1-8B).	2024-11-21 08:16:40 +02:00
Kawrakow	6b968f3894	Bitnet changes (#106 ) * Adapting iq2_bn to work without separate scale tensors Why? It is becoming burdensome to maintain the special Bitnet conversion in convert_hf_to_gguf.py, so I thnk it is better to make iq1_bn and iq2_bn just work with the mainline conversion script (which does not generate scales). * Adapting iq1_bn to work without separate scale tensors * Adapting iq2_bn: CUDA dequantize * Adapting iq2_bn: CUDA works * Adapting iq1_bn: CUDA works * Adapting iq1_bn, iq2_bn: NEON * Adapting iq1_bn, iq2_bn: Metal Dequantize works, but there is still something wrong with the dot products. * WIP Absoolutely don't see what is wrong with the iq1_bn and iq2_bn vector dot product kernels. * Remove iq1_tn and iq2_tn - Part 1 Now that iq1_bn and iq2_bn have per row scales, there is no reason to also have iq1_tn and iq2_tn. * Remove iq1_tn and iq2_tn - Part 2 * Bitnet: use the standard llm_build_kv to build self attention My main motivation was to enable FA. But FA does not work anyway because head size is 100 for the Botnet ternary models (and I had forgotten this little detail). * Revert "Avoid rebuild of GGML graph for each token (#98)" This reverts commit `f2d315b46f`. As far as I can tell, the commit breaks Metal TG. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-25 13:08:43 +02:00
Kawrakow	76b97c8064	Adding IQ4_KSS: 4.0 bpw quants (#89 ) * iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 15:18:26 +03:00
Iwan Kawrakow	ff23008ed4	Minor iq3_k tweak	2024-10-14 18:13:11 +03:00
Kawrakow	910a134094	IQ2_KS: 2.1875 bpw non-linear quantization (#85 ) * Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 13:34:30 +03:00
Kawrakow	b30c9e10d8	New SOTA quantization: 4.25 bpw IQ4_KS (#83 ) * iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-09 12:54:40 +03:00
Kawrakow	fe36930c8b	Move scale fudge factors to quantization (#81 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 16:16:01 +03:00
Kawrakow	6dec4af4b6	Adding ability to have meta data per tensor row (#61 ) * POC: per row scale This is a POC how to work around opinionated ggml to have scales per row rather than per block. Only implemened for Zen4 and only for iq2_tn. * POC per row scale: iq2_tn on NEON * POC per row scale: iq2_tn on Metal * Per row scale Metal templates * iq1_tn: shrink to 1.625 bpw (NEON and Metal) * POC per row scale: CUDA * POC per row scale: add CUDA TODOs There are two places in ggml-cuda.cu left where it is assumed that type_size * n_per_row / block_size is the way to compute and handle row sizes. This does not affect simple usage, but will lead to issues when tensors are split between GPUs. * Per row scales - CUDA The only place left where there are unnecessary assumptions being made is in the Flash Attention code. As we are not using any quants that use per row scales for quantized KV cache, it should be OK for now. * Update IQ1_TN and IQ2_TN bpw shown to user --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-27 08:16:06 +03:00
Kawrakow	12bbdb8ce7	Fix compiler warnings (#58 ) * Fix C++ compilation warnings caused by ggml-common.h * Disable c99-extensions warning I get tons of those on macOS due to the arm_neon.h header. * Disable c99-extensions warning only for APPLE * Fix warnings in iqk_quantize.cpp Also add GGML_ABORT when implementation is missing. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-17 14:31:29 +03:00
Kawrakow	8c86231f93	Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44 ) * Adding iq1_tn - 1.6875 bpw for TriLM ternary models * iq1_tn: NEON * iq1_tn: faster NEON * iq2_bn: improve performance on NEON We now get TG-128 = 100 t/s for Bitnet-3B-1.58b! * iq1_tn: improve AVX2 PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq1_tn: improve Zen4 PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq2_bn: improve on Zen4 We now get PP-512 = 614 t/s up from 542 t/s * iq2_bn: improve AVX2 implementation We now get PP-512 = 753 t/s up from 680 t/s. * Remove unnecessary barrier in ggml_compute_forward_mul_mat --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-09 14:56:34 +03:00
Kawrakow	dbb1db9899	Fix build when iqk_mul_mat is disabled (#31 ) Ref #29 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-31 09:11:42 +03:00
Kawrakow	a73702d93b	AVX2 quantization for Q8_K (#22 ) It has been there for a while, but forgot to add here. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 15:33:27 +03:00
Iwan Kawrakow	f5d1af61d7	Fix Makefile I always use cmake, so had forgotten to pay attention to the Makefile.	2024-08-09 16:31:04 +02:00
Iwan Kawrakow	f0d7a0d53b	Fix Zen4 implementation of iq3_k, iq4_k, iq5_k See comments in `f3a823ce72`	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	c3f5e4d9a7	iq6_k: CUDA dequantize We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	a9b3f4a54b	iq6_k: WIP (quantize/dequantize)	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	cfb0410067	iq6_k: WIP (nothing works)	2024-08-09 16:00:31 +02:00
Kawrakow	a9f302ebe2	Adding IQ2_TN for use with ternary models (#13 ) * iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-07 07:56:09 +02:00
Iwan Kawrakow	6901b3bf14	iq3_k, iq5_k: faster quantization Just use the same trick as iq4_k	2024-08-05 07:18:18 +02:00
Iwan Kawrakow	e830f4a5f7	iq4_k: speedup quantization by a factor of ~2	2024-08-03 18:38:39 +02:00
Iwan Kawrakow	4f237d44f6	iq3_k: Basics Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	5d341757bc	iq5_k: Basics Quantize/dequantize, CUDA dequantize	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	c85e139c68	iq2_k: Basics Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.	2024-08-01 09:38:06 +02:00
Kawrakow	291066e6df	IQ4_K: SOTA 4-bit quantization (#6 ) * iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-28 12:11:59 +02:00
Kawrakow	154e0d75fc	Merge mainline llama.cpp (#3 ) * Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 07:55:01 +02:00

32 Commits