ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-26 08:04:09 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	ff471dfd61	quantize_stats: print rmse and max error as fraction of <x> This allows for a better comparison between different models or different tensors of the same model where the magnitude of the model weights may differ.	2024-08-19 13:47:19 +03:00
Kawrakow	c7b47fc67f	iq2_k: slightly better bpw - accuracy compromise (#20 ) For LLaMA-3.1 models: * It is better to quantize all of attn_v with iq3_k instead of half of attn_v with iq4_k * Quantizing attn_output with iq3_k results in a larger PPL decrease compared to what one expects from the added bpw. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 13:36:51 +03:00
Kawrakow	6c5384f20e	Skip barriers of noops (#19 ) GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE, along with GGML_OP_NONE, are all noops. I.e., nothinh happens. But ggml still has a barrier after them, which wastes time. The waste is not too bad for large models where computations are long compared to the time taken for thread synchronization. But for small models skipping those unnecessary waits makes a significant difference. E.g., for the 99M TriLMamodel, TG-500 goes up to 1426 t/s from 1240 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-14 10:40:09 +02:00
Kawrakow	bb5ff6fade	Update README.md	2024-08-12 15:16:00 +02:00
Kawrakow	8f43e55103	Merge mainline - Aug 12 2024 (#17 ) * Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-12 15:14:32 +02:00
Iwan Kawrakow	f5d1af61d7	Fix Makefile I always use cmake, so had forgotten to pay attention to the Makefile.	2024-08-09 16:31:04 +02:00
Iwan Kawrakow	f0d7a0d53b	Fix Zen4 implementation of iq3_k, iq4_k, iq5_k See comments in `f3a823ce72`	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	c77dba5273	iq6_k: AVX2	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	a829cb7794	iq6_k: Metal About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	48c4389e3d	iq6_k: NEON Respectable performance, only slightly slower than Q6_K.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	595d2ae32d	iq6_k: slightly better Zen4 iqk_mul_mat We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	849476acc7	iq6_k: Zen4 iqk_mul_mat We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	050bdfa101	iq6_k: CUDA dot product 90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	c3f5e4d9a7	iq6_k: CUDA dequantize We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	a9b3f4a54b	iq6_k: WIP (quantize/dequantize)	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	cfb0410067	iq6_k: WIP (nothing works)	2024-08-09 16:00:31 +02:00
Kawrakow	a9f302ebe2	Adding IQ2_TN for use with ternary models (#13 ) * iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-07 07:56:09 +02:00
Iwan Kawrakow	b409c15363	q2_K: allow it to detect ternary nets and quantize accordingly	2024-08-05 11:39:10 +02:00
Kawrakow	c11c7c8cae	Update README.md There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.	2024-08-05 07:35:30 +02:00
Iwan Kawrakow	6901b3bf14	iq3_k, iq5_k: faster quantization Just use the same trick as iq4_k	2024-08-05 07:18:18 +02:00
Iwan Kawrakow	e830f4a5f7	iq4_k: speedup quantization by a factor of ~2	2024-08-03 18:38:39 +02:00
Iwan Kawrakow	3d1446b937	Add copyright notice	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	b572dd5347	iq2/3_k: tiny bit faster Metal dot products	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	394ed3913c	iq3_k: slightly faster Metal dequantize kernel PP-512 goes to 473 t/s up from 452 t/s.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	062313dab4	iq3_k: Metal dot product Quite slow: 43 t/s for a 7B model	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	57df5ccdd7	iq2_k: Metal dot product finally works It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	30d2d1b1eb	iq3_k: Metal dequantize	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	4c2c644dcc	iq3_k: NEON	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	93d09d1935	iq3_k: AVX2 iqk_mul_mat We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	9d0cf7a399	iq3_k: AVX512 iqk_mul_mat We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	fd1ae85a32	iq3_k: faster CUDA dot product 138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	0d19d19af8	iq3_k: CUDA dot product Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	4f237d44f6	iq3_k: Basics Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	36204c4ec7	iq2_k: very slightly better CUDA dot product 169.2 t/s vs 167.8 t/s before.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	e950b17125	iq2_k: better CUDA dot product Almost on par with iq2_xs (168 t/s vs 172 t/s).	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	ab4f9e1fdb	iq2_k: CUDA dot product finally works Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs 172 t/s for iq2_xs.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	69842c6ad8	iq5_k: CUDA dot product finally works	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	f6813cac0e	Factor out iqk CUDA dot products I cannot possibly wait for a 5 minutes nvcc compilation each time I touch vecdotq.cuh. Also, cmake was adding --options-file X.rsp to the nvcc compile commands, which confuses clangd, so I have turned that off.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	22d1568c1c	iq5_k: CUDA dot product still not working	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	d8d022a01b	iq5_k: Metal Performance is roughly on par with q5_0.	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	bd36ade98d	iq5_k: NEON	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	c0d0607f19	iq5_k: AVX512	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	c56ddee38c	iq5_k: AVX2	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	5d341757bc	iq5_k: Basics Quantize/dequantize, CUDA dequantize	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	06e255ac9d	iq2_k: Metal. Dot product is wrong	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	f476ea3b50	iq2_k: NEON	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	c0fe03b5c8	iq2_k: slightly faster AVX512	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	7d08719975	iq2_k: simplify AVX512	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	13091d39e8	iq2_k: AVX2	2024-08-01 09:38:06 +02:00
Iwan Kawrakow	c85e139c68	iq2_k: Basics Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.	2024-08-01 09:38:06 +02:00

1 2 3 4 5 ...

3399 Commits