ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-02 18:10:02 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	d2b51308d3	iq5_k: Metal Performance is roughly on par with q5_0.	2024-07-29 16:47:06 +02:00
Iwan Kawrakow	cc0493f1ff	iq5_k: NEON	2024-07-29 13:57:14 +02:00
Iwan Kawrakow	eb5bd49f10	iq5_k: AVX512	2024-07-29 14:26:14 +03:00
Iwan Kawrakow	1704a34e6c	iq5_k: AVX2	2024-07-29 13:39:50 +03:00
Iwan Kawrakow	f0836bdbbe	iq5_k: Basics Quantize/dequantize, CUDA dequantize	2024-07-29 12:38:46 +03:00
Iwan Kawrakow	1c2d026da6	iq2_k: Metal. Dot product is wrong	2024-07-29 09:49:32 +02:00
Iwan Kawrakow	89b410dfb7	iq2_k: NEON	2024-07-29 07:26:36 +02:00
Iwan Kawrakow	972f134e88	iq2_k: slightly faster AVX512	2024-07-29 06:51:44 +03:00
Iwan Kawrakow	d07c58b4b7	iq2_k: simplify AVX512	2024-07-28 21:05:56 +03:00
Iwan Kawrakow	3555a3d8ba	iq2_k: AVX2	2024-07-28 20:50:21 +03:00
Iwan Kawrakow	76449533f2	iq2_k: Basics Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.	2024-07-28 19:43:18 +03:00
Kawrakow	291066e6df	IQ4_K: SOTA 4-bit quantization (#6 ) * iq4_k: basics * quantize/dequantize works * CUDA dequantize works and one can run PPL calcs. I get PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16. In comparison, q4_K_S (same size) is 2.88% above fp16. * TG on CUDA does not work. Johannes has changed the way i-quant dot products are done, so need to sort out what he had in mind * iqk_mul_mat is not implemented. * iq4_k: TG now works on CUDA * iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S. * iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X. * iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower. * iq4_k: Metal implementation For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s on a 30-core M2-Max GPU. This is to be compared with (currently) PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S. * iq4_k: scalar dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-28 12:11:59 +02:00
Kawrakow	f62615b44f	Simdify and multi-thread tanh (#4 ) It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that tanh is used in each layer, and then at the end for softcaping the final output. ggml had tanh set to be computed with a single thread. Combined with tanh(x) being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the tanh operation. After multi-threading ggml_vec_soft_max_f32 and simd-ifying the tanh computation, I observe a 33% gain in prompt processing speed (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X). Simd-ifying: We have tanh(x) = (exp(2x) - 1)/(exp(2x) + 1) so we can just use Justine Tunney's SIMD exp implementation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 08:44:18 +02:00
Kawrakow	154e0d75fc	Merge mainline llama.cpp (#3 ) * Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 07:55:01 +02:00

14 Commits