ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-03 18:40:14 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	d89c88e8df	iq4_k: NEON implementation For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.	2024-07-28 08:36:20 +02:00
Iwan Kawrakow	db87f766e8	iq4_k: AVX2 implementation For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s on the Ryzen-5975X.	2024-07-27 21:10:22 +03:00
Iwan Kawrakow	be34f768db	iq4_k: AVX512 implementation For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s, so almost the same as q4_K_S.	2024-07-27 20:13:30 +03:00
Kawrakow	154e0d75fc	Merge mainline llama.cpp (#3 ) * Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 07:55:01 +02:00