ik_llama.cpp/123 - IQ4_XS_R4.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.2 KiB

Raw Permalink Blame History

🔀 #123 - IQ4_XS_R4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-04
Updated	2024-12-04

Description

Follow up of #118, #119, #120, #121, #122 for IQ4_XS.

I was curious to see if one can make the interleaved rows strategy work for i- and k-quants with their super-blocks & blocks and two levels of scales. IQ4_XS seemed easiest, so I tackled that one first. We get a massive speedup on ARM_NEON and a more modest (but still significant) gain on AVX2/Zen4. I'm not 100% happy with the Zen4 implementation, but shuffling scale bits for 4 rows at once is tricky, so for now I have settled on a sub-optimal solution.

Anyway, here is PP-512 for LLaMA-3.1-8B on Zen4 (Ryzen-7950X), ARM_NEON (M2-Max) and AVX2 (Ryzen-5975WX)

Platform	Threads	IQ4_XS	IQ4_XS_R4	Speedup
ARM_NEON	8	68.23 ± 1.06	115.43 ± 0.57	1.692
Zen4	16	183.43 ± 0.60	223.98 ± 0.12	1.221
AVX2	32	195.20 ± 0.40	248.25 ± 0.43	1.272

1.2 KiB Raw Permalink Blame History

🔀 #123 - IQ4_XS_R4

Description

1.2 KiB

Raw Permalink Blame History