ik_llama.cpp/118 - IQ4_NL_X4.md at 30381fc1fc6a302f9de0487b1e719f4efcc06a00 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.4 KiB

Raw Blame History

🔀 #118 - IQ4_NL_X4

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-02
Updated	2024-12-02

Description

In mainline llama.cpp they have added various types where Q4_0 or IQ4_NL are repacked by interleaving quants from 4 or 8 consecutive rows. They get significant improvement in prompt processing speed on ARM, so I decided to see if interleaved rows can further improve the iqk_mul_mat matrix-matrix multiplication speed.

This PR adds IQ4_NL_X4, a repacked variant of IQ4_NL. The table below shows PP-512comparison between IQ4_NL and IQ4_NL_X4 for LLaMA-3.1-8B-Instruct on ARM (M2-Max), Zen4 (Ryzen-7950X) and AVX2 (Ryzen-5975WX). Somewhat surprisingly the speedup on Zen4 is larger than the speedup on M2-Max. On Zen4 IQ4_NL_X4 is now the fastest quantization type for prompt processing, beating even bf16 (237 t/s on the Ryzen-7950X CPU, which has native bf16 support).

Platform	Threads	IQ4_NL	IQ4_NL_X4	Speedup
ARM_NEON	8	85.11 ± 0.47	110.32 ± 0.53	1.296
Zen4	16	168.21 ± 0.60	262.69 ± 0.65	1.562
AVX2.	32	186.81 ± 0.17	231.45 ± 0.61	1.240

For reference: On my M2-Max mainline llama.cpp (build: 3420909d) achieves 92.3 t/s for IQ4_NL_4_4.

1.4 KiB Raw Blame History

🔀 #118 - IQ4_NL_X4

Description

1.4 KiB

Raw Blame History