Files
ik_llama.cpp/github-data/pull_requests/118 - IQ4_NL_X4.md
2025-07-23 13:31:53 +02:00

1.4 KiB

🔀 #118 - IQ4_NL_X4

Author ikawrakow
State Closed
Created 2024-12-02
Updated 2024-12-02

Description

In mainline llama.cpp they have added various types where Q4_0 or IQ4_NL are repacked by interleaving quants from 4 or 8 consecutive rows. They get significant improvement in prompt processing speed on ARM, so I decided to see if interleaved rows can further improve the iqk_mul_mat matrix-matrix multiplication speed.

This PR adds IQ4_NL_X4, a repacked variant of IQ4_NL. The table below shows PP-512comparison between IQ4_NL and IQ4_NL_X4 for LLaMA-3.1-8B-Instruct on ARM (M2-Max), Zen4 (Ryzen-7950X) and AVX2 (Ryzen-5975WX). Somewhat surprisingly the speedup on Zen4 is larger than the speedup on M2-Max. On Zen4 IQ4_NL_X4 is now the fastest quantization type for prompt processing, beating even bf16 (237 t/s on the Ryzen-7950X CPU, which has native bf16 support).

Platform Threads IQ4_NL IQ4_NL_X4 Speedup
ARM_NEON 8 85.11 ± 0.47 110.32 ± 0.53 1.296
Zen4 16 168.21 ± 0.60 262.69 ± 0.65 1.562
AVX2. 32 186.81 ± 0.17 231.45 ± 0.61 1.240

For reference: On my M2-Max mainline llama.cpp (build: 3420909d) achieves 92.3 t/s for IQ4_NL_4_4.