ik_llama.cpp/github-data/pull_requests/552 - Much faster prompt processing for k-quants _ARM_NEON_.md at ec2ba592b5bafefbce39f7821a68a45f84d7db21 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.3 KiB

Raw Blame History

🔀 #552 - Much faster prompt processing for k-quants (ARM_NEON)

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-24
Updated	2025-06-24

Description

It is time to give some attention to the ARM_NEON back-end, which has fallen behind quite a bit.

This PR corresponds to PRs #531, #533, #534, #546, #549, #550, and applies the on-the-fly repacking technique to k-quants (Q2_K, Q3_K, Q4_K, Q5_K, Q6_K) and to IQ4_XS for the ARM_NEON implementation.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type	t/s (main)	t/s (PR)	Speedup
Q2_K	85.74	168.07	1.960
Q3_K	45.68	170.83	3.740
Q4_K	58.24	114.78	1.971
Q5_K	54.88	114.92	2.094
Q6_K	47.67	123.98	2.601
IQ4_XS	71.19	167.84	2.358

Q2_K, Q3_K and IQ4_XS join the top-tier group in terms of prompt processing speed.

Q4_K and Q5_K get repacked to Q8_1, and this ends up being slower than Q4_K_R4/Q5_K_R4, so it may have been better to simply repack to the corresponding row-interleaved variant. This is left for a future PR.

1.3 KiB Raw Blame History

🔀 #552 - Much faster prompt processing for k-quants (ARM_NEON)

Description

1.3 KiB

Raw Blame History