Files
ik_llama.cpp/github-data/pull_requests/553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON.md
2025-07-23 13:31:53 +02:00

852 B

🔀 #553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON

Author ikawrakow
State Closed
Created 2025-06-24
Updated 2025-06-24

Description

This PR corresponds to PRs #531, #533, #534, #546, #549, #550, #552, and applies the on-the-fly repacking technique to the 1-bit quants IQ1_S and IQ1_M on ARM_NEON.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type t/s (main) t/s (PR) Speedup
IQ1_S 66.3 168.8 2.546
IQ1_M 19.0 163.9 8.626

IQ1_M did not have a faster IQK implementation, so the 19 t/s is what one has within the standard ggml GEMM framework.