ik_llama.cpp/553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

852 B

Raw Permalink Blame History

🔀 #553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-24
Updated	2025-06-24

Description

This PR corresponds to PRs #531, #533, #534, #546, #549, #550, #552, and applies the on-the-fly repacking technique to the 1-bit quants IQ1_S and IQ1_M on ARM_NEON.

Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max

type	t/s (main)	t/s (PR)	Speedup
IQ1_S	66.3	168.8	2.546
IQ1_M	19.0	163.9	8.626

IQ1_M did not have a faster IQK implementation, so the 19 t/s is what one has within the standard ggml GEMM framework.

852 B Raw Permalink Blame History

🔀 #553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON

Description

852 B

Raw Permalink Blame History