ik_llama.cpp/544 - New integer trellis on ARM_NEON.md at ik/refactor_llama.cpp - ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-25 07:34:10 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

This PR adapts the ARM_NEON trellis implementation to the new integer trellis.

Test done on an M2-Max CPU using LlaMA-3.1-8B-Instruct.

Very respectable PP performance:

model	size	test	t/s
llama 8B IQ2_KT	2.77 GiB	pp512	129.19 ± 0.22
llama 8B IQ3_KT	3.58 GiB	pp512	127.66 ± 0.38
llama 8B IQ4_KT	4.30 GiB	pp512	125.23 ± 0.44

Still very low TG performance:

model	size	test	t/s
llama 8B IQ2_KT	2.77 GiB	tg128	12.59 ± 0.15
llama 8B IQ3_KT	3.58 GiB	tg128	9.92 ± 0.02
llama 8B IQ4_KT	4.30 GiB	tg128	9.73 ± 0.05

Don't ask Apple Silicon to do too much work with a piece of data fetched from memory.

Nevertheless, compared to PR #471 we observe ~13% speedup for IQ2_KT, ~30% speedup for IQ3_KT, and nearly 70% speedup for Q4_KT.