ik_llama.cpp/github-data/pull_requests/55 - Improve Q5_0 performance on AVX2.md at dc23be32a2aa63e24061663d94daca9b667ed920 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-25 17:09:22 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.3 KiB

Raw Blame History

🔀 #55 - Improve Q5_0 performance on AVX2

Author	`ikawrakow`
State	❌ Closed
Created	2024-09-14
Updated	2024-09-14

Description

The main purpose of the previous PR was to try to improve K*Q matrix multiplications for flash attention with Q8_0 quantized k-cache. Sadly, the performance improvement that we got for Q8_0 did not translate into better FA performance. It is a rainy Saturday, so need something to brighten my day. The last PR is very easily applied to Q5_0, so here we are.

The table shows performance comparison to mainline llama.cpp for LLaMA-3.1-8B ona Ryzen-7950X

model	backend	threads	test	t/s (llama.cpp)	t/s (PR)	Speedup
llama 8B Q5_0	CPU	16	pp512	55.72 ± 0.25	152.10 ± 0.74	2.793
llama 8B Q5_0	CPU	2	tg128	5.22 ± 0.01	8.88 ± 0.01	1.701
llama 8B Q5_0	CPU	4	tg128	9.24 ± 0.01	11.57 ± 0.00	1.252

1.3 KiB Raw Blame History

🔀 #55 - Improve Q5_0 performance on AVX2

Description

1.3 KiB

Raw Blame History