mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-25 17:09:22 +00:00
1.3 KiB
1.3 KiB
🔀 #55 - Improve Q5_0 performance on AVX2
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-09-14 |
| Updated | 2024-09-14 |
Description
The main purpose of the previous PR was to try to improve K*Q matrix multiplications for flash attention with Q8_0 quantized k-cache. Sadly, the performance improvement that we got for Q8_0 did not translate into better FA performance. It is a rainy Saturday, so need something to brighten my day. The last PR is very easily applied to Q5_0, so here we are.
The table shows performance comparison to mainline llama.cpp for LLaMA-3.1-8B ona Ryzen-7950X
| model | backend | threads | test | t/s (llama.cpp) | t/s (PR) | Speedup |
|---|---|---|---|---|---|---|
| llama 8B Q5_0 | CPU | 16 | pp512 | 55.72 ± 0.25 | 152.10 ± 0.74 | 2.793 |
| llama 8B Q5_0 | CPU | 2 | tg128 | 5.22 ± 0.01 | 8.88 ± 0.01 | 1.701 |
| llama 8B Q5_0 | CPU | 4 | tg128 | 9.24 ± 0.01 | 11.57 ± 0.00 | 1.252 |