ik_llama.cpp/github-data/pull_requests/173 - More Flash Attention improvements.md at 6f1a69352f3079bcd1e7d13d6970fa2c2aeaf601 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.8 KiB

Raw Blame History

🔀 #173 - More Flash Attention improvements

Author	`ikawrakow`
State	❌ Closed
Created	2025-01-19
Updated	2025-01-20

Description

This PR further improves the Flash Attention implementation as follows:

Slightly faster V * softmax(K * Q) implementation. This benefits all V-cache types
Faster implementation when the K-cache is quantized with Q8_0 via run-time-repacking to Q8_0_R4.

The following graph shows prompt processing speed as a function of prompt length for LLaMA-3.1-8B quantized with IQ4_XS on a Ryzem-7950X CPU. The PR results are shown with black (BF16 KV-cache) and red (Q8_0 KV-cache) triangles, circles are used for the main branch. I have reused the graph from the last post in #25 by just adding the results for this PR, so mainline llama.cpp performance is shown as well. I'm particularly pleased with the fact that Q8_0 KV-cache is now on per or even slightly better than the natively supported 16-bit float type as Q8_0 quantized KV-cache is basically lossless while reducing required memory by 2X.

For reference, with a Q8_K_R8-quantized model we achieve 380 t/s for 512 tokens, and 150 t/s for 32k tokens.

💬 Conversation

👤 ikawrakow commented the 2025-01-20 at 06:57:27:

Here is the performance relative to a GPU (RTX-4080) for the above graph

. We observe the ratio now decreasing with increasing prompt length \Rightarrow the utilization of available FLOPs in the FA implementation is now better on the CPU compared to the GPU.

1.8 KiB Raw Blame History

🔀 #173 - More Flash Attention improvements

Description

💬 Conversation

1.8 KiB

Raw Blame History