ik_llama.cpp/38 - Zen4 Flash Attention - bf16 support.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

972 B

Raw Permalink Blame History

🔀 #38 - Zen4 Flash Attention - bf16 support

Author	`ikawrakow`
State	❌ Closed
Created	2024-09-04
Updated	2024-09-05

Description

This PR adds support for using bf16 for the kv-cache.

As Zen4 has native support for bf16 fused-multiply-add, I was hoping that this might give better performance than fp16. But with this implementation it is basically the same as fp16. We get a tiny improvement for Gemma2-2b at 4k and 8k tokens as shown in this graph (there is no bf16 support for kv-cache in llama.cpp, so no comparison in the graph).

Given this outcome, I have only enabled support for K- and V-cache both as bf16 (i.e., one cannot mix bf16 with other types as it is possible with fp16, Q4_0, Q4_1 and Q8_0.

972 B Raw Permalink Blame History

🔀 #38 - Zen4 Flash Attention - bf16 support

Description

972 B

Raw Permalink Blame History