ik_llama.cpp/github-data/pull_requests/584 - Vulkan_ flash attention for DeepSeek models.md

### 🔀 [#584](https://github.com/ikawrakow/ik_llama.cpp/pull/584) - Vulkan: flash attention for DeepSeek models

| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-07-04 |
| **Updated** | 2025-07-05 |

---

#### Description

This PR is a cherry-pick of [PR 14509](https://github.com/ggml-org/llama.cpp/pull/14509) in mainline `llama.cpp` with minor adaptations, and adds FA for the DeepSeek models to the Vulkan back-end.

### Caveats

* The batch size cannot be greater than the maximum context length. Under normal usage this is never the case, but if one runs `perplexity` with default parameters where context is set to 512 tokens while batch size is 2048 tokens, one gets NaNs after the first context chunk. I have spent the better part of of the day trying to understand the reason, and just don't see it. Almost prepared to give a bounty to the person who finds the bug.
* For now KV cache can only be `fp16` as I have not implemented the various additions required to make quantized cache work with DeepSeek models in the Vulkan back-end (quantized KV cache can of course be used with models that do not use MLA)

I have tested with DeepSeek-V2-Lite on an RTX-4080 GPU with coopmat2 enabled. We are starting to see more significant performances gains compared to mainline `llama.cpp` as illustrated in the following two graphs. The first graph shows PP-2048 performance as a function of the number of tokens in the KV cache `N_KV`. Surprisingly, we don't see significant performance gains from `mla = 3` compared to `mla = 1` as we do with CUDA (see below). Nevertheless, at 32k tokens `ik_llama.cpp` is about 40% faster than `llama.cpp`.

![vulkan_dsl2_pp](https://github.com/user-attachments/assets/08952afa-6872-47a6-b7be-8c949cd7acc9)

The next graph compares TG performance as a function of `N_KV`. Here performance gains compared to mainline are even greater, with `ik_llama.cpp` nearly 2X faster than `llama.cpp` for a context of 32 tokens.

![vulkan_dsl2_tg](https://github.com/user-attachments/assets/375bc61b-9e44-4bda-8ccc-8f58f960c6a2)

Before you get too excited about these results, a reminder that the Vulkan back-end does not yet implement the fused MoE `ffn_up+ffn_gate` op, so it is still far behind CUDA. The next two graphs compare PP and TG performance as a function of `N_KV` on **the same RTX-4080 GPU**.

![vulkan_dsl2_vs_cuda_pp](https://github.com/user-attachments/assets/7a0f101c-eabc-45de-8d13-940c94ba1a84)
![vulkan_dsl2_vs_cuda_tg](https://github.com/user-attachments/assets/708df9d1-5ee2-436a-965f-3017c5c0db8c)