mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-01 11:51:53 +00:00
31 lines
2.6 KiB
Markdown
31 lines
2.6 KiB
Markdown
### 🔀 [#584](https://github.com/ikawrakow/ik_llama.cpp/pull/584) - Vulkan: flash attention for DeepSeek models
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **State** | ❌ **Closed** |
|
|
| **Created** | 2025-07-04 |
|
|
| **Updated** | 2025-07-05 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
This PR is a cherry-pick of [PR 14509](https://github.com/ggml-org/llama.cpp/pull/14509) in mainline `llama.cpp` with minor adaptations, and adds FA for the DeepSeek models to the Vulkan back-end.
|
|
|
|
### Caveats
|
|
|
|
* The batch size cannot be greater than the maximum context length. Under normal usage this is never the case, but if one runs `perplexity` with default parameters where context is set to 512 tokens while batch size is 2048 tokens, one gets NaNs after the first context chunk. I have spent the better part of of the day trying to understand the reason, and just don't see it. Almost prepared to give a bounty to the person who finds the bug.
|
|
* For now KV cache can only be `fp16` as I have not implemented the various additions required to make quantized cache work with DeepSeek models in the Vulkan back-end (quantized KV cache can of course be used with models that do not use MLA)
|
|
|
|
I have tested with DeepSeek-V2-Lite on an RTX-4080 GPU with coopmat2 enabled. We are starting to see more significant performances gains compared to mainline `llama.cpp` as illustrated in the following two graphs. The first graph shows PP-2048 performance as a function of the number of tokens in the KV cache `N_KV`. Surprisingly, we don't see significant performance gains from `mla = 3` compared to `mla = 1` as we do with CUDA (see below). Nevertheless, at 32k tokens `ik_llama.cpp` is about 40% faster than `llama.cpp`.
|
|
|
|

|
|
|
|
The next graph compares TG performance as a function of `N_KV`. Here performance gains compared to mainline are even greater, with `ik_llama.cpp` nearly 2X faster than `llama.cpp` for a context of 32 tokens.
|
|
|
|

|
|
|
|
Before you get too excited about these results, a reminder that the Vulkan back-end does not yet implement the fused MoE `ffn_up+ffn_gate` op, so it is still far behind CUDA. The next two graphs compare PP and TG performance as a function of `N_KV` on **the same RTX-4080 GPU**.
|
|
|
|

|
|
 |