mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
Add GitHub data: filename sanitization (#640)
This commit is contained in:
38
github-data/pull_requests/234 - Faster MLA on CUDA.md
Normal file
38
github-data/pull_requests/234 - Faster MLA on CUDA.md
Normal file
@@ -0,0 +1,38 @@
|
||||
### 🔀 [#234](https://github.com/ikawrakow/ik_llama.cpp/pull/234) - Faster MLA on CUDA
|
||||
|
||||
| **Author** | `ikawrakow` |
|
||||
| :--- | :--- |
|
||||
| **State** | ❌ **Closed** |
|
||||
| **Created** | 2025-02-26 |
|
||||
| **Updated** | 2025-02-27 |
|
||||
|
||||
---
|
||||
|
||||
#### Description
|
||||
|
||||
The CUDA code absolutely does not like MLA. On the main branch MLA attention is in the range of 15-20% slower than the standard attention implementation. The issue is with the `wk_b x q_nope` and `wv_b x qkv_compressed` operations. For TG they require two tensor multiplications of shapes $(N_h \times N_t \times K)$ and $(N_h \times 1 \times K)$, where $N_h$ is the head size, $N_t$ is the number of tokens in the KV cache, and $K$ is the number of heads. These get computed as $K$ consecutive $(N_h \times N_t) \times (N_h \times 1)$ matrix-vector multiplications. To add insult to injury, for `wk_b x q_nope` where `q_nope` is not contiguous, we get $K$ copies (one for each `q_nope` row) to contiguous memory, followed by quantization for a single row (when `wk_b` is quantized), followed by the actual GEMV, i.e., $3 K$ CUDA kernel launches. The associated overhead by far exceeds the time needed for the actual matrix multiplications, so the computation becomes extremely slow compared to what it could be.
|
||||
|
||||
This PR fixes the inefficiency by adding a special purpose kernel that performs the $K$ GEMV in one go. It is a bit of a hack and I should try to consolidate with the regular `ggml_cuda_op_mul_mat_vec_q` implementation, but it should do for now. In addition, the PR adds a new `quantize_tensor_q8_1_cuda` method that operates on non-contiguous tensors that have a single row. This allows the `q_nope` quantization for the `qk_b x q_nope` multiplication to be done with a single call.
|
||||
|
||||
These two changes result in a significant speedup of the MLA attention computation on CUDA. For `IQ4_NL` quantized DeepSeek-Lite with all layers processed on the GPU we get a TG-128 increase of 31%. For the hybrid calculations where the experts are computed on the CPU we get a 15% speedup. MLA is now (nearly) on par with standard attention for short contexts and outperforms it with increasing context length. Here is a table comparing standard to MLA attention in this PR for hybrid CPU/GPU inference as a function of context length. The CPU is Ryzen-7950X, and the GPU is RTX-4080
|
||||
|
||||
| model | test | t/s (std) | t/s (MLA, this PR)| Speedup |
|
||||
| -------------------- | ------------: | ---------------: | -----------------: | -------: |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp128 | 52.99 ± 0.03 | 52.43 ± 0.04 | 0.989 |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp256 | 52.77 ± 0.09 | 52.26 ± 0.07 | 0.990 |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp512 | 51.58 ± 1.19 | 51.93 ± 0.10 | 1.007 |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp1024 | 50.75 ± 0.56 | 51.73 ± 0.07 | 1.019 |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp2048 | 49.96 ± 0.28 | 51.29 ± 0.05 | 1.027 |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp4096 | 47.94 ± 0.58 | 50.23 ± 0.05 | 1.048 |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp8192 | 43.77 ± 0.34 | 48.04 ± 0.04 | 1.098 |
|
||||
| deepseek2 16B IQ4_NL | tg64@pp16384 | 37.76 ± 0.15 | 44.62 ± 0.17 | 1.182 |
|
||||
|
||||
---
|
||||
|
||||
#### 💬 Conversation
|
||||
|
||||
👤 **davidsyoung** commented the **2025-02-27** at **16:17:26**:<br>
|
||||
|
||||
@ikawrakow Seeing a significant speed increase from this, with also transposed KV cache. From 12t/s to 17.25t/s, and seeing less of a drop off on speed as well at longer PP tokens. Full CUDA 15x3090 Q2_K MLA.
|
||||
|
||||
Really nice!
|
||||
Reference in New Issue
Block a user