ik_llama.cpp/233 - Slightly faster CUDA MLA.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

2.1 KiB

Raw Permalink Blame History

🔀 #233 - Slightly faster CUDA MLA

Author	`ikawrakow`
State	❌ Closed
Created	2025-02-26
Updated	2025-02-27

Description

The CUDA code absolutely does not like MLA. The issue is with the wk_b x q_nope and wv_b x qkv_compressed operations. For TG they require two tensor multiplications of shapes (N_h \times N_t \times K) and (N_h \times 1 \times K), where N_h is the head size, N_t is the number of tokens in the KV cache, and K is the number of heads. These get computed as K consecutive $(N_h \times N_t) \times (N_h \times 1) matrix-vector multiplications. To add insult to injury, for wk_b x q_nope where q_nope is not contiguous, we get K copies (one for each q_nope row) to contiguous memory, followed by quantization for a single row (when wk_b is quantized), followed by the actual GEMV, i.e., 3 K CUDA kernel launches. The associated overhead by far exceeds the time needed for the actual matrix multiplications, so the computation becomes extremely slow compared to what it could be.

This PR improves the situation slightly by making q_nope contiguous before ggml_mul_mat(ctx, wk_b, q_nope). For DeepSeek-Lite we gain ~7% in performance when running the entire model on the GPU, and about 4% when running experts on the CPU and everything else on the GPU.

I did attempt to implement a computation of the entire tensor multiplication with a single kernel launch, but I'm failing so far. The TG speed is improved and matches standard attention performance, but I get gibberish output (and so far haven't seen what is wrong). So, for now, adding just this relatively minor improvement.

💬 Conversation

👤 ikawrakow commented the 2025-02-26 at 17:27:37:

Closing in favor of #234

👤 davidsyoung commented the 2025-02-27 at 16:16:55:

@ikawrakow Seeing a significant speed increase from this, with also transposed KV cache. From 12t/s to 17.25t/s, and seeing less of a drop off on speed as well at longer PP tokens. Full CUDA 15x3090 Q2_K MLA.

Really nice!

2.1 KiB Raw Permalink Blame History

🔀 #233 - Slightly faster CUDA MLA

Description

💬 Conversation

2.1 KiB

Raw Permalink Blame History