mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-11 14:30:02 +00:00
357 lines
18 KiB
Markdown
357 lines
18 KiB
Markdown
### 🗣️ [#599](https://github.com/ikawrakow/ik_llama.cpp/discussions/599) - mla matrix absorbtion
|
||
|
||
| **Author** | `magikRUKKOLA` |
|
||
| :--- | :--- |
|
||
| **Created** | 2025-07-11 |
|
||
| **Updated** | 2025-07-15 |
|
||
|
||
---
|
||
|
||
#### Description
|
||
|
||
As a prefill optimization for the long context as implemented in ktransformers. I found some cool docs. Will leave it here.
|
||
|
||
https://github.com/madsys-dev/deepseekv2-profile/blob/main/workspace/blog/optimizing-mla.md
|
||
|
||
deepseek r1 **explanation**:
|
||
|
||
The **matrix absorption technique** in DeepSeek-V2's MLA (Multi-head Latent Attention) mechanism is a clever mathematical optimization that avoids explicitly decompressing the compressed KV cache, significantly reducing computation and memory overhead. Here's a step-by-step explanation:
|
||
|
||
### 1. **Core Problem**
|
||
Traditional MLA implementations:
|
||
- Store **compressed KV representations** (small memory footprint)
|
||
- But require **decompression** before attention calculation:
|
||
```math
|
||
k_t^C = W^{UK} \cdot c_t^{KV} \quad \text{(expensive operation)}
|
||
```
|
||
```math
|
||
v_t = W^{UV} \cdot c_t^{KV} \quad \text{(expensive operation)}
|
||
```
|
||
|
||
### 2. **Key Insight: Matrix Associativity**
|
||
Matrix multiplication is associative. Instead of decompressing KV, **absorb the decompression matrices** into adjacent operations:
|
||
- **K-absorption**: Fuse decompression into Q projection
|
||
- **V-absorption**: Fuse decompression into output projection
|
||
|
||
---
|
||
|
||
### 3. **K-Absorption (for Attention Scores)**
|
||
**Original computation** for non-RoPE attention scores:
|
||
```math
|
||
{q_t^C}^\top k_t^C = (W^{UQ} c_t^Q)^\top (W^{UK} c_t^{KV})
|
||
```
|
||
|
||
**Absorbed version** using associativity:
|
||
```math
|
||
{q_t^C}^\top k_t^C = \underbrace{(c_t^Q)^\top}_{\text{input}} \cdot \underbrace{(W^{UQ})^\top W^{UK}}_{\text{precomputed}} \cdot \underbrace{c_t^{KV}}_{\text{cached}}
|
||
```
|
||
|
||
**Why this helps**:
|
||
- Avoids explicit computation of full-dimensional `k_t^C`
|
||
- Replaces large matrix multiplication with smaller operations
|
||
- **FLOPs reduction**: From 33.64 MFLOP/token → 0.28 MFLOP/token
|
||
|
||
---
|
||
|
||
### 4. **V-Absorption (for Attention Output)**
|
||
**Original computation**:
|
||
```math
|
||
o = a \cdot v_t = a \cdot (W^{UV} \cdot c_t^{KV})
|
||
```
|
||
```math
|
||
u = W^O \cdot o
|
||
```
|
||
|
||
**Absorbed version** using Einstein summation:
|
||
```python
|
||
# Instead of decompressing V explicitly:
|
||
attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv) # Weighted sum in compressed space
|
||
attn_output = torch.einsum('bhqc,hdc->bhqd', attn_output, W_UV) # Late decompression
|
||
u = torch.einsum('hdD,bhqd->bqD', W_O, attn_output) # Output projection
|
||
```
|
||
|
||
**Why this helps**:
|
||
- Avoids materializing full `v_t` (128× larger than `c_t^{KV}`)
|
||
- Fuses decompression with weighted sum
|
||
- Minimizes intermediate memory
|
||
|
||
---
|
||
|
||
### 5. **Move Elision Optimization**
|
||
**Additional problem**: Original code concatenates RoPE/non-RoPE parts of Q/K, creating large temporary tensors.
|
||
|
||
**Solution**: Compute attention scores in two parts and sum:
|
||
```python
|
||
attn_weights = (
|
||
torch.matmul(q_pe, k_pe.transpose(2, 3)) # RoPE part
|
||
+ torch.einsum('bhqc,blc->bhql', q_nope, compressed_kv) # Non-RoPE (absorbed)
|
||
)
|
||
```
|
||
- **Eliminates** memory-hungry concatenation ops
|
||
- **Avoids** storing full Q/K tensors (192-dimensional)
|
||
|
||
---
|
||
|
||
### 6. **Why Not Pre-Absorb All Matrices?**
|
||
Theoretically, you could precompute:
|
||
```math
|
||
W_{\text{new}}^{UQ} = (W^{UQ})^\top W^{UK} \quad \text{and} \quad W_{\text{new}}^O = W^O W^{UV}
|
||
```
|
||
But this is **inefficient** because:
|
||
- `W_{\text{new}}^{UQ}` would be a large low-rank matrix (1536×512)
|
||
- `W_{\text{new}}^O` would be massive (5120×512 per head)
|
||
- **Actual implementation** (sequential small ops) is faster:
|
||

|
||
|
||
---
|
||
|
||
### 7. **Performance Impact**
|
||
| Optimization | KV Cache Size | Compute (FLOPs/token) | Speedup (vs baseline) |
|
||
|----------------------------------|---------------|----------------------|----------------------|
|
||
| Baseline (CacheDecompressed) | 81.92 KB | 0.08 M | 1.0× |
|
||
| CacheCompressed (no absorption) | 1.15 KB | 33.64 M | 0.3× |
|
||
| **Absorption + Move Elision** | **1.15 KB** | **0.28 M** | **1.2×** |
|
||
|
||
**Key wins**:
|
||
- 71× smaller KV cache than decompressed version
|
||
- 120× less computation than non-absorbed compression
|
||
- Enables larger batch sizes (critical for GPU efficiency)
|
||
|
||
---
|
||
|
||
### Summary
|
||
Matrix absorption works by:
|
||
1. **Exploiting associativity** to "push" decompression matrices into adjacent operations
|
||
2. **Late decompression** of V during output projection
|
||
3. **Splitting attention scores** to avoid tensor concatenation
|
||
4. **Leveraging compressed representations** throughout computation
|
||
|
||
This transforms MLA from a memory-bound problem into a compute-bound one, better utilizing modern GPU tensor cores while maintaining 98.6% KV cache compression.
|
||
|
||
---
|
||
|
||
#### 🗣️ Discussion
|
||
|
||
👤 **ikawrakow** replied the **2025-07-11** at **12:16:34**:<br>
|
||
|
||
@magikRUKKOLA
|
||
|
||
You may want to check #246, #260, #273.
|
||
|
||
As far as I can tell, #246, which explains the basic idea of reducing the amount of multiply-adds when using MLA, precedes the linked doc by about a month, and is surprisingly similar to what they wrote.
|
||
|
||
#260 explains the `-amb` option, which limits the amount of intermediate compute buffer storage required.
|
||
|
||
#273 is the best MLA version in `ik_llama.cpp`. The MLA=2 variant (explained in #246) is used for prompt processing, the original MLA (MLA=1) is used for token generation. The main reason it took a while to arrive at #273 was the struggle to implement the MLA=1 case efficiently on CUDA (and the struggle was due to the much larger than usual attention head sizes of 576 and 512).
|
||
|
||
If you look at all merged PRs, you will see that it has been quite a journey to arrive at what we have today for doing fast DeepSeek inference.
|
||
|
||
---
|
||
|
||
👤 **ubergarm** replied the **2025-07-11** at **22:10:57**:<br>
|
||
|
||
A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... :sob: lol...
|
||
|
||
> 👤 **magikRUKKOLA** replied the **2025-07-11** at **23:35:48**:<br>
|
||
> @ubergarm
|
||
> > A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... 😭 lol...
|
||
>
|
||
> ```
|
||
> Paper Link (co**mm**ing soon)
|
||
> ```
|
||
>
|
||
> Yeah, I am so excited too! :D
|
||
> So the minimum requirements are the 512GB RAM and 48GB VRAM to run some IQ2 quant lol. (?) I guess its time to upgrade.
|
||
>
|
||
> quote:
|
||
> > Agentic Intelligence: Specifically designed for **tool use**, reasoning, and autonomous problem-solving.
|
||
>
|
||
> I suggest that the setup how the tool usage can be applied with ik_llama.cpp should be documented somewhere. Basically we need a MITM-tool to translate JSON<->TOOL_CALL_TOKENS. And that's about it.
|
||
>
|
||
> 👤 **ewhacc** replied the **2025-07-12** at **09:59:24**:<br>
|
||
> @ubergarm
|
||
> Are you going to cook quants? ^^; It uses deekseek architecture, so I'm hoping it runs in ik_llama.cpp flawlessly.
|
||
>
|
||
> I have 512G RAM and would like to test IQ2. I thought 256G is the best because using 512G (with higher bits) is too slow. I was wrong. Kimi-K2 keep the active experts the same but almost doubled the weights. I guess tg speed is about the same, but pp will be slower.
|
||
>
|
||
> I'm downloading original FP8 now. I don't know why I'm doing this... ^^
|
||
>
|
||
> 👤 **ubergarm** replied the **2025-07-12** at **15:43:55**:<br>
|
||
> @ewhacc
|
||
>
|
||
> I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.
|
||
>
|
||
> Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix. I don't have access to the big RAM box right now, so I can't do this step at the moment. Plus its a pain to free up like 4TB disk space lol...
|
||
>
|
||
> Keep us posted, I'm sure people will want to run this monster eventually
|
||
>
|
||
> 👤 **ubergarm** replied the **2025-07-12** at **15:45:52**:<br>
|
||
> @magikRUKKOLA
|
||
>
|
||
> > I suggest that the setup how the tool usage can be applied with ik_llama.cpp should be documented somewhere. Basically we need a MITM-tool to translate JSON<->TOOL_CALL_TOKENS. And that's about it.
|
||
>
|
||
> One guy put together a function calling wrapper thing, not sure if it is applicable here: https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943
|
||
>
|
||
> I haven't tried it personally.
|
||
>
|
||
> 👤 **magikRUKKOLA** replied the **2025-07-12** at **20:58:08**:<br>
|
||
> > @magikRUKKOLA
|
||
> >
|
||
> > One guy put together a function calling wrapper thing, not sure if it is applicable here: [#407 (comment)](https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943)
|
||
> >
|
||
>
|
||
> Yeah, I noticed. I suggest some docs should be created on how to provide a frontend for the ik_llama.cpp to support the tool calling. But first let me observe what solution would be the most elegant.
|
||
>
|
||
> 👤 **magikRUKKOLA** replied the **2025-07-12** at **21:05:55**:<br>
|
||
> @ewhacc
|
||
>
|
||
> > I have 512G RAM and would like to test IQ2.
|
||
>
|
||
> I just noticed that IQ4_KS_R4 of Deepseek R1 is 368 GiB. So
|
||
>
|
||
> ```
|
||
> echo "scale=2;368*(1000/671)"|bc
|
||
> 548.32
|
||
> ```
|
||
>
|
||
> So the kimi k2 with a similar quant might fit within the 512 GB RAM. Or, the IQ3 quant should fit.
|
||
>
|
||
> But... but... something should be done with the attention mechanism (for the prefill) to reduce the VRAM usage. I am currently looking at flashinfer. That is the exact reason of instability in ktransofmers. Its a hurdle. :)
|
||
>
|
||
> > I thought 256G is the best because using 512G (with higher bits) is too slow. I was wrong.
|
||
>
|
||
> Yeah, I made a same mistake.
|
||
> Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.
|
||
>
|
||
> 👤 **magikRUKKOLA** replied the **2025-07-13** at **14:09:14**:<br>
|
||
> 621GB Q4_K quant dropped!
|
||
>
|
||
> https://huggingface.co/KVCache-ai/Kimi-K2-Instruct-GGUF
|
||
>
|
||
> Can't wait for the Q3 quant to try out on 512GB RAM. :) Also setting up the water cooling for the four RTX 3090 to be able to connect [four of them] without the risers (to support as much context as possible).
|
||
|
||
---
|
||
|
||
👤 **ewhacc** replied the **2025-07-13** at **11:25:36**:<br>
|
||
|
||
@ubergarm
|
||
|
||
> I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.
|
||
|
||
I just tried fp8_cast_bf16.py but got VRAM OOM. I didn't think this will be big challenge but 1st one is getting tough. I will try with more VRAM, and perhaps will try evshrion llama.cpp too. Thanks a lot for help. I'm just giving a try your recipes.
|
||
|
||
> Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix.
|
||
|
||
Hmm, this one is what I worried and wanted to ask. Well, time to wake my xeon box (it's too loud). BTW, isn't it possible to make imatrix directly from BF16? Making Q8_0 is a must? Ha ha, it's a long and big way to go. FP8 -> BF16 -> Q8_0 -> imatrix -> Q2
|
||
|
||
Edit: I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0.
|
||
|
||
Edit: Failed to get q8_0. I don't know it needs 1T RAM, but seems not a RAM problem (tried on 512M)
|
||
python ev_llama.cpp/convert_hf_to_gguf.py Kimi-K2-Instruct --outfile Kimi-K2-Instruct-q8 --outtype q8_0
|
||
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
|
||
|
||
> 👤 **ubergarm** replied the **2025-07-13** at **16:29:49**:<br>
|
||
> @ewhacc
|
||
>
|
||
> > I just tried fp8_cast_bf16.py but got VRAM OOM.
|
||
>
|
||
> Right for the `fp8_cast_bf16.py` script from deepseek approach it is quite long. `fp8 safetensors -> bf16 safetensors -> bf16 GGUF -> Q8_0 -> imatrix -> Q2`. I believe this is the method used for mainline MLA quants of deepseek. Not sure if this works for the slightly different arch Kimi-K2 1000B-A32B or not.
|
||
>
|
||
> Regarding OOMing with this method, [i have some notes in a discussion with fairydreaming about using triton-cpu instead for using RAM without GPU](https://github.com/ggml-org/llama.cpp/discussions/11989#discussioncomment-13555486) that I just dug up. Also found a patch that might prevent VRAM OOM on 4090 series cards [here on hugginface](https://huggingface.co/deepseek-ai/DeepSeek-V3/discussions/17).
|
||
>
|
||
> > BTW, isn't it possible to make imatrix directly from BF16?
|
||
>
|
||
> Yes, if you can run inferencing with the 2TB VRAM+RAM bf16 GGUF, then you could use it directly for imatrix. I haven't tested the quality difference in terms of perplexity, but I believe the Q8_0 is sufficient given it is quite similar to the native fp8.
|
||
>
|
||
> > I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0.
|
||
>
|
||
> Yes this is my usual method. Not sure it would work with Kimi-K2 though without some modifications. I assume you got `triton-cpu` to build (this is one of the more difficult steps of the process). Notes on building triton-cpu [here where @saood06 helped fix a build bug for them](https://github.com/triton-lang/triton-cpu/issues/237#issuecomment-2878180022).
|
||
>
|
||
> My script then is to convert the fp8 safetensors directly to bf16 GGUF is:
|
||
> ```bash
|
||
> # evshiron/llama.cpp@63b7e8aa
|
||
> source venv/bin/activate
|
||
> python \
|
||
> llama.cpp/convert_hf_to_gguf.py \
|
||
> --outtype bf16 \
|
||
> --split-max-size 50G \
|
||
> --outfile /models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/ \
|
||
> /models/tngtech/DeepSeek-TNG-R1T2-Chimera/
|
||
> ```
|
||
>
|
||
> If you're still getting that error, you might have to poke around in `convert_hf_to_gguf.py` search where it says `triton` for the `deepseek-v3` part. Might need to look at the recent Kimi-K2 PR https://github.com/ggml-org/llama.cpp/pull/14654 and add that to the evshiron fork or something.
|
||
>
|
||
> I don't have access to enough RAM at the moment. Maybe will in the next few weeks :crossed_fingers:
|
||
>
|
||
> Thanks for blazing the trail! And feel free to open a new discussion/issue specific to Kimi-K2 etc...
|
||
>
|
||
> 👤 **magikRUKKOLA** replied the **2025-07-13** at **18:17:56**:<br>
|
||
> > I don't have access to enough RAM at the moment. Maybe will in the next few weeks 🤞
|
||
>
|
||
> Hey bro, are you in EU? I can drop you some 1TB DDR5 RAM with a huge discount.
|
||
>
|
||
> 👤 **ubergarm** replied the **2025-07-13** at **18:38:54**:<br>
|
||
> @magikRUKKOLA
|
||
>
|
||
> Oh man, thanks for the offer, no I'm in east coast usa currently. wendell at level1techs.com is hooking me up with access to a new remote rig he's assembling that is a big dual socket 1.5TB beast that should be online sooner than I expected!
|
||
>
|
||
> 👤 **ewhacc** replied the **2025-07-14** at **00:16:44**:<br>
|
||
> @ubergarm
|
||
> You had have gone through all this tough process. Thank so much for sharing experience.
|
||
>
|
||
> > Yes, if you can run inferencing with the 2TB VRAM+RAM bf16 GGUF, then you could use it directly for imatrix. I haven't tested the quality difference in terms of perplexity, but I believe the Q8_0 is sufficient given it is quite similar to the native fp8.
|
||
>
|
||
> Oops, 2TB. Sounds like going through Q8_0 is a must.
|
||
>
|
||
> 👤 **ubergarm** replied the **2025-07-14** at **02:53:30**:<br>
|
||
> @ewhacc
|
||
>
|
||
> So Wendell just hooked me up with remote access to a big dual socket AMD CPU rig with 42TB of kioxia flash storage i put into two RAID0 arrays and with almost 1.5TB RAM - (no GPUs). So working through it now using the "mainline" method of casting the fp8 safetensors to bf16 safetensors first.
|
||
>
|
||
> If I can get that working, I'll try to see if it is possible to adapt the evshiron fork to do the same MLA treatment to Kimi-K2 as it does for deepseek models and do the direct fp8 safetensors -> bf16 GGUF
|
||
>
|
||
> A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1
|
||
>
|
||
> 👤 **ewhacc** replied the **2025-07-14** at **03:12:39**:<br>
|
||
> @ubergarm
|
||
> > A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1
|
||
>
|
||
> Thanks for inviting. I see you already started there :)
|
||
|
||
---
|
||
|
||
👤 **ewhacc** replied the **2025-07-13** at **11:30:20**:<br>
|
||
|
||
@magikRUKKOLA
|
||
> Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.
|
||
|
||
Thank you for the tip. Yeah, I have temped to overclock DDR4, and even DDR5. But, I have to check my board allow it. Yes, RAM also needs cooling, my DDR5 gets hot when I use R1.
|
||
|
||
---
|
||
|
||
👤 **magikRUKKOLA** replied the **2025-07-15** at **19:59:27**:<br>
|
||
|
||
ATTN! Below is not a joke. Its an actual latest commit for the flashinfer. Please pay attention:
|
||
|
||
```diff
|
||
- return self.run_return_lse(q, paged_kv_cache, k_scale, v_scale)
|
||
+ return self.run_return_lse(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)
|
||
```
|
||
|
||
Lets read the explanation:
|
||
|
||
```
|
||
fix: correctly pass k_scale and v_scale to run() in forward_return_lse
|
||
```
|
||
MORE!
|
||
```
|
||
Bug Fix: Corrected an issue in BatchPrefillWithPagedKVCacheWrapper.forward_return_lse where k_scale and v_scale were incorrectly passed as positional arguments instead of keyword arguments to run_return_lse(). This resolves a **silent misbehavior or potential runtime error** caused by functools.partialmethod expecting keyword-only arguments.
|
||
```
|
||
|
||
the comments from the **maintainer**!!
|
||
|
||
```
|
||
Great catch, left some comments for suggestions :)
|
||
```
|
||
|
||
I mean, this doesn't make sense. I am not really sure its real. |