Files
ik_llama.cpp/github-data/discussions/599 - mla matrix absorbtion.md
2025-07-23 13:31:53 +02:00

18 KiB
Raw Blame History

🗣️ #599 - mla matrix absorbtion

Author magikRUKKOLA
Created 2025-07-11
Updated 2025-07-15

Description

As a prefill optimization for the long context as implemented in ktransformers. I found some cool docs. Will leave it here.

https://github.com/madsys-dev/deepseekv2-profile/blob/main/workspace/blog/optimizing-mla.md

deepseek r1 explanation:

The matrix absorption technique in DeepSeek-V2's MLA (Multi-head Latent Attention) mechanism is a clever mathematical optimization that avoids explicitly decompressing the compressed KV cache, significantly reducing computation and memory overhead. Here's a step-by-step explanation:

1. Core Problem

Traditional MLA implementations:

  • Store compressed KV representations (small memory footprint)
  • But require decompression before attention calculation:
    k_t^C = W^{UK} \cdot c_t^{KV} \quad \text{(expensive operation)}
    
    v_t = W^{UV} \cdot c_t^{KV} \quad \text{(expensive operation)}
    

2. Key Insight: Matrix Associativity

Matrix multiplication is associative. Instead of decompressing KV, absorb the decompression matrices into adjacent operations:

  • K-absorption: Fuse decompression into Q projection
  • V-absorption: Fuse decompression into output projection

3. K-Absorption (for Attention Scores)

Original computation for non-RoPE attention scores:

{q_t^C}^\top k_t^C = (W^{UQ} c_t^Q)^\top (W^{UK} c_t^{KV})

Absorbed version using associativity:

{q_t^C}^\top k_t^C = \underbrace{(c_t^Q)^\top}_{\text{input}} \cdot \underbrace{(W^{UQ})^\top W^{UK}}_{\text{precomputed}} \cdot \underbrace{c_t^{KV}}_{\text{cached}}

Why this helps:

  • Avoids explicit computation of full-dimensional k_t^C
  • Replaces large matrix multiplication with smaller operations
  • FLOPs reduction: From 33.64 MFLOP/token → 0.28 MFLOP/token

4. V-Absorption (for Attention Output)

Original computation:

o = a \cdot v_t = a \cdot (W^{UV} \cdot c_t^{KV})
u = W^O \cdot o

Absorbed version using Einstein summation:

# Instead of decompressing V explicitly:
attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv)  # Weighted sum in compressed space
attn_output = torch.einsum('bhqc,hdc->bhqd', attn_output, W_UV)            # Late decompression
u = torch.einsum('hdD,bhqd->bqD', W_O, attn_output)                        # Output projection

Why this helps:

  • Avoids materializing full v_t (128× larger than c_t^{KV})
  • Fuses decompression with weighted sum
  • Minimizes intermediate memory

5. Move Elision Optimization

Additional problem: Original code concatenates RoPE/non-RoPE parts of Q/K, creating large temporary tensors.

Solution: Compute attention scores in two parts and sum:

attn_weights = (
    torch.matmul(q_pe, k_pe.transpose(2, 3))   # RoPE part
    + torch.einsum('bhqc,blc->bhql', q_nope, compressed_kv)  # Non-RoPE (absorbed)
)
  • Eliminates memory-hungry concatenation ops
  • Avoids storing full Q/K tensors (192-dimensional)

6. Why Not Pre-Absorb All Matrices?

Theoretically, you could precompute:

W_{\text{new}}^{UQ} = (W^{UQ})^\top W^{UK} \quad \text{and} \quad W_{\text{new}}^O = W^O W^{UV}

But this is inefficient because:

  • W_{\text{new}}^{UQ} would be a large low-rank matrix (1536×512)
  • W_{\text{new}}^O would be massive (5120×512 per head)
  • Actual implementation (sequential small ops) is faster:

7. Performance Impact

Optimization KV Cache Size Compute (FLOPs/token) Speedup (vs baseline)
Baseline (CacheDecompressed) 81.92 KB 0.08 M 1.0×
CacheCompressed (no absorption) 1.15 KB 33.64 M 0.3×
Absorption + Move Elision 1.15 KB 0.28 M 1.2×

Key wins:

  • 71× smaller KV cache than decompressed version
  • 120× less computation than non-absorbed compression
  • Enables larger batch sizes (critical for GPU efficiency)

Summary

Matrix absorption works by:

  1. Exploiting associativity to "push" decompression matrices into adjacent operations
  2. Late decompression of V during output projection
  3. Splitting attention scores to avoid tensor concatenation
  4. Leveraging compressed representations throughout computation

This transforms MLA from a memory-bound problem into a compute-bound one, better utilizing modern GPU tensor cores while maintaining 98.6% KV cache compression.


🗣️ Discussion

👤 ikawrakow replied the 2025-07-11 at 12:16:34:

@magikRUKKOLA

You may want to check #246, #260, #273.

As far as I can tell, #246, which explains the basic idea of reducing the amount of multiply-adds when using MLA, precedes the linked doc by about a month, and is surprisingly similar to what they wrote.

#260 explains the -amb option, which limits the amount of intermediate compute buffer storage required.

#273 is the best MLA version in ik_llama.cpp. The MLA=2 variant (explained in #246) is used for prompt processing, the original MLA (MLA=1) is used for token generation. The main reason it took a while to arrive at #273 was the struggle to implement the MLA=1 case efficiently on CUDA (and the struggle was due to the much larger than usual attention head sizes of 576 and 512).

If you look at all merged PRs, you will see that it has been quite a journey to arrive at what we have today for doing fast DeepSeek inference.


👤 ubergarm replied the 2025-07-11 at 22:10:57:

A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... 😭 lol...

👤 magikRUKKOLA replied the 2025-07-11 at 23:35:48:
@ubergarm

A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... 😭 lol...

Paper Link (co**mm**ing soon)

Yeah, I am so excited too! :D So the minimum requirements are the 512GB RAM and 48GB VRAM to run some IQ2 quant lol. (?) I guess its time to upgrade.

quote:

Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.

I suggest that the setup how the tool usage can be applied with ik_llama.cpp should be documented somewhere. Basically we need a MITM-tool to translate JSON<->TOOL_CALL_TOKENS. And that's about it.

👤 ewhacc replied the 2025-07-12 at 09:59:24:
@ubergarm Are you going to cook quants? ^^; It uses deekseek architecture, so I'm hoping it runs in ik_llama.cpp flawlessly.

I have 512G RAM and would like to test IQ2. I thought 256G is the best because using 512G (with higher bits) is too slow. I was wrong. Kimi-K2 keep the active experts the same but almost doubled the weights. I guess tg speed is about the same, but pp will be slower.

I'm downloading original FP8 now. I don't know why I'm doing this... ^^

👤 ubergarm replied the 2025-07-12 at 15:43:55:
@ewhacc

I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.

Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix. I don't have access to the big RAM box right now, so I can't do this step at the moment. Plus its a pain to free up like 4TB disk space lol...

Keep us posted, I'm sure people will want to run this monster eventually

👤 ubergarm replied the 2025-07-12 at 15:45:52:
@magikRUKKOLA

I suggest that the setup how the tool usage can be applied with ik_llama.cpp should be documented somewhere. Basically we need a MITM-tool to translate JSON<->TOOL_CALL_TOKENS. And that's about it.

One guy put together a function calling wrapper thing, not sure if it is applicable here: https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943

I haven't tried it personally.

👤 magikRUKKOLA replied the 2025-07-12 at 20:58:08:

@magikRUKKOLA

One guy put together a function calling wrapper thing, not sure if it is applicable here: #407 (comment)

Yeah, I noticed. I suggest some docs should be created on how to provide a frontend for the ik_llama.cpp to support the tool calling. But first let me observe what solution would be the most elegant.

👤 magikRUKKOLA replied the 2025-07-12 at 21:05:55:
@ewhacc

I have 512G RAM and would like to test IQ2.

I just noticed that IQ4_KS_R4 of Deepseek R1 is 368 GiB. So

echo "scale=2;368*(1000/671)"|bc
548.32

So the kimi k2 with a similar quant might fit within the 512 GB RAM. Or, the IQ3 quant should fit.

But... but... something should be done with the attention mechanism (for the prefill) to reduce the VRAM usage. I am currently looking at flashinfer. That is the exact reason of instability in ktransofmers. Its a hurdle. :)

I thought 256G is the best because using 512G (with higher bits) is too slow. I was wrong.

Yeah, I made a same mistake. Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.

👤 magikRUKKOLA replied the 2025-07-13 at 14:09:14:
621GB Q4_K quant dropped!

https://huggingface.co/KVCache-ai/Kimi-K2-Instruct-GGUF

Can't wait for the Q3 quant to try out on 512GB RAM. :) Also setting up the water cooling for the four RTX 3090 to be able to connect [four of them] without the risers (to support as much context as possible).


👤 ewhacc replied the 2025-07-13 at 11:25:36:

@ubergarm

I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.

I just tried fp8_cast_bf16.py but got VRAM OOM. I didn't think this will be big challenge but 1st one is getting tough. I will try with more VRAM, and perhaps will try evshrion llama.cpp too. Thanks a lot for help. I'm just giving a try your recipes.

Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix.

Hmm, this one is what I worried and wanted to ask. Well, time to wake my xeon box (it's too loud). BTW, isn't it possible to make imatrix directly from BF16? Making Q8_0 is a must? Ha ha, it's a long and big way to go. FP8 -> BF16 -> Q8_0 -> imatrix -> Q2

Edit: I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0.

Edit: Failed to get q8_0. I don't know it needs 1T RAM, but seems not a RAM problem (tried on 512M) python ev_llama.cpp/convert_hf_to_gguf.py Kimi-K2-Instruct --outfile Kimi-K2-Instruct-q8 --outtype q8_0 ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

👤 ubergarm replied the 2025-07-13 at 16:29:49:
@ewhacc

I just tried fp8_cast_bf16.py but got VRAM OOM.

Right for the fp8_cast_bf16.py script from deepseek approach it is quite long. fp8 safetensors -> bf16 safetensors -> bf16 GGUF -> Q8_0 -> imatrix -> Q2. I believe this is the method used for mainline MLA quants of deepseek. Not sure if this works for the slightly different arch Kimi-K2 1000B-A32B or not.

Regarding OOMing with this method, i have some notes in a discussion with fairydreaming about using triton-cpu instead for using RAM without GPU that I just dug up. Also found a patch that might prevent VRAM OOM on 4090 series cards here on hugginface.

BTW, isn't it possible to make imatrix directly from BF16?

Yes, if you can run inferencing with the 2TB VRAM+RAM bf16 GGUF, then you could use it directly for imatrix. I haven't tested the quality difference in terms of perplexity, but I believe the Q8_0 is sufficient given it is quite similar to the native fp8.

I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0.

Yes this is my usual method. Not sure it would work with Kimi-K2 though without some modifications. I assume you got triton-cpu to build (this is one of the more difficult steps of the process). Notes on building triton-cpu here where @saood06 helped fix a build bug for them.

My script then is to convert the fp8 safetensors directly to bf16 GGUF is:

# evshiron/llama.cpp@63b7e8aa
source venv/bin/activate
python \
      llama.cpp/convert_hf_to_gguf.py \
      --outtype bf16 \
      --split-max-size 50G \
      --outfile /models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/ \
      /models/tngtech/DeepSeek-TNG-R1T2-Chimera/

If you're still getting that error, you might have to poke around in convert_hf_to_gguf.py search where it says triton for the deepseek-v3 part. Might need to look at the recent Kimi-K2 PR https://github.com/ggml-org/llama.cpp/pull/14654 and add that to the evshiron fork or something.

I don't have access to enough RAM at the moment. Maybe will in the next few weeks 🤞

Thanks for blazing the trail! And feel free to open a new discussion/issue specific to Kimi-K2 etc...

👤 magikRUKKOLA replied the 2025-07-13 at 18:17:56:

I don't have access to enough RAM at the moment. Maybe will in the next few weeks 🤞

Hey bro, are you in EU? I can drop you some 1TB DDR5 RAM with a huge discount.

👤 ubergarm replied the 2025-07-13 at 18:38:54:
@magikRUKKOLA

Oh man, thanks for the offer, no I'm in east coast usa currently. wendell at level1techs.com is hooking me up with access to a new remote rig he's assembling that is a big dual socket 1.5TB beast that should be online sooner than I expected!

👤 ewhacc replied the 2025-07-14 at 00:16:44:
@ubergarm You had have gone through all this tough process. Thank so much for sharing experience.

Yes, if you can run inferencing with the 2TB VRAM+RAM bf16 GGUF, then you could use it directly for imatrix. I haven't tested the quality difference in terms of perplexity, but I believe the Q8_0 is sufficient given it is quite similar to the native fp8.

Oops, 2TB. Sounds like going through Q8_0 is a must.

👤 ubergarm replied the 2025-07-14 at 02:53:30:
@ewhacc

So Wendell just hooked me up with remote access to a big dual socket AMD CPU rig with 42TB of kioxia flash storage i put into two RAID0 arrays and with almost 1.5TB RAM - (no GPUs). So working through it now using the "mainline" method of casting the fp8 safetensors to bf16 safetensors first.

If I can get that working, I'll try to see if it is possible to adapt the evshiron fork to do the same MLA treatment to Kimi-K2 as it does for deepseek models and do the direct fp8 safetensors -> bf16 GGUF

A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1

👤 ewhacc replied the 2025-07-14 at 03:12:39:
@ubergarm

A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1

Thanks for inviting. I see you already started there :)


👤 ewhacc replied the 2025-07-13 at 11:30:20:

@magikRUKKOLA

Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.

Thank you for the tip. Yeah, I have temped to overclock DDR4, and even DDR5. But, I have to check my board allow it. Yes, RAM also needs cooling, my DDR5 gets hot when I use R1.


👤 magikRUKKOLA replied the 2025-07-15 at 19:59:27:

ATTN! Below is not a joke. Its an actual latest commit for the flashinfer. Please pay attention:

-        return self.run_return_lse(q, paged_kv_cache, k_scale, v_scale)
+        return self.run_return_lse(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)

Lets read the explanation:

fix: correctly pass k_scale and v_scale to run() in forward_return_lse

MORE!

Bug Fix: Corrected an issue in BatchPrefillWithPagedKVCacheWrapper.forward_return_lse where k_scale and v_scale were incorrectly passed as positional arguments instead of keyword arguments to run_return_lse(). This resolves a **silent misbehavior or potential runtime error** caused by functools.partialmethod expecting keyword-only arguments.

the comments from the maintainer!!

Great catch, left some comments for suggestions :)

I mean, this doesn't make sense. I am not really sure its real.