ik_llama.cpp/205 - Faster MLA prompt processing.md at ik/handle_split_cache - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-02 20:48:03 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

4.6 KiB

Raw Permalink Blame History

🔀 #205 - Faster MLA prompt processing

Author	`ikawrakow`
State	❌ Closed
Created	2025-02-12
Updated	2025-02-13

Description

This PR speeds up prompt processing (PP) when MLA is enabled. It is still slower than no-MLA, so I'm making this a draft for now to try some more. Still it would be great if somebody else tested to confirm that a) I did not introduce bugs and b) It is indeed faster on their systems.

The PR also adds the changes suggested by @saood06 in the review of #188

Speedup is achieved by concatenating the no- and rotational position encoding parts of K and Q (this also eliminates the k_r cache), which allows us to combine the former kq_nope and kq_pe matrix multiplications into a single matrix multiplication. This also eliminates the fairly expensive addition of kq_nope and kq_pe.

Here is a comparison between PP performance on the main branch and this PR for DeepSeek-Lite quantized with IQ4_XS and running on a Ryzen-7950X using Q8_0 for K-cache

model	test	t/s (main)	t/s (PR)	Speedup
deepseek2 16B IQ4_XS	pp512	478.58 ± 5.14	489.40 ± 1.08	1.023
deepseek2 16B IQ4_XS	pp1024	438.56 ± 0.75	458.37 ± 1.51	1.045
deepseek2 16B IQ4_XS	pp2048	378.95 ± 1.40	407.83 ± 2.07	1.076
deepseek2 16B IQ4_XS	pp4096	294.71 ± 2.86	327.88 ± 0.18	1.113
deepseek2 16B IQ4_XS	pp8192	204.52 ± 0.27	234.17 ± 0.37	1.145
deepseek2 16B IQ4_XS	pp16384	126.31 ± 0.13	148.35 ± 0.38	1.174

TG performance (the whole point of MLA) is not sacrificed. Here the results of llama-bench -gp -Np,64 for different prompt lengths Np

model	test	t/s (main)	t/s (PR)	Speedup
deepseek2 16B IQ4_XS	tg64@pp128	33.58 ± 0.06	33.80 ± 0.00	1.007
deepseek2 16B IQ4_XS	tg64@pp256	32.67 ± 0.00	32.76 ± 0.01	1.003
deepseek2 16B IQ4_XS	tg64@pp512	32.38 ± 0.08	32.68 ± 0.05	1.009
deepseek2 16B IQ4_XS	tg64@pp1024	31.50 ± 0.02	32.02 ± 0.00	1.017
deepseek2 16B IQ4_XS	tg64@pp2048	30.01 ± 0.01	30.31 ± 0.03	1.010
deepseek2 16B IQ4_XS	tg64@pp4096	27.08 ± 0.03	27.54 ± 0.10	1.017
deepseek2 16B IQ4_XS	tg64@pp8192	22.82 ± 0.00	23.12 ± 0.01	1.013
deepseek2 16B IQ4_XS	tg64@pp16384	17.24 ± 0.00	18.74 ± 0.09	1.087

Not sure if the ~9% improvement at 16k tokens is real. It may be just due to less thermal trottling because of the prompt processing part finishing quicker.

💬 Conversation

👤 saood06 submitted a review the 2025-02-12 at 20:10:21: 💬 COMMENTED

👤 ikawrakow submitted a review the 2025-02-13 at 08:57:48: 💬 COMMENTED

👤 ikawrakow commented during a code review the 2025-02-13 at 08:57:48 on src/llama.cpp:

Thanks. Added a hopefully visible warning.

👤 ikawrakow commented the 2025-02-13 at 09:04:18:

The PR also adds a compile time option to disable the transposed KV cache when using MLA (simple look for MLA_USE_TRANSPOSED_CACHE and set it to 0). This cuts KV cache size in nearly half at the expense of a lower TG performance with long contexts. PP performance stays about the same. Here is a comparison between MLA with and without transposed cache

model	test	t/s (with c^T)	t/s (without c^T)
deepseek2 16B IQ4_XS	tg64@pp128	33.58 ± 0.06	33.05 ± 0.05
deepseek2 16B IQ4_XS	tg64@pp256	32.67 ± 0.00	31.54 ± 0.07
deepseek2 16B IQ4_XS	tg64@pp512	32.38 ± 0.08	30.26 ± 0.33
deepseek2 16B IQ4_XS	tg64@pp1024	31.50 ± 0.02	28.50 ± 0.01
deepseek2 16B IQ4_XS	tg64@pp2048	30.01 ± 0.01	24.75 ± 0.01
deepseek2 16B IQ4_XS	tg64@pp4096	27.08 ± 0.03	20.67 ± 0.09
deepseek2 16B IQ4_XS	tg64@pp8192	22.82 ± 0.00	14.89 ± 0.01

4.6 KiB Raw Permalink Blame History

🔀 #205 - Faster MLA prompt processing

Description

💬 Conversation

4.6 KiB

Raw Permalink Blame History