ik_llama.cpp/218 - Better strategy for attention matrix multiplications when generating to.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

11 KiB

Raw Permalink Blame History

🔀 #218 - Better strategy for attention matrix multiplications when generating tokens

Author	`ikawrakow`
State	❌ Closed
Created	2025-02-21
Updated	2025-02-22

Description

The K*Q and V*softmax(K*Q) matrix multiplications have the shape

\left(K x N_t x N_k\right) \times \left(K x N_b x N_h\right)

where K is the head size, N_t is the number of tokens in the cache, N_b is the number of tokens in the current batch, N_k is the K or V number of heads, and N_h is the total number of heads. In llama.cpp this tensor multiplication has been traditionally performed as N_h consecutive matrix multiplications, each being of shape

\left(K x N_t\right) \times \left(K x N_b\right)

The issue with this is that for token generation (TG) we have N_b = 1, so we are dealing with N_h matrix-vector multiplications, which are notoriously memory bound, and hence limit performance for large cache size (long contexts). To add insult to injury, the stride between consecutive rows in the left matrix is not just the row size R, but rather N_k R, so fetching data from memory is associated with big jumps and sub-optimal cache use, which is not exactly ideal in a memory bound situation.

When N_h > N_k (GQA, in that case N_h is divisible by N_k), PR #207 changed the multiplication strategy to perform N_k matrix multiplications, each with shape \left(K x N_t\right) \times \left(K x N_h/N_k\right), thus turning many matrix-vector multiplications into fewer matrix-matrix multiplications. This leads to non negligible performance gains for long contexts.

But when N_h = N_k (e.g., DeepSeek attention architecture), the above does not work. What we could do instead is to perform N_t x N_h dot products, where the inner loop is over N_h and the outer loop is over N_t. When multi-threaded, each thread performs N_t/M x N_h dot products (where M is the number of threads). The advantage of doing this is that memory is accessed consecutively, resulting in better throughput and cache utilization. This is being done with this PR.

To access performance impact, I use DeepSeek-Lite quantized with IQ1_S. This minimizes the model size, thus allowing to achieve higher tokens per second and hence the size of the KV cache has a stronger impact. Calculations are on a Ryzen-7950X (Zen4), Ryzen-5975WX (AVX2) and M2-Max CPU (NEON). Calculations are without FA so the change in tensor multiplication strategy is invoked. As performance is also influenced by cache size and quantization type (if the cache is quantized), we examine fp16, Q8_0, Q8_KV and, on Zen4, bf16 for the K-cache (without FA the V cache cannot be quantized).

AVX2

model	threads	type_k	test	t/s (main)	t/s (PR)	Speedup
deepseek2 16B IQ1_S	16	fp16	tg128@pp128	40.39 ± 0.03	42.76 ± 0.03	1.059
deepseek2 16B IQ1_S	16		tg128@pp256	37.51 ± 0.00	41.37 ± 0.03	1.103
deepseek2 16B IQ1_S	16		tg128@pp512	32.31 ± 0.01	38.63 ± 0.01	1.196
deepseek2 16B IQ1_S	16		tg128@pp1024	26.64 ± 0.01	34.28 ± 0.02	1.289
deepseek2 16B IQ1_S	16		tg128@pp2048	19.82 ± 0.00	27.81 ± 0.01	1.403
deepseek2 16B IQ1_S	16		tg128@pp4096	13.60 ± 0.01	20.57 ± 0.00	1.512
deepseek2 16B IQ1_S	16		tg128@pp8192	8.38 ± 0.00	13.71 ± 0.00	1.636
deepseek2 16B IQ1_S	16		tg128@pp16384	4.77 ± 0.00	8.20 ± 0.00	1.719
deepseek2 16B IQ1_S	16	q8_KV	tg128@pp128	42.11 ± 0.00	42.74 ± 0.02	1.015
deepseek2 16B IQ1_S	16		tg128@pp256	40.26 ± 0.02	41.66 ± 0.02	1.035
deepseek2 16B IQ1_S	16		tg128@pp512	37.32 ± 0.01	39.94 ± 0.01	1.070
deepseek2 16B IQ1_S	16		tg128@pp1024	32.04 ± 0.00	36.32 ± 0.02	1.133
deepseek2 16B IQ1_S	16		tg128@pp2048	26.42 ± 0.01	31.48 ± 0.01	1.192
deepseek2 16B IQ1_S	16		tg128@pp4096	19.04 ± 0.01	24.04 ± 0.01	1.263
deepseek2 16B IQ1_S	16		tg128@pp8192	12.44 ± 0.00	16.25 ± 0.01	1.306
deepseek2 16B IQ1_S	16		tg128@pp16384	6.88 ± 0.00	10.23 ± 0.00	1.487
deepseek2 16B IQ1_S	16	q8_0	tg128@pp128	42.77 ± 0.01	43.70 ± 0.01	1.022
deepseek2 16B IQ1_S	16		tg128@pp256	41.07 ± 0.00	42.23 ± 0.00	1.028
deepseek2 16B IQ1_S	16		tg128@pp512	38.53 ± 0.01	40.34 ± 0.00	1.047
deepseek2 16B IQ1_S	16		tg128@pp1024	33.90 ± 0.01	37.18 ± 0.02	1.097
deepseek2 16B IQ1_S	16		tg128@pp2048	27.15 ± 0.02	31.71 ± 0.00	1.168
deepseek2 16B IQ1_S	16		tg128@pp4096	19.88 ± 0.00	24.76 ± 0.00	1.245
deepseek2 16B IQ1_S	16		tg128@pp8192	13.03 ± 0.01	16.89 ± 0.01	1.296
deepseek2 16B IQ1_S	16		tg128@pp16384	8.03 ± 0.00	10.12 ± 0.00	1.260

NEON (M2-Max CPU)

model	threads	type_k	test	t/s (main)	t/s (PR)	Speedup
deepseek2 16B IQ1_S	8	fp16	tg128@pp128	56.84 ± 0.05	58.21 ± 0.05	1.024
deepseek2 16B IQ1_S	8		tg128@pp256	54.55 ± 0.01	57.45 ± 0.07	1.053
deepseek2 16B IQ1_S	8		tg128@pp512	50.99 ± 0.04	55.47 ± 0.11	1.088
deepseek2 16B IQ1_S	8		tg128@pp1024	44.53 ± 0.48	51.93 ± 0.01	1.166
deepseek2 16B IQ1_S	8		tg128@pp2048	35.92 ± 0.02	45.80 ± 0.02	1.275
deepseek2 16B IQ1_S	8		tg128@pp4096	25.96 ± 0.01	37.36 ± 0.00	1.439
deepseek2 16B IQ1_S	8		tg128@pp4096	16.38 ± 0.11	27.21 ± 0.03	1.661
deepseek2 16B IQ1_S	8	q8_KV	tg128@pp128	57.73 ± 0.28	58.10 ± 0.65	1.006
deepseek2 16B IQ1_S	8		tg128@pp256	56.40 ± 0.22	57.27 ± 0.02	1.015
deepseek2 16B IQ1_S	8		tg128@pp512	53.61 ± 0.41	55.95 ± 0.31	1.044
deepseek2 16B IQ1_S	8		tg128@pp1024	49.15 ± 0.12	54.00 ± 0.03	1.099
deepseek2 16B IQ1_S	8		tg128@pp2048	41.54 ± 0.12	48.59 ± 0.14	1.170
deepseek2 16B IQ1_S	8		tg128@pp4096	31.24 ± 0.00	41.31 ± 0.03	1.322
deepseek2 16B IQ1_S	8		tg128@pp8192	21.75 ± 0.01	31.66 ± 0.01	1.456

Zen4 (Ryzen-7950X)

model	threads	type_k	test	t/s (main)	t/s (PR)	Speedup
deepseek2 16B IQ1_S	16	bf16	tg128@pp128	48.84 ± 0.08	49.32 ± 0.31	1.010
deepseek2 16B IQ1_S	16		tg128@pp256	46.17 ± 0.27	47.52 ± 0.60	1.029
deepseek2 16B IQ1_S	16		tg128@pp512	41.76 ± 0.17	44.86 ± 0.14	1.074
deepseek2 16B IQ1_S	16		tg128@pp1024	36.58 ± 0.38	38.99 ± 0.13	1.066
deepseek2 16B IQ1_S	16		tg128@pp2048	29.55 ± 0.03	33.11 ± 0.15	1.120
deepseek2 16B IQ1_S	16		tg128@pp4096	20.95 ± 0.17	24.87 ± 0.25	1.187
deepseek2 16B IQ1_S	16		tg128@pp8192	14.55 ± 0.48	16.72 ± 0.13	1.149
deepseek2 16B IQ1_S	16		tg128@pp16384	9.11 ± 0.00	10.14 ± 0.00	1.113
deepseek2 16B IQ1_S	16	fp16	tg128@pp128	48.25 ± 0.42	49.61 ± 0.41	1.028
deepseek2 16B IQ1_S	16		tg128@pp256	45.62 ± 0.04	47.76 ± 1.06	1.047
deepseek2 16B IQ1_S	16		tg128@pp512	42.08 ± 0.22	45.34 ± 0.05	1.077
deepseek2 16B IQ1_S	16		tg128@pp1024	37.14 ± 0.20	39.65 ± 0.00	1.068
deepseek2 16B IQ1_S	16		tg128@pp2048	29.74 ± 0.23	33.98 ± 0.05	1.142
deepseek2 16B IQ1_S	16		tg128@pp4096	21.98 ± 0.03	25.09 ± 0.05	1.141
deepseek2 16B IQ1_S	16		tg128@pp8192	14.59 ± 0.07	16.92 ± 0.03	1.160
deepseek2 16B IQ1_S	16		tg128@pp16384	9.52 ± 0.00	10.10 ± 0.00	1.061
deepseek2 16B IQ1_S	16	q8_KV	tg128@pp128	49.87 ± 0.10	50.47 ± 0.21	1.012
deepseek2 16B IQ1_S	16		tg128@pp256	46.89 ± 0.53	49.02 ± 0.16	1.045
deepseek2 16B IQ1_S	16		tg128@pp512	44.08 ± 0.41	46.57 ± 0.25	1.056
deepseek2 16B IQ1_S	16		tg128@pp1024	40.59 ± 0.09	42.50 ± 0.02	1.047
deepseek2 16B IQ1_S	16		tg128@pp2048	34.32 ± 0.04	37.55 ± 0.18	1.094
deepseek2 16B IQ1_S	16		tg128@pp4096	26.09 ± 0.99	29.50 ± 0.06	1.131
deepseek2 16B IQ1_S	16		tg128@pp8192	19.43 ± 0.35	20.64 ± 0.04	1.062
deepseek2 16B IQ1_S	16		tg128@pp16384	11.48 ± 0.00	13.03 ± 0.00	1.135
deepseek2 16B IQ1_S	16	q8_0	tg128@pp128	50.69 ± 0.17	50.70 ± 0.02	1.000
deepseek2 16B IQ1_S	16		tg128@pp256	48.54 ± 0.15	49.55 ± 0.12	1.021
deepseek2 16B IQ1_S	16		tg128@pp512	45.99 ± 0.11	46.98 ± 0.03	1.022
deepseek2 16B IQ1_S	16		tg128@pp1024	42.85 ± 0.06	42.35 ± 0.05	0.988
deepseek2 16B IQ1_S	16		tg128@pp2048	37.02 ± 0.11	37.57 ± 0.03	1.015
deepseek2 16B IQ1_S	16		tg128@pp4096	29.10 ± 0.07	29.63 ± 0.00	1.018
deepseek2 16B IQ1_S	16		tg128@pp8192	20.55 ± 0.09	20.71 ± 0.12	1.008
deepseek2 16B IQ1_S	16		tg128@pp16384	12.91 ± 0.00	13.06 ± 0.00	1.012

11 KiB Raw Permalink Blame History