ik_llama.cpp/65 - Adding SWIGLU unary op.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

5.7 KiB

Raw Permalink Blame History

🔀 #65 - Adding SWIGLU unary op

Author	`ikawrakow`
State	❌ Closed
Created	2024-09-28
Updated	2024-09-28

Description

Phi-3(.5) (and also ChatGLM) uses a "SWIGLU" operation in its FFN. There is nothing special about "SWIGLU", it is just that the ffn_up tensor is actually a combination of the usual ffn_up and ffn_gate tensors, where in each row the first half contains the ffn_up weights and the second half has the ffn_gate weights. So that, to implement

silu(ffn_up * A) * (ffn_gate * A)

(A are the activations passed into the FFN network), which is common for many LLMs, one needs swiglu(ffn_up * A) . In a typical ggml style, instead of adding a dedicated op for that, ggml models it as 4 (!) operations

x1 = ggml_cont(ffn_up, first row half)
x2 = ggml_cont(ffn_up, second row half)
x3 = ggml_silu(x1)
x4 = ggml_mul(x2, x3)

ggml_cont(x) is basically a copy operation. The result of this is that on my Ryzen-7950X CPU more than 5% (!) of PP time is spent in ggml_cont, i.e., in completely unnecessary copies¹

To remedy this unfortunate ggml implementation detail, this PR adds a dedicated ggml_swiglu operation, implemented for the CPU, CUDA, and Metal back-ends. We get

~4% PP speedup on the CPU (Ryzen-7950X, Ryzen-5975WX, M2-Max)
~3% PP speedup on Metal (M2-Max GPU)
~12% PP speedup on CUDA (RTX-4080)
~1-2% speedup for TG on all tested platforms

Of note: Phi-3.5 has been trained in bf16. To make sure that my ggml_swiglu implementation is correct, I ran a full Wikitext2 perplexity calculation on the CPU. The Ryzen-7950X CPU has native bf16 support, so I used a GGUF converted directly to bf16 from the safetensors on HF. As FA with bf16 KV-cache is slightly faster when there is native bf16 support, I also used that. The final PPL for a context of 512 tokens is 6.5556. In comparison, the fp16 CUDA result is 6.5816. The difference is small but definitely outside of what one would expect from numerical roundoff errors alone. I guess, there are a few model weights in Phi-3.5-mini, as well as some activations, that fall outside of the fp16 range.

=== ¹ Phi-3(-5) also uses a combined QKV tensor, which triggers additional ggml_cont operations as implemented in llama.cpp:

cur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wqkv, attn_norm_output); // this is the QKV * A matrix multiplication
Qcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd,     n_tokens, cur->nb[1], 0 * sizeof(float) * (n_embd)));      
Kcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd)));      
Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd + n_embd_gqa)));      
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head,    n_tokens);
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);

The ggml_reshape_3d op requires the tensor being reshaped to be contiguous, so Qcur and Kcur are created by copying the appropriate data out of QKV * A. The Vcur copy is completely unnecessary. The exact same result can be achieved, without using any copies, via

Qcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head, n_tokens, n_embd_head*sizeof(float), cur->nb[1], cur, 0 * sizeof(float) * (n_embd));
Kcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head_kv, n_tokens, n_embd_head*sizeof(float), cur->nb[1], 1 * sizeof(float) * (n_embd));
Vcur = ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd + n_embd_gqa));

This results in an additional 2-3% speedup of PP-512(Phi-3.5-mini) when running on the CPU. Unfortunately CUDA becomes massively slower, so I need to investigate and hence have left this change for a future PR.

💬 Conversation

👤 ikawrakow commented the 2024-09-28 at 10:07:59:

OK, Phi-3.5 has a 128k context, so let's run a benchmark with a longer context, say, 8k tokens. Here is what I get after this PR on a Ryzen-7950X CPU for Phi-3.5-mini:

model	size	backend	threads	type_k	type_v	fa	test	t/s
phi3 3B BF16	7.12 GiB	CPU	16	-	-	0	pp8192	218.01 ± 0.37
phi3 3B BF16	7.12 GiB	CPU	16	bf16	bf16	1	pp8192	307.62 ± 1.23

Mainline llama.cpp has no bf16 support, so we need to use fp16 (bf16 will run but it is infinitely slow). Here is what I get with the llama.cpp version from this morning (build: 44f59b43 (3829))

model	size	backend	threads	fa	test	t/s
phi3 3B F16	7.12 GiB	CPU	16	1	pp8192	32.28 ± 0.01
phi3 3B F16	7.12 GiB	CPU	16	0	pp8192	81.05 ± 0.05

The best calculation here (FA with bf16 for K- and V-cache) is 3.8X faster than the best llama.cpp has to offer (no FA). Out FA speeds things up by 41%, llama.cpp FA slows things down 2.5X. A user who has not taken the time to investigate FA performance in llama.cpp, and is running on a Zen4 CPU, will observe a 9.5X difference in processing speed between here and mainline llama.cpp.

5.7 KiB Raw Permalink Blame History

🔀 #65 - Adding SWIGLU unary op

Description

💬 Conversation

5.7 KiB

Raw Permalink Blame History