mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-26 17:20:01 +00:00
72 lines
5.7 KiB
Markdown
72 lines
5.7 KiB
Markdown
### 🔀 [#65](https://github.com/ikawrakow/ik_llama.cpp/pull/65) - Adding SWIGLU unary op
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **State** | ❌ **Closed** |
|
|
| **Created** | 2024-09-28 |
|
|
| **Updated** | 2024-09-28 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
Phi-3(.5) (and also ChatGLM) uses a "SWIGLU" operation in its FFN. There is nothing special about "SWIGLU", it is just that the `ffn_up` tensor is actually a combination of the usual `ffn_up` and `ffn_gate` tensors, where in each row the first half contains the `ffn_up` weights and the second half has the `ffn_gate` weights. So that, to implement
|
|
```
|
|
silu(ffn_up * A) * (ffn_gate * A)
|
|
```
|
|
(`A` are the activations passed into the FFN network), which is common for many LLMs, one needs `swiglu(ffn_up * A) `. In a typical `ggml` style, instead of adding a dedicated op for that, `ggml` models it as 4 (!) operations
|
|
```
|
|
x1 = ggml_cont(ffn_up, first row half)
|
|
x2 = ggml_cont(ffn_up, second row half)
|
|
x3 = ggml_silu(x1)
|
|
x4 = ggml_mul(x2, x3)
|
|
```
|
|
`ggml_cont(x)` is basically a copy operation. The result of this is that on my Ryzen-7950X CPU more than 5% (!) of PP time is spent in `ggml_cont`, i.e., in completely unnecessary copies<sup>1</sup>
|
|
|
|
To remedy this unfortunate `ggml` implementation detail, this PR adds a dedicated `ggml_swiglu` operation, implemented for the CPU, CUDA, and Metal back-ends. We get
|
|
* ~4% PP speedup on the CPU (Ryzen-7950X, Ryzen-5975WX, M2-Max)
|
|
* ~3% PP speedup on Metal (M2-Max GPU)
|
|
* ~12% PP speedup on CUDA (RTX-4080)
|
|
* ~1-2% speedup for TG on all tested platforms
|
|
|
|
**Of note**: Phi-3.5 has been trained in `bf16`. To make sure that my `ggml_swiglu` implementation is correct, I ran a full Wikitext2 perplexity calculation on the CPU. The Ryzen-7950X CPU has native `bf16` support, so I used a GGUF converted directly to `bf16` from the safetensors on HF. As FA with `bf16` KV-cache is slightly faster when there is native `bf16` support, I also used that. The final PPL for a context of 512 tokens is `6.5556`. In comparison, the `fp16` CUDA result is `6.5816`. The difference is small but definitely outside of what one would expect from numerical roundoff errors alone. I guess, there are a few model weights in Phi-3.5-mini, as well as some activations, that fall outside of the `fp16` range.
|
|
|
|
===
|
|
<sup>1</sup> Phi-3(-5) also uses a combined `QKV` tensor, which triggers additional `ggml_cont` operations as implemented in `llama.cpp`:
|
|
```
|
|
cur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wqkv, attn_norm_output); // this is the QKV * A matrix multiplication
|
|
Qcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd, n_tokens, cur->nb[1], 0 * sizeof(float) * (n_embd)));
|
|
Kcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd)));
|
|
Vcur = ggml_cont(ctx0, ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd + n_embd_gqa)));
|
|
Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
|
|
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
|
|
```
|
|
The ` ggml_reshape_3d` op requires the tensor being reshaped to be contiguous, so `Qcur` and `Kcur` are created by copying the appropriate data out of `QKV * A`. The `Vcur` copy is completely unnecessary. The exact same result can be achieved, without using any copies, via
|
|
```
|
|
Qcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head, n_tokens, n_embd_head*sizeof(float), cur->nb[1], cur, 0 * sizeof(float) * (n_embd));
|
|
Kcur = ggml_view_3d(ctx0, cur, n_embd_head, n_head_kv, n_tokens, n_embd_head*sizeof(float), cur->nb[1], 1 * sizeof(float) * (n_embd));
|
|
Vcur = ggml_view_2d(ctx0, cur, n_embd_gqa, n_tokens, cur->nb[1], 1 * sizeof(float) * (n_embd + n_embd_gqa));
|
|
```
|
|
This results in an additional 2-3% speedup of PP-512(Phi-3.5-mini) when running on the CPU. Unfortunately CUDA becomes massively slower, so I need to investigate and hence have left this change for a future PR.
|
|
|
|
---
|
|
|
|
#### 💬 Conversation
|
|
|
|
👤 **ikawrakow** commented the **2024-09-28** at **10:07:59**:<br>
|
|
|
|
OK, Phi-3.5 has a 128k context, so let's run a benchmark with a longer context, say, 8k tokens. Here is what I get after this PR on a Ryzen-7950X CPU for Phi-3.5-mini:
|
|
|
|
| model | size | backend | threads | type_k | type_v | fa | test | t/s |
|
|
| ------------------------------ | ---------: | ---------- | ------: | -----: | -----: | -: | ------------: | ---------------: |
|
|
| phi3 3B BF16 | 7.12 GiB | CPU | 16 | - | - | 0 | pp8192 | 218.01 ± 0.37 |
|
|
| phi3 3B BF16 | 7.12 GiB | CPU | 16 | bf16 | bf16 | 1 | pp8192 | 307.62 ± 1.23 |
|
|
|
|
Mainline `llama.cpp` has no `bf16` support, so we need to use `fp16` (`bf16` will run but it is infinitely slow). Here is what I get with the `llama.cpp` version from this morning (`build: 44f59b43 (3829)`)
|
|
|
|
| model | size | backend | threads | fa | test | t/s |
|
|
| ------------------------------ | ---------: | ---------- | ------: | -: | ------------: | -------------------: |
|
|
| phi3 3B F16 | 7.12 GiB | CPU | 16 | 1 | pp8192 | 32.28 ± 0.01 |
|
|
| phi3 3B F16 | 7.12 GiB | CPU | 16 | 0 | pp8192 | 81.05 ± 0.05 |
|
|
|
|
The best calculation here (FA with `bf16` for K- and V-cache) is 3.8X faster than the best `llama.cpp` has to offer (no FA). Out FA speeds things up by 41%, `llama.cpp` FA slows things down 2.5X. A user who has not taken the time to investigate FA performance in `llama.cpp`, and is running on a Zen4 CPU, will observe a 9.5X difference in processing speed between here and mainline `llama.cpp`. |