mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-11 00:40:09 +00:00
22 lines
1.1 KiB
Markdown
22 lines
1.1 KiB
Markdown
### 🔀 [#4](https://github.com/ikawrakow/ik_llama.cpp/pull/4) - Simdify and multi-thread tanh
|
|
|
|
| **Author** | `ikawrakow` |
|
|
| :--- | :--- |
|
|
| **State** | ❌ **Closed** |
|
|
| **Created** | 2024-07-27 |
|
|
| **Updated** | 2024-07-27 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that `tanh` is used in each layer, and then at the end for soft-caping the final output. `ggml` had `tanh` set to be computed with a single thread. Combined with `tanh(x)` being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the `tanh` operation.
|
|
|
|
After multi-threading `ggml_vec_soft_max_f32` and simd-ifying the `tanh` computation, I observe a 33% gain in prompt processing speed for Gemma-2-9b (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X).
|
|
|
|
Simd-ifying:
|
|
We have
|
|
```
|
|
tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)
|
|
```
|
|
so we can just use Justine Tunney's SIMD implementation for the exponential function. |