mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-24 00:19:19 +00:00
Add GitHub data (#637)
This commit is contained in:
22
github-data/pull_requests/4-Simdify and multi-thread tanh.md
Normal file
22
github-data/pull_requests/4-Simdify and multi-thread tanh.md
Normal file
@@ -0,0 +1,22 @@
|
||||
### 🔀 [#4](https://github.com/ikawrakow/ik_llama.cpp/pull/4) - Simdify and multi-thread tanh
|
||||
|
||||
| **Author** | `ikawrakow` |
|
||||
| :--- | :--- |
|
||||
| **State** | ❌ **Closed** |
|
||||
| **Created** | 2024-07-27 |
|
||||
| **Updated** | 2024-07-27 |
|
||||
|
||||
---
|
||||
|
||||
#### Description
|
||||
|
||||
It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that `tanh` is used in each layer, and then at the end for soft-caping the final output. `ggml` had `tanh` set to be computed with a single thread. Combined with `tanh(x)` being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the `tanh` operation.
|
||||
|
||||
After multi-threading `ggml_vec_soft_max_f32` and simd-ifying the `tanh` computation, I observe a 33% gain in prompt processing speed for Gemma-2-9b (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X).
|
||||
|
||||
Simd-ifying:
|
||||
We have
|
||||
```
|
||||
tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)
|
||||
```
|
||||
so we can just use Justine Tunney's SIMD implementation for the exponential function.
|
||||
Reference in New Issue
Block a user