ik_llama.cpp/4 - Simdify and multi-thread tanh.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.1 KiB

Raw Permalink Blame History

🔀 #4 - Simdify and multi-thread tanh

Author	`ikawrakow`
State	❌ Closed
Created	2024-07-27
Updated	2024-07-27

Description

It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that tanh is used in each layer, and then at the end for soft-caping the final output. ggml had tanh set to be computed with a single thread. Combined with tanh(x) being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the tanh operation.

After multi-threading ggml_vec_soft_max_f32 and simd-ifying the tanh computation, I observe a 33% gain in prompt processing speed for Gemma-2-9b (!!!) TG is of course memory bound, but despite this, we still get a ~2% boost at 4 threads (which gives max TG performance on my Ryzen-7950X).

Simd-ifying: We have

   tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)

so we can just use Justine Tunney's SIMD implementation for the exponential function.

1.1 KiB Raw Permalink Blame History

🔀 #4 - Simdify and multi-thread tanh

Description

1.1 KiB

Raw Permalink Blame History