ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-04 11:00:00 +00:00

Files

Kawrakow 8963f383c0 Simdify and multi-thread tanh (#4 )

It seemed Gemma-2 performance is lower than expected for its size.
Looking at the architecture, I noticed that tanh is used in each layer,
and then at the end for softcaping the final output. ggml had tanh
set to be computed with a single thread. Combined with tanh(x) being a
pretty expensive operation, this resulted in a significant fraction
of the time being spent in the tanh operation.

After multi-threading ggml_vec_soft_max_f32 and simd-ifying the
tanh computation, I observe a 33% gain in prompt processing speed (!!!)
TG is of course memory bound, but despite this, we still get a
~2% boost at 4 threads (which gives max TG performance on my
Ryzen-7950X).

Simd-ifying:
We have
   tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)
so we can just use Justine Tunney's SIMD exp implementation.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2024-07-27 08:44:18 +02:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

src

Simdify and multi-thread tanh (#4 )

2024-07-27 08:44:18 +02:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00