* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.
It seemed Gemma-2 performance is lower than expected for its size.
Looking at the architecture, I noticed that tanh is used in each layer,
and then at the end for softcaping the final output. ggml had tanh
set to be computed with a single thread. Combined with tanh(x) being a
pretty expensive operation, this resulted in a significant fraction
of the time being spent in the tanh operation.
After multi-threading ggml_vec_soft_max_f32 and simd-ifying the
tanh computation, I observe a 33% gain in prompt processing speed (!!!)
TG is of course memory bound, but despite this, we still get a
~2% boost at 4 threads (which gives max TG performance on my
Ryzen-7950X).
Simd-ifying:
We have
tanh(x) = (exp(2*x) - 1)/(exp(2*x) + 1)
so we can just use Justine Tunney's SIMD exp implementation.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>