* Use mmq_id in mul_mat_id
* Better
* Also use it in the fused up+gate op
* Better -no-fmoe TG on CUDA
Still much slower than -fmoe, but abot 20-25% faster than what
we had before.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Trying to implement quantized fmoe - not working yet
* This works, but is slower than the non-working version
* quantize_mmq_q8_1_id
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Slight MLA TG performance improvement on CUDA
The low MLA performance on CUDA is dues to
the wk_b * q_nope operation.
It turns into n_head matrix multiplications with
n_head separate quantization and GEMV steps.
The associated overhead is just too much for TG
where each GEMV is very fast (512 x 128 = 131 KFLOP
for DeepSeek-Lite, 4X that for DeepSeekV3/R1).
The way it was done there was also a copy of each q_nope
row before quantization, which I have now eliminated.
This results in a ~2.5% speedup.
What needs to happen instead is to launch a single
computation that quantizes all heads, and then have
a kernel that does the GEMV for all heads instead of
n_head sequential GEMVs.
* Slightly better
* CUDA: Quantize non-contiguous tensors
* Much better MLA
It is a total hack, but it works.
* Cleanup
Remove duplicated gemv's.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Added Johannes' changes, still getting NaNs with quantized k-cache.
Also getting NaN's on Johannes's mainline branch.
* This fixes it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>