* iq5_ks: basics
* iq5_ks: quantize
* iq5_ks: CUDA dequantize works
* iq5_ks: dot product works on CUDA
* iq5_ks: MMQ works
* iq5_ks: Zen4
* iq5_ks: AVX2
But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.
* iq5_ks: NEON
* iq5_ks: Metal dequantize
* iq5_ks: Metal dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* MMQ for iq4_k: WIP (not working)
* MMQ for iq4_k: working now
* MMQ for iq5_k
* Cleanup
* MMQ for iq5_k: slightly faster
* MMQ for iq6_k
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* New DeepSeek FlashMLA
Does not work because the RoPE portion is stored at the end
in our case, while in mainline it is stored at the beginning,
and the FA kernel assumes that.
* Rearrange MLA K cache so it first new CUDA FA implementation
* constexpr and minor changes
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* cuda: Remove unnecessary device to host copy of row ids
We get 3-4% TG speed improvement for DeepSeek-Lite just from that.
* CPU: fix get_rows when SER is used
With smart experts reduction (SER), one potentially uses fewer
experts than specified by the model. This is accomplished by setting
the ID of the not seected tensors to -1. Most of the necessary
stuff was implemented when I added the SER option, but I forgot
to update get_rows() for not quantized tensors. As a result, we
get random garbage for the weights of the not-selected epxerts,
which leads to garbage output. This commit fixes it on the CPU.
I'm not quite sure yet why the GPU is not working.
* CUDA: fix TG with SER
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* CUDA WIP: support for FlashMLA-3
* Much better
The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)
* Sadly, the previous commit was wrong
* Finalizing
* Also add these
* Minor
* Minor tweak
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP
* WIP: still getting illegal memory access
* CUDA: MMQ for iq4_ks now works
~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* FA: provide work buffer for K repacking
* Add header to avoid comp0iler warnings
* WIP
* WIP
* WIP
* WIP
* Slightly better
* WIP (Zen4)
* WIP
* Try to improve for unusual number of heads/number of threads
* Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA
* Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA
* Use Sum4q4 for q4_0
* WIP
* WIP
* Much better FA TG with q8_0 KV cache
Just repack it even for TG. But do the repacking for k_step rows,
not the whole K tensor.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Slightly better CPU TG performance for GQA
* Better CPU FA implementation for TG when GQA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Metal: WIP to update Metal FA implementation
Dk=192, Dv=128 works, but not Dk = 576, Dv = 512
* Metal FA: go to float
* WIP
* Metal FA: MLA options now all work
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* MoE improvements on Metal
This version beats mainline, there are things I don't understand:
* Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the
same, but we are 30% slower. Why?
* Using actual GEMM, we beat mainline with ubtach size of 128. But then
performance degrades. Why?
* Some cleanup
* Much better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq3_k: slightly better quantization
Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).
* Small improvement for type-1 quants
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better make_qx_quants
Tested with q4_0 and q3_K (pure, imatrix), and the improvement is
quite significant.
* Sae for iq4_nl, iq4_xs
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP - not working
* q8_0 without bells and wistles works
* It works for q8_0
* Use bf16 instead of f16,int16
* q4_0_r8
* q5_0_r4
* q6_0_r4
* Also q4_1 and q5_1
* q8_0_r8 on avx2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Make fused MoE reproducible
As a bonus, peak performance at pp2048 with u_batch = 2048 is
~8% better.
* Slightly better
* Also do it for non-fused mul_mat_id
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Improve DeepSeek batched processing speed
* Revert the commented out section in iqk_mul_mat.cpp
It does have some benefit at long contexts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>