* q6_K dequantizing GEMM
* Much easier: just use different vec_dot types!
* WIP
* Finally q6_K x q8_2_x4 dot product works
* Very slightly better
* We don't need the changes in ggml.c
* Fix AVX2
* iq2_xs
* Fix AVX2
* iq2_s
* q3_K
* Fix q8_k_r8 on Zen4
* q3_K: repack to q8_k_r8 instead of q8_0_r8
With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.
* q3_K: don't scale when all quants in a block are <= 127 when repacking
* iq2_s: repack to q8_k_r8 instead of q8_0_r8
* iq2_xs: rapck to q8_k_r8
* WIP
* iq2_xs: repack to q8_k_r8
* iq3_xxs: repack to q8_k_r8
* iq3_s: use q8_k_r8
* iq1_s: repack to q8_k_r8
* iq1_m: repack to q8_k_r8
* iq1_m: slightly faster
* Slightly faster
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Refactor iqk: WIP
* Refactor iqk: Factor out float GEMM (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4
* Refactor iqk: Factor out GEMM for repacked legacy quants
* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV
* Refactor iqk: Factor out GEMM for repacked i-quants
* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512
* Refactor iqk: factor out 1-bit quants (NEON)
* Refactor iqk: factor out k-quants (NEON)
* Refactor iqk: factor out floats (NEON)
* Also iq4_xs belongs to k-quants
* Refactor iqk: factor out iqk quants (NEON)
* Refactor iqk: factor out legacy quants (NEON)
* Refactor iqk: factor out repacked legacy quants (NEON)
* Refactor iqk: factor out repacked k-quants (NEON)
* Refactor iqk: factor out repacked iqk quants (NEON)
* Refactor iqk: GEMM kernels are refactored on NEON
* Refactor iqk: FA compiles
If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X
* Refactor iqk: FA refactored (Zen4)
Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.
* Adding forgotten file
* Most helpers don't need to be templates
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.
Compilation time drops to 14 second on the Ryzen-5975WX
* Fix bf16
* Refactor iqk: FA refactored (NEON)
* Forgotten MMQ ref and typo (#431)
* Adding forgotten iq5_k_r4
* Fix iq4_k_r4 on NEON
* Fix iq4_ks on NEON
It was broken before the refactoring (the shifts were not correctly
applied).
* Fix q8_0 on NEON
* Fix q6_0 K cache
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>