Iwan Kawrakow
7090f171e1
Fix iq4_k_r4 on NEON
2025-05-19 19:46:44 +03:00
Iwan Kawrakow
06efa17fa9
Adding forgotten iq5_k_r4
2025-05-19 18:00:16 +03:00
Nexes the Elder
380ab3f33a
Forgotten MMQ ref and typo ( #431 )
2025-05-19 17:18:03 +03:00
Iwan Kawrakow
65c8e860bf
Refactor iqk: FA refactored (NEON)
2025-05-19 17:16:00 +03:00
Iwan Kawrakow
9ae8f75114
Fix bf16
2025-05-19 15:30:46 +03:00
Iwan Kawrakow
9541631a52
Most helpers don't need to be templates
...
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.
Compilation time drops to 14 second on the Ryzen-5975WX
2025-05-19 15:20:43 +03:00
Iwan Kawrakow
fbfe79e2fe
Adding forgotten file
2025-05-19 13:42:20 +03:00
Iwan Kawrakow
630279cb54
Refactor iqk: FA refactored (Zen4)
...
Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.
2025-05-19 13:38:38 +03:00
Iwan Kawrakow
131e5ac6df
Refactor iqk: FA compiles
...
If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X
2025-05-19 11:43:02 +03:00
Iwan Kawrakow
4b4b4fdcac
Refactor iqk: GEMM kernels are refactored on NEON
2025-05-19 08:36:16 +03:00
Iwan Kawrakow
7aa2de6d5a
Refactor iqk: factor out repacked iqk quants (NEON)
2025-05-19 08:25:56 +03:00
Iwan Kawrakow
7e59d2b974
Refactor iqk: factor out repacked k-quants (NEON)
2025-05-19 08:11:24 +03:00
Iwan Kawrakow
2b8a231d87
Refactor iqk: factor out repacked legacy quants (NEON)
2025-05-19 07:51:28 +03:00
Iwan Kawrakow
bd1e4d4909
Refactor iqk: factor out legacy quants (NEON)
2025-05-18 19:47:53 +03:00
Iwan Kawrakow
465d717bb9
Refactor iqk: factor out iqk quants (NEON)
2025-05-18 19:06:46 +03:00
Iwan Kawrakow
312413694f
Also iq4_xs belongs to k-quants
2025-05-18 18:14:45 +03:00
Iwan Kawrakow
f4ab917e9e
Refactor iqk: factor out floats (NEON)
2025-05-18 18:09:39 +03:00
Iwan Kawrakow
c805a19202
Refactor iqk: factor out k-quants (NEON)
2025-05-18 17:41:54 +03:00
Iwan Kawrakow
28b94800c1
Refactor iqk: factor out 1-bit quants (NEON)
2025-05-18 16:54:44 +03:00
Iwan Kawrakow
c63a0af5b7
Refactor iqk: GEMM kernels are refactored on AVX2/AVX512
2025-05-18 15:50:20 +03:00
Iwan Kawrakow
0d96f3bd37
Refactor iqk: Factor out GEMM for repacked i-quants
2025-05-18 14:51:59 +03:00
Iwan Kawrakow
f501200d42
Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV
2025-05-18 14:02:07 +03:00
Iwan Kawrakow
6cd3609a85
Refactor iqk: Factor out GEMM for repacked legacy quants
2025-05-18 10:20:54 +03:00
Iwan Kawrakow
7868545062
Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4
2025-05-17 19:53:48 +03:00
Iwan Kawrakow
d66ec60836
Refactor iqk: fix AVX2
2025-05-17 19:29:55 +03:00
Iwan Kawrakow
9b6e75cb79
Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)
2025-05-17 18:28:24 +03:00
Iwan Kawrakow
082a9bd632
Refactor iqk: fix AVX2
2025-05-17 17:45:32 +03:00
Iwan Kawrakow
de5660cee3
Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)
2025-05-17 17:34:34 +03:00
Iwan Kawrakow
8dae13cd84
Refactor iqk: fix AVX2
2025-05-17 16:43:53 +03:00
Iwan Kawrakow
2cbbc5581f
Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)
2025-05-17 16:34:25 +03:00
Iwan Kawrakow
d355ff997b
Refactor iqk: fix AVX2
2025-05-17 15:45:15 +03:00
Iwan Kawrakow
4ef94c26fb
Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)
2025-05-17 15:34:56 +03:00
Iwan Kawrakow
f83e64dcb6
Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)
2025-05-17 14:32:00 +03:00
Iwan Kawrakow
51a87cf20d
Refactor iqk: Factor out float GEMM (AVX2/AVX512)
2025-05-17 13:41:39 +03:00
Iwan Kawrakow
68b782e861
Refactor iqk: WIP
2025-05-17 12:31:39 +03:00
Kawrakow
b3036a872f
Option to enable disable the IQK CPU FA kernels ( #429 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-17 11:21:58 +03:00
Kawrakow
c35a383bcd
Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS ( #428 )
...
* Zen4: faster PP for iq4_ks and iq5_ks
* Zen4: faster PP for iq2_ks
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-17 10:42:33 +03:00
Kawrakow
7abdf2b099
IQ5_KS_R4: row-interleaved IQ5_KS ( #426 )
...
* iq5_ks_r4: basics
* iq5_ks_r4: Zen4 works
* iq5_ks_r4: AVX2 works
* iq5_ks_r4: NEON
* Fix iq5_ks on NEON
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-17 08:57:26 +03:00
Kawrakow
134d548173
Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K ( #427 )
...
* Fix IQ4_K on AVX2
* Fix IQ4_KS on AVX2
* Fix IQ5_K on AVX2
* Fix IQ6_K on AVX2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-16 17:25:15 +03:00
Kawrakow
34ae71c4d7
Adding forgotten template instance for iq5_ks ( #424 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-15 16:50:15 +03:00
Kawrakow
3d92d7f802
Adding IQ5_KS - 5.25 bpw quants ( #422 )
...
* iq5_ks: basics
* iq5_ks: quantize
* iq5_ks: CUDA dequantize works
* iq5_ks: dot product works on CUDA
* iq5_ks: MMQ works
* iq5_ks: Zen4
* iq5_ks: AVX2
But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.
* iq5_ks: NEON
* iq5_ks: Metal dequantize
* iq5_ks: Metal dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-15 16:02:39 +03:00
Kawrakow
3f8c865b92
Fix standard attention on the CPU ( #421 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-15 08:43:39 +03:00
Kawrakow
14ed9fb44d
CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K ( #418 )
...
* MMQ for iq2_k
* This works
* MMQ for iq3_k
* MMQ for iq2_ks
* Fix iq2_ks
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-15 08:15:08 +03:00
Kawrakow
0435b68e6d
CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K ( #417 )
...
* MMQ for iq4_k: WIP (not working)
* MMQ for iq4_k: working now
* MMQ for iq5_k
* Cleanup
* MMQ for iq5_k: slightly faster
* MMQ for iq6_k
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-14 14:04:11 +03:00
Kawrakow
b90d6ede2e
Fix SER (CUDA) ( #416 )
...
* Fixing SER bugs
* Cleanup
* This seems to fix it.
* This seems to work
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-14 07:29:28 +03:00
Kawrakow
13740622e9
Fix SER (CPU) ( #415 )
...
* Fixing SER bugs
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-13 17:55:04 +03:00
Kawrakow
553c08b6b4
Better CPU FA performance for DeepSeek-Lite ( #410 )
...
* Better CPU FA performance for DeepSeek-Lite
* It must be like this
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-13 17:53:20 +03:00
Kawrakow
627f406437
Fix new CUDA FA on Touring ( #413 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-12 15:09:33 +03:00
Kawrakow
465569dff8
Faster DeepSeek FA on CUDA ( #408 )
...
* New DeepSeek FlashMLA
Does not work because the RoPE portion is stored at the end
in our case, while in mainline it is stored at the beginning,
and the FA kernel assumes that.
* Rearrange MLA K cache so it first new CUDA FA implementation
* constexpr and minor changes
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-12 07:49:00 +03:00
Kawrakow
8669c3db2b
GPU offload policy ( #405 )
...
* Adding GPU offload policy
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-12 07:47:46 +03:00