Commit Graph

3723 Commits

Author SHA1 Message Date
Iwan Kawrakow
bd1e4d4909 Refactor iqk: factor out legacy quants (NEON) 2025-05-18 19:47:53 +03:00
Iwan Kawrakow
465d717bb9 Refactor iqk: factor out iqk quants (NEON) 2025-05-18 19:06:46 +03:00
Iwan Kawrakow
312413694f Also iq4_xs belongs to k-quants 2025-05-18 18:14:45 +03:00
Iwan Kawrakow
f4ab917e9e Refactor iqk: factor out floats (NEON) 2025-05-18 18:09:39 +03:00
Iwan Kawrakow
c805a19202 Refactor iqk: factor out k-quants (NEON) 2025-05-18 17:41:54 +03:00
Iwan Kawrakow
28b94800c1 Refactor iqk: factor out 1-bit quants (NEON) 2025-05-18 16:54:44 +03:00
Iwan Kawrakow
c63a0af5b7 Refactor iqk: GEMM kernels are refactored on AVX2/AVX512 2025-05-18 15:50:20 +03:00
Iwan Kawrakow
0d96f3bd37 Refactor iqk: Factor out GEMM for repacked i-quants 2025-05-18 14:51:59 +03:00
Iwan Kawrakow
f501200d42 Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV 2025-05-18 14:02:07 +03:00
Iwan Kawrakow
6cd3609a85 Refactor iqk: Factor out GEMM for repacked legacy quants 2025-05-18 10:20:54 +03:00
Iwan Kawrakow
7868545062 Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4 2025-05-17 19:53:48 +03:00
Iwan Kawrakow
d66ec60836 Refactor iqk: fix AVX2 2025-05-17 19:29:55 +03:00
Iwan Kawrakow
9b6e75cb79 Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512) 2025-05-17 18:28:24 +03:00
Iwan Kawrakow
082a9bd632 Refactor iqk: fix AVX2 2025-05-17 17:45:32 +03:00
Iwan Kawrakow
de5660cee3 Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512) 2025-05-17 17:34:34 +03:00
Iwan Kawrakow
8dae13cd84 Refactor iqk: fix AVX2 2025-05-17 16:43:53 +03:00
Iwan Kawrakow
2cbbc5581f Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512) 2025-05-17 16:34:25 +03:00
Iwan Kawrakow
d355ff997b Refactor iqk: fix AVX2 2025-05-17 15:45:15 +03:00
Iwan Kawrakow
4ef94c26fb Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512) 2025-05-17 15:34:56 +03:00
Iwan Kawrakow
f83e64dcb6 Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512) 2025-05-17 14:32:00 +03:00
Iwan Kawrakow
51a87cf20d Refactor iqk: Factor out float GEMM (AVX2/AVX512) 2025-05-17 13:41:39 +03:00
Iwan Kawrakow
68b782e861 Refactor iqk: WIP 2025-05-17 12:31:39 +03:00
Kawrakow
b3036a872f Option to enable disable the IQK CPU FA kernels (#429)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 11:21:58 +03:00
Kawrakow
c35a383bcd Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428)
* Zen4: faster PP for iq4_ks and iq5_ks

* Zen4: faster PP for iq2_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 10:42:33 +03:00
Kawrakow
7abdf2b099 IQ5_KS_R4: row-interleaved IQ5_KS (#426)
* iq5_ks_r4: basics

* iq5_ks_r4: Zen4 works

* iq5_ks_r4: AVX2 works

* iq5_ks_r4: NEON

* Fix iq5_ks on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 08:57:26 +03:00
Kawrakow
134d548173 Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427)
* Fix IQ4_K on AVX2

* Fix IQ4_KS on AVX2

* Fix IQ5_K on AVX2

* Fix IQ6_K on AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-16 17:25:15 +03:00
Kawrakow
34ae71c4d7 Adding forgotten template instance for iq5_ks (#424)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:50:15 +03:00
Kawrakow
3d92d7f802 Adding IQ5_KS - 5.25 bpw quants (#422)
* iq5_ks: basics

* iq5_ks: quantize

* iq5_ks: CUDA dequantize works

* iq5_ks: dot product works on CUDA

* iq5_ks: MMQ works

* iq5_ks: Zen4

* iq5_ks: AVX2

But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.

* iq5_ks: NEON

* iq5_ks: Metal dequantize

* iq5_ks: Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:02:39 +03:00
Kawrakow
3f8c865b92 Fix standard attention on the CPU (#421)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 08:43:39 +03:00
Kawrakow
14ed9fb44d CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418)
* MMQ for iq2_k

* This works

* MMQ for iq3_k

* MMQ for iq2_ks

* Fix iq2_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 08:15:08 +03:00
Kawrakow
0435b68e6d CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417)
* MMQ for iq4_k: WIP (not working)

* MMQ for iq4_k: working now

* MMQ for iq5_k

* Cleanup

* MMQ for iq5_k: slightly faster

* MMQ for iq6_k

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14 14:04:11 +03:00
Kawrakow
b90d6ede2e Fix SER (CUDA) (#416)
* Fixing SER bugs

* Cleanup

* This seems to fix it.

* This seems to work

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14 07:29:28 +03:00
Kawrakow
13740622e9 Fix SER (CPU) (#415)
* Fixing SER bugs

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:55:04 +03:00
Kawrakow
0c57f84dc4 Fix imatrix calculation for MLA models (#411)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:53:38 +03:00
Kawrakow
553c08b6b4 Better CPU FA performance for DeepSeek-Lite (#410)
* Better CPU FA performance for DeepSeek-Lite

* It must be like this

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:53:20 +03:00
Kawrakow
4ba6bbb44a Update README.md 2025-05-12 15:48:37 +03:00
Kawrakow
627f406437 Fix new CUDA FA on Touring (#413)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 15:09:33 +03:00
Kawrakow
1d2da7feae Add batch warmup to sweep-bench (#375)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:50:26 +03:00
Kawrakow
f27cd40542 Enable faster prompt processing with mainline llama.cpp GGUFs (#409)
* Enable MLA-3 in crippled GGUFs: WIP

* Enable MLA-3 in crippled GGUFs: seems to work

* Add newly created tensors to model.tensors_by_name

Else they don't get run-time repacked.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:49:51 +03:00
Kawrakow
465569dff8 Faster DeepSeek FA on CUDA (#408)
* New DeepSeek FlashMLA

Does not work because the RoPE portion is stored at the end
in our case, while in mainline it is stored at the beginning,
and the FA kernel assumes that.

* Rearrange MLA K cache so it first new CUDA FA implementation

* constexpr and minor changes

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:49:00 +03:00
Kawrakow
8669c3db2b GPU offload policy (#405)
* Adding GPU offload policy

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:47:46 +03:00
Iwan Kawrakow
504fb890d9 Revert "Fix race in the CUDA DeepSeek FA kernel (#406)"
This reverts commit 36e6e888b7.
I should have tested. We get NaNs.
2025-05-11 12:22:19 +03:00
Kawrakow
36e6e888b7 Fix race in the CUDA DeepSeek FA kernel (#406)
Reference: https://github.com/ggml-org/llama.cpp/pull/13438

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-11 08:12:47 +03:00
Kawrakow
a2d24c97e5 TG improvements for MoE models (#404)
* cuda: Remove unnecessary device to host copy of row ids

We get 3-4% TG speed improvement for DeepSeek-Lite just from that.

* CPU: fix get_rows when SER is used

With smart experts reduction (SER), one potentially uses fewer
experts than specified by the model. This is accomplished by setting
the ID of the not seected tensors to -1. Most of the necessary
stuff was implemented when I added the SER option, but I forgot
to update get_rows() for not quantized tensors. As a result, we
get random garbage for the weights of the not-selected epxerts,
which leads to garbage output. This commit fixes it on the CPU.
I'm not quite sure yet why the GPU is not working.

* CUDA: fix TG with SER

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-10 18:52:54 +03:00
Kawrakow
43a154d8b8 Handle incompatible DeepSeek GGUFs (#394)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 22:00:40 +03:00
saood06
967a2e1860 Fix missing rope_freqs with convert_hf_to_gguf (#402)
* lora : fix llama conversion script with ROPE_FREQS

* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-05-09 09:17:41 -05:00
Kawrakow
e5a4a3ce78 Update README.md
@saood06 Thanks!
2025-05-09 11:16:36 +03:00
Kawrakow
8777fc4855 Fix CUDA FlashMLA-3 with quantized KV cache (#400)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:22:48 +03:00
Kawrakow
496451a1d4 Update README.md 2025-05-09 10:13:25 +03:00
saood06
bc6ae515ce Support for Llama-3-Nemotron models (#377)
* conflict resolution

* Changes to make work and add longrope support

* Changes to n_attention_wv rule

* Untested support of 253B

* DeciLMCausalModel now reads rope_theta from config.json properly

* Remove errant Granite mentions

* Better n_attention_vw rule

* Update vocab.py

---------

Co-authored-by: Yee Man Chan <ymchan@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:09:59 +03:00