Commit Graph

3703 Commits

Author SHA1 Message Date
Kawrakow
dcc5ab31f1 Bug fixes from mainline (#439)
* Add __syncthreads() to the new FA kernel

* Clearing padding

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-20 17:03:14 +03:00
Nexes the Elder
e19ecd296b Forgotten MMQ ref and typo (#431) 2025-05-18 17:36:41 +03:00
Kawrakow
7e0ac477b8 Option to enable disable the IQK CPU FA kernels (#429)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 11:21:58 +03:00
Kawrakow
634da2f0c9 Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428)
* Zen4: faster PP for iq4_ks and iq5_ks

* Zen4: faster PP for iq2_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 10:42:33 +03:00
Kawrakow
db111c91ee IQ5_KS_R4: row-interleaved IQ5_KS (#426)
* iq5_ks_r4: basics

* iq5_ks_r4: Zen4 works

* iq5_ks_r4: AVX2 works

* iq5_ks_r4: NEON

* Fix iq5_ks on NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-17 08:57:26 +03:00
Kawrakow
e31ba05fcd Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427)
* Fix IQ4_K on AVX2

* Fix IQ4_KS on AVX2

* Fix IQ5_K on AVX2

* Fix IQ6_K on AVX2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-16 17:25:15 +03:00
Kawrakow
06532ebd0e Adding forgotten template instance for iq5_ks (#424)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:50:15 +03:00
Kawrakow
90e53a0b8b Adding IQ5_KS - 5.25 bpw quants (#422)
* iq5_ks: basics

* iq5_ks: quantize

* iq5_ks: CUDA dequantize works

* iq5_ks: dot product works on CUDA

* iq5_ks: MMQ works

* iq5_ks: Zen4

* iq5_ks: AVX2

But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.

* iq5_ks: NEON

* iq5_ks: Metal dequantize

* iq5_ks: Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:02:39 +03:00
Kawrakow
17d721820a Fix standard attention on the CPU (#421)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 08:43:39 +03:00
Kawrakow
5e31a7df43 CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418)
* MMQ for iq2_k

* This works

* MMQ for iq3_k

* MMQ for iq2_ks

* Fix iq2_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 08:15:08 +03:00
Kawrakow
51db1bf2d2 CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417)
* MMQ for iq4_k: WIP (not working)

* MMQ for iq4_k: working now

* MMQ for iq5_k

* Cleanup

* MMQ for iq5_k: slightly faster

* MMQ for iq6_k

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14 14:04:11 +03:00
Kawrakow
fba62d61c0 Fix SER (CUDA) (#416)
* Fixing SER bugs

* Cleanup

* This seems to fix it.

* This seems to work

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-14 07:29:28 +03:00
Kawrakow
d002b9b4a0 Fix SER (CPU) (#415)
* Fixing SER bugs

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:55:04 +03:00
Kawrakow
4071472bdc Fix imatrix calculation for MLA models (#411)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:53:38 +03:00
Kawrakow
86dbdea6fc Better CPU FA performance for DeepSeek-Lite (#410)
* Better CPU FA performance for DeepSeek-Lite

* It must be like this

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-13 17:53:20 +03:00
Kawrakow
537f72f9cc Update README.md 2025-05-12 15:48:37 +03:00
Kawrakow
be1d5c4b7e Fix new CUDA FA on Touring (#413)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 15:09:33 +03:00
Kawrakow
ceb8f513e4 Add batch warmup to sweep-bench (#375)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:50:26 +03:00
Kawrakow
2e585d4508 Enable faster prompt processing with mainline llama.cpp GGUFs (#409)
* Enable MLA-3 in crippled GGUFs: WIP

* Enable MLA-3 in crippled GGUFs: seems to work

* Add newly created tensors to model.tensors_by_name

Else they don't get run-time repacked.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:49:51 +03:00
Kawrakow
0c02e16a39 Faster DeepSeek FA on CUDA (#408)
* New DeepSeek FlashMLA

Does not work because the RoPE portion is stored at the end
in our case, while in mainline it is stored at the beginning,
and the FA kernel assumes that.

* Rearrange MLA K cache so it first new CUDA FA implementation

* constexpr and minor changes

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:49:00 +03:00
Kawrakow
aa8ec5dfa6 GPU offload policy (#405)
* Adding GPU offload policy

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-12 07:47:46 +03:00
Kawrakow
8f7bd74afb Revert "Fix race in the CUDA DeepSeek FA kernel (#406)"
This reverts commit 36e6e888b7.
I should have tested. We get NaNs.
2025-05-11 12:22:19 +03:00
Kawrakow
0abcf0749e Fix race in the CUDA DeepSeek FA kernel (#406)
Reference: https://github.com/ggml-org/llama.cpp/pull/13438

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-11 08:12:47 +03:00
Kawrakow
a961f41762 TG improvements for MoE models (#404)
* cuda: Remove unnecessary device to host copy of row ids

We get 3-4% TG speed improvement for DeepSeek-Lite just from that.

* CPU: fix get_rows when SER is used

With smart experts reduction (SER), one potentially uses fewer
experts than specified by the model. This is accomplished by setting
the ID of the not seected tensors to -1. Most of the necessary
stuff was implemented when I added the SER option, but I forgot
to update get_rows() for not quantized tensors. As a result, we
get random garbage for the weights of the not-selected epxerts,
which leads to garbage output. This commit fixes it on the CPU.
I'm not quite sure yet why the GPU is not working.

* CUDA: fix TG with SER

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-10 18:52:54 +03:00
Kawrakow
47fa8380c6 Handle incompatible DeepSeek GGUFs (#394)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 22:00:40 +03:00
saood06
a7e5b01540 Fix missing rope_freqs with convert_hf_to_gguf (#402)
* lora : fix llama conversion script with ROPE_FREQS

* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-05-09 09:17:41 -05:00
Kawrakow
b64cb29713 Update README.md
@saood06 Thanks!
2025-05-09 11:16:36 +03:00
Kawrakow
dd2014a853 Fix CUDA FlashMLA-3 with quantized KV cache (#400)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:22:48 +03:00
Kawrakow
957a6e7911 Update README.md 2025-05-09 10:13:25 +03:00
saood06
87bfad8437 Support for Llama-3-Nemotron models (#377)
* conflict resolution

* Changes to make work and add longrope support

* Changes to n_attention_wv rule

* Untested support of 253B

* DeciLMCausalModel now reads rope_theta from config.json properly

* Remove errant Granite mentions

* Better n_attention_vw rule

* Update vocab.py

---------

Co-authored-by: Yee Man Chan <ymchan@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:09:59 +03:00
Kawrakow
828758ec0d Update README.md 2025-05-07 18:59:01 +03:00
Kawrakow
92ceda1d06 FlashMLA-3 for DeepSeek models on CUDA (#386)
* CUDA WIP: support for FlashMLA-3

* Much better

The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)

* Sadly, the previous commit was wrong

* Finalizing

* Also add these

* Minor

* Minor tweak

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 17:38:22 +03:00
Gaolingx
5436acdb6c fix some MSVC build problem. (#392)
* cmake: force MSVC compiler charset to utf-8

* build: apply MSVC /bigobj option to c/cpp files only

* Update CMakeLists.txt
2025-05-07 17:04:39 +03:00
Kawrakow
8a2d611083 Fix DeepSeek q8_0 cache (#391)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 12:06:49 +03:00
Kawrakow
6104bf5296 Fix build for Xeon Gold 6226R (#390)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 10:33:27 +03:00
Kawrakow
6e7b28f7b0 Update README.md 2025-05-06 08:48:11 +03:00
Kawrakow
b08471f717 Fix DeepSeek FA (#382)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-05 08:39:10 +03:00
Kawrakow
45cd1bcd59 CUDA: MMQ for IQ4_KS (#374)
* WIP

* WIP: still getting illegal memory access

* CUDA: MMQ for iq4_ks now works

~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 12:45:00 +03:00
Kawrakow
db0ed280f1 Update README.md 2025-05-04 12:06:47 +03:00
Kawrakow
7cb99f8078 Update README.md 2025-05-04 11:49:29 +03:00
Kawrakow
711ba7e8f4 CUDA: faster FA TG for GQA models (#370)
* cuda: WIP MMA FA

* Use MMA for TG also when quantized

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 09:17:44 +03:00
Kawrakow
fdbdb5310a Another attempt to fix #367 (#371)
* Another attempt to fix #367

* Yet another

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 09:02:12 +03:00
Gaolingx
8db70379ae cmake: force MSVC compiler charset to utf-8 (#369) 2025-05-03 15:56:29 +03:00
Kawrakow
758ca617cd Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-03 14:43:55 +03:00
Kawrakow
892e96be53 Fix FA bug on AVX2 (#364)
* Fix FA bug on AVX2

* Also this was wrong

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02 07:09:09 +02:00
saood06
aca68016d8 Fix model architecture name (#366)
Co-authored-by: junhuihe <junhui-he@outlook.com>
2025-05-02 07:07:24 +02:00
Kawrakow
9303df7450 Update README.md (#352)
* Update README.md

* Edits

* Updates

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30 15:11:29 +02:00
Kawrakow
1ea49001f3 Fix IQK_FA_ALL_QUANTS on AVX2 (#360)
* Fix IQK_FA_ALL_QUANTS on AVX2

* Make it also work, not just compile

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30 10:45:43 +02:00
Kawrakow
71bc74d738 Add missing enum values for qwen3 and qwen3moe (#356)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29 10:05:38 +02:00
Ben Harris
8b62ee32ca Apply Qwen3 PR from llama.cpp (#355) 2025-04-29 10:02:08 +02:00