Files
ik_llama.cpp/ggml
Kawrakow 0cc32ff0b1 CUDA: muh faster prompt processing for MoE models and small u-batch sizes (#728)
* WIP: adding mainline mmq_id implementation

* This seems to work

* Now also -fmoe works

* WIP

* WIP

* WIP

* This works for mainline supported quants

* mmq_id: add iq2_k, iq2_k_r4

* mmiq_id: don't assume row size is multiple of type size (per row scales)

* mmiq_id: don't assume row size is multiple of type size

* mmq_id: add iq2_ks

So we are sure it works with per row scales

* mmq_id: add iq2_kl

* mmq_id: add iq3_ks

* mmq_id: adding iq3_k, iq3_k_r4

* mmq_id: add iq4_kss, iq4_ks, iq4_ks_r4

* mmq_id: adding iq4_k, iq4_k_r4

* mmq_id: adding iq5_ks, iq5_ks_r4

* mmq_id: adding iq5_k, iq5_k_r4, q6_0

* mmq_id: adding iq6_k

* mmq_id: add iq1_s_r4

* mmq_id: adding iq1_kt, iq2_kt

* mmq_id: add iq3_kt, iq4_kt

* Add CUDA fp8 header

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-26 13:30:35 +03:00
..
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00