Commit Graph

260 Commits

Author SHA1 Message Date
Kawrakow
dd2014a853 Fix CUDA FlashMLA-3 with quantized KV cache (#400)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:22:48 +03:00
Kawrakow
92ceda1d06 FlashMLA-3 for DeepSeek models on CUDA (#386)
* CUDA WIP: support for FlashMLA-3

* Much better

The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)

* Sadly, the previous commit was wrong

* Finalizing

* Also add these

* Minor

* Minor tweak

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 17:38:22 +03:00
Kawrakow
8a2d611083 Fix DeepSeek q8_0 cache (#391)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 12:06:49 +03:00
Kawrakow
6104bf5296 Fix build for Xeon Gold 6226R (#390)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-07 10:33:27 +03:00
Kawrakow
b08471f717 Fix DeepSeek FA (#382)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-05 08:39:10 +03:00
Kawrakow
45cd1bcd59 CUDA: MMQ for IQ4_KS (#374)
* WIP

* WIP: still getting illegal memory access

* CUDA: MMQ for iq4_ks now works

~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 12:45:00 +03:00
Kawrakow
711ba7e8f4 CUDA: faster FA TG for GQA models (#370)
* cuda: WIP MMA FA

* Use MMA for TG also when quantized

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 09:17:44 +03:00
Kawrakow
fdbdb5310a Another attempt to fix #367 (#371)
* Another attempt to fix #367

* Yet another

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 09:02:12 +03:00
Kawrakow
758ca617cd Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-03 14:43:55 +03:00
Kawrakow
892e96be53 Fix FA bug on AVX2 (#364)
* Fix FA bug on AVX2

* Also this was wrong

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02 07:09:09 +02:00
Kawrakow
1ea49001f3 Fix IQK_FA_ALL_QUANTS on AVX2 (#360)
* Fix IQK_FA_ALL_QUANTS on AVX2

* Make it also work, not just compile

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30 10:45:43 +02:00
Kawrakow
9d9f9f96b2 CPU FA improvements (#351)
* FA: provide work buffer for K repacking

* Add header to avoid comp0iler warnings

* WIP

* WIP

* WIP

* WIP

* Slightly better

* WIP (Zen4)

* WIP

* Try to improve for unusual number of heads/number of threads

* Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA

* Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA

* Use Sum4q4 for q4_0

* WIP

* WIP

* Much better FA TG with q8_0 KV cache

Just repack it even for TG. But do the repacking for k_step rows,
not the whole K tensor.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29 07:19:43 +02:00
Kawrakow
815307d3bd Fix division by zero bug (#349)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26 09:19:43 +02:00
Kawrakow
86be28d5bd Add support for Cohere2 (#341)
* Add support for Cohere2

* Fixe IQ4_NL on AVX2

* Command-A needs fp32 precision for K*Q

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26 08:13:25 +02:00
Kawrakow
4413f17b58 Fix q4_1 and q5_1 on Arm (#348)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 19:48:08 +02:00
Kawrakow
fb98619852 Add ability to manually set arch flags (#347)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 13:24:18 +02:00
Kawrakow
542351d088 Fix FA on ARM (#346)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 11:01:08 +02:00
Kawrakow
2d2a03df24 cuda: use switch in constexpr funcs (#343)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24 17:37:12 +02:00
saood06
16f945d9bb Fix termux/android build (#336)
* Attempt fix

* Attempt fix 2

* Attempt fix 3

* Attempt fix 4

* Attempt fix 5

* Attempt fix 6

* Attempt fix 7

* Attempt fix 8

* Attempt fix 9

* Attempt fix 10

* Attempt fix 11

* Attempt fix 12

* Attempt fix 13
2025-04-21 09:13:46 +02:00
Kawrakow
4a70adae94 Better TG performance for GQA models (CPU) (#332)
* Slightly better CPU TG performance for GQA

* Better CPU FA implementation for TG when GQA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-17 08:08:40 +02:00
Kawrakow
5a98a66b5c Better gemm/gemv on AVX2 fr q4_0_r8 (#331)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15 17:18:50 +02:00
Kawrakow
1a786850e6 Allow q8_0 KV cache for head size 256 (#330)
* Allow q8_0 KV cache for head size 256

* We need also these

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15 17:05:31 +02:00
Kawrakow
0f7aa11b6d Improved IQ1_M quantization (#327)
* Much faster and it looks like better iq1_m quantiation

* Cleanup

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13 10:37:55 +02:00
Kawrakow
86c9b08846 Better iq2_xs quantization (#312)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 12:39:04 +02:00
Kawrakow
8210ed4883 Add copyright notices (#317)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 10:43:26 +02:00
Kawrakow
d3c0cc788b We need to synchronize before using device to host async memcpy (#313)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-05 14:31:27 +02:00
Kawrakow
c7fceae221 Add -flax-vector-conversions for GCC on ARM (#311)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-04 11:04:19 +02:00
Kawrakow
9ab6dc9f91 Metal: FA and FlashMLA (#310)
* Metal: WIP to update Metal FA implementation

Dk=192, Dv=128 works, but not Dk = 576, Dv = 512

* Metal FA: go to float

* WIP

* Metal FA: MLA options now all work

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 17:54:25 +02:00
Kawrakow
1f260865ef Fix GCC compilation errors on ARM (#309)
* Fix GCC compilation errors on ARM

* One more

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 15:50:53 +02:00
Kawrakow
3b5da96073 Metal: much faster MoE prompt processing (#307)
* MoE improvements on Metal

This version beats mainline, there are things I don't understand:
* Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the
  same, but we are 30% slower. Why?
* Using actual GEMM, we beat mainline with ubtach size of 128. But then
  performance degrades. Why?

* Some cleanup

* Much better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 07:15:49 +02:00
Kawrakow
df20261b6a Fix ARM_NEON build failure due to q8_2 (#303)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 13:48:20 +02:00
Kawrakow
1bc60d6cc9 Quantization improvements (2) (#302)
* iq3_k: slightly better quantization

Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).

* Small improvement for type-1 quants

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 10:31:06 +02:00
Kawrakow
ba3030c9c3 Fix #300 (#301)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 08:29:25 +02:00
Kawrakow
3c3825d7f6 Quantization improvements (#295)
* Better make_qx_quants

Tested with q4_0 and q3_K (pure, imatrix), and the improvement is
quite significant.

* Sae for iq4_nl, iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-29 08:09:52 +01:00
Kawrakow
d71e84bdc1 Use bf16 instead of fp16 block scales for q8_1 (#292)
* WIP - not working

* q8_0 without bells and wistles works

* It works for q8_0

* Use bf16 instead of f16,int16

* q4_0_r8

* q5_0_r4

* q6_0_r4

* Also q4_1 and q5_1

* q8_0_r8 on avx2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27 05:49:16 +01:00
Kawrakow
6ef4954612 CUDA: better MoE implementation (#283)
* Make fused MoE reproducible

As a bonus, peak performance at pp2048 with u_batch = 2048 is
~8% better.

* Slightly better

* Also do it for non-fused mul_mat_id

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25 07:47:10 +01:00
Kawrakow
a9a941b5b8 Improve DeepSeek batched processing speed (#282)
* Improve DeepSeek batched processing speed

* Revert the commented out section in iqk_mul_mat.cpp

It does have some benefit at long contexts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 17:10:52 +01:00
Kawrakow
23ee1ac1b8 Attempt to improve FlashMLA on the CPU (#277)
* Fix it for nth > rk2

* Handle rk2%nth_k != 0

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 07:28:21 +01:00
Kawrakow
37c48feb3e Native build ooption for CUDA when GGML_NATIVE is set (#280)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 18:17:51 +01:00
Kawrakow
5a67c8322e Fighting with cmake (#279)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 16:58:30 +01:00
Kawrakow
c5e554f941 Convert models to row-interleaved quants using the quantize tool (#272)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:23:36 +01:00
Kawrakow
f2997472f4 Fix ggml_compute_forward_dup_q (#269)
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 15:47:24 +01:00
Kawrakow
623b5b6cca Prevent FlashMLA-1 from running on CUDA (#268)
as it is not supported.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 13:03:59 +01:00
Kawrakow
1bc07b5ccd Allow q8_0 cache on the CPU for FlashMLA-2 (#265)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 15:41:05 +01:00
Kawrakow
7d8119d0ba Make Q8_0 KV cache work with mla=2,fa on CUDA (#264)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 15:40:47 +01:00
Kawrakow
264071c351 Fix #261 (#262)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 07:44:43 +01:00
Kawrakow
f8277ced45 Compile time option to use bf16 for qunts without MMQ kernels (#261)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 07:37:10 +01:00
Kawrakow
9fe2b06f79 FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260)
* FlashMLA-2: eliminate intermediate f32 tensors

This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.

* FlashMLA-2: enable fast path only on the CPU for now

I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.

* FlashMLA-2: slightly smaller computer buffer size

* Prepare wk_b when loading DeepSeek models (if wk_b is missing)

* Add some comments

* Fix case where wkv_b is quantized with k- or i-quants.

* Fix CUDA

There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.

* FlashMLA-2: avoid conversions to f32 also on CUDA

* Be able to compute for more than 65535 tokens

On CUDA just a quick hack that allows us to cancatenate tensors
with more than 65535 rows along zroth dimension as needed by
FlashMLA-2. Also needed some care in the perplexity tool to
avoid int overflows when evaluating the computed logits.

* Reduce memory usage for FlashMLA-2

Oh, also fix int overflow in the CUDA concat implementation.

It is funny how the llama.cpp 64-bit police has gone (almost) everywhere
and replaced 32-bit ints with 64-bit ints, needed or not,
but hasn't done it where it is actually needed.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 07:36:42 +01:00
Kawrakow
676f0e71b4 FlashMLA-2 (CPU): faster and smaller compute buffer size (#253)
* FlashMLA-2: eliminate intermediate f32 tensors

This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.

* FlashMLA-2: enable fast path only on the CPU for now

I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.

* FlashMLA-2: slightly smaller computer buffer size

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-13 12:07:43 +02:00
Kawrakow
fc6a65dda4 MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252)
* FlashMLA(CUDA): WIP to allow q8_0 quantized cache

* WIP

* FlashMLA(CUDA) - allow q8_0 for KV cache

This works, and PP is not bad, but TG is still quite a bit slower.

* FlashMLA(CUDA) - allow q8_0 for KV cache

This is better. ~9% slower than f16 cache for short contexts,
nearly on par at 16k tokens.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-12 07:21:46 +02:00