Commit Graph

3632 Commits

Author SHA1 Message Date
Kawrakow
3e7be3d28e Correct L4 rms_norm (#324)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-11 10:49:18 +02:00
Kawrakow
474435f58b LlaMA-4 support (text only) (#321)
* llama4: WIP

* llama4: this seems to be working

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10 09:05:21 +02:00
Kawrakow
5f44f4b3d0 Guard against attempts to use MLA for non-MLA models (#320)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-08 08:47:24 +02:00
Kawrakow
22d7440ba2 Update AUTHORS
Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07 17:31:35 +02:00
Kawrakow
f03ae19aad Update AUTHORS
Forgot to add @Nexesenex
2025-04-07 17:28:19 +02:00
Kawrakow
b38759127a Use links for ggml/llama.cpp authors (#318)
* Use links for ggml/llama.cpp authors

* This file is not html

* More

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 17:25:06 +02:00
Kawrakow
2309ecda80 Better iq2_xs quantization (#312)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 12:39:04 +02:00
Kawrakow
a051f08b8f Add copyright notices (#317)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 10:43:26 +02:00
Kawrakow
abbabf7ca1 Update LICENSE
I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). 

This PR corrects my mistake.
2025-04-07 10:41:40 +02:00
Kawrakow
ec84855c6a We need to synchronize before using device to host async memcpy (#313)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-05 14:31:27 +02:00
Kawrakow
c616306a01 Add -flax-vector-conversions for GCC on ARM (#311)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-04 11:04:19 +02:00
Kawrakow
073eda985e Metal: FA and FlashMLA (#310)
* Metal: WIP to update Metal FA implementation

Dk=192, Dv=128 works, but not Dk = 576, Dv = 512

* Metal FA: go to float

* WIP

* Metal FA: MLA options now all work

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 17:54:25 +02:00
Kawrakow
2ee6263e24 Fix GCC compilation errors on ARM (#309)
* Fix GCC compilation errors on ARM

* One more

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 15:50:53 +02:00
Kawrakow
07dbc1aa06 Metal: much faster MoE prompt processing (#307)
* MoE improvements on Metal

This version beats mainline, there are things I don't understand:
* Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the
  same, but we are 30% slower. Why?
* Using actual GEMM, we beat mainline with ubtach size of 128. But then
  performance degrades. Why?

* Some cleanup

* Much better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 07:15:49 +02:00
Ikko Eltociear Ashimine
6d405d1fd1 docs: update README.md (#304) 2025-04-01 21:30:25 +02:00
Kawrakow
21a5b8bd28 Fix ARM_NEON build failure due to q8_2 (#303)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 13:48:20 +02:00
Kawrakow
190e7866db Quantization improvements (2) (#302)
* iq3_k: slightly better quantization

Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).

* Small improvement for type-1 quants

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 10:31:06 +02:00
Kawrakow
b07a337bfe Additional guards for interleaved quants (#299)
* Make sure no interleaved quants are being used for token embeddings

also with `--pure` and/or `--custom-q`.

* Simplify

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 08:29:47 +02:00
Kawrakow
6e5156cab5 Fix #300 (#301)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 08:29:25 +02:00
Kawrakow
4819257ce6 Quantization improvements (#295)
* Better make_qx_quants

Tested with q4_0 and q3_K (pure, imatrix), and the improvement is
quite significant.

* Sae for iq4_nl, iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-29 08:09:52 +01:00
Kawrakow
23b0addb34 Make sure tensor row size is multiple of block size also when quantizing with --pure (#294)
* WIP - not working

* q8_0 without bells and wistles works

* It works for q8_0

* Use bf16 instead of f16,int16

* q4_0_r8

* q5_0_r4

* q6_0_r4

* Also q4_1 and q5_1

* Add check if selected type is possible with --pure

I often want to quantize with --pure to see quantization performance
without quantization mixes. But for models where there qre tensors
with row sizes that are not multiple of 256, this results in a crash
for k- and i-quants. Hence, lets add a check if the quant selected
via --pure is applicable, and change it if not.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27 10:48:52 +01:00
Kawrakow
d0b52076da Use bf16 instead of fp16 block scales for q8_1 (#292)
* WIP - not working

* q8_0 without bells and wistles works

* It works for q8_0

* Use bf16 instead of f16,int16

* q4_0_r8

* q5_0_r4

* q6_0_r4

* Also q4_1 and q5_1

* q8_0_r8 on avx2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27 05:49:16 +01:00
Kawrakow
a22250df93 llama-bench: enable having different number of threads for tg and pp (#284)
* llama-bench: enable having different number of threads for tg and pp

* Add -tgb to usage

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25 16:31:17 +01:00
saood06
279b7d3395 Update sweep bench (depracating .jsonl support) (#289)
* Update sweep bench (depracating .jsonl support)

* Fix README.md
2025-03-25 10:14:44 -05:00
Kawrakow
98a264a2ea CUDA: better MoE implementation (#283)
* Make fused MoE reproducible

As a bonus, peak performance at pp2048 with u_batch = 2048 is
~8% better.

* Slightly better

* Also do it for non-fused mul_mat_id

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25 07:47:10 +01:00
Kawrakow
f9307d7907 Improve DeepSeek batched processing speed (#282)
* Improve DeepSeek batched processing speed

* Revert the commented out section in iqk_mul_mat.cpp

It does have some benefit at long contexts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 17:10:52 +01:00
Kawrakow
5a4855e61c Attempt to improve FlashMLA on the CPU (#277)
* Fix it for nth > rk2

* Handle rk2%nth_k != 0

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 07:28:21 +01:00
Kawrakow
dd5ebd0e3d Test transparent huge pages on Linux (#278)
* Adding ability to use THP on Linux

* Use the actual page size4 used for mmap also in munmap

* Add -thp to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 07:24:43 +01:00
Kawrakow
6028362ef6 Native build ooption for CUDA when GGML_NATIVE is set (#280)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 18:17:51 +01:00
Kawrakow
13ecc5332e Fighting with cmake (#279)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 16:58:30 +01:00
Kawrakow
d8584a1bbe Add Gemma3 support (text only) (#276)
* WIP Gemma3: not working

* gemma3: build_gemma3 seems to be working now

* Revert changes to convert_hf_to_gguf.py

It wasn't working, so I guess, it is better to leave the
conversion up tp upstream.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 08:05:10 +01:00
Kawrakow
3d6e25c82d Fix bug: missing parentheses in logical expression (#275)
This results in GGGGGGGGGGGGG when generating with
mla = 3, fa = 0.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 13:23:01 +01:00
Kawrakow
022660f7ab Specify tensor name regex for tensors to be repacked (#274)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 10:51:37 +01:00
Kawrakow
ddc8eee10e FlashMLA-3: the best of both worlds (CPU only) (#273)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

* FlashMLA-3: the best of both worlds - CPU only

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:24:22 +01:00
Kawrakow
b8d1fac97b Convert models to row-interleaved quants using the quantize tool (#272)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:23:36 +01:00
Kawrakow
127c6ee649 Honor mmap setting when using tensor overrides (#270)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 19:17:03 +01:00
Kawrakow
22c84a126f Fix ggml_compute_forward_dup_q (#269)
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 15:47:24 +01:00
Kawrakow
c3b75c531c Prevent FlashMLA-1 from running on CUDA (#268)
as it is not supported.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 13:03:59 +01:00
Kawrakow
8e549b4234 Allow q8_0 cache on the CPU for FlashMLA-2 (#265)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 15:41:05 +01:00
Kawrakow
68a5b60408 Make Q8_0 KV cache work with mla=2,fa on CUDA (#264)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 15:40:47 +01:00
Kawrakow
f4ebf13b6a Fix #261 (#262)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 07:44:43 +01:00
Kawrakow
bdcae905c4 Compile time option to use bf16 for qunts without MMQ kernels (#261)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 07:37:10 +01:00
Kawrakow
dcdfad29f7 FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260)
* FlashMLA-2: eliminate intermediate f32 tensors

This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.

* FlashMLA-2: enable fast path only on the CPU for now

I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.

* FlashMLA-2: slightly smaller computer buffer size

* Prepare wk_b when loading DeepSeek models (if wk_b is missing)

* Add some comments

* Fix case where wkv_b is quantized with k- or i-quants.

* Fix CUDA

There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.

* FlashMLA-2: avoid conversions to f32 also on CUDA

* Be able to compute for more than 65535 tokens

On CUDA just a quick hack that allows us to cancatenate tensors
with more than 65535 rows along zroth dimension as needed by
FlashMLA-2. Also needed some care in the perplexity tool to
avoid int overflows when evaluating the computed logits.

* Reduce memory usage for FlashMLA-2

Oh, also fix int overflow in the CUDA concat implementation.

It is funny how the llama.cpp 64-bit police has gone (almost) everywhere
and replaced 32-bit ints with 64-bit ints, needed or not,
but hasn't done it where it is actually needed.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-18 07:36:42 +01:00
Kawrakow
f91b2e38d0 Prepare wk_b tensors of DeepSeek models on the fly (#259)
* FlashMLA-2: eliminate intermediate f32 tensors

This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.

* FlashMLA-2: enable fast path only on the CPU for now

I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.

* FlashMLA-2: slightly smaller computer buffer size

* Prepare wk_b when loading DeepSeek models (if wk_b is missing)

* Add some comments

* Fix case where wkv_b is quantized with k- or i-quants.

* Fix CUDA

There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-17 09:31:56 +01:00
Kawrakow
305fabfc3b FlashMLA-2 (CPU): faster and smaller compute buffer size (#253)
* FlashMLA-2: eliminate intermediate f32 tensors

This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.

* FlashMLA-2: enable fast path only on the CPU for now

I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.

* FlashMLA-2: slightly smaller computer buffer size

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-13 12:07:43 +02:00
Kawrakow
3f23ed68f1 MLA-2: Allow usage of q8_0 for KV cache on CUDA (#252)
* FlashMLA(CUDA): WIP to allow q8_0 quantized cache

* WIP

* FlashMLA(CUDA) - allow q8_0 for KV cache

This works, and PP is not bad, but TG is still quite a bit slower.

* FlashMLA(CUDA) - allow q8_0 for KV cache

This is better. ~9% slower than f16 cache for short contexts,
nearly on par at 16k tokens.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-12 07:21:46 +02:00
Kawrakow
a48e163247 DeepSeek imatrix stuff (#250)
* This gives us ~20% TG speedup for DeepSeek on CUDA

* Slightly better

* Also do it for plain (not fused) mul_mat_id

* Guard against numerical precision issues for MLA on CUDA

* imatrix: wv_b <-> wkv_b

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-10 16:19:09 +02:00
Kawrakow
699c9cb7f6 Faster MoE token generation on CUDA (#248)
* This gives us ~20% TG speedup for DeepSeek on CUDA

* Slightly better

* Also do it for plain (not fused) mul_mat_id

* Guard against numerical precision issues for MLA on CUDA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-10 16:16:51 +02:00
Kawrakow
b096a5de7a This works on CUDA, but (#247)
PP speed is great, almost on par with standard FA.
But TG speed is pathetic. The strangest thing is that
the slowdown is not due to FA, but due to the ffn_gate_exps
gemm, which somehow becomes very slow. WTF?

As I'm unable the resolve the slow ffn_gate_exps GEMM mystery,
for now TG goes via mla=2, PP is via FA.
Also discovered the ggml_cast op, so we don't need the aux
tensors that I had added to the KV cache.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-09 16:53:55 +02:00
Kawrakow
81748fb55e Faster FlashMLA prompt processing (#246)
* FlashMLA-2: faster prompt processing

The current MLA implementation computes

wv_b * (k_cache * softmax(k_cache * (wk_b*q)))

This leads to 3.4X more multiply-adds (madds)
compared to standard attention. Due to the resulting
tensor shapes, TG is still faster than standard attention
because the k_cache*(wk_b*q) and k_cache*(softmax(k_cache * (wk_b*q)))
multiplications become GEMMs, so the additional madds are
more than compensated for due to the much higher performance
of GEMMs compared to GEMVs. But for PP, where we are dealing
with GEMMs in both cases, the additional madds needed for MLA
lead to lower performance, with the performance gap increasing
with context length.

So, then, when we are dealing with PP, we can rearrange the
above to (wv_b * k_cache) * softmax( (wk_b^T*k_cache) * q),
thus transforming it into the standard attention mechanism.
We do need two additional matrix multiplications (which in practice
is done as a single wkv_b * k_cache GEMM) with the *entire*
K cache. But this is still cheaper than MLA, as we end up with
1.8X the madds required by standard attention. Oh, these figures
are for the DeepSeek-V3/R1/Lite attention architecture.
This leads to a significant PP performance increase compared
to standard MLA with FA.

There are many upsides to this:
* If we only apply the above trick when we are processing more than
  X tokens (with suitable chosen X), TG performance stays the same
  as MLA with FA
* We still need to store just the K-cache, so 576 entries per layer
  for DeepSeek-V3/R1/Lite
* We get significantly better PP performance
* We can use MLA+FA on CUDA. It works already with this commit
  for PP, something is not yet quite right for TG.

The downside is that it only works with fp16 cache (for now).
This is so because we need to convert the cache to fp32,
else we cannot do the wkv_b * k_cache matrix multiplication
(which in ggml requires the second operand to be fp32).
But converting (copying) to fp32 only works for f16, bf16 and
f32 tensors, so no luck with quantized cache. Another reason
that we need to convert to fp32 is that the cache contains the
RoPE'd portion, which we need to concatenate to the result of
the wkv_b * k_cache matrix multiplication. Also this op
works only when the tensors being concatenated are both fp32.

So much about ggml being a general purpose ML library.

* FlashMLA-2: on the CPU it now works for quantized cache

except for q8_KV (q8_KV has row meta data, and there is still
some confusion with row sizes because of that).

* FlashMLA-2: on the CPU it now works also with q8_KV

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-08 19:33:41 +02:00