Commit Graph

3644 Commits

Author SHA1 Message Date
Kawrakow
2d2a03df24 cuda: use switch in constexpr funcs (#343)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24 17:37:12 +02:00
saood06
bf095b682f Update gguf-py constants (#298)
* Update GGMLQuantizationType

* Update LlamaFileType

* Update GGML_QUANT_SIZES
2025-04-24 00:34:10 -05:00
Kawrakow
614e59733e BitNet adjustments (#338)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-22 08:46:31 +02:00
saood06
e6c85a5b95 Add support for bitnet2b_2501 model (#337)
* add support for bitnet2b_2501 model

* Fixes

* Support both model names

---------

Co-authored-by: potassiummmm <zhou.hansong@outlook.com>
2025-04-22 08:34:13 +02:00
saood06
16f945d9bb Fix termux/android build (#336)
* Attempt fix

* Attempt fix 2

* Attempt fix 3

* Attempt fix 4

* Attempt fix 5

* Attempt fix 6

* Attempt fix 7

* Attempt fix 8

* Attempt fix 9

* Attempt fix 10

* Attempt fix 11

* Attempt fix 12

* Attempt fix 13
2025-04-21 09:13:46 +02:00
Kawrakow
4a70adae94 Better TG performance for GQA models (CPU) (#332)
* Slightly better CPU TG performance for GQA

* Better CPU FA implementation for TG when GQA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-17 08:08:40 +02:00
Kawrakow
5a98a66b5c Better gemm/gemv on AVX2 fr q4_0_r8 (#331)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15 17:18:50 +02:00
Kawrakow
1a786850e6 Allow q8_0 KV cache for head size 256 (#330)
* Allow q8_0 KV cache for head size 256

* We need also these

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15 17:05:31 +02:00
Kawrakow
70a1d99fb8 imatrix: collect layer influence statistics (#328)
* imatrix: collect layer influence statistics

* imatrix: collect layer influence statiscs also for the last layer

For the last layer we need to use the input for the output.weight
tensor. Last layer(s) tend(s) to be important, so it is useful to also
have its influence metric.

* imatrix: separate metric for attention and ffn importance

* Use stripped tensor name, not src0->name

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14 19:43:19 +02:00
Kawrakow
9be8812727 Add ability to hide imatrix details in llama-quantize (#329)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14 19:41:31 +02:00
Kawrakow
0f7aa11b6d Improved IQ1_M quantization (#327)
* Much faster and it looks like better iq1_m quantiation

* Cleanup

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13 10:37:55 +02:00
Kawrakow
65fc77c285 Fix KLD precision (#325)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-12 16:17:50 +02:00
Kawrakow
a3b16affaf Correct L4 rms_norm (#324)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-11 10:49:18 +02:00
Kawrakow
5c127b279f LlaMA-4 support (text only) (#321)
* llama4: WIP

* llama4: this seems to be working

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10 09:05:21 +02:00
Kawrakow
c50d00d0dc Guard against attempts to use MLA for non-MLA models (#320)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-08 08:47:24 +02:00
Kawrakow
d223b26daf Update AUTHORS
Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07 17:31:35 +02:00
Kawrakow
21725684a4 Update AUTHORS
Forgot to add @Nexesenex
2025-04-07 17:28:19 +02:00
Kawrakow
16b36613d5 Use links for ggml/llama.cpp authors (#318)
* Use links for ggml/llama.cpp authors

* This file is not html

* More

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 17:25:06 +02:00
Kawrakow
86c9b08846 Better iq2_xs quantization (#312)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 12:39:04 +02:00
Kawrakow
8210ed4883 Add copyright notices (#317)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 10:43:26 +02:00
Kawrakow
9bd4357cbc Update LICENSE
I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). 

This PR corrects my mistake.
2025-04-07 10:41:40 +02:00
Kawrakow
d3c0cc788b We need to synchronize before using device to host async memcpy (#313)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-05 14:31:27 +02:00
Kawrakow
c7fceae221 Add -flax-vector-conversions for GCC on ARM (#311)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-04 11:04:19 +02:00
Kawrakow
9ab6dc9f91 Metal: FA and FlashMLA (#310)
* Metal: WIP to update Metal FA implementation

Dk=192, Dv=128 works, but not Dk = 576, Dv = 512

* Metal FA: go to float

* WIP

* Metal FA: MLA options now all work

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 17:54:25 +02:00
Kawrakow
1f260865ef Fix GCC compilation errors on ARM (#309)
* Fix GCC compilation errors on ARM

* One more

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 15:50:53 +02:00
Kawrakow
3b5da96073 Metal: much faster MoE prompt processing (#307)
* MoE improvements on Metal

This version beats mainline, there are things I don't understand:
* Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the
  same, but we are 30% slower. Why?
* Using actual GEMM, we beat mainline with ubtach size of 128. But then
  performance degrades. Why?

* Some cleanup

* Much better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-03 07:15:49 +02:00
Ikko Eltociear Ashimine
79db2e243f docs: update README.md (#304) 2025-04-01 21:30:25 +02:00
Kawrakow
df20261b6a Fix ARM_NEON build failure due to q8_2 (#303)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 13:48:20 +02:00
Kawrakow
1bc60d6cc9 Quantization improvements (2) (#302)
* iq3_k: slightly better quantization

Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).

* Small improvement for type-1 quants

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 10:31:06 +02:00
Kawrakow
a630958fb4 Additional guards for interleaved quants (#299)
* Make sure no interleaved quants are being used for token embeddings

also with `--pure` and/or `--custom-q`.

* Simplify

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 08:29:47 +02:00
Kawrakow
ba3030c9c3 Fix #300 (#301)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-01 08:29:25 +02:00
Kawrakow
3c3825d7f6 Quantization improvements (#295)
* Better make_qx_quants

Tested with q4_0 and q3_K (pure, imatrix), and the improvement is
quite significant.

* Sae for iq4_nl, iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-29 08:09:52 +01:00
Kawrakow
9898f480fe Make sure tensor row size is multiple of block size also when quantizing with --pure (#294)
* WIP - not working

* q8_0 without bells and wistles works

* It works for q8_0

* Use bf16 instead of f16,int16

* q4_0_r8

* q5_0_r4

* q6_0_r4

* Also q4_1 and q5_1

* Add check if selected type is possible with --pure

I often want to quantize with --pure to see quantization performance
without quantization mixes. But for models where there qre tensors
with row sizes that are not multiple of 256, this results in a crash
for k- and i-quants. Hence, lets add a check if the quant selected
via --pure is applicable, and change it if not.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27 10:48:52 +01:00
Kawrakow
d71e84bdc1 Use bf16 instead of fp16 block scales for q8_1 (#292)
* WIP - not working

* q8_0 without bells and wistles works

* It works for q8_0

* Use bf16 instead of f16,int16

* q4_0_r8

* q5_0_r4

* q6_0_r4

* Also q4_1 and q5_1

* q8_0_r8 on avx2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-27 05:49:16 +01:00
Kawrakow
b307c1c375 llama-bench: enable having different number of threads for tg and pp (#284)
* llama-bench: enable having different number of threads for tg and pp

* Add -tgb to usage

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25 16:31:17 +01:00
saood06
c12a6f8558 Update sweep bench (depracating .jsonl support) (#289)
* Update sweep bench (depracating .jsonl support)

* Fix README.md
2025-03-25 10:14:44 -05:00
Kawrakow
6ef4954612 CUDA: better MoE implementation (#283)
* Make fused MoE reproducible

As a bonus, peak performance at pp2048 with u_batch = 2048 is
~8% better.

* Slightly better

* Also do it for non-fused mul_mat_id

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-25 07:47:10 +01:00
Kawrakow
a9a941b5b8 Improve DeepSeek batched processing speed (#282)
* Improve DeepSeek batched processing speed

* Revert the commented out section in iqk_mul_mat.cpp

It does have some benefit at long contexts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 17:10:52 +01:00
Kawrakow
23ee1ac1b8 Attempt to improve FlashMLA on the CPU (#277)
* Fix it for nth > rk2

* Handle rk2%nth_k != 0

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 07:28:21 +01:00
Kawrakow
79a105d8ab Test transparent huge pages on Linux (#278)
* Adding ability to use THP on Linux

* Use the actual page size4 used for mmap also in munmap

* Add -thp to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-23 07:24:43 +01:00
Kawrakow
37c48feb3e Native build ooption for CUDA when GGML_NATIVE is set (#280)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 18:17:51 +01:00
Kawrakow
5a67c8322e Fighting with cmake (#279)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 16:58:30 +01:00
Kawrakow
42b0e3921b Add Gemma3 support (text only) (#276)
* WIP Gemma3: not working

* gemma3: build_gemma3 seems to be working now

* Revert changes to convert_hf_to_gguf.py

It wasn't working, so I guess, it is better to leave the
conversion up tp upstream.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-22 08:05:10 +01:00
Kawrakow
eff34cf265 Fix bug: missing parentheses in logical expression (#275)
This results in GGGGGGGGGGGGG when generating with
mla = 3, fa = 0.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 13:23:01 +01:00
Kawrakow
4158743014 Specify tensor name regex for tensors to be repacked (#274)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 10:51:37 +01:00
Kawrakow
24e780ba74 FlashMLA-3: the best of both worlds (CPU only) (#273)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

* FlashMLA-3: the best of both worlds - CPU only

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:24:22 +01:00
Kawrakow
c5e554f941 Convert models to row-interleaved quants using the quantize tool (#272)
* Repack a model with the quantize tool

* WIP

* Fixed various issues

As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.

* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved

* Fix GCC 13.3 compilation error

* Another one

* Add missing include

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-21 07:23:36 +01:00
Kawrakow
712de34b12 Honor mmap setting when using tensor overrides (#270)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 19:17:03 +01:00
Kawrakow
f2997472f4 Fix ggml_compute_forward_dup_q (#269)
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 15:47:24 +01:00
Kawrakow
623b5b6cca Prevent FlashMLA-1 from running on CUDA (#268)
as it is not supported.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-03-19 13:03:59 +01:00