Commit Graph

3674 Commits

Author SHA1 Message Date
Iwan Kawrakow
1982beb005 Minor tweak 2025-05-07 09:07:34 +03:00
Iwan Kawrakow
53e7e7790e Minor 2025-05-06 19:47:55 +03:00
Iwan Kawrakow
59a3e361a3 Also add these 2025-05-06 16:27:41 +03:00
Iwan Kawrakow
c36fa20d2a Finalizing 2025-05-06 15:54:52 +03:00
Iwan Kawrakow
4edfc6712a Sadly, the previous commit was wrong 2025-05-06 15:05:05 +03:00
Iwan Kawrakow
0fee6c54d9 Much better
The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)
2025-05-06 12:59:05 +03:00
Iwan Kawrakow
ed5990712d CUDA WIP: support for FlashMLA-3 2025-05-06 09:32:02 +03:00
Kawrakow
e3fec17347 Fix DeepSeek FA (#382)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-05 08:39:10 +03:00
Kawrakow
f7c9a0f036 CUDA: MMQ for IQ4_KS (#374)
* WIP

* WIP: still getting illegal memory access

* CUDA: MMQ for iq4_ks now works

~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 12:45:00 +03:00
Kawrakow
1328128298 Update README.md 2025-05-04 12:06:47 +03:00
Kawrakow
7cb6a76cd0 Update README.md 2025-05-04 11:49:29 +03:00
Kawrakow
ce2b0292e1 CUDA: faster FA TG for GQA models (#370)
* cuda: WIP MMA FA

* Use MMA for TG also when quantized

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 09:17:44 +03:00
Kawrakow
b890e01238 Another attempt to fix #367 (#371)
* Another attempt to fix #367

* Yet another

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 09:02:12 +03:00
Gaolingx
ab7f694b71 cmake: force MSVC compiler charset to utf-8 (#369) 2025-05-03 15:56:29 +03:00
Kawrakow
afcfa85756 Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-03 14:43:55 +03:00
Kawrakow
1ea1df4b2d Fix FA bug on AVX2 (#364)
* Fix FA bug on AVX2

* Also this was wrong

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-02 07:09:09 +02:00
saood06
d37add8b39 Fix model architecture name (#366)
Co-authored-by: junhuihe <junhui-he@outlook.com>
2025-05-02 07:07:24 +02:00
Kawrakow
98d1626469 Update README.md (#352)
* Update README.md

* Edits

* Updates

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30 15:11:29 +02:00
Kawrakow
4c2bee0bed Fix IQK_FA_ALL_QUANTS on AVX2 (#360)
* Fix IQK_FA_ALL_QUANTS on AVX2

* Make it also work, not just compile

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-30 10:45:43 +02:00
Kawrakow
9ba362706c Add missing enum values for qwen3 and qwen3moe (#356)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29 10:05:38 +02:00
Ben Harris
1064f5bc31 Apply Qwen3 PR from llama.cpp (#355) 2025-04-29 10:02:08 +02:00
Kawrakow
99b87a375f Update AUTHORS
Add @ubergarm
2025-04-29 07:22:06 +02:00
Kawrakow
cda24b58cb CPU FA improvements (#351)
* FA: provide work buffer for K repacking

* Add header to avoid comp0iler warnings

* WIP

* WIP

* WIP

* WIP

* Slightly better

* WIP (Zen4)

* WIP

* Try to improve for unusual number of heads/number of threads

* Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA

* Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA

* Use Sum4q4 for q4_0

* WIP

* WIP

* Much better FA TG with q8_0 KV cache

Just repack it even for TG. But do the repacking for k_step rows,
not the whole K tensor.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-29 07:19:43 +02:00
ubergarm
baeefb4731 Add GLM-4-0414 Model Support (#344)
* Add GLM-4-0414 model support

Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp.

Still some issues where it doesn't work:
* offloading >=60 layers to GPU
* no flash attention

* Remove seemingly unused llm_tensor enums

Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already
existed which seems pretty similar? Don't think they were used in the
python code either...

So removed these as possibly just cruft:
* LLM_TENSOR_POST_ATTN_NORM
* LLM_TENSOR_POST_MLP_NORM

* Set flash attention precision to f32 on GLM4 arch

* Set non flash attention precision to f32 on GLM4

* Remove reshape_3d() for Vcur in build_glm4()

This fixes the non-flash-attention inferencing on both CPU and CUDA.
2025-04-26 17:34:04 +02:00
Kawrakow
9e846f0eb1 Fix division by zero bug (#349)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26 09:19:43 +02:00
Kawrakow
715fc552ad Add support for Cohere2 (#341)
* Add support for Cohere2

* Fixe IQ4_NL on AVX2

* Command-A needs fp32 precision for K*Q

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-26 08:13:25 +02:00
Kawrakow
770892086c Fix q4_1 and q5_1 on Arm (#348)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 19:48:08 +02:00
Kawrakow
c817160d03 Add ability to manually set arch flags (#347)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 13:24:18 +02:00
Kawrakow
25d1a0dca8 Fix FA on ARM (#346)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 11:01:08 +02:00
Kawrakow
f176122a3d Fix LLaMA-4 attention (#342)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-25 09:21:03 +02:00
Kawrakow
c9eec1729f cuda: use switch in constexpr funcs (#343)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-24 17:37:12 +02:00
saood06
222a195743 Update gguf-py constants (#298)
* Update GGMLQuantizationType

* Update LlamaFileType

* Update GGML_QUANT_SIZES
2025-04-24 00:34:10 -05:00
Kawrakow
9dac3edf2f BitNet adjustments (#338)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-22 08:46:31 +02:00
saood06
cc39800723 Add support for bitnet2b_2501 model (#337)
* add support for bitnet2b_2501 model

* Fixes

* Support both model names

---------

Co-authored-by: potassiummmm <zhou.hansong@outlook.com>
2025-04-22 08:34:13 +02:00
saood06
93cd77b655 Fix termux/android build (#336)
* Attempt fix

* Attempt fix 2

* Attempt fix 3

* Attempt fix 4

* Attempt fix 5

* Attempt fix 6

* Attempt fix 7

* Attempt fix 8

* Attempt fix 9

* Attempt fix 10

* Attempt fix 11

* Attempt fix 12

* Attempt fix 13
2025-04-21 09:13:46 +02:00
Kawrakow
3bb64d9330 Better TG performance for GQA models (CPU) (#332)
* Slightly better CPU TG performance for GQA

* Better CPU FA implementation for TG when GQA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-17 08:08:40 +02:00
Kawrakow
f7c5a94e75 Better gemm/gemv on AVX2 fr q4_0_r8 (#331)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15 17:18:50 +02:00
Kawrakow
1bbb143eb3 Allow q8_0 KV cache for head size 256 (#330)
* Allow q8_0 KV cache for head size 256

* We need also these

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-15 17:05:31 +02:00
Kawrakow
05dbbeaf14 imatrix: collect layer influence statistics (#328)
* imatrix: collect layer influence statistics

* imatrix: collect layer influence statiscs also for the last layer

For the last layer we need to use the input for the output.weight
tensor. Last layer(s) tend(s) to be important, so it is useful to also
have its influence metric.

* imatrix: separate metric for attention and ffn importance

* Use stripped tensor name, not src0->name

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14 19:43:19 +02:00
Kawrakow
028e0cfa19 Add ability to hide imatrix details in llama-quantize (#329)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-14 19:41:31 +02:00
Kawrakow
d210661c91 Improved IQ1_M quantization (#327)
* Much faster and it looks like better iq1_m quantiation

* Cleanup

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-13 10:37:55 +02:00
Kawrakow
c01449a478 Fix KLD precision (#325)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-12 16:17:50 +02:00
Kawrakow
3e7be3d28e Correct L4 rms_norm (#324)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-11 10:49:18 +02:00
Kawrakow
474435f58b LlaMA-4 support (text only) (#321)
* llama4: WIP

* llama4: this seems to be working

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10 09:05:21 +02:00
Kawrakow
5f44f4b3d0 Guard against attempts to use MLA for non-MLA models (#320)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-08 08:47:24 +02:00
Kawrakow
22d7440ba2 Update AUTHORS
Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07 17:31:35 +02:00
Kawrakow
f03ae19aad Update AUTHORS
Forgot to add @Nexesenex
2025-04-07 17:28:19 +02:00
Kawrakow
b38759127a Use links for ggml/llama.cpp authors (#318)
* Use links for ggml/llama.cpp authors

* This file is not html

* More

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 17:25:06 +02:00
Kawrakow
2309ecda80 Better iq2_xs quantization (#312)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 12:39:04 +02:00
Kawrakow
a051f08b8f Add copyright notices (#317)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-07 10:43:26 +02:00