Iwan Kawrakow
20d50172d0
Much better FA TG with q8_0 KV cache
...
Just repack it even for TG. But do the repacking for k_step rows,
not the whole K tensor.
2025-04-28 11:26:28 +03:00
Iwan Kawrakow
802d4de1b5
WIP
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
9be8b490b1
WIP
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
cd44692bc0
Use Sum4q4 for q4_0
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
b19fd13141
Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
ddcdf25e54
Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
9f310ea663
Try to improve for unusual number of heads/number of threads
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
39714026fe
WIP
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
a7cd27f7e0
WIP (Zen4)
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
26eb64c4f9
Slightly better
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
bcacf33350
WIP
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
998c1b2117
WIP
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
fae18dd0bc
WIP
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
b498633203
WIP
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
74a21d48d6
Add header to avoid comp0iler warnings
2025-04-26 18:34:35 +03:00
Iwan Kawrakow
6801a4368c
FA: provide work buffer for K repacking
2025-04-26 18:34:35 +03:00
ubergarm
baeefb4731
Add GLM-4-0414 Model Support ( #344 )
...
* Add GLM-4-0414 model support
Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp.
Still some issues where it doesn't work:
* offloading >=60 layers to GPU
* no flash attention
* Remove seemingly unused llm_tensor enums
Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already
existed which seems pretty similar? Don't think they were used in the
python code either...
So removed these as possibly just cruft:
* LLM_TENSOR_POST_ATTN_NORM
* LLM_TENSOR_POST_MLP_NORM
* Set flash attention precision to f32 on GLM4 arch
* Set non flash attention precision to f32 on GLM4
* Remove reshape_3d() for Vcur in build_glm4()
This fixes the non-flash-attention inferencing on both CPU and CUDA.
2025-04-26 17:34:04 +02:00
Kawrakow
9e846f0eb1
Fix division by zero bug ( #349 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-26 09:19:43 +02:00
Kawrakow
715fc552ad
Add support for Cohere2 ( #341 )
...
* Add support for Cohere2
* Fixe IQ4_NL on AVX2
* Command-A needs fp32 precision for K*Q
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-26 08:13:25 +02:00
Kawrakow
770892086c
Fix q4_1 and q5_1 on Arm ( #348 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 19:48:08 +02:00
Kawrakow
c817160d03
Add ability to manually set arch flags ( #347 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 13:24:18 +02:00
Kawrakow
25d1a0dca8
Fix FA on ARM ( #346 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 11:01:08 +02:00
Kawrakow
f176122a3d
Fix LLaMA-4 attention ( #342 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 09:21:03 +02:00
Kawrakow
c9eec1729f
cuda: use switch in constexpr funcs ( #343 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-24 17:37:12 +02:00
saood06
222a195743
Update gguf-py constants ( #298 )
...
* Update GGMLQuantizationType
* Update LlamaFileType
* Update GGML_QUANT_SIZES
2025-04-24 00:34:10 -05:00
Kawrakow
9dac3edf2f
BitNet adjustments ( #338 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-22 08:46:31 +02:00
saood06
cc39800723
Add support for bitnet2b_2501 model ( #337 )
...
* add support for bitnet2b_2501 model
* Fixes
* Support both model names
---------
Co-authored-by: potassiummmm <zhou.hansong@outlook.com >
2025-04-22 08:34:13 +02:00
saood06
93cd77b655
Fix termux/android build ( #336 )
...
* Attempt fix
* Attempt fix 2
* Attempt fix 3
* Attempt fix 4
* Attempt fix 5
* Attempt fix 6
* Attempt fix 7
* Attempt fix 8
* Attempt fix 9
* Attempt fix 10
* Attempt fix 11
* Attempt fix 12
* Attempt fix 13
2025-04-21 09:13:46 +02:00
Kawrakow
3bb64d9330
Better TG performance for GQA models (CPU) ( #332 )
...
* Slightly better CPU TG performance for GQA
* Better CPU FA implementation for TG when GQA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-17 08:08:40 +02:00
Kawrakow
f7c5a94e75
Better gemm/gemv on AVX2 fr q4_0_r8 ( #331 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-15 17:18:50 +02:00
Kawrakow
1bbb143eb3
Allow q8_0 KV cache for head size 256 ( #330 )
...
* Allow q8_0 KV cache for head size 256
* We need also these
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-15 17:05:31 +02:00
Kawrakow
05dbbeaf14
imatrix: collect layer influence statistics ( #328 )
...
* imatrix: collect layer influence statistics
* imatrix: collect layer influence statiscs also for the last layer
For the last layer we need to use the input for the output.weight
tensor. Last layer(s) tend(s) to be important, so it is useful to also
have its influence metric.
* imatrix: separate metric for attention and ffn importance
* Use stripped tensor name, not src0->name
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-14 19:43:19 +02:00
Kawrakow
028e0cfa19
Add ability to hide imatrix details in llama-quantize ( #329 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-14 19:41:31 +02:00
Kawrakow
d210661c91
Improved IQ1_M quantization ( #327 )
...
* Much faster and it looks like better iq1_m quantiation
* Cleanup
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-13 10:37:55 +02:00
Kawrakow
c01449a478
Fix KLD precision ( #325 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-12 16:17:50 +02:00
Kawrakow
3e7be3d28e
Correct L4 rms_norm ( #324 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-11 10:49:18 +02:00
Kawrakow
474435f58b
LlaMA-4 support (text only) ( #321 )
...
* llama4: WIP
* llama4: this seems to be working
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-10 09:05:21 +02:00
Kawrakow
5f44f4b3d0
Guard against attempts to use MLA for non-MLA models ( #320 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-08 08:47:24 +02:00
Kawrakow
22d7440ba2
Update AUTHORS
...
Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07 17:31:35 +02:00
Kawrakow
f03ae19aad
Update AUTHORS
...
Forgot to add @Nexesenex
2025-04-07 17:28:19 +02:00
Kawrakow
b38759127a
Use links for ggml/llama.cpp authors ( #318 )
...
* Use links for ggml/llama.cpp authors
* This file is not html
* More
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-07 17:25:06 +02:00
Kawrakow
2309ecda80
Better iq2_xs quantization ( #312 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-07 12:39:04 +02:00
Kawrakow
a051f08b8f
Add copyright notices ( #317 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-07 10:43:26 +02:00
Kawrakow
abbabf7ca1
Update LICENSE
...
I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS ) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS ).
This PR corrects my mistake.
2025-04-07 10:41:40 +02:00
Kawrakow
ec84855c6a
We need to synchronize before using device to host async memcpy ( #313 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-05 14:31:27 +02:00
Kawrakow
c616306a01
Add -flax-vector-conversions for GCC on ARM ( #311 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-04 11:04:19 +02:00
Kawrakow
073eda985e
Metal: FA and FlashMLA ( #310 )
...
* Metal: WIP to update Metal FA implementation
Dk=192, Dv=128 works, but not Dk = 576, Dv = 512
* Metal FA: go to float
* WIP
* Metal FA: MLA options now all work
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-03 17:54:25 +02:00
Kawrakow
2ee6263e24
Fix GCC compilation errors on ARM ( #309 )
...
* Fix GCC compilation errors on ARM
* One more
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-03 15:50:53 +02:00
Kawrakow
07dbc1aa06
Metal: much faster MoE prompt processing ( #307 )
...
* MoE improvements on Metal
This version beats mainline, there are things I don't understand:
* Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the
same, but we are 30% slower. Why?
* Using actual GEMM, we beat mainline with ubtach size of 128. But then
performance degrades. Why?
* Some cleanup
* Much better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-03 07:15:49 +02:00
Ikko Eltociear Ashimine
6d405d1fd1
docs: update README.md ( #304 )
2025-04-01 21:30:25 +02:00