Kawrakow
8a2d611083
Fix DeepSeek q8_0 cache ( #391 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-07 12:06:49 +03:00
Kawrakow
6104bf5296
Fix build for Xeon Gold 6226R ( #390 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-07 10:33:27 +03:00
Kawrakow
6e7b28f7b0
Update README.md
2025-05-06 08:48:11 +03:00
Kawrakow
b08471f717
Fix DeepSeek FA ( #382 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-05 08:39:10 +03:00
Kawrakow
45cd1bcd59
CUDA: MMQ for IQ4_KS ( #374 )
...
* WIP
* WIP: still getting illegal memory access
* CUDA: MMQ for iq4_ks now works
~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-04 12:45:00 +03:00
Kawrakow
db0ed280f1
Update README.md
2025-05-04 12:06:47 +03:00
Kawrakow
7cb99f8078
Update README.md
2025-05-04 11:49:29 +03:00
Kawrakow
711ba7e8f4
CUDA: faster FA TG for GQA models ( #370 )
...
* cuda: WIP MMA FA
* Use MMA for TG also when quantized
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-04 09:17:44 +03:00
Kawrakow
fdbdb5310a
Another attempt to fix #367 ( #371 )
...
* Another attempt to fix #367
* Yet another
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-04 09:02:12 +03:00
Gaolingx
8db70379ae
cmake: force MSVC compiler charset to utf-8 ( #369 )
2025-05-03 15:56:29 +03:00
Kawrakow
758ca617cd
Trying to fix iq1_s_r4/iq1_m_r4 quantization failure ( #368 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-03 14:43:55 +03:00
Kawrakow
892e96be53
Fix FA bug on AVX2 ( #364 )
...
* Fix FA bug on AVX2
* Also this was wrong
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-05-02 07:09:09 +02:00
saood06
aca68016d8
Fix model architecture name ( #366 )
...
Co-authored-by: junhuihe <junhui-he@outlook.com >
2025-05-02 07:07:24 +02:00
Kawrakow
9303df7450
Update README.md ( #352 )
...
* Update README.md
* Edits
* Updates
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-30 15:11:29 +02:00
Kawrakow
1ea49001f3
Fix IQK_FA_ALL_QUANTS on AVX2 ( #360 )
...
* Fix IQK_FA_ALL_QUANTS on AVX2
* Make it also work, not just compile
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-30 10:45:43 +02:00
Kawrakow
71bc74d738
Add missing enum values for qwen3 and qwen3moe ( #356 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-29 10:05:38 +02:00
Ben Harris
8b62ee32ca
Apply Qwen3 PR from llama.cpp ( #355 )
2025-04-29 10:02:08 +02:00
Kawrakow
2f2803a1d7
Update AUTHORS
...
Add @ubergarm
2025-04-29 07:22:06 +02:00
Kawrakow
9d9f9f96b2
CPU FA improvements ( #351 )
...
* FA: provide work buffer for K repacking
* Add header to avoid comp0iler warnings
* WIP
* WIP
* WIP
* WIP
* Slightly better
* WIP (Zen4)
* WIP
* Try to improve for unusual number of heads/number of threads
* Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA
* Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA
* Use Sum4q4 for q4_0
* WIP
* WIP
* Much better FA TG with q8_0 KV cache
Just repack it even for TG. But do the repacking for k_step rows,
not the whole K tensor.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-29 07:19:43 +02:00
ubergarm
42d7e58a96
Add GLM-4-0414 Model Support ( #344 )
...
* Add GLM-4-0414 model support
Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp.
Still some issues where it doesn't work:
* offloading >=60 layers to GPU
* no flash attention
* Remove seemingly unused llm_tensor enums
Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already
existed which seems pretty similar? Don't think they were used in the
python code either...
So removed these as possibly just cruft:
* LLM_TENSOR_POST_ATTN_NORM
* LLM_TENSOR_POST_MLP_NORM
* Set flash attention precision to f32 on GLM4 arch
* Set non flash attention precision to f32 on GLM4
* Remove reshape_3d() for Vcur in build_glm4()
This fixes the non-flash-attention inferencing on both CPU and CUDA.
2025-04-26 17:34:04 +02:00
Kawrakow
815307d3bd
Fix division by zero bug ( #349 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-26 09:19:43 +02:00
Kawrakow
86be28d5bd
Add support for Cohere2 ( #341 )
...
* Add support for Cohere2
* Fixe IQ4_NL on AVX2
* Command-A needs fp32 precision for K*Q
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-26 08:13:25 +02:00
Kawrakow
4413f17b58
Fix q4_1 and q5_1 on Arm ( #348 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 19:48:08 +02:00
Kawrakow
fb98619852
Add ability to manually set arch flags ( #347 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 13:24:18 +02:00
Kawrakow
542351d088
Fix FA on ARM ( #346 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 11:01:08 +02:00
Kawrakow
c26f5b315d
Fix LLaMA-4 attention ( #342 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-25 09:21:03 +02:00
Kawrakow
2d2a03df24
cuda: use switch in constexpr funcs ( #343 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-24 17:37:12 +02:00
saood06
bf095b682f
Update gguf-py constants ( #298 )
...
* Update GGMLQuantizationType
* Update LlamaFileType
* Update GGML_QUANT_SIZES
2025-04-24 00:34:10 -05:00
Kawrakow
614e59733e
BitNet adjustments ( #338 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-22 08:46:31 +02:00
saood06
e6c85a5b95
Add support for bitnet2b_2501 model ( #337 )
...
* add support for bitnet2b_2501 model
* Fixes
* Support both model names
---------
Co-authored-by: potassiummmm <zhou.hansong@outlook.com >
2025-04-22 08:34:13 +02:00
saood06
16f945d9bb
Fix termux/android build ( #336 )
...
* Attempt fix
* Attempt fix 2
* Attempt fix 3
* Attempt fix 4
* Attempt fix 5
* Attempt fix 6
* Attempt fix 7
* Attempt fix 8
* Attempt fix 9
* Attempt fix 10
* Attempt fix 11
* Attempt fix 12
* Attempt fix 13
2025-04-21 09:13:46 +02:00
Kawrakow
4a70adae94
Better TG performance for GQA models (CPU) ( #332 )
...
* Slightly better CPU TG performance for GQA
* Better CPU FA implementation for TG when GQA
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-17 08:08:40 +02:00
Kawrakow
5a98a66b5c
Better gemm/gemv on AVX2 fr q4_0_r8 ( #331 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-15 17:18:50 +02:00
Kawrakow
1a786850e6
Allow q8_0 KV cache for head size 256 ( #330 )
...
* Allow q8_0 KV cache for head size 256
* We need also these
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-15 17:05:31 +02:00
Kawrakow
70a1d99fb8
imatrix: collect layer influence statistics ( #328 )
...
* imatrix: collect layer influence statistics
* imatrix: collect layer influence statiscs also for the last layer
For the last layer we need to use the input for the output.weight
tensor. Last layer(s) tend(s) to be important, so it is useful to also
have its influence metric.
* imatrix: separate metric for attention and ffn importance
* Use stripped tensor name, not src0->name
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-14 19:43:19 +02:00
Kawrakow
9be8812727
Add ability to hide imatrix details in llama-quantize ( #329 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-14 19:41:31 +02:00
Kawrakow
0f7aa11b6d
Improved IQ1_M quantization ( #327 )
...
* Much faster and it looks like better iq1_m quantiation
* Cleanup
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-13 10:37:55 +02:00
Kawrakow
65fc77c285
Fix KLD precision ( #325 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-12 16:17:50 +02:00
Kawrakow
a3b16affaf
Correct L4 rms_norm ( #324 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-11 10:49:18 +02:00
Kawrakow
5c127b279f
LlaMA-4 support (text only) ( #321 )
...
* llama4: WIP
* llama4: this seems to be working
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-10 09:05:21 +02:00
Kawrakow
c50d00d0dc
Guard against attempts to use MLA for non-MLA models ( #320 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-08 08:47:24 +02:00
Kawrakow
d223b26daf
Update AUTHORS
...
Well, there was also the initial MLA PR, which was derived from @fairydreaming
2025-04-07 17:31:35 +02:00
Kawrakow
21725684a4
Update AUTHORS
...
Forgot to add @Nexesenex
2025-04-07 17:28:19 +02:00
Kawrakow
16b36613d5
Use links for ggml/llama.cpp authors ( #318 )
...
* Use links for ggml/llama.cpp authors
* This file is not html
* More
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-07 17:25:06 +02:00
Kawrakow
86c9b08846
Better iq2_xs quantization ( #312 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-07 12:39:04 +02:00
Kawrakow
8210ed4883
Add copyright notices ( #317 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-07 10:43:26 +02:00
Kawrakow
9bd4357cbc
Update LICENSE
...
I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS ) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS ).
This PR corrects my mistake.
2025-04-07 10:41:40 +02:00
Kawrakow
d3c0cc788b
We need to synchronize before using device to host async memcpy ( #313 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-05 14:31:27 +02:00
Kawrakow
c7fceae221
Add -flax-vector-conversions for GCC on ARM ( #311 )
...
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-04 11:04:19 +02:00
Kawrakow
9ab6dc9f91
Metal: FA and FlashMLA ( #310 )
...
* Metal: WIP to update Metal FA implementation
Dk=192, Dv=128 works, but not Dk = 576, Dv = 512
* Metal FA: go to float
* WIP
* Metal FA: MLA options now all work
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com >
2025-04-03 17:54:25 +02:00