ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 23:24:13 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	20d50172d0	Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor.	2025-04-28 11:26:28 +03:00
Iwan Kawrakow	802d4de1b5	WIP	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	9be8b490b1	WIP	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	cd44692bc0	Use Sum4q4 for q4_0	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	b19fd13141	Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	ddcdf25e54	Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	9f310ea663	Try to improve for unusual number of heads/number of threads	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	39714026fe	WIP	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	a7cd27f7e0	WIP (Zen4)	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	26eb64c4f9	Slightly better	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	bcacf33350	WIP	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	998c1b2117	WIP	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	fae18dd0bc	WIP	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	b498633203	WIP	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	74a21d48d6	Add header to avoid comp0iler warnings	2025-04-26 18:34:35 +03:00
Iwan Kawrakow	6801a4368c	FA: provide work buffer for K repacking	2025-04-26 18:34:35 +03:00
ubergarm	baeefb4731	Add GLM-4-0414 Model Support (#344 ) * Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.	2025-04-26 17:34:04 +02:00
Kawrakow	9e846f0eb1	Fix division by zero bug (#349 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 09:19:43 +02:00
Kawrakow	715fc552ad	Add support for Cohere2 (#341 ) * Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 08:13:25 +02:00
Kawrakow	770892086c	Fix q4_1 and q5_1 on Arm (#348 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 19:48:08 +02:00
Kawrakow	c817160d03	Add ability to manually set arch flags (#347 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 13:24:18 +02:00
Kawrakow	25d1a0dca8	Fix FA on ARM (#346 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 11:01:08 +02:00
Kawrakow	f176122a3d	Fix LLaMA-4 attention (#342 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 09:21:03 +02:00
Kawrakow	c9eec1729f	cuda: use switch in constexpr funcs (#343 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-24 17:37:12 +02:00
saood06	222a195743	Update gguf-py constants (#298 ) * Update GGMLQuantizationType * Update LlamaFileType * Update GGML_QUANT_SIZES	2025-04-24 00:34:10 -05:00
Kawrakow	9dac3edf2f	BitNet adjustments (#338 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-22 08:46:31 +02:00
saood06	cc39800723	Add support for bitnet2b_2501 model (#337 ) * add support for bitnet2b_2501 model * Fixes * Support both model names --------- Co-authored-by: potassiummmm <zhou.hansong@outlook.com>	2025-04-22 08:34:13 +02:00
saood06	93cd77b655	Fix termux/android build (#336 ) * Attempt fix * Attempt fix 2 * Attempt fix 3 * Attempt fix 4 * Attempt fix 5 * Attempt fix 6 * Attempt fix 7 * Attempt fix 8 * Attempt fix 9 * Attempt fix 10 * Attempt fix 11 * Attempt fix 12 * Attempt fix 13	2025-04-21 09:13:46 +02:00
Kawrakow	3bb64d9330	Better TG performance for GQA models (CPU) (#332 ) * Slightly better CPU TG performance for GQA * Better CPU FA implementation for TG when GQA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-17 08:08:40 +02:00
Kawrakow	f7c5a94e75	Better gemm/gemv on AVX2 fr q4_0_r8 (#331 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-15 17:18:50 +02:00
Kawrakow	1bbb143eb3	Allow q8_0 KV cache for head size 256 (#330 ) * Allow q8_0 KV cache for head size 256 * We need also these --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-15 17:05:31 +02:00
Kawrakow	05dbbeaf14	imatrix: collect layer influence statistics (#328 ) * imatrix: collect layer influence statistics * imatrix: collect layer influence statiscs also for the last layer For the last layer we need to use the input for the output.weight tensor. Last layer(s) tend(s) to be important, so it is useful to also have its influence metric. * imatrix: separate metric for attention and ffn importance * Use stripped tensor name, not src0->name --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-14 19:43:19 +02:00
Kawrakow	028e0cfa19	Add ability to hide imatrix details in llama-quantize (#329 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-14 19:41:31 +02:00
Kawrakow	d210661c91	Improved IQ1_M quantization (#327 ) * Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-13 10:37:55 +02:00
Kawrakow	c01449a478	Fix KLD precision (#325 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-12 16:17:50 +02:00
Kawrakow	3e7be3d28e	Correct L4 rms_norm (#324 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-11 10:49:18 +02:00
Kawrakow	474435f58b	LlaMA-4 support (text only) (#321 ) * llama4: WIP * llama4: this seems to be working --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-10 09:05:21 +02:00
Kawrakow	5f44f4b3d0	Guard against attempts to use MLA for non-MLA models (#320 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-08 08:47:24 +02:00
Kawrakow	22d7440ba2	Update AUTHORS Well, there was also the initial MLA PR, which was derived from @fairydreaming	2025-04-07 17:31:35 +02:00
Kawrakow	f03ae19aad	Update AUTHORS Forgot to add @Nexesenex	2025-04-07 17:28:19 +02:00
Kawrakow	b38759127a	Use links for ggml/llama.cpp authors (#318 ) * Use links for ggml/llama.cpp authors * This file is not html * More --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-07 17:25:06 +02:00
Kawrakow	2309ecda80	Better iq2_xs quantization (#312 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-07 12:39:04 +02:00
Kawrakow	a051f08b8f	Add copyright notices (#317 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-07 10:43:26 +02:00
Kawrakow	abbabf7ca1	Update LICENSE I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). This PR corrects my mistake.	2025-04-07 10:41:40 +02:00
Kawrakow	ec84855c6a	We need to synchronize before using device to host async memcpy (#313 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-05 14:31:27 +02:00
Kawrakow	c616306a01	Add -flax-vector-conversions for GCC on ARM (#311 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-04 11:04:19 +02:00
Kawrakow	073eda985e	Metal: FA and FlashMLA (#310 ) * Metal: WIP to update Metal FA implementation Dk=192, Dv=128 works, but not Dk = 576, Dv = 512 * Metal FA: go to float * WIP * Metal FA: MLA options now all work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 17:54:25 +02:00
Kawrakow	2ee6263e24	Fix GCC compilation errors on ARM (#309 ) * Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 15:50:53 +02:00
Kawrakow	07dbc1aa06	Metal: much faster MoE prompt processing (#307 ) * MoE improvements on Metal This version beats mainline, there are things I don't understand: * Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the same, but we are 30% slower. Why? * Using actual GEMM, we beat mainline with ubtach size of 128. But then performance degrades. Why? * Some cleanup * Much better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 07:15:49 +02:00
Ikko Eltociear Ashimine	6d405d1fd1	docs: update README.md (#304 )	2025-04-01 21:30:25 +02:00

1 2 3 4 5 ...

3667 Commits