ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-01 17:40:25 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	8bff04c9d6	Use stripped tensor name, not src0->name	2025-04-14 19:00:06 +03:00
Iwan Kawrakow	a891d49d59	imatrix: separate metric for attention and ffn importance	2025-04-14 16:26:31 +03:00
Iwan Kawrakow	02629c9ab9	imatrix: collect layer influence statiscs also for the last layer For the last layer we need to use the input for the output.weight tensor. Last layer(s) tend(s) to be important, so it is useful to also have its influence metric.	2025-04-14 11:48:28 +03:00
Iwan Kawrakow	34be9d8d57	imatrix: collect layer influence statistics	2025-04-14 10:03:39 +03:00
Kawrakow	d210661c91	Improved IQ1_M quantization (#327 ) * Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-13 10:37:55 +02:00
Kawrakow	c01449a478	Fix KLD precision (#325 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-12 16:17:50 +02:00
Kawrakow	3e7be3d28e	Correct L4 rms_norm (#324 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-11 10:49:18 +02:00
Kawrakow	474435f58b	LlaMA-4 support (text only) (#321 ) * llama4: WIP * llama4: this seems to be working --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-10 09:05:21 +02:00
Kawrakow	5f44f4b3d0	Guard against attempts to use MLA for non-MLA models (#320 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-08 08:47:24 +02:00
Kawrakow	22d7440ba2	Update AUTHORS Well, there was also the initial MLA PR, which was derived from @fairydreaming	2025-04-07 17:31:35 +02:00
Kawrakow	f03ae19aad	Update AUTHORS Forgot to add @Nexesenex	2025-04-07 17:28:19 +02:00
Kawrakow	b38759127a	Use links for ggml/llama.cpp authors (#318 ) * Use links for ggml/llama.cpp authors * This file is not html * More --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-07 17:25:06 +02:00
Kawrakow	2309ecda80	Better iq2_xs quantization (#312 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-07 12:39:04 +02:00
Kawrakow	a051f08b8f	Add copyright notices (#317 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-07 10:43:26 +02:00
Kawrakow	abbabf7ca1	Update LICENSE I did not realize until today that the [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS) is not the same thing as the [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS). This PR corrects my mistake.	2025-04-07 10:41:40 +02:00
Kawrakow	ec84855c6a	We need to synchronize before using device to host async memcpy (#313 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-05 14:31:27 +02:00
Kawrakow	c616306a01	Add -flax-vector-conversions for GCC on ARM (#311 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-04 11:04:19 +02:00
Kawrakow	073eda985e	Metal: FA and FlashMLA (#310 ) * Metal: WIP to update Metal FA implementation Dk=192, Dv=128 works, but not Dk = 576, Dv = 512 * Metal FA: go to float * WIP * Metal FA: MLA options now all work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 17:54:25 +02:00
Kawrakow	2ee6263e24	Fix GCC compilation errors on ARM (#309 ) * Fix GCC compilation errors on ARM * One more --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 15:50:53 +02:00
Kawrakow	07dbc1aa06	Metal: much faster MoE prompt processing (#307 ) * MoE improvements on Metal This version beats mainline, there are things I don't understand: * Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the same, but we are 30% slower. Why? * Using actual GEMM, we beat mainline with ubtach size of 128. But then performance degrades. Why? * Some cleanup * Much better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-03 07:15:49 +02:00
Ikko Eltociear Ashimine	6d405d1fd1	docs: update README.md (#304 )	2025-04-01 21:30:25 +02:00
Kawrakow	21a5b8bd28	Fix ARM_NEON build failure due to q8_2 (#303 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 13:48:20 +02:00
Kawrakow	190e7866db	Quantization improvements (2) (#302 ) * iq3_k: slightly better quantization Not much of a difference for most models, but this change avoids what it looks like a catastrophic failure for DeepSeek-Lite (PPL is now 7.041 vs 7.314 on main). * Small improvement for type-1 quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 10:31:06 +02:00
Kawrakow	b07a337bfe	Additional guards for interleaved quants (#299 ) * Make sure no interleaved quants are being used for token embeddings also with `--pure` and/or `--custom-q`. * Simplify --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 08:29:47 +02:00
Kawrakow	6e5156cab5	Fix #300 (#301 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-01 08:29:25 +02:00
Kawrakow	4819257ce6	Quantization improvements (#295 ) * Better make_qx_quants Tested with q4_0 and q3_K (pure, imatrix), and the improvement is quite significant. * Sae for iq4_nl, iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-29 08:09:52 +01:00
Kawrakow	23b0addb34	Make sure tensor row size is multiple of block size also when quantizing with --pure (#294 ) * WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * Add check if selected type is possible with --pure I often want to quantize with --pure to see quantization performance without quantization mixes. But for models where there qre tensors with row sizes that are not multiple of 256, this results in a crash for k- and i-quants. Hence, lets add a check if the quant selected via --pure is applicable, and change it if not. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-27 10:48:52 +01:00
Kawrakow	d0b52076da	Use bf16 instead of fp16 block scales for q8_1 (#292 ) * WIP - not working * q8_0 without bells and wistles works * It works for q8_0 * Use bf16 instead of f16,int16 * q4_0_r8 * q5_0_r4 * q6_0_r4 * Also q4_1 and q5_1 * q8_0_r8 on avx2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-27 05:49:16 +01:00
Kawrakow	a22250df93	llama-bench: enable having different number of threads for tg and pp (#284 ) * llama-bench: enable having different number of threads for tg and pp * Add -tgb to usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-25 16:31:17 +01:00
saood06	279b7d3395	Update sweep bench (depracating .jsonl support) (#289 ) * Update sweep bench (depracating .jsonl support) * Fix README.md	2025-03-25 10:14:44 -05:00
Kawrakow	98a264a2ea	CUDA: better MoE implementation (#283 ) * Make fused MoE reproducible As a bonus, peak performance at pp2048 with u_batch = 2048 is ~8% better. * Slightly better * Also do it for non-fused mul_mat_id --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-25 07:47:10 +01:00
Kawrakow	f9307d7907	Improve DeepSeek batched processing speed (#282 ) * Improve DeepSeek batched processing speed * Revert the commented out section in iqk_mul_mat.cpp It does have some benefit at long contexts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-23 17:10:52 +01:00
Kawrakow	5a4855e61c	Attempt to improve FlashMLA on the CPU (#277 ) * Fix it for nth > rk2 * Handle rk2%nth_k != 0 * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-23 07:28:21 +01:00
Kawrakow	dd5ebd0e3d	Test transparent huge pages on Linux (#278 ) * Adding ability to use THP on Linux * Use the actual page size4 used for mmap also in munmap * Add -thp to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-23 07:24:43 +01:00
Kawrakow	6028362ef6	Native build ooption for CUDA when GGML_NATIVE is set (#280 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-22 18:17:51 +01:00
Kawrakow	13ecc5332e	Fighting with cmake (#279 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-22 16:58:30 +01:00
Kawrakow	d8584a1bbe	Add Gemma3 support (text only) (#276 ) * WIP Gemma3: not working * gemma3: build_gemma3 seems to be working now * Revert changes to convert_hf_to_gguf.py It wasn't working, so I guess, it is better to leave the conversion up tp upstream. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-22 08:05:10 +01:00
Kawrakow	3d6e25c82d	Fix bug: missing parentheses in logical expression (#275 ) This results in GGGGGGGGGGGGG when generating with mla = 3, fa = 0. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-21 13:23:01 +01:00
Kawrakow	022660f7ab	Specify tensor name regex for tensors to be repacked (#274 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-21 10:51:37 +01:00
Kawrakow	ddc8eee10e	FlashMLA-3: the best of both worlds (CPU only) (#273 ) * Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include * FlashMLA-3: the best of both worlds - CPU only --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-21 07:24:22 +01:00
Kawrakow	b8d1fac97b	Convert models to row-interleaved quants using the quantize tool (#272 ) * Repack a model with the quantize tool * WIP * Fixed various issues As we don't have a way to tell if a repacked quant has been modified, I had to remove the modification at the expense of a slight decrease in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and q4_0_r8 on ARM. * Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved * Fix GCC 13.3 compilation error * Another one * Add missing include --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-21 07:23:36 +01:00
Kawrakow	127c6ee649	Honor mmap setting when using tensor overrides (#270 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-19 19:17:03 +01:00
Kawrakow	22c84a126f	Fix ggml_compute_forward_dup_q (#269 ) I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-19 15:47:24 +01:00
Kawrakow	c3b75c531c	Prevent FlashMLA-1 from running on CUDA (#268 ) as it is not supported. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-19 13:03:59 +01:00
Kawrakow	8e549b4234	Allow q8_0 cache on the CPU for FlashMLA-2 (#265 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 15:41:05 +01:00
Kawrakow	68a5b60408	Make Q8_0 KV cache work with mla=2,fa on CUDA (#264 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 15:40:47 +01:00
Kawrakow	f4ebf13b6a	Fix #261 (#262 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 07:44:43 +01:00
Kawrakow	bdcae905c4	Compile time option to use bf16 for qunts without MMQ kernels (#261 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 07:37:10 +01:00
Kawrakow	dcdfad29f7	FlashMLA-2: reduce compute buffer size (CUDA and CPU) (#260 ) * FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. * FlashMLA-2: avoid conversions to f32 also on CUDA * Be able to compute for more than 65535 tokens On CUDA just a quick hack that allows us to cancatenate tensors with more than 65535 rows along zroth dimension as needed by FlashMLA-2. Also needed some care in the perplexity tool to avoid int overflows when evaluating the computed logits. * Reduce memory usage for FlashMLA-2 Oh, also fix int overflow in the CUDA concat implementation. It is funny how the llama.cpp 64-bit police has gone (almost) everywhere and replaced 32-bit ints with 64-bit ints, needed or not, but hasn't done it where it is actually needed. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-18 07:36:42 +01:00
Kawrakow	f91b2e38d0	Prepare wk_b tensors of DeepSeek models on the fly (#259 ) * FlashMLA-2: eliminate intermediate f32 tensors This works on the CPU. PP performance is ~13% better for 16k tokens and compute buffer is quite a bit smaller. * FlashMLA-2: enable fast path only on the CPU for now I did implement the necessary ops on CUDA, but something is still wrong there, so for now we only use it when running CPU-only. * FlashMLA-2: slightly smaller computer buffer size * Prepare wk_b when loading DeepSeek models (if wk_b is missing) * Add some comments * Fix case where wkv_b is quantized with k- or i-quants. * Fix CUDA There is an issue with quantized GEMV on CUDA when the left operand (the matrix) is not contiguous. So, for now, we also create wv_b during model loading and use that instead of the 3D view of wkv_b. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-03-17 09:31:56 +01:00

1 2 3 4 5 ...

3638 Commits