ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-02 18:10:02 +00:00

Author	SHA1	Message	Date
Kawrakow	0549be76e5	Fused mul + multi_add op (#858 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: CUDA * fused mul+multi_add: command line argument to disable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:40:35 +03:00
Kawrakow	ed4e1a6588	Fuse add+add+fused_rms (#853 ) * Fuse add+add+fused_rms * Try this * Macro to easily enable/disable fusion * Various: * Check that all tensors involved are on the same device before applying fusion * Fuse sigmoid+scale+sum_rows+div * Fix the fused bailingmoe2 experts selection The issue there was that the bias was not per row, but per expert group, so only the first n_per_group biases were used for al experts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-22 16:18:11 +03:00
Kawrakow	af5bf60cc8	Hopefully this fixes #854 (#855 ) * Hopefully this fixes #854 * Also this one --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 19:07:23 +03:00
Kawrakow	366d66bc1a	Fuse add + fused_rms_norm (CUDA) (#852 ) * Combine all calls to llm_build_norm to a single line so more easily check what kind of arguments are being passed by simply using grep. * Combine add + fused_rms_norm For many models this happens at each layer: the result of the layer is added to the ayer input, which then becomes the input to the next layer, which then is typically normalized via fused_rms_norm. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 14:29:50 +03:00
Kawrakow	a27d661aeb	Fix fused grouped topk (#851 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 10:10:38 +03:00
Kawrakow	c23a17b6fe	cuda: use better block sizes for rms_norm (#845 ) * cuda: use better block sizes for rms_norm * Minor * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 08:12:48 +03:00
Kawrakow	7a41b3b1f5	Various fused ops around expert selection (#840 ) * Fuse sigmoid+add+grouped_topk+get_rows (CPU) * Fix CPU + CUDA but CUDA is somehow not 100% correct as I get a slightly different PPL (lower!) * Minor * Fuse sigmoid+add+topk+get_rows (CUDA) * Fuse sigmoid+add+topk+get_rows (CPU) * Fuse topk+view+get_rows+reshape+softmax (CPU) * Fuse topk+view+get_rows+reshape+softmax (CUDA) * cpu: turn off the openai topk fusing for now Something is not right and I don't see the bug. On the CPU one doesn't gain much if anything, so not a big loss. * Also fuse sum_rows and div --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-19 19:02:46 +03:00
Kawrakow	1dcc044134	Grouped expert routing (CUDA) (#838 ) * WIP * cuda: grouped top_k * This is very slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-18 07:22:35 +03:00
Kawrakow	cde642e591	Grouped expert routing (CPU only) (#836 ) * Better argsort (CPU) * Attemt at grouped topk * This seems to do the trick for grouped experts routing * Cleanup * Trying to merge, something is not right * Working merged grouped top_k (CPU) * Add command line option to enable grouped expert routing * Add grouped expert routing option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 14:57:02 +03:00
Kawrakow	e66d307e13	Better argsort (CPU) (#835 ) * Better argsort (CPU) * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 11:31:03 +03:00
Kawrakow	f7adde1043	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	ba9fefb73d	gpt-oss: duplicate experts biases when necessary (#829 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-14 14:38:40 +03:00
Kawrakow	475223079c	Attempt to fix AVX2 FA (#807 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-30 08:06:53 +02:00
Kawrakow	9932e6b102	Fix gemma3 vision (#803 ) * Remove unnecessary assert in im2col * Remove unnecessary assert in im2col (CPU) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 11:15:32 +02:00
Kawrakow	c1a0e15377	Port mdmd from mainline + Qwen2/2.5-VL support (#798 ) * Add mtmd: the beginning * Add mtmd: mtmd.cpp compiles * Add mtmd: clip initialization compiles * Add mtmd: clip.cpp compiles * Add mtmd: builds successfully * Add CPU implementation for GGML_OP_GLU * Add CUDA implementation for GGML_OP_GLU * Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add mtmd: refresh CPU rope * Add mtmd: refresh CUDA rope * Add mtmd: add Qwen2-VL * Add mtmd: Qwen2.5-VL text seems to work with this change * Add mtmd: fix swiglu * Add mtmd: use LOG_TEE so generated tokens show up in terminal * Add mtmd: do not attempt to load a GPU backend if none are available * GLU, not GPU * Fix typo * Fix new/free mismatch * LOG stuff * Add mtmd: this fixes gibberish on second image --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 08:45:29 +02:00
Kawrakow	bc34573356	CPU: faster FA (#797 ) * Avoid computing FA chunks where the mask is -infinity * Avoid computing FA chunks where the mask is -infinity also for f16/bf16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-26 09:00:25 +02:00
Kawrakow	f8b66238fa	Fused matrix multiplications (CUDA and CPU) (#796 ) * Quick attempt to fuse the Q, K, V GEMMs Doesn't do much on the CPU * Doesn't do much on the GPU either * Use llm_build_mul_mat_qkv * This is not needed * Revert timing on committed by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 16:52:54 +02:00
Kawrakow	f59b2909d4	cpu: fused softmax+topk (#794 ) * cpu: fused softmax+topk * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 09:02:21 +02:00
Kawrakow	8b4208e789	Fix #772 (#790 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 16:43:02 +02:00
Kawrakow	4591e83825	cuda: fused top_k+softmax as used in most MoE models (#789 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 13:45:57 +02:00
Kawrakow	540a26514f	This is very slightly better (#762 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-05 21:31:02 +02:00
Kawrakow	f74dd77143	Fix ggml_is_contiguously_allocated (#764 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-05 19:05:02 +02:00
firecoperana	49979ba9e9	llama: enable K-shift for quantized KV cache for cuda (#760 ) cuda: add q8_0->f32 cpy operation (#9571) It will fail on unsupported backends or quant types. Co-authored-by: Ivan <nekotekina@gmail.com>	2025-09-05 11:54:18 +02:00
Kawrakow	13c3b6412e	Offload only activated experts to the GPU (#698 ) * Offload only activated experts * This seems to do the trick for -fmoe * Do not recalculate activated expers for fused up/gate * Log out of bounds access details * Add a command line argument --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 12:22:30 +02:00
Kawrakow	144d456717	Better CPU SWA (#757 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 11:58:16 +02:00
Kawrakow	4a6a6f17ee	Alternative CUDA FA for SWA models (#754 ) * Bounds for flash attention * Add n_swa to FA parameters * Fix it * This seems very slightly better * Using vec kernel when we have SWA * Need also this * f32 vec kernel * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 08:42:18 +02:00
Kawrakow	727f7b7d9f	Refactor CUDA flash attention (#745 ) * Factor out mma * Factor out wmma * Factor out vec * Remove unnecessary includes from fattn.cu * Move mma launch to fattn-mma-f16.cuh * Slightly better PP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 10:12:56 +02:00
Kawrakow	56e0f897ae	Revert "CUDA: prompt processing optimizations for MoE models (#739 )" (#748 ) This reverts commit `f22a9ef95a`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 06:55:48 +02:00
Kawrakow	8de297b795	Fused FFN_UP+FFN_GATE op (#741 ) * Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-31 18:16:36 +03:00
Kawrakow	d55e98519f	CUDA: prompt processing optimizations for MoE models (#739 ) * Skip the row id computation for the ffn_down op Sadly, almost negligible performance gain. * Also this doesn't do much * Also this barely moves the needle * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-30 12:09:41 +03:00
Kawrakow	f529c3a808	Sanitize imatrix (#735 ) * sanitize importance matrix: WIP * sanitize importance matrix: iq4_k * sanitize importance matrix: iq5_k, iq6_k * sanitize imatrix: iq4_ks * sanitize imatrix: iq4_kss * sanitize imatrix: iq2_ks and iq2_kl * sanitize imatrix: iq5_ks * sanitize imatrix: iq4_nl_r4 * sanitize imatrix: q4_0_r8 * sanitize imatrix: q6_0_r4 * sanitize imatrix: iq4_xs_r8 * sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template * sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4 * sanitize imatrix: repacked i-quants * Minor * Add more checks for iq3_k, iq3_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-29 09:08:15 +03:00
Kawrakow	e760b4dc41	Check for NaNs while loading the model. (#727 ) * Check for NaNs while loading the model. * Also tell which experts have NaNs. * Add command line option to validate quants * Add checks for more quantization types * Add checks for more quantizagtion types --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 19:00:17 +03:00
Kawrakow	ca5b6ab9b1	Fix typo	2025-08-27 14:43:44 +03:00
Kawrakow	1dcc34f70a	Heuristics for mmq_id -> original threshold (#734 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:17:41 +03:00
Kawrakow	6afe9b48ab	Sanitize importances for KT quantization (#720 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:04:15 +03:00
Kawrakow	3dc4dffed5	Fix avx2 GEMM mess (v2) (#724 ) * This fixes confusion around Q8_0 on AVX2 * This does it for iq4_nl, including FA * This does it for iq4_nl on Zen4, but FA does not work * Slightly more clear * Adding forgotten q8_0_r8 to num_rows() --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:03:47 +03:00
Kawrakow	ac4ec50f03	CUDA: muh faster prompt processing for MoE models and small u-batch sizes (#728 ) * WIP: adding mainline mmq_id implementation * This seems to work * Now also -fmoe works * WIP * WIP * WIP * This works for mainline supported quants * mmq_id: add iq2_k, iq2_k_r4 * mmiq_id: don't assume row size is multiple of type size (per row scales) * mmiq_id: don't assume row size is multiple of type size * mmq_id: add iq2_ks So we are sure it works with per row scales * mmq_id: add iq2_kl * mmq_id: add iq3_ks * mmq_id: adding iq3_k, iq3_k_r4 * mmq_id: add iq4_kss, iq4_ks, iq4_ks_r4 * mmq_id: adding iq4_k, iq4_k_r4 * mmq_id: adding iq5_ks, iq5_ks_r4 * mmq_id: adding iq5_k, iq5_k_r4, q6_0 * mmq_id: adding iq6_k * mmq_id: add iq1_s_r4 * mmq_id: adding iq1_kt, iq2_kt * mmq_id: add iq3_kt, iq4_kt * Add CUDA fp8 header --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-26 13:30:35 +03:00
Kawrakow	d3aecc7f37	Log for debugging #721 (#722 ) * Log for debugging #721 * Remove the 16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-23 15:24:34 +03:00
Kawrakow	8e23ecdd96	Fix more Q8_0 repacking mess on AVX2 (#719 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-23 09:04:51 +03:00
Kawrakow	e962ce8c70	CUDA: faster IQ2_K, IQ2_KS, IQ2_K_R4 (#716 ) * Use bperm trick for iq2_ks gemm -> 7% gain * Use bperm trick for iq2_k gemm -> ~5% gain * Use bperm trick for iq2_k_r4 gemm -> ~3% gain * Use bperm trick for iq2_ks gemv -> ~7% gain * Use bperm trick for iq2_k gemv -> ~3% gain * Use bperm trick for iq2_k_r4 gemv -> ~7% gain --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-22 07:25:35 +03:00
Kawrakow	ca8c72ff1a	AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations (#710 ) * q8_k_r16: basics * q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+ PP performance is about the same as using q8_k_r8 on the Ryzen-7950X, so we expect nice gains on Zen5, and we don't need to wory about using 2 different q8_k_r8 implementations for fancy SIMD. * q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+ * q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+ * q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+ * q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+ * q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+ * Fix AVX2 * Just always set num_rows to 16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-22 06:27:07 +03:00
Kawrakow	c5f58e0270	CUDA: faster IQ3_K, IQ3_KS, IQ3_K_R4 (#714 ) * Use bperm trick for iq3_ks - 5% PP performance gain * Use bperm trick for iq3_k -> 5% PP performance gain * Use bperm trick for iq3_k -> 8% PP performance gain * Use bperm trick for iq3_k_r4 gemv -> ~5% faster * Use bperm trick for iq3_k gemv -> ~3% faster * Use bperm trick for iq3_k gemv -> 4.5% gain --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-21 19:08:57 +03:00
Kawrakow	9f3d062ba7	CUDA: faster prompt processing for 4-bit quants (#713 ) * Use __byte_perm in get_int_from_table_16 * Use get_int_from_table_16 everywhere for 4-bit quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-21 15:57:35 +03:00
Kawrakow	a9eeef53f3	Fix q8_0 repacking issues on AVX2 (#708 ) Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4. So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8, iqk_moe_fused_mul_unary fails because the activations were prepared as Q0_0_X4, but we now need Q8_2_X4. For now a simple fix: just take the slow path, do not repack. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-19 19:49:58 +03:00
usrlocalben	f98b1befdb	remove curious assertions (#705 ) This assertion can hit during prefill as MLA/KV tensors grow, e.g. Kimi K2 n_ctx >= 32768.	2025-08-19 14:41:29 +03:00
Kawrakow	6b2c84b099	Revert "Better CPU prompt processing performance for SWA models (#696 )" (#701 ) This reverts commit `93a4f6089f`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 15:44:02 +03:00
Kawrakow	1ca612375e	Fix GLM-4.5 attention (#700 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 14:31:03 +03:00
Kawrakow	d4d017766e	Better CPU prompt processing performance for SWA models (#696 ) * This does the trick for PP * Compute mask bounds when creating the mask * Set mask bounds for all supported SWA models --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 10:30:27 +03:00
Kawrakow	2e2abddaa8	Quick hack to improve TG performance for SWA models (#692 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-15 16:43:04 +03:00
Kawrakow	633e0617b0	Enable CUDA graphs for MoE models + GPT-OSS support (#689 ) * gmp-oss: common * gpt-oss: attnetion sinks, swiglu_oai * gpt-oss: WIP llama Model loads and runs (CPU only), but PPL is much to high (~1500 for 1st batch vs ~200 in mainline). Is it because of SWA, because of vocab, or did I introduce a bug somewhere? * gpt-oss: CPU seems to be working It was the SWA thta was missing in the previous commit. There are issues with EOG tokens, so this still needs to be added. * CUDA: ADD_ID Just a copy from mainline * gpt-oss: Seems to be working on CUDA * gpt-oss: add sinks to the attn-vec kernels * CUDA: add head size of 64 to new mma Haven't turned it on yet, but observe slightly better PP and slightly worse TG performance with that. * gpt-oss: add ability to use -fmoe (only CUDA for now) * Move row sums to the write place * Add sinks to iqk flash attention * gpt_oss: Implement -fmoe on the CPU * Simdify swiglu_oai Turning it off for now as performance becomes more variable, so perhaps I'm running into thermal trottling imore often because of making the CPU work too hard. * llama: factor out model loader * Builds successfully * It runs, but mmap does not work * Fix llama_mmap so mmap works * Minor * Fix CUDA after latest changes * Attempt to use CUDA graphs with MoE models - not working * CUDA graphs WIP - still not working * CUDA graphs - seems to be working Likely not all MLA variants are working. I no longer remember why I added the q8_0 cpy that transposes the tensor, but if really needed, this is now missing. Also missing is q6_0. * Make q8_0 cache work for DeepSeek models with CUDA graphs * cuda: cpy for q6_0 * Fix llama_mmap on non-Linux platforms * Adding forgotten file * Iterating on Windows build failures * cuda: re-add q8_0 -> q8_0 transpose so mla = 2 can be used with CUDA graphs and q8_0 cache. * Disable graphs without -fmoe * Minor * Turn graphs on by default --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-15 09:18:07 +03:00

1 2 3 4 5 ...

405 Commits