ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Author	SHA1	Message	Date
Kawrakow	03da76eb05	Fix RoPE cache on multi-GPU setup (#966 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-16 11:50:48 +02:00
Kawrakow	37d72f9878	Fix ggml_cuda_fattn_is_supported (#968 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-16 11:50:29 +02:00
firecoperana	b40d11b22d	Fix kv cache save and load for GLM model (#965 ) Co-authored-by: firecoperana <firecoperana>	2025-11-15 17:04:16 +02:00
firecoperana	5ec0def0ef	Fix compiler warnings (#963 ) * Fix changes meaning warnings * A couple of more warnings and formatting --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-15 07:07:15 +02:00
firecoperana	bb358223cd	server: cache prompt to host memory (#954 ) * server : host-memory prompt caching change similarity calculation and prompt save conditions Remove unneeded token limit rename variable Separate prompt save and load logic change default values change log remove truncate prompt logic * add description * bug fixes * remove token limit in init --------- Co-authored-by: firecoperana <firecoperana>	2025-11-14 18:40:13 +02:00
Kawrakow	00dffb5e68	Add --chat-template-file to usage (#959 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 11:08:58 +02:00
Kawrakow	668c37d4cf	DeepSeek: enable option to merge Q and K tensors (#941 ) * Merge Q and K for DeepSeek * Formatting --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 08:23:04 +02:00
firecoperana	177b5d2a47	Fix cuda init error in rpc (#957 ) Co-authored-by: firecoperana <firecoperana>	2025-11-14 06:59:54 +02:00
Kawrakow	c64e3e3482	Fix fused up+gate when mmq is not supported (#952 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:59:27 +02:00
Kawrakow	a1f60b3535	Add missing AVX512 operators for MSVC (#948 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:58:51 +02:00
Kawrakow	6b9d1bf4b4	Graph reuse (#947 ) * Add mainline compatible FA command line option * Graph reuse: add command line argument to turn it on * WIP * This seems to work * This is perhaps cleaner * Change the command line option to -gr --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:58:19 +02:00
Kawrakow	22c20fcd6d	Fix flash attention long argument for mainloine compatibility	2025-11-13 19:22:16 +02:00
Kawrakow	ce3ce97a29	Fix repacked legacy quants (#951 ) * Fix q5_0_r4 The issue waqs in the tail part. As almost all models have tensor rows that are multiple of 128, that part was never triggered in testing. But ithe gpt-oss models have an embedding size of 2880, so we end up there and trigger the bug. * Fix q6_0_r4 Same fix as q5_0_r4 * Fix q4_0_r8 * Fix q5_0_r4 and q6_0_r4 also on Zen4 * Fix q4_0_r8 also on Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 15:35:37 +02:00
Kawrakow	88c02fa108	Set default MLA to 3 also in llama-bench (#949 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 09:52:06 +02:00
Kawrakow	874926800f	Add mainline compatible FA command line option (#944 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 08:55:33 +02:00
Kawrakow	32edcb4b74	Fix rope_norm_fast_cuda (#945 ) * Fix rope_norm_fast_cuda * One more * Also fix mrope and vision --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 08:54:37 +02:00
Kawrakow	ddc88bac17	Set mla=3 by default (#943 ) so more recent users that haven't followed the history of FlashMLA evolution and hence don't know about the MLA options get the best setting without having to add -mla 3 on the command line. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-12 11:00:58 +02:00
Kawrakow	0d97b9c0bf	Enable fusion by default (#939 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 10:35:48 +02:00
Kawrakow	219fe93973	Opt from #880 also for iqk cuda gemv (#938 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 10:01:34 +02:00
Kawrakow	25cd985c9b	Add --n-cpu-moe to llama_bench (#937 ) * Add --n-cpu-moe to llama_banch * Add usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 08:44:59 +02:00
Kawrakow	121ed91165	Add rcache to llama-bench (#936 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 08:06:18 +02:00
Kawrakow	1223bc63b8	Minor: remove unnecesssary calls to build_inp_out_ids (#935 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 17:38:46 +02:00
Kawrakow	263be6670b	Add support for SmolLM3 (#934 ) * Convert from HF * Model loading and compute graph --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 15:40:12 +02:00
Kawrakow	86e2bec04e	DeepSeek FA optimizations (#929 ) * Use new-new-mma also for MLA=3, and use mask bounds This gives us ~25% better PP at 32k tokens compared to main * This seems better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:55:30 +02:00
Kawrakow	adba641347	DeepSeek TG optimizations for TG (#928 ) * Fuse concat and copy into K cache * Avoid ggml_cont() when n_token = 1 Combined effect: about +2% in TG performance with full GPU offload Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:52:07 +02:00
firecoperana	eea6cc4433	Server: Add --draft-params to set draft model parameter via command line args (#932 ) * Add command line argument for draft model * Remove second context of draft model * Format print * print usage if parsing -draft fails --------- Co-authored-by: firecoperana <firecoperana>	2025-11-10 09:51:07 +02:00
Kawrakow	bf474e9bff	Use fused gemv+add only for TG (#933 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 08:34:24 +02:00
Kawrakow	56ee303254	Make biased gemv fusion optional (#931 ) * Make biased gemv fusion optional * Fix one path through gemv fusion * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-09 19:09:47 +02:00
Kawrakow	7df9947923	Fix compiler warning	2025-11-09 14:35:59 +02:00
Lennart Lopin	fd37776584	Add ARM Grace Blackwell (NVIDIA DGX Spark) support (#922 ) This commit enables IQK quantization operations on ARM-based systems, specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell. Changes: - Enable IQK_IMPLEMENT macro for ARM NEON operations - Add arm_neon.h header include for ARM SIMD intrinsics - Fix compilation errors related to missing NEON types and functions Build requirements for ARM: cmake .. -DGGML_CUDA=ON \ -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \ -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16" Tested on: - Platform: NVIDIA DGX Spark (aarch64) - CPU: GB10 Grace Blackwell Superchip - Memory: 128GB unified memory Fixes build errors: - 'float32x4_t' does not name a type - 'vld1q_f32' was not declared in this scope - 'v_expf' was not declared in this scope - Missing FP16 NEON intrinsics	2025-11-09 14:22:40 +02:00
firecoperana	73c28dbef4	server: bug fix for preserved_tokens not preserved in process_token (#926 ) Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:29 +02:00
firecoperana	b63309a918	Fix embedding missing, CORS and crash using verbose in server (#924 ) * server: fix crash when prompt has image and is too long * server: fix CORS * server: fix empty result for embedding * change error message to truncate prompt * server: fix slot id for save and load state * bug fix * server: update slot similarity to handle mtmd * server: quick hack to calculate number of token processed with image * server: fix out of range error when detokenizing prompt under verbose * Add back Access-Control-Allow-Origin * Server: Add prompt tokens in embedding results --------- Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:03 +02:00
Kawrakow	5cc15d0ecf	CUDA MoE improvements (#923 ) * Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-09 11:34:33 +02:00
Kawrakow	defa6945b3	CUDA: fuse copies to K and V cache (#921 ) * Fuse copies to K- and V-cache on CUDA * Adapt to latest main --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 18:13:58 +02:00
Kawrakow	3614c4f098	Adopt fix from mainline PR 17089 (#920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 07:44:20 +02:00
Kawrakow	d0850dccc8	Disable add + fused_rms_norm fusion (#916 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 19:38:18 +02:00
Kawrakow	1c31b25380	Fix PPL increase caused by mmq_id (#913 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 18:58:09 +02:00
Nexes the Elder	f9a411e5db	More informative PPL readout line (#914 ) * More informative PPL readout line * trailing whitespace..	2025-11-07 16:41:24 +02:00
Kawrakow	532a05e466	CUDA: set compute parameters via command line arguments (#910 ) * cuda: set compute parameters via command line arguments * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 07:11:23 +02:00
Kawrakow	49befdd4fb	Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-06 19:07:46 +02:00
firecoperana	e15a215e6b	model : Port Minimax M2 from mainline (#907 ) Co-authored-by: firecoperana <firecoperana>	2025-11-06 18:09:24 +02:00
Kawrakow	66ef68bc14	Fix compiler warning	2025-11-06 07:12:07 +02:00
firecoperana	18f5a6caef	Bug fixes for completions and prompt caching in server (#906 ) * Bug fixes for completions and prompt caching in server * Fix compiler warning about redefinition --------- Co-authored-by: firecoperana <firecoperana>	2025-11-06 07:10:51 +02:00
Kawrakow	50f95d7bf3	Disable CUDA fusion by default for now (#903 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:58:12 +02:00
Kawrakow	cb30f8e057	Merge Q and K into a single tensor (#892 ) * Merge Q and K into a single tensor * Make V mul mat follow QK mul mat so they can be fused, which gives a slightly bbetter TG performance. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:54:36 +02:00
Kawrakow	e68f50be9a	Allow quantization of ffn_gate_inp (#896 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:44:32 +02:00
firecoperana	7978f04996	Add vision support in llama-server (#901 ) * server: add support for vision model webui: add support for vision model * server : remove hack for extra parallel slot#10187 * llama : fix KV shift for qwen2vl #13870 * add no-context-shift parameter --------- Co-authored-by: firecoperana <firecoperana>	2025-11-05 10:43:46 +02:00
Kawrakow	92607d44c4	Much better CPU TG performance at long context for GLM-4.5 (#899 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:20:26 +02:00
Kawrakow	98357d9aa5	Adding cmake option to disable CUDA fusion (#902 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 07:09:27 +02:00
Kawrakow	11feb49562	Fix compilation failure after merging #883 (#900 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 19:28:52 +02:00

... 2 3 4 5 6 ...

4149 Commits