ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 19:31:48 +00:00

Author	SHA1	Message	Date
Kawrakow	233225db8f	Take into account layer sizes for setting GPU layers (count 2) (#1498 ) * Also take into account KV cache * Take into account attn_wkv_b and mla = 3 compute buffers * WIP * Minor	2026-03-24 08:18:28 +01:00
Aliez Ren	f6c8b5f2cb	Update nix flake: sync with upstream, fix for newer nixpkgs (#1371 ) * Update nix flake: sync with upstream, fix for newer nixpkgs - Sync package.nix with upstream: add rocmGpuTargets/useRpc params, fix env optionals→optionalAttrs, remove unused inputs - Rewrite devshells.nix to use mkShell directly instead of passthru - Fix CUDA license check in nixpkgs-instances.nix for list-type licenses - Update flake inputs (nixpkgs, flake-parts) to latest - Update references from ggerganov/llama.cpp to ikawrakow/ik_llama.cpp Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Apply nix review suggestions from baileylu121 (#1371) - Use legacyPackages instead of import to avoid extra nixpkgs instances - Use inherit (pkgs) stdenv idiom - Remove unnecessary lib.pipe wrapper in devshells.nix - Simplify license normalization with lib.toList in nixpkgs-instances.nix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 08:17:52 +01:00
Kawrakow	f4125e8b1f	Slightly better CPU performance for SWA models (#1496 )	2026-03-24 07:53:16 +01:00
mcm007	bbc07002f7	Update docker and it's configs (#1497 ) * Update ik_llama-cuda.Containerfile - DGGML_NATIVE=ON, as most users are building images locally - Latest llama-swap version * Update ik_llama-cpu.Containerfile - DGGML_NATIVE=ON, as most users are building images locally - Latest llama-swap version * Update README.md - DGGML_NATIVE=ON is now default * Update ik_llama-cuda-swap.config.yaml - Adapt to the latest version * Update ik_llama-cpu-swap.config.yaml - Adapt to the latest version * Update ik_llama-cuda-swap.config.yaml - Add model with aliases * Update ik_llama-cpu-swap.config.yaml - Add model with aliases	2026-03-24 07:52:54 +01:00
Nexes the Elder	094f76ee86	Cleaner log for adjusted splits (#1494 ) * sweep-bench: add more skipped patterns to --minilog * cleaner log for adjusted splits * Add totalization for adjusted splits * Clean up semicolons * Addition for totalizer ^^ * Change accordingly to review * Forgotten leftover removed * 'total' instead of 'totalized'	2026-03-24 07:49:40 +01:00
firecoperana	cdf9142aa5	fix grammar stack empty error for qwen3.5 (#1490 ) * fix grammar stack empty error for qwen3.5 * Add to --help --------- Co-authored-by: firecoperana <firecoperana>	2026-03-24 07:48:20 +01:00
Kawrakow	b7a2bde4cc	Take into account layer sizes for setting GPU layers (cont) (#1476 ) * Also take into account KV cache * Take into account attn_wkv_b and mla = 3 compute buffers	2026-03-23 17:46:53 +01:00
Kawrakow	4eb08208f2	Fix misleading quantize error message (#1493 )	2026-03-23 13:55:18 +01:00
Adam Caldwell	47e4015337	Enable AVX-VNNI 256-bit path for Q3_K R4 matmul (#1472 ) Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and _mm256_dpbusd_epi32 for the Q3_K R4 kernel. The existing HAVE_FANCY_SIMD (AVX-512 VNNI) path is preserved unchanged. Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-23 09:47:29 +01:00
Kawrakow	3633a7cfca	Log HAVE_FANCY_SIMD via LLAMA_LOG_INFO (#1492 )	2026-03-23 08:43:29 +01:00
gapeleon	716ecd6457	Added split mode graph for Command-R/R+ models. (#1491 )	2026-03-23 08:10:41 +01:00
Adam Caldwell	87e4b9260b	Enable AVX-VNNI 256-bit path for Q6_K R4 matmul (#1482 ) Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and _mm256_dpbusd_epi32 for the Q6_K R4 kernel. The existing HAVE_FANCY_SIMD (AVX-512 VNNI) path is preserved unchanged. Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-23 08:10:13 +01:00
Kawrakow	56e026f6ba	Fix bug when splitting merged ffn_up/gate_exps tensors (#1479 )	2026-03-20 18:54:45 +01:00
firecoperana	0c9bc3ed28	server: support --minilog to log request message for completions/response/anthropic and response (#1477 ) Co-authored-by: firecoperana <firecoperana>	2026-03-20 16:13:43 +01:00
Adam Caldwell	ac4d6b94fb	Enable AVX-VNNI 256-bit path for IQ3_XXS and IQ3_S R4 matmul (#1474 ) Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-20 16:12:07 +01:00
firecoperana	10b44eca72	server: sync anthropic api code (#1469 ) * server: sync anthropic api code * fix cc header issue --------- Co-authored-by: firecoperana <firecoperana>	2026-03-20 10:18:45 +01:00
Juk Armstrong	08f81b5afd	Fix batch calculation for image processing (#1475 )	2026-03-20 10:16:05 +01:00
Nexes the Elder	6c665f38fd	sweep-bench: add -minilog argument to reduce verbose logging (#1468 ) Purpose: Add --minilog flag to llama-sweep-bench that filters log output to show only essential GPU/layer distribution information while suppressing verbose model metadata and per-layer device assignment messages. Changes: - Add llama_selective_log_callback with blacklist approach (sweep-bench.cpp) Blacklisted patterns (hidden): - Per-layer device assignments ('Setting default device in layer') - KV metadata dump header and entries - Tensor type counts - Model validation messages - EOG/special token cache info - Metadata printout (llm_load_print_meta, print_info) - Layer sizes table - Tensor loading info (llm_load_tensors) - Separator lines - Most common cases of incomplete/continuation lines are also hidden All other log output is shown, including: - GPU VRAM info - Split/buffer distribution per device - Graph split estimates - Final benchmark table and timings	2026-03-20 09:40:56 +01:00
Adam Caldwell	a56a7863c7	Enable AVX-VNNI 256-bit path for IQ4_NL R4 matmul (#1467 ) Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-20 09:40:39 +01:00
Max Homilius	77f8060ca6	fix: FA vec kernels for D=256 quantized KV cache (e.g., Qwen3-Coder-Next) (#1452 ) * fix: FA vec kernels for D=256 quantized KV cache Three bugs prevented Flash Attention from working with quantized (q8_0) KV cache on models with head dimension 256, such as Qwen3-Coder-Next: 1. need_f16_K/V used dimension-based logic (Dk != 128) that forced q8_0→f16 conversion at D=256. The kernel then read f16 data as q8_0 blocks. Changed to type-based logic (type_K == GGML_TYPE_F16) matching mainline. 2. quantize_q8_1_to_shared output pointers were not advanced between loop iterations, so the second iteration overwrote the first half of Q data in shared memory. Added i0 offset to output pointers. 3. Q_i32 register array in vec_f32 kernel was sized to 1 instead of 2 for D=256 due to an errant comparison (Dk >= expr evaluates to boolean 1, not expr). Removed the comparison. Also adds extern template declarations for (256, Q8_0, Q8_0) so the kernel is compiled and linked. Tested on Pascal (GTX 1080, CC 6.1) with Qwen3-Coder-Next (D=256, n_head_kv=2) using --cache-type-k q8_0 --cache-type-v q8_0. * fix: use dimension-based need_f16 for D=256 vec kernels	2026-03-20 09:39:39 +01:00
Kawrakow	0871ab2964	Take into account layer sizes for setting GPU layers (#1466 ) * Take into account layer sizes for setting GPU layers * Fix bug	2026-03-20 09:38:14 +01:00
Kawrakow	b56a3c2dc9	Better barrier (#1456 )	2026-03-19 07:33:00 +01:00
Kawrakow	9b7db9bc3f	Better --n-cpu-moe (#1464 )	2026-03-19 06:57:01 +01:00
Adam Caldwell	b8fa7936bf	Enable AVX-VNNI 256-bit path for Q8_K R8 matmul (#1460 ) Use the sign trick with dpbusd instead of maddubs+madd+add, replacing 3 AVX2 instructions with 1 fused VNNI instruction. Removes dead HAVE_FANCY_SIMD code left over from the R16 split. Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-19 06:56:11 +01:00
Kawrakow	1a7aa3e7fa	Fix potential integer overflow in the flash attention kernels (#1458 )	2026-03-18 19:44:46 +01:00
firecoperana	f9b7fe9749	llama: add --dry-run option (#1462 ) Co-authored-by: firecoperana <firecoperana>	2026-03-18 17:20:17 +01:00
Adam Caldwell	1f4f09419b	Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product (#1463 ) Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses 256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs. Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-18 16:09:30 +01:00
Kawrakow	b08b620c9f	Update README	2026-03-18 14:25:47 +01:00
Adam Caldwell	9015b6c51d	Enable AVX-VNNI 256-bit path for Q8_0 R8 dot product (#1459 ) Replace maddubs_epi16 + madd_epi16 with dpbusd_epi32 in the mul_mat_q8_0_r8_q8_2 dot product lambda when HAVE_VNNI256 is defined. Same sign trick operands (abs(x), sign(y,x)), just fewer instructions per sub-block. Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-18 11:14:23 +01:00
Adam Caldwell	8ccb4f856c	Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product (#1455 ) Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses 256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs. Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-18 09:18:02 +01:00
Kawrakow	dea161f108	Update model support list in README	2026-03-18 07:34:37 +01:00
Kawrakow	56477c7a9e	Mistral 4 support (#1450 ) * WIP: mistral4 * CPU FA * CUDA FA 320, 256	2026-03-18 07:32:39 +01:00
Kawrakow	f6ca2fa8c0	Qwen-3.5/Next tweaks (#1447 ) * Allow using -rtr and -muge together * Various Qwen-3.5 tweaks * No need to make v, g, beta contiguous * Adjust NEON delta-net to non-contiguous v, g, b * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-03-18 07:32:17 +01:00
Nexes the Elder	61fad8b094	Print timings in sweep-bench (#1454 )	2026-03-18 06:57:00 +01:00
StrikeOner	a399456c12	fix: propagate CPPHTTPLIB_OPENSSL_SUPPORT to cpp-httplib target when LLAMA_SERVER_SSL=ON (#1451 ) Without this, libcpp-httplib.a is compiled without SSL support, causing an undefined reference to httplib::SSLServer at link time even though the OpenSSL libraries are present on the link line. Fixes #1449 Co-authored-by: kerem seyhan <kerem.seyhan@codecut.de>	2026-03-17 16:39:11 +01:00
Kawrakow	bd1e604d0c	Update AUTHORS t	2026-03-17 10:46:21 +01:00
Adam Caldwell	008125b5c1	Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul (#1446 ) Add new CPU macro HAVE_VNNI256 for CPUs with 256-bit VNNI (AVX-VNNI) support or better (AVX512-VNNI+VL), separate from HAVE_FANCY_SIMD which requires the full AVX-512 set. Relax four #ifdef guards in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k to use HAVE_VNNI256 instead of HAVE_FANCY_SIMD, enabling vpdpbusd and cvtepi8_epi32 on Alder Lake, Raptor Lake, and similar CPUs. Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>	2026-03-17 10:44:55 +01:00
Kawrakow	54bcafee16	Allow using -rtr and -muge together (#1444 )	2026-03-16 18:26:26 +01:00
Kawrakow	e3f5f3d823	Minor delta-net tweaks (#1429 )	2026-03-16 13:59:55 +01:00
hksdpc255	fe92e30d1e	server : preserve anthropic thinking blocks in conversion (#1441 )	2026-03-16 13:59:19 +01:00
Kawrakow	29e6d6b4c1	Check for overlap before fusing ssm_conv and silu (#1443 )	2026-03-16 12:14:20 +01:00
Kawrakow	8075acb6cd	Turn off CPU ssm_conv and silu fusion (#1440 )	2026-03-16 10:44:53 +01:00
hksdpc255	18a9b4c125	fix chat parser not been used in anthropic api (#1437 )	2026-03-16 08:59:01 +01:00
hksdpc255	a655a95378	Prevent adding content that starts with 'x-anthropic-' to system_content. (#1436 )	2026-03-16 08:57:09 +01:00
Kawrakow	d83b0172b1	Attempt to fix #1438 (#1439 )	2026-03-16 08:34:17 +01:00
Kawrakow	56f4e9e673	Fix the fix (#1433 )	2026-03-15 17:33:10 +01:00
Kawrakow	edf54621d2	Fix long max. context bottleneck (#1430 )	2026-03-15 17:18:27 +01:00
Kawrakow	49778a6ce0	Update Docker documentation with important notice Clarify the status of Docker support in ik_llama.cpp.	2026-03-15 12:35:04 +01:00
Kawrakow	10a8f5f8f1	Fix hybrid graph parallel + muge (#1426 )	2026-03-14 18:15:09 +01:00
Kawrakow	f8d95f1279	Fused SSM_CONV and SILU on ARM_NEON (#1425 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-03-14 18:14:56 +01:00

1 2 3 4 5 ...

4447 Commits