4447 Commits

Author SHA1 Message Date
Kawrakow
233225db8f Take into account layer sizes for setting GPU layers (count 2) (#1498)
* Also take into account KV cache

* Take into account attn_wkv_b and mla = 3 compute buffers

* WIP

* Minor
2026-03-24 08:18:28 +01:00
Aliez Ren
f6c8b5f2cb Update nix flake: sync with upstream, fix for newer nixpkgs (#1371)
* Update nix flake: sync with upstream, fix for newer nixpkgs

- Sync package.nix with upstream: add rocmGpuTargets/useRpc params,
  fix env optionals→optionalAttrs, remove unused inputs
- Rewrite devshells.nix to use mkShell directly instead of passthru
- Fix CUDA license check in nixpkgs-instances.nix for list-type licenses
- Update flake inputs (nixpkgs, flake-parts) to latest
- Update references from ggerganov/llama.cpp to ikawrakow/ik_llama.cpp

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Apply nix review suggestions from baileylu121 (#1371)

- Use legacyPackages instead of import to avoid extra nixpkgs instances
- Use inherit (pkgs) stdenv idiom
- Remove unnecessary lib.pipe wrapper in devshells.nix
- Simplify license normalization with lib.toList in nixpkgs-instances.nix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 08:17:52 +01:00
Kawrakow
f4125e8b1f Slightly better CPU performance for SWA models (#1496) 2026-03-24 07:53:16 +01:00
mcm007
bbc07002f7 Update docker and it's configs (#1497)
* Update ik_llama-cuda.Containerfile

- DGGML_NATIVE=ON, as most users are building images locally

- Latest llama-swap version

* Update ik_llama-cpu.Containerfile

- DGGML_NATIVE=ON, as most users are building images locally

- Latest llama-swap version

* Update README.md

- DGGML_NATIVE=ON is now default

* Update ik_llama-cuda-swap.config.yaml

- Adapt to the latest version

* Update ik_llama-cpu-swap.config.yaml

- Adapt to the latest version

* Update ik_llama-cuda-swap.config.yaml

- Add model with aliases

* Update ik_llama-cpu-swap.config.yaml

- Add model with aliases
2026-03-24 07:52:54 +01:00
Nexes the Elder
094f76ee86 Cleaner log for adjusted splits (#1494)
* sweep-bench: add more skipped patterns to --minilog

* cleaner log for adjusted splits

* Add totalization for adjusted splits

* Clean up semicolons

* Addition for totalizer ^^

* Change accordingly to review

* Forgotten leftover removed

* 'total' instead of 'totalized'
2026-03-24 07:49:40 +01:00
firecoperana
cdf9142aa5 fix grammar stack empty error for qwen3.5 (#1490)
* fix grammar stack empty error for qwen3.5

* Add to --help

---------

Co-authored-by: firecoperana <firecoperana>
2026-03-24 07:48:20 +01:00
Kawrakow
b7a2bde4cc Take into account layer sizes for setting GPU layers (cont) (#1476)
* Also take into account KV cache

* Take into account attn_wkv_b and mla = 3 compute buffers
2026-03-23 17:46:53 +01:00
Kawrakow
4eb08208f2 Fix misleading quantize error message (#1493) 2026-03-23 13:55:18 +01:00
Adam Caldwell
47e4015337 Enable AVX-VNNI 256-bit path for Q3_K R4 matmul (#1472)
Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and
_mm256_dpbusd_epi32 for the Q3_K R4 kernel. The existing HAVE_FANCY_SIMD
(AVX-512 VNNI) path is preserved unchanged.

Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-23 09:47:29 +01:00
Kawrakow
3633a7cfca Log HAVE_FANCY_SIMD via LLAMA_LOG_INFO (#1492) 2026-03-23 08:43:29 +01:00
gapeleon
716ecd6457 Added split mode graph for Command-R/R+ models. (#1491) 2026-03-23 08:10:41 +01:00
Adam Caldwell
87e4b9260b Enable AVX-VNNI 256-bit path for Q6_K R4 matmul (#1482)
Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and
_mm256_dpbusd_epi32 for the Q6_K R4 kernel. The existing HAVE_FANCY_SIMD
(AVX-512 VNNI) path is preserved unchanged.

Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-23 08:10:13 +01:00
Kawrakow
56e026f6ba Fix bug when splitting merged ffn_up/gate_exps tensors (#1479) 2026-03-20 18:54:45 +01:00
firecoperana
0c9bc3ed28 server: support --minilog to log request message for completions/response/anthropic and response (#1477)
Co-authored-by: firecoperana <firecoperana>
2026-03-20 16:13:43 +01:00
Adam Caldwell
ac4d6b94fb Enable AVX-VNNI 256-bit path for IQ3_XXS and IQ3_S R4 matmul (#1474)
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-20 16:12:07 +01:00
firecoperana
10b44eca72 server: sync anthropic api code (#1469)
* server: sync anthropic api code

* fix cc header issue

---------

Co-authored-by: firecoperana <firecoperana>
2026-03-20 10:18:45 +01:00
Juk Armstrong
08f81b5afd Fix batch calculation for image processing (#1475) 2026-03-20 10:16:05 +01:00
Nexes the Elder
6c665f38fd sweep-bench: add -minilog argument to reduce verbose logging (#1468)
Purpose:
Add --minilog flag to llama-sweep-bench that filters log output to show only essential GPU/layer distribution information while suppressing verbose model metadata and per-layer device assignment messages.

Changes:
- Add llama_selective_log_callback with blacklist approach (sweep-bench.cpp)

Blacklisted patterns (hidden):
- Per-layer device assignments ('Setting default device in layer')
- KV metadata dump header and entries
- Tensor type counts
- Model validation messages
- EOG/special token cache info
- Metadata printout (llm_load_print_meta, print_info)
- Layer sizes table
- Tensor loading info (llm_load_tensors)
- Separator lines
- Most common cases of incomplete/continuation lines are also hidden

All other log output is shown, including:
- GPU VRAM info
- Split/buffer distribution per device
- Graph split estimates
- Final benchmark table and timings
2026-03-20 09:40:56 +01:00
Adam Caldwell
a56a7863c7 Enable AVX-VNNI 256-bit path for IQ4_NL R4 matmul (#1467)
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-20 09:40:39 +01:00
Max Homilius
77f8060ca6 fix: FA vec kernels for D=256 quantized KV cache (e.g., Qwen3-Coder-Next) (#1452)
* fix: FA vec kernels for D=256 quantized KV cache

Three bugs prevented Flash Attention from working with quantized
(q8_0) KV cache on models with head dimension 256, such as
Qwen3-Coder-Next:

1. need_f16_K/V used dimension-based logic (Dk != 128) that forced
   q8_0→f16 conversion at D=256. The kernel then read f16 data as
   q8_0 blocks. Changed to type-based logic (type_K == GGML_TYPE_F16)
   matching mainline.

2. quantize_q8_1_to_shared output pointers were not advanced between
   loop iterations, so the second iteration overwrote the first half
   of Q data in shared memory. Added i0 offset to output pointers.

3. Q_i32 register array in vec_f32 kernel was sized to 1 instead of 2
   for D=256 due to an errant comparison (Dk >= expr evaluates to
   boolean 1, not expr). Removed the comparison.

Also adds extern template declarations for (256, Q8_0, Q8_0) so the
kernel is compiled and linked.

Tested on Pascal (GTX 1080, CC 6.1) with Qwen3-Coder-Next (D=256,
n_head_kv=2) using --cache-type-k q8_0 --cache-type-v q8_0.

* fix: use dimension-based need_f16 for D=256 vec kernels
2026-03-20 09:39:39 +01:00
Kawrakow
0871ab2964 Take into account layer sizes for setting GPU layers (#1466)
* Take into account layer sizes for setting GPU layers

* Fix bug
2026-03-20 09:38:14 +01:00
Kawrakow
b56a3c2dc9 Better barrier (#1456) 2026-03-19 07:33:00 +01:00
Kawrakow
9b7db9bc3f Better --n-cpu-moe (#1464) 2026-03-19 06:57:01 +01:00
Adam Caldwell
b8fa7936bf Enable AVX-VNNI 256-bit path for Q8_K R8 matmul (#1460)
Use the sign trick with dpbusd instead of maddubs+madd+add,
replacing 3 AVX2 instructions with 1 fused VNNI instruction.
Removes dead HAVE_FANCY_SIMD code left over from the R16 split.

Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-19 06:56:11 +01:00
Kawrakow
1a7aa3e7fa Fix potential integer overflow in the flash attention kernels (#1458) 2026-03-18 19:44:46 +01:00
firecoperana
f9b7fe9749 llama: add --dry-run option (#1462)
Co-authored-by: firecoperana <firecoperana>
2026-03-18 17:20:17 +01:00
Adam Caldwell
1f4f09419b Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product (#1463)
Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the
guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses
256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.

Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-18 16:09:30 +01:00
Kawrakow
b08b620c9f Update README 2026-03-18 14:25:47 +01:00
Adam Caldwell
9015b6c51d Enable AVX-VNNI 256-bit path for Q8_0 R8 dot product (#1459)
Replace maddubs_epi16 + madd_epi16 with dpbusd_epi32 in the
mul_mat_q8_0_r8_q8_2 dot product lambda when HAVE_VNNI256 is
defined. Same sign trick operands (abs(x), sign(y,x)), just
fewer instructions per sub-block.

Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-18 11:14:23 +01:00
Adam Caldwell
8ccb4f856c Enable AVX-VNNI 256-bit path for Q8_1 R8 dot product (#1455)
Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the
guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses
256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.

Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-18 09:18:02 +01:00
Kawrakow
dea161f108 Update model support list in README 2026-03-18 07:34:37 +01:00
Kawrakow
56477c7a9e Mistral 4 support (#1450)
* WIP: mistral4

* CPU FA

* CUDA FA 320, 256
2026-03-18 07:32:39 +01:00
Kawrakow
f6ca2fa8c0 Qwen-3.5/Next tweaks (#1447)
* Allow using -rtr and -muge together

* Various Qwen-3.5 tweaks

* No need to make v, g, beta contiguous

* Adjust NEON delta-net to non-contiguous v, g, b

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-18 07:32:17 +01:00
Nexes the Elder
61fad8b094 Print timings in sweep-bench (#1454) 2026-03-18 06:57:00 +01:00
StrikeOner
a399456c12 fix: propagate CPPHTTPLIB_OPENSSL_SUPPORT to cpp-httplib target when LLAMA_SERVER_SSL=ON (#1451)
Without this, libcpp-httplib.a is compiled without SSL support, causing
an undefined reference to httplib::SSLServer at link time even though
the OpenSSL libraries are present on the link line.

Fixes #1449

Co-authored-by: kerem seyhan <kerem.seyhan@codecut.de>
2026-03-17 16:39:11 +01:00
Kawrakow
bd1e604d0c Update AUTHORS t 2026-03-17 10:46:21 +01:00
Adam Caldwell
008125b5c1 Enable AVX-VNNI 256-bit path for Q4_K and Q5_K R4 matmul (#1446)
Add new CPU macro HAVE_VNNI256 for CPUs with 256-bit VNNI
(AVX-VNNI) support or better (AVX512-VNNI+VL), separate from
HAVE_FANCY_SIMD which requires the full AVX-512 set. Relax four
#ifdef guards in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k
to use HAVE_VNNI256 instead of HAVE_FANCY_SIMD, enabling vpdpbusd
and cvtepi8_epi32 on Alder Lake, Raptor Lake, and similar CPUs.

Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
2026-03-17 10:44:55 +01:00
Kawrakow
54bcafee16 Allow using -rtr and -muge together (#1444) 2026-03-16 18:26:26 +01:00
Kawrakow
e3f5f3d823 Minor delta-net tweaks (#1429) 2026-03-16 13:59:55 +01:00
hksdpc255
fe92e30d1e server : preserve anthropic thinking blocks in conversion (#1441) 2026-03-16 13:59:19 +01:00
Kawrakow
29e6d6b4c1 Check for overlap before fusing ssm_conv and silu (#1443) 2026-03-16 12:14:20 +01:00
Kawrakow
8075acb6cd Turn off CPU ssm_conv and silu fusion (#1440) 2026-03-16 10:44:53 +01:00
hksdpc255
18a9b4c125 fix chat parser not been used in anthropic api (#1437) 2026-03-16 08:59:01 +01:00
hksdpc255
a655a95378 Prevent adding content that starts with 'x-anthropic-' to system_content. (#1436) 2026-03-16 08:57:09 +01:00
Kawrakow
d83b0172b1 Attempt to fix #1438 (#1439) 2026-03-16 08:34:17 +01:00
Kawrakow
56f4e9e673 Fix the fix (#1433) 2026-03-15 17:33:10 +01:00
Kawrakow
edf54621d2 Fix long max. context bottleneck (#1430) 2026-03-15 17:18:27 +01:00
Kawrakow
49778a6ce0 Update Docker documentation with important notice
Clarify the status of Docker support in ik_llama.cpp.
2026-03-15 12:35:04 +01:00
Kawrakow
10a8f5f8f1 Fix hybrid graph parallel + muge (#1426) 2026-03-14 18:15:09 +01:00
Kawrakow
f8d95f1279 Fused SSM_CONV and SILU on ARM_NEON (#1425)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-14 18:14:56 +01:00