* Update nix flake: sync with upstream, fix for newer nixpkgs
- Sync package.nix with upstream: add rocmGpuTargets/useRpc params,
fix env optionals→optionalAttrs, remove unused inputs
- Rewrite devshells.nix to use mkShell directly instead of passthru
- Fix CUDA license check in nixpkgs-instances.nix for list-type licenses
- Update flake inputs (nixpkgs, flake-parts) to latest
- Update references from ggerganov/llama.cpp to ikawrakow/ik_llama.cpp
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Apply nix review suggestions from baileylu121 (#1371)
- Use legacyPackages instead of import to avoid extra nixpkgs instances
- Use inherit (pkgs) stdenv idiom
- Remove unnecessary lib.pipe wrapper in devshells.nix
- Simplify license normalization with lib.toList in nixpkgs-instances.nix
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Update ik_llama-cuda.Containerfile
- DGGML_NATIVE=ON, as most users are building images locally
- Latest llama-swap version
* Update ik_llama-cpu.Containerfile
- DGGML_NATIVE=ON, as most users are building images locally
- Latest llama-swap version
* Update README.md
- DGGML_NATIVE=ON is now default
* Update ik_llama-cuda-swap.config.yaml
- Adapt to the latest version
* Update ik_llama-cpu-swap.config.yaml
- Adapt to the latest version
* Update ik_llama-cuda-swap.config.yaml
- Add model with aliases
* Update ik_llama-cpu-swap.config.yaml
- Add model with aliases
Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and
_mm256_dpbusd_epi32 for the Q3_K R4 kernel. The existing HAVE_FANCY_SIMD
(AVX-512 VNNI) path is preserved unchanged.
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
Add a separate HAVE_VNNI256 code path using _mm256_dpwssd_epi32 and
_mm256_dpbusd_epi32 for the Q6_K R4 kernel. The existing HAVE_FANCY_SIMD
(AVX-512 VNNI) path is preserved unchanged.
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
Purpose:
Add --minilog flag to llama-sweep-bench that filters log output to show only essential GPU/layer distribution information while suppressing verbose model metadata and per-layer device assignment messages.
Changes:
- Add llama_selective_log_callback with blacklist approach (sweep-bench.cpp)
Blacklisted patterns (hidden):
- Per-layer device assignments ('Setting default device in layer')
- KV metadata dump header and entries
- Tensor type counts
- Model validation messages
- EOG/special token cache info
- Metadata printout (llm_load_print_meta, print_info)
- Layer sizes table
- Tensor loading info (llm_load_tensors)
- Separator lines
- Most common cases of incomplete/continuation lines are also hidden
All other log output is shown, including:
- GPU VRAM info
- Split/buffer distribution per device
- Graph split estimates
- Final benchmark table and timings
* fix: FA vec kernels for D=256 quantized KV cache
Three bugs prevented Flash Attention from working with quantized
(q8_0) KV cache on models with head dimension 256, such as
Qwen3-Coder-Next:
1. need_f16_K/V used dimension-based logic (Dk != 128) that forced
q8_0→f16 conversion at D=256. The kernel then read f16 data as
q8_0 blocks. Changed to type-based logic (type_K == GGML_TYPE_F16)
matching mainline.
2. quantize_q8_1_to_shared output pointers were not advanced between
loop iterations, so the second iteration overwrote the first half
of Q data in shared memory. Added i0 offset to output pointers.
3. Q_i32 register array in vec_f32 kernel was sized to 1 instead of 2
for D=256 due to an errant comparison (Dk >= expr evaluates to
boolean 1, not expr). Removed the comparison.
Also adds extern template declarations for (256, Q8_0, Q8_0) so the
kernel is compiled and linked.
Tested on Pascal (GTX 1080, CC 6.1) with Qwen3-Coder-Next (D=256,
n_head_kv=2) using --cache-type-k q8_0 --cache-type-v q8_0.
* fix: use dimension-based need_f16 for D=256 vec kernels
Use the sign trick with dpbusd instead of maddubs+madd+add,
replacing 3 AVX2 instructions with 1 fused VNNI instruction.
Removes dead HAVE_FANCY_SIMD code left over from the R16 split.
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the
guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses
256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
Replace maddubs_epi16 + madd_epi16 with dpbusd_epi32 in the
mul_mat_q8_0_r8_q8_2 dot product lambda when HAVE_VNNI256 is
defined. Same sign trick operands (abs(x), sign(y,x)), just
fewer instructions per sub-block.
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
Ungate the VNNI path in mul_mat_q8_1_r8_q8_2 by changing the
guard from HAVE_FANCY_SIMD to HAVE_VNNI256. This block only uses
256-bit intrinsics so it is safe for AVX-VNNI (non-512) CPUs.
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
* Allow using -rtr and -muge together
* Various Qwen-3.5 tweaks
* No need to make v, g, beta contiguous
* Adjust NEON delta-net to non-contiguous v, g, b
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Without this, libcpp-httplib.a is compiled without SSL support, causing
an undefined reference to httplib::SSLServer at link time even though
the OpenSSL libraries are present on the link line.
Fixes#1449
Co-authored-by: kerem seyhan <kerem.seyhan@codecut.de>
Add new CPU macro HAVE_VNNI256 for CPUs with 256-bit VNNI
(AVX-VNNI) support or better (AVX512-VNNI+VL), separate from
HAVE_FANCY_SIMD which requires the full AVX-512 set. Relax four
#ifdef guards in mul_mat_q4_k_r4_q8_k and mul_mat_q5_k_r4_q8_k
to use HAVE_VNNI256 instead of HAVE_FANCY_SIMD, enabling vpdpbusd
and cvtepi8_epi32 on Alder Lake, Raptor Lake, and similar CPUs.
Co-authored-by: Adam Caldwell <accaldwell@users.noreply.github.com>
* Improve text formatting
* Update README.md with recent models and features
* Update parameters.md with recent additions
* Remove deprecated from parameters.md