* handle reasoning content in webui
server : include usage statistics only when user request them (#16052)
server : only attempt to enable thinking if using jinja (#15967)
* config reasoning_content in webui and change default to auto
---------
Co-authored-by: firecoperana <firecoperana>
* Offload only activated experts
* This seems to do the trick for -fmoe
* Do not recalculate activated expers for fused up/gate
* Log out of bounds access details
* Add a command line argument
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Bounds for flash attention
* Add n_swa to FA parameters
* Fix it
* This seems very slightly better
* Using vec kernel when we have SWA
* Need also this
* f32 vec kernel
* This is slightly better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fused up+gate+unary for regular (not MoE) FFN - CPU
* WIP CUDA
* Seems to be working on CUDA
For a dense model we get 2-3% speedup for PP and ~0.6% for TG.
* Add command line option
This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Skip the row id computation for the ffn_down op
Sadly, almost negligible performance gain.
* Also this doesn't do much
* Also this barely moves the needle
* This is slightly better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Check for NaNs while loading the model.
* Also tell which experts have NaNs.
* Add command line option to validate quants
* Add checks for more quantization types
* Add checks for more quantizagtion types
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This fixes confusion around Q8_0 on AVX2
* This does it for iq4_nl, including FA
* This does it for iq4_nl on Zen4, but FA does not work
* Slightly more clear
* Adding forgotten q8_0_r8 to num_rows()
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Getting this error when compiling on Mac with clang 17
Simple fix, add the string header in src/llama-impl.h
Co-authored-by: Mohan Krishnan <mohan.krishnan@grab.com>
* mikupad.html in ik_llama.cpp (functional but WIP)
* Remove hardcoded extension and add error handling to extension loading
* Update version number and add features array to version
* Make version endpoint always accessible
* Fix case with empty sql
* Add useful error message when launched without sql file
* Add sigma sampler
* Update sigma step and max based on docs
* Remove selectedSessionId and handle it with URL fragment
* Export All (code only, no UI)
* Add compression to server.cpp
* Major UI work (and also add update backend endpoints to accomadate)
* Finalize UI
* Fix visual bug
* fix merge conflict issue
* Pull in full sqlite_modern_cpp repo for the license as it is not attached to source files
* Make compression not show in sidebar if extension is not loaded
* Finalize build, Put support behing LLAMA_SERVER_SQLITE3: command not found build option, and update error message to include the build option is not passed situation
* Fix compile without flag on systems without it installed
* Use bperm trick for iq2_ks gemm -> 7% gain
* Use bperm trick for iq2_k gemm -> ~5% gain
* Use bperm trick for iq2_k_r4 gemm -> ~3% gain
* Use bperm trick for iq2_ks gemv -> ~7% gain
* Use bperm trick for iq2_k gemv -> ~3% gain
* Use bperm trick for iq2_k_r4 gemv -> ~7% gain
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* q8_k_r16: basics
* q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+
PP performance is about the same as using q8_k_r8 on the Ryzen-7950X,
so we expect nice gains on Zen5, and we don't need to wory about
using 2 different q8_k_r8 implementations for fancy SIMD.
* q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+
* q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+
* q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+
* q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+
* q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+
* Fix AVX2
* Just always set num_rows to 16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use bperm trick for iq3_ks - 5% PP performance gain
* Use bperm trick for iq3_k -> 5% PP performance gain
* Use bperm trick for iq3_k -> 8% PP performance gain
* Use bperm trick for iq3_k_r4 gemv -> ~5% faster
* Use bperm trick for iq3_k gemv -> ~3% faster
* Use bperm trick for iq3_k gemv -> 4.5% gain
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use __byte_perm in get_int_from_table_16
* Use get_int_from_table_16 everywhere for 4-bit quants
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4.
So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8,
iqk_moe_fused_mul_unary fails because the activations were
prepared as Q0_0_X4, but we now need Q8_2_X4.
For now a simple fix: just take the slow path, do not repack.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This does the trick for PP
* Compute mask bounds when creating the mask
* Set mask bounds for all supported SWA models
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>