Commit Graph

3891 Commits

Author SHA1 Message Date
Iwan Kawrakow
08080356ab Fix dequantization when requantizing 2025-09-24 11:44:33 +03:00
Kawrakow
cde2eb5e95 cpu: fused softmax+topk (#794)
* cpu: fused softmax+topk

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 09:02:21 +02:00
firecoperana
09db3a494f Update webui to handle reasoning content and include usage stats in server only when requested (#791)
* handle reasoning content in webui
server : include usage statistics only when user request them (#16052)
server : only attempt to enable thinking if using jinja (#15967)

* config reasoning_content in webui and change default to auto

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-24 07:45:09 +02:00
Kawrakow
45afaf3391 Fix #772 (#790)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 16:43:02 +02:00
firecoperana
8cd2d7ccd7 model : add grok-2 support (#782)
Co-authored-by: firecoperana <firecoperana>
2025-09-23 16:31:01 +02:00
Kawrakow
18f04350e9 cuda: fused top_k+softmax as used in most MoE models (#789)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 13:45:57 +02:00
Kawrakow
64e357c327 Fix compiler warnings (#788)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 10:30:15 +02:00
firecoperana
6d2e7ca42d Deepseek V3.1 native tool calling support (OpenAI Style) (#771) 2025-09-13 07:51:40 +02:00
firecoperana
0ce2f93581 fix convert error for ernie 4.5 (#774) 2025-09-11 07:59:24 +02:00
firecoperana
d323871ba9 fix v1 completions streaming mode (#768) 2025-09-09 15:38:12 +02:00
Kawrakow
c519d4177b This is very slightly better (#762)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 21:31:02 +02:00
Kawrakow
c15f8ac508 Fix ggml_is_contiguously_allocated (#764)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 19:05:02 +02:00
firecoperana
33e071201f Add Ernie 4.5 MOE and 0.3B Support (#759)
* Add Ernie4_5MoeModel

* add ernie 4.5 0.3B model

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-05 11:54:35 +02:00
firecoperana
cec8b70a7e llama: enable K-shift for quantized KV cache for cuda (#760)
cuda: add q8_0->f32 cpy operation (#9571)
It will fail on unsupported backends or quant types.

Co-authored-by: Ivan <nekotekina@gmail.com>
2025-09-05 11:54:18 +02:00
Kawrakow
0c15494c30 Offload only activated experts to the GPU (#698)
* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 12:22:30 +02:00
Kawrakow
06cc7c6894 Better CPU SWA (#757)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 11:58:16 +02:00
Kawrakow
f5e68bf8b6 Alternative CUDA FA for SWA models (#754)
* Bounds for flash attention

* Add n_swa to FA parameters

* Fix it

* This seems very slightly better

* Using vec kernel when we have SWA

* Need also this

* f32 vec kernel

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 08:42:18 +02:00
Kawrakow
3433c7b56d Refactor CUDA flash attention (#745)
* Factor out mma

* Factor out wmma

* Factor out vec

* Remove unnecessary includes from fattn.cu

* Move mma launch to fattn-mma-f16.cuh

* Slightly better PP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 10:12:56 +02:00
Kawrakow
1f4346381f Set default value of GGML_SCHED_MAX_COPIES to 1 (#751)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 07:04:39 +02:00
Kawrakow
62f5382c2b Revert "CUDA: prompt processing optimizations for MoE models (#739)" (#748)
This reverts commit f22a9ef95a.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 06:55:48 +02:00
Iwan Kawrakow
d10d90ae27 Remove double definition of LLAMA_LOG_DEBUG 2025-09-01 08:42:04 +03:00
firecoperana
0f9ecaec04 Tool calls support from mainline (#723)
* Tool calls support from mainline

* update cmake

* revert api for /completions

* Fix broken thinking process for gpt-oss

* add missing args and fix webui bugs

* add missing args and fix webui bugs2

* Fix reasoning format error

* add usage

* change default post_sampling_probs to true

* add back generated_text

* Remove server endpoints tests

* add log

* Chat fixes

* Remove logs

* webui: revert extra handling of thinking process

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-01 08:38:49 +03:00
Kawrakow
b66cecca45 Fused FFN_UP+FFN_GATE op (#741)
* Fused up+gate+unary for regular (not MoE) FFN - CPU

* WIP CUDA

* Seems to be working on CUDA

For a dense model we get 2-3% speedup for PP and ~0.6% for TG.

* Add command line option

This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-31 18:16:36 +03:00
Kawrakow
f22a9ef95a CUDA: prompt processing optimizations for MoE models (#739)
* Skip the row id computation for the ffn_down op

Sadly, almost negligible performance gain.

* Also this doesn't do much

* Also this barely moves the needle

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-30 12:09:41 +03:00
Kawrakow
46968d4ab1 Sanitize imatrix (#735)
* sanitize importance matrix: WIP

* sanitize importance matrix: iq4_k

* sanitize importance matrix: iq5_k, iq6_k

* sanitize imatrix: iq4_ks

* sanitize imatrix: iq4_kss

* sanitize imatrix: iq2_ks and iq2_kl

* sanitize imatrix: iq5_ks

* sanitize imatrix: iq4_nl_r4

* sanitize imatrix: q4_0_r8

* sanitize imatrix: q6_0_r4

* sanitize imatrix: iq4_xs_r8

* sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template

* sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4

* sanitize imatrix: repacked i-quants

* Minor

* Add more checks for iq3_k, iq3_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-29 09:08:15 +03:00
Kawrakow
872ac10b02 Make yarn_log_multiplier optional (#738)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-28 14:09:59 +03:00
Kawrakow
dac5b48398 Check for NaNs while loading the model. (#727)
* Check for NaNs while loading the model.

* Also tell which experts have NaNs.

* Add command line option to validate quants

* Add checks for more quantization types

* Add checks for more quantizagtion types

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 19:00:17 +03:00
Iwan Kawrakow
03ce6fdb0d Fix typo 2025-08-27 14:43:44 +03:00
Kawrakow
966a6ce93c Heuristics for mmq_id -> original threshold (#734)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:17:41 +03:00
Kawrakow
931f04af53 Sanitize importances for KT quantization (#720)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:04:15 +03:00
Kawrakow
683d6f1fc8 Fix avx2 GEMM mess (v2) (#724)
* This fixes confusion around Q8_0 on AVX2

* This does it for iq4_nl, including FA

* This does it for iq4_nl on Zen4, but FA does not work

* Slightly more clear

* Adding forgotten q8_0_r8 to num_rows()

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:03:47 +03:00
Kawrakow
0cc32ff0b1 CUDA: muh faster prompt processing for MoE models and small u-batch sizes (#728)
* WIP: adding mainline mmq_id implementation

* This seems to work

* Now also -fmoe works

* WIP

* WIP

* WIP

* This works for mainline supported quants

* mmq_id: add iq2_k, iq2_k_r4

* mmiq_id: don't assume row size is multiple of type size (per row scales)

* mmiq_id: don't assume row size is multiple of type size

* mmq_id: add iq2_ks

So we are sure it works with per row scales

* mmq_id: add iq2_kl

* mmq_id: add iq3_ks

* mmq_id: adding iq3_k, iq3_k_r4

* mmq_id: add iq4_kss, iq4_ks, iq4_ks_r4

* mmq_id: adding iq4_k, iq4_k_r4

* mmq_id: adding iq5_ks, iq5_ks_r4

* mmq_id: adding iq5_k, iq5_k_r4, q6_0

* mmq_id: adding iq6_k

* mmq_id: add iq1_s_r4

* mmq_id: adding iq1_kt, iq2_kt

* mmq_id: add iq3_kt, iq4_kt

* Add CUDA fp8 header

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-26 13:30:35 +03:00
Mohan Krishnan
50f7119dfd Fix undefined template std::basic_string<char> (#726)
Getting this error when compiling on Mac with clang 17
Simple fix, add the string header in src/llama-impl.h

Co-authored-by: Mohan Krishnan <mohan.krishnan@grab.com>
2025-08-25 11:34:01 +03:00
saood06
af13c9a292 Add mikupad to ik_llama as an alternative WebUI (#558)
* mikupad.html in ik_llama.cpp (functional but WIP)

* Remove hardcoded extension and add error handling to extension loading

* Update version number and add features array to version

* Make version endpoint always accessible

* Fix case with empty sql

* Add useful error message when launched without sql file

* Add sigma sampler

* Update sigma step and max based on docs

* Remove selectedSessionId and handle it with URL fragment

* Export All (code only, no UI)

* Add compression to server.cpp

* Major UI work (and also add update backend endpoints to accomadate)

* Finalize UI

* Fix visual bug

* fix merge conflict issue

* Pull in full sqlite_modern_cpp repo for the license as it is not attached to source files

* Make compression not show in sidebar if extension is not loaded

* Finalize build, Put support behing LLAMA_SERVER_SQLITE3: command not found build option, and update error message to include the build option is not passed situation

* Fix compile without flag on systems without it installed
2025-08-24 08:27:29 -05:00
Kawrakow
e008c0e192 Log for debugging #721 (#722)
* Log for debugging #721

* Remove the 16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 15:24:34 +03:00
Kawrakow
e919e89d5a Fix more Q8_0 repacking mess on AVX2 (#719)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 09:04:51 +03:00
Kawrakow
9351cc3416 Remove scary warning about incompatible model (#717)
* Remove scary warning about incompatible model

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 18:42:01 +03:00
Kawrakow
dfa6e2b5fa CUDA: faster IQ2_K, IQ2_KS, IQ2_K_R4 (#716)
* Use bperm trick for iq2_ks gemm -> 7% gain

* Use bperm trick for iq2_k gemm -> ~5% gain

* Use bperm trick for iq2_k_r4 gemm -> ~3% gain

* Use bperm trick for iq2_ks gemv -> ~7% gain

* Use bperm trick for iq2_k gemv -> ~3% gain

* Use bperm trick for iq2_k_r4 gemv -> ~7% gain

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 07:25:35 +03:00
Kawrakow
3b94f0a73e AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations (#710)
* q8_k_r16: basics

* q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+

PP performance is about the same as using q8_k_r8 on the Ryzen-7950X,
so we expect nice gains on Zen5, and we don't need to wory about
using 2 different q8_k_r8 implementations for fancy SIMD.

* q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+

* q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+

* q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+

* q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+

* q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+

* Fix AVX2

* Just always set num_rows to 16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 06:27:07 +03:00
Kawrakow
c8e4d6648c Does this fix #690? (#711)
* Does this fix #690?

* Another attempt

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-21 19:17:33 +03:00
Kawrakow
05cd6994c8 CUDA: faster IQ3_K, IQ3_KS, IQ3_K_R4 (#714)
* Use bperm trick for iq3_ks - 5% PP performance gain

* Use bperm trick for iq3_k -> 5% PP performance gain

* Use bperm trick for iq3_k -> 8% PP performance gain

* Use bperm trick for iq3_k_r4 gemv -> ~5% faster

* Use bperm trick for iq3_k gemv -> ~3% faster

* Use bperm trick for iq3_k gemv -> 4.5% gain

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-21 19:08:57 +03:00
Kawrakow
78de7736e8 CUDA: faster prompt processing for 4-bit quants (#713)
* Use __byte_perm in get_int_from_table_16

* Use get_int_from_table_16 everywhere for 4-bit quants

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-21 15:57:35 +03:00
Kawrakow
0cb6696943 Disable "...is not marked as EOG" messages (#712)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-20 16:47:14 +03:00
Kawrakow
2572d16399 Fix q8_0 repacking issues on AVX2 (#708)
Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4.
So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8,
iqk_moe_fused_mul_unary fails because the activations were
prepared as Q0_0_X4, but we now need Q8_2_X4.

For now a simple fix: just take the slow path, do not repack.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-19 19:49:58 +03:00
Kawrakow
ee85af6d26 Disable experimental code that causes issues with MSVC (#707)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-19 18:09:49 +03:00
usrlocalben
6aef3c9415 remove curious assertions (#705)
This assertion can hit during prefill as MLA/KV tensors grow,
e.g. Kimi K2 n_ctx >= 32768.
2025-08-19 14:41:29 +03:00
g2mt
23fe18ce83 Port universal assisted decoding to llama-server (#699)
* port universal assisted decoding to server

* fix calls

* fix LOG_INFO

* fix llama_detokenize call

* use emplace_back
2025-08-18 09:22:23 +03:00
Kawrakow
a3a523009e Revert "Better CPU prompt processing performance for SWA models (#696)" (#701)
This reverts commit 93a4f6089f.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 15:44:02 +03:00
Kawrakow
7d14f8ea79 Fix GLM-4.5 attention (#700)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 14:31:03 +03:00
Kawrakow
93a4f6089f Better CPU prompt processing performance for SWA models (#696)
* This does the trick for PP

* Compute mask bounds when creating the mask

* Set mask bounds for all supported SWA models

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 10:30:27 +03:00