Commit Graph

4178 Commits

Author SHA1 Message Date
Kawrakow
a7befb3bed Change default FA offset to ln(2) (#1235)
* Change default FA offset to ln(2)

* Also here
2026-02-05 13:42:53 +02:00
Kawrakow
9c1c74acda Step-3.5-Flash support (#1231)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)
2026-02-05 08:13:22 +02:00
firecoperana
8d952ff183 Server: add string ban (#1185)
* server: add string ban

* increase rewind limit

* init n_buffer

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-05 08:12:34 +02:00
Michael Militzer
a335cff664 Fix llama-server-cuda Dockerfile to build ik_llama.cpp correctly (#1224)
Co-authored-by: Michael Militzer <michael@xvid.com>
2026-02-04 16:08:00 +02:00
Kawrakow
b41b8cf813 Graph parallel for SEED-OSS (#1222)
* Graph parallel for SEED-OSS

* Cleanup
2026-02-04 16:07:43 +02:00
gapeleon
17d101863d server: add dynamic control vector management endpoints (#1223)
This implements the ability to load, unload, and scale control vectors
(representation engineering) mid-inference, following the existing
task-queue pattern used by LoRA adapters.

New Endpoints:
- GET  /control-vectors
- POST /control-vectors/load
- POST /control-vectors/unload
- POST /control-vectors/apply (handles scaling)

Technical Notes:
- Centralizes vector aggregation logic to share implementation between
  load, unload, and apply tasks.
- Vectors are applied globally to the model context.
- Enforces dimension validation on load to safely reject incompatible
  vectors.

Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>
2026-02-04 16:07:18 +02:00
usrlocalben
e5622a2e91 Fix Phi-3, Phi-4 (#1226)
* fix phi3 tensor setup

* avoid SWA for Phi-4
2026-02-04 11:57:50 +02:00
Kawrakow
f8acfc2bf0 Better CUDA TG for GQA = 10 (#1221)
* Better CUDA TG for GQA = 10

* Cleanup
2026-02-03 09:18:46 +02:00
firecoperana
7e8d444033 llama : add token matching support to llama-grammar (#1220)
* llama : add token matching support to llama-grammar

llama : add token matching support to llama-grammar (#17816)

common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342)

* disable tests and fix warnings

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-03 07:57:17 +02:00
saood06
8ba7e2b40c Add support for Seed-OSS (#1218)
* it compiles

* Fix constants.py
2026-02-03 07:39:45 +02:00
dungquixote42
b86d8024a5 Adaptive p: history update fix + temp as flag (#1213)
* adaptive_p: fix history update + use current probability for high temp

* adaptive_p: fix history update bug, update with current probability if temp is high

* replace temp-as-signal with server argument

* adaptive_p: rename ema_w_cur_p to updt_w_cur

* delete test code
2026-02-03 07:36:12 +02:00
Kawrakow
589d80f677 Fix CPU FA work buffer size (#1216) 2026-02-02 12:39:41 +02:00
Kawrakow
49ba462f22 Merge pull request #1215 from ikawrakow/ik/cpu_fa_dont_repack_tg
Do not repack q8_0 for batch sizes less than 8
2026-02-02 12:12:34 +02:00
Kawrakow
d5498c4467 Do not repack q8_0 for batch sizes less than 8 2026-02-02 09:07:45 +00:00
Kawrakow
a527b5af25 Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy
Better long-context CPU performance
2026-02-02 10:58:01 +02:00
Kawrakow
685df0e69d Work buffer size 2026-01-31 16:10:23 +00:00
Kawrakow
2bf2fa8ba4 Better CPU FA thread strategy 2026-01-31 15:46:16 +00:00
Kawrakow
33308908db Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size
Reduce CUDA compute buffer size for mla=3
2026-01-31 14:24:14 +02:00
Kawrakow
b85a2a50d5 Reduce compute buffer size for mla=3 2026-01-31 10:43:05 +00:00
Kawrakow
373f043d41 Merge pull request #1208 from ikawrakow/ik/try_fix_1201 2026-01-30 23:12:07 +02:00
Kawrakow
4d13ae03b5 Also these other two places 2026-01-30 15:36:29 +00:00
Kawrakow
098b1a2e04 Fix MiniMax-M2 KV-cache loading/saving 2026-01-30 13:38:07 +00:00
Kawrakow
811f8c3393 Fix bug in the CPU flash attention implementation (#1206) 2026-01-30 11:37:34 +02:00
Kawrakow
686fd1ebec Use standard output calculation for MiniMax-M2 graph parallel (#1199) 2026-01-29 09:06:40 +02:00
Kawrakow
f0c61adacc Be able to set FA offset via command line argument (#1198) 2026-01-29 08:56:47 +02:00
Kawrakow
02ae22388f Apply offfset to KQ_max in CUDA flash attention (#1196)
* Apply offfset to KQ_max in CUDA flash attention

* Forgot to add to fattn-common.h
2026-01-29 07:27:53 +02:00
Kawrakow
68ed62447c Split mode graph for Minimax-M2 (#1195)
* Split mode graph for Minimax-M2

* Cleanup

* Forgotten ffn_exp_probs_b
2026-01-29 07:27:06 +02:00
Kawrakow
68cd52e583 Much faster long context TG for Minimax-M2 (#1194) 2026-01-28 10:43:11 +02:00
Kawrakow
f9b5420e6a Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193)
* This seems much better for GQA = 12 TG

* Remove unused arguments
2026-01-28 10:27:14 +02:00
Kawrakow
69fdd041c1 Remove forgotten unused code 2026-01-26 12:54:21 +00:00
Kawrakow
65441c2385 Even better GLM-4.7-Flash long context TG performance (#1192)
* Better FA for GLM-4.7-Flash

* Adjust ncols for ADA_LOVELACE or better
2026-01-26 13:45:06 +02:00
Kawrakow
30381fc1fc Faster hybrid inference when shared experts (#1191) 2026-01-26 07:22:05 +02:00
Kawrakow
478b56871f Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190)
* This works

* Make quantized KV cache work

* Remove the glm45 graph building changes

* Add condition
2026-01-26 07:21:47 +02:00
Kawrakow
28f8320f3a Much faster rng sampling (#1187) 2026-01-25 09:11:27 +02:00
Kawrakow
04beeffa4e Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (#1183)
* Similar hack to #1182 for GLM-4.5/6/7

* Refinements

* Disable when the KV cache is not f16
2026-01-24 09:39:29 +02:00
Kawrakow
f0fb76da64 Better GLM-4.7-Flash long context TG performance (#1182)
* Better GLM-4.7-Flash long context TG performance

* Handle quantized cache
2026-01-24 07:05:48 +02:00
Kawrakow
2a7cc09149 Remove llamafile remnants (#1179) 2026-01-22 13:20:23 +02:00
Kawrakow
66caa42b53 Fix build with GGML_CUDA_GRAPHS=OFF 2026-01-22 10:46:57 +00:00
Kawrakow
851fda3509 Split mode graph: use CUDA graphs (#1177)
* Use GUDA graphs also when theretensor overrides

* Change graph key

* This seems to work
2026-01-22 12:38:36 +02:00
Kawrakow
573e23679d sweep_bench: set number of repetions (#1176) 2026-01-22 12:28:30 +02:00
Kawrakow
101fe54797 CUDA graphs with tensor overrides (#1172)
* Use GUDA graphs also when theretensor overrides

* Change graph key
2026-01-22 12:28:11 +02:00
Kawrakow
1cb8cd534f Fix build failure when OpenMP is not available (#1171) 2026-01-22 12:26:23 +02:00
Kawrakow
77c18acc90 Fix non-contiguous batched cuBLAS (#1178) 2026-01-22 12:25:05 +02:00
Kawrakow
987651e54c Make comments more precise when experts gating function is missing (#1175) 2026-01-21 09:12:40 +02:00
Kawrakow
9e07839ba3 Correct GLM-4.7-Flash gating function (#1174)
* Correct GLM-4.7-Flash gating function

* This is better
2026-01-21 07:53:18 +02:00
Kawrakow
6f1a69352f Fuse experts bias in top_k_moe kernel (#1170)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0

* Fuse bias in top_k_moe kernel if present
2026-01-20 15:38:51 +02:00
Kawrakow
996e77047a Avoid ggml_get_rows if not necessary (#1160)
* Copy reduce result to other GPUs if necessary

* Avoid ggml_get_rows for TG

* For the output ops use the result of the split that ran on the main GPU

* More models
2026-01-20 15:38:21 +02:00
Kawrakow
132a01d25d GLM-4.7-Flash support (#1168)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0
2026-01-20 12:46:52 +02:00
Kawrakow
ef5f17940c sampling: refactor sorting (#1166)
* sampling: refactor sorting

* Couldn't look at it without fixing it.
2026-01-19 16:48:54 +02:00
Kawrakow
98b30e5e81 Faster adaptive_p sampling (#1165)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Hopefully better

* This should be better

* Correctly accumulate adaptive_p sampling time

* AVX2
2026-01-19 16:03:09 +02:00