usrlocalben
e5622a2e91
Fix Phi-3, Phi-4 ( #1226 )
...
* fix phi3 tensor setup
* avoid SWA for Phi-4
2026-02-04 11:57:50 +02:00
Kawrakow
f8acfc2bf0
Better CUDA TG for GQA = 10 ( #1221 )
...
* Better CUDA TG for GQA = 10
* Cleanup
2026-02-03 09:18:46 +02:00
firecoperana
7e8d444033
llama : add token matching support to llama-grammar ( #1220 )
...
* llama : add token matching support to llama-grammar
llama : add token matching support to llama-grammar (#17816 )
common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342 )
* disable tests and fix warnings
---------
Co-authored-by: firecoperana <firecoperana>
2026-02-03 07:57:17 +02:00
saood06
8ba7e2b40c
Add support for Seed-OSS ( #1218 )
...
* it compiles
* Fix constants.py
2026-02-03 07:39:45 +02:00
dungquixote42
b86d8024a5
Adaptive p: history update fix + temp as flag ( #1213 )
...
* adaptive_p: fix history update + use current probability for high temp
* adaptive_p: fix history update bug, update with current probability if temp is high
* replace temp-as-signal with server argument
* adaptive_p: rename ema_w_cur_p to updt_w_cur
* delete test code
2026-02-03 07:36:12 +02:00
Kawrakow
589d80f677
Fix CPU FA work buffer size ( #1216 )
2026-02-02 12:39:41 +02:00
Kawrakow
49ba462f22
Merge pull request #1215 from ikawrakow/ik/cpu_fa_dont_repack_tg
...
Do not repack q8_0 for batch sizes less than 8
2026-02-02 12:12:34 +02:00
Kawrakow
d5498c4467
Do not repack q8_0 for batch sizes less than 8
2026-02-02 09:07:45 +00:00
Kawrakow
a527b5af25
Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy
...
Better long-context CPU performance
2026-02-02 10:58:01 +02:00
Kawrakow
685df0e69d
Work buffer size
2026-01-31 16:10:23 +00:00
Kawrakow
2bf2fa8ba4
Better CPU FA thread strategy
2026-01-31 15:46:16 +00:00
Kawrakow
33308908db
Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size
...
Reduce CUDA compute buffer size for mla=3
2026-01-31 14:24:14 +02:00
Kawrakow
b85a2a50d5
Reduce compute buffer size for mla=3
2026-01-31 10:43:05 +00:00
Kawrakow
373f043d41
Merge pull request #1208 from ikawrakow/ik/try_fix_1201
2026-01-30 23:12:07 +02:00
Kawrakow
4d13ae03b5
Also these other two places
2026-01-30 15:36:29 +00:00
Kawrakow
098b1a2e04
Fix MiniMax-M2 KV-cache loading/saving
2026-01-30 13:38:07 +00:00
Kawrakow
811f8c3393
Fix bug in the CPU flash attention implementation ( #1206 )
2026-01-30 11:37:34 +02:00
Kawrakow
686fd1ebec
Use standard output calculation for MiniMax-M2 graph parallel ( #1199 )
2026-01-29 09:06:40 +02:00
Kawrakow
f0c61adacc
Be able to set FA offset via command line argument ( #1198 )
2026-01-29 08:56:47 +02:00
Kawrakow
02ae22388f
Apply offfset to KQ_max in CUDA flash attention ( #1196 )
...
* Apply offfset to KQ_max in CUDA flash attention
* Forgot to add to fattn-common.h
2026-01-29 07:27:53 +02:00
Kawrakow
68ed62447c
Split mode graph for Minimax-M2 ( #1195 )
...
* Split mode graph for Minimax-M2
* Cleanup
* Forgotten ffn_exp_probs_b
2026-01-29 07:27:06 +02:00
Kawrakow
68cd52e583
Much faster long context TG for Minimax-M2 ( #1194 )
2026-01-28 10:43:11 +02:00
Kawrakow
f9b5420e6a
Much faster long-context TG for GLM-4.5/4.6/4.7/AIR ( #1193 )
...
* This seems much better for GQA = 12 TG
* Remove unused arguments
2026-01-28 10:27:14 +02:00
Kawrakow
69fdd041c1
Remove forgotten unused code
2026-01-26 12:54:21 +00:00
Kawrakow
65441c2385
Even better GLM-4.7-Flash long context TG performance ( #1192 )
...
* Better FA for GLM-4.7-Flash
* Adjust ncols for ADA_LOVELACE or better
2026-01-26 13:45:06 +02:00
Kawrakow
30381fc1fc
Faster hybrid inference when shared experts ( #1191 )
2026-01-26 07:22:05 +02:00
Kawrakow
478b56871f
Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) ( #1190 )
...
* This works
* Make quantized KV cache work
* Remove the glm45 graph building changes
* Add condition
2026-01-26 07:21:47 +02:00
Kawrakow
28f8320f3a
Much faster rng sampling ( #1187 )
2026-01-25 09:11:27 +02:00
Kawrakow
04beeffa4e
Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR ( #1183 )
...
* Similar hack to #1182 for GLM-4.5/6/7
* Refinements
* Disable when the KV cache is not f16
2026-01-24 09:39:29 +02:00
Kawrakow
f0fb76da64
Better GLM-4.7-Flash long context TG performance ( #1182 )
...
* Better GLM-4.7-Flash long context TG performance
* Handle quantized cache
2026-01-24 07:05:48 +02:00
Kawrakow
2a7cc09149
Remove llamafile remnants ( #1179 )
2026-01-22 13:20:23 +02:00
Kawrakow
66caa42b53
Fix build with GGML_CUDA_GRAPHS=OFF
2026-01-22 10:46:57 +00:00
Kawrakow
851fda3509
Split mode graph: use CUDA graphs ( #1177 )
...
* Use GUDA graphs also when theretensor overrides
* Change graph key
* This seems to work
2026-01-22 12:38:36 +02:00
Kawrakow
573e23679d
sweep_bench: set number of repetions ( #1176 )
2026-01-22 12:28:30 +02:00
Kawrakow
101fe54797
CUDA graphs with tensor overrides ( #1172 )
...
* Use GUDA graphs also when theretensor overrides
* Change graph key
2026-01-22 12:28:11 +02:00
Kawrakow
1cb8cd534f
Fix build failure when OpenMP is not available ( #1171 )
2026-01-22 12:26:23 +02:00
Kawrakow
77c18acc90
Fix non-contiguous batched cuBLAS ( #1178 )
2026-01-22 12:25:05 +02:00
Kawrakow
987651e54c
Make comments more precise when experts gating function is missing ( #1175 )
2026-01-21 09:12:40 +02:00
Kawrakow
9e07839ba3
Correct GLM-4.7-Flash gating function ( #1174 )
...
* Correct GLM-4.7-Flash gating function
* This is better
2026-01-21 07:53:18 +02:00
Kawrakow
6f1a69352f
Fuse experts bias in top_k_moe kernel ( #1170 )
...
* GLM-4.7-Flash support
* Model type
* Make FA work for mla != 0
* Fuse bias in top_k_moe kernel if present
2026-01-20 15:38:51 +02:00
Kawrakow
996e77047a
Avoid ggml_get_rows if not necessary ( #1160 )
...
* Copy reduce result to other GPUs if necessary
* Avoid ggml_get_rows for TG
* For the output ops use the result of the split that ran on the main GPU
* More models
2026-01-20 15:38:21 +02:00
Kawrakow
132a01d25d
GLM-4.7-Flash support ( #1168 )
...
* GLM-4.7-Flash support
* Model type
* Make FA work for mla != 0
2026-01-20 12:46:52 +02:00
Kawrakow
ef5f17940c
sampling: refactor sorting ( #1166 )
...
* sampling: refactor sorting
* Couldn't look at it without fixing it.
2026-01-19 16:48:54 +02:00
Kawrakow
98b30e5e81
Faster adaptive_p sampling ( #1165 )
...
* A hopefully more efficient adaptive_p sampling
* Once at it, lets fix the formatting too
* More formatting
* Hopefully better
* This should be better
* Correctly accumulate adaptive_p sampling time
* AVX2
2026-01-19 16:03:09 +02:00
Kawrakow
fa58c20c42
A hopefully more efficient adaptive_p sampling ( #1161 )
...
* A hopefully more efficient adaptive_p sampling
* Once at it, lets fix the formatting too
* More formatting
* Correctly accumulate sampling time for adaptive_p
2026-01-19 15:01:55 +02:00
Kawrakow
6a5c180be9
Fix bf16 additions on CUDA arch < Ampere ( #1164 )
...
* Fix bf16 additions on CUDA arch < Ampere
* Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE
2026-01-19 12:27:52 +02:00
Kawrakow
0c0b6e4b8b
Copy reduce result to other GPUs if necessary ( #1156 )
2026-01-19 08:40:26 +02:00
dungquixote42
6dfbef27ec
Adaptive p: bugfix + optimization + refactor ( #1155 )
...
* adaptive-p sampler: fix zeroed orig_probs bug and refactor
- Fix bug where original probabilities were captured as zero by calculating
them from logits in llama_prep_adaptive_p (new).
- Replace vector with unordered_map to track candidate probabilities,
filtering for relevance via logit delta (16.6f).
- Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info>
- Update function signatures to follow most other samplers.
* resolve merge bug
* adaptive-p: revert reordering function definitions
2026-01-18 08:26:06 +02:00
firecoperana
d71a3ec315
Server: refactor and rename functions ( #1151 )
...
* Server: rename functions and refactor code
rename functions
refactor update slots
rename params_base
rename timings
* change
* Revert kv cache name changes
* Revert 2
* fix test build error
---------
Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
Kawrakow
7024fdbc72
Additional graph reduce types for split mode graph ( #1154 )
...
* WIP: add Q8_0 and BF16 as possible reduce types
Does not work - there is a big somewhere
* This finally works
2026-01-18 08:02:49 +02:00