Commit Graph

  • 686b6f63ea Try splitting PP MLA computation ik/try_split_mla Kawrakow 2026-01-26 16:37:12 +00:00
  • 69fdd041c1 Remove forgotten unused code main Kawrakow 2026-01-26 12:54:21 +00:00
  • 65441c2385 Even better GLM-4.7-Flash long context TG performance (#1192) Kawrakow 2026-01-26 13:45:06 +02:00
  • 04829ca412 Adjust ncols for ADA_LOVELACE or better ik/glm47_fa_2 Kawrakow 2026-01-26 11:00:42 +02:00
  • bd7e75192e Better FA for GLM-4.7-Flash Kawrakow 2026-01-26 07:31:13 +00:00
  • 30381fc1fc Faster hybrid inference when shared experts (#1191) Kawrakow 2026-01-26 07:22:05 +02:00
  • 478b56871f Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190) Kawrakow 2026-01-26 07:21:47 +02:00
  • c96ad27cd0 server: add string ban fcp/string_ban firecoperana 2026-01-19 21:24:47 -06:00
  • 109686af6f Faster hybrid inference when shared experts ik/shexps_better_hybrid Kawrakow 2026-01-25 14:38:54 +00:00
  • 28f8320f3a Much faster rng sampling (#1187) Kawrakow 2026-01-25 09:11:27 +02:00
  • aff7aa0cf6 Add condition ik/better_fa_glm45 Kawrakow 2026-01-25 06:52:04 +00:00
  • d08481d0f4 Remove the glm45 graph building changes Kawrakow 2026-01-25 06:28:53 +00:00
  • 4d5dcba7c9 Make quantized KV cache work Kawrakow 2026-01-25 05:51:44 +00:00
  • 6a5111c215 This works Kawrakow 2026-01-25 05:39:25 +00:00
  • 6e6d105d4e Much faster rng sampling ik/rng_sampling Kawrakow 2026-01-24 13:41:47 +00:00
  • 04beeffa4e Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (#1183) Kawrakow 2026-01-24 09:39:29 +02:00
  • f0fb76da64 Better GLM-4.7-Flash long context TG performance (#1182) Kawrakow 2026-01-24 07:05:48 +02:00
  • c663eeaca6 Disable when the KV cache is not f16 ik/glm45_tg_fa_hack Kawrakow 2026-01-24 05:03:52 +00:00
  • 485d23d91c Refinements Kawrakow 2026-01-23 16:35:50 +00:00
  • f5754f5f82 Similar hack to #1182 for GLM-4.5/6/7 Kawrakow 2026-01-23 14:51:42 +00:00
  • 7f5503244e Handle quantized cache ik/glm47_tg_fa_hack Kawrakow 2026-01-23 06:47:29 +00:00
  • d774dca828 Better GLM-4.7-Flash long context TG performance Kawrakow 2026-01-23 06:00:31 +00:00
  • 2a7cc09149 Remove llamafile remnants (#1179) Kawrakow 2026-01-22 13:20:23 +02:00
  • 3a3e1638d4 Remove llamafile remnants ik/remove_llamafile Kawrakow 2026-01-22 11:12:04 +00:00
  • 66caa42b53 Fix build with GGML_CUDA_GRAPHS=OFF Kawrakow 2026-01-22 10:46:57 +00:00
  • 851fda3509 Split mode graph: use CUDA graphs (#1177) Kawrakow 2026-01-22 12:38:36 +02:00
  • 32f8e6a565 Merge remote-tracking branch 'origin/main' into ik/sm_graph_cuda_graphs ik/sm_graph_cuda_graphs Kawrakow 2026-01-22 10:34:11 +00:00
  • 573e23679d sweep_bench: set number of repetions (#1176) Kawrakow 2026-01-22 12:28:30 +02:00
  • 101fe54797 CUDA graphs with tensor overrides (#1172) Kawrakow 2026-01-22 12:28:11 +02:00
  • 1cb8cd534f Fix build failure when OpenMP is not available (#1171) Kawrakow 2026-01-22 12:26:23 +02:00
  • 77c18acc90 Fix non-contiguous batched cuBLAS (#1178) Kawrakow 2026-01-22 12:25:05 +02:00
  • c37783b361 Fix non-contiguous batched cuBLAS ik/fix_batched_cublas Kawrakow 2026-01-22 10:05:35 +00:00
  • 3b83473658 This seems to work Kawrakow 2026-01-21 11:53:58 +00:00
  • a2fb4cefda sweep_bench: set number of repetions ik/sweep_bench_nrep Kawrakow 2026-01-21 08:33:42 +00:00
  • 987651e54c Make comments more precise when experts gating function is missing (#1175) Kawrakow 2026-01-21 09:12:40 +02:00
  • 3d5b854aee Make comments more precise when experts gating function is missing ik/correct_missing_gating_func_comments Kawrakow 2026-01-21 07:08:54 +00:00
  • 9e07839ba3 Correct GLM-4.7-Flash gating function (#1174) Kawrakow 2026-01-21 07:53:18 +02:00
  • 487411b676 This is better ik/correct_glm47_flash_gating_func Kawrakow 2026-01-21 05:52:10 +00:00
  • 06bfd8861b Correct GLM-4.7-Flash gating function Kawrakow 2026-01-21 05:38:36 +00:00
  • a6651d017a Change graph key ik/cuda_graphs_with_overrides Kawrakow 2026-01-20 15:35:53 +00:00
  • c307525cb2 Use GUDA graphs also when theretensor overrides Kawrakow 2026-01-20 15:26:47 +00:00
  • 6f1a69352f Fuse experts bias in top_k_moe kernel (#1170) Kawrakow 2026-01-20 15:38:51 +02:00
  • 996e77047a Avoid ggml_get_rows if not necessary (#1160) Kawrakow 2026-01-20 15:38:21 +02:00
  • 8f98961b96 Fix build failure when OpenMP is not available ik/fix_windows_no_omp Kawrakow 2026-01-20 13:06:25 +02:00
  • bc16202fc7 Merge remote-tracking branch 'origin/main' into ik/topk_moe_fuse_bias ik/topk_moe_fuse_bias Kawrakow 2026-01-20 10:47:11 +00:00
  • 132a01d25d GLM-4.7-Flash support (#1168) Kawrakow 2026-01-20 12:46:52 +02:00
  • b7b2d4f847 Fuse bias in top_k_moe kernel if present Kawrakow 2026-01-20 10:00:49 +00:00
  • 03c0629b3c Make FA work for mla != 0 ik/glm_flash Kawrakow 2026-01-20 07:58:31 +00:00
  • e6115f7241 Model type Kawrakow 2026-01-20 06:33:40 +00:00
  • ca8425b456 GLM-4.7-Flash support Kawrakow 2026-01-20 05:26:06 +00:00
  • ef5f17940c sampling: refactor sorting (#1166) Kawrakow 2026-01-19 16:48:54 +02:00
  • 1e240db2a0 Couldn't look at it without fixing it. ik/sampling_refactor_sorting Kawrakow 2026-01-19 14:38:53 +00:00
  • 1fea38ce31 Merge remote-tracking branch 'origin/main' into ik/sampling_refactor_sorting Kawrakow 2026-01-19 14:06:05 +00:00
  • efd36d2863 sampling: refactor sorting Kawrakow 2026-01-19 14:04:16 +00:00
  • 98b30e5e81 Faster adaptive_p sampling (#1165) Kawrakow 2026-01-19 16:03:09 +02:00
  • f62e317dbe Merge remote-tracking branch 'origin/main' into ik/adaptive_p_2 ik/adaptive_p_2 Kawrakow 2026-01-19 13:11:04 +00:00
  • c9cd616f84 AVX2 Kawrakow 2026-01-19 13:03:12 +00:00
  • fa58c20c42 A hopefully more efficient adaptive_p sampling (#1161) Kawrakow 2026-01-19 15:01:55 +02:00
  • 6a5c180be9 Fix bf16 additions on CUDA arch < Ampere (#1164) Kawrakow 2026-01-19 12:27:52 +02:00
  • a96e5449cc Correctly accumulate sampling time for adaptive_p ik/adaptive_p Kawrakow 2026-01-19 10:17:07 +00:00
  • bd2434945d Correctly accumulate adaptive_p sampling time Kawrakow 2026-01-19 10:00:19 +00:00
  • 5c1c0e2bad Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE ik/fix_add_bf16_turing Kawrakow 2026-01-19 09:25:20 +00:00
  • 889c553a34 Fix bf16 additions on CUDA arch < Ampere Kawrakow 2026-01-19 09:03:17 +00:00
  • 4df3251b12 This should be better Kawrakow 2026-01-19 08:40:07 +00:00
  • a9f37c2f80 Hopefully better Kawrakow 2026-01-19 07:42:57 +00:00
  • 61eccfcf0d More formatting Kawrakow 2026-01-18 15:49:29 +00:00
  • b2c9689762 Once at it, lets fix the formatting too Kawrakow 2026-01-18 15:01:46 +00:00
  • 6c65430257 A hopefully more efficient adaptive_p sampling Kawrakow 2026-01-18 14:56:30 +00:00
  • 0c0b6e4b8b Copy reduce result to other GPUs if necessary (#1156) Kawrakow 2026-01-19 08:40:26 +02:00
  • ae5c269371 More models ik/skip_get_rows Kawrakow 2026-01-18 13:37:40 +00:00
  • c7deb32142 For the output ops use the result of the split that ran on the main GPU Kawrakow 2026-01-18 12:53:34 +00:00
  • a26adbcf5d Avoid ggml_get_rows for TG Kawrakow 2026-01-18 11:31:35 +00:00
  • fb5c340e17 Copy reduce result to other GPUs if necessary ik/reduce_make_copies Kawrakow 2026-01-18 07:00:06 +00:00
  • 6dfbef27ec Adaptive p: bugfix + optimization + refactor (#1155) dungquixote42 2026-01-18 01:26:06 -05:00
  • d71a3ec315 Server: refactor and rename functions (#1151) firecoperana 2026-01-18 00:16:57 -06:00
  • 7024fdbc72 Additional graph reduce types for split mode graph (#1154) Kawrakow 2026-01-18 08:02:49 +02:00
  • 73b8fea90b This finally works ik/extra_reduce_types Kawrakow 2026-01-17 17:25:57 +00:00
  • 288a8cf842 WIP: add Q8_0 and BF16 as possible reduce types Kawrakow 2026-01-17 15:09:26 +00:00
  • ee463b079e Webui: add text completions and adaptive_p sampling (#1153) firecoperana 2026-01-17 00:37:07 -06:00
  • 709e1a5375 Fixing split mode graph with many GPUs (#1152) Kawrakow 2026-01-17 08:05:24 +02:00
  • c6c890e164 WIP - still deadlocking ik/try_fix_many_gpus_2 Kawrakow 2026-01-16 15:07:23 +00:00
  • 4730b3e1f0 printf cleanup ik/try_fix_many_gpus Kawrakow 2026-01-15 14:33:54 +00:00
  • 7878553ad6 Reenable OpenMP in scheduler async Kawrakow 2026-01-15 14:20:24 +00:00
  • d6e5fb00d6 WIP: this seems more stable Kawrakow 2026-01-15 13:29:27 +00:00
  • 99890edf7e Attempt to fix the many GPU issue in split mode graph Kawrakow 2026-01-15 08:45:52 +00:00
  • cb1063f6cd Fix experts/shared experts split (#1147) Kawrakow 2026-01-14 15:35:16 +02:00
  • e65782de67 Fix experts/shared experts split ik/fix_exp_shexp_split Kawrakow 2026-01-14 13:26:09 +00:00
  • 3a0b234669 Add context management to the MiroThinker template (simulate official agent behavior) (#1143) hksdpc255 2026-01-14 03:08:59 +11:00
  • 672df48ed1 server: keep logit bias unchanged when client does not set it (#1144) firecoperana 2026-01-13 10:08:09 -06:00
  • 0adff91363 Make adding tensor overrides to llama-bench table optional (#1141) Kawrakow 2026-01-13 11:08:13 +02:00
  • 4fd797c863 Make adding tensor overrides to llama-bench table optional ik/llama_bench_overrides Kawrakow 2026-01-13 08:55:38 +00:00
  • 9d9ed6a032 Add -sas, --scheduler-async to llama-bench (#1140) Kawrakow 2026-01-13 10:23:50 +02:00
  • 81c466835d Add -sas, --scheduler-async to llama-bench ik/llama_bench_sas Kawrakow 2026-01-13 08:21:44 +00:00
  • e1c4c4a495 Fix Anthropic Messages API (#1136) hksdpc255 2026-01-13 17:37:29 +11:00
  • 013831bba5 Fix compilation errors Kawrakow 2026-01-13 08:12:49 +02:00
  • 978202a754 Merge ffn_up and ffn_gate experts tensors (part 2) (#1139) Kawrakow 2026-01-13 08:07:52 +02:00
  • 54a1f68d32 Add chat parser for MiroThinker (#1138) hksdpc255 2026-01-13 17:07:12 +11:00
  • 1a461525d5 server: stop processing the prompt when client disconnects (#1134) firecoperana 2026-01-12 23:56:59 -06:00
  • d3e3ad40f9 Compiler warning and white space Kawrakow 2026-01-12 19:06:17 +02:00
  • a50bd821ec Also Qwen3VL-MoE ik/merge_up_gate_exps_3 Kawrakow 2026-01-12 18:52:15 +02:00