Default Branch

30381fc1fc · Faster hybrid inference when shared experts (#1191) · Updated 2026-01-26 05:22:05 +00:00

Branches

a2f5614529 · Try to split offloaded MoE up/gate up · Updated 2025-12-09 10:09:04 +00:00

97
3

ccf72a0e46 · Also this · Updated 2025-12-09 06:36:31 +00:00

97
2

c83d2fd335 · WIP · Updated 2025-12-08 15:44:53 +00:00

99
3

be8e7057b3 · Handle split cache (read) · Updated 2025-12-08 08:55:35 +00:00

98
2

0e683f24ad · Fix annoying compiler warnings · Updated 2025-12-06 08:57:50 +00:00

100
1

a4da6e298a · Automatically disable CUDA graphs for split mode "graph" · Updated 2025-12-05 17:00:58 +00:00

101
1

b18f658a7d · CUDA: set current device in compute_forward · Updated 2025-12-05 15:40:48 +00:00

103
1

ed8a3d8e3d · Don't split the output tensor · Updated 2025-12-05 13:16:11 +00:00

104
1

9264abfbaf · Fix debug build (#1037) · Updated 2025-12-05 13:06:22 +00:00

104
0
Included

c374b221b6 · Mistral3-large · Updated 2025-12-04 16:05:40 +00:00

4147
4042

6387a5800a · Minor · Updated 2025-12-04 05:52:05 +00:00

106
2

9c17d5f176 · WIP: Hadamard transforms for K-cache · Updated 2025-12-03 14:26:46 +00:00

107
1

ab19054a79 · Use standard attention for Ministral3 · Updated 2025-12-03 10:51:32 +00:00

109
1

c5f9a5c29a · Fix bug in ggml_cuda_op_scale_tensor · Updated 2025-12-03 10:28:26 +00:00

110
1

84129f7eb6 · Adding ministral3: this seems to work · Updated 2025-12-03 09:41:44 +00:00

4147
4037

dde8028336 · WIP: allocate graph · Updated 2025-12-03 07:54:53 +00:00

111
4

b415e734e5 · Fix also output · Updated 2025-12-03 04:53:44 +00:00

111
3

49ec5726d7 · Is this better for multi-GPU and split mode "graph"? · Updated 2025-12-02 08:44:46 +00:00

112
1

c4c266847f · Slightly better graph split strategy · Updated 2025-12-02 08:18:55 +00:00

112
1

864b496831 · Try to better distribute the splits · Updated 2025-12-01 13:18:56 +00:00

113
32