Default Branch

30381fc1fc · Faster hybrid inference when shared experts (#1191) · Updated 2026-01-26 05:22:05 +00:00

Branches

5c1c0e2bad · Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE · Updated 2026-01-19 09:25:20 +00:00

21
2

ae5c269371 · More models · Updated 2026-01-18 13:37:40 +00:00

22
4

fb5c340e17 · Copy reduce result to other GPUs if necessary · Updated 2026-01-18 07:00:06 +00:00

22
1

73b8fea90b · This finally works · Updated 2026-01-17 17:25:57 +00:00

25
2

02aa65009b · fix test build error · Updated 2026-01-17 16:04:42 +00:00

27
5

c2eed98296 · update description · Updated 2026-01-17 00:52:52 +00:00

27
2

c6c890e164 · WIP - still deadlocking · Updated 2026-01-16 15:07:23 +00:00

27
5

4730b3e1f0 · printf cleanup · Updated 2026-01-15 14:33:54 +00:00

27
4

e65782de67 · Fix experts/shared experts split · Updated 2026-01-14 13:26:09 +00:00

28
1

4fd797c863 · Make adding tensor overrides to llama-bench table optional · Updated 2026-01-13 08:55:38 +00:00

31
1

81c466835d · Add -sas, --scheduler-async to llama-bench · Updated 2026-01-13 08:21:44 +00:00

32
1

a50bd821ec · Also Qwen3VL-MoE · Updated 2026-01-12 16:52:15 +00:00

38
4

5d0123313a · All the others · Updated 2026-01-12 16:22:53 +00:00

40
16

905bca2e1c · Cleanup · Updated 2026-01-12 13:28:06 +00:00

40
13

738dc60b78 · We don't need these · Updated 2026-01-10 15:32:21 +00:00

40
0
Included

1ee36144a8 · WIP - something is wrong · Updated 2026-01-10 13:17:22 +00:00

41
1

d329029dde · Fix mla = 0 · Updated 2026-01-10 08:27:57 +00:00

42
1

39e57c1b57 · Update AUTHORS · Updated 2026-01-10 06:09:34 +00:00

4147
4105

58f3784821 · Fix split mode graph for GPT-OSS with partial offload · Updated 2026-01-09 16:57:30 +00:00

4147
4102

ae547b8502 · Fix assert when --max-gpu is less than available GPUs · Updated 2026-01-09 11:15:05 +00:00

4147
4102