Slightly faster TG for split mode "graph" (#1057)

* Rearrange graph nodes

So that we can do graph portions that are the same on 2 or more
GPUs at the same time.

* Separate graph compute implementation for split mode graph

* This is better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-12-12 07:54:37 +01:00
committed by GitHub
parent bf03f63c34
commit f65fefa36c
4 changed files with 183 additions and 91 deletions

View File

@@ -1228,6 +1228,7 @@ llm_expert_gating_func_type gating_op,
cur = ggml_cast(ctx, cur, GGML_TYPE_F16);
cb(cur, "ffn_out_f16", il_cb);
}
ggml_build_forward_expand(graph, routed_out);
results.push_back(cur);
}
GGML_ASSERT(!results.empty());