Fuse add+add+fused_rms (#853)

* Fuse add+add+fused_rms * Try this * Macro to easily enable/disable fusion * Various: * Check that all tensors involved are on the same device before applying fusion * Fuse sigmoid+scale+sum_rows+div * Fix the fused bailingmoe2 experts selection The issue there was that the bias was not per row, but per expert group, so only the first n_per_group biases were used for al experts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-02-27 16:44:21 +00:00 · 2025-10-22 16:18:11 +03:00
parent af5bf60cc8
commit ed4e1a6588
8 changed files with 281 additions and 54 deletions
--- a/src/llama-build-context.cpp
+++ b/src/llama-build-context.cpp
@@ -7793,6 +7793,7 @@ ggml_cgraph * llm_build_context::build_openai_moe() {

        cur = ffn_inp;
        cur = llm_build_norm(ctx0, cur,  hparams, model.layers[il].attn_post_norm, nullptr, LLM_NORM_RMS, cb, il);
+        ggml_build_forward_expand(gf, cur);
        cb(cur, "attn_post_norm", il);

        bool use_dup_bias = cur->ne[1] < 32 && model.layers[il].ffn_up_exps_b_dup &&