Playing games with the scheduler

This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG.
2026-02-19 04:40:09 +00:00 · 2025-11-26 20:34:37 +00:00
parent 4bcfb40711
commit 52a7cbe482
2 changed files with 6 additions and 1 deletions
--- a/src/llama-build-context.cpp
+++ b/src/llama-build-context.cpp
@@ -680,6 +680,7 @@ ggml_tensor * llm_build_context::llm_build_ffn(
            cur = ggml_add(ctx, cur, ffn[id]);
            cb(cur, "combine_ffn", il);
        }
+        cur->op_params[0] = 0xff;
        return cur;
    }

@@ -9088,6 +9089,7 @@ ggml_tensor * llm_build_context::build_std_attention(ggml_cgraph * gf, ggml_tens
                cur = ggml_add(ctx0, cur, attn[id]);
                cb(cur, "combine_attn", il);
            }
+            cur->op_params[0] = 0xff;
            return cur;
        }
    }