Fused FFN_UP+FFN_GATE op (#741)

* Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-02-10 00:10:13 +00:00 · 2025-08-31 18:16:36 +03:00
parent f22a9ef95a
commit b66cecca45
10 changed files with 276 additions and 12 deletions
--- a/include/llama.h
+++ b/include/llama.h
@@ -419,7 +419,8 @@ extern "C" {
        bool flash_attn;  // whether to use flash attention [EXPERIMENTAL]
        int  mla_attn;    // whether to use MLA attention [EXPERIMENTAL]
        int  attn_max_batch;    // maximum batch size for attention computations [EXPERIMENTAL]
-        bool fused_moe_up_gate; // whether to use fused MoE up/down op [EXPERIMENTAL]
+        bool fused_moe_up_gate; // whether to use fused MoE up/gate op
+        bool fused_up_gate;     // whether to use fused up/gate op [EXPERIMENTAL]
        int  min_experts;
        float thresh_experts;