Fused MoE ffn_up and ffn_gate (#229)

* Fusing MoE up * unary(gate)

* Fusing MoE up * unary(gate): CUDA

We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite

* On CUDA also fuse MoE down * (up * unary(gate))

in case the MUL_MAT_ID op for the down experts is the next
op in the graph.

* Command line option to enable fused MoE up*unary(gate)

* Add fmoe option to llama-bench

* Adding forgotten gelu, relu, silu on ARM

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-02-23 14:31:11 +02:00
committed by GitHub
parent ce1b59f08c
commit b50efcc9d2
12 changed files with 730 additions and 81 deletions

View File

@@ -175,6 +175,7 @@ struct gpt_params {
bool cont_batching = true; // insert new sequences for decoding on-the-fly
bool flash_attn = false; // flash attention
bool mla_attn = false; // MLA
bool fused_moe_up_gate = false; // fused up*unary(gate) op for MoE models
bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix
bool ignore_eos = false; // ignore generated EOS tokens