Merge ffn_up and ffn_gate experts tensors (#1137)

* WIP - not working * WIP - not working * WIP - GPT-OSS working However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation. * WIP * WIP - Qwen3-MoE (and hopefully all others) working But when I say here and in the previous commit "working", I mean PP is working. TG is still broken. * WIP: TG seems to be working * Minor * Add command line option to merge experts up/gate * Add merge up/gate command line parameter to llama-bench * Turn off merge_up_gate_exps if split mode graph It is not yet implemented * When no bias, allow merging up/gate with tensor overrides * Arghh, we need to increase the context size again * Cleanup
2026-04-29 19:01:47 +00:00 · 2026-01-12 18:30:53 +02:00
parent bf0c6c57bb
commit c03c2d7cc6
16 changed files with 505 additions and 134 deletions
--- a/include/llama.h
+++ b/include/llama.h
@@ -392,6 +392,7 @@ extern "C" {
        bool use_thp;       // use transparent huge pages (linux only)
        bool validate_quants; // if true, check for NaNs while loading the model
        bool merge_qkv;     // if true, merge separate Q, K, V tensors into a single, contiguous tensor
+        bool merge_up_gate_exps;  // if true, merge ffn_up_exps and ffn_gate_exps tensors into a single, contiguous tensor
    };

    // NOTE: changing the default values of parameters marked as [EXPERIMENTAL] may cause crashes or incorrect results in certain configurations