Offload only activated experts to the GPU (#698)

* Offload only activated experts * This seems to do the trick for -fmoe * Do not recalculate activated expers for fused up/gate * Log out of bounds access details * Add a command line argument --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-04-29 19:01:47 +00:00 · 2025-09-04 12:22:30 +02:00
parent 06cc7c6894
commit 0c15494c30
8 changed files with 155 additions and 45 deletions
--- a/include/llama.h
+++ b/include/llama.h
@@ -424,6 +424,7 @@ extern "C" {
        bool fused_up_gate;     // whether to use fused up/gate op [EXPERIMENTAL]
        int  min_experts;
        float thresh_experts;
+        bool only_active_experts;

        // Abort callback
        // if it returns true, execution of llama_decode() will be aborted