Offload only activated experts to the GPU (#698)

* Offload only activated experts * This seems to do the trick for -fmoe * Do not recalculate activated expers for fused up/gate * Log out of bounds access details * Add a command line argument --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-26 17:20:01 +00:00 · 2025-09-04 12:22:30 +02:00
parent 144d456717
commit 13c3b6412e
8 changed files with 155 additions and 45 deletions
--- a/common/common.h
+++ b/common/common.h
@@ -223,6 +223,7 @@ struct gpt_params {
    bool repack_tensors    = false; // repack tensors if interleaved variant is available
    bool use_thp           = false; // use transparent huge pages (linux only)
    bool validate_quants   = false; // if true, check for NaNs while loading the model
+    bool only_active_exps  = false; // if true, offload only active experts (relevant only for hybrid CPU/GPU)

    std::string cache_type_k = "f16"; // KV cache data type for the K
    std::string cache_type_v = "f16"; // KV cache data type for the V