Offload only activated experts to the GPU (#698)

* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-09-04 12:22:30 +02:00
committed by GitHub
parent 144d456717
commit 13c3b6412e
8 changed files with 155 additions and 45 deletions

View File

@@ -223,6 +223,7 @@ struct gpt_params {
bool repack_tensors = false; // repack tensors if interleaved variant is available
bool use_thp = false; // use transparent huge pages (linux only)
bool validate_quants = false; // if true, check for NaNs while loading the model
bool only_active_exps = false; // if true, offload only active experts (relevant only for hybrid CPU/GPU)
std::string cache_type_k = "f16"; // KV cache data type for the K
std::string cache_type_v = "f16"; // KV cache data type for the V