SER - Smart Expert Reduction (#239)

* A better way to measure the cost of ggml_barrier

* Smart expert selection

* Add ser option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-03-02 13:47:38 +02:00
committed by GitHub
parent 101c888724
commit 9424c80ab1
11 changed files with 330 additions and 27 deletions

View File

@@ -386,6 +386,8 @@ extern "C" {
int mla_attn; // whether to use MLA attention [EXPERIMENTAL]
int attn_max_batch; // maximum batch size for attention computations [EXPERIMENTAL]
bool fused_moe_up_gate; // whether to use fused MoE up/down op [EXPERIMENTAL]
int min_experts;
float thresh_experts;
// Abort callback
// if it returns true, execution of llama_decode() will be aborted