Faster MoE inference (#112)

* multi_sdd: WIP

* multi_sdd: CPU works

* multi_add: CUDA

* multi_add: simplify

* multi_add: Metal

* Metal: speed up mul_mat_id

For the Granite-1B MoE model PP-512 goes from
156 t/s to 890 t/s, so nearly a 6X speedup!

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2024-10-31 12:05:27 +01:00
committed by GitHub
parent 5ad6439486
commit 52874c5d21
8 changed files with 332 additions and 33 deletions

View File

@@ -494,6 +494,7 @@ extern "C" {
GGML_OP_GROUP_NORM,
GGML_OP_FUSED_RMS_NORM,
GGML_OP_FUSED_MUL_UNARY,
GGML_OP_MULTI_ADD,
GGML_OP_MUL_MAT,
GGML_OP_MUL_MAT_ID,
@@ -930,6 +931,11 @@ extern "C" {
struct ggml_tensor * a,
struct ggml_tensor * b);
GGML_API struct ggml_tensor * ggml_multi_add(
struct ggml_context * ctx,
struct ggml_tensor * a,
int n_experts);
// dst = a
// view(dst, nb1, nb2, nb3, offset) += b
// return dst