ik_llama.cpp/github-data/pull_requests/248-Faster MoE token generation on CUDA.md at b685f9b4aafca252dd99ea011ffab65dfb7ad143

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-25 17:09:22 +00:00

Files

Thomas ab7d193fe0 Add GitHub data (#637 )

2025-07-22 18:18:40 +02:00

This PR adds special purpose matrix-vector multiplications for MoE models.

For DeepSeek-Lite this results in a ~25% speedup for token generation.

For now only implemented ~~with the -fmoe option and only~~ for quantized experts.