ik_llama.cpp/github-data/pull_requests/488-Faster CPU prompt processing for Trellis quants and MoE models.md at b685f9b4aafca252dd99ea011ffab65dfb7ad143

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-28 18:32:04 +00:00

Files

Thomas ab7d193fe0 Add GitHub data (#637 )

2025-07-22 18:18:40 +02:00

573 B

Raw Blame History

🔀 #488 - Faster CPU prompt processing for Trellis quants and MoE models

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-03
Updated	2025-06-05

Description

This PR is a follow up to #482, and applies the same dequantizing GEMM for MoE matrix multiplications.

For a DeepSeek-Lite model where only the ffn_up and ffn_gate tensors are quantized with IQ2_KT I observe a ~35% improvement in PP performance compared to te main branch.

573 B Raw Blame History

🔀 #488 - Faster CPU prompt processing for Trellis quants and MoE models

Description

573 B

Raw Blame History