Files
ik_llama.cpp/github-data/pull_requests/488 - Faster CPU prompt processing for Trellis quants and MoE models.md
2025-07-23 13:31:53 +02:00

573 B

🔀 #488 - Faster CPU prompt processing for Trellis quants and MoE models

Author ikawrakow
State Closed
Created 2025-06-03
Updated 2025-06-05

Description

This PR is a follow up to #482, and applies the same dequantizing GEMM for MoE matrix multiplications.

For a DeepSeek-Lite model where only the ffn_up and ffn_gate tensors are quantized with IQ2_KT I observe a ~35% improvement in PP performance compared to te main branch.