ik_llama.cpp/488 - Faster CPU prompt processing for Trellis quants and MoE models.md at 30381fc1fc6a302f9de0487b1e719f4efcc06a00 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

573 B

Raw Blame History

🔀 #488 - Faster CPU prompt processing for Trellis quants and MoE models

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-03
Updated	2025-06-05

Description

This PR is a follow up to #482, and applies the same dequantizing GEMM for MoE matrix multiplications.

For a DeepSeek-Lite model where only the ffn_up and ffn_gate tensors are quantized with IQ2_KT I observe a ~35% improvement in PP performance compared to te main branch.

573 B Raw Blame History

🔀 #488 - Faster CPU prompt processing for Trellis quants and MoE models

Description

573 B

Raw Blame History