ik_llama.cpp/488 - Faster CPU prompt processing for Trellis quants and MoE models.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

573 B

Raw Permalink Blame History

🔀 #488 - Faster CPU prompt processing for Trellis quants and MoE models

Author	`ikawrakow`
State	❌ Closed
Created	2025-06-03
Updated	2025-06-05

Description

This PR is a follow up to #482, and applies the same dequantizing GEMM for MoE matrix multiplications.

For a DeepSeek-Lite model where only the ffn_up and ffn_gate tensors are quantized with IQ2_KT I observe a ~35% improvement in PP performance compared to te main branch.

573 B Raw Permalink Blame History

🔀 #488 - Faster CPU prompt processing for Trellis quants and MoE models

Description

573 B

Raw Permalink Blame History