ik_llama.cpp/6 - IQ4_K_ SOTA 4-bit quantization.md at dc23be32a2aa63e24061663d94daca9b667ed920 - ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-25 15:44:10 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

Same 4.5 bpw as Q4_K.
Significantly reduces quantization error of LLaMA-3.1 (and also 3.0). E.g., 1.77% vs 2.9% for Q4_K_S for LLaMA-3.1-8B (with quantization error defined as PPL(Q)/PPL(fp16)-1)
Non-linear quantization similar to IQ4_XS and IQ4_NL with the following differences
- Blocks of 16 instead of blocks of 32
- Non-linear values in each block of 16 can be on the original non-linear grid, or can be on a shifted grid. This is indicated by one bit, so we need 16 extra bits per block of 256
- So, we need 256 * 4 bits for the quants, 16 * 6 bits for the 6-bit block scales, 16 bits for the super-block float scale, and 16 bits for the shift bits, ending up with exactly 4.5 bpw
Performance is on par with Q4_K on AVX2 and CUDA, and slightly lower on ARM_NEON and Metal