ik_llama.cpp/196 - Refactor_ remove usage of Q8_1 for activation quantization.md at 0451f10a4206dd61a1c91969bb8ebbaeb83a9cb6 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

756 B

Raw Blame History

📝 #196 - Refactor: remove usage of Q8_1 for activation quantization

Author	`ikawrakow`
State	❌ Closed
Created	2025-02-09
Updated	2025-03-27

Description

Background Description

Some models can produce activations that are beyond the range of fp16. In that scenario, usage of Q8_1 to quantize the activations can be futile, see discussion in #194.

Hence, it would be prudent to change all quantization types using Q8_1 for matrix multiplications to use something else. Alternatively, one may replace the fp16 block scale and block sum in Q8_1 with bf16.

Possible Refactor Approaches

No response

756 B Raw Blame History

📝 #196 - Refactor: remove usage of Q8_1 for activation quantization

Description

Background Description

Possible Refactor Approaches

756 B

Raw Blame History