ik_llama.cpp/158 - Faster R4 legacy quants.md at 69fdd041c1ccd0e307bdc3b2264583b54f6dfea2 - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 17:20:01 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.1 KiB

Raw Blame History

🔀 #158 - Faster R4 legacy quants

Author	`ikawrakow`
State	❌ Closed
Created	2024-12-22
Updated	2024-12-22

Description

It seems converting fp16 to fp32 is extremely slow on the Ryzen-5975WX CPU (or ggml's GGML_FP16_TO_FP32 is inadequate), so it is better to convert the fp16 Q8_1_x4 block scales using AVX2 intrinsics, store the result, and then use the converted fp32 scales when performing the dot product. This PR does that on AVX2 for Q4_0_R4, Q5_0_R4, Q6_0_R4 and Q8_0_R4. There was no benefit on the Ryzen-7950X (Zen4), so not implemented there.

The table shows PP-512 comparison between the main branch and this PR for LLaMA-3.1-8B on the Ryzen-5975WX

Quant	t/s (main)	t/s (PR)	Speedup
Q4_0_R4	251.00 ± 0.51	283.61 ± 0.50	1.130
Q5_0_R4	236.33 ± 0.56	271.57 ± 0.52	1.149
Q6_0_R4	231.53 ± 0.60	260.22 ± 0.53	1.124
Q8_0_R4	234.40 ± 0.60	246.11 ± 0.54	1.050

1.1 KiB Raw Blame History

🔀 #158 - Faster R4 legacy quants

Description

1.1 KiB

Raw Blame History