Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4 (#182)

* Slightly faster AVX2 implementation for q4_k_r4

* Even better AVX2 implementation for q4_k_r4

We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f8722.
With FA and Q8_0 K-cache we get to 339.5 t/s.

* Fix llama-bench labels that I broke with #181

* Faster AVX2 implementation for q5_k_q4

We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.

* Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4

After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.

* Minor tweak

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-01-30 09:28:53 +02:00
committed by GitHub
parent 5bbe93c0c4
commit f7a4a0fd42
2 changed files with 68 additions and 255 deletions

View File

@@ -756,7 +756,7 @@ static std::vector<cmd_params_instance> get_cmd_params_instances(const cmd_param
continue;
}
cmd_params_instance instance = {
/* .test_kind = */ TEST_KIND_PP,
/* .test_kind = */ TEST_KIND_TG,
/* .model = */ m,
/* .n_prompt = */ 0,
/* .n_gen = */ n_gen,
@@ -784,7 +784,7 @@ static std::vector<cmd_params_instance> get_cmd_params_instances(const cmd_param
continue;
}
cmd_params_instance instance = {
/* .test_kind = */ TEST_KIND_PP,
/* .test_kind = */ TEST_KIND_PG,
/* .model = */ m,
/* .n_prompt = */ n_pg.first,
/* .n_gen = */ n_pg.second,