Q8_K_R8: Fastest quantized matrix multiplications (#141)

* q8_k_r8: fastest matrix multiplication known to human kind

We get PP-512(LLaMA-3.1-8B) = 370 t/s on a Ryzen-7950X!

* q8_k_r8: AVX2

I was worried that we don't have enough vector registrers on
AVX2, but it looks like it handles it just fine. We get
PP-512(LLaMA-3.1-8B) = 354 t/s on a Ryzen-5975WX.
Slightly slower than the Zen4 version with double the threads,
but still a huge upgrade compared to Q8_0_R4.

* q8_k_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 159.2 t/s.
Compare this to the 128 t/s we have fr Q8_0_R4.

* q8_k_r4: go to signed ints

Why?
* On AVX2 _mm256_maddubs_epi16() may overflow, so we need to
  stay within the signed int range and use _mm256_sign_epi8.
  Not yet tested on the AVX2 comp, vut expect major slowdown.
* It is almost 10% faster on ARM_NEON. Somehow the veorrq_u8()
  needed tto convert from unsigned to signed seems to be extremely
  slow on the M2-Max
* We only lose ~0.5% in oerformance on Zen4 (there the exclusive
  or that we now use to convert fro signed to unsigned seems to be
  much faster than on M2-Max)

* Shutup useless compiler warnings

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2024-12-14 09:24:30 +01:00
committed by GitHub
parent 12f962dd24
commit 20758edcae
10 changed files with 301 additions and 7 deletions

View File

@@ -408,6 +408,7 @@ extern "C" {
GGML_TYPE_IQ4_KSS = 146,
GGML_TYPE_Q8_K16 = 147,
GGML_TYPE_Q8_K32 = 148,
GGML_TYPE_Q8_KR8 = 149,
GGML_TYPE_Q4_0_R4 = 202,
GGML_TYPE_Q5_0_R4 = 206,
@@ -422,6 +423,7 @@ extern "C" {
GGML_TYPE_Q6_0_R4 = 233,
GGML_TYPE_IQ2_BN_R4 = 335,
GGML_TYPE_IQ4_K_R4 = 339,
GGML_TYPE_Q8_K_R8 = 399,
GGML_TYPE_COUNT,
};
@@ -494,6 +496,7 @@ extern "C" {
GGML_FTYPE_MOSTLY_Q6_0_R4 = 227, // except 1d tensors
GGML_FTYPE_MOSTLY_IQ2_BN_R4 = 329, // except 1d tensors
GGML_FTYPE_MOSTLY_IQ4_K_R4 = 332, // except 1d tensors
GGML_FTYPE_MOSTLY_Q8_K_R8 = 399, // except 1d tensors
};
// available tensor operations: