BF16_R16 - 16 interleaved bf16 rows (#142)

* Not working bf16_r4

* Adding bf16_r8

Small performance gain compared to bf16 - 258 t/s vs 234 t/s.
I guess, this is still sub-obtimal.

* bf16_rx: Very slightly faster by interleaving 16 rows

258 t/s -> 263 t/s

* Rename bf16_r4 to bf16_r16

We are interleaving 16 rows now.

* Cleanup unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2024-12-15 09:54:21 +01:00
committed by GitHub
parent e885c1e59b
commit e811de75e9
9 changed files with 175 additions and 2 deletions

View File

@@ -191,6 +191,7 @@ extern "C" {
LLAMA_FTYPE_MOSTLY_IQ4_NL_R4 = 225, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ4_XS_R4 = 230, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q6_0_R4 = 335, // except 1d tensors
LLAMA_FTYPE_MOSTLY_BF16_R16 = 232, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ2_BN_R4 = 337, // except 1d tensors
LLAMA_FTYPE_MOSTLY_IQ4_K_R4 = 340, // except 1d tensors
LLAMA_FTYPE_MOSTLY_Q8_K_R8 = 399, // except 1d tensors