Fix q8_0 KV cache when not using FA - WIP (AVX2) · ad78678bb9 - ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-29 10:51:51 +00:00

Fix q8_0 KV cache when not using FA - WIP (AVX2)

1. We add new types GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4, and use
   those to quantize activations for quants that use Q8_0 or Q8_1
   as their vec_dot type.
2. We revert the changes to quantize_row_q8_0 and quantize_row_q8_1
3. We use GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4 as the vec_dot type
4. We change the FA implementation to use GGML_TYPE_Q8_0 rather than
   GGML_TYPE_Q8_0_X4 as the K and V types
5. We change the expected type to GGML_TYPE_Q8_0_X4/GGML_TYPE_Q8_1_X4
   in iqk_mul_mat

Also added an optimization in ggml_compute_forward_mul_mat when
ne12*ne13 > 1 (K*Q and V*softmax(K*Q)) to process
n12*ne13/GCD(n12*ne13, nthread) threads simultaneously using
nthread/GCD(n12*ne13, nthread) threads per head. This results in
a non-negligible performance gain for large contexts.

Question: why is it not allowed to use quantized V-cache when
not using FA?

This commit is contained in:

Iwan Kawrakow

2025-01-15 12:13:08 +02:00

parent 093cf3ec9b

commit ad78678bb9

6 changed files with 428 additions and 389 deletions

									
										2

ggml/include/ggml.h
									
												View File
												
				@@ -396,6 +396,8 @@ extern "C" {

				        //

				        GGML_TYPE_I2_S    = 36,

				        //

				        GGML_TYPE_Q8_0_X4 = 98,

				        GGML_TYPE_Q8_1_X4 = 99,

				        GGML_TYPE_Q6_0    = 133,

				        GGML_TYPE_IQ1_BN  = 134,

				        GGML_TYPE_IQ2_BN  = 135,

Fix q8_0 KV cache when not using FA - WIP (AVX2)

2 ggml/include/ggml.h Unescape Escape View File

2

ggml/include/ggml.h

View File