mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-29 10:51:51 +00:00
Fix q8_0 KV cache when not using FA - WIP (AVX2)
1. We add new types GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4, and use those to quantize activations for quants that use Q8_0 or Q8_1 as their vec_dot type. 2. We revert the changes to quantize_row_q8_0 and quantize_row_q8_1 3. We use GGML_TYPE_Q8_0_X4 and GGML_TYPE_Q8_1_X4 as the vec_dot type 4. We change the FA implementation to use GGML_TYPE_Q8_0 rather than GGML_TYPE_Q8_0_X4 as the K and V types 5. We change the expected type to GGML_TYPE_Q8_0_X4/GGML_TYPE_Q8_1_X4 in iqk_mul_mat Also added an optimization in ggml_compute_forward_mul_mat when ne12*ne13 > 1 (K*Q and V*softmax(K*Q)) to process n12*ne13/GCD(n12*ne13, nthread) threads simultaneously using nthread/GCD(n12*ne13, nthread) threads per head. This results in a non-negligible performance gain for large contexts. Question: why is it not allowed to use quantized V-cache when not using FA?
This commit is contained in:
@@ -396,6 +396,8 @@ extern "C" {
|
||||
//
|
||||
GGML_TYPE_I2_S = 36,
|
||||
//
|
||||
GGML_TYPE_Q8_0_X4 = 98,
|
||||
GGML_TYPE_Q8_1_X4 = 99,
|
||||
GGML_TYPE_Q6_0 = 133,
|
||||
GGML_TYPE_IQ1_BN = 134,
|
||||
GGML_TYPE_IQ2_BN = 135,
|
||||
|
||||
Reference in New Issue
Block a user