ik_llama.cpp/101 - Enable q6_0 in flash attention.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

1.5 KiB

Raw Permalink Blame History

🔀 #101 - Enable q6_0 in flash attention

Author	`ikawrakow`
State	❌ Closed
Created	2024-10-21
Updated	2024-10-22

Description

As with IQ4_NL, just for head size of 128 for now. Without GGML_CUDA_FA_ALL_QUANTS set, only Q6_0 + Q5_0 and Q8_0 + Q6_0 are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache from

K-cache	V-cache	BPV
Q4_0	Q4_0	4.5
IQ4_NL	IQ4_NL	4.5
Q6_0	Q5_0	6.0
Q8_0	IQ4_NL	6.5
Q8_0	Q6_0	7.5
Q8_0	Q8_0	8.5
F16	F16	16.0

💬 Conversation

👤 Nexesenex commented the 2024-10-21 at 18:14:38:

Merged in my fork of Kobold CPP. K q6_0 V q5_0 works like a charm. I also activated 16/6, 6/iq4_nl, as well as 8/6 and 6/6, I'll test them tonight or tomorrow.

Thank you (very very much) and congratulation for this, IK, I'm delighted to have those options and thus the best inference quality I can get right now, and I'm gonna release soon an updated version of my fork, with the proper credits of course, so everyone interested and not too scared by downloading my patchwork can enjoy the fruit of your labors on these KV Quants, as some already enjoyed a bit more speed on CPU due to some of your commits that I was able to merge a few months ago!

1.5 KiB Raw Permalink Blame History

🔀 #101 - Enable q6_0 in flash attention

Description

💬 Conversation

1.5 KiB

Raw Permalink Blame History