ik_llama.cpp/github-data/pull_requests/330-Allow q8_0 KV cache for head size 256.md at 3600d82e986ab91ec8996a7ebf15168da2fec34e

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Files

Thomas 94aa54df76 Add GitHub data (#637 )

2025-07-22 18:18:40 +02:00

462 B

Raw Blame History

🔀 #330 - Allow q8_0 KV cache for head size 256

Author	`ikawrakow`
State	❌ Closed
Created	2025-04-15
Updated	2025-04-15

Description

Gemma models have a head size of 256. For whatever reason, the inherited CUDA FA code only allows fp16 KV cache for this head size. This PR adds the ability to also use Q8_0 KV cache with FA.

462 B Raw Blame History

🔀 #330 - Allow q8_0 KV cache for head size 256

Description

462 B

Raw Blame History