ik_llama.cpp/github-data/issues/499 - Bug_ cache quantization crash with IQK_FORCE_BF16.md

### 🐛 [#499](https://github.com/ikawrakow/ik_llama.cpp/issues/499) - Bug: cache quantization crash with IQK_FORCE_BF16

| **Author** | `randoentity` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-06-06 |
| **Updated** | 2025-06-07 |

---

#### Description

### What happened?

Using `DGGML_CUDA_IQK_FORCE_BF16=1` in combination with `--cache-type-k q8_0` results in the error below.
Turning either off does not raise an error.
`--cache-type-v` doesn't seem to do anything for this model.

```sh
cmake -B ./${BUILD_DIR} -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_SCHED_MAX_COPIES=1 -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1  -DGGML_BLAS=OFF
```

```sh
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=2,0,1 ./build_bf16/bin/llama-sweep-bench \
--attention-max-batch 64 \
--batch-size 4096 \
--ubatch-size 4096 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--ctx-size 32768 \
--flash-attn \
--fused-moe \
--mla-use 3 \
--model /mnt/x/models/ubergarm/dsr1-0528-iq1-s4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
--n-gpu-layers 99 \
--override-tensor "blk\.(16|17|18|19|20|21|22|23|24)\.ffn_.*=CUDA1" \
--override-tensor "blk\.(3|4|5|6)\.ffn_.*=CUDA0" \
--override-tensor "blk\.(7|8|9|10|11|12|13|14|15)\.ffn_.*=CUDA2" \
--override-tensor exps=CPU,attn_kv_b=CPU \
--tensor-split 100,1,1 \
--threads 6 \
--threads-batch 12 \
--min_p 0.01 \
--temp 0.6 \
--top_p 0.95 \
--warmup-batch
```

### Name and Version

build_bf16/bin/llama-sweep-bench --version
version: 3730 (ffd87f28)
built with cc (Gentoo 14.2.1_p20241221 p7) 14.2.1 20241221 for x86_64-pc-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 6, n_threads_batch = 12

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
/mnt/x/ik_llama.cpp/ggml/src/ggml-cuda.cu:1286: GGML_ASSERT(to_bf16_cuda != nullptr) failed
[New LWP 8409]
[New LWP 8408]
[New LWP 8407]
[New LWP 8406]
[New LWP 8332]
[New LWP 8331]
[New LWP 7938]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
0x00007fae703158a7 in wait4 () from /usr/lib64/libc.so.6
#0  0x00007fae703158a7 in wait4 () from /usr/lib64/libc.so.6
#1  0x0000564ac2e60592 in ggml_abort ()
#2  0x0000564ac2f166fa in ggml_cuda_op_mul_mat_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char
 const*, float*, long, long, long, long, CUstream_st*) ()
#3  0x0000564ac2f09c8b in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, g
gml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void
*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.0] ()
#4  0x0000564ac2f1e58b in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#5  0x0000564ac2ebce93 in ggml_backend_sched_compute_splits ()
#6  0x0000564ac2d79e5a in llama_decode ()
#7  0x0000564ac2cd6920 in main::{lambda(llama_context*, llama_batch&, int)#1}::operator()(llama_context*, llama_batch&, int) const [clone .isra.0] ()
#8  0x0000564ac2c792ec in main ()
[Inferior 1 (process 7937) detached]
```

---

#### 💬 Conversation

👤 **Thireus** commented the **2025-06-06** at **15:04:29**:<br>

I can confirm the same issue occurs on q4_0 as well.

---

👤 **ikawrakow** commented the **2025-06-06** at **16:32:03**:<br>

Does #501 fix it?