mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-23 16:09:18 +00:00
3.6 KiB
3.6 KiB
🐛 #499 - Bug: cache quantization crash with IQK_FORCE_BF16
| Author | randoentity |
|---|---|
| State | ❌ Closed |
| Created | 2025-06-06 |
| Updated | 2025-06-07 |
Description
What happened?
Using DGGML_CUDA_IQK_FORCE_BF16=1 in combination with --cache-type-k q8_0 results in the error below.
Turning either off does not raise an error.
--cache-type-v doesn't seem to do anything for this model.
cmake -B ./${BUILD_DIR} -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_SCHED_MAX_COPIES=1 -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_BLAS=OFF
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=2,0,1 ./build_bf16/bin/llama-sweep-bench \
--attention-max-batch 64 \
--batch-size 4096 \
--ubatch-size 4096 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--ctx-size 32768 \
--flash-attn \
--fused-moe \
--mla-use 3 \
--model /mnt/x/models/ubergarm/dsr1-0528-iq1-s4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf \
--n-gpu-layers 99 \
--override-tensor "blk\.(16|17|18|19|20|21|22|23|24)\.ffn_.*=CUDA1" \
--override-tensor "blk\.(3|4|5|6)\.ffn_.*=CUDA0" \
--override-tensor "blk\.(7|8|9|10|11|12|13|14|15)\.ffn_.*=CUDA2" \
--override-tensor exps=CPU,attn_kv_b=CPU \
--tensor-split 100,1,1 \
--threads 6 \
--threads-batch 12 \
--min_p 0.01 \
--temp 0.6 \
--top_p 0.95 \
--warmup-batch
Name and Version
build_bf16/bin/llama-sweep-bench --version
version: 3730 (ffd87f28)
built with cc (Gentoo 14.2.1_p20241221 p7) 14.2.1 20241221 for x86_64-pc-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 6, n_threads_batch = 12
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
/mnt/x/ik_llama.cpp/ggml/src/ggml-cuda.cu:1286: GGML_ASSERT(to_bf16_cuda != nullptr) failed
[New LWP 8409]
[New LWP 8408]
[New LWP 8407]
[New LWP 8406]
[New LWP 8332]
[New LWP 8331]
[New LWP 7938]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
0x00007fae703158a7 in wait4 () from /usr/lib64/libc.so.6
#0 0x00007fae703158a7 in wait4 () from /usr/lib64/libc.so.6
#1 0x0000564ac2e60592 in ggml_abort ()
#2 0x0000564ac2f166fa in ggml_cuda_op_mul_mat_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char
const*, float*, long, long, long, long, CUstream_st*) ()
#3 0x0000564ac2f09c8b in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, g
gml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void
*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.0] ()
#4 0x0000564ac2f1e58b in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#5 0x0000564ac2ebce93 in ggml_backend_sched_compute_splits ()
#6 0x0000564ac2d79e5a in llama_decode ()
#7 0x0000564ac2cd6920 in main::{lambda(llama_context*, llama_batch&, int)#1}::operator()(llama_context*, llama_batch&, int) const [clone .isra.0] ()
#8 0x0000564ac2c792ec in main ()
[Inferior 1 (process 7937) detached]
💬 Conversation
👤 Thireus commented the 2025-06-06 at 15:04:29:
I can confirm the same issue occurs on q4_0 as well.
👤 ikawrakow commented the 2025-06-06 at 16:32:03:
Does #501 fix it?