Files
ik_llama.cpp/github-data/pull_requests/416 - Fix SER _CUDA_.md
2025-07-23 13:31:53 +02:00

2.5 KiB

🐛 #416 - Fix SER (CUDA)

Author ikawrakow
State Closed
Created 2025-05-13
Updated 2025-05-14

Description

Follow up of #415. This should fix SER issues on CUDA.


💬 Conversation

👤 ubergarm commented the 2025-05-13 at 15:30:55:

Interestingly I recompiled main with CUDA (after you merged #415 into main) and haven't been able to reproduce the error now.

fwiw this command is working both with and without this PR:

CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
    --model /mnt/raid/hf/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
    --alias ubergarm/DeepSeek-R1-IQ2_K_R4 \
    --ctx-size 131072 \
    -ctk f16 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    -ser 6,1 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 24 \
    --host 127.0.0.1 \
    --port 8080

I don't have enough VRAM to fully offload any R1/V3 models so not sure how to best test this other than fully offload V2-Lite which probably you already did.


👤 ikawrakow commented the 2025-05-13 at 15:43:01:

On CUDA it is more difficult to trigger the bug. I used Qwen3-30B-A3B quantized with IQ5_K. I only have a 16 GB GPU, so I had to leave the last 19 layers of exerts on the CPU. I used llama-cli like this

./bin/llama-cli -m ../ncuda/junk.bin -t 16 -ngl 100 -c 20000 -cnv -p " " -rtr -fa -s 1234 -ot "blk\.29\.ffn=CPU,blk\.[3-4][0-9]\.ffn=CPU" -ser 6,1

and prompted with

Encoded text:\noyfjdnisdr rtqwainr acxz mynzbhhx\nDecoded text:\nThink step by step\n\nEncoded text:\nsudlcg jncgpxoydflx ky lraebdtvlxmy nzbnkyaibh ttemgsdfqu gkdx pvsunvaauyacairrlxyy\nDecoded text:\n<think>

(and I guess the same can be done with the server).

The thinking goes well for a while, but eventually it starts spitting out GGGGG. The PR fixes that.

Interestingly enough, after the fix it does solve the puzzle with -ser 6,1, but fails with -ser 7,1.

I don't think partial offload is required, and it is likely the bug will trigger quicker if all layers are on the GPU. I found it is easier to debug with a "thinking" model because there isn't much interaction required to have the model generate many tokens one-by-one.


👤 ikawrakow commented the 2025-05-13 at 15:57:54:

Oops, it is still failing with DeepSeek-Lite. Converting to draft.