### 🐛 [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398) - Bug: -fmoe causing illegal memory access
| **Author** | `pt13762104` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-05-08 |
| **Updated** | 2025-05-23 |
---
#### Description
### What happened?
It seems like when I used Qwen3-30B-A3B with `-fmoe`, an "illegal memory access" always occur after a short period of time. Without `-fmoe`, it works fine.
I'm not sure if this is GPU-related.
### Name and Version
version: 3673 (4084ca73)
built with gcc-14 (Homebrew GCC 14.2.0_1) 14.2.0 for x86_64-pc-linux-gnu
### What operating system are you seeing the problem on?
Linux
### Relevant log output
```shell
INFO [ main] build info | tid="133287468544000" timestamp=1746695902 build=3673 commit="4084ca73"
INFO [ main] system info | tid="133287468544000" timestamp=1746695902 n_threads=2 n_threads_batch=-1 total_threads=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B
llama_model_loader: - kv 3: general.basename str = Qwen3-30B-A3B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 30B-A3B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3moe.block_count u32 = 48
llama_model_loader: - kv 8: qwen3moe.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3moe.embedding_length u32 = 2048
llama_model_loader: - kv 10: qwen3moe.feed_forward_length u32 = 6144
llama_model_loader: - kv 11: qwen3moe.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen3moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 16: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 17: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 18: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 19: qwen3moe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: general.file_type u32 = 15
llama_model_loader: - kv 31: quantize.imatrix.file str = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 32: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B.txt
llama_model_loader: - kv 33: quantize.imatrix.entries_count i32 = 384
llama_model_loader: - kv 34: quantize.imatrix.chunks_count i32 = 32
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 290 tensors
llama_model_loader: - type q5_K: 37 tensors
llama_model_loader: - type q6_K: 11 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen3moe
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 40960
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 6144
llm_load_print_meta: n_expert = 128
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 40960
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 30.532 B
llm_load_print_meta: model size = 16.493 GiB (4.640 BPW)
llm_load_print_meta: repeating layers = 16.093 GiB (4.622 BPW, 29.910 B parameters)
llm_load_print_meta: general.name = Qwen3-30B-A3B
llm_load_print_meta: BOS token = 11 ','
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151654 '<|vision_pad|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp = 768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
Device 1: Tesla T4, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.76 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: CPU buffer size = 166.92 MiB
llm_load_tensors: CUDA0 buffer size = 8509.23 MiB
llm_load_tensors: CUDA1 buffer size = 8213.14 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 1472.00 MiB
llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model: CUDA0 compute buffer size = 368.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 444.77 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 260.02 MiB
llama_new_context_with_model: graph nodes = 1878
llama_new_context_with_model: graph splits = 3
INFO [ init] initializing slots | tid="133287468544000" timestamp=1746695910 n_slots=1
INFO [ init] new slot | tid="133287468544000" timestamp=1746695910 id_slot=0 n_ctx_slot=32768
INFO [ main] model loaded | tid="133287468544000" timestamp=1746695910
INFO [ main] chat template | tid="133287468544000" timestamp=1746695910 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="133287468544000" timestamp=1746695910 n_threads_http="3" port="8080" hostname="127.0.0.1"
INFO [ update_slots] all slots are idle | tid="133287468544000" timestamp=1746695910
INFO [ launch_slot_with_task] slot is processing task | tid="133287468544000" timestamp=1746695926 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="133287468544000" timestamp=1746695926 id_slot=0 id_task=0 p0=0
INFO [ print_timings] prompt eval time = 1428.08 ms / 756 tokens ( 1.89 ms per token, 529.38 tokens per second) | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 t_prompt_processing=1428.075 n_prompt_tokens_processed=756 t_token=1.8889880952380953 n_tokens_second=529.383960926422
INFO [ print_timings] generation eval time = 44081.50 ms / 2038 runs ( 21.63 ms per token, 46.23 tokens per second) | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 t_token_generation=44081.501 n_decoded=2038 t_token=21.629784592737977 n_tokens_second=46.23254548432914
INFO [ print_timings] total time = 45509.58 ms | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 t_prompt_processing=1428.075 t_token_generation=44081.501 t_total=45509.575999999994
INFO [ update_slots] slot released | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 n_ctx=32768 n_past=2793 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [ update_slots] all slots are idle | tid="133287468544000" timestamp=1746695972
INFO [ log_server_request] request | tid="133286382788608" timestamp=1746695972 remote_addr="127.0.0.1" remote_port=51948 status=200 method="POST" path="/chat/completions" params={}
INFO [ update_slots] all slots are idle | tid="133287468544000" timestamp=1746695972
INFO [ launch_slot_with_task] slot is processing task | tid="133287468544000" timestamp=1746695989 id_slot=0 id_task=2040
INFO [ update_slots] kv cache rm [p0, end) | tid="133287468544000" timestamp=1746695989 id_slot=0 id_task=2040 p0=0
INFO [ print_timings] prompt eval time = 2259.97 ms / 1480 tokens ( 1.53 ms per token, 654.88 tokens per second) | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 t_prompt_processing=2259.965 n_prompt_tokens_processed=1480 t_token=1.5270033783783785 n_tokens_second=654.8773985437828
INFO [ print_timings] generation eval time = 10276.92 ms / 407 runs ( 25.25 ms per token, 39.60 tokens per second) | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 t_token_generation=10276.922 n_decoded=407 t_token=25.250422604422607 n_tokens_second=39.603297563219805
INFO [ print_timings] total time = 12536.89 ms | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 t_prompt_processing=2259.965 t_token_generation=10276.922 t_total=12536.887
INFO [ update_slots] slot released | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 n_ctx=32768 n_past=1886 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [ update_slots] all slots are idle | tid="133287468544000" timestamp=1746696002
INFO [ log_server_request] request | tid="133286374395904" timestamp=1746696002 remote_addr="127.0.0.1" remote_port=36728 status=200 method="POST" path="/chat/completions" params={}
INFO [ update_slots] all slots are idle | tid="133287468544000" timestamp=1746696002
INFO [ launch_slot_with_task] slot is processing task | tid="133287468544000" timestamp=1746696077 id_slot=0 id_task=2449
INFO [ update_slots] kv cache rm [p0, end) | tid="133287468544000" timestamp=1746696077 id_slot=0 id_task=2449 p0=0
CUDA error: an illegal memory access was encountered
current device: 1, in function ggml_cuda_up_gate_unary at /kaggle/working/ik_llama.cpp/ggml/src/ggml-cuda.cu:2555
cudaMemcpyAsync(ids_host.data(), ids_dev, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
/kaggle/working/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
```
---
#### 💬 Conversation
👤 **ikawrakow** commented the **2025-05-08** at **11:11:23**:
Can you add the command line you used? Thanks.
---
👤 **pt13762104** commented the **2025-05-08** at **14:15:50**:
`ik_llama.cpp/build/bin/llama-server -m /root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -c 32768 -fmoe -fa -ngl 99`
It starts to do this in 2-3 prompts. Maybe it's related to the fact that the T4 doesn't have BF16 capability?
---
👤 **ikawrakow** commented the **2025-05-08** at **14:42:29**:
It is more likely due to a bug that shows up in a multi-GPU setup that I cannot debug because I only have a single GPU.
I have a single 16 GB GPU and run Qwen3-30B-A3B with a pretty good performance using tensor overrides to keep part of the layers on the CPU. For instance,
```
./bin/llama-server -m model -t 16 -ngl 100 -fa -fmoe -rtr -c 32768 -rtr -ot "blk\.[3-4][0-9]\.ffn=CPU"
```
With my Ryzen-7950X CPU the above gives me better performance (~60 t/s) than uploading 35 layers to the GPU (~40 t/s).
If you are up to experimenting, you could try something like the above to run on a single GPU. If that works, it would confirm an issue with `fmoe` with multiple GPUs. You need to use
```
-ot "blk\.[3-4][0-9]\.ffn=CPU,.*=CUDA0"
```
to put the first 30 layers on the first GPU and everything else on the CPU.
---
👤 **pt13762104** commented the **2025-05-09** at **01:35:39**:
I can't even try this:
```
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen3moe
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 40960
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 6144
llm_load_print_meta: n_expert = 128
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 40960
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 30.532 B
llm_load_print_meta: model size = 16.493 GiB (4.640 BPW)
llm_load_print_meta: repeating layers = 16.093 GiB (4.622 BPW, 29.910 B parameters)
llm_load_print_meta: general.name = Qwen3-30B-A3B
llm_load_print_meta: BOS token = 11 ','
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151654 '<|vision_pad|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp = 768
llm_load_tensors: ggml ctx size = 0.76 MiB
Tensor token_embd.weight buffer type overriden to CUDA0
Tensor output_norm.weight buffer type overriden to CUDA0
Tensor output.weight buffer type overriden to CUDA0
Tensor blk.0.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.0.attn_q.weight buffer type overriden to CUDA0
Tensor blk.0.attn_k.weight buffer type overriden to CUDA0
Tensor blk.0.attn_v.weight buffer type overriden to CUDA0
Tensor blk.0.attn_output.weight buffer type overriden to CUDA0
Tensor blk.0.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.0.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.1.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.attn_q.weight buffer type overriden to CUDA0
Tensor blk.1.attn_k.weight buffer type overriden to CUDA0
Tensor blk.1.attn_v.weight buffer type overriden to CUDA0
Tensor blk.1.attn_output.weight buffer type overriden to CUDA0
Tensor blk.1.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.1.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.2.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.attn_q.weight buffer type overriden to CUDA0
Tensor blk.2.attn_k.weight buffer type overriden to CUDA0
Tensor blk.2.attn_v.weight buffer type overriden to CUDA0
Tensor blk.2.attn_output.weight buffer type overriden to CUDA0
Tensor blk.2.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.2.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.attn_q.weight buffer type overriden to CUDA0
Tensor blk.3.attn_k.weight buffer type overriden to CUDA0
Tensor blk.3.attn_v.weight buffer type overriden to CUDA0
Tensor blk.3.attn_output.weight buffer type overriden to CUDA0
Tensor blk.3.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.3.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.4.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.attn_q.weight buffer type overriden to CUDA0
Tensor blk.4.attn_k.weight buffer type overriden to CUDA0
Tensor blk.4.attn_v.weight buffer type overriden to CUDA0
Tensor blk.4.attn_output.weight buffer type overriden to CUDA0
Tensor blk.4.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.4.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.5.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.5.attn_q.weight buffer type overriden to CUDA0
Tensor blk.5.attn_k.weight buffer type overriden to CUDA0
Tensor blk.5.attn_v.weight buffer type overriden to CUDA0
Tensor blk.5.attn_output.weight buffer type overriden to CUDA0
Tensor blk.5.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.5.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.6.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.6.attn_q.weight buffer type overriden to CUDA0
Tensor blk.6.attn_k.weight buffer type overriden to CUDA0
Tensor blk.6.attn_v.weight buffer type overriden to CUDA0
Tensor blk.6.attn_output.weight buffer type overriden to CUDA0
Tensor blk.6.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.6.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.7.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.7.attn_q.weight buffer type overriden to CUDA0
Tensor blk.7.attn_k.weight buffer type overriden to CUDA0
Tensor blk.7.attn_v.weight buffer type overriden to CUDA0
Tensor blk.7.attn_output.weight buffer type overriden to CUDA0
Tensor blk.7.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.7.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.8.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.8.attn_q.weight buffer type overriden to CUDA0
Tensor blk.8.attn_k.weight buffer type overriden to CUDA0
Tensor blk.8.attn_v.weight buffer type overriden to CUDA0
Tensor blk.8.attn_output.weight buffer type overriden to CUDA0
Tensor blk.8.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.8.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.9.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.9.attn_q.weight buffer type overriden to CUDA0
Tensor blk.9.attn_k.weight buffer type overriden to CUDA0
Tensor blk.9.attn_v.weight buffer type overriden to CUDA0
Tensor blk.9.attn_output.weight buffer type overriden to CUDA0
Tensor blk.9.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.9.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.10.attn_norm.weight buffer type overriden to CPU
Tensor blk.10.attn_q.weight buffer type overriden to CPU
Tensor blk.10.attn_k.weight buffer type overriden to CPU
Tensor blk.10.attn_v.weight buffer type overriden to CPU
Tensor blk.10.attn_output.weight buffer type overriden to CPU
Tensor blk.10.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.10.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.10.ffn_norm.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.attn_norm.weight buffer type overriden to CPU
Tensor blk.11.attn_q.weight buffer type overriden to CPU
Tensor blk.11.attn_k.weight buffer type overriden to CPU
Tensor blk.11.attn_v.weight buffer type overriden to CPU
Tensor blk.11.attn_output.weight buffer type overriden to CPU
Tensor blk.11.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.11.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.11.ffn_norm.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.attn_norm.weight buffer type overriden to CPU
Tensor blk.12.attn_q.weight buffer type overriden to CPU
Tensor blk.12.attn_k.weight buffer type overriden to CPU
Tensor blk.12.attn_v.weight buffer type overriden to CPU
Tensor blk.12.attn_output.weight buffer type overriden to CPU
Tensor blk.12.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.12.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.12.ffn_norm.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.attn_norm.weight buffer type overriden to CPU
Tensor blk.13.attn_q.weight buffer type overriden to CPU
Tensor blk.13.attn_k.weight buffer type overriden to CPU
Tensor blk.13.attn_v.weight buffer type overriden to CPU
Tensor blk.13.attn_output.weight buffer type overriden to CPU
Tensor blk.13.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.13.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.13.ffn_norm.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.attn_norm.weight buffer type overriden to CPU
Tensor blk.14.attn_q.weight buffer type overriden to CPU
Tensor blk.14.attn_k.weight buffer type overriden to CPU
Tensor blk.14.attn_v.weight buffer type overriden to CPU
Tensor blk.14.attn_output.weight buffer type overriden to CPU
Tensor blk.14.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.14.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.attn_norm.weight buffer type overriden to CPU
Tensor blk.15.attn_q.weight buffer type overriden to CPU
Tensor blk.15.attn_k.weight buffer type overriden to CPU
Tensor blk.15.attn_v.weight buffer type overriden to CPU
Tensor blk.15.attn_output.weight buffer type overriden to CPU
Tensor blk.15.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.15.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.attn_norm.weight buffer type overriden to CPU
Tensor blk.16.attn_q.weight buffer type overriden to CPU
Tensor blk.16.attn_k.weight buffer type overriden to CPU
Tensor blk.16.attn_v.weight buffer type overriden to CPU
Tensor blk.16.attn_output.weight buffer type overriden to CPU
Tensor blk.16.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.16.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.attn_norm.weight buffer type overriden to CPU
Tensor blk.17.attn_q.weight buffer type overriden to CPU
Tensor blk.17.attn_k.weight buffer type overriden to CPU
Tensor blk.17.attn_v.weight buffer type overriden to CPU
Tensor blk.17.attn_output.weight buffer type overriden to CPU
Tensor blk.17.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.17.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.17.ffn_norm.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.attn_norm.weight buffer type overriden to CPU
Tensor blk.18.attn_q.weight buffer type overriden to CPU
Tensor blk.18.attn_k.weight buffer type overriden to CPU
Tensor blk.18.attn_v.weight buffer type overriden to CPU
Tensor blk.18.attn_output.weight buffer type overriden to CPU
Tensor blk.18.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.18.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.18.ffn_norm.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.attn_norm.weight buffer type overriden to CPU
Tensor blk.19.attn_q.weight buffer type overriden to CPU
Tensor blk.19.attn_k.weight buffer type overriden to CPU
Tensor blk.19.attn_v.weight buffer type overriden to CPU
Tensor blk.19.attn_output.weight buffer type overriden to CPU
Tensor blk.19.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.19.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.19.ffn_norm.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.attn_norm.weight buffer type overriden to CPU
Tensor blk.20.attn_q.weight buffer type overriden to CPU
Tensor blk.20.attn_k.weight buffer type overriden to CPU
Tensor blk.20.attn_v.weight buffer type overriden to CPU
Tensor blk.20.attn_output.weight buffer type overriden to CPU
Tensor blk.20.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.20.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.20.ffn_norm.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.attn_norm.weight buffer type overriden to CPU
Tensor blk.21.attn_q.weight buffer type overriden to CPU
Tensor blk.21.attn_k.weight buffer type overriden to CPU
Tensor blk.21.attn_v.weight buffer type overriden to CPU
Tensor blk.21.attn_output.weight buffer type overriden to CPU
Tensor blk.21.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.21.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.21.ffn_norm.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.attn_norm.weight buffer type overriden to CPU
Tensor blk.22.attn_q.weight buffer type overriden to CPU
Tensor blk.22.attn_k.weight buffer type overriden to CPU
Tensor blk.22.attn_v.weight buffer type overriden to CPU
Tensor blk.22.attn_output.weight buffer type overriden to CPU
Tensor blk.22.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.22.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.22.ffn_norm.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.attn_norm.weight buffer type overriden to CPU
Tensor blk.23.attn_q.weight buffer type overriden to CPU
Tensor blk.23.attn_k.weight buffer type overriden to CPU
Tensor blk.23.attn_v.weight buffer type overriden to CPU
Tensor blk.23.attn_output.weight buffer type overriden to CPU
Tensor blk.23.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.23.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.23.ffn_norm.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.attn_norm.weight buffer type overriden to CPU
Tensor blk.24.attn_q.weight buffer type overriden to CPU
Tensor blk.24.attn_k.weight buffer type overriden to CPU
Tensor blk.24.attn_v.weight buffer type overriden to CPU
Tensor blk.24.attn_output.weight buffer type overriden to CPU
Tensor blk.24.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.24.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.24.ffn_norm.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.attn_norm.weight buffer type overriden to CPU
Tensor blk.25.attn_q.weight buffer type overriden to CPU
Tensor blk.25.attn_k.weight buffer type overriden to CPU
Tensor blk.25.attn_v.weight buffer type overriden to CPU
Tensor blk.25.attn_output.weight buffer type overriden to CPU
Tensor blk.25.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.25.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.25.ffn_norm.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.attn_norm.weight buffer type overriden to CPU
Tensor blk.26.attn_q.weight buffer type overriden to CPU
Tensor blk.26.attn_k.weight buffer type overriden to CPU
Tensor blk.26.attn_v.weight buffer type overriden to CPU
Tensor blk.26.attn_output.weight buffer type overriden to CPU
Tensor blk.26.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.26.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.26.ffn_norm.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.attn_norm.weight buffer type overriden to CPU
Tensor blk.27.attn_q.weight buffer type overriden to CPU
Tensor blk.27.attn_k.weight buffer type overriden to CPU
Tensor blk.27.attn_v.weight buffer type overriden to CPU
Tensor blk.27.attn_output.weight buffer type overriden to CPU
Tensor blk.27.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.27.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.27.ffn_norm.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.attn_norm.weight buffer type overriden to CPU
Tensor blk.28.attn_q.weight buffer type overriden to CPU
Tensor blk.28.attn_k.weight buffer type overriden to CPU
Tensor blk.28.attn_v.weight buffer type overriden to CPU
Tensor blk.28.attn_output.weight buffer type overriden to CPU
Tensor blk.28.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.28.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.28.ffn_norm.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.attn_norm.weight buffer type overriden to CPU
Tensor blk.29.attn_q.weight buffer type overriden to CPU
Tensor blk.29.attn_k.weight buffer type overriden to CPU
Tensor blk.29.attn_v.weight buffer type overriden to CPU
Tensor blk.29.attn_output.weight buffer type overriden to CPU
Tensor blk.29.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.29.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.29.ffn_norm.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.attn_norm.weight buffer type overriden to CPU
Tensor blk.30.attn_q.weight buffer type overriden to CPU
Tensor blk.30.attn_k.weight buffer type overriden to CPU
Tensor blk.30.attn_v.weight buffer type overriden to CPU
Tensor blk.30.attn_output.weight buffer type overriden to CPU
Tensor blk.30.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.30.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.30.ffn_norm.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.attn_norm.weight buffer type overriden to CPU
Tensor blk.31.attn_q.weight buffer type overriden to CPU
Tensor blk.31.attn_k.weight buffer type overriden to CPU
Tensor blk.31.attn_v.weight buffer type overriden to CPU
Tensor blk.31.attn_output.weight buffer type overriden to CPU
Tensor blk.31.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.31.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.31.ffn_norm.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.attn_norm.weight buffer type overriden to CPU
Tensor blk.32.attn_q.weight buffer type overriden to CPU
Tensor blk.32.attn_k.weight buffer type overriden to CPU
Tensor blk.32.attn_v.weight buffer type overriden to CPU
Tensor blk.32.attn_output.weight buffer type overriden to CPU
Tensor blk.32.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.32.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.32.ffn_norm.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.attn_norm.weight buffer type overriden to CPU
Tensor blk.33.attn_q.weight buffer type overriden to CPU
Tensor blk.33.attn_k.weight buffer type overriden to CPU
Tensor blk.33.attn_v.weight buffer type overriden to CPU
Tensor blk.33.attn_output.weight buffer type overriden to CPU
Tensor blk.33.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.33.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.33.ffn_norm.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.attn_norm.weight buffer type overriden to CPU
Tensor blk.34.attn_q.weight buffer type overriden to CPU
Tensor blk.34.attn_k.weight buffer type overriden to CPU
Tensor blk.34.attn_v.weight buffer type overriden to CPU
Tensor blk.34.attn_output.weight buffer type overriden to CPU
Tensor blk.34.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.34.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.34.ffn_norm.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.attn_norm.weight buffer type overriden to CPU
Tensor blk.35.attn_q.weight buffer type overriden to CPU
Tensor blk.35.attn_k.weight buffer type overriden to CPU
Tensor blk.35.attn_v.weight buffer type overriden to CPU
Tensor blk.35.attn_output.weight buffer type overriden to CPU
Tensor blk.35.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.35.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.35.ffn_norm.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.attn_norm.weight buffer type overriden to CPU
Tensor blk.36.attn_q.weight buffer type overriden to CPU
Tensor blk.36.attn_k.weight buffer type overriden to CPU
Tensor blk.36.attn_v.weight buffer type overriden to CPU
Tensor blk.36.attn_output.weight buffer type overriden to CPU
Tensor blk.36.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.36.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.36.ffn_norm.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.attn_norm.weight buffer type overriden to CPU
Tensor blk.37.attn_q.weight buffer type overriden to CPU
Tensor blk.37.attn_k.weight buffer type overriden to CPU
Tensor blk.37.attn_v.weight buffer type overriden to CPU
Tensor blk.37.attn_output.weight buffer type overriden to CPU
Tensor blk.37.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.37.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.37.ffn_norm.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.attn_norm.weight buffer type overriden to CPU
Tensor blk.38.attn_q.weight buffer type overriden to CPU
Tensor blk.38.attn_k.weight buffer type overriden to CPU
Tensor blk.38.attn_v.weight buffer type overriden to CPU
Tensor blk.38.attn_output.weight buffer type overriden to CPU
Tensor blk.38.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.38.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.38.ffn_norm.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.attn_norm.weight buffer type overriden to CPU
Tensor blk.39.attn_q.weight buffer type overriden to CPU
Tensor blk.39.attn_k.weight buffer type overriden to CPU
Tensor blk.39.attn_v.weight buffer type overriden to CPU
Tensor blk.39.attn_output.weight buffer type overriden to CPU
Tensor blk.39.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.39.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.39.ffn_norm.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.attn_norm.weight buffer type overriden to CPU
Tensor blk.40.attn_q.weight buffer type overriden to CPU
Tensor blk.40.attn_k.weight buffer type overriden to CPU
Tensor blk.40.attn_v.weight buffer type overriden to CPU
Tensor blk.40.attn_output.weight buffer type overriden to CPU
Tensor blk.40.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.40.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.40.ffn_norm.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.attn_norm.weight buffer type overriden to CPU
Tensor blk.41.attn_q.weight buffer type overriden to CPU
Tensor blk.41.attn_k.weight buffer type overriden to CPU
Tensor blk.41.attn_v.weight buffer type overriden to CPU
Tensor blk.41.attn_output.weight buffer type overriden to CPU
Tensor blk.41.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.41.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.41.ffn_norm.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.attn_norm.weight buffer type overriden to CPU
Tensor blk.42.attn_q.weight buffer type overriden to CPU
Tensor blk.42.attn_k.weight buffer type overriden to CPU
Tensor blk.42.attn_v.weight buffer type overriden to CPU
Tensor blk.42.attn_output.weight buffer type overriden to CPU
Tensor blk.42.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.42.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.42.ffn_norm.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.attn_norm.weight buffer type overriden to CPU
Tensor blk.43.attn_q.weight buffer type overriden to CPU
Tensor blk.43.attn_k.weight buffer type overriden to CPU
Tensor blk.43.attn_v.weight buffer type overriden to CPU
Tensor blk.43.attn_output.weight buffer type overriden to CPU
Tensor blk.43.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.43.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.43.ffn_norm.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.attn_norm.weight buffer type overriden to CPU
Tensor blk.44.attn_q.weight buffer type overriden to CPU
Tensor blk.44.attn_k.weight buffer type overriden to CPU
Tensor blk.44.attn_v.weight buffer type overriden to CPU
Tensor blk.44.attn_output.weight buffer type overriden to CPU
Tensor blk.44.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.44.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.44.ffn_norm.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.attn_norm.weight buffer type overriden to CPU
Tensor blk.45.attn_q.weight buffer type overriden to CPU
Tensor blk.45.attn_k.weight buffer type overriden to CPU
Tensor blk.45.attn_v.weight buffer type overriden to CPU
Tensor blk.45.attn_output.weight buffer type overriden to CPU
Tensor blk.45.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.45.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.45.ffn_norm.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.attn_norm.weight buffer type overriden to CPU
Tensor blk.46.attn_q.weight buffer type overriden to CPU
Tensor blk.46.attn_k.weight buffer type overriden to CPU
Tensor blk.46.attn_v.weight buffer type overriden to CPU
Tensor blk.46.attn_output.weight buffer type overriden to CPU
Tensor blk.46.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.46.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.46.ffn_norm.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.attn_norm.weight buffer type overriden to CPU
Tensor blk.47.attn_q.weight buffer type overriden to CPU
Tensor blk.47.attn_k.weight buffer type overriden to CPU
Tensor blk.47.attn_v.weight buffer type overriden to CPU
Tensor blk.47.attn_output.weight buffer type overriden to CPU
Tensor blk.47.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.47.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.47.ffn_norm.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
llama_model_load: error loading model: failed to allocate buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf'
ERR [ load_model] unable to load model | tid="135803250569216" timestamp=1746754485 model="/root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf"
munmap_chunk(): invalid pointer # could be free() or it just disappears
```
---
👤 **pt13762104** commented the **2025-05-09** at **01:36:06**:
Removing `.*=CUDA0` fixed that
---
👤 **pt13762104** commented the **2025-05-09** at **01:36:06**:
Let me try IQ4_K model instead.
---
👤 **pt13762104** commented the **2025-05-09** at **01:59:34**:
@ikawrakow I haven't found issues while using -fmoe on 1 GPU. It seems like a multi-GPU issue, given that the error always occur on device 1. The IQ4_K model doesn't seem to run into this bug.
---
👤 **Ph0rk0z** commented the **2025-05-09** at **11:52:43**:
I'm not sure how it is done here but afaik, real cudaMemcpyAsync is not supported on SM75.
---
👤 **schynce** commented the **2025-05-12** at **18:47:03**:
Hey @ikawrakow and @pt13762104,
I've been running into the exact same "illegal memory access" crash with 3x3090, but not with a specific quant.
I compiled ik_llama.cpp (4ba6bbb) like this:
```
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
cmake --build ./build --config Release -j $(nproc)
```
I have tested different quantizations from HuggingFace:
- IQ4_XS (unsloth/Qwen3-235B-A22B-GGUF)
- i1-Q4_K_S (mradermacher/Qwen3-235B-A22B-i1-GGUF)
- "mix-IQ3_K" (ubergarm/Qwen3-235B-A22B-GGUF)
Only the mix-IQ3_K seems to be working without crashing (and it is a ik_llama.cpp specific). The crash happens regardless of -fmoe. I can run the mix-IQ3_K quant with -fmoe without problems, like this:
```
./llama-server --model /mnt/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf --alias Qwen3-235B-A22B-mix-IQ3_K \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20)\.=CUDA0" \
-ot "blk\.(21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41)\.=CUDA1" \
-ot "blk\.(42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57)\.=CUDA2"
```
On the other hand, this crashes (even if I remove -fmoe):
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50)\.=CUDA2"
```
This is the crash:
```
INFO [ log_server_request] request | tid="140045957632000" timestamp=1746960702 remote_addr="127.0.0.1" remote_port=60492 status=200 method="GET" path="/v1/models" params={}
INFO [ launch_slot_with_task] slot is processing task | tid="140048404189184" timestamp=1746960702 id_slot=0 id_task=373
INFO [ update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960702 id_slot=0 id_task=373 p0=3
INFO [ log_server_request] request | tid="140045940846592" timestamp=1746960722 remote_addr="127.0.0.1" remote_port=44428 status=200 method="GET" path="/v1/models" params={}
INFO [ update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960741 id_slot=0 id_task=373 p0=2051
INFO [ update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960774 id_slot=0 id_task=373 p0=4099
INFO [ update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960808 id_slot=0 id_task=373 p0=6147
CUDA error: an illegal memory access was encountered
current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3049
cudaStreamSynchronize(cuda_ctx->stream())
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
```
For me, the crashing device is 2. It seems to be changing depending on the offloaded layers?
I would be happy to provide logs or test specific configurations to help debug this.
---
👤 **Ph0rk0z** commented the **2025-05-13** at **11:51:23**:
Oh snap.. that's the FA error?! Try without flash attention and see if it still crashes.
---
👤 **ikawrakow** commented the **2025-05-13** at **12:33:36**:
> Only the mix-IQ3_K seems to be working without crashing (and it is a ik_llama.cpp specific). The crash happens regardless of -fmoe. I can run the mix-IQ3_K quant with -fmoe without problems, like this:
This is useful info. The `IQX_K` quants do not have quantized matrix multiplication implementation, so matrix multiplications are computed via `dequantize -> cuBLAS`. If the illegal memory access does not occur in that case, it would indicate a problem in the quantized matrix multiplication implementation.
The problem is that I cannot trigger the bug on my single-GPU system. I need to get access to a multi-GPU system to be able to debug.
---
👤 **schynce** commented the **2025-05-13** at **22:33:11**:
> Oh snap.. that's the FA error?! Try without flash attention and see if it still crashes.
I tested without -fa with the crashing IQ4_XS quant, like this:
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fmoe -rtr -c 40960 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50)\.=CUDA2"
```
The prompt processing speed is absolutely glacial, but it does not seem to be crashing.
Long prompts seemed to reliably crash it before with flash attention. So, I ran the same 32K token prompt I used to test earlier through it like this. It took almost an hour to complete, but did so without incident. I also chatted with it a bit.
---
👤 **Panchovix** commented the **2025-05-14** at **16:32:23**:
Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324
```
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 468.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 360.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 360.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 360.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 648.00 MiB
llama_new_context_with_model: KV self size = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model: CUDA0 compute buffer size = 3520.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 1540.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 1540.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 1540.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 1540.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 312.02 MiB
llama_new_context_with_model: graph nodes = 3304
llama_new_context_with_model: graph splits = 393
INFO [ init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
INFO [ init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
INFO [ main] model loaded | tid="140562497785856" timestamp=1747239254
INFO [ main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
INFO [ main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
INFO [ update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
INFO [ launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
cudaGetLastError()
/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
[New LWP 25355]
[New LWP 25354]
[New LWP 25353]
[New LWP 25352]
[New LWP 25351]
[New LWP 25350]
[New LWP 25349]
[New LWP 25348]
[New LWP 25347]
[New LWP 25346]
[New LWP 25345]
[New LWP 25344]
[New LWP 25343]
[New LWP 25342]
[New LWP 25341]
[New LWP 25340]
[New LWP 24655]
[New LWP 24654]
[New LWP 24653]
[New LWP 24652]
[New LWP 24651]
[New LWP 24650]
[New LWP 24649]
[New LWP 23954]
[New LWP 23953]
[New LWP 23952]
[New LWP 23951]
[New LWP 23950]
[New LWP 23949]
[New LWP 23948]
[New LWP 23947]
[New LWP 23942]
[New LWP 23941]
[New LWP 23940]
This GDB supports auto-downloading debuginfo from the following URLs:
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1 0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2 0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
#3 0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
#4 0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#5 0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#6 0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#7 0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#8 0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#9 0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
#10 0x000000000049a2d4 in server_context::update_slots() ()
#11 0x000000000046cafc in server_queue::start_loop() ()
#12 0x0000000000416977 in main ()
[Inferior 1 (process 23939) detached]
```
Ran it with
```
./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1 -fmoe
```
Not using -fmoe makes it work without issues.
---
👤 **Panchovix** commented the **2025-05-14** at **16:32:23**:
Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324
```
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 468.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 360.00 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 360.00 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 360.00 MiB
llama_kv_cache_init: CUDA4 KV buffer size = 648.00 MiB
llama_new_context_with_model: KV self size = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model: CUDA0 compute buffer size = 3520.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 1540.01 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 1540.01 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 1540.01 MiB
llama_new_context_with_model: CUDA4 compute buffer size = 1540.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 312.02 MiB
llama_new_context_with_model: graph nodes = 3304
llama_new_context_with_model: graph splits = 393
INFO [ init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
INFO [ init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
INFO [ main] model loaded | tid="140562497785856" timestamp=1747239254
INFO [ main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
INFO [ main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
INFO [ update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
INFO [ launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
cudaGetLastError()
/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
[New LWP 25355]
[New LWP 25354]
[New LWP 25353]
[New LWP 25352]
[New LWP 25351]
[New LWP 25350]
[New LWP 25349]
[New LWP 25348]
[New LWP 25347]
[New LWP 25346]
[New LWP 25345]
[New LWP 25344]
[New LWP 25343]
[New LWP 25342]
[New LWP 25341]
[New LWP 25340]
[New LWP 24655]
[New LWP 24654]
[New LWP 24653]
[New LWP 24652]
[New LWP 24651]
[New LWP 24650]
[New LWP 24649]
[New LWP 23954]
[New LWP 23953]
[New LWP 23952]
[New LWP 23951]
[New LWP 23950]
[New LWP 23949]
[New LWP 23948]
[New LWP 23947]
[New LWP 23942]
[New LWP 23941]
[New LWP 23940]
This GDB supports auto-downloading debuginfo from the following URLs:
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1 0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2 0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
#3 0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
#4 0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#5 0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#6 0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#7 0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#8 0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#9 0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
#10 0x000000000049a2d4 in server_context::update_slots() ()
#11 0x000000000046cafc in server_queue::start_loop() ()
#12 0x0000000000416977 in main ()
[Inferior 1 (process 23939) detached]
```
Ran it with
```
./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1
```
Not using -fmoe makes it work without issues.
---
👤 **p4s2wd** commented the **2025-05-15** at **00:13:20**:
> 顺便说一下,我在 DeepSeekV3 0324 上使用 -fmoe 时遇到了 CUDA 非法内存访问
>
> ```
> llama_new_context_with_model: freq_scale = 0.025
> llama_kv_cache_init: CUDA0 KV buffer size = 468.00 MiB
> llama_kv_cache_init: CUDA1 KV buffer size = 360.00 MiB
> llama_kv_cache_init: CUDA2 KV buffer size = 360.00 MiB
> llama_kv_cache_init: CUDA3 KV buffer size = 360.00 MiB
> llama_kv_cache_init: CUDA4 KV buffer size = 648.00 MiB
> llama_new_context_with_model: KV self size = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
> llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
> llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
> llama_new_context_with_model: CUDA0 compute buffer size = 3520.01 MiB
> llama_new_context_with_model: CUDA1 compute buffer size = 1540.01 MiB
> llama_new_context_with_model: CUDA2 compute buffer size = 1540.01 MiB
> llama_new_context_with_model: CUDA3 compute buffer size = 1540.01 MiB
> llama_new_context_with_model: CUDA4 compute buffer size = 1540.02 MiB
> llama_new_context_with_model: CUDA_Host compute buffer size = 312.02 MiB
> llama_new_context_with_model: graph nodes = 3304
> llama_new_context_with_model: graph splits = 393
> INFO [ init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
> INFO [ init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
> INFO [ main] model loaded | tid="140562497785856" timestamp=1747239254
> INFO [ main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
> INFO [ main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
> INFO [ update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
> INFO [ launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
> INFO [ update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
> CUDA error: an illegal memory access was encountered
> current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
> cudaGetLastError()
> /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
> [New LWP 25355]
> [New LWP 25354]
> [New LWP 25353]
> [New LWP 25352]
> [New LWP 25351]
> [New LWP 25350]
> [New LWP 25349]
> [New LWP 25348]
> [New LWP 25347]
> [New LWP 25346]
> [New LWP 25345]
> [New LWP 25344]
> [New LWP 25343]
> [New LWP 25342]
> [New LWP 25341]
> [New LWP 25340]
> [New LWP 24655]
> [New LWP 24654]
> [New LWP 24653]
> [New LWP 24652]
> [New LWP 24651]
> [New LWP 24650]
> [New LWP 24649]
> [New LWP 23954]
> [New LWP 23953]
> [New LWP 23952]
> [New LWP 23951]
> [New LWP 23950]
> [New LWP 23949]
> [New LWP 23948]
> [New LWP 23947]
> [New LWP 23942]
> [New LWP 23941]
> [New LWP 23940]
>
> This GDB supports auto-downloading debuginfo from the following URLs:
>
> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
> Debuginfod has been disabled.
> To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
> Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
> Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #0 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #1 0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
> #2 0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
> #3 0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
> #4 0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #5 0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #6 0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #7 0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #8 0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #9 0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
> #10 0x000000000049a2d4 in server_context::update_slots() ()
> #11 0x000000000046cafc in server_queue::start_loop() ()
> #12 0x0000000000416977 in main ()
> [Inferior 1 (process 23939) detached]
> ```
>
> 运行它
>
> ```
> ./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1 -fmoe
> ```
>
> 不使用 -fm
---
👤 **p4s2wd** commented the **2025-05-15** at **00:21:27**:
> Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324
>
> ```
> llama_new_context_with_model: freq_scale = 0.025
> llama_kv_cache_init: CUDA0 KV buffer size = 468.00 MiB
> llama_kv_cache_init: CUDA1 KV buffer size = 360.00 MiB
> llama_kv_cache_init: CUDA2 KV buffer size = 360.00 MiB
> llama_kv_cache_init: CUDA3 KV buffer size = 360.00 MiB
> llama_kv_cache_init: CUDA4 KV buffer size = 648.00 MiB
> llama_new_context_with_model: KV self size = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
> llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
> llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
> llama_new_context_with_model: CUDA0 compute buffer size = 3520.01 MiB
> llama_new_context_with_model: CUDA1 compute buffer size = 1540.01 MiB
> llama_new_context_with_model: CUDA2 compute buffer size = 1540.01 MiB
> llama_new_context_with_model: CUDA3 compute buffer size = 1540.01 MiB
> llama_new_context_with_model: CUDA4 compute buffer size = 1540.02 MiB
> llama_new_context_with_model: CUDA_Host compute buffer size = 312.02 MiB
> llama_new_context_with_model: graph nodes = 3304
> llama_new_context_with_model: graph splits = 393
> INFO [ init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
> INFO [ init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
> INFO [ main] model loaded | tid="140562497785856" timestamp=1747239254
> INFO [ main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
> INFO [ main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
> INFO [ update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
> INFO [ launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
> INFO [ update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
> CUDA error: an illegal memory access was encountered
> current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
> cudaGetLastError()
> /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
> [New LWP 25355]
> [New LWP 25354]
> [New LWP 25353]
> [New LWP 25352]
> [New LWP 25351]
> [New LWP 25350]
> [New LWP 25349]
> [New LWP 25348]
> [New LWP 25347]
> [New LWP 25346]
> [New LWP 25345]
> [New LWP 25344]
> [New LWP 25343]
> [New LWP 25342]
> [New LWP 25341]
> [New LWP 25340]
> [New LWP 24655]
> [New LWP 24654]
> [New LWP 24653]
> [New LWP 24652]
> [New LWP 24651]
> [New LWP 24650]
> [New LWP 24649]
> [New LWP 23954]
> [New LWP 23953]
> [New LWP 23952]
> [New LWP 23951]
> [New LWP 23950]
> [New LWP 23949]
> [New LWP 23948]
> [New LWP 23947]
> [New LWP 23942]
> [New LWP 23941]
> [New LWP 23940]
>
> This GDB supports auto-downloading debuginfo from the following URLs:
>
> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
> Debuginfod has been disabled.
> To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
> Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
> Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #0 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #1 0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
> #2 0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
> #3 0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
> #4 0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #5 0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #6 0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #7 0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #8 0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #9 0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
> #10 0x000000000049a2d4 in server_context::update_slots() ()
> #11 0x000000000046cafc in server_queue::start_loop() ()
> #12 0x0000000000416977 in main ()
> [Inferior 1 (process 23939) detached]
> ```
>
> Ran it with
>
> ```
> ./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1 -fmoe
> ```
>
> Not using -fmoe makes it work without issues.
As you're using GPU+CPU, please try to replace "-mla 1" with "-mla 2".
---
👤 **ikawrakow** commented the **2025-05-15** at **04:35:23**:
> As you're using GPU+CPU, please try to replace "-mla 1" with "-mla 2".
`-mla 3` work now on CPU+GPU and is the best option.
Concerning the error, it is not triggered in a function related to `-fmoe`, so I wonder if it is a pre-existing bug (a bunch of those got fixed in mainline lately).
---
👤 **Panchovix** commented the **2025-05-15** at **22:22:06**:
Okay tested again, after updating and rebooting Fedora and now -fmoe works fine with MLA 1 + FA on CUDA+CPU (I use it like to save vram on compute buffers)
Not sure exactly what would have causes the issue.
---
👤 **schynce** commented the **2025-05-15** at **22:32:20**:
> Okay tested again, after updating and rebooting Fedora and now -fmoe works fine with MLA 1 + FA on CUDA+CPU (I use it like to save vram on compute buffers)
>
> Not sure exactly what would have causes the issue.
Are you sure that it is actually fixed? I am asking because I had some commands that I thought "worked" and started happily using them only for them to crash 15 messages and >30K tokens later. Some would crash instantly or with long prompts.
---
👤 **Panchovix** commented the **2025-05-15** at **22:45:52**:
@schynce you're correct, tried a few more and it got the illegal memory access again.
---
👤 **Panchovix** commented the **2025-05-15** at **22:45:52**:
@schynce you're correct, tried a few more it got the illegal memory access.
---
👤 **divine-taco** commented the **2025-05-19** at **23:10:44**:
Another data point. I'm not entirely sure `-fmoe` is the problem here. This is running multi gpu (3090) with cpu offload.
I can also report that it is rare for the crash to occur immediately. It's usually after a handful of turns.
Note this seems this a recently introduced bug:
`-fmoe -mla 2` does not crash on 6c23618ca5d680bd00f06a143dc4a1b386c827e3
`-fmoe -mla 3` does not crash on 6c23618ca5d680bd00f06a143dc4a1b386c827e3 (much slower than mla 2 on this commit)
It stopped working somewhen after this.
`-fmoe -mla 2` crashes for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
`-fmoe -mla 3` crashes for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
`-mla 2` without fmoe is also crashing for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
If I get some time this week I'll try to isolate when the bug was introduced.
Probably worth someone else trying `6c23618ca5d680bd00f06a143dc4a1b386c827e3` to confirm this is the same issue everyone seems to be running into with multi gpu.
Suspect https://github.com/ikawrakow/ik_llama.cpp/issues/425 may be the same issue.
---
👤 **divine-taco** commented the **2025-05-19** at **23:10:44**:
Another data point. I'm not entirely sure `-fmoe` is the problem here. This is running multi gpu (3090) with cpu offload.
I can also report that it is rare for the crash to occur immediately. It's usually after a handful of turns.
Note this seems this a recently introduced bug:
`-fmoe -mla 2` does not crash on 6c23618ca5d680bd00f06a143dc4a1b386c827e3
It stopped working somewhen after this.
`-fmoe -mla 2` is broken for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
`-mla 2` without fmoe is also broken for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
If I get some time this week I'll try to isolate when the bug was introduced.
Probably worth someone else trying `6c23618ca5d680bd00f06a143dc4a1b386c827e3` to confirm this is the same issue everyone seems to be running into with multi gpu.
---
👤 **ikawrakow** commented the **2025-05-20** at **04:34:00**:
@divine-taco It would be useful to share your command line when reporting a problem.
The most significant change between https://github.com/ikawrakow/ik_llama.cpp/commit/6c23618ca5d680bd00f06a143dc4a1b386c827e3 and https://github.com/ikawrakow/ik_llama.cpp/commit/2ec2229f2e9847d4e96bd7f163201810c8f8299a is PR #405. Prior to this PR the fused `ffn_up/ffn_gate` operation was not offloaded to the GPU if the tensors were on the CPU. After #405 the op is offloaded. You can disable that and restore the behavior prior to #405 using `-op 29,0`. Can you try that? Thanks.
---
👤 **divine-taco** commented the **2025-05-20** at **05:56:42**:
~~@ikawrakow `-op 29,0` seems to fix the issues running with the latest commit - 2ec2229f2e9847d4e96bd7f163201810c8f8299a~~
Full command:
```
llama-server \
--parallel 1 \
-ctk f16 -ctv f16 \
-ts 17,17,17,17,17,17,17,17,17 \
--model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
--ctx-size 44000 \
-fmoe -rtr -mla 3 -fa \
-b 2048 -ub 2048 -amb 512 \
-op 29,0 \
--no-mmap \
--threads 64 --threads-batch 64 \
-ngl 99 \
-ot exps=CPU
```
Update:
2ec2229f2e9847d4e96bd7f163201810c8f8299a did eventually crash with `-op 29,0` in the same manner as before. It took quite a few turns to observe the behavior (~15).
```
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda.cu:3067
cudaStreamSynchronize(cuda_ctx->stream())
/app/ggml/src/ggml-cuda.cu:110: CUDA error
```
---
👤 **divine-taco** commented the **2025-05-20** at **05:56:42**:
@ikawrakow `-op 29,0` seems to fix the issues running with the latest commit - 2ec2229f2e9847d4e96bd7f163201810c8f8299a
Full command:
```
llama-server \
--parallel 1 \
-ctk f16 -ctv f16 \
-ts 17,17,17,17,17,17,17,17,17 \
--model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
--ctx-size 44000 \
-fmoe -rtr -mla 3 -fa \
-b 2048 -ub 2048 -amb 512 \
-op 29,0 \
--no-mmap \
--threads 64 --threads-batch 64 \
-ngl 99 \
-ot exps=CPU
```
---
👤 **schynce** commented the **2025-05-20** at **13:44:34**:
For me, the best way to trigger the bug quickly is to dump in a 30K token prompt. It seems to crash during the prompt processing or before generating a single token.
---
👤 **schynce** commented the **2025-05-20** at **13:44:34**:
For me, the best way to trigger the bug quickly is to dump in a 30K token prompt. It seems to crash during the prompt processing.
---
👤 **ikawrakow** commented the **2025-05-20** at **14:23:18**:
Does PR #438 help?
---
👤 **schynce** commented the **2025-05-20** at **15:58:47**:
> Does PR [#438](https://github.com/ikawrakow/ik_llama.cpp/pull/438) help?
I tested #438 (branch ik/desperate_bug_fix_attempt) but unfortunately, it crashed almost straight away:
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51)\.=CUDA2"
```
```
INFO [ update_slots] kv cache rm [p0, end) | tid="139707044622336" timestamp=1747756441 id_slot=0 id_task=27 p0=4097
CUDA error: an illegal memory access was encountered
current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3075
cudaStreamSynchronize(cuda_ctx->stream())
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
```
---
👤 **divine-taco** commented the **2025-05-20** at **21:36:55**:
~~PR #438 - 82871cc2a3366dfdeff758f04fdfcf5ae5859829 - looks to fix the issue for me. Tried 30 turn completions at long context and saw no issues.~~
Command used:
```
llama-server \
--parallel 1 \
-ctk f16 -ctv f16 \
-ts 17,17,17,17,17,17,17,17,17 \
--model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
--ctx-size 44000 \
-fmoe -rtr -mla 3 -fa \
-b 2048 -ub 2048 -amb 512 \
--no-mmap \
--threads 64 --threads-batch 64 \
-ngl 99 \
-ot exps=CPU
```
@schynce - Have a link to the Qwen3-235B-A22B quant you used? I can try that as well.
Update: Failed with illegal memory access again on PR #438 with deepseek 0324 after I ran some automated completions tests. I don't have enough data yet to be confident, but it does seem to fail less frequently. I'll try running `--mla 2` on PR #438 to see if this makes any difference.
---
👤 **divine-taco** commented the **2025-05-20** at **21:36:55**:
PR #438 - 82871cc2a3366dfdeff758f04fdfcf5ae5859829 - looks to fix the issue for me. Tried 30 turn completions at long context and saw no issues.
Command used:
```
llama-server \
--parallel 1 \
-ctk f16 -ctv f16 \
-ts 17,17,17,17,17,17,17,17,17 \
--model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
--ctx-size 44000 \
-fmoe -rtr -mla 3 -fa \
-b 2048 -ub 2048 -amb 512 \
--no-mmap \
--threads 64 --threads-batch 64 \
-ngl 99 \
-ot exps=CPU
```
---
👤 **schynce** commented the **2025-05-20** at **21:49:54**:
@divine-taco
I used this:
https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/tree/main/IQ4_XS
However, I notice that there have been some updates in the first split file since I downloaded it.
---
👤 **ikawrakow** commented the **2025-05-21** at **06:02:41**:
Please use branch in PR #442 and post the CUDA call trace that will be printed when the application crashes.
---
👤 **schynce** commented the **2025-05-21** at **12:11:08**:
> Please use branch in PR [#442](https://github.com/ikawrakow/ik_llama.cpp/pull/442) and post the CUDA call trace that will be printed when the application crashes.
```
llm_load_tensors: CUDA_Host buffer size = 52313.37 MiB
llm_load_tensors: CUDA0 buffer size = 22068.28 MiB
llm_load_tensors: CUDA1 buffer size = 22068.28 MiB
llm_load_tensors: CUDA2 buffer size = 23042.94 MiB
....................................................................................................
============ Repacked 127 tensors
llama_new_context_with_model: n_ctx = 40960
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 3995.00 MiB
llama_new_context_with_model: KV self size = 3995.00 MiB, K (q8_0): 1997.50 MiB, V (q8_0): 1997.50 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 104.50 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 104.50 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 189.25 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 304.75 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 432
INFO [ init] initializing slots | tid="140363884277760" timestamp=1747829175 n_slots=1
INFO [ init] new slot | tid="140363884277760" timestamp=1747829175 id_slot=0 n_ctx_slot=40960
INFO [ main] model loaded | tid="140363884277760" timestamp=1747829175
INFO [ main] chat template | tid="140363884277760" timestamp=1747829175 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="140363884277760" timestamp=1747829175 n_threads_http="15" port="5000" hostname="0.0.0.0"
INFO [ update_slots] all slots are idle | tid="140363884277760" timestamp=1747829175
INFO [ log_server_request] request | tid="140361486192640" timestamp=1747829175 remote_addr="127.0.0.1" remote_port=55754 status=200 method="GET" path="/v1/models" params={}
INFO [ log_server_request] request | tid="140361494585344" timestamp=1747829175 remote_addr="127.0.0.1" remote_port=57094 status=200 method="GET" path="/v1/models" params={}
INFO [ log_server_request] request | tid="140361477799936" timestamp=1747829182 remote_addr="127.0.0.1" remote_port=43408 status=200 method="GET" path="/v1/models" params={}
INFO [ log_server_request] request | tid="140361469407232" timestamp=1747829191 remote_addr="127.0.0.1" remote_port=49880 status=200 method="GET" path="/v1/models" params={}
INFO [ launch_slot_with_task] slot is processing task | tid="140363884277760" timestamp=1747829191 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="140363884277760" timestamp=1747829191 id_slot=0 id_task=0 p0=0
CUDA error: an illegal memory access was encountered
current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3085
cudaStreamSynchronize(cuda_ctx->stream())
========================== CUDA trace: 315944 previous calls
315943: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
315942: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315941: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315940: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315939: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
315938: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
315937: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
315936: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
315935: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315934: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315933: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315932: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
315931: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
315930: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
315929: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
315928: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315927: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315926: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315925: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
315924: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
315923: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
315922: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
315921: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
315920: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
315919: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
315918: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
315917: function ggml_backend_cuda_synchronize, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3085
315916: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2773
315915: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2764
315914: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
315913: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315912: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
315911: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```
---
👤 **ikawrakow** commented the **2025-05-21** at **12:37:17**:
Thank you!
So, it crashes in a matrix multiplication. I have pushed another commit on the branch that will help narrow it down further if you rerun with that.
---
👤 **schynce** commented the **2025-05-21** at **13:29:25**:
> Thank you!
>
> So, it crashes in a matrix multiplication. I have pushed another commit on the branch that will help narrow it down further if you rerun with that.
Thanks for looking into the issue! Here you go:
```
CUDA error: an illegal memory access was encountered
current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3085
cudaStreamSynchronize(cuda_ctx->stream())
========================== CUDA trace: 335439 previous calls
335438: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335437: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335436: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335435: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335434: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335433: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335432: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335431: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335430: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335429: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335428: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335427: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335426: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335425: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335424: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335423: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335422: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335421: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335420: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335419: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335418: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335417: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335416: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335415: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335414: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335413: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335412: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335411: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
335410: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335409: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
335408: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
335407: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
335406: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```
---
👤 **ikawrakow** commented the **2025-05-21** at **13:55:41**:
I was confused. If there was something wrong with the matrix multiplications, it would have aborted there. The computations succeed, but then something goes wrong in the back-end. I have now added 2 additional asserts in the back-end at the place where the back-trace was when we did the debugging session.
---
👤 **schynce** commented the **2025-05-21** at **14:10:05**:
> I was confused. If there was something wrong with the matrix multiplications, it would have aborted there. The computations succeed, but then something goes wrong in the back-end. I have now added 2 additional asserts in the back-end at the place where the back-trace was when we did the debugging session.
I tried the newest commit, but the backtrace is practically identical as far as I can tell:
```
CUDA error: an illegal memory access was encountered
current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3089
cudaStreamSynchronize(stream)
========================== CUDA trace: 335439 previous calls
335438: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335437: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335436: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335435: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335434: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335433: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335432: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335431: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335430: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335429: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335428: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335427: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335426: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335425: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335424: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335423: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335422: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335421: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335420: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335419: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335418: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335417: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335416: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335415: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335414: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335413: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335412: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335411: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
335410: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335409: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
335408: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
335407: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
335406: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```
---
👤 **ikawrakow** commented the **2025-05-21** at **14:27:12**:
Thanks! I'll keep digging.
---
👤 **ikawrakow** commented the **2025-05-21** at **15:26:00**:
I have now added a trace to the back-end, so when the crash occurs it will print from where `ggml_backend_cuda_synchronize` was called. Can you try another time? Thanks!
---
👤 **schynce** commented the **2025-05-21** at **16:31:48**:
> I have now added a trace to the back-end, so when the crash occurs it will print from where `ggml_backend_cuda_synchronize` was called. Can you try another time? Thanks!
```
CUDA error: an illegal memory access was encountered
current device: 2, in function ggml_backend_sched_compute_splits at /home/user/ik_llama.cpp/ggml/src/ggml-backend.c:1835
cudaStreamSynchronize
========================== CUDA trace: 335439 previous calls
335438: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335437: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335436: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335435: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335434: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335433: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335432: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335431: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335430: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335429: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335428: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335427: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335426: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335425: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335424: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335423: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335422: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335421: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335420: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
335419: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
335418: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
335417: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335416: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335415: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
335414: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335413: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
335412: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335411: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
335410: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
335409: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
335408: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
335407: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
335406: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```
---
👤 **ikawrakow** commented the **2025-05-21** at **16:43:24**:
@schynce You are running with `--no-kv-offload`, right? Your error is different. What happens if you don't use `--no-kv-offload`?
---
👤 **schynce** commented the **2025-05-21** at **16:55:42**:
> [@schynce](https://github.com/schynce) You are running with `--no-kv-offload`, right? Your error is different. What happens if you don't use `--no-kv-offload`?
Yes, those logs were with this launch command:
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51)\.=CUDA2"
```
---
I ran without --no-kv-offload and modified the layers to fit the KV cache:
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16)\.=CUDA0" \
-ot "blk\.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33)\.=CUDA1" \
-ot "blk\.(34|35|36|37|38|39|40|41|42|43|44|45|46|47)\.=CUDA2"
```
It took considerably longer for the crash to appear this time:
```
INFO [ launch_slot_with_task] slot is processing task | tid="139770035781632" timestamp=1747846205 id_slot=0 id_task=0
INFO [ update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846205 id_slot=0 id_task=0 p0=0
INFO [ update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846249 id_slot=0 id_task=0 p0=2048
INFO [ update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846293 id_slot=0 id_task=0 p0=4096
INFO [ update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846338 id_slot=0 id_task=0 p0=6144
CUDA error: an illegal memory access was encountered
current device: 2, in function ggml_backend_sched_compute_splits at /home/user/ik_llama.cpp/ggml/src/ggml-backend.c:1835
cudaStreamSynchronize
========================== CUDA trace: 2460820 previous calls
2460819: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
2460818: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
2460817: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
2460816: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460815: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460814: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460813: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2460812: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
2460811: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2460810: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
2460809: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
2460808: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
2460807: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460806: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460805: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460804: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2460803: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
2460802: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2460801: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
2460800: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
2460799: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
2460798: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460797: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460796: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2460795: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2460794: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
2460793: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2460792: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
2460791: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2460790: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
2460789: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
2460788: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
2460787: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```
---
👤 **ikawrakow** commented the **2025-05-22** at **06:44:46**:
If you are not tired of testing, there are new changes on #442
---
👤 **schynce** commented the **2025-05-22** at **07:43:25**:
> If you are not tired of testing, there are new changes on [#442](https://github.com/ikawrakow/ik_llama.cpp/pull/442)
Not even close to being tired yet, thank you for taking the time to look into this :)
I ran this command:
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16)\.=CUDA0" \
-ot "blk\.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33)\.=CUDA1" \
-ot "blk\.(34|35|36|37|38|39|40|41|42|43|44|45|46|47)\.=CUDA2"
```
During context processing, the console was getting spammed with the `ggml_backend_cuda_synchronize` and `ggml_backend_cuda_cpy_tensor_async` lines. At the end of prompt processing (I assume), it crashed like before:
```
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_sched_compute_splits at /home/user/ik_llama.cpp/ggml/src/ggml-backend.c:1835
cudaStreamSynchronize
========================== CUDA trace: 2486495 previous calls
2486494: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3070
2486493: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3055
2486492: function ggml_backend_cuda_synchronize, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3120
2486491: function ggml_backend_sched_compute_splits, file /home/user/ik_llama.cpp/ggml/src/ggml-backend.c, line 1828
2486490: function ggml_backend_cuda_synchronize, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3107
2486489: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2774
2486488: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2765
2486487: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1756
2486486: function ggml_cuda_op_mul_mat_vec_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu, line 593
2486485: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486484: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486483: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486482: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2486481: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
2486480: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2486479: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2744
2486478: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2740
2486477: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1756
2486476: function ggml_cuda_op_mul_mat_vec_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu, line 593
2486475: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486474: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486473: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486472: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2486471: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
2486470: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2486469: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2736
2486468: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1756
2486467: function ggml_cuda_op_mul_mat_vec_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu, line 593
2486466: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486465: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486464: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
2486463: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
2486462: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3070
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```