ik_llama.cpp/github-data/issues/398 - Bug_ -fmoe causing illegal memory access.md

### 🐛 [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398) - Bug: -fmoe causing illegal memory access

| **Author** | `pt13762104` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-05-08 |
| **Updated** | 2025-05-23 |

---

#### Description

### What happened?

It seems like when I used Qwen3-30B-A3B with `-fmoe`, an "illegal memory access" always occur after a short period of time. Without `-fmoe`, it works fine.
I'm not sure if this is GPU-related.

### Name and Version

version: 3673 (4084ca73)
built with gcc-14 (Homebrew GCC 14.2.0_1) 14.2.0 for x86_64-pc-linux-gnu


### What operating system are you seeing the problem on?

Linux

### Relevant log output

```shell
INFO [                    main] build info | tid="133287468544000" timestamp=1746695902 build=3673 commit="4084ca73"
INFO [                    main] system info | tid="133287468544000" timestamp=1746695902 n_threads=2 n_threads_batch=-1 total_threads=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3-30B-A3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3-30B-A3B
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 30B-A3B
llama_model_loader: - kv   6:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   7:                       qwen3moe.block_count u32              = 48
llama_model_loader: - kv   8:                    qwen3moe.context_length u32              = 40960
llama_model_loader: - kv   9:                  qwen3moe.embedding_length u32              = 2048
llama_model_loader: - kv  10:               qwen3moe.feed_forward_length u32              = 6144
llama_model_loader: - kv  11:              qwen3moe.attention.head_count u32              = 32
llama_model_loader: - kv  12:           qwen3moe.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:                    qwen3moe.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  14:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                 qwen3moe.expert_used_count u32              = 8
llama_model_loader: - kv  16:              qwen3moe.attention.key_length u32              = 128
llama_model_loader: - kv  17:            qwen3moe.attention.value_length u32              = 128
llama_model_loader: - kv  18:                      qwen3moe.expert_count u32              = 128
llama_model_loader: - kv  19:        qwen3moe.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  26:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                          general.file_type u32              = 15
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = Qwen3-30B-A3B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3-30B-A3B.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 384
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 32
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  290 tensors
llama_model_loader: - type q5_K:   37 tensors
llama_model_loader: - type q6_K:   11 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 6144
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 30.532 B
llm_load_print_meta: model size       = 16.493 GiB (4.640 BPW)
llm_load_print_meta: repeating layers = 16.093 GiB (4.622 BPW, 29.910 B parameters)
llm_load_print_meta: general.name     = Qwen3-30B-A3B
llm_load_print_meta: BOS token        = 11 ','
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151654 '<|vision_pad|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5, VMM: yes
  Device 1: Tesla T4, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.76 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors:        CPU buffer size =   166.92 MiB
llm_load_tensors:      CUDA0 buffer size =  8509.23 MiB
llm_load_tensors:      CUDA1 buffer size =  8213.14 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1600.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =  1472.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   368.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   444.77 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   260.02 MiB
llama_new_context_with_model: graph nodes  = 1878
llama_new_context_with_model: graph splits = 3
INFO [                    init] initializing slots | tid="133287468544000" timestamp=1746695910 n_slots=1
INFO [                    init] new slot | tid="133287468544000" timestamp=1746695910 id_slot=0 n_ctx_slot=32768
INFO [                    main] model loaded | tid="133287468544000" timestamp=1746695910
INFO [                    main] chat template | tid="133287468544000" timestamp=1746695910 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="133287468544000" timestamp=1746695910 n_threads_http="3" port="8080" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="133287468544000" timestamp=1746695910
INFO [   launch_slot_with_task] slot is processing task | tid="133287468544000" timestamp=1746695926 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="133287468544000" timestamp=1746695926 id_slot=0 id_task=0 p0=0
INFO [           print_timings] prompt eval time     =    1428.08 ms /   756 tokens (    1.89 ms per token,   529.38 tokens per second) | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 t_prompt_processing=1428.075 n_prompt_tokens_processed=756 t_token=1.8889880952380953 n_tokens_second=529.383960926422
INFO [           print_timings] generation eval time =   44081.50 ms /  2038 runs   (   21.63 ms per token,    46.23 tokens per second) | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 t_token_generation=44081.501 n_decoded=2038 t_token=21.629784592737977 n_tokens_second=46.23254548432914
INFO [           print_timings]           total time =   45509.58 ms | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 t_prompt_processing=1428.075 t_token_generation=44081.501 t_total=45509.575999999994
INFO [            update_slots] slot released | tid="133287468544000" timestamp=1746695972 id_slot=0 id_task=0 n_ctx=32768 n_past=2793 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="133287468544000" timestamp=1746695972
INFO [      log_server_request] request | tid="133286382788608" timestamp=1746695972 remote_addr="127.0.0.1" remote_port=51948 status=200 method="POST" path="/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="133287468544000" timestamp=1746695972
INFO [   launch_slot_with_task] slot is processing task | tid="133287468544000" timestamp=1746695989 id_slot=0 id_task=2040
INFO [            update_slots] kv cache rm [p0, end) | tid="133287468544000" timestamp=1746695989 id_slot=0 id_task=2040 p0=0
INFO [           print_timings] prompt eval time     =    2259.97 ms /  1480 tokens (    1.53 ms per token,   654.88 tokens per second) | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 t_prompt_processing=2259.965 n_prompt_tokens_processed=1480 t_token=1.5270033783783785 n_tokens_second=654.8773985437828
INFO [           print_timings] generation eval time =   10276.92 ms /   407 runs   (   25.25 ms per token,    39.60 tokens per second) | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 t_token_generation=10276.922 n_decoded=407 t_token=25.250422604422607 n_tokens_second=39.603297563219805
INFO [           print_timings]           total time =   12536.89 ms | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 t_prompt_processing=2259.965 t_token_generation=10276.922 t_total=12536.887
INFO [            update_slots] slot released | tid="133287468544000" timestamp=1746696002 id_slot=0 id_task=2040 n_ctx=32768 n_past=1886 n_system_tokens=0 n_cache_tokens=0 truncated=false
INFO [            update_slots] all slots are idle | tid="133287468544000" timestamp=1746696002
INFO [      log_server_request] request | tid="133286374395904" timestamp=1746696002 remote_addr="127.0.0.1" remote_port=36728 status=200 method="POST" path="/chat/completions" params={}
INFO [            update_slots] all slots are idle | tid="133287468544000" timestamp=1746696002
INFO [   launch_slot_with_task] slot is processing task | tid="133287468544000" timestamp=1746696077 id_slot=0 id_task=2449
INFO [            update_slots] kv cache rm [p0, end) | tid="133287468544000" timestamp=1746696077 id_slot=0 id_task=2449 p0=0
CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_cuda_up_gate_unary at /kaggle/working/ik_llama.cpp/ggml/src/ggml-cuda.cu:2555
  cudaMemcpyAsync(ids_host.data(), ids_dev, ggml_nbytes(ids), cudaMemcpyDeviceToHost, stream)
/kaggle/working/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
```

---

#### 💬 Conversation

👤 **ikawrakow** commented the **2025-05-08** at **11:11:23**:<br>

Can you add the command line you used? Thanks.

---

👤 **pt13762104** commented the **2025-05-08** at **14:15:50**:<br>

`ik_llama.cpp/build/bin/llama-server -m /root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -c 32768 -fmoe -fa -ngl 99`
It starts to do this in 2-3 prompts. Maybe it's related to the fact that the T4 doesn't have BF16 capability?

---

👤 **ikawrakow** commented the **2025-05-08** at **14:42:29**:<br>

It is more likely due to a bug that shows up in a multi-GPU setup that I cannot debug because I only have a single GPU.

I have a single 16 GB GPU and run Qwen3-30B-A3B with a pretty good performance using tensor overrides to keep part of the layers on the CPU. For instance,
```
./bin/llama-server -m model -t 16 -ngl 100 -fa -fmoe -rtr -c 32768 -rtr -ot "blk\.[3-4][0-9]\.ffn=CPU"
```
With my Ryzen-7950X CPU the above gives me better performance (~60 t/s) than uploading 35 layers to the GPU (~40 t/s).

If you are up to experimenting, you could try something like the above to run on a single GPU. If that works, it would confirm an issue with `fmoe` with multiple GPUs. You need to use
```
 -ot "blk\.[3-4][0-9]\.ffn=CPU,.*=CUDA0"
```
to put the first 30 layers on the first GPU and everything else on the CPU.

---

👤 **pt13762104** commented the **2025-05-09** at **01:35:39**:<br>

I can't even try this:
```
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen3moe
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 40960
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 6144
llm_load_print_meta: n_expert         = 128
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 40960
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 30.532 B
llm_load_print_meta: model size       = 16.493 GiB (4.640 BPW)
llm_load_print_meta: repeating layers = 16.093 GiB (4.622 BPW, 29.910 B parameters)
llm_load_print_meta: general.name     = Qwen3-30B-A3B
llm_load_print_meta: BOS token        = 11 ','
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151654 '<|vision_pad|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp         = 768
llm_load_tensors: ggml ctx size =    0.76 MiB
Tensor token_embd.weight buffer type overriden to CUDA0
Tensor output_norm.weight buffer type overriden to CUDA0
Tensor output.weight buffer type overriden to CUDA0
Tensor blk.0.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.0.attn_q.weight buffer type overriden to CUDA0
Tensor blk.0.attn_k.weight buffer type overriden to CUDA0
Tensor blk.0.attn_v.weight buffer type overriden to CUDA0
Tensor blk.0.attn_output.weight buffer type overriden to CUDA0
Tensor blk.0.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.0.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.0.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.1.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.attn_q.weight buffer type overriden to CUDA0
Tensor blk.1.attn_k.weight buffer type overriden to CUDA0
Tensor blk.1.attn_v.weight buffer type overriden to CUDA0
Tensor blk.1.attn_output.weight buffer type overriden to CUDA0
Tensor blk.1.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.1.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.1.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.2.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.attn_q.weight buffer type overriden to CUDA0
Tensor blk.2.attn_k.weight buffer type overriden to CUDA0
Tensor blk.2.attn_v.weight buffer type overriden to CUDA0
Tensor blk.2.attn_output.weight buffer type overriden to CUDA0
Tensor blk.2.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.2.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.2.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.3.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.attn_q.weight buffer type overriden to CUDA0
Tensor blk.3.attn_k.weight buffer type overriden to CUDA0
Tensor blk.3.attn_v.weight buffer type overriden to CUDA0
Tensor blk.3.attn_output.weight buffer type overriden to CUDA0
Tensor blk.3.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.3.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.4.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.attn_q.weight buffer type overriden to CUDA0
Tensor blk.4.attn_k.weight buffer type overriden to CUDA0
Tensor blk.4.attn_v.weight buffer type overriden to CUDA0
Tensor blk.4.attn_output.weight buffer type overriden to CUDA0
Tensor blk.4.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.4.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.5.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.5.attn_q.weight buffer type overriden to CUDA0
Tensor blk.5.attn_k.weight buffer type overriden to CUDA0
Tensor blk.5.attn_v.weight buffer type overriden to CUDA0
Tensor blk.5.attn_output.weight buffer type overriden to CUDA0
Tensor blk.5.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.5.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.6.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.6.attn_q.weight buffer type overriden to CUDA0
Tensor blk.6.attn_k.weight buffer type overriden to CUDA0
Tensor blk.6.attn_v.weight buffer type overriden to CUDA0
Tensor blk.6.attn_output.weight buffer type overriden to CUDA0
Tensor blk.6.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.6.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.7.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.7.attn_q.weight buffer type overriden to CUDA0
Tensor blk.7.attn_k.weight buffer type overriden to CUDA0
Tensor blk.7.attn_v.weight buffer type overriden to CUDA0
Tensor blk.7.attn_output.weight buffer type overriden to CUDA0
Tensor blk.7.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.7.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.7.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.8.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.8.attn_q.weight buffer type overriden to CUDA0
Tensor blk.8.attn_k.weight buffer type overriden to CUDA0
Tensor blk.8.attn_v.weight buffer type overriden to CUDA0
Tensor blk.8.attn_output.weight buffer type overriden to CUDA0
Tensor blk.8.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.8.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.8.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.9.attn_norm.weight buffer type overriden to CUDA0
Tensor blk.9.attn_q.weight buffer type overriden to CUDA0
Tensor blk.9.attn_k.weight buffer type overriden to CUDA0
Tensor blk.9.attn_v.weight buffer type overriden to CUDA0
Tensor blk.9.attn_output.weight buffer type overriden to CUDA0
Tensor blk.9.attn_k_norm.weight buffer type overriden to CUDA0
Tensor blk.9.attn_q_norm.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_down_exps.weight buffer type overriden to CUDA0
Tensor blk.9.ffn_up_exps.weight buffer type overriden to CUDA0
Tensor blk.10.attn_norm.weight buffer type overriden to CPU
Tensor blk.10.attn_q.weight buffer type overriden to CPU
Tensor blk.10.attn_k.weight buffer type overriden to CPU
Tensor blk.10.attn_v.weight buffer type overriden to CPU
Tensor blk.10.attn_output.weight buffer type overriden to CPU
Tensor blk.10.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.10.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.10.ffn_norm.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.11.attn_norm.weight buffer type overriden to CPU
Tensor blk.11.attn_q.weight buffer type overriden to CPU
Tensor blk.11.attn_k.weight buffer type overriden to CPU
Tensor blk.11.attn_v.weight buffer type overriden to CPU
Tensor blk.11.attn_output.weight buffer type overriden to CPU
Tensor blk.11.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.11.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.11.ffn_norm.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.attn_norm.weight buffer type overriden to CPU
Tensor blk.12.attn_q.weight buffer type overriden to CPU
Tensor blk.12.attn_k.weight buffer type overriden to CPU
Tensor blk.12.attn_v.weight buffer type overriden to CPU
Tensor blk.12.attn_output.weight buffer type overriden to CPU
Tensor blk.12.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.12.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.12.ffn_norm.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.13.attn_norm.weight buffer type overriden to CPU
Tensor blk.13.attn_q.weight buffer type overriden to CPU
Tensor blk.13.attn_k.weight buffer type overriden to CPU
Tensor blk.13.attn_v.weight buffer type overriden to CPU
Tensor blk.13.attn_output.weight buffer type overriden to CPU
Tensor blk.13.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.13.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.13.ffn_norm.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.14.attn_norm.weight buffer type overriden to CPU
Tensor blk.14.attn_q.weight buffer type overriden to CPU
Tensor blk.14.attn_k.weight buffer type overriden to CPU
Tensor blk.14.attn_v.weight buffer type overriden to CPU
Tensor blk.14.attn_output.weight buffer type overriden to CPU
Tensor blk.14.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.14.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.15.attn_norm.weight buffer type overriden to CPU
Tensor blk.15.attn_q.weight buffer type overriden to CPU
Tensor blk.15.attn_k.weight buffer type overriden to CPU
Tensor blk.15.attn_v.weight buffer type overriden to CPU
Tensor blk.15.attn_output.weight buffer type overriden to CPU
Tensor blk.15.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.15.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.16.attn_norm.weight buffer type overriden to CPU
Tensor blk.16.attn_q.weight buffer type overriden to CPU
Tensor blk.16.attn_k.weight buffer type overriden to CPU
Tensor blk.16.attn_v.weight buffer type overriden to CPU
Tensor blk.16.attn_output.weight buffer type overriden to CPU
Tensor blk.16.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.16.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.17.attn_norm.weight buffer type overriden to CPU
Tensor blk.17.attn_q.weight buffer type overriden to CPU
Tensor blk.17.attn_k.weight buffer type overriden to CPU
Tensor blk.17.attn_v.weight buffer type overriden to CPU
Tensor blk.17.attn_output.weight buffer type overriden to CPU
Tensor blk.17.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.17.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.17.ffn_norm.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.18.attn_norm.weight buffer type overriden to CPU
Tensor blk.18.attn_q.weight buffer type overriden to CPU
Tensor blk.18.attn_k.weight buffer type overriden to CPU
Tensor blk.18.attn_v.weight buffer type overriden to CPU
Tensor blk.18.attn_output.weight buffer type overriden to CPU
Tensor blk.18.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.18.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.18.ffn_norm.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.19.attn_norm.weight buffer type overriden to CPU
Tensor blk.19.attn_q.weight buffer type overriden to CPU
Tensor blk.19.attn_k.weight buffer type overriden to CPU
Tensor blk.19.attn_v.weight buffer type overriden to CPU
Tensor blk.19.attn_output.weight buffer type overriden to CPU
Tensor blk.19.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.19.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.19.ffn_norm.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.20.attn_norm.weight buffer type overriden to CPU
Tensor blk.20.attn_q.weight buffer type overriden to CPU
Tensor blk.20.attn_k.weight buffer type overriden to CPU
Tensor blk.20.attn_v.weight buffer type overriden to CPU
Tensor blk.20.attn_output.weight buffer type overriden to CPU
Tensor blk.20.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.20.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.20.ffn_norm.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.21.attn_norm.weight buffer type overriden to CPU
Tensor blk.21.attn_q.weight buffer type overriden to CPU
Tensor blk.21.attn_k.weight buffer type overriden to CPU
Tensor blk.21.attn_v.weight buffer type overriden to CPU
Tensor blk.21.attn_output.weight buffer type overriden to CPU
Tensor blk.21.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.21.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.21.ffn_norm.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.22.attn_norm.weight buffer type overriden to CPU
Tensor blk.22.attn_q.weight buffer type overriden to CPU
Tensor blk.22.attn_k.weight buffer type overriden to CPU
Tensor blk.22.attn_v.weight buffer type overriden to CPU
Tensor blk.22.attn_output.weight buffer type overriden to CPU
Tensor blk.22.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.22.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.22.ffn_norm.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.23.attn_norm.weight buffer type overriden to CPU
Tensor blk.23.attn_q.weight buffer type overriden to CPU
Tensor blk.23.attn_k.weight buffer type overriden to CPU
Tensor blk.23.attn_v.weight buffer type overriden to CPU
Tensor blk.23.attn_output.weight buffer type overriden to CPU
Tensor blk.23.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.23.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.23.ffn_norm.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.24.attn_norm.weight buffer type overriden to CPU
Tensor blk.24.attn_q.weight buffer type overriden to CPU
Tensor blk.24.attn_k.weight buffer type overriden to CPU
Tensor blk.24.attn_v.weight buffer type overriden to CPU
Tensor blk.24.attn_output.weight buffer type overriden to CPU
Tensor blk.24.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.24.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.24.ffn_norm.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.25.attn_norm.weight buffer type overriden to CPU
Tensor blk.25.attn_q.weight buffer type overriden to CPU
Tensor blk.25.attn_k.weight buffer type overriden to CPU
Tensor blk.25.attn_v.weight buffer type overriden to CPU
Tensor blk.25.attn_output.weight buffer type overriden to CPU
Tensor blk.25.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.25.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.25.ffn_norm.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.26.attn_norm.weight buffer type overriden to CPU
Tensor blk.26.attn_q.weight buffer type overriden to CPU
Tensor blk.26.attn_k.weight buffer type overriden to CPU
Tensor blk.26.attn_v.weight buffer type overriden to CPU
Tensor blk.26.attn_output.weight buffer type overriden to CPU
Tensor blk.26.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.26.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.26.ffn_norm.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.27.attn_norm.weight buffer type overriden to CPU
Tensor blk.27.attn_q.weight buffer type overriden to CPU
Tensor blk.27.attn_k.weight buffer type overriden to CPU
Tensor blk.27.attn_v.weight buffer type overriden to CPU
Tensor blk.27.attn_output.weight buffer type overriden to CPU
Tensor blk.27.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.27.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.27.ffn_norm.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.28.attn_norm.weight buffer type overriden to CPU
Tensor blk.28.attn_q.weight buffer type overriden to CPU
Tensor blk.28.attn_k.weight buffer type overriden to CPU
Tensor blk.28.attn_v.weight buffer type overriden to CPU
Tensor blk.28.attn_output.weight buffer type overriden to CPU
Tensor blk.28.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.28.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.28.ffn_norm.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.29.attn_norm.weight buffer type overriden to CPU
Tensor blk.29.attn_q.weight buffer type overriden to CPU
Tensor blk.29.attn_k.weight buffer type overriden to CPU
Tensor blk.29.attn_v.weight buffer type overriden to CPU
Tensor blk.29.attn_output.weight buffer type overriden to CPU
Tensor blk.29.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.29.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.29.ffn_norm.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.30.attn_norm.weight buffer type overriden to CPU
Tensor blk.30.attn_q.weight buffer type overriden to CPU
Tensor blk.30.attn_k.weight buffer type overriden to CPU
Tensor blk.30.attn_v.weight buffer type overriden to CPU
Tensor blk.30.attn_output.weight buffer type overriden to CPU
Tensor blk.30.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.30.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.30.ffn_norm.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.31.attn_norm.weight buffer type overriden to CPU
Tensor blk.31.attn_q.weight buffer type overriden to CPU
Tensor blk.31.attn_k.weight buffer type overriden to CPU
Tensor blk.31.attn_v.weight buffer type overriden to CPU
Tensor blk.31.attn_output.weight buffer type overriden to CPU
Tensor blk.31.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.31.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.31.ffn_norm.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.32.attn_norm.weight buffer type overriden to CPU
Tensor blk.32.attn_q.weight buffer type overriden to CPU
Tensor blk.32.attn_k.weight buffer type overriden to CPU
Tensor blk.32.attn_v.weight buffer type overriden to CPU
Tensor blk.32.attn_output.weight buffer type overriden to CPU
Tensor blk.32.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.32.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.32.ffn_norm.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.33.attn_norm.weight buffer type overriden to CPU
Tensor blk.33.attn_q.weight buffer type overriden to CPU
Tensor blk.33.attn_k.weight buffer type overriden to CPU
Tensor blk.33.attn_v.weight buffer type overriden to CPU
Tensor blk.33.attn_output.weight buffer type overriden to CPU
Tensor blk.33.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.33.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.33.ffn_norm.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.34.attn_norm.weight buffer type overriden to CPU
Tensor blk.34.attn_q.weight buffer type overriden to CPU
Tensor blk.34.attn_k.weight buffer type overriden to CPU
Tensor blk.34.attn_v.weight buffer type overriden to CPU
Tensor blk.34.attn_output.weight buffer type overriden to CPU
Tensor blk.34.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.34.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.34.ffn_norm.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.35.attn_norm.weight buffer type overriden to CPU
Tensor blk.35.attn_q.weight buffer type overriden to CPU
Tensor blk.35.attn_k.weight buffer type overriden to CPU
Tensor blk.35.attn_v.weight buffer type overriden to CPU
Tensor blk.35.attn_output.weight buffer type overriden to CPU
Tensor blk.35.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.35.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.35.ffn_norm.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.36.attn_norm.weight buffer type overriden to CPU
Tensor blk.36.attn_q.weight buffer type overriden to CPU
Tensor blk.36.attn_k.weight buffer type overriden to CPU
Tensor blk.36.attn_v.weight buffer type overriden to CPU
Tensor blk.36.attn_output.weight buffer type overriden to CPU
Tensor blk.36.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.36.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.36.ffn_norm.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.37.attn_norm.weight buffer type overriden to CPU
Tensor blk.37.attn_q.weight buffer type overriden to CPU
Tensor blk.37.attn_k.weight buffer type overriden to CPU
Tensor blk.37.attn_v.weight buffer type overriden to CPU
Tensor blk.37.attn_output.weight buffer type overriden to CPU
Tensor blk.37.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.37.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.37.ffn_norm.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.38.attn_norm.weight buffer type overriden to CPU
Tensor blk.38.attn_q.weight buffer type overriden to CPU
Tensor blk.38.attn_k.weight buffer type overriden to CPU
Tensor blk.38.attn_v.weight buffer type overriden to CPU
Tensor blk.38.attn_output.weight buffer type overriden to CPU
Tensor blk.38.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.38.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.38.ffn_norm.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.39.attn_norm.weight buffer type overriden to CPU
Tensor blk.39.attn_q.weight buffer type overriden to CPU
Tensor blk.39.attn_k.weight buffer type overriden to CPU
Tensor blk.39.attn_v.weight buffer type overriden to CPU
Tensor blk.39.attn_output.weight buffer type overriden to CPU
Tensor blk.39.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.39.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.39.ffn_norm.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.40.attn_norm.weight buffer type overriden to CPU
Tensor blk.40.attn_q.weight buffer type overriden to CPU
Tensor blk.40.attn_k.weight buffer type overriden to CPU
Tensor blk.40.attn_v.weight buffer type overriden to CPU
Tensor blk.40.attn_output.weight buffer type overriden to CPU
Tensor blk.40.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.40.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.40.ffn_norm.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.41.attn_norm.weight buffer type overriden to CPU
Tensor blk.41.attn_q.weight buffer type overriden to CPU
Tensor blk.41.attn_k.weight buffer type overriden to CPU
Tensor blk.41.attn_v.weight buffer type overriden to CPU
Tensor blk.41.attn_output.weight buffer type overriden to CPU
Tensor blk.41.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.41.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.41.ffn_norm.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.42.attn_norm.weight buffer type overriden to CPU
Tensor blk.42.attn_q.weight buffer type overriden to CPU
Tensor blk.42.attn_k.weight buffer type overriden to CPU
Tensor blk.42.attn_v.weight buffer type overriden to CPU
Tensor blk.42.attn_output.weight buffer type overriden to CPU
Tensor blk.42.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.42.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.42.ffn_norm.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.43.attn_norm.weight buffer type overriden to CPU
Tensor blk.43.attn_q.weight buffer type overriden to CPU
Tensor blk.43.attn_k.weight buffer type overriden to CPU
Tensor blk.43.attn_v.weight buffer type overriden to CPU
Tensor blk.43.attn_output.weight buffer type overriden to CPU
Tensor blk.43.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.43.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.43.ffn_norm.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.44.attn_norm.weight buffer type overriden to CPU
Tensor blk.44.attn_q.weight buffer type overriden to CPU
Tensor blk.44.attn_k.weight buffer type overriden to CPU
Tensor blk.44.attn_v.weight buffer type overriden to CPU
Tensor blk.44.attn_output.weight buffer type overriden to CPU
Tensor blk.44.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.44.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.44.ffn_norm.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.45.attn_norm.weight buffer type overriden to CPU
Tensor blk.45.attn_q.weight buffer type overriden to CPU
Tensor blk.45.attn_k.weight buffer type overriden to CPU
Tensor blk.45.attn_v.weight buffer type overriden to CPU
Tensor blk.45.attn_output.weight buffer type overriden to CPU
Tensor blk.45.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.45.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.45.ffn_norm.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.46.attn_norm.weight buffer type overriden to CPU
Tensor blk.46.attn_q.weight buffer type overriden to CPU
Tensor blk.46.attn_k.weight buffer type overriden to CPU
Tensor blk.46.attn_v.weight buffer type overriden to CPU
Tensor blk.46.attn_output.weight buffer type overriden to CPU
Tensor blk.46.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.46.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.46.ffn_norm.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.47.attn_norm.weight buffer type overriden to CPU
Tensor blk.47.attn_q.weight buffer type overriden to CPU
Tensor blk.47.attn_k.weight buffer type overriden to CPU
Tensor blk.47.attn_v.weight buffer type overriden to CPU
Tensor blk.47.attn_output.weight buffer type overriden to CPU
Tensor blk.47.attn_k_norm.weight buffer type overriden to CPU
Tensor blk.47.attn_q_norm.weight buffer type overriden to CPU
Tensor blk.47.ffn_norm.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
llama_model_load: error loading model: failed to allocate buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf'
 ERR [              load_model] unable to load model | tid="135803250569216" timestamp=1746754485 model="/root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf"
munmap_chunk(): invalid pointer # could be free() or it just disappears
```

---

👤 **pt13762104** commented the **2025-05-09** at **01:36:06**:<br>

Removing `.*=CUDA0` fixed that

---

👤 **pt13762104** commented the **2025-05-09** at **01:36:06**:<br>

Let me try IQ4_K model instead.

---

👤 **pt13762104** commented the **2025-05-09** at **01:59:34**:<br>

@ikawrakow I haven't found issues while using -fmoe on 1 GPU. It seems like a multi-GPU issue, given that the error always occur on device 1. The IQ4_K model doesn't seem to run into this bug.

---

👤 **Ph0rk0z** commented the **2025-05-09** at **11:52:43**:<br>

I'm not sure how it is done here but afaik, real cudaMemcpyAsync is not supported on SM75.

---

👤 **schynce** commented the **2025-05-12** at **18:47:03**:<br>

Hey @ikawrakow and @pt13762104,

I've been running into the exact same "illegal memory access" crash with 3x3090, but not with a specific quant.

I compiled ik_llama.cpp (4ba6bbb) like this:
```
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
cmake --build ./build --config Release -j $(nproc)
```

I have tested different quantizations from HuggingFace:

- IQ4_XS (unsloth/Qwen3-235B-A22B-GGUF)
- i1-Q4_K_S (mradermacher/Qwen3-235B-A22B-i1-GGUF)
- "mix-IQ3_K" (ubergarm/Qwen3-235B-A22B-GGUF)

Only the mix-IQ3_K seems to be working without crashing (and it is a ik_llama.cpp specific). The crash happens regardless of -fmoe. I can run the mix-IQ3_K quant with -fmoe without problems, like this:

```
./llama-server --model /mnt/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf --alias Qwen3-235B-A22B-mix-IQ3_K \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20)\.=CUDA0" \
-ot "blk\.(21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41)\.=CUDA1" \
-ot "blk\.(42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57)\.=CUDA2"
```

On the other hand, this crashes (even if I remove -fmoe):

```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50)\.=CUDA2"
```

This is the crash:

```
INFO [      log_server_request] request | tid="140045957632000" timestamp=1746960702 remote_addr="127.0.0.1" remote_port=60492 status=200 method="GET" path="/v1/models" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="140048404189184" timestamp=1746960702 id_slot=0 id_task=373
INFO [            update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960702 id_slot=0 id_task=373 p0=3
INFO [      log_server_request] request | tid="140045940846592" timestamp=1746960722 remote_addr="127.0.0.1" remote_port=44428 status=200 method="GET" path="/v1/models" params={}
INFO [            update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960741 id_slot=0 id_task=373 p0=2051
INFO [            update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960774 id_slot=0 id_task=373 p0=4099
INFO [            update_slots] kv cache rm [p0, end) | tid="140048404189184" timestamp=1746960808 id_slot=0 id_task=373 p0=6147
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3049
  cudaStreamSynchronize(cuda_ctx->stream())
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
```

For me, the crashing device is 2. It seems to be changing depending on the offloaded layers?

I would be happy to provide logs or test specific configurations to help debug this.

---

👤 **Ph0rk0z** commented the **2025-05-13** at **11:51:23**:<br>

Oh snap.. that's the FA error?! Try without flash attention and see if it still crashes.

---

👤 **ikawrakow** commented the **2025-05-13** at **12:33:36**:<br>

> Only the mix-IQ3_K seems to be working without crashing (and it is a ik_llama.cpp specific). The crash happens regardless of -fmoe. I can run the mix-IQ3_K quant with -fmoe without problems, like this:

This is useful info. The `IQX_K` quants do not have quantized matrix multiplication implementation, so matrix multiplications are computed via `dequantize -> cuBLAS`. If the illegal memory access does not occur in that case, it would indicate a problem in the quantized matrix multiplication implementation.

The problem is that I cannot trigger the bug on my single-GPU system. I need to get access to a multi-GPU system to be able to debug.

---

👤 **schynce** commented the **2025-05-13** at **22:33:11**:<br>

> Oh snap.. that's the FA error?! Try without flash attention and see if it still crashes.

I tested without -fa with the crashing IQ4_XS quant, like this:

```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fmoe -rtr -c 40960 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50)\.=CUDA2"
```

The prompt processing speed is absolutely glacial, but it does not seem to be crashing.

Long prompts seemed to reliably crash it before with flash attention. So, I ran the same 32K token prompt I used to test earlier through it like this. It took almost an hour to complete, but did so without incident. I also chatted with it a bit.

---

👤 **Panchovix** commented the **2025-05-14** at **16:32:23**:<br>

Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324

```
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   468.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   360.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   360.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   360.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   648.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  3520.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1540.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  1540.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  1540.01 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =  1540.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   312.02 MiB
llama_new_context_with_model: graph nodes  = 3304
llama_new_context_with_model: graph splits = 393
INFO [                    init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
INFO [                    init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
INFO [                    main] model loaded | tid="140562497785856" timestamp=1747239254
INFO [                    main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>" built_in=true
INFO [                    main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
INFO [   launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
  cudaGetLastError()
/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
[New LWP 25355]
[New LWP 25354]
[New LWP 25353]
[New LWP 25352]
[New LWP 25351]
[New LWP 25350]
[New LWP 25349]
[New LWP 25348]
[New LWP 25347]
[New LWP 25346]
[New LWP 25345]
[New LWP 25344]
[New LWP 25343]
[New LWP 25342]
[New LWP 25341]
[New LWP 25340]
[New LWP 24655]
[New LWP 24654]
[New LWP 24653]
[New LWP 24652]
[New LWP 24651]
[New LWP 24650]
[New LWP 24649]
[New LWP 23954]
[New LWP 23953]
[New LWP 23952]
[New LWP 23951]
[New LWP 23950]
[New LWP 23949]
[New LWP 23948]
[New LWP 23947]
[New LWP 23942]
[New LWP 23941]
[New LWP 23940]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
#4  0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#5  0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#6  0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#7  0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#8  0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#9  0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
#10 0x000000000049a2d4 in server_context::update_slots() ()
#11 0x000000000046cafc in server_queue::start_loop() ()
#12 0x0000000000416977 in main ()
[Inferior 1 (process 23939) detached]
```

Ran it with

```
./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3"  -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1 -fmoe
```

Not using -fmoe makes it work without issues.

---

👤 **Panchovix** commented the **2025-05-14** at **16:32:23**:<br>

Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324

```
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   468.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   360.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   360.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   360.00 MiB
llama_kv_cache_init:      CUDA4 KV buffer size =   648.00 MiB
llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =  3520.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1540.01 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =  1540.01 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =  1540.01 MiB
llama_new_context_with_model:      CUDA4 compute buffer size =  1540.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   312.02 MiB
llama_new_context_with_model: graph nodes  = 3304
llama_new_context_with_model: graph splits = 393
INFO [                    init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
INFO [                    init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
INFO [                    main] model loaded | tid="140562497785856" timestamp=1747239254
INFO [                    main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>" built_in=true
INFO [                    main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
INFO [            update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
INFO [   launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
  cudaGetLastError()
/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
[New LWP 25355]
[New LWP 25354]
[New LWP 25353]
[New LWP 25352]
[New LWP 25351]
[New LWP 25350]
[New LWP 25349]
[New LWP 25348]
[New LWP 25347]
[New LWP 25346]
[New LWP 25345]
[New LWP 25344]
[New LWP 25343]
[New LWP 25342]
[New LWP 25341]
[New LWP 25340]
[New LWP 24655]
[New LWP 24654]
[New LWP 24653]
[New LWP 24652]
[New LWP 24651]
[New LWP 24650]
[New LWP 24649]
[New LWP 23954]
[New LWP 23953]
[New LWP 23952]
[New LWP 23951]
[New LWP 23950]
[New LWP 23949]
[New LWP 23948]
[New LWP 23947]
[New LWP 23942]
[New LWP 23941]
[New LWP 23940]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
#4  0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#5  0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#6  0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#7  0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#8  0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
#9  0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
#10 0x000000000049a2d4 in server_context::update_slots() ()
#11 0x000000000046cafc in server_queue::start_loop() ()
#12 0x0000000000416977 in main ()
[Inferior 1 (process 23939) detached]
```

Ran it with

```
./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3"  -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1
```

Not using -fmoe makes it work without issues.

---

👤 **p4s2wd** commented the **2025-05-15** at **00:13:20**:<br>

> 顺便说一下，我在 DeepSeekV3 0324 上使用 -fmoe 时遇到了 CUDA 非法内存访问
>
> ```
> llama_new_context_with_model: freq_scale = 0.025
> llama_kv_cache_init:      CUDA0 KV buffer size =   468.00 MiB
> llama_kv_cache_init:      CUDA1 KV buffer size =   360.00 MiB
> llama_kv_cache_init:      CUDA2 KV buffer size =   360.00 MiB
> llama_kv_cache_init:      CUDA3 KV buffer size =   360.00 MiB
> llama_kv_cache_init:      CUDA4 KV buffer size =   648.00 MiB
> llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
> llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
> llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
> llama_new_context_with_model:      CUDA0 compute buffer size =  3520.01 MiB
> llama_new_context_with_model:      CUDA1 compute buffer size =  1540.01 MiB
> llama_new_context_with_model:      CUDA2 compute buffer size =  1540.01 MiB
> llama_new_context_with_model:      CUDA3 compute buffer size =  1540.01 MiB
> llama_new_context_with_model:      CUDA4 compute buffer size =  1540.02 MiB
> llama_new_context_with_model:  CUDA_Host compute buffer size =   312.02 MiB
> llama_new_context_with_model: graph nodes  = 3304
> llama_new_context_with_model: graph splits = 393
> INFO [                    init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
> INFO [                    init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
> INFO [                    main] model loaded | tid="140562497785856" timestamp=1747239254
> INFO [                    main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>" built_in=true
> INFO [                    main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
> INFO [            update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
> INFO [   launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
> INFO [            update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
> CUDA error: an illegal memory access was encountered
>   current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
>   cudaGetLastError()
> /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
> [New LWP 25355]
> [New LWP 25354]
> [New LWP 25353]
> [New LWP 25352]
> [New LWP 25351]
> [New LWP 25350]
> [New LWP 25349]
> [New LWP 25348]
> [New LWP 25347]
> [New LWP 25346]
> [New LWP 25345]
> [New LWP 25344]
> [New LWP 25343]
> [New LWP 25342]
> [New LWP 25341]
> [New LWP 25340]
> [New LWP 24655]
> [New LWP 24654]
> [New LWP 24653]
> [New LWP 24652]
> [New LWP 24651]
> [New LWP 24650]
> [New LWP 24649]
> [New LWP 23954]
> [New LWP 23953]
> [New LWP 23952]
> [New LWP 23951]
> [New LWP 23950]
> [New LWP 23949]
> [New LWP 23948]
> [New LWP 23947]
> [New LWP 23942]
> [New LWP 23941]
> [New LWP 23940]
>
> This GDB supports auto-downloading debuginfo from the following URLs:
>   <https://debuginfod.fedoraproject.org/>
> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
> Debuginfod has been disabled.
> To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
> Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
> Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #0  0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #1  0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
> #2  0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
> #3  0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
> #4  0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #5  0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #6  0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #7  0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #8  0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #9  0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
> #10 0x000000000049a2d4 in server_context::update_slots() ()
> #11 0x000000000046cafc in server_queue::start_loop() ()
> #12 0x0000000000416977 in main ()
> [Inferior 1 (process 23939) detached]
> ```
>
> 运行它
>
> ```
> ./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3"  -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1 -fmoe
> ```
>
> 不使用 -fm

---

👤 **p4s2wd** commented the **2025-05-15** at **00:21:27**:<br>

> Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324
>
> ```
> llama_new_context_with_model: freq_scale = 0.025
> llama_kv_cache_init:      CUDA0 KV buffer size =   468.00 MiB
> llama_kv_cache_init:      CUDA1 KV buffer size =   360.00 MiB
> llama_kv_cache_init:      CUDA2 KV buffer size =   360.00 MiB
> llama_kv_cache_init:      CUDA3 KV buffer size =   360.00 MiB
> llama_kv_cache_init:      CUDA4 KV buffer size =   648.00 MiB
> llama_new_context_with_model: KV self size  = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
> llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
> llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
> llama_new_context_with_model:      CUDA0 compute buffer size =  3520.01 MiB
> llama_new_context_with_model:      CUDA1 compute buffer size =  1540.01 MiB
> llama_new_context_with_model:      CUDA2 compute buffer size =  1540.01 MiB
> llama_new_context_with_model:      CUDA3 compute buffer size =  1540.01 MiB
> llama_new_context_with_model:      CUDA4 compute buffer size =  1540.02 MiB
> llama_new_context_with_model:  CUDA_Host compute buffer size =   312.02 MiB
> llama_new_context_with_model: graph nodes  = 3304
> llama_new_context_with_model: graph splits = 393
> INFO [                    init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
> INFO [                    init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
> INFO [                    main] model loaded | tid="140562497785856" timestamp=1747239254
> INFO [                    main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>" built_in=true
> INFO [                    main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
> INFO [            update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
> INFO [   launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
> INFO [            update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
> CUDA error: an illegal memory access was encountered
>   current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
>   cudaGetLastError()
> /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
> [New LWP 25355]
> [New LWP 25354]
> [New LWP 25353]
> [New LWP 25352]
> [New LWP 25351]
> [New LWP 25350]
> [New LWP 25349]
> [New LWP 25348]
> [New LWP 25347]
> [New LWP 25346]
> [New LWP 25345]
> [New LWP 25344]
> [New LWP 25343]
> [New LWP 25342]
> [New LWP 25341]
> [New LWP 25340]
> [New LWP 24655]
> [New LWP 24654]
> [New LWP 24653]
> [New LWP 24652]
> [New LWP 24651]
> [New LWP 24650]
> [New LWP 24649]
> [New LWP 23954]
> [New LWP 23953]
> [New LWP 23952]
> [New LWP 23951]
> [New LWP 23950]
> [New LWP 23949]
> [New LWP 23948]
> [New LWP 23947]
> [New LWP 23942]
> [New LWP 23941]
> [New LWP 23940]
>
> This GDB supports auto-downloading debuginfo from the following URLs:
>   <https://debuginfod.fedoraproject.org/>
> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
> Debuginfod has been disabled.
> To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
> Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
> Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
> Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #0  0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
> #1  0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
> #2  0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
> #3  0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
> #4  0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #5  0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #6  0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #7  0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #8  0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
> #9  0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
> #10 0x000000000049a2d4 in server_context::update_slots() ()
> #11 0x000000000046cafc in server_queue::start_loop() ()
> #12 0x0000000000416977 in main ()
> [Inferior 1 (process 23939) detached]
> ```
>
> Ran it with
>
> ```
> ./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3"  -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1 -fmoe
> ```
>
> Not using -fmoe makes it work without issues.

As you're using GPU+CPU, please try to replace "-mla 1" with "-mla 2".

---

👤 **ikawrakow** commented the **2025-05-15** at **04:35:23**:<br>

> As you're using GPU+CPU, please try to replace "-mla 1" with "-mla 2".

`-mla 3` work now on CPU+GPU and is the best option.

Concerning the error, it is not triggered in a function related to `-fmoe`, so I wonder if it is a pre-existing bug (a bunch of those got fixed in mainline lately).

---

👤 **Panchovix** commented the **2025-05-15** at **22:22:06**:<br>

Okay tested again, after updating and rebooting Fedora and now -fmoe works fine with MLA 1 + FA on CUDA+CPU (I use it like to save vram on compute buffers)

Not sure exactly what would have causes the issue.

---

👤 **schynce** commented the **2025-05-15** at **22:32:20**:<br>

> Okay tested again, after updating and rebooting Fedora and now -fmoe works fine with MLA 1 + FA on CUDA+CPU (I use it like to save vram on compute buffers)
>
> Not sure exactly what would have causes the issue.

Are you sure that it is actually fixed? I am asking because I had some commands that I thought "worked" and started happily using them only for them to crash 15 messages and >30K tokens later. Some would crash instantly or with long prompts.

---

👤 **Panchovix** commented the **2025-05-15** at **22:45:52**:<br>

@schynce you're correct, tried a few more and it got the illegal memory access again.

---

👤 **Panchovix** commented the **2025-05-15** at **22:45:52**:<br>

@schynce you're correct, tried a few more it got the illegal memory access.

---

👤 **divine-taco** commented the **2025-05-19** at **23:10:44**:<br>

Another data point. I'm not entirely sure `-fmoe` is the problem here. This is running multi gpu (3090) with cpu offload.

I can also report that it is rare for the crash to occur immediately. It's usually after a handful of turns.

Note this seems this a recently introduced bug:
`-fmoe -mla 2` does not crash on 6c23618ca5d680bd00f06a143dc4a1b386c827e3
`-fmoe -mla 3` does not crash on 6c23618ca5d680bd00f06a143dc4a1b386c827e3     (much slower than mla 2 on this commit)

It stopped working somewhen after this.
`-fmoe -mla 2` crashes for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
`-fmoe -mla 3` crashes for 2ec2229f2e9847d4e96bd7f163201810c8f8299a

`-mla 2` without fmoe is also crashing for 2ec2229f2e9847d4e96bd7f163201810c8f8299a

If I get some time this week I'll try to isolate when the bug was introduced.
Probably worth someone else trying `6c23618ca5d680bd00f06a143dc4a1b386c827e3` to confirm this is the same issue everyone seems to be running into with multi gpu.

Suspect https://github.com/ikawrakow/ik_llama.cpp/issues/425 may be the same issue.

---

👤 **divine-taco** commented the **2025-05-19** at **23:10:44**:<br>

Another data point. I'm not entirely sure `-fmoe` is the problem here. This is running multi gpu (3090) with cpu offload.

I can also report that it is rare for the crash to occur immediately. It's usually after a handful of turns.

Note this seems this a recently introduced bug:
`-fmoe -mla 2` does not crash on 6c23618ca5d680bd00f06a143dc4a1b386c827e3

It stopped working somewhen after this.
`-fmoe -mla 2` is broken for 2ec2229f2e9847d4e96bd7f163201810c8f8299a

`-mla 2` without fmoe is also broken for 2ec2229f2e9847d4e96bd7f163201810c8f8299a

If I get some time this week I'll try to isolate when the bug was introduced.
Probably worth someone else trying `6c23618ca5d680bd00f06a143dc4a1b386c827e3` to confirm this is the same issue everyone seems to be running into with multi gpu.

---

👤 **ikawrakow** commented the **2025-05-20** at **04:34:00**:<br>

@divine-taco It would be useful to share your command line when reporting a problem.

The most significant change between  https://github.com/ikawrakow/ik_llama.cpp/commit/6c23618ca5d680bd00f06a143dc4a1b386c827e3 and https://github.com/ikawrakow/ik_llama.cpp/commit/2ec2229f2e9847d4e96bd7f163201810c8f8299a is PR #405. Prior to this PR the fused `ffn_up/ffn_gate` operation was not offloaded to the GPU if the tensors were on the CPU. After #405 the op is offloaded. You can disable that and restore the behavior prior to #405 using `-op 29,0`. Can you try that? Thanks.

---

👤 **divine-taco** commented the **2025-05-20** at **05:56:42**:<br>

~~@ikawrakow `-op 29,0` seems to fix the issues running with the latest commit - 2ec2229f2e9847d4e96bd7f163201810c8f8299a~~

Full command:

```
llama-server \
  --parallel 1 \
  -ctk f16 -ctv f16 \
  -ts 17,17,17,17,17,17,17,17,17 \
  --model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
  --ctx-size 44000 \
  -fmoe -rtr -mla 3 -fa \
  -b 2048 -ub 2048 -amb 512 \
  -op 29,0 \
  --no-mmap \
  --threads 64 --threads-batch 64 \
  -ngl 99 \
  -ot exps=CPU
```

Update:

2ec2229f2e9847d4e96bd7f163201810c8f8299a did eventually crash with `-op 29,0` in the same manner as before. It took quite a few turns to observe the behavior (~15).

```
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /app/ggml/src/ggml-cuda.cu:3067
  cudaStreamSynchronize(cuda_ctx->stream())
/app/ggml/src/ggml-cuda.cu:110: CUDA error
```

---

👤 **divine-taco** commented the **2025-05-20** at **05:56:42**:<br>

@ikawrakow `-op 29,0` seems to fix the issues running with the latest commit - 2ec2229f2e9847d4e96bd7f163201810c8f8299a

Full command:

```
llama-server \
  --parallel 1 \
  -ctk f16 -ctv f16 \
  -ts 17,17,17,17,17,17,17,17,17 \
  --model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
  --ctx-size 44000 \
  -fmoe -rtr -mla 3 -fa \
  -b 2048 -ub 2048 -amb 512 \
  -op 29,0 \
  --no-mmap \
  --threads 64 --threads-batch 64 \
  -ngl 99 \
  -ot exps=CPU
```

---

👤 **schynce** commented the **2025-05-20** at **13:44:34**:<br>

For me, the best way to trigger the bug quickly is to dump in a 30K token prompt. It seems to crash during the prompt processing or before generating a single token.

---

👤 **schynce** commented the **2025-05-20** at **13:44:34**:<br>

For me, the best way to trigger the bug quickly is to dump in a 30K token prompt. It seems to crash during the prompt processing.

---

👤 **ikawrakow** commented the **2025-05-20** at **14:23:18**:<br>

Does PR #438 help?

---

👤 **schynce** commented the **2025-05-20** at **15:58:47**:<br>

> Does PR [#438](https://github.com/ikawrakow/ik_llama.cpp/pull/438) help?

I tested #438 (branch ik/desperate_bug_fix_attempt) but unfortunately, it crashed almost straight away:

```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51)\.=CUDA2"
```

```
INFO [            update_slots] kv cache rm [p0, end) | tid="139707044622336" timestamp=1747756441 id_slot=0 id_task=27 p0=4097
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3075
  cudaStreamSynchronize(cuda_ctx->stream())
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)
```

---

👤 **divine-taco** commented the **2025-05-20** at **21:36:55**:<br>

~~PR #438 - 82871cc2a3366dfdeff758f04fdfcf5ae5859829 - looks to fix the issue for me. Tried 30 turn completions at long context and saw no issues.~~

Command used:
```
llama-server \
  --parallel 1 \
  -ctk f16 -ctv f16 \
  -ts 17,17,17,17,17,17,17,17,17 \
  --model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
  --ctx-size 44000 \
  -fmoe -rtr -mla 3 -fa \
  -b 2048 -ub 2048 -amb 512 \
  --no-mmap \
  --threads 64 --threads-batch 64 \
  -ngl 99 \
  -ot exps=CPU
```

@schynce - Have a link to the Qwen3-235B-A22B quant you used? I can try that as well.

Update: Failed with illegal memory access again on PR #438 with deepseek 0324 after I ran some automated completions tests. I don't have enough data yet to be confident, but it does seem to fail less frequently. I'll try running `--mla 2` on PR #438 to see if this makes any difference.

---

👤 **divine-taco** commented the **2025-05-20** at **21:36:55**:<br>

PR #438 - 82871cc2a3366dfdeff758f04fdfcf5ae5859829 - looks to fix the issue for me. Tried 30 turn completions at long context and saw no issues.

Command used:
```
llama-server \
  --parallel 1 \
  -ctk f16 -ctv f16 \
  -ts 17,17,17,17,17,17,17,17,17 \
  --model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
  --ctx-size 44000 \
  -fmoe -rtr -mla 3 -fa \
  -b 2048 -ub 2048 -amb 512 \
  --no-mmap \
  --threads 64 --threads-batch 64 \
  -ngl 99 \
  -ot exps=CPU
```

---

👤 **schynce** commented the **2025-05-20** at **21:49:54**:<br>

@divine-taco

I used this:

https://huggingface.co/unsloth/Qwen3-235B-A22B-GGUF/tree/main/IQ4_XS

However, I notice that there have been some updates in the first split file since I downloaded it.

---

👤 **ikawrakow** commented the **2025-05-21** at **06:02:41**:<br>

Please use branch in PR #442 and post the CUDA call trace that will be printed when the application crashes.

---

👤 **schynce** commented the **2025-05-21** at **12:11:08**:<br>

> Please use branch in PR [#442](https://github.com/ikawrakow/ik_llama.cpp/pull/442) and post the CUDA call trace that will be printed when the application crashes.

```
llm_load_tensors:  CUDA_Host buffer size = 52313.37 MiB
llm_load_tensors:      CUDA0 buffer size = 22068.28 MiB
llm_load_tensors:      CUDA1 buffer size = 22068.28 MiB
llm_load_tensors:      CUDA2 buffer size = 23042.94 MiB
....................................................................................................
============ Repacked 127 tensors
llama_new_context_with_model: n_ctx      = 40960
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  3995.00 MiB
llama_new_context_with_model: KV self size  = 3995.00 MiB, K (q8_0): 1997.50 MiB, V (q8_0): 1997.50 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   104.50 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   104.50 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   189.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   304.75 MiB
llama_new_context_with_model: graph nodes  = 3672
llama_new_context_with_model: graph splits = 432
INFO [                    init] initializing slots | tid="140363884277760" timestamp=1747829175 n_slots=1
INFO [                    init] new slot | tid="140363884277760" timestamp=1747829175 id_slot=0 n_ctx_slot=40960
INFO [                    main] model loaded | tid="140363884277760" timestamp=1747829175
INFO [                    main] chat template | tid="140363884277760" timestamp=1747829175 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [                    main] HTTP server listening | tid="140363884277760" timestamp=1747829175 n_threads_http="15" port="5000" hostname="0.0.0.0"
INFO [            update_slots] all slots are idle | tid="140363884277760" timestamp=1747829175
INFO [      log_server_request] request | tid="140361486192640" timestamp=1747829175 remote_addr="127.0.0.1" remote_port=55754 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="140361494585344" timestamp=1747829175 remote_addr="127.0.0.1" remote_port=57094 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="140361477799936" timestamp=1747829182 remote_addr="127.0.0.1" remote_port=43408 status=200 method="GET" path="/v1/models" params={}
INFO [      log_server_request] request | tid="140361469407232" timestamp=1747829191 remote_addr="127.0.0.1" remote_port=49880 status=200 method="GET" path="/v1/models" params={}
INFO [   launch_slot_with_task] slot is processing task | tid="140363884277760" timestamp=1747829191 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140363884277760" timestamp=1747829191 id_slot=0 id_task=0 p0=0
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3085
  cudaStreamSynchronize(cuda_ctx->stream())
========================== CUDA trace: 315944 previous calls
      315943: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      315942: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315941: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315940: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315939: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      315938: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      315937: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      315936: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      315935: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315934: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315933: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315932: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      315931: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      315930: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      315929: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      315928: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315927: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315926: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315925: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      315924: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      315923: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      315922: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
      315921: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      315920: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
      315919: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
      315918: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
      315917: function ggml_backend_cuda_synchronize, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3085
      315916: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2773
      315915: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2764
      315914: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      315913: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315912: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      315911: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
```

---

👤 **ikawrakow** commented the **2025-05-21** at **12:37:17**:<br>

Thank you!

So, it crashes in a matrix multiplication. I have pushed another commit on the branch that will help narrow it down further if you rerun with that.

---

👤 **schynce** commented the **2025-05-21** at **13:29:25**:<br>

> Thank you!
>
> So, it crashes in a matrix multiplication. I have pushed another commit on the branch that will help narrow it down further if you rerun with that.

Thanks for looking into the issue! Here you go:

```
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3085
  cudaStreamSynchronize(cuda_ctx->stream())
========================== CUDA trace: 335439 previous calls
      335438: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335437: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335436: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335435: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335434: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335433: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335432: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335431: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335430: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335429: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335428: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335427: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335426: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335425: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335424: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335423: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335422: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335421: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335420: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335419: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335418: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335417: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335416: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335415: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335414: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335413: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335412: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335411: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
      335410: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335409: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
      335408: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
      335407: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
      335406: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```

---

👤 **ikawrakow** commented the **2025-05-21** at **13:55:41**:<br>

I was confused. If there was something wrong with the matrix multiplications, it would have aborted there. The computations succeed, but then something goes wrong in the back-end. I have now added 2 additional asserts in the back-end at the place where the back-trace was when we did the debugging session.

---

👤 **schynce** commented the **2025-05-21** at **14:10:05**:<br>

> I was confused. If there was something wrong with the matrix multiplications, it would have aborted there. The computations succeed, but then something goes wrong in the back-end. I have now added 2 additional asserts in the back-end at the place where the back-trace was when we did the debugging session.

I tried the newest commit, but the backtrace is practically identical as far as I can tell:

```
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_backend_cuda_synchronize at /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:3089
  cudaStreamSynchronize(stream)
========================== CUDA trace: 335439 previous calls
      335438: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335437: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335436: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335435: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335434: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335433: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335432: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335431: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335430: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335429: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335428: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335427: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335426: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335425: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335424: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335423: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335422: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335421: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335420: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335419: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335418: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335417: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335416: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335415: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335414: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335413: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335412: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335411: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
      335410: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335409: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
      335408: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
      335407: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
      335406: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```

---

👤 **ikawrakow** commented the **2025-05-21** at **14:27:12**:<br>

Thanks! I'll keep digging.

---

👤 **ikawrakow** commented the **2025-05-21** at **15:26:00**:<br>

I have now added a trace to the back-end, so when the crash occurs it will print from where `ggml_backend_cuda_synchronize` was called. Can you try another time? Thanks!

---

👤 **schynce** commented the **2025-05-21** at **16:31:48**:<br>

> I have now added a trace to the back-end, so when the crash occurs it will print from where `ggml_backend_cuda_synchronize` was called. Can you try another time? Thanks!

```
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_backend_sched_compute_splits at /home/user/ik_llama.cpp/ggml/src/ggml-backend.c:1835
  cudaStreamSynchronize
========================== CUDA trace: 335439 previous calls
      335438: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335437: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335436: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335435: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335434: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335433: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335432: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335431: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335430: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335429: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335428: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335427: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335426: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335425: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335424: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335423: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335422: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335421: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335420: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
      335419: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
      335418: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
      335417: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335416: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335415: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
      335414: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335413: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
      335412: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335411: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
      335410: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
      335409: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
      335408: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
      335407: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
      335406: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```

---

👤 **ikawrakow** commented the **2025-05-21** at **16:43:24**:<br>

@schynce You are running with `--no-kv-offload`, right? Your error is different. What happens if you don't use `--no-kv-offload`?

---

👤 **schynce** commented the **2025-05-21** at **16:55:42**:<br>

> [@schynce](https://github.com/schynce) You are running with `--no-kv-offload`, right? Your error is different. What happens if you don't use `--no-kv-offload`?

Yes, those logs were with this launch command:

```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 --no-kv-offload \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17)\.=CUDA0" \
-ot "blk\.(18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35)\.=CUDA1" \
-ot "blk\.(36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51)\.=CUDA2"

```
---

I ran without --no-kv-offload and modified the layers to fit the KV cache:

```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16)\.=CUDA0" \
-ot "blk\.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33)\.=CUDA1" \
-ot "blk\.(34|35|36|37|38|39|40|41|42|43|44|45|46|47)\.=CUDA2"
```

It took considerably longer for the crash to appear this time:

```
INFO [   launch_slot_with_task] slot is processing task | tid="139770035781632" timestamp=1747846205 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846205 id_slot=0 id_task=0 p0=0
INFO [            update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846249 id_slot=0 id_task=0 p0=2048
INFO [            update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846293 id_slot=0 id_task=0 p0=4096
INFO [            update_slots] kv cache rm [p0, end) | tid="139770035781632" timestamp=1747846338 id_slot=0 id_task=0 p0=6144
CUDA error: an illegal memory access was encountered
  current device: 2, in function ggml_backend_sched_compute_splits at /home/user/ik_llama.cpp/ggml/src/ggml-backend.c:1835
  cudaStreamSynchronize
========================== CUDA trace: 2460820 previous calls
     2460819: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
     2460818: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
     2460817: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
     2460816: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460815: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460814: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460813: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2460812: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
     2460811: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2460810: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
     2460809: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
     2460808: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
     2460807: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460806: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460805: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460804: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2460803: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
     2460802: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2460801: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
     2460800: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3529
     2460799: function launch_mul_mat_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/template-instances/../mmq.cuh, line 3525
     2460798: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460797: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460796: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2460795: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2460794: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
     2460793: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2460792: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 135
     2460791: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2460790: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3074
     2460789: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3071
     2460788: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3061
     2460787: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```

---

👤 **ikawrakow** commented the **2025-05-22** at **06:44:46**:<br>

If you are not tired of testing, there are new changes on #442

---

👤 **schynce** commented the **2025-05-22** at **07:43:25**:<br>

> If you are not tired of testing, there are new changes on [#442](https://github.com/ikawrakow/ik_llama.cpp/pull/442)

Not even close to being tired yet, thank you for taking the time to look into this :)

I ran this command:
```

./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
-fa -fmoe -rtr -c 40960 -ctk q8_0 -ctv q8_0 --threads 7 \
-ot "blk\.\d+\.attn=CUDA2" \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16)\.=CUDA0" \
-ot "blk\.(17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33)\.=CUDA1" \
-ot "blk\.(34|35|36|37|38|39|40|41|42|43|44|45|46|47)\.=CUDA2"
```

During context processing, the console was getting spammed with the `ggml_backend_cuda_synchronize` and `ggml_backend_cuda_cpy_tensor_async` lines. At the end of prompt processing (I assume), it crashed like before:

```
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 2 to device 0 without access enabled
ggml_backend_cuda_synchronize: curent device is 2, context device is 0
ggml_backend_cuda_synchronize: reverting device to 2
ggml_backend_cuda_synchronize: curent device is 0, context device is 2
ggml_backend_cuda_synchronize: reverting device to 0
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 2 without access enabled
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_sched_compute_splits at /home/user/ik_llama.cpp/ggml/src/ggml-backend.c:1835
  cudaStreamSynchronize
========================== CUDA trace: 2486495 previous calls
     2486494: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3070
     2486493: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3055
     2486492: function ggml_backend_cuda_synchronize, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3120
     2486491: function ggml_backend_sched_compute_splits, file /home/user/ik_llama.cpp/ggml/src/ggml-backend.c, line 1828
     2486490: function ggml_backend_cuda_synchronize, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3107
     2486489: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2774
     2486488: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2765
     2486487: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1756
     2486486: function ggml_cuda_op_mul_mat_vec_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu, line 593
     2486485: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486484: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486483: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486482: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2486481: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
     2486480: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2486479: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2744
     2486478: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2740
     2486477: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1756
     2486476: function ggml_cuda_op_mul_mat_vec_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu, line 593
     2486475: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486474: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486473: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486472: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2486471: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
     2486470: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2486469: function ggml_cuda_up_gate_unary, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 2736
     2486468: function ggml_cuda_op_mul_mat, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1756
     2486467: function ggml_cuda_op_mul_mat_vec_q, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu, line 593
     2486466: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486465: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486464: function ggml_cuda_get_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
     2486463: function ggml_cuda_set_device, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
     2486462: function ggml_backend_cuda_cpy_tensor_async, file /home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 3070
/home/user/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
```