Files
ik_llama.cpp/github-data/discussions/562 - AMD GPU Vulkan _ ROCm_HIP Discussion.md
2025-07-23 13:31:53 +02:00

37 KiB

🗣️ #562 - AMD GPU Vulkan & ROCm/HIP Discussion

Author ubergarm
Created 2025-06-28
Updated 2025-07-06

Description

Background

I've been asked a few times now about AMD GPU support with ik's fork. I recently got access to an AMD RX 7900 XTX to try it out, and as discussed on Issue 503 the Vulkan and ROCm backends are not the focus of this fork hence limited support on AMD GPU hardware.

I'm starting this discussion to have a place to point folks who might be interested the current state AMD GPU backend support, and especially if they wanted to attempt updates and work on it at all.

Current State

ik_llama.cpp actually does compile with Vulkan and can do some limited inferencing. As it is unmaintained, it is slower than mainline at the moment. However I couldn't get it to compile with ROCm/HIP support. I only tried the AMD official open source AMDVLK backend and not the community open source RADV backend.

There is a good benchmarking discussion on mainline maintained by @netrunnereve which was very helpful for establishing baseline expectations and trying to understand the various AMD GPU driver development environments.

Benchmarks

I did a comparison between mainline llama.cpp and ik_llama.cpp at the given sha's for what I could get working.

sweep-bench-amd-gpu-mainline-vs-ik

Methodology

To keep things somewhat consistent with the establish methodologies I used TheBloke's now vintage Llama-2-7B at classic Q4_0 quantization. The following is how compilation was done as well as running llama-sweep-bench with and without flash attention -fa:

Compiling

# compile for Vulkan
cmake -B build -DGGML_HIP=OFF -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)

# couldn't find a combination that worked below
# compile for ROCm/HIP
export HIPCXX="$(hipconfig -l)/clang"
export HIP_PATH="$(hipconfig -R)"
#cmake -B build -DGGML_VULKAN=0 -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release
cmake -B build -DGGML_VULKAN=0 -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
In file included from /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/fattn.cu:15:
In file included from /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/fattn-mma-f16.cuh:3:
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/mma_new.cuh:49:27: error: use of undeclared identifier '__shfl_sync'
   49 |     const int ret_low  = (__shfl_sync(0xFFFFFFFF, x, src_laneid_low,  WARP_SIZE) >> shift_low)  & 0x0000FFFF;
      |                           ^
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/mma_new.cuh:50:27: error: use of undeclared identifier '__shfl_sync'
   50 |     const int ret_high = (__shfl_sync(0xFFFFFFFF, x, src_laneid_high, WARP_SIZE) << shift_high) & 0xFFFF0000;
      |                           ^
4 errors generated when compiling for gfx1100.

sweep-bench

export model=/models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf
# try with and without -fa
./build/bin/llama-sweep-bench \
  --model "$model" \
  -fa \
  -c 18432 \
  -ngl 99 \
  --warmup-batch \
  --threads 1

Observations

  1. Surprisingly Vulkan without FA managed to complete the benchmark and even give similar performance as mainline for the no FA token generation at longer context lengths.
  2. However, Vulkan with FA enabled shows very poor performance and consistently crashes at N_KV=7680. iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
  3. I did not test any other quantizations especially the newer ik exclusive quants.
  4. I did do a quick vibe check and confirm the model was at least valid tokens, however the chat template seemed odd or could be due to my client settings for temp etc but the responses seemed wrong and had <|im_start|> and <|im_end|> type tokens which don't usually come back from the chat endpoint.

Conclusion

Well, sorry if you have AMD GPU hardware and were hoping to try out the latest greatest stuff on ik's fork. You can still make use of the CPU only optimizations fwiw. You can see the relative performance of native CUDA in the linked benchmark thread for one of my other tests, and ik's fork does run faster than mainline for CUDA.

Finally, I saw and interesting NVIDIA slide deck from the Vulkanised 2025 Developer Conference which discusses llama.cpp on pages 14 and 15 even showing what looks like some of ik's IQ4_NL code with implementation discussions. I was surprised that some models benchmark faster on NVIDIA GPUs using vulkan backend beating out the native CUDA implementation, but perhaps that is for another day...

Thanks and curious if anyone else has tried this or is interested in improving support here. Cheers!


🗣️ Discussion

👤 OneOfOne replied the 2025-06-29 at 01:50:14:

llama.cpp's vulkan backend is faster and uses less memory on my 7900xtx as well (I'm using latest rocm on Arch so it's not that).

👤 ubergarm replied the 2025-06-29 at 14:41:47:
Yup, this is to be expected given ik's fork prioritizes a couple CPU types and CUDA implementations and does not focus on maintaining Vulkan nor ROCm/HIP backends.


👤 firecoperana replied the 2025-06-29 at 14:50:07:

I'm working on bringing ik_llama.cpp up to date with llama.cpp's vulkan backend. It is actually easier than I expected.

👤 ubergarm replied the 2025-06-29 at 14:58:06:
@firecoperana very cool to hear 🔥 !

As suggested by @0cc4m and some discussion by the author of those Vulkanised Conference PDF slides linked above, @jeffbolznv ,over on the mainline vulkan benchmark discussion I might try to pacman -Sy extra/nvidia-utils and build the vulkan backend for my NVIDIA RTX 3090TI FE GPU and compare performance there as well.

Please update us here if you have a fork/branch/PR you'd like to test and if I still have access to the AMD RX 7900 XTX I can give it a go as I'd like to use ik's SOTA quants on that machine for a fun project...

👤 ikawrakow replied the 2025-06-29 at 16:25:52:
@firecoperana Great that you want to port the mainline Vulkan back-end to ik_llama.cpp, but are you also willing to maintain it?

👤 firecoperana replied the 2025-06-29 at 19:30:13:
PR is created. Welcome to test. I can maintain it if the vulkan code there hasn't been refactored too much. With this PR, the future update should be easier too. I don't use vulkan much so need someone to remind me if there is some major improvement in vulkan that is worth porting.

👤 ubergarm replied the 2025-06-29 at 19:41:02:
I'll give it a try, I just updated my home rig to latest greatest drivers (which I loathe to do but sometimes u gotta pay the piper...).

Interestingly on a Qwen3-14B-Q4_0 the Vulkan FA=1 backend beats native CUDA implementation in token generation at sufficiently deep n_kv

https://github.com/ggml-org/llama.cpp/discussions/10879#discussioncomment-13611122

I'll take a look at the PR now, thanks! https://github.com/ikawrakow/ik_llama.cpp/pull/563

👤 firecoperana replied the 2025-06-29 at 19:53:41:
https://github.com/ggml-org/llama.cpp/pull/14366 Vulkan also needs this one, but I couldn't port it in easily. The issue is vulkan does not have FUSED_RMS_NORM and FUSED_MUL_UNARY support, and when using RPC, it needs this. My current workaround is skip ggml_fused_rms_norm and ggml_fused_mul_unary when using vulkan. @ikawrakow


👤 ikawrakow replied the 2025-07-01 at 13:50:50:

So, what is the "approved" way of installing the necessary dependencies for Vulkan development on Ubuntu? I ended up installing LunarG VulkanSDK, but the thing almost bricked my system because I hadn't run sudo apt update && sudo apt upgrade before importing their repository and attempting to install. Is there a better way, preferably with just Ubuntu packages and no 3rd party stuff?

Anyhow, at the end I got the mainline Vulkan build working, but performance is very far from CUDA on my RTX-4080

Vulkan sweep-bench, LlaMA-3.1-8B

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 0.339 3024.00 2.808 91.16
1024 256 1024 0.337 3035.97 2.709 94.51
1024 256 2048 0.328 3121.27 2.657 96.36
1024 256 3072 0.336 3052.01 2.661 96.19
1024 256 4096 0.368 2781.06 2.704 94.67
1024 256 5120 0.405 2531.44 2.794 91.61
1024 256 6144 0.465 2202.62 2.917 87.75
1024 256 7168 0.542 1888.01 3.047 84.00
1024 256 8192 0.618 1656.82 3.196 80.10
1024 256 9216 0.657 1559.24 3.283 77.98
1024 256 10240 0.695 1473.46 3.365 76.08
1024 256 11264 0.720 1422.92 3.412 75.02
1024 256 12288 0.753 1359.30 3.464 73.89
1024 256 13312 0.792 1293.13 3.523 72.67
1024 256 14336 0.814 1257.77 3.588 71.35
1024 256 15360 0.858 1192.89 3.625 70.63

CUDA sweep-bench, LlaMA-3.1-8B

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 0.134 7649.04 2.018 126.88
1024 256 1024 0.129 7921.34 2.105 121.63
1024 256 2048 0.135 7561.83 2.170 117.99
1024 256 3072 0.144 7121.15 2.226 114.99
1024 256 4096 0.151 6784.15 2.292 111.71
1024 256 5120 0.159 6460.57 2.354 108.75
1024 256 6144 0.164 6225.61 2.423 105.66
1024 256 7168 0.172 5961.15 2.484 103.05
1024 256 8192 0.183 5606.81 2.545 100.61
1024 256 9216 0.194 5289.56 2.604 98.31
1024 256 10240 0.195 5239.75 2.662 96.15
1024 256 11264 0.206 4962.13 2.731 93.72
1024 256 12288 0.214 4777.95 2.787 91.85
1024 256 13312 0.217 4725.71 2.845 89.97
1024 256 14336 0.230 4454.44 2.919 87.71
1024 256 15360 0.238 4311.56 2.966 86.30

So, PP is 3X lower, TG is 20-25% lower.

Given this, does it make sense to spend time on Vulkan? When I forked llama.cpp last year the Vulkan stuff was mostly a gimmick, with performance not much better than just running on a moderately fast CPU. They have done a lot of Vulkan development and performance improvements in mainline since then, but it still seems way too far behind.

👤 jeffbolznv replied the 2025-07-01 at 14:08:19:
Installing the Vulkan SDK is the "right" way to get the dependencies. The pp scores shouldn't be that low, it suggests cooperative matrix isn't getting used. What driver version are you using? Can you share the beginning of the log where ggml-vulkan prints device info?

👤 ubergarm replied the 2025-07-01 at 19:20:24:

Given this, does it make sense to spend time on Vulkan?

Personally, the two things I see Vulkan back-end support providing are:

  1. A path allowing AMD GPUs to be used e.g. RX 7900 XTX 24GB VRAM
  2. Potentially faster NVIDIA path for some situations/models (this was news to me).

This Qwen3-14B-Q4_0 dense sweep-bench I ran a couple days ago opened my eyes where the vulkan backend on mainline took the lead on TG after about ~8k depth. NV_coopmat2 is described in @jeffbolznv recent Vulkanised 2025 slides.

sweep-bench-llama-vs-ik-vulkan-qwen3-14b

Otherwise ik CUDA is generally the fastest. I haven't tested other models/configs but likely vulkan takes the lead in other situations reading the benchmarks in the slides.

However, I also don't want to distract ik whatever optimizations and experiments are most interesting and intrinsically motivating. So nice to see a few folks from the community possibly providing some support. Big thanks @firecoperana for taking a stab at it on https://github.com/ikawrakow/ik_llama.cpp/pull/563

Thanks!


👤 ikawrakow replied the 2025-07-01 at 14:11:22:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
build: 5781 (ba3ef86c5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 4080) - 16376 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from ../ncuda/junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                          general.file_type u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta Llama 3.1 8B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:      Vulkan0 model buffer size =  4155.99 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 1024
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:    Vulkan0 KV buffer size =  2048.00 MiB
llama_kv_cache_unified: size = 2048.00 MiB ( 16384 cells,  32 layers,  1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context:    Vulkan0 compute buffer size =   613.01 MiB
llama_context: Vulkan_Host compute buffer size =    80.01 MiB
llama_context: graph nodes  = 999
llama_context: graph splits = 2

👤 ikawrakow replied the 2025-07-01 at 14:19:04:

@jeffbolznv Thank you for chiming in. Above is the log. Is there something additional I need to do to improve performance? I did

mkdir vulkan && cd vulkan
cmake .. -DGGML_VULKAN=ON -DGGML_CUDA=OFF
make -j

👤 jeffbolznv replied the 2025-07-01 at 14:43:13:
Is it a release build? I can't tell.

You'd probably get a boost from a newer driver (to enable coopmat2), but the pp numbers seem slow for coopmat1.

👤 ikawrakow replied the 2025-07-01 at 14:54:36:
Yes, this is a release build. @ubergarm is getting in the range of 3000 t/s for LlaMA-7B on his RX 7900 XTX, so same ball park.


👤 jeffbolznv replied the 2025-07-01 at 14:53:29:

What's the llama-bench equivalent of the N_KV column in that table? Is it -d? I see a big difference between coopmat1 and coopmat2 with large depth.

👤 ikawrakow replied the 2025-07-01 at 15:00:56:
I haven't looked into mainline llama.cpp, but the sweep-bench here adds N_KV tokens to the KV cache, and then runs a batch of a given size (1024 tokens in the above example), and generates a given number of new tokens (256 in the example). Time is measured for both, and resulting tokens/second is printed. The KV cache is increased gradually in a sweep, which corresponds to a typical experience of a user interacting with an LLM. I don't know what the -d option in mainline does (I think it is a relatively recent addition), that's why I have a port of sweep-bench to mainline llama.cpp to be able to run direct (and more meaningful) comparisons than -p 512 or -n 128).

👤 jeffbolznv replied the 2025-07-01 at 15:14:47:
OK. I think these are basically the same parameter.

I see much better (>2x) performance for large KV with coopmat2, and I think this is because it's doing more rows at a time (64 vs 16). It might be possible to improve this for the coopmat1 path, but it may start running into register limits, hard to say. For an NV GPU, you should just update to a recent driver (r575) and you'll get the improved performance automatically.

👤 ikawrakow replied the 2025-07-01 at 15:28:31:

you should just update to a recent driver (r575) and you'll get the improved performance automatically.

You mean the Nvidia driver? I'm on 560.35.03 and reluctant to update as the machine I'm working on is remote.

But IIRC, you have an RTX-4070. Can you post a comparison between CUDA and Vulkan on your GPU?

👤 jeffbolznv replied the 2025-07-01 at 16:12:50:
I recently got a 5090, so the 4070 is no longer in my system. Here's what I'm seeing for coopmat2, coopmat1, and CUDA.

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa 1 -n 0 -p 1024 --prio 1 -r 1 -d 1024-15360+1024
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d1024 |      10616.78 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d2048 |       9960.08 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d3072 |       9841.83 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d4096 |       9479.70 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d5120 |       9019.58 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d6144 |       8337.62 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d7168 |       8149.66 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d8192 |       7892.09 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d9216 |       7678.50 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d10240 |       7396.89 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d11264 |       7160.86 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d12288 |       6865.95 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d13312 |       6660.70 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d14336 |       6481.23 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d15360 |       6240.57 ± 0.00 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa 1 -n 0 -p 1024 --prio 1 -r 1 -d 1024-15360+1024
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d1024 |       6484.20 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d2048 |       5791.34 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d3072 |       5398.55 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d4096 |       4879.42 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d5120 |       4477.92 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d6144 |       4112.65 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d7168 |       3902.24 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d8192 |       3651.50 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |  pp1024 @ d9216 |       3420.07 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d10240 |       3236.93 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d11264 |       3061.68 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d12288 |       2896.88 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d13312 |       2734.89 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d14336 |       2624.02 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 | pp1024 @ d15360 |       2496.16 ± 0.00 |

Z:\github\jeffbolznv\llama.cpp\buildcuda\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa 1 -n 0 -p 1024 --prio 1 -r 1 -d 1024-15360+1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d1024 |      12854.24 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d2048 |      12101.30 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d3072 |      11831.37 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d4096 |      11467.68 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d5120 |      11072.99 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d6144 |      10646.26 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d7168 |      10287.17 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d8192 |       9873.84 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 |  pp1024 @ d9216 |       9688.37 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 | pp1024 @ d10240 |       9373.99 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 | pp1024 @ d11264 |       9117.66 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 | pp1024 @ d12288 |       8706.74 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 | pp1024 @ d13312 |       8635.61 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 | pp1024 @ d14336 |       8351.58 ± 0.00 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | CUDA       |  99 |  1 | pp1024 @ d15360 |       8134.32 ± 0.00 |

You mean the Nvidia driver? I'm on 560.35.03 and reluctant to update as the machine I'm working on is remote.

Yeah, r575 has coopmat2 support.

👤 ikawrakow replied the 2025-07-01 at 16:50:49:
OK, thanks, this looks much better.

👤 ubergarm replied the 2025-07-01 at 19:41:16:
@jeffbolznv Thanks for the benchmarks. I'm curious how Vulkan coopmat2 is looking for TG. On the slightly larger Qwen3-14B-Q4_0 I mentioned above how it is actually faster than CUDA on my 3090TI FE for larger kv depths.

If you are interested, here is one way to use llama-sweep-bench on mainline llama.cpp for comparisons. I just updated my fork/branch to llama.cpp tip of master@de5694414

cd llama.cpp
git remote add ubergarm git@github.com:ubergarm/llama.cpp.git
git fetch ubergarm
git checkout ug/port-sweep-bench
# compile as usual for CUDA/Vulkan Release
# it runs basically like llama-server with similar argument style
# this might work on your windows box:
llama-sweep-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa -c 8192 -ngl 99 -t 1

👤 jeffbolznv replied the 2025-07-01 at 19:53:41:
coopmat2 mostly isn't used for tg, but if there's grouped query attention then it may be used for the flash attention shader. It's nice/surprising to see vulkan pull ahead for larger KV. I suspect the Vulkan driver still has some small launch overhead relative to CUDA that hurts at smaller sizes, but I'm not really sure.

👤 ikawrakow replied the 2025-07-02 at 06:28:28:
@jeffbolznv

Once you are here, may I ask why flash attention for DeepSeek is not implemented in the llama.cpp Vulkan backend? Is it just that nobody has come around to do it, or are there principle issues? The most efficient FA implementation requires k-head = 192, v-head = 128 for prompt processing, and k-head = 576, v-head = 512 for token generation.

👤 jeffbolznv replied the 2025-07-02 at 12:51:50:
Just nobody has done it yet. I don't think I've seen a version of the model that would even come close to fitting on my GPU. I suppose I could implement it just from the backend tests, but it would be nice to be able to perf test it.

👤 ikawrakow replied the 2025-07-02 at 13:04:19:
Here is a 16B parameter MoE model that easily fits in your 5090 with VRAM to spare that uses the exact same attention mechanism as DeepSeek-V3/R1: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite (except that it has 16 instead of 128 heads). I think this is what Johannes used for testing when he implemented the k-head-size != v-head-size FA in the llama.cpp CUDA backend. I did have a partial implementation here using this model quite a bit earlier than mainline (the part for k-head=192, v-head=128), but I was straggling to get a performant implementation for the k-head=576, v-head=512 case, so that's why I asked the question if there are principle issues with the Vulkan implementation.

👤 jeffbolznv replied the 2025-07-02 at 13:10:56:
I thought deepseek v2 was already accelerated and it was only deepseek R1 that uses the large/mixed head sizes?

👤 ikawrakow replied the 2025-07-02 at 13:13:09:
Well, get the model and see what happens.

👤 jeffbolznv replied the 2025-07-02 at 13:52:10:
OK, I do see FA falling back to CPU with it.

👤 jeffbolznv replied the 2025-07-02 at 20:14:58:
I added support for these head sizes in https://github.com/ggml-org/llama.cpp/pull/14509. Performance is tolerable with the coopmat2 shader but very slow for coopmat1/scalar. I'm sure there's some room for tuning.


👤 ikawrakow replied the 2025-07-02 at 06:16:07:

Personally, the two things I see Vulkan back-end support providing are:

A path allowing AMD GPUs to be used e.g. RX 7900 XTX 24GB VRAM

But a port of the mainline Vulkan back-end to ik_llama.cpp without the additions that make ik_llama.cpp faster for CUDA and CPU inference has zero benefits. People can simply use llama.cpp with their AMD GPUs.

👤 firecoperana replied the 2025-07-02 at 14:32:39:
Another benefit is to people who have both nvidia and amd or even intel GPUs. They can use RPC to load different backends or just use vulkan to use non CUDA GPU to offload more weights to vram.

👤 ikawrakow replied the 2025-07-02 at 14:43:52:

Another benefit is to people who have both nvidia and amd or even intel GPUs. They can use RPC to load different backends or just use vulkan to use non CUDA GPU to offload more weights to vram.

They already have this with llama.cpp. What does ik_llama.cpp without the additions implemented for Vulkan give them that they don't already have with llama.cpp?

👤 firecoperana replied the 2025-07-02 at 15:38:13:
One major thing I can think of is mla support for old quants of Deepseek V2.5 and V3 models. And if someone is already using ik_llama.cpp, adding AMD gpu that is not useable earlier can offer more speed boost.


👤 ikawrakow replied the 2025-07-06 at 13:41:44:

So, the Vulkan back-end is usable, and performance is better than llama.cpp (see, e.g., PR #584 that has a comparison for a MoE model). But compared to CUDA on the same GPU, performance is much lower, especially for MoE models (and most users appear to be using ik_llama.cpp exactly for one of the giant MoE models). I have mixed feelings how to proceed:

  • There is much more performance optimization potential in the Vulkan back-end compared to CUDA or CPU. So, from that point of view it seems worthwhile to put some effort into optimizing the Vulkan back-end
  • I know nothing about Vulkan programming in general or the llama.cpp Vulkan back-end in particular, hence, at least initially, it will be an uphill battle. Without a significant interest from the user base, I don't feel particularly motivated to do this to myself.

So, if you feel that Vulkan performance improvement in ik_llama.cpp is important, go to discussion #590 and vote!