### đ [#563](https://github.com/ikawrakow/ik_llama.cpp/pull/563) - Merge vulkan code from mainline up to commit of 6/28/2025
| **Author** | `firecoperana` |
| :--- | :--- |
| **State** | â **Closed** |
| **Created** | 2025-06-29 |
| **Updated** | 2025-07-02 |
---
#### Description
* Vulkan Optimizations and Fixes (#8959)
* Optimize Vulkan REPEAT performance
.....................................................................................
vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
Fix cuda build error
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
- [ ] Low
- [ ] Medium
- [x] High
---
#### đŹ Conversation
đ¤ **firecoperana** commented the **2025-06-29** at **19:21:51**:
Test Qwen 2.5 7B Q4_K_S and it runs fine, but for deepseek model, I was getting "GGGGGGG" output with -mla 1 -amb 512. Probably related to deepseek related optimization.
---
đ¤ **ubergarm** commented the **2025-06-29** at **19:51:08**:
For deepseek often one wants to compile with `-DGGML_CUDA_IQK_FORCE_BF16=1` to avoid overflowing fp16 accumulator which manifests as gibberish, nans, or `GGG` typically I believe.
I just tried to compile but got an error, might be because I just updated my rig and now seem to have `gcc version 15.1.1 20250425 (GCC)`... I'll fuss with it a bit but put it here in the meantime.
Details inside:
đ build command and logs
```bash
# attempt to build clean despite it seems to still be using cmake cache? hah...
$ rm -rf ./build
$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF GGML_CCACHE=OFF
$ cmake --build build --config Release -j $(nproc)
CMake Warning:
Ignoring extra path from command line:
"GGML_CCACHE=OFF"
-- The C compiler identification is GNU 15.1.1
-- The CXX compiler identification is GNU 15.1.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.50.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- OpenMP found
-- Using optimized iqk matrix multiplications
-- Enabling IQK Flash Attention kernels
-- Using llamafile
-- Found Vulkan: /lib/libvulkan.so (found version "1.4.313") found components: glslc glslangValidator
-- Vulkan found
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- ARCH_FLAGS = -march=native
-- Configuring done (0.5s)
-- Generating done (0.0s)
-- Build files have been written to: /mnt/astrodata/llm/ik_llama.cpp/build
[ 0%] Generating build details from Git
[ 0%] Building CXX object ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o
[ 1%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[ 3%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
[ 3%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
-- Found Git: /usr/bin/git (found version "2.50.0")
In function âSHA1Updateâ,
inlined from âSHA1Finalâ at /mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:265:5:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: âSHA1Transformâ reading 64 bytes from a region of size 0 [-Wstringop-overread]
219 | SHA1Transform(context->state, &data[i]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type âconst unsigned char[64]â
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function âSHA1Finalâ:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function âSHA1Transformâ
54 | void SHA1Transform(
| ^~~~~~~~~~~~~
In function âSHA1Updateâ,
inlined from âSHA1Finalâ at /mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:269:9:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: âSHA1Transformâ reading 64 bytes from a region of size 0 [-Wstringop-overread]
219 | SHA1Transform(context->state, &data[i]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type âconst unsigned char[64]â
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function âSHA1Finalâ:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function âSHA1Transformâ
54 | void SHA1Transform(
| ^~~~~~~~~~~~~
[ 3%] Built target sha256
[ 3%] Built target sha1
[ 3%] Built target xxhash
[ 3%] Generating build details from Git
-- Found Git: /usr/bin/git (found version "2.50.0")
[ 4%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 5%] Linking CXX executable ../../../bin/vulkan-shaders-gen
[ 5%] Built target build_info
[ 5%] Built target vulkan-shaders-gen
[ 6%] Generate vulkan shaders
ggml_vulkan: Generating and compiling shaders to SPIR-V
[ 6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[ 7%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-vulkan.cpp.o
[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-vulkan-shaders.cpp.o
[ 9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
[ 9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_kquants.cpp.o
[ 10%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_576_512.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iquants.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_256_256.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_192_128.cpp.o
[ 12%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[ 14%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_ktquants.cpp.o
[ 14%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_128_128.cpp.o
[ 15%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_64_64.cpp.o
[ 16%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_legacy_quants.cpp.o
[ 16%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_96_96.cpp.o
[ 17%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_floats.cpp.o
[ 17%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_1bit.cpp.o
[ 18%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iqk_quants.cpp.o
[ 18%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o
[ 19%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c: In function âggml_compute_forwardâ:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value âGGML_OP_SINâ not handled in switch [-Wswitch]
19814 | switch (tensor->op) {
| ^~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value âGGML_OP_COSâ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value âGGML_OP_COUNT_EQUALâ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value âGGML_OP_CONV_2D_DWâ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value âGGML_OP_RWKV_WKV6â not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value âGGML_OP_OPT_STEP_ADAMWâ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c: In function âggml_compute_backwardâ:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value âGGML_OP_SINâ not handled in switch [-Wswitch]
20395 | switch (tensor->op) {
| ^~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value âGGML_OP_COSâ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value âGGML_OP_COUNT_EQUALâ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value âGGML_OP_CONV_2D_DWâ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value âGGML_OP_RWKV_WKV6â not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value âGGML_OP_OPT_STEP_ADAMWâ not handled in switch [-Wswitch]
In file included from /usr/include/vulkan/vulkan_hpp_macros.hpp:35,
from /usr/include/vulkan/vulkan.hpp:11,
from /mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:8:
/usr/include/c++/15.1.1/ciso646:46:4: warning: #warning " is deprecated in C++17, use to detect implementation-specific macros" [-Wcpp]
46 | # warning " is deprecated in C++17, use to detect implementation-specific macros"
| ^~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp: In function âvoid ggml_vk_print_gpu_info(size_t)â:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3541:18: warning: unused variable âsubgroup_sizeâ [-Wunused-variable]
3541 | const size_t subgroup_size = (default_subgroup_size != 0) ? default_subgroup_size : subgroup_props.subgroupSize;
| ^~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3542:16: warning: unused variable âumaâ [-Wunused-variable]
3542 | const bool uma = props2.properties.deviceType == vk::PhysicalDeviceType::eIntegratedGpu;
| ^~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp: In function âvoid ggml_vk_instance_init()â:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3644:12: warning: unused variable ânum_available_devicesâ [-Wunused-variable]
3644 | size_t num_available_devices = vk_instance.instance.enumeratePhysicalDevices().size();
| ^~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:269:16: warning: no previous prototype for âggml_backend_tensor_memsetâ [-Wmissing-prototypes]
269 | GGML_CALL void ggml_backend_tensor_memset(struct ggml_tensor* tensor, uint8_t value, size_t offset, size_t size) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c: In function âggml_backend_multi_buffer_context_interfaceâ:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1022:34: error: initialization of â_Bool (*)(struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)â from incompatible pointer type âvoid (*)(struct ggml_backend_buffer *, uint8_t)â {aka âvoid (*)(struct ggml_backend_buffer *, unsigned char)â} [-Wincompatible-pointer-types]
1022 | /* .clear = */ ggml_backend_multi_buffer_clear,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1022:34: note: (near initialization for âmulti_backend_buffer_i.cpy_tensorâ)
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1006:23: note: âggml_backend_multi_buffer_clearâ declared here
1006 | GGML_CALL static void ggml_backend_multi_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1024:5: warning: missing initializer for field âresetâ of âstruct ggml_backend_buffer_iâ [-Wmissing-field-initializers]
1024 | };
| ^
In file included from /mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend-impl.h:50:34: note: âresetâ declared here
50 | void (*GGML_CALL reset) (ggml_backend_buffer_t buffer); // reset any internal state due to tensor initialization, such as tensor extras
| ^~~~~
make[2]: *** [ggml/src/CMakeFiles/ggml.dir/build.make:222: ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2044: ggml/src/CMakeFiles/ggml.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
```
---
đ¤ **ikawrakow** submitted a review the **2025-06-30** at **07:12:08**: đ `CHANGES_REQUESTED`
Please no new ops, new enum values, and no refactoring of the CPU backend. I think the Vulkan back-end can be updated to the latest without using the new back-end formalism in mainline.
---
đ¤ **ubergarm** commented the **2025-07-01** at **02:59:51**:
@firecoperana
Heya thanks again for digging into this! I have two different rigs on which I'm testing. It does now build on the AMD RX 7900 XTX Ubuntu 24.04 box now!
So good news I was able to compile and run `firecoperana/Merge_mainline_vulkan@495103bd` with vulkan backend! However, only seemed to run without `-fa`. If I try to use `-fa` it segfaults after its mostly loaded and right before llama-server would start listening for inputs.
Seems like something is still off as the speeds are off from mainline. Could be I'm using the AMDVLK driver as installed from `apt-get install libvulkan-dev` `1.4.313.0~rc1-1lunarg24.04-1` or that I'm compiling it wrong? Details in the fold:
đ sweep-bench comparisons Qwen3-14B-Q4_0 dense no FA

```bash
# checkout Merge_mainline_vulkan
$ git rev-parse --short HEAD
495103bd
# build
cmake -B build -DGGML_HIP=OFF -DGGML_HIPBLAS=OFF -DGGML_VULKAN=ON -DGGML_RPC=OFF -DGGML_CCACHE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
# test
model=/home/w/projects/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q4_0.gguf
sudo ./build/bin/llama-sweep-bench \
--model "$model" \
-ctk f16 -ctv f16 \
-c 16896 \
-ngl 99 \
--warmup-batch \
--threads 1
```
## ik_llama.cpp firecoperana/Merge_mainline_vulkan@495103bd FA=0
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 1.363 | 375.67 | 3.786 | 33.81 |
| 512 | 128 | 512 | 1.365 | 375.16 | 3.817 | 33.53 |
| 512 | 128 | 1024 | 1.414 | 362.06 | 3.844 | 33.30 |
| 512 | 128 | 1536 | 1.444 | 354.69 | 3.971 | 32.23 |
| 512 | 128 | 2048 | 1.429 | 358.21 | 3.965 | 32.28 |
| 512 | 128 | 2560 | 1.447 | 353.93 | 4.036 | 31.71 |
| 512 | 128 | 3072 | 1.462 | 350.17 | 4.099 | 31.23 |
| 512 | 128 | 3584 | 1.492 | 343.12 | 4.137 | 30.94 |
| 512 | 128 | 4096 | 1.499 | 341.62 | 4.233 | 30.24 |
| 512 | 128 | 4608 | 1.518 | 337.27 | 4.311 | 29.69 |
| 512 | 128 | 5120 | 1.525 | 335.71 | 4.355 | 29.39 |
| 512 | 128 | 5632 | 1.567 | 326.74 | 4.440 | 28.83 |
| 512 | 128 | 6144 | 1.556 | 329.11 | 4.508 | 28.39 |
| 512 | 128 | 6656 | 1.579 | 324.18 | 4.534 | 28.23 |
| 512 | 128 | 7168 | 1.596 | 320.79 | 4.600 | 27.83 |
| 512 | 128 | 7680 | 1.623 | 315.45 | 4.685 | 27.32 |
| 512 | 128 | 8192 | 1.640 | 312.19 | 4.775 | 26.80 |
## llama.cpp@27208bf6 FA=0
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.323 | 1585.78 | 1.822 | 70.27 |
| 512 | 128 | 512 | 0.334 | 1533.43 | 1.859 | 68.86 |
| 512 | 128 | 1024 | 0.369 | 1386.13 | 1.907 | 67.11 |
| 512 | 128 | 1536 | 0.382 | 1338.94 | 1.956 | 65.43 |
| 512 | 128 | 2048 | 0.374 | 1369.21 | 1.995 | 64.15 |
| 512 | 128 | 2560 | 0.391 | 1308.08 | 2.081 | 61.50 |
| 512 | 128 | 3072 | 0.396 | 1293.44 | 2.148 | 59.58 |
| 512 | 128 | 3584 | 0.422 | 1214.46 | 2.202 | 58.12 |
| 512 | 128 | 4096 | 0.422 | 1214.09 | 2.278 | 56.20 |
| 512 | 128 | 4608 | 0.435 | 1176.88 | 2.344 | 54.61 |
| 512 | 128 | 5120 | 0.441 | 1159.87 | 2.407 | 53.17 |
| 512 | 128 | 5632 | 0.482 | 1061.18 | 2.472 | 51.77 |
| 512 | 128 | 6144 | 0.465 | 1100.88 | 2.549 | 50.21 |
| 512 | 128 | 6656 | 0.483 | 1060.17 | 2.602 | 49.20 |
| 512 | 128 | 7168 | 0.494 | 1037.17 | 2.661 | 48.10 |
| 512 | 128 | 7680 | 0.523 | 979.25 | 2.724 | 46.99 |
| 512 | 128 | 8192 | 0.538 | 951.01 | 2.820 | 45.39 |
On my local rig with a CUDA and ARCH linux installing `extra/vulkan-utility-libraries 1.4.313.0-1 (vulkan-devel)` was having a compiling issue still complaining about RPC during linking. It might be because that super new gcc 15.1.1 though given I just updated everything...
```bash
$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CCACHE=ON -DCMAKE_BUILD_TYPE=Debug
$ cmake --build build --config Debug -j $(nproc)
[ 24%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 24%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 25%] Linking CXX executable ../../bin/llama-gguf
/mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp: In function âstd::wstring unicode_wstring_from_utf8(const std::string&)â:
/mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp:232:10: warning: âtemplate class std::__cxx11::wstring_convertâ is deprecated [-Wdeprecated-declarations]
232 | std::wstring_convert> conv;
| ^~~~~~~~~~~~~~~
In file included from /usr/include/c++/15.1.1/locale:47,
from /usr/include/c++/15.1.1/regex:43,
from /mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp:12:
/usr/include/c++/15.1.1/bits/locale_conv.h:262:33: note: declared here
262 | class _GLIBCXX17_DEPRECATED wstring_convert
| ^~~~~~~~~~~~~~~
[ 25%] Linking CXX executable ../../bin/llama-gguf-hash
[ 26%] Linking CXX shared library libllama.so
/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `ggml_backend_rpc_init'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/gguf/CMakeFiles/llama-gguf.dir/build.make:102: bin/llama-gguf] Error 1
make[1]: *** [CMakeFiles/Makefile2:3314: examples/gguf/CMakeFiles/llama-gguf.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `ggml_backend_rpc_init'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/build.make:108: bin/llama-gguf-hash] Error 1
make[1]: *** [CMakeFiles/Makefile2:3151: examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/all] Error 2
[ 26%] Built target llama
make: *** [Makefile:146: all] Error 2
```
However, if I enable the RPC backend with `-DGGML_RPC=ON` it compiles now! Though starting up it throws some errors and isn't working yet
```bash
model=/mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-c 16896 \
-ngl 99 \
--warmup-batch \
--threads 1
llm_load_tensors: ggml ctx size = 0.40 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Vulkan0 buffer size = 7697.69 MiB
llm_load_tensors: CPU buffer size = 417.30 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 16896
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan0 KV buffer size = 2640.00 MiB
llama_new_context_with_model: KV self size = 2640.00 MiB, K (f16): 1320.00 MiB, V (f16): 1320.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.58 MiB
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_q_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_k_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.ffn_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.1.attn_norm.weight, the weight will need to be copied
```
Lemme know if there is a certain version of the vulkan backend that might work better or happy to try more iterations! Thanks!
---
đ¤ **firecoperana** commented the **2025-07-01** at **15:00:17**:
I noticed something odd too and suspect it's related to vulkan shader. When I run llama server in visual studio, I can match the performance of the mainline, but if I run in command line, I was only getting 1/3 to 1/2 of the speed for token generation. If you have time, you can do some troubleshooting, as I'm not familiar with vulkan at all.
"warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_norm.weight" happens because vulkan does not support fused rms norm. It only shows in debug version.
---
đ¤ **ikawrakow** commented the **2025-07-01** at **16:38:42**:
Tested on my RTX-4080. If I remove the fused ops (`GGML_OP_FUSED_RMS_NORM` and `GGML_OP_FUSED_MUL_UNARY`) and don't use flash attention, I get this for LlaMA-3.1-8B
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 2.074 | 493.73 | 2.602 | 98.37 |
| 1024 | 256 | 1024 | 1.074 | 953.71 | 3.198 | 80.05 |
| 1024 | 256 | 2048 | 0.968 | 1058.33 | 3.069 | 83.41 |
| 1024 | 256 | 3072 | 0.907 | 1128.89 | 3.187 | 80.32 |
| 1024 | 256 | 4096 | 0.941 | 1088.54 | 3.368 | 76.00 |
| 1024 | 256 | 5120 | 0.962 | 1064.06 | 3.531 | 72.51 |
| 1024 | 256 | 6144 | 0.993 | 1030.96 | 3.742 | 68.42 |
| 1024 | 256 | 7168 | 1.037 | 987.64 | 3.963 | 64.60 |
| 1024 | 256 | 8192 | 1.098 | 932.90 | 4.223 | 60.62 |
| 1024 | 256 | 9216 | 1.156 | 885.58 | 4.474 | 57.22 |
| 1024 | 256 | 10240 | 1.216 | 842.27 | 4.711 | 54.34 |
| 1024 | 256 | 11264 | 1.271 | 805.53 | 4.949 | 51.73 |
| 1024 | 256 | 12288 | 1.323 | 774.28 | 5.201 | 49.22 |
| 1024 | 256 | 13312 | 1.381 | 741.70 | 5.457 | 46.92 |
| 1024 | 256 | 14336 | 1.440 | 711.14 | 5.709 | 44.84 |
| 1024 | 256 | 15360 | 1.469 | 696.92 | 5.962 | 42.94 |
Flash attention seems to be running on the CPU, so performance drops further with that. TG is on par with mainline for short context, but PP is ~3X lower.
---
đ¤ **ikawrakow** commented the **2025-07-01** at **16:48:33**:
If I change the `LOG_DEBUG` to `LOG_INFO` in `ggml_vk_print_gpu_info`, I see this line:
```
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
```
On mainline I see this:
```
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
```
So, for some reason int dot products and cooperative matrix are not enabled. I guess, this may explain the lower performance.
---
đ¤ **ikawrakow** submitted a review the **2025-07-01** at **18:07:18**: đŹ `COMMENTED`
---
đ¤ **firecoperana** submitted a review the **2025-07-02** at **01:10:01**: đŹ `COMMENTED`
---
đ¤ **firecoperana** commented during a code review the **2025-07-02** at **01:10:01** on `ggml/src/ggml-vulkan.cpp`:
Removed.
---
đ¤ **ubergarm** commented the **2025-07-02** at **04:42:36**:
> The new commit should remove the need to add these in cmake command. Also disable the fused ops for now.
Thanks I was having trouble getting it setup. First the amazing news, check this out on the AMD RX 7900 XTX it is up to snuff in early testing:

Very nice! I want to try some more models tomorrow but this is getting exciting!
I also got it to build and detect things properly on my local ARCH linux NVIDIA 3090TI FE rig, however when it starts up it throws an error:
```bash
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:2031: GGML_ASSERT((GGML_KQ_MASK_PAD % rows_cols[0]) == 0) failed
```
Amazing progress in a short time!
---
đ¤ **ikawrakow** submitted a review the **2025-07-02** at **06:49:33**: â
`APPROVED`