27 KiB
🔀 #563 - Merge vulkan code from mainline up to commit of 6/28/2025
| Author | firecoperana |
|---|---|
| State | ❌ Closed |
| Created | 2025-06-29 |
| Updated | 2025-07-02 |
Description
-
Vulkan Optimizations and Fixes (#8959)
-
Optimize Vulkan REPEAT performance
.....................................................................................
vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
Fix cuda build error
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
💬 Conversation
👤 firecoperana commented the 2025-06-29 at 19:21:51:
Test Qwen 2.5 7B Q4_K_S and it runs fine, but for deepseek model, I was getting "GGGGGGG" output with -mla 1 -amb 512. Probably related to deepseek related optimization.
👤 ubergarm commented the 2025-06-29 at 19:51:08:
For deepseek often one wants to compile with -DGGML_CUDA_IQK_FORCE_BF16=1 to avoid overflowing fp16 accumulator which manifests as gibberish, nans, or GGG typically I believe.
I just tried to compile but got an error, might be because I just updated my rig and now seem to have gcc version 15.1.1 20250425 (GCC)... I'll fuss with it a bit but put it here in the meantime.
Details inside:
👈 build command and logs
# attempt to build clean despite it seems to still be using cmake cache? hah...
$ rm -rf ./build
$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF GGML_CCACHE=OFF
$ cmake --build build --config Release -j $(nproc)
CMake Warning:
Ignoring extra path from command line:
"GGML_CCACHE=OFF"
-- The C compiler identification is GNU 15.1.1
-- The CXX compiler identification is GNU 15.1.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.50.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- OpenMP found
-- Using optimized iqk matrix multiplications
-- Enabling IQK Flash Attention kernels
-- Using llamafile
-- Found Vulkan: /lib/libvulkan.so (found version "1.4.313") found components: glslc glslangValidator
-- Vulkan found
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- ARCH_FLAGS = -march=native
-- Configuring done (0.5s)
-- Generating done (0.0s)
-- Build files have been written to: /mnt/astrodata/llm/ik_llama.cpp/build
[ 0%] Generating build details from Git
[ 0%] Building CXX object ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o
[ 1%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[ 3%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
[ 3%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
-- Found Git: /usr/bin/git (found version "2.50.0")
In function ‘SHA1Update’,
inlined from ‘SHA1Final’ at /mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:265:5:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
219 | SHA1Transform(context->state, &data[i]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’
54 | void SHA1Transform(
| ^~~~~~~~~~~~~
In function ‘SHA1Update’,
inlined from ‘SHA1Final’ at /mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:269:9:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
219 | SHA1Transform(context->state, &data[i]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’
54 | void SHA1Transform(
| ^~~~~~~~~~~~~
[ 3%] Built target sha256
[ 3%] Built target sha1
[ 3%] Built target xxhash
[ 3%] Generating build details from Git
-- Found Git: /usr/bin/git (found version "2.50.0")
[ 4%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[ 5%] Linking CXX executable ../../../bin/vulkan-shaders-gen
[ 5%] Built target build_info
[ 5%] Built target vulkan-shaders-gen
[ 6%] Generate vulkan shaders
ggml_vulkan: Generating and compiling shaders to SPIR-V
[ 6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[ 7%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-vulkan.cpp.o
[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-vulkan-shaders.cpp.o
[ 9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
[ 9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_kquants.cpp.o
[ 10%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_576_512.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iquants.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_256_256.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_192_128.cpp.o
[ 12%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[ 14%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_ktquants.cpp.o
[ 14%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_128_128.cpp.o
[ 15%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_64_64.cpp.o
[ 16%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_legacy_quants.cpp.o
[ 16%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_96_96.cpp.o
[ 17%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_floats.cpp.o
[ 17%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_1bit.cpp.o
[ 18%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iqk_quants.cpp.o
[ 18%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o
[ 19%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c: In function ‘ggml_compute_forward’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_SIN’ not handled in switch [-Wswitch]
19814 | switch (tensor->op) {
| ^~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_COS’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_COUNT_EQUAL’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_CONV_2D_DW’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_RWKV_WKV6’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_OPT_STEP_ADAMW’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c: In function ‘ggml_compute_backward’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_SIN’ not handled in switch [-Wswitch]
20395 | switch (tensor->op) {
| ^~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_COS’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_COUNT_EQUAL’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_CONV_2D_DW’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_RWKV_WKV6’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_OPT_STEP_ADAMW’ not handled in switch [-Wswitch]
In file included from /usr/include/vulkan/vulkan_hpp_macros.hpp:35,
from /usr/include/vulkan/vulkan.hpp:11,
from /mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:8:
/usr/include/c++/15.1.1/ciso646:46:4: warning: #warning "<ciso646> is deprecated in C++17, use <version> to detect implementation-specific macros" [-Wcpp]
46 | # warning "<ciso646> is deprecated in C++17, use <version> to detect implementation-specific macros"
| ^~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp: In function ‘void ggml_vk_print_gpu_info(size_t)’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3541:18: warning: unused variable ‘subgroup_size’ [-Wunused-variable]
3541 | const size_t subgroup_size = (default_subgroup_size != 0) ? default_subgroup_size : subgroup_props.subgroupSize;
| ^~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3542:16: warning: unused variable ‘uma’ [-Wunused-variable]
3542 | const bool uma = props2.properties.deviceType == vk::PhysicalDeviceType::eIntegratedGpu;
| ^~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp: In function ‘void ggml_vk_instance_init()’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3644:12: warning: unused variable ‘num_available_devices’ [-Wunused-variable]
3644 | size_t num_available_devices = vk_instance.instance.enumeratePhysicalDevices().size();
| ^~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:269:16: warning: no previous prototype for ‘ggml_backend_tensor_memset’ [-Wmissing-prototypes]
269 | GGML_CALL void ggml_backend_tensor_memset(struct ggml_tensor* tensor, uint8_t value, size_t offset, size_t size) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c: In function ‘ggml_backend_multi_buffer_context_interface’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1022:34: error: initialization of ‘_Bool (*)(struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)’ from incompatible pointer type ‘void (*)(struct ggml_backend_buffer *, uint8_t)’ {aka ‘void (*)(struct ggml_backend_buffer *, unsigned char)’} [-Wincompatible-pointer-types]
1022 | /* .clear = */ ggml_backend_multi_buffer_clear,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1022:34: note: (near initialization for ‘multi_backend_buffer_i.cpy_tensor’)
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1006:23: note: ‘ggml_backend_multi_buffer_clear’ declared here
1006 | GGML_CALL static void ggml_backend_multi_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1024:5: warning: missing initializer for field ‘reset’ of ‘struct ggml_backend_buffer_i’ [-Wmissing-field-initializers]
1024 | };
| ^
In file included from /mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend-impl.h:50:34: note: ‘reset’ declared here
50 | void (*GGML_CALL reset) (ggml_backend_buffer_t buffer); // reset any internal state due to tensor initialization, such as tensor extras
| ^~~~~
make[2]: *** [ggml/src/CMakeFiles/ggml.dir/build.make:222: ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2044: ggml/src/CMakeFiles/ggml.dir/all] Error 2
make: *** [Makefile:146: all] Error 2
👤 ikawrakow submitted a review the 2025-06-30 at 07:12:08: 🔄 CHANGES_REQUESTED
Please no new ops, new enum values, and no refactoring of the CPU backend. I think the Vulkan back-end can be updated to the latest without using the new back-end formalism in mainline.
👤 ubergarm commented the 2025-07-01 at 02:59:51:
@firecoperana
Heya thanks again for digging into this! I have two different rigs on which I'm testing. It does now build on the AMD RX 7900 XTX Ubuntu 24.04 box now!
So good news I was able to compile and run firecoperana/Merge_mainline_vulkan@495103bd with vulkan backend! However, only seemed to run without -fa. If I try to use -fa it segfaults after its mostly loaded and right before llama-server would start listening for inputs.
Seems like something is still off as the speeds are off from mainline. Could be I'm using the AMDVLK driver as installed from apt-get install libvulkan-dev 1.4.313.0~rc1-1lunarg24.04-1 or that I'm compiling it wrong? Details in the fold:
👈 sweep-bench comparisons Qwen3-14B-Q4_0 dense no FA
# checkout Merge_mainline_vulkan
$ git rev-parse --short HEAD
495103bd
# build
cmake -B build -DGGML_HIP=OFF -DGGML_HIPBLAS=OFF -DGGML_VULKAN=ON -DGGML_RPC=OFF -DGGML_CCACHE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
# test
model=/home/w/projects/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q4_0.gguf
sudo ./build/bin/llama-sweep-bench \
--model "$model" \
-ctk f16 -ctv f16 \
-c 16896 \
-ngl 99 \
--warmup-batch \
--threads 1
ik_llama.cpp firecoperana/Merge_mainline_vulkan@495103bd FA=0
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 1.363 | 375.67 | 3.786 | 33.81 |
| 512 | 128 | 512 | 1.365 | 375.16 | 3.817 | 33.53 |
| 512 | 128 | 1024 | 1.414 | 362.06 | 3.844 | 33.30 |
| 512 | 128 | 1536 | 1.444 | 354.69 | 3.971 | 32.23 |
| 512 | 128 | 2048 | 1.429 | 358.21 | 3.965 | 32.28 |
| 512 | 128 | 2560 | 1.447 | 353.93 | 4.036 | 31.71 |
| 512 | 128 | 3072 | 1.462 | 350.17 | 4.099 | 31.23 |
| 512 | 128 | 3584 | 1.492 | 343.12 | 4.137 | 30.94 |
| 512 | 128 | 4096 | 1.499 | 341.62 | 4.233 | 30.24 |
| 512 | 128 | 4608 | 1.518 | 337.27 | 4.311 | 29.69 |
| 512 | 128 | 5120 | 1.525 | 335.71 | 4.355 | 29.39 |
| 512 | 128 | 5632 | 1.567 | 326.74 | 4.440 | 28.83 |
| 512 | 128 | 6144 | 1.556 | 329.11 | 4.508 | 28.39 |
| 512 | 128 | 6656 | 1.579 | 324.18 | 4.534 | 28.23 |
| 512 | 128 | 7168 | 1.596 | 320.79 | 4.600 | 27.83 |
| 512 | 128 | 7680 | 1.623 | 315.45 | 4.685 | 27.32 |
| 512 | 128 | 8192 | 1.640 | 312.19 | 4.775 | 26.80 |
llama.cpp@27208bf6 FA=0
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 512 | 128 | 0 | 0.323 | 1585.78 | 1.822 | 70.27 |
| 512 | 128 | 512 | 0.334 | 1533.43 | 1.859 | 68.86 |
| 512 | 128 | 1024 | 0.369 | 1386.13 | 1.907 | 67.11 |
| 512 | 128 | 1536 | 0.382 | 1338.94 | 1.956 | 65.43 |
| 512 | 128 | 2048 | 0.374 | 1369.21 | 1.995 | 64.15 |
| 512 | 128 | 2560 | 0.391 | 1308.08 | 2.081 | 61.50 |
| 512 | 128 | 3072 | 0.396 | 1293.44 | 2.148 | 59.58 |
| 512 | 128 | 3584 | 0.422 | 1214.46 | 2.202 | 58.12 |
| 512 | 128 | 4096 | 0.422 | 1214.09 | 2.278 | 56.20 |
| 512 | 128 | 4608 | 0.435 | 1176.88 | 2.344 | 54.61 |
| 512 | 128 | 5120 | 0.441 | 1159.87 | 2.407 | 53.17 |
| 512 | 128 | 5632 | 0.482 | 1061.18 | 2.472 | 51.77 |
| 512 | 128 | 6144 | 0.465 | 1100.88 | 2.549 | 50.21 |
| 512 | 128 | 6656 | 0.483 | 1060.17 | 2.602 | 49.20 |
| 512 | 128 | 7168 | 0.494 | 1037.17 | 2.661 | 48.10 |
| 512 | 128 | 7680 | 0.523 | 979.25 | 2.724 | 46.99 |
| 512 | 128 | 8192 | 0.538 | 951.01 | 2.820 | 45.39 |
On my local rig with a CUDA and ARCH linux installing extra/vulkan-utility-libraries 1.4.313.0-1 (vulkan-devel) was having a compiling issue still complaining about RPC during linking. It might be because that super new gcc 15.1.1 though given I just updated everything...
$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CCACHE=ON -DCMAKE_BUILD_TYPE=Debug
$ cmake --build build --config Debug -j $(nproc)
[ 24%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 24%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 25%] Linking CXX executable ../../bin/llama-gguf
/mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp: In function ‘std::wstring unicode_wstring_from_utf8(const std::string&)’:
/mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp:232:10: warning: ‘template<class _Codecvt, class _Elem, class _Wide_alloc, class _Byte_alloc> class std::__cxx11::wstring_convert’ is deprecated [-Wdeprecated-declarations]
232 | std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
| ^~~~~~~~~~~~~~~
In file included from /usr/include/c++/15.1.1/locale:47,
from /usr/include/c++/15.1.1/regex:43,
from /mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp:12:
/usr/include/c++/15.1.1/bits/locale_conv.h:262:33: note: declared here
262 | class _GLIBCXX17_DEPRECATED wstring_convert
| ^~~~~~~~~~~~~~~
[ 25%] Linking CXX executable ../../bin/llama-gguf-hash
[ 26%] Linking CXX shared library libllama.so
/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `ggml_backend_rpc_init'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/gguf/CMakeFiles/llama-gguf.dir/build.make:102: bin/llama-gguf] Error 1
make[1]: *** [CMakeFiles/Makefile2:3314: examples/gguf/CMakeFiles/llama-gguf.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `ggml_backend_rpc_init'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/build.make:108: bin/llama-gguf-hash] Error 1
make[1]: *** [CMakeFiles/Makefile2:3151: examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/all] Error 2
[ 26%] Built target llama
make: *** [Makefile:146: all] Error 2
However, if I enable the RPC backend with -DGGML_RPC=ON it compiles now! Though starting up it throws some errors and isn't working yet
model=/mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-c 16896 \
-ngl 99 \
--warmup-batch \
--threads 1
llm_load_tensors: ggml ctx size = 0.40 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: Vulkan0 buffer size = 7697.69 MiB
llm_load_tensors: CPU buffer size = 417.30 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 16896
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Vulkan0 KV buffer size = 2640.00 MiB
llama_new_context_with_model: KV self size = 2640.00 MiB, K (f16): 1320.00 MiB, V (f16): 1320.00 MiB
llama_new_context_with_model: Vulkan_Host output buffer size = 0.58 MiB
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_q_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_k_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.ffn_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.1.attn_norm.weight, the weight will need to be copied
Lemme know if there is a certain version of the vulkan backend that might work better or happy to try more iterations! Thanks!
👤 firecoperana commented the 2025-07-01 at 15:00:17:
I noticed something odd too and suspect it's related to vulkan shader. When I run llama server in visual studio, I can match the performance of the mainline, but if I run in command line, I was only getting 1/3 to 1/2 of the speed for token generation. If you have time, you can do some troubleshooting, as I'm not familiar with vulkan at all.
"warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_norm.weight" happens because vulkan does not support fused rms norm. It only shows in debug version.
👤 ikawrakow commented the 2025-07-01 at 16:38:42:
Tested on my RTX-4080. If I remove the fused ops (GGML_OP_FUSED_RMS_NORM and GGML_OP_FUSED_MUL_UNARY) and don't use flash attention, I get this for LlaMA-3.1-8B
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 1024 | 256 | 0 | 2.074 | 493.73 | 2.602 | 98.37 |
| 1024 | 256 | 1024 | 1.074 | 953.71 | 3.198 | 80.05 |
| 1024 | 256 | 2048 | 0.968 | 1058.33 | 3.069 | 83.41 |
| 1024 | 256 | 3072 | 0.907 | 1128.89 | 3.187 | 80.32 |
| 1024 | 256 | 4096 | 0.941 | 1088.54 | 3.368 | 76.00 |
| 1024 | 256 | 5120 | 0.962 | 1064.06 | 3.531 | 72.51 |
| 1024 | 256 | 6144 | 0.993 | 1030.96 | 3.742 | 68.42 |
| 1024 | 256 | 7168 | 1.037 | 987.64 | 3.963 | 64.60 |
| 1024 | 256 | 8192 | 1.098 | 932.90 | 4.223 | 60.62 |
| 1024 | 256 | 9216 | 1.156 | 885.58 | 4.474 | 57.22 |
| 1024 | 256 | 10240 | 1.216 | 842.27 | 4.711 | 54.34 |
| 1024 | 256 | 11264 | 1.271 | 805.53 | 4.949 | 51.73 |
| 1024 | 256 | 12288 | 1.323 | 774.28 | 5.201 | 49.22 |
| 1024 | 256 | 13312 | 1.381 | 741.70 | 5.457 | 46.92 |
| 1024 | 256 | 14336 | 1.440 | 711.14 | 5.709 | 44.84 |
| 1024 | 256 | 15360 | 1.469 | 696.92 | 5.962 | 42.94 |
Flash attention seems to be running on the CPU, so performance drops further with that. TG is on par with mainline for short context, but PP is ~3X lower.
👤 ikawrakow commented the 2025-07-01 at 16:48:33:
If I change the LOG_DEBUG to LOG_INFO in ggml_vk_print_gpu_info, I see this line:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none
On mainline I see this:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
So, for some reason int dot products and cooperative matrix are not enabled. I guess, this may explain the lower performance.
👤 ikawrakow submitted a review the 2025-07-01 at 18:07:18: 💬 COMMENTED
👤 firecoperana submitted a review the 2025-07-02 at 01:10:01: 💬 COMMENTED
👤 firecoperana commented during a code review the 2025-07-02 at 01:10:01 on ggml/src/ggml-vulkan.cpp:
Removed.
👤 ubergarm commented the 2025-07-02 at 04:42:36:
The new commit should remove the need to add these in cmake command. Also disable the fused ops for now.
Thanks I was having trouble getting it setup. First the amazing news, check this out on the AMD RX 7900 XTX it is up to snuff in early testing:
Very nice! I want to try some more models tomorrow but this is getting exciting!
I also got it to build and detect things properly on my local ARCH linux NVIDIA 3090TI FE rig, however when it starts up it throws an error:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:2031: GGML_ASSERT((GGML_KQ_MASK_PAD % rows_cols[0]) == 0) failed
Amazing progress in a short time!
👤 ikawrakow submitted a review the 2025-07-02 at 06:49:33: ✅ APPROVED