ik_llama.cpp/github-data/pull_requests/563 - Merge vulkan code from mainline up to commit of 6_28_2025.md at 3afd0600a1cdcd4b53238bc46695e66ec378cdcb - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-26 01:19:20 +00:00

Files

Thomas 0451f10a42 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

27 KiB

Raw Blame History

🔀 #563 - Merge vulkan code from mainline up to commit of 6/28/2025

Author	`firecoperana`
State	❌ Closed
Created	2025-06-29
Updated	2025-07-02

Description

Vulkan Optimizations and Fixes (#8959)
Optimize Vulkan REPEAT performance

.....................................................................................

vulkan: lock accesses of pinned_memory vector (#14333)

vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)

Fix cuda build error

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

💬 Conversation

👤 firecoperana commented the 2025-06-29 at 19:21:51:

Test Qwen 2.5 7B Q4_K_S and it runs fine, but for deepseek model, I was getting "GGGGGGG" output with -mla 1 -amb 512. Probably related to deepseek related optimization.

👤 ubergarm commented the 2025-06-29 at 19:51:08:

For deepseek often one wants to compile with -DGGML_CUDA_IQK_FORCE_BF16=1 to avoid overflowing fp16 accumulator which manifests as gibberish, nans, or GGG typically I believe.

I just tried to compile but got an error, might be because I just updated my rig and now seem to have gcc version 15.1.1 20250425 (GCC)... I'll fuss with it a bit but put it here in the meantime.

Details inside:

👈 build command and logs

# attempt to build clean despite it seems to still be using cmake cache? hah...
$ rm -rf ./build
$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF GGML_CCACHE=OFF
$ cmake --build build --config Release -j $(nproc)

CMake Warning:
  Ignoring extra path from command line:

   "GGML_CCACHE=OFF"


-- The C compiler identification is GNU 15.1.1
-- The CXX compiler identification is GNU 15.1.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.50.0")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- OpenMP found
-- Using optimized iqk matrix multiplications
-- Enabling IQK Flash Attention kernels
-- Using llamafile
-- Found Vulkan: /lib/libvulkan.so (found version "1.4.313") found components: glslc glslangValidator
-- Vulkan found
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- ARCH_FLAGS = -march=native
-- Configuring done (0.5s)
-- Generating done (0.0s)
-- Build files have been written to: /mnt/astrodata/llm/ik_llama.cpp/build
[  0%] Generating build details from Git
[  0%] Building CXX object ggml/src/vulkan-shaders/CMakeFiles/vulkan-shaders-gen.dir/vulkan-shaders-gen.cpp.o
[  1%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
[  3%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
[  3%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
-- Found Git: /usr/bin/git (found version "2.50.0")
In function ‘SHA1Update’,
    inlined from ‘SHA1Final’ at /mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:265:5:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
  219 |             SHA1Transform(context->state, &data[i]);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’
   54 | void SHA1Transform(
      |      ^~~~~~~~~~~~~
In function ‘SHA1Update’,
    inlined from ‘SHA1Final’ at /mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:269:9:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: warning: ‘SHA1Transform’ reading 64 bytes from a region of size 0 [-Wstringop-overread]
  219 |             SHA1Transform(context->state, &data[i]);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:219:13: note: referencing argument 2 of type ‘const unsigned char[64]’
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c: In function ‘SHA1Final’:
/mnt/astrodata/llm/ik_llama.cpp/examples/gguf-hash/deps/sha1/sha1.c:54:6: note: in a call to function ‘SHA1Transform’
   54 | void SHA1Transform(
      |      ^~~~~~~~~~~~~
[  3%] Built target sha256
[  3%] Built target sha1
[  3%] Built target xxhash
[  3%] Generating build details from Git
-- Found Git: /usr/bin/git (found version "2.50.0")
[  4%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
[  5%] Linking CXX executable ../../../bin/vulkan-shaders-gen
[  5%] Built target build_info
[  5%] Built target vulkan-shaders-gen
[  6%] Generate vulkan shaders
ggml_vulkan: Generating and compiling shaders to SPIR-V
[  6%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
[  7%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-vulkan.cpp.o
[  8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/ggml-vulkan-shaders.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
[  9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_kquants.cpp.o
[ 10%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_576_512.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iquants.cpp.o
[ 11%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_256_256.cpp.o
[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_192_128.cpp.o
[ 12%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
[ 14%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_ktquants.cpp.o
[ 14%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_128_128.cpp.o
[ 15%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_64_64.cpp.o
[ 16%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_legacy_quants.cpp.o
[ 16%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_96_96.cpp.o
[ 17%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_floats.cpp.o
[ 17%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_1bit.cpp.o
[ 18%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iqk_quants.cpp.o
[ 18%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o
[ 19%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c: In function ‘ggml_compute_forward’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_SIN’ not handled in switch [-Wswitch]
19814 |     switch (tensor->op) {
      |     ^~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_COS’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_COUNT_EQUAL’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_CONV_2D_DW’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_RWKV_WKV6’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:19814:5: warning: enumeration value ‘GGML_OP_OPT_STEP_ADAMW’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c: In function ‘ggml_compute_backward’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_SIN’ not handled in switch [-Wswitch]
20395 |     switch (tensor->op) {
      |     ^~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_COS’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_COUNT_EQUAL’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_CONV_2D_DW’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_RWKV_WKV6’ not handled in switch [-Wswitch]
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml.c:20395:5: warning: enumeration value ‘GGML_OP_OPT_STEP_ADAMW’ not handled in switch [-Wswitch]
In file included from /usr/include/vulkan/vulkan_hpp_macros.hpp:35,
                 from /usr/include/vulkan/vulkan.hpp:11,
                 from /mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:8:
/usr/include/c++/15.1.1/ciso646:46:4: warning: #warning "<ciso646> is deprecated in C++17, use <version> to detect implementation-specific macros" [-Wcpp]
   46 | #  warning "<ciso646> is deprecated in C++17, use <version> to detect implementation-specific macros"
      |    ^~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp: In function ‘void ggml_vk_print_gpu_info(size_t)’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3541:18: warning: unused variable ‘subgroup_size’ [-Wunused-variable]
 3541 |     const size_t subgroup_size = (default_subgroup_size != 0) ? default_subgroup_size : subgroup_props.subgroupSize;
      |                  ^~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3542:16: warning: unused variable ‘uma’ [-Wunused-variable]
 3542 |     const bool uma = props2.properties.deviceType == vk::PhysicalDeviceType::eIntegratedGpu;
      |                ^~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp: In function ‘void ggml_vk_instance_init()’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:3644:12: warning: unused variable ‘num_available_devices’ [-Wunused-variable]
 3644 |     size_t num_available_devices = vk_instance.instance.enumeratePhysicalDevices().size();
      |            ^~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:269:16: warning: no previous prototype for ‘ggml_backend_tensor_memset’ [-Wmissing-prototypes]
  269 | GGML_CALL void ggml_backend_tensor_memset(struct ggml_tensor* tensor, uint8_t value, size_t offset, size_t size) {
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c: In function ‘ggml_backend_multi_buffer_context_interface’:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1022:34: error: initialization of ‘_Bool (*)(struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)’ from incompatible pointer type ‘void (*)(struct ggml_backend_buffer *, uint8_t)’ {aka ‘void (*)(struct ggml_backend_buffer *, unsigned char)’} [-Wincompatible-pointer-types]
 1022 |         /* .clear           = */ ggml_backend_multi_buffer_clear,
      |                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1022:34: note: (near initialization for ‘multi_backend_buffer_i.cpy_tensor’)
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1006:23: note: ‘ggml_backend_multi_buffer_clear’ declared here
 1006 | GGML_CALL static void ggml_backend_multi_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
      |                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1024:5: warning: missing initializer for field ‘reset’ of ‘struct ggml_backend_buffer_i’ [-Wmissing-field-initializers]
 1024 |     };
      |     ^
In file included from /mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend.c:1:
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-backend-impl.h:50:34: note: ‘reset’ declared here
   50 |         void         (*GGML_CALL reset)      (ggml_backend_buffer_t buffer); // reset any internal state due to tensor initialization, such as tensor extras
      |                                  ^~~~~
make[2]: *** [ggml/src/CMakeFiles/ggml.dir/build.make:222: ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:2044: ggml/src/CMakeFiles/ggml.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

👤 ikawrakow submitted a review the 2025-06-30 at 07:12:08: 🔄 CHANGES_REQUESTED

Please no new ops, new enum values, and no refactoring of the CPU backend. I think the Vulkan back-end can be updated to the latest without using the new back-end formalism in mainline.

👤 ubergarm commented the 2025-07-01 at 02:59:51:

@firecoperana

Heya thanks again for digging into this! I have two different rigs on which I'm testing. It does now build on the AMD RX 7900 XTX Ubuntu 24.04 box now!

So good news I was able to compile and run firecoperana/Merge_mainline_vulkan@495103bd with vulkan backend! However, only seemed to run without -fa. If I try to use -fa it segfaults after its mostly loaded and right before llama-server would start listening for inputs.

Seems like something is still off as the speeds are off from mainline. Could be I'm using the AMDVLK driver as installed from apt-get install libvulkan-dev 1.4.313.0~rc1-1lunarg24.04-1 or that I'm compiling it wrong? Details in the fold:

👈 sweep-bench comparisons Qwen3-14B-Q4_0 dense no FA

# checkout Merge_mainline_vulkan
$ git rev-parse --short HEAD
495103bd

# build
cmake -B build -DGGML_HIP=OFF -DGGML_HIPBLAS=OFF -DGGML_VULKAN=ON -DGGML_RPC=OFF -DGGML_CCACHE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)

# test
model=/home/w/projects/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q4_0.gguf
sudo ./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk f16 -ctv f16 \
  -c 16896 \
  -ngl 99 \
  --warmup-batch \
  --threads 1

ik_llama.cpp firecoperana/Merge_mainline_vulkan@495103bd FA=0

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	1.363	375.67	3.786	33.81
512	128	512	1.365	375.16	3.817	33.53
512	128	1024	1.414	362.06	3.844	33.30
512	128	1536	1.444	354.69	3.971	32.23
512	128	2048	1.429	358.21	3.965	32.28
512	128	2560	1.447	353.93	4.036	31.71
512	128	3072	1.462	350.17	4.099	31.23
512	128	3584	1.492	343.12	4.137	30.94
512	128	4096	1.499	341.62	4.233	30.24
512	128	4608	1.518	337.27	4.311	29.69
512	128	5120	1.525	335.71	4.355	29.39
512	128	5632	1.567	326.74	4.440	28.83
512	128	6144	1.556	329.11	4.508	28.39
512	128	6656	1.579	324.18	4.534	28.23
512	128	7168	1.596	320.79	4.600	27.83
512	128	7680	1.623	315.45	4.685	27.32
512	128	8192	1.640	312.19	4.775	26.80

llama.cpp@27208bf6 FA=0

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.323	1585.78	1.822	70.27
512	128	512	0.334	1533.43	1.859	68.86
512	128	1024	0.369	1386.13	1.907	67.11
512	128	1536	0.382	1338.94	1.956	65.43
512	128	2048	0.374	1369.21	1.995	64.15
512	128	2560	0.391	1308.08	2.081	61.50
512	128	3072	0.396	1293.44	2.148	59.58
512	128	3584	0.422	1214.46	2.202	58.12
512	128	4096	0.422	1214.09	2.278	56.20
512	128	4608	0.435	1176.88	2.344	54.61
512	128	5120	0.441	1159.87	2.407	53.17
512	128	5632	0.482	1061.18	2.472	51.77
512	128	6144	0.465	1100.88	2.549	50.21
512	128	6656	0.483	1060.17	2.602	49.20
512	128	7168	0.494	1037.17	2.661	48.10
512	128	7680	0.523	979.25	2.724	46.99
512	128	8192	0.538	951.01	2.820	45.39

On my local rig with a CUDA and ARCH linux installing extra/vulkan-utility-libraries 1.4.313.0-1 (vulkan-devel) was having a compiling issue still complaining about RPC during linking. It might be because that super new gcc 15.1.1 though given I just updated everything...

$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CCACHE=ON -DCMAKE_BUILD_TYPE=Debug
$ cmake --build build --config Debug -j $(nproc)

[ 24%] Building CXX object src/CMakeFiles/llama.dir/llama.cpp.o
[ 24%] Building CXX object src/CMakeFiles/llama.dir/unicode.cpp.o
[ 25%] Linking CXX executable ../../bin/llama-gguf
/mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp: In function ‘std::wstring unicode_wstring_from_utf8(const std::string&)’:
/mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp:232:10: warning: ‘template<class _Codecvt, class _Elem, class _Wide_alloc, class _Byte_alloc> class std::__cxx11::wstring_convert’ is deprecated [-Wdeprecated-declarations]
  232 |     std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
      |          ^~~~~~~~~~~~~~~
In file included from /usr/include/c++/15.1.1/locale:47,
                 from /usr/include/c++/15.1.1/regex:43,
                 from /mnt/astrodata/llm/ik_llama.cpp/src/unicode.cpp:12:
/usr/include/c++/15.1.1/bits/locale_conv.h:262:33: note: declared here
  262 |     class _GLIBCXX17_DEPRECATED wstring_convert
      |                                 ^~~~~~~~~~~~~~~
[ 25%] Linking CXX executable ../../bin/llama-gguf-hash
[ 26%] Linking CXX shared library libllama.so
/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `ggml_backend_rpc_init'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/gguf/CMakeFiles/llama-gguf.dir/build.make:102: bin/llama-gguf] Error 1
make[1]: *** [CMakeFiles/Makefile2:3314: examples/gguf/CMakeFiles/llama-gguf.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `ggml_backend_rpc_init'
collect2: error: ld returned 1 exit status
make[2]: *** [examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/build.make:108: bin/llama-gguf-hash] Error 1
make[1]: *** [CMakeFiles/Makefile2:3151: examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/all] Error 2
[ 26%] Built target llama
make: *** [Makefile:146: all] Error 2

However, if I enable the RPC backend with -DGGML_RPC=ON it compiles now! Though starting up it throws some errors and isn't working yet

model=/mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q4_0.gguf

./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 16896 \
  -ngl 99 \
  --warmup-batch \
  --threads 1

llm_load_tensors: ggml ctx size =    0.40 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:    Vulkan0 buffer size =  7697.69 MiB
llm_load_tensors:        CPU buffer size =   417.30 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 16896
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn   = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =  2640.00 MiB
llama_new_context_with_model: KV self size  = 2640.00 MiB, K (f16): 1320.00 MiB, V (f16): 1320.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.58 MiB
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_q_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_k_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.ffn_norm.weight, the weight will need to be copied
ggml_backend_sched_backend_from_buffer: warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.1.attn_norm.weight, the weight will need to be copied

Lemme know if there is a certain version of the vulkan backend that might work better or happy to try more iterations! Thanks!

👤 firecoperana commented the 2025-07-01 at 15:00:17:

I noticed something odd too and suspect it's related to vulkan shader. When I run llama server in visual studio, I can match the performance of the mainline, but if I run in command line, I was only getting 1/3 to 1/2 of the speed for token generation. If you have time, you can do some troubleshooting, as I'm not familiar with vulkan at all.

"warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_norm.weight" happens because vulkan does not support fused rms norm. It only shows in debug version.

👤 ikawrakow commented the 2025-07-01 at 16:38:42:

Tested on my RTX-4080. If I remove the fused ops (GGML_OP_FUSED_RMS_NORM and GGML_OP_FUSED_MUL_UNARY) and don't use flash attention, I get this for LlaMA-3.1-8B

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	2.074	493.73	2.602	98.37
1024	256	1024	1.074	953.71	3.198	80.05
1024	256	2048	0.968	1058.33	3.069	83.41
1024	256	3072	0.907	1128.89	3.187	80.32
1024	256	4096	0.941	1088.54	3.368	76.00
1024	256	5120	0.962	1064.06	3.531	72.51
1024	256	6144	0.993	1030.96	3.742	68.42
1024	256	7168	1.037	987.64	3.963	64.60
1024	256	8192	1.098	932.90	4.223	60.62
1024	256	9216	1.156	885.58	4.474	57.22
1024	256	10240	1.216	842.27	4.711	54.34
1024	256	11264	1.271	805.53	4.949	51.73
1024	256	12288	1.323	774.28	5.201	49.22
1024	256	13312	1.381	741.70	5.457	46.92
1024	256	14336	1.440	711.14	5.709	44.84
1024	256	15360	1.469	696.92	5.962	42.94

Flash attention seems to be running on the CPU, so performance drops further with that. TG is on par with mainline for short context, but PP is ~3X lower.

👤 ikawrakow commented the 2025-07-01 at 16:48:33:

If I change the LOG_DEBUG to LOG_INFO in ggml_vk_print_gpu_info, I see this line:

ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: none

On mainline I see this:

ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

So, for some reason int dot products and cooperative matrix are not enabled. I guess, this may explain the lower performance.

👤 ikawrakow submitted a review the 2025-07-01 at 18:07:18: 💬 COMMENTED

👤 firecoperana submitted a review the 2025-07-02 at 01:10:01: 💬 COMMENTED

👤 firecoperana commented during a code review the 2025-07-02 at 01:10:01 on ggml/src/ggml-vulkan.cpp:

Removed.

👤 ubergarm commented the 2025-07-02 at 04:42:36:

The new commit should remove the need to add these in cmake command. Also disable the fused ops for now.

Thanks I was having trouble getting it setup. First the amazing news, check this out on the AMD RX 7900 XTX it is up to snuff in early testing:

Very nice! I want to try some more models tomorrow but this is getting exciting!

I also got it to build and detect things properly on my local ARCH linux NVIDIA 3090TI FE rig, however when it starts up it throws an error:

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:2031: GGML_ASSERT((GGML_KQ_MASK_PAD % rows_cols[0]) == 0) failed

Amazing progress in a short time!

👤 ikawrakow submitted a review the 2025-07-02 at 06:49:33: ✅ APPROVED

27 KiB Raw Blame History Unescape Escape