mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

21 KiB

Raw Permalink Blame History

🔀 #598 - Vulkan: iquants and flash attention split_k_reduce improvement

Author	`firecoperana`
State	❌ Closed
Created	2025-07-11
Updated	2025-07-16

Description

Vulkan small token gen improvement

Taken from https://github.com/ggml-org/llama.cpp/pull/14485 and https://github.com/ggml-org/llama.cpp/pull/14554

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

💬 Conversation

👤 ubergarm commented the 2025-07-11 at 19:14:27:

I had to refactor the mainline llama-sweep-bench for some llama_memory_ api business but seems to still be working. Added that result from mainline to the above results. So ik fork seems faster with or without this PR fwiw 🤷

👤 firecoperana commented the 2025-07-11 at 21:28:51:

For the second commit, performance gain is for kv<512 if I understand it correctly.

👤 ikawrakow commented the 2025-07-12 at 09:48:22:

Also I'm not sure how to make it say KHR_coopmat instead of NV_coopmat2 like jeff bolz results show.

If your driver supports NV_coopmat2, this is the thing you want to have as performance is much better than KHR_coopmat. But if you want to test both, you need to work with preprocessor defines at build time (look for GGML_VULKAN_COOPMAT_GLSLC_SUPPORT and GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT)

Apart from performance, did someone test that it works correctly?

👤 ikawrakow commented the 2025-07-12 at 09:51:29:

Oh, btw, the not yet merged 14555 looks much more interesting, with quite significant performance gains for DeepSeek.

👤 firecoperana commented the 2025-07-12 at 12:06:14:

14555 just merged

👤 ubergarm commented the 2025-07-12 at 16:30:59:

Apart from performance, did someone test that it works correctly?

Seems like -fa is having numerical issues on vulkan backend (even on main branch).

I ran perplexity on my test Qwen3-14B-IQ2_XS.gguf quant for some configurations with mixed results.

I didn't test this PR yet as I want to get a DeepSeek-V2-Lite quant which would better excercise all the PRs involved now.

# Test with and without `-fa`
model=/mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ2_XS.gguf
./build/bin/llama-perplexity \
  --model "$model" \
  -f wiki.test.raw \
  --seed 1337 \
  -fa \
  -ngl 99 \
  --threads 1

# Vulkan
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
...
[1]7.9532,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan

# CUDA
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
...
Final estimate: PPL = 10.3231 +/- 0.08240

👤 ubergarm commented the 2025-07-12 at 18:37:31:

Do we get NaNs also in mainline with Vulkan and FA enabled? Or did something get broken with the port or my modifications?

Right, just tried latest mainline llama.cpp and Vulkan and FA enabled runs clean for both the same Q4_0 and IQ2_XS quants mentioned above.

So yes, seems like an issue with the port breaking Vulkan FA enabled path numerical stability. (prior and unrelated to this PR).

$ cd llama.cpp
$ git rev-parse --short HEAD
c31e60647

$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_VULKAN=ON
$ cmake --build build --config Release -j $(nproc)

# model=Qwen3-14B-IQ2_XS.gguf
$ ./build/bin/llama-perplexity \
  --model "$model" \
  -f wiki.test.raw \
  --seed 1337 \
  -fa \
  -ngl 99 \
  --threads 1

# Vulkan -fa
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
...
Final estimate: PPL = 10.3268 +/- 0.08242

# Vulkan no fa
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
...
Final estimate: PPL = 10.3281 +/- 0.08243

I also spot checked my new DeepSeek-V2-Lite-Q4_0.gguf test quant with vulkan backend and same thing, with -fa it throws nan on the second chunk. Removing -fa and keeping -fmoe -mla 3 -amb 512 -ngl 99 fully offloaded on the 3090TI it is running clean so far after 50 chunks.

👤 firecoperana commented the 2025-07-12 at 19:26:57:

https://github.com/ggml-org/llama.cpp/pull/12776 Here is a fix of NaN for flash attention in mainline. It was included in the port, but could be helpful to solve the current issue.

👤 firecoperana commented the 2025-07-13 at 00:46:36:

It's introduced in https://github.com/ikawrakow/ik_llama.cpp/pull/584. If I roll back to build before that, I don't see issue with fa.

👤 ubergarm commented the 2025-07-13 at 04:34:49:

@firecoperana wait, i forget are you using nvidia GPU and if so are you testing with KHR_coopmat or NV_coopmat2 ?

I tested a some more cases successfully with both this fcp/vulkan_01@3ef6de2 and also main@c53cb652. Working just fine using -fa enabled for both Qwen3-14B-Q4_0 and also DeepSeek-V2-Lite-Q4_0.

So to get it to run without nan I just had to re-compile and disable NV_coopmat2 on my nvidia 3090TI so it starts up and says:

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

(I'm not sure how to pass the preprocessor defines at build time and using -DGGML_VULKAN_COOPMAT2_GLSLC_SUPPORT=0 didn't disable it, so I just commented it out in ggml/src/CMakeLists.txt the GL_NV_cooperative_matrix2 stuff.

It also worked fine on an AMD RX 7900 XTX 24GB VRAM GPU test rig.

ggml_vulkan: 0 = Radeon RX 7900 XTX (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | mat
rix cores: KHR_coopmat

So it seems like the issue lies with my very updated ARCH linux rig with driver version 575.64 and NV_coopmat2. Guessing that path wasn't tested as well if others are not on the bleeding edge.

👤 ubergarm commented the 2025-07-13 at 06:10:23:

Okay, ran 4x sweep benches to compare speed using KHR_coopmat on DeepSeek-V2-Lite-Q4_0 between this PR and main branch on vulkan. Also ran main branch with CUDA backend for comparison.

Seems like this PR really helps PP for DeepSeek-V2-Lite on vulkan backend approaching CUDA (without fmoe) speeds.

fwiw it is also running pretty good on the AMD RX 7900 XTX GPU.

Couldn't compare against mainline as I accidentally used iq6_k and such for token_embd/output instead of older q6_K... oops will fix-up a test quant compatible with mainline for those comparisons later...

👈command and raw data

#!/usr/bin/env bash
model=DeepSeek-V2-Lite-Q4_0.gguf

# seems vulkan can't use -fmoe yet, so only add it for CUDA backend test
./build/bin/llama-sweep-bench \
  --model "$model" \
  -c 20480 \
  -fa \
  -mla 3 \
  -ngl 99 \
  --threads 1 \
  --warmup-batch

PR598 fcp/vulkan_01@3ef6de29 ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat (no -fmoe)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.158	3237.86	2.047	62.54
512	128	512	0.167	3071.16	2.066	61.94
512	128	1024	0.171	2995.99	2.092	61.19
512	128	1536	0.181	2833.91	2.108	60.71
512	128	2048	0.199	2577.63	2.128	60.16
512	128	2560	0.200	2555.94	2.146	59.65
512	128	3072	0.212	2415.40	2.171	58.96
512	128	3584	0.222	2305.55	2.204	58.08
512	128	4096	0.230	2227.69	2.218	57.72
512	128	4608	0.238	2152.48	2.242	57.09
512	128	5120	0.249	2053.81	2.274	56.29
512	128	5632	0.261	1957.96	2.296	55.75
512	128	6144	0.267	1917.53	2.317	55.23
512	128	6656	0.275	1859.15	2.334	54.84
512	128	7168	0.284	1805.34	2.359	54.26
512	128	7680	0.294	1740.77	2.379	53.80
512	128	8192	0.312	1640.89	2.407	53.18
512	128	8704	0.313	1638.38	2.420	52.90
512	128	9216	0.323	1584.68	2.465	51.93
512	128	9728	0.334	1532.87	2.471	51.81
512	128	10240	0.342	1496.42	2.498	51.24
512	128	10752	0.349	1466.47	2.542	50.35
512	128	11264	0.363	1411.49	2.541	50.37
512	128	11776	0.370	1383.75	2.575	49.71
512	128	12288	0.381	1344.28	2.590	49.43
512	128	12800	0.392	1305.20	2.615	48.94
512	128	13312	0.397	1291.08	2.630	48.67
512	128	13824	0.412	1243.87	2.653	48.25
512	128	14336	0.419	1220.54	2.696	47.47
512	128	14848	0.429	1192.23	2.719	47.07
512	128	15360	0.438	1168.03	2.727	46.94
512	128	15872	0.449	1139.93	2.740	46.71
512	128	16384	0.458	1117.78	2.769	46.23
512	128	16896	0.469	1091.90	2.802	45.68
512	128	17408	0.480	1065.66	2.846	44.98
512	128	17920	0.489	1047.92	2.857	44.80
512	128	18432	0.500	1024.66	2.869	44.61
512	128	18944	0.508	1006.99	2.893	44.24
512	128	19456	0.520	983.92	2.930	43.68
512	128	19968	0.527	970.88	2.977	43.00

main@c53cb652 ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat (no -fmoe)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.352	1453.63	2.060	62.13
512	128	512	0.363	1411.14	2.093	61.17
512	128	1024	0.371	1381.41	2.123	60.29
512	128	1536	0.382	1341.59	2.142	59.74
512	128	2048	0.390	1314.28	2.164	59.15
512	128	2560	0.399	1283.78	2.189	58.48
512	128	3072	0.409	1253.19	2.208	57.98
512	128	3584	0.417	1226.70	2.232	57.35
512	128	4096	0.429	1193.48	2.260	56.65
512	128	4608	0.444	1152.15	2.297	55.74
512	128	5120	0.448	1141.95	2.308	55.47
512	128	5632	0.458	1118.20	2.326	55.03
512	128	6144	0.466	1098.13	2.345	54.58
512	128	6656	0.477	1073.00	2.372	53.95
512	128	7168	0.485	1055.92	2.398	53.38
512	128	7680	0.495	1033.49	2.404	53.23
512	128	8192	0.501	1021.30	2.448	52.30
512	128	8704	0.513	998.78	2.434	52.58
512	128	9216	0.524	977.36	2.482	51.57
512	128	9728	0.532	961.59	2.517	50.85
512	128	10240	0.541	945.58	2.532	50.55
512	128	10752	0.550	931.63	2.544	50.32
512	128	11264	0.559	916.67	2.572	49.77
512	128	11776	0.566	904.18	2.594	49.35
512	128	12288	0.578	886.11	2.629	48.69
512	128	12800	0.588	871.11	2.633	48.62
512	128	13312	0.594	862.53	2.670	47.94
512	128	13824	0.607	843.09	2.683	47.70
512	128	14336	0.617	829.66	2.722	47.03
512	128	14848	0.632	810.67	2.757	46.42
512	128	15360	0.638	802.61	2.754	46.48
512	128	15872	0.656	780.56	2.782	46.00
512	128	16384	0.669	765.63	2.814	45.48
512	128	16896	0.667	767.13	2.813	45.51
512	128	17408	0.677	756.36	2.862	44.72
512	128	17920	0.699	732.60	2.871	44.59
512	128	18432	0.691	740.86	2.840	45.07
512	128	18944	0.704	727.26	2.912	43.96
512	128	19456	0.717	714.40	2.961	43.23
512	128	19968	0.728	703.28	2.979	42.97

main@c53cb652 CUDA Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes (no -fmoe)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.150	3410.58	0.850	150.56
512	128	512	0.153	3347.65	0.883	144.95
512	128	1024	0.161	3170.67	0.889	143.93
512	128	1536	0.164	3131.27	0.897	142.76
512	128	2048	0.170	3014.62	0.902	141.88
512	128	2560	0.177	2898.93	0.909	140.77
512	128	3072	0.179	2854.08	0.915	139.84
512	128	3584	0.185	2772.59	0.921	138.91
512	128	4096	0.190	2695.74	0.921	139.05
512	128	4608	0.193	2647.73	0.924	138.60
512	128	5120	0.199	2577.73	0.930	137.66
512	128	5632	0.207	2470.39	0.939	136.32
512	128	6144	0.205	2496.83	0.950	134.72
512	128	6656	0.209	2450.44	0.948	134.96
512	128	7168	0.211	2420.98	0.953	134.32
512	128	7680	0.217	2356.83	0.958	133.63
512	128	8192	0.222	2301.66	0.962	133.10
512	128	8704	0.226	2268.36	0.970	131.99
512	128	9216	0.233	2201.90	0.974	131.40
512	128	9728	0.237	2162.63	0.981	130.43
512	128	10240	0.242	2115.01	0.987	129.74
512	128	10752	0.247	2076.34	0.995	128.66
512	128	11264	0.250	2048.60	0.999	128.18
512	128	11776	0.256	2002.21	1.004	127.46
512	128	12288	0.262	1956.47	1.013	126.36
512	128	12800	0.267	1920.49	1.019	125.57
512	128	13312	0.270	1893.36	1.022	125.21
512	128	13824	0.276	1854.78	1.025	124.85
512	128	14336	0.281	1824.00	1.030	124.31
512	128	14848	0.287	1786.71	1.038	123.28
512	128	15360	0.291	1760.18	1.042	122.89
512	128	15872	0.294	1739.60	1.046	122.41
512	128	16384	0.299	1710.85	1.053	121.52
512	128	16896	0.305	1676.11	1.059	120.83
512	128	17408	0.309	1654.43	1.067	119.98
512	128	17920	0.314	1628.70	1.073	119.34
512	128	18432	0.320	1598.91	1.076	119.01
512	128	18944	0.324	1582.60	1.081	118.42
512	128	19456	0.326	1570.21	1.086	117.90
512	128	19968	0.329	1554.16	1.091	117.28

main@c53cb652 CUDA Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes (-fmoe enabled)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	0.129	3967.12	0.731	175.15
512	128	512	0.132	3878.35	0.766	167.18
512	128	1024	0.140	3644.23	0.773	165.67
512	128	1536	0.143	3586.97	0.779	164.27
512	128	2048	0.148	3448.86	0.785	163.01
512	128	2560	0.153	3341.10	0.794	161.13
512	128	3072	0.159	3217.78	0.798	160.33
512	128	3584	0.163	3146.28	0.807	158.60
512	128	4096	0.171	2986.96	0.812	157.68
512	128	4608	0.173	2960.00	0.816	156.93
512	128	5120	0.179	2860.22	0.822	155.79
512	128	5632	0.185	2764.53	0.827	154.78
512	128	6144	0.186	2759.27	0.833	153.69
512	128	6656	0.190	2697.36	0.837	152.86
512	128	7168	0.193	2648.87	0.843	151.87
512	128	7680	0.199	2568.33	0.850	150.53
512	128	8192	0.203	2526.30	0.854	149.84
512	128	8704	0.207	2477.51	0.859	148.99
512	128	9216	0.213	2398.65	0.863	148.28
512	128	9728	0.217	2355.20	0.870	147.05
512	128	10240	0.223	2292.29	0.877	146.02
512	128	10752	0.227	2255.92	0.883	145.01
512	128	11264	0.231	2215.18	0.888	144.09
512	128	11776	0.235	2178.60	0.893	143.31
512	128	12288	0.243	2110.92	0.898	142.47
512	128	12800	0.249	2059.40	0.907	141.05
512	128	13312	0.252	2029.32	0.913	140.18
512	128	13824	0.258	1981.40	0.919	139.34
512	128	14336	0.261	1959.38	0.923	138.73
512	128	14848	0.268	1912.02	0.929	137.71
512	128	15360	0.272	1883.56	0.934	137.11
512	128	15872	0.276	1854.29	0.939	136.29
512	128	16384	0.282	1816.98	0.944	135.65
512	128	16896	0.286	1789.60	0.949	134.84
512	128	17408	0.290	1764.20	0.955	134.07
512	128	17920	0.296	1730.75	0.960	133.40
512	128	18432	0.302	1695.63	0.966	132.51
512	128	18944	0.306	1675.23	0.973	131.61
512	128	19456	0.308	1659.91	0.978	130.86
512	128	19968	0.313	1634.69	0.984	130.04

👤 firecoperana commented the 2025-07-13 at 13:29:51:

I tried KHR_coopmat and none matrix cores. The response looks like below when I start the second round of conversation using Qwen2.5 14B Q4_0: I can help with various tasks suchFlushKeyId their刻 index弈etur İsHub()

cession/***/_-_oidalglichsy propriéarya Gol鲜 <20>回 peelediran catalogsنق fı.translate_calc新闻中心咴LAG零帮助疹_hdlG Lair刚可以Aggregate Mor广泛的"struct因地ocos Hor bè Boroughapo<70>回

👤 firecoperana commented the 2025-07-15 at 12:28:43:

@firecoperana

I think this is not necessary after #608, right?

Yes.

21 KiB Raw Permalink Blame History Unescape Escape