ROCm/composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 19:28:33 +00:00

Files

History

Amir Ghamarian bbc748defe Add unified attention d64/GQA-8 kernel instances and fix BLOCK_SIZE for small head dims

The unified attention kernel previously only supported head_size=128
with MHA (NumQPerKV=1). This change adds support for head_size=64 with
GQA-8 (NumQPerKV=8), which is the configuration used by models like
DeepSeek-V3/R1 (64 query heads, 8 KV heads, head_dim=64).
Changes:
- Add 4 new kernel instance files for d64/GQA-8:
  unified_attention_d64_{bf16,fp16}_{nmask,mask}_gqa8.cpp
- Add d64/GQA-8 dispatch path in unified_attention.cpp
- Fix BLOCK_SIZE (kPageBlockSize) in unified_attention_kernel_traits:
  compute from HEAD_SIZE instead of hardcoding 32. For HeadSize<=64,
  BLOCK_SIZE must be 64 to guarantee NumIssues>=1 on gfx950. With
  128-bit vector loads (KVector=8), LaneGroups*NumWarps=128 exceeds
  kPageBlockSize=32 when HeadSize=64, causing a division-by-zero in
  the LDS tile descriptor constexpr evaluation.

2026-03-27 09:41:10 -05:00

..

Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 )

2026-01-15 07:19:31 -08:00

02_gemm_bilinear

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

03_gemm_bias_relu

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

04_gemm_add_add_fastgelu

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

[CK] Integrate GPU reference into ckProfiler for convolutions (#3379 )

2025-12-18 07:59:45 +01:00

10_convnd_fwd_multiple_d_multiple_reduce

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

11_convnd_fwd_bias

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

14_gemm_quantization

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

15_grouped_gemm

Implement grouped gemm tile loop for RDNA4 (#3304 )

2026-01-13 07:14:23 +01:00

16_gemm_multi_d_multi_reduces

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

17_convnd_bwd_data

[CK] Integrate GPU reference into ckProfiler for convolutions (#3379 )

2025-12-18 07:59:45 +01:00

18_batched_gemm_reduce

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

19_binary_elementwise

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

20_grouped_conv_bwd_weight

[CK] Integrate GPU reference into ckProfiler for convolutions (#3379 )

2025-12-18 07:59:45 +01:00

21_gemm_layernorm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

24_batched_gemm

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

25_gemm_bias_e_permute

Implement batched gemm bias permute for RDNA4 (#3534 )

2026-01-17 08:30:27 +01:00

[ck][gfx12] support contraction on gfx12 (#3421 )

2025-12-15 07:16:01 -08:00

27_layernorm2d_fwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

28_grouped_gemm_bias_e_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

29_batched_gemm_bias_e_permute

Implement batched gemm bias permute for RDNA4 (#3534 )

2026-01-17 08:30:27 +01:00

30_grouped_conv_fwd_multiple_d

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

31_batched_gemm_gemm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

32_batched_gemm_scale_softmax_gemm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

33_multiple_reduce

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

36_sparse_embedding

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

37_batched_gemm_add_add_relu_gemm_add

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

38_grouped_conv_bwd_data_multiple_d

Grouped convolution backward data WMMA v3 implementation (#3460 )

2025-12-30 16:25:08 +01:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

40_conv2d_fwd_quantization

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

41_grouped_conv_conv_fwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

42_groupnorm_fwd

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

43_splitk_gemm_bias_e_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

44_elementwise_permute

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

45_elementwise_normalization

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

46_gemm_add_multiply

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

47_gemm_bias_softmax_gemm_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

49_maxpool2d_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

51_avgpool3d_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

52_im2col_col2im

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

53_layernorm2d_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

54_groupnorm_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

59_grouped_gemm_multi_ABD

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

60_gemm_multi_ABD

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

61_contraction_multi_ABD

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

62_convnd_activ

Grouped convolution forward device implementation and base flavors for RDNA3/4 (#2964 )

2025-12-18 13:12:15 -07:00

63_layernorm4d_fwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

64_fpAintB_gemm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

65_gemm_multiply_multiply

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

66_complex_contraction_bilinear

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

67_gemm_microscaling

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

69_gemm_add_relu

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

Add unified attention d64/GQA-8 kernel instances and fix BLOCK_SIZE for small head dims

2026-03-27 09:41:10 -05:00

CMakeLists.txt

Build CK on Windows (#3458 )

2026-01-14 07:31:45 -08:00

README.md

Add basic documentation structure (#1715 )

2024-12-04 00:46:47 +01:00

README.md

Back to the main page

Composable Kernel examples