ROCm/composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-29 11:16:59 +00:00

Files

History

juuso-oskari 5d1def74a6 CK-UA: remove legacy ping-pong pipeline; FA4 is the only 2-WG path

Pipeline cleanup (-fav4):
  * Delete the 8-wave compute/memory ping-pong baseline (the ~200-line
    monolithic `core_loop` lambda + its 2-warp-group dispatch). It was
    reachable only under -DUA_FA4_PIPELINE=0 and never beat FA4 on any
    measured prefill shape, so it was dead under the default build.
  * Drop the UA_FA4_PIPELINE toggle entirely. kFA4 is now derived purely
    from NumWarpGroups==2 + the 32x32x16 within-wave FP8 P-relayout
    invariant, with a static_assert pinning that every 2-WG instance is
    FA4-capable (fails the build loudly instead of running an empty loop).
  * Remove the now-orphaned ADD_SBARRIER_FOR_PHASE0/PHASE2 knobs (they
    only gated barriers inside the deleted core_loop). MOVE_FMHA_MASK_*
    stay (still consumed by the FA4 core-loop scheduler).
  * The non-FA4 pre-stage + fmha_post_process epilogue are retained: they
    are shared by the single-warp-group (NumWarpGroups==1) serial decode
    path, where kFA4 is false.

Behaviour-preserving for the default build: FA4 prefill perf is bit-for-
bit unchanged (b16 sq=sk=10000 fp8 CK=5.76ms before/after) and the full
decode regression (d{64,128} x {bf16,fp8} x split-KV {2,64}) still PASSes.

Add opt-in prefill fallback knob (unified_attention.cpp):
  * AITER_UA_PREFILL_FALLBACK=1 routes prefill-sized shapes to the 4-warp
    single-warp-group *serial* decode_*_m128 instances instead of FA4.
    Reuses already-compiled instances (no extra binary). OFF by default:
    the serial path has no matrix/softmax overlap and measured ~0.66-0.70x
    Triton vs FA4's ~0.73-0.80x on gfx950 fp8 GQA-12/2 (i.e. SLOWER than
    FA4). Kept as a diagnostic / robustness A-B knob only.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-06-03 09:15:30 +00:00

..

Padding support for wave transfer (#3537 )

2026-01-26 12:57:09 -08:00

02_gemm_bilinear

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

03_gemm_bias_relu

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

04_gemm_add_add_fastgelu

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

[CK] Integrate GPU reference into ckProfiler for convolutions (#3379 )

2025-12-18 07:59:45 +01:00

10_convnd_fwd_multiple_d_multiple_reduce

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

11_convnd_fwd_bias

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

[rocm-libraries] ROCm/rocm-libraries#5030 (commit 8e02a26)

2026-03-06 16:28:22 +00:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

14_gemm_quantization

[rocm-libraries] ROCm/rocm-libraries#5348 (commit 7b18234)

2026-03-12 08:48:36 +00:00

15_grouped_gemm

[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)

2026-02-26 00:28:58 +00:00

16_gemm_multi_d_multi_reduces

[rocm-libraries] ROCm/rocm-libraries#5348 (commit 7b18234)

2026-03-12 08:48:36 +00:00

17_convnd_bwd_data

[CK] Integrate GPU reference into ckProfiler for convolutions (#3379 )

2025-12-18 07:59:45 +01:00

18_batched_gemm_reduce

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

19_binary_elementwise

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

20_grouped_conv_bwd_weight

[rocm-libraries] ROCm/rocm-libraries#4872 (commit ca623f7)

2026-02-25 20:11:01 +00:00

21_gemm_layernorm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

24_batched_gemm

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

25_gemm_bias_e_permute

Implement batched gemm bias permute for RDNA4 (#3534 )

2026-01-17 08:30:27 +01:00

Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598 )

2026-01-20 09:39:57 -08:00

27_layernorm2d_fwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

28_grouped_gemm_bias_e_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

29_batched_gemm_bias_e_permute

Implement batched gemm bias permute for RDNA4 (#3534 )

2026-01-17 08:30:27 +01:00

30_grouped_conv_fwd_multiple_d

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

31_batched_gemm_gemm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

32_batched_gemm_scale_softmax_gemm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

33_multiple_reduce

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

[rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)

2026-02-20 22:41:34 +00:00

36_sparse_embedding

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

37_batched_gemm_add_add_relu_gemm_add

Implement batched gemm add relu gemm add for rdna4 (#3391 )

2026-01-20 13:06:59 -08:00

38_grouped_conv_bwd_data_multiple_d

Grouped convolution backward data WMMA v3 implementation (#3460 )

2025-12-30 16:25:08 +01:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

40_conv2d_fwd_quantization

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

41_grouped_conv_conv_fwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

42_groupnorm_fwd

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

43_splitk_gemm_bias_e_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

44_elementwise_permute

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

45_elementwise_normalization

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

46_gemm_add_multiply

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

47_gemm_bias_softmax_gemm_permute

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

49_maxpool2d_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

51_avgpool3d_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

52_im2col_col2im

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

53_layernorm2d_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

54_groupnorm_bwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

59_grouped_gemm_multi_ABD

[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)

2026-02-25 05:17:08 +00:00

60_gemm_multi_ABD

Multi AB support for wave transfer (#3578 )

2026-01-29 10:29:40 -08:00

61_contraction_multi_ABD

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

62_convnd_activ

Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529 )

2026-01-30 17:02:14 +01:00

63_layernorm4d_fwd

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

64_fpAintB_gemm

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

65_gemm_multiply_multiply

[rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)

2026-02-20 22:41:34 +00:00

66_complex_contraction_bilinear

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

67_gemm_microscaling

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

69_gemm_add_relu

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

CK-UA: remove legacy ping-pong pipeline; FA4 is the only 2-WG path

2026-06-03 09:15:30 +00:00

CMakeLists.txt

Build CK on Windows (#3458 )

2026-01-14 07:31:45 -08:00

README.md

Add basic documentation structure (#1715 )

2024-12-04 00:46:47 +01:00

README.md

Back to the main page

Composable Kernel examples