Commit Graph

266 Commits

Author SHA1 Message Date
Jobbins
ec787e6fa2 [composablekernel] fix failure status (#4351)
## Motivation

Pipelines were failing on Math CI status check.

## Technical Details

For the success case, we just changed the config in Jenkins to use a
proper app token and no code changes were required. However, the failure
case would not have worked as coded, so we needed to move that outside
of the `rocmnode()` block.

## Test Plan

I removed all of the CI in one of the commits to quickly test, and then
added it back.  Got a successful "success" message and "failure" message
produced
2026-02-05 08:56:42 -07:00
Eiden Yoshida
9e00e291dc [CK] MICI: Correct path for build trace script (#4349)
## Motivation

- Corrects path to script due to superrepo migration
- Forces all tests to run by default

## Technical Details

- now in /projects/composablekernel

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2026-02-05 10:55:44 -05:00
Eiden Yoshida
606d2aaf31 [CK] MICI: Use reference repo for checkout operations (#4336)
## Motivation

- Maintain a reference repo on slave nodes that speeds up any
clone/checkout operations

## Technical Details

- clone a ref repo if it does not exist
- update ref repo if it does exist
- checkout after ref repo is updated
- eliminates double clone

## Test Result

- Initial checkouts succeeded
2026-02-04 21:43:22 -05:00
assistant-librarian[bot]
4231c8d673 [CK] Add FP8 KV_BLOCKSCALE support for batch prefill (#4263)
Implement per-page K/V quantization for paged attention:
  - Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
  - Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations

## Proposed changes

Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.

## Checklist

Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.

- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged

## Discussion

If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered



---
🔁 Imported from
[ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696)
🧑‍💻 Originally authored by @Jeff-Huang

---------

Co-authored-by: Jeff Huang <chiachi.huang@amd.com>
Co-authored-by: Illia Silin <Illia.Silin@amd.com>
2026-02-04 18:25:31 -05:00
Illia Silin
2df84787b6 CK CI migration. (#4310)
## Motivation

Enable the CK CI after migration from standalone repo.

## Technical Details

Modify the jenkinsfile in projects/composablekernel to update the CI
workflow.

## Test Plan

This is for CK internal testing only.

## Test Result

Set up new CK CI pipeline/dashboard.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Andrew Clark <andrew.clark@amd.com>
2026-02-04 12:34:38 -05:00
andrew clark
dc0dc337a6 Adding Additional Failure Patterns for Alerts (#3663)
* Added two new failure patterns to detect. Including test function to verify if the patterns are detected

* Modifying pattern match to detect docker login failure. Removed passing tests.

* Removing passing tests. Modifying docker pattern to detect failure

* Removed passing tests

* Removing test logging function

[ROCm/composable_kernel commit: 421b714f13]
2026-02-03 10:23:07 -08:00
Bartłomiej Kocot
117abb6af4 Fix path to ck tile conv fwd instance generator (#3699)
* Fix path to ck tile conv fwd instance generator

* fixes

[ROCm/composable_kernel commit: f2b9b3a3a6]
2026-02-02 18:07:33 -08:00
Bartłomiej Kocot
c8d112deb5 Enable Grouped Conv Tile Fwd Tests daily (#3680)
[ROCm/composable_kernel commit: 1ae83137eb]
2026-01-31 15:55:25 -07:00
Illia Silin
7fbe9af19d remove builds on legacy OSs from CI (#3693)
[ROCm/composable_kernel commit: 63df1c0af2]
2026-01-30 09:15:09 -08:00
Andrew Clark
2b90408685 Finished testing failure types. Removed testing code.
[ROCm/composable_kernel commit: 8654c0628f]
2026-01-26 15:09:49 -07:00
Andrew Clark
c2cfd318da Removed working tests. Validating remaining tests.
[ROCm/composable_kernel commit: 402f21d0a6]
2026-01-26 15:09:49 -07:00
Andrew Clark
ec4a6be1ed Removed working tests. Validating remaining tests.
[ROCm/composable_kernel commit: 1397924c21]
2026-01-26 15:09:49 -07:00
Andrew Clark
22abf1b0d9 Testing a pattern to support all text variations
[ROCm/composable_kernel commit: 6c596b9553]
2026-01-26 15:09:49 -07:00
Andrew Clark
c3c318c340 Removing working cases to test other failure examples
[ROCm/composable_kernel commit: 58e1d03244]
2026-01-26 15:09:49 -07:00
Andrew Clark
c490f137b3 Adding forcing failure to test notifications
[ROCm/composable_kernel commit: 95768d1b22]
2026-01-26 15:09:49 -07:00
Andrew Clark
9e7b7fe59a Fixing Jenkinsfile too large error
[ROCm/composable_kernel commit: 786965b95e]
2026-01-26 15:09:49 -07:00
Andrew Clark
76b261ef00 Updating failure patterns to be more reliable and adding tests to verify they are caught in the logs
[ROCm/composable_kernel commit: 42a731b791]
2026-01-26 15:09:49 -07:00
andrew clark
b66bbac9ea Sanitizing URL-encoded characters from the image file name (#3622)
[ROCm/composable_kernel commit: 0fbb3bb8c4]
2026-01-21 11:00:53 -07:00
Yi DING
f7d5c3a34c Add CMakePresets.json (#3284)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: f41f37da96]
2026-01-21 08:04:24 -08:00
Bartłomiej Kocot
d15cc593ea [CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518)
* [BULDER] Add grouped conv fwd ck tile profiler

* [CK TILE] Fix grouped conv kernels splitk and double lds

* Updates

* Fixes

* Move to ckProfiler

* Fixes

* fix

* fix

* Change instances to empty list by default

* fix

* fix

* Update grouped_convolution_signatures.hpp

* Update grouped_convolution_forward_tile_algs.hpp

* [CK TILE] Add grouped convolution forward tests (#3556)

* [CK TILE] Add grouped convolution forward tests

* fix jenkins

* fixes

* comments fixes

* unit test

* unit test fix

* Move instances outside builder

* fix includes

* clang format fix

* readme fix

* fix includes

* fixes

[ROCm/composable_kernel commit: 0727e85e52]
2026-01-19 22:29:01 -07:00
Max Podkorytov
44434d33d5 Optimize clang-format check in Jenkins CI (#3597)
This change improves the clang-format CI check to be faster and not
depend on git being available in the build environment.

Changes:
- Use `find` instead of `git ls-files` (no git dependency)
- Check all C++ files: *.h, *.hpp, *.cpp, *.h.in, *.hpp.in, *.cpp.in, *.cl
- Exclude build/ and include/rapidjson directories
- Use parallel processing with 8 cores (-P 8) for ~8x speedup
- Show only errors with unified diff format (-u)
- Clear error messages: "ERROR: <file> needs formatting"
- Preserve original logic: run clang-format only when RUN_CPPCHECK=false,
  or run both clang-format and cppcheck when RUN_CPPCHECK=true

Performance:
- Sequential processing: ~93 seconds for 5,899 files
- Parallel with 8 cores: ~12 seconds for 5,899 files
- Per-file processing time: ~15ms

This reduces CI time while maintaining code formatting standards.

[ROCm/composable_kernel commit: 98abfa4ade]
2026-01-19 12:23:06 -08:00
John Shumway
c4dce7cb69 Disable CK Builder for SLES15 in Jenkins CI (#3581)
1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder

2. Added a detailed comment explaining why this is necessary:

   - SLES15 is a legacy platform with limited C++20 ecosystem support
   - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder

[ROCm/composable_kernel commit: 2d233c838a]
2026-01-16 10:36:23 -08:00
Illia Silin
8b415db3d6 add aiter test_batch_prefill and simplify jenkins file a bit (#3570)
[ROCm/composable_kernel commit: 8705fdcb0c]
2026-01-14 14:07:47 -08:00
Thrupti Raj Lakshmana Gowda
183c01c8f1 [CK TILE ENGINE] CI fix for Basic Tile Engine (#3554)
* memory op changes

* memory op changes

* Fixing TILE_ENGINE_BASIC in Tile Engine

* Removing gfx90a from Tile Engine Run

* [CK TILE ENGINE] increasing ci configs for BASIC case

* Setting RUN_TILE_ENGINE_BASIC_TESTS to ON by default

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: 51027474af]
2026-01-13 16:20:30 -08:00
Thomas Ning
f444eab66c Shuffle fix for gfx950 (#3491)
* solve compiler issue

* solve the gfx950 mfma shuffle regression

* refactor jenkinsfile to handle arch name better

* [CK TILE] set divisor to count of thread along k dimension

* fix the compiler error

* solve degradation

* Finish the multiplies fix

* fix the scales

* solve compilation error

* solve the composes

* solve the error of tile sweeper

* fix the test and example

* fix for gfx950

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>

[ROCm/composable_kernel commit: 00c46785a8]
2026-01-13 09:21:29 -08:00
Illia Silin
acb2292b46 add tabulate package to aiter docker (#3519)
[ROCm/composable_kernel commit: 2ffbf7f476]
2026-01-06 09:36:54 -08:00
Bartłomiej Kocot
502914e556 Fix large tensor grouped conv bwd data test (#3513)
[ROCm/composable_kernel commit: bbf0b1a3b3]
2026-01-05 09:42:02 -08:00
Thrupti Raj Lakshmana Gowda
b17fa5656f [CK TILE ENGINE] CI configuration with basic cases (#3475)
* [CK TILE ENGINE] Adding GEMM BASIC TEST in Kenkins

* fix RUN_TILE_ENGINE_BASIC_TESTS name typo

* [CK Tile Engine] Updating basic CI

* Resolving merging issues

* Resolving merging issues

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 62a8ec155f]
2025-12-24 10:45:56 -08:00
Bartłomiej Kocot
2228960cc4 Fix jenkinsfile for large tensor conv test (#3478)
[ROCm/composable_kernel commit: 44f1b5c5de]
2025-12-21 17:39:30 -08:00
Thrupti Raj Lakshmana Gowda
2dacac9561 [TILE ENGINE] Restructure to Base class of GEMM (#3434)
[ROCm/composable_kernel commit: e22622f0ec]
2025-12-19 23:53:56 +08:00
andrew clark
cc8e250c35 Supporting Custom Build Trace File Names (#3443)
* Removing hard-coded trace filename

* Including stage name in notification

* Simplifying capture setup and tagging file names with arch

* Removed test property from notification message

* Fixing regex to get arch name

* Fixing error in notification and modified regex

[ROCm/composable_kernel commit: e77a7ca2bc]
2025-12-18 12:15:33 -08:00
andrew clark
23d11e9792 Adding sscache stats monitoring (#3428)
* Adding additional sccache and redis logging to each build

* Removing custom workspace

* Removing script reference

* Logging complete sccache stats

* Ensuring monitor is stopped if build fails

* Including additional sccache logging

* Removing build duration log

* Fixing groovy syntax error

* Fixing syntax

* Modifying logging statements

* Fixing syntax

* Modifying logging

* Modifying logging

* Including additional logging

* Fixing logging message

* Logging build path

* Testing

* Testing workspace path logs

* Adding additonal logging to monitor

* Modifying comments

* Adding copyright info

* Cleaning unnecessary logs

* Removing build time logs

* Merge branch 'develop' into aick-457

[ROCm/composable_kernel commit: e67cd7edeb]
2025-12-17 09:15:27 -07:00
Illia Silin
de71120c7f Add build trace diagnostics to CI. (#3432)
* generate and visualize build traces for all archs

* generate build traces in all cases

* fix jenkins logic

* fix typo

* use more threads for parsing dependency map

* add script to parse ninja traces and issue warnings

* fix python script syntax and header

* fix python syntax one more time

* fix python syntax

[ROCm/composable_kernel commit: 3dfa794fab]
2025-12-16 08:22:52 -08:00
Johannes Graner
9d6790dc2e Add grouped convnd dataset tests for bwd_data, bwd_weight and make them parallel (#3380)
* Parallelization in dataset generation

* Parallelizable tests for fwd, bwd data, bwd weight with datasets

* .gitignore generated datasets

* Test parallelization script with round-robin GPU scheduling

* Parallelization updates to test generation and running

* Dataset paths relative to executable

* Update output from test generation

* Default to one GPU in test generation

* Add small dataset tests to Jenkins

* Update copyright lines

* Update test_data/generate_test_dataset.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Move trap disable

* Common get path function

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: fe35ba5dac]
2025-12-15 13:38:25 +01:00
Illia Silin
2185fc59cb use hipTensor from monorepo for daily builds (#3386)
[ROCm/composable_kernel commit: 934ba1208a]
2025-12-09 14:39:08 -08:00
Illia Silin
25918f26a2 temporarily disable daily builds on gfx1010 and gfx908 (#3384)
[ROCm/composable_kernel commit: 0d8259affd]
2025-12-09 10:37:13 -08:00
Illia Silin
43b4ec3209 Upgrade to ROCm7.1.1 as default compiler. (#3370)
* upgrade to rocm7.1.1 as new default compiler

* fix jenkinsfile

[ROCm/composable_kernel commit: 7582c9e73f]
2025-12-09 07:35:32 -08:00
Illia Silin
646dc6133d build latest hipblaslt in ck_pytorch docker (#3347)
[ROCm/composable_kernel commit: cd21e20ae7]
2025-12-04 06:58:42 -08:00
Cong Ma
f622d546f3 Tile engine for streamk (#3157)
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.

- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
  It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
  the streamk implementation reaches stability.

* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM

* [CK TILE STREAMK] Refactor: Extract common utility functions.

* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation

* Add pre-commit

* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK

* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK

* [CK TILE STREAMK] Update Jenkinsfile

* [CK TILE Engine] Update StreamK tile engine help message

Remove default value messages as they are automatically printed

* [CK TILE Engine] Update StreamK tile engine

- Remove namespace reboot

* [CK TILE Engine] Update StreamK tile engine

- Fix merge error

[ROCm/composable_kernel commit: 30727c48fc]
2025-11-27 15:49:57 -07:00
andrew clark
544525b61d Automated Perfetto UI Notifications (#3255)
* Testing visualization generation

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Adding dummy test data

* Update Jenkinsfile

* Update Jenkinsfile

* Adding notifications

* Testing

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Image compression

* Update Jenkinsfile

* Moving capture logic to main Jenkins file

* Testing generation

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Fixing curl request

* Update Jenkinsfile

* Clean up

* Fix

* Fixing notification

* Testing message creation

* Adjusting message payload

* Testing notification generation

* Updating main jenkinsfile

* Fixing cleanup call

* Removing test pipeline code

* Comment clean up

* Testing pipeline

* Update Jenkinsfile

* Update Jenkinsfile

* Update Jenkinsfile

* Moving archive

Moving trace archive to safe location before source checkout

* Removing test pipeline

* Testing pipeline with unique file names

* Update Jenkinsfile

* Removing test files

Updated main pipeline

[ROCm/composable_kernel commit: 40d7217ac7]
2025-11-26 16:27:27 -07:00
Illia Silin
250fd6be12 Enable ck_builder in CI. (#3296)
* build and run ck_builder tests

* add test_ckb_all to targets

* fix syntax

* fix test path

* Update CMake targets for builder testing in CI (#3290)

Our existing CMake only had build targets. Update CMakeLists.txt to have CTEST targets:
* smoke-builder
* regression-builder
* check-builder

Co-authored-by: John Shumway <jshumway@amd.com>

* use check-builder target

* get rid of test_ckb_all target

* call ninja check-builder separately

---------

Co-authored-by: John Shumway <jshumway@amd.com>

[ROCm/composable_kernel commit: a54f7b1138]
2025-11-25 17:45:59 -08:00
Illia Silin
585a6a2048 disable CI on gfx1010 by default (#3280)
[ROCm/composable_kernel commit: e857e26bf6]
2025-11-24 07:06:41 -08:00
Illia Silin
6d7d99f91b Enable daily builds on gfx1010 (#3258)
* add build/test on gfx1010

* only build and run on gfx1010 once daily

[ROCm/composable_kernel commit: 21ae743acd]
2025-11-21 07:22:01 -08:00
Illia Silin
c36a154c66 fix typo (#3244)
[ROCm/composable_kernel commit: 84540edff3]
2025-11-19 20:23:09 -08:00
Illia Silin
c2247fa9a5 Refactor Jenkinsfile (#3229)
* allow using alternative compiler in all CI stages

* get rid of some redundancies in jenkinsfile

* clean up jenkinsfile a bit more

* further clean up jenkinsfile

* do not force user jenkins in ci dockers

[ROCm/composable_kernel commit: 3e8e6f7e4f]
2025-11-19 07:20:25 -08:00
Illia Silin
8d454aa01d Initialize new variable to prevent c++17 compiler error (#3156)
* initialize new variable to prevent c++17 compiler error

* build for gfx90a using -std=c++17 flag

[ROCm/composable_kernel commit: 930423ab3b]
2025-11-04 18:54:14 -08:00
Thrupti Raj Lakshmana Gowda
10e844d93c [CK TILE ENGINE] GEMM Multi D Restructure (#3121)
* Renaming old code

* Adding GEMM code with new Architecture

* Partial Progress : Errors

* Partial Progress : Working code

* Changes to element wise function

* Removing Debugging statements

* Working GEMM Multi D code

* Removing Stale Code

* Address Copilot review comments

* Address Copilot review comments

* Changes to validation file

* Changes to common code snippets

* Creating common folder

* Removing duplicate files

* Pointing to right common file

* Pointing to right common file

* Pointing to right common file

* Changing to VERBOSE

* Changing CMAKE messages to verbose

* Updating Cmake with right layout datatype configs

* Working code for GEMM Multi D

[ROCm/composable_kernel commit: a33d98f8e2]
2025-10-31 12:02:46 -07:00
andrew clark
9950df2ae7 Adding new alert failure patterns (#3122)
* Adding GPU not found pattern

Also, failurePatterns does not need to be global. Moved variable to live in the failure notifications function scope.

* Testing new failure type

* Testing failure

* Removing the forced failure test

* Adding an additional failure pattern

[ROCm/composable_kernel commit: 1977e4b96a]
2025-10-31 07:38:31 -07:00
Anton Gorenko
9a012c3135 [CK_TILE] Support WMMA (gfx12) in FMHA (#2528)
* Pass hdim to tile_example_fmha_fwd in fp8 tests

* Add WMMA support to fwd FMHA pipelines

* Tune tile sizes a bit for less spilling

fp16 256 is still quite slow

* Fix Q grad tile distribution for warp size = 32 and hdim >= 256

With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.

* Use code based on BlockDropout in BlockDropoutBwd

* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal

* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
  because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
  it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
    * any number of warps (power-of-2), not only 4;
    * kN1 = 16, not only 32;
    * any number of splits;

* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now

* Replace hard-coded kN1 in dispatch code with the requested tile size

* Add gfx12-specific tile sizes for split KV

* Pass GPU architecture to kernel generation scripts

This is still a temporary solution.

* Build and run FMHA CI tests for gfx12

* Fix issue after merging

* Fix bwd tile sizes

The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).

* Use hardware f32->f8 on gfx12, remove v_perm

__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
 32-bit dword) is used to store results (two f8 values).

* Update changelog

* Add WMMA support to pagedkv

* Fix scripts after rebasing

* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Fix names after cherry-picking

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Do not use filters related to qr_async_trload

They disable tiles/pipelines which are valid for gfx12.

* Use different dstr encoding when C is transposed

* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code

Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().

* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution

Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.

* Generalize usages of WarpGemmDispatcher for MFMA and WMMA

WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.

* Mark has_load_tr as maybe_unused

There are no transpose loading for RDNA.

* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code

* Detect BlockSize on host based on warp size of the current device

If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.

* Fix calculation of grid size for combine kernel with warp size = 32

* Add missing includes and header

* Support multiple archs in one binary for fwd

* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill

* Support multiple archs in one binary for bwd

* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
  they can be used as fallbacks (similarly to fwd);

* Extract common code from register_traits

* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"

To simplify merging , the proper fix is in develop already.

* Support new numerical d paddings in trait ordering checks

* Build fp32 tests only on gfx9

* Do not use hardcoded M0 = 64 for dot bwd kernel

* Use textwrap.indent from standard library

* Make fp8 pipelines on gfx12 consistent with gfx9

* Update tests for current pipelines

* Make ninja check more responsive in CI

ninja buffers output so this job looks hanging.

* Support fp8fp32 by limiting O vector size

The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).

* Remove unused cmake options

* Unify including  amd_buffer_addressing.hpp/_builtins.hpp

* Temporarily use amd_buffer_addressing.hpp on >=gfx10

amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
 * 1 scalar for exec masks instead of 2,
 * gfx12 uses different instruction names etc.

* Update asm in bf16 conversions to work with warp 32

* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd

* Add arch tags to kernels/host funcs, compile for each arch separately

* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename

* Add workaround for miscompilation of bwd with padded hdim

SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.

* Fix add_gtest_executable for absolute paths

Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.

* Run only hdim 128 smoke tests for fp8fp32

There are no instances for hdim 64 and 256.

* Format py with ruff to simplify merging develop

* Fix incorrect var name

* Codegen for gfx9,gfx950 when --targets is not specified

Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).

* Combine arch-related values into ArchTrait

This more centralized approach removes duplication of various formatting templates.

* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"

Some code is extracted into a function.

[ROCm/composable_kernel commit: 1e77695fe8]
2025-10-29 13:31:08 -07:00
andrew clark
332a0e1696 Added failure pattern check (#3111)
[ROCm/composable_kernel commit: aa22da07be]
2025-10-29 08:19:56 -06:00