In group mode the dq_acc workspace layout uses physical (padded)
seqlen_q for the per-nhead stride (see FmhaBwdWorkspaceManager doc;
also matches FmhaBwdConvertQGradKernel reads). The unified-workspace
refactor inlined this stride as kargs.seqlen_q, which is the LOGICAL
length when seqlen_q_ptr is provided. The result: main kernel writes
batch i nhead>0 dq_acc at offsets that the convert kernel never reads,
so dQ ends up zero for those positions.
Hoist physical_seqlen_q to the outer scope and use it for both the
non-deterministic and deterministic stride computations in the
dq_dram_window lambda. Batch mode is unaffected since kargs.seqlen_q
already equals the physical length there.
Fixes 135 padding-related failures in test_ck_tile_fmha_bwd_fp16
(BasicQPadding / MultiBatchPadding / PaddingWithMask / QKVPadding /
VariedPaddingRatios / ZeroLengthPadding / Deterministic /
ElementwiseBias). Verified locally: full suite 672 PASSED / 0 FAILED.
SGPR usage drops by 1; VGPR/AGPR/spill/occupancy unchanged.
## Summary
- Add `num_groups_to_merge` field to `CKGroupedConvFwdOp` dataclass to
match the new (#4273) `NumGroupsToMerge` template parameter added to
`DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3`
- Enable inductor tests by default in Jenkins CI
## Test plan
- [x] Built wheel without patch: `test_gen_conv_instances` fails with
`TypeError: takes from 47 to 50 positional arguments but 51 were given`
- [x] Built wheel with patch: `test_gen_conv_instances` passes
## Motivation
New changes from upstream llvm-project cause an avalanche of warnings in
CK. Gonna disable them by ignoring the
lifetime-safety-intra-tu-suggestions flag until a better permanent
solution is found.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
`MakeAsyncLoadADramWindow` in
`GemmPipelineAgBgCrCompAsyncEightWavesPolicy` was rebuilding the 6D view
descriptor with `make_naive_tensor_descriptor_packed`, which synthesizes
strides from lengths and assumes a dense layout. When the input view's
leading-dim stride is larger than its inner length (non-packed memory
layout), the resulting tile window stepped through memory at the wrong
stride.
Compose the unmerge transforms on top of the input view's existing
descriptor instead, so the actual runtime strides are preserved and the
correct `element_space_size` is inherited for bounds checking.
## Test Plan
Added an unit test showing the problem.
## Test Result
The new test fails before fixes and passes after.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Mark the FmhaBwdWorkspaceManager size/offset accessors as CK_TILE_HOST
(they are only invoked from host-side workspace setup), and pad
GetWorkspaceHostSize up to a 4K boundary so the GPU dq_acc buffer always
starts on a page-aligned offset.
Wrap fmha_bwd_launcher constructor with std::chrono and prepare_workspace
with ck_tile::gpu_timer; append "init:Xms, prws:Yms" to the benchmark
header line. Also reorder launcher construction to occur after device
buffer allocation so its timing is isolated.
## Motivation
This PR adds tests for CK Tile's convolution backward data operation to
enable functionality regression tracking and error-detection.
## Technical Details
Currently only NHWGC/GKCYX/NHWGK and NDHWGC/GKCZYX/NDHWGK(2 dim and 3
dim channel-last) layouts are being tested, since only they are
implemented in CK Tile. Current tests support FP16, BF16 and FP32
datatypes and various different convolutions scenarios. The tested
instances are listed in
`experimental/grouped_convolution_tile_instances` directory.
## Test Result
All implemented tests are working properly and passing.
---------
Co-authored-by: Ville Pietilä <>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
## Motivation
This PR fixes the hipblaslt client build failures that occur when
building the PyTorch Docker image, which are currently causing failures
in CI.
## Technical Details
- Correctly reset the working directory to tmp/
- Added --use-system-packages to the install.sh to use system installed
laplack packages, as hard-coded paths were not being built.
## Test Plan
Locally built the Docker image using the Dockerfile.
## Test Result
Image was successfully built.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
## Motivation
The existing paged-KV attention pipelines (pagedkv, splitkv) support
StreamLLM-style sink tokens — a fixed set of initial tokens kept in
attention alongside the sliding window. The `batch_prefill` pipeline
(chunked-prefill with VLLM-style block tables) previously hardcoded
`kHasSink = false`, making it incompatible with sink-based attention
patterns in LLM serving scenarios.
This PR extends `batch_prefill` to support `kHasSink` and wires it
into `fmha_fwd_runner` for validation against the existing CPU
reference.
## Technical Details
**Pipeline** (`block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp`):
- When `kHasSink`, the K/V loop splits into a sink phase [0,
sink_seq_end)
and a window phase [seqlen_k_start, seqlen_k_end), mirroring pagedkv.
- K advance at the sink→window transition jumps
`seqlen_k_start - sink_seq_end + kN0` to bridge the gap.
- V scatter-gather offsets are re-initialized at the transition to fix a
window mismatch bug: V was lagging kN0 behind K after the large jump,
loading from the wrong sequence position.
- Bias window, dropout seq_offset, and mask type (LogitsSinkMask)
updated
for sink-awareness.
**Traits / codegen** (`tile_fmha_traits.hpp`, `fmha_fwd.hpp`,
`fmha_batch_prefill.py`):
- `TileFmhaBatchPrefillTraits` gains `kHasSink_` (was hardcoded
`false`).
- Codegen adds `F_sink` field; skips batch-mode kernels (group mode
required).
- CMake test filter broadened from 9 → 33 instances covering
fp16/bf16 × mask/nmask × lse/nlse × sink/nsink.
**Runner** (`fmha_fwd_runner.hpp`, `CMakeLists.txt`):
- `fmha_batch_prefill()` dispatched from `run_fwd` when:
group mode + paged KV + num_splits == 1.
- K/V strides corrected for runner's [num_pages, nhead_k,
page_block_size, hdim] layout.
- `page_block_size % 128` check relaxed: batch_prefill supports ps=16.
- CPU reference paged-KV reordering guards extended with
`CK_TILE_FMHA_FWD_BATCH_PREFILL_API`.
## Test Plan
Build with `-DFMHA_FWD_ENABLE_APIS="fwd;batch_prefill"`, run
`tile_example_fmha_fwd` in group mode with page_block_size=16.
Test matrix:
- Mask: no-mask, causal, sliding window
- Sink: nsink, sink=1..128
- dtype: fp16, bf16
- LSE output: on/off
- seqlen ∈ {512,1024,2048,4096} × window ∈ {32,256,512,1024}
- GQA, chunked prefill, large batch×seqlen
- page_block_size: 16, 32
## Test Result
171 test cases, all valid:y:
- nmask + nsink: ✓
- causal + nsink: ✓
- causal + sink=8: ✓
- sliding window + sink=8 (d=128, d=256): ✓
- bf16, LSE output, GQA: ✓
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary
- add `swiglustep_and_mul` to the composablekernel MoE blockscale
activation enum
- implement the corresponding blockscale epilogue path for `SwigluStep`
- keep existing `silu` and `gelu` paths unchanged
## Scope
This PR covers the classic composablekernel blockscale MoE path under
`projects/composablekernel`.
This is separate from the `ck_tile` / FlatMM path being discussed in
ROCm/rocm-libraries#5992.
## Motivation
`Step-3.5-Flash-FP8` uses `SwigluStep` in its MoE MLP path. The
dependent AITER change needs native support for this activation in the
classic composablekernel MoE blockscale path.
## Validation
- patch is limited to two composablekernel files under
`projects/composablekernel`
- existing `silu` / `gelu` paths are unchanged
- dependent AITER runtime validation hit the classic CK 2-stage path
with AITER MoE enabled
## Motivation
- On gfx11/gfx12, the existing float -> bf16 conversion path in FMHA
forward adds noticeable overhead and causes a meaningful performance gap
versus fp16. The asm-based path (mode 3) does not improve this on RDNA
and can perform even worse.
- In particular, on gfx12, bf16 FMHA forward can be up to ~20% slower
than the corresponding fp16 path.
- This PR reduces that gap by switching FMHA forward to a different BF16
conversion strategy based on Triton’s canonical-NaN
round-to-nearest-even behavior.
## Technical Details
- Add a new `standard_cnan` BF16 conversion mode to CK Tile.
- Implement a canonical-NaN RTN `float -> bf16` conversion path based on
the Triton implementation.
- Enable this conversion mode by default for FMHA forward builds
targeting gfx11/gfx12.
- Retune gfx11/gfx12 FMHA forward kernel selection thresholds for some
`hdim=128` cases to keep kernel selection aligned with the updated
conversion behavior.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={hdim} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- all tests passed when running `test_ck_tile_fmha`
- BF16 FMHA forward performance improves by up to ~5% on gfx11.
- BF16 FMHA forward performance improves by up to ~10% on gfx12.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
The existing unification work supports DENSE and SPARSE intrinsics. In
this PR, we enable support for SCALE intrinsics and add example SCALE
implementations.
## Technical Details
Adding MFMA SCALE intrinsics support, adding tests for MFMA SCALE
intrinsics, and adding WMMA SCALE policy trait.
Note: fp6 SCALE intrinsics support is not included in this PR, as its
handling in ck_tile is currently more specialized and does not follow
the same pattern as other datatypes.
## Test Plan
Added new tests for the relevant SCALE specialisations.
## Test Result
Test should pass.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
Different flavours of grouped gemm fixed nk implemenations share the
same block to tile mapping logic. Despite that the code responsible for
it is duplicated in each device struct implementation.
- Move `BlockToCTileMap_KBatch_M00_N0_M01Adapt_MLoops` and
`OffsettedBlockToCTileMapMLoops` from the device struct implementations
to a common header file.
- Use the generic Kernel Argument structures in xdl versions of the
fixed nk.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
CI in general. Relevant test and examples are all fixed_nk versions of
grouped gemm multiple D and ABD.
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
## Motivation
`qr_hpad` currently executes work for padded head-dim fragments even
when only a subset of the values are valid. This adds unnecessary
computation for head dimensions that require padding, such as `hdim=72`
and `hdim=80`, and hurts FMHA forward performance.
The goal of this PR is to make the padded-head-dim path skip invalid
work based on the actual valid fragment count, while preserving the
existing behavior for the non-padded path.
## Technical Details
This PR improves the `qr_hpad` FMHA forward path in three parts:
- Skip padded `k`/`n` fragments in the GEMM/pipeline path when only part
of the fragment is valid.
- Add partial GEMM0 tail handling for `qr_hpad` so the kernel uses the
valid fragment range instead of always computing over the padded extent.
- Retune the gfx11 `qr_hpad` kernel configuration after enabling the
partial-fragment path.
To keep the existing path stable, the implementation adds overloads for
the updated GEMM/pipeline interfaces. This allows existing full-tile
callers to keep using the previous form, while the `qr_hpad` path can
pass valid fragment counts when needed.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- On gfx11 and gfx12, for head dimensions that require padding,
`tile_example_fmha_fwd` shows about 20-30% performance improvement at
`hdim=72/80`.
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
CPP tests in dispatcher were failing due to a mismatch in registry key
and string representation.
## Technical Details
Bug 1 - Registry key mismatch: The registry stored kernels using
get_name() but lookups used encode_identifier(), causing all registry
lookups to fail. Fixed by changing registry.cpp:58 to use
encode_identifier() for storage.
Bug 2 - String representation changes: Tests checked for
"persist"/"nopers" substrings, but the code emits "True"/"False". Fixed
by replacing brittle substring checks with comparison-based assertions
in test_kernel_key.cpp and test_kernel_key_extended.cpp.
## Test Plan
Tested with CPP tests in dispatcher
## Test Result
Validation: All three core cpp tests now pass:
- test_kernel_key - 6/6 tests passing
- test_kernel_key_extended - 25/25 tests passing
- test_registry - 8/8 tests passing
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Motivation
Due to compiler version update, there are test failures in the test
target `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current develop branch, the test
failures are blocking any PR merges into develop. An example of a failed
CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as the tests are passing for clang compiler version 20.0 and
failing for clang compiler version 22.0.
First attempt to fix this problem had to be reverted in #6400 because it
broke MIOpen internal DB sync tests.
## Technical Details
The root cause for the test failures are the block-GEMM V5 instances of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3` that have large tile size.
The V5 pipeline uses double register buffer that in combination with
large tile size causes high register pressure. The latest version of
compiler handles the register spillage incorrectly for `gfx90a`, which
cause the kernel to output incorrect results.
The BF16/FP16 instances of `DeviceGroupedConvBwdWeight_Xdl_CShuffleV3`
that do not use direct load for are divided into two groups
- Base instances
- Instances that result into high register usage (currently only one
instance - one that causes the test failures).
This division allows to disable only the V5 block-GEMM flavor of
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` for 3D convolutions on `gfx90a`. The
selective disabling leaves the set of instances for 1D and 2D
convolutions unaffected, and removes at runtime two V5 block-GEMM
instances (`ConvBwdWeightDefault` and
`ConvBwdWeightFilter1x1Stride1Pad0`) per data type (FP16/BF16) when the
device is `gfx90a`.
Because MIOpen uses CK's type string (provided by method
`GetTypeString`) to identify the instances, the DB sync tests are
expected to unaffected since there are still the V2 block-GEMM instances
that result in the same type string
(`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>`). This expectation needs to be verified by
running the MIOpen DB sync tests that are not part of the normal CK PR
build.
## Test Plan
Running all CI tests + the MIOpen internal DB sync tests is sufficient
to verify the correctness of the code changes.
## Test Result
Verified locally that the previously failing tests
`TestGroupedConvndBwdWeight3d/4.Test3D` and
`TestGroupedConvndBwdWeight3d/4.Test3D` have instance counts
- 231 on `gfx90a`
- 233 on `gfx942`
and are currently passing. This confirms the expectation that two
instances per data type should be disabled on `gfx90a`.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Co-authored-by: Ville Pietilä <>
## Motivation
setup_gemm_dispatcher() was rebuilding libraries on every call instead
of reusing cached libraries.
**Root Cause**:
1. Library names only included dtype+layout, causing different
tile/wave/warp configs to overwrite each other
2. No cache checking - always loaded default library, detected mismatch,
then rebuilt
## Technical Details
**Solution**:
1. Complete library naming with all distinguishing parameters:
libdispatcher_gemm_{dtype}_{layout}_{tile}_{wave}_{warp}_{pipeline}_{epilogue}_{scheduler}.so
2. Cache checking before rebuild:
- Check if library for exact config already exists
- Reuse if found (500x faster: 0.02s vs 10s)
- Only rebuild when no cached library exists
3. Better error handling for kernel generation failures
Files Changed:
- dispatcher/python/ctypes_utils.py
- dispatcher/tests/test_library_caching.py (new unit test)
## Test Plan
Use `dispatcher/tests/test_library_caching.py ` to ensure that libraries
are cached and only rebuilt if they are not present in build directory
1. **test_01_unique_library_naming** - Library names include all
parameters (dtype, layout, tile, wave, warp, pipeline, epilogue,
scheduler)
2. **test_02_library_build_and_cache** - Libraries are built once and
then cached for reuse
3. **test_03_different_configs_different_libraries** - Different configs
create different library files
4. **test_04_cache_message_verification** - Cache hit messages are
logged correctly
5. **test_05_code_fix_verification** - Code changes are present in
ctypes_utils.py
## Test Result
All the test above passed.
## Submission Checklist
- [ x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
The PermuteN epilogue was previously embedded within
cshuffle_epilogue.hpp, despite having fundamentally different behaviour.
Coupling these two independent strategies in one file introduced
unnecessary complexity, SFINAE guards, and a dual operator() overload
selected at compile time via TiledMMAPermuteN_ template parameter.
This PR separates PermuteN into its own standalone
file(pertmuten_epilogue.hpp), simplifying both implementations and
making the codebase easier to maintain and extend independently.
## Technical Details
**New file: permuten_epilogue.hpp:**
contains PermuteNEpilogueProblem and PermuteNEpilogue, extracted from
the permuteN code path in cshuffle_epilogue.hpp.
**Cleanup of cshuffle_epilogue.hpp:**
- Removed the TiledMMAPermuteN_ template parameter from
[CShuffleEpilogueProblem]
- Removed the SFINAE-guarded permuteN operator() overload
- Removed the EnablePermuateN_ SFINAE alias
- CShuffle now only contains CShuffle logic; EightWave support
(independent feature) is retained
**Consumer migration :**
All consumer files now use compile-time epilogue selection via
[std::conditional_t]
`using GemmEpilogue = std::conditional_t<
TiledMMAPermuteN,
PermuteNEpilogue<PermuteNEpilogueProblem<...>>,
CShuffleEpilogue<CShuffleEpilogueProblem<...>>>;`
**Files modified:**
- flatmm_basic.cpp, moe_flatmm.cpp, a16w4_moe_flatmm.cpp,
mixed_prec_flatmm.cpp, mx_flatmm_instance.hpp — flatmm examples
- run_gemm_quant_example.inc — block-scale GEMM example
- gemm_weight_preshuffle_invoker.hpp — weight preshuffle invoker
- test_gemm_quant_fixtures.hpp, test_gemm_persistent_async_input.cpp,
test_gemm_pipeline_util.hpp — test utilities
- universal_gemm_invoker.hpp — universal GEMM invoker
- epilogue.hpp — add header updated to include permuten_epilogue.hpp
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
## Motivation
This PR introduces a restructure for the benchmarking and profiling
aspects of CK Tile's Tile Engine, expanding on the groundwork from this
previous https://github.com/ROCm/composable_kernel/pull/3434 and
outlined in this [design
document](https://amdcloud-my.sharepoint.com/:w:/r/personal/astharai_amd_com/Documents/Restructuring%20Tile%20Engine.docx?d=w14ea28a30718416988ed5ebb759bd3b2&csf=1&web=1&e=l3VBuX).
In PR 3434, to reduce repeated code we implemented:
- Base class that centralizes common functionality and provides a
default implementation (Universal GEMM)
- Child classes for GEMM variants override virtual functions to handle
variant-specific behavior
This refactoring in this PR follows the same process and should greatly
reduce the duplicated code present in Tile Engine and make it simpler to
add in new operations, increasing scalability.
## Technical Details
The files have been refactored around new base structs for benchmarks,
profiling and problem descriptions. The new base structs are:
- GemmProblem
- GemmBenchmark
- GemmProfiler
Universal GEMM, Preshuffle GEMM, and Multi-D GEMM all have child classes
that will inherit from these base structs overriding only what differs
per variant.
All common functions across the benchmarking and profiling files have
been moved into newly added common utility files under the commons/
directory. The new utility files are:
- utils.hpp: common functions for the benchmarking and profiling process
- benchmark_utils.py: common utility functions for the benchmark
generation
## Test Plan
I tested using the existing tests for Tile Engine.
## Test Result
All tests passed.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
hipEventElapsedTime can return a small negative value on Windows when
timing a very fast kernel launch on the null stream. This caused
consumers of launch_and_time_kernel to receive a negative elapsed time,
which they reasonably treat as an error, breaking otherwise-correct
kernel executions.
## Technical Details
After calling hipEventElapsedTime, a clamp is applied in
launch_and_time_kernel before the result is returned, avoiding the
return of a physically impossible elapsed time.
The negative value from hipEventElapsedTime has been observed on
Windows. For kernels that complete in well under a millisecond, the HIP
event timestamps can alias such that the computed difference is a small
negative number (observed: ~-1.78 ms). No HIP error is reported by any
surrounding call (hipEventRecord, hipEventSynchronize, hipGetLastError),
confirming the kernel itself executed successfully.
## Test Plan
- Recompile CK and validate no kernel execution reports a negative
elapsed time during hipTensor tests.
- Pass the CI/CD pre-checking tests for CK.
## Test Result
- All tests passing
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary
- Skip fp16 FMHA forward dropout tests that use the d256 tile in batch
mode, gated on compiler version
- The AMDGPU compiler miscompiles these kernels due to VGPR aliasing of
Philox RNG parameters under high register pressure (383 VGPRs)
- bf16 dropout tests are unaffected and cover the same code paths
## Root Cause
The compiler aliases `ph_seed` and `ph_head_offset` (Philox RNG state
stored in VGPRs) with other live data during the softmax main loop. This
causes corrupted `buffer_store_byte` writes for dropout randval on wave
lanes 32-63, producing NaN in output and LSE tensors.
**Conditions:** fp16 + d256 tile + dropout + batch mode + `qr` pipeline
+ gfx90a
## Changes
- `include/ck_tile/core/config.hpp`: Add
`CK_TILE_WORKAROUND_ROCM_7_12_FP16_DROPOUT_MISCOMPILE` macro
- `test/ck_tile/fmha/test_fmha_fwd.cpp`: Version-gated `GTEST_SKIP` in
`TEST_P(Dropout, ...)`
## Test plan
- [x] ROCm 7.1.1 (clang 20): 168/168 fp16 dropout tests PASS (no skip
active)
- [x] ROCm 7.12 (clang 22): 132 PASS, 36 SKIPPED, 0 FAILED
- [x] bf16 dropout tests: 168/168 PASS (unaffected by this change)
## Motivation
modify elementwise kernel template signature to fix cshuffle epilogue
build error
## Technical Details
Encountered a build error while building conv fallback kernel with
dispatcher.
Error: Type mismatch in `ElementWiseKernel::operator()` where the
template required all three parameters (lens, input_strides,
output_strides) to be the same type, but the CShuffle epilogue was
passing them with different tuple element types.
Solution: Modified the template signature in elementwise_kernel.hpp to
accept three independent type parameters:
Changed from single typename `Dims` to typename `DimsLens`, typename
`DimsInStrides`, typename `DimsOutStrides`
Updated references to `Dims::size()` to use the appropriate specific
type
## Test Plan
- Test with dispatcher conv unit tests
- Relying on CI tests
## Test Result
- Dispatcher unit tests passed
- Relying on CI tests
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
In this PR we showcase how the amdgcn structs could be used in a pipeline that does some extra pre/post processing.
For the sparse intrinsics, so far we compressed the A vector "on the fly" right before the execution of the builtin. This might introduce performance issues down the line if, for example, the user decided to chain multiple sparse builtins. We tackle this problem by creating a specific SparseCompressTransform.
A MmaPipelineBase is also created to facilitate those kind of higher level compositions of the amdgcn structs and is integrated to the existing WaveWiseMma prototype. There is an effort to facilitate future operations, like swizzle A/B, C transpose or double/quad attr num access through the MmaPipelineOptionFlags, but those are not yet defined and should do so in a future PR.
The pipeline base class is basically at the RFC stage.
We also create a runtime test for the existing WaveWiseMma, as well as one for the SparseMma pipeline.
## Technical Details
The goal should be to have the pipeline easily expandable. May the CRTP of the base class or the interface in general be insufficient or unable to handle all of our needs, then a design modification should be discussed.
## Test Plan
New tests are added.
## Test Result
Tests should pass.
---------
Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
Reverts ROCm/rocm-libraries#6343
This is causing failures in miopen, namely Dbsync gfx942 even though it shouldn't be affected so this needs to be investigated. Please add miopen as a label to the new PR for addressing the compiler codegen bug so that this can be addressed simultaneously.
## Motivation
Due to compiler version update, there are test failures in the test
suite `test_grouped_convnd_bwd_weight` when running on `gfx90a`. There
are four failing tests for FP16/BF16 that arise from a single kernel
instance. As the problem is in the current `develop` branch, the test
failures are blocking any PR merges into `develop`. An example of a
failed CI runs is here:
[http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/](http://micimaster.amd.com/blue/organizations/jenkins/rocm-libraries-folder%2FComposable%20Kernel/detail/develop/558/pipeline/).
The underlying compiler problem is potentially the same as described in
#6342 as tests are passing for clang compiler version 20.0 and failing
for clang compiler version 22.0.
## Technical Details
This PR disables the compilation of the problematic bwd weight conv
instance for `gfx90a` by adding a new CMake flag `CK_USE_GFX90A` that
allows us to detect when we are compiling for `gfx90a`. Using the new
CMake flag, compilation of instance
`DeviceGroupedConvBwdWeight_Xdl_CShuffleV3<64, 128, 32, 32, Default, 8,
4, 1, 8, 8, 8, 8, 1, 1, 2>` is disabled for `gfx90a`.
Co-authored-by: Ville Pietilä <>
## Motivation
We want to be able to calculate TileDistributionEncodings describing
register mappings for any MmaOp. This is necessary for further
integration with CK Tile.
This MR adds a new struct TileDistrEncCalc, which takes an amdgcn_mma
type (MmaOp) and provides ABC warp distribution encodings for mapping
matrix fragment coordinates to register coordinates
(lane, vector item) and vice versa. It is able to take CTranpose,
Swizzle, and NumAccessA / NumAccessB template parameters for tweaking
the tile distributions. Swizzle modification will be implemented later.
The current implementation can deal with all intrinsic types and
block-hiding.
This MR also adds some additional static asserts and derived params
within amdgcn_mma_base, to enforce consistency and help calculate Tile
Distributions for block-hiding intrinsics.
An Example was added that uses the Tile Distr Enc Calc to calc and print
register layouts for Tile Distributions for some of our amdgcn_mma
structs. It also makes sure that the CTranspose modifier works as
intended.
Some additional gfx9 intrinsics were added to test block-hiding layouts
for the different types of C-block-hiding layouts.
The sparse intrinsic wrappers were updated according to Chris's recent
changes in another branch
(https://github.com/ROCm/rocm-libraries/pull/5508), which moved the
compression step outside of the intrinsic itself. This is necessary to
make sure that the Calculator can deal with this new interpretation of
the sparse intrinsics. I directly copied the new amdgcn structs from
Chris's branch and changed nothing else to avoid more complex merges in
the future. Note that this means I did not update a bunch of related
sparse code since that would be a lot, and therefore I disabled
test_amdgcn_sparse_mma for now.
The amdgcn_mma_layout test was refactored a bit:
- The old register mapping utility was removed and its use was replaced
by the new TileDistrEncCalc
- More tests were added to test layouts for different types of
block-hiding and sparse intrinsics
- The Selector method was removed and the tests were split up over
target architectures, with each target arch having a direct list of
amdgcn structs to be tested. This ensures that we force specific tests
on specific architectures and makes sure that the selector doesn't
quietly do some workarounds like creating compound intrinsics.
## Test Results
Layout tests based on calculated tile distribution encodings pass on all
architectures. Calculator works for all currently added amdgcn structs,
which includes different types of block-hiding and sparse intrinsics.
Printed layouts from new example verified by eye. CTranspose modifier
tested for large set of intrinsics.
Depends on #6303
## Summary
Extract shared test boilerplate (includes, type aliases, test fixture
macros) from 47 `test_gemm_quant_*` files into a single
`test_gemm_quant_common.hpp` header. Each test file is reduced from ~50
lines of boilerplate to ~5 lines.
| Metric | Value |
|--------|-------|
| Files changed | 48 |
| Insertions | +413 |
| Deletions | −1,106 |
| **Net lines removed** | **−693** |
### What changed
| Before | After |
|--------|-------|
| 47 test files, each with ~50 lines of identical includes, type
aliases, and fixture macros | 1 shared header
(`test_gemm_quant_common.hpp`) + 47 thin files (~5 lines each: include +
params) |
### Readability assessment
A code realist review confirmed this change **improves readability**:
the 47 test files had identical boilerplate obscuring the only
meaningful content — the `GemmConfig` type alias and test dimensions.
After the refactoring, each file's unique configuration is immediately
visible, and adding a new test variant requires specifying only the
varying parameters instead of copying 50 lines.
### Cumulative cleanup series stats
| PR | Description | Net lines |
|----|-------------|-----------|
| #6300 | Remove 61 dead `#if 0` blocks | −2,648 |
| #6302 | Remove 41 commented-out dead code blocks | −2,861 |
| #6303 | Remove 4 orphaned files | −3,886 |
| This PR | Extract gemm_quant test boilerplate | −693 |
| **Total** | | **−10,088** |
Depends on #6302
## Summary
Remove 4 orphaned files that have verified replacements already in the
build.
| File | Reason | Replacement |
|------|--------|-------------|
| `test_gemm_pipeline_compiler.cpp` | Refactored into 13 smaller tests |
`_compv3`, `_compv4`, `_mem`, `_persistent`, etc. |
| `test_grouped_gemm_quant.cpp` | Refactored into 5 smaller tests |
`_rowcol`, `_tensor`, `_aquant`, `_bquant`, etc. |
| `..._f8_f8_f16_..._comp_default_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
| `..._f8_f8_f16_..._comp_kpadding_instance.cpp` | Superseded by split
files | `_part1.cpp` + `_part2.cpp` |
Each deletion was verified:
- Original file is NOT in any CMakeLists.txt
- Replacement files ARE in CMakeLists.txt and actively compiled
- Content is fully covered by the replacement files
## Motivation
Add CI support for running
[flash-attention](https://github.com/ROCm/flash-attention) tests against
CK, similar to existing AITER and PyTorch downstream test pipelines.
## Technical Details
### New: `Dockerfile.fa`
A new Dockerfile that builds a flash-attention test image on top of a
ROCm PyTorch base image. It:
- Sparse-checkouts CK from `rocm-libraries` (or clones directly from
`ROCm/composable_kernel`)
- Clones and builds `flash-attention` with CK as the backend
- Supports configurable `FA_BRANCH`, `CK_FA_BRANCH`, and `GPU_ARCHS`
build args
### Updated: `Jenkinsfile`
**buildDocker refactor:**
- Extracted `buildAndPushDockerImage()` helper that handles both "check
if exists, skip" and "force build, push" logic, eliminating the
duplicated try/catch blocks
- Split monolithic `buildDocker()` into `buildDockerBase()`,
`buildDockerPytorch()`, `buildDockerAiter()`, and new `buildDockerFa()`
- Each downstream docker build now runs unconditionally within its
respective guard (`RUN_PYTORCH_TESTS`, `RUN_AITER_TESTS`,
`RUN_FA_TESTS`)
- Image digests are stored in env vars (`CK_BASE_IMAGE`,
`CK_PYTORCH_IMAGE`, `CK_AITER_IMAGE`, `CK_FA_IMAGE`) for use in
downstream stages
**run_downstream_tests refactor:**
- Merged `run_aiter_tests()` and `run_pytorch_tests()` into a single
generic `run_downstream_tests(conf)` that accepts `image`,
`timeoutHours`, and `execute_cmds`
- Test commands for each downstream target are declared as top-level
lists (`RUN_PYTORCH_TESTS_CMDS`, `RUN_AITER_TESTS_CMDS`,
`RUN_FA_TESTS_CMDS`)
**Pipeline stages:**
- Merged "Run Pytorch Tests" and "Run AITER Tests" into a single "Run
Downstream Tests" parallel stage
- Added two new FA test stages: "Run FA Tests on gfx942" and "Run FA
Tests on gfx950"
- Added new pipeline parameters: `RUN_FA_TESTS`, `fa_base_docker`,
`fa_branch`, `ck_fa_branch`
- `ck_pytorch_branch` and `ck_aiter_branch` now default to the current
branch instead of hardcoded `develop`
- CRON schedule at 13:00 now also triggers `RUN_FA_TESTS=true`
## Test Plan
- [x] Trigger pipeline manually with `RUN_FA_TESTS=true` on gfx942 and
gfx950 nodes
- [x] Verify existing AITER and PyTorch test stages are unaffected
- [x] Verify `buildAndPushDockerImage` correctly skips rebuild when
image already exists (with `BUILD_DOCKER=false`)
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
## Motivation
Read the Docs config files contains outdated file path from their legacy
repos. Update and correct all paths.
## Technical Details
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
This PR adds CK Tile group convolution (forward, backward-data,
backward-weight) support to the kernel dispatcher, matching and unifying
with the existing dispatcher GEMM infrastructure in architecture and
usability. The dispatcher provides a unified kernel dispatch system with
both C++ and Python frontends, and until now only supported GEMM
operations. This PR enables framework integrators to use the same
declarative kernel workflow for convolutions as they do for GEMM:
declare kernels, build a registry JIT, select kernels within the
registry at runtime, and dispatch to GPU. Future PRs will include
runtime kernel selection heuristics for autotuning of kernel parameters
based on (problem, hardware arch).
## Technical Details
Grouped convolution support has been added to the CK Tile Dispatcher
with generated_conv_backend.hpp enabling dispatcher.run(in, wei, out,
problem) for all 6 conv variants (fwd/bwdd/bwdw x 2D/3D), runtime
heuristic kernel selection, and GroupedConvKernelKey with full
ConvConfigBase fields. Python side adds parallel JIT via
registry.build(max_workers) and heuristic registry.select(). Includes 7
C++ and 6 Python examples covering all directions with CPU reference
validation, and shared infrastructure improvements (BaseRegistry CRTP,
structured exceptions). As a sanity check, JIT compile times for a
single kernel remains the same and for multiple kernels there is better
parallelism:
Kernels | 1 worker | 8 workers
1 | 7.7 s | 7.7 s
2 | 15.9 s | 8.2 s
4 | 33.4 s | 9.7 s
6 | 52.3 s | 10.2 s
## Test Plan
145 ephemeral unit tests have been added to test basic functionality.
All 30 examples/integration tests run end-to-end on gfx950 (MI350): 7
C++ conv, 7 C++ GEMM, 6 Python conv, 10 Python GEMM. CPU reference
validation for forward, backward-data, and backward-weight (2D) in both
C++ and Python examples pass.
## Test Result
30 examples pass. Peak performance: 132 TFLOPS (Batch-32 forward 56x56),
53 TFLOPS (pointwise 1x1). CPU reference accuracy: max_abs_diff < 0.002
for all directions (fp16 vs fp32 reference).
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Yaswanth Raparti <113389104+yraparti@users.noreply.github.com>
## Motivation
On gfx11/gfx12, FMHA forward kernels that require head-dim padding show
a large performance drop compared to the exact-head-dim path. In
practice, padded cases such as `HDIM=72` and `HDIM=80` were falling too
far off the fast path.
This PR improves padded-head-dim FMHA performance on gfx11/gfx12 while
keeping the behavior for other GPUs unchanged.
## Technical Details
- Add/scope a dedicated padded-head-dim (`qr_hpad`) FMHA forward path
for gfx11/gfx12.
- For `receipt=0`, keep support conservative and only enable the padded
fast path for vector-safe cases (`head_dim % 8 == 0`), matching the
existing assumption used on other GPUs.
- Move `v_prefetch` later only for the head-dim-padded path on
gfx11/gfx12. This reduces live ranges and removes the register-spill
behavior seen in the earlier scheduling.
- Enable the buffer-load OOB check offset trick for the padded path on
gfx11/gfx12.
## Test Plan
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=16
-d={72/80} -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
Observed padded-head-dim performance improvements for HDIM=72/80:
- gfx11: about ~3.5x
- gfx1151: about ~2.0x
- gfx12: about ~1.3x
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation
The two-dot syntax (origin/develop..HEAD) is more conservative and
catches a broader set of changes when PRs merge develop branch. While
three-dot syntax shows only PR-specific changes, two-dot ensures we
don't miss any files that differ between develop and the PR branch,
including files modified in both the PR and merged develop commits.
This conservative approach prioritizes catching all potential issues
over CI efficiency, which is appropriate for build system change
detection.
# Technical Details:
- Switched to two-dot (..) syntax in ci_safety_check.sh
- Update comments to clarify the intentional use of two-dot syntax
- Maintain consistency across both CHANGE_ID branches
- Trigger full build when any of the following changes
- `Dockerfile|Jenkinsfile|CMakePresets\.json|script/dependency-parser/`
## Test Plan
Tested with PR 6200 which has multiple merge-commits.
## Test Result
It detects 43 new tests compared to 3-dot scheme.
## Submission Checklist
- [x ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: andrew clark <anclark@amd.com>
Updates the FMHA README to document fp8 precision support more
accurately, replacing the outdated "experimental" section and incomplete
CLI arg descriptions.
## Changes
- **`-prec` arg**: expanded supported values from `fp16/bf16/fp8/bf8` →
`fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4`
- **`-qscale` arg**: replaced single-line `1: per-tensor quantization`
with all four modes: `pt/1`, `bs/2`, `kvbs/3`, `mx/4`
- **FP8 support section**: replaced "FP8 experimental support" paragraph
with:
- Supported targets: gfx942/gfx950 + ROCm 6.0+
- Table distinguishing `fp8` / `fp8bf16` / `fp8fp32` by Q/K/V input type
and output type
- Table for all `-qscale` modes with descriptions
- Note that `-vlayout=r` (`seqlen*hdim` for V) is the only supported
layout for fp8 types
<!-- START COPILOT ORIGINAL PROMPT -->
<details>
<summary>Original prompt</summary>
Please open a PR against base branch `develop` in repository
`ROCm/rocm-libraries` applying the following documentation updates
within the composable kernel path.
## Scope
Update the file:
- `projects/composablekernel/example/ck_tile/01_fmha/README.md`
## Changes to apply
Apply the combined edits described in the diffs below (two consecutive
patches). Ensure the final file content includes **both** sets of
changes.
### Patch 1
- In the CLI args section:
- Update `-qscale` description lines to include:
- `pt or 1, per-tensor scale`
- `bs or 2, block scale`
- `kvbs or 3, Q per-tensor, K/V per-page block scale`
- `mx or 4, microscaling (exclusively for mxfp8/mxfp4)`
- Update `-prec` supported data types from `fp16/bf16/fp8/bf8` to
`fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4`.
- Replace the existing "FP8 experimental support" section with an "FP8
support" section stating:
- FP8 FMHA kernels supported on gfx942/gfx950 with ROCm 6.0+
- Precision selectable via `-prec=fp8` (or `fp8bf16`, `fp8fp32`) for
`tile_example_fmha_fwd`
- Add a table describing `-qscale` modes:
- `n` or `0`: No quantization scale (default)
- `pt` or `1`: Per-tensor quantization scale
- `bs` or `2`: Per-block quantization scale
- `kvbs` or `3`: Q per-tensor + K/V per-page block scale
- `mx` or `4`: Microscaling (MX format), exclusively for `mxfp8` and
`mxfp4`
- Add/keep note that currently only `-vlayout=r` (`seqlen*hdim` for V
matrix) is supported for fp8 data types.
### Patch 2
Further refine the "FP8 support" paragraph to explain the difference
between `fp8`, `fp8bf16`, and `fp8fp32` via a table:
| `-prec` value | Q/K/V input type | Output type | Description |
|---|---|---|---|
| `fp8` | fp8 | fp8 | Fully fp8: both inputs and output are in fp8 |
| `fp8bf16` | fp8 | bf16 | Mixed precision: fp8 inputs, bf16 output —
useful when the consumer expects a wider-range output format |
| `fp8fp32` | fp8 | fp32 | Mixed precision: fp8 inputs, fp32 output —
highest-precision output, suitable for debugging or further fp32
processing |
Keep the rest of the `-qscale` table and the `-vlayout=r` limitation
note.
## Notes
- PR title must be: `[CK_TILE] Add fp8 in FMHA readme`
- Ensure markdown formatting renders correctly (tables, code
formatting).
- Only modify the file listed above.
The following is the prior conversation context from the user's chat
exploration (may be truncated):
User: 能幫我上這個pr嗎 在composable kernel裡的路徑
diff --git a/example/ck_tile/01_fmha/README.md
b/example/ck_tile/01_fmha/README.md
index 0b526f4e9fc..1627435863b 100644
--- a/example/ck_tile/01_fmha/README.md
+++ b/example/ck_tile/01_fmha/README.md
@@ -62,14 +62,17 @@ args:
-d_v head dim for v, -1 means equal to d (default:-1)
-scale_s scale factor of S. 0 means equal to 1/sqrt(hdim). (default:0)
-qscale n or 0, no scaling (default:n)
- 1: per-tensor quantization.
+ pt or 1, per-tensor scale
+ bs or 2, block scale
+ kvbs or 3, Q per-tensor, K/V per-page block scale
+ mx or 4, microscaling (exclusively for mxfp8/mxfp4)
-iperm permute input (default:1)
if true, will be b*h*s*d, else b*s*h*d
-operm permute output (default:1)
-bias n or 0, no bias (default:n)
e(lementwise) or 1, elementwise bias with 1*1*s*s. e:1, 1*h*s*s. e:2,
b*h*s*s
a(libi) or 2, alibi with 1*h. a:1, b*h
- -prec data type. fp16/bf16/fp8/bf8 (default:fp16)
+ -prec data type. fp32/fp16/bf16/fp8/fp8bf16/fp8fp32/mxfp8/mxfp4
(default:fp16)
-mask 0: no mask, 1: top-left(same as 't'), 2:bottom-right(same as 'b')
(default:0)
't', top-left causal mask, 'b', bottom-r causal mask
't:l,r', top-left sliding window attn(swa) with FA style left right size
@@ -161,7 +164,17 @@ We support sequence padding and variable-length
processing in both batch and gro
Both approaches optimize memory access patterns while supporting
flexible sequence length requirements commonly found in transformer
inference scenarios.
-## FP8 experimental support
-As described in [this
blog](https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482),
we have an experimental support for fp8 fmha kernels, you can evaluate
the performance by setting the arg `-prec=fp8` to the
`tile_example_fmha_fwd`, on a gfx942 machine and ROCm 6.0+.
+## FP8 support
+FP8 FMHA kernels are supported on gfx942/gfx950 machines with ROCm
6.0+. You can select fp8 precision by setting the arg `-prec=fp8` (or
`fp8bf16`, `fp8fp32`) to the `tile_example_fmha_fwd`.
-Currently we only support `-vlayout=r`( `seqlen*hdim` for V matrix) for
fp8 and fp8bf16 now. Full feature support will come later.
+The following quantization scale modes are available via `-qscale`:
+
+| `-qscale` value | Description |
+|---|---|
+| `n` or `0` | No quantization sca...
</details>
<!-- START COPILOT CODING AGENT SUFFIX -->
*This pull request was created from Copilot chat.*
>
---------
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: asleepzzz <4926646+asleepzzz@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
## Summary
- Temporarily remove the KV cache offset overflow assert checks in
`FmhaBatchPrefillWithPagedKVCacheKernel`
- The asserts are **correct**, but they block project progress in
certain configurations
- This is a **temporary workaround** to unblock progress; a proper fix
will follow
## Note
This is NOT a permanent solution. A follow-up PR will add proper
overflow handling that addresses the underlying issue without blocking
progress.