Commit Graph

2594 Commits

Author SHA1 Message Date
linqunAMD
13cf0bd17f [CK_TILE] Fix gemm_quant (#3186)
[ROCm/composable_kernel commit: 1b1c46e508]
2025-11-11 08:23:57 -08:00
Aviral Goel
c1b5372db3 chore(copyright): update copyright header for tile_engine directory (#3180)
[ROCm/composable_kernel commit: 88e3212fcc]
2025-11-11 08:17:24 -08:00
Scott Todd
2c7d1aba58 Bump commit ref for TheRock in workflows (#3189)
* Bump commit ref for TheRock in workflows

* Update to more recent commit (could also `rm` the patch)

* Revert "Update to more recent commit (could also `rm` the patch)"

This reverts commit 4b9f4952ea.

* Rm patch that no longer applies

* Fix post_build_upload flag name

* Fix artifact_group plumbing for setup test env

[ROCm/composable_kernel commit: aa1fb29aa1]
2025-11-11 07:44:38 -08:00
Khushbu Agarwal
ae4444dfba formatting (#3182)
[ROCm/composable_kernel commit: 06c651b100]
2025-11-11 07:42:26 -08:00
Enrico Degregori
8e23284922 Extend support for ak1 / bk1 WMMA (#3073)
* Extend AK1 / BK1 support:

 - Add support for AK1 != BK1
 - Add support for AK1, BK1 > 8
 - Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction

* fix clang format

[ROCm/composable_kernel commit: 1c544abf57]
2025-11-11 07:38:15 -08:00
Thomas Ning
b40859d461 fix input range (#3188)
[ROCm/composable_kernel commit: 9f33b7cfd3]
2025-11-10 11:08:41 -08:00
linqunAMD
ddb0078fec [ck] Enable missing op for gfx11 and gfx12 (#3187)
[ROCm/composable_kernel commit: 7b6ba8d5c2]
2025-11-10 10:58:20 -08:00
linqunAMD
93b4c77e06 [ck] correct memory size in grouped_gemm_multi_abd_xdl_fixed_nk_bias_bf16_i8 (#3168)
b1 and b0 use same layout,  so, the size of b1_tensors_device should be same with b0_tensors_device's

[ROCm/composable_kernel commit: e593a14ae1]
2025-11-10 10:58:08 -08:00
Manish Kumar
5f9d5566e5 [CK-Tile] Add gtests for compiler CI for faster testing (#3123)
* Add gtests for compiler CI for faster testing

* Add changes to have a custom target

* Add a gtest suite for gemm kernel for running CI tests with compiler mode

* Fix Clang error (EOL)

* Removed compiler subfolder from CMake

* Add gtest suite for gemm kernel

* Disable failed tests

* Fix build errors

* Resolved PR comments

* Update shape for persistent gemm kernel test

* Seperated types by H/W archs

* Made changes to persistent types

* Fix persistent build failure issue

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: d5746dd120]
2025-11-10 10:42:23 -08:00
Gino Lu
0344170dac fix MX bpreshuffle gemm B grid descriptor dimension error. (#3170)
[ROCm/composable_kernel commit: e31a7a4f29]
2025-11-06 19:42:39 -08:00
Xudong Yuan
6e40562dff Ck moe mxfp4 blockm32 (#3098)
* block_m = 32

* ck block_m = 32

* aiter/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_b_preshuffle_mx_moe_v3.hpp format

* mxfp4_moe v1 pipe

* update format

---------

Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: d04eba4ae3]
2025-11-07 08:45:41 +08:00
JH-Leon-KIM-AMD
e8afef1e8b [CK_BUILDER]ckb add remining fwd conv device ops (#3155)
* Add device operation to conv signature. Use unions to hold conv layouts and device operations.

* Add predicates for all device op instances.

* Use the device op signature for validation.

* Fix ckb CMakeLists.txt file for tests.

* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.

* Fix clang-formatting.

* add device_grouped_conv_fwd_dl_multiple_d_nhwc_kyxc_nhwk

* Add full DL configurability with Option A implementation

- Added 5 DL descriptor structs (39 configurable parameters)
- Added 10 C++20 concepts for type-safe validation
- Updated factory to read all parameters from descriptors
- Updated test helper to populate all descriptors
- All tests passing (13/13 including 3 new DL tests)

* Add factory and test support for DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor

- Add factory specialization for Large_Tensor device operation (conv_factory.hpp lines 1145-1265)
- Add macro collision workaround using pragma push/pop (conv_factory.hpp lines 43-51)
- Add test helper function run_test_DeviceGroupedConvFwdMultipleD_Xdl_CShuffle_Large_Tensor
- Add builder test file test_ckb_conv_fwd_2d_large_tensor_fp16.cpp with 2 test cases
- Update CMakeLists.txt to include new test file
- Reuse existing ConvAlgorithm_DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle descriptor
- Map all 42 template parameters identical to regular XDL CShuffle
- All 15 builder tests passing including 2 new Large_Tensor tests

Completes Task 350: All 4 forward convolution device operations now supported in CK Builder.

* Update copyright headers to new format

- Change copyright format to: Copyright (C) Advanced Micro Devices, Inc., or its affiliates.
- Reorder headers: Copyright first, then SPDX-License-Identifier
- Updated files:
  * experimental/builder/test/conv/test_ckb_conv_fwd_2d_dl_fp16.cpp
  * experimental/builder/test/conv/test_ckb_conv_fwd_2d_large_tensor_fp16.cpp
  * experimental/builder/include/ck_tile/builder/device_op_types.hpp

* fix c++ 18 format

* Fix clang-format-18 error in device_op_types.hpp

---------

Co-authored-by: Ville Pietilä <ville.pietila@amd.com>
Co-authored-by: Ville Pietilä <188998872+vpietila-amd@users.noreply.github.com>

[ROCm/composable_kernel commit: 5f3cae3e28]
2025-11-06 16:29:48 -08:00
Johannes Graner
085690955f Add .clangd and CMakeUserPresets.json to .gitignore (#3171)
[ROCm/composable_kernel commit: 76c4c12f59]
2025-11-06 15:07:39 -08:00
Adam Osewski
9fde8e559a [CK_BUILDER] Convolution description (#3163)
* Add DirectLoad tparam & clean up headers.

* Add convolution traits.

* Update inline documentation.

* Add more convolution specialization and gemm padding types.

* Add additional helper functions & more tests to conv traits.

* Fix tests cmake file.

* Add case insensitive string comparison

* Fix function name overlapping with variable name.

* Unify pipeline version and scheduler enums.

* Fix includes.

* Update test conv traits with unified enums.

* Update concepts etc with update unified enum

* Fix ckb conv fwd test - unified enum usage.

* Dump changes.

* Add ostream overloads for all enum classes.

* Update detailed() function in ConvDescription

* Fix handling union based conv direction.

* Add test & update conv description.

* Refine tree view.

* Update copyrights

* Fix merge artifacts

* Update detailed tree conv description

* Fix clang-format

[ROCm/composable_kernel commit: 18e083003f]
2025-11-06 15:46:26 +01:00
Bartłomiej Kocot
e89cb52306 [CK TILE] Convolution remove magic values (#3160)
* [CK TILE] Refactor Conv configs and Conv Elementwise

* fix

* [CK TILE] Convolution remove magix values

* fix partitioner

[ROCm/composable_kernel commit: 2234ff830b]
2025-11-06 11:26:30 +01:00
joyeamd
846b43f43b add gfx11's barrier following SPG's reference (#3159)
* add gfx11's barrier following SPG's reference

* re-format the code

* minor fix

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 12922120d2]
2025-11-05 22:29:03 -08:00
Illia Silin
b7d6555a88 Fix compilation errors with clang22. (#3164)
* resolve compilation issue with clang22

* add __extension__ for __COUNTER__ usage in ck_tile

[ROCm/composable_kernel commit: 4533aa6dba]
2025-11-05 15:42:22 -08:00
Adam Osewski
54409e7fb5 [CK_BUILDER] Convolution traits. (#3152)
Added:

1. Convolution traits & unit tests
2. Update builder enumerators to have representation of Convolution Kernels properties.
3. Unified builder pipeline version & scheduler enumerators

[ROCm/composable_kernel commit: b8527a9236]
2025-11-05 08:53:06 -08:00
andrew clark
a70d21d523 Collecting redis stats (#3149)
[ROCm/composable_kernel commit: 3b076b0b74]
2025-11-04 18:55:11 -08:00
Illia Silin
bb4b6e5961 Initialize new variable to prevent c++17 compiler error (#3156)
* initialize new variable to prevent c++17 compiler error

* build for gfx90a using -std=c++17 flag

[ROCm/composable_kernel commit: 930423ab3b]
2025-11-04 18:54:14 -08:00
Vidyasagar Ananthan
4d72320b51 Chunk Ctests so we dont run into large number of tests error (#3050)
* Chunk Ctests so we dont run into large number of tests error

* Addressing feedback from copilot

[ROCm/composable_kernel commit: 31c019f589]
2025-11-04 10:31:32 -08:00
Cong Ma
0343c4e1fe Introduces the new partitioner to implement the reduction StreamK kernel. (#3107)
* Introduces the new partitioner to implement the reduction StreamK kernel

* Add more doc text to functions

* Add persistent-dp option to streamk example

* Update example/ck_tile/40_streamk_gemm/README.md

[ROCm/composable_kernel commit: 5abe4109e0]
2025-11-04 10:32:17 -07:00
Thomas Ning
1a8f824938 fix the blockscale 2d case (#3148)
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 13ba06f1e7]
2025-11-04 11:55:23 -05:00
John Shumway
a9d0980ad9 [CK_BUILDER] Update copyright messages. (#3150)
* Update copyright messages.

Copyright messages should no longer include a year. This PR updates all 38 source files to the new format.

* Switch to (C) from unicode copyright symbol.

The unicodein comments  was causing compilation errors.

[ROCm/composable_kernel commit: 0be0288f58]
2025-11-04 15:35:16 +01:00
John Shumway
52204ff4e5 [CK_BUILDER] Add backward weight instance traits for xdl cshuffle. (#3143)
* Add backward weight instance traits for xdl cshuffle.

To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file.

* Fix copyright notices.

* Remove (c) symbol, replace with (C).

Having UTF-8 in source caused an error with code generation.

[ROCm/composable_kernel commit: 6dbee64886]
2025-11-04 15:34:00 +01:00
Bartłomiej Kocot
052c043d99 [CK TILE] Refactor Conv configs and Conv Elementwise (#3151)
* [CK TILE] Refactor Conv configs and Conv Elementwise

* fix

[ROCm/composable_kernel commit: 8681ced962]
2025-11-04 15:04:53 +01:00
Bartłomiej Kocot
a3a55b00d7 [CK TILE] Refactor grouped conv fwd large tensor (#3144)
[ROCm/composable_kernel commit: 99f38e4d9b]
2025-11-04 00:34:48 +01:00
Vidyasagar Ananthan
c9e7b735c0 Adding note on CMake convenience script (#3139)
* Adding note on convenience script

* Addressing feedback

* Update README.md

reword

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: c7ded76cc7]
2025-11-03 12:21:57 -08:00
Enrico Degregori
9575bcd099 Fix splitk preshuffle (#3137)
* Fix splitK multiply_multiply_wp

* Add tests for gemm_multiply_multiply_wp

* Add tests for gemm_universal_preshuffle (KBatch = 1)

* Add tests gemm_blockscale_wp

* Fix splitk gemm universal preshuffle

* Run new tests on arch supporting fp8

* Restore example

* Fix strides profiler

* Fix tests

* Fix clang format

* Finalize profiler preshuffle with tolerances

* Minor improvements to splitk related changes

* Address review comments: clang format and ckProfiler typo

* Remove b_k_split_offset from SplitKBatchOffset struct

[ROCm/composable_kernel commit: 507d81c3af]
2025-11-03 11:59:01 -08:00
Thomas Ning
bf0dc8ce56 fix the compv4 and async pipeline when tile handler is 1 (#3141)
[ROCm/composable_kernel commit: 057b7d43b4]
2025-11-03 09:37:35 -08:00
Emily Martins
b00303a831 Replace CK_TILE_PIPELINE macros with a common enum
This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3,
CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum
called GemmPipeline to reduce code duplication.


[ROCm/composable_kernel commit: 2ec57a8e70]
2025-11-03 09:35:05 -07:00
Michael Mcminn
699f7daae3 Ud fix moe sorting gfx908 (#2720)
* Adding a ds permute fallback for the gfx908 and older for row_newbcast:7 instruction

* Better macro for selecting ROW_NEWBCAST

* clang-format the update

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: afe1ff618d]
2025-11-03 07:31:31 -08:00
msaffari-amd
7c8d79af33 Ck tile engine gemm unit tests exapand test coverage (#3025)
* initial commit for testing datatypes, layouts and traits

* correct warp tile size for small datatype config to make a validate instance for fp16, bf16, fp8

* add tile size coverage test

* Cover more tests, parallel instance generation, documentation

* update cmakelist to run more tests

* initial codes to support add test params in json file

* add congurable  problem sizes for different tests

* modify README.md

* clean test_gemm_simple code

* correct padding coverage test

* Add comprehensive and quick tile size config files

* remove fp64 from datatypes

* update documents. manage selecting tile_size config (quick or Comprehensive)

* correct padding test problem sizes

* update comprehensive test and correct documents

* Skip GEMM tests with unsupported arguments instead of failing

* change gen_single instead of gen_indivisual because of an issue. add splitk tests to tile_size_quick_config

* clean CMakeList, remod py file

* Refactor test configs: Rename tile_size to coverage, remove separate traits config,  clean cmakefile, readme

* update fp32, fp8 to test all layouts, clean documents and comments

* limit fp32 test layouts to rcr because of compilation error on some gpus

* remove fp32 because of the removing from gemm_instance_builder, make quick test smaller, updating comments

* Fix fp8/bf8 test failures on gfx950 by adding OCP FP8 format support

* Reduce quick_coverage test count from ~250 to ~144 for faster CI

[ROCm/composable_kernel commit: d405641f06]
2025-11-03 10:29:16 +01:00
Ville Pietilä
aeeed60666 [CK_BUILDER] Add conv factories for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle and DeviceGroupedConvFwdMultipleD_Wmma_CShuffle (#3138)
* Add device operation to conv signature. Use unions to hold conv layouts and device operations.

* Add predicates for all device op instances.

* Use the device op signature for validation.

* Fix ckb CMakeLists.txt file for tests.

* Fix building CK Builder instance traits after the introduction of direct load template parameter in CK.

* Fix clang-formatting.

* Add factory for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle device op.

* Add conv factory for  DeviceGroupedConvFwdMultipleD_Wmma_CShuffle

* Rename elements per wave per shuffle member in the epilogue concept.

* clang-format

* Add concepts and types for optional device op template parameters.

* Add optional compute, direct load, and loop scheduler arguments to conv factory.

* Add number of groups to merge template parameter.

* clang-format.

[ROCm/composable_kernel commit: 3ae3992c18]
2025-11-03 09:03:25 +02:00
Sami Remes
9f069d6e35 [CK_TILE] B matrix 2D block scale gemm (#3074)
* Refactor quant group size to be configurable for M/N/K, not just K

* add some asserts for configurations not implemented

* start setting of group size for N dimension

* enable 2d for reference quant gemm

* WIP: trying to figure out tile dstr and/or indexing for scale matrix

* WIP

* Fix handling of n dim blocks in tile windows etc

* remove commented code and enable all tests again

* fix formatting

* Add more specialized tile distributions

* Enable NWarps replication for bquant tile dstr

* fix formatting

* fix format

* Fix some issues from the merge

* fix formatting

* one more fix to tile dstr, and revert debug initialization

* Remove commented code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* simplify conditions that are needed for tile distributions

* only enable the working group sizes in tests

* fix formatting

* Update tile distribution for 2D bquant

* add some documentation and 2d block scale example

* fix formatting

* Add in Changlog and restructure the quant 2d example

* fix CMake

* support the change for blockscale 2d

* fix the test file

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 16e85cf179]
2025-11-02 16:49:20 -08:00
Aviral Goel
f4b880d058 refactor: remove gemm preshuffle pipeline v1 by removing all references from codebase (#3132)
* test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf

* refactor: deprecate gemm preshuffle pipeline v1 by removing all references from codebase

* Revert "test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf"

This reverts commit 573c08a085.

[ROCm/composable_kernel commit: 73f637894d]
2025-11-02 00:06:28 -04:00
Aviral Goel
5be796d8a5 fix: fix bug in print tile window when printing bf8/fp8 tiles (#3120)
* fix: fix bug in print tile window when printing bf8/fp8 tiles

* test(print_tile_window_range): add unit tests to maintain function integrity

* fix: fp8 numerical mismatch error on gfx950 by adding DCK_TILE_USE_OCP_FP8

[ROCm/composable_kernel commit: 45be741586]
2025-11-01 15:28:07 -04:00
Bartłomiej Kocot
b2aa37f3f5 Add 2GB limitation for grouped conv bwd weight (#3054)
[ROCm/composable_kernel commit: ab1a8356b6]
2025-11-01 14:16:45 +01:00
JH-Leon-KIM-AMD
5f45985732 [CK TILE] Grouped conv fwd split image (#2970)
* Refactor split-image implementation: simplify code and remove redundant variables

* Add padding debug output to split-image implementation

- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp
- Verified padding works correctly with all tests passing

* Fix sign comparison warning after rebase with origin/develop

- Cast blockIdX from unsigned to signed index_t for comparisons
- Integrated with new GetOutputTileIndex logic from upstream
- Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane

* Fix Split-N with groups bug and clean up unused parameters

- Fixed batch stride calculation to include G dimension for grouped convolutions
- When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups
- Removed unused multi-split parameters (we only support 2-way split)
- All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N

* Implement recursive queue-based split-image detection and calculation

- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp
- Implement recursive binary splitting algorithm (10GB→5GB+5GB→...)
- Correctly handle odd dimensions (61→30+31)
- Calculate proper offsets for each split piece
- Update invoker to use split-image helper

Note: Split detection and calculation work correctly but kernel launching
for individual pieces requires kernel modification to handle different
spatial dimensions (unlike Split-N which uses blockIdx.z).

* WIP: Split-Image investigation - found architecture mismatch

- Split-N modifies N_ directly in transformer constructor
- Split-Image needs different approach due to varying dimensions
- Added split calculation logic for 1D and 2D convolutions
- Still facing memory issues when creating piece transformers

Key finding: Split-N uses blockIdx.z for parallel execution,
while Split-Image needs sequential execution of non-uniform pieces.

* Add 1D split-image implementation for grouped convolution (N=1 working)

Implements split-image for 1D convolution to handle large tensors that
exceed memory thresholds. This is a critical milestone with N=1 fully
working and tested.

Key Changes:
- Invoker: Add split-image logic that splits W dimension in half
- Transformer: Add SplitConvProblem helper for recursive splitting
- Calculate offsets for LEFT and RIGHT pieces
- Launch two kernels sequentially (LEFT then RIGHT)

Implementation Details:
- Binary split: divides W dimension by 2
- LEFT piece: W=0 to W/2, keeps left padding, removes right padding
- RIGHT piece: W/2 to W, removes left padding, keeps right padding
- Offset calculation accounts for stride, dilation, and padding
- Physical memory offset (no padding in memory)

Test Results (N=1):
 94/94 tests passing
- Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups)
- Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries)
- Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops)

Known Limitations:
- Only works with N=1 (single batch)
- N>1 fails when split-image triggers (offset calculation issue with Split-N)
- Root cause: Split-N modifies N in transformer, but offset calculated in invoker
- Solution planned: Move offset calculation to transformer (next phase)

Files Modified:
- grouped_convolution_forward_invoker.hpp: Add split-image logic
- transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper

This commit represents a stable, tested 1D split-image implementation
for N=1 cases. It's an important milestone before extending to N>1
and multi-dimensional splits.

* Add basic split-image implementation for 1D/2D/3D grouped convolution

This is a working baseline implementation that splits large spatial
dimensions to handle memory constraints.

Implementation:
- 1D: W-split for NWGC layout (36/36 tests passing)
- 2D: H-split for NHWGC layout (20/20 tests passing)
- 3D: D-split for NDHWGC layout (verified working)

Features:
- Binary split of outermost spatial dimension
- Sequential LEFT/RIGHT kernel launches
- Proper padding adjustment at split boundaries
- Offset calculation for pointer arithmetic
- Debug output for verification

Threshold: 100KB (configurable in transformer)

Known limitations:
- No safety checks for edge cases (to be added)
- Offset calculated before Split-N (incompatible with N>1, to be fixed)
- No recursive splitting for very large tensors

Next steps:
- Add safety checks (is_possible_to_split_*)
- Move offset calculation to transformer (after Split-N)
- Test with N>1 + split-image combination

* Refactor split-image to unified structure for 1D/2D/3D

Unified the three separate dimension-specific blocks into a single
common implementation with dimension-specific stride calculations.

Benefits:
- Reduced code from 636 → 348 lines (45% reduction)
- Eliminated code duplication
- Easier to maintain and extend
- Single source of truth for split logic

Implementation:
- Common: Binary split, offset calc, padding adjustment, kernel launch
- Dimension-specific: Stride calculation only
  - 1D: stride = G * C
  - 2D: stride = W_in * G * C
  - 3D: stride = H_in * W_in * G * C

Test results (all passing):
- 1D: 36/36 tests 
- 2D: 20/20 tests 
- 3D: 28/28 tests 
- Total: 84/84 (100%)

All test scenarios verified:
- Varying channels, padding, stride, dilation
- Filter sizes (1x1 pointwise to 7x7)
- Multiple groups (G=1,2,4)
- Odd dimensions
- Complex combinations

* Add safety checks for split-image in all dimensions

Added is_possible_to_split safety checks to prevent crashes when
splitting is not feasible.

Safety checks verify:
1. Output dimension > 1 (can't split single element)
2. RIGHT piece starts after left padding
3. LEFT piece ends within input bounds

If checks fail, falls back to normal kernel launch.

Verified for all dimensions:
- 1D (W-split): Wo=1 case triggers fallback
- 2D (H-split): Ho=1 case triggers fallback
- 3D (D-split): Do=1 case triggers fallback

Original 84 tests still pass - they use normal configurations
that naturally satisfy safety conditions.

Safety checks protect against pathological edge cases with:
- Very small spatial dimensions
- Extreme stride/dilation combinations
- Invalid padding configurations

* Fix Split-N + Split-Image compatibility issue

Fixed critical bug where Split-N and Split-Image working together
caused ~50% incorrect results due to wrong batch stride calculation.

Problem:
- Batch stride was calculated using MODIFIED spatial dimensions
  (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000)
- Spatial offset was applied globally in invoker, not per-batch in kernel
- Each batch (blockIdx.z) got wrong memory offset

Solution:
1. Store spatial offset in kargs (don't apply to pointer in invoker)
2. Copy correct batch_stride from temp_kargs to left/right kargs
3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset

Changes:
- grouped_convolution_forward_kernel.hpp:
  * Added spatial_offset_in/out fields to KernelArgs
  * Apply batch + spatial offset in operator()

- grouped_convolution_forward_invoker.hpp:
  * Keep base pointer, store spatial offset in kargs
  * Copy batch_stride from temp_kargs (has original dimensions)

- transform_conv_fwd_to_gemm.hpp:
  * Add debug output for split-image calculation

Results:
- N=1 tests: 84/84 passing (100%)
- N>1 tests: Now all passing (previously ~50% errors)
- Tested: 1D, 2D, 3D with N=1,2,4,8,16,20

* Implement unified threshold for Split-N and Split-Image

This commit consolidates threshold management for both Split-N and
Split-Image operations into a single source of truth, eliminating
code duplication and fixing offset calculation issues.

Key Changes:
============

1. Transformer (transform_conv_fwd_to_gemm.hpp):
   - Moved TwoGB constant to public section for unified access
   - CalculateSplitImage() now takes no parameters
   - Uses internal threshold: TwoGB / sizeof(CDataType)
   - Calculates offsets using N_ (after Split-N) for correctness

2. Kernel (grouped_convolution_forward_kernel.hpp):
   - GetSplitImageInfo() simplified to take no parameters
   - Forwards to transformer's CalculateSplitImage()
   - Clean interface with unified threshold internally

3. Invoker (grouped_convolution_forward_invoker.hpp):
   - Removed redundant threshold calculation
   - Simplified to call kargs.GetSplitImageInfo() with no params
   - Clean early-return pattern (no unnecessary else blocks)
   - Removed duplicate/dead code paths

Benefits:
=========
- Single source of truth: TwoGB defined once in transformer
- No parameter passing for threshold between components
- Correct offset calculation using N_ (post-Split-N)
- Cleaner code with no duplication
- All tests passing: 1D/2D/3D with various N values

Testing:
========
- Split-Image only (N=1, large spatial): PASS
- Split-N only (N>1, small spatial): PASS
- Both splits active (N>1, large spatial): PASS
- No splits (N=1, small spatial): PASS
- CPU verification correct for all scenarios

* Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded)

The old recursive queue-based implementation has been replaced by the
new CalculateSplitImage() method which is simpler and correctly handles
Split-N + Split-Image interaction.

Changes:
- Wrapped lines 381-1078 in #if 0...#endif
- Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded()
- Preserved for reference but disabled from compilation
- No functional changes - all tests still pass

The new implementation (CalculateSplitImage at line ~2163) provides:
- Correct offset calculation using N_ (after Split-N)
- Simpler binary split logic
- Better integration with unified threshold approach

* Implement recursive split-image with depth limit (MAX_DEPTH=10)

Changes:
- Add depth tracking to SplitPiece struct
- Implement two stopping conditions:
  1. Piece size below threshold (optimal case)
  2. Depth >= MAX_DEPTH (prevents infinite recursion)
- Remove MAX_PIECES limit in favor of depth-based control
- Support up to 2^10 = 1024 pieces with depth 10

This allows handling extreme tensor sizes while ensuring termination.
Pieces larger than threshold will still launch correctly if depth limit reached.

Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.

* Summary of recursive split-image implementation:
- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces)
- Two stopping conditions: size below threshold OR max depth reached
- Cumulative offset tracking through all recursion levels
- LEFT piece inherits parent offset, RIGHT accumulates (parent + local)
- Per-batch spatial offset application in kernel operator()
- Batch stride uses original dimensions (before split)
- Works with Split-N: split-N first, then recursive split-image
- Handles odd dimensions, padding, stride, dilation correctly
- All 1D/2D/3D tests pass with CPU verification

* Add comment explaining MAX_DEPTH capacity for 2GB threshold

* Refactor: move recursive split-image logic to transformer

- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp
- Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic)
- Encapsulate SplitPiece struct and BFS splitting algorithm in transformer
- Remove unused includes (queue, vector) from invoker
- Add documentation comment for AreDescriptorsSmallerThan2GB()
- Improve code organization and reusability
- No performance overhead (static template function, compiler inlines)
- All tests passing with 2GB production threshold

* Apply clang-format-18 formatting

- Format invoker and transformer files with clang-format-18
- Fix brace placement and alignment
- No functional changes

* Fix clang-format-18 issues in forward kernel

- Remove extra blank lines
- Fix line wrapping for template calls
- Consolidate GetSplitImageInfo() to single line

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Split-Image implementation with temporary fixed divider

- Implemented spatial dimension splitting (Split-Image) for large tensors
- Added piece-based coordinate transformation for 1D/2D/3D convolutions
- Integrated Split-N (batch splitting) with automatic threshold detection
- Fixed M dimension calculation to include batch: M = N × spatial_size
- Added spatial offset support in kernel arguments
- Verified 20/20 test cases passing for Split-Image alone
- Known issue: Split-N + Split-Image combination needs coordinate fix

Implementation Details:
- Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values
- Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts
- Piece descriptors track spatial boundaries and block ranges
- No performance overhead for N=1 cases

* Fix 1D split-image padding issue with per-piece dimensions

- Store actual size per piece to handle non-uniform splits
- Remove dead code from transform utils

* Fix 2D/3D split-image with independent split factors per dimension

Problem: Single split factor caused non-uniform pieces when dimensions
didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed.

Solution: Independent split factor selection for W, H, D dimensions.
Each dimension gets optimal factor based on its own size.

Test Results:
- 1D: 42/42 pass (100%)
- 2D: 25/25 pass (100%)
- Total: 67/67 combinations verified

* Remove unused split-image struct fields

Cleanup of split-image implementation:
- Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct
- These fields were declared but never used in the kernel
- Per-piece dimensions are already stored in pieces[] array
- Reduces struct size and improves code clarity

Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass

* Refactor split-image invoker code for improved readability

- Extract piece calculation logic into calculate_piece lambda helper
- Extract kernel args population into populate_split_image_kargs lambda
- Use aggregate initialization for cleaner struct population
- Reduce nesting depth and improve maintainability
- Fix outdated comment about split-image implementation status

* Refactor split-image code and remove debug prints

- Extract GPU kernel helper lambdas for better readability
- Remove all split-image debug print statements
- Set memory threshold to 2GB for production
- All tests pass with CPU verification

* Add split-image safety constraints and refactor to utils

- Add MAX_TOTAL_PIECES=64 limit to prevent segfault
- Move calculate_spatial_piece to library utils
- Add layout validation (NWGC, NHWGC, NDHWGC only)
- Fix hierarchical splitting to respect piece limits
- Add proper documentation and formatting

* Change split-image from runtime to compile-time branching

Response to @bartekxk review comment:
Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)'

Changes:
- Add EnableSplitImage template parameter to kernel
- Change runtime if to compile-time if constexpr
- Update invoker to instantiate kernel variants with true/false

Benefits:
- Eliminates runtime branching in GPU kernel
- Dead code elimination (each variant is smaller)
- Better compiler optimization

Files modified: 2
Lines changed: 20 total (6 in kernel, 14 in invoker)
Tests: 27/27 passed (100%)
Performance: No regression

* Add split-image example as separate binary

- Create grouped_convolution_forward_split_image example
- Add grouped_convolution_forward_split_image_invoker.hpp
- Update CMakeLists.txt to build split_image binary

* Replace linear search with binary search in find_piece_id

- Change O(n) to O(log n) for finding piece ownership
- Matches reference implementation in large_tensor_cshuffle

* Simplify split-image code and fix integer overflow

- Extract lambda functions to static helper methods
- Pre-calculate constants in invoker
- Fix integer overflow in tensor size calculation for large tensors

* Trigger CI rerun - fix merge conflicts

* Fix merge conflict markers

* Fix clang-format: remove space before {}

* Fix clang-format: comment wrapping and Swish constructor

* Rename split_image to large_tensor for clarity

- Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp
- Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp
- Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor
- Updated comments to refer to 'large tensor' instead of 'split-image'

* Update comments and include in large_tensor example

- Updated header comments to use 'large tensor' terminology
- Fixed include path to use large_tensor_invoker.hpp

* Remove test code, restore 2GB threshold

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix build errors after develop merge and complete rename to large_tensor

This commit addresses compilation errors from the develop merge and
completes the rename from split_image to large_tensor.

Changes:
1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp
2. Fix template parameter order in large_tensor_invoker.hpp
   - TransformConvFwdToGemm signature changed in develop
   - NumGroupsToMerge and SplitN parameters swapped positions
3. Fix missing template parameter in GroupedConvFwdHostArgs
4. Fix EpiloguePipeline scope in kernel (merge conflict)
5. Update binary name references in test scripts

* Restore 2GB threshold for split-image

Changed threshold from 100MB (testing) back to 2GB for production use.

* Fix const-correctness in ds_ptr cast

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply clang-format-18

* update c++ 18 format

* Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 1fbb47ad30]
2025-11-01 14:18:16 +02:00
Aviral Goel
658fb530ab test(grouped_gemm): add unit tests for grouped_gemm bquant with preshuffleB true (#3119)
* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* add bquant to grouped_gemm

* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm

* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* feat: add bf8 support

* chore: remove unnecessary decltype usage

* chore: add default quant_mode to function signature as fallback

* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel

Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.

* chore: set default quant mode in function signature

* test: add additional test cases to cover edge case of no hotloop

* change code based on comments

* WIP: bquant preshuffle b compiles but gives numerical error

* feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel

* refactor: refactor code after merge commit

* chore: remove print statements

* test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases

---------

Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 8f1274d9b6]
2025-10-31 12:07:06 -07:00
Thrupti Raj Lakshmana Gowda
27dc4d9833 [CK TILE ENGINE] GEMM Multi D Restructure (#3121)
* Renaming old code

* Adding GEMM code with new Architecture

* Partial Progress : Errors

* Partial Progress : Working code

* Changes to element wise function

* Removing Debugging statements

* Working GEMM Multi D code

* Removing Stale Code

* Address Copilot review comments

* Address Copilot review comments

* Changes to validation file

* Changes to common code snippets

* Creating common folder

* Removing duplicate files

* Pointing to right common file

* Pointing to right common file

* Pointing to right common file

* Changing to VERBOSE

* Changing CMAKE messages to verbose

* Updating Cmake with right layout datatype configs

* Working code for GEMM Multi D

[ROCm/composable_kernel commit: a33d98f8e2]
2025-10-31 12:02:46 -07:00
Max Podkorytov
b7a073f769 [CK-tile] unhardcode the number of LDS banks from universal gemm policy (#3130)
Fixes LDS bank conflicts on gfx950 for universal gemm v3 pipeline

Replaces hardcoded LDS layer calculations with dynamic computation using the new architecture helpers

Adds architecture-specific helper function get_n_lds_banks()

Changes function attributes from CK_TILE_HOST_DEVICE to CK_TILE_DEVICE in universal gemm policy


[ROCm/composable_kernel commit: 04efd282cf]
2025-10-31 11:58:11 -07:00
Enrico Degregori
e6be7bcc2a WMMA gemm_add_relu_add_layernorm (#2989)
* Summary:

 - Refactor epilogue (with CShuffle) to support fused operations:
    - EpilogueCShuffleBase holds common parts
    - EpilogueCShuffle: runs CShuffle and write out
    - EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out

 - Extend thread transfer v7r3:
    - Support for intermediate data type different from src and dst type
    - New functionality to write to dst buffer and keep data (to be able to use them for additional operations)

* Adress review comments

[ROCm/composable_kernel commit: 4ebc48a3cd]
2025-10-31 11:19:26 -07:00
Anton Gorenko
2136eddf8a Fix synchronization issue in fwd qr pipeline with dropout (#3135)
BlockFmhaPipelineQRKSVS reuses LDS for K and dropout so there must be
block_sync_lds between loading k_lds_window by gemm_0 and storing
dropout randval.

[ROCm/composable_kernel commit: e9596228ff]
2025-10-31 09:44:52 -07:00
John Shumway
a8a377ca53 Add the last two forward instance traits. (#3134)
* Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle

* Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d

* A few small changes to fix broken instance traits.

[ROCm/composable_kernel commit: 5ed2046bee]
2025-10-31 07:52:42 -07:00
andrew clark
d2474f5396 Adding new alert failure patterns (#3122)
* Adding GPU not found pattern

Also, failurePatterns does not need to be global. Moved variable to live in the failure notifications function scope.

* Testing new failure type

* Testing failure

* Removing the forced failure test

* Adding an additional failure pattern

[ROCm/composable_kernel commit: 1977e4b96a]
2025-10-31 07:38:31 -07:00
John Afaganis
c6b0458d1d Add copyright notices to missing files (#3133)
[ROCm/composable_kernel commit: 3f996ee738]
2025-10-31 07:35:11 -07:00
kabrahamAMD
b7429e620c Kabraham/fix block gemm v1 b scale (#3129)
* fixed synchronization issue in block gemm pipeline v1 that caused b_scale to fail

* run clang-format

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>

[ROCm/composable_kernel commit: a7c52e8afa]
2025-10-31 07:19:01 -07:00
Bartłomiej Kocot
2e1831f8fd [CK TILE] Clear output buffers for grouped conv bwd (#3127)
[ROCm/composable_kernel commit: c2d7931446]
2025-10-31 14:11:54 +01:00
Yi DING
c28e65c6bf [CK_TILE] Add mxfp4 flatmm (#3080)
* Squashed commit of the following:

commit 3e1a851dad834776efbe4fe365ac82c4ed312010
Author: Ding, Yi <yi.ding@amd.com>
Date:   Thu Oct 23 06:10:54 2025 +0000

    Fix & clean after rebase

commit 1edf485092f44411da9a1796a4a6b72d5cdb67c6
Author: Ding, Yi <yi.ding@amd.com>
Date:   Wed Oct 22 10:46:13 2025 +0000

    Squashed commit of the following:

    commit 5276b28a51dac7b5d2106fbae8e78de190ee0de1
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 02:04:27 2025 -0500

        fix bandwidth calculation

    commit d645bb20c6d879154c30ecd82bbff4d2a9206750
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 22 00:58:59 2025 -0500

        updates

    commit 0fa7e6b88aaf81a36034aa7607746de295de4263
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Sep 19 00:39:46 2025 -0500

        fix a bug, set the A DS_read preload size to 4 for MXFP4

    commit 50cafa824e2267f2b2f0dfeeb93e69a673630c61
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Thu Sep 18 01:19:03 2025 -0500

        fix a_wrap preload issue for large MPerBlock.

    commit e6333bbbc6ef540e24f92095040085f1ed59041e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 21:34:03 2025 -0500

        optimized the VGPR repack issue for MXFP4

    commit e99e4932c401b9f6d1893dd5044c2827d6b3f145
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Wed Sep 17 04:19:44 2025 -0500

        fix time error

    commit 4586ce6da7fba0514f2e01a8124c76b7d494e124
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 17 03:58:00 2025 -0500

        updated, function passed.

    commit c4f25e7579573db5681b9160f6bdb1349f3566f1
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 22:21:39 2025 -0500

        fix, function partially passed

    commit a51b56eb6b00b99a4e8d2802dbf5b5b5277b54d8
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 16 03:01:12 2025 -0500

        fix, reference function passed, next check kernel function

    commit 5b02643ebab18960e8f9ba66c6bd2f91774f9cae
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Sep 16 02:29:01 2025 -0500

        let pack/unpack return pk_fp4_t

    commit 76d37c5d4b17530e95c6fced31bff66a35d54b8f
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 20:50:26 2025 -0500

        fix

    commit e5be3e162b9a20e5355bd556d2b27afb6d8bf085
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 15 05:51:06 2025 -0500

        fix bug

    commit 39a024efe4aa773df589712b1290803bb5ab5d1d
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 04:02:05 2025 -0500

        fix core dump issue, function is not correct.

    commit 16c49d268cfe065b5112b960b2d852b26552686a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 03:03:02 2025 -0500

        updates, build pass

    commit fe7a961852dee6eff3be3cf1e0d0fabec5cd42ee
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 15 00:05:18 2025 -0500

        updates

    commit aaf9fe8022a72df59e04e4d5886dca3ba9c23400
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Sun Sep 14 23:40:28 2025 -0500

        fix bug

    commit a3da89290e1553b85fbf1171c07e93ac0f5584db
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 03:28:50 2025 -0500

        fix interface

    commit c5ff747e72d877461ba61dc19a0fe15527d3161e
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 12 02:53:50 2025 -0500

        add interface in warp_gemm_impl

    commit 0a48d369e601cc798589fc59e0784bdbfc0a22f9
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 10 05:03:08 2025 -0500

        updates some fixes.

    commit aaa2beca30ff5546d171a2028d1894fd4e131d4e
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Tue Sep 9 04:37:42 2025 -0500

        fix after merge ginolu/add_wgmfma_dispatcher

    commit bf87449b09cba690922b2f3f78ba39bf1b1e472e
    Merge: 05ab58e3d 991d7fdbb
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 22:09:15 2025 -0500

        Merge remote-tracking branch 'origin/ginolu/add_wgmfma_dispatcher' into mtgu/cktile_mxfp4_flatmm_dev

    commit 05ab58e3de2b708aceda63d704089c0fa89437ae
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 21:42:47 2025 -0500

        update mx flatmm tail pipeline

    commit 991d7fdbb726d65091a91b5cc2800f798a6661fc
    Merge: ad046084a 41ee8fe31
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:10:23 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit ad046084a2f6e4ebf0cd8b47d0d72b74815061fa
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 8 19:09:55 2025 -0500

        fix type error

    commit 42e16b43a035364a42789d7ce45a1e6a7d1d2609
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Mon Sep 8 04:01:40 2025 -0500

        update hotloop pipeline

    commit c2f69745346545087c8ce24acaba2961bb93ef0b
    Merge: adbeeb90b 91db4cec3
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:22:26 2025 -0500

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit adbeeb90be1533f8aeb8c1d5aea6470d45a455a0
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Fri Sep 5 04:21:26 2025 -0500

        fix clang format

    commit e2378ac393bb79ac80a8eef84677bffce86d9e0a
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Wed Sep 3 10:00:54 2025 -0500

        some updates

    commit bdc18a2269db49ff88e1ef1db30f83ea430d7544
    Merge: 6c5cea2b7 b3886a6d8
    Author: asleepzzz <hanwen.chang@amd.com>
    Date:   Wed Sep 3 13:22:03 2025 +0800

        Merge branch 'develop' into ginolu/add_wgmfma_dispatcher

    commit 6c5cea2b7a306f5d0ad346cb9baf6370ea2a73fe
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 02:11:02 2025 -0500

        fix vec size error

    commit 76d1dfa352087dfd5867c8909b73726d3a1e853e
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Mon Sep 1 01:23:39 2025 -0500

        fix format error

    commit a9061aaa1b4bfaa9db102c75b9d74863f39708a9
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Sat Aug 30 03:19:07 2025 -0500

        update codes

    commit 0caa184a271a8824ef40f87de456d0fa2500c8ad
    Author: mtgu0705 <mtgu@amd.com>
    Date:   Fri Aug 29 11:27:33 2025 -0500

        init ck_tile mxfp4 flatmm

    commit 5d46a6635f04bd69b76f7eda1438862e271b987a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 28 08:02:50 2025 +0000

        Add bias for f16xf4 moe_flatmm

    commit dd112dc302d17f541737671a3ac557d7c09ff969
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 27 13:39:47 2025 +0000

        update case construction

    commit b1aca68a073d82c7b3c7bb53286e5f415999edc1
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 26 12:32:29 2025 +0000

        support swiglu activaion and use rcpf to accelerate silu

    commit 49235bd42349a84fc2ebd7ad0b100cc2545bb80a
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Tue Aug 26 02:33:55 2025 -0500

        first commit

    commit c169e39d6381b932cf7098cc118db29df91da1cb
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 04:01:59 2025 -0500

        add line to last

    commit 318f9bf317306454941bbf394c1940023edcf0ac
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Fri Aug 22 03:20:46 2025 -0500

        adjust A_LDS descriptor to avoid bankconflict

    commit 9d066120ed068d6d102da25d619e170a28a04d18
    Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu>
    Date:   Thu Aug 21 09:46:52 2025 -0500

        enable hotloop

    commit 61a895e6b821798970afffd0e9432a21e2f04df8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 09:12:21 2025 +0000

        support atomic_pk_add_bf16 on gfx950

    commit 9f14864e45f21d8c1bc70a94988fb86c2c0017d8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 21 06:58:55 2025 +0000

        use int64_t as expert stride to avoid overflow

    commit e63af46b32e1139a1e59dee6f46b9971047c4026
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 13:53:32 2025 +0000

        use v4i32 as the storage type for B to avoid repack operation

    commit 6cf0224dd8a229bf2be726ca861c736c9b5f5415
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 06:40:03 2025 +0000

        add pk_fp4_t and e8m0_t support for amd_buffer_load_impl

    commit 67a591f2240b0b035029edad904627f98b3839fd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 20 04:39:14 2025 +0000

        optimize cvt_pkf4_to_f16 implementation

    commit 51c7126e77e9b17af694eaa57040e487f9d443e8
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Aug 19 14:56:46 2025 +0000

        optimize A_LDS descriptor to avoid bankconflict

    commit c113160f326353290a2878d7b8febf7daed91d71
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 18:43:37 2025 +0000

        fix gate-up when GU_NRepeat > 1

    commit a45ca0e9934ca4bb9114f65621d5c9582d937a45
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 18 17:28:11 2025 +0000

        add fp16xf4 moe

    commit dc8c8e484804f7bca10c8f0764540af3b5884e83
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Aug 17 17:51:18 2025 +0000

        rename example

    commit b177c967141cfdc401d3f36bf17830fe99893600
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 15 06:20:46 2025 +0000

        remove additional check when e8m0->float

    commit d467f9688c3d35f391e15089135edb1ad1d38b05
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 14 09:34:12 2025 +0000

        eliminate repeat dequant

    commit 1b20674b26ab3ce6bd2f710dd729fd4cc0f79428
    Merge: faa3c0278 7d02625e7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:51:49 2025 +0000

        Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm

    commit faa3c0278cf11b7105a4302dea3a4416520b2cc7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 16:16:48 2025 +0000

        update f16xMXF4

    commit a2a2e1dab05501cc2136133236c01c08d51db4ea
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 13 10:48:53 2025 +0000

        update scale-preshuffle for MXF4

    commit eac9667feb899419dda1628164c092b969852660
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 11:24:34 2025 +0000

        update

    commit 7d02625e7678882af653f52c2a4ddaf64568a41c
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 08:38:23 2025 +0000

        optimize gemm2 atomic_add pattern

    commit d5f3c3e3ec72d0e6739467c4dc0b4e209f6d1192
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:59:47 2025 +0000

        update scale for mxfp4

    commit 15db198084614466bd4cfd4943fcb549cab2069a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 07:56:14 2025 +0000

        update case construction

    commit 5dff349d82a5f70b6eea821d2622df51f90ef200
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 06:03:06 2025 +0000

        update granularity control

    commit d32cdc52144f65ec473f4ec8e45ea23968811184
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 11 03:42:46 2025 +0000

        fix TileConfig

    commit 26f38c5716304ee5f84e5c4f6f88144d9f3dddaf
    Author: Gino Lu <gino.lu@amd.com>
    Date:   Thu Aug 7 21:37:28 2025 +0800

        Add e8m0 scaled convert into CK_TILE (#2617)

        * first commit

        * remove redundent code

        * modify according to comments.

        * fix type_convert error with scaled_type_convert

    commit 419041478745f65dfec18859e75a13d975089519
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 8 20:19:16 2025 +0000

        add mixed_prec fp16xfp4

    commit 92e2a8b0308b9b107df9d2fd63a961efce706402
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Thu Aug 7 09:22:04 2025 +0000

        debug mixed_prec flatmm

    commit dea3ce80496ebcb00512979f0c3bb897f25e11a5
    Merge: fde443bc3 b4f45fe14
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Aug 6 16:49:47 2025 +0800

        Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk

        fix split k

    commit d480e8150358cc4ef8b05e25afe299141fad4fde
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Aug 6 08:33:33 2025 +0000

        add moe_flatmm

    commit b4f45fe14d11569f34de40c8a205cd6760b61357
    Author: coderfeli <coderfeli@163.com>
    Date:   Wed Aug 6 02:45:31 2025 +0000

        fix split k

    commit fde443bc38fe60e52195817ecb2c7b20d772eedb
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Aug 4 07:16:36 2025 +0000

        fix flatmm with scaling when WarpTileM == 32

    commit 5a0667afa889a5af8c6b8509232eabd50cf5efef
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 11:01:23 2025 +0000

        optimize scaling epilogue

    commit 5c3502bbf71833c6f6f7d4a1cc4f4fd93811f522
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Aug 1 07:28:38 2025 +0000

        fix wrong config for fp8 scaling

    commit eb2d0653cdb86603cb11539cbac466b6431b58b7
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 06:20:30 2025 +0000

        prune debug message

    commit 0c089cb56343a39e02a1ee38e9cabeb71ba35e92
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 30 04:52:08 2025 +0000

        fix compile error

    commit 61759ca30ce3787f70e228c3919b3e4d354016dd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Tue Jul 29 15:42:58 2025 +0000

        Add persistent option on flatmm for tuning

    commit b36dc5dd55f15fc1ce8eb21637bdec862e56a883
    Author: AMD-dteng <dteng@amd.com>
    Date:   Tue Jul 29 22:48:00 2025 +0800

        update pipeline v1: add atomic IGLP schedule

    commit f886f26994454fc2b4fc3433c86bf699767a2a7c
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 09:09:27 2025 +0000

        fix error log throwing

    commit 4b4686ab144daa9061fbda17f3df4c17600c8e9a
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Mon Jul 28 08:24:51 2025 +0000

        crz idea

    commit 7099af44a81be41431ba70ae60827b60116d02d2
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Sun Jul 27 11:57:38 2025 +0000

        Add permuteN optimzization when NRepeat % 2 == 0 on flatmm

    commit b147524c92e69a267337c8e48b6e64bcb1483551
    Author: sjfeng <j514681085@icloud.com>
    Date:   Sun Jul 27 17:24:08 2025 +0800

        try to remove c_shuffle_lds

    commit 2dd94f59d1a7740a5689e1713ed45588cd0d55dd
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Fri Jul 25 07:41:48 2025 +0000

        fix loop-dim mismatch and improve c_shuffle alu parallelism

    commit 4e93f0c5e27806adc070e4caa81661069295751c
    Merge: 3f12ef5aa 0eb7455f1
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 08:46:51 2025 +0000

        merge flatmm -scale

    commit 3f12ef5aa52ced1bff3bfb57b878358330e9e095
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 24 16:19:58 2025 +0800

        revert delete of inc file

    commit 08c3a0d184d7581dc5be364f5b36f16fb4a8d6fa
    Author: solin <bingzhou@amd.com>
    Date:   Thu Jul 24 04:38:16 2025 +0000

        reorg  flatmm code

    commit 0eb7455f106604d5254ed16b0daeda68e2a148e3
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:12:31 2025 +0000

        fix flatmm syntax error on gfx950

    commit 695ff87e68fdcbe28452c1805cd4dbb643c45495
    Author: Feng Shijie <Shijie.Feng@amd.com>
    Date:   Wed Jul 23 19:04:22 2025 +0000

        support flatmm scaling

    commit e3c29d9dea8758db96b998982ccc8bd1c4e8298d
    Author: valarLip <340077269@qq.com>
    Date:   Wed Jul 23 08:44:12 2025 +0000

        merge flatmm pipe v0 from dteng_flatmm_opt

    commit 425c366fa4c30426ff36cade89b39fd8cb7b9732
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:38:12 2025 +0800

        build pass

    commit 6b377a9481535696de40f175d7e2159263d21bdc
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 07:20:26 2025 +0000

         fix bug

    commit b6dc58d1ea676fe480c0243ae098c875498f6d6a
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 23 15:01:53 2025 +0800

        sync

    commit 904359f401866ee810484e6b8f5b46d79d9e25c8
    Author: valarLip <340077269@qq.com>
    Date:   Tue Jul 22 08:09:35 2025 +0000

        adaptive scheduler instead of Macro definition

    commit f29916c17228c17de9923aab62e7d72d7a30f4e9
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Thu Jul 17 08:40:35 2025 +0000

        fix tail handler bug

    commit e2c60a90929fec955d91db909d50db538d58363b
    Author: lalala-sh <Jiaxing.Wen@amd.com>
    Date:   Wed Jul 16 10:12:19 2025 +0000

        merge from dteng_flatmm_opt

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

* Fix crash on small M

* Apply suggestion from @Copilot

---------

Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: AMD-dteng <dteng@amd.com>
Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: sjfeng <j514681085@icloud.com>
Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Gino Lu <gino.lu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: e135dd518d]
2025-10-31 11:29:05 +08:00