Commit Graph

1234 Commits

Author SHA1 Message Date
Emily Martins
02ab76c2cb Fix CK Tile DP + 2 Tile Stream-K Validation Errors (#3269)
When there are multiple workgroups contributing to a tile, when using
atomics, there may be round off error in cases where the accumulator
type is not the same as the C type. To compute an error tolerance for
test validation, the Stream-K Tile Partitioner has a function called
estimate_num_wgs_per_tile to estimate the number of workgroups per tile.
That said, this function only provides an estimate. In some cases for
DP+2TSK, the function returns 1 rather than the more accurate value of
2.

Thus, this change updates the estimate_num_wgs_per_tile function to
explicitely return the value of 2 in cases for DP+2TSK to ensure that we
have a better error tolerance to avoid test failures due to round-off
error.
2025-11-21 20:29:47 -07:00
lalala-sh
f58bd56e6b fix static assert (#3178)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-11-20 17:27:05 -08:00
Yi DING
8b284a63a4 [CK_TILE] Refine FP32 => FP16/BF16 Conversion (#3215)
* [CK_TILE] Refine FP32 => FP16/BF16 Conversion

* Thank you Copilot

* Rename fix

* Fix example

* Fix accu checking

* Fix

* Fix
2025-11-20 10:50:26 -08:00
Gavin Zhao
07314ac543 Add support for RDNA1 GPUs (#3220)
* Allow compilation for RDNA1 (__gfx101__)

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* More RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* Even more RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* cmake: skip build quantization for unsupported arches

* add gfx10-1-generic support as well

* add gfx1013 and complete gfx10-1-generic

* fix clang format

* enable DL kernels on gfx101x

---------

Signed-off-by: Gavin Zhao <git@gzgz.dev>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-11-20 10:45:57 -08:00
Emily Martins
2e4b8a8fc4 [CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202)
* Remove old CK Tile Stream-K implementation

The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.

Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
  in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.

* Remove v2 from tile partitioner validation function names
2025-11-20 09:32:32 -07:00
asleepzzz
5adaa201ed Revert "Add attn sink (#2892)" (#3250)
This reverts commit 9fa4e8d5ab.
2025-11-20 07:55:15 -08:00
Linjun-AMD
9fa4e8d5ab Add attn sink (#2892)
* enable attn sink

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update attn_sink script

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix some error

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* clang-format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update fmha_bwd_kernel'mask

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* update block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* fix ci error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix format error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_bwd_pipeline_default_policy.hpp

* Update fmha_fwd_runner.hpp

* Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* Update fmha_fwd_runner.hpp

* update splitkv_pipline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update splitkv&pagedkv pipeline

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* add sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update attn_sink result log

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update smoke_test_fwd_sink.sh

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test file

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update test script

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp

* use constexpr kHasSink for sink in fmha pipeline

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* update by pre-commit

Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fmha_fwd.py

* Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove causal mask setting logic from mask.hpp

Removed the mask setting logic for causal masks.

* fix ci error that some usage of lamada not support in c++17

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update remod.py

* add smoke sink test

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update fmha_pagedkv_prefill.py

* Update FmhaFwdPipeline parameters in fmha_fwd.py

* update block_fmha_pipeline_qr_ks_vs_async_trload.hpp

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* fix c++17 unsupprot error

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

* Fix formatting of sink_seq_end assignment

* Fix indentation for sink_seq_end assignment

* Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-20 19:24:05 +08:00
Yi DING
47e2ed838e [CK_TILE] Add Flatmm MX FP8 (#3208)
* Use async for flatmm mxfp4

* Fix preshuffle

* Add flatmm mxfp8

* Thanks, Copilot

* Thanks Copilot again~
2025-11-20 10:35:15 +08:00
linqunAMD
d2e32b4305 [ck_tile] enable test grouped_gemm_quant and gemm_streamk on gfx12 (#3196)
1. Enable grouped_gemm_quant and gemm_streamk on gfx12
- test_ck_tile_streamk_smoke is kept on gfx9, since it looks someone is still working on it.
2. Update warp tile size in grouped_gemm_quant and gemm_streamk unit test
3. Reduce gemm tile size to pass the build on gfx12 in test_gemm_streamk_reboot_types.hpp
2025-11-20 08:40:27 +08:00
Michal Kulikowski
cd8af997e6 [CK] s_prefetch unit test fixes.
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-11-19 21:54:50 +01:00
Michal Kulikowski
f3ef7acca0 [CK] Added s_prefetch unit test.
-added s_buffer_load_b32/64 assembly
-added amd_s_buffer_load_impl

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-11-19 21:54:50 +01:00
John Shumway
ad57f6ef0b [CK_BUILDER] Put global CK functions in an the CK namespace (#3232)
* Wrap ck host utitlies in CK namespace.

The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.

Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.

There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate  kernels from either template library.

* Add using declarations to test code.

After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.

* Add using declarations to client examples.
2025-11-19 11:23:02 +01:00
Anton Gorenko
d7b3197869 [CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) (#3221)
* Use vectorized stores for dropout randvals

With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.

* Calculate dropout randvals for storing and applying only once

Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.
2025-11-19 10:40:12 +05:00
Aviral Goel
f5ac3ee359 chore(copyright): update copyright header for include directory (#3224)
* chore(copyright): update copyright header for tile_engine directory

* chore(copyright): update copyright header for script directory

* chore(copyright): update copyright header for test_data directory

* chore(copyright): update copyright header for python directory

* chore(copyright): update copyright header for profiler directory

* chore(copyright): update copyright header for library directory

* chore(copyright): update copyright header for include directory
2025-11-18 10:17:18 -08:00
Max Podkorytov
a3a4eb12bd [CK-Tile] Remove usage of tile partitioner's full gemm shape (#3204)
gemm shape should be used from the pipeline instead (where it gets from a problem description struct)
2025-11-18 09:56:40 -08:00
Sami Remes
3ede8e2a6e [CK_TILE] Non-K Major from old CK to CK-Tile - fix reverted PR (#3199)
* Reapply "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)

This reverts commit e4298e55c7.

* WIP

* take Y2 as the AK1/BK1 value, that is the 'vector size' after shuffle

* use get_n_lds_banks()

* clang-format

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-11-18 10:17:02 +02:00
Yi DING
b6720531de [CK_TILE] MX Flatmm Split kernel instances (#3207)
* [CK_TILE] MX Flatmm Split kernel instances

* Fix flatmm example compile
2025-11-18 13:46:30 +08:00
Illia Silin
b38bb492a1 Disable DL kernels on all architectures except gfx103x. (#3218)
* disable dl kernels on all archs except gfx103

* add gfx10-3-generic target to cmake
2025-11-14 17:39:50 -08:00
jefyang1
d30babbd00 Add new gemm multiply multiply instances on gfx950 (#3213) 2025-11-14 08:20:41 -08:00
BingYuan.Zhou
4d629cd2b0 fix build error (#3195)
Co-authored-by: root <root@hjbog-srdc-39.amd.com>
2025-11-14 09:46:13 +08:00
Yi DING
4a8b17d1a4 [CK_TILE] Improve device printing (#3198)
* [CK_TILE] Improve device printing

* fix host gtest build

* clean
2025-11-14 09:46:06 +08:00
yinglu
2a73eb3bc0 Simulate TF32 with BF16x3 (#3142)
* tf32:bf16x3:use bf16x3 emulate tf32 gemm

* change blockwiseGemm to demo bf16x3

* temp push

* self review

* self review

* fix multi-device compile error

* bug fix

* code refactor

* limit to gfx950

* enhance gemm gfx942 threshold

* lower change from blockwise to warpwise

* refact codes

* refact codes

* error fix

* change threshold

* bug fix

* fix threshold error

* change host reference implement to same as device

* bug fix

* bug fix

* code refact

* fix clang-format fail

* code refine
2025-11-13 16:21:09 -08:00
SamiAario-AMD
f2cfc6b94e Remove "basic" and universal GEMM tests, and incorporate their test cases into the GEMM pipeline tests (#3094)
* Add missing copyright statements

* Use ck_tile::host_tensor_descriptor instead of a custom lambda

* Refactor use of check_data_type in test classes

* Use TEST_SUITE_NAME with TYPED_TEST_SUITE

* Remove an unused namespace

* Make dim3 const

* Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp

* Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile

* Add missing precision type combinations to CompV4 from CompV3

* Move the INT8 tests around for consistency with KernelTypesCompV3Wmma

* Add missing precision type combinations to CompV3Wmma from CompV3

* Remove the basic and universal tests and their dependencies

* On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B

* Use ADataType and BDataType instead of ComputeDataType for WarpGemm

* Explicitly set some return types to void

* Use more general typenames in InterleavedPKTypeLoader

* Add load_interleaved_pk_type.hpp to common.hpp

* Use std::is_same_v in load_int4_tile

* Add handling of LoadTranspose to load_int4_tile

* Factor out common code in several places using load_int4_tile

* Add support for pk_int4_t using load_int4_tile

* Fix formatting
2025-11-13 11:01:27 -08:00
Yi DING
8d50001b93 [CK_TILE] Improve F8F6F4 Scaled WarpGemm (#3197)
* [CK_TILE] Improve F8F6F4 Scaled WarpGemm

* Thanks, Copilot
2025-11-13 20:22:05 +08:00
Enrico Degregori
7414a0f4d4 Wmma support for gemm_reduce (#3145)
* Initial implementation GEMM+Reduce:

 - device struct
 - epilogue struct

* Fix tests, improve profiler and add initial instances

* Add instances

* Fix compilation error

* Address review comments

* Fix logging

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-11-12 11:23:54 -08:00
Po Yen Chen
40d2ed0f2a [CK_TILE] Share partition index across threads and specify offset in load_tile()/async_load_tile()/load_tile_transpose() (#2905)
* Allow sharing partition index across threads

* Fix typo PartitoinIndex -> PartitionIndex

* Remove C++20 'requires' usages

* Add missing template arguments

* Fix load_tile() overload ambiguity issue

* Use SFINAE to exclude invalid arguments

* Add additional offset parameter to the async_load_tile()

* Remove async_load_tile() default argument to avoid ambiguity

* Extract tile_window coordinate compute logic as method

* Use warp-shared LDS base address in tile_window::async_load()

* Add constraint to tile_window::load() templates

* Fix wrong type traits is_class_v<> usages

* Add missing constraint to async_load_tile()

* Add missing tile_window::load() overload

* Add more constraint to avoid load_tile() call ambiguity

* Rename ParitionIndex as ReplacementPartitionIndex

* Update pre_computed_warp_coords_ in move_extended()

* Fix inconsistency between template parameters and documentation

* Allow specifying pre-computed parition index

* Add type straits is_sequence<> & is_tile_distribution<>

* Add type straits is_tensor_view<>

* Add type constraints to make_tile_window() templates

* Allow passing partition_index to set_tile_if()

* Allow specifying partition_index to store_tile()

* Add missing template parameter of replace_bottom_tensor_view()

* Allow passing partition_index to Default2DEpilogue

* Make get_partition_index() public

* Add _with_offset() postfix to avoid resolution error

* Remove ReplacementPartitionIndex template param

* Add missing comments

* Add load_tile_transpose_with_offset() overload
2025-11-12 10:26:14 +08:00
Bartłomiej Kocot
92c1f4981a [CK_BUILDER] Add grouped conv fwd ck tile traits (#3183)
* [CK BUILDER] Add grouped conv fwd ck tile traits

* Update instance_traits_tile_grouped_convolution_forward.hpp

* Update grouped_convolution_forward_kernel.hpp
2025-11-11 13:55:33 -08:00
linqunAMD
1b1c46e508 [CK_TILE] Fix gemm_quant (#3186) 2025-11-11 08:23:57 -08:00
Khushbu Agarwal
06c651b100 formatting (#3182) 2025-11-11 07:42:26 -08:00
Enrico Degregori
1c544abf57 Extend support for ak1 / bk1 WMMA (#3073)
* Extend AK1 / BK1 support:

 - Add support for AK1 != BK1
 - Add support for AK1, BK1 > 8
 - Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction

* fix clang format
2025-11-11 07:38:15 -08:00
Gino Lu
e31a7a4f29 fix MX bpreshuffle gemm B grid descriptor dimension error. (#3170) 2025-11-06 19:42:39 -08:00
Xudong Yuan
d04eba4ae3 Ck moe mxfp4 blockm32 (#3098)
* block_m = 32

* ck block_m = 32

* aiter/3rdparty/composable_kernel/include/ck/tensor_operation/gpu/block/blockwise_gemm_pipeline_xdlops_b_preshuffle_mx_moe_v3.hpp format

* mxfp4_moe v1 pipe

* update format

---------

Co-authored-by: zhimding <zhimding@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-11-07 08:45:41 +08:00
Bartłomiej Kocot
2234ff830b [CK TILE] Convolution remove magic values (#3160)
* [CK TILE] Refactor Conv configs and Conv Elementwise

* fix

* [CK TILE] Convolution remove magix values

* fix partitioner
2025-11-06 11:26:30 +01:00
joyeamd
12922120d2 add gfx11's barrier following SPG's reference (#3159)
* add gfx11's barrier following SPG's reference

* re-format the code

* minor fix

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-11-05 22:29:03 -08:00
Illia Silin
4533aa6dba Fix compilation errors with clang22. (#3164)
* resolve compilation issue with clang22

* add __extension__ for __COUNTER__ usage in ck_tile
2025-11-05 15:42:22 -08:00
Adam Osewski
b8527a9236 [CK_BUILDER] Convolution traits. (#3152)
Added:

1. Convolution traits & unit tests
2. Update builder enumerators to have representation of Convolution Kernels properties.
3. Unified builder pipeline version & scheduler enumerators
2025-11-05 08:53:06 -08:00
Illia Silin
930423ab3b Initialize new variable to prevent c++17 compiler error (#3156)
* initialize new variable to prevent c++17 compiler error

* build for gfx90a using -std=c++17 flag
2025-11-04 18:54:14 -08:00
Cong Ma
5abe4109e0 Introduces the new partitioner to implement the reduction StreamK kernel. (#3107)
* Introduces the new partitioner to implement the reduction StreamK kernel

* Add more doc text to functions

* Add persistent-dp option to streamk example

* Update example/ck_tile/40_streamk_gemm/README.md
2025-11-04 10:32:17 -07:00
John Shumway
6dbee64886 [CK_BUILDER] Add backward weight instance traits for xdl cshuffle. (#3143)
* Add backward weight instance traits for xdl cshuffle.

To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file.

* Fix copyright notices.

* Remove (c) symbol, replace with (C).

Having UTF-8 in source caused an error with code generation.
2025-11-04 15:34:00 +01:00
Bartłomiej Kocot
8681ced962 [CK TILE] Refactor Conv configs and Conv Elementwise (#3151)
* [CK TILE] Refactor Conv configs and Conv Elementwise

* fix
2025-11-04 15:04:53 +01:00
Bartłomiej Kocot
99f38e4d9b [CK TILE] Refactor grouped conv fwd large tensor (#3144) 2025-11-04 00:34:48 +01:00
Enrico Degregori
507d81c3af Fix splitk preshuffle (#3137)
* Fix splitK multiply_multiply_wp

* Add tests for gemm_multiply_multiply_wp

* Add tests for gemm_universal_preshuffle (KBatch = 1)

* Add tests gemm_blockscale_wp

* Fix splitk gemm universal preshuffle

* Run new tests on arch supporting fp8

* Restore example

* Fix strides profiler

* Fix tests

* Fix clang format

* Finalize profiler preshuffle with tolerances

* Minor improvements to splitk related changes

* Address review comments: clang format and ckProfiler typo

* Remove b_k_split_offset from SplitKBatchOffset struct
2025-11-03 11:59:01 -08:00
Thomas Ning
057b7d43b4 fix the compv4 and async pipeline when tile handler is 1 (#3141) 2025-11-03 09:37:35 -08:00
Emily Martins
2ec57a8e70 Replace CK_TILE_PIPELINE macros with a common enum
This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3,
CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum
called GemmPipeline to reduce code duplication.
2025-11-03 09:35:05 -07:00
Michael Mcminn
afe1ff618d Ud fix moe sorting gfx908 (#2720)
* Adding a ds permute fallback for the gfx908 and older for row_newbcast:7 instruction

* Better macro for selecting ROW_NEWBCAST

* clang-format the update

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-11-03 07:31:31 -08:00
Sami Remes
16e85cf179 [CK_TILE] B matrix 2D block scale gemm (#3074)
* Refactor quant group size to be configurable for M/N/K, not just K

* add some asserts for configurations not implemented

* start setting of group size for N dimension

* enable 2d for reference quant gemm

* WIP: trying to figure out tile dstr and/or indexing for scale matrix

* WIP

* Fix handling of n dim blocks in tile windows etc

* remove commented code and enable all tests again

* fix formatting

* Add more specialized tile distributions

* Enable NWarps replication for bquant tile dstr

* fix formatting

* fix format

* Fix some issues from the merge

* fix formatting

* one more fix to tile dstr, and revert debug initialization

* Remove commented code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* simplify conditions that are needed for tile distributions

* only enable the working group sizes in tests

* fix formatting

* Update tile distribution for 2D bquant

* add some documentation and 2d block scale example

* fix formatting

* Add in Changlog and restructure the quant 2d example

* fix CMake

* support the change for blockscale 2d

* fix the test file

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-11-02 16:49:20 -08:00
Aviral Goel
73f637894d refactor: remove gemm preshuffle pipeline v1 by removing all references from codebase (#3132)
* test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf

* refactor: deprecate gemm preshuffle pipeline v1 by removing all references from codebase

* Revert "test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf"

This reverts commit 573c08a085.
2025-11-02 00:06:28 -04:00
Aviral Goel
45be741586 fix: fix bug in print tile window when printing bf8/fp8 tiles (#3120)
* fix: fix bug in print tile window when printing bf8/fp8 tiles

* test(print_tile_window_range): add unit tests to maintain function integrity

* fix: fp8 numerical mismatch error on gfx950 by adding DCK_TILE_USE_OCP_FP8
2025-11-01 15:28:07 -04:00
Bartłomiej Kocot
ab1a8356b6 Add 2GB limitation for grouped conv bwd weight (#3054) 2025-11-01 14:16:45 +01:00
JH-Leon-KIM-AMD
1fbb47ad30 [CK TILE] Grouped conv fwd split image (#2970)
* Refactor split-image implementation: simplify code and remove redundant variables

* Add padding debug output to split-image implementation

- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp
- Verified padding works correctly with all tests passing

* Fix sign comparison warning after rebase with origin/develop

- Cast blockIdX from unsigned to signed index_t for comparisons
- Integrated with new GetOutputTileIndex logic from upstream
- Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane

* Fix Split-N with groups bug and clean up unused parameters

- Fixed batch stride calculation to include G dimension for grouped convolutions
- When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups
- Removed unused multi-split parameters (we only support 2-way split)
- All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N

* Implement recursive queue-based split-image detection and calculation

- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp
- Implement recursive binary splitting algorithm (10GB→5GB+5GB→...)
- Correctly handle odd dimensions (61→30+31)
- Calculate proper offsets for each split piece
- Update invoker to use split-image helper

Note: Split detection and calculation work correctly but kernel launching
for individual pieces requires kernel modification to handle different
spatial dimensions (unlike Split-N which uses blockIdx.z).

* WIP: Split-Image investigation - found architecture mismatch

- Split-N modifies N_ directly in transformer constructor
- Split-Image needs different approach due to varying dimensions
- Added split calculation logic for 1D and 2D convolutions
- Still facing memory issues when creating piece transformers

Key finding: Split-N uses blockIdx.z for parallel execution,
while Split-Image needs sequential execution of non-uniform pieces.

* Add 1D split-image implementation for grouped convolution (N=1 working)

Implements split-image for 1D convolution to handle large tensors that
exceed memory thresholds. This is a critical milestone with N=1 fully
working and tested.

Key Changes:
- Invoker: Add split-image logic that splits W dimension in half
- Transformer: Add SplitConvProblem helper for recursive splitting
- Calculate offsets for LEFT and RIGHT pieces
- Launch two kernels sequentially (LEFT then RIGHT)

Implementation Details:
- Binary split: divides W dimension by 2
- LEFT piece: W=0 to W/2, keeps left padding, removes right padding
- RIGHT piece: W/2 to W, removes left padding, keeps right padding
- Offset calculation accounts for stride, dilation, and padding
- Physical memory offset (no padding in memory)

Test Results (N=1):
 94/94 tests passing
- Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups)
- Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries)
- Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops)

Known Limitations:
- Only works with N=1 (single batch)
- N>1 fails when split-image triggers (offset calculation issue with Split-N)
- Root cause: Split-N modifies N in transformer, but offset calculated in invoker
- Solution planned: Move offset calculation to transformer (next phase)

Files Modified:
- grouped_convolution_forward_invoker.hpp: Add split-image logic
- transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper

This commit represents a stable, tested 1D split-image implementation
for N=1 cases. It's an important milestone before extending to N>1
and multi-dimensional splits.

* Add basic split-image implementation for 1D/2D/3D grouped convolution

This is a working baseline implementation that splits large spatial
dimensions to handle memory constraints.

Implementation:
- 1D: W-split for NWGC layout (36/36 tests passing)
- 2D: H-split for NHWGC layout (20/20 tests passing)
- 3D: D-split for NDHWGC layout (verified working)

Features:
- Binary split of outermost spatial dimension
- Sequential LEFT/RIGHT kernel launches
- Proper padding adjustment at split boundaries
- Offset calculation for pointer arithmetic
- Debug output for verification

Threshold: 100KB (configurable in transformer)

Known limitations:
- No safety checks for edge cases (to be added)
- Offset calculated before Split-N (incompatible with N>1, to be fixed)
- No recursive splitting for very large tensors

Next steps:
- Add safety checks (is_possible_to_split_*)
- Move offset calculation to transformer (after Split-N)
- Test with N>1 + split-image combination

* Refactor split-image to unified structure for 1D/2D/3D

Unified the three separate dimension-specific blocks into a single
common implementation with dimension-specific stride calculations.

Benefits:
- Reduced code from 636 → 348 lines (45% reduction)
- Eliminated code duplication
- Easier to maintain and extend
- Single source of truth for split logic

Implementation:
- Common: Binary split, offset calc, padding adjustment, kernel launch
- Dimension-specific: Stride calculation only
  - 1D: stride = G * C
  - 2D: stride = W_in * G * C
  - 3D: stride = H_in * W_in * G * C

Test results (all passing):
- 1D: 36/36 tests 
- 2D: 20/20 tests 
- 3D: 28/28 tests 
- Total: 84/84 (100%)

All test scenarios verified:
- Varying channels, padding, stride, dilation
- Filter sizes (1x1 pointwise to 7x7)
- Multiple groups (G=1,2,4)
- Odd dimensions
- Complex combinations

* Add safety checks for split-image in all dimensions

Added is_possible_to_split safety checks to prevent crashes when
splitting is not feasible.

Safety checks verify:
1. Output dimension > 1 (can't split single element)
2. RIGHT piece starts after left padding
3. LEFT piece ends within input bounds

If checks fail, falls back to normal kernel launch.

Verified for all dimensions:
- 1D (W-split): Wo=1 case triggers fallback
- 2D (H-split): Ho=1 case triggers fallback
- 3D (D-split): Do=1 case triggers fallback

Original 84 tests still pass - they use normal configurations
that naturally satisfy safety conditions.

Safety checks protect against pathological edge cases with:
- Very small spatial dimensions
- Extreme stride/dilation combinations
- Invalid padding configurations

* Fix Split-N + Split-Image compatibility issue

Fixed critical bug where Split-N and Split-Image working together
caused ~50% incorrect results due to wrong batch stride calculation.

Problem:
- Batch stride was calculated using MODIFIED spatial dimensions
  (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000)
- Spatial offset was applied globally in invoker, not per-batch in kernel
- Each batch (blockIdx.z) got wrong memory offset

Solution:
1. Store spatial offset in kargs (don't apply to pointer in invoker)
2. Copy correct batch_stride from temp_kargs to left/right kargs
3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset

Changes:
- grouped_convolution_forward_kernel.hpp:
  * Added spatial_offset_in/out fields to KernelArgs
  * Apply batch + spatial offset in operator()

- grouped_convolution_forward_invoker.hpp:
  * Keep base pointer, store spatial offset in kargs
  * Copy batch_stride from temp_kargs (has original dimensions)

- transform_conv_fwd_to_gemm.hpp:
  * Add debug output for split-image calculation

Results:
- N=1 tests: 84/84 passing (100%)
- N>1 tests: Now all passing (previously ~50% errors)
- Tested: 1D, 2D, 3D with N=1,2,4,8,16,20

* Implement unified threshold for Split-N and Split-Image

This commit consolidates threshold management for both Split-N and
Split-Image operations into a single source of truth, eliminating
code duplication and fixing offset calculation issues.

Key Changes:
============

1. Transformer (transform_conv_fwd_to_gemm.hpp):
   - Moved TwoGB constant to public section for unified access
   - CalculateSplitImage() now takes no parameters
   - Uses internal threshold: TwoGB / sizeof(CDataType)
   - Calculates offsets using N_ (after Split-N) for correctness

2. Kernel (grouped_convolution_forward_kernel.hpp):
   - GetSplitImageInfo() simplified to take no parameters
   - Forwards to transformer's CalculateSplitImage()
   - Clean interface with unified threshold internally

3. Invoker (grouped_convolution_forward_invoker.hpp):
   - Removed redundant threshold calculation
   - Simplified to call kargs.GetSplitImageInfo() with no params
   - Clean early-return pattern (no unnecessary else blocks)
   - Removed duplicate/dead code paths

Benefits:
=========
- Single source of truth: TwoGB defined once in transformer
- No parameter passing for threshold between components
- Correct offset calculation using N_ (post-Split-N)
- Cleaner code with no duplication
- All tests passing: 1D/2D/3D with various N values

Testing:
========
- Split-Image only (N=1, large spatial): PASS
- Split-N only (N>1, small spatial): PASS
- Both splits active (N>1, large spatial): PASS
- No splits (N=1, small spatial): PASS
- CPU verification correct for all scenarios

* Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded)

The old recursive queue-based implementation has been replaced by the
new CalculateSplitImage() method which is simpler and correctly handles
Split-N + Split-Image interaction.

Changes:
- Wrapped lines 381-1078 in #if 0...#endif
- Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded()
- Preserved for reference but disabled from compilation
- No functional changes - all tests still pass

The new implementation (CalculateSplitImage at line ~2163) provides:
- Correct offset calculation using N_ (after Split-N)
- Simpler binary split logic
- Better integration with unified threshold approach

* Implement recursive split-image with depth limit (MAX_DEPTH=10)

Changes:
- Add depth tracking to SplitPiece struct
- Implement two stopping conditions:
  1. Piece size below threshold (optimal case)
  2. Depth >= MAX_DEPTH (prevents infinite recursion)
- Remove MAX_PIECES limit in favor of depth-based control
- Support up to 2^10 = 1024 pieces with depth 10

This allows handling extreme tensor sizes while ensuring termination.
Pieces larger than threshold will still launch correctly if depth limit reached.

Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.

* Summary of recursive split-image implementation:
- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces)
- Two stopping conditions: size below threshold OR max depth reached
- Cumulative offset tracking through all recursion levels
- LEFT piece inherits parent offset, RIGHT accumulates (parent + local)
- Per-batch spatial offset application in kernel operator()
- Batch stride uses original dimensions (before split)
- Works with Split-N: split-N first, then recursive split-image
- Handles odd dimensions, padding, stride, dilation correctly
- All 1D/2D/3D tests pass with CPU verification

* Add comment explaining MAX_DEPTH capacity for 2GB threshold

* Refactor: move recursive split-image logic to transformer

- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp
- Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic)
- Encapsulate SplitPiece struct and BFS splitting algorithm in transformer
- Remove unused includes (queue, vector) from invoker
- Add documentation comment for AreDescriptorsSmallerThan2GB()
- Improve code organization and reusability
- No performance overhead (static template function, compiler inlines)
- All tests passing with 2GB production threshold

* Apply clang-format-18 formatting

- Format invoker and transformer files with clang-format-18
- Fix brace placement and alignment
- No functional changes

* Fix clang-format-18 issues in forward kernel

- Remove extra blank lines
- Fix line wrapping for template calls
- Consolidate GetSplitImageInfo() to single line

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Split-Image implementation with temporary fixed divider

- Implemented spatial dimension splitting (Split-Image) for large tensors
- Added piece-based coordinate transformation for 1D/2D/3D convolutions
- Integrated Split-N (batch splitting) with automatic threshold detection
- Fixed M dimension calculation to include batch: M = N × spatial_size
- Added spatial offset support in kernel arguments
- Verified 20/20 test cases passing for Split-Image alone
- Known issue: Split-N + Split-Image combination needs coordinate fix

Implementation Details:
- Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values
- Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts
- Piece descriptors track spatial boundaries and block ranges
- No performance overhead for N=1 cases

* Fix 1D split-image padding issue with per-piece dimensions

- Store actual size per piece to handle non-uniform splits
- Remove dead code from transform utils

* Fix 2D/3D split-image with independent split factors per dimension

Problem: Single split factor caused non-uniform pieces when dimensions
didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed.

Solution: Independent split factor selection for W, H, D dimensions.
Each dimension gets optimal factor based on its own size.

Test Results:
- 1D: 42/42 pass (100%)
- 2D: 25/25 pass (100%)
- Total: 67/67 combinations verified

* Remove unused split-image struct fields

Cleanup of split-image implementation:
- Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct
- These fields were declared but never used in the kernel
- Per-piece dimensions are already stored in pieces[] array
- Reduces struct size and improves code clarity

Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass

* Refactor split-image invoker code for improved readability

- Extract piece calculation logic into calculate_piece lambda helper
- Extract kernel args population into populate_split_image_kargs lambda
- Use aggregate initialization for cleaner struct population
- Reduce nesting depth and improve maintainability
- Fix outdated comment about split-image implementation status

* Refactor split-image code and remove debug prints

- Extract GPU kernel helper lambdas for better readability
- Remove all split-image debug print statements
- Set memory threshold to 2GB for production
- All tests pass with CPU verification

* Add split-image safety constraints and refactor to utils

- Add MAX_TOTAL_PIECES=64 limit to prevent segfault
- Move calculate_spatial_piece to library utils
- Add layout validation (NWGC, NHWGC, NDHWGC only)
- Fix hierarchical splitting to respect piece limits
- Add proper documentation and formatting

* Change split-image from runtime to compile-time branching

Response to @bartekxk review comment:
Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)'

Changes:
- Add EnableSplitImage template parameter to kernel
- Change runtime if to compile-time if constexpr
- Update invoker to instantiate kernel variants with true/false

Benefits:
- Eliminates runtime branching in GPU kernel
- Dead code elimination (each variant is smaller)
- Better compiler optimization

Files modified: 2
Lines changed: 20 total (6 in kernel, 14 in invoker)
Tests: 27/27 passed (100%)
Performance: No regression

* Add split-image example as separate binary

- Create grouped_convolution_forward_split_image example
- Add grouped_convolution_forward_split_image_invoker.hpp
- Update CMakeLists.txt to build split_image binary

* Replace linear search with binary search in find_piece_id

- Change O(n) to O(log n) for finding piece ownership
- Matches reference implementation in large_tensor_cshuffle

* Simplify split-image code and fix integer overflow

- Extract lambda functions to static helper methods
- Pre-calculate constants in invoker
- Fix integer overflow in tensor size calculation for large tensors

* Trigger CI rerun - fix merge conflicts

* Fix merge conflict markers

* Fix clang-format: remove space before {}

* Fix clang-format: comment wrapping and Swish constructor

* Rename split_image to large_tensor for clarity

- Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp
- Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp
- Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor
- Updated comments to refer to 'large tensor' instead of 'split-image'

* Update comments and include in large_tensor example

- Updated header comments to use 'large tensor' terminology
- Fixed include path to use large_tensor_invoker.hpp

* Remove test code, restore 2GB threshold

* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix build errors after develop merge and complete rename to large_tensor

This commit addresses compilation errors from the develop merge and
completes the rename from split_image to large_tensor.

Changes:
1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp
2. Fix template parameter order in large_tensor_invoker.hpp
   - TransformConvFwdToGemm signature changed in develop
   - NumGroupsToMerge and SplitN parameters swapped positions
3. Fix missing template parameter in GroupedConvFwdHostArgs
4. Fix EpiloguePipeline scope in kernel (merge conflict)
5. Update binary name references in test scripts

* Restore 2GB threshold for split-image

Changed threshold from 100MB (testing) back to 2GB for production use.

* Fix const-correctness in ds_ptr cast

* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply clang-format-18

* update c++ 18 format

* Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-01 14:18:16 +02:00