mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-05 14:11:29 +00:00

Files

rahjain-amd 4d041837ad Add json dump support to output details from CK/CKTile Examples. (#2551 )

* Adding RapidJson Library

* Adding Json Dumps in all CK_Tile Examples

Not verified yet

* Adding json to cktile Batched Transpose

* adding json dumps to layernorm2d_fwd

* Adding  json dump to flatmm_basic

* Adding RapidJson Library

* Adding Json Dumps in all CK_Tile Examples

Not verified yet

* Adding json to cktile Batched Transpose

* adding json dumps to layernorm2d_fwd

* Adding  json dump to flatmm_basic

* Adding json in 03_gemm

* Add json dump to 16_batched_gemm

* Add json dump to gemm_multi_d_fp16

* Add json dump to grouped_gemm

* fix fmha_bwd/fwd

* Fix clang-format errors

exclude include/rapidjson in jenkins as its a third-party library

* Saparating function and defination.

* Update Documentation of 03_gemm

* Refactoring as per code review

* Disable fp8 instances on unsupported targets (#2592)

* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt

* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt

* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

---------

Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>

* fix clang format

* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt

* Fixing Readme and unifying jsondumps

* adding moe_smoothquant

* adding fused_moe

* Fixing Readme for batched_gemm

* Fixing Readme for grouped_gemm

* adding flatmm

* adding gemm_multi_d_fp16

* adding elementwise

* adding File name when json is dumped

* Fixing Reduce after merge

* adding batched_transpose

* Adding Warptile in Gemm

* Fixing Clang Format

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

2025-09-02 23:31:29 -07:00

script

[CK-Tile] Merge transpose examples (#2450 )

2025-07-26 21:51:54 -07:00

batched_transpose_api.cpp

Support Wave32 in CK_TILE - Part 1 (#2594 )

2025-08-18 10:08:31 -07:00

batched_transpose_example.cpp

Add json dump support to output details from CK/CKTile Examples. (#2551 )

2025-09-02 23:31:29 -07:00

batched_transpose_example.hpp

[CK-Tile] Merge transpose examples (#2450 )

2025-07-26 21:51:54 -07:00

CMakeLists.txt

Revert "Add ck tile examples to package (#1880 )" (#2150 )

2025-04-30 10:20:16 -07:00

README.md

invoke script directly (#2687 )

2025-08-19 00:23:07 -07:00

README.md

Batched Transpose

This folder contains example for batched Transpose using ck_tile tile-programming implementation. Currently, it supports the batched transpose with NCHW to NHWC or NHWC to NCHW. So in this way from NCHW you could transpose to either NHWC or NWCH(two transposes). Now the transpose read with single data point. We would soon put it in vectorized transpose.

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh  ../ <arch>
# Make the transpose executable
make tile_example_batched_transpose -j

This will result in an executable build/bin/tile_example_batched_transpose

example

args:
          -N    input batch size (default:2)
          -C    input channel size. (default:16)
          -H    input height size. (default:1)
          -W    input width size. (default:16)
          -v    whether do CPU validation or not (default: 1)
  -layout_in    input tensor data layout - NCHW by default
 -layout_out    output tensor data layout - NHWC by default
       -seed    seed to be used, -1 means random every time (default:-1)
     -k_name    t to 1 will print kernel name (default:0)
     -warmup    warmup iterations to run this kernel (default:50)
     -repeat    number of iterations to run this kernel (default:100)