[CK-Tile] Merge transpose examples (#2450)

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 05:01:25 +00:00

* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

This commit is contained in:

Max Podkorytov

2025-07-26 21:51:54 -07:00

committed by

GitHub

parent d2459878cf

commit 821cd26c13

24 changed files with 431 additions and 869 deletions

									
										27

example/ck_tile/37_transpose/README.md
									
												View File
											
				@@ -1,27 +0,0 @@

				# Batched Transpose

				This folder contains example for transpose load for architecture gfx950. This transpose load has some constraints in input tile distribution.

				## build

				```

				# in the root of ck_tile

				mkdir build && cd build

				# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank

				sh ../script/cmake-ck-dev.sh  ../ <arch>

				# Make the transpose executable

				make tile_example_transpose -j

				```

				This will result in an executable `build/bin/tile_example_transpose`

				## example

				```

				args:

				          -N    input batch size (default:2)

				          -C    input channel size. (default:64)

				          -H    input height size. (default:1)

				          -W    input width size. (default:64)

				          -v    whether do CPU validation or not (default: 1)

				  -layout_in    input tensor data layout - NCHW by default

				 -layout_out    output tensor data layout - NHWC by default

				       -seed    seed to be used, -1 means random every time (default:-1)

				     -k_name    t to 1 will print kernel name (default:0)

				```

[CK-Tile] Merge transpose examples (#2450)

27 example/ck_tile/37_transpose/README.md Unescape Escape View File

27

example/ck_tile/37_transpose/README.md

View File