composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-04-19 22:39:03 +00:00

Files

Enrico Degregori 440358c168 Wave Tile Transfer supporting global load with transpose (#3027 )

* Initial implementation:

 - add new thread group transfer supporting transpose instruction
 - refactor AB transfer to switch between thread and wave tiles methods

* Add some comments and remove explicit wave and lane calculations

* Remove compiler option for performance

* fp16 example: use tuned instance

* Missing cleanup

* Integrate wave transfer in existing gemm and batched gemm instances

* Add fast instances

* extend implementation for 8 bit datatypes

packed types not supported

* Address review comments

* Optimize pipeline v1 and re-introduce compiler option

* Disable wave tile approach for b scale gemm

* Fix for clang20

* Avoid code duplication of amd_global_load_transpose_to_vgpr function

2025-10-16 11:33:56 -07:00

gpu

Wave Tile Transfer supporting global load with transpose (#3027 )

2025-10-16 11:33:56 -07:00

operator_transform

Fix splitK for grouped conv bwd data (#2991 )

2025-10-10 09:24:21 +02:00