* solve compiler issue
* solve the gfx950 mfma shuffle regression
* refactor jenkinsfile to handle arch name better
* [CK TILE] set divisor to count of thread along k dimension
* fix the compiler error
* solve degradation
* Finish the multiplies fix
* fix the scales
* solve compilation error
* solve the composes
* solve the error of tile sweeper
* fix the test and example
* fix for gfx950
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
[ROCm/composable_kernel commit: 00c46785a8]
* [CK_TILE] Port hw independent changes from internal repo to develop branch
It includes PR#96, #114, #120, #121.
* correct rebase error
[ROCm/composable_kernel commit: fc7bf0ab1c]
* Adding a ds permute fallback for the gfx908 and older for row_newbcast:7 instruction
* Better macro for selecting ROW_NEWBCAST
* clang-format the update
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit: afe1ff618d]
* Have a workable version for SGPR
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* substitute with the new sgpr read api
* update the CHANGELOG
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* change to static for logic
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
[ROCm/composable_kernel commit: 2cbbf5dcb3]
* fix bug in loops that need use local tokens to compute
* support extra chain local_token
* update
* update
* refine some main
* update
* support dispatch_policy
* fix 15 example
[ROCm/composable_kernel commit: cfe211cc60]
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
[ROCm/composable_kernel commit: 4c57157d50]
* moe sorting ex
* fix bug for race condition
* fix bug and optimze large expert
* fix
* optimize with sub_token_oneshot
* support skip empty tokens for expert sorting
* update moe_sorting
* tidy code
* support mp kernel
* hint mp
* remove use less code
* porting to example 15
---------
Co-authored-by: valarLip <340077269@qq.com>
[ROCm/composable_kernel commit: 353a612b44]