* solve compiler issue
* solve the gfx950 mfma shuffle regression
* refactor jenkinsfile to handle arch name better
* [CK TILE] set divisor to count of thread along k dimension
* fix the compiler error
* solve degradation
* Finish the multiplies fix
* fix the scales
* solve compilation error
* solve the composes
* solve the error of tile sweeper
* fix the test and example
* fix for gfx950
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
* chore(copyright): update copyright header for test directory
* chore(copyright): update copyright header for test directory
* chore(copyright): update copyright header for client_example directory
* chore(copyright): update copyright header for test directory
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* change bit width check
* refactor macros into constexpr functions
which still get inlined
* wrap s_waitcnt api
* parameterize test
* cleanup
* cleanup fp8 stub
* add fp8 test cases; todo which input parameters are valid?
* replace n for fp8 in test cases
* add large shapes; fp8 fails again
* change input init
* test sync/async
* time the test
* clang-format test
* use float instead of bfloat to cover a 4-byte type
* fix logic - arg sections should be 'or'd
* make block_sync_lds_direct_load interface similar to old ck
* fix a few comment typos
* name common shapes
* revert the example to original logic of not waiting lds
* clang-format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>