All our platforms support C++20 now, so update to C++20 standard
for language features such as concepts, designated initializers,
range-based for initializers, and consteval. This PR only switches
the compiler flags to C++20, no other changes.
[CK_TILE] Add new ck tile unit test
* Add new ck tile unit test smoke-gemm-universal
* Add new ck tile unit test smoke-gemm-basic
* Add new ck tile unit test topk_softmax
* Add new ck tile unit test add_rmsnorm2d_rdquant_fwd
* Convert ck-tile 06_permute smoke test to unit tests for fp16, fp8, and fp32
* Apply clang format and update copy right year
* Convert ck tile moe sorting example smoke tests to unit tests
* fix CMakelists to ensure that permute and moe_sorting are built for gfx9 only.
* Remove number prefix from permute and moe_sorting directory names
* code cleanup
* add missing test cases for fp16 permute
* remove unecessary parentheses
* Cleanup
* Remove uneccessary final nullptr
* update copyright and licensing statement in files
* Add custom target for permute tests
* Add missing new line at end of file for moe sorting CMakelist.
* Update MOE sorting tests to account for MOE sorting example updates
The ck_tile/13_moe_sorting example was updated to include different
cases dependending on whether MOE_SORTING_FMOE_2D_BUF is set. So,
the ck_tile tests for MOE sorting were updated to account for these
changes.
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* CK tile tests for flatmm using example
* MOE smoothquant draft tests
* fix create_arg default index to zero for MOE smoothquant
* revert MOE smoothquant changes
* code clean up
* Add back MOE smoothquant changes
* Add MOE smoothquant cases for different precisions and update cmake
* clean up comments
* Update flamm cmake
* revert change made to moe_smoothquant smoke_test.sh EXE path
* remove unecessary comment in MOE smoothquant cmakelist
* comment out adding moe_smoothquant subdirectory for now due to bugs with GPU core dump issue on gfx942 and gfx90a
* Clean up run_test_case function in MOE smootquant tests
* update copyright and licensing on files
* Remove flatmm test dir since tests should be done as weighted preshuffle gemm
* Add flamm smoke test cases to weighted preshuffle gemm gtests
* remove blank line from CMakeLists
---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Updating runtime log message for CK TILE ENGINE
* CKTile layout from config
* CKTile custom config for CI
* Documentation for Layout Changes
* CKTile Layout changes to Jenkins
* Fixing Clang Format
* Changes to Jenkins file to fix error
* fix(cmake-ck-dev): no longer sets invalid values as gpu arch
* style(py files): ruff formatting
* fix(cmake-ck-release): no longer sets invalid values as gpu arch
* chore(cmake-tile_engine): add reminder to uncomment user config json
* Changes to jenkin file to address more cases
* Changes to Jenkins to fix Error
* Changes to Jenkins file for fixing an error
* Update Jenkinsfile (#2517)
* Update Jenkinsfile
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Create tests for ck tile batched transpose using example
* Create ck tile tests for smoothquant using examples
* fix precision input strings and convert batched transpose to regression tests
* Code cleanup and fix asserts
* add missing licenses
* update copyright and licensing in files
* Update smoothquant tests to use example's smoothquant.cpp
* Add custom target for batched transpose tests
* Add missing new lines at end of files for CMakelists
* fix typo in batched transpose CMakeList target_compile_options
---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
* Use read_tr in universal gemm
* Enable all instances back
* Revert example37 changes
* Resolve comments
* resolve comments 2
* Fix assertion msg
* fix the gemm basic
* change index_t to bool for preshuffle variable
* Solve the comment
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
1. Port NCHW support from ConvFwd (#2375) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw
Co-authored-by: azhuang <anzhong.huang@amd.com>
Without this DataType = unknown -
``` sh
Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s,
```
after this change
```sh
Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s,
```
* fix bug in loops that need use local tokens to compute
* support extra chain local_token
* update
* update
* refine some main
* update
* support dispatch_policy
* fix 15 example
* [draft] Add pk_fp4 and test
* Add hw conversion for fp4
* Refine test code and pk_fp4 constructor.
* fix test indent
* modify according to comment.
* fix clang-format
* modify according comments.
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
* Initial commit
* Adding new tile partitioner to flatmm
* intermediate changes
* debugging kernels
* Updating flatmm example to universal gemm example
* updated flatmm kernel to run via gemmKernel
* update universal gemm to incorporate flatmm
* debug
* Fix flatmm call
* Fixing other kernels and tests for API changes
* clang formatted
* fixing gemm tests
* added test for flatmm and simplify kernel arguments
* adding flatmm test
* fix test for flatmm
* simplify gemm kernel with flatmm
* remove flatmm related files
* addressing review comments and code clean up
* resolving empty file
* resolving empty file
* clang formatted
* addressing review comments
* enable persistent kernel for flatmm
* reverted the removed files for flatmm
* reverted the removed files for flatmm
* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example
* some more renames
* clang formatted
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com>
* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
* revrt back to atomic add using casting
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
* Wrap tile size mapping as class method
* Warp pipeline generating as class method
* Add constraint as kernel dispatching criteria
* Support mutltiple tile size for a (hdim, hdim_v) combination
* Use smaller tile size if CU utilization is low
* Use integar as the key of the tile size map
* Fix type error
* Simply override parent class method return value
* Add attribute to eliminate warnging
* Allow using environment variables to turn on/off custom factory
* Unify param naming style
* Add missing HIP runtime include directive
* Fix os.environ.get() usage
* Adding ninja log json convertion utility
* Updating to match old ninjatracing
* Updating Jenkins to use new ninjatracing
* Ensuring v7 works
* Removing old ninjatracing from dockerfile
* add for async load builtin
* add async load api
* fix some compiling errors
* fix a compiling error
* fix some compiling errors
* add a pipeline which copies from v4
* add a new pipeline for async load
* fix some compiling errors
* add async load tests
* fix some issues in async load
* fix
* fix async inline assembly
* fix async inline assembly
* add ignore header file
* comment some not gfx950 codes
* comment some not gfx950 codes
* fix a error
* update async load apis
* fix lds descriptor
* fix a compiling error
* fix some compiling errors
* fix a descriptor issue
* update lds descriptor
* change async pipeline's tile distribution pattern from thread to warp
* fix clang format
* update async policy
* fix a CRTP issue
* fix a typo error
* change lds layout
* fix some sync issues
* improve codes
* delete the async test
* fix a commented format issue
* avoid compiling device functions when compile host
* make gemm run
* add the copy kernel support
* finish the feature
* Address comment
* add the support for buffer_builtin
* solved the merging problem
* Comment Addressed
---------
Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>