* [CK_TILE] Support multi-config in tile_example_gemm_universal
Add GemmConfig in run_gemm_example to support multiple tile config.
- It is useful when use you need compare gemm perf with different tile/pipeline config
- we also can use it simplify the code for wmma support in the furture.
* [CK_TILE] Support multi-config in tile_example_gemm_universal
Address review comments
* rebase code and fix clang format.
* fix clang format
* support pipeline v5.
* fix merge conflict
* address review comment
* add missing file
* address review comment v2
* fix build error
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* Multiple d, initial commit
* Check Ds Layout
* Readme and clang format
* Update branch & conflicts
* Multiple D - fix clang-formatter
* Rename elemetwise_op
* Fix CI
* Code review part1
* Remove printf
* Remove unnecessary comment
* Add new tests with Col layout
* Review part 2
* Added support for Multiple D GEMM
* Update comment
* Remove maybe_unused
* Clang-format
* Review part 3
* Add comment to function
* Add comment to function: another
* Take number of params for a refrence function
* Remove additional d param for 0 tensor
* Change name of function
* Fix CI fails
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
* Add constraint on traits/tile/pipeline
* Use kM0=128 if max_seqlen_q == 8192
* Re-format codegen script
* Remove redundant attr name postix
* Fix import error: default field in dataclass
* Use kK0=64 & kK1=64 to hide latency
* Use CU utilization to decide tile size
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm
* Some helper functions for persistent kernel case
* Get max occupancy grid using device properties
* Implement tile loop in main entry point to grouped gemm
* Enable GridSize() on device
* Handle offset tile index using real current block index
* Add persistent kernel choice to grouped gemm example
* Use a for-loop for iterating over the group
* Reduce VGPR spills by early-exit
* Enable persistent kernel choice in grouped_gemm example
* Add persistent kernel option to grouped_gemm test
* Fix formatting with remod.py
* Remove GridUpdateBlocks as blocks are now iteratively computed
* Add comment about VGPR spilling
* Fix formatting
* Use CK_TILE_HOST instead of __host__
* Enable all Row/Col combinations in grouped gemm unit test
* Add some KBatch=2 cases to grouped gemm tests
* Fix SplitK for grouped gemm
* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm
* Add type traits
* Split examples to regular and tileloop
* Formatting
* Use hipExtStreamGetCUMask to get current active CUs for the given stream
* Align test and example kernel config, and disable validation for splitk repeats
* Remove debug options from CMakeLists.txt
* Separate the code paths for persistent/non-persistent in test
* Fix formatting
* Address review comments
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Shared Memory for single data point
* CKTile Transpose vectorize CP1
* CKTile Transpose vectorize CP2
* CKTile Transpose vectorize CP2.1
* fixed the compile error of the transpose tile 2d
* Have the correct result for the current test sample
* Changes to printing tensor
* fp8 support added
* Debugging for transpose
* solving the corner issue
* Changed padding flag
* Intermideate Debugging
* Intermidiate Debugging
* Intermediate Debugging
* Finished debugging of the transpose op
* Code Cleanup
* Adding edge case smoke tests
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Addressing Review Comment
* Addressing Comments
* Addressing Comments
* Measuring Perf Tests
* Code Cleanup
* Changlog
* Added the running iterations
* clang format
* Fix the changelog
* Fix the compilation error
* change the printing factor
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
* sync with function interface of cshuffleepiloge,fix flatmm build fail
* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm
---------
Co-authored-by: solin <bingzhou@amd.com>
* Test Copy kernel code for testing tile distribution logic
* Fix the error
* Solved the problem
* Updated comments and document formatting
* Removed unused tile distribution and code cleanup
* Added README.md and formatting for CI/CD.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* add ck tile examples to package
* Update jenkinsfile
* fix for jenkinsfile
* fix for building ck tile code on non gfx9
* compile ck tile examples only for gfx94
* include ck tile examples in all target
* fix for basic gemm UseStructuredSparsity
* Update CMakeLists.txt
* Update gemm_pipeline_problem.hpp
* add targets to rocm install
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* fix error while testing new command
* Finished the feature of new mfma 16*16*32
* Addressed the comment
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>