* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
* Add constraint on traits/tile/pipeline
* Use kM0=128 if max_seqlen_q == 8192
* Re-format codegen script
* Remove redundant attr name postix
* Fix import error: default field in dataclass
* Use kK0=64 & kK1=64 to hide latency
* Use CU utilization to decide tile size
* Add trait to use a persistent kernel and split the entrypoints in grouped gemm
* Some helper functions for persistent kernel case
* Get max occupancy grid using device properties
* Implement tile loop in main entry point to grouped gemm
* Enable GridSize() on device
* Handle offset tile index using real current block index
* Add persistent kernel choice to grouped gemm example
* Use a for-loop for iterating over the group
* Reduce VGPR spills by early-exit
* Enable persistent kernel choice in grouped_gemm example
* Add persistent kernel option to grouped_gemm test
* Fix formatting with remod.py
* Remove GridUpdateBlocks as blocks are now iteratively computed
* Add comment about VGPR spilling
* Fix formatting
* Use CK_TILE_HOST instead of __host__
* Enable all Row/Col combinations in grouped gemm unit test
* Add some KBatch=2 cases to grouped gemm tests
* Fix SplitK for grouped gemm
* Enable pipeline hotloop/tailnumber selection in-kernel for grouped gemm
* Add type traits
* Split examples to regular and tileloop
* Formatting
* Use hipExtStreamGetCUMask to get current active CUs for the given stream
* Align test and example kernel config, and disable validation for splitk repeats
* Remove debug options from CMakeLists.txt
* Separate the code paths for persistent/non-persistent in test
* Fix formatting
* Address review comments
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Shared Memory for single data point
* CKTile Transpose vectorize CP1
* CKTile Transpose vectorize CP2
* CKTile Transpose vectorize CP2.1
* fixed the compile error of the transpose tile 2d
* Have the correct result for the current test sample
* Changes to printing tensor
* fp8 support added
* Debugging for transpose
* solving the corner issue
* Changed padding flag
* Intermideate Debugging
* Intermidiate Debugging
* Intermediate Debugging
* Finished debugging of the transpose op
* Code Cleanup
* Adding edge case smoke tests
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Adding Transpose test to CI/CD
* Addressing Review Comment
* Addressing Comments
* Addressing Comments
* Measuring Perf Tests
* Code Cleanup
* Changlog
* Added the running iterations
* clang format
* Fix the changelog
* Fix the compilation error
* change the printing factor
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
* sync with function interface of cshuffleepiloge,fix flatmm build fail
* move code from solin/flatmm which add mfma16*16*32fp8 and optimize flatmm
---------
Co-authored-by: solin <bingzhou@amd.com>
* Test Copy kernel code for testing tile distribution logic
* Fix the error
* Solved the problem
* Updated comments and document formatting
* Removed unused tile distribution and code cleanup
* Added README.md and formatting for CI/CD.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* add ck tile examples to package
* Update jenkinsfile
* fix for jenkinsfile
* fix for building ck tile code on non gfx9
* compile ck tile examples only for gfx94
* include ck tile examples in all target
* fix for basic gemm UseStructuredSparsity
* Update CMakeLists.txt
* Update gemm_pipeline_problem.hpp
* add targets to rocm install
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* draft v_mfma_f32_16x16x32_bf16
* fix error config and add debug code.
* Solve the CShuffle Problem
* fix error while testing new command
* Finished the feature of new mfma 16*16*32
* Addressed the comment
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* 50ms -> 28ms
* Fix bug in non fuse_add_store cases
* Fine tuned setting for 2 pass pipeline
* adjust workload
* remove unnecessary change
* add layernorm
* Adding output quant and unquant results at the same time.
* fix test
* fix format
* tune for cases 128x640 and 128x1024
* bug ifx
* return value with macro and revert the return value
* [CK-TILE] no-macro launch api solution (#1992)
* no-macro solution
* address -Wcomma
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>