* reduce the docker image size and layers
* clean up docker file
* fix linker error for client example 24
* install CK into the default /opt/rocm/ path
* restore installing CK to alternative path in CI
* add linking for utility lib
[ROCm/composable_kernel commit: d31e8249c1]
* add bf16 gemms for gfx11/gfx12
* reduce the input values in test_gemm
* add int8 wmma gemm instances for gfx11/gfx12
* add example gemm_wmma_int8
* fix bug in gemm_wmma_int8 test
* increase bf16 gemm test tolerance
* update the dates and clean-up commented-out instances
[ROCm/composable_kernel commit: 8aba2724cc]
* Improve test verbosity.
* BUGFIX: Add missing initialization for reduction buffer
* Change default initialization method
Performance may be affected for fp32 and int8 examples.
* Improve test verbosity
* Cleanup
[ROCm/composable_kernel commit: d805a461aa]
* Finished the feature
* Modified the test file
* Test case update
* addresss comment
* Addressed the review comment
* Fixed the CI error
[ROCm/composable_kernel commit: 2b6458ddf2]
* [CK_TILE] add more stride for layernorm to support un-continuous Tensor
* align CK coding style
* extend strides to layernrom expample
* clang-format...
[ROCm/composable_kernel commit: 8ef8a994e7]
* optimze small N case using vec io and using rcp div
* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass
* [Ck_tile] fix blockSize compute in Generic2dBlockShape
* [Ck_tile]fix kfastfdiv template style
* [Ck_tile] layernorm, fix stype in review
---------
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
[ROCm/composable_kernel commit: 686a58a912]
* make sure cmake can handle xnack targets
* dont build xdl instances for gfx906:xnack-
* dont build xdl tests for gfx906:xnack-
[ROCm/composable_kernel commit: b6e74be1aa]
Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.
Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.
[ROCm/composable_kernel commit: 464abd235e]
* more accurate residual
* modify comment
* Fix literal case in README.md
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit: cb6c5d39dc]
* add prenorm/postnorm support, refactor using generate.py
* update README
* update README
* fix format
* update some description and fix format
* update format
* format
* use non-raw for loading
* format and update n4096
* dynamic-quant ready
* update readme
* support fused dynamic-quant
* update fused-quant, with smooth
* update README
* update args
* update some based on comment
[ROCm/composable_kernel commit: c3a4800c5f]
* CK-Tile GEMM with memory bound pipeline.
* Memory bound gemm pipeline.
* Fix not closed namespace.
* Block gemm mem pipeline draft.
* Do not use ck_tile:: within ck_tile namespace.
* Refactoring & Move Layout info to pipeline problem.
* Get hot loop and TailNum information before lunching kernel.
* Fixes in pipeline.
* Add comment to load_tile_raw and change variable naming style.
* Few small changes & formatting.
* Do not use macro.
* Add gtests.
* Use AccDataType for Output of MFMA instruction.
* Formatting.
* Refactor gemm examples.
* Switch over to current block gemm.
* Use currently available pipeline policy.
* Refactoring and review comment.s
* Fixes after merge.
* Add missing include.
* Add load tile overload which accepts output tensor as parameter.
* This give 8% perf boost at the cost of using more registers.
* Rename example.
* Small changes.
* Fix compilation err and lower K.
* Support different layouts for A/B
* Fix vector size for different layouts.
* Rename Alignment into VectorSize
* Unblock tests.
[ROCm/composable_kernel commit: 24d996aae1]
* Add reduce2d new api
* Prevent user use cross warp reduction
* Fix bug of std caculation
* Add rmsnorm2d
* Add rmsnorm small example
* Remove static assert to prevent compile fail
* Add script to test performance and correctness
* Add missing cmake change
* refine naming
* refine example of rmsnorm
* Fix bug of rmsnorm
* Refine naming
* Fix cmake
* clang format
* Refine pipeline name
* Add add_rmsnorm2d_rdquant kernel
* Add reduce op
* host verification
* Fix bug of one pass pipeline
* Refine tile size
* Add two pass pipeline
* Rename two pass to three pass
* Fix bug of kSaveX == false
* Add instance library
* Add test script
* Fix bug of x verification
* Add save_x to trait
* Add README
* Move reduce2d into reduce folder
* Fix bug of welford when number of m warp > 1
* remove reduncant comment
* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant
* clang format and add missing header
* Add host validation of add + layernorm2d + rsquant
* Revert "Add host validation of add + layernorm2d + rsquant"
This reverts commit 936cb45797.
* Remove deprecated flag
[ROCm/composable_kernel commit: 3d60953477]
* Add ceil_to_qualified_tile_length()
* Rename kK0BlockLength to kQKHeaddim
* Add kSubQKHeaddim concept to support headdim96
* Fix in math.hpp to avoid using __half interfaces
* Add LdsBufferSequence instance for headdim96
* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing
* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time
* Reformat one file
* Fix text alignment in fmha_fwd_splitkv.py
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit: 8632221814]
* topk_softmax
* remove some file
* fix atomix linear_offset
* address various comment, and change sfc get_index api to static(tuple)
[ROCm/composable_kernel commit: b098b71b05]
* Use pre-defined constants for readability
* Use vector write for o_acc tensor
* Remove no-longer used policy method
* Deprecate no-longer used policy/pipeline
* Specify gemm0/gemm1 block warps separately in codegen
* Fix wrong ps_idx creation logic
* Add single-warp block gemm
* Supoprt single-warp gemm0
* Make MakeCBlockTile() as static method
* Use MakeCBlockTile() to get underlying tile distribution
* Use kNumGemm1Warps to compute # threads for gemm1
* Put normal case in the if clause
* Refine fmha splitkv block mapping
* Refine & fix the lse_acc/o_acc layout
* Fix wrong LDS size for K tile
* Use kK0=64 for hdim=128,256 fmha splitkv kernels
* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels
* Undo kK0/kK1 changes
* Use more reasonable GetAlignmentV() computation
* Using store_tile() in fmha splitkv kernel epilogue
[ROCm/composable_kernel commit: 54f0e6f4bb]