mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-14 02:02:46 +00:00
* add delayed cvt
* extend fp16 gemm_splitk instances for fp8_fp16 gemm
* add f8 example
* add 128 kperblk instances for fp8
* add kpb128 instance
* added more instances into kpb128
* clean code
* clean code
* fix
* fix
* fixed
* Update example/35_splitK_gemm/splitK_gemm_xdl_fp16_fp8.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Update include/ck/tensor_operation/gpu/thread/threadwise_tensor_slice_transfer.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Update library/src/tensor_operation_instance/gpu/gemm_splitk/device_gemm_xdl_splitk_f16_fp8_f16_mk_nk_mn_kpb128_instance.cpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
[ROCm/composable_kernel commit: 602c4cc0d9]