Bartłomiej Kocot
|
4d8fce33dd
|
Add SplitK support into Batched GEMM V3 (#1729)
* add bmm api
* add bf16 multi_d
* add ckProfiler for bf16
* add ckProfiler files
* add more instance; fixed 64bit index issue
* fixed naming
* enabled batched Ds
* use long_index for ds offsets
* clean
* add bmm fp8 ckProfiler
* Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* Update profiler/src/profile_gemm_universal_batched.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>
* clean
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* refactor batch offset func
* add splitk suppport into bmm_v3
* clean
* clean
* format
* fixed
* fix
---------
Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
|
2024-12-13 21:08:35 +01:00 |
|