* split up splitk-gemm instances
* clean up some unused variables
* split the mk_kn_mn interwave splitk-gemm instances
* split up f16_f16_f16 mk_nk_mn splitk gemm instances
* fix clang format
* fix function names
* fix typo
* split up the 2 largest fp16*fp8 splitk gemm instances
* get rid of unused variables
* split up the largest splitk-gemm fp8*fp16 instance file
* split up the instances for xdl fp8 gemms
* split the headers for f16 and i8 for wmmma convolution instances
[ROCm/composable_kernel commit: 1b0fbaebbb]
This PR optimizes fp16 instances of direct load GEMM kernel introduced in #999 and #1052.
Measured the performance of new instances on CDNA2 GPU and compared it against the performance of the best non-direct-load GEMM instances. Used 76 different GEMM problems.
On average, this change improves the performance of the tested problems by 47%. For cases known as latency-bound, the speedup is around 126%.
[ROCm/composable_kernel commit: ad0a8e4cd2]
* disabling some fp8 gemm instances to reduce build time
* disable fp8 gemm instances to reduce build time
* remove the unused variable
* build fp8 gemm default and padded instances separately
* fix include pathsc
[ROCm/composable_kernel commit: c004e0d990]
This PR introduces support for double buffering in LDS into GEMM kernels that use direct load instructions.
Direct loads now use inline asm instead of intrinsics. Usage of intrinsics results in compiler adding additional waitcnt instructions what breaks possible load/compute overlap in case of double buffering.
Usage of inline asm results in the need to use sched_barrier in order to make sure that compiler cannot incorrectly reschedule instructions since it does not know the data dependencies between global->LDS and LDS->registers.
[ROCm/composable_kernel commit: bc4bf9bd03]
* Add basic support for direct loads from global to LDS
* Clean the code and comments
* Add support for fp16
* Add comments
* Add check for thread cluster lengths
* Align non-direct-load fp16 example
* Small fixes
* Extend IsSupported to check for supported GPU gens
* Build examples only on the supported HW
* Do not throw when instance not supported in 04 example
* Review: Apply review suggestions
* Review: small fix
* Review: small fix
[ROCm/composable_kernel commit: 627054b941]
* Disable the SLP vectorizer to prevent unnecessary wait
* Add comment to the reason of adding flag
* Fix wording
[ROCm/composable_kernel commit: db4461c142]
* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
[ROCm/composable_kernel commit: bba085d2b5]
* Fix vector lengths of DL GEMM instances with padding
* Add checks for correctness of vector lenghts in DL GEMM
[ROCm/composable_kernel commit: 63cd459248]
* Redesign the DPP8 GEMM kernel to use warp-wise component
* Review: Improve error messages
* Review: Remove unnecessary empty lines
* Review: Fix M, N per thread names
* Review: Rename mfma_input_type to dpp_input_type
* Review: Fix tensor adaptor; remove unnecessary element
* Review: Remove calls to dpp_gemm's MakeCDescriptor
* Review: Add blockwise doc, change function names to include dimension names
* Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file
* Review: Add __restrict__ keywords
* Review: Use MatrixPadder for padding A, B, C matrices
* Review: Remove hardcoded datatypes
* Review: Change names from FloatX to XDataType
* Review: Introduce AK0 and BK0 instead of a single K0
* Review: Remove construction of dpp_datatypes object
* Review: Rename DppInstrRunner to DppLanegroupGemm
[ROCm/composable_kernel commit: 37a8c1f756]
* experiment with config file
* experiment with version.h config
* add more info to version.h
* minor updates
* minor updates
* fix case where DTYPE is not used
* large amount of files but minor changes
* remove white space
* minor changes to add more MACROs
* fix cmakedefine01
* fix issue with CK internal conflict
* fix define and define value
* fix clang-format
* fix formatting issue
* experiment with cmake
* clang format v12 to be consistent with miopen
* avoid clang-format for config file
[ROCm/composable_kernel commit: c8a8385fdd]
* allow building CK for specific data types
* add CI build and test stage on Naiv3x without some int8 instances
* add missing gemm fp16 instances
* add the changes to the missed cmake file
* add empty lines at end of source files
* Do not build quantization client example on navi3 in CI
* disable batched_gemm_multi_d_int8 instances with DTYPES
* disable device_conv2d_bwd_data_instance with DTYPES
* fix ckprofiler for conv_bwd_data for int8
* properly isolate the conv_bwd_data int8 instances
* remove empty line
[ROCm/composable_kernel commit: 189ea3b9aa]
* Move source file into sub-directories
* Add missing include directive
* Split DeviceGemmXdl<> fp16 instances
* Fix format
* Remove unnecessary CMakeLists.txt
* Add macros to toggle new features
* Remove debug message
* Turn off GEMM v2 pipeline optimization by default
* Fix format
* Extract duplicated string as list
* Enlarge indent in CMakeLists.txt
[ROCm/composable_kernel commit: 850144a0d3]
* Add gridwise gemm pipeline v1/v2 selector
* Pipeline selector working, test-wise add pipeline options to one instance
* Add gemm instances
* Add debug info to DeviceGemmXdl
* Add debug info to DeviceGemmXdl_CShuffle
* Add debug info to DeviceGemmXdl_CShuffle and instances to gemm_add_add_fastgelu
* Minor fix
* Add debug info to DeviceBatchedGemmXdl and instances to batched_gemm
* set up inter-wave configuration
* use defualt loop scheduling for supported gemm ops
for blanket-applying interwave scheduling for all supported gemm ops, define macro CK_EXPERIMENTAL_DEFAULT_TO_INTER_WAVE_SCHEDULING=1. this should be discouraged though as it is not covered by CI
* Add enum PipelineVersion
* Update instances
* Format
* Fix the merge conflict
* Add flags to disable added instances
* Test disable flag check
* Disable flag check
* Enable the instances
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
[ROCm/composable_kernel commit: 1a0b0e7bec]
* Move kernel implementation files under impl directory.
* Update examples paths.
* Update device kernel impl include paths.
* Update tensor operation instances include paths.
* Update profiler and tests include paths.
* Clang-format
* Update include paths for batched gemm reduce
* Refactor UnitTest ConvNDBwdWeight.
* Refactor fwd and bwd data convND UT.
* Fix used test macro.
* Fix include path.
* Fix include paths.
* Fix include paths in profiler and tests.
* Fix include paths.
Co-authored-by: Adam Osewski <aosewski@amd.com>
[ROCm/composable_kernel commit: 3048028897]
* Change all device operations to use add_instance_library to avoid duplicated cmake configuration.
* update DeviceMem
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: fb1cbf025b]
* add intrin_mfma_f64_16x16x4f64
* add example
* gemm reference add double data type
* chang init data
* fix M N PerXdlops
* fix ifdef
* add comparsion config
* add conv fwd example
* format log out
* change rc matrix egister layout
* reorganize example
* reorganize example 2
* format,because merge develop
* fix call impl adding acc data type
* lost ;
* add compiler warning
* change example tunning parameters
* add test for fp64
* add instance
* add test/gemm/gemm_fp64.cpp
* fix get name issue
* remove some tunning parameter
* fix conflict
* format
* use integer value for GEMM test
* add acc data type
* remove typeid because fp16
* fix streamconfig etc bug from merging develop
* format
* remove test_gemm_xdl_fp64
* add AccDataType
* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 3e6c2610ae]
* start adding navi21 GEMM
* navi_gemm_km_kn_mn_fp32 compiles and passes one test.
* rename variables and functions in gridwise_gemm_dlops_v1r3
* add other 3 layouts; format instance
* adding more tuning parameters
add tuning parameters for other 3 layouts
* add gemm_dlops_f16
* tmp
* add dependence of DeviceGemm::IsSupportedArg() on arch
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* push gemm_dlops into profiler
* minor changes
* if using xdl or dlops is moved into profiler_gemm_impl
* minor changes
* minor changes
* remove is_xdl from profile_gemm_impl
* make IsSupportedArg dependent on arch for other device_gemm
* minor changes
* minor changes
* fix a bug in f_generate_tensor_value
* add 64x64x64 for gemm_dlops_int8
* add 64x64x64 for gemm_dlops_int8
* comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1
* fix
* start fixing tuning parameters
* monir
* minor changes
* minor changes
* minor changes
* fixing
* adding example
* adding example
* adding example
* add gemm fp32 example
* clean up
* use 128x128x16 as MNK tile in navi21 gemm example
* bug fix
* fix test
* use new block c tile
* clean
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>
[ROCm/composable_kernel commit: 40b59a63cc]
* [What] Separate fixpoint gemm from gemm example
[Why] let example of gemm_int8 be pure gemm.
[What]
1. Add gemm_requant_relu_requant,
2. Let CDataType be int32 in pure gemm, because no one use int8 CDataType. It is also part of gemm_requant_relu_requant
* Fix path
* Revise cmakelist due to merge develop
* Add gemm fp16 test
* Extract PrepareGemmTensor
* Extract TestGemm
* Add test for different layout
* Add 4 layouts of shuffle version of fp32
* Add 4 layouts of shuffle version of int8
* Add 4 layouts of shuffle version of bf16
* replace all DeviceGemmPtr_ with DeviceGemmNoOpPtr to fit naming convension
* Add test for non-shuffle verstion of gemm
* Fix typo
* Print kernel information
* Add rest of the fp32 kernel to the test
* 1. Add rest of the fp16 device iop.
2. Mark the invalid device operation
Co-authored-by: rocking <chunylai@amd.com>
[ROCm/composable_kernel commit: 485ea46a40]