* add f8 gemm with multiD for both row/col wise
* change compute_type to fp8
* changed tuning parameters in the example
* add rcr example
* post-merge fix
* fix
* reduce init range
* Add bf16 instances
* Add bf16 gemm universal example
* tempsave
* Add guard to navi compilation
* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
* fix formatting condition statement issue
* solve conflict
---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
* Overload output stream operator for LoopScheduler and PiplineVersion
* Add Run overload accepting grid descriptors MK.
* Add __device__ keyword for CalculateGridSize
* Create device op GroupedGemmMultipleD
* Add GroupedGemm MultipleD Tile Loop implementation.
* Add an example for GroupedGemm MultipleD tile loop.
* Device Op GroupedGEMMTileLoop.
* Bunch of small changes in exmaple.
* CkProfiler
* Remove unused tparam.
* Fix include statement.
* Fix output stream overloads.
* Do not make descriptors and check validity untill we find group.
* Fix gemm desc initialization.
* Revert device op
* Fix compilation for DTYPES=FP16
* Validate tensor transfers paramters.
* Validate on host only NK dims if M is not known.
* Fix bug.
* A convenient debug func for selecting threads.
* Fix has main k block loop bug.
* Make sure that b2c has up to date tile offset.
* Output stream operator for Sequence type.
* Cmake file formatting.
* add flush cache to device op
* add flush cache parameter to ckProfiler
* change calculate size a and b method
* chang evaluation time method foro AVERAGE to MEDIAN
* format code
* adjust some code
* fix core dumped
* remove loop call flush icache in kernel
* remove loop(outer) call flush icache
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
* Refactor elementwise kernels
* Instances fixes
* Fix cmake
* Fix max pool bwd test
* Update two stage gemm split k
* Restore elementwise scale for hiptensor backward compatiblity
* Fix Acc data type check in conv fwd multiple abd
* Disable conv fp64 fwd example
* Update grouped conv weight multi d
* Support A/B/C elementwise ops.
* First part of GGEMM multiD splitk two stage.
* WIP - changes for debuggin.
* tmp save
* working version
* added bf16@int8 version
* fixes
* add reviewers sugestions
* pre-commited missing files
* switched to ifs from elseifs
---------
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Update device op api to support BComputeType
* Add example
* Add instances
* Add profiler mode
* Add client example
* Update copyright year
* Add BComputeType check
* Fix compute types
* Format
* Format
* Format
* Remove const
* Use the right template
* Format
* Format
* add row/col instances
* Add missing file
* fixed
* Format
* Updates
* Format
* fixed rrr layout
* Format
* Update test and embed modules
* Restore older version
* Update year
* Set -fPIC
* Format
* Use double for isnan
* rename host folder to codegen + minor fix
* add codegen CI test
* add option to build components without building CK
* fix the groovy syntax
* fix typo
* use the correct function for the codegen stage
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
* SWDEV-439954 - Use hard coded filename rather than using the macro __FILE__ for debug prints.
Hiptensor library is using the header files from CK. Hard coded ROCm path was getting embedded into the hiptensor library, since the header file was having the macro __FILE__. Replace the macro with filename.
* fix syntax
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* rename folder
* Add type string
* Remove typo
* Add deviceOp to backward x
* Add comment to describe the behavior of backward normalization
* Add kernel function, prepare to implement
* implement generic kernel
* Check vector size
* Add sweep once pipeline for small reduce size
* Fix bug of KRaw_ error
* Fix bug of dx stride
* sanity check for mean and rstd
* backward x for groupnorm
* Add bwd x instance
* add layernorm 2d bwd gamma beta instances
* Change save mean var type from f32 to f16 in f16 mode
* Change the example to f16
* Add groupnorm bwd gamma beta instance
* Add groupnorm bwd x instance
* Fix naming
* Add layernorm bwd x ckprofiler
* Add groupnorm bwd x profiler
* clang format
* Rename bwd x to bwd data
* Fix bug of verification in profiler
* Add test of layernorm and groupnorm bwd data
* Add missing cmake
* Add layernorm2d bwd data
* rename fwd example
* Add groupnorm client example
* Fix typo. replace Invarient with Invariant
* Add checking before running the best instance
Current implementation of IsSupported method in contraction ops does not cover a lot of possible cases in which ScalarPerVector cannot really be used to read A, B or D, or write E.
This PR extends both the regular and multiABD contraction ops with improved checks and also adds new instances with smaller values of ScalarPerVector to support instances that are not supported by other instances.
This PR introduces support for double buffering in LDS into GEMM kernels that use direct load instructions.
Direct loads now use inline asm instead of intrinsics. Usage of intrinsics results in compiler adding additional waitcnt instructions what breaks possible load/compute overlap in case of double buffering.
Usage of inline asm results in the need to use sched_barrier in order to make sure that compiler cannot incorrectly reschedule instructions since it does not know the data dependencies between global->LDS and LDS->registers.
* added working example for 5D input using 1D kernel
* example with 5D input tensor and 2d kernel - not working: issues with arguments
* added updated version of 3d device op - changed descriptors/dims
* added example file to check kernel
* fixed descriptor and isSupportedArgument stride problem
* added and modified kernel for 3d - updated tids/loop
* adding some more 5d example files
* fixed some issues
* changes made for testing
* working version: fixed error in stride for A, still a bit inefficient
* cleaned up formatting/comments
* updating formatting
* more formatting fixes
* fixing cmake, adding back gpu targets in cmake script
* adding client example
* added instances for client example
* fixed errors in client example
* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp
* removed extra files
* minor formatting and naming fixes
* adding test files and profiler
* fixing minor error
* minor fix
* removed unneccesary comments, renamed files
* updated instance list for client example, added different layout example
* removing instances
* fixed error in instance generation
* remove comments
* update profiler and client example tensor layouts
* fixed errors in test/profiler
* updated vector dim access to enable vector load
* updated test/profiler files
* updated example with 1d kernel
* updating profiler
* renamed files
* disabled device op for MI300
* skip elementwise_permute_2d on gfx94x
* Update CMakeLists.txt
* fixing CMake - disabling some GPU targets
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add basic support for direct loads from global to LDS
* Clean the code and comments
* Add support for fp16
* Add comments
* Add check for thread cluster lengths
* Align non-direct-load fp16 example
* Small fixes
* Extend IsSupported to check for supported GPU gens
* Build examples only on the supported HW
* Do not throw when instance not supported in 04 example
* Review: Apply review suggestions
* Review: small fix
* Review: small fix
* Introduce multiABD api and deprecate multiD
* Replace multiD with multiABD
* Mark structures as deprecated
* Change doxygen deprecated to note to avoid warnings
* adding files for F32 example
* adding functioning implementation with scalar multiplication and unary operator support
* added fp 16 type check in unary square
* updating scalar multiplication as an operator
* functioning version with scalar operator
* changing strides for col major
* updated column major implementation
* working column major implementation
* cleaned up comments, rearranged/renamed files
* Support multi AB for grouped conv fwd xdl
* Add instances
* Add client example
* Add example
* Add interface test
* Minor fixes
Minor fixes
Minor fixes
* Comment fixes
* Fixes
* Reference fix
* Test xdl fixes
* Improve multi_ab interface test
* added working example for 5D input using 1D kernel
* example with 5D input tensor and 2d kernel - not working: issues with arguments
* added updated version of 3d device op - changed descriptors/dims
* added example file to check kernel
* fixed descriptor and isSupportedArgument stride problem
* added and modified kernel for 3d - updated tids/loop
* adding some more 5d example files
* fixed some issues
* changes made for testing
* working version: fixed error in stride for A, still a bit inefficient
* cleaned up formatting/comments
* updating formatting
* more formatting fixes
* fixing cmake, adding back gpu targets in cmake script
* adding client example
* added instances for client example
* fixed errors in client example
* implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp
* removed extra files
* minor formatting and naming fixes
* adding test files and profiler
* fixing minor error
* minor fix
* removed unneccesary comments, renamed files
* updated instance list for client example, added different layout example
* removing instances
* fixed error in instance generation
* remove comments
* update profiler and client example tensor layouts
* fixed errors in test/profiler
* updated vector dim access to enable vector load
* updated test/profiler files
* updated example with 1d kernel
* updating profiler
* renamed files
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
* Rename folder
* Add layernorm 4d fwd example
* Rename original layernorm example
* Add layernorm 4d f16 test
* Add layernorm4d_fwd client example
* Support layernorm4D in ckProfiler
* Rename groupnorm to groupnorm fwd in example
* Rename layernorm and group fwd in test
* Rename normalization to normalization_fwd (instances)
* Add fwd to DeviceNormalization
* Rename external api header
* Rename folder, because we can also add bwd in this folder
* Add fwd in layernorm and groupnorm (profiler
* Fix compile error
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>