* replace amd_buffer_atomic_add with hip_atomic_add
* fix grouped_gemm_splitk kernels on mi300
* fix syntax
* revert experimental atomic_add changes
* fix the group of kernels from ticket 723 on MI300
---------
Co-authored-by: Jing Zhang <jizhan@amd.com>
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add conv perlayer quantization
* Add gemm_dlops quantization
* Support int8 for innerproduct
* Refine gemm dlops int8 kernel parameter
* Support gfx908(MI100) and gfx90a(MI200)
* clang-format
* Rename example number
* Support different layout for d tensor
* Add conv dlops perchannel quantization example
* Move to example 40
* Extract the common code for different platform (dlops and xdlops)
* Move ot subfolder. Prepare to add other op of quantization
* Refine the quantization instance library
* Add conv dl instances and client example
* Remove unnecessary type
* Add gemm quantization instance
* Add external api and client example
* Refine num_bytes
* Separete different layout to different cpp
* Add more xdl instances
* Revert "Remove unnecessary type"
This reverts commit 820869182f.
* Remove CShuffleDataType in dlops
Let acc and CShuffleDataType be the same in xdlops
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Pass shared mem pointer as pointer to void.
* Device Op GroupedGEMM Multiple D
* Example for grouped gemm multiple d.
* Add MI200 to supported archs.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* make conv_fwd_bias_activation kernel id unique
* add more parameters to conv and gemm kernel names
* update GetTypeString for conv and gemm kernels
* fix two more kernel strings
* Grouped gemm + Gelu instances.
* Device Instance Factory for GroupedGemm+Gelu
* Client example
* Rangify fill helper functions.
* Fix name clash.
* Profiler for grouped_gemm+gelu
* No need to use full namespace name.
* Add check for MRaw divisible by vector load.
* Ugly fix for big errors.
* Add grouped_gemm+gelu to profiler CMakelists.
* Store in argument additional info.
* Information about Mraw, Nraw, Kraw values.
* Use FastGelu instead of Gelu.
* Change client ex to use FastGelu
* Remove relaxed error precision.
* Remove duplicate output elementwise-op
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Modify Doxygen config to pick up include directories recursively
* Add DeviceMem struct to API Reference guide
* Add classes that are used in Flash Attention kernel
* Add a reference and config for generating bibliography
Co-authored-by: Philip Maybank <Philip.Maybank@amd.com>
* fix a bug blocking wmma_gemm_multipleD
* Utilize matrix padder in device_wmma_op
* cosmetic change for gemmpadding format
* clang format
* Change gridwise gemm from FIFO to KMN loop fashion
* Add DeviceOp and examples
* Format DeviceOp template arguments
* Remove bf16 example
* Format
* Format
* Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N
* Refactor argument preparation
* Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl
* Rename device op file
* Update include directive in the example file
* Update descriptor preparation for grouped op
* Update the argument
* Update batch handling
* Add gridwise gemm supporting batched input
* Update blockwise indexing, working version
* Update copyright year
* Update check if argument is supported
* Refactor and make consistent with xdl examples
* Update check if argument is supported
* Add changelog entry
* Added comments on Dl op split_k>1 support
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* clean up output from kernel_launch
* set RUN_WARMUP to 0 by default
* split the warm-up into a separate issue
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Sync the order of type string with template parameter
* Add more instances
* Check the vector size and remove redundant var
* Extract var to static, prepare to separate sweep once kernel
* Separate sweeponce flow and optimize the flow
* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
* Remove useless code
* Update naive variance kernel
* Refine string
* Fix typo
* Support naive variance for device_normalization
* Check the blocksize
* Share the VGPR of x and y
* Share the VGPR of gamma and beta
* Add more instances
* Support fp16 sqrt for experiment
* Add CHANGELOG
* Fix typo
* clang-format
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
* navi3x_multipleD+example
* temp save
* workable
* batchedgemm[OK], groupconv[debug]
* groupconv: Sanity check[OK], Performance[Bad]
* navi3x_groupconv_need_optimization
* format
* Add arch limitation to all wmma examples
* fix bug: example30 input conv args
* Add gemm + layernorm instance
* Add ckProfiler
* Add test
* Add client example
* Detect if user forger to set the workrspace
* Use literal in the example
* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
* check gemm vaildity in IsSupportedArgument
* Add more testcases
* Merge duplicated folder in client example
* Print more infomation
* Use better kernel parameter for MS problem size
* clang format
* Add constexpr for if condition and remove redundant include
* Remove cstdlib and add constexpr
* add instance for gemm bias softmax gemm
* add client example
* change CGridDesc_G_M_N to CGridDesc_G_M_O
* add gridwise
* change c grid name
* device add d0s data
* fix 08 client_example
* add example 47_fused_attention
* example output correct
* add d0 to example
* add d0 element op
* rechange instance code
* change Acc0ElementwiseOperation to C0DEElementwiseOperation
* change example name
* update instance for cdeelementwiseop
* add bhalf_t ScaleAdd
* add test
* not surport geem1 bias
* remove some ignore
* fix test bug
* File renaming and class renaming for device element-wise operation
* Add batchnorm-infer instances, external API and client example
* Add batchnorm-infer profiler module and gtests
* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
* Rename class and file due to conflict from device_elementwise_2d.hpp
* Fix namespace in batcnnorm_infer_nhwc client example
* Use double as alpha/beta values type in reduce device op api
* Use double as alpha/beta values type in softmax device op api
* Use double as alpha/beta values type in multiple-reduce device op api
* Use double as epsilon value type in normalization/elementwise-normalization device op api