* Remove M/N/KPad local variables
* Use M/N/KPad to name padded lengths
* Replace duplicated local variable by parameters
* Rename variables M/N/KRaw to M/N/K
* Move AK0/BK0 compute logic into GridwiseGemm
* Use macro to shorten code
* Move CalculateGridSize() logic into GridwiseGemm
* Add comment to credit the implementation source
* Reuse the existing implementation
* Remove no-longer used data members
* Remove elementwise-op objects from interfaces
* Reserve kernel arg as whole object in interfaces
* Remove redundant data member
* Make 3rd type parameter optional
* Remove unnesscary type parameters
* Remove no-longer used descriptor-creation methods
* Move kernel arg type definition into GridwiseGemm
* Add macro to switch between code sections
* Move argument field computing logic into device op side
* Make utility method 'static'
* Declare special methods
* Unify MakeArgument() usage
* Adapt the new GridwiseGemm interface
* Push-down class 'GridwiseGemm::Argument' fields
* Remove no-longer used methods
* Add unused parameters
* Force copying parameters in 'Embed' ctor
* Remove no-longer used descriptors
* Fallback change on BaseArgument
* Remove macro 'INTEGER_DIVIDE_CEIL'
* Make variable naming more consistent
* Make sure methods are only invoked on right place
* Remove tailing underscore in public attribute name
* Remove necessary methods
* Hide computing logic of derived attributes
* Make new 'Embed' ctor only available for device code
* Make sure 'Embed' type args are not references
* Move check for karg.K into CheckValidity()
* Remove more integer division logic form device code
* Undo changes on Embed
* Separate 'Problem' concept out from 'Argument'
* Add overloaded version of __builtin_amdgcn_readfirstlane()
* Remove 'static' specifiers
* Remove more 'static' specifier
* Replace unsigne char by std::byte
* Add 'const' specifier to never changing variable
* Add 'inline' specifier to funcion definition
* Share same name for kernel interfaces
* Fix wrong boundar calculation logic
* Leave the third template arg for compatibility
* Remove unnecessary parameters
* Fix wrong error message (for type name)
* Create descriptor on device side
* Fix wrong debug message
* Remove no-longer used data members
* Rename type trait
* Remove std:: qualifier from standard types
* Replace 'size_t' by 'unsigned'
* Use type alias to hint usage
* Replace static_for<> by ordinary 'for' loop
* Reject unsupported argument
* Rename readfirstlane() to amd_wave_read_first_lane()
* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
* Update function calls
* Reorder statements
* Re-format files
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Remove M/N/KPad local variables
* Use M/N/KPad to name padded lengths
* Replace duplicated local variable by parameters
* Rename variables M/N/KRaw to M/N/K
* Move AK0/BK0 compute logic into GridwiseGemm
* Use macro to shorten code
* Move CalculateGridSize() logic into GridwiseGemm
* Add comment to credit the implementation source
* Reuse the existing implementation
* Remove no-longer used data members
* Remove elementwise-op objects from interfaces
* Reserve kernel arg as whole object in interfaces
* Remove redundant data member
* Make 3rd type parameter optional
* Remove unnesscary type parameters
* Remove no-longer used descriptor-creation methods
* Move kernel arg type definition into GridwiseGemm
* Add macro to switch between code sections
* Move argument field computing logic into device op side
* Make utility method 'static'
* Declare special methods
* Unify MakeArgument() usage
* Adapt the new GridwiseGemm interface
* Push-down class 'GridwiseGemm::Argument' fields
* Remove no-longer used methods
* Add unused parameters
* Force copying parameters in 'Embed' ctor
* Remove no-longer used descriptors
* Fallback change on BaseArgument
* Remove macro 'INTEGER_DIVIDE_CEIL'
* Make variable naming more consistent
* Make sure methods are only invoked on right place
* Remove tailing underscore in public attribute name
* Remove necessary methods
* Hide computing logic of derived attributes
* Make new 'Embed' ctor only available for device code
* Make sure 'Embed' type args are not references
* Move check for karg.K into CheckValidity()
* Remove more integer division logic form device code
* Undo changes on Embed
* Separate 'Problem' concept out from 'Argument'
* Share same name for kernel interfaces
* Reject unsupported argument
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add license header.
* Reduce number of logged output. Add constant initialization.
* Add functional tests for grouped_gemm with different kbatch value.
* Add debug log informations + remove unused code.
* Don't pass kbatch to CalculateKPadded.
* Turn on logging in grouped gemm and gemm splitk profiler
* Debug: limit number of test cases to run;
* Log more information and initialize with constant value.
* Turn on DEBUG_LOG
* Add more debug log informations.
* Limit the number of instances to compile.
* Use GridwiseGemmPipeline
* Use KBatch to calculate K0
* Multiple DebugLog messages.
* Unit tests for multiple KBatch values.
* Refactoring
* Disable logging
* extract out of if statement KBatch update.
* Uncomment instances.
* Disable DebugLog.
* Use Kbatch when calculate KPadded.
* Fix CGridDesc padding.
* Use available helper functions.
* Uncomment code commented for debuggin.
* Remove unnecessary debug log messages.
* Uncomment previously commented code for debug purposes.
* Add KBatch info to profiler output summary log.
* Add gtests for gemm splitk using ckProfiler API.
* Add more test-cases for different data layout.
* Add more test cases for gemm splitk
* Remove old test.
* Unit tests for MKNK ggemm interface.
* Fix and add more unit-tests.
* Constepxr everything!
* Increase error threshold for fp16 and splitk.
Since we're using fp16 atomic add for splitk there's a
known precision loss.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Expand the base class of pool2d, prepare to share base class with pool3d
* Add pool3d device op
* Add pool3d f16 example
* Refactor the base class. implement generic pooling in the future
* clang format
* get original index in max pooling
* Add outputindex to base class
* Fix dimension
* Add pooling instance
* Use indexType instead
* Remove useless header
* Extract IndexDataType to template
* Extract pooling reference code
* clang format
* clang format
* Fix typo
* Add tensor stride
* Add missing header
* Add index stride and output stride
* Refine naming
* Add type to base class
* Rename file
* Use proper size
* Fix typo
* Refine naming
* Modify the argument into vector.
* Add max pool profiler
* Refine naming
* Support f32 pool
* Fix typo
* Add avg pool2d fwd in profiler
* clang format
* Rename AccDatatype to ComputeDatatype
* Fix init
* test pool
* Extract variable
* Add client example
* Check the pooling dim
* clang format
* Connect argv and arg_parser
* Add found check
* Remove useless header
* Refine naming
* Adjust the order of device_pool_fwd
* enable dl kernels on navi3
* do not build xdl tests and examples on Navi
* run tests before building everything on jenkins
* disable gemm_bilinear on gfx1030
* add gpu targets to installer on Navi
* put tests in the same order as before
* reduce the number of navi targets in CI
* build CI installed for gfx940 as well
* only build for MI300 during QA runs
* replace amd_buffer_atomic_add with hip_atomic_add
* fix grouped_gemm_splitk kernels on mi300
* fix syntax
* revert experimental atomic_add changes
* fix the group of kernels from ticket 723 on MI300
---------
Co-authored-by: Jing Zhang <jizhan@amd.com>
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add conv perlayer quantization
* Add gemm_dlops quantization
* Support int8 for innerproduct
* Refine gemm dlops int8 kernel parameter
* Support gfx908(MI100) and gfx90a(MI200)
* clang-format
* Rename example number
* Support different layout for d tensor
* Add conv dlops perchannel quantization example
* Move to example 40
* Extract the common code for different platform (dlops and xdlops)
* Move ot subfolder. Prepare to add other op of quantization
* Refine the quantization instance library
* Add conv dl instances and client example
* Remove unnecessary type
* Add gemm quantization instance
* Add external api and client example
* Refine num_bytes
* Separete different layout to different cpp
* Add more xdl instances
* Revert "Remove unnecessary type"
This reverts commit 820869182f.
* Remove CShuffleDataType in dlops
Let acc and CShuffleDataType be the same in xdlops
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Pass shared mem pointer as pointer to void.
* Device Op GroupedGEMM Multiple D
* Example for grouped gemm multiple d.
* Add MI200 to supported archs.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* make conv_fwd_bias_activation kernel id unique
* add more parameters to conv and gemm kernel names
* update GetTypeString for conv and gemm kernels
* fix two more kernel strings
* Grouped gemm + Gelu instances.
* Device Instance Factory for GroupedGemm+Gelu
* Client example
* Rangify fill helper functions.
* Fix name clash.
* Profiler for grouped_gemm+gelu
* No need to use full namespace name.
* Add check for MRaw divisible by vector load.
* Ugly fix for big errors.
* Add grouped_gemm+gelu to profiler CMakelists.
* Store in argument additional info.
* Information about Mraw, Nraw, Kraw values.
* Use FastGelu instead of Gelu.
* Change client ex to use FastGelu
* Remove relaxed error precision.
* Remove duplicate output elementwise-op
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* fix a bug blocking wmma_gemm_multipleD
* Utilize matrix padder in device_wmma_op
* cosmetic change for gemmpadding format
* clang format
* Change gridwise gemm from FIFO to KMN loop fashion
* Add DeviceOp and examples
* Format DeviceOp template arguments
* Remove bf16 example
* Format
* Format
* Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N
* Refactor argument preparation
* Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl
* Rename device op file
* Update include directive in the example file
* Update descriptor preparation for grouped op
* Update the argument
* Update batch handling
* Add gridwise gemm supporting batched input
* Update blockwise indexing, working version
* Update copyright year
* Update check if argument is supported
* Refactor and make consistent with xdl examples
* Update check if argument is supported
* Add changelog entry
* Added comments on Dl op split_k>1 support
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Sync the order of type string with template parameter
* Add more instances
* Check the vector size and remove redundant var
* Extract var to static, prepare to separate sweep once kernel
* Separate sweeponce flow and optimize the flow
* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
* Remove useless code
* Update naive variance kernel
* Refine string
* Fix typo
* Support naive variance for device_normalization
* Check the blocksize
* Share the VGPR of x and y
* Share the VGPR of gamma and beta
* Add more instances
* Support fp16 sqrt for experiment
* Add CHANGELOG
* Fix typo
* clang-format
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
* navi3x_multipleD+example
* temp save
* workable
* batchedgemm[OK], groupconv[debug]
* groupconv: Sanity check[OK], Performance[Bad]
* navi3x_groupconv_need_optimization
* format
* Add arch limitation to all wmma examples
* fix bug: example30 input conv args
* Add gemm + layernorm instance
* Add ckProfiler
* Add test
* Add client example
* Detect if user forger to set the workrspace
* Use literal in the example
* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
* check gemm vaildity in IsSupportedArgument
* Add more testcases
* Merge duplicated folder in client example
* Print more infomation
* Use better kernel parameter for MS problem size
* clang format
* Add constexpr for if condition and remove redundant include
* Remove cstdlib and add constexpr
* add instance for gemm bias softmax gemm
* add client example
* change CGridDesc_G_M_N to CGridDesc_G_M_O
* add gridwise
* change c grid name
* device add d0s data
* fix 08 client_example
* add example 47_fused_attention
* example output correct
* add d0 to example
* add d0 element op
* rechange instance code
* change Acc0ElementwiseOperation to C0DEElementwiseOperation
* change example name
* update instance for cdeelementwiseop
* add bhalf_t ScaleAdd
* add test
* not surport geem1 bias
* remove some ignore
* fix test bug
* File renaming and class renaming for device element-wise operation
* Add batchnorm-infer instances, external API and client example
* Add batchnorm-infer profiler module and gtests
* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp
* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer
* Rename class and file due to conflict from device_elementwise_2d.hpp
* Fix namespace in batcnnorm_infer_nhwc client example
* Use double as alpha/beta values type in reduce device op api
* Use double as alpha/beta values type in softmax device op api
* Use double as alpha/beta values type in multiple-reduce device op api
* Use double as epsilon value type in normalization/elementwise-normalization device op api
* add multi embeddings support
* fix format
* optimize sqrt
* add reduce operation
* change to elementwise op
* fix name
* rename
* run ci cd
* format example
* format code
* format code
* Change to the DeviceReduce base class template to include all problem description information
* Add external api for reduction
* Add client example to test the reduction external api
* Spelling correction
* Re-implement the host_reduction to follow the DeviceReduce base API format
* Change the reduce profiler to call the external API for collecting device instances
* Rename reduce client example directory from 08_reduce to 12_reduce
* Remove (void) before the functional call
* Tiny update in reduce client example
* Tiny update in profile_reduce_impl.hpp
* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Add device op of gemm layernorm
* [What] Rename F to H
[Why] F and G prepare for welford tensor
* Add gridwise gemm + welford
* Extract template parameter
* Rename kernel. Prepare to add second half kernel
* Extract var
* Add second kernel for gemm+layernorm
* Move to the gemm_layernorm folder
* Rename F and G to mean and var
* Do not use snakeCurved, it makes determination of padding for welford difficult
* Rewrite the device interface and rename some var
* Add welford count
* Update interface
* Sync code, prepare to test on MI200
* Clean the code
* Implement layernorm
* Add comment to mension hipFree
* Wrtie out the e for debug.
This could be remove and use h for instead
* 1. Allocate mean, var and count into by SetWorkSpacePointer.
2. Add GetWorkSpaceSize to calculate the space size
* Add gemm layernorm host code
* use reference layernorm
* Fix bug of blockwise welford for first kernel
* Fix bug of mean var padding for layernorm
* Use sgpr for shuffleM_index
* padding for GemmMeanVarCountGridDescriptor_M_NBlock
* Add layout parameter
* Check argument for gemm
* calculate max count for tail block
* Share E and H memory in device op
* Hard code the vector dim
* Refine the MakeDescriptor
* 1. Remove E parameter, because E is inside of device op
2. Check vector size
* [What] Rename MakeMeanVarDescriptor_M_N
[Why] Prepare to add count version of make descriptor
* Use 1D global memory for count
* Prevent redundant IO
* Update parameter
* Add pipeline v1/v2 selector
* Rename the example name
* Add base class for gemm layernorm
* Refine naming to distinguish naive and welford
* Add comment to explan in detail
* We don't need to pad in N dimension in gemm for mean/var/count. Set NPerTile 1
* Rewrite the 2st kernel, use multiple block along N dimension in layernorm kernel
* Share the vector size
* Refine var name
* [What] Force LayernormThreadSliceSize_N = vector size.
[Why] Memory coalesce
* Add comment
* Extract divisor out of the loop in reference layernorm
* Pad different size for E and H in layernorm kernel according to different block tile
* Refine naming
* Refine naming
* Prevent implicit cast
* [What] use ck::math::sqrt instead of __builtin_amdgcn_sqrtf
[Why] __builtin_amdgcn_sqrtf is only support float, double will cause casting
* Cast only constant
* Change of post shuffle thread descriptor
* Add EMeanVarDataType parameter.
* Merge the mean and var threadwise copy
* Add missing index
* Fix Typo
* Sync the variable with previous if
* 1. Declare e inside the host_gemm_layernorm()
2. Prevent implicit cast in reference code
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
* add DEBUG_LOG macro to enable/disable debug output
* fix syntax
* fix syntax again
* fix syntax one more time
* remove balnk spaces
* use ifdefs
* add the Print argument
* move the definition of DEBUG_LOG to ck.hpp
* add the missign argument to Print()
* Refine the device batchnorm-backward base API templates and data type assignments
* Remove duplicated kernel file
* Add batchnorm backward instances and external API
* Add batchnorm-backward profiler and tests
* Add client example which uses batchnorm backward external API
* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory
* Loose the threshold for batchnorm-backward check_err()
* Implemented batchnorm-backward Blockwise and Multiblock kernels
* Add batchnorm-backward device op
* Add batchnorm-backward host-reference op
* Add batchnorm-backward example
* Parameters renaming in batchnorm backward kernels and device op
* Change in the example to loose the threshold for ScaleDiff checking
* Add comments to explain the implementation of batchnorm-backward
* Parameters renaming again in batchnorm backward kernels
* Improve the expression calculation for performance
* Add batchnorm backward to README
* Add comments to explain inv-variance in batchnorm forward and backward
* Renaming the batchnorm forward training and inferring examples
* Add/update the comments for batchnorm-backward kernels
* Renaming again
* Add block_sync_lds between two consecutive blockwise reductions
* Move common expression 1/N out of the static_for loops
* Add dy_elementwise_op
* Renaming in backward example again
* Add checking for reduceDims in reference_batchnorm_backward
* Update to comments and codes format
* Rename in the comments
* Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c
* Add block_sync_lds() between blockwise reduction again
* Fix comments again
* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
* Update to device_batchnorm_forward base class to include all template parameters for problem description
* Add batchnorm forward instances and external api
* Add batchnorm forward profiler module which uses the external api
* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]
* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward
* Improvement to the batchnorm infer base API
* Add batchnorm forward client example which shows using the batchnorm forward external API
* Add test for batchnorm forward
* Tuning the batchnorm profiler initialized values and error threshold
* Add support for bhalf_t in instances/external api/tests
* Add support for int8_t in instances/external api/tests
* Add support for double in instances/external api/tests
* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances
* Checking before running best instance in batchnorm_fwd_nhwc client example
* Add checking for YElementwiseOp in batchnorm_forward external API
* Add more types in batchnorm forward profiler
* Add more test lengths
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>