* comment on specialization for TensorSpecialization::Packed
* gemm_softmax_gemm with output permutation
* scaling
* refactor MatrixPadder; rename to GemmPadder
* remove old sanity check
* restore original gemm_softmax_gemm
* revise comment in gemm_softmax_gemm example
* use GetElementSpaceSize()
* remove extra header
* typo
* remove archaic DeviceOpPtr
* add examples into grouped/batched_gemm
* adding splitK examples
* fixed splitK
* add bfp16 int8 example into splitK
* formatting
* use static_cast
* added common for batched_gemm
* add commons for examples of splitK/batched/grouped_gemm
* return true
* adjust splitK check tol
* update example
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* GemmPadder and GemmGemmPadder
* proper padding using GemmGemmPadder
* test gemm_gemm padding
* properly check size K in IsSupportedArgument()
* properly check size requirement given SrcScalarPerVector in IsSupportedArgument()
* comment
* format
* Implement multiple-reduction in one kernel (kernels, device ops, examples)
* Add generic elementwise kernel and device interface
* Add generator for normal-distributed data initialization
* Add host refer implementation of batchnorm-forward and batchnorm-infer
* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
* Remove un-needed including in batchnorm example
* Renaming generic_elementwise to elementiwise in kernel and device classes/functions
* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
* Removing and renaming files
* Update to synchronize gemm_layernorm client example to the generic element-wise device op API
* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
* Merge two static member functions in device_elementwise.hpp
* Remove unary_elementwise_1d kernel and device
* Add threadwise and blockwise welford
* Rename gridwise op, prepare to add welford version
* implement welford and integrate welford into layernorm
* Take care of tail loop
* Fix buf when ThreadSliceK > 1
* Fix bug of merging of two empty set
* Rename clip to clamp
* 1. Fix type of count
2. Remove useless static_assert
* Do not inherit Reduction::Argument
* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
* Add y stride
* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm
* Move literal ""_uz & ""_zu into namespace 'literals'
* Move namespace 'literals' as 'ck::literals'
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* add gemm_gemm instances and tests
* avoid LDS data hazard
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* attention host validation
* add blockwsie softmax v1
* iteratively update softmax+gemm
* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
* add init method for easier debugging
* do away with manual thread cluster calculation
* generalize blockwise softmax interface
* row-wise softmax sum & max
* format
* rename to DeviceBatchedGemmSoftmaxGemm
* add gemm_softmax_gemm instances and tests
* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Implement layernorm kernel and deviceOp
* verify gpu kernel with host code
* 1. Separate gamma aand beta from affine
2. Check if argument is valid
* clean
* Sync the naming
* Support sweep once mode if we can put k dimension data inside one block
* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.
* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.
* Use 1d descriptor for gamma and beta
* Add accElementwiseOp
* Extract layernorm host code
* Support different YVectorDim in GridwiseLayernorm
* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
* Gamma and beta can share the VGPR.
* Add test for fp32 and fp16
* Fix bug of concurrency and add test case which may fail orignally
* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* dump lds content in appropriate precision type
* add squared add reduction op; allows sq sum
* initial stub from regular gemm impl
* layernorm example code & host verification
* initial layernorm implementation
* tidy up
* make C0 precision type consistent with C
* clang-tidy and additional comments
* tighten up example code
* account for extra flops/bytes from normalization
* clang-format
* c0 bias/beta/gamma now have its own precision type
* AccElemOp for gemm outputs prior to feeding to layernorm
* update workgroup mapping
* rename kernel template param to reflect its dual use
* use LDS mem pool for reduction workspace
* change cshuffle precision type to f16; clean up
* clang-format
* correct naming
* explicit cast
* fully implemented gemm + bias + activation + add + norm
* activation in correct order
* reflect reduction API's recent change
* amend
* clean up; add comment
* keep up with recent changes in reduction API
* format
* resolve merge conflicts
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* use 'sweep once' softmax kernel where applicable
* threadwise copy's dst buffer can specify invalid element value
* add int8 in/out float compute softmax support
give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
* format
* softmax inherits DeviceNormalization
* softmax profiler stub
* tighten up reference softmax interface
* example prints tensor dimension
* add fp32 to softmax profiler
* rename header
* hook with ckProfiler
* format
* resolve merge conflict
* resolve merge conflicts
* update normalization profiler help string
* resolve conflict
* typo
* remove residual
* softmax profiler: address feedback
* test for mixed precision input/output
* fully qualify ck::math::isnan
* add comment for device normalization interface
* revise wording
* constness for alpha/beta scaler pointer
* Extract base class for elementwise
* Refactor interface of DeviceGemmReduce. Do not use tuple in interface
* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add
* Unify base class of gemm + reduce and gemm + bias + add + reduce
* 1. Rename gemm_bias_add_reduce for external api
2. Refine cmake
* Add normalize device operation
* [What] Reorder the argument
[Why] Because d0 is also the input of c.
* Add type string
* Add example of gemm_bias_add_layernorm via external api
* Refactor example code
* clang-format
* Fix compile error
* clang-format
* Add external api for gemm_add_add_layernorm and normalize
* Add client example
* clang-format
* initial stub for standalone softmax
* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
* host softmax validates
* compiles; to implement beta scaling
* use NaN trick to efficiently ignore OOB values during sum of exponentials
* freeload device_reduce's utility functions
* clean up interface
* adding prior value (beta scaling)
* remove restriction related to perf considerations
* apply clang-format
* clean; disable diagnostics
* resolve conflicts
* add exp wrapper
* honor HostTensorDesc interface; allow implicit cast from different vector<T> type
* test softmax for fp16/fp32
* update readme
* amend commit NaN trick
* remove redundant param added during development
* format
* replace ScalarDataType with AccDataType
* separate out test programs by precision type
* move softmax sample code to its own folder
* format
* keep up with recent changes in reduction API
* remove extra header
* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
* Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
* Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
* Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
* Use struct-scope operator template instantiation for binary and unary element-wise operations
* Change a few more elementwise operations to use template for operator()
* Tiny correction in Normalize operator
* Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
* Correction in some examples with regard to using ReduceAccDataType
* Use static_assert for UnaryDivide
* Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
* Tiny fix with regard to SetWorkSpacePointer()
* Copy "gemm reduce" to "gemm bias add reduce"
* Implement gemm bias add reduction
* Fix compiler error due to merge from develop
* Add tensor operation for gemm + bias + add + reduce
* Add gemm_bais_add_reduce to ckProfiler
* Add c1 functor
* Refine type
* Use reduceAccDataType instead of explicitly float
* Change to use check_err()
* Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned
* Refactor relu. using type_trait instead of overloading
* Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation
* Fix denominator
* Refine nameing
* Fix denominator in host
* Remove useless include header
* Use AccDataType
* Fix static_cast order
* Refine type
* [What] Remove tuple type in the base class
[Why] External api depend on base class. if base class has relationship with type, we will need many class for different type
* add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight
* add bwd weight for bf16: init
* remove redundant compute
* use datatype and split k to check whether a workspace is used
* remove unused computation for work space size
* add some code for bfp16
* add device/grid unary op
* add unary type convert to bwd-weight example
* support bf16 splitk kernel for convnd bwd weight
* 1. remove comments. 2. add checkvalidity. 3. add gridsize computation
* add workspace size check
* fix format
* change function name
* Use the unified naming for math functions on host and HIP kernel
* Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming
* Renaming GetReductionZeroVal() to GetIdentityValue()
* Tiny renaming in profile_reduce_impl.hpp
* More renaming in profile_reduce_impl.hpp
* Replace zeroVal by identiyVal
* Remove ck_ prefix in the naming of ck::math provided functions
* moved gemm_descs_args into const buff
* use CK_CONSTANT_ADDRESS_SPACE instead of global constant
* clean
* moved hipMemAlloc outside of deviceOp
* add SetWorkSpacePointer
* fix ignore
* Reference CGEMM + test stub
* Format.
* Incomplete simple implementation
* Library instances
* Sketch of tests
* Test fixes.
* Example added
* Cosmetics
* Add elementwise operation kernel and example
* Add comment
* Add template argument of dim . Prepare to support multiple dimension
* Rename example
* Support 1 dimension
* Add static assert
* Add comment
* Second auxiliary buffer added
* Extract pad
* Remove redundant argument
* Support any dimension for elementwise operation
* Remove line
* Let it be the multiple number of CU
* Move thread per block to the parameter of constructor
* Consuming binary ops to do A+B / A-B
* Fix + cosmetics + bf16 test commented out temporarily
* Format
* Enabling bf16 test
* Revert "Enabling bf16 test"
This reverts commit f497e2ba44.
* Fix + test reenabled
* fix build
* Revert "fix build"
This reverts commit d73102384b.
* post PR #235 merge fix
* amend
* Single workspace for cgemm + helper
* Perf calc fix
* Review remarks: static_cast
* Review remarks: binary ops templated
* Cleaning
* Removal of instances and their tests
* Review remarks from aosew addressed
* Review remark: unnecessary attribute
* Post-merge fixes
* Restrict 4gemm to PassThrough + bug fix
* Review remarks
* update licence
* change cgemm example to fp16
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
* Implement reduction meand and reduction square mean
* Refine file name
* Add reduce mean and square mean
* Fix parameter name
* Add normalize device op (not implement invoker::run())
* Remove epislon
* Refine deviceop
* Add 5ary elementwise for normalization
* Add layernorm example
* layerNorm verication
* Fix compiler error due to merge from develop
* Fix typo
* Fix compile error
* Refine naming
* [What] Suport non pointer for invoker and argument
[Why] Snyc coding style with gemm
* Refine folder name
* Refine class name
* Evaluate perf of the kernel
* Fix compile error
* [What] Refine perf evaluation in example of gemm + reduction
[Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory
* clang-format
* debugging conv
* fix oversight where ctile map is constructed before initializing c desc
* example program should returns error code
* clean up
* changed Block2CTileMap in conv2d and convnd
* clean up
* clean up
* cleanup
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
* Support different length of ScalarPerVector
* Add example of broadcast on fastest axis
* Typo
* Refine fastest example
* Add dimension check
* Modify fastest broadcast example to 3d
* Enforce users give scalarPerVector explicitely
* 1. Add CscalarPerVedctor
2. Not only broadcast on fastest need to set scalarPerVector to 1
* Rename var
* Move IsScalarPerVectorValid() inside IsSupportedArgument()
* Separate GridDesc_M0 into A, B and C
* rename var
* Rename var of length
Co-authored-by: rocking <chunylai@amd.com>
* start adding navi21 GEMM
* navi_gemm_km_kn_mn_fp32 compiles and passes one test.
* rename variables and functions in gridwise_gemm_dlops_v1r3
* add other 3 layouts; format instance
* adding more tuning parameters
add tuning parameters for other 3 layouts
* add gemm_dlops_f16
* tmp
* add dependence of DeviceGemm::IsSupportedArg() on arch
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* push gemm_dlops into profiler
* minor changes
* if using xdl or dlops is moved into profiler_gemm_impl
* minor changes
* minor changes
* remove is_xdl from profile_gemm_impl
* make IsSupportedArg dependent on arch for other device_gemm
* minor changes
* minor changes
* fix a bug in f_generate_tensor_value
* add 64x64x64 for gemm_dlops_int8
* add 64x64x64 for gemm_dlops_int8
* comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1
* fix
* start fixing tuning parameters
* monir
* minor changes
* minor changes
* minor changes
* fixing
* adding example
* adding example
* adding example
* add gemm fp32 example
* clean up
* use 128x128x16 as MNK tile in navi21 gemm example
* bug fix
* fix test
* use new block c tile
* clean
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>