* GEMM + Reduce max fp16+fp32
* GEmm + Max bf16 + int8
* Refactor common definitions.
* Refactor common func of mean meansquare example.
* More examples for mean meansquare.
* Update int8 examples and skip them cause of random errors.
* Int4 examples.
* Fix examples for max int4/8
* Tensor conversion for int4 input data for mean meansquare example.
* Remove int4 mean_meansquare example
* Fix int8 mean_meansquare example.
-All ReductionAccData and R<N>DataType have to be F32. The INT32 data
type is giving wrong results.
* Guard int4 with ifdef
* Change int8 example to add_addsquare due to div rounding err.
* Clang format
* Change the return type of common function.
* Get back int8 example with division.
* Remove int8 mean meansquare.
* Use proper cast for BF16 data type.
* Use ck::literals.
* Use proper data type for host tensors & reference.
- Use ReduceAccDataType for reference gemm output data type.
- Cast host reference output tensor to EDataType
- Fix ifdefs for int4.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* add padding algo for bmm+scale+softmax+bmm. Version for verification
* remove verification code
* remove comments
* add padded bmm scale softmax bmm example
* format
* refactor
* add comments for usages of padding bmm+scale+softmax+bmm
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* Add int4 example for convnd_fwd_bias_relu_add
* Fix AddReluAdd for building without int4 support
* Update CMakeLists.txt
* Format
* Convert int4 tensors for int8 kernel
* Fix device memory allocation
* Format
* Format
* Add GEMM examples for int4
Currently the source files are just copied from int8 examples
* Re-use pre-defined alias in int4 exmples
* Distinguish user-side type from kernel-side type
* Add int4_t support for check_err()
* Allow conversion between Tensor<> specializations
* Re-format source files
* Use different type for host tensors
* Re-use CopyAsType<>() to implement copy ctor
* Re-use element-wise operation type alias
* Fix typo in alias names
* Complete the int4 examples
* Add constraint to Tensor<> templated methods
* Add type traits 'is_signed_integral<>'
* Add type constraints for integer version check_err<>()
* Allow comparing different-sized integral types in check_err()
* Check converted Tensor<int4_t> with golden Tensor<int8_t>
* Remove constraint of Tensor<>::CopyAsType()
* Avoid compilation error while disabling ck::int4_t support
* Remove debug messages
* Add #error directive to prevent compile sources with wrong setting
* Simplify tensor usages in examples
* Add constraint to check_err() input reference type
* Align design with other PR
* Use ""_uz to simplify example code
* Avoid too much generalizing check_err()
* Re-format GEMM instance template arguments
* Extract int4 example common codes
* Sort include directives
* Move #include directives into new header
* Move common codes together
* Re-format template argument in example code
* Reuse same implementation code for most of GEMM examples
* Re-format common.hpp
* Unify structured comment in examples
* Use reinterpret_cast<>() for cross-type pointer conversion
* Revert "Add type traits 'is_signed_integral<>'"
This reverts commit f2c148efae.
* Allow unsigned integer arguments for check_err()
* Fix compilation error in check_err()
* Remove unnecessary copy ctor for Tensor<>
* Mark Tensor<> special member functions as 'default'
* Use more strict condition to add code in examples
* Fix wrong program return value of GEMM examples
* Handle the case while user specify all the strides
* Fix never-ran examples
* Exit successfully if GEMM instance does not support given problem
* Add missing 'else' keyword
* Re-format CMakeLists.txt
* Add wrapper function to hide value conversion while copying memory
* Add new DeviceMem API to copy memory
* Use new DeviceMem API to implement examples
* Revert "Add new DeviceMem API to copy memory"
This reverts commit 3f190b0779.
* Add conversion ctor for Tensor<>
* Write Tensor<> conversion logics explicitly in example code
* Convert Tensor<> values after transfer data to host
* comment on specialization for TensorSpecialization::Packed
* gemm_softmax_gemm with output permutation
* scaling
* refactor MatrixPadder; rename to GemmPadder
* remove old sanity check
* restore original gemm_softmax_gemm
* revise comment in gemm_softmax_gemm example
* use GetElementSpaceSize()
* remove extra header
* typo
* remove archaic DeviceOpPtr
* add examples into grouped/batched_gemm
* adding splitK examples
* fixed splitK
* add bfp16 int8 example into splitK
* formatting
* use static_cast
* added common for batched_gemm
* add commons for examples of splitK/batched/grouped_gemm
* return true
* adjust splitK check tol
* update example
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
* Add custom target to bundle examples together
* Add int4 example conditionally (just copy from int8 example)
* Extract common code into common.hpp
* Move ref gemm type alias into data-type-specific sources
* Add #error directive to prevent compile with wrong setting
* Let AddAddFastGelu support int4 parameter type
* Let check_err() support int4 parameter type
* Add wrapper function to hide value conversion while copying memory
* Finish int4 example for GEMM + AddAddFastGelu
* Add new DeviceMem API to copy memory
* Use new DeviceMem API to implement examples
* Fix wrongly use of macro 'CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4'
* Revert "Add new DeviceMem API to copy memory"
This reverts commit e26e7af71e.
* Add conversion ctor for Tensor<>
* Add 'const' specifier to Tensor<>::CopyAsType()
* Convert Tensor<> values before/after transfer between host & device
* Implement multiple-reduction in one kernel (kernels, device ops, examples)
* Add generic elementwise kernel and device interface
* Add generator for normal-distributed data initialization
* Add host refer implementation of batchnorm-forward and batchnorm-infer
* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
* Remove un-needed including in batchnorm example
* Renaming generic_elementwise to elementiwise in kernel and device classes/functions
* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
* Removing and renaming files
* Update to synchronize gemm_layernorm client example to the generic element-wise device op API
* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
* Merge two static member functions in device_elementwise.hpp
* Remove unary_elementwise_1d kernel and device
* Add threadwise and blockwise welford
* Rename gridwise op, prepare to add welford version
* implement welford and integrate welford into layernorm
* Take care of tail loop
* Fix buf when ThreadSliceK > 1
* Fix bug of merging of two empty set
* Rename clip to clamp
* 1. Fix type of count
2. Remove useless static_assert
* Do not inherit Reduction::Argument
* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
* Add y stride
* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm
* Move literal ""_uz & ""_zu into namespace 'literals'
* Move namespace 'literals' as 'ck::literals'
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* add gemm_gemm instances and tests
* avoid LDS data hazard
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Update the reduce_blockwise example to support user specified data type and input+reducing dimensions
* Add examples for using reduce_multiblock_atomic_add
* Add more running examples to the default command-line
* Remove un-necessary header including
* Update to the example README.md
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* attention host validation
* add blockwsie softmax v1
* iteratively update softmax+gemm
* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
* add init method for easier debugging
* do away with manual thread cluster calculation
* generalize blockwise softmax interface
* row-wise softmax sum & max
* format
* rename to DeviceBatchedGemmSoftmaxGemm
* add gemm_softmax_gemm instances and tests
* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* [LWPCK-359] Initial commit
* Working version for fp16, add results to readme
* Update according to PR #341
* Update results in readme
* Add fp32 example
* Add bf16 example
* Update fp16 and fp32 examples
* Add int8 example
* Add separate lengths and strides tensors for D tensors
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
* Add always_false<> util to delay symbol resolution
* Use always_false<> to prevent trying instantiate unwanted method
* Add new specializations of AddAddFastGelu::operator() method
* Add GEMM + AddAddFastGelu examples for data types: int8, bf16, fp32
* Use floating point literal to simplify code
* Remove unnecessary capture in lambda expressions
* Extract fast GeLU calculation as standalone method
* Mark methods as 'constexpr'
* Add constraint for HostTensorDescriptor templated ctors
* Simplify HostTensorDescriptor ctor calls
* Add C++23 std::size_t literal suffix
* Use _uz suffix to shorten example code
* Remove unnecessary conversion to std::array<>
* Re-order include directives
* Remove C-style casting by literal suffix
* Remove unnecessary statements in main()
* Remove unused type parameter of always_false<>
* Remove unused include directive
* Exit main() by returning meaningful value
* Use 'if constexpr' to switch example flow
* Use std::is_same_v<> to shorten example code
* Add 'inline' specifier to literal functions
* Unify output methods in example
* Move common codes into .inc file
* Add type check in type_convert<>()
* Add type_convert<float>() before computation
* Merge AddAddFastGelu method specializations
* Remove always_false<>
* Add constraint to AddAddFastGelu::operator() parameter types
* Add int8 specialization for elementwise Add and Subtract.
* CGEMM examples bf16, fp32, int8
* Add convert reference output to CDataType.
* Skip BF16 data type during testing.
* Lower K value to get rid of accumulation error.
* Fix merge artifact.
* Fix changed function name: GetElementSpaceSize()
* Fix merge artifact.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* Implement layernorm kernel and deviceOp
* verify gpu kernel with host code
* 1. Separate gamma aand beta from affine
2. Check if argument is valid
* clean
* Sync the naming
* Support sweep once mode if we can put k dimension data inside one block
* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.
* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.
* Use 1d descriptor for gamma and beta
* Add accElementwiseOp
* Extract layernorm host code
* Support different YVectorDim in GridwiseLayernorm
* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
* Gamma and beta can share the VGPR.
* Add test for fp32 and fp16
* Fix bug of concurrency and add test case which may fail orignally
* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* format
* improving pipeline
* fix typo
* format
* adding thread group
* adding thread group
* adding thread group
* adding gemm pipeline
* tweak
* refactor
* refactor
* add missing type convert
* refactor
* refactor
* refactor
* clean
* fix build
* refactor
* format
* clean up
* use remove_cvref_t
* clean
* use pipeline_v2 for gemm kernel
* Remove inconsistent indent
* Fix compilation errors due to incomplete merge process
* Add missing include directives
* Fix compilation errors in currently unused files
* Add license in newly added files
* Re-format touched files by clang-format-10
* Fix wrong template argument count of DeviceGemm<>
* Use language construct to choose between types
* Use language construct to choose GEMM example instance
* Fix compilation error due to interface change
* Re-use type alias to avoid duplication
* Unify type alias usage in source file
* Only use v2 pipeline in one gridwise GEMM type
* Remove no-longer used include directives
* Add static_assert() to check pipeline type requirements
* Revert "Add static_assert() to check pipeline type requirements"
This reverts commit f0985f0a13.
* clean
* clean
* clean
* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>
* dump lds content in appropriate precision type
* add squared add reduction op; allows sq sum
* initial stub from regular gemm impl
* layernorm example code & host verification
* initial layernorm implementation
* tidy up
* make C0 precision type consistent with C
* clang-tidy and additional comments
* tighten up example code
* account for extra flops/bytes from normalization
* clang-format
* c0 bias/beta/gamma now have its own precision type
* AccElemOp for gemm outputs prior to feeding to layernorm
* update workgroup mapping
* rename kernel template param to reflect its dual use
* use LDS mem pool for reduction workspace
* change cshuffle precision type to f16; clean up
* clang-format
* correct naming
* explicit cast
* fully implemented gemm + bias + activation + add + norm
* activation in correct order
* reflect reduction API's recent change
* amend
* clean up; add comment
* keep up with recent changes in reduction API
* format
* resolve merge conflicts
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* use 'sweep once' softmax kernel where applicable
* threadwise copy's dst buffer can specify invalid element value
* add int8 in/out float compute softmax support
give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
* format
* softmax inherits DeviceNormalization
* softmax profiler stub
* tighten up reference softmax interface
* example prints tensor dimension
* add fp32 to softmax profiler
* rename header
* hook with ckProfiler
* format
* resolve merge conflict
* resolve merge conflicts
* update normalization profiler help string
* resolve conflict
* typo
* remove residual
* softmax profiler: address feedback
* test for mixed precision input/output
* fully qualify ck::math::isnan
* add comment for device normalization interface
* revise wording
* constness for alpha/beta scaler pointer
* Extract base class for elementwise
* Refactor interface of DeviceGemmReduce. Do not use tuple in interface
* [What] Rename d into reduce in gemm + reduction related code
[Why] Prepare to add d term for add
* Unify base class of gemm + reduce and gemm + bias + add + reduce
* 1. Rename gemm_bias_add_reduce for external api
2. Refine cmake
* Add normalize device operation
* [What] Reorder the argument
[Why] Because d0 is also the input of c.
* Add type string
* Add example of gemm_bias_add_layernorm via external api
* Refactor example code
* clang-format
* Fix compile error
* clang-format
* Add external api for gemm_add_add_layernorm and normalize
* Add client example
* clang-format
* UniforFill with integer values.
* Log tested instance type string.
* Add UT for all convolution specializations.
* debugging conv
* Fix dangling reference bug.
* Small refinements.
* Fix call to error checking function.
* Small refinements to tests.
* Configure error tolerance
* Change problem size.
* Remove OddC case from types that do not support it.
* Add helper traits for AccumulatorDataType.
* Print first 5 errs in check_err for integral types.
* Rename FillUniform to FillUniformDistribution
* Refactor
* Do not use typed tests.
* Instead use plain fixture class with templatized member functions.
* Initialize tensors with integer values.
* Refine test instances.
* Properly set accumulator data type.
* Add another "big" instance.
* Refactor convolution tests.
* Revert "debugging conv"
This reverts commit b109516455.
* Add pragma once + format + small refinement.
* Fix some unwanted changes.
* Clang-format
* Fix profile_convnd to use renamed tensor initializer.
* Add instances for ConvFWDND kernel case 2D
* Helpers to get ConvNDFwd 2D instances.
* Refactoring.
* Remove "small block" instance as it was generating compiler errors.
* Remove default template parameters values.
* Refine and fix test.
* Fix problem with default template parameter types.
* Adjust error thresholds for floating point values test.
* Use integer values initialization for instances test.
* Add tests for ConvNDFwd 2D case.
* Remove AccumulatorDataType type trait.
* Update unit-tests.
* Remove operator<< overload.
* Unlock conv1d/3d nd fwd instances.
* Enable skipping calculating reference using flag.
* Fix number of channels for first ResNet50 layer.
* Clang-format.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* initial stub for standalone softmax
* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
* host softmax validates
* compiles; to implement beta scaling
* use NaN trick to efficiently ignore OOB values during sum of exponentials
* freeload device_reduce's utility functions
* clean up interface
* adding prior value (beta scaling)
* remove restriction related to perf considerations
* apply clang-format
* clean; disable diagnostics
* resolve conflicts
* add exp wrapper
* honor HostTensorDesc interface; allow implicit cast from different vector<T> type
* test softmax for fp16/fp32
* update readme
* amend commit NaN trick
* remove redundant param added during development
* format
* replace ScalarDataType with AccDataType
* separate out test programs by precision type
* move softmax sample code to its own folder
* format
* keep up with recent changes in reduction API
* remove extra header
* Remove template from Reducton operation classes and add template to their operator() and GetIdentityValue() interfaces
* Change to unary elementwise operators and the reduce_unary_operator (class for mapping) and dependent variations in all host layers
* Remove the data type template parameter from reduce_binary_operator (class for mapping) and dependent variations in host layers
* Add InMemoryDataOperatonSupportedOnDataType to check the matching between data type and InMemoryDataOperation
* Use struct-scope operator template instantiation for binary and unary element-wise operations
* Change a few more elementwise operations to use template for operator()
* Tiny correction in Normalize operator
* Add static_assert to check the data type appliability for some reduction accumulator and element-wise operatons
* Correction in some examples with regard to using ReduceAccDataType
* Use static_assert for UnaryDivide
* Update to merged codes to use Element-wise operations and Reduction Accumulator operations correctly
* Tiny fix with regard to SetWorkSpacePointer()
* Copy "gemm reduce" to "gemm bias add reduce"
* Implement gemm bias add reduction
* Fix compiler error due to merge from develop
* Add tensor operation for gemm + bias + add + reduce
* Add gemm_bais_add_reduce to ckProfiler
* Add c1 functor
* Refine type
* Use reduceAccDataType instead of explicitly float
* Change to use check_err()
* Do relu in float32 instead of bhalf_t. Because bhalf_t is unsigned
* Refactor relu. using type_trait instead of overloading
* Rename DxsReduceAccElementwiseOperation to DxsReduceAccElementwiseOperation
* Fix denominator
* Refine nameing
* Fix denominator in host
* Remove useless include header
* Use AccDataType
* Fix static_cast order
* Refine type
* [What] Remove tuple type in the base class
[Why] External api depend on base class. if base class has relationship with type, we will need many class for different type
* add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight
* add bwd weight for bf16: init
* remove redundant compute
* use datatype and split k to check whether a workspace is used
* remove unused computation for work space size
* add some code for bfp16
* add device/grid unary op
* add unary type convert to bwd-weight example
* support bf16 splitk kernel for convnd bwd weight
* 1. remove comments. 2. add checkvalidity. 3. add gridsize computation
* add workspace size check
* fix format
* change function name