* Introduce LocalBlockToCTileMap.
* Change the signature of CalculateBottomIndex() function which now does
not accept any argument. The B2C map which is already passed as an
argument to the kernel Run function is calculating block's local id
already outside at kernel entry point __global__ function.
The LocalB2C map stores as members local block ID.
* Use LocalBlockToCTile map in device ops.
* First draft of tile loop work distribution.
* Fix typo.
* Simplify kernel arguments.
Calculate descriptors & B2C maps on the device.
* Use looping kernel.
* Fix B2C constructor.
* Fix Navi21 errors.
* Calculate tile start/end in device kernel.
* Change Run API to accept user provided workspace buffer.
* Add new line at EOF.
* Move Gemm KernelArguments to device op interface.
* Remove unused code.
* Update API.
* Launch grid size which is min of occupancy vs tile count
* Get back to use constant memory for gemm descriptors.
* Remove unused code.
* Add default virtual method implementation.
* Update comments to conform with doxygen style.
* Fix doc style and unused parameters.
* Add thread cluster lengths to kernel name.
* Remove old splitk impl and replace it with tile looping one.
* Modify instances.
* set KPerBlock to 64
* maximize wherever possible vector load size.
* Fix instances cluster lengths.
* Change comment style.
* Use 128b store where possible in instances.
* Update test cases, since KPerBlock has doubled.
* Update output stream operator for Sequence.
* Add pipeline version to GroupedGEMM device op type string.
* Fix pipeline version type logging.
* Fix input tensors type after merge.
* Fix compiler error.
* Fix output stream operator for Pipeline version.
* Store using 128b.
* Set of instances with kpb 32/64
* Limit number of instances
* Remove commented out instances.
* Fix function name.
* Limit the number of instances.
Add pipline version to the regular instances
* Change thr cluster layout for reading B tensor.
* disabled failed instances
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
[ROCm/composable_kernel commit: a4f72a314a]
* workaround nan problem by changing output to fp16
* enable f8/bf8 gemm tests on MI200
* workaround f16 to f8 conversion
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
[ROCm/composable_kernel commit: ac9595a9f1]
* Handle type conversions to a const datatype
* Review: Handle X being const data type as well
* Review: Remove typo
[ROCm/composable_kernel commit: f4af5aed8b]
* Redesign the DPP8 GEMM kernel to use warp-wise component
* Review: Improve error messages
* Review: Remove unnecessary empty lines
* Review: Fix M, N per thread names
* Review: Rename mfma_input_type to dpp_input_type
* Review: Fix tensor adaptor; remove unnecessary element
* Review: Remove calls to dpp_gemm's MakeCDescriptor
* Review: Add blockwise doc, change function names to include dimension names
* Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file
* Review: Add __restrict__ keywords
* Review: Use MatrixPadder for padding A, B, C matrices
* Review: Remove hardcoded datatypes
* Review: Change names from FloatX to XDataType
* Review: Introduce AK0 and BK0 instead of a single K0
* Review: Remove construction of dpp_datatypes object
* Review: Rename DppInstrRunner to DppLanegroupGemm
[ROCm/composable_kernel commit: 37a8c1f756]
* Add maxpool instances
* Rename index pool to max pool.
* Add maxpool bwd bf16 instances
* Add avg pool bwd instances
* Rename avgpool and maxpool to avg_pool3d and max_pool
* Add bf16 pool fwd instances
* Add max pool bwd to ckProfiler
* Add avg pool3d bwd to ckProfiler
* Add avg pool bwd test
* Fix bug of reference pool fwd (dilation)
* Fix bug of max pool bwd (dilation and initZero)
* Support bf16 compute data type
* Force compute type be f32. Because atomicAdd only support f32
* Add max pool bwd test
* Rename folder
* Rename pool
* Add max pool bwd client example
* Add avg pool bwd client example
* Add missing workspace
* clang format
* Rename macro
* remove useless header
* remove useless layout
[ROCm/composable_kernel commit: 866377de18]
* initial stream-k implementation with example
* fix unexpected change in err
* improve a little bit performance by reorganize pipeline.
* improve perf a little bit by swizzle block idx
* add profiler
* update example
* fix spelling
* shrink karg for streamk
* support dynamic buffer using memory coherence glc_slc bit from template
* control memory coherence while construct dynamic buffer
* update reduction for streamk(not ready yet)
* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
* fix build issue
* fix several bug
* now result is correct, everything works (but has scratch)
* remove scratch by manually reset coordinate
* update device code
* fix a bug in final reduce
* fix something in example
* update async memset
* fix enum as camel case
* modify coherence enum name
* clean code and use atomic streamk by default
* remove unused var
* throw exception if have empty pointer
* fix format
* fix CI warning
* fix type in init
* modify CI error
* filter out on gfx10+
* restore changed example code
---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
[ROCm/composable_kernel commit: e7dca79d27]
* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]
* Add CountDataType as template parameter in blockwise_welford
* Add utility/get_shift.hpp
* Add BatchNorm multiblock single-kernel implementation
* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a
* Renaming in device_batchnorm_forward_impl.hpp
* Tiny fix in the batchnorm_fwd profiler
* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"
This reverts commit d16d00919c.
* Use the old two-kernel batchnorm multiblock method for gfx1030
* Use the old two-kernel batchnorm multiblock method for gfx908
* use the single-kernel batchnorm multiblock method only for gfx90a
* Remove get_wave_id() from utility/get_id.hpp since it is not used
* Set true for testing running mean/variance and saving mean/invvariance in the examples
* Fix to copy-right words
* Remove un-needed including in utility/get_id.hpp
* Add comments to workgroup_synchronization.hpp
* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp
* Renaming in the kernels
* Remove un-used kernel file
[ROCm/composable_kernel commit: 8f5cafaf04]
* Add basic fp8 definitions and prn-generator
* Format
* Add fp8<->fp32 type_convert
* Format
* Split type_convert and cast_to/from_f8
* Format
* Minor fix
* Minor fix
* Move fp8 utils to a separate header
* Add elementwise ops
* Add fp8_convert_sr
* Format
* Add element op
* Eliminate magic numbers
* Split f8_convert_sr in host and device
* Format
* Add some constexpr
* Add a datatype test
* Format
* Another format
* Add fp8<->fp16 tests
* Update type_converts
* Format
* Add fp16 casting functions
* Format
* Use seed as a runtime arg
* Use element location for PRNG
* Format
* Add fp8<->fp16 to PassThrough element op
* Clean up
* Merge host and device implementations
* Add comments on rounding modes
* Remove leftover code
* Put type_converts into a separate header
* Put random number gen to a separate header
* Rearrange f8_utils' namespaces
* Refactor type_convert.hpp
* Move f8_t definition
[ROCm/composable_kernel commit: f0c620c42e]
* enable gfx941/942 targets
* fix clang format
* fix the cmake logic for multiple targets
* fix cmake syntax for looping over targets
* add gfx941/942 support for gemm_xdl instances
[ROCm/composable_kernel commit: 027e46ee82]
* Add overloaded version of __builtin_amdgcn_readfirstlane()
* Remove 'static' specifiers
* Remove more 'static' specifier
* Replace unsigne char by std::byte
* Add 'const' specifier to never changing variable
* Add 'inline' specifier to funcion definition
* Fix wrong boundar calculation logic
* Rename type trait
* Remove std:: qualifier from standard types
* Replace 'size_t' by 'unsigned'
* Use type alias to hint usage
* Replace static_for<> by ordinary 'for' loop
* Rename readfirstlane() to amd_wave_read_first_lane()
* Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp
* Reorder statements
[ROCm/composable_kernel commit: 582e31e88d]
* Add TypeConvert class and start refactoring
* Refactor TypeConvert as a struct
* Get back to template functions type_convert
* Add a type_convert_bf16_rtn, set rtz as default
* Clean up
* Add UnaryConvertPrecision struct for high-precision workloads
* Format
* Update type_convert to UnaryConvert on threadwise level
* Update UnaryConvertPrecision
* Format
* Fix chmod
* Add a flag to pick converion method
* Format
* Remove the added flag
* Merge elementwise op with type conversion
* Move type_convert to elemwise op, update the op
* Update type_convert_precision -> bf16_convert_rtn
* Clean up
* Update comments
* Update the CK_WORKAROUND_DENORM_FIX flag handling
* Update the unneeded op to work but warn user
* Remove the message
* Use a PassThrough instead of ConvertBF16RTN to calcaulate reference
* Format
* Add missing include
[ROCm/composable_kernel commit: b076a02ad2]
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
[ROCm/composable_kernel commit: ed3a2e5226]
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
[ROCm/composable_kernel commit: dbd8f94bef]
* Add conv perlayer quantization
* Add gemm_dlops quantization
* Support int8 for innerproduct
* Refine gemm dlops int8 kernel parameter
* Support gfx908(MI100) and gfx90a(MI200)
* clang-format
* Rename example number
* Support different layout for d tensor
* Add conv dlops perchannel quantization example
* Move to example 40
* Extract the common code for different platform (dlops and xdlops)
* Move ot subfolder. Prepare to add other op of quantization
* Refine the quantization instance library
* Add conv dl instances and client example
* Remove unnecessary type
* Add gemm quantization instance
* Add external api and client example
* Refine num_bytes
* Separete different layout to different cpp
* Add more xdl instances
* Revert "Remove unnecessary type"
This reverts commit 820869182f.
* Remove CShuffleDataType in dlops
Let acc and CShuffleDataType be the same in xdlops
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
[ROCm/composable_kernel commit: 16dc18e0f9]
* Sync the order of type string with template parameter
* Add more instances
* Check the vector size and remove redundant var
* Extract var to static, prepare to separate sweep once kernel
* Separate sweeponce flow and optimize the flow
* 1. Rename AccDatatype in normalization to computeData
2. Rename AccElementwiseOperation to YElementwiseOperation in normalization
* Remove useless code
* Update naive variance kernel
* Refine string
* Fix typo
* Support naive variance for device_normalization
* Check the blocksize
* Share the VGPR of x and y
* Share the VGPR of gamma and beta
* Add more instances
* Support fp16 sqrt for experiment
* Add CHANGELOG
* Fix typo
* clang-format
[ROCm/composable_kernel commit: 6a6163a3d1]
* Add gemm + layernorm instance
* Add ckProfiler
* Add test
* Add client example
* Detect if user forger to set the workrspace
* Use literal in the example
* [What] use builtin function for sqrt
[Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt()
* check gemm vaildity in IsSupportedArgument
* Add more testcases
* Merge duplicated folder in client example
* Print more infomation
* Use better kernel parameter for MS problem size
* clang format
* Add constexpr for if condition and remove redundant include
* Remove cstdlib and add constexpr
[ROCm/composable_kernel commit: f7d28f3e4b]
* Change to the DeviceReduce base class template to include all problem description information
* Add external api for reduction
* Add client example to test the reduction external api
* Spelling correction
* Re-implement the host_reduction to follow the DeviceReduce base API format
* Change the reduce profiler to call the external API for collecting device instances
* Rename reduce client example directory from 08_reduce to 12_reduce
* Remove (void) before the functional call
* Tiny update in reduce client example
* Tiny update in profile_reduce_impl.hpp
* Rename the reduce client example directory
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit: 80e0526741]
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
[ROCm/composable_kernel commit: 919aeb1f52]
* Let cmath included when compiling host codes in math_v2.hpp
* Remove including of cmath in device_base.hpp and device_permute.hpp
[ROCm/composable_kernel commit: a17b041486]
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* Remote int4 related
* delete deprecated test
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: abf9cc6c5c]
* fixed bug in softmax reference & add bf16 examples for batched_gemm_scale_softmax_gemm
* added bf16 tests for batched_gemm_softmax_gemm_permute
* changed format of device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* changed format device_batched_gemm_softmax_gemm_permute_xdl_cshuffle_bf16_bf16_bf16_bf16_gmk_gnk_gno_gmo_instance.cpp
* aligned annotations
* modified CMakeLists for examples
* add common example code of fp16/bf16 version for batched_gemm_scale_softmax_gemm_xdl
* use macro to control the instances
* added macro control into instances
* clang-format some files
* changed error tolerance for bf16
* changed index for 10_elementwise_normalization
* fixed xdlops code bug in amd_xdlops.hpp
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
[ROCm/composable_kernel commit: 4c4c7328a6]
* Add example folder for 'DeviceElementwise'
* Re-structure example files
* Move common parts into common.hpp
* Use more strict input
* Add more helper methods in 'DeviceElementwise'
* Use more specific method to write example
* Allow specify problem through command line argument
* Allow specify problem 'axes' through command line argument
* Add check to template type argument
* Add transpose_shape() to generalize shape permute
* Generalize transpose utility functions
* Use better name for tensor indices
* Add checks in helper functions
* Remove debug messages
* Refine error message for check_err()
* Generalize variable naming in example code
* Add device op 'DevicePermute'
This device op is clone of 'DeviceElementwise'
* Use 'DevicePermute' device op in example
* Remove 'elementwise' from identifiers
* Remove 'elementwise' from file paths
* Remove base class of 'DevicePermute'
* Let 'DevicePermute' inherit from 'BaseOperator'
* Add simple type traits to validate device op type
* Add static_assert() to check type constraints
* Create 'DevicePermuteBase' to generate methods
* Use indirect base type to generate methods
* Remove 'is_device_op<>' type traits
* Only accept single-input-single-output for 'DervicePermute'
* Simplify 'DevicePermute' interface
* Re-format 'DeviceElementwise'
* Use CRTP to generate overridden virtual method
* Remove unnecessary include directives
* Distinguish input & output shape in 'DevicePermute'
* Passing 'axes' to 'DevicePermute'
* Use more reasonable return value for Invoker::Run()
* Add 'GridwisePermute' kernel
This kernel is a clone of 'GridwiseElementwise_1D'
* Remove no-longer used type argument
* Check if input/output shape meet the requirement
* Remove no-longer used method
* Remove never-entered-if-clause
* Change problem description for 'DevicePermute'
* Transform descriptor into 3 dimensions
* Add debug code the verify result
* Add comment to indicate template argument location
* Add N/H/WPerBlock template parameter to 'DevicePermute'
* Rename 'GridwisePermute' to 'GridwiseCopy'
* Check tensor descriptor dimensions in 'GridwiseElementwise_1D'
* Add missing include directive
* Add 'BlockSize' parameter to 'DevicePermute'
* Remove no-longer used method
* Add 'BlockToTileMap' for 'GridwiseCopy'
* Use the normal Block2TileMap convention
* Rename 'BlockToTileMap' as 'Block2TileMap'
* Fix most of compilation errors
* Let 'Block2TileMap' map block to 2d coordinate
* Allow data transfer in 'GridwiseCopy'
* Fix wrong output descriptor for 2nd blockwise copy
* Rename 'GridwiseCopy' as 'GridwisePermute'
* Remove '1d' in identifiers
* Remove commented-out codes
* Remove 'MPerThread' template parameter
* Seperate template parameters
* Unify variable namming convention
* Use more verbose way to create expressions
* Add template parameter 'InBlockLdsExtraW'
* Release the constraint on In/OutGridDesc
* Use date type directly as template argument
* Re-arrange template arguments for blockwise copy
* Remove no-longer used template parameters
* Embed layout in the variable names
* Add GridwisePermute::CheckValidity()
* Extract local types as template parameters
* Rename local type alias
* Add more template parameters (vector width related)
* Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions
* Fill tensor values start from 1
* Re-formate example code
* Avoid too-large block id
* Add comment
* Make sure 'SrcVectorDim' is not same as 'DstVectorDim'
* Add check for the 'VectorDim' & 'ScalarPerVector' template params
* Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc
* Remove no-longer used template parameter 'NPerBlock'
* Fix wrong descriptor creation logics
* Specify problem in each examples
* Use better example name
* Add new example 'example_permute_NxHxW_fp32'
* Add example for demonstrating bundle multiple elems in tensor
* Add support to permute multiple elements together
* Change the default problem size
* Add span<> class template
* Use span<> to generalize check_err() interface
* Fix ambiguous ctor call
* Avoid create necessary objects
* Use helper functions to simplify example code
* Add example for 4xfp16 permute
* Disable failed-to-compile example
* Add check for the NUM_ELEMS_IN_BUNDLE
* Remove redundant parameter in helper lambda function
* Add check for the input tensor type's byte-size
* Check scalar-per-vector with padded length
* Use more verbose name to avoid name collision
* Use fixed 'VectorDim' & 'ScalarPerVector' for LDS
* Embed shape info in name of descriptor constructor
* Rename example folder '36_permute' into '37_permute'
* Avoid using too-large LDS in kernel code
* Remove redundant example
* Usw switch() to group similar codes
* Add const to the span<> type arguement
* Simply initialize tensor with floating point values
* Use fp16 as data type in all examples
* Enlarge tensor size in example
* Enalrge N-dim in example
* Add check for the bundled type in example
* Use more stricter error threshold
* Remove global load/store loop in kernel code
* Measure execution time by default
* Use faster device op config for example 'NxHxW_fp16'
* Use faster device op config for example '1xHxW_fp16'
* Use faster device op config for example 'HxWx4_fp16'
* Remove cmd arg parsing logics
* Rename functions
* Extract bundle permutation logic out
* Simplify permute bundle example
* Add Tensor<>::GetElementSpaceSizeInBytes()
* Add Tensor<>::data()
* Use new methods to simplify code
* Use type alias to replace duplicated code
* Use existing method to shorten code
* Allow FillUniformDistribution accept range arugment
* Intialize random values in range
* Add Tensor<>::size()
* Use more meaningful names in permute bundle example
* Use more meaningful names in permute element examples
* Use rangified copy() to copy elements
* Use function return value directly to eliminate variables
* Add to_array() conversion tool to eliminate more variables
* Add Tensor<>::AsSpan<>() to create view of tensor values
* Use AsSpan() to shorten check_err() calls
* Remove no-longer-used 'using' directives
* Move 'using' directive to proper code position
* Remove redudant variables
* Remove useless static_assert()
* Add check for range types
* Declare variable right before first use
* Move long return type as tailing return type
* Add BaseInvokerCRTP<> class template to generate method
* Create new base type for 'DervicePermute' implementations
* Move 'NumDim' template param to the first
* Rename 'DevicePermute' to 'DevicePermuteImpl'
* Add 'noexcept' specifier to CRTP generated method
* Move 'Block2TileMap' definition into 'GridwisePermute'
* Use type alias to reduce code
* Unify naming style in 'DevicePermute'
* Add comments in 'GridwisePermute'
* Rename permute example folder
* Use std::cerr to report error
* Use larger shape in examples
* Rename '38_permute' to '39_permute'
* Make sure we use unsigned type for shape & indices
* Remove opt-ed out assertion
* Remove template BaseInvokerCRTP<>
[ROCm/composable_kernel commit: f584ab0c54]
* init commit of convnd bwd data
* begin compiling example
* have a first version that produce a right result
* refine device level launch kernel code
* add more instances in example and get right results
* clang-format
* format example file
* add more instances
* fix instances
* adding conv_bwd_data multile_d
* adding conv_bwd_data multile_d
* adding conv_bwd multiple d
* adding conv_bwd multiple d
* adding conv_bwd multiple d
* refactor
* refactor
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* refactor
* update conv fwd's bias impl
* refactor
* reorg file
* clean up cmake
* clean
* clean
* clean
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 27858374ac]