* smoke and regression targets working with tests
* test filters work for both examples and test
* removed uneccesary comments
* added a missing comment
* added a missing comment
* fixed typo in the comments
* updated README
* Update PULL_REQUEST_TEMPLATE.md
updating the template for future addition of test cases
* Update PULL_REQUEST_TEMPLATE.md
[ROCm/composable_kernel commit: 54de3e55e1]
* Disable building DPP kernels by default
* Disable building dpp instances, examples, or tests if DPP_KERNELS is not set
* Add new DPP_KERNELS flag to readme
[ROCm/composable_kernel commit: 26b3829c02]
* (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (4/5) grouped conv pass
* (5/5) attention pass, todo: debug lds perf bug
* AIT Attention API refactor (#8)
* sanity pass
* sanity pass 2
* confirm significant performance regression.
* turn on all instances
* turn off instance format
* Fix bug & tunning & format
* DML meta, self_attn+cross_attn
* sanity pass
* remove useless flag
* update tile and problem size used in AIT attention
* bug fix in grouped conv supporting check
* deprecate inline asm wmma
* Bug fix: double lds skip
* clang-format
* Fix errors in
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array
* part2 of previous commit
* clang format
* API fix of gridwisegemmpipeline
* separate array base and vector base attention tensor transformation
* fix gemm
* clang format
* add gemm fp16 instances
* Temp save
* fpAintB kernel compile pass
* Sanity pass.
* Temp save
* debug code enabled
* Fp16AInt8B_GEMM sanity
* MQA implementation
* GQA-4 example
* tempsave
* Compile pass
* New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* initial enablement of gfx950
* fix clang format
* disable examples 31 and 41 int8 on gfx950
* initial navi4x enablement
* remove extra endif
* enabled dl_gemm
* update s_barrier and s_waitcnt for gfx12
* fix the gfx12 assembly syntax
* fixed block_sync_lds
* add support for more dl kernels on navi4
* add wmma
* format
* Todo: fix gemm_bilinear_wmma instances compilation bug
* Solve a bug when K1=16
* remove unnecessary changes
* Remove tensor layout limitation to LDS usage in tesnor contraction
* fixed block_sync_lds
* merge navi3_ref
* update self-attention and cross-attention
* fix a typo of name
* fixed layout
* debugging
* Add arch limiter for fp8 gemm
* fixed wmma
* enable fp8 gemm_xdl for all gfx9 targets
* temporarily disable gemm_xdl_fp16_fp8 on MI100/200
* fix the cmake logic for gemm_xdl_fp16_fp8
* fixed c_output
* re-enable the gemm_xdl_fp16_fp8 on MI100/200
* fixed gfx12
* fixed
* fixed
* seperate gfx12 blockwise_gemm
* fixed
* enable fwd conv on navi4x
* enable gridwise
* enabled gemm
* fixed merge
* remove empty example fold
* fixed conflicts
* some small changes
* Update cmake-ck-dev.sh
* Update cmake-ck-dev.sh
* enabled other types
* fixed register loads
* test fa
* enable gfx12
* clean up
* enable some instances on gfx12
* add gfx1201 macro in amd_wmma header
* fix clang format
* enable batched_gemm_softmax_gemm_perm_wmma for gfx12
* disable instances with blocksize=256 in attention examples
* debuggging
* debug
* fixed lds_enabled
* debugging
* Fix and add limit to skiplds feature
* Enable skipLds feature and fix compilation bugs
* add ck_tile definitions for gfx12
* fix clang format and test/wmma_op
* updage instances cmake for gfx12
* disable the test_wmma_op on gfx12
* fix the builds for gfx950
* add gfx12 and gfx950 to default target list
* clean-up cmake file
* Initial introduction of OFP8 data types.
* Renamed FP8 and BF8 tests into FP8_FNUZ and BF8_FNUZ.
* Implementation of ConvertFP32Nearest in test_fp8_ocp.
* Remove dependence on possibly undeclared alias.
* Implement FP8OCP test for stochastic rounding mode.
* Implement FP8OCP tests for half_t type conversions.
* enable bf16 atomic add on gfx950
* Implement ConvertFP32Nearest test.
* Implement ConvertFP32Stochastic test.
* Implement ConvertFP16Nearest and ConvertFP16Stochastic tests.
* Refactoring. Move FP8 definitions into a separate header file.
* Enable easy switching between architectures.
* Fix compilation error for gfx942 architecture.
* only builf gfx950 branch for gfx950 target by default
* Enable OCP build of example_gemm_xdl_fp8.
* Fix formatting.
* fix the build logic for gfx950
* Improve GEMM example verbosity.
* Add constexpr where applicable.
* fix the logic of enabling XDL and WMMA instances
* Improve GEMM example verbosity.
* Enable build of example_gemm_xdl_fp8_bf8 test.
* Fix tests for gfx1101 architecture.
* Build DPP examples only on gfx103 and gfx11 architectures.
* Optionaly run either CPU or GPU verifications with GEMM examples.
* Extend GeneratorTensor_Sequential to produce values of prescribed data types.
* Add missing constructor.
* Improve infrastructure for OFP8 data type support.
* BUGFIX. Should not use FP8 as Compute/Accum data type.
* Add custom target for grouped_convnd_bwd_weight tests.
* Can build `tests` target on gfx950.
* Bugfixes on gfx1101 architecture.
* Fix dependencies.
* Provide single point of truth for FP8 INF and NAN checks
* Prevent instantiation of operators that are not supported by FP8 data types
* Add FP8 type selection into client_axample CMakeLists.txt
* Prevent sccache server from shutting down during build
* Fix test success reporting logic
* Change default verification method to CPU.
GPU verification takes too much time to complete on the emulator.
* Make sure all tests and examples are built for gfx950
* Facilitate testing of FP8 data types on the emulator
* Introduce two new tensor generators
* Enable instances built for gfx94 to be built on gfx950
* Verify 35_splitk_gemm on floating point numbers.
splitk gemm appears to be losing precision VS reference implementation when FP numbers are involved.
* Verify 04_gemm_add_add_fastgelu on floating point numbers
* Verify 20_grouped_conv_bwd_weight on floating point numbers
* Verify 38_grouped_conv_bwd_data_multiple_d on floating point numbers
* Verify more tests on floating point data
* Fix data types and improve testing verbocity.
* Upgrade to NPI 573 build docker.
* Skip on gemm_universal tests.
The tests take too long to complete on the emulator.
Need to see if it is possible to reduce the scope of the testing to just FP8 data types.
* Fix gfx1101 build
* Document test availability
* Re-enable fp8 gemms for gfx94/95
* Cherry-pick GEMM Universal tests for FP8 data types
* Cleanup
* CK_USE_GFX94 has already been set on this branch
* Address formatting issues and leftovers
* Make fail/pass logic consistent within 01_gemm folder
Removed multiple negations in fail/pass logic to propagate `true` as the success indicator.
* Fix GPU verification reporting logic.
* Update year in copyright notice.
* Cleanup
* Use `enum class` instead of `enum`
* Remove set_property for FP8 tests
* Narrowing the scope of PR to OCP FP8 enablement only
* Add tests for OCP FP8 vector_type storage
* Enable gemm kernel on all gfx9 architectures (#227)
* clean-up
* Implement `non_native_vector_base` with `ext_vector_type` array. (#232)
* Enable support of 1, 2, 4, and 8-byte custom types in CK.
* Fix pool tests for OCP FP8 data type
* fix jenkins file
* restore cron trigger
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
[ROCm/composable_kernel commit: 08d5c02c37]
* make sure cmake can handle xnack targets
* dont build xdl instances for gfx906:xnack-
* dont build xdl tests for gfx906:xnack-
[ROCm/composable_kernel commit: b6e74be1aa]
* update build logic with GPU_ARCHS
* fix the GPU_ARCHS build for codegen
* unset GPU_TARGETS when GPU_ARCHS are set
[ROCm/composable_kernel commit: 7d8ea5f08b]
* set individual gpu targets for instances, examples, tests
* fix path to hip compiler
* fix path to hip compiler once more
* aggregate device macros in ck_tile config header
* fix the cmake logic for instances
* fix clang format
* add gfx900 and gfx906 to default set of targets
[ROCm/composable_kernel commit: 7b027d5643]
* parse examples inside the add_example_executable function
* fix the example 64 cmake file
* add xdl flag to the gemm_bias_softmax_gemm_permute example
* add filtering of tests based on architecture type
* enable test_grouped_gemm for gfx9 only
* enable test_transpose only for gfx9
* only linnk test_transpose if it gets built
* split the gemm instances by architectures
* split gemm_bilinear,grouped_conv_bwd_weight instances by targets
* split instances by architecture
* split grouped_conv instances by architecture
* fix clang format
* fix the if-else logic in group_conv headers
* small fix for grouped convolution instances
* fix the grouped conv bwd weight dl instances
* fix client examples
* only enable client examples 3 and 4 on gfx9
* set the gfx9 macro
* make sure the architecture macros are set by cmake
* use separate set of xdl/wmma flags for host code
* sinmplify the main cmake file
* add conv_fwd_bf8 instance declaration
[ROCm/composable_kernel commit: ae57e5938e]
* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
[ROCm/composable_kernel commit: bba085d2b5]
* init commit of convnd bwd data
* begin compiling example
* have a first version that produce a right result
* refine device level launch kernel code
* add more instances in example and get right results
* clang-format
* format example file
* add more instances
* fix instances
* adding conv_bwd_data multile_d
* adding conv_bwd_data multile_d
* adding conv_bwd multiple d
* adding conv_bwd multiple d
* adding conv_bwd multiple d
* refactor
* refactor
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* adding conv bwd data multiple d
* refactor
* update conv fwd's bias impl
* refactor
* reorg file
* clean up cmake
* clean
* clean
* clean
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 27858374ac]
* Refactor the design of DeviceGemmMultipleDMultipleR_Xdl_CShuffle
* Add 'DeviceGroupedConvFwdMultipleDMultipleR' interface
* Add DeviceGroupedConvFwdMultipleDMultipleR_Xdl_CShuffle
* Remove 'GridwiseConvFwdMultipleDMultipleR_xdl_cshuffle'
* Add 'TransformConvFwdToGemm<>' utility class (from Chao)
* Use 'TransformConvFwdToGemm<>' to shorten code
* Fix ill-formed method declaration
* Re-implement MakeRGridDescriptor_M() function
* Change problem description
* Use macro to define layout types
* Define K-reduced output tensor layout types
* Let user to decide R output tensor layout
* Rename variables
* Add padding to the reduced output tensor if necessary
* Extract common code as helper method
* Remove debug message
* Add missing include directive
* Add partial fp16 Conv + Reduction example
* Add example verification code for 2D Conv problem
* Use type alias to simplify code
* Share code across different-dimension Conv problems
* Rename file/functions from run_conv_fwd* to run_convnd_fwd*
* Make example code more verbose
* Add code to support 1D & 3D Conv + Reduction on host
* Add more examples for data type: bf16, fp32
* Add example for int8
* Add custom target to group examples
* Use more general custom target name
* Change the description in error message
* Disable testing for example other than fp32
* Add examplel for int4 (just copy from int8)
* Fix wrong data type
* Use larger data type for intermediate tensors
* Finish int4 example
* Undefine macro PP_DEFINE_LAYOUT_TYPE() after use
* Use named variables to replace magic numbers
* Remove debug messages
* Use same A/B data type for host Conv in int4 example
* Add check for the 'RLayout' type argument
* Group same-dim-layouts together in 'LayoutSetting<>'
* Add 'final' specifier to utility classes
* Use different initialization method for examples
* Remove macro PP_DEFINE_LAYOUT_TYPE()
* Fix code-comment mismatch
* Use more reasonable initialization value for all data types
* Default use init_method=1 for all examples
* Remove never-used code
* Remove confusing out-of-date comments
* clean
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
[ROCm/composable_kernel commit: 46a675aa6f]
* add examples into grouped/batched_gemm
* adding splitK examples
* fixed splitK
* add bfp16 int8 example into splitK
* formatting
* use static_cast
* added common for batched_gemm
* add commons for examples of splitK/batched/grouped_gemm
* return true
* adjust splitK check tol
* update example
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
[ROCm/composable_kernel commit: 6091458300]
* Implement multiple-reduction in one kernel (kernels, device ops, examples)
* Add generic elementwise kernel and device interface
* Add generator for normal-distributed data initialization
* Add host refer implementation of batchnorm-forward and batchnorm-infer
* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels
* Remove un-needed including in batchnorm example
* Renaming generic_elementwise to elementiwise in kernel and device classes/functions
* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise
* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise
* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise
* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp
* Removing and renaming files
* Update to synchronize gemm_layernorm client example to the generic element-wise device op API
* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming
* Merge two static member functions in device_elementwise.hpp
* Remove unary_elementwise_1d kernel and device
[ROCm/composable_kernel commit: 53ea4713af]
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* add gemm_gemm instances and tests
* avoid LDS data hazard
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: c20a75b07d]
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* attention host validation
* add blockwsie softmax v1
* iteratively update softmax+gemm
* transpose both gemm0 and gemm1 xdl output so as to avoid broadcasting softmax max/sum
* add init method for easier debugging
* do away with manual thread cluster calculation
* generalize blockwise softmax interface
* row-wise softmax sum & max
* format
* rename to DeviceBatchedGemmSoftmaxGemm
* add gemm_softmax_gemm instances and tests
* comment
Co-authored-by: ltqin <letao.qin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: cac014f173]
* [LWPCK-359] Initial commit
* Working version for fp16, add results to readme
* Update according to PR #341
* Update results in readme
* Add fp32 example
* Add bf16 example
* Update fp16 and fp32 examples
* Add int8 example
* Add separate lengths and strides tensors for D tensors
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
[ROCm/composable_kernel commit: 0c6ef7c14e]
* Implement layernorm kernel and deviceOp
* verify gpu kernel with host code
* 1. Separate gamma aand beta from affine
2. Check if argument is valid
* clean
* Sync the naming
* Support sweep once mode if we can put k dimension data inside one block
* [What] Get length from upper length.
[Why] if we get length directly, we may get length after padding.
* We only use one block in K dimension.
Hence, we can simplify the indexing of global R/W.
* Use 1d descriptor for gamma and beta
* Add accElementwiseOp
* Extract layernorm host code
* Support different YVectorDim in GridwiseLayernorm
* Rename XSrcVectorDim to XYSrcVectorDim. Because we use same parameter in deviceOp
* Gamma and beta can share the VGPR.
* Add test for fp32 and fp16
* Fix bug of concurrency and add test case which may fail orignally
* Propagate NaN for layernorm
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 7f21662089]
* initial stub for standalone softmax
* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
* host softmax validates
* compiles; to implement beta scaling
* use NaN trick to efficiently ignore OOB values during sum of exponentials
* freeload device_reduce's utility functions
* clean up interface
* adding prior value (beta scaling)
* remove restriction related to perf considerations
* apply clang-format
* clean; disable diagnostics
* resolve conflicts
* add exp wrapper
* honor HostTensorDesc interface; allow implicit cast from different vector<T> type
* test softmax for fp16/fp32
* update readme
* amend commit NaN trick
* remove redundant param added during development
* format
* replace ScalarDataType with AccDataType
* separate out test programs by precision type
* move softmax sample code to its own folder
* format
* keep up with recent changes in reduction API
* remove extra header
[ROCm/composable_kernel commit: 15c89e81f0]
* Reference CGEMM + test stub
* Format.
* Incomplete simple implementation
* Library instances
* Sketch of tests
* Test fixes.
* Example added
* Cosmetics
* Add elementwise operation kernel and example
* Add comment
* Add template argument of dim . Prepare to support multiple dimension
* Rename example
* Support 1 dimension
* Add static assert
* Add comment
* Second auxiliary buffer added
* Extract pad
* Remove redundant argument
* Support any dimension for elementwise operation
* Remove line
* Let it be the multiple number of CU
* Move thread per block to the parameter of constructor
* Consuming binary ops to do A+B / A-B
* Fix + cosmetics + bf16 test commented out temporarily
* Format
* Enabling bf16 test
* Revert "Enabling bf16 test"
This reverts commit f497e2ba44.
* Fix + test reenabled
* fix build
* Revert "fix build"
This reverts commit d73102384b.
* post PR #235 merge fix
* amend
* Single workspace for cgemm + helper
* Perf calc fix
* Review remarks: static_cast
* Review remarks: binary ops templated
* Cleaning
* Removal of instances and their tests
* Review remarks from aosew addressed
* Review remark: unnecessary attribute
* Post-merge fixes
* Restrict 4gemm to PassThrough + bug fix
* Review remarks
* update licence
* change cgemm example to fp16
Co-authored-by: rocking <chunylai@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Anthony Chang <ac.chang@outlook.com>
[ROCm/composable_kernel commit: 7b1e2c379e]
* Implement reduction meand and reduction square mean
* Refine file name
* Add reduce mean and square mean
* Fix parameter name
* Add normalize device op (not implement invoker::run())
* Remove epislon
* Refine deviceop
* Add 5ary elementwise for normalization
* Add layernorm example
* layerNorm verication
* Fix compiler error due to merge from develop
* Fix typo
* Fix compile error
* Refine naming
* [What] Suport non pointer for invoker and argument
[Why] Snyc coding style with gemm
* Refine folder name
* Refine class name
* Evaluate perf of the kernel
* Fix compile error
* [What] Refine perf evaluation in example of gemm + reduction
[Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory
* clang-format
[ROCm/composable_kernel commit: d32a67a9b6]
* start adding navi21 GEMM
* navi_gemm_km_kn_mn_fp32 compiles and passes one test.
* rename variables and functions in gridwise_gemm_dlops_v1r3
* add other 3 layouts; format instance
* adding more tuning parameters
add tuning parameters for other 3 layouts
* add gemm_dlops_f16
* tmp
* add dependence of DeviceGemm::IsSupportedArg() on arch
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* minor changes
* push gemm_dlops into profiler
* minor changes
* if using xdl or dlops is moved into profiler_gemm_impl
* minor changes
* minor changes
* remove is_xdl from profile_gemm_impl
* make IsSupportedArg dependent on arch for other device_gemm
* minor changes
* minor changes
* fix a bug in f_generate_tensor_value
* add 64x64x64 for gemm_dlops_int8
* add 64x64x64 for gemm_dlops_int8
* comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1
* fix
* start fixing tuning parameters
* monir
* minor changes
* minor changes
* minor changes
* fixing
* adding example
* adding example
* adding example
* add gemm fp32 example
* clean up
* use 128x128x16 as MNK tile in navi21 gemm example
* bug fix
* fix test
* use new block c tile
* clean
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: shaojiewang <wsjmessi@163.com>
[ROCm/composable_kernel commit: 40b59a63cc]
* enable example of conv 1d/3d for bwd weight
* make bf16 kernel do not use atomic add
* using new gridwise gemm for bwd weight on convnd bwd weight
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: ac543313bf]
* [What] Rename the example
[Why] Prepare to add unary reduction
* Add global oparation to the parameter
* Add atomicmax
* Fix compile error
* Support atomicMax (hip library)
* Rename the reduction example
* Fix target name
* use p_d1_grid as the indicator directly
* Prevent performance issue. Let passthrough handle it.
* Implement the function template the specialize the float2
* No need to separate into two lines
* Remove empty line
* add comment
* Fix compile error due to merge from develop
* make the implementation of atomic_max / atomic_add explicit for each datatype
* Refine typo
* For future CI test
* Fix compiler error in ckProfiler
* Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe'
* simply use remove_pointer
* Rename type and var
* Refine example
* Modify reducemax example
* Fix bug in reduction
* Change initialize range
* Implement F64 version of atomicMax
* Move reduction code together
* Add buffer atomic_max
* Fix coding style by clang-format
* Integrate new api of DeviceGemmReduce_Xdl_CShuffle
* Integrate Batch gemm reduction
* Fix example
* fix example
* clean up
* Fix batch gemm tensor operation
* Fix coding style
* Fix template augument
* Fix clang format
* Keep flexible of different stride for each D tensor
* Fix compile error for ckProfiler
* Fix typo
* [What] Fix naming
[Why] Prepare to add out elementop
* Add DoutElementOp
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: rocking <chunylai@amd.com>
[ROCm/composable_kernel commit: 0ffe956ab1]
* Add elementwise operation kernel and example
* Add comment
* Add template argument of dim . Prepare to support multiple dimension
* Rename example
* Support 1 dimension
* Add static assert
* Add comment
* Extract pad
* Remove redundant argument
* Support any dimension for elementwise operation
* Remove line
* Let it be the multiple number of CU
* Move thread per block to the parameter of constructor
* rename threadPerBlock with blockSize
* Support double
* rename kernel function name
* remove redundant include header
* Refine type
* Need to the final dimension
* Refine variable name
* Refine type
* Use index_t instead of int in API
Co-authored-by: rocking <chunylai@amd.com>
[ROCm/composable_kernel commit: aafc3ac27a]
* validate examples in ctest runs
* format
* fix usage of check_err
* amend
* add example codes to custom target 'check'
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 9f71ff48e2]
* Convolution ND
* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances
* Move convnd f32 instance file to comply with repo structure.
* Conv 1D tensor layouts.
* Formatting and use ReferenceConv
* Reference ConvFwd supporting 1D and 2D convolution.
* Debug printing TensorLayout name.
* Conv fwd 1D instance f32
* Refactor conv ND example.
Needed to support various conv dimensio.
Needed to support various conv dimensions
* Rename conv nd example director to prevent conflicts.
* Refactor some common utility to single file.
Plus some tests.
* Refactor GetHostTensorDescriptor + UT.
* Add 1D test case.
* Test reference convolution 1d/2d
* Remove some leftovers.
* Fix convolution example error for 1D
* Refactor test check errors utility function.
* Test Conv2D Fwd XDL
* More UT for 1D case.
* Parameterize input & weight initializers.
* Rename example to prevent conflicts.
* Split convnd instance into separate files for 1d/2d
* Address review comments.
* Fix data type for flops/gbytes calculations.
* Assign example number 11.
* 3D cases for convolution utility functions.
* 3D reference convolution.
* Add support for 3D convolution.
* Check for inputs bigger than 2GB.
* Formatting
* Support for bf16/f16/f32/i8 - conv instances + UT.
* Use check_err from test_util.hpp.
* Split convnd test into separate files for each dim.
* Fix data generation and use proper instances.
* Formatting
* Skip tensor initialization if not necessary.
* Fix CMakefiles.
* Remove redundant conv2d_fwd test.
* Lower problem size for conv3D UT.
* 3D case for convnd example.
* Remove leftovers after merge.
* Add Conv Specialization string to GetTypeString
* Skip instance causing numerical errors.
* Small fixes.
* Remove redundant includes.
* Fix namespace name error.
* Script for automatic testing and logging convolution fwd UTs
* Comment out numactl cmd.
* Refine weights initalization and relax rtol for fp16
* Move test_util.hpp to check_err.hpp
* Refine weights initalization and relax rtol for fp16
* Refactor common part of test conv utils.
* Move utility function to single common place.
* Add additional common functions to utility.
* Refactor convnd_fwd_xdl examples.
* Remove redundant files.
* Unify structure.
* Add constructor to ConvParams.
* And add input parameters validation.
* Modify conv examples to use single utility file.
* Remove check_error from host_tensor.hpp
* Get rid of check_indices function.
* Remove bf16_to_f32 function overload for scalars.
* Fix namespace.
* Add half_float::half for check_err.
* Fix conv params size in UT.
* Fix weights initialization for int8.
* Fix weights initialization for int8.
* Add type_convert when store output in ref conv 1D.
* Get back old conv2d_fwd_xdl operation.
* Silence conv debug print.
* format
* clean
* clean
* Fix merge.
* Fix namespace for check_err
* Formatting.
* Fix merge artifacts.
* Remove deleted header.
* Fix some includes and use ck::utils::check_err.
* Remove unused check_indices restored by previous merge.
* Fix namespaces after merge.
* Fix compilation error.
* Small fixes.
* Use common functions.
* Fix filename
* Fix namespaces.
* Fix merge artifact - retrieve removed by accident fun.
* Fix ConvForwardSpecialization.
* Adhere to coding style rules.
* Fix merge artifacts.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: abf4bdb9a9]