* add space_filling_curve
* cleanup and move space_filling_curve into test
* WIP: start refactoring threadwise_transfer_v1r3
* threadwise_copy works but needs further refactoring
* add some comments
* add SpaceFillingCurve::GetIndices()
* minor changes
* removed GetIndices; refactored GetDstCoordinateResetStep
* add DynamicBuffer::Transfer, but Add is not tested
* rebased agaist develop
* threadwise_copy_v6r1/v6r2/v6r3 using space-filling curve start to work
* minor changes
* refactored threadcopy v3r1, v2; removed old implementations
* clang-format
* cleanup
* fix a typo in v6r3
* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 0619ebf70b]
* add gitignore
* host tensor: allow generating sequentially increasing value in a given dimension
* gridwise gemm v3r1: allow distinct K0/K1 values for A/B block descriptor
- remove dangling header include
- modify example gemm_xdl accordingly
- infer KPack value from M/NPerXdl
- device conv2d fwd: update parameters accordingly for the underlying gridwise gemm v3r1
(API for conv2d fwd stays the same for now until we decide to expose individual K0s for activation and weight)
* add LDS data dump utility
* profiler: reflect API change for distinct K0/K1 for A/B matrices
* profiler: add conflict-free LDS write FP16 kernel instances
* fix accidental perf regression
* address feedback; cosmetic changes
* clang-format for new files
* format
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 6d4450ef15]
* conv3d compiles but has memory error
* conv3d works
* fix performance issue by using __builtin_amdgc_readfirstlane
* change MakeBlock2CTileMap to MakeDefaultBlock2CTileMap; change c_blockid_to* to cblockid_to*
* clang-format
* remove CK_EXPERIMENTAL_PASS_TENSOR_DECRIPTOR_BY_*; moved wrapper into DeviceConv3d
* format
* remove useless marc
* add comment
Co-authored-by: Chao Liu <chao.liu2@amd.com>
[ROCm/composable_kernel commit: 6dfb92bbef]
* tweak conv for odd C
* update script
* clean up elementwise op
* fix build
* clean up
* added example for gemm+bias+relu+add
* added example for gemm+bias+relu
* add profiler for gemm_s_shuffle; re-org files
* add profiler
* fix build
* clean up
* clean up
* clean up
* fix build
[ROCm/composable_kernel commit: 823657ed12]
* [What]
1. Add DeviceGemmXdl_C_Shuffle
2. Revise example of gemm_xdl
[Why] Prepare to add shuffle version of D = alpha * (A * B) + beta * C
[How] Imitate DeviceGemmXdl and device_conv2d_fwd_xdl_c_shuffle_nhwc_kyxc_nhwk.hpp
[ROCm/composable_kernel commit: 4d40b1974e]
* fix relu
* clean up
* clean up
* adding 1x1 conv
* adding 1x1 conv
* added 1x1 conv
* refactor
* refactor
* refactor
* added profiler for conv+bias+relu+add
* clean up
* adding conv+bias+relu
* adding conv+bias+relu
* added conv+bias+relu
* Update README.md
* update cpu verification
* adding c shuffle
* update static_tensor for dealing with invalid element
* adding c shuffle
* debugging
* fix bug
* convert to fp16 before shuffle
* shuffle more than one M/NRepeat
* clean up
* remove coordinate step hack from GridwiseGemm_k0mk1_k0nk1_mn_xdlops_v3r1
* clean up
* remove coordinate step hack from all gridwise gemm xdl
* clean up coordinate step hack
* clean up coordinate step hack
* ThreadwiseTensorSliceTransfer_v3r2 support pointwise op on both src and dst
* adding output shuffle in conv+bias+relu+add
* update
* added conv+bias+relu+add with c shuffle
* added conv+bias+relu+add with c shuffle
* fix forward_sweep bugs in threadwise copy
* clean up
* refactor
* clean up
* clean up
* added conv_c_shuffle+bias_relu
* clean up
* added conv+bias+relu+atomic_add
* clean up
* clean up
* clean up
* clean up
* clean up
* clean up
* misc fixes; add 1x1 specialization
* clean up
* delete unused device op
* clean up
* add support for odd C value
[ROCm/composable_kernel commit: acbd7bd7c5]