* Do not hardcode stride
* devicePool2DFwd Inherit devicePool3DFwd
* Move instance declaration out of common
* Add dilation
* use the pool3d rank, because pool2d inherit pooo3d
* calculate Do Ho Wo for the dilation
* Fix header name
* Modify ckProfiler
* Remove pool2d instance
* Remove pool2d in profiler
* Remove pool2d and add dilation
* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d
* Refine naming and IsSupportedArgument()
* Add dilation to maxpool bwd example
* clang format
* 1. Remove useless header
2. Fix copyright
3. Refine naming
* Add layout parameter to pool fwd
* clang format
* Fix merge error
* Fix compile error
* Remove layout parameter in derived class
* Refine changlog
* Fix compile error
* Fix compiler error
* Add layout to external api and profiler
[ROCm/composable_kernel commit: f60f0a5e03]
* Add avgpool bwd reference code
* Refine naming
* Fix invalid in_element op in ref_conv
* Add example (only reference now)
* Add the full example of avgpool bwd
* Fix copyright
* Imitate MakeDescriptor from transform_conv_bwd_data_to_gemm_v1.hpp
* rename channel to c from k
* Arrange the code
* Imitate the argument from conv bwd
* Implement invoker
* Fix order of parameter in example
* Refactor reference code for different dimension
* Support different stride
* Check if argument is valid
* Fix kernel parameter for NDHWC, fastest dimension C is not reduced
* Add more data type in example
* Fix bug in example
* calculate Do Ho Wo according to the dilation
* Remove useless header
* Add comment in reference code
* Add layout parameter
* Remove layout in derived class
* Refine reference comment
[ROCm/composable_kernel commit: 578142db3a]
* add fno-offload-uniform-block flag for rocm5.7 and up
* add a comment and compiler ticket number
* update the threshold rocm version
[ROCm/composable_kernel commit: cbbd172fd6]
* properly split conv_nd_bwd_data instances
* split conv2d_fwd instance data types
* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
* split the tests by data types where possible
* filter examples by DTYPES
* split few remaining examples by DTYPES
* filter most instances by DTYPES
* add new lines at end of headers, fix grouped_gemm profiler
* fix syntax
* split the ckprofiler instances by DTYPES
* split the conv2d and quantization DL and XDL instances
* fix the splitting of conv2d DL instances
* split softmax and pool_fwd tests for fp16 and fp32 types
* fix syntax
* fix the dl_int8 quantization instances isolation
[ROCm/composable_kernel commit: 08eb176929]
* Add wei_strides to grouped conv3d wei to keep consistency
* Fix strides in client examples
* Unify backward weight api with forward
* Fix for example
* Fixes for examples
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
[ROCm/composable_kernel commit: 22443f7aae]
* Improve formatting of docs; Add a note about the DL_KERNELS flag
* Change the recommended version of ROCm to 5.6
[ROCm/composable_kernel commit: 8c13df07bf]
* initial stream-k implementation with example
* fix unexpected change in err
* improve a little bit performance by reorganize pipeline.
* improve perf a little bit by swizzle block idx
* add profiler
* update example
* fix spelling
* shrink karg for streamk
* support dynamic buffer using memory coherence glc_slc bit from template
* control memory coherence while construct dynamic buffer
* update reduction for streamk(not ready yet)
* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
* fix build issue
* fix several bug
* now result is correct, everything works (but has scratch)
* remove scratch by manually reset coordinate
* update device code
* fix a bug in final reduce
* fix something in example
* update async memset
* fix enum as camel case
* modify coherence enum name
* clean code and use atomic streamk by default
* remove unused var
* throw exception if have empty pointer
* fix format
* fix CI warning
* fix type in init
* modify CI error
* filter out on gfx10+
* restore changed example code
---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
[ROCm/composable_kernel commit: e7dca79d27]
* Disable XDL kernels on unsupported HW; Add ck::is_xdl_supported function (#765)
* Do not throw an error when GEMM problem is not supported.
---------
Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
[ROCm/composable_kernel commit: ac6d68b353]
* first change bias load
* add bias dim and scalervector parameter
* make CDE0BlockTransferSrcVectorDim not work
* changse toinstance
* add limit for CDE0BlockTransferSrcScalarPerVector
[ROCm/composable_kernel commit: 50643dd555]
* allow building CK for specific data types
* add CI build and test stage on Naiv3x without some int8 instances
* add missing gemm fp16 instances
* add the changes to the missed cmake file
* add empty lines at end of source files
* Do not build quantization client example on navi3 in CI
* disable batched_gemm_multi_d_int8 instances with DTYPES
* disable device_conv2d_bwd_data_instance with DTYPES
* fix ckprofiler for conv_bwd_data for int8
* properly isolate the conv_bwd_data int8 instances
* remove empty line
[ROCm/composable_kernel commit: 189ea3b9aa]
* Move source file into sub-directories
* Add missing include directive
* Split DeviceGemmXdl<> fp16 instances
* Fix format
* Remove unnecessary CMakeLists.txt
* Add macros to toggle new features
* Remove debug message
* Turn off GEMM v2 pipeline optimization by default
* Fix format
* Extract duplicated string as list
* Enlarge indent in CMakeLists.txt
[ROCm/composable_kernel commit: 850144a0d3]
* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]
* Add CountDataType as template parameter in blockwise_welford
* Add utility/get_shift.hpp
* Add BatchNorm multiblock single-kernel implementation
* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a
* Renaming in device_batchnorm_forward_impl.hpp
* Tiny fix in the batchnorm_fwd profiler
* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"
This reverts commit d16d00919c.
* Use the old two-kernel batchnorm multiblock method for gfx1030
* Use the old two-kernel batchnorm multiblock method for gfx908
* use the single-kernel batchnorm multiblock method only for gfx90a
* Remove get_wave_id() from utility/get_id.hpp since it is not used
* Set true for testing running mean/variance and saving mean/invvariance in the examples
* Fix to copy-right words
* Remove un-needed including in utility/get_id.hpp
* Add comments to workgroup_synchronization.hpp
* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp
* Renaming in the kernels
* Remove un-used kernel file
[ROCm/composable_kernel commit: 8f5cafaf04]
* Support bf16/f32/f16 and NHWGC conv2d_bwd_data
* Add interface test
* clang format
* Comment fixes
* Add more friendly error message
[ROCm/composable_kernel commit: 63388e84ab]
* do not build gemm-gemm and conv-conv examples for gfx94*
* do not build gemm-gemm and conv-conv examples on navi
[ROCm/composable_kernel commit: 645eb2f2a0]
* Add basic fp8 definitions and prn-generator
* Format
* Add fp8<->fp32 type_convert
* Format
* Split type_convert and cast_to/from_f8
* Format
* Minor fix
* Minor fix
* Move fp8 utils to a separate header
* Add elementwise ops
* Add fp8_convert_sr
* Format
* Add element op
* Eliminate magic numbers
* Split f8_convert_sr in host and device
* Format
* Add some constexpr
* Add a datatype test
* Format
* Another format
* Add fp8<->fp16 tests
* Update type_converts
* Format
* Add fp16 casting functions
* Format
* Use seed as a runtime arg
* Use element location for PRNG
* Format
* Add fp8<->fp16 to PassThrough element op
* Clean up
* Merge host and device implementations
* Add comments on rounding modes
* Remove leftover code
* Put type_converts into a separate header
* Put random number gen to a separate header
* Rearrange f8_utils' namespaces
* Refactor type_convert.hpp
* Move f8_t definition
[ROCm/composable_kernel commit: f0c620c42e]