mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-18 03:49:41 +00:00
* Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type
* Update to host layer and host reduction
* Merge and remove reduction kernels
* Merge and remove reduction device interfaces and update pooling device interface
* Merge and remove useless reduction device instances
* Update to reduction profiler and reduction ctests
* Update to reduction and pooling examples and add one reduction example
* Change to reduction examples to let them testable by ctest
* Add explicit pass checking for reduction and pooling examples
* Explicit assignment of tensor shapes in example reduce_blockwise_two_call
* Use atomic_add to repace atomicAdd and add atomic_add for double type
* Add reduce ctest support for double data type
* Replace to_int_vector() by using c++ std::vector::assign()
* Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise
* Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock
* Add GetAtomicOperationZeroValue() support for AtomicMax
* Tiny change to reduce example README.md
* Fix some tiny issues due to branch merging
* Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t
* Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64
* Renaming
* Clean the header includings in device_reduce instances header files
[ROCm/composable_kernel commit: 63eee2d999]
43 lines
1.5 KiB
Markdown
43 lines
1.5 KiB
Markdown
# Instructions for ```example_reduce_blockwise```
|
|
|
|
## Run ```example_reduce_blockwise```
|
|
```bash
|
|
# -D <xxx> : input 4-d tensor lengths
|
|
# -v <x> : verification (0=no, 1=yes)
|
|
#arg1: initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
|
|
#arg2: time kernel (0=no, 1=yes)
|
|
./bin/example_reduce_blockwise -D 16,64,32,960 -v 1 1 1
|
|
```
|
|
|
|
Result
|
|
```
|
|
./bin/example_reduce_blockwise -D 16,64,32,960 -v 1 1 1
|
|
launch_and_time_kernel: grid_dim {240, 1, 1}, block_dim {256, 1, 1}
|
|
Warm up 1 time
|
|
Start running 10 times...
|
|
Perf: 0.282592 ms, 222.641 GB/s, DeviceReduceBlockWise<256,M_C4_S1,K_C64_S1,InSrcVectorDim_0_InSrcVectorSize_1_OutDstVectorSize_1>
|
|
```
|
|
|
|
# Instructions for ```example_reduce_blockwise_two_call```
|
|
|
|
## Run ```example_reduce_blockwise_two_call```
|
|
```bash
|
|
#arg1: verification (0=no, 1=yes(
|
|
#arg2: initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
|
|
#arg3: time kernel (0=no, 1=yes)
|
|
./bin/example_reduce_blockwise_two_call 1 2 1
|
|
|
|
|
|
Result
|
|
```
|
|
./bin/example_reduce_blockwise_two_call 1 2 1
|
|
launch_and_time_kernel: grid_dim {204800, 1, 1}, block_dim {256, 1, 1}
|
|
Warm up 1 time
|
|
Start running 10 times...
|
|
launch_and_time_kernel: grid_dim {6400, 1, 1}, block_dim {256, 1, 1}
|
|
Warm up 1 time
|
|
Start running 10 times...
|
|
Perf: 2.1791 ms, 771.42 GB/s, DeviceReduceBlockWise<256,M_C32_S1,K_C8_S1,InSrcVectorDim_1_InSrcVectorSize_1_OutDstVectorSize_1> => DeviceReduceBlockWise<256,M_C256_S1,K_C1_S1,InSrcVectorDim_1_InSrcVectorSize_1_OutDstVectorSize_1>
|
|
```
|
|
|