mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 10:09:41 +00:00

Files

Anthony Chang 074cb887c0 Standalone softmax kernel (#284 )

* initial stub for standalone softmax

* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m

* host softmax validates

* compiles; to implement beta scaling

* use NaN trick to efficiently ignore OOB values during sum of exponentials

* freeload device_reduce's utility functions

* clean up interface

* adding prior value (beta scaling)

* remove restriction related to perf considerations

* apply clang-format

* clean; disable diagnostics

* resolve conflicts

* add exp wrapper

* honor HostTensorDesc interface; allow implicit cast from different vector<T> type

* test softmax for fp16/fp32

* update readme

* amend commit NaN trick

* remove redundant param added during development

* format

* replace ScalarDataType with AccDataType

* separate out test programs by precision type

* move softmax sample code to its own folder

* format

* keep up with recent changes in reduction API

* remove extra header

[ROCm/composable_kernel commit: 15c89e81f0]

2022-06-21 14:59:19 -05:00

615 B

Raw Blame History

Instructions for `example_softmax_blockwise`

Run `example_softmax_blockwise`

# -D <xxx> : input 3-d tensor lengths
# -v <x> :   verification (0=no, 1=yes)
#arg1: initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
#arg2: time kernel (0=no, 1=yes)
example_softmax_blockwise -D 4,128,2048 -v 1 1 1

Result

launch_and_time_kernel: grid_dim {64, 1, 1}, block_dim {256, 1, 1}
Warm up 1 time
Start running 10 times...
Perf: 0.0242877 ms, 259.039 GB/s, DeviceReduceSoftmax<256,M_C8_S1,K_C32_S8,InSrcVectorDim_1_InSrcVectorSize_8_OutDstVectorSize_8>

615 B Raw Blame History

Instructions for example_softmax_blockwise

Run example_softmax_blockwise

615 B

Raw Blame History

Instructions for `example_softmax_blockwise`

Run `example_softmax_blockwise`