Files
composable_kernel/include/ck/tensor_operation/gpu
rocking5566 0bd6b842b9 Layernorm welford (#346)
* Add threadwise and blockwise welford

* Rename gridwise op, prepare to add welford version

* implement welford and integrate welford into layernorm

* Take care of tail loop

* Fix buf when ThreadSliceK > 1

* Fix bug of merging of two empty set

* Rename clip to clamp

* 1. Fix type of count
2. Remove useless static_assert

* Do not inherit Reduction::Argument

* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)

* Add y stride

* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm

* Move literal ""_uz & ""_zu into namespace 'literals'

* Move namespace 'literals' as 'ck::literals'

Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
2022-08-13 09:43:18 -05:00
..
2022-08-13 09:43:18 -05:00
2022-08-13 09:43:18 -05:00
2022-08-13 01:07:12 -05:00
2022-08-13 09:43:18 -05:00
2022-08-13 09:43:18 -05:00
2022-08-13 00:16:14 -05:00