* initial stream-k implementation with example
* fix unexpected change in err
* improve a little bit performance by reorganize pipeline.
* improve perf a little bit by swizzle block idx
* add profiler
* update example
* fix spelling
* shrink karg for streamk
* support dynamic buffer using memory coherence glc_slc bit from template
* control memory coherence while construct dynamic buffer
* update reduction for streamk(not ready yet)
* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
* fix build issue
* fix several bug
* now result is correct, everything works (but has scratch)
* remove scratch by manually reset coordinate
* update device code
* fix a bug in final reduce
* fix something in example
* update async memset
* fix enum as camel case
* modify coherence enum name
* clean code and use atomic streamk by default
* remove unused var
* throw exception if have empty pointer
* fix format
* fix CI warning
* fix type in init
* modify CI error
* filter out on gfx10+
* restore changed example code
---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* Add threadwise and blockwise welford
* Rename gridwise op, prepare to add welford version
* implement welford and integrate welford into layernorm
* Take care of tail loop
* Fix buf when ThreadSliceK > 1
* Fix bug of merging of two empty set
* Rename clip to clamp
* 1. Fix type of count
2. Remove useless static_assert
* Do not inherit Reduction::Argument
* [What] replace __syncthreads() with block_sync_lds()
[Why] __syncthreads might wait both lgkmcnt(0) and vmcnt(0)
* Add y stride
* Rename.
DeviceLayernorm -> DeviceLayernormImpl
DeviceNormalization2 -> DeviceLayernorm
* Move literal ""_uz & ""_zu into namespace 'literals'
* Move namespace 'literals' as 'ck::literals'
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* use 'sweep once' softmax kernel where applicable
* threadwise copy's dst buffer can specify invalid element value
* add int8 in/out float compute softmax support
give a bit of leeway for int absolute tolerance as there's a single data point of all test cases showing off-by-1 error
* format
* softmax inherits DeviceNormalization
* softmax profiler stub
* tighten up reference softmax interface
* example prints tensor dimension
* add fp32 to softmax profiler
* rename header
* hook with ckProfiler
* format
* resolve merge conflict
* resolve merge conflicts
* update normalization profiler help string
* resolve conflict
* typo
* remove residual
* softmax profiler: address feedback
* test for mixed precision input/output
* fully qualify ck::math::isnan
* add comment for device normalization interface
* revise wording
* constness for alpha/beta scaler pointer
* initial stub for standalone softmax
* start device_softmax_mk_to_mk as a wrapper to device_reduce_mk_to_m
* host softmax validates
* compiles; to implement beta scaling
* use NaN trick to efficiently ignore OOB values during sum of exponentials
* freeload device_reduce's utility functions
* clean up interface
* adding prior value (beta scaling)
* remove restriction related to perf considerations
* apply clang-format
* clean; disable diagnostics
* resolve conflicts
* add exp wrapper
* honor HostTensorDesc interface; allow implicit cast from different vector<T> type
* test softmax for fp16/fp32
* update readme
* amend commit NaN trick
* remove redundant param added during development
* format
* replace ScalarDataType with AccDataType
* separate out test programs by precision type
* move softmax sample code to its own folder
* format
* keep up with recent changes in reduction API
* remove extra header