* add wrw reference
* start device
* raw not split version
* run simple example
* start to use atomic add
* simple transform result correct
* first version that can run
* fix atomic and set operator choice
* add check split-k
* format
* change input parameter
* add pad for t total
* rename example index
Co-authored-by: ltqin <letaoqin@amd.com>