Files
composable_kernel/device_operation/include
Chao Liu b491ebf384 FP16 data in-register transpose (#41)
* start fixing 16bit data packing

* adding StaticTensor

* adding StaticTensor

* adding StaticTensor

* add missing constexpr

* adding static tensor

* adding static tensor

* adding transpose

* add inline asm for transpose 2x2 of half_t

* add general transpose_vectors(), but have unnecessary register initialization using v_mov

* fix unnecessary register initialization in transpose_vector by using more pass-by-reference

* add hardcoded logic for NHWC wrw

* improve asm for v_pack

* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor

* tweak

* reorganize file
2021-11-15 10:05:58 -06:00
..