Chao Liu
|
b491ebf384
|
FP16 data in-register transpose (#41)
* start fixing 16bit data packing
* adding StaticTensor
* adding StaticTensor
* adding StaticTensor
* add missing constexpr
* adding static tensor
* adding static tensor
* adding transpose
* add inline asm for transpose 2x2 of half_t
* add general transpose_vectors(), but have unnecessary register initialization using v_mov
* fix unnecessary register initialization in transpose_vector by using more pass-by-reference
* add hardcoded logic for NHWC wrw
* improve asm for v_pack
* make ThreadwiseTensorSliceTransfer_v3r2 support any tensor
* tweak
* reorganize file
|
2021-11-15 10:05:58 -06:00 |
|