fix transpose_vectors logic for 2x2 8-bit tiles
add a test which goes through this code path.
factor out constexpr'd cases into smaller functions.
add inline docs about the data movement
impact: gemms with 8-bit non-rcr inputs on gfx942
* add failing tests
* swap out and reference
* add constraint assert to transpose input distribution
* test both pipelines with rectangular block tile
* print mismatched indices
* add a smaller failing test for old pipeline
* print grid and block
* fill output before operating on it
* swap m/n tile sizes and make one test pass
* add device syncs
* add one more flipped test case
* flip block tile at host arg init
* fix tiles for lds pipeline
* clang-format
* rename tests
* roll back error check
* remove device syncs
* reduce large test case's size
* add a dummy test file
* add kernel launch logic to the test
* transfer all test cases into gtest params
* factor kernel out into test config
* add load transpose pipeline tests
* add padded tests and skip invalid kernels at runtime
* enum class for pipeline type
* add multiwarp test cases
* fix type
* try to solve the problem
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>