fix transpose_vectors logic for 2x2 8-bit tiles add a test which goes through this code path. factor out constexpr'd cases into smaller functions. add inline docs about the data movement impact: gemms with 8-bit non-rcr inputs on gfx942