* merge the build and performance tests CI stages together
* add gemm performance test on gfx11/gfx12
* add suffices to distinguish gemm performance logs from different archs
* use smaller gemm set in CI for gfx10/gfx11/gfx12
* disable performance tests on gfx1030
* fix the shashing logic
* fix finding python3 for mha instances