Add support for double buffering in direct load GEMM kernel (#1052)

This PR introduces support for double buffering in LDS into GEMM kernels that use direct load instructions.

Direct loads now use inline asm instead of intrinsics. Usage of intrinsics results in compiler adding additional waitcnt instructions what breaks possible load/compute overlap in case of double buffering.

Usage of inline asm results in the need to use sched_barrier in order to make sure that compiler cannot incorrectly reschedule instructions since it does not know the data dependencies between global->LDS and LDS->registers.
This commit is contained in:
Bartlomiej Wroblewski
2023-12-03 23:08:47 +01:00
committed by GitHub
parent c7d5c7727b
commit bc4bf9bd03
10 changed files with 211 additions and 24 deletions

View File

@@ -380,7 +380,9 @@ struct DeviceGemm_Xdl_CShuffle_LdsDirectLoad : public DeviceGemm<ALayout,
<< " LoopScheduler: "
<< LoopSchedToString[LoopSched] << ", "
<< "PipelineVersion: "
<< PipelineVersionToString[PipelineVer];
<< PipelineVersionToString[PipelineVer] << ", "
<< "Prefetch: "
<< NumGemmKPrefetchStage;
// clang-format on
return str.str();