Add a new gemm pipeline based on ComputeV4 which utilizes async copy API (#2949)

* check in pipeline and policy

for async  load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked

solve merge conflicts

* fix cmakelists

* make it build

* fix? buffer async fence

* relax fences; it appears it only is needed between pairs of ping-pongs

* remove fences

* remove fences

* cleanup and reformat

* add steps annotations

* comment all pipeline steps / remove unexplainable syncs

* clang-format

* add comment

* cleanup kernel types for test

* fix comment

* fix hardcoded warp size

* faithfully copy block gemm from compute v4 policy to async policy

* make async test gfx950 only

* fix cmake logic

* set separate compile options for async

* refine comment in policy

* try update hotloop scheduler

* cleanup comments

* test more K block sizes

* unhardcode Ks, sort of

* add large odd test case

* fix build for quant

* add comment to hot loop scheduler and rename enum

* reformat

* reword the pipeline description

* reformat

* address review / add static asserts / typo fix

* update changelog
This commit is contained in:
Max Podkorytov
2025-10-01 15:38:07 -07:00
committed by GitHub
parent f2d367262f
commit a7da3c68b9
13 changed files with 803 additions and 62 deletions

View File

@@ -275,4 +275,20 @@ CK_TILE_DEVICE static constexpr auto get_device_arch()
return gfx12_t{};
#endif
}
enum LLVMSchedGroupMask : int32_t
{
NONE = 0,
ALU = 1 << 0,
VALU = 1 << 1,
SALU = 1 << 2,
MFMA = 1 << 3,
VMEM = 1 << 4,
VMEM_READ = 1 << 5,
VMEM_WRITE = 1 << 6,
DS = 1 << 7,
DS_READ = 1 << 8,
DS_WRITE = 1 << 9,
ALL = (DS_WRITE << 1) - 1,
};
} // namespace ck_tile