Add a new gemm pipeline based on ComputeV4 which utilizes async copy API (#2949)

* check in pipeline and policy for async load in mi350, need to make sure TileAccessPattern is warp_raked or block_raked solve merge conflicts * fix cmakelists * make it build * fix? buffer async fence * relax fences; it appears it only is needed between pairs of ping-pongs * remove fences * remove fences * cleanup and reformat * add steps annotations * comment all pipeline steps / remove unexplainable syncs * clang-format * add comment * cleanup kernel types for test * fix comment * fix hardcoded warp size * faithfully copy block gemm from compute v4 policy to async policy * make async test gfx950 only * fix cmake logic * set separate compile options for async * refine comment in policy * try update hotloop scheduler * cleanup comments * test more K block sizes * unhardcode Ks, sort of * add large odd test case * fix build for quant * add comment to hot loop scheduler and rename enum * reformat * reword the pipeline description * reformat * address review / add static asserts / typo fix * update changelog
2026-04-20 06:49:15 +00:00 · 2025-10-01 15:38:07 -07:00
parent f2d367262f
commit a7da3c68b9
13 changed files with 803 additions and 62 deletions
--- a/include/ck_tile/core/arch/arch.hpp
+++ b/include/ck_tile/core/arch/arch.hpp
@@ -275,4 +275,20 @@ CK_TILE_DEVICE static constexpr auto get_device_arch()
    return gfx12_t{};
 #endif
 }
+
+enum LLVMSchedGroupMask : int32_t
+{
+    NONE       = 0,
+    ALU        = 1 << 0,
+    VALU       = 1 << 1,
+    SALU       = 1 << 2,
+    MFMA       = 1 << 3,
+    VMEM       = 1 << 4,
+    VMEM_READ  = 1 << 5,
+    VMEM_WRITE = 1 << 6,
+    DS         = 1 << 7,
+    DS_READ    = 1 << 8,
+    DS_WRITE   = 1 << 9,
+    ALL        = (DS_WRITE << 1) - 1,
+};
 } // namespace ck_tile