[CK_TILE] Epilogue chaining (Lwpck 3373) (#2773)

* Epilogue chainer

* epilogue chainer with context to share state in between epilogues
* chain-able epilogues for cshuffle

* clang-format

* rebase related changes

- Added separate chainer test
-  clang format

* comment resolutions

* clang-format

* Policy based chaining

- basic Policy structure to control blanket looping and barrier
placement.

- to be extended for fine grianed control

- to  be modified to move possible auto-compute values and SFC  access
count to policy

* Refactoring as per spec

- Introduced epilogue schedule, graph
- modified chainer to function with graph and schedule

* minor_changes

- made functions to overload in the epilogue_graph file

* clang-format

* Documentation and Comments

- Added comments to files
- Noted changes in changelog
- Added README to explain the chainer and current status, exact use
steps to be added

* Comment resolutions

- README modified with the suggested changes
- Comment fixed accordingly

* major refactoring

- modified the chainer files to match the new design
- updated comments
- updated readme
- multi-d example shocases use of the chainer

* minor cleanup

* tensor and rowcol quant chainer epilogue

- added scalarepilogue for tensor quant
- added schedule for tensorquant
- modified quant example to use chainer and appropriate schedules

* Refactor epilogue chainer: generalize ops and standardize context interface

Address review comments.

Changes:
- Rename CastToLdsOp to CastAndStoreToLdsOp for clarity
- Standardize context member names (working_tile, out_tile, aux_windows)
- Update README documentation with correct operation names
- Clean up parameter naming in epilogue_chainer.hpp (OutWindow, AccTile,
AuxWindows)
- common_epilogue_ops.hpp: General-purpose ops (ScaleScalarOp,
CastAndStoreToLdsOp,
  LoadFromLdsOp, ElementwiseOp, StoreOp, MoveWindowsOp)
- cshuffle_epilogue_chainer_ops.hpp: CShuffle-specific context and slice
operations
- epilogue_chainer.hpp: Cleaned up parameter naming for generality
- Removed test files that are no longer needed. These were added for
intermediate use

* update cshuffle chainer ops file w.r.t cshuffle_epilogue.hpp updates & add chainer to quant gemm example

* fix compile errors

- CI uses c++17 while the code had c++20 features

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
This commit is contained in:
Yashvardhan Agarwal
2025-12-18 11:02:02 +02:00
committed by GitHub
parent bfac64953f
commit 15e81397a4
9 changed files with 1244 additions and 42 deletions

View File

@@ -16,6 +16,7 @@
#include "ck_tile/host/permute_pk_int4.hpp"
#include "ck_tile/host/tensor_shuffle_utils.hpp"
#include "ck_tile/ops/gemm_quant.hpp"
#include "ck_tile/ops/epilogue.hpp"
#include "gemm_utils.hpp"
template <typename GemmConfig,
@@ -172,31 +173,77 @@ float gemm_calc_quant(const ck_tile::QuantGemmHostArgs& args, const ck_tile::str
printf(
"TiledPermuteN: %d (QuantGroupSize::kN=%d)\n", TiledPermuteN, BQuantGroupSize::kN);
}
using GemmEpilogue = ck_tile::CShuffleEpilogue<ck_tile::CShuffleEpilogueProblem<
typename TypeConfig::ADataType,
std::conditional_t<
std::is_same_v<typename TypeConfig::BDataType, ck_tile::pk_fp4_raw_t>,
// Epilogue selection: use chainer for RowCol/Tensor quant, standard for others
// Toggle to switch between chainer-based and standard CShuffleEpilogue
constexpr bool UseChainerEpilogue = true;
// Define the schedule tag based on quant mode
using ScheduleTag =
std::conditional_t<QuantMode == ck_tile::QuantType::RowColQuant,
ck_tile::RowColQuantScheduleTag,
std::conditional_t<QuantMode == ck_tile::QuantType::TensorQuant,
ck_tile::TensorQuantScheduleTag,
ck_tile::DefaultScheduleTag>>;
using GemmEpilogue = std::conditional_t<
UseChainerEpilogue && (QuantMode == ck_tile::QuantType::RowColQuant ||
QuantMode == ck_tile::QuantType::TensorQuant),
// Chainer-based epilogue for RowCol/Tensor quant modes
ck_tile::EpilogueChainer<ck_tile::CshuffleEpilogueSchedule<
ck_tile::CShuffleEpilogueChainProblem<
typename TypeConfig::ADataType,
std::conditional_t<
std::is_same_v<typename TypeConfig::BDataType, ck_tile::pk_fp4_raw_t>,
typename TypeConfig::ADataType,
typename TypeConfig::BDataType>,
ck_tile::tuple<>,
typename TypeConfig::AccDataType,
typename TypeConfig::CDataType,
ck_tile::tuple<>,
CLayout,
CDEElementWise,
TilePartitioner::MPerBlock,
TilePartitioner::NPerBlock,
GemmConfig::M_Warp,
GemmConfig::N_Warp,
GemmConfig::M_Warp_Tile,
GemmConfig::N_Warp_Tile,
GemmConfig::K_Warp_Tile,
transpose_c,
ck_tile::memory_operation_enum::set,
1,
false,
1,
TiledPermuteN>,
ScheduleTag>>,
// Standard CShuffleEpilogue for other modes
ck_tile::CShuffleEpilogue<ck_tile::CShuffleEpilogueProblem<
typename TypeConfig::ADataType,
typename TypeConfig::BDataType>,
ck_tile::tuple<>,
typename TypeConfig::AccDataType,
typename TypeConfig::CDataType,
ck_tile::tuple<>,
CLayout,
CDEElementWise,
TilePartitioner::MPerBlock,
TilePartitioner::NPerBlock,
GemmConfig::M_Warp,
GemmConfig::N_Warp,
GemmConfig::M_Warp_Tile,
GemmConfig::N_Warp_Tile,
GemmConfig::K_Warp_Tile,
transpose_c,
ck_tile::memory_operation_enum::set,
1,
false,
1,
TiledPermuteN>>;
std::conditional_t<
std::is_same_v<typename TypeConfig::BDataType, ck_tile::pk_fp4_raw_t>,
typename TypeConfig::ADataType,
typename TypeConfig::BDataType>,
ck_tile::tuple<>,
typename TypeConfig::AccDataType,
typename TypeConfig::CDataType,
ck_tile::tuple<>,
CLayout,
CDEElementWise,
TilePartitioner::MPerBlock,
TilePartitioner::NPerBlock,
GemmConfig::M_Warp,
GemmConfig::N_Warp,
GemmConfig::M_Warp_Tile,
GemmConfig::N_Warp_Tile,
GemmConfig::K_Warp_Tile,
transpose_c,
ck_tile::memory_operation_enum::set,
1,
false,
1,
TiledPermuteN>>>;
using Kernel =
ck_tile::QuantGemmKernel<TilePartitioner, GemmPipeline, GemmEpilogue, QuantMode>;