composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-12 17:26:00 +00:00

Author	SHA1	Message	Date
Cong Ma	23fb253c4e	Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325 ) * [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples * [CK TILE GEMM Quant] update tile distribution of aquant * [CK TILE GEMM Quant] update aquant register offset calculation * [CK TILE GEMM Quant] Reimplement aquant register offset calculation * [CK TILE GEMM Quant] Add more unit tests of Aquant - Test M128xN128xK128 * [CK TILE GEMM Quant] Add more comments to Gemm Aquant	2025-12-01 15:04:37 -08:00
Yi DING	9ed9539ddf	[CK_TILE] Disable cast_tile_pk_fp16bf16_fp32 as It Causes Extra spills on Recent Compilers (#3327 )	2025-12-01 14:48:22 +08:00
Gino Lu	ba6af9fe7c	[CK_TILE] Add unit test for fp4 warp gemm (#2817 ) This update includes a unit test for warp GEMM	2025-12-01 13:56:48 +08:00
Sami Remes	f981554c39	[CK_TILE] Fix Quant GEMM build (#3320 ) * Fix build * Fix ck_tile example 38 & 40 --------- Co-authored-by: Yi DING <yi.ding@amd.com>	2025-11-28 20:33:53 +08:00
msaffari-amd	f875ab0bbc	Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic-add (#3236 ) * Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic * correct clang-format * removed unused rtol_atol variable from example code * clang format correction * remove unused varable max_accumulated_value from example	2025-11-28 09:43:01 +01:00
Cong Ma	30727c48fc	Tile engine for streamk (#3157 ) * [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM. - This commit lays the groundwork for integrating the tile engine into streamk GEMM. It focuses on creating benchmark executables for streamk GEMM. - Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once the streamk implementation reaches stability. * [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM * [CK TILE STREAMK] Refactor: Extract common utility functions. * [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation * Add pre-commit * [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK * [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK * [CK TILE STREAMK] Update Jenkinsfile * [CK TILE Engine] Update StreamK tile engine help message Remove default value messages as they are automatically printed * [CK TILE Engine] Update StreamK tile engine - Remove namespace reboot * [CK TILE Engine] Update StreamK tile engine - Fix merge error	2025-11-27 15:49:57 -07:00
arai713	24d88d2472	[CK_TILE] Move DataTypeTraits into a Common File (#3146 ) This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile. Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>	2025-11-27 09:09:54 -08:00
Matthias Gehre	678298d4c7	Add support for gfx1153 (#3306 )	2025-11-27 08:48:00 +01:00
Thomas Ning	a38aeceb21	Fix and improve the gemm quant pipeline infrastructure (#3245 )	2025-11-26 18:04:27 -08:00
Max Podkorytov	79aae7c7f7	[CK Tile] enable building examples by default (#3259 ) * remove EXCLUDE_FROM_ALL from ck-tile examples -> +15 min build time w/ 64 threads for a single arch * fix cpp17 compile error in the ck-tile examples --------- Co-authored-by: khuagarw <khuagarw@amd.com> Co-authored-by: Ding, Yi <yi.ding@amd.com>	2025-11-26 16:24:44 -08:00
Aviral Goel	de6466481f	chore(copyright): update copyright header for include directory (#3293 )	2025-11-26 11:00:05 -07:00
John Shumway	10a782d846	Fix template parameter macros (#3305 ) Some of the device implementation templates have macros like GridwiseGemmMultiABDTemplateParameters that can cause build errors if multiple files are included together. This error comes up with our builder code. To clean up the macros and make them safer, we follow these follow rules: * Use more specific names to avoid duplication. * Undefine the macro after it is used to avoid leaking out of the file scope. * Use a prefix CK_ on the macro to avoid conflicting with other libraries. * Use all caps with underscores for preprocessor macro names.	2025-11-26 09:48:17 -08:00
Aviral Goel	35a4b26af0	fix: add dynamic selection of pipelines for aquant mode (#3282 ) - Add conditional selection to use v3 pipeline when PreshuffleQuant is true - Add static assertion in memory pipeline to prevent PreshuffleQuant usage - Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases - Update BaseGemmPipeline selection to handle all quant modes properly	2025-11-26 10:58:09 +04:00
Yi DING	8fa90025d0	[CK_TILE] Refine warp_gemm_attribute_mfma (#3272 )	2025-11-26 10:57:15 +08:00
Yi DING	c7dce2ac29	[CK_TILE] Fix Compilation of Flatmm Examples (#3285 )	2025-11-26 10:11:43 +08:00
Bartłomiej Kocot	00dfa2f2ce	[CK TILE] Grouped Conv Explicit Gemm (#3289 ) * [CK TILE] Grouped Conv Explicit Gemm * fixes * apply builder fixes	2025-11-25 23:28:35 +01:00
Bartłomiej Kocot	9ac2666d5b	[CK_BUILDER] Add grouped conv bwd ck tile traits (#3281 ) * [CK_BUILDER] Add grouped conv bwd ck tile traits * copilot fixes	2025-11-25 14:57:43 +01:00
rocking	229d43ea0c	Fix batch prefill compile fail in aiter (#3279 ) * Fix batch prefill aiter compile fail * Fix compile error	2025-11-25 09:46:32 +08:00
Thomas Ning	de6a9590ab	Reorganize of KPack in GEMM (#3247 ) * add the reorganize of KPack * fix the compilation error * fix the compilation error	2025-11-24 12:38:59 -08:00
Christopher Millette	b9c6cb1452	First look at mfma / wmma unification (#2704 ) * First look at mfma / wmma unification * Refactor * Re-org file structure * Restructure transform selection and WaveWiseMma class * Update license files. Add missing gfx1151 support. Change wave size for HOST to 1. Update datatypes naming consistency * Fixes default MmaSelector implentation * Adds unit tests for amdgcn_mma and arch * Consolidate common arch id checks to constexpr functions. Strongly type ids as amdgcn_target_arch_id object. * Refactor is_any_value_of * Fixes mma_selector logic * Fix typo * Add mma selector test for tile decomposition * Fix compilation of mma.hpp * Revert back to c++17 compatibility * Fix compiler error by returning index_t from get_warp_size() * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes compiler error for missing is_wave32() function * Fixes compiler error for host wave_size() should be 64 * Fixes compiler errors where __cpp_concepts is not defined * Fixes compiler errors where __cpp_concepts is not defined * Fix test failure for host is wave64 by default --------- Co-authored-by: Chris Millette <you@example.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-24 09:39:59 -08:00
Khushbu Agarwal	8111572785	[CK_Tile] Support for preshuffle weight(B) quant tensor for block scale gemm (#3165 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * addressing review comments * fixing CI issue * addressing reveiw comments * formatting * formatting * fixing aquant operator overlaoding * formatting --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-11-24 07:48:42 -08:00
Qianfeng	81042ea574	Fix a bug for qr_ks_vs_async_trload pipeline (#3271 )	2025-11-24 21:31:48 +08:00
rocking	5948dbffe4	Support fp8 dynamic quantization for fmha (#3206 ) * Support qscale for dynamic quant, remove static quant * Support hdim=256 * Remove bias test case for fp8 --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com>	2025-11-24 16:28:25 +08:00
Johannes Graner	096f0a3b23	[CK Tile] Fix example for conv fwd + bias + clamp (#3235 ) * Fix clamp not being applied correctly * Apply group offsets to D tensors --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-11-24 07:36:26 +01:00
Emily Martins	02ab76c2cb	Fix CK Tile DP + 2 Tile Stream-K Validation Errors (#3269 ) When there are multiple workgroups contributing to a tile, when using atomics, there may be round off error in cases where the accumulator type is not the same as the C type. To compute an error tolerance for test validation, the Stream-K Tile Partitioner has a function called estimate_num_wgs_per_tile to estimate the number of workgroups per tile. That said, this function only provides an estimate. In some cases for DP+2TSK, the function returns 1 rather than the more accurate value of 2. Thus, this change updates the estimate_num_wgs_per_tile function to explicitely return the value of 2 in cases for DP+2TSK to ensure that we have a better error tolerance to avoid test failures due to round-off error.	2025-11-21 20:29:47 -07:00
lalala-sh	f58bd56e6b	fix static assert (#3178 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-11-20 17:27:05 -08:00
Yi DING	8b284a63a4	[CK_TILE] Refine FP32 => FP16/BF16 Conversion (#3215 ) * [CK_TILE] Refine FP32 => FP16/BF16 Conversion * Thank you Copilot * Rename fix * Fix example * Fix accu checking * Fix * Fix	2025-11-20 10:50:26 -08:00
Gavin Zhao	07314ac543	Add support for RDNA1 GPUs (#3220 ) * Allow compilation for RDNA1 (__gfx101__) Signed-off-by: Gavin Zhao <git@gzgz.dev> * More RDNA1 changes Signed-off-by: Gavin Zhao <git@gzgz.dev> * Even more RDNA1 changes Signed-off-by: Gavin Zhao <git@gzgz.dev> * cmake: skip build quantization for unsupported arches * add gfx10-1-generic support as well * add gfx1013 and complete gfx10-1-generic * fix clang format * enable DL kernels on gfx101x --------- Signed-off-by: Gavin Zhao <git@gzgz.dev> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-11-20 10:45:57 -08:00
Emily Martins	2e4b8a8fc4	[CK_TILE] Remove Old CK Tile Stream-K Artifacts (#3202 ) * Remove old CK Tile Stream-K implementation The original CK Stream-K implementation was based on old CK's Stream-K block to C tile map. However, this implementation did not align with the original Stream-K paper. Thus, we implemented a new tile partitioner and associated Stream-K kernel, which was placed in the reboot namespace. Now that the new Stream-K implementation is ready, this change removes all artifacts of the old implementation. Specifically, the following changes were made: - Removes old Stream-K tile partitioner from CK Tile - Removes the reboot namespace such that the new implementation resides in the ck_tile namespace only. - Adds tests for bf8 and fp8 using the new implementation - Removes tests for the old implementation - Remove the v2 suffix from the new CK Tile Tile Partitioner derived classes. - Updates Stream-K Kernel ops file to use /** commenting style. * Remove v2 from tile partitioner validation function names	2025-11-20 09:32:32 -07:00
asleepzzz	5adaa201ed	Revert "Add attn sink (#2892 )" (#3250 ) This reverts commit `9fa4e8d5ab`.	2025-11-20 07:55:15 -08:00
Linjun-AMD	9fa4e8d5ab	Add attn sink (#2892 ) * enable attn sink Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update attn_sink script Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix some error Signed-off-by: JL-underdog <Jun.Lin@amd.com> * clang-format Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update fmha_bwd_kernel'mask Signed-off-by: JL-underdog <Jun.Lin@amd.com> * update block_fmha_pipeline_qr_ks_vs.hpp Signed-off-by: JL-underdog <Jun.Lin@amd.com> * fix ci error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix format error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_bwd_pipeline_default_policy.hpp * Update fmha_fwd_runner.hpp * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * Update fmha_fwd_runner.hpp * update splitkv_pipline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update splitkv&pagedkv pipeline Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * add sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update attn_sink result log Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update smoke_test_fwd_sink.sh Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test file Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update test script Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp * use constexpr kHasSink for sink in fmha pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update by pre-commit Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/kernel/fmha_fwd_pagedkv_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update fmha_fwd.py * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd_splitkv.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove causal mask setting logic from mask.hpp Removed the mask setting logic for causal masks. * fix ci error that some usage of lamada not support in c++17 Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update remod.py * add smoke sink test Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update fmha_pagedkv_prefill.py * Update FmhaFwdPipeline parameters in fmha_fwd.py * update block_fmha_pipeline_qr_ks_vs_async_trload.hpp Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * fix c++17 unsupprot error Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp * Fix formatting of sink_seq_end assignment * Fix indentation for sink_seq_end assignment * Update block_fmha_fwd_pagedkv_pipeline_qr_ks_vs.hpp --------- Signed-off-by: JL-underdog <Jun.Lin@amd.com> Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-20 19:24:05 +08:00
Yi DING	47e2ed838e	[CK_TILE] Add Flatmm MX FP8 (#3208 ) * Use async for flatmm mxfp4 * Fix preshuffle * Add flatmm mxfp8 * Thanks, Copilot * Thanks Copilot again~	2025-11-20 10:35:15 +08:00
linqunAMD	d2e32b4305	[ck_tile] enable test grouped_gemm_quant and gemm_streamk on gfx12 (#3196 ) 1. Enable grouped_gemm_quant and gemm_streamk on gfx12 - test_ck_tile_streamk_smoke is kept on gfx9, since it looks someone is still working on it. 2. Update warp tile size in grouped_gemm_quant and gemm_streamk unit test 3. Reduce gemm tile size to pass the build on gfx12 in test_gemm_streamk_reboot_types.hpp	2025-11-20 08:40:27 +08:00
Michal Kulikowski	cd8af997e6	[CK] s_prefetch unit test fixes. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2025-11-19 21:54:50 +01:00
Michal Kulikowski	f3ef7acca0	[CK] Added s_prefetch unit test. -added s_buffer_load_b32/64 assembly -added amd_s_buffer_load_impl Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>	2025-11-19 21:54:50 +01:00
John Shumway	ad57f6ef0b	[CK_BUILDER] Put global CK functions in an the CK namespace (#3232 ) * Wrap ck host utitlies in CK namespace. The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace. Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace. There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library. * Add using declarations to test code. After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace. * Add using declarations to client examples.	2025-11-19 11:23:02 +01:00
Anton Gorenko	d7b3197869	[CK_TILE] FMHA Reduce register spilling in fwd with dropout (workaround for CI failures with clang-22) (#3221 ) * Use vectorized stores for dropout randvals With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of 16 buffer_store_byte. This requires less registers and reduces spilling. * Calculate dropout randvals for storing and applying only once Even though it may add a small overhead when storing is not required, it uses significantly less registers and hence no spilling.	2025-11-19 10:40:12 +05:00
Aviral Goel	f5ac3ee359	chore(copyright): update copyright header for include directory (#3224 ) * chore(copyright): update copyright header for tile_engine directory * chore(copyright): update copyright header for script directory * chore(copyright): update copyright header for test_data directory * chore(copyright): update copyright header for python directory * chore(copyright): update copyright header for profiler directory * chore(copyright): update copyright header for library directory * chore(copyright): update copyright header for include directory	2025-11-18 10:17:18 -08:00
Max Podkorytov	a3a4eb12bd	[CK-Tile] Remove usage of tile partitioner's full gemm shape (#3204 ) gemm shape should be used from the pipeline instead (where it gets from a problem description struct)	2025-11-18 09:56:40 -08:00
Sami Remes	3ede8e2a6e	[CK_TILE] Non-K Major from old CK to CK-Tile - fix reverted PR (#3199 ) * Reapply "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017) This reverts commit `e4298e55c7`. * WIP * take Y2 as the AK1/BK1 value, that is the 'vector size' after shuffle * use get_n_lds_banks() * clang-format --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2025-11-18 10:17:02 +02:00
Yi DING	b6720531de	[CK_TILE] MX Flatmm Split kernel instances (#3207 ) * [CK_TILE] MX Flatmm Split kernel instances * Fix flatmm example compile	2025-11-18 13:46:30 +08:00
Illia Silin	b38bb492a1	Disable DL kernels on all architectures except gfx103x. (#3218 ) * disable dl kernels on all archs except gfx103 * add gfx10-3-generic target to cmake	2025-11-14 17:39:50 -08:00
jefyang1	d30babbd00	Add new gemm multiply multiply instances on gfx950 (#3213 )	2025-11-14 08:20:41 -08:00
BingYuan.Zhou	4d629cd2b0	fix build error (#3195 ) Co-authored-by: root <root@hjbog-srdc-39.amd.com>	2025-11-14 09:46:13 +08:00
Yi DING	4a8b17d1a4	[CK_TILE] Improve device printing (#3198 ) * [CK_TILE] Improve device printing * fix host gtest build * clean	2025-11-14 09:46:06 +08:00
yinglu	2a73eb3bc0	Simulate TF32 with BF16x3 (#3142 ) * tf32:bf16x3:use bf16x3 emulate tf32 gemm * change blockwiseGemm to demo bf16x3 * temp push * self review * self review * fix multi-device compile error * bug fix * code refactor * limit to gfx950 * enhance gemm gfx942 threshold * lower change from blockwise to warpwise * refact codes * refact codes * error fix * change threshold * bug fix * fix threshold error * change host reference implement to same as device * bug fix * bug fix * code refact * fix clang-format fail * code refine	2025-11-13 16:21:09 -08:00
SamiAario-AMD	f2cfc6b94e	Remove "basic" and universal GEMM tests, and incorporate their test cases into the GEMM pipeline tests (#3094 ) * Add missing copyright statements * Use ck_tile::host_tensor_descriptor instead of a custom lambda * Refactor use of check_data_type in test classes * Use TEST_SUITE_NAME with TYPED_TEST_SUITE * Remove an unused namespace * Make dim3 const * Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp * Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile * Add missing precision type combinations to CompV4 from CompV3 * Move the INT8 tests around for consistency with KernelTypesCompV3Wmma * Add missing precision type combinations to CompV3Wmma from CompV3 * Remove the basic and universal tests and their dependencies * On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B * Use ADataType and BDataType instead of ComputeDataType for WarpGemm * Explicitly set some return types to void * Use more general typenames in InterleavedPKTypeLoader * Add load_interleaved_pk_type.hpp to common.hpp * Use std::is_same_v in load_int4_tile * Add handling of LoadTranspose to load_int4_tile * Factor out common code in several places using load_int4_tile * Add support for pk_int4_t using load_int4_tile * Fix formatting	2025-11-13 11:01:27 -08:00
Yi DING	8d50001b93	[CK_TILE] Improve F8F6F4 Scaled WarpGemm (#3197 ) * [CK_TILE] Improve F8F6F4 Scaled WarpGemm * Thanks, Copilot	2025-11-13 20:22:05 +08:00
Enrico Degregori	7414a0f4d4	Wmma support for gemm_reduce (#3145 ) * Initial implementation GEMM+Reduce: - device struct - epilogue struct * Fix tests, improve profiler and add initial instances * Add instances * Fix compilation error * Address review comments * Fix logging --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-11-12 11:23:54 -08:00
Po Yen Chen	40d2ed0f2a	[CK_TILE] Share partition index across threads and specify offset in load_tile()/async_load_tile()/load_tile_transpose() (#2905 ) * Allow sharing partition index across threads * Fix typo PartitoinIndex -> PartitionIndex * Remove C++20 'requires' usages * Add missing template arguments * Fix load_tile() overload ambiguity issue * Use SFINAE to exclude invalid arguments * Add additional offset parameter to the async_load_tile() * Remove async_load_tile() default argument to avoid ambiguity * Extract tile_window coordinate compute logic as method * Use warp-shared LDS base address in tile_window::async_load() * Add constraint to tile_window::load() templates * Fix wrong type traits is_class_v<> usages * Add missing constraint to async_load_tile() * Add missing tile_window::load() overload * Add more constraint to avoid load_tile() call ambiguity * Rename ParitionIndex as ReplacementPartitionIndex * Update pre_computed_warp_coords_ in move_extended() * Fix inconsistency between template parameters and documentation * Allow specifying pre-computed parition index * Add type straits is_sequence<> & is_tile_distribution<> * Add type straits is_tensor_view<> * Add type constraints to make_tile_window() templates * Allow passing partition_index to set_tile_if() * Allow specifying partition_index to store_tile() * Add missing template parameter of replace_bottom_tensor_view() * Allow passing partition_index to Default2DEpilogue * Make get_partition_index() public * Add _with_offset() postfix to avoid resolution error * Remove ReplacementPartitionIndex template param * Add missing comments * Add load_tile_transpose_with_offset() overload	2025-11-12 10:26:14 +08:00

1 2 3 4 5 ...

1308 Commits