composable_kernel/include/ck/tensor_operation/gpu/block at 7106976a72897f44b05260bd1ae1f70b319a4e75 - composable_kernel - Public git mirror

ROCm/composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 05:01:25 +00:00

Files

History

Andriy Roshchenko 7106976a72 MX GEMM - New GEMM pipeline for MX data types (#2059 )

* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods

2025-04-15 17:17:07 -06:00

..

blockwise_gemm_dl_v2r3.hpp

Implement DPP8 based GEMM for Navi21 (#826 )

2023-08-14 15:46:27 -05:00

blockwise_gemm_dlops_v2r2.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

blockwise_gemm_dlops_v3.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

blockwise_gemm_dpp.hpp

Fix cmake warnings (#1342 )

2024-06-21 09:47:58 +02:00

blockwise_gemm_mx_pipeline_xdlops_base.hpp

MX GEMM - New GEMM pipeline for MX data types (#2059 )

2025-04-15 17:17:07 -06:00

blockwise_gemm_pipeline_xdlops_ab_scale_selector.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

blockwise_gemm_pipeline_xdlops_b_preshuffle_dequant_v1.hpp

Ck int4 moe develop (#1949 )

2025-03-10 11:16:44 +08:00

blockwise_gemm_pipeline_xdlops_b_preshuffle_dequant_v3.hpp

Ck int4 moe develop (#1949 )

2025-03-10 11:16:44 +08:00

blockwise_gemm_pipeline_xdlops_b_preshuffle_selector.hpp

Ck int4 moe develop (#1949 )

2025-03-10 11:16:44 +08:00

blockwise_gemm_pipeline_xdlops_b_preshuffle_v1.hpp

[Block Scale GEMM] Optimized block scale gemm (#1950 )

2025-03-11 10:11:21 -07:00

blockwise_gemm_pipeline_xdlops_b_preshuffle_v2.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

blockwise_gemm_pipeline_xdlops_b_preshuffle_v3.hpp

[Block Scale GEMM] Optimized block scale gemm (#1950 )

2025-03-11 10:11:21 -07:00

blockwise_gemm_pipeline_xdlops_b_scale_selector.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

blockwise_gemm_pipeline_xdlops_base.hpp

Introduce MX GEMM for FP8 data type (#2000 )

2025-03-24 15:41:07 -06:00

blockwise_gemm_pipeline_xdlops_mx_selector.hpp

MX GEMM - New GEMM pipeline for MX data types (#2059 )

2025-04-15 17:17:07 -06:00

blockwise_gemm_pipeline_xdlops_selector.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

blockwise_gemm_pipeline_xdlops_v1_ab_scale.hpp

[Block Scale GEMM] Optimized block scale gemm (#1950 )

2025-03-11 10:11:21 -07:00

blockwise_gemm_pipeline_xdlops_v1_b_scale.hpp

Implement the fp16xint4 scale weight only kernel for Ali (#1786 )

2025-01-03 18:35:21 +08:00

blockwise_gemm_pipeline_xdlops_v1_mx.hpp

MX GEMM - New GEMM pipeline for MX data types (#2059 )

2025-04-15 17:17:07 -06:00

blockwise_gemm_pipeline_xdlops_v1.hpp

Fix cmake warnings (#1342 )

2024-06-21 09:47:58 +02:00

blockwise_gemm_pipeline_xdlops_v2_ab_scale.hpp

[Block Scale GEMM] Optimized block scale gemm (#1950 )

2025-03-11 10:11:21 -07:00

blockwise_gemm_pipeline_xdlops_v2_b_scale.hpp

Implement the fp16xint4 scale weight only kernel for Ali (#1786 )

2025-01-03 18:35:21 +08:00

blockwise_gemm_pipeline_xdlops_v2.hpp

Change block gemm pipeline local prefill loop order. (#1692 )

2024-11-26 17:36:53 +01:00

blockwise_gemm_pipeline_xdlops_v3_ab_scale.hpp

[Block Scale GEMM] Optimized block scale gemm (#1950 )

2025-03-11 10:11:21 -07:00

blockwise_gemm_pipeline_xdlops_v3_b_scale.hpp

Implement the fp16xint4 scale weight only kernel for Ali (#1786 )

2025-01-03 18:35:21 +08:00

blockwise_gemm_pipeline_xdlops_v3.hpp

Add gemm universal bf16 instances (#1484 )

2024-09-04 20:58:54 -07:00

blockwise_gemm_pipeline_xdlops_v4_b_scale.hpp

Implement the fp16xint4 scale weight only kernel for Ali (#1786 )

2025-01-03 18:35:21 +08:00

blockwise_gemm_pipeline_xdlops_v4.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

blockwise_gemm_pipeline_xdlops_v5.hpp

Fix cmake warnings (#1342 )

2024-06-21 09:47:58 +02:00

blockwise_gemm_pipeline_xdlops.hpp

Fix compilation errors with Clang20.0. (#1533 )

2024-09-25 13:45:38 -07:00

blockwise_gemm_smfmac_xdlops.hpp

Added structural sparsity blockwise gemm (#1435 )

2024-09-11 15:19:42 +02:00

blockwise_gemm_wmma.hpp

fix compilation errors for gfx12 with clang20 (#1606 )

2024-10-28 19:02:48 -07:00

blockwise_gemm_xdlops_skip_b_lds.hpp

Fix cmake warnings (#1342 )

2024-06-21 09:47:58 +02:00

blockwise_gemm_xdlops.hpp

Merging the gfx12 code into public repo. (#1362 )

2024-06-27 00:33:34 -07:00

blockwise_softmax.hpp

[HotFix] add config and version files to pass on build info (#856 )

2023-08-23 11:36:17 -07:00

blockwise_tensor_slice_transfer_v5r1.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

blockwise_welford.hpp

Batchnorm splitk single kernel (#771 )

2023-07-06 10:58:55 -05:00

reduction_functions_blockwise.hpp

Batchnorm splitk single kernel (#771 )

2023-07-06 10:58:55 -05:00

thread_group_tensor_slice_transfer_direct_load.hpp

Add basic support for direct loads from global to LDS (#999 )

2023-11-25 13:35:22 +01:00

thread_group_tensor_slice_transfer_v4r1_dequant.hpp

Navi3 rel (#1176 )

2024-03-08 17:11:51 -08:00

thread_group_tensor_slice_transfer_v4r1_gather.hpp

ck moe gemm implement (#1936 )

2025-03-05 15:56:55 +08:00

thread_group_tensor_slice_transfer_v4r1.hpp

[A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862 )

2025-02-20 14:00:27 -08:00

thread_group_tensor_slice_transfer_v4r2.hpp

Add elementwise with dynamic vector dim (#1198 )

2024-03-22 10:40:43 +01:00

thread_group_tensor_slice_transfer_v6r1.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

thread_group_tensor_slice_transfer_v6r1r2.hpp

initial stream-k implementation with example (#699 )

2023-07-26 14:18:15 -05:00

thread_group_tensor_slice_transfer_v6r2.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

thread_group_tensor_slice_transfer_v6r3.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

thread_group_tensor_slice_transfer_v7.hpp

update copyright headers (#726 )

2023-05-31 18:46:57 -05:00

thread_group_tensor_slice_transfer_v7r2.hpp

Codegen hipRTC compilation (#1579 )

2025-01-31 09:48:39 -08:00

thread_group_tensor_slice_transfer_v7r3_scatter.hpp

ck moe gemm implement (#1936 )

2025-03-05 15:56:55 +08:00

thread_group_tensor_slice_transfer_v7r3.hpp

add f8 gemm multiD with both row/col wise scale (#1300 )

2024-05-28 12:04:22 -05:00