mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-11 09:40:51 +00:00

Files

Vijay Krish bad7262507 ck_tile kernel for gemm with groupwise quantized B tensor. (#2663 )

* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.

* Solve merge conflict

* [CK TILE] Update CHANGELOG.md

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>

[ROCm/composable_kernel commit: 4208e28988]

2025-08-28 23:43:02 -07:00

algorithm

[CK_TILE] Enable printing more structures in CK-Tile (#2443 )

2025-08-07 15:45:27 +03:00

arch

[CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715 )

2025-08-25 20:55:12 +08:00

container

[CK_TILE] Enable printing more structures in CK-Tile (#2443 )

2025-08-07 15:45:27 +03:00

numeric

ck_tile kernel for gemm with groupwise quantized B tensor. (#2663 )

2025-08-28 23:43:02 -07:00

tensor

[CK_TILE] FMHA avoid unnecessary vmcnt0 (#2715 )

2025-08-25 20:55:12 +08:00

utility

Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 )

2025-08-11 15:01:33 +02:00

config.hpp

Support Wave32 in CK_TILE - Part 1 (#2594 )

2025-08-18 10:08:31 -07:00

README.md

introducing ck_tile! (#1216 )

2024-04-15 19:27:12 -05:00

README.md

ck_tile/core

ck_tile/core contains every basic functions and structures to create a GPU kernel using ck_tile. User should only include ck_tile/core.hpp this single header to use all the functionality. Everything is under ck_tile namespace. The coding style under this folder should be similar to std (snake_case for structure/function, Camel for template types...)

algorithm/
    coordinate transform and some other reusable algorithm
arch/
    contains some basic device building block like mma, buffer addressing, etc...
container/
    contains basic container data structure, array/sequence/tuple/...
numeric/
    data type, and data type related math
tensor/
    tensor descriptors and tile level API
utility/
    other utility function for both host/device