composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-02 04:31:25 +00:00

Files

Vijay Krish 4208e28988 ck_tile kernel for gemm with groupwise quantized B tensor. (#2663 )

* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.

* Solve merge conflict

* [CK TILE] Update CHANGELOG.md

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>

2025-08-28 23:43:02 -07:00

bfloat16.hpp

Re-enable optimization for gfx950 fmha fwd (#2671 )

2025-08-13 14:57:43 +08:00

e8m0.hpp

fix wrong nan producion. (#2640 )

2025-08-14 15:12:31 +08:00

float8.hpp

[CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571 )

2025-07-31 09:54:17 +05:00

half.hpp

[CK TILE] GEMM with packed i4 (#1885 )