mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-10 16:28:38 +00:00

Files

Illia Silin 717f2efef7 [rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

[CK] add composable kernel support on gfx1250 (#6978)

## Motivation

Add composable kernel support on gfx1250.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Qun Lin <qlin@amd.com>
Co-authored-by: jialuo12_amdeng <jia.luo@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: hsivasun_amdeng <haresh.sivasuntharampillai@amd.com>

2026-05-15 06:46:51 -07:00

batched_gemm_xdl_bf16_v3.cpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

batched_gemm_xdl_bf16.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

batched_gemm_xdl_fp8_rowwise_v3.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

batched_gemm_xdl_fp16.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

batched_gemm_xdl_fp16int4_b_scale_v3.cpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

batched_gemm_xdl_fp32.cpp

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

batched_gemm_xdl_int4.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

batched_gemm_xdl_int8.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

CMakeLists.txt

chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 )

2025-11-28 13:49:54 -08:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

run_batched_gemm_example_fp16int4_b_scale.inc

[rocm-libraries] ROCm/rocm-libraries#6978 (commit e58096d)

2026-05-15 06:46:51 -07:00

run_batched_gemm_example_rowwise.inc

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

run_batched_gemm_example.inc

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

README.md

Batched GEMM

Theory

This example demonstrates batched GEMM: performing multiple independent matrix multiplications (all with the same shape) in a single kernel launch. Batched GEMM is used in multi-head attention, RNNs, and other models requiring parallel matrix multiplications.

Mathematical Formulation: For B batches:


C_b = A_b \times B_b \quad \text{for} \quad b = 1, 2, ..., B

A_b: [M, K] input matrix for batch b
B_b: [K, N] weight matrix for batch b
C_b: [M, N] output matrix for batch b

Algorithmic Background:

All matrices in the batch have the same shape and strides.
The kernel launches a grid covering all batches, with each block assigned to a batch.
Used for multi-head attention, parallel MLPs, and more.

How to Run

Prerequisites

Please follow the instructions in the main Build Guide section as a prerequisite to building and running this example.

Build and run

cd composable_kernel/example/24_batched_gemm
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
make -j

# Example run
./batched_gemm_xdl --verify=1 --time=1

Source Code Structure

Directory Layout

example/24_batched_gemm/
├── batched_gemm_xdl.cpp         # Main example: sets up, runs, and verifies batched GEMM
include/ck/tensor_operation/gpu/device/
│   └── device_batched_gemm_xdl.hpp       # Device-level batched GEMM API
include/ck/tensor_operation/gpu/grid/
│   └── gridwise_batched_gemm_xdl.hpp     # Grid-level batched GEMM kernel

Key Classes and Functions

DeviceBatchedGemmXdl (in device_batched_gemm_xdl.hpp):
Device API for batched GEMM.
gridwise_batched_gemm_xdl (in gridwise_batched_gemm_xdl.hpp):
Implements the tiled/blocking batched GEMM kernel.

This example demonstrates how Composable Kernel supports efficient parallel matrix multiplication for batched and multi-head workloads.