mirror of https://github.com/ROCm/composable_kernel.git synced 2026-03-25 01:27:40 +00:00

Files

Vidyasagar Ananthan 9e049a32a1 Adding dispatcher architecture (#3300 )

* WIP POC of dispatcher

* Dispatcher python workflow setup.

* Dispatcher cleanup and updates.

Further dispatcher cleanup and updates.

Build fixes

Improvements and python to CK example

Improvements to readme

* Fixes to python paths

* Cleaning up code

* Improving dispatcher support for different arch

Fixing typos

* Fix formatting errors

* Cleaning up examples

* Improving codegeneration

* Improving and fixing C++ examples

* Adding conv functionality (fwd,bwd,bwdw) and examples.

* Fixes based on feedback.

* Further fixes based on feedback.

* Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug.

* Another round of improvements  based on feedback.

* Trimming out unnecessary code.

* Fixing the multi-D implementation.

* Using gpu verification for gemms and fixing convolutions tflops calculation.

* Fix counter usage issue and arch filtering per ops.

* Adding changelog and other fixes.

* Improve examples and resolve critical bugs.

* Reduce build time for python examples.

* Fixing minor bug.

* Fix compilation error.

* Improve installation instructions for dispatcher.

* Add docker based  installation instructions for dispatcher.

* Fixing arch-based filtering to match tile engine.

* Remove dead code and fix arch filtering.

* Minor bugfix.

* Updates after rebase.

* Trimming code.

* Fix copyright headers.

* Consolidate examples, cut down code.

* Minor fixes.

* Improving python examples.

* Update readmes.

* Remove conv functionality.

* Cleanup following conv removable.

2026-01-22 09:34:33 -08:00

4.7 KiB

Raw Permalink Blame History

Adding New GPU Architecture Support

Guide for adding support for a new AMD GPU architecture to the CK Tile Dispatcher.

See also: Main Dispatcher README | Codegen README

Overview

The dispatcher uses arch_specs.json as the single source of truth for GPU specifications:

arch_specs.json → generate_arch_specs.py → arch_specs_generated.py (Python)
                                        → arch_specs_generated.hpp (C++)

Quick Start

# 1. Edit arch_specs.json
# 2. Run generator
python generate_arch_specs.py
# 3. Rebuild
cd ../build && cmake --build . -j8
# 4. Test
ctest

Step-by-Step Guide

Step 1: Edit arch_specs.json

Add new architecture under "architectures":

{
  "architectures": {
    "gfx1100": {
      "family": "rdna3",
      "description": "AMD Radeon RX 7000 series (RDNA3)",
      "warp_size": 32,
      "lds_capacity_kb": 64,
      "warp_configs": [
        [2, 4, 1],
        [4, 2, 1]
      ],
      "warp_tile_combos": {
        "fp16_fp16_fp16": [[16, 16, 16], [32, 32, 16]],
        "bf16_bf16_bf16": [[16, 16, 16], [32, 32, 16]]
      }
    }
  }
}

Step 2: Configuration Fields

Field	Description	Example
`family`	GPU family	`"cdna3"`, `"rdna4"`
`description`	Human-readable name	`"AMD Instinct MI300"`
`warp_size`	Wave/warp size	`64` (CDNA), `32` (RDNA)
`lds_capacity_kb`	LDS memory in KB	`64`
`warp_configs`	Valid `[warp_m, warp_n, warp_k]`	`[[2,2,1], [4,4,1]]`
`warp_tile_combos`	Warp tiles per dtype	See below

Step 3: Warp Tile Combinations

Map data type combinations to valid warp tile sizes:

"warp_tile_combos": {
  "fp16_fp16_fp16": [[32, 32, 8], [16, 16, 16], [32, 32, 16]],
  "bf16_bf16_bf16": [[32, 32, 8], [16, 16, 16]],
  "fp8_fp8_fp16": [[32, 32, 16], [32, 32, 32]],
  "int8_int8_int32": [[16, 16, 32], [32, 32, 16]]
}

Key format: {A_dtype}_{B_dtype}_{C_dtype}

Step 4: Run Generator

cd dispatcher/codegen
python generate_arch_specs.py

This generates:

arch_specs_generated.py (Python module)
../include/ck_tile/dispatcher/arch_specs_generated.hpp (C++ header)

Step 5: Rebuild and Test

cd ../build
cmake --build . -j8
ctest --output-on-failure

Step 6: Verify

from arch_filter import ArchFilter

filter = ArchFilter("gfx1100")
is_valid = filter.is_kernel_valid(
    datatype_a="fp16", datatype_b="fp16", datatype_c="fp16",
    tile_m=128, tile_n=128, tile_k=32,
    warp_m=2, warp_n=2, warp_k=1,
    warp_tile_m=16, warp_tile_n=16, warp_tile_k=16
)
print(f"Valid: {is_valid}")

Reference

Supported Data Types

Key	Description
`fp16`	Half precision (16-bit)
`bf16`	Brain float 16
`fp32`	Single precision (32-bit)
`fp64`	Double precision (64-bit)
`fp8`	8-bit float (E4M3)
`bf8`	8-bit brain float (E5M2)
`int8`	8-bit integer
`int4`	4-bit integer

GPU Families

Family	Description
`cdna2`	MI200 series (gfx90a)
`cdna3`	MI300 series (gfx942)
`cdna4`	MI350 series (gfx950)
`rdna3`	RX 7000 series (gfx1100)
`rdna4`	RX 9000 series (gfx1201)

Pipeline LDS Limits

Pipeline	LDS Limit
`compv4`	32 KB
`preshufflev2`	32 KB
`default`	64 KB

Troubleshooting

"Unknown GPU architecture"

Check architecture key matches exactly (e.g., "gfx942" not "GFX942")
Verify you ran generate_arch_specs.py
Rebuild C++ code

Kernels being rejected

from arch_filter import ArchFilter, KernelConfig

filter = ArchFilter("gfx942")
result = filter.validate_kernel(config)
print(f"Valid: {result.valid}")
for error in result.errors:
    print(f"  Error: {error}")

Missing warp tile combination

Check warp_tile_combos in arch_specs.json
Ensure [warp_tile_m, warp_tile_n, warp_tile_k] is in the list
Verify data type key format

File Structure

codegen/
├── arch_specs.json              # Single source of truth (EDIT THIS)
├── generate_arch_specs.py       # Generator script
├── arch_specs_generated.py      # Generated Python module
└── ADDING_NEW_GPU.md           # This file

include/ck_tile/dispatcher/
├── arch_specs_generated.hpp     # Generated C++ header
└── arch_filter.hpp              # C++ filter

Best Practices

Test thoroughly - Run all tests after adding a new GPU
Start minimal - Add only validated configurations
Document sources - Note where warp tile combinations came from
Keep in sync - If using tile_engine, keep both updated

More info: See ../README.md for full documentation.

4.7 KiB Raw Permalink Blame History