mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-13 17:55:48 +00:00

Files

Aviral Goel e0dfe58d66 [rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

CK: Remove 41 commented-out dead code blocks (~200 lines)
 (#6302)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Depends on #6300

## Summary

Remove 41 commented-out code blocks across 33 files in Composable
Kernel, totaling ~200 lines.

Identified using an automated dead code scanning skill (`ck-dead-code`)
with a calibrated two-stage pipeline:
1. **Pre-filter**: Keyword-based scan found 1,338 `//`-commented blocks.
Calibrated heuristics (trained on 50-sample expert classification)
reduced to 89 high-confidence candidates — 93% noise reduction.
2. **Expert triage**: LLM expert classified each block in context as
CODE_REMOVE, CODE_KEEP, or NOT_CODE.

| Classification | Count |
|---------------|-------|
| Removed (this PR) | 41 |
| Kept (debug helpers, alt configs, reference impls) | 32 |
| Not code (false positives) | 16 |

Removed blocks include: superseded implementations, old test data,
abandoned stubs, unreachable code, and buggy dead code.

2026-04-10 15:18:02 +00:00

CMakeLists.txt

Dev/a8w4 and a8w8splitk (#3447 )

2025-12-19 09:26:52 +08:00

common.hpp

Wmma support for gemm_multiply_multiply_wp (#3278 )

2025-12-03 07:38:23 -08:00

gemm_add_add_wmma_fp16.cpp

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

gemm_add_add_xdl_fp16.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

gemm_multiply_multiply_wmma_fp8_ab_scale.cpp

Wmma support for gemm_ab_scale (#3314 )

2025-12-11 09:06:20 +01:00

gemm_multiply_multiply_wmma_fp8_blockscale_bpreshuffle.cpp

Wmma support for gemm_ab_scale (#3314 )

2025-12-11 09:06:20 +01:00

gemm_multiply_multiply_wmma_fp8_bpreshuffle.cpp

Wmma support for gemm_multiply_multiply_wp (#3278 )

2025-12-03 07:38:23 -08:00

gemm_multiply_multiply_wmma_fp16_bpreshuffle.cpp

[rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)

2026-02-20 22:41:34 +00:00

gemm_multiply_multiply_xdl_fp8_ab_scale.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

gemm_multiply_multiply_xdl_fp8_blockscale_bpreshuffle.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

gemm_multiply_multiply_xdl_fp8_bpreshuffle.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

gemm_multiply_multiply_xdl_fp8.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

gemm_multiply_multiply_xdl_fp16_bpreshuffle.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

gemm_multiply_multiply_xdl_int8.cpp

chore(copyright): update copyright header for example directory (#3273 )

2025-11-24 18:02:41 -08:00

moe_gemm1_xdl_fp8_blockscale_splitk.cpp

moe fp8 blockscale use nt (#3524 )

2026-01-12 10:48:10 +08:00

moe_gemm1_xdl_fp8_blockscale.cpp

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

moe_gemm1_xdl_fp8.cpp

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

moe_gemm1_xdl_pk_i4.cpp

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

moe_gemm2_xdl_fp8_blockscale.cpp

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 15:18:02 +00:00

moe_gemm2_xdl_fp8.cpp

[rocm-libraries] ROCm/rocm-libraries#6302 (commit 8d419e8)

2026-04-10 15:18:02 +00:00

moe_gemm2_xdl_pk_i4.cpp

[CI, CK examples] Disable time_kernel for CI tests and examples (#3464 )

2026-01-07 16:30:57 +01:00

README.md

[DOCS] Documentation Addition (Readme updates) (#2495 )

2025-10-16 03:10:57 -07:00

run_gemm_multiply_multiply_wp_example.inc

[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation

2026-01-15 16:43:02 +01:00

README.md

GEMM with Double Multiply Operations

This example demonstrates a GEMM followed by two sequential elementwise multiplication operations. This fusion pattern is useful for implementing layers that require matrix multiplication followed by multiple scaling or masking operations, such as certain attention mechanisms or gated neural network architectures.

Mathematical Formulation

The operation performs a matrix multiplication followed by two sequential elementwise multiplications.

GEMM Stage: A standard matrix multiplication. C_{temp1} = A \times B
First Multiplication: Elementwise multiplication with tensor D. C_{temp2} = C_{temp1} \odot D
Second Multiplication: Elementwise multiplication with tensor E. F = C_{temp2} \odot E

The key optimization is that the intermediate tensors C_temp1 and C_temp2 are never written to global memory. All operations are fused into the GEMM's epilogue, operating on data held in registers.

Algorithmic Strategy: GEMM with Dual-Multiply Epilogue

The implementation uses a tiled GEMM algorithm with a multi-stage fused epilogue that performs two sequential multiplications.

Tiled GEMM Core: The kernel begins with a standard tiled GEMM. A thread block computes a tile of the product A \times B, accumulating the result in registers.
Dual-Multiply Epilogue: Before any data is written to global memory, the following sequence occurs for the tile of data held in registers:
- Load First Multiplicand: Threads load the corresponding elements of tensor D.
- First Multiplication: The elementwise multiplication is performed in registers: result *= D.
- Load Second Multiplicand: Threads load the corresponding elements of tensor E.
- Second Multiplication: The second elementwise multiplication is performed in registers: result *= E.
- Store Final Result: The final result F is written to global memory.

This deep fusion eliminates multiple kernel launches and the memory bandwidth required to write and re-read intermediate tensors.

Source Code Organization

gemm_multiply_multiply_xdl.cpp: The main example file. It sets up the input matrices (A, B) and auxiliary tensors (D, E), and instantiates the DeviceGemmMultiplyMultiply operation.
../../include/ck/tensor_operation/gpu/device/device_gemm_multiply_multiply.hpp: The high-level device interface for this fused operation.
The underlying kernel implements the dual-multiply epilogue that performs both multiplication operations on register data before storing.

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/65_gemm_multiply_multiply
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./gemm_multiply_multiply_xdl

# Run with verification, data initialization, and timing
./gemm_multiply_multiply_xdl 1 2 1

Applications

This fusion pattern is useful for several types of neural network operations and advanced computational patterns.

Multi-Scale Attention: Some attention mechanisms apply multiple scaling factors sequentially, such as learned attention scales followed by positional scaling.
Gated Mechanisms: Advanced gating architectures that use multiple multiplicative gates in sequence, such as in some RNN variants or transformer modifications.
Feature Modulation: Computer vision models that apply multiple feature modulation operations, such as style-based generators or attention-based feature refinement.
Masking Operations: Applying multiple types of masks (e.g., attention mask followed by a dropout mask) in sequence.
Custom Activations: Implementing complex activation functions that involve multiple multiplicative terms.
Mixture of Experts: Some MoE architectures use multiple routing or gating multiplications in sequence.

Performance Considerations

The performance benefits of this fusion depend on several factors:

Memory Bandwidth Savings: Eliminates two full tensor read/write cycles for intermediate results
Cache Locality: Maintains data in registers throughout the computation pipeline
Instruction Scheduling: Allows better interleaving of compute and memory operations
Kernel Launch Overhead: Reduces from three separate kernel launches to one

Comparison with Sequential Operations

Approach	Kernel Launches	Memory Bandwidth	Register Pressure	Implementation Complexity
Sequential	3 kernels	3× intermediate storage	Low	Simple
Fused	1 kernel	No intermediate storage	Medium	Moderate

Extension Possibilities

This pattern can be extended in several ways:

More Multiplications: Additional sequential multiplications can be added to the epilogue
Mixed Operations: Combine multiplications with additions or other elementwise operations
Conditional Operations: Apply multiplications conditionally based on masks or thresholds
Broadcasting: Handle different broadcasting patterns for the multiplicand tensors

This example demonstrates the flexibility of the epilogue fusion approach, showing how multiple sequential operations can be efficiently combined with matrix multiplication.

README.md Unescape Escape

GEMM with Double Multiply Operations

Mathematical Formulation

Algorithmic Strategy: GEMM with Dual-Multiply Epilogue

Source Code Organization

Build and Run

Prerequisites

Build the Example

Run the Example

Applications

Performance Considerations

Comparison with Sequential Operations

Extension Possibilities

README.md