Files
composable_kernel/example/64_fpAintB_gemm
John Afaganis 96c39b331e [rocm-libraries] ROCm/rocm-libraries#7829 (commit 13af7da)
[ck] Enforce ASCII-only C/C++ sources for hipRTC
 compatibility (#7829)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary

CK source files must be compilable via **hipRTC (HIP runtime
compilation)**, whose preprocessor does not accept non-ASCII bytes
anywhere in a translation unit — **including in comments**. Bytes that
are harmless under `hipcc` (em-dashes, smart quotes, multiplication
signs, Greek letters, box-drawing glyphs, etc.) cause hipRTC to fail at
preprocessing time. These regularly leak in via LLM-assisted authoring
or copy/paste from formatted documents and silently break hipRTC paths
that are not exercised by the default `hipcc`-based build matrix.

This PR (a) cleans every existing violation (53 files) and (b) adds a
pre-checkin gate so new violations are rejected before merge.

## File extensions covered

Both the cleanup scan and the new Jenkins enforcement stage use the same
predicate:

```
*.h  *.hpp  *.cpp  *.h.in  *.hpp.in  *.cpp.in  *.inc  *.cl
```

(excluding `*/build/*` and `*/include/rapidjson/*`). This is a strict
superset of the existing `Clang Format` stage's predicate — `*.inc` is
added so test-fixture include files are also gated. The local pre-commit
hook's `c++/inc` type filter covers the same set.

## Why no enforcement today

CK is opted out of the rocm-libraries root `.pre-commit-config.yaml`, so
the existing `pre-commit` workflow doesn't touch CK. The local CK
`.pre-commit-config.yaml` only runs for developers who installed hooks.
The **authoritative gate is therefore the new Jenkins stage** in this
PR; the local hook is convenience.

## Commit layout (bisect-friendly)

1. `79798aa6261` — **`[ck] Convert reflect/ rendering to ASCII for
hipRTC compatibility`**
Behavior change, isolated. `TreeFormatter` swaps `├─ / └─ / │ ` for `|-
/ +- / | ` (3-col width preserved so alignment is unchanged).
`conv_description.hpp` swaps `×` for `x` as the dimension separator.
`test_conv_description.cpp` expected strings updated in lockstep so the
snapshot test stays green. This is the only commit in the series with
observable runtime impact.

2. `738fdb0d81c` — **`[ck] Strip non-ASCII bytes from C++ sources for
hipRTC compatibility`**
Mechanical text cleanup across 53 files. Replacements happen in comments
or in `std::cout` strings that are not asserted on by any test. None of
the 174 `.inc` files in the tree required edits, but they were in the
scan's predicate so the enforcement stage's predicate is a superset of
what was scanned. Full replacement table in the commit message.

3. `1d7cd8ba235` — **`[ck] Enforce ASCII-only C/C++ sources for hipRTC
compatibility`**
- New `projects/composablekernel/script/check_ascii_only.sh` (modeled on
`check_copyright_year.sh`).
- New entry in `projects/composablekernel/.pre-commit-config.yaml` under
the local-hooks block (`types_or: [c++, inc]`).
- New `ASCII Only Check` parallel stage in
`projects/composablekernel/Jenkinsfile`'s `Static checks` block,
mirroring the existing `Clang Format` stage but with `*.inc` added to
the find predicate. Always-on, no `RUN_CPPCHECK` gate.

The tree is buildable at every commit boundary. Commit 1 leaves 50 known
violations; commit 2 leaves 0; commit 3 wires the gate.

## Demo

Script output on a synthesized violation:

```
$ printf '// em-dash test \xe2\x80\x94 here\n' > /tmp/bad.cpp
$ projects/composablekernel/script/check_ascii_only.sh /tmp/bad.cpp
ERROR: /tmp/bad.cpp contains non-ASCII bytes:
1:// em-dash test — here
  Fix: replace with ASCII (em-dash -> --, smart quotes -> ", arrows -> ->, etc.)
$ echo $?
1
```

Full repo scan after the cleanup commits (note the `-name '*.inc'`
clause):

```
$ cd projects/composablekernel && find . -type f \( -name '*.h' -o -name '*.hpp' -o -name '*.cpp' \
    -o -name '*.h.in' -o -name '*.hpp.in' -o -name '*.cpp.in' -o -name '*.inc' -o -name '*.cl' \) \
    -not -path '*/build/*' -not -path '*/include/rapidjson/*' -print0 \
  | xargs -0 -P 8 -n 64 script/check_ascii_only.sh
$ echo $?
0
```

## Test plan

- [ ] Jenkins PR build: confirm new `Static checks -> ASCII Only Check`
stage runs green over the full predicate (incl. `*.inc`) and existing
`Clang Format` stage is unaffected.
- [ ] `test_conv_description` passes against the ASCII tree-formatter
output (touched in commit 1).
- [ ] Local: `pre-commit run ascii-only-checker --all-files` runs
cleanly after installing CK pre-commit hooks via
`script/install_precommit.sh`.
- [ ] Manually inject a non-ASCII byte in any `.cpp/.hpp/.inc` file,
push: confirm Jenkins fails the new stage with a clear error.
- [ ] Spot-check a representative subset of touched files under hipRTC
compilation to confirm no remaining hipRTC-blocking content (optional,
since the static byte check is a sufficient condition for hipRTC
preprocessor acceptance on this dimension).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
2026-06-04 15:00:17 +00:00
..

Mixed-Precision GEMM: FP16 A × INT8 B

This example demonstrates a mixed-precision GEMM operation where matrix A is in FP16 (half-precision floating-point) format and matrix B is in INT8 (8-bit integer) format. This is an important optimization technique for inference workloads that enables significant memory bandwidth reduction while maintaining acceptable numerical accuracy.

Mathematical Formulation

The operation performs matrix multiplication with mixed data types: C = A_{fp16} \times B_{int8}

Where:

  • Matrix A has FP16 elements with shape [M, K]
  • Matrix B has INT8 elements with shape [K, N]
  • Matrix C typically has FP16 or FP32 elements with shape [M, N]

The computation involves:

  1. Type Conversion: INT8 elements of B are converted to FP16 during computation
  2. Scaling: Optional scaling factors can be applied to account for the quantization of B
  3. Accumulation: Products are accumulated in higher precision (typically FP32) to maintain numerical accuracy
  4. Output Conversion: Final results are converted to the desired output precision

Algorithmic Strategy: Mixed-Precision Tiled GEMM

The implementation extends the standard tiled GEMM algorithm to handle mixed data types efficiently.

  1. Tiled Matrix Multiplication: Standard tiling approach with type-specific optimizations:

    • A Matrix Loading: FP16 elements are loaded directly from global memory
    • B Matrix Loading: INT8 elements are loaded and converted to FP16 in registers
    • Scaling Application: If quantization scales are provided, they are applied during the conversion
    • Mixed-Type Computation: FP16 × FP16 multiplication with FP32 accumulation
  2. Memory Access Optimization:

    • Bandwidth Efficiency: INT8 storage for B reduces memory bandwidth by 2× compared to FP16
    • Coalescing: Both data types are accessed with coalesced memory patterns
    • Vectorization: Use vectorized loads where possible for both FP16 and INT8 data
  3. Computation Precision:

    • Multiply-Accumulate: Use FP32 accumulators to prevent overflow and maintain accuracy
    • Hardware Utilization: Leverage mixed-precision matrix instructions where available
    • Numerical Stability: Careful handling of type conversions to minimize precision loss

Source Code Organization

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/64_fpAintB_gemm
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./fpAintB_gemm_xdl

# Run with verification, data initialization, and timing
./fpAintB_gemm_xdl 1 2 1

Applications in Model Optimization

Mixed-precision GEMM is crucial for efficient neural network inference:

  • Quantized Inference: Deploy models with quantized weights (INT8) while keeping activations in higher precision (FP16)
  • Memory-Constrained Environments: Reduce memory footprint for weight storage while maintaining computational accuracy
  • Edge Deployment: Enable deployment on devices with limited memory bandwidth
  • Large Language Models: Reduce memory requirements for transformer models while preserving quality
  • Computer Vision Models: Optimize CNN inference with quantized convolution layers

Performance Benefits

This mixed-precision approach provides several advantages:

  • Memory Bandwidth: 2× reduction in bandwidth for matrix B compared to FP16×FP16
  • Storage Efficiency: 50% reduction in storage requirements for quantized matrices
  • Cache Efficiency: More data fits in cache due to reduced memory footprint
  • Energy Efficiency: Lower memory traffic reduces energy consumption

Quantization Considerations

Effective use of INT8 quantization requires:

  • Calibration: Proper calibration to determine appropriate scaling factors
  • Range Analysis: Understanding the dynamic range of weights to maximize INT8 utilization
  • Accuracy Trade-offs: Balancing between compression ratio and numerical accuracy
  • Hardware Support: Leveraging hardware features for efficient mixed-precision computation

Comparison with Other Precision Formats

Configuration A Precision B Precision Memory Bandwidth Accuracy Hardware Support
FP32×FP32 FP32 FP32 1.0× (baseline) Highest Universal
FP16×FP16 FP16 FP16 0.5× High Modern GPUs
FP16×INT8 FP16 INT8 0.375× Medium-High Specialized
INT8×INT8 INT8 INT8 0.25× Medium Specialized

The FP16×INT8 configuration provides an excellent balance between memory efficiency and numerical accuracy for many inference workloads.