composable_kernel/example/64_fpAintB_gemm/README.md

# Mixed-Precision GEMM: FP16 A × INT8 B

This example demonstrates a **mixed-precision GEMM operation** where matrix `A` is in FP16 (half-precision floating-point) format and matrix `B` is in INT8 (8-bit integer) format. This is an important optimization technique for inference workloads that enables significant memory bandwidth reduction while maintaining acceptable numerical accuracy.

## Mathematical Formulation

The operation performs matrix multiplication with mixed data types:
$C = A_{fp16} \times B_{int8}$

Where:
- Matrix `A` has FP16 elements with shape `[M, K]`
- Matrix `B` has INT8 elements with shape `[K, N]`
- Matrix `C` typically has FP16 or FP32 elements with shape `[M, N]`

The computation involves:
1.  **Type Conversion**: INT8 elements of `B` are converted to FP16 during computation
2.  **Scaling**: Optional scaling factors can be applied to account for the quantization of `B`
3.  **Accumulation**: Products are accumulated in higher precision (typically FP32) to maintain numerical accuracy
4.  **Output Conversion**: Final results are converted to the desired output precision

## Algorithmic Strategy: Mixed-Precision Tiled GEMM

The implementation extends the standard tiled GEMM algorithm to handle mixed data types efficiently.

1.  **Tiled Matrix Multiplication**: Standard tiling approach with type-specific optimizations:
    -   **A Matrix Loading**: FP16 elements are loaded directly from global memory
    -   **B Matrix Loading**: INT8 elements are loaded and converted to FP16 in registers
    -   **Scaling Application**: If quantization scales are provided, they are applied during the conversion
    -   **Mixed-Type Computation**: FP16 × FP16 multiplication with FP32 accumulation

2.  **Memory Access Optimization**:
    -   **Bandwidth Efficiency**: INT8 storage for `B` reduces memory bandwidth by 2× compared to FP16
    -   **Coalescing**: Both data types are accessed with coalesced memory patterns
    -   **Vectorization**: Use vectorized loads where possible for both FP16 and INT8 data

3.  **Computation Precision**:
    -   **Multiply-Accumulate**: Use FP32 accumulators to prevent overflow and maintain accuracy
    -   **Hardware Utilization**: Leverage mixed-precision matrix instructions where available
    -   **Numerical Stability**: Careful handling of type conversions to minimize precision loss

## Source Code Organization

-   [`fpAintB_gemm_xdl.cpp`](./fpAintB_gemm_xdl.cpp): The main example file. It sets up FP16 matrix A, INT8 matrix B with optional scaling factors, and instantiates the `DeviceFpAintBGemm` operation.
-   [`../../include/ck/tensor_operation/gpu/device/device_fpAintB_gemm.hpp`](../../include/ck/tensor_operation/gpu/device/device_fpAintB_gemm.hpp): The device interface for mixed-precision GEMM operations.
-   The underlying kernel implements optimized mixed-type arithmetic with efficient type conversion and scaling operations.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/64_fpAintB_gemm
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run the Example
```bash
# Run the example with default settings
./fpAintB_gemm_xdl

# Run with verification, data initialization, and timing
./fpAintB_gemm_xdl 1 2 1
```

## Applications in Model Optimization

Mixed-precision GEMM is crucial for efficient neural network inference:

-   **Quantized Inference**: Deploy models with quantized weights (INT8) while keeping activations in higher precision (FP16)
-   **Memory-Constrained Environments**: Reduce memory footprint for weight storage while maintaining computational accuracy
-   **Edge Deployment**: Enable deployment on devices with limited memory bandwidth
-   **Large Language Models**: Reduce memory requirements for transformer models while preserving quality
-   **Computer Vision Models**: Optimize CNN inference with quantized convolution layers

## Performance Benefits

This mixed-precision approach provides several advantages:

-   **Memory Bandwidth**: 2× reduction in bandwidth for matrix B compared to FP16×FP16
-   **Storage Efficiency**: 50% reduction in storage requirements for quantized matrices
-   **Cache Efficiency**: More data fits in cache due to reduced memory footprint
-   **Energy Efficiency**: Lower memory traffic reduces energy consumption

## Quantization Considerations

Effective use of INT8 quantization requires:

-   **Calibration**: Proper calibration to determine appropriate scaling factors
-   **Range Analysis**: Understanding the dynamic range of weights to maximize INT8 utilization
-   **Accuracy Trade-offs**: Balancing between compression ratio and numerical accuracy
-   **Hardware Support**: Leveraging hardware features for efficient mixed-precision computation

## Comparison with Other Precision Formats

| Configuration | A Precision | B Precision | Memory Bandwidth | Accuracy | Hardware Support |
|---------------|-------------|-------------|------------------|----------|------------------|
| **FP32×FP32** | FP32 | FP32 | 1.0× (baseline) | Highest | Universal |
| **FP16×FP16** | FP16 | FP16 | 0.5× | High | Modern GPUs |
| **FP16×INT8** | FP16 | INT8 | 0.375× | Medium-High | Specialized |
| **INT8×INT8** | INT8 | INT8 | 0.25× | Medium | Specialized |

The FP16×INT8 configuration provides an excellent balance between memory efficiency and numerical accuracy for many inference workloads.