* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
[ROCm/composable_kernel commit: ad57f6ef0b]
Mixed-Precision GEMM: FP16 A × INT8 B
This example demonstrates a mixed-precision GEMM operation where matrix A is in FP16 (half-precision floating-point) format and matrix B is in INT8 (8-bit integer) format. This is an important optimization technique for inference workloads that enables significant memory bandwidth reduction while maintaining acceptable numerical accuracy.
Mathematical Formulation
The operation performs matrix multiplication with mixed data types:
C = A_{fp16} \times B_{int8}
Where:
- Matrix
Ahas FP16 elements with shape[M, K] - Matrix
Bhas INT8 elements with shape[K, N] - Matrix
Ctypically has FP16 or FP32 elements with shape[M, N]
The computation involves:
- Type Conversion: INT8 elements of
Bare converted to FP16 during computation - Scaling: Optional scaling factors can be applied to account for the quantization of
B - Accumulation: Products are accumulated in higher precision (typically FP32) to maintain numerical accuracy
- Output Conversion: Final results are converted to the desired output precision
Algorithmic Strategy: Mixed-Precision Tiled GEMM
The implementation extends the standard tiled GEMM algorithm to handle mixed data types efficiently.
-
Tiled Matrix Multiplication: Standard tiling approach with type-specific optimizations:
- A Matrix Loading: FP16 elements are loaded directly from global memory
- B Matrix Loading: INT8 elements are loaded and converted to FP16 in registers
- Scaling Application: If quantization scales are provided, they are applied during the conversion
- Mixed-Type Computation: FP16 × FP16 multiplication with FP32 accumulation
-
Memory Access Optimization:
- Bandwidth Efficiency: INT8 storage for
Breduces memory bandwidth by 2× compared to FP16 - Coalescing: Both data types are accessed with coalesced memory patterns
- Vectorization: Use vectorized loads where possible for both FP16 and INT8 data
- Bandwidth Efficiency: INT8 storage for
-
Computation Precision:
- Multiply-Accumulate: Use FP32 accumulators to prevent overflow and maintain accuracy
- Hardware Utilization: Leverage mixed-precision matrix instructions where available
- Numerical Stability: Careful handling of type conversions to minimize precision loss
Source Code Organization
fpAintB_gemm_xdl.cpp: The main example file. It sets up FP16 matrix A, INT8 matrix B with optional scaling factors, and instantiates theDeviceFpAintBGemmoperation.../../include/ck/tensor_operation/gpu/device/device_fpAintB_gemm.hpp: The device interface for mixed-precision GEMM operations.- The underlying kernel implements optimized mixed-type arithmetic with efficient type conversion and scaling operations.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/64_fpAintB_gemm
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./fpAintB_gemm_xdl
# Run with verification, data initialization, and timing
./fpAintB_gemm_xdl 1 2 1
Applications in Model Optimization
Mixed-precision GEMM is crucial for efficient neural network inference:
- Quantized Inference: Deploy models with quantized weights (INT8) while keeping activations in higher precision (FP16)
- Memory-Constrained Environments: Reduce memory footprint for weight storage while maintaining computational accuracy
- Edge Deployment: Enable deployment on devices with limited memory bandwidth
- Large Language Models: Reduce memory requirements for transformer models while preserving quality
- Computer Vision Models: Optimize CNN inference with quantized convolution layers
Performance Benefits
This mixed-precision approach provides several advantages:
- Memory Bandwidth: 2× reduction in bandwidth for matrix B compared to FP16×FP16
- Storage Efficiency: 50% reduction in storage requirements for quantized matrices
- Cache Efficiency: More data fits in cache due to reduced memory footprint
- Energy Efficiency: Lower memory traffic reduces energy consumption
Quantization Considerations
Effective use of INT8 quantization requires:
- Calibration: Proper calibration to determine appropriate scaling factors
- Range Analysis: Understanding the dynamic range of weights to maximize INT8 utilization
- Accuracy Trade-offs: Balancing between compression ratio and numerical accuracy
- Hardware Support: Leveraging hardware features for efficient mixed-precision computation
Comparison with Other Precision Formats
| Configuration | A Precision | B Precision | Memory Bandwidth | Accuracy | Hardware Support |
|---|---|---|---|---|---|
| FP32×FP32 | FP32 | FP32 | 1.0× (baseline) | Highest | Universal |
| FP16×FP16 | FP16 | FP16 | 0.5× | High | Modern GPUs |
| FP16×INT8 | FP16 | INT8 | 0.375× | Medium-High | Specialized |
| INT8×INT8 | INT8 | INT8 | 0.25× | Medium | Specialized |
The FP16×INT8 configuration provides an excellent balance between memory efficiency and numerical accuracy for many inference workloads.