[DOCS] Documentation Addition (Readme updates) (#2495)

* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: 92c67a824f]
2026-05-20 21:09:08 +00:00 · 2025-10-16 03:10:57 -07:00
parent cbae1418f7
commit 8654f7be8d
120 changed files with 8188 additions and 221 deletions
--- a/client_example/01_gemm/README.md
+++ b/client_example/01_gemm/README.md
@@ -1,5 +1,22 @@
-[Back to supported operations](../../../include/ck/README.md)
-# Composable Kernel GEMM
+# Client Example: Basic GEMM
+
+## Theory
+
+This client example demonstrates a basic **GEMM (General Matrix Multiplication)** operation using the Composable Kernel library. GEMM is a core operation in linear algebra and deep learning, computing the product of two matrices and optionally adding a bias or scaling.
+
+**Mathematical Formulation:**
+$$
+C = \alpha (A \times B) + \beta D
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $D$: [M, N] optional bias or residual
+- $C$: [M, N] output
+- $\alpha, \beta$: scalars (often 1.0, 0.0)
+
+**Algorithmic Background:**
+- The operation is implemented using a tiled/blocking strategy for memory efficiency.
+- GEMM is the computational backbone for transformer attention, MLPs, and CNNs (via im2col).

 ## GEMM
 General matrix multiplications operation. In CK GEMM operation is called as `DeviceGemm` and requires following types as template parameters:
@@ -124,3 +141,38 @@ Table of supported cases by instance factory with XDL instruction for Row/Row/Ro
 * **DeviceGemmReduce** - GEMM fused with reduction.
 * **DeviceGemm_Streamk_V2** - GEMM stream K implementation. Implementation allows to use reduction instead of AtomicAdd.
 * **DeviceGemmStreamK** - GEMM stream K implementation using AtomicAdd.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/01_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/01_gemm/
+├── gemm.cpp         # Main client example: sets up, runs, and verifies GEMM
+├── CMakeLists.txt   # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `gemm.cpp`):  
+  Sets up input matrices, configures GEMM parameters, launches the GEMM kernel, and verifies the result.
+- **GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch the GEMM operation.
+
+This client example provides a minimal, end-to-end demonstration of using Composable Kernel for matrix multiplication in a user application.
--- a/client_example/02_gemm_add_add_fastgelu/README.md
+++ b/client_example/02_gemm_add_add_fastgelu/README.md
@@ -0,0 +1,65 @@
+# Client Example: GEMM with Add, Add, and FastGELU Fusion
+
+## Theory
+
+This client example demonstrates **GEMM fused with two addition operations and FastGELU activation**. This pattern is common in transformer feed-forward networks and other neural architectures where a linear transformation is followed by bias addition, residual addition, and a non-linear activation.
+
+**Mathematical Formulation:**
+$$
+E = \text{FastGELU}((A \times B) + D_0 + D_1)
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $D_0$: [N] bias vector (broadcasted)
+- $D_1$: [M, N] residual tensor
+- $E$: [M, N] output
+
+FastGELU is an efficient approximation of GELU:
+$$
+\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)
+$$
+where $\sigma$ is the sigmoid function.
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, bias and residual are added, and FastGELU is applied before writing to global memory.
+- No intermediate results are written to global memory.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/02_gemm_add_add_fastgelu
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_add_add_fastgelu
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/02_gemm_add_add_fastgelu/
+├── gemm_add_add_fastgelu.cpp         # Main client example: GEMM+Add+Add+FastGELU
+├── gemm_add_add_fastgelu_generic.cpp # Generic variant
+├── gemm_add_fastgelu.cpp             # GEMM+Add+FastGELU
+├── gemm_add_fastgelu_generic.cpp     # Generic variant
+├── gemm_fastgelu.cpp                 # GEMM+FastGELU only
+├── gemm_fastgelu_generic.cpp         # Generic variant
+├── CMakeLists.txt                    # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch the GEMM with fused addition and FastGELU.
+
+This client example provides several variants to demonstrate different levels of fusion and genericity for transformer-style MLP layers.
--- a/client_example/03_gemm_layernorm/README.md
+++ b/client_example/03_gemm_layernorm/README.md
@@ -0,0 +1,57 @@
+# Client Example: GEMM with LayerNorm Fusion
+
+## Theory
+
+This client example demonstrates **GEMM fused with layer normalization** and additional elementwise operations. This pattern is common in transformer feed-forward networks and other architectures where a linear transformation is followed by normalization and activation.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+- Additions: $Z = Y + D_0 + D_1$ (bias, residual, etc.)
+- Activation: $A = \text{ReLU}(Z)$ (or other activation)
+- LayerNorm: $\text{LayerNorm}(A) = \gamma \cdot \frac{A - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$
+
+$\mu$, $\sigma^2$ are mean and variance over the normalization axis; $\gamma$, $\beta$ are learnable scale and shift.
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, elementwise ops and layer normalization are fused in the epilogue.
+- LayerNorm is typically applied over the last dimension (features).
+- This fusion reduces memory traffic and is common in transformer MLP blocks.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/03_gemm_layernorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (naive)
+./gemm_add_add_layernorm_naive
+
+# Example run (with ReLU and Welford)
+./gemm_add_relu_add_layernorm_welford
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/03_gemm_layernorm/
+├── gemm_add_add_layernorm_naive.cpp         # GEMM + Add + Add + LayerNorm (naive)
+├── gemm_add_relu_add_layernorm_welford.cpp  # GEMM + Add + ReLU + Add + LayerNorm (Welford)
+├── CMakeLists.txt                           # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
+- **LayerNorm implementation**:  
+  Demonstrates both naive and numerically stable (Welford) algorithms for mean/variance.
+
+This client example provides variants to demonstrate different levels of fusion and normalization for transformer-style MLP layers.
--- a/client_example/04_contraction/README.md
+++ b/client_example/04_contraction/README.md
@@ -0,0 +1,56 @@
+# Client Example: General Tensor Contraction
+
+## Theory
+
+This client example demonstrates **general tensor contraction** operations, including bilinear and scaled contractions. Tensor contraction generalizes matrix multiplication to higher dimensions and is used in scientific computing, quantum chemistry, and advanced neural network layers.
+
+**Mathematical Formulation:**
+- General contraction: $C_{i,j} = \sum_k A_{i,k} \cdot B_{k,j}$
+- Bilinear contraction: $C = \alpha (A \cdot B) + \beta D$
+- Scale contraction: $C = \text{scale}(A, B)$ (elementwise or broadcasted scaling)
+
+**Algorithmic Background:**
+- Contraction can be performed over arbitrary axes and supports broadcasting.
+- Bilinear and scale contractions are used for feature fusion, gating, and scientific workloads.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/04_contraction
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (bilinear FP32)
+./contraction_bilinear_fp32
+
+# Example run (scale FP64)
+./contraction_scale_fp64
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/04_contraction/
+├── contraction_bilinear_fp32.cpp         # Bilinear contraction (FP32)
+├── contraction_bilinear_fp64.cpp         # Bilinear contraction (FP64)
+├── contraction_g1m2n3k1_add_xdl_fp16.cpp # Grouped contraction with addition (FP16)
+├── contraction_scale_fp32.cpp            # Scale contraction (FP32)
+├── contraction_scale_fp64.cpp            # Scale contraction (FP64)
+├── CMakeLists.txt                        # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures contraction parameters, launches the contraction kernel, and verifies the result.
+- **Contraction kernel invocation**:  
+  Uses the Composable Kernel device API to launch the contraction operation.
+
+This client example provides several variants to demonstrate different contraction types and data types for scientific and ML workloads.
--- a/client_example/05_layernorm/README.md
+++ b/client_example/05_layernorm/README.md
@@ -0,0 +1,66 @@
+# Client Example: Layer Normalization (Forward and Backward)
+
+## Theory
+
+This client example demonstrates **layer normalization** in both forward and backward modes, for 2D and 4D tensors. Layer normalization is used in transformers and other neural networks to normalize activations across the feature dimension, improving training stability.
+
+**Mathematical Formulation:**
+Given input $X$:
+- Mean: $\mu = \frac{1}{N} \sum_{i=1}^N X_i$
+- Variance: $\sigma^2 = \frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2$
+- Normalized: $\hat{X}_i = \frac{X_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$
+- Output: $Y_i = \gamma \hat{X}_i + \beta$
+
+$\gamma$, $\beta$ are learnable scale and shift parameters.
+
+**Algorithmic Background:**
+- Forward pass computes mean, variance, normalization, and affine transformation.
+- Backward pass computes gradients with respect to input, gamma, and beta.
+- Supports both 2D (batch, feature) and 4D (batch, channel, height, width) tensors.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/05_layernorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D forward)
+./layernorm2d_fwd
+
+# Example run (4D forward)
+./layernorm4d_fwd
+
+# Example run (2D backward, data)
+./layernorm2d_bwd_data
+
+# Example run (2D backward, gamma/beta)
+./layernorm2d_bwd_gamma_beta
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/05_layernorm/
+├── layernorm2d_fwd.cpp         # 2D layernorm forward
+├── layernorm4d_fwd.cpp         # 4D layernorm forward
+├── layernorm2d_bwd_data.cpp    # 2D layernorm backward (data)
+├── layernorm2d_bwd_gamma_beta.cpp # 2D layernorm backward (gamma/beta)
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures normalization parameters, launches the forward or backward kernel, and verifies the result.
+- **LayerNorm implementation**:  
+  Demonstrates both forward and backward passes for different tensor shapes.
+
+This client example provides a comprehensive demonstration of layer normalization for both inference and training in deep learning models.
--- a/client_example/06_softmax/README.md
+++ b/client_example/06_softmax/README.md
@@ -0,0 +1,54 @@
+# Client Example: 4D Softmax
+
+## Theory
+
+This client example demonstrates **Softmax computation over 4D tensors**. Softmax is a key operation in deep learning, especially in attention mechanisms and classification, converting logits into normalized probabilities.
+
+**Mathematical Formulation:**
+Given input $X$ and axis $a$:
+$$
+\text{softmax}(X)_i = \frac{\exp(X_i)}{\sum_j \exp(X_j)}
+$$
+
+**Algorithmic Background:**
+- Softmax is implemented using a numerically stable algorithm:
+  1. Subtract the maximum value for numerical stability.
+  2. Exponentiate and sum.
+  3. Normalize by the sum.
+- Efficient parallel Softmax requires careful reduction and memory access patterns.
+- This example demonstrates Softmax over a 4D tensor, as used in attention and vision models.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/06_softmax
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./softmax4d
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/06_softmax/
+├── softmax4d.cpp         # Main client example: sets up, runs, and verifies 4D softmax
+├── CMakeLists.txt        # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `softmax4d.cpp`):  
+  Sets up input tensors, configures Softmax parameters, launches the Softmax kernel, and verifies the result.
+- **Softmax kernel invocation**:  
+  Uses the Composable Kernel device API to launch the Softmax operation.
+
+This client example provides a demonstration of efficient, numerically stable Softmax for 4D tensors in deep learning models.
--- a/client_example/07_grouped_convnd_fwd/README.md
+++ b/client_example/07_grouped_convnd_fwd/README.md
@@ -1,5 +1,18 @@
-[Back to supported operations](../../../include/ck/README.md)
-# Composable Kernel Grouped Convolution
+# Client Example: Grouped N-Dimensional Convolution Forward
+
+## Theory
+
+This client example demonstrates **grouped N-dimensional convolution forward** for 1D, 2D, and 3D inputs, supporting multiple data types (including BF8 and FP8). Grouped convolution is used in modern CNNs and vision transformers to reduce computation and enable channel-wise or expert-wise processing.
+
+**Mathematical Formulation:**
+Given input $X$ and weights $W$ for $G$ groups:
+- For each group $g$:
+  $$
+  Y^g[n, c_{out}, ...] = \sum_{c_{in}} \sum_{k_1} ... \sum_{k_n} X^g[n, c_{in}, ...] \cdot W^g[c_{out}, c_{in}, ...]
+  $$
+- Each group operates on a subset of input/output channels.
+
+**Algorithmic Background:**

 ## Grouped Convolution Forward
 Grouped convolution operation for 1D, 2D or 3D spatial dimensions. Convolution utilizes GEMM kernel after tensor coordinate transform. In CK Grouped Convolution Forward operation is called as `DeviceGroupedConvFwdMultipleABD` and requires following types as template parameters:
@@ -66,3 +79,52 @@ Table of supported cases by instance factory with fused elementwise operation:
 * **Scale** - 3D, NHWGC, bf16/fp16/fp32/int8
 * **Scale + Add (for A and B)** - 3D, NHWGC, bf16/fp16/fp32/int8
 * **Scale + Add + Scale + Add + Relu** - 3D, NHWGC, bf16/fp16/fp32/int8
+
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/07_grouped_convnd_fwd
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D grouped convolution)
+./grouped_conv2d_fwd
+
+# Example run (3D grouped convolution, BF8)
+./grouped_conv3d_fwd_bf8
+
+# Example run (3D grouped convolution, FP8)
+./grouped_conv3d_fwd_fp8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/07_grouped_convnd_fwd/
+├── grouped_conv1d_fwd.cpp         # 1D grouped convolution
+├── grouped_conv2d_fwd.cpp         # 2D grouped convolution (NCHW)
+├── grouped_conv2d_fwd_ngchw.cpp   # 2D grouped convolution (NGCHW)
+├── grouped_conv3d_fwd_bf8.cpp     # 3D grouped convolution (BF8)
+├── grouped_conv3d_fwd_fp8.cpp     # 3D grouped convolution (FP8)
+├── grouped_conv3d_fwd_bf8_fp8.cpp # 3D grouped convolution (BF8/FP8 mixed)
+├── grouped_conv3d_fwd_fp8_bf8.cpp # 3D grouped convolution (FP8/BF8 mixed)
+├── common.hpp                     # Common utilities for grouped convolution
+├── CMakeLists.txt                 # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures grouped convolution parameters, launches the kernel, and verifies the result.
+- **Grouped convolution kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution for different dimensions and data types.
+
+This client example provides a comprehensive demonstration of grouped convolution for efficient CNN and vision transformer models.
--- a/client_example/08_fused_attention/README.md
+++ b/client_example/08_fused_attention/README.md
@@ -0,0 +1,89 @@
+# Fused Attention Examples
+
+This directory contains comprehensive examples demonstrating CK's high-performance fused attention implementations, which are critical for modern transformer architectures and large language models.
+
+---
+
+## Theory
+
+**Fused Multi-Head Attention Operation:**
+The fused attention mechanism performs the core transformer operation in a single, optimized kernel:
+
+$$
+\text{Attention}(Q, K, V) = \text{Softmax}(Q K^T / \sqrt{d_k}) V
+$$
+
+**Detailed Mathematical Steps:**
+1. **Query-Key Attention Scores**: $S = Q K^T$
+2. **Scale**: $S_{\text{scaled}} = S / \sqrt{d_k}$
+3. **Softmax**: $A = \text{Softmax}(S_{\text{scaled}})$
+4. **Weighted Value Sum**: $\text{Output} = A V$
+
+- Multi-head extension: Each head computes attention independently, then results are concatenated and projected.
+- Tensor shapes: Q, K, V, Output are typically [Batch, Seq_len, Num_heads, Head_dim].
+
+**Algorithmic Background:**
+- Fused attention combines two GEMMs and a softmax in a single kernel, minimizing memory traffic.
+- Supports bias, masking, and permutation for transformer and LLM workloads.
+
+---
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/08_fused_attention
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic fused attention)
+./fused_attention
+
+# Example run (fused attention with bias)
+./fused_attention_bias
+```
+
+---
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/08_fused_attention/
+├── fused_attention.cpp         # Main client example: fused attention (Q, K, V)
+├── fused_attention_bias.cpp    # Fused attention with bias
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up Q, K, V tensors, configures attention parameters, launches the fused kernel, and verifies the result.
+- **Fused attention kernel invocation**:  
+  Uses the Composable Kernel device API to launch the fused attention operation, optionally with bias.
+
+---
+
+## Additional Details
+
+- Supports FP16, BF16, FP32, and mixed precision.
+- Handles causal and generic masking for autoregressive and variable-length models.
+- Optimized for memory efficiency (no intermediate attention matrix in global memory).
+- Example parameters can be adjusted in the source for different transformer workloads.
+
+---
+
+## Related Examples
+
+- [01_gemm](../01_gemm/README.md): GEMM for Q×K^T and Attn×V
+- [06_softmax](../06_softmax/README.md): Softmax client API usage
+- [03_gemm_layernorm](../03_gemm_layernorm/README.md): Fused GEMM + layer normalization
+- [07_grouped_convnd_fwd](../07_grouped_convnd_fwd/README.md): Grouped convolution for vision transformers
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/09_quantization/README.md
+++ b/client_example/09_quantization/README.md
@@ -0,0 +1,85 @@
+# Client Example: Quantization for GEMM and Conv2D
+
+## Theory
+
+This client example demonstrates **quantized GEMM and 2D convolution** operations, including per-layer and per-channel quantization, and fusion with bias and activation functions. Quantization reduces memory and computation by representing values with lower-precision integer types (e.g., int8), enabling efficient inference in deep learning.
+
+**Mathematical Formulation:**
+- Quantized GEMM: $C = \text{dequant}(A_q) \times \text{dequant}(B_q)$
+- Quantized Conv2D: $Y = \text{dequant}(X_q) * \text{dequant}(W_q)$
+- $\text{dequant}(x_q) = (x_q - z) \cdot s$ (scale $s$, zero-point $z$)
+- Per-layer: one scale/zero-point per tensor
+- Per-channel: scale/zero-point per output channel
+
+**Algorithmic Background:**
+- Quantized values are dequantized on-the-fly during computation.
+- Accumulation is performed in higher precision for accuracy.
+- Supports bias addition and activation fusion (ReLU, Tanh).
+- Per-channel quantization improves accuracy for convolutional layers.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/09_quantization
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (GEMM quantization)
+./gemm_quantization
+
+# Example run (Conv2D per-layer quantization)
+./conv2d_fwd_perlayer_quantization
+
+# Example run (Conv2D per-channel quantization)
+./conv2d_fwd_perchannel_quantization
+
+# Example run (Conv2D + bias + ReLU + per-channel quantization)
+./conv2d_fwd_bias_relu_perchannel_quantization
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/09_quantization/
+├── gemm_quantization.cpp                         # Quantized GEMM
+├── conv2d_fwd_perlayer_quantization.cpp          # Conv2D per-layer quantization
+├── conv2d_fwd_perchannel_quantization.cpp        # Conv2D per-channel quantization
+├── conv2d_fwd_bias_relu_perlayer_quantization.cpp # Conv2D + bias + ReLU + per-layer quantization
+├── conv2d_fwd_bias_relu_perchannel_quantization.cpp # Conv2D + bias + ReLU + per-channel quantization
+├── conv2d_fwd_bias_tanh_perlayer_quantization.cpp # Conv2D + bias + Tanh + per-layer quantization
+├── conv2d_fwd_bias_tanh_perchannel_quantization.cpp # Conv2D + bias + Tanh + per-channel quantization
+├── CMakeLists.txt                                # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures quantization parameters, launches the quantized kernel, and verifies the result.
+- **Quantization kernel invocation**:  
+  Uses the Composable Kernel device API to launch quantized GEMM or Conv2D with optional bias and activation.
+
+---
+
+## Additional Details
+
+- Supports int8 quantization, per-layer and per-channel scaling.
+- Demonstrates fusion with bias and activation (ReLU, Tanh).
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [01_gemm](../01_gemm/README.md): GEMM for quantized matrix multiplication
+- [14_gemm_quantization](../../example/14_gemm_quantization/README.md): GEMM quantization in the main example directory
+- [40_conv2d_fwd_quantization](../../example/40_conv2d_fwd_quantization/README.md): Conv2D quantization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/10_grouped_convnd_bwd_data/README.md
+++ b/client_example/10_grouped_convnd_bwd_data/README.md
@@ -1,4 +1,4 @@
-[Back to supported operations](../../../include/ck/README.md)
+[Back to supported operations](../../include/ck/README.md)
 # Composable Kernel Grouped Convolution

 ## Grouped Convolution Backward Data
@@ -46,3 +46,56 @@ Table of supported cases by instance factory with fused elementwise operation:

 * **Bilinear** - 3D, NHWGC, bf16/fp16/fp32
 * **Scale** - 3D, NHWGC, bf16/fp16/fp32
+
+---
+
+## Theory
+
+**Grouped convolution backward data** computes the gradient of the input tensor with respect to the loss, given the output gradient and the weights, for each group independently. This is essential for training CNNs and grouped/expert models.
+
+**Mathematical Formulation:**
+For each group $g$:
+$$
+\text{InputGrad}^g = \text{ConvBwdData}(\text{OutputGrad}^g, \text{Weights}^g)
+$$
+
+- Supports 1D, 2D, and 3D grouped convolutions.
+- Utilizes implicit GEMM for efficient computation.
+- Supports fused elementwise operations (e.g., bilinear, scale).
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/10_grouped_convnd_bwd_data
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D grouped convolution backward data)
+./grouped_conv2d_bwd_data
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/10_grouped_convnd_bwd_data/
+├── grouped_conv1d_bwd_data.cpp         # 1D grouped convolution backward data
+├── grouped_conv2d_bwd_data.cpp         # 2D grouped convolution backward data
+├── grouped_conv3d_bwd_data.cpp         # 3D grouped convolution backward data
+├── CMakeLists.txt                      # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures grouped convolution parameters, launches the backward data kernel, and verifies the result.
+- **Grouped convolution backward kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution backward data for different dimensions and data types.
+
+This client example provides a comprehensive demonstration of grouped convolution backward data for efficient CNN and vision transformer training.
--- a/client_example/11_grouped_conv_bwd_weight/README.md
+++ b/client_example/11_grouped_conv_bwd_weight/README.md
@@ -1,4 +1,4 @@
-[Back to supported operations](../../../include/ck/README.md)
+[Back to supported operations](../../include/ck/README.md)
 # Composable Kernel Grouped Convolution

 ## Grouped Convolution Backward Weight
@@ -60,3 +60,63 @@ Table of supported cases by instance factory with fused elementwise operation:

 * **Bilinear** - 3D, NHWGC, bf16(fp32 for weight)/fp16/fp32
 * **Scale** - 3D, NHWGC, bf16(fp32 for weight)/fp16/fp32
+
+---
+
+## Theory
+
+**Grouped convolution backward weight** computes the gradient of the weights with respect to the loss, given the input and output gradients, for each group independently. This is essential for training CNNs and grouped/expert models.
+
+**Mathematical Formulation:**
+For each group $g$:
+$$
+\text{WeightGrad}^g = \text{ConvBwdWeight}(\text{Input}^g, \text{OutputGrad}^g)
+$$
+
+- Supports 1D, 2D, and 3D grouped convolutions.
+- Utilizes implicit GEMM for efficient computation.
+- Supports fused elementwise operations (e.g., bilinear, scale).
+- Uses splitK for large GEMM K dimensions.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/11_grouped_conv_bwd_weight
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D grouped convolution backward weight, FP16)
+./grouped_conv2d_bwd_weight_fp16
+
+# Example run (3D grouped convolution backward weight, FP32)
+./grouped_conv3d_bwd_weight_fp32
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/11_grouped_conv_bwd_weight/
+├── grouped_conv1d_bwd_weight_fp16.cpp         # 1D grouped convolution backward weight (FP16)
+├── grouped_conv2d_bwd_weight_fp16.cpp         # 2D grouped convolution backward weight (FP16)
+├── grouped_conv3d_bwd_weight_fp16.cpp         # 3D grouped convolution backward weight (FP16)
+├── grouped_conv3d_bwd_weight_fp32.cpp         # 3D grouped convolution backward weight (FP32)
+├── grouped_conv3d_bwd_weight_fp16_comp_bf8_fp8.cpp # 3D grouped convolution backward weight (FP16, BF8/FP8 mixed)
+├── common.hpp                                 # Common utilities for grouped convolution
+├── CMakeLists.txt                             # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures grouped convolution parameters, launches the backward weight kernel, and verifies the result.
+- **Grouped convolution backward weight kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution backward weight for different dimensions and data types.
+
+This client example provides a comprehensive demonstration of grouped convolution backward weight for efficient CNN and vision transformer training.
--- a/client_example/12_elementwise_normalization/README.md
+++ b/client_example/12_elementwise_normalization/README.md
@@ -0,0 +1,69 @@
+# Client Example: Elementwise Layer Normalization
+
+## Theory
+
+This client example demonstrates **elementwise layer normalization** for 2D tensors. Layer normalization is used in transformers and other neural networks to normalize activations across the feature dimension, improving training stability. Elementwise normalization fuses normalization with other elementwise operations for efficiency.
+
+**Mathematical Formulation:**
+Given input $X$:
+- Mean: $\mu = \frac{1}{N} \sum_{i=1}^N X_i$
+- Variance: $\sigma^2 = \frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2$
+- Normalized: $\hat{X}_i = \frac{X_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$
+- Output: $Y_i = \gamma \hat{X}_i + \beta$
+
+$\gamma$, $\beta$ are learnable scale and shift parameters.
+
+**Algorithmic Background:**
+- Computes mean and variance per row (sample).
+- Applies normalization and affine transformation.
+- Can be fused with other elementwise operations for efficiency.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/12_elementwise_normalization
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./elementwise_layernorm2d
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/12_elementwise_normalization/
+├── elementwise_layernorm2d.cpp         # Main client example: elementwise layernorm for 2D tensors
+├── CMakeLists.txt                      # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `elementwise_layernorm2d.cpp`):  
+  Sets up input tensors, configures normalization parameters, launches the normalization kernel, and verifies the result.
+- **Elementwise normalization kernel invocation**:  
+  Uses the Composable Kernel device API to launch layer normalization, optionally fused with other elementwise ops.
+
+---
+
+## Additional Details
+
+- Supports fusion with other elementwise operations for efficiency.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [05_layernorm](../05_layernorm/README.md): Layer normalization client API
+- [27_layernorm2d_fwd](../../example/27_layernorm2d_fwd/README.md): Layer normalization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/13_batchnorm/README.md
+++ b/client_example/13_batchnorm/README.md
@@ -0,0 +1,76 @@
+# Client Example: Batch Normalization (Forward, Backward, Inference)
+
+## Theory
+
+This client example demonstrates **batch normalization** in forward, backward, and inference modes for NHWC tensors. Batch normalization is used in deep neural networks to normalize activations across the batch and spatial dimensions, improving training stability and convergence.
+
+**Mathematical Formulation:**
+Given input $X[N, H, W, C]$:
+- Mean: $\mu_c = \frac{1}{NHW} \sum_{n,h,w} X_{n,h,w,c}$
+- Variance: $\sigma^2_c = \frac{1}{NHW} \sum_{n,h,w} (X_{n,h,w,c} - \mu_c)^2$
+- Normalized: $\hat{X}_{n,h,w,c} = \frac{X_{n,h,w,c} - \mu_c}{\sqrt{\sigma^2_c + \epsilon}}$
+- Output: $Y_{n,h,w,c} = \gamma_c \hat{X}_{n,h,w,c} + \beta_c$
+
+$\gamma_c$, $\beta_c$ are learnable scale and shift parameters per channel.
+
+**Algorithmic Background:**
+- Forward pass computes mean, variance, normalization, and affine transformation.
+- Backward pass computes gradients with respect to input, gamma, and beta.
+- Inference uses running mean and variance for normalization.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/13_batchnorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (forward)
+./batchnorm_fwd_nhwc
+
+# Example run (backward)
+./batchnorm_bwd_nhwc
+
+# Example run (inference)
+./batchnorm_infer_nhwc
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/13_batchnorm/
+├── batchnorm_fwd_nhwc.cpp         # Batchnorm forward (NHWC)
+├── batchnorm_bwd_nhwc.cpp         # Batchnorm backward (NHWC)
+├── batchnorm_infer_nhwc.cpp       # Batchnorm inference (NHWC)
+├── CMakeLists.txt                 # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures batchnorm parameters, launches the forward, backward, or inference kernel, and verifies the result.
+- **BatchNorm kernel invocation**:  
+  Uses the Composable Kernel device API to launch batch normalization for different modes.
+
+---
+
+## Additional Details
+
+- Supports NHWC layout for image and vision models.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [34_batchnorm](../../example/34_batchnorm/README.md): Batch normalization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/14_instance_id/README.md
+++ b/client_example/14_instance_id/README.md
@@ -0,0 +1,63 @@
+# Client Example: BatchNorm with Instance ID Selection
+
+## Theory
+
+This client example demonstrates **batch normalization** using explicit instance ID selection. In Composable Kernel, "instance ID" refers to a specific kernel configuration (tile sizes, vectorization, etc.) chosen for a given workload. This allows users to benchmark or select the best-performing kernel for their data shape.
+
+**Mathematical Formulation:**
+See [BatchNorm Theory](../13_batchnorm/README.md) for the mathematical details of batch normalization.
+
+**Algorithmic Background:**
+- The example shows how to enumerate and select a specific kernel instance by its ID.
+- Useful for performance tuning, benchmarking, and debugging.
+- BatchNorm is performed in NHWC layout.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/14_instance_id
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (selects a specific kernel instance)
+./batchnorm_fwd_instance_id
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/14_instance_id/
+├── batchnorm_fwd_instance_id.cpp         # Batchnorm forward with instance ID selection
+├── CMakeLists.txt                        # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `batchnorm_fwd_instance_id.cpp`):  
+  Sets up input tensors, enumerates available kernel instances, selects an instance by ID, launches the batchnorm kernel, and verifies the result.
+- **Instance selection**:  
+  Demonstrates how to use the Composable Kernel API to list and select kernel configurations.
+
+---
+
+## Additional Details
+
+- Useful for kernel benchmarking and performance tuning.
+- Example parameters and instance ID can be adjusted in the source.
+
+---
+
+## Related Examples
+
+- [13_batchnorm](../13_batchnorm/README.md): Batch normalization client API
+- [34_batchnorm](../../example/34_batchnorm/README.md): Batch normalization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/15_convnd_bwd_data/README.md
+++ b/client_example/15_convnd_bwd_data/README.md
@@ -0,0 +1,73 @@
+# Client Example: N-Dimensional Convolution Backward Data
+
+## Theory
+
+This client example demonstrates **N-dimensional convolution backward data** for 3D inputs, supporting multiple data types (FP16, FP32). The backward data operation computes the gradient of the input tensor with respect to the loss, given the output gradient and the weights. This is essential for training CNNs and 3D vision models.
+
+**Mathematical Formulation:**
+For input $X$, weights $W$, and output gradient $dY$:
+$$
+dX = \text{ConvBwdData}(dY, W)
+$$
+
+- Supports 3D convolution (ND can be extended).
+- Utilizes implicit GEMM for efficient computation.
+
+**Algorithmic Background:**
+- The backward data operation is implemented as a convolution with transformed coordinates.
+- Used in training pipelines for 3D CNNs, medical imaging, and volumetric data.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/15_convnd_bwd_data
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (3D backward data, FP16)
+./conv3d_bwd_data_fp16
+
+# Example run (3D backward data, FP32)
+./conv3d_bwd_data_fp32
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/15_convnd_bwd_data/
+├── conv3d_bwd_data_fp16.cpp         # 3D convolution backward data (FP16)
+├── conv3d_bwd_data_fp32.cpp         # 3D convolution backward data (FP32)
+├── common.hpp                       # Common utilities for convolution
+├── CMakeLists.txt                   # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures convolution parameters, launches the backward data kernel, and verifies the result.
+- **Backward data kernel invocation**:  
+  Uses the Composable Kernel device API to launch convolution backward data for different data types.
+
+---
+
+## Additional Details
+
+- Supports FP16 and FP32 for 3D convolution.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [10_grouped_convnd_bwd_data](../10_grouped_convnd_bwd_data/README.md): Grouped convolution backward data
+- [17_convnd_bwd_data](../../example/17_convnd_bwd_data/README.md): Convolution backward data in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/16_convnd_fwd/README.md
+++ b/client_example/16_convnd_fwd/README.md
@@ -0,0 +1,85 @@
+# Client Example: N-Dimensional Convolution Forward
+
+## Theory
+
+This client example demonstrates **N-dimensional convolution forward** for 3D inputs, supporting multiple data types (FP16, FP32, FP8 composite). Convolution is a fundamental operation in deep learning, especially in convolutional neural networks (CNNs) for images, audio, and volumetric data.
+
+**Mathematical Formulation:**
+Given input $X$, weights $W$:
+$$
+Y = \text{Conv}(X, W)
+$$
+
+- Supports 3D convolution (ND can be extended).
+- Utilizes implicit GEMM for efficient computation.
+
+**Algorithmic Background:**
+- The forward convolution operation is implemented as a convolution with transformed coordinates.
+- Used in inference and training pipelines for 3D CNNs, medical imaging, and volumetric data.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/16_convnd_fwd
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (3D forward, FP16)
+./conv3d_fwd_fp16
+
+# Example run (3D forward, FP32)
+./conv3d_fwd_fp32
+
+# Example run (3D forward, FP16 compute with FP8)
+./conv3d_fwd_fp16_comp_fp8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/16_convnd_fwd/
+├── conv3d_fwd_fp16.cpp         # 3D convolution forward (FP16)
+├── conv3d_fwd_fp32.cpp         # 3D convolution forward (FP32)
+├── conv3d_fwd_fp16_comp_fp8.cpp # 3D convolution forward (FP16 compute, FP8)
+├── common.hpp                  # Common utilities for convolution
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures convolution parameters, launches the forward kernel, and verifies the result.
+- **Forward convolution kernel invocation**:  
+  Uses the Composable Kernel device API to launch convolution forward for different data types.
+
+---
+
+## Additional Details
+
+- Supports FP16, FP32, and FP8 composite for 3D convolution.
+- Parameters can be adjusted in the source files for different workloads. The following parameters are configurable:
+  - `NumDimSpatial`: Number of spatial dimensions (default: 3 for 3D convolution)
+  - `G`: Number of groups (default: 1)
+  - `N`: Batch size (default: 64)
+  - `K`: Number of output channels (default: 128)
+  - `C`: Number of input channels (default: 64)
+  - `Z`, `Y`, `X`: Filter/kernel dimensions (default: 3x3x3)
+  - `Di`, `Hi`, `Wi`: Input dimensions - depth, height, width (default: 28x28x3)
+  - `Do`, `Ho`, `Wo`: Output dimensions - depth, height, width (default: 28x28x3)
+
+---
+
+## Related Examples
+
+- [09_convnd_fwd](../../example/09_convnd_fwd/README.md): N-dimensional convolution in the main example directory
+- [30_grouped_conv_fwd_multiple_d](../../example/30_grouped_conv_fwd_multiple_d/README.md): Grouped convolution forward with multiple D
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/17_grouped_gemm_fastgelu/README.md
+++ b/client_example/17_grouped_gemm_fastgelu/README.md
@@ -0,0 +1,71 @@
+# Client Example: Grouped GEMM with FastGELU Activation
+
+## Theory
+
+This client example demonstrates **grouped GEMM fused with FastGELU activation**. Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, and FastGELU is a fast approximation of the GELU activation used in transformers and MLPs.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$:
+- GEMM: $Y_g = A_g \times B_g$
+- FastGELU: $E_g = \text{FastGELU}(Y_g)$
+
+FastGELU is defined as:
+$$
+\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)
+$$
+where $\sigma$ is the sigmoid function.
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- The kernel launches a grid covering all groups, with each block assigned to a group.
+- FastGELU is applied in the epilogue for each group.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/17_grouped_gemm_fastgelu
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./grouped_gemm_fastgelu
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/17_grouped_gemm_fastgelu/
+├── grouped_gemm_fastgelu.cpp         # Main client example: grouped GEMM + FastGELU
+├── CMakeLists.txt                    # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `grouped_gemm_fastgelu.cpp`):  
+  Sets up input matrices for each group, configures GEMM and epilogue parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM with FastGELU activation.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+- [04_gemm_add_add_fastgelu](../../example/04_gemm_add_add_fastgelu/README.md): GEMM with FastGELU fusion
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/18_groupnorm/README.md
+++ b/client_example/18_groupnorm/README.md
@@ -0,0 +1,80 @@
+# Client Example: Group Normalization (Forward and Backward)
+
+## Theory
+
+This client example demonstrates **group normalization** in both forward and backward modes, including fusion with Swish activation. Group normalization normalizes activations across groups of channels, improving training stability for small batch sizes or non-i.i.d. data.
+
+**Mathematical Formulation:**
+Given input $X[N, C, ...]$ divided into $G$ groups:
+- For each group $g$:
+  - Mean: $\mu_g = \frac{1}{|g|} \sum_{i \in g} X_i$
+  - Variance: $\sigma^2_g = \frac{1}{|g|} \sum_{i \in g} (X_i - \mu_g)^2$
+  - Normalized: $\hat{X}_i = \frac{X_i - \mu_g}{\sqrt{\sigma^2_g + \epsilon}}$
+  - Output: $Y_i = \gamma \hat{X}_i + \beta$
+
+$\gamma$, $\beta$ are learnable scale and shift parameters.
+
+- Swish activation: $\text{Swish}(x) = x \cdot \sigma(x)$, where $\sigma$ is the sigmoid function.
+
+**Algorithmic Background:**
+- Forward pass computes mean, variance, normalization, and affine transformation per group.
+- Backward pass computes gradients with respect to input, gamma, and beta.
+- Swish activation can be fused with normalization for efficiency.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/18_groupnorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (forward with Swish)
+./groupnorm_swish_fwd
+
+# Example run (backward, data)
+./groupnorm_bwd_data
+
+# Example run (backward, gamma/beta)
+./groupnorm_bwd_gamma_beta
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/18_groupnorm/
+├── groupnorm_swish_fwd.cpp         # Groupnorm forward with Swish activation
+├── groupnorm_bwd_data.cpp          # Groupnorm backward (data)
+├── groupnorm_bwd_gamma_beta.cpp    # Groupnorm backward (gamma/beta)
+├── CMakeLists.txt                  # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures groupnorm parameters, launches the forward or backward kernel, and verifies the result.
+- **GroupNorm kernel invocation**:  
+  Uses the Composable Kernel device API to launch group normalization for different modes.
+
+---
+
+## Additional Details
+
+- Supports fusion with Swish activation for efficiency.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [42_groupnorm_fwd](../../example/42_groupnorm_fwd/README.md): Group normalization in the main example directory
+- [54_groupnorm_bwd](../../example/54_groupnorm_bwd/README.md): Group normalization backward in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/19_pool/README.md
+++ b/client_example/19_pool/README.md
@@ -0,0 +1,80 @@
+# Client Example: Pooling Operations (2D Max, 3D Avg)
+
+## Theory
+
+This client example demonstrates **pooling operations** for 2D max pooling and 3D average pooling, including both forward and backward passes. Pooling is used in convolutional neural networks (CNNs) for spatial downsampling, translation invariance, and reducing computation.
+
+**Mathematical Formulation:**
+- **Max Pooling (2D):** $Y_{n,c,h,w} = \max_{i,j} X_{n,c,h \cdot s_H + i, w \cdot s_W + j}$
+- **Average Pooling (3D):** $Y_{n,c,d,h,w} = \frac{1}{k_D k_H k_W} \sum_{i,j,k} X_{n,c,d \cdot s_D + i, h \cdot s_H + j, w \cdot s_W + k}$
+
+Where $s_H, s_W, s_D$ are strides, $k_H, k_W, k_D$ are kernel sizes.
+
+**Algorithmic Background:**
+- Forward pass computes the pooled output.
+- Backward pass computes the gradient with respect to the input.
+- Handles padding and boundary conditions.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/19_pool
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D max pool forward)
+./max_pool2d_fwd
+
+# Example run (2D max pool backward)
+./max_pool2d_bwd
+
+# Example run (3D avg pool forward)
+./avg_pool3d_fwd
+
+# Example run (3D avg pool backward)
+./avg_pool3d_bwd
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/19_pool/
+├── max_pool2d_fwd.cpp         # 2D max pooling forward
+├── max_pool2d_bwd.cpp         # 2D max pooling backward
+├── avg_pool3d_fwd.cpp         # 3D average pooling forward
+├── avg_pool3d_bwd.cpp         # 3D average pooling backward
+├── CMakeLists.txt             # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures pooling parameters, launches the forward or backward kernel, and verifies the result.
+- **Pooling kernel invocation**:  
+  Uses the Composable Kernel device API to launch pooling operations for different modes.
+
+---
+
+## Additional Details
+
+- Supports both max and average pooling, forward and backward.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [13_pool2d_fwd](../../example/13_pool2d_fwd/README.md): 2D pooling in the main example directory
+- [48_pool3d_fwd](../../example/48_pool3d_fwd/README.md): 3D pooling in the main example directory
+- [49_maxpool2d_bwd](../../example/49_maxpool2d_bwd/README.md): 2D max pool backward in the main example directory
+- [51_avgpool3d_bwd](../../example/51_avgpool3d_bwd/README.md): 3D avg pool backward in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/20_splitk_gemm/README.md
+++ b/client_example/20_splitk_gemm/README.md
@@ -0,0 +1,66 @@
+# Client Example: Split-K GEMM
+
+## Theory
+
+This client example demonstrates **Split-K GEMM**, a technique for parallelizing matrix multiplication along the K dimension. Split-K is used to improve parallelism and memory bandwidth utilization for large GEMM operations, especially when K is large.
+
+**Mathematical Formulation:**
+- Standard GEMM: $C = A \times B$
+- Split-K: Partition the K dimension into $K_s$ splits, compute partial results, then reduce:
+  $$
+  C = \sum_{s=1}^{K_s} (A_{[:, K_s]} \times B_{[K_s, :]})
+  $$
+
+**Algorithmic Background:**
+- Each split computes a partial GEMM over a chunk of K.
+- Partial results are reduced (summed) to produce the final output.
+- Useful for large K, limited workspace, or maximizing GPU occupancy.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/20_splitk_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (FP16 compute, FP8 output)
+./splitK_gemm_fp16_f8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/20_splitk_gemm/
+├── splitK_gemm_fp16_f8.cpp         # Main client example: Split-K GEMM (FP16 compute, FP8 output)
+├── CMakeLists.txt                  # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `splitK_gemm_fp16_f8.cpp`):  
+  Sets up input matrices, configures Split-K parameters, launches the Split-K GEMM kernel, and verifies the result.
+- **Split-K kernel invocation**:  
+  Uses the Composable Kernel device API to launch the Split-K GEMM operation.
+
+---
+
+## Additional Details
+
+- Supports FP16 compute with FP8 output for memory efficiency.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [35_splitK_gemm](../../example/35_splitK_gemm/README.md): Split-K GEMM in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/21_grouped_gemm_bias/README.md
+++ b/client_example/21_grouped_gemm_bias/README.md
@@ -0,0 +1,65 @@
+# Client Example: Grouped GEMM with Bias
+
+## Theory
+
+This client example demonstrates **grouped GEMM fused with bias addition**. Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, and bias addition is a standard pattern in neural network layers.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$, $b_g$:
+- GEMM: $Y_g = A_g \times B_g$
+- Bias: $E_g = Y_g + b_g$
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- The kernel launches a grid covering all groups, with each block assigned to a group.
+- Bias is added in the epilogue for each group.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/21_grouped_gemm_bias
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (grouped GEMM with bias, FP16)
+./grouped_gemm_fixed_nk_bias_fp16
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/21_grouped_gemm_bias/
+├── grouped_gemm_fixed_nk_bias_fp16.cpp         # Main client example: grouped GEMM + bias (FP16)
+├── CMakeLists.txt                              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `grouped_gemm_fixed_nk_bias_fp16.cpp`):  
+  Sets up input matrices for each group, configures GEMM and bias parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM with bias addition.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+- [11_convnd_fwd_bias](../../example/11_convnd_fwd_bias/README.md): Convolution with bias fusion
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/22_grouped_gemm/README.md
+++ b/client_example/22_grouped_gemm/README.md
@@ -0,0 +1,76 @@
+# Client Example: Grouped GEMM (Multiple Data Types)
+
+## Theory
+
+This client example demonstrates **grouped GEMM** for multiple data types (FP16, BF16, FP8, INT8). Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, which is useful for transformer models, mixture-of-experts, and variable-length sequence processing.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$:
+- GEMM: $Y_g = A_g \times B_g$
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- The kernel launches a grid covering all groups, with each block assigned to a group.
+- Supports multiple data types for flexibility and performance tuning.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/22_grouped_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (FP16)
+./grouped_gemm_fixed_nk_fp16
+
+# Example run (BF16)
+./grouped_gemm_fixed_nk_bf16
+
+# Example run (FP8)
+./grouped_gemm_fixed_nk_fp8
+
+# Example run (INT8)
+./grouped_gemm_fixed_nk_i8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/22_grouped_gemm/
+├── grouped_gemm_fixed_nk_fp16.cpp         # Grouped GEMM (FP16)
+├── grouped_gemm_fixed_nk_bf16.cpp         # Grouped GEMM (BF16)
+├── grouped_gemm_fixed_nk_fp8.cpp          # Grouped GEMM (FP8)
+├── grouped_gemm_fixed_nk_i8.cpp           # Grouped GEMM (INT8)
+├── CMakeLists.txt                         # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices for each group, configures GEMM parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM for different data types.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes and data types.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+- [17_grouped_gemm_fastgelu](../17_grouped_gemm_fastgelu/README.md): Grouped GEMM with FastGELU activation
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/23_elementwise_transpose/README.md
+++ b/client_example/23_elementwise_transpose/README.md
@@ -0,0 +1,64 @@
+# Client Example: Elementwise Operation with 3D Transpose
+
+## Theory
+
+This client example demonstrates **elementwise operations fused with 3D tensor transpose**. This pattern is used in deep learning for applying activation functions or scaling while simultaneously reordering tensor dimensions (e.g., for layout conversion or attention head reshaping).
+
+**Mathematical Formulation:**
+- Elementwise: $Z = f(X)$ or $Z = f(X, Y)$
+- Transpose: $Y_{i_0, i_1, i_2} = Z_{i_{\pi(0)}, i_{\pi(1)}, i_{\pi(2)}}$
+  - $\pi$ is a permutation of the axes.
+
+**Algorithmic Background:**
+- The elementwise operation and transpose are fused in a single kernel.
+- Intermediate results are kept in registers, not written to global memory.
+- Used for layout conversion with activation, attention head reshaping, and more.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/23_elementwise_transpose
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (elementwise + 3D transpose)
+./elementwise_transpose_3d
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/23_elementwise_transpose/
+├── elementwise_transpose_3d.cpp         # Main client example: elementwise + 3D transpose
+├── CMakeLists.txt                       # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `elementwise_transpose_3d.cpp`):  
+  Sets up input tensors, configures elementwise and transpose parameters, launches the fused kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch the elementwise+transpose operation.
+
+---
+
+## Additional Details
+
+- Supports fusion of elementwise operations with 3D transpose.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [44_elementwise_permute](../../example/44_elementwise_permute/README.md): Elementwise operation with permutation in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/24_grouped_conv_activation/README.md
+++ b/client_example/24_grouped_conv_activation/README.md
@@ -0,0 +1,88 @@
+# Client Example: Grouped Convolution with Activation and Fusion
+
+## Theory
+
+This client example demonstrates **grouped convolution fused with various activation and elementwise operations**. Grouped convolution splits the input and weights into groups and applies convolution independently to each group, while fusion with activation and scaling improves efficiency.
+
+**Mathematical Formulation:**
+For each group $g$:
+- Convolution: $Y^g = \text{Conv}(X^g, W^g)$
+- Fused operations: $E^g = f(Y^g, D_0^g, D_1^g, ...)$
+  - $f$ can be bilinear, scale, add, relu, etc.
+
+**Algorithmic Background:**
+- Grouped convolution is used in efficient CNNs, depthwise separable convolutions, and expert models.
+- Fused epilogue operations (scale, add, relu, reduce) are performed in registers before writing to memory.
+- Supports 1D, 2D, and 3D grouped convolutions and a variety of fusion patterns.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/24_grouped_conv_activation
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (grouped conv + scale)
+./grouped_convnd_fwd_scale/grouped_convnd_fwd_scale
+
+# Example run (grouped conv + bilinear)
+./grouped_convnd_fwd_bilinear/grouped_convnd_fwd_bilinear
+
+# Example run (grouped conv + scale + relu)
+./grouped_convnd_fwd_convscale_relu/grouped_convnd_fwd_convscale_relu
+
+# Example run (grouped conv + scale + add + relu)
+./grouped_convnd_fwd_scaleadd_scaleadd_relu/grouped_convnd_fwd_scaleadd_scaleadd_relu
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/24_grouped_conv_activation/
+├── grouped_convnd_fwd_scale/                  # Grouped conv + scale
+├── grouped_convnd_fwd_bilinear/               # Grouped conv + bilinear
+├── grouped_convnd_fwd_convscale/              # Grouped conv + scale (convscale)
+├── grouped_convnd_fwd_convscale_add/          # Grouped conv + scale + add
+├── grouped_convnd_fwd_convscale_reduce/       # Grouped conv + scale + reduce
+├── grouped_convnd_fwd_convscale_relu/         # Grouped conv + scale + relu
+├── grouped_convnd_fwd_convinvscale/           # Grouped conv + inverse scale
+├── grouped_convnd_fwd_scaleadd_ab/            # Grouped conv + scale + add (A/B)
+├── grouped_convnd_fwd_scaleadd_scaleadd_relu/ # Grouped conv + scale + add + relu
+├── grouped_convnd_bwd_data_bilinear/          # Grouped conv bwd data + bilinear
+├── grouped_convnd_bwd_data_scale/             # Grouped conv bwd data + scale
+├── grouped_convnd_bwd_weight_bilinear/        # Grouped conv bwd weight + bilinear
+├── grouped_convnd_bwd_weight_scale/           # Grouped conv bwd weight + scale
+├── CMakeLists.txt                             # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each subdirectory's `.cpp`):  
+  Sets up input tensors, configures grouped convolution and fusion parameters, launches the kernel, and verifies the result.
+- **Grouped convolution kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution with various fused epilogue operations.
+
+---
+
+## Additional Details
+
+- Supports a wide range of fusion patterns (bilinear, scale, add, relu, reduce, etc.).
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [10_grouped_convnd_bwd_data](../10_grouped_convnd_bwd_data/README.md): Grouped convolution backward data
+- [11_grouped_conv_bwd_weight](../11_grouped_conv_bwd_weight/README.md): Grouped convolution backward weight
+- [30_grouped_conv_fwd_multiple_d](../../example/30_grouped_conv_fwd_multiple_d/README.md): Grouped convolution forward with multiple D
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/25_wrapper/README.md
+++ b/client_example/25_wrapper/README.md
@@ -1,13 +1,70 @@
 [Back to the main page](../../README.md)
-# Composable Kernel wrapper GEMM tutorial

-This tutorial demonstrates how to implement matrix multiplication using Composable Kernel (CK) wrapper. We present the base version of GEMM without most of the available optimizations; however, it's worth noting that CK has kernels with different optimizations.
+# Composable Kernel Wrapper GEMM Tutorial

-To implement these optimizations, you can use the CK wrapper or directly use available instances in CK. You can also refer to the [optimized GEMM example](https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_optimized_gemm.cpp), that uses CK wrapper based on the [`gridwise_gemm_xdlops_v2r3`](https://github.com/ROCm/composable_kernel/blob/develop/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdlops_v2r3.hpp) implementation.
+This tutorial demonstrates how to implement matrix multiplication (GEMM) using the Composable Kernel wrapper. The three examples show both basic and optimized GEMM implementations, as well as how to use the wrapper for tensor transformations such as im2col.

-The kernel definition should look similar to:
+---
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/25_wrapper
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic GEMM)
+./wrapper_basic_gemm
+
+# Example run (optimized GEMM)
+./wrapper_optimized_gemm
+
+# Example run (im2col transformation)
+./wrapper_img2col
+
+# Example run (tensor transform using wrapper)
+./tensor_transform_using_wrapper
+```
+
+---
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/25_wrapper/
+├── wrapper_basic_gemm.cpp         # Basic GEMM using CK wrapper
+├── wrapper_optimized_gemm.cpp     # Optimized GEMM using CK wrapper
+├── wrapper_img2col.cpp            # im2col transformation using CK wrapper
+├── tensor_transform_using_wrapper.cpp # General tensor transform example
+├── CMakeLists.txt                 # Build configuration for the example
+├── README.md                      # This tutorial and reference
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures wrapper parameters, launches the kernel, and verifies the result.
+- **CK wrapper API usage**:  
+  Demonstrates how to create layouts, tensors, and launch GEMM or tensor transforms using the wrapper.
+
+---
+
+## Additional Details
+
+## Overview
+
+The CK wrapper provides a flexible interface for launching GEMM kernels and tensor operations. This tutorial presents:
+- A base GEMM implementation (minimal optimizations)
+- An optimized GEMM using `gridwise_gemm_xdlops_v2r3`
+- Examples of tensor transformations (e.g., im2col)

-```cpp
 template <typename DataType,
          typename GemmTraits,
          ck::index_t scalar_per_vector,
@@ -168,5 +225,13 @@ The end result from `c_vgpr_reg` is stored in the `C` local partition (tensor pe
    ck::wrapper::copy(c_vgpr_reg, c_global_local_partition);
 ```

-If you want to dive deep into the details, you can find the entire example
-[here](https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_basic_gemm.cpp).
+---
+
+## Related Examples
+
+- [01_gemm](../01_gemm/README.md): Basic GEMM client example
+- [27_im2col_col2im](../27_im2col_col2im/README.md): im2col/col2im transformations
+- [25_gemm_bias_e_permute](../../example/25_gemm_bias_e_permute/README.md): GEMM with bias and permutation in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/26_reduce/CMakeLists.txt
+++ b/client_example/26_reduce/CMakeLists.txt
--- a/client_example/26_reduce/README.md
+++ b/client_example/26_reduce/README.md
@@ -0,0 +1,64 @@
+# Client Example: Parallel Reduction (NHWC)
+
+## Theory
+
+This client example demonstrates **parallel reduction operations** over NHWC tensors. Reduction is a fundamental operation in deep learning for computing statistics (such as batch mean/variance), loss aggregation, and normalization.
+
+**Mathematical Formulation:**
+Given a tensor $X[N, H, W, C]$ and a reduction axis (e.g., channel $C$):
+- **Sum**: $Y_{n,h,w} = \sum_c X_{n,h,w,c}$
+- **Max**: $Y_{n,h,w} = \max_c X_{n,h,w,c}$
+- **Mean**: $Y_{n,h,w} = \frac{1}{C} \sum_c X_{n,h,w,c}$
+
+**Algorithmic Background:**
+- Reductions are implemented using parallel tree or segmented reduction algorithms.
+- Efficient reductions require careful memory access, synchronization, and sometimes numerically stable algorithms.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/26_reduce
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (reduce over channel dimension)
+./reduce_nhwc_c
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/26_reduce/
+├── reduce_nhwc_c.cpp         # Main client example: reduction over NHWC tensors (channel axis)
+├── CMakeLists.txt            # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `reduce_nhwc_c.cpp`):  
+  Sets up input tensors, configures reduction parameters, launches the reduction kernel, and verifies the result.
+- **Reduction kernel invocation**:  
+  Uses the Composable Kernel device API to launch the reduction operation.
+
+---
+
+## Additional Details
+
+- Supports sum, max, mean, and other reductions over NHWC tensors.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [12_reduce](../../example/12_reduce/README.md): Parallel reduction in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/26_reduce/reduce_nhwc_c.cpp
+++ b/client_example/26_reduce/reduce_nhwc_c.cpp
--- a/client_example/27_im2col_col2im/CMakeLists.txt
+++ b/client_example/27_im2col_col2im/CMakeLists.txt
--- a/client_example/27_im2col_col2im/README.md
+++ b/client_example/27_im2col_col2im/README.md
@@ -0,0 +1,68 @@
+# Client Example: im2col and col2im Transformations
+
+## Theory
+
+This client example demonstrates **im2col (image-to-column) and col2im (column-to-image) transformations**. These operations are used to convert image data into a matrix form suitable for GEMM-based convolution and reconstruct images from column representations.
+
+**Mathematical Formulation:**
+- **im2col**: Rearranges image blocks into columns, mapping a 3D/4D tensor to a 2D matrix.
+- **col2im**: Reverses the process, mapping a 2D matrix back to an image tensor.
+
+**Algorithmic Background:**
+- im2col is used to lower convolution to matrix multiplication (GEMM).
+- col2im is used to reconstruct the original image or feature map from the column representation.
+- These transformations are essential for efficient convolution implementations on GPUs.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/27_im2col_col2im
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (image to column)
+./image_to_column
+
+# Example run (column to image)
+./column_to_image
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/27_im2col_col2im/
+├── image_to_column.cpp         # im2col: image to column transformation
+├── column_to_image.cpp         # col2im: column to image transformation
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures transformation parameters, launches the im2col or col2im kernel, and verifies the result.
+- **im2col/col2im kernel invocation**:  
+  Uses the Composable Kernel device API to launch the transformation.
+
+---
+
+## Additional Details
+
+- Supports various image and patch sizes.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [52_im2col_col2im](../../example/52_im2col_col2im/README.md): im2col/col2im in the main example directory
+- [09_convnd_fwd](../../example/09_convnd_fwd/README.md): N-dimensional convolution using im2col
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/27_im2col_col2im/column_to_image.cpp
+++ b/client_example/27_im2col_col2im/column_to_image.cpp
--- a/client_example/27_im2col_col2im/image_to_column.cpp
+++ b/client_example/27_im2col_col2im/image_to_column.cpp
--- a/client_example/28_gemm_mx/CMakeLists.txt
+++ b/client_example/28_gemm_mx/CMakeLists.txt
--- a/client_example/28_gemm_mx/README.md
+++ b/client_example/28_gemm_mx/README.md
@@ -0,0 +1,34 @@
+# Client Example: GEMM pipeline for microscaling (MX)
+
+## How to Run
+
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+```bash
+cd composable_kernel/build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="fp8" ..
+make -j
+make install
+```
+
+### Build and run
+```bash
+/opt/rocm/bin/hipcc gemm_mx_fp8.cpp -o gemm_mx_fp8
+
+# Example run
+./gemm_mx_fp8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/28_gemm_mx/
+├── gemm_mx_fp8.cpp       # GEMM MX (fp8)
+├── CMakeLists.txt        # Build configuration for the example
+```
+---
+[Back to Client Examples](../README.md)
--- a/client_example/28_gemm_mx/gemm_mx_fp8.cpp
+++ b/client_example/28_gemm_mx/gemm_mx_fp8.cpp
--- a/client_example/29_gemm_add_multiply/CMakeLists.txt
+++ b/client_example/29_gemm_add_multiply/CMakeLists.txt
--- a/client_example/29_gemm_add_multiply/README.md
+++ b/client_example/29_gemm_add_multiply/README.md
@@ -0,0 +1,66 @@
+# Client Example: GEMM with Add and Multiply Fusion
+
+## Theory
+
+This client example demonstrates **GEMM fused with addition and multiplication operations**. This pattern is used in neural networks for bias addition, scaling, gating, and other elementwise transformations after a linear layer.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+- Add: $Z = Y + D_0$
+- Multiply: $E = Z \odot D_1$
+  - $D_0$, $D_1$: auxiliary tensors (e.g., bias, scale, gate)
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, addition and multiplication are fused in the epilogue.
+- No intermediate results are written to global memory.
+- Used for bias+scale, gating, and other fused epilogue patterns.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/29_gemm_add_multiply
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_add_multiply
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/29_gemm_add_multiply/
+├── gemm_add_multiply.cpp         # Main client example: GEMM+Add+Multiply
+├── CMakeLists.txt                # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `gemm_add_multiply.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch the GEMM with fused addition and multiplication.
+
+---
+
+## Additional Details
+
+- Supports fusion of multiple elementwise operations with GEMM.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [02_gemm_bilinear](../../example/02_gemm_bilinear/README.md): Multi-tensor bilinear operations
+- [46_gemm_add_multiply](../../example/46_gemm_add_multiply/README.md): GEMM with add and multiply in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/29_gemm_add_multiply/gemm_add_multiply.cpp
+++ b/client_example/29_gemm_add_multiply/gemm_add_multiply.cpp
--- a/client_example/30_gemm_bf16Aint8B/README.md
+++ b/client_example/30_gemm_bf16Aint8B/README.md
@@ -0,0 +1,92 @@
+# Client Example: GEMM with bf16A/int8B and Fused Epilogues
+
+## Theory
+
+This client example demonstrates **GEMM with mixed-precision input types (bf16 for A, int8 for B)** and various fused epilogue operations (bias, GELU, FastGELU, multiply). Mixed-precision GEMM is used for efficient inference and training in deep learning, especially for transformer and MLP layers.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+  - $A$: bf16 (brain floating point)
+  - $B$: int8 (8-bit integer)
+- Fused epilogues:
+  - Bias: $Z = Y + \text{bias}$
+  - GELU: $E = \text{GELU}(Z)$
+  - FastGELU: $E = \text{FastGELU}(Z)$
+  - Multiply: $E = Z \odot D_1$
+
+**Algorithmic Background:**
+- Mixed-precision computation reduces memory and compute requirements.
+- Fused epilogues improve efficiency by combining bias, activation, and scaling in a single kernel.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+```bash
+cd composable_kernel/build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="bf16;int8" ..
+make -j
+make install
+```
+
+### Build and run
+```bash
+cd composable_kernel/client_example/30_gemm_bf16Aint8B
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic GEMM)
+./gemm_xdl_bf16_i8
+
+# Example run (GEMM + bias)
+./gemm_bias_xdl_bf16_i8
+
+# Example run (GEMM + bias + GELU)
+./gemm_xdl_gelu_bf16_i8
+
+# Example run (GEMM + bias + FastGELU)
+./gemm_bias_fastgelu_xdl_bf16_i8
+
+# Example run (GEMM + multiply)
+./gemm_xdl_multiply_bf16_i8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/30_gemm_bf16Aint8B/
+├── gemm_xdl_bf16_i8.cpp                # GEMM (bf16A, int8B)
+├── gemm_bias_xdl_bf16_i8.cpp           # GEMM + bias
+├── gemm_xdl_gelu_bf16_i8.cpp           # GEMM + bias + GELU
+├── gemm_bias_fastgelu_xdl_bf16_i8.cpp  # GEMM + bias + FastGELU
+├── gemm_xdl_multiply_bf16_i8.cpp       # GEMM + multiply
+├── CMakeLists.txt                      # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch GEMM with various fused epilogues.
+
+---
+
+## Additional Details
+
+- Supports bf16 and int8 input types for efficient mixed-precision computation.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [14_gemm_quantization](../../example/14_gemm_quantization/README.md): GEMM quantization in the main example directory
+- [46_gemm_add_multiply](../../example/46_gemm_add_multiply/README.md): GEMM with add and multiply in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/31_grouped_gemm_bf16Aint8B/README.md
+++ b/client_example/31_grouped_gemm_bf16Aint8B/README.md
@@ -0,0 +1,93 @@
+# Client Example: Grouped GEMM with bf16A/int8B and Fused Epilogues
+
+## Theory
+
+This client example demonstrates **grouped GEMM with mixed-precision input types (bf16 for A, int8 for B)** and various fused epilogue operations (bias, FastGELU, multiply). Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, and mixed-precision is used for efficient inference and training.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$:
+- GEMM: $Y_g = A_g \times B_g$
+  - $A_g$: bf16 (brain floating point)
+  - $B_g$: int8 (8-bit integer)
+- Fused epilogues:
+  - Bias: $Z_g = Y_g + \text{bias}_g$
+  - FastGELU: $E_g = \text{FastGELU}(Z_g)$
+  - Multiply: $E_g = Z_g \odot D_{1,g}$
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- Mixed-precision computation reduces memory and compute requirements.
+- Fused epilogues improve efficiency by combining bias, activation, and scaling in a single kernel.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+```bash
+cd composable_kernel/build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="bf16;int8" ..
+make -j
+make install
+```
+
+### Build and run
+```bash
+cd composable_kernel/client_example/31_grouped_gemm_bf16Aint8B
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic grouped GEMM)
+./grouped_gemm_xdl_bf16_i8
+
+# Example run (grouped GEMM + bias + FastGELU)
+./grouped_gemm_bias_fastgelu_xdl_bf16_i8
+
+# Example run (grouped GEMM + FastGELU)
+./grouped_gemm_fastgelu_xdl_bf16_i8
+
+# Example run (grouped GEMM + multiply)
+./grouped_gemm_multiply_xdl_bf16_i8
+
+# Example run (grouped GEMM + multiply + bias + FastGELU)
+./grouped_gemm_multiply_bias_fastgelu_xdl_bf16_i8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/31_grouped_gemm_bf16Aint8B/
+├── grouped_gemm_xdl_bf16_i8.cpp                # Grouped GEMM (bf16A, int8B)
+├── grouped_gemm_bias_fastgelu_xdl_bf16_i8.cpp  # Grouped GEMM + bias + FastGELU
+├── grouped_gemm_fastgelu_xdl_bf16_i8.cpp       # Grouped GEMM + FastGELU
+├── grouped_gemm_multiply_xdl_bf16_i8.cpp       # Grouped GEMM + multiply
+├── grouped_gemm_multiply_bias_fastgelu_xdl_bf16_i8.cpp # Grouped GEMM + multiply + bias + FastGELU
+├── CMakeLists.txt                              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices for each group, configures GEMM and epilogue parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM with various fused epilogues.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes and bf16/int8 input types.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [30_gemm_bf16Aint8B](../30_gemm_bf16Aint8B/README.md): GEMM with bf16A/int8B and fused epilogues
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+
+---
+[Back to Client Examples](../README.md)