[DOCS] Documentation Addition (Readme updates) (#2495)

* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-05-13 17:55:48 +00:00 · 2025-10-16 03:10:57 -07:00
parent 013ba3c737
commit 92c67a824f
120 changed files with 8188 additions and 221 deletions
--- a/ACRONYMS.md
+++ b/ACRONYMS.md
@@ -0,0 +1,67 @@
+# Acronyms in Composable Kernel
+
+The following acronyms are used in the Composable Kernel codebase:
+
+| Acronym | Expansion | Explanation |
+|---------|-----------|-------------|
+| BF16    | Brain Floating Point 16 | 1 Signed bit, 8 Exponent bits, 7 Significand bits |
+| BF8     | 8-bit Brain Floating Point | 1 Signed bit, 3 Exponent bits, 4 Significand bits |
+| DLA     | Deep Learning Accelerator | Specialized hardware for deep learning workloads |
+| DRAM    | Dynamic Random-Access Memory | Main memory. Global memory on GPU |
+| E2E     | End-to-End | Complete pipeline or process from input to output |
+| ELU     | Exponential Linear Unit | Activation function: $x$ if $x>0$ else $\alpha(e^x-1)$ |
+| FMHA    | Fused Multi-Head Attention | Efficient transformer attention kernel, fusing softmax, masking, and matmul |
+| FP16    | Half-Precision Floating Point | 16-bit IEEE floating point format |
+| FP32    | Single-Precision Floating Point | 32-bit IEEE floating point format |
+| FP64    | Double-Precision Floating Point | 64-bit IEEE floating point format |
+| FP8     | 8-bit Floating Point | Experimental 8-bit floating point format for inference |
+| GEMM    | General Matrix Multiply | Matrix multiplication operation: $C = A \times B$ |
+| GELU    | Gaussian Error Linear Unit | Activation function: $x \cdot \Phi(x)$ |
+| GQA     | Grouped Query Attention | Variant of multi-head attention with grouped queries/keys/values |
+| HBM     | High Bandwidth Memory | Fast memory used in modern GPUs |
+| HIP     | Heterogeneous-Compute Interface for Portability | AMD's CUDA-like GPU programming API |
+| INT8    | 8-bit Integer | Quantized integer format for inference |
+| KVS     | Key-Value Store | Data structure for storing key-value pairs (context: QKV in transformers) |
+| L2/L1   | Level 2/Level 1 Cache | On-chip memory hierarchy in CPUs/GPUs |
+| LDS     | Local Data Share | Shared memory on AMD GPUs (equivalent to CUDA's shared memory) |
+| LLM     | Large Language Model | Transformer-based model for NLP tasks |
+| LSE     | Log-Sum-Exp | Numerically stable softmax computation: $\log(\sum \exp(x))$ |
+| MHA     | Multi-Head Attention | Attention mechanism with multiple heads in transformers |
+| MFMA    | Matrix Fused Multiply-Add | AMD GPU hardware instruction for matrix-matrix multiplication |
+| MoE     | Mixture of Experts | Neural network architecture with multiple expert subnetworks |
+| MQA     | Multi-Query Attention | Variant of multi-head attention with shared keys/values across heads |
+| RCCL    | ROCm Collective Communications Library | AMD Library for multi-GPU communication |
+| NCHW    | Batch, Channel, Height, Width | Tensor layout: batch-major, channels-first |
+| NHWC    | Batch, Height, Width, Channel | Tensor layout: batch-major, channels-last |
+| OOM     | Out Of Memory | Error when memory allocation fails |
+| QAT     | Quantization Aware Training | Training technique for quantized inference |
+| QKV     | Query, Key, Value | Components of transformer attention mechanism |
+| RDMA    | Remote Direct Memory Access | High-speed network memory access |
+| RDQuant | Rowwise Dynamic Quantization | Quantization technique with per-row scaling for int8 inference |
+| ReLU    | Rectified Linear Unit | Activation function: $\max(0, x)$ |
+| ROCm    | Radeon Open Compute | AMD's open GPU computing stack |
+| SGD     | Stochastic Gradient Descent | Optimization algorithm for training neural networks |
+| SM      | Streaming Multiprocessor | GPU compute unit (NVIDIA terminology) |
+| SWA     | Sliding Window Attention | Attention mechanism with a limited window for each token |
+| TLB     | Translation Lookaside Buffer | Memory management unit cache for virtual-to-physical address translation |
+| VGPR    | Vector General Purpose Register | GPU register for vector operations |
+| WARP    | Group of Threads | Smallest scheduling unit on NVIDIA GPUs (32 threads) |
+| WMMA    | Warp Matrix Multiply-Accumulate | NVIDIA's matrix-multiply hardware primitive |
+| XLA     | Accelerated Linear Algebra | Compiler for optimizing ML computations (Google) |
+
+### Common Variable Acronyms in Code
+
+| Symbol | Meaning | Context |
+|--------|---------|---------|
+| M, N, K | Matrix dimensions | GEMM: $A[M,K] \times B[K,N] = C[M,N]$ |
+| Q, K, V | Query, Key, Value | Transformer attention |
+| S       | Sequence length | NLP, transformers |
+| D       | Dimension | Hidden size, feature dim |
+| B       | Batch size | ML batch processing |
+| H       | Head count | Multi-head attention |
+| C       | Channel | CNNs, tensor layouts |
+| T       | Token | NLP, sequence models |
+
+---
+
+If you find an acronym not listed here, please submit a pull request or issue!
--- a/client_example/01_gemm/README.md
+++ b/client_example/01_gemm/README.md
@@ -1,5 +1,22 @@
-[Back to supported operations](../../../include/ck/README.md)
-# Composable Kernel GEMM
+# Client Example: Basic GEMM
+
+## Theory
+
+This client example demonstrates a basic **GEMM (General Matrix Multiplication)** operation using the Composable Kernel library. GEMM is a core operation in linear algebra and deep learning, computing the product of two matrices and optionally adding a bias or scaling.
+
+**Mathematical Formulation:**
+$$
+C = \alpha (A \times B) + \beta D
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $D$: [M, N] optional bias or residual
+- $C$: [M, N] output
+- $\alpha, \beta$: scalars (often 1.0, 0.0)
+
+**Algorithmic Background:**
+- The operation is implemented using a tiled/blocking strategy for memory efficiency.
+- GEMM is the computational backbone for transformer attention, MLPs, and CNNs (via im2col).

 ## GEMM
 General matrix multiplications operation. In CK GEMM operation is called as `DeviceGemm` and requires following types as template parameters:
@@ -124,3 +141,38 @@ Table of supported cases by instance factory with XDL instruction for Row/Row/Ro
 * **DeviceGemmReduce** - GEMM fused with reduction.
 * **DeviceGemm_Streamk_V2** - GEMM stream K implementation. Implementation allows to use reduction instead of AtomicAdd.
 * **DeviceGemmStreamK** - GEMM stream K implementation using AtomicAdd.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/01_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/01_gemm/
+├── gemm.cpp         # Main client example: sets up, runs, and verifies GEMM
+├── CMakeLists.txt   # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `gemm.cpp`):  
+  Sets up input matrices, configures GEMM parameters, launches the GEMM kernel, and verifies the result.
+- **GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch the GEMM operation.
+
+This client example provides a minimal, end-to-end demonstration of using Composable Kernel for matrix multiplication in a user application.
--- a/client_example/02_gemm_add_add_fastgelu/README.md
+++ b/client_example/02_gemm_add_add_fastgelu/README.md
@@ -0,0 +1,65 @@
+# Client Example: GEMM with Add, Add, and FastGELU Fusion
+
+## Theory
+
+This client example demonstrates **GEMM fused with two addition operations and FastGELU activation**. This pattern is common in transformer feed-forward networks and other neural architectures where a linear transformation is followed by bias addition, residual addition, and a non-linear activation.
+
+**Mathematical Formulation:**
+$$
+E = \text{FastGELU}((A \times B) + D_0 + D_1)
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $D_0$: [N] bias vector (broadcasted)
+- $D_1$: [M, N] residual tensor
+- $E$: [M, N] output
+
+FastGELU is an efficient approximation of GELU:
+$$
+\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)
+$$
+where $\sigma$ is the sigmoid function.
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, bias and residual are added, and FastGELU is applied before writing to global memory.
+- No intermediate results are written to global memory.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/02_gemm_add_add_fastgelu
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_add_add_fastgelu
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/02_gemm_add_add_fastgelu/
+├── gemm_add_add_fastgelu.cpp         # Main client example: GEMM+Add+Add+FastGELU
+├── gemm_add_add_fastgelu_generic.cpp # Generic variant
+├── gemm_add_fastgelu.cpp             # GEMM+Add+FastGELU
+├── gemm_add_fastgelu_generic.cpp     # Generic variant
+├── gemm_fastgelu.cpp                 # GEMM+FastGELU only
+├── gemm_fastgelu_generic.cpp         # Generic variant
+├── CMakeLists.txt                    # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch the GEMM with fused addition and FastGELU.
+
+This client example provides several variants to demonstrate different levels of fusion and genericity for transformer-style MLP layers.
--- a/client_example/03_gemm_layernorm/README.md
+++ b/client_example/03_gemm_layernorm/README.md
@@ -0,0 +1,57 @@
+# Client Example: GEMM with LayerNorm Fusion
+
+## Theory
+
+This client example demonstrates **GEMM fused with layer normalization** and additional elementwise operations. This pattern is common in transformer feed-forward networks and other architectures where a linear transformation is followed by normalization and activation.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+- Additions: $Z = Y + D_0 + D_1$ (bias, residual, etc.)
+- Activation: $A = \text{ReLU}(Z)$ (or other activation)
+- LayerNorm: $\text{LayerNorm}(A) = \gamma \cdot \frac{A - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$
+
+$\mu$, $\sigma^2$ are mean and variance over the normalization axis; $\gamma$, $\beta$ are learnable scale and shift.
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, elementwise ops and layer normalization are fused in the epilogue.
+- LayerNorm is typically applied over the last dimension (features).
+- This fusion reduces memory traffic and is common in transformer MLP blocks.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/03_gemm_layernorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (naive)
+./gemm_add_add_layernorm_naive
+
+# Example run (with ReLU and Welford)
+./gemm_add_relu_add_layernorm_welford
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/03_gemm_layernorm/
+├── gemm_add_add_layernorm_naive.cpp         # GEMM + Add + Add + LayerNorm (naive)
+├── gemm_add_relu_add_layernorm_welford.cpp  # GEMM + Add + ReLU + Add + LayerNorm (Welford)
+├── CMakeLists.txt                           # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
+- **LayerNorm implementation**:  
+  Demonstrates both naive and numerically stable (Welford) algorithms for mean/variance.
+
+This client example provides variants to demonstrate different levels of fusion and normalization for transformer-style MLP layers.
--- a/client_example/04_contraction/README.md
+++ b/client_example/04_contraction/README.md
@@ -0,0 +1,56 @@
+# Client Example: General Tensor Contraction
+
+## Theory
+
+This client example demonstrates **general tensor contraction** operations, including bilinear and scaled contractions. Tensor contraction generalizes matrix multiplication to higher dimensions and is used in scientific computing, quantum chemistry, and advanced neural network layers.
+
+**Mathematical Formulation:**
+- General contraction: $C_{i,j} = \sum_k A_{i,k} \cdot B_{k,j}$
+- Bilinear contraction: $C = \alpha (A \cdot B) + \beta D$
+- Scale contraction: $C = \text{scale}(A, B)$ (elementwise or broadcasted scaling)
+
+**Algorithmic Background:**
+- Contraction can be performed over arbitrary axes and supports broadcasting.
+- Bilinear and scale contractions are used for feature fusion, gating, and scientific workloads.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/04_contraction
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (bilinear FP32)
+./contraction_bilinear_fp32
+
+# Example run (scale FP64)
+./contraction_scale_fp64
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/04_contraction/
+├── contraction_bilinear_fp32.cpp         # Bilinear contraction (FP32)
+├── contraction_bilinear_fp64.cpp         # Bilinear contraction (FP64)
+├── contraction_g1m2n3k1_add_xdl_fp16.cpp # Grouped contraction with addition (FP16)
+├── contraction_scale_fp32.cpp            # Scale contraction (FP32)
+├── contraction_scale_fp64.cpp            # Scale contraction (FP64)
+├── CMakeLists.txt                        # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures contraction parameters, launches the contraction kernel, and verifies the result.
+- **Contraction kernel invocation**:  
+  Uses the Composable Kernel device API to launch the contraction operation.
+
+This client example provides several variants to demonstrate different contraction types and data types for scientific and ML workloads.
--- a/client_example/05_layernorm/README.md
+++ b/client_example/05_layernorm/README.md
@@ -0,0 +1,66 @@
+# Client Example: Layer Normalization (Forward and Backward)
+
+## Theory
+
+This client example demonstrates **layer normalization** in both forward and backward modes, for 2D and 4D tensors. Layer normalization is used in transformers and other neural networks to normalize activations across the feature dimension, improving training stability.
+
+**Mathematical Formulation:**
+Given input $X$:
+- Mean: $\mu = \frac{1}{N} \sum_{i=1}^N X_i$
+- Variance: $\sigma^2 = \frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2$
+- Normalized: $\hat{X}_i = \frac{X_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$
+- Output: $Y_i = \gamma \hat{X}_i + \beta$
+
+$\gamma$, $\beta$ are learnable scale and shift parameters.
+
+**Algorithmic Background:**
+- Forward pass computes mean, variance, normalization, and affine transformation.
+- Backward pass computes gradients with respect to input, gamma, and beta.
+- Supports both 2D (batch, feature) and 4D (batch, channel, height, width) tensors.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/05_layernorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D forward)
+./layernorm2d_fwd
+
+# Example run (4D forward)
+./layernorm4d_fwd
+
+# Example run (2D backward, data)
+./layernorm2d_bwd_data
+
+# Example run (2D backward, gamma/beta)
+./layernorm2d_bwd_gamma_beta
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/05_layernorm/
+├── layernorm2d_fwd.cpp         # 2D layernorm forward
+├── layernorm4d_fwd.cpp         # 4D layernorm forward
+├── layernorm2d_bwd_data.cpp    # 2D layernorm backward (data)
+├── layernorm2d_bwd_gamma_beta.cpp # 2D layernorm backward (gamma/beta)
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures normalization parameters, launches the forward or backward kernel, and verifies the result.
+- **LayerNorm implementation**:  
+  Demonstrates both forward and backward passes for different tensor shapes.
+
+This client example provides a comprehensive demonstration of layer normalization for both inference and training in deep learning models.
--- a/client_example/06_softmax/README.md
+++ b/client_example/06_softmax/README.md
@@ -0,0 +1,54 @@
+# Client Example: 4D Softmax
+
+## Theory
+
+This client example demonstrates **Softmax computation over 4D tensors**. Softmax is a key operation in deep learning, especially in attention mechanisms and classification, converting logits into normalized probabilities.
+
+**Mathematical Formulation:**
+Given input $X$ and axis $a$:
+$$
+\text{softmax}(X)_i = \frac{\exp(X_i)}{\sum_j \exp(X_j)}
+$$
+
+**Algorithmic Background:**
+- Softmax is implemented using a numerically stable algorithm:
+  1. Subtract the maximum value for numerical stability.
+  2. Exponentiate and sum.
+  3. Normalize by the sum.
+- Efficient parallel Softmax requires careful reduction and memory access patterns.
+- This example demonstrates Softmax over a 4D tensor, as used in attention and vision models.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/06_softmax
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./softmax4d
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/06_softmax/
+├── softmax4d.cpp         # Main client example: sets up, runs, and verifies 4D softmax
+├── CMakeLists.txt        # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `softmax4d.cpp`):  
+  Sets up input tensors, configures Softmax parameters, launches the Softmax kernel, and verifies the result.
+- **Softmax kernel invocation**:  
+  Uses the Composable Kernel device API to launch the Softmax operation.
+
+This client example provides a demonstration of efficient, numerically stable Softmax for 4D tensors in deep learning models.
--- a/client_example/07_grouped_convnd_fwd/README.md
+++ b/client_example/07_grouped_convnd_fwd/README.md
@@ -1,5 +1,18 @@
-[Back to supported operations](../../../include/ck/README.md)
-# Composable Kernel Grouped Convolution
+# Client Example: Grouped N-Dimensional Convolution Forward
+
+## Theory
+
+This client example demonstrates **grouped N-dimensional convolution forward** for 1D, 2D, and 3D inputs, supporting multiple data types (including BF8 and FP8). Grouped convolution is used in modern CNNs and vision transformers to reduce computation and enable channel-wise or expert-wise processing.
+
+**Mathematical Formulation:**
+Given input $X$ and weights $W$ for $G$ groups:
+- For each group $g$:
+  $$
+  Y^g[n, c_{out}, ...] = \sum_{c_{in}} \sum_{k_1} ... \sum_{k_n} X^g[n, c_{in}, ...] \cdot W^g[c_{out}, c_{in}, ...]
+  $$
+- Each group operates on a subset of input/output channels.
+
+**Algorithmic Background:**

 ## Grouped Convolution Forward
 Grouped convolution operation for 1D, 2D or 3D spatial dimensions. Convolution utilizes GEMM kernel after tensor coordinate transform. In CK Grouped Convolution Forward operation is called as `DeviceGroupedConvFwdMultipleABD` and requires following types as template parameters:
@@ -66,3 +79,52 @@ Table of supported cases by instance factory with fused elementwise operation:
 * **Scale** - 3D, NHWGC, bf16/fp16/fp32/int8
 * **Scale + Add (for A and B)** - 3D, NHWGC, bf16/fp16/fp32/int8
 * **Scale + Add + Scale + Add + Relu** - 3D, NHWGC, bf16/fp16/fp32/int8
+
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/07_grouped_convnd_fwd
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D grouped convolution)
+./grouped_conv2d_fwd
+
+# Example run (3D grouped convolution, BF8)
+./grouped_conv3d_fwd_bf8
+
+# Example run (3D grouped convolution, FP8)
+./grouped_conv3d_fwd_fp8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/07_grouped_convnd_fwd/
+├── grouped_conv1d_fwd.cpp         # 1D grouped convolution
+├── grouped_conv2d_fwd.cpp         # 2D grouped convolution (NCHW)
+├── grouped_conv2d_fwd_ngchw.cpp   # 2D grouped convolution (NGCHW)
+├── grouped_conv3d_fwd_bf8.cpp     # 3D grouped convolution (BF8)
+├── grouped_conv3d_fwd_fp8.cpp     # 3D grouped convolution (FP8)
+├── grouped_conv3d_fwd_bf8_fp8.cpp # 3D grouped convolution (BF8/FP8 mixed)
+├── grouped_conv3d_fwd_fp8_bf8.cpp # 3D grouped convolution (FP8/BF8 mixed)
+├── common.hpp                     # Common utilities for grouped convolution
+├── CMakeLists.txt                 # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures grouped convolution parameters, launches the kernel, and verifies the result.
+- **Grouped convolution kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution for different dimensions and data types.
+
+This client example provides a comprehensive demonstration of grouped convolution for efficient CNN and vision transformer models.
--- a/client_example/08_fused_attention/README.md
+++ b/client_example/08_fused_attention/README.md
@@ -0,0 +1,89 @@
+# Fused Attention Examples
+
+This directory contains comprehensive examples demonstrating CK's high-performance fused attention implementations, which are critical for modern transformer architectures and large language models.
+
+---
+
+## Theory
+
+**Fused Multi-Head Attention Operation:**
+The fused attention mechanism performs the core transformer operation in a single, optimized kernel:
+
+$$
+\text{Attention}(Q, K, V) = \text{Softmax}(Q K^T / \sqrt{d_k}) V
+$$
+
+**Detailed Mathematical Steps:**
+1. **Query-Key Attention Scores**: $S = Q K^T$
+2. **Scale**: $S_{\text{scaled}} = S / \sqrt{d_k}$
+3. **Softmax**: $A = \text{Softmax}(S_{\text{scaled}})$
+4. **Weighted Value Sum**: $\text{Output} = A V$
+
+- Multi-head extension: Each head computes attention independently, then results are concatenated and projected.
+- Tensor shapes: Q, K, V, Output are typically [Batch, Seq_len, Num_heads, Head_dim].
+
+**Algorithmic Background:**
+- Fused attention combines two GEMMs and a softmax in a single kernel, minimizing memory traffic.
+- Supports bias, masking, and permutation for transformer and LLM workloads.
+
+---
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/08_fused_attention
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic fused attention)
+./fused_attention
+
+# Example run (fused attention with bias)
+./fused_attention_bias
+```
+
+---
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/08_fused_attention/
+├── fused_attention.cpp         # Main client example: fused attention (Q, K, V)
+├── fused_attention_bias.cpp    # Fused attention with bias
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up Q, K, V tensors, configures attention parameters, launches the fused kernel, and verifies the result.
+- **Fused attention kernel invocation**:  
+  Uses the Composable Kernel device API to launch the fused attention operation, optionally with bias.
+
+---
+
+## Additional Details
+
+- Supports FP16, BF16, FP32, and mixed precision.
+- Handles causal and generic masking for autoregressive and variable-length models.
+- Optimized for memory efficiency (no intermediate attention matrix in global memory).
+- Example parameters can be adjusted in the source for different transformer workloads.
+
+---
+
+## Related Examples
+
+- [01_gemm](../01_gemm/README.md): GEMM for Q×K^T and Attn×V
+- [06_softmax](../06_softmax/README.md): Softmax client API usage
+- [03_gemm_layernorm](../03_gemm_layernorm/README.md): Fused GEMM + layer normalization
+- [07_grouped_convnd_fwd](../07_grouped_convnd_fwd/README.md): Grouped convolution for vision transformers
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/09_quantization/README.md
+++ b/client_example/09_quantization/README.md
@@ -0,0 +1,85 @@
+# Client Example: Quantization for GEMM and Conv2D
+
+## Theory
+
+This client example demonstrates **quantized GEMM and 2D convolution** operations, including per-layer and per-channel quantization, and fusion with bias and activation functions. Quantization reduces memory and computation by representing values with lower-precision integer types (e.g., int8), enabling efficient inference in deep learning.
+
+**Mathematical Formulation:**
+- Quantized GEMM: $C = \text{dequant}(A_q) \times \text{dequant}(B_q)$
+- Quantized Conv2D: $Y = \text{dequant}(X_q) * \text{dequant}(W_q)$
+- $\text{dequant}(x_q) = (x_q - z) \cdot s$ (scale $s$, zero-point $z$)
+- Per-layer: one scale/zero-point per tensor
+- Per-channel: scale/zero-point per output channel
+
+**Algorithmic Background:**
+- Quantized values are dequantized on-the-fly during computation.
+- Accumulation is performed in higher precision for accuracy.
+- Supports bias addition and activation fusion (ReLU, Tanh).
+- Per-channel quantization improves accuracy for convolutional layers.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/09_quantization
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (GEMM quantization)
+./gemm_quantization
+
+# Example run (Conv2D per-layer quantization)
+./conv2d_fwd_perlayer_quantization
+
+# Example run (Conv2D per-channel quantization)
+./conv2d_fwd_perchannel_quantization
+
+# Example run (Conv2D + bias + ReLU + per-channel quantization)
+./conv2d_fwd_bias_relu_perchannel_quantization
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/09_quantization/
+├── gemm_quantization.cpp                         # Quantized GEMM
+├── conv2d_fwd_perlayer_quantization.cpp          # Conv2D per-layer quantization
+├── conv2d_fwd_perchannel_quantization.cpp        # Conv2D per-channel quantization
+├── conv2d_fwd_bias_relu_perlayer_quantization.cpp # Conv2D + bias + ReLU + per-layer quantization
+├── conv2d_fwd_bias_relu_perchannel_quantization.cpp # Conv2D + bias + ReLU + per-channel quantization
+├── conv2d_fwd_bias_tanh_perlayer_quantization.cpp # Conv2D + bias + Tanh + per-layer quantization
+├── conv2d_fwd_bias_tanh_perchannel_quantization.cpp # Conv2D + bias + Tanh + per-channel quantization
+├── CMakeLists.txt                                # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures quantization parameters, launches the quantized kernel, and verifies the result.
+- **Quantization kernel invocation**:  
+  Uses the Composable Kernel device API to launch quantized GEMM or Conv2D with optional bias and activation.
+
+---
+
+## Additional Details
+
+- Supports int8 quantization, per-layer and per-channel scaling.
+- Demonstrates fusion with bias and activation (ReLU, Tanh).
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [01_gemm](../01_gemm/README.md): GEMM for quantized matrix multiplication
+- [14_gemm_quantization](../../example/14_gemm_quantization/README.md): GEMM quantization in the main example directory
+- [40_conv2d_fwd_quantization](../../example/40_conv2d_fwd_quantization/README.md): Conv2D quantization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/10_grouped_convnd_bwd_data/README.md
+++ b/client_example/10_grouped_convnd_bwd_data/README.md
@@ -1,4 +1,4 @@
-[Back to supported operations](../../../include/ck/README.md)
+[Back to supported operations](../../include/ck/README.md)
 # Composable Kernel Grouped Convolution

 ## Grouped Convolution Backward Data
@@ -46,3 +46,56 @@ Table of supported cases by instance factory with fused elementwise operation:

 * **Bilinear** - 3D, NHWGC, bf16/fp16/fp32
 * **Scale** - 3D, NHWGC, bf16/fp16/fp32
+
+---
+
+## Theory
+
+**Grouped convolution backward data** computes the gradient of the input tensor with respect to the loss, given the output gradient and the weights, for each group independently. This is essential for training CNNs and grouped/expert models.
+
+**Mathematical Formulation:**
+For each group $g$:
+$$
+\text{InputGrad}^g = \text{ConvBwdData}(\text{OutputGrad}^g, \text{Weights}^g)
+$$
+
+- Supports 1D, 2D, and 3D grouped convolutions.
+- Utilizes implicit GEMM for efficient computation.
+- Supports fused elementwise operations (e.g., bilinear, scale).
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/10_grouped_convnd_bwd_data
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D grouped convolution backward data)
+./grouped_conv2d_bwd_data
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/10_grouped_convnd_bwd_data/
+├── grouped_conv1d_bwd_data.cpp         # 1D grouped convolution backward data
+├── grouped_conv2d_bwd_data.cpp         # 2D grouped convolution backward data
+├── grouped_conv3d_bwd_data.cpp         # 3D grouped convolution backward data
+├── CMakeLists.txt                      # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures grouped convolution parameters, launches the backward data kernel, and verifies the result.
+- **Grouped convolution backward kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution backward data for different dimensions and data types.
+
+This client example provides a comprehensive demonstration of grouped convolution backward data for efficient CNN and vision transformer training.
--- a/client_example/11_grouped_conv_bwd_weight/README.md
+++ b/client_example/11_grouped_conv_bwd_weight/README.md
@@ -1,4 +1,4 @@
-[Back to supported operations](../../../include/ck/README.md)
+[Back to supported operations](../../include/ck/README.md)
 # Composable Kernel Grouped Convolution

 ## Grouped Convolution Backward Weight
@@ -60,3 +60,63 @@ Table of supported cases by instance factory with fused elementwise operation:

 * **Bilinear** - 3D, NHWGC, bf16(fp32 for weight)/fp16/fp32
 * **Scale** - 3D, NHWGC, bf16(fp32 for weight)/fp16/fp32
+
+---
+
+## Theory
+
+**Grouped convolution backward weight** computes the gradient of the weights with respect to the loss, given the input and output gradients, for each group independently. This is essential for training CNNs and grouped/expert models.
+
+**Mathematical Formulation:**
+For each group $g$:
+$$
+\text{WeightGrad}^g = \text{ConvBwdWeight}(\text{Input}^g, \text{OutputGrad}^g)
+$$
+
+- Supports 1D, 2D, and 3D grouped convolutions.
+- Utilizes implicit GEMM for efficient computation.
+- Supports fused elementwise operations (e.g., bilinear, scale).
+- Uses splitK for large GEMM K dimensions.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/11_grouped_conv_bwd_weight
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D grouped convolution backward weight, FP16)
+./grouped_conv2d_bwd_weight_fp16
+
+# Example run (3D grouped convolution backward weight, FP32)
+./grouped_conv3d_bwd_weight_fp32
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/11_grouped_conv_bwd_weight/
+├── grouped_conv1d_bwd_weight_fp16.cpp         # 1D grouped convolution backward weight (FP16)
+├── grouped_conv2d_bwd_weight_fp16.cpp         # 2D grouped convolution backward weight (FP16)
+├── grouped_conv3d_bwd_weight_fp16.cpp         # 3D grouped convolution backward weight (FP16)
+├── grouped_conv3d_bwd_weight_fp32.cpp         # 3D grouped convolution backward weight (FP32)
+├── grouped_conv3d_bwd_weight_fp16_comp_bf8_fp8.cpp # 3D grouped convolution backward weight (FP16, BF8/FP8 mixed)
+├── common.hpp                                 # Common utilities for grouped convolution
+├── CMakeLists.txt                             # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures grouped convolution parameters, launches the backward weight kernel, and verifies the result.
+- **Grouped convolution backward weight kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution backward weight for different dimensions and data types.
+
+This client example provides a comprehensive demonstration of grouped convolution backward weight for efficient CNN and vision transformer training.
--- a/client_example/12_elementwise_normalization/README.md
+++ b/client_example/12_elementwise_normalization/README.md
@@ -0,0 +1,69 @@
+# Client Example: Elementwise Layer Normalization
+
+## Theory
+
+This client example demonstrates **elementwise layer normalization** for 2D tensors. Layer normalization is used in transformers and other neural networks to normalize activations across the feature dimension, improving training stability. Elementwise normalization fuses normalization with other elementwise operations for efficiency.
+
+**Mathematical Formulation:**
+Given input $X$:
+- Mean: $\mu = \frac{1}{N} \sum_{i=1}^N X_i$
+- Variance: $\sigma^2 = \frac{1}{N} \sum_{i=1}^N (X_i - \mu)^2$
+- Normalized: $\hat{X}_i = \frac{X_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$
+- Output: $Y_i = \gamma \hat{X}_i + \beta$
+
+$\gamma$, $\beta$ are learnable scale and shift parameters.
+
+**Algorithmic Background:**
+- Computes mean and variance per row (sample).
+- Applies normalization and affine transformation.
+- Can be fused with other elementwise operations for efficiency.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/12_elementwise_normalization
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./elementwise_layernorm2d
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/12_elementwise_normalization/
+├── elementwise_layernorm2d.cpp         # Main client example: elementwise layernorm for 2D tensors
+├── CMakeLists.txt                      # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `elementwise_layernorm2d.cpp`):  
+  Sets up input tensors, configures normalization parameters, launches the normalization kernel, and verifies the result.
+- **Elementwise normalization kernel invocation**:  
+  Uses the Composable Kernel device API to launch layer normalization, optionally fused with other elementwise ops.
+
+---
+
+## Additional Details
+
+- Supports fusion with other elementwise operations for efficiency.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [05_layernorm](../05_layernorm/README.md): Layer normalization client API
+- [27_layernorm2d_fwd](../../example/27_layernorm2d_fwd/README.md): Layer normalization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/13_batchnorm/README.md
+++ b/client_example/13_batchnorm/README.md
@@ -0,0 +1,76 @@
+# Client Example: Batch Normalization (Forward, Backward, Inference)
+
+## Theory
+
+This client example demonstrates **batch normalization** in forward, backward, and inference modes for NHWC tensors. Batch normalization is used in deep neural networks to normalize activations across the batch and spatial dimensions, improving training stability and convergence.
+
+**Mathematical Formulation:**
+Given input $X[N, H, W, C]$:
+- Mean: $\mu_c = \frac{1}{NHW} \sum_{n,h,w} X_{n,h,w,c}$
+- Variance: $\sigma^2_c = \frac{1}{NHW} \sum_{n,h,w} (X_{n,h,w,c} - \mu_c)^2$
+- Normalized: $\hat{X}_{n,h,w,c} = \frac{X_{n,h,w,c} - \mu_c}{\sqrt{\sigma^2_c + \epsilon}}$
+- Output: $Y_{n,h,w,c} = \gamma_c \hat{X}_{n,h,w,c} + \beta_c$
+
+$\gamma_c$, $\beta_c$ are learnable scale and shift parameters per channel.
+
+**Algorithmic Background:**
+- Forward pass computes mean, variance, normalization, and affine transformation.
+- Backward pass computes gradients with respect to input, gamma, and beta.
+- Inference uses running mean and variance for normalization.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/13_batchnorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (forward)
+./batchnorm_fwd_nhwc
+
+# Example run (backward)
+./batchnorm_bwd_nhwc
+
+# Example run (inference)
+./batchnorm_infer_nhwc
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/13_batchnorm/
+├── batchnorm_fwd_nhwc.cpp         # Batchnorm forward (NHWC)
+├── batchnorm_bwd_nhwc.cpp         # Batchnorm backward (NHWC)
+├── batchnorm_infer_nhwc.cpp       # Batchnorm inference (NHWC)
+├── CMakeLists.txt                 # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures batchnorm parameters, launches the forward, backward, or inference kernel, and verifies the result.
+- **BatchNorm kernel invocation**:  
+  Uses the Composable Kernel device API to launch batch normalization for different modes.
+
+---
+
+## Additional Details
+
+- Supports NHWC layout for image and vision models.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [34_batchnorm](../../example/34_batchnorm/README.md): Batch normalization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/14_instance_id/README.md
+++ b/client_example/14_instance_id/README.md
@@ -0,0 +1,63 @@
+# Client Example: BatchNorm with Instance ID Selection
+
+## Theory
+
+This client example demonstrates **batch normalization** using explicit instance ID selection. In Composable Kernel, "instance ID" refers to a specific kernel configuration (tile sizes, vectorization, etc.) chosen for a given workload. This allows users to benchmark or select the best-performing kernel for their data shape.
+
+**Mathematical Formulation:**
+See [BatchNorm Theory](../13_batchnorm/README.md) for the mathematical details of batch normalization.
+
+**Algorithmic Background:**
+- The example shows how to enumerate and select a specific kernel instance by its ID.
+- Useful for performance tuning, benchmarking, and debugging.
+- BatchNorm is performed in NHWC layout.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/14_instance_id
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (selects a specific kernel instance)
+./batchnorm_fwd_instance_id
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/14_instance_id/
+├── batchnorm_fwd_instance_id.cpp         # Batchnorm forward with instance ID selection
+├── CMakeLists.txt                        # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `batchnorm_fwd_instance_id.cpp`):  
+  Sets up input tensors, enumerates available kernel instances, selects an instance by ID, launches the batchnorm kernel, and verifies the result.
+- **Instance selection**:  
+  Demonstrates how to use the Composable Kernel API to list and select kernel configurations.
+
+---
+
+## Additional Details
+
+- Useful for kernel benchmarking and performance tuning.
+- Example parameters and instance ID can be adjusted in the source.
+
+---
+
+## Related Examples
+
+- [13_batchnorm](../13_batchnorm/README.md): Batch normalization client API
+- [34_batchnorm](../../example/34_batchnorm/README.md): Batch normalization in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/15_convnd_bwd_data/README.md
+++ b/client_example/15_convnd_bwd_data/README.md
@@ -0,0 +1,73 @@
+# Client Example: N-Dimensional Convolution Backward Data
+
+## Theory
+
+This client example demonstrates **N-dimensional convolution backward data** for 3D inputs, supporting multiple data types (FP16, FP32). The backward data operation computes the gradient of the input tensor with respect to the loss, given the output gradient and the weights. This is essential for training CNNs and 3D vision models.
+
+**Mathematical Formulation:**
+For input $X$, weights $W$, and output gradient $dY$:
+$$
+dX = \text{ConvBwdData}(dY, W)
+$$
+
+- Supports 3D convolution (ND can be extended).
+- Utilizes implicit GEMM for efficient computation.
+
+**Algorithmic Background:**
+- The backward data operation is implemented as a convolution with transformed coordinates.
+- Used in training pipelines for 3D CNNs, medical imaging, and volumetric data.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/15_convnd_bwd_data
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (3D backward data, FP16)
+./conv3d_bwd_data_fp16
+
+# Example run (3D backward data, FP32)
+./conv3d_bwd_data_fp32
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/15_convnd_bwd_data/
+├── conv3d_bwd_data_fp16.cpp         # 3D convolution backward data (FP16)
+├── conv3d_bwd_data_fp32.cpp         # 3D convolution backward data (FP32)
+├── common.hpp                       # Common utilities for convolution
+├── CMakeLists.txt                   # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures convolution parameters, launches the backward data kernel, and verifies the result.
+- **Backward data kernel invocation**:  
+  Uses the Composable Kernel device API to launch convolution backward data for different data types.
+
+---
+
+## Additional Details
+
+- Supports FP16 and FP32 for 3D convolution.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [10_grouped_convnd_bwd_data](../10_grouped_convnd_bwd_data/README.md): Grouped convolution backward data
+- [17_convnd_bwd_data](../../example/17_convnd_bwd_data/README.md): Convolution backward data in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/16_convnd_fwd/README.md
+++ b/client_example/16_convnd_fwd/README.md
@@ -0,0 +1,85 @@
+# Client Example: N-Dimensional Convolution Forward
+
+## Theory
+
+This client example demonstrates **N-dimensional convolution forward** for 3D inputs, supporting multiple data types (FP16, FP32, FP8 composite). Convolution is a fundamental operation in deep learning, especially in convolutional neural networks (CNNs) for images, audio, and volumetric data.
+
+**Mathematical Formulation:**
+Given input $X$, weights $W$:
+$$
+Y = \text{Conv}(X, W)
+$$
+
+- Supports 3D convolution (ND can be extended).
+- Utilizes implicit GEMM for efficient computation.
+
+**Algorithmic Background:**
+- The forward convolution operation is implemented as a convolution with transformed coordinates.
+- Used in inference and training pipelines for 3D CNNs, medical imaging, and volumetric data.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/16_convnd_fwd
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (3D forward, FP16)
+./conv3d_fwd_fp16
+
+# Example run (3D forward, FP32)
+./conv3d_fwd_fp32
+
+# Example run (3D forward, FP16 compute with FP8)
+./conv3d_fwd_fp16_comp_fp8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/16_convnd_fwd/
+├── conv3d_fwd_fp16.cpp         # 3D convolution forward (FP16)
+├── conv3d_fwd_fp32.cpp         # 3D convolution forward (FP32)
+├── conv3d_fwd_fp16_comp_fp8.cpp # 3D convolution forward (FP16 compute, FP8)
+├── common.hpp                  # Common utilities for convolution
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input/output tensors, configures convolution parameters, launches the forward kernel, and verifies the result.
+- **Forward convolution kernel invocation**:  
+  Uses the Composable Kernel device API to launch convolution forward for different data types.
+
+---
+
+## Additional Details
+
+- Supports FP16, FP32, and FP8 composite for 3D convolution.
+- Parameters can be adjusted in the source files for different workloads. The following parameters are configurable:
+  - `NumDimSpatial`: Number of spatial dimensions (default: 3 for 3D convolution)
+  - `G`: Number of groups (default: 1)
+  - `N`: Batch size (default: 64)
+  - `K`: Number of output channels (default: 128)
+  - `C`: Number of input channels (default: 64)
+  - `Z`, `Y`, `X`: Filter/kernel dimensions (default: 3x3x3)
+  - `Di`, `Hi`, `Wi`: Input dimensions - depth, height, width (default: 28x28x3)
+  - `Do`, `Ho`, `Wo`: Output dimensions - depth, height, width (default: 28x28x3)
+
+---
+
+## Related Examples
+
+- [09_convnd_fwd](../../example/09_convnd_fwd/README.md): N-dimensional convolution in the main example directory
+- [30_grouped_conv_fwd_multiple_d](../../example/30_grouped_conv_fwd_multiple_d/README.md): Grouped convolution forward with multiple D
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/17_grouped_gemm_fastgelu/README.md
+++ b/client_example/17_grouped_gemm_fastgelu/README.md
@@ -0,0 +1,71 @@
+# Client Example: Grouped GEMM with FastGELU Activation
+
+## Theory
+
+This client example demonstrates **grouped GEMM fused with FastGELU activation**. Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, and FastGELU is a fast approximation of the GELU activation used in transformers and MLPs.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$:
+- GEMM: $Y_g = A_g \times B_g$
+- FastGELU: $E_g = \text{FastGELU}(Y_g)$
+
+FastGELU is defined as:
+$$
+\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)
+$$
+where $\sigma$ is the sigmoid function.
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- The kernel launches a grid covering all groups, with each block assigned to a group.
+- FastGELU is applied in the epilogue for each group.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/17_grouped_gemm_fastgelu
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./grouped_gemm_fastgelu
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/17_grouped_gemm_fastgelu/
+├── grouped_gemm_fastgelu.cpp         # Main client example: grouped GEMM + FastGELU
+├── CMakeLists.txt                    # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `grouped_gemm_fastgelu.cpp`):  
+  Sets up input matrices for each group, configures GEMM and epilogue parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM with FastGELU activation.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+- [04_gemm_add_add_fastgelu](../../example/04_gemm_add_add_fastgelu/README.md): GEMM with FastGELU fusion
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/18_groupnorm/README.md
+++ b/client_example/18_groupnorm/README.md
@@ -0,0 +1,80 @@
+# Client Example: Group Normalization (Forward and Backward)
+
+## Theory
+
+This client example demonstrates **group normalization** in both forward and backward modes, including fusion with Swish activation. Group normalization normalizes activations across groups of channels, improving training stability for small batch sizes or non-i.i.d. data.
+
+**Mathematical Formulation:**
+Given input $X[N, C, ...]$ divided into $G$ groups:
+- For each group $g$:
+  - Mean: $\mu_g = \frac{1}{|g|} \sum_{i \in g} X_i$
+  - Variance: $\sigma^2_g = \frac{1}{|g|} \sum_{i \in g} (X_i - \mu_g)^2$
+  - Normalized: $\hat{X}_i = \frac{X_i - \mu_g}{\sqrt{\sigma^2_g + \epsilon}}$
+  - Output: $Y_i = \gamma \hat{X}_i + \beta$
+
+$\gamma$, $\beta$ are learnable scale and shift parameters.
+
+- Swish activation: $\text{Swish}(x) = x \cdot \sigma(x)$, where $\sigma$ is the sigmoid function.
+
+**Algorithmic Background:**
+- Forward pass computes mean, variance, normalization, and affine transformation per group.
+- Backward pass computes gradients with respect to input, gamma, and beta.
+- Swish activation can be fused with normalization for efficiency.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/18_groupnorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (forward with Swish)
+./groupnorm_swish_fwd
+
+# Example run (backward, data)
+./groupnorm_bwd_data
+
+# Example run (backward, gamma/beta)
+./groupnorm_bwd_gamma_beta
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/18_groupnorm/
+├── groupnorm_swish_fwd.cpp         # Groupnorm forward with Swish activation
+├── groupnorm_bwd_data.cpp          # Groupnorm backward (data)
+├── groupnorm_bwd_gamma_beta.cpp    # Groupnorm backward (gamma/beta)
+├── CMakeLists.txt                  # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures groupnorm parameters, launches the forward or backward kernel, and verifies the result.
+- **GroupNorm kernel invocation**:  
+  Uses the Composable Kernel device API to launch group normalization for different modes.
+
+---
+
+## Additional Details
+
+- Supports fusion with Swish activation for efficiency.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [42_groupnorm_fwd](../../example/42_groupnorm_fwd/README.md): Group normalization in the main example directory
+- [54_groupnorm_bwd](../../example/54_groupnorm_bwd/README.md): Group normalization backward in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/19_pool/README.md
+++ b/client_example/19_pool/README.md
@@ -0,0 +1,80 @@
+# Client Example: Pooling Operations (2D Max, 3D Avg)
+
+## Theory
+
+This client example demonstrates **pooling operations** for 2D max pooling and 3D average pooling, including both forward and backward passes. Pooling is used in convolutional neural networks (CNNs) for spatial downsampling, translation invariance, and reducing computation.
+
+**Mathematical Formulation:**
+- **Max Pooling (2D):** $Y_{n,c,h,w} = \max_{i,j} X_{n,c,h \cdot s_H + i, w \cdot s_W + j}$
+- **Average Pooling (3D):** $Y_{n,c,d,h,w} = \frac{1}{k_D k_H k_W} \sum_{i,j,k} X_{n,c,d \cdot s_D + i, h \cdot s_H + j, w \cdot s_W + k}$
+
+Where $s_H, s_W, s_D$ are strides, $k_H, k_W, k_D$ are kernel sizes.
+
+**Algorithmic Background:**
+- Forward pass computes the pooled output.
+- Backward pass computes the gradient with respect to the input.
+- Handles padding and boundary conditions.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/19_pool
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (2D max pool forward)
+./max_pool2d_fwd
+
+# Example run (2D max pool backward)
+./max_pool2d_bwd
+
+# Example run (3D avg pool forward)
+./avg_pool3d_fwd
+
+# Example run (3D avg pool backward)
+./avg_pool3d_bwd
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/19_pool/
+├── max_pool2d_fwd.cpp         # 2D max pooling forward
+├── max_pool2d_bwd.cpp         # 2D max pooling backward
+├── avg_pool3d_fwd.cpp         # 3D average pooling forward
+├── avg_pool3d_bwd.cpp         # 3D average pooling backward
+├── CMakeLists.txt             # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures pooling parameters, launches the forward or backward kernel, and verifies the result.
+- **Pooling kernel invocation**:  
+  Uses the Composable Kernel device API to launch pooling operations for different modes.
+
+---
+
+## Additional Details
+
+- Supports both max and average pooling, forward and backward.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [13_pool2d_fwd](../../example/13_pool2d_fwd/README.md): 2D pooling in the main example directory
+- [48_pool3d_fwd](../../example/48_pool3d_fwd/README.md): 3D pooling in the main example directory
+- [49_maxpool2d_bwd](../../example/49_maxpool2d_bwd/README.md): 2D max pool backward in the main example directory
+- [51_avgpool3d_bwd](../../example/51_avgpool3d_bwd/README.md): 3D avg pool backward in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/20_splitk_gemm/README.md
+++ b/client_example/20_splitk_gemm/README.md
@@ -0,0 +1,66 @@
+# Client Example: Split-K GEMM
+
+## Theory
+
+This client example demonstrates **Split-K GEMM**, a technique for parallelizing matrix multiplication along the K dimension. Split-K is used to improve parallelism and memory bandwidth utilization for large GEMM operations, especially when K is large.
+
+**Mathematical Formulation:**
+- Standard GEMM: $C = A \times B$
+- Split-K: Partition the K dimension into $K_s$ splits, compute partial results, then reduce:
+  $$
+  C = \sum_{s=1}^{K_s} (A_{[:, K_s]} \times B_{[K_s, :]})
+  $$
+
+**Algorithmic Background:**
+- Each split computes a partial GEMM over a chunk of K.
+- Partial results are reduced (summed) to produce the final output.
+- Useful for large K, limited workspace, or maximizing GPU occupancy.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/20_splitk_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (FP16 compute, FP8 output)
+./splitK_gemm_fp16_f8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/20_splitk_gemm/
+├── splitK_gemm_fp16_f8.cpp         # Main client example: Split-K GEMM (FP16 compute, FP8 output)
+├── CMakeLists.txt                  # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `splitK_gemm_fp16_f8.cpp`):  
+  Sets up input matrices, configures Split-K parameters, launches the Split-K GEMM kernel, and verifies the result.
+- **Split-K kernel invocation**:  
+  Uses the Composable Kernel device API to launch the Split-K GEMM operation.
+
+---
+
+## Additional Details
+
+- Supports FP16 compute with FP8 output for memory efficiency.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [35_splitK_gemm](../../example/35_splitK_gemm/README.md): Split-K GEMM in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/21_grouped_gemm_bias/README.md
+++ b/client_example/21_grouped_gemm_bias/README.md
@@ -0,0 +1,65 @@
+# Client Example: Grouped GEMM with Bias
+
+## Theory
+
+This client example demonstrates **grouped GEMM fused with bias addition**. Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, and bias addition is a standard pattern in neural network layers.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$, $b_g$:
+- GEMM: $Y_g = A_g \times B_g$
+- Bias: $E_g = Y_g + b_g$
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- The kernel launches a grid covering all groups, with each block assigned to a group.
+- Bias is added in the epilogue for each group.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/21_grouped_gemm_bias
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (grouped GEMM with bias, FP16)
+./grouped_gemm_fixed_nk_bias_fp16
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/21_grouped_gemm_bias/
+├── grouped_gemm_fixed_nk_bias_fp16.cpp         # Main client example: grouped GEMM + bias (FP16)
+├── CMakeLists.txt                              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `grouped_gemm_fixed_nk_bias_fp16.cpp`):  
+  Sets up input matrices for each group, configures GEMM and bias parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM with bias addition.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+- [11_convnd_fwd_bias](../../example/11_convnd_fwd_bias/README.md): Convolution with bias fusion
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/22_grouped_gemm/README.md
+++ b/client_example/22_grouped_gemm/README.md
@@ -0,0 +1,76 @@
+# Client Example: Grouped GEMM (Multiple Data Types)
+
+## Theory
+
+This client example demonstrates **grouped GEMM** for multiple data types (FP16, BF16, FP8, INT8). Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, which is useful for transformer models, mixture-of-experts, and variable-length sequence processing.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$:
+- GEMM: $Y_g = A_g \times B_g$
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- The kernel launches a grid covering all groups, with each block assigned to a group.
+- Supports multiple data types for flexibility and performance tuning.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/22_grouped_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (FP16)
+./grouped_gemm_fixed_nk_fp16
+
+# Example run (BF16)
+./grouped_gemm_fixed_nk_bf16
+
+# Example run (FP8)
+./grouped_gemm_fixed_nk_fp8
+
+# Example run (INT8)
+./grouped_gemm_fixed_nk_i8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/22_grouped_gemm/
+├── grouped_gemm_fixed_nk_fp16.cpp         # Grouped GEMM (FP16)
+├── grouped_gemm_fixed_nk_bf16.cpp         # Grouped GEMM (BF16)
+├── grouped_gemm_fixed_nk_fp8.cpp          # Grouped GEMM (FP8)
+├── grouped_gemm_fixed_nk_i8.cpp           # Grouped GEMM (INT8)
+├── CMakeLists.txt                         # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices for each group, configures GEMM parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM for different data types.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes and data types.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+- [17_grouped_gemm_fastgelu](../17_grouped_gemm_fastgelu/README.md): Grouped GEMM with FastGELU activation
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/23_elementwise_transpose/README.md
+++ b/client_example/23_elementwise_transpose/README.md
@@ -0,0 +1,64 @@
+# Client Example: Elementwise Operation with 3D Transpose
+
+## Theory
+
+This client example demonstrates **elementwise operations fused with 3D tensor transpose**. This pattern is used in deep learning for applying activation functions or scaling while simultaneously reordering tensor dimensions (e.g., for layout conversion or attention head reshaping).
+
+**Mathematical Formulation:**
+- Elementwise: $Z = f(X)$ or $Z = f(X, Y)$
+- Transpose: $Y_{i_0, i_1, i_2} = Z_{i_{\pi(0)}, i_{\pi(1)}, i_{\pi(2)}}$
+  - $\pi$ is a permutation of the axes.
+
+**Algorithmic Background:**
+- The elementwise operation and transpose are fused in a single kernel.
+- Intermediate results are kept in registers, not written to global memory.
+- Used for layout conversion with activation, attention head reshaping, and more.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/23_elementwise_transpose
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (elementwise + 3D transpose)
+./elementwise_transpose_3d
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/23_elementwise_transpose/
+├── elementwise_transpose_3d.cpp         # Main client example: elementwise + 3D transpose
+├── CMakeLists.txt                       # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `elementwise_transpose_3d.cpp`):  
+  Sets up input tensors, configures elementwise and transpose parameters, launches the fused kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch the elementwise+transpose operation.
+
+---
+
+## Additional Details
+
+- Supports fusion of elementwise operations with 3D transpose.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [44_elementwise_permute](../../example/44_elementwise_permute/README.md): Elementwise operation with permutation in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/24_grouped_conv_activation/README.md
+++ b/client_example/24_grouped_conv_activation/README.md
@@ -0,0 +1,88 @@
+# Client Example: Grouped Convolution with Activation and Fusion
+
+## Theory
+
+This client example demonstrates **grouped convolution fused with various activation and elementwise operations**. Grouped convolution splits the input and weights into groups and applies convolution independently to each group, while fusion with activation and scaling improves efficiency.
+
+**Mathematical Formulation:**
+For each group $g$:
+- Convolution: $Y^g = \text{Conv}(X^g, W^g)$
+- Fused operations: $E^g = f(Y^g, D_0^g, D_1^g, ...)$
+  - $f$ can be bilinear, scale, add, relu, etc.
+
+**Algorithmic Background:**
+- Grouped convolution is used in efficient CNNs, depthwise separable convolutions, and expert models.
+- Fused epilogue operations (scale, add, relu, reduce) are performed in registers before writing to memory.
+- Supports 1D, 2D, and 3D grouped convolutions and a variety of fusion patterns.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/24_grouped_conv_activation
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (grouped conv + scale)
+./grouped_convnd_fwd_scale/grouped_convnd_fwd_scale
+
+# Example run (grouped conv + bilinear)
+./grouped_convnd_fwd_bilinear/grouped_convnd_fwd_bilinear
+
+# Example run (grouped conv + scale + relu)
+./grouped_convnd_fwd_convscale_relu/grouped_convnd_fwd_convscale_relu
+
+# Example run (grouped conv + scale + add + relu)
+./grouped_convnd_fwd_scaleadd_scaleadd_relu/grouped_convnd_fwd_scaleadd_scaleadd_relu
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/24_grouped_conv_activation/
+├── grouped_convnd_fwd_scale/                  # Grouped conv + scale
+├── grouped_convnd_fwd_bilinear/               # Grouped conv + bilinear
+├── grouped_convnd_fwd_convscale/              # Grouped conv + scale (convscale)
+├── grouped_convnd_fwd_convscale_add/          # Grouped conv + scale + add
+├── grouped_convnd_fwd_convscale_reduce/       # Grouped conv + scale + reduce
+├── grouped_convnd_fwd_convscale_relu/         # Grouped conv + scale + relu
+├── grouped_convnd_fwd_convinvscale/           # Grouped conv + inverse scale
+├── grouped_convnd_fwd_scaleadd_ab/            # Grouped conv + scale + add (A/B)
+├── grouped_convnd_fwd_scaleadd_scaleadd_relu/ # Grouped conv + scale + add + relu
+├── grouped_convnd_bwd_data_bilinear/          # Grouped conv bwd data + bilinear
+├── grouped_convnd_bwd_data_scale/             # Grouped conv bwd data + scale
+├── grouped_convnd_bwd_weight_bilinear/        # Grouped conv bwd weight + bilinear
+├── grouped_convnd_bwd_weight_scale/           # Grouped conv bwd weight + scale
+├── CMakeLists.txt                             # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each subdirectory's `.cpp`):  
+  Sets up input tensors, configures grouped convolution and fusion parameters, launches the kernel, and verifies the result.
+- **Grouped convolution kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped convolution with various fused epilogue operations.
+
+---
+
+## Additional Details
+
+- Supports a wide range of fusion patterns (bilinear, scale, add, relu, reduce, etc.).
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [10_grouped_convnd_bwd_data](../10_grouped_convnd_bwd_data/README.md): Grouped convolution backward data
+- [11_grouped_conv_bwd_weight](../11_grouped_conv_bwd_weight/README.md): Grouped convolution backward weight
+- [30_grouped_conv_fwd_multiple_d](../../example/30_grouped_conv_fwd_multiple_d/README.md): Grouped convolution forward with multiple D
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/25_wrapper/README.md
+++ b/client_example/25_wrapper/README.md
@@ -1,13 +1,70 @@
 [Back to the main page](../../README.md)
-# Composable Kernel wrapper GEMM tutorial

-This tutorial demonstrates how to implement matrix multiplication using Composable Kernel (CK) wrapper. We present the base version of GEMM without most of the available optimizations; however, it's worth noting that CK has kernels with different optimizations.
+# Composable Kernel Wrapper GEMM Tutorial

-To implement these optimizations, you can use the CK wrapper or directly use available instances in CK. You can also refer to the [optimized GEMM example](https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_optimized_gemm.cpp), that uses CK wrapper based on the [`gridwise_gemm_xdlops_v2r3`](https://github.com/ROCm/composable_kernel/blob/develop/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdlops_v2r3.hpp) implementation.
+This tutorial demonstrates how to implement matrix multiplication (GEMM) using the Composable Kernel wrapper. The three examples show both basic and optimized GEMM implementations, as well as how to use the wrapper for tensor transformations such as im2col.

-The kernel definition should look similar to:
+---
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/25_wrapper
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic GEMM)
+./wrapper_basic_gemm
+
+# Example run (optimized GEMM)
+./wrapper_optimized_gemm
+
+# Example run (im2col transformation)
+./wrapper_img2col
+
+# Example run (tensor transform using wrapper)
+./tensor_transform_using_wrapper
+```
+
+---
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/25_wrapper/
+├── wrapper_basic_gemm.cpp         # Basic GEMM using CK wrapper
+├── wrapper_optimized_gemm.cpp     # Optimized GEMM using CK wrapper
+├── wrapper_img2col.cpp            # im2col transformation using CK wrapper
+├── tensor_transform_using_wrapper.cpp # General tensor transform example
+├── CMakeLists.txt                 # Build configuration for the example
+├── README.md                      # This tutorial and reference
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures wrapper parameters, launches the kernel, and verifies the result.
+- **CK wrapper API usage**:  
+  Demonstrates how to create layouts, tensors, and launch GEMM or tensor transforms using the wrapper.
+
+---
+
+## Additional Details
+
+## Overview
+
+The CK wrapper provides a flexible interface for launching GEMM kernels and tensor operations. This tutorial presents:
+- A base GEMM implementation (minimal optimizations)
+- An optimized GEMM using `gridwise_gemm_xdlops_v2r3`
+- Examples of tensor transformations (e.g., im2col)

-```cpp
 template <typename DataType,
          typename GemmTraits,
          ck::index_t scalar_per_vector,
@@ -168,5 +225,13 @@ The end result from `c_vgpr_reg` is stored in the `C` local partition (tensor pe
    ck::wrapper::copy(c_vgpr_reg, c_global_local_partition);
 ```

-If you want to dive deep into the details, you can find the entire example
-[here](https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_basic_gemm.cpp).
+---
+
+## Related Examples
+
+- [01_gemm](../01_gemm/README.md): Basic GEMM client example
+- [27_im2col_col2im](../27_im2col_col2im/README.md): im2col/col2im transformations
+- [25_gemm_bias_e_permute](../../example/25_gemm_bias_e_permute/README.md): GEMM with bias and permutation in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/26_reduce/CMakeLists.txt
+++ b/client_example/26_reduce/CMakeLists.txt
--- a/client_example/26_reduce/README.md
+++ b/client_example/26_reduce/README.md
@@ -0,0 +1,64 @@
+# Client Example: Parallel Reduction (NHWC)
+
+## Theory
+
+This client example demonstrates **parallel reduction operations** over NHWC tensors. Reduction is a fundamental operation in deep learning for computing statistics (such as batch mean/variance), loss aggregation, and normalization.
+
+**Mathematical Formulation:**
+Given a tensor $X[N, H, W, C]$ and a reduction axis (e.g., channel $C$):
+- **Sum**: $Y_{n,h,w} = \sum_c X_{n,h,w,c}$
+- **Max**: $Y_{n,h,w} = \max_c X_{n,h,w,c}$
+- **Mean**: $Y_{n,h,w} = \frac{1}{C} \sum_c X_{n,h,w,c}$
+
+**Algorithmic Background:**
+- Reductions are implemented using parallel tree or segmented reduction algorithms.
+- Efficient reductions require careful memory access, synchronization, and sometimes numerically stable algorithms.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/26_reduce
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (reduce over channel dimension)
+./reduce_nhwc_c
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/26_reduce/
+├── reduce_nhwc_c.cpp         # Main client example: reduction over NHWC tensors (channel axis)
+├── CMakeLists.txt            # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `reduce_nhwc_c.cpp`):  
+  Sets up input tensors, configures reduction parameters, launches the reduction kernel, and verifies the result.
+- **Reduction kernel invocation**:  
+  Uses the Composable Kernel device API to launch the reduction operation.
+
+---
+
+## Additional Details
+
+- Supports sum, max, mean, and other reductions over NHWC tensors.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [12_reduce](../../example/12_reduce/README.md): Parallel reduction in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/26_reduce/reduce_nhwc_c.cpp
+++ b/client_example/26_reduce/reduce_nhwc_c.cpp
--- a/client_example/27_im2col_col2im/CMakeLists.txt
+++ b/client_example/27_im2col_col2im/CMakeLists.txt
--- a/client_example/27_im2col_col2im/README.md
+++ b/client_example/27_im2col_col2im/README.md
@@ -0,0 +1,68 @@
+# Client Example: im2col and col2im Transformations
+
+## Theory
+
+This client example demonstrates **im2col (image-to-column) and col2im (column-to-image) transformations**. These operations are used to convert image data into a matrix form suitable for GEMM-based convolution and reconstruct images from column representations.
+
+**Mathematical Formulation:**
+- **im2col**: Rearranges image blocks into columns, mapping a 3D/4D tensor to a 2D matrix.
+- **col2im**: Reverses the process, mapping a 2D matrix back to an image tensor.
+
+**Algorithmic Background:**
+- im2col is used to lower convolution to matrix multiplication (GEMM).
+- col2im is used to reconstruct the original image or feature map from the column representation.
+- These transformations are essential for efficient convolution implementations on GPUs.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/27_im2col_col2im
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (image to column)
+./image_to_column
+
+# Example run (column to image)
+./column_to_image
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/27_im2col_col2im/
+├── image_to_column.cpp         # im2col: image to column transformation
+├── column_to_image.cpp         # col2im: column to image transformation
+├── CMakeLists.txt              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input tensors, configures transformation parameters, launches the im2col or col2im kernel, and verifies the result.
+- **im2col/col2im kernel invocation**:  
+  Uses the Composable Kernel device API to launch the transformation.
+
+---
+
+## Additional Details
+
+- Supports various image and patch sizes.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [52_im2col_col2im](../../example/52_im2col_col2im/README.md): im2col/col2im in the main example directory
+- [09_convnd_fwd](../../example/09_convnd_fwd/README.md): N-dimensional convolution using im2col
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/27_im2col_col2im/column_to_image.cpp
+++ b/client_example/27_im2col_col2im/column_to_image.cpp
--- a/client_example/27_im2col_col2im/image_to_column.cpp
+++ b/client_example/27_im2col_col2im/image_to_column.cpp
--- a/client_example/28_gemm_mx/CMakeLists.txt
+++ b/client_example/28_gemm_mx/CMakeLists.txt
--- a/client_example/28_gemm_mx/README.md
+++ b/client_example/28_gemm_mx/README.md
@@ -0,0 +1,34 @@
+# Client Example: GEMM pipeline for microscaling (MX)
+
+## How to Run
+
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+```bash
+cd composable_kernel/build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="fp8" ..
+make -j
+make install
+```
+
+### Build and run
+```bash
+/opt/rocm/bin/hipcc gemm_mx_fp8.cpp -o gemm_mx_fp8
+
+# Example run
+./gemm_mx_fp8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/28_gemm_mx/
+├── gemm_mx_fp8.cpp       # GEMM MX (fp8)
+├── CMakeLists.txt        # Build configuration for the example
+```
+---
+[Back to Client Examples](../README.md)
--- a/client_example/28_gemm_mx/gemm_mx_fp8.cpp
+++ b/client_example/28_gemm_mx/gemm_mx_fp8.cpp
--- a/client_example/29_gemm_add_multiply/CMakeLists.txt
+++ b/client_example/29_gemm_add_multiply/CMakeLists.txt
--- a/client_example/29_gemm_add_multiply/README.md
+++ b/client_example/29_gemm_add_multiply/README.md
@@ -0,0 +1,66 @@
+# Client Example: GEMM with Add and Multiply Fusion
+
+## Theory
+
+This client example demonstrates **GEMM fused with addition and multiplication operations**. This pattern is used in neural networks for bias addition, scaling, gating, and other elementwise transformations after a linear layer.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+- Add: $Z = Y + D_0$
+- Multiply: $E = Z \odot D_1$
+  - $D_0$, $D_1$: auxiliary tensors (e.g., bias, scale, gate)
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, addition and multiplication are fused in the epilogue.
+- No intermediate results are written to global memory.
+- Used for bias+scale, gating, and other fused epilogue patterns.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/client_example/29_gemm_add_multiply
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_add_multiply
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/29_gemm_add_multiply/
+├── gemm_add_multiply.cpp         # Main client example: GEMM+Add+Multiply
+├── CMakeLists.txt                # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in `gemm_add_multiply.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the fused kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch the GEMM with fused addition and multiplication.
+
+---
+
+## Additional Details
+
+- Supports fusion of multiple elementwise operations with GEMM.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [02_gemm_bilinear](../../example/02_gemm_bilinear/README.md): Multi-tensor bilinear operations
+- [46_gemm_add_multiply](../../example/46_gemm_add_multiply/README.md): GEMM with add and multiply in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/29_gemm_add_multiply/gemm_add_multiply.cpp
+++ b/client_example/29_gemm_add_multiply/gemm_add_multiply.cpp
--- a/client_example/30_gemm_bf16Aint8B/README.md
+++ b/client_example/30_gemm_bf16Aint8B/README.md
@@ -0,0 +1,92 @@
+# Client Example: GEMM with bf16A/int8B and Fused Epilogues
+
+## Theory
+
+This client example demonstrates **GEMM with mixed-precision input types (bf16 for A, int8 for B)** and various fused epilogue operations (bias, GELU, FastGELU, multiply). Mixed-precision GEMM is used for efficient inference and training in deep learning, especially for transformer and MLP layers.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+  - $A$: bf16 (brain floating point)
+  - $B$: int8 (8-bit integer)
+- Fused epilogues:
+  - Bias: $Z = Y + \text{bias}$
+  - GELU: $E = \text{GELU}(Z)$
+  - FastGELU: $E = \text{FastGELU}(Z)$
+  - Multiply: $E = Z \odot D_1$
+
+**Algorithmic Background:**
+- Mixed-precision computation reduces memory and compute requirements.
+- Fused epilogues improve efficiency by combining bias, activation, and scaling in a single kernel.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+```bash
+cd composable_kernel/build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="bf16;int8" ..
+make -j
+make install
+```
+
+### Build and run
+```bash
+cd composable_kernel/client_example/30_gemm_bf16Aint8B
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic GEMM)
+./gemm_xdl_bf16_i8
+
+# Example run (GEMM + bias)
+./gemm_bias_xdl_bf16_i8
+
+# Example run (GEMM + bias + GELU)
+./gemm_xdl_gelu_bf16_i8
+
+# Example run (GEMM + bias + FastGELU)
+./gemm_bias_fastgelu_xdl_bf16_i8
+
+# Example run (GEMM + multiply)
+./gemm_xdl_multiply_bf16_i8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/30_gemm_bf16Aint8B/
+├── gemm_xdl_bf16_i8.cpp                # GEMM (bf16A, int8B)
+├── gemm_bias_xdl_bf16_i8.cpp           # GEMM + bias
+├── gemm_xdl_gelu_bf16_i8.cpp           # GEMM + bias + GELU
+├── gemm_bias_fastgelu_xdl_bf16_i8.cpp  # GEMM + bias + FastGELU
+├── gemm_xdl_multiply_bf16_i8.cpp       # GEMM + multiply
+├── CMakeLists.txt                      # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices, configures GEMM and epilogue parameters, launches the kernel, and verifies the result.
+- **Fused kernel invocation**:  
+  Uses the Composable Kernel device API to launch GEMM with various fused epilogues.
+
+---
+
+## Additional Details
+
+- Supports bf16 and int8 input types for efficient mixed-precision computation.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [14_gemm_quantization](../../example/14_gemm_quantization/README.md): GEMM quantization in the main example directory
+- [46_gemm_add_multiply](../../example/46_gemm_add_multiply/README.md): GEMM with add and multiply in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/client_example/31_grouped_gemm_bf16Aint8B/README.md
+++ b/client_example/31_grouped_gemm_bf16Aint8B/README.md
@@ -0,0 +1,93 @@
+# Client Example: Grouped GEMM with bf16A/int8B and Fused Epilogues
+
+## Theory
+
+This client example demonstrates **grouped GEMM with mixed-precision input types (bf16 for A, int8 for B)** and various fused epilogue operations (bias, FastGELU, multiply). Grouped GEMM performs multiple independent GEMM operations (with potentially different shapes) in a single kernel launch, and mixed-precision is used for efficient inference and training.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$:
+- GEMM: $Y_g = A_g \times B_g$
+  - $A_g$: bf16 (brain floating point)
+  - $B_g$: int8 (8-bit integer)
+- Fused epilogues:
+  - Bias: $Z_g = Y_g + \text{bias}_g$
+  - FastGELU: $E_g = \text{FastGELU}(Z_g)$
+  - Multiply: $E_g = Z_g \odot D_{1,g}$
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- Mixed-precision computation reduces memory and compute requirements.
+- Fused epilogues improve efficiency by combining bias, activation, and scaling in a single kernel.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+```bash
+cd composable_kernel/build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc -D DTYPES="bf16;int8" ..
+make -j
+make install
+```
+
+### Build and run
+```bash
+cd composable_kernel/client_example/31_grouped_gemm_bf16Aint8B
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (basic grouped GEMM)
+./grouped_gemm_xdl_bf16_i8
+
+# Example run (grouped GEMM + bias + FastGELU)
+./grouped_gemm_bias_fastgelu_xdl_bf16_i8
+
+# Example run (grouped GEMM + FastGELU)
+./grouped_gemm_fastgelu_xdl_bf16_i8
+
+# Example run (grouped GEMM + multiply)
+./grouped_gemm_multiply_xdl_bf16_i8
+
+# Example run (grouped GEMM + multiply + bias + FastGELU)
+./grouped_gemm_multiply_bias_fastgelu_xdl_bf16_i8
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+client_example/31_grouped_gemm_bf16Aint8B/
+├── grouped_gemm_xdl_bf16_i8.cpp                # Grouped GEMM (bf16A, int8B)
+├── grouped_gemm_bias_fastgelu_xdl_bf16_i8.cpp  # Grouped GEMM + bias + FastGELU
+├── grouped_gemm_fastgelu_xdl_bf16_i8.cpp       # Grouped GEMM + FastGELU
+├── grouped_gemm_multiply_xdl_bf16_i8.cpp       # Grouped GEMM + multiply
+├── grouped_gemm_multiply_bias_fastgelu_xdl_bf16_i8.cpp # Grouped GEMM + multiply + bias + FastGELU
+├── CMakeLists.txt                              # Build configuration for the example
+```
+
+### Key Functions
+
+- **main()** (in each `.cpp`):  
+  Sets up input matrices for each group, configures GEMM and epilogue parameters, launches the grouped kernel, and verifies the result.
+- **Grouped GEMM kernel invocation**:  
+  Uses the Composable Kernel device API to launch grouped GEMM with various fused epilogues.
+
+---
+
+## Additional Details
+
+- Supports multiple groups with different matrix shapes and bf16/int8 input types.
+- Example parameters can be adjusted in the source for different workloads.
+
+---
+
+## Related Examples
+
+- [30_gemm_bf16Aint8B](../30_gemm_bf16Aint8B/README.md): GEMM with bf16A/int8B and fused epilogues
+- [15_grouped_gemm](../../example/15_grouped_gemm/README.md): Grouped GEMM in the main example directory
+
+---
+[Back to Client Examples](../README.md)
--- a/example/01_gemm/README.md
+++ b/example/01_gemm/README.md
@@ -1,27 +1,221 @@
-# Instructions for ```example_gemm_xdl```
+[Back to supported operations](../../../include/ck/README.md)
+# Composable Kernel GEMM Example
+
+## Introduction
+
+GEMM (General Matrix Multiplication) is a fundamental operation in linear algebra and deep learning. It computes the product of two matrices, optionally adds a bias or residual, and is the core of many neural network layers (MLPs, attention, convolutions via im2col). This example demonstrates the flexible and high-performance GEMM API provided by Composable Kernel.
+
+---
+
+## Theory
+
+**Mathematical Formulation:**
+$$
+C = \alpha (A \times B) + \beta D
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $D$: [M, N] optional bias/residual
+- $C$: [M, N] output
+- $\alpha, \beta$: scalars (often 1.0, 0.0)
+
+GEMM is implemented using a tiled/blocking strategy to maximize data reuse and memory bandwidth. Modern GPU implementations use matrix core/XDL/MFMA instructions for high throughput. The operation is the computational backbone for transformer attention, MLPs, CNNs (via lowering), and more.
+
+---
+
+## CK GEMM API Overview
+
+CK provides a highly composable GEMM API via the `DeviceGemm` family of device operations. These are highly templated to support a wide range of data types, layouts, and fused operations.
+
+### Template Parameters
+
+- **ALayout** - A matrix layout (RowMajor/ColumnMajor)
+- **BLayout** - B matrix layout (RowMajor/ColumnMajor)
+- **CLayout** - C matrix layout (RowMajor/ColumnMajor)
+- **ADataType** - A matrix data type
+- **BDataType** - B matrix data type
+- **CDataType** - C matrix data type
+- **AElementwiseOperation** - Fused operation on tensor A before GEMM
+- **BElementwiseOperation** - Fused operation on tensor B before GEMM
+- **CElementwiseOperation** - Fused operation on tensor C after GEMM
+
+For large K dimension, use `DeviceGemmSplitK` to split K across workgroups (requires zeroing output buffer due to use of AtomicAdd).
+
+For fused operations with additional tensors, use `DeviceGemmMultipleABD` or `DeviceGemmMultipleD`:
+- **DsLayout** - layouts for additional tensors
+- **DsDataType** - data types for additional tensors
+
+For `DeviceGemmMultipleABD`, pass **ALayout**, **BLayout**, **ADataType**, **BDataType** as tuples.
+
+---
+
+## Supported GEMM Variants
+
+- **DeviceGemm**: Standard GEMM
+- **DeviceGemmSplitK**: Split-K GEMM for large K
+- **DeviceGemmMultipleABD**: Fused GEMM with multiple A/B/D tensors
+- **DeviceGemmMultipleD**: Fused GEMM with multiple D tensors
+
+---
+
+## Supported Device Operations
+
+- **DeviceGemmDl**: DL instructions
+- **DeviceGemmDpp**: DL instructions with DPP during data load
+- **DeviceGemmWmma_CShuffle**: WMMA instructions with CShuffle optimization
+- **DeviceGemm_Xdl_CShuffle_LdsDirectLoad**: XDL instructions, CShuffle, direct global-to-shared load
+- **DeviceGemm_Xdl_CShuffle**: XDL instructions with CShuffle
+- **DeviceGemm_Xdl_CShuffleV2**: XDL instructions, optimized pipeline vs. V1
+- **DeviceGemmXdlSkipBLds**: XDL, skips shared memory load for B
+- **DeviceGemm_Xdl_WaveletModel_CShuffle**: XDL, CShuffle, wavelet producer/consumer
+- **DeviceGemmXdl**: XDL instructions
+
+---
+
+## Supported Data Types and Layouts
+
+### XDL Instruction
+
+|       |Is supported|
+|-------|---|
+|bf16   |✔️|
+|fp16   |✔️|
+|fp32   |✔️|
+|int8   |✔️|
+|fp8    |✔️|
+
+### WMMA Instruction
+
+|       |Is supported|
+|-------|---|
+|bf16   |✔️|
+|fp16   |✔️|
+|fp32   |❌|
+|int8   |✔️|
+|fp8    |❌|
+
+### DL Instruction
+
+|       |Is supported|
+|-------|---|
+|bf16   |❌|
+|fp16   |✔️|
+|fp32   |✔️|
+|int8   |✔️|
+|fp8    |❌|
+
+---
+
+## Supported Fused Elementwise Operations
+
+- **B Matrix Multiply + Add + Gelu** - bf16 (int8 for B matrix)
+- **B Matrix Multiply + Add** - bf16 (int8 for B matrix)
+- **B Matrix Multiply + Gelu** - bf16 (int8 for B matrix)
+- **B Matrix Multiply** - bf16 (int8 for B matrix)
+- **Add + Add + Gelu** - fp16
+- **Add + Gelu** - fp16, bf16 (int8 for B matrix) for Row/Column/Row
+- **Multiply** - fp16
+- **Add + Multiply** - fp16
+- **Add + Relu** - fp16 (int8 for B matrix) for Row/Column/Row, bf16 (int8 for B matrix) for Row/Column/Row
+- **Add + Silu** - fp16 (int8 for B matrix) for Row/Column/Row, bf16 (int8 for B matrix) for Row/Column/Row
+- **Add** - fp16 (int8 for B matrix) for Row/Column/Row, bf16 (int8 for B matrix) for Row/Column/Row
+- **Bilinear** - fp16, int8
+- **Gelu** - fp16
+- **Multiply + Add** - fp16 for Row/Column/Row and Row/Row/Row, fp16 (int8 for B matrix, fp32 for Bias) for Row/Column/Row and Row/Row/Row
+- **Quantization** - int8
+
+---
+
+## GEMM V2 (Universal GEMM)
+
+Optimized for MI300 series. Operation is called as `DeviceGemmV2` and uses similar template parameters as above.
+
+- **ALayout**, **BLayout**, **CLayout**
+- **ADataType**, **BDataType**, **CDataType**
+- **AElementwiseOperation**, **BElementwiseOperation**, **CElementwiseOperation**
+
+Split-K is supported (requires zeroing output buffer if splitK > 1).
+
+### Device Operations
+
+- **DeviceGemm_Xdl_CShuffleV3**: XDL with CShuffle optimization
+- **DeviceGemm_Xdl_CShuffleV3R1**: XDL with CShuffle, reduction on split-K after GEMM
+
+### Supported Types
+
+|       |Is supported|
+|-------|---|
+|bf16   |✔️|
+|fp16   |✔️|
+|fp32   |❌|
+|int8   |❌|
+|fp8 (C bf16)|✔️|
+|fp16 (A fp8)|✔️|
+|fp16 (B fp8)|✔️|
+
+---
+
+## Other GEMM Extensions
+
+- **DeviceGemm_dequantB**: GEMM with dequantization (WMMA)
+- **DeviceGemmMultipleD_ABScale**: GEMM with scale for A and B
+- **DeviceGemmMultipleDLayernorm**: GEMM fused with layernorm
+- **DeviceGemmMultipleDMultipleR**: GEMM fused with reductions and custom global reductions
+- **DeviceGemmReduce**: GEMM fused with reduction
+- **DeviceGemm_Streamk_V2**: Stream K with reduction instead of AtomicAdd
+- **DeviceGemmStreamK**: Stream K using AtomicAdd
+
+---
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run

-## Run ```example_gemm_xdl```
 ```bash
-#arg1: verification (0=no, 1=yes)
-#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
-#arg3: run kernel # of times (>1)
-./bin/example_gemm_xdl 0 1 5
+cd composable_kernel/example/01_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (FP16)
+./gemm_xdl_fp16 -M 4096 -N 4096 -K 4096 -v 1 -t 1
 ```

-# Instructions for ```example_gemm_xdl_fp16_streamk_v3```
+---
+
+## Source Code Structure

-## Run ```example_gemm_xdl_fp16_streamk_v3```
-```bash
-arg1: verification (0=no, 1=yes)
-arg2: initialization (0=no init, 1=integer value, 2=decimal value)
-arg3: time kernel (0=no, 1=yes)
-arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC
-arg10: stream-k select (-1: default config, 0: all DP, 1: 1-tile SK, 2: 2-tile SK)
-arg11: Grid_size(-1 for max occupancy)
-bin/example_gemm_xdl_fp16_streamk_v3 1 2 1 3840 4096 4096 4096 4096 4096 1 -1
-a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
-b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
-c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
-problem {M:3840, N:4096, K:4096, SA:4096, SB:4096, SC:4096, MP:4032, NP:4096, KRead:4096, KP:4096, AK0:512, BK0:2048, MBlock: 18, NBlock: 16, Stream-K Selection:1, Grid size:-1}
-Perf: 0.292022 ms, 441.23 TFlops, 330.348 GB/s, DeviceGemmXdlUniversal<MNPadding, RRR> BlkSize: 256, BlkTile: 224x256x64, WaveTile: 16x16, WaveMap: 7x8, VmemReadVec: 8x8, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3, BlkGemmPipelinePrefetchStages: 2
 ```
+example/01_gemm/
+├── gemm_xdl_fp16.cpp         # Main example: sets up, runs, and verifies GEMM (FP16)
+├── gemm_xdl_fp32.cpp         # Main example: FP32 variant
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm.hpp       # Device-level GEMM API (templated)
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_xdl.hpp   # XDL-based GEMM implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_gemm_xdl.hpp # Grid-level tiled GEMM kernel
+include/ck/tensor_operation/gpu/block/
+│   └── blockwise_gemm_xdl.hpp # Block-level tiled GEMM
+library/reference_tensor_operation/cpu/
+    └── reference_gemm.hpp    # CPU reference GEMM for correctness checking
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmXdl** (in `device_gemm.hpp`):  
+  Main device API for launching GEMM kernels.  
+- **GridwiseGemmXdl** (in `gridwise_gemm_xdl.hpp`):  
+  Implements the tiled/blocking GEMM kernel for the GPU grid.
+- **BlockwiseGemmXdl** (in `blockwise_gemm_xdl.hpp`):  
+  Handles block-level computation and shared memory tiling.
+- **reference_gemm** (in `reference_gemm.hpp`):  
+  CPU implementation for result verification.
+
+---
+
+This example is the foundation for all matrix operations in Composable Kernel and is the basis for more advanced fused and batched operations.
--- a/example/02_gemm_bilinear/README.md
+++ b/example/02_gemm_bilinear/README.md
@@ -1,6 +1,78 @@
-# Instructions for ```example_gemm_bilinear_xdl_fp16```
+# Composable Kernel GEMM Bilinear Example
+
+## Introduction
+
+This example demonstrates GEMM (General Matrix Multiplication) fused with bilinear operations on auxiliary tensors using Composable Kernel. Bilinear fusion patterns are widely used in neural networks for gating, attention, and multimodal feature fusion, where the output of a matrix multiplication is combined elementwise with one or more additional tensors.
+
+---
+
+## Theory
+
+**Mathematical Formulation:**
+$$
+F = \text{BilinearOp}(A \times B, D, E)
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $D$, $E$: [M, N] auxiliary tensors (or broadcastable)
+- $F$: [M, N] output
+
+**Examples:**
+- Elementwise: $F = (A \times B) \odot D \odot E$
+- Gated: $F = (A \times B) \odot \sigma(D) + E$
+- Weighted: $F = \alpha (A \times B) + \beta (D \odot E)$
+
+The GEMM result is kept in registers and combined with auxiliary tensors in the epilogue, avoiding intermediate writes to global memory. This pattern is common in attention, gating, and feature interaction layers.
+
+---
+
+## CK GEMM Bilinear API Overview
+
+CK provides a composable API for GEMM with multiple auxiliary tensors via the `DeviceGemmMultipleD` operation.
+
+### Template Parameters
+
+- **ALayout** - A matrix layout (RowMajor/ColumnMajor)
+- **BLayout** - B matrix layout (RowMajor/ColumnMajor)
+- **DsLayout** - Layouts for auxiliary tensors (tuple)
+- **ELayout** - Output matrix layout (RowMajor/ColumnMajor)
+- **ADataType** - A matrix data type
+- **BDataType** - B matrix data type
+- **DsDataType** - Data types for auxiliary tensors (tuple)
+- **EDataType** - Output matrix data type
+- **AElementwiseOperation** - Fused operation on tensor A before GEMM
+- **BElementwiseOperation** - Fused operation on tensor B before GEMM
+- **CDEElementwiseOperation** - Fused operation on C, D, E after GEMM
+
+### Supported Data Types and Layouts
+
+- Supports fp16, int8, and other types depending on the device operation.
+- Supports RowMajor and ColumnMajor layouts for all tensors.
+
+### Supported Device Operations
+
+- **DeviceGemmMultipleD**: Standard multi-tensor GEMM
+- **DeviceGemmMultipleD_Bilinear**: GEMM with bilinear fusion in the epilogue
+
+---
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+
+```bash
+cd composable_kernel/example/02_gemm_bilinear
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+```
+### Run ```example_gemm_bilinear_xdl_fp16```

-## Run ```example_gemm_bilinear_xdl_fp16```
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=integer value, 2=decimal value)
@@ -9,3 +81,35 @@
 #arg11 to 12: alpha, beta
 ./bin/example_gemm_bilinear_xdl_fp16 1 1 1 3840 4096 4096 4096 4096 4096 4096 0.5 0.5
 ```
+
+---
+
+## Source Code Structure
+
+```
+example/02_gemm_bilinear/
+├── gemm_bilinear_xdl.cpp         # Main example: sets up, runs, and verifies GEMM with bilinear fusion
+├── gemm_bilinear_wmma_fp16.cpp   # WMMA FP16 variant
+├── gemm_bilinear_wmma_int8.cpp   # WMMA int8 variant
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_multiple_d.hpp       # Device-level API for multi-tensor GEMM
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_bilinear_impl.hpp    # Bilinear operation implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_gemm_multiple_d.hpp     # Grid-level multi-tensor GEMM kernel
+include/ck/tensor_operation/gpu/element/
+    └── element_wise_operation.hpp       # Elementwise operation definitions
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmMultipleD** (in `device_gemm_multiple_d.hpp`):  
+  Device API for GEMM with multiple auxiliary tensors and fused epilogues.
+- **gridwise_gemm_multiple_d** (in `gridwise_gemm_multiple_d.hpp`):  
+  Implements the tiled/blocking GEMM kernel with multi-tensor epilogue.
+- **element_wise_operation** (in `element_wise_operation.hpp`):  
+  Defines bilinear and other elementwise operations.
+
+---
+
+This example demonstrates how Composable Kernel supports complex multi-tensor fusion patterns for advanced neural network architectures.
--- a/example/03_gemm_bias_relu/README.md
+++ b/example/03_gemm_bias_relu/README.md
@@ -1,10 +1,63 @@
-# Instructions for ```example_gemm_bias_relu_xdl_fp16```
+# GEMM with Bias and ReLU Activation Fusion

-## Run ```example_gemm_bias_relu_xdl_fp16```
+## Theory
+
+This example demonstrates **GEMM fused with bias addition and ReLU activation**. This is the core pattern for fully connected (dense) neural network layers and the feed-forward blocks in transformers.
+
+**Mathematical Formulation:**
+$$
+E = \text{ReLU}(A \times B + \text{bias})
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $\text{bias}$: [N] bias vector (broadcasted)
+- $E$: [M, N] output
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, bias is added, and ReLU is applied before writing to global memory.
+- This fusion eliminates intermediate memory traffic and is a standard optimization in deep learning frameworks.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
 ```bash
-#arg1: verification (0=no, 1=yes)
-#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
-#arg3: time kernel (0=no, 1=yes)
-#arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideE
-./bin/example_gemm_bias_relu_xdl_fp16 1 1 1 3840 4096 4096 4096 4096 4096
+cd composable_kernel/example/03_gemm_bias_relu
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_bias_relu_xdl -M 2048 -N 8192 -K 2048 --verify=1 --time=1
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/03_gemm_bias_relu/
+├── gemm_bias_relu_xdl.cpp         # Main example: sets up, runs, and verifies GEMM+Bias+ReLU
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_multiple_d.hpp         # Device-level API for multi-tensor GEMM
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_xdl_cshuffle_v3.hpp    # XDL with C-Shuffle epilogue
+│   └── device_gemm_bias_relu_impl.hpp     # Specialized bias+ReLU implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_gemm_xdl_cshuffle.hpp     # Grid-level GEMM with epilogue
+include/ck/tensor_operation/gpu/element/
+    └── element_wise_operation.hpp         # Elementwise operation definitions
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmMultipleD** (in `device_gemm_multiple_d.hpp`):  
+  Device API for GEMM with auxiliary tensors and fused epilogues.
+- **gridwise_gemm_xdl_cshuffle** (in `gridwise_gemm_xdl_cshuffle.hpp`):  
+  Implements the tiled/blocking GEMM kernel with fused epilogue.
+- **element_wise_operation** (in `element_wise_operation.hpp`):  
+  Defines bias addition and ReLU activation.
+
+This example demonstrates the standard epilogue fusion concept that enables efficient neural network layers in modern deep learning.
--- a/example/04_gemm_add_add_fastgelu/README.md
+++ b/example/04_gemm_add_add_fastgelu/README.md
@@ -1,10 +1,70 @@
-# Instructions for ```example_gemm_add_add_fastgelu_xdl_fp16```
+# GEMM with Add, Add, and FastGELU Activation

-## Run ```example_gemm_add_add_fastgelu_xdl_fp16```
+## Theory
+
+This example demonstrates a **GEMM operation fused with two addition operations and FastGELU activation**. This pattern is used in transformer feed-forward networks and other neural architectures where a linear transformation is followed by bias addition, residual addition, and a non-linear activation.
+
+**Mathematical Formulation:**
+$$
+E = \text{FastGELU}((A \times B) + D_0 + D_1)
+$$
+- $A$: [M, K] input matrix
+- $B$: [K, N] weight matrix
+- $D_0$: [N] bias vector (broadcasted)
+- $D_1$: [M, N] residual tensor
+- $E$: [M, N] output
+
+FastGELU is an efficient approximation of GELU:
+$$
+\text{FastGELU}(x) = x \cdot \sigma(1.702 \cdot x)
+$$
+where $\sigma$ is the sigmoid function.
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, bias and residual are added, and FastGELU is applied before writing to global memory.
+- No intermediate results are written to global memory.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
 ```bash
-#arg1: verification (0=no, 1=yes)
-#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
-#arg3: time kernel (0=no, 1=yes)
-#arg4 to 11: M (256x), N(128x), K(32x), StrideA, StrideB, StrideD0, StrideD1, StrideE"
-./bin/example_gemm_add_add_fastgelu_xdl_fp16 1 1 1
+cd composable_kernel/example/04_gemm_add_add_fastgelu
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_add_add_fastgelu_xdl -M 2048 -N 8192 -K 2048 --verify=1 --time=1
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/04_gemm_add_add_fastgelu/
+├── gemm_add_add_fastgelu_xdl.cpp         # Main example: sets up, runs, and verifies GEMM+Add+Add+FastGELU
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_multiple_d.hpp         # Device-level API for multi-tensor GEMM
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_xdl_cshuffle_v3.hpp    # XDL with C-Shuffle epilogue
+│   └── device_gemm_fastgelu_impl.hpp      # FastGELU-specific implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_gemm_multiple_d_xdl.hpp   # Grid-level multi-stage GEMM
+include/ck/tensor_operation/gpu/element/
+    └── element_wise_operation.hpp         # Elementwise operation definitions
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmMultipleD** (in `device_gemm_multiple_d.hpp`):  
+  Device API for GEMM with multiple auxiliary tensors and fused epilogues.
+- **gridwise_gemm_multiple_d_xdl** (in `gridwise_gemm_multiple_d_xdl.hpp`):  
+  Implements the tiled/blocking GEMM kernel with multi-stage epilogue.
+- **element_wise_operation** (in `element_wise_operation.hpp`):  
+  Defines FastGELU and other elementwise operations.
+
+This example demonstrates how Composable Kernel supports complex multi-stage epilogue fusion for advanced neural network architectures.
--- a/example/09_convnd_fwd/README.md
+++ b/example/09_convnd_fwd/README.md
@@ -1,6 +1,42 @@
-# Instructions for ```example_convnd_fwd_xdl```
+# N-Dimensional Convolution Forward
+
+## Theory
+
+This example demonstrates the **N-dimensional convolution forward pass** using Composable Kernel. Convolution is a fundamental operation in deep learning, especially in convolutional neural networks (CNNs) for images, audio, and volumetric data.
+
+**Mathematical Formulation:**
+Given:
+- Input tensor: $X[N, C_{in}, D_1, D_2, ..., D_n]$
+- Weight tensor: $W[C_{out}, C_{in}, K_1, K_2, ..., K_n]$
+- Output tensor: $Y[N, C_{out}, O_1, O_2, ..., O_n]$
+
+The convolution computes:
+$$
+Y[n, c_{out}, o_1, ..., o_n] = \sum_{c_{in}} \sum_{k_1} ... \sum_{k_n} X[n, c_{in}, o_1 + k_1, ..., o_n + k_n] \cdot W[c_{out}, c_{in}, k_1, ..., k_n]
+$$
+
+Stride, padding, and dilation parameters control the mapping between input and output indices.
+
+**Algorithmic Background:**
+- Composable Kernel implements convolution as an implicit GEMM (matrix multiplication) for efficiency.
+- The input and weight tensors are transformed into matrices, and the convolution is performed as a GEMM.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/09_convnd_fwd
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+```
+
+### Run ```example_convnd_fwd_xdl```

-## Run ```example_convnd_fwd_xdl```
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=integer value, 2=decimal value)
@@ -16,3 +52,29 @@
 # <right padding>, (ie RightPy, RightPx for 2D)
 ./bin/example_convnd_fwd_xdl 0 1 100
 ```
+## Source Code Structure
+
+### Directory Layout
+```
+example/09_convnd_fwd/
+├── convnd_fwd_xdl.cpp         # Main example: sets up, runs, and verifies N-D convolution
+include/ck/tensor_operation/gpu/device/
+│   └── device_convnd_fwd.hpp       # Device-level convolution API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_convnd_fwd_xdl.hpp   # XDL-based convolution implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_convnd_fwd_xdl.hpp # Grid-level convolution kernel
+include/ck/tensor_operation/gpu/block/
+    └── blockwise_convnd_fwd_xdl.hpp # Block-level convolution
+```
+
+### Key Classes and Functions
+
+- **DeviceConvNdFwd** (in `device_convnd_fwd.hpp`):  
+  Device API for N-dimensional convolution.
+- **gridwise_convnd_fwd_xdl** (in `gridwise_convnd_fwd_xdl.hpp`):  
+  Implements the tiled/blocking convolution kernel.
+- **blockwise_convnd_fwd_xdl** (in `blockwise_convnd_fwd_xdl.hpp`):  
+  Handles block-level computation and shared memory tiling.
+
+This example demonstrates how Composable Kernel implements efficient N-dimensional convolution using implicit GEMM, supporting a wide range of deep learning applications.
--- a/example/10_convnd_fwd_multiple_d_multiple_reduce/README.md
+++ b/example/10_convnd_fwd_multiple_d_multiple_reduce/README.md
@@ -0,0 +1,57 @@
+# N-Dimensional Convolution with Multiple D and Multiple Reduce
+
+## Theory
+
+This example demonstrates **N-dimensional convolution forward** with support for multiple auxiliary tensors (D) and multiple reduction operations. This is useful for advanced neural network layers that require additional outputs or statistics alongside the main convolution result.
+
+**Mathematical Formulation:**
+- Input tensor: $X[N, C_{in}, D_1, D_2, ..., D_n]$
+- Weight tensor: $W[C_{out}, C_{in}, K_1, K_2, ..., K_n]$
+- Auxiliary tensors: $D_0, D_1, ...$ (various shapes)
+- Output tensor: $Y[N, C_{out}, O_1, O_2, ..., O_n]$
+- Reduction operations: e.g., sum, mean, max over specified axes
+
+The convolution computes the standard output as well as additional outputs or statistics by applying reduction operations to the convolution result or auxiliary tensors.
+
+**Algorithmic Background:**
+- Composable Kernel implements this as an implicit GEMM with support for multiple auxiliary tensors and reductions in the epilogue.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/10_convnd_fwd_multiple_d_multiple_reduce
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./convnd_fwd_multiple_d_multiple_reduce_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/10_convnd_fwd_multiple_d_multiple_reduce/
+├── convnd_fwd_multiple_d_multiple_reduce_xdl.cpp   # Main example: sets up, runs, and verifies N-D convolution with multiple D/reduce
+include/ck/tensor_operation/gpu/device/
+│   └── device_convnd_fwd_multiple_d_multiple_reduce.hpp   # Device-level API for multi-D/multi-reduce convolution
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_convnd_fwd_multiple_d_multiple_reduce_impl.hpp # Implementation
+include/ck/tensor_operation/gpu/grid/
+    └── gridwise_convnd_fwd_multiple_d_multiple_reduce.hpp # Grid-level kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceConvNdFwdMultipleDMultipleReduce** (in `device_convnd_fwd_multiple_d_multiple_reduce.hpp`):  
+  Device API for N-dimensional convolution with multiple outputs and reductions.
+- **gridwise_convnd_fwd_multiple_d_multiple_reduce** (in `gridwise_convnd_fwd_multiple_d_multiple_reduce.hpp`):  
+  Implements the tiled/blocking convolution kernel with multi-output/reduce epilogue.
+
+This example demonstrates how Composable Kernel supports advanced convolution patterns with multiple outputs and reductions in a single efficient kernel.
--- a/example/11_convnd_fwd_bias/README.md
+++ b/example/11_convnd_fwd_bias/README.md
@@ -0,0 +1,57 @@
+# N-Dimensional Convolution Forward with Bias
+
+## Theory
+
+This example demonstrates **N-dimensional convolution forward** with bias addition. This is a common pattern in convolutional neural networks (CNNs), where a bias term is added to each output channel after the convolution operation.
+
+**Mathematical Formulation:**
+$$
+Y[n, c_{out}, o_1, ..., o_n] = \sum_{c_{in}} \sum_{k_1} ... \sum_{k_n} X[n, c_{in}, o_1 + k_1, ..., o_n + k_n] \cdot W[c_{out}, c_{in}, k_1, ..., k_n] + B[c_{out}]
+$$
+- $X$: [N, C_in, D1, D2, ..., Dn] input tensor
+- $W$: [C_out, C_in, K1, K2, ..., Kn] weight tensor
+- $B$: [C_out] bias tensor
+- $Y$: [N, C_out, O1, O2, ..., On] output tensor
+
+**Algorithmic Background:**
+- Composable Kernel implements convolution as an implicit GEMM, with bias addition fused in the epilogue for efficiency.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/11_convnd_fwd_bias
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./convnd_fwd_bias_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/11_convnd_fwd_bias/
+├── convnd_fwd_bias_xdl.cpp         # Main example: sets up, runs, and verifies N-D convolution with bias
+include/ck/tensor_operation/gpu/device/
+│   └── device_convnd_fwd_bias.hpp       # Device-level convolution API with bias
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_convnd_fwd_bias_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+    └── gridwise_convnd_fwd_bias.hpp     # Grid-level kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceConvNdFwdBias** (in `device_convnd_fwd_bias.hpp`):  
+  Device API for N-dimensional convolution with bias.
+- **gridwise_convnd_fwd_bias** (in `gridwise_convnd_fwd_bias.hpp`):  
+  Implements the tiled/blocking convolution kernel with bias epilogue.
+
+This example demonstrates how Composable Kernel fuses bias addition into the convolution forward pass for efficient CNN layer implementation.
--- a/example/12_reduce/README.md
+++ b/example/12_reduce/README.md
@@ -1,6 +1,38 @@
-# Instructions for ```example_reduce_blockwise```
+# Parallel Reduction Operations
+
+## Theory
+
+This example demonstrates **parallel reduction operations** (e.g., sum, max, min, mean) over tensors. Reduction is a fundamental operation in deep learning for computing statistics (such as batch mean/variance), loss aggregation, and normalization.
+
+**Mathematical Formulation:**
+Given a tensor $X$ and a reduction axis $a$:
+$$
+Y = \text{reduce}_{a}(X)
+$$
+- For sum: $Y = \sum_{i \in a} X_i$
+- For max: $Y = \max_{i \in a} X_i$
+- For mean: $Y = \frac{1}{|a|} \sum_{i \in a} X_i$
+
+**Algorithmic Background:**
+- Reductions are implemented using parallel tree reduction or segmented reduction algorithms.
+- Efficient reductions require careful memory access, synchronization, and sometimes numerically stable algorithms (e.g., Welford's for variance).
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/12_reduce
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+```

 ## Run ```example_reduce_blockwise```
+
 ```bash
 # -D <xxx> : input 3D/4D/5D tensor lengths
 # -R <xxx> : reduce dimension ids
@@ -11,7 +43,8 @@
 ./bin/example_reduce_blockwise -D 16,64,32,960 -v 1 0 2 1
 ```

-Result
+Expected Result:
+
 ```
 ./bin/example_reduce_blockwise -D 16,64,32,960 -v 1 0 2 1
 launch_and_time_kernel: grid_dim {240, 1, 1}, block_dim {256, 1, 1} 
@@ -21,6 +54,7 @@ Perf: 0.238063 ms, 264.285 GB/s, DeviceReduceBlockWise<256,M_C4_S1,K_C64_S1,InSr
 ```

 ## Run ```example_reduce_multiblock_atomic_add```
+
 ```bash
 # -D <xxx> : input 3D/4D/5D tensor lengths
 # -R <xxx> : reduce dimension ids
@@ -31,7 +65,7 @@ Perf: 0.238063 ms, 264.285 GB/s, DeviceReduceBlockWise<256,M_C4_S1,K_C64_S1,InSr
 ./bin/example_reduce_multiblock_atomic_add -D 16,64,32,960 -v 1 0 2 0
 ```

-Result
+Expected Result
 ```
 ./bin/example_reduce_multiblock_atomic_add -D 16,64,32,960 -v 1 0 2 0
 Perf: 0 ms, inf GB/s, DeviceReduceMultiBlock<256,M_C4_S1,K_C64_S1,InSrcVectorDim_0_InSrcVectorSize_1_OutDstVectorSize_1>
@@ -42,6 +76,7 @@ echo $?
 # Instructions for ```example_reduce_blockwise_two_call```

 ## Run ```example_reduce_blockwise_two_call```
+
 ```bash
 #arg1:  verification (0=no, 1=yes(
 #arg2:  initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
@@ -49,7 +84,8 @@ echo $?
 ./bin/example_reduce_blockwise_two_call 1 2 1
 ```

-Result
+Expected Result:
+
 ```
 ./bin/example_reduce_blockwise_two_call 1 2 1
 launch_and_time_kernel: grid_dim {204800, 1, 1}, block_dim {256, 1, 1}
@@ -60,3 +96,30 @@ Warm up 1 time
 Start running 10 times...
 Perf: 2.1791 ms, 771.42 GB/s, DeviceReduceBlockWise<256,M_C32_S1,K_C8_S1,InSrcVectorDim_1_InSrcVectorSize_1_OutDstVectorSize_1> => DeviceReduceBlockWise<256,M_C256_S1,K_C1_S1,InSrcVectorDim_1_InSrcVectorSize_1_OutDstVectorSize_1>
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/12_reduce/
+├── reduce_xdl.cpp         # Main example: sets up, runs, and verifies reduction
+include/ck/tensor_operation/gpu/device/
+│   └── device_reduce.hpp       # Device-level reduction API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_reduce_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_reduce.hpp     # Grid-level reduction kernel
+include/ck/tensor_operation/gpu/block/
+    └── blockwise_reduce.hpp    # Block-level reduction
+```
+
+### Key Classes and Functions
+
+- **DeviceReduce** (in `device_reduce.hpp`):  
+  Device API for reductions.
+- **gridwise_reduce** (in `gridwise_reduce.hpp`):  
+  Implements the tiled/blocking reduction kernel.
+- **blockwise_reduce** (in `blockwise_reduce.hpp`):  
+  Handles block-level reduction and shared memory.
+
+This example demonstrates how Composable Kernel implements efficient parallel reductions for deep learning and scientific computing.
--- a/example/13_pool2d_fwd/README.md
+++ b/example/13_pool2d_fwd/README.md
@@ -1,6 +1,41 @@
-# Instructions for ```example_pool2d_fwd``` Examples
+# 2D Pooling Forward
+
+## Theory
+
+This example demonstrates the **2D pooling forward pass**, a key operation in convolutional neural networks (CNNs) for spatial downsampling. Pooling reduces the spatial dimensions of feature maps, providing translation invariance and reducing computation.
+
+**Mathematical Formulation:**
+Given input $X[N, C, H_{in}, W_{in}]$, pooling window $(k_H, k_W)$, stride $(s_H, s_W)$, and padding $(p_H, p_W)$:
+- Output $Y[N, C, H_{out}, W_{out}]$
+- $H_{out} = \left\lfloor \frac{H_{in} + 2p_H - k_H}{s_H} \right\rfloor + 1$
+- $W_{out} = \left\lfloor \frac{W_{in} + 2p_W - k_W}{s_W} \right\rfloor + 1$
+
+For each output position:
+- **Max Pooling:** $Y_{n,c,h,w} = \max_{i,j} X_{n,c,h \cdot s_H + i, w \cdot s_W + j}$
+- **Average Pooling:** $Y_{n,c,h,w} = \frac{1}{k_H k_W} \sum_{i,j} X_{n,c,h \cdot s_H + i, w \cdot s_W + j}$
+
+**Algorithmic Background:**
+- Each thread computes one or more output elements.
+- Handles padding and boundary conditions.
+- Optimizes memory access for bandwidth.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/13_pool2d_fwd
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+```
+
+### Run ```example_pool2d_fwd_fp16```

-## Run ```example_pool2d_fwd_fp16```
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
@@ -9,7 +44,7 @@
 ./bin/example_pool2d_fwd_fp16 1 1 1
 ```

-Result 
+Expected Result: 
 ```
 in_n_c_hi_wi: dim 4, lengths {128, 192, 71, 71}, strides {967872, 1, 13632, 192}
 out_n_c_ho_wo: dim 4, lengths {128, 192, 36, 36}, strides {248832, 1, 6912, 192}
@@ -19,7 +54,8 @@ Start running 10 times...
 Perf: 0.397436 ms, 1.44252 TFlops, 783.713 GB/s
 ```

-## Run ```example_pool2d_fwd_fp32```
+### Run ```example_pool2d_fwd_fp32```
+
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
@@ -29,8 +65,9 @@ Perf: 0.397436 ms, 1.44252 TFlops, 783.713 GB/s
 ```


-Result 
-```
+Expected Result: 
+
+```bash
 ./bin/example_pool2d_fwd_fp32 1 1 1
 in_n_c_hi_wi: dim 4, lengths {128, 192, 71, 71}, strides {967872, 1, 13632, 192}
 out_n_c_ho_wo: dim 4, lengths {128, 192, 36, 36}, strides {248832, 1, 6912, 192}
@@ -39,3 +76,31 @@ Warm up 1 time
 Start running 10 times...
 Perf: 1.01823 ms, 0.563045 TFlops, 611.8 GB/s
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/13_pool2d_fwd/
+├── pool2d_fwd_xdl.cpp         # Main example: sets up, runs, and verifies 2D pooling
+include/ck/tensor_operation/gpu/device/
+│   └── device_pool_fwd.hpp       # Device-level pooling API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_pool2d_fwd_nhwc.hpp # NHWC layout optimization
+│   └── device_pool2d_fwd_nchw.hpp # NCHW layout optimization
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_pool_fwd.hpp     # Grid-level pooling kernel
+include/ck/tensor_operation/gpu/block/
+    └── blockwise_pool.hpp        # Block-level pooling
+```
+
+### Key Classes and Functions
+
+- **DevicePoolFwd** (in `device_pool_fwd.hpp`):  
+  Device API for pooling.
+- **gridwise_pool_fwd** (in `gridwise_pool_fwd.hpp`):  
+  Implements the tiled/blocking pooling kernel.
+- **blockwise_pool** (in `blockwise_pool.hpp`):  
+  Handles block-level pooling and shared memory.
+
+This example demonstrates how Composable Kernel implements efficient 2D pooling for CNNs and vision models.
--- a/example/14_gemm_quantization/README.md
+++ b/example/14_gemm_quantization/README.md
@@ -0,0 +1,60 @@
+# GEMM with Quantization
+
+## Theory
+
+This example demonstrates **GEMM (General Matrix Multiplication) with quantized inputs or weights**. Quantization is a technique to reduce memory and computation by representing values with lower-precision integer types (e.g., int8), commonly used for efficient inference in deep learning.
+
+**Mathematical Formulation:**
+- Quantized GEMM: $C = \text{dequant}(A_q) \times \text{dequant}(B_q)$
+- $A_q$, $B_q$: quantized matrices (e.g., int8)
+- $\text{dequant}(x_q) = (x_q - z) \cdot s$ (scale $s$, zero-point $z$)
+- $C$: output matrix (often in higher precision, e.g., float32 or float16)
+
+**Algorithmic Background:**
+- Quantized values are dequantized on-the-fly during GEMM computation.
+- Accumulation is performed in higher precision for accuracy.
+- Supports symmetric and asymmetric quantization.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/14_gemm_quantization
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_quantization_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/14_gemm_quantization/
+├── gemm_quantization_xdl.cpp         # Main example: sets up, runs, and verifies quantized GEMM
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_quantized.hpp       # Device-level quantized GEMM API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_quantized_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_gemm_quantized.hpp     # Grid-level quantized GEMM kernel
+include/ck/tensor_operation/gpu/element/
+    └── quantization_operations.hpp     # Quantization/dequantization utilities
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmQuantized** (in `device_gemm_quantized.hpp`):  
+  Device API for quantized GEMM.
+- **gridwise_gemm_quantized** (in `gridwise_gemm_quantized.hpp`):  
+  Implements the tiled/blocking quantized GEMM kernel.
+- **quantization_operations** (in `quantization_operations.hpp`):  
+  Defines quantization and dequantization functions.
+
+This example demonstrates how Composable Kernel supports efficient quantized matrix multiplication for deep learning inference.
--- a/example/15_grouped_gemm/README.md
+++ b/example/15_grouped_gemm/README.md
@@ -1,9 +1,64 @@
-# Instructions for ```example_grouped_gemm_xdl```
+# Grouped GEMM
+
+## Theory
+
+This example demonstrates **grouped GEMM**: performing multiple independent GEMM operations (with potentially different shapes) in a single kernel launch. Grouped GEMM is used in transformer models (e.g., multi-head attention), mixture-of-experts, and other architectures requiring heterogeneous batched matrix multiplications.
+
+**Mathematical Formulation:**
+For $G$ groups, each with its own $A_g$, $B_g$, $C_g$:
+$$
+C_g = A_g \times B_g \quad \text{for} \quad g = 1, 2, ..., G
+$$
+- $A_g$: [M_g, K_g] input matrix for group $g$
+- $B_g$: [K_g, N_g] weight matrix for group $g$
+- $C_g$: [M_g, N_g] output matrix for group $g$
+
+**Algorithmic Background:**
+- Each group can have different matrix sizes and strides.
+- The kernel launches a grid covering all groups, with each block assigned to a group.
+- Useful for variable-length sequences, multi-head attention, and expert routing.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/15_grouped_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+```
+
+### Run ```example_grouped_gemm_xdl```

-## Run ```example_grouped_gemm_xdl```
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=integer value, 2=decimal value)
 #arg3: run kernel # of times (>1)
 ./bin/example_grouped_gemm_xdl_fp16 0 1 5
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/15_grouped_gemm/
+├── grouped_gemm_xdl.cpp         # Main example: sets up, runs, and verifies grouped GEMM
+include/ck/tensor_operation/gpu/device/
+│   └── device_grouped_gemm_xdl.hpp       # Device-level grouped GEMM API
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_grouped_gemm_xdl.hpp     # Grid-level grouped GEMM kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceGroupedGemmXdl** (in `device_grouped_gemm_xdl.hpp`):  
+  Device API for grouped GEMM.
+- **gridwise_grouped_gemm_xdl** (in `gridwise_grouped_gemm_xdl.hpp`):  
+  Implements the tiled/blocking grouped GEMM kernel.
+
+This example demonstrates how Composable Kernel supports efficient heterogeneous batched matrix multiplication for advanced AI/ML workloads.
--- a/example/16_gemm_multi_d_multi_reduces/README.md
+++ b/example/16_gemm_multi_d_multi_reduces/README.md
@@ -0,0 +1,56 @@
+# GEMM with Multiple D and Multiple Reductions
+
+## Theory
+
+This example demonstrates **GEMM with multiple auxiliary tensors (D) and multiple reduction operations**. This pattern is used in advanced neural network layers that require additional outputs or statistics (such as sums, means, or other reductions) alongside the main GEMM result.
+
+**Mathematical Formulation:**
+- For each GEMM: $C = A \times B$
+- Auxiliary tensors: $D_0, D_1, ...$ (various shapes)
+- Reductions: e.g., sum, mean, max over specified axes or outputs
+
+The kernel computes the main GEMM output and additional reductions or statistics in a single pass.
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, auxiliary tensors are fused in the epilogue, and reductions are computed as part of the output.
+- Useful for multi-task learning, attention statistics, and custom neural network layers.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/16_gemm_multi_d_multi_reduces
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_multi_d_multi_reduces_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/16_gemm_multi_d_multi_reduces/
+├── gemm_multi_d_multi_reduces_xdl.cpp         # Main example: sets up, runs, and verifies GEMM with multi-D/multi-reduce
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_multi_d_multi_reduces.hpp       # Device-level API for multi-D/multi-reduce GEMM
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_multi_d_multi_reduces_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+    └── gridwise_gemm_multi_d_multi_reduces.hpp     # Grid-level kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmMultiDMultiReduces** (in `device_gemm_multi_d_multi_reduces.hpp`):  
+  Device API for GEMM with multiple outputs and reductions.
+- **gridwise_gemm_multi_d_multi_reduces** (in `gridwise_gemm_multi_d_multi_reduces.hpp`):  
+  Implements the tiled/blocking GEMM kernel with multi-output/reduce epilogue.
+
+This example demonstrates how Composable Kernel supports advanced GEMM patterns with multiple outputs and reductions in a single efficient kernel.
--- a/example/17_convnd_bwd_data/README.md
+++ b/example/17_convnd_bwd_data/README.md
@@ -1,6 +1,62 @@
-# Instructions for ```example_convnd_bwd_data_xdl```
+# N-Dimensional Convolution Backward Pass for Data

-## Run ```example_example_convnd_bwd_data_xdl```
+This example demonstrates the backward data pass of an N-dimensional convolution, often denoted as `conv_bwd_data`. This operation is a crucial part of the backpropagation algorithm for training Convolutional Neural Networks (CNNs). Its purpose is to compute the gradient of the loss function with respect to the convolution's *input data*, which is then passed back to the preceding layer in the network.
+
+## Mathematical Formulation
+
+The backward data pass computes the gradient $\frac{\partial L}{\partial \text{In}}$, given the gradient from the subsequent layer, $\frac{\partial L}{\partial \text{Out}}$, and the filter weights `W` used in the forward pass.
+
+Let the forward convolution be defined as:
+$\text{Out} = \text{In} \star W$
+
+The backward data pass is mathematically equivalent to a "full" convolution between the output gradient tensor `dL/dOut` and the 180-degree rotated (or transposed and flipped) weight tensor `W`.
+
+$\frac{\partial L}{\partial \text{In}} = \frac{\partial L}{\partial \text{Out}} \star \text{rot180}(W)$
+
+This operation propagates the error signal from the output back to the input, weighted by the same filters that were used in the forward pass.
+
+## Algorithmic Strategy: Implicit GEMM
+
+As with the forward pass, the most efficient way to implement the backward data pass on a GPU is to transform the convolution into a General Matrix-Matrix Multiplication (GEMM) problem.
+
+1.  **Output Gradient Reshaping**: The output gradient tensor `dL/dOut` is logically reshaped into a matrix `dL/dOut'` of shape `[K, (N*Ho*Wo)]`. This becomes the "A" matrix in the GEMM.
+
+2.  **Weight Reshaping**: The weight tensor `W` is logically reshaped into a matrix `W'` of shape `[K, (C*Y*X)]`. This becomes the "B" matrix in the GEMM.
+
+3.  **Implicit GEMM**: The core computation is then formulated as a GEMM operation. However, the output of this GEMM is not a simple matrix; it's the `dL/dIn` tensor.
+    $(\text{dL/dIn})' = (W')^T \times (\text{dL/dOut})'$
+
+    The key insight is that this operation can be performed without explicitly forming the matrices. The GEMM kernel is designed to read from `dL/dOut` and `W` and write its results directly to the appropriate locations in the `dL/dIn` tensor. This process is sometimes referred to as an "implicit `col2im`" (column-to-image), as it is the inverse of the `im2col` transformation used in the forward pass.
+
+This "implicit GEMM" approach is highly efficient. It avoids the massive memory and bandwidth overhead of materializing intermediate matrices, which is critical for performance.
+
+## Source Code Organization
+
+-   [`conv_bwd_data_xdl.cpp`](./conv_bwd_data_xdl.cpp): The main example file that defines the parameters for a 2D convolution and instantiates the generic `DeviceConvNdBwdData` kernel to compute the input gradients.
+-   [`../../include/ck/tensor_operation/gpu/device/device_conv_bwd_data.hpp`](../../include/ck/tensor_operation/gpu/device/device_conv_bwd_data.hpp): The high-level device interface for the backward data convolution. It is templated on the dimensionality, layouts, and data types of the problem.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_gemm_implicit_gemm_v1r2_xdlops_nchw_kcyx_nkhw.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_gemm_implicit_gemm_v1r2_xdlops_nchw_kcyx_nkhw.hpp): An example of a specific grid-wise kernel that implements the implicit GEMM algorithm for the backward data pass. The library contains multiple such kernels optimized for different layouts and problem types, and the `DeviceConvNdBwdData` interface selects the most appropriate one.
+-   [`../../library/include/ck/library/reference_tensor_operation/cpu/reference_conv_bwd_data.hpp`](../../library/include/ck/library/reference_tensor_operation/cpu/reference_conv_bwd_data.hpp): A CPU reference implementation used to verify the correctness of the GPU kernel's output.
+
+## Build and Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/17_convnd_bwd_data
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=integer value, 2=decimal value)
@@ -45,3 +101,16 @@ Warm up
 Start running 1 times...
 Perf: 1.40031 ms, 69.8734 TFlops, 179.037 GB/s
 ```
+
+## Relationship to Other Passes
+
+The training of a single convolutional layer requires three distinct steps:
+
+1.  **Forward Pass (`conv_fwd`)**: Computes the output feature maps.
+    -   `Out = In * W`
+2.  **Backward Data Pass (`conv_bwd_data`)**: Computes the gradient with respect to the input, propagating the error to the previous layer. This is the focus of the current example.
+    -   `dL/dIn = dL/dOut * rot180(W)`
+3.  **Backward Weight Pass (`conv_bwd_weight`)**: Computes the gradient with respect to the weights, which is needed for the weight update.
+    -   `dL/dW = In * dL/dOut`
+
+All three passes are critical for training a CNN, and all are typically implemented as high-performance implicit GEMM operations.
--- a/example/18_batched_gemm_reduce/README.md
+++ b/example/18_batched_gemm_reduce/README.md
@@ -0,0 +1,78 @@
+# Batched GEMM with Reduction
+
+This example demonstrates a Batched General Matrix-Matrix Multiplication (Batched GEMM) where the result of each individual GEMM in the batch is then reduced along one of its dimensions. This is a specialized fusion pattern that combines a compute-intensive operation (GEMM) with a memory-intensive one (reduction), offering significant performance benefits for specific workloads.
+
+## Mathematical Formulation
+
+The operation performs a standard GEMM for each item in a batch, and then reduces the resulting matrix to a vector. For each batch item `b` from `0` to `BatchCount-1`:
+
+1.  **GEMM Stage**: A standard matrix multiplication is performed.
+    $C_{[b]} = A_{[b]} \times B_{[b]}$
+
+2.  **Reduction Stage**: The resulting matrix $C_{[b]}$ is reduced along one of its dimensions (e.g., the M dimension) to produce an output vector $D_{[b]}$.
+    $D_{[b], j} = \bigoplus_{i=0}^{M-1} C_{[b], i, j}$
+
+Where:
+-   $A_{[b]}$ is an $M \times K$ matrix.
+-   $B_{[b]}$ is a $K \times N$ matrix.
+-   $C_{[b]}$ is the intermediate $M \times N$ result matrix for batch `b`.
+-   $D_{[b]}$ is the final $1 \times N$ output vector for batch `b`.
+-   $\bigoplus$ is a binary, associative reduction operator like sum, max, or min.
+
+The key optimization is that the intermediate matrix $C_{[b]}$ is never written to global memory. The reduction is fused directly into the GEMM kernel.
+
+## Algorithmic Strategy: Fused GEMM and Reduction
+
+The implementation fuses the reduction into the epilogue of a batched GEMM kernel. The batch dimension provides a natural axis for parallelism.
+
+1.  **Batch Scheduling**: The `BatchCount` GEMM problems are distributed across the GPU's thread blocks. Each block is assigned one or more GEMMs from the batch to compute.
+
+2.  **Tiled GEMM Core**: For each assigned GEMM, the thread block runs a standard tiled GEMM algorithm to compute the product $A_{[b]} \times B_{[b]}$. The result for each tile of $C_{[b]}$ is accumulated in the private registers of the threads.
+
+3.  **Fused Reduction Epilogue**: This is where the fusion occurs. Instead of writing the computed tile of $C_{[b]}$ to global memory, the threads use it as input for a parallel reduction.
+    -   **Intra-Block Reduction**: The threads within a block, which collectively hold the values for a tile of $C_{[b]}$, perform a local reduction. For example, to reduce along the M dimension, threads responsible for different M-rows but the same N-column will cooperate, using fast shared memory to sum their partial results.
+    -   **Inter-Block Reduction**: Since multiple thread blocks may be working on different M-tiles for the same batch item, their partial reduction results must be combined. Each block writes its partial sum to a designated location in the output vector `D`, using atomic operations (like `atomicAdd`) to safely accumulate the final result.
+
+This strategy completely eliminates the global memory traffic associated with the intermediate matrix `C`, which is often the largest tensor in the operation. This leads to substantial savings in memory bandwidth and improved performance.
+
+## Source Code Organization
+
+-   [`batched_gemm_reduce_xdl.cpp`](./batched_gemm_reduce_xdl.cpp): The main example file. It sets up the batched GEMM problem and instantiates the `DeviceBatchedGemmReduce` operation, specifying the reduction dimension and operator.
+-   [`../../include/ck/tensor_operation/gpu/device/device_batched_gemm_reduce.hpp`](../../include/ck/tensor_operation/gpu/device/device_batched_gemm_reduce.hpp): The high-level device interface for this fused operation.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_reduce_xdl_cshuffle.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_reduce_xdl_cshuffle.hpp): The grid-wise kernel that implements the fused logic. It handles the batch scheduling, the tiled GEMM, and the fused reduction epilogue with atomic operations for inter-block communication.
+
+## Build and Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/18_batched_gemm_reduce
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./batched_gemm_reduce_xdl
+
+# Run with verification, data initialization, and timing
+./batched_gemm_reduce_xdl 1 2 1
+```
+
+## Applications
+
+This fused pattern is less common than simple GEMM+Bias but is highly effective for specific algorithms.
+
+-   **Gradient Computations**: In some complex neural network layers, the gradient calculation might involve a matrix product followed by a summation. For example, computing the gradient with respect to a bias term often involves summing the output gradients over the batch and spatial dimensions. If the output gradient itself is the result of a GEMM, this fused kernel could be applicable.
+-   **Custom Attention Mechanisms**: While standard attention involves a `softmax`, some research explores attention-like mechanisms that might use a simple sum or max reduction instead. If the query-key interaction is formulated as a batched GEMM, this kernel could compute the attention weights in a single, fused step.
+-   **Scientific Computing**: Certain numerical methods, particularly in physics or signal processing, may involve performing a linear transform (GEMM) on a set of signals (a batch) and then integrating the result (a reduction).
--- a/example/19_binary_elementwise/README.md
+++ b/example/19_binary_elementwise/README.md
@@ -0,0 +1,84 @@
+# Binary Elementwise Operations with Broadcasting
+
+This example demonstrates a generic binary elementwise operation, a fundamental building block in numerical computing. It covers two important cases:
+1.  **Simple Elementwise**: Applying a binary function to two input tensors of the *same* shape.
+2.  **Elementwise with Broadcasting**: Applying a binary function to two input tensors of *different but compatible* shapes.
+
+Broadcasting defines a set of rules for applying elementwise operations on tensors of different sizes, and it is a cornerstone of libraries like NumPy and TensorFlow.
+
+## Mathematical Formulation
+
+### Simple Elementwise
+Given two input tensors, A and B, of the same rank and dimensions, and a binary operator $\odot$, the operation computes an output tensor C where each element is:
+
+$C_{i,j,k,\dots} = A_{i,j,k,\dots} \odot B_{i,j,k,\dots}$
+
+### Elementwise with Broadcasting
+Broadcasting allows elementwise operations on tensors with different shapes, provided they are compatible. Two dimensions are compatible if they are equal, or if one of them is 1. The operation implicitly "stretches" or "duplicates" the tensor with the dimension of size 1 to match the other tensor's shape.
+
+For example, adding a bias vector `B` of shape `(1, N)` to a matrix `A` of shape `(M, N)`:
+$C_{i,j} = A_{i,j} + B_{0,j}$
+
+Here, the single row of `B` is broadcast across all `M` rows of `A`. The output tensor `C` has the shape `(M, N)`.
+
+Common binary elementwise operations include addition, subtraction, multiplication (Hadamard product), division, max, and min.
+
+## Algorithmic Strategy: Grid-Stride Loop with Broadcasting
+
+The implementation for both cases relies on the efficient **grid-stride loop**, which is adapted to handle broadcasting.
+
+1.  **Grid Partitioning**: The problem is mapped to a 1D grid of threads based on the number of elements in the **output** tensor.
+
+2.  **Grid-Stride Loop**: Each thread iterates through a subset of the output elements. For each output index, it must calculate the corresponding indices into the input tensors A and B.
+
+3.  **Broadcasting Logic**:
+-   The core of the broadcasting logic lies in the `get_broadcast_coord` function. If an input tensor's dimension is 1, the coordinate for that dimension is always set to 0, effectively reusing the same element across the broadcast dimension. If the dimension matches the output, the coordinate is passed through.
+-   This strategy ensures that memory accesses to the larger tensor remain coalesced, while accesses to the smaller, broadcasted tensor will naturally involve re-reading the same values, which is efficiently handled by the GPU's cache hierarchy.
+
+Like the simple case, broadcasted elementwise operations are almost always memory-bandwidth-bound.
+
+## Source Code Organization
+
+This example contains multiple files to demonstrate different scenarios:
+
+-   [`binary_elementwise_xdl.cpp`](./binary_elementwise_xdl.cpp): Demonstrates the simple case where both input tensors have the same shape.
+-   [`broadcast_add_2d_amn_bn.cpp`](./broadcast_add_2d_amn_bn.cpp): A specific example of broadcasting, adding a tensor of shape `(B, N)` to a tensor of shape `(A, M, N)`.
+-   [`../../include/ck/tensor_operation/gpu/device/device_elementwise.hpp`](../../include/ck/tensor_operation/gpu/device/device_elementwise.hpp): The high-level device interface. It is generic enough to handle both simple and broadcasted operations by correctly interpreting the tensor descriptors, which contain shape and stride information.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_elementwise.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_elementwise.hpp): The grid-wise kernel that implements the grid-stride loop. The tensor coordinate logic within this kernel correctly handles broadcasting based on the provided tensor descriptors.
+-   [`../../include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp`](../../include/ck/tensor_operation/gpu/element/binary_element_wise_operation.hpp): Defines the various binary operator functors (like `Add`, `Multiply`, etc.).
+
+## Build and Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/19_binary_elementwise
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the simple elementwise example
+./binary_elementwise_xdl 1 2 1
+
+# Run the broadcasting example
+./broadcast_add_2d_amn_bn 1 2 1
+```
+
+## Applications
+
+Broadcasting is a powerful feature that makes code more concise and memory-efficient.
+-   **Adding Bias**: The most common use case in deep learning is adding a bias vector (shape `[N]`) to a matrix of activations (shape `[Batch, N]`).
+-   **Feature Scaling**: Multiplying a feature map (shape `[N, C, H, W]`) by a per-channel scaling factor (shape `[1, C, 1, 1]`).
+-   **Standardization**: In data preprocessing, subtracting the mean (a vector) and dividing by the standard deviation (another vector) from a data matrix.
+-   **Coordinate Grids**: Creating coordinate grids by adding a row vector `[0, 1, 2...]` to a column vector `[0, 1, 2...]^T`.
--- a/example/20_grouped_conv_bwd_weight/README.md
+++ b/example/20_grouped_conv_bwd_weight/README.md
@@ -0,0 +1,77 @@
+# Grouped Convolution Backward Pass for Weights
+
+This example demonstrates the backward weight pass for a **grouped convolution**, often denoted as `grouped_conv_bwd_weight`. This operation is essential for training neural networks that use grouped or depthwise convolutions, such as ResNeXt, MobileNets, and EfficientNets. Its purpose is to compute the gradient of the loss function with respect to the convolution's *filter weights*, which is then used by an optimizer (like SGD or Adam) to update the model's parameters.
+
+## Mathematical Formulation
+
+The backward weight pass computes the gradient $\frac{\partial L}{\partial W}$, given the input tensor from the forward pass, `In`, and the gradient from the subsequent layer, `dL/dOut`.
+
+For a single group `g`, the operation is mathematically equivalent to a convolution between the input tensor for that group, `In_[g]`, and the output gradient tensor for that group, `dL/dOut_[g]`.
+
+$\frac{\partial L}{\partial W_{[g]}} = \text{In}_{[g]} \star \frac{\partial L}{\partial \text{Out}_{[g]}}$
+
+This operation correlates the input activations with the output error signals to determine how each weight should be adjusted to reduce the overall loss. The total gradient `dL/dW` is the collection of gradients for all `G` groups.
+
+## Algorithmic Strategy: Implicit Grouped GEMM
+
+This operation is a perfect candidate for the **Grouped GEMM** primitive. The convolution for each of the `G` groups is independently transformed into a GEMM problem, and all `G` GEMMs are executed in a single kernel launch.
+
+For each group `g`:
+
+1.  **Input to Columns (`im2col`)**: The input tensor `In_[g]` is logically unrolled into a matrix `In'_[g]`. This is the same `im2col` transformation used in the forward pass. This matrix becomes the "A" matrix in the GEMM.
+
+2.  **Output Gradient Reshaping**: The output gradient tensor `dL/dOut_[g]` is logically reshaped into a matrix `(dL/dOut)'_[g]`. This matrix becomes the "B" matrix in the GEMM.
+
+3.  **Implicit Grouped GEMM**: The weight gradient `dL/dW_[g]` is computed by a single GEMM:
+    $(\text{dL/dW})'_{[g]} = (\text{dL/dOut})'_{[g]} \times (\text{In}'_{[g]})^T$
+
+The key to performance is that this is executed as a **Grouped GEMM**. The `DeviceGroupedConvBwdWeight` interface takes the `G` independent problems and maps them to a `DeviceGroupedGemm` kernel. This kernel schedules the `G` independent GEMMs across the GPU's compute units. The `im2col` transformation is performed implicitly; the GEMM kernel reads data directly from the original `In` and `dL/dOut` tensors in the correct pattern, avoiding the materialization of large intermediate matrices.
+
+This approach is highly efficient as it leverages the task-parallel nature of the grouped convolution and the computational efficiency of highly optimized GEMM kernels.
+
+## Source Code Organization
+
+-   [`grouped_conv_bwd_weight_xdl.cpp`](./grouped_conv_bwd_weight_xdl.cpp): The main example file. It sets up a grouped convolution problem and instantiates the `DeviceGroupedConvBwdWeight` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_grouped_conv_bwd_weight.hpp`](../../include/ck/tensor_operation/gpu/device/device_grouped_conv_bwd_weight.hpp): The high-level device interface. It internally translates the grouped convolution problem into a set of arguments for the `DeviceGroupedGemm` interface.
+-   [`../../include/ck/tensor_operation/gpu/device/device_grouped_gemm.hpp`](../../include/ck/tensor_operation/gpu/device/device_grouped_gemm.hpp): The underlying Grouped GEMM device interface that is called by the grouped convolution operator.
+-   [`../../library/include/ck/library/reference_tensor_operation/cpu/reference_grouped_conv_bwd_weight.hpp`](../../library/include/ck/library/reference_tensor_operation/cpu/reference_grouped_conv_bwd_weight.hpp): A CPU reference implementation for verifying the correctness of the GPU kernel.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/20_grouped_conv_bwd_weight
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./grouped_conv_bwd_weight_xdl
+
+# Run with verification, data initialization, and timing
+./grouped_conv_bwd_weight_xdl 1 2 1
+```
+
+## Importance in Modern CNNs
+
+Grouped and depthwise convolutions are the cornerstone of many efficient, state-of-the-art CNN architectures.
+-   **Parameter Efficiency**: By not connecting every input channel to every output channel, grouped convolutions significantly reduce the number of weights in a layer, leading to smaller and faster models.
+-   **Depthwise Separable Convolutions**: Used in MobileNets, EfficientNets, and Xception, these layers factorize a standard convolution into a depthwise convolution (a grouped convolution with `G = C`) and a pointwise convolution (`1x1` conv). The backward pass for the depthwise part requires an efficient `grouped_conv_bwd_weight` implementation.
+-   **ResNeXt**: This architecture introduced the "cardinality" dimension, which is simply the number of groups in a grouped convolution, demonstrating that increasing the number of groups can be more effective than increasing layer depth or width.
+
+An optimized `grouped_conv_bwd_weight` kernel is therefore not an exotic feature but a critical requirement for training a wide range of modern and efficient deep learning models.
--- a/example/21_gemm_layernorm/README.md
+++ b/example/21_gemm_layernorm/README.md
@@ -0,0 +1,57 @@
+# GEMM with LayerNorm Fusion
+
+## Theory
+
+This example demonstrates **GEMM fused with layer normalization**. This pattern is used in transformer feed-forward networks and other architectures where a linear transformation is followed by normalization for improved training stability.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+- LayerNorm: $\text{LayerNorm}(Y) = \gamma \cdot \frac{Y - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$
+  - $\mu$: mean of $Y$ over the normalization axis
+  - $\sigma^2$: variance of $Y$ over the normalization axis
+  - $\gamma$, $\beta$: learnable scale and shift parameters
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, and layer normalization is applied before writing to global memory.
+- LayerNorm is typically applied over the last dimension (features).
+- This fusion reduces memory traffic and is common in transformer MLP blocks.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/21_gemm_layernorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_layernorm_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/21_gemm_layernorm/
+├── gemm_layernorm_xdl.cpp         # Main example: sets up, runs, and verifies GEMM+LayerNorm
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_layernorm.hpp       # Device-level GEMM+LayerNorm API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_layernorm_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+    └── gridwise_gemm_layernorm.hpp     # Grid-level kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmLayerNorm** (in `device_gemm_layernorm.hpp`):  
+  Device API for GEMM fused with layer normalization.
+- **gridwise_gemm_layernorm** (in `gridwise_gemm_layernorm.hpp`):  
+  Implements the tiled/blocking GEMM kernel with layer normalization epilogue.
+
+This example demonstrates how Composable Kernel supports efficient fusion of linear and normalization layers for transformer and deep learning models.
--- a/example/22_cgemm/README.md
+++ b/example/22_cgemm/README.md
@@ -0,0 +1,83 @@
+# Complex General Matrix-Matrix Multiplication (CGEMM)
+
+This example demonstrates a General Matrix-Matrix Multiplication for complex-valued tensors (CGEMM). This operation is a fundamental building block in many scientific and engineering domains, including signal processing, quantum computing, and electromagnetics, where computations are naturally expressed using complex numbers.
+
+## Mathematical Formulation
+
+A complex number `z` can be represented as `z = a + bi`, where `a` is the real part and `b` is the imaginary part. The multiplication of two complex numbers `z1 = a + bi` and `z2 = c + di` is:
+
+$z_1 \cdot z_2 = (a+bi)(c+di) = (ac - bd) + (ad + bc)i$
+
+A CGEMM operation, $D = \alpha \cdot (A \times B) + \beta \cdot C$, involves matrices where each element is a complex number. The core matrix multiplication $A \times B$ is defined as:
+
+$C_{ik} = \sum_j A_{ij} \cdot B_{jk}$
+
+Where each multiplication and addition is a complex operation. This can be broken down into four real-valued GEMM operations:
+
+Let $A = A_r + iA_i$ and $B = B_r + iB_i$. Then the product $C = A \times B$ is:
+$C = (A_r + iA_i) \times (B_r + iB_i) = (A_r B_r - A_i B_i) + i(A_r B_i + A_i B_r)$
+
+This shows that one CGEMM can be decomposed into four real GEMMs and two real matrix additions/subtractions.
+
+## Algorithmic Strategy: Fused Complex Arithmetic
+
+A naive implementation would launch six separate real-valued kernels (4 GEMMs, 2 additions). A much more efficient approach, and the one used by Composable Kernel, is to implement CGEMM in a single, fused kernel.
+
+1.  **Data Layout**: Complex numbers are typically stored in an interleaved format, where the real and imaginary parts of an element are adjacent in memory (e.g., `[r1, i1, r2, i2, ...]`). The kernel is designed to work efficiently with this layout.
+
+2.  **Tiled CGEMM**: The kernel uses a standard tiled GEMM algorithm, but the fundamental operations are adapted for complex numbers.
+    -   **Loading**: A thread block loads tiles of the complex-valued matrices A and B from global memory into shared memory.
+    -   **Complex Multiply-Accumulate**: The core of the algorithm is the multiply-accumulate (MAC) operation. Instead of a single `fma` instruction, each complex MAC involves multiple real-valued `fma` instructions to compute the real and imaginary parts of the product, as shown in the mathematical formulation.
+        -   `real_part = (a_r * b_r) - (a_i * b_i)`
+        -   `imag_part = (a_r * b_i) + (a_i * b_r)`
+    -   These operations are carefully scheduled to maximize instruction-level parallelism and hide latency. The accumulators for both the real and imaginary parts are held in private registers.
+
+3.  **Storing**: After the tile is fully computed, the complex-valued result is written from registers back to the output matrix D in global memory.
+
+By fusing the complex arithmetic directly into the GEMM kernel, we avoid launching multiple kernels and storing large intermediate real-valued matrices, which dramatically reduces kernel launch overhead and memory bandwidth requirements.
+
+## Source Code Organization
+
+-   [`cgemm_xdl.cpp`](./cgemm_xdl.cpp): The main example file. It defines complex-valued input matrices and instantiates the `DeviceGemm` operation, specialized for complex data types.
+-   The standard `DeviceGemm` interface from [`../../include/ck/tensor_operation/gpu/device/device_gemm.hpp`](../../include/ck/tensor_operation/gpu/device/device_gemm.hpp) is used. Composable Kernel overloads this interface for complex types (`ck::complex<T>`).
+-   The grid-wise GEMM kernel is specialized to handle complex types. When the template arguments for data types are `ck::complex`, the compiler instantiates a version of the kernel where the MAC operations are replaced with the sequence of real-valued operations required for complex multiplication.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/22_cgemm
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./cgemm_xdl
+
+# Run with verification, data initialization, and timing
+./cgemm_xdl 1 2 1
+```
+
+## Applications
+
+CGEMM is a critical kernel in many high-performance computing applications:
+
+-   **Digital Signal Processing (DSP)**: The Fast Fourier Transform (FFT), a cornerstone of DSP, can be implemented using complex matrix multiplications. Filtering and convolution in the frequency domain also rely on complex arithmetic.
+-   **Quantum Computing Simulation**: The state of a quantum system is described by a vector of complex numbers, and quantum gates are represented by unitary matrices (a special type of complex matrix). Simulating a quantum circuit involves a sequence of CGEMM operations.
+-   **Electromagnetics and Wave Physics**: Simulating the propagation of electromagnetic or acoustic waves often involves solving systems of equations with complex numbers to represent the phase and amplitude of the waves.
+-   **Communications**: Modern communication systems (like 5G and Wi-Fi) use complex modulation schemes (like QAM) where signals are represented by complex numbers.
--- a/example/23_softmax/README.md
+++ b/example/23_softmax/README.md
@@ -1,6 +1,36 @@
-# Instructions for ```example_softmax_blockwise```
+# Parallel Softmax
+
+## Theory
+
+This example demonstrates **parallel softmax computation** over tensors. Softmax is a key operation in deep learning, especially in attention mechanisms and classification, converting logits into normalized probabilities.
+
+**Mathematical Formulation:**
+Given input $X$ and axis $a$:
+$$
+\text{softmax}(X)_i = \frac{\exp(X_i)}{\sum_j \exp(X_j)}
+$$
+
+**Algorithmic Background:**
+- Softmax is implemented using a numerically stable algorithm:
+  1. Subtract the maximum value for numerical stability.
+  2. Exponentiate and sum.
+  3. Normalize by the sum.
+- Efficient parallel softmax requires careful reduction and memory access patterns.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/23_softmax
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+```

-## Run ```example_softmax_blockwise```
 ```bash
 # -D <xxx> : input 3-d tensor lengths
 # -v <x> :   verification (0=no, 1=yes)
@@ -16,3 +46,30 @@ Warm up 1 time
 Start running 10 times...
 Perf: 0.0242877 ms, 259.039 GB/s, DeviceReduceSoftmax<256,M_C8_S1,K_C32_S8,InSrcVectorDim_1_InSrcVectorSize_8_OutDstVectorSize_8>
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/23_softmax/
+├── softmax_xdl.cpp         # Main example: sets up, runs, and verifies softmax
+include/ck/tensor_operation/gpu/device/
+│   └── device_softmax.hpp       # Device-level softmax API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_softmax_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_softmax.hpp     # Grid-level softmax kernel
+include/ck/tensor_operation/gpu/block/
+    └── blockwise_softmax.hpp    # Block-level softmax
+```
+
+### Key Classes and Functions
+
+- **DeviceSoftmax** (in `device_softmax.hpp`):  
+  Device API for softmax.
+- **gridwise_softmax** (in `gridwise_softmax.hpp`):  
+  Implements the tiled/blocking softmax kernel.
+- **blockwise_softmax** (in `blockwise_softmax.hpp`):  
+  Handles block-level softmax and shared memory.
+
+This example demonstrates how Composable Kernel implements efficient, numerically stable softmax for deep learning models.
--- a/example/24_batched_gemm/README.md
+++ b/example/24_batched_gemm/README.md
@@ -0,0 +1,57 @@
+# Batched GEMM
+
+## Theory
+
+This example demonstrates **batched GEMM**: performing multiple independent matrix multiplications (all with the same shape) in a single kernel launch. Batched GEMM is used in multi-head attention, RNNs, and other models requiring parallel matrix multiplications.
+
+**Mathematical Formulation:**
+For $B$ batches:
+$$
+C_b = A_b \times B_b \quad \text{for} \quad b = 1, 2, ..., B
+$$
+- $A_b$: [M, K] input matrix for batch $b$
+- $B_b$: [K, N] weight matrix for batch $b$
+- $C_b$: [M, N] output matrix for batch $b$
+
+**Algorithmic Background:**
+- All matrices in the batch have the same shape and strides.
+- The kernel launches a grid covering all batches, with each block assigned to a batch.
+- Used for multi-head attention, parallel MLPs, and more.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/24_batched_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./batched_gemm_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/24_batched_gemm/
+├── batched_gemm_xdl.cpp         # Main example: sets up, runs, and verifies batched GEMM
+include/ck/tensor_operation/gpu/device/
+│   └── device_batched_gemm_xdl.hpp       # Device-level batched GEMM API
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_batched_gemm_xdl.hpp     # Grid-level batched GEMM kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceBatchedGemmXdl** (in `device_batched_gemm_xdl.hpp`):  
+  Device API for batched GEMM.
+- **gridwise_batched_gemm_xdl** (in `gridwise_batched_gemm_xdl.hpp`):  
+  Implements the tiled/blocking batched GEMM kernel.
+
+This example demonstrates how Composable Kernel supports efficient parallel matrix multiplication for batched and multi-head workloads.
--- a/example/25_gemm_bias_e_permute/README.md
+++ b/example/25_gemm_bias_e_permute/README.md
@@ -0,0 +1,56 @@
+# GEMM with Bias, Elementwise, and Permute Fusion
+
+## Theory
+
+This example demonstrates **GEMM fused with bias addition, elementwise operation, and permutation**. This pattern is used in transformer models and other neural architectures where a linear transformation is followed by bias, activation, and layout transformation.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+- Bias: $Z = Y + \text{bias}$
+- Elementwise: $E = f(Z)$ (e.g., activation)
+- Permute: $O = \text{permute}(E, \text{axes})$
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, bias and elementwise ops are fused in the epilogue, and permutation is applied before writing to global memory.
+- Permutation changes the layout/order of tensor axes (e.g., NCHW to NHWC).
+- This fusion reduces memory traffic and is common in transformer and CNN pipelines.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/25_gemm_bias_e_permute
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./gemm_bias_e_permute_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/25_gemm_bias_e_permute/
+├── gemm_bias_e_permute_xdl.cpp         # Main example: sets up, runs, and verifies GEMM+Bias+Elementwise+Permute
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_bias_e_permute.hpp       # Device-level API for fused GEMM
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_bias_e_permute_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+    └── gridwise_gemm_bias_e_permute.hpp     # Grid-level kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmBiasEPermute** (in `device_gemm_bias_e_permute.hpp`):  
+  Device API for GEMM fused with bias, elementwise, and permutation.
+- **gridwise_gemm_bias_e_permute** (in `gridwise_gemm_bias_e_permute.hpp`):  
+  Implements the tiled/blocking GEMM kernel with fused epilogue and permutation.
+
+This example demonstrates how Composable Kernel supports efficient fusion of linear, bias, activation, and layout operations for deep learning models.
--- a/example/26_contraction/README.md
+++ b/example/26_contraction/README.md
@@ -1,9 +1,80 @@
-# Instructions for ```example_contraction_bilinear_xdl_fp32```
+# Tensor Contraction

-## Run
+This example demonstrates a tensor contraction operation, which is a generalization of matrix multiplication to tensors of arbitrary rank (or number of dimensions). Tensor contractions are fundamental to many algorithms in physics, chemistry, and machine learning, particularly in the field of tensor networks.
+
+## Mathematical Formulation
+
+A tensor contraction sums the product of two tensors over a specified set of indices. It is most clearly expressed using Einstein summation notation (einsum).
+
+For example, a standard matrix multiplication $C_{ik} = \sum_j A_{ij} B_{jk}$ is written in einsum notation as:
+`ik = ij,jk`
+
+A tensor contraction can involve more dimensions and more contracted indices. For instance, contracting a 3D tensor `A` with a 4D tensor `B`:
+$D_{imn} = \sum_{j,k} A_{ijk} B_{kjmn}$
+In einsum notation, this is:
+`imn = ijk,kjmn`
+
+Here, the `j` and `k` indices are the "contracted" or "summation" indices, while `i`, `m`, and `n` are the "free" or "output" indices.
+
+Composable Kernel's contraction operation can perform any such contraction, provided there is a clear distinction between contracted indices and free indices for each tensor.
+
+## Algorithmic Strategy: Mapping Contraction to GEMM
+
+The dominant strategy for performing tensor contractions efficiently on GPUs is to reshape or "flatten" the input tensors into 2D matrices, perform a standard, highly-optimized GEMM, and then reshape the resulting matrix back into the desired output tensor shape.
+
+1.  **Tensor-to-Matrix Reshaping**:
+    -   The dimensions of each input tensor are partitioned into two sets: the contracted dimensions and the free (non-contracted) dimensions.
+    -   The tensor is then treated as a 2D matrix by flattening all the free dimensions into the "row" dimension (M for tensor A, N for tensor B) and all the contracted dimensions into the "column" dimension (K).
+    -   For example, in the contraction `imn = ijk,kjmn`:
+        -   Tensor A (`ijk`): Free index is `i`, contracted indices are `jk`. It is reshaped into a matrix A' of shape `[i, (j*k)]`.
+        -   Tensor B (`kjmn`): Free indices are `mn`, contracted indices are `kj`. It is reshaped into a matrix B' of shape `[(k*j), (m*n)]`.
+        -   The GEMM computes `D' = A' x B'`. The resulting matrix D' has shape `[i, (m*n)]`.
+
+2.  **High-Performance GEMM**: A standard, block-tiled GEMM kernel is used to perform the matrix multiplication `A' x B'`. This is the computationally intensive part of the operation.
+
+3.  **Output Reshaping**: The resulting 2D matrix `D'` is then logically reshaped back into the desired multi-dimensional output tensor `D` of shape `[i, m, n]`.
+
+Crucially, the reshaping operations are often *logical*. The data is not physically moved or transposed in global memory. Instead, the GEMM kernel is provided with a "tensor descriptor" that understands the original N-dimensional layout and can calculate the correct memory addresses for the flattened 2D view on the fly. This avoids costly data movement and is key to performance.
+
+## Source Code Organization
+
+-   [`contraction_xdl.cpp`](./contraction_xdl.cpp): The main example file. It defines the input tensors and their layouts, specifies the contraction indices, and instantiates the `DeviceContraction` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_contraction_multiple_d.hpp`](../../include/ck/tensor_operation/gpu/device/device_contraction_multiple_d.hpp): The high-level device interface for the contraction operation. It is highly generic and takes tensor descriptors that define the complex layouts and index mappings.
+-   The device interface internally creates a plan to map the contraction to a GEMM, then calls a standard `DeviceGemm` instance to execute it. The intelligence lies in how the tensor descriptors are configured to present a 2D matrix view of the higher-dimensional tensor data to the underlying GEMM kernel.
+
+## Build and Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build the Example
 ```bash
-#arg1: verification (0=no, 1=yes)
-#arg2: initialization (0=no init, 1=integer value, 2=decimal value)
-#arg3: time kernel (0=no, 1=yes)
-./bin/example_contraction_bilinear_xdl_fp32 1 1 1
+cd /path/to/composable_kernel/example/26_contraction
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
 ```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./contraction_xdl
+
+# Run with verification, data initialization, and timing
+./contraction_xdl 1 2 1
+```
+
+## Applications
+
+Tensor contractions are the core computational primitive in a wide range of fields:
+
+-   **Tensor Network Methods**: In physics and chemistry, methods like DMRG (Density Matrix Renormalization Group) and PEPS (Projected Entangled Pair States) use networks of interconnected tensors to represent complex quantum states. The simulation of these systems involves sequences of tensor contractions.
+-   **High-Order Statistics**: In data analysis, computing higher-order moments (like skewness or kurtosis) can be expressed as tensor contractions.
+-   **Relativistic Physics**: Many equations in general relativity are expressed in the language of tensors and involve contractions.
+-   **Advanced Deep Learning Models**: Some research models, particularly in areas like quantum machine learning or geometric deep learning, use tensor contractions as a primary layer type, going beyond the capabilities of standard matrix multiplication.
--- a/example/27_layernorm2d_fwd/README.md
+++ b/example/27_layernorm2d_fwd/README.md
@@ -0,0 +1,59 @@
+# 2D Layer Normalization Forward
+
+## Theory
+
+This example demonstrates **2D layer normalization forward pass**. Layer normalization is used in transformers and other neural networks to normalize activations across the feature dimension, improving training stability.
+
+**Mathematical Formulation:**
+Given input $X[N, C, H, W]$:
+- Mean: $\mu = \frac{1}{CHW} \sum_{c,h,w} X_{n,c,h,w}$
+- Variance: $\sigma^2 = \frac{1}{CHW} \sum_{c,h,w} (X_{n,c,h,w} - \mu)^2$
+- Normalized: $\hat{X}_{n,c,h,w} = \frac{X_{n,c,h,w} - \mu}{\sqrt{\sigma^2 + \epsilon}}$
+- Output: $Y_{n,c,h,w} = \gamma \hat{X}_{n,c,h,w} + \beta$
+
+$\gamma$, $\beta$ are learnable scale and shift parameters.
+
+**Algorithmic Background:**
+- Computes mean and variance per sample (across all features).
+- Applies normalization and affine transformation.
+- Used in transformer blocks and normalization layers.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/27_layernorm2d_fwd
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./layernorm2d_fwd_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/27_layernorm2d_fwd/
+├── layernorm2d_fwd_xdl.cpp         # Main example: sets up, runs, and verifies 2D layernorm
+include/ck/tensor_operation/gpu/device/
+│   └── device_layernorm_fwd.hpp       # Device-level layernorm API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_layernorm_fwd_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+    └── gridwise_layernorm_fwd.hpp     # Grid-level kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceLayernormFwd** (in `device_layernorm_fwd.hpp`):  
+  Device API for layer normalization.
+- **gridwise_layernorm_fwd** (in `gridwise_layernorm_fwd.hpp`):  
+  Implements the tiled/blocking layernorm kernel.
+
+This example demonstrates how Composable Kernel implements efficient layer normalization for transformer and deep learning models.
--- a/example/28_grouped_gemm_bias_e_permute/README.md
+++ b/example/28_grouped_gemm_bias_e_permute/README.md
@@ -0,0 +1,84 @@
+# Grouped GEMM with Bias, Elementwise Operation, and Permutation
+
+This example demonstrates a highly complex and specialized fusion: a **Grouped GEMM** where each individual GEMM operation is fused with a bias addition, a second elementwise operation, and a final permutation of the output. This kernel is designed to accelerate layers that have a group-parallel structure, such as depthwise separable convolutions or multi-head attention, when they are part of a larger fused computational graph.
+
+## Mathematical Formulation
+
+This operation performs `G` independent fused GEMM operations in parallel, where `G` is the group count. For each group `g` from `0` to `G-1`:
+
+1.  **GEMM Stage**: A standard matrix multiplication.
+    $C_{temp1[g]} = A_{[g]} \times B_{[g]}$
+
+2.  **Bias Addition Stage**: A bias vector `D_[g]` is broadcast and added.
+    $C_{temp2[g]} = C_{temp1[g]} + D_{[g]}$
+
+3.  **Elementwise Stage**: A second elementwise operation is performed with tensor `E_[g]`.
+    $C_{temp3[g]} = C_{temp2[g]} \odot E_{[g]}$
+
+4.  **Permutation Stage**: The final result for the group is permuted.
+    $F_{[g]} = \text{permute}(C_{temp3[g]})$
+
+All four stages for all `G` groups are executed within a single kernel launch. The intermediate results are kept in registers and never written to global memory.
+
+## Algorithmic Strategy: Group-Parallel GEMM with Fused Epilogue
+
+The implementation combines the scheduling strategy of Grouped GEMM with the multi-stage fused epilogue seen in `25_gemm_bias_e_permute`.
+
+1.  **Group Scheduling**: The `G` independent problems are distributed across the GPU's thread blocks. The grid-wise kernel is designed such that each thread block is assigned to compute one of the `G` fused operations.
+
+2.  **Fused GEMM Execution**: Once a thread block is assigned a group `g`, it executes a complete fused GEMM for that group's specific data. This involves:
+    -   Calculating the base memory addresses for $A_{[g]}, B_{[g]}, D_{[g]}, E_{[g]}$, and $F_{[g]}$ using the group index and the problem description for that group.
+    -   Executing a standard tiled GEMM for $A_{[g]} \times B_{[g]}$, accumulating the result in registers.
+    -   Executing the fused epilogue:
+        -   Load the bias `D_[g]` and add it.
+        -   Load the elementwise tensor `E_[g]` and apply the operation.
+        -   Calculate the permuted destination coordinates and write the final result to `F_[g]`.
+
+This approach maximizes parallelism at two levels: the coarse-grained parallelism across the `G` groups, and the fine-grained data parallelism within each individual GEMM operation.
+
+## Source Code Organization
+
+-   [`grouped_gemm_bias_e_permute_xdl.cpp`](./grouped_gemm_bias_e_permute_xdl.cpp): The main example file. It demonstrates the complex setup for a grouped problem, defining the `G` sets of input tensors and the permutation. It then instantiates the `DeviceGroupedGemmBiasEPermute` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_bias_e_permute_impl.hpp`](../../include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_bias_e_permute_impl.hpp): The high-level device interface for this specific fused operation. It takes arrays of tensor descriptors, one for each group.
+-   The underlying grid-wise kernel contains the logic to map thread blocks to groups and then execute the full fused GEMM pipeline for the assigned group.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/28_grouped_gemm_bias_e_permute
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./grouped_gemm_bias_e_permute_xdl
+
+# Run with verification, data initialization, and timing
+./grouped_gemm_bias_e_permute_xdl 1 2 1
+```
+
+## Applications
+
+This highly specialized kernel is valuable for optimizing specific patterns in modern neural networks:
+
+-   **Multi-Head Attention (MHA)**: The computation for each head in MHA is independent. The entire MHA block can be viewed as a Grouped GEMM where the number of groups `G` is the number of attention heads. If the Q, K, or V projections involve fusions with bias, other elementwise ops, and permutations to prepare the data for the batched GEMM, this kernel could potentially fuse a large part of that logic.
+-   **Depthwise Separable Convolutions**: The depthwise part of this convolution is a Grouped GEMM with `G` equal to the number of channels. If this is followed by a fused activation function (e.g., a gated activation) and a permutation, this kernel could be a perfect match.
+-   **Mixture-of-Experts (MoE) Models**: In MoE layers, an input is routed to one of several "expert" sub-networks. If these experts have identical structure, their execution can be formulated as a Grouped GEMM, where `G` is the number of experts. Any fusions within the expert network could be captured by this kernel.
+
+This example showcases the extreme composability of the library, allowing for the creation of highly tailored, high-performance kernels for complex, group-parallel computational graphs.
--- a/example/29_batched_gemm_bias_e_permute/README.md
+++ b/example/29_batched_gemm_bias_e_permute/README.md
@@ -0,0 +1,91 @@
+# Batched GEMM with Bias, Elementwise Operation, and Permutation
+
+This example demonstrates a **Batched GEMM** where each individual GEMM operation is fused with a bias addition, a second elementwise operation, and a final permutation of the output. This kernel is designed to accelerate layers that have a batch-parallel structure, such as the dense layers in a Transformer's feed-forward network, when they are part of a larger fused computational graph.
+
+## Mathematical Formulation
+
+This operation performs `B` independent fused GEMM operations in parallel, where `B` is the batch count. For each batch item `b` from `0` to `B-1`:
+
+1.  **GEMM Stage**: A standard matrix multiplication.
+    $C_{temp1[b]} = A_{[b]} \times B_{[b]}$
+
+2.  **Bias Addition Stage**: A bias vector `D_[b]` is broadcast and added.
+    $C_{temp2[b]} = C_{temp1[b]} + D_{[b]}$
+
+3.  **Elementwise Stage**: A second elementwise operation is performed with tensor `E_[b]`.
+    $C_{temp3[b]} = C_{temp2[b]} \odot E_{[b]}$
+
+4.  **Permutation Stage**: The final result for the batch item is permuted.
+    $F_{[b]} = \text{permute}(C_{temp3[b]})$
+
+All four stages for all `B` batch items are executed within a single kernel launch. The intermediate results are kept in registers and never written to global memory.
+
+**Distinction from Grouped Version**:
+-   In this **Batched** version, all `B` problems are uniform. They share the same dimensions (M, N, K), layouts, and permutations. The input/output tensors are accessed with a constant batch stride.
+-   In the **Grouped** version (`28_grouped_gemm_bias_e_permute`), each of the `G` problems can have different dimensions, layouts, and strides, offering more flexibility.
+
+## Algorithmic Strategy: Batch-Parallel GEMM with Fused Epilogue
+
+The implementation combines the scheduling strategy of Batched GEMM with the multi-stage fused epilogue.
+
+1.  **Batch Scheduling**: The `B` independent problems are distributed across the GPU's thread blocks. The grid-wise kernel is designed such that each thread block is assigned to compute one of the `B` fused operations.
+
+2.  **Fused GEMM Execution**: Once a thread block is assigned a batch item `b`, it executes a complete fused GEMM for that item's specific data. This involves:
+    -   Calculating the base memory addresses for $A_{[b]}, B_{[b]}, D_{[b]}, E_{[b]}$, and $F_{[b]}$ using the batch index and the constant batch stride.
+    -   Executing a standard tiled GEMM for $A_{[b]} \times B_{[b]}$, accumulating the result in registers.
+    -   Executing the fused epilogue:
+        -   Load the bias `D_[b]` and add it.
+        -   Load the elementwise tensor `E_[b]` and apply the operation.
+        -   Calculate the permuted destination coordinates and write the final result to `F_{[b]`.
+
+This approach is extremely efficient when the batch size `B` is large enough to saturate the GPU's parallelism.
+
+## Source Code Organization
+
+-   [`batched_gemm_bias_e_permute_xdl.cpp`](./batched_gemm_bias_e_permute_xdl.cpp): The main example file. It sets up the batched problem, defining the batch size, strides, and the single permutation rule that applies to all batch items. It then instantiates the `DeviceBatchedGemmBiasEPermute` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_bias_e_permute_impl.hpp`](../../include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_bias_e_permute_impl.hpp): The high-level device interface for this specific fused operation.
+-   The underlying grid-wise kernel contains the logic to map thread blocks to batch items (`block_to_batch`) and then execute the full fused GEMM pipeline for the assigned item.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/29_batched_gemm_bias_e_permute
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./batched_gemm_bias_e_permute_xdl
+
+# Run with verification, data initialization, and timing
+./batched_gemm_bias_e_permute_xdl 1 2 1
+```
+
+## Applications
+
+This kernel is ideal for optimizing the feed-forward network (FFN) block in a Transformer, especially when layout transformations are needed between layers.
+
+A typical Transformer FFN block is:
+`FFN(X) = Linear_2(ReLU(Linear_1(X)))`
+
+-   `Linear_1` is a GEMM.
+-   `ReLU` is an elementwise activation.
+-   `Linear_2` is another GEMM.
+
+Sometimes, for performance reasons (e.g., to align with a subsequent layer's expected input layout), the output of the FFN needs to be permuted. This kernel could fuse the `Linear_2` GEMM with its bias, a subsequent elementwise operation (if any), and the final permutation, all while operating on a batch of input sequences. This avoids multiple kernel launches and saves significant memory bandwidth, leading to faster model execution.
--- a/example/30_grouped_conv_fwd_multiple_d/README.md
+++ b/example/30_grouped_conv_fwd_multiple_d/README.md
@@ -1,4 +1,63 @@
-Command
+# Grouped Convolution Forward with Multiple Elementwise Inputs
+
+This example demonstrates a **Grouped Convolution Forward Pass** fused with an elementwise operation that takes multiple auxiliary input tensors (`D` tensors). This is a powerful fusion that combines the parallel structure of grouped convolutions with the ability to merge subsequent elementwise layers, such as custom activations or residual connections, into a single kernel.
+
+## Mathematical Formulation
+
+This operation performs `G` independent fused convolution operations in parallel, where `G` is the group count. For each group `g` from `0` to `G-1`:
+
+1.  **Convolution Stage**: A standard N-dimensional forward convolution is performed for the group.
+    $C_{out[g]} = \text{Conv}(\text{In}_{[g]}, \text{W}_{[g]})$
+
+2.  **Elementwise Stage**: The result of the convolution is combined with one or more auxiliary tensors ($D_{0[g]}, D_{1[g]}, \dots$) using a user-defined elementwise function `f`.
+    $E_{[g]} = f(C_{out[g]}, D_{0[g]}, D_{1[g]}, \dots)$
+
+The key optimization is that the intermediate convolution result, $C_{out[g]}$, is never written to global memory. It is computed and held in registers, then immediately consumed by the elementwise part of the kernel's epilogue before the final result `E` is stored.
+
+## Algorithmic Strategy: Implicit Grouped GEMM with Fused Multi-D Epilogue
+
+The implementation combines three core concepts: the implicit GEMM transformation for convolutions, the group-parallel scheduling of Grouped GEMM, and a multi-input fused epilogue.
+
+1.  **Group Scheduling**: The `G` independent problems are distributed across the GPU's thread blocks. Each thread block is assigned to compute the fused convolution for one of the `G` groups.
+
+2.  **Implicit GEMM Core**: Once a thread block is assigned a group `g`, it executes the convolution for that group using the implicit GEMM algorithm. This involves:
+    -   Calculating the base memory addresses for the group's input tensors: $\text{In}_{[g]}, \text{W}_{[g]}, D_{0[g]}, \dots, E_{[g]}$.
+    -   Performing a tiled GEMM, where tiles of the input `In` and weights `W` are read (with the `im2col` transformation happening on-the-fly) and the result is accumulated in registers.
+
+3.  **Fused Multi-D Epilogue**: Before writing the result to global memory, the epilogue performs the elementwise fusion:
+    -   Threads load the corresponding tiles from the auxiliary `D` tensors for the assigned group.
+    -   The user-defined elementwise function `f` is applied in registers to the convolution result and the `D` tensor values.
+    -   The final result `E` for the group is written to global memory.
+
+This strategy is highly efficient as it minimizes memory bandwidth by avoiding the materialization of the intermediate convolution output and maximizes parallelism by executing all groups concurrently.
+
+## Source Code Organization
+
+-   [`grouped_conv_fwd_multiple_d_xdl.cpp`](./grouped_conv_fwd_multiple_d_xdl.cpp): The main example file. It sets up the grouped convolution problem, including the multiple `D` tensors, and instantiates the `DeviceGroupedConvFwdMultipleD` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_grouped_conv_fwd_multiple_d.hpp`](../../include/ck/tensor_operation/gpu/device/device_grouped_conv_fwd_multiple_d.hpp): The high-level device interface for this operation. It takes arrays of tensor descriptors, one for each group for each of the `D` tensors.
+-   The underlying grid-wise kernel contains the logic to map thread blocks to groups and then execute the full implicit GEMM pipeline with the fused multi-D epilogue for the assigned group.
+
+## Build and Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/30_grouped_conv_fwd_multiple_d
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+
 ```bash
 arg1: verification (0=no, 1=yes)
 arg2: initialization (0=no init, 1=integer value, 2=decimal value)
@@ -16,3 +75,10 @@ Following arguments (depending on number of spatial dims):
 ./bin/example_grouped_conv_fwd_bias_relu_add_xdl_fp16 1 1 1
 ```

+## Applications
+
+This kernel is ideal for optimizing layers in modern CNNs that use grouped convolutions followed by complex activations or residual connections.
+
+-   **Fused Residual Connections**: A common pattern is `Conv(x) + x`. This can be implemented by passing the input `x` as a `D` tensor and defining the elementwise function as `f(conv_out, d0) = conv_out + d0`. If this is a grouped convolution, this kernel is a perfect fit.
+-   **Custom Gated Activations**: Some architectures use gated activations, such as `Conv_A(x) * sigmoid(Conv_B(x))`. While this kernel doesn't compute two convolutions, it can fuse one convolution with an elementwise multiplication against another tensor. For example, it could compute `Conv_A(x) * D0`, where `D0` is the pre-computed `sigmoid(Conv_B(x))`.
+-   **Depthwise Separable Convolutions**: These layers consist of a depthwise convolution (a grouped convolution with `G = C`) followed by a pointwise convolution (`1x1` conv). If there is a residual connection or other elementwise operation after the depthwise stage, this kernel can fuse it directly, improving the performance of this widely used building block.
--- a/example/31_batched_gemm_gemm/README.md
+++ b/example/31_batched_gemm_gemm/README.md
@@ -0,0 +1,73 @@
+# Fused Batched GEMM-GEMM
+
+This example demonstrates a **Batched GEMM-GEMM** operation, where two sequential General Matrix-Matrix Multiplications are fused into a single high-performance kernel. This pattern is common in multi-layer perceptrons (MLPs) and is a core component of the feed-forward network (FFN) block in Transformer models.
+
+## Mathematical Formulation
+
+The operation computes a chain of two matrix multiplications, batched `B` times. For each batch item `b` from `0` to `B-1`:
+
+1.  **First GEMM (GEMM0)**:
+    $D_{temp[b]} = A_{[b]} \times B_{[b]}$
+    Where `A` has shape `[B, M, K0]`, `B` has shape `[B, K0, N]`. The intermediate result `D_temp` has shape `[B, M, N]`.
+
+2.  **Second GEMM (GEMM1)**:
+    $E_{[b]} = D_{temp[b]} \times C_{[b]}$
+    Where `D_temp` (the output of GEMM0) has shape `[B, M, N]` and `C` has shape `[B, N, K1]`. The final output `E` has shape `[B, M, K1]`.
+
+The critical optimization is that the intermediate tensor `D_temp` is **never written to global memory**. It is produced and consumed entirely within the GPU's on-chip memory (registers and LDS/shared memory), saving a massive amount of memory bandwidth.
+
+## Algorithmic Strategy: Fused GEMM-GEMM via Shared Memory
+
+The implementation uses a batch-parallel approach where each thread block is assigned a single batch item. Within the block, the two GEMMs are fused using shared memory as a buffer.
+
+1.  **Batch Scheduling**: The `B` independent GEMM-GEMM problems are distributed across the GPU's thread blocks. Each thread block is assigned to compute the full chain for one batch item `b`.
+
+2.  **Fused Execution within a Thread Block**:
+    -   **Compute GEMM0 Tile**: The thread block first computes a tile of the intermediate tensor, $D_{temp[b]}$, using a standard tiled GEMM algorithm. The result of this computation is stored directly into a designated region of **shared memory (LDS)**.
+    -   **Synchronization**: A block-wide synchronization (`__syncthreads()`) is performed. This is a critical step that ensures the *entire* tile of $D_{temp[b]}$ is visible to all threads in the block before the second GEMM begins.
+    -   **Compute GEMM1 Tile**: The threads then immediately start computing the second GEMM. They use the intermediate tile stored in shared memory as the "A" matrix for this second GEMM, multiplying it with tiles of the `C` matrix. The result is accumulated in registers.
+    -   **Store Final Result**: Once a tile of the final output `E` is computed, it is written to global memory.
+
+This "producer-consumer" pattern within a thread block is highly efficient. It treats shared memory as a fast, programmable cache for the intermediate tensor, completely avoiding the slow round-trip to global HBM memory.
+
+## Source Code Organization
+
+-   [`batched_gemm_gemm_xdl.cpp`](./batched_gemm_gemm_xdl.cpp): The main example file. It sets up the three input tensors (A, B, C) for the batched operation and instantiates the `DeviceBatchedGemmGemm` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_batched_gemm_gemm.hpp`](../../include/ck/tensor_operation/gpu/device/device_batched_gemm_gemm.hpp): The high-level device interface for the fused Batched GEMM-GEMM operation.
+-   The underlying grid-wise kernel implements the complex fusion logic, managing the register usage for GEMM0, the write to shared memory, the synchronization, and the subsequent computation of GEMM1 using the data from shared memory.
+
+## Build and Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/31_batched_gemm_gemm
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./batched_gemm_gemm_xdl
+
+# Run with verification, data initialization, and timing
+./batched_gemm_gemm_xdl 1 2 1
+```
+
+## Application to Transformer FFN
+
+This kernel is perfectly suited to optimize the Feed-Forward Network (FFN) block found in every layer of a Transformer model. The FFN is typically defined as:
+
+`FFN(X) = Linear_2(Activation(Linear_1(X)))`
+
+Where `Linear_1` and `Linear_2` are dense layers (GEMMs). If the activation function can also be fused (e.g., ReLU or GeLU), an even more complex kernel can be used. However, this `GEMM-GEMM` kernel provides the core fusion for the two most computationally expensive parts of the FFN. By fusing `Linear_1` and `Linear_2`, this kernel can significantly reduce the latency and memory bandwidth of the FFN block, leading to faster end-to-end model training and inference.
--- a/example/32_batched_gemm_scale_softmax_gemm/README.md
+++ b/example/32_batched_gemm_scale_softmax_gemm/README.md
@@ -0,0 +1,61 @@
+# Batched GEMM-Scale-Softmax-GEMM: Fused Attention
+
+## Theory
+
+This example demonstrates the **fused attention mechanism** used in transformer models, implementing the sequence: batched Q×K^T → scaling → softmax → ×V in a single kernel. This pattern is critical for efficient transformer inference and training.
+
+**Mathematical Formulation:**
+- Attention: $\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
+- $Q$: [B, H, N, d_k] queries
+- $K$: [B, H, N, d_k] keys
+- $V$: [B, H, N, d_v] values
+- $O$: [B, H, N, d_v] output
+
+**Algorithmic Background:**
+- Computes Q×K^T, scales by $1/\sqrt{d_k}$, applies softmax, then multiplies by V.
+- Uses numerically stable softmax and memory-efficient tiling.
+- Used in multi-head attention and transformer blocks.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/32_batched_gemm_scale_softmax_gemm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./batched_gemm_scale_softmax_gemm_xdl --batch=32 --heads=12 --seq_len=512 --head_dim=64 --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/32_batched_gemm_scale_softmax_gemm/
+├── batched_gemm_scale_softmax_gemm_xdl.cpp         # Main example: sets up, runs, and verifies fused attention
+include/ck/tensor_operation/gpu/device/
+│   └── device_batched_gemm_scale_softmax_gemm.hpp       # Device-level fused attention API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_batched_attention_impl.hpp                # Attention-specific implementation
+│   └── device_online_softmax_impl.hpp                   # Online softmax implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_batched_gemm_softmax.hpp                # Grid-level fused attention kernel
+│   └── gridwise_online_softmax.hpp                      # Grid-level online softmax
+```
+
+### Key Classes and Functions
+
+- **DeviceBatchedGemmScaleSoftmaxGemm** (in `device_batched_gemm_scale_softmax_gemm.hpp`):  
+  Device API for fused attention.
+- **gridwise_batched_gemm_softmax** (in `gridwise_batched_gemm_softmax.hpp`):  
+  Implements the tiled/blocking fused attention kernel.
+- **gridwise_online_softmax** (in `gridwise_online_softmax.hpp`):  
+  Implements numerically stable, memory-efficient softmax.
+
+This example demonstrates how Composable Kernel implements efficient, fused attention for transformer and large language models.
--- a/example/33_multiple_reduce/README.md
+++ b/example/33_multiple_reduce/README.md
@@ -1,6 +1,73 @@
-# Instructions for ```example_dual_reduce```
+# Multiple Reductions
+
+This example demonstrates a **Multiple Reduction** operation, where several different reduction computations (e.g., sum, average, max, min) are performed on the same input tensor in a single kernel launch. This is a highly efficient pattern when multiple statistics are needed for a tensor, as it requires only one read pass over the (potentially very large) input data.
+
+## Mathematical Formulation
+
+Given an input tensor `A`, this operation computes a set of output scalars or vectors, $\{R_0, R_1, \dots, R_N\}$, where each $R_i$ is the result of a different reduction operation applied to `A`.
+
+$R_0 = \bigoplus_0 A$
+$R_1 = \bigoplus_1 A$
+...
+$R_N = \bigoplus_N A$
+
+Where $\bigoplus_i$ represents a distinct reduction operation, such as:
+-   `sum`: $\sum_j A_j$
+-   `avg`: $\frac{1}{N} \sum_j A_j$
+-   `max`: $\max_j(A_j)$
+-   `min`: $\min_j(A_j)$
+-   `sum of squares`: $\sum_j A_j^2$
+
+The reductions can be performed over the entire tensor to produce a scalar, or along specific dimensions to produce a lower-rank tensor.
+
+## Algorithmic Strategy: Fused Parallel Reduction
+
+The implementation uses a classic parallel reduction algorithm but extends it to handle multiple reduction functions simultaneously.
+
+1.  **Grid Scheduling**: The input tensor is partitioned across the GPU's thread blocks. Each block is responsible for reducing a slice of the input data.
+
+2.  **Intra-Block Reduction**:
+    -   **Loading**: Threads within a block cooperatively load their assigned slice of the input tensor `A` into shared memory.
+    -   **Fused Accumulation**: Each thread maintains a separate set of accumulators in its private registers, one for each of the `N` reduction operations being performed.
+    -   As threads iterate through the data in shared memory, they update all of their accumulators simultaneously. For example, for each element `a`, a thread might update its `sum_accumulator += a`, `max_accumulator = max(max_accumulator, a)`, and `sum_sq_accumulator += a*a`.
+    -   **Tree-Based Reduction**: After processing all elements in the slice, the threads perform a parallel reduction using shared memory. This is done *for each of the N reduction types*. For example, they first reduce all the `sum_accumulator` values to get the block's partial sum, then they reduce all the `max_accumulator` values to get the block's partial max, and so on.
+
+3.  **Inter-Block Reduction**:
+    -   Each thread block writes its `N` partial results (the block's partial sum, partial max, etc.) to `N` separate temporary arrays in global memory.
+    -   A final, small reduction kernel is launched (or atomic operations are used) for each of the `N` temporary arrays to combine the partial results from all blocks into the final `N` output values.
+
+The key to this kernel's efficiency is that the expensive part—reading the input tensor `A` from global memory—is only done once. All subsequent computations happen on-chip.
+
+## Source Code Organization
+
+-   [`multiple_reduce_xdl.cpp`](./multiple_reduce_xdl.cpp): The main example file. It sets up the input tensor and defines the multiple reduction operations to be performed. It then instantiates the `DeviceMultipleReduce` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_multiple_reduce.hpp`](../../include/ck/tensor_operation/gpu/device/device_multiple_reduce.hpp): The high-level device interface for the multiple reduction operation. It takes a tuple of structs, where each struct defines one of the reduction operations to be performed.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_multiple_reduce.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_multiple_reduce.hpp): The grid-wise kernel that implements the fused parallel reduction algorithm. It is heavily templated to generate the specific accumulation and reduction logic for the requested set of operations.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/33_multiple_reduce
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run  ```example_dual_reduce_multiblock```

-## Run ```example_dual_reduce_multiblock```
 ```bash
 # -D <xxx> : input 4-d tensor lengths
 # -v <x> :   verification (0=no, 1=yes)
@@ -18,7 +85,7 @@ Start running 10 times...
 Perf: 1.19529 ms, 201.499 GB/s, DeviceMultipleReduceBlockWise<256,M_C4_S1,K_C64_S1,InSrcVectorDim_1_InSrcVectorSize_1,OutDstVectorSize_1_1>
 ```

-## Run ```example_dual_reduce_threadwise```
+### Run ```example_dual_reduce_threadwise```
 ```bash
 # -D <xxx> : input 4-d tensor lengths
 # -v <x> :   verification (0=no, 1=yes)
@@ -35,3 +102,11 @@ Warm up 1 time
 Start running 10 times...
 Perf: 0.01512 ms, 71.9577 GB/s, DeviceMultipleReduceThreadwise<256,M_C256_S1,K_C1_S4,InSrcVectorDim_1_InSrcVectorSize_2,OutDstVectorSize_1_1>
 ```
+
+## Applications
+
+This operation is extremely useful for computing statistics and implementing normalization layers.
+
+-   **Normalization Layers**: Both Batch Normalization and Layer Normalization require computing the mean and variance of a tensor. Variance is defined as $\sigma^2 = E[X^2] - (E[X])^2$. This requires two statistics: the sum of elements (for the mean, $E[X]$) and the sum of squares of elements (for $E[X^2]$). This kernel can compute both in a single pass, making it a highly efficient way to calculate the moments needed for normalization.
+-   **Data Analytics**: When analyzing a large dataset, one might want to compute its min, max, mean, and standard deviation all at once. This kernel can perform all the necessary underlying reductions in a single, efficient operation.
+-   **Loss Function Components**: Some complex loss functions might involve multiple statistical properties of a model's output. This kernel can compute them efficiently.
--- a/example/34_batchnorm/README.md
+++ b/example/34_batchnorm/README.md
@@ -1,4 +1,39 @@
-# Instructions for ```batchnorm nhwc``` Example
+# Batch Normalization Forward
+
+## Theory
+
+This example demonstrates **batch normalization forward pass**. Batch normalization is used in deep neural networks to normalize activations across the batch dimension, improving training stability and convergence.
+
+**Mathematical Formulation:**
+Given input $X[N, C, ...]$:
+- Mean: $\mu_c = \frac{1}{N \cdot ...} \sum_{n,...} X_{n,c,...}$
+- Variance: $\sigma^2_c = \frac{1}{N \cdot ...} \sum_{n,...} (X_{n,c,...} - \mu_c)^2$
+- Normalized: $\hat{X}_{n,c,...} = \frac{X_{n,c,...} - \mu_c}{\sqrt{\sigma^2_c + \epsilon}}$
+- Output: $Y_{n,c,...} = \gamma_c \hat{X}_{n,c,...} + \beta_c$
+
+$\gamma_c$, $\beta_c$ are learnable scale and shift parameters per channel.
+
+**Algorithmic Background:**
+- Computes mean and variance per channel (across batch and spatial dimensions).
+- Applies normalization and affine transformation.
+- Used in CNNs, MLPs, and other deep learning models.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/34_batchnorm
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./batchnorm_fwd_xdl --verify=1 --time=1
+```

 ## Run ```batchnorm forward nhwc```
 ```bash
@@ -79,3 +114,26 @@ Warm up 1 time
 Start running 10 times...
 Perf: 0.411026 ms, 91.8702 GB/s
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/34_batchnorm/
+├── batchnorm_fwd_xdl.cpp         # Main example: sets up, runs, and verifies batchnorm
+include/ck/tensor_operation/gpu/device/
+│   └── device_batchnorm_fwd.hpp       # Device-level batchnorm API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_batchnorm_fwd_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+    └── gridwise_batchnorm_fwd.hpp     # Grid-level kernel
+```
+
+### Key Classes and Functions
+
+- **DeviceBatchnormFwd** (in `device_batchnorm_fwd.hpp`):  
+  Device API for batch normalization.
+- **gridwise_batchnorm_fwd** (in `gridwise_batchnorm_fwd.hpp`):  
+  Implements the tiled/blocking batchnorm kernel.
+
+This example demonstrates how Composable Kernel implements efficient batch normalization for deep learning models.
--- a/example/35_splitK_gemm/README.md
+++ b/example/35_splitK_gemm/README.md
@@ -0,0 +1,82 @@
+# GEMM with K-Axis Splitting (Split-K GEMM)
+
+This example demonstrates a General Matrix-Matrix Multiplication (GEMM) implemented with a **Split-K** algorithm. This is a technique used to increase the available parallelism for a single, large GEMM operation, which can lead to higher performance, especially on GPUs with a very large number of compute units.
+
+## Mathematical Formulation
+
+A standard GEMM computes the matrix product $C = A \times B$, where `A` has shape `[M, K]` and `B` has shape `[K, N]`. The computation is:
+$C_{ij} = \sum_{k=0}^{K-1} A_{ik} B_{kj}$
+
+In a Split-K algorithm, the `K` dimension is split into `S` chunks of size `K_split = K / S`. The GEMM is then broken down into `S` smaller, partial GEMMs.
+
+For each split `s` from `0` to `S-1`:
+-   Let $A_s$ be the s-th slice of `A` along the K-axis (shape `[M, K_split]`).
+-   Let $B_s$ be the s-th slice of `B` along the K-axis (shape `[K_split, N]`).
+-   A partial product is computed: $C_s = A_s \times B_s$.
+
+The final result `C` is the sum of all the partial products:
+$C = \sum_{s=0}^{S-1} C_s = C_0 + C_1 + \dots + C_{S-1}$
+
+## Algorithmic Strategy: Parallel Reduction of Partial GEMMs
+
+The Split-K algorithm turns a single large GEMM into multiple smaller GEMMs whose results must be reduced (summed). This introduces a new axis of parallelism.
+
+1.  **Splitting the K-Dimension**: The `K` dimension of the input matrices `A` and `B` is logically split into `S` parts. The `S` value is chosen by the kernel based on the problem size and hardware characteristics to expose a suitable amount of parallelism.
+
+2.  **Parallel Partial GEMMs**: The `S` partial GEMMs are executed in parallel. The GPU's grid of thread blocks is now two-dimensional, mapping not only to the M and N dimensions of the output matrix `C`, but also to the `S` splits of the K dimension.
+    -   A thread block is assigned to compute a tile of a *partial* product $C_s$.
+
+3.  **Reduction of Partial Results**: The key challenge is how to sum the partial products $C_s$ efficiently.
+    -   **Atomic Add**: The simplest method is for each block to compute its tile of $C_s$ and then use atomic add operations to accumulate its result directly into the final output matrix `C` in global memory. This is easy to implement but can suffer from high contention on the atomic operations, especially if many splits are trying to update the same memory location.
+    -   **Two-Stage Reduction**: A more robust approach involves two stages:
+        -   **Stage 1 (Partial Products)**: Each of the `S` parallel GEMMs writes its full partial product $C_s$ to a temporary workspace in global memory.
+        -   **Stage 2 (Final Reduction)**: A separate reduction kernel is launched to sum the `S` partial products from the workspace into the final output matrix `C`.
+
+Composable Kernel's implementation abstracts this complexity. The `DeviceGemmSplitK` interface handles the selection of the split factor `S`, the launch of the parallel partial GEMMs, and the final reduction step.
+
+## Source Code Organization
+
+-   [`splitk_gemm_xdl.cpp`](./splitk_gemm_xdl.cpp): The main example file. It sets up a standard GEMM problem and instantiates the `DeviceGemmSplitK` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_gemm_splitk.hpp`](../../include/ck/tensor_operation/gpu/device/device_gemm_splitk.hpp): The high-level device interface for the Split-K GEMM. It takes an additional `k_batch` parameter which controls the number of splits.
+-   The underlying grid-wise kernel is modified to accept a `k_batch` index, so that each thread block knows which slice of the `A` and `B` matrices it is responsible for. It also includes the logic for the reduction (e.g., using atomic adds).
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/35_splitK_gemm
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./splitk_gemm_xdl
+
+# Run with verification, data initialization, and timing
+./splitk_gemm_xdl 1 2 1
+```
+
+## When is Split-K Useful?
+
+Split-K is not always faster than a standard GEMM. It is most beneficial in specific scenarios:
+
+-   **"Skinny" GEMMs**: For GEMMs where `M` and `N` are small but `K` is very large (e.g., `M=64, N=64, K=65536`). A standard GEMM might not generate enough parallel work to fill a large GPU. By splitting the large `K` dimension, we create many more independent work items, improving hardware utilization.
+-   **Limited Shared Memory**: If a standard GEMM requires a very large tile size (and thus a large amount of shared memory) to be efficient, Split-K can be an alternative. It can use smaller tiles for the partial GEMMs, reducing the shared memory footprint per block.
+-   **Load Balancing**: It can help with load balancing on heterogeneous hardware or in complex fused scenarios.
+
+The trade-off is the overhead of the reduction step. The performance gain from increased parallelism must outweigh the cost of either atomic operations or writing and re-reading intermediate results.
--- a/example/36_sparse_embedding/README.md
+++ b/example/36_sparse_embedding/README.md
@@ -0,0 +1,80 @@
+# Sparse Embedding Lookup
+
+This example demonstrates a **sparse embedding lookup**, a fundamental operation in deep learning models that process sparse, high-cardinality categorical features, such as words in a vocabulary or user IDs in a recommendation system. The operation gathers feature vectors from a large embedding table based on a set of sparse input indices.
+
+## Mathematical Formulation
+
+The operation can be described as a lookup or gather operation.
+
+Given:
+-   An **Embedding Table** `W`, a dense 2D tensor of shape `[VocabularySize, EmbeddingDim]`. Each row of `W` is a feature vector (an embedding) for a specific category.
+-   A set of **Indices** `I`, a tensor of integer IDs (e.g., shape `[BatchSize, SequenceLength]`) that specify which embeddings to look up.
+-   An optional **Sparsity-aware Optimizer** state, such as momentum vectors, which must also be looked up and updated.
+
+The operation produces an **Output Tensor** `O` by gathering the rows from `W` corresponding to the indices in `I`.
+$O_{bsj} = W_{I_{bs}, j}$
+
+Where `b` is the batch index, `s` is the sequence index, and `j` is the embedding dimension index. The output tensor `O` will have a shape like `[BatchSize, SequenceLength, EmbeddingDim]`.
+
+## Algorithmic Strategy: Parallel Gather
+
+Unlike compute-bound operations like GEMM, an embedding lookup is almost entirely **memory-bound**. The primary challenge is to perform the gather operation from the potentially very large embedding table `W` as efficiently as possible.
+
+1.  **Grid Scheduling**: The lookup problem is parallelized over the indices. The grid of threads is typically launched to match the shape of the index tensor `I`. Each thread is assigned to handle the lookup for a single index.
+
+2.  **Gather Operation**:
+    -   Each thread reads its assigned index `id = I[b, s]` from the index tensor.
+    -   The thread then calculates the memory address of the start of the corresponding embedding vector in the table `W`. This is typically `address = base_address_W + id * EmbeddingDim * sizeof(DataType)`.
+    -   The thread then reads the entire embedding vector of size `EmbeddingDim` from that address in global memory and writes it to the corresponding position in the output tensor `O`.
+
+3.  **Memory Access Coalescing**: Performance is highly dependent on the memory access patterns.
+    -   If multiple threads in a warp access indices that are close to each other, their memory reads from the embedding table `W` might also be close, leading to some coalescing and better memory bandwidth utilization.
+    -   However, if the indices are random and scattered, the memory accesses will be random, leading to poor cache utilization and low memory bandwidth. This is often the bottleneck.
+
+4.  **Fused Optimizer Update**: In training, the embedding lookup is part of a larger forward-backward-update cycle. For sparse features, only the embedding vectors that were actually used (the "hot" embeddings) need their gradients computed and their weights updated. High-performance implementations often fuse the backward pass (gradient accumulation) and the optimizer step (e.g., SGD or Adam update) for these hot embeddings directly into a specialized kernel to avoid multiple passes over the embedding table. This example focuses on the forward-pass lookup.
+
+## Source Code Organization
+
+-   [`sparse_embedding_xdl.cpp`](./sparse_embedding_xdl.cpp): The main example file. It sets up the embedding table `W`, the index tensor `I`, and instantiates the `DeviceSparseEmbedding` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_sparse_embedding.hpp`](../../include/ck/tensor_operation/gpu/device/device_sparse_embedding.hpp): The high-level device interface for the sparse embedding lookup.
+-   The underlying grid-wise kernel is a straightforward gather kernel. Its performance is almost entirely dictated by the efficiency of its memory load and store operations.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/36_sparse_embedding
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./sparse_embedding_xdl
+
+# Run with verification, data initialization, and timing
+./sparse_embedding_xdl 1 2 1
+```
+
+## Applications
+
+Embedding layers are the first step in a vast number of deep learning models:
+
+-   **Natural Language Processing (NLP)**: Models like BERT and GPT use embedding layers to convert integer token IDs from a vocabulary into dense vector representations.
+-   **Recommender Systems**: Models use embeddings to represent users and items. The input to the model is often a set of sparse IDs (e.g., user ID, watched movie IDs), which are converted to dense vectors via embedding lookups. Embedding tables in these systems can be enormous (terabytes in size).
+-   **Graph Neural Networks**: Nodes in a graph are often represented by feature vectors, which can be stored in an embedding table and looked up as needed.
+-   **Any model with categorical features**: Whenever a model needs to process non-numeric categorical data (e.g., "product category", "day of the week"), it is typically first converted to an integer ID and then to a dense vector via an embedding layer.
--- a/example/37_batched_gemm_add_add_relu_gemm_add/README.md
+++ b/example/37_batched_gemm_add_add_relu_gemm_add/README.md
@@ -0,0 +1,104 @@
+# Fused Batched GEMM-Add-Add-ReLU-GEMM-Add
+
+This example demonstrates an exceptionally deep and complex fusion, chaining two GEMMs with multiple elementwise additions and a ReLU activation. This pattern is designed to fuse a significant portion of a residual block, such as the feed-forward network (FFN) in a Transformer, into a single, highly optimized kernel.
+
+## Mathematical Formulation
+
+The operation computes a complex chain of operations, batched `B` times. For each batch item `b` from `0` to `B-1`:
+
+1.  **First GEMM (GEMM0)**:
+    $C_{temp1[b]} = A_{[b]} \times B_{[b]}$
+
+2.  **First Add (Add0)**: An elementwise addition with tensor `D0`.
+    $C_{temp2[b]} = C_{temp1[b]} + D0_{[b]}$
+
+3.  **Second Add (Add1)**: Another elementwise addition with tensor `D1`.
+    $C_{temp3[b]} = C_{temp2[b]} + D1_{[b]}$
+
+4.  **Activation (ReLU)**: A Rectified Linear Unit activation is applied.
+    $C_{temp4[b]} = \text{ReLU}(C_{temp3[b]})$
+
+5.  **Second GEMM (GEMM1)**: The result is fed into a second GEMM.
+    $E_{temp[b]} = C_{temp4[b]} \times C_{[b]}$
+
+6.  **Third Add (Add2)**: A final elementwise addition with tensor `D2`.
+    $E_{[b]} = E_{temp[b]} + D2_{[b]}$
+
+The key optimization is that all intermediate tensors ($C_{temp1}$ through $E_{temp}$) are **never written to global memory**. They are produced and consumed entirely within the GPU's on-chip memory (registers and LDS/shared memory).
+
+## Algorithmic Strategy: Deeply Fused Producer-Consumer Chain
+
+This kernel represents a pinnacle of fusion capability. It chains two "producer-consumer" GEMMs together, with a series of elementwise operations fused into the epilogue of the first GEMM.
+
+1.  **Batch Scheduling**: The `B` independent problems are distributed across the GPU's thread blocks. Each thread block is assigned to compute the full chain for one batch item `b`.
+
+2.  **Fused Execution within a Thread Block**:
+    -   **Compute GEMM0 Tile**: The thread block computes a tile of the first GEMM, $A_{[b]} \times B_{[b]}$. The result is held in registers.
+    -   **Fused Epilogue (Add-Add-ReLU)**: Before this intermediate result is stored anywhere, the epilogue operations are applied directly to the data in registers.
+        -   Load corresponding elements from `D0` and `D1`.
+        -   Perform the two additions.
+        -   Apply the ReLU activation.
+    -   **Store to Shared Memory**: The result of this entire fused chain ($C_{temp4}$) is written to a designated region of **shared memory (LDS)**.
+    -   **Synchronization**: A block-wide synchronization (`__syncthreads()`) ensures the intermediate result in LDS is visible to all threads in the block.
+    -   **Compute GEMM1 Tile**: The threads immediately start the second GEMM, using the tile in shared memory as the input, multiplying it with tiles of `C`. The result is accumulated in registers.
+    -   **Final Fused Epilogue (Add)**: Before the final result is stored, the last addition is fused.
+        -   Load corresponding elements from `D2`.
+        -   Perform the final addition in registers.
+    -   **Store Final Result**: The final result `E` is written to global memory.
+
+This deep fusion avoids five separate kernel launches and the associated read/write traffic for four large intermediate tensors, resulting in a massive performance improvement.
+
+## Source Code Organization
+
+-   [`batched_gemm_add_add_relu_gemm_add_xdl.cpp`](./batched_gemm_add_add_relu_gemm_add_xdl.cpp): The main example file. It sets up the numerous input tensors (A, B, C, D0, D1, D2) and instantiates the highly specialized device-level operation.
+-   The device-level interface and underlying grid-wise kernel for this operation are extremely complex, templated on the multiple elementwise operations and managing the intricate data flow between registers, shared memory, and global memory.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/37_batched_gemm_add_add_relu_gemm_add
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./batched_gemm_add_add_relu_gemm_add_xdl
+
+# Run with verification, data initialization, and timing
+./batched_gemm_add_add_relu_gemm_add_xdl 1 2 1
+```
+
+## Application to Transformer FFN Block
+
+This kernel can fuse almost the entire Feed-Forward Network (FFN) block of a standard Transformer, including the residual connections.
+
+A typical FFN block with pre-layer-normalization looks like this:
+`Z = LayerNorm(X)`
+`Y = Linear_2(ReLU(Linear_1(Z)))`
+`Output = X + Y`
+
+This kernel can compute `Y` and the final residual addition:
+-   `A`: The normalized input `Z`.
+-   `B`: The weight matrix for `Linear_1`.
+-   `D0`: The bias for `Linear_1`.
+-   `D1`: Not used in this specific mapping (can be zero).
+-   `C`: The weight matrix for `Linear_2`.
+-   `D2`: The bias for `Linear_2` plus the original input `X` for the residual connection.
+
+By mapping the components of a Transformer FFN block to this kernel, a developer can achieve performance far beyond what is possible with a sequence of standard library calls. This demonstrates the power of Composable Kernel to create highly domain-specific, performance-leading fused operations.
--- a/example/38_grouped_conv_bwd_data_multiple_d/README.md
+++ b/example/38_grouped_conv_bwd_data_multiple_d/README.md
@@ -0,0 +1,75 @@
+# Grouped Convolution Backward Data with Multiple Elementwise Inputs
+
+This example demonstrates a **Grouped Convolution Backward Data Pass** fused with an elementwise operation that takes multiple auxiliary input tensors (`D` tensors). The backward data pass (also known as a transposed convolution or deconvolution) computes the gradient of the loss with respect to the convolution's *input* tensor. Fusing it with other operations is a powerful way to optimize the backward pass of a neural network.
+
+## Mathematical Formulation
+
+The operation computes the gradient with respect to the input (`GradIn`) of a grouped convolution, and then fuses the result with other tensors. For each group `g` from `0` to `G-1`:
+
+1.  **Backward Data Convolution Stage**: A standard N-dimensional backward data convolution is performed for the group. This computes the gradient that should be propagated back to the input of the original forward-pass convolution.
+    $GradIn_{temp[g]} = \text{ConvBwdData}(\text{GradOut}_{[g]}, \text{W}_{[g]})$
+    Where `GradOut` is the gradient from the subsequent layer and `W` is the weight tensor from the forward pass.
+
+2.  **Elementwise Stage**: The result of the backward convolution is combined with one or more auxiliary tensors ($D_{0[g]}, D_{1[g]}, \dots$) using a user-defined elementwise function `f`.
+    $GradIn_{[g]} = f(GradIn_{temp[g]}, D_{0[g]}, D_{1[g]}, \dots)$
+
+This fusion is particularly useful for operations like adding the gradient from a residual "skip" connection, which is a common pattern in modern network architectures. By fusing the addition, we avoid a separate kernel launch and a full read/write pass of the `GradIn` tensor.
+
+## Algorithmic Strategy: Implicit Grouped GEMM with Fused Multi-D Epilogue
+
+The implementation uses the implicit GEMM algorithm, but configured for the backward data pass.
+
+1.  **Group Scheduling**: The `G` independent problems are distributed across the GPU's thread blocks. Each thread block is assigned to compute the fused backward convolution for one of the `G` groups.
+
+2.  **Implicit GEMM for Backward Data**: The backward data convolution can be mathematically re-arranged to be equivalent to a forward convolution with transformed inputs and weights, which can then be solved with an implicit GEMM algorithm. Composable Kernel handles this transformation. A thread block executes the implicit GEMM for its assigned group, accumulating the `GradIn_temp` result in registers.
+
+3.  **Fused Multi-D Epilogue**: Before writing the result to global memory, the epilogue performs the elementwise fusion:
+    -   Threads load the corresponding tiles from the auxiliary `D` tensors for the assigned group.
+    -   The user-defined elementwise function `f` is applied in registers to the computed gradient and the `D` tensor values.
+    -   The final result `GradIn` for the group is written to global memory.
+
+This strategy minimizes memory bandwidth by avoiding the materialization of the intermediate gradient tensor and maximizes parallelism by executing all groups concurrently.
+
+## Source Code Organization
+
+-   [`grouped_conv_bwd_data_multiple_d_xdl.cpp`](./grouped_conv_bwd_data_multiple_d_xdl.cpp): The main example file. It sets up the grouped backward convolution problem, including the multiple `D` tensors, and instantiates the `DeviceGroupedConvBwdDataMultipleD` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_grouped_conv_bwd_data_multiple_d.hpp`](../../include/ck/tensor_operation/gpu/device/device_grouped_conv_bwd_data_multiple_d.hpp): The high-level device interface for this operation. It takes arrays of tensor descriptors, one for each group for each of the `D` tensors.
+-   The underlying grid-wise kernel contains the logic to map thread blocks to groups and then execute the full implicit GEMM pipeline (formulated for backward data) with the fused multi-D epilogue for the assigned group.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/38_grouped_conv_bwd_data_multiple_d
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./grouped_conv_bwd_data_multiple_d_xdl
+
+# Run with verification, data initialization, and timing
+./grouped_conv_bwd_data_multiple_d_xdl 1 2 1
+```
+
+## Applications in Backpropagation
+
+Fusing operations into the backward pass is a critical optimization for training deep neural networks.
+
+-   **Fused Residual Gradient**: In a residual block (`y = F(x) + x`), the gradient with respect to `x` is `dF/dx + dy/dx`. If `F` is a convolution, `dF/dx` is the output of the `ConvBwdData` operation. The `dy/dx` term (the gradient from the skip connection) can be passed as a `D` tensor and fused via an addition, computing the full gradient for `x` in a single kernel.
+-   **Fused Gradient Clipping/Scaling**: The `D` tensors and the elementwise function `f` could be used to apply gradient scaling or other custom gradient processing steps directly to the output of the backward convolution, before the result is written back to memory.
--- a/example/39_permute/README.md
+++ b/example/39_permute/README.md
@@ -0,0 +1,56 @@
+# Tensor Permutation (Dimension Reordering)
+
+## Theory
+
+This example demonstrates **tensor permutation operations**, which reorder the dimensions of tensors according to a specified permutation pattern. Permutation is fundamental for many machine learning operations, including tensor layout transformations, data format conversions, and implementing complex tensor operations.
+
+**Mathematical Formulation:**
+Given an input tensor $X$ with shape $[D_0, D_1, ..., D_{n-1}]$ and a permutation pattern $P = [p_0, p_1, ..., p_{n-1}]$, the permutation operation produces an output tensor $Y$ with shape $[D_{p_0}, D_{p_1}, ..., D_{p_{n-1}}]$ such that:
+$$
+Y_{i_{p_0}, i_{p_1}, ..., i_{p_{n-1}}} = X_{i_0, i_1, ..., i_{n-1}}
+$$
+
+**Algorithmic Background:**
+- Permutation is used for matrix transpose, NCHW/NHWC layout conversion, attention head reshaping, and more.
+- Efficient permutation requires optimizing memory access patterns for coalescing and bandwidth.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/39_permute
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (matrix transpose)
+./permute_xdl --input_shape=4096,4096 --permutation=1,0 --verify=1 --time=1
+
+# Example run (NCHW to NHWC)
+./permute_xdl --input_shape=32,256,56,56 --permutation=0,2,3,1 --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/39_permute/
+├── permute_xdl.cpp         # Main example: sets up, runs, and verifies tensor permutation
+include/ck/tensor_operation/gpu/device/
+│   └── device_permute.hpp       # Device-level permutation API
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_permute.hpp     # Grid-level permutation kernel
+```
+
+### Key Classes and Functions
+
+- **DevicePermute** (in `device_permute.hpp`):  
+  Device API for tensor permutation.
+- **gridwise_permute** (in `gridwise_permute.hpp`):  
+  Implements the tiled/blocking permutation kernel.
+
+This example demonstrates how Composable Kernel implements efficient tensor dimension reordering for layout transformations and deep learning operations.
--- a/example/40_conv2d_fwd_quantization/README.md
+++ b/example/40_conv2d_fwd_quantization/README.md
@@ -0,0 +1,61 @@
+# 2D Convolution Forward with Quantization
+
+## Theory
+
+This example demonstrates **2D convolution forward with quantized weights or activations**. Quantization is used to reduce memory and computation by representing values with lower-precision integer types (e.g., int8), enabling efficient inference in deep learning.
+
+**Mathematical Formulation:**
+- Quantized convolution: $Y = \text{dequant}(X_q) * \text{dequant}(W_q)$
+- $X_q$, $W_q$: quantized input and weight tensors (e.g., int8)
+- $\text{dequant}(x_q) = (x_q - z) \cdot s$ (scale $s$, zero-point $z$)
+- $Y$: output tensor (often in higher precision, e.g., float32 or float16)
+
+**Algorithmic Background:**
+- Quantized values are dequantized on-the-fly during convolution.
+- Accumulation is performed in higher precision for accuracy.
+- Supports symmetric and asymmetric quantization.
+- Convolution is implemented as implicit GEMM for efficiency.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/40_conv2d_fwd_quantization
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run
+./conv2d_fwd_quantization_xdl --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/40_conv2d_fwd_quantization/
+├── conv2d_fwd_quantization_xdl.cpp         # Main example: sets up, runs, and verifies quantized conv2d
+include/ck/tensor_operation/gpu/device/
+│   └── device_conv2d_fwd_quantization.hpp       # Device-level quantized conv2d API
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_conv2d_fwd_quantization_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_conv2d_fwd_quantization.hpp     # Grid-level quantized conv2d kernel
+include/ck/tensor_operation/gpu/element/
+    └── quantization_operations.hpp              # Quantization/dequantization utilities
+```
+
+### Key Classes and Functions
+
+- **DeviceConv2dFwdQuantization** (in `device_conv2d_fwd_quantization.hpp`):  
+  Device API for quantized 2D convolution.
+- **gridwise_conv2d_fwd_quantization** (in `gridwise_conv2d_fwd_quantization.hpp`):  
+  Implements the tiled/blocking quantized conv2d kernel.
+- **quantization_operations** (in `quantization_operations.hpp`):  
+  Defines quantization and dequantization functions.
+
+This example demonstrates how Composable Kernel supports efficient quantized convolution for deep learning inference.
--- a/example/41_grouped_conv_conv_fwd/README.md
+++ b/example/41_grouped_conv_conv_fwd/README.md
@@ -0,0 +1,83 @@
+# Fused Grouped-Convolution -> Convolution Forward
+
+This example demonstrates a **fused Grouped Convolution followed by a standard Convolution**. This pattern is specifically designed to optimize Depthwise Separable Convolutions, a key building block in many modern, efficient convolutional neural networks like MobileNet.
+
+A Depthwise Separable Convolution consists of two stages:
+1.  **Depthwise Convolution**: A grouped convolution where the number of groups is equal to the number of input channels (`groups = in_channels`). Each filter is applied to exactly one input channel.
+2.  **Pointwise Convolution**: A standard `1x1` convolution that projects the channels from the depthwise stage into a new output channel space.
+
+Fusing these two stages into a single kernel can provide significant performance benefits.
+
+## Mathematical Formulation
+
+The operation computes a chain of two convolutions, where the first is a grouped convolution.
+
+1.  **First Convolution (Conv0 - Grouped/Depthwise)**:
+    $D_{temp} = \text{GroupedConv}(\text{In}, \text{W0})$
+    Where `In` is the input tensor and `W0` are the weights for the grouped convolution.
+
+2.  **Second Convolution (Conv1 - Pointwise)**:
+    $Out = \text{Conv}(\text{D}_{temp}, \text{W1})$
+    Where `D_temp` is the output of the first stage and `W1` are the weights for the second convolution (typically `1x1` filters).
+
+The critical optimization is that the intermediate tensor `D_temp` is **never written to global memory**. It is produced and consumed entirely within the GPU's on-chip memory (registers and LDS/shared memory), saving a massive amount of memory bandwidth.
+
+## Algorithmic Strategy: Fused Implicit GEMM-GEMM via Shared Memory
+
+The implementation maps the two-stage convolution into a fused GEMM-GEMM problem, using shared memory as the communication buffer between the stages.
+
+1.  **Grid Scheduling**: The problem is parallelized across the thread blocks of the GPU. Each thread block is assigned to compute a tile of the final output tensor `Out`.
+
+2.  **Fused Execution within a Thread Block**: To compute its output tile, a thread block must perform both convolution stages for the corresponding input region.
+    -   **Compute Conv0 Tile**: The thread block first computes a tile of the intermediate tensor, $D_{temp}$, using the implicit GEMM algorithm for the grouped convolution. The result of this computation is stored directly into a designated region of **shared memory (LDS)**.
+    -   **Synchronization**: A block-wide synchronization (`__syncthreads()`) is performed. This is a critical step that ensures the *entire* tile of $D_{temp}$ is visible to all threads in the block before the second convolution begins.
+    -   **Compute Conv1 Tile**: The threads then immediately start computing the second convolution. They use the intermediate tile stored in shared memory as the input for this second stage, applying the implicit GEMM algorithm for the pointwise convolution. The result is accumulated in registers.
+    -   **Store Final Result**: Once a tile of the final output `Out` is computed, it is written to global memory.
+
+This "producer-consumer" pattern within a thread block is highly efficient. It treats shared memory as a fast, programmable cache for the intermediate tensor, completely avoiding the slow round-trip to global HBM memory that would be required by two separate kernel calls.
+
+## Source Code Organization
+
+-   [`grouped_conv_conv_fwd_xdl.cpp`](./grouped_conv_conv_fwd_xdl.cpp): The main example file. It sets up the input tensor and the two weight tensors (W0 for grouped, W1 for standard) and instantiates the `DeviceGroupedConvConvFwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_grouped_conv_conv_fwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_grouped_conv_conv_fwd.hpp): The high-level device interface for the fused convolution operation.
+-   The underlying grid-wise kernel implements the complex fusion logic, managing the two implicit GEMM calculations and the data flow through shared memory.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/41_grouped_conv_conv_fwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./grouped_conv_conv_fwd_xdl
+
+# Run with verification, data initialization, and timing
+./grouped_conv_conv_fwd_xdl 1 2 1
+```
+
+## Application to Efficient CNNs
+
+As mentioned, this kernel is a direct, high-performance implementation of a **Depthwise Separable Convolution**. This architectural primitive is the foundation of many efficient CNNs, including:
+-   **MobileNets (V1, V2, V3)**: Designed for high performance on mobile and edge devices.
+-   **EfficientNets**: A family of models that systematically scale model depth, width, and resolution to achieve high accuracy with fewer parameters and FLOPs.
+-   **Xception**: A model that takes the idea of separable convolutions to an extreme.
+
+By providing a fused kernel for this common pattern, Composable Kernel allows developers to achieve significantly better performance for these models than would be possible by calling a library for the depthwise and pointwise stages separately.
--- a/example/42_groupnorm_fwd/README.md
+++ b/example/42_groupnorm_fwd/README.md
@@ -0,0 +1,92 @@
+# Group Normalization Forward
+
+This example demonstrates the forward pass of **Group Normalization (GroupNorm)**. GroupNorm is a normalization technique that acts as a bridge between Layer Normalization and Instance Normalization. It divides channels into groups and computes the mean and variance for normalization within each group. This makes its performance stable across a wide range of batch sizes, unlike BatchNorm.
+
+## Mathematical Formulation
+
+Given an input tensor `X` with shape `[N, C, H, W]` and a specified number of groups `G`:
+The `C` channels are divided into `G` groups, with each group containing `C/G` channels. The normalization is performed independently for each group within each batch item.
+
+For each batch item `n` and each group `g`:
+1.  **Identify Channels**: Identify the set of channels belonging to group `g`. Let this set be $S_g$. The size of this set is $C' = C/G$.
+
+2.  **Compute Mean**: The mean is calculated across the channels in the group and the spatial dimensions (`H`, `W`).
+    $\mu_{ng} = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw}$
+
+3.  **Compute Variance**: The variance is also calculated across the same dimensions.
+    $\sigma_{ng}^2 = \frac{1}{C' \cdot H \cdot W} \sum_{c \in S_g} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_{ng})^2$
+
+4.  **Normalize**: The input is normalized using the computed mean and variance for its corresponding group. For any channel `c` in group `g`:
+    $\hat{X}_{nchw} = \frac{X_{nchw} - \mu_{ng}}{\sqrt{\sigma_{ng}^2 + \epsilon}}$
+    Where `epsilon` is a small constant for numerical stability.
+
+5.  **Scale and Shift**: The normalized output is scaled by a learnable parameter `gamma` and shifted by a learnable parameter `beta`. Unlike BatchNorm, `gamma` and `beta` are applied per-channel, not per-group.
+    $Y_{nchw} = \gamma_c \cdot \hat{X}_{nchw} + \beta_c$
+    Both `gamma` and `beta` are vectors of shape `[C]`.
+
+## Algorithmic Strategy: Two-Pass Parallel Reduction per Group
+
+The implementation of GroupNorm is a parallel reduction problem, similar to LayerNorm and BatchNorm, but with a different scope for the reduction.
+
+1.  **Grid Scheduling**: The `N * G` independent normalization problems (one for each batch item and each group) are distributed among the GPU's thread blocks. Each block is assigned one or more `(n, g)` pairs to normalize.
+
+2.  **Pass 1: Compute Moments (Mean and Variance)**
+    -   For an assigned `(n, g)` pair, the threads within a block cooperatively read the data for the channels in that group and the spatial dimensions.
+    -   **Welford's Algorithm**: To compute mean and variance in a single pass with good numerical stability, Welford's online algorithm is used.
+    -   **Intra-Block Reduction**: The threads perform a parallel reduction using shared memory to compute the final mean and variance for the `(n, g)` pair.
+    -   The final mean and variance for each `(n, g)` pair are written to temporary arrays in global memory.
+
+3.  **Pass 2: Normalize, Scale, and Shift**
+    -   A second kernel (or a second stage in the same kernel after a grid-wide sync) is launched.
+    -   Threads read the input data `X` again.
+    -   For each element `X_nchw`, the thread identifies its group `g`, reads the corresponding mean `mu_ng` and variance `sigma_ng`, and applies the normalization formula.
+    -   It then reads the per-channel `gamma_c` and `beta_c` values and applies the scale and shift.
+    -   The final result `Y` is written to global memory.
+
+Composable Kernel encapsulates this two-pass logic into a single, efficient `DeviceGroupnormFwd` operation.
+
+## Source Code Organization
+
+-   [`groupnorm_fwd_xdl.cpp`](./groupnorm_fwd_xdl.cpp): The main example file. It sets up the input tensor, `gamma` and `beta` vectors, the number of groups, and instantiates the `DeviceGroupnormFwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_groupnorm_fwd.hpp): The high-level device interface for the GroupNorm forward pass.
+-   The implementation internally uses a reduction kernel based on Welford's algorithm to compute the statistics and an elementwise kernel to apply the normalization.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/42_groupnorm_fwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./groupnorm_fwd_xdl
+
+# Run with verification, data initialization, and timing
+./groupnorm_fwd_xdl 1 2 1
+```
+
+## Comparison of Normalization Layers
+
+-   **BatchNorm**: Normalizes over `(N, H, W)`. Learns `gamma` and `beta` per channel `C`. Batch-size dependent.
+-   **LayerNorm**: Normalizes over `(C, H, W)`. Learns `gamma` and `beta` per channel `C`. Batch-size independent.
+-   **InstanceNorm**: Normalizes over `(H, W)`. Learns `gamma` and `beta` per channel `C`. A special case of GroupNorm where `G=C`.
+-   **GroupNorm**: Normalizes over `(C/G, H, W)`. Learns `gamma` and `beta` per channel `C`. Batch-size independent.
+
+GroupNorm's flexibility has made it popular in GANs and in Transformer-based vision models where batch sizes can be small.
--- a/example/43_splitk_gemm_bias_e_permute/README.md
+++ b/example/43_splitk_gemm_bias_e_permute/README.md
@@ -0,0 +1,82 @@
+# Split-K GEMM with Bias, Elementwise Operation, and Permutation
+
+This example demonstrates a highly complex fusion: a **Split-K GEMM** where the final result is fused with a bias addition, a second elementwise operation, and a final permutation. This kernel combines the parallelism-enhancing Split-K strategy with a multi-stage epilogue, making it suitable for accelerating very large or "skinny" GEMMs that are part of a more complex computational graph.
+
+## Mathematical Formulation
+
+The operation first computes a GEMM using the Split-K algorithm and then applies a sequence of fused operations.
+
+1.  **Split-K GEMM Stage**: The matrix multiplication $C_{temp1} = A \times B$ is computed by splitting the `K` dimension into `S` chunks and summing the partial products.
+    $C_{temp1} = \sum_{s=0}^{S-1} (A_s \times B_s)$
+
+2.  **Bias Addition Stage**: A bias vector `D` is broadcast and added.
+    $C_{temp2} = C_{temp1} + D$
+
+3.  **Elementwise Stage**: A second elementwise operation is performed with tensor `E`.
+    $C_{temp3} = C_{temp2} \odot E$
+
+4.  **Permutation Stage**: The final result is permuted.
+    $F = \text{permute}(C_{temp3})$
+
+The key is that the reduction (summation) of the partial GEMM products is fused with the entire epilogue chain (Bias, E-wise, Permute).
+
+## Algorithmic Strategy: Split-K with a Fused Reduction Epilogue
+
+The implementation combines the Split-K algorithm with the multi-stage fused epilogue seen in previous examples.
+
+1.  **Splitting the K-Dimension**: The `K` dimension is logically split into `S` parts to create `S` parallel partial GEMM problems.
+
+2.  **Parallel Partial GEMMs**: The `S` partial GEMMs are executed in parallel across the GPU's thread blocks. A thread block is assigned to compute a tile of a *partial* product $C_s$.
+
+3.  **Fused Reduction and Epilogue**: The method for reducing the partial sums and applying the epilogue is critical.
+    -   **Workspace Approach**: A common strategy is to use a temporary workspace in global memory.
+        -   **Stage 1 (Partial Products)**: Each of the `S` parallel GEMMs computes its partial product $C_s$ and writes it to a unique slice of a temporary workspace tensor.
+        -   **Stage 2 (Reduce + Epilogue)**: A second, specialized kernel is launched. This kernel reads the `S` partial products from the workspace, reduces (sums) them on-the-fly, and then immediately applies the full Bias-E-Permute epilogue before writing the final result `F` to memory.
+    -   **Atomic-based Approach**: For some data types and operations, it's possible to perform the reduction using atomic operations. The first block to arrive at an output element would compute its partial result, apply the epilogue, and write it out. Subsequent blocks would compute their partial results, read the intermediate value from the output buffer, add their contribution, and then atomically write the new sum back. This is more complex and often less performant due to atomic contention.
+
+Composable Kernel's implementation abstracts this complexity, providing a single device-level operation that manages the workspace, the two stages, and the complex epilogue.
+
+## Source Code Organization
+
+-   [`splitk_gemm_bias_e_permute_xdl.cpp`](./splitk_gemm_bias_e_permute_xdl.cpp): The main example file. It sets up the GEMM problem, the bias and elementwise tensors, the permutation, and instantiates the `DeviceSplitkGemmBiasEPermute` operation.
+-   The device-level interface and underlying kernels are highly specialized. They manage the Split-K parameter, the workspace allocation (if needed), and the two-stage execution process, combining the logic from `DeviceGemmSplitK` and `DeviceGemmBiasEPermute`.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/43_splitk_gemm_bias_e_permute
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./splitk_gemm_bias_e_permute_xdl
+
+# Run with verification, data initialization, and timing
+./splitk_gemm_bias_e_permute_xdl 1 2 1
+```
+
+## Applications
+
+This highly specialized kernel is useful when a very large GEMM (that would benefit from Split-K) is immediately followed by a series of operations that can be fused.
+
+-   **Large Feed-Forward Networks**: In a Transformer with a very large hidden dimension, the GEMMs in the FFN block might become "skinny" (large K, smaller M/N). If this FFN is also fused with residual connections (bias/add) and layout permutations, this kernel could be a perfect fit, offering both the parallelism benefits of Split-K and the memory bandwidth savings of the fused epilogue.
+-   **Final Classifier Layers**: The final layer of a large classification model is often a very large GEMM. If this layer's output needs to be reshaped or post-processed, this kernel could fuse those operations directly into the Split-K GEMM.
+
+This example showcases the extreme composability of the library, allowing for the creation of highly tailored, high-performance kernels that combine different algorithmic strategies (like Split-K) with deep fusion.
--- a/example/44_elementwise_permute/README.md
+++ b/example/44_elementwise_permute/README.md
@@ -0,0 +1,59 @@
+# Elementwise Operation with Permutation Fusion
+
+## Theory
+
+This example demonstrates **elementwise operations fused with tensor permutation**. This pattern is used in deep learning for applying activation functions or scaling while simultaneously reordering tensor dimensions (e.g., NCHW to NHWC).
+
+**Mathematical Formulation:**
+- Elementwise: $Z = f(X)$ or $Z = f(X, Y)$
+- Permute: $Y_{i_{p_0}, i_{p_1}, ..., i_{p_{n-1}}} = Z_{i_0, i_1, ..., i_{n-1}}$
+  - $P = [p_0, p_1, ..., p_{n-1}]$ is the permutation pattern
+
+**Algorithmic Background:**
+- The elementwise operation and permutation are fused in a single kernel.
+- Intermediate results are kept in registers, not written to global memory.
+- Used for layout conversion with activation, attention head reshaping, and more.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/44_elementwise_permute
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+# Example run (ReLU + NCHW to NHWC)
+./elementwise_permute_xdl --input_shape=32,128,56,56 --permutation=0,2,3,1 --operation=relu --verify=1 --time=1
+```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/44_elementwise_permute/
+├── elementwise_permute_xdl.cpp         # Main example: sets up, runs, and verifies elementwise+permute
+include/ck/tensor_operation/gpu/device/
+│   └── device_elementwise_permute.hpp       # Device-level API for fused elementwise+permute
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_elementwise_permute_impl.hpp  # Implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_elementwise_permute.hpp     # Grid-level kernel
+include/ck/tensor_operation/gpu/element/
+    └── element_wise_operation.hpp           # Elementwise operation definitions
+```
+
+### Key Classes and Functions
+
+- **DeviceElementwisePermute** (in `device_elementwise_permute.hpp`):  
+  Device API for fused elementwise and permutation.
+- **gridwise_elementwise_permute** (in `gridwise_elementwise_permute.hpp`):  
+  Implements the tiled/blocking elementwise+permute kernel.
+- **element_wise_operation** (in `element_wise_operation.hpp`):  
+  Defines elementwise operations (e.g., relu, scale).
+
+This example demonstrates how Composable Kernel supports efficient fusion of elementwise operations and tensor permutation for deep learning and data layout transformations.
--- a/example/45_elementwise_normalization/README.md
+++ b/example/45_elementwise_normalization/README.md
@@ -0,0 +1,86 @@
+# Elementwise Normalization
+
+This example demonstrates a fused **elementwise operation followed by normalization**. This pattern combines elementwise tensor arithmetic with a normalization operation in a single kernel, which is particularly useful for implementing custom normalization layers or fused activation-normalization blocks.
+
+## Mathematical Formulation
+
+The operation performs an elementwise computation followed by a normalization operation.
+
+1.  **Elementwise Stage**: An elementwise operation is applied to one or more input tensors.
+    $C_{temp} = f(A, B, \dots)$
+    Where `f` is a user-defined elementwise function that operates on corresponding elements of the input tensors.
+
+2.  **Normalization Stage**: The result is then normalized. The normalization can be performed along specified dimensions.
+    -   **Compute Statistics**: For each normalization group, compute the mean and variance.
+        $\mu = \frac{1}{N} \sum C_{temp}$
+        $\sigma^2 = \frac{1}{N} \sum (C_{temp} - \mu)^2$
+    -   **Normalize**: Apply the normalization formula.
+        $\hat{C} = \frac{C_{temp} - \mu}{\sqrt{\sigma^2 + \epsilon}}$
+    -   **Scale and Shift**: Apply learnable parameters.
+        $D = \gamma \cdot \hat{C} + \beta$
+
+The key optimization is that the intermediate tensor `C_temp` is **never written to global memory**. The elementwise computation feeds directly into the normalization calculation.
+
+## Algorithmic Strategy: Fused Elementwise with Online Normalization
+
+The implementation combines elementwise computation with an online normalization algorithm.
+
+1.  **Grid Scheduling**: The normalization groups are distributed among thread blocks. Each block handles one or more normalization groups.
+
+2.  **Fused Two-Pass Algorithm**:
+    -   **Pass 1 - Compute Elementwise and Moments**:
+        -   Threads cooperatively load input tensors and apply the elementwise function `f`.
+        -   The elementwise results are kept in registers/shared memory.
+        -   **Welford's Algorithm**: Threads use Welford's online algorithm to compute the mean and variance of the elementwise results within their normalization group.
+        -   **Intra-Block Reduction**: A parallel reduction in shared memory computes the final statistics for the group.
+    -   **Pass 2 - Normalize and Store**:
+        -   Using the computed statistics, threads apply the normalization formula to their elementwise results.
+        -   The final normalized result is written to the output tensor `D`.
+
+This approach ensures that the elementwise computation is performed only once, and the results are immediately consumed by the normalization process without requiring additional memory bandwidth.
+
+## Source Code Organization
+
+-   [`elementwise_normalization_xdl.cpp`](./elementwise_normalization_xdl.cpp): The main example file. It sets up the input tensors, defines the elementwise operation and normalization parameters, and instantiates the `DeviceElementwiseNormalization` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_elementwise_normalization.hpp`](../../include/ck/tensor_operation/gpu/device/device_elementwise_normalization.hpp): The high-level device interface for the fused elementwise normalization operation.
+-   The underlying grid-wise kernel implements the complex fusion of elementwise operations with the two-pass normalization algorithm.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/45_elementwise_normalization
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./elementwise_normalization_xdl
+
+# Run with verification, data initialization, and timing
+./elementwise_normalization_xdl 1 2 1
+```
+
+## Applications
+
+This fused operation is valuable for implementing custom normalization layers and optimizing activation-normalization sequences.
+
+-   **Custom Activation-Normalization Blocks**: Some architectures use non-standard activation functions followed by normalization. For example, a Swish activation followed by layer normalization can be fused into a single kernel using this pattern.
+-   **Residual Connection with Normalization**: In some variants of residual networks, the residual addition is immediately followed by normalization. This can be expressed as an elementwise addition (residual) followed by normalization.
+-   **Preprocessing Pipelines**: In data preprocessing, tensors might need elementwise transformations (e.g., color space conversion) followed by normalization (e.g., standardization). This kernel can fuse these operations.
+-   **Research Architectures**: Novel normalization techniques often involve custom elementwise operations before the normalization step. This kernel provides a flexible foundation for implementing such research ideas efficiently.
--- a/example/46_gemm_add_multiply/README.md
+++ b/example/46_gemm_add_multiply/README.md
@@ -1,6 +1,37 @@
-# Instructions for ```example_gemm_add_multiply_dl_fp16```
+# GEMM with Add and Multiply Fusion
+
+## Theory
+
+This example demonstrates **GEMM fused with addition and multiplication operations**. This pattern is used in neural networks for bias addition, scaling, gating, and other elementwise transformations after a linear layer.
+
+**Mathematical Formulation:**
+- GEMM: $Y = A \times B$
+- Add: $Z = Y + D_0$
+- Multiply: $E = Z \odot D_1$
+  - $D_0$, $D_1$: auxiliary tensors (e.g., bias, scale, gate)
+
+**Algorithmic Background:**
+- The GEMM result is kept in registers, addition and multiplication are fused in the epilogue.
+- No intermediate results are written to global memory.
+- Used for bias+scale, gating, and other fused epilogue patterns.
+
+## How to Run
+
+### Prerequisites
+
+Please follow the instructions in the main [Build Guide](../../README.md#building-ck) section as a prerequisite to building and running this example.
+
+### Build and run
+```bash
+cd composable_kernel/example/46_gemm_add_multiply
+mkdir build && cd build
+cmake -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc ..
+make -j
+
+```
+
+### Run ```example_gemm_add_multiply_dl_fp16```

-## Run ```example_gemm_add_multiply_dl_fp16```
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=integer value, 2=decimal value)
@@ -8,3 +39,30 @@
 #arg4 to 11: M (256x), N(128x), K(32x), StrideA, StrideB, StrideD0, StrideD1, StrideE"
 ./bin/example_gemm_add_multiply_dl_fp16 1 1 1
 ```
+
+## Source Code Structure
+
+### Directory Layout
+```
+example/46_gemm_add_multiply/
+├── gemm_add_multiply_xdl.cpp         # Main example: sets up, runs, and verifies GEMM+Add+Multiply
+include/ck/tensor_operation/gpu/device/
+│   └── device_gemm_multiple_d.hpp       # Device-level API for multi-tensor GEMM
+include/ck/tensor_operation/gpu/device/impl/
+│   └── device_gemm_add_multiply_impl.hpp # Add+Multiply implementation
+include/ck/tensor_operation/gpu/grid/
+│   └── gridwise_gemm_multiple_d_xdl.hpp # Grid-level multi-stage GEMM
+include/ck/tensor_operation/gpu/element/
+    └── element_wise_operation.hpp       # Elementwise operation definitions
+```
+
+### Key Classes and Functions
+
+- **DeviceGemmMultipleD** (in `device_gemm_multiple_d.hpp`):  
+  Device API for GEMM with multiple auxiliary tensors and fused epilogues.
+- **gridwise_gemm_multiple_d_xdl** (in `gridwise_gemm_multiple_d_xdl.hpp`):  
+  Implements the tiled/blocking GEMM kernel with multi-stage epilogue.
+- **element_wise_operation** (in `element_wise_operation.hpp`):  
+  Defines addition, multiplication, and other elementwise operations.
+
+This example demonstrates how Composable Kernel supports efficient fusion of addition and multiplication with GEMM for deep learning and scientific computing.
--- a/example/47_gemm_bias_softmax_gemm_permute/README.md
+++ b/example/47_gemm_bias_softmax_gemm_permute/README.md
@@ -0,0 +1,90 @@
+# GEMM-Bias-Softmax-GEMM-Permute Fusion
+
+This example demonstrates an extremely complex and highly specialized fusion: **GEMM → Bias → Softmax → GEMM → Permute**. This pattern represents a complete, optimized attention mechanism with additional layout transformation, making it ideal for Transformer models that require specific output formats.
+
+## Mathematical Formulation
+
+The operation performs a complete attention calculation with an additional permutation at the end.
+
+1.  **First GEMM (QK^T)**: Compute attention scores.
+    $S_{temp} = Q \times K^T$
+
+2.  **Bias Addition**: Add attention bias (e.g., positional bias or causal mask).
+    $S'_{temp} = S_{temp} + \text{Bias}$
+
+3.  **Softmax**: Apply softmax to get attention weights.
+    $P = \text{softmax}(S'_{temp})$
+
+4.  **Second GEMM (PV)**: Apply attention weights to values.
+    $O_{temp} = P \times V$
+
+5.  **Permutation**: Reorder dimensions for subsequent processing.
+    $O = \text{permute}(O_{temp})$
+
+The key optimization is that all intermediate tensors (`S_temp`, `S'_temp`, `P`, `O_temp`) are **never written to global memory**. The entire attention calculation and permutation are performed in a single, monolithic kernel.
+
+## Algorithmic Strategy: Extended Tiled Attention with Permuted Output
+
+This kernel extends the fused attention algorithm with bias addition and output permutation.
+
+1.  **Batch Scheduling**: The attention problems are distributed across thread blocks, with each block handling one attention head for one batch item.
+
+2.  **Extended Tiled Computation**: The tiled attention algorithm is enhanced to include bias and permutation.
+    -   **Load Q tile**: A tile of the Query matrix is loaded into registers.
+    -   **Inner Loop over K/V tiles**:
+        -   Load tiles of Key matrix `K` and Value matrix `V`.
+        -   **Compute Score Tile (GEMM0)**: Compute QK^T and keep in registers.
+        -   **Bias Addition**: Load and add the corresponding bias tile.
+        -   **Online Softmax**: Apply the numerically stable online softmax algorithm.
+        -   **Compute Output Tile (GEMM1)**: Multiply attention weights with V tile.
+    -   **Permuted Store**: Instead of writing directly to the output, calculate the permuted destination coordinates and write the final result to the correct permuted location.
+
+This approach maintains the memory efficiency of fused attention while adding the computational benefits of bias fusion and the layout flexibility of permutation.
+
+## Source Code Organization
+
+-   [`gemm_bias_softmax_gemm_permute_xdl.cpp`](./gemm_bias_softmax_gemm_permute_xdl.cpp): The main example file. It sets up the Q, K, V matrices, bias tensor, and permutation specification, then instantiates the highly specialized operation.
+-   The device-level interface for this operation is extremely complex, combining attention computation with bias handling and permutation logic.
+-   The underlying kernel represents one of the most sophisticated fusion patterns in the library, managing multiple computational stages and complex memory access patterns.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/47_gemm_bias_softmax_gemm_permute
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./gemm_bias_softmax_gemm_permute_xdl
+
+# Run with verification, data initialization, and timing
+./gemm_bias_softmax_gemm_permute_xdl 1 2 1
+```
+
+## Applications in Advanced Transformer Architectures
+
+This kernel is designed for advanced Transformer implementations that require specialized attention patterns and output formats.
+
+-   **Relative Position Encoding**: Many modern Transformers use relative positional encodings that require adding learned bias terms to attention scores. This kernel can fuse these bias additions directly into the attention computation.
+-   **Multi-Head Attention with Layout Optimization**: After computing attention for multiple heads, the output often needs to be permuted to optimize memory layout for subsequent layers. This kernel can perform the attention computation and layout transformation in a single pass.
+-   **Causal Attention with Masking**: In autoregressive models, causal masking is applied as a bias term (typically large negative values) to prevent attending to future positions. This kernel can efficiently apply such masking.
+-   **Custom Attention Variants**: Research architectures often require modified attention mechanisms with additional bias terms or specific output layouts. This kernel provides a high-performance foundation for such implementations.
+
+This example represents the pinnacle of computational fusion, demonstrating how complex multi-stage algorithms can be optimized through deep kernel fusion.
--- a/example/48_pool3d_fwd/README.md
+++ b/example/48_pool3d_fwd/README.md
@@ -0,0 +1,93 @@
+# 3D Pooling Forward
+
+This example demonstrates a **3D pooling forward operation**. Pooling is a fundamental operation in convolutional neural networks that reduces the spatial dimensions of feature maps while retaining important information. 3D pooling extends this concept to three-dimensional data, commonly used in video analysis, medical imaging, and 3D computer vision applications.
+
+## Mathematical Formulation
+
+3D pooling operates on 5D tensors with shape `[N, C, D, H, W]` where:
+- `N` is the batch size
+- `C` is the number of channels
+- `D`, `H`, `W` are the depth, height, and width dimensions
+
+The operation applies a pooling function over 3D windows of the input tensor.
+
+For each output position `(n, c, d_out, h_out, w_out)`:
+$\text{Out}_{ncd_{out}h_{out}w_{out}} = \text{Pool}(\{X_{ncd'h'w'} : d' \in W_d, h' \in W_h, w' \in W_w\})$
+
+Where:
+- $W_d$, $W_h$, $W_w$ define the 3D pooling window
+- `Pool` is the pooling function (e.g., max or average)
+
+**Max Pooling**: $\text{Pool}(S) = \max(S)$
+**Average Pooling**: $\text{Pool}(S) = \frac{1}{|S|} \sum_{x \in S} x$
+
+The window positions are determined by:
+- **Window size**: `(pool_d, pool_h, pool_w)`
+- **Stride**: `(stride_d, stride_h, stride_w)`
+- **Padding**: `(pad_d, pad_h, pad_w)`
+
+## Algorithmic Strategy: Parallel Window-based Computation
+
+3D pooling is implemented as a parallel algorithm where each thread computes one output element.
+
+1.  **Grid Scheduling**: The output tensor elements are distributed across GPU threads. Each thread is assigned to compute one element of the output tensor.
+
+2.  **Window Processing**: For each output position, a thread:
+    -   **Calculate Input Window**: Determines the 3D input window corresponding to the current output position based on stride, padding, and window size.
+    -   **Boundary Handling**: Checks for boundary conditions and padding, ensuring that only valid input positions are processed.
+    -   **Apply Pooling Function**: 
+        -   **Max Pooling**: Iterates through the window and finds the maximum value.
+        -   **Average Pooling**: Iterates through the window, accumulates values, and computes the average.
+    -   **Store Result**: Writes the computed result to the output tensor.
+
+3.  **Memory Access Optimization**: The kernel is optimized for memory access patterns, using techniques like:
+    -   Coalesced memory access where possible
+    -   Shared memory for frequently accessed data
+    -   Efficient handling of boundary conditions
+
+## Source Code Organization
+
+-   [`pool3d_fwd_xdl.cpp`](./pool3d_fwd_xdl.cpp): The main example file. It sets up a 3D input tensor, defines pooling parameters (window size, stride, padding), and instantiates the `DevicePool3dFwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_pool3d_fwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_pool3d_fwd.hpp): The high-level device interface for 3D pooling operations.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_pool3d_fwd.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_pool3d_fwd.hpp): The grid-wise kernel implementing the parallel 3D pooling algorithm.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/48_pool3d_fwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./pool3d_fwd_xdl
+
+# Run with verification, data initialization, and timing
+./pool3d_fwd_xdl 1 2 1
+```
+
+## Applications
+
+3D pooling is essential in several domains that process volumetric or temporal data.
+
+-   **Video Analysis**: In video understanding tasks, 3D CNNs use 3D pooling to reduce temporal and spatial dimensions while preserving important motion and appearance features.
+-   **Medical Imaging**: 3D medical images (CT scans, MRI) require 3D pooling for feature extraction while maintaining spatial relationships in all three dimensions.
+-   **3D Computer Vision**: Object detection and segmentation in 3D point clouds or voxel grids use 3D pooling for hierarchical feature learning.
+-   **Action Recognition**: Video action recognition models use 3D pooling to aggregate features across temporal and spatial dimensions.
+-   **Volumetric Data Processing**: Scientific applications processing 3D volumetric data (weather modeling, fluid dynamics) use 3D pooling for multi-scale analysis.
--- a/example/49_maxpool2d_bwd/README.md
+++ b/example/49_maxpool2d_bwd/README.md
@@ -0,0 +1,95 @@
+# 2D Max Pooling Backward
+
+This example demonstrates the **backward pass of 2D max pooling**. This operation computes the gradient of the loss with respect to the input of a max pooling layer, which is essential for training convolutional neural networks that use max pooling for downsampling.
+
+## Mathematical Formulation
+
+The backward pass of max pooling propagates gradients only to the input positions that contributed to the maximum value in each pooling window.
+
+Given:
+- Input tensor `X` with shape `[N, C, H_in, W_in]`
+- Output gradients `dL/dY` with shape `[N, C, H_out, W_out]`
+- Pooling parameters: window size `(pool_h, pool_w)`, stride `(stride_h, stride_w)`, padding `(pad_h, pad_w)`
+
+The backward pass computes input gradients `dL/dX` with the same shape as `X`.
+
+For each pooling window, the gradient flows only to the position that had the maximum value:
+$\frac{\partial L}{\partial X_{nchw}} = \sum_{\text{windows containing } (h,w)} \frac{\partial L}{\partial Y_{nch'w'}} \cdot \mathbf{1}[\text{argmax}_{(h'',w'')} X_{nch''w''} = (h,w)]$
+
+Where the indicator function $\mathbf{1}[\cdot]$ is 1 if the position `(h,w)` was the argmax in its corresponding pooling window, and 0 otherwise.
+
+## Algorithmic Strategy: Parallel Gradient Routing
+
+The backward pass requires determining which input positions were selected during the forward pass and routing gradients accordingly.
+
+1.  **Grid Scheduling**: The computation can be parallelized over either the input or output tensor elements, depending on the implementation strategy.
+
+2.  **Argmax Information**: There are two main approaches to handle the argmax information:
+    -   **Recomputation**: Recompute the argmax during the backward pass by examining each pooling window.
+    -   **Stored Indices**: Use precomputed argmax indices from the forward pass (more memory efficient for multiple backward passes).
+
+3.  **Gradient Routing Algorithm** (using recomputation approach):
+    -   **Initialize**: Set all input gradients to zero.
+    -   **For each output position**: Each thread processes one output gradient position `(n, c, h_out, w_out)`.
+    -   **Find Input Window**: Calculate the corresponding input window based on stride and padding.
+    -   **Recompute Argmax**: Find the position with the maximum value in the input window.
+    -   **Route Gradient**: Add the output gradient to the input position that had the maximum value (using atomic operations if necessary).
+
+4.  **Memory Access Optimization**: The kernel optimizes for:
+    -   Coalesced access to gradient tensors
+    -   Efficient atomic operations for gradient accumulation
+    -   Minimal redundant computation of argmax positions
+
+## Source Code Organization
+
+-   [`maxpool2d_bwd_xdl.cpp`](./maxpool2d_bwd_xdl.cpp): The main example file. It sets up the input tensor, output gradients, pooling parameters, and instantiates the `DeviceMaxpool2dBwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_maxpool2d_bwd.hpp): The high-level device interface for 2D max pooling backward operations.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_maxpool2d_bwd.hpp): The grid-wise kernel implementing the gradient routing algorithm.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/49_maxpool2d_bwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./maxpool2d_bwd_xdl
+
+# Run with verification, data initialization, and timing
+./maxpool2d_bwd_xdl 1 2 1
+```
+
+## Computational Characteristics
+
+Max pooling backward has unique characteristics compared to other CNN operations:
+
+-   **Sparse Gradient Flow**: Unlike convolution or dense layers where gradients flow to all inputs, max pooling creates sparse gradient patterns where only selected input positions receive gradients.
+-   **Memory-bound Operation**: The operation is typically memory-bound rather than compute-bound, as it involves reading gradients and writing results with minimal arithmetic.
+-   **Atomic Operations**: When multiple output positions map to the same input position, atomic operations may be needed to correctly accumulate gradients.
+
+## Relationship to Forward Pass
+
+The backward pass must be consistent with the forward pass implementation:
+- The same tie-breaking rules for equal maximum values
+- Identical handling of padding and boundary conditions
+- Consistent stride and window size interpretation
+
+This ensures that the computed gradients correctly reflect the actual forward pass computation, which is essential for proper gradient-based optimization.
--- a/example/50_put_element/README.md
+++ b/example/50_put_element/README.md
@@ -0,0 +1,103 @@
+# Put Element Operation
+
+This example demonstrates a **put element operation**, which scatters or places elements from a source tensor into specific positions of a destination tensor based on index arrays. This is a fundamental operation for implementing sparse updates, scatter operations, and advanced indexing patterns in deep learning and scientific computing.
+
+## Mathematical Formulation
+
+The put element operation updates specific positions in a destination tensor using values from a source tensor and position information from index tensors.
+
+Given:
+- **Destination tensor** `D` with shape `[D0, D1, ..., Dn]`
+- **Source tensor** `S` with shape `[M, ...]` containing values to be placed
+- **Index tensors** `I0, I1, ..., In` with shape `[M]` specifying destination coordinates
+- **Update mode**: how to handle multiple updates to the same position
+
+The operation performs:
+$D[I0[i], I1[i], ..., In[i]] \leftarrow \text{Update}(D[I0[i], I1[i], ..., In[i]], S[i])$
+
+For each element `i` from `0` to `M-1`.
+
+**Update modes**:
+- **Overwrite**: `D[idx] = S[i]`
+- **Add**: `D[idx] += S[i]`
+- **Multiply**: `D[idx] *= S[i]`
+- **Max**: `D[idx] = max(D[idx], S[i])`
+- **Min**: `D[idx] = min(D[idx], S[i])`
+
+## Algorithmic Strategy: Parallel Scatter with Conflict Resolution
+
+The implementation must handle parallel updates and potential conflicts when multiple source elements target the same destination position.
+
+1.  **Grid Scheduling**: The operation is parallelized over the source elements. Each thread is assigned to process one or more elements from the source tensor.
+
+2.  **Index Calculation**: For each source element, threads:
+    -   Read the corresponding indices from the index tensors
+    -   Validate that indices are within bounds
+    -   Calculate the linear memory address in the destination tensor
+
+3.  **Conflict Resolution**: When multiple threads attempt to update the same destination position:
+    -   **Atomic Operations**: Use atomic functions for commutative operations (add, max, min)
+    -   **Serialization**: For non-commutative operations, use locks or other synchronization
+    -   **Deterministic Ordering**: Ensure consistent results across runs
+
+4.  **Memory Access Optimization**:
+    -   Coalesced reading from source and index tensors
+    -   Efficient atomic operations on destination tensor
+    -   Minimize memory bank conflicts
+
+## Source Code Organization
+
+-   [`put_element_xdl.cpp`](./put_element_xdl.cpp): The main example file. It sets up the destination tensor, source tensor, index arrays, and instantiates the `DevicePutElement` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_put_element.hpp`](../../include/ck/tensor_operation/gpu/device/device_put_element.hpp): The high-level device interface for put element operations.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_put_element.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_put_element.hpp): The grid-wise kernel implementing the parallel scatter algorithm with conflict resolution.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/50_put_element
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./put_element_xdl
+
+# Run with verification, data initialization, and timing
+./put_element_xdl 1 2 1
+```
+
+## Applications
+
+Put element operations are fundamental to many advanced algorithms and data structures.
+
+-   **Sparse Neural Networks**: Updating specific weights or activations in sparse neural network architectures where only a subset of parameters are active.
+-   **Graph Neural Networks**: Scatter operations for aggregating information from neighboring nodes to target nodes in graph structures.
+-   **Embedding Updates**: Updating specific rows in embedding tables based on sparse input indices, common in recommendation systems and NLP models.
+-   **Histogram Computation**: Accumulating counts or values into histogram bins based on computed indices.
+-   **Sparse Linear Algebra**: Implementing sparse matrix operations where values are placed at specific coordinate positions.
+-   **Advanced Indexing**: Supporting NumPy-style advanced indexing patterns for tensor manipulation.
+
+## Performance Considerations
+
+The performance of put element operations depends heavily on the access patterns:
+
+-   **Random Access**: Scattered indices lead to poor memory locality and cache performance
+-   **Atomic Contention**: High conflict rates (many updates to same positions) can severely impact performance
+-   **Memory Bandwidth**: The operation is typically memory-bound, especially with good locality
+-   **Load Balancing**: Uneven distribution of conflicts can cause load imbalance across threads
--- a/example/51_avgpool3d_bwd/README.md
+++ b/example/51_avgpool3d_bwd/README.md
@@ -0,0 +1,101 @@
+# 3D Average Pooling Backward
+
+This example demonstrates the **backward pass of 3D average pooling**. This operation computes the gradient of the loss with respect to the input of a 3D average pooling layer, which is essential for training 3D convolutional neural networks used in video analysis, medical imaging, and volumetric data processing.
+
+## Mathematical Formulation
+
+The backward pass of 3D average pooling distributes the output gradients uniformly across all input positions that contributed to each pooling window.
+
+Given:
+- Input tensor `X` with shape `[N, C, D_in, H_in, W_in]`
+- Output gradients `dL/dY` with shape `[N, C, D_out, H_out, W_out]`
+- Pooling parameters: window size `(pool_d, pool_h, pool_w)`, stride `(stride_d, stride_h, stride_w)`, padding `(pad_d, pad_h, pad_w)`
+
+The backward pass computes input gradients `dL/dX` with the same shape as `X`.
+
+For 3D average pooling, the gradient is distributed uniformly across all positions in each pooling window:
+$\frac{\partial L}{\partial X_{ncdhw}} = \sum_{\text{windows containing } (d,h,w)} \frac{1}{|W|} \cdot \frac{\partial L}{\partial Y_{ncd'h'w'}}$
+
+Where `|W|` is the effective window size (accounting for padding and boundaries), and the sum is over all output positions whose pooling windows include the input position `(d,h,w)`.
+
+## Algorithmic Strategy: Parallel Gradient Distribution
+
+The backward pass distributes gradients from output positions to all input positions that contributed to each pooling window.
+
+1.  **Grid Scheduling**: The computation can be parallelized over either input or output tensor elements, depending on the implementation strategy.
+
+2.  **Gradient Distribution Algorithm** (output-centric approach):
+    -   **Initialize**: Set all input gradients to zero.
+    -   **For each output position**: Each thread processes one output gradient position `(n, c, d_out, h_out, w_out)`.
+    -   **Calculate Input Window**: Determine the 3D input window that contributed to this output position.
+    -   **Effective Window Size**: Calculate the actual number of input elements in the window (accounting for padding and boundaries).
+    -   **Distribute Gradient**: Add `grad_output / window_size` to each input position in the window (using atomic operations for thread safety).
+
+3.  **Boundary Handling**: Careful handling of:
+    -   **Padding**: Input positions outside the valid range should not receive gradients
+    -   **Partial Windows**: Windows at boundaries may have fewer than `pool_d × pool_h × pool_w` elements
+    -   **Edge Cases**: Zero-sized windows or invalid configurations
+
+4.  **Memory Access Optimization**:
+    -   Coalesced reading from output gradients
+    -   Efficient atomic operations for gradient accumulation
+    -   Minimized redundant boundary checks
+
+## Source Code Organization
+
+-   [`avgpool3d_bwd_xdl.cpp`](./avgpool3d_bwd_xdl.cpp): The main example file. It sets up the input tensor, output gradients, pooling parameters, and instantiates the `DeviceAvgpool3dBwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_avgpool3d_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_avgpool3d_bwd.hpp): The high-level device interface for 3D average pooling backward operations.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_avgpool3d_bwd.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_avgpool3d_bwd.hpp): The grid-wise kernel implementing the gradient distribution algorithm.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/51_avgpool3d_bwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./avgpool3d_bwd_xdl
+
+# Run with verification, data initialization, and timing
+./avgpool3d_bwd_xdl 1 2 1
+```
+
+## Comparison with Max Pooling Backward
+
+3D average pooling backward differs significantly from max pooling backward:
+
+| Aspect | Max Pooling | Average Pooling |
+|--------|-------------|-----------------|
+| **Gradient Flow** | Sparse (only to argmax positions) | Dense (to all window positions) |
+| **Distribution** | Single position per window | Uniform across window |
+| **Computation** | Requires argmax information | Simple arithmetic division |
+| **Memory Pattern** | Irregular write pattern | Regular, predictable pattern |
+| **Atomic Operations** | Needed for gradient routing | Needed for accumulation |
+
+## Applications in 3D Deep Learning
+
+3D average pooling backward is essential for training models that process volumetric data:
+
+-   **Video Understanding**: 3D CNNs for action recognition, video classification, and temporal modeling
+-   **Medical Imaging**: 3D segmentation and classification of CT scans, MRI, and other volumetric medical data
+-   **3D Object Recognition**: Processing 3D point clouds, voxel grids, and depth data
+-   **Scientific Computing**: Climate modeling, fluid dynamics, and other physics simulations
+-   **Augmented Reality**: 3D scene understanding and object tracking in real-time applications
--- a/example/52_im2col_col2im/README.md
+++ b/example/52_im2col_col2im/README.md
@@ -0,0 +1,120 @@
+# Im2Col and Col2Im Operations
+
+This example demonstrates **Im2Col (image to column) and Col2Im (column to image)** operations. These are fundamental data layout transformations used in implementing convolution operations, particularly in frameworks that convert convolutions into matrix multiplications for efficient computation on GPUs.
+
+## Mathematical Formulation
+
+### Im2Col (Image to Column)
+Im2Col transforms a 4D image tensor into a 2D matrix where each column represents the input values for one convolution window.
+
+Given:
+- Input tensor `X` with shape `[N, C, H, W]`
+- Convolution parameters: kernel size `(KH, KW)`, stride `(SH, SW)`, padding `(PH, PW)`, dilation `(DH, DW)`
+
+The output matrix has shape `[C × KH × KW, N × OH × OW]` where:
+- `OH = (H + 2×PH - DH×(KH-1) - 1) / SH + 1`
+- `OW = (W + 2×PW - DW×(KW-1) - 1) / SW + 1`
+
+Each column `j` contains the flattened values from the convolution window at output position `j`:
+$\text{Col}[:, j] = \text{flatten}(\text{Window}_j(X))$
+
+### Col2Im (Column to Image)
+Col2Im is the inverse operation that reconstructs an image tensor from the column representation.
+
+Given:
+- Column matrix `Col` with shape `[C × KH × KW, N × OH × OW]`
+- Target image dimensions and convolution parameters
+
+The operation accumulates values from overlapping windows:
+$X[n, c, h, w] = \sum_{\text{windows covering } (h,w)} \text{Col}[\text{offset}, \text{window\_id}]$
+
+Where multiple windows may contribute to the same image position, requiring accumulation.
+
+## Algorithmic Strategy: Parallel Data Reshaping
+
+Both operations involve complex memory access patterns that require careful optimization.
+
+### Im2Col Implementation
+1.  **Grid Scheduling**: Parallelize over output columns (convolution windows).
+
+2.  **Window Extraction**: For each output column:
+    -   Calculate the corresponding input window position
+    -   Handle padding by inserting zeros for out-of-bounds positions
+    -   Apply dilation by skipping elements in the kernel
+    -   Copy window values to the appropriate column
+
+3.  **Memory Optimization**:
+    -   Coalesced reads from input image
+    -   Coalesced writes to output matrix
+    -   Efficient padding handling
+
+### Col2Im Implementation
+1.  **Grid Scheduling**: Parallelize over input image positions or column elements.
+
+2.  **Accumulation**: For each column element:
+    -   Calculate which image position it corresponds to
+    -   Accumulate the value using atomic operations (for overlapping windows)
+    -   Handle boundary conditions and padding
+
+3.  **Conflict Resolution**: Use atomic operations for thread-safe accumulation when multiple columns contribute to the same image position.
+
+## Source Code Organization
+
+-   [`im2col_col2im_xdl.cpp`](./im2col_col2im_xdl.cpp): The main example file. It demonstrates both Im2Col and Col2Im operations with verification that they are inverse operations.
+-   [`../../include/ck/tensor_operation/gpu/device/device_im2col.hpp`](../../include/ck/tensor_operation/gpu/device/device_im2col.hpp): The high-level device interface for Im2Col operations.
+-   [`../../include/ck/tensor_operation/gpu/device/device_col2im.hpp`](../../include/ck/tensor_operation/gpu/device/device_col2im.hpp): The high-level device interface for Col2Im operations.
+-   The underlying kernels implement the complex address calculations and memory access patterns required for these transformations.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/52_im2col_col2im
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./im2col_col2im_xdl
+
+# Run with verification, data initialization, and timing
+./im2col_col2im_xdl 1 2 1
+```
+
+## Applications in Deep Learning
+
+Im2Col and Col2Im are fundamental operations in convolution implementations:
+
+### Im2Col Applications
+-   **Convolution via GEMM**: Transform convolution into matrix multiplication, allowing use of highly optimized BLAS libraries
+-   **Explicit Convolution**: Some frameworks prefer explicit Im2Col for better control over memory layouts
+-   **Winograd Convolution**: Used in Winograd-based fast convolution algorithms
+-   **Debugging and Visualization**: Understanding the convolution process by examining the column representation
+
+### Col2Im Applications
+-   **Transpose Convolution**: The backward pass of convolution (gradient w.r.t. input) uses Col2Im
+-   **Deconvolution**: Upsampling operations that are the inverse of convolution
+-   **Gradient Computation**: Computing gradients for convolution operations
+-   **Memory Layout Restoration**: Converting back from optimized layouts to standard image formats
+
+## Performance Characteristics
+
+-   **Memory Bound**: Both operations are typically memory-bound rather than compute-bound
+-   **Access Patterns**: Performance heavily depends on memory access patterns and coalescing
+-   **Memory Overhead**: Im2Col can significantly increase memory usage due to data duplication
+-   **Cache Behavior**: Complex strided access patterns can lead to poor cache utilization
--- a/example/53_layernorm2d_bwd/README.md
+++ b/example/53_layernorm2d_bwd/README.md
@@ -0,0 +1,110 @@
+# 2D Layer Normalization Backward
+
+This example demonstrates the **backward pass of 2D Layer Normalization**. This operation computes the gradients of the loss with respect to the input, gamma, and beta parameters of a layer normalization layer, which is essential for training neural networks that use layer normalization, particularly Transformers.
+
+## Mathematical Formulation
+
+The backward pass of layer normalization involves computing gradients for three components: input `X`, scale parameter `gamma`, and shift parameter `beta`.
+
+Given:
+- Input tensor `X` with shape `[M, N]`
+- Scale parameter `gamma` with shape `[N]`
+- Shift parameter `beta` with shape `[N]`
+- Output gradients `dL/dY` with shape `[M, N]`
+
+From the forward pass, we have:
+- Mean: $\mu_i = \frac{1}{N} \sum_{j=0}^{N-1} X_{ij}$ for each row `i`
+- Variance: $\sigma_i^2 = \frac{1}{N} \sum_{j=0}^{N-1} (X_{ij} - \mu_i)^2$
+- Normalized: $\hat{X}_{ij} = \frac{X_{ij} - \mu_i}{\sqrt{\sigma_i^2 + \epsilon}}$
+- Output: $Y_{ij} = \gamma_j \cdot \hat{X}_{ij} + \beta_j$
+
+### Gradient Computations
+
+**Gradient w.r.t. beta**:
+$\frac{\partial L}{\partial \beta_j} = \sum_{i=0}^{M-1} \frac{\partial L}{\partial Y_{ij}}$
+
+**Gradient w.r.t. gamma**:
+$\frac{\partial L}{\partial \gamma_j} = \sum_{i=0}^{M-1} \frac{\partial L}{\partial Y_{ij}} \cdot \hat{X}_{ij}$
+
+**Gradient w.r.t. input** (most complex):
+$\frac{\partial L}{\partial X_{ij}} = \frac{\gamma_j}{\sqrt{\sigma_i^2 + \epsilon}} \left[ \frac{\partial L}{\partial Y_{ij}} - \frac{1}{N}\left(\frac{\partial L}{\partial \beta_j} + \hat{X}_{ij} \frac{\partial L}{\partial \gamma_j}\right) \right]$
+
+Where the gradient w.r.t. input involves the normalized input values and requires careful handling of the mean and variance computations.
+
+## Algorithmic Strategy: Multi-Pass Gradient Computation
+
+The backward pass requires multiple reduction operations and careful coordination between gradient computations.
+
+1.  **Pass 1: Compute Gamma and Beta Gradients**
+    -   **Grid Scheduling**: Parallelize over features (`N` dimension).
+    -   **Reduction per Feature**: For each feature `j`, reduce across the batch dimension (`M`) to compute:
+        -   `grad_beta[j] = sum(grad_output[:, j])`
+        -   `grad_gamma[j] = sum(grad_output[:, j] * x_normalized[:, j])`
+
+2.  **Pass 2: Compute Input Gradients**
+    -   **Grid Scheduling**: Parallelize over rows (`M` dimension).
+    -   **Per-Row Computation**: For each row `i`:
+        -   Read the previously computed `grad_beta` and `grad_gamma`
+        -   Compute intermediate values needed for the input gradient formula
+        -   Apply the complex gradient formula for each element in the row
+
+3.  **Memory Management**: 
+    -   Store intermediate statistics (mean, variance, normalized values) from forward pass or recompute them
+    -   Use shared memory for efficient intra-block reductions
+    -   Optimize memory access patterns for coalescing
+
+## Source Code Organization
+
+-   [`layernorm2d_bwd_xdl.cpp`](./layernorm2d_bwd_xdl.cpp): The main example file. It sets up the forward pass results, output gradients, and instantiates the `DeviceLayernormBwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_layernorm_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_layernorm_bwd.hpp): The high-level device interface for layer normalization backward operations.
+-   The underlying implementation coordinates multiple reduction kernels and gradient computation stages to efficiently compute all required gradients.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/53_layernorm2d_bwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./layernorm2d_bwd_xdl
+
+# Run with verification, data initialization, and timing
+./layernorm2d_bwd_xdl 1 2 1
+```
+
+## Computational Complexity
+
+The backward pass of layer normalization has similar computational complexity to the forward pass but requires additional memory for storing gradients:
+
+-   **Time Complexity**: O(M × N) for each gradient computation
+-   **Memory Complexity**: O(M × N) for input gradients plus O(N) for parameter gradients
+-   **Numerical Stability**: Requires careful handling of the variance computation and division operations
+
+## Role in Transformer Training
+
+Layer normalization backward is crucial for training Transformer models:
+
+-   **Gradient Flow**: Provides stable gradient propagation through normalization layers
+-   **Parameter Updates**: Enables learning of the scale (`gamma`) and shift (`beta`) parameters
+-   **Training Stability**: The normalization helps maintain stable gradients throughout the network
+-   **Convergence**: Proper implementation is essential for achieving good convergence rates in Transformer training
+
+The efficient implementation of this operation is critical for the overall training performance of large language models and other Transformer-based architectures.
--- a/example/54_groupnorm_bwd/README.md
+++ b/example/54_groupnorm_bwd/README.md
@@ -0,0 +1,115 @@
+# Group Normalization Backward
+
+This example demonstrates the **backward pass of Group Normalization**. This operation computes the gradients of the loss with respect to the input, gamma, and beta parameters of a group normalization layer, which is essential for training neural networks that use group normalization, particularly in computer vision applications where batch size independence is important.
+
+## Mathematical Formulation
+
+The backward pass of group normalization involves computing gradients for three components: input `X`, scale parameter `gamma`, and shift parameter `beta`.
+
+Given:
+- Input tensor `X` with shape `[N, C, H, W]`
+- Number of groups `G` (where `C` must be divisible by `G`)
+- Scale parameter `gamma` with shape `[C]`
+- Shift parameter `beta` with shape `[C]`
+- Output gradients `dL/dY` with shape `[N, C, H, W]`
+
+From the forward pass, for each batch item `n` and group `g`:
+- Channels in group: $S_g = \{c : c \text{ belongs to group } g\}$ where $|S_g| = C/G$
+- Mean: $\mu_{ng} = \frac{1}{(C/G) \cdot H \cdot W} \sum_{c \in S_g} \sum_{h,w} X_{nchw}$
+- Variance: $\sigma_{ng}^2 = \frac{1}{(C/G) \cdot H \cdot W} \sum_{c \in S_g} \sum_{h,w} (X_{nchw} - \mu_{ng})^2$
+- Normalized: $\hat{X}_{nchw} = \frac{X_{nchw} - \mu_{ng}}{\sqrt{\sigma_{ng}^2 + \epsilon}}$ for $c \in S_g$
+- Output: $Y_{nchw} = \gamma_c \cdot \hat{X}_{nchw} + \beta_c$
+
+### Gradient Computations
+
+**Gradient w.r.t. beta**:
+$\frac{\partial L}{\partial \beta_c} = \sum_{n,h,w} \frac{\partial L}{\partial Y_{nchw}}$
+
+**Gradient w.r.t. gamma**:
+$\frac{\partial L}{\partial \gamma_c} = \sum_{n,h,w} \frac{\partial L}{\partial Y_{nchw}} \cdot \hat{X}_{nchw}$
+
+**Gradient w.r.t. input** (most complex):
+For channel `c` in group `g`:
+$\frac{\partial L}{\partial X_{nchw}} = \frac{\gamma_c}{\sqrt{\sigma_{ng}^2 + \epsilon}} \left[ \frac{\partial L}{\partial Y_{nchw}} - \frac{1}{|S_g| \cdot H \cdot W}\left(\sum_{c' \in S_g} \frac{\partial L}{\partial \beta_{c'}} + \hat{X}_{nchw} \sum_{c' \in S_g} \frac{\partial L}{\partial \gamma_{c'}}\right) \right]$
+
+## Algorithmic Strategy: Multi-Stage Group-wise Gradient Computation
+
+The backward pass requires coordinated computation across groups with multiple reduction operations.
+
+1.  **Pass 1: Compute Gamma and Beta Gradients**
+    -   **Grid Scheduling**: Parallelize over channels (`C` dimension).
+    -   **Reduction per Channel**: For each channel `c`, reduce across `N`, `H`, `W` dimensions:
+        -   `grad_beta[c] = sum(grad_output[n, c, h, w])` over all `n, h, w`
+        -   `grad_gamma[c] = sum(grad_output[n, c, h, w] * x_normalized[n, c, h, w])` over all `n, h, w`
+
+2.  **Pass 2: Compute Group-wise Intermediate Values**
+    -   **Grid Scheduling**: Parallelize over `(N, G)` pairs.
+    -   **Group Reduction**: For each `(n, g)` pair:
+        -   Sum `grad_beta` values for channels in group `g`
+        -   Sum `grad_gamma` values for channels in group `g`
+        -   These values are needed for the input gradient computation
+
+3.  **Pass 3: Compute Input Gradients**
+    -   **Grid Scheduling**: Parallelize over input tensor elements.
+    -   **Per-Element Computation**: For each `(n, c, h, w)`:
+        -   Identify which group `g` channel `c` belongs to
+        -   Read the group-wise intermediate values from Pass 2
+        -   Apply the complex input gradient formula
+
+## Source Code Organization
+
+-   [`groupnorm_bwd_xdl.cpp`](./groupnorm_bwd_xdl.cpp): The main example file. It sets up the forward pass results, output gradients, group configuration, and instantiates the `DeviceGroupnormBwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_groupnorm_bwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_groupnorm_bwd.hpp): The high-level device interface for group normalization backward operations.
+-   The underlying implementation coordinates multiple reduction and computation stages to efficiently handle the group-wise structure of the gradients.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/54_groupnorm_bwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./groupnorm_bwd_xdl
+
+# Run with verification, data initialization, and timing
+./groupnorm_bwd_xdl 1 2 1
+```
+
+## Comparison with Other Normalization Backward Passes
+
+| Normalization Type | Gradient Scope | Complexity | Memory Pattern |
+|-------------------|----------------|------------|----------------|
+| **BatchNorm** | Across batch for each channel | Medium | Channel-wise reductions |
+| **LayerNorm** | Across features for each item | Medium | Per-sample reductions |
+| **GroupNorm** | Across group for each (batch, group) | High | Group-wise reductions |
+| **InstanceNorm** | Per channel per sample | Low | Independent computations |
+
+## Applications in Computer Vision
+
+Group normalization backward is particularly important for:
+
+-   **Small Batch Training**: When batch sizes are too small for effective batch normalization
+-   **Transfer Learning**: Fine-tuning pre-trained models with different batch sizes
+-   **Object Detection**: Models like YOLO and R-CNN that benefit from batch-size independent normalization
+-   **Segmentation Networks**: Dense prediction tasks where normalization stability is crucial
+-   **Style Transfer**: Applications where group-wise feature normalization helps preserve style information
+
+The group-wise structure provides a balance between the stability of batch normalization and the flexibility of layer normalization, making it valuable for many computer vision applications.
--- a/example/59_grouped_gemm_multi_ABD/README.md
+++ b/example/59_grouped_gemm_multi_ABD/README.md
@@ -0,0 +1,95 @@
+# Grouped GEMM with Multiple A, B, and D Tensors
+
+This example demonstrates a **Grouped GEMM operation with multiple A, B, and D tensors**. This is an advanced fusion pattern that extends the basic grouped GEMM to handle multiple input matrices and auxiliary tensors simultaneously, enabling complex multi-input computational graphs to be executed in a single kernel launch.
+
+## Mathematical Formulation
+
+This operation performs `G` independent GEMM operations in parallel, where each group can have multiple input matrices and auxiliary tensors.
+
+For each group `g` from `0` to `G-1`:
+1.  **Multiple Input GEMMs**: Compute products from multiple A and B tensor pairs.
+    $C_{temp0[g]} = A_{0[g]} \times B_{0[g]}$
+    $C_{temp1[g]} = A_{1[g]} \times B_{1[g]}$
+    $\vdots$
+    $C_{tempK[g]} = A_{K[g]} \times B_{K[g]}$
+
+2.  **Combination with Auxiliary Tensors**: Apply a user-defined function that combines the GEMM results with multiple D tensors.
+    $E_{[g]} = f(C_{temp0[g]}, C_{temp1[g]}, \ldots, C_{tempK[g]}, D_{0[g]}, D_{1[g]}, \ldots, D_{M[g]})$
+
+The key optimization is that all intermediate tensors are **never written to global memory**. The multiple GEMMs and auxiliary tensor operations are fused into a single kernel.
+
+## Algorithmic Strategy: Multi-Input Grouped GEMM with Complex Epilogue
+
+This kernel represents one of the most complex fusion patterns, combining multiple matrix multiplications with auxiliary tensor operations.
+
+1.  **Group Scheduling**: The `G` independent problems are distributed across thread blocks, with each block assigned to one group.
+
+2.  **Multi-GEMM Computation**: Within each thread block:
+    -   **Parallel GEMM Execution**: Multiple GEMM operations are computed simultaneously, with each using different portions of the available registers and compute resources.
+    -   **Register Management**: Careful orchestration of register usage to accommodate multiple accumulation buffers for the different GEMM operations.
+    -   **Memory Interleaving**: Coordinated loading of multiple A and B matrix tiles to maximize memory bandwidth utilization.
+
+3.  **Complex Fused Epilogue**: After computing all GEMMs for a group:
+    -   **Load Auxiliary Tensors**: Read the corresponding D tensor values for the group.
+    -   **Apply Fusion Function**: Execute the user-defined function `f` that combines all GEMM results and auxiliary tensors.
+    -   **Store Result**: Write the final fused result to the output tensor.
+
+This approach enables extremely complex computational patterns while maintaining the memory bandwidth efficiency of deep fusion.
+
+## Source Code Organization
+
+-   [`grouped_gemm_multi_ABD_xdl.cpp`](./grouped_gemm_multi_ABD_xdl.cpp): The main example file. It sets up multiple sets of A and B matrices for each group, multiple D tensors, and instantiates the highly complex device operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp`](../../include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp): The device interface for this advanced fusion pattern.
+-   The underlying kernel manages multiple simultaneous matrix multiplications with extremely complex register allocation and memory access patterns.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/59_grouped_gemm_multi_ABD
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./grouped_gemm_multi_ABD_xdl
+
+# Run with verification, data initialization, and timing
+./grouped_gemm_multi_ABD_xdl 1 2 1
+```
+
+## Applications
+
+This highly specialized kernel is valuable for complex computational patterns found in advanced neural network architectures.
+
+-   **Multi-Branch Networks**: Architectures that compute multiple parallel paths that are later combined, such as Inception modules or complex residual blocks.
+-   **Multi-Head Attention Variants**: Advanced attention mechanisms that compute multiple different attention patterns simultaneously and combine them.
+-   **Ensemble Methods**: When multiple model predictions need to be computed and combined in a single operation.
+-   **Complex Gating Mechanisms**: Advanced neural network layers that use multiple matrix operations for different gating or routing decisions.
+-   **Multi-Modal Fusion**: Combining features from different modalities (e.g., vision and text) where each modality requires different linear transformations.
+
+## Performance Considerations
+
+This kernel pushes the boundaries of GPU computation complexity:
+
+-   **Register Pressure**: Managing multiple simultaneous GEMM operations requires careful register allocation
+-   **Memory Bandwidth**: Coordinating multiple data streams while maintaining coalesced access patterns
+-   **Instruction Scheduling**: Balancing multiple computational streams to maximize throughput
+-   **Complexity vs. Performance**: The benefits of fusion must outweigh the increased kernel complexity
+
+This example showcases the extreme flexibility of the Composable Kernel framework, demonstrating how highly specialized computational patterns can be implemented efficiently on modern GPU architectures.
--- a/example/60_gemm_multi_ABD/README.md
+++ b/example/60_gemm_multi_ABD/README.md
@@ -0,0 +1,102 @@
+# GEMM with Multiple A, B, and D Tensors
+
+This example demonstrates a **GEMM operation with multiple A, B, and D tensors**. This is a non-grouped version of the previous example, focusing on fusing multiple matrix multiplications and auxiliary tensor operations into a single kernel for a single problem instance rather than multiple grouped problems.
+
+## Mathematical Formulation
+
+This operation performs multiple GEMM operations simultaneously and combines them with auxiliary tensors.
+
+1.  **Multiple Input GEMMs**: Compute products from multiple A and B tensor pairs.
+    $C_{temp0} = A_0 \times B_0$
+    $C_{temp1} = A_1 \times B_1$
+    $\vdots$
+    $C_{tempK} = A_K \times B_K$
+
+2.  **Combination with Auxiliary Tensors**: Apply a user-defined function that combines all GEMM results with multiple D tensors.
+    $E = f(C_{temp0}, C_{temp1}, \ldots, C_{tempK}, D_0, D_1, \ldots, D_M)$
+
+The key optimization is that all intermediate tensors are **never written to global memory**. All matrix multiplications and the final combination operation are fused into a single kernel.
+
+## Algorithmic Strategy: Multi-Input GEMM with Complex Epilogue
+
+This kernel extends the basic GEMM algorithm to handle multiple simultaneous matrix multiplications.
+
+1.  **Unified Grid Scheduling**: A single grid of thread blocks handles all matrix multiplications simultaneously. Each thread block computes corresponding tiles from all GEMM operations.
+
+2.  **Multi-GEMM Tile Computation**: Within each thread block:
+    -   **Parallel Accumulation**: Multiple accumulator arrays are maintained in registers, one for each GEMM operation.
+    -   **Coordinated Memory Access**: Tiles from all A and B matrices are loaded in a coordinated fashion to maximize memory bandwidth.
+    -   **Register Orchestration**: Careful management of register usage to accommodate multiple simultaneous accumulations.
+
+3.  **Unified Fused Epilogue**: After computing tiles for all GEMMs:
+    -   **Load All Auxiliary Tensors**: Read corresponding elements from all D tensors.
+    -   **Apply Complex Fusion Function**: Execute the user-defined function `f` that operates on all GEMM results and auxiliary tensors.
+    -   **Single Output Store**: Write the final combined result to the output tensor.
+
+This approach maximizes computational density by performing multiple matrix operations simultaneously while maintaining the memory efficiency of fusion.
+
+## Source Code Organization
+
+-   [`gemm_multi_ABD_xdl.cpp`](./gemm_multi_ABD_xdl.cpp): The main example file. It sets up multiple A and B matrices, multiple D tensors, and instantiates the `DeviceGemmMultiABD` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_gemm_multi_abd.hpp`](../../include/ck/tensor_operation/gpu/device/device_gemm_multi_abd.hpp): The device interface for this multi-input fusion pattern.
+-   The underlying kernel implements sophisticated register management and memory access coordination for multiple simultaneous GEMM operations.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/60_gemm_multi_ABD
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./gemm_multi_ABD_xdl
+
+# Run with verification, data initialization, and timing
+./gemm_multi_ABD_xdl 1 2 1
+```
+
+## Applications
+
+This kernel is useful for complex computational patterns that require multiple simultaneous matrix operations.
+
+-   **Multi-Stream Processing**: Computing multiple different transformations of the same data simultaneously (e.g., different projections in attention mechanisms).
+-   **Ensemble Linear Layers**: When multiple linear transformations need to be computed and combined, such as in ensemble methods or multi-expert systems.
+-   **Complex Gating Mechanisms**: Advanced neural network layers like Mixture of Experts (MoE) that require multiple matrix operations for routing and computation.
+-   **Multi-Objective Optimization**: When multiple loss functions require different linear transformations of the same input.
+-   **Feature Fusion**: Combining multiple feature representations that each require different linear projections.
+
+## Comparison with Grouped Version
+
+| Aspect | Grouped Multi-ABD | Non-Grouped Multi-ABD |
+|--------|-------------------|----------------------|
+| **Problem Structure** | G independent problems | Single unified problem |
+| **Memory Layout** | Separate tensors per group | Single tensors with multiple channels |
+| **Scheduling** | Group-parallel | Unified parallel |
+| **Use Cases** | Independent computations | Correlated computations |
+| **Complexity** | Higher (group management) | Lower (unified computation) |
+
+## Performance Characteristics
+
+-   **Computational Intensity**: Very high, as multiple matrix operations are performed simultaneously
+-   **Memory Bandwidth**: Efficiently utilized through coordinated access patterns
+-   **Register Usage**: High due to multiple accumulator arrays
+-   **Instruction Throughput**: Maximized through parallel execution of multiple GEMM streams
+
+This kernel demonstrates the ability to achieve extreme computational density while maintaining the benefits of operation fusion, making it valuable for applications that require multiple related matrix computations.
--- a/example/61_contraction_multi_ABD/README.md
+++ b/example/61_contraction_multi_ABD/README.md
@@ -0,0 +1,105 @@
+# Tensor Contraction with Multiple A, B, and D Tensors
+
+This example demonstrates a **tensor contraction operation with multiple A, B, and D tensors**. This extends the basic tensor contraction to handle multiple input tensor pairs and auxiliary tensors simultaneously, enabling complex multi-input tensor network computations to be executed in a single kernel launch.
+
+## Mathematical Formulation
+
+This operation performs multiple tensor contractions simultaneously and combines them with auxiliary tensors.
+
+1.  **Multiple Tensor Contractions**: Compute contractions from multiple A and B tensor pairs using Einstein summation notation.
+    $C_{temp0} = \text{einsum}(\text{pattern}_0, A_0, B_0)$
+    $C_{temp1} = \text{einsum}(\text{pattern}_1, A_1, B_1)$
+    $\vdots$
+    $C_{tempK} = \text{einsum}(\text{pattern}_K, A_K, B_K)$
+
+2.  **Combination with Auxiliary Tensors**: Apply a user-defined function that combines all contraction results with multiple D tensors.
+    $E = f(C_{temp0}, C_{temp1}, \ldots, C_{tempK}, D_0, D_1, \ldots, D_M)$
+
+Each contraction can have different Einstein summation patterns, allowing for complex tensor network computations. The key optimization is that all intermediate tensors are **never written to global memory**.
+
+## Algorithmic Strategy: Multi-Input Contraction with Tensor-to-GEMM Mapping
+
+This kernel extends the tensor contraction algorithm to handle multiple simultaneous contractions.
+
+1.  **Unified Tensor-to-GEMM Mapping**: Each tensor contraction is mapped to a GEMM operation through tensor reshaping:
+    -   **Multiple Reshaping Operations**: For each contraction pair `(A_i, B_i)`, the tensors are logically reshaped into 2D matrices based on their Einstein summation pattern.
+    -   **Coordinated Memory Layout**: The reshaping operations are coordinated to enable efficient memory access patterns across all contractions.
+
+2.  **Multi-Contraction Tile Computation**: Within each thread block:
+    -   **Parallel GEMM Execution**: Multiple GEMM operations (representing the contractions) are computed simultaneously.
+    -   **Complex Address Calculation**: Each contraction requires its own address calculation logic for the tensor descriptor interpretation.
+    -   **Register Management**: Multiple accumulator arrays are maintained for the different contraction results.
+
+3.  **Tensor Fusion Epilogue**: After computing all contractions:
+    -   **Multi-Tensor Reshape**: The GEMM results are logically reshaped back to their target tensor shapes.
+    -   **Load Auxiliary Tensors**: Read corresponding elements from all D tensors.
+    -   **Apply Fusion Function**: Execute the user-defined function `f` combining all results.
+    -   **Store Final Tensor**: Write the combined result to the output tensor.
+
+## Source Code Organization
+
+-   [`contraction_multi_ABD_xdl.cpp`](./contraction_multi_ABD_xdl.cpp): The main example file. It sets up multiple pairs of tensors for contraction, defines the Einstein summation patterns, sets up auxiliary D tensors, and instantiates the `DeviceContractionMultiABD` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_contraction_multi_abd.hpp`](../../include/ck/tensor_operation/gpu/device/device_contraction_multi_abd.hpp): The device interface for this multi-contraction fusion pattern.
+-   The underlying kernel manages multiple simultaneous tensor contractions with complex tensor descriptor logic and memory access patterns.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/61_contraction_multi_ABD
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./contraction_multi_ABD_xdl
+
+# Run with verification, data initialization, and timing
+./contraction_multi_ABD_xdl 1 2 1
+```
+
+## Applications
+
+This kernel is valuable for complex tensor network computations found in advanced scientific and machine learning applications.
+
+-   **Tensor Network Methods**: Computing multiple tensor contractions simultaneously in quantum physics simulations, such as DMRG (Density Matrix Renormalization Group) or PEPS (Projected Entangled Pair States).
+-   **Multi-Modal Tensor Analysis**: Processing multiple tensor contractions for different data modalities in machine learning applications.
+-   **Higher-Order Statistics**: Computing multiple statistical tensor operations simultaneously, such as different moments or correlation patterns.
+-   **Advanced Neural Network Layers**: Implementing complex layers that require multiple tensor operations, such as tensor decomposition layers or high-dimensional convolutions.
+-   **Scientific Computing**: Simulating physical systems that require multiple tensor contractions, such as in quantum chemistry or condensed matter physics.
+
+## Computational Complexity
+
+The complexity depends on the specific contraction patterns used:
+
+-   **Multiple Contractions**: Each contraction has its own complexity based on tensor dimensions and contraction indices
+-   **Memory Access**: Complex patterns due to multiple tensor descriptors and reshaping operations
+-   **Register Pressure**: High due to multiple accumulator arrays and intermediate results
+-   **Instruction Diversity**: Different contractions may have different computational patterns
+
+## Comparison with Single Contraction
+
+| Aspect | Single Contraction | Multi-Contraction |
+|--------|-------------------|-------------------|
+| **Input Complexity** | Single tensor pair | Multiple tensor pairs |
+| **Memory Layout** | Single reshaping pattern | Multiple coordinated patterns |
+| **Computation** | Single GEMM operation | Multiple parallel GEMMs |
+| **Fusion Opportunity** | Simple epilogue | Complex multi-input epilogue |
+| **Applications** | Basic tensor operations | Complex tensor networks |
+
+This kernel showcases the ability to handle extremely complex tensor network computations efficiently, making it valuable for advanced scientific computing and machine learning research applications.
--- a/example/62_convnd_activ/README.md
+++ b/example/62_convnd_activ/README.md
@@ -0,0 +1,105 @@
+# N-Dimensional Convolution with Activation
+
+This example demonstrates an **N-dimensional convolution forward pass fused with an activation function**. This fusion pattern combines the convolution operation with elementwise activation functions in a single kernel, which is extremely common in convolutional neural networks and provides significant performance benefits.
+
+## Mathematical Formulation
+
+The operation performs an N-dimensional convolution followed immediately by an activation function.
+
+1.  **N-Dimensional Convolution**: A standard N-dimensional forward convolution.
+    $C_{temp} = \text{Conv}_{\text{ND}}(\text{In}, \text{W})$
+    Where `In` is the input tensor, `W` is the weight tensor, and the convolution can be 1D, 2D, 3D, or higher-dimensional.
+
+2.  **Activation Function**: Apply an elementwise activation function to the convolution result.
+    $\text{Out} = \text{Activation}(C_{temp})$
+    Common activation functions include:
+    - **ReLU**: $\text{ReLU}(x) = \max(0, x)$
+    - **Sigmoid**: $\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}$
+    - **Tanh**: $\text{Tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
+    - **GELU**: $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi$ is the standard Gaussian CDF
+    - **Swish**: $\text{Swish}(x) = x \cdot \text{Sigmoid}(x)$
+
+The key optimization is that the intermediate tensor `C_temp` is **never written to global memory**. The activation function is applied directly to the convolution result held in registers.
+
+## Algorithmic Strategy: Implicit GEMM with Fused Activation Epilogue
+
+The implementation uses the implicit GEMM algorithm for convolution with the activation function fused into the epilogue.
+
+1.  **Implicit GEMM Core**: The convolution is transformed into an equivalent GEMM operation:
+    -   **Input Transformation**: The input tensor is implicitly transformed using the im2col operation.
+    -   **Matrix Multiplication**: The core computation is performed as a tiled matrix multiplication.
+    -   **Output Accumulation**: Results are accumulated in registers as standard GEMM tiles.
+
+2.  **Fused Activation Epilogue**: Before storing results to global memory:
+    -   **Elementwise Activation**: Apply the activation function to each element in the accumulated tile.
+    -   **Vectorized Operations**: Use vectorized instructions where possible for activation computation.
+    -   **Store Activated Result**: Write the final activated output directly to global memory.
+
+This approach eliminates the need for a separate activation kernel and the associated memory bandwidth for reading and writing the intermediate convolution result.
+
+## Source Code Organization
+
+-   [`convnd_activ_xdl.cpp`](./convnd_activ_xdl.cpp): The main example file. It sets up the N-dimensional input tensor, weight tensor, specifies the activation function, and instantiates the `DeviceConvNdActiv` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp`](../../include/ck/tensor_operation/gpu/device/device_convnd_activ.hpp): The device interface for N-dimensional convolution with activation fusion.
+-   The underlying kernel implements the implicit GEMM algorithm with templated activation functions in the epilogue.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/62_convnd_activ
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./convnd_activ_xdl
+
+# Run with verification, data initialization, and timing
+./convnd_activ_xdl 1 2 1
+```
+
+## Applications
+
+Convolution with activation fusion is fundamental to many neural network architectures.
+
+-   **Convolutional Neural Networks (CNNs)**: Nearly every convolutional layer in CNNs is followed by an activation function, making this fusion extremely valuable.
+-   **Computer Vision Models**: Image classification, object detection, and segmentation networks all benefit from this fusion.
+-   **3D CNNs**: Video analysis and medical imaging applications using 3D convolutions with activations.
+-   **Mobile and Edge Deployment**: The reduced memory bandwidth makes this fusion especially valuable for resource-constrained environments.
+-   **Training Acceleration**: Reducing the number of kernel launches and memory operations accelerates both forward and backward passes during training.
+
+## Performance Benefits
+
+This fusion provides several performance advantages:
+
+-   **Reduced Memory Bandwidth**: Eliminates one full read/write cycle of the intermediate tensor
+-   **Improved Cache Locality**: Data stays in cache/registers between convolution and activation
+-   **Fewer Kernel Launches**: Reduces GPU kernel launch overhead
+-   **Better Instruction Scheduling**: Allows better interleaving of compute and memory operations
+
+## Activation Function Considerations
+
+Different activation functions have different computational characteristics:
+
+-   **ReLU**: Very fast, just a comparison and conditional assignment
+-   **Sigmoid/Tanh**: Require expensive exponential calculations
+-   **GELU**: Involves error function computation, typically approximated
+-   **Swish**: Combines multiplication with sigmoid computation
+
+The choice of activation function can significantly impact the overall performance of the fused kernel, with simpler functions like ReLU providing the best performance improvements.
--- a/example/63_layernorm4d_fwd/README.md
+++ b/example/63_layernorm4d_fwd/README.md
@@ -0,0 +1,102 @@
+# 4D Layer Normalization Forward
+
+This example demonstrates the forward pass of **4D Layer Normalization**. This extends the layer normalization operation to 4-dimensional tensors, which is commonly used in computer vision applications where tensors have shape `[N, C, H, W]` and normalization is applied across the channel and spatial dimensions.
+
+## Mathematical Formulation
+
+Given a 4D input tensor `X` with shape `[N, C, H, W]`, 4D layer normalization computes an output tensor `Y` of the same shape. The normalization is performed independently for each batch item across the channel and spatial dimensions.
+
+For each batch item `n` from `0` to `N-1`:
+1.  **Compute Mean**: The mean is calculated across the channel (`C`) and spatial (`H`, `W`) dimensions.
+    $\mu_n = \frac{1}{C \cdot H \cdot W} \sum_{c=0}^{C-1} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} X_{nchw}$
+
+2.  **Compute Variance**: The variance is calculated across the same dimensions.
+    $\sigma_n^2 = \frac{1}{C \cdot H \cdot W} \sum_{c=0}^{C-1} \sum_{h=0}^{H-1} \sum_{w=0}^{W-1} (X_{nchw} - \mu_n)^2$
+
+3.  **Normalize**: The input is normalized using the computed mean and variance.
+    $\hat{X}_{nchw} = \frac{X_{nchw} - \mu_n}{\sqrt{\sigma_n^2 + \epsilon}}$
+    Where `epsilon` is a small constant for numerical stability.
+
+4.  **Scale and Shift**: The normalized output is scaled by learnable parameter `gamma` and shifted by learnable parameter `beta`.
+    $Y_{nchw} = \gamma_{chw} \cdot \hat{X}_{nchw} + \beta_{chw}$
+    
+    Note: The scale and shift parameters can have different granularities:
+    - **Per-element**: `gamma` and `beta` have shape `[C, H, W]`
+    - **Per-channel**: `gamma` and `beta` have shape `[C]` (broadcast over H, W)
+    - **Global**: `gamma` and `beta` are scalars (broadcast over C, H, W)
+
+## Algorithmic Strategy: Batch-Parallel Reduction with Spatial Aggregation
+
+The implementation treats this as a parallel reduction problem with spatial aggregation for each batch item.
+
+1.  **Grid Scheduling**: The `N` batch items are distributed among the GPU's thread blocks. Each block is assigned one or more batch items to normalize.
+
+2.  **Spatial-Channel Reduction**: For each assigned batch item:
+    -   **Cooperative Loading**: Threads within a block cooperatively read the 3D slice `X[n, :, :, :]` corresponding to their batch item.
+    -   **Welford's Algorithm**: Use Welford's online algorithm to compute mean and variance across all `C × H × W` elements with good numerical stability.
+    -   **Intra-Block Reduction**: Threads perform parallel reduction using shared memory to compute the final statistics for each batch item.
+
+3.  **Normalization and Scale/Shift**: 
+    -   **Elementwise Processing**: Each thread processes one or more elements of the batch item.
+    -   **Apply Normalization**: Use the computed mean and variance to normalize each element.
+    -   **Apply Scale/Shift**: Apply the appropriate `gamma` and `beta` values based on the parameterization choice.
+    -   **Store Result**: Write the final normalized result to the output tensor.
+
+## Source Code Organization
+
+-   [`layernorm4d_fwd_xdl.cpp`](./layernorm4d_fwd_xdl.cpp): The main example file. It sets up the 4D input tensor, `gamma` and `beta` parameters, and instantiates the `DeviceLayernorm4dFwd` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_layernorm4d_fwd.hpp`](../../include/ck/tensor_operation/gpu/device/device_layernorm4d_fwd.hpp): The device interface for 4D layer normalization.
+-   The underlying implementation uses reduction kernels optimized for the 4D tensor structure with efficient spatial dimension handling.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/63_layernorm4d_fwd
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./layernorm4d_fwd_xdl
+
+# Run with verification, data initialization, and timing
+./layernorm4d_fwd_xdl 1 2 1
+```
+
+## Applications in Computer Vision
+
+4D layer normalization has specific applications in computer vision tasks:
+
+-   **Vision Transformers**: Some vision transformer variants apply layer normalization to 4D feature maps instead of flattening them.
+-   **Style Transfer**: Normalizing feature maps across spatial and channel dimensions for style transfer applications.
+-   **Feature Normalization**: Normalizing intermediate feature maps in CNNs for improved training stability.
+-   **Attention Mechanisms**: Some spatial attention mechanisms benefit from normalized 4D feature representations.
+-   **Multi-Scale Processing**: When processing features at different spatial scales, 4D layer normalization can provide consistent normalization.
+
+## Comparison with Other Normalizations for 4D Tensors
+
+| Normalization | Reduction Dimensions | Parameter Shape | Batch Dependence |
+|---------------|---------------------|-----------------|------------------|
+| **BatchNorm** | `[N, H, W]` per channel | `[C]` | Yes |
+| **LayerNorm (2D)** | `[C, H, W]` per sample | `[C, H, W]` or `[C]` | No |
+| **LayerNorm (4D)** | `[C, H, W]` per sample | `[C, H, W]` or variants | No |
+| **InstanceNorm** | `[H, W]` per channel per sample | `[C]` | No |
+| **GroupNorm** | Groups of channels per sample | `[C]` | No |
+
+4D layer normalization provides batch-independent normalization while maintaining the spatial structure of the data, making it valuable for applications where spatial relationships are important.
--- a/example/64_fpAintB_gemm/README.md
+++ b/example/64_fpAintB_gemm/README.md
@@ -0,0 +1,115 @@
+# Mixed-Precision GEMM: FP16 A × INT8 B
+
+This example demonstrates a **mixed-precision GEMM operation** where matrix `A` is in FP16 (half-precision floating-point) format and matrix `B` is in INT8 (8-bit integer) format. This is an important optimization technique for inference workloads that enables significant memory bandwidth reduction while maintaining acceptable numerical accuracy.
+
+## Mathematical Formulation
+
+The operation performs matrix multiplication with mixed data types:
+$C = A_{fp16} \times B_{int8}$
+
+Where:
+- Matrix `A` has FP16 elements with shape `[M, K]`
+- Matrix `B` has INT8 elements with shape `[K, N]`  
+- Matrix `C` typically has FP16 or FP32 elements with shape `[M, N]`
+
+The computation involves:
+1.  **Type Conversion**: INT8 elements of `B` are converted to FP16 during computation
+2.  **Scaling**: Optional scaling factors can be applied to account for the quantization of `B`
+3.  **Accumulation**: Products are accumulated in higher precision (typically FP32) to maintain numerical accuracy
+4.  **Output Conversion**: Final results are converted to the desired output precision
+
+## Algorithmic Strategy: Mixed-Precision Tiled GEMM
+
+The implementation extends the standard tiled GEMM algorithm to handle mixed data types efficiently.
+
+1.  **Tiled Matrix Multiplication**: Standard tiling approach with type-specific optimizations:
+    -   **A Matrix Loading**: FP16 elements are loaded directly from global memory
+    -   **B Matrix Loading**: INT8 elements are loaded and converted to FP16 in registers
+    -   **Scaling Application**: If quantization scales are provided, they are applied during the conversion
+    -   **Mixed-Type Computation**: FP16 × FP16 multiplication with FP32 accumulation
+
+2.  **Memory Access Optimization**:
+    -   **Bandwidth Efficiency**: INT8 storage for `B` reduces memory bandwidth by 2× compared to FP16
+    -   **Coalescing**: Both data types are accessed with coalesced memory patterns
+    -   **Vectorization**: Use vectorized loads where possible for both FP16 and INT8 data
+
+3.  **Computation Precision**:
+    -   **Multiply-Accumulate**: Use FP32 accumulators to prevent overflow and maintain accuracy
+    -   **Hardware Utilization**: Leverage mixed-precision matrix instructions where available
+    -   **Numerical Stability**: Careful handling of type conversions to minimize precision loss
+
+## Source Code Organization
+
+-   [`fpAintB_gemm_xdl.cpp`](./fpAintB_gemm_xdl.cpp): The main example file. It sets up FP16 matrix A, INT8 matrix B with optional scaling factors, and instantiates the `DeviceFpAintBGemm` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_fpAintB_gemm.hpp`](../../include/ck/tensor_operation/gpu/device/device_fpAintB_gemm.hpp): The device interface for mixed-precision GEMM operations.
+-   The underlying kernel implements optimized mixed-type arithmetic with efficient type conversion and scaling operations.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/64_fpAintB_gemm
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./fpAintB_gemm_xdl
+
+# Run with verification, data initialization, and timing
+./fpAintB_gemm_xdl 1 2 1
+```
+
+## Applications in Model Optimization
+
+Mixed-precision GEMM is crucial for efficient neural network inference:
+
+-   **Quantized Inference**: Deploy models with quantized weights (INT8) while keeping activations in higher precision (FP16)
+-   **Memory-Constrained Environments**: Reduce memory footprint for weight storage while maintaining computational accuracy
+-   **Edge Deployment**: Enable deployment on devices with limited memory bandwidth
+-   **Large Language Models**: Reduce memory requirements for transformer models while preserving quality
+-   **Computer Vision Models**: Optimize CNN inference with quantized convolution layers
+
+## Performance Benefits
+
+This mixed-precision approach provides several advantages:
+
+-   **Memory Bandwidth**: 2× reduction in bandwidth for matrix B compared to FP16×FP16
+-   **Storage Efficiency**: 50% reduction in storage requirements for quantized matrices
+-   **Cache Efficiency**: More data fits in cache due to reduced memory footprint
+-   **Energy Efficiency**: Lower memory traffic reduces energy consumption
+
+## Quantization Considerations
+
+Effective use of INT8 quantization requires:
+
+-   **Calibration**: Proper calibration to determine appropriate scaling factors
+-   **Range Analysis**: Understanding the dynamic range of weights to maximize INT8 utilization
+-   **Accuracy Trade-offs**: Balancing between compression ratio and numerical accuracy
+-   **Hardware Support**: Leveraging hardware features for efficient mixed-precision computation
+
+## Comparison with Other Precision Formats
+
+| Configuration | A Precision | B Precision | Memory Bandwidth | Accuracy | Hardware Support |
+|---------------|-------------|-------------|------------------|----------|------------------|
+| **FP32×FP32** | FP32 | FP32 | 1.0× (baseline) | Highest | Universal |
+| **FP16×FP16** | FP16 | FP16 | 0.5× | High | Modern GPUs |
+| **FP16×INT8** | FP16 | INT8 | 0.375× | Medium-High | Specialized |
+| **INT8×INT8** | INT8 | INT8 | 0.25× | Medium | Specialized |
+
+The FP16×INT8 configuration provides an excellent balance between memory efficiency and numerical accuracy for many inference workloads.
--- a/example/65_gemm_multiply_multiply/README.md
+++ b/example/65_gemm_multiply_multiply/README.md
@@ -0,0 +1,108 @@
+# GEMM with Double Multiply Operations
+
+This example demonstrates a **GEMM followed by two sequential elementwise multiplication operations**. This fusion pattern is useful for implementing layers that require matrix multiplication followed by multiple scaling or masking operations, such as certain attention mechanisms or gated neural network architectures.
+
+## Mathematical Formulation
+
+The operation performs a matrix multiplication followed by two sequential elementwise multiplications.
+
+1.  **GEMM Stage**: A standard matrix multiplication.
+    $C_{temp1} = A \times B$
+
+2.  **First Multiplication**: Elementwise multiplication with tensor `D`.
+    $C_{temp2} = C_{temp1} \odot D$
+
+3.  **Second Multiplication**: Elementwise multiplication with tensor `E`.
+    $F = C_{temp2} \odot E$
+
+The key optimization is that the intermediate tensors `C_temp1` and `C_temp2` are **never written to global memory**. All operations are fused into the GEMM's epilogue, operating on data held in registers.
+
+## Algorithmic Strategy: GEMM with Dual-Multiply Epilogue
+
+The implementation uses a tiled GEMM algorithm with a multi-stage fused epilogue that performs two sequential multiplications.
+
+1.  **Tiled GEMM Core**: The kernel begins with a standard tiled GEMM. A thread block computes a tile of the product $A \times B$, accumulating the result in registers.
+
+2.  **Dual-Multiply Epilogue**: Before any data is written to global memory, the following sequence occurs for the tile of data held in registers:
+    -   **Load First Multiplicand**: Threads load the corresponding elements of tensor `D`.
+    -   **First Multiplication**: The elementwise multiplication is performed in registers: `result *= D`.
+    -   **Load Second Multiplicand**: Threads load the corresponding elements of tensor `E`.
+    -   **Second Multiplication**: The second elementwise multiplication is performed in registers: `result *= E`.
+    -   **Store Final Result**: The final result `F` is written to global memory.
+
+This deep fusion eliminates multiple kernel launches and the memory bandwidth required to write and re-read intermediate tensors.
+
+## Source Code Organization
+
+-   [`gemm_multiply_multiply_xdl.cpp`](./gemm_multiply_multiply_xdl.cpp): The main example file. It sets up the input matrices (A, B) and auxiliary tensors (D, E), and instantiates the `DeviceGemmMultiplyMultiply` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_gemm_multiply_multiply.hpp`](../../include/ck/tensor_operation/gpu/device/device_gemm_multiply_multiply.hpp): The high-level device interface for this fused operation.
+-   The underlying kernel implements the dual-multiply epilogue that performs both multiplication operations on register data before storing.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/65_gemm_multiply_multiply
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example
+```bash
+# Run the example with default settings
+./gemm_multiply_multiply_xdl
+
+# Run with verification, data initialization, and timing
+./gemm_multiply_multiply_xdl 1 2 1
+```
+
+## Applications
+
+This fusion pattern is useful for several types of neural network operations and advanced computational patterns.
+
+-   **Multi-Scale Attention**: Some attention mechanisms apply multiple scaling factors sequentially, such as learned attention scales followed by positional scaling.
+-   **Gated Mechanisms**: Advanced gating architectures that use multiple multiplicative gates in sequence, such as in some RNN variants or transformer modifications.
+-   **Feature Modulation**: Computer vision models that apply multiple feature modulation operations, such as style-based generators or attention-based feature refinement.
+-   **Masking Operations**: Applying multiple types of masks (e.g., attention mask followed by a dropout mask) in sequence.
+-   **Custom Activations**: Implementing complex activation functions that involve multiple multiplicative terms.
+-   **Mixture of Experts**: Some MoE architectures use multiple routing or gating multiplications in sequence.
+
+## Performance Considerations
+
+The performance benefits of this fusion depend on several factors:
+
+-   **Memory Bandwidth Savings**: Eliminates two full tensor read/write cycles for intermediate results
+-   **Cache Locality**: Maintains data in registers throughout the computation pipeline
+-   **Instruction Scheduling**: Allows better interleaving of compute and memory operations
+-   **Kernel Launch Overhead**: Reduces from three separate kernel launches to one
+
+## Comparison with Sequential Operations
+
+| Approach | Kernel Launches | Memory Bandwidth | Register Pressure | Implementation Complexity |
+|----------|----------------|------------------|-------------------|---------------------------|
+| **Sequential** | 3 kernels | 3× intermediate storage | Low | Simple |
+| **Fused** | 1 kernel | No intermediate storage | Medium | Moderate |
+
+## Extension Possibilities
+
+This pattern can be extended in several ways:
+
+-   **More Multiplications**: Additional sequential multiplications can be added to the epilogue
+-   **Mixed Operations**: Combine multiplications with additions or other elementwise operations
+-   **Conditional Operations**: Apply multiplications conditionally based on masks or thresholds
+-   **Broadcasting**: Handle different broadcasting patterns for the multiplicand tensors
+
+This example demonstrates the flexibility of the epilogue fusion approach, showing how multiple sequential operations can be efficiently combined with matrix multiplication.
--- a/example/66_complex_contraction_bilinear/README.md
+++ b/example/66_complex_contraction_bilinear/README.md
@@ -1,6 +1,79 @@
-# Instructions for ```example_complex_contraction_bilinear_xdl_fp32```
+# Complex Tensor Contraction with Bilinear Operations
+
+This example demonstrates a **complex tensor contraction combined with bilinear operations**. This advanced operation handles complex-valued tensors (with real and imaginary components) and performs both tensor contractions and bilinear transformations, which is particularly important for applications in quantum computing, signal processing, and advanced scientific computing.
+
+## Mathematical Formulation
+
+The operation combines complex tensor contraction with bilinear operations on complex-valued data.
+
+Given complex tensors with real and imaginary components:
+- Complex tensor `A = A_real + i × A_imag`
+- Complex tensor `B = B_real + i × B_imag`  
+- Auxiliary complex tensors `D, E, ...`
+
+1.  **Complex Tensor Contraction**: Perform tensor contraction using Einstein summation on complex tensors.
+    $C_{temp} = \text{einsum}(\text{pattern}, A, B)$
+    
+    For complex multiplication: $(a + bi)(c + di) = (ac - bd) + (ad + bc)i$
+
+2.  **Bilinear Operations**: Apply bilinear transformations involving the contraction result and auxiliary tensors.
+    $F = \text{BilinearOp}(C_{temp}, D, E, \ldots)$
+
+The bilinear operations can include various combinations such as:
+- $F = C_{temp} \odot D + E$ (elementwise multiply and add)
+- $F = \alpha \cdot C_{temp} + \beta \cdot (D \odot E)$ (scaled combinations)
+- More complex multi-term bilinear expressions
+
+## Algorithmic Strategy: Complex-Arithmetic GEMM with Bilinear Epilogue
+
+The implementation handles complex arithmetic throughout the computation pipeline.
+
+1.  **Complex Tensor-to-GEMM Mapping**: 
+    -   **Real/Imaginary Separation**: Complex tensors are logically separated into real and imaginary components
+    -   **Complex GEMM**: Four real GEMM operations represent one complex GEMM:
+        - $C_{real} = A_{real} \times B_{real} - A_{imag} \times B_{imag}$
+        - $C_{imag} = A_{real} \times B_{imag} + A_{imag} \times B_{real}$
+
+2.  **Multi-Component Computation**: Within each thread block:
+    -   **Parallel Real/Imaginary Processing**: Simultaneously compute real and imaginary components
+    -   **Complex Accumulation**: Maintain separate accumulators for real and imaginary parts
+    -   **Register Management**: Carefully orchestrate register usage for multiple complex components
+
+3.  **Complex Bilinear Epilogue**: 
+    -   **Load Complex Auxiliary Tensors**: Read real and imaginary components of auxiliary tensors
+    -   **Complex Bilinear Operations**: Apply the specified bilinear transformations using complex arithmetic
+    -   **Complex Result Storage**: Store final complex result with proper real/imaginary organization
+
+## Source Code Organization
+
+-   [`complex_contraction_bilinear_xdl.cpp`](./complex_contraction_bilinear_xdl.cpp): The main example file. It sets up complex tensors (with real and imaginary components), defines contraction patterns and bilinear operations, and instantiates the `DeviceComplexContractionBilinear` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_complex_contraction_bilinear.hpp`](../../include/ck/tensor_operation/gpu/device/device_complex_contraction_bilinear.hpp): The device interface for complex tensor operations with bilinear fusion.
+-   The underlying kernel implements sophisticated complex arithmetic with optimized memory layouts for real/imaginary components.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/66_complex_contraction_bilinear
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run the Example

-## Run
 ```bash
 #arg1: verification (0=no, 1=yes)
 #arg2: initialization (0=no init, 1=integer value, 2=decimal value)
@@ -8,4 +81,33 @@
 ./bin/example_contraction_bilinear_xdl_fp32 1 1 1
 ```

+## Applications

+Complex tensor operations with bilinear transformations are essential in several advanced domains:
+
+-   **Quantum Computing**: Quantum circuit simulations require complex tensor contractions for state evolution and gate operations
+-   **Signal Processing**: Digital signal processing with complex-valued signals, such as in communications and radar systems
+-   **Fourier Analysis**: FFT-related computations that naturally involve complex arithmetic and tensor operations
+-   **Quantum Chemistry**: Electronic structure calculations often involve complex-valued wavefunctions and operators
+-   **Machine Learning**: Some advanced neural network architectures use complex-valued weights and activations
+-   **Scientific Computing**: Simulations involving wave equations, electromagnetic fields, or quantum mechanical systems
+
+## Complex Arithmetic Considerations
+
+Working with complex numbers introduces several computational challenges:
+
+-   **Memory Layout**: Efficient storage of real and imaginary components (interleaved vs. separate arrays)
+-   **Arithmetic Complexity**: Complex multiplication requires 4 real multiplications and 2 real additions
+-   **Numerical Precision**: Maintaining accuracy across multiple complex operations
+-   **Performance Trade-offs**: Balancing between computational complexity and memory bandwidth
+
+## Performance Characteristics
+
+Complex operations have unique performance profiles:
+
+-   **Computational Intensity**: ~2× the arithmetic operations compared to real-valued equivalents
+-   **Memory Bandwidth**: 2× the memory requirements for storing complex values
+-   **Register Pressure**: Higher register usage due to separate real/imaginary components
+-   **Instruction Complexity**: More complex instruction sequences for complex arithmetic
+
+This kernel demonstrates the ability to handle sophisticated mathematical operations efficiently while maintaining the benefits of deep fusion for complex-valued computations.
--- a/example/67_gemm_microscaling/README.md
+++ b/example/67_gemm_microscaling/README.md
@@ -1,6 +1,16 @@
-# GEMM Examples for Microscaling Formats
+# GEMM with Microscaling

-## example_gemm_mx_fp8
+This example demonstrates a **GEMM operation with microscaling**, an advanced quantization technique that applies fine-grained scaling to small blocks of data. Microscaling enables more precise quantization than traditional methods by using different scale factors for small groups of elements, leading to better accuracy preservation in quantized neural network inference.
+
+## Source Code Organization
+
+-   [`gemm_microscaling_xdl.cpp`](./gemm_microscaling_xdl.cpp): The main example file. It sets up microscaled matrices with quantized data and scale factors, and instantiates the `DeviceGemmMicroscaling` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_gemm_microscaling.hpp`](../../include/ck/tensor_operation/gpu/device/device_gemm_microscaling.hpp): The device interface for GEMM with microscaling support.
+-   The underlying kernel implements sophisticated block-wise dequantization integrated into the GEMM computation pipeline.
+
+## Build and Run
+
+### example_gemm_mx_fp8

 Custom verification parameters:
 ```bash
@@ -20,8 +30,27 @@ Custom tensor shapes:
 ./bin/example_gemm_mx_fp8 1 2 1 0 256  256  512 -1 -1 -1 1 10 10
 ```

+### Run the Example
+
+Custom verification parameters:
+```bash
+# arg1: verification (0=no, 1=CPU)
+# arg2: initialization (0=constant values, 1=integer values, 2=decimal values)
+# arg3: time kernel (0=no, 1=yes)
+# arg4: verbosity (0=no info, 1=verbose info)
+# arg5 to 10: M(128x), N(128x), K(64x), StrideA, StrideB, StrideC
+# arg11: KBatch
+./bin/example_gemm_mx_fp8 1 1 0 1
+```
+
+Custom tensor shapes:
+```bash
+./bin/example_gemm_mx_fp8 1 2 1 0 128  128  256 -1 -1 -1 1
+```
+
 Default invocation:
 ```bash
 # Implies: ./bin/example_gemm_mx_fp8 1 2 0 0
 ./bin/example_gemm_mx_fp8
-```
+```
+
--- a/Show More
+++ b/Show More