[DOCS] Documentation Addition (Readme updates) (#2495)

* GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples * GH-2368 Adding a basic glossary GH-2368 Minor edits GH-2368 Adding missing READMEs and standardization. resolving readme updates GH-2368 Minor improvements to documentation. Improving some readmes. Further improvement for readmes. Cleaned up the documentation in 'client_example' (#2468) Update for PR Update ACRONYMS.md to remove trivial terms Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Apply suggestion from @spolifroni-amd Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com> Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine. revise 37_transpose readme revise 36_copy readme Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity. Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity. Remove references to the Tile Engine in README files across multiple examples Refine README files by removing outdated references to the Tile Engine * Updates based on PR feedback 1 * Updates based on PR feedback 2 * Updates based on PR feedback 3 * Updates based on PR feedback 4 * Updates based on PR feedback 5 * Updates based on PR feedback 6 * Updates based on PR feedback 7 * Updates based on PR feedback 8 * Content Modification of CK Tile Example * Modify the ck_tile gemm config --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com>
2026-04-20 14:59:17 +00:00 · 2025-10-16 03:10:57 -07:00
parent 013ba3c737
commit 92c67a824f
120 changed files with 8188 additions and 221 deletions
--- a/example/33_multiple_reduce/README.md
+++ b/example/33_multiple_reduce/README.md
@@ -1,6 +1,73 @@
-# Instructions for ```example_dual_reduce```
+# Multiple Reductions
+
+This example demonstrates a **Multiple Reduction** operation, where several different reduction computations (e.g., sum, average, max, min) are performed on the same input tensor in a single kernel launch. This is a highly efficient pattern when multiple statistics are needed for a tensor, as it requires only one read pass over the (potentially very large) input data.
+
+## Mathematical Formulation
+
+Given an input tensor `A`, this operation computes a set of output scalars or vectors, $\{R_0, R_1, \dots, R_N\}$, where each $R_i$ is the result of a different reduction operation applied to `A`.
+
+$R_0 = \bigoplus_0 A$
+$R_1 = \bigoplus_1 A$
+...
+$R_N = \bigoplus_N A$
+
+Where $\bigoplus_i$ represents a distinct reduction operation, such as:
+-   `sum`: $\sum_j A_j$
+-   `avg`: $\frac{1}{N} \sum_j A_j$
+-   `max`: $\max_j(A_j)$
+-   `min`: $\min_j(A_j)$
+-   `sum of squares`: $\sum_j A_j^2$
+
+The reductions can be performed over the entire tensor to produce a scalar, or along specific dimensions to produce a lower-rank tensor.
+
+## Algorithmic Strategy: Fused Parallel Reduction
+
+The implementation uses a classic parallel reduction algorithm but extends it to handle multiple reduction functions simultaneously.
+
+1.  **Grid Scheduling**: The input tensor is partitioned across the GPU's thread blocks. Each block is responsible for reducing a slice of the input data.
+
+2.  **Intra-Block Reduction**:
+    -   **Loading**: Threads within a block cooperatively load their assigned slice of the input tensor `A` into shared memory.
+    -   **Fused Accumulation**: Each thread maintains a separate set of accumulators in its private registers, one for each of the `N` reduction operations being performed.
+    -   As threads iterate through the data in shared memory, they update all of their accumulators simultaneously. For example, for each element `a`, a thread might update its `sum_accumulator += a`, `max_accumulator = max(max_accumulator, a)`, and `sum_sq_accumulator += a*a`.
+    -   **Tree-Based Reduction**: After processing all elements in the slice, the threads perform a parallel reduction using shared memory. This is done *for each of the N reduction types*. For example, they first reduce all the `sum_accumulator` values to get the block's partial sum, then they reduce all the `max_accumulator` values to get the block's partial max, and so on.
+
+3.  **Inter-Block Reduction**:
+    -   Each thread block writes its `N` partial results (the block's partial sum, partial max, etc.) to `N` separate temporary arrays in global memory.
+    -   A final, small reduction kernel is launched (or atomic operations are used) for each of the `N` temporary arrays to combine the partial results from all blocks into the final `N` output values.
+
+The key to this kernel's efficiency is that the expensive part—reading the input tensor `A` from global memory—is only done once. All subsequent computations happen on-chip.
+
+## Source Code Organization
+
+-   [`multiple_reduce_xdl.cpp`](./multiple_reduce_xdl.cpp): The main example file. It sets up the input tensor and defines the multiple reduction operations to be performed. It then instantiates the `DeviceMultipleReduce` operation.
+-   [`../../include/ck/tensor_operation/gpu/device/device_multiple_reduce.hpp`](../../include/ck/tensor_operation/gpu/device/device_multiple_reduce.hpp): The high-level device interface for the multiple reduction operation. It takes a tuple of structs, where each struct defines one of the reduction operations to be performed.
+-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_multiple_reduce.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_multiple_reduce.hpp): The grid-wise kernel that implements the fused parallel reduction algorithm. It is heavily templated to generate the specific accumulation and reduction logic for the requested set of operations.
+
+## Build and Run
+
+### Prerequisites
+Ensure the Composable Kernel library is built and installed.
+```bash
+cd /path/to/composable_kernel/build
+make -j install
+```
+
+### Build the Example
+```bash
+cd /path/to/composable_kernel/example/33_multiple_reduce
+mkdir build && cd build
+
+cmake \
+  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
+  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
+  ..
+
+make -j
+```
+
+### Run  ```example_dual_reduce_multiblock```

-## Run ```example_dual_reduce_multiblock```
 ```bash
 # -D <xxx> : input 4-d tensor lengths
 # -v <x> :   verification (0=no, 1=yes)
@@ -18,7 +85,7 @@ Start running 10 times...
 Perf: 1.19529 ms, 201.499 GB/s, DeviceMultipleReduceBlockWise<256,M_C4_S1,K_C64_S1,InSrcVectorDim_1_InSrcVectorSize_1,OutDstVectorSize_1_1>
 ```

-## Run ```example_dual_reduce_threadwise```
+### Run ```example_dual_reduce_threadwise```
 ```bash
 # -D <xxx> : input 4-d tensor lengths
 # -v <x> :   verification (0=no, 1=yes)
@@ -35,3 +102,11 @@ Warm up 1 time
 Start running 10 times...
 Perf: 0.01512 ms, 71.9577 GB/s, DeviceMultipleReduceThreadwise<256,M_C256_S1,K_C1_S4,InSrcVectorDim_1_InSrcVectorSize_2,OutDstVectorSize_1_1>
 ```
+
+## Applications
+
+This operation is extremely useful for computing statistics and implementing normalization layers.
+
+-   **Normalization Layers**: Both Batch Normalization and Layer Normalization require computing the mean and variance of a tensor. Variance is defined as $\sigma^2 = E[X^2] - (E[X])^2$. This requires two statistics: the sum of elements (for the mean, $E[X]$) and the sum of squares of elements (for $E[X^2]$). This kernel can compute both in a single pass, making it a highly efficient way to calculate the moments needed for normalization.
+-   **Data Analytics**: When analyzing a large dataset, one might want to compute its min, max, mean, and standard deviation all at once. This kernel can perform all the necessary underlying reductions in a single, efficient operation.
+-   **Loss Function Components**: Some complex loss functions might involve multiple statistical properties of a model's output. This kernel can compute them efficiently.