# Multiple Reductions

This example demonstrates a **Multiple Reduction** operation, where several different reduction computations (e.g., sum, average, max, min) are performed on the same input tensor in a single kernel launch. This is a highly efficient pattern when multiple statistics are needed for a tensor, as it requires only one read pass over the (potentially very large) input data.

## Mathematical Formulation

Given an input tensor `A`, this operation computes a set of output scalars or vectors, $\{R_0, R_1, \dots, R_N\}$, where each $R_i$ is the result of a different reduction operation applied to `A`.

$R_0 = \bigoplus_0 A$
$R_1 = \bigoplus_1 A$
...
$R_N = \bigoplus_N A$

Where $\bigoplus_i$ represents a distinct reduction operation, such as:
-   `sum`: $\sum_j A_j$
-   `avg`: $\frac{1}{N} \sum_j A_j$
-   `max`: $\max_j(A_j)$
-   `min`: $\min_j(A_j)$
-   `sum of squares`: $\sum_j A_j^2$

The reductions can be performed over the entire tensor to produce a scalar, or along specific dimensions to produce a lower-rank tensor.

## Algorithmic Strategy: Fused Parallel Reduction

The implementation uses a classic parallel reduction algorithm but extends it to handle multiple reduction functions simultaneously.

1.  **Grid Scheduling**: The input tensor is partitioned across the GPU's thread blocks. Each block is responsible for reducing a slice of the input data.

2.  **Intra-Block Reduction**:
    -   **Loading**: Threads within a block cooperatively load their assigned slice of the input tensor `A` into shared memory.
    -   **Fused Accumulation**: Each thread maintains a separate set of accumulators in its private registers, one for each of the `N` reduction operations being performed.
    -   As threads iterate through the data in shared memory, they update all of their accumulators simultaneously. For example, for each element `a`, a thread might update its `sum_accumulator += a`, `max_accumulator = max(max_accumulator, a)`, and `sum_sq_accumulator += a*a`.
    -   **Tree-Based Reduction**: After processing all elements in the slice, the threads perform a parallel reduction using shared memory. This is done *for each of the N reduction types*. For example, they first reduce all the `sum_accumulator` values to get the block's partial sum, then they reduce all the `max_accumulator` values to get the block's partial max, and so on.

3.  **Inter-Block Reduction**:
    -   Each thread block writes its `N` partial results (the block's partial sum, partial max, etc.) to `N` separate temporary arrays in global memory.
    -   A final, small reduction kernel is launched (or atomic operations are used) for each of the `N` temporary arrays to combine the partial results from all blocks into the final `N` output values.

The key to this kernel's efficiency is that the expensive part—reading the input tensor `A` from global memory—is only done once. All subsequent computations happen on-chip.

## Source Code Organization

-   [`multiple_reduce_xdl.cpp`](./multiple_reduce_xdl.cpp): The main example file. It sets up the input tensor and defines the multiple reduction operations to be performed. It then instantiates the `DeviceMultipleReduce` operation.
-   [`../../include/ck/tensor_operation/gpu/device/device_multiple_reduce.hpp`](../../include/ck/tensor_operation/gpu/device/device_multiple_reduce.hpp): The high-level device interface for the multiple reduction operation. It takes a tuple of structs, where each struct defines one of the reduction operations to be performed.
-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_multiple_reduce.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_multiple_reduce.hpp): The grid-wise kernel that implements the fused parallel reduction algorithm. It is heavily templated to generate the specific accumulation and reduction logic for the requested set of operations.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/33_multiple_reduce
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run  ```example_dual_reduce_multiblock```

```bash
# -D <xxx> : input 4-d tensor lengths
# -v <x> :   verification (0=no, 1=yes)
#arg1: initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
#arg2: time kernel (0=no, 1=yes) 
./bin/example_dual_reduce_multiblock -D 600,28,28,256 -v 1 2 1
```

Result
```
./bin/example_dual_reduce_multiblock -D 600,28,28,256 -v 1 2 1                        
launch_and_time_kernel: grid_dim {150, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Perf: 1.19529 ms, 201.499 GB/s, DeviceMultipleReduceBlockWise<256,M_C4_S1,K_C64_S1,InSrcVectorDim_1_InSrcVectorSize_1,OutDstVectorSize_1_1>
```

### Run ```example_dual_reduce_threadwise```
```bash
# -D <xxx> : input 4-d tensor lengths
# -v <x> :   verification (0=no, 1=yes)
#arg1: initialization (0=no init, 1=single integer value, 2=scope integer value, 3=decimal value)
#arg2: time kernel (0=no, 1=yes)
./bin/example_dual_reduce_multiblock -D 8000,4,4,4 -v 1 2 1
```

Result
```
./bin/example_dual_reduce_threadwise -D 8000,4,4,4 -v 1 2 1
launch_and_time_kernel: grid_dim {32, 1, 1}, block_dim {256, 1, 1} 
Warm up 1 time
Start running 10 times...
Perf: 0.01512 ms, 71.9577 GB/s, DeviceMultipleReduceThreadwise<256,M_C256_S1,K_C1_S4,InSrcVectorDim_1_InSrcVectorSize_2,OutDstVectorSize_1_1>
```

## Applications

This operation is extremely useful for computing statistics and implementing normalization layers.

-   **Normalization Layers**: Both Batch Normalization and Layer Normalization require computing the mean and variance of a tensor. Variance is defined as $\sigma^2 = E[X^2] - (E[X])^2$. This requires two statistics: the sum of elements (for the mean, $E[X]$) and the sum of squares of elements (for $E[X^2]$). This kernel can compute both in a single pass, making it a highly efficient way to calculate the moments needed for normalization.
-   **Data Analytics**: When analyzing a large dataset, one might want to compute its min, max, mean, and standard deviation all at once. This kernel can perform all the necessary underlying reductions in a single, efficient operation.
-   **Loss Function Components**: Some complex loss functions might involve multiple statistical properties of a model's output. This kernel can compute them efficiently.