composable_kernel/example/50_put_element/README.md

# Put Element Operation

This example demonstrates a **put element operation**, which scatters or places elements from a source tensor into specific positions of a destination tensor based on index arrays. This is a fundamental operation for implementing sparse updates, scatter operations, and advanced indexing patterns in deep learning and scientific computing.

## Mathematical Formulation

The put element operation updates specific positions in a destination tensor using values from a source tensor and position information from index tensors.

Given:
- **Destination tensor** `D` with shape `[D0, D1, ..., Dn]`
- **Source tensor** `S` with shape `[M, ...]` containing values to be placed
- **Index tensors** `I0, I1, ..., In` with shape `[M]` specifying destination coordinates
- **Update mode**: how to handle multiple updates to the same position

The operation performs:
$D[I0[i], I1[i], ..., In[i]] \leftarrow \text{Update}(D[I0[i], I1[i], ..., In[i]], S[i])$

For each element `i` from `0` to `M-1`.

**Update modes**:
- **Overwrite**: `D[idx] = S[i]`
- **Add**: `D[idx] += S[i]`
- **Multiply**: `D[idx] *= S[i]`
- **Max**: `D[idx] = max(D[idx], S[i])`
- **Min**: `D[idx] = min(D[idx], S[i])`

## Algorithmic Strategy: Parallel Scatter with Conflict Resolution

The implementation must handle parallel updates and potential conflicts when multiple source elements target the same destination position.

1.  **Grid Scheduling**: The operation is parallelized over the source elements. Each thread is assigned to process one or more elements from the source tensor.

2.  **Index Calculation**: For each source element, threads:
    -   Read the corresponding indices from the index tensors
    -   Validate that indices are within bounds
    -   Calculate the linear memory address in the destination tensor

3.  **Conflict Resolution**: When multiple threads attempt to update the same destination position:
    -   **Atomic Operations**: Use atomic functions for commutative operations (add, max, min)
    -   **Serialization**: For non-commutative operations, use locks or other synchronization
    -   **Deterministic Ordering**: Ensure consistent results across runs

4.  **Memory Access Optimization**:
    -   Coalesced reading from source and index tensors
    -   Efficient atomic operations on destination tensor
    -   Minimize memory bank conflicts

## Source Code Organization

-   [`put_element_xdl.cpp`](./put_element_xdl.cpp): The main example file. It sets up the destination tensor, source tensor, index arrays, and instantiates the `DevicePutElement` operation.
-   [`../../include/ck/tensor_operation/gpu/device/device_put_element.hpp`](../../include/ck/tensor_operation/gpu/device/device_put_element.hpp): The high-level device interface for put element operations.
-   [`../../include/ck/tensor_operation/gpu/grid/gridwise_put_element.hpp`](../../include/ck/tensor_operation/gpu/grid/gridwise_put_element.hpp): The grid-wise kernel implementing the parallel scatter algorithm with conflict resolution.

## Build and Run

### Prerequisites
Ensure the Composable Kernel library is built and installed.
```bash
cd /path/to/composable_kernel/build
make -j install
```

### Build the Example
```bash
cd /path/to/composable_kernel/example/50_put_element
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j
```

### Run the Example
```bash
# Run the example with default settings
./put_element_xdl

# Run with verification, data initialization, and timing
./put_element_xdl 1 2 1
```

## Applications

Put element operations are fundamental to many advanced algorithms and data structures.

-   **Sparse Neural Networks**: Updating specific weights or activations in sparse neural network architectures where only a subset of parameters are active.
-   **Graph Neural Networks**: Scatter operations for aggregating information from neighboring nodes to target nodes in graph structures.
-   **Embedding Updates**: Updating specific rows in embedding tables based on sparse input indices, common in recommendation systems and NLP models.
-   **Histogram Computation**: Accumulating counts or values into histogram bins based on computed indices.
-   **Sparse Linear Algebra**: Implementing sparse matrix operations where values are placed at specific coordinate positions.
-   **Advanced Indexing**: Supporting NumPy-style advanced indexing patterns for tensor manipulation.

## Performance Considerations

The performance of put element operations depends heavily on the access patterns:

-   **Random Access**: Scattered indices lead to poor memory locality and cache performance
-   **Atomic Contention**: High conflict rates (many updates to same positions) can severely impact performance
-   **Memory Bandwidth**: The operation is typically memory-bound, especially with good locality
-   **Load Balancing**: Uneven distribution of conflicts can cause load imbalance across threads