Files
Aviral Goel 004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
..

Put Element Operation

This example demonstrates a put element operation, which scatters or places elements from a source tensor into specific positions of a destination tensor based on index arrays. This is a fundamental operation for implementing sparse updates, scatter operations, and advanced indexing patterns in deep learning and scientific computing.

Mathematical Formulation

The put element operation updates specific positions in a destination tensor using values from a source tensor and position information from index tensors.

Given:

  • Destination tensor D with shape [D0, D1, ..., Dn]
  • Source tensor S with shape [M, ...] containing values to be placed
  • Index tensors I0, I1, ..., In with shape [M] specifying destination coordinates
  • Update mode: how to handle multiple updates to the same position

The operation performs: D[I0[i], I1[i], ..., In[i]] \leftarrow \text{Update}(D[I0[i], I1[i], ..., In[i]], S[i])

For each element i from 0 to M-1.

Update modes:

  • Overwrite: D[idx] = S[i]
  • Add: D[idx] += S[i]
  • Multiply: D[idx] *= S[i]
  • Max: D[idx] = max(D[idx], S[i])
  • Min: D[idx] = min(D[idx], S[i])

Algorithmic Strategy: Parallel Scatter with Conflict Resolution

The implementation must handle parallel updates and potential conflicts when multiple source elements target the same destination position.

  1. Grid Scheduling: The operation is parallelized over the source elements. Each thread is assigned to process one or more elements from the source tensor.

  2. Index Calculation: For each source element, threads:

    • Read the corresponding indices from the index tensors
    • Validate that indices are within bounds
    • Calculate the linear memory address in the destination tensor
  3. Conflict Resolution: When multiple threads attempt to update the same destination position:

    • Atomic Operations: Use atomic functions for commutative operations (add, max, min)
    • Serialization: For non-commutative operations, use locks or other synchronization
    • Deterministic Ordering: Ensure consistent results across runs
  4. Memory Access Optimization:

    • Coalesced reading from source and index tensors
    • Efficient atomic operations on destination tensor
    • Minimize memory bank conflicts

Source Code Organization

Build and Run

Prerequisites

Ensure the Composable Kernel library is built and installed.

cd /path/to/composable_kernel/build
make -j install

Build the Example

cd /path/to/composable_kernel/example/50_put_element
mkdir build && cd build

cmake \
  -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
  -DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
  ..

make -j

Run the Example

# Run the example with default settings
./put_element_xdl

# Run with verification, data initialization, and timing
./put_element_xdl 1 2 1

Applications

Put element operations are fundamental to many advanced algorithms and data structures.

  • Sparse Neural Networks: Updating specific weights or activations in sparse neural network architectures where only a subset of parameters are active.
  • Graph Neural Networks: Scatter operations for aggregating information from neighboring nodes to target nodes in graph structures.
  • Embedding Updates: Updating specific rows in embedding tables based on sparse input indices, common in recommendation systems and NLP models.
  • Histogram Computation: Accumulating counts or values into histogram bins based on computed indices.
  • Sparse Linear Algebra: Implementing sparse matrix operations where values are placed at specific coordinate positions.
  • Advanced Indexing: Supporting NumPy-style advanced indexing patterns for tensor manipulation.

Performance Considerations

The performance of put element operations depends heavily on the access patterns:

  • Random Access: Scattered indices lead to poor memory locality and cache performance
  • Atomic Contention: High conflict rates (many updates to same positions) can severely impact performance
  • Memory Bandwidth: The operation is typically memory-bound, especially with good locality
  • Load Balancing: Uneven distribution of conflicts can cause load imbalance across threads