* chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>
Put Element Operation
This example demonstrates a put element operation, which scatters or places elements from a source tensor into specific positions of a destination tensor based on index arrays. This is a fundamental operation for implementing sparse updates, scatter operations, and advanced indexing patterns in deep learning and scientific computing.
Mathematical Formulation
The put element operation updates specific positions in a destination tensor using values from a source tensor and position information from index tensors.
Given:
- Destination tensor
Dwith shape[D0, D1, ..., Dn] - Source tensor
Swith shape[M, ...]containing values to be placed - Index tensors
I0, I1, ..., Inwith shape[M]specifying destination coordinates - Update mode: how to handle multiple updates to the same position
The operation performs:
D[I0[i], I1[i], ..., In[i]] \leftarrow \text{Update}(D[I0[i], I1[i], ..., In[i]], S[i])
For each element i from 0 to M-1.
Update modes:
- Overwrite:
D[idx] = S[i] - Add:
D[idx] += S[i] - Multiply:
D[idx] *= S[i] - Max:
D[idx] = max(D[idx], S[i]) - Min:
D[idx] = min(D[idx], S[i])
Algorithmic Strategy: Parallel Scatter with Conflict Resolution
The implementation must handle parallel updates and potential conflicts when multiple source elements target the same destination position.
-
Grid Scheduling: The operation is parallelized over the source elements. Each thread is assigned to process one or more elements from the source tensor.
-
Index Calculation: For each source element, threads:
- Read the corresponding indices from the index tensors
- Validate that indices are within bounds
- Calculate the linear memory address in the destination tensor
-
Conflict Resolution: When multiple threads attempt to update the same destination position:
- Atomic Operations: Use atomic functions for commutative operations (add, max, min)
- Serialization: For non-commutative operations, use locks or other synchronization
- Deterministic Ordering: Ensure consistent results across runs
-
Memory Access Optimization:
- Coalesced reading from source and index tensors
- Efficient atomic operations on destination tensor
- Minimize memory bank conflicts
Source Code Organization
put_element_xdl.cpp: The main example file. It sets up the destination tensor, source tensor, index arrays, and instantiates theDevicePutElementoperation.../../include/ck/tensor_operation/gpu/device/device_put_element.hpp: The high-level device interface for put element operations.../../include/ck/tensor_operation/gpu/grid/gridwise_put_element.hpp: The grid-wise kernel implementing the parallel scatter algorithm with conflict resolution.
Build and Run
Prerequisites
Ensure the Composable Kernel library is built and installed.
cd /path/to/composable_kernel/build
make -j install
Build the Example
cd /path/to/composable_kernel/example/50_put_element
mkdir build && cd build
cmake \
-DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-DCMAKE_PREFIX_PATH="/opt/rocm;${CK_INSTALL_PATH}" \
..
make -j
Run the Example
# Run the example with default settings
./put_element_xdl
# Run with verification, data initialization, and timing
./put_element_xdl 1 2 1
Applications
Put element operations are fundamental to many advanced algorithms and data structures.
- Sparse Neural Networks: Updating specific weights or activations in sparse neural network architectures where only a subset of parameters are active.
- Graph Neural Networks: Scatter operations for aggregating information from neighboring nodes to target nodes in graph structures.
- Embedding Updates: Updating specific rows in embedding tables based on sparse input indices, common in recommendation systems and NLP models.
- Histogram Computation: Accumulating counts or values into histogram bins based on computed indices.
- Sparse Linear Algebra: Implementing sparse matrix operations where values are placed at specific coordinate positions.
- Advanced Indexing: Supporting NumPy-style advanced indexing patterns for tensor manipulation.
Performance Considerations
The performance of put element operations depends heavily on the access patterns:
- Random Access: Scattered indices lead to poor memory locality and cache performance
- Atomic Contention: High conflict rates (many updates to same positions) can severely impact performance
- Memory Bandwidth: The operation is typically memory-bound, especially with good locality
- Load Balancing: Uneven distribution of conflicts can cause load imbalance across threads