From 2b28edd417d7cd00d3d0dec48fa010d603a59be3 Mon Sep 17 00:00:00 2001 From: spolifroni-amd Date: Mon, 8 Sep 2025 13:55:32 -0400 Subject: [PATCH 1/3] first commit of the glossary (#2702) * first commit of the glossary * minor changes * Update docs/reference/Composable-Kernel-Glossary.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/reference/Composable-Kernel-Glossary.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update Composable-Kernel-Glossary.rst --------- Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> Co-authored-by: Vidyasagar Ananthan (cherry picked from commit e11f694eda2c1c35e401fe025ad1a0a4cffe2c98) --- docs/index.rst | 1 + docs/reference/Composable-Kernel-Glossary.rst | 256 ++++++++++++++++++ docs/sphinx/_toc.yml.in | 8 +- 3 files changed, 264 insertions(+), 1 deletion(-) create mode 100644 docs/reference/Composable-Kernel-Glossary.rst diff --git a/docs/index.rst b/docs/index.rst index 89a5e3e836..c28eb646b5 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -39,6 +39,7 @@ The Composable Kernel repository is located at `https://github.com/ROCm/composab * :doc:`Composable Kernel API reference <./doxygen/html/namespace_c_k>` * :doc:`CK Tile API reference <./doxygen/html/namespaceck__tile>` * :doc:`Composable Kernel complete API class list <./doxygen/html/annotated>` + * :doc:`Composable Kernel glossary <./reference/Composable-Kernel-Glossary>` To contribute to the documentation refer to `Contributing to ROCm `_. diff --git a/docs/reference/Composable-Kernel-Glossary.rst b/docs/reference/Composable-Kernel-Glossary.rst new file mode 100644 index 0000000000..847802b903 --- /dev/null +++ b/docs/reference/Composable-Kernel-Glossary.rst @@ -0,0 +1,256 @@ +.. meta:: + :description: Composable Kernel glossary of terms + :keywords: composable kernel, glossary + +*************************************************** +Composable Kernel glossary + +*************************************************** + +.. glossary:: + :sorted: + + arithmetic logic unit + The arithmetic logic unit (ALU) is the GPU component responsible for arithmetic and logic operations. + + compute unit + The compute unit (CU) is the parallel vector processor in an AMD GPU with multiple :term:`ALUs`. Each compute unit will run all the :term:`wavefronts` in a :term:`work group>`. A compute unit is equivalent to NVIDIA's streaming multiprocessor. + + matrix core + A matrix core is a specialized GPU unit that accelerate matrix operations for AI and deep learning tasks. A GPU contains multiple matrix cores. + + register + Registers are the fastest tier of memory. They're used for storing temporary values during computations and are private to the :term:`work-items` that use them. + + VGPR + See :term:`vector general purpose register`. + + vector general purpose register + A vector general purpose register (VGPR) is a :term:`register` that stores individual thread data. Each thread in a :term:`wave` has its own set of VGPRs for private variables and calculations. + + SGPR + See :term:`scalar general purpose register`. + + scalar general purpose register + A scalar general purpose register (SGPR) is a :term:`register` shared by all the :term:`work items` in a :term:`wave`. SGPRs are used for constants, addresses, and control flow common across the entire wave. + + LDS + See :term:`local data share`. + + local data share + Local data share (LDS) is high-bandwidth, low-latency on-chip memory accessible to all the :term:`work-items` in a :term:`work group`. LDS is equivalent to NVIDIA's shared memory. + + LDS banks + LDS banks are a type of memory organization where consecutive addresses are distributed across multiple memory banks for parallel access. LDS banks are used to prevent memory access conflicts and improve bandwidth when LDS is used. + + global memory + The main device memory accessible by all threads, offering high capacity but higher latency than shared memory. + + pinned memory + Pinned memory is :term:`host` memory that is page-locked to accelerate transfers between the CPU and GPU. + + dense tensor + A dense tensor is a tensor where most of its elements are non-zero. Dense tensors are typically stored in a contiguous block of memory. + + sparse tensor + A sparse tensor is a tensor where most of its elements are zero. Typically only the non-zero elements of a sparse tensor and their indices are stored. + + host + Host refers to the CPU and the main memory system that manages GPU execution. The host is responsible for launching kernels, transferring data, and coordinating overall computation. + + device + Device refers to the GPU hardware that runs parallel kernels. The device contains the :term:`compute units`, memory hierarchy, and specialized accelerators. + + work-item + A work-item is the smallest unit of parallel execution. A work-item runs a single independent instruction stream on a single data element. A work-item is equivalent to an NVIDIA thread. + + wavefront + Also referred to as a wave, a wavefront is a group of :term:`work-items` that run the same instruction. A wavefront is equivalent to an NVIDIA warp. + + work group + A work group is a collection of :term:`work-items` that can synchronize and share memory. A work group is equivalent to NVIDIA's thread block. + + grid + A grid is a collection of :term:`work groups` that run a kernel. Each work group within the grid operates independently and can be scheduled on a different :term:`compute unit`. A grid can be organized into one, two, or three dimensions. A grid is equivalent to an NVIDIA thread block. + + block Size + The block size is the number of :term:`work-items` in a :term:`compute unit`. + + SIMT + See :term:`single-instruction, multi-thread` + + single-instruction, multi-thread + Single-instruction, multi-thread (SIMT) is a parallel computing model where all the :term:`work-items` within a :term:`wavefront` run the same instruction on different data. + + SIMD + See :term:`single-instruction, multi-data` + + single-instruction, multi-data + Single-instruction, multi-data (SIMD) is a parallel computing model where the same instruction is run with different data simultaneously. + + occupancy + The ratio of active :term:`wavefronts` to the maximum possible number of wavefronts. + + kernel + A kernel is a function that runs an :term:`operation` or a collection of operations. A kernel will run in parallel on several :term:`work-items` across the GPU. In Composable Kernel, kernels require :term:`pipelines`. + + operation + An operation is a computation on input data. + + pipeline + A Composable Kernel pipeline schedules the sequence of operations for a :term:`kernel`, such as the data loading, computation, and storage phases. A pipeline consists of a :term:`problem` and a :term:`policy`. + + tile partitioner + The tile partitioner defines the mapping between the :term:`problem` dimensions and GPU hierarchy. It specifies :term:`workgroup`-level :term:`tile` sizes and determines :term:`grid` dimensions by dividing the problem size by the tile sizes. + + problem + The problem is the part of the :term:`pipeline` that defines input and output shapes, data types, and mathematical :term:`operations`. + + policy + The policy is the part of the :term:`pipeline` that defines memory access patterns and hardware-specific optimizations. + + user customized tile pipeline + A customized :term:`tile` :term:`pipeline` that combines custom :term:`problem` and :term:`policy` components for specialized computations. + + user customized tile pipeline optimization + The process of tuning the :term:`tile` size, memory access pattern, and hardware utilization for specific workloads. + + tile programming API + The :term:`tile` programming API is Composable Kernel's high-level interface for defining tile-based computations with predefined hardware mappings for data loading and storing. + + coordinate transformation primitives + Coordinate transformation primitives are Composable Kernel utilities for converting between different coordinate systems. + + reference kernel + A reference :term:`kernel` is a baseline kernel implementation used to verify correctness and performance. Composable Kernel makes two reference kernels, one for CPU and one for GPU, available. + + launch parameters + Launch parameters are the configuration values, such as :term:`grid` and :term:`block size`, that determine how a :term:`kernel` is mapped to hardware resources. + + memory coalescing + Memory coalescing is an optimization strategy where consecutive :term:`work-items` access consecutive memory addresses in such a way that a single memory transaction serves multiple work-items. + + alignment + Alignment is a memory management strategy where data structures are stored at addresses that are multiples of a specific value. + + + bank conflict + A bank conflict occurs when multiple :term:`work-items` in a :term:`wavefront` access different addresses that map to the same shared memory bank. + + padding + Padding is the addition of extra elements, often zeros, to tensor edges in order to control output size in convolution and pooling, or to align data for memory access. + + transpose + Transpose is an :term:`operation` that rearranges the order of tensor axes, often for the purposes of matching :term:`kernel` input formats or optimize memory access patterns. + + permute + Permute is an :term:`operation` that rearranges the order of tensor axes, often for the purposes of matching :term:`kernel` input formats or optimize memory access patterns. + + host-device transfer + A host-device transfer is the process of moving data between :term:`host` and :term:`device` memory. + + stride + A stride is the step size to move from one element to the next in a specific dimension of a tensor or matrix. In convolution and pooling, the stride determines how far the :term:`kernel` moves at each step. + + dilation + Dilation is the spacing between :term:`kernel` elements in convolution :term:`operations`, allowing the receptive field to grow without increasing kernel size. + + Im2Col + Im2Col is a data transformation technique that converts image data to column format. + + Col2Im + Col2Im is a data transformation technique that converts column data to image format. + + fast changing dimension + The fast changing dimension is the innermost dimension in memory layout. + + outer dimension + The outer dimension is the slower-changing dimension in memory layout. + + inner dimension + The inner dimension is the faster-changing dimension in memory layout. + + tile + A tile is a sub-region of a tensor or matrix that is processed by a :term:`work group` or :term:`work-item`. Rectangular data blocks are the unit of computation and memory transfer in Composable Kernel, and are the basis for tiled algorithms. + + block tile + A block tile is a memory :term:`tile` processed by a :term:`work group`. + + wave tile + A wave :term:`tile` is a sub-tile processed by a single :term:`wavefront` within a :term:`work group`. The wave tile is the base level granularity of a :term:`single-instruction, multi-thread (SIMD)` model. + + tile distribution + The tile distribution is the hierarchical data mapping from :term:`work-items` to data in memory. + + tile window + Viewport into a larger tensor that defines the current tile's position and boundaries for computation. + + load tile + Load tile is an operation that transfers data from :term:`global memory` or the :term:`load data share` to :term:`vector general purpose registers`. + + store tile + Store tile is an operation that transfers data from :term:`vector general purpose registers` to :term:`global memory` or the :term:`load data share`. + + descriptor + Metadata structure that defines :term:`tile` properties, memory layouts, and coordinate transformations for Composable Kernel :term:`operations`. + + input + See :term:`problem shape`. + + problem shape + The problem shape defines the dimensions and data types of input tensors that define the :term:`problem`. + + vector + The vector is the smallest data unit processed by an individual :term:`work-item`. A vectors is typically four to sixteen elements, depending on data type and hardware. + + elementwise + An elementwise :term:`operation` is an operation applied to each tensor element independently. + + epilogue + The epilogue is the final stage of a kernel. Activation functions, bias, and other post-processing steps are applied in the epilogue. + + Add+Multiply + See :term:`fused add multiply`. + + fused add multiply + A common fused :term:`operation` in machine language and linear algebra, where an :term:`elementwise` addition is immediately followed by a multiplication. Fused add multiply is often used for bias and scaling in neural network layers. + + MFMA + See :term:`matrix fused multiply-add`. + + matrix fused multiply-add + Matrix fused multiply-add (MFMA) is a :term:`matrix core` instruction for GEMM :term:`operations`. + + GEMM + See :term:`general matrix multiply`. + + general matrix multiply + A general matrix multiply (GEMM) is a Core matrix :term:`operation` in linear algebra and deep learning. A GEMM is defined as :math:`C = {\alpha}AB + {\beta}C`, where :math:`A`, :math:`B`, and :math:`C` are matrices, and :math:`\alpha` and :math:`\beta` are scalars. + + VGEMM + See :term:`naive GEMM`. + + vanilla GEMM + See :term:`naive GEMM`. + + naive GEMM + The naive GEMM, sometimes referred to as a vanilla GEMM or VGEMM, is the simplest form of :term:`GEMM` in Composable Kernel. The naive GEMM is defined as :math:`C = AB`, where :math:`A`, :math:`B`, and :math:`C` are matrices. The naive GEMM is the baseline GEMM that all other GEMM :term:`operations` build on. + + GGEMM + See :term:`grouped GEMM`. + + grouped GEMM + A :term:`kernel` that calls multiple :term:`VGEMMs`. Each call can have a different :term:`problem shape`. + + batched GEMM + A :term:`kernel` that calls :term:`VGEMMs` with different batches of data. All the data batches have the same :term:`problem shape`. + + Split-K GEMM + Split-K GEMM is a parallelization strategy that partitions the reduction dimension (K) of a :term:`GEMM` across multiple :term:`compute units`, increasing parallelism for large matrix multiplications. + + GEMV + See :term:`general matrix vector multiplication` + + general matrix vector multiplication + General matrix vector multiplication (GEMV) is an :term:`operation` where a matrix is multiplied by a vector, producing another vector. + diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 2ef3383d84..33ad8d91f8 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -34,8 +34,14 @@ subtrees: title: Composable Kernel vector utilities - file: reference/Composable-Kernel-wrapper.rst title: Composable Kernel wrapper + - file: doxygen/html/namespace_c_k.rst + title: CK API reference + - file: doxygen/html/namespaceck__tile.rst + title: CK Tile API reference - file: doxygen/html/annotated.rst - title: Composable Kernel class list + title: Full API class list + - file: reference/Composable-Kernel-Glossary.rst + title: Glossary - caption: About entries: From d07400e78b0a72cc442758882f311200a19b46dd Mon Sep 17 00:00:00 2001 From: spolifroni-amd Date: Tue, 9 Sep 2025 15:24:44 -0400 Subject: [PATCH 2/3] Improving the contribution page (#2804) * edited the contribution page to remove a broken link * smoothed language; added a link * updated link to install * Adding contribution guide for PRs. * additional editing * Update docs/Contributors_Guide.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/Contributors_Guide.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/Contributors_Guide.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/Contributors_Guide.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/Contributors_Guide.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> * Update docs/Contributors_Guide.rst Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> --------- Co-authored-by: Vidyasagar Ananthan Co-authored-by: Leo Paoletti <164940351+lpaoletti@users.noreply.github.com> (cherry picked from commit 4759cb5e63d976d1908034f46d9d0fc8451d507d) --- docs/Contributors_Guide.rst | 99 +++++++++++-------------------------- 1 file changed, 28 insertions(+), 71 deletions(-) diff --git a/docs/Contributors_Guide.rst b/docs/Contributors_Guide.rst index 3788ba609c..bd414c08d6 100644 --- a/docs/Contributors_Guide.rst +++ b/docs/Contributors_Guide.rst @@ -5,101 +5,58 @@ .. _contributing-to: ******************************************************************** -Contributor's guide +Contributing to Composable Kernel ******************************************************************** -This chapter explains the rules for contributing to the Composable Kernel project, and how to contribute. +Review the `Composable Kernel documentation `_ before contributing to the Composable Kernel project. This documentation provides information about core concepts and configurations, as well as providing :doc:`steps for building Composable Kernel `. Some of this information is also available in the `Composable Kernel README `_. -Getting started -=============== - -#. **Documentation:** Before contributing to the library, familiarize yourself with the - `Composable Kernel User Guide `_. - It provides insight into the core concepts, environment configuration, and steps to obtain or - build the library. You can also find some of this information in the - `README file `_ - on the project's GitHub page. -#. **Additional reading:** The blog post `AMD Composable Kernel library: efficient fused kernels for AI apps with just a few lines of code `_ provides a deeper understanding of the CK library and showcases its performance capabilities. - `_ - from the AMD Community portal. It offers a deeper understanding of the library's objectives and showcases its performance capabilities. -#. **General information:** For broader information about AMD products, consider exploring the - `AMD Developer Central portal `_. - -How to contribute -=================== - -You can make an impact by reporting issues or proposing code enhancements through pull requests. +Consult the `AMD Developer Central portal `_ for more information about AMD products. Reporting issues ----------------- +================= -Use `Github issues `_ -to track public bugs and enhancement requests. +Use `Github issues `_ to log and track issues and enhancement requests. -If you encounter an issue with the library, please check if the problem has already been -reported by searching existing issues on GitHub. If your issue seems unique, please submit a new -issue. All reported issues must include: +If you encounter an issue with the Composable Kernel library, search the existing GitHub issues to determine whether the problem has already been +reported. If it hasn't, submit a new issue that includes: -* A comprehensive description of the problem, including: +* A description of the problem, including what you observed, what you were expecting, and why this was an issue. + +* Your configuration details, including the GPU, OS, and ROCm version, and any Docker image you used. - * What did you observe? - * Why do you think it is a bug (if it seems like one)? - * What did you expect to happen? What would indicate the resolution of the problem? - * Are there any known workarounds? +* The steps to reproduce the issue, including any CMake command you used to build the library, as well as the frequency of the issue. -* Your configuration details, including: +* Any workarounds you've found and what you expect in a resolution. - * Which GPU are you using? - * Which OS version are you on? - * Which ROCm version are you using? - * Are you using a Docker image? If so, which one? -* Steps to reproduce the issue, including: +Contributing to the codebase +============================= - * What actions trigger the issue? What are the reproduction steps? +All external contributors to the Composable Kernel codebase must follow these guidelines: - * If you build the library from scratch, what CMake command did you use? +* Use the correct branch: Use your own branch for your changes. Create your branch from the develop branch. - * How frequently does this issue happen? Does it reproduce every time? Or is it a sporadic issue? +* Describe your changes: Provide the motivation for the changes and a general description of all code changes. -Before submitting any issue, ensure you have addressed all relevant questions from the checklist. +* Add design documents for major changes: Major architectural changes must be accompanied by comprehensive design documents uploaded with your pull request. -Creating Pull Requests ----------------------- +* Add inline documentation: Include relevant documentation and inline comments with your code changes. -You can submit `Pull Requests (PR) on GitHub -`_. +* Link your pull request to related issues: Add links to any issues resolved by your changes in your pull request description. -All contributors are required to develop their changes on a separate branch and then create a -pull request to merge their changes into the `develop` branch, which is the default -development branch in the Composable Kernel project. All external contributors must use their own -forks of the project to develop their changes. +* Verify and test the changes: Run all relevant existing tests and write new tests for any new functionality that isn't covered by existing tests. -When submitting a Pull Request you should: +* Provide performance numbers: Include documentation showing before and after performance numbers for any changes that potentially impact build times or run times. -* Describe the change providing information about the motivation for the change and a general - description of all code modifications. +* Keep your branch up to date: Regularly rebase or merge the develop branch back into your feature branch. This should be done both prior to creating your pull request and during the review process. -* Verify and test the change: +* Ensure a manageable pull request size: Pull requests should be limited to approximately one thousand lines. If your changes significantly exceed one thousand lines, break them into smaller pull requests that can be reviewed independently. - * Run any relevant existing tests. - * Write new tests if added functionality is not covered by current tests. +* Use pre-commit hooks to adhere to the coding style: Composable Kernel's coding style is defined in `.clang-format `_. Use the provided pre-commit hooks to run clang formatting and linting. Instructions on installing pre-commit hooks are available in the `README file `_. -* Ensure your changes align with the coding style defined in the ``.clang-format`` file located in - the project's root directory. We leverage `pre-commit` to run `clang-format` automatically. We - highly recommend contributors utilize this method to maintain consistent code formatting. - Instructions on setting up `pre-commit` can be found in the project's - `README file `_ +Forks require an approver from AMD to trigger continuous integration (CI) testing. This approval process is necessary for security and resource management. -* Link your PR to any related issues: +Depending on the complexity of your changes, an AMD developer might need to pull your changes and perform additional fixes or modifications before merging. This collaborative approach ensures compatibility with internal systems and standards. - * If there is an issue that is resolved by your change, please provide a link to the issue in - the description of your pull request. +You can see a complete list of pull requests on the `Composable Kernel GitHub page `_. -* For larger contributions, structure your change into a sequence of smaller, focused commits, each - addressing a particular aspect or fix. - -Following the above guidelines ensures a seamless review process and faster assistance from our -end. - -Thank you for your commitment to enhancing the Composable Kernel project! From 5e1667f082dac3256a86dd771d068bbd487b2d04 Mon Sep 17 00:00:00 2001 From: spolifroni-amd Date: Wed, 26 Nov 2025 13:28:41 -0500 Subject: [PATCH 3/3] removed an extra newline that caused an issue --- docs/reference/Composable-Kernel-Glossary.rst | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/reference/Composable-Kernel-Glossary.rst b/docs/reference/Composable-Kernel-Glossary.rst index 847802b903..ac4c966ebf 100644 --- a/docs/reference/Composable-Kernel-Glossary.rst +++ b/docs/reference/Composable-Kernel-Glossary.rst @@ -4,7 +4,6 @@ *************************************************** Composable Kernel glossary - *************************************************** .. glossary::