Commit Graph

1152 Commits

Author SHA1 Message Date
kyle-256
3c12a02827 [CK_TILE] add tensorwise quant in grouped gemm (#3007)
* add tensorwise quant in grouped gemm

* fix example issue

* update test cases

* format codes

* clang format

* use GTEST_FAIL

* fix a bug in test_grouped_gemm_util

* skip test when use wmma on grouped_quant kernel

* change cmake

* change code based on comments

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-24 07:41:54 -07:00
yinglu
6bbc05e1bd conv:tf32:add missed instances (#3081)
* conv:tf32:add missed instances
2025-10-24 16:28:36 +08:00
Gino Lu
bedade2572 [CK_TILE] Add fp4 warp gemm 16x16x128 (#2738)
* first commit

* fix format error

* fix vec size error

* fix clang format

* fix type error

* add interface in warp_gemm_impl

* fix interface

* fix bug

* fix bug

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-23 10:55:51 -07:00
Qianfeng
fbd101b1ac [CK_TILE] Fix in set_slice_tile (#2232)
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-23 10:34:02 -07:00
Haocong WANG
0d3860dfdb [CKTILE] FMHA fwd trload lse fix (#3046)
* enable storelse for fmha_fwd_trload kernel

* fix lse in trload

* fix the mask related bug
2025-10-23 09:33:33 +08:00
lalala-sh
211d64e18a [CK_TILE] Update flatmm related kernels (#3022)
---------

Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: felix <felix.li@amd.com>
2025-10-22 22:36:11 +08:00
Johannes Graner
cbd1279ae6 [CK_TILE] Conv bwd splitN support (#3047)
* Conv bwd splitN support

* Adjust splitting calculations to lengths format

* Prepare indexing for future splitK support
2025-10-22 13:34:06 +02:00
MHYangAMD
5a27a97391 Introduce tree reduction for BlockReduce2dCrossWarpSync (#2588)
* Introduce tree reduction for BlockReduce2dCrossWarpSync

* Rename original impl to BlockReduce2dLinearCrossWarpSync

* Replace warp_size with get_warp_size()

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-22 14:41:35 +08:00
John Shumway
37dff024c1 [CK_BUILDER] Add compile-time reflection for a convolution instance (#3065)
* [CK_BILDER] Add compile-time reflection for a convolution instance

Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template
pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values.

This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters.

The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface.

Key architectural decisions:

- Forward-declare device kernels in instance_traits.hpp to avoid  circular dependencies, since device implementation headers will  include the reflection headers

- Use compile-time constants and type aliases to expose kernel  parameters, enabling zero-overhead introspection

- Provide a templated instance_string() function that generates human-readable  kernel configuration strings by serializing all template parameters  in order, useful for debugging and kernel identification

- Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep  it opt-in until the API stabilizes

- Add GetInstanceString() virtual method to BaseOperator, allowing  runtime polymorphic access to compile-time kernel information

This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically.

Includes unit tests validating both the trait extraction accuracy and the string generation format.
2025-10-21 21:10:19 -07:00
Bartłomiej Kocot
3a28632b20 Gridwise gemm conv v3 force padded layout on gfx950 (#2961)
* Gridwise gemm conv v3 force padded layout on gfx950

* fix bug in other gridwise

* fix

* Update gridwise_gemm_wmma_cshuffle_v3_common.hpp
2025-10-21 15:41:02 +02:00
Yashvardhan Agarwal
35754d2ec8 fix identity value of AbsMax (#3058)
* fix identity value of AbsMax

- Identity value of AbsMax should be 0 not numeric<T>::lowest()

* Update include/ck_tile/core/utility/reduce_operator.hpp

resolved comment

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

---------

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
2025-10-21 14:42:08 +02:00
Johannes Graner
4043401db1 Fix race conditions in ck_tile remod (#3061) 2025-10-21 09:35:04 +02:00
Max Podkorytov
2570462ecf [CK_TILE] Fix transpose_vectors for 2x2 8-bit tiles (#3042)
fix transpose_vectors logic for 2x2 8-bit tiles

    add a test which goes through this code path.

    factor out constexpr'd cases into smaller functions.

    add inline docs about the data movement

    impact: gemms with 8-bit non-rcr inputs on gfx942
2025-10-20 13:40:44 -07:00
Gino Lu
fb1d090f3c [CK_TILE] Patch for pk_fp4 ref check and buffer load. (#3044)
* Patch for pk_fp4_raw_t buffer load and ref check
2025-10-20 14:47:04 +08:00
AviralGoelAMD
b03764ca5a docs: add inline comments about flush_cache and rotating buffer 2025-10-17 12:56:47 -04:00
Yashvardhan Agarwal
889ffc0b1d fix identity values in Max and AbsMax (#3048)
- The identity value method returned the minimum positive number while
we need the lowest number for Max and AbsMax operations
2025-10-17 09:49:21 -07:00
Emily Martins
352dee5225 Fix CK Tile Stream-K BF16 Validation Errors (#3039)
Prior to this change, the number of accumulations passed into
calculate_rtol_atol was 1. That said, in most cases, this is not correct
when there are multiple workgroups contributing to the same macro tile
in C.

This change ensures uses the function estimate_num_wgs_per_tile, which
was extracted into a common file and generalized, to estimate the number
of workgroups per macro tile. This estimate is passed into
calculate_rtol_atol to ensure we get a better relative and absolute
tolerance.
2025-10-17 09:33:38 -07:00
Johannes Graner
8a4cd32d86 Pre-commit in CI (#3029)
* Pre-commit in CI

* Specify python version, and install dos2unix for remod

* Refactor remod hook to correctly install dependencies

* Run pre-commit
2025-10-17 09:28:38 -07:00
Ville Pietilä
7e44b845b5 Fixed handling of split-K autodeduce argument for grouped convolution (#3024)
* Fix handling of split-K autodeduce argument.

* Fix clang formatting.

* Test fix.

* Fix clang formatting.
2025-10-17 15:36:39 +03:00
Johannes Graner
d40b50b9d5 Update pre-commit to fixed versions, run remod for ck_tile (#2895)
* Fix ruff linter errors

* Fix remod dos2unix command

* Clang format

* Ignore utility in remod

* Run remod

* Specify clang-format version in pre-commit

* Specify ruff version

* Include PoolKernelArgs in reference_pool

* Add calculate_total_elements to reference batched contraction

* Fix calculate_total_elements declaration

* Refactor remod pre-commit hook

* Fix Aquant tests

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-16 15:29:17 -07:00
Enrico Degregori
440358c168 Wave Tile Transfer supporting global load with transpose (#3027)
* Initial implementation:

 - add new thread group transfer supporting transpose instruction
 - refactor AB transfer to switch between thread and wave tiles methods

* Add some comments and remove explicit wave and lane calculations

* Remove compiler option for performance

* fp16 example: use tuned instance

* Missing cleanup

* Integrate wave transfer in existing gemm and batched gemm instances

* Add fast instances

* extend implementation for 8 bit datatypes

packed types not supported

* Address review comments

* Optimize pipeline v1 and re-introduce compiler option

* Disable wave tile approach for b scale gemm

* Fix for clang20

* Avoid code duplication of amd_global_load_transpose_to_vgpr function
2025-10-16 11:33:56 -07:00
kabrahamAMD
c4b2da9cbd implement device batched gemm b scale for wmma (#2825)
* rebased on top of develop

* fixed missing shuffeling and wrong indexing

* added tests for batched_b_scale

* added missing files

* fixed wrong stride computation and removed k batching (for now) due to precision issues

* reinstated k-batching with PRNG constrained to -1..1

* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow

* added k-batching to reference and increased tolerances for test

* changed gemm_b_scale and gemm_universal tests to use correct parameters

* adressed review commentsd

* ported fixes back to non-batched version of b_scale

* adressed review comments

* run clang-format on older commits

* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior

* added newline at end of file

* reflected changes from muitl-abd branch in batched b_scale

* fixed gfx11 issue

* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed

* run clang format

* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.

* reduced range for pk_i4 even further to 0..0

* removed failing xld instances. Failure now uncovered now that tests were fixed

* removed generation of int4 values entierly

* divide B buffer by BPackedSize

---------

Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
2025-10-16 11:00:42 -07:00
Emily Martins
cb83d52301 Style updates and cleanup
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes
2025-10-16 08:47:06 -06:00
Astha
8f75d7cea6 Addition of the derived structs for the new Stream-K TilePartitioner
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.

The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.

Unit tests for the derived structs are included.
2025-10-16 08:47:06 -06:00
Emily Martins
f87f768d16 Stream-K Tile Partitioner Base Class with Tests
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.

This change also includes unit tests for the base tile partitioner.
2025-10-16 08:47:06 -06:00
Illia Silin
3348f01e6f re-enable clang-format by default (#3030)
* re-enable clang-format by default

* fix clang format
2025-10-15 07:43:11 -07:00
felix
4c826abfff Felix/opt sorting (#2902)
* merge felix/sorting
* opt moe sorting  (#2822)
* opt moe storing for 2k
---------
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-10-15 09:24:03 +08:00
joyeamd
b9d74e7746 update s_barrier's logic in gfx12 architecture (#3003)
change s_waitcnt's logic in gfx1250

change s_waitcnt's logic in gfx1250

update comment
2025-10-14 08:49:34 -07:00
Illia Silin
e4298e55c7 Revert "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)
This reverts commit d2bbca3eca.
2025-10-14 08:43:14 -07:00
jakpiase
6deaaa92cc [CK_TILE] Switch into universal gemms for conv bwds (#2981)
* switch into universal gemms for conv bwds

* some fixes and support universal gemm in conv fwd

* add reviewer comments
2025-10-14 16:09:16 +02:00
ClementLinCF
e1b0bdfbfa [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm (#2540)
* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm

* Update rmsnorm host reference

* Update tree reduction of rmsnorm for reference host

* Fix cross warp for m > 1 cases

* Add RMSNorm model selectable option for host reference

* Fix save_unquant cases

* Update reference rmsnorm forward function to use enum for model sensitivity

* Update reference rmsnorm calculation for model sensitivity

* Fix m warp for layernorm

* Adjust parameter of reference for twoPass

* Fix clang format

* Run clang-format-overwrite.sh to fix formating issue

* fix clang format

---------

Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-13 11:52:37 -07:00
Sami Remes
d2bbca3eca [CK_TILE] Non-K Major from old CK to CK-Tile (#2442)
* Enable the adapted LDS B layout for Row-Major

* fix formatting

* Implement specialized col-major A LDS block descriptor

* Fix formatting

* Use VecLoadSize for AK1/BK1

* Fix some thread access pattern values

* Use GetVectorSizeA for A

* Fix formatting

* Add extra condition to avoid division by zero

* disable layout for wave32

* remove extra else

* fix formatting

* Fix formatting

* Rename one remaining TileDistributionEncodingPattern2D

* Use integer ceil division

* revert remod.py changes

* also revert utility.hpp

* use getA/BTileAccessPattern everywhere

* use integer_divide_ceil for AK0 too

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2025-10-13 14:27:02 +02:00
aledudek
634634f5c0 [CK_TILE] Blockwise GEMM pipeline v6 - port of v5 from old CK (#2955)
* First checkpoint

* Second checkpoint - hot loop scheduler

* Third checkpoint - init main operator

* Fourth checkpoint - main loop ready

* Fifth checkpoint - main loop fix

* Sixth checkpoint - ReadWritecompFunc

* Seventh checkpoint - Tail finished

* [CK_TILE] Blockwise gemm pipeline v5 complete

* Working

* Working fixes 2

* Rename v5 to v77 temporarily

* Data type adjustment

* Data type adjustment 2

* [CK_TILE] Blockwise Gemm pipeline v5 add tests

* [CK_TILE] Fix calculation error

* TEMP: check pipeline

* Fix name to V6

* naming and documentation changes

* WIP dump

* Try fixing v1

* Failing tests v5

* Debugging

* Changes v2

* F16 tests working great

* Working BlockwiseGemmPipelineV5 as V6

* Cleanup and format

* Merging changes part1

* [CK_TILE] Blockwise Gemm Pipeline Comp V5/V6

* Remove commented code

* Fix gfx950 build issues

* Fix file formatting

* Review changes, more concat info, add bf16 bf8 tests

* Fix formatting

* Add bf16 and bf8 tests

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2025-10-13 13:57:37 +02:00
aledudek
3021604213 [CK_TILE] Batched Gemm Kernel IsSupported function checks (#2860)
* Add valid check batched gemm part1

* [CK_TILE] Add batched gemm kernel IsSupported func checks

* revert broken pre-commit hook changes

* revert broken pre-commit hook changes v2

* Clarify error messages
2025-10-13 13:55:23 +02:00
damien-lejeune
46c10c316d Update include path to break the remod's cyclic dep issue (#2978)
* Update include path to break the cyclic dep issue

* Use ck_tile::permute_vectors_i4x4_b in tile engine

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-13 13:24:47 +02:00
msaffari-amd
e9f0cc83a8 [CK Tile] contraction multi d - kernel & example (#2901)
* Initial commit. create batched_contraction_kernel file

* initial problem definition

* implement initial example to launch kernel

* add universal gemm to contraction. initial phase

* complete implementation for special case all Dims are 1 and no Ds

* clean code

* initial changes to support multi dimensional G

* more progress in implementing multiple G

* tmp commit

* manage dynamic NumDimG in kernel

* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit

* implement the example for general Multi dimension G M N K and test different reference calculation algorithms

* 2 functions for reference using multi dimensional and flat indexing

* clean the code for muti dimentional G, M, N, K contraction and add some logs

* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E

* some cleaning on kernel

* clean the code for  calculating the offsets from flatten batch number

* Start adding MultiD support to kernel and example

* more changes to manage multi D in kernel and example

* manage passing multi d to kernel and testing.

* complete multi D support in kernel. modify example code to support it

* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning

* Minor fix

* Generalize example code for variable NumD tensors and apply cleanup based on review feedback

* Refactored code and addressed review feedback

* refactoring, cleaning, add documents, in kernel side and example codes

* Optimize batch offset calculation in kernel

* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-13 12:30:28 +02:00
Christopher Millette
f5708882a3 Streamk functional tests (#2974)
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite

* Account for stride when computing K offsets for A and B tensor

This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.

* Move helper code to test_gemm_streamk_util.hpp

* Separate tests into smoke/regression/extended. Add bf16 datatype

* Run clang-format

* Refactor combinatorial macro expansion and naming

* Adjust the initialization values to account for better tolerance on bf16

* Correct BF16 datatypes in comments

* Move the extended tests under the REGRESSION_TESTS label

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-11 07:53:40 -05:00
Khushbu Agarwal
3c39d279ab supporting prefill shapes for preshuffle block scale gemm (#2975)
* debugging

* debugging for prefill shapes

* comment unused code

* fix for prefill shapes

* clearing up the code

* add int4 to universal gemm example

* clang formatted

* adding test for prefill shapes in block scale gemm

* lil improv on the block pipeline

* Address Review Comment

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-10 15:36:24 -07:00
Max Podkorytov
9d060d3e3c [CK-Tile] functional support for transposed inputs in compute-bound double-lds-buffer pipeline with async loads from global memory to LDS (#2984)
* reuse local prefetch logic from compute v4 pipeline

add single-tile test

explicit lambda capture

reuse lds block descriptors from base policy for the transposed case

match the test case kernel configuration with compute v4

* add comments
2025-10-10 12:57:50 -07:00
yinglu
fada1a3cae Conv:TF32: add more instances - 2 (#2879)
* add instances of device_grouped_conv_fwd_xdl_f32_comp_instances
* add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances
* add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances
* tf32:conv:add instances for base class DeviceConvFwd
* tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD
* tf32:conv:add instances for base class DeviceGroupedConvBwdWeight
* add tf32 in profiler
* remove gnhwc/ngchw/ngcdhw instances
* remove non-ndhwgc/nhwgc/nhwc instances
* add check in IsSupportedArgument()
2025-10-10 15:28:17 +08:00
Bartłomiej Kocot
ad7a215aba Fix splitK for grouped conv bwd data (#2991) 2025-10-10 09:24:21 +02:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
fb66b4f5e4 [CK_TILE] fix pk_fp4 compilation for non-gfx950 GPUs (#2983)
See build error log from
https://github.com/ROCm/composable_kernel/issues/2271#issuecomment-3150218542

This PR make vector element access constexpr-safe by avoiding operator[] on
ext_vector_type(2) and replace those sites in the pk_fp4 conversions so they
can be used in constant expressions, as The operator[] on ext_vector_type(2)
isn't allowed in constant expressions, which caused "constexpr function never
produces a constant expression" with a note at x[0]. Using `bit_cast` to a
trivial array representation keeps it constexpr-compatible.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-10-09 07:43:41 -07:00
Yashvardhan Agarwal
7b6451b68e [CK_TILE] Pooling FWD (Lwpck 3683) (#2956)
* Pooling 2D/3D with refernce

* Tests & cleanup

- added test for ppoling
- cleanup
- removed 2d example

* Comment resolution

- README added
- example target name rectified
- appropriate arg description and comments added

* clang-format

* appropriate blocksize calc

* modifications for future indexing addition

- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future

* some basic fixes

* comment resolutions

* comment resolutions

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-10-09 16:13:26 +02:00
Sami Remes
9d4bfe3932 Add KBatch support for gemm_ab_scale (#2740)
* Add KBatch support for gemm_ab_scale

* Revert kernel parameters change

* Remove printing

* fix formatting

* fix check

* Use {} in if

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-10-09 08:33:16 +02:00
Aviral Goel
e99356dabc Add Memory pipeline for AQuant Block Scale GEMM (#2987)
* WIP: add memory pipeline boiler plate code that compiles and works for one block

* WIP: tail handling works for memory pipeline

* WIP: numerical errors appears to have gone by adding block_sync_lds()

* fix: numerical error with memory pipeline by adding block_sync_lds() and new tail handler

* refactror: remove debug print statements and lints

* fix: remove redundant sync barriars

* chore: remove lint

* fix: remove unused code from tile handler and remove redundant block_sync_lds()

* fix: correct parent struct name for memory pipeline

* fix: remove static assert check from parent struct and add it to child struct because not all child structs needs to static assert

* fix: defer block sync lds to just before prefill
2025-10-08 17:22:30 -07:00
Cong Ma
1d4db30af9 [CK TILE GEMM] Refactor the code of transposeC and quantpreshuffle of AQuant Gemm (#2965)
Refactor the code of transposeC and quantpreshuffle of AQuant Gemm to make it easier to maintain.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-10-08 00:05:38 -07:00
Thomas Ning
ae9f29b7d5 add the sync barrier for persistent kernel (#2977) 2025-10-07 11:54:04 -07:00
Thomas Ning
b4a4aa2b64 [CK Tile] CShuffle Tile Permute N all warp compatible (#2966)
* solve the hard_code issue of kM2

* clang format
2025-10-03 09:46:13 -07:00
Illia Silin
4c98535456 fix compilation errors on RHEL8 and SLES15 (#2967) 2025-10-03 07:08:49 -07:00
Thomas Ning
cadafde722 add the check of granularity for atomic add (#2959) 2025-10-02 11:15:24 -07:00