Commit Graph

3846 Commits

Author SHA1 Message Date
assistant-librarian[bot]
a3bbd74c0c Merge commit '583fafc803a0ec9d0edc902fc6b9ecfdc42fb09b' into develop 2025-12-04 07:13:58 +00:00
arai713
beaa1aa47c [CK_TILE] Fix for Moving DataTypeTraits into a Common File (#3335)
This PR fixes a mismatch caused when PR #3146 was merged out of sync with develop, which made its intended changes ineffective. This PR reapplies those changes to move DataTypeTraits into a common file to mitigate code duplication.

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 583fafc803]
2025-12-03 22:46:22 -08:00
assistant-librarian[bot]
2171fa6588 Merge commit 'ffc3120f63135cc697e46031523e44c5cd5d43fa' into develop 2025-12-04 06:16:54 +00:00
kensclin
9e8836195c Ck tile/gemm blockscale opt (#3227)
* GEMM block scale optimization kernel

* GEMM block scale optimization kernel

* Fix: Apply clang-format for style consistency

* Fix: Apply clang-format for style consistency

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: ffc3120f63]
2025-12-03 22:07:23 -08:00
assistant-librarian[bot]
170089beb6 Merge commit 'eb7f6177136173c8a6af539bffd915fddff293c4' into develop 2025-12-04 04:24:28 +00:00
rocking
ea2e816aa5 fp8 fmha async pipeline (#3339)
* replace qr with async pipeline

* Add fp8fp32 to DTYPE_BITS

* Add kAlignmentRandVal to avoid compile fail

* format

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: eb7f617713]
2025-12-04 12:18:25 +08:00
assistant-librarian[bot]
d0d2528beb Merge commit '4baa4c9fae0e56f1105d73a5d2484611d40886e0' into develop 2025-12-03 20:13:41 +00:00
JH-Leon-KIM-AMD
8a6ce28f47 [CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216)
* LWPCK-4043: Add GPU reference implementations for CK Tile convolution

This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).

Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability

The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.

* update gpu reference for ck tile grouped conv

* correct c++ 18 format

* Add GPU Reference Implementations for Old CK Convolution

This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.

Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
  * Fixed BF16 NaN issue (use type_convert instead of static_cast)
  * Fixed FP8/BF8 arithmetic (accumulate in float)
  * Fixed uninitialized variables
  * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)

- Created backward data GPU reference (naive_conv_bwd_data.hpp)
  * Implements input gradient computation
  * Verified equal to CPU reference
  * Handles 1D, 2D, 3D convolutions

- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
  * Implements weight gradient computation
  * Verified equal to CPU reference
  * Handles 1D, 2D, 3D convolutions

- Integrated with old CK examples
  * Forward: 10 XDL examples now support do_verification=2
  * Backward data: Integrated with example/17_convnd_bwd_data/
  * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
  * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)

Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types

Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints

Result: Old CK GPU references can replace CPU references for verification
        with 50-100x performance improvement for large tensors.

* Apply clang-format to old CK GPU reference files

* Fix C++17 compatibility: use brace initialization for aggregate types

* add get_rtol, get_atl and consistency cout message

* Use triple bracket syntax for kernel launch per review feedback

Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.

All tests still passing after this change.

* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode

- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)

* Use ConvDims structure to simplify GPU reference kernels

Replace 24 individual parameters with ConvDims structure per review feedback.

- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files

* Use get_block_id() and get_thread_id() helpers in CK Tile

Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.

Updated 3 CK Tile GPU reference kernels per review feedback.

* Use std::array for spatial parameters in CK Tile GPU references

Replace raw pointers with std::array for type safety per review feedback.

- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters

* Use NDimSpatial+3 for stride array sizes

Replace hardcoded [10] with [NDimSpatial+3] per review feedback.

Array sizes now correctly reflect actual dimensions needed.

* Use #pragma once instead of include guards

Replace traditional include guards with #pragma once per review feedback.

Updated 3 Old CK GPU reference headers.

* Fix element-wise operation output in Old CK GPU references

Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.

This ensures element-wise operations are correctly applied to output.

* Initialize element-wise operation variables

Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.

Updated backward data and backward weight kernels.

* Use explicit zero initialization for element-wise variables

Change TIn{} to TIn{0} for consistency per Copilot feedback.

All 3 kernels now use consistent zero initialization.

* Fix copyright headers to match existing style

- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range

Addresses consistency feedback.

* Rename GPU reference files: add _gpu suffix

* Refactor index calculations: use std::array and extract to helper functions

* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 4baa4c9fae]
2025-12-03 21:14:21 +02:00
assistant-librarian[bot]
834ee396bb Merge commit '161835533becff72c71d20eff1e907a702820252' into develop 2025-12-03 17:14:58 +00:00
Enrico Degregori
b71ce6f8ac Wmma support for gemm_multiply_multiply_wp (#3278)
* Initial implementation with splitK support

* Add gfx11 support

* Fix compilation error

* Add instances

* Add irregular instances

* Fix GetBuffer arguments

* Minor changes

* Address review comments

* Fix compilation errors

* Fix copyright header

[ROCm/composable_kernel commit: 161835533b]
2025-12-03 07:38:23 -08:00
assistant-librarian[bot]
0cb95a3e70 Merge commit 'f29b67cf9b20be44299b2dcdd1716393c9c7569c' into develop 2025-12-03 15:14:25 +00:00
John Shumway
8d959a1ec0 [CK_BUILDER] Add Description::instance_string() method and update tests (#3340)
* Create Description::instance_string() function

To expose more reflection capabilities in MIOpen, we add the instance_string functionality to the ckr::Description class. This PR introduces a base class, adds the instance_string method, and implements the method by injecting the Traits::instance_string method through the ConvDescription constructor.

This will enable us to replace the specialized get_instance_string() method on device operations with a describe() method in a subsequent PR.

* Test describe().instance_string()

Update the instance string tests to also call `ckr::describe<Instance>().instance_string()`. This documents that the xld kernels are supported with describe(), but WMMA and DL kernels are not yet supported. Also update namespace and add a HasConvTraits concept.

[ROCm/composable_kernel commit: f29b67cf9b]
2025-12-03 06:36:09 -08:00
assistant-librarian[bot]
5f10aba31c Merge commit 'e6a583416b0dc534fcd023f90ed2ebf800fdd78b' into develop 2025-12-03 10:18:22 +00:00
jakpiase
d58a45bb17 [CK TILE] Add index optimizations for conv bwd weight (#3321)
[ROCm/composable_kernel commit: e6a583416b]
2025-12-03 10:53:46 +01:00
assistant-librarian[bot]
96143e6974 Merge commit '6cb0bc2d11a97a928dd156533d97f59f52f41d5f' into develop 2025-12-02 23:12:06 +00:00
Aviral Goel
c93bb1714d feat(block_scale_gemm): Support RRR-R, CRR-R and CCR-C layout for aquant quant mode (#3193)
* [CK TILE GEMM] Refactor block_scale_gemm examples

- Split cpp file to reduce building time
- Support multiple GemmConfig

* [CK TILE GEMM] Refactor block_scale_gemm examples

- Update Readme

* feat(gemm_quant): add RRR and CRR layout support for aquant gemm

* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm

* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests

* fix: test cases compilation failure due to PR# 2095

* fix: make condition to filter out tests for gfx950 more explicit

* need to support the gfx950

* fix: add layout suppot for gfx950

* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)

* WIP: add support for pk_int4_t for aquant mode layouts RR and CR

* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant

* fix: compile time error for gfx950

* fix: minor bug where is_a_load_tr_v() was mising

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* docs: update changelog with new layout support info

* Update CHANGELOG.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* refactor: break test instances into multiple cpp files to reduce build time (#3319)

* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant

* test: add unit tests for new layout support CCRC for aquant block scale gemm

* refactor: break test instances into multiple cpp files to reduce build time

* chore: rename file for better code readability

* fix: merge conflict resolution

* fix: remove memory pipeline because new layout is not compatible

* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()

* refactor: address review comments

* solve the conflict

---------

Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 6cb0bc2d11]
2025-12-02 14:59:07 -08:00
assistant-librarian[bot]
43ff4cfd3f Merge commit '2c284a1780acb790f7c52fb94c99694fa4e3f1fe' into develop 2025-12-02 20:14:38 +00:00
Illia Silin
2929833ec7 Disable gemm_blockscale_f8 on gfx90a by default. (#3338)
* disable gemm_blockscale_f8 instances on gfx90a by default

* fix cmake logic, diasble some cmake output

* fix cmake logic

[ROCm/composable_kernel commit: 2c284a1780]
2025-12-02 11:33:33 -08:00
assistant-librarian[bot]
ebfafa78fe Merge commit '280bc4219151c3f79fe8ca076a2d10df4ff88b34' into develop 2025-12-02 16:14:43 +00:00
John Shumway
750a87e7fc [CK_BUILDER] Refactor builder factory code. (#3276)
Refactor the builder factory code into multiple files and subdirectories and a ck_tile::builder::factory namespace.

The factory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations.

Major changes in this PR:

Dispatch logic is explicit in the function make_conv_instance instead of implicit in template specialization selection.
Helper code is moved to a subdirectory builder/factory/helpers.
Helpers now have unit tests.
Factories are moved to their own files.
Code moved to namespaces ck_tile::builder::factory and ck_tile::builder::factory::internal.
This does not yet fix the problem of bad error messages, but the make_conv_instance function makes the poor error messages clear. The choice of algorithm must be much more robust (perhaps with explicit enumeration in the algorithm descriptor), so that the dispatch doesn't fail.

Quality changes:

Making dispatch explicit rather than implicit will improve robustness, readability, maintainability, testability, and extensibility.
Separating code into separate files and subdirectories helps readability and extensibility.
Adding unit tests for helpers documents behavior and will enable more complex logic and functionality.
Separating files (especially unit tests) helps clarify includes and dependencies and makes code easier to refactor.

[ROCm/composable_kernel commit: 280bc42191]
2025-12-02 07:40:14 -08:00
Thomas Ning
cdffa8cc83 disable the gfx90a (#3336)
[ROCm/composable_kernel commit: 8459d389ad]
2025-12-02 07:27:37 -08:00
assistant-librarian[bot]
aef67fef38 Merge commit '66832861ad78cc63584c32e5d231fd29a99c57b3' into develop 2025-12-02 14:14:02 +00:00
Ville Pietilä
8bf6f9cac8 [CK_TILE] Merge multiple fwd convolution groups into a single GEMM batch. (#3136)
* Merge fwd conv groups in CK Tile.

* Fix building CK fwd convs.

* Add number of merged groups to conv fwd kernel name.

* Get number of merged groups from conv config.

* Rename GemmConfig to ConvConfig.

* Clean-up TODOs.

* Check that number of conv groups must be divisible by the number of merged groups.

* Improve error handling in the conv fwd example.

* Fix clang-format.

* Fix group offsets.

* Fix merge problem.

* Address feedback from code review.

* Fix clang-formatting.

[ROCm/composable_kernel commit: 66832861ad]
2025-12-02 15:23:32 +02:00
assistant-librarian[bot]
9c9a022007 Merge commit '2d3020e5b03109a56fc2498a721134e5c34ab10f' into develop 2025-12-02 13:23:30 +00:00
msaffari-amd
b06b6e684c [CK Tile] batched contraction kernel generalizing (#3126)
* Add help for example

* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings

* Add stride-aware reference for batched contraction with independent D tensor layouts

* Add -num_d argument for runtime D tensor count selection in batched contraction

* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs

* Add descriptor-based architecture for batched contraction multi-dimensional stride support

* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0

* Add complete multi-dimensional stride support via descriptors

* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm

* Clean up batched contraction: remove old UniversalGemmKernel path

* Clean up batched contraction: remove legacy paths and finalize docs

* Optimize batched contraction example: pass dimension sizes not vectors

* correct the reference calculation, unsigned int to int

* Fix batched_contraction C++17 build errors for gfx90a CI

[ROCm/composable_kernel commit: 2d3020e5b0]
2025-12-02 13:30:27 +01:00
DarylHawkinsAMD
3ba598e05d [CK_BUILDER] Install CK builder headers, added missing include (#3334)
[ROCm/composable_kernel commit: d3f37ebf6c]
2025-12-02 04:28:46 -08:00
assistant-librarian[bot]
be7c12a132 Merge commit '5d67d82a0bb6dbf5f82f3b4ba2e9188eb838b927' into develop 2025-12-02 11:12:35 +00:00
jakpiase
9632da4f80 [CK_TILE] Fix for comp pipeline v4 (#3307)
* Fix for gemm_pipeline_ag_bg_cr_comp_v4

* Update hotloop condition

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* fix formating

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 5d67d82a0b]
2025-12-02 11:38:06 +01:00
jakpiase
a587054099 [CK_TILE] Add indexing optimizations for conv bwd data (#3309)
* add indexing optimizations for conv bwd data

* fix formating

[ROCm/composable_kernel commit: 59265d5eb2]
2025-12-02 11:37:26 +01:00
assistant-librarian[bot]
ae8f3a3b19 Merge commit 'f211156ce6e9a8411c9ab8c3647147b6a9cf78d8' into develop 2025-12-02 07:14:17 +00:00
Yi DING
07158d16ad [CK_Tile] Flatmm MX Cleanup & Explicite Offset Calculation (#3286)
[ROCm/composable_kernel commit: f211156ce6]
2025-12-02 14:21:12 +08:00
assistant-librarian[bot]
94dda8df22 Merge commit '46f1d740f03d11bc2a78fce60a95cd0933b9dd4d' into develop 2025-12-02 00:36:50 +00:00
Erwin Terpstra
328a733e0e Add grouped gemm instances for RDNA4 (#3237)
* wip: grouped_gemm implementation based on wmma kernel + example for fp16

* chore: clean up grouped_gem_wmma_splitk_fp16 example

* chore: add cmake options to fully disable XDL or WMMA kernels

* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)

* chore: add grouped gemm wmma bf16 example

* refactor: reuse more code between instance factory functions

* chore: turn test failure if not all batch sizes are supported into a warning

* chore: made failing of test on unsupported instances conditional to not break old tests

* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value

* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()

* fix: stray comma after parameter list

* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran

* chore: update copyright in header comments

* nit: minor feebdack

* refactor: unified XDL / wma tests

* fix: properly disable FP8 instances when ONLY targeting gfx11

* refactor: add v3 suffix to grouped_gemm device struct name

* fix: small typos in example code

* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags

* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths

* fix: make sure to not add instance library to group if library was skipped

* fix: make sure xdl grouped gemm doesnt fail the new test

* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case

* fix: examples not working since dependent types and functions were moved to ck namespace in develop

* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances

* chore: replace/add copyright headers with new format

[ROCm/composable_kernel commit: 46f1d740f0]
2025-12-01 15:32:10 -08:00
assistant-librarian[bot]
1b8a648333 Merge commit '23fb253c4e5ed6ef1a9b69feda5e037d08325bc6' into develop 2025-12-01 23:13:36 +00:00
Cong Ma
a6ec08a1d2 Make CK TILE GEMM Aquant support block tile 128x128x128 (#3325)
* [CK TILE GEMM Quant] Rename GemmConfigBQuantPrefill to GemmConfigQuantPrefill in examples

* [CK TILE GEMM Quant] update tile distribution of aquant

* [CK TILE GEMM Quant] update aquant register offset calculation

* [CK TILE GEMM Quant] Reimplement aquant register offset calculation

* [CK TILE GEMM Quant] Add more unit tests of Aquant

- Test M128xN128xK128

* [CK TILE GEMM Quant] Add more comments to Gemm Aquant

[ROCm/composable_kernel commit: 23fb253c4e]
2025-12-01 15:04:37 -08:00
assistant-librarian[bot]
45c3d34009 Merge commit '7873f8fa13ce42d7ef570f7ae99f76f68f463109' into develop 2025-12-01 21:12:49 +00:00
John Shumway
fc586d2de6 [CK_BUILDER] Update the testing documentation (#3312)
* [CKBuilder] Update the testing documentation

Now that we have clear sets for smoke tests and regression test, we rearange the CMakeLists.txt file to be organized and have description and instructional comments.

Move all the test targets that compile quickly into the smoke test suite.

Update the builder README.md to reflect this new test organization and functionality.

* Update experimental/builder/README.md

Clarify integration tests description from review comment.

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Correct README.md

The regression tests here still run very fast like smoke tests, but can take minutes or even tens of minutes to compile.  We test most of the builder functionality without compiling heavily-templated kernel code, but these regression tests do an expensive full build of the CK kernels.

---------

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: 7873f8fa13]
2025-12-01 13:05:32 -08:00
John Shumway
6f2f67b0b6 [CK_BUILDER] Fix cosmetic problem with conv_description (#3333)
The ConvDescription::detailed command wasn't using TreeFormatter::writeLast correctly, which led to extra lines being drawn in the tree view. It's a simple fix, just a cosmetic improvment out reflection output (ASCII art).

[ROCm/composable_kernel commit: d17994f3df]
2025-12-01 12:45:04 -08:00
assistant-librarian[bot]
f38ca54019 Merge commit 'abd6a4b3fc535772ecff047b02f2af666987f859' into develop 2025-12-01 09:16:45 +00:00
John Shumway
f645d827c8 Cleanup convolution description (#3329)
Remove obsolete feature for extracting a description from a builder, since this should apply directly to the instance type. Also add some documentation, including a README.md for reflection.

[ROCm/composable_kernel commit: abd6a4b3fc]
2025-12-01 10:03:58 +01:00
assistant-librarian[bot]
572df7d4d1 Merge commit '9ed9539ddfcdd8de4180fb992b718b57e1cadfae' into develop 2025-12-01 07:15:08 +00:00
Yi DING
2688602697 [CK_TILE] Disable cast_tile_pk_fp16bf16_fp32 as It Causes Extra spills on Recent Compilers (#3327)
[ROCm/composable_kernel commit: 9ed9539ddf]
2025-12-01 14:48:22 +08:00
assistant-librarian[bot]
0dff04aa27 Merge commit 'ba6af9fe7c6689075b46052cc40b7f94d96f647f' into develop 2025-12-01 06:17:27 +00:00
Gino Lu
0551d4412e [CK_TILE] Add unit test for fp4 warp gemm (#2817)
This update includes a unit test for warp GEMM

[ROCm/composable_kernel commit: ba6af9fe7c]
2025-12-01 13:56:48 +08:00
assistant-librarian[bot]
4f8c179bfd Merge commit '004784ef98beffb24a03d106b143ee9f8e03e826' into develop 2025-11-28 22:12:10 +00:00
Aviral Goel
0861395425 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>

[ROCm/composable_kernel commit: 004784ef98]
2025-11-28 13:49:54 -08:00
assistant-librarian[bot]
74d3173d15 Merge commit 'f981554c39eafbf993e05c832cb86b3aaf474571' into develop 2025-11-28 13:21:12 +00:00
Sami Remes
1c73a3d480 [CK_TILE] Fix Quant GEMM build (#3320)
* Fix build

* Fix ck_tile example 38 & 40

---------

Co-authored-by: Yi DING <yi.ding@amd.com>

[ROCm/composable_kernel commit: f981554c39]
2025-11-28 20:33:53 +08:00
assistant-librarian[bot]
296bf24afd Merge commit 'f875ab0bbc6ea68a689a688a58f9a53ad12fd536' into develop 2025-11-28 09:13:31 +00:00
msaffari-amd
1055485a38 Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic-add (#3236)
* Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic

* correct clang-format

* removed unused rtol_atol variable from example code

* clang format correction

* remove unused varable max_accumulated_value from example

[ROCm/composable_kernel commit: f875ab0bbc]
2025-11-28 09:43:01 +01:00