Commit Graph

317 Commits

Author SHA1 Message Date
Yi DING
ef32abc4cc [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp

[ROCm/composable_kernel commit: 4fde1646e5]
2025-08-12 11:11:55 +08:00
Yashvardhan Agarwal
78662c8ad5 Fixes to "General 2D Reduction Kernel" (#2535) (#2656)
* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format

[ROCm/composable_kernel commit: 191c62967b]
2025-08-11 15:01:33 +02:00
Max Podkorytov
c8ff442803 [CK-tile] add more tests for batched transpose testing the rectangular block tile sizes (#2634)
* add failing tests

* swap out and reference

* add constraint assert to transpose input distribution

* test both pipelines with rectangular block tile

* print mismatched indices

* add a smaller failing test for old pipeline

* print grid and block

* fill output before operating on it

* swap m/n tile sizes and make one test pass

* add device syncs

* add one more flipped test case

* flip block tile at host arg init

* fix tiles for lds pipeline

* clang-format

* rename tests

* roll back error check

* remove device syncs

* reduce large test case's size

[ROCm/composable_kernel commit: ab26026835]
2025-08-07 16:51:53 -07:00
Gino Lu
f21e432ea1 Add e8m0 scaled convert into CK_TILE (#2617)
* first commit

* remove redundent code

* modify according to comments.

* fix type_convert error with scaled_type_convert

[ROCm/composable_kernel commit: 5d6d236b25]
2025-08-07 21:37:28 +08:00
Yi DING
84a17e2f94 [CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550)
* Remove unnecessary pssk

* Add BlockFmhaBwdDQDKDVPipeline wrapper

* Resolve copilot comments & Remove kpad & fix

* Remove spad

[ROCm/composable_kernel commit: b0a97498b0]
2025-08-07 21:24:43 +08:00
Sami Remes
6474014ee7 [CK_TILE] Enable printing more structures in CK-Tile (#2443)
* Add more printing to core cktile

* Revert other changes in static encoding pattern

* Refactor to using a free print() function

* Remove loops and print just the containers

* Print tuple with better formatting, fix sequence compilation

* Add some tests for print utility

* Add print utility header

* Print for static_encoding_pattern

* add buffer_view printing

* Align vector_traits

* Fix formatting

* Lower-case enum strings

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

* Remove empty comment lines

* Fix test with lower-case too

* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y

* Add test_print_common.hpp

* add print.hpp in core.hpp

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: ffdee5e774]
2025-08-07 15:45:27 +03:00
Yashvardhan Agarwal
b6f0e98da6 General 2D Reduction Kernel (#2535)
* General 2D Reduction Kernel

* Move the reduction kernel from the example
* Split the code and add the necessary policy, problem, shape files as
per ck_tile convention
* Add/modify the headers
* Modified the example to work with the 'new' kernel
* Added tests for the kernel
* N-D refernce reduce
* Added support for N-D input with transform to 2D
* Added padding to support various input sized tensors
* Bug fix in the thread buffer constructor
* Some comments to explain the reduce2d block kernel

* comments resolution

* clang-format

* comments resolution

* clang-format

* clang-format

* comments resolution

* clang-format

[ROCm/composable_kernel commit: 4750b293fe]
2025-08-06 15:36:59 +02:00
Adam Osewski
b7659e284a Remove unused lds direct load instruction. (#2573)
This functionality is replaced by amd_async_buffer_load

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 2622ff06cb]
2025-08-06 15:16:12 +02:00
Thomas Ning
7c9d0d0435 Persistent grouped gemm CompV4 Enablement & Polish (#2605)
* enable the persistent kernel for CompV4

* polish the example and clang format

* fix the non-persistent kernel error

---------

Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: cbfecf8d7a]
2025-08-04 23:43:01 -07:00
Jinchao Xu
d2525548c6 Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611)
Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances
by splitting debug information into separate .dwo files.

Add explicit cast to avoid signed/unsigned comparison warning.

[ROCm/composable_kernel commit: 15eb493152]
2025-08-04 11:26:08 -07:00
Illia Silin
6fc0a709dc update the switch condition for buffer built-ins (#2602)
[ROCm/composable_kernel commit: 788e8a878e]
2025-08-01 14:30:07 -07:00
Thomas Ning
9325ae50e8 Fix the GFX 950 Universal GEMM (#2597)
* solve the gfx950 error

* clang format

* fix a typo error

---------

Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: 7c44a763fa]
2025-08-01 09:32:24 -07:00
Aviral Goel
4f9a78fbf0 Integration of a new pipeline for weight preshuffle into gemm examples (#2516)
* something khushbu can help with

* v1 v2 works with flatmm develop

* v0 v1 v2 numerical error gone

* Fixing numerical error, and interchange preshuffle configs to match with flatmm

* Refactor GEMM pipeline configurations and integrate preshuffle support

- Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3).
- Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations.
- Removed obsolete code and comments

* clang format

* fix vectorloadsize bug

* add the Preshuffle3

* update kwarp calculation in gemm utils

* update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch

* fix: add CK_GFX950_SUPPORT macro for gfx950 detection

* default disable rotating buffer

* docs(CHANGELOG): update changelog for rocm 7.0

* Revert "docs(CHANGELOG): update changelog for rocm 7.0"

This reverts commit 2bc16fff84.

* Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files.

* revert example/ck_tile/flatmm to its original state

* remove comment added by second author

* switch to xor ALDSDescriptor

* modify the MakeALdsDescriptor()

* temporary profiling script

* getting rid of line marker compiler error

* UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy

* add a minor fix for the config

* typo fix

* Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2

* revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp

* revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp

* reenable the GemmSpatiallyLocalTilePartitioner

* make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline

* remove hardcoded true for preshuffle bool template argument

* rename script

* remove gemm_profilie.sh script

* merge conflict resolve

* clang formatted

* typo fix

* Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp

* Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy

* Fix missing newline at end of file in run_gemm_example.inc

* Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1

* addressing review comments

* removing debug code

* addressing review comments

* Revert "addressing review comments"

This reverts commit 29c45192ba.

* updating tile_engine code

* addressing review comments

---------

Co-authored-by: amd-khushbu <khuagarw@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 1441a0a7ee]
2025-08-01 00:04:54 -07:00
Khushbu Agarwal
464c6f459e [CK_Tile] Updating gpu timer when doing flush cache (#2593)
* Missed updating function names in example

* updating timer

* code cleanup

* addressing review comments

* updating tile_engine code

* addressing review comments

[ROCm/composable_kernel commit: 88d72178d6]
2025-07-31 16:43:33 -07:00
Anton Gorenko
2ef590ab43 [CK_TILE] Fix UB and corner cases in f32/f16 to/from f8 conversion (#2571)
* Add tests for host convesion f32/f16 to f8

* Add tests for host convesion from f8 to f32/f16

* Fix UB and corner cases in f32/f16 to/from f8 conversion

* There are UBs when very small values are converted to f8: bitshifts
  can be larger that type width. Using unsigned long long does not help
  because exponent_diff >= 64 in such cases. This causes that values
  like 2.117582368e-22 are converted to non-zero f8 in host validation
  of FMHA tests, test_f8 crashes with segfault in completely irrelevant
  code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
  conversion.

* Replace some magic numbers with values from numeric_traits

* Build tests only on devices supporting the type

[ROCm/composable_kernel commit: 7b074249f4]
2025-07-31 09:54:17 +05:00
Gino Lu
80a9d720ca add constexpr to pk_fp4::pack/unpack() (#2586)
[ROCm/composable_kernel commit: b25d512e8a]
2025-07-30 10:29:04 -04:00
Khushbu Agarwal
7af5e043b1 Update to gpu_timer for rotating_buffer (#2524)
* update gpu_timer for rotating buffer as hipblasLt's implementation

* timing fix

* Updating gpu timer for old ck as well

* Revert "Updating gpu timer for old ck as well"

This reverts commit 958cd1bc99.

* code clean up with runtime argument; function rename

* code cleanup

* general timer fixes

* bug fix

* clang formatted

* addressing reveiew comments

* clang formatted

* Addressing review comments

* CI fix

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 61e21f5567]
2025-07-29 15:21:05 -07:00
Thomas Ning
69b117ae73 Expand the bandwidth of direct_global_to_lds for gfx950 (#2576)
* Expand the bandwidth of direct_global_to_lds for gfx950

* clang-format

* fix the remod.py and script for clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: 9d4b494f07]
2025-07-28 23:56:53 -07:00
Illia Silin
6704e5303f fix the clang-format (#2578)
[ROCm/composable_kernel commit: 49723e94bb]
2025-07-28 20:49:55 -07:00
Yi DING
a592107cb9 [CK_TILE] FMHA bwd Support hdim as a Multiple of 32 (#2130)
* Fix shuffle_tile

* Add fmha bwd d160

* CHANGELOG

* Use static_cast

* Update

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 1926cd0cb8]
2025-07-29 09:31:14 +08:00
Illia Silin
3345f5f417 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config

[ROCm/composable_kernel commit: 504b101da3]
2025-07-28 11:34:07 -07:00
shay-li77
d143f1466f fix mha bwd dbias random mismatch (#2570)
* fix mha bwd dbias random mismatch

* formatting code

[ROCm/composable_kernel commit: 8ae528a1b4]
2025-07-28 14:39:31 +08:00
Gheorghe-Teodor Bercea
457e3c8b08 Refactor async loads to work on all GPUs (#2545)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: cbfa62e4b6]
2025-07-26 22:04:59 -07:00
Max Podkorytov
98bad686a9 [CK-Tile] Merge transpose examples (#2450)
* unify pipeline signature with existing example

* iwyu

* move stuff around in load-tile-transpose

* cleanups in batched transpose pipeline

* comments

* use same inputs size

* cleaner printf

* print host args

* use 64 block sides in the 37_transpose example

* roll back grid dimension size adjustment for 37_transpose example

* transpose grid for 37_transpose to unify with 35_batched_transpose

* unify grid computation logic

* make policy methods device only (since they are used only on device from the pipeline)

* more host/device attribute cleanups

* copy over problem

* move over pipeline and policy

* add switch to batched transpose api

* make the lds problem more similar to original problem

* factor out logic into traits

* factor out conditional compilation into trait parameter

* propagate pipeline to args

* unhardcode pipeline dispatch parameter

* refactor vector size

* put warp tile out of dispatch

* rename template parameter for trait

* rewrite vector size in terms of problem

* mark policy-internal struct variable as device

* factor out input distribution and thread access pattern from policies

* reword vector size

* use datatype across batched transpose pipelines, problems and kernel

* remove transpose traits from lds pipeline

* add padding to the lds pipeline *interface*

* add comment

* remove ck_tile example #37

* update cmakelists

* add test for new pipeline

* update batched transpose test

* roll back load_tile_transpose changes

* remove comments

* pack dispatch parameters into a config

* padM can be enabled

* adjust lds vector size to enable padding along N

* update test

* clean up logic

* swap m/n input vector size

* adjust perf test script

* sweep over C/W in perf test

* count both read and written bytes into bandwidth (x2 the number)

* clang-format

* widen size range for perf test

* remove 64k x 64k case; it's too large for index

* remove thread tile from dispatch

* Solve merge conflict

* fix compile

* modify the transpose

* solve the test error and clang format

* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)

* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Fixing 0ms and inf GB/s issue in img2col (#2565)

issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```

solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message

`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`

* merge with develop

* solve clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>

[ROCm/composable_kernel commit: 821cd26c13]
2025-07-26 21:51:54 -07:00
liang
d055927b83 reorder grid dim schedule (#2533)
Co-authored-by: smallmou <liangshenghao.lsh@alibaba-inc.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: d2459878cf]
2025-07-26 02:46:55 +08:00
Mateusz Ozga
c3568357ca [CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520)
* Init commit new API

* apply clang-format

* PreShuffle preapring

* Apply Preshuffle condition to universal_gemm

* Fix: convert size_t to index_t

* Review changes

* Mode 100755 -> 100644

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: b507d889c1]
2025-07-24 20:39:56 +02:00
Yashvardhan Agarwal
094e5bad50 [CK_TILE] Support for elementwise kernel (#2246)
* Elementwise kernel implementation

Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>

* Elementwise with generalized nDims

* Adding the n-ary input tensor feature

* Generalize dimensions on top of inputs

* Add TFLOPS + remove std usage for tuples

* 1D basecase optimization

* Cleanup code + refactoring to a common interface

* Generalize to unary and add an example

* Cleanup, refactoring and commenting

* Suggestions for LWPCK-3170: elementwise kernel improvements

* Clang-format: remod.py

* Replace InputTensorType with XDataType as the type of input_tensors

* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs

* Move examples to folder 19_elementwise

* Add missing copyright headers and fix some existing ones

* Replace an assert with throw std::runtime_error in elementwise example

* Avoid reading the output by using make_static_distributed_tensor for y_tile

* Removed two unused includes

* No need to move windows to the next block when each workgroup processes a single tile

* Only copy input tensors to the device

* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example

* Adding output strides to the kernel, transposition example and update the other examples

* Changes made by remod.py

* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp

* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference

* Fix comments in elementwise_example.cpp

- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors

* Simplify CMakeLists.txt and remove the unused variables this uncovers

* Rename a file and fix some copyright statements

* Changes made by script/clang-format-overwrite.sh

* Add basic unit test for ElementWiseKernel

* Remove left-over uninformative comment in apply unit test

* Changes made by clang-format-overwrite.sh

* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp

* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array

* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally

* Move examples to folder 20_elementwise

* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK

* Fix CLang formating

* Bump up the elementwise example folder number

* Elementwise: add padding + minor cleanup

* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view

* Add isSupportedArg to Elementwise kernel + addapt example and unit tests

* Fix clang-format on the unit test file

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 606b0cc947]
2025-07-24 11:21:45 +02:00
jakpiase
bdb86fee78 [CK_TILE] Grouped Convolution Backward Weight Kernel (#2357)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* resolved conflicts

* rebase fixes

* fixes

* tmp

* add working support for splitk

* minor fix

* fixes

* fixes

* minor fix

* small fix

* Split K and preprocessing fixes

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 6681593864]
2025-07-24 10:41:35 +02:00
Haocong WANG
4ed2dda658 fix async copytest bug (#2509)
* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* change bit width check

* refactor macros into constexpr functions

which still get inlined

* wrap s_waitcnt api

* parameterize test

* cleanup

* cleanup fp8 stub

* add fp8 test cases; todo which input parameters are valid?

* replace n for fp8 in test cases

* add large shapes; fp8 fails again

* change input init

* test sync/async

* time the test

* clang-format test

* use float instead of bfloat to cover a 4-byte type

* fix logic - arg sections should be 'or'd

* make block_sync_lds_direct_load interface similar to old ck

* fix a few comment typos

* name common shapes

* revert the example to original logic of not waiting lds

* clang-format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: a5fdc663c8]
2025-07-23 00:14:02 -07:00
Cong Ma
baf244000e ck_tile kernel for gemm with groupwise quantized A tensor (#2473)
* ck_tile kernel for gemm with groupwise quantized A or B tensor.

This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32

Group size can go down to as low as K length of underlying WarpGemm primitive.

For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.

* [Block Scale Gemm] Only run gemm quant examples on __gfx94__

- Only run gemm quant examples on __gfx94__ for usage of
  `v_cvt_pk_fp8_f32`
- Format the code

* [Block Scale Gemm] Remove Bquant Gemm BlockScale

This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.

* [Block Scale Gemm] Format code with clang-format-12

The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.

Format code with clang-format-12 for consistency.

* [Block Scale Gemm] Split the k direction loop

- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
   to make the logic clearer.
- Disable C transposition.

* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm

* [Block Scale Gemm] Update copyright

* test

* Add TailHandler

* Move TileDistributionEncodingPatternAQ

* Refactor

* refactor

* fix bug

* fix bug

* help solve the PR comment

* Format the code

* [Block Scale Gemm] Add unit tests

* [Block Scale Gemm] Add support to 16x16x32 MFMA

- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: e62710e461]
2025-07-23 00:10:16 -07:00
Yi DING
bb1a05cc87 [CK_TILE] Fix tile_example_moe_sorting broke in #2436 (#2525)
[ROCm/composable_kernel commit: f0a8c18017]
2025-07-17 22:50:58 -07:00
Mateusz Ozga
96e2d5f86a Fix CI clang-format (#2521)
[ROCm/composable_kernel commit: 7fc000d7b3]
2025-07-17 14:41:29 +02:00
Haocong WANG
894750fd59 fix mfma32x32 dispatch (#2490)
[ROCm/composable_kernel commit: 28072adc3a]
2025-07-17 15:24:12 +08:00
Yi DING
cee5776046 [CK_TILE] Use read_tr in universal gemm (#2436)
* Use read_tr in universal gemm

* Enable all instances back

* Revert example37 changes

* Resolve comments

* resolve comments 2

* Fix assertion msg

* fix the gemm basic

* change index_t to bool for preshuffle variable

* Solve the comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>

[ROCm/composable_kernel commit: f1d8ad2818]
2025-07-16 23:56:22 -07:00
Po Yen Chen
e75bbb6a13 Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389)" (#2514)
This reverts commit c5014a86765c598215bf2b58f89c3a7c70f92bac.

[ROCm/composable_kernel commit: 722c22fb15]
2025-07-17 10:09:01 +08:00
linqunAMD
348dec0d0c Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: 6e76b82059]
2025-07-16 07:58:23 -07:00
MHYangAMD
ff8d3d0d13 [CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409)
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass

* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass

1.  Add BlockReduce2dTreeCrossWarpSync

* Add Rmsnorm2dFusedModelSensitiveEnum

* Update patch

1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce

* Refactor RMSNorm model enum and introduce T5-like option

* Update the n stage for cross warp reduce

* Add new cmdline option in RMSNorm for new pipeline testing

---------

Co-authored-by: Clement Lin <clement.lin@amd.com>
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>

[ROCm/composable_kernel commit: 3499fe67ff]
2025-07-16 14:05:26 +08:00
carlushuang
34ca5f6a68 [CK_TILE] moe sorting optimize local_token (#2469)
* fix bug in loops that need use local tokens to compute

* support extra chain local_token

* update

* update

* refine some main

* update

* support dispatch_policy

* fix 15 example

[ROCm/composable_kernel commit: cfe211cc60]
2025-07-15 09:42:18 +08:00
Gino Lu
b5517fb522 [CK_TILE] Add pk_fp4 data type (#2422)
* [draft] Add pk_fp4 and test

* Add hw conversion for fp4

* Refine test code and pk_fp4 constructor.

* fix test indent

* modify according to comment.

* fix clang-format

* modify according comments.

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 141bf2d54d]
2025-07-14 20:35:06 +08:00
Khushbu Agarwal
f3120e7526 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted

[ROCm/composable_kernel commit: d239b91fd5]
2025-07-11 08:27:55 -07:00
Qianfeng
337126469c Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) (#2487)
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline

* i_nhead_ conversion type to prevent overflow

---------

Co-authored-by: ltqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: 45904b8fd7]
2025-07-11 18:14:47 +08:00
shay-li77
4f08a02dae support y-direction step length greater than 1 for SimplifiedGenericAttentionMask (#2338)
* mask support ratio for y axis

* format code

* add notes for param y_ratio

* fix comments error

* support template and mdiv for ratio mask

* refactor y-ratio mask constructor

* optimize coordinate calculation

* add SimplifiedRatioAttentionMask

[ROCm/composable_kernel commit: d814fefe18]
2025-07-09 23:18:55 +08:00
Yi DING
9f5cf4f49d [CK_TILE] Avoid compile kernel in host pass (#2475)
[ROCm/composable_kernel commit: 032ca60015]
2025-07-09 22:27:54 +08:00
Haocong WANG
7c04d93083 [CK TILE] Fix FA build filter (#2369)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* cmake depends & bwd filter order fix

* revert unexpected reformat

* Avoid change fmha bwd filter order for downstream compatibility

* Revert unexpected changes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>

[ROCm/composable_kernel commit: 5557eadce6]
2025-07-08 10:42:07 +08:00
Illia Silin
99cf9b9cae fix compilation errors with clang20 (#2464)
[ROCm/composable_kernel commit: e033a1b4bf]
2025-07-07 19:40:30 -07:00
Po Yen Chen
a71dc1245f Eliminate warning caused by failed to meet occupancy requirement (#2389)
Co-authored-by: felix <felix.li@amd.com>

[ROCm/composable_kernel commit: b2dea90116]
2025-07-08 09:17:25 +08:00
Thomas Ning
1129c9dc4e Enable Async Copy for MI355 (#2425)
* add for async load builtin

* add async load api

* fix some compiling errors

* fix a compiling error

* fix some compiling errors

* add a pipeline which copies from v4

* add a new pipeline for async load

* fix some compiling errors

* add async load tests

* fix some issues in async load

* fix

* fix async inline assembly

* fix async inline assembly

* add ignore header file

* comment some not gfx950 codes

* comment some not gfx950 codes

* fix a error

* update async load apis

* fix lds descriptor

* fix a compiling error

* fix some compiling errors

* fix a descriptor issue

* update lds descriptor

* change async pipeline's tile distribution pattern from thread to warp

* fix clang format

* update async policy

* fix a CRTP issue

* fix a typo error

* change lds layout

* fix some sync issues

* improve codes

* delete the async test

* fix a commented format issue

* avoid compiling device functions when compile host

* make gemm run

* add the copy kernel support

* finish the feature

* Address comment

* add the support for buffer_builtin

* solved the merging problem

* Comment Addressed

---------

Co-authored-by: joye <joye@amd.com>
Co-authored-by: joyeamd <John.Ye@amd.com>

[ROCm/composable_kernel commit: f240ae3248]
2025-07-07 10:08:49 -07:00
ltqin
ba133fe9b7 ck tile pagedkv prefill (#2405)
* add prefetching physical block id for pagedkv

* start add pagedkv prefill

* rename pipeline

* add kernel for pagedkv

* add an init version pagedkv prefill

* fix redefine issue

* add struct BlockFmhaFwdPagedKVPipelineProblem and fmha_fwd_pagedkv_args

* generate dispatch code

* add body generating code

* comipling pass

* remove dropout from pagedkv

* set lse to false in generating code

* start changing qr kernel to pagedkv

* init version of  kernerl with pagedkv

* change names of file that are generated

* chang host validation for pagedkv prefill

* using iglp to change blockgemm

* add kernel files to op head file

* show parameters

* rewrite print parameter fun

* add fwd

* remove default parameter of GridSize

* format

* fix nhead issue and add seqlen_k_ptr to batch mode

* format code

* remove no-longer used code

* format

* fix some comments

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 9f4c5d7372]
2025-07-07 16:16:54 +08:00
carlushuang
8e15d99ddc default skip y point to r (#2457)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 0aecb5ab68]
2025-07-06 23:54:34 -07:00
carlushuang
4ed061c05d [CK_TILE][CORE] enhance slice_tile api (#2430)
* support slice cross p

* fix some bug in y_len

* more case

* fix a bug when R exist

* support -1 to hint end of current length

* format

* change commit

[ROCm/composable_kernel commit: a8742f7e31]
2025-07-06 20:13:12 -07:00