Commit Graph

394 Commits

Author SHA1 Message Date
Emily Martins
610284f16a [CK_TILE] Stream-K GEMM Implementation (#2781)
* Change splitk_batch_offset parameter to k_size in UniversalGemmKernel::MakeGemmTensorViews function

Prior to this change, the splitk_batch_offset parameter of
MakeGemmTensorViews had type SplitKBatchOffset. But, the only member
variable of the SplitKBatchOffset class used in the MakeGemmTensorViews
function was splitted_k (an int32_t). The splitted_k value was used as
part of defining the dimensions of the tensor view. That said, for
Stream K, we do not need to use the SplitKBatchOffset class since we are
not using Split K. Thus, this commit changes the splitk_batch_offset
parameter to a int32_t called k_size. This will avoid the constraint of
requiring a caller of MakeGemmTensorViews to use the SplitKBatchOffset
class while still providing the same functionality. Calls to
UniversalGemmKernel::MakeGemmTensorViews have been updated accordingly.

* StreamK Kernel RunGemm Implementation

Stream K cannot simply use UniversalGemmKernel's RunGemm for the
following reasons:

1. The UniversalGemmKernel::RunGemm function computes num_loop based on
   a static function of the TilePartitioner. That said, for Stream K,
num_loop must be computed using a member function (namely
GetCurrentIterLength from PR #2708).
2. The UniversalGemmKernel::RunGemm function requires the use of a
   SplitKBatchOffset object which is not used for Stream K since we are
not using Split K.

Thus, this change adds a RunGemm function in the StreamKKernel class.

* initial implementation for operator() for StreamKKernel: adding stream-k algorithm and calls to RunGemm

* Fix indexing and offset issues for StreamK

These changes do the following:
- Ensure offsets along the M and N dimensions are multiplied by
  MPerblock or NPerBlock, respectively. This ensures tile window origins
are at the correct locations.
- Fix bug in the tile partitioner's GetTileIdxWithOffset. Now, we apply
  divmod to the given references to ensure correct values are available
to the caller.
- Added documentation in the Stream-K operator()

* Initial gtests for Stream-K

These changes add an initial gtest suite for the CK Tile Stream-K
kernel. Currently, due to bugs in the StreamKTilePartitioner (which will
be handled in a future PR), there are validation issues for certain
cases which may differ on different architectures. Thus, we opted to run
cases that are only fully data-parallel (skipping others). A guard was
added to Stream-K's IsSupportedArgument method to ensure that callers
are aware of this constraint. Additionally, to ensure testing
reproducibility, options for setting the number of CUs and occupancy
were added to MakeKernelArgs.

* Use GemmPipeline operator() variant that takes hot loop and tail num

In Stream-K, the num_loop value varies per WG and per iteration of a
Stream-K loop. So instead, we use the version of the GemmPipeline's
operator() function that takes in has_hot_loop and tail_num. This is
similar to what is done in Grouped GEMM.

* changes from review: comments, move readfirstlane, remove ifndef

* Switch direction of C tensor traversal & add padding guard

Prior to this change, WGs travelled backwards through their assigned
macro tiles in the C tensor. For instance, if WG0 is responsible for C
tiles 0 and 1, it would first visit tile 1 then tile 0. This means that
the iter_end decrements in each iteration of the stream-K while loop.

Since we are working with unsigned integers, the subtraction operation
may not be safe. Thus, this change makes is such that WGs travel forward
so that their iter_start is incremented and their iter_end remains
fixed.

Additionally, we added a guard against WGs that are neither sk_blocks
nor dp_blocks to ensure such WGs do not participate in the GEMM.

Together, these changes make is such that the algorithm is correct when
sk_blocks is greater than zero.

* Disable StreamK_M256_N256_K256_SKBlocks12 test case

This instance involves >=3 WGs contributing to each macro tile in C. Due
to the use of atomics, this is resulting in precision errors. These
errors will not persist once the reduction strategy is implemented. We
will re-enable this test then.

---------

Co-authored-by: Astha Rai <astha.rai713@gmail.com>

[ROCm/composable_kernel commit: dee185d80c]
2025-09-16 16:21:47 -06:00
linqunAMD
2713d3df13 [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (c254f… (#2837)
* [CK_TILE][REGRESSION] Correct blockSize in Generic2dBlockShape (797b107dd90ce )

WarpPerBlock_M * WarpPerBlock_N are not equal with ThreadPerBlock_M * ThreadPerBlock_N /warpSize. we should calculate BlockSize from WarpPerBlock_M * WarpPerBlock_N

To compatible with wave32, function GetBlockSize is added to calculate correct size in host side.

* fix blocksize for all kernel related with generic2dblockshap

* remove constexpr for blocks

[ROCm/composable_kernel commit: b7a806f244]
2025-09-16 08:47:55 -07:00
Cong Ma
f9f1e8e257 [CK TILE GEMM] Add support to convert i4 to OCP FP8/BF8 (#2853)
[ROCm/composable_kernel commit: 78a9823cb4]
2025-09-16 07:18:51 -07:00
JH-Leon-KIM-AMD
fb3b4011d7 [CK Tile] Grouped conv fwd splitn support (#2776)
## What's New
  Add Split-N support for grouped convolution forward to handle tensors >2GB by splitting the batch dimension.

  ## Bug Fix
  Fixed 32-bit integer overflow that caused crashes with 6+ splits:
  - Use `long_index_t` for batch offset calculations
  - Remove redundant GemmM initialization in constructors

  ## How It Works
  - Automatically splits batch dimension when tensor exceeds 2GB
  - Uses grid.z dimension for parallel processing of splits
  - Each split processes a subset of batches independently

  ## Testing
  Verified with tile_example_grouped_conv_fwd:
  - n=3000 (6 splits) ✓
  - n=3500 (7 splits) ✓
  - n=10480 (40 splits) ✓

[ROCm/composable_kernel commit: 804065a36b]
2025-09-16 16:56:11 +03:00
Haocong WANG
a905c2bb79 [CK_TILE] fix bug when iperm =0 in fmha fwd (#2820)
* fix bug when iperm =0 in fmha fwd

* Disable f8 fmha smoke test until fix pr merged

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 59cb906482]
2025-09-16 15:07:10 +08:00
Po Yen Chen
fccf1db11a [CK_TILE] FMHA FAv3 scheduling fine-tuning for performance (#2833)
* Re-mapping thread block indices for causal=True kernels

* Use more intuitive remap_opt value

* Fallback to origin remapping if seqlen_q >= 64K

* Use GenericAttentionMask to reduce mask computation

* Avoid unnecessary boundary check for IsMasking=false case

* Fix wrong kernel entry specifier

* Add s_nop to prevent delay wave0-3

* Refine scheduling

* Remove unnecessary sched_group_barrier()

* Move sched_group_barrier() call to scheduler

* Replace inline asm s_setprio with intrinsics

* Rephrase comments

* Expend some o_acc rescaling insts to avoid SIMD idle

* Fix block idx special mapping logic

* Tune block index mapping for causal=False cases

* Tune block index mapping for causal=True cases

* Fix wrong vmcnt()

* Remove parameter name

* Use boolean option for turn on/off causal mask

* Update benchmark_fwd_v3.sh option usages

* Add option if compiler support it

[ROCm/composable_kernel commit: 7fbc9d6c97]
2025-09-16 11:32:38 +08:00
linqunAMD
a9e6cb0ec0 Extend XDL kernel to Support RDNA3/4 - Part 5 (#2725)
* Enable xdl in gfx11 & gfx12

* update cmake file

* fix all instance build (cmake)

* fix batched_gemm_gemm(cmake)

* rebase cmake files

* fix cmake build error

* remve CK_ENABLE_DYNAMIC_WARP_SIZE

* update cmake build error2

* fix gfx11 build

CK_USE_XDL is enabled on gfx11 and gfx12

* fix gfx10 build

* fix gfx11 error

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>

[ROCm/composable_kernel commit: f22740df82]
2025-09-15 10:59:25 -07:00
Cong Ma
6b1ed5eb4d [CK TILE GEMM] set correct value to TiledMMAPermuteN_ (#2839)
- TiledMMAPermuteN_ should be set to true when config if GemmConfigPreshufflePrefill

[ROCm/composable_kernel commit: e5d73da2da]
2025-09-13 20:54:08 -07:00
linqunAMD
da4657eaca [CK_TILE] Enable ck_tile tests on gfx11 and gfx12 (#2821)
* [CK_TILE] Enable ck_tile test on gfx11 & gfx12

* revert an unnecessary change

* enable pk_int4 on gfx11 & gfx12

* revert .pre-commit-config.yaml

[ROCm/composable_kernel commit: b0ee317d83]
2025-09-12 12:45:14 -07:00
Thomas Ning
82cd975698 Fix the vector load & fix the gfx950 compv4 error (#2831)
[ROCm/composable_kernel commit: 1894a0dbc3]
2025-09-12 11:48:45 -07:00
Aviral Goel
358f7ab285 fix(copyright header): add header to missing files (#2807)
[ROCm/composable_kernel commit: f3239395dc]
2025-09-11 12:27:08 -07:00
Cong Ma
3790ed04c6 [CK TILE GEMM] Fixed the regression issue with transpose C in Quant Gemm (#2819)
The numerical error was introduced after merging row/col quant. And it is fixed.

[ROCm/composable_kernel commit: 2ed39f8d91]
2025-09-11 11:38:16 -07:00
linqunAMD
a2bbb7bff0 [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12 (#2808)
* [CK_TILE] Fix example batched_gemm, grouped_gemm, gemm_multi_d, convolution on gfx11 & gfx12

* fix gemm_splitk_two_stage

* revert .pre-commit-config.yaml

[ROCm/composable_kernel commit: 60d3e8f504]
2025-09-11 07:27:33 -07:00
linqunAMD
3e397114b7 [CK_TILE] fix example reduces, permute and elementwise on gfx11 & gfx12 (#2810)
1. Refine Reduce2dShape to support both wave32 and wave64
2. Fix example reduce, permute and elementwise on gfx11 and gfx12

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 0b9a638f26]
2025-09-11 12:41:20 +08:00
Khushbu Agarwal
65666b6cab [CK-Tile] Fix quant example code (#2813)
* initial commit

* remove extra files

* fixing errors

* updated ReadMe file for mapping of diff quants with diff configs

* addressing review comments

* addressing review comments

* Resolved merge conflicts

* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled

The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.

---------

Co-authored-by: Cong Ma <congma13@amd.com>

[ROCm/composable_kernel commit: 80a61afb9b]
2025-09-10 17:15:39 -07:00
Illia Silin
84ee8c0a7e Revert "add vector load 16/32 for bf16/fp16 (#2779)" (#2818)
This reverts commit e04d4c6fcb124b6c00af9f7378fb672e7f0ef87d.

[ROCm/composable_kernel commit: b4207c01c7]
2025-09-10 13:35:15 -07:00
zjing14
72dcde393a add vector load 16/32 for bf16/fp16 (#2779)
* add vector load 16/32 for bf16/fp16

* comment addressed

* clang format

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 7ecdba878f]
2025-09-09 23:15:19 -07:00
Anton Gorenko
2ec8ebff31 [CK_TILE] Add gtests for FMHA (#2744)
* Improve random number generation

* use different seed for each input (Q, K, V...);
* use deterministic generation of:
  * seqstart_q/k (for group mode);
  * block_table (for paged-kvcahe);
  * cache_batch_idx (for kvcache);

* Extract arg_parser-related code from run functions to use them as tests

* Split examples into main programs and fmha runners, build instances separately

* Add dummy tests that use instances and runners

* Fix a missed corner case of f32->f8 conversion

When value if < min f8 denormal but > min f8 denormal / 2, it must be
rounded to min f8 denormal (i.e. 0b1), not to 0.

* Fix incorrect fp8 scales for P and O in validation code

DataTypeConfig was incorrectly compared with fp8_t.

* Add host generation of dropout random values and use it for validation

Previously host validation (reference_batched_dropout) used random
numbers generated by BlockDropout of the kernel, meaning that incorrect
generation on device (bad distribution, repeated numbers, too many zeros,
etc.) would not trigger any validation errors.

* Implement tests from smoke_test_bwd.sh

* Return result as enum to distinguish failure and missing instance

* Add tests for bwd features: bias, alibi, dropout

* Implement tests from smoke_test_fwd.sh

* Pass seqlen_q/k as vectors to fwd and bwd runners

* Add tests for fwd features: bias, alibi, dropout

* Add tests for pagedkv and splitkv

* Fix conditions when to use splitkv and pagedkv kernels

splitkv was executed only when use_kvcache which == (need_append_kvcache || use_cache_batch_idx || 0 < page_block_size).
In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1.
In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance.

* Add tests for appendkv

* Use is_v_rowmajor = true because there are no instances with column layout anymore

* Split public and private compile options for instances

Tests and examples need to know only about CK_TILE_FMHA_FWD_*_API.

* Improve parsing validation in bias and mask

* Pass bias as string for consistency with mask

* Catch parsing and other exceptions

* Add bwd test for deterministic flag

* Initialize fp8 tensors (-init=ufq) similarly to uf

* Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null

seqlen_k cannot be used to determine padding when seqlen_k_ptr is
provided. The actual seqlen_k is taken from seqlen_k_ptr[b].
Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr
may contain arbitrary values.
In the example or tests this produces incorrect results with appendkv
(for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8).

* Fix use_pagedkv value when kvcache = true but page_block_size = 0

In this case block_table_ptr is nullptr which is accessed in the kernel.

* Clean up bwd tests

* Unify fwd tests for f16/bf16 and fp8

* Use better explicit instantiation declaration for fmha_bwd<2>

* Use the same seed for all tests, allow to override it with env variable

* Undo clang-format of one irrelevant file

For some reason my local clang-format-18 and the one in CI work differently.

* Do not build instances and tests on unsupported archs

* Build instance libraries as OBJECT library

* CI: Enable sccache for HIP

There are source files with LANGUAGE HIP, they need
-DCMAKE_HIP_COMPILER_LAUNCHER=sccache

* Add tests to REGRESSION_TESTS

* Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0

The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are
smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf.

* Pass CK_TILE_FMHA_FWD_*_API as INTERFACE compile options

The instances don't actually depend on them, only examples and tests do.
Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS
without recompiling instances that are already in ccache.

* Fix formatting and names

[ROCm/composable_kernel commit: ec006bb8e0]
2025-09-10 08:06:14 +05:00
linqunAMD
01b8e6a80f [CK_TILE] Refine Generic2dBlockShape to fix ck_tile example 2,10,11,14 on rdna3 and 4 (#2795)
BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side.

Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape

[ROCm/composable_kernel commit: c254f3d7b4]
2025-09-10 08:29:20 +08:00
linqunAMD
9f852da7c8 [CK_TILE] Fix flatmm on gfx11 and gfx12 (#2790)
1. Correct shuffle_b and MakeBFlatDramTileDistribution according to WMMA warp layout
2. Add FlatmmConfig16_Wmma for gfx11 and gfx12

[ROCm/composable_kernel commit: df4ee556d6]
2025-09-10 08:28:00 +08:00
Cong Ma
b42e7a85b2 [CK TILE] Support fp8/fp16 with pk_int4_t as data types for tensors A and B (#2805)
- Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats
- Implement A(bf8) B(i4) support in universal GEMM
- Use new implementation for i4 to fp8 conversion in Block Scale

[ROCm/composable_kernel commit: 82890192dd]
2025-09-09 16:40:52 -07:00
Yi DING
64a4b6dc9e [CK_TILE] Fix kname & typo in FMHA BWD (#2809)
[ROCm/composable_kernel commit: 91178b4011]
2025-09-09 15:08:00 -07:00
lalala-sh
ea352f2510 [CK_TILE] Add permuteN optimization to remove lds operation in c_shuffle (#2764)
* permuteN optimization to remove lds operation in c_shuffle

* add the change log

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 75570d0fa8]
2025-09-08 22:02:48 -07:00
Max Podkorytov
c1a4d4c5f1 [Util] add a RAII stuct which inserts markers into generated asm (#2748)
* add asm scope raii printer

* add comment

* clang-format

* compress

* add Aviral's suggestion to extend the docstring

Thanks~

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

* cleanup docstring

* clang-format

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 92b07380d3]
2025-09-08 22:02:02 -07:00
lalala-sh
6c9ffce867 [CK_TILE] add atomic IGLP scheduler for wp gemm (#2739)
* add atomic IGLP scheduler

* clang format

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: e4a7728903]
2025-09-08 14:57:14 -07:00
kyle-256
f1aec610f1 [CK_TILE] Implement Row/Col quant grouped gemm (#2786)
* Add cshuffle epilogue test

* add the poc implementation to the epilogue and tests

* refactor cshuffle epilogue

* WIP: adding tensor/tile usage to scale_tile

* fix usage of tile_elementwise_inout

* add gemm_quant_kernel for generalizing gemm quant kernel

* Add problem specific to different quants, add QuantType to Traits

* Add quant_type to quant_kernel template parameters

* Create aq/bq_block_windows and views depending on QuantType

* Use tile windows as inputs in cshuffle epilogue

* Fix some issues in epilogue

* initial new example code for new general gemm quant kernel test

* Fix issues in kernel

* Add verification check for rowcol Quantmode

* use AccDataType instead of AQ in pipeline

* fix aquant preshuffle

* fix formatting

* some cleanup

* remove gemm_aquant_basic.cpp

* remove gemm_aquant_kernel.hpp

* fix tests for the renamed quant kernel

* fix formatting

* clean example files

* fix some merge conflicts

* fix preshufflequant rename issue

* updating

* fix some templates after merging with develop

* fix test preshuffle parameter

* fix formatting

* updating kernels

* change update user

* test username

* update quant_grouped_gemm example

* update example

* Unify bquant kernel to the common quant kernel

* remove bquant kernel also from common header

* fix formatting

* clean up commented code

* update grouped_gemm_quant example

* fix formatting config hpp

* fix merge mistake

* Non-const for movable windows

* fix formatting

* update tileloop pipleline

* Fix grammar in README

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Remove #include<bit> and clean up example

* fix strides

* Add some descriptions for move_windows

* fix tensor print bug

* update quant_grouped_gemm example

* remove useless code

* cleanup code

* clean up code & format code

* fix compile & running bug in grouped_gemm example

---------

Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: liyingli <liyingli@amd.com>
Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: 4eb415829e]
2025-09-08 10:25:57 -07:00
Yi DING
8f02933795 Reapply "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757) (#2761)
This reverts commit e7fa2e16f812f3a7cd2096a581e29976a7e7429f.

[ROCm/composable_kernel commit: 5ff205ca79]
2025-09-08 09:21:14 -07:00
Thomas Ning
7c9c48737c [FIX] fix on fmha_bwd (#2784)
* fix on fmha_bwd

* Add 'const' to the Default2DEpilogue call operator

* Fix more calls to Default2DEpilogue

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>

[ROCm/composable_kernel commit: 42a43d1523]
2025-09-08 14:31:27 +08:00
Aviral Goel
8eb4d67f9b feat(grouped_gemm): add preshuffle v2 support to grouped gemm example (#2721)
* docs(README): update readme with new build instructions

* feat(grouped_gemm): add support back for non persistent kernel

* refactor(grouped_gemm): simplify tensor creation

* refactor(grouped_gemm): Persistance is now GemmConfig value for easier management

* chore(grouped_gemm): add print statements to ease debugging

* WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration

* fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity

* WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines

* chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle

* style: clang format work so far

* BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline

* fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors

* refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm

* chore(grouped_gemm): add/remove debug comments and debug print statements

* feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes

* chore(gemm_profile): add new argument combinations

* fix: branch cleanup, formatting, refactoring

* fix: branch cleanup, formatting, refactoring

* chore(changelog):  update changelog to reflect new featuer

* address review comments & nit

[ROCm/composable_kernel commit: e279e9420e]
2025-09-07 14:18:35 -07:00
Ville Pietilä
40361182ca [CK Tile] Fix building grouped conv examples in CK Tile (#2777)
* Fix compilation of the grouped conv examples.

* Fix grouped conv bwd weight example output in CK Tile.

[ROCm/composable_kernel commit: 83f607e2a6]
2025-09-05 09:14:21 +03:00
Sami Remes
75bd80b1be [CK_TILE] Row/Col quant gemm (#2729)
* Add cshuffle epilogue test

* add the poc implementation to the epilogue and tests

* refactor cshuffle epilogue

* WIP: adding tensor/tile usage to scale_tile

* fix usage of tile_elementwise_inout

* add gemm_quant_kernel for generalizing gemm quant kernel

* Add problem specific to different quants, add QuantType to Traits

* Add quant_type to quant_kernel template parameters

* Create aq/bq_block_windows and views depending on QuantType

* Use tile windows as inputs in cshuffle epilogue

* Fix some issues in epilogue

* initial new example code for new general gemm quant kernel test

* Fix issues in kernel

* Add verification check for rowcol Quantmode

* use AccDataType instead of AQ in pipeline

* fix aquant preshuffle

* fix formatting

* some cleanup

* remove gemm_aquant_basic.cpp

* remove gemm_aquant_kernel.hpp

* fix tests for the renamed quant kernel

* fix formatting

* clean example files

* fix some merge conflicts

* fix preshufflequant rename issue

* fix some templates after merging with develop

* fix test preshuffle parameter

* fix formatting

* Unify bquant kernel to the common quant kernel

* remove bquant kernel also from common header

* fix formatting

* clean up commented code

* fix formatting config hpp

* fix merge mistake

* Non-const for movable windows

* fix formatting

* Fix grammar in README

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Remove #include<bit> and clean up example

* fix strides

* Add some descriptions for move_windows

---------

Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: c6010f2953]
2025-09-04 16:17:12 -07:00
Illia Silin
c217c0fa93 Fix latest AITER failure and add more AITER tests in CK CI. (#2782)
* add aiter tests and move json_dump header

* remove example/include path from cmake

* extend time for aiter and pytorch stages

[ROCm/composable_kernel commit: ef6c28e989]
2025-09-04 13:44:00 -07:00
SamiAario-AMD
3641dcd64c [CK Tile] gemm splitk two stage (#2697)
* Fix a typo

* Use std::variant to call run_gemm_example_with_layouts with the available layout variant combinations

* Use a unified run_gemm_example_prec_type for basic gemm and universal gemm

* Factor out run_gemm_example_prec_type

* Refactor argument parsing in gemm_splitk_two_stage_reduce.cpp

* Parse arguments outside of create_args

* Move the gemm operators to separate structs to facilitate their reuse

* Move the invokers to separate files to facilitate their reuse

* Rename the invoker files for consistency with the examples that use them

* Add fp32 support to the elementwise examples, and produce an error message for unsupported types

* Get rid of four unused variables

* Make two variables const

* Add support for different input-output type combinations in elementwise examples

* Test support for different input and output types in elementwise examples

* Add support for different operations in the elementwise unary tests

* Add support for UnaryConvert in the elementwise unary tests

* Add support for bf16 in elementwise examples, excluding unsupported type combinations

* Make some operator parameters const in ElementWiseKernel

* Remove some unnecessary include statements

* Implement a two-stage GEMM that does a type conversion in the second stage using the elementwise kernel

* Clear workspace instead of output when flushing the cache in SplitKTwoStageInvoker::gemm

* Fix formatting issues reported by clang

* Add back CK_TILE_USE_WMMA related changes

* Use the right prec type for bf16 in the universal GEMM and two stage split K examples

* Add some brackets

* Add some brackets

* Separate the clearing of the GEMM output memory from the cache flushing in the universal GEMM example

* Separate the clearing of the GEMM output memory from the cache flushing in the split K two stage example

* Fix formatting

* No need to call SetZero on ws_m_n_dev_buf here, as clear_gemm_output now does this as part of the kernel preprocessing

* Add fp16 data type to splitk two stage example

* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the basic GEMM example

[ROCm/composable_kernel commit: 1acd8e041c]
2025-09-04 14:33:44 +03:00
arai713
71986ea795 [CK TILE] Stream-K tile partitioner (#2708)
* initial commit for skeleton code

* replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code

* fixed up code to match CK Tile convention: data type changes, naming changes, etc.

* change for num_sk_blocks data type

* formatting fix

* minor fixes

* moved reduction argument to template

* resolved comments from PR review: standardizing naming, pruning unneeded code

* resolve errors from merge of device op PR: moved enum to common file

* switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems

* unsigned type fix

* add const qualifier

* added documentation for template parameters

* documentation edit

[ROCm/composable_kernel commit: 0282d98412]
2025-09-03 13:38:17 -07:00
msaffari-amd
5994c71a13 refactor: use snake_case naming in ck_tile/core components (#2766)
[ROCm/composable_kernel commit: 47d020a993]
2025-09-03 09:34:11 +02:00
Cong Ma
1ed315638e [CK TILE GEMM] Fix building issues (#2772)
- Add `WarpGemmMfma_f32_16x16x128_[fp8|bf8]_[fp8|bf8]_CTransposed`
- Replace `__gfx950__` with `CK_GFX950_SUPPORT`

[ROCm/composable_kernel commit: e1ab460d2d]
2025-09-02 22:40:18 -07:00
Po Yen Chen
d175d5b28a [CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765)
* Fix Default2DEpilogue usage

* Fix Default2DEpilogue usage for batch_prefill

[ROCm/composable_kernel commit: 9f35cde374]
2025-09-02 09:51:56 -07:00
Sami Remes
4b690a85e0 Fix formatting problem (#2768)
[ROCm/composable_kernel commit: 4419fc34a2]
2025-09-02 14:14:10 +03:00
Michael Mcminn
53129b5add Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751)
* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16

* Adding support for offload target gfx9-4-generic

* This duplication here isn't ideal

[ROCm/composable_kernel commit: 022f369deb]
2025-09-02 10:35:07 +02:00
Haocong WANG
1b9b8084bc Fix naming issue (#2762)
[ROCm/composable_kernel commit: 33418b201f]
2025-09-02 11:18:53 +08:00
Po Yen Chen
884ecea071 [CK_TILE] Add FAv3 fwd pipeline (#2731)
* Add FAv3 fwd pipeline

* Unpack v_pk_mul to hide v_mov

* Avoid compiler moving l compute across phase

* Sync sched_group_barrier() setting for masking cases

[ROCm/composable_kernel commit: d876e87fe4]
2025-09-01 09:16:45 +08:00
Aviral Goel
f71b16afc0 chore(gemm): clang format to pass CI (#2758)
[ROCm/composable_kernel commit: fcff0043ae]
2025-08-29 00:38:46 -07:00
Vijay Krish
bad7262507 ck_tile kernel for gemm with groupwise quantized B tensor. (#2663)
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.

* Solve merge conflict

* [CK TILE] Update CHANGELOG.md

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>

[ROCm/composable_kernel commit: 4208e28988]
2025-08-28 23:43:02 -07:00
Cong Ma
16e29b0fe3 Support transposed C tile in Aquant (#2679)
The performance of Aquant has increased after enabling transposed C.

Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.

[ROCm/composable_kernel commit: 428090f749]
2025-08-28 13:28:09 -07:00
Mateusz Ozga
d9f42f3777 [CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752)
* Init commit

* Quick fix, CI fails

* Remove CDElementWise

* Add CDEELementWise

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>

[ROCm/composable_kernel commit: 0758883fa4]
2025-08-28 12:45:50 -07:00
asleepzzz
154b02423f Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757)
This reverts commit 4eb18aab71ddb0cf7b63fb161424034615e2bdf5.

[ROCm/composable_kernel commit: 038ea82315]
2025-08-28 22:50:42 +08:00
linqunAMD
cd6d731322 [Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728)
* fix copy basic build error

* fix other ck tile test build error

[ROCm/composable_kernel commit: 4a49dac7c6]
2025-08-28 20:30:30 +08:00
Yi DING
8b537fb883 [CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)
* 16x192

* Use buffer_load_lds for lse/d

* Dispatch & cleanup

* Avoid zeroing dq & fix

* fix

[ROCm/composable_kernel commit: ead4447b20]
2025-08-28 18:54:18 +08:00
Linjun-AMD
16d37f9d67 use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711)
* add k_lds padding and iglp to improve dim256 fmha fwd

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update  block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp

* clang format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* use same naming style

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: bf7b458e6e]
2025-08-28 11:39:39 +08:00
Aviral Goel
5256e754cc feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types (#2691)
* feat(check_err): add a variable to adjust number of incorrect values to print

* feat(host_tensor): add printing capability for fp8 bf8 int8 int4

* fix(gemm_utils): update acceptable data type

* fix(host_tensor): print both 4 bit ints in pk_int4_t

* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp

* feat(host_tensor): add print first n elements functions

[ROCm/composable_kernel commit: f5f795c4d6]
2025-08-27 18:17:24 -07:00