Commit Graph

51 Commits

Author SHA1 Message Date
Hosang
c800f88911 [rocm-libraries] ROCm/rocm-libraries#5088 (commit 36ca523)
[CK_TILE] Update gfx11 FMHA forward kernel configs

## Motivation
Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded
seqlen_q/k) cases.
This tuning is based on heuristic search and improves performance in
most tested shapes.
Performance should be evaluated on top of
[`ROCm/rocm-libraries#5018`](https://github.com/ROCm/rocm-libraries/pull/5018)
(required baseline).

## Technical Details

  - Updated gfx11 codegen heuristic choices for tile size and occupancy.
   - Updated gfx11 pipeline selection:
- Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently
slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad
cases are dispatched to the faster kernel path.`
- Kept gfx12 unchanged: with PSSK support from
[`ROCm/rocm-libraries#4957`](https://github.com/ROCm/rocm-libraries/pull/4957),
existing gfx12 config is already sufficient.
  - Tuning rationale:
    - In some cases, higher `kBlockPerCu` lowers register pressure.
- On RDNA, this generally aligns with better performance when
`waves_per_eu >= 6`.

## Test Plan
- test_ck_tile_fmha
- tile_example_fmha_fwd: tested this on gfx1100 and gfx1151
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24
-d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}

## Test Result
- TFLOPs by sequence length target: `gfx1100` layout: `bhsd`
- mode: batch / VGPR usage: 225 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 74.10 | 71.97 | 0.97x
4096 | 66.26 | 77.79 | 1.17x
8192 | 68.18 | 75.88 | 1.11x
12288 | 68.47 | 80.44 | 1.17x
16384 | 59.54 | 79.66 | 1.34x
20480 | 55.78 | 77.91 | 1.40x
24576 | 55.08 | 77.47 | 1.41x
27280 | 47.45 | 77.16 | 1.63x
- mode: group / VGPR usage: 256 vs 214

SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 71.47 | 70.6 | 0.99x
4096 | 64.74 | 77.06 | 1.19x
8192 | 64.68 | 75.47 | 1.17x
12288 | 66.43 | 79.95 | 1.20x
16384 | 56.02 | 79.73 | 1.42x
20480 | 50.21 | 78.15 | 1.56x
24576 | 47.29 | 77.53 | 1.64x
27280 | 46.13 | 77.04 | 1.67x

- TFLOPs by sequence length target: `gfx1151` layout: `bshd`
- mode: batch / VGPR usage: 225 vs 223

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 26.85 | 29.17 | 1.09x
4096 | 24.75 | 26.01 | 1.05x
8192 | 25.24 | 25.50 | 1.01x
12288 | 25.18 | 25.00 | 0.99x
16384 | 24.79 | 25.91 | 1.05x
20480 | 25.56 | 25.24 | 0.99x
24576 | 25.13 | 26.20 | 1.04x
27280 | 10.78 | 26.35 | 2.44x
- mode: group / VGPR usage: 256 vs 229

Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 27.44 | 26.71 | 0.97x
4096 | 21.89 | 23.09 | 1.05x
8192 | 22.85 | 24.49 | 1.07x
12288 | 24.33 | 24.42 | 1.00x
16384 | 20.05 | 24.98 | 1.24x
20480 | 14.70 | 25.15 | 1.71x
24576 | 11.30 | 26.31 | 2.33x
27280 | 10.10 | 26.32 | 2.61x

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
2026-03-10 16:47:43 +00:00
Thomas Ning
5cb8109535 [rocm-libraries] ROCm/rocm-libraries#4640 (commit 37b8c81)
Fix the Composable Kernel CI and versions incompatibility
 (#4640)

## Motivation

This PR has 4 patches:
1. Fix the CI error of grouped gemm.
2. Fix the incompatibility of old linux version.
3. Fix the potential errors of flatmm.
4. Address the previous comments of abquant eight warps pipeline
solution.
2026-02-18 15:00:26 +00:00
Estevan Vedovelli
1224bc0a82 Add support to gfx1153 and fix gfx115X WMMA config (#3496)
* Support for gfx115X

* Changes for gfx115X

* Add gfx1153

* Update changelog

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-05 10:03:30 -08:00
Illia Silin
14668a56e3 remove the LLVM_MAIN_REVISION usage (#3487) 2025-12-24 16:49:35 -08:00
Illia Silin
2d9c962e2c get LLVM_MAIN_REVISION macro from compiler header (#3469) 2025-12-19 14:57:12 -08:00
Aviral Goel
5e2d25e20f build: reduce build time for bquant tests by splitting into multiple cpp & support on other gfx10 case (#3395)
* build: reduce build time for bqaunt unit tests by splitting into multiple cpp

* reduce the test case & add the gfx10 support

* fix: copyright header for new file

* chore: add copyright to pass the CI

* build: Hot fix to reduce massive build time by just disabling the instances

* Update include/ck_tile/core/config.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: khushbu <khuagarw@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-12-15 07:19:29 -08:00
Illia Silin
b2925ee207 Fix compilation errors with latest clang22 version. (#3396)
* remove target attributes from deduction guides

* switch CK_TILE_HOST_DEVICE_EXTERN based on clang version
2025-12-11 08:09:29 -08:00
Thomas Ning
86a84ae611 Add the gfx1011 support on CK Tile with the SGPR builtin reading protection (#3350)
* Finish the fixes

* add the gfx1010 support macro

* Fix the compilation error
2025-12-05 14:18:30 -08:00
Matthias Gehre
678298d4c7 Add support for gfx1153 (#3306) 2025-11-27 08:48:00 +01:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
Christopher Millette
b9c6cb1452 First look at mfma / wmma unification (#2704)
* First look at mfma / wmma unification

* Refactor

* Re-org file structure

* Restructure transform selection and WaveWiseMma class

* Update license files. Add missing gfx1151 support. Change wave size for HOST to 1. Update datatypes naming consistency

* Fixes default MmaSelector implentation

* Adds unit tests for amdgcn_mma and arch

* Consolidate common arch id checks to constexpr functions. Strongly type ids as amdgcn_target_arch_id object.

* Refactor is_any_value_of

* Fixes mma_selector logic

* Fix typo

* Add mma selector test for tile decomposition

* Fix compilation of mma.hpp

* Revert back to c++17 compatibility

* Fix compiler error by returning index_t from get_warp_size()

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes compiler error for missing is_wave32() function

* Fixes compiler error for host wave_size() should be 64

* Fixes compiler errors where __cpp_concepts is not defined

* Fixes compiler errors where __cpp_concepts is not defined

* Fix test failure for host is wave64 by default

---------

Co-authored-by: Chris Millette <you@example.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-24 09:39:59 -08:00
Gavin Zhao
07314ac543 Add support for RDNA1 GPUs (#3220)
* Allow compilation for RDNA1 (__gfx101__)

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* More RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* Even more RDNA1 changes

Signed-off-by: Gavin Zhao <git@gzgz.dev>

* cmake: skip build quantization for unsupported arches

* add gfx10-1-generic support as well

* add gfx1013 and complete gfx10-1-generic

* fix clang format

* enable DL kernels on gfx101x

---------

Signed-off-by: Gavin Zhao <git@gzgz.dev>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-11-20 10:45:57 -08:00
Anton Gorenko
1e77695fe8 [CK_TILE] Support WMMA (gfx12) in FMHA (#2528)
* Pass hdim to tile_example_fmha_fwd in fp8 tests

* Add WMMA support to fwd FMHA pipelines

* Tune tile sizes a bit for less spilling

fp16 256 is still quite slow

* Fix Q grad tile distribution for warp size = 32 and hdim >= 256

With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.

* Use code based on BlockDropout in BlockDropoutBwd

* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal

* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
  because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
  it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
    * any number of warps (power-of-2), not only 4;
    * kN1 = 16, not only 32;
    * any number of splits;

* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now

* Replace hard-coded kN1 in dispatch code with the requested tile size

* Add gfx12-specific tile sizes for split KV

* Pass GPU architecture to kernel generation scripts

This is still a temporary solution.

* Build and run FMHA CI tests for gfx12

* Fix issue after merging

* Fix bwd tile sizes

The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).

* Use hardware f32->f8 on gfx12, remove v_perm

__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
 32-bit dword) is used to store results (two f8 values).

* Update changelog

* Add WMMA support to pagedkv

* Fix scripts after rebasing

* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout

Add comments with dropout implementation details

Fix performance regression of fwd+dropout

    * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
    * "scalarize" seed and offset, they may come either from kernel args or from device memory
      (presumably loaded with vector loads).

    These changes help the compiler to procude more optimal code and reduce register spilling.

Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get  CWarpDstrEncoding

Use code based on BlockDropout in BlockDropoutBwd

Refactor BlockDropout (fwd)

Implement BlockDropout (fwd) for WMMA

    Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
    this version supports 16x16 tiles.
    If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
    to BlockDropoutBwd.

Implement BlockDropoutBwd for WMMA

Remove MakeRandValLds* functions unused in BlockDropoutBwd

Remove unused Run overload from BlockDropoutBwd

* Fix regression with philox seed and offset when they exceed 32-bit int

__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.

* Fix names after cherry-picking

* Fix selection of a fallback tile based on bm0

The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.

* Do not use filters related to qr_async_trload

They disable tiles/pipelines which are valid for gfx12.

* Use different dstr encoding when C is transposed

* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code

Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().

* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution

Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.

* Generalize usages of WarpGemmDispatcher for MFMA and WMMA

WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.

* Mark has_load_tr as maybe_unused

There are no transpose loading for RDNA.

* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code

* Detect BlockSize on host based on warp size of the current device

If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.

* Fix calculation of grid size for combine kernel with warp size = 32

* Add missing includes and header

* Support multiple archs in one binary for fwd

* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill

* Support multiple archs in one binary for bwd

* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
  they can be used as fallbacks (similarly to fwd);

* Extract common code from register_traits

* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"

To simplify merging , the proper fix is in develop already.

* Support new numerical d paddings in trait ordering checks

* Build fp32 tests only on gfx9

* Do not use hardcoded M0 = 64 for dot bwd kernel

* Use textwrap.indent from standard library

* Make fp8 pipelines on gfx12 consistent with gfx9

* Update tests for current pipelines

* Make ninja check more responsive in CI

ninja buffers output so this job looks hanging.

* Support fp8fp32 by limiting O vector size

The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).

* Remove unused cmake options

* Unify including  amd_buffer_addressing.hpp/_builtins.hpp

* Temporarily use amd_buffer_addressing.hpp on >=gfx10

amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
 * 1 scalar for exec masks instead of 2,
 * gfx12 uses different instruction names etc.

* Update asm in bf16 conversions to work with warp 32

* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd

* Add arch tags to kernels/host funcs, compile for each arch separately

* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename

* Add workaround for miscompilation of bwd with padded hdim

SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.

* Fix add_gtest_executable for absolute paths

Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.

* Run only hdim 128 smoke tests for fp8fp32

There are no instances for hdim 64 and 256.

* Format py with ruff to simplify merging develop

* Fix incorrect var name

* Codegen for gfx9,gfx950 when --targets is not specified

Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).

* Combine arch-related values into ArchTrait

This more centralized approach removes duplication of various formatting templates.

* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"

Some code is extracted into a function.
2025-10-29 13:31:08 -07:00
Sami Remes
4419fc34a2 Fix formatting problem (#2768) 2025-09-02 14:14:10 +03:00
Michael Mcminn
022f369deb Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751)
* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16

* Adding support for offload target gfx9-4-generic

* This duplication here isn't ideal
2025-09-02 10:35:07 +02:00
linqunAMD
9fcc1ee9fd Support Wave32 in CK_TILE - Part 1 (#2594)
* Support wave32/wave64 in CK_TILE - Part 1

* remove blocksize in kernel launch

* fix build error

* fix clang format

* fix clang format 2

* fix clang format 3

* fix fmha build error

* fix fmha build 2

* fix fmha build 3

* fix build error 4

* address review comment

* update change log

* replace KernelBlockSize with kBlockSize

* fix CI fail

* fix clang format

* address review comment and rebase code.

* fix universal test fail

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-18 10:08:31 -07:00
Tianyuan Wu
68134b60e4 [CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 (#2466)
* WMMA GEMM F16 Implementation

Signed-off-by: root <tianyuwu@amd.com>

* Self-review

Signed-off-by: root <tianyuwu@amd.com>

* ASIC check minor tweak

Signed-off-by: root <tianyuwu@amd.com>

* add missing include file

* Set GPU_TARGETS to gfx11/12 generic

Signed-off-by: root <tianyuwu@amd.com>

* INT8 GFX12

Signed-off-by: root <tianyuwu@amd.com>

* add int8x16 branch

* Fix CI script

Signed-off-by: root <tianyuwu@amd.com>

* Fix typo

Signed-off-by: root <tianyuwu@amd.com>

* Add CK_Tile WMMA example

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Fix CI

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* fix clang format

* Set M/N_Warp Back to Constant

Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>

* Use GemmConfigComputeV3 by default

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Remove CK_Tile wmma gemm examples from the CI list

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add atomic add fallback method for gfx11

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Omit copyright year

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Support non-square cases

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Add get_device_ip()

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Add atomic add fallback method for gfx11"

This reverts commit 07a79e797d.

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"

This reverts commit ceee918007.

* Revise method name and typos

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Try fix CI

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Revert "Try fix CI"

This reverts commit 7a7241085e.

* clang-format

Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>

* Fix typo caused by merge

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

* Fix typo caused by merging

Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>

---------

Signed-off-by: root <tianyuwu@amd.com>
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com>
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-08-15 16:22:27 -07:00
Haocong WANG
05a6e92705 Re-enable optimization for gfx950 fmha fwd (#2671)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

* refactor blockgemm change, isolate to v2;

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-08-13 14:57:43 +08:00
asleepzzz
5b39de4bb6 Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)
This reverts commit b7322a521a.
2025-08-12 20:27:10 +08:00
Haocong WANG
b7322a521a Optimize fmha fwd decode & prefill for gfx950 (#2641)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-08-12 19:43:14 +08:00
Illia Silin
788e8a878e update the switch condition for buffer built-ins (#2602) 2025-08-01 14:30:07 -07:00
carlushuang
0aecb5ab68 default skip y point to r (#2457)
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-07-06 23:54:34 -07:00
Thomas Ning
f59b8c7d3d OCP FP8 Macro restructure (#2331)
* solved the problem
2025-06-12 09:46:33 -07:00
MHYangAMD
9fcf21a4ec Fix fmha fwd precision issue on MI3XX series (#2285)
* Fix fmha fwd precision issue on MI3XX series

For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.

For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.

* Add option to switch rtn/rtz for fmha fwd
2025-06-10 15:03:23 +08:00
carlushuang
4e9b76f88c [CK_TILE] optimize moe sorting kernel, boost large context case up to 20x (#2153)
* combine 2-3 as single stage

* support zeroing

* improve long tokens

* update specialization

* b16 ws

* 8bit topk optimize

* update 15 example
2025-05-06 17:32:07 +08:00
Illia Silin
3e6d21adeb enable gfx115x support (#2065) 2025-04-09 10:06:42 -07:00
Illia Silin
572cd820ce Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic
2025-04-03 15:30:21 -07:00
felix
a82f338fb9 hotfix fix sorting int64 (#2025)
* fix sorting int64

* clang format

* fix example issue

* update WA issue #

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2025-03-28 11:31:52 +08:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
carlushuang
7a93b16ff6 [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update
2025-03-11 21:07:40 +08:00
carlushuang
c12fb0a624 [CK_TILE][HOTFIX] WA for address space by disable it completely (#1947)
* port all moe changes from ck_moe_gemm branch

* refine codes in the pr

* fix tail odd

* fix clang format

* fix clang format2

* make hot loop scheduler compatible with 16x16 and 32x32

* clang format

* fix per token quant

* rename moe example

* clang format

* WA for address space by disable it completely

* hot fix moe gemm2

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
2025-03-06 12:01:25 +08:00
Illia Silin
9b51c08bf7 remove support for gfx940 and gfx941 targets (#1944)
* remove support for gfx940 and gfx941 targets

* update changelog
2025-03-05 11:07:33 -08:00
Juan Manuel Martinez Caamaño
57bb0e96a4 Missing _ in __HIPCC__ (#1930) 2025-03-03 08:19:47 -08:00
aledudek
c9bcfd755e [CK_TILE] Add EnvLogging and missing gemm args checks (#1896)
* [CK_TILE] Add EnvLogging - refactor IsSupported error messages

* [CK_TILE] Add EnvLogging - wrap gemm kernel error messages

* [CK_TILE] Add EnvLogging - Add missing k_batch args check

* [CK_TILE] Add EnvLogging - remove debug log

* Add one check

* [CK_TILE] EnvLogging - add CK_TILE_ERROR logs

* [CK_TILE] EnvLogging quotes fix

* [CK_TILE] EngLogging use function instead of macro for err logs

* [CK_TILE] EnvLogging - refactor checking env var
2025-02-25 11:48:38 +01:00
Bartłomiej Kocot
9ee69dd297 Fix pk_int4 cast and add pk_int4 dtype in ck tile (#1854)
* Fix pk_int4 cast and add pk_int4 dtype in ck tile

* fixes

* Improvements

* fix typo
2025-02-04 10:32:07 +01:00
Andriy Roshchenko
35aebe5936 Add OCP FP8 support in CK_TILE (#1829)
* Add OCP FP8 to CK_TILE

* Validate OCP FP8 in FMHA FWD under VALID=1
2025-01-27 11:59:49 -07:00
carlushuang
36c7ce4e0e [CK_TILE]Moe update index (#1672)
* update MOCK_ID for moe-sorting

* add moe-smoothquant

* update a comment

* fix format

* hot fix

* update topk in overflow case

* update comments

* update bf16 cvt

---------

Co-authored-by: valarLip <340077269@qq.com>
2024-11-25 13:12:35 +08:00
Illia Silin
75c5bfa364 enable compilation for generic navi targets (#1645) 2024-11-07 14:14:42 -08:00
carlushuang
b098b71b05 topk_softmax (#1592)
* topk_softmax

* remove some file

* fix atomix linear_offset

* address various comment, and change sfc get_index api to static(tuple)
2024-10-26 23:52:49 +08:00
ltqin
0394f8a713 update layernorm (#1570)
* port layernorm

* change warp_welford.hpp

* Update warpshuffle

* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()

* refine welford max count calculation

* unify layernorm api

* Rename file

* Remove save mean and inv std

* Revert "refine welford max count calculation"

This reverts commit 022365802b.

* Fix order of parameter

* refine welford max count calculation again

* Remove fp32 instances

* Fix bug of padding

* refactor api

* Support bf16

* Extract common function

* Refine arg of operator()

* Add kMThreadPerBlock to template parameter

* clang format

* Refine variable name

* Refine file name

* remove redundant line

* refactor layernorm2d pipeline and add block-per-block utility

* fix name

* rename more

* add more block-per-tile instance

* remove duplicated define

* update instance for 2048, 1024 case

* support up to 2048 now

* opt loading

* add n1536

* Add two pass pipeline

* format

* Fix incorrect type

* parallel compilation

* Use smaller N

* fix 2p pass

* Support Repeat_M in distribution

* Refine nameing

* Add reduce example

---------

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-22 09:26:18 +08:00
Po Yen Chen
9868fd0245 Apply ROCm 6.2 WA to ROCm 6.3 and later (#1563) 2024-10-15 18:02:41 +08:00
Dan Yao
b8addae293 [CK_TILE] float -> bf16 inline asm rtn (#1482)
* asm rtn

* add asm rtn macro

* reorder macro

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-08-30 15:38:09 +08:00
Po Yen Chen
461ec98d78 Enable scratch memory workaround on ROCm 6.2 (#1486)
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-08-30 10:40:00 +08:00
carlushuang
8182976c37 [CK_TILE] wa prec, remove sgpr offset for inline asm (#1356)
* wa prec, remove sgpr offset for inline asm

* macro for set tile

* ignore unused param if no kernel instances in host API

* fix more prec issue

* cache buffer resource

* fix

* support pre-nop

* clear tile by vector type members

* add workaround to reduce scratch memory

* conditionally enable workaround code

* enable workaround start from certain build version

* fallback set_tile() implementation from certain build version

* undo template argument changes

* put dummy asm in load_raw()

* fix comments, refactor s_nop inside buffer_load

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-07-08 11:09:55 -07:00
Illia Silin
941d1f7ce0 Merging the gfx12 code into public repo. (#1362) 2024-06-27 00:33:34 -07:00
carlushuang
17ed368f58 [CK_TILE][FA] using pk f16_f32 (#1343)
* [CK_TILE][FA] using pk f16_f32

* correct a error
2024-06-17 17:16:46 +08:00
carlushuang
5055b3bdcb [CK_TILE] support group from cmdline (#1295)
* support cmdline seqlen decode

* silent print

* update readme

* update kernel launch 3d

* update tile partitioner

* fix spill for bf16

* modify based on comment

* modify payload_t

* fix bug for alibi mode

* fix alibi test err

* refactor kernel launch, support select timer

* add missing file

* remove useless code

* add some comments
2024-05-28 11:13:21 +08:00
Illia Silin
06b891c5c2 aggregate device macros in ck_tile config header (#1297) 2024-05-20 08:34:45 -07:00
carlushuang
851c3ed157 [CK_TILE] support alibi (#1269)
* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>
2024-05-07 22:32:54 +08:00