Commit Graph

1781 Commits

Author SHA1 Message Date
rocking
01ea8aa249 Reduce redundant space in bias tensor (#2024)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 8a20b62e91]
2025-03-28 21:58:06 +08:00
felix
20ffa0f474 hotfix fix sorting int64 (#2025)
* fix sorting int64

* clang format

* fix example issue

* update WA issue #

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

[ROCm/composable_kernel commit: a82f338fb9]
2025-03-28 11:31:52 +08:00
Illia Silin
895ba2b497 add gfx950 to default targets for rocm6.4+ (#2032)
[ROCm/composable_kernel commit: d142e15f5e]
2025-03-27 18:48:47 -07:00
spolifroni-amd
408c8b8125 creation of install doc and refactor of doc in general (#1908)
* creation of install doc and refactor of doc in general

* updates based on review comments

* updated based on review comments

* updated readme and contributors markdown

* added extra note to not use -j on its own

* added note about smoke tests and regression tests

* made changes as per Illia's feedback

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: a426f67301]
2025-03-27 15:13:18 -06:00
felix
900acdc2db ckmoe: change cmake; use smaller shape for i4 (#2027)
* change cmake; use smaller shape for i4

* fix pki4 run

* fix typo

* fix runtime arch logic for moe_gemm2 example

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 36d50de50e]
2025-03-27 09:04:31 -07:00
Rostyslav Geyyer
23ad59e1fd Add MX FP4 device conversion tests (#1889)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add printf to fix intrinsic

* Add a sw-based workaround

[ROCm/composable_kernel commit: 441343a23d]
2025-03-26 19:23:01 -05:00
Illia Silin
73a5a3c463 Disable all pk_i4 tests for all targets except gfx942/950. (#2022)
* only build gemm_fp8_pk_i4 examples for gfx942/950

* fix cmake logic

* moved the architecture check to IsSupported function

* Revert "moved the architecture check to IsSupported function"

This reverts commit 056d2a08b3.

* disable all pk_i4 tests for targets other than gfx942/950

* fix cmake logic

[ROCm/composable_kernel commit: 23a949706c]
2025-03-26 15:15:57 -07:00
Bartłomiej Kocot
6ccfb817e4 Add support for GKCYX grouped conv fwd (#2015)
* Add support for GKCYX grouped conv fwd

* fixes

* fix

* changelog

* Fixes

[ROCm/composable_kernel commit: 54c81a1fcf]
2025-03-26 21:13:38 +01:00
Illia Silin
fd8983d063 fix clang format (#2021)
[ROCm/composable_kernel commit: fd915b83f7]
2025-03-26 09:42:10 -07:00
Mirza Halilčević
614e3fee5e Add default arguments for prologue and epilogue. (#2020)
[ROCm/composable_kernel commit: 21e0ca197d]
2025-03-26 09:28:40 -07:00
Illia Silin
27de1d431a Make sure gemm_fp8_pk_i4 examples only build and run on gfx942/950. (#2010)
* only build gemm_fp8_pk_i4 examples for gfx942/950

* fix cmake logic

* moved the architecture check to IsSupported function

* Revert "moved the architecture check to IsSupported function"

This reverts commit 056d2a08b3.

[ROCm/composable_kernel commit: 99b2bbc1d6]
2025-03-25 14:43:38 -07:00
Andriy Roshchenko
75ef4c83bf MX GEMM examples with FP8, FP16, and E8M0 scales (#2016)
* Add `scalar_type` specification for E8M0 exponent

* Specialize `nnvb_data_t_selector` for E8M0 exponent

* Remove partial specializations for `scalar_type` of `non_native_vector_base` template

* Reword command line helper string

* Create MX GEMM examples for different scales


[ROCm/composable_kernel commit: 72d888821c]
2025-03-25 15:33:03 -06:00
Illia Silin
21af4139ad Enable ClangBuildAnalizer when doing ninja build traces. (#2009)
* enable ClangBuildAnalizer when doing ninja traces

* add branch and date to clang build log name

* fix jenkins syntax

* fix jenkins syntax once more

* fix jenkins syntax once more

* simplify the clang_build log name

* simplify the clang_build log name further

[ROCm/composable_kernel commit: 44c093ba0c]
2025-03-25 12:27:04 -07:00
Max Podkorytov
58789d03d3 use fast path for sequence generation in old CK (#1993)
[ROCm/composable_kernel commit: 1a58522f01]
2025-03-25 11:28:44 -07:00
ruanjm
ce1d20c2c6 [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx

[ROCm/composable_kernel commit: d49abdaa87]
2025-03-25 20:09:45 +08:00
Illia Silin
b9e0e7d93e Split up data_type header. (#1996)
* split fp64 vector data type

* add missing header

* move e8m0 structs

* split off numeric_utils header

* fix typo

* split off numeric limits header

* update data_type header

* fix clang format

* split off vector type header

* fix clang format

* fix typo for binary_inf

[ROCm/composable_kernel commit: d2eab23958]
2025-03-24 15:08:54 -07:00
Andriy Roshchenko
bbdd7f6d57 Introduce MX GEMM for FP8 data type (#2000)
[ROCm/composable_kernel commit: 6660dc6b8e]
2025-03-24 15:41:07 -06:00
MHYang-gh
fd151c05d9 Fix A/B lds transform (#2007)
[ROCm/composable_kernel commit: c027637a8f]
2025-03-22 23:13:50 -07:00
Bartłomiej Kocot
ceb078163f Fix split N for large images in groupd conv fwd (#2004)
* Fix split N for large images in groupd conv fwd

* Fix comments

[ROCm/composable_kernel commit: 5b0873c31a]
2025-03-22 23:19:49 +01:00
carlushuang
e1122c5c27 add mask support in hdim=192/128 (#1999)
[ROCm/composable_kernel commit: 6c08c5c46d]
2025-03-21 18:28:43 +08:00
BingYuan.Zhou
c245d569d5 fix ck_tile/basic_gemm build error (#1988)
[ROCm/composable_kernel commit: 5a0d693b86]
2025-03-20 22:01:14 -07:00
felix
bd00da1848 change cmake (#2006)
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 902dbe89ad]
2025-03-20 19:25:11 -07:00
Attila T. Áfra
081e3c7880 Fix compile errors on Windows and Linux (#2002)
* Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous)

* Fix compile error (no matching function for call to 'cast_to_f32_from_f8')

[ROCm/composable_kernel commit: c79bf11148]
2025-03-20 12:37:25 -07:00
carlushuang
23340c5dd5 [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: e3c9886cdf]
2025-03-20 11:00:29 -07:00
jakpiase
f1262b783a [CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919)
* switch to universal gemm for batched and grouped gemms

* added reviewer comments

* fixed grouped gemm tests

[ROCm/composable_kernel commit: 0e91d32c61]
2025-03-20 11:17:04 +01:00
rocking
b0f323c4ec Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: b819c217e4]
2025-03-20 00:06:45 +08:00
felix
c2948a0634 Ck moe hot fix (#1979)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* clang format

* rm useless comments

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 7eaedeb36c]
2025-03-19 22:58:27 +08:00
Bartłomiej Kocot
ba83883cd2 Add grouped conv bwd wei merged grouped instance for larger filter (#1984)
* Add grouped conv bwd wei merged grouped instance for larger filter

* Update readme

[ROCm/composable_kernel commit: fdaff5603e]
2025-03-18 16:16:24 +01:00
Illia Silin
77ed99efe7 Add a daily CI build on gfx908. (#1987)
* add one daily ci build on gfx908

* add redis invocation tag for gfx908

* make ci build for gfx908 conditional

* fix groovy logic

* add option to run perf tests for gfx908

* disable a few tests on mi100

[ROCm/composable_kernel commit: 1342ecf7fb]
2025-03-17 18:08:53 -07:00
Illia Silin
3a0ce843b1 disable ck_tile basic gemm (#1986)
[ROCm/composable_kernel commit: 07f25186b2]
2025-03-17 15:26:43 -07:00
aledudek
73d207bd4e Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue

[ROCm/composable_kernel commit: 5095906975]
2025-03-17 16:42:43 +01:00
Bartłomiej Kocot
b8f58a234e Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example

[ROCm/composable_kernel commit: c2e4898b4b]
2025-03-17 13:32:00 +01:00
valarLip
99d7424a14 hotfix fmoe build issue (#1976)
[ROCm/composable_kernel commit: 52b1cd7780]
2025-03-13 15:11:59 +08:00
dependabot[bot]
39c02ebfab Bump rocm-docs-core from 1.17.1 to 1.18.1 in /docs/sphinx (#1977)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.1 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.1...v1.18.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: de7a745ca6]
2025-03-12 23:36:36 -07:00
carlushuang
f2dd57b76f Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit b92caa3d84.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 3e81279d26]
2025-03-13 11:41:39 +08:00
Illia Silin
7f849a89e3 disable tests that take too long to build for gfx90a (#1975)
[ROCm/composable_kernel commit: d4a6d69643]
2025-03-12 17:54:03 -07:00
feli
e3c5b2ae80 ck_moe: fix useless code and remove usless oob (#1972)
* fix useless code and remove usless oob

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 251afab3b7]
2025-03-12 09:22:42 -07:00
Illia Silin
a7614ad594 use old instrinsics with staging compiler (#1970)
[ROCm/composable_kernel commit: 4c97cc511e]
2025-03-12 07:29:09 -07:00
Illia Silin
b92caa3d84 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 45fbd9210a.

[ROCm/composable_kernel commit: 8cbcd3e0d0]
2025-03-11 10:40:18 -07:00
Haocong WANG
1ed0b74c43 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: cbd74c2d12]
2025-03-11 10:11:21 -07:00
Haocong WANG
1f5c6f2db1 reduce test size to avoid timeout on specific silicon (#1966)
[ROCm/composable_kernel commit: ba209b9dab]
2025-03-11 09:15:26 -07:00
Illia Silin
8fe2095d43 disable example_moe_gemm2_xdl_pk_i4 on gfx950 (#1968)
[ROCm/composable_kernel commit: aa42c3db06]
2025-03-11 08:34:47 -07:00
carlushuang
45fbd9210a [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update

[ROCm/composable_kernel commit: 7a93b16ff6]
2025-03-11 21:07:40 +08:00
Mingtao Gu
fc98615212 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 0db7c8f0b2]
2025-03-10 11:16:44 +08:00
Thomas Ning
89f3ca4c89 Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955)
* tempsave, not selected

* finish the feature and merge with develop

---------

Co-authored-by: aska-0096 <haocwang@amd.com>

[ROCm/composable_kernel commit: c954bd0cfa]
2025-03-07 13:44:06 -08:00
Thomas Ning
ed0649e4e6 Fix on the error (#1956)
[ROCm/composable_kernel commit: 9d51d17dd0]
2025-03-07 13:43:52 -08:00
Illia Silin
88bd358d65 add missing headers (#1959)
[ROCm/composable_kernel commit: 0e8e711ec8]
2025-03-07 11:11:30 -08:00
Max Podkorytov
9b160b318f refactor ck-tile kernel launch (#1925)
[ROCm/composable_kernel commit: 9e132eb77c]
2025-03-07 08:29:40 -08:00
Qianfeng
162253694a Ck tile/complete k prefetch (#1941)
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce

* Remove last block_sync_lds() in the loop

* Tiny adjustment in qr_ks_vs_async pipeline for better performance

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline

* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline

* Use un-rolled gemm for Gemm-0

* Use k0_loops small tile load/store to replace the big tile load/store for K

* Remove the commented lines in qx_ks_vs_custom_policy.hpp

* Tune the prefetching of V in qr_ks_vs_async pipeline

* Move the codes for storing the first v_lds tile some later

* Let BlockDropout reuse LDS with V

* Switch to separate code blocks according to iteration index

* Interleave code blocks for better performance

* Move clear_tile(s_acc) for better interleaving

* Move code interleaving

* Use MakeQDramTileDistribution for q_dram_window

* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop

* Let V reuse the LDS of K

* Use array of tiles to represent Q in vgprs

* Use QLoadOnce == false for qr_ks_vs_async pipeline

* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline

* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()

* Move the definition of v_tiles out from the loop

* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()

* Remove using KLoadOnce in qx_ks_vs_custom_policy

* Remove un-used get_slice_tile() call

* Move the code line of clear_tile(s_acc)

* Tune the lines of codes to make them more tidy

* Re-arrange the codes before the main-loop

* Add comments

* Unify the alignment to be 8 for Q/K/V Lds decriptors

* Tuning to K pre-loading

* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false

* Adjust the pipeline codes

* Use NumPrefetchV to separate from NumVLdsBuffers

* Tune the location of a scheduler barrier code line

* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths

* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel

* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline

* Move the location for store_tile() of first v_tile

* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline

* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>

* Try to fix old bugs in qx_ks_vs_custom_policy

* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler

* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy

* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)

* Tiny Fix (#1888)

* Ck tile/paged attention workaround (#1894)

* Correction in GetRangeAlongX()

* Work-around to solve the failures in test_paged_attention_ck in xformers

* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline

* Remove one call of move_tile_window for q_dram_window

* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()

* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline

* Adjust the location of codes for storing the first V tile to LDS

* Tiny fix and add comments

* Change GetSmemKPackK size to improve performance

* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy

* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access

* Adjustment in intermediate iteration codes for tiny performance improvement

* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation

* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue

* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()

* Remove useless AsyncopyV

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used

* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy

* Recover the qr_ks_vs_async pipeline

* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline

* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"

This reverts commit 39b82ca194.

* Tiny fix with regard to whole_k_prefetch pipeline compiling

* Update kPadSeqLenK setting in fmha_fwd_kernel

* Use q_element_func and k_element_func

* Use single q_tile rather than multiple sliced q_tiles

* Codes refine according to the comments

* Re-format one file

* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true

[ROCm/composable_kernel commit: 4f54fa3058]
2025-03-07 14:19:51 +08:00
Illia Silin
86ceda9438 RE-enable DL and DPP instances by default. (#1954)
* enable DL and DPP instances by default

* fix cmake logic

[ROCm/composable_kernel commit: 43c90b5234]
2025-03-06 21:45:31 -08:00