Commit Graph

742 Commits

Author SHA1 Message Date
BingYuan.Zhou
4ec293cb4b [flatmm] implement basic fp16 flatmm (#2089)
* [flatmm] implement basic fp16 flatmm

* fix CI build fail

---------

Co-authored-by: root <root@hjbog-srdc-50.amd.com>
Co-authored-by: solin <bingzhou@amd.com>

[ROCm/composable_kernel commit: eaf1f0bf3b]
2025-04-16 16:51:17 +08:00
Andriy Roshchenko
d9c9f17c3d MX GEMM - New GEMM pipeline for MX data types (#2059)
* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods

[ROCm/composable_kernel commit: 7106976a72]
2025-04-15 17:17:07 -06:00
Mingtao Gu
e8db9f0220 CK pk_i4_t test failures fix (SWDEV-518629) (#2075)
* fix pk_i4_v3 tests failures in Unbuntu env.

* fix pk_i4_t tests failure on Unbuntu issues.

* some fixed.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 56378f810f]
2025-04-14 16:58:57 +08:00
Thomas Ning
1b61d3a0ed Solve the Static Encoding Pattern compile error when the tile size is too small (#2079)
[ROCm/composable_kernel commit: 269f4f6af5]
2025-04-13 20:09:30 -07:00
Illia Silin
90612d0e37 Fix build issues for multiple targets. (#2077)
* build for multiple targets on gfx942

* add missing ignore statements

[ROCm/composable_kernel commit: 0d4f145078]
2025-04-11 12:12:53 -07:00
jakpiase
d76ebf9795 [CK_TILE] Add 2:4 structured sparsity support for fp16 gemm (#1957)
* add structured sparsity fp16 support for gemm

* added reviewer suggestions

* update changelog

* update changelog

* add reviewers suggestions

* Minor fix

* clang fix

* fix doxygen

[ROCm/composable_kernel commit: 6c61f4d237]
2025-04-11 12:18:26 +02:00
slippedJim
959225947a add fmha fwd splitkv receipt for aiter c++ api (#2068)
* add s_randval for c++ api

* Fix bug of bias in splitkv

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

[ROCm/composable_kernel commit: 5f885d2b7a]
2025-04-10 23:21:13 +08:00
Juan Manuel Martinez Caamaño
7a42b06988 Replace inline assembly with builtins in FHMA (#2067)
* Replace inline assembly with builtins in FHMA

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: f14e648e7c]
2025-04-10 09:48:37 +02:00
Illia Silin
7546e4bafe enable gfx115x support (#2065)
[ROCm/composable_kernel commit: 3e6d21adeb]
2025-04-09 10:06:42 -07:00
MHYang-gh
62ce5b906b Make buffer coherence configurable in tensor view (#2041)
* Make buffer coherence configurable in tensor view

* Fix clang-format for tensor_view.hpp

[ROCm/composable_kernel commit: 03ce8729fd]
2025-04-08 15:34:11 -07:00
valarLip
c1d067be5c add passthrough for int32->float32 (#2062)
[ROCm/composable_kernel commit: 2c563fecf7]
2025-04-08 15:16:30 -07:00
Max Podkorytov
26724086f3 simplify generate_tuple (#2043)
[ROCm/composable_kernel commit: 6ce0797dad]
2025-04-08 09:00:51 -07:00
aledudek
6dbaeb5fe8 [CK_TILE] Fix GEMM Memory Pipeline (#2034)
* [CK_TILE] Fix GEMM Memory Pipeline

* Fix transpose tile

* Add comments

[ROCm/composable_kernel commit: 80aae6119b]
2025-04-08 12:40:04 +02:00
Illia Silin
32879114dc fix codegen issues (#2052)
[ROCm/composable_kernel commit: 1793228422]
2025-04-07 07:08:39 -07:00
Illia Silin
ada1b5f341 Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic

[ROCm/composable_kernel commit: 572cd820ce]
2025-04-03 15:30:21 -07:00
Rostyslav Geyyer
7fbc128e83 Add FP16/BF16<->FP8/BF8 conversions (#2035)
* Move conversion functions and add missing conversions

* Add tests

* Add missing conversions

* Add missing conversions

* Add bf8 tests

* Update clipping for vectors

* Add missing conversions

* Add bf16 fp8 tests

* Add bf16 bf8 tests

* Fix device conversion

* Fix conversions

* Fix vector use

* Minor fix

* Add a workaround flag

* Add a workaround flag for bf16 conversion

* Add another workaround

* Add a workaround for fp16 to bf8 conversion

* Update type alias

* Add docstrings and missing wrappers

* Fix if defined macros

* Fix more if defined macros

* Add comments

* Remove __host__ specifier

* Add a gfx950 guard

* Update function naming

[ROCm/composable_kernel commit: 265af71a71]
2025-04-03 12:42:03 -05:00
aledudek
b7359bcfac Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments

[ROCm/composable_kernel commit: 9329432f6c]
2025-04-03 13:35:43 +02:00
Bartłomiej Kocot
169e3cb4f8 Add support for GKCYX grouped conv weight (#2023)
* Grouped conv bwd weight GKCYX support

* fix and changelog

* fix

* fix

* fixes

* comments

* fix

[ROCm/composable_kernel commit: 2ccf914888]
2025-04-02 23:59:49 +02:00
Adam Osewski
5585c3121e Basic docs for universal gemm & ck-tile gemm. (#2014)
* Basic docs for universal gemm & ck-tile gemm.

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Reviewers suggestions.

* Align tparam names in doc with class tparams.

* More reviewers fine tuning ;)

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: e5ad48a784]
2025-04-02 11:03:40 +02:00
Bartłomiej Kocot
ca7ae808d4 Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances

[ROCm/composable_kernel commit: 8c0ab61ece]
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
67c3bcfce1 Grouped conv fwd v3 fix for SplitN an G > 1 (#2038)
* Grouped conv fwd v3 fix for SplitN an G > 1

* Remove int8 large test

* Retore int8 test

[ROCm/composable_kernel commit: ec742908bd]
2025-04-01 13:19:35 -07:00
Seunghoon Lee
345ab65612 Fix Windows build. (#2012)
* Remove duplicate using uint64_t.

* Cast before shift.

[ROCm/composable_kernel commit: df32020f93]
2025-04-01 12:22:10 -07:00
Max Podkorytov
cf08db04a6 add a fast compilation path for static for (0..N) (#2005)
* add a fast compilation path for static for (0..N)

* Update functional2.hpp

add comment and put range applier into detail namespace

* Update functional.hpp

ditto for ck-tile

* prettify

* prettify more

* add comment

* clang-format

[ROCm/composable_kernel commit: c59a8bb206]
2025-04-01 12:06:25 -07:00
rocking
01ea8aa249 Reduce redundant space in bias tensor (#2024)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 8a20b62e91]
2025-03-28 21:58:06 +08:00
felix
20ffa0f474 hotfix fix sorting int64 (#2025)
* fix sorting int64

* clang format

* fix example issue

* update WA issue #

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

[ROCm/composable_kernel commit: a82f338fb9]
2025-03-28 11:31:52 +08:00
Rostyslav Geyyer
23ad59e1fd Add MX FP4 device conversion tests (#1889)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add printf to fix intrinsic

* Add a sw-based workaround

[ROCm/composable_kernel commit: 441343a23d]
2025-03-26 19:23:01 -05:00
Bartłomiej Kocot
6ccfb817e4 Add support for GKCYX grouped conv fwd (#2015)
* Add support for GKCYX grouped conv fwd

* fixes

* fix

* changelog

* Fixes

[ROCm/composable_kernel commit: 54c81a1fcf]
2025-03-26 21:13:38 +01:00
Andriy Roshchenko
75ef4c83bf MX GEMM examples with FP8, FP16, and E8M0 scales (#2016)
* Add `scalar_type` specification for E8M0 exponent

* Specialize `nnvb_data_t_selector` for E8M0 exponent

* Remove partial specializations for `scalar_type` of `non_native_vector_base` template

* Reword command line helper string

* Create MX GEMM examples for different scales


[ROCm/composable_kernel commit: 72d888821c]
2025-03-25 15:33:03 -06:00
Max Podkorytov
58789d03d3 use fast path for sequence generation in old CK (#1993)
[ROCm/composable_kernel commit: 1a58522f01]
2025-03-25 11:28:44 -07:00
ruanjm
ce1d20c2c6 [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx

[ROCm/composable_kernel commit: d49abdaa87]
2025-03-25 20:09:45 +08:00
Illia Silin
b9e0e7d93e Split up data_type header. (#1996)
* split fp64 vector data type

* add missing header

* move e8m0 structs

* split off numeric_utils header

* fix typo

* split off numeric limits header

* update data_type header

* fix clang format

* split off vector type header

* fix clang format

* fix typo for binary_inf

[ROCm/composable_kernel commit: d2eab23958]
2025-03-24 15:08:54 -07:00
Andriy Roshchenko
bbdd7f6d57 Introduce MX GEMM for FP8 data type (#2000)
[ROCm/composable_kernel commit: 6660dc6b8e]
2025-03-24 15:41:07 -06:00
MHYang-gh
fd151c05d9 Fix A/B lds transform (#2007)
[ROCm/composable_kernel commit: c027637a8f]
2025-03-22 23:13:50 -07:00
Bartłomiej Kocot
ceb078163f Fix split N for large images in groupd conv fwd (#2004)
* Fix split N for large images in groupd conv fwd

* Fix comments

[ROCm/composable_kernel commit: 5b0873c31a]
2025-03-22 23:19:49 +01:00
BingYuan.Zhou
c245d569d5 fix ck_tile/basic_gemm build error (#1988)
[ROCm/composable_kernel commit: 5a0d693b86]
2025-03-20 22:01:14 -07:00
Attila T. Áfra
081e3c7880 Fix compile errors on Windows and Linux (#2002)
* Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous)

* Fix compile error (no matching function for call to 'cast_to_f32_from_f8')

[ROCm/composable_kernel commit: c79bf11148]
2025-03-20 12:37:25 -07:00
carlushuang
23340c5dd5 [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: e3c9886cdf]
2025-03-20 11:00:29 -07:00
jakpiase
f1262b783a [CK_TILE] Switch to universal gemm for batched and grouped gemms (#1919)
* switch to universal gemm for batched and grouped gemms

* added reviewer comments

* fixed grouped gemm tests

[ROCm/composable_kernel commit: 0e91d32c61]
2025-03-20 11:17:04 +01:00
rocking
b0f323c4ec Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: b819c217e4]
2025-03-20 00:06:45 +08:00
felix
c2948a0634 Ck moe hot fix (#1979)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* clang format

* rm useless comments

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 7eaedeb36c]
2025-03-19 22:58:27 +08:00
aledudek
73d207bd4e Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue

[ROCm/composable_kernel commit: 5095906975]
2025-03-17 16:42:43 +01:00
Bartłomiej Kocot
b8f58a234e Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example

[ROCm/composable_kernel commit: c2e4898b4b]
2025-03-17 13:32:00 +01:00
carlushuang
f2dd57b76f Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit b92caa3d84.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 3e81279d26]
2025-03-13 11:41:39 +08:00
feli
e3c5b2ae80 ck_moe: fix useless code and remove usless oob (#1972)
* fix useless code and remove usless oob

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 251afab3b7]
2025-03-12 09:22:42 -07:00
Illia Silin
a7614ad594 use old instrinsics with staging compiler (#1970)
[ROCm/composable_kernel commit: 4c97cc511e]
2025-03-12 07:29:09 -07:00
Illia Silin
b92caa3d84 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 45fbd9210a.

[ROCm/composable_kernel commit: 8cbcd3e0d0]
2025-03-11 10:40:18 -07:00
Haocong WANG
1ed0b74c43 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: cbd74c2d12]
2025-03-11 10:11:21 -07:00
carlushuang
45fbd9210a [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update

[ROCm/composable_kernel commit: 7a93b16ff6]
2025-03-11 21:07:40 +08:00
Mingtao Gu
fc98615212 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 0db7c8f0b2]
2025-03-10 11:16:44 +08:00
Thomas Ning
ed0649e4e6 Fix on the error (#1956)
[ROCm/composable_kernel commit: 9d51d17dd0]
2025-03-07 13:43:52 -08:00