Commit Graph

589 Commits

Author SHA1 Message Date
Anton Gorenko
db016cf6da DeviceGemm_Wmma_CShuffleV3 with BlockGemmPipelineVersion::v3 (#2096)
* Prepare files for DeviceGemm_Wmma_CShuffleV3

* Implement main part of CShuffleV3 with block pipeline v3 for WMMA

* Remove unused functions and template params for A/B descriptors

* Support both gfx11 and gfx12

* Enable SplitK for gfx12 and disable for gfx11

* Added RowColRow layout for DeviceGemmV2 fp16

* Added more instances for Row, Col, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout

* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout

* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout

* Fix formatting

* Add documentation

Based on 5585c3121e

* Enable gemm_universal profiling for gfx11/12

* Add WMMA intrinsics for F8/BF8

* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances

* Add BF16 instances and tests

* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

[ROCm/composable_kernel commit: edd92fc546]
2025-04-28 10:14:21 +05:00
Illia Silin
93caa99277 make code compliant with std=c++20 (#2123)
[ROCm/composable_kernel commit: 01cb8379cd]
2025-04-24 10:14:52 -07:00
lalala-sh
7748793e09 Moe gemm activation (#2026)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* impl int64 but result not correct

* int64 index ok now

* input output all ok

* fix uint32

* revert v1 test

* use uint32

* mork to support 13w tokens

* moe sorting fix moebuf

* fix merge

* update moe api fix aiter build

* fix buid

* fuse silu

* silu ok

* acale ok

* add silu

* change code

* gemm2 ok

* gufusion compatible ok, fix warnings

* gu fusion for m32 m64 ok

* support bf16 cshuffle

* i4 gemm2 ok

* i4 gemm2 ok and i4 gemm1 build

* 16x16 run ok

* change flops; change cshuffle dtype

* fuse gelu silu act in moe gemm1

* fp8 with act ready

* int4 act ready

* remove useless changes

* remove useless code change

* fix clang format

* add the arch limit of int4 moe gemm

* fuse moe activation

* fix fp8 16x16

* fix no quant case

* fix bugs

* fix fp8 gufusion bug

* remove useless comments

* refine activation code & complete moe example

* fix int8 bugs

* merge tkw1

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: root <root@hjbog-srdc-51.amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 39ba03f25d]
2025-04-23 10:35:34 +08:00
Rostyslav Geyyer
9a8af1c82a Temporarily disable MX FP4 device tests (#2112)
[ROCm/composable_kernel commit: 416e851584]
2025-04-22 16:08:48 -05:00
Thomas Ning
79b01b0824 GEMM Multiply Multiply Fix (#2102)
* fix the type convert and increase the BF16 conversion + the profile comment

* fix the CI

[ROCm/composable_kernel commit: 0cca8fa28f]
2025-04-22 01:13:22 -07:00
Muhammed Emin Ozturk
16fd74e8d3 MI308 fix for streamk 1-Tile floating point exception (#2101)
[ROCm/composable_kernel commit: b092c18da7]
2025-04-21 11:44:07 -07:00
lalala-sh
2d0b5aba13 enable do top k weights in moe stage1 gemm (#2094)
* add switch for mul topk weights

* fix bf16/f16 bugs

* complete

[ROCm/composable_kernel commit: bcf5bb41be]
2025-04-18 10:45:49 +08:00
Andriy Roshchenko
6d0890b6f4 MX GEMM - Parameterized Test Template (#2088)
* Tests for MX FP8 GEMM

* Improve documentation


[ROCm/composable_kernel commit: 213b203a3c]
2025-04-16 19:56:00 -06:00
Andriy Roshchenko
348760d56e MX GEMM - Add MX BF8 example (#2071)
* Add MX GEMM example for MX BF8

* Verified MX FP8 with 16x16x128 scale builtin

* Verify MX BF8 GEMM with BF16 output


[ROCm/composable_kernel commit: da54464cce]
2025-04-16 15:25:02 -06:00
Andriy Roshchenko
d9c9f17c3d MX GEMM - New GEMM pipeline for MX data types (#2059)
* Allow selection of mfma_scale instructions

* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order

* Add constexpr and synchronize return type for `get_exponent_value`

* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`

* Add support for microscaling instructions in `XdlopsGemm`

* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper

* Remove software implementation of MX GEMM

* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction

* Update README

* Updated CHANGELOG

* Remove unused static methods

[ROCm/composable_kernel commit: 7106976a72]
2025-04-15 17:17:07 -06:00
Mingtao Gu
e8db9f0220 CK pk_i4_t test failures fix (SWDEV-518629) (#2075)
* fix pk_i4_v3 tests failures in Unbuntu env.

* fix pk_i4_t tests failure on Unbuntu issues.

* some fixed.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 56378f810f]
2025-04-14 16:58:57 +08:00
Illia Silin
90612d0e37 Fix build issues for multiple targets. (#2077)
* build for multiple targets on gfx942

* add missing ignore statements

[ROCm/composable_kernel commit: 0d4f145078]
2025-04-11 12:12:53 -07:00
Illia Silin
7546e4bafe enable gfx115x support (#2065)
[ROCm/composable_kernel commit: 3e6d21adeb]
2025-04-09 10:06:42 -07:00
valarLip
c1d067be5c add passthrough for int32->float32 (#2062)
[ROCm/composable_kernel commit: 2c563fecf7]
2025-04-08 15:16:30 -07:00
Max Podkorytov
26724086f3 simplify generate_tuple (#2043)
[ROCm/composable_kernel commit: 6ce0797dad]
2025-04-08 09:00:51 -07:00
Illia Silin
32879114dc fix codegen issues (#2052)
[ROCm/composable_kernel commit: 1793228422]
2025-04-07 07:08:39 -07:00
Illia Silin
ada1b5f341 Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic

[ROCm/composable_kernel commit: 572cd820ce]
2025-04-03 15:30:21 -07:00
Rostyslav Geyyer
7fbc128e83 Add FP16/BF16<->FP8/BF8 conversions (#2035)
* Move conversion functions and add missing conversions

* Add tests

* Add missing conversions

* Add missing conversions

* Add bf8 tests

* Update clipping for vectors

* Add missing conversions

* Add bf16 fp8 tests

* Add bf16 bf8 tests

* Fix device conversion

* Fix conversions

* Fix vector use

* Minor fix

* Add a workaround flag

* Add a workaround flag for bf16 conversion

* Add another workaround

* Add a workaround for fp16 to bf8 conversion

* Update type alias

* Add docstrings and missing wrappers

* Fix if defined macros

* Fix more if defined macros

* Add comments

* Remove __host__ specifier

* Add a gfx950 guard

* Update function naming

[ROCm/composable_kernel commit: 265af71a71]
2025-04-03 12:42:03 -05:00
aledudek
b7359bcfac Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments

[ROCm/composable_kernel commit: 9329432f6c]
2025-04-03 13:35:43 +02:00
Bartłomiej Kocot
169e3cb4f8 Add support for GKCYX grouped conv weight (#2023)
* Grouped conv bwd weight GKCYX support

* fix and changelog

* fix

* fix

* fixes

* comments

* fix

[ROCm/composable_kernel commit: 2ccf914888]
2025-04-02 23:59:49 +02:00
Adam Osewski
5585c3121e Basic docs for universal gemm & ck-tile gemm. (#2014)
* Basic docs for universal gemm & ck-tile gemm.

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Reviewers suggestions.

* Align tparam names in doc with class tparams.

* More reviewers fine tuning ;)

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: e5ad48a784]
2025-04-02 11:03:40 +02:00
Bartłomiej Kocot
ca7ae808d4 Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances

[ROCm/composable_kernel commit: 8c0ab61ece]
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
67c3bcfce1 Grouped conv fwd v3 fix for SplitN an G > 1 (#2038)
* Grouped conv fwd v3 fix for SplitN an G > 1

* Remove int8 large test

* Retore int8 test

[ROCm/composable_kernel commit: ec742908bd]
2025-04-01 13:19:35 -07:00
Seunghoon Lee
345ab65612 Fix Windows build. (#2012)
* Remove duplicate using uint64_t.

* Cast before shift.

[ROCm/composable_kernel commit: df32020f93]
2025-04-01 12:22:10 -07:00
Max Podkorytov
cf08db04a6 add a fast compilation path for static for (0..N) (#2005)
* add a fast compilation path for static for (0..N)

* Update functional2.hpp

add comment and put range applier into detail namespace

* Update functional.hpp

ditto for ck-tile

* prettify

* prettify more

* add comment

* clang-format

[ROCm/composable_kernel commit: c59a8bb206]
2025-04-01 12:06:25 -07:00
Rostyslav Geyyer
23ad59e1fd Add MX FP4 device conversion tests (#1889)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add printf to fix intrinsic

* Add a sw-based workaround

[ROCm/composable_kernel commit: 441343a23d]
2025-03-26 19:23:01 -05:00
Bartłomiej Kocot
6ccfb817e4 Add support for GKCYX grouped conv fwd (#2015)
* Add support for GKCYX grouped conv fwd

* fixes

* fix

* changelog

* Fixes

[ROCm/composable_kernel commit: 54c81a1fcf]
2025-03-26 21:13:38 +01:00
Andriy Roshchenko
75ef4c83bf MX GEMM examples with FP8, FP16, and E8M0 scales (#2016)
* Add `scalar_type` specification for E8M0 exponent

* Specialize `nnvb_data_t_selector` for E8M0 exponent

* Remove partial specializations for `scalar_type` of `non_native_vector_base` template

* Reword command line helper string

* Create MX GEMM examples for different scales


[ROCm/composable_kernel commit: 72d888821c]
2025-03-25 15:33:03 -06:00
Max Podkorytov
58789d03d3 use fast path for sequence generation in old CK (#1993)
[ROCm/composable_kernel commit: 1a58522f01]
2025-03-25 11:28:44 -07:00
Illia Silin
b9e0e7d93e Split up data_type header. (#1996)
* split fp64 vector data type

* add missing header

* move e8m0 structs

* split off numeric_utils header

* fix typo

* split off numeric limits header

* update data_type header

* fix clang format

* split off vector type header

* fix clang format

* fix typo for binary_inf

[ROCm/composable_kernel commit: d2eab23958]
2025-03-24 15:08:54 -07:00
Andriy Roshchenko
bbdd7f6d57 Introduce MX GEMM for FP8 data type (#2000)
[ROCm/composable_kernel commit: 6660dc6b8e]
2025-03-24 15:41:07 -06:00
Bartłomiej Kocot
ceb078163f Fix split N for large images in groupd conv fwd (#2004)
* Fix split N for large images in groupd conv fwd

* Fix comments

[ROCm/composable_kernel commit: 5b0873c31a]
2025-03-22 23:19:49 +01:00
Attila T. Áfra
081e3c7880 Fix compile errors on Windows and Linux (#2002)
* Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous)

* Fix compile error (no matching function for call to 'cast_to_f32_from_f8')

[ROCm/composable_kernel commit: c79bf11148]
2025-03-20 12:37:25 -07:00
felix
c2948a0634 Ck moe hot fix (#1979)
* fix useless code and remove usless oob

* clang format

* fix coredump in e2e test

* fix2

* fix clang format

* fix output oob

* clang format

* rm useless comments

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 7eaedeb36c]
2025-03-19 22:58:27 +08:00
aledudek
73d207bd4e Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue

[ROCm/composable_kernel commit: 5095906975]
2025-03-17 16:42:43 +01:00
Bartłomiej Kocot
b8f58a234e Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example

[ROCm/composable_kernel commit: c2e4898b4b]
2025-03-17 13:32:00 +01:00
feli
e3c5b2ae80 ck_moe: fix useless code and remove usless oob (#1972)
* fix useless code and remove usless oob

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 251afab3b7]
2025-03-12 09:22:42 -07:00
Illia Silin
a7614ad594 use old instrinsics with staging compiler (#1970)
[ROCm/composable_kernel commit: 4c97cc511e]
2025-03-12 07:29:09 -07:00
Haocong WANG
1ed0b74c43 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: cbd74c2d12]
2025-03-11 10:11:21 -07:00
Mingtao Gu
fc98615212 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 0db7c8f0b2]
2025-03-10 11:16:44 +08:00
Thomas Ning
ed0649e4e6 Fix on the error (#1956)
[ROCm/composable_kernel commit: 9d51d17dd0]
2025-03-07 13:43:52 -08:00
carlushuang
31b9e40b3d [CK_TILE][HOTFIX] WA for address space by disable it completely (#1947)
* port all moe changes from ck_moe_gemm branch

* refine codes in the pr

* fix tail odd

* fix clang format

* fix clang format2

* make hot loop scheduler compatible with 16x16 and 32x32

* clang format

* fix per token quant

* rename moe example

* clang format

* WA for address space by disable it completely

* hot fix moe gemm2

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>

[ROCm/composable_kernel commit: c12fb0a624]
2025-03-06 12:01:25 +08:00
Illia Silin
9d24409070 Replace buffer load/store intrinsics with builtins (#1876)
* replace buffer load/store intrinsics with builtins

* fix clang format

* replace buffer load/store intrinsics with built-ins in ck_tile

* fix clang format

* add switch between buffer intrinsics and built-ins

* change the builtins threshold to clang20

* fix clang format

* fix some compilation errors

* revert changes in ck_tile

* revert changes in ck_tile

* delete all root files and folders when CI completes

* try changing the username in CI

* fix groovy syntax

* add user and group id info to ci dockers

* change ownership of all files in CI to jenkins at the end

* update changelog

[ROCm/composable_kernel commit: a88bf76ecc]
2025-03-05 14:33:28 -08:00
Illia Silin
53b89436b2 remove support for gfx940 and gfx941 targets (#1944)
* remove support for gfx940 and gfx941 targets

* update changelog

[ROCm/composable_kernel commit: 9b51c08bf7]
2025-03-05 11:07:33 -08:00
feli
6fd94cff45 ck moe gemm implement (#1936)
* port all moe changes from ck_moe_gemm branch

* refine codes in the pr

* fix tail odd

* fix clang format

* fix clang format2

* make hot loop scheduler compatible with 16x16 and 32x32

* clang format

* fix per token quant

* rename moe example

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 3786e16375]
2025-03-05 15:56:55 +08:00
jefyang1
dfd15c220d Remove CK_USE_AMD_MFMA_GFX950 (#1935)
* Add runtime check in example_gemm_xdl_streamk for gfx950

* Add runtime check in grouped conv fwd examples for gfx950

* Disable CK_USE_AMD_MFMA_GFX950

* Add new instances for gfx950

* Fix test_gemm_universal on gfx950

[ROCm/composable_kernel commit: c95bda93ba]
2025-03-04 10:32:25 -08:00
arai713
afc6806202 MIGraphX hipRTC fix (#1923)
* fixed hiprtc compilation issues from new additions, removed clashing mixed precision functionality from codegen(ignore the whole file)

* fixed device op error: misplaced header guard

* restrict virtual function use in device_gemm_multiple_d file for codegen hiprtc compilation

* add CK_CODE_GEN_RTC flag for compilation, since this flag has wider coverage for hiprtc compilation

* fixed conditional error in amd_ck_fp8.hpp

* Add MaskOutUpperTriangle as a problem parameter to
BatchedGemmSoftmaxGemm and disable tests with
MaskOutUpperTriangle==True.

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>

---------

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com>

[ROCm/composable_kernel commit: fd06ed926c]
2025-03-03 07:55:05 -08:00
asleepzzz
88ce49f185 Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913)" (#1933)
This reverts commit 1d09b0928c.

[ROCm/composable_kernel commit: ef16010273]
2025-03-03 07:17:39 -08:00
Haocong WANG
1d09b0928c [BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>

[ROCm/composable_kernel commit: 020148d0f7]
2025-02-25 15:42:20 +08:00
coconutruben
eb914952cc device_prop.hpp - replace map with compile time hash and switch (#1898)
* device_prop.hpp - replace map with compile time hash and switch

Summary:

We replace a static const map with a compile time hash function
and a switch statement to achieve the same goal: translate names
to architectures. Most of these are very old, however the function
needs to continue to work.

Why? because the static map can cause issues when compiling into
libraries that get dynamically loaded/unloaded, leading to memory
corruption

Test Plan:

Running pytorch `torch.compile()` with CK enabled, and seeing it not
segfault on the 2nd kernel (1st reload of the library)

Reviewers:

Subscribers:

Tasks:

Tags:

* clang-format

[ROCm/composable_kernel commit: fcd4a6f3d1]
2025-02-24 09:57:55 -08:00