lalala-sh
2d0b5aba13
enable do top k weights in moe stage1 gemm ( #2094 )
...
* add switch for mul topk weights
* fix bf16/f16 bugs
* complete
[ROCm/composable_kernel commit: bcf5bb41be ]
2025-04-18 10:45:49 +08:00
Andriy Roshchenko
6d0890b6f4
MX GEMM - Parameterized Test Template ( #2088 )
...
* Tests for MX FP8 GEMM
* Improve documentation
[ROCm/composable_kernel commit: 213b203a3c ]
2025-04-16 19:56:00 -06:00
Andriy Roshchenko
348760d56e
MX GEMM - Add MX BF8 example ( #2071 )
...
* Add MX GEMM example for MX BF8
* Verified MX FP8 with 16x16x128 scale builtin
* Verify MX BF8 GEMM with BF16 output
[ROCm/composable_kernel commit: da54464cce ]
2025-04-16 15:25:02 -06:00
BingYuan.Zhou
4ec293cb4b
[flatmm] implement basic fp16 flatmm ( #2089 )
...
* [flatmm] implement basic fp16 flatmm
* fix CI build fail
---------
Co-authored-by: root <root@hjbog-srdc-50.amd.com >
Co-authored-by: solin <bingzhou@amd.com >
[ROCm/composable_kernel commit: eaf1f0bf3b ]
2025-04-16 16:51:17 +08:00
Andriy Roshchenko
d9c9f17c3d
MX GEMM - New GEMM pipeline for MX data types ( #2059 )
...
* Allow selection of mfma_scale instructions
* Read B tensor from LDS to VGPR in chunks of 16 in MFMA order
* Add constexpr and synchronize return type for `get_exponent_value`
* Pass scales by reference and add comments to `mfma_scale_f32_32x32x64`
* Add support for microscaling instructions in `XdlopsGemm`
* Fix `mfma_scale_f32_16x16x128f8f6f4` wrapper
* Remove software implementation of MX GEMM
* Make interface of `intrin_mfma_scale_f32_16x16x128f8f6f4<16, 16>` consistent with the other scale instruction
* Update README
* Updated CHANGELOG
* Remove unused static methods
[ROCm/composable_kernel commit: 7106976a72 ]
2025-04-15 17:17:07 -06:00
Mingtao Gu
e8db9f0220
CK pk_i4_t test failures fix (SWDEV-518629) ( #2075 )
...
* fix pk_i4_v3 tests failures in Unbuntu env.
* fix pk_i4_t tests failure on Unbuntu issues.
* some fixed.
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
[ROCm/composable_kernel commit: 56378f810f ]
2025-04-14 16:58:57 +08:00
Thomas Ning
1b61d3a0ed
Solve the Static Encoding Pattern compile error when the tile size is too small ( #2079 )
...
[ROCm/composable_kernel commit: 269f4f6af5 ]
2025-04-13 20:09:30 -07:00
Illia Silin
90612d0e37
Fix build issues for multiple targets. ( #2077 )
...
* build for multiple targets on gfx942
* add missing ignore statements
[ROCm/composable_kernel commit: 0d4f145078 ]
2025-04-11 12:12:53 -07:00
jakpiase
d76ebf9795
[CK_TILE] Add 2:4 structured sparsity support for fp16 gemm ( #1957 )
...
* add structured sparsity fp16 support for gemm
* added reviewer suggestions
* update changelog
* update changelog
* add reviewers suggestions
* Minor fix
* clang fix
* fix doxygen
[ROCm/composable_kernel commit: 6c61f4d237 ]
2025-04-11 12:18:26 +02:00
slippedJim
959225947a
add fmha fwd splitkv receipt for aiter c++ api ( #2068 )
...
* add s_randval for c++ api
* Fix bug of bias in splitkv
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
[ROCm/composable_kernel commit: 5f885d2b7a ]
2025-04-10 23:21:13 +08:00
Juan Manuel Martinez Caamaño
7a42b06988
Replace inline assembly with builtins in FHMA ( #2067 )
...
* Replace inline assembly with builtins in FHMA
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: f14e648e7c ]
2025-04-10 09:48:37 +02:00
Illia Silin
7546e4bafe
enable gfx115x support ( #2065 )
...
[ROCm/composable_kernel commit: 3e6d21adeb ]
2025-04-09 10:06:42 -07:00
MHYang-gh
62ce5b906b
Make buffer coherence configurable in tensor view ( #2041 )
...
* Make buffer coherence configurable in tensor view
* Fix clang-format for tensor_view.hpp
[ROCm/composable_kernel commit: 03ce8729fd ]
2025-04-08 15:34:11 -07:00
valarLip
c1d067be5c
add passthrough for int32->float32 ( #2062 )
...
[ROCm/composable_kernel commit: 2c563fecf7 ]
2025-04-08 15:16:30 -07:00
Max Podkorytov
26724086f3
simplify generate_tuple ( #2043 )
...
[ROCm/composable_kernel commit: 6ce0797dad ]
2025-04-08 09:00:51 -07:00
aledudek
6dbaeb5fe8
[CK_TILE] Fix GEMM Memory Pipeline ( #2034 )
...
* [CK_TILE] Fix GEMM Memory Pipeline
* Fix transpose tile
* Add comments
[ROCm/composable_kernel commit: 80aae6119b ]
2025-04-08 12:40:04 +02:00
Illia Silin
32879114dc
fix codegen issues ( #2052 )
...
[ROCm/composable_kernel commit: 1793228422 ]
2025-04-07 07:08:39 -07:00
Illia Silin
ada1b5f341
Split env.hpp header from the ck.hpp header. ( #2049 )
...
* split env.hpp out of main headers
* fix namespace logic
[ROCm/composable_kernel commit: 572cd820ce ]
2025-04-03 15:30:21 -07:00
Rostyslav Geyyer
7fbc128e83
Add FP16/BF16<->FP8/BF8 conversions ( #2035 )
...
* Move conversion functions and add missing conversions
* Add tests
* Add missing conversions
* Add missing conversions
* Add bf8 tests
* Update clipping for vectors
* Add missing conversions
* Add bf16 fp8 tests
* Add bf16 bf8 tests
* Fix device conversion
* Fix conversions
* Fix vector use
* Minor fix
* Add a workaround flag
* Add a workaround flag for bf16 conversion
* Add another workaround
* Add a workaround for fp16 to bf8 conversion
* Update type alias
* Add docstrings and missing wrappers
* Fix if defined macros
* Fix more if defined macros
* Add comments
* Remove __host__ specifier
* Add a gfx950 guard
* Update function naming
[ROCm/composable_kernel commit: 265af71a71 ]
2025-04-03 12:42:03 -05:00
aledudek
b7359bcfac
Post-merge changes for fully async args copy in ck grouped gemm ( #1991 )
...
* Post-merge changes for fully async args copy in ck grouped gemm
* Post-merge documentation and naming changes
* Build fix and updated changelog
* Revised comments
[ROCm/composable_kernel commit: 9329432f6c ]
2025-04-03 13:35:43 +02:00
Bartłomiej Kocot
169e3cb4f8
Add support for GKCYX grouped conv weight ( #2023 )
...
* Grouped conv bwd weight GKCYX support
* fix and changelog
* fix
* fix
* fixes
* comments
* fix
[ROCm/composable_kernel commit: 2ccf914888 ]
2025-04-02 23:59:49 +02:00
Adam Osewski
5585c3121e
Basic docs for universal gemm & ck-tile gemm. ( #2014 )
...
* Basic docs for universal gemm & ck-tile gemm.
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Reviewers suggestions.
* Align tparam names in doc with class tparams.
* More reviewers fine tuning ;)
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
[ROCm/composable_kernel commit: e5ad48a784 ]
2025-04-02 11:03:40 +02:00
Bartłomiej Kocot
ca7ae808d4
Grouped conv backward data GKCYX support ( #2029 )
...
* Grouped conv backward data GKCYX support
* profiler
* Converter
* split instances
[ROCm/composable_kernel commit: 8c0ab61ece ]
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
67c3bcfce1
Grouped conv fwd v3 fix for SplitN an G > 1 ( #2038 )
...
* Grouped conv fwd v3 fix for SplitN an G > 1
* Remove int8 large test
* Retore int8 test
[ROCm/composable_kernel commit: ec742908bd ]
2025-04-01 13:19:35 -07:00
Seunghoon Lee
345ab65612
Fix Windows build. ( #2012 )
...
* Remove duplicate using uint64_t.
* Cast before shift.
[ROCm/composable_kernel commit: df32020f93 ]
2025-04-01 12:22:10 -07:00
Max Podkorytov
cf08db04a6
add a fast compilation path for static for (0..N) ( #2005 )
...
* add a fast compilation path for static for (0..N)
* Update functional2.hpp
add comment and put range applier into detail namespace
* Update functional.hpp
ditto for ck-tile
* prettify
* prettify more
* add comment
* clang-format
[ROCm/composable_kernel commit: c59a8bb206 ]
2025-04-01 12:06:25 -07:00
rocking
01ea8aa249
Reduce redundant space in bias tensor ( #2024 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: 8a20b62e91 ]
2025-03-28 21:58:06 +08:00
felix
20ffa0f474
hotfix fix sorting int64 ( #2025 )
...
* fix sorting int64
* clang format
* fix example issue
* update WA issue #
---------
Co-authored-by: coderfeli <coderfeli@163.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
[ROCm/composable_kernel commit: a82f338fb9 ]
2025-03-28 11:31:52 +08:00
Rostyslav Geyyer
23ad59e1fd
Add MX FP4 device conversion tests ( #1889 )
...
* Add conversion tests
* Fix ctor
* Fix nan logic
* Fix conversion logic
* Permute packed f4_t values
* Fix conversion to float, repack vector elements
* Fix device tests
* Permute elements in a vector
* Add a repro test
* Add a conversion for a repro test
* Update test vectors
* Update conversion
* Fix the test
* Update test vector generator
* Fix vector sr conversion
* Permute conversion args
* Update conversion
* Test
* Fix packing
* Simplify conversion function
* Pack conversion in a loop
* Pack conversion in a loop
* Pack another conversion in a loop
* Pack one more conversion in a loop
* Pack the last conversion in a loop
* Clean up
* Add printf to fix intrinsic
* Add a sw-based workaround
[ROCm/composable_kernel commit: 441343a23d ]
2025-03-26 19:23:01 -05:00
Bartłomiej Kocot
6ccfb817e4
Add support for GKCYX grouped conv fwd ( #2015 )
...
* Add support for GKCYX grouped conv fwd
* fixes
* fix
* changelog
* Fixes
[ROCm/composable_kernel commit: 54c81a1fcf ]
2025-03-26 21:13:38 +01:00
Andriy Roshchenko
75ef4c83bf
MX GEMM examples with FP8, FP16, and E8M0 scales ( #2016 )
...
* Add `scalar_type` specification for E8M0 exponent
* Specialize `nnvb_data_t_selector` for E8M0 exponent
* Remove partial specializations for `scalar_type` of `non_native_vector_base` template
* Reword command line helper string
* Create MX GEMM examples for different scales
[ROCm/composable_kernel commit: 72d888821c ]
2025-03-25 15:33:03 -06:00
Max Podkorytov
58789d03d3
use fast path for sequence generation in old CK ( #1993 )
...
[ROCm/composable_kernel commit: 1a58522f01 ]
2025-03-25 11:28:44 -07:00
ruanjm
ce1d20c2c6
[CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance ( #1861 )
...
* 50ms -> 28ms
* Fix bug in non fuse_add_store cases
* Fine tuned setting for 2 pass pipeline
* adjust workload
* remove unnecessary change
* add layernorm
* Adding output quant and unquant results at the same time.
* fix test
* fix format
* tune for cases 128x640 and 128x1024
* bug ifx
[ROCm/composable_kernel commit: d49abdaa87 ]
2025-03-25 20:09:45 +08:00
Illia Silin
b9e0e7d93e
Split up data_type header. ( #1996 )
...
* split fp64 vector data type
* add missing header
* move e8m0 structs
* split off numeric_utils header
* fix typo
* split off numeric limits header
* update data_type header
* fix clang format
* split off vector type header
* fix clang format
* fix typo for binary_inf
[ROCm/composable_kernel commit: d2eab23958 ]
2025-03-24 15:08:54 -07:00
Andriy Roshchenko
bbdd7f6d57
Introduce MX GEMM for FP8 data type ( #2000 )
...
[ROCm/composable_kernel commit: 6660dc6b8e ]
2025-03-24 15:41:07 -06:00
MHYang-gh
fd151c05d9
Fix A/B lds transform ( #2007 )
...
[ROCm/composable_kernel commit: c027637a8f ]
2025-03-22 23:13:50 -07:00
Bartłomiej Kocot
ceb078163f
Fix split N for large images in groupd conv fwd ( #2004 )
...
* Fix split N for large images in groupd conv fwd
* Fix comments
[ROCm/composable_kernel commit: 5b0873c31a ]
2025-03-22 23:19:49 +01:00
BingYuan.Zhou
c245d569d5
fix ck_tile/basic_gemm build error ( #1988 )
...
[ROCm/composable_kernel commit: 5a0d693b86 ]
2025-03-20 22:01:14 -07:00
Attila T. Áfra
081e3c7880
Fix compile errors on Windows and Linux ( #2002 )
...
* Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous)
* Fix compile error (no matching function for call to 'cast_to_f32_from_f8')
[ROCm/composable_kernel commit: c79bf11148 ]
2025-03-20 12:37:25 -07:00
carlushuang
23340c5dd5
[CK_TILE] return value with macro in ck_tile::kernel_launch API ( #1982 )
...
* return value with macro and revert the return value
* [CK-TILE] no-macro launch api solution (#1992 )
* no-macro solution
* address -Wcomma
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
[ROCm/composable_kernel commit: e3c9886cdf ]
2025-03-20 11:00:29 -07:00
jakpiase
f1262b783a
[CK_TILE] Switch to universal gemm for batched and grouped gemms ( #1919 )
...
* switch to universal gemm for batched and grouped gemms
* added reviewer comments
* fixed grouped gemm tests
[ROCm/composable_kernel commit: 0e91d32c61 ]
2025-03-20 11:17:04 +01:00
rocking
b0f323c4ec
Sync the kname with instance name ( #1989 )
...
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: b819c217e4 ]
2025-03-20 00:06:45 +08:00
felix
c2948a0634
Ck moe hot fix ( #1979 )
...
* fix useless code and remove usless oob
* clang format
* fix coredump in e2e test
* fix2
* fix clang format
* fix output oob
* clang format
* rm useless comments
---------
Co-authored-by: coderfeli <coderfeli@163.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 7eaedeb36c ]
2025-03-19 22:58:27 +08:00
aledudek
73d207bd4e
Async grouped gemm v3 ( #1940 )
...
* Fully async grouped gemm
* Remove commented code
* Remvoe maybe_unused
* host kernel args
* Checkpoint segfault debugging...
* Working part1
* Working part2
* Remvoe comments...
* Use void ptr for gemm kernel host args
* Fix device_grouped_gemm_multiple_d_dl build issue
* Fix device_grouped_gemm_xdl build issue
[ROCm/composable_kernel commit: 5095906975 ]
2025-03-17 16:42:43 +01:00
Bartłomiej Kocot
b8f58a234e
Grouped conv bwd data NGCHW ( #1967 )
...
* Grouped conv bwd data NGCHW
* fixes
* fix
* Improvements
* Fix
* Fix
* add client example
[ROCm/composable_kernel commit: c2e4898b4b ]
2025-03-17 13:32:00 +01:00
carlushuang
f2dd57b76f
Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" … ( #1971 )
...
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961 )" (#1969 )
This reverts commit b92caa3d84 .
* fix codegen problem
* Update config.hpp
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: 3e81279d26 ]
2025-03-13 11:41:39 +08:00
feli
e3c5b2ae80
ck_moe: fix useless code and remove usless oob ( #1972 )
...
* fix useless code and remove usless oob
* clang format
---------
Co-authored-by: coderfeli <coderfeli@163.com >
[ROCm/composable_kernel commit: 251afab3b7 ]
2025-03-12 09:22:42 -07:00
Illia Silin
a7614ad594
use old instrinsics with staging compiler ( #1970 )
...
[ROCm/composable_kernel commit: 4c97cc511e ]
2025-03-12 07:29:09 -07:00
Illia Silin
b92caa3d84
Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 ( #1961 )" ( #1969 )
...
This reverts commit 45fbd9210a .
[ROCm/composable_kernel commit: 8cbcd3e0d0 ]
2025-03-11 10:40:18 -07:00
Haocong WANG
1ed0b74c43
[Block Scale GEMM] Optimized block scale gemm ( #1950 )
...
* Added two kernel for M=32 problem
* Comment the first one
* Enable multiply_multiply for Scale_Block_M = 1 for deepseek
* Modify the a_thread offset since the A data load is different from B.
* edit fp8 ab scale for Scale_Block_M=1
* edit GemmSpec to MNKPadding
* enable blockwise pipelie v1 and v2. v1 is work for small K.
* add instance for gemm_ab_scale
* fix cmakelist of ckProfiler
* optimize blockscale gemm. todo: reduce vgpr usage
* fix a correctness bug
* sanity checked
* revert ckprofiler cmake changes
* clang format
* revert unnecessary changes.
* remove commented codes.
* split weight preshuffle library targets
* bring back enable-post-misched=0
* fix build issues for gemm_multiply_multiply_fp8 instances
* fix clang format
* add verbose build flag when building for all targets
* reduce path names for new instances
* fix paths in cmake
* refactor gemm_multiply_multiply library target
* fix a bug in example
* fix example 65 cmake
* reduce the number of threads when building libs for all targets to 50
* use ninja to build for all targets
* reduce teh number of threads when building for all targets
* reduce the number of threads to 32 when building libs for all targets to 50
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: chenjun <junchen2@amd.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: cbd74c2d12 ]
2025-03-11 10:11:21 -07:00