Commit Graph

781 Commits

Author SHA1 Message Date
lalala-sh
01344f2e19 add switch for topk weights in each stages 2025-04-12 15:47:14 +00:00
lalala-sh
0353337447 refine activation code & complete moe example 2025-04-11 02:18:16 +00:00
lalala-sh
d2a82f56e2 remove useless comments 2025-04-10 02:23:08 +00:00
lalala-sh
4ff3ca6a0f fix fp8 gufusion bug 2025-04-10 02:03:38 +00:00
lalala-sh
8da84640a4 Merge branch 'develop' into moe_gemm_activation 2025-04-10 09:35:03 +08:00
Illia Silin
3e6d21adeb enable gfx115x support (#2065) 2025-04-09 10:06:42 -07:00
lalala-sh
4c2abb376a Merge branch 'develop' into moe_gemm_activation 2025-04-09 15:15:00 +08:00
MHYang-gh
03ce8729fd Make buffer coherence configurable in tensor view (#2041)
* Make buffer coherence configurable in tensor view

* Fix clang-format for tensor_view.hpp
2025-04-08 15:34:11 -07:00
valarLip
2c563fecf7 add passthrough for int32->float32 (#2062) 2025-04-08 15:16:30 -07:00
Max Podkorytov
6ce0797dad simplify generate_tuple (#2043) 2025-04-08 09:00:51 -07:00
aledudek
80aae6119b [CK_TILE] Fix GEMM Memory Pipeline (#2034)
* [CK_TILE] Fix GEMM Memory Pipeline

* Fix transpose tile

* Add comments
2025-04-08 12:40:04 +02:00
root
fbf91ada78 merge from testx 2025-04-08 06:55:35 +00:00
lalala-sh
62f99e5cca Merge branch 'develop' into moe_gemm_activation 2025-04-08 14:43:41 +08:00
root
c0c1c04b50 fix bugs 2025-04-08 06:26:57 +00:00
Illia Silin
1793228422 fix codegen issues (#2052) 2025-04-07 07:08:39 -07:00
Illia Silin
572cd820ce Split env.hpp header from the ck.hpp header. (#2049)
* split env.hpp out of main headers

* fix namespace logic
2025-04-03 15:30:21 -07:00
Rostyslav Geyyer
265af71a71 Add FP16/BF16<->FP8/BF8 conversions (#2035)
* Move conversion functions and add missing conversions

* Add tests

* Add missing conversions

* Add missing conversions

* Add bf8 tests

* Update clipping for vectors

* Add missing conversions

* Add bf16 fp8 tests

* Add bf16 bf8 tests

* Fix device conversion

* Fix conversions

* Fix vector use

* Minor fix

* Add a workaround flag

* Add a workaround flag for bf16 conversion

* Add another workaround

* Add a workaround for fp16 to bf8 conversion

* Update type alias

* Add docstrings and missing wrappers

* Fix if defined macros

* Fix more if defined macros

* Add comments

* Remove __host__ specifier

* Add a gfx950 guard

* Update function naming
2025-04-03 12:42:03 -05:00
aledudek
9329432f6c Post-merge changes for fully async args copy in ck grouped gemm (#1991)
* Post-merge changes for fully async args copy in ck grouped gemm

* Post-merge documentation and naming changes

* Build fix and updated changelog

* Revised comments
2025-04-03 13:35:43 +02:00
root
20f6674bf6 fix no quant case 2025-04-03 02:46:01 +00:00
Bartłomiej Kocot
2ccf914888 Add support for GKCYX grouped conv weight (#2023)
* Grouped conv bwd weight GKCYX support

* fix and changelog

* fix

* fix

* fixes

* comments

* fix
2025-04-02 23:59:49 +02:00
root
b2b34fffbb fix fp8 16x16 2025-04-02 16:27:52 +00:00
Adam Osewski
e5ad48a784 Basic docs for universal gemm & ck-tile gemm. (#2014)
* Basic docs for universal gemm & ck-tile gemm.

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_v3.hpp

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Reviewers suggestions.

* Align tparam names in doc with class tparams.

* More reviewers fine tuning ;)

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
2025-04-02 11:03:40 +02:00
root
85f83330b5 fuse moe activation 2025-04-02 07:02:09 +00:00
Bartłomiej Kocot
8c0ab61ece Grouped conv backward data GKCYX support (#2029)
* Grouped conv backward data GKCYX support

* profiler

* Converter

* split instances
2025-04-01 13:24:38 -07:00
Bartłomiej Kocot
ec742908bd Grouped conv fwd v3 fix for SplitN an G > 1 (#2038)
* Grouped conv fwd v3 fix for SplitN an G > 1

* Remove int8 large test

* Retore int8 test
2025-04-01 13:19:35 -07:00
Illia Silin
dcfec66bc4 Merge branch 'develop' into moe_gemm_activation 2025-04-01 12:28:10 -07:00
Seunghoon Lee
df32020f93 Fix Windows build. (#2012)
* Remove duplicate using uint64_t.

* Cast before shift.
2025-04-01 12:22:10 -07:00
Max Podkorytov
c59a8bb206 add a fast compilation path for static for (0..N) (#2005)
* add a fast compilation path for static for (0..N)

* Update functional2.hpp

add comment and put range applier into detail namespace

* Update functional.hpp

ditto for ck-tile

* prettify

* prettify more

* add comment

* clang-format
2025-04-01 12:06:25 -07:00
illsilin
d3fb5a9b8d fix clang format 2025-03-28 11:52:40 -07:00
rocking
8a20b62e91 Reduce redundant space in bias tensor (#2024)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-03-28 21:58:06 +08:00
lalala-sh
656a3657cb remove useless code change 2025-03-28 16:25:00 +08:00
root
7054d45579 remove useless changes 2025-03-28 08:21:42 +00:00
root
c9ce26837c Merge remote-tracking branch 'origin/develop' into moe_gemm_activation 2025-03-28 07:57:53 +00:00
felix
a82f338fb9 hotfix fix sorting int64 (#2025)
* fix sorting int64

* clang format

* fix example issue

* update WA issue #

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2025-03-28 11:31:52 +08:00
root
529a1732cd fp8 with act ready 2025-03-28 02:34:38 +00:00
lalala-sh
d0f3e87129 Merge remote-tracking branch 'origin/develop' into moe_gemm_activation 2025-03-28 10:02:27 +08:00
root
4f4bce30cf fuse gelu silu act in moe gemm1 2025-03-27 16:59:53 +08:00
coderfeli
bd4a8c71d4 i4 gemm2 ok and i4 gemm1 build 2025-03-27 06:41:31 +00:00
Rostyslav Geyyer
441343a23d Add MX FP4 device conversion tests (#1889)
* Add conversion tests

* Fix ctor

* Fix nan logic

* Fix conversion logic

* Permute packed f4_t values

* Fix conversion to float, repack vector elements

* Fix device tests

* Permute elements in a vector

* Add a repro test

* Add a conversion for a repro test

* Update test vectors

* Update conversion

* Fix the test

* Update test vector generator

* Fix vector sr conversion

* Permute conversion args

* Update conversion

* Test

* Fix packing

* Simplify conversion function

* Pack conversion in a loop

* Pack conversion in a loop

* Pack another conversion in a loop

* Pack one more conversion in a loop

* Pack the last conversion in a loop

* Clean up

* Add printf to fix intrinsic

* Add a sw-based workaround
2025-03-26 19:23:01 -05:00
Bartłomiej Kocot
54c81a1fcf Add support for GKCYX grouped conv fwd (#2015)
* Add support for GKCYX grouped conv fwd

* fixes

* fix

* changelog

* Fixes
2025-03-26 21:13:38 +01:00
coderfeli
9729c9e3f7 i4 gemm2 ok 2025-03-26 11:41:21 +00:00
coderfeli
6a0cc4aad1 gu fusion for m32 m64 ok 2025-03-26 05:58:22 +00:00
coderfeli
74d8ac608f gufusion compatible ok, fix warnings 2025-03-26 02:20:30 +00:00
Andriy Roshchenko
72d888821c MX GEMM examples with FP8, FP16, and E8M0 scales (#2016)
* Add `scalar_type` specification for E8M0 exponent

* Specialize `nnvb_data_t_selector` for E8M0 exponent

* Remove partial specializations for `scalar_type` of `non_native_vector_base` template

* Reword command line helper string

* Create MX GEMM examples for different scales
2025-03-25 15:33:03 -06:00
Max Podkorytov
1a58522f01 use fast path for sequence generation in old CK (#1993) 2025-03-25 11:28:44 -07:00
coderfeli
6ca5892256 gemm2 ok 2025-03-25 15:01:10 +00:00
ruanjm
d49abdaa87 [CK_TILE] Improve RMS/Layer Normalization 2 Pass Pipeline Performance (#1861)
* 50ms -> 28ms

* Fix bug in non fuse_add_store cases

* Fine tuned setting for 2 pass pipeline

* adjust workload

* remove unnecessary change

* add layernorm

* Adding output quant and unquant results at the same time.

* fix test

* fix format

* tune for cases 128x640 and 128x1024

* bug ifx
2025-03-25 20:09:45 +08:00
coderfeli
234b8d415c change code 2025-03-25 09:44:32 +00:00
coderfeli
0d266bfd65 add silu 2025-03-25 03:01:27 +00:00
coderfeli
2b15b67b3f acale ok 2025-03-25 02:52:04 +00:00