Commit Graph

200 Commits

Author SHA1 Message Date
coderfeli
7ca2d03e82 merge moe sorting 2025-02-25 05:08:21 +00:00
coderfeli
f64b137521 merge haocong branch 2025-02-17 09:30:02 +00:00
coderfeli
7572a6916c merge develop 2025-02-15 03:23:00 +00:00
aska-0096
4599ee0079 Merge branch 'update_cka8w8_uc' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc 2025-02-13 06:07:11 +00:00
aska-0096
0172488d64 hotfix for ckprofiler operator 2025-02-13 06:07:07 +00:00
Haocong WANG
d6e3e83a80 Merge branch 'develop' into update_cka8w8_uc 2025-02-11 16:06:08 +08:00
Mingtao Gu
d9f1ead347 Added Int4 mixed batch gemm support (#1839)
* remove redundant kernels.

* added batched_gemm_xdl_fp16int4_b_scale_v3

* Enabled the split K.

* added the batched_gemm_b_scale ckProfiler, meet function issue

* fix some typo

* fix ckProfiler build issue

* fix some bugs

* updated some debug info

* comment some code

* Fix

* fixed some bugs and refactor the code

* fixed a function bug.

* formatted files.

* formatted

* uncommented the ckProfiler CMakeLists

* fixed.

* fix ckProfiler for batched_gemm_b_scale

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-02-10 11:17:02 +08:00
aska-0096
5be42bb398 fix errors 2025-02-05 08:22:14 +00:00
aska-0096
1b616990b3 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc 2025-02-05 08:07:20 +00:00
deepsek
e7dce4d247 Added bf16 instances grouped gemm fixed nk (#1825)
* Feat: Add bf16 input instances

* feat: Add BF16 profiler code

* fix: reorder enum types

* fix: CI fail due to clang-format

* fix: clang script format issue

* fix: clang format broke cmakelist file
2025-01-20 09:13:09 -08:00
deepsek
0fcbb25f70 fix: preprocessor directives logic error if/else (#1764)
* fix: preprocessors logic error if/else

* fix: added macros as preferred by CK team
2025-01-16 20:31:15 -08:00
feli
53ab1b9047 Dev/merge u8w8 (#1774)
* port tiles from a8w8

* rm debug used files

* add instances

* remove all non gemm in cmake

* merge; impl fp16

* recover cmake from develop

* add missed files; fix clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-01-13 10:25:14 -08:00
Mingtao Gu
4f62f6e9b7 Implement the fp16xint4 scale weight only kernel for Ali (#1786)
* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-03 18:35:21 +08:00
Muhammed Emin Ozturk
9e95d54cd2 BF16 GEMM Stream-K (#1541)
* initial

* Cmake file

* successfull compilation but validation failed

* Cmake

* update

* gpu validation

* gemm universal

* gemm universal sk update

* sk bf16 universal instance

* gemm_universal_streamk.hpp

* only build for gfx94

* Cmakelist

* profiler update, bf16 sk only works at gfx42

* clang

* clang

* clang all

* no need flags

* cmake script

* delete comment

* gemm universal sk fix

* clang

* profiler fix

* clang

* update

* update

* delete comment

* code formatting

* cmake

* fix instance

* clang

* argument supported

* argument supported and clang

* update

* fix

* removing unnecessary comments

* clang formatting

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt

Co-authored-by: afagaj <john.afaganis@gmail.com>

* CopyRight Comment 2025

* clang reformatting

* copy right 2025

---------

Co-authored-by: Emin Ozturk <ozturk.27@osu.edu>
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund>
Co-authored-by: root <root@splinter-126-wr-d3.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund>
Co-authored-by: afagaj <john.afaganis@gmail.com>
2025-01-02 10:30:04 -08:00
aska-0096
0dbe537032 refine weight preshuffle format. 2025-01-02 13:59:58 +00:00
Adam Osewski
1d8e4ec2ce Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762)
* add a prototype of int4

* clean

* debug

* clean

* clean

* move packed into dynamic_buffer

* fixed coord reset

* add fast pki4 to half conversion

* fix

* fixed reference and host_tensor

* fixed tensor init

* format

* debug i4_to_f16_convert

* format

* fixed splitk

* weight permute

* add b tile permute

* clean

* weight permute with splitki

* format

* improve weight layout

* add and_or_b32

* fixed splitk crush

* add permute switch as a template

* recover v3r1

* clean

* failure with intrawave v2

* fixed

* fixed

* add ckProfiler

* add bfp16 support

* add bf16 example

* fixed int4 to bhalf_t conversion

* format

* fixed int4 to bf16 conversion

* clean

* add instances for mem

* clean

* fixed host tensor size

* fixed

* debug

* fixed

* add pk_i4_t as a struct

* fix

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* revert

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed comments

* revert

* clean

* revert

* revert

* fixed

* Update CMakeLists.txt

* Update script/cmake-ck-dev.sh

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update CMakeLists.txt

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed

* fixed

* fixed

* revert

* revert

* add comments

* format

* fixed assert

* fixed

* Fix I4 define in ckProfiler

* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-02 11:48:06 +08:00
aska-0096
72c1ddacb9 Merge branch 'add_a8w8_preshuffle_ckprofiler' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc 2024-12-31 07:23:50 +00:00
aska-0096
bbbedc1fd7 add fp16 instances 2024-12-31 07:14:56 +00:00
aska-0096
f60f9d5917 sanity pass, most tile size enabled. TODO: NWave!=4 2024-12-30 18:22:08 +00:00
aska-0096
74ef5021b6 tempsave 2024-12-30 09:20:25 +00:00
coderfeli
fda5f8cfb0 fix missed files and fix clang format 2024-12-27 11:56:46 +00:00
coderfeli
e92395d9b1 Merge remote-tracking branch 'origin/cka8w8_devtimer' into update_cka8w8_uc 2024-12-27 11:09:05 +00:00
coderfeli
e2127d7a96 impl fp16 in ckprofiler 2024-12-27 06:53:40 +00:00
coderfeli
400cac2839 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8 2024-12-27 05:42:38 +00:00
coderfeli
04f09f087e fix build 2024-12-27 11:44:06 +08:00
coderfeli
19b7c1312c remove all non gemm in cmake 2024-12-24 07:44:12 +00:00
coderfeli
9ba219c875 rm debug used files 2024-12-24 03:59:12 +00:00
coderfeli
3f50b99e7b port tiles from a8w8 2024-12-23 14:13:16 +00:00
Bartłomiej Kocot
4d8fce33dd Add SplitK support into Batched GEMM V3 (#1729)
* add bmm api

* add bf16 multi_d

* add ckProfiler for bf16

* add ckProfiler files

* add more instance; fixed 64bit index issue

* fixed naming

* enabled batched Ds

* use long_index for ds offsets

* clean

* add bmm fp8 ckProfiler

* Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/src/profile_gemm_universal_batched.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* clean

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* refactor batch offset func

* add splitk suppport into bmm_v3

* clean

* clean

* format

* fixed

* fix

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2024-12-13 21:08:35 +01:00
aska-0096
c8c016ddad Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8 2024-12-13 09:18:50 +00:00
Adam Osewski
061ac0649c Polished Grouped GEMM APIs and new BF16 instances (#1600)
* Few small fixes.

* New GroupedGemm instances (BF16)

* Unify and refactor GroupedGEMM device API.

* Adapt changes to new API.

* Adapt grouped gemm profiler.

* Accept multiple kbatches for grouped gemm profiler.

- delete obsolete two stage as it is now covered by grouped gemm

* Update unit test for grouped gemm.

* Fix thresholds for BF16 and F8. Unblock tests.

* Fix few instances.

* Multiple small fixes.

* Adapt to new API, check dynamic casting.

* Uncomment few data types in grouped gemm profiler.

* Fix call to SetDeviceArgs.

* Fix profile grouped gemm multiply tile loop.

* Fix grouped gemm tile loop kernel args in client examples.

* Review comments.
2024-11-27 13:02:44 +01:00
Harisankar Sadasivan
d6d4c2788b universal streamk fp8 changes (#1665)
* universal streamk fp8 changes & ckprofiler instances

* revert strides to -1 and verification options

* fp8 exclusion on pre-gfx94 for universal_streamk

* PR review based revisions: permissions reverted,  removed hip err checks


---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2024-11-21 08:21:37 -08:00
aska-0096
ec6b000c77 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8 2024-11-19 08:50:37 +00:00
Bartłomiej Kocot
754adc70e3 Batched GEMM Multiple D based on Universal GEMM (#1655)
* Batched GEMM Multiple D based on Universal GEMM

Co-authored-by: Jing Zhang <jizhan@fb.com>

* CI fixes

Co-authored-by: Jing Zhang <jizhan@fb.com>

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
2024-11-18 14:03:45 +01:00
aska-0096
f3bbfe3efe Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8 2024-11-18 07:32:39 +00:00
Illia Silin
b4a7904582 re-enable fp8 gemms in ckProfiler (#1667) 2024-11-14 16:15:01 -08:00
rocking
3599418aa8 Fix F16 type (#1583) 2024-11-06 11:32:44 -08:00
aska-0096
7a0ad60e77 remove the change in ckprofiler src 2024-11-05 08:05:41 +00:00
aska-0096
f20e48f1f4 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8 2024-11-05 07:03:42 +00:00
aska-0096
b97c68764e update ck_a8w8 library, update flush cache timing api 2024-11-05 06:57:48 +00:00
Illia Silin
03c6448ba3 Reduce build time. (#1621)
* disable fp8 gemm_universal on gfx90a and gfx908 by default

* fix cmake syntax

* fix clang format

* add ifdefs in amd_xdlops

* disable fp8 gemm instances on gfx90a by default

* update readme
2024-11-01 13:52:23 +08:00
aska-0096
b3e5048f12 tempsave 2024-10-30 07:38:59 +00:00
valarLip
37f7afed1e add int8 gemm multiply multiply a8w8 (#1591)
* add int8 gemm multiply multiply a8w8

* uncomment

* clang-format-12

* Add example_gemm_multiply_multiply_xdl_int8

* Remove shell scripts

* update preprocess number for mi308; bring back printout in ckprofiler

* format

---------

Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-26 16:39:34 +08:00
Bartłomiej Kocot
cedccd59c9 [POST MERGE PR] Enable grouped conv bwd wei bf16 NGCHW (#1594) 2024-10-23 12:02:33 +02:00
Haocong WANG
47294b4b22 Merge branch 'develop' into gemm_multiply_multiply_int8a8w8 2024-10-23 11:28:40 +08:00
Bartłomiej Kocot
82fc53835a Enable grouped conv bwd wei bf16 NGCHW (#1589)
* Enable grouped conv bwd wei bf16 NGCHW

* fixes

* fixes

* Fixes

* fixes

* fixes

* Fixes
2024-10-22 16:18:28 +02:00
chenjun
1670bba95f clang-format-12 2024-10-21 23:16:04 +08:00
Thomas Ning
560917b161 Ck profiler instance support (#1575)
* The draft on ckProfiler instance add

* support the ck profiler instance with same data types

* add a small feature on the M and N variable switch.

* Partially solve the incorrect result problem

* fix based on ci cd
2024-10-21 22:47:48 +08:00
chenjun
09852d3ba7 uncomment 2024-10-21 22:03:22 +08:00
chenjun
7fb0b3223c add int8 gemm multiply multiply a8w8 2024-10-21 21:57:41 +08:00