coderfeli
7ca2d03e82
merge moe sorting
2025-02-25 05:08:21 +00:00
coderfeli
f64b137521
merge haocong branch
2025-02-17 09:30:02 +00:00
coderfeli
7572a6916c
merge develop
2025-02-15 03:23:00 +00:00
aska-0096
4599ee0079
Merge branch 'update_cka8w8_uc' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc
2025-02-13 06:07:11 +00:00
aska-0096
0172488d64
hotfix for ckprofiler operator
2025-02-13 06:07:07 +00:00
Haocong WANG
d6e3e83a80
Merge branch 'develop' into update_cka8w8_uc
2025-02-11 16:06:08 +08:00
Mingtao Gu
d9f1ead347
Added Int4 mixed batch gemm support ( #1839 )
...
* remove redundant kernels.
* added batched_gemm_xdl_fp16int4_b_scale_v3
* Enabled the split K.
* added the batched_gemm_b_scale ckProfiler, meet function issue
* fix some typo
* fix ckProfiler build issue
* fix some bugs
* updated some debug info
* comment some code
* Fix
* fixed some bugs and refactor the code
* fixed a function bug.
* formatted files.
* formatted
* uncommented the ckProfiler CMakeLists
* fixed.
* fix ckProfiler for batched_gemm_b_scale
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com >
2025-02-10 11:17:02 +08:00
aska-0096
5be42bb398
fix errors
2025-02-05 08:22:14 +00:00
aska-0096
1b616990b3
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc
2025-02-05 08:07:20 +00:00
deepsek
e7dce4d247
Added bf16 instances grouped gemm fixed nk ( #1825 )
...
* Feat: Add bf16 input instances
* feat: Add BF16 profiler code
* fix: reorder enum types
* fix: CI fail due to clang-format
* fix: clang script format issue
* fix: clang format broke cmakelist file
2025-01-20 09:13:09 -08:00
deepsek
0fcbb25f70
fix: preprocessor directives logic error if/else ( #1764 )
...
* fix: preprocessors logic error if/else
* fix: added macros as preferred by CK team
2025-01-16 20:31:15 -08:00
feli
53ab1b9047
Dev/merge u8w8 ( #1774 )
...
* port tiles from a8w8
* rm debug used files
* add instances
* remove all non gemm in cmake
* merge; impl fp16
* recover cmake from develop
* add missed files; fix clang format
---------
Co-authored-by: coderfeli <coderfeli@163.com >
2025-01-13 10:25:14 -08:00
Mingtao Gu
4f62f6e9b7
Implement the fp16xint4 scale weight only kernel for Ali ( #1786 )
...
* enable int4 scale (weight only) kernel
* format some files
* Add unit test for int4 weight only
* fixed and formatted code
* fixed
* formated
* formated
* fixed
* fixed a bug in the ckProfiler, and formatted the code
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-01-03 18:35:21 +08:00
Muhammed Emin Ozturk
9e95d54cd2
BF16 GEMM Stream-K ( #1541 )
...
* initial
* Cmake file
* successfull compilation but validation failed
* Cmake
* update
* gpu validation
* gemm universal
* gemm universal sk update
* sk bf16 universal instance
* gemm_universal_streamk.hpp
* only build for gfx94
* Cmakelist
* profiler update, bf16 sk only works at gfx42
* clang
* clang
* clang all
* no need flags
* cmake script
* delete comment
* gemm universal sk fix
* clang
* profiler fix
* clang
* update
* update
* delete comment
* code formatting
* cmake
* fix instance
* clang
* argument supported
* argument supported and clang
* update
* fix
* removing unnecessary comments
* clang formatting
* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt
Co-authored-by: afagaj <john.afaganis@gmail.com >
* CopyRight Comment 2025
* clang reformatting
* copy right 2025
---------
Co-authored-by: Emin Ozturk <ozturk.27@osu.edu >
Co-authored-by: root <root@ctr-ubbsmc16.amd.com >
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund >
Co-authored-by: root <root@splinter-126-wr-d3.amd.com >
Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund >
Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund >
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund >
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu >
Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund >
Co-authored-by: afagaj <john.afaganis@gmail.com >
2025-01-02 10:30:04 -08:00
aska-0096
0dbe537032
refine weight preshuffle format.
2025-01-02 13:59:58 +00:00
Adam Osewski
1d8e4ec2ce
Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM ( #1762 )
...
* add a prototype of int4
* clean
* debug
* clean
* clean
* move packed into dynamic_buffer
* fixed coord reset
* add fast pki4 to half conversion
* fix
* fixed reference and host_tensor
* fixed tensor init
* format
* debug i4_to_f16_convert
* format
* fixed splitk
* weight permute
* add b tile permute
* clean
* weight permute with splitki
* format
* improve weight layout
* add and_or_b32
* fixed splitk crush
* add permute switch as a template
* recover v3r1
* clean
* failure with intrawave v2
* fixed
* fixed
* add ckProfiler
* add bfp16 support
* add bf16 example
* fixed int4 to bhalf_t conversion
* format
* fixed int4 to bf16 conversion
* clean
* add instances for mem
* clean
* fixed host tensor size
* fixed
* debug
* fixed
* add pk_i4_t as a struct
* fix
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* revert
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* fixed comments
* revert
* clean
* revert
* revert
* fixed
* Update CMakeLists.txt
* Update script/cmake-ck-dev.sh
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update CMakeLists.txt
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* fixed
* fixed
* fixed
* revert
* revert
* add comments
* format
* fixed assert
* fixed
* Fix I4 define in ckProfiler
* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue
---------
Co-authored-by: Jing Zhang <jizhan@fb.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-01-02 11:48:06 +08:00
aska-0096
72c1ddacb9
Merge branch 'add_a8w8_preshuffle_ckprofiler' of https://github.com/ROCm/composable_kernel into update_cka8w8_uc
2024-12-31 07:23:50 +00:00
aska-0096
bbbedc1fd7
add fp16 instances
2024-12-31 07:14:56 +00:00
aska-0096
f60f9d5917
sanity pass, most tile size enabled. TODO: NWave!=4
2024-12-30 18:22:08 +00:00
aska-0096
74ef5021b6
tempsave
2024-12-30 09:20:25 +00:00
coderfeli
fda5f8cfb0
fix missed files and fix clang format
2024-12-27 11:56:46 +00:00
coderfeli
e92395d9b1
Merge remote-tracking branch 'origin/cka8w8_devtimer' into update_cka8w8_uc
2024-12-27 11:09:05 +00:00
coderfeli
e2127d7a96
impl fp16 in ckprofiler
2024-12-27 06:53:40 +00:00
coderfeli
400cac2839
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8
2024-12-27 05:42:38 +00:00
coderfeli
04f09f087e
fix build
2024-12-27 11:44:06 +08:00
coderfeli
19b7c1312c
remove all non gemm in cmake
2024-12-24 07:44:12 +00:00
coderfeli
9ba219c875
rm debug used files
2024-12-24 03:59:12 +00:00
coderfeli
3f50b99e7b
port tiles from a8w8
2024-12-23 14:13:16 +00:00
Bartłomiej Kocot
4d8fce33dd
Add SplitK support into Batched GEMM V3 ( #1729 )
...
* add bmm api
* add bf16 multi_d
* add ckProfiler for bf16
* add ckProfiler files
* add more instance; fixed 64bit index issue
* fixed naming
* enabled batched Ds
* use long_index for ds offsets
* clean
* add bmm fp8 ckProfiler
* Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* Update profiler/src/profile_gemm_universal_batched.cpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp
Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com >
* clean
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp
* refactor batch offset func
* add splitk suppport into bmm_v3
* clean
* clean
* format
* fixed
* fix
---------
Co-authored-by: Jing Zhang <jizhan@fb.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
2024-12-13 21:08:35 +01:00
aska-0096
c8c016ddad
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8
2024-12-13 09:18:50 +00:00
Adam Osewski
061ac0649c
Polished Grouped GEMM APIs and new BF16 instances ( #1600 )
...
* Few small fixes.
* New GroupedGemm instances (BF16)
* Unify and refactor GroupedGEMM device API.
* Adapt changes to new API.
* Adapt grouped gemm profiler.
* Accept multiple kbatches for grouped gemm profiler.
- delete obsolete two stage as it is now covered by grouped gemm
* Update unit test for grouped gemm.
* Fix thresholds for BF16 and F8. Unblock tests.
* Fix few instances.
* Multiple small fixes.
* Adapt to new API, check dynamic casting.
* Uncomment few data types in grouped gemm profiler.
* Fix call to SetDeviceArgs.
* Fix profile grouped gemm multiply tile loop.
* Fix grouped gemm tile loop kernel args in client examples.
* Review comments.
2024-11-27 13:02:44 +01:00
Harisankar Sadasivan
d6d4c2788b
universal streamk fp8 changes ( #1665 )
...
* universal streamk fp8 changes & ckprofiler instances
* revert strides to -1 and verification options
* fp8 exclusion on pre-gfx94 for universal_streamk
* PR review based revisions: permissions reverted, removed hip err checks
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2024-11-21 08:21:37 -08:00
aska-0096
ec6b000c77
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8
2024-11-19 08:50:37 +00:00
Bartłomiej Kocot
754adc70e3
Batched GEMM Multiple D based on Universal GEMM ( #1655 )
...
* Batched GEMM Multiple D based on Universal GEMM
Co-authored-by: Jing Zhang <jizhan@fb.com >
* CI fixes
Co-authored-by: Jing Zhang <jizhan@fb.com >
---------
Co-authored-by: Jing Zhang <jizhan@fb.com >
2024-11-18 14:03:45 +01:00
aska-0096
f3bbfe3efe
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8
2024-11-18 07:32:39 +00:00
Illia Silin
b4a7904582
re-enable fp8 gemms in ckProfiler ( #1667 )
2024-11-14 16:15:01 -08:00
rocking
3599418aa8
Fix F16 type ( #1583 )
2024-11-06 11:32:44 -08:00
aska-0096
7a0ad60e77
remove the change in ckprofiler src
2024-11-05 08:05:41 +00:00
aska-0096
f20e48f1f4
Merge branch 'develop' of https://github.com/ROCm/composable_kernel into update_cka8w8
2024-11-05 07:03:42 +00:00
aska-0096
b97c68764e
update ck_a8w8 library, update flush cache timing api
2024-11-05 06:57:48 +00:00
Illia Silin
03c6448ba3
Reduce build time. ( #1621 )
...
* disable fp8 gemm_universal on gfx90a and gfx908 by default
* fix cmake syntax
* fix clang format
* add ifdefs in amd_xdlops
* disable fp8 gemm instances on gfx90a by default
* update readme
2024-11-01 13:52:23 +08:00
aska-0096
b3e5048f12
tempsave
2024-10-30 07:38:59 +00:00
valarLip
37f7afed1e
add int8 gemm multiply multiply a8w8 ( #1591 )
...
* add int8 gemm multiply multiply a8w8
* uncomment
* clang-format-12
* Add example_gemm_multiply_multiply_xdl_int8
* Remove shell scripts
* update preprocess number for mi308; bring back printout in ckprofiler
* format
---------
Co-authored-by: chenjun <junchen2@amd.com >
Co-authored-by: Haocong WANG <haocwang@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
2024-10-26 16:39:34 +08:00
Bartłomiej Kocot
cedccd59c9
[POST MERGE PR] Enable grouped conv bwd wei bf16 NGCHW ( #1594 )
2024-10-23 12:02:33 +02:00
Haocong WANG
47294b4b22
Merge branch 'develop' into gemm_multiply_multiply_int8a8w8
2024-10-23 11:28:40 +08:00
Bartłomiej Kocot
82fc53835a
Enable grouped conv bwd wei bf16 NGCHW ( #1589 )
...
* Enable grouped conv bwd wei bf16 NGCHW
* fixes
* fixes
* Fixes
* fixes
* fixes
* Fixes
2024-10-22 16:18:28 +02:00
chenjun
1670bba95f
clang-format-12
2024-10-21 23:16:04 +08:00
Thomas Ning
560917b161
Ck profiler instance support ( #1575 )
...
* The draft on ckProfiler instance add
* support the ck profiler instance with same data types
* add a small feature on the M and N variable switch.
* Partially solve the incorrect result problem
* fix based on ci cd
2024-10-21 22:47:48 +08:00
chenjun
09852d3ba7
uncomment
2024-10-21 22:03:22 +08:00
chenjun
7fb0b3223c
add int8 gemm multiply multiply a8w8
2024-10-21 21:57:41 +08:00