Aviral Goel
de6466481f
chore(copyright): update copyright header for include directory ( #3293 )
2025-11-26 11:00:05 -07:00
Michal Kulikowski
cd8af997e6
[CK] s_prefetch unit test fixes.
...
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com >
2025-11-19 21:54:50 +01:00
Michal Kulikowski
f3ef7acca0
[CK] Added s_prefetch unit test.
...
-added s_buffer_load_b32/64 assembly
-added amd_s_buffer_load_impl
Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com >
2025-11-19 21:54:50 +01:00
Illia Silin
b38bb492a1
Disable DL kernels on all architectures except gfx103x. ( #3218 )
...
* disable dl kernels on all archs except gfx103
* add gfx10-3-generic target to cmake
2025-11-14 17:39:50 -08:00
Illia Silin
d2eab23958
Split up data_type header. ( #1996 )
...
* split fp64 vector data type
* add missing header
* move e8m0 structs
* split off numeric_utils header
* fix typo
* split off numeric limits header
* update data_type header
* fix clang format
* split off vector type header
* fix clang format
* fix typo for binary_inf
2025-03-24 15:08:54 -07:00
Mingtao Gu
0db7c8f0b2
Ck int4 moe develop ( #1949 )
...
* Add Gemm fp8xint4 example and kernel, function pass.
* Init Gemm_fp8xint4 Bpreshuffle
* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet
* General fix.
* fp8xint4 bpreshuffle function pass
* fix.
* init b preshuffle dequant in VGPR.
* fix bug, function pass.
* move b thread dequant copy to blockwise.
* fix bug, function now passes.
* modified the tile size to 256, 128x128x128.
* fixed a bug.
* Initial int4 moe, compile pass, function not check.
* fix bug in moe_gemm1.cpp, now function pass.
* test expert = 8 and function pass.
* Added moe_pk_i4_gemm2, function pass.
* Added b preshuffle pipeline v3 support.
* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.
* Split the blockwise pipeline for fp8xint4.
* commit missing files
* opt gemm2 to 2x2 wave
* fix swizzle = false
* update int4 moe with latest input changes.
* update tile size.
* enable pipeline v3.
* fix nswizzle = true
* commit a version for compiler debug.
* Updated transfer_v3r1_gather to support pk_i4_t type.
* for int4 moe2 for type_convert support.
* remove some values between mfma instructions.
* fix int4 moe
* Updated transfer_v3r1_gather to support pk_i4_t type.
* i4 support lds multiple shuffle
* fixed int4 moe tflops calculation.
* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle
* updated gemm2.
* change int4 moe example names
* fix and format code.
* format.
* format codes.
* update fp8xint4 example tile size.
* add <unordered_map> header
* fixed.
* format.
* Added conditional compilation for int4 -> fp8 conversion kernels
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: coderfeli <coderfeli@163.com >
2025-03-10 11:16:44 +08:00
Mingtao Gu
4f62f6e9b7
Implement the fp16xint4 scale weight only kernel for Ali ( #1786 )
...
* enable int4 scale (weight only) kernel
* format some files
* Add unit test for int4 weight only
* fixed and formatted code
* fixed
* formated
* formated
* fixed
* fixed a bug in the ckProfiler, and formatted the code
---------
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-01-03 18:35:21 +08:00
Adam Osewski
1d8e4ec2ce
Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM ( #1762 )
...
* add a prototype of int4
* clean
* debug
* clean
* clean
* move packed into dynamic_buffer
* fixed coord reset
* add fast pki4 to half conversion
* fix
* fixed reference and host_tensor
* fixed tensor init
* format
* debug i4_to_f16_convert
* format
* fixed splitk
* weight permute
* add b tile permute
* clean
* weight permute with splitki
* format
* improve weight layout
* add and_or_b32
* fixed splitk crush
* add permute switch as a template
* recover v3r1
* clean
* failure with intrawave v2
* fixed
* fixed
* add ckProfiler
* add bfp16 support
* add bf16 example
* fixed int4 to bhalf_t conversion
* format
* fixed int4 to bf16 conversion
* clean
* add instances for mem
* clean
* fixed host tensor size
* fixed
* debug
* fixed
* add pk_i4_t as a struct
* fix
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* revert
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* fixed comments
* revert
* clean
* revert
* revert
* fixed
* Update CMakeLists.txt
* Update script/cmake-ck-dev.sh
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update CMakeLists.txt
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* fixed
* fixed
* fixed
* revert
* revert
* add comments
* format
* fixed assert
* fixed
* Fix I4 define in ckProfiler
* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue
---------
Co-authored-by: Jing Zhang <jizhan@fb.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-01-02 11:48:06 +08:00
zjing14
1837040a9c
Navi3 rel ( #1176 )
...
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
* navi3x_multipleD+example
* temp save
* workable
* batchedgemm[OK], groupconv[debug]
* groupconv: Sanity check[OK], Performance[Bad]
* navi3x_groupconv_need_optimization
* create necessary files
* save progress
* Add Inter-Row thread transfer
* save progress
* save debugging progress
* sanity check pass
* fix a host tensor bug and clean up flash-attn code
* format
* cancel unnecessary change
* cancel unnecessary change
* cancel unnecessary change
* temp save, add asm backend flag to amd_wmma
* Mat-A LDS Bypass sanity pass
* temp save
* gemm sanity fix
* Porting new blockwise gemm to flash attention
* Example branch provide to compiler team
* tempsave
* Fix a bug
* batched gemm ported
* conv A-skip lds ported
* Skip B-Lds real gemm
* Skip B Lds Gemm + MulD
* batched gemm, conv, skip b lds
* format
* Attn, skip b lds
* Change GridwiseOp nam
* fix a typo caused bug
* Skip A_Lds sanity pass, Skip B_Lds scratch occured
* Bug found, intra-row permute off caused
* bug found
* a fix
* disable buffer load due to incorrect 3rd dword
* update fmha config, no scratch generated
* update 3rd dword
* fmha config update
* FMHA, add support to gfx1101/gfx1102
* Merge origin dev (#2 )
* [Navi3x] Fix Gridwise_multiple_d operation (#649 )
* Add CMake Option "USE_OPT_NAVI3X"
* fix bug
* standardize docs (#655 )
* Separate bibtex requirement from rocm-docs-core (#656 )
* separate bibtex requirement from rocm-docs-core
* point requirements to source rocm-docs-core repo
* Add CMake Option "USE_OPT_NAVI3X" (#647 )
* Add CMake Option "USE_OPT_NAVI3X"
* remove navi3x opt compile option from cmake script
* Conv + quantization + tanh (#645 )
* Rename file. Prepare to support another activation
* Add comment for quantization
* Extract out_elementop
* Add tanh example
* Add conv + bias + tanh quantization instance
* Add missing parameter
* Refine cmake
* Add external api and client example
* Extract variable in example
* Fix the comment
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* Add a denorm test fix (#603 )
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* simplify karg in device/grid of split-k op (#644 )
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
* fix 3rd dword of buffer source descriptor (#659 )
* add fp64 instances (#658 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 )
This reverts commit bb5530af91 .
* Groupnorm + swish external api (#668 )
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* add a marco to turn on/off denorm fix (off by default) (#673 )
* add a marco to turn off denorm fix by default
* expose the marco
---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* fixed quant example (#672 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Add dependabot config and pin rocm-docs-core (#663 )
* [gtest] suppress unsafe buffer warn (#670 )
ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
* Add memory index guard in wmma device ops (#667 )
* Add more macros to turn on/off denorm fix (#678 )
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
* Fix a typo (#676 )
* Add (#677 )
* Allow using ROCm release candidate compilers. (#679 )
* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* add vector load check
* solve conflicts
---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
* Disable SkipLDS & Align AIT api (#3 )
* fix layernorm, reduction Ops (#4 )
* [Navi3x] Fix Gridwise_multiple_d operation (#649 )
* Add CMake Option "USE_OPT_NAVI3X"
* fix bug
* standardize docs (#655 )
* Separate bibtex requirement from rocm-docs-core (#656 )
* separate bibtex requirement from rocm-docs-core
* point requirements to source rocm-docs-core repo
* Add CMake Option "USE_OPT_NAVI3X" (#647 )
* Add CMake Option "USE_OPT_NAVI3X"
* remove navi3x opt compile option from cmake script
* Conv + quantization + tanh (#645 )
* Rename file. Prepare to support another activation
* Add comment for quantization
* Extract out_elementop
* Add tanh example
* Add conv + bias + tanh quantization instance
* Add missing parameter
* Refine cmake
* Add external api and client example
* Extract variable in example
* Fix the comment
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* Add a denorm test fix (#603 )
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
* simplify karg in device/grid of split-k op (#644 )
* simplify karg in device/grid split-k op
* fix mk_kn_mn instances
* add more instances
* use name from tensor layout
* fix 3rd dword of buffer source descriptor (#659 )
* add fp64 instances (#658 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 )
This reverts commit bb5530af91 .
* Groupnorm + swish external api (#668 )
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* add a marco to turn on/off denorm fix (off by default) (#673 )
* add a marco to turn off denorm fix by default
* expose the marco
---------
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* fixed quant example (#672 )
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
* Add dependabot config and pin rocm-docs-core (#663 )
* [gtest] suppress unsafe buffer warn (#670 )
ref: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1912
* Add memory index guard in wmma device ops (#667 )
* Add more macros to turn on/off denorm fix (#678 )
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
* Fix a typo (#676 )
* Add (#677 )
* Allow using ROCm release candidate compilers. (#679 )
* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* Disable SkipLDS & Align AIT api
* Update dependabot config (#682 )
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
* update attn api
* solve type_convert bug + enable
---------
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
Co-authored-by: haocwang <Haocong.WANG@amd.com >
* fix typo
* Fix attention with causal mask
* multiple fix, try ait compile
* Add A/B not use LDS pipeline
* Clang format, Add gfx1101, gfx1102 support of FMHA example
* cancel change of format script
* 1. Enable 2-stage global Prefetch ( May cause VGPR spilling)
2. Enable FP16 accumulator blockwise_gemm
* clang-format
* 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement)
2. change kernel timing mode to 50 warmup + 50 timed repeat
* Update low level abstration of blockwise gemm wmma
* (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds
* (4/5) grouped conv pass
* (5/5) attention pass, todo: debug lds perf bug
* AIT Attention API refactor (#8 )
* sanity pass
* sanity pass 2
* confirm significant performance regression.
* turn on all instances
* turn off instance format
* Fix bug & tunning & format
* DML meta, self_attn+cross_attn
* sanity pass
* remove useless flag
* update tile and problem size used in AIT attention
* bug fix in grouped conv supporting check
* deprecate inline asm wmma
* Bug fix: double lds skip
* clang-format
* Fix errors in
1. example, fmha
2. gridwise pipeline
3. deviceop, fmha, change some containers from vector to array
* part2 of previous commit
* clang format
* API fix of gridwisegemmpipeline
* separate array base and vector base attention tensor transformation
* fix gemm
* clang format
* add gemm fp16 instances
* Temp save
* fpAintB kernel compile pass
* Sanity pass.
* Temp save
* debug code enabled
* Fp16AInt8B_GEMM sanity
* MQA implementation
* GQA-4 example
* tempsave
* Compile pass
* New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm
* format
* Todo: fix gemm_bilinear_wmma instances compilation bug
* Solve a bug when K1=16
* remove unnecessary changes
* Remove tensor layout limitation to LDS usage in tesnor contraction
* update self-attention and cross-attention
* fix a typo of name
* Add arch limiter for fp8 gemm
* enable fp8 gemm_xdl for all gfx9 targets
* temporarily disable gemm_xdl_fp16_fp8 on MI100/200
* fix the cmake logic for gemm_xdl_fp16_fp8
* re-enable the gemm_xdl_fp16_fp8 on MI100/200
---------
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: Sam Wu <sjwu@ualberta.ca >
Co-authored-by: Sam Wu <sam.wu2@amd.com >
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com >
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: root <root@ctr-ubbsmc15.amd.com >
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: samjwu <samjwu@users.noreply.github.com >
Co-authored-by: haocwang <Haocong.WANG@amd.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-03-08 17:11:51 -08:00
Illia Silin
b94fd0b227
update copyright headers ( #726 )
2023-05-31 18:46:57 -05:00
Haocong WANG
087e310589
[Navi3x] Multiple issue fix ( #612 )
...
* Change gridwise gemm mD blockwise gemm to naive
* RRR Gemm fix
* Fix RCR gemm bug
* Isolate wmma instructions
* Update amd_inline_asm.hpp
* Update amd_wmma.hpp
* Update amd_wmma.hpp
* fix syntax and update Jenkinsfile
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
2023-03-10 17:04:28 -06:00
Haocong WANG
919aeb1f52
[Navi3x-LWPCK-545] Block-wise GEMM + Real GEMM_WMMA_FP16 ( #541 )
...
* wmma_op + unit test
* add arch limitation to wmma test
* change arch limitation
* Refactor + Add all type unit test(int4 compile failed)
* Add f32_16x16x16_bf16 unit test
* tempsave
* tempsave
* tempsave
* runtime bug, cannot find symbol
* workaround for incorrect HIP warpSize return value
* debugging
* tempsave
* Correctness OK, waiting for optimization
* Tidy up + format
* temp save
* temp save, reproduce the v_bfi_b32 issue
* add inline asm for wmmaop test
* tidy up
* clean some debug purpose code
* discard some codes
* clang format
* clang format
* compiler issue fixed + increase tile size
2023-01-16 20:06:01 -06:00
Chao Liu
d3051d7517
add license in file ( #303 )
2022-06-24 23:32:43 -05:00
Chao Liu
5d37d7bff4
Reorganize files, Part 1 ( #119 )
...
* delete obselete files
* move files
* build
* update cmake
* update cmake
* fix build
* reorg examples
* update cmake for example and test
2022-03-08 21:46:36 -06:00