Adam Osewski
1d8e4ec2ce
Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM ( #1762 )
...
* add a prototype of int4
* clean
* debug
* clean
* clean
* move packed into dynamic_buffer
* fixed coord reset
* add fast pki4 to half conversion
* fix
* fixed reference and host_tensor
* fixed tensor init
* format
* debug i4_to_f16_convert
* format
* fixed splitk
* weight permute
* add b tile permute
* clean
* weight permute with splitki
* format
* improve weight layout
* add and_or_b32
* fixed splitk crush
* add permute switch as a template
* recover v3r1
* clean
* failure with intrawave v2
* fixed
* fixed
* add ckProfiler
* add bfp16 support
* add bf16 example
* fixed int4 to bhalf_t conversion
* format
* fixed int4 to bf16 conversion
* clean
* add instances for mem
* clean
* fixed host tensor size
* fixed
* debug
* fixed
* add pk_i4_t as a struct
* fix
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* revert
* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* fixed comments
* revert
* clean
* revert
* revert
* fixed
* Update CMakeLists.txt
* Update script/cmake-ck-dev.sh
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Update CMakeLists.txt
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* fixed
* fixed
* fixed
* revert
* revert
* add comments
* format
* fixed assert
* fixed
* Fix I4 define in ckProfiler
* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue
---------
Co-authored-by: Jing Zhang <jizhan@fb.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
2025-01-02 11:48:06 +08:00
Adam Osewski
b6bcd76d88
CK-Tile first draft of universal block gemm with interwave & intrawave scheduler ( #1676 )
...
* Block universal gemm.
* Universal block gemm with interwave scheduler - draft.
* Refactoring
* Move a/b_warp_tiles into BlockGemmImpl
* set BlockGemmImpl as a class member
* Change tile size for more suitable to memory bound cases.
* Introduce kKPerThread to WarpGemm
* Add documentation comment.
* Fix Interwave scheduler block gemm.
* Add compute/memory friendly tile configuration.
* Clean
* New tile configurations in gemm mem example.
* Add more static checks and fix loop order in block gemm.
* Add more static checks and use warp gemm mfma dispatcher.
* Add default scheduler block gemm.
* Remove logging in example.
2024-11-26 08:45:14 +01:00
Rostyslav Geyyer
7d576f1748
Update GPU verification ( #1596 )
...
* Update inits
* Update static_cast to type_convert
* Add verification option selection
2024-10-25 08:13:46 -07:00
Rostyslav Geyyer
3f710930f6
Update default stride ( #1576 )
...
* Update default stride value to -1
* Fix format
* Revert "Fix format"
This reverts commit ae0c3649ec .
---------
Co-authored-by: Harisankar Sadasivan <135730918+hsadasiv@users.noreply.github.com >
2024-10-21 08:45:22 -07:00
Haocong WANG
5b10dae6a4
Add gemm universal bf16 instances ( #1484 )
...
* revert ckprofiler change
* temp save
* Add test and test pass
* test pass
* Fix bug inside rotating buffer when tensor is not packed
* bug fix
* clang format
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2024-09-04 20:58:54 -07:00
Haocong WANG
764164b488
[GEMM] UniversalGemm update ( #1262 )
...
* Add bf16 instances
* Add bf16 gemm universal example
* tempsave
* Add guard to navi compilation
* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
* fix formatting condition statement issue
* solve conflict
---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com >
2024-04-26 12:56:07 -05:00
Haocong WANG
f83e9701e9
[GEMM] Gemm universal device operation ( #1154 )
...
* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances
* clang format + typo fix
* Fix a bug
* initial commit
* Add more instances to irregular splitk
* blkgemm pipeline v1~4 prototype
* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3
* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset
* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler
* Add NN instances for gemm universal
* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal
* Fix a bug regarding padding Odd K number
* remove kernel print
* Fix KPadding bug...
* Update safety check
* another try to fix kpadding..
* Sanity checked
* new instances..
* clang format+typo fix
* remove clang format script's change
* Add non-hotloop compile option
* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150
* Some miscs.. opt and fix
* Add pipeline description docs
* Split universal gemm instance library to cut profiler compiling time
* uncomment cmakefile
* Fix a bug caused by blockwise_gemm_pipe_v2
* reduce default splitk to 1
* Add 224x256x64 tile size
* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library
* temp save
* temp save
* Permuted lds layout, sanity and function checked
* clang format
* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout
* clangformat
* A/B splitpipe scheduler for v3
* Fix two bugs
* bug fix
* fix a bug in oob check
* Example for mixed fp16_fp8 gemm
* Clean experimental code blocks
* Add mixed precision gemm into profiler
* tempsave
* optimize m/n major lds layout
* Add RRR GEMM mixed precision instances
* Optimize f8 matrix transpose
* Add test_gemm_universal
* A/B spilt schedule for blkpip v5
* Take ds_read2 into iglp scheduling scheme
* format
* fixed cmake
* Add llvm-option into CI cmake flag
---------
Co-authored-by: Jing Zhang <jizhan@amd.com >
2024-04-13 21:03:18 -05:00