* * [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh` and
`test_fmha_fwd.inc`, and add benchmarks to `benchmark_fwd.sh` and
`benchmark_fwd_v3.sh` as well. These test cases and benchmarks that
specifically validate/benchmark the new padding and variable-length
functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
* [CK_TILE] add mqa, gqa to sequence padding unit tests
* [CI_TILE] Reduce the number of padding seqlen unit tests in FMHA to avoid timeouts in CI
* [CK_TILE] remove unnecessary MageKArgs overload in FmhaFwdV3Kernel and FmhaFwdKernel
* Remove C++20 code
C++20 features should not be used in CK. Remove all C++20 code.
* fix c++17 build
* format
* fix merge issue
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* Update grouped_gemm example and pipeline
* find the root cause error in did not enable the transpose in gfx950 correctly
* Fix v3 pipeline, row and col major
* Disable f8 datatype tests, it fails on gfx950
* fix the abd test by clear the runtime argument unsupported
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Mateusz Ozga <mateusz.ozga@amd.com>
* rename gemm_group_quant to gemm_quant
* Add TensorWise quant mode
* Cshuffle epilogue tests with tensor scaling
* Add tensor quant to example
* Don't use readfirstlane for reading scales - doesn't work for some reason
* Add to changelog
* revert include - from a merge problem?
* revert common.hpp include
* revert host.hpp include
* remove unused utility function
* rename quant pipeline problem
* refactor quant tests
* remove aquant utils
* use TEST_F
* fix all tests by changing gemm config
* Use typed tests
* fix copyright
* [CK_TILE] Add sequence padding and variable length support in fmha (and v3)
- Group Mode Padding: Introduces the `-s_qpad` argument to support
physically padded layouts. Kernels now use padded start pointers
(`seqstart_padded_*_ptr`) for memory addressing.
- Batch Mode Variable Length: Adds `-q_eff_lens` and `-kv_eff_lens`
arguments for efficient processing of variable-length sequences by
passing cumulative effective lengths (`cu_seqlen_*_ptr`) to the kernel.
- FMHA examples: Support padding and variable length both in
group and batch mode. Dispatcher is updated as well (dispatch to
kPadSeqLenK enabled pipeline).
- New padding test cases: Add padding test cases to `smoke_test_fwd.sh`,
and add benchmarks to `benchmark_fwd.sh` and `benchmark_fwd_v3.sh` as well.
These test cases and benchmarks that specifically validate/benchmark the
new padding and variable-length functionalities in both group and batch modes.
* [CK_TILE] Fix build error in fmha unit tests
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Yi DING <yi.ding@amd.com>
* change host using fp16 to check
* fp8 to fp8 compare
* rewrite input parameters
* add not squant
* remove some output code
* for scale = 1
* format
* saturates only for fp8
* add fp8bf16 data type
* add fp8bf16 data type
* fix test fp8 code
* add run_fp8bf16_tests
* change fmha fwd example parameter(adding fp8bf16)
* Support fp8bf16 for Aiter
* Support aiter fp8bf16 in c++
* fix comment about fp8 in readme.md
* add fp8fp32
* add fp8fp32 test
* remove range_q etc.
* format
* fix test parameters about squant and fmha example input fp8bf16 fp8fp32 data type
* add fp8bf16 to data_type function
* change colmajor to rowmajor in test_ck_tile_fmha_fwd_fp8
* format
* reset atol for fp8
* fix bug for atol
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
* fix(grouped_gemm): numerical errors on gfx950 by correctly calculating the tail num
* WIP: add temp config to stress test numerical error correction
* refactor: remove comments
* test(grouped_gemm): add gtests for the example to maintain its integrity
* test(grouped_gemm_preshuffle): add prefill variant to testbed to cover wider range
* fix: removed residue code to make b_shuffle() work again
* test(grouped_gemm_preshuffle): limit the test suite to gfx942 arch as it fails on gfx90a
* build: add gfx950 as build target for gtests
* test(grouped_gemm_preshuffle): temporarily disable fp8 prec tests due to numerical errors
* fix(grouped_gemm_preshuffle): resolved fp8 tests failure on gfx950 by adding correct compiler flag
* Use lse = false for PagedKV tests
There are no instances with lse = true so splitkv is actually launched
as a fallback.
* Reduce build time by disabling instances that are not tested
1. Refine Reduce2dShape to support both wave32 and wave64
2. Fix example reduce, permute and elementwise on gfx11 and gfx12
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* initial commit
* remove extra files
* fixing errors
* updated ReadMe file for mapping of diff quants with diff configs
* addressing review comments
* addressing review comments
* Resolved merge conflicts
* [CK TILE GEMM] Replace get_preshuffle_or with is_quantpreshuffle_enabled
The get_preshuffle_or was not working as expected, which led to incorrect behavior
in the quantization preshuffle process. This change replaces it with the more reliable
is_quantpreshuffle_enabled function to properly determine when preshuffle should be applied.
---------
Co-authored-by: Cong Ma <congma13@amd.com>
* Improve random number generation
* use different seed for each input (Q, K, V...);
* use deterministic generation of:
* seqstart_q/k (for group mode);
* block_table (for paged-kvcahe);
* cache_batch_idx (for kvcache);
* Extract arg_parser-related code from run functions to use them as tests
* Split examples into main programs and fmha runners, build instances separately
* Add dummy tests that use instances and runners
* Fix a missed corner case of f32->f8 conversion
When value if < min f8 denormal but > min f8 denormal / 2, it must be
rounded to min f8 denormal (i.e. 0b1), not to 0.
* Fix incorrect fp8 scales for P and O in validation code
DataTypeConfig was incorrectly compared with fp8_t.
* Add host generation of dropout random values and use it for validation
Previously host validation (reference_batched_dropout) used random
numbers generated by BlockDropout of the kernel, meaning that incorrect
generation on device (bad distribution, repeated numbers, too many zeros,
etc.) would not trigger any validation errors.
* Implement tests from smoke_test_bwd.sh
* Return result as enum to distinguish failure and missing instance
* Add tests for bwd features: bias, alibi, dropout
* Implement tests from smoke_test_fwd.sh
* Pass seqlen_q/k as vectors to fwd and bwd runners
* Add tests for fwd features: bias, alibi, dropout
* Add tests for pagedkv and splitkv
* Fix conditions when to use splitkv and pagedkv kernels
splitkv was executed only when use_kvcache which == (need_append_kvcache || use_cache_batch_idx || 0 < page_block_size).
In the SplitKV tests: the regular fwd kernel was executed if use_cache_batch_idx was not requested even when num_splitkv > 1.
In the AppendKV tests: the pagedkv kernel was executed but it often failed to find an instance.
* Add tests for appendkv
* Use is_v_rowmajor = true because there are no instances with column layout anymore
* Split public and private compile options for instances
Tests and examples need to know only about CK_TILE_FMHA_FWD_*_API.
* Improve parsing validation in bias and mask
* Pass bias as string for consistency with mask
* Catch parsing and other exceptions
* Add bwd test for deterministic flag
* Initialize fp8 tensors (-init=ufq) similarly to uf
* Fix splitkv/pagedkv invocation: use padded sk when seqlen_k_ptr is not null
seqlen_k cannot be used to determine padding when seqlen_k_ptr is
provided. The actual seqlen_k is taken from seqlen_k_ptr[b].
Even seqlen_k values (% bn0 == 0) use padded seqlen_k while seqlen_k_ptr
may contain arbitrary values.
In the example or tests this produces incorrect results with appendkv
(for example, -d=32 -s=1 -s_k=64 -s_knew=7 -vlayout=c -b=8).
* Fix use_pagedkv value when kvcache = true but page_block_size = 0
In this case block_table_ptr is nullptr which is accessed in the kernel.
* Clean up bwd tests
* Unify fwd tests for f16/bf16 and fp8
* Use better explicit instantiation declaration for fmha_bwd<2>
* Use the same seed for all tests, allow to override it with env variable
* Undo clang-format of one irrelevant file
For some reason my local clang-format-18 and the one in CI work differently.
* Do not build instances and tests on unsupported archs
* Build instance libraries as OBJECT library
* CI: Enable sccache for HIP
There are source files with LANGUAGE HIP, they need
-DCMAKE_HIP_COMPILER_LAUNCHER=sccache
* Add tests to REGRESSION_TESTS
* Fix OOB accesses in deterministic bwd due to incorrectly assumed kN0
The runner assumes kN0 = (hdim_q <= 128) ? 128 : 64 but there are
smaller tiles (for tr_load or fp32). This can create too small dq_acc_buf.
* Pass CK_TILE_FMHA_FWD_*_API as INTERFACE compile options
The instances don't actually depend on them, only examples and tests do.
Passing these definitions as INTERFACE allows to change FMHA_FWD_ENABLE_APIS
without recompiling instances that are already in ccache.
* Fix formatting and names
BlockWarps, WarpTile in Generic2dBlockShape are wave size dependent, it causes mangled name mismatch between host and device side.
Solution: Replace them with ThreadPerBlock and move BlockWarps, WarpTile calculation into Generic2dBlockShape
- Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats
- Implement A(bf8) B(i4) support in universal GEMM
- Use new implementation for i4 to fp8 conversion in Block Scale
* Add cshuffle epilogue test
* add the poc implementation to the epilogue and tests
* refactor cshuffle epilogue
* WIP: adding tensor/tile usage to scale_tile
* fix usage of tile_elementwise_inout
* add gemm_quant_kernel for generalizing gemm quant kernel
* Add problem specific to different quants, add QuantType to Traits
* Add quant_type to quant_kernel template parameters
* Create aq/bq_block_windows and views depending on QuantType
* Use tile windows as inputs in cshuffle epilogue
* Fix some issues in epilogue
* initial new example code for new general gemm quant kernel test
* Fix issues in kernel
* Add verification check for rowcol Quantmode
* use AccDataType instead of AQ in pipeline
* fix aquant preshuffle
* fix formatting
* some cleanup
* remove gemm_aquant_basic.cpp
* remove gemm_aquant_kernel.hpp
* fix tests for the renamed quant kernel
* fix formatting
* clean example files
* fix some merge conflicts
* fix preshufflequant rename issue
* updating
* fix some templates after merging with develop
* fix test preshuffle parameter
* fix formatting
* updating kernels
* change update user
* test username
* update quant_grouped_gemm example
* update example
* Unify bquant kernel to the common quant kernel
* remove bquant kernel also from common header
* fix formatting
* clean up commented code
* update grouped_gemm_quant example
* fix formatting config hpp
* fix merge mistake
* Non-const for movable windows
* fix formatting
* update tileloop pipleline
* Fix grammar in README
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Remove #include<bit> and clean up example
* fix strides
* Add some descriptions for move_windows
* fix tensor print bug
* update quant_grouped_gemm example
* remove useless code
* cleanup code
* clean up code & format code
* fix compile & running bug in grouped_gemm example
---------
Co-authored-by: Sami Remes <samremes@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: liyingli <liyingli@amd.com>
Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* docs(README): update readme with new build instructions
* feat(grouped_gemm): add support back for non persistent kernel
* refactor(grouped_gemm): simplify tensor creation
* refactor(grouped_gemm): Persistance is now GemmConfig value for easier management
* chore(grouped_gemm): add print statements to ease debugging
* WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration
* fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity
* WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines
* chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle
* style: clang format work so far
* BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline
* fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors
* refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm
* chore(grouped_gemm): add/remove debug comments and debug print statements
* feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes
* chore(gemm_profile): add new argument combinations
* fix: branch cleanup, formatting, refactoring
* fix: branch cleanup, formatting, refactoring
* chore(changelog): update changelog to reflect new featuer
* address review comments & nit
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Updates to support mixed precision
(cherry picked from commit e65d71180393e7b66169c56565a6bac740427de6)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip
(cherry picked from commit f8c06322df0abcbd5945a56cdf5bffe56480f9f0)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Added support for F8xF16xF16 to gemm_wmma_universal
(cherry picked from commit 15c851de6daa513a12c2e3af299bab0176175fb5)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Added support for F16xF8xF16 to gemm_wmma_universal
* Added support for BF16xI4xBF16 to gemm_wmma_universal
(cherry picked from commit c6a4a69d2d43d59bae8bdabfae80d648646f217e)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Added support for F16xI4xF16 to gemm_wmma_universal
* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType
* Added missing test class for FP16_KM_NK
* Pre-commit hooks fixes
* Added padding instances for f16xf16xf16
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
(cherry picked from commit 5bdc993dbf)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Fixed cmake build errors related to test_fp8
(cherry picked from commit 12176616b6)
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* Ammending changes for adding support for padding instances for f16xf16xf16
* Fixes for padding instances for f16xf16xf16
* Added padding instances for bf16xbf16, f8xf8
* Added packed instances for bf16xi4xbf16
* Added padding instances for f8xf16xf16
* Added padding instances for f16xf8xf16, f16xi4xf16
* Fixed typos for bf16xbf16xbf16 padding instances
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* removed unnecessary ck parts from compilation
* initial gemm_add_multiply instance implementations
* fixed profiler help message for gemm_add_multiply
* improved multiply_add profiler layout help
* fixed template arguments for test instances
* added test for gemm_add_multiply
* Support multiple D in GridwiseGemm_wmma_cshuffle_v3
DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.
* Use ThreadGroupTensorSliceTransfer_v7r3
* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support
* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma
* Implement DeviceGemmMultipleD_Wmma_CShuffleV3
* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3
* Prepare gemma_add tests for adding wmma
* Add gemm_add_fastgelu instances and test
* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API
ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.
* switched to splitK interface
* log print added to splitk benchmarks
* revert main cmake comments
* newline change reverted
* added add_fastgelu instances
* revert unintended change in xdl add_fastgelu
* created gemm_add_add_fastgelu instances
* created fastegelu instances
* added tests for all splitk fastgelus
* Added tests.
* multiply_add instances created
* updates to add_multiply splitk instances
* splitk xdl test fixes
* added wmma multiply_multiply instances
* fixed ONLY_XDL_AND_WMMA_KERNELS tag
* Added gemm_add examples for wmma v1 and v3
* fixed / workarounded i8 instances
* Modified the v3 code to added one fp16 bxdl instance.
* added bf16 xdl instance.
* adding gemm_add wmma_cshuffle and other support
(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* add instances into camkelists
(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* This is work in progress, edited the template parameters in order to build
(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype
(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* added datatype and use clang-format-12
(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
* Fixing build errors
* Added instances for v3
* Adding instances and executables
* Code update of template parameters modified.
* Renamed file.
* Added tests.
* resolved error tests.
* Fixing build errors
* Updated comments
* removed the changes as per the MR review comment.
* Updated tests.
* fp8 instances - not tested
* Restored the Cmake file that was reverted by mistake during rebase.
* fixed wmma_op test
* Updated comments.
* Updated the template parameter description
* fixed rdna4 instances
* fixed back compatibility on gfx11
* cleanups
* fix ckProfiler
* one more cmake fix
* added fp8 instances
* Updated tests to ad BF16 instances as per review comment
* Added include file and cleaned up(as per review comment)
* Updated and optimized the example code for all types.
* Fixed clang format
* Resolve "Implement `device_gemm_bilinear` for RDNA4"
* test generalization to handle FP16 shuffle better
* added missing changes
* Added bf16 wmma instance for add_relu
* Added f16 wmma instance and corrected bf16 instance errors.
* Added instances to Cmake
* Modified the template parameters to make the instances work.
* Fixed typo in profiler
* Added v3 instances for gemm_add_relu
* addressed core review comments
* Added test for gemm_add_relu wmma instance
* Cleaned up the code.
* Added examples for gemm_add_relu
* Fixing typo to resolve build errors.
* Fixes applied to fix the precision loss.
* fix billinear test after merge
* Removed the old wmma instances.
* Added wrapper and renamed the wmma_v3 instances
* Updated copyrights and added wrappers.
* Fixes applied according to review comments
* Apply 1 suggestion(s) to 1 file(s)
Co-authored-by: Robin Voetter <robin@streamhpc.com>
* Removed the old wmma instances.
* Updated wrapper for the v3 instances
* removed the old wmma examples
* Renamed the v3 instances
* Deleted the gtest file added by mistake.
* Updated thge profiler with wrapper
* Fixed test errors.
* Fixed the review comments
* Fixed the if condition MACROS.
* REVERTED THE PROFILER CHANGES
* Revert "REVERTED THE PROFILER CHANGES"
This reverts commit 21cb98546c.
* Revert "Fixed test errors."
This reverts commit 13efcc6fe1.
* Revert "Updated thge profiler with wrapper"
This reverts commit 536f86661d.
* Added missing wrapper instances
* Updated copyrights.
* Fixed typo.
* Fixed copyrights.
* Updated copyrights.
* updated copyrights.
* comments on the atomics workaround
* fixed cmake comment
* Fix bug from merge
* clang-format-18
* Fix compilation error
* Fix linking error
* Fix bug in add and add_relu examples
* Fix error including file (typo)
* Quick fix to compile examples for different targets
* Fix for multi target
* implemented f16 and bf16 instances for gemm_silu
* addressed review comments
* addressed review comments
* Fix clang format
* Fix clang format
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Add cshuffle epilogue test
* add the poc implementation to the epilogue and tests
* refactor cshuffle epilogue
* WIP: adding tensor/tile usage to scale_tile
* fix usage of tile_elementwise_inout
* add gemm_quant_kernel for generalizing gemm quant kernel
* Add problem specific to different quants, add QuantType to Traits
* Add quant_type to quant_kernel template parameters
* Create aq/bq_block_windows and views depending on QuantType
* Use tile windows as inputs in cshuffle epilogue
* Fix some issues in epilogue
* initial new example code for new general gemm quant kernel test
* Fix issues in kernel
* Add verification check for rowcol Quantmode
* use AccDataType instead of AQ in pipeline
* fix aquant preshuffle
* fix formatting
* some cleanup
* remove gemm_aquant_basic.cpp
* remove gemm_aquant_kernel.hpp
* fix tests for the renamed quant kernel
* fix formatting
* clean example files
* fix some merge conflicts
* fix preshufflequant rename issue
* fix some templates after merging with develop
* fix test preshuffle parameter
* fix formatting
* Unify bquant kernel to the common quant kernel
* remove bquant kernel also from common header
* fix formatting
* clean up commented code
* fix formatting config hpp
* fix merge mistake
* Non-const for movable windows
* fix formatting
* Fix grammar in README
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Remove #include<bit> and clean up example
* fix strides
* Add some descriptions for move_windows
---------
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Fix a typo
* Use std::variant to call run_gemm_example_with_layouts with the available layout variant combinations
* Use a unified run_gemm_example_prec_type for basic gemm and universal gemm
* Factor out run_gemm_example_prec_type
* Refactor argument parsing in gemm_splitk_two_stage_reduce.cpp
* Parse arguments outside of create_args
* Move the gemm operators to separate structs to facilitate their reuse
* Move the invokers to separate files to facilitate their reuse
* Rename the invoker files for consistency with the examples that use them
* Add fp32 support to the elementwise examples, and produce an error message for unsupported types
* Get rid of four unused variables
* Make two variables const
* Add support for different input-output type combinations in elementwise examples
* Test support for different input and output types in elementwise examples
* Add support for different operations in the elementwise unary tests
* Add support for UnaryConvert in the elementwise unary tests
* Add support for bf16 in elementwise examples, excluding unsupported type combinations
* Make some operator parameters const in ElementWiseKernel
* Remove some unnecessary include statements
* Implement a two-stage GEMM that does a type conversion in the second stage using the elementwise kernel
* Clear workspace instead of output when flushing the cache in SplitKTwoStageInvoker::gemm
* Fix formatting issues reported by clang
* Add back CK_TILE_USE_WMMA related changes
* Use the right prec type for bf16 in the universal GEMM and two stage split K examples
* Add some brackets
* Add some brackets
* Separate the clearing of the GEMM output memory from the cache flushing in the universal GEMM example
* Separate the clearing of the GEMM output memory from the cache flushing in the split K two stage example
* Fix formatting
* No need to call SetZero on ws_m_n_dev_buf here, as clear_gemm_output now does this as part of the kernel preprocessing
* Add fp16 data type to splitk two stage example
* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the basic GEMM example
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>