* add one daily ci build on gfx908
* add redis invocation tag for gfx908
* make ci build for gfx908 conditional
* fix groovy logic
* add option to run perf tests for gfx908
* disable a few tests on mi100
* Added two kernel for M=32 problem
* Comment the first one
* Enable multiply_multiply for Scale_Block_M = 1 for deepseek
* Modify the a_thread offset since the A data load is different from B.
* edit fp8 ab scale for Scale_Block_M=1
* edit GemmSpec to MNKPadding
* enable blockwise pipelie v1 and v2. v1 is work for small K.
* add instance for gemm_ab_scale
* fix cmakelist of ckProfiler
* optimize blockscale gemm. todo: reduce vgpr usage
* fix a correctness bug
* sanity checked
* revert ckprofiler cmake changes
* clang format
* revert unnecessary changes.
* remove commented codes.
* split weight preshuffle library targets
* bring back enable-post-misched=0
* fix build issues for gemm_multiply_multiply_fp8 instances
* fix clang format
* add verbose build flag when building for all targets
* reduce path names for new instances
* fix paths in cmake
* refactor gemm_multiply_multiply library target
* fix a bug in example
* fix example 65 cmake
* reduce the number of threads when building libs for all targets to 50
* use ninja to build for all targets
* reduce teh number of threads when building for all targets
* reduce the number of threads to 32 when building libs for all targets to 50
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Add Gemm fp8xint4 example and kernel, function pass.
* Init Gemm_fp8xint4 Bpreshuffle
* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet
* General fix.
* fp8xint4 bpreshuffle function pass
* fix.
* init b preshuffle dequant in VGPR.
* fix bug, function pass.
* move b thread dequant copy to blockwise.
* fix bug, function now passes.
* modified the tile size to 256, 128x128x128.
* fixed a bug.
* Initial int4 moe, compile pass, function not check.
* fix bug in moe_gemm1.cpp, now function pass.
* test expert = 8 and function pass.
* Added moe_pk_i4_gemm2, function pass.
* Added b preshuffle pipeline v3 support.
* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.
* Split the blockwise pipeline for fp8xint4.
* commit missing files
* opt gemm2 to 2x2 wave
* fix swizzle = false
* update int4 moe with latest input changes.
* update tile size.
* enable pipeline v3.
* fix nswizzle = true
* commit a version for compiler debug.
* Updated transfer_v3r1_gather to support pk_i4_t type.
* for int4 moe2 for type_convert support.
* remove some values between mfma instructions.
* fix int4 moe
* Updated transfer_v3r1_gather to support pk_i4_t type.
* i4 support lds multiple shuffle
* fixed int4 moe tflops calculation.
* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle
* updated gemm2.
* change int4 moe example names
* fix and format code.
* format.
* format codes.
* update fp8xint4 example tile size.
* add <unordered_map> header
* fixed.
* format.
* Added conditional compilation for int4 -> fp8 conversion kernels
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce
* Remove last block_sync_lds() in the loop
* Tiny adjustment in qr_ks_vs_async pipeline for better performance
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline
* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline
* Use un-rolled gemm for Gemm-0
* Use k0_loops small tile load/store to replace the big tile load/store for K
* Remove the commented lines in qx_ks_vs_custom_policy.hpp
* Tune the prefetching of V in qr_ks_vs_async pipeline
* Move the codes for storing the first v_lds tile some later
* Let BlockDropout reuse LDS with V
* Switch to separate code blocks according to iteration index
* Interleave code blocks for better performance
* Move clear_tile(s_acc) for better interleaving
* Move code interleaving
* Use MakeQDramTileDistribution for q_dram_window
* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop
* Let V reuse the LDS of K
* Use array of tiles to represent Q in vgprs
* Use QLoadOnce == false for qr_ks_vs_async pipeline
* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline
* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()
* Move the definition of v_tiles out from the loop
* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()
* Remove using KLoadOnce in qx_ks_vs_custom_policy
* Remove un-used get_slice_tile() call
* Move the code line of clear_tile(s_acc)
* Tune the lines of codes to make them more tidy
* Re-arrange the codes before the main-loop
* Add comments
* Unify the alignment to be 8 for Q/K/V Lds decriptors
* Tuning to K pre-loading
* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false
* Adjust the pipeline codes
* Use NumPrefetchV to separate from NumVLdsBuffers
* Tune the location of a scheduler barrier code line
* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths
* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel
* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline
* Move the location for store_tile() of first v_tile
* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline
* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>
* Try to fix old bugs in qx_ks_vs_custom_policy
* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler
* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy
* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)
* Tiny Fix (#1888)
* Ck tile/paged attention workaround (#1894)
* Correction in GetRangeAlongX()
* Work-around to solve the failures in test_paged_attention_ck in xformers
* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline
* Remove one call of move_tile_window for q_dram_window
* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()
* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline
* Adjust the location of codes for storing the first V tile to LDS
* Tiny fix and add comments
* Change GetSmemKPackK size to improve performance
* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy
* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access
* Adjustment in intermediate iteration codes for tiny performance improvement
* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation
* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue
* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()
* Remove useless AsyncopyV
* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used
* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy
* Recover the qr_ks_vs_async pipeline
* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline
* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"
This reverts commit 39b82ca194.
* Tiny fix with regard to whole_k_prefetch pipeline compiling
* Update kPadSeqLenK setting in fmha_fwd_kernel
* Use q_element_func and k_element_func
* Use single q_tile rather than multiple sliced q_tiles
* Codes refine according to the comments
* Re-format one file
* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true
* port all moe changes from ck_moe_gemm branch
* refine codes in the pr
* fix tail odd
* fix clang format
* fix clang format2
* make hot loop scheduler compatible with 16x16 and 32x32
* clang format
* fix per token quant
* rename moe example
* clang format
* WA for address space by disable it completely
* hot fix moe gemm2
---------
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
* replace buffer load/store intrinsics with builtins
* fix clang format
* replace buffer load/store intrinsics with built-ins in ck_tile
* fix clang format
* add switch between buffer intrinsics and built-ins
* change the builtins threshold to clang20
* fix clang format
* fix some compilation errors
* revert changes in ck_tile
* revert changes in ck_tile
* delete all root files and folders when CI completes
* try changing the username in CI
* fix groovy syntax
* add user and group id info to ci dockers
* change ownership of all files in CI to jenkins at the end
* update changelog
* port all moe changes from ck_moe_gemm branch
* refine codes in the pr
* fix tail odd
* fix clang format
* fix clang format2
* make hot loop scheduler compatible with 16x16 and 32x32
* clang format
* fix per token quant
* rename moe example
* clang format
---------
Co-authored-by: coderfeli <coderfeli@163.com>
* Add runtime check in example_gemm_xdl_streamk for gfx950
* Add runtime check in grouped conv fwd examples for gfx950
* Disable CK_USE_AMD_MFMA_GFX950
* Add new instances for gfx950
* Fix test_gemm_universal on gfx950
* fixed hiprtc compilation issues from new additions, removed clashing mixed precision functionality from codegen(ignore the whole file)
* fixed device op error: misplaced header guard
* restrict virtual function use in device_gemm_multiple_d file for codegen hiprtc compilation
* add CK_CODE_GEN_RTC flag for compilation, since this flag has wider coverage for hiprtc compilation
* fixed conditional error in amd_ck_fp8.hpp
* Add MaskOutUpperTriangle as a problem parameter to
BatchedGemmSoftmaxGemm and disable tests with
MaskOutUpperTriangle==True.
Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
---------
Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
* moe sorting ex
* fix bug for race condition
* fix bug and optimze large expert
* fix
* optimize with sub_token_oneshot
* support skip empty tokens for expert sorting
* update moe_sorting
* tidy code
* support mp kernel
* hint mp
* remove use less code
* porting to example 15
---------
Co-authored-by: valarLip <340077269@qq.com>
* Added two kernel for M=32 problem
* Comment the first one
* Enable multiply_multiply for Scale_Block_M = 1 for deepseek
* Modify the a_thread offset since the A data load is different from B.
* edit fp8 ab scale for Scale_Block_M=1
* edit GemmSpec to MNKPadding
* enable blockwise pipelie v1 and v2. v1 is work for small K.
* add instance for gemm_ab_scale
* fix cmakelist of ckProfiler
* optimize blockscale gemm. todo: reduce vgpr usage
* fix a correctness bug
* sanity checked
* revert ckprofiler cmake changes
* clang format
* revert unnecessary changes.
* remove commented codes.
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
* device_prop.hpp - replace map with compile time hash and switch
Summary:
We replace a static const map with a compile time hash function
and a switch statement to achieve the same goal: translate names
to architectures. Most of these are very old, however the function
needs to continue to work.
Why? because the static map can cause issues when compiling into
libraries that get dynamically loaded/unloaded, leading to memory
corruption
Test Plan:
Running pytorch `torch.compile()` with CK enabled, and seeing it not
segfault on the 2nd kernel (1st reload of the library)
Reviewers:
Subscribers:
Tasks:
Tags:
* clang-format