ozturkosu
256aec203e
print hipOccupancyDefined Grid size for best in ckProfiler
2025-06-22 19:29:34 -04:00
root
b0e4ef4f96
compilation error is fixed
2025-06-22 05:05:35 +00:00
ozturkosu
9403f38a5c
update
2025-06-13 05:01:14 -04:00
ozturkosu
d7e80d2ae0
gridwise update
2025-06-13 04:54:51 -04:00
ozturkosu
2a88a04999
update
2025-06-13 04:45:18 -04:00
ozturkosu
287f8eb7e5
update
2025-06-13 04:31:28 -04:00
root
e296783bcc
print out grid size defined by hipOccupancyAPI
2025-06-13 02:15:05 -05:00
kylasa
5f1ad09b61
Code drop for 2 warp ping pong scheduler along K dimension. ( #2276 )
...
* Code drop for 2 warp ping pong scheduler along K dimension.
* Addressing code review comments.
* Addressing Clang formatting issues.
* Addressing build issues.
* Addressing build issues of other GEMM pipelines with ping pong scheduler code drop.
* Fix for LDS memory size for GEMM pipelines.
* Addressing code review feedback comments.
* Change log update.
* Addressing code review comments and build issues.
* Added new policy for pipeline specific logic about LDS needs.
* Clang Fix during build.
2025-06-12 18:24:02 -07:00
Thomas Ning
f59b8c7d3d
OCP FP8 Macro restructure ( #2331 )
...
* solved the problem
2025-06-12 09:46:33 -07:00
Bartłomiej Kocot
bb4f471b09
Grouped conv bwd weight with grouped gemm ( #2304 )
...
* Grouped conv bwd weight with grouped gemm
* fixes
* fix
* Fixes
* test comments
* restore atol
* fix
2025-06-12 10:15:07 +02:00
carlushuang
8aff45a8af
[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel ( #2327 )
...
* refactor subtoken logic to let more kernel pickup mp kernel
* typo
2025-06-12 11:44:22 +08:00
Yi DING
37554c31e8
Add MoE & FP8 Blockscale WP Kernels for GFX950 ( #2297 )
...
* [fix] align v3 gufusion pipeline
* fix device kernel selection.
* Add .co direct asm support by CK_USE_ASM_MOE_STAGE2_BLOCKSCALE
* experimental optimization for scale load in blkscale gemm
* Add asm for no-loop v3_128x128x128
* fix bugs
* tune fp8 example
* Update v1_128x128x128 to 2x2 instead of 4x1
* wip
* add warmup to asm launch
* wip2
* 16x16 function merged to moe
* temp save, a performant version.
* wip3
* Update .co binary to 16x16
* 16x16x128 correct; 64x64x128 failed
* update
* use mem_op::set when topk=1
* add mx fp8 b_preshuffle support, function not yet tested.
* Spilt the fp4 target. Fix the known bugs. 128x128x128 sanity checked; remove prints
* some fixes
* fix update
* remove some unnecessary hacky; enable 256x256x256 tilesize
* update for function debug
* Add pipeline v3. Have some runtime issue and register spill
* Fix pipe v3 correctness issue
* remove unnecessary hacky
* clang format
* fix a bug
* fix the bug, functional test passed
* tempsave; buggy at passed 4 e8m0 to scaled mfma
* added fp4_bpreshuffle example, build failures
* fixed some bugs
* implement shuffled scale mxfp4gemm, blocker: opsel not effect
* hotfix
* fix bugs, build passed
* (M, N, K)=(128, 128, 128) function failed.
* temp save for gemm1. Function not ready
* fix compile error. Gemm2 pass. Gemm1 WIP
* fix bug for a lds read
* update moe
* Compile pass. Gemm1 function WIP
* update moe
* fix fp8; fix even/odd
* tempsave
* update moe
* Revert "update"
This reverts commit 960b2bce1c .
* Revert "use mem_op::set when topk=1"
This reverts commit def952a178 .
* Add v3 128x128x128_4x4_16x16.co for gfx950
* temp cmake flag suppression for aiter test
* add code for mxfp4 gemm, blockscale not supported yet
* gemm1 up-only pass. GU WIP
* function pass with inline asm hacky
* revert unexpected file change
* updated and build passed
* update CE elementOP
* added code for debug
* Gemm1 GUFusion function pass. Perf WIP
* Fix fp8/bf8; remove duplicated code
* disable the scheduler in v3; bring it back when compiler feature ready.
* update moe v1 pipeline
* Add gemm1 v1 32x128x128
* remove schedule barrier
* updated
* Fix fp8/bf8 B-row
* mfma using asm, device result correct, host result need to check
* gemm1 v3 64x128x128 debug
* fix cpu ref
* a/b thread_desc stride fix
* Use random scale for init1
* 16x16x128 input size blockscale function passed
* fix blockscale gemm bug
* tempsave. Almost all instances passed.
* v1 fix for mi350.
* temp save
* debug save
* update debug
* fix the bug, 128x128x256 tile function passed
* v3
* rename moe block selector and pipeline
* Add gemm1 v1
* Add gemm1 v1 to selector
* added mx moe block v3 support, function passed
* compile error fix
* Improve the pipeline
* Pack e8m0 as int32_t
* v1 compile pass. Function not ready
* debug synchronize issue over different GPU/ROCm
* minor fix
* Add profiler filter
* Add f4 ckProfiler
* Fix example compile error
* Add f4 profiler examples
* tempsave
* v1 function pass.
* v3 function pass
* align file and function name
* mx_moe_fp4 ready for aiter with clang-format.
* modify the way we represent fp4
* generalize the pipeline scheduling.
* init moe mx f4 scale shuffle
* Cmakelist diable compiler-bound flags
* mx_fp4 default parameter change
* Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler.
* update code
* tempsave; modify the way we represent fp4
* generalize the pipeline scheduling.
* Add gemm1 gfx942 .co support
* updated code, build passed.
* Update gemm2 asm with latest compiler flag
* Fix mx f4 ckProfiler
* Fix blockwise gemm mx v1
* lds conflict free + buffer load lds
* Add gemm2 v3 64x128x128
* fix a, b scale loading bugs, a, b scale loading now correctly
* Add gemm2 v3 64x128x128
* commit with debug info
* fix fp4 profiler
* Add mx fp4 pileline v1 instances
* Fix v2 topk_weight cal. Add silu asm.
* v2 tok_weight WIP
* init mx fp4 B no preshuffle version
* tempsave. compile pass, function wrong
* enable fp4 moe no weigth preshuffle, function pass
* update the TFlops calculation in the example
* Add gemm2 64x128x128 asm. Fix BF16 ref.
* fix 2 typos in fp4_preshuffle
* Better kernel selection in device classes
* correct preShuffleBuffer
we should used packed k to do shuffle.
* lds conflict free + buffer load lds
* optimize offset math in dma
* Fix fp4 ckProfiler
* Fix MX MFMA tests
* fix f4 pipeline issues
* gemm1 func pass
* update mx moe gemm1_bns tile size to 64x128x256
* update mx moe gemm1 gemm2 TF and BW calculation
* fix typo
* temp save
* Fix example_gemm_mx build
* rename the block pipeline
* correct a typo in tail
* Add rotating to mx examples
* fix the correctness issue
* Fix v1; use M padding
* Add NT flag to B/BScale buffer
* Merge gemm_mx_common.hpp
* temp save, 4.4~4.5
* Fix 'Merge gemm_mx_common.hpp'
* refactor the pipeline
* Pad the M for scale buffer unconditionaly
* update MX moe GEMM1 hotloopscheduling
* change the gemm1 tile from 64x128x128 to 128x64x128
* Unconditional Ascale padding
* Pad shuffled a scale only
* pad ascale
* add vmcnt guard for async copy
* Profiler add f4 wp
* Merge preshuffle device
* Add more fp4 wp instances
* Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format
* Clang-format after 2 merges
* Remove rocm6.3 workaround flags and macro
* Fix fp8 config
* Fix bf8 config
* flag and barrier fix for copmiler branch MainOpSelV3
* Add fp8 profiler instances
* Remove debug infos; Enable flags for blockscale f8
* No asm ver. for merging moe blocksale fp8 into mainline
* update the flag name for f8blockscale
* recover example
* fix performance bug of bpreshuffle f8 gemm
* clang format, remove single rate mfma restriction for f8
* remove single rate mfma restriction for f8 blockscale gemm
* Fix moe blockscale gemm1 barrier 0x800 for new compiler
* add pipeline v1 for MOE Gemm2
* Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns
* Fix OOB; add MB96 instances
* remove unnecessary files
* fix the cmake issue
* Enable splitk for mxfp4; clang format;
* Generate random tensor values with multiple threads
* Use packed_size_v for A/BPackedSize
* Fix warning
* Fix target_compile_options for disabled target on gfx942
* fix moe pki4 on gfx950
* doc the kGroup definition
* Fix ThreadwiseTensorSliceTransfer_v4::Run (Fuse scale)
* Refactor thread_copy_lds_direct_load; fix gfx942 direct lds load example; fix f16_pki4 example
* Fix unknown compiler flag
* fix two failed examples.
* fix some failure tile size in gfx950 universal gemm. fix test_gemm_fp16
* workaround fix for test_gemm_f32; * We have very limited support for lds direct load if input matrix is not K major
* fix test_gemm_splitk;
* Fix compile for mx_mfma_op
* add mfma selection logic for multipled_v3
* Clean up
* Fix device gemm mx link error
* improve the global atomic pattern
* Revert unnecessary copyright updates
* restore minimum_occupancy logic
* Avoid data race in moe gemm2 ref
* Build fp8 gemm_multiply_multiply and moe only on gfx94/95
* update the instance in device_mx_gemm
* Resolve comments
* Copyright 2025
* Remove unused code
* fix library linking issue
---------
Co-authored-by: OscarXu <huaiguxu@amd.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: valarLip <340077269@qq.com >
Co-authored-by: feifei14119 <feiw@amd.com >
Co-authored-by: Lin, Qun <qlin@amd.com >
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-06-12 09:25:59 +08:00
Bartłomiej Kocot
8c1ed6f4c1
Move SetZero functions inside the kernels for Grouped Conv ( #2255 )
...
* Disable SetZero before launch kernel for grouped conv fwd
* Move set zero to kernel
* wmma fix
* fix
---------
Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com >
2025-06-11 23:41:03 +02:00
Muhammed Emin Ozturk
6fad1c4874
Stream-K Reduction option as Runtime parameter and Compilation Error Fix (SK- Reduction) ( #2145 )
...
* reduction is passed as runtime parameter
* clang
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_streamk_v3.hpp
Co-authored-by: John Afaganis <john.afaganis@amd.com >
* Update include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp
* remove comment
---------
2025-06-11 10:59:44 -07:00
Thomas Ning
06e0b8436c
Epilogue cshuffle Improvement ( #2312 )
...
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* update cshuffle logic
* update cshuffle_logics
* add some change within review
* update some codes following the code review
* update epilogue logic
* remove from problem
* update codes following review.
* fix some issues
* solve the previous PR error, refine the code
* Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Comment addressed
* handling tile_engine failing case
* handling tile_engine failing case
---------
Co-authored-by: joyeamd <John.Ye@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
Co-authored-by: khushbu agarwal <khuagarw@amd.com >
2025-06-10 22:44:50 -07:00
Thomas Ning
14d229d6c8
fix on the typo ( #2326 )
2025-06-10 16:34:33 -07:00
Khushbu Agarwal
bd270fe4bc
fix flatmm kernel for bigger size for fp16 datatype ( #2302 )
2025-06-10 11:13:40 -07:00
Eisuke Kawashima
4e586ca958
chore: unset executable permission ( #2303 )
...
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com >
2025-06-10 09:13:59 -07:00
John Afaganis
6635d1bb88
Remove usage of 'warpSize' variable as it has been deprecated ( #2295 )
...
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround
* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo
---------
Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com >
2025-06-10 07:34:54 -07:00
carlushuang
2e0536269e
hot fix ( #2315 )
2025-06-10 20:35:28 +08:00
Bartłomiej Kocot
7a83f1d510
Grouped conv bwd wei explicit GEMM for odd C/K ( #2306 )
2025-06-10 11:17:12 +02:00
MHYangAMD
9fcf21a4ec
Fix fmha fwd precision issue on MI3XX series ( #2285 )
...
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
2025-06-10 15:03:23 +08:00
carlushuang
65835c0bbb
MUST USE INLINE FOR ANY NON TEMPLATE FUNCTION IN HEADER!!! ( #2305 )
2025-06-10 10:40:54 +08:00
Aviral Goel
5a0bd157db
Code Refactor for check_err.hpp ( #2284 )
...
* refactor & add documentation
* removed return datatype from doxygen comments
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: John Afaganis <john.afaganis@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
---------
Co-authored-by: John Afaganis <john.afaganis@amd.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
2025-06-08 13:41:27 -07:00
Sami Remes
1c6f83df6c
[CK_TILE] Tileloop persistent gemm - resubmit ( #2299 )
...
* Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 )
This reverts commit 233e274077 .
* Add missing header for kentry
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-06-06 14:18:49 -07:00
valarLip
8482977a37
extend buffer load to support load 32 bf16/fp16 at same time ( #2291 )
2025-06-06 17:21:19 +08:00
Bartłomiej Kocot
050cad09b5
Grouped Convolution Backward Weight Explicit GEMM ( #2282 )
...
* Grouped conv bwd weight explicit gemm
* 3d
* cmake fixes
* fix test
* fix
2025-06-06 10:30:08 +02:00
Andriy Roshchenko
00247e3c29
Optimized GEMMs for MX FP4/8 ( #2294 )
...
Adds V3 GEMM pipeline for MX FP4 and MX FP8
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275 )
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: OscarXu <huaiguxu@amd.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
Co-authored-by: feifei14119 <feiw@amd.com >
Co-authored-by: Lin, Qun <qlin@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
2025-06-05 13:54:15 -06:00
Illia Silin
233e274077
Revert "[CK_TILE] Tile loop persistent gemm kernel ( #2191 )" ( #2293 )
...
This reverts commit ffb52783d0 .
2025-06-05 09:24:00 -07:00
Sami Remes
7ea1508b59
[CK_TILE] Move GEMM pipeline tail handling logic to pipelines ( #2222 )
...
* Add TailHandler for V3, V4 and Mem pipelines
* Adapt examples and tests to use TailHandler
* move tail-handling logic to pipeline in persistent grouped gemm
* Fix Mem pipeline dispatching, add CompV4 dispatching
* Use a macro for handling the many tails of Mem pipeline
* Fix formatting again
* Use const-ref RunFunction, remove unnecessary try_run
2025-06-04 11:50:21 +03:00
Sami Remes
ffb52783d0
[CK_TILE] Tile loop persistent gemm kernel ( #2191 )
...
* Implement tile loop persistent gemm kernel
* Enable timing
* Add tests for persistent gemm
* Fix formatting
* Fix gemm_basic
* Rename True/False to Persistent/NonPersistent
* Use only one set of layouts for persistent tests
* Fix gemm example persistent template parameter
* Fix formatting
2025-06-04 11:46:28 +03:00
Anton Gorenko
52b4860a30
WMMA GEMM universal pipeline v1, mixed precision and paddings, examples ( #2230 )
...
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Updates to support mixed precision
* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip
* Added support for F8xF16xF16 to gemm_wmma_universal
* Added support for F16xF8xF16 to gemm_wmma_universal
* Added support for BF16xI4xBF16 to gemm_wmma_universal
* Added support for F16xI4xF16 to gemm_wmma_universal
* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType
* Added missing test class for FP16_KM_NK
* Pre-commit hooks fixes
* Added padding instances for f16xf16xf16
* Fixed cmake errors related to gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"
* Fixed cmake build errors related to test_fp8
* Ammending changes for adding support for padding instances for f16xf16xf16
* Fixes for padding instances for f16xf16xf16
* Added padding instances for bf16xbf16, f8xf8
* Added packed instances for bf16xi4xbf16
* Added padding instances for f8xf16xf16
* Added padding instances for f16xf8xf16, f16xi4xf16
* Fixed typos for bf16xbf16xbf16 padding instances
* Fixed typos for padded instances
* Added tests for fp16, KM_KN and KM_NK
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.
* Fixed typos
* Updated the set of tests for FP16
* Updated the set of tests for FP16
* Fix typo
* Moved f16xi4 test under the correct data layout group
* example for gemm_universal_bf16
* Adding examples for gemm_wmma instances
* Added the missing parameters
* Fixed review comments and added executable to cmakeLists
* Fixing clang format
* Fixing build erros
* Fixed compilation failure.
* Modified some code as per gemm_universal_examples
* Fixed the gemm specialization error
* Fixed the build errors.
* Fix strides of a/b_thread_desc
The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).
* Load in M/NRepeat dims with thread copy's slice instead of a loop
* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation
* Implement Intrawave and Interwave variants of pipeline v1
* Add instances for Interwave and Intrawave v1
* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0
* Remove instances that are too slow (mostly because of register spilling)
* Add a workaround for fp8/bf8->f32 packed conversion issue
* Add instances for Interwave and Intrawave v1
* Enable profiling of mixed precision with f8 and int4 on WMMA
* Fix segfault in profiler when B is pk_i4_t
b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.
* Remove instances that are too slow (mostly because of register spilling)
* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations
* Add test case for bf16_i4
* Add missing Regular tests
* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS
They take more than 30 seconds
* Fix a bug that fp16_i4 validation passes only with PermuteB
A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.
* Use PermuteB with f16_i4 in most instances (as xdl)
Some instances use PermuteB = false for checking correctness.
See also the previous commit.
* Fix cache flushing for pk_i4
* Add mixed precision examples
* Disable all tests and instances with f8 on gfx11
Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.
* Add FP16 KM_NK and KM_KN test suites for XDL
These tests were added to common .inc for better testing of WMMA instances
* Fix int8 DTYPES check for gemm_bilinear
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com >
Co-authored-by: Apoorva Kalyani <apoorva@streamhpc.com >
2025-06-04 12:22:33 +06:00
Khushbu Agarwal
59a85cb4bc
[CK_Tile] Fix gemm kernel for 4,64,16 and 64,4,16 warp tile sizes ( #2262 )
...
* debugging issue
* debugging issue
* debugging
* debugging
* reverting debugging code
* clang formatted
* updating default_config.json
* fix ci failure
* clang formatted
2025-06-03 20:16:10 -07:00
Bartłomiej Kocot
6e5acee0f9
Add Clamp/Relu bf16/fp16 cast fixes ( #2279 )
...
* Add Clamp/Relu bf16/fp16 fixes
* fix
2025-06-03 18:31:46 +02:00
Xiaodong Wang
7f9eef40b0
Move pragma ahead ( #2231 )
2025-06-03 07:27:51 -07:00
Khushbu Agarwal
2e38eb4f1c
Rotating buffer PR CI fix ( #2257 )
...
* Revert "Revert "[CK_tile] Add rotating buffer feature for universal gemm (#2200 )" (#2256 )"
This reverts commit bbdaf79a52 .
* fix regression
2025-06-02 10:25:01 -07:00
valarLip
0fdbf6bcd1
extend buffer load for fp16/bf16x16 ( #2270 )
...
* extend buffer load for fp16/bf16x16
* format
2025-06-02 10:29:54 +08:00
Mirza Halilčević
fbce6c7bb6
Define CHAR_BIT during hipRTC ( #2264 )
...
* Fix failing codegen tests.
* fix clang format
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
2025-05-30 08:23:44 -07:00
Illia Silin
4e561af18c
Revert "add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue ( #2185 )" ( #2260 )
...
This reverts commit fd6a859b44 .
2025-05-29 16:22:16 -07:00
joyeamd
fd6a859b44
add CShuffleM/NXdlPerWavePerShuffle in cshuffle_epilogue ( #2185 )
...
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* update cshuffle logic
* update cshuffle_logics
* add some change within review
* update some codes following the code review
* update epilogue logic
* remove from problem
* update codes following review.
* fix some issues
2025-05-29 14:31:14 +02:00
Bartłomiej Kocot
e7906dd644
Change relu to clamp for grouped conv fwd instances ( #2249 )
2025-05-29 00:51:25 +02:00
Illia Silin
bbdaf79a52
Revert "[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )" ( #2256 )
...
This reverts commit 99857e10e6 .
2025-05-28 09:46:52 -06:00
Khushbu Agarwal
99857e10e6
[CK_tile] Add rotating buffer feature for universal gemm ( #2200 )
...
* Add rotating buffer feature for universal gemm
* adding changes in tile_engine
* Updated code to merge kernel_launch
* removing comments
* Enable rotating buffer changes to flatmm
* Created diff launch_kernel function for rotating buffer
* Simplfied calculation using macros
* merge code with new changes in tile_engine
* clang formatted
* Redefine macros
2025-05-27 23:00:58 -07:00
Casey-Shi
128f5a1eab
[Tile Engine] Add benchmark for tile engine gemm. ( #2193 )
...
* initial commit -m benchmark
* only support profile
* fix
* fix doc
* add default config
* add ci
* fix cmake
* tmp save for gen blobs
* fix bug
* merge
* range config
* test success
* fix
* fix
* move struct
* remove config property
* fix config
* remove comment
* add cmake option & modify
* add changelog
* fix
* format
* add pydantic module to the docker image
* fix
* add benchmark for cold and warmp up
* python format
* add asm cache control
* fix README
* remove pydantic module
* modify changelog
* fix config
* recover benchmark_gemm and fix
* format python
* refactor profiler
* fix csv bug
* fix codegen bug
* add kernel instance object
* add benchmark gemm executable
* fix jenkins & delete extra header
* disable warning output & enable default config
* Disable sparsity for invalid warp tile combinations
* fix gemm host template func
* refactor gemm profiler
* filter out some inmstances
* default config test & fix codegen bug
* add sparse flag to gen more instances
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: khuagarw <khuagarw@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
2025-05-26 22:32:36 -07:00
Po Yen Chen
c42b957d65
[CK_TILE] For FMHA forward kernels, assign block indices reversely if using mask ( #2209 )
...
* Assign block indices reversely if kHasMask=true
* Assign block indices reversely for splitkv kernel
2025-05-27 10:58:58 +08:00
Bartłomiej Kocot
037764bbc6
Fix grid size calc for bwd wei ( #2226 )
2025-05-26 16:51:09 +02:00
Zzz9990
ece38b9d7a
[VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q ( #2221 )
...
* fix splitkv compiler issue since lse is used to select kernel instances
* bypass seqlen == 1
* add chunked prefill into mha varlen
This reverts commit aa9847e42d .
* skip compile when receipt 2-4 and add comments
* fix
---------
Co-authored-by: fsx950223 <fsx950223@outlook.com >
2025-05-26 19:17:18 +08:00
Illia Silin
8146e471f1
fix the buffer intrinsic names for clang >=20 ( #2228 )
2025-05-23 14:58:25 -07:00
Illia Silin
1b846143c6
Revert "Update the buffer load/store intrinsic names for clang>=20. ( #2192 )" ( #2227 )
...
This reverts commit 58f9e9ffbc .
2025-05-22 15:41:17 -07:00
Aviral Goel
534d4594d0
Refactor tile_window.hpp, tile_window_linear.hpp into a CK Tile Hierarchy ( #2214 )
...
* window_origin variable now in base class
* abstracted more functions
* consolidated tile_window_static_distribution and tile_window_static_lengths
* clang format
* skeleton code for tile_window and tile_window_linear consolidation
* more abstraction
* moved variables from child to parent
* clang format
* removed comments
* removed debug code
* removed debug code
* abstracting traits WIP
* consolidated traits
* removed comments and clang formatted
2025-05-21 23:28:00 -07:00