Thomas Ning
f85c70b31e
Revert "Fix default epilogue ( #2358 )" ( #2364 )
...
This reverts commit b29e3830a6 .
[ROCm/composable_kernel commit: 64a2fda713 ]
2025-06-17 22:43:05 -07:00
linqunAMD
cd0bf60645
[CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd ( #2243 )
...
* [CK_TILE] fix build error in tile_add_rmsnorm2d_rdquant_fwd
* fix error with the latest develop code.
[ROCm/composable_kernel commit: 7aeec9a901 ]
2025-06-17 21:37:59 -07:00
carlushuang
f540c6ccb4
[CK_TILE] moe_sorting support "local_tokens" feature for EP case ( #2335 )
...
* support local_token for hipgraph
* update README
* fix comment
* fix fmoe example
[ROCm/composable_kernel commit: a4e1248dba ]
2025-06-18 10:49:43 +08:00
Kiefer van Teutem
609cb2c3ad
Fix argument order for calls to profile_batched_gemm_impl() ( #2277 )
...
* Fix argument order for calls to profile_batched_gemm_impl()
* Revert previous and swap the order of the profile_batched_gemm_impl() function arguments instead.
* Revert copyright years for unchanged files.
* Remove test_batched_gemm from REGRESSION_TESTS since it no longer takes more than 30 seconds to run.
---------
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com >
[ROCm/composable_kernel commit: c7c6a0ccb3 ]
2025-06-17 19:29:09 -07:00
Max Podkorytov
b29e3830a6
Fix default epilogue ( #2358 )
...
* [ck-tile] fix default epilogue in gemm universal
* argument validation needs vector size D
* operator() needs to specify dram windows
* copy/paste from cshuffle epilogue
* clang-format
* mark unused argument
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: cd606f72c1 ]
2025-06-17 17:30:21 -07:00
linqunAMD
af00674037
[CK_TILE] Support multi-config in tile_example_gemm_universal ( #2240 )
...
* [CK_TILE] Support multi-config in tile_example_gemm_universal
Add GemmConfig in run_gemm_example to support multiple tile config.
- It is useful when use you need compare gemm perf with different tile/pipeline config
- we also can use it simplify the code for wmma support in the furture.
* [CK_TILE] Support multi-config in tile_example_gemm_universal
Address review comments
* rebase code and fix clang format.
* fix clang format
* support pipeline v5.
* fix merge conflict
* address review comment
* add missing file
* address review comment v2
* fix build error
[ROCm/composable_kernel commit: 0eb8974502 ]
2025-06-17 17:27:46 -07:00
John Afaganis
3ef7712ee3
Add missing copyright headers ( #2359 )
...
* Add missing copyright headers
* empty commit
[ROCm/composable_kernel commit: df54667102 ]
2025-06-17 14:29:45 -07:00
Illia Silin
073bb8d588
Revert "Shard several of the most costly targets. ( #2266 )" ( #2361 )
...
This reverts commit c1285aaada .
[ROCm/composable_kernel commit: cdfd7722bf ]
2025-06-17 13:56:30 -07:00
Bartłomiej Kocot
d9316dfbeb
Fix Add in dynamic buffer for fp32/i8 ( #2351 )
...
* Fix Add in dynamic buffer for fp32/i8
* fixes
* Fix
[ROCm/composable_kernel commit: cc98a41f46 ]
2025-06-17 22:25:56 +02:00
Satyanvesh Dittakavi
bde406245a
Do not use warpSize as compile time constant as it is removed ( #2320 )
...
* Do not use warpSize as compile time constant as it is removed
* Update tile_image_to_column_shape.hpp
update warpSize usage.
* clean-up all use of warpSize, make sure code builds
* fix
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com >
[ROCm/composable_kernel commit: 4c57157d50 ]
2025-06-17 11:54:30 -07:00
Aviral Goel
66afddf431
add script to pre commit hooks for checking file permissions ( #2322 )
...
[ROCm/composable_kernel commit: 3af66e99ab ]
2025-06-17 07:07:08 -07:00
Thomas Ning
bc6af0fa49
Fix the CK Tile related operators ( #2356 )
...
* fix the flatmm
* Fix the pipeline
* address the comment
[ROCm/composable_kernel commit: 3c4cdfac4f ]
2025-06-16 17:38:52 -07:00
rahjain-amd
456719c9bc
Add cmake flag to enable Assembly dump ( #2347 )
...
This flag makes it easy to dump assembly for the example kernels.
[ROCm/composable_kernel commit: 6589f50bc9 ]
2025-06-16 09:29:35 -07:00
Illia Silin
0f4d68633b
Revert "fix the flatmm ( #2349 )" ( #2352 )
...
This reverts commit fc65195605 .
[ROCm/composable_kernel commit: 5523df4b2d ]
2025-06-16 07:54:55 -07:00
Bartłomiej Kocot
4ae33b454f
Grouped convolution forward with clamp ( #2334 )
...
* Grouped convolution forward with clamp
* Optimize clamp
* unary fixes
* test gk bias
* Revert "test gk bias"
This reverts commit 8e42e29d7b .
* Revert "Revert "test gk bias""
This reverts commit e73c0550ce .
* workaround comment
[ROCm/composable_kernel commit: f6c2ff9dce ]
2025-06-16 15:36:53 +02:00
Thomas Ning
fc65195605
fix the flatmm ( #2349 )
...
[ROCm/composable_kernel commit: d996bc78be ]
2025-06-16 02:17:53 -07:00
ruanjm
1fdac8b8fe
Add support for specifying valid flag when fetching elements for tile_scatter_gather ( #2332 )
...
* Add support for specifying valid flag when fetching elements for tile_scatter_gather
Add constexpr for operator[] of TrueGenerator
* Use different path when valid is enabled
[ROCm/composable_kernel commit: b34c234f51 ]
2025-06-16 17:17:03 +08:00
carlushuang
370dd01230
hot fix block_gemm fail with pipeline_problem by adding NumWaveGroups inside block gemm problem ( #2348 )
...
[ROCm/composable_kernel commit: fb97f75099 ]
2025-06-15 22:49:04 -07:00
Illia Silin
7eaa398458
Fix direct lds load for gfx950 and clang20 ( #2346 )
...
* fix direct lds load for gfx950 and clang20
* Update include/ck/utility/amd_buffer_addressing_builtins.hpp
* Fix format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com >
[ROCm/composable_kernel commit: 2d8a804152 ]
2025-06-15 15:22:34 -07:00
Illia Silin
d15d80a68b
Limit the threads to builf ck_tile engine, use ninja. ( #2342 )
...
* limit the threads to builf ck_tile engine, use ninja
* disable ck_tile engine until it can be built safely
[ROCm/composable_kernel commit: 56f654a826 ]
2025-06-13 14:13:07 -07:00
Illia Silin
6c4973f734
check for if misched-bottomup flag is valid ( #2341 )
...
[ROCm/composable_kernel commit: a0f4db8d9c ]
2025-06-13 13:34:22 -07:00
Mateusz Ozga
6b3ddd0e23
[CK_TILE] Multiple-D GEMM example ( #2219 )
...
* Multiple d, initial commit
* Check Ds Layout
* Readme and clang format
* Update branch & conflicts
* Multiple D - fix clang-formatter
* Rename elemetwise_op
* Fix CI
* Code review part1
* Remove printf
* Remove unnecessary comment
* Add new tests with Col layout
* Review part 2
* Added support for Multiple D GEMM
* Update comment
* Remove maybe_unused
* Clang-format
* Review part 3
* Add comment to function
* Add comment to function: another
* Take number of params for a refrence function
* Remove additional d param for 0 tensor
* Change name of function
* Fix CI fails
[ROCm/composable_kernel commit: bd96ac9742 ]
2025-06-13 19:39:11 +02:00
John Shumway
c1285aaada
Shard several of the most costly targets. ( #2266 )
...
* Shard several of the most costly targets.
Introduces a filter_tuple_by_modulo to break up tuples.
Drops build time of target from 21 minutes to under 14 minutes with 64
build processes, or 11 minutes with 128 build processes.
time ninja -j 64 device_grouped_conv3d_fwd_instance
* fix clang format
* Fix build errors in instantiation code.
I wasn't sure how to test the header-only instantiation code on my
initial commit. From Jenkins CI test results, I see that there is a
test target that depends on these headers:
ninja -j 128 test_grouped_convnd_fwd
This allowed me to test the build locally. I found three mistakes I
made, mostly related to early experiments on I tried on the code.
This was hard to find earlier because this PR is really too large.
I also discovered that there are five 2D convolution targets that now
dominate the compilation time. I will likely address those in a later
PR, rather than adding even more changes to this PR.
* Fix link errors from mismatched declarations.
Our pattern for instantiating MIOpen templates uses duplicate
declarations (instead of headers). This is fragile, and I didn't
notice that my last commit had a bunch of link errors. I fixed these
mistakes, and the bin/test_grouped_conv_fwd test target binary now links
correctly.
* Migrate the design to a code-generation approach.
Use a CMake function with template files to generate the source files for the
intantiating the kerenels and to generate the calling function.
* Shard the longest 2D convolution builds
Now that we have automated the shard instantiation, we can shard the 2D
convolution targets that take the longest to build. The target
test_grouped_conv2d_fwd now compiles in 15 minutes.
* Use PROJECT_SOURCE_DIR for submodule compatibility
I used CMAKE_SOURCE_DIR to refer to the top-level source directory in
the ShardInstantiation.cmake file, but this can cause issues with
git submodules. Instead, we should use PROJECT_SOURCE_DIR to ensure
compatibility when this project is used as a submodule in another
project.
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 3a0cb27966 ]
2025-06-13 03:58:50 -07:00
kylasa
afbc0625f4
Code drop for 2 warp ping pong scheduler along K dimension. ( #2276 )
...
* Code drop for 2 warp ping pong scheduler along K dimension.
* Addressing code review comments.
* Addressing Clang formatting issues.
* Addressing build issues.
* Addressing build issues of other GEMM pipelines with ping pong scheduler code drop.
* Fix for LDS memory size for GEMM pipelines.
* Addressing code review feedback comments.
* Change log update.
* Addressing code review comments and build issues.
* Added new policy for pipeline specific logic about LDS needs.
* Clang Fix during build.
[ROCm/composable_kernel commit: 5f1ad09b61 ]
2025-06-12 18:24:02 -07:00
Aviral Goel
c6996c8b7f
fix(gemm_universal): Update gemm_utils.hpp so it builds successfully for memory pipeline ( #2336 )
...
[ROCm/composable_kernel commit: e5ece14467 ]
2025-06-12 15:27:14 -07:00
Thomas Ning
8b534e4037
OCP FP8 Macro restructure ( #2331 )
...
* solved the problem
[ROCm/composable_kernel commit: f59b8c7d3d ]
2025-06-12 09:46:33 -07:00
Bartłomiej Kocot
adfcb7ea2b
Grouped conv bwd weight with grouped gemm ( #2304 )
...
* Grouped conv bwd weight with grouped gemm
* fixes
* fix
* Fixes
* test comments
* restore atol
* fix
[ROCm/composable_kernel commit: bb4f471b09 ]
2025-06-12 10:15:07 +02:00
carlushuang
a7eb83a51b
[CK_TILE] moe sorting optimization : refactor subtoken logic to let more kernel pickup mp kernel ( #2327 )
...
* refactor subtoken logic to let more kernel pickup mp kernel
* typo
[ROCm/composable_kernel commit: 8aff45a8af ]
2025-06-12 11:44:22 +08:00
Yi DING
b40267f7a6
Add MoE & FP8 Blockscale WP Kernels for GFX950 ( #2297 )
...
* [fix] align v3 gufusion pipeline
* fix device kernel selection.
* Add .co direct asm support by CK_USE_ASM_MOE_STAGE2_BLOCKSCALE
* experimental optimization for scale load in blkscale gemm
* Add asm for no-loop v3_128x128x128
* fix bugs
* tune fp8 example
* Update v1_128x128x128 to 2x2 instead of 4x1
* wip
* add warmup to asm launch
* wip2
* 16x16 function merged to moe
* temp save, a performant version.
* wip3
* Update .co binary to 16x16
* 16x16x128 correct; 64x64x128 failed
* update
* use mem_op::set when topk=1
* add mx fp8 b_preshuffle support, function not yet tested.
* Spilt the fp4 target. Fix the known bugs. 128x128x128 sanity checked; remove prints
* some fixes
* fix update
* remove some unnecessary hacky; enable 256x256x256 tilesize
* update for function debug
* Add pipeline v3. Have some runtime issue and register spill
* Fix pipe v3 correctness issue
* remove unnecessary hacky
* clang format
* fix a bug
* fix the bug, functional test passed
* tempsave; buggy at passed 4 e8m0 to scaled mfma
* added fp4_bpreshuffle example, build failures
* fixed some bugs
* implement shuffled scale mxfp4gemm, blocker: opsel not effect
* hotfix
* fix bugs, build passed
* (M, N, K)=(128, 128, 128) function failed.
* temp save for gemm1. Function not ready
* fix compile error. Gemm2 pass. Gemm1 WIP
* fix bug for a lds read
* update moe
* Compile pass. Gemm1 function WIP
* update moe
* fix fp8; fix even/odd
* tempsave
* update moe
* Revert "update"
This reverts commit c7d79dcb672616d9bc0fd9958f714fc80e7c84fd.
* Revert "use mem_op::set when topk=1"
This reverts commit 8c7772860735001a51421e7b6d0a28f6676d6c40.
* Add v3 128x128x128_4x4_16x16.co for gfx950
* temp cmake flag suppression for aiter test
* add code for mxfp4 gemm, blockscale not supported yet
* gemm1 up-only pass. GU WIP
* function pass with inline asm hacky
* revert unexpected file change
* updated and build passed
* update CE elementOP
* added code for debug
* Gemm1 GUFusion function pass. Perf WIP
* Fix fp8/bf8; remove duplicated code
* disable the scheduler in v3; bring it back when compiler feature ready.
* update moe v1 pipeline
* Add gemm1 v1 32x128x128
* remove schedule barrier
* updated
* Fix fp8/bf8 B-row
* mfma using asm, device result correct, host result need to check
* gemm1 v3 64x128x128 debug
* fix cpu ref
* a/b thread_desc stride fix
* Use random scale for init1
* 16x16x128 input size blockscale function passed
* fix blockscale gemm bug
* tempsave. Almost all instances passed.
* v1 fix for mi350.
* temp save
* debug save
* update debug
* fix the bug, 128x128x256 tile function passed
* v3
* rename moe block selector and pipeline
* Add gemm1 v1
* Add gemm1 v1 to selector
* added mx moe block v3 support, function passed
* compile error fix
* Improve the pipeline
* Pack e8m0 as int32_t
* v1 compile pass. Function not ready
* debug synchronize issue over different GPU/ROCm
* minor fix
* Add profiler filter
* Add f4 ckProfiler
* Fix example compile error
* Add f4 profiler examples
* tempsave
* v1 function pass.
* v3 function pass
* align file and function name
* mx_moe_fp4 ready for aiter with clang-format.
* modify the way we represent fp4
* generalize the pipeline scheduling.
* init moe mx f4 scale shuffle
* Cmakelist diable compiler-bound flags
* mx_fp4 default parameter change
* Moe blockscale gemm1&gemm2 asm support for aiter. Suppression cmkae flag til new compler.
* update code
* tempsave; modify the way we represent fp4
* generalize the pipeline scheduling.
* Add gemm1 gfx942 .co support
* updated code, build passed.
* Update gemm2 asm with latest compiler flag
* Fix mx f4 ckProfiler
* Fix blockwise gemm mx v1
* lds conflict free + buffer load lds
* Add gemm2 v3 64x128x128
* fix a, b scale loading bugs, a, b scale loading now correctly
* Add gemm2 v3 64x128x128
* commit with debug info
* fix fp4 profiler
* Add mx fp4 pileline v1 instances
* Fix v2 topk_weight cal. Add silu asm.
* v2 tok_weight WIP
* init mx fp4 B no preshuffle version
* tempsave. compile pass, function wrong
* enable fp4 moe no weigth preshuffle, function pass
* update the TFlops calculation in the example
* Add gemm2 64x128x128 asm. Fix BF16 ref.
* fix 2 typos in fp4_preshuffle
* Better kernel selection in device classes
* correct preShuffleBuffer
we should used packed k to do shuffle.
* lds conflict free + buffer load lds
* optimize offset math in dma
* Fix fp4 ckProfiler
* Fix MX MFMA tests
* fix f4 pipeline issues
* gemm1 func pass
* update mx moe gemm1_bns tile size to 64x128x256
* update mx moe gemm1 gemm2 TF and BW calculation
* fix typo
* temp save
* Fix example_gemm_mx build
* rename the block pipeline
* correct a typo in tail
* Add rotating to mx examples
* fix the correctness issue
* Fix v1; use M padding
* Add NT flag to B/BScale buffer
* Merge gemm_mx_common.hpp
* temp save, 4.4~4.5
* Fix 'Merge gemm_mx_common.hpp'
* refactor the pipeline
* Pad the M for scale buffer unconditionaly
* update MX moe GEMM1 hotloopscheduling
* change the gemm1 tile from 64x128x128 to 128x64x128
* Unconditional Ascale padding
* Pad shuffled a scale only
* pad ascale
* add vmcnt guard for async copy
* Profiler add f4 wp
* Merge preshuffle device
* Add more fp4 wp instances
* Fix do_weight in gemm1. Fix cshuffle_datatype. Clang-format
* Clang-format after 2 merges
* Remove rocm6.3 workaround flags and macro
* Fix fp8 config
* Fix bf8 config
* flag and barrier fix for copmiler branch MainOpSelV3
* Add fp8 profiler instances
* Remove debug infos; Enable flags for blockscale f8
* No asm ver. for merging moe blocksale fp8 into mainline
* update the flag name for f8blockscale
* recover example
* fix performance bug of bpreshuffle f8 gemm
* clang format, remove single rate mfma restriction for f8
* remove single rate mfma restriction for f8 blockscale gemm
* Fix moe blockscale gemm1 barrier 0x800 for new compiler
* add pipeline v1 for MOE Gemm2
* Use v1 pipeline for example_moe_gemm2_xdl_mx_fp4_bns
* Fix OOB; add MB96 instances
* remove unnecessary files
* fix the cmake issue
* Enable splitk for mxfp4; clang format;
* Generate random tensor values with multiple threads
* Use packed_size_v for A/BPackedSize
* Fix warning
* Fix target_compile_options for disabled target on gfx942
* fix moe pki4 on gfx950
* doc the kGroup definition
* Fix ThreadwiseTensorSliceTransfer_v4::Run (Fuse scale)
* Refactor thread_copy_lds_direct_load; fix gfx942 direct lds load example; fix f16_pki4 example
* Fix unknown compiler flag
* fix two failed examples.
* fix some failure tile size in gfx950 universal gemm. fix test_gemm_fp16
* workaround fix for test_gemm_f32; * We have very limited support for lds direct load if input matrix is not K major
* fix test_gemm_splitk;
* Fix compile for mx_mfma_op
* add mfma selection logic for multipled_v3
* Clean up
* Fix device gemm mx link error
* improve the global atomic pattern
* Revert unnecessary copyright updates
* restore minimum_occupancy logic
* Avoid data race in moe gemm2 ref
* Build fp8 gemm_multiply_multiply and moe only on gfx94/95
* update the instance in device_mx_gemm
* Resolve comments
* Copyright 2025
* Remove unused code
* fix library linking issue
---------
Co-authored-by: OscarXu <huaiguxu@amd.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: valarLip <340077269@qq.com >
Co-authored-by: feifei14119 <feiw@amd.com >
Co-authored-by: Lin, Qun <qlin@amd.com >
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 37554c31e8 ]
2025-06-12 09:25:59 +08:00
Bartłomiej Kocot
e98451e42f
Move SetZero functions inside the kernels for Grouped Conv ( #2255 )
...
* Disable SetZero before launch kernel for grouped conv fwd
* Move set zero to kernel
* wmma fix
* fix
---------
Co-authored-by: BrianHarrisonAMD <169072757+BrianHarrisonAMD@users.noreply.github.com >
[ROCm/composable_kernel commit: 8c1ed6f4c1 ]
2025-06-11 23:41:03 +02:00
Muhammed Emin Ozturk
6111449cd6
Stream-K Reduction option as Runtime parameter and Compilation Error Fix (SK- Reduction) ( #2145 )
...
* reduction is passed as runtime parameter
* clang
* Update include/ck/tensor_operation/gpu/device/impl/device_gemm_xdl_cshuffle_streamk_v3.hpp
Co-authored-by: John Afaganis <john.afaganis@amd.com >
* Update include/ck/tensor_operation/gpu/grid/block_to_ctile_map.hpp
* remove comment
---------
[ROCm/composable_kernel commit: 6fad1c4874 ]
2025-06-11 10:59:44 -07:00
Thomas Ning
46624a1abd
Epilogue cshuffle Improvement ( #2312 )
...
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* add cshuffle's mxdlperwavepershuffle support, not finished
* add epilogue functions
* update cshuffle logic
* update cshuffle_logics
* add some change within review
* update some codes following the code review
* update epilogue logic
* remove from problem
* update codes following review.
* fix some issues
* solve the previous PR error, refine the code
* Update include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
* Comment addressed
* handling tile_engine failing case
* handling tile_engine failing case
---------
Co-authored-by: joyeamd <John.Ye@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
Co-authored-by: khushbu agarwal <khuagarw@amd.com >
[ROCm/composable_kernel commit: 06e0b8436c ]
2025-06-10 22:44:50 -07:00
Thomas Ning
a052bfa81a
fix on the typo ( #2326 )
...
[ROCm/composable_kernel commit: 14d229d6c8 ]
2025-06-10 16:34:33 -07:00
Khushbu Agarwal
7afee6c536
fix flatmm kernel for bigger size for fp16 datatype ( #2302 )
...
[ROCm/composable_kernel commit: bd270fe4bc ]
2025-06-10 11:13:40 -07:00
Aviral Goel
9727cf5f62
Label CMakeLists message() as DEBUG or STATUS for clean build output ( #2301 )
...
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
[ROCm/composable_kernel commit: aed0f5880c ]
2025-06-10 10:46:47 -07:00
Max Podkorytov
9ff611028f
Convert CK (GeMM MulMul Weight Preshuffle) instances to use 16x16 xdl tile ( #2229 )
...
* compile profiler only for gemm-mulmul-weight-preshuffle
* m/n xdl; m/n xdl per wave; cshuffle block transfer cluster length m per block
* process all p1 instances
* process all p2 instances
* process all p3 instances
* convert p4 instance
* modify compute p1 instances
* modify compute p2 instances
* relax p4 instance c block transfer cluster len
* fix c block transfer cluster lengths comment
* add mfma (without 16x16) instances to the profiler
* roll back profiling cmakelists change
* clang-format
* re-add (now unused) 32x32 xdl-tile instances
* clang-format
* add more instances
* fit c block transfer lengths into block
* copy and write over the instance definitions from bf16 to fp16
* add instances to profiler
* unify instance tuple alias
[ROCm/composable_kernel commit: e6b5e31c20 ]
2025-06-10 09:37:14 -07:00
Eisuke Kawashima
87ad7ebf67
chore: unset executable permission ( #2303 )
...
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com >
[ROCm/composable_kernel commit: 4e586ca958 ]
2025-06-10 09:13:59 -07:00
John Afaganis
2fdbade459
Remove usage of 'warpSize' variable as it has been deprecated ( #2295 )
...
* SWDEV-535598 - remove usage of 'warpSize' variable as it has been deprecated. Ideally get_warp_size() should not be constexpr but this is just a workaround
* SWDEV-535598 - remove comment from get_warp_size as constexpr is required for this repo
---------
Co-authored-by: Gerardo Hernandez <gerardo.hernandez@amd.com >
[ROCm/composable_kernel commit: 6635d1bb88 ]
2025-06-10 07:34:54 -07:00
dependabot[bot]
67cddfa522
Bump rocm-docs-core[api_reference] from 1.20.0 to 1.20.1 in /docs/sphinx ( #2317 )
...
Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core ) from 1.20.0 to 1.20.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.20.1/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.20.0...v1.20.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core[api_reference]
dependency-version: 1.20.1
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/composable_kernel commit: 3d9f5eafaf ]
2025-06-10 07:27:26 -07:00
Illia Silin
c3ee65fd0f
fix headers ( #2321 )
...
[ROCm/composable_kernel commit: 1ac5eeaea9 ]
2025-06-10 07:26:32 -07:00
carlushuang
e2c603e173
hot fix ( #2315 )
...
[ROCm/composable_kernel commit: 2e0536269e ]
2025-06-10 20:35:28 +08:00
Bartłomiej Kocot
d795af6795
Grouped conv bwd wei explicit GEMM for odd C/K ( #2306 )
...
[ROCm/composable_kernel commit: 7a83f1d510 ]
2025-06-10 11:17:12 +02:00
MHYangAMD
22250b2784
Fix fmha fwd precision issue on MI3XX series ( #2285 )
...
* Fix fmha fwd precision issue on MI3XX series
For fmha fwd fp16 cases, we found that using
impl::cast_tile_pk_fp16_fp32 for casting P would lead to precision
issues, since it uses __builtin_amdgcn_cvt_pkrtz, which is round to zero.
For examaple, fixing K,V to be all 1, and Q is random, which outputs are
expected to be all 1. But we found that it would have some incorrect
outputs 0.9995, which are smaller than the atol 0.001. (1 - 0.9995 =
0.0005 < 0.001) Thus, ck do not report this error.
* Add option to switch rtn/rtz for fmha fwd
[ROCm/composable_kernel commit: 9fcf21a4ec ]
2025-06-10 15:03:23 +08:00
carlushuang
0ff21106d3
MUST USE INLINE FOR ANY NON TEMPLATE FUNCTION IN HEADER!!! ( #2305 )
...
[ROCm/composable_kernel commit: 65835c0bbb ]
2025-06-10 10:40:54 +08:00
Aviral Goel
ee294c1736
Code Refactor for check_err.hpp ( #2284 )
...
* refactor & add documentation
* removed return datatype from doxygen comments
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: John Afaganis <john.afaganis@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update include/ck_tile/host/check_err.hpp
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
---------
Co-authored-by: John Afaganis <john.afaganis@amd.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
[ROCm/composable_kernel commit: 5a0bd157db ]
2025-06-08 13:41:27 -07:00
Max Podkorytov
1fb96b85d9
Add a python script for running ckProfiler and processing the results ( #2288 )
...
* add profiler script
* add comments
* generalize and add some input validation
* format
* refactor
* Rename run_ck_profiler.py to run_ck_profiler_gemm_with_csv_shapes.py
rename script file
[ROCm/composable_kernel commit: aece3c6700 ]
2025-06-08 12:41:57 -07:00
Sami Remes
24beb3bc6b
[CK_TILE] Tileloop persistent gemm - resubmit ( #2299 )
...
* Reapply "[CK_TILE] Tile loop persistent gemm kernel (#2191 )" (#2293 )
This reverts commit 0c8aea8cb4 .
* Add missing header for kentry
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: 1c6f83df6c ]
2025-06-06 14:18:49 -07:00
valarLip
06f87e3fb9
extend buffer load to support load 32 bf16/fp16 at same time ( #2291 )
...
[ROCm/composable_kernel commit: 8482977a37 ]
2025-06-06 17:21:19 +08:00
Bartłomiej Kocot
48034577d4
Grouped Convolution Backward Weight Explicit GEMM ( #2282 )
...
* Grouped conv bwd weight explicit gemm
* 3d
* cmake fixes
* fix test
* fix
[ROCm/composable_kernel commit: 050cad09b5 ]
2025-06-06 10:30:08 +02:00
Andriy Roshchenko
72054549e7
Optimized GEMMs for MX FP4/8 ( #2294 )
...
Adds V3 GEMM pipeline for MX FP4 and MX FP8
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275 )
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com >
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com >
Co-authored-by: aska-0096 <haocwang@amd.com >
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com >
Co-authored-by: OscarXu <huaiguxu@amd.com >
Co-authored-by: mtgu0705 <mtgu@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
Co-authored-by: feifei14119 <feiw@amd.com >
Co-authored-by: Lin, Qun <qlin@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com >
[ROCm/composable_kernel commit: 00247e3c29 ]
2025-06-05 13:54:15 -06:00