Yi DING
f885dd8b33
[CK_TILE] FMHA BWD Fix Compilation with Bias ( #2682 )
...
* [CK_TILE] FMHA BWD Fix Compilation with Bias
* Fix appendkv kApplyRoPE
[ROCm/composable_kernel commit: 4cfa2c7158 ]
2025-08-22 10:01:10 +08:00
dnovakovic-dxc
c63fbe24ca
Script for generating list of files not referenced in tests ( #2696 )
...
* script for generating list of not referenced files in tests, list is in json format
* script comment added
* added empty line at the end of the script
* format changes
[ROCm/composable_kernel commit: 49c6b05c72 ]
2025-08-20 08:22:51 -07:00
Bartłomiej Kocot
092ba92f26
[CK Tile] Grouped convolution backward data ( #2652 )
...
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com >
[ROCm/composable_kernel commit: 4212bbc170 ]
2025-08-20 05:29:57 -07:00
Haocong WANG
aa28dcb161
[CK_TILE] Update the fmhafwd dispatch logic ( #2698 )
...
* update the fmhafwd dispatch logic
* Fix fmha test scripts
* Fix bash
---------
Co-authored-by: Ding, Yi <yi.ding@amd.com >
[ROCm/composable_kernel commit: 81b265cf91 ]
2025-08-20 16:24:43 +08:00
Vidyasagar Ananthan
0a5c8d6e45
Setting gpu target filtering for tile engine to gfx90a, gfx942 and gfx950. ( #2709 )
...
[ROCm/composable_kernel commit: bf3e719c16 ]
2025-08-19 18:12:06 -07:00
Emily Martins
aa964be163
[CK Tile] Stream K GEMM Kernel HostArgs and Kernel Classes ( #2681 )
...
* CK Tile Stream K Device Ops
Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The
StreamKKernel class injects Universal Gemm and includes functions to
facilitate kernel preparation for the GPU.
* Stream K Device Ops Fixes
- Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to
ensure we get size needed for accumulation buffers and semaphores.
- Pass in num_sk_blocks into TilePartitioner constructor
- Update documentation
* Add WarpTile dimensions to GetName function in StreamKKernel class
* Fix typos in StreamKHostArgs class description.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com >
* Apply clang format on updated comment for StreamKHostArgs
* Explicitly specify type for StreamKReductionStrategy enum
* Remove unecessary scopes
* Unify the commenting style to inline comments
* Add explicit casts for occupancy and num_cu in MakeKernelArgs function
Both the static functions Occupancy and NumCU in the StreamKKernel class
use functions from the HIP API that result in the returned occupancy and
num_cu types being type int. The TilePartitioner interface for stream K will
have occupancy and num_cu being type ck_tile::index_t which is int32_t.
Thus, to be safe, this change ensures that both occupancy and num_cu are
cast to int32_t.
* Fix use of kentry due to interface update
PR #2594 updated the interface for the kentry function in
include/ck_tile/host/kernel_launch.hpp. As a result, the static function
Occupancy was updated to work correctly with the new interface.
PR #2594 also changed UniversalGemmKernel's KernelBlockSize static
variable to kBlockSize, so the StreamKKernel class was updated to
reflect this change.
* Switch type of num_sk_blocks from uint32_t to int32_t
This change switches the type of num_sk_blocks to type ck_tile::index_t
which is int32_t. This was done because parallel work for the CK Tile
StreamK TilePartitioner's constructor will have num_sk_blocks as
ck_tile::index_t. Thus, this change will help unify the interfaces to
avoid any type conversion errors.
---------
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com >
[ROCm/composable_kernel commit: 071165919f ]
2025-08-19 15:08:52 -06:00
jefyang1
8b37a2968f
Fix pk i4 v3 example test regression on gfx942 ( #2706 )
...
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: 6ba9289b26 ]
2025-08-19 09:58:28 -07:00
joyeamd
f9e76244d8
fix grouped gemm example when wave32 enabled ( #2707 )
...
1, delete some unused variables
2, fix BlockSize when wave32 enabled
[ROCm/composable_kernel commit: a1589a9667 ]
2025-08-19 16:20:43 +08:00
mirchen-amd
020c9139ec
Mirchen/gemm blockscale wp segfault fix ( #2638 )
...
* Add stride validation to prevent segfault in blockscale GEMM
* run clang-format
* Update profiler/include/profiler/profile_gemm_blockscale_wp_impl.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* added stride length checking to more gemm examples in ckprofiler
* ran clang format
* added validation header and implement in core gemm operations
* remove ck_tile transpose and gemm stages from CI (#2646 )
* update CK build instruction step 4 (#2563 )
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
* Fixes to "General 2D Reduction Kernel" (#2535 ) (#2656 )
* fix reduce2d
- revret the combine_partial_results() chnages
- remove auto from function def
* clang-format
* enable aiter test_mha in daily CI (#2659 )
* feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582 )
* feat(copy_kernel): add basic copy kernel example with documentation
* docs(CHANGELOG): Updated changelog
* chore: performed clang format
* Update example/ck_tile/39_copy/copy_basic.cpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* fix(terminology): follow amd terms
* extract elementwise copy to a new kernel
* fix(copy_kernel): bug in verification
* add comments about vgpr usage
* lint and nits
* add notes and comments
* print hostTensor via stream
* print hostTensor via stream
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* [CK_TILE] FMHA BWD Optimization For GFX950 (#2628 )
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
* Fix num_byte calculations to use nhead_k for K & V size (#2653 )
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
* [CK_TILE] FMHA BWD Decode Pipeline (#2643 )
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
* fix (#2668 )
* Optimize fmha fwd decode & prefill for gfx950 (#2641 )
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
* Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641 )" (#2670 )
This reverts commit 747d127983 .
* added batch stride checking to batched gemm ops in profiler
* removed batch stride validation
* removed batched stride validation again
* Update include/ck/library/utility/profiler_validation_common.hpp
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
* refactor function names
* added gemm stride checking to more profiler gemm operations
* run clang format
* add stride checkign to 01 gemm example
* rename from profiler to validation common, used for examples and profiler
* build of ckProfiler success
* update file headers
---------
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: geozhai <44495440+geozhai@users.noreply.github.com >
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
Co-authored-by: Yashvardhan Agarwal <yashagar@amd.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
Co-authored-by: Yi DING <yi.ding@amd.com >
Co-authored-by: Cameron Shinn <camerontshinn@gmail.com >
Co-authored-by: Mateusz Ozga <110818320+mozga-amd@users.noreply.github.com >
Co-authored-by: Haocong WANG <haocwang@amd.com >
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 60320e90c1 ]
2025-08-19 01:19:17 -07:00
Max Podkorytov
339fc3da3f
invoke script directly ( #2687 )
...
[ROCm/composable_kernel commit: f38751fc2a ]
2025-08-19 00:23:07 -07:00
Max Podkorytov
e1f603e645
[Dev infra] cmake_ck_dev.sh inline docs and refactor argument list ( #2689 )
...
* invoke script directly
* script fixup
* keep the docs update separate
* add newline
* escape arg
* use portable way of setting IFS
[ROCm/composable_kernel commit: 696ef05784 ]
2025-08-19 00:22:23 -07:00
Max Podkorytov
3eb72ec7f4
remove script ( #2692 )
...
[ROCm/composable_kernel commit: 8f6dc23a89 ]
2025-08-19 00:20:54 -07:00
Geo Min
eaf2de9156
[TheRock CI] Adding presubmit check for CK ( #2688 )
...
* Adding presubmit check for CK
* Adding exclusion
* Enable forks
[ROCm/composable_kernel commit: b4f3487d84 ]
2025-08-18 14:16:31 -07:00
Illia Silin
9371dfb3a6
Build ckProfiler package for all architectures. ( #2701 )
...
* stash ckprofiler package built for all targets
* build the lib for all instances in newer docker
* make sure packages get posted
[ROCm/composable_kernel commit: 8b55afcd93 ]
2025-08-18 11:16:25 -07:00
linqunAMD
615ca9842d
Support Wave32 in CK_TILE - Part 1 ( #2594 )
...
* Support wave32/wave64 in CK_TILE - Part 1
* remove blocksize in kernel launch
* fix build error
* fix clang format
* fix clang format 2
* fix clang format 3
* fix fmha build error
* fix fmha build 2
* fix fmha build 3
* fix build error 4
* address review comment
* update change log
* replace KernelBlockSize with kBlockSize
* fix CI fail
* fix clang format
* address review comment and rebase code.
* fix universal test fail
---------
Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com >
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com >
[ROCm/composable_kernel commit: 9fcc1ee9fd ]
2025-08-18 10:08:31 -07:00
Sami Remes
13bfcba04c
Add other layouts for FP8 block scaled gemm ( #2665 )
...
* Start adding other layouts for gemm_ab_scale
* Add some instances
* Create tensor descriptors for A/B scales depending on A/B layout
* Fix formatting
* Revert some comments
* Revert commented instances in CMakeLists.txt
* Add some more instances for col-row gemm
* enable more row,row instances
* Use occupancy=1 for col,row layout to avoid spills
[ROCm/composable_kernel commit: 26d3300930 ]
2025-08-18 01:46:10 -07:00
Tianyuan Wu
a4d70b6e13
Fix CI build error ( #2695 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
[ROCm/composable_kernel commit: 7310830d14 ]
2025-08-18 01:45:40 -07:00
Tianyuan Wu
abb90422b4
[CK_TILE] CK_TILE GEMM WMMA Support for GFX11/GFX12 ( #2466 )
...
* WMMA GEMM F16 Implementation
Signed-off-by: root <tianyuwu@amd.com >
* Self-review
Signed-off-by: root <tianyuwu@amd.com >
* ASIC check minor tweak
Signed-off-by: root <tianyuwu@amd.com >
* add missing include file
* Set GPU_TARGETS to gfx11/12 generic
Signed-off-by: root <tianyuwu@amd.com >
* INT8 GFX12
Signed-off-by: root <tianyuwu@amd.com >
* add int8x16 branch
* Fix CI script
Signed-off-by: root <tianyuwu@amd.com >
* Fix typo
Signed-off-by: root <tianyuwu@amd.com >
* Add CK_Tile WMMA example
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Fix CI
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* fix clang format
* Set M/N_Warp Back to Constant
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
* Use GemmConfigComputeV3 by default
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Remove CK_Tile wmma gemm examples from the CI list
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add atomic add fallback method for gfx11
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Omit copyright year
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Support non-square cases
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Add get_device_ip()
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Add atomic add fallback method for gfx11"
This reverts commit 4f664969c01b37976c8518c19833d9f1574cd746.
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Revert "Enable CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT for gfx12"
This reverts commit 949129a3858a825b2a2c4d3ec01663df18a165a5.
* Revise method name and typos
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Try fix CI
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Revert "Try fix CI"
This reverts commit 084c683227e64ab6a8137db00c8165fb05bdc902.
* clang-format
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
* Fix typo caused by merge
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
* Fix typo caused by merging
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
---------
Signed-off-by: root <tianyuwu@amd.com >
Signed-off-by: Tianyuan Wu <tianyuwu@amd.com >
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
Co-authored-by: joye <joye@amd.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 68134b60e4 ]
2025-08-15 16:22:27 -07:00
Thomas Ning
692699b5bc
Preshuffle Decode Prefill config fix ( #2693 )
...
* feat(gemm_wp): add two new configs for wp
* delete the unnecessary files
* fix the config error
* update the config
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com >
[ROCm/composable_kernel commit: 5ada85ec04 ]
2025-08-15 15:49:07 -07:00
Aviral Goel
117cd71ab6
feat(gemm_wp): add two new configs for gemm weight preshuffle in gemm_utils.h ( #2690 )
...
* feat(gemm_wp): add two new configs for wp
* delete the unnecessary files
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com >
[ROCm/composable_kernel commit: c06e8b4a66 ]
2025-08-15 15:00:25 -07:00
Thrupti Raj Lakshmana Gowda
bb1d39c958
Variable name correction in Jenkins file ( #2686 )
...
[ROCm/composable_kernel commit: 1c2078066b ]
2025-08-14 13:35:55 -07:00
jefyang1
1a412578d8
Add gemm universal f8 f8 bf16 instances on gfx950 ( #2662 )
...
[ROCm/composable_kernel commit: d7c95dd491 ]
2025-08-14 13:25:24 -07:00
Emily Martins
5d16d6fd3c
[CK_Tile] Refactor Permute and MOE Smoothquant ctests to gtests ( #2622 )
...
* Refactor CK tile permute ctests to gtests
* Refactor CK tile MOE smoothquant ctests to gtests
* fix typo in comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update invalid case in else clause for get_precision_string
* Refactor permute gtests to use templated versions of matrix_core_swizzle and permute functions
---------
Co-authored-by: root <root@splinter-126-wr-c2.aus.dcgpu >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
[ROCm/composable_kernel commit: 10395fc895 ]
2025-08-14 11:01:54 -07:00
Emily Martins
b8cd0b8dd4
[CK_Tile] Refactor MOE Sorting and Smoothquant ctests to gtests ( #2596 )
...
* refactor moe_sorting ctests to use gtest framework
* Refactor ctests for smoothquant to gtests
* fix clang format to use version 18
* Print local_eid in MOE sorting gtests
* Remove extra space in smoothquant output
[ROCm/composable_kernel commit: 70dce4e0c6 ]
2025-08-14 10:54:57 -07:00
Yashvardhan Agarwal
b13aa4217e
CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) ( #2632 )
...
* CK_TILE: Implement two-stage split-K GEMM with reduction
- Added split-K GEMM with reduction example
* comment resolutions
[ROCm/composable_kernel commit: 7f14772406 ]
2025-08-14 10:18:52 +02:00
Gino Lu
3bbad88a97
fix wrong nan producion. ( #2640 )
...
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
[ROCm/composable_kernel commit: e5623d3825 ]
2025-08-14 15:12:31 +08:00
Jimniu
d6b1452592
Jimniu/tile_example_flatmm_basic fix ( #2680 )
...
* Add stride_b validation
* run clang-format
[ROCm/composable_kernel commit: 753b6227c5 ]
2025-08-13 16:06:08 -07:00
Aviral Goel
e371bb31f5
Minor Improvements in CK TILE memory copy EXAMPLE ( #2678 )
...
* Rename vector to ThreadTile
* more notes on tile encoding
* remove number<> from tuple of make_tile_window
* add script to stress test the copy example
[ROCm/composable_kernel commit: 8a698c7445 ]
2025-08-13 15:24:16 -07:00
joyeamd
d6347da784
[CK_TILE]fix elementwise example in gfx11/12 ( #2676 )
...
* fix elementwise examples
* improve the robust
* fix ck_tile's elementwise test
* update elementwise test
[ROCm/composable_kernel commit: bcc38deff7 ]
2025-08-13 15:21:46 -07:00
Enrico Degregori
ec1c249b63
Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) ( #2675 )
...
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
[ROCm/composable_kernel commit: a6f4029276 ]
2025-08-14 00:21:09 +02:00
JH-Leon-KIM-AMD
68eecd0d5e
CSV-driven convolution test pipeline ( #2581 )
...
* Add CSV-driven convolution test pipeline
- Add test_grouped_convnd_fwd_dataset_xdl.cpp with CSV reader functionality
- Add complete dataset generation toolchain in test_data/
- Add Jenkins integration with RUN_CONV_COMPREHENSIVE_DATASET parameter
- Ready for comprehensive convolution testing with scalable datasets
* Update convolution test dataset generation pipeline
* add 2d, 3d dataset csv files
* Remove CSV test dataset files from repository
* Update generate_test_dataset.sh
* Fix channel division for MIOpen to CK conversion
* Remove unnecessary test files
* Fix clang-format-18 formatting issues
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
[ROCm/composable_kernel commit: b963478759 ]
2025-08-13 16:24:34 +02:00
Haocong WANG
1b424cee6b
fix for aiter consume ( #2677 )
...
[ROCm/composable_kernel commit: 3142562c22 ]
2025-08-13 19:06:22 +08:00
SamiAario-AMD
8a32077ccd
Cleanups ( #2631 )
...
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
[ROCm/composable_kernel commit: 28a97865f5 ]
2025-08-13 10:12:08 +02:00
Haocong WANG
e3e8de3477
Re-enable optimization for gfx950 fmha fwd ( #2671 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
* refactor blockgemm change, isolate to v2;
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
[ROCm/composable_kernel commit: 05a6e92705 ]
2025-08-13 14:57:43 +08:00
Cong Ma
402b3f445b
Preshuffle AQ matrix in block scale gemm ( #2624 )
...
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com >
[ROCm/composable_kernel commit: 452791a3ba ]
2025-08-12 21:32:51 -07:00
Thomas Ning
bd3b4afb92
Finish the grouped gemm restructure with fp8 data type ( #2655 )
...
* Finish the grouped gemm restructure with data type
* restore gemm_utils.hpp
* Update example/ck_tile/17_grouped_gemm/run_grouped_gemm_example.inc
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Comment Addressed
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
[ROCm/composable_kernel commit: 0f42a92fc1 ]
2025-08-12 18:23:34 -07:00
Thrupti Raj Lakshmana Gowda
62fb072dbe
GEMM Multi D for CK Tile Engine ( #2660 )
...
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com >
[ROCm/composable_kernel commit: 3f57ec3d2d ]
2025-08-12 16:05:05 -07:00
Geo Min
3f5d6a4d1f
[TheRock CI] Adding TheRock CI gate check ( #2648 )
...
* Adding initial TheRock CI
* Adding composable kernel link
* Adding correct repo for rocm-libraries
* Adding entire rocm-libraries checkout
* Adding correct flag
* Adding correct flag for fetch sources
* Fixing git health
* Removing patch
* Removing patching
* Removing manual check
* PR comments
* testing without dist
* Removing test branch
* PR comments
* PR comments
* PR comment
* Adding test_runs_on
[ROCm/composable_kernel commit: 30dafe8281 ]
2025-08-12 14:13:01 -07:00
joyeamd
60e654e2c4
[CK_TILE]fix ck_tile's moe_sorting example in gfx11 ( #2667 )
...
* fix ck_tile's moe_sorting example in gfx11
* fix clang format
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 0856b3f4a2 ]
2025-08-12 12:33:56 -07:00
Illia Silin
75e42f90da
fix builds with mainline/staging compilers ( #2674 )
...
[ROCm/composable_kernel commit: bbf41b27f2 ]
2025-08-12 10:23:08 -07:00
slippedJim
be39c00a8e
remove bad pipeline codegen ( #2673 )
...
[ROCm/composable_kernel commit: 20288caa2f ]
2025-08-13 00:23:40 +08:00
asleepzzz
4730050182
Revert "Optimize fmha fwd decode & prefill for gfx950 ( #2641 )" ( #2670 )
...
This reverts commit 747d127983 .
[ROCm/composable_kernel commit: 5b39de4bb6 ]
2025-08-12 20:27:10 +08:00
Haocong WANG
747d127983
Optimize fmha fwd decode & prefill for gfx950 ( #2641 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
[ROCm/composable_kernel commit: b7322a521a ]
2025-08-12 19:43:14 +08:00
Mateusz Ozga
634936e483
fix ( #2668 )
...
[ROCm/composable_kernel commit: c0c2ded566 ]
2025-08-12 13:02:10 +02:00
Yi DING
19ef22e567
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
[ROCm/composable_kernel commit: 8e1eb0c1ee ]
2025-08-12 17:02:52 +08:00
Cameron Shinn
55d3f65a9e
Fix num_byte calculations to use nhead_k for K & V size ( #2653 )
...
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
[ROCm/composable_kernel commit: 352f87e684 ]
2025-08-12 13:44:01 +08:00
Yi DING
8cb8da53c9
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
[ROCm/composable_kernel commit: 4fde1646e5 ]
2025-08-12 11:11:55 +08:00
Aviral Goel
eda4a5e804
feat(copy_kernel): add basic copy kernel example with beginner friendly documentation ( #2582 )
...
* feat(copy_kernel): add basic copy kernel example with documentation
* docs(CHANGELOG): Updated changelog
* chore: performed clang format
* Update example/ck_tile/39_copy/copy_basic.cpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* fix(terminology): follow amd terms
* extract elementwise copy to a new kernel
* fix(copy_kernel): bug in verification
* add comments about vgpr usage
* lint and nits
* add notes and comments
* print hostTensor via stream
* print hostTensor via stream
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
[ROCm/composable_kernel commit: a7badc6ec5 ]
2025-08-11 10:54:37 -07:00
Illia Silin
7535dc70ea
enable aiter test_mha in daily CI ( #2659 )
...
[ROCm/composable_kernel commit: 6bfef63414 ]
2025-08-11 09:50:33 -07:00
Yashvardhan Agarwal
92c0435e29
Fixes to "General 2D Reduction Kernel" ( #2535 ) ( #2656 )
...
* fix reduce2d
- revret the combine_partial_results() chnages
- remove auto from function def
* clang-format
[ROCm/composable_kernel commit: 191c62967b ]
2025-08-11 15:01:33 +02:00