John Shumway
2fee665612
Use a set_submatrix helper.
...
We want to simplify the test of lots of instances, so adding a helper to make the test and instantiation details more clear.
2025-09-06 16:46:57 +00:00
John Shumway
c5f67bc87c
Add all device_grouped_conv_fwd_xdl_bf16_comp_instances
2025-09-05 23:46:09 +00:00
John Shumway
032774448c
Add block GEMM pipeline version to builder.
2025-09-05 23:14:02 +00:00
John Shumway
ade741ae61
Generalized version to StringLiteral.
...
With the change, the following can be used for the version parameter:
"0.1.0" // string literal
constexpr char API_VERSION[] = "0.1.0";
constexpr ckb::StringLiteral API_VERSION = "0.1.0";
2025-09-05 22:30:00 +00:00
John Shumway
03f4beda5f
Update builder_utils.hpp.
...
This commit moves sequence_util to more general builder_utils, adds unit testss, and converts ToSequence to better match standard library conventions.
2025-09-05 20:41:32 +00:00
John Shumway
46969bd05d
Convert SIGNATURE to non-template type parameter.
...
No functional changes yet, but this aligns with ALGORITHM and will allow testing different signatures.
2025-09-04 21:31:44 +00:00
John Shumway
d3404a8e6b
Add two more instances to tests.
...
We now have four instances, next we need to add the block GEMM pipeline version.
2025-09-04 14:46:04 +00:00
John Shumway
b70a58fcdd
Split builder tests and instance tests.
...
We have a typed test suite of all the instance we want to create, and also we have simple test of the builder. Split those into two different test suites.
2025-09-04 14:02:34 +00:00
John Shumway
6b83f7e0d9
Migrate builder instantiation test to a TYPED_TEST_SUITE.
...
We can now drive the test from a constexpr std::array of TestCases and a ::testing::Types sequence of test indicies.
2025-09-04 13:34:22 +00:00
John Shumway
e05879fbd6
Add block transfer paramters to builder.
...
These are very hard to test in the kernel class, so just test the values in the factory.
2025-09-02 23:08:32 +00:00
John Shumway
c3c92f1267
Add test for ak1 and bk1.
...
These are not easy to access in the instance object, so just test the factory values.
2025-09-02 21:06:16 +00:00
John Shumway
a2e661cd57
Making alorithm a non-type parameter
...
This simplifies the design by continuing to reduce the number of types and avoidng extra use of constexpr.
2025-09-02 17:22:29 +00:00
John Shumway
a79616f323
Add tuning parameters to builder.
...
Add support for setting ak1, bk1, m_xdl_per_wave, and n_xdl_per_wave.
Note: It's difficult to test ak1 and bk1, since the values are not stored in the class.
2025-09-02 16:32:32 +00:00
John Shumway
d34806a8ae
Simplify convolution builder tests.
...
Move static_assert concept checks out of tests and improve instance test names.
2025-09-01 22:04:04 +00:00
John Shumway
a24fc535b7
Add thread block info to factory.
...
Now we can set the thread block size and submatrix shape for the builder.
2025-09-01 21:54:41 +00:00
John Shumway
682d27cc40
Fix test files for convolution builder.
2025-08-28 02:02:48 +00:00
John Shumway
4117fcb36e
Initial commit of convolution builder.
...
Creates a single instance with template metaprogramming. Many things are still hard-coded.
2025-08-28 01:53:27 +00:00
Thrupti Raj Lakshmana Gowda
1c2078066b
Variable name correction in Jenkins file ( #2686 )
2025-08-14 13:35:55 -07:00
jefyang1
d7c95dd491
Add gemm universal f8 f8 bf16 instances on gfx950 ( #2662 )
2025-08-14 13:25:24 -07:00
Emily Martins
10395fc895
[CK_Tile] Refactor Permute and MOE Smoothquant ctests to gtests ( #2622 )
...
* Refactor CK tile permute ctests to gtests
* Refactor CK tile MOE smoothquant ctests to gtests
* fix typo in comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update invalid case in else clause for get_precision_string
* Refactor permute gtests to use templated versions of matrix_core_swizzle and permute functions
---------
Co-authored-by: root <root@splinter-126-wr-c2.aus.dcgpu >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-08-14 11:01:54 -07:00
Emily Martins
70dce4e0c6
[CK_Tile] Refactor MOE Sorting and Smoothquant ctests to gtests ( #2596 )
...
* refactor moe_sorting ctests to use gtest framework
* Refactor ctests for smoothquant to gtests
* fix clang format to use version 18
* Print local_eid in MOE sorting gtests
* Remove extra space in smoothquant output
2025-08-14 10:54:57 -07:00
Yashvardhan Agarwal
7f14772406
CK_TILE: Implement two-stage split-K GEMM with workspace reduction (LWPCK-2966) ( #2632 )
...
* CK_TILE: Implement two-stage split-K GEMM with reduction
- Added split-K GEMM with reduction example
* comment resolutions
2025-08-14 10:18:52 +02:00
Gino Lu
e5623d3825
fix wrong nan producion. ( #2640 )
...
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2025-08-14 15:12:31 +08:00
Jimniu
753b6227c5
Jimniu/tile_example_flatmm_basic fix ( #2680 )
...
* Add stride_b validation
* run clang-format
2025-08-13 16:06:08 -07:00
Aviral Goel
8a698c7445
Minor Improvements in CK TILE memory copy EXAMPLE ( #2678 )
...
* Rename vector to ThreadTile
* more notes on tile encoding
* remove number<> from tuple of make_tile_window
* add script to stress test the copy example
2025-08-13 15:24:16 -07:00
joyeamd
bcc38deff7
[CK_TILE]fix elementwise example in gfx11/12 ( #2676 )
...
* fix elementwise examples
* improve the robust
* fix ck_tile's elementwise test
* update elementwise test
2025-08-13 15:21:46 -07:00
Enrico Degregori
a6f4029276
Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) ( #2675 )
...
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2025-08-14 00:21:09 +02:00
JH-Leon-KIM-AMD
b963478759
CSV-driven convolution test pipeline ( #2581 )
...
* Add CSV-driven convolution test pipeline
- Add test_grouped_convnd_fwd_dataset_xdl.cpp with CSV reader functionality
- Add complete dataset generation toolchain in test_data/
- Add Jenkins integration with RUN_CONV_COMPREHENSIVE_DATASET parameter
- Ready for comprehensive convolution testing with scalable datasets
* Update convolution test dataset generation pipeline
* add 2d, 3d dataset csv files
* Remove CSV test dataset files from repository
* Update generate_test_dataset.sh
* Fix channel division for MIOpen to CK conversion
* Remove unnecessary test files
* Fix clang-format-18 formatting issues
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com >
2025-08-13 16:24:34 +02:00
Haocong WANG
3142562c22
fix for aiter consume ( #2677 )
2025-08-13 19:06:22 +08:00
SamiAario-AMD
28a97865f5
Cleanups ( #2631 )
...
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2025-08-13 10:12:08 +02:00
Haocong WANG
05a6e92705
Re-enable optimization for gfx950 fmha fwd ( #2671 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
* refactor blockgemm change, isolate to v2;
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-08-13 14:57:43 +08:00
Cong Ma
452791a3ba
Preshuffle AQ matrix in block scale gemm ( #2624 )
...
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com >
2025-08-12 21:32:51 -07:00
Thomas Ning
0f42a92fc1
Finish the grouped gemm restructure with fp8 data type ( #2655 )
...
* Finish the grouped gemm restructure with data type
* restore gemm_utils.hpp
* Update example/ck_tile/17_grouped_gemm/run_grouped_gemm_example.inc
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Comment Addressed
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-08-12 18:23:34 -07:00
Thrupti Raj Lakshmana Gowda
3f57ec3d2d
GEMM Multi D for CK Tile Engine ( #2660 )
...
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com >
2025-08-12 16:05:05 -07:00
Geo Min
30dafe8281
[TheRock CI] Adding TheRock CI gate check ( #2648 )
...
* Adding initial TheRock CI
* Adding composable kernel link
* Adding correct repo for rocm-libraries
* Adding entire rocm-libraries checkout
* Adding correct flag
* Adding correct flag for fetch sources
* Fixing git health
* Removing patch
* Removing patching
* Removing manual check
* PR comments
* testing without dist
* Removing test branch
* PR comments
* PR comments
* PR comment
* Adding test_runs_on
2025-08-12 14:13:01 -07:00
joyeamd
0856b3f4a2
[CK_TILE]fix ck_tile's moe_sorting example in gfx11 ( #2667 )
...
* fix ck_tile's moe_sorting example in gfx11
* fix clang format
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
2025-08-12 12:33:56 -07:00
Illia Silin
bbf41b27f2
fix builds with mainline/staging compilers ( #2674 )
2025-08-12 10:23:08 -07:00
slippedJim
20288caa2f
remove bad pipeline codegen ( #2673 )
2025-08-13 00:23:40 +08:00
asleepzzz
5b39de4bb6
Revert "Optimize fmha fwd decode & prefill for gfx950 ( #2641 )" ( #2670 )
...
This reverts commit b7322a521a .
2025-08-12 20:27:10 +08:00
Haocong WANG
b7322a521a
Optimize fmha fwd decode & prefill for gfx950 ( #2641 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* save an example for __bf16 type
* temp save, waiting for debug
* tempsave, fmha_decode
* temp save, change all instance to 1wave
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* temp save
* tempsave
* compile pass
* tempsave, trload+asyncload done
* tempsave. asynccopy+trload sanity checked
* remove unnecessary features
* fix the lds alignment caused performance regression
* enable prefill overload operator().
* remove all lds bankconflict with xor layouts
* enable larger tile size; upgrade xor pattern
* upgrade prefill pipeline; simple iglp; consistent data produce and consume order
* small refactor
* Load Q through lds, implement xor;
* add vmcnt guard before load ktile
* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA
* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug
* add __restrict__ to tr load
* merge fa_decode pipeline into fmha_fwd api
* remove unnecessary files; rename some files
* Remove unnecessary changes
* bug fix, clang format;
* remove non-necessary change
* fix clangformat with 18.1.3
* fix bugs
* fix bug
* fix bug on non-gfx950
* fix bugs in gemm
* fix bug in pki4
* tempsave, update the blocksync functions
* change the warp setting for hdim32 fmha fwd
* clang format
* fix conflict. disable all v-col instance for fmha fwd
* Fix the bug
* clang format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com >
2025-08-12 19:43:14 +08:00
Mateusz Ozga
c0c2ded566
fix ( #2668 )
2025-08-12 13:02:10 +02:00
Yi DING
8e1eb0c1ee
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
2025-08-12 17:02:52 +08:00
Cameron Shinn
352f87e684
Fix num_byte calculations to use nhead_k for K & V size ( #2653 )
...
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.
Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```
After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```
2025-08-12 13:44:01 +08:00
Yi DING
4fde1646e5
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
2025-08-12 11:11:55 +08:00
Aviral Goel
a7badc6ec5
feat(copy_kernel): add basic copy kernel example with beginner friendly documentation ( #2582 )
...
* feat(copy_kernel): add basic copy kernel example with documentation
* docs(CHANGELOG): Updated changelog
* chore: performed clang format
* Update example/ck_tile/39_copy/copy_basic.cpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* Update example/ck_tile/39_copy/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
* fix(terminology): follow amd terms
* extract elementwise copy to a new kernel
* fix(copy_kernel): bug in verification
* add comments about vgpr usage
* lint and nits
* add notes and comments
* print hostTensor via stream
* print hostTensor via stream
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com >
2025-08-11 10:54:37 -07:00
Illia Silin
6bfef63414
enable aiter test_mha in daily CI ( #2659 )
2025-08-11 09:50:33 -07:00
Yashvardhan Agarwal
191c62967b
Fixes to "General 2D Reduction Kernel" ( #2535 ) ( #2656 )
...
* fix reduce2d
- revret the combine_partial_results() chnages
- remove auto from function def
* clang-format
2025-08-11 15:01:33 +02:00
geozhai
1e1ee758fa
update CK build instruction step 4 ( #2563 )
...
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
2025-08-11 00:26:13 -04:00
Illia Silin
8613aa1e40
remove ck_tile transpose and gemm stages from CI ( #2646 )
2025-08-08 10:48:44 -07:00
Illia Silin
7ac850ac72
Add daily AITER tests on gfx942. ( #2639 )
...
* add option to select aiter branch, add tests on gfx942
2025-08-08 09:30:46 -07:00