Commit Graph

1035 Commits

Author SHA1 Message Date
Ville Pietilä
558054eadb WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases. 2025-09-26 13:38:24 +00:00
Ville Pietilä
8babf7195a Fix strides in 1D conv to gemm transformation. 2025-09-26 09:38:11 +00:00
Ville Pietilä
354dd5039c Add compile check for assumed row-mjor layout. 2025-09-26 08:39:39 +00:00
Ville Pietilä
1764c77fb2 Enable running multiple GEMM batches of merged conv groups. 2025-09-26 07:51:29 +00:00
Ville Pietilä
b864c077ed Code clean-up for bwd tensor transformations. 2025-09-25 15:09:08 +00:00
Ville Pietilä
0ea3268d5d Remove debug and other dead code. 2025-09-25 09:41:33 +00:00
Ville Pietilä
cc7433efc6 Add more comments, disable debug code. 2025-09-25 09:37:15 +00:00
Ville Pietilä
97f842f2c6 Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding. 2025-09-25 09:30:50 +00:00
Ville Pietilä
625a78b17b WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding. 2025-09-24 15:08:01 +00:00
Ville Pietilä
7280df1bc3 Add one more unit test for tensor view. 2025-09-24 12:10:26 +00:00
Ville Pietilä
8048d6ff73 Fix build. 2025-09-23 11:17:08 +00:00
Ville Pietilä
e6f6c4a6a3 Working baseline for depthwise covolution with merged conv groups. 2025-09-23 11:14:10 +00:00
Ville Pietilä
29e3112b9b Epilogue fixes. 2025-09-22 15:38:02 +00:00
Ville Pietilä
d7da3d5089 Offset fixes. 2025-09-22 15:37:46 +00:00
Ville Pietilä
7dfbac5d0b WIP: Separate epilogue for merged conv groups. 2025-09-19 13:52:33 +00:00
Ville Pietilä
af6838e5dc Integration test for CShuffle epilogue. 2025-09-19 12:09:08 +00:00
Ville Pietilä
7f52f84167 Fix tile window size for c block. 2025-09-19 08:08:19 +00:00
Ville Pietilä
6bcdb0947e LDS to global memory copy. 2025-09-18 14:59:32 +00:00
Ville Pietilä
0e09504057 WIP: merged conv groups GEMM epilogue changes. 2025-09-17 14:25:02 +00:00
Ville Pietilä
27a2ceb4f7 Increase the max number of reported errors. 2025-09-17 12:29:12 +00:00
Ville Pietilä
4ec81cb95c Add more logging. 2025-09-17 12:27:51 +00:00
Ville Pietilä
6d318ab481 Enable running multiple conv groups per batch. 2025-09-12 14:03:04 +00:00
Ville Pietilä
0d5c1b9638 WIP: Merged conv groups epilogue. 2025-09-11 15:24:36 +00:00
Ville Pietilä
970b40aa6c WIP: Merged conv groups offset calculation. 2025-09-09 11:33:31 +00:00
Ville Pietilä
d9f0a9cdd0 Fully working conv group merging for TransformConvBwdWeightToGemm. 2025-09-09 09:58:43 +00:00
Ville Pietilä
8845b23254 WIP: Tensor transformations. 2025-09-08 15:41:54 +00:00
Ville Pietilä
61b3c96273 Add number of groups to merge to ck tile grouped gemm example. 2025-09-04 14:24:23 +00:00
Ville Pietilä
2b1908a375 Fix compilation of the grouped conv examples. 2025-09-04 12:01:49 +00:00
linqunAMD
e2d28a92af Extend XDL kernel to Support RDNA3/4 - Part 2 (#2722)
Update Blockwise and Gridwise files to support both wave32 & wave64.

1. Calculate WaveSize from template parameter, instead of hard code it to 64, some "64" is also replace with WaveSize
2. Move BN0Shuffled and BK0Shuffled to device side. we can't get correct mfma inst info in host side.
3. Update b_thread_offset_n and b_thread_offset_k in gridwise_gemm_xdl_cshuffle_v3_b_scale.hpp for gfx11. in gfx11, input data is duplicated for each 16 threads, it is different with all of others.
4. Modify a1_threadwise_copy in gridwise_batched_*gemm*gemm for gfx11.  for gfx11, we need duplicate input and swizzle A if transposeC isn't enabled.
2025-09-04 08:33:40 +08:00
arai713
0282d98412 [CK TILE] Stream-K tile partitioner (#2708)
* initial commit for skeleton code

* replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code

* fixed up code to match CK Tile convention: data type changes, naming changes, etc.

* change for num_sk_blocks data type

* formatting fix

* minor fixes

* moved reduction argument to template

* resolved comments from PR review: standardizing naming, pruning unneeded code

* resolve errors from merge of device op PR: moved enum to common file

* switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems

* unsigned type fix

* add const qualifier

* added documentation for template parameters

* documentation edit
2025-09-03 13:38:17 -07:00
msaffari-amd
47d020a993 refactor: use snake_case naming in ck_tile/core components (#2766) 2025-09-03 09:34:11 +02:00
rahjain-amd
4d041837ad Add json dump support to output details from CK/CKTile Examples. (#2551)
* Adding RapidJson Library

* Adding Json Dumps in all CK_Tile Examples

Not verified yet

* Adding json to cktile Batched Transpose

* adding json dumps to layernorm2d_fwd

* Adding  json dump to flatmm_basic

* Adding RapidJson Library

* Adding Json Dumps in all CK_Tile Examples

Not verified yet

* Adding json to cktile Batched Transpose

* adding json dumps to layernorm2d_fwd

* Adding  json dump to flatmm_basic

* Adding json in 03_gemm

* Add json dump to 16_batched_gemm

* Add json dump to gemm_multi_d_fp16

* Add json dump to grouped_gemm

* fix fmha_bwd/fwd

* Fix clang-format errors

exclude include/rapidjson in jenkins as its a third-party library

* Saparating function and defination.

* Update Documentation of 03_gemm

* Refactoring as per code review

* Disable fp8 instances on unsupported targets (#2592)

* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt

* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt

* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt

---------

Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>

* fix clang format

* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt

* Fixing Readme and unifying jsondumps

* adding moe_smoothquant

* adding fused_moe

* Fixing Readme for batched_gemm

* Fixing Readme for grouped_gemm

* adding flatmm

* adding gemm_multi_d_fp16

* adding elementwise

* adding File name when json is dumped

* Fixing Reduce after merge

* adding batched_transpose

* Adding Warptile in Gemm

* Fixing Clang Format

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
2025-09-02 23:31:29 -07:00
Cong Ma
e1ab460d2d [CK TILE GEMM] Fix building issues (#2772)
- Add `WarpGemmMfma_f32_16x16x128_[fp8|bf8]_[fp8|bf8]_CTransposed`
- Replace `__gfx950__` with `CK_GFX950_SUPPORT`
2025-09-02 22:40:18 -07:00
linqunAMD
00fd72b2d4 Fix a typo in intrin_wmma_bf16_16x16x16_bf16_w32 (#2727)
__builtin_amdgcn_wmma_bf16_16x16x16_bf16_w32 is only available in gfx11.
2025-09-03 08:07:09 +08:00
Po Yen Chen
9f35cde374 [CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765)
* Fix Default2DEpilogue usage

* Fix Default2DEpilogue usage for batch_prefill
2025-09-02 09:51:56 -07:00
Sami Remes
4419fc34a2 Fix formatting problem (#2768) 2025-09-02 14:14:10 +03:00
Michael Mcminn
022f369deb Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751)
* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16

* Adding support for offload target gfx9-4-generic

* This duplication here isn't ideal
2025-09-02 10:35:07 +02:00
Haocong WANG
33418b201f Fix naming issue (#2762) 2025-09-02 11:18:53 +08:00
Po Yen Chen
d876e87fe4 [CK_TILE] Add FAv3 fwd pipeline (#2731)
* Add FAv3 fwd pipeline

* Unpack v_pk_mul to hide v_mov

* Avoid compiler moving l compute across phase

* Sync sched_group_barrier() setting for masking cases
2025-09-01 09:16:45 +08:00
Aviral Goel
fcff0043ae chore(gemm): clang format to pass CI (#2758) 2025-08-29 00:38:46 -07:00
Vijay Krish
4208e28988 ck_tile kernel for gemm with groupwise quantized B tensor. (#2663)
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.

* Solve merge conflict

* [CK TILE] Update CHANGELOG.md

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2025-08-28 23:43:02 -07:00
Cong Ma
428090f749 Support transposed C tile in Aquant (#2679)
The performance of Aquant has increased after enabling transposed C.

Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
2025-08-28 13:28:09 -07:00
Mateusz Ozga
0758883fa4 [CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752)
* Init commit

* Quick fix, CI fails

* Remove CDElementWise

* Add CDEELementWise

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-28 12:45:50 -07:00
asleepzzz
038ea82315 Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757)
This reverts commit ead4447b20.
2025-08-28 22:50:42 +08:00
linqunAMD
4a49dac7c6 [Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728)
* fix copy basic build error

* fix other ck tile test build error
2025-08-28 20:30:30 +08:00
Yi DING
ead4447b20 [CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)
* 16x192

* Use buffer_load_lds for lse/d

* Dispatch & cleanup

* Avoid zeroing dq & fix

* fix
2025-08-28 18:54:18 +08:00
Linjun-AMD
bf7b458e6e use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711)
* add k_lds padding and iglp to improve dim256 fmha fwd

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update  block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp

* clang format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* use same naming style

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-28 11:39:39 +08:00
Aviral Goel
f5f795c4d6 feat(HostTensor): Extend support for HostTensor class' >> operator to print more data types (#2691)
* feat(check_err): add a variable to adjust number of incorrect values to print

* feat(host_tensor): add printing capability for fp8 bf8 int8 int4

* fix(gemm_utils): update acceptable data type

* fix(host_tensor): print both 4 bit ints in pk_int4_t

* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp

* feat(host_tensor): add print first n elements functions
2025-08-27 18:17:24 -07:00
Cong Ma
cd53e2e57e [CK TILE GEMM] Fix a merge conflict (#2753)
* Fixed a merge conflict in 245467f3
* Foramt the code
2025-08-27 11:08:09 -07:00
Bartłomiej Kocot
cfe5e448db Fix splitk autodeduce for grouped conv bwd weight (#2742) 2025-08-27 12:35:42 +02:00