Illia Silin
c217c0fa93
Fix latest AITER failure and add more AITER tests in CK CI. ( #2782 )
...
* add aiter tests and move json_dump header
* remove example/include path from cmake
* extend time for aiter and pytorch stages
[ROCm/composable_kernel commit: ef6c28e989 ]
2025-09-04 13:44:00 -07:00
rahjain-amd
7674eb6416
Add json dump support to output details from CK/CKTile Examples. ( #2551 )
...
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592 )
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com >
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com >
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com >
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com >
[ROCm/composable_kernel commit: 4d041837ad ]
2025-09-02 23:31:29 -07:00
asleepzzz
154b02423f
Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )" ( #2757 )
...
This reverts commit 4eb18aab71ddb0cf7b63fb161424034615e2bdf5.
[ROCm/composable_kernel commit: 038ea82315 ]
2025-08-28 22:50:42 +08:00
Yi DING
8b537fb883
[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )
...
* 16x192
* Use buffer_load_lds for lse/d
* Dispatch & cleanup
* Avoid zeroing dq & fix
* fix
[ROCm/composable_kernel commit: ead4447b20 ]
2025-08-28 18:54:18 +08:00
Yi DING
d732e1e379
[CK_TILE] FMHA BWD Fix Compilation with Bias ( #2682 )
...
* [CK_TILE] FMHA BWD Fix Compilation with Bias
* Fix appendkv kApplyRoPE
[ROCm/composable_kernel commit: 4cfa2c7158 ]
2025-08-22 10:01:10 +08:00
Yi DING
0afd7af89c
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
[ROCm/composable_kernel commit: 8e1eb0c1ee ]
2025-08-12 17:02:52 +08:00
rahjain-amd
85c6fd56c5
Fixing Debug build ( #2404 )
...
Failed to build `tile_example_fmha_bwd` due to below error
```
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
358 | assert(slopes.size() == nhead);
| ~~~~~~~~~~~~~ ^ ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
103 | (static_cast <bool> (expr) \
| ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:989:16: note: in instantiation of function template specialization 'run<FmhaBwdFp16>' requested here
989 | return run<FmhaBwdFp16>(arg_parser) ? 0 : -2;
| ^
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:358:30: error: comparison of integers of different signs: 'size_type' (aka 'unsigned long') and 'ck_tile::index_t' (aka 'int') [-Werror,-Wsign-compare]
358 | assert(slopes.size() == nhead);
| ~~~~~~~~~~~~~ ^ ~~~~~
/usr/include/assert.h:103:27: note: expanded from macro 'assert'
103 | (static_cast <bool> (expr) \
| ^~~~
/home/rahjain/src/composable_kernel/example/ck_tile/01_fmha/fmha_bwd.cpp:993:16: note: in instantiation of function template specialization 'run<FmhaBwdBf16>' requested here
993 | return run<FmhaBwdBf16>(arg_parser) ? 0 : -2;
| ^
2 errors generated when compiling for gfx942.
```
Fixed with proper cast
[ROCm/composable_kernel commit: ad593c286f ]
2025-07-07 14:46:22 +05:30
Anton Gorenko
193c84af34
Improve fmha_bwd tests performance ( #2376 )
...
* Avoid passing indices (std::vector) by value to host tensor's operator()
Each access requires 2 allocations and copies of the vector.
* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification
* Compute ds_hp_host_ref in parallel
This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.
* Do not use ForEach for simple copy and conversion of large tensors
These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
[ROCm/composable_kernel commit: 77123600ee ]
2025-06-24 07:45:24 -07:00
rocking
6e778cc529
[CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> ( #1731 )
...
* Add data type config, Prepare to add mix precision in the future
* Fix compile error
[ROCm/composable_kernel commit: 94ae7113bd ]
2024-12-10 11:36:18 +08:00
kylasa
347dd2aa7b
Adding seed and offset pointer support to the philox random number generator. ( #1523 )
...
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
[ROCm/composable_kernel commit: c24fae2346 ]
2024-10-05 02:48:47 +08:00
Dan Yao
c9be86c120
[CK_TILE] Fix compiler related FA bwd issues ( #1530 )
...
* add barriers
* tail bias barriers
* adjust bf16/hd256 tol
* continue adjust bf16/hd256 tol
[ROCm/composable_kernel commit: 9d69a099a4 ]
2024-09-26 12:18:39 -07:00
Dan Yao
14402bb211
[CK_TILE] FA bwd kernels optimization ( #1397 )
...
* tmp save
* fix batch deterministic bugs
* fix group deterministic bugs
* codegen update
* reorder files
* bias support
* hd256 bias support
* bwd smoke test update
* simplify convert dq
* fix hd256 dropout scratch
* do{}while() -> while(){}
* comments
* remove FmhaBwdTilePartitioner
* save clear_tile
* refactor dropout
* code cleanup
* code cleanup
* comments
* fix epilogue problem
* fix fwd dropout
* group convert_dq opt
* fix dq alignment
* Do not store storerandval in bwd for flash attention integration
* fix hd32 error and boost performance
* revert
* Remove duplicated WarpGemm definitions in the policy file
* dropout patch for mrepeat 16*16
* code sync up
* dq_acc stride
* dq_acc stride stuff
* codegen update
* fwd dropout revert
* fix hd128 scratches and boost performance
* receipt 3 for simplified smoke test
* more strides for fa integration
* fix hd64 scratches and boost performance
* non-iglp pipeline for headdim padding cases
* dpad same as dvpad for flash attention integration
* unpadded lse&d for group mode
* Support unpad layout for group lse
* Support unpad lse layout for splitkv
* Fix stride for splitkv kernel
* fix unpadded lse issue in fwd splitkv
* comment
* solve lds read&write conflicts
* rename
* bias rename
* tile index revert
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com >
[ROCm/composable_kernel commit: 79a5d9c10c ]
2024-08-16 13:40:10 -07:00
Dan Yao
26840b623a
CK Tile FA Training kernels ( #1286 )
...
* FA fwd dropout
* FA bwd
* epilogue reuse
* CMakeLists update
* [CK_TILE] support alibi (#1269 )
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
* now fwd/bwd can build
* bwd alibi
* add bwd validation stream_config
* update generated filenames
* update bwd kernel launch
* CK_TILE_HOST_DEVICE in philox
* Transpose -> transpose
* format
* format
* format
* Generate the instance for FA required
* format
* fix error in WarpGemm
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
[ROCm/composable_kernel commit: 2cab8d39e3 ]
2024-06-04 13:12:45 -05:00