asleepzzz
038ea82315
Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )" ( #2757 )
...
This reverts commit ead4447b20 .
2025-08-28 22:50:42 +08:00
Yi DING
ead4447b20
[CK_TILE] FMHA BWD Enable Tile 16x192 ( #2741 )
...
* 16x192
* Use buffer_load_lds for lse/d
* Dispatch & cleanup
* Avoid zeroing dq & fix
* fix
2025-08-28 18:54:18 +08:00
Yi DING
8e1eb0c1ee
[CK_TILE] FMHA BWD Decode Pipeline ( #2643 )
...
* Fix distr
* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr
* decode 16x16 o2
2025-08-12 17:02:52 +08:00
Yi DING
4fde1646e5
[CK_TILE] FMHA BWD Optimization For GFX950 ( #2628 )
...
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window
* simply duplicate
* trload pipeline
* Try two-stage
* add prefetch
* optimize & iglp
2025-08-12 11:11:55 +08:00
Yi DING
b0a97498b0
[CK_TILE] FMHA BWD Remove Unnecessary Padding ( #2550 )
...
* Remove unnecessary pssk
* Add BlockFmhaBwdDQDKDVPipeline wrapper
* Resolve copilot comments & Remove kpad & fix
* Remove spad
2025-08-07 21:24:43 +08:00
slippedJim
a9bcd3c98d
make fmha bwd api template for v2 & v3 ( #1918 )
...
* use template fmha_bwd function
* update
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2025-02-27 19:26:19 +08:00
rocking
94ae7113bd
[CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> ( #1731 )
...
* Add data type config, Prepare to add mix precision in the future
* Fix compile error
2024-12-10 11:36:18 +08:00
Po Yen Chen
645fe812f6
[CK_TILE] Fix fMHA fwd MakeKargs() compilation errors ( #1689 )
...
* Fix mis-matched tuple<> elem types
* Rename MakeKargs() as MakeKargsImpl()
---------
Co-authored-by: Qianfeng <qianfeng.zhang@amd.com >
2024-11-25 15:30:35 +08:00
kylasa
c24fae2346
Adding seed and offset pointer support to the philox random number generator. ( #1523 )
...
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
2024-10-05 02:48:47 +08:00
Dan Yao
79a5d9c10c
[CK_TILE] FA bwd kernels optimization ( #1397 )
...
* tmp save
* fix batch deterministic bugs
* fix group deterministic bugs
* codegen update
* reorder files
* bias support
* hd256 bias support
* bwd smoke test update
* simplify convert dq
* fix hd256 dropout scratch
* do{}while() -> while(){}
* comments
* remove FmhaBwdTilePartitioner
* save clear_tile
* refactor dropout
* code cleanup
* code cleanup
* comments
* fix epilogue problem
* fix fwd dropout
* group convert_dq opt
* fix dq alignment
* Do not store storerandval in bwd for flash attention integration
* fix hd32 error and boost performance
* revert
* Remove duplicated WarpGemm definitions in the policy file
* dropout patch for mrepeat 16*16
* code sync up
* dq_acc stride
* dq_acc stride stuff
* codegen update
* fwd dropout revert
* fix hd128 scratches and boost performance
* receipt 3 for simplified smoke test
* more strides for fa integration
* fix hd64 scratches and boost performance
* non-iglp pipeline for headdim padding cases
* dpad same as dvpad for flash attention integration
* unpadded lse&d for group mode
* Support unpad layout for group lse
* Support unpad lse layout for splitkv
* Fix stride for splitkv kernel
* fix unpadded lse issue in fwd splitkv
* comment
* solve lds read&write conflicts
* rename
* bias rename
* tile index revert
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com >
2024-08-16 13:40:10 -07:00
Dan Yao
2cab8d39e3
CK Tile FA Training kernels ( #1286 )
...
* FA fwd dropout
* FA bwd
* epilogue reuse
* CMakeLists update
* [CK_TILE] support alibi (#1269 )
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com >
* now fwd/bwd can build
* bwd alibi
* add bwd validation stream_config
* update generated filenames
* update bwd kernel launch
* CK_TILE_HOST_DEVICE in philox
* Transpose -> transpose
* format
* format
* format
* Generate the instance for FA required
* format
* fix error in WarpGemm
---------
Co-authored-by: danyao12 <danyao12>
Co-authored-by: carlushuang <carlus.huang@amd.com >
Co-authored-by: rocking <ChunYu.Lai@amd.com >
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Jing Zhang <jizhan@amd.com >
2024-06-04 13:12:45 -05:00