Qianfeng Zhang
|
6cf17bc827
|
Miscellaneous small updates and corrections
|
2026-06-28 13:59:44 +00:00 |
|
Qianfeng Zhang
|
52eff34d21
|
Add scripts for testing hstu attention fwd with drop-out
|
2026-06-28 13:59:44 +00:00 |
|
Qianfeng Zhang
|
35234a492d
|
Fix in kernel and HstuBlockMask interface for kHasDropout == true
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
83676066af
|
Fix in GetSmemSizeDropout()
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
95d3ecfb14
|
Update for considering kHasDropout used with minfull_attn_seqlen > 0
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
4250341a92
|
Fix identified issues for implementing kHasDropout == true
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
c3df01a519
|
Add instances and enable dispatching for kHasDropout == true
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
124f072158
|
Add p_drop parameter to example and drop-out logic to the reference code
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
58fbfe766e
|
Update to kernel and host interface for generating random numbers
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
8a4ec6382d
|
Add kernel and host interface for generating random numbers
|
2026-06-28 13:59:43 +00:00 |
|
Qianfeng Zhang
|
d5c872f504
|
Removing the class derivation to simplify struct HstuAttentionFwdCommonDropoutKargs
|
2026-06-24 09:03:39 +00:00 |
|
Qianfeng Zhang
|
0720bc48be
|
Fix using multiplies
|
2026-06-23 10:20:17 +00:00 |
|
Qianfeng Zhang
|
ad769baea4
|
Fix make_kernel() template parameters
|
2026-06-23 09:38:37 +00:00 |
|
Qianfeng Zhang
|
0b0684aff2
|
Revert "[ck_tile] Add get_partition_index_v2 which uses warp_id in vgpr and to be used by tile_windows on lds-based tensor_view"
This reverts commit 2b0f3791a6.
|
2026-06-23 09:31:31 +00:00 |
|
Qianfeng Zhang
|
1805c985a0
|
Update to example_hstu_attention_fwd.cpp
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
dc1d433351
|
Rename generate_instances.py to generate_fwd_instances.py
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
503a017c42
|
Remove the using of kSubQKHeaddim
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
a81f32331f
|
Add restriction on the relationship between HstuAttention<xxx>Problem and HstuAttention<xxx>TileSettingClass
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
79ebc7479d
|
Add static_assert() in HstuAttentionFwdTileSettingClass
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
7a67ae4dd3
|
Update to the comments in reference_hstu_attention_bwd.hpp
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
7d2e575fed
|
Fix the comments in reference_hstu_attention_fwd.hpp
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
d07b37c097
|
Remove un-used element-wise functions passed through pipelines' operator() interfaces
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
cc7e216fa6
|
Rename GetKVBlockGemm to GetPVTBlockGemm
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
06547efd90
|
Remove the kHasBias==true instances to save building time
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
f41777fbd2
|
Renaming BUILD_HSTU_FOR_GFX95_ONLY to BUILD_HSTU_FOR_GFX95
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
295136e48b
|
Update the README.md according to the summary by claude code
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
a1ad9fc312
|
Fix the using of num_targets[] in run_group_hstu_attention
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
62627db768
|
Update to the comments in reference_hstu_attention_bwd.hpp
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
08873f0d50
|
Renaming in the dispatching codes and generate_instances.py scripts
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
c7de3af246
|
Split hstu_attention_util.hpp into host_util.hpp and kernel_util.hpp
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
89d6f5aa92
|
Remove un-needed includings from some hpp and cpp files
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
6f4f3eac48
|
Tiny update in generate_instances.py
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
673207ce59
|
Fix header file mapping bug in generate_instances.py
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
f73341de37
|
Some renaming in kernel and pipeline
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
42a3bfbab7
|
Update and fix for leeked changes and make the scripts be able to test/benchmark kStoreLSE cases
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
b17b41a1e6
|
Enable the kernel dispatching path from is_training & use_softmax to kStoreLSE
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
4414019296
|
Add instances and kStoreLSE template in dispatcher class to support outputting lse for fwd training
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
8f83a2841f
|
Set lse tensor dim strides in example
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
02e5c23f9c
|
Replace template kUseSoftmax/kStoreLSE by boolean parameters in reference fwd codes to save compiling time
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
bd126618d1
|
Add support for outputing lse in the example and reference hstu attention forward implementation
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
fa0a9c1656
|
Add support for preparing lse_dram_window in hstu fwd kernel
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
fd9af72c9f
|
Kernel use types declared in the problem rather than the pipeline
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
ee5bd0ebba
|
Tiny simplification with defining the Bias related Kargs
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
f41b0176d3
|
Add parameters used by storing lse in the fwd and fwd_splitkv_combine kernel to prepare for supporting training
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
270d073c88
|
Move num_splits/o_acc_ptr/l_acc_ptr out from HstuAttention<xxx>FwdParams struct
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
b2561b88e4
|
Add kStoreLSE template parameter to the problems
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
2a86bfb6f5
|
Implement host reference operator for hstu attention backward
|
2026-06-23 09:28:00 +00:00 |
|
Qianfeng Zhang
|
ec58f92f05
|
Rename the reference interfaces and the files
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
7d317adf37
|
Update to MakeLSEaccDramTileDistribution trying to assign more threads to MThreadPerWarp so that block_tile_reduce_sync() work on less KThreadPerWarp
|
2026-06-23 09:27:59 +00:00 |
|
Qianfeng Zhang
|
30b5d7bd01
|
Use buffer_view to create lse_acc_dram_naive so that out_of_boundary loading value can be specified (be -inf)
|
2026-06-23 09:27:59 +00:00 |
|