ROCm/composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-20 21:09:08 +00:00

Files

History

Qianfeng Zhang f1f4e249a6 Adjust the v_tile and k_tile loading location

2025-04-28 09:25:09 +00:00

..

Fix in generate_instances.py and re-generated the instances

2025-04-23 10:30:40 +00:00

bench_batched_causal.sh

Rename the performance measurement scripts

2025-04-25 06:09:17 +00:00

bench_jagged_causal_local.sh

Rename the performance measurement scripts

2025-04-25 06:09:17 +00:00

bench_jagged_causal.sh

Add scripts for measuring jagged with/no causal cases

2025-04-25 15:59:51 +00:00

benchmark_hstu_attention.sh

Fix in calculation of total_flops and update benchmark scripts

2025-04-13 08:50:00 +00:00

CMakeLists.txt

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

example_hstu_attention.cpp

Fix the integer overflow in total_flops calculation

2025-04-17 10:34:13 +00:00

generate_instances.py

Fix in generate_instances.py and re-generated the instances

2025-04-23 10:30:40 +00:00

hstu_attention_batched_forward_bf16.cpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_batched_forward_dispatch.hpp

Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations

2025-04-15 14:40:55 +00:00

hstu_attention_batched_forward_fp16.cpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_bool_switch.hpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_fwd_kernel.hpp

Update the GridSize() and GetTileIndex() in hstu kernel

2025-04-26 10:01:23 +00:00

hstu_attention_fwd_pipeline_default_policy.hpp

Put two gemms call inside one n0loop unroll

2025-04-28 06:41:37 +00:00

hstu_attention_fwd_pipeline.hpp

Adjust the v_tile and k_tile loading location

2025-04-28 09:25:09 +00:00

hstu_attention_fwd_setting.hpp

Use 16x16x16 WarpGemm

2025-04-24 08:14:09 +00:00

hstu_attention_fwd_type_config.hpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_hdim_switch.hpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_jagged_forward_bf16.cpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_jagged_forward_dispatch.hpp

Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations

2025-04-15 14:40:55 +00:00

hstu_attention_jagged_forward_fp16.cpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_params.hpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_pipeline_problem.hpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_attention_traits.hpp

Add hstu attention kernel implementation, instances and interfaces (building succeeded)

2025-04-03 08:20:54 +00:00

hstu_block_masking.hpp

Add IsFullTileInsideMask() to avoid pixel-by-pixel checking when kUseCausl=true but kUseLocal=false

2025-04-27 09:31:38 +00:00

README.md

fix the jagged mode tensor access in reference_hstu_attention

2025-03-29 12:55:40 +00:00

reference_hstu_attention.hpp

Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskNoLocal to save vgprs for non-local situations

2025-04-15 14:40:55 +00:00

test_hstu_attention.sh

Update to the scripts and error thresholds

2025-04-09 10:34:37 +00:00

README.md

HSTU attention operator

HSTU-attention operator is an operator which takes tensor q: [batches, seqlen, nhead, hdim_qk], k: [batches, seqlen, nhead, hdim_qk, v: [batches, seqlen, nhead, hdim_v] and some parameters for defining the functional masking as inputs, and do the following:

Multiply q: [batches, seqlen, nhead, hdim_qk] with k: [batches, seqlen, nhead, hdim_k] to get temporary tensor s: [batches, nhead, seqlen, seqlen]
Update s by filtering its values according to a special functional mask, which includes the logics of lower-triangular and diagonal window causal mask as well assequence mask
Do element-wise SiLu on the lower seqlen dimension of s to get temporary tensor p: [batches, nhead, seqlen, seqlen]
Multiply p : [batches, nhead, seqlen, seqlen] with v: [batches, seqlen, nhead, hdim_v] to get final output o: [batches, seqlen_q, nhead, headsz_v]
Jagged inputs are also supported, where each batch has separate seqlen defined by the sequence_offsets[]

implementation

The operator is implemented using a fused kernel in the example:

Tensor S and Tensor P only exist in VGPRs as per-workgroup tiles, no global memory access is needed

build

#> mkdir build
#> cd build
#> ../script/cmake-ck-dev.sh .. gfx942              ; use #> rocminfo |grep "gfx"   to check your gpu arch
#> make -j tile_example_hstu_attention

test/verify

#>  build/bin/tile_example_hstu_attention -v=1 -prec=bf16 -b=10 -jagged=1 -nhead=4 -hdim_qk=128 -hdim_v=128 -seqlen=750,730,733,860,870,788,760,821,833,779 -targets=5,5,6,6,5,6,5,6,4,6
    -causal=1 -local_len=5 -context_len=6 -minfull_len=6 
#>  . example/ck_tile/07_hstu_attention/test_hstu_attention.sh

Check the example file example_hstu_attention.cpp for an understanding of the command-line arguments. Which is like the following:

  arg_parser.insert("v", "1", "weather do CPU validation or not")
      .insert("prec", "fp16", "data type. fp16/bf16")
      .insert("jagged", "0", "q/k/v batched sequence is jagged or not")
      .insert("b", "12", "batch size")
      .insert("nhead", "4", "number of heads")
      .insert("hdim_qk", "64", "headdim size of Q/K")
      .insert("hdim_v", "64", "headdim size of V/O")
      .insert("seqlen", "400", "seqlen of single or all batches for query and key/value tensor")
      .insert("targets", "16", "sequence length at the end of query/key token sequence that should be excluded from attention")
      .insert("causal", "1", "enable causal mask or not")
      .insert("local_len", "5", "length of the diagonal window for enabling masking, value 0 to disable")
      .insert("context_len", "6", "sequence length at the begin of the query sequence the should be included for attention")
      .insert("minfull_len", "6", "sequence length at the end of the query sequence that should be included for attention")
      .insert("seed", "13579", "seed by the uniform or normal distribution generator")
      .insert("perf", "0", "weather measure execution time or not");