Commit Graph

64 Commits

Author SHA1 Message Date
slippedJim
ae9edce7ad remove restriction of group mode hd192 no lse (#2252)
Co-authored-by: Jim <jimguo12@amd.com>

[ROCm/composable_kernel commit: 57f497452a]
2025-05-30 10:14:21 +08:00
Po Yen Chen
144377ae38 [CK_TILE] FMHA forward batch_prefill optimization for low CU utilization (#2251)
* Add constraint on traits/tile/pipeline

* Use kM0=128 if max_seqlen_q == 8192

* Re-format codegen script

* Remove redundant attr name postix

* Fix import error: default field in dataclass

* Use kK0=64 & kK1=64 to hide latency

* Use CU utilization to decide tile size

[ROCm/composable_kernel commit: 28cd0dffc9]
2025-05-29 18:36:33 +09:00
Zzz9990
6538aae676 [VLLM V1] Add chunked prefill for FA to pass seq with small seqlen_q (#2221)
* fix splitkv compiler issue since lse is used to select kernel instances

* bypass seqlen == 1

* add chunked prefill into mha varlen

This reverts commit 5ba7148f7b34b1b438e9806748211c153ee8c433.

* skip compile when receipt 2-4 and add comments

* fix

---------

Co-authored-by: fsx950223 <fsx950223@outlook.com>

[ROCm/composable_kernel commit: ece38b9d7a]
2025-05-26 19:17:18 +08:00
Po Yen Chen
b54eceb59f Use only qr_async pipeline for batch_prefill (#2195)
[ROCm/composable_kernel commit: 8cb0474b3d]
2025-05-15 11:47:29 -07:00
Po Yen Chen
2944f929e7 [CK_TILE] Add logits soft-capping & customization support to the FMHA forward kernel/pipelines (#2163)
* hack for cap logits

* fix bug

* Re-format files

* Allow specifying logits_soft_cap through APIs

* Support turn on/off logits_soft_cap in async pipeline

* Do not generate non-verified kernels

* Align receipt used in Aiter

* Sync logits soft-capping across pipelines

* Re-enable some hdim pipelines

* fix perf

* Add attention variant for logits_soft_cap

* Add newline at end-of-file

* Fix performance

* Add comment to explain logits_soft_cap pre-processing

* Unify code

* Unify floating-point literal style

* Use class data member to slience the compilation error

* [CK_TILE] Update attention customizaton interface: add LogitsMask() (#2133)

* Send 'mask' along with variant params to the LogitsMask()

* Send block indices to the variant

* Add indices parameters in variant interface

* Fix fmha bwd codegen error

* Allow switch logits_soft_cap impl

* Eliminate register spills

* Fix compilation errors

* Fix wrong LSE

* Fix LSE for splitkv kernel

* Sync splitkv pipeline changes

* Add batch_prefill kernel/pipeline

* Fix codegen error

* Undo changes in CMakeLists.txt

* Merge pipeline filtering check

* Use different code path if kHasLogitsSoftCap=false

* Remove [[maybe_unused]] attribute

* Use pre-existing compile-time flag to instantiate templates

* Sync pipeline changes

* Update CHANGELOG.md

---------

Co-authored-by: Bernard <bernaliu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>

[ROCm/composable_kernel commit: 2920604786]
2025-05-13 12:19:25 +08:00
Illia Silin
df8daef645 Revert "Add ck tile examples to package (#1880)" (#2150)
[ROCm/composable_kernel commit: 9a9f59ae69]
2025-04-30 10:20:16 -07:00
jakpiase
d48f3b585b Add ck tile examples to package (#1880)
* add ck tile examples to package

* Update jenkinsfile

* fix for jenkinsfile

* fix for building ck tile code on non gfx9

* compile ck tile examples only for gfx94

* include ck tile examples in all target

* fix for basic gemm UseStructuredSparsity

* Update CMakeLists.txt

* Update gemm_pipeline_problem.hpp

* add targets to rocm install

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 434d19f696]
2025-04-28 09:53:19 -07:00
Po Yen Chen
7b61b0e4fa Avoid using store_tile_raw() for fp32 tensors (#2072)
[ROCm/composable_kernel commit: 3d4d70d2fc]
2025-04-26 23:07:41 -07:00
joyeamd
c27b313602 SWDEV-52596 for hdim=256, when use splitkv pipeline, two new pipelines need to be added (#2126)
[ROCm/composable_kernel commit: 41541aff7a]
2025-04-25 16:31:09 +08:00
rocking
cb729623b5 Only generate specific hdim (#2120)
[ROCm/composable_kernel commit: 02ce6d39ea]
2025-04-24 18:52:58 +08:00
joyeamd
19a9980cd5 fmha hdim256 vectorize improve (#2086)
For hdim 256, will not have vectorized buffer load when seqlen % 256 != 0 and hdim % 256 = 0; this commit tries to solve this condition.

[ROCm/composable_kernel commit: 94d47b1680]
2025-04-16 09:21:04 +08:00
slippedJim
959225947a add fmha fwd splitkv receipt for aiter c++ api (#2068)
* add s_randval for c++ api

* Fix bug of bias in splitkv

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

[ROCm/composable_kernel commit: 5f885d2b7a]
2025-04-10 23:21:13 +08:00
slippedJim
753a84d9d5 Add new receipt (#2055)
[ROCm/composable_kernel commit: 5a22b61de5]
2025-04-07 14:18:01 +08:00
rocking
01ea8aa249 Reduce redundant space in bias tensor (#2024)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 8a20b62e91]
2025-03-28 21:58:06 +08:00
carlushuang
e1122c5c27 add mask support in hdim=192/128 (#1999)
[ROCm/composable_kernel commit: 6c08c5c46d]
2025-03-21 18:28:43 +08:00
carlushuang
23340c5dd5 [CK_TILE] return value with macro in ck_tile::kernel_launch API (#1982)
* return value with macro and revert the return value

* [CK-TILE] no-macro launch api solution (#1992)

* no-macro solution

* address -Wcomma

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: e3c9886cdf]
2025-03-20 11:00:29 -07:00
rocking
b0f323c4ec Sync the kname with instance name (#1989)
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: b819c217e4]
2025-03-20 00:06:45 +08:00
carlushuang
f2dd57b76f Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit b92caa3d84.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 3e81279d26]
2025-03-13 11:41:39 +08:00
Illia Silin
b92caa3d84 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 45fbd9210a.

[ROCm/composable_kernel commit: 8cbcd3e0d0]
2025-03-11 10:40:18 -07:00
carlushuang
45fbd9210a [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update

[ROCm/composable_kernel commit: 7a93b16ff6]
2025-03-11 21:07:40 +08:00
Max Podkorytov
9b160b318f refactor ck-tile kernel launch (#1925)
[ROCm/composable_kernel commit: 9e132eb77c]
2025-03-07 08:29:40 -08:00
Illia Silin
53b89436b2 remove support for gfx940 and gfx941 targets (#1944)
* remove support for gfx940 and gfx941 targets

* update changelog

[ROCm/composable_kernel commit: 9b51c08bf7]
2025-03-05 11:07:33 -08:00
rocking
1bb361e8dd explicit show no feature in kernel name (#1920)
[ROCm/composable_kernel commit: faa2235dad]
2025-02-28 14:23:30 +08:00
slippedJim
a8bf44629b make fmha bwd api template for v2 & v3 (#1918)
* use template fmha_bwd function

* update

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: a9bcd3c98d]
2025-02-27 19:26:19 +08:00
rocking
ddbaec1c1b Apply filter to every kernel in the codgen of FMHA (#1911)
* add receipt for fwd

* Add receipt for bwd

* Use kernel name to avoid more receipt

* apply filter to every kernel

[ROCm/composable_kernel commit: e9ee568681]
2025-02-26 20:20:29 +08:00
rocking
b6916590c5 only output the deterministic bwd kernel for aiter (#1903)
* only output the deterministic kernel

* Add comment

[ROCm/composable_kernel commit: e4358c01d9]
2025-02-20 04:27:01 +08:00
rocking
f7d2c5f79a Add receipt 10~12 for codegen of aiter integration (#1877)
* Add receipt for aiter integration

* update receipt

* Add hdim 96 instances

* Revert "Add hdim 96 instances"

This reverts commit f339449f54.

[ROCm/composable_kernel commit: f0d49d14fc]
2025-02-19 09:01:08 +08:00
Andres Lugo
5a73cb91dd Add receipt 4 option to codegen (#1875)
* Add receipt 4 option to codegen

* Remove repeated code

* Review comments

[ROCm/composable_kernel commit: 8086bbe3a7]
2025-02-11 10:11:46 -08:00
Andriy Roshchenko
f9938dfdc6 Add OCP FP8 support in CK_TILE (#1829)
* Add OCP FP8 to CK_TILE

* Validate OCP FP8 in FMHA FWD under VALID=1

[ROCm/composable_kernel commit: 35aebe5936]
2025-01-27 11:59:49 -07:00
Po Yen Chen
9e1b6988b8 [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789)
* Update license year

* Add initial code to override decode problem

* Fix splitkv traits/args overriding error

* Reshape and transpose lse for decode

* Remove debug code

* Prettify example code

* Use better function name

* Add kMergeNumHeadGroupsSeqLenQ flag

Kernel user can use this switch to turn on/off optimization for
some problem sizes

* Add missing flag declarations

* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen

* Group similar statements together

* Remove assumption of seqlen_q=1

* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel

* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel

* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need

* Fix group mode block skip logics

* Undo changes of normal fwd kernel

* Update in GridSize() and using GridSize() for splitkv kernel (#1799)

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

[ROCm/composable_kernel commit: 24b12d04af]
2025-01-07 18:49:24 +08:00
carlushuang
7804fe97aa [CK_TILE]naive attn support FP8 KVCache quant (#1747)
* quant

* fix bug

* simple smoothquant after softmax

* update kv-quant

* update stride

* fix fp8-pertoken-kvcache

* update int8/fp8 quant support

---------

Co-authored-by: so <a.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 6df5fe2ad8]
2025-01-03 18:43:07 +08:00
Qianfeng
d7de39fab3 Remove using partitioner for all fmha kernels (#1778)
* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex

[ROCm/composable_kernel commit: 4e076909b6]
2024-12-29 14:29:56 +08:00
Po Yen Chen
282f02cc66 Correct the dtype checking logics (#1775)
[ROCm/composable_kernel commit: 4c2eff023a]
2024-12-25 23:57:28 +08:00
Po Yen Chen
08975b5a8f [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

[ROCm/composable_kernel commit: 37cdbf4f0e]
2024-12-20 14:41:01 +08:00
carlushuang
96d0405d96 [CK_TILE] naive attn (#1708)
* add reference attention fwd

* refactor addresser

* update

* paged, and i8 reflect-quant

* lets call it forward-quant

* fix error in decode variation

* update naive-attn

* fix page table

* fix build err

[ROCm/composable_kernel commit: 77a38e0211]
2024-12-12 11:54:03 +08:00
rocking
a05c4ff96c [CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> (#1731)
* Add data type config, Prepare to add mix precision in the future

* Fix compile error

[ROCm/composable_kernel commit: 94ae7113bd]
2024-12-10 11:36:18 +08:00
Po Yen Chen
b00d5d4750 [CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688)
* Allow getting batch size from splitkv tile partitioner

* Fix wrong paged-kvcache impl for group mode

* Fix wrong example code for page-kvcache

* Undo changes in fmha_fwd.cpp

* Always use 2D block table

* Add is_gappy kernel argument for paged-kvcache

The is_gappy argument is used for differentiating seqstart_k_ptr usage
in flash-attention & xformers

* Remove out-of-date comments

* Remove no-longer used method

* Fix wrong # page-block calculation

* Fix wrong comment

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

[ROCm/composable_kernel commit: cf2d635ea2]
2024-11-26 20:37:54 +08:00
Po Yen Chen
69aef1e11e [CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689)
* Fix mis-matched tuple<> elem types

* Rename MakeKargs() as MakeKargsImpl()

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>

[ROCm/composable_kernel commit: 645fe812f6]
2024-11-25 15:30:35 +08:00
Qianfeng
2a36aa8f41 Change in fwd-splitkv kernel to support num_splits=1 case (#1690)
* Change in fwd-splitkv kernel to support num_splits=1 case

* Update in codegen fwd-splitkv to make num_splits > 1 cases pass

* Specify instance traits in dispatch

* Fix link error for fp8 kernels

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: ce2bdf42a9]
2024-11-25 12:31:38 +08:00
Po Yen Chen
326639e80c [CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678)
* Generate group mode paged-attn kernel

* Enable paged-kvcache + group mode support

* Add missing header: fused_moe.hpp

* Add comment to explain kernel arg usage

* Make error message more clear

* Add comment for confusing data member names

* Add more comment for confusing variable names

* Fix typo in option description

[ROCm/composable_kernel commit: fb1ccfa9df]
2024-11-21 14:53:10 +08:00
Po Yen Chen
ae9d04ac98 Fix 'sh' command compatibility of smoke_test_fwd.sh (#1553)
[ROCm/composable_kernel commit: af9546d9f4]
2024-11-09 09:55:14 +08:00
Juan Manuel Martinez Caamaño
4bb95f18ed [generate.py] Override blob list if it already exists (#1635)
Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.

Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.

[ROCm/composable_kernel commit: 464abd235e]
2024-11-05 10:09:52 +01:00
Qianfeng
606fb4df2c [CK_TILE] Add fmha fwd headdim96 support (#1608)
* Add ceil_to_qualified_tile_length()

* Rename kK0BlockLength to kQKHeaddim

* Add kSubQKHeaddim concept to support headdim96

* Fix in math.hpp to avoid using __half interfaces

* Add LdsBufferSequence instance for headdim96

* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing

* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time

* Reformat one file

* Fix text alignment in fmha_fwd_splitkv.py

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 8632221814]
2024-10-30 14:03:16 +08:00
Po Yen Chen
919046f86c [CK_TILE] More fmha splitkv optimizations (#1588)
* Use pre-defined constants for readability

* Use vector write for o_acc tensor

* Remove no-longer used policy method

* Deprecate no-longer used policy/pipeline

* Specify gemm0/gemm1 block warps separately in codegen

* Fix wrong ps_idx creation logic

* Add single-warp block gemm

* Supoprt single-warp gemm0

* Make MakeCBlockTile() as static method

* Use MakeCBlockTile() to get underlying tile distribution

* Use kNumGemm1Warps to compute # threads for gemm1

* Put normal case in the if clause

* Refine fmha splitkv block mapping

* Refine & fix the lse_acc/o_acc layout

* Fix wrong LDS size for K tile

* Use kK0=64 for hdim=128,256 fmha splitkv kernels

* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels

* Undo kK0/kK1 changes

* Use more reasonable GetAlignmentV() computation

* Using store_tile() in fmha splitkv kernel epilogue

[ROCm/composable_kernel commit: 54f0e6f4bb]
2024-10-26 18:35:45 +08:00
Po Yen Chen
267bf490e5 [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577)
* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged

[ROCm/composable_kernel commit: 95e722a3b3]
2024-10-21 10:52:11 +08:00
Po Yen Chen
50f0f55fbc [CK_TILE] Update example README files & fix script compatibility issue (#1548)
* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command

[ROCm/composable_kernel commit: 0c094daa7e]
2024-10-08 10:45:12 +08:00
Qianfeng
1ca2b3d76c [CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549)
* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 74d68e3b99]
2024-10-08 10:44:34 +08:00
kylasa
6f048f54dc Adding seed and offset pointer support to the philox random number generator. (#1523)
* Adding seed and offset pointer support to the philox random number generator.

* Separating seed and offset pointer checks with different condition statements.

* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.

* Correcting a typo in the readme file

* Re-format files using remod.py

* Use STL type for API parameters

* Use simpler struct design for drop_seed & drop_offset

* Undo unnecessary changes

* Sync kargs style for fmha_fwd.hpp/.cpp

* Use templated union to reduce code

* Use structured binding to make code more readable

---------

Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: c24fae2346]
2024-10-05 02:48:47 +08:00
Po Yen Chen
1c4c07c669 [CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527)
* Use same layout for o_acc and o tensor

* Use better param names in partitioner

* Remove redundant kargs 'max_seqlen_q'

* Use better param names in splitkv kernel

* Add comment for additional kernel arguments

* Sync empty loop early return logics between pipelines

* Pass more arguments to cmake in scripts

* Align backslashes

* Fix wrong o_acc tensor view strides

* Change o_acc layout if o_perm=0

* Handle whole row masked via attn_bias

* Use use vector width = 1 for o_acc

* Use more even split sizes

[ROCm/composable_kernel commit: a1c07e8d91]
2024-10-01 22:13:52 +08:00
Dan Yao
7460d19460 [CK_TILE] Fix compiler related FA bwd issues (#1530)
* add barriers

* tail bias barriers

* adjust bf16/hd256 tol

* continue adjust bf16/hd256 tol

[ROCm/composable_kernel commit: 9d69a099a4]
2024-09-26 12:18:39 -07:00