composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 18:17:44 +00:00

Author	SHA1	Message	Date
Po Yen Chen	267bf490e5	[CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577 ) * Use smaller width for lse_accum dist tensor * Update pipeline comment * Fix wrong distribution for lse_accum * Remove duplicate dim in lse_accum dist encoding * Decide fmha splitkv combine kernel kBlockSize by kM0 * Remove assumption of MPerThread=1 * Add log<4> & log<8> specialization * Enlarge occupancy array * Fix vector size for small tile * Add support for kMaxSplits=8 * Re-format gemm.hpp * Use 16x16x16 warp gemm for fwd_splitkv * Centralize policy code changes * Leave fp8/bf8 tile settings unchanged [ROCm/composable_kernel commit: `95e722a3b3`]	2024-10-21 10:52:11 +08:00
Qianfeng	3e0b77670e	[CK_TILE] Improve headdim96 performance for fmha-bwd (#1573 ) * Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96 * Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor * Fix in bwd_piple_default_policy * Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines * Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim * Update to hd96 tile settings * Add smoke test scripts for fmha-bwd hd96 * Revert "Add smoke test scripts for fmha-bwd hd96" This reverts commit `7ca7e1a93d`. * Remove hd96 tile settings in fmha_bwd codegen to save compiling * Fix lost code line in bwd_pipeline_default_policy * Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits * Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor * tiny adjustments --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: danyao12 <Dan.Yao@amd.com> [ROCm/composable_kernel commit: `14c3cfb1c6`]	2024-10-16 18:14:32 +08:00
Bartłomiej Kocot	acb8a72aad	[CK_TILE] Add block universal gemm pipeline policy (#1557 ) * [CK_TILE] Add block universal gemm pipeline policy * Fixes * fixes2 * Fixes3 * fixeS [ROCm/composable_kernel commit: `d02a92cc0d`]	2024-10-15 13:53:41 +02:00
Po Yen Chen	739f5210b0	Apply ROCm 6.2 WA to ROCm 6.3 and later (#1563 ) [ROCm/composable_kernel commit: `9868fd0245`]	2024-10-15 18:02:41 +08:00
Rostyslav Geyyer	0d8b3d36b2	Add custom type vector support (#1333 ) * Add non_native_vector_type * Add a test * Add non-native vector type * Fix CTOR * Fix non-native vector type of 1 * Fix CTORs * Use vector_type to cover non-native implementation as well * Update the test * Format * Format * Fix copyright years * Remove BoolVecT so far * Add AsType test cases * Update assert error message * Remove redundant type * Update naming * Add complex half type with tests * Add tests for vector reshaping * Add missing alignas * Update test/data_type/test_custom_type.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Compare custom types to built-in types * Add default constructor test * Add an alignment test --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `4cf70b36c1`]	2024-10-14 11:56:45 -05:00
Bartłomiej Kocot	87e4507543	Add transpose scale amax example (#1547 ) * Add transpose scale amax example * fixes * Tune reduce instance [ROCm/composable_kernel commit: `f21cda2536`]	2024-10-14 17:39:38 +02:00
Thomas Ning	2117e76277	decouple the calling from gemm_pipeline (#1571 ) * decouple the calling from gemm_pipeline * clang format [ROCm/composable_kernel commit: `35c1777d59`]	2024-10-14 13:59:26 +08:00
Adam Osewski	b05ec1b096	Implement GetWorkSpaceSize from BaseOperator. (#1564 ) [ROCm/composable_kernel commit: `29d384d0b2`]	2024-10-12 14:05:11 +08:00
Thomas Ning	0d711b3edf	Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535 ) * ake the cshuffle compilable * modify Mhe reference on gpu and cpu. Correaccess of cshuffle * fix the cpu reference code * Complete the in tile shuffle logic * restructure the kernel template input * change the naming pattern of ck_tile gemm pipeline * Re-format files using remod.py * Solve the fmha conflict with gemm * Comment Addressed from Carlus --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `6f27bc9872`]	2024-10-10 18:02:22 +08:00
Christopher Millette	d6eae63f60	Fixes small memory leak from missing hipEventDestroy (#1554 ) [ROCm/composable_kernel commit: `ceaed8e097`]	2024-10-09 09:41:35 +02:00
Po Yen Chen	50f0f55fbc	[CK_TILE] Update example README files & fix script compatibility issue (#1548 ) * Fix text alignment of ArgParser::print() * Update example README files * Clarify make-ck-dev.sh <arch> usage * Only keep some of the argument from '-?' output * Undo command line output changes in README * Only keep existing argument on doc and update description * Fix text alignment * Make cmake-ck-*.sh compatible with 'sh' command [ROCm/composable_kernel commit: `0c094daa7e`]	2024-10-08 10:45:12 +08:00
Qianfeng	1ca2b3d76c	[CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549 ) * Simplify the codes in splitkv_combine pipeline * Always set kPadSeqLenK=true for fmha splitkv kernels * Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `74d68e3b99`]	2024-10-08 10:44:34 +08:00
Illia Silin	881bc2c930	Fix build logic using GRU_ARCHS. (#1536 ) * update build logic with GPU_ARCHS * fix the GPU_ARCHS build for codegen * unset GPU_TARGETS when GPU_ARCHS are set [ROCm/composable_kernel commit: `7d8ea5f08b`]	2024-10-07 08:18:23 -07:00
Bartłomiej Kocot	4aaf6ad633	[CK_TILE] Fix conv param multiple definition (#1550 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `cc8f466a7e`]	2024-10-07 15:21:21 +02:00
rocking	36b2a932b0	[Ck tile] Support layernorm one pass (#1512 ) * Fix compile error * Add one pass pipeline * Extract creating tile_window to operator() * clang format * reduce duplicated code * do not hardcode * Support padding in layernorm --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `0023f01ab0`]	2024-10-07 14:25:53 +08:00
kylasa	6f048f54dc	Adding seed and offset pointer support to the philox random number generator. (#1523 ) * Adding seed and offset pointer support to the philox random number generator. * Separating seed and offset pointer checks with different condition statements. * Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs. * Correcting a typo in the readme file * Re-format files using remod.py * Use STL type for API parameters * Use simpler struct design for drop_seed & drop_offset * Undo unnecessary changes * Sync kargs style for fmha_fwd.hpp/.cpp * Use templated union to reduce code * Use structured binding to make code more readable --------- Co-authored-by: Sudhir Kylasa <sukylasa@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `c24fae2346`]	2024-10-05 02:48:47 +08:00
Bartłomiej Kocot	47a2eb1cce	Fix grouped gemm check to avoid overflow (#1545 ) [ROCm/composable_kernel commit: `6b54d2faf8`]	2024-10-04 17:32:43 +02:00
macurtis-amd	164963bf83	Fix compilation errors generated by forthcoming Clang changes (#1544 ) Without this change, the following diagnostic is generated: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] See C++17 spec [temp.names] p5. [ROCm/composable_kernel commit: `aeb7c91f48`]	2024-10-02 13:56:22 -07:00
Illia Silin	ef193e048a	[CK_TILE] add missing vector header (#1537 ) * add missing vector header * Re-format header using remod.py --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `8e4c3fb1bc`]	2024-10-01 07:58:20 -07:00
Po Yen Chen	1c4c07c669	[CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527 ) * Use same layout for o_acc and o tensor * Use better param names in partitioner * Remove redundant kargs 'max_seqlen_q' * Use better param names in splitkv kernel * Add comment for additional kernel arguments * Sync empty loop early return logics between pipelines * Pass more arguments to cmake in scripts * Align backslashes * Fix wrong o_acc tensor view strides * Change o_acc layout if o_perm=0 * Handle whole row masked via attn_bias * Use use vector width = 1 for o_acc * Use more even split sizes [ROCm/composable_kernel commit: `a1c07e8d91`]	2024-10-01 22:13:52 +08:00
Bartłomiej Kocot	da3172955b	[CK_TILE] Image to Column kernel (#1532 ) * [CK_TILE] Image to Column kernel * Fixes * Vector loads and stores * Fixes * Fixes * change test dir name [ROCm/composable_kernel commit: `de3e3b6424`]	2024-09-27 22:57:38 +02:00
Dan Yao	7460d19460	[CK_TILE] Fix compiler related FA bwd issues (#1530 ) * add barriers * tail bias barriers * adjust bf16/hd256 tol * continue adjust bf16/hd256 tol [ROCm/composable_kernel commit: `9d69a099a4`]	2024-09-26 12:18:39 -07:00
Illia Silin	04c756ea93	Fix compilation errors with Clang20.0. (#1533 ) * fix clang20 compilation errors for gfx90a * fix clang20 compilation errors for gfx11 targets [ROCm/composable_kernel commit: `42e6dceacc`]	2024-09-25 13:45:38 -07:00
Po Yen Chen	53b581e122	Early return if seqlen_k=0 on group mode (#1524 ) [ROCm/composable_kernel commit: `770d2b7725`]	2024-09-22 20:05:58 +08:00
Bartłomiej Kocot	e4f4e04add	Add support for NGCHW in grouped conv fwd (#1499 ) * Support NGCHW in grouped conv fwd * Remove not needed variable * Fixes [ROCm/composable_kernel commit: `4ba52b35dc`]	2024-09-20 10:45:46 +02:00
Adam Osewski	f6c6c375db	Remove unsupported (fp8) type from Add memory operation. (#1521 ) The dynamic buffer doesn't have support for fp8 in `Update` operation thus fp8 is not supporting `InMemoryDataOperation::Add` [ROCm/composable_kernel commit: `0c39954da9`]	2024-09-20 09:40:45 +02:00
Thomas Ning	2ded318de8	Ck tile gemm padding dim (#1516 ) * Support the N dimension padding * Finished the padding feature for different dimension of K [ROCm/composable_kernel commit: `694c300145`]	2024-09-18 11:32:29 -07:00
Thomas Ning	84f3413bb2	Ck tile GPU verification sample develop & Add the CK TILE GEMM to the CI/CD test (#1505 ) * Finished the feature of gpu verification * Add the ck_tile_gemm test in the CI CD * add the include of tensor_layou in reference_gemm * Comment Addressed * split ck_tile fhma and gemm tests into separate stages * restructure the reference gemm * restructure a new reference_gemm api that could read the device mem --------- Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `844f5a1712`]	2024-09-14 21:08:40 +08:00
Jun Liu	04a8584b87	Customize filesystem in CK for legacy systems (#1509 ) * Legacy support: customized filesystem * Update cmakefile for python alternative path * fix build issues * CK has no boost dependency * More fixes to issues found on legay systems * fix clang format issue * Check if blob is correctly generated in cmake * fix the python issues * add a compiler flag for codegen when using alternative python * use target_link_options instead of target_compile_options --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `81bc1496b2`]	2024-09-13 07:51:07 -07:00
Mateusz Ozga	9c0316d853	Pool2d max/avg kernel in the BWD version (#1494 ) * Add pool2d instance BWD AVG * Add pool2d instance BWD MAX * Fix: avg review * Fix review: part2 * Fix - enable test when type is compiled * Fix review part3 [ROCm/composable_kernel commit: `448c0f56d8`]	2024-09-12 11:47:52 +02:00
jakpiase	8aeb2afbe2	Rewrite pool2d fwd (#1462 ) * added pool2d fwd * add tests * add reviewers changes * Revert "Merge remote-tracking branch 'origin/develop' into jakpiase/pool2d_fwd_new" This reverts commit `6b2ba7ff89`, reversing changes made to `22c82bea0c`. * Revert "add reviewers changes" This reverts commit `22c82bea0c`. * added reviewers comments * revert some old files * add reviewers requests --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `e8d2887cb2`]	2024-09-11 15:21:00 +02:00
jakpiase	681d36db5f	Added structural sparsity blockwise gemm (#1435 ) * Implemented smfmac xdlops * Added smfmac blockwise xdlops * fixes * add reviewers suggestions --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `2a261afcdf`]	2024-09-11 15:19:42 +02:00
Dan Yao	df8769d3c8	[CK_TILE] FA bwd repair (#1502 ) * fix fa bwd * revert kernelBlockSize in gemm_kernel.hpp [ROCm/composable_kernel commit: `d09572e8c2`]	2024-09-10 10:45:32 -07:00
Thomas Ning	b736c9b51a	Ck tile gemm example (#1488 ) * Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout * Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future. * Fix: Clang Format, API fixed from fmha * fix with better naming convention * revert back the pipeline code of fmha * Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one. * clang format with the reference_gemm file * convert the clang format with the remod.py * Changed the format and variable name of the kernel gemm_shape and partitioner --------- Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu> [ROCm/composable_kernel commit: `caacd38830`]	2024-09-07 16:23:32 +08:00
M.Emin Ozturk	483f33772d	Moficiation to fix this issue "threadwise_tensor_slice_transfer_v5r1 issue #1279 " (#1492 ) * issue fix, one line changed for tmp * clang --------- Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu> Co-authored-by: Harisankar Sadasivan <135730918+hsadasiv@users.noreply.github.com> [ROCm/composable_kernel commit: `8378855361`]	2024-09-04 21:52:55 -07:00
Haocong WANG	505351b016	Add gemm universal bf16 instances (#1484 ) * revert ckprofiler change * temp save * Add test and test pass * test pass * Fix bug inside rotating buffer when tensor is not packed * bug fix * clang format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `5b10dae6a4`]	2024-09-04 20:58:54 -07:00
Bartłomiej Kocot	691144def1	Add support for NGCHW in grouped conv bwd wei (#1491 ) * Add support for NGCHW in grouped conv bwd wei * Comments fixes * navi fixes * Update function names [ROCm/composable_kernel commit: `73b67f290f`]	2024-09-03 10:52:03 +02:00
Bartłomiej Kocot	ebb827260e	Revert "Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 ) (#1415 )" (#1455 )" (#1490 ) This reverts commit `a05bad520a`. [ROCm/composable_kernel commit: `a9b170b541`]	2024-09-02 10:39:49 +02:00
Dan Yao	34c2080c73	[CK_TILE] float -> bf16 inline asm rtn (#1482 ) * asm rtn * add asm rtn macro * reorder macro --------- Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `b8addae293`]	2024-08-30 15:38:09 +08:00
Po Yen Chen	a412946042	Enable scratch memory workaround on ROCm 6.2 (#1486 ) Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `461ec98d78`]	2024-08-30 10:40:00 +08:00
Po Yen Chen	bc3b6b2c6f	[CK_TILE] Add PagedAttention kernels (#1387 ) * Use dictionary to config all the functions * Add init codegen logic for fmha fwd appendkv * Call HIP_CHECK_ERROR() macro to get real source info * Setup meaningfull arguments * Sync kernel name with the codegen * Add knew/vnew tensors to the kernel argument * Fix wrong K values after appending * Fix vnew append errro * Extract common logics * Fix Vnew tile dstr for row major case * Conditionally add fwd_splitkv API in fmha_fwd example * Conditionally add call to fmha_fwd_splitkv() * Remove "EXAMPLE_" prefix of cmake variables * Regsiter API handlers automatically * Early return if 0 < s_k_new is not supported * Show message if we are ignoring option * Unify CMakeLists.txt coding style * Set num_splits=1 if split-kv is not supported * Add length/stride getters for HostTensor * Add RoPE example utilities * Add reference_rotary_position_embedding() (not implemented) * Finish reference_rotary_position_embedding() impl * Fix typo of HostTensor<>::get_length() * Fix compilation errors * Fix wrong answer when interleaved=false * Fix wrong answer when interleaved=true * Append K/V in the host verification code * Simplify K appending logics * Simplify v_host_ref definition * Reduce input/output dimensions * Rename function: add "batched" prefix * Apply RoPE on host side * Rename RoPE utility function * Fix wrong tensor size * Avoid invoking deprecated method 'find_module' * Pass RoPE kernel args * Create Rotary Cos/Sin tile windows in kernel * Add compute data type alias for RoPE * Randomly generate seqlen_knew if needed * Fix seqlen_knew enabling check logic * Add minimum seqlen_k to generate compliance kvcache * Fix compilation error in debug mode * Fix wrong boundaries * Fix wrong seqlen_k for kvcache * Rename variables used in distributio encoding * Fix rotary cos/sin tensor/tile size * Add constraint to the rotary_dim option * Remove unused inner namespace * Add dram distribution for rotary_cos/rotary_sin (interleaved) * Only apply interleaved RoPE on Knew for now * Fix wrong thread starting offset * Instantiate multiple kernels for RoPE approaches * Clean-up pipeline * Fix error in RoPE host reference * Handle RoPE half-rotated logics * Support 8x rotary_dim under half-rotated RoPE * Add comment * Apply elementwise function to the loaded tiles * Unify parameter/variable naming style * Remove constness from q_ptr * Add code blocks for q_tile * Apply RoPE to q_tile * Remove debug print code in kernel * Fix wrong knew/vnew appending positions * Use better naming for tile indices * Add make_tile_window() for adding distribution only * Skip code if # of block is more than needed * Move thread locating logics into policy * Remove always true static_assert() * Rename header * Rename RotaryEmbeddingEnum * Extract rotary embedding logic out * Re-order parameters * Align naming of some tile size constants * Rename more tile size constants * Fix wrong grid size * Fix wrong shape of knew_host/vnew_host * Fix wrong index into knew_host/vnew_host * Fix wrong rotary_cos/rotary_sin memory size for Q * Extract Q/Knew vector size to helper methods * Use different rotary_cos/rotary_sin distr for Q/Knew * Update host/device specifiers * Fix wrong data type for Q rotary_cos/rotary_sin * Remove RoPEComputeDataType type alias * Shift rotary_cos/rotary_sin by cache_seqlen_k * Add comment for why I just 't' for all padding flags * Align commit message to the real comment * Fix wrong pipeline * Rename utility function * Disable host verification if API not exist * Fix wrong rope key for fp8 pipeline * Allow only apply RoPE on Q (without append KV) * Add append-kv smoke tests * Remove debug statements * Remove more debug statements * Re-arrange the 'set +x' command * Remove no-longer used method in pipeline * Add missing init code * Refine pipeline padding settings * Enlarge rotary_dim limit (8 -> 16) * Enlarge KPerThread for rotary_interleaved=false * Update rotary_dim range in smoke_test_fwd.sh * Add template argument 'kIsPagedKV' for splitkv kernels * Launch splitkv kernel if given page_block_size * Fix wrong kernel name * Fix seqlen_k_min for pre-fill case (1 -> 0) * Add copy_const<> type trait * Add another make_tile_window() * Introduce 'TileWindowNavigator' types * Simplify TileWindowNavigator interfaces * Fix tile window navigation bugs * Disable calling fmha_fwd() * Remove ununnecessary data members * Simplify more make_tile_window() overloads * Move V tile through TileWindowNavigator * Fix uneven split checking logic * Move code after decide seqlen_q/seqlen_k * Make sure we always start reading complete tile * Use 128 as minimus page_block_size * Fix wrong origin for bias * Add batch_stride_k/batch_stride_v in group mode * Unify origin * Add missing kernel arguments for group mode * Add paged-kv codegen logic for appendkv kernels * Add block_table kernel args for appendkv kernel * Add tile navigators to the appendkv kernel * Fix wrong tensor descriptor lengths * Pass re-created tile window to pipeline * Fix wrong strides for appendkv kernel * Allow transit tile_window to another page-block * Handle cross-page-block write * Donot perform write again if already in last page-block * Always add fmha_fwd() api * Add missing group mode argument * Remove debug macro usages * Rename option s_k_new to s_knew * Separate splitkv/non-splitkv args/traits * Remove fmha_fwd_dispatch() * Fix compilation errors * Remove dropout code in splitkv kernel * Allow problem types without define kHasDropout attr * Use generic lambda to init traits objects * Separate more non-splitkv & splitkv traits/args * Display more info for specific kernels * Show more detailed warning message * Rename 'max_num_blocks' to 'max_num_page_blocks' * Remove no-longer used pipeline files * Wrap code by #if directives * Move functors to the begining of validation code * Use generic lambda to init all the api traits/args * Fix wrong seqlen for kvcache * Add missing comment * Rename TileWindowNavigator to PageBlockNavigator * Only expose necessary methods (not attributes) * Re-order pipeline paremeters * Refine smoke_test_fwd.sh * Fix wrong arugment count * Make tile window directly via PageBlockNavigator * Remove unused template paremeter * Remove group mode from appendkv kernel * Fix skcheck logic * Fix wrong syntax in skcheck expr * Use meaningful options in smoke test * Remove options * Fix formatting * Fix more format * Re-organize bash functions * Pass cache_batch_idx to kernels * Support cache_batch_idx in example * Fix compilation error * Add more appendkv test * Add more case for appendkv * Fix unexisted attribute * Remove 0 < seqlen_knew constraint * Clarify the case in warning message * Remove macro checking * Force batch mode when invoking appendkv & splitkv apis * Fix mode overriding logics * Fix wrong parameter name * Randomize seqlen_k if use kvcache * Use randomized seqlen_k for kvcache * Avoid using too small rotary_cos & rotary_sin * Rename parameter * Add seqlen_q & seqlen_k rules * Add comment * Add more comments * Fix compilation errors * Fix typo in comment * Remove type argument * Avoid seqlen_k=0 for kvcache * Revert "Avoid seqlen_k=0 for kvcache" This reverts commit `21c4df89e4`. * Fix wrong uneven split checking logics * Only randomize kvcache seqlen_k if 1 < batch * Return earlier if split is empty * Revert "Only randomize kvcache seqlen_k if 1 < batch" This reverts commit `b9a4ab0d7e`. * Re-order seqlen_k_start adjustment logics * Fix compilation errors * Re-format script * Find executable from folder automatically * Fix kvcache seqlen_k generating logic * Make comment more clear * Fix wrong knew/vew appending logic on host * Add s_barrier to sync threads * Revert "Add s_barrier to sync threads" This reverts commit `d3f550f30c`. * Support only using 1 row of rotary_cos/rotary_sin * Rotate Q in different way * Unify tensor view creation logics * Fix wrong argument * Add mask to switch how we use the rotary_cos/sin * Move attr from traits to problem * Move has_mask to fmha_fwd_appendkv_args * Support use uint32_t as SAD operand in Alibi<> * Use sad_u32() in splitkv kernels * Store tensor views in PageBlockNavigator * Use stored tensor view to update tile windows * Enlarge tensor view size * Remove debug code * Fix wrong tensor view size * Wrap tensor view into PageBlockNavigator * Add DataType member to PageBlockNavigator * Remove unnecessary member functions * Refind macro use * Fix typo * Add blank line between directives and actual code * Re-format files * Remove type in comment --------- Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `c156989298`]	2024-08-28 20:50:43 +08:00
Andriy Roshchenko	10be209218	Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473 ) * Enable CMakePresets build * Verify Convolution, Scaling and ReLU algorithms. * Add tensor element-wise scale and type cast operation. * Reduction implemented but does not work. * Exploration of Reduction functionality. * Completed example for Convolution scaled with ReLu activation and AMAX reduction. * WIP: Add required instances for convolution. * WIP: Create client example. Implement convolution stage. * Add elementwise instances. * Add elementwise scale + convert example. * Add reduction instances. * WIP: Client example for AMAX reduction. * WIP: Add instances for multistage reduction. * WIP: Implementation of multistage reduction. * Refactoring. * Clean up. * Add CMakePresets.json * Guard off FP8 instances when the data type is not available. * Add example for Scaled FP8 Convolution with AMAX reduction. * Refactor CombConvScaleRelu instances. * Add CombConvScale instances. * Add client example for Scaled FP8 Convolution with AMAX reduction. * Cleanup. [ROCm/composable_kernel commit: `c3515f277c`]	2024-08-21 15:22:41 -07:00
Rostyslav Geyyer	94954e9fe4	Set RNE fp8 conversion as a default (#1458 ) * Set RNE fp8 conversion as a default * Update f8 tests * Disable failing test on gfx11 * Update bf8 tests * Add a flag * Fix the flag * Raise flag for gfx10 as well * Temp commit for tolerance testing * Update tolerances [ROCm/composable_kernel commit: `e20f20efbf`]	2024-08-21 09:09:48 -07:00
Dan Yao	9aa946fd9f	[CK_TILE] FA bwd kernels optimization (#1397 ) * tmp save * fix batch deterministic bugs * fix group deterministic bugs * codegen update * reorder files * bias support * hd256 bias support * bwd smoke test update * simplify convert dq * fix hd256 dropout scratch * do{}while() -> while(){} * comments * remove FmhaBwdTilePartitioner * save clear_tile * refactor dropout * code cleanup * code cleanup * comments * fix epilogue problem * fix fwd dropout * group convert_dq opt * fix dq alignment * Do not store storerandval in bwd for flash attention integration * fix hd32 error and boost performance * revert * Remove duplicated WarpGemm definitions in the policy file * dropout patch for mrepeat 1616 code sync up * dq_acc stride * dq_acc stride stuff * codegen update * fwd dropout revert * fix hd128 scratches and boost performance * receipt 3 for simplified smoke test * more strides for fa integration * fix hd64 scratches and boost performance * non-iglp pipeline for headdim padding cases * dpad same as dvpad for flash attention integration * unpadded lse&d for group mode * Support unpad layout for group lse * Support unpad lse layout for splitkv * Fix stride for splitkv kernel * fix unpadded lse issue in fwd splitkv * comment * solve lds read&write conflicts * rename * bias rename * tile index revert --------- Co-authored-by: danyao12 <danyao12> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com> [ROCm/composable_kernel commit: `79a5d9c10c`]	2024-08-16 13:40:10 -07:00
Haocong WANG	68d3fce998	[GEMM] gemm_universal related optimization (#1453 ) * replace buffer_atomic with global_atomic * fixed global_atomic_add * added bf16 atomic_add * format * clang-format-12 * clean * clean * add guards * Update gtest.cmake * enabled splitk_gemm_multi_d * format * add ckProfiler * format * fixed naming * format * clean * clean * add guards * fix clang format * format * add kbatch printout * clean * Add rocm6.2 related gemm optimization * Limit bf16 atomic usage * remove redundant RCR gemm_universal instance * Add RRR fp8 gemm universal instance * Bug fix * Add GPU_TARGET guard to FP8/BF8 target * bug fix * update cmake * remove all fp8/bf8 example if arch not support * Enable fp8 RRR support in ckProfiler * limit greedy-reverse flag to gemm_universal in ckProfiler --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: Jing Zhang <jizhan@meta.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `3049b5467c`]	2024-08-14 10:42:30 +08:00
Mateusz Ozga	c71e3a3c28	Support large: 12d tensor size for reduction kenrel (#1465 ) [ROCm/composable_kernel commit: `0606e5498e`]	2024-08-13 16:15:47 +02:00
Bartłomiej Kocot	c786468150	Fix bug with n block id calculation in DeviceGroupedConvXdlCShuffle (#1457 ) * Fix typo in TransformConvFwdToGemm * Fix bug in n offset calculation [ROCm/composable_kernel commit: `4a870942e6`]	2024-08-10 13:12:05 +02:00
Jun Liu	a05bad520a	Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 ) (#1415 )" (#1455 ) This reverts commit `aa83424e9c`. [ROCm/composable_kernel commit: `5ff8eeebf9`]	2024-08-08 19:09:33 -07:00
Juan Manuel Martinez Caamaño	5f8f190b34	Remove reinterpret_cast uses that result in undefined behaviour. (#1445 ) * Remove reinterpret_cast uses that result in undefined behaviour. Use a bitcast instead. See https://en.cppreference.com/w/cpp/language/reinterpret_cast#Type_accessibility Closes #1439 * fix clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `901e5f1540`]	2024-08-07 11:49:02 -07:00
Juan Manuel Martinez Caamaño	f2b070380e	Add missing constexpr to if conditions (#1444 ) [ROCm/composable_kernel commit: `fd9ef4e678`]	2024-08-06 11:40:34 -07:00

1 2 3 4 5 ...

529 Commits