composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 12:17:00 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	31bf253aeb	Add dynamic elementwise op (#1426 ) * Add dynamic elementwise op Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com> * CI issues fix * Custom parameter value for dynamic functions - Comments addressed --------- Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com> Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>	2024-10-26 15:22:37 +02:00
Po Yen Chen	54f0e6f4bb	[CK_TILE] More fmha splitkv optimizations (#1588 ) * Use pre-defined constants for readability * Use vector write for o_acc tensor * Remove no-longer used policy method * Deprecate no-longer used policy/pipeline * Specify gemm0/gemm1 block warps separately in codegen * Fix wrong ps_idx creation logic * Add single-warp block gemm * Supoprt single-warp gemm0 * Make MakeCBlockTile() as static method * Use MakeCBlockTile() to get underlying tile distribution * Use kNumGemm1Warps to compute # threads for gemm1 * Put normal case in the if clause * Refine fmha splitkv block mapping * Refine & fix the lse_acc/o_acc layout * Fix wrong LDS size for K tile * Use kK0=64 for hdim=128,256 fmha splitkv kernels * Use kK1=64 for hdim=32,64,128 fmha splitkv kernels * Undo kK0/kK1 changes * Use more reasonable GetAlignmentV() computation * Using store_tile() in fmha splitkv kernel epilogue	2024-10-26 18:35:45 +08:00
valarLip	37f7afed1e	add int8 gemm multiply multiply a8w8 (#1591 ) * add int8 gemm multiply multiply a8w8 * uncomment * clang-format-12 * Add example_gemm_multiply_multiply_xdl_int8 * Remove shell scripts * update preprocess number for mi308; bring back printout in ckprofiler * format --------- Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: Haocong WANG <haocwang@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-10-26 16:39:34 +08:00
aledudek	9385caa306	Generic threshold calculation (#1546 ) * Calculate generic relative threshold pool3dfwd * Calculate absolute error threshold pool3d fwd * Generic threshold calculation take max input for relative error pool3dfwd * Remove max possible value for error calculation at runtime * Remove debug print in pool3dfwd * Pool3d fwd adjusted types in generic threshold calculation * Generic threshold calculation take into account number of accumulations and accdatatype * Generic threshold fix final error formula * Generic threshold calculation - num of accs fix * Generic threshold calculation - adjust absolute error * Generic threshold calculation - OutDataType in absolute error	2024-10-25 12:46:24 +02:00
dummycoderfe	9183ce69ca	hot_fix epsilon pos (#1597 ) Co-authored-by: dummycoderfe <noplydummmycoder@163.com>	2024-10-25 11:17:45 +08:00
Jatin Chaudhary	4d5248e2d1	Explicit cast values to half (#1593 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2024-10-22 11:17:32 -07:00
ltqin	0394f8a713	update layernorm (#1570 ) * port layernorm * change warp_welford.hpp * Update warpshuffle * 1. Add save mean and save std back 2. Move construction of tensor_view and tile_window to operator() * refine welford max count calculation * unify layernorm api * Rename file * Remove save mean and inv std * Revert "refine welford max count calculation" This reverts commit `022365802b`. * Fix order of parameter * refine welford max count calculation again * Remove fp32 instances * Fix bug of padding * refactor api * Support bf16 * Extract common function * Refine arg of operator() * Add kMThreadPerBlock to template parameter * clang format * Refine variable name * Refine file name * remove redundant line * refactor layernorm2d pipeline and add block-per-block utility * fix name * rename more * add more block-per-tile instance * remove duplicated define * update instance for 2048, 1024 case * support up to 2048 now * opt loading * add n1536 * Add two pass pipeline * format * Fix incorrect type * parallel compilation * Use smaller N * fix 2p pass * Support Repeat_M in distribution * Refine nameing * Add reduce example --------- Co-authored-by: letaoqin <letaoqin@amd.com> Co-authored-by: aska-0096 <haocwang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-10-22 09:26:18 +08:00
Po Yen Chen	95e722a3b3	[CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577 ) * Use smaller width for lse_accum dist tensor * Update pipeline comment * Fix wrong distribution for lse_accum * Remove duplicate dim in lse_accum dist encoding * Decide fmha splitkv combine kernel kBlockSize by kM0 * Remove assumption of MPerThread=1 * Add log<4> & log<8> specialization * Enlarge occupancy array * Fix vector size for small tile * Add support for kMaxSplits=8 * Re-format gemm.hpp * Use 16x16x16 warp gemm for fwd_splitkv * Centralize policy code changes * Leave fp8/bf8 tile settings unchanged	2024-10-21 10:52:11 +08:00
Qianfeng	14c3cfb1c6	[CK_TILE] Improve headdim96 performance for fmha-bwd (#1573 ) * Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96 * Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor * Fix in bwd_piple_default_policy * Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines * Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim * Update to hd96 tile settings * Add smoke test scripts for fmha-bwd hd96 * Revert "Add smoke test scripts for fmha-bwd hd96" This reverts commit `7ca7e1a93d`. * Remove hd96 tile settings in fmha_bwd codegen to save compiling * Fix lost code line in bwd_pipeline_default_policy * Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits * Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor * tiny adjustments --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: danyao12 <Dan.Yao@amd.com>	2024-10-16 18:14:32 +08:00
Bartłomiej Kocot	d02a92cc0d	[CK_TILE] Add block universal gemm pipeline policy (#1557 ) * [CK_TILE] Add block universal gemm pipeline policy * Fixes * fixes2 * Fixes3 * fixeS	2024-10-15 13:53:41 +02:00
Po Yen Chen	9868fd0245	Apply ROCm 6.2 WA to ROCm 6.3 and later (#1563 )	2024-10-15 18:02:41 +08:00
Rostyslav Geyyer	4cf70b36c1	Add custom type vector support (#1333 ) * Add non_native_vector_type * Add a test * Add non-native vector type * Fix CTOR * Fix non-native vector type of 1 * Fix CTORs * Use vector_type to cover non-native implementation as well * Update the test * Format * Format * Fix copyright years * Remove BoolVecT so far * Add AsType test cases * Update assert error message * Remove redundant type * Update naming * Add complex half type with tests * Add tests for vector reshaping * Add missing alignas * Update test/data_type/test_custom_type.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Compare custom types to built-in types * Add default constructor test * Add an alignment test --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-10-14 11:56:45 -05:00
Bartłomiej Kocot	f21cda2536	Add transpose scale amax example (#1547 ) * Add transpose scale amax example * fixes * Tune reduce instance	2024-10-14 17:39:38 +02:00
Thomas Ning	35c1777d59	decouple the calling from gemm_pipeline (#1571 ) * decouple the calling from gemm_pipeline * clang format	2024-10-14 13:59:26 +08:00
Adam Osewski	29d384d0b2	Implement GetWorkSpaceSize from BaseOperator. (#1564 )	2024-10-12 14:05:11 +08:00
Thomas Ning	6f27bc9872	Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535 ) * ake the cshuffle compilable * modify Mhe reference on gpu and cpu. Correaccess of cshuffle * fix the cpu reference code * Complete the in tile shuffle logic * restructure the kernel template input * change the naming pattern of ck_tile gemm pipeline * Re-format files using remod.py * Solve the fmha conflict with gemm * Comment Addressed from Carlus --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>	2024-10-10 18:02:22 +08:00
Christopher Millette	ceaed8e097	Fixes small memory leak from missing hipEventDestroy (#1554 )	2024-10-09 09:41:35 +02:00
Po Yen Chen	0c094daa7e	[CK_TILE] Update example README files & fix script compatibility issue (#1548 ) * Fix text alignment of ArgParser::print() * Update example README files * Clarify make-ck-dev.sh <arch> usage * Only keep some of the argument from '-?' output * Undo command line output changes in README * Only keep existing argument on doc and update description * Fix text alignment * Make cmake-ck-*.sh compatible with 'sh' command	2024-10-08 10:45:12 +08:00
Qianfeng	74d68e3b99	[CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549 ) * Simplify the codes in splitkv_combine pipeline * Always set kPadSeqLenK=true for fmha splitkv kernels * Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-10-08 10:44:34 +08:00
Illia Silin	7d8ea5f08b	Fix build logic using GRU_ARCHS. (#1536 ) * update build logic with GPU_ARCHS * fix the GPU_ARCHS build for codegen * unset GPU_TARGETS when GPU_ARCHS are set	2024-10-07 08:18:23 -07:00
Bartłomiej Kocot	cc8f466a7e	[CK_TILE] Fix conv param multiple definition (#1550 ) Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-10-07 15:21:21 +02:00
rocking	0023f01ab0	[Ck tile] Support layernorm one pass (#1512 ) * Fix compile error * Add one pass pipeline * Extract creating tile_window to operator() * clang format * reduce duplicated code * do not hardcode * Support padding in layernorm --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-10-07 14:25:53 +08:00
kylasa	c24fae2346	Adding seed and offset pointer support to the philox random number generator. (#1523 ) * Adding seed and offset pointer support to the philox random number generator. * Separating seed and offset pointer checks with different condition statements. * Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs. * Correcting a typo in the readme file * Re-format files using remod.py * Use STL type for API parameters * Use simpler struct design for drop_seed & drop_offset * Undo unnecessary changes * Sync kargs style for fmha_fwd.hpp/.cpp * Use templated union to reduce code * Use structured binding to make code more readable --------- Co-authored-by: Sudhir Kylasa <sukylasa@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2024-10-05 02:48:47 +08:00
Bartłomiej Kocot	6b54d2faf8	Fix grouped gemm check to avoid overflow (#1545 )	2024-10-04 17:32:43 +02:00
macurtis-amd	aeb7c91f48	Fix compilation errors generated by forthcoming Clang changes (#1544 ) Without this change, the following diagnostic is generated: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] See C++17 spec [temp.names] p5.	2024-10-02 13:56:22 -07:00
Illia Silin	8e4c3fb1bc	[CK_TILE] add missing vector header (#1537 ) * add missing vector header * Re-format header using remod.py --------- Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>	2024-10-01 07:58:20 -07:00
Po Yen Chen	a1c07e8d91	[CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527 ) * Use same layout for o_acc and o tensor * Use better param names in partitioner * Remove redundant kargs 'max_seqlen_q' * Use better param names in splitkv kernel * Add comment for additional kernel arguments * Sync empty loop early return logics between pipelines * Pass more arguments to cmake in scripts * Align backslashes * Fix wrong o_acc tensor view strides * Change o_acc layout if o_perm=0 * Handle whole row masked via attn_bias * Use use vector width = 1 for o_acc * Use more even split sizes	2024-10-01 22:13:52 +08:00
Bartłomiej Kocot	de3e3b6424	[CK_TILE] Image to Column kernel (#1532 ) * [CK_TILE] Image to Column kernel * Fixes * Vector loads and stores * Fixes * Fixes * change test dir name	2024-09-27 22:57:38 +02:00
Dan Yao	9d69a099a4	[CK_TILE] Fix compiler related FA bwd issues (#1530 ) * add barriers * tail bias barriers * adjust bf16/hd256 tol * continue adjust bf16/hd256 tol	2024-09-26 12:18:39 -07:00
Illia Silin	42e6dceacc	Fix compilation errors with Clang20.0. (#1533 ) * fix clang20 compilation errors for gfx90a * fix clang20 compilation errors for gfx11 targets	2024-09-25 13:45:38 -07:00
Po Yen Chen	770d2b7725	Early return if seqlen_k=0 on group mode (#1524 )	2024-09-22 20:05:58 +08:00
Bartłomiej Kocot	4ba52b35dc	Add support for NGCHW in grouped conv fwd (#1499 ) * Support NGCHW in grouped conv fwd * Remove not needed variable * Fixes	2024-09-20 10:45:46 +02:00
Adam Osewski	0c39954da9	Remove unsupported (fp8) type from Add memory operation. (#1521 ) The dynamic buffer doesn't have support for fp8 in `Update` operation thus fp8 is not supporting `InMemoryDataOperation::Add`	2024-09-20 09:40:45 +02:00
Thomas Ning	694c300145	Ck tile gemm padding dim (#1516 ) * Support the N dimension padding * Finished the padding feature for different dimension of K	2024-09-18 11:32:29 -07:00
Thomas Ning	844f5a1712	Ck tile GPU verification sample develop & Add the CK TILE GEMM to the CI/CD test (#1505 ) * Finished the feature of gpu verification * Add the ck_tile_gemm test in the CI CD * add the include of tensor_layou in reference_gemm * Comment Addressed * split ck_tile fhma and gemm tests into separate stages * restructure the reference gemm * restructure a new reference_gemm api that could read the device mem --------- Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-09-14 21:08:40 +08:00
Jun Liu	81bc1496b2	Customize filesystem in CK for legacy systems (#1509 ) * Legacy support: customized filesystem * Update cmakefile for python alternative path * fix build issues * CK has no boost dependency * More fixes to issues found on legay systems * fix clang format issue * Check if blob is correctly generated in cmake * fix the python issues * add a compiler flag for codegen when using alternative python * use target_link_options instead of target_compile_options --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-09-13 07:51:07 -07:00
Mateusz Ozga	448c0f56d8	Pool2d max/avg kernel in the BWD version (#1494 ) * Add pool2d instance BWD AVG * Add pool2d instance BWD MAX * Fix: avg review * Fix review: part2 * Fix - enable test when type is compiled * Fix review part3	2024-09-12 11:47:52 +02:00
jakpiase	e8d2887cb2	Rewrite pool2d fwd (#1462 ) * added pool2d fwd * add tests * add reviewers changes * Revert "Merge remote-tracking branch 'origin/develop' into jakpiase/pool2d_fwd_new" This reverts commit `6b2ba7ff89`, reversing changes made to `22c82bea0c`. * Revert "add reviewers changes" This reverts commit `22c82bea0c`. * added reviewers comments * revert some old files * add reviewers requests --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2024-09-11 15:21:00 +02:00
jakpiase	2a261afcdf	Added structural sparsity blockwise gemm (#1435 ) * Implemented smfmac xdlops * Added smfmac blockwise xdlops * fixes * add reviewers suggestions --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>	2024-09-11 15:19:42 +02:00
Dan Yao	d09572e8c2	[CK_TILE] FA bwd repair (#1502 ) * fix fa bwd * revert kernelBlockSize in gemm_kernel.hpp	2024-09-10 10:45:32 -07:00
Thomas Ning	caacd38830	Ck tile gemm example (#1488 ) * Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout * Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future. * Fix: Clang Format, API fixed from fmha * fix with better naming convention * revert back the pipeline code of fmha * Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one. * clang format with the reference_gemm file * convert the clang format with the remod.py * Changed the format and variable name of the kernel gemm_shape and partitioner --------- Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>	2024-09-07 16:23:32 +08:00
M.Emin Ozturk	8378855361	Moficiation to fix this issue "threadwise_tensor_slice_transfer_v5r1 issue #1279 " (#1492 ) * issue fix, one line changed for tmp * clang --------- Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu> Co-authored-by: Harisankar Sadasivan <135730918+hsadasiv@users.noreply.github.com>	2024-09-04 21:52:55 -07:00
Haocong WANG	5b10dae6a4	Add gemm universal bf16 instances (#1484 ) * revert ckprofiler change * temp save * Add test and test pass * test pass * Fix bug inside rotating buffer when tensor is not packed * bug fix * clang format --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2024-09-04 20:58:54 -07:00
Bartłomiej Kocot	73b67f290f	Add support for NGCHW in grouped conv bwd wei (#1491 ) * Add support for NGCHW in grouped conv bwd wei * Comments fixes * navi fixes * Update function names	2024-09-03 10:52:03 +02:00
Bartłomiej Kocot	a9b170b541	Revert "Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 ) (#1415 )" (#1455 )" (#1490 ) This reverts commit `5ff8eeebf9`.	2024-09-02 10:39:49 +02:00
Dan Yao	b8addae293	[CK_TILE] float -> bf16 inline asm rtn (#1482 ) * asm rtn * add asm rtn macro * reorder macro --------- Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-08-30 15:38:09 +08:00
Po Yen Chen	461ec98d78	Enable scratch memory workaround on ROCm 6.2 (#1486 ) Co-authored-by: carlushuang <carlus.huang@amd.com>	2024-08-30 10:40:00 +08:00
Po Yen Chen	c156989298	[CK_TILE] Add PagedAttention kernels (#1387 ) * Use dictionary to config all the functions * Add init codegen logic for fmha fwd appendkv * Call HIP_CHECK_ERROR() macro to get real source info * Setup meaningfull arguments * Sync kernel name with the codegen * Add knew/vnew tensors to the kernel argument * Fix wrong K values after appending * Fix vnew append errro * Extract common logics * Fix Vnew tile dstr for row major case * Conditionally add fwd_splitkv API in fmha_fwd example * Conditionally add call to fmha_fwd_splitkv() * Remove "EXAMPLE_" prefix of cmake variables * Regsiter API handlers automatically * Early return if 0 < s_k_new is not supported * Show message if we are ignoring option * Unify CMakeLists.txt coding style * Set num_splits=1 if split-kv is not supported * Add length/stride getters for HostTensor * Add RoPE example utilities * Add reference_rotary_position_embedding() (not implemented) * Finish reference_rotary_position_embedding() impl * Fix typo of HostTensor<>::get_length() * Fix compilation errors * Fix wrong answer when interleaved=false * Fix wrong answer when interleaved=true * Append K/V in the host verification code * Simplify K appending logics * Simplify v_host_ref definition * Reduce input/output dimensions * Rename function: add "batched" prefix * Apply RoPE on host side * Rename RoPE utility function * Fix wrong tensor size * Avoid invoking deprecated method 'find_module' * Pass RoPE kernel args * Create Rotary Cos/Sin tile windows in kernel * Add compute data type alias for RoPE * Randomly generate seqlen_knew if needed * Fix seqlen_knew enabling check logic * Add minimum seqlen_k to generate compliance kvcache * Fix compilation error in debug mode * Fix wrong boundaries * Fix wrong seqlen_k for kvcache * Rename variables used in distributio encoding * Fix rotary cos/sin tensor/tile size * Add constraint to the rotary_dim option * Remove unused inner namespace * Add dram distribution for rotary_cos/rotary_sin (interleaved) * Only apply interleaved RoPE on Knew for now * Fix wrong thread starting offset * Instantiate multiple kernels for RoPE approaches * Clean-up pipeline * Fix error in RoPE host reference * Handle RoPE half-rotated logics * Support 8x rotary_dim under half-rotated RoPE * Add comment * Apply elementwise function to the loaded tiles * Unify parameter/variable naming style * Remove constness from q_ptr * Add code blocks for q_tile * Apply RoPE to q_tile * Remove debug print code in kernel * Fix wrong knew/vnew appending positions * Use better naming for tile indices * Add make_tile_window() for adding distribution only * Skip code if # of block is more than needed * Move thread locating logics into policy * Remove always true static_assert() * Rename header * Rename RotaryEmbeddingEnum * Extract rotary embedding logic out * Re-order parameters * Align naming of some tile size constants * Rename more tile size constants * Fix wrong grid size * Fix wrong shape of knew_host/vnew_host * Fix wrong index into knew_host/vnew_host * Fix wrong rotary_cos/rotary_sin memory size for Q * Extract Q/Knew vector size to helper methods * Use different rotary_cos/rotary_sin distr for Q/Knew * Update host/device specifiers * Fix wrong data type for Q rotary_cos/rotary_sin * Remove RoPEComputeDataType type alias * Shift rotary_cos/rotary_sin by cache_seqlen_k * Add comment for why I just 't' for all padding flags * Align commit message to the real comment * Fix wrong pipeline * Rename utility function * Disable host verification if API not exist * Fix wrong rope key for fp8 pipeline * Allow only apply RoPE on Q (without append KV) * Add append-kv smoke tests * Remove debug statements * Remove more debug statements * Re-arrange the 'set +x' command * Remove no-longer used method in pipeline * Add missing init code * Refine pipeline padding settings * Enlarge rotary_dim limit (8 -> 16) * Enlarge KPerThread for rotary_interleaved=false * Update rotary_dim range in smoke_test_fwd.sh * Add template argument 'kIsPagedKV' for splitkv kernels * Launch splitkv kernel if given page_block_size * Fix wrong kernel name * Fix seqlen_k_min for pre-fill case (1 -> 0) * Add copy_const<> type trait * Add another make_tile_window() * Introduce 'TileWindowNavigator' types * Simplify TileWindowNavigator interfaces * Fix tile window navigation bugs * Disable calling fmha_fwd() * Remove ununnecessary data members * Simplify more make_tile_window() overloads * Move V tile through TileWindowNavigator * Fix uneven split checking logic * Move code after decide seqlen_q/seqlen_k * Make sure we always start reading complete tile * Use 128 as minimus page_block_size * Fix wrong origin for bias * Add batch_stride_k/batch_stride_v in group mode * Unify origin * Add missing kernel arguments for group mode * Add paged-kv codegen logic for appendkv kernels * Add block_table kernel args for appendkv kernel * Add tile navigators to the appendkv kernel * Fix wrong tensor descriptor lengths * Pass re-created tile window to pipeline * Fix wrong strides for appendkv kernel * Allow transit tile_window to another page-block * Handle cross-page-block write * Donot perform write again if already in last page-block * Always add fmha_fwd() api * Add missing group mode argument * Remove debug macro usages * Rename option s_k_new to s_knew * Separate splitkv/non-splitkv args/traits * Remove fmha_fwd_dispatch() * Fix compilation errors * Remove dropout code in splitkv kernel * Allow problem types without define kHasDropout attr * Use generic lambda to init traits objects * Separate more non-splitkv & splitkv traits/args * Display more info for specific kernels * Show more detailed warning message * Rename 'max_num_blocks' to 'max_num_page_blocks' * Remove no-longer used pipeline files * Wrap code by #if directives * Move functors to the begining of validation code * Use generic lambda to init all the api traits/args * Fix wrong seqlen for kvcache * Add missing comment * Rename TileWindowNavigator to PageBlockNavigator * Only expose necessary methods (not attributes) * Re-order pipeline paremeters * Refine smoke_test_fwd.sh * Fix wrong arugment count * Make tile window directly via PageBlockNavigator * Remove unused template paremeter * Remove group mode from appendkv kernel * Fix skcheck logic * Fix wrong syntax in skcheck expr * Use meaningful options in smoke test * Remove options * Fix formatting * Fix more format * Re-organize bash functions * Pass cache_batch_idx to kernels * Support cache_batch_idx in example * Fix compilation error * Add more appendkv test * Add more case for appendkv * Fix unexisted attribute * Remove 0 < seqlen_knew constraint * Clarify the case in warning message * Remove macro checking * Force batch mode when invoking appendkv & splitkv apis * Fix mode overriding logics * Fix wrong parameter name * Randomize seqlen_k if use kvcache * Use randomized seqlen_k for kvcache * Avoid using too small rotary_cos & rotary_sin * Rename parameter * Add seqlen_q & seqlen_k rules * Add comment * Add more comments * Fix compilation errors * Fix typo in comment * Remove type argument * Avoid seqlen_k=0 for kvcache * Revert "Avoid seqlen_k=0 for kvcache" This reverts commit `21c4df89e4`. * Fix wrong uneven split checking logics * Only randomize kvcache seqlen_k if 1 < batch * Return earlier if split is empty * Revert "Only randomize kvcache seqlen_k if 1 < batch" This reverts commit `b9a4ab0d7e`. * Re-order seqlen_k_start adjustment logics * Fix compilation errors * Re-format script * Find executable from folder automatically * Fix kvcache seqlen_k generating logic * Make comment more clear * Fix wrong knew/vew appending logic on host * Add s_barrier to sync threads * Revert "Add s_barrier to sync threads" This reverts commit `d3f550f30c`. * Support only using 1 row of rotary_cos/rotary_sin * Rotate Q in different way * Unify tensor view creation logics * Fix wrong argument * Add mask to switch how we use the rotary_cos/sin * Move attr from traits to problem * Move has_mask to fmha_fwd_appendkv_args * Support use uint32_t as SAD operand in Alibi<> * Use sad_u32() in splitkv kernels * Store tensor views in PageBlockNavigator * Use stored tensor view to update tile windows * Enlarge tensor view size * Remove debug code * Fix wrong tensor view size * Wrap tensor view into PageBlockNavigator * Add DataType member to PageBlockNavigator * Remove unnecessary member functions * Refind macro use * Fix typo * Add blank line between directives and actual code * Re-format files * Remove type in comment --------- Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com>	2024-08-28 20:50:43 +08:00
Andriy Roshchenko	c3515f277c	Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473 ) * Enable CMakePresets build * Verify Convolution, Scaling and ReLU algorithms. * Add tensor element-wise scale and type cast operation. * Reduction implemented but does not work. * Exploration of Reduction functionality. * Completed example for Convolution scaled with ReLu activation and AMAX reduction. * WIP: Add required instances for convolution. * WIP: Create client example. Implement convolution stage. * Add elementwise instances. * Add elementwise scale + convert example. * Add reduction instances. * WIP: Client example for AMAX reduction. * WIP: Add instances for multistage reduction. * WIP: Implementation of multistage reduction. * Refactoring. * Clean up. * Add CMakePresets.json * Guard off FP8 instances when the data type is not available. * Add example for Scaled FP8 Convolution with AMAX reduction. * Refactor CombConvScaleRelu instances. * Add CombConvScale instances. * Add client example for Scaled FP8 Convolution with AMAX reduction. * Cleanup.	2024-08-21 15:22:41 -07:00
Rostyslav Geyyer	e20f20efbf	Set RNE fp8 conversion as a default (#1458 ) * Set RNE fp8 conversion as a default * Update f8 tests * Disable failing test on gfx11 * Update bf8 tests * Add a flag * Fix the flag * Raise flag for gfx10 as well * Temp commit for tolerance testing * Update tolerances	2024-08-21 09:09:48 -07:00

... 17 18 19 20 21 ...

1436 Commits