composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 00:58:44 +00:00

Author	SHA1	Message	Date
Andriy Roshchenko	f6c6819b47	Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473 ) * Enable CMakePresets build * Verify Convolution, Scaling and ReLU algorithms. * Add tensor element-wise scale and type cast operation. * Reduction implemented but does not work. * Exploration of Reduction functionality. * Completed example for Convolution scaled with ReLu activation and AMAX reduction. * WIP: Add required instances for convolution. * WIP: Create client example. Implement convolution stage. * Add elementwise instances. * Add elementwise scale + convert example. * Add reduction instances. * WIP: Client example for AMAX reduction. * WIP: Add instances for multistage reduction. * WIP: Implementation of multistage reduction. * Refactoring. * Clean up. * Add CMakePresets.json * Guard off FP8 instances when the data type is not available. * Add example for Scaled FP8 Convolution with AMAX reduction. * Refactor CombConvScaleRelu instances. * Add CombConvScale instances. * Add client example for Scaled FP8 Convolution with AMAX reduction. * Cleanup. [ROCm/composable_kernel commit: `c3515f277c`]	2024-08-21 15:22:41 -07:00
Rostyslav Geyyer	0ab95a332e	Set RNE fp8 conversion as a default (#1458 ) * Set RNE fp8 conversion as a default * Update f8 tests * Disable failing test on gfx11 * Update bf8 tests * Add a flag * Fix the flag * Raise flag for gfx10 as well * Temp commit for tolerance testing * Update tolerances [ROCm/composable_kernel commit: `e20f20efbf`]	2024-08-21 09:09:48 -07:00
Dan Yao	14402bb211	[CK_TILE] FA bwd kernels optimization (#1397 ) * tmp save * fix batch deterministic bugs * fix group deterministic bugs * codegen update * reorder files * bias support * hd256 bias support * bwd smoke test update * simplify convert dq * fix hd256 dropout scratch * do{}while() -> while(){} * comments * remove FmhaBwdTilePartitioner * save clear_tile * refactor dropout * code cleanup * code cleanup * comments * fix epilogue problem * fix fwd dropout * group convert_dq opt * fix dq alignment * Do not store storerandval in bwd for flash attention integration * fix hd32 error and boost performance * revert * Remove duplicated WarpGemm definitions in the policy file * dropout patch for mrepeat 1616 code sync up * dq_acc stride * dq_acc stride stuff * codegen update * fwd dropout revert * fix hd128 scratches and boost performance * receipt 3 for simplified smoke test * more strides for fa integration * fix hd64 scratches and boost performance * non-iglp pipeline for headdim padding cases * dpad same as dvpad for flash attention integration * unpadded lse&d for group mode * Support unpad layout for group lse * Support unpad lse layout for splitkv * Fix stride for splitkv kernel * fix unpadded lse issue in fwd splitkv * comment * solve lds read&write conflicts * rename * bias rename * tile index revert --------- Co-authored-by: danyao12 <danyao12> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com> [ROCm/composable_kernel commit: `79a5d9c10c`]	2024-08-16 13:40:10 -07:00
Haocong WANG	65d6442b4c	[GEMM] gemm_universal related optimization (#1453 ) * replace buffer_atomic with global_atomic * fixed global_atomic_add * added bf16 atomic_add * format * clang-format-12 * clean * clean * add guards * Update gtest.cmake * enabled splitk_gemm_multi_d * format * add ckProfiler * format * fixed naming * format * clean * clean * add guards * fix clang format * format * add kbatch printout * clean * Add rocm6.2 related gemm optimization * Limit bf16 atomic usage * remove redundant RCR gemm_universal instance * Add RRR fp8 gemm universal instance * Bug fix * Add GPU_TARGET guard to FP8/BF8 target * bug fix * update cmake * remove all fp8/bf8 example if arch not support * Enable fp8 RRR support in ckProfiler * limit greedy-reverse flag to gemm_universal in ckProfiler --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: Jing Zhang <jizhan@meta.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `3049b5467c`]	2024-08-14 10:42:30 +08:00
Mateusz Ozga	7a4690b077	Support large: 12d tensor size for reduction kenrel (#1465 ) [ROCm/composable_kernel commit: `0606e5498e`]	2024-08-13 16:15:47 +02:00
Bartłomiej Kocot	15ab8b0d5c	Fix bug with n block id calculation in DeviceGroupedConvXdlCShuffle (#1457 ) * Fix typo in TransformConvFwdToGemm * Fix bug in n offset calculation [ROCm/composable_kernel commit: `4a870942e6`]	2024-08-10 13:12:05 +02:00
Jun Liu	254a7dadb6	Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 ) (#1415 )" (#1455 ) This reverts commit 0c367d5912486f4fcbae1dbb861a1fb8176ca308. [ROCm/composable_kernel commit: `5ff8eeebf9`]	2024-08-08 19:09:33 -07:00
Juan Manuel Martinez Caamaño	61ecdbc128	Remove reinterpret_cast uses that result in undefined behaviour. (#1445 ) * Remove reinterpret_cast uses that result in undefined behaviour. Use a bitcast instead. See https://en.cppreference.com/w/cpp/language/reinterpret_cast#Type_accessibility Closes #1439 * fix clang format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `901e5f1540`]	2024-08-07 11:49:02 -07:00
Juan Manuel Martinez Caamaño	e539c37e7d	Add missing constexpr to if conditions (#1444 ) [ROCm/composable_kernel commit: `fd9ef4e678`]	2024-08-06 11:40:34 -07:00
jakpiase	e8ee8856fa	Fix for beta!=0 in reduce (#1440 ) * fix for beta!=0 in reduce * add reviewers suggestions [ROCm/composable_kernel commit: `b74d4d4d54`]	2024-08-06 09:10:39 -07:00
Bartłomiej Kocot	69a6b563f9	Add Grouped Conv Fwd Large Tensor kernel (#1432 ) * Support 64 bit indexing * Add new grouped conv fwd kernel for large tensors * Add instances large tensor * Fixes for transform conv to gemm * Fixes * fixes * Remove not needed instances * examples fixes * Remove not need ds arrays * Fix tests * Add 2GB check in gridwise dl * Fixes [ROCm/composable_kernel commit: `4ec5c52a0c`]	2024-08-06 10:06:10 +02:00
arai713	735984bb5a	Codegen: isSupportedArgument check (#1417 ) * added isSupportedArgument check into codegen device op * adding function call * remove commented code [ROCm/composable_kernel commit: `d32997a792`]	2024-07-31 07:12:15 -07:00
carlushuang	cecee51c65	workaround rocm-6.2 compiler issue (#1421 ) [ROCm/composable_kernel commit: `b3f86e79dd`]	2024-07-31 16:03:59 +08:00
Bartłomiej Kocot	1567614d80	Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 ) (#1415 ) [ROCm/composable_kernel commit: `33b399cc15`]	2024-07-30 18:36:04 +02:00
zjing14	a94e87d868	Add rotating buff for gemm_multi_d (#1411 ) * add rotating_buff for gemm_multi_d * format * Update flush_cache.hpp * Update gtest.cmake --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: Haocong WANG <haocwang@amd.com> [ROCm/composable_kernel commit: `105bd708c7`]	2024-07-25 23:21:21 +08:00
Andriy Roshchenko	e3b469a493	Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412 ) * Add CMakePresets configurations. * Add binary elementwise ConvScaleAdd and an example. * Numerical verification of results. Observed significant irregularities in F8 to F32 type conversions: ```log ConvScaleAdd: float=145.000000 f8_t=160.000000 e=144.000000 ConvScaleAdd: float=97.000000 f8_t=96.000000 e=104.000000 ConvScaleAdd: float=65.000000 f8_t=64.000000 e=72.000000 ``` * Implemented ConvScaleAdd + Example. * Add ConvScale+Bias Instances * Add Client Example for ConvScale+Bias * Fix number of bytes in an example.. * Cleanup. [ROCm/composable_kernel commit: `4a8a1befd5`]	2024-07-24 15:49:55 -05:00
Bartłomiej Kocot	1f93d3f961	Add support for half_t and bfloat to reduction operations (#1395 ) * Add support for half_t and bfloat to reduction operations * Fix bhalf convert * Next fix bf16 [ROCm/composable_kernel commit: `ffabd70a15`]	2024-07-24 12:12:37 -05:00
Bartłomiej Kocot	b23a3fcf77	Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 ) [ROCm/composable_kernel commit: `5d8c3d8190`]	2024-07-22 14:21:24 +02:00
Haocong WANG	a0e0f3cdcc	[GEMM] F8 GEMM, performance optimized. (#1384 ) * add ab_scale init support * enabled interwave * add scale type; update isSupport * adjust example * clean * enable f8 pure gemm rcr ckprofiler * Add gemm_multiply_multiply instances * clang format * Optimize for ScaleBlockMNK=128 * enable abscale f8 gemm ck profiler * Add pure f8 gemm test suite * Reverting to the state of project at `f60fd77` * update copyright * clang format * update copyright --------- Co-authored-by: root <jizhan@amd.com> [ROCm/composable_kernel commit: `8c90f25be3`]	2024-07-19 22:06:52 +08:00
ltqin	50c6703b31	Universal gemm splitk using reduce (with multi-d) (#1341 ) * init for reduce_threadwise multi_d * add reduce_threadwise_multi_d * add reduce_multi_d * clean * start add an other splitk device op * add reduce template parameter to SplitKBatchOffset * add reduce c matrix * clean up code * change example data type to bf16 * add bf16Ai8B example * remove reduce template parameter * add splitk atomic status to v4 * example add multi d parameters * device op add multi-d parameters * add multi-d to reduce * fix kbach=1 bug * change B layout to col in bf16Ai8B example * remove float adding struct * change multi-d interface * change file and class name * remove multi-d of bf16Ai8B example * change IsReduce function to IsReduceAdd * change example layout to RRR from RCR * according layout to set ds stride * reset parameter layout * add gemm universal reduce instance * add reduce factory * add profile_gemm_universal_reduce * add reduce to profiler * fix reduce instance * fix profiler reduce compiling bug * format * format library instance code * add mem instance for reduce library * fix call instance names * add workspace for reduce in ckProfiler * format * add mnpading to reduce library instance * add fp16 instance to reduce of profiler * change copyright time * restore profiler cmake file * add reduce text to instances * add DsLayout and DsDataType to instances template parameter * fixed gemm_reduce_multi_d * add an example without multi_d * Update common.hpp * Update gtest.cmake * Update gemm_xdl_splitk_reduce_bf16.cpp * clean * Update gtest.cmake * format * fixe api * format * default parameter change to RRR * add vector_len for multi_d * format * Update gtest.cmake * fix bf16A iBB elementwiseop * add ReduceDataType * move ReduceDataType to end position * format * remove googletest git method address * fix copyright time * update init data --------- Co-authored-by: root <jizhan@amd.com> Co-authored-by: letaoqin <letaoqin@amd.com> Co-authored-by: Jing Zhang <jizhan@meta.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `c544eb4da0`]	2024-07-19 22:01:22 +08:00
Bartłomiej Kocot	bff9195296	Refactor transform conv to gemm fwd (#1391 ) * Refactor transform conv to gemm fwd * fixes codegen * wmma fixes * fix wmma * Fix copyright [ROCm/composable_kernel commit: `70a814f163`]	2024-07-19 09:29:25 +02:00
Qianfeng	59a1c6464f	Replace the using of __expf by __ocml_exp_f32 to work-around the test_softmax_rank4 failure (#1394 ) [ROCm/composable_kernel commit: `ee768148f0`]	2024-07-17 09:15:05 -07:00
Andriy Roshchenko	a765481437	Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386 ) * Add CMakePresets configurations. * Add ConvScale+ReLU Functor and an Example * Account for ReLU FLOPs. * Add instances of 3D convolutions with ConvscaleRelu operation. * Implement Client Example * Cleanup [ROCm/composable_kernel commit: `802a8a1df1`]	2024-07-16 08:51:49 -07:00
Bartłomiej Kocot	07ca6dacf1	Support access per groups and filter3x3 in grouped conv fwd (#1382 ) * Support access per groups and filter3x3 in grouped conv fwd * Fixes for large cases * Fixes for large tensors [ROCm/composable_kernel commit: `82e8a78a3f`]	2024-07-12 11:08:42 -07:00
carlushuang	e3d0102174	[CK_TILE] wa prec, remove sgpr offset for inline asm (#1356 ) * wa prec, remove sgpr offset for inline asm * macro for set tile * ignore unused param if no kernel instances in host API * fix more prec issue * cache buffer resource * fix * support pre-nop * clear tile by vector type members * add workaround to reduce scratch memory * conditionally enable workaround code * enable workaround start from certain build version * fallback set_tile() implementation from certain build version * undo template argument changes * put dummy asm in load_raw() * fix comments, refactor s_nop inside buffer_load --------- Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `8182976c37`]	2024-07-08 11:09:55 -07:00
Harisankar Sadasivan	c5f81450e1	Universal streamk with atomics (#1360 ) * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1 [ROCm/composable_kernel commit: `75e622f02f`]	2024-07-05 21:40:30 -07:00
jakpiase	2f29b76d2e	Add structural sparsity xdlops (#1363 ) * Implemented smfmac xdlops * add reviewer comments [ROCm/composable_kernel commit: `eaa870a1ab`]	2024-07-04 12:00:14 +02:00
Jun Liu	100de21005	Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372 ) [ROCm/composable_kernel commit: `959073842c`]	2024-07-03 23:34:38 -07:00
jakpiase	3a04bdded7	Add structural sparsity gemm instruction tests (#1309 ) * first version of smfmac test * add reviewer comments * add reviewer suggestions [ROCm/composable_kernel commit: `ed21948bcd`]	2024-06-27 11:30:32 +02:00
Illia Silin	cd1e33cce4	Merging the gfx12 code into public repo. (#1362 ) [ROCm/composable_kernel commit: `941d1f7ce0`]	2024-06-27 00:33:34 -07:00
Po Yen Chen	612bbfdccc	Replace hipDeviceSynchronize() by hipStreamSynchronize(stream) calls (#1359 ) [ROCm/composable_kernel commit: `a32b1bc647`]	2024-06-26 22:04:52 +08:00
Po Yen Chen	f351f9775c	[CK_TILE] fmha forward split-kv + combine kernels (#1338 ) * FA fwd dropout * FA bwd * epilogue reuse * CMakeLists update * [CK_TILE] support alibi (#1269) * add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> * now fwd/bwd can build * bwd alibi * add bwd validation stream_config * update generated filenames * update bwd kernel launch * CK_TILE_HOST_DEVICE in philox * Transpose -> transpose * format * format * format * Generate the instance for FA required * format * fix error in WarpGemm * Add num_splits option and dummy split-kv api method * Generate fmha_fwd_splitkv() * Add SplitKV kernel codegen logics * Add SplitKV combine kernel codegen logics * Fix mismatched return type * Clean-up code * Replace sentinel value before storing * Fix wrong layout of LSE/LSEacc/Oacc * Format codes * Fix o_acc memory error * Fix wrong kBlockSize used in policy * Reduce # of combine kernels * Fix split-kv combine kernel name * Fix wrong LDS indexing logics * Fix wrong loop counter step logic * Undo vector size changes * Remove no-longer used field * Remove in-consistent comment * Remove debug statements in example * Remove more debug statements * Add constness to local variables * Clearn up generate.py * Fix unstable clang-format comment * Remove unused include directive * Use shorter template parameter name * Enable non-split-kv blobs * Update license date * Print num_splits conditionally * Undo disabling data types * Remove unnessary tile size for fp8 * Fix wrong pipeline args for fp8 * Fix example output format * Remove more debug code in combine pipeline * Add stride kernel arguments for LSE/O acc workspace * Re-order split-kv pipeline call operator arguments * Pass LSE/O strides in kernel argument * Re-order pipeline call operator arguments * Use tensor_descriptor to locate LSEacc elements * Support providing invalid element for tensor view * Set invalid element value for LSEacc tensor view * Remove hand-written store_tile() code * Remove necessary value-overwrite logic * Add transposed lds descriptor * Support load_tile() for tile_window_with_static_lengths<> * Undo removing necessary value-overwrite logic * Use read descriptor to locate lds elements * Simplify pipeline source code * Add constraint to kMaxSplits * Default use kMaxSplits=64 in generate.py * Revert "Add constraint to kMaxSplits" This reverts commit `0a2132d758`. * Revert "Default use kMaxSplits=64 in generate.py" This reverts commit `c7d9c80b77`. * Decide alignment by the padding parameter * Remove no-longer used utility functions * Remove not-working code * Add comment & remove no-longer used code * Fix computation errors * Add heuristic to override num_splits option * Add constraint to kMaxSplits * Fix compilation error * Clean up pipeline code * Wrap pointer access as lambda function * Rename confusing methods * Use kLogMasSplits as template parameter * Finish splitkv combine kernel codegen * Update kMaxSplits limit * Use smaller kM0 for splitkv combine kernel * Ignore droupout flag in splitkv pipeline * Unify flag usage * Add back flag kStoreLSE * Merge lambda calls in pipeline * Fix compilation errors * Avoid all empty splits * Always check for empty loop in splitkv pipelines * Re-order parameters * Remove redundant p_drop option check * Add traits/problem for fwd splitkv kernel * Conditionally enable uneven split boundary checks * Add comment for the splitkv traits field * Change even split criteria * Re-order statements * Refine occupancy value for hdim=128&256 * Refine occupancy value for hdim=32&64 * Remove redundant kernel argument * Separate fmha bwd codegen logics * Separate fmha fwd codegen logics * Remove redundant direction parameter in fwd&bwd codegen logics * Support generate multiple APIs for an example * Let 'api' an alias of 'direction' option * Remove choices for the 'direction' option * Use dictionary to config all the functions * Move fmha splitkv codegen logics to other file * Add fwd_splitkv api for tile_example_fmha_fwd --------- Co-authored-by: danyao12 <danyao12> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com> [ROCm/composable_kernel commit: `0cb2e06ddc`]	2024-06-26 17:41:15 +08:00
arai713	4acf502f4c	CK Instance Gen (#1145 ) * Format * Format * Format * Remove const * Use the right template * Format * Format * add row/col instances * Add missing file * fixed * fixing block to etile error * Format * Updates * Format * fixed rrr layout * generating a sample JSON file: currently contains includes, prologue/epilogue and instances * version where the json is passed into the instances to generate a key * updated run function to just launch kernel * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer * adding in testing files * cleaned up comments, still need to work on including header files * removed unneeded files * removed/commented out JSON implementation * added fusion(prologue/epilogue) into instance generation * working on instance selection * added instance selection, need to fix instance validation * removed block2etile map validity check for testing purposes * test running: failing due to incorrect files/input * all grid descs/ptrs completed, but device file not found * Update test and embed modules * Restore older version * added convolution operation, written test, debugging generated code for compilation * attempting to include CK in host directory: _Float16 error * CK header file issues * slight fix * don't crash when hip can't report total memory * dump generated code to a file * changing sizes * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed * some fixes to call the device code * separating test files for conv and gemm * completed arg ptr, now have linking errors * clang format fix * resolved linker issues in conv test * remove dependency on libutility from ck * resolved num dim error * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration * undo the commenting of device function * hand created kernel code to find rtc issues * dump the full src to file * resolved redeclaration errors, cleaned up errors for Amber's kernel code * debugging purposes: redeclaration error * config files * resolved errors for NumTensor and redeclaration, formatted version.h * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type * WIP: close to getting kernel compiled * WIP: fixing rtc errors * fixed sequence errors, formatting, still one error with run fcn * yay: kernel compiles and runs * updated templated/generated version to run and compile * minor fixes * working generated example, resolved memory access error due to padding * adding in reference kernel, validation failing against reference * debugging: printing kernel argsz * reduced error in results * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues * working validation (using reference convolution) with prologue function for both hard-coded and generated version * WIP: create an alt version that creates Argument on the device * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments * wip: making necessary methods device code * added grid descs, working on grid pointers, errors with stl numerics * wip: updating kernel args - issue, replacing some std functions * replaced std::accumulate call with temp hardcoded version * wip: args causing memory issue * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes * adding object file dump * temporary hardcoding of grid size, can remove device op inst + arg ptr * minor fix for grid size * added modified example where arg ptr is created on the device for generated version as well * removed device op instance and arg ptr from modified examples * moving device op file for testing purposes and to properly build CK * commenting out print-outs * adjust compiler args to produce a valid ELF file * temporary removal of validation * reverting compiler args back for working example * retrieve necessary arguments from generated template parameters in correct format * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example * temporary change to generate ELF format binary object file * removed unecessary code, added comments * formatting fix * cleaned up code, added new tests, restructured library: move helper into CK * refactored launch parameter calculation to be more concise * renamed files and variables for more clarity/uniformity * more code cleaning, removed debug statements * moved majority of my files into codegen directory, running properly * updated Embed.cmake(string_view) in codegen directory * updated host directory to match Embed.cmake as well * added old tests in * updated instance generation methods to be more concise * removed layout from launch parameter calculation * working test * fixed issue with verification, all instances working * updated verification in other tests * removed duplicate matrix padder file, removed code dumps * removed old hard-coded tests * removed old host directory, all files in codegen directory now * fixed copyright in files * commenting out validation * renamed files * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code * updated headers * removing duplicate file for fwd conv to gemm, merging with original file * fix building codegen with clang++ directly * resolving build error from conv_fwd_to_gemm * fix for previous error * renaming tests * created common test file * cleaned up code, added comments * renamed device op * fixed typos in comments * removed extra space * code cleanup: resolving Amber's comments * removed wrapper struct for matrix padder, fixed template * cleaned up if statements for better readability --------- Co-authored-by: Paul <pfultz2@yahoo.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3e9711f0cb`]	2024-06-25 16:37:35 -05:00
rocking	f70826b5fb	layernorm2d forward (#1339 ) * Add layernorm2d forward * Refind file path * clang format * Exclude ck_tile op from all * use add_executable instead * refactor layernorm2d_fwd example --------- Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `cb13839425`]	2024-06-24 08:45:52 +08:00
carlushuang	5cab9c9eac	WA for rocm-6.2+ s constrait for buffer resource (#1346 ) * WA for rocm-6.2+ s constrait for buffer resource * add missing memory clobber [ROCm/composable_kernel commit: `fa129c1a5d`]	2024-06-21 11:00:13 -05:00
Bartłomiej Kocot	cb58db5160	Fix cmake warnings (#1342 ) * Cmake add -Wno-nvcc-compt * Remove template without initialization list * dpp remove template without init list * Fixes [ROCm/composable_kernel commit: `510325a468`]	2024-06-21 09:47:58 +02:00
Dan Yao	10efd2a0b1	Fix FA bwd alibi+causal NaN errors (#1352 ) * fix bwd alibi nan error * fix datatype --------- Co-authored-by: danyao12 <danyao12> [ROCm/composable_kernel commit: `1da802bdf2`]	2024-06-20 09:50:53 -05:00
ThruptiRajLakshmanaGowda	cc15ede67e	Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348 ) * Initial Push * First Push * Fixed Clang format * Resolve merge conflict * Addressed review comments * Addressed review comments * Addressed review comments [ROCm/composable_kernel commit: `0162a5f6ba`]	2024-06-20 09:24:54 -05:00
Qianfeng	19c52f8082	Fix in dropout lambda to avoid the compiling issue on some docker/compiler envs (#1350 ) [ROCm/composable_kernel commit: `e3f44659cf`]	2024-06-20 11:36:42 +08:00
Qianfeng	ceabd63e2a	Hacking ck_tile fmha Dropout facility (#1344 ) * Add NullBlockDropout to be used when kHasDropout is false * Change to BlockDropout::Run() for forward to reduce conditional checkings * Re-format files --------- Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `1973903f49`]	2024-06-19 10:37:22 +08:00
Bartłomiej Kocot	6935a2481c	Add read_first_lane function for int64 (#1347 ) [ROCm/composable_kernel commit: `8faec23cb4`]	2024-06-18 15:05:30 -05:00
jakpiase	92853de60e	Switch to universal gemm in grouped gemm tile loop (#1335 ) * switch to universal gemm in grouped gemm tile loop * minor fixes * add reviewers comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `e2d139201b`]	2024-06-18 09:01:49 -05:00
Bartłomiej Kocot	c0eda96fec	Fix continous dim selection in contraction (#1336 ) * Fix continous dim selection in contraction * Fixes [ROCm/composable_kernel commit: `933951ed48`]	2024-06-18 10:26:49 +02:00
carlushuang	05adcc7f64	[CK_TILE][FA] using pk f16_f32 (#1343 ) * [CK_TILE][FA] using pk f16_f32 * correct a error [ROCm/composable_kernel commit: `17ed368f58`]	2024-06-17 17:16:46 +08:00
zjing14	651ce5c272	disabled lds direct load inline asm (#1331 ) [ROCm/composable_kernel commit: `e02103168a`]	2024-06-16 20:33:47 -05:00
Bartłomiej Kocot	5728b06e64	Support large tensors in grouped conv fwd (#1332 ) * Support large tensors in grouped conv fwd * Multi ABD fixes * Fix calculate element space size [ROCm/composable_kernel commit: `dc1e9c5df9`]	2024-06-14 09:53:03 -05:00
Qianfeng	9b0d87fe9a	Fix to the using of static_for in amd_buffer_addressing.hpp (#1337 ) * Add insert_dummy_dep_per_dword over-loading for length 64 * Fix insert_dummy_dep_per_dword and remove over-loading for length 64 * Remove blank lines --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `37a347e380`]	2024-06-13 16:12:20 +08:00
Rostyslav Geyyer	9416b16080	Add a convinvscale op, related instances and examples (#1307 ) * Update the element op * Add an example * Add instances * Add a client example * make sure new instances only build on gfx9 * Update element op and its handling * Format * Update instances to take element op as an argument * Update examples to use random scale values * Format * Update client example with random scales * Format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `ce66277a76`]	2024-06-10 14:48:49 -05:00
Bartłomiej Kocot	4716f8f70b	Integrate universal gemm with conv forward (#1320 ) * Integrate universal gemm with conv fwd * Fix conv fwd wmma test * Fix instances * Remove direct load check [ROCm/composable_kernel commit: `ac58cc5d1d`]	2024-06-05 13:01:29 -05:00
Rostyslav Geyyer	692ae331ca	Add a scale op, related instances and examples (#1242 ) * Add a scale op * Update the element op * Add instances * Add an example * Add a client example * Add a flag check * Revert flag check addition * Fix flag check * Update d strides in example * Update d strides in client example * Apply suggestions from code review Update copyright header Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Move the example * Move the client example * Update element op * Update example with the new element op * Add scalar layout * Update example * Update kernel for scalar Ds * Revert kernel changes * Update element op * Update example to use scales' pointers * Format * Update instances * Update client example * Move element op to unary elements * Update element op to work with values instead of pointers * Update instances to take element op as an argument * Update examples to use random scale values --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `cb0645bedc`]	2024-06-04 19:28:15 -05:00

1 2 3 4 5 ...

488 Commits