composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-16 19:09:59 +00:00

Author	SHA1	Message	Date
arai713	4acf502f4c	CK Instance Gen (#1145 ) * Format * Format * Format * Remove const * Use the right template * Format * Format * add row/col instances * Add missing file * fixed * fixing block to etile error * Format * Updates * Format * fixed rrr layout * generating a sample JSON file: currently contains includes, prologue/epilogue and instances * version where the json is passed into the instances to generate a key * updated run function to just launch kernel * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer * adding in testing files * cleaned up comments, still need to work on including header files * removed unneeded files * removed/commented out JSON implementation * added fusion(prologue/epilogue) into instance generation * working on instance selection * added instance selection, need to fix instance validation * removed block2etile map validity check for testing purposes * test running: failing due to incorrect files/input * all grid descs/ptrs completed, but device file not found * Update test and embed modules * Restore older version * added convolution operation, written test, debugging generated code for compilation * attempting to include CK in host directory: _Float16 error * CK header file issues * slight fix * don't crash when hip can't report total memory * dump generated code to a file * changing sizes * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed * some fixes to call the device code * separating test files for conv and gemm * completed arg ptr, now have linking errors * clang format fix * resolved linker issues in conv test * remove dependency on libutility from ck * resolved num dim error * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration * undo the commenting of device function * hand created kernel code to find rtc issues * dump the full src to file * resolved redeclaration errors, cleaned up errors for Amber's kernel code * debugging purposes: redeclaration error * config files * resolved errors for NumTensor and redeclaration, formatted version.h * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type * WIP: close to getting kernel compiled * WIP: fixing rtc errors * fixed sequence errors, formatting, still one error with run fcn * yay: kernel compiles and runs * updated templated/generated version to run and compile * minor fixes * working generated example, resolved memory access error due to padding * adding in reference kernel, validation failing against reference * debugging: printing kernel argsz * reduced error in results * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues * working validation (using reference convolution) with prologue function for both hard-coded and generated version * WIP: create an alt version that creates Argument on the device * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments * wip: making necessary methods device code * added grid descs, working on grid pointers, errors with stl numerics * wip: updating kernel args - issue, replacing some std functions * replaced std::accumulate call with temp hardcoded version * wip: args causing memory issue * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes * adding object file dump * temporary hardcoding of grid size, can remove device op inst + arg ptr * minor fix for grid size * added modified example where arg ptr is created on the device for generated version as well * removed device op instance and arg ptr from modified examples * moving device op file for testing purposes and to properly build CK * commenting out print-outs * adjust compiler args to produce a valid ELF file * temporary removal of validation * reverting compiler args back for working example * retrieve necessary arguments from generated template parameters in correct format * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example * temporary change to generate ELF format binary object file * removed unecessary code, added comments * formatting fix * cleaned up code, added new tests, restructured library: move helper into CK * refactored launch parameter calculation to be more concise * renamed files and variables for more clarity/uniformity * more code cleaning, removed debug statements * moved majority of my files into codegen directory, running properly * updated Embed.cmake(string_view) in codegen directory * updated host directory to match Embed.cmake as well * added old tests in * updated instance generation methods to be more concise * removed layout from launch parameter calculation * working test * fixed issue with verification, all instances working * updated verification in other tests * removed duplicate matrix padder file, removed code dumps * removed old hard-coded tests * removed old host directory, all files in codegen directory now * fixed copyright in files * commenting out validation * renamed files * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code * updated headers * removing duplicate file for fwd conv to gemm, merging with original file * fix building codegen with clang++ directly * resolving build error from conv_fwd_to_gemm * fix for previous error * renaming tests * created common test file * cleaned up code, added comments * renamed device op * fixed typos in comments * removed extra space * code cleanup: resolving Amber's comments * removed wrapper struct for matrix padder, fixed template * cleaned up if statements for better readability --------- Co-authored-by: Paul <pfultz2@yahoo.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3e9711f0cb`]	2024-06-25 16:37:35 -05:00
rocking	f70826b5fb	layernorm2d forward (#1339 ) * Add layernorm2d forward * Refind file path * clang format * Exclude ck_tile op from all * use add_executable instead * refactor layernorm2d_fwd example --------- Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `cb13839425`]	2024-06-24 08:45:52 +08:00
carlushuang	5cab9c9eac	WA for rocm-6.2+ s constrait for buffer resource (#1346 ) * WA for rocm-6.2+ s constrait for buffer resource * add missing memory clobber [ROCm/composable_kernel commit: `fa129c1a5d`]	2024-06-21 11:00:13 -05:00
Bartłomiej Kocot	cb58db5160	Fix cmake warnings (#1342 ) * Cmake add -Wno-nvcc-compt * Remove template without initialization list * dpp remove template without init list * Fixes [ROCm/composable_kernel commit: `510325a468`]	2024-06-21 09:47:58 +02:00
Dan Yao	10efd2a0b1	Fix FA bwd alibi+causal NaN errors (#1352 ) * fix bwd alibi nan error * fix datatype --------- Co-authored-by: danyao12 <danyao12> [ROCm/composable_kernel commit: `1da802bdf2`]	2024-06-20 09:50:53 -05:00
ThruptiRajLakshmanaGowda	cc15ede67e	Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348 ) * Initial Push * First Push * Fixed Clang format * Resolve merge conflict * Addressed review comments * Addressed review comments * Addressed review comments [ROCm/composable_kernel commit: `0162a5f6ba`]	2024-06-20 09:24:54 -05:00
Qianfeng	19c52f8082	Fix in dropout lambda to avoid the compiling issue on some docker/compiler envs (#1350 ) [ROCm/composable_kernel commit: `e3f44659cf`]	2024-06-20 11:36:42 +08:00
Qianfeng	ceabd63e2a	Hacking ck_tile fmha Dropout facility (#1344 ) * Add NullBlockDropout to be used when kHasDropout is false * Change to BlockDropout::Run() for forward to reduce conditional checkings * Re-format files --------- Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `1973903f49`]	2024-06-19 10:37:22 +08:00
Bartłomiej Kocot	6935a2481c	Add read_first_lane function for int64 (#1347 ) [ROCm/composable_kernel commit: `8faec23cb4`]	2024-06-18 15:05:30 -05:00
jakpiase	92853de60e	Switch to universal gemm in grouped gemm tile loop (#1335 ) * switch to universal gemm in grouped gemm tile loop * minor fixes * add reviewers comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `e2d139201b`]	2024-06-18 09:01:49 -05:00
Bartłomiej Kocot	c0eda96fec	Fix continous dim selection in contraction (#1336 ) * Fix continous dim selection in contraction * Fixes [ROCm/composable_kernel commit: `933951ed48`]	2024-06-18 10:26:49 +02:00
carlushuang	05adcc7f64	[CK_TILE][FA] using pk f16_f32 (#1343 ) * [CK_TILE][FA] using pk f16_f32 * correct a error [ROCm/composable_kernel commit: `17ed368f58`]	2024-06-17 17:16:46 +08:00
zjing14	651ce5c272	disabled lds direct load inline asm (#1331 ) [ROCm/composable_kernel commit: `e02103168a`]	2024-06-16 20:33:47 -05:00
Bartłomiej Kocot	5728b06e64	Support large tensors in grouped conv fwd (#1332 ) * Support large tensors in grouped conv fwd * Multi ABD fixes * Fix calculate element space size [ROCm/composable_kernel commit: `dc1e9c5df9`]	2024-06-14 09:53:03 -05:00
Qianfeng	9b0d87fe9a	Fix to the using of static_for in amd_buffer_addressing.hpp (#1337 ) * Add insert_dummy_dep_per_dword over-loading for length 64 * Fix insert_dummy_dep_per_dword and remove over-loading for length 64 * Remove blank lines --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `37a347e380`]	2024-06-13 16:12:20 +08:00
Rostyslav Geyyer	9416b16080	Add a convinvscale op, related instances and examples (#1307 ) * Update the element op * Add an example * Add instances * Add a client example * make sure new instances only build on gfx9 * Update element op and its handling * Format * Update instances to take element op as an argument * Update examples to use random scale values * Format * Update client example with random scales * Format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `ce66277a76`]	2024-06-10 14:48:49 -05:00
Bartłomiej Kocot	4716f8f70b	Integrate universal gemm with conv forward (#1320 ) * Integrate universal gemm with conv fwd * Fix conv fwd wmma test * Fix instances * Remove direct load check [ROCm/composable_kernel commit: `ac58cc5d1d`]	2024-06-05 13:01:29 -05:00
Rostyslav Geyyer	692ae331ca	Add a scale op, related instances and examples (#1242 ) * Add a scale op * Update the element op * Add instances * Add an example * Add a client example * Add a flag check * Revert flag check addition * Fix flag check * Update d strides in example * Update d strides in client example * Apply suggestions from code review Update copyright header Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Move the example * Move the client example * Update element op * Update example with the new element op * Add scalar layout * Update example * Update kernel for scalar Ds * Revert kernel changes * Update element op * Update example to use scales' pointers * Format * Update instances * Update client example * Move element op to unary elements * Update element op to work with values instead of pointers * Update instances to take element op as an argument * Update examples to use random scale values --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `cb0645bedc`]	2024-06-04 19:28:15 -05:00
Dan Yao	26840b623a	CK Tile FA Training kernels (#1286 ) * FA fwd dropout * FA bwd * epilogue reuse * CMakeLists update * [CK_TILE] support alibi (#1269) * add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> * now fwd/bwd can build * bwd alibi * add bwd validation stream_config * update generated filenames * update bwd kernel launch * CK_TILE_HOST_DEVICE in philox * Transpose -> transpose * format * format * format * Generate the instance for FA required * format * fix error in WarpGemm --------- Co-authored-by: danyao12 <danyao12> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com> [ROCm/composable_kernel commit: `2cab8d39e3`]	2024-06-04 13:12:45 -05:00
zjing14	9227a76f8e	Post-merge fix of PR 1300 (#1313 ) * add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example * post-merge fix * fix * reduce init range [ROCm/composable_kernel commit: `6fb1f4e03f`]	2024-05-31 22:46:41 -07:00
zjing14	96356d2daf	add f8 gemm multiD with both row/col wise scale (#1300 ) * add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example [ROCm/composable_kernel commit: `80db62f08d`]	2024-05-28 12:04:22 -05:00
carlushuang	7ff08f6a52	[CK_TILE] support group from cmdline (#1295 ) * support cmdline seqlen decode * silent print * update readme * update kernel launch 3d * update tile partitioner * fix spill for bf16 * modify based on comment * modify payload_t * fix bug for alibi mode * fix alibi test err * refactor kernel launch, support select timer * add missing file * remove useless code * add some comments [ROCm/composable_kernel commit: `5055b3bdcb`]	2024-05-28 11:13:21 +08:00
Bartłomiej Kocot	c6431f6c07	Optimize grouped conv bwd weight for small M and N (#1303 ) * Optimize grouped conv bwd weight for small M and N * Fixes [ROCm/composable_kernel commit: `fd72380aeb`]	2024-05-22 21:01:01 +02:00
Illia Silin	b63dc7b530	aggregate device macros in ck_tile config header (#1297 ) [ROCm/composable_kernel commit: `06b891c5c2`]	2024-05-20 08:34:45 -07:00
Illia Silin	2026ce49e7	replace the ENV macro with CK_ENV (#1296 ) [ROCm/composable_kernel commit: `1274861a9d`]	2024-05-17 10:42:51 -07:00
rocking	545e4e9e77	Fix compile error (#1292 ) error: no viable conversion from returned value of type '__half' to function return type 'fp16_hip_t' (aka '_Float16') Co-authored-by: carlushuang <carlus.huang@amd.com> [ROCm/composable_kernel commit: `aaa8dfdae9`]	2024-05-17 17:19:17 +08:00
Illia Silin	6a57fe4ad4	remove wrong use of nonexistent class members (#1290 ) [ROCm/composable_kernel commit: `c44137838e`]	2024-05-15 08:08:17 -07:00
carlushuang	f16be4051d	remove operator-deref (#1291 ) [ROCm/composable_kernel commit: `dd0dd13d4e`]	2024-05-15 08:06:50 -07:00
jakpiase	c63db2b2ab	Add unit tests for grouped gemm two stage (#1256 ) * add unit tests for grouped gemm two stage * add reviewers suggestions --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `3e3471d5d2`]	2024-05-15 10:03:39 +02:00
Illia Silin	a90f0099fc	Code clean-up (#1285 ) * code clean-up * remove the profiling output samples [ROCm/composable_kernel commit: `566b6480a2`]	2024-05-10 09:41:39 -07:00
Bartłomiej Kocot	b7ee312021	Change output gemm type to AccDataType in two stage conv bwd wei (#1283 ) [ROCm/composable_kernel commit: `8346af9c68`]	2024-05-10 10:57:42 +02:00
Adam Osewski	84627bf589	Fix MakeArgument (#1284 ) [ROCm/composable_kernel commit: `a0ae1c6133`]	2024-05-09 09:42:41 -07:00
Adam Osewski	d395dbb19f	Add vector instruction coherency bits for gfx94 targets. (#1268 ) [ROCm/composable_kernel commit: `3c043cd10b`]	2024-05-09 07:30:17 -07:00
Illia Silin	6b5aca8f3c	fix the output formatting (#1282 ) [ROCm/composable_kernel commit: `fdbf8ccbd7`]	2024-05-08 16:11:54 -07:00
Bartłomiej Kocot	68b2757f11	Add two stage grouped conv bwd weight kernel (#1280 ) [ROCm/composable_kernel commit: `0b6b5d1785`]	2024-05-08 09:53:24 +02:00
Illia Silin	b62c21c3b5	Enable logging in CK with environment variable. (#1278 ) * enable logging using environment variable * update ck.hpp header * fix typo * fix clang format * Update include/ck/utility/env.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `bf42097646`]	2024-05-07 16:26:43 -07:00
carlushuang	7bfe56e5ca	[CK_TILE] support alibi (#1269 ) * add alibi support * fix code * update code based on comment * Support more hdim * fix fp8 bias * support seqlen_k=0 case * remove unused printf * fix format --------- Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `851c3ed157`]	2024-05-07 22:32:54 +08:00
Illia Silin	48872cec09	add missing vector header (#1275 ) [ROCm/composable_kernel commit: `08d51d9bc4`]	2024-05-02 11:27:59 -07:00
Rostyslav Geyyer	2733f0aeab	Mark unneeded instances as "getting deprecated" (#1265 ) * Add a flag * Add flag check and messages --------- Co-authored-by: root <root@aus-g7-rogeyyer.amd.com> [ROCm/composable_kernel commit: `6ced3c12ff`]	2024-04-29 12:00:55 -07:00
Haocong WANG	6456722fae	[GEMM] UniversalGemm update (#1262 ) * Add bf16 instances * Add bf16 gemm universal example * tempsave * Add guard to navi compilation * workground on a specific mixed gemm instance ( bring back it when compiler fix upload) * fix formatting condition statement issue * solve conflict --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> [ROCm/composable_kernel commit: `764164b488`]	2024-04-26 12:56:07 -05:00
Rostyslav Geyyer	078c052109	Add element op (#1259 ) [ROCm/composable_kernel commit: `f044ff71fb`]	2024-04-26 12:55:45 -05:00
zjing14	1c712ea255	bf16A_Int8B with fastgelu/bias (#1264 ) * changed the copy function to v7r2 * adding multi_abd * in-progress * add post-load oob check * debugging * adjust instances * add run_lds * add elemntwise_op * replace multi_abd_device with v3 * clean up * clean * clean * Added LDSType * profiling * adjust oobcheck * add missing file * refactor * clean * add examples [ROCm/composable_kernel commit: `0d0150db20`]	2024-04-26 07:26:30 -05:00
Adam Osewski	672831fa3b	Grouped GEMM Multiple D tile loop. (#1247 ) * Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Validate on host only NK dims if M is not known. * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting. [ROCm/composable_kernel commit: `b4032629e5`]	2024-04-25 15:12:53 -05:00
ltqin	fef66ea961	Universal gemm flush cache (#1251 ) * add flush cache to device op * add flush cache parameter to ckProfiler * change calculate size a and b method * chang evaluation time method foro AVERAGE to MEDIAN * format code * adjust some code * fix core dumped * remove loop call flush icache in kernel * remove loop(outer) call flush icache --------- Co-authored-by: letaoqin <letaoqin@amd.com> [ROCm/composable_kernel commit: `f448d179b7`]	2024-04-25 15:07:14 -05:00
Bartłomiej Kocot	2f333b5225	Fix contraction IsSupported checks (#1257 ) [ROCm/composable_kernel commit: `b1f8ae379b`]	2024-04-23 22:59:39 +02:00
rocking	abc7fe2f5e	Small refactor (#1246 ) * Remove kIsFp8 * Extract alias * Fix K, V and corresponding acc type --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `43879b89e4`]	2024-04-22 20:28:49 +08:00
Bartłomiej Kocot	95a7407a87	Refactor elementwise kernels (#1222 ) * Refactor elementwise kernels * Instances fixes * Fix cmake * Fix max pool bwd test * Update two stage gemm split k * Restore elementwise scale for hiptensor backward compatiblity * Fix Acc data type check in conv fwd multiple abd * Disable conv fp64 fwd example * Update grouped conv weight multi d [ROCm/composable_kernel commit: `ad1597c499`]	2024-04-19 13:31:17 +02:00
Bartłomiej Kocot	d3cdadb6e9	Add grouped conv bwd weight multi d kernel (#1237 ) * Add grouped conv bwd weight multi d kernel * Reference fix * Fix cmake files * bwd weight scale only xdl * Fixes * Fix client conv fwd example [ROCm/composable_kernel commit: `fd923b6d86`]	2024-04-18 23:35:04 +02:00
zjing14	37bbf7a46a	Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978 ) * added an example grouped_gemm_multi_abd * fixed ci * add setElementwiseOp * changed API * clean code: add multiA into example * fixed v7r2 copy * add transpose * clean * fixed vector_load check * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * add reduce * testing * add example_b16_i8 * refactor example * clean * add mpading * disable reduce for kbatch = 1 * seperate reduce device op * add reduce op * add guard for workspace_size * add instances * format * fixed * add client example * add a colmajor * add instances * Update cmake-ck-dev.sh * Update profile_gemm_splitk.cpp * Update gridwise_gemm_xdlops_v2r4r2.hpp * format * Update profile_gemm_splitk.cpp * fixed * fixed * adjust test * adjust precision loss * adjust test * fixed * add bf16_i8 scale bias * fixed scale * fixed scale elementwise_op * revert contraction deviceop changes * fixed * Add AddFastGelu * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example" This reverts commit `3b5d001efd`, reversing changes made to `943199a991`. * add Scales into elementwise * add gemm_multi_abd client example * add client examples * add rcr and crr * add grouped gemm client example * add grouped gemm client example * add instance for rcr crr * format * fixed * fixed cmake * fixed * fixed client_example * format * fixed contraction isSupport * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update device_reduce_threadwise.hpp * clean * Fixes * Fix example --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `12865fbf28`]	2024-04-15 21:09:45 -05:00
carlushuang	db614b49eb	introducing ck_tile! (#1216 ) * enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * initial enablement of gfx950 * fix clang format * disable examples 31 and 41 int8 on gfx950 * add code * fix build wip * fix xx * now can build * naming * minor fix * wip fix * fix macro for exp2; fix warpgemm a/b in transposedC * unify as tuple_array * Update the required Python version to 3.9 * Update executable name in test scripts * re-structure tuple/array to avoid spill * Merge function templates * Fix format * Add constraint to array<> ctor * Re-use function * Some minor changes * remove wrong code in store_raw() * fix compile issue in transpose * Rename enum Rename 'cood_transform_enum' to 'coord_transform_enum' * let more integral_constant->constant, and formating * make sure thread_buffer can be tuple/array * temp fix buffer_store spill * not using custom data type by default, now we can have ISA-level same code as opt_padding * fix compile error, fp8 not ready now * fix fp8 duplicated move/shift/and/or problem * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode * fix scratch in fp8 kernel * update some readme * fix merge from upstream * sync with upstream * sync upstream again * sync 22 * remove unused * fix clang-format * update README of ck_tile example * fix several issue * let python version to be 3.8 as minimal * remove ck_tile example from default cmake target like all/install/check * remove mistake * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg * fix some bug in group-mode masking and codegen. update README * F8 quantization for FMHA forward (#1224) * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline * Add element function to fmha api * Adjust P elementwise function * Fix bug of elementwise op, our elementwise op is not inout * Add some elementwise op, prepare to quantization * Let generate.py can generate different elementwise function * To prevent compiler issue, remove the elementwise function we have not used. * Remove f8 pipeline, we should share the same pipeline even in f8 * Remove remove_cvref_t * Avoid warning * Fix wrong fp8 QK/KV block gemm setting * Check fp8 rounding error in check_err() * Set fp8 rounding error for check_err() * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode * 1. codgen the f8 api and kernel 2. f8 host code * prevent warning in filter mode * Remove not-in-use elementwise function kargs * Remove more not-in-use elementwise function kargs * Small refinements in C++ source files * Use conditional_t<> to simplify code * Support heterogeneous argument for binary function types * Re-use already-existing scales<> functor template * Fix wrong value produced by saturating * Generalize the composes<> template * Unify saturates<> implementation * Fix type errors in composes<> * Extend less_equal<> * Reuse the existing template less_equal<> in check_err() * Add equal<float> & equal<double> * Rename check_err() parameter * Rename check_err() parameter * Add FIXME comment for adding new macro in future * Remove unnecessary cast to void * Eliminate duplicated code * Avoid dividing api pool into more than 2 groups * Use more clear variable names * Use affirmative condition in if stmt * Remove blank lines * Donot perfect forwarding in composes<> * To fix compile error, revert generate.py back to `4439cc107d` * Fix bug of p element function * Add compute element op to host softmax * Remove element function in api interface * Extract user parameter * Rename pscale and oscale variable * rename f8 to fp8 * rename more f8 to fp8 * Add pipeline::operator() without element_functor * 1. Remove deprecated pipeline enum 2. Refine host code parameter * Use quantization range as input * 1. Rename max_dtype to dtype_max. 2. Rename scale to scale_s 3.Add init description * Refine description * prevent early return * unify _squant kernel name in cpp, update README * Adjust the default range. * Refine error message and bias range * Add fp8 benchmark and smoke test * fix fp8 swizzle_factor=4 case --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com> [ROCm/composable_kernel commit: `db376dd8a4`]	2024-04-15 19:27:12 -05:00

1 2 3 4 5 ...

456 Commits