composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-28 18:56:59 +00:00

Author	SHA1	Message	Date
Muhammed Emin Ozturk	9e95d54cd2	BF16 GEMM Stream-K (#1541 ) * initial * Cmake file * successfull compilation but validation failed * Cmake * update * gpu validation * gemm universal * gemm universal sk update * sk bf16 universal instance * gemm_universal_streamk.hpp * only build for gfx94 * Cmakelist * profiler update, bf16 sk only works at gfx42 * clang * clang * clang all * no need flags * cmake script * delete comment * gemm universal sk fix * clang * profiler fix * clang * update * update * delete comment * code formatting * cmake * fix instance * clang * argument supported * argument supported and clang * update * fix * removing unnecessary comments * clang formatting * Update library/src/tensor_operation_instance/gpu/CMakeLists.txt Co-authored-by: afagaj <john.afaganis@gmail.com> * CopyRight Comment 2025 * clang reformatting * copy right 2025 --------- Co-authored-by: Emin Ozturk <ozturk.27@osu.edu> Co-authored-by: root <root@ctr-ubbsmc16.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund> Co-authored-by: root <root@splinter-126-wr-d3.amd.com> Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund> Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund> Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu> Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund> Co-authored-by: afagaj <john.afaganis@gmail.com>	2025-01-02 10:30:04 -08:00
Adam Osewski	1d8e4ec2ce	Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762 ) * add a prototype of int4 * clean * debug * clean * clean * move packed into dynamic_buffer * fixed coord reset * add fast pki4 to half conversion * fix * fixed reference and host_tensor * fixed tensor init * format * debug i4_to_f16_convert * format * fixed splitk * weight permute * add b tile permute * clean * weight permute with splitki * format * improve weight layout * add and_or_b32 * fixed splitk crush * add permute switch as a template * recover v3r1 * clean * failure with intrawave v2 * fixed * fixed * add ckProfiler * add bfp16 support * add bf16 example * fixed int4 to bhalf_t conversion * format * fixed int4 to bf16 conversion * clean * add instances for mem * clean * fixed host tensor size * fixed * debug * fixed * add pk_i4_t as a struct * fix * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * revert * Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed comments * revert * clean * revert * revert * fixed * Update CMakeLists.txt * Update script/cmake-ck-dev.sh Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * Update CMakeLists.txt Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> * fixed * fixed * fixed * revert * revert * add comments * format * fixed assert * fixed * Fix I4 define in ckProfiler * Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue --------- Co-authored-by: Jing Zhang <jizhan@fb.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: mtgu0705 <mtgu@amd.com>	2025-01-02 11:48:06 +08:00
Illia Silin	fdfe210230	upgrade sqlalchemy version (#1748 ) * upgrade sqlalchemy version * replace the connection with engine in to_sql call * change the hipTes=nsor ctest syntax	2024-12-15 16:25:21 -08:00
Illia Silin	355893cdd8	Refactor CI performance tests. (#1726 ) * merge the build and performance tests CI stages together * add gemm performance test on gfx11/gfx12 * add suffices to distinguish gemm performance logs from different archs * use smaller gemm set in CI for gfx10/gfx11/gfx12 * disable performance tests on gfx1030 * fix the shashing logic * fix finding python3 for mha instances	2024-12-06 13:04:25 -08:00
Illia Silin	5fb150dbe7	restore collecting performance of mixed prec gemms (#1648 )	2024-11-11 09:25:08 -08:00
Bartłomiej Kocot	82fc53835a	Enable grouped conv bwd wei bf16 NGCHW (#1589 ) * Enable grouped conv bwd wei bf16 NGCHW * fixes * fixes * Fixes * fixes * fixes * Fixes	2024-10-22 16:18:28 +02:00
Po Yen Chen	0c094daa7e	[CK_TILE] Update example README files & fix script compatibility issue (#1548 ) * Fix text alignment of ArgParser::print() * Update example README files * Clarify make-ck-dev.sh <arch> usage * Only keep some of the argument from '-?' output * Undo command line output changes in README * Only keep existing argument on doc and update description * Fix text alignment * Make cmake-ck-*.sh compatible with 'sh' command	2024-10-08 10:45:12 +08:00
Po Yen Chen	a1c07e8d91	[CK_TILE] Change output accum tensor layout of fmha fwd split-kv & combine kernels (#1527 ) * Use same layout for o_acc and o tensor * Use better param names in partitioner * Remove redundant kargs 'max_seqlen_q' * Use better param names in splitkv kernel * Add comment for additional kernel arguments * Sync empty loop early return logics between pipelines * Pass more arguments to cmake in scripts * Align backslashes * Fix wrong o_acc tensor view strides * Change o_acc layout if o_perm=0 * Handle whole row masked via attn_bias * Use use vector width = 1 for o_acc * Use more even split sizes	2024-10-01 22:13:52 +08:00
Bartłomiej Kocot	4ba52b35dc	Add support for NGCHW in grouped conv fwd (#1499 ) * Support NGCHW in grouped conv fwd * Remove not needed variable * Fixes	2024-09-20 10:45:46 +02:00
Bartłomiej Kocot	73b67f290f	Add support for NGCHW in grouped conv bwd wei (#1491 ) * Add support for NGCHW in grouped conv bwd wei * Comments fixes * navi fixes * Update function names	2024-09-03 10:52:03 +02:00
Bartłomiej Kocot	dc82daa86e	Convert MIOpen driver to ckProfiler script typos fix (#1476 )	2024-08-20 19:04:14 +02:00
Bartłomiej Kocot	a6a7966505	Add script to convert MIOpen driver to ckProfiler (#1472 ) * Add script to convert MIOpen driver to ckProfiler * Fix	2024-08-19 08:24:56 -07:00
Bartłomiej Kocot	2581727d2a	Add performance and large tensor tests for grouped conv (#1456 ) * Add performance and large tensor tests for grouped conv * Resize tests * Resize tests * update the python script to parse the grouped_conv results * Remove int8 tests * change bwd wei layout --------- Co-authored-by: illsilin <Illia.Silin@amd.com>	2024-08-16 07:48:30 -07:00
Mateusz Ozga	ab60b390f8	Rewrite sh reduce unit tests to gtest: part 1 (#1407 ) Rewrite .sh test to Gtest * review chnages * Removew unused comments * Review v2 * Typo * Separete UT: AMAX, MAX, MIN; added template params to trigger them * Update test/reduce/reduce_no_index.cpp --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>	2024-08-12 16:28:10 +02:00
Illia Silin	12c1f68dd9	Run CK_TILE FMHA benchmarks and collect the performance data. (#1447 ) * run ck_tile benchmarks after the smoke tests and store logs * change the path of fmha benchmark logs * change the way of stashig ck_tile fmha logs * prevent the errors in stages where no logs are generated * fix the ck_tile fmha log names and headers * generate the fmha performance logs in the root folder * change jenkins scrip arguments format * use exact file names for stashing * modify scripts to process FMHA performance results * unstash FMHA logs before parsing them	2024-08-07 08:18:26 -07:00
Bartłomiej Kocot	5d8c3d8190	Revert Support access per groups and filter2x3 in grouped conv fwd (#1382 ) (#1406 )	2024-07-22 14:21:24 +02:00
Andriy Roshchenko	eb44e0472a	Add ckProfiler support for forward 3D convolutions with OUT element-wise operations. (#1354 )	2024-07-08 10:55:54 -07:00
Harisankar Sadasivan	75e622f02f	Universal streamk with atomics (#1360 ) * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1	2024-07-05 21:40:30 -07:00
Illia Silin	566b6480a2	Code clean-up (#1285 ) * code clean-up * remove the profiling output samples	2024-05-10 09:41:39 -07:00
carlushuang	db376dd8a4	introducing ck_tile! (#1216 ) * enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * initial enablement of gfx950 * fix clang format * disable examples 31 and 41 int8 on gfx950 * add code * fix build wip * fix xx * now can build * naming * minor fix * wip fix * fix macro for exp2; fix warpgemm a/b in transposedC * unify as tuple_array * Update the required Python version to 3.9 * Update executable name in test scripts * re-structure tuple/array to avoid spill * Merge function templates * Fix format * Add constraint to array<> ctor * Re-use function * Some minor changes * remove wrong code in store_raw() * fix compile issue in transpose * Rename enum Rename 'cood_transform_enum' to 'coord_transform_enum' * let more integral_constant->constant, and formating * make sure thread_buffer can be tuple/array * temp fix buffer_store spill * not using custom data type by default, now we can have ISA-level same code as opt_padding * fix compile error, fp8 not ready now * fix fp8 duplicated move/shift/and/or problem * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode * fix scratch in fp8 kernel * update some readme * fix merge from upstream * sync with upstream * sync upstream again * sync 22 * remove unused * fix clang-format * update README of ck_tile example * fix several issue * let python version to be 3.8 as minimal * remove ck_tile example from default cmake target like all/install/check * remove mistake * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg * fix some bug in group-mode masking and codegen. update README * F8 quantization for FMHA forward (#1224) * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline * Add element function to fmha api * Adjust P elementwise function * Fix bug of elementwise op, our elementwise op is not inout * Add some elementwise op, prepare to quantization * Let generate.py can generate different elementwise function * To prevent compiler issue, remove the elementwise function we have not used. * Remove f8 pipeline, we should share the same pipeline even in f8 * Remove remove_cvref_t * Avoid warning * Fix wrong fp8 QK/KV block gemm setting * Check fp8 rounding error in check_err() * Set fp8 rounding error for check_err() * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode * 1. codgen the f8 api and kernel 2. f8 host code * prevent warning in filter mode * Remove not-in-use elementwise function kargs * Remove more not-in-use elementwise function kargs * Small refinements in C++ source files * Use conditional_t<> to simplify code * Support heterogeneous argument for binary function types * Re-use already-existing scales<> functor template * Fix wrong value produced by saturating * Generalize the composes<> template * Unify saturates<> implementation * Fix type errors in composes<> * Extend less_equal<> * Reuse the existing template less_equal<> in check_err() * Add equal<float> & equal<double> * Rename check_err() parameter * Rename check_err() parameter * Add FIXME comment for adding new macro in future * Remove unnecessary cast to void * Eliminate duplicated code * Avoid dividing api pool into more than 2 groups * Use more clear variable names * Use affirmative condition in if stmt * Remove blank lines * Donot perfect forwarding in composes<> * To fix compile error, revert generate.py back to `4439cc107d` * Fix bug of p element function * Add compute element op to host softmax * Remove element function in api interface * Extract user parameter * Rename pscale and oscale variable * rename f8 to fp8 * rename more f8 to fp8 * Add pipeline::operator() without element_functor * 1. Remove deprecated pipeline enum 2. Refine host code parameter * Use quantization range as input * 1. Rename max_dtype to dtype_max. 2. Rename scale to scale_s 3.Add init description * Refine description * prevent early return * unify _squant kernel name in cpp, update README * Adjust the default range. * Refine error message and bias range * Add fp8 benchmark and smoke test * fix fp8 swizzle_factor=4 case --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com>	2024-04-15 19:27:12 -05:00
Haocong WANG	f83e9701e9	[GEMM] Gemm universal device operation (#1154 ) * Optimize GEMM on MI200/300: 1. Add new blockwise gemm pipeline 2. Add irregular splitk intances * clang format + typo fix * Fix a bug * initial commit * Add more instances to irregular splitk * blkgemm pipeline v1~4 prototype * Sanity Checked. Known issue: 1. Poor performance of splitk 2. Register spill on blkgemmpipeline v3 * Sanity and Performance fix: 1. fix a bug related to sanity in grouped b2c mapping 2. fix a bug related to sanity and performance in splitk offset * Sanity and API update: 1. Remove prefetch stage 2. Fix valid check bug 3, Add first gemm_universal instance into ckProfiler * Add NN instances for gemm universal * 1. Add NT instances for gemm_universal 2. Fix a bug about Kpadding in gemm_universal * Fix a bug regarding padding Odd K number * remove kernel print * Fix KPadding bug... * Update safety check * another try to fix kpadding.. * Sanity checked * new instances.. * clang format+typo fix * remove clang format script's change * Add non-hotloop compile option * 1. Add fp16xfp8 example 2. pull packed convert f8 from pr1150 * Some miscs.. opt and fix * Add pipeline description docs * Split universal gemm instance library to cut profiler compiling time * uncomment cmakefile * Fix a bug caused by blockwise_gemm_pipe_v2 * reduce default splitk to 1 * Add 224x256x64 tile size * update, including: 1. Experiment pipeline 5~7 2. Optimization for pipeline 4 3. Organized instance library * temp save * temp save * Permuted lds layout, sanity and function checked * clang format * Move OOB check from RunRead to RunWrite, for better software pipeline. TODO: agpr spill when NN layout * clangformat * A/B splitpipe scheduler for v3 * Fix two bugs * bug fix * fix a bug in oob check * Example for mixed fp16_fp8 gemm * Clean experimental code blocks * Add mixed precision gemm into profiler * tempsave * optimize m/n major lds layout * Add RRR GEMM mixed precision instances * Optimize f8 matrix transpose * Add test_gemm_universal * A/B spilt schedule for blkpip v5 * Take ds_read2 into iglp scheduling scheme * format * fixed cmake * Add llvm-option into CI cmake flag --------- Co-authored-by: Jing Zhang <jizhan@amd.com>	2024-04-13 21:03:18 -05:00
Bartłomiej Kocot	9c052804a7	Add elementwise with dynamic vector dim (#1198 ) * Add elementwise with dynamic vector dim * Reduce number of instaces * Fixes * Fixes	2024-03-22 10:40:43 +01:00
Illia Silin	bdcd037428	Re-enable the performance tracking in CI. (#1203 ) * test CK with rocm6.1 RC2 * add docker credentials for pull * update the performance db name * use environment variable for db name * add rocm-llvm-dev package to ck docker * turn off verification for daily performance runs * do not stash ckProfiler on MI300 node * add processing of mixed gemms to qa, fix parsing of splitk gemm logs * fix the splitk gemm log file name * turn the timing on for splitk gemm performance	2024-03-18 09:48:29 -07:00
Illia Silin	112b691bb7	add new performance tests for mixed fp16/fp8 gemms (#1151 )	2024-01-31 13:27:17 -08:00
Illia Silin	180e572076	Fixing most of the cppcheck errors. (#1142 ) * fix cppcheck errors, first pass * fix format * fix returned value in examples * add macro definitions for cppcheck * fix the profile_gemm logic * update the gemm profiler logic * add more difinitions to cppcheck, fix couple more errors * replace runtime error with message in device function * fix a couple of int4 issues * no return for fill function * fix errors in data_types.hpp * fix format * fix few remaining errors * fix errors in data_types.hpp * fix last couple of errors in datat_types.hpp	2024-01-24 13:47:48 -08:00
Illia Silin	68f2b5e7c7	add linker script to QA builds (#1030 )	2023-11-08 17:53:45 -08:00
zjing14	98fd41f597	Add Gemm instances for performance improvement (#1018 ) * improve kpad * more tuning parameters * f16_f8_fp16 * cut test time * add f16_f8_fp16 * add f16_f8_f16 * testing instances for skinny cases * format * clean * add fp16_f8_fp16 * clang-format * add grouped gemm instalces * fixed profile grouped_gemm * clean * clean * clean * clean * clean * add missing instance func * fixed inferface --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu>	2023-11-07 09:09:58 -06:00
Illia Silin	4e44a9e8da	Enable sccache in the default docker and CI. (#1009 ) * replace ccache with sccache, pin package versions * put ccache back temporarily to avoid breaking other CI jobs * add sccashe_wrapper.sh script * fix the package version syntax * fix the pymysql package issue * run sccache_wrapper before build if ccache server found * set the paths before calling the sccache_wrapper * use /tmp instead of /usr/local for cache * try using sccache --start-server instead of wrapper * try using redis server with sccache * define SCCACHE_REDIS * add redis and ping packages, and redis port * use the new sccache redis server * do not use sccache with staging compiler * fix the condition syntax * add stunnel to redis * add tunnel verification * separate caches for different architectures * fix syntax for the cache tag * quse double brackets for conditions * add bash line to the script * add a switch for sccache and only use it in build stage * run check_host function when enabling sccache * fix the invocation tags for sccache * fix groovy syntax * set the invocation tag in groovy * disable sccache in clang-format stage * try another syntax for invocation tags * use local sccache server if can't connect to redis * fix script syntax * update README * refresh readme * readme updates * remove the timing and verification caveat from readme --------- Co-authored-by: Lisa Delaney <lisa.delaney@amd.com>	2023-10-30 13:16:29 -07:00
zjing14	c99323be6e	Revert "Grouped Gemm with looping over the tiles. (#788 )" (#982 ) This reverts commit `a4f72a314a`.	2023-10-11 14:27:29 -05:00
Adam Osewski	a4f72a314a	Grouped Gemm with looping over the tiles. (#788 ) * Introduce LocalBlockToCTileMap. * Change the signature of CalculateBottomIndex() function which now does not accept any argument. The B2C map which is already passed as an argument to the kernel Run function is calculating block's local id already outside at kernel entry point __global__ function. The LocalB2C map stores as members local block ID. * Use LocalBlockToCTile map in device ops. * First draft of tile loop work distribution. * Fix typo. * Simplify kernel arguments. Calculate descriptors & B2C maps on the device. * Use looping kernel. * Fix B2C constructor. * Fix Navi21 errors. * Calculate tile start/end in device kernel. * Change Run API to accept user provided workspace buffer. * Add new line at EOF. * Move Gemm KernelArguments to device op interface. * Remove unused code. * Update API. * Launch grid size which is min of occupancy vs tile count * Get back to use constant memory for gemm descriptors. * Remove unused code. * Add default virtual method implementation. * Update comments to conform with doxygen style. * Fix doc style and unused parameters. * Add thread cluster lengths to kernel name. * Remove old splitk impl and replace it with tile looping one. * Modify instances. * set KPerBlock to 64 * maximize wherever possible vector load size. * Fix instances cluster lengths. * Change comment style. * Use 128b store where possible in instances. * Update test cases, since KPerBlock has doubled. * Update output stream operator for Sequence. * Add pipeline version to GroupedGEMM device op type string. * Fix pipeline version type logging. * Fix input tensors type after merge. * Fix compiler error. * Fix output stream operator for Pipeline version. * Store using 128b. * Set of instances with kpb 32/64 * Limit number of instances * Remove commented out instances. * Fix function name. * Limit the number of instances. Add pipline version to the regular instances * Change thr cluster layout for reading B tensor. * disabled failed instances --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-10 22:21:15 -05:00
zjing14	f5ec04f091	Grouped Gemm with Fixed K and N with SplitK (#818 ) * move all arguments into device * add b2c_tile_map * add examples * add SetDeviceKernelArgs * dedicated fixed_nk solution * init client api * add grouped_gemm_bias example * add a instance * add instances * formatting * fixed cmake * Update EnableCompilerWarnings.cmake * Update cmake-ck-dev.sh * clean; fixed comments * fixed comment * add instances for fp32 output * add instances for fp32 output * add fp32 out client example * fixed CI * init commit for kbatch * add splitk gridwise * format * fixed * clean deviceop * clean code * finish splitk * fixed instances * change m_loops to tile_loops * add setkbatch * clean code * add splitK+bias * add instances * opt mk_nk instances * clean examples * fixed CI * remove zero * finished non-zero * clean * clean code * optimized global_barrier * fixed ci * fixed CI * removed AddBias * format * fixed CI * fixed CI * move 20_grouped_gemm to 21_grouped_gemm --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-08-31 09:22:12 -05:00
Jun Liu	c8a8385fdd	[HotFix] add config and version files to pass on build info (#856 ) * experiment with config file * experiment with version.h config * add more info to version.h * minor updates * minor updates * fix case where DTYPE is not used * large amount of files but minor changes * remove white space * minor changes to add more MACROs * fix cmakedefine01 * fix issue with CK internal conflict * fix define and define value * fix clang-format * fix formatting issue * experiment with cmake * clang format v12 to be consistent with miopen * avoid clang-format for config file	2023-08-23 11:36:17 -07:00
Bartłomiej Kocot	7761e5232c	Add s_nops after v_dot to avoid hazard (#808 ) * Add s_nops after v_dot to avoid hazard * Fix builtin for inner_produxt fp16 * Skip inline version to builtin * Add comments regarding isa * Fix comment regarding s_nop	2023-07-27 13:29:44 -05:00
Adam Osewski	237f9cd3aa	Add basic setup for precommit (#749 ) (#764 ) * Add basic setup for precommit * Update README.md with instructions on installing precommit hooks --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Bartlomiej Wroblewski <bwroblewski10@gmail.com>	2023-07-06 11:01:06 -05:00
Illia Silin	d140bdc9fa	do not build gfx941/942 targets during daily QA runs (#758 )	2023-06-16 12:13:16 -07:00
Illia Silin	027e46ee82	Enable gfx941 and gfx942 architectures. (#752 ) * enable gfx941/942 targets * fix clang format * fix the cmake logic for multiple targets * fix cmake syntax for looping over targets * add gfx941/942 support for gemm_xdl instances	2023-06-15 08:20:59 -07:00
Illia Silin	4feebedd41	Syncing up from internal repo to enable MI300. (#690 ) * enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-04-28 18:22:59 -05:00
Haocong WANG	4e097ad283	Add CMake Option "USE_OPT_NAVI3X" (#647 ) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script	2023-03-29 14:07:33 -05:00
Rostyslav Geyyer	fa998675fc	Update cmake-ck-dev.sh script (#641 ) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>	2023-03-15 18:38:11 -05:00
zjing14	209baee299	disable tensor contraction f64 on MI100 (#602 )	2023-02-23 16:59:37 -08:00
zjing14	24c9ee1d22	Add contraction_fp64 example (#570 ) * add contraction_bilinear * add contraction_scale_xdl_fp64 * reduce tile size to avoid register spill --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com>	2023-02-15 12:00:58 -06:00
Illia Silin	f73574ffdd	Fix CI issues. (#572 ) * switch to recent staging compiler as default for CI * fix the baseline query * roll back sqlalchemy to version 1.4.46	2023-02-06 13:15:45 -06:00
rocking5566	226bc02b73	Conv perlayer int8 quantization (#471 ) * Add conv2d requant example * Fix bash error * Rename example * 1. Rename gemm quantization 2. shares the requantization lambda function with conv * Refine declare type * Add conv bias relu quantization exmaple * clang format * Fix compile error due to merge develop * Fix CI error * Extract quantization post operation into another file * Support quantization for non piecewise linear function * Add instance for conv quantization * Add convolution quantization factory * Add convolution quantization client example * Add more instances with different template parameters * clang format * Sync the naming with the develop	2022-11-02 13:56:26 -06:00
Illia Silin	0ee3aea16a	fix the script parsing the QA results (#495 )	2022-10-26 10:25:27 -06:00
Chao Liu	473ba5bc4a	update document: Readme, contributors, citation, (#463 ) * update cmake script * update readme * Update README.md * add citation * add images * Update README.md * update * Update README.md * Update CONTRIBUTORS.md * Update README.md * Update CITATION.cff * Update README.md * Update CITATION.cff	2022-10-03 00:48:24 -05:00
Illia Silin	85b0920dc8	Build the CK targets only once. (#433 ) * build CK only once, use deb package in all subsequent stages * update jenkins file * change prefix for build_CK stage * update writing deb metadata to control file * update ubuntu source for docker, script syntax for deb package metadata * try different way to create deb metadata * clean up DEBIAN before creating one * fix the CI folder names, fix splitK qa * use correct docker in all stages, separate tests for splitK verification and performance * clean old comments, change dir before packaging * use different package syntax * change packaging syntax * package with cmake * remove unnecessary build prefix * get rid of unnecessary paths * change paths during unpacking * change script syntax while unpacking * get rid of unneccesary steps * get rid of comments in the scripts * use double quotes for scripts * add ccache during build, try dpkg -x * pull and install each package separately * use full package names * try to use stashing for packages * change stash/unstash syntax * move unstash out of shell, run tests on any gpu node * unpack each package separately * try re-using existing workspace * merge the build and test stages, only stash ckProfiler * merge the build and test stages, only stash zipped ckProfiler * fix syntax * add GPU check before build and test, rename docker to usual name	2022-09-21 14:30:13 -05:00
Illia Silin	b22ebd4485	Upgrade the OS and ROCM versions. (#411 ) * upgrade the OS and ROCM versions in CK docker * add cxx flags to link code with rocm5.2 and ck-9110 compiler * rename the docker image * run ONNX gemms using init=1	2022-09-13 10:39:14 -05:00
Illia Silin	ce74cea407	Add stderr to QA logfiles, process splitK and ONNX gemm kernels (#402 ) * add processing for the onng_gemm and splitK_gemm * add profile_onnx_gemm.sh * add stderr to logfiles, add splitK and onnx gemm parsing * enable splitK gemm wresults posting to db	2022-09-07 13:59:44 -05:00
zjing14	9881625b2d	Fixed splitk gemm fp32 (#384 ) * add scripts * fixed splitK_gemm_fp32 * clean * clean	2022-08-26 09:59:50 -05:00
Adam Osewski	3ab20fd753	GEMM batched/splitK/cgemm/grouped int4 examples (#383 ) * Grouped GEmm int4. * Formatting + fix K dimension for int8. * Batched Gemm int4 example. * CGEMM int4 example. * Include inc filese in clang-format. * SplitK int4 example * Refactoring of performance measurement. * Fix #ifdef statements. Co-authored-by: Adam Osewski <aosewski@amd.com>	2022-08-25 17:19:15 -05:00

1 2 3 4

156 Commits