composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Bartlomiej Kocot	2b0b6d9f46	Fix copyrights for DeviceBatchedGemmMultipleD_Dl	2023-07-06 15:50:27 +02:00
Rostyslav Geyyer	61dc9aa932	Add the missing archs (#785 )	2023-07-05 18:29:56 -05:00
Rostyslav Geyyer	1cf5003179	Add fp8 GEMM and an example for it (#767 ) * Add fp8 xdl gemm * Add example * Use int8 intrinsics for buffer load/store * Format * Update cmakelists	2023-07-04 20:38:49 -06:00
Illia Silin	7797bd3d2b	Upgrade default docker to ROCM5.6 release. (#778 ) * upgrade default compiler to rocm5.6 release * do daily runs with rocm5.6 instead of 5.5	2023-06-30 08:06:54 -07:00
Illia Silin	d3adc66581	Add rocm5.6 RC4 and rocm5.7 to docker build options. (#770 ) * upgrade to rocm5.6 rc4 * add rocm5.7 docker	2023-06-28 08:58:28 -05:00
Illia Silin	3b18f1e38c	do not build gfx941/942 targets during CI (#766 )	2023-06-21 10:47:35 -07:00
Bartłomiej Kocot	63388e84ab	Support bf16/f32/f16 and NHWGC conv2d_bwd_data (#757 ) * Support bf16/f32/f16 and NHWGC conv2d_bwd_data * Add interface test * clang format * Comment fixes * Add more friendly error message	2023-06-21 08:20:31 -05:00
ltqin	32d2f52bf7	remove useless comments (#760 )	2023-06-19 19:25:08 -07:00
zjing14	05ea6452b6	changed pipeline v1 (#763 )	2023-06-19 19:24:18 -07:00
Illia Silin	645eb2f2a0	do not build gemm-gemm and conv-conv examples for gfx94* (#761 ) * do not build gemm-gemm and conv-conv examples for gfx94* * do not build gemm-gemm and conv-conv examples on navi	2023-06-19 16:55:03 -07:00
Rostyslav Geyyer	f0c620c42e	FP8 enablement - add a pseudorandom number generator, add conversion methods (#708 ) * Add basic fp8 definitions and prn-generator * Format * Add fp8<->fp32 type_convert * Format * Split type_convert and cast_to/from_f8 * Format * Minor fix * Minor fix * Move fp8 utils to a separate header * Add elementwise ops * Add fp8_convert_sr * Format * Add element op * Eliminate magic numbers * Split f8_convert_sr in host and device * Format * Add some constexpr * Add a datatype test * Format * Another format * Add fp8<->fp16 tests * Update type_converts * Format * Add fp16 casting functions * Format * Use seed as a runtime arg * Use element location for PRNG * Format * Add fp8<->fp16 to PassThrough element op * Clean up * Merge host and device implementations * Add comments on rounding modes * Remove leftover code * Put type_converts into a separate header * Put random number gen to a separate header * Rearrange f8_utils' namespaces * Refactor type_convert.hpp * Move f8_t definition	2023-06-19 11:20:35 -05:00
rocking	341ad95665	Maxpool bwd (#750 ) * Add maxpool f32 kernel and example * Revise copyright * Add device pool bwd device op * Support f16 and bf16 * Add compute datatype for reference code. Prevent error in bf16 * Fix type error * Remove layout * Fix bf16 error * Add f16 and bf16 example * Add more operations * Implement IsSupportedArgument * Add changelog * Add comment * Add comment * Remove useless header * Move initialize of workspace to the run * Move set din zero to the device operator * Save din_length_raw * Remove useless header * Calculate gridsize according to the number of CU * Calculate gridSize according to the number of CU. Remove useless header * Add put example * Remove useless header * Fix CI fail	2023-06-19 09:44:22 -05:00
Qianfeng	0d9118226b	Padded Generic Kernel Instance (#730 ) * Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting * Move the generic kernel instance to be the first of the instance list for elementwise op of normalization * Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax * Add testing of GetGenericInstance() in client_example of Softmax * Revert "Add testing of GetGenericInstance() in client_example of Softmax" This reverts commit `f629cd9a93`. * Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax" This reverts commit `a9f0d000eb`. * Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm * Move generic kernel instance to separate tuple for elementwise op of normalization * Remove un-used files for softmax instance * Store generic kernel instance to separate tuple for softmax * Add IsSupported checking for generic instance to client example of softmax * Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization * clang-format fix * Remove int8 from softmax instances --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-06-16 23:43:11 -05:00
Illia Silin	d140bdc9fa	do not build gfx941/942 targets during daily QA runs (#758 )	2023-06-16 12:13:16 -07:00
Illia Silin	027e46ee82	Enable gfx941 and gfx942 architectures. (#752 ) * enable gfx941/942 targets * fix clang format * fix the cmake logic for multiple targets * fix cmake syntax for looping over targets * add gfx941/942 support for gemm_xdl instances	2023-06-15 08:20:59 -07:00
zjing14	309b1c6461	Fixed Weight layout of grouped_conv 3d fwd (#743 ) * Changed wei layout * changed layout for examples * fixed client example --------- Co-authored-by: root <root@ctr-ubbsmc15.amd.com>	2023-06-15 10:19:33 -05:00
Qianfeng	c5f6ec842c	Using number of compute units to set gridSize (#754 ) * Add getAvailableComputeUnitCount() interface * Use available number of compute units to set kernel grid size	2023-06-15 10:13:59 -05:00
Illia Silin	d1838d328c	Fix the daily CI job with latest staging compiler. (#753 ) * fix CI builds with latest staging compiler * remove mount flags from dockerfile	2023-06-14 16:44:13 -07:00
Rostyslav Geyyer	54b68eb343	Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD (#741 ) * Add generic instance gemm_add_add_fastgelu * Add a client example for generic gemm_add_add_fastgelu * Update CMakeLists * Format * Format * Add generic instance gemm_add_fastgelu * Format * Add a gemm_add_fastgelu client example * Format * Add generic instance gemm_fastgelu * Format * Fix argument order * Add gemm_fastgelu client example * Add exceptions if argument is not supported	2023-06-14 16:06:56 -05:00
Rostyslav Geyyer	a35456a3f4	Fix arg order (#751 )	2023-06-12 08:38:46 -05:00
Bartłomiej Kocot	fc9f97568f	Add DeviceBatchedGemmMultipleD_Dl (#732 ) * Add DeviceBatchedGemmMultipleD_Dl * Fix batched_gemm tests * Fix comments * test_batched_gemm_multi_d fixes * Fix args for isSupported batchedGemmMultipleDDl * Disable tests for gfx90a	2023-06-12 08:37:15 -05:00
Po Yen Chen	7c24654c24	Fix incomplete object size (=4n + 3) support of amd_wave_read_first_lane() (#738 ) * Fix wrong pointer type * Rename type trait get_unsigned_int<> to get_carrier<> * Add 3-bytes carrier type * Add missing __device__ specifier * Rename template non-type parameter * Leave the rest byte uninitialized * Avoid invoking (host) STL algorithms * Remove unnecessary 'inline' specifier * Extract common logic out as helper method * Hide dummy member function * Add missing __device__ specifier	2023-06-12 08:36:40 -05:00
ltqin	0ede66de54	Fix flash attn mask bug (#733 ) * add check input parameter * add instance for vector load = 1 * move gerneral instance to first pos * fix read bias code * regular code for bias load --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-06-12 08:35:31 -05:00
carlushuang	016ebaa7f3	support dynamic buffer using memory coherence glc_slc bit from template (#725 )	2023-06-08 07:40:29 -05:00
Illia Silin	1dd455d633	Update docker (#744 ) * update dockerfile to build rocm5.6 rc3 * fix couple of docker issues	2023-06-07 09:35:14 -07:00
Illia Silin	4036590401	fix clang format (#740 )	2023-06-02 14:10:02 -07:00
who who who	e2ebc8e795	replace hipMemcpy with hipMemcpyWithStream (#734 )	2023-06-01 16:23:41 -05:00
Po Yen Chen	9eae73df9b	Simplify kernel argument of device operator Device(Batched)GemmXdl<> (#723 ) * Remove M/N/KPad local variables * Use M/N/KPad to name padded lengths * Replace duplicated local variable by parameters * Rename variables M/N/KRaw to M/N/K * Move AK0/BK0 compute logic into GridwiseGemm * Use macro to shorten code * Move CalculateGridSize() logic into GridwiseGemm * Add comment to credit the implementation source * Reuse the existing implementation * Remove no-longer used data members * Remove elementwise-op objects from interfaces * Reserve kernel arg as whole object in interfaces * Remove redundant data member * Make 3rd type parameter optional * Remove unnesscary type parameters * Remove no-longer used descriptor-creation methods * Move kernel arg type definition into GridwiseGemm * Add macro to switch between code sections * Move argument field computing logic into device op side * Make utility method 'static' * Declare special methods * Unify MakeArgument() usage * Adapt the new GridwiseGemm interface * Push-down class 'GridwiseGemm::Argument' fields * Remove no-longer used methods * Add unused parameters * Force copying parameters in 'Embed' ctor * Remove no-longer used descriptors * Fallback change on BaseArgument * Remove macro 'INTEGER_DIVIDE_CEIL' * Make variable naming more consistent * Make sure methods are only invoked on right place * Remove tailing underscore in public attribute name * Remove necessary methods * Hide computing logic of derived attributes * Make new 'Embed' ctor only available for device code * Make sure 'Embed' type args are not references * Move check for karg.K into CheckValidity() * Remove more integer division logic form device code * Undo changes on Embed * Separate 'Problem' concept out from 'Argument' * Add overloaded version of __builtin_amdgcn_readfirstlane() * Remove 'static' specifiers * Remove more 'static' specifier * Replace unsigne char by std::byte * Add 'const' specifier to never changing variable * Add 'inline' specifier to funcion definition * Share same name for kernel interfaces * Fix wrong boundar calculation logic * Leave the third template arg for compatibility * Remove unnecessary parameters * Fix wrong error message (for type name) * Create descriptor on device side * Fix wrong debug message * Remove no-longer used data members * Rename type trait * Remove std:: qualifier from standard types * Replace 'size_t' by 'unsigned' * Use type alias to hint usage * Replace static_for<> by ordinary 'for' loop * Reject unsupported argument * Rename readfirstlane() to amd_wave_read_first_lane() * Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp * Update function calls * Reorder statements * Re-format files --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-06-01 16:23:02 -05:00
Illia Silin	b94fd0b227	update copyright headers (#726 )	2023-05-31 18:46:57 -05:00
Po Yen Chen	582e31e88d	Add class type support for __builtin_amdgcn_readfirstlane() (#711 ) * Add overloaded version of __builtin_amdgcn_readfirstlane() * Remove 'static' specifiers * Remove more 'static' specifier * Replace unsigne char by std::byte * Add 'const' specifier to never changing variable * Add 'inline' specifier to funcion definition * Fix wrong boundar calculation logic * Rename type trait * Remove std:: qualifier from standard types * Replace 'size_t' by 'unsigned' * Use type alias to hint usage * Replace static_for<> by ordinary 'for' loop * Rename readfirstlane() to amd_wave_read_first_lane() * Rename file readfirstlance.hpp as amd_wave_read_first_lane.hpp * Reorder statements	2023-05-31 10:25:25 -05:00
Haocong WANG	6eef0755c9	fix wmma gemm int8; add grouped conv int8 example (#716 )	2023-05-30 07:18:53 -05:00
Po Yen Chen	1344a0f25b	Simplify kernel argument of device operator DeviceGemm_Xdl_CShuffle<> (#696 ) * Remove M/N/KPad local variables * Use M/N/KPad to name padded lengths * Replace duplicated local variable by parameters * Rename variables M/N/KRaw to M/N/K * Move AK0/BK0 compute logic into GridwiseGemm * Use macro to shorten code * Move CalculateGridSize() logic into GridwiseGemm * Add comment to credit the implementation source * Reuse the existing implementation * Remove no-longer used data members * Remove elementwise-op objects from interfaces * Reserve kernel arg as whole object in interfaces * Remove redundant data member * Make 3rd type parameter optional * Remove unnesscary type parameters * Remove no-longer used descriptor-creation methods * Move kernel arg type definition into GridwiseGemm * Add macro to switch between code sections * Move argument field computing logic into device op side * Make utility method 'static' * Declare special methods * Unify MakeArgument() usage * Adapt the new GridwiseGemm interface * Push-down class 'GridwiseGemm::Argument' fields * Remove no-longer used methods * Add unused parameters * Force copying parameters in 'Embed' ctor * Remove no-longer used descriptors * Fallback change on BaseArgument * Remove macro 'INTEGER_DIVIDE_CEIL' * Make variable naming more consistent * Make sure methods are only invoked on right place * Remove tailing underscore in public attribute name * Remove necessary methods * Hide computing logic of derived attributes * Make new 'Embed' ctor only available for device code * Make sure 'Embed' type args are not references * Move check for karg.K into CheckValidity() * Remove more integer division logic form device code * Undo changes on Embed * Separate 'Problem' concept out from 'Argument' * Share same name for kernel interfaces * Reject unsupported argument --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-05-30 07:09:55 -05:00
Adam Osewski	70e4eb567f	Multiple fixes to GroupedGemm+SplitK (#707 ) * Add license header. * Reduce number of logged output. Add constant initialization. * Add functional tests for grouped_gemm with different kbatch value. * Add debug log informations + remove unused code. * Don't pass kbatch to CalculateKPadded. * Turn on logging in grouped gemm and gemm splitk profiler * Debug: limit number of test cases to run; * Log more information and initialize with constant value. * Turn on DEBUG_LOG * Add more debug log informations. * Limit the number of instances to compile. * Use GridwiseGemmPipeline * Use KBatch to calculate K0 * Multiple DebugLog messages. * Unit tests for multiple KBatch values. * Refactoring * Disable logging * extract out of if statement KBatch update. * Uncomment instances. * Disable DebugLog. * Use Kbatch when calculate KPadded. * Fix CGridDesc padding. * Use available helper functions. * Uncomment code commented for debuggin. * Remove unnecessary debug log messages. * Uncomment previously commented code for debug purposes. * Add KBatch info to profiler output summary log. * Add gtests for gemm splitk using ckProfiler API. * Add more test-cases for different data layout. * Add more test cases for gemm splitk * Remove old test. * Unit tests for MKNK ggemm interface. * Fix and add more unit-tests. * Constepxr everything! * Increase error threshold for fp16 and splitk. Since we're using fp16 atomic add for splitk there's a known precision loss. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-05-30 07:09:06 -05:00
Bartłomiej Kocot	c2d7a29dec	Add instances for fp16/int8 Gemm kernels (Navi21) (#717 ) * Add instances for fp16/int8 Gemm kernels (Navi21) * Extend instances with smaller tiles * Fix SrcVectorTensor for km_kn_mn int8	2023-05-30 07:07:17 -05:00
Illia Silin	ac9e01e2cc	Clean-up the headers (#713 ) * fix headers for gpu instances * remove unused headers --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-05-24 08:11:25 -07:00
rocking	76ec0089fb	Pool3d fwd (#697 ) * Expand the base class of pool2d, prepare to share base class with pool3d * Add pool3d device op * Add pool3d f16 example * Refactor the base class. implement generic pooling in the future * clang format * get original index in max pooling * Add outputindex to base class * Fix dimension * Add pooling instance * Use indexType instead * Remove useless header * Extract IndexDataType to template * Extract pooling reference code * clang format * clang format * Fix typo * Add tensor stride * Add missing header * Add index stride and output stride * Refine naming * Add type to base class * Rename file * Use proper size * Fix typo * Refine naming * Modify the argument into vector. * Add max pool profiler * Refine naming * Support f32 pool * Fix typo * Add avg pool2d fwd in profiler * clang format * Rename AccDatatype to ComputeDatatype * Fix init * test pool * Extract variable * Add client example * Check the pooling dim * clang format * Connect argv and arg_parser * Add found check * Remove useless header * Refine naming * Adjust the order of device_pool_fwd	2023-05-24 09:05:04 -05:00
Illia Silin	d821d1e54f	Enable gemm_dl and other kernels on Navi3x. (#714 ) * enable dl kernels on navi3 * do not build xdl tests and examples on Navi * run tests before building everything on jenkins * disable gemm_bilinear on gfx1030 * add gpu targets to installer on Navi * put tests in the same order as before * reduce the number of navi targets in CI * build CI installed for gfx940 as well * only build for MI300 during QA runs	2023-05-23 11:23:16 -05:00
Sam Wu	3cff340423	Documentation Updates (#710 ) * update documentation dependencies add version number to docs rename doc config directories enable more doc formats on rtd add license section in docs	2023-05-18 11:08:38 -06:00
Bartłomiej Kocot	642d5e9155	Add contraction profiler and tests (#701 ) * Add contraction profiler and tests * Build and style fixes * Allow to use any elementwise operator for ref_contraction * Introduce profile_contraction_scale and profile_contraction_bilinear * Make ref_contraction generic and extend interface tests * Stylistic minor fixes * Extend test_contraction_interface	2023-05-15 09:46:52 -05:00
rocking	a1e344b1ae	Normalization/split k (#615 )	2023-05-11 07:15:02 -05:00
Rostyslav Geyyer	b076a02ad2	Optimize bf16 conversion (#664 ) * Add TypeConvert class and start refactoring * Refactor TypeConvert as a struct * Get back to template functions type_convert * Add a type_convert_bf16_rtn, set rtz as default * Clean up * Add UnaryConvertPrecision struct for high-precision workloads * Format * Update type_convert to UnaryConvert on threadwise level * Update UnaryConvertPrecision * Format * Fix chmod * Add a flag to pick converion method * Format * Remove the added flag * Merge elementwise op with type conversion * Move type_convert to elemwise op, update the op * Update type_convert_precision -> bf16_convert_rtn * Clean up * Update comments * Update the CK_WORKAROUND_DENORM_FIX flag handling * Update the unneeded op to work but warn user * Remove the message * Use a PassThrough instead of ConvertBF16RTN to calcaulate reference * Format * Add missing include	2023-05-04 10:25:47 -05:00
Illia Silin	b8635a25b2	Fix the group of quantization_int8 kernels on MI300. (#695 ) * replace amd_buffer_atomic_add with hip_atomic_add * fix grouped_gemm_splitk kernels on mi300 * fix syntax * revert experimental atomic_add changes * fix the group of kernels from ticket 723 on MI300 --------- Co-authored-by: Jing Zhang <jizhan@amd.com>	2023-05-03 18:27:04 -05:00
Illia Silin	4a51d2da9d	Fix grouped_gemm_splitk kernels on MI300. (#694 ) * replace amd_buffer_atomic_add with hip_atomic_add * fix grouped_gemm_splitk kernels on mi300 * fix syntax * revert experimental atomic_add changes --------- Co-authored-by: Jing Zhang <jizhan@amd.com>	2023-05-03 08:25:25 -07:00
Illia Silin	86e0190ec9	update daily build from rocm 5.4.3 to 5.5 (#693 )	2023-05-03 08:18:10 -07:00
zjing14	f53ede26e5	fixed init range (#691 )	2023-05-02 08:30:23 -07:00
Illia Silin	4feebedd41	Syncing up from internal repo to enable MI300. (#690 ) * enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> rocm-5.7.1 rocm-5.7.0	2023-04-28 18:22:59 -05:00
Haocong WANG	54c90aae13	add vector load check (#680 ) Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-04-26 15:58:57 -05:00
Jun Liu	7613c1d9b9	[CK] suppress unsafe buffer warn (#687 ) incomplete fix from https://github.com/ROCmSoftwarePlatform/composable_kernel/pull/670 So it does not only happen in gtest but also in CK code: We need to fix them as a quality improvement, but for now suppressing this warning in immediate releases: http://compiler-ci.amd.com/blue/rest/organizations/jenkins/pipelines/compiler-psdb-amd-stg-open/runs/2540/nodes/282/steps/3202/log/?start=0 e.g. ``` [2023-04-26T17:26:31.524Z] /jenkins/workspace/compiler-psdb-amd-stg-open/Libs/MIOpen/deps_hip/cget/build/tmp-a3db5da587a64213bde99fb856db1b43/composable_kernel-0f98035df1cc5ba3e90ab03187e672b426a25b00/include/ck/utility/generic_memory_space_atomic.hpp:52:19: error: unsafe pointer arithmetic [-Werror,-Wunsafe-buffer-usage] [2023-04-26T17:26:31.524Z] atomicAdd(c_style_pointer_cast<float>(p_dst) + 1, vx.template AsType<float>()[I1]); [2023-04-26T17:26:31.524Z] ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` ``` [2023-04-26T17:26:31.523Z] /jenkins/workspace/compiler-psdb-amd-stg-open/Libs/MIOpen/deps_hip/cget/build/tmp-a3db5da587a64213bde99fb856db1b43/composable_kernel-0f98035df1cc5ba3e90ab03187e672b426a25b00/include/ck/utility/amd_inline_asm.hpp:62:20: error: 'p_a_half2' is an unsafe pointer used for buffer access [-Werror,-Wunsafe-buffer-usage] [2023-04-26T17:26:31.523Z] const half2_t p_a_half2 = c_style_pointer_cast<const half2_t*>(&a); [2023-04-26T17:26:31.523Z] ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ```	2023-04-26 15:41:03 -05:00
Adam Osewski	8bb2bb4a05	Grouped Gemm + SplitK + simplified Kernel Args (#669 ) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * B2C with 3D grid for KSplit * Remove unused code. * Use default B2C (3D grid) in grid gemm v2r4r2. * Device gemm splitk use B2C map. * Device GroupedGemmXdlSplitKCShuffle * Example for GroupedGemm Xdl SplitK * Introduce Device GroupedGemmSplitK * Fix updating kbatch size. * Add instance mk-nk-mn * Enable set kbatch in profiler. * Add GGemmSplitK mk-kn-mn instances * Add more instances & split into multiple files. * minor fix * tuning * clean * disabled failed instances * use pipe v2 * Ignore arg on not supported arch. * fix warning --------- Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Jing Zhang <jizhan@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com>	2023-04-24 15:43:36 -05:00
zjing14	8b9cbba823	reduce inital number for half_t splitk (#685 )	2023-04-24 08:07:39 -05:00

1 2 3 4 5 ...

922 Commits