composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 17:00:18 +00:00

Author	SHA1	Message	Date
Po Yen Chen	f7cc8c3b03	Update tuning parameter & compilation options of DeviceGemmXdl<> instance (layout=TT) (#819 ) * Enable pipeline v2 opt for layout=TT instance * Use better thread mapping for reading A tile * Conditionally enable pipeline v2 opt * Allow enabling only fp16 gemm instances in profiler * Fix formatting error * Fix compilation error if we enable fp32 in profiler	2023-08-02 10:32:22 -05:00
carlushuang	e7dca79d27	initial stream-k implementation with example (#699 ) * initial stream-k implementation with example * fix unexpected change in err * improve a little bit performance by reorganize pipeline. * improve perf a little bit by swizzle block idx * add profiler * update example * fix spelling * shrink karg for streamk * support dynamic buffer using memory coherence glc_slc bit from template * control memory coherence while construct dynamic buffer * update reduction for streamk(not ready yet) * Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting * fix build issue * fix several bug * now result is correct, everything works (but has scratch) * remove scratch by manually reset coordinate * update device code * fix a bug in final reduce * fix something in example * update async memset * fix enum as camel case * modify coherence enum name * clean code and use atomic streamk by default * remove unused var * throw exception if have empty pointer * fix format * fix CI warning * fix type in init * modify CI error * filter out on gfx10+ * restore changed example code --------- Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>	2023-07-26 14:18:15 -05:00
Illia Silin	9195435c77	Disable DL kernels by default. (#816 )	2023-07-26 11:06:45 -05:00
Po Yen Chen	f4ea560112	Speed-up global memory reading for GEMM instances (#813 ) * Use better ThreadClusterLengths to speed up * Update B tile reading pattern for layout=NN instance	2023-07-25 18:54:47 -05:00
ltqin	50643dd555	Add bias scalar vectorload = 1 for gemm bias gemm (#791 ) * first change bias load * add bias dim and scalervector parameter * make CDE0BlockTransferSrcVectorDim not work * changse toinstance * add limit for CDE0BlockTransferSrcScalarPerVector	2023-07-24 20:08:15 -05:00
Bartłomiej Kocot	10732847e7	Grouped conv bwd wei NDHWGC/NDHWGK (#804 )	2023-07-21 12:00:55 -05:00
Bartłomiej Kocot	49180fd60b	Grouped 3d conv backward data support (#799 ) * Grouped 3d conv backward data support * Fix comments	2023-07-18 11:01:33 -05:00
Illia Silin	189ea3b9aa	Add mechanism to build CK for select data types, add Navi3x CI. (#790 ) * allow building CK for specific data types * add CI build and test stage on Naiv3x without some int8 instances * add missing gemm fp16 instances * add the changes to the missed cmake file * add empty lines at end of source files * Do not build quantization client example on navi3 in CI * disable batched_gemm_multi_d_int8 instances with DTYPES * disable device_conv2d_bwd_data_instance with DTYPES * fix ckprofiler for conv_bwd_data for int8 * properly isolate the conv_bwd_data int8 instances * remove empty line	2023-07-17 18:02:42 -07:00
Bartłomiej Kocot	1ee99dcaa6	Support NHWGC conv2d_bwd_weight (#769 ) * Support NHWGC conv2d_bwd_weight * Fix client example * Fix client example * Fix comments * Redesign grouped_conv_bwd_weight instances * Clang format fix --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-07-12 08:25:02 -05:00
Po Yen Chen	850144a0d3	Split GEMM instance library & enable pipeline v2 optimization (#783 ) * Move source file into sub-directories * Add missing include directive * Split DeviceGemmXdl<> fp16 instances * Fix format * Remove unnecessary CMakeLists.txt * Add macros to toggle new features * Remove debug message * Turn off GEMM v2 pipeline optimization by default * Fix format * Extract duplicated string as list * Enlarge indent in CMakeLists.txt	2023-07-06 10:59:35 -05:00
Adam Osewski	f4dfc060b7	Move Device Ops implementations into impl directory. (#777 ) Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-07-06 16:15:51 +02:00
Bartlomiej Kocot	2b0b6d9f46	Fix copyrights for DeviceBatchedGemmMultipleD_Dl	2023-07-06 15:50:27 +02:00
Bartłomiej Kocot	63388e84ab	Support bf16/f32/f16 and NHWGC conv2d_bwd_data (#757 ) * Support bf16/f32/f16 and NHWGC conv2d_bwd_data * Add interface test * clang format * Comment fixes * Add more friendly error message	2023-06-21 08:20:31 -05:00
Rostyslav Geyyer	f0c620c42e	FP8 enablement - add a pseudorandom number generator, add conversion methods (#708 ) * Add basic fp8 definitions and prn-generator * Format * Add fp8<->fp32 type_convert * Format * Split type_convert and cast_to/from_f8 * Format * Minor fix * Minor fix * Move fp8 utils to a separate header * Add elementwise ops * Add fp8_convert_sr * Format * Add element op * Eliminate magic numbers * Split f8_convert_sr in host and device * Format * Add some constexpr * Add a datatype test * Format * Another format * Add fp8<->fp16 tests * Update type_converts * Format * Add fp16 casting functions * Format * Use seed as a runtime arg * Use element location for PRNG * Format * Add fp8<->fp16 to PassThrough element op * Clean up * Merge host and device implementations * Add comments on rounding modes * Remove leftover code * Put type_converts into a separate header * Put random number gen to a separate header * Rearrange f8_utils' namespaces * Refactor type_convert.hpp * Move f8_t definition	2023-06-19 11:20:35 -05:00
rocking	341ad95665	Maxpool bwd (#750 ) * Add maxpool f32 kernel and example * Revise copyright * Add device pool bwd device op * Support f16 and bf16 * Add compute datatype for reference code. Prevent error in bf16 * Fix type error * Remove layout * Fix bf16 error * Add f16 and bf16 example * Add more operations * Implement IsSupportedArgument * Add changelog * Add comment * Add comment * Remove useless header * Move initialize of workspace to the run * Move set din zero to the device operator * Save din_length_raw * Remove useless header * Calculate gridsize according to the number of CU * Calculate gridSize according to the number of CU. Remove useless header * Add put example * Remove useless header * Fix CI fail	2023-06-19 09:44:22 -05:00
Qianfeng	0d9118226b	Padded Generic Kernel Instance (#730 ) * Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting * Move the generic kernel instance to be the first of the instance list for elementwise op of normalization * Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax * Add testing of GetGenericInstance() in client_example of Softmax * Revert "Add testing of GetGenericInstance() in client_example of Softmax" This reverts commit `f629cd9a93`. * Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax" This reverts commit `a9f0d000eb`. * Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm * Move generic kernel instance to separate tuple for elementwise op of normalization * Remove un-used files for softmax instance * Store generic kernel instance to separate tuple for softmax * Add IsSupported checking for generic instance to client example of softmax * Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization * clang-format fix * Remove int8 from softmax instances --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-06-16 23:43:11 -05:00
zjing14	309b1c6461	Fixed Weight layout of grouped_conv 3d fwd (#743 ) * Changed wei layout * changed layout for examples * fixed client example --------- Co-authored-by: root <root@ctr-ubbsmc15.amd.com>	2023-06-15 10:19:33 -05:00
Rostyslav Geyyer	54b68eb343	Add generic kernel instances for ck::tensor_operation::device::DeviceGemmMultipleD (#741 ) * Add generic instance gemm_add_add_fastgelu * Add a client example for generic gemm_add_add_fastgelu * Update CMakeLists * Format * Format * Add generic instance gemm_add_fastgelu * Format * Add a gemm_add_fastgelu client example * Format * Add generic instance gemm_fastgelu * Format * Fix argument order * Add gemm_fastgelu client example * Add exceptions if argument is not supported	2023-06-14 16:06:56 -05:00
Bartłomiej Kocot	fc9f97568f	Add DeviceBatchedGemmMultipleD_Dl (#732 ) * Add DeviceBatchedGemmMultipleD_Dl * Fix batched_gemm tests * Fix comments * test_batched_gemm_multi_d fixes * Fix args for isSupported batchedGemmMultipleDDl * Disable tests for gfx90a	2023-06-12 08:37:15 -05:00
ltqin	0ede66de54	Fix flash attn mask bug (#733 ) * add check input parameter * add instance for vector load = 1 * move gerneral instance to first pos * fix read bias code * regular code for bias load --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-06-12 08:35:31 -05:00
Illia Silin	b94fd0b227	update copyright headers (#726 )	2023-05-31 18:46:57 -05:00
Adam Osewski	70e4eb567f	Multiple fixes to GroupedGemm+SplitK (#707 ) * Add license header. * Reduce number of logged output. Add constant initialization. * Add functional tests for grouped_gemm with different kbatch value. * Add debug log informations + remove unused code. * Don't pass kbatch to CalculateKPadded. * Turn on logging in grouped gemm and gemm splitk profiler * Debug: limit number of test cases to run; * Log more information and initialize with constant value. * Turn on DEBUG_LOG * Add more debug log informations. * Limit the number of instances to compile. * Use GridwiseGemmPipeline * Use KBatch to calculate K0 * Multiple DebugLog messages. * Unit tests for multiple KBatch values. * Refactoring * Disable logging * extract out of if statement KBatch update. * Uncomment instances. * Disable DebugLog. * Use Kbatch when calculate KPadded. * Fix CGridDesc padding. * Use available helper functions. * Uncomment code commented for debuggin. * Remove unnecessary debug log messages. * Uncomment previously commented code for debug purposes. * Add KBatch info to profiler output summary log. * Add gtests for gemm splitk using ckProfiler API. * Add more test-cases for different data layout. * Add more test cases for gemm splitk * Remove old test. * Unit tests for MKNK ggemm interface. * Fix and add more unit-tests. * Constepxr everything! * Increase error threshold for fp16 and splitk. Since we're using fp16 atomic add for splitk there's a known precision loss. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-05-30 07:09:06 -05:00
Bartłomiej Kocot	c2d7a29dec	Add instances for fp16/int8 Gemm kernels (Navi21) (#717 ) * Add instances for fp16/int8 Gemm kernels (Navi21) * Extend instances with smaller tiles * Fix SrcVectorTensor for km_kn_mn int8	2023-05-30 07:07:17 -05:00
Illia Silin	ac9e01e2cc	Clean-up the headers (#713 ) * fix headers for gpu instances * remove unused headers --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-05-24 08:11:25 -07:00
rocking	76ec0089fb	Pool3d fwd (#697 ) * Expand the base class of pool2d, prepare to share base class with pool3d * Add pool3d device op * Add pool3d f16 example * Refactor the base class. implement generic pooling in the future * clang format * get original index in max pooling * Add outputindex to base class * Fix dimension * Add pooling instance * Use indexType instead * Remove useless header * Extract IndexDataType to template * Extract pooling reference code * clang format * clang format * Fix typo * Add tensor stride * Add missing header * Add index stride and output stride * Refine naming * Add type to base class * Rename file * Use proper size * Fix typo * Refine naming * Modify the argument into vector. * Add max pool profiler * Refine naming * Support f32 pool * Fix typo * Add avg pool2d fwd in profiler * clang format * Rename AccDatatype to ComputeDatatype * Fix init * test pool * Extract variable * Add client example * Check the pooling dim * clang format * Connect argv and arg_parser * Add found check * Remove useless header * Refine naming * Adjust the order of device_pool_fwd	2023-05-24 09:05:04 -05:00
Bartłomiej Kocot	642d5e9155	Add contraction profiler and tests (#701 ) * Add contraction profiler and tests * Build and style fixes * Allow to use any elementwise operator for ref_contraction * Introduce profile_contraction_scale and profile_contraction_bilinear * Make ref_contraction generic and extend interface tests * Stylistic minor fixes * Extend test_contraction_interface	2023-05-15 09:46:52 -05:00
Rostyslav Geyyer	b076a02ad2	Optimize bf16 conversion (#664 ) * Add TypeConvert class and start refactoring * Refactor TypeConvert as a struct * Get back to template functions type_convert * Add a type_convert_bf16_rtn, set rtz as default * Clean up * Add UnaryConvertPrecision struct for high-precision workloads * Format * Update type_convert to UnaryConvert on threadwise level * Update UnaryConvertPrecision * Format * Fix chmod * Add a flag to pick converion method * Format * Remove the added flag * Merge elementwise op with type conversion * Move type_convert to elemwise op, update the op * Update type_convert_precision -> bf16_convert_rtn * Clean up * Update comments * Update the CK_WORKAROUND_DENORM_FIX flag handling * Update the unneeded op to work but warn user * Remove the message * Use a PassThrough instead of ConvertBF16RTN to calcaulate reference * Format * Add missing include	2023-05-04 10:25:47 -05:00
Adam Osewski	8bb2bb4a05	Grouped Gemm + SplitK + simplified Kernel Args (#669 ) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * B2C with 3D grid for KSplit * Remove unused code. * Use default B2C (3D grid) in grid gemm v2r4r2. * Device gemm splitk use B2C map. * Device GroupedGemmXdlSplitKCShuffle * Example for GroupedGemm Xdl SplitK * Introduce Device GroupedGemmSplitK * Fix updating kbatch size. * Add instance mk-nk-mn * Enable set kbatch in profiler. * Add GGemmSplitK mk-kn-mn instances * Add more instances & split into multiple files. * minor fix * tuning * clean * disabled failed instances * use pipe v2 * Ignore arg on not supported arch. * fix warning --------- Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Jing Zhang <jizhan@amd.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com>	2023-04-24 15:43:36 -05:00
rocking	3eecbfb6ec	Revise layout of group convolution (#675 ) * [What] Remove pure conv int8 instance [Why] We will never use pure int8 conv in AI, use int8 quantization instead * Change layout * Share the kernel parameter * Support more type of NHWGC for group conv * Revise client example of conv 2d, use NHWGC layout * Add instance to cmake * Revise layout of group conv quantization instance * Revise layout of external api of group conv quantization * Revise layout of group conv quantization client example * Fix clang format * Add comment to describe meaning of each parameter	2023-04-23 23:40:00 -05:00
Illia Silin	903cd19ce3	Put back the split-k gemm code. (#684 ) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout --------- Co-authored-by: carlushuang <carlus.huang@amd.com>	2023-04-21 19:37:00 -05:00
rocking5566	fd11a4a12a	Add (#677 )	2023-04-17 10:12:10 -05:00
rocking5566	ed3a2e5226	Groupnorm + swish external api (#668 ) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp	2023-04-10 08:02:17 -05:00
Jun Liu	3248387bbb	Issue #666 : Revert "simplify karg in device/grid of split-k op (#644 )" (#665 ) This reverts commit `bb5530af91`.	2023-04-06 17:14:11 -07:00
zjing14	fde6d2742b	add fp64 instances (#658 ) Co-authored-by: root <root@ctr-ubbsmc15.amd.com>	2023-03-30 13:30:43 -05:00
carlushuang	bb5530af91	simplify karg in device/grid of split-k op (#644 ) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout	2023-03-29 19:03:07 -05:00
rocking5566	389e84a83b	Conv + quantization + tanh (#645 ) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-03-29 14:50:23 -05:00
ltqin	6ae12434d2	workaround 637 (#640 ) * add workaround 637 * format * change id --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-03-20 11:49:31 -05:00
rocking5566	16dc18e0f9	gemm/Conv xdlops + dlops quantization (#625 ) * Add conv perlayer quantization * Add gemm_dlops quantization * Support int8 for innerproduct * Refine gemm dlops int8 kernel parameter * Support gfx908(MI100) and gfx90a(MI200) * clang-format * Rename example number * Support different layout for d tensor * Add conv dlops perchannel quantization example * Move to example 40 * Extract the common code for different platform (dlops and xdlops) * Move ot subfolder. Prepare to add other op of quantization * Refine the quantization instance library * Add conv dl instances and client example * Remove unnecessary type * Add gemm quantization instance * Add external api and client example * Refine num_bytes * Separete different layout to different cpp * Add more xdl instances * Revert "Remove unnecessary type" This reverts commit `820869182f`. * Remove CShuffleDataType in dlops Let acc and CShuffleDataType be the same in xdlops --------- Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-03-15 15:29:40 -05:00
Adam Osewski	9096b1c7b2	GroupedGEMM + Gelu client example/instances/profiler (#614 ) * Grouped gemm + Gelu instances. * Device Instance Factory for GroupedGemm+Gelu * Client example * Rangify fill helper functions. * Fix name clash. * Profiler for grouped_gemm+gelu * No need to use full namespace name. * Add check for MRaw divisible by vector load. * Ugly fix for big errors. * Add grouped_gemm+gelu to profiler CMakelists. * Store in argument additional info. * Information about Mraw, Nraw, Kraw values. * Use FastGelu instead of Gelu. * Change client ex to use FastGelu * Remove relaxed error precision. * Remove duplicate output elementwise-op --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-03-07 22:06:56 -06:00
pmaybank	e4bf6d422e	Generate output using Doxygen / Breathe (#598 ) * Modify Doxygen config to pick up include directories recursively * Add DeviceMem struct to API Reference guide * Add classes that are used in Flash Attention kernel * Add a reference and config for generating bibliography Co-authored-by: Philip Maybank <Philip.Maybank@amd.com>	2023-03-06 11:39:16 -06:00
rocking5566	6a6163a3d1	Improve normalization (#580 ) * Sync the order of type string with template parameter * Add more instances * Check the vector size and remove redundant var * Extract var to static, prepare to separate sweep once kernel * Separate sweeponce flow and optimize the flow * 1. Rename AccDatatype in normalization to computeData 2. Rename AccElementwiseOperation to YElementwiseOperation in normalization * Remove useless code * Update naive variance kernel * Refine string * Fix typo * Support naive variance for device_normalization * Check the blocksize * Share the VGPR of x and y * Share the VGPR of gamma and beta * Add more instances * Support fp16 sqrt for experiment * Add CHANGELOG * Fix typo * clang-format	2023-02-15 11:59:35 -06:00
Adam Osewski	e9fd122889	Conv3D FWD BWD WRW fp16 fp32 client examples (#559 ) * Conv3d bwd weight client example. * Update year in license * Convolution bwd data 3D fp16/fp32 client example. * Client example for convnd fwd fp16 fp32 * clang-format * Review remarks. * Fix compiler err. * Update data layout to standard one. * Add conv 3d fwd NDHWGC instances * clang-format * Conv3d fwd NDHWGC instances. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-15 11:16:47 -06:00
Adam Osewski	8f42780fd6	GroupedGEMM more bigger tiles. (#577 ) * Adding more bigger tiles. * Remove failing instance. * Remove instances which that don't improve perf. --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-02-13 10:06:24 -06:00
rocking5566	f7d28f3e4b	Gemm+layernorm instance, ckProfiler, client example (#568 ) * Add gemm + layernorm instance * Add ckProfiler * Add test * Add client example * Detect if user forger to set the workrspace * Use literal in the example * [What] use builtin function for sqrt [Why] compiler will not use v_sqrt_f64_e64 if we use ::sqrt() * check gemm vaildity in IsSupportedArgument * Add more testcases * Merge duplicated folder in client example * Print more infomation * Use better kernel parameter for MS problem size * clang format * Add constexpr for if condition and remove redundant include * Remove cstdlib and add constexpr	2023-02-09 15:02:55 -06:00
guangzlu	76d144fa7c	Add instance for elementwise normlization (#573 ) * added instances for large N * add instance for elementwise normlization * added supported restrict in device_elementwise_normalization_impl.hpp	2023-02-09 09:37:29 -08:00
ltqin	332ccc3367	Add GemmAddSoftmaxGemm support for MSFT ORT (instances and client API) (#576 ) * add instance for gemm bias softmax gemm * add client example * change CGridDesc_G_M_N to CGridDesc_G_M_O * add gridwise * change c grid name * device add d0s data * fix 08 client_example * add example 47_fused_attention * example output correct * add d0 to example * add d0 element op * rechange instance code * change Acc0ElementwiseOperation to C0DEElementwiseOperation * change example name * update instance for cdeelementwiseop * add bhalf_t ScaleAdd * add test * not surport geem1 bias * remove some ignore * fix test bug	2023-02-08 14:34:45 -06:00
Adam Osewski	7494c1c611	Add more instances for irregular GEMM sizes. (#560 ) Co-authored-by: Adam Osewski <aosewski@amd.com>	2023-01-26 13:42:20 -06:00
Qianfeng	a1b2441f8d	Batchnorm inference instances, external API, client examples and gtests (#531 ) * File renaming and class renaming for device element-wise operation * Add batchnorm-infer instances, external API and client example * Add batchnorm-infer profiler module and gtests * Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp * Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer * Rename class and file due to conflict from device_elementwise_2d.hpp * Fix namespace in batcnnorm_infer_nhwc client example	2023-01-25 17:09:04 -06:00
Qianfeng	52abc2f371	Use double for all scaling values and float-point constant values at the Device Op API (#557 ) * Use double as alpha/beta values type in reduce device op api * Use double as alpha/beta values type in softmax device op api * Use double as alpha/beta values type in multiple-reduce device op api * Use double as epsilon value type in normalization/elementwise-normalization device op api	2023-01-18 12:02:50 -06:00
ltqin	d66421fe34	Add multiD Gemm client APIs (#534 ) * start add example * fix config * fix showinfo bug * add an elementop * change to padding * add xdl example * change elementwiseop * add instance * add instance to profiler * change file name * fix deive not support issue * add client example * fix client gemm_add_multiply name * change AddMultiply elementwiseop * fix elementwiseop * fix client example * fix addmultiply op * fix comments and fun name Co-authored-by: letaoqin <letaoqin@amd.com>	2023-01-18 11:53:56 -06:00

1 2 3 4

184 Commits