Commit Graph

16 Commits

Author SHA1 Message Date
Aviral Goel
004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
Aviral Goel
d85f065b15 chore(copyright): update copyright header for example directory (#3273)
* chore(copyright): update copyright header for codegen directory

* chore(copyright): update copyright header for example directory
2025-11-24 18:02:41 -08:00
John Shumway
ad57f6ef0b [CK_BUILDER] Put global CK functions in an the CK namespace (#3232)
* Wrap ck host utitlies in CK namespace.

The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.

Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.

There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate  kernels from either template library.

* Add using declarations to test code.

After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.

* Add using declarations to client examples.
2025-11-19 11:23:02 +01:00
Vidyasagar Ananthan
92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495)
* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-16 03:10:57 -07:00
linqunAMD
6e76b82059 Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-16 07:58:23 -07:00
Bartłomiej Kocot
ad1597c499 Refactor elementwise kernels (#1222)
* Refactor elementwise kernels

* Instances fixes

* Fix cmake

* Fix max pool bwd test

* Update two stage gemm split k

* Restore elementwise scale for hiptensor backward compatiblity

* Fix Acc data type check in conv fwd multiple abd

* Disable conv fp64 fwd example

* Update grouped conv weight multi d
2024-04-19 13:31:17 +02:00
Qianfeng
8f5cafaf04 Batchnorm splitk single kernel (#771)
* Use dim 0 as faster dim for writing mean/var/count workspace in batchnorm multiblock method [performance]

* Add CountDataType as template parameter in blockwise_welford

* Add utility/get_shift.hpp

* Add BatchNorm multiblock single-kernel implementation

* Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a

* Renaming in device_batchnorm_forward_impl.hpp

* Tiny fix in the batchnorm_fwd profiler

* Revert "Add smem inline assembly based implementation of gms_init/gms_barrier/gms_reset for gfx90a"

This reverts commit d16d00919c.

* Use the old two-kernel batchnorm multiblock method for gfx1030

* Use the old two-kernel batchnorm multiblock method for gfx908

* use the single-kernel batchnorm multiblock method only for gfx90a

* Remove get_wave_id() from utility/get_id.hpp since it is not used

* Set true for testing running mean/variance and saving mean/invvariance in the examples

* Fix to copy-right words

* Remove un-needed including in utility/get_id.hpp

* Add comments to workgroup_synchronization.hpp

* Remove un-used codes in gridwise_multiblock_batchnorm_forward.hpp

* Renaming in the kernels

* Remove un-used kernel file
2023-07-06 10:58:55 -05:00
Illia Silin
b94fd0b227 update copyright headers (#726) 2023-05-31 18:46:57 -05:00
Qianfeng
a1b2441f8d Batchnorm inference instances, external API, client examples and gtests (#531)
* File renaming and class renaming for device element-wise operation

* Add batchnorm-infer instances, external API and client example

* Add batchnorm-infer profiler module and gtests

* Remove file device_elementwise_extension.hpp and move NormalizeInInfer operation to element_wise_operation.hpp

* Remove the using of class aliasing for DeviceElementwiseForBatchNormInfer

* Rename class and file due to conflict from device_elementwise_2d.hpp

* Fix namespace in batcnnorm_infer_nhwc client example
2023-01-25 17:09:04 -06:00
Qianfeng
63af525c06 BatchNorm backward instance/external API/profiler/tests (#519)
* Refine the device batchnorm-backward base API templates and data type assignments

* Remove duplicated kernel file

* Add batchnorm backward instances and external API

* Add batchnorm-backward profiler and tests

* Add client example which uses batchnorm backward external API

* Merge test/batchnorm_fwd and test/batchnorm_bwd into one directory

* Loose the threshold for batchnorm-backward check_err()
2022-11-30 13:32:20 -06:00
Qianfeng
44789d992a BatchNorm backward implementation (#461)
* Implemented batchnorm-backward Blockwise and Multiblock kernels

* Add batchnorm-backward device op

* Add batchnorm-backward host-reference op

* Add batchnorm-backward example

* Parameters renaming in batchnorm backward kernels and device op

* Change in the example to loose the threshold for ScaleDiff checking

* Add comments to explain the implementation of batchnorm-backward

* Parameters renaming again in batchnorm backward kernels

* Improve the expression calculation for performance

* Add batchnorm backward to README

* Add comments to explain inv-variance in batchnorm forward and backward

* Renaming the batchnorm forward training and inferring examples

* Add/update the comments for batchnorm-backward kernels

* Renaming again

* Add block_sync_lds between two consecutive blockwise reductions

* Move common expression 1/N out of the static_for loops

* Add dy_elementwise_op

* Renaming in backward example again

* Add checking for reduceDims in reference_batchnorm_backward

* Update to comments and codes format

* Rename in the comments

* Remove common expression out of the loop in reference_batchnorm_backward_nhwc_c

* Add block_sync_lds() between blockwise reduction again

* Fix comments again

* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
2022-11-28 20:51:10 -06:00
Qianfeng
4e6a5575be BatchNorm forward instance/external api/profiler/tests/client example (#511)
* Update to device_batchnorm_forward base class to include all template parameters for problem description

* Add batchnorm forward instances and external api

* Add batchnorm forward profiler module which uses the external api

* Add some comments in batchnorm_forward example to explain the dimensions in lengths[]

* Replace the reference_batchnorm_forward_nhwc_c by generic reference_batchnorm_forward

* Improvement to the batchnorm infer base API

* Add batchnorm forward client example which shows using the batchnorm forward external API

* Add test for batchnorm forward

* Tuning the batchnorm profiler initialized values and error threshold

* Add support for bhalf_t in instances/external api/tests

* Add support for int8_t in instances/external api/tests

* Add support for double in instances/external api/tests

* Let ScaleDataType and BiasDataType be same as XDataType and YDataType when creating instances

* Checking before running best instance in batchnorm_fwd_nhwc client example

* Add checking for YElementwiseOp in batchnorm_forward external API

* Add more types in batchnorm forward profiler

* Add more test lengths

Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
2022-11-24 18:02:27 -06:00
Po Yen Chen
4a2a56c22f Rangify constructor of HostTensorDescriptor & Tensor<> (#445)
* Rangify STL algorithms

This commit adapts rangified std::copy(), std::fill() & std::transform()

* Rangify check_err()

By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.

* Allow constructing Tensor<> like a HostTensorDescriptor

* Simplify Tensor<> object construction logics

* Remove more unnecessary 'HostTensorDescriptor' objects

* Re-format example code

* Re-write more HostTensorDescriptor ctor call
2022-11-11 11:36:01 -06:00
Qianfeng
7fa892e63e Batchnorm-forward implemented using welford method to calculate variance (#403)
* Update to the batchnorm-forward API and base class

* Fix leeked header including in gridwise_set_buffer_value.hpp

* Add kernels and device file for batchnorm-forward welford supporting both blockwise and multi-block reduction

* Update to the batchnorm-forward example to use the new batchnorm-forward device interface

* Change the batchnorm-forward reference to use sequential welford method

* Change to assign the workspace into four buffers in the host layer

* Use GetReduceCountPerThread functor to replace the initial count for Blockwise and Multiblock welford

* Tiny correction and remove un-used file under example/34_batchnorm

* Renaming in the kernel arguments

* Explicitly use ck::math::sqrt in batchnorm-forward kernels

* Add some comments to some kernels

* Tiny fix

* Generalize the data types in reference_batchnorm_forward_nhwc_c

* Use ck::ignore to mark un-used parameters

* Move GetReduceCountPerThread functor codes from kernel to device

* Remove some un-used codes in device_batchnorm_forward_impl.hpp

* Tiny fix in batchnorm_forward example

* Move GetReduceCountPerThread() to welford_helper.hpp

* Use seperate data type for Scale and Bias

* Renaming in device Op

* Tiny fix in forward example

* Updata to batchnorm-infer (type spliting, renaming)

* Add time and bandwidth measurement to the batchnorm-forward example

* Add support of elementwise operation for batchnorm forward output

* Reduce object copying by passing object as reference type

* Tiny change for performance

* Updates for performance again

* Some Renamings

* Add GetActualVariance template parameter for ThreadwiseWelfordMerge

* Tiny update in reference batchnorm forward nhwc/c

* Move batchnorm multiblock kernel files to grid/batchnorm_multiblock sub-directory

* Fuse mean and bias in the normalization calculation

Co-authored-by: root <root@dc-smc-18.amd.com>
Co-authored-by: rocking5566 <ChunYu.Lai@amd.com>
2022-10-27 18:52:54 -06:00
Adam Osewski
3048028897 Refactor device op implementations into impl subdirectory. (#420)
* Move kernel implementation files under impl directory.

* Update examples paths.

* Update device kernel impl include paths.

* Update tensor operation instances include paths.

* Update profiler and tests include paths.

* Clang-format

* Update include paths for batched gemm reduce

* Refactor UnitTest ConvNDBwdWeight.

* Refactor fwd and bwd data convND UT.

* Fix used test macro.

* Fix include path.

* Fix include paths.

* Fix include paths in profiler and tests.

* Fix include paths.

Co-authored-by: Adam Osewski <aosewski@amd.com>
2022-10-13 09:05:08 -05:00
Qianfeng
53ea4713af Batchnorm-forward and Batchnorm-infer Implemented using generic kernels (#320)
* Implement multiple-reduction in one kernel (kernels, device ops, examples)

* Add generic elementwise kernel and device interface

* Add generator for normal-distributed data initialization

* Add host refer implementation of batchnorm-forward and batchnorm-infer

* Add examples for implementing batchnorm-forward and batchnorm-infer using generic kernels

* Remove un-needed including in batchnorm example

* Renaming generic_elementwise to elementiwise in kernel and device classes/functions

* Change in gemm_layernorm examples to use DeviceElementwise instead of Device5AryElementwise

* Change in exampe 19_binary_elementwise to use DeviceElementwise instead of DeviceBinaryElementwise

* Change in device_cgemm_4gemm_xdl_cshuffle.hpp to use kernel_elementwise instead of kernel_binary_elementwise

* Add DeviceElementwiseBase and use it in device_normalize_instance.cpp

* Removing and renaming files

* Update to synchronize gemm_layernorm client example to the generic element-wise device op API

* Update to synchronize with the latest headers directory and HostTensorDescriptor interface renaming

* Merge two static member functions in device_elementwise.hpp

* Remove unary_elementwise_1d kernel and device
2022-08-15 10:11:02 -05:00