Commit Graph

12 Commits

Author SHA1 Message Date
Aviral Goel
004784ef98 chore(copyright) update library wide CMakeLists.txt copyright header template (#3313)
* chore(copyright) update library wide CMakeLists.txt files copyright header template

* Fix build

---------

Co-authored-by: Sami Remes <samremes@amd.com>
2025-11-28 13:49:54 -08:00
Aviral Goel
d85f065b15 chore(copyright): update copyright header for example directory (#3273)
* chore(copyright): update copyright header for codegen directory

* chore(copyright): update copyright header for example directory
2025-11-24 18:02:41 -08:00
John Shumway
ad57f6ef0b [CK_BUILDER] Put global CK functions in an the CK namespace (#3232)
* Wrap ck host utitlies in CK namespace.

The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.

Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.

There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate  kernels from either template library.

* Add using declarations to test code.

After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.

* Add using declarations to client examples.
2025-11-19 11:23:02 +01:00
Vidyasagar Ananthan
92c67a824f [DOCS] Documentation Addition (Readme updates) (#2495)
* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

* GH-2368 Adding a basic glossary

GH-2368 Minor edits

GH-2368 Adding missing READMEs and standardization.

resolving readme updates

GH-2368 Minor improvements to documentation.

Improving some readmes.

Further improvement for readmes.

Cleaned up the documentation in 'client_example' (#2468)

Update for PR

Update ACRONYMS.md to remove trivial terms

Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Apply suggestion from @spolifroni-amd

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.

revise 37_transpose readme

revise 36_copy readme

Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.

Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.

Remove references to the Tile Engine in README files across multiple examples

Refine README files by removing outdated references to the Tile Engine

* Updates based on PR feedback 1

* Updates based on PR feedback 2

* Updates based on PR feedback 3

* Updates based on PR feedback 4

* Updates based on PR feedback 5

* Updates based on PR feedback 6

* Updates based on PR feedback 7

* Updates based on PR feedback 8

* Content Modification of CK Tile Example

* Modify the ck_tile gemm config

---------

Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-10-16 03:10:57 -07:00
Michał Kulikowski
ffe9775e70 [CK][EXAMPLES] (#2826)
-Added parameter to enable/disable verification and timing of kernel in various examples that missed it.
-Added parameter to change number of groups to execute in grouped_gemm_examples.

Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com>
2025-09-11 12:33:00 -07:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
zjing14
bf435140dc Clean DTYPES conditions in CMake (#974)
* Add a condition to build fp8 instances

* simplified buffer_load/store

* add bfp8/fp8

* fixed

* remove all f8/bf8 condition include folder

* fixed cmake conditions

* fixed DTYPES=fp16/bfp16

* fix

* fixed buffer_load

* fixed buffer_store

* fix

* clean example cmake files

* fixed ci

* fixed cit

---------

Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
2023-10-18 11:14:14 -05:00
Illia Silin
bba085d2b5 Refactoring cmake files to build data types separately. (#932)
* refactor cmake files for the tests

* refactor cmake files for examples

* fix cmake for gemm example

* fix the cmake file for all examples

* add splitting by data types in gemm_splitk instance header

* rename test to reflect only dl instances are used

* clean up CI workspace, update cmake for instances

* change the jenkinsfile syntax

* build all instances except DL on gfx11

* move workspace cleanup after stages

* clean up workspace after every stage

* isolate data types in grouped_conv_fwd header

* isolate dl instances for grouped_conv2d_fwd

* fix syntax

* fix cmake and batchnorm instances

* fix typo

* fix reduction instances

* fix grouped_conv headers

* fix syntax

* replace parsing logic for instances, replace bfp16 with bf16

* fix the client examples build

* clean up DTYPES from instances cmake files

* update the parsing logic in cmake files

* make an exception for reduction kernels

* update few remaining cmake files to handle DTYPES

* fix syntax

* fix cmake conflicts

* replace f8 with fp8 test name

* resolve conflicts for dpp instances
2023-09-20 22:15:56 -07:00
Illia Silin
08eb176929 Allow building CK for specific data types and split off last remaining DL instances. (#830)
* properly split conv_nd_bwd_data instances

* split conv2d_fwd instance data types

* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm

* split the tests by data types where possible

* filter examples by DTYPES

* split few remaining examples by DTYPES

* filter most instances by DTYPES

* add new lines at end of headers, fix grouped_gemm profiler

* fix syntax

* split the ckprofiler instances by DTYPES

* split the conv2d and quantization DL and XDL instances

* fix the splitting of conv2d DL instances

* split softmax and pool_fwd tests for fp16 and fp32 types

* fix syntax

* fix the dl_int8 quantization instances isolation
2023-08-07 14:56:10 -07:00
Illia Silin
b94fd0b227 update copyright headers (#726) 2023-05-31 18:46:57 -05:00
Po Yen Chen
4a2a56c22f Rangify constructor of HostTensorDescriptor & Tensor<> (#445)
* Rangify STL algorithms

This commit adapts rangified std::copy(), std::fill() & std::transform()

* Rangify check_err()

By rangifying check_err(), we can not only compare values between
std::vector<>s, but also compare any ranges which have same value
type.

* Allow constructing Tensor<> like a HostTensorDescriptor

* Simplify Tensor<> object construction logics

* Remove more unnecessary 'HostTensorDescriptor' objects

* Re-format example code

* Re-write more HostTensorDescriptor ctor call
2022-11-11 11:36:01 -06:00
Po Yen Chen
f584ab0c54 Add 'Permute' device op & example (#408)
* Add example folder for 'DeviceElementwise'

* Re-structure example files

* Move common parts into common.hpp

* Use more strict input

* Add more helper methods in 'DeviceElementwise'

* Use more specific method to write example

* Allow specify problem through command line argument

* Allow specify problem 'axes' through command line argument

* Add check to template type argument

* Add transpose_shape() to generalize shape permute

* Generalize transpose utility functions

* Use better name for tensor indices

* Add checks in helper functions

* Remove debug messages

* Refine error message for check_err()

* Generalize variable naming in example code

* Add device op 'DevicePermute'

This device op is clone of 'DeviceElementwise'

* Use 'DevicePermute' device op in example

* Remove 'elementwise' from identifiers

* Remove 'elementwise' from file paths

* Remove base class of 'DevicePermute'

* Let 'DevicePermute' inherit from 'BaseOperator'

* Add simple type traits to validate device op type

* Add static_assert() to check type constraints

* Create 'DevicePermuteBase' to generate methods

* Use indirect base type to generate methods

* Remove 'is_device_op<>' type traits

* Only accept single-input-single-output for 'DervicePermute'

* Simplify 'DevicePermute' interface

* Re-format 'DeviceElementwise'

* Use CRTP to generate overridden virtual method

* Remove unnecessary include directives

* Distinguish input & output shape in 'DevicePermute'

* Passing 'axes' to 'DevicePermute'

* Use more reasonable return value for Invoker::Run()

* Add 'GridwisePermute' kernel

This kernel is a clone of 'GridwiseElementwise_1D'

* Remove no-longer used type argument

* Check if input/output shape meet the requirement

* Remove no-longer used method

* Remove never-entered-if-clause

* Change problem description for 'DevicePermute'

* Transform descriptor into 3 dimensions

* Add debug code the verify result

* Add comment to indicate template argument location

* Add N/H/WPerBlock template parameter to 'DevicePermute'

* Rename 'GridwisePermute' to 'GridwiseCopy'

* Check tensor descriptor dimensions in 'GridwiseElementwise_1D'

* Add missing include directive

* Add 'BlockSize' parameter to 'DevicePermute'

* Remove no-longer used method

* Add 'BlockToTileMap' for 'GridwiseCopy'

* Use the normal Block2TileMap convention

* Rename 'BlockToTileMap' as 'Block2TileMap'

* Fix most of compilation errors

* Let 'Block2TileMap' map block to 2d coordinate

* Allow data transfer in 'GridwiseCopy'

* Fix wrong output descriptor for 2nd blockwise copy

* Rename 'GridwiseCopy' as 'GridwisePermute'

* Remove '1d' in identifiers

* Remove commented-out codes

* Remove 'MPerThread' template parameter

* Seperate template parameters

* Unify variable namming convention

* Use more verbose way to create expressions

* Add template parameter 'InBlockLdsExtraW'

* Release the constraint on In/OutGridDesc

* Use date type directly as template argument

* Re-arrange template arguments for blockwise copy

* Remove no-longer used template parameters

* Embed layout in the variable names

* Add GridwisePermute::CheckValidity()

* Extract local types as template parameters

* Rename local type alias

* Add more template parameters (vector width related)

* Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions

* Fill tensor values start from 1

* Re-formate example code

* Avoid too-large block id

* Add comment

* Make sure 'SrcVectorDim' is not same as 'DstVectorDim'

* Add check for the 'VectorDim' & 'ScalarPerVector' template params

* Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc

* Remove no-longer used template parameter 'NPerBlock'

* Fix wrong descriptor creation logics

* Specify problem in each examples

* Use better example name

* Add new example 'example_permute_NxHxW_fp32'

* Add example for demonstrating bundle multiple elems in tensor

* Add support to permute multiple elements together

* Change the default problem size

* Add span<> class template

* Use span<> to generalize check_err() interface

* Fix ambiguous ctor call

* Avoid create necessary objects

* Use helper functions to simplify example code

* Add example for 4xfp16 permute

* Disable failed-to-compile example

* Add check for the NUM_ELEMS_IN_BUNDLE

* Remove redundant parameter in helper lambda function

* Add check for the input tensor type's byte-size

* Check scalar-per-vector with padded length

* Use more verbose name to avoid name collision

* Use fixed 'VectorDim' & 'ScalarPerVector' for LDS

* Embed shape info in name of descriptor constructor

* Rename example folder '36_permute' into '37_permute'

* Avoid using too-large LDS in kernel code

* Remove redundant example

* Usw switch() to group similar codes

* Add const to the span<> type arguement

* Simply initialize tensor with floating point values

* Use fp16 as data type in all examples

* Enlarge tensor size in example

* Enalrge N-dim in example

* Add check for the bundled type in example

* Use more stricter error threshold

* Remove global load/store loop in kernel code

* Measure execution time by default

* Use faster device op config for example 'NxHxW_fp16'

* Use faster device op config for example '1xHxW_fp16'

* Use faster device op config for example 'HxWx4_fp16'

* Remove cmd arg parsing logics

* Rename functions

* Extract bundle permutation logic out

* Simplify permute bundle example

* Add Tensor<>::GetElementSpaceSizeInBytes()

* Add Tensor<>::data()

* Use new methods to simplify code

* Use type alias to replace duplicated code

* Use existing method to shorten code

* Allow FillUniformDistribution accept range arugment

* Intialize random values in range

* Add Tensor<>::size()

* Use more meaningful names in permute bundle example

* Use more meaningful names in permute element examples

* Use rangified copy() to copy elements

* Use function return value directly to eliminate variables

* Add to_array() conversion tool to eliminate more variables

* Add Tensor<>::AsSpan<>() to create view of tensor values

* Use AsSpan() to shorten check_err() calls

* Remove no-longer-used 'using' directives

* Move 'using' directive to proper code position

* Remove redudant variables

* Remove useless static_assert()

* Add check for range types

* Declare variable right before first use

* Move long return type as tailing return type

* Add BaseInvokerCRTP<> class template to generate method

* Create new base type for 'DervicePermute' implementations

* Move 'NumDim' template param to the first

* Rename 'DevicePermute' to 'DevicePermuteImpl'

* Add 'noexcept' specifier to CRTP generated method

* Move 'Block2TileMap' definition into 'GridwisePermute'

* Use type alias to reduce code

* Unify naming style in 'DevicePermute'

* Add comments in 'GridwisePermute'

* Rename permute example folder

* Use std::cerr to report error

* Use larger shape in examples

* Rename '38_permute' to '39_permute'

* Make sure we use unsigned type for shape & indices

* Remove opt-ed out assertion

* Remove template BaseInvokerCRTP<>
2022-09-19 21:30:25 -05:00