* enable use of rocm5.5 release candidate 4
* upgrade to ROCM5.5 RC5
* try fix the PUB_KEY error, remove the cmake-data package
* upgrade to latest cmake version
* use private dockerhub repo for rocm5.5 rc5
* add missing bracket
* Rename to proper naming
* Add example of groupnorm + swish
* Extract duplicate code in example
* Add groupnorm + swish instances
* Ractor instance generation, split into multiple cpp file
* Add external api and client example
* Refine profiler message
* Use ck math version of exp
* Refine problem size in example
* Add host version of exp
* Add type_convert implementations for bf16
* Add the fix for conv_fwd
* Add the fix for conv_bwd_data
* Add the fix for conv_bwd_weight
* Format
* Format
* Another format
* Add a macro to use workaround on MI200 only
* Format
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add conv perlayer quantization
* Add gemm_dlops quantization
* Support int8 for innerproduct
* Refine gemm dlops int8 kernel parameter
* Support gfx908(MI100) and gfx90a(MI200)
* clang-format
* Rename example number
* Support different layout for d tensor
* Add conv dlops perchannel quantization example
* Move to example 40
* Extract the common code for different platform (dlops and xdlops)
* Move ot subfolder. Prepare to add other op of quantization
* Refine the quantization instance library
* Add conv dl instances and client example
* Remove unnecessary type
* Add gemm quantization instance
* Add external api and client example
* Refine num_bytes
* Separete different layout to different cpp
* Add more xdl instances
* Revert "Remove unnecessary type"
This reverts commit 820869182f.
* Remove CShuffleDataType in dlops
Let acc and CShuffleDataType be the same in xdlops
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Pass shared mem pointer as pointer to void.
* Device Op GroupedGEMM Multiple D
* Example for grouped gemm multiple d.
* Add MI200 to supported archs.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* make conv_fwd_bias_activation kernel id unique
* add more parameters to conv and gemm kernel names
* update GetTypeString for conv and gemm kernels
* fix two more kernel strings
* Grouped gemm + Gelu instances.
* Device Instance Factory for GroupedGemm+Gelu
* Client example
* Rangify fill helper functions.
* Fix name clash.
* Profiler for grouped_gemm+gelu
* No need to use full namespace name.
* Add check for MRaw divisible by vector load.
* Ugly fix for big errors.
* Add grouped_gemm+gelu to profiler CMakelists.
* Store in argument additional info.
* Information about Mraw, Nraw, Kraw values.
* Use FastGelu instead of Gelu.
* Change client ex to use FastGelu
* Remove relaxed error precision.
* Remove duplicate output elementwise-op
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Modify Doxygen config to pick up include directories recursively
* Add DeviceMem struct to API Reference guide
* Add classes that are used in Flash Attention kernel
* Add a reference and config for generating bibliography
Co-authored-by: Philip Maybank <Philip.Maybank@amd.com>
* add new parallel stage on navi node
* dont run performance tests on navi, get rid of 9110 compiler
* only run navi build when not doing QA
* fix syntax
* use navi21 label
* dont stash profiler on navi nodes, scp deb package to ginger
* disable tests on navi nodes
* test posting a binary to ginger
* add sshpass and use it to copy deb package
* fix the scp example
* fix syntax
* debug the scp issues
* add jenkins user to docker
* dont try whoami
* change jenkins uid and add user with uid=1002
* try scp from the last stage on micimaster
* rename and stash the package, scp from micimaster
* fix a bug blocking wmma_gemm_multipleD
* Utilize matrix padder in device_wmma_op
* cosmetic change for gemmpadding format
* clang format
* Change gridwise gemm from FIFO to KMN loop fashion
* Add DeviceOp and examples
* Format DeviceOp template arguments
* Remove bf16 example
* Format
* Format
* Update MakeABCGridDescriptor_A_K0_M_K1_B_K0_N_K1_C_M_N
* Refactor argument preparation
* Update conv_bwd_weight_dl to grouped_conv_bwd_weight_dl
* Rename device op file
* Update include directive in the example file
* Update descriptor preparation for grouped op
* Update the argument
* Update batch handling
* Add gridwise gemm supporting batched input
* Update blockwise indexing, working version
* Update copyright year
* Update check if argument is supported
* Refactor and make consistent with xdl examples
* Update check if argument is supported
* Add changelog entry
* Added comments on Dl op split_k>1 support
---------
Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* New docs directory with minimal config
* Based on docs directory of rocBLAS
* Config for running Doxygen then Sphinx to generate HTML
* Add minimal content - intro to doc
* Add some boilerplate sections to doc
* content still needs to be done,
* e.g., need to generate API documentation using Doxygen
* need to write contributor guide
* Start Softmax section of Support Primitives doc
* Written as a test bed for typesetting math content
* Need to decide how much detail to go into
* add doc directories to git ignore file.
* Minor edits - new line at EOF, change year in copyright notices
* Port Markdown files to ReStructuredText
* Copy Markdown files from pre-existing doc directory to docs directory
* Convert to reStructured Text (rst) - section headings, links, tables
have a different syntax in rst
* New rst files added to index - can generate HTML with same style as
HTML generated from rst files in previous commits
* Intention is to make all the content in doc redundant and use rst
throughout rather than mix of md and rst
* Extend Softmax section of Primitives Guide
* rename l to z
* add material on applying softmax row-wise to matrix
* define macro for diag operator (represents diagonal matrix)
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* clean up output from kernel_launch
* set RUN_WARMUP to 0 by default
* split the warm-up into a separate issue
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>