* add bf16 gemms for gfx11/gfx12
* reduce the input values in test_gemm
* add int8 wmma gemm instances for gfx11/gfx12
* add example gemm_wmma_int8
* fix bug in gemm_wmma_int8 test
* increase bf16 gemm test tolerance
* update the dates and clean-up commented-out instances
* Calculate generic relative threshold pool3dfwd
* Calculate absolute error threshold pool3d fwd
* Generic threshold calculation take max input for relative error pool3dfwd
* Remove max possible value for error calculation at runtime
* Remove debug print in pool3dfwd
* Pool3d fwd adjusted types in generic threshold calculation
* Generic threshold calculation take into account number of accumulations and accdatatype
* Generic threshold fix final error formula
* Generic threshold calculation - num of accs fix
* Generic threshold calculation - adjust absolute error
* Generic threshold calculation - OutDataType in absolute error
* Add non_native_vector_type
* Add a test
* Add non-native vector type
* Fix CTOR
* Fix non-native vector type of 1
* Fix CTORs
* Use vector_type to cover non-native implementation as well
* Update the test
* Format
* Format
* Fix copyright years
* Remove BoolVecT so far
* Add AsType test cases
* Update assert error message
* Remove redundant type
* Update naming
* Add complex half type with tests
* Add tests for vector reshaping
* Add missing alignas
* Update test/data_type/test_custom_type.cpp
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Compare custom types to built-in types
* Add default constructor test
* Add an alignment test
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Support 64 bit indexing
* Add new grouped conv fwd kernel for large tensors
* Add instances large tensor
* Fixes for transform conv to gemm
* Fixes
* fixes
* Remove not needed instances
* examples fixes
* Remove not need ds arrays
* Fix tests
* Add 2GB check in gridwise dl
* Fixes
* Format
* Format
* Format
* Remove const
* Use the right template
* Format
* Format
* add row/col instances
* Add missing file
* fixed
* fixing block to etile error
* Format
* Updates
* Format
* fixed rrr layout
* generating a sample JSON file: currently contains includes, prologue/epilogue and instances
* version where the json is passed into the instances to generate a key
* updated run function to just launch kernel
* updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer
* adding in testing files
* cleaned up comments, still need to work on including header files
* removed unneeded files
* removed/commented out JSON implementation
* added fusion(prologue/epilogue) into instance generation
* working on instance selection
* added instance selection, need to fix instance validation
* removed block2etile map validity check for testing purposes
* test running: failing due to incorrect files/input
* all grid descs/ptrs completed, but device file not found
* Update test and embed modules
* Restore older version
* added convolution operation, written test, debugging generated code for compilation
* attempting to include CK in host directory: _Float16 error
* CK header file issues
* slight fix
* don't crash when hip can't report total memory
* dump generated code to a file
* changing sizes
* creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed
* some fixes to call the device code
* separating test files for conv and gemm
* completed arg ptr, now have linking errors
* clang format fix
* resolved linker issues in conv test
* remove dependency on libutility from ck
* resolved num dim error
* properly passing arg ptr, errors with passing typenames: redefinition/redeclaration
* undo the commenting of device function
* hand created kernel code to find rtc issues
* dump the full src to file
* resolved redeclaration errors, cleaned up errors for Amber's kernel code
* debugging purposes: redeclaration error
* config files
* resolved errors for NumTensor and redeclaration, formatted version.h
* resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type
* WIP: close to getting kernel compiled
* WIP: fixing rtc errors
* fixed sequence errors, formatting, still one error with run fcn
* yay: kernel compiles and runs
* updated templated/generated version to run and compile
* minor fixes
* working generated example, resolved memory access error due to padding
* adding in reference kernel, validation failing against reference
* debugging: printing kernel argsz
* reduced error in results
* debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues
* working validation (using reference convolution) with prologue function for both hard-coded and generated version
* WIP: create an alt version that creates Argument on the device
* wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments
* wip: making necessary methods device code
* added grid descs, working on grid pointers, errors with stl numerics
* wip: updating kernel args - issue, replacing some std functions
* replaced std::accumulate call with temp hardcoded version
* wip: args causing memory issue
* Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes
* adding object file dump
* temporary hardcoding of grid size, can remove device op inst + arg ptr
* minor fix for grid size
* added modified example where arg ptr is created on the device for generated version as well
* removed device op instance and arg ptr from modified examples
* moving device op file for testing purposes and to properly build CK
* commenting out print-outs
* adjust compiler args to produce a valid ELF file
* temporary removal of validation
* reverting compiler args back for working example
* retrieve necessary arguments from generated template parameters in correct format
* calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly
* scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example
* temporary change to generate ELF format binary object file
* removed unecessary code, added comments
* formatting fix
* cleaned up code, added new tests, restructured library: move helper into CK
* refactored launch parameter calculation to be more concise
* renamed files and variables for more clarity/uniformity
* more code cleaning, removed debug statements
* moved majority of my files into codegen directory, running properly
* updated Embed.cmake(string_view) in codegen directory
* updated host directory to match Embed.cmake as well
* added old tests in
* updated instance generation methods to be more concise
* removed layout from launch parameter calculation
* working test
* fixed issue with verification, all instances working
* updated verification in other tests
* removed duplicate matrix padder file, removed code dumps
* removed old hard-coded tests
* removed old host directory, all files in codegen directory now
* fixed copyright in files
* commenting out validation
* renamed files
* made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code
* updated headers
* removing duplicate file for fwd conv to gemm, merging with original file
* fix building codegen with clang++ directly
* resolving build error from conv_fwd_to_gemm
* fix for previous error
* renaming tests
* created common test file
* cleaned up code, added comments
* renamed device op
* fixed typos in comments
* removed extra space
* code cleanup: resolving Amber's comments
* removed wrapper struct for matrix padder, fixed template
* cleaned up if statements for better readability
---------
Co-authored-by: Paul <pfultz2@yahoo.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Overload output stream operator for LoopScheduler and PiplineVersion
* Add Run overload accepting grid descriptors MK.
* Add __device__ keyword for CalculateGridSize
* Create device op GroupedGemmMultipleD
* Add GroupedGemm MultipleD Tile Loop implementation.
* Add an example for GroupedGemm MultipleD tile loop.
* Device Op GroupedGEMMTileLoop.
* Bunch of small changes in exmaple.
* CkProfiler
* Remove unused tparam.
* Fix include statement.
* Fix output stream overloads.
* Do not make descriptors and check validity untill we find group.
* Fix gemm desc initialization.
* Revert device op
* Fix compilation for DTYPES=FP16
* Validate tensor transfers paramters.
* Validate on host only NK dims if M is not known.
* Fix bug.
* A convenient debug func for selecting threads.
* Fix has main k block loop bug.
* Make sure that b2c has up to date tile offset.
* Output stream operator for Sequence type.
* Cmake file formatting.
* add flush cache to device op
* add flush cache parameter to ckProfiler
* change calculate size a and b method
* chang evaluation time method foro AVERAGE to MEDIAN
* format code
* adjust some code
* fix core dumped
* remove loop call flush icache in kernel
* remove loop(outer) call flush icache
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
* clip fp8 to +/-240 on all targets
* if inputs to fp8 conversion are +/-inf, they remain unaltered
* increase tolerance for test_elementwise_layernorm to prevent false errors
* change the input values for gemm examples to floats
* reduce gemm example float input values to prevent errors
* increase the tolerance for gemm examples
* fix cppcheck errors, first pass
* fix format
* fix returned value in examples
* add macro definitions for cppcheck
* fix the profile_gemm logic
* update the gemm profiler logic
* add more difinitions to cppcheck, fix couple more errors
* replace runtime error with message in device function
* fix a couple of int4 issues
* no return for fill function
* fix errors in data_types.hpp
* fix format
* fix few remaining errors
* fix errors in data_types.hpp
* fix last couple of errors in datat_types.hpp
This PR introduces support for double buffering in LDS into GEMM kernels that use direct load instructions.
Direct loads now use inline asm instead of intrinsics. Usage of intrinsics results in compiler adding additional waitcnt instructions what breaks possible load/compute overlap in case of double buffering.
Usage of inline asm results in the need to use sched_barrier in order to make sure that compiler cannot incorrectly reschedule instructions since it does not know the data dependencies between global->LDS and LDS->registers.
* Add basic support for direct loads from global to LDS
* Clean the code and comments
* Add support for fp16
* Add comments
* Add check for thread cluster lengths
* Align non-direct-load fp16 example
* Small fixes
* Extend IsSupported to check for supported GPU gens
* Build examples only on the supported HW
* Do not throw when instance not supported in 04 example
* Review: Apply review suggestions
* Review: small fix
* Review: small fix