The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
* add gemm unit tests for int4, int8 datatypes
* minor changes based on reviews
---------
Co-authored-by: msaffari-amd <msaffari@banff-cyxtera-s78-2.ctr.dcgpu>
In #2443, a helper function was added test new print functionality, but it used auto for the function parameter types.
We need to support c++17 for downstream libraries, so we cannot use auto there. This PR replaces it witht the equivalent function template implementation.
* [CK TILE] Fix bugs in AQuant preshuffle
- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.
* [CK TILE] Rename Preshuffle to PreshuffleQuant
The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.
* [CK TILE Block Scale] Use GemmConfig to save tile properties
- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
coded tile properties in `gemm_calc_aquant()`
* [CK TILE Block Scale] Rename GemmConfig used in block scale
- Remove unused GemmConfig
- Rename GemmConfig used in block scale
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Add CSV-driven convolution test pipeline
- Add test_grouped_convnd_fwd_dataset_xdl.cpp with CSV reader functionality
- Add complete dataset generation toolchain in test_data/
- Add Jenkins integration with RUN_CONV_COMPREHENSIVE_DATASET parameter
- Ready for comprehensive convolution testing with scalable datasets
* Update convolution test dataset generation pipeline
* add 2d, 3d dataset csv files
* Remove CSV test dataset files from repository
* Update generate_test_dataset.sh
* Fix channel division for MIOpen to CK conversion
* Remove unnecessary test files
* Fix clang-format-18 formatting issues
* TEST: Enable comprehensive dataset tests by default
* Fix test_data path in Jenkins - build runs from build directory
* Add Python dependencies and debug output for CSV generation
* Remove Python package installation - not needed
* Add better debugging for generate_test_dataset.sh execution
* Fix Jenkinsfile syntax error - escape dollar signs
* Add PyTorch to Docker image for convolution test dataset generation
- Install PyTorch CPU version for lightweight model execution
- Fixes Jenkins CI failures where CSV files were empty due to missing PyTorch
- Model generation scripts require PyTorch to extract convolution parameters
* Add debugging to understand Jenkins directory structure and CSV file status
- Print current working directory
- List CSV files in test_data directory
- Show line counts of CSV files
- Will help diagnose why tests fail in Jenkins
* Fix clang-format-18 formatting issues
- Applied clang-format-18 to test file
- Fixed brace placement and whitespace issues
* Add detailed debugging for CSV dataset investigation
- Check generated_datasets directory contents
- List all CSV files with line counts
- Show first 5 lines of main CSV file
- Applied clang-format-18 formatting
- This will help identify why CSV files are empty in Jenkins
* keep testing add pytorch installation in shell script
* Use virtual environment for PyTorch installation
- Jenkins user doesn't have permission to write to /.local
- Create virtual environment in current directory (./pytorch_venv)
- Install PyTorch in virtual environment to avoid permission issues
- Use PYTHON_CMD variable to run all Python scripts with correct interpreter
- Virtual environment will be reused if it already exists
* Remove debug code and reduce verbose logging in Jenkins
- Remove bash -x and debug commands from Jenkinsfile execute_args
- Remove all debug system() calls and getcwd from C++ test file
- Remove unistd.h include that was only needed for getcwd
- Remove debug print in CSV parser
- Add set +x to generate_test_dataset.sh to disable command echo
- Redirect Python script stdout to /dev/null for cleaner output
This makes Jenkins logs much cleaner while still showing progress messages.
* install gpu torch
* Clean up and optimize comprehensive dataset test pipeline
- Reorder Jenkinsfile execution: build -> generate data -> run test
- Remove commented-out debug code from generate_test_dataset.sh
- Ensure all files end with proper newline character (POSIX compliance)
- Keep useful status messages while removing development debug prints
- Set MAX_ITERATIONS=0 for unlimited test generation in production
* Add configuration modes to reduce test execution time
- Add --mode option (half/full) to generate_model_configs.py
- half mode (default): ~278 configs (224 2D + 54 3D) -> ~1,058 total tests
- full mode: ~807 configs (672 2D + 135 3D) -> ~3,093 total tests
- Update generate_test_dataset.sh to use CONFIG_MODE environment variable
- Keeps all model types but reduces parameter combinations intelligently
- Fixes Jenkins timeout issue (was running 3,669 tests taking 17+ hours)
- Default half mode should complete in ~4-5 hours instead of 17+ hours
* Add small mode for quick testing of comprehensive dataset
* jenkins pipeline test done
* jenkins test done
* Trigger CI build
* remove test comment and update data generation option as half
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Remove some unnecessary calls to create_args in basic and universal GEMM tests
* Remove unnecessary include statements in universal GEMM tests
* Improve compilation time of basic GEMM tests by only compiling the precision variants that we need
* Universal GEMM PrecType should be the same as CDataType
* Improve compilation time of universal GEMM tests by only compiling the precision variants that we need
* Revert to constexpr when defining some constants
* Refactor CK tile permute ctests to gtests
* Refactor CK tile MOE smoothquant ctests to gtests
* fix typo in comment
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update invalid case in else clause for get_precision_string
* Refactor permute gtests to use templated versions of matrix_core_swizzle and permute functions
---------
Co-authored-by: root <root@splinter-126-wr-c2.aus.dcgpu>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor moe_sorting ctests to use gtest framework
* Refactor ctests for smoothquant to gtests
* fix clang format to use version 18
* Print local_eid in MOE sorting gtests
* Remove extra space in smoothquant output
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* add failing tests
* swap out and reference
* add constraint assert to transpose input distribution
* test both pipelines with rectangular block tile
* print mismatched indices
* add a smaller failing test for old pipeline
* print grid and block
* fill output before operating on it
* swap m/n tile sizes and make one test pass
* add device syncs
* add one more flipped test case
* flip block tile at host arg init
* fix tiles for lds pipeline
* clang-format
* rename tests
* roll back error check
* remove device syncs
* reduce large test case's size
* Add more printing to core cktile
* Revert other changes in static encoding pattern
* Refactor to using a free print() function
* Remove loops and print just the containers
* Print tuple with better formatting, fix sequence compilation
* Add some tests for print utility
* Add print utility header
* Print for static_encoding_pattern
* add buffer_view printing
* Align vector_traits
* Fix formatting
* Lower-case enum strings
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Remove empty comment lines
* Fix test with lower-case too
* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y
* Add test_print_common.hpp
* add print.hpp in core.hpp
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* General 2D Reduction Kernel
* Move the reduction kernel from the example
* Split the code and add the necessary policy, problem, shape files as
per ck_tile convention
* Add/modify the headers
* Modified the example to work with the 'new' kernel
* Added tests for the kernel
* N-D refernce reduce
* Added support for N-D input with transform to 2D
* Added padding to support various input sized tensors
* Bug fix in the thread buffer constructor
* Some comments to explain the reduce2d block kernel
* comments resolution
* clang-format
* comments resolution
* clang-format
* clang-format
* comments resolution
* clang-format
* Split-K autodeduction for DeviceGroupedConvBwdWeight_Xdl_CShuffle and DeviceGroupedConvBwdWeight_Xdl_CShuffleV3.
* Split-K autodeduction for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle.
* Use simple best occupancy model to calculate the split-K.
* Handle split-K autodeduction in explicit gemm conv.
* Add unit tests for split-K autodeduction.
* Remove oversubscription.
* Small fixes.
* Added split-K autodeduction for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle.
* Run clang formatting.
* Fix error handling in the conv profiler.
* Add missing documentation for the autodeducted split-K values.
* Add split-K autodeduction to DeviceGroupedConvBwdWeight_Explicit_Xdl solver.
* Fix clang formatting and split-K profiler documentation.
* Rename max_occupancy value variable.
* Calculate grid size for split-K autodeduction directly from input array shapes and template params.
---------
Co-authored-by: Ville Pietilä <>
* Add tests for host convesion f32/f16 to f8
* Add tests for host convesion from f8 to f32/f16
* Fix UB and corner cases in f32/f16 to/from f8 conversion
* There are UBs when very small values are converted to f8: bitshifts
can be larger that type width. Using unsigned long long does not help
because exponent_diff >= 64 in such cases. This causes that values
like 2.117582368e-22 are converted to non-zero f8 in host validation
of FMHA tests, test_f8 crashes with segfault in completely irrelevant
code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
conversion.
* Replace some magic numbers with values from numeric_traits
* Build tests only on devices supporting the type
* add a dummy test file
* add kernel launch logic to the test
* transfer all test cases into gtest params
* factor kernel out into test config
* add load transpose pipeline tests
* add padded tests and skip invalid kernels at runtime
* enum class for pipeline type
* add multiwarp test cases
* fix type
* try to solve the problem
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* update gpu_timer for rotating buffer as hipblasLt's implementation
* timing fix
* Updating gpu timer for old ck as well
* Revert "Updating gpu timer for old ck as well"
This reverts commit 958cd1bc99.
* code clean up with runtime argument; function rename
* code cleanup
* general timer fixes
* bug fix
* clang formatted
* addressing reveiew comments
* clang formatted
* Addressing review comments
* CI fix
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* unify pipeline signature with existing example
* iwyu
* move stuff around in load-tile-transpose
* cleanups in batched transpose pipeline
* comments
* use same inputs size
* cleaner printf
* print host args
* use 64 block sides in the 37_transpose example
* roll back grid dimension size adjustment for 37_transpose example
* transpose grid for 37_transpose to unify with 35_batched_transpose
* unify grid computation logic
* make policy methods device only (since they are used only on device from the pipeline)
* more host/device attribute cleanups
* copy over problem
* move over pipeline and policy
* add switch to batched transpose api
* make the lds problem more similar to original problem
* factor out logic into traits
* factor out conditional compilation into trait parameter
* propagate pipeline to args
* unhardcode pipeline dispatch parameter
* refactor vector size
* put warp tile out of dispatch
* rename template parameter for trait
* rewrite vector size in terms of problem
* mark policy-internal struct variable as device
* factor out input distribution and thread access pattern from policies
* reword vector size
* use datatype across batched transpose pipelines, problems and kernel
* remove transpose traits from lds pipeline
* add padding to the lds pipeline *interface*
* add comment
* remove ck_tile example #37
* update cmakelists
* add test for new pipeline
* update batched transpose test
* roll back load_tile_transpose changes
* remove comments
* pack dispatch parameters into a config
* padM can be enabled
* adjust lds vector size to enable padding along N
* update test
* clean up logic
* swap m/n input vector size
* adjust perf test script
* sweep over C/W in perf test
* count both read and written bytes into bandwidth (x2 the number)
* clang-format
* widen size range for perf test
* remove 64k x 64k case; it's too large for index
* remove thread tile from dispatch
* Solve merge conflict
* fix compile
* modify the transpose
* solve the test error and clang format
* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.
* Less casting in AddClamp
* Conv+bias+clamp instances & profiler BF16
* Fix 3D instances & run just 1x for verification.
* :Run just once for verification conv fwd.
* ckProfiler conv fwd clampwq
* Remove exec bit & formatting
* Add support for MultiD for grouped conv fwd v3.
* Enable 2Lds.
* clean
* align instances
* align instances
* profiler fixes
* Fixes
* fix
* fix
---------
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Fixing 0ms and inf GB/s issue in img2col (#2565)
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```
solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message
`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`
* merge with develop
* solve clang format
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
* [CK_TILE] Disable moe_sorting unit test on gfx908
- gfx908 does not support instruction used in moe_sorting
* Update CMakeLists.txt
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Elementwise kernel implementation
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>
* Elementwise with generalized nDims
* Adding the n-ary input tensor feature
* Generalize dimensions on top of inputs
* Add TFLOPS + remove std usage for tuples
* 1D basecase optimization
* Cleanup code + refactoring to a common interface
* Generalize to unary and add an example
* Cleanup, refactoring and commenting
* Suggestions for LWPCK-3170: elementwise kernel improvements
* Clang-format: remod.py
* Replace InputTensorType with XDataType as the type of input_tensors
* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs
* Move examples to folder 19_elementwise
* Add missing copyright headers and fix some existing ones
* Replace an assert with throw std::runtime_error in elementwise example
* Avoid reading the output by using make_static_distributed_tensor for y_tile
* Removed two unused includes
* No need to move windows to the next block when each workgroup processes a single tile
* Only copy input tensors to the device
* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example
* Adding output strides to the kernel, transposition example and update the other examples
* Changes made by remod.py
* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view
* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp
* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference
* Fix comments in elementwise_example.cpp
- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors
* Simplify CMakeLists.txt and remove the unused variables this uncovers
* Rename a file and fix some copyright statements
* Changes made by script/clang-format-overwrite.sh
* Add basic unit test for ElementWiseKernel
* Remove left-over uninformative comment in apply unit test
* Changes made by clang-format-overwrite.sh
* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view
* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp
* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array
* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally
* Move examples to folder 20_elementwise
* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK
* Fix CLang formating
* Bump up the elementwise example folder number
* Elementwise: add padding + minor cleanup
* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view
* Add isSupportedArg to Elementwise kernel + addapt example and unit tests
* Fix clang-format on the unit test file
---------
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
* fix async copytest bug
* Add block_sync_lds_direct_load utility
* fix the s_waitcnt_imm calculation
* Improve s_waitcnt_imm calculation
* fix vmcnt shift
* add input validation and bug fix
* remove unnecessary output
* move test_copy into test
* change bit width check
* refactor macros into constexpr functions
which still get inlined
* wrap s_waitcnt api
* parameterize test
* cleanup
* cleanup fp8 stub
* add fp8 test cases; todo which input parameters are valid?
* replace n for fp8 in test cases
* add large shapes; fp8 fails again
* change input init
* test sync/async
* time the test
* clang-format test
* use float instead of bfloat to cover a 4-byte type
* fix logic - arg sections should be 'or'd
* make block_sync_lds_direct_load interface similar to old ck
* fix a few comment typos
* name common shapes
* revert the example to original logic of not waiting lds
* clang-format
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* ck_tile kernel for gemm with groupwise quantized A or B tensor.
This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.
* [Block Scale Gemm] Only run gemm quant examples on __gfx94__
- Only run gemm quant examples on __gfx94__ for usage of
`v_cvt_pk_fp8_f32`
- Format the code
* [Block Scale Gemm] Remove Bquant Gemm BlockScale
This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.
* [Block Scale Gemm] Format code with clang-format-12
The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.
Format code with clang-format-12 for consistency.
* [Block Scale Gemm] Split the k direction loop
- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
to make the logic clearer.
- Disable C transposition.
* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm
* [Block Scale Gemm] Update copyright
* test
* Add TailHandler
* Move TileDistributionEncodingPatternAQ
* Refactor
* refactor
* fix bug
* fix bug
* help solve the PR comment
* Format the code
* [Block Scale Gemm] Add unit tests
* [Block Scale Gemm] Add support to 16x16x32 MFMA
- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
All our platforms support C++20 now, so update to C++20 standard
for language features such as concepts, designated initializers,
range-based for initializers, and consteval. This PR only switches
the compiler flags to C++20, no other changes.
[CK_TILE] Add new ck tile unit test
* Add new ck tile unit test smoke-gemm-universal
* Add new ck tile unit test smoke-gemm-basic
* Add new ck tile unit test topk_softmax
* Add new ck tile unit test add_rmsnorm2d_rdquant_fwd
* Convert ck-tile 06_permute smoke test to unit tests for fp16, fp8, and fp32
* Apply clang format and update copy right year
* Convert ck tile moe sorting example smoke tests to unit tests
* fix CMakelists to ensure that permute and moe_sorting are built for gfx9 only.
* Remove number prefix from permute and moe_sorting directory names
* code cleanup
* add missing test cases for fp16 permute
* remove unecessary parentheses
* Cleanup
* Remove uneccessary final nullptr
* update copyright and licensing statement in files
* Add custom target for permute tests
* Add missing new line at end of file for moe sorting CMakelist.
* Update MOE sorting tests to account for MOE sorting example updates
The ck_tile/13_moe_sorting example was updated to include different
cases dependending on whether MOE_SORTING_FMOE_2D_BUF is set. So,
the ck_tile tests for MOE sorting were updated to account for these
changes.
---------
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
* CK tile tests for flatmm using example
* MOE smoothquant draft tests
* fix create_arg default index to zero for MOE smoothquant
* revert MOE smoothquant changes
* code clean up
* Add back MOE smoothquant changes
* Add MOE smoothquant cases for different precisions and update cmake
* clean up comments
* Update flamm cmake
* revert change made to moe_smoothquant smoke_test.sh EXE path
* remove unecessary comment in MOE smoothquant cmakelist
* comment out adding moe_smoothquant subdirectory for now due to bugs with GPU core dump issue on gfx942 and gfx90a
* Clean up run_test_case function in MOE smootquant tests
* update copyright and licensing on files
* Remove flatmm test dir since tests should be done as weighted preshuffle gemm
* Add flamm smoke test cases to weighted preshuffle gemm gtests
* remove blank line from CMakeLists
---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Create tests for ck tile batched transpose using example
* Create ck tile tests for smoothquant using examples
* fix precision input strings and convert batched transpose to regression tests
* Code cleanup and fix asserts
* add missing licenses
* update copyright and licensing in files
* Update smoothquant tests to use example's smoothquant.cpp
* Add custom target for batched transpose tests
* Add missing new lines at end of files for CMakelists
* fix typo in batched transpose CMakeList target_compile_options
---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>