Commit Graph

3006 Commits

Author SHA1 Message Date
Ville Pietilä
e2225e2baa Git ignore rocprofv3 files. 2026-02-04 10:05:14 -05:00
Ville Pietilä
c97795b139 Remove .dat file. 2026-02-04 10:04:55 -05:00
Ville Pietilä
1c842f39a4 Git ignore profiler output. 2026-02-04 10:02:57 -05:00
Ville Pietilä
1c1ac4ef10 Small fixes to runner script. 2026-02-04 09:55:05 -05:00
Ville Pietilä
73b459c5a4 Runner script for benchmarking. 2026-02-04 09:38:16 -05:00
Ville Pietilä
3e6415a8ea True baseline benchmarking results. 2026-02-04 07:56:15 -05:00
Ville Pietilä
20cf6df685 Best instances for benchmark shapes. 2026-02-04 07:27:34 -05:00
Ville Pietilä
2cfc4209bb Profile optionally only a given instance. 2026-02-04 06:31:36 -05:00
Ville Pietilä
f6f381dbd4 Benchmarking shapes and baseline results. 2026-02-04 06:07:19 -05:00
Ville Pietilä
403f36ed26 Disable building all but fwd convs for CK profiler. 2026-02-04 06:07:06 -05:00
Bartłomiej Kocot
fbb073f276 Update device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v3.hpp 2026-01-31 20:46:58 +01:00
Jakub Piasecki
2086516deb fixed building errors 2026-01-30 19:22:34 +00:00
Jakub Piasecki
ae2d2d9f2c fixed conflicts 2026-01-30 18:47:19 +00:00
Bartlomiej Kocot
a7b57187cf Grouped Convolution Backward Data Direct Load
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
2026-01-30 18:45:23 +00:00
Graner, Johannes
c815e734c7 Add good instance 2026-01-30 10:50:06 -05:00
Ville Pietilä
7ffa682bd3 Add missing applicability check to v3 fwd convs. 2026-01-30 05:05:38 -05:00
Ville Pietilä
966706bb21 Add new grouped conv instance to the gfx950 branch. 2026-01-30 04:13:36 -05:00
Graner, Johannes
5301efc8e4 Add NumGroupsToMerge to BwdWeight type string 2026-01-29 09:30:02 -05:00
Ville Pietilä
0fba67a7e7 Add fwd conv group merging to the v3 conv instances. 2026-01-29 08:16:11 -05:00
Ville Pietilä
44960922a2 Merge remote-tracking branch 'origin/jograner/bwd-weight-splitk-autodeduce' into features/grouped-conv-perf-uplift 2026-01-28 10:57:40 -05:00
Ville Pietilä
c92b954537 Add new fwd conv fp16/bf16 instances optimized for unit group size. 2026-01-28 10:50:46 -05:00
Graner, Johannes
029efffeb5 Update test with new applicability 2026-01-28 09:19:41 -05:00
Graner, Johannes
0eee2d3392 Fix threshold calculation 2026-01-28 09:18:03 -05:00
Graner, Johannes
55d8e9b4f0 Add missing logic to wmma multiple d kernel 2026-01-28 02:12:18 -05:00
Graner, Johannes
74eb200c73 Fix error threshold calculations 2026-01-27 09:23:12 -05:00
Graner, Johannes
ad3954f119 Enable bwd weight splitk autodeduction with cap 2026-01-27 08:46:53 -05:00
Max Podkorytov
b66597ed96 Add build time optimization documentation (#3608)
This document describes techniques for reducing C++ template instantiation
overhead in the Composable Kernel codebase, including:

- Replacing recursive templates with pack expansion (O(N) → O(1) depth)
- Using named functors instead of lambdas to share instantiations
- Replacing template recursion with constexpr loops
- Using fold expressions for accumulation operations

These techniques can significantly reduce build times for template-heavy code.
2026-01-27 06:07:27 -07:00
Bartłomiej Kocot
3d67e6c492 [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624)
* [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err

* Update test_grouped_convnd_fwd_tile.cpp

* Update test_grouped_convnd_fwd_tile.cpp

* Update conv_tuning_params.hpp

* clang format fix

* Update CMakeLists.txt
2026-01-27 11:04:11 +02:00
Johannes Graner
c190d8d61f [CK tests] Extend conv GPU reference (#3539)
* test_convnd_fwd

* test_convnd_bwd_data

* test_conv_bwd_data_scale

* test_grouped_convnd_fwd_clamp

* test_grouped_convnd_fwd_scale

* multiple A/B tensors and D tensor for fwd GPU ref

* test_grouped_convnd_fwd_scaleadd_ab

* test_grouped_convnd_fwd_bias_clamp

* test_grouped_convnd_fwd_bilinear

* test_grouped_convnd_fwd_gk_bias_clamp

* Extend GPU reference to enable batchnorm epilogue

* test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp

* test_grouped_conv_bwd_data_bilinear

* test_grouped_convnd_bwd_weight_bilinear

* Add missing template instantiation

* Perform operations in float in reference

* Slightly increase tolerance for batchnorm profiler

* Revert "Slightly increase tolerance for batchnorm profiler"

This reverts commit a3b2475229.

* Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp"

This reverts commit 6da4576060.

* Revert "Extend GPU reference to enable batchnorm epilogue"

This reverts commit e2f75fa10e.

* Clarify variable names

* Refactor elementwise ops into helper functions

* Make helpers C++17-compatible
2026-01-27 09:49:42 +01:00
Ville Pietilä
6361810fb5 Fix merge conflict. 2026-01-27 03:08:22 -05:00
Robin Voetter
cc75948d1c [CK_BUILDER] conv bwd weight testing (#3618)
* ck-builder: restructure testing conv

In order to prepare for bwd of conv testing, this commit moves some
files and types around so that we can reuse ckt::Args for both forward
and backwards convolution.

* ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp

This will allow us to more easily include fwd.hpp from backwards
definitions, which is required for initializing bwd values.

* ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3

Turns out that the supplied layout isn't actually supported...

* ck-builder: ck and reference conv integration for bwd weight

* ck-builder: ck bwd weight execution test

* ck-builder: ckt::run support for ck-tile bwd weight

* ck-builder: ck tile bwd weight execution test

* ck-builder: extra debug printing in MatchesReference

* ck-builder: make ckt::run return RunResult

This type is more convenient than std::tuple, as it will allow us to
use google test matchers with this in the future.

* ck-builder: RunResult matcher

Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error
message about how and why running an algorithm failed.

* ck-builder: doc fixes

* ck-builder: add missing headers
2026-01-26 23:50:15 +01:00
Andrew Clark
8654c0628f Finished testing failure types. Removed testing code. 2026-01-26 15:09:49 -07:00
Andrew Clark
402f21d0a6 Removed working tests. Validating remaining tests. 2026-01-26 15:09:49 -07:00
Andrew Clark
1397924c21 Removed working tests. Validating remaining tests. 2026-01-26 15:09:49 -07:00
Andrew Clark
6c596b9553 Testing a pattern to support all text variations 2026-01-26 15:09:49 -07:00
Andrew Clark
58e1d03244 Removing working cases to test other failure examples 2026-01-26 15:09:49 -07:00
Andrew Clark
95768d1b22 Adding forcing failure to test notifications 2026-01-26 15:09:49 -07:00
Andrew Clark
786965b95e Fixing Jenkinsfile too large error 2026-01-26 15:09:49 -07:00
Andrew Clark
42a731b791 Updating failure patterns to be more reliable and adding tests to verify they are caught in the logs 2026-01-26 15:09:49 -07:00
John Shumway
a213ce676b Add python analysis scripts for Clang's time trace (#3644)
This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575).

The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration.

Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows.

This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection.
2026-01-26 13:44:36 -08:00
Enrico Degregori
2e49b6b2f7 Padding support for wave transfer (#3537)
* Add padding support with transpose

Also move check before writing storing is_src_valid during reading

* Add/modify instances to use wave transfer for gemm universal

Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer

* Fix clang format

* Modify example

* Fix bwd data

* Add restriction for wave transfer with padding and transpose

Add test case which shows this limitation

* Fix validity checks 8 bit types

* Add validity check gemm_bias_add_reduce

* Add validity check grouped gemm tile loop

* Fix validity checks new flavours

* Minor fixes

* Fix clang format
2026-01-26 12:57:09 -08:00
Thrupti Raj Lakshmana Gowda
bd5fec81af Removing [4,64,16] warp tile from Tile Engine (#3643) 2026-01-26 11:56:06 -08:00
yinglu
8942a19d5e ck: add CK_USE_GFX950 macro (#3636) 2026-01-26 11:38:45 -08:00
Aviral Goel
b8751e505d feat: Add Interwave scheduler for aquant memory pipeline (#3540)
* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* test: add unit test for aquant memory pipeline

* WIP: host level interwave pipeline compiles

* WIP: interwave implementation computes correct GEMM result when no aquant

* WIP: quantization works for subset of problem shapes

* WIP: quantization works for subset of problem shapes

* WIP: interwave memory pipeline passes local test

* feat: Add interwave pipeline implementation for memory pipline in aquant

* fix: compilation error on gfx950

* chore: remove debug statements from the code

* test: resolve merge conflict

* test: remove non rcr unit tests from test suite
2026-01-26 11:27:42 -08:00
Thomas Ning
3900e1e7ce Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634)
* Finished the work

* Fix the pipeline
2026-01-26 10:29:28 -08:00
SamiAario-AMD
834642202c Re enable f8 x bf8 tests on compv3 and compv4 (#3605)
* Re-enable f8 x bf8 tests on CompV3 as they now pass

* On CompV4, fp8 x bf8 tests now pass with K_BlockSize I32

* Add a changelog entry

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2026-01-26 10:23:26 -08:00
chris-tsiaousis-hpc
917f35553a Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617)
* Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation

This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct.

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

* Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>

---------

Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>
2026-01-26 10:20:30 -08:00
Max Podkorytov
de59c0716c Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585)
This change significantly improves compile-time performance by reducing template
instantiation depth for sequence generation and merging operations:

Optimizations:
- sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using
  __make_integer_seq to generate indices in a single step, then applying the
  functor via pack expansion
- uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq
  with a helper that applies a constant value via pack expansion
- sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction
  strategy. Added direct concatenation specializations for 1-4 sequences to
  avoid recursion in common cases, falling back to binary tree merging for 5+
  sequences

Documentation:
- Added extensive inline comments explaining why sequence_merge cannot achieve
  O(1) depth like sequence_gen (requires computing cumulative sequence lengths
  from heterogeneous inputs, inherently requiring recursion)
- Documented the binary tree reduction approach and why it's superior to fold
  expressions for this use case

Testing:
- Added comprehensive unit tests for uniform_sequence_gen with different values,
  sizes, and edge cases
- Added tests for sequence_gen with custom functors (double, square, identity,
  constant) to verify the new implementation works with arbitrary functors
- Added tests for sequence_merge with 4, 5, and many sequences to verify both
  the direct concatenation path and binary tree reduction path
- Added tests for empty sequence edge cases
2026-01-26 10:08:55 -08:00
Illia Silin
054c437dec add dockerfile for manylinux (#3651) 2026-01-26 09:23:19 -08:00
Ville Pietilä
a1a2f05b3c Merge remote-tracking branch 'origin/barkocot/direct-load-conv-wrw' into features/grouped-conv-perf-uplift 2026-01-26 09:06:07 -05:00