composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 04:07:56 +00:00

Author	SHA1	Message	Date
Ville Pietilä	e2225e2baa	Git ignore rocprofv3 files.	2026-02-04 10:05:14 -05:00
Ville Pietilä	c97795b139	Remove .dat file.	2026-02-04 10:04:55 -05:00
Ville Pietilä	1c842f39a4	Git ignore profiler output.	2026-02-04 10:02:57 -05:00
Ville Pietilä	1c1ac4ef10	Small fixes to runner script.	2026-02-04 09:55:05 -05:00
Ville Pietilä	73b459c5a4	Runner script for benchmarking.	2026-02-04 09:38:16 -05:00
Ville Pietilä	3e6415a8ea	True baseline benchmarking results.	2026-02-04 07:56:15 -05:00
Ville Pietilä	20cf6df685	Best instances for benchmark shapes.	2026-02-04 07:27:34 -05:00
Ville Pietilä	2cfc4209bb	Profile optionally only a given instance.	2026-02-04 06:31:36 -05:00
Ville Pietilä	f6f381dbd4	Benchmarking shapes and baseline results.	2026-02-04 06:07:19 -05:00
Ville Pietilä	403f36ed26	Disable building all but fwd convs for CK profiler.	2026-02-04 06:07:06 -05:00
Bartłomiej Kocot	fbb073f276	Update device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v3.hpp	2026-01-31 20:46:58 +01:00
Jakub Piasecki	2086516deb	fixed building errors	2026-01-30 19:22:34 +00:00
Jakub Piasecki	ae2d2d9f2c	fixed conflicts	2026-01-30 18:47:19 +00:00
Bartlomiej Kocot	a7b57187cf	Grouped Convolution Backward Data Direct Load Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>	2026-01-30 18:45:23 +00:00
Graner, Johannes	c815e734c7	Add good instance	2026-01-30 10:50:06 -05:00
Ville Pietilä	7ffa682bd3	Add missing applicability check to v3 fwd convs.	2026-01-30 05:05:38 -05:00
Ville Pietilä	966706bb21	Add new grouped conv instance to the gfx950 branch.	2026-01-30 04:13:36 -05:00
Graner, Johannes	5301efc8e4	Add NumGroupsToMerge to BwdWeight type string	2026-01-29 09:30:02 -05:00
Ville Pietilä	0fba67a7e7	Add fwd conv group merging to the v3 conv instances.	2026-01-29 08:16:11 -05:00
Ville Pietilä	44960922a2	Merge remote-tracking branch 'origin/jograner/bwd-weight-splitk-autodeduce' into features/grouped-conv-perf-uplift	2026-01-28 10:57:40 -05:00
Ville Pietilä	c92b954537	Add new fwd conv fp16/bf16 instances optimized for unit group size.	2026-01-28 10:50:46 -05:00
Graner, Johannes	029efffeb5	Update test with new applicability	2026-01-28 09:19:41 -05:00
Graner, Johannes	0eee2d3392	Fix threshold calculation	2026-01-28 09:18:03 -05:00
Graner, Johannes	55d8e9b4f0	Add missing logic to wmma multiple d kernel	2026-01-28 02:12:18 -05:00
Graner, Johannes	74eb200c73	Fix error threshold calculations	2026-01-27 09:23:12 -05:00
Graner, Johannes	ad3954f119	Enable bwd weight splitk autodeduction with cap	2026-01-27 08:46:53 -05:00
Max Podkorytov	b66597ed96	Add build time optimization documentation (#3608 ) This document describes techniques for reducing C++ template instantiation overhead in the Composable Kernel codebase, including: - Replacing recursive templates with pack expansion (O(N) → O(1) depth) - Using named functors instead of lambdas to share instantiations - Replacing template recursion with constexpr loops - Using fold expressions for accumulation operations These techniques can significantly reduce build times for template-heavy code.	2026-01-27 06:07:27 -07:00
Bartłomiej Kocot	3d67e6c492	[CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err (#3624 ) * [CK TILE] Enable CK TILE Conv Fwd tests in CI and fix check_err * Update test_grouped_convnd_fwd_tile.cpp * Update test_grouped_convnd_fwd_tile.cpp * Update conv_tuning_params.hpp * clang format fix * Update CMakeLists.txt	2026-01-27 11:04:11 +02:00
Johannes Graner	c190d8d61f	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible	2026-01-27 09:49:42 +01:00
Ville Pietilä	6361810fb5	Fix merge conflict.	2026-01-27 03:08:22 -05:00
Robin Voetter	cc75948d1c	[CK_BUILDER] conv bwd weight testing (#3618 ) * ck-builder: restructure testing conv In order to prepare for bwd of conv testing, this commit moves some files and types around so that we can reuse ckt::Args for both forward and backwards convolution. * ck-builder: decouple fwd_ck.hpp and fwd_reference.hpp from fwd.hpp This will allow us to more easily include fwd.hpp from backwards definitions, which is required for initializing bwd values. * ck-builder: fix layout of test_ckb_conv_bwd_weight_xdl_cshuffle_v3 Turns out that the supplied layout isn't actually supported... * ck-builder: ck and reference conv integration for bwd weight * ck-builder: ck bwd weight execution test * ck-builder: ckt::run support for ck-tile bwd weight * ck-builder: ck tile bwd weight execution test * ck-builder: extra debug printing in MatchesReference * ck-builder: make ckt::run return RunResult This type is more convenient than std::tuple, as it will allow us to use google test matchers with this in the future. * ck-builder: RunResult matcher Using EXPECT_THAT(..., SuccessfulRun()) will generate a check and a nice error message about how and why running an algorithm failed. * ck-builder: doc fixes * ck-builder: add missing headers	2026-01-26 23:50:15 +01:00
Andrew Clark	8654c0628f	Finished testing failure types. Removed testing code.	2026-01-26 15:09:49 -07:00
Andrew Clark	402f21d0a6	Removed working tests. Validating remaining tests.	2026-01-26 15:09:49 -07:00
Andrew Clark	1397924c21	Removed working tests. Validating remaining tests.	2026-01-26 15:09:49 -07:00
Andrew Clark	6c596b9553	Testing a pattern to support all text variations	2026-01-26 15:09:49 -07:00
Andrew Clark	58e1d03244	Removing working cases to test other failure examples	2026-01-26 15:09:49 -07:00
Andrew Clark	95768d1b22	Adding forcing failure to test notifications	2026-01-26 15:09:49 -07:00
Andrew Clark	786965b95e	Fixing Jenkinsfile too large error	2026-01-26 15:09:49 -07:00
Andrew Clark	42a731b791	Updating failure patterns to be more reliable and adding tests to verify they are caught in the logs	2026-01-26 15:09:49 -07:00
John Shumway	a213ce676b	Add python analysis scripts for Clang's time trace (#3644 ) This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575). The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration. Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows. This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection.	2026-01-26 13:44:36 -08:00
Enrico Degregori	2e49b6b2f7	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format	2026-01-26 12:57:09 -08:00
Thrupti Raj Lakshmana Gowda	bd5fec81af	Removing [4,64,16] warp tile from Tile Engine (#3643 )	2026-01-26 11:56:06 -08:00
yinglu	8942a19d5e	ck: add CK_USE_GFX950 macro (#3636 )	2026-01-26 11:38:45 -08:00
Aviral Goel	b8751e505d	feat: Add Interwave scheduler for aquant memory pipeline (#3540 ) * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * test: add unit test for aquant memory pipeline * WIP: host level interwave pipeline compiles * WIP: interwave implementation computes correct GEMM result when no aquant * WIP: quantization works for subset of problem shapes * WIP: quantization works for subset of problem shapes * WIP: interwave memory pipeline passes local test * feat: Add interwave pipeline implementation for memory pipline in aquant * fix: compilation error on gfx950 * chore: remove debug statements from the code * test: resolve merge conflict * test: remove non rcr unit tests from test suite	2026-01-26 11:27:42 -08:00
Thomas Ning	3900e1e7ce	Solve the CTAD regression & add up the Shell file for the docker management in testing (#3634 ) * Finished the work * Fix the pipeline	2026-01-26 10:29:28 -08:00
SamiAario-AMD	834642202c	Re enable f8 x bf8 tests on compv3 and compv4 (#3605 ) * Re-enable f8 x bf8 tests on CompV3 as they now pass * On CompV4, fp8 x bf8 tests now pass with K_BlockSize I32 * Add a changelog entry --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-01-26 10:23:26 -08:00
chris-tsiaousis-hpc	917f35553a	Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617 ) * Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-01-26 10:20:30 -08:00
Max Podkorytov	de59c0716c	Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585 ) This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases	2026-01-26 10:08:55 -08:00
Illia Silin	054c437dec	add dockerfile for manylinux (#3651 )	2026-01-26 09:23:19 -08:00
Ville Pietilä	a1a2f05b3c	Merge remote-tracking branch 'origin/barkocot/direct-load-conv-wrw' into features/grouped-conv-perf-uplift	2026-01-26 09:06:07 -05:00

1 2 3 4 5 ...

3006 Commits