composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-01 20:27:42 +00:00

Author	SHA1	Message	Date
Bartłomiej Kocot	fbb073f276	Update device_grouped_conv_bwd_data_multiple_d_xdl_cshuffle_v3.hpp	2026-01-31 20:46:58 +01:00
Jakub Piasecki	2086516deb	fixed building errors	2026-01-30 19:22:34 +00:00
Jakub Piasecki	ae2d2d9f2c	fixed conflicts	2026-01-30 18:47:19 +00:00
Bartlomiej Kocot	a7b57187cf	Grouped Convolution Backward Data Direct Load Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>	2026-01-30 18:45:23 +00:00
Ville Pietilä	7ffa682bd3	Add missing applicability check to v3 fwd convs.	2026-01-30 05:05:38 -05:00
Graner, Johannes	5301efc8e4	Add NumGroupsToMerge to BwdWeight type string	2026-01-29 09:30:02 -05:00
Ville Pietilä	0fba67a7e7	Add fwd conv group merging to the v3 conv instances.	2026-01-29 08:16:11 -05:00
Ville Pietilä	44960922a2	Merge remote-tracking branch 'origin/jograner/bwd-weight-splitk-autodeduce' into features/grouped-conv-perf-uplift	2026-01-28 10:57:40 -05:00
Graner, Johannes	55d8e9b4f0	Add missing logic to wmma multiple d kernel	2026-01-28 02:12:18 -05:00
Graner, Johannes	ad3954f119	Enable bwd weight splitk autodeduction with cap	2026-01-27 08:46:53 -05:00
Max Podkorytov	b66597ed96	Add build time optimization documentation (#3608 ) This document describes techniques for reducing C++ template instantiation overhead in the Composable Kernel codebase, including: - Replacing recursive templates with pack expansion (O(N) → O(1) depth) - Using named functors instead of lambdas to share instantiations - Replacing template recursion with constexpr loops - Using fold expressions for accumulation operations These techniques can significantly reduce build times for template-heavy code.	2026-01-27 06:07:27 -07:00
Johannes Graner	c190d8d61f	[CK tests] Extend conv GPU reference (#3539 ) * test_convnd_fwd * test_convnd_bwd_data * test_conv_bwd_data_scale * test_grouped_convnd_fwd_clamp * test_grouped_convnd_fwd_scale * multiple A/B tensors and D tensor for fwd GPU ref * test_grouped_convnd_fwd_scaleadd_ab * test_grouped_convnd_fwd_bias_clamp * test_grouped_convnd_fwd_bilinear * test_grouped_convnd_fwd_gk_bias_clamp * Extend GPU reference to enable batchnorm epilogue * test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp * test_grouped_conv_bwd_data_bilinear * test_grouped_convnd_bwd_weight_bilinear * Add missing template instantiation * Perform operations in float in reference * Slightly increase tolerance for batchnorm profiler * Revert "Slightly increase tolerance for batchnorm profiler" This reverts commit `a3b2475229`. * Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp" This reverts commit `6da4576060`. * Revert "Extend GPU reference to enable batchnorm epilogue" This reverts commit `e2f75fa10e`. * Clarify variable names * Refactor elementwise ops into helper functions * Make helpers C++17-compatible	2026-01-27 09:49:42 +01:00
Enrico Degregori	2e49b6b2f7	Padding support for wave transfer (#3537 ) * Add padding support with transpose Also move check before writing storing is_src_valid during reading * Add/modify instances to use wave transfer for gemm universal Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer * Fix clang format * Modify example * Fix bwd data * Add restriction for wave transfer with padding and transpose Add test case which shows this limitation * Fix validity checks 8 bit types * Add validity check gemm_bias_add_reduce * Add validity check grouped gemm tile loop * Fix validity checks new flavours * Minor fixes * Fix clang format	2026-01-26 12:57:09 -08:00
yinglu	8942a19d5e	ck: add CK_USE_GFX950 macro (#3636 )	2026-01-26 11:38:45 -08:00
chris-tsiaousis-hpc	917f35553a	Remove code duplications in batched gemm (multi D) gemm (multi D) wmma (#3617 ) * Added common struct to enable code reduction in gemm gemm and gemm multi_d gemm multi_d wmma implementation This file includes all shared components. The (shared between the two implementations) kernel, the pointer offset computation struct, the grid descriptor creator and definitions, the invoker struct and the argument struct. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the common struct in the batched gemm gemm wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Used the shared structs in the gemm multiple D gemm multiple D wmma cshuffle v3 implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: IWYU paradigm in the gemm gemm and gemm multiple D gemm multiple D wmma cshuffle v3 implementations Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-01-26 10:20:30 -08:00
Max Podkorytov	de59c0716c	Optimize sequence metaprogramming utilities to reduce template instantiation depth (#3585 ) This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases	2026-01-26 10:08:55 -08:00
Ville Pietilä	a1a2f05b3c	Merge remote-tracking branch 'origin/barkocot/direct-load-conv-wrw' into features/grouped-conv-perf-uplift	2026-01-26 09:06:07 -05:00
Graner, Johannes	a9db3100b8	Implement group merging for bwd_weight and add instances	2026-01-26 08:48:42 -05:00
Bartłomiej Kocot	25cb0283ed	Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp	2026-01-26 13:16:15 +01:00
Bartlomiej Kocot	20b056ded0	Grouped Conv Bwd Weight Direct Load	2026-01-26 10:21:39 +00:00
Ville Pietilä	7ac3794284	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor. (#3639 ) Co-authored-by: Ville Pietilä <>	2026-01-25 13:42:23 +01:00
chris-tsiaousis-hpc	e1c46ff548	Remove code duplications in batched gemm wmma (#3580 ) * Moved device struct for batched gemm wmma to a common file Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Use the common device struct in the scaled batched gemm wmma implementation Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Boy-scout: Remove unused includes and ambiguous comment Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved pointer offset calculation and gridwise argument to common struct This change enables further code reduction by re-using the common structs for the batched gemm and batched gemm b scale wmma implementations. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> * Moved type string to the common struct of DeviceBatchedGemm_Wmma_CShuffleV3_Common" Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> --------- Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com>	2026-01-23 12:39:03 -08:00
Graner, Johannes	0a10cb3582	Implement group merging for bwd_weight and add instances	2026-01-23 06:28:07 -05:00
Ville Pietilä	f88efddea4	Add new instances for merging multiple fwd conv groups into a single GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor.	2026-01-23 06:26:41 -05:00
Wojciech Laskowski	81ee19bd2c	WMMA grouped conv fwd large tensor extra flavors (#3582 ) * Additional flavors for WMMA conv fwd large tensor - added F16/BF16 clamp operation - added F16/BF16 bias_clamp operation - small modification to the device code to accomodate extra tensors * changed strategy to handle GemmArgs array * Adding generic instance * Added generic instance to clamp and bias_clamp ops	2026-01-23 12:19:51 +01:00
Erwin Terpstra	d5ae81b292	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>	2026-01-20 13:06:59 -08:00
music-dino	6300ad3c62	Batched gemm softmax gemm descriptor fix (#3564 ) * Add rocm to prefix path for codegen * Fix issue with c0_matrix_mask construction	2026-01-20 07:25:30 -08:00
Wojciech Laskowski	b09121f860	WMMA support for batched_gemm_reduce (#3332 ) Summary: - added new device impl of Batched GEMM Reduce for WMMA - added instance library - added WMMA impl to the Batched GEMM Reduce tests	2026-01-20 10:50:46 +01:00
Bartłomiej Kocot	0727e85e52	[CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518 ) * [BULDER] Add grouped conv fwd ck tile profiler * [CK TILE] Fix grouped conv kernels splitk and double lds * Updates * Fixes * Move to ckProfiler * Fixes * fix * fix * Change instances to empty list by default * fix * fix * Update grouped_convolution_signatures.hpp * Update grouped_convolution_forward_tile_algs.hpp * [CK TILE] Add grouped convolution forward tests (#3556) * [CK TILE] Add grouped convolution forward tests * fix jenkins * fixes * comments fixes * unit test * unit test fix * Move instances outside builder * fix includes * clang format fix * readme fix * fix includes * fixes	2026-01-19 22:29:01 -07:00
Erwin Terpstra	fe40a5d139	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method	2026-01-17 08:30:27 +01:00
logicat	fec81109f1	Remove unnecessary hip_fp16 include from stream_config (#3549 )	2026-01-16 10:40:05 -08:00
Yung-sheng Tu	6df2d70143	Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 ) * add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples	2026-01-15 07:19:31 -08:00
John Shumway	5122637215	[CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547 ) * Factor helpers out of conv_traits.hpp * Create a non-templated conv_traits struct * Migrate to new instance-specific instance_to_conv_traits functions * Clean up reflection concepts * Clean up ConvTraits helpers * Update testing for convolution traits This is a lot of cleanup on tests to have verbose coverage of feature extraction, explicit tests for each supported device kernel, and simple, readable test code. * Address reviewer comments and resolve merge conflict	2026-01-15 10:03:21 +01:00
Bartłomiej Kocot	a346cfa960	Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566 )	2026-01-14 12:37:12 -08:00
Bartłomiej Kocot	a07c8e38bd	Fix grouped conv bwd data wmma check (#3562 )	2026-01-14 11:04:37 -08:00
Johannes Graner	f173642087	[CK] Refactor GPU verification kernel to gather error stats on GPU (#3551 ) * Refactor GPU verification kernel to gather erorr stats on GPU * Check if result is all zero * non-negative error count doesn't need custom Atomics * Remove unnecessary AtomicMaxFloat function * Simpler warp reduction, remove passed flag * Move verification header to include * Fix header path in test * Fix block reduction loop	2026-01-14 16:04:50 +01:00
Enrico Degregori	693ff3bbb3	Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465 ) - Add support for direct store in epilogue instead of cshuffle - Add padding support for wave transfer without transpose - Add wave transfer with interleaved layout to support direct store - Enable new functionalities on GEMMs - Add optional new functionality support for grouped convolution fwd - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed)	2026-01-14 11:02:19 +01:00
Ville Pietilä	9908a87c31	[CK_BUILDER] Add bwd weight factories (#3509 ) * Add placeholder test. * Initial conv bwd weight factory. * Conv builder test refactoring. * Add missing pieces to bwd weight factory. * Improve compile time erros message when no matching factory is found. * Use amcro to ensure automatic macthing between concepts are their string representations. * Improve compile time diagnostics. * Small improvements. * Improve missing member/wrong type compile-time errors. * Improve compile time diagnostics. * Concept bug fixes. * Remove debug assert. * Update algorithm signature diagnostics. * Factory bug fixes. * First functional version of bwd weight conv factory. * Refactor handing of GEMM-K batch template parameter in conv bwd weight factory. * Concept improvements. * Improve concept diagnostics. * Introduve a common size type for concepts. * Update compiletime diagnostics to use the size type. * Update conv specialization enum. * Fix fwd conv builder tests. * Fix smoke tests. * Separate bwd weigth and bwd data tests into separate targets. * Clean-up CK Tile builder tests. * Add bwd weight XDL CShuffle V3 factory. * Build conv bwd weigth v3 instances successfully. * Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Test fix. * Add instance traits for bwd weight algorithms. * Add unit tests for instance strings. * Build new instance traits unit tests but exclude WMMA for now. * Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Conv bwd weight DL factory. * Final implementation for bwd weight DL factory. * Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle * Treat ref algorithm the same way as real algorithms in the dispatcher. * Refactor large tensor support and WMMA configuration. * Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3. * Update Readme. * Fix WMMA bwd weight tests. * Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3. * Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle. * Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 * Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs. * Fix fwd factories after refactoring. * clang-format * Move compile-time diagnostics to a separate branch. * Fix ref algorithm dispatching. * Fix smoke tests. * clang-format * Fix factory for regular WMMA conv bwd weight. * Clarify builder Readme. * Remove obsolete test file. * Fix test after merge. * clang-format * Remove the C++26 extensions. * Unify conv elementwise ops and layout definitions for fwd and bwd directions. * Remove old layout and elementwise ops. * Unify handling of conv tensor types between fwd and bwd directions. * Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank. * Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms. * clang-format --------- Co-authored-by: Ville Pietilä <>	2026-01-13 18:12:38 +02:00
Erwin Terpstra	eb041079a3	Implement grouped gemm tile loop for RDNA4 (#3304 ) * feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3	2026-01-13 07:14:23 +01:00
yadaish	32408c8bc0	moe fp8 blockscale use nt (#3524 ) * nt on fp8 blockscale * some improve and tests needs to be fixed * update * fix format * revert useless change * revert any change in amd_buffer_coherence	2026-01-12 10:48:10 +08:00
Johannes Graner	ee2c35b92d	[CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169 ) * Take split_k into account when checking 2GB tensor limit. * Revert "Take split_k into account when checking 2GB tensor limit." This reverts commit `adf35c91be`. * Optimize grouped conv bwd wei split_k off calc (cherry picked from commit `6f61dd56c5`) * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp (cherry picked from commit `b33877c10f`) * Fix tensor descriptors and stride calculations * Don't miss half of the elements * Fix buffer size calculations * Disable hack if stride not divisible by k_batch * Clean up comments * Disallow hack in non-contiguous edge cases * Index -> Dim * Fix broken test * Refactor applicability checks into separate function * fix missed variable name * Fix variable name in info print * update V3 2GB check * No more regression, use templates instead * Code deduplication * Regression fix for cshuffle * arch-guarded atomic_add implementations for gfx11 * Similar for half(4\|8)_t as well * Only use both offset hacks at the same time * Revert "arch-guarded atomic_add implementations for gfx11" This reverts commit `3883fe6935`. This reverts commit `5311ec608d`. * Reapply "arch-guarded atomic_add implementations for gfx11" This reverts commit `1972adeddc`. * Only remove float4 atomic_add * Refactor to single flag * Consolidate template parameters * Consolidate flag in transformers --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>	2026-01-08 08:02:02 +01:00
Bartłomiej Kocot	f449a5faaa	Disable fp32 atomic adds on gfx11 (#3510 ) * Disable fp32 atomic adds on gfx11 * Fixes is supported	2026-01-07 15:32:04 -08:00
Enrico Degregori	aad4cf0985	Wmma support for gemm_bias_add_reduce (#3316 ) * Add tests for gemm_bias_add_reduce * Initial working implementation * Generalize implementation of reduce epilogue * Add tests for all layouts * Add instances * Fix test archs * Fix xdl bug * Remove library/profiler duplications * Fix num_byted error profiler * Fix typos * Fix copyright	2026-01-07 10:27:16 -08:00
Erwin Terpstra	f9c6ba0403	Implement grouped gemm fastgelu for RDNA4 (#3303 ) * Implement grouped gemm fastgelu for RDNA4 * chore: some cleanup and minor inconsistencies in grouped gemm profiler * chore: clarified logic and reporting of supported instance warnings	2026-01-07 10:20:44 -08:00
Estevan Vedovelli	1224bc0a82	Add support to gfx1153 and fix gfx115X WMMA config (#3496 ) * Support for gfx115X * Changes for gfx115X * Add gfx1153 * Update changelog --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2026-01-05 10:03:30 -08:00
John Shumway	4670df5ca6	[CK_BUILDER] Remove cmath include (#3508 ) Remove the dependency from device_tensor_generator.hpp and fix a typo from a previous force push. The changes replace standard library math functions with their ck::math equivalents and define PI as a local constant instead of computing it using std::acos. Key changes: * Removed #include header dependency * Replaced std::acos(-1.0) with hardcoded PI constant 3.141592653f * Replaced std::sqrt, std::cos, and std::sin with ck::math equivalents	2026-01-02 16:58:35 -08:00
John Shumway	355ce9230d	Remove non-standard M_PI (#3507 ) Just use PI=acos(-1.0) as a local static constexpr. This has been causing build issues on windows.	2026-01-02 14:21:46 -08:00
John Shumway	1da340031c	Enable math defines for MSVC. (#3503 ) The symbol M_PI is breaking the build on Windows. The _USE_MATH_DEFINES macro enables M_PI and other math constants on Windows. (I'm guessing this is more idomatic than the old trick of using PI=acos(-1.0).) https://learn.microsoft.com/en-us/cpp/c-runtime-library/math-constants?view=msvc-170 Co-authored-by: BradPepersAMD <Brad.Pepers@amd.com>	2026-01-02 14:36:42 -05:00
Ville Pietilä	6e8c401e33	[CK_BUILDER] Instance traits for conv bwd weight algorithms (#3498 ) Added instance traits for the following bwd weight conv algorithms DeviceGroupedConvBwdWeight_Xdl_CShuffleV3 DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 DeviceGroupedConvBwdWeight_Wmma_CShuffle DeviceGroupedConvBwdWeight_TwoStage_Xdl_CShuffle DeviceGroupedConvBwdWeight_TwoStage_Wmma_CShuffleV3 DeviceGroupedConvBwdWeight_DL DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 Added also unit tests for instance traits of those bwd weigth algorithms that are currently exposed by the narrow CK build for MIOpen. --------- Co-authored-by: Ville Pietilä <>	2025-12-31 15:41:15 -08:00
kabrahamAMD	f86bbb1aef	[CK_Builder] [testing] Integrate device random generators (#3427 ) Implemented device random number generators for ck tensors. Includes tests and integration to ck builder testing interface.	2025-12-30 10:03:05 -08:00

1 2 3 4 5 ...

822 Commits