Introduces cutlass::epilogue::thread::Snake, a two-operand activation
functor implementing Snake_a(x) = x + (1/a) * sin^2(a*x) from
Ziyin et al. 2020 (arXiv:2006.08195). The per-channel learnable
frequency `a` flows through an EVT child (e.g. Sm90RowBroadcast),
composing into Sm90EVT<Sm90Compute<Snake, ...>, x_node, alpha_node>
for fused GEMM+Snake epilogues used in neural vocoders.
Adds unit tests in test/unit/epilogue/thread/activation.cu covering
f32 and bf16 paths, validated against float64 reference goldens.
Closes#3141
The store(frag, tile_offset) method was computing the pointer offset
without dividing by kElementsPerAccess, while the matching load(frag,
tile_offset) method does include this division. Both load_with_pointer_offset
and store_with_pointer_offset apply the same byte conversion, so the
tile_offset -> pointer_offset calculation must also match.
When kElementsPerAccess > 1, this caused load and store to reference
different memory locations for the same logical tile offset.
Fixes#3017
Signed-off-by: Blake Ledden <bledden@users.noreply.github.com>
the parameter workspace is marked as unused like other kernels, but it is actually used after 3.3.0, so the code which mark it as unused could be removed.
* fix print_layout printf format in device code
* Replace %.*s format specifier with explicit loop
* Remove unused delim variable
The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.
Fixes#2496
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com>