* fix print_layout printf format in device code
* Replace %.*s format specifier with explicit loop
* Remove unused delim variable
The printf format %.*s with dynamic width does not work correctly
in CUDA device code, causing literal %.*s to appear in output.
Fixes#2496
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
* Update include/cute/util/print_tensor.hpp
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
---------
Co-authored-by: Cris Cecka <ccecka@users.noreply.github.com>
* Adding blockscaled ragged contiguous grouped gemm for MoEs
* cleaning up the example
* introduction to example improved
---------
Co-authored-by: Shreya Gaur <shgaur@dc2-container-xterm-012.prd.it.nvidia.com>
* Blackwell DistGEMM bug fixes
1. If using preferred cluster, there needs to be a branch so that
the universal GEMM wrapper finds the correct base params.
2. Workspace sizes can change depending on problem shape in Blackwell,
and DistGEMM was previously using the per-device shape to evaluate
workspace size instead of the per-gemm shape.
3. Flattened size used to initialize host tensors can overflow (in
Hopper example as well)
4. Preferred and fallback cluster args need to be set explicitly,
otherwise if someone modifies the example to use preferred cluster,
it will just fail.
* Fix example runtimes
* Set default fallback cluster shapes to the static ones
* v4.3 update.
* Update the cute_dsl_api changelog's doc link
* Update version to 4.3.0
* Update the example link
* Update doc to encourage user to install DSL from requirements.txt
---------
Co-authored-by: Larry Wu <larwu@nvidia.com>
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* Fix a sm100 gemm wrong defined static constexpr that breaks compilation on Windows
* More Windows fixes
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>
* Revert "More Windows fixes"
This reverts commit 2e8cfc1382.
---------
Signed-off-by: Javier <25750030+SystemPanic@users.noreply.github.com>