* Implement argument passing to element-wise functions for fwd convolution
* Add files for fwd + bias + clamp example
* Implement Bias
* Implement Clamp
* Elementwise function composition
* Composition unit test
* Implement fwd + bias + clamp example
* Simplify argument passing and composition
* elfunc -> bias_and_clamp
* Rename function to specify example
* Move element-wise function instantiation to kernel
* Make bias a runtime tensor
* No ugly namespace aliasing
* Initialize element-wise function on host
* Remove function initialization helper, simplify Compose initialization
* Remove unintended LSP compatibility patch
* Clean up includes and unused code
* Switch names in cshuffle epilogue
* Move CDElementwise to conv traits
* Re-add required include
* Initialize bias in same way as other tensors
* Better type specification for ds pointer
* Disable 1D convolution
* Add warning for non-group-constant bias
## What's New
Add Split-N support for grouped convolution forward to handle tensors >2GB by splitting the batch dimension.
## Bug Fix
Fixed 32-bit integer overflow that caused crashes with 6+ splits:
- Use `long_index_t` for batch offset calculations
- Remove redundant GemmM initialization in constructors
## How It Works
- Automatically splits batch dimension when tensor exceeds 2GB
- Uses grid.z dimension for parallel processing of splits
- Each split processes a subset of batches independently
## Testing
Verified with tile_example_grouped_conv_fwd:
- n=3000 (6 splits) ✓
- n=3500 (7 splits) ✓
- n=10480 (40 splits) ✓
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>