mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-16 10:59:55 +00:00
Add CachePtrInt32OverflowPossible template parameter (default false) to all unified attention kernel traits. This enables dual kernel variants: - Small cache (false): compile-time elimination of overflow checks for <100K blocks - Large cache (true): runtime overflow checking with pointer rebasing for >=100K blocks Key changes: - Add CachePtrInt32OverflowPossible as 14th template parameter to UnifiedAttentionPipelineProblem - Pass parameter through all kernel traits: decode, decode_small, decode_tiny, decode_bs32 - Implement overflow checking in pipeline with if constexpr for zero overhead when disabled - Update dispatch macros with _SMALL_CACHE and _LARGE_CACHE variants - Create instance files for both small and large cache variants (narrow, _s, _m tiers) - Remove old MAX_NUM_BLOCKS inference logic (num_kv_heads is runtime, cannot infer) Python calculates overflow possibility based on actual cache size and passes it explicitly via cache_ptr_int32_overflow_possible parameter. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>