Add CachePtrInt32OverflowPossible template parameter (default false) to all
unified attention kernel traits. This enables dual kernel variants:
- Small cache (false): compile-time elimination of overflow checks for <100K blocks
- Large cache (true): runtime overflow checking with pointer rebasing for >=100K blocks
Key changes:
- Add CachePtrInt32OverflowPossible as 14th template parameter to UnifiedAttentionPipelineProblem
- Pass parameter through all kernel traits: decode, decode_small, decode_tiny, decode_bs32
- Implement overflow checking in pipeline with if constexpr for zero overhead when disabled
- Update dispatch macros with _SMALL_CACHE and _LARGE_CACHE variants
- Create instance files for both small and large cache variants (narrow, _s, _m tiers)
- Remove old MAX_NUM_BLOCKS inference logic (num_kv_heads is runtime, cannot infer)
Python calculates overflow possibility based on actual cache size and passes
it explicitly via cache_ptr_int32_overflow_possible parameter.
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>