Files
composable_kernel/include
Matthias Gehre 15b4a580dc [CK] AIESW-32282: bake bf16 truncate-to-bf16 conversion as the only behavior; drop *Truncate variants and DequantPolicyFor trait
i4_to_bhalf4_scale and i4_to_bhalf4_zp_scale now use fp32_to_bhalf_truncate
(bit-cast >>16) for the trailing fp32->bf16 step. The IEEE round-to-
nearest-even path's per-nibble v_add3_u32 + v_cmp_o_f32 + v_cndmask_b16
chain is gone — saves ~3 RDNA3.5 VALU instructions per nibble. Worst-case
0.5 ULP of bf16 error (~4e-3 relative, inside the 5e-3 op-test tolerance);
lm_eval on Orion-zhen/Qwen3-1.7B-AWQ shows truncate statistically
indistinguishable from Triton (gsm8k 5-shot n=500, McNemar p=1.000).

Changes:
- DequantPack8Truncate / DequantPack8WithZpTruncate structs removed —
  DequantPack8 / DequantPack8WithZp now ARE the truncate variants on the
  bf16 path. fp16 path unchanged (no rounding chain to skip; the fp16
  bit-trick is already optimal).
- DequantPolicyFor<DequantPack8WithZpTruncate> specialization removed.
- The generic DequantPolicyFor<> + DequantPack8WithZp specialization stay
  in place so the device-op's BDequantOp template hook continues to work
  as a generic plug-in point for callers that want a custom dequant op
  (kept upstream-friendly: no caller side has to change).
- Comments in threadwise_tensor_slice_transfer.hpp and
  blockwise_gemm_pipeline_wmmaops_v1.hpp updated to drop the Truncate
  examples that no longer exist.
2026-05-21 04:22:38 -06:00
..