mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-29 19:28:33 +00:00
i4_to_bhalf4_scale and i4_to_bhalf4_zp_scale now use fp32_to_bhalf_truncate (bit-cast >>16) for the trailing fp32->bf16 step. The IEEE round-to- nearest-even path's per-nibble v_add3_u32 + v_cmp_o_f32 + v_cndmask_b16 chain is gone — saves ~3 RDNA3.5 VALU instructions per nibble. Worst-case 0.5 ULP of bf16 error (~4e-3 relative, inside the 5e-3 op-test tolerance); lm_eval on Orion-zhen/Qwen3-1.7B-AWQ shows truncate statistically indistinguishable from Triton (gsm8k 5-shot n=500, McNemar p=1.000). Changes: - DequantPack8Truncate / DequantPack8WithZpTruncate structs removed — DequantPack8 / DequantPack8WithZp now ARE the truncate variants on the bf16 path. fp16 path unchanged (no rounding chain to skip; the fp16 bit-trick is already optimal). - DequantPolicyFor<DequantPack8WithZpTruncate> specialization removed. - The generic DequantPolicyFor<> + DequantPack8WithZp specialization stay in place so the device-op's BDequantOp template hook continues to work as a generic plug-in point for callers that want a custom dequant op (kept upstream-friendly: no caller side has to change). - Comments in threadwise_tensor_slice_transfer.hpp and blockwise_gemm_pipeline_wmmaops_v1.hpp updated to drop the Truncate examples that no longer exist.