Files
sglang/python
chenghanke 37eecb4e99 Enable DeepSeek V4 Flash inference on Ampere GPUs (#58)
* feat: enable DeepSeek V4 Flash inference on Ampere (SM_86) GPUs without FP8

Ampere GPUs (SM_86, e.g. RTX 3090) lack native FP8 support, preventing
DeepSeek V4's compressed attention from running. This commit adds BF16
fallback paths throughout the V4 Flash attention pipeline.

Changes ported from deepseek-v4-ampere branch, excluding all MTP
(Multi-Token Prediction) speculative decoding code.

=== BF16 KV Cache Mode ===
- mem_cache/deepseekv4_memory_pool.py: Add use_bf16_cache parameter,
  SetBf16KAndS Triton kernel, BF16 byte/token size calculations
- model_runner_kv_cache_mixin.py: BF16 cell size calculation for
  compressed attention
- memory_profiler.py: BF16 memory profile calculations

=== V4 Flash NSA/Compressed Attention BF16 Kernels ===
- nsa/v4_triton_kernel.py: FP8 LUT decode for SM<90 + full BF16 sparse
  attention kernel (decode_sparse_attention_bf16)
- nsa/index_buf_accessor_v4.py: SetBf16KAndS class with Triton kernel
  and torch fallback; NopeBf16RopeBf16Pack data class
- nsa/quant_k_cache_v4.py: BF16 quantize/dequantize path
- nsa/triton_kernel.py: SM version check for act_quant fallback
- nsa/index_buf_accessor.py: SM version check fallback to vanilla path
- compressed/indexer.py: tilelang-based BF16 paged MQA logits kernels
  (bf16_paged_mqa_logits_tilelang, bf16_direct_paged_mqa_logits_tilelang)
  + torch fallback for CUDA graph safety
- compressed/compressor.py: SM-based dispatch for BF16 vs FP8 compressor
- compressed/paged_prefill.py: SWA BF16 cache store dispatch
- debug_flash_mla_adapter.py: BF16 decode dispatch
- deepseek_v4_backend_radix.py: BF16 SWA key storage dispatch

=== SM_89 (Ada Lovelace) Capability Guard Fixes ===
- All compute capability checks use cc < (8, 9) tuple comparison
  instead of major < 9, correctly excluding SM_89 (RTX 4090) which
  has native FP8 support and should not use BF16 fallbacks

=== Function Call / Reasoning Parser ===
- function_call/deepseekv4_detector.py: JSON format + structural tag
- function_call/function_call_parser.py: Register DeepSeekV4Detector
- parser/reasoning_parser.py: DeepSeekV3Detector with explicit_thinking

=== Infrastructure ===
- cuda_graph_runner.py: Fix KTMoEWrapper.set_capture_batch_sizes() to
  pass num_tokens (bs * num_tokens_per_bs) instead of num_seqs, since
  KExpertsCPUBuffer.get_buffer() indexes by hidden_states.shape[0]
- server_args.py: Document V4 compressed attention FP8 dtype requirement

* revert: restore upstream docstring and remove unused logging in DeepSeekV4Detector

Revert the class docstring to the concise upstream version and remove
unused  / .
Keep  as the only functional addition.

Co-Authored-By: Claude <noreply@anthropic.com>

* Update python/sglang/srt/layers/attention/compressed/paged_prefill.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Benjamin <159887351+yyj6666667@users.noreply.github.com>
2026-06-22 13:53:55 +08:00
..