mirror of
https://github.com/kvcache-ai/sglang.git
synced 2026-06-30 03:37:51 +00:00
* feat: enable DeepSeek V4 Flash inference on Ampere (SM_86) GPUs without FP8 Ampere GPUs (SM_86, e.g. RTX 3090) lack native FP8 support, preventing DeepSeek V4's compressed attention from running. This commit adds BF16 fallback paths throughout the V4 Flash attention pipeline. Changes ported from deepseek-v4-ampere branch, excluding all MTP (Multi-Token Prediction) speculative decoding code. === BF16 KV Cache Mode === - mem_cache/deepseekv4_memory_pool.py: Add use_bf16_cache parameter, SetBf16KAndS Triton kernel, BF16 byte/token size calculations - model_runner_kv_cache_mixin.py: BF16 cell size calculation for compressed attention - memory_profiler.py: BF16 memory profile calculations === V4 Flash NSA/Compressed Attention BF16 Kernels === - nsa/v4_triton_kernel.py: FP8 LUT decode for SM<90 + full BF16 sparse attention kernel (decode_sparse_attention_bf16) - nsa/index_buf_accessor_v4.py: SetBf16KAndS class with Triton kernel and torch fallback; NopeBf16RopeBf16Pack data class - nsa/quant_k_cache_v4.py: BF16 quantize/dequantize path - nsa/triton_kernel.py: SM version check for act_quant fallback - nsa/index_buf_accessor.py: SM version check fallback to vanilla path - compressed/indexer.py: tilelang-based BF16 paged MQA logits kernels (bf16_paged_mqa_logits_tilelang, bf16_direct_paged_mqa_logits_tilelang) + torch fallback for CUDA graph safety - compressed/compressor.py: SM-based dispatch for BF16 vs FP8 compressor - compressed/paged_prefill.py: SWA BF16 cache store dispatch - debug_flash_mla_adapter.py: BF16 decode dispatch - deepseek_v4_backend_radix.py: BF16 SWA key storage dispatch === SM_89 (Ada Lovelace) Capability Guard Fixes === - All compute capability checks use cc < (8, 9) tuple comparison instead of major < 9, correctly excluding SM_89 (RTX 4090) which has native FP8 support and should not use BF16 fallbacks === Function Call / Reasoning Parser === - function_call/deepseekv4_detector.py: JSON format + structural tag - function_call/function_call_parser.py: Register DeepSeekV4Detector - parser/reasoning_parser.py: DeepSeekV3Detector with explicit_thinking === Infrastructure === - cuda_graph_runner.py: Fix KTMoEWrapper.set_capture_batch_sizes() to pass num_tokens (bs * num_tokens_per_bs) instead of num_seqs, since KExpertsCPUBuffer.get_buffer() indexes by hidden_states.shape[0] - server_args.py: Document V4 compressed attention FP8 dtype requirement * revert: restore upstream docstring and remove unused logging in DeepSeekV4Detector Revert the class docstring to the concise upstream version and remove unused / . Keep as the only functional addition. Co-Authored-By: Claude <noreply@anthropic.com> * Update python/sglang/srt/layers/attention/compressed/paged_prefill.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update python/sglang/srt/layers/attention/deepseek_v4_backend_radix.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Benjamin <159887351+yyj6666667@users.noreply.github.com>