DFlash Accelerates DeepSeek-V4 Inference
DFlash delivers 6x faster LLM inference with no quality loss. Open-source, plug-and-play, and supports vLLM, SGLang, MLX.
DFlash Accelerates Models by up to 6Γ β‘
Same output quality, completely lossless, open-source, plug-and-play β‘
This article breaks down the DFlash project in exhaustive detail.
The Core Bottleneck of LLM Generation
The essence of large model generation is autoregression β the N-th token cannot be computed until the (N-1)-th token is finished. Tokens are generated serially, so speed is fundamentally limited.
The current mainstream solution in the industry is Speculative Decoding:
A small draft model quickly βguessesβ a sequence of tokens.
The large target model verifies them in parallel.
Accepted tokens are kept; rejected ones are discarded and regenerated.
In theory, this can dramatically boost throughput. However, even the strongest current method, EAGLE-3, only achieves 2β3Γ speedup, because the draft model itself is still autoregressive. It still generates one token at a time, making the drafting step the bottleneck.
DFlash does something radical: It replaces the autoregressive draft model with a block diffusion model.
With a single forward pass, it generates an entire block of 16 tokens at once β no more serial generation.



