Google highlights 3x TPU speedups with DFlash speculative decoding

DFlash speculative decoding boosts TPU v5p LLM serving with 3.13x throughput gains See how UCSD's vLLM TPU integration outperforms EAGLE-3 and speeds up inference

Google Cloud highlighted an opensource implementation of blockdiffusion speculative decoding on TPU v5p, built by UCSD researchers and integrated into the vLLM TPU inference stack. The team said the DFlash approach, which generates a block of draft tokens in a single forward pass, improved average throughput by 3.13x across benchmarks and reached nearly 6x speedups on some math workloads. In a headtohead comparison with EAGLE3 on the same TPU hardware and Llama3.18B target model, DFlash delivered a 2.29x endtoend serving gain versus 1.30x for EAGLE3. The post also described the engineering work needed to adapt the method to TPU and JAX, including changes to attention caching, context handling, and metadata alignment. Google said the implementation has been submitted to the vLLM tpuinference repository and could help shape future TPUbased LLM serving research.