China has reportedly charted a new course away from reliance on NVIDIA’s limited AI accelerators. Thanks to DeepSeek’s latest endeavor, the TFLOPS capability has been significantly amplified, ushering in a new era with the Hopper H800s AI accelerators boasting eight times more power.
### DeepSeek’s FlashMLA: A Boost for China’s AI Industry with Optimized NVIDIA Hopper GPUs
In an impressive move of self-sufficiency, China is leveraging its domestic talents, notably through companies like DeepSeek, to push its hardware capabilities further. Instead of waiting for external hardware advancements, DeepSeek is making headlines with its innovative software solutions. Through strategic memory and resource management during inference, they’ve tapped into remarkable performance gains with NVIDIA’s “cut-down” Hopper H800 GPUs.
Here’s a snippet from DeepSeek’s announcement on Twitter during their #OpenSourceWeek, where they introduced FlashMLA, a decoding kernel crafted for NVIDIA’s Hopper GPUs. This revelation sets an optimistic tone for the week as FlashMLA steps into the spotlight, showcasing what the company describes as groundbreaking market impacts.
DeepSeek proudly reported hitting an astonishing 580 TFLOPS for BF16 matrix multiplication on the Hopper H800. This achievement is nearly eight times more than what the industry typically expects. What’s more, FlashMLA boosts memory bandwidth up to 3000 GB/s, almost doubling the H800’s prior peak. This leap is driven purely by clever coding, without any need for hardware modifications.
In another Twitter highlight, the incredible processing speed and memory manipulation abilities of FlashMLA are calling attention. Achieving 580 TFLOPS on the H800 when the industry average lingers at about 73.5 TFLOPS, as well as surpassing historical memory bandwidth limits, illustrate the software’s prowess.
FlashMLA employs “low-rank key-value compression,” which breaks down data into smaller chunks for faster processing and reduces memory use by 40-60%. Additionally, it uses a block-based paging strategy that dynamically allocates memory based on task demands, improving efficiency when processing variable-length sequences.
DeepSeek’s FlashMLA underlines that AI computational advancement transcends mere hardware innovation; it’s an intricate, multifaceted endeavor. As of now, FlashMLA is fine-tuned for Hopper GPUs, yet the anticipation for potentially broader adaptations—such as with the H100—grows, hinting at future breakthroughs.