The hardware efficiency gains are honestly the most interesting part of the paper. The main reason DeepSeek-V4 is so cheap to run comes down to how they completely bypassed the quadratic cost of standard attention for massive context windows.
They built a hybrid attention architecture that interleaves Compressed Sparse Attention and Heavily Compressed Attention. Standard models keep every single token in the KV cache which absolutely kills memory. CSA fixes this by compressing the KV cache of multiple tokens into a single entry and then uses a sparse routing mechanism to only compute attention over the top-k most relevant compressed blocks. HCA takes it a step further by compressing an even larger number of tokens into one entry but computes dense attention over them. So, a 1.6T parameter Pro model only uses a third of the compute FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2 at a one million token context.
They also aggressively pushed low-precision formats applying FP4 quantization-aware training to the Mixture-of-Experts weights and the attention Query-Key paths. MoE models are notoriously memory bound because you have to constantly shuttle massive expert weights into the GPU cores. Dropping these to FP4 slashes the memory bandwidth bottleneck and lets the model run way faster during inference without ruining accuracy since they handle the quantization dynamically during training.
On the infrastructure side they wrote a custom fused kernel using TileLang that overlaps communication and computation. When running expert parallelism across multiple GPUs you usually hit a wall waiting for the network. DeepSeek slices the experts into micro-waves so the GPU is crunching matrix math on the first wave while the network is simultaneously pulling the data for the second wave. They basically hid the network latency behind the compute time which means you do not need super expensive interconnects to get peak hardware utilization out of the cluster.
Holy shit, I barely learned what the quadratic cost of attention was like 2 weeks ago. Can we hit the brakes a bit, before we start optimizing the shit out of everything? I am going to get lost in the layers of abstraction.