CFM has updated the domain of its english website from en.chinaflashmarket.com to www.memorymarket.com, Please be informed.

NVIDIA Slashes DeepSeek V4 Token Costs By Up To 5x Through Blackwell Software Tuning

By: M 2 days ago

Codesigned with NVIDIA GPUs, CPUs, networking and systems, and strengthened by a broad open source ecosystem, NVIDIA's full-stack inference software continuously improves hardware performance. On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month. 

Three-Layer Tech Synergy Enables All-Round Lower LLM Inference Costs

NVIDIA's inference software stack lowers cost per token by connecting three layers, including production operation, application acceleration and infrastructure access. 

Production Operation: Coordinates distributed serving, orchestration, autoscaling and memory management so inference can run across the right compute and storage resources.

Application Acceleration: Runs models with high performance while giving developers room to tune and customize, using runtime optimizations such as overlapping compute and communication and kernel fusion.

Infrastructure Access: Exposes NVIDIA GPU, networking, memory and system capabilities without requiring developers to manage every device instruction set or data-transfer protocol directly.

When these layers work as one system, individual optimizations compound. 

Core Technology Combined Delivers Up to 20x  Token Throughput

Disaggregated serving, large expert parallelism over NVIDIA NVLink interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined the optimization with multiple technologies , the token throughput per GPU on the Blackwell platform can be increased by up to 20x.

Comprehensive Open Source Ecosystem Adaptation Delivers Fast Large-Scale Cost Reduction 

The significant cost optimization of the DeepSeek V4 model and its rapid implementation are inseparable from the support of the mature open-source ecosystem. 

When a new frontier open model like DeepSeek V4 is released, leading inference frameworks like vLLM and SGLang have day-zero deployment recipes for the NVIDIA Blackwell architecture — making the model accessible across millions of Blackwell GPUs. It's also why DeepSeek V4 performance on Blackwell improved by up to 5x within about a month across vLLM and SGLang frameworks, cutting token costs to roughly one-fifth of previous levels.