Run DeepSeek V4 Flash/Pro on B300
Target hardware: 8* B300 (native) or 8× H200 (192GB each, 1.5TB total VRAM) Covers V4-Flash (284B / 13B active, ~146GB) and V4-Pro (1.6T / 49B active, ~960GB) Both models are MIT licensed.
Created based on: https://www.lmsys.org/blog/2026-04-25-deepseek-v4/
1. Spin Up the Virtual Machine
Log in to yottalabs.ai → Compute → VMs→ Launch.
For those who intend to run V4-Flash, two H200 SXM in a single pod is enough. If you need full 1M context or high QPS, choose 8× H200, which can cost significantly more.
For V4-Pro, 8× HGX B300 is the its native FP4 execution, full 1M context, no model-len cap. H200 cluster nodes is also a good choice, if there are not as many B300 as we want.
Pod Settings
GPU Type
H200
GPU Count
8
Image
lmsysorg/sglang:deepseek-v4-hopper
System Volume
≥ 300 GB (Flash) / ≥ 1.2 TB (Pro)
System volume sizing rule: The SGLang Blackwell image is ~15GB. Yotta Labs requires volume ≥ image size × 3, plus model weights plus buffer:
Flash: 45GB + 200GB + 55GB buffer = ≥ 300GB
Pro: 45GB + 1000GB + 155GB buffer = ≥ 1.2TB
Once deployed, click Connect → SSH to open a terminal.
2. Environment Setup
Set HuggingFace token
Get a token at huggingface.co/settings/tokens.
Install the hf CLI
hf CLIUbuntu 24.04 has an externally managed Python. Install to user directory:
⚠️ Do NOT use
huggingface-cli— it may not be on PATH. Usehfinstead.
Accept model licenses on HuggingFace
Visit each link and click "Access repository":
https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash ← for Flash
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro ← for Pro
Verify GPUs
3. Download Model Weights
This could take 40~60 minutes depends on which version you choose.
Always download to /home/ubuntu/models/ — the pod runs as the ubuntu user and cannot write to /root/.
Run downloads inside tmux — SSH disconnects will kill the process otherwise.
hf downloadsupports resume: if interrupted, just re-run the same command and it will skip already-completed files. If you seeStill waiting to acquire lock, a previous process is still running — find and kill it first withpkill -f "hf download", then delete lock files withfind /home/ubuntu/models -name "*.lock" -delete.
4. Launch SGLang
Decide which image to use based on what GPUs you chose:

Chart from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4
Key points for H200 :
Use image
lmsysorg/sglang:deepseek-v4-hopperUse official
deepseek-ai/checkpoints (native FP4+FP8)Enable
--moe-runner-backend flashinfer_mxfp4for FP4 expert kernelMount model as
-v /home/ubuntu/models/<model>:/workspace/modeland pass--model-path /workspace/modelAlways use
sudo dockerin our virtual machine
Watch startup logs
Health check
Basic inference
Always use
temperature=1.0, top_p=1.0— DeepSeek officially recommends these for V4. Lowering temperature degrades reasoning quality on MoE architectures.
Python client
6. Enabling Thinking Mode
V4 has three reasoning levels: Non-think, Think High, Think Max.
Think Max requires
--context-length≥ 384K. Achievable on 8× B200 with Flash; for Pro use Think High (128K) until you've confirmed stable operation.
Streaming with thinking
Summary Cheat Sheet
References: SGLang DeepSeek-V4 Docs · LMSYS Day-0 Blog · Yotta Labs Pod Docs
Last updated
Was this helpful?