whaleRun DeepSeek V4 Flash/Pro on B300

Target hardware: 8* B300 (native) or 8× H200 (192GB each, 1.5TB total VRAM) Covers V4-Flash (284B / 13B active, ~146GB) and V4-Pro (1.6T / 49B active, ~960GB) Both models are MIT licensed.

Created based on: https://www.lmsys.org/blog/2026-04-25-deepseek-v4/arrow-up-right


1. Spin Up the Virtual Machine

Log in to yottalabs.aiarrow-up-rightCompute → VMs→ Launch.

For those who intend to run V4-Flash, two H200 SXM in a single pod is enough. If you need full 1M context or high QPS, choose 8× H200, which can cost significantly more.

For V4-Pro, 8× HGX B300 is the its native FP4 execution, full 1M context, no model-len cap. H200 cluster nodes is also a good choice, if there are not as many B300 as we want.

Pod Settings

Field
Value

GPU Type

H200

GPU Count

8

Image

lmsysorg/sglang:deepseek-v4-hopper

System Volume

≥ 300 GB (Flash) / ≥ 1.2 TB (Pro)

circle-exclamation

Once deployed, click Connect → SSH to open a terminal.


2. Environment Setup

Set HuggingFace token

Get a token at huggingface.co/settings/tokensarrow-up-right.

Install the hf CLI

Ubuntu 24.04 has an externally managed Python. Install to user directory:

⚠️ Do NOT use huggingface-cli — it may not be on PATH. Use hf instead.

Accept model licenses on HuggingFace

Visit each link and click "Access repository":

  • https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash ← for Flash

  • https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro ← for Pro

Verify GPUs


3. Download Model Weights

This could take 40~60 minutes depends on which version you choose.

circle-exclamation

Run downloads inside tmux — SSH disconnects will kill the process otherwise. hf download supports resume: if interrupted, just re-run the same command and it will skip already-completed files. If you see Still waiting to acquire lock, a previous process is still running — find and kill it first with pkill -f "hf download", then delete lock files with find /home/ubuntu/models -name "*.lock" -delete.

Option A — V4-Flash (~146GB, FP4+FP8)


4. Launch SGLang

Decide which image to use based on what GPUs you chose:

Chart from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4arrow-up-right

Key points for H200 :

  • Use image lmsysorg/sglang:deepseek-v4-hopper

  • Use official deepseek-ai/ checkpoints (native FP4+FP8)

  • Enable --moe-runner-backend flashinfer_mxfp4 for FP4 expert kernel

  • Mount model as -v /home/ubuntu/models/<model>:/workspace/model and pass --model-path /workspace/model

  • Always use sudo docker in our virtual machine

4a. V4-Flash on 8× H200

Watch startup logs

Health check

Basic inference

Always use temperature=1.0, top_p=1.0 — DeepSeek officially recommends these for V4. Lowering temperature degrades reasoning quality on MoE architectures.

Python client


6. Enabling Thinking Mode

V4 has three reasoning levels: Non-think, Think High, Think Max.

Think Max requires --context-length ≥ 384K. Achievable on 8× B200 with Flash; for Pro use Think High (128K) until you've confirmed stable operation.

Streaming with thinking


Summary Cheat Sheet


References: SGLang DeepSeek-V4 Docsarrow-up-right · LMSYS Day-0 Blogarrow-up-right · Yotta Labs Pod Docsarrow-up-right

Last updated

Was this helpful?