Running DFlash with Qwen3.6-35B-A3B on RTX PRO 6000
DFlash is a speculative decoding framework from Z Lab that uses a lightweight block diffusion model to draft multiple tokens in parallel, delivering up to 6× lossless inference acceleration over standard autoregressive decoding — and up to 2.5× faster than EAGLE-3. This guide walks you through deploying Qwen3.6-35B-A3B with DFlash on a Yotta Labs GPU Pod, using the pod-templates-jupyterlab template.
Note: Yotta Labs provided compute resources for training the
Qwen3.6-35B-A3B-DFlashdraft model. The draft model is still under active training (2000 steps); early benchmark numbers are included at the end of this guide.
Prerequisites
A Yotta Labs account with sufficient credits
A Hugging Face account — you must accept the model access conditions for
z-lab/Qwen3.6-35B-A3B-DFlashandQwen/Qwen3.6-35B-A3Bbefore downloadingA Hugging Face access token (
HF_TOKEN) with read permissions
GPU Requirements
Qwen3.6-35B-A3B is a Mixture-of-Experts model with 35B total parameters and ~3B active parameters per token. The BF16 weights alone are 71 GB, so you need a GPU with sufficient VRAM to hold the full model plus the DFlash drafter and KV cache.
RTX PRO 6000
96 GB
single GPU
✅ Recommended on Yotta Labs — fits BF16 comfortably with headroom for KV cache
2× RTX 5090
64 GB total
--tensor-parallel-size 2
Good consumer-GPU alternative
H100 80GB
80 GB
single GPU
Best for high-concurrency serving
A100 80GB
80 GB
single GPU
Solid alternative; no native FP8 Tensor Cores
Why RTX PRO 6000? With 96 GB VRAM, the full BF16 model (71 GB) + DFlash drafter (~2 GB) + KV cache and runtime overhead all fit on a single card with comfortable headroom. No tensor parallelism needed, which simplifies deployment and avoids inter-GPU communication overhead.
Step 1 — Deploy a Dflash Pod
Log in to the Yotta Labs Console.
Navigate to Compute → Pods and click Deploy in the top right.
On the GPU Selection page, choose RTX PRO 6000/H200/2*RTX5090.
Under Pod Template, select Dflash.

Configure the following settings:
SettingRecommended ValueSystem Volume
200 GB (BF16 weights ≈ 73 GB total for target + drafter, plus working space)
GPU Count
1
Add the following Environment Variable so your Hugging Face token is available inside the Pod:
Click Deploy. The Pod will enter
Initializingstate while resources are allocated.
Tip: The System Volume must be at least 3× the image size. With BF16 weights for both models (~73 GB) plus the JupyterLab base image, 200 GB gives you comfortable headroom for checkpoints and logs.
Step 2 — Open JupyterLab
Once the Pod status shows Running:
Click Connect on the Pod card.
Under HTTP services, click the link for port 8888 to open JupyterLab in your browser.
Create a new notebook: File → New → Notebook, select the Python 3 kernel.
All steps below run inside notebook cells. Commands prefixed with %pip or ! run as shell commands inside the notebook.
Step 3 — Install Dependencies
Run each of the following in separate notebook cells. Use %pip (not pip) so packages are installed into the active kernel.
vLLM has stable DFlash support built in. Use the nightly build until DFlash lands in a stable release.
After installing either option: go to Kernel → Restart Kernel so the newly installed packages are picked up before running the next cells.
Step 4 — Download Model Weights
Run this cell after restarting the kernel. Downloads go to /root/models/, which is persisted across Pod restarts.
Note: Total download is approximately 73 GB. This may take 15–30 minutes. Because
/rootis persisted on Yotta Labs Pods, the weights survive Pod restarts and kernel restarts — you only need to download once.
Step 5 — Send Your First Request
Run this in a new cell after the server shows ✅ ready:
Optionally run a health check before your first request:
Performance Reference
The following acceptance lengths are measured with the Qwen3.6-35B-A3B-DFlash draft model (thinking enabled, block size 16, SGLang backend). Higher acceptance length = more tokens accepted per draft step = faster effective throughput.
GSM8K
5.8
Math500
6.3
HumanEval
5.2
MBPP
4.8
MT-Bench
4.4
These are early results — the draft model is still under training. Numbers will improve as training progresses.
Tips & Troubleshooting
Model download fails / access denied Accept the model access conditions at huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash while logged into Hugging Face, then re-run Cell 2.
HF_TOKEN is empty inside the notebook The token must be set as a Pod environment variable before deployment. If you forgot, terminate and redeploy with HF_TOKEN added. As a temporary workaround you can set it directly in a cell:
Packages not found after install Always run Kernel → Restart Kernel after running Cell 1. %pip install takes effect only after the kernel is restarted.
Server process exits immediately Check the log for errors:
Out of GPU memory (OOM) Change "0.92" to "0.88" in the --gpu-memory-utilization argument in Cell 3, then re-run. If the problem persists, add "--max-model-len", "65536" to cap the context window and free KV cache budget.
Server not reachable on port 30000 Make sure port 30000 was configured as an HTTP port when deploying the Pod. Click Connect on the Pod card — port 30000 should appear in the HTTP services list. If not, you need to redeploy with the port added.
Kernel restarted and server is gone The server subprocess is tied to the kernel session. If the kernel restarts, re-run Cell 3. Model weights in /root/models/ are still on disk — no re-download needed.
Slow generation / DFlash not kicking in DFlash is most effective at low-to-medium batch sizes (1–32 concurrent requests). For high concurrency, add "--speculative-disable-by-batch-size", "32" to the argument list in Cell 3a.
Additional Resources
Last updated
Was this helpful?