Sol-RL on A100: Post-Training Diffusion Models at the Speed of Light

Sol-RL (Speed-of-light RL) is NVIDIA's framework for aligning text-to-image diffusion models with human preferences — significantly faster and cheaper than standard approaches. The core idea: use NVFP4-quantized rollouts to cheaply explore a massive candidate pool, then train exclusively on the best samples regenerated at full BF16 precision. The result is up to 4.64× faster convergence with essentially no quality loss, verified across SANA, FLUX.1, and SD3.5-L. This tutorial walks through the paper, the key ideas, and how to run Sol-RL post-training on a Yotta Labs multi-GPU Pod.

RL-based post-training for diffusion models has been picking up serious momentum lately. The basic premise borrows from what made GRPO so effective for LLMs: generate a group of candidate outputs, score them with a reward model, and use the relative advantage between high- and low-reward samples to update the policy. Translated to image generation, this means: for each prompt, sample a batch of images, evaluate them against a human preference signal (ImageReward, CLIPScore, PickScore, HPSv2, etc.), and optimize the diffusion model to produce more of the high-reward kind.

The catch — and it's a real one — is that rollout generation is expensive. On a model like FLUX.1-12B, running BF16 forward passes at scale is brutally slow. The research consistently shows that larger rollout group sizes lead to better alignment (more candidates → sharper advantage signal → cleaner gradient → faster convergence), but scaling rollout group size linearly scales compute. You quickly find yourself in a situation where candidate generation consumes 80% of wall-clock training time, and the policy update itself becomes almost a rounding error. This is the bottleneck Sol-RL is designed to break.

Why FP4 Rollouts Work as a Proxy

The key insight is deceptively simple. In ODE-based diffusion sampling — which covers most modern flow-matching models including SANA, FLUX, and SD3 — the semantic content of a generated image is largely determined by the initial noise seed. Change the denoising path slightly (as FP4 quantization does), and you'll get a different-looking image, but the reward it would receive tends to stay consistent with its BF16 counterpart. The ranking within a group is preserved even when absolute pixel quality degrades.

This means FP4 rollouts can serve as reliable proxy reward rankings. You're not asking them to be perfect images — you're just asking them to tell you which seeds are worth regenerating at full precision. The paper formalizes this in two ways: first, low-precision denoising can be modeled as a bounded perturbation to the BF16 trajectory, so FP4 reward values stay close to their BF16 counterparts; second, as rollout group size N grows, the probability that FP4-identified top/bottom-K seeds match the true BF16 top/bottom-K improves rapidly. In other words, the proxy gets more reliable precisely when you need it most — at large N.

What's notably absent from the paper is any evidence that this ranking consistency holds universally. The authors are careful to scope it to the ODE-style deterministic samplers they use throughout (classifier-free guidance disabled for SANA and SD3.5, guidance embedding fixed at 1.0 for FLUX.1), and to warn that naively using FP4 outputs directly as training targets — without the two-stage decoupling — does degrade quality. The proxy ranking insight only works because you're using FP4 exclusively for selection, not for gradient computation.

The Two-Stage Pipeline

Each training iteration runs as follows.

Stage 1 — FP4 Explore. Sample N=96 initial noise seeds per prompt. Run the NVFP4-quantized model with a reduced step count (fewer denoising steps suffice to rank, not to produce final quality — the paper uses 6 preview steps). Score all 96 images with the reward model. Filter to the top-K and bottom-K seeds (K=24 by default), which form the most contrastive subset and produce the sharpest advantage signal for GRPO.

Stage 2 — BF16 Train. Take the 24 selected seeds. Regenerate them from scratch using the full BF16 model with the full denoising step budget. Compute GRPO loss on these high-fidelity regenerations using the proxy rewards from Stage 1. Update the policy weights. Re-quantize to NVFP4 for the next iteration's Stage 1.

The overhead is minimal: Stage 2 runs BF16 on exactly the same number of samples a standard GRPO run would train on anyway. Everything else — the N-K discarded candidates — is replaced with cheap FP4 passes. The paper reports roughly 2% computational overhead from Stage 2 re-quantization, and 2–2.4× rollout phase acceleration from FP4, for a net end-to-end speedup of 1.25× on SANA and 1.62× on FLUX.1. The convergence speedup headline (4.64×) refers to wall-clock time to reach a target reward level — because Sol-RL can afford more iterations in the same time budget, it reaches alignment ceilings that naive BF16 pipelines can't reach without prohibitive cost.

Results 🎉

Across three base models and four reward metrics, the results are consistent:

Efficiency (seconds per iteration):

Base Model
Rollout (Naive BF16)
Rollout (Sol-RL)
Speedup
E2E Speedup

FLUX.1

184s

79s

2.33×

1.62×

SD3.5-Large

451s

187s

2.41×

1.61×

SANA

65s

46s

1.41×

1.25×

Alignment quality (FLUX.1, ImageReward):

Method
Score
Δ vs. Base

DanceGRPO

1.4937

+1.04

FlowGRPO

1.5331

+1.08

AWM

1.6693

+1.21

DiffusionNFT

1.6707

+1.22

Sol-RL

1.7636

+1.31

Sol-RL outperforms all baselines on all metrics across all three models while training faster than any of them. The quality gap versus a pure BF16 rollout pipeline (where all N=96 candidates are generated at full precision) is approximately 1% on ImageReward — effectively noise.


Running Sol-RL on a Yotta Labs Pod

The official scripts are configured for single-node 8-GPU training, which is the setup validated in the paper. SANA 1.6B is the most accessible starting point — it's the smallest model and the fastest to iterate on. FLUX.1 and SD3.5-L require more VRAM per GPU but run on the same scripts.

Prerequisites

  • A Yotta Labs account with sufficient credits

  • A Hugging Face account and access token (HF_TOKEN)

  • Familiarity with multi-GPU JupyterLab environments and torchrun

1

Deploy an 8-GPU Pod

Log in to the Yotta Labs Console. Navigate to Compute → Pods, click Deploy, and select a 8× GPU configuration. The official training was done on H100 80GB GPUs; on Yotta Labs, 8× A100 will work for SANA 1.6B with the default config.

Setting
Recommended Value

Template

pod-templates-jupyterlab

System Volume

200 GB — model weights, checkpoints, reward model caches

GPU Count

8

Add your Hugging Face token as an environment variable before deploying:

Click Deploy, wait for Running status, then connect to JupyterLab on port 8888.

2

Clone and Install

Open a terminal via ssh connection and set up the base environment:

Sol-RL's NVFP4 path (naive_quant and sol_rl config families) requires transformer-engine. Install it with the same Python interpreter that will run torchrun:

Note on setuptools: If you hit a ModuleNotFoundError: No module named 'pkg_resources' during setup (caused by setuptools 70+ breaking backwards compatibility), fix it first: pip install "setuptools<70" --force-reinstall, then re-run.

3

Download Reward Model Checkpoints

Most reward models download automatically on first use, but HPSv2 requires manual setup. Create the checkpoint directory and download its weights:

The other reward models (clipscore, pickscore, imagereward) are downloaded automatically the first time training runs. If you're planning to use only those, you can skip this step.

4

Run Sol-RL Training on SANA

The repo ships ready-to-run launcher scripts for each model. For SANA, the default entry point is:

To select a specific config family and reward, set CONFIG_SPEC before launching. The config naming pattern is <model>_<family>_<reward>. For example, to run the full Sol-RL two-stage pipeline with ImageReward on SANA:

If you want to start with something simpler to validate the pipeline first — no NVFP4 required, no transformer-engine dependency — use the diffusionnft family:

Config families at a glance:

Family
What it does
NVFP4 needed

diffusionnft

PEFT-only baseline (24-in-24)

No

naive_scaling

BF16 brute-force scaling (24-in-96)

No

compile

BF16 compiled scaling (24-in-96)

No

naive_quant

Direct NVFP4 rollout — don't use for real training

Yes

sol_rl

Two-stage decoupled Sol-RL (24-in-96)

Yes

The diffusionnft family is the recommended starting point for a first run. It validates that your environment, data loading, and reward scoring all work correctly before you bring in the NVFP4 machinery.

5

Run Inference to Verify Alignment

Once you have a checkpoint (or midway through training), test your aligned model against the base model to see the quality difference:

To quantify the improvement with reward scores, use the ImageReward evaluation script:

The alignment improvement is most visible on compositionally complex prompts — spatial relationships, counting, attribute binding. The RL-aligned model handles "a red cube on top of a blue sphere to the left of a green cylinder" noticeably better than the base model, because the reward signal specifically pushes the policy toward prompt fidelity.


Troubleshooting

transformer-engine install fails. This package is tied tightly to your CUDA and PyTorch versions. Run python -c "import torch; print(torch.__version__, torch.version.cuda)" first, then check the transformer-engine release page for a matching wheel. On Blackwell (RTX 5090 / H100), CUDA 12.4+ and PyTorch 2.4+ are required.

run_sana_single_node_8gpu.sh exits immediately with no output. Usually means torchrun can't find all 8 GPUs. Check nvidia-smi — all 8 should be visible with no processes holding them. If another process is using a GPU, kill it first.

Reward scores don't improve after 100+ iterations. First check that the reward model is actually scoring correctly by looking at per-sample scores in the training logs. If scores are near-constant, the reward model may have failed to load (common if reward_ckpts/ is missing for HPSv2). Switch to pickscore or imagereward which auto-download and are more robust to environment issues.

OOM during the BF16 Stage 2 regeneration. Reduce train_group_size (K) from 24 to 16 in your config. This reduces the BF16 batch size with minimal impact on gradient quality, since you're still selecting from the same N=96 FP4 candidates.

Want to run on FLUX.1 or SD3.5-L? The launchers are identical in structure, just swap the script:

FLUX.1 shows the largest absolute efficiency gains (2.33× rollout speedup), making it the most rewarding target if you have the VRAM budget.


Resources: Paper (arXiv:2604.06916) · GitHub · Project Page · Official Sol-RL Docs

Last updated

Was this helpful?