brain-circuitGRPO on MathVista with Unsloth

🦥This tutorial is created based on Unsloth official notebooksarrow-up-right.

Model: Qwen3-VL-8B-Instruct (4-bit QLoRA) Algorithm: GRPO / GSPO (Group Relative Policy Optimization) Task: Teaching a VLM to solve math problems from images

What Are We Solving?

We train the model to answer visual math questions like these:

The model must look at an image, reason about it, and output a numeric answer in a structured format.


Why GRPO on a VLM?

Standard supervised fine-tuning (SFT) requires ground-truth reasoning traces. GRPO instead uses reward signals — we only need the final answer label, and the model learns how to reason on its own. This is the same technique used to build o1-style thinking models.

Method
Needs reasoning traces?
Sample efficiency

SFT

✅ Yes

Moderate

GRPO

❌ No

High


Section 1 — YottaLabs Platform

Before running any code in this notebook, you need to spin up a GPU Pod on YottaLabs.

1
GPU
VRAM
Speed vs A100
Recommended?
Note

H100 80GB

80 GB

Best

Plenty of headroom, fastest training

circle-info

TL;DR: Use 1× H100 80GB. With 4-bit QLoRA the model uses ~25 GB, giving you lots of room.

2

Launch a Pod

To get started with our pods, see our official docarrow-up-right


Section 2 — Environment Setup

circle-info

Run all remaining cells inside your YottaLabs Pod (via Jupyter at http://<POD_IP>:8888).

The YottaLabs base image already has PyTorch 2.8 + CUDA 12.8 pre-installed. We only need to add the Unsloth fine-tuning stack on top.


Section 3 — Load Qwen3-VL 8B with QLoRA

We load Qwen3-VL-8B-Instruct in 4-bit quantization (QLoRA). This reduces the model's memory footprint from ~16 GB (BF16) to ~5 GB, while preserving most of the model's capability.

Key Parameters Explained

Parameter
Value
Why

max_seq_length

16384

VLMs need long context for image tokens (~1000 tokens per image)

load_in_4bit

True

QLoRA — reduces VRAM from ~16 GB to ~5 GB

gpu_memory_utilization

0.85

Leave 15% buffer for GRPO rollout buffers

fast_inference

False

Disable vLLM for training (enable only for inference-only use)

Attach LoRA Adapters

Instead of fine-tuning all 8B parameters, LoRA injects small trainable matrices into the attention and MLP layers. Only these adapter weights are updated during training — the base model stays frozen in 4-bit.

Why freeze the vision encoder? finetune_vision_layers = False — The vision encoder (which converts images to tokens) is already well-trained and generalises well. Training it risks overfitting on a small dataset and uses extra VRAM.

Section 4 — Prepare the MathVista Dataset

MathVistaarrow-up-right is a benchmark of math problems that require understanding visual context (charts, geometry diagrams, tables, etc.). We use the testmini split (~1000 examples) as our training set.

Data Pipeline

Why 512×512?

The original images vary in size. Resizing to a fixed 512×512:

  • Keeps image token count predictable (~1000 tokens per image)

  • Prevents sequence length from exceeding max_seq_length

  • Speeds up training (smaller feature maps in the vision encoder)

1

Step 1 — Load the dataset

2

Step 2 — Keep only numeric answers and preprocess images

3

Step 3 — Build the prompt template

We use a structured output format with XML-style tags to make it easy for reward functions to parse the model's response:

This is similar to DeepSeek-R1's <think> / </think> format. The two-part structure lets us:

  1. Reward format compliance — did the model produce both tags?

  2. Reward correctness — is the number inside <SOLUTION> correct?

4

Step 4 — Sanity check examples


Section 5 — Design Reward Functions

Reward functions are the heart of GRPO. Instead of manually labelling reasoning traces, we define two automatic signals that tell the model what "good" looks like:

Reward 1 — Format Compliance (formatting_reward_func)

Checks whether the response contains exactly one <REASONING>...</REASONING> block and exactly one <SOLUTION>...</SOLUTION> block.

Condition
Points

Has exactly 1 <REASONING> block

+1.0

Has exactly 1 <SOLUTION> block

+1.0

>50% of output is addCriterion / \n spam

−2.0

Max possible

+2.0

Reward 2 — Correctness (correctness_reward_func)

Checks whether the number inside <SOLUTION> exactly matches the ground-truth answer.

Condition
Points

Answer matches ground truth

+2.0

Answer is wrong or missing

0.0

Total Reward Range: −2.0 to +4.0

The model learns to maximise this sum across its GRPO rollouts.

circle-info

What is the addCriterion penalty? Qwen-VL models have a known degenerate mode where they repeat addCriterion and newlines endlessly instead of generating meaningful text. The penalty shuts this down early.


Section 6 — Pre-Training Baseline Inference

Before training, let's run the model on a sample problem to establish a baseline. This shows us what the model can already do, and gives us a qualitative benchmark to compare against after training.


Section 7 — Configure & Launch GRPO Training

How GRPO Works (Briefly)

For each training step, GRPO:

  1. Samples num_generations completions from the current policy (the model)

  2. Scores each completion with the reward functions

  3. Updates the policy to increase the probability of high-reward completions relative to the group mean

This is fundamentally different from PPO (which needs a separate value network) — GRPO only needs the policy model, making it much more memory-efficient.

GSPO Extension (Enabled Here)

We also enable GSPO (Group Sequence Policy Optimisation) via:

  • importance_sampling_level = "sequence" — importance weights at sequence level

  • loss_type = "dr_grpo" — doubly robust GRPO loss

This improves training stability, especially for long completions typical of VLMs.

Key Hyperparameters

Parameter
Value
Effect

learning_rate

5e-6

Small LR — RL is sensitive to overshooting

num_generations

4

More rollouts = better gradient estimate (H100 has room)

per_device_train_batch_size

1

VLMs are memory-heavy; batch=1 is standard

gradient_accumulation_steps

4

Effective batch of 4 without extra VRAM

max_grad_norm

0.1

Gradient clipping — critical for RL stability

optim

adamw_8bit

8-bit Adam: same quality, half the optimizer state VRAM

num_train_epochs

1.0

Full pass over the dataset (~600 steps)


Section 8 — Post-Training Inference

Now let's run the same inference test as Section 6, but with the trained model. You should see:

  1. Structured output — the model reliably uses <REASONING> and <SOLUTION> tags

  2. Better accuracy — the numeric answer inside <SOLUTION> should be closer to correct

  3. Coherent reasoning — the <REASONING> block should show sensible working-out steps


Section 9 — Save & Export

There are several ways to save the trained model depending on your use case:

Method
File size
Use case

LoRA adapters only

~100 MB

Fine-tune further, share on HuggingFace

Merged 16-bit

~16 GB

Deploy with vLLM or standard transformers

Merged 4-bit

~5 GB

Memory-constrained deployment

GGUF q4_k_m

~5 GB

llama.cpp / Ollama local inference

GGUF f16

~16 GB

llama.cpp high-quality inference


Last updated

Was this helpful?