whaleFine-tune a Reasoning Model to Think in Target Language with Unsloth

🦥 This tutorial is created based on Unsloth official notebooksarrow-up-right.

Model: DeepSeek-R1-0528-Qwen3-8B (4-bit QLoRA + vLLM fast inference) Algorithm: GRPO (Group Relative Policy Optimization) Dataset: DAPO-Math-17karrow-up-right


Most GRPO tutorials only reward correctness. This notebook adds a language-consistency reward — inspired directly by the DeepSeek-R1 paper — that forces the model's internal <think> reasoning chain to use Bahasa Indonesia. The technique generalises to any target language.

Reward signal breakdown (max total = 16.5 per step):
  match_format_exactly       +3.0  — <think>...</think> tags present and correct
  match_format_approximately +1.0  — partial credit for tag structure
  check_answer               +5.0  — exact answer match (with ratio-based partial credit)
  check_numbers              +3.5  — float comparison with comma normalisation
  language_consistency       +5.0  — reasoning trace detected as Bahasa Indonesia

Section 1 — YottaLabs Platform

Before running any code in this notebook, you need to spin up a virtual machine on YottaLabs.

GPU
VRAM
Speed vs A100
Recommended?
Note

H100 80GB

80 GB

Best

Plenty of headroom, fastest training

TL;DR: Use 1× H100 80GB. With 4-bit QLoRA the model uses ~25 GB, giving you lots of room.

Launch the VM

To get started with our VM, see our official docarrow-up-right.


Section 2 — Environment Setup

circle-info

Run all cells from here inside your YottaLabs Pod via Jupyter.

Key Differences from Standard Unsloth Setup

This notebook uses vLLM for fast GRPO rollout generation. vLLM dramatically speeds up the num_generations sampling step — instead of running the model sequentially for each rollout, vLLM batches them with continuous batching and PagedAttention.

Additionally, we install langid — a fast language identification library used in our language-consistency reward function.

Package
Version
Why pinned

transformers

4.56.2

Compatible with DeepSeek-R1 tokenizer + Unsloth patches

trl

0.22.2

GRPO trainer version tested with vLLM integration

vllm

auto

Fast inference engine for GRPO rollouts

langid

latest

Language detection for the consistency reward


Section 3 — Load DeepSeek-R1-0528-Qwen3-8B

Why fast_inference = True?

Unlike the Vision GRPO notebook, this text-only model enables vLLM fast inference during training. This is possible because:

  1. vLLM is installed and compatible with this model architecture

  2. Text-only GRPO rollouts can be batched efficiently with vLLM's continuous batching

  3. On an H100, vLLM generates ~4 rollouts in roughly the same time standard generation produces 1

LoRA Configuration

We use lora_alpha = rank * 2 here (alpha = 64 for rank 32). This is a common trick that effectively doubles the learning rate for LoRA layers without changing the base LR, leading to faster convergence in RL settings.

Parameter
Value
Why

lora_rank

32

Higher than vision notebook — text reasoning benefits from more capacity

lora_alpha

64 (rank×2)

2× alpha speeds up LoRA convergence

max_seq_length

1024

Short for speed; increase to 4096+ for longer reasoning

fast_inference

True

Enable vLLM — critical for GRPO rollout speed

gpu_memory_utilization

0.9

High — vLLM needs KV cache VRAM


Section 4 — Understanding the DeepSeek-R1 Chat Template

DeepSeek-R1 models use special tokens to delimit the internal reasoning chain from the final answer. The GRPOTrainer automatically prepends the <think> opening token to every completion, so the model always starts in reasoning mode.

Token Discovery

We programmatically find the special tokens rather than hardcoding them, making this code robust across model variants:

System Prompt for Language Targeting

We instruct the model to reason in Bahasa Indonesia via the system prompt. The language-consistency reward function (Section 6) then reinforces this behaviour during training.

circle-info

Design principle: the system prompt asks for the target language; the reward function enforces it. Neither alone is sufficient — both are needed for reliable steering.


Section 5 — Prepare the DAPO-Math Dataset

We use DAPO-Math-17karrow-up-right — a curated math reasoning dataset from the Open-R1 project, containing 17k problems with ground-truth solutions compatible with GRPO training.

Why Filter by Token Length?

GRPO requires every prompt in a batch to fit within max_prompt_length. Rather than truncating (which corrupts problems), we remove the top 10% longest prompts. This ensures:

  • No silent truncation of math problem statements

  • Predictable memory usage per step

  • Faster per-step time (no padding to extreme lengths)

Data Pipeline


Section 6 — Five Reward Functions

This notebook uses five complementary reward functions that together guide the model towards structured, correct, and Indonesian-language reasoning. Each targets a different aspect of quality.

Reward Architecture Overview

Why Multiple Rewards?

  • match_format_exactly gives a strong binary signal for correct structure

  • match_format_approximately provides gradient even for near-miss formatting

  • check_answer rewards exact string matches and ratio-based proximity

  • check_numbers catches cases where the answer is embedded in a sentence

  • language_consistency is the novel reward that steers the reasoning language

Using multiple rewards is better than one composite reward because each signal is cleaner and easier to tune independently.


Section 7 — Configure & Launch GRPO Training

vLLM Integration

When fast_inference=True, GRPOTrainer routes all rollout generation through vLLM's SamplingParams. This gives a significant speedup on H100:

Method
4 rollouts time (approx)

Standard HuggingFace generate

~8 seconds

vLLM with PagedAttention

~2 seconds

We configure SamplingParams separately from GRPOConfig — vLLM uses its own sampling API.

Sequence Length Calculation

We compute max_completion_length dynamically from the measured max prompt length, ensuring the total sequence always fits within max_seq_length. This prevents silent truncation of reasoning chains, which would produce noisy gradients.

Training Duration

We run for 100 steps here (max_steps=100) — enough to see rewards start climbing, and to verify the training loop works. For production quality, train for 1–3 full epochs (num_train_epochs=1, remove max_steps).

ℹ️ Watch the reward and rewards/language columns in the training table. You expect reward to climb from ~0 to ~2+ after 100 steps, and the language reward to become increasingly positive as the model adopts Indonesian.


Section 8 — Inference & Language Compliance Evaluation

We run three comparison experiments:

  1. Without LoRA — baseline model behaviour (no system prompt)

  2. With LoRA, no system prompt — does LoRA affect general reasoning?

  3. With LoRA + system prompt — the intended deployment mode (Indonesian reasoning)

Then we run a batch evaluation on 20 samples to measure the language compliance rate quantitatively.

circle-info

model.save_lora() is used here instead of model.save_pretrained() because we enabled fast_inference=True (vLLM mode). The vLLM-backed model uses the Unsloth save_lora API.

Batch Language Compliance Evaluation

A single example is anecdotal. We run over 20 randomly sampled problems and measure the percentage that produce Indonesian reasoning chains — with and without the LoRA.


Section 9 — Save & Export

Export Options

Method
Size
Use Case

LoRA adapters

~200 MB

Share on HuggingFace, continue training, hot-swap with vLLM

Merged 16-bit

~16 GB

Deploy with vLLM as a standalone model

Merged 4-bit

~5 GB

Memory-constrained vLLM deployment

GGUF q4_k_m

~5 GB

llama.cpp / Ollama local inference

GGUF q8_0

~9 GB

High-quality llama.cpp inference

Note: model.save_lora() is used here instead of model.save_pretrained() because we enabled fast_inference=True (vLLM mode). The vLLM-backed model uses the Unsloth save_lora API.

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our docs pagearrow-up-right):

  • q8_0 - Fast conversion. High resource use, but generally acceptable.

  • q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.

  • q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

Now, use the deepseek_r1_finetune.Q8_0.gguf file or deepseek_r1_finetune.Q4_K_M.gguf file in llama.cpp.

Last updated

Was this helpful?