Fine-tune a Reasoning Model to Think in Target Language with Unsloth
Reward signal breakdown (max total = 16.5 per step):
match_format_exactly +3.0 — <think>...</think> tags present and correct
match_format_approximately +1.0 — partial credit for tag structure
check_answer +5.0 — exact answer match (with ratio-based partial credit)
check_numbers +3.5 — float comparison with comma normalisation
language_consistency +5.0 — reasoning trace detected as Bahasa IndonesiaSection 1 — YottaLabs Platform
Recommended GPU Configuration
GPU
VRAM
Speed vs A100
Recommended?
Note
Launch the VM
Section 2 — Environment Setup
Key Differences from Standard Unsloth Setup
Package
Version
Why pinned
Section 3 — Load DeepSeek-R1-0528-Qwen3-8B
Why fast_inference = True?
fast_inference = True?LoRA Configuration
Parameter
Value
Why
Section 4 — Understanding the DeepSeek-R1 Chat Template
Token Discovery
System Prompt for Language Targeting
Section 5 — Prepare the DAPO-Math Dataset
Why Filter by Token Length?
Data Pipeline
Section 6 — Five Reward Functions
Reward Architecture Overview
Why Multiple Rewards?
Section 7 — Configure & Launch GRPO Training
vLLM Integration
Method
4 rollouts time (approx)
Sequence Length Calculation
Training Duration
Section 8 — Inference & Language Compliance Evaluation
Batch Language Compliance Evaluation
Section 9 — Save & Export
Export Options
Method
Size
Use Case
GGUF / llama.cpp Conversion
Last updated
Was this helpful?