Reinforced Learning with Miles
Prerequisites
Access to a YottaLabs RTX 5090 GPU, ideally.
Start a pod with our official template and enter Jupyter lab.
Run the cell below to see what we're working with:
# Let's see what awesome hardware we have! π₯
!nvidia-smi
# Check Python version
import sys
print(f"\nπ Python version: {sys.version}")
# Check current working directory
import os
print(f"\nπ Current directory: {os.getcwd()}")
Step 1: Environment Setup
The original miles repository quickstart provides a build_conda.sh script, but we'll adapt it for our JupyterLab environment.
Let's grab the miles framework from GitHub. This is where all the magic happens.
Now let's install miles and its dependencies. This might take a few minutes.
Miles uses Megatron-LM as one of its training backends.
Prepare Models and Datasets
We'll be working with:
Model: GLM-Z1-9B (a 9 billion parameter language model)
Training Data: dapo-math-17k (17,000 math problems)
Evaluation Data: aime-2024 (American Invitational Mathematics Examination)
These downloads can take a while depending on your connection.
Let's organize our downloads neatly.
We're using huggingface-cli to download the GLM-Z1-9B model. This is a fantastic 9B parameter model perfect for learning.

The dapo-math-17k dataset contains 17,000 math problems. Great for training our model to think mathematically.

The AIME-2024 dataset will help us evaluate how well our model is learning. Think of it as the final exam.

Model Weight Conversion
Miles uses Megatron for training, which requires a specific weight format. We need to convert our Hugging Face model to Megatron's torch_dist format.
Think of this as translating a book from one language to another - same content, different format.
First, we need to load the model's configuration. Miles provides ready-made configs for popular models.
Now let's do the actual conversion! This process reads the Hugging Face weights and reorganizes them into Megatron's format.
Understanding Training Parameters
Before we start training, let's take a moment to understand what's going on under the hood. Trust me, this will make everything click.
Miles training follows a reinforcement learning loop:
Let me break down the key parameters that control this loop:
Phase 1: Data Sampling (Rollout)
rollout-batch-size: How many prompts we sample each roundExample: 16 prompts
n-samples-per-prompt: How many responses we generate per promptExample: 8 responses per prompt
Total samples generated = 16 Γ 8 = 128 samples
Phase 2: Model Training
global-batch-size: Samples needed for one parameter updateExample: 128 samples
num-steps-per-rollout: How many updates to make with current dataExample: 1 update (on-policy learning)
Total samples consumed = 128 Γ 1 = 128 samples
Golden Rule π
Samples Generated = Samples Consumed
(rollout-batch-size Γ n-samples-per-prompt) = (global-batch-size Γ num-steps-per-rollout)
This ensures we use exactly the data we generate - no waste, no shortfall.
Start Ray and Launch Training
Now for the exciting part - let's actually start training.
Miles uses Ray for distributed computing. We'll need to:
Start a Ray cluster
Submit our training job
Monitor progress
Ray is pretty awesome - it handles all the distributed computing magic for usβ¨Let's start ray.
Let's build our training command. I'll show you each part so you understand what's happening!
This is it - the moment we've been building up to.
Training will run in the background. You can monitor progress in the next cell.
Let's keep an eye on how things are going! This cell will show you the latest updates from the training log.
Understanding What's Happening
While training runs, let me explain what's happening behind the scenes.
The GRPO Algorithm
Miles uses GRPO (Group Relative Policy Optimization) by default. Here's what's happening in each iteration:
β Sampling Phase
Model generates multiple responses for each prompt
Each response gets a reward score from the reward model
We collect both good and bad responses
β Advantage Calculation
Compare each response's reward within its group
Responses better than the group average get positive advantages
Poor responses get negative advantages
β Policy Update
Update the model to increase probability of high-advantage responses
Decrease probability of low-advantage responses
Use PPO-style clipping to prevent drastic changes
β Evaluation
Every few rollouts, test on the AIME dataset
Track progress and save checkpoints
In the logs, you'll see these key metrics:
loss: Training loss (should decrease over time)reward_mean: Average reward score (should increase)kl_divergence: How much model changes (want controlled change)eval_accuracy: Performance on test set (the ultimate goal!)
Last updated
Was this helpful?
