Reinforced Learning with Miles

Prerequisites

Access to a YottaLabs RTX 5090 GPU, ideally.

Start a pod with our official template and enter Jupyter lab.


Run the cell below to see what we're working with:

# Let's see what awesome hardware we have! πŸ–₯
!nvidia-smi

# Check Python version
import sys
print(f"\n🐍 Python version: {sys.version}")

# Check current working directory
import os
print(f"\nπŸ“ Current directory: {os.getcwd()}")

1

Step 1: Environment Setup

circle-info

The original miles repository quickstart provides a build_conda.sh script, but we'll adapt it for our JupyterLab environment.

Let's grab the miles framework from GitHub. This is where all the magic happens.

Now let's install miles and its dependencies. This might take a few minutes.

Miles uses Megatron-LM as one of its training backends.

2

Prepare Models and Datasets

We'll be working with:

  • Model: GLM-Z1-9B (a 9 billion parameter language model)

  • Training Data: dapo-math-17k (17,000 math problems)

  • Evaluation Data: aime-2024 (American Invitational Mathematics Examination)

These downloads can take a while depending on your connection.

Let's organize our downloads neatly.

We're using huggingface-cli to download the GLM-Z1-9B model. This is a fantastic 9B parameter model perfect for learning.

The dapo-math-17k dataset contains 17,000 math problems. Great for training our model to think mathematically.

The AIME-2024 dataset will help us evaluate how well our model is learning. Think of it as the final exam.

3

Model Weight Conversion

Miles uses Megatron for training, which requires a specific weight format. We need to convert our Hugging Face model to Megatron's torch_dist format.

Think of this as translating a book from one language to another - same content, different format.

First, we need to load the model's configuration. Miles provides ready-made configs for popular models.

Now let's do the actual conversion! This process reads the Hugging Face weights and reorganizes them into Megatron's format.

4

Understanding Training Parameters

Before we start training, let's take a moment to understand what's going on under the hood. Trust me, this will make everything click.

Miles training follows a reinforcement learning loop:

Let me break down the key parameters that control this loop:

Phase 1: Data Sampling (Rollout)

  • rollout-batch-size: How many prompts we sample each round

    • Example: 16 prompts

  • n-samples-per-prompt: How many responses we generate per prompt

    • Example: 8 responses per prompt

  • Total samples generated = 16 Γ— 8 = 128 samples

Phase 2: Model Training

  • global-batch-size: Samples needed for one parameter update

    • Example: 128 samples

  • num-steps-per-rollout: How many updates to make with current data

    • Example: 1 update (on-policy learning)

  • Total samples consumed = 128 Γ— 1 = 128 samples

circle-info

Golden Rule 🌟

Samples Generated = Samples Consumed

(rollout-batch-size Γ— n-samples-per-prompt) = (global-batch-size Γ— num-steps-per-rollout)

This ensures we use exactly the data we generate - no waste, no shortfall.

5

Prepare Your Training Script

Alright, time to put it all together.

6

Start Ray and Launch Training

Now for the exciting part - let's actually start training.

Miles uses Ray for distributed computing. We'll need to:

  1. Start a Ray cluster

  2. Submit our training job

  3. Monitor progress

Ray is pretty awesome - it handles all the distributed computing magic for us✨Let's start ray.

Let's build our training command. I'll show you each part so you understand what's happening!

This is it - the moment we've been building up to.

Training will run in the background. You can monitor progress in the next cell.

Let's keep an eye on how things are going! This cell will show you the latest updates from the training log.

7

Understanding What's Happening

While training runs, let me explain what's happening behind the scenes.

The GRPO Algorithm

Miles uses GRPO (Group Relative Policy Optimization) by default. Here's what's happening in each iteration:

βœ…Sampling Phase

  • Model generates multiple responses for each prompt

  • Each response gets a reward score from the reward model

  • We collect both good and bad responses

βœ…Advantage Calculation

  • Compare each response's reward within its group

  • Responses better than the group average get positive advantages

  • Poor responses get negative advantages

βœ…Policy Update

  • Update the model to increase probability of high-advantage responses

  • Decrease probability of low-advantage responses

  • Use PPO-style clipping to prevent drastic changes

βœ…Evaluation

  • Every few rollouts, test on the AIME dataset

  • Track progress and save checkpoints

In the logs, you'll see these key metrics:

  • loss: Training loss (should decrease over time)

  • reward_mean: Average reward score (should increase)

  • kl_divergence: How much model changes (want controlled change)

  • eval_accuracy: Performance on test set (the ultimate goal!)

Last updated

Was this helpful?