GRPO on GSM8K with Virtual Machine
1
Connect to your YottaLabs instance
ssh [email protected] -i private_key.pem2
3
4
6
Launch training
uv run --isolated --extra fsdp -m skyrl.train.entrypoints.main_base data.train_data="['$HOME/data/gsm8k/train.parquet']" data.val_data="['$HOME/data/gsm8k/validation.parquet']" trainer.algorithm.advantage_estimator="grpo" trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" trainer.strategy=fsdp2 trainer.placement.colocate_all=true trainer.placement.policy_num_gpus_per_node=1 trainer.eval_batch_size=1024 trainer.eval_before_train=true trainer.eval_interval=5 trainer.ckpt_interval=10 generator.inference_engine.backend=vllm generator.inference_engine.num_engines=1 generator.inference_engine.tensor_parallel_size=1 generator.inference_engine.weight_sync_backend=nccl environment.env_class=gsm8kData
data.train_data="['$HOME/data/gsm8k/train.parquet']"
data.val_data="['$HOME/data/gsm8k/validation.parquet']"Algorithm
trainer.algorithm.advantage_estimator="grpo"Base model
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct"Training strategy
trainer.strategy=fsdp2
trainer.placement.colocate_all=true
trainer.placement.policy_num_gpus_per_node=1Evaluation
trainer.eval_batch_size=1024
trainer.eval_before_train=true
trainer.eval_interval=5
trainer.ckpt_interval=10Inference engine
generator.inference_engine.backend=vllm
generator.inference_engine.num_engines=1
generator.inference_engine.tensor_parallel_size=1
generator.inference_engine.weight_sync_backend=ncclEnvironment
environment.env_class=gsm8k7
What a healthy startup looks like
Started a local Ray instance.
Infrastructure logs will be written to: /tmp/skyrl-logs/...
Synced registries to ray actor
InferenceEngineClient initialized with 1 engines.
Total steps: 7
Validation set size: 2
init policy/ref/critic models done
No checkpoint found, starting training from scratch
Started: 'eval'8
What is happening during training?
(skyrl_entrypoint pid=8455) Input: [{'content': 'The selling price of a bicycle that had sold for $220 last year was increased by 15%. What is the new price? Let\'s think step by step and output the final answer after "####".', 'role': 'user'}]
(skyrl_entrypoint pid=8455) Output (Total Reward: 0.0000):
(skyrl_entrypoint pid=8455) To find the new price of the bicycle after a 15% increase from the original price of $220, we can follow these steps:
(skyrl_entrypoint pid=8455)
(skyrl_entrypoint pid=8455) 1. Calculate the amount of the increase by multiplying the original price by 15%. We will use 15% as 0.15 in decimal form, so we calculate:
(skyrl_entrypoint pid=8455) \[
(skyrl_entrypoint pid=8455) 220 \times 0.15
(skyrl_entrypoint pid=8455) \]
(skyrl_entrypoint pid=8455)
(skyrl_entrypoint pid=8455) 2. Perform the multiplication:
(skyrl_entrypoint pid=8455) \[
(skyrl_entrypoint pid=8455) 220 \times 0.15 = 33
(skyrl_entrypoint pid=8455) \]
(skyrl_entrypoint pid=8455)
(skyrl_entrypoint pid=8455) 3. Add the increase to the original price to find the new price:
(skyrl_entrypoint pid=8455) \[
(skyrl_entrypoint pid=8455) 220 + 33 = 253
(skyrl_entrypoint pid=8455) \]
(skyrl_entrypoint pid=8455)
(skyrl_entrypoint pid=8455) Therefore, the new price of the bicycle is **$253**.<|im_end|>
(skyrl_entrypoint pid=8455) 2026-03-26 14:44:59.431 | INFO | skyrl.train.trainer:train:261 - Started: 'convert_to_training_input'
(skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.969 | INFO | skyrl.train.trainer:convert_to_training_input:665 - batch_num_seq: 5120, batch_padded_seq_len: 1172
(skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:convert_to_training_input:683 - Number of sequences before padding: 5120
(skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:convert_to_training_input:685 - Number of sequences after padding: 5120
(skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:train:261 - Finished: 'convert_to_training_input', time cost: 1.54s
(skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:train:265 - Started: 'fwd_logprobs_values_reward'
(skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.325 | INFO | skyrl.train.trainer:train:265 - Finished: 'fwd_logprobs_values_reward', time cost: 377.36s
(skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.326 | INFO | skyrl.train.trainer:train:274 - Started: 'compute_advantages_and_returns'
(skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.430 | INFO | skyrl.train.trainer:compute_advantages_and_returns:873 - avg_final_rewards: 0.13398437201976776, avg_response_length: 273.6802734375
(skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.431 | INFO | skyrl.train.trainer:train:274 - Finished: 'compute_advantages_and_returns', time cost: 0.11s
(skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.431 | INFO | skyrl.train.trainer:train:291 - Started: 'train_critic_and_policy'
(skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.431 | INFO | skyrl.train.trainer:train_critic_and_policy:1121 - Started: 'policy_train'
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:44.917 | INFO | skyrl.train.trainer:train_critic_and_policy:1121 - Finished: 'policy_train', time cost: 866.49s
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:44.938 | INFO | skyrl.train.trainer:train:291 - Finished: 'train_critic_and_policy', time cost: 866.51s
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:44.938 | INFO | skyrl.train.trainer:train:316 - Started: 'sync_weights'
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:51.979 | INFO | skyrl.train.trainer:train:316 - Finished: 'sync_weights', time cost: 7.04s
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:51.980 | INFO | skyrl.train.trainer:train:210 - Finished: 'step', time cost: 1331.13s
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:51.980 | INFO | skyrl.train.trainer:train:320 - {'final_loss': -0.0001395885121230879, 'policy_loss': -0.0001400031556840986, 'policy_entropy': 0.36564826631365577, 'response_length': 1024.0, 'policy_lr': 9.999999974752427e-07, 'loss_metrics/clip_ratio': 0.000960409866502232, 'policy_kl': 0.00041451671746912667, 'grad_norm': 0.3702232241630554}
Training Batches Processed: 14%|█▍ | 1/7 [22:36<2:15:37, 1356.23s/it]
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:52.046 | INFO | skyrl.train.trainer:train:210 - Started: 'step'
(skyrl_entrypoint pid=8455)
(skyrl_entrypoint pid=8455) 2026-03-26 15:05:52.075 | INFO | skyrl.train.trainer:train:227 - Started: 'generate'
(skyrl_entrypoint pid=8455) %| | 0/5120 [00:00<?, ?it/s]
(skyrl_entrypoint pid=8455) %|█ | 512/5120 [00:10<01:37, 47.35it/s]
(skyrl_entrypoint pid=8455) %|██ | 1024/5120 [00:22<01:32, 44.08it/s]
(skyrl_entrypoint pid=8455) %|███ | 1536/5120 [00:28<01:01, 58.46it/s]
(skyrl_entrypoint pid=8455) %|████ | 2048/5120 [00:37<00:53, 57.86it/s]
(skyrl_entrypoint pid=8455) %|█████ | 2581/5120 [00:42<00:36, 68.80it/s]
(skyrl_entrypoint pid=8455) %|██████ | 3093/5120 [00:51<00:31, 64.64it/s]
(skyrl_entrypoint pid=8455) %|███████ | 3615/5120 [00:56<00:20, 72.97it/s]
(skyrl_entrypoint pid=8455) %|████████ | 4127/5120 [01:05<00:14, 67.50it/s]
Generating Trajectories: 100%|██████████| 5120/5120 [01:18<00:00, 65.35it/s]
(skyrl_entrypoint pid=8455) 2026-03-26 15:07:10.550 | INFO | skyrl.train.trainer:train:227 - Finished: 'generate', time cost: 78.47s
(skyrl_entrypoint pid=8455) 2026-03-26 15:07:10.654 | INFO | skyrl.train.trainer:train:248 - Started: 'postprocess_generator_output'
(skyrl_entrypoint pid=8455) 2026-03-26 15:07:10.801 | INFO | skyrl.train.trainer:postprocess_generator_output:776 - reward/avg_pass_at_5: 0.6533203125, reward/avg_raw_reward: 0.2173828125, reward/mean_positive_reward: 0.2173828125
(skyrl_entrypoint pid=8455) 2026-03-26 15:07:10.801 | INFO | skyrl.train.trainer:train:248 - Finished: 'postprocess_generator_output', time cost: 0.15s
(skyrl_entrypoint pid=8455) 2026-03-26 15:07:10.802 | INFO | skyrl.train.utils.logging_utils:log_example:81 - Example:1. The model first generates an answer
Input: [{'content': 'The selling price of a bicycle that had sold for $220 last year was increased by 15%. What is the new price? ...', 'role': 'user'}]Therefore, the new price of the bicycle is **$253**.<|im_end|>2. SkyRL converts the generated outputs into training tensors
Started: 'convert_to_training_input'
batch_num_seq: 5120, batch_padded_seq_len: 1172
Number of sequences before padding: 5120
Number of sequences after padding: 5120
Finished: 'convert_to_training_input', time cost: 1.54s3. SkyRL computes logprobs, values, and rewards
Started: 'fwd_logprobs_values_reward'
Finished: 'fwd_logprobs_values_reward', time cost: 377.36s4. SkyRL computes advantages and returns
Started: 'compute_advantages_and_returns'
avg_final_rewards: 0.13398437201976776, avg_response_length: 273.6802734375
Finished: 'compute_advantages_and_returns', time cost: 0.11savg_final_rewards: 0.13398437201976776avg_response_length: 273.685. SkyRL updates the policy
Started: 'train_critic_and_policy'
Started: 'policy_train'
Finished: 'policy_train', time cost: 866.49s
Finished: 'train_critic_and_policy', time cost: 866.51s6. The updated weights are synced back to inference
Started: 'sync_weights'
Finished: 'sync_weights', time cost: 7.04s7. One full RL step is now complete
Finished: 'step', time cost: 1331.13s{'final_loss': -0.0001395885121230879,
'policy_loss': -0.0001400031556840986,
'policy_entropy': 0.36564826631365577,
'response_length': 1024.0,
'policy_lr': 9.999999974752427e-07,
'loss_metrics/clip_ratio': 0.000960409866502232,
'policy_kl': 0.00041451671746912667,
'grad_norm': 0.3702232241630554}Training Batches Processed: 14%|█▍ | 1/7reward/avg_pass_at_5: 0.6533203125
reward/avg_raw_reward: 0.2173828125reward/avg_pass_at_5: 0.4912109375
reward/avg_raw_reward: 0.133984375Useful output locations
Checkpoints
/home/ray/ckpts/Exported evaluation results
Infrastructure logs
Last updated
Was this helpful?