> For the complete documentation index, see [llms.txt](https://docs.yottalabs.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.yottalabs.ai/tutorials/training-and-fine-tuning/grpo-on-gsm8k-with-virtual-machine.md). # GRPO on GSM8K with SkyRL This guide shows how to run a **working SkyRL training job on YottaLabs** virtual machine using: * **1 H200 GPU** * **Docker** * **SkyRL** * **GSM8K** * **Qwen/Qwen2.5-1.5B-Instruct** * **GRPO + FSDP2** * **vLLM backend** *** {% stepper %} {% step %} #### Connect to your YottaLabs instance Start a virtual machine on our platform. See our [Virtual Machine Guide](https://docs.yottalabs.ai/products/virtual-machines/launching-a-virtual-machine) for step-to-step tutorial. From your local machine, SSH into the YottaLabs host, for example: ```bash ssh root@instance-2037157542617944064.yottadeos.com -i private_key.pem ``` Once connected, you should be on the remote machine as `root`. {% endstep %} {% step %} #### Start the SkyRL Docker container Run the official training container with GPU access enabled: ```bash docker run -it \ --runtime=nvidia \ --gpus all \ --shm-size=16g \ --name skyrl \ novaskyai/skyrl-train-ray-2.51.1-py3.12-cu12.8 \ /bin/bash ``` This container already includes the right CUDA-compatible environment for SkyRL training. We also give Docker a larger shared memory segment with: ```bash --shm-size=16g ``` That helps Ray and PyTorch behave better during training. Once the container starts, you should land inside a shell that looks roughly like this: ``` (base) ray@...:~$ ``` {% endstep %} {% step %} #### Clone the SkyRL repository Inside the container, run: ```bash git clone https://github.com/novasky-ai/SkyRL.git cd SkyRL ``` At this point you should be in:

/home/ray/SkyRL

{% endstep %} {% step %} #### Prepare the GSM8K dataset Generate the parquet files SkyRL expects for training and validation: ```bash uv run --isolated examples/train/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k ``` This command downloads and preprocesses GSM8K into parquet format. When it finishes, verify that the dataset files exist: ```bash ls -lh $HOME/data/gsm8k ``` You should see files such as: ``` train.parquet validation.parquet ``` That confirms the data preparation step is done. {% endstep %} {% step %} #### Export your W\&B API key SkyRL logs this run to Weights & Biases, so set your API key before training: ```bash export WANDB_API_KEY=your_wandb_api_key_here ``` Use your real key in place of the placeholder above. If you want to confirm the variable exists without printing the key itself, you can run: ```bash echo $WANDB_API_KEY | wc -c ``` A nonzero result means the variable is present. {% endstep %} {% step %} #### Launch training Here is the exact single-GPU training command. Use the **one-line version** for the least error-prone performance in shell environments: ```bash uv run --isolated --extra fsdp -m skyrl.train.entrypoints.main_base data.train_data="['$HOME/data/gsm8k/train.parquet']" data.val_data="['$HOME/data/gsm8k/validation.parquet']" trainer.algorithm.advantage_estimator="grpo" trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" trainer.strategy=fsdp2 trainer.placement.colocate_all=true trainer.placement.policy_num_gpus_per_node=1 trainer.eval_batch_size=1024 trainer.eval_before_train=true trainer.eval_interval=5 trainer.ckpt_interval=10 generator.inference_engine.backend=vllm generator.inference_engine.num_engines=1 generator.inference_engine.tensor_parallel_size=1 generator.inference_engine.weight_sync_backend=nccl environment.env_class=gsm8k ``` Let us briefly decode the important parts so the command is not just a magic spell. #### Data ``` data.train_data="['$HOME/data/gsm8k/train.parquet']" data.val_data="['$HOME/data/gsm8k/validation.parquet']" ``` These point SkyRL to the GSM8K parquet files you just generated. #### Algorithm ``` trainer.algorithm.advantage_estimator="grpo" ``` This tells SkyRL to train with **GRPO**. #### Base model ``` trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" ``` This is the policy model being trained. #### Training strategy ``` trainer.strategy=fsdp2 trainer.placement.colocate_all=true trainer.placement.policy_num_gpus_per_node=1 ``` This configures a **single-GPU FSDP2 run** and keeps the components colocated appropriately for a one-GPU machine. #### Evaluation ``` trainer.eval_batch_size=1024 trainer.eval_before_train=true trainer.eval_interval=5 trainer.ckpt_interval=10 ``` This tells SkyRL to: * run evaluation before training starts * evaluate every 5 steps * save checkpoints every 10 steps #### Inference engine ``` generator.inference_engine.backend=vllm generator.inference_engine.num_engines=1 generator.inference_engine.tensor_parallel_size=1 generator.inference_engine.weight_sync_backend=nccl ``` This uses **vLLM** for rollout generation with a **single engine** and **tensor parallel size 1**, which matches a single-GPU setup. #### Environment ``` environment.env_class=gsm8k ``` This tells SkyRL to use the GSM8K task environment for reward computation and evaluation. {% endstep %} {% step %} #### What a healthy startup looks like After launching training, you should see logs that indicate the system is booting correctly. Here are the kinds of messages you want to see: ``` Started a local Ray instance. Infrastructure logs will be written to: /tmp/skyrl-logs/... Synced registries to ray actor InferenceEngineClient initialized with 1 engines. Total steps: 7 Validation set size: 2 init policy/ref/critic models done No checkpoint found, starting training from scratch Started: 'eval' ``` These lines tell you: * Ray started * SkyRL initialized its worker stack * the inference engine came up * the dataset loaded * the trainer is entering evaluation and then training {% endstep %} {% step %} #### What is happening during training? One full **reinforcement learning training step** in SkyRL looks like: ```bash (skyrl_entrypoint pid=8455) Input: [{'content': 'The selling price of a bicycle that had sold for $220 last year was increased by 15%. What is the new price? Let\'s think step by step and output the final answer after "####".', 'role': 'user'}] (skyrl_entrypoint pid=8455) Output (Total Reward: 0.0000): (skyrl_entrypoint pid=8455) To find the new price of the bicycle after a 15% increase from the original price of $220, we can follow these steps: (skyrl_entrypoint pid=8455) (skyrl_entrypoint pid=8455) 1. Calculate the amount of the increase by multiplying the original price by 15%. We will use 15% as 0.15 in decimal form, so we calculate: (skyrl_entrypoint pid=8455) \[ (skyrl_entrypoint pid=8455) 220 \times 0.15 (skyrl_entrypoint pid=8455) \] (skyrl_entrypoint pid=8455) (skyrl_entrypoint pid=8455) 2. Perform the multiplication: (skyrl_entrypoint pid=8455) \[ (skyrl_entrypoint pid=8455) 220 \times 0.15 = 33 (skyrl_entrypoint pid=8455) \] (skyrl_entrypoint pid=8455) (skyrl_entrypoint pid=8455) 3. Add the increase to the original price to find the new price: (skyrl_entrypoint pid=8455) \[ (skyrl_entrypoint pid=8455) 220 + 33 = 253 (skyrl_entrypoint pid=8455) \] (skyrl_entrypoint pid=8455) (skyrl_entrypoint pid=8455) Therefore, the new price of the bicycle is **$253**.<|im_end|> (skyrl_entrypoint pid=8455) 2026-03-26 14:44:59.431 | INFO | skyrl.train.trainer:train:261 - Started: 'convert_to_training_input' (skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.969 | INFO | skyrl.train.trainer:convert_to_training_input:665 - batch_num_seq: 5120, batch_padded_seq_len: 1172 (skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:convert_to_training_input:683 - Number of sequences before padding: 5120 (skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:convert_to_training_input:685 - Number of sequences after padding: 5120 (skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:train:261 - Finished: 'convert_to_training_input', time cost: 1.54s (skyrl_entrypoint pid=8455) 2026-03-26 14:45:00.970 | INFO | skyrl.train.trainer:train:265 - Started: 'fwd_logprobs_values_reward' (skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.325 | INFO | skyrl.train.trainer:train:265 - Finished: 'fwd_logprobs_values_reward', time cost: 377.36s (skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.326 | INFO | skyrl.train.trainer:train:274 - Started: 'compute_advantages_and_returns' (skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.430 | INFO | skyrl.train.trainer:compute_advantages_and_returns:873 - avg_final_rewards: 0.13398437201976776, avg_response_length: 273.6802734375 (skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.431 | INFO | skyrl.train.trainer:train:274 - Finished: 'compute_advantages_and_returns', time cost: 0.11s (skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.431 | INFO | skyrl.train.trainer:train:291 - Started: 'train_critic_and_policy' (skyrl_entrypoint pid=8455) 2026-03-26 14:51:18.431 | INFO | skyrl.train.trainer:train_critic_and_policy:1121 - Started: 'policy_train' (skyrl_entrypoint pid=8455) 2026-03-26 15:05:44.917 | INFO | skyrl.train.trainer:train_critic_and_policy:1121 - Finished: 'policy_train', time cost: 866.49s (skyrl_entrypoint pid=8455) 2026-03-26 15:05:44.938 | INFO | skyrl.train.trainer:train:291 - Finished: 'train_critic_and_policy', time cost: 866.51s (skyrl_entrypoint pid=8455) 2026-03-26 15:05:44.938 | INFO | skyrl.train.trainer:train:316 - Started: 'sync_weights' (skyrl_entrypoint pid=8455) 2026-03-26 15:05:51.979 | INFO | skyrl.train.trainer:train:316 - Finished: 'sync_weights', time cost: 7.04s (skyrl_entrypoint pid=8455) 2026-03-26 15:05:51.980 | INFO | skyrl.train.trainer:train:210 - Finished: 'step', time cost: 1331.13s (skyrl_entrypoint pid=8455) 2026-03-26 15:05:51.980 | INFO | skyrl.train.trainer:train:320 - {'final_loss': -0.0001395885121230879, 'policy_loss': -0.0001400031556840986, 'policy_entropy': 0.36564826631365577, 'response_length': 1024.0, 'policy_lr': 9.999999974752427e-07, 'loss_metrics/clip_ratio': 0.000960409866502232, 'policy_kl': 0.00041451671746912667, 'grad_norm': 0.3702232241630554} Training Batches Processed: 14%|█▍ | 1/7 [22:36<2:15:37, 1356.23s/it] (skyrl_entrypoint pid=8455) 2026-03-26 15:05:52.046 | INFO | skyrl.train.trainer:train:210 - Started: 'step' (skyrl_entrypoint pid=8455) (skyrl_entrypoint pid=8455) 2026-03-26 15:05:52.075 | INFO | skyrl.train.trainer:train:227 - Started: 'generate' (skyrl_entrypoint pid=8455) %| | 0/5120 [00:00 ``` This means the current policy model has already done rollout generation for that prompt. In other words, the model is not being trained on a fixed answer here — it is **producing its own answer first**, and that answer becomes part of the RL batch. So in plain English: * the model saw a math word problem * it generated a chain-of-thought style response * that generated response is now part of the training material for this step *** **2. SkyRL converts the generated outputs into training tensors** Next you see: ``` Started: 'convert_to_training_input' batch_num_seq: 5120, batch_padded_seq_len: 1172 Number of sequences before padding: 5120 Number of sequences after padding: 5120 Finished: 'convert_to_training_input', time cost: 1.54s ``` This stage takes all of the newly generated samples and prepares them for training. What that means in practice: * SkyRL has collected **5120 generated sequences** * each sequence is a prompt plus a model-generated response * they are padded to a common length, here **1172 tokens** * then they are packed into tensors that the trainer can feed into the model So this is the “data formatting” phase of the RL step. A good mental model is: > generation produces raw text, and `convert_to_training_input` turns that raw text into something the optimizer can use *** **3. SkyRL computes logprobs, values, and rewards** Then training enters one of the expensive stages: ``` Started: 'fwd_logprobs_values_reward' Finished: 'fwd_logprobs_values_reward', time cost: 377.36s ``` This is where the trainer runs forward passes to compute the quantities needed for RL optimization. That stage typically includes: * **logprobs**: how likely the policy thought each generated token was * **values**: the value function / critic estimate for the trajectory * **reward-related information**: the signals used to determine whether the output was good or bad This phase took **377.36 seconds**, which tells you it is one of the heavy parts of the step. That makes sense because the batch is large: * **5120 sequences** * padded to **1172 tokens** So the trainer is doing a lot of token-level computation here. *** **4. SkyRL computes advantages and returns** Next: ``` Started: 'compute_advantages_and_returns' avg_final_rewards: 0.13398437201976776, avg_response_length: 273.6802734375 Finished: 'compute_advantages_and_returns', time cost: 0.11s ``` This is the stage where RL training targets are prepared. The important number here is: ``` avg_final_rewards: 0.13398437201976776 ``` That means the average reward across this batch was about **0.134**. The other useful number is: ``` avg_response_length: 273.68 ``` So even though sequences were padded to a much larger common length for batching, the average actual generated response was around **274 tokens**. Conceptually, this step answers: * which sampled outputs were better? * by how much? * what learning signal should the optimizer use? This stage is fast because the heavy model forward work already happened in the previous phase. *** **5. SkyRL updates the policy** Then the trainer starts the actual model update: ``` Started: 'train_critic_and_policy' Started: 'policy_train' Finished: 'policy_train', time cost: 866.49s Finished: 'train_critic_and_policy', time cost: 866.51s ``` This is the point where the policy is actually being trained. The important thing to understand is: * the model has already generated outputs * those outputs have already been scored * advantages have already been computed * now the optimizer uses that information to **change the model weights** This took **866.49 seconds**, which is even longer than the forward-statistics phase. So in this run, the biggest chunk of time in the step is the actual policy update. In human terms: > this is where the model learns from the answers it just produced *** **6. The updated weights are synced back to inference** After the policy update, you see: ``` Started: 'sync_weights' Finished: 'sync_weights', time cost: 7.04s ``` SkyRL uses a training stack and an inference stack together. After the policy changes, the inference engine needs the newest weights so the next rollout uses the updated model instead of the old one. That is what this sync step is doing. It is short compared with the training phases, but it is important because it closes the loop between: * training the model * using the updated model to generate the next batch *** **7. One full RL step is now complete** Then SkyRL reports: ``` Finished: 'step', time cost: 1331.13s ``` So one complete RL step took: * about **1331 seconds** * roughly **22.2 minutes** That full step included: * converting generated outputs into training input * computing logprobs / values / rewards * computing advantages and returns * training the policy * syncing the updated weights Immediately after that, SkyRL prints training metrics: ``` {'final_loss': -0.0001395885121230879, 'policy_loss': -0.0001400031556840986, 'policy_entropy': 0.36564826631365577, 'response_length': 1024.0, 'policy_lr': 9.999999974752427e-07, 'loss_metrics/clip_ratio': 0.000960409866502232, 'policy_kl': 0.00041451671746912667, 'grad_norm': 0.3702232241630554} ``` These are the summary numbers for that training update. And then the progress bar confirms the trainer has completed **1 out of 7** batches: ``` Training Batches Processed: 14%|█▍ | 1/7 ``` So at this point: * the first RL training step is done * the model has been updated once * training is moving on to step 2 **On the next step, the run showed higher reward metrics:** ``` reward/avg_pass_at_5: 0.6533203125 reward/avg_raw_reward: 0.2173828125 ``` Compared to the earlier step: ``` reward/avg_pass_at_5: 0.4912109375 reward/avg_raw_reward: 0.133984375 ``` That is a useful sign that training is not just running — it is moving in a promising direction. {% endstep %} {% endstepper %} *** ## Useful output locations During the run, SkyRL wrote outputs to a few standard locations. ### Checkpoints ```bash /home/ray/ckpts/ ``` ### Exported evaluation results ```bash /home/ray/exports/dumped_evals/ ``` For example: ```bash /home/ray/exports/dumped_evals/global_step_0_evals/openai_gsm8k.jsonl /home/ray/exports/dumped_evals/global_step_0_evals/aggregated_results.jsonl ``` ### Infrastructure logs ```bash /tmp/skyrl-logs/ ``` These are useful if you want to inspect trainer or Ray-side behavior later. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://docs.yottalabs.ai/tutorials/training-and-fine-tuning/grpo-on-gsm8k-with-virtual-machine.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.