Notes on RL Questions

Jun 7, 2026

35 questions covering LLM RL algorithms and infrastructure. Compiled as a personal reference — answers are deliberately brief and can be extended much further with follow-ups.

Sources: arXiv preprints, official GitHub repos, Lilian Weng’s blog. Based on Xiuyu Li (@sheriyuo).


Algorithm (Q1–Q19)

Q1. Why use Actor-Critic instead of a pure Critic approach?

Pure critic (value-only) methods like DQN require discretizing the action space or expensive optimization at every step. Actor-Critic separates policy (actor) from value estimation (critic), enabling continuous action spaces and lower-variance gradient estimates via the baseline. The actor directly parameterizes the policy, which is more sample-efficient in high-dimensional or continuous spaces.

Q2. What is the relationship between KL divergence, cross entropy, and MLE?

KL(P||Q) = H(P,Q) − H(P). Minimizing cross entropy H(P,Q) (as in MLE on data distribution P) is equivalent to minimizing KL(P||Q) when H(P) is constant. In RL, KL from the reference policy acts as a regularizer equivalent to a log-ratio penalty.

Q3. How should rewards be designed in different RL scenarios?

Q4. How do importance sampling, rejection sampling, and other Monte Carlo methods fit into RL?

Q5. How is advantage computed in PPO and GRPO? Why subtract a baseline? Is std normalization necessary?

Q6. How do RL training and test-time scaling perform exploration differently?

Q7. How does PPO clipping work? Why take the minimum? What happens without clipping? How does CISPO differ?

PPO clips the IS ratio: L = min(r·A, clip(r, 1−ε, 1+ε)·A). Taking the min is a pessimistic bound — it only updates when the clipped and unclipped objectives agree on direction.

Q8. Why does GRPO include a KL penalty? How is KL computed? Why do DAPO and GSPO remove it?

Q9. During LLM training, what happens if loss is accidentally All Reduced multiple times?

All Reduce averages gradients across ranks. Doing it multiple times effectively divides the gradient by world_size each extra time — the effective learning rate shrinks exponentially. Can silently cause training to stall or converge to suboptimal solutions. Hard to detect without gradient norm logging.

Q10. What is the reward function in DPO? Can reward hacking occur? How can it be mitigated?

Q11. What methods address train-inference mismatch in MoE models, and how do they work?

Q12. How should group size, learning rate, PPO epochs, and generation length be selected during RL training?

Q13. Compared with GRPO, how do Dr.GRPO, DAPO, GSPO, CISPO, SAPO, DPPO, MaxRL, and SimKO improve training?

MethodKey ImprovementLimitation
Dr.GRPORemoves length bias; per-token normalizationMinor gain over GRPO
DAPONo KL, clip-higher, token-level lossCareful entropy tuning needed
GSPOSequence-level KL constraintMore complex reference computation
CISPOReplaces clip with IS-weighted SFTRequires careful accept threshold
SAPOSeparates actor/critic updates; better stabilityExtra hyperparameters
DPPODistributed PPO with async rolloutStaleness risk
MaxRLReward maximization with explicit diversityExperimental
SimKOSimplified KL objective, no reference modelWeaker regularization

Q14. How do TRPO, DPPO, and AReaL enforce trust-region constraints?

Q15. Can RL fundamentally expand the capability frontier of LLMs?

Debated. RL can unlock latent capabilities (e.g., step-by-step reasoning in R1) and improve reliability on verifiable tasks. However, it likely cannot introduce genuinely new knowledge — it redistributes probability mass over already-learned behaviors. Capability frontier expansion requires the base model to have learned the skill implicitly during pre-training.

Q16. Based on ProRL, how should we think about scaling RL training?

ProRL shows that prolonged RL training (>1000 steps) with curriculum and diverse task mixing continues to improve — contradicting early saturation beliefs. Key findings: task diversity prevents forgetting, reward signal quality matters more than quantity, and entropy maintenance is critical for sustained improvement.

Q17. What improvements does OPD introduce over traditional RL and SFT?

OPD (Online Preference Distillation) combines online rollouts with preference-based distillation from a stronger teacher. It avoids the cold-start problem of pure RL and overcomes SFT’s distribution shift by continuously generating on-policy data. Applications: math reasoning, code generation with execution feedback.

Q18. At which stage of training does reasoning ability emerge in LLMs?

Evidence suggests reasoning emerges during pre-training at scale (chain-of-thought is latent in the base model). RL (e.g., R1-Zero) can surface and amplify it without SFT cold-start. However, the reliability and length of reasoning chains are shaped by post-training.


Infrastructure (Q20–Q35)

Q20. Ignoring CPU offload, how many model copies exist in memory during GRPO training?

At minimum: reference model (frozen π_ref) + actor (π_θ) + optimizer states (Adam: 2× params in fp32). For GRPO without a value network: ~2 model copies (actor + reference) + optimizer states ≈ 3–4× parameter memory in practice. With a critic: add another copy.

Q21. Distributed inference: KV cache transfer optimization and multi-GPU communication strategies.

Q22. INT8 versus FP8. What are the tradeoffs? Which precisions are preferred for training and inference?

Q23. What is the long-tail problem in RL rollouts, and how can it be addressed?

Some prompts generate much longer sequences than others, causing GPU idle time while waiting for the longest sequence in a batch. Solutions:

Q24. What issues does continuous batching introduce in RL training? How do vLLM and SGLang differ?

Continuous batching mixes sequences at different stages — problematic for RL because you need complete trajectories before computing rewards. Solutions: track per-request state, flush complete sequences.

Q25. How do you measure utilization in vLLM and SGLang? How do you evaluate KV cache utilization during training?

Q26. How is backpropagation implemented in large-scale multi-node RL training?

Q27. What asynchronous RL frameworks exist, and what synchronization bottlenecks do they solve?

Q28. In AReaL or other partially rollout frameworks, are KV caches from previous policies preserved?

No, in general. When policy weights update, the KV cache computed under the old policy is stale and would produce incorrect attention outputs. AReaL refreshes the inference engine (or restarts vLLM workers) after weight sync. Some systems use speculative decoding-style checks, but recomputation is the safe default.

Q29. How does Expert Parallelism affect MoE throughput?

Expert Parallelism (EP) shards experts across GPUs — each GPU holds a subset of experts. For a token routed to expert i, it must be sent to the GPU holding that expert via All-to-All communication. This adds latency proportional to message_size × num_experts / bandwidth. High EP degree → lower memory per GPU but higher communication overhead. Optimal EP degree balances compute vs. network saturation.

Q30. In long-context training, how should compute-communication overlap be designed? How do Megatron and FSDP differ in parallelism strategies?

Q31. How do you enable deterministic execution? What is batch invariance? What causes it? Is atomic add involved?

Q32. How do AReaL and slime differ in their understanding of the RL rollout bottleneck?

Q33. How should we think about staleness in fully asynchronous RL training? What are typical values in practice?

Staleness = number of gradient updates between when a rollout was generated and when it’s used for training. High staleness → IS ratio π_θ/π_old drifts → clipping becomes too aggressive or too permissive. Typical practice: track staleness per sample, discard or reweight samples beyond a threshold (e.g., 2–4 updates stale). AReaL monitors per-sample KL to bound staleness implicitly. In practice, staleness of 1–3 steps is generally acceptable for LLM RL workloads.

Q34. How does data flow through slime? How is it integrated with Megatron? How is the loss computed?

Q35. If you had to choose among VeRL, TRL, Unsloth, AReaL, and slime, which one would you use and why?

FrameworkBest For
TRLRapid prototyping, small models, research experiments
UnslothSingle-GPU fine-tuning, memory efficiency (QLoRA), low-budget
VeRLProduction LLM RL at scale, Ray-based, good vLLM integration
AReaLAsync RL research, studying staleness, academic scale
slimeMegatron-native shops, largest-scale training, MoE models

Recommendation: VeRL for most industry use cases (mature, well-documented, Ray ecosystem). slime if you’re training >100B parameter MoE models with Megatron already in your stack.