Real-World RL for Manipulation with HIL-SERL on LeRobot SO-101

At the Seed Studio LeRobot hackathon in Mountain View, our team implemented HIL-SERL (Human-in-the-Loop Sample Efficient Reinforcement Learning) on the SO-101 robotic arm from Hugging Face’s LeRobot platform.

Motivation

We initially trained and deployed Vision-Language-Action (VLA) models for simple pick tasks but found they would overfit to exact object positions, requiring retraining whenever the workspace changed. This motivated us to explore reinforcement learning as a way to achieve more generalizable policies.

Approach

HIL-SERL combines two key ideas:

Soft Actor-Critic (SAC): An off-policy maximum entropy RL algorithm that learns a stochastic policy
RLPD (RL with Prior Data): Bootstrapping RL with expert demonstrations instead of training from scratch

The algorithm adds a human-in-the-loop correction procedure where an operator can intervene during training to help the policy experience reward more frequently.

Experiments

We ran a progression of experiments:

Sim validation: Ported the HIL-SERL demo from MuJoCo Franka to SO-101 in Isaac Sim, achieving successful cube picking after ~1 hour of training
Sim-to-real attempt: Trained for 16 hours in Isaac Sim and attempted direct transfer to hardware — this failed due to the large sim-to-real gap (visual differences, camera mounting, shadows)
Fine-tuning from sim checkpoint: Resumed training on real hardware from the sim-trained checkpoint, achieving successful picks after ~4 hours of additional real-world training
Training from scratch on hardware: With tighter workspace bounds, achieved a working pick policy in approximately 40 minutes of real-world training

Key Findings

HIL-SERL can learn manipulation policies on hardware from scratch within 40 minutes with appropriately constrained workspace bounds
The sim-to-real gap for vision-based policies remains significant — direct transfer failed, and it’s unclear if sim pretraining provided meaningful benefit
RL data collection via interventions is less tedious than collecting full demonstration trajectories for imitation learning
Strict workspace bounds around the task area dramatically improved sample efficiency by reducing irrelevant exploration

Team

Bruce Kim, Siddharth Saha, Cyan Ding, Indraneel Patil