Physical Modelling of sim2real SO101 Arm Project

Good afternoon everyone, this post has to relate with the moving of sim2real from NVIDIA’s recent SO101 Sim 2 Real Course. This is crossposted from NVIDIA’s Forums

Current Setup

  • Isaac Sim Version: 5.1.0
  • OS: Ubuntu 24.04 (arm64)
  • GPU Setup: Dell Pro Max GB10 - Blackwell Architecture - CUDA v13.0, Driver 580.142
  • Lerobot Tag: e670ac5daf9b76 (Just around v0.4.3)

Topic Description

Is it possible to rollout example pretrained VLA policies in the real world if we copy the original workspace?

Detailed Description

I am currently following the Train an SO-101 Robot From Sim-to-Real With NVIDIA Isaac Learning Tutorial. I’ve been able to setup the actual docker container just fine and have been able to rollout the various example policies that are provided in the tutorial in Isaac Lab simulation with high (70-90%) success rates. The issue lies with the Simulation to Real portion, particularly, the Real Evaluation section. I built the lightbox, mounted the light, external camera, and robot SO101 arm all as mentioned. These same conditions I would assume work between the provided example policies, the original real setup, and a new setup, yet I’ve not once been able to place a vial into the rack, probably over the span of 150-200 episodes, varying the environment plenty of times.

Within the “Real Evaluation” section there’s no mention of success rate, no video footage from the physical arm completing the task, and no dataset visualization in huggingface of an evaluation dataset. I took a look at one of the datasets in the Dataset Visualizer from LeRobot and noticed the position of the arm (in sim) wasn’t at the base of the lightbox, but instead the base of the mat.

I understand VLAs like Gr00t N1.6 require finetuning for a variety of tasks, especially as it needs to get deployed in general scenarios and I don’t expect high success rates (aiming for around 30-40%), but I’m just recreating the original environment that was quite isolated from the rest of the world (hence the lighbox), so policies should be able to run on my setup as well.

Steps to Reproduce

  1. Make sure the real docker container from the tutorial has been built

  2. Start the real docker container, select a model, run the policy server
    2.1) export MODEL=aravindhs-NV/grootn16-finetune_sreetz- so101_teleop_vials_rack_left/checkpoint-1000
    2.2) python Isaac-GR00T/gr00t/eval/run_gr00t_server.py –model-path /workspace/models/$MODEL

  3. Run the evaluation rollout for docker : attach the docker and run
    3.1) docker exec -it real-robot /bin/bash
    3.2) ```python Isaac-GR00T/gr00t/eval/real_robot/SO100/so101_eval.py
    –robot.type=so101_follower
    –robot.port=“$ROBOT_PORT”
    –robot.id=“$ROBOT_ID”
    –robot.cameras=“{
    wrist: {type: opencv, index_or_path: $CAMERA_GRIPPER, width: 640, height: 480, fps: 30},
    front: {type: opencv, index_or_path: $CAMERA_EXTERNAL, width: 640, height: 480, fps: 30}
    }”
    –policy_host=localhost
    –policy_port=5555
    –lang_instruction=“Pick up the vial and place it in the yellow rack”
    –rerun True```

  4. Let it run and watch it not succeed.

Error Messages

No errors, just poor success rate (0%)

Screenshots or Videos

Additional photos in the google drive link , sorry for the lack of “film production” but I hope this is a bit useful into insight of my environment. A bit unfortunate that I’m new and can only 2 photos/videos

IMG_8647

Additional Information

What I’ve Tried

  • I double checked with the setting up of the workspace section and all of my initial measurements are the same as the original authors.
  • I also recalibrated both arms, verified they work in simulation, and checked the calibration, in which my standard deviations of joint position are on the same magnitude as the image
  • Tried to recreate the environment based on a Given Dataset
  • Tried Multiple Orientations of the External Camera to limit noise
  • Tried Multiple Poses (position + orientation) of the rack relative to the arm
  • Tried Multiple # of Vials along with vial poses
  • Tried Multiple amounts of light intensity (25% - 100%) with the provided light bar

Related Issues

N/A

Additional Context

The authors did a fantastic job on the tutorial, certainly more information provided than many other tutorials, I’m also hopeful to see the sim2real pipeline with VLAs. Just a bit unfortunate it’s lacking in the real evaluation portion.

Also, I’ve tried again recently have been getting the real arm to consistently, autonomously pick up a vial by positioning the vial central to the arm with random orientations, but below the top of the rack (imagine you drew a transversal line at the top of the rack close to the back wall, going across the width of the lightbox, the cap would be below that).

Edit:

I began changing some things around the physical project setup to match the simulation environment (without domain randomization) based on relative position of objects and got it to function with 80% success in 10 episodes! It was very particular though, which is a bit concerning being applied to a generalist policy like Groot.
Conditions: 50% light, 1 vial in top left of container, 1 vial on mat, top left of mat 4.5cm from back, top right of mat 2.5 cm from back. Rack: 18cm from left wall, 17 cm from back wall, 90 \degree perfect. Vial 18.5 cm right from the rack, 2 cm from top of the rack transversal line, also 90 \degree perfect.

Model: so101_telop_vials_rack_left/checkpoint-100005

Personally, given the topic area, I would probably also suggest that you ask in the LeRobot Discord, since people there may have more hands-on SO-101 / camera / calibration experience. But before doing that, I think it may help to organize the report a bit so that others can reproduce or diagnose it more easily:


I have not run your exact setup myself, so please treat this as a practical debugging / reproducibility checklist rather than a diagnosis. But based on your description, I would not reduce this to only “the physical modeling is wrong” yet.

A more useful way to frame it may be:

Is this mainly a workspace / visual distribution issue, a camera / calibration issue, an expected limitation of the sim-only checkpoint, or a lower-level actuation / backlash / physical-modeling issue?

Your update that success improves when the physical rack / vial / mat placement is made closer to the sim / dataset-like setup is especially informative. If that observation is reproducible, it suggests that the policy may be quite sensitive to the reference workspace geometry and camera observations.

1. First clarify the exact checkpoint

The NVIDIA Real Evaluation docs appear to use this sim-only checkpoint for real evaluation:

export MODEL=aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left/checkpoint-10000

So I would explicitly confirm whether the run used exactly checkpoint-10000, or whether it used another checkpoint such as checkpoint-1000, checkpoint-100005, a local checkpoint, or a later training artifact.

This matters because otherwise it is hard to compare:

  • sim evaluation,
  • real evaluation,
  • the documented tutorial,
  • and other users’ results.

Useful information to add:

Model repo:
<model repo>

Checkpoint actually loaded by the GR00T server:
<checkpoint path>

Server log showing the loaded model:
<server log excerpt>

Was this exactly the documented checkpoint-10000?
<yes/no/unclear>

If the checkpoint differs from the documented one, the result may still be useful, but it becomes a different comparison.

2. Confirm camera assignment with actual frames

The same Real Evaluation docs mention using Rerun so that you can inspect joint actions and camera feeds while the policy runs. The NVIDIA troubleshooting guide also calls out camera index changes, wrong camera feed assignment, and camera positioning as possible causes of deployment problems.

So I would add concrete camera evidence, not just “the cameras are detected.”

For example:

lerobot-find-cameras opencv

Then include:

Camera detection output:
<output>

CAMERA_GRIPPER:
<index>

CAMERA_EXTERNAL:
<index>

One frame from wrist camera:
<link or image>

One frame from front camera:
<link or image>

Rerun screenshot while policy is running:
<link or image>

Things that would be worth checking:

  • Are front and wrist definitely not swapped?
  • Is the wrist camera image oriented as expected?
  • Is the external camera seeing roughly the same workspace composition as the dataset visualizer?
  • Are the OpenCV camera indices stable after unplug / replug?
  • Is the camera physically fixed and not vibrating?
  • Are focus, exposure, brightness, and white balance stable?
  • Are the camera views 640x480 as expected by the tutorial command?
  • Does Rerun show reasonable joint actions and camera feeds during the rollout?

A camera mismatch can easily produce a situation where the policy runs without a runtime error but fails behaviorally.

3. Quantify the workspace geometry

Your observation that real success improves after matching the physical layout more closely seems important. I would make this part quantitative.

Instead of only describing the setup in photos, it would help to provide robot-base-relative or mat-relative measurements.

For example:

Item Measurement
Robot base center → mat corner <x mm, y mm>
Robot base center → rack center <x mm, y mm>
Rack yaw
Robot base center → vial initial position <x mm, y mm>
Vial yaw
External camera position <x, y, z, yaw, pitch, roll if known>
Wrist camera mount <photo / approximate angle>
Light brightness / color temperature
Camera exposure / focus
Mat / rack / vial dimensions

The practical question is:

How narrow is the successful region in workspace coordinates?

If success only appears when the rack / vial / mat layout is very close to the dataset visualizer or simulated reference setup, then the dominant issue may be workspace / visual distribution sensitivity rather than just physical dynamics.

I would phrase that cautiously:

If the reported improvement after matching rack / vial / mat placement is reproducible, that suggests the policy may be quite sensitive to the reference workspace geometry and camera observations. I would not conclude “bad physical modeling” until checkpoint identity, camera assignment, camera pose, and workspace geometry are ruled out.

4. Add a failure-mode table

“0% success” is useful, but it is hard to debug without knowing how the failures look. A failure-mode table would help others reason about the cause.

For example:

Failure mode Count Possible interpretation
Does not move toward vial Camera / language / action interface issue
Reaches near vial but laterally offset Camera pose / workspace geometry / calibration issue
Grasps but drops vial Gripper / friction / timing issue
Grasps vial but misses rack Rack pose / precision / actuation issue
Always misses by the same offset Calibration / camera positioning / kinematic offset
Random-looking failures Visual instability / distribution shift
Stuttering / jerky execution Action execution / latency / chunking issue
Succeeds only after geometry tuning Narrow workspace / visual distribution

This would be especially useful if you can attach a few short clips or frame sequences for representative failures.

5. Compare the available policy variants

The NVIDIA Datasets and Models page lists several relevant model variants:

  • sim-only model: aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left
  • sim+real model: aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left_sim_and_real
  • Cosmos-augmented models:
    • aravindhs-NV/sreetz-so101_teleop_vials_rack_left_augment_02
    • aravindhs-NV/sreetz-so101_teleop_vials_rack_left_augment_10

A useful diagnostic experiment would be to run the same physical setup with multiple checkpoints:

Policy Same physical geometry? Result
sim-only checkpoint yes <success / trials>
sim-only checkpoint, reference-like geometry yes <success / trials>
sim+real checkpoint yes <success / trials>
Cosmos 7 checkpoint yes <success / trials>
Cosmos 70 checkpoint yes <success / trials>

Possible interpretations:

  • Only reference-like geometry works
    → likely narrow workspace / initial-condition / visual distribution sensitivity.

  • Cosmos improves performance
    → likely visual variation, lighting, texture, or object-position variation matters.

  • sim+real improves performance
    → real-world grounding is important; sim-only may simply be too weak for robust zero-shot transfer.

  • All policies miss by the same spatial offset
    → calibration, camera pose, or kinematic offset becomes more likely.

  • All policies stutter or pause
    → action execution, latency, or chunk-boundary behavior may need attention.

  • All policies fail despite correct cameras and geometry
    → then deeper actuation / calibration / physical-model mismatch becomes more plausible.

This comparison would be more useful than only asking whether the sim-only policy “should work.”

6. Clarify the expected real-world baseline

One documentation question seems worth asking directly:

What real-world success rate should users expect from the documented sim-only checkpoint under the reference SO-101 setup?

The Real Evaluation docs explain how to run the real robot and inspect Rerun, but it would be useful to know whether near-zero real success is expected for the sim-only checkpoint outside a narrow reference setup, or whether it indicates a setup problem.

Related documentation questions:

  1. Is checkpoint-10000 the canonical checkpoint for real evaluation?
  2. Was the real-evaluation video / baseline success rate measured internally?
  3. Can the authors share a short successful real rollout video?
  4. Can the authors share reference front/wrist camera frames?
  5. Can the authors share approximate reference measurements for robot base, mat, rack, vial initial position, and external camera?
  6. Is the sim-only checkpoint intended mainly as a baseline before sim+real co-training / Cosmos augmentation?
  7. Are camera exposure, focus, and white balance expected to be fixed?

There is also a small dataset/model-count clarification that may be worth asking. The Datasets and Models page describes the sim+real dataset as 75 sim-only demonstrations + 5 real-world demonstrations, while the model table describes the sim+real model as 75 sim + 50 real. It would be useful to clarify which number is correct, because the amount of real data strongly affects the expected real-world behavior.

7. Co-training may be the intended next step

The Co-Training section describes combining simulation data with real-world demonstrations, including small real datasets such as 5 real episodes.

So I would not assume that the sim-only checkpoint is expected to be robust zero-shot across copied physical setups. It may be better interpreted as a baseline for observing the sim-to-real gap before trying:

  • sim+real co-training,
  • Cosmos augmentation,
  • or actuation-gap modeling.

A useful question for the maintainers would be:

Is the expected workflow that users first observe the sim-to-real gap with the sim-only checkpoint, and then move to sim+real / Cosmos / SAGE? Or should the sim-only checkpoint already achieve a meaningful real-world success rate in the reference workspace?

8. Cosmos comparison can test visual / workspace-distribution sensitivity

The Cosmos augmentation section discusses augmenting data with visual variations such as lighting, object position, textures, and environmental changes.

That makes the Cosmos checkpoints useful as a diagnostic tool here.

Suggested comparison:

Same camera placement:
<yes/no>

Same rack / vial / mat placement:
<yes/no>

sim-only success:
<n>/<N>

Cosmos-7 success:
<n>/<N>

Cosmos-70 success:
<n>/<N>

Failure-mode differences:
<short notes>

If Cosmos helps, the issue is probably not only actuator physics. It would suggest that visual / workspace distribution is a major factor.

If Cosmos does not help, but sim+real helps, then real-world grounding may be more important than synthetic visual variation.

If neither helps and the failure is spatially consistent, calibration / camera pose / actuation gap becomes more likely.

9. Actuation / backlash is still possible, but I would check it later

The SAGE + GapONet section discusses sim-to-real actuation gaps and notes that SO-101 hobby servos can introduce backlash that accumulates through the kinematic chain.

So actuation gap is definitely a real possibility.

But I would check it after:

  1. exact checkpoint,
  2. camera assignment,
  3. camera pose / focus / exposure,
  4. reference-like workspace geometry,
  5. failure-mode consistency,
  6. sim-only vs sim+real vs Cosmos behavior.

If the robot always misses by the same spatial offset even when the camera views and workspace geometry are correct, then calibration / camera pose / actuation gap becomes much more likely.

10. Related reports, but not necessarily the same root cause

There are a few related community reports that may be worth reading. I would not assume they have the same root cause, but they show that SO-101 / GR00T / LeRobot real deployment can be sensitive to grasping, calibration, camera setup, and execution details.

Again, these are not proof that this issue has the same cause. They are just useful context.

11. If you also ask in LeRobot Discord

For a LeRobot Discord follow-up, I would make the question short and evidence-heavy. The goal would not be to repost the whole thread, but to ask whether other SO-101 users can compare their working setup against yours.

Something like this might be easier for people to answer:

Has anyone reproduced the NVIDIA SO-101 sim-to-real real evaluation with the provided GR00T sim-only checkpoint?

I am trying to distinguish between:
- camera feed / camera assignment issues,
- workspace geometry or initial-condition distribution shift,
- SO-101 calibration / backlash,
- and the expected limitation of the sim-only checkpoint.

The interesting observation is that real success improves when the rack / vial / mat positions are manually matched more closely to the sim / dataset-like layout.

Could anyone who has run this share:
1. exact checkpoint used,
2. real success rate,
3. front/wrist camera frames,
4. external camera pose,
5. rack/mat/robot-base measurements,
6. whether sim+real or Cosmos checkpoints worked better?

The most useful attachments would probably be:

  • one front-camera frame,
  • one wrist-camera frame,
  • one Rerun screenshot,
  • one top-down setup photo,
  • a small table of workspace measurements,
  • success / failure counts,
  • and a short failure-mode table.

12. Compact information block to add to this thread

If possible, I would add a compact block like this to the original post:

Exact checkpoint:
<checkpoint path>

GR00T / workshop repo commit:
<commit>

LeRobot version / commit:
<version or commit>

Docker image / tag:
<tag>

Sim eval:
<n>/<N> success

Real eval before geometry adjustment:
<n>/<N> success

Real eval after geometry adjustment:
<n>/<N> success

Camera detection:
<lerobot-find-cameras opencv output>

Front camera frame:
<link>

Wrist camera frame:
<link>

Rerun screenshot:
<link>

Robot-base-relative workspace measurements:
<measurements>

Main failure modes:
<counts and notes>

Other checkpoints tested:
<sim+real / Cosmos / none>

That would make the question much easier to answer for both LeRobot users and NVIDIA / GR00T maintainers.

My current read

Based only on the description, I would not jump straight to “the physical model is wrong.”

The sharp improvement after matching the physical layout more closely seems more consistent with one or more of:

  • narrow workspace / initial-condition distribution,
  • camera pose or camera assignment mismatch,
  • visual distribution shift,
  • calibration offset,
  • and only later, actuation / backlash / physical-modeling gaps.

The most useful next step is probably a small reproducibility package:

  1. exact checkpoint,
  2. camera evidence,
  3. workspace measurements,
  4. failure taxonomy,
  5. and a comparison between sim-only / sim+real / Cosmos checkpoints under the same physical setup.

That would also make a LeRobot Discord follow-up much more actionable.