Personally, given the topic area, I would probably also suggest that you ask in the LeRobot Discord, since people there may have more hands-on SO-101 / camera / calibration experience. But before doing that, I think it may help to organize the report a bit so that others can reproduce or diagnose it more easily:
I have not run your exact setup myself, so please treat this as a practical debugging / reproducibility checklist rather than a diagnosis. But based on your description, I would not reduce this to only “the physical modeling is wrong” yet.
A more useful way to frame it may be:
Is this mainly a workspace / visual distribution issue, a camera / calibration issue, an expected limitation of the sim-only checkpoint, or a lower-level actuation / backlash / physical-modeling issue?
Your update that success improves when the physical rack / vial / mat placement is made closer to the sim / dataset-like setup is especially informative. If that observation is reproducible, it suggests that the policy may be quite sensitive to the reference workspace geometry and camera observations.
1. First clarify the exact checkpoint
The NVIDIA Real Evaluation docs appear to use this sim-only checkpoint for real evaluation:
export MODEL=aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left/checkpoint-10000
So I would explicitly confirm whether the run used exactly checkpoint-10000, or whether it used another checkpoint such as checkpoint-1000, checkpoint-100005, a local checkpoint, or a later training artifact.
This matters because otherwise it is hard to compare:
- sim evaluation,
- real evaluation,
- the documented tutorial,
- and other users’ results.
Useful information to add:
Model repo:
<model repo>
Checkpoint actually loaded by the GR00T server:
<checkpoint path>
Server log showing the loaded model:
<server log excerpt>
Was this exactly the documented checkpoint-10000?
<yes/no/unclear>
If the checkpoint differs from the documented one, the result may still be useful, but it becomes a different comparison.
2. Confirm camera assignment with actual frames
The same Real Evaluation docs mention using Rerun so that you can inspect joint actions and camera feeds while the policy runs. The NVIDIA troubleshooting guide also calls out camera index changes, wrong camera feed assignment, and camera positioning as possible causes of deployment problems.
So I would add concrete camera evidence, not just “the cameras are detected.”
For example:
lerobot-find-cameras opencv
Then include:
Camera detection output:
<output>
CAMERA_GRIPPER:
<index>
CAMERA_EXTERNAL:
<index>
One frame from wrist camera:
<link or image>
One frame from front camera:
<link or image>
Rerun screenshot while policy is running:
<link or image>
Things that would be worth checking:
- Are
front and wrist definitely not swapped?
- Is the wrist camera image oriented as expected?
- Is the external camera seeing roughly the same workspace composition as the dataset visualizer?
- Are the OpenCV camera indices stable after unplug / replug?
- Is the camera physically fixed and not vibrating?
- Are focus, exposure, brightness, and white balance stable?
- Are the camera views 640x480 as expected by the tutorial command?
- Does Rerun show reasonable joint actions and camera feeds during the rollout?
A camera mismatch can easily produce a situation where the policy runs without a runtime error but fails behaviorally.
3. Quantify the workspace geometry
Your observation that real success improves after matching the physical layout more closely seems important. I would make this part quantitative.
Instead of only describing the setup in photos, it would help to provide robot-base-relative or mat-relative measurements.
For example:
| Item |
Measurement |
| Robot base center → mat corner |
<x mm, y mm> |
| Robot base center → rack center |
<x mm, y mm> |
| Rack yaw |
|
| Robot base center → vial initial position |
<x mm, y mm> |
| Vial yaw |
|
| External camera position |
<x, y, z, yaw, pitch, roll if known> |
| Wrist camera mount |
<photo / approximate angle> |
| Light brightness / color temperature |
|
| Camera exposure / focus |
|
| Mat / rack / vial dimensions |
|
The practical question is:
How narrow is the successful region in workspace coordinates?
If success only appears when the rack / vial / mat layout is very close to the dataset visualizer or simulated reference setup, then the dominant issue may be workspace / visual distribution sensitivity rather than just physical dynamics.
I would phrase that cautiously:
If the reported improvement after matching rack / vial / mat placement is reproducible, that suggests the policy may be quite sensitive to the reference workspace geometry and camera observations. I would not conclude “bad physical modeling” until checkpoint identity, camera assignment, camera pose, and workspace geometry are ruled out.
4. Add a failure-mode table
“0% success” is useful, but it is hard to debug without knowing how the failures look. A failure-mode table would help others reason about the cause.
For example:
| Failure mode |
Count |
Possible interpretation |
| Does not move toward vial |
|
Camera / language / action interface issue |
| Reaches near vial but laterally offset |
|
Camera pose / workspace geometry / calibration issue |
| Grasps but drops vial |
|
Gripper / friction / timing issue |
| Grasps vial but misses rack |
|
Rack pose / precision / actuation issue |
| Always misses by the same offset |
|
Calibration / camera positioning / kinematic offset |
| Random-looking failures |
|
Visual instability / distribution shift |
| Stuttering / jerky execution |
|
Action execution / latency / chunking issue |
| Succeeds only after geometry tuning |
|
Narrow workspace / visual distribution |
This would be especially useful if you can attach a few short clips or frame sequences for representative failures.
5. Compare the available policy variants
The NVIDIA Datasets and Models page lists several relevant model variants:
- sim-only model:
aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left
- sim+real model:
aravindhs-NV/grootn16-finetune_sreetz-so101_teleop_vials_rack_left_sim_and_real
- Cosmos-augmented models:
aravindhs-NV/sreetz-so101_teleop_vials_rack_left_augment_02
aravindhs-NV/sreetz-so101_teleop_vials_rack_left_augment_10
A useful diagnostic experiment would be to run the same physical setup with multiple checkpoints:
| Policy |
Same physical geometry? |
Result |
| sim-only checkpoint |
yes |
<success / trials> |
| sim-only checkpoint, reference-like geometry |
yes |
<success / trials> |
| sim+real checkpoint |
yes |
<success / trials> |
| Cosmos 7 checkpoint |
yes |
<success / trials> |
| Cosmos 70 checkpoint |
yes |
<success / trials> |
Possible interpretations:
-
Only reference-like geometry works
→ likely narrow workspace / initial-condition / visual distribution sensitivity.
-
Cosmos improves performance
→ likely visual variation, lighting, texture, or object-position variation matters.
-
sim+real improves performance
→ real-world grounding is important; sim-only may simply be too weak for robust zero-shot transfer.
-
All policies miss by the same spatial offset
→ calibration, camera pose, or kinematic offset becomes more likely.
-
All policies stutter or pause
→ action execution, latency, or chunk-boundary behavior may need attention.
-
All policies fail despite correct cameras and geometry
→ then deeper actuation / calibration / physical-model mismatch becomes more plausible.
This comparison would be more useful than only asking whether the sim-only policy “should work.”
6. Clarify the expected real-world baseline
One documentation question seems worth asking directly:
What real-world success rate should users expect from the documented sim-only checkpoint under the reference SO-101 setup?
The Real Evaluation docs explain how to run the real robot and inspect Rerun, but it would be useful to know whether near-zero real success is expected for the sim-only checkpoint outside a narrow reference setup, or whether it indicates a setup problem.
Related documentation questions:
- Is
checkpoint-10000 the canonical checkpoint for real evaluation?
- Was the real-evaluation video / baseline success rate measured internally?
- Can the authors share a short successful real rollout video?
- Can the authors share reference front/wrist camera frames?
- Can the authors share approximate reference measurements for robot base, mat, rack, vial initial position, and external camera?
- Is the sim-only checkpoint intended mainly as a baseline before sim+real co-training / Cosmos augmentation?
- Are camera exposure, focus, and white balance expected to be fixed?
There is also a small dataset/model-count clarification that may be worth asking. The Datasets and Models page describes the sim+real dataset as 75 sim-only demonstrations + 5 real-world demonstrations, while the model table describes the sim+real model as 75 sim + 50 real. It would be useful to clarify which number is correct, because the amount of real data strongly affects the expected real-world behavior.
7. Co-training may be the intended next step
The Co-Training section describes combining simulation data with real-world demonstrations, including small real datasets such as 5 real episodes.
So I would not assume that the sim-only checkpoint is expected to be robust zero-shot across copied physical setups. It may be better interpreted as a baseline for observing the sim-to-real gap before trying:
- sim+real co-training,
- Cosmos augmentation,
- or actuation-gap modeling.
A useful question for the maintainers would be:
Is the expected workflow that users first observe the sim-to-real gap with the sim-only checkpoint, and then move to sim+real / Cosmos / SAGE? Or should the sim-only checkpoint already achieve a meaningful real-world success rate in the reference workspace?
8. Cosmos comparison can test visual / workspace-distribution sensitivity
The Cosmos augmentation section discusses augmenting data with visual variations such as lighting, object position, textures, and environmental changes.
That makes the Cosmos checkpoints useful as a diagnostic tool here.
Suggested comparison:
Same camera placement:
<yes/no>
Same rack / vial / mat placement:
<yes/no>
sim-only success:
<n>/<N>
Cosmos-7 success:
<n>/<N>
Cosmos-70 success:
<n>/<N>
Failure-mode differences:
<short notes>
If Cosmos helps, the issue is probably not only actuator physics. It would suggest that visual / workspace distribution is a major factor.
If Cosmos does not help, but sim+real helps, then real-world grounding may be more important than synthetic visual variation.
If neither helps and the failure is spatially consistent, calibration / camera pose / actuation gap becomes more likely.
9. Actuation / backlash is still possible, but I would check it later
The SAGE + GapONet section discusses sim-to-real actuation gaps and notes that SO-101 hobby servos can introduce backlash that accumulates through the kinematic chain.
So actuation gap is definitely a real possibility.
But I would check it after:
- exact checkpoint,
- camera assignment,
- camera pose / focus / exposure,
- reference-like workspace geometry,
- failure-mode consistency,
- sim-only vs sim+real vs Cosmos behavior.
If the robot always misses by the same spatial offset even when the camera views and workspace geometry are correct, then calibration / camera pose / actuation gap becomes much more likely.
10. Related reports, but not necessarily the same root cause
There are a few related community reports that may be worth reading. I would not assume they have the same root cause, but they show that SO-101 / GR00T / LeRobot real deployment can be sensitive to grasping, calibration, camera setup, and execution details.
Again, these are not proof that this issue has the same cause. They are just useful context.
11. If you also ask in LeRobot Discord
For a LeRobot Discord follow-up, I would make the question short and evidence-heavy. The goal would not be to repost the whole thread, but to ask whether other SO-101 users can compare their working setup against yours.
Something like this might be easier for people to answer:
Has anyone reproduced the NVIDIA SO-101 sim-to-real real evaluation with the provided GR00T sim-only checkpoint?
I am trying to distinguish between:
- camera feed / camera assignment issues,
- workspace geometry or initial-condition distribution shift,
- SO-101 calibration / backlash,
- and the expected limitation of the sim-only checkpoint.
The interesting observation is that real success improves when the rack / vial / mat positions are manually matched more closely to the sim / dataset-like layout.
Could anyone who has run this share:
1. exact checkpoint used,
2. real success rate,
3. front/wrist camera frames,
4. external camera pose,
5. rack/mat/robot-base measurements,
6. whether sim+real or Cosmos checkpoints worked better?
The most useful attachments would probably be:
- one front-camera frame,
- one wrist-camera frame,
- one Rerun screenshot,
- one top-down setup photo,
- a small table of workspace measurements,
- success / failure counts,
- and a short failure-mode table.
12. Compact information block to add to this thread
If possible, I would add a compact block like this to the original post:
Exact checkpoint:
<checkpoint path>
GR00T / workshop repo commit:
<commit>
LeRobot version / commit:
<version or commit>
Docker image / tag:
<tag>
Sim eval:
<n>/<N> success
Real eval before geometry adjustment:
<n>/<N> success
Real eval after geometry adjustment:
<n>/<N> success
Camera detection:
<lerobot-find-cameras opencv output>
Front camera frame:
<link>
Wrist camera frame:
<link>
Rerun screenshot:
<link>
Robot-base-relative workspace measurements:
<measurements>
Main failure modes:
<counts and notes>
Other checkpoints tested:
<sim+real / Cosmos / none>
That would make the question much easier to answer for both LeRobot users and NVIDIA / GR00T maintainers.
My current read
Based only on the description, I would not jump straight to “the physical model is wrong.”
The sharp improvement after matching the physical layout more closely seems more consistent with one or more of:
- narrow workspace / initial-condition distribution,
- camera pose or camera assignment mismatch,
- visual distribution shift,
- calibration offset,
- and only later, actuation / backlash / physical-modeling gaps.
The most useful next step is probably a small reproducibility package:
- exact checkpoint,
- camera evidence,
- workspace measurements,
- failure taxonomy,
- and a comparison between sim-only / sim+real / Cosmos checkpoints under the same physical setup.
That would also make a LeRobot Discord follow-up much more actionable.