arXiv 论文速递

Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Authors: Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang

First: 2025-12-31T18:59:53+00:00 · Latest: 2025-12-31T18:59:53+00:00

Comments: Project page: https://edit3r.github.io/edit3r/

Abstract

We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

中文标题/摘要

标题：Edit3r：从稀疏未对齐图像即时编辑3D场景

我们提出了Edit3r，这是一种单次通过框架，可以从未对齐、视角不一致、指令编辑过的图像中重建和编辑3D场景。与需要逐场景优化的先前方法不同，Edit3r可以直接预测指令对齐的3D编辑，从而实现快速且逼真的渲染，无需优化或姿态估计。训练此类模型的关键挑战在于缺乏多视角一致的编辑图像作为监督。我们通过(i)基于SAM2的重新着色策略生成可靠的、跨视角一致的监督，以及(ii)不对称输入策略，将重新着色的参考视图与原始辅助视图配对，鼓励网络融合和对齐不同的观察结果来解决这一问题。在推理时，我们的模型能够有效处理由2D方法（如InstructPix2Pix）编辑的图像，尽管在训练过程中并未接触到此类编辑。为了进行大规模的定量评估，我们引入了DL3DV-Edit-Bench基准，该基准基于DL3DV测试集构建，包含20个不同的场景、4种编辑类型和总共100次编辑。全面的定量和定性结果表明，Edit3r在语义对齐和3D一致性方面优于最近的基线方法，同时具有显著更高的推理速度，使其在实时3D编辑应用中具有前景。

Summary / 总结

Edit3r is a feed-forward framework that reconstructs and edits 3D scenes from unposed images in a single pass, without requiring per-scene optimization. It uses a SAM2-based recoloring strategy to generate reliable cross-view-consistent supervision and an asymmetric input strategy to encourage the network to fuse and align disparate observations. The model effectively handles 2D edits like InstructPix2Pix and achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at higher inference speed, making it suitable for real-time 3D editing applications.

Edit3r 是一个无需优化和姿态估计即可从不一致视角的图像中重建并编辑 3D 场景的前馈框架。它使用 SAM2 基础的重新着色策略生成可靠的监督，并使用不对称输入策略鼓励网络融合和对齐不同的观察。该模型可以处理如 InstructPix2Pix 等 2D 编辑，而无需在训练中接触此类编辑。定量和定性结果表明，Edit3r 在语义对齐和 3D 一致性方面优于最近的基线模型，同时具有更快的推理速度，适用于实时 3D 编辑应用。

Coordinated Humanoid Manipulation with Choice Policies

Authors: Haozhi Qi, Yen-Jen Wang, Toru Lin, Brent Yi, Yi Ma, Koushil Sreenath, Jitendra Malik

First: 2025-12-31T18:59:53+00:00 · Latest: 2025-12-31T18:59:53+00:00

Comments: Code and Website: https://choice-policy.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Humanoid robots hold great promise for operating in human-centric environments, yet achieving robust whole-body coordination across the head, hands, and legs remains a major challenge. We present a system that combines a modular teleoperation interface with a scalable learning framework to address this problem. Our teleoperation design decomposes humanoid control into intuitive submodules, which include hand-eye coordination, grasp primitives, arm end-effector tracking, and locomotion. This modularity allows us to collect high-quality demonstrations efficiently. Building on this, we introduce Choice Policy, an imitation learning approach that generates multiple candidate actions and learns to score them. This architecture enables both fast inference and effective modeling of multimodal behaviors. We validate our approach on two real-world tasks: dishwasher loading and whole-body loco-manipulation for whiteboard wiping. Experiments show that Choice Policy significantly outperforms diffusion policies and standard behavior cloning. Furthermore, our results indicate that hand-eye coordination is critical for success in long-horizon tasks. Our work demonstrates a practical path toward scalable data collection and learning for coordinated humanoid manipulation in unstructured environments.

中文标题/摘要

标题：协调的人形操作策略

人形机器人在人类中心环境中操作具有巨大潜力，但实现头部、手部和腿部的全身协调仍是一个重大挑战。我们提出了一种结合模块化远程操作界面和可扩展学习框架的系统来解决这一问题。我们的远程操作设计将人形控制分解为直观的子模块，包括手眼协调、抓取原语、手臂末端执行器跟踪和移动。这种模块化使我们能够高效地收集高质量的演示。在此基础上，我们引入了选择策略，这是一种模仿学习方法，生成多个候选动作并学习评分。该架构能够实现快速推理和多模态行为的有效建模。我们在两个实际任务上验证了我们的方法：洗碗机装载和全身移动操作以擦白板。实验表明，选择策略显著优于扩散策略和标准行为克隆。此外，我们的结果表明，手眼协调对于长期任务的成功至关重要。我们的工作展示了在非结构化环境中实现协调人形操作的可扩展数据收集和学习的实际路径。

Summary / 总结

The research aims to achieve robust whole-body coordination in humanoid robots for human-centric environments. It introduces a modular teleoperation interface and a scalable learning framework called Choice Policy, which generates and scores multiple candidate actions for efficient data collection and multimodal behavior modeling. The approach is validated on dishwasher loading and whole-body loco-manipulation tasks, showing significant performance improvements over diffusion policies and standard behavior cloning, with hand-eye coordination identified as crucial for long-horizon tasks.

研究旨在通过人形机器人实现人体中心环境中的全身协调控制。提出了一种模块化远程操作界面和名为Choice Policy的可扩展学习框架，该框架生成并评分多个候选动作。实验表明，Choice Policy在洗碗机装载和全身移动操作擦黑板任务中优于扩散策略和标准行为克隆，强调了长时间任务中手眼协调的重要性。

Scaling Open-Ended Reasoning to Predict the Future

Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

First: 2025-12-31T18:59:51+00:00 · Latest: 2025-12-31T18:59:51+00:00

Comments: 45 pages