arXiv 论文速递

Pixel-Perfect Visual Geometry Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang

First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00

Comments: Code: https://github.com/gangweix/pixel-perfect-depth

Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

中文标题/摘要

标题：像素完美视觉几何估计

从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而，现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中，我们提出了像素完美视觉几何模型，通过在像素空间中利用生成建模来预测无漂像素的高质量点云。我们首先介绍了像素完美深度（PPD），这是一种基于像素空间扩散变换器（DiT）的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性，我们提出了两种关键设计：1）语义提示DiT，该设计结合了视觉基础模型的语义表示来提示扩散过程，保留全局语义同时增强细粒度视觉细节；2）级联DiT架构，逐步增加图像标记的数量，提高效率和准确性。为了将PPD扩展到视频（PPVD），我们引入了一种新的语义一致DiT，该设计从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播，以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳，并且产生的点云比其他所有模型都更干净。

Summary / 总结

This paper addresses the issue of recovering clean and accurate geometry from images for robotics and augmented reality. It introduces pixel-perfect visual geometry models, specifically Pixel-Perfect Depth (PPD) and its video extension PPVD, which use pixel-space diffusion transformers to predict high-quality point clouds without flying pixels. Key designs include Semantics-Prompted DiT for preserving global semantics and enhancing fine details, and Cascade DiT for improving efficiency and accuracy. The models outperform existing methods in monocular and video depth estimation, producing cleaner point clouds.

本文旨在解决从图像中恢复干净准确几何结构的挑战，这对机器人技术和增强现实至关重要。文中提出了像素完美的视觉几何模型，特别是Pixel-Perfect Depth (PPD)及其视频扩展PPVD，能够预测无飞像素的高质量点云。PPD 使用像素空间扩散变换器 (DiT) 并结合语义提示来保留全局语义并增强细粒度视觉细节。Cascade DiT 架构提高了效率和准确性。对于视频，引入了语义一致的 DiT 来保持时间一致性。这些模型在单目和视频深度估计中表现出色，生成的点云更为干净。

Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration

Authors: Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang, Xiaowei Zhou, Kuan Fang

First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00

Comments: Project Page: https://cordex-manipulation.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.

中文标题/摘要

标题：生成、转移、适应：从单个人类演示学习功能性灵巧抓取

功能性灵巧抓取对于使机器人手能够使用工具和进行复杂操作至关重要，但进展受限于两个持续存在的瓶颈：大规模数据集的稀缺性和学习模型中缺乏集成的语义和几何推理。在本工作中，我们提出了CorDex框架，该框架能够从单一个人类演示生成的合成数据中稳健地学习新物体的功能性灵巧抓取。我们方法的核心是一个基于对应关系的数据引擎，该引擎在仿真中生成多样且高质量的训练数据。基于人类演示，数据引擎生成同一类别的多种物体实例，通过对应关系估计将专家抓取转移到生成的物体上，并通过优化进行抓取适应。基于生成的数据，我们引入了一个多模态预测网络，结合了视觉和几何信息。通过设计局部-全局融合模块和重要性感知采样机制，我们实现了功能性灵巧抓取的稳健且计算高效的预测。通过在各种物体类别上的广泛实验，我们证明了CorDex能够很好地泛化到未见过的物体实例，并显著优于最先进的基线。

Summary / 总结

The research aims to address the challenges of learning functional dexterous grasping from limited data by proposing CorDex, a framework that generates diverse training data from a single human demonstration. The method involves a correspondence-based data engine that creates high-quality synthetic objects and optimizes grasps through transfer and adaptation. Experiments show that CorDex outperforms existing methods in predicting functional dexterous grasps for various object categories and generalizes well to unseen instances.

该研究通过提出CorDex框架解决了学习功能性灵巧抓取的挑战，该框架从单个人类演示中生成多样化的训练数据。方法使用基于对应关系的数据引擎生成高质量的合成数据，通过对应关系估计将专家抓取转移，并通过优化进行适应。多模态预测网络整合视觉和几何信息以预测功能性抓取。实验表明，CorDex在未见过的对象上表现出良好的泛化能力并优于现有方法。

Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation

Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider

First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00

Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426