arXiv 论文速递

Pixel-Perfect Visual Geometry Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang

First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00

Comments: Code: https://github.com/gangweix/pixel-perfect-depth

Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

中文标题/摘要

标题：像素完美视觉几何估计

从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而，现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中，我们提出了像素完美视觉几何模型，通过在像素空间中利用生成建模来预测高质量、无漂像素的点云。我们首先介绍了像素完美深度（PPD），这是一种基于像素空间扩散变换器（DiT）的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性，我们提出了两个关键设计：1）语义提示DiT，该设计结合了视觉基础模型的语义表示来提示扩散过程，保留全局语义同时增强细粒度视觉细节；2）级联DiT架构，逐步增加图像标记的数量，提高效率和准确性。为了将PPD进一步扩展到视频（PPVD），我们引入了一种新的语义一致DiT，该设计从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播，以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳，并且产生的点云比其他所有模型都更干净。

Summary / 总结

This paper addresses the challenge of recovering clean and accurate geometry from images, essential for robotics and augmented reality. It introduces pixel-perfect visual geometry models using generative modeling in the pixel space. The models, including Pixel-Perfect Depth (PPD) and its video extension PPVD, leverage pixel-space diffusion transformers (DiT) and incorporate semantic prompts and a cascade architecture to enhance fine-grained details and computational efficiency. Experimental results show that these models outperform existing methods in monocular and video depth estimation, producing cleaner point clouds.

本文旨在解决从图像中恢复干净准确的几何形状以应用于机器人和增强现实的问题。提出了像素完美的视觉几何模型，特别是Pixel-Perfect Depth (PPD)及其视频扩展PPVD，使用像素空间扩散变压器（DiT）来预测无飞像素的高质量点云。关键创新包括Semantics-Prompted DiT和Cascade DiT架构以提高效率和准确性，以及Semantics-Consistent DiT用于视频。这些模型在单目和视频深度估计中表现出色，生成的点云更为干净。

Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration

Authors: Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang, Xiaowei Zhou, Kuan Fang

First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00

Comments: Project Page: https://cordex-manipulation.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.

中文标题/摘要

标题：生成、转移、适应：从单个人类演示学习功能性灵巧抓取

功能性灵巧抓取对于使机器人手能够使用工具和进行复杂操作至关重要，但进展受限于两个持续存在的瓶颈：大规模数据集的稀缺性和学习模型中缺乏集成的语义和几何推理。在本文中，我们提出了CorDex框架，该框架能够从仅一个单个人类演示生成的合成数据中稳健地学习新物体的功能性灵巧抓取。我们方法的核心是一个基于对应关系的数据引擎，该引擎在仿真中生成多样且高质量的训练数据。基于人类演示，我们的数据引擎生成同一类别的多种物体实例，通过对应关系估计将专家抓取转移到生成的物体上，并通过优化进行抓取适应。基于生成的数据，我们引入了一个多模态预测网络，该网络整合了视觉和几何信息。通过设计局部-全局融合模块和重要性感知采样机制，我们实现了功能灵巧抓取的稳健且计算高效的预测。通过在各种物体类别上的广泛实验，我们证明了CorDex能够很好地泛化到未见过的物体实例，并显著优于最先进的基线。

Summary / 总结

The research aims to address the challenges of learning functional dexterous grasping from a single human demonstration, focusing on the scarcity of large-scale datasets and the lack of integrated semantic and geometric reasoning. The method involves generating diverse synthetic training data through a correspondence-based engine, transferring expert grasps to new objects, and adapting them through optimization. The key experimental findings show that CorDex generalizes well to unseen object instances and outperforms existing state-of-the-art methods across various object categories.

该研究提出了一种名为CorDex的框架，通过单一人类演示生成多样化的训练数据来解决功能灵巧抓取的学习挑战。该方法使用基于对应的数据引擎生成高质量的合成数据，通过对应估计转移专家抓取，并通过优化进行适应。多模态预测网络整合视觉和几何信息，提高了抓取预测的鲁棒性和效率。实验表明，CorDex在未见过的对象上表现出良好的泛化能力，并优于现有方法。

Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation

Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider

First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00

Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426