arXiv 论文速递

Pixel-Perfect Visual Geometry Estimation

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang

First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00

Comments: Code: https://github.com/gangweix/pixel-perfect-depth

Abstract

Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

中文标题/摘要

标题：像素完美视觉几何估计

从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而，现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中，我们提出了像素完美视觉几何模型，通过在像素空间中利用生成建模来预测高质量、无漂像素的点云。我们首先介绍了像素完美深度（PPD），这是一种基于像素空间扩散变换器（DiT）的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性，我们提出了两个关键设计：1）语义提示DiT，将视觉基础模型中的语义表示融入扩散过程，保留全局语义同时增强细粒度视觉细节；2）级联DiT架构，逐步增加图像标记的数量，提高效率和准确性。为了将PPD进一步扩展到视频（PPVD），我们引入了一种新的语义一致DiT，从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播，以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳，并且产生的点云比其他所有模型都更干净。

Summary / 总结

This paper addresses the challenge of recovering clean and accurate geometry from images, crucial for robotics and augmented reality. It introduces pixel-perfect visual geometry models using generative modeling in the pixel space. The models, including Pixel-Perfect Depth (PPD) and its video extension PPVD, leverage pixel-space diffusion transformers (DiT) and incorporate semantic prompts and a cascade architecture to enhance fine-grained details and computational efficiency. Experimental results show that these models outperform existing methods in monocular and video depth estimation, producing cleaner point clouds.

本文解决了从图像中恢复干净准确几何结构的挑战，这对机器人技术和增强现实至关重要。该文提出了基于像素空间生成建模的像素完美视觉几何模型，包括像素完美深度（PPD）及其视频扩展PPVD。这些模型利用像素空间扩散变换器（DiT），并结合语义提示和级联架构，以增强细粒度细节和计算效率。实验结果表明，这些模型在单目和视频深度估计中优于现有方法，生成的点云更为干净。

Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration

Authors: Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang, Xiaowei Zhou, Kuan Fang

First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00

Comments: Project Page: https://cordex-manipulation.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.

中文标题/摘要

标题：生成、转移、适应：从单个人类演示学习功能性灵巧抓取

功能性灵巧抓取对于使机器人手能够使用工具和进行复杂操作至关重要，但进展受限于两个持续存在的瓶颈：大规模数据集的稀缺性和学习模型中缺乏集成的语义和几何推理。在本文中，我们提出了CorDex框架，该框架能够从单一个人演示生成的合成数据中稳健地学习新物体的功能灵巧抓取。我们方法的核心是一个基于对应关系的数据引擎，该引擎在仿真中生成多样且高质量的训练数据。基于人类演示，数据引擎生成同一类别的多种物体实例，通过对应关系估计将专家抓取转移到生成的物体上，并通过优化进行抓取适应。基于生成的数据，我们引入了一种多模态预测网络，结合了视觉和几何信息。通过设计局部-全局融合模块和重要性感知采样机制，我们实现了功能灵巧抓取的稳健且计算高效的预测。通过在各种物体类别上的广泛实验，我们证明了CorDex能够很好地泛化到未见过的物体实例，并显著优于最先进的基线。

Summary / 总结

The research addresses the challenge of learning functional dexterous grasping from a single human demonstration, overcoming the limitations of scarce datasets and integrated reasoning. The CorDex framework generates diverse training data in simulation and transfers expert grasps to new objects through correspondence estimation and optimization. The multimodal prediction network integrates visual and geometric information, achieving robust and efficient grasp prediction. Experiments show that CorDex generalizes well to unseen objects and outperforms existing methods.

研究旨在通过单个人类示范和合成数据生成来解决学习灵巧功能性抓取的挑战。方法包括使用对应关系数据引擎生成模拟中的多样化训练数据，将专家抓取转移到新物体并进行优化。多模态预测网络结合视觉和几何信息，实现稳健且高效的抓取预测。实验表明，CorDex 在未见过的物体上表现良好并优于现有方法。

Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation

Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider

First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00

Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426