arXiv 论文速递

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

First: 2025-12-18T18:59:59+00:00 · Latest: 2025-12-18T18:59:59+00:00

Comments: Project page and code: https://worldcanvas.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

中文标题/摘要

标题：世界即画布：使用参考图像、轨迹和文本绘制可提示事件

我们提出了WorldCanvas框架，该框架通过结合文本、轨迹和参考图像，实现了丰富且用户导向的模拟。与仅使用文本的方法和现有的基于轨迹控制的图像到视频方法不同，我们的多模态方法将轨迹（编码运动、时间、可见性）与自然语言（用于语义意图）和参考图像（用于物体身份的视觉定位）相结合，从而生成连贯且可控的事件，包括多智能体交互、物体进出、参考引导的外观以及反常识事件。生成的视频不仅展示了时间连贯性，还展示了在短暂消失后的一致性，保持了物体身份和场景。通过支持富有表现力的世界事件生成，WorldCanvas将世界模型从被动预测者提升为交互式的、用户导向的模拟器。我们的项目页面可在：https://worldcanvas.github.io/获取。

Summary / 总结

WorldCanvas is a framework that combines text, trajectories, and reference images to simulate rich, user-directed world events. Unlike text-only methods or trajectory-controlled image-to-video approaches, it integrates trajectories for motion details with natural language for semantic intent and reference images for visual grounding, resulting in coherent and controllable events with multi-agent interactions and object appearance changes. The generated videos show temporal coherence and emergent consistency, preserving object identity and scene despite temporary disappearance, advancing world models to interactive simulators.

WorldCanvas 是一个框架，通过结合文本、轨迹和参考图像实现丰富的用户导向模拟。不同于以往仅依赖文本的方法或轨迹控制的图像到视频技术，WorldCanvas 将轨迹与自然语言和参考图像结合，生成包含多智能体交互和物体进出的连贯且可控的事件。生成的视频不仅展示了时间连贯性，还展示了临时消失后的场景一致性，保持了物体身份和场景的完整性。该框架将世界模型从被动预测者转变为交互式的、用户导向的模拟器。

Next-Embedding Prediction Makes Strong Vision Learners

Authors: Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu

First: 2025-12-18T18:59:58+00:00 · Latest: 2025-12-18T18:59:58+00:00

Comments: Project Page: https://sihanxu.me/nepa

Abs · PDF · Code1 · Code2

Abstract

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

中文标题/摘要

标题：下一代嵌入预测造就强大的视觉学习者

受自然语言生成预训练成功的启发，我们询问同样的原则是否可以产生强大的自监督视觉学习者。我们不是训练模型输出用于下游使用的特征，而是训练它们生成嵌入以直接执行预测任务。这项工作探讨了从学习表示到学习模型的转变。具体来说，模型学习根据过去的嵌入预测未来的嵌入，使用因果掩码和停止梯度，我们称之为下一代嵌入预测自回归（NEPA）。我们证明，一个仅以下一代嵌入预测作为其唯一学习目标在ImageNet-1k上预训练的简单Transformer是有效的——没有像素重建、离散标记、对比损失或特定任务的头部。这种表述保留了架构的简洁性和可扩展性，无需额外的设计复杂性。NEPA在各种任务中取得了出色的结果，在使用ViT-B和ViT-L骨干网络微调后分别在ImageNet-1K上达到了83.8%和85.3%的顶级准确率，并且能够有效地转移到ADE20K的语义分割上。我们认为，从嵌入生成预训练提供了一种简单、可扩展且可能跨模态的视觉自监督学习替代方案。

Summary / 总结

This work explores the application of generative pretraining principles to vision tasks, introducing Next-Embedding Predictive Autoregression (NEPA) where models learn to predict future patch embeddings based on past ones. The study demonstrates that a simple Transformer pretrained on ImageNet-1k with this objective achieves strong results, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and effectively transferring to semantic segmentation on ADE20K.

该研究探索了将生成预训练原则应用于视觉任务，提出了一种名为Next-Embedding Predictive Autoregression (NEPA)的方法，其中模型学习预测基于过去嵌入的未来嵌入。研究显示，一个仅以此目标预训练在ImageNet-1k上的简单Transformer，在ViT-B和ViT-L骨干网络微调后分别在ImageNet-1K上达到83.8%和85.3%的顶级准确率，并且有效转移到ADE20K的语义分割任务上。

EasyV2V: A High-quality Instruction-based Video Editing Framework

Authors: Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: Project page: https://snap-research.github.io/easyv2v/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

中文标题/摘要

标题：EasyV2V：一种基于指令的高质量视频编辑框架

虽然图像编辑已经取得了快速进展，但视频编辑仍然较少被探索，面临着一致性、控制和泛化的挑战。我们研究了数据、架构和控制的设计空间，并引入了\emph{EasyV2V}，这是一种简单有效的基于指令的视频编辑框架。在数据方面，我们通过组合现有的专家和快速逆向操作来构建多样化的视频对，通过单帧监督和共享仿射运动的伪对将图像编辑对提升为视频，挖掘密集字幕片段以构建视频对，并添加过渡监督以教授编辑的展开方式。在模型方面，我们观察到预训练的文本到视频模型具有编辑能力，这激励了简化的设计。简单的序列拼接作为条件，并结合轻量级LoRA微调足以训练出强大的模型。对于控制，我们通过单一掩码机制统一了时空控制，并支持可选的参考图像。总体而言，EasyV2V 可以灵活地处理输入，例如视频+文本、视频+掩码+文本、视频+掩码+参考+文本，并实现了最先进的视频编辑结果，超越了同时期和商用系统。项目页面：https://snap-research.github.io/easyv2v/

Summary / 总结

EasyV2V is a framework for instruction-based video editing that addresses the challenges of consistency, control, and generalization in video editing. It uses diverse video pairs created by combining existing experts with fast inverses, single-frame supervision, and shared affine motion. The model leverages pretrained text-to-video models with simple sequence concatenation and light LoRA fine-tuning. EasyV2V supports various input types and achieves state-of-the-art results, outperforming concurrent and commercial systems.

EasyV2V 是一种基于指令的视频编辑框架，旨在解决视频编辑中的连贯性、控制性和泛化性挑战。它通过现有专家、单帧监督和密集字幕片段来创建多样化的视频对。该模型利用预训练的文本到视频模型，并采用简单的序列拼接和轻量级 LoRA 微调。EasyV2V 支持多种输入类型，并达到了最先进的效果，超越了同时期和商业系统。

DVGT: Driving Visual Geometry Transformer

Authors: Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: Code is available at https://github.com/wzzheng/DVGT

Abs · PDF · Code1 · Code2 · Code3

Abstract

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

中文标题/摘要

标题：DVGT：驾驶目标视觉几何变换器

从视觉输入感知和重建3D场景几何对于自动驾驶至关重要。然而，仍然缺乏一种针对驾驶场景的密集几何感知模型，能够适应不同的场景和相机配置。为了解决这一问题，我们提出了一种驾驶目标视觉几何变换器（DVGT），它可以从前序的多视角未校正视觉输入中重建全局密集的3D点云图。我们首先使用DINO主干网络提取每张图像的视觉特征，然后采用交替的同视角局部注意力、跨视角空间注意力和跨帧时间注意力来推断图像间的几何关系。接着，我们使用多个解码头在第一帧的 ego 坐标系中解码全局点云图，并为每一帧计算 ego 姿态。与依赖精确相机参数的传统方法不同，DVGT 不需要显式的3D几何先验，能够灵活处理任意的相机配置。DVGT 直接从图像序列中预测出度量标定的几何结构，消除了与外部传感器进行后对齐的需要。在包括 nuScenes、OpenScene、Waymo、KITTI 和 DDAD 等多种驾驶数据集的大规模混合训练下，DVGT 在各种场景中显著优于现有模型。代码可在 https://github.com/wzzheng/DVGT 获取。

Summary / 总结

DVGT is designed to perceive and reconstruct 3D scene geometry from visual inputs for autonomous driving. It uses a Driving Visual Geometry Transformer to infer geometric relations across images through alternating local, spatial, and temporal attention mechanisms. DVGT directly predicts metric-scaled geometry from image sequences without relying on precise camera parameters, achieving superior performance across various scenarios compared to existing models.

DVGT 是一种用于自动驾驶的 Driving Visual Geometry Transformer，旨在从视觉输入中感知和重建 3D 场景几何。它使用 DINO 主干提取特征，并应用注意力机制来推断图像间的几何关系。DVGT 直接从图像序列中预测具有度量尺度的几何结构，无需依赖精确的相机参数，实现了在各种场景中对密集 3D 点图重建和 ego 姿态估计的显著性能提升。

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: project page: https://auditdm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

中文标题/摘要

标题：关键差异：审计模型以发现和纠正能力差距

传统的多模态大语言模型（MLLMs）评估方法缺乏可解释性，往往无法充分揭示模型间的显著能力差距。为解决这一问题，我们引入了AuditDM，这是一种自动化的框架，通过审计模型间的差异来主动发现和纠正其失败模式。AuditDM通过强化学习微调一个MLLM作为审计员，生成能够最大化目标模型间分歧的挑战性问题和反事实图像。训练完成后，审计员能够揭示多样且可解释的示例，揭示模型的弱点，并作为无需标注的数据用于纠正。当应用于如Gemma-3和PaliGemma-2等最先进的模型时，AuditDM发现了超过20种不同的失败类型。基于这些发现的微调在16个基准测试中持续改进了所有模型，并使一个3B模型超越了其28B的对照组。我们的结果表明，在数据规模效应减弱时，有针对性的模型审计为模型诊断和改进提供了一条有效途径。

Summary / 总结

The paper introduces AuditDM, an automated framework that identifies and rectifies capability gaps in multimodal LLMs by auditing their divergence. It uses reinforcement learning to fine-tune an auditor model that generates challenging questions and counterfactual images to maximize disagreement among target models. When applied to state-of-the-art models like Gemma-3 and PaliGemma-2, AuditDM discovered over 20 distinct failure types, and fine-tuning on these discoveries improved all models across 16 benchmarks, even enabling a 3B model to surpass its 28B counterpart.

论文介绍了AuditDM，这是一种自动化的框架，通过审计模型的差异来识别和纠正能力差距。它使用强化学习来微调一个审计模型，生成具有挑战性的问题和反事实图像，以最大化目标模型之间的分歧。当应用于如Gemma-3和PaliGemma-2等最先进的模型时，AuditDM发现了超过20种不同的失败类型，并通过对这些发现的微调，所有模型在16个基准测试中都得到了改进，甚至使一个3B模型超越了其28B的版本。

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Authors: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

First: 2025-12-18T18:59:55+00:00 · Latest: 2025-12-18T18:59:55+00:00

Comments: Project page: https://github.com/CYWang735/AdaTooler-V

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

中文标题/摘要

标题：AdaTooler-V：自适应图像和视频工具使用

最近的研究表明，多模态大型语言模型（MLLMs）从多模态交错的思维链（CoT）与视觉工具交互中受益。然而，现有的开源模型经常表现出盲目的工具使用推理模式，即使在不需要时也调用视觉工具，这显著增加了推理开销并降低了模型性能。为此，我们提出了AdaTooler-V，这是一种MLLM，能够根据视觉问题是否真正需要工具来执行自适应工具使用。首先，我们引入了AT-GRPO，这是一种基于每个样本的工具收益分数自适应调整奖励尺度的强化学习算法，鼓励模型仅在工具提供真正改进时才调用工具。此外，我们构建了两个数据集以支持训练：AdaTooler-V-CoT-100k 用于SFT冷启动，AdaTooler-V-300k 用于具有可验证奖励的强化学习，涵盖单图像、多图像和视频数据。在十二个基准测试中的实验表明，AdaTooler-V 具有强大的推理能力，在各种视觉推理任务中优于现有方法。值得注意的是，AdaTooler-V-7B 在高分辨率基准测试 V* 中的准确率为 89.8%，超过了商业专有模型 GPT-4o 和 Gemini 1.5 Pro。所有代码、模型和数据均已发布。

Summary / 总结

AdaTooler-V is an MLLM that adapts tool-use by determining the necessity of vision tools. It introduces AT-GRPO, a reinforcement learning algorithm that adjusts reward scales based on the Tool Benefit Score, encouraging the model to use tools only when they are genuinely beneficial. AdaTooler-V outperforms existing methods across twelve benchmarks, achieving 89.8% accuracy on the high-resolution benchmark V*. The model surpasses commercial proprietary models like GPT-4o and Gemini 1.5 Pro. All code, models, and data are publicly available.

AdaTooler-V旨在通过使多模态大语言模型能够适应性地使用工具来提高效率和性能。它引入了AT-GRPO，这是一种基于工具效益评分调整奖励尺度的强化学习算法，鼓励模型仅在必要时使用工具。实验表明，AdaTooler-V在各种视觉推理任务中表现优于现有方法，其在高分辨率基准V*上的准确率达到89.8%。

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

First: 2025-12-18T18:59:54+00:00 · Latest: 2025-12-18T18:59:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.

中文标题/摘要

标题：生成对抗推理器：通过对抗强化学习增强LLM推理能力

具有明确推理能力的大语言模型（LLMs）在数学推理方面表现出色，但仍会犯过程错误，如错误计算、脆弱逻辑和表面上合理但实际上无效的步骤。在本文中，我们介绍了生成对抗推理器，这是一种通过对抗强化学习共同进化LLM推理器和基于LLM的鉴别器的在策略联合训练框架，旨在通过逻辑推理能力的共同进化来增强推理能力。计算高效的审查计划将每个推理链分割成逻辑上完整的、长度相近的片段，并通过简洁、结构化的论证来评估每个片段的合理性。学习结合互补信号：LLM推理器因逻辑一致且得出正确答案的步骤而获得奖励，而鉴别器因正确检测错误或区分推理过程中的痕迹而获得奖励。这产生了密集、校准良好的在策略步骤级奖励，补充了稀疏的精确匹配信号，提高了信用分配，增加了样本效率，并增强了LLM的整体推理质量。在各种数学基准测试中，该方法在标准RL训练后相对于强基线实现了持续改进。具体来说，在AIME24上，我们使DeepSeek-R1-Distill-Qwen-7B从54.0提高到61.3（+7.3），DeepSeek-R1-Distill-Llama-8B从43.7提高到53.7（+10.0）。模块化的鉴别器还使教师蒸馏、偏好对齐和基于数学证明的推理等目标的奖励塑造变得灵活。

Summary / 总结

This paper introduces Generative Adversarial Reasoner, a framework that uses adversarial reinforcement learning to enhance the reasoning capabilities of large language models (LLMs). By co-evolving an LLM reasoner and a discriminator, the method improves logical consistency and reduces errors in reasoning. The approach delivers consistent improvements over strong baselines on mathematical benchmarks, with specific gains of 7.3% and 10.0% on AIME24 for two different LLMs.

论文提出了生成对抗推理器，该框架通过对抗强化学习增强LLM的推理能力。它通过计算高效的审查计划将推理链分割，并提供结构化的反馈，以促进LLM推理器和判别器的共同进化。该方法通过奖励逻辑一致性和错误检测来提高推理质量，导致在数学基准测试中的一致改进。具体来说，它在AIME24上分别将DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Llama-8B的得分提高了7.3和10.0。

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Authors: Nikhil Prakash, Donghao Ren, Dominik Moritz, Yannick Assogba

First: 2025-12-18T18:59:46+00:00 · Latest: 2025-12-18T18:59:46+00:00

Comments: 18 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.

中文标题/摘要

标题：建设性电路放大：通过目标导向的子网络更新提高LLMs的数学推理能力

先前研究发现，LLMs内部存在负责执行特定任务的稀疏子网络，通常称为电路。此外，模型性能通过微调改进通常源于增强模型中现有的电路。这些发现表明，可以直接干预这些电路，进行精确的任务导向更新。受这些发现的启发，我们提出了一种名为建设性电路放大（Constructive Circuit Amplification）的新方法，该方法从模型推理痕迹中识别关键标记，并确定负责所需任务的模型组件，仅更新这些组件。应用于数学推理，该方法在多个模型上提高了高达11.4%的准确性，同时仅修改了1.59%的模型组件，根据MMLU、TriviaQA和TruthfulQA的测量，对其他能力的影响最小。这些结果表明，通过选择性更新稀疏的模型组件，可以可靠地增强特定能力。

Summary / 总结

The study aims to improve math reasoning in large language models (LLMs) by directly updating specific sub-networks, or circuits, that are responsible for mathematical tasks. The proposed method, Constructive Circuit Amplification, identifies key tokens and model components related to math reasoning and updates only these components. This approach enhances accuracy by up to 11.4% across multiple models while modifying only 1.59% of the model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA.

研究旨在通过直接更新负责数学推理的特定子网络（电路）来增强大型语言模型（LLM）的数学推理能力。提出的Constructive Circuit Amplification方法识别与数学推理相关的关键令牌和模型组件，并仅更新这些组件。这种方法在多个模型上将准确性提高了最多11.4%，同时仅修改了1.59%的模型组件，其他能力（如MMLU、TriviaQA和TruthfulQA）的影响也最小。

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Authors: Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

First: 2025-12-18T18:59:27+00:00 · Latest: 2025-12-18T18:59:27+00:00

Comments: 35 pages

Abs · PDF · Code1 · Code2

Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

中文标题/摘要

标题：探索 vs 开发：通过剪裁、熵和虚假奖励重新思考可验证奖励强化学习（RLVR）

本文探讨了强化学习中可验证奖励（RLVR）框架下的探索-开发权衡问题，该框架旨在提高大型语言模型（LLMs）的推理能力。近期研究表明，RLVR可以通过两个看似矛盾的机制激发LLMs的强大数学推理能力：虚假奖励，它通过奖励与真实结果无关的结果来抑制开发；熵最小化，它通过促使模型更加自信和确定来抑制探索，揭示了一个令人困惑的动态：两者都抑制开发和探索反而提高了推理性能，但其背后的原理仍不甚明了。我们关注两个基本问题：（i）策略熵与性能的关系，（ii）虚假奖励是否能带来收益，可能是通过剪裁偏差和模型污染的相互作用。我们的结果显示，虚假奖励下的剪裁偏差降低了策略熵，导致更加自信和确定的输出，而仅通过熵最小化无法实现改进。我们进一步提出了一种奖励错配模型，解释了为什么虚假奖励可以在污染环境中增强性能。我们的发现阐明了虚假奖励收益背后的机制，并为更有效的RLVR训练提供了原则。

Summary / 总结

This paper investigates the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), focusing on how spurious rewards and entropy minimization affect reasoning performance in Large Language Models (LLMs). The study reveals that spurious rewards reduce policy entropy, leading to more confident outputs, while entropy minimization alone is not sufficient for improvement. The authors propose a reward-misalignment model to explain the enhanced performance from spurious rewards, providing insights into the mechanisms behind RLVR benefits.

该论文研究了RLVR中探索与利用之间的权衡，这是一种增强LLM推理的框架。研究探讨了伪奖励和熵最小化如何影响模型性能，结果显示伪奖励减少了策略的熵，导致更自信的输出，而仅通过熵最小化无法提高性能。研究还提出了一种奖励错配模型，以解释为什么伪奖励可以在受污染环境中进一步提升性能。

SFTok: Bridging the Performance Gap in Discrete Tokenizers

Authors: Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

First: 2025-12-18T18:59:04+00:00 · Latest: 2025-12-18T18:59:04+00:00

Comments: Under review. Code is available at https://github.com/Neur-IO/SFTok

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

中文标题/摘要

标题：SFTok：在离散分词器中弥合性能差距

近期多模态模型的发展突显了图像分词在高分辨率图像生成中的关键作用。通过将图像压缩为紧凑的潜在表示，分词器使生成模型能够在低维空间中运行，从而提高计算效率并降低复杂性。离散分词器自然与自回归范式相契合，但仍然落后于连续分词器，限制了其在多模态系统中的应用。为了解决这一问题，我们提出了**SFTok**，一种结合多步迭代机制进行精确重建的离散分词器。通过整合**自我强化引导视觉重建**和**去偏见和拟合训练策略**，SFTok解决了多步过程中的训练-推理不一致性，显著提高了图像重建质量。在仅64个分词的高压缩率下，SFTok在ImageNet上的重建质量达到最新水平（rFID = 1.21），并在类别到图像生成任务中表现出色（gFID = 2.29）。

Summary / 总结

SFTok is a discrete tokenizer designed to improve the performance of multimodal models in image generation. It uses a multi-step iterative mechanism for precise reconstruction and includes a self-forcing guided visual reconstruction and a debias-and-fitting training strategy. SFTok achieves state-of-the-art reconstruction quality on ImageNet with a high compression rate of 64 tokens per image and performs exceptionally well in class-to-image generation tasks.

SFTok 是为提高离散分词器在多模态模型中的性能而提出的，特别适用于高分辨率图像生成。它采用多步迭代机制和自我强化引导视觉重建以及去偏和拟合训练策略来提升图像重建质量。SFTok 在 ImageNet 上以每张图像 64 个令牌的高压缩率达到了最先进的结果，并在类别到图像生成任务中表现出色。

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

First: 2025-12-18T18:59:03+00:00 · Latest: 2025-12-18T18:59:03+00:00

Comments: 25 pages, 10 figures. Project page:https://hybridrobotics.github.io/MomaGraph/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

中文标题/摘要

标题：MomaGraph：基于视觉语言模型的统一场景图及其在体感任务规划中的状态感知

家庭中的移动机械臂必须同时导航和操作。这需要一种紧凑且语义丰富的场景表示，能够捕捉物体的位置、功能以及哪些部分可以操作。场景图是一个自然的选择，但先前的工作往往将空间关系和功能关系分开处理，将场景视为静态快照，不包含物体状态或时间更新，也忽略了与当前任务相关的最重要信息。为了解决这些限制，我们引入了MomaGraph，这是一种将空间功能关系和部分级交互元素整合在一起的统一场景表示。然而，推进这种表示需要合适的数据和严格的评估，这些方面目前仍然不足。因此，我们贡献了MomaGraph-Scenes，这是第一个包含丰富注释、任务驱动的场景图的大规模数据集，以及MomaGraph-Bench，这是一个涵盖从高层规划到细粒度场景理解的六个推理能力的系统评估套件。在此基础上，我们进一步开发了MomaGraph-R1，这是一种7B参数的视觉语言模型，通过强化学习在MomaGraph-Scenes上进行训练。MomaGraph-R1预测任务导向的场景图，并在Graph-then-Plan框架下作为零样本任务规划器。广泛的实验表明，我们的模型在开源模型中达到了最先进的结果，准确率达到71.6%（比最佳基线高11.4%），并且在公共基准测试中具有泛化能力，并且能够有效地转移到真实机器人实验。

Summary / 总结

MomaGraph addresses the limitations of prior scene graph representations by integrating spatial-functional relationships and part-level interactive elements. It introduces MomaGraph-Scenes, a large-scale dataset of richly annotated, task-driven scene graphs, and MomaGraph-Bench, an evaluation suite. MomaGraph-R1, a 7B vision-language model, predicts task-oriented scene graphs and serves as a zero-shot task planner, achieving 71.6% accuracy on the benchmark, surpassing previous models by 11.4%.

MomaGraph通过整合空间功能关系和部件级交互元素来解决先前场景图表示的局限性。它引入了MomaGraph-Scenes，一个包含任务驱动场景图的大规模数据集，以及MomaGraph-Bench，一个用于体态代理的评估套件。MomaGraph-R1，一个7B视觉语言模型，在基准测试中以71.6%的准确率预测任务导向的场景图，展示了强大的泛化能力和转移能力。

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Authors: Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem

First: 2025-12-18T18:59:02+00:00 · Latest: 2025-12-18T18:59:02+00:00

Abs · PDF · Code1 · Code2

Abstract

We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

中文标题/摘要

标题：SceneDiff：一种多视角物体变化检测基准和方法

我们研究了在不同时间同一场景的成对捕获（图像或视频）之间识别已添加、移除或移动的物体的问题。检测此类变化对于许多应用非常重要，例如机器人整理或建筑进度和安全监控。主要挑战在于不同视角的变化可能导致物体错误地被检测为变化。我们引入了SceneDiff基准，这是第一个包含物体实例注释的多视角变化检测基准，包含350个多样化的视频对，数千个变化的物体。我们还引入了SceneDiff方法，这是一种新的无需训练的多视角物体变化检测方法，利用预训练的3D、分割和图像编码模型来稳健地跨多个基准进行预测。该方法在3D中对齐捕获，提取物体区域，并比较空间和语义区域特征以检测变化。在多视角和两视角基准上的实验表明，我们的方法在现有方法的基础上取得了显著的性能提升（相对AP改进94%和37.4%）。基准和代码将公开发布。

Summary / 总结

The research aims to detect changes in objects between two captures of the same scene taken at different times, which is crucial for applications like robotic tidying and construction monitoring. The authors introduce the SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, and the SceneDiff method, a training-free approach that uses pretrained 3D, segmentation, and image encoding models to align captures, extract object regions, and compare spatial and semantic features to detect changes. The method shows significant improvements over existing approaches on both multi-view and two-view benchmarks, with relative AP improvements of 94% and 37.4%.

研究旨在检测同一场景在不同时间点的两次捕捉之间物体的变化，这对于机器人整理和建筑安全监控等应用至关重要。SceneDiff方法利用预训练的3D、分割和图像编码模型在3D中对齐捕捉，提取物体区域，并比较空间和语义特征来检测变化。该方法在多视图和两视图基准上的表现显著优于现有方法，相对AP改进分别为94%和37.4%。同时，还引入了包含350个多样视频对和数千个变化物体的SceneDiff基准。

Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Authors: Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu

First: 2025-12-18T18:59:01+00:00 · Latest: 2025-12-18T18:59:01+00:00

Comments: Project website: https://egoman-project.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

中文标题/摘要

标题：从推理到运动：基于第一人称人类互动视频的3D手部轨迹预测学习

先前的3D手部轨迹预测工作受限于将运动与语义监督脱钩的数据集以及弱化推理与动作链接的模型。为解决这些问题，我们首先提出了EgoMAN数据集，这是一个用于交互阶段感知的3D手部轨迹预测的大规模第一人称数据集，包含219,000个6自由度轨迹和300万结构化问答对，用于语义、空间和运动推理。我们随后引入了EgoMAN模型，这是一种通过轨迹标记接口将视觉语言推理与运动生成链接的推理到运动框架。通过逐步训练使推理与运动动力学对齐，我们的方法能够生成准确且阶段感知的轨迹，并在真实场景中泛化。

Summary / 总结

The research addresses the limitations of existing datasets and models in 3D hand trajectory prediction by introducing the EgoMAN dataset and model. The EgoMAN dataset includes 219K 6DoF trajectories and 3M structured QA pairs for reasoning, while the EgoMAN model is a reasoning-to-motion framework that links vision-language reasoning with motion generation. The model progressively aligns reasoning with motion dynamics, resulting in accurate and stage-aware trajectories with generalization across real-world scenes.

研究旨在通过解决现有数据集和模型的限制，改进3D手轨迹预测。它引入了EgoMAN数据集，其中包括219K 6DoF轨迹和3M结构化问答对用于推理，并提出了EgoMAN模型，这是一种将视觉语言推理与运动生成链接的推理到运动框架。该模型通过使推理与运动动力学对齐，实现了准确且场景相关的轨迹预测。

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Authors: Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao

First: 2025-12-18T18:57:58+00:00 · Latest: 2025-12-18T18:57:58+00:00

Comments: project page: https://kxding.github.io/project/Alchemist/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

中文标题/摘要

标题：炼金师：通过元梯度数据选择提高文本到图像模型训练效率

近年来，文本到图像（T2I）生成模型的最新进展，如Imagen、Stable Diffusion和FLUX，显著提高了视觉质量。然而，其性能从根本上受限于训练数据的质量。网络抓取和合成图像数据集往往包含低质量或冗余样本，导致视觉保真度下降、训练不稳定和计算效率低下。因此，有效数据选择对于提高数据效率至关重要。现有方法依赖于昂贵的手动筛选或基于文本到图像数据单维度特征的启发式评分。虽然在LLM中已经探索了基于元学习的方法，但尚未针对图像模态进行适应。为此，我们提出**炼金师**，一种基于元梯度的框架，用于从大规模文本-图像数据对中选择合适的子集。我们的方法通过从数据为中心的角度迭代优化模型，自动学习评估每个样本的影响。炼金师包括两个关键阶段：数据评级和数据修剪。我们训练一个轻量级的评级器，基于梯度信息估计每个样本的影响，并增强多粒度感知。然后，我们使用Shift-G采样策略选择信息丰富的子集，以实现高效的模型训练。炼金师是第一个自动、可扩展的基于元梯度的数据选择框架，用于文本到图像模型训练。在合成和网络抓取数据集上的实验表明，炼金师能够一致地提高视觉质量和下游性能。使用炼金师选择的数据集的50%进行训练可以超越使用完整数据集的训练。

Summary / 总结

Alchemist is a meta-gradient-based framework designed to improve the efficiency of Text-to-Image (T2I) model training by selecting high-quality data subsets. It automatically rates and prunes data samples based on gradient information and multi-granularity perception, using a Shift-Gsampling strategy. Experiments show that Alchemist enhances visual quality and downstream performance, with 50% of the selected data outperforming the full dataset.

Alchemist 是一种基于元梯度的数据选择框架，旨在通过选择合适的子数据集来提高文本到图像模型训练的效率。它使用梯度信息和多粒度感知自动评估每个样本的影响，然后进行数据修剪以选择有信息量的子集。实验表明，使用 Alchemist 选择的 50% 数据进行训练可以优于使用完整数据集进行训练，从而提高视觉质量和下游性能。

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Authors: Pierre Fernandez, Tom Sander, Hady Elsahar, Hongyan Chang, Tomáš Souček, Valeriu Lacatusu, Tuan Tran, Sylvestre-Alvise Rebuffi, Alexandre Mourachko

First: 2025-12-18T18:57:33+00:00 · Latest: 2025-12-18T18:57:33+00:00

Comments: Code at https://github.com/facebookresearch/textseal

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.

中文标题/摘要

标题：后 hoc 水印化与语言模型重写效果如何？

生成时文本水印将统计信号嵌入文本中，以提高 AI 生成内容的可追溯性。我们探讨了 *后 hoc 水印化*，即 LLM 在重写现有文本的同时应用生成时水印，以保护版权文档或通过水印放射性检测其在训练或 RAG 中的使用。与受限于 LLM 服务方式的生成时方法不同，此设置为生成和检测提供了额外的自由度。我们研究了通过增加计算资源（如使用更大的重写模型、束搜索、多候选生成或检测时的熵过滤）如何影响质量-可检测性权衡。我们的策略在开放文本如书籍上实现了强大的可检测性和语义保真度。我们的发现中，简单的 Gumbel-max 方案在核采样下出人意料地优于更近期的替代方案，而大多数方法从束搜索中获益显著。然而，当水印可验证文本如代码时，大多数方法表现不佳，我们意外地发现较小的模型优于较大的模型。本研究揭示了后 hoc 水印化的优势和局限性，为实际应用和未来研究奠定了基础。

Summary / 总结

This study investigates post-hoc watermarking for text, where an LLM rewrites existing text while embedding watermarks. The research explores how different computational strategies (larger models, beam search, multi-candidate generation, entropy filtering) affect the quality-detectability trade-off. Key findings include the surprising effectiveness of the simple Gumbel-max scheme and the benefit of beam search for most methods, though smaller models outperform larger ones when watermarking verifiable text like code. This work highlights the potential and limitations of post-hoc watermarking for practical applications and future research.

研究探讨了通过LLM重新编写现有文本并嵌入水印的后处理水印技术。研究考察了不同计算策略（更大模型、束搜索、多候选生成、熵过滤）对质量-可检测性权衡的影响。关键发现包括简单Gumbel-max方案的有效性以及束搜索对大多数方法的益处，但当对可验证文本（如代码）进行水印时，较小模型的表现优于较大模型。这项工作揭示了后处理水印的潜力和局限性，为实际应用和未来研究奠定了基础。

In-Context Algebra

Authors: Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau

First: 2025-12-18T18:56:50+00:00 · Latest: 2025-12-18T18:56:50+00:00

Comments: 28 pages, 18 figures. Code and data at https://algebra.baulab.info

Abs · PDF · Code1 · Code2

Abstract

We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.

中文标题/摘要

标题：上下文相关代数

我们研究了当变压器在序列上进行训练以解决其中包含变量的算术问题时出现的机制，这些变量的意义仅通过它们的相互作用来确定。尽管先前的工作发现变压器发展出反映代数结构的几何嵌入，但这些先前的发现来自算术值符号具有固定意义的设置。我们设计了一个新任务，在该任务中，符号与特定代数群元素的分配在每个序列中都不同。尽管在这种具有挑战性的设置下，变压器在任务上的准确率接近完美，并且甚至可以泛化到未见过的代数群。我们开发了目标数据分布来创建对一组假设机制的因果测试，并且我们确定了三种模型一致学习的机制：一种是专门的头复制答案的交换律复制机制，一种是识别包含单位元的事实的单位元识别机制，以及一种基于封闭性消除的机制，该机制跟踪群成员身份以限制有效答案。与固定符号设置中发现的几何表示互补，我们的研究结果表明，当变压器被训练以在变量意义不固定的上下文中进行推理时，模型会发展出符号推理机制。

Summary / 总结

The study investigates how transformers solve arithmetic problems with variables whose meanings are context-dependent. Despite the challenge, transformers achieve high accuracy and generalize to new algebraic groups. The research identifies three key mechanisms: commutative copying, identity element recognition, and closure-based cancellation, which help models reason symbolically in this dynamic setting. These findings contrast with previous geometric embeddings and highlight the symbolic reasoning capabilities of transformers in variable-based tasks.

研究探讨了变压器在解决具有上下文依赖意义变量的算术问题时的工作机制。尽管挑战重重，变压器仍能实现高准确率并推广到新的代数群。研究发现三种关键机制：交换复制、单位元识别和封闭性取消，这些机制帮助模型在动态环境中进行符号推理。这些发现与之前的几何嵌入不同，突显了变压器在变量基任务中的符号推理能力。

Impacts of Racial Bias in Historical Training Data for News AI

Authors: Rahul Bhargava, Malene Hornstrup Jespersen, Emily Boardman Ndulue, Vivica Dsouza

First: 2025-12-18T18:56:11+00:00 · Latest: 2025-12-18T18:56:11+00:00

Abs · PDF · Code1 · Code2

Abstract

AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning "blacks" thematic topic label. Through quantitative and qualitative means we investigate this label's use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the "blacks" label operates partially as a general "racism detector" across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.

中文标题/摘要

标题：历史训练数据中的种族偏见对新闻AI的影响

AI技术已迅速应用于涉及大量文本语料库的商业和研究领域，包括计算新闻学研究和新闻编辑室环境。这些模型基于各种来源的现有数据进行训练，可以被视为包含数十年来态度和刻板印象的历史文物。本文研究了其中一个例子，该模型基于广泛使用的纽约时报注释语料库创建了一个多标签分类器。我们在研究环境中使用该模型时发现了令人担忧的“blacks”主题标签。通过定量和定性方法，我们调查了该标签在训练语料库中的使用情况，它在训练分类器中可能编码的概念以及这些概念如何影响我们的模型使用。通过应用可解释的AI方法，我们发现“blacks”标签在某些少数群体中部分作为“种族主义检测器”发挥作用。然而，它在现代示例如COVID-19时期的反亚裔仇恨故事和报道黑命贵运动方面的表现不尽如人意。这一案例研究揭示了类似应用在新闻编辑室环境中如何导致意想不到的输出，这些输出可能会影响任何大型语言模型的广泛潜在用途，如故事发现、受众定位、摘要等。这种暴露的基本紧张关系是新闻编辑室如何在降低再现新闻报道中历史偏见风险的同时采用AI驱动的工作流程工具。

Summary / 总结

This paper examines racial bias in a multi-label classifier trained on the New York Times Annotated Corpus, focusing on the 'blacks' thematic label. Through quantitative and qualitative analysis, the study reveals that the label functions as a general 'racism detector' but fails to accurately capture modern issues like anti-Asian hate and Black Lives Matter. This highlights the risk of reproducing historical biases in AI-driven news applications, underscoring the need for careful consideration in adopting AI tools in newsrooms.

本文研究了新闻AI模型在使用纽约时报注释语料库训练时存在的种族偏见问题，导致出现了一个令人担忧的‘blacks’主题标签。通过定量和定性分析，研究发现‘blacks’标签在一些边缘化群体中充当了一般‘种族检测器’的角色，但在识别现代问题如针对亚裔的仇恨故事和黑命贵运动报道方面表现不佳。研究强调，在新闻编辑室采用AI工具时存在重现历史偏见的风险，强调需要在AI应用中减轻此类偏见。

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Kai Qiu, Chong Luo, Zuxuan Wu

First: 2025-12-18T18:56:05+00:00 · Latest: 2025-12-18T18:56:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

Summary / 总结

FlashPortrait aims to improve the identity consistency in long portrait animations by using an end-to-end video diffusion transformer. It introduces a Normalized Facial Expression Block to align facial features with diffusion latents, and employs a dynamic sliding-window scheme during inference to ensure smooth transitions and ID consistency. By predicting future latents using higher-order derivatives, FlashPortrait achieves up to 6x faster inference speed while maintaining high-quality animations. Experiments demonstrate its effectiveness in both qualitative and quantitative evaluations.

FlashPortrait旨在通过解决基于扩散的方法在身份一致性方面的问题，提高合成无限长度的保身份肖像动画的效果。它使用了一个端到端的视频扩散变换器和一个归一化面部表情块来对齐面部特征与扩散潜变量，并采用动态滑动窗口方案以确保长时间动画中的平滑过渡。在推理过程中，FlashPortrait利用当前时间步的高阶潜变量导数直接预测未来时间步的潜变量，从而跳过多个去噪步骤，实现6倍的速度加速。实验表明，它在保持身份稳定性和加速推理速度方面都具有有效性。

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Authors: Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad

First: 2025-12-18T18:56:04+00:00 · Latest: 2025-12-18T18:56:04+00:00

Comments: Code and data available at https://github.com/facebookresearch/MMRB2

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

中文标题/摘要

标题：Multimodal RewardBench 2：评估处理交错文本和图像的全能奖励模型

奖励模型（RMs）对于训练大型语言模型（LLMs）至关重要，但它们在处理交错图像和文本序列的全能模型方面仍被严重忽视。我们引入了Multimodal RewardBench 2（MMRB2），这是第一个全面评估奖励模型在多模态理解和（交错）生成方面的基准。MMRB2 包含四个任务：文本到图像、图像编辑、交错生成和多模态推理（“图像思考”），每个任务提供了来自 23 个模型和代理的 1,000 对专家注释的偏好对，这些模型和代理来自 21 个源任务。MMRB2 设计有：(1) 实用但具有挑战性的提示；(2) 来自最先进的模型和代理的响应；以及 (3) 通过集成筛选策略精心挑选的具有强烈人类专家共识的偏好对。使用 MMRB2，我们研究了每个子任务的现有评判者，包括多模态 LLM 作为评判者和使用人类偏好训练的模型。最新的 Gemini 3 Pro 达到 75-80% 的准确率。GPT-5 和 Gemini 2.5 Pro 达到 66-75% 的准确率，而人类的准确率超过 90%，但超过了广泛使用的 GPT-4o（59%）。性能最佳的开源模型 Qwen3-VL-32B 达到了与 Gemini 2.5 Flash（64%）相似的准确率。我们还展示了 MMRB2 的性能与下游任务的成功之间存在强烈的相关性，并通过 Best-of-N 抽样进行了深入分析，展示了未来改进奖励模型的关键领域。

Summary / 总结

The study introduces Multimodal RewardBench 2 (MMRB2), a benchmark for evaluating reward models on multimodal understanding and generation tasks. It includes four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning. Using MMRB2, the study evaluates various reward models, finding that Gemini 3 Pro and Gemini 2.5 Pro achieve 75-80% and 66-75% accuracy, respectively, compared to human accuracy of over 90%. The best open-source model, Qwen3-VL-32B, achieves similar accuracy to Gemini 2.5 Flash. The study also shows that MMRB2 performance correlates with downstream task success and provides insights for improving reward models.

论文介绍了Multimodal RewardBench 2 (MMRB2)，这是一个用于评估奖励模型在多模态理解和生成任务上的基准，包括文本到图像、图像编辑、交错生成和多模态推理。使用MMRB2，研究发现Gemini 3 Pro和GPT-5的准确率为75-80%，Gemini 2.5 Pro和Qwen3-VL-32B的准确率为66-75%，远低于人类的准确率（>90%）。这些模型的性能与它们在下游任务中的成功相关，表明了改进奖励模型的关键领域。

LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu

First: 2025-12-18T18:52:18+00:00 · Latest: 2025-12-18T18:52:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

中文标题/摘要

标题：LinkedOut：从视频LLM中链接世界知识表示以实现下一代视频推荐

视频大型语言模型（VLLMs）通过在互联网规模数据上进行预训练，解锁了对视频的理解能力，并已在电影分析和视频问答等任务上展示了潜力。然而，将VLLMs部署到视频推荐等下游任务仍然具有挑战性，因为实际系统需要多视频输入、轻量级骨干网络、低延迟序列推理和快速响应。实践中，(1) 只解码生成会导致序列推理的高延迟，(2) 传统接口不支持多视频输入，(3) 将输出限制为语言会丢弃对下游视觉任务重要的细粒度视觉细节。我们认为这些限制源于缺乏一种同时保留像素级细节并利用世界知识的表示。我们提出了LinkedOut，一种直接从视频中提取VLLM世界知识的表示，以实现快速推理、支持多视频历史记录，并移除语言瓶颈。LinkedOut 使用VLLMs从原始帧中提取语义上接地、知识导向的标记，由可提示查询和可选辅助模态引导。我们引入了一种跨层知识融合MoE，从丰富的VLLM特征中选择适当的抽象级别，从而实现个性化、可解释和低延迟的推荐。据我们所知，LinkedOut 是第一个在不使用手工制作标签的情况下直接在原始帧上操作的VLLM基视频推荐方法，实现了标准基准上的最佳结果。解释性研究和消融实验证实了层多样性及层内融合的好处，指出了一个实用的路径，该路径充分利用了VLLM世界知识先验和视觉推理，以实现推荐等下游视觉任务。

Summary / 总结

The research aims to address the challenges of deploying Video Large Language Models (VLLMs) for video recommendation by proposing LinkedOut, a representation that integrates world knowledge from videos for fast and interpretable inference. LinkedOut extracts semantically grounded tokens from raw video frames using VLLMs and a cross-layer knowledge fusion MoE, enabling multi-video support and low-latency recommendations. Key findings include state-of-the-art performance on standard benchmarks and confirmed benefits of layer diversity and fusion for downstream vision tasks.

研究旨在通过提出LinkedOut，一种直接从原始视频帧中提取世界知识的表示方法，解决将Video Large Language Models (VLLMs)部署到视频推荐中的挑战。该方法使用VLLMs生成语义上相关的令牌，并由可提示的查询和辅助模态引导，引入了一种跨层知识融合MoE，以实现个性化、可解释和低延迟的推荐。关键发现表明，LinkedOut在标准基准上达到了最先进的效果，且无需手工标注标签，并且解释性研究证实了层多样性与层融合的好处。

AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

Authors: Tzu-Han Lin, Wei-Lin Chen, Chen-An Li, Hung-yi Lee, Yun-Nung Chen, Yu Meng

First: 2025-12-18T18:50:01+00:00 · Latest: 2025-12-18T18:50:01+00:00

Comments: Preprint. Code and artifacts will be uploaded to https://github.com/hank0316/AdaSearch

Abs · PDF · Code1 · Code2 · Code3

Abstract

Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.

中文标题/摘要

标题：AdaSearch：通过强化学习平衡大型语言模型中的参数知识和搜索

通过强化学习（RL）为大型语言模型（LLMs）配备搜索引擎已成为构建搜索代理的有效方法。然而，过度依赖搜索会引入不必要的成本，并且存在接触到嘈杂或恶意内容的风险，而仅依赖参数知识则存在幻觉的风险。核心挑战在于开发能够适当地平衡参数知识与外部搜索的代理，仅在必要时才调用搜索。先前的工作通过围绕工具调用次数塑造奖励来缓解搜索过度使用的问题。然而，这些惩罚需要大量的奖励工程，提供模糊的信用分配，并且可以被表面上减少调用次数的代理所利用。此外，仅通过调用次数来评估性能混淆了必要的和不必要的搜索，掩盖了真正适应行为的测量。为了解决这些局限性，我们首先通过基于F1的决策度量来量化现有搜索代理的自我知识意识，发现诸如Search-R1等方法往往忽视了可用的参数知识。受这些发现的启发，我们提出了AdaSearch，这是一种简单的两阶段、结果导向的RL框架，将问题解决与是否调用搜索的决策分离，并使这一决策过程变得明确和可解释。这种透明性对于金融和医学问答等高风险领域至关重要，而先前的方法大多忽略了这一点。在多个模型家族和规模上的实验表明，AdaSearch显著提高了知识边界意识，减少了不必要的搜索调用，保持了强大的任务性能，并提供了更透明、可解释的决策行为。

Summary / 总结

AdaSearch is a reinforcement learning framework designed to balance the use of parametric knowledge and external search in large language models. It addresses the limitations of previous methods by quantifying self-knowledge awareness and proposing a two-stage, outcome-driven approach that makes the decision to invoke search explicit and interpretable. Experiments show that AdaSearch enhances knowledge-boundary awareness, reduces unnecessary search calls, maintains strong task performance, and provides more transparent decision behaviors.

AdaSearch 是一个基于强化学习的框架，旨在平衡大型语言模型中参数知识和外部搜索的使用。它通过量化自我知识意识并提出两阶段、结果导向的方法来解决先前方法的局限性，使调用搜索的决策过程变得明确和可解释。实验表明，AdaSearch 提高了知识边界意识，减少了不必要的搜索调用，保持了强大的任务性能，并提供了更透明和可解释的决策行为。

Semi-Supervised Online Learning on the Edge by Transforming Knowledge from Teacher Models

Authors: Jiabin Xue

First: 2025-12-18T18:37:28+00:00 · Latest: 2025-12-18T18:37:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Edge machine learning (Edge ML) enables training ML models using the vast data distributed across network edges. However, many existing approaches assume static models trained centrally and then deployed, making them ineffective against unseen data. To address this, Online Edge ML allows models to be trained directly on edge devices and updated continuously with new data. This paper explores a key challenge of Online Edge ML: "How to determine labels for truly future, unseen data points". We propose Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning. In short, KT acts as the oracle in active learning by transforming knowledge from a teacher model to generate pseudo-labels for training a student model. To verify the validity of the method, we conducted simulation experiments with two setups: (1) using a less stable teacher model and (2) a relatively more stable teacher model. Results indicate that when a stable teacher model is given, the student model can eventually reach its expected maximum performance. KT is potentially beneficial for scenarios that meet the following circumstances: (1) when the teacher's task is generic, which means existing pre-trained models might be adequate for its task, so there will be no need to train the teacher model from scratch; and/or (2) when the label for the student's task is difficult or expensive to acquire.

中文标题/摘要

标题：边缘设备上的半监督在线学习通过从教师模型转化知识

边缘机器学习（Edge ML）允许使用网络边缘分布的数据训练机器学习模型。然而，许多现有方法假设中心训练静态模型然后部署，这使得它们对未见过的数据无效。为了解决这个问题，在线边缘机器学习允许模型直接在边缘设备上进行训练，并不断用新数据进行更新。本文探讨了在线边缘机器学习的关键挑战：“如何为真正未来的未见过的数据点确定标签”。我们提出了知识转化（KT），这是一种结合知识蒸馏、主动学习和因果推理的混合方法。简而言之，KT 在主动学习中充当先验知识的来源，通过从教师模型转化知识生成伪标签来训练学生模型。为了验证该方法的有效性，我们进行了两种设置的仿真实验：（1）使用一个不太稳定的教师模型；（2）一个相对更稳定的教师模型。结果显示，当给定一个稳定的教师模型时，学生模型最终可以达到其预期的最大性能。KT 对于满足以下条件的场景可能有益：（1）当教师的任务是通用的，这意味着现有的预训练模型可能足以完成其任务，因此不需要从头开始训练教师模型；和/或（2）当学生任务的标签难以获取或成本高昂时。

Summary / 总结

This paper addresses the challenge of labeling unseen data in online edge machine learning by proposing Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning. KT transforms knowledge from a teacher model to generate pseudo-labels for training a student model. Experiments with two setups—using a less stable and a more stable teacher model—show that a stable teacher model can help the student model achieve its maximum performance. KT is particularly useful when the teacher's task is generic and when labeling the student's task is difficult or expensive.

本文提出了一种名为Knowledge Transformation (KT) 的混合方法，结合了Knowledge Distillation、Active Learning和因果推理，以解决在线边缘机器学习中未见数据的标签问题。KT 从教师模型中提取知识生成伪标签用于训练学生模型。实验结果显示，当使用稳定的教师模型时，学生模型可以达到其预期的最大性能，这使得KT 在教师任务通用且标签获取困难或昂贵的情况下具有潜在应用价值。

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia

First: 2025-12-18T18:34:23+00:00 · Latest: 2025-12-18T18:34:23+00:00

Comments: Precise region control and planning for instruction-based image editing. Our project page: https://replan-iv-edit.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

中文标题/摘要

标题：RePlan：基于推理的区域规划方法用于复杂指令驱动的图像编辑

指令驱动的图像编辑允许通过自然语言控制视觉修改，但现有模型在指令视觉复杂性（IV-复杂性）场景下表现不佳，即复杂的指令与杂乱或模糊的场景相遇时。我们提出了RePlan（区域对齐规划），这是一种计划-执行框架，结合了视觉语言规划器和扩散编辑器。规划器通过逐步推理将指令分解，并明确地将它们与目标区域关联；编辑器随后使用无训练注意力区域注入机制应用更改，从而实现精确的、并行的多区域编辑，无需迭代修复。为了增强规划，我们使用基于GRPO的强化学习，利用1000个仅指令示例，显著提高了推理准确性和格式可靠性。我们还介绍了IV-Edit基准，专注于精细的区域定位和知识密集型编辑。在IV-复杂场景中，RePlan始终优于大型数据集训练的强大基线，提高了区域精度和整体保真度。我们的项目页面：https://replan-iv-edit.github.io

Summary / 总结

RePlan is a plan-then-execute framework for instruction-based image editing that addresses the challenge of IV-Complexity by using a vision-language planner to decompose instructions and ground them to target regions, followed by a diffusion editor that applies changes without iterative inpainting. This approach, enhanced by GRPO-based reinforcement learning, improves reasoning fidelity and format reliability, outperforming strong baselines in regional precision and overall fidelity across complex settings.

RePlan 是一种用于基于指令的图像编辑的计划-执行框架，通过使用视觉-语言规划器分解指令并将其明确地与目标区域关联，然后由扩散编辑器进行修改，无需迭代修复。该方法通过基于 GRPO 的强化学习增强，提高了推理准确性和格式可靠性，在复杂场景中表现出色，优于强大的基线模型，在区域精度和整体保真度方面均有所提升。

ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning

Authors: Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett

First: 2025-12-18T18:32:39+00:00 · Latest: 2025-12-18T18:32:39+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in https://reinforcegen.github.io/

中文标题/摘要

标题：ReinforceGen：结合自动数据生成和强化学习的混合技能策略

长时程操作一直是机器人领域的一个长期挑战。我们提出了一种名为ReinforceGen的系统，该系统结合了任务分解、数据生成、模仿学习和运动规划，形成初始解决方案，并通过基于强化学习的微调改进每个组件。ReinforceGen首先将任务分割为多个局部技能，这些技能通过运动规划连接。技能和运动规划目标使用来自10个人类演示生成的数据集进行模仿学习训练，然后通过在线适应和强化学习进行微调。在Robosuite数据集上进行基准测试时，ReinforceGen在最高重置范围设置下使用视知觉控制达到80%的成功率。额外的消融研究显示，我们的微调方法平均提高了89%的性能。更多结果和视频请参见https://reinforcegen.github.io/

Summary / 总结

ReinforceGen is a system designed to address long-horizon manipulation challenges in robotics by integrating task decomposition, data generation, imitation learning, and motion planning. It segments tasks into localized skills and connects them through motion planning, initially training these components with imitation learning on a dataset generated from human demonstrations. The system then fine-tunes these components through reinforcement learning, achieving an 80% success rate on all tasks in the highest reset range setting of the Robosuite dataset. Ablation studies indicate that the fine-tuning approaches contribute to an 89% average performance increase.

ReinforceGen 通过结合任务分解、数据生成、模仿学习和运动规划来解决机器人领域的长期操作挑战。该系统将任务分解为局部技能，并通过运动规划连接这些技能，初始训练使用来自人类演示的数据集进行模仿学习。然后通过强化学习进一步优化这些组件。在 Robosuite 数据集上，ReinforceGen 在最高重置范围设置下的视觉运动控制中实现了 80% 的成功率，且消融研究显示，优化方法带来了平均 89% 的性能提升。

Distributional AGI Safety

Authors: Nenad Tomašev, Matija Franklin, Julian Jacobs, Sébastien Krier, Simon Osindero

First: 2025-12-18T18:29:50+00:00 · Latest: 2025-12-18T18:29:50+00:00

Abs · PDF · Code1 · Code2

Abstract

AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centers on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.

中文标题/摘要

标题：分布式的AGI安全

AI安全与对齐研究主要集中在保障单个AI系统的安全方法上，基于最终会出现单一的人工通用智能（AGI）的假设。相比之下，另一种AGI出现的假设，即通用能力首先通过具有互补技能和功能的子AGI个体代理之间的协调表现出来，得到了较少的关注。我们在此提出，这种拼凑的AGI假设需要认真考虑，并应指导相应的保障措施和缓解措施的发展。随着先进AI代理的快速部署，它们具有工具使用能力并能够沟通和协调，这使得安全考虑变得尤为紧迫。因此，我们提出了一种分布式的AGI安全框架，超越了对单个代理进行评估和对齐的方法。该框架以设计和实施虚拟代理经济（不可渗透或半渗透）为中心，其中代理间的交易由稳健的市场机制管理，并辅以适当的审计、声誉管理和监督，以减轻集体风险。

Summary / 总结

This paper addresses the need to consider the patchwork AGI hypothesis, where AGI capabilities emerge through groups of sub-AGI agents, rather than a monolithic AGI. It proposes a framework for distributional AGI safety that involves creating virtual agentic sandbox economies with robust market mechanisms and oversight to mitigate collective risks. Key findings include the importance of evaluating and aligning not just individual agents but also the interactions and transactions between them.

论文探讨了需要考虑通过小组的亚通用人工智能（AGI）代理出现的拼凑AGI假设，作为单一通用人工智能假设的替代方案。它提出了一种分布式的AGI安全框架，侧重于具有稳健市场机制、审计、声誉管理和监督的虚拟代理沙盒经济体，以减轻协调的亚通用人工智能代理带来的集体风险。

TOGGLE: Temporal Logic-Guided Large Language Model Compression for Edge

Authors: Khurram Khalil, Khaza Anuarul Hoque

First: 2025-12-18T18:27:42+00:00 · Latest: 2025-12-18T18:27:42+00:00

Comments: Published in the IEEE ICCAD 2025 conference

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) deliver exceptional performance across natural language tasks but demand substantial computational resources, limiting their deployment on resource-constrained edge devices. Existing compression techniques, such as quantization and pruning, often degrade critical linguistic properties and lack formal guarantees for preserving model behavior. We propose Temporal Logic-Guided Large Language Model Compression (TOGGLE), a novel framework that leverages Signal Temporal Logic (STL) to formally specify and enforce linguistic properties during compression. TOGGLE employs an STL robustness-guided Bayesian optimization to systematically explore layer-wise quantization and pruning configurations, generating compressed models that formally satisfy specified linguistic constraints without retraining or fine-tuning. Evaluating TOGGLE on four LLM architectures (GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B), we achieve up to 3.3x reduction in computational costs (FLOPs) and up to a 68.8% reduction in model size while satisfying all linguistic properties. TOGGLE represents the first integration of formal methods into LLM compression, enabling efficient, verifiable deployment of LLMs on edge hardware.

中文标题/摘要

标题：TOGGLE：基于时间逻辑的大语言模型压缩技术用于边缘设备

大语言模型（LLMs）在自然语言任务中表现出色，但需要大量的计算资源，限制了它们在资源受限的边缘设备上的部署。现有的压缩技术，如量化和剪枝，往往会损害关键的语言特性，并缺乏正式保证来保持模型行为。我们提出了基于时间逻辑的大语言模型压缩（TOGGLE）这一新颖框架，该框架利用信号时间逻辑（STL）在压缩过程中正式指定和强制执行语言特性。TOGGLE 使用基于 STL 的鲁棒性引导贝叶斯优化系统地探索逐层量化和剪枝配置，生成满足指定语言约束的压缩模型，而无需重新训练或微调。在四个 LLM 架构（GPT-2、DeepSeek-V2 7B、LLaMA 3 8B 和 Mistral 7B）上评估 TOGGLE，我们实现了高达 3.3 倍的计算成本（FLOPs）减少和高达 68.8% 的模型大小减少，同时满足所有语言特性。TOGGLE 是将形式方法首次集成到大语言模型压缩中，使大语言模型能够在边缘硬件上高效且可验证地部署。

Summary / 总结

TOGGLE is a novel framework that uses Signal Temporal Logic (STL) to guide the compression of Large Language Models (LLMs) for edge devices. It employs STL robustness-guided Bayesian optimization to explore quantization and pruning configurations, ensuring that the compressed models satisfy specified linguistic constraints without retraining. TOGGLE achieves up to 3.3x reduction in computational costs and up to 68.8% reduction in model size while preserving all linguistic properties across different LLM architectures.

TOGGLE 是一种使用信号时序逻辑（STL）压缩大型语言模型（LLM）的同时保持关键语言属性的新框架。它使用 STL robustness-guided 的贝叶斯优化来探索量化和剪枝配置，确保压缩模型满足指定的语言约束。在四个 LLM 架构上的评估显示，计算成本最多可减少 3.3 倍，模型大小最多可减少 68.8%，同时保持所有语言属性。

Wrist Photoplethysmography Predicts Dietary Information

Authors: Kyle Verrier, Achille Nazaret, Joseph Futoma, Andrew C. Miller, Guillermo Sapiro

First: 2025-11-24T16:12:03+00:00 · Latest: 2025-12-18T18:27:29+00:00

Comments: 20 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Whether wearable photoplethysmography (PPG) contains dietary information remains unknown. We trained a language model on 1.1M meals to predict meal descriptions from PPG, aligning PPG to text. PPG nontrivially predicts meal content; predictability decreases for PPGs farther from meals. This transfers to dietary tasks: PPG increases AUC by 11% for intake and satiety across held-out and independent cohorts, with gains robust to text degradation. Wearable PPG may enable passive dietary monitoring.

中文标题/摘要

标题：腕部光体积描记图预测饮食信息

是否可从可穿戴光体积描记图（PPG）中提取饮食信息尚不清楚。我们使用110万份餐食训练了一个语言模型，从PPG预测餐食描述，将PPG与文本对齐。PPG非平凡地预测餐食内容；PPG与餐食距离越远，预测能力越弱。这在饮食任务中也适用：PPG在独立和外部队列中分别提高摄入和饱腹感的AUC达11%，且在文本降级时表现稳健。可穿戴PPG可能实现被动饮食监测。

Summary / 总结

The study investigates whether wrist photoplethysmography (PPG) can predict dietary information. By training a language model on 1.1 million meals, the researchers were able to align PPG data with meal descriptions. The results show that PPG can nontrivially predict meal content, with predictability decreasing as the time gap between PPG and meal increases. This finding is further validated in dietary tasks, where PPG improves AUC by 11% for intake and satiety across different cohorts, demonstrating its potential for passive dietary monitoring.

研究探讨了腕部光体积描记图（PPG）是否能预测饮食信息。通过使用110万份餐食训练语言模型，研究人员将PPG数据与餐食描述进行了对齐。结果显示，PPG可以非平凡地预测餐食内容，预测准确性随PPG与餐食之间的时间间隔增加而降低。这一发现进一步在饮食任务中得到验证，PPG在不同人群中的摄入量和饱腹感AUC上提高了11%，展示了其在被动饮食监测中的潜力。

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Authors: Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad

First: 2025-12-18T18:26:56+00:00 · Latest: 2025-12-18T18:26:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

中文标题/摘要

标题：GenEval 2：解决文本到图像评估基准漂移问题

自动化文本到图像（T2I）模型评估具有挑战性；必须使用裁判模型来评分，并选择具有挑战性的测试提示，但不应该是当前T2I模型能够解决的。我们认为，满足这些约束条件可能会导致基准漂移，随着时间的推移，静态基准裁判无法跟上新模型的能力。我们展示了基准漂移是GenEval（最受欢迎的T2I基准之一）的一个重大问题。尽管GenEval在发布时与人类判断高度一致，但随着时间的推移，它已经远离了人类判断——导致当前模型的绝对误差高达17.7%。这种程度的漂移强烈表明，GenEval已经饱和了一段时间，我们通过大规模的人类研究进行了验证。为了填补这一评估缺口，我们引入了新的基准GenEval 2，它涵盖了更广泛的原始视觉概念，并具有更高的组合性，我们证明这对当前模型更具挑战性。我们还引入了Soft-TIFA，这是一种结合了视觉基本概念判断的评估方法，我们证明它与人类判断更一致，并认为与更全面的评判标准（如VQAScore）相比，它不太可能随着时间的推移而失去与人类判断的一致性。尽管我们希望GenEval 2能够为多年提供一个强大的基准，但避免基准漂移远非有保证的，我们的工作更广泛地强调了持续审计和改进对于T2I及相关自动模型评估基准的重要性。

Summary / 总结

The research addresses the issue of benchmark drift in Text-to-Image (T2I) model evaluation by introducing GenEval 2, which improves coverage of visual concepts and compositional complexity. The study shows that the original GenEval has drifted significantly from human judgment, leading to a 17.7% absolute error for current models. To mitigate this, GenEval 2 and a new evaluation method, Soft-TIFA, are proposed, which better aligns with human judgment and is less prone to drift over time. The work emphasizes the need for continual audits and improvements in T2I benchmarks.

研究通过引入GenEval 2来解决Text-to-Image (T2I)模型评估中的基准漂移问题，改进了原始的GenEval基准。研究显示，原始的GenEval已经显著偏离了人类判断，导致当前模型的绝对误差高达17.7%。为了缓解这一问题，GenEval 2在视觉概念覆盖和组合复杂性方面进行了改进，使其对当前模型更具挑战性。此外，研究还引入了Soft-TIFA评估方法，该方法结合了对视觉基本元素的判断，与人类判断更加一致，并且相对于如VQAScore等整体性判断，更不易随时间偏离人类一致性。

Meta-RL Induces Exploration in Language Agents

Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

First: 2025-12-18T18:22:17+00:00 · Latest: 2025-12-18T18:22:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

中文标题/摘要

标题：元RL促进语言代理的探索

强化学习（RL）使大型语言模型（LLM）代理能够与环境互动并解决多轮长时序任务。然而，RL训练的代理在需要主动探索的任务中往往表现不佳，无法有效地从试错经验中适应。在本文中，我们提出了LaMer，这是一种通用的元RL框架，使LLM代理能够在测试时积极探索并从环境反馈中学习。LaMer包含两个关键组件：（i）跨回合训练框架，以鼓励探索和长期奖励优化；（ii）通过反思进行上下文内策略适应，使代理能够在不进行梯度更新的情况下从任务反馈信号中适应其策略。在多种环境中的实验表明，与RL基线相比，LaMer显著提高了性能，分别在Sokoban、MineSweeper和Webshop上提高了11%、14%和19%的性能。此外，LaMer在更具有挑战性或以前未见过的任务上的泛化能力也优于RL训练的代理。总体而言，我们的结果表明，元RL提供了一种有原则的方法来促进语言代理的探索，通过学习探索策略使代理能够更 robust 地适应新的环境。

Summary / 总结

This paper addresses the challenge of exploration in reinforcement learning (RL)-trained language agents, which often fail to efficiently explore and adapt in tasks requiring active exploration. The authors introduce LaMer, a Meta-RL framework that includes a cross-episode training framework for encouraging exploration and long-term reward optimization, and in-context policy adaptation via reflection. Experiments show that LaMer outperforms RL baselines by 11%, 14%, and 19% on Sokoban, MineSweeper, and Webshop, respectively, and demonstrates better generalization to new tasks.

本文针对强化学习（RL）训练的语言模型代理在主动探索方面存在的挑战，这些代理往往无法有效地从试错经验中进行探索和适应。作者提出了LaMer，这是一种元RL框架，包括一个跨回合训练框架以鼓励探索和优化长期奖励，以及通过反思进行上下文内策略适应。实验表明，LaMer在Sokoban、MineSweeper和Webshop上的表现分别比RL基线高出11%、14%和19%，并且在新任务上的泛化能力也优于RL训练的代理。

OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Authors: Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang

First: 2025-12-18T18:18:17+00:00 · Latest: 2025-12-18T18:18:17+00:00

Comments: https://opentouch-tactile.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

中文标题/摘要

标题：OPENTOUCH：将全手触觉引入现实世界交互

人类的手是我们与物理世界的主要接口，但主观感知很少知道何时、何地或以何种力度接触。可靠的可穿戴触觉传感器稀缺，且现有野外数据集无法将第一人称视频与全手触觉对齐。为了弥合视觉感知与物理交互之间的差距，我们提出了OpenTouch，这是首个野外主观全手触觉数据集，包含5.1小时同步视频-触觉-姿态数据和2900个经过精挑细选的片段，附有详细的文本注释。利用OpenTouch，我们引入了检索和分类基准，以探究触觉如何为感知和行动提供基础。我们展示了触觉信号为抓取理解提供了紧凑而强大的线索，加强了跨模态对齐，并可以从野外视频查询中可靠地检索。通过发布此注释的视觉-触觉-姿态数据集和基准，我们旨在推进多模态主观感知、具身学习和接触丰富的机器人操作。

Summary / 总结

The paper introduces OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, which includes synchronized video, touch, and pose data. The dataset aims to bridge the gap between visual perception and physical interaction. Key findings show that tactile signals are effective for grasp understanding and cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this dataset and benchmarks, the authors seek to advance multimodal egocentric perception and robotic manipulation.

该论文介绍了OpenTouch，这是首个野外第一人称全手触觉数据集，包含5.1小时的同步视频-触觉-姿态数据和2,900个标注片段。数据集旨在弥合视觉感知与物理交互之间的差距。使用该数据集，作者提出了检索和分类基准任务，探索触觉如何影响感知和行动。关键发现表明，触觉信号对于抓取理解、跨模态对齐非常有效，并且可以从野外视频查询中可靠地检索出来。通过发布此标注数据集，作者希望推动多模态第一人称感知、体态学习和接触丰富的机器人操作的发展。