arXiv 论文速递

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

First: 2025-12-18T18:59:59+00:00 · Latest: 2025-12-18T18:59:59+00:00

Comments: Project page and code: https://worldcanvas.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

Summary / 总结

WorldCanvas is a framework that allows for rich, user-directed simulation by integrating text, trajectories, and reference images. Unlike previous text-only or trajectory-based methods, WorldCanvas combines these elements to create coherent and controllable events, including multi-agent interactions and object entry/exit, with visual grounding through reference images. The generated videos show temporal coherence and emergent consistency, preserving object identity and scene despite temporary disappearance. This framework transforms world models into interactive simulators shaped by user intent.

WorldCanvas 是一个框架，结合了文本、轨迹和参考图像来模拟丰富的用户导向的世界事件。不同于之前的纯文本或轨迹方法，WorldCanvas 生成了具有多智能体交互和视觉定位的连贯且可控的事件，展示了时间一致性和涌现一致性。该框架支持创建互动的、用户导向的模拟器，将世界模型从被动预测者推进到动态模拟器。项目页面和代码可在 https://worldcanvas.github.io/ 获取。

Next-Embedding Prediction Makes Strong Vision Learners

Authors: Sihan Xu, Ziqiao Ma, Wenhao Chai, Xuweiyi Chen, Weiyang Jin, Joyce Chai, Saining Xie, Stella X. Yu

First: 2025-12-18T18:59:58+00:00 · Latest: 2025-12-18T18:59:58+00:00

Comments: Project Page: https://sihanxu.me/nepa

Abs · PDF · Code1 · Code2

Abstract

Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.

中文标题/摘要

标题：下一代嵌入预测使视觉学习者更强大

受自然语言生成预训练成功的启发，我们询问同样的原则是否可以产生强大的自监督视觉学习者。我们不是训练模型输出用于下游使用的特征，而是训练它们生成嵌入以直接执行预测任务。这项工作探讨了从学习表示到学习模型的转变。具体来说，模型学习根据过去的嵌入预测未来的嵌入，使用因果掩码和停止梯度，我们称之为下一代嵌入预测自回归（NEPA）。我们证明，一个仅以下一代嵌入预测作为其唯一学习目标在ImageNet-1k上预训练的简单Transformer是有效的——没有像素重建、离散标记、对比损失或任务特定的头部。这种表述保留了架构的简洁性和可扩展性，无需额外的设计复杂性。NEPA在各种任务中取得了优异的结果，在使用ViT-B和ViT-L骨干网络微调后分别在ImageNet-1K上达到了83.8%和85.3%的顶级准确率，并且能够有效地转移到ADE20K的语义分割上。我们认为，从嵌入生成预训练提供了一种简单、可扩展且可能跨模态的视觉自监督学习替代方案。

Summary / 总结

This study explores the application of generative pretraining principles to vision tasks, proposing a method called Next-Embedding Predictive Autoregression (NEPA) where models learn to predict future patch embeddings based on past ones. The approach, using a simple Transformer pretrained on ImageNet-1k, achieves strong results without pixel reconstruction or task-specific heads, demonstrating 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and effective transfer to semantic segmentation on ADE20K.

该研究探索了将生成预训练原则应用于视觉任务，提出了一种名为Next-Embedding Predictive Autoregression (NEPA)的方法，其中模型学习根据过去的嵌入预测未来的嵌入。该方法使用一个在ImageNet-1k上预训练的简单Transformer，不增加额外的设计复杂性，经过微调后在使用ViT-B和ViT-L骨干网络的ImageNet-1K上分别达到了83.8%和85.3%的顶级准确率，并且在ADE20K上的语义分割任务中表现出有效的迁移能力。

EasyV2V: A High-quality Instruction-based Video Editing Framework

Authors: Jinjie Mai, Chaoyang Wang, Guocheng Gordon Qian, Willi Menapace, Sergey Tulyakov, Bernard Ghanem, Peter Wonka, Ashkan Mirzaei

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: Project page: https://snap-research.github.io/easyv2v/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: https://snap-research.github.io/easyv2v/

中文标题/摘要

标题：EasyV2V：一种基于指令的高质量视频编辑框架

虽然图像编辑已经取得了快速进展，但视频编辑仍然较少被探索，面临着一致性、控制和泛化方面的挑战。我们研究了数据、架构和控制的设计空间，并引入了\emph{EasyV2V}，这是一种简单有效的基于指令的视频编辑框架。在数据方面，我们通过组合现有的专家和快速逆向操作来构建多样化的视频对，通过单帧监督和共享仿射运动的伪对将图像编辑对提升为视频，挖掘密集字幕片段以构建视频对，并添加过渡监督以教授编辑的展开方式。在模型方面，我们观察到预训练的文本到视频模型具有编辑能力，这激励了简化的设计。简单的序列拼接作为条件，并结合轻量级LoRA微调足以训练出强大的模型。对于控制，我们通过单一掩码机制统一了时空控制，并支持可选的参考图像。总体而言，EasyV2V 可以灵活地处理输入，例如视频+文本、视频+掩码+文本、视频+掩码+参考+文本，并实现了最先进的视频编辑结果，超越了同时期和商用系统。项目页面：https://snap-research.github.io/easyv2v/

Summary / 总结

EasyV2V is a framework for instruction-based video editing that addresses the challenges of consistency, control, and generalization in video editing. It uses diverse video pairs created by combining existing experts with fast inverses, and single-frame supervision to lift image edit pairs into videos. The model leverages pretrained text-to-video models with simple sequence concatenation and light LoRA fine-tuning. EasyV2V supports various input types such as video+text, video+mask+text, and video+mask+reference+text, and achieves superior video editing results compared to concurrent and commercial systems.

研究针对视频编辑中的挑战，如一致性和控制，引入了EasyV2V，一种基于指令的视频编辑简单框架。它使用来自现有专家和图像编辑的多样化视频对，并利用预训练的文本到视频模型进行轻量级LoRA微调。EasyV2V支持多种输入类型，并取得了最先进的结果，超越了同时期和商业系统。

DVGT: Driving Visual Geometry Transformer

Authors: Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Shengyin Jiang, Long Chen, Zhi-Xin Yang, Jiwen Lu

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: Code is available at https://github.com/wzzheng/DVGT

Abs · PDF · Code1 · Code2 · Code3

Abstract

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

中文标题/摘要

标题：DVGT：驾驶目标视觉几何变换器

从视觉输入感知和重建3D场景几何对于自动驾驶至关重要。然而，仍然缺乏一种针对驾驶场景的密集几何感知模型，能够适应不同的场景和相机配置。为了解决这一问题，我们提出了一种驾驶目标视觉几何变换器（DVGT），它可以从前序的多视角未校正视觉输入中重建全局密集的3D点云图。我们首先使用DINO骨干网络提取每张图像的视觉特征，然后采用交替的同视角局部注意力、跨视角空间注意力和跨帧时间注意力来推断图像间的几何关系。接着，我们使用多个解码头在第一帧的 ego 坐标系中解码全局点云图，并为每一帧计算 ego 姿态。与依赖精确相机参数的传统方法不同，DVGT 不需要显式的3D几何先验，能够灵活处理任意的相机配置。DVGT 直接从图像序列中预测出度量标定的几何结构，消除了与外部传感器进行后对齐的需要。在包括 nuScenes、OpenScene、Waymo、KITTI 和 DDAD 等多种驾驶数据集的大规模混合训练下，DVGT 在各种场景中显著优于现有模型。代码可在 https://github.com/wzzheng/DVGT 获取。

Summary / 总结

DVGT is designed to perceive and reconstruct 3D scene geometry from visual inputs for autonomous driving, addressing the lack of a driving-targeted dense geometry perception model. It uses a DINO backbone to extract visual features and applies local, spatial, and temporal attention to infer geometric relations. DVGT directly predicts metric-scaled geometry from image sequences without relying on precise camera parameters, achieving superior performance across various scenarios compared to existing models.

DVGT旨在从视觉输入中感知和重建用于自动驾驶的3D场景几何。它使用Driving Visual Geometry Transformer通过局部、空间和时间注意力机制来推断图像间的几何关系。DVGT无需依赖精确的相机参数即可直接从图像序列中预测出度量标尺的几何结构，相比现有模型在各种场景下表现出显著的优越性。通过多种数据集训练，DVGT在密集3D点云重建方面取得了显著的改进。

Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification

Authors: Qihao Liu, Chengzhi Mao, Yaojie Liu, Alan Yuille, Wen-Sheng Chu

First: 2025-12-18T18:59:57+00:00 · Latest: 2025-12-18T18:59:57+00:00

Comments: project page: https://auditdm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.

中文标题/摘要

标题：关键差异：审计模型以发现和纠正能力差距

传统的多模态大语言模型（MLLMs）评估方法缺乏可解释性，往往无法充分揭示模型间的显著能力差距。为解决这一问题，我们引入了AuditDM，这是一种自动化的框架，通过审计模型间的差异来主动发现和纠正其失败模式。AuditDM 通过强化学习微调一个MLLM作为审计器，生成能够最大化目标模型间分歧的挑战性问题和反事实图像。训练完成后，审计器能够揭示多样且可解释的示例，揭示模型的弱点，并作为无需标注的数据用于纠正。当应用于如Gemma-3和PaliGemma-2等最先进的模型时，AuditDM 发现了超过20种不同的失败类型。基于这些发现的微调在16个基准测试中持续改进了所有模型，并使一个3B模型超越了其28B的对照组。我们的结果表明，在数据规模效应减弱时，有针对性的模型审计为模型诊断和改进提供了一条有效途径。

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Authors: Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

First: 2025-12-18T18:59:55+00:00 · Latest: 2025-12-18T18:59:55+00:00

Comments: Project page: https://github.com/CYWang735/AdaTooler-V

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

中文标题/摘要

标题：AdaTooler-V：自适应图像和视频工具使用

最近的研究表明，多模态大型语言模型（MLLM）从视觉工具交互的多模态交错思维链（CoT）中受益。然而，现有的开源模型经常表现出盲目的工具使用推理模式，即使在不需要时也调用视觉工具，这显著增加了推理开销并降低了模型性能。为此，我们提出了AdaTooler-V，这是一种MLLM，能够根据视觉问题是否真正需要工具来执行自适应工具使用。首先，我们引入了AT-GRPO，这是一种基于每个样本的工具效益评分自适应调整奖励尺度的强化学习算法，鼓励模型仅在工具提供真正改进时才调用工具。此外，我们构建了两个数据集以支持训练：AdaTooler-V-CoT-100k 用于SFT冷启动，AdaTooler-V-300k 用于具有可验证奖励的RL，涵盖单图像、多图像和视频数据。在十二个基准测试中的实验表明，AdaTooler-V 具有强大的推理能力，在各种视觉推理任务中优于现有方法。值得注意的是，AdaTooler-V-7B 在高分辨率基准V* 上的准确率为89.8%，超过了商业专有模型GPT-4o 和 Gemini 1.5 Pro。所有代码、模型和数据均已发布。

Summary / 总结

AdaTooler-V is an MLLM that performs adaptive tool-use by determining the necessity of visual problem-solving tools. It introduces AT-GRPO, a reinforcement learning algorithm that adjusts reward scales based on the Tool Benefit Score, encouraging tool invocation only when beneficial. Experiments across twelve benchmarks show AdaTooler-V outperforms existing methods, achieving 89.8% accuracy on the high-resolution benchmark V*.

AdaTooler-V 是一种 MLLM，通过判断视觉问题解决是否需要工具来执行自适应工具使用。它引入了 AT-GRPO，一种基于工具效益评分调整奖励尺度的强化学习算法，仅在有益时才鼓励使用工具。实验表明，AdaTooler-V 在十二个基准测试中表现出色，高分辨率基准 V* 的准确率达到 89.8%。

Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Authors: Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille

First: 2025-12-18T18:59:54+00:00 · Latest: 2025-12-18T18:59:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.

中文标题/摘要

标题：生成对抗推理器：通过对抗强化学习增强LLM推理能力

具有明确推理能力的大语言模型（LLMs）在数学推理方面表现出色，但仍会犯过程错误，如错误计算、脆弱逻辑和表面上合理但实际上无效的步骤。在本文中，我们介绍了生成对抗推理器，这是一种通过对抗强化学习共同进化LLM推理器和基于LLM的鉴别器的在策略联合训练框架，旨在通过逻辑完整且长度相近的推理链片段进行计算高效审查，鉴别器使用简洁的结构化证明来评估每个片段的合理性。学习结合互补信号：LLM推理器因逻辑一致且得出正确答案的步骤而获得奖励，而鉴别器因正确检测错误或在推理过程中区分痕迹而获得奖励。这产生了密集且校准良好的在策略步骤级奖励，补充稀疏的精确匹配信号，改善了信用分配，增加了样本效率，并提高了LLMs的整体推理质量。在各种数学基准测试中，该方法在标准RL后训练中相对于强基线实现了持续改进。具体而言，在AIME24上，我们使DeepSeek-R1-Distill-Qwen-7B从54.0提高到61.3（+7.3），DeepSeek-R1-Distill-Llama-8B从43.7提高到53.7（+10.0）。模块化的鉴别器还使教师蒸馏、偏好对齐和基于数学证明的推理等目标的奖励塑造变得灵活。

Summary / 总结

The paper introduces Generative Adversarial Reasoner, a framework that enhances LLM reasoning through adversarial reinforcement learning. It co-evolves an LLM reasoner and a discriminator, using a compute-efficient review schedule to evaluate reasoning steps. This approach improves logical consistency and sample efficiency, leading to consistent gains over strong baselines on mathematical benchmarks. Specifically, it boosts performance on AIME24, increasing scores for DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B by 7.3 and 10.0, respectively.

论文提出了生成对抗推理器，通过对抗强化学习提升LLM的推理能力。该框架通过进化LLM推理器和判别器来提高逻辑一致性和减少错误。方法使用高效的评估调度来评价推理步骤，并提供密集且校准良好的奖励。实验结果显示在数学基准测试上的一致改进，具体来说，对于两个不同的LLM，在AIME24上的得分分别提高了7.3和10.0个百分点。

Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates

Authors: Nikhil Prakash, Donghao Ren, Dominik Moritz, Yannick Assogba

First: 2025-12-18T18:59:46+00:00 · Latest: 2025-12-18T18:59:46+00:00

Comments: 18 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.

中文标题/摘要

标题：建设性电路放大：通过目标子网络更新提高LLMs的数学推理能力

先前研究发现，LLMs内部存在负责执行特定任务的稀疏子网络，通常称为电路。此外，模型性能通过微调改进通常源于增强模型中的现有电路。这些发现表明，可以直接干预这些电路，进行精确的任务导向更新。受这些发现的启发，我们提出了一种名为建设性电路放大（Constructive Circuit Amplification）的新方法，该方法从模型推理痕迹中识别关键标记，并确定负责所需任务的模型组件，仅更新这些组件。应用于数学推理时，它在多个模型上提高了高达11.4%的准确性，同时仅修改了1.59%的模型组件，且根据MMLU、TriviaQA和TruthfulQA的测量，对其他能力的影响最小。这些结果表明，通过选择性更新稀疏的模型组件，可以可靠地增强特定能力。

Summary / 总结

The study aims to enhance mathematical reasoning in large language models (LLMs) by directly updating specific subnetworks, or circuits, responsible for the task. The proposed method, Constructive Circuit Amplification, identifies key tokens and model components related to mathematical reasoning and updates only these components. This approach improves accuracy by up to 11.4% across multiple models while modifying only 1.59% of the model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA.

研究提出了一种名为Constructive Circuit Amplification的方法，通过选择性地更新特定模型组件（或电路），来提高大型语言模型（LLM）的数学推理能力。这种方法在多个模型上将准确率提高了最多11.4%，同时仅修改了1.59%的模型组件，且对其他能力（如MMLU、TriviaQA和TruthfulQA）的影响很小。

Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Authors: Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, Tianyi Lin

First: 2025-12-18T18:59:27+00:00 · Latest: 2025-12-18T18:59:27+00:00

Comments: 35 pages

Abs · PDF · Code1 · Code2

Abstract

This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.

中文标题/摘要

标题：探索 vs 开发：通过剪裁、熵和虚假奖励重新思考可验证奖励强化学习（RLVR）

本文探讨了强化学习中可验证奖励（RLVR）框架下的探索-开发权衡问题，该框架旨在提高大型语言模型（LLMs）的推理能力。近期研究表明，RLVR可以通过两个看似矛盾的机制激发LLMs进行强大的数学推理：虚假奖励通过奖励与真实情况无关的结果来抑制开发，而熵最小化则通过促使模型更加自信和确定来抑制探索，揭示了一个令人困惑的动态：两者都抑制开发和探索反而能提高推理性能，但其背后的原理仍不甚明了。我们关注两个基本问题：（i）策略熵与性能的关系，（ii）虚假奖励是否能带来收益，可能是通过剪裁偏差和模型污染的相互作用。我们的结果显示，在虚假奖励下，剪裁偏差降低了策略熵，导致更加自信和确定的输出，而仅通过熵最小化无法实现改进。我们进一步提出一个奖励错配模型，解释为什么虚假奖励可以在污染环境中提升性能。我们的发现阐明了虚假奖励收益背后的机制，并为更有效的RLVR训练提供了原则。

Summary / 总结

This paper investigates the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), focusing on the roles of spurious rewards and entropy minimization. The study reveals that spurious rewards reduce policy entropy, leading to more confident outputs, while entropy minimization alone is not sufficient for improvement. The authors propose a reward-misalignment model to explain why spurious rewards enhance performance beyond contaminated settings, providing insights into the mechanisms behind RLVR benefits and guiding more effective training strategies.

该研究探讨了可验证奖励强化学习（RLVR）中的探索与利用权衡问题，重点关注虚假奖励和熵最小化的作用。研究发现，虚假奖励会减少策略的熵，导致更自信的输出，而仅通过熵最小化无法提升性能。此外，研究还提出了一种奖励错配模型，以解释为什么虚假奖励可以在受污染环境中提供更好的性能，从而阐明了虚假奖励效益背后的机制，并为更有效的RLVR训练提供了原则。

SFTok: Bridging the Performance Gap in Discrete Tokenizers

Authors: Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

First: 2025-12-18T18:59:04+00:00 · Latest: 2025-12-18T18:59:04+00:00

Comments: Under review. Code is available at https://github.com/Neur-IO/SFTok

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

中文标题/摘要

标题：SFTok：在离散分词器中弥合性能差距

近期多模态模型的发展突显了图像分词在高分辨率图像生成中的关键作用。通过将图像压缩为紧凑的潜在表示，分词器使生成模型能够在低维空间中运行，从而提高计算效率并降低复杂性。离散分词器自然与自回归范式相契合，但仍然落后于连续分词器，限制了其在多模态系统中的应用。为了解决这一问题，我们提出了**SFTok**，一种结合多步迭代机制进行精确重建的离散分词器。通过整合**自我强化引导视觉重建**和**去偏见和拟合训练策略**，SFTok解决了多步过程中的训练-推理不一致性，显著提高了图像重建质量。在仅64个分词的高压缩率下，SFTok在ImageNet上的重建质量达到最新水平（rFID = 1.21），并在类别到图像生成任务中表现出色（gFID = 2.29）。

Summary / 总结

SFTok is proposed to improve the performance of discrete tokenizers in multimodal models, particularly for image generation. It uses a multi-step iterative mechanism with self-forcing guided visual reconstruction and a debias-and-fitting training strategy to enhance image reconstruction quality. SFTok achieves state-of-the-art results on ImageNet with a high compression rate of 64 tokens per image and performs well in class-to-image generation tasks.

SFTok旨在提高离散分词器在多模态模型中的性能，特别是在高分辨率图像生成方面。它采用多步迭代机制和自我强化引导视觉重建策略，以及去偏见和拟合训练策略，以提高图像重建质量。SFTok以每张图像64个令牌的高压缩率实现了最先进的结果，在ImageNet任务中rFID得分为1.21，gFID得分为2.29。

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

First: 2025-12-18T18:59:03+00:00 · Latest: 2025-12-18T18:59:03+00:00

Comments: 25 pages, 10 figures. Project page:https://hybridrobotics.github.io/MomaGraph/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

中文标题/摘要

标题：MomaGraph：基于视觉语言模型的统一场景图及其在体感任务规划中的状态感知

家庭中的移动机械臂必须同时导航和操作。这需要一种紧凑且语义丰富的场景表示，能够捕捉物体的位置、功能以及哪些部分可以操作。场景图是一个自然的选择，但先前的工作往往将空间关系和功能关系分开处理，将场景视为静态快照，不包含物体状态或时间更新，并且忽略了与当前任务相关的最重要信息。为了解决这些限制，我们引入了MomaGraph，这是一种统一的场景表示，适用于体感代理，能够整合空间功能关系和部分级交互元素。然而，要推进这种表示需要合适的数据和严格的评估，这些方面目前都很少见。因此，我们贡献了MomaGraph-Scenes，这是第一个包含丰富注释、任务驱动的场景图的大规模数据集，以及MomaGraph-Bench，这是一个涵盖六种推理能力的系统评估套件，从高层规划到精细的场景理解。在此基础上，我们进一步开发了MomaGraph-R1，这是一种7B的视觉语言模型，通过强化学习在MomaGraph-Scenes上进行训练。MomaGraph-R1预测任务导向的场景图，并在Graph-then-Plan框架下作为零样本任务规划器。广泛的实验表明，我们的模型在开源模型中达到了最先进的结果，准确率达到71.6%（比最佳基线高11.4%），并且能够在公共基准测试中泛化，并有效转移到真实机器人实验。

Summary / 总结

MomaGraph addresses the limitations of previous scene graph representations by integrating spatial-functional relationships and part-level interactive elements, which are crucial for embodied task planning. To support this, the authors introduce MomaGraph-Scenes, a large-scale dataset of richly annotated scene graphs in household environments, and MomaGraph-Bench, a comprehensive evaluation suite. They further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning, which predicts task-oriented scene graphs and serves as a zero-shot task planner. Experiments show that MomaGraph-R1 achieves 71.6% accuracy on the benchmark, outperforming previous models by 11.4%.

MomaGraph通过整合空间-功能关系和部分级交互元素，解决了先前场景图表示的局限性。它引入了MomaGraph-Scenes，这是第一个包含丰富注释和任务驱动的场景图的大规模数据集，以及MomaGraph-Bench，一个系统性的评估套件。MomaGraph-R1是一个7B的视觉语言模型，通过强化学习训练，预测任务导向的场景图，并作为零样本任务规划器。实验表明，MomaGraph-R1在基准测试中的准确率为71.6%，比之前的最佳基线高出11.4%。

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Authors: Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem

First: 2025-12-18T18:59:02+00:00 · Latest: 2025-12-18T18:59:02+00:00

Abs · PDF · Code1 · Code2

Abstract

We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.

中文标题/摘要

标题：SceneDiff：一种多视角物体变化检测基准和方法

我们研究了在不同时间同一场景的两组捕获（图像或视频）之间识别已添加、移除或移动的物体的问题。检测此类变化对于许多应用非常重要，例如机器人整理或建筑进度和安全监控。主要挑战在于不同视角的变化可能导致物体错误地被检测为已变化。我们引入了SceneDiff基准，这是第一个包含物体实例注释的多视角变化检测基准，包含350个多样化的视频对，数千个已变化的物体。我们还引入了SceneDiff方法，这是一种新的无需训练的多视角物体变化检测方法，利用预训练的3D、分割和图像编码模型来稳健地预测多个基准。该方法在3D中对齐捕获，提取物体区域，并比较空间和语义区域特征以检测变化。在多视角和两视角基准上的实验表明，我们的方法在现有方法的基础上取得了显著的性能提升（相对AP改进94%和37.4%）。基准和代码将公开发布。

Summary / 总结

The research aims to detect changes in objects between two captures of the same scene taken at different times, crucial for applications like robotic tidying and construction monitoring. The authors introduce SceneDiff, a benchmark with object instance annotations for multiview change detection, and a training-free method that uses pretrained 3D, segmentation, and image encoding models to align captures, extract object regions, and compare features to detect changes. The method shows significant improvements over existing approaches, with relative AP improvements of 94% and 37.4% on multi-view and two-view benchmarks respectively.

研究旨在检测同一场景在不同时间拍摄的两幅图像或视频之间对象的变化，这对于机器人整理和建筑进度及安全监控等应用至关重要。SceneDiff方法利用预训练的3D、分割和图像编码模型在3D中对齐捕获图像，提取对象区域，并比较空间和语义特征以检测变化。实验表明，该方法在多视图和两视图基准上的表现显著优于现有方法，相对AP改进分别为94%和37.4%。

Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Authors: Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu

First: 2025-12-18T18:59:01+00:00 · Latest: 2025-12-18T18:59:01+00:00

Comments: Project website: https://egoman-project.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

中文标题/摘要

标题：从推理到运动：基于第一人称人类互动视频的3D手部轨迹预测学习

先前的3D手部轨迹预测工作受限于将运动与语义监督脱钩的数据集以及弱化推理与动作联系的模型。为解决这些问题，我们首先提出了EgoMAN数据集，这是一个用于交互阶段感知的3D手部轨迹预测的大规模第一人称数据集，包含219,000个6自由度轨迹和300万结构化问答对，用于语义、空间和运动推理。我们随后引入了EgoMAN模型，这是一种通过轨迹标记接口将视觉语言推理与运动生成联系起来的推理到运动框架。通过逐步训练使推理与运动动力学对齐，我们的方法能够生成准确且阶段感知的轨迹，并在真实场景中泛化。

Summary / 总结

The research aims to improve 3D hand trajectory prediction by addressing limitations in existing datasets and models. It introduces the EgoMAN dataset, which includes 219K 6DoF hand trajectories and 3M structured QA pairs for reasoning, and the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning with motion generation. The model is trained to align reasoning with motion dynamics, resulting in accurate and stage-aware trajectories that generalize well across real-world scenes.

研究旨在通过解决现有数据集和模型的局限性，提高3D手部轨迹预测的准确性。研究引入了EgoMAN数据集，包含219K 6DoF轨迹和3M结构化问答对用于推理，并提出了EgoMAN模型，这是一种将视觉语言推理与运动生成链接的推理到运动框架。该模型通过使推理与运动动力学对齐，实现了准确且场景泛化的阶段感知轨迹。

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

Authors: Kaixin Ding, Yang Zhou, Xi Chen, Miao Yang, Jiarong Ou, Rui Chen, Xin Tao, Hengshuang Zhao

First: 2025-12-18T18:57:58+00:00 · Latest: 2025-12-18T18:57:58+00:00

Comments: project page: https://kxding.github.io/project/Alchemist/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.

中文标题/摘要

标题：炼金师：通过元梯度数据选择提高文本到图像模型训练效率

近年来，文本到图像（T2I）生成模型的最新进展，如Imagen、Stable Diffusion和FLUX，显著提高了视觉质量。然而，其性能从根本上受限于训练数据的质量。网络抓取和合成图像数据集往往包含低质量或重复的样本，导致视觉保真度下降、训练不稳定和计算效率低下。因此，有效的数据选择对于提高数据效率至关重要。现有方法依赖于昂贵的手动筛选或基于文本到图像数据单维度特征的启发式评分。虽然在大语言模型（LLM）中探索了基于元学习的方法，但尚未针对图像模态进行适应。为此，我们提出了一种基于元梯度的框架**炼金师**，用于从大规模文本-图像数据对中选择合适的子集。我们的方法通过从数据为中心的角度迭代优化模型，自动学习评估每个样本的影响。炼金师包括两个关键阶段：数据评级和数据修剪。我们训练了一个轻量级的评级器，基于梯度信息估计每个样本的影响，并增强多粒度感知。然后，我们使用Shift-G采样策略选择信息丰富的子集，以实现高效的模型训练。炼金师是第一个自动、可扩展的基于元梯度的数据选择框架，用于文本到图像模型训练。在合成和网络抓取数据集上的实验表明，炼金师能够一致地提高视觉质量和下游性能。使用炼金师选择的数据训练，仅需50%的数据即可超越使用完整数据集的训练。

Summary / 总结

Alchemist is a meta-gradient-based framework designed to improve the efficiency of training Text-to-Image models by selecting a suitable subset of data. It automatically rates and prunes data samples based on their influence, using gradient information and multi-granularity perception. Experiments show that Alchemist enhances visual quality and downstream performance, with training on 50% of the selected data outperforming full dataset training.

Alchemist 是一种基于元梯度的数据选择框架，旨在通过选择合适的子数据集来提高 Text-to-Image 模型的训练效率。它基于梯度信息和多粒度感知自动评估和修剪数据样本，并使用 Shift-Gsampling 选择有信息量的子集。实验表明，Alchemist 提高了视觉质量和下游性能，使用 Alchemist 选择的 50% 数据进行训练的效果优于使用完整数据集进行训练。

How Good is Post-Hoc Watermarking With Language Model Rephrasing?

Authors: Pierre Fernandez, Tom Sander, Hady Elsahar, Hongyan Chang, Tomáš Souček, Valeriu Lacatusu, Tuan Tran, Sylvestre-Alvise Rebuffi, Alexandre Mourachko

First: 2025-12-18T18:57:33+00:00 · Latest: 2025-12-18T18:57:33+00:00

Comments: Code at https://github.com/facebookresearch/textseal

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.

中文标题/摘要

标题：后 hoc 水印标记与语言模型重写效果如何？

生成时文本水印将统计信号嵌入文本中，以提高 AI 生成内容的可追溯性。我们探讨了 *后 hoc 水印标记*，即 LLM 在重写现有文本的同时应用生成时水印，以保护版权文档，或通过水印放射性检测其在训练或 RAG 中的使用。与受限于 LLM 服务方式的生成时方法不同，此设置为生成和检测提供了更多的自由度。我们研究了通过增加计算资源（如使用更大的重写模型、束搜索、多候选生成或检测时的熵过滤）如何影响质量-可检测性权衡。我们的策略在开放文本如书籍上实现了强大的可检测性和语义保真度。我们的发现中，简单的 Gumbel-max 方案在核采样下出人意料地优于更近期的替代方案，而大多数方法从束搜索中获益显著。然而，大多数方法在水印可验证文本如代码中表现不佳，我们意外地发现较小的模型优于较大的模型。这项研究揭示了后 hoc 水印标记的潜力和局限性，为实际应用和未来研究奠定了基础。

Summary / 总结

This study investigates post-hoc watermarking, where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents or detect their use. The research explores how compute allocation affects the quality-detectability trade-off and finds that simple Gumbel-max schemes outperform recent alternatives under nucleus sampling, while beam search significantly improves performance. However, smaller models outperform larger ones when watermarking verifiable text like code. The study highlights both the potential and limitations of post-hoc watermarking, providing insights for practical applications and future research.

研究探讨了后置水印技术，即在LLM重写现有文本的同时应用生成时水印，以保护版权文档或检测其使用情况。研究发现，简单的Gumbel-max方案在核采样下优于近期的替代方案，而beam搜索显著提高了性能。然而，当对可验证文本（如代码）进行水印时，较小的模型反而优于较大的模型。研究揭示了后置水印技术的潜力和局限性，为实际应用和未来研究奠定了基础。

In-Context Algebra

Authors: Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau

First: 2025-12-18T18:56:50+00:00 · Latest: 2025-12-18T18:56:50+00:00

Comments: 28 pages, 18 figures. Code and data at https://algebra.baulab.info

Abs · PDF · Code1 · Code2

Abstract

We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.

中文标题/摘要

标题：上下文相关代数

我们研究了当变压器在序列上进行训练以解决其中包含变量的算术问题时出现的机制，这些变量的意义仅通过它们的相互作用来确定。尽管先前的工作发现变压器发展出反映代数结构的几何嵌入，但这些先前的发现来自算术值符号具有固定意义的设置。我们设计了一个新任务，在该任务中，符号与特定代数群元素的分配在每个序列中都不同。尽管在这种具有挑战性的设置下，变压器在任务上的准确率接近完美，并且甚至可以泛化到未见过的代数群。我们开发了目标数据分布来创建对一组假设机制的因果测试，并且我们确定了三种模型一致学习的机制：一种是专门的头复制答案的交换律复制机制，一种是识别包含单位元的事实的单位元识别机制，以及一种基于封闭性消除的机制，该机制跟踪群成员身份以限制有效答案。与固定符号设置中发现的几何表示互补，我们的研究结果表明，当变压器被训练以在变量意义不固定的上下文中进行推理时，模型会发展出符号推理机制。

Summary / 总结

The study investigates how transformers solve arithmetic problems with variable tokens whose meanings are context-dependent. Despite the challenge, transformers achieve high accuracy and generalize to new algebraic groups. The research identifies three key mechanisms: commutative copying, identity element recognition, and closure-based cancellation, which help the models reason symbolically in this dynamic setting. These findings suggest that transformers can develop symbolic reasoning capabilities when trained with context-dependent variables, complementing their geometric representations in fixed-symbol settings.

研究探讨了变压器如何解决具有上下文依赖意义的变量的算术问题。尽管挑战重重，变压器仍能实现高准确率并泛化到新的代数群。研究识别了三种关键机制：交换复制、单位元识别和封闭性取消，这些机制帮助模型在动态环境中进行符号推理。这些发现表明，当变压器被训练处理上下文依赖的变量时，它们可以发展出符号推理能力，这补充了它们在固定符号设置中的几何表示。

Impacts of Racial Bias in Historical Training Data for News AI

Authors: Rahul Bhargava, Malene Hornstrup Jespersen, Emily Boardman Ndulue, Vivica Dsouza

First: 2025-12-18T18:56:11+00:00 · Latest: 2025-12-18T18:56:11+00:00

Abs · PDF · Code1 · Code2

Abstract

AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning "blacks" thematic topic label. Through quantitative and qualitative means we investigate this label's use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the "blacks" label operates partially as a general "racism detector" across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.

中文标题/摘要

标题：历史训练数据中的种族偏见对新闻AI的影响

AI技术已迅速应用于涉及大量文本语料库的商业和研究领域，包括计算新闻学研究和新闻编辑室环境。这些模型基于各种来源的现有数据进行训练，可以被视为包含数十年来态度和刻板印象的历史文物。本文研究了其中一个例子，该模型基于广泛使用的纽约时报注释语料库创建了一个多标签分类器。我们在研究环境中使用该模型时发现了令人担忧的“黑人”主题标签。通过定量和定性方法，我们调查了该标签在训练语料库中的使用情况，它在训练分类器中可能编码的概念以及这些概念如何影响我们的模型使用。通过应用可解释的AI方法，我们发现“黑人”标签在某些少数群体中部分作为“种族主义检测器”发挥作用。然而，它在现代示例如COVID-19时期的反亚裔仇恨故事和报道黑命贵运动方面的表现不尽如人意。这一案例研究揭示了类似应用在新闻编辑室环境中如何导致意想不到的输出，这些输出可能会影响任何大型语言模型的各种潜在用途，如故事发现、受众定位、摘要等。新闻编辑室面临的根本紧张关系是如何采用AI驱动的工作流程工具，同时降低再现新闻报道中历史偏见的风险。

Summary / 总结

This paper examines racial bias in a multi-label classifier trained on the New York Times Annotated Corpus, focusing on the 'blacks' thematic label. Through quantitative and qualitative analysis, the study reveals that the label functions as a general 'racism detector' for some minoritized groups but fails to accurately identify modern issues like anti-Asian hate stories and Black Lives Matter reporting. The research highlights the risk of reproducing historical biases in news coverage when using AI in newsroom settings, emphasizing the need to address such biases to ensure ethical and effective use of AI tools.

本文研究了基于《纽约时报注释语料库》训练的多标签分类器中的种族偏见，重点关注“黑人”主题标签。通过定量和定性分析，研究发现该标签在某些边缘化群体中作为“种族检测器”发挥作用，但无法准确识别如反亚裔仇恨故事和黑人命贵运动等现代问题。研究强调了新闻编辑室在采用AI辅助工作流程工具时，需要减少再现历史偏见的风险。

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Kai Qiu, Chong Luo, Zuxuan Wu

First: 2025-12-18T18:56:05+00:00 · Latest: 2025-12-18T18:56:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

中文标题/摘要

标题：FlashPortrait：6倍速无限肖像动画的自适应潜在预测

当前基于扩散的长肖像动画加速方法难以保证身份一致性。本文提出FlashPortrait，这是一种端到端的视频扩散变换器，能够合成保持身份、无限长度的视频，同时实现高达6倍的推理速度加速。具体而言，FlashPortrait首先使用现成的提取器计算身份无关的表情特征。然后引入归一化表情特征块，通过将它们分别的均值和方差进行归一化，以改善面部建模中的身份稳定性。在推理过程中，FlashPortrait采用动态滑动窗口方案并在重叠区域进行加权融合，确保长动画中的平滑过渡和身份一致性。在每个上下文窗口中，根据特定时间步的潜在变化率和扩散层间导数幅度比，FlashPortrait利用当前时间步的高阶潜在导数直接预测未来时间步的潜在值，从而跳过多个去噪步骤，实现6倍速度加速。基准实验表明，FlashPortrait在定性和定量上均有效。

Summary / 总结

FlashPortrait aims to improve the identity consistency in long portrait animations using a diffusion-based approach. It employs an end-to-end video diffusion transformer that computes identity-agnostic facial expression features and uses a Normalized Facial Expression Block to align these features with diffusion latents, enhancing identity stability. During inference, FlashPortrait uses a dynamic sliding-window scheme and higher-order latent derivatives to skip denoising steps, achieving up to 6x speed acceleration. Experiments demonstrate its effectiveness in maintaining identity consistency and accelerating inference speed.

FlashPortrait旨在通过使用端到端的视频扩散变换器生成具有身份保持性和无限长度的肖像动画，并实现高达6倍的推理速度提升。它采用归一化面部表情块将面部特征与扩散潜变量对齐，并在推理过程中采用动态滑动窗口方案以确保平滑过渡和身份一致性。通过在当前时间步直接预测未来时间步的潜变量，FlashPortrait跳过了多个去噪步骤，从而实现显著的加速效果。实验在基准测试中展示了其在定性和定量方面的有效性。

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Authors: Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall, Emily Dinan, Luke Zettlemoyer, Marjan Ghazvininejad

First: 2025-12-18T18:56:04+00:00 · Latest: 2025-12-18T18:56:04+00:00

Comments: Code and data available at https://github.com/facebookresearch/MMRB2

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to >90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.

中文标题/摘要

标题：Multimodal RewardBench 2：评估全能奖励模型在交错文本和图像中的表现

奖励模型（RMs）对于训练大型语言模型（LLMs）至关重要，但它们在处理交错图像和文本序列的全能模型方面仍被严重忽视。我们引入了Multimodal RewardBench 2（MMRB2），这是首个全面评估奖励模型在多模态理解和（交错）生成任务上的基准。MMRB2 包含四个任务：文本到图像、图像编辑、交错生成和多模态推理（“图像辅助思考”），每个任务提供了来自 23 个模型和代理的 1,000 对专家标注的偏好对。MMRB2 设计有：（1）实用但具有挑战性的提示；（2）来自最先进的模型和代理的响应；（3）通过集成筛选策略精心挑选的具有强烈人类专家共识的偏好对。使用 MMRB2，我们研究了每个子任务的现有评判者，包括多模态 LLM 作为评判者和使用人类偏好训练的模型。最新的 Gemini 3 Pro 达到 75-80% 的准确率。GPT-5 和 Gemini 2.5 Pro 达到 66-75% 的准确率，而人类的准确率超过 90%，但超过了广泛使用的 GPT-4o（59%）。表现最佳的开源模型 Qwen3-VL-32B 达到了与 Gemini 2.5 Flash（64%）相似的准确率。我们还展示了 MMRB2 的性能与下游任务的成功高度相关，并通过 Best-of-N 抽样进行了深入分析，指出了未来改进奖励模型的关键领域。

Summary / 总结

The research introduces Multimodal RewardBench 2 (MMRB2), a comprehensive benchmark for evaluating reward models on multimodal understanding and interleaved generation tasks. It includes four tasks with 1,000 expert-annotated preference pairs each. Using MMRB2, existing judges like Gemini 3 Pro and GPT-5 achieved 75-80% accuracy, while GPT-5 and Gemini 2.5 Pro surpassed GPT-4o (59%) but still lagged behind human accuracy (>90%). The best open-source model, Qwen3-VL-32B, achieved similar accuracies as Gemini 2.5 Flash. The study also shows strong correlations between MMRB2 performance and downstream task success, highlighting areas for improvement in reward models.

论文介绍了Multimodal RewardBench 2 (MMRB2)，这是一个用于评估奖励模型在多模态理解和生成任务上的基准，包括文本到图像、图像编辑、交错生成和多模态推理。使用每任务1,000个专家标注的偏好对，研究发现最新的Gemini 3 Pro和Gemini 2.5 Pro分别达到了75-80%和66-75%的准确率，而人类的准确率超过90%。开源模型如Qwen3-VL-32B的表现也与Gemini 2.5 Flash相似。奖励模型在MMRB2上的表现与它们在下游任务中的成功密切相关。

LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation

Authors: Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu

First: 2025-12-18T18:52:18+00:00 · Latest: 2025-12-18T18:52:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.

中文标题/摘要

标题：LinkedOut：从视频LLM中链接世界知识表示以实现下一代视频推荐

视频大型语言模型（VLLMs）通过在互联网规模数据上进行预训练，解锁了对视频的理解能力，并已在电影分析和视频问答等任务上展示了潜力。然而，将VLLMs部署到下游任务如视频推荐仍然具有挑战性，因为实际系统需要多视频输入、轻量级骨干网络、低延迟序列推理和快速响应。实践中，(1) 只解码生成会导致序列推理的高延迟，(2) 传统接口不支持多视频输入，(3) 限制输出为语言会丢弃对下游视觉任务重要的细粒度视觉细节。我们认为这些限制源于缺乏一种同时保留像素级细节并利用世界知识的表示。我们提出了LinkedOut，一种直接从视频中提取VLLM世界知识的表示，以实现快速推理、支持多视频历史记录，并移除语言瓶颈。LinkedOut 使用VLLMs从原始帧中提取语义上合理的、知识导向的标记，由可提示查询和可选辅助模态引导。我们引入了一种跨层知识融合MoE，从丰富的VLLM特征中选择适当的抽象级别，实现个性化、可解释和低延迟的推荐。据我们所知，LinkedOut 是第一个基于VLLM在原始帧上操作且无需手工标签的方法，在标准基准上取得了最先进的结果。可解释性研究和消融实验证实了层多样性及层内融合的好处，指出了一个实用的路径，充分利用VLLM世界知识先验和视觉推理，以实现如推荐等下游视觉任务。

Summary / 总结

LinkedOut addresses the challenges of deploying Video Large Language Models (VLLMs) for video recommendation by introducing a representation that extracts world knowledge directly from video frames, enabling fast inference and supporting multi-video inputs. Key findings include the use of a cross-layer knowledge fusion MoE to select the appropriate level of abstraction, resulting in state-of-the-art performance on standard benchmarks and improved interpretability.

LinkedOut通过结合世界知识和像素级细节，解决了将视频大型语言模型（VLLMs）应用于视频推荐的挑战。它从原始视频帧中使用VLLMs提取知识感知的令牌，并支持多视频输入，实现快速推理和低延迟推荐。LinkedOut在标准基准上取得了最先进的结果，并展示了层多样性及层内融合的好处，提供了一种实用的解决方案，用于下游视觉任务如推荐。

AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning

Authors: Tzu-Han Lin, Wei-Lin Chen, Chen-An Li, Hung-yi Lee, Yun-Nung Chen, Yu Meng

First: 2025-12-18T18:50:01+00:00 · Latest: 2025-12-18T18:50:01+00:00

Comments: Preprint. Code and artifacts will be uploaded to https://github.com/hank0316/AdaSearch

Abs · PDF · Code1 · Code2 · Code3

Abstract

Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.

中文标题/摘要

标题：AdaSearch：通过强化学习平衡大型语言模型中的参数知识和搜索

通过强化学习（RL）为大型语言模型（LLMs）配备搜索引擎已成为构建搜索代理的有效方法。然而，过度依赖搜索会引入不必要的成本，并且存在接触到嘈杂或恶意内容的风险，而仅依赖参数知识则存在幻觉的风险。核心挑战在于开发能够适当地平衡参数知识与外部搜索的代理，仅在必要时才调用搜索。先前的工作通过围绕工具调用次数塑造奖励来缓解搜索过度使用的问题。然而，这些惩罚需要大量的奖励工程，提供模糊的信用分配，并且可以被表面上减少调用次数的代理所利用。此外，仅通过调用次数来评估性能混淆了必要的和不必要的搜索，掩盖了真正适应行为的测量。为了解决这些局限性，我们首先通过基于F1的决策度量来量化现有搜索代理的自我知识意识，发现诸如Search-R1等方法往往忽视了现成的参数知识。受这些发现的启发，我们提出了AdaSearch，这是一种简单的两阶段、结果导向的RL框架，将问题解决与是否调用搜索的决策分离，并使这一决策过程变得明确和可解释。这种透明性对于金融和医疗问答等高风险领域至关重要，而先前的方法对此大多忽视。在多个模型家族和规模的实验中表明，AdaSearch显著提高了知识边界意识，减少了不必要的搜索调用，保持了强大的任务性能，并提供了更透明和可解释的决策行为。

Summary / 总结

AdaSearch is a reinforcement learning framework that balances the use of parametric knowledge and external search in large language models. It addresses the limitations of previous methods by quantifying self-knowledge awareness and proposing a two-stage, outcome-driven approach that makes the decision to invoke search explicit and interpretable. Experimental results show that AdaSearch enhances knowledge-boundary awareness, reduces unnecessary search calls, maintains strong task performance, and provides more transparent and interpretable decision behaviors across different model sizes and families.

AdaSearch 是一个强化学习框架，旨在平衡大型语言模型中参数知识和外部搜索的使用。它通过量化自我知识意识并提出两阶段、基于结果的方法来解决先前方法的局限性，使调用搜索的决定过程变得明确和可解释。实验表明，AdaSearch 提高了知识边界意识，减少了不必要的搜索调用，保持了强大的任务性能，并提供了更透明和可解释的决策行为。

Semi-Supervised Online Learning on the Edge by Transforming Knowledge from Teacher Models

Authors: Jiabin Xue

First: 2025-12-18T18:37:28+00:00 · Latest: 2025-12-18T18:37:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Edge machine learning (Edge ML) enables training ML models using the vast data distributed across network edges. However, many existing approaches assume static models trained centrally and then deployed, making them ineffective against unseen data. To address this, Online Edge ML allows models to be trained directly on edge devices and updated continuously with new data. This paper explores a key challenge of Online Edge ML: "How to determine labels for truly future, unseen data points". We propose Knowledge Transformation (KT), a hybrid method combining Knowledge Distillation, Active Learning, and causal reasoning. In short, KT acts as the oracle in active learning by transforming knowledge from a teacher model to generate pseudo-labels for training a student model. To verify the validity of the method, we conducted simulation experiments with two setups: (1) using a less stable teacher model and (2) a relatively more stable teacher model. Results indicate that when a stable teacher model is given, the student model can eventually reach its expected maximum performance. KT is potentially beneficial for scenarios that meet the following circumstances: (1) when the teacher's task is generic, which means existing pre-trained models might be adequate for its task, so there will be no need to train the teacher model from scratch; and/or (2) when the label for the student's task is difficult or expensive to acquire.

中文标题/摘要

标题：边缘设备上的半监督在线学习通过从教师模型转换知识

边缘机器学习（Edge ML）允许使用网络边缘分布的数据训练机器学习模型。然而，许多现有方法假设中心训练的静态模型然后部署，这使得它们对未见过的数据无效。为了解决这个问题，在线边缘机器学习允许模型直接在边缘设备上进行训练，并不断用新数据进行更新。本文探讨了在线边缘机器学习的关键挑战：“如何为真正未来的未见过的数据点确定标签”。我们提出了知识转换（KT），这是一种结合知识蒸馏、主动学习和因果推理的混合方法。简而言之，KT 在主动学习中充当先验知识的来源，通过从教师模型中转换知识生成伪标签来训练学生模型。为了验证该方法的有效性，我们进行了两种设置的仿真实验：（1）使用一个不太稳定的教师模型；（2）一个相对更稳定的教师模型。结果显示，当给定一个稳定的教师模型时，学生模型最终可以达到其预期的最大性能。KT 对于满足以下条件的场景可能有益：（1）当教师的任务是通用的，这意味着现有的预训练模型可能足以完成其任务，因此不需要从头开始训练教师模型；和/或（2）当学生任务的标签难以获取或昂贵时。

Summary / 总结

This paper addresses the challenge of labeling unseen data in Online Edge ML by proposing Knowledge Transformation (KT), which combines Knowledge Distillation, Active Learning, and causal reasoning. KT transforms knowledge from a teacher model to generate pseudo-labels for training a student model. Experiments with two setups show that a stable teacher model can help the student model achieve its maximum performance, making KT suitable for scenarios where the teacher's task is generic and labels are difficult or expensive to obtain.

本文提出了一种名为Knowledge Transformation (KT) 的混合方法，结合了Knowledge Distillation、Active Learning和因果推理，以解决在线边缘机器学习中未见数据的标签问题。KT 从教师模型中提取知识生成伪标签来训练学生模型。实验结果显示，稳定的教师模型可以帮助学生模型达到其预期的最大性能，使KT 适用于教师任务通用且标签难以获取的场景。

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Authors: Tianyuan Qu, Lei Ke, Xiaohang Zhan, Longxiang Tang, Yuqi Liu, Bohao Peng, Bei Yu, Dong Yu, Jiaya Jia

First: 2025-12-18T18:34:23+00:00 · Latest: 2025-12-18T18:34:23+00:00

Comments: Precise region control and planning for instruction-based image editing. Our project page: https://replan-iv-edit.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: https://replan-iv-edit.github.io

中文标题/摘要

标题：RePlan：基于推理的区域规划以实现复杂指令驱动的图像编辑

基于指令的图像编辑允许通过自然语言控制视觉修改，但现有模型在指令视觉复杂性（IV-复杂性）场景下表现不佳，即复杂的指令与杂乱或模糊的场景相遇时。我们引入了RePlan（区域对齐规划），这是一种计划-执行框架，结合了视觉语言规划器和扩散编辑器。规划器通过逐步推理将指令分解，并明确地将它们与目标区域关联；编辑器然后使用无需训练的注意力区域注入机制应用更改，从而实现精确的、并行的多区域编辑，而无需迭代的填补。为了增强规划，我们使用基于GRPO的强化学习应用1000个仅指令示例，从而在推理准确性和格式可靠性方面取得显著进步。我们还提出了IV-Edit基准，专注于精细的区域定位和知识密集型编辑。在IV-复杂设置中，RePlan始终优于在更大数据集上训练的强大基线，提高了区域精度和整体保真度。我们的项目页面：https://replan-iv-edit.github.io

Summary / 总结

RePlan is a plan-then-execute framework for instruction-based image editing that addresses the challenge of Instruction-Visual Complexity. It uses a vision-language planner to decompose instructions and ground them to target regions, followed by a diffusion editor for precise, parallel multi-region edits. RePlan improves regional precision and overall fidelity in complex settings, outperforming strong baselines despite being trained on smaller datasets. It employs GRPO-based reinforcement learning to enhance reasoning fidelity and format reliability.

RePlan 是一种用于基于指令的图像编辑的计划-执行框架，旨在解决指令-视觉复杂性的问题。它使用视觉语言规划器将指令分解并明确地定位到目标区域，然后由扩散编辑器进行精确的、并行的多区域编辑。RePlan 在复杂场景中提高了区域精度和整体保真度，即使在较小的数据集上训练也能超越强大的基线。它使用基于 GRPO 的强化学习来增强推理准确性和格式可靠性。

ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning

Authors: Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett

First: 2025-12-18T18:32:39+00:00 · Latest: 2025-12-18T18:32:39+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance increase. More results and videos available in https://reinforcegen.github.io/

中文标题/摘要

标题：ReinforceGen：结合自动数据生成和强化学习的混合技能策略

长时程操作一直是机器人领域的长期挑战。我们提出了一种名为ReinforceGen的系统，该系统结合了任务分解、数据生成、模仿学习和运动规划，形成初始解决方案，并通过基于强化学习的微调改进每个组件。ReinforceGen首先将任务分割为多个局部技能，这些技能通过运动规划连接。技能和运动规划目标使用来自10个人类演示生成的数据集进行模仿学习训练，然后通过在线适应和强化学习进行微调。在Robosuite数据集上进行基准测试时，ReinforceGen在最高重置范围设置下使用视知觉控制达到80%的成功率。额外的消融研究显示，我们的微调方法平均提高了89%的性能。更多结果和视频请参见https://reinforcegen.github.io/

Summary / 总结

ReinforceGen is designed to address the challenge of long-horizon manipulation in robotics by combining task decomposition, data generation, imitation learning, and motion planning. It segments tasks into localized skills and connects them through motion planning, which are initially trained using imitation learning on a dataset generated from human demonstrations. These components are then fine-tuned through reinforcement learning. On the Robosuite dataset, ReinforceGen achieves an 80% success rate with visuomotor controls in the highest reset range setting. Ablation studies indicate that the fine-tuning approaches contribute to an 89% average performance increase.

ReinforceGen 旨在通过结合任务分解、数据生成、模仿学习和运动规划来解决机器人领域的长期操作挑战。它将任务分解为局部技能，并通过运动规划连接这些技能，初始训练使用来自人类演示生成的数据集进行模仿学习。这些组件随后通过强化学习进行微调。在 Robosuite 数据集上，ReinforceGen 在最高重置范围设置下的视觉运动控制中实现了 80% 的成功率。消融研究显示，微调方法平均提高了 89% 的性能。

Distributional AGI Safety

Authors: Nenad Tomašev, Matija Franklin, Julian Jacobs, Sébastien Krier, Simon Osindero

First: 2025-12-18T18:29:50+00:00 · Latest: 2025-12-18T18:29:50+00:00

Abs · PDF · Code1 · Code2

Abstract

AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centers on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.

中文标题/摘要

标题：分布式的AGI安全

AI安全与对齐研究主要集中在保障单个AI系统的安全方法上，基于最终会出现单一的通用人工智能（AGI）的假设。相比之下，通用能力水平首先通过具有互补技能和功能的子AGI个体代理之间的协调表现出来的AGI出现假设，受到了较少的关注。在这里，我们主张这种拼凑的AGI假设需要认真考虑，并应指导相应的保障措施和缓解措施的发展。随着先进AI代理的快速部署，它们具有工具使用能力并能够沟通和协调，这使得安全考虑变得尤为紧迫。因此，我们提出了一种分布式的AGI安全框架，超越了评估和对齐单个代理的方法。该框架以设计和实施虚拟代理经济（不可渗透或半渗透）为中心，其中代理间的交易由稳健的市场机制管理，并辅以适当的审计、声誉管理和监督，以减轻集体风险。

Summary / 总结

The paper addresses the need to consider the alternative hypothesis of AGI emergence through groups of sub-AGI agents rather than a single monolithic AGI. It proposes a framework for distributional AGI safety, focusing on virtual agentic sandbox economies with robust market mechanisms and oversight to mitigate collective risks. Key findings include the importance of evaluating and aligning groups of agents rather than individual ones, and the urgent need for corresponding safeguards due to the rapid deployment of advanced AI agents with tool-use capabilities and communication abilities.

论文探讨了需要考虑通过一组具有互补技能和功能的亚AGI代理出现的拼凑AGI假设，而不是单一的巨型AGI。它提出了一种分布式的AGI安全框架，重点是具有稳健市场机制和监督的虚拟代理沙盒经济，以减轻集体风险。关键发现包括评估和对齐代理交互的重要性，而不是单独的代理，以及鉴于先进AI代理的快速部署，具有工具使用能力和协调能力，对安全性的迫切需求。

TOGGLE: Temporal Logic-Guided Large Language Model Compression for Edge

Authors: Khurram Khalil, Khaza Anuarul Hoque

First: 2025-12-18T18:27:42+00:00 · Latest: 2025-12-18T18:27:42+00:00

Comments: Published in the IEEE ICCAD 2025 conference

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) deliver exceptional performance across natural language tasks but demand substantial computational resources, limiting their deployment on resource-constrained edge devices. Existing compression techniques, such as quantization and pruning, often degrade critical linguistic properties and lack formal guarantees for preserving model behavior. We propose Temporal Logic-Guided Large Language Model Compression (TOGGLE), a novel framework that leverages Signal Temporal Logic (STL) to formally specify and enforce linguistic properties during compression. TOGGLE employs an STL robustness-guided Bayesian optimization to systematically explore layer-wise quantization and pruning configurations, generating compressed models that formally satisfy specified linguistic constraints without retraining or fine-tuning. Evaluating TOGGLE on four LLM architectures (GPT-2, DeepSeek-V2 7B, LLaMA 3 8B, and Mistral 7B), we achieve up to 3.3x reduction in computational costs (FLOPs) and up to a 68.8% reduction in model size while satisfying all linguistic properties. TOGGLE represents the first integration of formal methods into LLM compression, enabling efficient, verifiable deployment of LLMs on edge hardware.

中文标题/摘要

标题：TOGGLE：基于时序逻辑的大语言模型压缩技术用于边缘设备

大语言模型（LLMs）在自然语言任务中表现出色，但需要大量的计算资源，限制了它们在资源受限的边缘设备上的部署。现有的压缩技术，如量化和剪枝，往往会损害关键的语言特性，并缺乏正式保证来保持模型行为。我们提出了基于时序逻辑的大语言模型压缩（TOGGLE）这一新颖框架，该框架利用信号时序逻辑（STL）在压缩过程中正式指定和执行语言特性。TOGGLE 使用基于 STL 稳定性引导的贝叶斯优化系统地探索逐层量化和剪枝配置，生成满足指定语言约束的压缩模型，而无需重新训练或微调。在四个 LLM 架构（GPT-2、DeepSeek-V2 7B、LLaMA 3 8B 和 Mistral 7B）上评估 TOGGLE，我们实现了高达 3.3 倍的计算成本（FLOPs）减少和高达 68.8% 的模型大小减少，同时满足所有语言特性。TOGGLE 是首次将形式方法集成到大语言模型压缩中，使大语言模型能够在边缘硬件上高效且可验证地部署。

Summary / 总结

TOGGLE is a novel framework that uses Signal Temporal Logic (STL) to guide the compression of Large Language Models (LLMs) for edge devices. It employs STL robustness-guided Bayesian optimization to explore quantization and pruning configurations, ensuring that the compressed models satisfy specified linguistic constraints without retraining. Evaluations on four LLM architectures show up to 3.3x reduction in computational costs and up to 68.8% reduction in model size while maintaining all linguistic properties.

TOGGLE 是一种使用信号时序逻辑（STL）引导大型语言模型（LLM）压缩的新框架，旨在为边缘设备优化模型。它通过 STL robustness-guided 的贝叶斯优化来探索量化和剪枝配置，确保压缩后的模型满足指定的语言约束，无需重新训练。在四个 LLM 架构上的评估显示，计算成本最多可减少 3.3 倍，模型大小最多可减少 68.8%，同时保持所有语言属性。

Wrist Photoplethysmography Predicts Dietary Information

Authors: Kyle Verrier, Achille Nazaret, Joseph Futoma, Andrew C. Miller, Guillermo Sapiro

First: 2025-11-24T16:12:03+00:00 · Latest: 2025-12-18T18:27:29+00:00

Comments: 20 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Whether wearable photoplethysmography (PPG) contains dietary information remains unknown. We trained a language model on 1.1M meals to predict meal descriptions from PPG, aligning PPG to text. PPG nontrivially predicts meal content; predictability decreases for PPGs farther from meals. This transfers to dietary tasks: PPG increases AUC by 11% for intake and satiety across held-out and independent cohorts, with gains robust to text degradation. Wearable PPG may enable passive dietary monitoring.

中文标题/摘要

标题：腕部光体积描记图预测饮食信息

是否可从可穿戴光体积描记图（PPG）中提取饮食信息尚不清楚。我们使用110万餐的数据训练了一个语言模型，从PPG预测餐食描述，将PPG与文本对齐。PPG非平凡地预测餐食内容；PPG与餐食距离越远，预测能力越弱。这在饮食任务中也适用：PPG在独立和独立样本中分别提高摄入和饱腹感的AUC 11%，且在文本降级的情况下表现稳健。可穿戴PPG可能实现被动饮食监测。

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation

Authors: Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, Marjan Ghazvininejad

First: 2025-12-18T18:26:56+00:00 · Latest: 2025-12-18T18:26:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.

中文标题/摘要

标题：GenEval 2：解决文本到图像评估基准漂移问题

自动化文本到图像（T2I）模型评估具有挑战性；必须使用裁判模型来评分，并选择具有挑战性的测试提示，但不应该是当前T2I模型的难题。我们认为，满足这些约束条件可能会导致基准漂移，随着时间的推移，静态基准裁判无法跟上新模型的能力。我们展示了基准漂移是GenEval（最受欢迎的T2I基准之一）的一个重大问题。尽管GenEval在发布时与人类判断高度一致，但随着时间的推移，它已经远离了人类判断——导致当前模型的绝对误差高达17.7%。这种程度的漂移强烈表明，GenEval已经饱和了一段时间，我们通过大规模的人类研究进行了验证。为了填补这一评估缺口，我们引入了新的基准GenEval 2，它在基本视觉概念的覆盖范围和组合性方面有所改进，我们证明这使得当前模型更具挑战性。我们还引入了Soft-TIFA，这是一种用于GenEval 2的评估方法，结合了对视觉基本概念的判断，我们证明这种方法与人类判断更一致，并且我们认为与更全面的评判标准（如VQAScore）相比，它不太可能随着时间的推移而失去与人类判断的一致性。尽管我们希望GenEval 2能够为多年提供一个强大的基准，但避免基准漂移远非有保证的，我们的工作更广泛地强调了对T2I及相关自动模型评估基准进行持续审计和改进的重要性。

Summary / 总结

The research addresses the issue of benchmark drift in Text-to-Image (T2I) model evaluation by introducing GenEval 2, which improves coverage of visual concepts and compositionality. The study shows that GenEval, a popular T2I benchmark, has drifted significantly from human judgment, with an absolute error of up to 17.7% for current models. To mitigate this, GenEval 2 and a new evaluation method called Soft-TIFA are proposed, which are more aligned with human judgment and less prone to drift over time. The work emphasizes the need for continual audits and improvements in T2I benchmarks to maintain their relevance.

研究通过引入GenEval 2来解决Text-to-Image (T2I)模型评估中的基准漂移问题，GenEval 2旨在更好地涵盖原始视觉概念并增强组合挑战。研究显示，原始的GenEval已经显著偏离了人类判断，当前模型的绝对误差高达17.7%。为了缓解这一问题，提出了GenEval 2和一种新的评估方法Soft-TIFA，Soft-TIFA与人类判断更加一致，并且与VQAScore等整体评估相比，更不容易随着时间的推移而偏离人类一致性。

Meta-RL Induces Exploration in Language Agents

Authors: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

First: 2025-12-18T18:22:17+00:00 · Latest: 2025-12-18T18:22:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

中文标题/摘要

标题：元强化学习促进语言代理的探索

强化学习（RL）使大型语言模型（LLM）代理能够与环境互动并解决多轮长时序任务。然而，RL训练的代理在需要主动探索的任务中往往表现不佳，无法有效地从试错经验中适应。在本文中，我们提出了LaMer，这是一种通用的元强化学习框架，使LLM代理能够在测试时积极探索并从环境反馈中学习。LaMer包含两个关键组件：（i）跨回合训练框架，鼓励探索和长期奖励优化；（ii）通过反思进行上下文内策略适应，使代理能够在不进行梯度更新的情况下从任务反馈信号中调整其策略。在多种环境中的实验表明，与RL基线相比，LaMer显著提高了性能，分别在Sokoban、MineSweeper和Webshop上提高了11%、14%和19%的性能。此外，LaMer在更具有挑战性或以前未见过的任务上的泛化能力也优于RL训练的代理。总体而言，我们的结果表明，元强化学习为诱导语言代理的探索提供了一种原理性的方法，通过学习的探索策略使代理能够更稳健地适应新的环境。

Summary / 总结

This paper addresses the challenge of active exploration in reinforcement learning (RL)-trained language model agents, which often fail to efficiently explore and adapt from trial-and-error experiences. The authors introduce LaMer, a Meta-RL framework that includes a cross-episode training framework for encouraging exploration and long-term reward optimization, and in-context policy adaptation via reflection. Experiments show that LaMer outperforms RL baselines by 11%, 14%, and 19% on Sokoban, MineSweeper, and Webshop, respectively, and demonstrates better generalization to new tasks.

本文解决了强化学习（RL）训练的语言代理在探索和适应新任务时效率低下的问题。作者引入了LaMer，这是一种元RL框架，包括一种跨回合训练机制来鼓励探索和长期奖励优化，以及一种通过反思进行上下文内策略适应的方法。实验表明，LaMer在Sokoban、MineSweeper和Webshop上的表现分别比RL基线高出11%、14%和19%，并且在新任务上的泛化能力也优于RL训练的代理。

OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Authors: Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du, Yiyue Luo, Xianyi Cheng, Antonio Torralba, Wojciech Matusik, Paul Pu Liang

First: 2025-12-18T18:18:17+00:00 · Latest: 2025-12-18T18:18:17+00:00

Comments: https://opentouch-tactile.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

中文标题/摘要

标题：OPENTOUCH：将全手触觉引入现实世界交互

人类的手是我们与物理世界的主要接口，但主观感知很少知道何时、何地或以何种力度接触。可靠的可穿戴触觉传感器稀缺，且现有野外数据集无法将第一人称视频与全手触觉对齐。为了弥合视觉感知与物理交互之间的差距，我们提出了OpenTouch，这是首个野外主观全手触觉数据集，包含5.1小时同步视频-触觉-姿态数据和2900个经过精挑细选的片段，附有详细的文本注释。使用OpenTouch，我们引入了检索和分类基准，以探究触觉如何为感知和行动提供基础。我们展示了触觉信号为抓取理解提供了紧凑而强大的线索，加强了跨模态对齐，并可以从野外视频查询中可靠地检索。通过发布此注释的视觉-触觉-姿态数据集和基准，我们旨在推进多模态主观感知、具身学习和接触丰富的机器人操作。

Summary / 总结

The paper presents OpenTouch, an in-the-wild egocentric full-hand tactile dataset, which includes 5.1 hours of synchronized video, touch, and pose data and 2,900 curated clips with detailed annotations. The dataset aims to bridge the gap between visual perception and physical interaction. Using this dataset, the authors introduce benchmarks for retrieval and classification tasks, demonstrating that tactile signals are crucial for grasp understanding and cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. The goal is to advance multimodal egocentric perception, embodied learning, and robotic manipulation.

该论文介绍了OpenTouch，这是首个野外第一人称全手触觉数据集，包含5.1小时的同步视频-触觉-姿态数据和2,900个带有详细文本注释的片段。数据集旨在弥合视觉感知与物理交互之间的差距。通过使用该数据集，作者引入了检索和分类基准，以探索触觉如何影响感知和行动。关键发现包括触觉信号在理解抓取和增强跨模态对齐方面的有效性，以及从野外视频查询中可靠检索触觉信息的能力。通过发布该数据集和基准，作者旨在推进多模态第一人称感知、体态学习以及带有触觉反馈的机器人操作技术。