arXiv 论文速递

Snapshot: 20260225_0353

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Authors: Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan

First: 2026-02-23T18:59:58+00:00 · Latest: 2026-02-23T18:59:58+00:00

Comments: Project page: https://amshaker.github.io/Mobile-O/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

中文标题/摘要

标题：Mobile-O：移动设备上的统一多模态理解和生成

统一多模态模型可以在单一架构中同时理解和生成视觉内容。现有模型仍然数据饥渴且过于沉重，无法部署在边缘设备上。我们提出了Mobile-O，这是一种紧凑的视觉-语言-扩散模型，将统一的多模态智能带到了移动设备上。其核心模块，移动条件投影器（MCP），使用深度可分离卷积和层间对齐将视觉-语言特征与扩散生成器融合。这种设计使得跨模态条件化在最小的计算成本下得以实现。Mobile-O仅在几百万样本上进行训练，并以新颖的四元组格式（生成提示、图像、问题、答案）进行后续训练，从而同时增强了视觉理解和生成能力。尽管效率高，Mobile-O在GenEval上的表现与其它统一模型相当或更优，达到74%，并且比Show-O和JanusFlow分别快6倍和11倍，同时在视觉理解方面，Mobile-O在七个基准测试中的平均表现优于它们15.3%和5.1%。在iPhone上，Mobile-O每处理一张512x512的图像仅需约3秒，建立了首个适用于边缘设备的实时统一多模态理解和生成的实用框架。我们希望Mobile-O能够简化在设备上完全运行的实时统一多模态智能的研究，无需依赖云服务。我们的代码、模型、数据集和移动应用程序可在https://amshaker.github.io/Mobile-O/获取。

Summary / 总结

Mobile-O is a compact vision-language-diffusion model designed for efficient unified multimodal understanding and generation on mobile devices. It uses a Mobile Conditioning Projector (MCP) to fuse vision-language features with a diffusion generator, enabling efficient cross-modal conditioning. Despite being trained on fewer samples, Mobile-O outperforms other unified models in both generation and understanding tasks, achieving competitive or superior performance and running significantly faster. On visual understanding, Mobile-O surpasses other models by 15.3% and 5.1% across seven benchmarks, while running in just 3 seconds per 512x512 image on an iPhone.

Mobile-O 是一种紧凑的视觉-语言-扩散模型，旨在移动设备上运行，通过高效的数据利用和轻量化设计解决现有统一多模态模型的不足。它使用 Mobile Conditioning Projector (MCP) 将视觉-语言特征与扩散生成器融合，实现高效的跨模态条件处理。Mobile-O 在 GenEval 和其他基准测试中表现出竞争力或优越性，运行速度比 Show-O 和 JanusFlow 快 6 倍，同时在七个基准测试中的视觉理解方面分别超越它们 15.3% 和 5.1%。它在 iPhone 上每张 512x512 图像运行约 3 秒，是第一个在边缘设备上实现实时统一多模态理解和生成的实用框架。

OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Authors: Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, Salman Khan

First: 2026-02-19T18:59:54+00:00 · Latest: 2026-02-23T18:59:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.

中文标题/摘要

标题：OpenEarthAgent：统一的工具增强地理空间代理框架

近期多模态推理的进步使代理能够解释图像、将其与语言关联起来并执行结构化分析任务。将此类能力扩展到遥感领域仍然具有挑战性，因为模型必须在保持连贯的多步逻辑的同时，在空间尺度、地理结构和多光谱指数上进行推理。为了弥合这一差距，OpenEarthAgent 引入了一个统一框架，用于开发基于卫星图像、自然语言查询和详细推理轨迹训练的工具增强地理空间代理。训练管道依赖于结构化推理轨迹的监督微调，使模型与跨多种分析上下文的验证多步工具交互对齐。伴随的语料库包括14,538个训练实例和1,169个评估实例，训练集中有超过100,000个推理步骤，评估集中有超过7,000个推理步骤。它涵盖了城市、环境、灾害和基础设施领域，并结合了GIS操作和NDVI、NBR和NDBI等指数分析。基于显式的推理轨迹，学习到的代理展示了结构化的推理、稳定的地理空间理解以及通过工具驱动的地理空间交互实现的可解释行为。我们报告了相对于强大基线的一致改进，并且在与最近的开源和闭源模型的性能上具有竞争力。

Summary / 总结

The research aims to develop geospatial agents capable of handling complex reasoning tasks in the remote sensing domain. OpenEarthAgent introduces a unified framework for training these agents using satellite imagery, natural-language queries, and detailed reasoning traces. The model is trained through supervised fine-tuning on structured reasoning trajectories, showing improvements over a strong baseline and competitive performance compared to recent models. Key findings include structured reasoning, stable spatial understanding, and interpretable behavior across various geospatial contexts.

研究旨在通过结合多模态推理和工具增强，开发能够处理遥感领域复杂任务的地理空间代理。方法是训练一个统一框架OpenEarthAgent，使用卫星图像、自然语言查询和推理轨迹。关键发现表明，该代理在基线模型上表现出一致的改进，并且与最近的开源和封闭源模型相比具有竞争力，同时展示了结构化的推理和稳定的地理空间理解能力，在各种地理空间环境中表现良好。

tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Authors: Chen Wang, Hao Tan, Wang Yifan, Zhiqin Chen, Yuheng Liu, Kalyan Sunkavalli, Sai Bi, Lingjie Liu, Yiwei Hu

Venue: CVPR 2026

First: 2026-02-23T18:59:45+00:00 · Latest: 2026-02-23T18:59:45+00:00

Comments: Accepted by CVPR 2026. Project Page: https://cwchenwang.github.io/tttLRM

Abs · PDF · Code1 · Code2 · Project1

Abstract

We propose tttLRM, a novel large 3D reconstruction model that leverages a Test-Time Training (TTT) layer to enable long-context, autoregressive 3D reconstruction with linear computational complexity, further scaling the model's capability. Our framework efficiently compresses multiple image observations into the fast weights of the TTT layer, forming an implicit 3D representation in the latent space that can be decoded into various explicit formats, such as Gaussian Splats (GS) for downstream applications. The online learning variant of our model supports progressive 3D reconstruction and refinement from streaming observations. We demonstrate that pretraining on novel view synthesis tasks effectively transfers to explicit 3D modeling, resulting in improved reconstruction quality and faster convergence. Extensive experiments show that our method achieves superior performance in feedforward 3D Gaussian reconstruction compared to state-of-the-art approaches on both objects and scenes.

中文标题/摘要

标题：tttLRM：测试时训练的长上下文和自回归3D重建

我们提出了一种名为tttLRM的新颖大型3D重建模型，该模型利用测试时训练（TTT）层，以线性计算复杂度实现长上下文、自回归3D重建，进一步扩展了模型的能力。我们的框架高效地将多个图像观察压缩到TTT层的快速权重中，在潜在空间中形成隐式的3D表示，可以解码为各种显式格式，例如用于下游应用的高斯斑点（GS）。我们的模型的在线学习变体支持从流式观察中进行渐进的3D重建和细化。我们证明，对新颖视图合成任务的预训练可以有效地转移到显式的3D建模，从而提高重建质量并加快收敛速度。大量实验表明，与当前最先进的方法相比，我们的方法在物体和场景的前馈3D高斯重建方面表现出更优的性能。

Summary / 总结

The research motivation is to enable long-context, autoregressive 3D reconstruction with linear computational complexity. The main method involves using a Test-Time Training (TTT) layer to compress multiple image observations into fast weights, forming an implicit 3D representation that can be decoded into various explicit formats. Key experimental findings show that the proposed tttLRM method outperforms state-of-the-art approaches in feedforward 3D Gaussian reconstruction on both objects and scenes, with improved reconstruction quality and faster convergence after pretraining on novel view synthesis tasks.

tttLRM 是一种新型的 3D 重建模型，利用 Test-Time Training (TTT) 层实现高效长上下文、自回归的 3D 重建，并具有线性计算复杂度。该模型将多个图像观察压缩到 TTT 层的快速权重中，形成隐式的 3D 表示，可以解码为各种显式的格式。预训练在新颖视图合成任务上可以提高重建质量和收敛速度。实验表明，tttLRM 在物体和场景的 feedforward 3D 高斯重建方面优于最先进的方法。

A Very Big Video Reasoning Suite

Authors: Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Bo Li, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng

First: 2026-02-23T18:59:41+00:00 · Latest: 2026-02-23T18:59:41+00:00

Comments: Homepage: https://video-reason.com/

Abs · PDF · Code1 · Code2

Abstract

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .

中文标题/摘要

标题：一个非常大的视频推理套件

视频模型的快速发展主要集中在视觉质量上，而对其推理能力的探索则相对不足。视频推理将智能置于时空一致的视觉环境中，超越了文本所能自然捕捉的内容，使人们能够直观地推理时空结构，如连续性、交互性和因果关系。然而，系统地研究视频推理及其扩展行为受到大规模训练数据缺乏的阻碍。为解决这一问题，我们引入了非常大的视频推理（VBVR）数据集，这是一个前所未有的大规模资源，涵盖了200个经过精心分类的推理任务，涉及超过一百万段视频片段，比现有数据集大三个数量级。我们还提出了VBVR-Bench，这是一种可验证的评估框架，通过引入基于规则、与人类对齐的评分者，超越了基于模型的评判，使视频推理能力的再现和解释成为可能。利用VBVR套件，我们进行了第一个大规模的视频推理扩展研究，并观察到了对未见过的推理任务的早期泛化迹象。总体而言，VBVR为通用视频推理的下一阶段研究奠定了基础。数据、基准工具包和模型可在https://video-reason.com/ 公开获取。

Summary / 总结

This paper addresses the underexplored area of video reasoning, which is crucial for understanding spatiotemporal structures. To tackle the lack of large-scale training data, the authors introduce the Very Big Video Reasoning (VBVR) Dataset, containing over one million video clips and 200 reasoning tasks. They also present VBVR-Bench, an evaluation framework that includes human-aligned scorers for more interpretable and reproducible results. The study reveals early signs of generalization in video reasoning capabilities when scaled up.

该论文关注视频推理这一尚未充分探索的领域，对于理解时空一致的视觉环境至关重要。为了解决大规模训练数据不足的问题，作者引入了Very Big Video Reasoning (VBVR) 数据集，包含200个精心策划的推理任务和超过一百万的视频片段，远大于现有数据集。他们还提出了VBVR-Bench，这是一种可验证的评估框架，包括基于规则的评分者，能够实现可重复和可解释的视频推理能力评估。研究使用VBVR套件揭示了对未见推理任务的早期泛化迹象，为未来通用视频推理研究奠定了基础。

Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Authors: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko

First: 2026-02-23T18:59:27+00:00 · Latest: 2026-02-23T18:59:27+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.

中文标题/摘要

标题：技能注入：衡量代理对技能文件攻击的脆弱性

LLM代理正在迅速发展，得益于代码执行、工具以及最近引入的代理技能功能。技能允许用户通过专门的第三方代码、知识和指令扩展LLM应用程序的功能。虽然这可以将代理能力扩展到新的领域，但也为提示注入攻击提供了新的攻击面。我们识别出基于技能的提示注入是一个重大威胁，并引入了SkillInject基准，评估广泛使用的LLM代理通过技能文件遭受注入攻击的易感性。SkillInject包含202个注入任务对，攻击范围从明显的恶意注入到隐藏在合法指令中的微妙、上下文相关的攻击。我们对前沿LLM进行了评估，从安全性和实用性两个方面衡量其对注入攻击的易感性。结果显示，当前的代理高度易受攻击，前沿模型的攻击成功率高达80%，经常执行极其有害的指令，包括数据泄露、破坏性操作和类似勒索软件的行为。此外，这些结果表明，这个问题不会通过模型扩展或简单的输入过滤来解决，而是需要具备上下文感知授权框架的稳健代理安全。我们的基准可以在https://www.skill-inject.com/获取。

Summary / 总结

The paper addresses the vulnerability of LLM agents to skill-based prompt injection attacks, which exploit the agent skills feature to extend their capabilities. SkillInject, a benchmark, evaluates the susceptibility of LLM agents to these attacks. The benchmark includes 202 injection-task pairs with varying levels of maliciousness. The evaluation shows that current LLM agents are highly vulnerable, with up to 80% attack success rate, often executing harmful instructions. The results indicate that robust security will require context-aware authorization frameworks rather than model scaling or simple input filtering.

论文关注LLM代理对基于技能的提示注入攻击的脆弱性，这些攻击利用代理技能特性扩展其功能。研究引入了SkillInject基准，包含202个注入任务对，以评估LLM代理对这类攻击的易感性。评估结果显示，领先模型的攻击成功率高达80%，经常执行诸如数据泄露和勒索软件行为等有害指令。结果表明，稳健的安全性需要上下文感知的授权框架，而不仅仅是模型扩展或简单的输入过滤。

Agentic AI for Scalable and Robust Optical Systems Control

Authors: Zehao Wang, Mingzhe Han, Wei Cheng, Yue-Kai Huang, Philip Ji, Denton Wu, Mahdi Safari, Flemming Holtorf, Kenaish AlQubaisi, Norbert M. Linke, Danyang Zhuo, Yiran Chen, Ting Wang, Dirk Englund, Tingjun Chen

First: 2026-02-23T18:54:32+00:00 · Latest: 2026-02-23T18:54:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We present AgentOptics, an agentic AI framework for high-fidelity, autonomous optical system control built on the Model Context Protocol (MCP). AgentOptics interprets natural language tasks and executes protocol-compliant actions on heterogeneous optical devices through a structured tool abstraction layer. We implement 64 standardized MCP tools across 8 representative optical devices and construct a 410-task benchmark to evaluate request understanding, role-aware responses, multi-step coordination, robustness to linguistic variation, and error handling. We assess two deployment configurations--commercial online LLMs and locally hosted open-source LLMs--and compare them with LLM-based code generation baselines. AgentOptics achieves 87.7%--99.0% average task success rates, significantly outperforming code-generation approaches, which reach up to 50% success. We further demonstrate broader applicability through five case studies extending beyond device-level control to system orchestration, monitoring, and closed-loop optimization. These include DWDM link provisioning and coordinated monitoring of coherent 400 GbE and analog radio-over-fiber (ARoF) channels; autonomous characterization and bias optimization of a wideband ARoF link carrying 5G fronthaul traffic; multi-span channel provisioning with launch power optimization; closed-loop fiber polarization stabilization; and distributed acoustic sensing (DAS)-based fiber monitoring with LLM-assisted event detection. These results establish AgentOptics as a scalable, robust paradigm for autonomous control and orchestration of heterogeneous optical systems.

中文标题/摘要

标题：代理AI在可扩展和稳健的光学系统控制中的应用

我们提出了AgentOptics，一种基于模型上下文协议（MCP）的高保真度自主光学系统控制的代理AI框架。AgentOptics 解释自然语言任务并通过结构化的工具抽象层执行符合协议的操作，覆盖了8种代表性光学设备上的64个标准化MCP工具，并构建了一个包含410个任务的基准测试，以评估请求理解、角色感知响应、多步协调、语言变异的鲁棒性以及错误处理。我们评估了两种部署配置——商用在线LLM和本地托管的开源LLM，并与基于LLM的代码生成基线进行比较。AgentOptics 实现了87.7%至99.0%的平均任务成功率，显著优于代码生成方法，后者最高成功率仅为50%。我们还通过五个案例研究进一步展示了其更广泛的应用，这些案例研究不仅扩展到设备级控制，还涉及系统编排、监控和闭环优化。这些案例包括DWDM链路配置、相干400 GbE和模拟射频光纤（ARoF）通道的协调监控；宽带ARoF链路的自主表征和偏置优化，该链路承载5G前传流量；多段通道配置，包括发射功率优化；闭环光纤偏振稳定；以及基于LLM辅助事件检测的分布式声学传感（DAS）光纤监控。这些结果确立了AgentOptics 作为自主控制和编排异构光学系统的可扩展和稳健范式的地位。

Summary / 总结

AgentOptics is an agentic AI framework for autonomous optical system control using the Model Context Protocol (MCP). It interprets natural language tasks and executes actions on various optical devices. The framework achieves 87.7% to 99.0% task success rates, significantly outperforming code-generation approaches. It demonstrates broad applicability in device-level control, system orchestration, monitoring, and closed-loop optimization across different optical systems and channels.

AgentOptics 是一个使用 Model Context Protocol (MCP) 的自主光学系统控制框架，能够解析自然语言任务并在多种光学设备上执行操作。该框架的任务成功率达到了 87.7% 到 99.0%，显著优于代码生成方法。它通过 DWDM 链路配置、相干 400 GbE 和模拟射频光纤 (ARoF) 监控、以及闭环光纤偏振稳定等案例研究展示了其广泛的适用性。

TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Authors: Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

Venue: ICLR 2026

First: 2025-10-04T14:14:20+00:00 · Latest: 2026-02-23T18:54:13+00:00

Comments: Published as a conference paper at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model's most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model's inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

中文标题/摘要

标题：TROLL：信任区域提高大型语言模型的强化学习

使用PPO类似剪裁目标的强化学习（RL）已成为基于奖励的大型语言模型（LLM）微调的标准选择。尽管最近的工作探索了改进的优势估计和归一化方法，但剪裁机制本身仍未得到改进。剪裁最初作为原则性KL信任区域的代理引入，但它是对KL约束的粗略近似，经常导致不稳定的更新和次优性能。我们用一种新颖的离散可微信任区域投影取代剪裁目标，提供原则性的令牌级KL约束。投影作用于模型最重要的令牌logits的稀疏子集，以平衡计算成本和投影效果。我们的方法，大型语言模型的信任区域优化（TROLL），在训练期间直接替代PPO类似的剪裁，而不改变模型的推理行为。在数学推理和代码生成任务、模型系列以及优势估计方法方面，TROLL在训练速度、稳定性和最终成功率方面均优于PPO类似的剪裁。

Summary / 总结

The paper addresses the limitations of clipping in reinforcement learning for large language models, which often leads to unstable updates and suboptimal performance. It introduces TROLL, a method that replaces the clip objective with a discrete differentiable trust region projection, providing principled token-level KL constraints. TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates across various tasks and model families.

TROLL 通过将 PPO 类目标中的剪裁机制替换为一种新颖的离散可微信任区域投影，改进了大型语言模型的强化学习。这种方法提供了原理上的 token 级别 KL 约束，并平衡了计算成本和投影效果。在各种任务和模型家族中，TROLL 在训练速度、稳定性和最终成功率方面均优于 PPO 类剪裁机制。

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

Authors: Clarisse Wibault, Johannes Forkel, Sebastian Towers, Tiphaine Wibault, Juan Duque, George Whittle, Andreas Schaab, Yucheng Yang, Chiyuan Wang, Michael Osborne, Benjamin Moll, Jakob Foerster

First: 2026-02-23T18:53:09+00:00 · Latest: 2026-02-23T18:53:09+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Mean Field Games (MFGs) provide a principled framework for modeling interactions in large population models: at scale, population dynamics become deterministic, with uncertainty entering only through aggregate shocks, or common noise. However, algorithmic progress has been limited since model-free methods are too high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) use Monte Carlo rollouts for the common noise in combination with exact estimation of the expected return, conditioned on those samples. However, HSMs have not been scaled to Partially Observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history-aware HSM for settings involving public information. We also introduce MFAX, our JAX-based framework for MFGs. By leveraging known transition dynamics, RSPG achieves state-of-the-art performance as well as an order-of-magnitude faster convergence and solves, for the first time, a macroeconomics MFG with heterogeneous agents, common noise and history-aware policies. MFAX is publicly available at: https://github.com/CWibault/mfax.

中文标题/摘要

标题：部分可观测的均场博弈的循环结构策略梯度

均场博弈（MFGs）提供了一种原理性的框架来建模大规模群体模型中的相互作用：在大规模情况下，群体动力学变得确定性，不确定性仅通过总体冲击或公共噪声进入。然而，由于无模型方法的方差过高且精确方法的可扩展性较差，算法进展有限。最近的混合结构方法（HSMs）使用蒙特卡洛展开公共噪声，并结合基于这些样本的预期回报的精确估计。然而，HSMs尚未扩展到部分可观测的设置。我们提出了循环结构策略梯度（RSPG），这是第一个具有历史意识的HSM，适用于涉及公共信息的场景。我们还引入了MFAX，这是一个基于JAX的MFG框架。通过利用已知的转换动力学，RSPG实现了最先进的性能，收敛速度提高了数量级，并首次解决了包含异质代理、公共噪声和历史意识策略的宏观经济MFG。MFAX可在以下网址获取：https://github.com/CWibault/mfax。

Summary / 总结

The paper addresses the challenge of applying model-free methods and exact methods in Mean Field Games (MFGs) due to their high variance and poor scalability, respectively. It introduces Recurrent Structural Policy Gradient (RSPG), a history-aware Hybrid Structural Method (HSM) that uses Monte Carlo rollouts for common noise and exact estimation of expected return. RSPG achieves state-of-the-art performance and faster convergence, solving a macroeconomics MFG with heterogeneous agents and common noise for the first time.

论文解决了在大型人口模型中应用模型自由方法的挑战，其中不确定性源于公共噪声。它提出了递归结构策略梯度（RSPG）方法，该方法结合了蒙特卡洛滚动和预期回报的确切估计，并首次处理了包含公共信息的不完全可观测设置。RSPG实现了最先进的性能和更快的收敛速度，并首次解决了包含异质代理、公共噪声和历史感知策略的宏观经济MFG模型。

Towards a Science of AI Agent Reliability

Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

First: 2026-02-18T18:05:44+00:00 · Latest: 2026-02-23T18:49:07+00:00

Comments: Interactive dashboard available at: https://hal.cs.princeton.edu/reliability

Abs · PDF · Code1 · Code2 · Project1

Abstract

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

中文标题/摘要

标题：迈向AI代理可靠性的科学

AI代理正越来越多地被部署以执行重要任务。尽管在标准基准上的准确率得分不断提高，表明快速进步，但许多代理仍然在实践中继续失败。这种差异突显了当前评估的基本局限性：将代理行为压缩为单一成功指标掩盖了关键的操作缺陷。值得注意的是，它忽略了代理是否在多次运行中表现一致、能否抵御干扰、能否预测性地失败或具有有限的误差严重性。基于安全关键工程，我们通过提出十二个具体的指标，从四个关键维度分解代理可靠性：一致性、鲁棒性、可预测性和安全性，提供了一个全面的性能概况。在两个互补基准上评估14个模型，我们发现最近的能力提升仅带来了可靠性的小幅提高。通过揭示这些持续的局限性，我们的指标补充了传统的评估，同时提供了关于代理如何表现、退化和失败的推理工具。

Summary / 总结

The research aims to address the gap between the high accuracy scores of AI agents on benchmarks and their practical failures. It introduces twelve metrics to evaluate AI agent reliability across four dimensions: consistency, robustness, predictability, and safety. Evaluating 14 models, the study finds that recent improvements in capability have only led to minor enhancements in reliability, highlighting persistent limitations in current AI systems.

研究旨在解决AI代理在基准测试中的表现与实际可靠性之间的差距。它提出了十二个指标来评估代理可靠性的四个关键维度：一致性、鲁棒性、可预测性和安全性。通过对两个基准测试中14个模型的评估，研究发现最近的进步仅在可靠性方面带来了微小的改进，揭示了代理性能、鲁棒性和可预测性中的持续问题。

Do Large Language Models Understand Data Visualization Rules?

Authors: Martin Sinnona, Valentin Bonas, Emmanuel Iarussi, Viviana Siless

First: 2026-02-23T18:47:51+00:00 · Latest: 2026-02-23T18:47:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco's constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 < 0.15 for some categories) and for outputs generated from technical ASP formulations.Translating constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

中文标题/摘要

标题：大型语言模型理解数据可视化规则吗？

数据可视化规则源自数十年的设计和感知研究，确保了图表通信的可信度。尽管先前的工作表明大型语言模型（LLMs）能够生成图表或标记误导性图表，但尚不清楚它们是否能够直接推理和执行可视化规则。基于约束的系统如Draco将这些规则编码为逻辑约束，以实现精确的自动化检查，但维护符号编码需要专家努力，因此推动了使用LLMs作为灵活的规则验证器。在本文中，我们首次使用来自Answer Set Programming (ASP)的硬验证地面真相对LLMs进行了系统评估，以测试其对可视化规则的遵守情况。我们将Draco的一部分约束转换为自然语言陈述，并生成了一个包含2,000个Vega-Lite规范的受控数据集，这些规范被明确标注了规则违规。LLMs在检测违规行为的准确性以及提示遵守度（衡量输出是否遵循所需的结构化格式）方面进行了评估。结果显示，前沿模型在遵守度方面表现良好（Gemma 3 4B / 27B：100%，GPT-oss 20B：98%），并且能够可靠地检测常见违规行为（F1值高达0.82），但性能在更微妙的感知规则方面下降（某些类别F1值<0.15），并且对于从技术ASP公式生成的输出表现不佳。将约束转换为自然语言可将较小模型的性能提高高达150%。这些发现展示了LLMs作为灵活的语言驱动验证器的潜力，同时也指出了它们与符号求解器相比的当前局限性。

Summary / 总结

This paper evaluates large language models (LLMs) in enforcing data visualization rules by translating constraint-based rules into natural language and generating a controlled dataset of 2,000 Vega-Lite specifications. The results show that frontier models like Gemma and GPT-oss achieve high adherence to the required structured format and can reliably detect common violations with F1 scores up to 0.82, although performance decreases for more subtle perceptual rules. Smaller models benefit significantly from this natural language approach, improving performance by up to 150%. This study highlights the potential of LLMs as flexible validators but also their current limitations compared to symbolic solvers.

本文评估了大型语言模型（LLMs）理解和执行数据可视化规则的能力。研究使用了2,000个带有明确规则违规标注的Vega-Lite规范数据集，这些规范从Draco的逻辑约束翻译成了自然语言。结果显示，LLMs，尤其是前沿模型，能够实现对所需结构格式的高度遵守，并且可以可靠地检测常见的违规行为，F1分数最高可达0.82。然而，对于更微妙的知觉规则，性能会下降，而较小的模型从这种自然语言翻译方法中受益显著。

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Authors: Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

First: 2026-02-23T18:46:27+00:00 · Latest: 2026-02-23T18:46:27+00:00

Comments: Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

Abs · PDF · Code1 · Code2

Abstract

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

中文标题/摘要

标题：KNIGHT：基于知识图谱的自适应难度调整多项选择题生成

随着大型语言模型（LLMs）的发展，它们在检索增强生成（RAG）等应用中变得至关重要。然而，评估这些系统仍然受到构建专门评估数据集所需时间和成本的限制。我们引入了KNIGHT，这是一种基于LLM和知识图谱的框架，可以从外部来源生成多项选择题（MCQ）数据集。KNIGHT构建了一个特定主题的知识图谱，这是一种结构化且简洁的实体和关系总结，可以重复使用以生成由教师控制的难度级别，包括多跳问题，而无需反复重新输入完整源文本。这个知识图谱作为可重复使用的压缩状态，使得问题生成成为图上的廉价读取操作。我们使用维基百科/维基数据实例化KNIGHT，同时保持框架的领域无关性和本体无关性。作为案例研究，KNIGHT生成了六个历史、生物学和数学领域的MCQ数据集。我们从五个标准评估了质量：流畅性、明确性（单一正确答案）、主题相关性、选项独特性和基于提供的来源可回答性（作为幻觉的代理）。结果表明，KNIGHT能够从可重复使用的图表示中实现高效生成，这些标准下的质量都很高，并且模型排名与MMLU风格的基准一致，同时支持特定主题和难度控制的评估。

Summary / 总结

KNIGHT is an LLM-based framework that generates multiple-choice questions (MCQs) from a topic-specific knowledge graph, enabling the creation of instructor-controlled difficulty levels without re-feeding the full source text. It produces six MCQ datasets in History, Biology, and Mathematics, and achieves high quality across fluency, unambiguity, topic relevance, option uniqueness, and answerability. The framework supports token- and cost-efficient generation and yields model rankings aligned with MMLU-style benchmarks.

KNIGHT 是一个基于 LLM 的框架，通过主题特定的知识图谱生成多项选择题（MCQ），无需重新输入原始文本即可高效且定制化地创建问题。它生成了六个关于历史、生物学和数学的 MCQ 数据集，这些数据集在流畅性、明确性、主题相关性、选项独特性和可回答性方面都达到了高质量。该框架支持特定主题和难度控制的评估，与 MMLU 样式的基准相一致。

AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization

Authors: Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, Ion Stoica

First: 2026-02-23T18:45:31+00:00 · Latest: 2026-02-23T18:45:31+00:00

Abs · PDF · Code1 · Code2

Abstract

The paradigm of automated program generation is shifting from one-shot generation to inference-time search, where Large Language Models (LLMs) function as semantic mutation operators within evolutionary loops. While effective, these systems are currently governed by static schedules that fail to account for the non-stationary dynamics of the search process. This rigidity results in substantial computational waste, as resources are indiscriminately allocated to stagnating populations while promising frontiers remain under-exploited. We introduce AdaEvolve, a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem. AdaEvolve uses an "accumulated improvement signal" to unify decisions across three levels: Local Adaptation, which dynamically modulates the exploration intensity within a population of solution candidates; Global Adaptation, which routes the global resource budget via bandit-based scheduling across different solution candidate populations; and Meta-Guidance which generates novel solution tactics based on the previously generated solutions and their corresponding improvements when the progress stalls. We demonstrate that AdaEvolve consistently outperforms the open-sourced baselines across 185 different open-ended optimization problems including combinatorial, systems optimization and algorithm design problems.

中文标题/摘要

标题：AdaEvolve：自适应的大语言模型驱动零阶优化

自动化程序生成的范式正从单次生成转向推理时的搜索，在这种范式中，大型语言模型（LLMs）作为语义变异操作符在进化循环中发挥作用。虽然有效，但这些系统目前由静态的时间表控制，未能考虑搜索过程中的非平稳动态。这种刚性导致了巨大的计算浪费，因为资源被无差别地分配给停滞不前的群体，而有潜力的前沿则被忽视。我们提出了AdaEvolve框架，将LLM驱动的进化重新表述为分层自适应优化问题。AdaEvolve使用“累积改进信号”在三个层次上统一决策：局部自适应，动态调节候选解群体内的探索强度；全局自适应，通过基于多臂老虎机调度将全局资源预算分配到不同的候选解群体；元指导，基于先前生成的解及其改进生成新的解策略，当进展停滞时。我们证明了AdaEvolve在185个不同类型的开放优化问题中，包括组合优化、系统优化和算法设计问题上，始终优于开源基准。

Summary / 总结

AdaEvolve is a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem. It uses an 'accumulated improvement signal' to dynamically adjust exploration intensity, route global resources, and generate novel tactics. AdaEvolve consistently outperforms open-sourced baselines across 185 open-ended optimization problems, including combinatorial, systems optimization, and algorithm design tasks.

AdaEvolve 是一个框架，将基于 LLM 的进化重新表述为分层自适应优化问题。它使用 '累积改进信号' 来动态调整探索强度、分配全局资源并生成新的策略。实验表明，AdaEvolve 在 185 个开放性优化问题上优于开源基准，包括组合、系统优化和算法设计任务。

LAD: Learning Advantage Distribution for Reasoning

Authors: Wendi Li, Sharon Li

First: 2026-02-23T18:44:10+00:00 · Latest: 2026-02-23T18:44:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Current reinforcement learning objectives for large-model reasoning primarily focus on maximizing expected rewards. This paradigm can lead to overfitting to dominant reward signals, while neglecting alternative yet valid reasoning trajectories, thereby limiting diversity and exploration. To address this issue, we introduce Learning Advantage Distributions (LAD), a distribution-matching framework that replaces advantage maximization with learning the advantage-induced distribution. By establishing the equivalence between the optimal policy update and an advantage-based target distribution, we derive a practical LAD objective formulated as minimizing an $f$-divergence between the policy-induced and advantage-induced distributions. This yields a gradient update that increases likelihood for high-advantage responses while suppressing over-confident probability growth, preventing collapse without requiring auxiliary entropy regularization. LAD incurs no extra training cost compared to GRPO and scales naturally to LLM post-training. In a controlled bandit setting, LAD faithfully recovers the multimodal advantage distribution, validating the theoretical formulation. Experiments on math and code reasoning tasks across several LLM backbones show that LAD reliably improves both accuracy and generative diversity.

中文标题/摘要

标题：LAD：推理中的学习优势分布

当前大规模模型推理的强化学习目标主要集中在最大化预期奖励上。这种范式可能导致过度拟合到主导的奖励信号，而忽视了其他同样有效的推理路径，从而限制了多样性和探索。为了解决这一问题，我们引入了学习优势分布（LAD），这是一种分布匹配框架，用学习由优势引起的分布来替代优势最大化。通过建立最优策略更新与基于优势的目标分布之间的等价性，我们推导出一个实用的LAD目标，该目标以最小化由策略引起的分布与由优势引起的分布之间的$f$-散度的形式表示。这产生了一个梯度更新，增加了高优势响应的可能性，同时抑制了过度自信的概率增长，防止了崩溃，而无需额外的熵正则化。与GRPO相比，LAD没有额外的训练成本，并且自然地扩展到LLM后训练。在受控的多臂老虎机环境中，LAD准确地恢复了多模态优势分布，验证了理论形式。在多个LLM基础模型上的数学和代码推理任务中进行的实验表明，LAD能够可靠地提高准确性和生成多样性。

Summary / 总结

The research aims to enhance the diversity and exploration in large-model reasoning by addressing the limitations of current reinforcement learning objectives that focus on maximizing expected rewards. The method introduces Learning Advantage Distributions (LAD), which shifts the focus from maximizing advantage to learning the distribution induced by advantage. Key experimental findings show that LAD improves both accuracy and generative diversity in math and code reasoning tasks across different language model backbones, validating its theoretical formulation in a controlled bandit setting.

研究旨在通过解决当前强化学习目标主要关注最大化预期奖励的问题，增强大型模型推理的多样性和探索性。提出的Learning Advantage Distribution (LAD) 方法引入了一种分布匹配框架，学习优势诱导的分布而非最大化优势。这种方法导致了一个梯度更新，增加了高优势响应的可能性并抑制了过度自信的概率增长，在多种LLM后训练中提高了数学和代码推理任务的准确性和生成多样性。

To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering

Authors: Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen, Yu Hou, Yifan Wu, Yang Ruan, Rui Zhang

First: 2026-02-23T18:42:50+00:00 · Latest: 2026-02-23T18:42:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Objective: To improve the efficiency of medical question answering (MedQA) with large language models (LLMs) by avoiding unnecessary reasoning while maintaining accuracy. Methods: We propose Selective Chain-of-Thought (Selective CoT), an inference-time strategy that first predicts whether a question requires reasoning and generates a rationale only when needed. Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA. Metrics included accuracy, total generated tokens, and inference time. Results: Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss ($\leq$4\%). In some model-task pairs, it achieved both higher accuracy and greater efficiency than standard CoT. Compared with fixed-length CoT, Selective CoT reached similar or superior accuracy at substantially lower computational cost. Discussion: Selective CoT dynamically balances reasoning depth and efficiency by invoking explicit reasoning only when beneficial, reducing redundancy on recall-type questions while preserving interpretability. Conclusion: Selective CoT provides a simple, model-agnostic, and cost-effective approach for medical QA, aligning reasoning effort with question complexity to enhance real-world deployability of LLM-based clinical systems.

中文标题/摘要

标题：是否推理或不推理：医学问答中的选择性链式思考

目标：通过避免不必要的推理来提高大型语言模型（LLM）在医学问答（MedQA）中的效率，同时保持准确性。方法：我们提出了选择性链式思考（Selective CoT），这是一种推理时策略，首先预测问题是否需要推理，仅在需要时生成推理。在四个生物医学问答基准测试（HeadQA、MedQA-USMLE、MedMCQA、PubMedQA）上评估了两个开源LLM（Llama-3.1-8B和Qwen-2.5-7B）。评估指标包括准确率、生成的总令牌数和推理时间。结果：选择性CoT将推理时间减少了13-45%，令牌使用量减少了8-47%，准确率损失不超过4%。在某些模型-任务配对中，它在准确性和效率上都优于标准CoT。与固定长度CoT相比，选择性CoT在显著降低计算成本的同时达到了相似或更高的准确率。讨论：选择性CoT通过仅在有益时调用显式推理来动态平衡推理深度和效率，减少回忆型问题上的冗余，同时保持可解释性。结论：选择性CoT为医学问答提供了一种简单、模型无关且成本效益高的方法，将推理努力与问题复杂性对齐，以增强基于LLM的临床系统的实际部署能力。

Summary / 总结

The study aims to enhance the efficiency of medical question answering using large language models by employing Selective Chain-of-Thought (Selective CoT), which predicts whether reasoning is necessary and generates rationales only when needed. Evaluations on four biomedical QA benchmarks showed that Selective CoT reduced inference time by 13-45% and token usage by 8-47% with minimal accuracy loss. It also achieved higher accuracy and efficiency in some model-task pairs compared to standard CoT, and reached similar or superior accuracy at lower computational cost than fixed-length CoT.

研究旨在通过使用Selective Chain-of-Thought（Selective CoT）来提高大型语言模型在医学问答中的效率，Selective CoT预测是否需要推理，并仅在必要时生成推理。在四个生物医学问答基准上的评估显示，Selective CoT将推理时间减少了13-45%，生成的令牌减少了8-47%，同时保持了最小的准确性损失。在某些模型-任务组合中，它还实现了比标准CoT更高的准确性和更高的效率，是一种成本效益高的医学问答方法。

NanoKnow: How to Know What Your Language Model Knows

Authors: Lingwei Gu, Nour Jedidi, Jimmy Lin

First: 2026-02-23T18:37:49+00:00 · Latest: 2026-02-23T18:37:49+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

How do large language models (LLMs) know what they know? Answering this question has been difficult because pre-training data is often a "black box" -- unknown or inaccessible. The recent release of nanochat -- a family of small LLMs with fully open pre-training data -- addresses this as it provides a transparent view into where a model's parametric knowledge comes from. Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre-training corpus. Using these splits, we can now properly disentangle the sources of knowledge that LLMs rely on when producing an output. To demonstrate NanoKnow's utility, we conduct experiments using eight nanochat checkpoints. Our findings show: (1) closed-book accuracy is strongly influenced by answer frequency in the pre-training data, (2) providing external evidence can mitigate this frequency dependence, (3) even with external evidence, models are more accurate when answers were seen during pre-training, demonstrating that parametric and external knowledge are complementary, and (4) non-relevant information is harmful, with accuracy decreasing based on both the position and the number of non-relevant contexts. We release all NanoKnow artifacts at https://github.com/castorini/NanoKnow.

中文标题/摘要

标题：NanoKnow：如何了解你的语言模型知道什么

大型语言模型（LLMs）是如何知道它们所知道的内容的？回答这个问题一直很困难，因为预训练数据通常是“黑箱”——未知或不可访问的。最近发布的nanochat——一系列具有完全开放预训练数据的小型LLMs——解决了这一问题，因为它提供了一个透明的视角，可以看到模型的参数知识来自何处。为了理解知识是如何被LLMs编码的，我们发布了NanoKnow，这是一个基准数据集，将自然问题和SQuAD中的问题分为基于答案是否出现在nanochat的预训练语料库中的分割。利用这些分割，我们现在可以正确地解开LLMs在生成输出时依赖的知识来源。为了展示NanoKnow的实用性，我们使用八个nanochat检查点进行了实验。我们的发现表明：（1）闭卷准确率强烈受预训练数据中答案频率的影响，（2）提供外部证据可以减轻这种频率依赖性，（3）即使有外部证据，当答案在预训练期间被看到时，模型更准确，这表明参数知识和外部知识是互补的，（4）无关信息是有害的，准确性会根据无关上下文的位置和数量而降低。我们在https://github.com/castorini/NanoKnow/发布了所有NanoKnow的资源。

Summary / 总结

The paper addresses the challenge of understanding how large language models (LLMs) acquire their knowledge by leveraging nanochat, a family of small LLMs with fully open pre-training data. The authors introduce NanoKnow, a benchmark dataset that categorizes questions based on whether their answers are present in nanochat's pre-training corpus. Through experiments with eight nanochat checkpoints, they find that closed-book accuracy is heavily influenced by answer frequency in the pre-training data, and that providing external evidence can reduce this dependence. The study also shows that models are more accurate when answers were seen during pre-training, indicating the complementary nature of parametric and external knowledge, and that non-relevant information negatively impacts accuracy.

论文通过利用具有完全开放预训练数据的nanochat小语言模型家族，解决了理解大型语言模型（LLMs）知识获取机制的挑战。作者引入了NanoKnow基准数据集，将来自Natural Questions和SQuAD的问题根据其答案是否出现在nanochat的预训练数据中进行分类。使用八个nanochat检查点进行的实验表明，闭卷准确率受到预训练数据中答案频率的影响较大，提供外部证据可以减轻这种依赖性。研究还发现，当答案在预训练期间被看到时，模型更准确，表明参数知识和外部知识是互补的，而不相关的信息会损害准确性。

Towards Unifying Perceptual Reasoning and Logical Reasoning

Authors: Hiroyuki Kido

First: 2022-06-27T10:32:47+00:00 · Latest: 2026-02-23T18:36:24+00:00

Abs · PDF · Code1 · Code2

Abstract

An increasing number of scientific experiments support the view of perception as Bayesian inference, which is rooted in Helmholtz's view of perception as unconscious inference. Recent study of logic presents a view of logical reasoning as Bayesian inference. In this paper, we give a simple probabilistic model that is applicable to both perceptual reasoning and logical reasoning. We show that the model unifies the two essential processes common in perceptual and logical systems: on the one hand, the process by which perceptual and logical knowledge is derived from another knowledge, and on the other hand, the process by which such knowledge is derived from data. We fully characterise the model in terms of logical consequence relations.

中文标题/摘要

标题：向着统一感知推理和逻辑推理的方向

越来越多的科学研究支持感知是贝叶斯推理的观点，这源于赫尔姆霍兹关于感知是无意识推理的观点。最近对逻辑的研究提出了逻辑推理也是贝叶斯推理的观点。在本文中，我们提供了一个简单的概率模型，该模型适用于感知推理和逻辑推理。我们展示了该模型如何统一感知系统和逻辑系统中两种基本过程：一方面，感知和逻辑知识如何从其他知识中推导出来，另一方面，这些知识如何从数据中推导出来。我们完全用逻辑后果关系来描述该模型。

Summary / 总结

This paper aims to unify perceptual reasoning and logical reasoning by presenting a probabilistic model that is applicable to both. The model demonstrates that both perceptual and logical reasoning can be understood as Bayesian inference, unifying the processes of deriving knowledge from other knowledge and from data. The key finding is that this model fully characterizes logical consequence relations in a unified framework.

本文旨在通过提出一个适用于感知推理和逻辑推理的概率模型来统一这两种推理方式。该模型表明，感知和逻辑推理都可以被视为贝叶斯推理，统一了从其他知识和数据中推导知识的过程。主要发现是，该模型在一个统一框架中完全描述了逻辑后果关系。

NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning

Authors: Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, George Konidaris

First: 2026-02-23T18:35:18+00:00 · Latest: 2026-02-23T18:35:18+00:00

Comments: 25 pages, 15 figures. Project webpage: https://nova-plan.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/

中文标题/摘要

标题：NovaPlan：通过闭环视频语言规划实现零样本长时程操作

解决长时程任务需要机器人将高层次语义推理与低层次物理交互相结合。尽管视觉-语言模型（VLM）和视频生成模型可以分解任务并想象结果，但它们往往缺乏实现世界执行所需的物理基础。我们提出了NovaPlan，这是一种分层框架，将闭环VLM和视频规划与几何上接地的机器人执行统一起来，以实现零样本长时程操作。在高层次上，VLM规划器将任务分解为子目标，并在闭环中监控机器人执行，使系统能够通过自主重新规划从单步失败中恢复。为了计算低层次的机器人动作，我们从生成的视频中提取并利用与任务相关的物体关键点和人类手部姿态作为运动学先验，并采用切换机制选择更好的一个作为机器人动作的参考，即使在严重遮挡或深度不准确的情况下也能保持稳定的执行。我们在三个长时程任务和功能性操作基准（FMB）上展示了NovaPlan的有效性。我们的结果表明，NovaPlan可以在没有任何先验演示或训练的情况下执行复杂的装配任务并表现出灵巧的错误恢复行为。项目页面：https://nova-plan.github.io/

Summary / 总结

NovaPlan is a hierarchical framework that integrates closed-loop vision-language planning and geometrically grounded robot execution for zero-shot long-horizon manipulation. It decomposes tasks into sub-goals and monitors robot execution, allowing for autonomous re-planning. To compute low-level actions, it uses task-relevant object keypoints and human hand poses from generated videos, switching between them to maintain stable execution. NovaPlan demonstrates effectiveness on complex assembly tasks and the Functional Manipulation Benchmark without prior demonstrations or training.

NovaPlan 是一个层次框架，结合了闭环视觉语言规划和几何上接地的机器人执行，用于零样本长时程操作。它将任务分解为子目标，并监控机器人执行情况，允许在出现故障时进行自主重新规划。为了计算低级动作，它使用生成视频中的任务相关对象关键点和人类手部姿势作为运动先验，并根据其质量进行切换。NovaPlan 在三个任务和功能性操作基准（FMB）上展示了在复杂装配任务和错误恢复方面的有效性，无需任何先验演示或训练。

ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models

Authors: Andre He, Nathaniel Weir, Kaj Bostrom, Allen Nie, Darion Cassel, Sam Bayless, Huzefa Rangwala

First: 2026-02-23T18:34:29+00:00 · Latest: 2026-02-23T18:34:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs) by leveraging supervision from verifiers. Although verifier implementation is easier than solution annotation for many tasks, existing synthetic data generation methods remain largely solution-centric, while verifier-based methods rely on a few hand-crafted procedural environments. In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers, covering tasks such as constraint satisfaction, algorithmic puzzles, and spatial reasoning. A Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, including a 27\% relative improvement on the challenging BBEH benchmark. Ablations show that verifier-based supervision and increased task diversity both contribute significantly, providing empirical evidence that generating reasoning environments at scale can enhance reasoning abilities in RLMs

中文标题/摘要

标题：ReSyn：自主扩展合成环境以支持推理模型

可验证奖励的强化学习（RLVR）已成为通过验证者提供的监督训练推理语言模型（RLMs）的一种有前途的方法。尽管验证者实现比解决方案注解更容易，但现有的合成数据生成方法仍主要以解决方案为中心，而基于验证者的方法则依赖于少数手工构建的程序化环境。在本工作中，我们通过引入ReSyn，一种生成多样化推理环境的流水线，扩展了RLVR，该流水线配备了实例生成器和验证者，涵盖了诸如约束满足、算法谜题和空间推理等任务。使用RL在ReSyn数据上训练的Qwen2.5-7B-Instruct模型在推理基准和跨域数学基准上均取得了持续的改进，包括在具有挑战性的BBEH基准上相对提高了27%。消融实验表明，基于验证者的监督和任务多样性的增加都做出了显著贡献，提供了生成大规模推理环境可以增强RLMs推理能力的实证证据

Summary / 总结

The research motivation is to improve the training of reasoning language models (RLMs) using reinforcement learning with verifiable rewards (RLVR). The main method involves developing ReSyn, a pipeline that generates diverse reasoning environments with instance generators and verifiers, covering various tasks. Key experimental findings show that a Qwen2.5-7B-Instruct model trained with RL on ReSyn data achieves consistent gains across reasoning benchmarks and out-of-domain math benchmarks, with a 27% relative improvement on the BBEH benchmark. Ablations indicate that verifier-based supervision and increased task diversity are crucial for enhancing reasoning abilities in RLMs.

ReSyn 是一个生成多样推理环境的管道，包含实例生成器和验证器，用于使用可验证奖励的强化学习（RLVR）训练推理语言模型（RLMs）。它涵盖了约束满足、算法谜题和空间推理等任务。使用 ReSyn 数据进行 RL 训练的 Qwen2.5-7B-Instruct 模型在推理基准测试和跨域数学基准测试中表现出一致的改进，BBEH 基准测试的相对改进达到 27%。消融实验表明，基于验证器的监督和任务多样性的增加对于增强 RLMs 的推理能力至关重要。

Benchmarking Unlearning for Vision Transformers

Authors: Kairan Zhao, Iurie Luca, Peter Triantafillou

First: 2026-02-23T18:33:16+00:00 · Latest: 2026-02-23T18:33:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Research in machine unlearning (MU) has gained strong momentum: MU is now widely regarded as a critical capability for building safe and fair AI. In parallel, research into transformer architectures for computer vision tasks has been highly successful: Increasingly, Vision Transformers (VTs) emerge as strong alternatives to CNNs. Yet, MU research for vision tasks has largely centered on CNNs, not VTs. While benchmarking MU efforts have addressed LLMs, diffusion models, and CNNs, none exist for VTs. This work is the first to attempt this, benchmarking MU algorithm performance in different VT families (ViT and Swin-T) and at different capacities. The work employs (i) different datasets, selected to assess the impacts of dataset scale and complexity; (ii) different MU algorithms, selected to represent fundamentally different approaches for MU; and (iii) both single-shot and continual unlearning protocols. Additionally, it focuses on benchmarking MU algorithms that leverage training data memorization, since leveraging memorization has been recently discovered to significantly improve the performance of previously SOTA algorithms. En route, the work characterizes how VTs memorize training data relative to CNNs, and assesses the impact of different memorization proxies on performance. The benchmark uses unified evaluation metrics that capture two complementary notions of forget quality along with accuracy on unseen (test) data and on retained data. Overall, this work offers a benchmarking basis, enabling reproducible, fair, and comprehensive comparisons of existing (and future) MU algorithms on VTs. And, for the first time, it sheds light on how well existing algorithms work in VT settings, establishing a promising reference performance baseline.

中文标题/摘要

标题：视觉变换器的遗忘基准测试

机器遗忘（MU）研究已获得强劲动力：MU现被广泛认为是构建安全和公平AI的关键能力。同时，针对计算机视觉任务的变换器架构研究也非常成功：视觉变换器（VTs）逐渐成为CNNs的强大替代品。然而，视觉任务的MU研究主要集中在CNNs上，而不是VTs。虽然MU基准测试已涵盖LLMs、扩散模型和CNNs，但尚无针对VTs的基准测试。这项工作是首次尝试这一领域，对不同VT家族（ViT和Swin-T）及其不同容量下的MU算法性能进行了基准测试。该工作采用了(i) 不同的数据集，以评估数据集规模和复杂性的影响；(ii) 不同的MU算法，以代表MU的完全不同方法；(iii) 单次学习和持续学习协议。此外，它还关注了利用训练数据记忆的MU算法基准测试，因为利用记忆已被发现能显著提高之前SOTA算法的性能。在这一过程中，该工作描述了VTs相对于CNNs如何记忆训练数据，并评估了不同记忆代理对性能的影响。基准测试使用统一的评估指标，这些指标捕捉了遗忘质量的两个互补概念，以及在未见过（测试）数据和保留数据上的准确性。总体而言，这项工作提供了一个基准测试基础，使人们能够对现有（和未来）VTs上的MU算法进行可重复、公平和全面的比较。并且，首次揭示了现有算法在VT设置中的表现，建立了有希望的参考性能基准。

Summary / 总结

This work benchmarks machine unlearning (MU) for Vision Transformers (VTs), addressing a gap in the literature by focusing on ViT and Swin-T families. It evaluates MU algorithms across different datasets, MU approaches, and unlearning protocols, using unified metrics that measure both forget quality and accuracy. The study reveals how VTs memorize training data compared to CNNs and assesses the impact of different memorization proxies on performance, offering a comprehensive basis for comparing MU algorithms on VTs.

这项研究对视觉变换器（VTs）进行了机器遗忘（MU）基准测试，这些变换器在计算机视觉任务中越来越受欢迎。它在不同的VT家族和容量下，使用多种数据集和协议评估了不同的MU算法。关键发现包括VTs与CNNs相比在训练数据记忆方面的特征化，以及不同记忆代理对MU性能的影响。该研究引入了统一的评估指标来评估遗忘质量和准确性，为在VTs上比较MU算法提供了全面的基础。

AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Authors: Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe

Venue: ICLR 2026

First: 2025-06-09T13:34:50+00:00 · Latest: 2026-02-23T18:25:13+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focus on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.

中文标题/摘要

标题：AbstRaL：通过强化抽象思维增强LLMs的推理能力

近期研究表明，大型语言模型（LLMs），尤其是较小的模型，在小学数学（GSM）推理方面往往缺乏稳健性。特别是在面对分布变化时，如数值或名义变量的变化，或插入分散性从句时，它们的性能往往会下降。一种可能的策略是生成合成数据以进一步“实例化”推理问题的潜在变化。在本文中，我们反而关注“抽象化”推理问题的策略。这不仅有助于抵消分布变化，还促进了与符号工具的连接，以推导解决方案。专注于GSM，我们发现这种抽象过程通过强化学习（RL）比单纯的监督微调更容易获得，后者往往无法产生忠实的抽象。我们的方法AbstRaL——通过RL在粒度抽象数据上促进LLMs的抽象推理——显著减轻了在最近的GSM扰动基准上的性能下降。此外，通过AbstRaL提高GSM稳健性也被证明可以隐式地增强LLMs在OOD数学和一般推理任务上的能力，表明抽象思维广泛地促进了更好的泛化。

Summary / 总结

This study addresses the robustness issues of large language models (LLMs) in grade school math reasoning, particularly their performance drops under distribution shifts. Instead of generating synthetic data, the authors propose an abstraction strategy using reinforcement learning (RL) to enhance LLMs' abstract thinking. The method, AbstRaL, significantly improves GSM robustness and also benefits LLMs in out-of-distribution (OOD) mathematical and general reasoning tasks, demonstrating the importance of abstract thinking for better generalizability.

该研究针对大型语言模型（LLMs）在小学数学推理中的鲁棒性问题，特别是它们在分布变化下的性能下降。作者提出了一种通过强化学习（RL）增强抽象策略的方法来提升LLMs的推理能力。方法AbstRaL在小学数学推理扰动基准测试中显著减少了性能下降，并增强了LLMs的一般推理能力，表明抽象思维广泛提高了模型的泛化能力。

EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization

Authors: Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, Lizhu Zhang

First: 2026-02-05T00:33:02+00:00 · Latest: 2026-02-23T18:23:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy's accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford's online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.

中文标题/摘要

标题：EBPO：经验贝叶斯收缩以稳定组相对策略优化

可验证奖励的强化学习（RLVR）已被证明能够增强大型语言模型（LLMs）的推理能力。然而，主流方法如组相对策略优化（GRPO）面临严重的稳定性挑战：在计算约束条件下（小组规模较小）它们遭受高估计方差问题，并且在所有响应均产生相同零奖励的饱和失败状态下，梯度信号消失。为解决这一问题，我们提出了一种新的经验贝叶斯策略优化（EBPO）框架，该框架通过借用策略累积的全局统计信息来正则化局部组基线。EBPO 不是孤立地估计基线，而是使用一个动态平衡局部组统计信息与通过 Welford 在线算法更新的全局先验的收缩估计器。理论上，我们证明了与 GRPO 相比，EBPO 严格具有更低的均方误差（MSE）、有界熵衰减和在失败场景中非消失的惩罚信号。实验上，EBPO 在包括 AIME 和 OlympiadBench 在内的多种基准测试中均优于 GRPO 和其他现有基准，表现出更优的训练稳定性，即使在小组规模较小的情况下也能实现高性能提升，并且从难度分层的课程学习中获益显著。

Summary / 总结

The paper addresses the stability challenges in Group Relative Policy Optimization (GRPO) for Reinforcement Learning with Verifiable Rewards (RLVR), particularly high variance and vanishing gradients. It introduces Empirical Bayes Policy Optimization (EBPO), which regularizes local group baselines using global statistics through a shrinkage estimator. Theoretically, EBPO is shown to have lower Mean Squared Error, bounded entropy decay, and non-vanishing penalty signals. Empirically, EBPO outperforms GRPO and other baselines across various benchmarks, demonstrating superior training stability even with small group sizes.

论文针对组相对策略优化（GRPO）在验证奖励强化学习（RLVR）中的稳定性问题，特别是高方差和梯度消失。提出了一种新的Empirical Bayes策略优化（EBPO）框架，通过收缩估计器利用全局统计信息来正则化局部组基线。理论上，EBPO被证明具有更低的均方误差、有界熵衰减和非消失的惩罚信号。实验上，EBPO在各种基准测试中优于GRPO和其他基线，即使在小组规模下也能表现出更优的训练稳定性。

Align When They Want, Complement When They Need! Human-Centered Ensembles for Adaptive Human-AI Collaboration

Authors: Hasan Amin, Ming Yin, Rajiv Khanna

Venue: AAAI 2026

First: 2026-02-23T18:22:58+00:00 · Latest: 2026-02-23T18:22:58+00:00

Comments: AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

In human-AI decision making, designing AI that complements human expertise has been a natural strategy to enhance human-AI collaboration, yet it often comes at the cost of decreased AI performance in areas of human strengths. This can inadvertently erode human trust and cause them to ignore AI advice precisely when it is most needed. Conversely, an aligned AI fosters trust yet risks reinforcing suboptimal human behavior and lowering human-AI team performance. In this paper, we start by identifying this fundamental tension between performance-boosting (i.e., complementarity) and trust-building (i.e., alignment) as an inherent limitation of the traditional approach for training a single AI model to assist human decision making. To overcome this, we introduce a novel human-centered adaptive AI ensemble that strategically toggles between two specialist AI models - the aligned model and the complementary model - based on contextual cues, using an elegantly simple yet provably near-optimal Rational Routing Shortcut mechanism. Comprehensive theoretical analyses elucidate why the adaptive AI ensemble is effective and when it yields maximum benefits. Moreover, experiments on both simulated and real-world data show that when humans are assisted by the adaptive AI ensemble in decision making, they can achieve significantly higher performance than when they are assisted by single AI models that are trained to either optimize for their independent performance or even the human-AI team performance.

中文标题/摘要

标题：在他们想要时对齐，在他们需要时补充！以人为本的自适应人机协作集成

在人机决策中，设计能够补充人类专长的AI一直是增强人机协作的自然策略，但往往会在人类强项领域降低AI性能。这可能会无意中削弱人类的信任，导致他们在最需要时忽视AI建议。相反，对齐的AI可以培养信任，但会增加强化人类次优行为的风险，从而降低人机团队的性能。在本文中，我们首先识别出这种提升性能（即补充性）与建立信任（即对齐性）之间的基本矛盾是传统方法训练单一AI模型以辅助人类决策的固有限制。为克服这一问题，我们引入了一种新颖的人本自适应AI集成，根据上下文线索战略性地在两个专家AI模型——对齐模型和补充模型——之间切换，使用一个优雅简单且可证明接近最优的理性捷径机制。全面的理论分析阐明了为什么自适应AI集成是有效的，以及何时能获得最大益处。此外，对模拟和真实数据的实验表明，当人类在决策中受到自适应AI集成的辅助时，他们可以显著提高性能，而单个AI模型要么优化其独立性能，要么优化人机团队性能时，他们的表现则不如前者。

Summary / 总结

This paper addresses the tension between performance-boosting and trust-building in human-AI collaboration by introducing an adaptive AI ensemble. The ensemble toggles between an aligned model and a complementary model based on contextual cues, using a Rational Routing Shortcut mechanism. Experiments show that this approach leads to significantly higher performance in decision-making tasks compared to using single AI models optimized for either individual or team performance.

本文通过引入一种自适应AI集成来解决人类与AI协作中性能提升与信任建立之间的矛盾。该集成根据上下文线索在对齐模型和补充模型之间切换，使用一种简洁而证明有效的理性捷径机制。实验表明，与分别优化个体或团队性能的单一AI模型相比，这种自适应集成在决策任务中可以显著提高人类的表现。

BarrierSteer: LLM Safety via Learning Barrier Steering

Authors: Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao

First: 2026-02-23T18:19:46+00:00 · Latest: 2026-02-23T18:19:46+00:00

Comments: This paper introduces SafeBarrier, a framework that enforces safety in large language models by steering their latent representations with control barrier functions during inference, reducing adversarial and unsafe outputs

Abs · PDF · Code1 · Code2

Abstract

Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model's latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the model's original capabilities and performance. We provide theoretical results establishing that applying CBFs in latent space offers a principled and computationally efficient approach to enforcing safety. Our experiments across multiple models and datasets show that BarrierSteer substantially reduces adversarial success rates, decreases unsafe generations, and outperforms existing methods.

中文标题/摘要

标题：BarrierSteer：通过学习障碍导向控制提升大语言模型安全性

尽管大型语言模型（LLMs）在多种任务上表现出色，但它们对对抗攻击和不安全内容生成的易感性仍然是部署的主要障碍，尤其是在高风险环境中。解决这一挑战需要既实用又具有严格理论支持的安全机制。我们提出了BarrierSteer，这是一种新颖的框架，通过将学习到的非线性安全约束直接嵌入模型的潜在表示空间中来形式化响应安全性。BarrierSteer 使用基于控制障碍函数（CBFs）的导向机制，在推理过程中高效地检测和防止不安全的响应轨迹。通过在不修改底层LLM参数的情况下强制执行多个安全约束，BarrierSteer 保留了模型的原始能力和性能。我们提供了理论结果，证明在潜在空间中应用CBFs是一种原理上合理且计算高效的强制执行安全的方法。我们在多个模型和数据集上的实验表明，BarrierSteer 显著降低了对抗攻击的成功率，减少了不安全的生成，并优于现有方法。

Summary / 总结

BarrierSteer is a framework that enhances the safety of large language models by embedding non-linear safety constraints into the model's latent space. It uses Control Barrier Functions (CBFs) to steer the model away from unsafe response trajectories during inference, without altering the model parameters. Experiments demonstrate that BarrierSteer significantly reduces adversarial success rates and unsafe content generation compared to existing methods.

BarrierSteer 是一个框架，通过将非线性安全约束嵌入到模型的潜在空间中来增强大型语言模型的安全性。它使用控制屏障函数（CBFs）在推理过程中引导模型远离不安全的响应轨迹，而不改变模型参数。实验表明，BarrierSteer 显著减少了对抗性和不安全内容的生成，优于现有方法。

The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm

Authors: Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal, Enes Gurun, Brian Hur, Saab Mansour, Ravid Shwartz Ziv, Karin Verspoor, Dan Roth

First: 2026-01-09T03:19:37+00:00 · Latest: 2026-02-23T18:16:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is not just limited to human preferences, but is also consequential even in safety critical domains such as medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how - high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80\%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor -- uncertainty.

中文标题/摘要

标题：在不确定性下的人类与AI平庸幻象：通过概率范式导航难以捉摸的真相

在基准测试AI系统的相对能力时，包括大型语言模型（LLMs）和视觉模型，通常会忽略底层专家答案不确定性的影响。这种模糊不仅限于人类偏好，甚至在医学等关键安全领域也是如此，这些领域充满了不确定性。在本文中，我们引入了概率范式来理论解释：即使对于专家来说，高确定性的底层答案几乎总是必要的，而在具有高底层答案变异性数据集上，随机标注者和专家之间的差异可能很小。因此，在忽略底层答案评估数据中的不确定性时，可能会得出误导性的结论，即非专家的表现与专家相似。利用概率范式，我们提出了预期准确率和预期F1的概念，以估计给定底层答案变异性时专家人类或系统的得分。我们的工作导致了这样的建议：在确定系统的能力时，结果应按底层答案概率分层，通常通过底层答案专家的一致率来衡量。当整体性能低于80%的阈值时，分层评估变得至关重要。在分层评估下，高确定性区间内的性能比较更加可靠，减轻了关键混杂因素——不确定性的影响。

Summary / 总结

This paper addresses the issue of ignoring uncertainty in ground truth answers when benchmarking AI systems, particularly in safety-critical domains. It introduces a probabilistic paradigm to explain how high certainty in ground truth is crucial for experts to achieve high scores, while in datasets with high variation, there may be little difference between a random labeler and an expert. The authors propose using expected accuracy and expected F1 to estimate scores given ground truth variability, recommending stratification by the probability of ground truth answers for more reliable performance comparison, especially when overall performance drops below 80%.

本文提出了一个概率范式来考虑地面真实答案中的不确定性问题。研究表明，在地面真实答案高度确定的情况下，专家才能获得高分，而在高地面真实答案变异性数据集中，非专家的表现可能与专家相似。作者提出了使用预期准确率和预期F1来估计在不同地面真实答案不确定性下的专家表现，并建议根据地面真实答案的一致性概率对结果进行分层，特别是在整体性能低于80%时。这种方法通过减轻不确定性的影响，增强了性能比较的可靠性。

FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding

Authors: João Pereira, Vasco Lopes, João Neves, David Semedo

Venue: AAAI 2026

First: 2026-01-24T02:17:07+00:00 · Latest: 2026-02-23T18:12:49+00:00

Comments: Accepted at AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FVScore, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events with strong visual cues.

中文标题/摘要

标题：FineVAU：一种新的细粒度视频异常理解人对齐基准

视频异常理解（VAU）是一个专注于描述视频中异常事件的新任务。尽管引起了越来越多的兴趣，但VAU的评估仍然是一个开放的挑战。现有的基准依赖于基于n-gram的度量标准（例如，BLEU，ROUGE-L）或基于LLM的评估。前者无法捕捉到LVLM响应的丰富、自由形式和视觉基础的特性，而后者则侧重于评估语言质量而非事实相关性，往往导致主观判断与人类感知不一致。在本文中，我们通过提出FineVAU，一种新的VAU基准，解决了这一问题，该基准将重点转向了对异常视频的丰富、细粒度和领域特定的理解。我们将VAU表述为一个三重问题，旨在全面理解视频中异常事件的关键描述元素：事件（What）、参与者（Who）和位置（Where）。我们的基准引入了a) FVScore，一种新的、人对齐的评估指标，评估LVLM答案中关键视觉元素的出现情况，提供可解释的细粒度反馈；以及b) FineW3，一种通过结构化和全自动程序编纂的新颖、全面的数据集，该数据集通过现有的人类注释增加了高质量的细粒度视觉信息。人类评估表明，我们提出的方法在与当前方法对异常事件的人类感知的对齐方面具有优越性。对FineVAU的详细实验揭示了LVLM在感知需要空间和细粒度时间理解的异常事件方面的关键局限性，尽管在粗粒度、静态信息和具有强烈视觉线索的事件上表现出色。

Summary / 总结

FineVAU is a new benchmark for Video Anomaly Understanding that focuses on rich, fine-grained, and domain-specific understanding of anomalous videos. It introduces FVScore, a human-aligned evaluation metric, and FineW3, a comprehensive dataset with high-quality, fine-grained visual information. Experiments show that current language models struggle with spatial and fine-grained temporal understanding of anomalous events, despite performing well on static information and events with strong visual cues.

FineVAU 是一个新的视频异常理解 (VAU) 基准，专注于异常的丰富、细致和领域特定的理解。它引入了 FVScore，一个与人类感知对齐的评估指标，以及 FineW3，一个带有高质量、细致视觉信息的综合数据集。实验表明，当前的语言模型在空间和细致的时间理解异常方面存在局限性，尽管在静态信息和具有强烈视觉线索的事件上表现出色。

Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning

Authors: Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang

Venue: ICLR 2026

First: 2025-07-23T18:10:43+00:00 · Latest: 2026-02-23T18:12:05+00:00

Comments: Accepted by ICLR 2026. The project page is available at https://damon-demon.github.io/shop-r1.html

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments. Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline. The project page is available at https://damon-demon.github.io/shop-r1.html.

中文标题/摘要

标题：Shop-R1：通过强化学习奖励LLM在在线购物中模拟人类行为

大型语言模型（LLM）最近在生成‘可信的人类行为’方面展示了强大的潜力。先前的工作探索了通过LLM合成的推理理由增强训练数据，并应用监督微调（SFT）来提高推理能力，从而改善下游行为预测。然而，这些方法的性能仍然受限于生成推理理由的模型的推理能力。在本文中，我们提出了Shop-R1，这是一种新颖的强化学习（RL）框架，旨在通过LLM增强模拟在线购物环境中真实人类行为的推理能力。具体而言，Shop-R1将人类行为模拟任务分解为两个阶段：理由生成和行为预测，每个阶段都由不同的奖励信号引导。在理由生成阶段，我们利用内部模型信号（例如，logit分布）以自监督的方式引导推理过程。在行为预测阶段，我们提出了一种具有难度感知缩放的分层奖励结构，以防止奖励作弊并实现细粒度奖励分配。该设计评估了高级行为类型和细粒度子行为细节（属性和值）的正确性，奖励输出与其难度成比例。实验结果表明，与基线相比，我们的方法相对提高了超过65%。项目页面可在https://damon-demon.github.io/shop-r1.html获取。

Summary / 总结

Shop-R1 is a reinforcement learning framework designed to enhance the reasoning ability of LLMs for simulating human behavior in online shopping. It decomposes the task into rationale generation and action prediction, using distinct reward signals for each. The method leverages internal model signals for self-supervised rationale generation and proposes a hierarchical reward structure for action prediction. Experiments show a significant improvement of over 65% compared to the baseline.

Shop-R1 是一个强化学习框架，旨在增强 LLMs 在在线购物环境中的推理能力以模拟人类行为。该框架将任务分解为推理解释生成和动作预测两个阶段，使用不同的奖励信号。方法利用内部模型信号进行推理解释生成，并提出了一种分层奖励结构来预测动作。实验结果显示，相对于基线方法，改进幅度超过 65%。

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching

Authors: Yuzhe Wang, Yaochen Zhu, Jundong Li

First: 2026-02-23T18:06:15+00:00 · Latest: 2026-02-23T18:06:15+00:00

Comments: 8 pages plus references, 3 figures, 3 tables. Under review

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) witness increasing deployment in complex, high-stakes decision-making scenarios, it becomes imperative to ground their reasoning in causality rather than spurious correlations. However, strong performance on traditional reasoning benchmarks does not guarantee true causal reasoning ability of LLMs, as high accuracy may still arise from memorizing semantic patterns instead of analyzing the underlying true causal structures. To bridge this critical gap, we propose a new causal reasoning benchmark, CausalFlip, designed to encourage the development of new LLM paradigm or training algorithms that ground LLM reasoning in causality rather than semantic correlation. CausalFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations. Based on this, for each event triple, we construct pairs of semantically similar questions that reuse the same events but yield opposite causal answers, where models that rely heavily on semantic matching are systematically driven toward incorrect predictions. To further probe models' reliance on semantic patterns, we introduce a noisy-prefix evaluation that prepends causally irrelevant text before intermediate causal reasoning steps without altering the underlying causal relations or the logic of the reasoning process. We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought (CoT) supervision, and a proposed internalized causal reasoning approach that aims to mitigate explicit reliance on correlation in the reasoning process. Our results show that explicit CoT can still be misled by spurious semantic correlations, where internalizing reasoning steps yields substantially improved causal grounding, suggesting that it is promising to better elicit the latent causal reasoning capabilities of base LLMs.

中文标题/摘要

标题：CausalFlip：超越语义匹配的LLM因果判断基准

随着大型语言模型（LLMs）在复杂、高风险决策场景中的应用日益增多，将它们的推理基于因果关系而非偶然的相关性变得至关重要。然而，传统推理基准上的强大表现并不能保证LLMs真正具备因果推理能力，因为高准确率可能只是由于记忆了语义模式而非分析了潜在的真实因果结构。为了弥合这一关键差距，我们提出了一种新的因果推理基准CausalFlip，旨在鼓励开发新的LLM范式或训练算法，使LLM的推理基于因果关系而非语义相关性。CausalFlip由基于事件三元组构建的因果判断问题组成，这些事件三元组可以形成不同的共因、链式和碰撞关系。基于此，对于每个事件三元组，我们构建了语义相似的问题对，这些问题重用了相同的事件但导致相反的因果答案，使得依赖于语义匹配的模型系统地产生错误预测。为了进一步探究模型对语义模式的依赖，我们引入了一种噪声前缀评估，该评估在中间因果推理步骤前添加因果无关的文本，而不改变潜在的因果关系或推理过程的逻辑。我们对多种训练范式下的LLMs进行了评估，包括仅答案训练、显式的因果推理链（CoT）监督，以及一种旨在减轻推理过程中对相关性依赖的内部化因果推理方法。结果显示，显式的CoT仍然可能被虚假的语义相关性误导，而内部化推理步骤则显著提高了因果定位，表明更好地激发基底LLMs的潜在因果推理能力是可行的。

Summary / 总结

CausalFlip is a new benchmark designed to evaluate the causal reasoning ability of large language models (LLMs) beyond mere semantic matching. It consists of causal judgment questions built over event triples with different causal relations, and introduces a noisy-prefix evaluation to probe models' reliance on semantic patterns. Evaluations show that explicit Chain-of-Thought (CoT) can still be misled by spurious semantic correlations, while internalizing reasoning steps significantly improves causal grounding. This suggests that internalized causal reasoning is promising for better eliciting latent causal reasoning capabilities of LLMs.

CausalFlip 是一个新的基准，旨在测试大型语言模型（LLMs）的因果推理能力，而不仅仅是语义匹配。它包含基于事件三元组构建的因果判断问题，这些事件三元组可以形成不同的因果关系，并且包含一对语义相似的问题，但因果答案相反。基准还包括一个噪声前缀评估，以探究模型对语义模式的依赖。评估结果显示，显式的因果推理链（CoT）仍然可能被虚假的语义相关性误导，而内化推理步骤可以显著提高因果定位的效果。

Closing the Gap Between Text and Speech Understanding in LLMs

Authors: Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh

First: 2025-10-15T14:57:16+00:00 · Latest: 2026-02-23T18:05:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.

中文标题/摘要

标题：缩小大型语言模型在文本与语音理解之间的差距

大型语言模型（LLMs）可以被调整以扩展其文本能力以处理语音输入。然而，这些调整后的语音LLMs在语言理解任务上的表现始终不如基于文本的对应模型，甚至不如级联管道。我们称这种不足为文本-语音理解差距：当语音调整后的LLM处理语音输入时，相对于原始基于文本的LLM处理等效文本时观察到的性能下降。最近缩小这一差距的方法要么依赖大规模的文本语料库语音合成，这既昂贵又高度依赖合成数据，要么依赖大规模的专有语音数据集，这些数据集不可复制。因此，仍需要更高效的数据替代方案来缩小文本-语音理解差距。在本研究中，我们分析了这一差距由两个因素驱动：（i）适应过程中对文本能力的遗忘，以及（ii）语音和文本之间的跨模态不一致。基于这一分析，我们引入了SALAD——高效样本对齐与通过主动选择和跨模态蒸馏学习相结合——结合跨模态蒸馏与目标合成数据，以提高对齐并减轻遗忘。将SALAD应用于3B和7B LLMs，在公共语料库的语音数据量超过一个数量级的情况下，SALAD在广泛领域的知识、语言理解和推理基准测试中实现了与强开源权重模型相当的性能。

Summary / 总结

This study addresses the text-speech understanding gap in Large Language Models (LLMs) by analyzing it as resulting from two factors: forgetting of text capabilities during adaptation and cross-modal misalignment between speech and text. The proposed method, SALAD, combines cross-modal distillation with targeted synthetic data to improve alignment and mitigate forgetting. It achieves competitive performance with a strong open-weight model across various benchmarks while requiring significantly less speech data compared to previous approaches.

本文探讨了大型语言模型（LLMs）在文本与语音理解之间的差距问题，即语音适应的LLMs在性能上低于其文本版本。作者提出了一种名为SALAD的方法，结合跨模态蒸馏和目标合成数据来提高对齐并减轻遗忘。SALAD在各种基准测试中实现了竞争力的表现，同时使用了比先前方法少得多的公共语料库中的语音数据。

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Authors: Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou

First: 2026-01-22T18:58:55+00:00 · Latest: 2026-02-23T18:05:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

中文标题/摘要

标题：PyraTok：语言对齐的分层分词器用于视频理解和生成

离散视频VAEs是现代文本到视频生成和视频理解系统的基石，但现有的分词器通常在单尺度上学习视觉码本，词汇量有限且语言监督浅薄，导致跨模态对齐差且零样本迁移效果不佳。我们提出了PyraTok，一种语言对齐的分层分词器，能够在多个时空分辨率上学习语义结构化的离散潜在变量。PyraTok 基于一个预训练的视频VAE和一个新颖的语言对齐分层量化（LaPQ）模块，该模块使用共享的大二进制码本在多个深度上离散化编码特征，从而产生紧凑且富有表现力的视频分词序列。为了紧密耦合视觉分词与语言，PyraTok 联合优化多尺度文本引导量化和分词层次上的全局自回归目标。在十个基准测试中，PyraTok 在视频重建方面达到最先进的（SOTA）性能，一致地提高了文本到视频的质量，并在视频分割、动作定位和视频理解方面设立了新的SOTA零样本性能，能够稳健地扩展到4K/8K分辨率。

Summary / 总结

PyraTok is designed to improve cross-modal alignment and zero-shot transfer in text-to-video generation and video understanding by learning semantically structured discrete latents across multiple spatiotemporal resolutions. It uses a Language aligned Pyramidal Quantization (LaPQ) module to discretize encoder features at several depths with a shared large binary codebook, and jointly optimizes multi-scale text-guided quantization and a global autoregressive objective. PyraTok achieves state-of-the-art performance in video reconstruction, text-to-video quality, and zero-shot video segmentation, temporal action localization, and understanding, scaling well to high resolutions up to 4K/8K.

PyraTok 通过在多个时空分辨率上学习语义结构化的离散潜变量，旨在提高文本到视频生成和视频理解中的跨模态对齐和零样本迁移。它使用语言对齐的分层量化（LaPQ）模块，在共享的大二进制码本中对编码特征进行多尺度离散化，并联合优化多尺度文本引导量化和全局自回归目标。PyraTok 在视频重建、文本到视频质量、零样本视频分割、动作定位和理解等各个基准上均达到最先进的性能，支持到 4K/8K 分辨率的扩展。

How Retrieved Context Shapes Internal Representations in RAG

Authors: Samuel Yeh, Sharon Li

First: 2026-02-23T18:02:04+00:00 · Latest: 2026-02-23T18:02:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial. In realistic retrieval settings, the retrieved document set often contains a mixture of documents that vary in relevance and usefulness. While prior work has largely examined these phenomena through output behavior, little is known about how retrieved context shapes the internal representations that mediate information integration in RAG. In this work, we study RAG through the lens of latent representations. We systematically analyze how different types of retrieved documents affect the hidden states of LLMs, and how these internal representation shifts relate to downstream generation behavior. Across four question-answering datasets and three LLMs, we analyze internal representations under controlled single- and multi-document settings. Our results reveal how context relevancy and layer-wise processing influence internal representations, providing explanations on LLMs output behaviors and insights for RAG system design.

中文标题/摘要

标题：检索上下文如何塑造RAG中的内部表示

检索增强生成（RAG）通过在生成时基于检索到的外部文档进行条件化，增强了大型语言模型（LLMs），但检索到的上下文的影响往往是非平凡的。在实际的检索设置中，检索到的文档集通常包含相关性和有用性各异的文档混合体。尽管先前的工作主要通过输出行为来研究这些现象，但关于检索到的上下文如何塑造RAG中信息整合的内部表示知之甚少。在本研究中，我们从潜在表示的角度研究RAG。我们系统地分析了不同类型检索到的文档如何影响LLMs的隐藏状态，以及这些内部表示的变化如何与下游生成行为相关。在四个问答数据集和三种LLMs上，我们分析了在单文档和多文档控制设置下的内部表示。我们的结果揭示了上下文的相关性和逐层处理如何影响内部表示，为解释LLMs的输出行为和RAG系统设计提供了见解。

Summary / 总结

This study investigates how retrieved context influences internal representations in Retrieval-Augmented Generation (RAG) models. By analyzing latent representations across different document types and LLMs, the research finds that context relevancy and layer-wise processing significantly impact internal representations, which in turn affect downstream generation behavior. These findings offer insights into the mechanisms of RAG and guide the design of future RAG systems.

本研究探讨检索上下文如何影响 Retrieval-Augmented Generation (RAG) 系统的内部表示。通过在不同类型的检索文档和多个数据集上分析潜在表示，研究发现上下文的相关性和逐层处理显著影响内部表示，为 RAG 输出行为提供了解释，并为系统设计提供了指导。

History

20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553