Semantic Chunking and the Entropy of Natural Language
Authors: Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks
First: 2026-02-13T18:58:10+00:00 · Latest: 2026-02-18T18:59:22+00:00
Comments: 29 pages, 9 figures; typos fixed
Abstract
The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
中文标题/摘要
标题:语义切块与自然语言的熵
印刷英语的熵率著名地估计为每个字符大约一个比特,这是一个现代大型语言模型(LLMs)仅最近才接近的基准。该熵率意味着英语相对于预期的每个字符五比特的随机文本,几乎含有80%的冗余。我们引入了一个统计模型,试图捕捉自然语言复杂的多层次结构,提供了一个从第一原理出发的冗余水平解释。该模型描述了一种自相似地将文本切块为语义上连贯的片段的过程,直到单个单词级别。文本的语义结构可以逐级分解,从而进行分析处理。现代LLMs和开源数据集的数值实验表明,我们的模型在语义层次的不同水平上定量地捕捉了真实文本的结构。我们的模型预测的熵率与印刷英语估计的熵率一致。此外,我们的理论还揭示了自然语言的熵率不是固定的,而是应该随着语料库的语义复杂性系统地增加,这由我们模型中的唯一自由参数来捕捉。
Summary / 总结
The paper aims to understand the redundancy in natural language by estimating its entropy rate, which is about one bit per character. To achieve this, the authors propose a statistical model that segments text into semantically coherent chunks, allowing for a hierarchical decomposition of the text's structure. Experiments with modern language models and open datasets show that the model accurately captures the semantic structure of real texts and predicts an entropy rate consistent with the estimated entropy rate of printed English. Additionally, the model suggests that the entropy rate increases with the semantic complexity of the text corpora.
该论文旨在通过估算自然语言的熵率(约为每个字符一个比特)来理解其中的冗余。为此,作者提出了一种统计模型,将文本逐层分割成语义上连贯的片段,从而对文本的结构进行层次分解。实验表明,该模型能够准确捕捉真实文本的语义结构,并预测的熵率与印刷英语的估计熵率一致。此外,该模型还表明,熵率会随着语料库的语义复杂性系统地增加,这由模型中的唯一自由参数决定。
Policy Compiler for Secure Agentic Systems
Authors: Nils Palumbo, Sarthak Choudhary, Jihye Choi, Prasad Chalasani, Mihai Christodorescu, Somesh Jha
First: 2026-02-18T18:57:12+00:00 · Latest: 2026-02-18T18:57:12+00:00
Abstract
LLM-based agents are increasingly being deployed in contexts requiring complex authorization policies: customer service protocols, approval workflows, data access restrictions, and regulatory compliance. Embedding these policies in prompts provides no enforcement guarantees. We present PCAS, a Policy Compiler for Agentic Systems that provides deterministic policy enforcement.
Enforcing such policies requires tracking information flow across agents, which linear message histories cannot capture. Instead, PCAS models the agentic system state as a dependency graph capturing causal relationships among events such as tool calls, tool results, and messages. Policies are expressed in a Datalog-derived language, as declarative rules that account for transitive information flow and cross-agent provenance. A reference monitor intercepts all actions and blocks violations before execution, providing deterministic enforcement independent of model reasoning.
PCAS takes an existing agent implementation and a policy specification, and compiles them into an instrumented system that is policy-compliant by construction, with no security-specific restructuring required. We evaluate PCAS on three case studies: information flow policies for prompt injection defense, approval workflows in a multi-agent pharmacovigilance system, and organizational policies for customer service. On customer service tasks, PCAS improves policy compliance from 48% to 93% across frontier models, with zero policy violations in instrumented runs.
中文标题/摘要
标题:安全代理系统策略编译器
基于LLM的代理在需要复杂授权策略的场景中越来越普遍:客户服务协议、审批工作流、数据访问限制和合规性。将这些策略嵌入提示中无法提供执行保证。我们提出了PCAS,一种代理系统策略编译器,提供确定性的策略执行。
执行这些策略需要跟踪代理间的信息流,而线性消息历史无法捕捉。相反,PCAS将代理系统状态建模为依赖图,捕捉事件(如工具调用、工具结果和消息)之间的因果关系。策略以Datalog衍生语言表达,作为描述性规则,考虑传递性信息流和跨代理来源。参考监视器拦截所有操作,在执行前阻止违规行为,提供独立于模型推理的确定性执行。
PCAS接受现有的代理实现和策略规范,并将它们编译成一个符合策略的系统,无需进行任何特定于安全性的重构。我们在三个案例研究中评估了PCAS:提示注入防御的信息流策略、多代理药效监测系统中的审批工作流,以及客户服务中的组织策略。在客户服务任务中,PCAS将前沿模型的策略合规性从48%提高到93%,且在编译运行中无策略违规。
Summary / 总结
PCAS is a policy compiler for agentic systems that ensures deterministic enforcement of complex authorization policies. It models the system state as a dependency graph and uses a Datalog-derived language to express policies, which are then intercepted by a reference monitor to prevent violations. Evaluations on three case studies show that PCAS improves policy compliance from 48% to 93% on customer service tasks, with no policy violations in instrumented runs.
PCAS 是一种用于智能代理系统的策略编译器,确保复杂授权策略的确定性执行。它将系统状态建模为依赖图以跨代理跟踪信息流,并使用一种基于 Datalog 的语言来表达策略,以考虑传递性信息流和跨代理来源。PCAS 将现有代理和策略编译成一个安全系统,在客户服务任务中将策略合规性从 48% 提高到 93%,且在仪器化运行中没有策略违规。
Reinforced Fast Weights with Next-Sequence Prediction
Authors: Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky
First: 2026-02-18T18:53:18+00:00 · Latest: 2026-02-18T18:53:18+00:00
Abstract
Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.
中文标题/摘要
标题:强化快速权重与后续序列预测
快速权重架构为长上下文建模提供了一种有希望的替代方案,通过保持恒定的内存开销,无论上下文长度如何。然而,它们的潜力受到后续标记预测(NTP)训练范式的限制。NTP优化单个标记的预测,而忽略了后续标记在前缀之后的语义连贯性。因此,快速权重模型,它们动态更新参数以存储上下文信息,学习到次优表示,无法捕捉长距离依赖关系。我们引入了REFINE(强化快速权重与后续序列预测),这是一种强化学习框架,它在后续序列预测(NSP)目标下训练快速权重模型。REFINE基于预测熵选择信息性标记位置,生成多标记展开,分配自我监督的序列级奖励,并使用组相对策略优化(GRPO)优化模型。REFINE适用于预训练语言模型训练生命周期中的各个阶段:中期训练、后期训练和测试时训练。我们在LaCT-760M和DeltaNet-1.3B上的实验表明,REFINE在针头在干草堆检索、长上下文问答以及LongBench中的各种任务中,始终优于基于NTP的监督微调。REFINE为改进快速权重架构中的长上下文建模提供了一个有效且多功能的框架。
Summary / 总结
The paper addresses the limitation of fast weight architectures in capturing long-range dependencies due to their training with next-token prediction (NTP), which focuses on single-token predictions. To overcome this, REFINE (Reinforced Fast weIghts with Next sEquence prediction) is proposed, which trains fast weight models under the next-sequence prediction (NSP) objective. REFINE uses prediction entropy to select informative token positions, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). The experiments show that REFINE outperforms supervised fine-tuning with NTP on various tasks, including needle-in-a-haystack retrieval and long-context question answering.
论文针对快权重架构因单个词预测训练 paradigm 无法捕捉长距离依赖的问题,提出了一种基于下一个序列预测的强化学习框架 REFINE。REFINE 利用预测熵选择信息丰富的词位置,生成多词卷积,并通过组相对策略优化分配自监督序列级奖励。实验结果表明,REFINE 在各种长上下文任务和多种基准测试中均优于基于下一个词预测的监督微调方法。
Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology
Authors: Shen Zhou Hong, Alex Kleinman, Alyssa Mathiowetz, Adam Howes, Julian Cohen, Suveer Ganta, Alex Letizia, Dora Liao, Deepika Pahari, Xavier Roberts-Gaal, Luca Righetti, Joe Torres
First: 2026-02-18T18:51:28+00:00 · Latest: 2026-02-18T18:51:28+00:00
Abstract
Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.
中文标题/摘要
标题:测量2025年中期LLM辅助对生物学初学者表现的影响
大型语言模型(LLMs)在生物基准测试中表现出色,引发了对其可能帮助初学者获得双重用途实验室技能的担忧。然而,这种能力是否转化为物理实验室中的人类表现提升仍不清楚。为了解决这一问题,我们在2025年6月至8月进行了一个预先注册、研究者盲法、随机对照试验(n = 153),评估LLMs是否能提高初学者在模拟病毒反向遗传学工作流程任务中的表现。我们没有观察到主要终点(工作流程完成)的显著差异(5.2% LLM vs. 6.6%互联网;P = 0.759),也没有在单个任务的成功率上观察到显著差异。然而,LLM组在四个任务中的成功率数值上高于互联网组,尤其是在细胞培养任务中(68.8% LLM vs. 55.3%互联网;P = 0.059)。对合并数据的后验贝叶斯建模估计,在LLM辅助下,“典型”的反向遗传学任务成功率大约增加1.4倍(95% CrI 0.74-2.62)。有序回归建模表明,LLM组的参与者更有可能在所有任务中跨过中间步骤(正效应后验概率:81%-96%)。总体而言,2025年中期的LLMs并未显著增加初学者完成复杂实验室程序,但与轻微的表现优势相关。这些结果揭示了计算基准与实际应用之间的差距,强调了随着模型能力和用户熟练度的发展,需要对AI生物安全评估进行物理世界验证的必要性。
Summary / 总结
The study aimed to determine if large language models (LLMs) could improve novice performance in a viral reverse genetics workflow. A randomized controlled trial with 153 participants found no significant difference in overall workflow completion between the LLM and internet control groups. However, the LLM group showed a modest improvement in four out of five tasks, particularly in cell culture, and Bayesian modeling suggested a 1.4-fold increase in success. Ordinal regression indicated a higher likelihood of progressing through task steps in the LLM group. The results highlight a gap between LLM performance on in silico benchmarks and real-world utility.
研究旨在评估大型语言模型(LLMs)是否能提高初学者在生物工作流程中的表现。一项包含153名参与者的随机对照试验发现,LLM组和互联网控制组在整体工作流程完成率上没有显著差异。然而,LLM组在五个任务中有四个任务表现更好,特别是在细胞培养方面,贝叶斯建模显示成功率为1.4倍。序贯回归分析表明,LLM组的参与者更有可能在所有任务中完成中间步骤。尽管如此,LLMs并未显著提升初学者在复杂实验室程序中的表现,这突显了在模型能力和用户熟练度发展过程中,虚拟基准与实际应用之间的差距。
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
Authors: Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze
Venue: NAACL 2025
First: 2024-02-28T05:44:41+00:00 · Latest: 2026-02-18T18:50:32+00:00
Comments: NAACL 2025
Abstract
LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. Datasets and code are available at https://github.com/HanjieChen/ChallengeClinicalQA. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.
中文标题/摘要
标题:大型语言模型在回答和解释具有挑战性的医学问题方面的基准测试
LLMs在回答医学问题方面表现出色,例如在医学执照考试中取得通过成绩。然而,医学委员会考试或一般的临床问题无法捕捉到实际临床案例的复杂性。此外,缺乏参考解释使得我们无法轻易评估模型决策的合理性,这是支持医生做出复杂医学决策的关键组成部分。为了解决这些挑战,我们构建了两个新的数据集:JAMA临床挑战和Medbullets。数据集和代码可在https://github.com/HanjieChen/ChallengeClinicalQA获取。JAMA临床挑战由基于具有挑战性的临床案例的问题组成,而Medbullets则包含模拟的临床问题。两个数据集都以多项选择题-回答任务的形式呈现,并附有专家撰写的解释。我们使用各种提示对七个LLM在两个数据集上的表现进行了评估。实验表明,我们的数据集比之前的基准更难。深入的自动和人工评估模型生成的解释提供了关于LLMs在可解释医学问答方面的潜力和不足的见解。
Summary / 总结
The research aims to evaluate large language models (LLMs) in answering and explaining complex medical questions, which are more challenging than typical medical exams. The study introduces two new datasets, JAMA Clinical Challenge and Medbullets, to better simulate real clinical scenarios. Seven LLMs were evaluated using these datasets, revealing that these models perform better on answering questions but struggle with providing clear explanations. Detailed evaluations of model-generated explanations highlight the strengths and weaknesses of LLMs in explainable medical question-answering systems.
该研究旨在通过创建两个新数据集JAMA临床挑战和Medbullets来评估大型语言模型(LLMs)在回答和解释复杂医学问题方面的表现。这些数据集包含基于挑战性临床案例和模拟临床场景的多项选择题,并附有专家解释。使用这些数据集评估了七种LLMs,结果显示新基准比之前的更难。对模型生成解释的详细评估揭示了LLMs在可解释医学问答系统中的潜力和局限性。
Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning
Authors: Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li
First: 2026-02-18T18:49:56+00:00 · Latest: 2026-02-18T18:49:56+00:00
Comments: preprint 10 pages, 4 figures
Abstract
Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.
中文标题/摘要
标题:注意引导多路径思考:重访视觉-语言推理
视觉-语言模型(VLMs)旨在通过联合利用视觉和文本模态进行推理。虽然为大型语言模型(LLMs)分配额外的推理时间计算已被证明是有效的,但在VLMs中实现类似的扩展仍然具有挑战性。一个关键障碍是视觉输入通常只在生成的开始阶段提供一次,而文本推理(例如,早期视觉摘要)是自回归生成的,这使得推理变得越来越以文本为主导,并允许早期视觉定位错误累积。此外,推理期间的视觉定位指导通常粗糙且嘈杂,这使得在长文本上引导推理变得困难。为了解决这些挑战,我们提出了\emph{注意引导原则}(SAP)选择。SAP 在高层次的推理原则上操作,而不是在标记级轨迹上,这使得在嘈杂反馈下稳定控制离散生成成为可能,同时允许后续推理步骤在需要重新定位时重新咨询视觉证据。此外,SAP 支持多路径推理,允许并行探索多种推理行为。SAP 是模型无关的,不需要额外的数据,也不需要额外的训练。实验证明,SAP 在与可比的标记生成预算下实现了竞争力的性能,特别是在减少对象幻觉方面,同时提供了比CoT风格的长序列推理更稳定的推理和更低的响应延迟。
Summary / 总结
The paper addresses the challenge of effective visual-grounding in vision-language models (VLMs) by proposing Saliency-Aware Principle (SAP) selection. SAP enables stable control over discrete generation and supports multi-route inference, allowing reasoning to re-consult visual evidence when needed. Experiments show SAP reduces object hallucination, provides more stable reasoning, and has lower response latency compared to long sequential reasoning.
本文提出了一种Saliency-Aware Principle (SAP) 选择方法,该方法在高层推理原则上操作,以控制离散生成并支持多路线推理。SAP 使推理更加稳定,并减少了对象幻觉,实现了与较低响应延迟相比的竞争力表现。
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Authors: Wenxuan Ding, Nicholas Tomlin, Greg Durrett
First: 2026-02-18T18:46:14+00:00 · Latest: 2026-02-18T18:46:14+00:00
Abstract
LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.
中文标题/摘要
标题:校准后再行动:LLM代理的成本意识探索
大型语言模型(LLMs)越来越多地被用于解决需要与环境交互以获取信息的复杂问题,而不仅仅是通过单个响应来解决。在这种情况下,LLMs必须在何时停止探索并承诺一个答案时权衡固有的成本不确定性。例如,在编程任务中,如果LLM对生成的代码片段的正确性不确定,它应该测试该代码;编写测试的成本是非零的,但通常低于犯错误的成本。在本研究中,我们展示了如何促使LLMs明确地权衡这些成本不确定性,从而进行更有效的环境探索。我们将多个任务,包括信息检索和编程,形式化为不确定性下的顺序决策问题。每个问题都有一个潜在的环境状态,可以通过传递给LLM代理的先验来推理。我们引入了一个称为校准后再行动(CTA)的框架,通过向LLM提供额外的上下文,使其能够更有效地行动。即使在对基线和CTA进行强化学习训练时,这种改进也得以保留。我们在信息寻求问答和简化编程任务上的结果表明,通过CTA使成本效益权衡变得明确,可以帮助代理发现更优的决策策略。
Summary / 总结
This work addresses the challenge of cost-aware exploration in LLM agents by formalizing tasks as sequential decision-making problems under uncertainty. The Calibrate-Then-Act (CTA) framework is introduced, which provides LLMs with additional context to reason about cost-uncertainty tradeoffs. Experimental results on information retrieval and coding tasks demonstrate that CTA enables agents to discover more optimal decision-making strategies compared to baseline methods, even when trained with reinforcement learning.
该研究通过将任务形式化为不确定性下的顺序决策问题,解决了LLM代理的成本感知探索挑战。引入了Calibrate-Then-Act (CTA)框架,为LLM提供额外的上下文,使其能够推理成本-不确定性权衡。实验结果表明,CTA使代理能够发现比基线方法更优的决策策略,即使在使用强化学习训练时也是如此。
Causality is Key for Interpretability Claims to Generalise
Authors: Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar
First: 2026-02-18T18:45:04+00:00 · Latest: 2026-02-18T18:45:04+00:00
Abstract
Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (\eg, average change in token probabilities) over a set of prompts. However, counterfactual claims -- i.e., asking what the model output would have been for the same prompt under an unobserved intervention -- remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.
中文标题/摘要
标题:因果关系是解释性声明能否泛化的关键
大规模语言模型(LLMs)的解释性研究提供了关于模型行为的重要见解,但反复出现的陷阱仍然存在:缺乏泛化的发现,以及超出证据范围的因果解释。我们的观点是,因果推理指明了从模型激活到不变的高层结构的有效映射是什么,实现它的数据或假设是什么,以及它能支持哪些推断。具体来说,Pearl的因果层次结构澄清了解释性研究可以证明什么。观察结果建立了模型行为与内部组件之间的关联。干预措施(例如,消融或激活补丁)支持关于这些编辑如何影响行为指标(例如,平均词概率变化)的声明。然而,反事实声明——即询问在未观察到的干预下,同一提示下的模型输出会是什么——在没有受控监督的情况下仍然难以验证。我们展示了因果表示学习(CRL)如何实现这一层次结构,指明哪些变量可以从激活中恢复以及在什么假设下。这些共同促使一种诊断框架,帮助实践者选择方法和评估,使声明与证据匹配,从而使得发现能够泛化。
Summary / 总结
The paper addresses the issue of non-generalizable findings in interpretability research on large language models (LLMs), emphasizing the importance of causal inference. It proposes using Pearl's causal hierarchy to clarify what an interpretability study can justify, distinguishing between observational associations and causal interventions. The authors demonstrate how causal representation learning (CRL) can operationalize this hierarchy, specifying which variables are recoverable from model activations under certain assumptions. This framework helps practitioners ensure that their findings are valid and can generalize.
论文探讨了大型语言模型(LLMs)可解释性研究中非泛化的发现问题,并认为因果推理对于有效的可解释性声明至关重要。作者提出使用佩尔的因果层次结构来澄清可以由可解释性研究证明的内容,区分观察到的相关性和干预。他们展示了如何通过因果表示学习(CRL)来实现这一层次结构,使其能够确定在某些假设下可以从模型激活中恢复哪些变量。这导致了一个诊断框架,帮助实践者选择适当的方法和评估,以确保发现能够泛化。
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
Authors: Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull, Daniel R. Cahn, Matteo Malgaroli
First: 2025-12-23T21:21:08+00:00 · Latest: 2026-02-18T18:41:36+00:00
Abstract
Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge. An effective simulator must expose the failure modes of the systems under evaluation. This work introduces Direct Iterative Adversarial Learning (DIAL), a DPO-based adversarial training framework that iteratively enhances user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. When applied to mental health support, a domain characterized by diverse failure types and a critical dependence on realistic user behavior for failure detection, DIAL restores lexical diversity diminished by supervised fine-tuning and reduces discriminator accuracy from near-perfect to near-random levels. The resulting simulator exhibits a strong correlation between simulated and real failure occurrence rates while maintaining low distributional divergence of failure modes. These findings indicate that DIAL is a promising method for developing realistic user simulators in multi-turn dialogue, facilitating rapid, reliable, and cost-effective system evaluation prior to deployment.
中文标题/摘要
标题:DIAL:直接迭代对抗学习在现实多轮对话模拟中的应用
现实用户模拟对于训练和评估多轮对话系统至关重要,但创建能够准确复制人类行为的模拟器仍然是一个重大挑战。有效的模拟器必须揭示被评估系统可能出现的失败模式。本文介绍了直接迭代对抗学习(DIAL),这是一种基于DPO的对抗训练框架,通过生成器(用户模拟器)和判别器之间的竞争动态,逐步提升用户模拟器的现实性。当应用于以多种失败类型为特征且依赖于现实用户行为进行失败检测的心理健康支持领域时,DIAL恢复了由监督微调减少的词汇多样性,并将判别器的准确性从近乎完美降低到近乎随机的水平。结果表明,模拟器在模拟和实际失败发生率之间表现出强烈的相关性,同时保持了失败模式分布的低差异性。这些发现表明,DIAL是一种在多轮对话中开发现实用户模拟器的有前途的方法,有助于在部署前实现快速、可靠和成本效益的系统评估。
Summary / 总结
The research aims to develop a realistic user simulator for multi-turn dialogue systems, addressing the challenge of accurately replicating human behavior. DIAL, a DPO-based adversarial training framework, iteratively improves the simulator's realism through a generator and discriminator competition. In a mental health support domain, DIAL enhances lexical diversity and reduces the discriminator's accuracy, indicating that the simulator closely mirrors real user behavior and failure modes, making it a promising tool for system evaluation.
研究旨在通过解决准确复制人类行为的挑战,开发多轮对话系统的现实用户模拟器。DIAL是一种基于DPO的对抗训练框架,通过生成器和判别器的竞争迭代提高模拟器的现实性。应用于心理健康支持时,DIAL恢复了词汇多样性并降低了判别器的准确性,表明模拟和真实失败发生率之间存在强烈的相关性,同时保持了失败模式的低分布差异。这表明DIAL是开发现实模拟器的有效方法,可以促进系统在部署前的快速、可靠和低成本评估。
EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents
Authors: Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski
First: 2025-03-24T16:06:04+00:00 · Latest: 2026-02-18T18:37:52+00:00
Comments: v3 was a major revision with updated experiments and analysis; v4 consists of minor edits
Abstract
We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics -- procurement, scheduling, and pricing -- that test an LLM's ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM's choice behavior on a stylized decision-making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM's tradeoff response, a reliability score, which measures the coherence of an LLM's choice behavior, and a competency score, which measures an LLM's capability at the same task when the conflicting objectives are replaced by a single, well-specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs' choice behavior and chain-of-thought, (3) validate our litmus test framework by testing self-consistency, robustness, and generalizability. Overall, this work provides a foundation for evaluating LLM agents as they are further integrated into economic decision-making.
中文标题/摘要
标题:EconEvals:评估LLM经济决策能力的标准与试金石
我们开发了评估LLM经济决策能力和倾向的方法。首先,我们从经济学中的关键问题——采购、调度和定价——中开发了基准,测试LLM在情境中从环境中学习的能力。其次,我们开发了试金石框架,这是一种评估LLM在具有多个冲突目标的简化决策任务中的选择行为的框架,量化了其选择行为。每个试金石输出一个试金石分数,量化了LLM的权衡响应,一个可靠度分数,衡量了LLM选择行为的一致性,以及一个能力分数,衡量了当冲突目标被单一明确的目标取代时,LLM在同一任务上的能力。评估了广泛的前沿LLM,我们(1)研究了LLM能力与倾向随时间的变化,(2)从LLM的选择行为和推理链中获得了经济上有意义的见解,(3)通过测试自我一致性、稳健性和泛化性验证了我们的试金石框架。总体而言,这项工作为评估进一步整合到经济决策中的LLM代理奠定了基础。
Summary / 总结
The research develops benchmarks and litmus tests to evaluate LLMs' economic decision-making capabilities. Benchmarks are derived from key economic problems like procurement, scheduling, and pricing, testing LLMs' ability to learn from context. Litmus tests, which quantify LLMs' choice behavior on tasks with multiple objectives, output scores for tradeoff response, choice coherence, and single-objective performance. Evaluations across various LLMs reveal changes in capabilities, insights into choice behavior, and validate the litmus test framework's self-consistency, robustness, and generalizability.
研究开发了基准和试金石来评估LLM的经济决策能力。基准是从采购、调度和定价等关键经济问题中得出的,测试LLM在上下文中学习的能力。试金石量化了LLM在具有多个目标的任务上的选择行为,输出贸易权衡、选择一致性以及单一目标性能的分数。对各种LLM的评估揭示了能力的变化、选择行为的见解,并验证了试金石框架的自一致性、稳健性和泛化能力。
Are Object-Centric Representations Better At Compositional Generalization?
Authors: Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi
First: 2026-02-18T18:34:07+00:00 · Latest: 2026-02-18T18:34:07+00:00
Abstract
Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.
中文标题/摘要
标题:对象中心表示法在组合泛化方面更优越吗?
组合泛化,即对熟悉概念的新组合进行推理的能力,是人类认知的基础,也是机器学习中的关键挑战。对象中心(OC)表示法将场景编码为一组对象,常被认为支持这种泛化,但在视觉丰富的环境中系统性的证据有限。我们引入了一个跨三个受控视觉世界(CLEVRTex、Super-CLEVR和MOVi-C)的视觉问答基准,以衡量具有和不具有对象中心偏见的视觉编码器在对未见过的对象属性组合进行泛化方面的表现。为了确保公平和全面的比较,我们仔细考虑了训练数据多样性、样本大小、表示大小、下游模型容量和计算量。我们使用DINOv2和SigLIP2作为基础模型及其OC版本。我们的主要发现表明:(1) 在更难的组合泛化设置中,OC方法更优越;(2) 原始密集表示法仅在更简单的设置中优于OC,通常需要更多的下游计算;(3) OC模型更具样本效率,在较少的图像中实现更强的泛化,而密集编码器仅在有足够的数据和多样性时才能赶上或超过它们。总体而言,当数据集大小、训练数据多样性或下游计算受到限制时,对象中心表示法提供了更强的组合泛化能力。
Summary / 总结
The study investigates whether object-centric (OC) representations enhance compositional generalization in visually rich settings. Using a Visual Question Answering benchmark across three controlled visual worlds, the research compares OC approaches with dense representations, accounting for training data diversity, sample size, and downstream model capacity. Key findings indicate that OC approaches outperform dense representations in harder compositional generalization tasks, are more sample efficient, and require less downstream compute, though dense encoders can match or surpass OC models with sufficient data and diversity.
研究探讨了在视觉丰富环境中,对象中心(OC)表示是否能增强组合泛化能力。通过在三个受控视觉世界中使用视觉问答基准,比较了具有和不具有对象中心偏见的OC和密集表示。关键发现表明,在更难的组合泛化任务中,OC方法优于密集表示,而密集表示仅在较简单的任务中表现更好,并且需要更多的下游计算资源。此外,OC模型更具样本效率,在较少的图像中就能实现更好的泛化,而密集编码器只有在有足够的数据和多样性时才能赶上或超过它们。
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang
First: 2024-11-18T16:33:52+00:00 · Latest: 2026-02-18T18:33:19+00:00
Abstract
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.
中文标题/摘要
标题:MC-LLaVA:多概念个性化视觉语言模型
当前的视觉语言模型(VLMs)在各种任务上表现出色,如视觉问答。为了提升用户体验,最近的研究探讨了VLM的个性化,以理解用户提供的概念。然而,这些研究主要集中在单一概念上,忽视了多个概念的存在及其相互作用,这限制了其在实际中的应用。本文提出了一种多概念个性化范式MC-LLaVA。具体而言,MC-LLaVA采用多概念指令调优策略,在单个训练步骤中有效整合多个概念。为了降低训练成本,我们提出了一种个性化的文本提示,利用视觉标记信息初始化概念标记。此外,在推理过程中,我们引入了个性化的视觉提示,聚合位置图以增强识别和语义关联能力。为了进一步提高性能上限,我们引入了一个可选的辅助损失,更好地增强了提出的个性化提示。为了丰富VLM个性化研究,我们贡献了一个高质量的数据集。我们精心收集了来自电影的多角色和多物体图像,并手动创建了多概念场景下的问题-答案样本,具有更高的多样性。全面的实验表明,MC-LLaVA实现了令人印象深刻的多概念个性化响应,为VLM成为更好的用户助手铺平了道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA发布。
Summary / 总结
This paper introduces MC-LLaVA, a multi-concept personalized vision-language model that enhances user experience by integrating multiple concepts during training. It uses a multi-concept instruction tuning strategy and personalized textual and visual prompts to improve recognition and grounding. Experimental results show that MC-LLaVA can generate impressive multi-concept personalized responses, advancing the field of VLMs as better user assistants. A high-quality dataset is also provided for research decoration.
本文提出了MC-LLaVA,一种多概念个性化范式,解决了仅关注单一概念的局限性。该方法采用多概念指令调优策略,并使用个性化文本和视觉提示来增强识别和语义关联能力。实验结果表明,MC-LLaVA能够生成令人印象深刻的多概念个性化响应,提高VLMs作为用户助手的实用性。
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Authors: Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang
First: 2026-02-18T18:32:46+00:00 · Latest: 2026-02-18T18:32:46+00:00
Abstract
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
中文标题/摘要
标题:使用交错语义、声学和文本标记扩展开放离散音频基础模型
当前的音频语言模型主要是以文本为主,要么扩展预训练的文本LLM主干,要么依赖于仅语义的音频标记,这限制了对通用音频建模的能力。本文介绍了对原生音频基础模型的系统性实证研究,这些模型在大规模音频上应用下一个标记预测,联合建模语义内容、声学细节和文本,以支持通用音频生成和跨模态能力。我们提供了构建此类模型的全面实证见解:(1) 我们系统地研究了设计选择——数据来源、文本混合比例和标记组成,确立了一个验证过的训练食谱。(2) 我们通过IsoFLOP分析对64个模型进行了首次离散音频模型的扩展律研究,这些模型的FLOP范围从$3 imes 10^{18}$到$3 imes 10^{20}$,发现最优数据的增长速度是最优模型大小的1.6倍。(3) 我们将这些经验应用于训练SODA(扩展开放离散音频),从1.35亿到40亿参数,基于500亿标记,与我们的扩展预测和现有模型进行比较。SODA 作为多种音频/文本任务的灵活主干——我们通过语音保留的语音到语音翻译微调来证明这一点,使用相同的统一架构。
Summary / 总结
This paper addresses the limitations of text-first audio language models by introducing a systematic approach to build native audio foundation models that jointly model semantic content, acoustic details, and text. The authors investigate design choices such as data sources, text mixture ratios, and token composition, and conduct a scaling law study for discrete audio models, finding that optimal data grows 1.6 times faster than optimal model size. They train SODA, a suite of models ranging from 135M to 4B parameters, and demonstrate its effectiveness in voice-preserving speech-to-speech translation tasks using a unified architecture.
该论文通过引入一种系统方法来构建能够同时建模语义内容、声学细节和文本的原生音频基础模型,解决了以文本为主导的音频语言模型的局限性。作者研究了数据来源、文本混合比例和令牌组成等设计选择,并对离散音频模型进行了扩展规律研究,发现最优数据的增长速度是最优模型大小的1.6倍。他们训练了从135M到4B参数的SODA系列模型,并通过统一架构展示了其在语音保留的语音到语音翻译任务中的有效性。
Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition
Authors: Bo Pan, Peter Zhiping Zhang, Hao-Wei Pang, Alex Zhu, Xiang Yu, Liying Zhang, Liang Zhao
First: 2026-02-18T18:27:21+00:00 · Latest: 2026-02-18T18:27:21+00:00
Abstract
Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.
中文标题/摘要
标题:用于重现药物化学直觉的匹配分子对变换增强基础模型
匹配分子对(MMPs)捕捉了药物化学家常用的局部化学编辑,但现有的机器学习方法要么在分子整体水平上操作,编辑可控性有限,要么从受限环境和小型模型中学习MMP风格的编辑。我们提出了一种变量到变量的类比生成形式,并在大规模MMP变换(MMPTs)上训练基础模型,根据输入变量生成多样化的变量。为了实现实际控制,我们开发了提示机制,允许用户在生成过程中指定首选的变换模式。我们还引入了MMPT-RAG,这是一种检索增强框架,使用外部参考类比作为上下文指导,引导生成并从项目特定系列中泛化。在通用化学语料库和专利特定数据集上的实验表明,我们的方法在多样性和新颖性方面有所改进,并且在实际发现场景中能够恢复现实的类比结构。
Summary / 总结
The research aims to improve the generation of analogs in medicinal chemistry by leveraging matched molecular pairs (MMPs) and foundation models. The method involves training a foundation model on large-scale MMP transformations and using retrieval-augmented generation to enable practical control over the analog generation process. Key findings include improved diversity, novelty, and controllability of generated analogs, as well as the ability to produce realistic analog structures in practical discovery scenarios.
研究旨在通过匹配分子对(MMPs)提高机器学习模型在生成类似物方面的可控性和有效性。方法包括在大规模MMP转换上训练基础模型,并结合检索增强生成,通过用户提示实现实际控制。关键发现包括生成类似物的多样性和新颖性提高,以及能够在实际发现场景中生成现实的类似物结构。
Learning Situated Awareness in the Real World
Authors: Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang
First: 2026-02-18T18:22:52+00:00 · Latest: 2026-02-18T18:22:52+00:00
Abstract
A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.
中文标题/摘要
标题:学习现实世界中的情境意识
人类感知的核心方面是情境意识,即能够将自己与周围的物理环境联系起来,并在上下文中推理可能的动作。然而,大多数现有的多模态基础模型(MFM)基准主要强调环境中心的空间关系(场景中物体之间的关系),而忽视了需要根据代理视角、姿态和运动进行相对推理的观察者中心关系。为了弥合这一差距,我们引入了SAW-Bench(现实世界中的情境意识),这是一个使用真实世界视频评估第一人称情境意识的新基准。SAW-Bench 包含786个使用Ray-Ban Meta(Gen 2)智能眼镜录制的自我记录视频,覆盖了多种室内外环境,并包含超过2,071个人类标注的问题-答案对。它通过六个不同的意识任务来测试模型的观察者中心理解。我们的全面评估揭示了即使在表现最佳的MFM(Gemini 3 Flash)中,人类模型性能差距仍高达37.66%。除了这一差距之外,我们深入分析还发现了几个值得注意的发现;例如,虽然模型可以利用第一人称视频中的部分几何线索,但它们往往无法推断出连贯的摄像机几何结构,导致系统性的空间推理错误。我们将SAW-Bench 定位为情境空间智能的基准,超越了被动观察,转向理解物理上接地的观察者中心动态。
Summary / 总结
The research aims to evaluate multimodal foundation models' ability to understand egocentric situated awareness, which involves reasoning about the environment relative to the agent's viewpoint. SAW-Bench, a new benchmark using real-world videos, includes 786 self-recorded videos and 2,071 human-annotated question-answer pairs, covering diverse environments. The evaluation shows a significant performance gap of 37.66% between humans and the best-performing model, Gemini 3 Flash, highlighting models' challenges in inferring coherent camera geometry from partial cues.
该论文提出了SAW-Bench,这是一个新的基准,用于评估使用真实世界视频的主观情境意识。它通过关注观察者为中心的关系,弥补了现有基准的不足,而不是环境为中心的空间关系。该基准包括786个自我录制的视频和2,071个人类标注的问题-答案对,涵盖了多种环境。实验结果表明,人类与模型之间的性能差距为37.66%,表明模型在从部分线索推断出连贯的摄像机几何结构和空间推理方面存在困难。
VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection
Authors: Yingyuan Yang, Tian Lan, Yifei Gao, Yimeng Lu, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang
First: 2026-02-18T18:22:22+00:00 · Latest: 2026-02-18T18:22:22+00:00
Abstract
Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.
中文标题/摘要
标题:VETime:视觉增强的零样本时间序列异常检测
时间序列异常检测(TSAD)需要识别即时点异常和长范围上下文异常。然而,现有的基础模型面临一个根本性的权衡:一维时间模型提供精细的点定位但缺乏全局上下文视角,而基于视觉的二维模型能够捕捉全局模式但因缺乏时间对齐和粗粒度点定位而受到信息瓶颈的限制。为了解决这一困境,我们提出了VETime,这是第一个通过精细视觉-时间对齐和动态融合统一时间和视觉模态的TSAD框架。VETime引入了可逆图像转换和块级时间对齐模块,以建立共享的视觉-时间时间线,同时保留区分性细节并保持时间敏感性。此外,我们设计了异常窗口对比学习机制和任务自适应多模态融合,以适应性地整合两种模态的互补感知优势。大量实验表明,VETime在零样本场景中显著优于最先进的模型,实现了更高的定位精度且计算开销低于当前基于视觉的方法。代码可在:https://github.com/yyyangcoder/VETime 获取。
Summary / 总结
VETime is a novel framework for time-series anomaly detection that integrates temporal and visual modalities to address the limitations of existing models. It uses fine-grained visual-temporal alignment and dynamic fusion, including a Reversible Image Conversion and a Patch-Level Temporal Alignment module, to establish a shared timeline. Additionally, it employs Anomaly Window Contrastive Learning and Task-Adaptive Multi-Modal Fusion to enhance anomaly detection. Experiments show that VETime outperforms state-of-the-art models in zero-shot scenarios with higher precision and lower computational cost compared to vision-based approaches.
VETime 是一种新颖的时间序列异常检测框架,结合了时间和视觉模态,以解决现有模型的局限性。它使用细粒度的视觉-时间对齐和动态融合,包括可逆图像转换和补丁级时间对齐模块,以建立共享的时间线。VETime 还采用了异常窗口对比学习和任务自适应多模态融合来整合两种模态的互补感知优势。实验结果表明,VETime 在零样本场景中优于最先进的模型,具有更高的定位精度和更低的计算开销,相比当前基于视觉的方法。
Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
Authors: Feilong Liu
First: 2026-01-09T23:07:14+00:00 · Latest: 2026-02-18T18:17:41+00:00
Abstract
Mixture-of-Experts (MoE) architectures are widely used for efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly understood. We study MoEs through a geometric lens, interpreting routing as soft partitioning into overlapping expert-local charts. We introduce a Dual Jacobian-PCA spectral probe that analyzes local function geometry via Jacobian singular value spectra and representation geometry via weighted PCA of routed hidden states. Using a controlled MLP-MoE setting with exact Jacobian computation, we compare dense, Top-k, and fully soft routing under matched capacity. Across random seeds, MoE routing consistently reduces local sensitivity: expert-local Jacobians show smaller leading singular values and faster spectral decay than dense baselines. Weighted PCA reveals that expert-local representations distribute variance across more principal directions, indicating higher effective rank. We further observe low alignment among expert Jacobians, suggesting decomposition into low-overlap expert-specific transformations. Routing sharpness modulates these effects: Top-k routing yields more concentrated, lower-rank expert structure, while fully soft routing produces broader, higher-rank representations. Experiments on a 3-layer transformer with WikiText confirm curvature reduction on natural language and show lower cross-expert alignment for Top-k routing. These findings support interpreting MoEs as soft partitionings of function space that flatten local curvature while redistributing representation variance, yielding testable predictions for expert scaling, hallucination reduction, and ensemble diversity.
中文标题/摘要
标题:专家混合作为软聚类:双重雅可比-PCA光谱几何视角
专家混合(MoE)架构广泛用于提高效率和条件计算,但它们对学习函数和表示的几何结构的影响仍然知之甚少。我们从几何学的角度研究MoE,将路由解释为将输入软划分到重叠的专家局部图中。我们引入了一种双重雅可比-PCA光谱探针,通过雅可比奇异值光谱分析局部函数几何结构,并通过加权PCA分析路由隐藏状态的表示几何结构。使用具有精确雅可比计算的受控MLP-MoE设置,我们在匹配容量的情况下比较了密集、Top-k和完全软路由。在随机种子上,MoE路由始终降低了局部敏感性:专家局部雅可比矩阵显示较小的主导奇异值和更快的光谱衰减,而密集基线则不然。加权PCA表明,专家局部表示在更多的主方向上分配方差,表明更高的有效秩。我们还观察到专家雅可比矩阵之间的低对齐度,这表明可以将其分解为低重叠的专家特定变换。路由的锐度调节了这些效果:Top-k路由产生更集中、更低秩的专家结构,而完全软路由则产生更广泛、更高秩的表示。在具有WikiText的3层变压器上的实验表明,Top-k路由降低了自然语言上的曲率,并且跨专家对齐度较低。这些发现支持将MoE解释为对函数空间进行软划分,从而在局部曲率上进行拉平并重新分配表示方差,从而为专家缩放、幻觉减少和集成多样性提供可测试的预测。
Summary / 总结
This study investigates Mixture-of-Experts (MoE) architectures from a geometric perspective, interpreting routing as soft clustering. The authors introduce a Dual Jacobian-PCA method to analyze local function geometry and representation geometry. Across different routing methods, MoE consistently reduces local sensitivity, with expert-local Jacobians showing smaller leading singular values and faster spectral decay compared to dense baselines. Weighted PCA indicates that expert-local representations distribute variance across more principal directions, and routing sharpness affects the structure of expert-specific transformations, with Top-k routing leading to more concentrated, lower-rank expert structure and fully soft routing producing broader, higher-rank representations. Experiments on a transformer model confirm these findings and suggest MoEs flatten local curvature while redistributing representation variance.
该研究从几何角度探讨了Mixture-of-Experts (MoE)架构,将路由解释为将函数空间软分割成重叠的专家局部图。作者引入了一种Dual Jacobian-PCA光谱探针来分析局部函数几何和表示几何。实验表明,MoE路由降低了局部敏感性,与密集基线相比,专家局部雅可比矩阵具有更小的主导奇异值和更快的谱衰减。加权PCA显示,专家局部表示将方差分布在更多的主方向上,而路由的锐度影响专家特定变换的结构,Top-k路由导致更集中且更低秩的结构,而完全软路由产生更广泛且更高秩的表示。
Statistical Inference Leveraging Synthetic Data with Distribution-Free Guarantees
Authors: Meshi Bashari, Yonghoon Lee, Roy Maor Lotan, Edgar Dobriban, Yaniv Romano
First: 2025-09-24T17:37:14+00:00 · Latest: 2026-02-18T18:13:32+00:00
Abstract
The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.
中文标题/摘要
标题:利用合成数据进行统计推断的无分布保证方法
高质量合成数据的迅速增长——由先进的人工智能模型生成或从相关任务中收集作为辅助数据——为统计推断提供了机遇和挑战。本文介绍了一种通用合成驱动推断(GESPI)框架,该框架可以包裹在任何统计推断程序周围,通过结合合成数据和真实数据来安全地提高样本效率。该框架利用高质量的合成数据增强统计功效,但在合成数据质量较低时,会自适应地默认使用仅使用真实数据的标准推断方法。我们的方法的误差保持在用户指定的界限以下,且无需对合成数据的分布做出假设,并随着合成数据质量的提高而降低。这种灵活性使得该方法能够无缝集成到符合性预测、风险控制、假设检验和多重检验程序中,而无需修改基础推断方法。我们通过具有有限标记数据的挑战性任务展示了该方法的优势,包括AlphaFold蛋白质结构预测和比较大型推理模型在复杂数学问题上的表现。
Summary / 总结
This paper addresses the challenge of leveraging synthetic data for statistical inference, especially when the synthetic data quality is uncertain. It proposes the GEneral Synthetic-Powered Inference (GESPI) framework, which integrates synthetic and real data to enhance sample efficiency while maintaining statistical power. The method ensures that the error remains below a user-specified bound without requiring distributional assumptions on the synthetic data, and it improves as the synthetic data quality increases. The framework can be applied to various statistical procedures such as conformal prediction, risk control, and hypothesis testing. Experiments show that GESPI improves performance on tasks with limited labeled data, such as protein structure prediction and complex math problem solving.
本文解决了利用合成数据进行统计推断时,特别是在合成数据质量不确定的情况下所面临的挑战。提出了GEneral Synthetic-Powered Inference (GESPI)框架,该框架将合成数据和真实数据结合以提高样本效率并保持统计功效。该方法确保误差在用户指定的范围内,无需对合成数据的分布做出假设,并且随着合成数据质量的提高而改善。该框架具有灵活性,可以应用于各种统计任务,如置信预测、风险控制和假设检验。实验结果显示,该方法在有限标注数据的任务中,如蛋白质结构预测和复杂数学问题中表现出色。
Modeling Human Behavior in a Strategic Network Game with Complex Group Dynamics
Authors: Jonathan Skaggs, Jacob W. Crandall
First: 2025-05-01T18:13:20+00:00 · Latest: 2026-02-18T18:09:33+00:00
Comments: In Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems, Paphos, Cyprus, 2026
Abstract
Human networks greatly impact important societal outcomes, including wealth and health inequality, poverty, and bullying. As such, understanding human networks is critical to learning how to promote favorable societal outcomes. As a step toward better understanding human networks, we compare and contrast several methods for learning models of human behavior in a strategic network game called the Junior High Game (JHG) [39]. These modeling methods differ with respect to the assumptions they use to parameterize human behavior (behavior matching vs. community-aware behavior) and the moments they model (mean vs. distribution). Results show that the highest-performing method, called hCAB, models the distribution of human behavior rather than the mean and assumes humans use community-aware behavior rather than behavior matching. When applied to small societies, the hCAB model closely mirrors the population dynamics of human groups (with notable differences). Additionally, in a user study, human participants had difficulty distinguishing hCAB agents from other humans, thus illustrating that the hCAB model also produces plausible (individual) behavior in this strategic network game.
中文标题/摘要
标题:在具有复杂群体动态的战略网络游戏中建模人类行为
人类网络极大地影响着重要的社会结果,包括财富和健康不平等、贫困和欺凌。因此,理解人类网络对于学习如何促进有利的社会结果至关重要。为了更好地理解人类网络,我们比较了几种在名为初中游戏(JHG)的战略网络游戏中学习人类行为模型的方法。这些建模方法在参数化人类行为(行为匹配 vs. 社区感知行为)以及建模的时刻(均值 vs. 分布)方面有所不同。结果显示,表现最佳的方法称为hCAB,它建模的是人类行为的分布而非均值,并假设人类使用的是社区感知行为而非行为匹配。当应用于小型社会时,hCAB模型与人类群体的人口动态非常相似(有显著差异)。此外,在用户研究中,人类参与者难以区分hCAB代理与其他人类,从而说明hCAB模型在这一战略网络游戏中也产生了可信的(个体)行为。
Summary / 总结
This study investigates human behavior in a strategic network game called the Junior High Game (JHG) to better understand human networks and their impact on societal outcomes. The research compares different modeling methods, finding that the hCAB model, which assumes community-aware behavior and models the distribution of human behavior, outperforms others. The hCAB model closely mimics real-world population dynamics and produces behavior that is indistinguishable from human players in a user study.
研究旨在通过比较不同方法在Junior High Game中的应用,来理解人类在网络中的行为。hCAB模型假设人类具有社区意识的行为,并模拟人类行为的分布,表现出色。它能够模拟人类群体的动力学,并在用户研究中难以与人类玩家区分,表明其合理性和有效性。
SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation
Authors: Jaid Monwar Chowdhury, Chi-An Fu, Reyhaneh Jabbarvand
First: 2026-02-18T18:09:03+00:00 · Latest: 2026-02-18T18:09:03+00:00
Comments: 9 pages, 6 figures, 4 tables
Abstract
Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the leap-to-code failure mode, where models prematurely emit code without grounding in program structure, constraints, and semantics. This will result in non-compilable tests, hallucinated function signatures, low branch coverage, and semantically irrelevant assertions that cannot properly capture bugs. We introduce SPARC, a neuro-symbolic, scenario-based framework that bridges this gap through four stages: (1) Control Flow Graph (CFG) analysis, (2) an Operation Map that grounds LLM reasoning in validated utility helpers, (3) Path-targeted test synthesis, and (4) an iterative, self-correction validation loop using compiler and runtime feedback. We evaluate SPARC on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE on complex subjects. SPARC retains 94.3% of tests through iterative repair and produces code with significantly higher developer-rated readability and maintainability. By aligning LLM reasoning with program structure, SPARC provides a scalable path for industrial-grade testing of legacy C codebases.
中文标题/摘要
标题:SPARC:面向自动C单元测试生成的场景规划与推理
由于高级程序意图与C语言严格语法约束之间的语义差距,自动C单元测试生成仍然是一个严峻的挑战。尽管大型语言模型(LLMs)具有强大的生成能力,但直接从意图到代码的合成经常会出现跳过代码的问题,即模型在缺乏程序结构、约束和语义的情况下过早地生成代码,这将导致不可编译的测试、虚构的功能签名、低分支覆盖率和语义无关的断言,无法有效捕捉错误。我们提出了SPARC,这是一种神经-符号、基于场景的框架,通过四个阶段来弥合这一差距:(1)控制流图(CFG)分析,(2)操作图将LLM推理与验证的实用程序助手联系起来,(3)路径导向的测试合成,以及(4)使用编译器和运行时反馈的迭代自我纠正验证循环。我们在59个真实世界和算法主题上评估了SPARC,其行覆盖率提高了31.36%,分支覆盖率提高了26.01%,突变分数提高了20.78%,在复杂主题上与符号执行工具KLEE相当或超过KLEE。SPARC通过迭代修复保留了94.3%的测试,并生成了具有显著更高开发者可读性和可维护性的代码。通过将LLM推理与程序结构对齐,SPARC为工业级测试遗留C代码库提供了一条可扩展的路径。
Summary / 总结
SPARC is a neuro-symbolic framework designed to bridge the semantic gap in automated C unit test generation. It consists of four stages: CFG analysis, an Operation Map that grounds LLM reasoning, path-targeted test synthesis, and an iterative validation loop. SPARC outperforms a baseline in line, branch, and mutation coverage by 31.36%, 26.01%, and 20.78% respectively, and produces more readable and maintainable code. It successfully generates tests for 59 real-world and algorithmic subjects, matching or exceeding the performance of KLEE on complex subjects.
SPARC 是一个神经符号框架,旨在通过弥合高级程序意图与低级语法约束之间的语义差距来自动化生成 C 单元测试。它包括四个阶段:控制流图分析、操作图以验证辅助函数为基础的 LLM 理解、路径导向的测试合成以及迭代验证循环。SPARC 在行、分支和突变覆盖率方面分别优于基线 31.36%、26.01% 和 20.78%,并生成了更易读和可维护的代码。它成功地迭代修复了 94.3% 的测试,并在复杂主题上与 KLEE 的性能相当。
PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction
Authors: Bo Lang, Nirav Savaliya, Zhihao Zheng, Jinglun Feng, Zheng-Hang Yeh, Mooi Choo Chuah
Venue: WACV 2026
First: 2026-02-18T18:08:26+00:00 · Latest: 2026-02-18T18:08:26+00:00
Comments: WACV 2026
Abstract
High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.
中文标题/摘要
标题:PredMapNet:未来和历史推理在一致的在线高清矢量地图构建中的应用
高分辨率(HD)地图对自动驾驶至关重要,提供了道路元素的结构化表示,以支持导航和规划。然而,现有的基于查询的方法通常采用随机查询初始化,并依赖于隐式的时序建模,这导致在构建全局地图时出现时序不一致和不稳定。为了解决这些挑战,我们提出了一种新的端到端框架,用于一致的在线高清矢量地图构建,该框架同时执行地图实例跟踪和短期预测。首先,我们提出了一种语义感知查询生成器,使用空间对齐的语义掩码初始化查询,以捕捉全局场景级上下文。接下来,我们设计了一种历史栅格化地图记忆,用于存储每个跟踪实例的细粒度实例级地图,从而启用显式的历史先验。然后,历史地图引导模块将栅格化地图信息整合到跟踪查询中,提高时序连续性。最后,我们提出了一种短期未来引导模块,根据存储的历史轨迹预测地图实例的即时运动。这些预测的未来位置作为跟踪实例的提示,以进一步避免不合理的预测并保持时序一致性。在nuScenes和Argoverse2数据集上的广泛实验表明,我们提出的方法在效率良好的情况下优于最先进的(SOTA)方法。
Summary / 总结
The research aims to address the temporal inconsistencies and instabilities in the construction of global HD vectorized maps for autonomous driving. The method introduces PredMapNet, an end-to-end framework that combines map instance tracking and short-term prediction. It uses a Semantic-Aware Query Generator to initialize queries with spatially aligned semantic masks, a History Rasterized Map Memory to store fine-grained instance-level maps, and a History-Map Guidance Module to integrate historical map information into track queries. Additionally, a Short-Term Future Guidance module forecasts the immediate motion of map instances. Experiments show that PredMapNet outperforms SOTA methods with good efficiency.
研究旨在解决全球HD矢量化地图构建中的时间不一致性和不稳定性问题。PredMapNet框架结合了语义感知查询生成、历史栅格化地图记忆、历史地图引导和短期未来引导模块,以实现一致的在线地图构建。实验结果表明,PredMapNet在nuScenes和Argoverse2数据集上的效率和准确性均优于现有方法。
Towards a Science of AI Agent Reliability
Authors: Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan
First: 2026-02-18T18:05:44+00:00 · Latest: 2026-02-18T18:05:44+00:00
Abstract
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.
中文标题/摘要
标题:迈向AI代理可靠性的科学
AI代理正越来越多地被部署以执行重要任务。尽管在标准基准上的准确率得分表明了快速的进步,但许多代理仍然在实践中继续失败。这种差异突显了当前评估的基本局限性:将代理行为压缩为单一的成功指标掩盖了关键的操作缺陷。值得注意的是,它忽略了代理是否在多次运行中表现一致、能否抵御干扰、能否以可预测的方式失败或其误差严重性是否受到限制。基于安全关键工程,我们通过提出十二个具体的指标,从四个关键维度分解代理可靠性:一致性、鲁棒性、可预测性和安全性,提供了一个全面的性能概况。在两个互补基准上评估14个代理模型后,我们发现最近的能力提升仅在可靠性方面带来了微小的改进。通过揭示这些持续存在的局限性,我们的指标补充了传统的评估,同时提供了关于代理如何表现、退化和失败的推理工具。
Summary / 总结
The research aims to address the gap between AI agent performance on benchmarks and their reliability in practical applications. The authors propose a comprehensive evaluation framework with twelve metrics to assess four key dimensions of agent reliability: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two benchmarks, the study reveals that recent advancements have only marginally improved reliability, highlighting persistent issues in agent performance and failure modes.
研究旨在解决AI代理在基准测试中高准确率与实际应用中失败之间的差距。作者提出了一种新的评估框架,包含十二个评估指标,分别从一致性、鲁棒性、可预测性和安全性四个方面评估AI代理的可靠性。研究评估了14种代理模型,发现近期的进步仅在可靠性方面带来了微小的改进,揭示了这些模型中存在的持续性问题。
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Authors: Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai
Venue: ICLR 2026
First: 2026-02-18T18:01:23+00:00 · Latest: 2026-02-18T18:01:23+00:00
Comments: Accepted by ICLR 2026
Abstract
The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.
中文标题/摘要
标题:一次对齐,多语言受益:为LLM安全对齐强制多语言一致性
跨语言社区广泛部署大型语言模型(LLMs)需要可靠的多语言安全对齐。然而,将对齐扩展到其他语言的近期努力往往需要大量资源,要么通过目标语言的大规模高质量监督,要么通过与高资源语言的成对对齐,这限制了可扩展性。在本文中,我们提出了一种资源高效的方法来提高多语言安全对齐。我们引入了一种即插即用的多语言一致性(MLC)损失,可以集成到现有的单语言对齐管道中。通过提高多语言表示向量之间的共线性,我们的方法在单次更新中鼓励多语言语义层面的方向一致性。这使得仅使用多语言提示变体即可同时对齐多种语言,而无需在低资源语言中进行额外的响应级监督。我们通过不同的模型架构和对齐范式验证了所提出的方法,并证明了其在有限影响一般模型效用的情况下提高了多语言安全性。进一步的语言和任务评估表明,跨语言泛化能力得到提高,表明所提出的方法是有限监督下多语言一致性对齐的实用解决方案。
Summary / 总结
This work addresses the challenge of multilingual safety alignment for large language models by proposing a resource-efficient Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. The method improves collinearity between multilingual representation vectors, encouraging directional consistency at the multilingual semantic level. It allows for simultaneous alignment across multiple languages using only multilingual prompt variants, without requiring additional response-level supervision in low-resource languages. The approach is validated across different model architectures and alignment paradigms, showing effectiveness in enhancing multilingual safety with minimal impact on general model utility.
本文提出了一种资源高效的多语言一致性(MLC)损失,可以集成到现有的单语言对齐管道中,以解决大型语言模型的多语言安全性对齐问题。该方法通过提高多语言表示向量之间的共线性,鼓励在语义层面的方向一致性。实验表明,所提出的方法在最小影响模型实用性的情况下增强了多语言安全性,并且在不同的模型架构和对齐范式下提高了跨语言的一般化能力。
Closing the Distribution Gap in Adversarial Training for LLMs
Authors: Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn
First: 2026-02-16T22:34:52+00:00 · Latest: 2026-02-18T17:57:10+00:00
Abstract
Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.
中文标题/摘要
标题:闭合大规模语言模型对抗训练中的分布差距
对抗训练是提高大规模语言模型鲁棒性的最有前途的方法之一。然而,尽管取得了显著进展,模型仍然容易受到简单的同分布攻击,例如将提示改写为过去时或将其翻译成其他语言。我们认为这种持续的脆弱性源于当前对抗训练算法的基本局限性:它们在训练集上最小化对抗损失,但未能充分覆盖数据分布,导致对看似简单的攻击仍然脆弱。为了弥合这一差距,我们提出了分布对抗训练(DAT)。我们利用扩散语言模型近似提示和响应的真实联合分布,从而生成多样且高概率的样本,以解决泛化失败问题。通过结合由扩散模型提供的数据分布优化与连续对抗训练,DAT 达到了比之前方法更高的对抗鲁棒性。
Summary / 总结
The research aims to enhance the robustness of large language models (LLMs) against adversarial attacks by addressing the distribution gap in adversarial training. The method, Distributional Adversarial Training (DAT), uses Diffusion LLMs to generate diverse and high-likelihood samples, which helps in covering the data distribution more comprehensively. This approach results in significantly improved adversarial robustness compared to previous methods.
研究旨在通过解决对抗训练中的分布差距,增强LLM的抗攻击能力。方法是使用扩散LLM生成多样且高概率的样本,从而更全面地覆盖数据分布。这使得对抗鲁棒性显著提高,优于以往的方法。
Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments
Authors: Yangjie Xu, Lujun Li, Lama Sleem, Niccolo Gentile, Yewei Song, Yiqun Wang, Siming Ji, Wenbo Wu, Radu State
First: 2026-02-18T17:52:17+00:00 · Latest: 2026-02-18T17:52:17+00:00
Abstract
Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent Skill paradigm provides similar benefits to small language models (SLMs). This question matters in industrial scenarios where continuous reliance on public APIs is infeasible due to data-security and budget constraints requirements, and where SLMs often show limited generalization in highly customized scenarios. This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes across multiple use cases. The evaluation encompasses two open-source tasks and a real-world insurance claims data set. The results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) parameters) benefit substantially from the Agent Skill approach. Moreover, code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency. Collectively, these findings provide a comprehensive and nuanced characterization of the capabilities and constraints of the framework, while providing actionable insights for the effective deployment of Agent Skills in SLM-centered environments.
中文标题/摘要
标题:代理技能框架:小语言模型在工业环境中的潜力视角
代理技能框架现已被GitHub Copilot、LangChain和OpenAI等主要玩家广泛且正式支持,通过改进上下文工程、减少幻觉并提高任务准确性,特别适用于专有模型。基于这些观察,本文探讨代理技能范式是否也能为小语言模型(SLMs)提供类似的益处。这一问题在工业场景中尤为重要,因为连续依赖公共API由于数据安全和预算限制等原因不可行,且SLMs在高度定制的场景中通常表现出有限的泛化能力。本文引入了代理技能过程的正式数学定义,随后对不同大小的语言模型在多个应用场景中的进行了系统评估。评估涵盖了两个开源任务和一个实际的保险索赔数据集。结果表明,小型模型在可靠技能选择方面存在困难,而中等大小的SLMs(约12B-30B参数)则显著受益于代理技能方法。此外,大约80B参数的代码专业化变体在性能上与闭源基准相当,同时提高了GPU效率。这些发现共同提供了框架能力与限制的全面而细致的描述,并为在以SLMs为中心的环境中有效部署代理技能提供了可操作的见解。
Summary / 总结
The research explores whether the Agent Skill framework, which enhances context engineering and reduces hallucinations in large models, can similarly benefit small language models (SLMs) in industrial settings. The study defines the Agent Skill process mathematically and evaluates various-sized language models across different use cases, including open-source tasks and real-world insurance claims data. Results indicate that tiny models have difficulty with reliable skill selection, while moderately sized SLMs (12B-30B parameters) significantly benefit from the Agent Skill approach. Additionally, code-specialized variants around 80B parameters achieve performance close to closed-source baselines while enhancing GPU efficiency.
研究探讨了Agent Skill框架是否能帮助小型语言模型(SLMs)在工业环境中,特别是在数据安全和预算限制使得依赖公共API不可行的情况下。研究引入了Agent Skill过程的正式定义,并在不同的应用场景中评估了不同规模的语言模型,包括开源任务和实际的保险索赔数据集。研究发现,小型模型在可靠技能选择方面存在困难,而大约12B-30B参数的中等规模SLMs显著受益于Agent Skill方法。此外,约80B参数的代码专业化变体在性能上与闭源基线相当,同时提高了GPU效率。
Retrieval Augmented Generation of Literature-derived Polymer Knowledge: The Example of a Biodegradable Polymer Expert System
Authors: Sonakshi Gupta, Akhlak Mahmood, Wei Xiong, Rampi Ramprasad
First: 2026-02-18T17:46:09+00:00 · Latest: 2026-02-18T17:46:09+00:00
Abstract
Polymer literature contains a large and growing body of experimental knowledge, yet much of it is buried in unstructured text and inconsistent terminology, making systematic retrieval and reasoning difficult. Existing tools typically extract narrow, study-specific facts in isolation, failing to preserve the cross-study context required to answer broader scientific questions. Retrieval-augmented generation (RAG) offers a promising way to overcome this limitation by combining large language models (LLMs) with external retrieval, but its effectiveness depends strongly on how domain knowledge is represented. In this work, we develop two retrieval pipelines: a dense semantic vector-based approach (VectorRAG) and a graph-based approach (GraphRAG). Using over 1,000 polyhydroxyalkanoate (PHA) papers, we construct context-preserving paragraph embeddings and a canonicalized structured knowledge graph supporting entity disambiguation and multi-hop reasoning. We evaluate these pipelines through standard retrieval metrics, comparisons with general state-of-the-art systems such as GPT and Gemini, and qualitative validation by a domain chemist. The results show that GraphRAG achieves higher precision and interpretability, while VectorRAG provides broader recall, highlighting complementary trade-offs. Expert validation further confirms that the tailored pipelines, particularly GraphRAG, produce well-grounded, citation-reliable responses with strong domain relevance. By grounding every statement in evidence, these systems enable researchers to navigate the literature, compare findings across studies, and uncover patterns that are difficult to extract manually. More broadly, this work establishes a practical framework for building materials science assistants using curated corpora and retrieval design, reducing reliance on proprietary models while enabling trustworthy literature analysis at scale.
中文标题/摘要
标题:基于检索增强生成的聚合物知识生成:以一种可生物降解聚合物专家系统为例
聚合物文献包含大量且不断增长的实验知识,但其中大部分被埋藏在无结构文本和不一致的术语中,使得系统检索和推理变得困难。现有工具通常孤立地提取狭窄的研究特定事实,未能保留回答更广泛科学问题所需的跨研究上下文。检索增强生成(RAG)通过结合大型语言模型(LLMs)和外部检索提供了一种有希望的方法来克服这一限制,但其有效性取决于领域知识的表示方式。在本研究中,我们开发了两种检索管道:基于密集语义向量的方法(VectorRAG)和基于图的方法(GraphRAG)。使用超过1,000篇聚羟基脂肪酸酯(PHA)论文,我们构建了保留上下文的段落嵌入和支持实体消歧和多跳推理的规范化结构知识图。我们通过标准检索指标、与通用的最新系统如GPT和Gemini的比较以及领域化学家的定性验证来评估这些管道。结果表明,GraphRAG在精确度和可解释性方面更高,而VectorRAG提供了更广泛的召回率,突显了互补的权衡。专家验证进一步证实,定制的管道,尤其是GraphRAG,生成了基于证据、引文可靠且具有强烈领域相关性的回答。通过将每个陈述都与证据联系起来,这些系统使研究人员能够导航文献、比较研究结果并在难以手动提取的模式中发现规律。更广泛地说,本研究建立了一种实用框架,用于使用精心策划的语料库和检索设计构建材料科学助手,减少了对专有模型的依赖,同时在大规模上实现了可信的文献分析。
Summary / 总结
This study addresses the challenge of systematically retrieving and reasoning about experimental knowledge in polymer literature, which is often unstructured and contextually inconsistent. The authors developed two retrieval pipelines, VectorRAG and GraphRAG, to enhance the effectiveness of retrieval-augmented generation (RAG) for polymer knowledge. Using over 1,000 polyhydroxyalkanoate (PHA) papers, they constructed context-preserving embeddings and a structured knowledge graph. Evaluation showed that GraphRAG achieved higher precision and interpretability, while VectorRAG provided broader recall. Expert validation confirmed that these pipelines produce well-grounded, citation-reliable responses with strong domain relevance, enabling researchers to navigate and analyze the literature more effectively.
该研究旨在解决聚合物文献中实验知识系统检索和推理的挑战,这些知识通常结构不完整且上下文不一致。作者开发了两种检索管道,VectorRAG和GraphRAG,以增强检索增强生成(RAG)的有效性。利用超过1,000篇聚羟基脂肪酸酯(PHA)论文,他们构建了上下文保留的嵌入和结构化知识图谱。评估结果显示,GraphRAG在精确度和可解释性方面表现更好,而VectorRAG提供了更广泛的召回率。专家验证进一步确认了这些管道生成的响应具有坚实的基础、引文可靠性且具有很强的领域相关性,有助于研究人员更有效地导航和分析文献。
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Authors: Subrit Dikshit
First: 2026-02-18T17:29:43+00:00 · Latest: 2026-02-18T17:29:43+00:00
Comments: 5 pages, 2 tables
Abstract
The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes "lexical density" within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.
中文标题/摘要
标题:Quecto-V1:8位量化小型语言模型在设备端法律检索中的实证分析
大型语言模型(LLMs)的迅速普及已经彻底改变了自然语言处理(NLP),但同时也造成了“资源鸿沟”。最先进的法律智能系统通常依赖于庞大的参数数量(7B+)和基于云的推理,这使得它们在资源受限的环境中难以获得,并且带来了重大的数据主权风险。本文介绍了Quecto-V1,这是一种针对特定领域的小型语言模型,旨在使印度法律智能的访问更加民主化。Quecto-V1基于自定义的GPT-2架构(1.24亿参数),从头开始仅在印度法典(包括印度刑法典、刑事诉讼法典和印度宪法)的语料库上进行训练。与侧重于广泛世界知识的一般模型不同,我们的方法在法律领域内最大化了“词汇密度”。此外,我们通过应用后训练的8位量化(GGUF格式)解决了部署瓶颈,将模型压缩到小于150 MB的内存占用。我们的实证分析表明,Quecto-V1在检索法典定义和处罚条款方面具有高度的准确性,优于通用的小型语言模型,在特定领域的精确匹配任务中表现更佳,且完全在消费级CPU上离线运行。我们还进行了消融研究,表明8位量化将模型大小减少了74%,与全精度基线相比,检索准确性下降不到3.5%。这些发现表明,在像法律这样的专业化和高风险领域,特定领域的训练结合激进的量化提供了一种可行的、保护隐私的替代方案,替代了庞大的云模型。
Summary / 总结
This paper introduces Quecto-V1, a 124 million parameter Small Language Model (SLM) tailored for Indian legal intelligence. Built on a custom GPT-2 configuration and trained exclusively on Indian legal texts, Quecto-V1 uses 8-bit quantization to reduce its size to under 150 MB, enabling offline operation on consumer-grade CPUs. Empirical analysis shows Quecto-V1 outperforms general-purpose models in domain-specific tasks, with 74% size reduction and less than 3.5% accuracy loss compared to full-precision models.
本文介绍了Quecto-V1,这是一种专为资源受限环境设计的法律智能小语言模型。该模型基于1.24亿参数的GPT-2配置,并专门针对印度法律文本进行训练。通过8位量化,模型的大小被压缩到不足150 MB。实验证明,Quecto-V1在检索法律定义和条款方面优于通用模型,能够在消费级CPU上离线运行。研究还表明,8位量化将模型大小减少了74%,同时对检索准确性的影响不到3.5%。
AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models
Authors: Adib Sakhawat, Fardeen Sadab
First: 2026-02-18T17:28:28+00:00 · Latest: 2026-02-18T17:28:28+00:00
Comments: 15 pages, 5 figures, 11 tables. Includes appendix with detailed experimental results and prompts
Abstract
Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($ρ= 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.
中文标题/摘要
标题:AREG:针对大型语言模型评估说服与抵抗能力的对抗资源提取游戏
评估大型语言模型(LLMs)的社会智能越来越需要从静态文本生成转向动态、对抗性的互动。我们引入了对抗资源提取游戏(AREG),这是一个将说服与抵抗操作化为多轮次、零和谈判的基准,涉及金融资源。通过在前沿模型之间进行循环赛,AREG能够在单一互动框架内同时评估进攻(说服)和防御(抵抗)能力。我们的分析提供了这些能力弱相关性的证据(ρ=0.33),并且实验证明它们是分离的:强大的说服表现并不可靠地预测强大的抵抗,反之亦然。在所有评估的模型中,抵抗得分超过说服得分,表明在对抗对话环境中存在系统性的防御优势。进一步的语言分析表明,互动结构在这些结果中起着核心作用。逐步寻求承诺的策略与更高的提取成功率相关,而验证寻求的回应在成功的防御中比明确拒绝更为常见。这些发现表明,LLMs中的社会影响不是单一的能力,并且仅关注说服的评估框架可能忽视不对称的行为漏洞。
Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes
Authors: Ethan Blaser, Jiuqi Wang, Shangtong Zhang
First: 2026-02-18T17:24:27+00:00 · Latest: 2026-02-18T17:24:27+00:00
Abstract
The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.
中文标题/摘要
标题:平均奖励马尔可夫决策过程的微分时差学习几乎必然收敛
平均奖励是强化学习(RL)中一个基本的性能指标,侧重于代理的长期性能。微分时差(TD)学习算法是平均奖励RL的一个重要进展,它们提供了一种在有策略和无策略设置中高效在线学习与平均奖励相关的价值函数的方法。然而,现有的收敛保证需要与状态访问次数相关的局部时钟来绑定学习率,这与实践者使用的不同,并且不适用于表格设置之外的情况。我们通过证明使用标准递减学习率的有策略$n$步微分TD在没有局部时钟的情况下几乎必然收敛来解决这一限制。然后,我们推导出三个充分条件,在这些条件下,无策略$n$步微分TD在没有局部时钟的情况下也收敛。这些结果加强了微分TD的理论基础,并使其收敛分析更接近实际实现。
Summary / 总结
This paper addresses the convergence of differential temporal difference (TD) learning algorithms for average reward Markov decision processes. The authors prove the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates, and derive three sufficient conditions for off-policy $n$-step differential TD to converge without a local clock. This work strengthens the theoretical foundations of differential TD and aligns its convergence analysis with practical applications.
该论文解决了差分时差(TD)学习算法在平均奖励马尔可夫决策过程中的收敛性问题。作者证明了使用标准递减学习率时,任何$n$步的差分TD在策略内收敛,并推导出三个充分条件,使得在策略外的$n$步差分TD无需局部时钟也能收敛。这项工作加强了差分TD的理论基础,并使其收敛分析更接近实际应用。
View Invariant Learning for Vision-Language Navigation in Continuous Environments
Authors: Josh Qixuan Sun, Xiaoying Xing, Huaiyuan Weng, Chul Min Yeum, Mark Crowley
First: 2025-07-05T18:04:35+00:00 · Latest: 2026-02-18T17:20:08+00:00
Comments: This paper is accepted to RA-L 2026
Abstract
Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent's observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.
中文标题/摘要
标题:连续环境中的视点不变学习在视觉-语言导航中的应用
连续环境中的视觉-语言导航(VLNCE),其中代理遵循指令并自由移动以到达目的地,是嵌入式人工智能中的关键研究问题。然而,大多数导航策略对视点变化敏感,即相机高度和视角变化导致的代理观察变化。本文引入了一种通用场景,V2-VLNCE(具有变化视点的VLNCE),并提出了一种视点不变后训练策略VIL(视点不变学习),以增强现有导航策略对相机视点变化的鲁棒性。VIL采用对比学习框架学习稀疏且视点不变的特征。此外,我们还引入了一种教师-学生框架,用于大多数VLNCE基线的核心组件航点预测模块,其中视点依赖的教师模型将知识提炼到视点不变的学生模型中。我们采用端到端的训练范式联合优化这些组件,从而消除了单独模块训练的成本。实验结果表明,我们的方法在两个标准基准数据集R2R-CE和RxR-CE上的成功率为8-15%的改进超过了最先进的方法。此外,我们在标准VLNCE设置下评估了VIL,发现尽管它被训练用于变化的视点,但它通常仍然能提高性能。在更具挑战性的RxR-CE数据集上,我们的方法在所有指标上也达到了其他无地图方法的最先进的性能。这表明添加VIL不会削弱标准视点性能,并且可以作为即插即用的后训练方法。
Summary / 总结
This paper addresses the issue of viewpoint sensitivity in Vision-Language Navigation in Continuous Environments (VLNCE) by introducing V2-VLNCE and proposing VIL (View Invariant Learning), a view-invariant post-training strategy. VIL uses a contrastive learning framework to learn sparse and view-invariant features and includes a teacher-student framework for the Waypoint Predictor Module. The method significantly improves the robustness of existing navigation policies, achieving an 8-15% increase in Success Rate on two benchmark datasets. It also enhances performance on the more challenging RxR-CE dataset, maintaining standard viewpoint performance while serving as a plug-and-play post-training method.
论文通过引入VIL(视点不变学习)策略,解决视点敏感性问题,该策略增强现有导航策略在连续环境中的鲁棒性。VIL使用对比学习框架学习视点不变特征,并使用教师-学生框架优化航点预测模块。实验结果表明,VIL在两个基准数据集上的成功率提高了8-15%,并在RxR-CE数据集上达到最佳性能,表明它可以在标准VLNCE设置中有效应用而不降低性能。