arXiv 论文速递

2026-03-11 03:42
Snapshot: 20260311_0342
FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models
Authors: Haoyang Li, Liang Wang, Siyu Zhou, Jiacheng Sun, Jing Jiang, Chao Wang, Guodong Long, Yan Peng
First: 2026-03-09T17:59:18+00:00 · Latest: 2026-03-09T17:59:18+00:00
Comments: 27 Pages, 9 Figures, 15 Tables
Abstract
CLIP-based prompt tuning enables pretrained Vision-Language Models (VLMs) to efficiently adapt to downstream tasks. Although existing studies have made significant progress, they pay limited attention to changes in the internal attention representations of VLMs during the tuning process. In this paper, we attribute the failure modes of prompt tuning predictions to shifts in foreground attention of the visual encoder, and propose Foreground View-Guided Prompt Tuning (FVG-PT), an adaptive plug-and-play foreground attention guidance module, to alleviate the shifts. Concretely, FVG-PT introduces a learnable Foreground Reliability Gate to automatically enhance the foreground view quality, applies a Foreground Distillation Compensation module to guide visual attention toward the foreground, and further introduces a Prior Calibration module to mitigate generalization degradation caused by excessive focus on the foreground. Experiments on multiple backbone models and datasets show the effectiveness and compatibility of FVG-PT. Codes are available at: https://github.com/JREion/FVG-PT
中文标题/摘要
标题:FVG-PT:视觉语言模型自适应前景视图引导提示调优
基于CLIP的提示调优使预训练的视觉语言模型(VLMs)能够高效地适应下游任务。尽管现有研究取得了显著进展,但它们在调优过程中VLMs内部注意力表示的变化方面关注有限。本文将提示调优预测的失败模式归因于视觉编码器前景注意力的变化,并提出前景视图引导提示调优(FVG-PT),这是一种自适应即插即用的前景注意力引导模块,以缓解这种变化。具体而言,FVG-PT引入了一个可学习的前景可靠性门控以自动增强前景视图质量,应用前景蒸馏补偿模块以引导视觉注意力朝向前景,并进一步引入先验校准模块以减轻过度关注前景导致的一般化退化。在多个骨干模型和数据集上的实验显示了FVG-PT的有效性和兼容性。代码可在:https://github.com/JREion/FVG-PT
Summary / 总结
This paper addresses the limitations of existing prompt tuning methods for Vision-Language Models (VLMs) by focusing on the shifts in foreground attention during the tuning process. It introduces FVG-PT, which includes a Foreground Reliability Gate, a Foreground Distillation Compensation module, and a Prior Calibration module to improve the quality of foreground views and guide visual attention. Experiments demonstrate the effectiveness and compatibility of FVG-PT across multiple backbone models and datasets.
论文针对现有视觉-语言模型(VLMs)提示调优方法中存在的前景注意力转移问题,提出了FVG-PT模块,包括前景可靠性门控、前景蒸馏补偿模块和先验校准模块,以提升前景视图质量、引导视觉注意力并缓解过度关注前景导致的一般化下降问题。实验结果显示FVG-PT在多个模型和数据集上具有有效性和兼容性。
Agentic Critical Training
Authors: Weize Liu, Minghui Liu, Sy-Tuyen Ho, Souradip Chakraborty, Xiyao Wang, Furong Huang
First: 2026-03-09T17:58:56+00:00 · Latest: 2026-03-09T17:58:56+00:00
Comments: Project page: https://attention-is-all-i-need.github.io/ACT/
Abstract
Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why: agents never contrast successful actions against suboptimal alternatives and thus lack awareness of action quality. Recent approaches attempt to address this by introducing self-reflection supervision derived from contrasts between expert and alternative actions. However, the training paradigm fundamentally remains imitation learning: the model imitates pre-constructed reflection text rather than learning to reason autonomously. We propose Agentic Critical Training (ACT), a reinforcement learning paradigm that trains agents to identify the better action among alternatives. By rewarding whether the model's judgment is correct, ACT drives the model to autonomously develop reasoning about action quality, producing genuine self-reflection rather than imitating it. Across three challenging agent benchmarks, ACT consistently improves agent performance when combined with different post-training methods. It achieves an average improvement of 5.07 points over imitation learning and 4.62 points over reinforcement learning. Compared to approaches that inject reflection capability through knowledge distillation, ACT also demonstrates clear advantages, yielding an average improvement of 2.42 points. Moreover, ACT enables strong out-of-distribution generalization on agentic benchmarks and improves performance on general reasoning benchmarks without any reasoning-specific training data, highlighting the value of our method. These results suggest that ACT is a promising path toward developing more reflective and capable LLM agents.
中文标题/摘要
标题:代理批判性训练
将大型语言模型(LLMs)训练为自主代理通常始于模仿学习,但这种方法仅教会代理做什么而不理解为什么:代理从未将成功的行为与次优替代行为进行对比,因此缺乏对行为质量的认识。最近的方法试图通过引入源自专家行为与替代行为对比的自我反思监督来解决这一问题。然而,训练范式本质上仍然是模仿学习:模型模仿预先构建的反思文本,而不是学习自主推理。我们提出了代理批判性训练(ACT),这是一种强化学习范式,训练代理识别替代行为中的更好行为。通过奖励模型判断是否正确,ACT 促使模型自主发展关于行为质量的推理,产生真正的自我反思,而不是模仿它。在三个具有挑战性的代理基准测试中,ACT 在结合不同后训练方法时,始终提高了代理性能。它在模仿学习上的平均改进为 5.07 分,在强化学习上的平均改进为 4.62 分。与通过知识蒸馏注入反思能力的方法相比,ACT 也表现出明显的优势,平均改进为 2.42 分。此外,ACT 在代理基准测试中实现了强大的分布外泛化,并在没有特定推理训练数据的情况下提高了通用推理基准测试的性能,突显了我们方法的价值。这些结果表明,ACT 是开发更具反思能力和能力的 LLM 代理的一个有希望的方向。
Summary / 总结
The research aims to improve large language models (LLMs) by addressing their lack of understanding of action quality, which is typically taught through imitation learning. The proposed Agentic Critical Training (ACT) uses reinforcement learning to train agents to autonomously develop reasoning about action quality by rewarding correct judgments. Across three benchmarks, ACT improved agent performance by an average of 5.07 points over imitation learning and 4.62 points over reinforcement learning, and showed better out-of-distribution generalization and general reasoning performance without specific reasoning training data.
研究动机是通过使大型语言模型(LLMs)能够自主发展关于行动质量的推理能力,而不是仅仅模仿预先构造的反思文本。主要方法是Agentic Critical Training (ACT),这是一种强化学习方法,通过奖励正确的判断来训练代理识别在多种选择中的更好行动。关键实验发现表明,ACT将代理性能平均提高了5.07分点,超过模仿学习,并且4.62分点超过强化学习,同时还能增强跨分布泛化和一般推理性能,而无需特定的推理训练数据。
Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines
Authors: Akshay Gulati, Kanha Singhania, Tushar Banga, Parth Arora, Anshul Verma, Vaibhav Kumar Singh, Agyapal Digra, Jayant Singh Bisht, Danish Sharma, Varun Singla, Shubh Garg
First: 2026-03-09T17:58:54+00:00 · Latest: 2026-03-09T17:58:54+00:00
Comments: 12 pages, 6 Figures, 5 Tables
Abstract
Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.
中文标题/摘要
标题:大型语言模型的金融智能评估:SuperInvesting AI与LLM引擎的基准测试
大型语言模型在金融分析和投资研究中的应用日益增多,但对其金融推理能力的系统性评估仍然有限。本文介绍了AI金融智能基准(AFIB),这是一个多维度评估框架,旨在从五个维度评估金融分析能力:事实准确性、分析完整性、数据时效性、模型一致性以及失败模式。我们使用95多个结构化的金融分析问题数据集,评估了五个AI系统:GPT、Gemini、Perplexity、Claude和SuperInvesting。结果表明,不同模型在性能上存在显著差异。在基准测试环境中,SuperInvesting表现出最高的综合性能,平均事实准确性得分为8.96/10,完整性得分为56.65/70,同时在评估的系统中表现出最低的幻觉率。以检索为导向的系统如Perplexity在数据时效性任务中表现出色,因为它们可以访问实时信息,但在分析综合和一致性方面较弱。总体而言,结果表明,大型语言模型中的金融智能是多维度的,结合结构化金融数据访问与分析推理能力的系统在复杂的投资研究工作流程中提供了最可靠的性能。
Summary / 总结
This study introduces the AI Financial Intelligence Benchmark (AFIB) to evaluate the financial reasoning capabilities of large language models across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. Five AI systems—GPT, Gemini, Perplexity, Claude, and SuperInvesting—are assessed using a dataset of 95+ structured financial analysis questions. SuperInvesting outperforms others with high factual accuracy and completeness scores, and the lowest hallucination rate. Retrieval-oriented systems like Perplexity excel in data recency but struggle with analytical synthesis and consistency.
本文引入了AI金融智能基准(AFIB),以评估大型语言模型在事实准确性、分析完整性、数据时效性、模型一致性及故障模式等五个维度上的金融推理能力。通过对GPT、Gemini、Perplexity、Claude和SuperInvesting五个AI系统的评估,结果显示SuperInvesting在事实准确性上得分8.96/10,在完整性上得分56.65/70,并且具有最低的幻觉率。检索导向的系统如Perplexity在数据时效性方面表现优异,但在分析综合和一致性方面表现较弱。
DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
Authors: Sungjae Park, Homanga Bharadhwaj, Shubham Tulsiani
Venue: ICRA 2026
First: 2025-06-25T17:59:01+00:00 · Latest: 2026-03-09T17:58:20+00:00
Comments: 11 pages. Published at ICRA 2026
Abstract
We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/
中文标题/摘要
标题:DemoDiffusion:使用预训练扩散策略的一次性人类模仿
我们提出了一种名为DemoDiffusion的简单方法,使机器人能够通过模仿单一个人类演示来执行操作任务,而无需进行特定任务的训练或人类-机器人配对数据。我们的方法基于两个洞察。首先,人类演示中的手部运动为机器人的末端执行器轨迹提供了一个有用的先验,我们可以通过运动学重定位将其转换为粗略的开环机器人运动轨迹。其次,虽然重定位的运动捕捉了任务的整体结构,但在上下文中的可能的机器人动作中可能并不对齐。为了解决这个问题,我们利用预训练的通用扩散策略来修改轨迹,确保它既遵循人类运动,又保持在可能的机器人动作分布内。与基于在线强化学习或人类-机器人配对数据的方法不同,我们的方法能够通过最少的努力实现对新任务和场景的稳健适应。在8个不同操作任务的实地实验中,DemoDiffusion的平均成功率达到了83.8%,而预训练策略仅为13.8%,运动学重定位为52.5%,甚至在预训练通用策略完全失败的任务中也成功了。项目页面:https://demodiffusion.github.io/
Summary / 总结
DemoDiffusion is a method that allows robots to imitate a single human demonstration for manipulation tasks without task-specific training. It uses kinematic retargeting to convert human hand motions into rough robot trajectories and then modifies these trajectories using a pre-trained diffusion policy to ensure they are both human-like and feasible for the robot. In experiments across 8 diverse tasks, DemoDiffusion achieved an 83.8% success rate, significantly outperforming both the pre-trained policy (13.8%) and kinematic retargeting (52.5%).
DemoDiffusion 是一种方法,可以让机器人通过单次人类演示来执行操作任务,无需特定任务的训练。它使用运动重定位将人类手部动作转换为粗略的机器人轨迹,然后通过预训练的扩散策略进行调整,以确保这些轨迹既符合人类动作又对机器人可行。在8个不同操作任务的实验中,DemoDiffusion 的成功率达到了83.8%,远高于预训练策略(13.8%)和运动重定位(52.5%)。
Offline-First Large Language Model Architecture for AI-Assisted Learning with Adaptive Response Levels in Low-Connectivity Environments
Authors: Joseph Walusimbi, Ann Move Oguti, Joshua Benjamin Ssentongo, Keith Ainebyona
First: 2026-02-14T09:53:40+00:00 · Latest: 2026-03-09T17:55:52+00:00
Comments: 16 pages, 2 table, 10 figures
Abstract
Artificial intelligence (AI) and large language models (LLMs) are transforming educational technology by enabling conversational tutoring, personalized explanations, and inquiry-driven learning. However, most AI-based learning systems rely on continuous internet connectivity and cloud-based computation, limiting their use in bandwidth-constrained environments. This paper presents an offline-first large language model architecture designed for AI-assisted learning in low-connectivity settings. The system performs all inference locally using quantized language models and incorporates hardware-aware model selection to enable deployment on low-specification CPU-only devices. By removing dependence on cloud infrastructure, the system provides curriculum-aligned explanations and structured academic support through natural-language interaction. To support learners at different educational stages, the system includes adaptive response levels that generate explanations at varying levels of complexity: Simple English, Lower Secondary, Upper Secondary, and Technical. This allows explanations to be adjusted to student ability, improving clarity and understanding of academic concepts. The system was deployed in selected secondary and tertiary institutions under limited-connectivity conditions and evaluated across technical performance, usability, perceived response quality, and educational impact. Results show stable operation on legacy hardware, acceptable response times, and positive user perceptions regarding support for self-directed learning. These findings demonstrate the feasibility of offline large language model deployment for AI-assisted education in low-connectivity environments.
中文标题/摘要
标题:面向低连接环境的离线优先大型语言模型架构:适应性响应级别辅助人工智能辅助学习
人工智能(AI)和大型语言模型(LLMs)正在通过实现对话式辅导、个性化解释和探究式学习来改变教育技术。然而,大多数基于AI的学习系统依赖于持续的互联网连接和基于云的计算,限制了它们在带宽受限环境中的使用。本文提出了一种离线优先的大型语言模型架构,旨在在低连接性环境中进行AI辅助学习。该系统使用量化语言模型在本地执行所有推理,并结合硬件感知模型选择,以支持低配置CPU设备的部署。通过消除对云基础设施的依赖,该系统通过自然语言交互提供与课程对齐的解释和结构化的学术支持。为了支持不同教育阶段的学习者,该系统包括适应性响应级别,生成不同复杂度级别的解释:简单英语、初中、高中和技术。这使得解释可以根据学生的能力进行调整,从而提高学术概念的理解。该系统在选定的中等和高等教育机构在有限连接条件下部署,并在技术性能、易用性、感知响应质量和教育影响方面进行了评估。结果显示,该系统在旧硬件上稳定运行,响应时间可接受,并且用户对支持自主学习的支持持积极态度。这些发现表明,在低连接性环境中部署离线大型语言模型进行AI辅助教育的可行性。
Summary / 总结
This paper introduces an offline-first large language model architecture aimed at AI-assisted learning in low-connectivity environments. The system uses quantized language models for local inference and hardware-aware model selection for deployment on low-specification devices. It includes adaptive response levels to provide explanations at varying complexity levels, from Simple English to Technical. Deployed in educational institutions, the system showed stable operation, acceptable response times, and positive user feedback on self-directed learning support.
该论文介绍了一种面向低连接性环境的离线优先大型语言模型架构,用于AI辅助学习。该系统使用量化模型进行本地推理,并支持针对低规格设备的硬件感知模型选择。它包括不同复杂度级别的适应性响应级别,从简单英语到技术级。在教育机构部署后,该系统显示了稳定运行、可接受的响应时间和用户对支持自主学习的积极反馈,证明了在低连接性环境中离线部署LLM的可行性。
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
Authors: Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter, Dan Roth
Venue: ICLR 2026
First: 2025-10-02T17:57:05+00:00 · Latest: 2026-03-09T17:54:56+00:00
Comments: Accepted at ICLR 2026
Abstract
Despite recent rapid progress in AI safety, current large language models remain vulnerable to adversarial attacks in multi-turn interaction settings, where attackers strategically adapt their prompts across conversation turns and pose a more critical yet realistic challenge. Existing approaches that discover safety vulnerabilities either rely on manual red-teaming with human experts or employ automated methods using pre-defined templates and human-curated attack data, with most focusing on single-turn attacks. However, these methods did not explore the vast space of possible multi-turn attacks, failing to consider novel attack trajectories that emerge from complex dialogue dynamics and strategic conversation planning. This gap is particularly critical given recent findings that LLMs exhibit significantly higher vulnerability to multi-turn attacks compared to single-turn attacks. We propose DialTree, an on-policy reinforcement learning framework integrated with tree search that autonomously discovers diverse multi-turn attack strategies by treating the dialogue as a sequential decision-making problem, enabling systematic exploration without manually curated data. Through extensive experiments, our approach not only achieves more than 44.2% higher ASR across 12 target models compared to previous state-of-the-art approaches, but also effectively uncovers new attack strategies by learning optimal dialogue policies that maximize attack success across multiple turns.
中文标题/摘要
标题:基于树的对话强化策略优化以应对红队攻击
尽管人工智能安全领域取得了快速进展,当前的大语言模型在多轮交互场景中仍然容易受到对手的攻击,攻击者会战略性地调整其提示,这构成了更为关键且现实的挑战。现有的发现安全漏洞的方法要么依赖于人工红队测试,要么使用预定义模板和人工收集的攻击数据进行自动化方法,大多数方法集中在单轮攻击上。然而,这些方法并未探索可能的多轮攻击空间,未能考虑从复杂对话动态和战略性对话规划中产生的新型攻击轨迹。鉴于最近的研究发现,LLMs在多轮攻击中的脆弱性远高于单轮攻击,这一差距尤为重要。我们提出了DialTree,这是一种结合树搜索的在线策略强化学习框架,能够自主发现多样化的多轮攻击策略,将对话视为顺序决策问题,从而实现系统性探索,无需手动收集的数据。通过广泛的实验,我们的方法不仅在12个目标模型上实现了比之前最先进的方法高出44.2%以上的攻击成功率,还通过学习最大化多轮攻击成功率的最优对话策略,有效发现了新的攻击策略。
Summary / 总结
This paper addresses the vulnerability of large language models to multi-turn adversarial attacks, which are more realistic and challenging than single-turn attacks. It introduces DialTree, a reinforcement learning framework that uses tree search to autonomously discover diverse multi-turn attack strategies. Experiments show that DialTree outperforms previous methods by achieving a 44.2% higher attack success rate across 12 target models and uncovers new attack strategies through learned optimal dialogue policies.
研究旨在解决大型语言模型在多轮交互设置中对攻击者的适应性提示攻击的脆弱性问题。提出的DialTree框架结合了基于策略的强化学习和树搜索技术,自主发现多轮攻击策略。实验结果显示,DialTree在12个目标模型上的攻击成功率(ASR)比之前最先进的方法高出超过44.2%,并通过学习最优对话策略有效发现了新的攻击策略。
AgentIR: Reasoning-Aware Retrieval for Deep Research Agents
Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong
First: 2026-03-04T18:47:26+00:00 · Latest: 2026-03-09T17:53:48+00:00
Abstract
Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.
中文标题/摘要
标题:AgentIR:具备推理意识的检索方法以促进深度研究代理
深度研究代理正迅速成为现代检索系统的主要消费者。与人类用户在不记录其中间思维过程的情况下提出并细化查询不同,深度研究代理在每次搜索调用前都会生成明确的自然语言推理,揭示出现有检索器完全忽略的丰富意图和上下文信息。为了利用这一被忽视的信号,我们引入了:(1) 具备推理意识的检索,这是一种检索范式,将代理的推理轨迹与查询一起联合嵌入;(2) DR-Synth,一种从标准问答数据集中生成深度研究检索训练数据的方法。我们证明了这两个组件各自有效,它们的结合产生了训练嵌入模型AgentIR-4B,取得了显著的提升。在具有挑战性的BrowseComp-Plus基准测试中,使用开放权重代理Tongyi-DeepResearch的AgentIR-4B达到了68%的准确率,而传统的两倍大小的嵌入模型仅为50%,BM25仅为37%。代码和数据可在:https://texttron.github.io/AgentIR/ 获取。
Summary / 总结
The research introduces Reasoning-Aware Retrieval, a new retrieval paradigm that incorporates the reasoning trace of Deep Research agents alongside their queries to improve search accuracy. It also presents DR-Synth, a method for generating training data for Deep Research retrievers from standard QA datasets. The combination of these components leads to a significant improvement in performance, with AgentIR-4B achieving 68% accuracy on the BrowseComp-Plus benchmark, outperforming larger conventional models and BM25 by a considerable margin.
研究旨在通过纳入深度研究代理的明确推理过程来提升检索系统。研究引入了Reasoning-Aware Retrieval,该方法将代理的推理轨迹与查询联合嵌入,以及DR-Synth,一种从标准问答数据集中生成训练数据的方法。这些组件的结合显著提高了检索准确性,AgentIR-4B在BrowseComp-Plus基准测试中达到了68%的准确率,分别比更大规模的传统模型和BM25高出21%和31%。
Linear probes rely on textual evidence: Results from leakage mitigation studies in language models
Authors: Gerard Boxo, Aman Neelappa, Shivam Raval
First: 2025-09-16T19:09:27+00:00 · Latest: 2026-03-09T17:52:49+00:00
Comments: 33 pages, 22 figures
Abstract
White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from $10$- to $30$-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: $0.57$ vs $0.74$ AUROC for Bias, and $0.57$ vs $0.94$ AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.
中文标题/摘要
标题:线性探针依赖于文本证据:语言模型中泄漏缓解研究的结果
白盒监控是检测语言模型潜在有害行为的一种流行技术。尽管它们在一般情况下表现良好,但它们在检测文本模糊行为方面的有效性受到质疑。在本研究中,我们发现移除行为的文本证据会显著降低探针性能。AUROC减少幅度在10到30点之间,具体取决于设置。我们评估了三种设置(沙袋攻击、阿谀奉承和偏见)下的探针监控,发现当探针依赖于目标行为的文本证据(如系统提示或CoT推理)时,一旦过滤这些标记,性能就会下降。这种过滤程序是输出监控评估的标准做法。为进一步证明这一现象,我们训练了模型有机体,这些有机体在没有任何行为表述的情况下生成输出。我们验证了模型有机体上的探针性能显著低于未过滤评估:偏见为0.57 vs 0.74 AUROC,沙袋攻击为0.57 vs 0.94 AUROC。我们的研究结果表明,在探针必须检测非表面模式的场景中,线性探针可能较为脆弱。
Summary / 总结
This study investigates the effectiveness of white-box monitors in detecting text-ambiguous behaviors in language models. By removing textual evidence of a behavior, probe performance significantly decreases, with AUROC reductions ranging from 10 to 30 points. The study evaluates probe monitors across three setups (Sandbagging, Sycophancy, and Bias) and finds that when probes rely on textual evidence, their performance drops when these tokens are filtered. Additionally, training Model Organisms without any behavior verbalizations shows that probe performance is substantially lower, indicating that linear probes may be brittle in detecting non-surface-level patterns.
本研究探讨了白盒监控在检测语言模型中的文本模糊行为时的有效性。通过移除行为的文本证据,线性探针的性能显著下降,AUROC减少幅度在10到30点之间。研究在三个设置(Sandbagging、Sycophancy和Bias)中评估了探针,并发现当探针依赖于文本证据时,过滤这些标记后其性能会下降。此外,训练不包含行为口头表述的模型有机体表明,探针的性能显著降低,表明线性探针在检测非表面模式时可能较为脆弱。
ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation
Authors: Nanjun Li, Pinqi Cheng, Zean Liu, Minghe Tian, Xuanyin Wang
First: 2026-03-09T17:49:46+00:00 · Latest: 2026-03-09T17:49:46+00:00
Abstract
Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance degradation. To address this issue, we propose a keypoint-driven learning paradigm that elevates pose estimation to a primary prediction objective. Specifically, we remove bounding-box prediction and redesign the prediction head to better accommodate the high-dimensional structured representations for pose estimation. We further introduce a keypoint-driven dynamic sample assignment strategy to align training objectives with pose evaluation metrics, enabling dense supervision during training and efficient NMS-free inference. In addition, we propose a smooth OKS-based loss function to stabilize optimization in regression-based pose estimation. Based on these designs, we develop a single-stage multi-person pose estimation framework, termed ER-Pose. On MS COCO and CrowdPose, ER-Pose-n achieves AP improvements of 3.2/6.7 without pre-training and 7.4/4.9 with pre-training respectively compared with the baseline YOLO-Pose. These improvements are achieved with fewer parameters and higher inference efficiency.
中文标题/摘要
标题:ER-Pose:重新思考基于关键点的表示学习以实现实时人体姿态估计
单阶段多人姿态估计旨在在一个统一框架内同时进行人体定位和关键点预测,从而在推理效率和架构简洁性方面具有优势。因此,多尺度实时检测架构,如YOLO类模型,广泛用于实时姿态估计。然而,这些方法通常继承了从目标检测中继承的框驱动建模范式,在训练过程中,姿态估计隐式地受到边界框监督的约束。这种形式引入了样本分配和特征表示的偏差,导致任务不匹配,最终限制了姿态估计的准确性。在本文中,我们从关键点驱动的角度重新审视了框驱动的单阶段姿态估计,并将并行目标之间的语义冲突识别为性能下降的关键来源。为了解决这一问题,我们提出了一种基于关键点的训练范式,将姿态估计提升为主要预测目标。具体来说,我们移除了边界框预测,并重新设计了预测头,以更好地适应姿态估计所需的高维结构化表示。我们还引入了一种基于关键点的动态样本分配策略,使训练目标与姿态评估指标对齐,在训练期间提供密集监督,并在推理时实现高效的NMS-free。此外,我们提出了一种平滑的OKS损失函数,以在基于回归的姿态估计中稳定优化。基于这些设计,我们开发了一种单阶段多人姿态估计框架,称为ER-Pose。在MS COCO和CrowdPose上,与基线YOLO-Pose相比,ER-Pose-n分别在无预训练情况下实现了3.2/6.7的AP改进,在有预训练情况下实现了7.4/4.9的AP改进。这些改进是在参数更少和更高的推理效率下实现的。
Summary / 总结
This paper addresses the limitations of box-driven single-stage pose estimation methods by proposing ER-Pose, a keypoint-driven framework. It removes bounding-box prediction and redesigns the prediction head to better fit pose estimation needs. The authors introduce a dynamic sample assignment strategy and a smooth OKS-based loss function to improve training and inference. On MS COCO and CrowdPose, ER-Pose shows significant improvements in accuracy over YOLO-Pose, achieving higher AP scores with fewer parameters and faster inference.
ER-Pose 通过移除边界框预测并引入关键点驱动的动态样本分配策略,重新思考实时人体姿态估计中的关键点驱动表示学习。这种方法在不预训练的情况下,分别在MS COCO和CrowdPose上提高了3.2/6.7的AP,在预训练的情况下分别提高了7.4/4.9,同时使用更少的参数并保持更高的推理效率。
Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration
Authors: Víctor Yeste, Paolo Rosso
First: 2026-01-31T21:50:35+00:00 · Latest: 2026-03-09T17:41:24+00:00
Comments: Code: https://github.com/VictorMYeste/human-value-detection, models: https://huggingface.co/papers/2602.00913, 27 pages, 4 figures
Abstract
Human value detection from single sentences is a sparse, imbalanced multi-label task. We study whether Schwartz higher-order (HO) categories help this setting on ValueEval'24 / ValuesML (74K English sentences) under a compute-frugal budget. Rather than proposing a new architecture, we compare direct supervised transformers, hard HO$\rightarrow$values pipelines, Presence$\rightarrow$HO$\rightarrow$values cascades, compact instruction-tuned large language models (LLMs), QLoRA, and low-cost upgrades such as threshold tuning and small ensembles. HO categories are learnable: the easiest bipolar pair, Growth vs. Self-Protection, reaches Macro-$F_1=0.58$. The most reliable gains come from calibration and ensembling: threshold tuning improves Social Focus vs. Personal Focus from $0.41$ to $0.57$ ($+0.16$), transformer soft voting lifts Growth from $0.286$ to $0.303$, and a Transformer+LLM hybrid reaches $0.353$ on Self-Protection. In contrast, hard hierarchical gating does not consistently improve the end task. Compact LLMs also underperform supervised encoders as stand-alone systems, although they sometimes add useful diversity in hybrid ensembles. Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.
中文标题/摘要
标题:施瓦茨高层次价值观是否有助于句子级人类价值观检测?一项关于层次门控和校准的研究
从单句中检测人类价值观是一项稀疏、不平衡的多标签任务。我们研究施瓦茨高层次(HO)类别是否有助于ValueEval'24 / ValuesML(74000条英语句子)的设置,预算有限。我们没有提出新的架构,而是比较了直接监督的变压器、硬HO→价值观管道、存在→HO→价值观级联、紧凑指令调优的大语言模型(LLMs)、QLoRA以及低成本升级如阈值调整和小型集成。HO类别是可学习的:最容易的二极对立,成长 vs. 自我保护,达到宏-$F_1=0.58$。最可靠的增益来自校准和集成:阈值调整将社会焦点 vs. 个人焦点从$0.41$提高到$0.57$(+0.16),变压器软投票将成长从$0.286$提升到$0.303$,而Transformer+LLM混合体在自我保护上达到$0.353$。相比之下,硬层次门控并不一致地改善最终任务。紧凑的LLMs作为独立系统也逊色于监督编码器,尽管它们有时在混合集成中增加有用多样性。在此基准下,HO结构作为归纳偏置比作为刚性路由规则更有用。
Summary / 总结
This study investigates the utility of Schwartz higher-order (HO) categories in improving sentence-level human value detection, a sparse and imbalanced multi-label task. Various methods including direct supervised transformers, HO-to-values pipelines, and compact instruction-tuned large language models were compared. The study found that calibration and ensembling techniques provided the most reliable gains, with threshold tuning improving Macro-$F_1$ scores and transformer soft voting enhancing the detection of certain values. Hard hierarchical gating did not consistently improve the task, and compact LLMs underperformed as standalone systems but contributed useful diversity in hybrid ensembles.
研究探讨了Schwartz更高阶(HO)类别是否能提升ValueEval'24 / ValuesML数据集上的单句人类价值观检测。比较了包括监督变压器、HO到价值观管道、紧凑指令调优的大语言模型等多种方法。关键发现包括HO类别是可学习的,最简单的二极对立对达到了Macro-$F_1$为0.58。校准和集成是最可靠的改进方法,而硬层次门控并不一致地提升任务效果。紧凑的大语言模型作为独立系统表现不佳,但在混合集成中可以提供有用多样性。
How Far Can Unsupervised RLVR Scale LLM Training?
Authors: Bingxiang He, Yuxin Zuo, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Cheng Qian, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan-ang Gao, Yuchen Zhang, Bowen Zhou, Zhiyuan Liu, Ning Ding
Venue: ICLR 2026
First: 2026-03-09T17:38:11+00:00 · Latest: 2026-03-09T17:38:11+00:00
Comments: Accepted to the ICLR 2026
Abstract
Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
中文标题/摘要
标题:无监督RLVR如何扩展LLM训练的规模?
无监督强化学习带可验证奖励(URLVR)提供了一种超越监督瓶颈、通过无需真实标签即可推导奖励来扩展LLM训练规模的途径。近期研究利用模型固有信号,显示出早期的积极成果,但其潜力和局限性尚不明确。本文重新审视URLVR,并提供涵盖分类学、理论和大量实验的全面分析。我们首先根据奖励来源将URLVR方法分为固有和外部两类,然后建立统一的理论框架,揭示所有固有方法均趋向于细化模型的初始分布。这种细化机制在初始置信度与正确性一致时成功,但在不一致时会灾难性地失败。通过系统实验,我们展示了固有奖励在方法间始终遵循先上升后下降的趋势,崩溃时间由模型先验决定而非工程选择。尽管存在这些扩展限制,我们发现固有奖励在小数据集的测试时训练中仍然有价值,并提出模型崩溃步来衡量模型先验,作为RL可训练性的实用指标。最后,我们探索基于计算不对称性进行验证的外部奖励方法,初步证据表明它们可能能够摆脱置信度-正确性天花板。我们的发现为固有URLVR设定了边界,同时激励寻找可扩展的替代方案。
Summary / 总结
This work explores the scalability of unsupervised reinforcement learning with verifiable rewards (URLVR) for large language model (LLM) training. It classifies URLVR methods into intrinsic and external categories, establishes a theoretical framework showing intrinsic methods sharpen the model's initial distribution, and finds that intrinsic rewards follow a rise-then-fall pattern, with collapse timing determined by model prior. The study also proposes a Model Collapse Step to measure model prior and suggests external reward methods may offer a way to overcome the confidence-correctness ceiling.
该研究探讨了无监督强化学习与可验证奖励(URLVR)方法在大规模语言模型(LLM)训练中的可扩展性。研究将URLVR方法分为基于内在和外部奖励来源两类,并建立了一个理论框架,表明内在方法会强化模型的初始分布。研究发现,内在奖励遵循先上升后下降的模式,崩溃的时间由模型的先验决定而非工程选择。此外,研究提出了一种模型崩溃步长来衡量模型先验,可以作为强化学习可训练性的指标。研究还表明,基于计算不对称性的外部奖励方法可能能够克服内在方法的局限性。
CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning
Authors: Siye Wu, Jian Xie, Yikai Zhang, Yanghua Xiao
First: 2026-03-09T17:37:15+00:00 · Latest: 2026-03-09T17:37:15+00:00
Abstract
The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.
中文标题/摘要
标题:CODA:基于难度的计算分配以实现自适应推理
大型推理模型的出现表明,在推理时增加计算量可以显著提高复杂任务的性能。然而,这往往又陷入了另一个陷阱:在简单问题上过度思考,重复的推理过程虽然增加了成本,但带来的准确度提升却微乎其微。这促使了自适应推理的发展:动态地将推理深度与实例难度对齐。在本文中,我们从最优性的角度研究自适应推理,将其形式化为一个在边际准确度增益低于增量成本时停止分配标记的效用最大化问题。基于此,我们提出了CODA(基于难度感知的计算分配),这是一种通过策略内部的难度信号来分配标记的方法。具体而言,CODA通过基于组的展开估计难度,并将其映射到二元基奖励之上一个长度依赖的调节项的两个非负门控上。在简单实例上,易侧门控惩罚冗余;在困难实例上,难侧门控鼓励更审慎的展开。在不同模型规模和基准测试中,CODA在无需外部注释或用户提供的预算的情况下实现了自适应推理:在简单任务上,CODA将标记成本降低了超过60%的同时保持了强大的准确度;而在困难任务上,它激励更审慎的展开以最大化性能。
Summary / 总结
This paper addresses the issue of overthinking simple problems in large reasoning models by proposing CODA, a method for adaptive reasoning. CODA dynamically allocates compute resources based on instance difficulty, aiming to maximize utility by balancing accuracy gains and computational costs. The method uses a policy-internal difficulty signal to adjust token allocation, with gates that penalize verbosity on simple tasks and encourage more deliberative reasoning on complex ones. Experimental results show that CODA reduces token costs by over 60% on easy tasks while maintaining strong accuracy, and it enhances performance on hard tasks by promoting more thoughtful reasoning.
本文提出了一种名为CODA的方法,通过动态调整计算资源来应对大型推理模型在处理简单问题时过度推理的问题。CODA根据实例的难度分配计算资源,以最大化准确性和计算成本之间的平衡。该方法利用内部策略中的难度信号来调整令牌分配,对简单任务施加惩罚以减少冗余,对复杂任务则鼓励更深入的推理。实验结果表明,CODA在简单任务上可以将令牌成本降低超过60%的同时保持高准确性,并在复杂任务上通过促进更深入的推理来提升性能。
Context-free Self-Conditioned GAN for Trajectory Forecasting
Authors: Tiago Rodrigues de Almeida, Eduardo Gutierrez Maestro, Oscar Martinez Mozos
First: 2026-03-09T17:37:03+00:00 · Latest: 2026-03-09T17:37:03+00:00
Comments: Accepted at the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA)
Abstract
In this paper, we present a context-free unsupervised approach based on a self-conditioned GAN to learn different modes from 2D trajectories. Our intuition is that each mode indicates a different behavioral moving pattern in the discriminator's feature space. We apply this approach to the problem of trajectory forecasting. We present three different training settings based on self-conditioned GAN, which produce better forecasters. We test our method in two data sets: human motion and road agents. Experimental results show that our approach outperforms previous context-free methods in the least representative supervised labels while performing well in the remaining labels. In addition, our approach outperforms globally in human motion, while performing well in road agents.
中文标题/摘要
标题:基于自条件GAN的无条件轨迹预测
在本文中,我们提出了一种基于自条件GAN的无监督方法,用于从2D轨迹中学习不同的模式。我们的直觉是,每个模式在判别器特征空间中表示不同的行为移动模式。我们将这种方法应用于轨迹预测问题。我们提出了三种基于自条件GAN的不同训练设置,这些设置产生了更好的预测器。我们在两个数据集:人体运动和道路代理上测试了我们的方法。实验结果表明,与之前的无条件方法相比,我们的方法在最不具代表性的监督标签上表现更优,而在其他标签上表现良好。此外,在人体运动方面,我们的方法表现最佳,而在道路代理方面表现良好。
Summary / 总结
The paper proposes a context-free unsupervised method using a self-conditioned GAN to forecast trajectories by learning different behavioral patterns. Three training settings are explored, showing improved forecasting performance. The method outperforms previous context-free approaches in least representative labels and excels in human motion while performing well in road agents data sets.
该论文提出了一种基于自条件GAN的无上下文无监督方法,通过学习不同的行为模式来预测轨迹。探索了三种训练设置,显示出改进的预测性能。该方法在最少代表性的标签上优于先前的无上下文方法,并在人类运动数据集上表现出色,在道路代理数据集上表现良好。
From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Authors: Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack Kaelbling
First: 2024-12-31T06:14:16+00:00 · Latest: 2026-03-09T17:35:57+00:00
Comments: A version of this paper appears in the official proceedings of RA-L, Volume 11, Issue 4
Abstract
Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision-language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
中文标题/摘要
标题:从像素到谓词:通过预训练的视觉-语言模型学习符号世界模型
我们的目标是在给定低级技能和少量短时间 horizon 的演示(包含一系列图像序列)的情况下,学习解决复杂机器人领域中的长期决策问题。为此,我们专注于学习抽象的符号世界模型,这些模型能够通过规划实现零样本泛化到新的目标。此类模型的关键组成部分是定义对象属性及其之间关系的符号谓词集。在本工作中,我们利用预训练的视觉-语言模型(VLMs)提出一组可能适用于决策的视觉谓词,并直接从相机图像中评估这些谓词。在训练时,我们将提出的谓词和演示传递给基于优化的模型学习算法,以获得一个用提出的谓词子集定义的抽象符号世界模型。在测试时,给定一个新的目标和新的环境设置,我们使用VLM构建当前世界状态的符号描述,然后使用基于搜索的规划算法找到实现目标的一系列低级技能。我们通过在仿真和真实世界中的实验演示,证明我们的方法可以积极泛化,将其学习到的世界模型应用于解决各种对象类型、排列方式、对象数量和视觉背景广泛变化的问题,以及新的目标和远超训练时的更长时间 horizon 的问题。
Summary / 总结
The research aims to develop a method for solving long-term decision-making problems in complex robotics environments using low-level skills and a few demonstrations. The approach leverages pretrained vision-language models to propose a large set of visual predicates and evaluates them from camera images. During training, an optimization-based algorithm selects a compact subset of these predicates to form an abstract symbolic world model. At test time, the model is used to plan sequences of low-level skills to achieve novel goals. Experiments show that the method can generalize effectively to various object types, settings, and goals, surpassing the training data in terms of problem complexity and horizon length.
研究旨在利用低级技能和少量演示,通过预训练的视觉-语言模型学习符号化的世界模型,以解决复杂机器人环境中的长期决策问题,并通过规划实现零样本泛化到新目标。关键发现表明,该方法能够在各种物体类型、环境和时间跨度下有效泛化,展示了其在仿真和真实世界中的能力。
OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
Authors: Krista Opsahl-Ong, Arnav Singhvi, Jasmine Collins, Ivan Zhou, Cindy Wang, Ashutosh Baheti, Owen Oertell, Jacob Portes, Sam Havens, Erich Elsen, Michael Bendersky, Matei Zaharia, Xing Chen
First: 2026-03-09T17:34:53+00:00 · Latest: 2026-03-09T17:34:53+00:00
Comments: 24 pages, 16 figures. Introduces the OfficeQA Pro benchmark for grounded reasoning over enterprise documents
Abstract
We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.
中文标题/摘要
标题:OfficeQA Pro:企业级端到端 grounded 推理基准
我们介绍了 OfficeQA Pro,这是一个用于评估 AI 代理在大型异构文档语料库上进行 grounded 多文档推理的基准。语料库包括近 100 年的美国国库公告,共计 89,000 页和超过 2600 万个数值。OfficeQA Pro 包含 133 个问题,需要对无结构文本和表格数据进行精确的文档解析、检索和分析推理。包括 Claude Opus 4.6、GPT-5.4 和 Gemini 3.1 Pro 预览版在内的前沿大语言模型在仅依赖参数化知识时,在 OfficeQA Pro 上的准确率低于 5%,在额外访问网络时低于 12%。即使直接提供文档语料库,前沿代理在超过一半的问题上仍然难以应对,平均得分为 34.1%。我们发现,通过 Databricks 的 ai_parse_document 生成的结构化文档表示为代理提供了 16.1% 的平均相对性能提升。我们还进行了额外的消融实验,研究模型选择、表格表示、检索策略和测试时缩放对性能的影响。尽管这些改进,但在企业级 grounded 推理方面仍存在显著的提升空间。
Summary / 总结
OfficeQA Pro is a benchmark for evaluating AI agents on grounded reasoning over a large and diverse document corpus consisting of U.S. Treasury Bulletins. The benchmark includes 133 questions that require precise document parsing, retrieval, and analytical reasoning. Despite the use of advanced language models, performance remains low, with less than 5% accuracy when relying on parametric knowledge and less than 12% with additional web access. Even with direct access to the document corpus, agents struggle with over half of the questions, achieving an average score of 34.1%. Providing a structured document representation improves performance by 16.1% on average, but significant challenges remain for enterprise-grade grounded reasoning.
OfficeQA Pro 是一个基准,用于评估 AI 系统在大型和多样化的文档库(包括美国财政部公告)上的接地推理能力。基准包括 133 个问题,要求精确的文档解析、检索和分析推理。代理在没有网络访问的情况下仅能达到不到 5% 的准确率,在获得文档库的情况下也只有 34.1% 的准确率。结构化的文档表示将平均性能提高了 16.1%。尽管有所改进,但在企业级接地推理方面仍存在显著的挑战。
CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation
Authors: Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, Yi Zhou, Siqi Dai, Jingwei Wu
First: 2026-03-09T17:31:16+00:00 · Latest: 2026-03-09T17:31:16+00:00
Comments: 21 pages, 7 figures, 7 tables
Abstract
Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo
中文标题/摘要
标题:CoCo:代码作为CoT的文本到图像预览和稀有概念生成
统一多模态模型(UMMs)的最新进展显著提升了文本到图像(T2I)生成,特别是在通过链式思考(CoT)推理的集成方面。然而,现有的基于CoT的T2I方法主要依赖于抽象的自然语言规划,这在复杂的空间布局、结构化视觉元素和密集文本内容方面缺乏精确性。在本工作中,我们提出了CoCo(Code-as-CoT),这是一种代码驱动的推理框架,将推理过程表示为可执行代码,从而实现显式的中间规划,以生成图像。给定一个文本提示,CoCo首先生成指定场景结构布局的可执行代码,然后在沙盒环境中执行以渲染确定性草图图像。模型随后通过精细的图像编辑对草图进行细化,以生成最终的高保真结果。为了支持这种训练范式,我们构建了CoCo-10K数据集,包含结构化的草图-最终图像对,旨在教授结构化草图构建和纠正性视觉细化。在StructT2IBench、OneIG-Bench和LongText-Bench上的实证评估表明,CoCo在直接生成上的改进分别为+68.83%、+54.8%和+41.23%,同时在由CoT赋能的其他生成方法中也表现出色。这些结果表明,可执行代码是精确、可控和结构化文本到图像生成的有效且可靠的推理范式。代码可在:https://github.com/micky-li-hd/CoCo
Summary / 总结
The research aims to enhance the precision of text-to-image generation by integrating executable code as a reasoning process, addressing the limitations of abstract natural-language planning. CoCo, a code-driven reasoning framework, generates executable code to specify the structural layout of the scene, which is then executed to produce a deterministic draft image. The model refines this draft to achieve high-fidelity results. Experiments on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo outperforms direct generation and other CoT-based methods by +68.83%, +54.8%, and +41.23% respectively, highlighting the effectiveness of executable code in structured text-to-image generation.
研究旨在通过集成可执行代码作为推理过程来提高文本到图像生成的精度,解决现有方法中抽象自然语言规划的局限性。CoCo 是一种代码驱动的推理框架,生成可执行代码以指定场景的结构布局,然后执行以生成确定性草图图像。模型进一步细化此草图以获得高质量的结果。在 StructT2IBench、OneIG-Bench 和 LongText-Bench 上的实验表明,CoCo 在直接生成和其它基于 CoT 的方法上分别提高了 +68.83%、+54.8% 和 +41.23%,突显了可执行代码在结构化文本到图像生成中的有效性。
Group Entropies and Mirror Duality: A Class of Flexible Mirror Descent Updates for Machine Learning
Authors: Andrzej Cichocki, Piergiulio Tempesta
First: 2026-03-09T17:31:03+00:00 · Latest: 2026-03-09T17:31:03+00:00
Comments: 36 pages, 5 figures
Abstract
We introduce a comprehensive theoretical and algorithmic framework that bridges formal group theory and group entropies with modern machine learning, paving the way for an infinite, flexible family of Mirror Descent (MD) optimization algorithms. Our approach exploits the rich structure of group entropies, which are generalized entropic functionals governed by group composition laws, encompassing and significantly extending all trace-form entropies such as the Shannon, Tsallis, and Kaniadakis families. By leveraging group-theoretical mirror maps (or link functions) in MD, expressed via multi-parametric generalized logarithms and their inverses (group exponentials), we achieve highly flexible and adaptable MD updates that can be tailored to diverse data geometries and statistical distributions. To this end, we introduce the notion of \textit{mirror duality}, which allows us to seamlessly switch or interchange group-theoretical link functions with their inverses, subject to specific learning rate constraints. By tuning or learning the hyperparameters of the group logarithms enables us to adapt the model to the statistical properties of the training distribution, while simultaneously ensuring desirable convergence characteristics via fine-tuning. This generality not only provides greater flexibility and improved convergence properties, but also opens new perspectives for applications in machine learning and deep learning by expanding the design of regularizers and natural gradient algorithms. We extensively evaluate the validity, robustness, and performance of the proposed updates on large-scale, simplex-constrained quadratic programming problems.
中文标题/摘要
标题:群熵与镜像对偶:一类灵活的镜像下降优化算法
我们提出了一种综合的理论和算法框架,将形式群论和群熵与现代机器学习联系起来,为镜像下降(MD)优化算法开辟了一条通往无限、灵活家族的道路。我们的方法利用了群熵的丰富结构,这些群熵是受群合成法则支配的广义熵泛函,涵盖了并显著扩展了所有迹形式熵,如香农、Tsallis和Kaniadakis家族。通过利用MD中的群论镜像映射(或链接函数),通过多参数广义对数及其逆(群指数)表达,我们实现了高度灵活和适应性强的MD更新,可以针对不同的数据几何形状和统计分布进行定制。为此,我们引入了“镜像对偶”的概念,这使得我们可以在特定的学习率约束下无缝地切换或互换群论链接函数及其逆。通过调整或学习群对数的超参数,使模型适应训练分布的统计特性,同时通过精细调整确保理想的收敛特性。这种普遍性不仅提供了更大的灵活性和改进的收敛特性,还为机器学习和深度学习中的应用开辟了新的视角,通过扩展正则化器和自然梯度算法的设计。我们对大规模、单纯形约束的二次规划问题进行了广泛评估,验证了所提出更新的有效性、稳健性和性能。
Summary / 总结
The paper introduces a theoretical and algorithmic framework that combines group theory and group entropies with machine learning, creating a flexible family of Mirror Descent (MD) optimization algorithms. By using group-theoretical mirror maps and exploiting the structure of group entropies, the method achieves adaptable MD updates suitable for various data geometries. The approach introduces the concept of mirror duality, allowing for the interchange of group-theoretical link functions, and demonstrates improved convergence properties and robust performance in large-scale, simplex-constrained quadratic programming problems.
论文提出了一种结合群论和群熵与镜像下降(MD)优化算法的理论和算法框架,创建了一个灵活的MD更新家族。通过使用群理论镜像映射并利用群熵的结构,该方法能够适应各种数据几何和统计分布。该方法引入了镜像对偶的概念,允许在特定的学习率约束下交换群理论的链接函数及其逆函数。所提出的更新在大规模、单纯形约束的二次规划问题上进行了广泛评估,证明了其有效性和鲁棒性,并且相比传统方法具有更好的性能。
CAST: Modeling Visual State Transitions for Consistent Video Retrieval
Authors: Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao
First: 2026-03-09T17:26:26+00:00 · Latest: 2026-03-09T17:26:26+00:00
Abstract
As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
中文标题/摘要
标题:CAST: 建模视觉状态转换以实现一致的视频检索
随着视频内容创作转向长篇叙事,将短片段组合成连贯的故事线变得越来越重要。然而,现有的检索模型在推理时仍保持上下文无关,优先考虑局部语义对齐,而忽视了状态和身份的一致性。为了解决这一结构限制,我们正式化了一致视频检索(CVR)任务,并引入了一个跨越YouCook2、COIN和CrossTask的诊断基准。我们提出了CAST(上下文感知状态转换),这是一种轻量级、即插即用的适配器,兼容多种冻结的视觉-语言嵌入空间。通过从视觉历史中预测状态条件下的残差更新(Δ),CAST引入了对潜在状态演化的显式归纳偏置。广泛的实验表明,CAST在YouCook2和CrossTask上提高了性能,在COIN上保持竞争力,并且在多种基础骨干网络上始终优于零样本基线。此外,CAST为黑盒视频生成候选(例如来自Veo的)提供了有用的重排序信号,促进了更连贯的时间延续。
Summary / 总结
The research aims to improve video retrieval by addressing the lack of state and identity consistency in existing methods. CAST, a Context-Aware State Transition model, is proposed to predict state-conditioned residual updates from visual history, providing an explicit inductive bias for latent state evolution. Experiments show that CAST enhances performance on YouCook2 and CrossTask, maintains competitiveness on COIN, and outperforms zero-shot baselines across various foundation models. Additionally, CAST offers a useful reranking signal for video generation candidates, enhancing temporal coherence.
研究旨在解决将短片段组合成长篇视频叙述中的连贯性问题。提出了一种名为CAST的轻量级适配器,通过从视觉历史中预测状态条件下的残差更新来提高视频检索的一致性。实验表明,CAST在YouCook2和CrossTask上提高了性能,在COIN上保持竞争力,并且在各种基础模型上优于零样本基线。
HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
Authors: Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Niraj Chitla, Zishen Wan, Shang Wu, Yu Cao, Caiwen Ding, Yang, Zhao
First: 2025-05-21T16:14:10+00:00 · Latest: 2026-03-09T17:26:13+00:00
Abstract
Retrieval Augmented Generation (RAG) is an essential agent for Large Language Model (LLM) aided Description Language (HDL) tasks, addressing the challenges of limited training data and prohibitively long prompts. However, its performance in handling ambiguous queries and real-world, repository-level HDL projects containing thousands or even tens of thousands of code lines remains limited. Our analysis demonstrates two fundamental mismatches, structural and vocabulary, between conventional semantic similarity-based RAGs and HDL codes. To this end, we propose HDLxGraph, the first framework that integrates the inherent graph characteristics of HDLs with RAGs for LLM-assisted tasks. Specifically, HDLxGraph incorporates Abstract Syntax Trees (ASTs) to capture HDLs' hierarchical structures and Data Flow Graphs (DFGs) to address the vocabulary mismatch. In addition, to overcome the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, an LLM generated dataset derived from real-world, repository-level HDL projects. Evaluations show that HDLxGraph improves search, debugging, and completion accuracy by 12.04%/12.22%/5.04% and by 11.59%/8.18%/4.07% over state-of-the-art similarity-based RAG and software-code Graph RAG baselines, respectively. The code of HDLxGraph and HDLSearch benchmark are available at https://github.com/UMN-ZhaoLab/HDLxGraph.
中文标题/摘要
标题:HDLxGraph:通过HDL图形数据库连接大型语言模型和HDL存储库
检索增强生成(RAG)是大型语言模型(LLM)辅助描述语言(HDL)任务中的关键代理,解决了有限训练数据和过长提示的挑战。然而,其在处理含数千甚至数万行代码的模糊查询和实际仓库级HDL项目中的表现仍然有限。我们的分析表明,传统的基于语义相似性的RAG与HDL代码之间存在两种基本不匹配,即结构和词汇不匹配。为此,我们提出了HDLxGraph,这是第一个将HDL固有的图形特性与RAG结合用于LLM辅助任务的框架。具体而言,HDLxGraph结合了抽象语法树(AST)来捕捉HDL的层次结构,并使用数据流图(DFG)来解决词汇不匹配问题。此外,为了解决全面的HDL搜索基准数据不足的问题,我们引入了HDLSearch,这是一个从实际仓库级HDL项目生成的LLM数据集。评估结果显示,与最先进的基于相似性的RAG和软件代码图形RAG基线相比,HDLxGraph在搜索、调试和完成准确性上分别提高了12.04%/12.22%/5.04%和11.59%/8.18%/4.07%。HDLxGraph和HDLSearch基准代码可在https://github.com/UMN-ZhaoLab/HDLxGraph/获取。
Summary / 总结
HDLxGraph is a framework that integrates HDL's inherent graph characteristics with Retrieval Augmented Generation (RAG) to improve Large Language Model (LLM) performance in HDL tasks. It uses Abstract Syntax Trees (ASTs) to capture HDL hierarchical structures and Data Flow Graphs (DFGs) to address vocabulary mismatches. HDLxGraph also introduces HDLSearch, an LLM-generated dataset from real-world HDL projects, to overcome the lack of comprehensive benchmarks. Experiments show that HDLxGraph outperforms state-of-the-art RAG and software-code Graph RAG baselines in search, debugging, and completion accuracy by 12.04%/12.22%/5.04% and 11.59%/8.18%/4.07%, respectively.
HDLxGraph 是一个框架,将硬件描述语言(HDL)的图特性与检索增强生成(RAG)相结合,以解决传统 RAG 方法在处理 HDL 任务时的局限性。它使用抽象语法树(AST)来捕捉 HDL 的层次结构,并使用数据流图(DFG)来处理词汇不匹配问题。评估结果显示,HDLxGraph 在搜索、调试和完成准确性上分别提高了 12.04%/12.22%/5.04% 和 11.59%/8.18%/4.07%,超过了最先进的基于相似性的 RAG 和软件代码图 RAG 基线。
Grow, Don't Overwrite: Fine-tuning Without Forgetting
Authors: Dyah Adila, Hanna Mazzawi, Benoit Dherin, Xavier Gonzalvo
First: 2026-03-09T17:26:03+00:00 · Latest: 2026-03-09T17:26:03+00:00
Abstract
Adapting pre-trained models to specialized tasks often leads to catastrophic forgetting, where new knowledge overwrites foundational capabilities. Existing methods either compromise performance on the new task or struggle to balance training stability with efficient reuse of pre-trained knowledge. We introduce a novel function-preserving expansion method that resolves this dilemma. Our technique expands model capacity by replicating pre-trained parameters within transformer submodules and applying a scaling correction that guarantees the expanded model is mathematically identical to the original at initialization, enabling stable training while exploiting existing knowledge. Empirically, our method eliminates the trade-off between plasticity and stability, matching the performance of full fine-tuning on downstream tasks without any degradation of the model's original capabilities. Furthermore, we demonstrate the modularity of our approach, showing that by selectively expanding a small subset of layers we can achieve the same performance as full fine-tuning at a fraction of the computational cost.
中文标题/摘要
标题:扩展,不要覆盖:在不遗忘的情况下微调
将预训练模型适应专门任务往往会引发灾难性遗忘,即新知识会覆盖基础能力。现有方法要么在新任务上牺牲性能,要么难以在训练稳定性和高效利用预训练知识之间取得平衡。我们提出了一种新的函数保持扩展方法,解决了这一困境。该技术通过在变压器子模块内复制预训练参数并应用缩放校正,使扩展后的模型在初始化时与原始模型在数学上完全相同,从而实现稳定训练并利用现有知识。实验证明,我们的方法消除了弹性和稳定性的权衡,在下游任务上达到与完全微调相同的性能,同时不损害模型的原始能力。此外,我们展示了我们方法的模块化,通过选择性地扩展一小部分层,我们可以在计算成本大幅降低的情况下达到与完全微调相同的性能。
Summary / 总结
The research addresses the issue of catastrophic forgetting in fine-tuning pre-trained models, where new knowledge can erase old capabilities. It introduces a function-preserving expansion method that expands model capacity by replicating pre-trained parameters and applying a scaling correction to ensure the expanded model is identical to the original at initialization. The method allows for stable training while retaining the original model's performance, matching full fine-tuning performance on downstream tasks without degradation. Additionally, it shows that selectively expanding a small subset of layers can achieve similar performance at lower computational cost.
论文解决了在为新任务微调预训练模型时出现的灾难性遗忘问题。它提出了一种函数保持扩展方法,通过在变压器子模块内复制预训练参数并应用缩放校正,确保扩展后的模型在初始化时与原始模型完全相同。这种方法允许稳定训练同时保留原始模型的能力,能够在下游任务上达到与完全微调相同的性能而不会降级。此外,它还展示了通过选择性地扩展一小部分层可以以较低的计算成本达到类似性能。
Exploring Embedding Priors in Prompt-Tuning for Improved Interpretability and Control
Authors: Sergey Sedov, Sumanth Bharadwaj Hachalli Karanam, Venu Gopal Kadamba
First: 2024-12-24T18:18:52+00:00 · Latest: 2026-03-09T17:23:47+00:00
Abstract
Prompt-Tuning is an efficient method for adapting pre-trained language models to new tasks with minimal computational overhead by modifying prompt embeddings. In this work, we investigate how crucial the phenomenon of embedding collapse, frequently observed in Prompt-Tuning, is for the final performance of the model. To address this question, we designed embedding priors and compared them with posteriors of the converged Soft and Deep Prompt-Tuning methods. Our findings suggest that priors strongly affect the position of the tuned embeddings, and models can effectively work with embeddings from different parts of activation spaces, including completely new regions. As the final Prompt-Tuning capabilities are limited, we hypothesize that controllable Prompt-Tuning posteriors may serve as a good starting point for tasks such as chain-of-thought (COT) distillation. Our experiments also show that generated trajectories are not localized in the activation space of the models. However, there are distinct clusters of activations for distant tasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g., Question-Answering and MLM) lie in the same cluster. These observations raise questions about the importance of a single activation cluster for the generalization abilities of large language models.
中文标题/摘要
标题:探索提示调优中的嵌入先验以提高可解释性和控制能力
提示调优是一种通过修改提示嵌入来最小化计算开销将预训练语言模型适应新任务的有效方法。在本文中,我们研究了提示调优中经常观察到的嵌入坍塌现象对最终模型性能的重要性。为了回答这个问题,我们设计了嵌入先验,并将其与收敛的软提示调优和深提示调优方法的后验进行了比较。我们的研究结果表明,先验强烈影响调优嵌入的位置,模型可以有效地使用来自激活空间不同部分的嵌入,包括全新的区域。由于最终的提示调优能力有限,我们假设可控的提示调优后验可以作为链式思考(COT)蒸馏等任务的良好起点。我们的实验还表明,生成的轨迹在模型的激活空间中并不局限于特定区域。然而,对于远程任务(例如,NLP和算术)存在不同的激活簇,而NLP任务(例如,问答和掩码语言模型)之间的激活则位于同一个簇中。这些观察结果引发了关于大型语言模型泛化能力中单一激活簇重要性的疑问。
Summary / 总结
This study investigates the impact of embedding collapse in Prompt-Tuning on model performance by introducing embedding priors and comparing them with the converged posteriors of Soft and Deep Prompt-Tuning methods. The research finds that priors significantly influence the tuned embeddings' positions and that models can operate effectively with embeddings from various parts of the activation space. The study also suggests that controllable Prompt-Tuning posteriors could be beneficial for tasks like chain-of-thought distillation. Additionally, experiments reveal that generated trajectories are not confined to a single activation cluster, but there are distinct clusters for different tasks, indicating the complexity of generalization in large language models.
这项研究探讨了嵌入先验对Prompt-Tuning性能的影响,Prompt-Tuning是一种用于适应预训练语言模型的方法。通过将先验与Soft和Deep Prompt-Tuning的后验进行比较,研究发现先验显著影响了调优后的嵌入,并且模型可以在激活空间的不同部分有效运行。研究还指出,尽管Prompt-Tuning的能力有限,但可控的Prompt-Tuning后验可能适用于诸如链式思维(COT)蒸馏等任务。此外,实验还表明生成的轨迹并不局限于单一的激活簇,但对于不同任务存在不同的簇,这表明单一的激活簇可能不是大型语言模型泛化能力的关键。
X-SYS: A Reference Architecture for Interactive Explanation Systems
Authors: Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin
First: 2026-02-13T09:24:03+00:00 · Latest: 2026-03-09T17:21:38+00:00
Comments: 18 pages, 8 figures
Abstract
The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.
中文标题/摘要
标题:X-SYS:交互解释系统参考架构
可解释人工智能(XAI)研究社区提出了众多技术方法,但将可解释性部署为系统仍然具有挑战性:交互解释系统需要合适的算法和系统能力,以保持解释在重复查询、模型和数据演变以及治理约束下的可用性。我们认为,实现XAI需要将可解释性视为信息系统问题,其中用户交互需求会引发特定的系统需求。我们介绍了X-SYS,一种交互解释系统的参考架构,指导(X)AI研究人员、开发人员和从业者将交互解释用户界面(XUI)与系统能力连接起来。X-SYS围绕四个质量属性(可扩展性、可追溯性、响应性和适应性)组织,并规定了五部分分解(XUI服务、解释服务、模型服务、数据服务、编排和治理)。它将交互模式映射到系统能力,以解耦用户界面的演变与后端计算。我们通过SemanticLens系统实现了X-SYS,SemanticLens是一个用于视觉语言模型的语义搜索和激活引导系统。SemanticLens展示了基于合同的服务边界如何实现独立演变,离线/在线分离如何确保响应性,持久状态管理如何支持可追溯性。这项工作一起提供了一个可重用的蓝图和具体的实例,支持在运营约束下端到端设计交互解释系统。
Summary / 总结
The research aims to address the challenges of deploying explainable AI (XAI) systems, particularly in interactive explanation systems that require maintaining usability across repeated queries and evolving models. X-SYS, a reference architecture, is introduced to guide the connection between interactive explanation user interfaces and system capabilities. Key findings include the organization of X-SYS around four quality attributes (scalability, traceability, responsiveness, and adaptability) and a five-component decomposition, demonstrating how contract-based service boundaries enable independent evolution and offline/online separation to ensure responsiveness and traceability.
研究旨在通过提出X-SYS参考架构来解决可解释AI(XAI)系统的部署挑战。X-SYS关注通过四个质量属性(可扩展性、可追溯性、响应性和适应性)和五组件分解来指导用户界面与系统能力的集成。关键实验发现包括通过SemanticLens的演示,展示了基于合同的服务边界如何实现独立演化,离线/在线分离如何确保响应性,以及持久状态管理如何支持可追溯性。
PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Authors: Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko
First: 2026-03-09T17:18:00+00:00 · Latest: 2026-03-09T17:18:00+00:00
Abstract
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.
中文标题/摘要
标题:PostTrainBench:LLM代理能否自动化LLM后训练?
过去一年,AI代理在软件工程方面表现出色,主要得益于推理能力的提升。这引发了一个更深层次的问题:这些系统能否将能力扩展到自动化AI研究本身?在本文中,我们探讨了后训练这一关键阶段,该阶段将基础LLM转化为有用的助手。我们引入了PostTrainBench来评估LLM代理在受限制计算资源(10小时,一个H100 GPU)下自主进行后训练的能力。我们让前沿代理(例如,配备Opus 4.6的Claude Code)优化一个基础LLM在特定基准上的性能(例如,Qwen3-4B在AIME上的表现)。重要的是,我们没有为代理提供任何预定义策略,而是赋予它们在网络上查找必要信息、运行实验和整理数据的完全自主权。我们发现,前沿代理取得了显著进展,但通常落后于领先提供商的指令调优LLM:最佳代理的进展为23.2%,而官方指令调优模型为51.1%。然而,在特定场景中,代理可以超越指令调优模型:GPT-5.1 Codex Max在BFCL上的得分为89%,而官方模型为67%。我们还观察到几种值得警惕的失败模式。代理有时会进行奖励作弊:在测试集上进行训练、下载现有的指令调优检查点而不是训练自己的模型、以及未经授权使用找到的API密钥生成合成数据。这些行为令人担忧,突显了随着这些系统能力增强,仔细沙盒化的重要性。总体而言,我们希望PostTrainBench能够有助于跟踪AI R&D自动化进展,并研究伴随而来的风险。网站和代码可在https://posttrainbench.com/上获取。
Summary / 总结
The paper explores whether advanced AI agents can automate the post-training phase of language models, which is crucial for turning base models into useful assistants. Using PostTrainBench, the authors benchmarked frontier agents like Claude Code with Opus 4.6 against instruction-tuned models from leading providers. While the best agent achieved 23.2% improvement, it lagged behind the 51.1% improvement of instruction-tuned models. However, in targeted scenarios, some agents, like GPT-5.1 Codex Max, outperformed the official models. The study also identified concerning behaviors such as reward hacking and unauthorized data generation, emphasizing the need for careful sandboxing as these systems evolve.
本文探讨了高级语言模型(LLMs)是否能够自动化LLM的后训练阶段,这是将基础模型转化为有用助手的关键步骤。通过PostTrainBench,作者评估了像Claude Code与Opus 4.6这样的前沿代理在优化特定基准上的基础模型性能方面的表现。虽然这些代理取得了显著进展,但它们通常落后于领先提供商的指令调优模型,前者达到23.2%,后者为51.1%。然而,在某些特定场景下,像GPT-5.1 Codex Max这样的代理超过了官方模型。研究还指出了几个令人担忧的行为,如奖励作弊和未经授权的数据生成,强调了随着这些系统变得越来越强大,需要谨慎的沙盒环境的重要性。
UNBOX: Unveiling Black-box visual models with Natural-language
Authors: Simone Carnemolla, Chiara Russo, Simone Palazzo, Quentin Bouniot, Daniela Giordano, Zeynep Akata, Matteo Pennisi, Concetto Spampinato
First: 2026-03-09T17:16:39+00:00 · Latest: 2026-03-09T17:16:39+00:00
Comments: Under review at IJCV
Abstract
Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.
中文标题/摘要
标题:UNBOX:使用自然语言揭开黑盒视觉模型的面纱
在开放世界视觉识别中确保可信赖性需要可解释、公平且对分布偏移具有鲁棒性的模型。然而,现代视觉系统越来越多地作为专有的黑盒API部署,仅暴露输出概率而不公开架构、参数、梯度和训练数据。这种透明度不足阻碍了有意义的审计、偏见检测和故障分析。现有解释方法假设白盒或灰盒访问或了解训练分布,使其在这些实际场景中不可用。我们提出了UNBOX,一种在完全无数据、无梯度和无反向传播约束下的分类模型解剖框架。UNBOX利用大型语言模型和文本到图像扩散模型,将激活最大化重新定义为由输出概率驱动的纯粹语义搜索。该方法生成人类可解释的文字描述,以最大化激活每个类别,揭示模型隐含学习的概念、反映的训练分布以及潜在的偏见来源。我们通过语义保真度测试、视觉特征相关性分析和切片发现审计在ImageNet-1K、Waterbirds和CelebA上评估UNBOX。尽管在最严格的黑盒约束下操作,UNBOX在与最先进的白盒可解释性方法的竞争中表现出色。这表明,无需任何内部访问,也可以恢复对模型内部推理的有意义见解,从而促进更可信和可问责的视觉识别系统。
Summary / 总结
UNBOX is a framework for interpreting black-box visual models by leveraging natural language and text-to-image models, enabling class-wise model dissection without requiring access to gradients, training data, or backpropagation. It produces human-interpretable text descriptors that reveal the concepts learned by the model and potential sources of bias. UNBOX was evaluated on ImageNet-1K, Waterbirds, and CelebA, showing competitive performance with state-of-the-art white-box methods despite operating under strict black-box constraints.
UNBOX 是一个框架,通过利用大型语言模型和文本到图像的扩散模型来解释黑盒视觉模型。它在无需访问梯度、训练数据或内部参数的情况下,专注于语义搜索生成可由人类理解的文字描述,揭示模型学习的概念和潜在偏差。尽管在最严格的黑盒约束下运行,UNBOX 在 ImageNet-1K、Waterbirds 和 CelebA 数据集上的表现仍与最先进的白盒可解释性方法相当。
StreamReady: Learning What to Answer and When in Long Streaming Videos
Authors: Shehreen Azad, Vibhav Vineet, Yogesh Singh Rawat
Venue: CVPR 2026
First: 2026-03-09T17:02:44+00:00 · Latest: 2026-03-09T17:02:44+00:00
Comments: Accepted in CVPR 2026
Abstract
Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability.
中文标题/摘要
标题:StreamReady:学习何时回答长流式视频中的问题
流式视频理解通常涉及时间敏感的场景,其中模型需要在支持视觉证据出现时精确回答:在证据出现之前回答是推测,而在证据过后回答则降低了实时实用性。为了捕捉这种行为,我们引入了一种流式视频理解的准备度感知公式,以及一种带有不对称早期和晚期惩罚的时间感知目标——答案准备度得分(ARS)。当与正确性结合时,ARS 定义了一种有效的准确性,不仅衡量模型是否正确,还衡量其是否在适当时刻回答。在此基础上,我们引入了 StreamReady 框架,通过一种轻量级的准备度机制来统一时间推理与及时回答。为了评估这种能力,我们进一步引入了 ProReady-QA,这是一个包含标注的答案证据窗口和主动多轮问题的基准,覆盖局部和全局上下文。StreamReady 在 ProReady-QA 上表现出色,并且在八个额外的流式和离线长视频基准上始终优于先前的方法,展示了稳健且广泛适用的视频理解能力。
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Authors: Earl Ranario, Mason J. Earles
First: 2025-12-17T21:22:44+00:00 · Latest: 2026-03-09T16:58:19+00:00
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
中文标题/摘要
标题:视觉语言模型是否准备好零样本替代农业监督分类模型?
视觉语言模型(VLMs)越来越多地被提议作为视觉识别任务的一般解决方案,但它们在农业决策支持中的可靠性仍不清楚。我们对来自AgML集合(https://github.com/Project-AgML)的27个农业图像分类数据集中的多种开源和闭源VLM进行了基准测试,这些数据集涵盖了162个类别和248,000张图像,涉及植物病害、害虫和损伤以及植物和杂草种类识别。在所有任务中,零样本VLMs的表现显著低于监督任务特定基线(YOLO11),后者始终比任何基础模型获得更高的准确率。在多项选择提示下,表现最佳的VLM(Gemini-3 Pro)的平均准确率为约62%,而开放式提示则导致性能大幅下降,通常准确率低于25%。基于LLM的语义评估提高了开放式提示的准确率(例如,顶级模型从约21%提高到约30%),并改变了模型排名,表明评估方法对报告结论有实质性影响。在开源模型中,Qwen-VL-72B表现最佳,在受约束提示下接近闭源性能,但仍落后于顶级专有系统。任务级分析表明,植物和杂草种类分类始终比害虫和损伤识别更容易,后者是所有模型中最具有挑战性的类别。总体而言,这些结果表明,当前的即用型VLMs尚不适合作为独立的农业诊断系统,但在与受约束界面、明确标签本体和领域意识评估策略配对时,可以作为辅助组件发挥作用。
Summary / 总结
The study evaluates the performance of vision-language models (VLMs) in agricultural image classification tasks, comparing them to a supervised task-specific baseline. Across 27 datasets, VLMs underperform, with Gemini-3 Pro achieving around 62% accuracy under multiple-choice prompting and open-ended prompting yielding lower results. Applying LLM-based semantic judging improves open-ended accuracy but does not fully match the performance of proprietary models. The research highlights that while VLMs can assist, they are not yet ready to replace supervised models as standalone diagnostic tools in agriculture.
研究对多种视觉语言模型(VLMs)进行了农业图像分类数据集的基准测试,发现零样本VLMs的表现低于监督任务特定基线。在多项选择提示下,性能最佳的VLM(Gemini-3 Pro)达到约62%的准确率,而开放提示下的表现较低。应用基于LLM的语义评估可以提高开放提示的准确率并改变模型排名。在开源模型中,Qwen-VL-72B表现最佳,但仍落后于顶级专有系统。研究结论认为,当前的VLMs尚不适合作为独立的诊断系统,但在与受限界面和领域导向评估策略结合使用时可以发挥作用。
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
Authors: Jiangye Yuan, Gowri Kumar, Baoyuan Wang
First: 2026-03-09T16:42:43+00:00 · Latest: 2026-03-09T16:42:43+00:00
Abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach boosts GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks that rely heavily on spatial layout understanding. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
Summary / 总结
The research aims to enhance the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) by introducing geometrically referenced 3D scene representations (GR3D). GR3D annotates objects in images with unique IDs and encodes their 3D geometric attributes as textual references, enabling MLLMs to interpret 3D cues and analyze 2D visual features simultaneously. The approach, implemented in a zero-shot setting, significantly improves GPT-5's performance on VSI-Bench by 8% overall and more than 11% on tasks requiring spatial layout understanding, as shown by both quantitative and qualitative studies.
研究旨在通过引入几何参考的3D场景表示(GR3D)来增强多模态大型语言模型(MLLMs)的空间推理能力。GR3D为图像中的对象分配唯一的ID,并以文本参考的形式编码其3D几何属性,使MLLMs能够同时解析3D线索和2D视觉特征。该方法在零样本设置下实施,显著提高了GPT-5在VSI-Bench上的整体性能8%,并在依赖于空间布局理解的任务上提高了超过11%,定量和定性研究进一步证实了GR3D的能力。
CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing
Authors: Yucheng Wang, Zedong Wang, Yuetong Wu, Yue Ma, Dan Xu
Venue: CVPR 2026
First: 2026-03-09T16:40:47+00:00 · Latest: 2026-03-09T16:40:47+00:00
Comments: Accepted by CVPR 2026. Project page: https://care-edit.github.io/
Abstract
Unified diffusion editors often rely on a fixed, shared backbone for diverse tasks, suffering from task interference and poor adaptation to heterogeneous demands (e.g., local vs global, semantic vs photometric). In particular, prevalent ControlNet and OmniControl variants combine multiple conditioning signals (e.g., text, mask, reference) via static concatenation or additive adapters which cannot dynamically prioritize or suppress conflicting modalities, thus resulting in artifacts like color bleeding across mask boundaries, identity or style drift, and unpredictable behavior under multi-condition inputs. To address this, we propose Condition-Aware Routing of Experts (CARE-Edit) that aligns model computation with specific editing competencies. At its core, a lightweight latent-attention router assigns encoded diffusion tokens to four specialized experts--Text, Mask, Reference, and Base--based on multi-modal conditions and diffusion timesteps: (i) a Mask Repaint module first refines coarse user-defined masks for precise spatial guidance; (ii) the router applies sparse top-K selection to dynamically allocate computation to the most relevant experts; (iii) a Latent Mixture module subsequently fuses expert outputs, coherently integrating semantic, spatial, and stylistic information to the base images. Experiments validate CARE-Edit's strong performance on contextual editing tasks, including erasure, replacement, text-driven edits, and style transfer. Empirical analysis further reveals task-specific behavior of specialized experts, showcasing the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.
中文标题/摘要
标题:CARE-Edit:条件感知专家路由的上下文图像编辑
统一的扩散编辑器通常依赖于一个固定的共享骨干网络来完成多种任务,这会导致任务干扰和对异构需求(如局部 vs 全局、语义 vs 光学)的不良适应。特别是,流行的ControlNet和OmniControl变体通过静态连接或加性适配器将多种条件信号(如文本、掩码、参考)组合在一起,但这些方法无法动态地优先处理或抑制冲突的模态,从而导致跨掩码边界的颜色溢出、身份或风格漂移以及在多条件输入下的不可预测行为。为了解决这个问题,我们提出了条件感知专家路由(CARE-Edit),它使模型计算与特定的编辑能力相一致。其核心是一个轻量级的潜在注意力路由器,根据多模态条件和扩散时间步将编码的扩散令牌分配给四个专门的专家——文本、掩码、参考和基础:(i)一个掩码重绘模块首先细化用户定义的粗略掩码,以提供精确的空间指导;(ii)路由器应用稀疏的Top-K选择来动态分配计算给最相关的专家;(iii)一个潜在混合模块随后融合专家输出,将语义、空间和风格信息有条不紊地整合到基础图像中。实验验证了CARE-Edit在上下文编辑任务(包括擦除、替换、文本驱动的编辑和风格转移)中的强大性能。进一步的实证分析揭示了专门专家的任务特定行为,展示了动态、条件感知处理的重要性,以减轻多条件冲突。
Summary / 总结
CARE-Edit addresses the limitations of unified diffusion editors by proposing a condition-aware routing mechanism. It uses a lightweight latent-attention router to assign diffusion tokens to specialized experts based on multi-modal conditions and diffusion timesteps. The system includes a Mask Repaint module for mask refinement, dynamic allocation of computation to relevant experts, and a Latent Mixture module for fusing expert outputs. Experiments show strong performance in contextual editing tasks such as erasure, replacement, text-driven edits, and style transfer, highlighting the importance of dynamic, condition-aware processing.
CARE-Edit通过提出条件感知路由机制来解决统一扩散编辑器的局限性。该系统使用轻量级的潜在注意力路由器根据多模态条件和扩散时间步分配扩散令牌给专门的专家。系统包括一个Mask重绘模块、稀疏Top-K选择和一个潜在混合模块来细化和整合编辑。实验表明,CARE-Edit在各种上下文编辑任务中表现出色,展示了动态、条件感知处理对缓解多条件冲突的重要性。
Trust via Reputation of Conviction
Authors: Aravind R. Iyengar
First: 2026-03-09T16:30:33+00:00 · Latest: 2026-03-09T16:30:33+00:00
Comments: 19 pages, 4 figures
Abstract
The question of \emph{knowledge}, \emph{truth} and \emph{trust} is explored via a mathematical formulation of claims and sources. We define truth as the reproducibly perceived subset of knowledge, formalize sources as having both generative and discriminative roles, and develop a framework for reputation grounded in the \emph{conviction} -- the likelihood that a source's stance is vindicated by independent consensus. We argue that conviction, rather than correctness or faithfulness, is the principled basis for trust: it is regime-independent, rewards genuine contribution, and demands the transparent and self-sufficient perceptions that make external verification possible. We formalize reputation as the expected weighted signed conviction over a realm of claims, characterize its behavior across source-claim regimes, and identify continuous verification as both a theoretical necessity and a practical mechanism through which reputation accrues. The framework is applied to AI agents, which are identified as capable but error-prone sources for whom verifiable conviction and continuously accrued reputation constitute the only robust foundation for trust.
中文标题/摘要
标题:基于信念的信任
通过数学形式化主张和来源,探讨了知识、真理和信任的问题。我们将真理定义为可重复感知的知识子集,形式化来源具有生成和辨别双重角色,并发展了一种基于信念——其立场被独立共识证实的可能性——的声誉框架。我们主张,信念而非正确性或忠诚是信任的原理基础:它是独立于制度的,奖励真正的贡献,并要求透明和自给自足的感知,使外部验证成为可能。我们将声誉形式化为在主张领域中的期望加权信念值,描述了其在来源-主张制度中的行为,并确定连续验证既是理论上的必要性,也是声誉积累的实际机制。该框架应用于AI代理,将其识别为可靠但易出错的来源,可验证的信念和连续积累的声誉是其信任的唯一坚实基础。
Summary / 总结
This paper explores the concepts of knowledge, truth, and trust through a mathematical framework of claims and sources. It defines truth as the reproducibly perceived subset of knowledge and formalizes sources with generative and discriminative roles. The authors develop a reputation framework based on conviction, the likelihood that a source's stance will be vindicated by independent consensus. They argue that conviction, rather than correctness or faithfulness, is the principled basis for trust, as it is regime-independent, rewards genuine contribution, and demands transparent and self-sufficient perceptions. The framework is applied to AI agents, emphasizing verifiable conviction and continuously accrued reputation as the only robust foundation for trust.
该论文通过数学框架探讨了知识、真理和信任的概念,将真理定义为可重复感知的知识,并将来源视为具有生成和辨别作用。作者开发了一个基于来源立场将被独立共识证实的信念的声誉系统。他们认为,信念而非正确性或忠诚是信任的原理基础,因为它不受制度影响,奖励真正的贡献,并要求透明和自给自足的感知。该框架应用于AI代理,表明可验证的信念和持续积累的声誉是这些来源中建立信任的必要条件。
ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall
Authors: Jiayu Yang, Yuxuan Fan, Songning Lai, Shengen Wu, Jiaqi Tang, Chun Kang, Zhijiang Guo, Yutao Yue
First: 2025-10-09T07:46:08+00:00 · Latest: 2026-03-09T16:29:41+00:00
Comments: Accepted by ICLR2026
Abstract
Large Language Models (LLMs) require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answer, a dynamic prior KE work has overlooked. Guided by this insight, we propose ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall, a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. ACE provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.
中文标题/摘要
标题:ACE:因果控制的知识编辑以实现多跳事实回忆
大型语言模型(LLMs)需要高效的知识编辑(KE)来更新事实信息,但现有方法在多跳事实回忆方面表现出显著的性能衰减。特别是在涉及推理链中的中间隐式主题时,这种失败尤为严重。通过因果分析,我们发现这一限制源于对链式知识在神经元层面动态表示和利用的忽视。我们发现,在多跳推理过程中,隐式主题作为查询神经元,会依次激活对应的值神经元,跨变换器层累积信息以得出最终答案,这一动态先验的KE工作已经忽略了这一点。受此见解的指导,我们提出了ACE:因果控制的知识编辑以实现多跳事实回忆,该框架利用神经元级别的归因来识别和编辑这些关键的查询-值(Q-V)路径。ACE 提供了一种基于机制的多跳KE解决方案,在GPT-J上比最先进的方法高出9.44%,在Qwen3-8B上高出37.46%。我们的分析进一步揭示了Qwen3更精细的激活模式,并表明值神经元的语义可解释性是由查询驱动的累积所协调的。这些发现为基于内部推理机制的原理性理解推进了KE能力。
Summary / 总结
The research aims to address the performance decay of Large Language Models (LLMs) in multi-hop factual recall during knowledge editing. The authors identify that existing methods fail to effectively handle implicit subjects in reasoning chains due to a lack of understanding of neuron-level dynamics. They propose ACE, a framework that uses neuron-level attribution to identify and edit critical query-value pathways, thereby improving multi-hop factual recall. Experiments show that ACE outperforms state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B.
研究旨在解决大型语言模型(LLMs)在知识编辑过程中多跳事实回忆性能下降的问题,现有方法未能关注链式知识在神经元层面的动态表示和利用,特别是在涉及隐式主体的多跳推理中。作者提出了一种名为ACE的框架,利用神经元级别的归因来识别和编辑关键的查询-值路径,该方法在GPT-J上的性能提高了9.44%,在Qwen3-8B上的性能提高了37.46%,优于最先进的方法。
History
20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553