arXiv 论文速递

2026-02-04 03:51
Snapshot: 20260204_0351
Reward-free Alignment for Conflicting Objectives
Authors: Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin
First: 2026-02-02T18:59:52+00:00 · Latest: 2026-02-02T18:59:52+00:00
Comments: 27 pages
Abstract
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.
中文标题/摘要
标题:冲突目标的无奖励对齐
直接对齐方法越来越多地用于使大型语言模型(LLMs)与人类偏好对齐。然而,许多实际对齐问题涉及多个冲突的目标,其中简单的偏好聚合可能导致训练不稳定和较差的权衡。特别是,加权损失方法可能无法识别同时改善所有目标的更新方向,而现有的多目标方法通常依赖于显式的奖励模型,引入了额外的复杂性并扭曲了用户指定的偏好。本文的贡献有两个方面。首先,我们提出了一种名为冲突目标无奖励对齐(RACO)的框架,直接利用成对偏好数据并通过一种新颖的冲突规避梯度下降的裁剪变体解决梯度冲突。我们提供了收敛到尊重用户指定目标权重的帕累托关键点的保证,并进一步表明在两目标设置中裁剪可以严格提高收敛速度。其次,我们使用一些启发式方法改进了该方法,并进行了实验以证明所提出框架在LLM对齐中的兼容性。在多个LLM家族(Qwen 3、Llama 3、Gemma 3)上的多目标总结和安全性对齐任务的定性和定量评估中,我们的方法始终优于现有的多目标对齐基线。
Summary / 总结
This paper addresses the challenge of aligning large language models with multiple conflicting objectives by proposing a Reward-free Alignment framework for Conflicted Objectives (RACO). RACO uses pairwise preference data and a novel clipped variant of conflict-averse gradient descent to resolve gradient conflicts. The method provides convergence guarantees to Pareto-critical points and shows improved convergence in the two-objective setting. Experiments on summarization and safety alignment tasks demonstrate that RACO achieves better Pareto trade-offs compared to existing multi-objective alignment methods across different LLM families.
本文提出了一种名为RACO的Reward-free Alignment框架,用于解决大型语言模型与多个冲突目标之间的对齐问题。RACO利用成对偏好数据和一种新的剪裁变体的冲突规避梯度下降来解决梯度冲突。实验表明,RACO在多目标总结和安全性对齐任务中实现了比现有多目标对齐方法更好的帕累托折衷,适用于不同的LLM家族。
MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training
Authors: Dulhan Jayalath, Oiwi Parker Jones
First: 2026-02-02T18:59:50+00:00 · Latest: 2026-02-02T18:59:50+00:00
Comments: 19 pages, 8 figures, 5 tables
Abstract
Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at https://github.com/neural-processing-lab/MEG-XL .
中文标题/摘要
标题:MEG-XL:通过长上下文预训练实现数据高效的大脑到文本接口
临床大脑到文本接口旨在为无法提供大量训练录音的瘫痪患者设计。预训练通过在不同受试者之间学习统计先验来提高数据高效泛化,但这些先验严重依赖于上下文。虽然自然语言可能在几分钟内逐渐展开,但大多数方法仅使用几秒钟的上下文进行预训练。因此,我们提出了MEG-XL模型,该模型每样本预训练使用2.5分钟的MEG上下文,比以往工作长5-300倍,相当于191k个令牌,捕捉到扩展的神经上下文。在从大脑数据解码单词的任务上微调,MEG-XL使用少量数据(例如1小时 vs 50小时)匹配监督性能,并优于大脑基础模型。我们发现,使用更长上下文进行预训练的模型在解码单词方面学到的表示具有更好的迁移性。我们的结果表明,长上下文预训练有助于利用其他方法不必要的丢弃的扩展神经上下文。代码、模型权重和说明可在https://github.com/neural-processing-lab/MEG-XL 获取。
Summary / 总结
MEG-XL is designed for clinical brain-to-text interfaces for paralyzed patients with limited training data. It uses long-context pre-training to learn statistical priors, with 2.5 minutes of context per sample, 5-300 times longer than previous methods. MEG-XL matches supervised performance with much less data and outperforms brain foundation models. Longer pre-training context improves transfer to word decoding tasks.
MEG-XL 旨在为瘫痪患者设计临床脑-文本接口,这些患者的数据训练有限。它通过每样本2.5分钟的上下文进行预训练,比以往方法长5-300倍,以捕捉扩展的神经上下文。在从脑数据解码单词的任务上微调后,MEG-XL 以少量数据匹配监督性能并优于脑基础模型。更长的预训练上下文有助于模型在解码任务上更好地转移。
DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
Authors: Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng
Venue: ICLR 2026
First: 2025-08-18T08:49:29+00:00 · Latest: 2026-02-02T18:59:08+00:00
Comments: Accepted to ICLR 2026. Project page: https://attention-is-all-i-need.github.io/Design-Logic-Reasoning
Abstract
Large language models (LLMs) perform strongly on many language tasks but still struggle with complex multi-step reasoning across disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents to generate multidisciplinary questions. The central insight is the notion of Design Logic, a form of reusable meta-knowledge that encapsulates the structured process human experts use to transform knowledge into complex exam questions, enabling LLMs to generate new questions with the same complex reasoning patterns from entirely different source texts with explicit control over difficulty, diversity, and question types. We use LLMs to reverse-engineer and abstract over 120,000 Design Logics from existing questions across various disciplines. By designing a two-stage retrieve-and-generate mechanism to match these Design Logics with raw corpus, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. Supervised fine-tuning (SFT) on Qwen3 and Llama3 with our data substantially improves multidisciplinary reasoning and outperforms baseline datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training.
中文标题/摘要
标题:DESIGNER: 设计逻辑引导的多学科数据合成以指导LLM推理
大型语言模型(LLMs)在许多语言任务上表现出色,但在跨学科的复杂多步推理方面仍然存在困难。现有的推理数据集往往缺乏学科广度、推理深度和多样性,以及问题合成的指导原则。我们提出了DESIGNER:一种设计逻辑引导的推理数据合成管道,利用自然可用的大量原始文档生成多学科问题。核心洞察是设计逻辑的概念,这是一种可重用的元知识形式,封装了人类专家将知识转化为复杂考试问题的结构化过程,使LLMs能够从完全不同的原始文本中生成具有相同复杂推理模式的新问题,并且具有对难度、多样性和问题类型的显式控制。我们使用LLMs从各个学科的现有问题中逆向工程并抽象出超过120,000个设计逻辑。通过设计两阶段检索和生成机制将这些设计逻辑与原始语料库匹配,我们合成了两个大规模推理数据集,涵盖了75个学科:DLR-Book(来自书本语料库的304万个问题)和DLR-Web(来自网络语料库的166万个问题)。数据分析表明,我们方法合成的问题比基线数据集中的问题更具难度和多样性。在Qwen3和Llama3上的监督微调(SFT)显著提高了多学科推理能力,并优于基线数据集。值得注意的是,仅使用我们数据对这些模型的基础版本进行SFT,我们甚至超过了它们经过完整后训练的官方最终模型。
Summary / 总结
The paper proposes DESIGNER, a pipeline that uses Design Logic to generate multidisciplinary questions for LLMs to improve their reasoning capabilities. By reverse-engineering 120,000 Design Logics from existing questions and synthesizing two large-scale datasets (DLR-Book and DLR-Web), the method enhances the difficulty and diversity of the questions. Supervised fine-tuning on Qwen3 and Llama3 with these datasets significantly improves LLMs' multidisciplinary reasoning performance, surpassing their official final models in some cases.
论文提出了DESIGNER,一种使用设计逻辑生成跨学科问题的管道,以提高LLM的推理能力。通过从现有问题中反向工程120,000个设计逻辑并合成两个大规模数据集(DLR-Book和DLR-Web),该方法增强了问题的难度和多样性。使用这些数据集对Qwen3和Llama3进行监督微调,显著提高了LLM的跨学科推理性能,在某些情况下甚至超过了它们的官方最终模型。
RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System
Authors: Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang
First: 2026-02-02T18:59:04+00:00 · Latest: 2026-02-02T18:59:04+00:00
Comments: Code: https://github.com/Gen-Verse/Open-AgentRL
Abstract
We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL
中文标题/摘要
标题:RLAnything:在完全动态的RL系统中锻造环境、策略和奖励模型
我们提出了RLAnything,这是一种强化学习框架,通过闭环优化动态锻造环境、策略和奖励模型,增强学习信号,强化整个RL系统,适用于任何LLM或代理场景。具体而言,策略通过逐步反馈和结果信号的整合进行训练,而奖励模型则通过一致性反馈联合优化,进一步提高策略训练效果。此外,基于理论动机的自动环境适应通过利用每个模型的评论反馈,增强了奖励和策略模型的训练,使学习经验化。实验证明,每个新增组件都一致地提高了整个系统的效果,RLAnything在各种代表性LLM和代理任务中取得了显著的提升,分别提高了Qwen3-VL-8B-Thinking在OSWorld上的9.1%,Qwen2.5-7B-Instruct在AlfWorld和LiveBench上的18.7%和11.9%。我们还发现优化后的奖励模型信号优于依赖于人工标签的结果。代码:https://github.com/Gen-Verse/Open-AgentRL
Summary / 总结
RLAnything is a reinforcement learning framework that dynamically creates environment, policy, and reward models through closed-loop optimization, enhancing learning signals and improving the overall RL system for various scenarios. The policy is trained using integrated feedback from step-wise and outcome signals, while the reward model is optimized via consistency feedback, which further enhances policy training. Empirically, each added component improves the system, and RLAnything significantly boosts performance on various tasks, with gains of 9.1% to 18.7% across different scenarios. Optimized reward-model signals outperform human-labeled outcomes.
RLAnything 是一种通过闭环优化动态构建环境、策略和奖励模型的强化学习框架,增强学习信号并改善各种场景下的整体 RL 系统。策略使用步骤和结果反馈进行训练,而奖励模型通过一致性反馈进行优化,进一步提升策略训练效果。实验表明,每个新增组件都能提升系统性能,RLAnything 在不同任务上的表现显著提升,分别提高了 Qwen3-VL-8B-Thinking 9.1%,Qwen2.5-7B-Instruct 18.7% 和 11.9%。优化后的奖励模型信号优于依赖人类标签的结果。
RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents
Authors: Jialiang Zhu, Gongrui Zhang, Xiaolong Ma, Lin Xu, Miaosen Zhang, Ruiqi Yang, Song Wang, Kai Qiu, Zhirong Wu, Qi Dai, Ruichun Ma, Bei Liu, Yifan Yang, Chong Luo, Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Xin Geng, Baining Guo
First: 2026-02-02T18:58:07+00:00 · Latest: 2026-02-02T18:58:07+00:00
Abstract
LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.
中文标题/摘要
标题:RE-TRAC: RE递归轨迹压缩用于深度搜索代理
基于LLM的深度研究代理大多建立在ReAct框架之上。这种线性设计使得难以回顾早期状态、分支到替代搜索方向或在长上下文中保持全局意识,通常导致局部最优解、重复探索和低效搜索。我们提出了一种名为Re-TRAC的代理框架,该框架在每次轨迹后生成结构化的状态表示来总结证据、不确定性、失败和未来计划,并基于此状态表示条件化后续轨迹。这使得迭代反思和全局指导规划成为可能,将研究重新定义为渐进过程。实验证明,Re-TRAC在使用前沿LLM的BrowseComp上比ReAct高出15-20%。对于较小的模型,我们引入了Re-TRAC意识的监督微调,实现了在相似规模下的最佳性能。值得注意的是,Re-TRAC在各轮次中工具调用和令牌使用量呈现出单调减少的趋势,表明通过跨轨迹反思驱动的渐进探索而非重复搜索实现了逐步聚焦的探索。
Summary / 总结
The paper addresses the limitations of the ReAct framework in revisiting earlier states, branching into alternative search directions, and maintaining global awareness. It introduces Re-TRAC, which generates a structured state representation after each trajectory to summarize evidence and future plans, enabling iterative reflection and globally informed planning. Experiments show that Re-TRAC outperforms ReAct by 15-20% on BrowseComp with frontier LLMs and achieves state-of-the-art performance with smaller models through supervised fine-tuning, reducing tool calls and token usage across rounds.
论文针对ReAct框架在重新访问早期状态、分支到替代搜索方向以及在长上下文中保持全局意识方面的局限性,提出了Re-TRAC框架。该框架在每个轨迹后生成一个结构化的状态表示,总结证据、不确定性以及未来计划,并在后续轨迹中基于此状态进行条件化。这使得可以进行迭代反思和全局指导的规划。实验结果显示,Re-TRAC在BrowseComp与前沿LLM上比ReAct高出15-20%,并通过监督微调在较小模型上达到最先进的性能。Re-TRAC还显示出工具调用和令牌使用量的逐步减少,表明了更精准的探索。
Expanding the Capabilities of Reinforcement Learning via Text Feedback
Authors: Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette
First: 2026-02-02T18:56:56+00:00 · Latest: 2026-02-02T18:56:56+00:00
Comments: 43 pages, 6 figures
Abstract
The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
中文标题/摘要
标题:通过文本反馈扩展强化学习的能力
对于LLM后训练的RL成功,其根源在于一个不合理地不具信息性的来源:每次回放中仅有一比特信息作为二元奖励或偏好标签。在另一极端,蒸馏提供了密集的监督,但需要演示,这既昂贵又难以扩展。我们研究文本反馈作为中间信号:比标量奖励更丰富,但比完整演示更便宜。文本反馈是人类互动的自然方式,并且在许多现实世界环境中已经非常普遍,用户、注释者和自动裁判经常对LLM输出进行评价。为了大规模利用文本反馈,我们形式化了一个多轮RL设置,文本反馈强化学习(RLTF),其中在训练期间可用文本反馈,但在推理时不可用。因此,模型必须学会内化反馈以提高其测试时单轮性能。为此,我们提出了两种方法:自我蒸馏(RLTF-SD),训练单轮策略使其匹配自身反馈条件下的第二轮生成;反馈建模(RLTF-FM),将预测反馈作为辅助目标。我们对这两种方法进行了理论分析,并在推理谜题、竞赛数学和创造性写作任务上进行了实证评估。结果显示,两种方法在基准测试中均优于强基线,突显了大规模使用额外丰富监督源的RL潜力。
Summary / 总结
This paper explores the use of text feedback to enhance reinforcement learning (RL) for large language models (LLMs). The motivation is to provide richer feedback than binary rewards but without the cost of full demonstrations. The authors propose two methods: Self Distillation (RLTF-SD) and Feedback Modeling (RLTF-FM), which train the model to internalize text feedback during training. Experiments on reasoning puzzles, competition math, and creative writing tasks show that both methods outperform strong baselines, demonstrating the potential of text feedback in RL for LLMs.
本文探讨了使用文本反馈来增强大型语言模型(LLM)的强化学习(RL)。动机是提供比二元奖励更丰富的反馈,但又不需要完整的演示。作者提出了两种方法:自我蒸馏(RLTF-SD)和反馈建模(RLTF-FM),这些方法在训练过程中让模型内化文本反馈。实验结果显示,这两种方法在逻辑谜题、竞赛数学和创造性写作任务上均优于强基线,展示了使用文本反馈来提高RL性能的潜力。
Flow Policy Gradients for Robot Control
Authors: Brent Yi, Hongsuk Choi, Himanshu Gaurav Singh, Xiaoyu Huang, Takara E. Truong, Carmelo Sferrazza, Yi Ma, Rocky Duan, Pieter Abbeel, Guanya Shi, Karen Liu, Angjoo Kanazawa
First: 2026-02-02T18:56:49+00:00 · Latest: 2026-02-02T18:56:49+00:00
Comments: Project webpage: https://hongsukchoi.github.io/fpo-control
Abstract
Likelihood-based policy gradient methods are the dominant approach for training robot control policies from rewards. These methods rely on differentiable action likelihoods, which constrain policy outputs to simple distributions like Gaussians. In this work, we show how flow matching policy gradients -- a recent framework that bypasses likelihood computation -- can be made effective for training and fine-tuning more expressive policies in challenging robot control settings. We introduce an improved objective that enables success in legged locomotion, humanoid motion tracking, and manipulation tasks, as well as robust sim-to-real transfer on two humanoid robots. We then present ablations and analysis on training dynamics. Results show how policies can exploit the flow representation for exploration when training from scratch, as well as improved fine-tuning robustness over baselines.
中文标题/摘要
标题:机器人控制的流策略梯度
基于似然性的策略梯度方法是通过奖励训练机器人控制策略的主要方法。这些方法依赖于可微的动作似然性,这限制了策略输出到简单的分布,如高斯分布。在本文中,我们展示了如何使流匹配策略梯度——一种最近的框架,绕过了似然性计算——能够有效地训练和微调更具表现力的策略,以应对具有挑战性的机器人控制环境。我们引入了一个改进的目标,使其能够在腿部运动、类人运动跟踪和操作任务中取得成功,并在两个类人机器人上实现了鲁棒的仿真到现实的转移。然后,我们提出了训练动态的消融实验和分析。结果表明,当从头开始训练时,策略如何利用流表示进行探索,以及与基线相比,改进了微调的鲁棒性。
Summary / 总结
This paper addresses the limitations of likelihood-based policy gradients in training complex robot control policies by introducing flow matching policy gradients. The method bypasses likelihood computation, allowing for more expressive policies. Key experimental findings include success in legged locomotion, humanoid motion tracking, and manipulation tasks, as well as robust sim-to-real transfer on two humanoid robots. The improved objective also enhances fine-tuning robustness compared to baseline methods.
本文解决了基于似然性的策略梯度方法在训练复杂机器人控制策略时的局限性。它引入了流匹配策略梯度,该方法不需要似然计算,从而允许使用更复杂的策略。该方法在包括腿部运动、类人运动跟踪和操作等任务中进行了测试,显示出成功和稳健的模拟到现实世界的转移。此外,研究还提供了训练动态的分析,并展示了使用流表示进行探索和微调的好处。
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Authors: Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang
First: 2026-01-20T18:58:10+00:00 · Latest: 2026-02-02T18:56:21+00:00
Comments: 27 pages. Project page: https://github.com/UmeanNever/RankSurprisalRatio
Abstract
Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.
中文标题/摘要
标题:哪种推理轨迹能更好地教授学生推理?一种简单的信息对齐度量
长链推理(CoT)轨迹为从教师到学生大语言模型提炼推理提供了丰富的监督信号。然而,先前的工作和我们的实验都表明,来自更强教师的轨迹并不一定产生更好的学生模型,突显了数据-学生适合性在提炼中的重要性。现有方法主要通过学生模型的似然性来评估适合性,倾向于那些与学生模型当前行为高度对齐的轨迹,而忽视了更具信息性的轨迹。为解决这一问题,我们提出了一种简单度量——排名惊异比(RSR),该度量同时捕捉对齐和信息性来评估推理轨迹的适合性。RSR 的动机是观察到有效的轨迹通常通过结合低绝对概率和相对高排名的令牌来平衡学习信号强度和行为对齐。具体而言,RSR 定义为轨迹的平均令牌排名与其平均负对数似然的比率,易于计算和解释。在五个学生模型和来自 11 位不同教师的推理轨迹上,RSR 与训练后推理性能(平均斯皮尔曼相关系数 0.86)高度相关,并且始终优于现有度量。我们进一步展示了其在轨迹选择和教师选择中的实际应用价值。
Summary / 总结
This study investigates the effectiveness of reasoning trajectories in teaching LLMs to reason better, focusing on the importance of data-student suitability. It introduces Rank-Surprisal Ratio (RSR), a metric that balances alignment and informativeness, outperforming existing methods in assessing trajectory suitability. Experiments across various student models and teacher trajectories show RSR strongly correlates with improved reasoning performance post-training.
该研究通过提出平衡对齐性和信息性的Rank-Surprisal Ratio (RSR)指标,探讨了推理轨迹在教导语言模型推理能力方面的有效性。RSR在评估推理轨迹的适用性方面优于现有方法,并与后续训练的推理性能有很强的相关性。该方法被用于选择轨迹和教师,展示了其实用性。
Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability
Authors: Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, Weizhu Chen
First: 2026-02-02T18:54:54+00:00 · Latest: 2026-02-02T18:54:54+00:00
Abstract
Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.
中文标题/摘要
标题:训练LLMs进行分而治之推理提升测试时扩展性
大型语言模型(LLMs)通过逐步链式思考(CoT)推理展示了强大的推理能力。然而,在模型能力的极限下,CoT往往不够充分,其严格顺序的性质限制了测试时的扩展性。一种潜在的替代方法是分而治之(DAC)推理,它将复杂问题分解为子问题以促进更有效的解决方案探索。尽管前景看好,但我们的分析揭示了通用后训练与DAC风格推理之间基本的不一致,这限制了模型充分利用这一潜力的能力。为了弥合这一差距并完全解锁LLMs在最具有挑战性任务上的推理能力,我们提出了一种端到端的强化学习(RL)框架来增强其DAC风格的推理能力。在每一步中,策略将问题分解为一组子问题,依次解决它们,并在基于子问题解决方案的基础上解决原始问题,将分解和解决方案都整合到RL训练中。在相似的训练下,我们的DAC风格框架赋予模型更高的性能上限和更强的测试时扩展性,在竞赛级别基准测试中,Pass@1和Pass@32分别超越CoT 8.6%和6.3%。
Summary / 总结
This paper addresses the limitations of step-by-step chain-of-thought (CoT) reasoning in large language models (LLMs) at their limits of capability, where CoT's sequential nature hampers test-time scalability. The authors propose a divide-and-conquer (DAC) reasoning framework, which decomposes complex problems into subproblems to enhance reasoning efficiency. Through an end-to-end reinforcement learning (RL) training approach, the model's DAC reasoning capacity is improved, achieving a 8.6% higher Pass@1 and 6.3% higher Pass@32 compared to CoT on competition-level benchmarks.
论文针对大型语言模型(LLMs)在极限能力下链式思考(CoT)推理的局限性,重点在于CoT的顺序性质限制了测试时的可扩展性。为克服这一问题,作者提出了一种分而治之(DAC)推理框架,使用端到端的强化学习(RL)方法。该框架将复杂问题分解为子问题,依次解决这些子问题,并基于子问题的解决方案来解决原始问题。实验结果显示,这种方法在基准测试中的Pass@1和Pass@32上分别比CoT高出8.6%和6.3%,展示了更高的性能和更强的测试时可扩展性。
AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
Authors: Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, Chetan Bansal
First: 2026-02-02T18:54:07+00:00 · Latest: 2026-02-02T18:54:07+00:00
Abstract
AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.
中文标题/摘要
标题:AgentRx:从执行轨迹诊断AI代理故障
AI代理常常以难以定位的方式失败,因为执行具有概率性、长时序、多代理性,并且受到嘈杂工具输出的中介。我们通过手动标注失败的代理运行并发布了一个包含115个失败轨迹的新基准,这些轨迹涵盖了结构化API工作流、事件管理以及开放性网络/文件任务。每个轨迹都标注了一个关键失败步骤和一个从扎根理论推导出的跨域故障分类学中的类别。为了减轻失败归因的人力成本,我们提出了AGENTRX,一个自动化的领域无关诊断框架,能够定位失败代理轨迹中的关键失败步骤。它综合了约束条件,逐步骤评估它们,并生成一个可审计的验证日志,其中包含与证据相关的约束违规;基于LLM的法官使用这个日志来定位关键步骤和类别。我们的框架在三个领域中优于现有基线,提高了步骤定位和故障归因的准确性。
Summary / 总结
The paper addresses the challenge of diagnosing AI agent failures in probabilistic, long-horizon, multi-agent scenarios by manually annotating 115 failed trajectories across different domains. It introduces AgentRx, an automated diagnostic framework that identifies the critical failure step and categorizes it using a validation log of constraint violations. The framework outperforms existing methods in localizing the failure step and attributing the failure across structured API workflows, incident management, and open-ended web/file tasks.
论文旨在解决AI代理在执行过程中出现的故障诊断难题,这些故障通常是概率性的、长期的和多代理的。研究引入了一个包含115个失败轨迹的基准数据集,涵盖了各种任务。该研究提出了一个名为AgentRx的自动化诊断框架,能够自动识别故障的关键步骤。该框架通过使用约束条件和基于LLM的法官来定位故障,并在不同领域中比现有方法提高了故障定位和归因的准确性。
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Authors: Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang
First: 2026-02-02T18:53:28+00:00 · Latest: 2026-02-02T18:53:28+00:00
Comments: Code is available at https://github.com/ViktorAxelsen/MemSkill
Abstract
Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.
中文标题/摘要
标题:MemSkill:自我进化代理的习得与进化记忆技能
大多数大型语言模型(LLM)代理的记忆系统依赖于一组静态的手工设计的操作来提取记忆。这些固定的程序硬编码了关于存储什么以及如何修订记忆的人类先验知识,使其在多样化的交互模式下变得僵硬,并且在处理长期历史记录时效率低下。为此,我们提出了**MemSkill**,它将这些操作重新定义为可学习和可进化的记忆技能,这些技能是结构化且可重用的提取、巩固和修剪交互痕迹信息的规程。受代理技能设计哲学的启发,MemSkill 使用一个**控制器**来学习选择一组相关的技能,并配以基于LLM的**执行器**,后者根据技能生成指导性记忆。除了学习技能选择之外,MemSkill 引入了一个**设计师**,它会定期回顾那些被选中技能产生错误或不完整记忆的难例,并通过提出改进和新技能来进化技能集。总体而言,MemSkill 形成了一种闭环过程,可以提高技能选择策略本身以及技能集。在LoCoMo、LongMemEval、HotpotQA和ALFWorld上的实验表明,MemSkill 在强基线之上提高了任务性能,并且在不同场景中具有良好的泛化能力。进一步的分析揭示了技能如何进化,为LLM代理的更适应性和自我进化记忆管理提供了见解。
Summary / 总结
MemSkill is designed to enhance memory management in LLM agents by learning and evolving memory skills, which are reusable routines for extracting, consolidating, and pruning information from interaction traces. The system includes a controller for selecting relevant skills, an executor for producing skill-guided memories, and a designer for evolving the skill set. Experiments show that MemSkill improves task performance and generalizes well across different settings compared to strong baselines.
MemSkill 通过学习和进化记忆技能来提升 LLM 代理的记忆管理,这些技能是用于从交互痕迹中提取、巩固和修剪信息的结构化和可重用的规程。系统使用控制器来选择相关技能,LLM 基础的执行器来生成技能导向的记忆。此外,一个设计师会定期审查并进化技能集以提高性能。实验表明,MemSkill 在 LoCoMo、LongMemEval、HotpotQA 和 ALFWorld 等各种设置中均优于强基线并表现出良好的泛化能力。
Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
Authors: Xutao Ma, Yixiao Huang, Hanlin Zhu, Somayeh Sojoudi
First: 2026-02-02T18:50:57+00:00 · Latest: 2026-02-02T18:50:57+00:00
Abstract
Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the "reversal curse" -- when trained on forward knowledge data of the form "$A \rightarrow B$" (e.g., Alice's husband is Bob), the model is unable to deduce the reversal knowledge "$B \leftarrow A$" (e.g., Bob's wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form "$A \to A$" (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 40% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.
中文标题/摘要
标题:通过身份桥梁打破自回归语言模型的反转诅咒
自回归大型语言模型(LLMs)在许多复杂任务中取得了显著成功,但在简单的逻辑推理任务如“反转诅咒”中仍会失败——当训练数据为“$A \rightarrow B$”形式(例如,爱丽丝的丈夫是鲍勃)时,模型在测试中无法推断出“$B \leftarrow A$”形式的反转知识(例如,鲍勃的妻子是爱丽丝)。大量前期研究认为,这种失败是自回归因果LLMs固有的、基本的限制,表明这些模型倾向于记忆事实性知识而非捕捉高层次规则。在本文中,我们挑战了这一观点,通过使用一种简单的正则化数据食谱——“身份桥梁”形式的“$A \to A$”(例如,爱丽丝的名字是爱丽丝)——来稍微调整训练数据,从而减轻了这一看似基本的限制。理论上,我们证明,在这种食谱下,即使是一层的变换器也能通过分析梯度下降的隐式偏置来打破反转诅咒。实验上,我们展示了使用所提数据食谱微调的1B预训练语言模型在反转任务中的成功率为40%,而在仅使用前向知识数据训练时,成功率几乎为零。我们的工作为反转诅咒提供了新的理论基础,并提供了一条鼓励LLMs从数据中学习高层次规则的原理性、低成本路径。
Summary / 总结
This paper addresses the 'reversal curse' in autoregressive language models, where models struggle to infer reverse relationships even after being trained on forward relationships. The authors propose a simple regularization technique called the Identity Bridge to mitigate this issue. Theoretically, they prove that even a one-layer transformer can break the reversal curse. Empirically, a 1B pretrained language model fine-tuned with this technique achieved a 40% success rate on reversal tasks, significantly higher than the near-zero success rate when only forward-knowledge data was used for training.
本文探讨了自回归语言模型中的‘反转诅咒’问题,即模型在训练于正向知识数据时难以推断反向知识。作者提出了一种简单的正则化方法,称为身份桥梁,以缓解这一问题。理论上,他们证明即使是一层的变压器也能打破反转诅咒。实验上,他们展示了使用该方法微调的1B预训练语言模型在反转任务上的成功率达到了40%,远高于仅训练于正向知识数据时近乎零的成功率。
Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
Authors: Aiden Yiliu Li, Xinyue Hao, Shilong Liu, Mengdi Wang
First: 2026-02-02T18:50:07+00:00 · Latest: 2026-02-02T18:50:07+00:00
Abstract
Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.
中文标题/摘要
标题:Avenir-Web:模仿人类体验的多模态网络代理
尽管在多模态大型语言模型方面取得了进展,但自主网络代理仍然难以可靠地在复杂和动态的网络界面中执行长期任务。现有代理往往遭受元素定位不准确、缺乏特定站点的操作知识以及长期任务跟踪和记忆不稳定等问题,尤其是在处理复杂的文档对象模型结构时。为解决这些限制,我们引入了Avenir-Web,这是一种在实际部署中达到Online-Mind2Web基准新开源先进水平的网络代理。Avenir-Web利用混合定位专家、经验模仿规划以融入操作先验,并结合任务跟踪清单和自适应记忆,以实现跨不同用户界面范式的稳健和无缝交互。我们在严格的Online-Mind2Web基准上评估了Avenir-Web,这是一个针对实时和用户中心网络任务的基准。我们的结果表明,Avenir-Web显著超越了先前的开源代理,并达到了顶级专有模型的性能水平,从而为可靠的实时网站网络代理建立了新的开源先进水平。
Summary / 总结
Avenir-Web is designed to improve the reliability of web agents in executing long-horizon tasks on complex web interfaces. It uses a Mixture of Grounding Experts, Experience-Imitation Planning, and a task-tracking checklist with adaptive memory. Avenir-Web outperforms previous open-source agents and matches the performance of top-tier proprietary models on the Online-Mind2Web benchmark, setting a new open-source state of the art for web agents.
Avenir-Web旨在提高网络代理在执行复杂网页界面的长期任务时的可靠性。它使用混合接地专家、经验模仿规划以及带有自适应记忆的任务跟踪清单。Avenir-Web在Online-Mind2Web基准测试中超越了之前的开源代理,并达到了顶级专有模型的性能,从而为可靠的网络代理设定了新的开源标准。
Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models
Authors: Noam Steinmetz Yalon, Ariel Goldstein, Liad Mudrik, Mor Geva
First: 2026-02-02T18:49:39+00:00 · Latest: 2026-02-02T18:49:39+00:00
Abstract
Rapid advancements in large language models (LLMs) have sparked the question whether these models possess some form of consciousness. To tackle this challenge, Butlin et al. (2023) introduced a list of indicators for consciousness in artificial systems based on neuroscientific theories. In this work, we evaluate a key indicator from this list, called HOT-3, which tests for agency guided by a general belief-formation and action selection system that updates beliefs based on meta-cognitive monitoring. We view beliefs as representations in the model's latent space that emerge in response to a given input, and introduce a metric to quantify their dominance during generation. Analyzing the dynamics between competing beliefs across models and tasks reveals three key findings: (1) external manipulations systematically modulate internal belief formation, (2) belief formation causally drives the model's action selection, and (3) models can monitor and report their own belief states. Together, these results provide empirical support for the existence of belief-guided agency and meta-cognitive monitoring in LLMs. More broadly, our work lays methodological groundwork for investigating the emergence of agency, beliefs, and meta-cognition in LLMs.
中文标题/摘要
标题:信念引导的代理和元认知监控在大型语言模型中的迹象
大型语言模型(LLMs)的迅速发展引发了关于这些模型是否具有某种形式的意识的问题。为应对这一挑战,Butlin等人(2023)基于神经科学理论提出了一套评估人工系统意识的指标。在此项工作中,我们评估了该列表中的一个关键指标HOT-3,该指标测试由普遍信念形成和行动选择系统引导的代理,该系统根据元认知监控更新信念。我们认为信念是模型潜在空间中的表示,这些表示在响应给定输入时出现,并引入了一个度量标准来量化其在生成过程中的主导性。分析不同模型和任务中竞争信念之间的动态关系揭示了三个关键发现:(1)外部操纵系统性地调节内部信念形成,(2)信念形成因果地驱动模型的动作选择,(3)模型可以监控并报告其自身的信念状态。这些结果共同提供了大型语言模型中存在信念引导的代理和元认知监控的实证支持。更广泛地说,我们的工作为研究代理、信念和元认知在大型语言模型中的涌现奠定了方法论基础。
Summary / 总结
This study evaluates the HOT-3 indicator for consciousness in large language models (LLMs) by assessing agency guided by belief-formation and meta-cognitive monitoring. The research finds that external manipulations systematically affect internal belief formation, belief formation drives the model's actions, and models can monitor and report their belief states, providing evidence for belief-guided agency and meta-cognitive monitoring in LLMs.
该研究通过测试信念引导的代理和元认知监控来评估大型语言模型(LLMs)的HOT-3指标。研究发现外部操纵系统性地影响内部信念形成,信念形成驱动模型的行为,模型可以监控和报告自己的信念状态,为LLMs中存在信念引导的代理和元认知监控提供了实证支持。
MentisOculi: Revealing the Limits of Reasoning with Mental Imagery
Authors: Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel
First: 2026-02-02T18:49:06+00:00 · Latest: 2026-02-02T18:49:06+00:00
Comments: 9 pages, 8 figures
Abstract
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
中文标题/摘要
标题:MentisOculi:揭示心智图像推理的局限性
前沿模型正在从仅能摄取视觉信息的多模态大型语言模型(MLLMs)过渡到能够原生交错生成的统一多模态模型(UMMs)。这一转变激发了使用中间可视化作为推理辅助工具的兴趣,类似于人类的心智图像。这一想法的核心在于能够以目标导向的方式形成、维持和操作视觉表征。为了评估和探究这一能力,我们开发了MentisOculi,这是一个程序化的、分层的多步骤推理问题套件,适用于视觉解决方案,并针对前沿模型设置了挑战。评估从潜在标记到显式生成图像的各种视觉策略,我们发现它们通常未能提高性能。对UMMs的具体分析揭示了一个关键局限:尽管它们拥有解决任务的文本推理能力,并且有时能够生成正确的视觉,但它们会遭受累积生成错误,并且无法利用真实视觉信息。我们的研究结果表明,尽管视觉思维具有内在吸引力,但它们尚未对模型推理产生益处。MentisOculi为分析并弥合这一差距奠定了必要的基础,适用于多种模型家族。
Summary / 总结
The research aims to evaluate the capability of unified multimodal models (UMMs) to use mental imagery for reasoning, by developing MentisOculi, a suite of multi-step problems. The study finds that UMMs generally fail to improve performance using visual strategies, and suffer from compounding generation errors, even when using ground-truth visualizations. This suggests that current models do not yet effectively leverage visual thoughts for reasoning.
研究旨在评估统一多模态模型(UMMs)利用视觉推理的能力,灵感来源于人类的心理图像。开发了MentisOculi,一个包含多步推理问题的套件来测试这一能力。研究发现,UMMs在使用视觉策略时通常无法提高性能,并且即使使用真实视觉信息也会出现累积生成错误。这表明,尽管视觉思考具有潜力,但目前尚未能增强模型的推理能力。MentisOculi为分析并解决这一问题提供了框架,适用于不同类型的模型。
Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models
Authors: Gabriele Maraia, Marco Valentino, Fabio Massimo Zanzotto, Leonardo Ranaldi
First: 2026-02-02T18:48:44+00:00 · Latest: 2026-02-02T18:48:44+00:00
Abstract
Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model's internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model's activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.
中文标题/摘要
标题:抽象激活空间在大型语言模型中实现内容不变推理
大型语言模型(LLMs)在演绎判断的演绎推理中经常遇到困难,系统地将语义合理性与形式有效性混淆,这种现象被称为内容效应。即使模型生成逐步解释,这种偏差仍然存在,表明中间推理可能继承了影响答案的相同语义捷径。最近的方法通过增加推理时的结构约束来缓解这一问题,要么通过鼓励抽象的中间表示,要么直接干预模型的内部计算;然而,可靠地抑制语义干扰仍然是一个开放的挑战。为了使形式演绎对语义内容不那么敏感,我们提出了一种基于抽象引导的推理框架,明确地将结构推理与词汇语义分离。我们构建了配对的内容丰富和抽象的三段论,并使用模型在抽象输入上的激活来定义一个抽象推理空间。然后,我们学习轻量级的抽象器,从内容条件下的残差流状态中预测与该空间对齐的表示,并在前向传递过程中通过多层干预将这些预测集成起来。使用跨语言迁移作为测试床,我们展示了与内容对齐的引导减少了由内容驱动的错误并提高了形式有效性敏感的性能。我们的结果将激活级别的抽象定位为增强LLMs在语义干扰下形式推理稳健性的可扩展机制。
Summary / 总结
This paper addresses the content effect in large language models (LLMs), where models conflate semantic plausibility with formal validity in syllogistic reasoning. To mitigate this, the authors propose an abstraction-guided reasoning framework that separates structural inference from lexical semantics. They construct paired syllogisms and use model activations on abstract inputs to define an abstract reasoning space. Lightweight Abstractors are then trained to predict representations aligned with this space, which are integrated during the forward pass. Experiments show that this approach reduces content-driven errors and improves validity-sensitive performance, suggesting activation-level abstraction as a scalable mechanism to enhance LLM robustness against semantic interference.
论文针对大型语言模型(LLMs)在演绎推理过程中存在的语义偏见问题,即模型常常将语义合理性与形式有效性混淆。为了解决这一问题,作者提出了一种基于抽象的推理框架,将结构推理与词汇语义分离。他们构建了成对的命题逻辑,一个是富含内容的,另一个是抽象的,并使用模型在抽象输入上的激活来定义一个抽象推理空间。通过学习轻量级的Abstractors,从内容条件下的残差流状态中预测与该空间对齐的表示,并在前向传递过程中进行多层干预。实验表明,这种方法减少了由内容驱动的错误,并在跨语言迁移任务中提高了形式推理的准确性。
Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction
Authors: Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, Yanfang Ye
First: 2026-02-02T18:46:16+00:00 · Latest: 2026-02-02T18:46:16+00:00
Comments: 65 pages, 40 figures
Abstract
As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbf{Drift-Bench}, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbf{Drift-Bench} provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbf{Rise} evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.
中文标题/摘要
标题:Drift-Bench:通过多轮交互诊断输入故障下LLM代理的合作破裂
随着大型语言模型转变为自主代理,用户输入经常违反合作假设(例如,隐含意图、缺失参数、虚假预设或模糊表达),这些执行风险是纯文本评估无法捕捉到的。现有基准通常假设明确的指令或仅限于文本单轮澄清的评估,因此无法衡量基于执行风险的多轮澄清。我们引入了**Drift-Bench**,这是第一个通过多轮澄清评估在输入故障下代理语用学的诊断基准,跨越状态导向和服务导向的执行环境。基于经典通信理论,**Drift-Bench** 提供了一致的合作破裂分类,并采用以角色为导向的用户模拟器和**Rise**评估协议。实验显示,在这些故障下性能显著下降,澄清效果在不同用户角色和故障类型之间有所不同。**MethodName** 将澄清研究与代理安全性评估相结合,使系统诊断可能导致不安全执行的失败成为可能。
Summary / 总结
Drift-Bench evaluates cooperative breakdowns in LLM agents under input faults through multi-turn interaction, addressing risks not captured by text-only evaluations. It introduces a unified taxonomy of cooperative breakdowns and uses a persona-driven user simulator with the Rise evaluation protocol. Experiments show significant performance drops under input faults, with varying effectiveness of clarification across user personas and fault types.
Drift-Bench 是一个诊断基准,通过多轮澄清评估 LLM 代理在输入故障下的合作破裂。它解决了现有基准的局限性,关注具体的执行风险和多轮澄清。实验显示,在输入故障下性能显著下降,不同用户角色和故障类型下的澄清效果有所不同。
World-Gymnast: Training Robots with Reinforcement Learning in a World Model
Authors: Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, Sherry Yang
First: 2026-02-02T18:44:45+00:00 · Latest: 2026-02-02T18:44:45+00:00
Comments: https://world-gymnast.github.io/
Abstract
Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.
中文标题/摘要
标题:世界体操手:使用世界模型中的强化学习训练机器人
机器人通过与物理世界交互学习受到物理交互成本的限制。两种替代方案,基于专家演示的监督微调(SFT)和基于软件模拟的强化学习(RL),分别受限于可用的专家数据量和模拟与现实之间的差距。随着世界模型从真实世界视频-动作数据中学习的最近出现,我们提出了一个问题:在世界模型中训练策略是否比监督学习或软件模拟更能实现更好的机器人性能。我们提出了World-Gymnast,它通过在动作条件下的视频世界模型中展开策略,并用视觉语言模型(VLM)奖励展开来执行视觉语言动作(VLA)策略的RL微调。在Bridge机器人设置中,World-Gymnast在SFT上的表现高出18倍,在软件模拟上的表现高出2倍。更重要的是,World-Gymnast展示了使用世界模型进行RL的有趣能力,包括在多种语言指令和世界模型中的新场景上进行训练,在新场景中的测试时训练,以及在线迭代改进世界模型和策略。我们的结果表明,学习世界模型并在云端训练机器人策略可能是弥合演示中工作的机器人和在任何家庭中工作的机器人之间的差距的关键。
Summary / 总结
World-Gymnast addresses the limitations of robot learning by using reinforcement learning (RL) in a world model, which is trained from real-world video-action data. This method outperforms supervised finetuning by up to 18 times and software simulation by up to 2 times on the Bridge robot setup. Key findings include the ability to train on diverse language instructions, test-time training in novel scenes, and online iterative improvement of both the world model and policy.
World-Gymnast 通过使用基于真实世界视频-动作数据训练的世界模型来进行强化学习 (RL),解决了机器人学习的限制。该方法在Bridge机器人设置上比监督微调高18倍,比软件模拟高2倍。关键发现包括能够训练多种语言指令、在新场景中进行测试时的训练以及在线迭代改进世界模型和策略。
Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling
Authors: Andong Chen, Wenxin Zhu, Qiuyu Ding, Yuchen Song, Muyun Yang, Tiejun Zhao
First: 2026-02-02T18:43:57+00:00 · Latest: 2026-02-02T18:43:57+00:00
Comments: Working paper
Abstract
Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.
中文标题/摘要
标题:用漫画思考:通过结构化视觉叙事增强多模态推理
链式思考推理促使大型语言模型从仅处理文本扩展到处理图像和视频。然而,不同模态仍然存在明显局限:静态图像难以表现时间结构,而视频则引入了大量冗余和计算成本。在本研究中,我们提出用漫画思考,这是一种视觉推理范式,使用漫画作为介于图像和视频之间的一种高信息密度媒介。漫画保留了时间结构、嵌入文本和叙述连贯性,同时所需推理成本显著降低。我们系统地研究了两种基于漫画的推理路径,并在多种推理任务和长上下文理解任务上进行了评估。实验结果表明,用漫画思考在多步时间因果推理任务上优于用图像思考,同时在效率上远超用视频思考。进一步分析表明,不同漫画叙述结构和风格在各种任务中始终影响性能,这表明漫画作为一种有效的中间视觉表示,有助于提高多模态推理。
Summary / 总结
This work introduces Thinking with Comics, a visual reasoning paradigm that uses comics to enhance multimodal reasoning by preserving temporal structure and narrative coherence while reducing reasoning cost compared to videos. The study evaluates two reasoning paths based on comics and finds that it outperforms images in multi-step temporal and causal reasoning tasks while being more efficient than videos. Different comic narrative structures and styles are shown to affect performance across tasks, indicating comics' effectiveness as an intermediate visual representation for multimodal reasoning.
本文提出了一种视觉推理范式——通过漫画进行思考,利用漫画保留时间结构和叙事连贯性,同时相比视频降低推理成本。研究评估了两种基于漫画的推理路径,发现其在多步时间与因果推理任务中优于图像,且效率高于视频。不同漫画叙事结构和风格被证明会影响跨任务的性能,表明漫画作为多模态推理的有效中间视觉表示的作用。
RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval
Authors: Tyler Skow, Alexander Martin, Benjamin Van Durme, Rama Chellappa, Reno Kriz
First: 2026-02-02T18:40:37+00:00 · Latest: 2026-02-02T18:40:37+00:00
Abstract
Reranking is a critical component of modern retrieval systems, which typically pair an efficient first-stage retriever with a more expressive model to refine results. While large reasoning models have driven rapid progress in text-centric reranking, reasoning-based reranking for video retrieval remains underexplored. To address this gap, we introduce RANKVIDEO, a reasoning-based reranker for video retrieval that explicitly reasons over query-video pairs using video content to assess relevance. RANKVIDEO is trained using a two-stage curriculum consisting of perception-grounded supervised fine-tuning followed by reranking training that combines pointwise, pairwise, and teacher confidence distillation objectives, and is supported by a data synthesis pipeline for constructing reasoning-intensive query-video pairs. Experiments on the large-scale MultiVENT 2.0 benchmark demonstrate that RANKVIDEO consistently improves retrieval performance within a two-stage framework, yielding an average improvement of 31% on nDCG@10 and outperforming text-only and vision-language reranking alternatives, while more efficient.
中文标题/摘要
标题:RANKVIDEO: 基于推理的视频检索重排序
重排序是现代检索系统中的关键组件,通常通过一个高效的初步检索器与一个更具表现力的模型配对来优化结果。虽然大型推理模型在文本中心的重排序方面取得了快速进展,但基于推理的视频检索重排序仍然未被充分探索。为了解决这一差距,我们引入了RANKVIDEO,这是一种基于推理的视频检索重排序器,它明确地通过视频内容对查询-视频对进行推理以评估相关性。RANKVIDEO 通过一个包含感知导向的监督微调和结合点式、对式和教师置信度蒸馏目标的重排序训练的两阶段课程进行训练,并通过数据合成管道构建推理密集型查询-视频对。在大规模的MultiVENT 2.0基准测试上的实验表明,RANKVIDEO 在两阶段框架内始终提高了检索性能,nDCG@10 的平均改进率为 31%,并优于仅基于文本和视觉语言的重排序替代方案,同时更为高效。
Summary / 总结
RANKVIDEO is a reasoning-based reranker for video retrieval that enhances the relevance of query-video pairs by explicitly reasoning over video content. It is trained using a two-stage curriculum and a data synthesis pipeline, and shows a 31% improvement in nDCG@10 compared to text-only and vision-language reranking methods within a two-stage framework. The method addresses the underexplored area of reasoning-based reranking for video retrieval and demonstrates consistent performance gains.
RANKVIDEO 是一种基于推理的视频检索重排序器,利用视频内容评估相关性,解决了视频检索中推理重排序的不足。它通过两阶段课程进行训练,并在 nDCG@10 上平均提高了 31%,优于仅基于文本和视觉语言的重排序方法,展示了其在两阶段框架中增强检索性能的有效性。
Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE
Authors: Yuanteng Chen, Peisong Wang, Nanxin Zeng, Yuantian Shao, Gang Li, Jing Liu, Jian Cheng
First: 2026-02-02T18:39:33+00:00 · Latest: 2026-02-02T18:39:33+00:00
Comments: 24 pages, 13 figures
Abstract
Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.
中文标题/摘要
标题:确定头部,不确定尾部:测试时缩放在细粒度MoE中的专家样本
测试时缩放通过生成多个候选解决方案来提高LLM性能,但按token级采样需要温度调节,这在多样性和稳定性之间进行权衡。细粒度MoE,每层包含数百个训练良好的专家,并且每个token具有多专家激活,提供了一个未被探索的替代方案,通过其丰富的路由空间。我们实证研究了细粒度MoE路由,并发现了一个有信息的模式:路由器分数表现出高置信度专家的确定头部,随后是低置信度候选者的不确定尾部。当激活更少的专家时,单次运行贪婪准确率保持稳定,而多样本pass@n显著下降——这表明确定头部管理核心推理能力,而不确定尾部与推理多样性相关。受这些发现的启发,我们提出了一种无需训练的方法——专家样本,该方法保留高置信度选择的同时,向不确定尾部注入可控的随机性,从而实现多样生成而不破坏输出。在数学、知识推理和代码任务等多个细粒度MoE模型上进行评估,专家样本一致地提高了pass@n和基于验证的准确性。在Qwen3-30B-A3B-Instruct上,使用GPQA-Diamond进行32并行样本评估,pass@32从85.4%提高到91.9%,准确率从59.1%提高到62.6%,使用Best-of-N验证。
Large Language Models for Mental Health: A Multilingual Evaluation
Authors: Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Ana-Maria Bucur, Stevie Chancellor, Marcos Zampieri
First: 2026-02-02T18:34:53+00:00 · Latest: 2026-02-02T18:34:53+00:00
Abstract
Large Language Models (LLMs) have remarkable capabilities across NLP tasks. However, their performance in multilingual contexts, especially within the mental health domain, has not been thoroughly explored. In this paper, we evaluate proprietary and open-source LLMs on eight mental health datasets in various languages, as well as their machine-translated (MT) counterparts. We compare LLM performance in zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines that do not employ LLMs. In addition, we assess translation quality across language families and typologies to understand its influence on LLM performance. Proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores on several datasets, often surpassing state-of-the-art results. However, performance on MT data is generally lower, and the extent of this decline varies by language and typology. This variation highlights both the strengths of LLMs in handling mental health tasks in languages other than English and their limitations when translation quality introduces structural or lexical mismatches.
中文标题/摘要
标题:大型语言模型在心理健康领域的多语言评估
大型语言模型(LLMs)在NLP任务中具有显著的能力。然而,它们在多语言环境中的表现,尤其是在心理健康领域,尚未得到充分探索。在本文中,我们评估了多种专有和开源LLM在多种语言的八个心理健康数据集上的表现,以及它们的机器翻译(MT)版本。我们将LLM在零样本、少样本和微调设置下的表现与不使用LLM的传统NLP基线进行比较。此外,我们还评估了跨语言家族和类型学的翻译质量,以了解其对LLM表现的影响。专有LLM和微调的开源LLM在多个数据集上取得了竞争力的F1分数,通常超过了最先进的结果。然而,MT数据上的表现普遍较低,这种下降的程度因语言和类型学而异。这种差异突显了LLM在处理除英语以外其他语言的心理健康任务方面的优势及其在翻译质量引入结构或词汇不匹配时的局限性。
Summary / 总结
This study evaluates the performance of proprietary and open-source Large Language Models (LLMs) on mental health datasets in multiple languages, including their machine-translated counterparts. The research compares LLMs in zero-shot, few-shot, and fine-tuned settings against traditional NLP methods. The results show that proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores, often outperforming existing methods. However, performance on machine-translated data is generally lower, with varying degrees of decline depending on the language and typology, indicating both the strengths and limitations of LLMs in the mental health domain.
本文评估了专有和开源大型语言模型在八个跨语言心理健康数据集上的表现,将它们的零样本、少量样本和微调设置与传统NLP基线进行比较。专有模型和微调的开源模型在某些数据集上取得了竞争力的F1分数,通常优于最先进的结果。然而,机器翻译数据上的性能普遍较低,且这种差异取决于语言和语系,表明大型语言模型在心理健康领域的强项和局限性。
UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
Authors: Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang
First: 2026-02-02T18:34:35+00:00 · Latest: 2026-02-02T18:34:35+00:00
Abstract
Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.
中文标题/摘要
标题:UniReason 1.0:统一的知识对齐图像生成与编辑推理框架
统一的多模态模型在处理需要深入推理的复杂合成任务时往往表现不佳,通常将文本到图像生成和图像编辑视为孤立的能力,而不是相互关联的推理步骤。为了解决这个问题,我们提出了UniReason,这是一种通过双重推理范式将这两种任务统一起来的框架。我们将生成过程形式化为增强的世界知识规划,以注入隐式约束,并利用编辑能力进行精细的视觉修正,通过自我反思进一步纠正视觉错误。这种方法在共享表示中统一了生成和编辑,反映了人类认知过程中的规划随后是细化。我们通过系统构建一个大规模的以推理为中心的数据集(约30万样本),涵盖了五个主要的知识领域(例如,文化常识、物理等)来进行规划,以及一个代理生成的语料库进行视觉自我修正。广泛的实验表明,UniReason在WISE、KrisBench和UniREditBench等推理密集型基准测试中实现了先进的性能,同时保持了优越的一般合成能力。
Embedding Perturbation may Better Reflect the Uncertainty in LLM Reasoning
Authors: Qihao Wen, Jiahao Wang, Yang Nan, Pengfei He, Ravi Tandon, Han Xu
First: 2026-02-02T18:27:26+00:00 · Latest: 2026-02-02T18:27:26+00:00
Abstract
Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model's uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essential to estimate the uncertainty not only for the final answer, but also for the intermediate steps of the reasoning, as this can enable more fine-grained and targeted interventions. In this study, we explore what UQ metrics better reflect the LLM's ``intermediate uncertainty''during reasoning. Our study reveals that an LLMs' incorrect reasoning steps tend to contain tokens which are highly sensitive to the perturbations on the preceding token embeddings. In this way, incorrect (uncertain) intermediate steps can be readily identified using this sensitivity score as guidance in practice. In our experiments, we show such perturbation-based metric achieves stronger uncertainty quantification performance compared with baseline methods such as token (generation) probability and token entropy. Besides, different from approaches that rely on multiple sampling, the perturbation-based metrics offer better simplicity and efficiency.
中文标题/摘要
标题:嵌入扰动可能更好地反映LLM推理中的不确定性
大型语言模型(LLMs)在多个领域取得了显著突破;然而,它们仍然可能产生不可靠或误导性的输出。为了负责任地应用LLM,使用不确定性量化(UQ)技术来估计模型对其输出的不确定性,这表明这些输出可能存在问题的可能性。对于LLM推理任务,不仅需要估计最终答案的不确定性,还需要估计推理过程中的中间步骤的不确定性,这可以实现更精细和针对性的干预。在本研究中,我们探讨了哪些UQ指标更能反映LLM在推理过程中的“中间不确定性”。我们的研究表明,LLM的错误推理步骤往往包含对前一个嵌入扰动高度敏感的标记。通过这种方式,可以使用这种敏感性分数作为指导,识别出错误(不确定)的中间步骤。在我们的实验中,我们展示了基于扰动的指标在不确定性量化性能上优于基线方法,如标记生成概率和标记熵。此外,与依赖多次采样的方法不同,基于扰动的指标提供了更好的简洁性和效率。
Summary / 总结
This study investigates UQ metrics for quantifying the uncertainty in LLM reasoning, focusing on intermediate steps. It finds that perturbing token embeddings can better identify uncertain reasoning steps, which are often sensitive to such perturbations. Experiments show that this perturbation-based method outperforms traditional metrics like token probability and entropy in quantifying uncertainty, and it is simpler and more efficient than sampling-based methods.
本研究探讨了用于量化LLM推理过程中不确定性(尤其是中间步骤)的UQ指标。研究发现,通过扰动词嵌入可以更好地识别出不确定的推理步骤,这些步骤往往对前一个词的扰动非常敏感。实验表明,基于扰动的方法在量化不确定性方面优于传统的词概率和熵等指标,并且比基于采样的方法更为简单和高效。
How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models
Authors: Parth Asawa, Alan Zhu, Abby O'Neill, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez
First: 2025-10-02T18:02:39+00:00 · Latest: 2026-02-02T18:23:32+00:00
Abstract
Frontier language models are deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5's performance on RuleArena (Taxes) by 71%, reduce Gemini 3 Pro's steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.
中文标题/摘要
标题:如何训练你的顾问:使用顾问模型引导黑盒大语言模型
前沿语言模型作为黑盒服务部署,模型权重不可修改,定制仅限于提示。我们介绍了顾问模型,这是一种方法,通过训练小型开放权重模型生成针对每个实例的自然语言建议,以提高黑盒前沿模型的能力。顾问模型使GPT-5在RuleArena(税收)上的性能提高了71%,减少了Gemini 3 Pro在SWE代理任务中的步骤24.6%,并且在个性化GPT-5以满足用户偏好方面优于静态提示优化器(85-100% vs. 40-60%)。我们还发现,顾问具有可迁移性:用低成本学生模型训练的顾问仍然可以提高前沿模型的表现。此外,顾问模型具有鲁棒性:我们发现在训练管道之外的其他基准上没有观察到性能下降。我们的方法展示了如何以实用和成本效益的方式对黑盒前沿模型进行参数优化。
Uncertainty-Aware Knowledge Tracing Models
Authors: Joshua Mitton, Prarthana Bhattacharyya, Ralph Abboud, Simon Woodhead
First: 2025-09-25T20:06:02+00:00 · Latest: 2026-02-02T18:21:09+00:00
Comments: 10 pages, 7 figures. Joshua Mitton and Prarthana Bhattacharyya contributed equally to this paper
Abstract
The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.
中文标题/摘要
标题:意识不确定性知识追踪模型
知识追踪(KT)模型研究的主要焦点在于提高预测准确性的模型开发。大多数模型在学生选择干扰项时做出最不正确的预测,导致学生错误未被检测到。我们提出了一种方法,通过捕捉预测不确定性来增强KT模型的新能力,并证明了更大的预测不确定性与模型错误预测相一致。我们展示了在知识追踪模型中不确定性是具有信息价值的,这种信号对于在资源有限的教育学习平台中应用,以了解学生能力是有教育意义的。
Summary / 总结
This research aims to enhance Knowledge Tracing (KT) models by incorporating predictive uncertainty to better detect student errors, especially when choosing distractors. The study demonstrates that higher predictive uncertainty correlates with incorrect model predictions, suggesting that uncertainty signals can be pedagogically useful. The findings indicate that uncertainty-aware KT models can provide valuable insights in educational settings with limited resources, aiding in understanding student abilities.
本文针对现有知识追踪(KT)模型在预测学生错误时的局限性,尤其是在他们选择干扰项时的准确性不足。作者提出了一种方法,将预测不确定性引入KT模型,表明更高的不确定性与错误预测相关。这种方法增强了模型检测学生错误的能力,使信号对教育平台具有教育意义,特别是在资源有限的环境中,理解学生能力至关重要。
Structure Enables Effective Self-Localization of Errors in LLMs
Authors: Ankur Samanta, Akshayaa Magesh, Ayush Jain, Kavosh Asadi, Youliang Yu, Daniel Jiang, Boris Vidolov, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni
First: 2026-02-02T18:15:59+00:00 · Latest: 2026-02-02T18:15:59+00:00
Abstract
Self-correction in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in incorrect reasoning, as a path toward building AI systems that can effectively correct themselves. We introduce a prompting method that structures reasoning as discrete, semantically coherent thought steps, and show that models are able to reliably localize errors within this structure, while failing to do so in conventional, unstructured chain-of-thought reasoning. Motivated by how the human brain monitors errors at discrete decision points and resamples alternatives, we introduce Iterative Correction Sampling of Thoughts (Thought-ICS), a self-correction framework. Thought-ICS iteratively prompts the model to generate reasoning one discrete and complete thought at a time--where each thought represents a deliberate decision by the model--creating natural boundaries for precise error localization. Upon verification, the model localizes the first erroneous step, and the system backtracks to generate alternative reasoning from the last correct point. When asked to correct reasoning verified as incorrect by an oracle, Thought-ICS achieves 20-40% self-correction lift. In a completely autonomous setting without external verification, it outperforms contemporary self-correction baselines.
中文标题/摘要
标题:结构使大语言模型能够有效自我定位错误
语言模型的自我纠正仍然难以实现。在本研究中,我们探索语言模型是否能够明确定位错误推理中的错误,作为构建能够有效自我纠正的AI系统的途径。我们引入了一种提示方法,将推理结构化为离散的、语义上连贯的思想步骤,并展示了模型能够在这种结构中可靠地定位错误,而在传统的非结构化链式思考推理中则无法做到这一点。受人类大脑在离散决策点监测错误并重新采样替代方案的启发,我们引入了迭代纠正采样思想(Thought-ICS)这一自我纠正框架。Thought-ICS 逐步提示模型以生成一个接一个的离散且完整的思考——每个思考代表模型的一个明确决策,从而创建出精确错误定位的自然边界。经过验证后,模型能够定位第一个错误步骤,并回溯生成从最后一个正确点开始的替代推理。当要求模型纠正由先验知识验证为错误的推理时,Thought-ICS 达到了20-40%的自我纠正提升。在完全自主的环境中,它优于当前的自我纠正基线。
Summary / 总结
This work investigates whether language models can self-correct by localizing errors in their reasoning. The authors introduce a structured prompting method called Thought-ICS, which prompts models to generate reasoning in discrete, coherent thought steps. This method enables models to reliably identify and correct errors, achieving a 20-40% self-correction lift when verified by an oracle and outperforming existing self-correction techniques in autonomous settings.
研究探讨了语言模型是否可以通过定位错误来自我纠正。通过将推理结构化为离散且语义连贯的思想步骤,模型能够可靠地识别错误。引入的迭代纠正采样思想(Thought-ICS)框架进一步通过逐个提示模型生成完整的思想,使精确的错误定位成为可能。实验结果显示,Thought-ICS 在经过金标准验证时的自我纠正率提高了 20-40%,并在完全自主的环境中优于现有自我纠正基准方法。
Misconception Diagnosis From Student-Tutor Dialogue: Generate, Retrieve, Rerank
Authors: Joshua Mitton, Prarthana Bhattacharyya, Digory Smith, Thomas Christie, Ralph Abboud, Simon Woodhead
First: 2026-02-02T18:14:35+00:00 · Latest: 2026-02-02T18:14:35+00:00
Comments: 21 pages, 8 figures, 8 tables. Joshua Mitton and Prarthana Bhattacharyya contributed equally to this paper
Abstract
Timely and accurate identification of student misconceptions is key to improving learning outcomes and pre-empting the compounding of student errors. However, this task is highly dependent on the effort and intuition of the teacher. In this work, we present a novel approach for detecting misconceptions from student-tutor dialogues using large language models (LLMs). First, we use a fine-tuned LLM to generate plausible misconceptions, and then retrieve the most promising candidates among these using embedding similarity with the input dialogue. These candidates are then assessed and re-ranked by another fine-tuned LLM to improve misconception relevance. Empirically, we evaluate our system on real dialogues from an educational tutoring platform. We consider multiple base LLM models including LLaMA, Qwen and Claude on zero-shot and fine-tuned settings. We find that our approach improves predictive performance over baseline models and that fine-tuning improves both generated misconception quality and can outperform larger closed-source models. Finally, we conduct ablation studies to both validate the importance of our generation and reranking steps on misconception generation quality.
中文标题/摘要
标题:学生-导师对话中的误解诊断:生成、检索、重排序
及时准确地识别学生的误解是提高学习成果和预防学生错误累积的关键。然而,这项任务高度依赖于教师的努力和直觉。在本工作中,我们提出了一种使用大规模语言模型(LLM)检测学生-导师对话中误解的新型方法。首先,我们使用微调后的LLM生成可能的误解,然后使用输入对话的嵌入相似性检索最有希望的候选者。这些候选者随后由另一个微调后的LLM评估和重新排序,以提高误解的相关性。通过实验证明,我们在一个教育辅导平台的真实对话上评估了我们的系统。我们考虑了包括LLaMA、Qwen和Claude在内的多个基础LLM模型,在零样本和微调设置下进行评估。我们发现,我们的方法在预测性能上优于基线模型,并且微调可以提高生成的误解质量,甚至可以超越更大的闭源模型。最后,我们进行了消融研究,以验证生成和重新排序步骤在误解生成质量上的重要性。
Masked Autoencoders as Universal Speech Enhancer
Authors: Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang, Kyu Han
First: 2026-02-02T18:13:59+00:00 · Latest: 2026-02-02T18:13:59+00:00
Abstract
Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
中文标题/摘要
标题:遮罩自编码器作为通用语音增强器
监督语音增强方法非常成功。然而,在实际场景中缺乏干净的语音,因此需要基于自监督学习(SSL)的语音增强方法,这些方法可以提供可比的增强性能,并且可以应用于其他语音相关的下游应用。在本文中,我们开发了一种基于遮罩自编码器的通用语音增强器,该增强器对影响语音的失真类型不敏感,可以同时处理多种失真,并以自监督方式训练。增强堆栈进一步向嘈杂的输入数据中添加失真。遮罩自编码器模型在预训练过程中学习去除添加的失真并重建光谱图的遮罩区域。预训练的嵌入随后用于在少量配对数据上训练特定下游任务的微调模型。我们评估了预训练特征在去噪和除混响下游任务中的性能。我们探索了预训练增强堆栈中不同的增强(如单说话人或多说话人)以及不同的嘈杂输入特征表示(如$log1p$压缩)对预训练嵌入和下游微调增强性能的影响。我们展示了所提出的方法不仅优于基线,而且在领域内和领域外评估数据集上均达到了最先进的性能。
Summary / 总结
This study addresses the limitation of supervised speech enhancement methods by developing a masked autoencoder-based universal speech enhancer that can handle various types of distortions and is trained in a self-supervised manner. The model is pre-trained using an augmentation stack to add distortions to noisy input data, allowing it to learn to remove these distortions. Fine-tuning on specific downstream tasks with small paired data improves performance. The method outperforms baselines and achieves state-of-the-art performance for both in-domain and out-of-domain datasets.
该研究旨在解决需要能够处理多种类型失真并应用于下游语音相关任务的自监督语音增强方法的问题。作者开发了一种基于掩码自编码器的通用语音增强器,该模型通过添加失真的增强堆栈在自监督方式下进行训练。预训练模型经过特定任务的微调,并在去噪和降混响任务中表现出色,相比基线方法具有更优的表现,在领域内和领域外的评估数据集上均达到了最先进的水平。
Reverse Engineering Human Preferences with Reinforcement Learning
Authors: Lisa Alazraki, Tan Yi-Chern, Jon Ander Campos, Maximilian Mozes, Marek Rei, Max Bartolo
Venue: NeurIPS 2025 Spotlight
First: 2025-05-21T17:48:16+00:00 · Latest: 2026-02-02T18:10:19+00:00
Comments: NeurIPS 2025 (Spotlight)
Abstract
The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework--known as LLM-as-a-judge--is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model's response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning--an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.
中文标题/摘要
标题:使用强化学习逆向工程人类偏好
大型语言模型(LLM)的能力通常由其他LLM训练的模型来评估,这些模型旨在预测人类偏好。这种框架被称为LLM作为裁判,具有高度可扩展性和相对较低的成本。然而,它也容易受到恶意利用,因为LLM的响应可以调整以过度拟合裁判的偏好。先前的研究表明,候选LLM生成的答案可以在事后编辑以最大化裁判LLM赋予它们的分数。在本研究中,我们采用不同的方法,利用裁判LLM提供的信号作为奖励,以对抗性方式调整生成文本前言的模型,这些前言旨在提升下游性能。我们发现,冻结的LLM与这些模型结合后,获得的LLM评估分数高于现有框架。至关重要的是,与其他直接干预模型响应的框架不同,我们的方法几乎不可检测。我们还证明,当候选LLM和裁判LLM被训练中未使用的模型替换时,调整后的前言生成器的有效性仍然有效。这些发现提出了关于更可靠LLM作为裁判评估设置设计的重要问题。它们还表明,通过将LLM管道化以通过强化学习优化上游前言,人类偏好可以有效地逆向工程,这种方法在未来可能在各种任务和领域中找到应用,而不仅仅是对抗性攻击。
Summary / 总结
This study addresses the vulnerability of Large Language Model (LLM)-as-a-judge evaluation by proposing a new method to adversarially tune models that generate text preambles to boost downstream performance. By using the judge-LLM's feedback as a reward signal, the method improves the LLM-evaluation scores without directly altering the model's responses, making it harder to detect. The effectiveness of this approach is shown to transfer across different models, raising concerns about the reliability of current LLM evaluation methods and suggesting potential applications in various tasks and domains.
本研究针对大型语言模型(LLM)评估中的漏洞,提出了一种新的方法,通过生成文本前言来提升下游性能。该方法利用裁判LLM的评分作为奖励信号,实现了比现有框架更高的评估分数。更重要的是,这种方法是不可检测的,并且其有效性在不同模型之间转移,这引发了对LLM作为裁判评估可靠性的担忧,并暗示了在各种领域中的潜在应用。
History
20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553