Recursive Language Models
Authors: Alex L. Zhang, Tim Kraska, Omar Khattab
First: 2025-12-31T03:43:41+00:00 · Latest: 2026-01-28T18:59:39+00:00
Comments: 9 pages, 33 with Appendix
Abstract
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.
中文标题/摘要
标题:递归语言模型
我们通过推理时缩放的视角研究允许大型语言模型(LLMs)处理任意长的提示。我们提出了递归语言模型(RLMs),这是一种通用的推理范式,将长提示视为外部环境的一部分,并允许LLM程序化地检查、分解并递归地在提示片段上调用自身。我们发现RLMs能够成功处理输入,其长度比模型上下文窗口大两个数量级,即使对于较短的提示,其质量也显著优于基础的前沿LLMs和常见的长上下文支架,在四个不同的长上下文任务中表现更佳,且成本相当。在较小的规模下,我们后训练了第一个原生递归语言模型。我们的模型RLM-Qwen3-8B在平均上比基础的Qwen3-8B模型高出28.3%,甚至在三个长上下文任务中接近vanilla GPT-5的质量。代码可在https://github.com/alexzhang13/rlm获取。
Summary / 总结
The research aims to enable large language models to handle long prompts by proposing Recursive Language Models (RLMs), which allow the model to examine and recursively call itself over prompt snippets. The study finds that RLMs can process inputs up to two orders of magnitude longer than the model's context window and outperform vanilla LLMs and long-context scaffolds in four diverse tasks, with comparable cost. At a small scale, the RLM-Qwen3-8B model outperforms the underlying Qwen3-8B model by 28.3% on average and approaches the quality of vanilla GPT-5 on three long-context tasks.
研究旨在通过提出递归语言模型(RLMs),使大型语言模型能够处理长输入,允许模型对提示片段进行检查并递归调用自身。研究发现,RLMs可以处理比模型上下文窗口大两个数量级的输入,并在四个不同任务中超越了传统的LLM和长上下文支架,成本相当。在小规模下,RLM-Qwen3-8B模型在平均上比底层的Qwen3-8B模型高出28.3%,并在三个长上下文任务中接近vanilla GPT-5的质量。
Evolutionary Strategies lead to Catastrophic Forgetting in LLMs
Authors: Immanuel Abdi, Akshat Gupta, Micah Mok, Alexander Lu, Nicholas Lee, Gopala Anumanchipalli
First: 2026-01-28T18:59:34+00:00 · Latest: 2026-01-28T18:59:34+00:00
Abstract
One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger $\ell_2$ norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.
中文标题/摘要
标题:进化策略导致大模型出现灾难性遗忘
当前AI系统中最大的缺失能力之一是在部署后能够持续学习的能力。实现这样的持续学习系统存在多个挑战,其中之一是用于训练最先进的大模型的梯度基算法需要大量的内存。进化策略(ES)最近重新成为梯度自由的替代传统学习算法的选择,并在大模型的特定任务上表现出令人鼓舞的性能。在本文中,我们对ES进行了全面分析,并特别评估了其遗忘曲线,当训练步骤增加时。我们首先发现,ES能够在数学和推理任务上达到与GRPO相近的性能水平,且具有相似的计算预算。然而,最重要的是对于持续学习而言,ES的性能提升伴随着对先前能力的显著遗忘,限制了其在线训练模型的应用。我们还探讨了这种行为的原因,并展示了使用ES进行的更新比相应的GRPO更新要稀疏得多,且具有数量级更大的$\ell_2$范数,解释了这两种算法之间截然不同的遗忘曲线。通过这项研究,我们旨在突出梯度自由算法如ES中的遗忘问题,并希望激发未来的工作来缓解这些问题。
Summary / 总结
This paper investigates the use of Evolutionary Strategies (ES) in training large language models (LLMs) for continual learning. ES is found to achieve performance comparable to gradient-based methods like GRPO on math and reasoning tasks with similar computational resources. However, ES exhibits significant catastrophic forgetting, where the model loses previously learned abilities as it learns new tasks, making it less suitable for online training. The study attributes this to less sparse and larger norm updates in ES compared to GRPO, highlighting the need for future work to address forgetting in gradient-free algorithms.
该论文研究了进化策略(ES)在训练大规模语言模型(LLMs)进行持续学习中的应用。ES在数学和推理任务上达到了与梯度方法如GRPO相近的性能,并且具有相似的计算资源。然而,ES存在灾难性遗忘的问题,即在继续学习新任务时会忘记之前学到的能力,这限制了其在线训练的适用性。研究指出,这归因于ES做出的更新比GRPO更新更不稀疏且幅度更大。
LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs
Authors: Piyush Jha, Arnav Arora, Vijay Ganesh
Venue: AAAI 2025
First: 2024-11-13T18:44:30+00:00 · Latest: 2026-01-28T18:58:57+00:00
Comments: Accepted at AAAI 2025
Abstract
We introduce LLMStinger, a novel approach that leverages Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. Unlike traditional methods, which require complex prompt engineering or white-box access, LLMStinger uses a reinforcement learning (RL) loop to fine-tune an attacker LLM, generating new suffixes based on existing attacks for harmful questions from the HarmBench benchmark. Our method significantly outperforms existing red-teaming approaches (we compared against 15 of the latest methods), achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2, both models known for their extensive safety measures. Additionally, we achieved a 94.97% ASR on GPT-3.5 and 99.4% on Gemma-2B-it, demonstrating the robustness and adaptability of LLMStinger across open and closed-source models.
中文标题/摘要
标题:LLMStinger:使用RL微调的LLM破解大型语言模型
我们介绍了LLMStinger,这是一种新颖的方法,利用大型语言模型(LLMs)自动生成针对破解攻击的对抗后缀。与传统方法不同,后者需要复杂的提示工程或白盒访问,LLMStinger使用强化学习(RL)循环微调攻击LLM,根据现有攻击生成新的后缀,以应对来自HarmBench基准的有害问题。我们的方法在现有红队方法中表现显著优于(我们与15种最新方法进行了比较),在LLaMA2-7B-chat上实现了攻击成功率(ASR)提高57.2%,在Claude 2上提高了50.3%的ASR,这两款模型因其广泛的安全措施而闻名。此外,我们在GPT-3.5上实现了94.97%的ASR,在Gemma-2B-it上实现了99.4%的ASR,这表明LLMStinger在开源和闭源模型中具有稳健性和适应性。
Summary / 总结
LLMStinger is a novel approach that uses Large Language Models (LLMs) to automatically generate adversarial suffixes for jailbreak attacks. It employs a reinforcement learning (RL) loop to fine-tune an attacker LLM, which generates new suffixes based on existing attacks. LLMStinger significantly outperforms 15 state-of-the-art red-teaming methods, achieving a +57.2% improvement in Attack Success Rate (ASR) on LLaMA2-7B-chat and a +50.3% ASR increase on Claude 2. It also demonstrates high ASR on GPT-3.5 and Gemma-2B-it, showing its robustness and adaptability across different models.
LLMStinger 是一种方法,利用大型语言模型(LLMs)自动生成用于监狱突破攻击的对抗后缀。不同于传统方法,它使用强化学习(RL)循环来微调攻击 LLM,基于现有攻击生成新的后缀。该方法在与 15 种最新红队方法的比较中显著提高了攻击成功率(ASR),在 LLaMA2-7B-chat 上提高了 +57.2%,在 Claude 2 上提高了 +50.3% 的 ASR。此外,它在 GPT-3.5 和 Gemma-2B-it 上也表现出高 ASR,表明其在不同模型上的鲁棒性和适应性。
DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems
Authors: Kostis Michailidis, Dimos Tsouros, Tias Guns
First: 2025-06-06T12:56:02+00:00 · Latest: 2026-01-28T18:58:23+00:00
Comments: This version is currently submitted and it is under review. For CP-Bench (the paper accepted at ECAI25), please refer to the previous version of this entry (v2)
Abstract
Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.
中文标题/摘要
标题:DCP-Bench-Open:评估大型语言模型在离散组合问题约束建模中的应用
离散组合问题(DCPs)在工业决策和优化中普遍存在。然而,尽管DCPs的约束求解技术取得了显著进步,但将它们形式化的核心过程,即约束建模,仍需要大量专业知识,成为更广泛应用的瓶颈。为缓解这一瓶颈,最近的研究探索了使用大型语言模型(LLMs)将组合问题描述转换为可执行的约束模型。然而,现有的离散约束建模评估数据集通常局限于小规模、同质或特定领域的例子,无法捕捉到现实世界的多样性。本研究通过引入DCP-Bench-Open,一种包含来自约束编程(CP)和运筹学(OR)社区的多种知名离散组合问题的新基准,填补了这一空白,这些问题明确结构化以评估LLM驱动的约束建模。借助此数据集,考虑到不同的建模框架,我们比较和评估了LLMs在三种不同抽象级别和底层语法的约束建模系统中的建模能力。值得注意的是,结果表明,使用基于Python的高级框架进行建模时性能更高。此外,我们系统地评估了提示方法和推理时计算方法在不同LLM中的使用情况,这进一步提高了准确性,最高达到91%的这个极具挑战性的基准。DCP-Bench-Open是公开可用的。
Summary / 总结
This work introduces DCP-Bench-Open, a benchmark for evaluating Large Language Models (LLMs) in transforming combinatorial problem descriptions into executable constraint models. It addresses the limitations of existing datasets by including a diverse set of problems from the CP and OR communities. The study compares LLMs' performance across three constraint modelling systems and finds higher accuracy with a high-level Python-based framework. Prompt-based and inference-time compute methods also significantly improve accuracy, reaching up to 91% on the benchmark.
该研究引入了DCP-Bench-Open,这是一个用于评估大型语言模型(LLMs)将组合问题描述转换为可执行约束模型能力的基准。基准包括来自CP和OR社区的一系列多样化的离散组合问题。研究比较了LLMs在三种不同约束建模系统中的建模能力,并发现使用高级Python基础框架时性能更高。此外,研究还评估了基于提示和推理时计算方法,达到了高达91%的准确率,这在这一极具挑战性的基准上具有重要意义。
When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
Authors: David Tan, Pinzhen Chen, Josef van Genabith, Koel Dutta Chowdhury
First: 2026-01-28T18:56:21+00:00 · Latest: 2026-01-28T18:56:21+00:00
Comments: 5 pages of content, 15 total. 5 figures, 12 tables total. Accepted to EACL 2026 main conference. Code can be found here: github.com/Mr-Ao-25/cross-ling-contamination
Abstract
Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz's FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.
中文标题/摘要
标题:当弗洛雷斯之花绽放错误:机器翻译评估中的跨方向污染
大型语言模型(LLMs)可能会受到基准污染的影响,导致分数被夸大,掩盖了记忆而非泛化。在多语言环境中,这种记忆甚至可以转移到“未污染”的语言中。使用FLORES-200翻译基准作为诊断工具,我们研究了两个7-8B指令调优的多语言LLMs:Bloomz,它在FLORES上进行了训练,以及Llama作为未污染的对照组。我们确认了Bloomz的FLORES污染,并展示了机器翻译污染可以是跨方向的,由于目标侧的记忆,导致在未见过的翻译方向上的人工性能提升。进一步的分析表明,尽管采取了各种源侧扰动措施,如改写和命名实体替换,记忆参考的召回率仍然存在。然而,替换命名实体导致BLEU分数的一致下降,这表明对于污染模型的记忆进行有效探测的方法。
Summary / 总结
The study investigates cross-direction contamination in machine translation evaluation, focusing on large language models (LLMs) benchmarked on FLORES-200. Using Bloomz, an instruction-tuned multilingual LLM trained on FLORES, and Llama as a control, the research confirms Bloomz's contamination and demonstrates that contamination can be bidirectional, enhancing performance in unseen translation directions due to target-side memorization. Named entity replacement is shown to effectively probe for memorization, leading to consistent BLEU score decreases, indicating its utility in contaminated models.
研究探讨了机器翻译评估中的跨方向污染问题,以FLORES-200基准为对象。通过使用在FLORES上训练的多语言LLM Bloomz和未受污染的Llama作为对照,研究者确认了Bloomz的污染情况,并展示了目标侧记忆可以人为提升未见过的翻译方向的性能。研究还发现,尽管进行了源侧的扰动,记忆的参考信息召回率仍然存在,但替换命名实体会一致地降低BLEU分数,表明这是一种有效的探测污染模型中记忆的方法。
FreeFix: Boosting 3D Gaussian Splatting via Fine-Tuning-Free Diffusion Models
Authors: Hongyu Zhou, Zisen Shao, Sheng Miao, Pan Wang, Dongfeng Bai, Bingbing Liu, Yiyi Liao
First: 2026-01-28T18:56:03+00:00 · Latest: 2026-01-28T18:56:03+00:00
Comments: Our project page is at https://xdimlab.github.io/freefix
Abstract
Neural Radiance Fields and 3D Gaussian Splatting have advanced novel view synthesis, yet still rely on dense inputs and often degrade at extrapolated views. Recent approaches leverage generative models, such as diffusion models, to provide additional supervision, but face a trade-off between generalization and fidelity: fine-tuning diffusion models for artifact removal improves fidelity but risks overfitting, while fine-tuning-free methods preserve generalization but often yield lower fidelity. We introduce FreeFix, a fine-tuning-free approach that pushes the boundary of this trade-off by enhancing extrapolated rendering with pretrained image diffusion models. We present an interleaved 2D-3D refinement strategy, showing that image diffusion models can be leveraged for consistent refinement without relying on costly video diffusion models. Furthermore, we take a closer look at the guidance signal for 2D refinement and propose a per-pixel confidence mask to identify uncertain regions for targeted improvement. Experiments across multiple datasets show that FreeFix improves multi-frame consistency and achieves performance comparable to or surpassing fine-tuning-based methods, while retaining strong generalization ability.
中文标题/摘要
标题:FreeFix:通过无需微调的扩散模型增强3D高斯散点图
神经辐射场和3D高斯散点图在新颖视图合成方面取得了进展,但仍依赖于密集输入,并且在外推视图中往往会退化。最近的方法利用生成模型,如扩散模型,提供额外的监督,但存在泛化能力和保真度之间的权衡:用于去除伪影的扩散模型微调可以提高保真度,但存在过拟合风险,而无需微调的方法可以保持泛化能力,但通常保真度较低。我们提出了FreeFix,这是一种无需微调的方法,通过使用预训练的图像扩散模型增强外推渲染,从而在这一权衡中推动边界。我们提出了一种交错的2D-3D细化策略,表明图像扩散模型可以用于一致的细化,而无需依赖昂贵的视频扩散模型。此外,我们更深入地研究了2D细化的指导信号,并提出了一种逐像素置信度掩码来识别需要改进的不确定区域。在多个数据集上的实验表明,FreeFix提高了多帧一致性,并实现了与或超越了基于微调方法的性能,同时保持了强大的泛化能力。
Summary / 总结
FreeFix is a fine-tuning-free approach that enhances 3D Gaussian splatting by using pretrained image diffusion models for consistent 2D-3D refinement. It introduces an interleaved 2D-3D refinement strategy and a per-pixel confidence mask to improve multi-frame consistency and achieve performance comparable to or surpassing fine-tuning-based methods while maintaining strong generalization ability.
FreeFix 是一种无需微调的方法,通过使用预训练的图像扩散模型进行一致的 2D-3D 精炼来增强 3D 高斯散点图。它引入了一种交错的 2D-3D 精炼策略和一个逐像素置信掩码,以提高多帧一致性并实现与或超越基于微调方法的性能,同时保持强大的泛化能力。
SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models
Authors: Sebastiano Monti, Carlo Nicolini, Gianni Pellegrini, Jacopo Staiano, Bruno Lepri
First: 2026-01-28T18:56:00+00:00 · Latest: 2026-01-28T18:56:00+00:00
Abstract
Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.
中文标题/摘要
标题:SokoBench:评估大型语言模型的长期规划和推理能力
尽管大型语言模型的能力在复杂的推理任务中得到了越来越多的测试,但它们的长期规划能力尚未得到广泛研究。在本文中,我们系统地评估了最先进的大型推理模型(LRMs)的规划和长期推理能力。我们提出了一种基于推箱子谜题的新基准,故意简化以隔离长期规划与状态持久性。我们的研究结果表明,当需要超过25步才能达到解决方案时,规划性能会出现一致下降,这表明存在基本的前向规划能力限制。我们展示了将LRMs与规划领域定义语言(PDDL)解析、验证和求解工具结合使用可以带来适度的改进,这表明固有的架构限制可能无法仅通过测试时的扩展方法来克服。
Summary / 总结
The research aims to evaluate the long-horizon planning and reasoning abilities of large language models (LRMs) by proposing a novel Sokoban-based benchmark. The study finds that LRMs show consistent performance degradation when more than 25 moves are required to solve the puzzles, indicating a fundamental constraint in their forward planning capacity. Integrating PDDL tools provides only modest improvements, suggesting inherent architectural limitations that may not be addressed by test-time scaling alone.
研究旨在通过提出基于Sokoban的新型基准来评估大型语言模型(LRMs)的长期规划和推理能力。研究发现,当需要超过25步才能解决问题时,LRMs的表现会出现一致的下降,这表明它们在前向规划能力方面存在根本性的限制。整合PDDL工具只能提供轻微的改进,表明可能存在固有的架构限制,这些限制可能不会通过测试时的扩展方法来克服。
From Specialist to Generalist: Unlocking SAM's Learning Potential on Unlabeled Medical Images
Authors: Vi Vu, Thanh-Huy Nguyen, Tien-Thinh Nguyen, Ba-Thinh Lam, Hoang-Thien Nguyen, Tianyang Wang, Xingjian Li, Min Xu
Venue: ISBI 2026
First: 2026-01-25T18:13:48+00:00 · Latest: 2026-01-28T18:55:46+00:00
Comments: Accepted to ISBI 2026
Abstract
Foundation models like the Segment Anything Model (SAM) show strong generalization, yet adapting them to medical images remains difficult due to domain shift, scarce labels, and the inability of Parameter-Efficient Fine-Tuning (PEFT) to exploit unlabeled data. While conventional models like U-Net excel in semi-supervised medical learning, their potential to assist a PEFT SAM has been largely overlooked. We introduce SC-SAM, a specialist-generalist framework where U-Net provides point-based prompts and pseudo-labels to guide SAM's adaptation, while SAM serves as a powerful generalist supervisor to regularize U-Net. This reciprocal guidance forms a bidirectional co-training loop that allows both models to effectively exploit the unlabeled data. Across prostate MRI and polyp segmentation benchmarks, our method achieves state-of-the-art results, outperforming other existing semi-supervised SAM variants and even medical foundation models like MedSAM, highlighting the value of specialist-generalist cooperation for label-efficient medical image segmentation. Our code is available at https://github.com/vnlvi2k3/SC-SAM.
中文标题/摘要
标题:从专家到通才:解锁SAM在未标注医学图像上的学习潜力
基础模型如分割一切模型(SAM)表现出强大的泛化能力,但将其适应医学图像仍然困难重重,由于领域转移、稀缺的标签以及参数高效微调(PEFT)无法利用未标注数据。尽管像U-Net这样的传统模型在半监督医学学习中表现出色,但它们协助PEFT SAM的潜力却常常被忽视。我们提出了SC-SAM,这是一种专家-通才框架,其中U-Net提供基于点的提示和伪标签来引导SAM的适应,而SAM则作为强大的通才监督者来正则化U-Net。这种相互指导形成了双向的协同训练循环,使两个模型都能有效地利用未标注数据。在前列腺MRI和息肉分割基准测试中,我们的方法取得了最先进的成果,超越了其他现有的半监督SAM变体,甚至医学基础模型如MedSAM,突显了专家-通才合作在标签高效医学图像分割中的价值。我们的代码可在https://github.com/vnlvi2k3/SC-SAM获取。
Summary / 总结
The research aims to enhance the adaptability of the Segment Anything Model (SAM) to medical images by leveraging the strengths of U-Net. The method introduces SC-SAM, a specialist-generalist framework where U-Net provides prompts and pseudo-labels to guide SAM's adaptation, while SAM acts as a generalist supervisor to regularize U-Net. This bidirectional co-training loop effectively utilizes unlabeled data. The method outperforms other semi-supervised SAM variants and medical foundation models like MedSAM on prostate MRI and polyp segmentation benchmarks, demonstrating the value of specialist-generalist cooperation for label-efficient medical image segmentation.
研究旨在通过利用U-Net的优势来增强Segment Anything Model (SAM)在医疗图像中的适应性。方法引入了SC-SAM,这是一种专家-通用家框架,其中U-Net提供提示和伪标签来引导SAM的适应,而SAM则作为通用家监督者来正则化U-Net。这种双向协同训练循环有效地利用了未标记的数据。该方法在前列腺MRI和息肉分割基准测试中优于其他半监督SAM变体和医疗基础模型如MedSAM,展示了专家-通用家合作在标签高效医疗图像分割中的价值。
HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Authors: Hongzheng Chen, Yingheng Wang, Yaohui Cai, Hins Hu, Jiajie Li, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang
Venue: ICLR
First: 2025-06-09T17:46:47+00:00 · Latest: 2026-01-28T18:52:54+00:00
Comments: Accepted to ICLR'26
Abstract
While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
中文标题/摘要
标题:HeuriGym:LLM生成启发式算法在组合优化中的代理基准
虽然大型语言模型(LLMs)在推理和基于代理的问题解决方面取得了显著进展,但当前的评估方法未能充分评估其能力:现有基准要么依赖于容易饱和和记忆的封闭式问题,要么依赖于缺乏一致性和严谨性的主观比较。在本文中,我们引入了HeuriGym,这是一种代理框架,用于评估LLM生成的启发式算法在组合优化问题中的表现,这些问题具有明确的目标和广阔的解空间。HeuriGym使LLM能够提出启发式算法,通过代码执行接收评估反馈,并逐步改进其解决方案。我们对九个最先进的模型在计算机系统、物流和生物学等领域中的九个问题进行了评估,揭示了工具使用、规划和适应性推理方面的持续局限性。为了量化性能,我们提出了质量产量指数(QYI),这是一个同时捕捉解通过率和质量的指标。即使是顶级模型如GPT-o4-mini-high和Gemini-2.5-Pro的QYI得分也只有0.6,远低于专家基准的1。我们的开源基准旨在引导LLM向更有效的科学和工程领域问题解决发展。
Summary / 总结
HeuriGym is an agentic benchmark for evaluating LLM-generated heuristics in combinatorial optimization, addressing limitations of existing benchmarks. It allows LLMs to propose heuristics, receive feedback through code execution, and iteratively refine their solutions. Evaluating nine state-of-the-art models on nine problems, the study reveals limitations in tool use, planning, and adaptive reasoning, with QYI scores below expert performance. The Quality-Yield Index (QYI) metric evaluates both solution pass rate and quality, showing that even top models like GPT-o4-mini-high and Gemini-2.5-Pro fall short of expert performance.
HeuriGym 是一个用于评估 LLM 生成的组合优化启发式算法的代理框架,解决了现有基准的局限性。它使 LLM 能够提出启发式算法,通过代码执行接收反馈并迭代改进解决方案。研究对九个最先进的模型在九个问题上的评估显示了在工具使用、规划和自适应推理方面的局限性,顶级模型如 GPT-o4-mini-high 和 Gemini-2.5-Pro 的 QYI 分数低于专家基准 1。
C3Box: A CLIP-based Class-Incremental Learning Toolbox
Authors: Hao Sun, Da-Wei Zhou
First: 2026-01-28T18:52:36+00:00 · Latest: 2026-01-28T18:52:36+00:00
Comments: The code is available at https://github.com/LAMDA-CL/C3Box
Abstract
Traditional machine learning systems are typically designed for static data distributions, which suffer from catastrophic forgetting when learning from evolving data streams. Class-Incremental Learning (CIL) addresses this challenge by enabling learning systems to continuously learn new classes while preserving prior knowledge. With the rise of pre-trained models (PTMs) such as CLIP, leveraging their strong generalization and semantic alignment capabilities has become a promising direction in CIL. However, existing CLIP-based CIL methods are often scattered across disparate codebases, rely on inconsistent configurations, hindering fair comparisons, reproducibility, and practical adoption. Therefore, we propose C3Box (CLIP-based Class-inCremental learning toolBOX), a modular and comprehensive Python toolbox. C3Box integrates representative traditional CIL methods, ViT-based CIL methods, and state-of-the-art CLIP-based CIL methods into a unified CLIP-based framework. By inheriting the streamlined design of PyCIL, C3Box provides a JSON-based configuration and standardized execution pipeline. This design enables reproducible experimentation with low engineering overhead and makes C3Box a reliable benchmark platform for continual learning research. Designed to be user-friendly, C3Box relies only on widely used open-source libraries and supports major operating systems. The code is available at https://github.com/LAMDA-CL/C3Box.
中文标题/摘要
标题:C3Box:基于CLIP的类增量学习工具箱
传统的机器学习系统通常针对静态数据分布设计,当从不断变化的数据流中学习时,会遭受灾难性遗忘。类增量学习(CIL)通过使学习系统能够连续学习新类并保留先前知识来应对这一挑战。随着预训练模型(PTMs)如CLIP的兴起,利用其强大的泛化能力和语义对齐能力在CIL中成为了一个有前景的方向。然而,现有的基于CLIP的CIL方法往往分散在不同的代码库中,依赖于不一致的配置,阻碍了公平比较、可重复性和实际应用。因此,我们提出了C3Box(CLIP基于的类增量学习工具箱),一个模块化和全面的Python工具箱。C3Box将传统的CIL方法、ViT基于的CIL方法和最先进的基于CLIP的CIL方法整合到一个统一的基于CLIP的框架中。通过继承PyCIL的精简设计,C3Box提供了一个基于JSON的配置和标准化执行管道。这种设计使得在低工程开销下进行可重复实验成为可能,并使C3Box成为持续学习研究的可靠基准平台。C3Box设计用户友好,仅依赖广泛使用的开源库,并支持主要的操作系统。代码可在https://github.com/LAMDA-CL/C3Box获取。
Summary / 总结
C3Box is a modular Python toolbox designed to facilitate class-incremental learning (CIL) using pre-trained models like CLIP. It integrates various traditional and vision transformer-based CIL methods into a unified CLIP-based framework, providing a standardized execution pipeline and JSON-based configuration for reproducible experiments. C3Box aims to improve the fairness of comparisons and practical adoption of CIL methods by reducing engineering overhead and ensuring consistency in configurations.
C3Box 是一个模块化的 Python 工具箱,旨在使用基于 CLIP 的方法促进类增量学习 (CIL)。它将各种传统和最先进的 CIL 方法整合到一个统一的框架中,提供标准化的执行管道和基于 JSON 的配置以实现可重复的实验。主要发现包括提高了 CIL 方法的可重复性和实际应用,使 C3Box 成为持续学习研究中一个有价值的基准平台。
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
Authors: Rui Pan, Zhuofu Chen, Hongyi Liu, Arvind Krishnamurthy, Ravi Netravali
First: 2025-12-23T18:16:58+00:00 · Latest: 2026-01-28T18:48:35+00:00
Abstract
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.7$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
中文标题/摘要
标题:快速失败,赢得更大:通过扩散大语言模型重新思考推测性解码的起草策略
扩散大语言模型(dLLMs)提供快速并行的标记生成,但它们的独立使用受到效率与质量固有的权衡。我们表明,如果谨慎应用,dLLMs 的属性实际上可以成为起草者在自回归(AR)验证器辅助下的推测性解码中的优势。我们的核心见解是,dLLM 的并行解码速度大大降低了昂贵拒绝的风险,提供了一种实用机制,以有效实现(难以捉摸的)长篇草案,这些草案在推测性解码中能带来大量加速。我们提出了 FailFast,一种基于 dLLM 的推测性解码框架,通过动态调整其推测长度来实现这一方法。它“快速失败”通过在难以推测的区域花费最少的计算资源来缩短推测延迟,并在容易推测的区域“赢得更大”通过积极扩展草案长度来减少验证延迟(在许多情况下,一次推测并接受 70 个标记!)。无需任何微调,FailFast 为 AR LLM 提供无损加速,并在多种模型和工作负载上分别实现了高达 4.9 倍、1.7 倍和 1.7 倍的速度提升。我们开源了 FailFast,地址为 https://github.com/ruipeterpan/failfast。
Summary / 总结
The paper addresses the efficiency-quality tradeoff in diffusion large language models (dLLMs) and proposes FailFast, a speculative decoding framework that leverages dLLMs' speed to reduce the risk of costly rejections. By dynamically adjusting speculation length, FailFast 'fails fast' in hard-to-speculate regions and 'wins big' by extending draft lengths in easier regions, achieving up to 4.9 times speedup over vanilla decoding and 1.7 times over the best naive dLLM drafter across various models and workloads.
论文探讨了在使用扩散大语言模型(dLLMs)进行推测性解码时的效率与质量权衡问题。它提出了FailFast框架,该框架动态调整推测长度,以减少昂贵的拒绝并最大化在容易区域的草稿长度。FailFast在各种模型和工作负载上实现了高达4.9倍的加速,且无需微调,比最佳的朴素dLLM草稿者快1.7倍。
A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion
Authors: Willams de Lima Costa, Thifany Ketuli Silva de Souza, Jonas Ferreira Silva, Carlos Gabriel Bezerra Pereira, Bruno Reis Vila Nova, Leonardo Silvino Brito, Rafael Raider Leoni, Juliano Silva, Valter Ferreira, Sibele Miguel Soares Neto, Samantha Uehara, Daniel Giacomo, João Marcelo Teixeira, Veronica Teichrieb, Cristiano Coelho de Araújo
First: 2026-01-28T18:46:29+00:00 · Latest: 2026-01-28T18:46:29+00:00
Abstract
Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.
中文标题/摘要
标题:一种基于相机-IMU融合的新数据集和鲁棒路面分类框架
路面分类(RSC)是环境感知预测维护系统的关键使能器。然而,现有的RSC技术往往由于传感模态有限和缺乏环境多样性而难以在狭窄的操作条件下泛化。本研究通过引入一种多模态框架来解决这些限制,该框架使用轻量级双向交叉注意力模块融合图像和惯性测量,并通过自适应门控层在领域转移时调整模态贡献。鉴于当前基准的局限性,尤其是缺乏变异性,我们引入了ROAD数据集,由三个互补子集组成:(i)使用黄金标准行业数据记录仪同步的RGB-IMU实时多模态记录,涵盖多种照明、天气和路面条件;(ii)一个大型仅视觉子集,旨在评估在不良照明和异构捕获设置下的鲁棒性;(iii)一个合成子集,用于研究在实践中难以获得的场景中的离分布泛化。实验表明,我们的方法在PVS基准上比之前最先进的方法提高了1.4个百分点,在我们的多模态ROAD子集上提高了11.6个百分点,且在少数类别的F1分数上始终更高。该框架还展示了在具有挑战性的视觉条件下(包括夜间、大雨和混合路面过渡)的稳定性能。这些发现表明,将经济实惠的相机和IMU传感器与多模态注意力机制相结合,为路面理解提供了一个可扩展且鲁棒的基础,特别是在环境多样性和成本限制限制了高端传感套件的采用的地区尤为重要。
Summary / 总结
This work introduces a new multimodal framework for road surface classification that fuses images and inertial measurements using a lightweight bidirectional cross-attention module and an adaptive gating layer. The framework is evaluated on a new dataset called ROAD, which includes real-world multimodal recordings, a large vision-only subset, and a synthetic subset. The method achieves a 1.4 percentage point improvement over the previous state-of-the-art on the PVS benchmark and a 11.6 percentage point improvement on the multimodal ROAD subset, with better F1-scores on minority classes. The framework also shows stable performance under challenging visual conditions such as nighttime, heavy rain, and mixed-surface transitions.
该研究提出了一种新的多模态框架,用于道路表面分类,该框架融合了图像和惯性测量数据,使用了轻量级双向交叉注意力模块和自适应门控层。该框架在包含真实世界多模态记录、大量仅视觉子集和合成子集的新数据集ROAD上进行了评估。该方法在PVS基准上实现了1.4个百分点的改进,在多模态ROAD子集上实现了11.6个百分点的改进,且在少数类别的F1分数上表现更好。该框架还在夜间、大雨和混合表面过渡等具有挑战性的视觉条件下表现出稳定的性能。
Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)
Authors: Saurav Prateek
First: 2026-01-28T18:45:39+00:00 · Latest: 2026-01-28T18:45:39+00:00
Comments: 11 pages, 6 figures, 2 tables, source code: https://github.com/SauravP97/deep-researcher-reflect-evolve/
Abstract
This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics by addressing the inherent limitations of the Parallel Scaling paradigm. Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm. The sequential refinement process is demonstrated as an efficient method that allows the agent to maintain a centralized Global Research Context, enabling it to look back at current progress, reason about the research plan, and intelligently make changes at runtime. This dynamic adaptation contrasts with parallel approaches, which often suffer from siloed knowledge. The Candidates Crossover algorithm further enhances search efficiency by deploying multiple LLM candidates with varied parameters to explore a larger search space, with their findings synthesized to curate a comprehensive final research response. The process concludes with One Shot Report Generation, ensuring the final document is informed by a unified narrative and high fact density. Powered by the Gemini 2.5 Pro model, our Deep Researcher was evaluated on the DeepResearch Bench, a globally recognized benchmark of 100 doctoral level research tasks. Our architecture achieved an overall score of 46.21, demonstrating superior performance by surpassing leading deep research agents such as Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher and Grok Deeper Search present on the DeepResearch Bench actively running leaderboard. This performance marginally exceeds our previous work, Static DRA, and reinforces the finding that sequential scaling consistently outperforms the parallel self consistency paradigm.
中文标题/摘要
标题:具有序列计划反思和候选交叉的深度研究员(深度研究员反思进化)
本文介绍了一种新型的深度研究员架构,旨在通过解决并行扩展范式固有的局限性来生成复杂博士级主题的详细研究报告。我们的系统利用了两项关键创新:序列研究计划反思和候选交叉算法。序列精炼过程被证明是一种高效的方法,使代理能够保持一个集中的全局研究上下文,使其能够回顾当前进展,对研究计划进行推理,并在运行时智能地进行修改。这种动态适应与并行方法中常见的知识孤岛形成对比。候选交叉算法进一步提高了搜索效率,通过部署具有不同参数的多个LLM候选者来探索更大的搜索空间,并综合其发现以编撰全面的最终研究回应。该过程以一次生成报告结束,确保最终文档由统一的叙述和高密度事实支持。由Gemini 2.5 Pro模型驱动,我们的深度研究员在DeepResearch Bench上进行了评估,这是一个全球公认的包含100项博士级研究任务的基准测试。我们的架构获得了46.21的总体评分,展示了优于Claude Researcher、Nvidia AIQ Research Assistant、Perplexity Research、Kimi Researcher和Grok Deeper Search等领先深度研究代理的性能。这一性能略高于我们之前的工作Static DRA,并进一步证实了序列扩展始终优于并行自我一致性范式。
Summary / 总结
This paper presents a Deep Researcher architecture that addresses the limitations of the Parallel Scaling paradigm by incorporating Sequential Plan Reflection and Candidates Crossover. The sequential refinement process allows the agent to maintain a centralized research context, enabling dynamic adaptation. The Candidates Crossover algorithm enhances search efficiency by deploying multiple LLM candidates with varied parameters. Evaluated on the DeepResearch Bench, the Deep Researcher achieved a score of 46.21, surpassing other leading research assistants and confirming the superiority of sequential scaling over parallel approaches.
本文提出了一种Deep Researcher架构,通过引入顺序计划反思和候选交叉算法来解决并行扩展范式的局限性。顺序精炼过程使代理能够保持集中化的全局研究上下文,实现动态适应并克服知识孤岛问题。候选交叉算法通过使用具有不同参数的多个LLM候选者来增强搜索效率。在DeepResearch Bench上的评估结果显示,Deep Researcher的整体得分为46.21,超过了其他领先的科研代理,并强化了顺序扩展优于并行自我一致性范式的结论。
BlindSight: Harnessing Sparsity for Efficient Vision-Language Models
Authors: Tharun Adithya Srikrishnan, Deval Shah, Timothy Hein, Ahmed Hasssan, Stephen Youn, Steven K. Reinhardt
First: 2025-07-11T23:15:30+00:00 · Latest: 2026-01-28T18:45:01+00:00
Abstract
Large vision-language models (VLMs) enable joint processing of text and images. However, incorporating vision data significantly increases the prompt length, resulting in a longer time to first token (TTFT). This bottleneck can be alleviated by leveraging the inherent sparsity in the attention computation. Analyzing these attention patterns in VLMs when processing a series of images, we observe the absence of inter-image attention in a substantial portion of layers. Based on this, we propose BlindSight: an approach to optimize multi-image VLM inference using an input-template-aware attention sparsity mask with no runtime overhead. We utilize a dataset to derive a prompt-agnostic categorization for attention heads: Dense, Sink, Intra-Image, and Intra-Image+Sink. We develop a Triton-based GPU kernel to leverage this sparsity. BlindSight achieves a 1.8-3.2x speedup in the attention computation (prompt length 36K-300K). BlindSight generalizes across VLMs (Qwen2-VL, Qwen2.5-VL, Gemma 3), with only a 0.78% absolute accuracy degradation on average on multi-image comprehension benchmarks. Finally, we advocate for the design of efficient VLMs that combine BlindSight-inspired sparse and dense layers.
中文标题/摘要
标题:BlindSight:利用稀疏性提高视觉语言模型效率
大型视觉语言模型(VLMs)能够联合处理文本和图像。然而,引入视觉数据显著增加了提示长度,导致第一个标记生成时间(TTFT)变长。通过利用注意力计算中的固有稀疏性,可以缓解这一瓶颈。在处理一系列图像时分析VLMs的这些注意力模式,我们观察到在大量层中不存在跨图像注意力。基于此,我们提出BlindSight:一种利用输入模板感知的注意力稀疏性掩码优化多图像VLM推理的方法,且无运行时开销。我们利用数据集为注意力头开发了一种提示无关的分类:密集型、汇流型、图像内和图像内+汇流型。我们开发了一个基于Triton的GPU内核以利用这种稀疏性。BlindSight在注意力计算中实现了1.8-3.2倍的加速(提示长度36K-300K)。BlindSight在不同VLMs(Qwen2-VL、Qwen2.5-VL、Gemma 3)上具有良好的泛化能力,平均在多图像理解基准测试中绝对准确率下降0.78%。最后,我们提倡设计结合BlindSight启发式稀疏和密集层的高效VLMs。
Summary / 总结
BlindSight optimizes multi-image vision-language model inference by utilizing attention sparsity, reducing the time to first token by 1.8-3.2x without runtime overhead. It categorizes attention heads into Dense, Sink, Intra-Image, and Intra-Image+Sink and leverages a Triton-based GPU kernel. The approach shows generalizability across different VLMs with minimal accuracy degradation on multi-image comprehension benchmarks.
BlindSight 通过利用注意力计算中的固有稀疏性来优化多图像视觉-语言模型的推理,实现1.8-3.2倍的注意力计算加速,且无运行时开销。它将注意力头分类为密集型、汇流型、图像内型和图像内+汇流型,并利用这种分类和基于Triton的GPU内核来加速推理。该方法在不同VLM上的表现具有普适性,并且在多图像理解基准测试上的绝对准确率下降仅为0.78%。
Reward Models Inherit Value Biases from Pretraining
Authors: Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska
First: 2026-01-28T18:40:29+00:00 · Latest: 2026-01-28T18:40:29+00:00
Abstract
Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.
中文标题/摘要
标题:奖励模型继承预训练的价值偏见
奖励模型(RMs)在使大型语言模型(LLMs)与人类价值观对齐方面至关重要,但它们受到的关注远不及预训练和后训练的LLMs。由于RMs是从LLMs初始化的,因此它们继承了影响其行为的表示,但这种影响的性质和程度尚未得到充分研究。通过使用验证过的心理语言学语料库对10个领先的开放权重RMs进行全面研究,我们发现RMs在多个维度上的人类价值方面表现出显著差异,这取决于其基础模型。使用“大两个”心理轴,我们展示了Llama RMs对“自主性”的稳健偏好,以及Gemma RMs对“社群性”的相应稳健偏好。即使偏好数据和微调过程相同,这一现象仍然存在,我们将其追溯到相应指令调优和预训练模型的logits。这些log-概率差异本身可以被表述为一个隐含的RM;我们推导出可使用的隐含奖励分数,并表明它们表现出相同的自主性/社群性差异。我们进行了实验,通过删除偏好数据来源和数量的实验,证明了这种效应不仅可重复,而且出人意料地持久。尽管RMs旨在代表人类偏好,但我们的证据表明,它们的输出受到其基于的预训练LLMs的影响。这项工作强调了预训练阶段安全和对齐努力的重要性,并明确指出开源开发人员选择基础模型不仅关乎性能,也关乎价值观。
Summary / 总结
The study investigates how reward models (RMs) inherit value biases from their pre-trained language models (LLMs). Using 10 leading open-weight RMs and validated psycholinguistic corpora, the research shows significant differences in RMs' preferences for 'agency' and 'communion' based on their base models. Experiments demonstrate that these biases are robust and durable, indicating that RMs' outputs are influenced by the pre-trained LLMs they are initialized from. This work highlights the need for safety and alignment efforts during pretraining and suggests that developers should consider values when choosing base models.
研究探讨了奖励模型(RMs)如何从预训练的语言模型(LLMs)继承价值偏见。使用10个领先的开放权重RMs和验证的心理语言学语料库,研究显示RMs在'agency'和'communion'方面的偏好显著不同,这取决于它们的基础模型。实验表明这些偏见是稳健且持久的,表明RMs的输出受到它们初始化的预训练LLMs的影响。这项工作强调了预训练阶段的安全和对齐努力的重要性,并表明开发人员在选择基础模型时应考虑价值观。
Multimodal Conversation Structure Understanding
Authors: Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman
First: 2025-05-23T06:41:54+00:00 · Latest: 2026-01-28T18:39:09+00:00
Comments: accepted to EACL 2026 main conference; 22 pages, 9 figures, 10 tables
Abstract
While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation -- conversational roles and threading -- remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.
中文标题/摘要
标题:多模态对话结构理解
虽然多模态大型语言模型(LLMs)在对话方面表现出色,但它们是否能够充分解析对话结构——对话角色和对话线程——仍待探索。在本研究中,我们引入了一系列任务并发布了TV-MMPC,这是一个新的标注数据集,用于多模态对话结构理解。我们的评估表明,尽管所有多模态LLMs都优于我们的启发式基线,但当对话中的角色身份被匿名化时,我们考虑的最佳模型的性能仍会显著下降。除了评估之外,我们对TVQA中的350,842个语句进行了社会语言学分析。我们发现,虽然女性角色的对话发起率与其发言时间成比例,但她们比男性更有可能被设定为受话人或旁参与者,旁参与者的存在使对话风格从个人转向社会。
Summary / 总结
This study addresses the gap in understanding how multimodal large language models (LLMs) handle conversation structure, such as conversational roles and threading. It introduces new tasks and a dataset, TV-MMPC, for multimodal conversation structure understanding. The evaluation shows that while all models outperform a heuristic baseline, anonymizing character identities significantly reduces their performance. Sociolinguistic analysis of TVQA data reveals that female characters are more likely to be addressees or side-participants, and their presence shifts the conversational register from personal to social.
本研究旨在解决关于多模态大型语言模型(LLMs)如何处理对话结构,如对话角色和线性关系的理解空白。它引入了新的任务和数据集TV-MMPC,用于多模态对话结构理解。评估结果显示,尽管所有模型都优于启发式基线,但匿名化角色身份会显著降低其性能。对TVQA数据的语用社会学分析表明,女性角色更有可能成为听众或旁听者,她们的存在会将对话风格从个人转向社会。
Open-Vocabulary Functional 3D Human-Scene Interaction Generation
Authors: Jie Liu, Yu Sun, Alpar Cseke, Yao Feng, Nicolas Heron, Michael J. Black, Yan Zhang
First: 2026-01-28T18:34:25+00:00 · Latest: 2026-01-28T18:34:25+00:00
Comments: 18 pages
Abstract
Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.
中文标题/摘要
标题:开放词汇功能性的三维人类-场景交互生成
生成能够功能性地与三维场景交互的三维人类仍然是一个开放问题,具有在具身人工智能、机器人技术和交互内容创作中的应用。关键挑战在于既要推理三维场景中功能元素的语义,又要推理实现功能感知交互所需的三维人类姿态。不幸的是,现有方法通常缺乏对物体功能及其相应的人-场景接触的显式推理,导致交互不合理或功能不正确。在本工作中,我们提出了一种无需训练的功能驱动框架FunHSI,该框架能够从开放词汇的任务提示中生成功能性正确的交互。给定一个任务提示,FunHSI 进行功能感知的接触推理,以识别功能性的场景元素,重建它们的三维几何结构,并通过接触图建模高层次的交互。然后利用视觉-语言模型合成执行任务的人类,并估计提出的三维身体和手部姿态。最后,通过阶段优化对提出的三维身体配置进行细化,以确保物理合理性和功能性正确。与现有方法相比,FunHSI 不仅能够合成更合理的通用三维交互,如“坐在沙发上”,还支持细粒度的功能性人类-场景交互,例如“提高房间温度”。大量实验表明,FunHSI 能够在各种室内外场景中一致地生成功能性正确且物理合理的交互。
Summary / 总结
The research aims to generate 3D humans that functionally interact with 3D scenes, addressing the challenge of reasoning about object functionality and human-scene contact. FunHSI, a training-free framework, identifies functional scene elements, reconstructs their geometry, and models interactions through a contact graph. It synthesizes humans performing tasks and refines their 3D poses for physical plausibility and functional correctness, demonstrating consistent generation of correct and plausible interactions across various scenes.
研究解决了生成能够与3D场景功能互动的3D人类的问题,这是在具身AI和交互式内容创作中的关键问题。提出的FunHSI框架通过功能感知的接触推理来识别和建模功能场景元素,并通过接触图和视觉语言模型合成3D人类姿态。该方法通过阶段优化确保物理合理性和功能性正确性。实验表明,FunHSI在各种室内外场景中生成了更合理且功能正确的互动,优于现有方法。
Discrete Variational Autoencoding via Policy Search
Authors: Michael Drolet, Firas Al-Hafez, Aditya Bhatt, Jan Peters, Oleg Arenz
First: 2025-09-29T12:44:05+00:00 · Latest: 2026-01-28T18:33:31+00:00
Abstract
Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces.
中文标题/摘要
标题:离散变分自编码器通过策略搜索
变分自编码器(VAE)中的离散潜变量瓶颈提供了高比特效率,并且可以使用自回归离散分布建模,从而利用变换器实现参数高效的多模态搜索。然而,离散随机变量不允许精确的可微参数化;因此,离散VAE通常依赖于近似方法,如Gumbel-Softmax重参数化或直接通过梯度估计,或者使用高方差的无梯度方法如REINFORCE,这些方法在高维任务如图像重建方面效果有限。受策略搜索中流行技术的启发,我们提出了一种离散VAE的训练框架,利用非参数编码器的自然梯度来更新参数编码器,而无需重参数化。结合自适应步长调整和基于变换器的编码器,该方法可以扩展到具有挑战性的数据集如ImageNet,并在从紧凑的潜空间重建高维数据方面优于近似重参数化方法和基于量化离散自编码器。
Summary / 总结
This paper addresses the challenge of training discrete latent VAEs by proposing a new framework that uses the natural gradient of a non-parametric encoder to update the parametric encoder, avoiding the need for reparameterization. This method, combined with automatic step size adaptation and a transformer-based encoder, is applied to large datasets like ImageNet and shows superior performance in reconstructing high-dimensional data compared to approximate reparameterization methods and quantization-based discrete autoencoders.
论文提出了一种新的框架,利用非参数编码器的自然梯度来更新参数编码器,从而避免了重参数化的需求。该方法结合自适应步长调整和基于变压器的编码器,能够处理大规模数据集如ImageNet,并在高维数据重构任务中优于近似重参数化方法和基于量化的一维离散自编码器。
MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents
Authors: Vishnu Sashank Dorbala, Dinesh Manocha
First: 2026-01-28T18:31:17+00:00 · Latest: 2026-01-28T18:31:17+00:00
Abstract
Foundation models rely on in-context learning for personalized decision making. The limited size of this context window necessitates memory compression and retrieval systems like RAG. These systems however often treat memory as large offline storage spaces, which is unfavorable for embodied agents that are expected to operate under strict memory and compute constraints, online. In this work, we propose MemCtrl, a novel framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online. MemCtrl augments MLLMs with a trainable memory head μthat acts as a gate to determine which observations or reflections to retain, update, or discard during exploration. We evaluate with training two types of μ, 1) via an offline expert, and 2) via online RL, and observe significant improvement in overall embodied task completion ability on μ-augmented MLLMs. In particular, on augmenting two low performing MLLMs with MemCtrl on multiple subsets of the EmbodiedBench benchmark, we observe that μ-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets. Finally, we present a qualitative analysis on the memory fragments collected by μ, noting the superior performance of μaugmented MLLMs on long and complex instruction types.
中文标题/摘要
标题:MemCtrl:使用MLLM作为具身代理的主动内存控制器
基础模型依赖于上下文学习进行个性化的决策。由于上下文窗口的限制,需要使用记忆压缩和检索系统,如RAG。然而,这些系统通常将记忆视为大型离线存储空间,这不利于需要在严格的记忆和计算约束下在线操作的具身代理。在本研究中,我们提出了一种名为MemCtrl的新框架,该框架利用多模态大型语言模型(MLLMs)进行在线记忆修剪。MemCtrl通过一个可训练的记忆头μ来增强MLLMs,μ作为门控机制,决定在探索过程中保留、更新或丢弃哪些观察或反思。我们通过两种方式训练μ:1)通过离线专家,2)通过在线强化学习,并观察到在μ增强的MLLMs上整体具身任务完成能力显著提高。特别是,在使用MemCtrl对EmbodiedBench基准的不同子集增强两个低性能的MLLMs时,我们发现μ增强的MLLMs在平均上提高了约16%,在特定指令子集上提高了超过20%。最后,我们对μ收集的记忆片段进行了定性分析,指出μ增强的MLLMs在长且复杂的指令类型上表现出更优的性能。
Summary / 总结
MemCtrl is a framework that uses Multimodal Large Language Models (MLLMs) to prune memory online for embodied agents. It introduces a trainable memory head μ that acts as a gate to decide which observations or reflections to retain, update, or discard. Evaluations show that μ-augmented MLLMs improve overall task completion by around 16% on average, with over 20% improvement on specific instruction subsets of the EmbodiedBench benchmark.
MemCtrl 是一个框架,利用多模态大型语言模型(MLLMs)在线修剪记忆,以满足严格约束下的内存压缩和检索需求。它引入了一个可训练的记忆头 μ,作为门控机制来决定保留、更新或丢弃哪些观察或反思。评估结果显示,μ 增强的 MLLMs 在 EmbodiedBench 基准测试中的平均改进幅度约为 16%,在特定指令子集上的改进幅度超过 20%。
VSCOUT: A Hybrid Variational Autoencoder Approach to Outlier Detection in High-Dimensional Retrospective Monitoring
Authors: Waldyn G. Martinez
First: 2026-01-28T18:30:48+00:00 · Latest: 2026-01-28T18:30:48+00:00
Abstract
Modern industrial and service processes generate high-dimensional, non-Gaussian, and contamination-prone data that challenge the foundational assumptions of classical Statistical Process Control (SPC). Heavy tails, multimodality, nonlinear dependencies, and sparse special-cause observations can distort baseline estimation, mask true anomalies, and prevent reliable identification of an in-control (IC) reference set. To address these challenges, we introduce VSCOUT, a distribution-free framework designed specifically for retrospective (Phase I) monitoring in high-dimensional settings. VSCOUT combines an Automatic Relevance Determination Variational Autoencoder (ARD-VAE) architecture with ensemble-based latent outlier filtering and changepoint detection. The ARD prior isolates the most informative latent dimensions, while the ensemble and changepoint filters identify pointwise and structural contamination within the determined latent space. A second-stage retraining step removes flagged observations and re-estimates the latent structure using only the retained inliers, mitigating masking and stabilizing the IC latent manifold. This two-stage refinement produces a clean and reliable IC baseline suitable for subsequent Phase II deployment. Extensive experiments across benchmark datasets demonstrate that VSCOUT achieves superior sensitivity to special-cause structure while maintaining controlled false alarms, outperforming classical SPC procedures, robust estimators, and modern machine-learning baselines. Its scalability, distributional flexibility, and resilience to complex contamination patterns position VSCOUT as a practical and effective method for retrospective modeling and anomaly detection in AI-enabled environments.
中文标题/摘要
标题:VSCOUT:高维回顾性监控中的混合变分自编码器异常检测方法
现代工业和服务过程生成高维、非高斯和受污染的数据,挑战了经典统计过程控制(SPC)的基本假设。重尾、多模态、非线性依赖关系和稀疏的特殊原因观测可以扭曲基线估计,掩盖真正的异常,并阻止可靠地识别处于控制状态(IC)的参考集。为了解决这些挑战,我们引入了VSCOUT,这是一种分布无关的框架,专门设计用于高维环境中的回顾性(第一阶段)监控。VSCOUT结合了自动相关性确定变分自编码器(ARD-VAE)架构与基于集成的潜在异常过滤和变化点检测。ARD先验隔离了最具信息量的潜在维度,而集成和变化点过滤器在确定的潜在空间内识别点状和结构异常。第二阶段重新训练步骤移除标记的观测值,并仅使用保留的内点重新估计潜在结构,从而减轻掩盖并稳定IC潜在流形。这种两阶段精炼产生了一个干净且可靠的IC基线,适用于后续第二阶段部署。在基准数据集上的广泛实验表明,VSCOUT在检测特殊原因结构方面具有更高的灵敏度,同时保持了受控的误报率,优于经典SPC程序、稳健估计和现代机器学习基线。其可扩展性、分布灵活性以及对复杂污染模式的抗性使VSCOUT成为AI驱动环境中回顾性建模和异常检测的一种实用且有效的方法。
Summary / 总结
VSCOUT is a hybrid variational autoencoder approach designed for outlier detection in high-dimensional, non-Gaussian data. It uses an ARD-VAE to isolate informative latent dimensions and an ensemble-based filtering method to detect pointwise and structural contamination. The method re-trains the model using only inliers to refine the latent structure, producing a clean baseline for subsequent monitoring. Experiments show VSCOUT outperforms classical SPC, robust estimators, and machine learning baselines in detecting special-cause anomalies while controlling false alarms.
VSCOUT 是一种用于高维数据回顾性监控的无分布框架,结合了ARD-VAE、基于集合的潜在异常过滤和变化点检测。它隔离了信息性的潜在维度,并识别点状和结构性的异常,通过仅使用保留的内点重新训练来生成清洁的IC基线。实验表明,VSCOUT 在检测特殊原因结构方面优于经典SPC、稳健估计和机器学习方法,同时控制着误报。
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
Authors: Minwu Kim, Safal Shrestha, Keith Ross
First: 2026-01-28T18:29:21+00:00 · Latest: 2026-01-28T18:29:21+00:00
Comments: 16 pages
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.
中文标题/摘要
标题:通过失败前缀条件化训练饱和问题的推理模型
可验证奖励的强化学习(RLVR)显著提升了大型语言模型(LLMs)的推理能力,但随着问题变得饱和,训练往往停滞不前。我们发现核心挑战在于信息性失败的可访问性差:学习信号存在但很少在标准滚动中遇到。为了解决这一问题,我们提出了一种简单而有效的方法——失败前缀条件化,用于从饱和问题中学习。我们的方法不是从原始问题开始,而是通过条件化训练于来自罕见错误推理轨迹的前缀,重新分配探索,从而使模型接触到失败倾向的状态。我们观察到,失败前缀条件化带来的性能提升与在中等难度问题上训练相当,同时保持了令牌效率。此外,我们分析了模型的鲁棒性,发现我们的方法减少了在误导性失败前缀下的性能下降,尽管在遵循正确早期推理方面略有妥协。最后,我们展示了迭代方法,在训练过程中刷新失败前缀,可以在性能平台期后解锁额外的收益。总体而言,我们的结果表明,失败前缀条件化为扩展RLVR在饱和问题上的训练提供了一条有效途径。
Summary / 总结
The paper addresses the challenge of training large language models (LLMs) using Reinforcement Learning with Verifiable Rewards (RLVR) on saturated problems, where training often stalls due to the lack of informative failures. The authors propose failure-prefix conditioning, a method that conditions training on prefixes from rare incorrect reasoning trajectories to expose the model to failure-prone states. This approach improves performance to match that of training on medium-difficulty problems while maintaining token efficiency. The method also enhances robustness against misleading failures but slightly compromises adherence to correct early reasoning. Iterative refreshing of failure prefixes further improves performance after initial plateaus.
论文解决了使用可验证奖励的强化学习(RLVR)在饱和问题上训练大型语言模型(LLMs)时遇到的挑战,即标准滚动力求很少遇到有信息量的失败。作者提出了一种失败前缀条件化的方法,该方法通过条件化训练于来自罕见错误推理轨迹的前缀,从而提高模型性能而不牺牲令牌效率。迭代刷新失败前缀进一步在初始平台期后增强了性能。
In-Context Bias Propagation in LLM-Based Tabular Data Generation
Authors: Pol G. Recasens, Alberto Gutierrez, Jordi Torres, Josep. Ll Berral, Javier Carnerero-Cano, Anisa Halimi, Kieran Fraser
First: 2025-06-11T11:39:29+00:00 · Latest: 2026-01-28T18:25:52+00:00
Abstract
Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the potential of LLMs to improve downstream task performance through augmenting underrepresented groups, these benefits often assume access to a subset of unbiased in-context examples, representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context examples, ultimately compromising the fairness of downstream classifiers for a targeted and protected subgroup. Finally, we evaluate mitigation strategies based on preprocessing in-context examples, demonstrating that while such interventions can attenuate disparity, the inherent sensitivity of LLMs to adversarial prompts remains a persistent challenge. Our findings highlight a critical new vulnerability in LLM-based data generation pipelines within sensitive domains.
中文标题/摘要
标题:基于LLM的表格数据生成中的上下文偏差传播
大型语言模型(LLMs)越来越多地通过上下文学习(ICL)用于合成表格数据生成,为数据稀缺场景下的数据增强提供了一种实用的解决方案。尽管先前的研究表明,LLMs通过增强代表性不足的群体来提高下游任务性能具有潜力,但这些好处通常假设可以访问一组无偏的上下文示例,这些示例代表了真实数据集。然而,在现实世界中,数据通常噪声较大且人口统计特征失衡。在本文中,我们系统地研究了上下文示例中的统计偏差如何传播到合成表格数据的分布,表明即使是轻微的上下文偏差也会导致全局统计失真。我们进一步引入了一种对抗性场景,在该场景中,恶意贡献者可以通过一组上下文示例向合成数据集注入偏差,最终损害目标和受保护子群体的下游分类器的公平性。最后,我们评估了预处理上下文示例的缓解策略,证明虽然这些干预措施可以减轻差异,但LLMs对对抗性提示的固有敏感性仍然是一个持续的挑战。我们的研究结果突显了LLM基于的数据生成管道在敏感领域中的一个关键新漏洞。
Summary / 总结
This paper investigates how biases in in-context examples propagate to synthetic tabular data generated by Large Language Models (LLMs) through in-context learning. The study shows that even mild biases in context examples can lead to significant statistical distortions in the generated data. It also introduces an adversarial scenario where malicious contributors can inject biases, affecting the fairness of downstream classifiers. The research evaluates preprocessing methods to mitigate these biases but finds that LLMs remain sensitive to adversarial prompts, highlighting a new vulnerability in data generation pipelines.
本文研究了在通过上下文学习生成的大型语言模型(LLM)合成表格数据中,上下文示例中的偏差如何传播。研究显示,即使上下文示例中的偏差较轻,也可能导致全局统计失真。此外,研究引入了一个恶意贡献者通过部分上下文示例注入偏差的对抗场景,最终损害了下游分类器的公平性。作者评估了预处理方法以减轻这些偏差,但发现LLMs对对抗提示的固有敏感性仍然是一个持续的挑战。
Neural Theorem Proving for Verification Conditions: A Real-World Benchmark
Authors: Qiyuan Xu, Xiaokun Luan, Renxi Wang, Joshua Ong Jun Leang, Peixin Wang, Haonan Li, Wenda Li, Conrad Watt
Venue: ICLR
First: 2026-01-26T20:37:11+00:00 · Latest: 2026-01-28T18:25:21+00:00
Comments: Accepted in ICLR'26
Abstract
Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification--particularly VC proving--remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.
中文标题/摘要
标题:神经定理证明在验证条件中的应用:一个实际基准
定理证明是程序验证的基础,其中验证条件(VC)的自动证明仍然是主要瓶颈。实际程序验证经常遇到现有的自动定理证明器(ATPs)无法证明的难题VC,导致需要大量的手动证明,这严重阻碍了实际应用。尽管神经定理证明(NTP)在数学竞赛中取得了显著成功,展示了机器学习方法在形式推理中的潜力,但其在程序验证中的应用,特别是VC证明,仍然鲜有探索。尽管已有工作集中在注释合成和验证相关的定理证明,但没有基准专门针对这一根本瓶颈:自动VC证明。本研究引入了神经定理证明用于验证条件(NTP4VC),提出了第一个用于此任务的实际多语言基准。从如Linux和Contiki-OS内核的实际项目中,我们的基准利用工业管道(Why3和Frama-C)生成形式语言(Isabelle、Lean和Rocq)之间的语义等价测试用例。我们评估了大型语言模型(LLMs),包括通用型和针对定理证明微调的模型,对NTP4VC的性能。结果表明,尽管LLMs在VC证明中显示出潜力,但在程序验证中仍面临重大挑战,突显了未来研究的巨大差距和机会。
Summary / 总结
This work addresses the bottleneck of automated proof of Verification Conditions (VCs) in program verification by introducing NTP4VC, a real-world multi-language benchmark. It evaluates large language models, both general-purpose and fine-tuned for theorem proving, on this benchmark. The results suggest that while these models show potential, they still face significant challenges in program verification, indicating a need for further research.
该研究通过引入面向验证条件的神经定理证明(NTP4VC),一个实际多语言基准,解决了程序验证中自动证明验证条件(VCs)的瓶颈问题。研究评估了大型语言模型,包括通用模型和专门针对定理证明微调的模型,在此基准上的表现。结果显示,大型语言模型在VC证明中表现出潜力,但也指出了显著的挑战和未来研究的巨大空间。
Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Authors: Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng
First: 2025-08-07T15:11:48+00:00 · Latest: 2026-01-28T18:20:03+00:00
Comments: EACL 2026
Abstract
We examine, analyze, and compare four representative creativity measures--perplexity, LLM-as-a-Judge, the Creativity Index (CI; measuring n-gram overlap with web corpora), and syntactic templates (detecting repetition of common part-of-speech patterns)--across the diverse creative domains, such as creative writing, unconventional problem-solving, and research ideation. For each domain, we compile datasets with human-aligned creative and uncreative examples and evaluate each metric's ability to discriminate between the two sets. Our analyses reveal limited consistency both across domains and metrics, as metrics that distinguish creativity in one domain fail in others (e.g., CI correctly distinguishes in creative writing but fails in problem-solving), and different metrics often disagree on the same data points (e.g., CI suggests one set to be more creative, while perplexity indicates the other set to be more creative.) We highlight key limitations, such as perplexity reflecting fluency rather than novelty; LLM-as-a-Judge producing inconsistent judgments under minor prompt variations and exhibiting bias towards particular labels; CI primarily measuring lexical diversity, with high sensitivity to implementation choices; and syntactic templates being ineffective in settings dominated by formulaic language. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.
中文标题/摘要
标题:重新思考创造力评估:现有创造力评估的批判性分析
我们检视、分析并比较了四种代表性的创造力度量标准——困惑度、LLM作为评判者、创造力指数(CI;衡量n-克gram与网络语料库的重叠)、以及句法模板(检测常见词性模式的重复),这些度量标准涵盖了多种多样的创造性领域,如创造性写作、非传统问题解决和研究构想。对于每个领域,我们收集了人类对齐的创造性与非创造性示例数据集,并评估每个度量标准区分这两组数据的能力。我们的分析揭示了在不同领域和度量标准之间的一致性有限,即在某一领域能够区分创造力的度量标准在其他领域可能无效(例如,CI在创造性写作中正确区分,但在问题解决中失败),并且不同的度量标准在相同的数据点上经常存在分歧(例如,CI认为一组更具有创造性,而困惑度则认为另一组更具有创造性)。我们强调了关键的局限性,如困惑度反映流畅性而非新颖性;LLM作为评判者在轻微提示变化下产生不一致的判断,并表现出对特定标签的偏见;CI主要衡量词汇多样性,对实现选择高度敏感;以及句法模板在以公式化语言为主导的环境中无效。我们的研究结果强调了需要更稳健、通用的评估框架,更好地与人类对创造力的判断相一致。
Summary / 总结
This study evaluates four creativity metrics—perplexity, LLM-as-a-Judge, the Creativity Index, and syntactic templates—across various domains like creative writing and problem-solving. The research finds that these metrics often fail to consistently distinguish creative from uncreative outputs across different domains, with each metric performing well in some areas but poorly in others. Key limitations include perplexity reflecting fluency rather than novelty, LLM-as-a-Judge showing inconsistent judgments, and the Creativity Index being highly sensitive to implementation choices.
研究评估了四个创造力指标——困惑度、LLM作为评判者、创造力指数(CI)和句法模板——在创意写作、问题解决和研究构想等不同领域中的表现。研究发现,这些指标在不同领域中的表现不一致,并且在相同的数据点上经常产生分歧,突显了每种方法的局限性。例如,CI在创意写作中表现良好,但在问题解决中表现不佳,而困惑度倾向于衡量流畅性而非新颖性。研究呼吁需要开发更稳健和通用的评估框架,更好地与人类对创造力的判断相一致。
AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics
Authors: Bakhtawar Ahtisham, Kirk Vanacore, Jinsook Lee, Zhuqian Zhou, Doug Pietrzak, Rene F. Kizilcec
First: 2025-11-12T22:35:36+00:00 · Latest: 2026-01-28T18:09:36+00:00
Abstract
Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen's kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human "gold" labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.
中文标题/摘要
标题:AI注解编排:评估LLM验证器以提高LLM注解在学习分析中的质量
大型语言模型(LLMs)越来越多地用于标注学习互动,但对其可靠性的担忧限制了其应用。我们测试了验证导向的编排提示模型是否能通过自我验证(检查自己的标签)或交叉验证(互相审核)来提高辅导对话的定性编码。使用30个一对一数学会话的转录,我们比较了三种生产LLM(GPT、Claude、Gemini)在三种条件下的表现:未验证标注、自我验证和全面交叉验证。输出与盲审、侧重分歧的人类仲裁进行基准比较,使用科恩κ值。总体而言,编排使κ值提高了58%。自我验证几乎将一致性提高了一倍,相对于未验证的基线,难度较大的辅导动作获得最大收益。交叉验证平均提高了37%,但存在配对和构建依赖的效果:一些验证者-标注者配对超过了自我验证,而另一些则降低了对齐度,反映了验证者严格程度的不同。我们贡献了:(1)一个灵活的编排框架,实现控制、自我和交叉验证的实例化;(2)在盲审人类“黄金”标签的背景下,对前沿LLM在真实辅导数据上的实证比较;(3)一种简洁的表示法,验证者(标注者)(例如Gemini(GPT)或Claude(Claude)),以标准化报告并明确指示方向性效果,便于复制。结果将验证定位为可靠、可扩展的LLM辅助注解在学习分析中的一个原则性设计杠杆。
Summary / 总结
The study evaluates the effectiveness of verification-oriented orchestration-prompting models to improve the quality of annotations by Large Language Models (LLMs) in learning analytics. Using transcripts from 30 one-to-one math sessions, the research compares unverified annotations, self-verification, and cross-verification among three LLMs (GPT, Claude, Gemini). The results show a 58 percent improvement in agreement using Cohen's kappa benchmark. Self-verification nearly doubles agreement compared to unverified annotations, with significant gains for challenging tutor moves. Cross-verification improves agreement by 37 percent on average, with varying effects depending on verifier-annotator pairs and strictness levels.
研究评估了验证导向的编排提示模型如何提高大型语言模型(LLM)在学习分析中的注释质量。使用30个一对一数学会话的转录,研究比较了未验证注释、自我验证和交叉验证三种情况下的三种LLM(GPT、Claude、Gemini)。结果显示,使用Cohen's kappa基准,编排提高了58%的一致性。自我验证将一致性几乎翻倍,尤其是在处理挑战性的辅导动作时。交叉验证平均提高了37%的一致性,但效果取决于验证者-注释者配对和严格程度的不同。
ProToken: Token-Level Attribution for Federated Large Language Models
Authors: Waris Gill, Ahmad Humayun, Ali Anwar, Muhammad Ali Gulzar
First: 2026-01-27T14:53:12+00:00 · Latest: 2026-01-28T18:05:55+00:00
Abstract
Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.
中文标题/摘要
标题:ProToken:联邦大型语言模型的令牌级归因方法
联邦学习(FL)允许在分布式数据源上协作训练大型语言模型(LLMs)同时保护隐私。然而,当联邦LLMs部署在关键应用中时,仍不清楚哪些客户端贡献了特定生成的响应,这阻碍了调试、恶意客户端识别、公平奖励分配和信任验证。我们提出了ProToken,这是一种新颖的溯源方法,用于解决联邦LLMs中令牌级归因问题,同时满足FL隐私约束。ProToken通过利用两个关键洞察来在每个令牌上实现溯源:(1)变压器架构将任务特定信号集中在后期块中,允许选择计算上可行的层,(2)基于梯度的相关性加权过滤掉无关的神经激活,将归因集中在直接影响令牌生成的神经元上。我们在涵盖四种LLM架构(Gemma、Llama、Qwen、SmolLM)和四种领域(医疗、金融、数学、编程)的16种配置中评估了ProToken。ProToken在正确定位负责客户端方面平均实现了98%的归因准确性,并且在客户端数量增加时保持了高准确性,验证了其在实际部署环境中的实用可行性。
Summary / 总结
ProToken is a novel provenance methodology for token-level attribution in federated LLMs, which helps identify which client contributed to specific generated responses. It uses transformer architecture insights and gradient-based relevance weighting to maintain privacy while accurately attributing tokens. ProToken achieves 98% average attribution accuracy across various LLM architectures and domains, demonstrating its practical viability for real-world applications.
ProToken 是一种新颖的联邦大语言模型中 token 级别归属的方法,能够在保持隐私的同时,在自回归文本生成过程中解决客户端归属问题。它利用后期变压器块和基于梯度的相关性加权来实现准确的归属。ProToken 在多种大语言模型架构和领域中实现了 98% 的平均归属准确性,验证了其实用性以适应实际部署环境。
Context-Augmented Code Generation Using Programming Knowledge Graphs
Authors: Shahd Seddik, Fahd Seddik, Iman Saberi, Fatemeh Fard, Minh Hieu Huynh, Patanamon Thongtanunam
First: 2026-01-28T17:58:30+00:00 · Latest: 2026-01-28T17:58:30+00:00
Abstract
Large Language Models (LLMs) excel at code generation but struggle with complex problems. Retrieval-Augmented Generation (RAG) mitigates this issue by integrating external knowledge, yet retrieval models often miss relevant context, and generation models hallucinate with irrelevant data. We propose Programming Knowledge Graph (PKG) for semantic representation and fine-grained retrieval of code and text. Our approach enhances retrieval precision through tree pruning and mitigates hallucinations via a re-ranking mechanism that integrates non-RAG solutions. Structuring external data into finer-grained nodes improves retrieval granularity. Evaluations on HumanEval and MBPP show up to 20% pass@1 accuracy gains and a 34% improvement over baselines on MBPP. Our findings demonstrate that our proposed PKG approach along with re-ranker effectively address complex problems while maintaining minimal negative impact on solutions that are already correct without RAG. The replication package is published at https://github.com/iamshahd/ProgrammingKnowledgeGraph
中文标题/摘要
标题:基于编程知识图谱的语境增强代码生成
大型语言模型(LLMs)在代码生成方面表现出色,但在处理复杂问题时存在困难。检索增强生成(RAG)通过整合外部知识来缓解这一问题,但检索模型往往遗漏相关语境,而生成模型则会生成不相关的内容。我们提出了编程知识图谱(PKG)以实现语义表示和细粒度检索代码和文本。我们的方法通过树修剪增强检索精度,并通过结合非RAG解决方案的重新排名机制来减轻生成幻觉。将外部数据结构化为更细粒度的节点,提高了检索的粒度。在HumanEval和MBPP上的评估显示,我们的方法在pass@1准确率上提高了最高20%,在MBPP上的改进幅度为34%。我们的研究结果表明,我们的PKG方法及其重新排名器能够有效解决复杂问题,同时对不需要RAG的正确解决方案的负面影响最小。复制包已发布于https://github.com/iamshahd/ProgrammingKnowledgeGraph
Summary / 总结
This paper addresses the limitations of Large Language Models (LLMs) in handling complex code generation tasks by proposing a Programming Knowledge Graph (PKG) to enhance retrieval precision and mitigate hallucinations. The method involves tree pruning and a re-ranking mechanism that integrates non-RAG solutions, improving retrieval granularity. Experiments on HumanEval and MBPP show up to 20% pass@1 accuracy gains and a 34% improvement over baselines, indicating that PKG effectively addresses complex problems with minimal negative impact on already correct solutions without RAG.
本文针对大型语言模型(LLMs)在处理复杂代码生成任务时的局限性,提出了编程知识图谱(PKG)以提高检索精度并减轻幻觉问题。方法包括树修剪和结合非RAG解决方案的重新排名机制,以提高检索粒度。实验结果表明,在HumanEval和MBPP上的准确率分别提高了20%和34%,表明PKG能够有效解决复杂问题,同时对不需要RAG的正确解决方案的影响最小。
Helping Johnny Make Sense of Privacy Policies with LLMs
Authors: Vincent Freiberger, Arthur Fleig, Erik Buchmann
First: 2025-01-27T13:27:04+00:00 · Latest: 2026-01-28T17:54:54+00:00
Comments: 21 pages, 3 figures, 3 tables, ACM CHI 2026
Abstract
Understanding and engaging with privacy policies is crucial for online privacy, yet these documents remain notoriously complex and difficult to navigate. We present PRISMe, an interactive browser extension that combines LLM-based policy assessment with a dashboard and customizable chat interface, enabling users to skim quick overviews or explore policy details in depth while browsing. We conduct a user study (N=22) with participants of diverse privacy knowledge to investigate how users interpret the tool's explanations and how it shapes their engagement with privacy policies, identifying distinct interaction patterns. Participants valued the clear overviews and conversational depth, but flagged some issues, particularly adversarial robustness and hallucination risks. Thus, we investigate how a retrieval-augmented generation (RAG) approach can alleviate issues by re-running the chat queries from the study. Our findings surface design challenges as well as technical trade-offs, contributing actionable insights for developing future user-centered, trustworthy privacy policy analysis tools.
中文标题/摘要
标题:使用LLM帮助约翰尼理解隐私政策
理解和参与隐私政策对于在线隐私至关重要,但这些文件仍然非常复杂且难以导航。我们介绍了PRISMe,这是一种交互式浏览器扩展程序,结合了基于LLM的政策评估、仪表板和可定制的聊天界面,使用户能够在浏览时快速浏览概述或深入探索政策细节。我们对具有不同隐私知识的22名参与者进行了用户研究,以调查用户如何解释工具的解释以及它如何影响用户对隐私政策的参与,识别出不同的交互模式。参与者重视清晰的概述和对话深度,但也指出了某些问题,特别是对抗性鲁棒性和幻觉风险。因此,我们通过重新运行研究中的聊天查询来研究检索增强生成(RAG)方法如何缓解这些问题。我们的研究揭示了设计挑战和技术权衡,为开发未来以用户为中心、可信赖的隐私政策分析工具提供了可操作的见解。
Summary / 总结
The research aims to help users better understand complex privacy policies through an interactive browser extension called PRISMe, which uses LLMs for policy assessment and provides a customizable chat interface. The study involved 22 participants with varying levels of privacy knowledge, revealing that users appreciated clear overviews and conversational depth but also highlighted issues like adversarial robustness and hallucination risks. The team then explored a retrieval-augmented generation (RAG) approach to address these issues, identifying design and technical challenges that contribute to developing more user-centered and trustworthy privacy policy analysis tools.
研究旨在通过一个名为PRISMe的交互式浏览器扩展帮助用户更好地理解复杂的隐私政策,该扩展利用LLMs进行政策评估,并提供可定制的聊天界面。研究涉及22名具有不同隐私知识水平的参与者,结果显示用户喜欢清晰的概述和对话深度,但也指出了诸如对抗鲁棒性和幻觉风险等问题。研究团队随后探索了检索增强生成(RAG)方法来解决这些问题,揭示了设计和技术上的挑战,为开发更以用户为中心和值得信赖的隐私政策分析工具提供了行动建议。
Structured Semantic Information Helps Retrieve Better Examples for In-Context Learning in Few-Shot Relation Extraction
Authors: Aunabil Chakma, Mihai Surdeanu, Eduardo Blanco
First: 2026-01-28T17:48:58+00:00 · Latest: 2026-01-28T17:48:58+00:00
Abstract
This paper presents several strategies to automatically obtain additional examples for in-context learning of one-shot relation extraction. Specifically, we introduce a novel strategy for example selection, in which new examples are selected based on the similarity of their underlying syntactic-semantic structure to the provided one-shot example. We show that this method results in complementary word choices and sentence structures when compared to LLM-generated examples. When these strategies are combined, the resulting hybrid system achieves a more holistic picture of the relations of interest than either method alone. Our framework transfers well across datasets (FS-TACRED and FS-FewRel) and LLM families (Qwen and Gemma). Overall, our hybrid selection method consistently outperforms alternative strategies and achieves state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.
中文标题/摘要
标题:结构化语义信息有助于在少量样本关系提取的上下文学习中检索更好的示例
本文提出了一些策略,以自动获取用于少量样本关系提取的上下文学习的额外示例。具体而言,我们介绍了一种新的示例选择策略,其中新示例是基于其潜在的句法-语义结构与提供的少量样本示例的相似性来选择的。我们表明,与LLM生成的示例相比,这种方法在单词选择和句子结构上具有互补性。当这些策略结合使用时,所得到的混合系统比单独使用任何一种方法都能更全面地描绘出感兴趣的关系。我们的框架在数据集(FS-TACRED和FS-FewRel)和LLM家族(Qwen和Gemma)之间具有良好的迁移性。总体而言,我们的混合选择方法始终优于其他策略,并在FS-TACRED上达到了最先进的性能,在定制的FewRel子集上也取得了显著的进展。
Summary / 总结
This paper introduces a strategy for selecting additional examples for in-context learning in one-shot relation extraction by focusing on the similarity of their syntactic-semantic structure to the provided example. The method complements LLM-generated examples and, when combined, outperforms alternative strategies, achieving state-of-the-art performance on FS-TACRED and strong gains on a customized FewRel subset.
该论文提出了一种基于语义结构相似性的额外示例选择策略,用于一-shot 关系抽取的上下文学习。该方法补充了LLM生成的示例,并且当结合使用时,优于其他策略,在FS-TACRED上达到最先进的性能,并在定制的FewRel子集上取得了显著进步。
Reinforcement Learning via Self-Distillation
Authors: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause
First: 2026-01-28T17:45:12+00:00 · Latest: 2026-01-28T17:45:12+00:00
Abstract
Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
中文标题/摘要
标题:通过自我蒸馏进行强化学习
大型语言模型越来越多地通过可验证领域(如代码和数学)的强化学习进行后训练。然而,当前的可验证奖励强化学习(RLVR)方法仅从每次尝试的标量结果奖励中学习,这造成了严重的信用分配瓶颈。许多可验证环境实际上提供了丰富的文本反馈,如运行时错误或评判评价,这些反馈解释了尝试失败的原因。我们将此设置形式化为具有丰富反馈的强化学习,并引入了自我蒸馏策略优化(SDPO),该方法将标记化的反馈转换为密集的学习信号,无需任何外部教师或显式的奖励模型。SDPO 将当前模型在反馈条件下的预测视为自我教师,并将其反馈导向的下一个标记预测回传给策略。这样,SDPO 利用了模型回顾性地识别其自身错误的能力。在科学推理、工具使用和 LiveCodeBench v6 的竞争编程中,SDPO 在强大的 RLVR 基准之上提高了样本效率和最终准确性。值得注意的是,SDPO 还通过使用成功的回放作为失败尝试的隐式反馈,在仅返回标量反馈的标准 RLVR 环境中优于基准。最后,在测试时将 SDPO 应用于单个问题可以加速在二元奖励任务中的发现,以三倍少的尝试次数达到与 k 次采样或三轮对话相同的发现概率。
Summary / 总结
The paper addresses the challenge of reinforcement learning with verifiable rewards, where current methods only use scalar outcomes, leading to a credit-assignment bottleneck. It introduces Self-Distillation Policy Optimization (SDPO), which converts rich textual feedback into a dense learning signal. SDPO improves sample efficiency and final accuracy in scientific reasoning, tool use, and competitive programming tasks compared to existing reinforcement learning methods. Notably, SDPO also outperforms baselines in scalar feedback environments by using successful rollouts as implicit feedback for failed attempts, and it accelerates discovery on binary-reward tasks with fewer attempts.
论文针对当前使用可验证奖励进行强化学习时仅使用标量结果导致的信用分配瓶颈问题,提出了Self-Distillation Policy Optimization (SDPO) 方法,该方法将丰富的文本反馈转化为密集的学习信号。SDPO 在科学推理、工具使用和编程竞赛任务中提高了样本效率和最终准确性,优于现有的可验证奖励强化学习(RLVR)方法。此外,它还能在仅提供标量反馈的环境中通过成功回放作为失败尝试的隐式反馈来超越基线方法,并在二元奖励任务中减少发现所需尝试次数,达到与最佳多次采样或三轮对话相同的效果。