arXiv 论文速递

ctELM: Decoding and Manipulating Embeddings of Clinical Trials with Embedding Language Models

Authors: Brian Ondov, Chia-Hsuan Chang, Yujia Zhou, Mauro Giuffrè, Hua Xu

First: 2026-01-26T18:58:46+00:00 · Latest: 2026-01-26T18:58:46+00:00

Abstract

Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we align Large Language Models to embeddings of clinical trials using the recently reported Embedding Language Model (ELM) method. We develop an open-source, domain-agnostic ELM architecture and training framework, design training tasks for clinical trials, and introduce an expert-validated synthetic dataset. We then train a series of ELMs exploring the impact of tasks and training regimes. Our final model, ctELM, can accurately describe and compare unseen clinical trials from embeddings alone and produce plausible clinical trials from novel vectors. We further show that generated trial abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.

中文标题/摘要

标题：ctELM：使用嵌入语言模型解码和操控临床试验的嵌入

文本嵌入已成为各种语言应用中的重要组成部分。然而，解释、探索和逆转嵌入空间的方法有限，降低了透明度并排除了潜在有价值的生成用例。在本工作中，我们使用最近报道的嵌入语言模型(ELM)方法将大型语言模型对齐到临床试验的嵌入中。我们开发了一个开源、领域无关的ELM架构和训练框架，设计了针对临床试验的训练任务，并引入了一个专家验证的合成数据集。然后，我们训练了一系列ELM，探索任务和训练制度的影响。我们的最终模型ctELM可以从嵌入中准确地描述和比较未见过的临床试验，并从新颖的向量中生成合理的临床试验。我们进一步表明，生成的试验摘要对沿着研究对象年龄和性别概念向量移动嵌入是响应的。我们的公共ELM实现和实验结果将有助于将大型语言模型与生物医学领域及其他领域的嵌入空间对齐。

Summary / 总结

This study aims to enhance the interpretability and generative capabilities of text embeddings in clinical trials by aligning Large Language Models (LLMs) with clinical trial embeddings using the ELM method. The researchers developed an open-source ELM architecture and trained a series of models to explore different training tasks and regimes. The final model, ctELM, can accurately describe and compare unseen clinical trials from embeddings and generate plausible new trials. Additionally, ctELM shows that generated trial abstracts can be influenced by moving embeddings along concept vectors related to the age and sex of study subjects.

该研究旨在通过使用ELM方法将大型语言模型（LLM）与临床试验的嵌入对齐，增强文本嵌入在临床试验中的可解释性和生成能力。研究人员开发了一个开源的ELM架构，并训练了一系列模型来探索不同的训练任务和策略。最终的模型ctELM可以从嵌入中准确地描述和比较未见过的临床试验，并生成可信的新试验。此外，ctELM显示生成的试验摘要可以通过沿研究对象年龄和性别概念向量移动嵌入来响应变化。

Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes

Authors: Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie

First: 2026-01-26T18:57:00+00:00 · Latest: 2026-01-26T18:57:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.

中文标题/摘要

标题：重用你的FLOPs：通过条件化非常离策前缀扩展强化学习

典型的强化学习（RL）方法在LLM推理中对难题浪费计算资源，因为正确的策略轨迹罕见，策略梯度消失，学习停滞。为了更高效地启动RL，我们考虑重用旧的采样FLOPs（来自先前推理或RL训练）以离策轨迹的形式。标准的离策方法使用离策数据进行监督，导致RL优化过程中出现不稳定性。我们引入了PrefixRL，其中我们条件化于成功的离策轨迹的前缀，并运行策略轨迹以完成它们，从而绕过离策不稳定性。PrefixRL通过调整离策前缀长度来调节问题的难度，从而增强在难题上的学习信号。我们证明PrefixRL目标不仅与标准RL目标一致，而且更具样本效率。实验中，我们发现反向泛化：仅在前缀问题上进行训练可以推广到未见过的无前缀性能，且学习策略往往与前缀中的策略不同。在我们的实验中，我们通过拒绝采样基模型生成离策轨迹，创建了一个自我改进循环。在难题推理问题上，PrefixRL比最强基线（在离策数据上进行SFT然后RL）快2倍达到相同的训练奖励，即使考虑初始拒绝采样的计算成本，且最终奖励提高了3倍。这些收益转移到了保留的基准上，即使离策轨迹源自不同的模型家族，PrefixRL仍然有效，验证了其在实际应用中的灵活性。

Summary / 总结

The paper addresses the inefficiency of typical RL methods in handling hard problems by proposing PrefixRL, which conditions on the prefix of successful off-policy traces to complete on-policy RL. This method avoids off-policy instabilities and enhances the learning signal on hard problems. Experiments show that PrefixRL trains 2x faster than the strongest baseline and achieves 3x higher final reward, with back-generalization to out-of-distribution unprefixed performance. The gains are transferable to held-out benchmarks, and PrefixRL remains effective even when off-policy traces are from a different model family.

论文针对大型语言模型（LLM）在解决难题时，强化学习（RL）方法因稀缺的在线策略数据而导致学习停滞的问题。提出了一种前缀引导RL（PrefixRL）方法，通过使用成功的离线策略轨迹来引导在线策略学习，从而避免了离线策略的不稳定性。实验表明，PrefixRL 在解决难题时比强基线快 2 倍，并将最终奖励提高了 3 倍。此外，该方法还展示了反向泛化能力，即从带有前缀的问题中学习到的策略可以应用于未见过的任务。该方法通过拒绝采样生成离线策略轨迹，形成自我改进的循环。

MEGnifying Emotion: Sentiment Analysis from Annotated Brain Data

Authors: Brian Liu, Oiwi Parker Jones

First: 2026-01-26T18:55:44+00:00 · Latest: 2026-01-26T18:55:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Decoding emotion from brain activity could unlock a deeper understanding of the human experience. While a number of existing datasets align brain data with speech and with speech transcripts, no datasets have annotated brain data with sentiment. To bridge this gap, we explore the use of pre-trained Text-to-Sentiment models to annotate non invasive brain recordings, acquired using magnetoencephalography (MEG), while participants listened to audiobooks. Having annotated the text, we employ force-alignment of the text and audio to align our sentiment labels with the brain recordings. It is straightforward then to train Brainto-Sentiment models on these data. Experimental results show an improvement in balanced accuracy for Brain-to-Sentiment compared to baseline, supporting the proposed approach as a proof-of-concept for leveraging existing MEG datasets and learning to decode sentiment directly from the brain.

中文标题/摘要

标题：放大情感：从注释的大脑数据中进行情感分析

从大脑活动解码情感可能揭示人类体验的更深层次理解。虽然许多现有数据集将大脑数据与语音和语音转录文本对齐，但没有数据集将大脑数据与情感进行注释。为了弥合这一差距，我们探索了使用预训练的文本到情感模型对使用脑磁图（MEG）记录的参与者听有声书时获取的非侵入性大脑记录进行情感注释的方法。在对文本进行注释后，我们采用文本和音频的强制对齐，将情感标签与大脑记录对齐。然后可以直接在这些数据上训练大脑到情感模型。实验结果表明，与基线相比，大脑到情感的平衡准确率有所提高，支持了利用现有MEG数据集并直接从大脑中学习解码情感的方法作为概念验证。

Summary / 总结

The research aims to decode emotions from brain activity to enhance our understanding of human experiences. The study uses pre-trained Text-to-Sentiment models to annotate magnetoencephalography (MEG) data while participants listen to audiobooks. By aligning the text and audio, the researchers train Brain-to-Sentiment models, showing an improvement in balanced accuracy compared to baseline models, validating the approach for leveraging MEG datasets to decode sentiment directly from the brain.

研究旨在通过解码脑活动来增进我们对人类体验的理解。研究使用预训练的文本到情感模型来标注进行脑磁图（MEG）记录时参与者听的有声书。通过对齐文本和音频，研究人员训练了脑到情感模型，结果显示与基线模型相比，平衡准确率有所提高，验证了利用MEG数据直接从脑部解码情感的方法的有效性。

Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets

Authors: Iaroslav Chelombitko, Mika Hämäläinen, Aleksey Komissarov

First: 2026-01-26T18:55:28+00:00 · Latest: 2026-01-26T18:55:28+00:00

Comments: 15 pages, 4 figues, 4 tables

Abs · PDF · Code1 · Code2

Abstract

We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p < 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.

中文标题/摘要

标题：基于子词的242种语言跨语言比较研究——使用维基百科语料库

我们使用基于子词的方法对242种拉丁语系和西里尔语系语言进行了大规模比较研究。通过从维基百科词典构建‘语系集’，我们引入了一种通过字节对编码（BPE）同时进行跨语言比较的框架。我们的方法利用基于排名的子词向量分析词汇重叠、词汇差异和语言相似性。评估表明，与随机基线相比，BPE分段在15种语言中与形态学边界对齐得更好（F1 = 0.34 vs 0.15），BPE词汇相似度与遗传语言相关性显著相关（Mantel r = 0.329，p < 0.001），罗曼语族形成最紧密的集群（平均距离0.51），跨家族对显示出明显的分离（0.82）。对26,939个跨语言同形词的分析表明，48.7%在相关语言中获得不同的分段，变异与系统发生距离相关。我们的结果提供了关于类型学上多样化的语言在统一分析框架内的词汇模式的定量宏观语言学见解。

Summary / 总结

The study aims to compare 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, the researchers employ Byte-Pair Encoding (BPE) to analyze vocabulary overlap, lexical divergence, and language similarity. The results show that BPE segmentation aligns better with morpheme boundaries than random baselines, and BPE vocabulary similarity correlates with genetic language relatedness, with Romance languages clustering closely and cross-family pairs showing clear separation. Additionally, 48.7% of cross-linguistic homographs receive different segmentations, which correlates with phylogenetic distance.

该研究使用基于子词的方法，特别是字节对编码（BPE），比较了242种拉丁语和西里尔语系语言，分析词汇重叠、词汇差异和语言相似性。研究显示BPE分段比随机基线更好地与形态边界对齐，并与遗传语言相关性显著相关。对26,939个跨语言同形词的分析表明，48.7%在相关语言中获得不同的分段，变异与谱系距离相关。

MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts

Authors: Etienne Lanzeray, Stephane Meilliez, Malo Ruelle, Damien Sileo

First: 2026-01-26T18:55:07+00:00 · Latest: 2026-01-26T18:55:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a "tunnel vision" that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.

中文标题/摘要

标题：MortalMATH：评估推理目标与紧急情境之间的冲突

大型语言模型越来越多地优化为深度推理，优先考虑正确执行复杂任务而非一般对话。我们研究这种专注于计算是否会导致一种“隧道视野”，忽视了关键情况下的安全性。我们引入了MortalMATH基准测试，包含150个场景，在这些场景中，用户请求代数帮助的同时描述了越来越危及生命的情况（例如，中风症状、自由落体）。我们发现行为出现了明显的分界：通用模型（如Llama-3.1）成功地拒绝了数学问题以应对危险。相反，专门的推理模型（如Qwen-3-32b和GPT-5-nano）往往完全忽视了紧急情况，维持着超过95%的任务完成率，即使用户描述的是濒死状态。此外，推理所需的计算时间引入了危险的延迟：在提供任何潜在帮助之前，可能需要长达15秒的时间。这些结果表明，训练模型以坚持不懈地追求正确答案可能会无意中抹去安全部署所需的生存本能。

Summary / 总结

The study explores whether large language models' focus on deep reasoning leads to a 'tunnel vision' that neglects safety in critical situations. Using MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing life-threatening emergencies, it was found that generalist models like Llama-3.1 refuse the math to address the danger, whereas specialized reasoning models like Qwen-3-32b and GPT-5-nano often ignore the emergency, maintaining high task completion rates. The computational time for reasoning introduces dangerous delays, up to 15 seconds before any help is offered, suggesting that relentless pursuit of correct answers may unlearn necessary survival instincts for safe deployment.

研究探讨了大型语言模型专注于深入推理是否会导致忽视危急情况下的安全问题。使用MortalMATH基准测试，该基准包含150个场景，用户在描述生命威胁的紧急情况时请求代数帮助，结果显示，通用模型如Llama-3.1会拒绝数学任务以应对危险，而专门的推理模型如Qwen-3-32b和GPT-5-nano则往往忽略紧急情况，保持较高的任务完成率。推理所需的计算时间会导致危险的延迟，最多可达15秒才提供帮助，这表明持续追求正确答案可能会遗忘必要的生存本能，从而影响安全部署。

Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings

Authors: Mumin Jia, Jairo Diaz-Rodriguez

First: 2026-01-26T18:54:34+00:00 · Latest: 2026-01-26T18:54:34+00:00

Comments: arXiv admin note: substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437

Abs · PDF · Code1 · Code2

Abstract

Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.

中文标题/摘要

标题：基于句子嵌入核变化点检测的无监督文本分段

无监督文本分段至关重要，因为边界标签成本高、主观性强且难以在不同领域和粒度选择间转移。我们提出了一种无需训练的方法Embed-KCPD，将句子表示为嵌入向量，并通过最小化惩罚核变化点检测目标来估计边界。除了算法实现，我们还开发了关于$m$-依赖序列（一种语言中常见的短程依赖的有限记忆抽象）下的核变化点检测理论，据我们所知，这是首个依赖感知理论。我们证明了总体惩罚风险的oracle不等式，并证明了每个真实变化点在相对于段长度较小的窗口内被恢复。为了将理论与实践连接起来，我们引入了一种基于LLM的模拟框架，生成具有可控有限记忆依赖和已知边界的合成文档，验证了预测的缩放行为。在标准分段基准测试中，Embed-KCPD经常优于强大的无监督基线。对泰勒·斯威夫特的推文进行的案例研究显示，Embed-KCPD结合了强大的理论保证、模拟可靠性以及文本分段的实际有效性。

Summary / 总结

The paper addresses the challenge of unsupervised text segmentation by proposing Embed-KCPD, a training-free method that uses sentence embeddings and kernel change-point detection to estimate boundaries. The method is validated through a novel LLM-based simulation framework and demonstrates superior performance on standard segmentation benchmarks compared to other unsupervised approaches. Theoretical guarantees are provided for the algorithm's performance, and a case study on Taylor Swift's tweets further illustrates its practical effectiveness and reliability.

论文提出了一种名为Embed-KCPD的方法，通过使用句子嵌入和核变化点检测来无监督地估计文本边界。该方法通过一个新颖的基于LLM的模拟框架得到验证，并在标准分割基准测试中表现出色，优于其他无监督方法。理论保证了算法的性能，并通过泰勒·斯威夫特的推文案例进一步展示了其实用性和可靠性。

Design Techniques for LLM-Powered Interactive Storytelling: A Case Study of the Dramamancer System

Authors: Tiffany Wang, Yuqian Sun, Yi Wang, Melissa Roemmele, John Joon Young Chung, Max Kreminski

Venue: EMNLP

First: 2026-01-26T18:51:20+00:00 · Latest: 2026-01-26T18:51:20+00:00

Comments: Extended abstract presented at the 2025 Wordplay Workshop at EMNLP

Abs · PDF · Code1 · Code2

Abstract

The rise of Large Language Models (LLMs) has enabled a new paradigm for bridging authorial intent and player agency in interactive narrative. We consider this paradigm through the example of Dramamancer, a system that uses an LLM to transform author-created story schemas into player-driven playthroughs. This extended abstract outlines some design techniques and evaluation considerations associated with this system.

中文标题/摘要

标题：LLM驱动的互动叙事设计技术：Dramamancer系统案例研究

大型语言模型（LLMs）的兴起使作者意图与玩家自主权之间的桥梁成为可能。我们通过Dramamancer系统这一例子来考虑这一范式，该系统使用LLM将作者创建的故事框架转化为玩家驱动的游戏体验。本文概要概述了该系统的设计技术和评估考虑。

Summary / 总结

This paper explores the use of Large Language Models (LLMs) to bridge authorial intent and player agency in interactive narratives, focusing on the Dramamancer system. The system uses an LLM to convert author-created story schemas into player-driven playthroughs. Key findings include the effectiveness of certain design techniques and evaluation methods in enhancing the interactivity and coherence of the storytelling experience.

本文探讨了使用大型语言模型（LLMs）来连接作者意图和玩家自主性在互动叙事中的应用，重点关注Dramamancer系统。该系统利用LLM将作者创建的故事框架转化为玩家驱动的游戏过程。主要发现包括某些设计技术和评估方法在增强故事叙述的互动性和连贯性方面的有效性。

POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration

Authors: Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, Aviral Kumar

First: 2026-01-26T18:47:21+00:00 · Latest: 2026-01-26T18:47:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.

中文标题/摘要

标题：POPE：通过特权在线策略探索学习解决难题

强化学习（RL）已提升了大型语言模型（LLMs）的推理能力，但最先进的方法在许多训练问题上仍然无法学习。在难题上，基于策略的RL很少探索单个正确的回放，导致零奖励且没有学习信号来驱动改进。我们发现，从经典RL中解决此探索问题的自然方法，如熵奖励、更宽松的重要性比率裁剪或直接优化pass@k目标，都无法解决此问题，且往往导致优化不稳定而未能提高可解性。一种自然的替代方法是利用从较易问题的转移。然而，我们证明，在RL训练过程中混合易和难的问题是反生产性的，因为优化会集中在已可解的问题上，从而主动抑制对更难问题的进展。为应对这一挑战，我们提出了特权在线策略探索（POPE）方法，该方法利用人类或其他先验信息来引导难题上的探索，不同于使用先验信息作为训练目标的方法（例如，离线策略RL方法或从SFT热启动）。POPE通过在难题前缀添加先验解决方案，使RL在引导回放中获得非零奖励。关键的是，由此产生的行为通过指令遵循和推理之间的协同作用，转移到原始、未引导的问题上。实验上，POPE扩展了可解问题的范围，并在具有挑战性的推理基准测试中显著提高了性能。

Summary / 总结

The paper addresses the challenge of reinforcement learning (RL) in improving the reasoning abilities of large language models (LLMs) on hard problems, where standard methods fail to explore correctly and thus do not learn. It proposes Privileged On-Policy Exploration (POPE), which uses human oracles as guidance to enable RL agents to obtain non-zero rewards on hard problems, facilitating learning. Experiments show that POPE significantly expands the set of solvable problems and improves performance on reasoning benchmarks.

论文解决了强化学习（RL）方法在处理难题时学习困难的问题，因为在线RL难以探索正确的路径。它提出了特权在线探索（POPE）方法，通过使用oracle解决方案作为指导，使RL在探索难题时能够获得非零奖励。实验结果显示，POPE扩展了可解决的问题范围，并在推理基准测试中显著提高了性能。

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Authors: Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe

First: 2026-01-26T18:46:56+00:00 · Latest: 2026-01-26T18:46:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.

中文标题/摘要

标题：教学模型自我教学：边缘可学习性推理

模型能否学会突破自身的学习瓶颈？针对初始成功率低、训练信号少的数据集，强化学习方法在微调大型推理模型时会停滞不前。我们探讨了一个基本问题：预训练的语言模型能否利用潜在知识为它无法解决的问题生成自动课程？为此，我们设计了SOAR：一种自我改进框架，通过元强化学习揭示这些教学信号。教师模型副本为学生模型副本提出合成问题，并根据其在一小部分难题上的进步获得奖励。关键的是，SOAR将课程建立在实际的学生进步上，而不是内在的代理奖励上。在数学基准中最难的子集（0/128成功率）上进行的研究揭示了三个核心发现。首先，我们展示了通过增强预训练模型生成有用阶梯的能力，实现双层元强化学习的可能性，从而在稀疏的二元奖励下解锁学习。其次，基于实际进步的奖励方案优于先前LLM自我博弈中使用的内在奖励方案，可靠地避免了它们通常表现出的不稳定性及多样性崩溃模式。第三，分析生成的问题表明，结构质量和良好定义比解的正确性对学习进步更为关键。我们的研究结果表明，生成有用阶梯的能力不需要预先具备解决难题的能力，为在无需额外精心策划数据的情况下逃离推理瓶颈提供了一条原则性的路径。

Summary / 总结

The study aims to explore whether a pretrained large language model (LLM) can generate an automated curriculum to improve its own performance on problems it initially cannot solve. The researchers developed SOAR, a self-improvement framework using meta-reinforcement learning (meta-RL). In this framework, a teacher model proposes synthetic problems for a student model, and the teacher is rewarded based on the student's improvement on hard problems. Key findings include the realization of bi-level meta-RL under sparse, binary rewards, the superior performance of grounded rewards over intrinsic rewards, and the importance of structural quality and well-posedness in generated questions for learning progress. These results indicate that the ability to generate useful stepping stones can help escape reasoning plateaus without requiring the preexisting ability to solve the problems.

研究旨在探讨预训练的大语言模型是否能够自动生成课程来提高其在初始无法解决的问题上的性能。研究人员开发了SOAR，一种使用元强化学习的自我改进框架，其中教师模型向学生模型提出合成问题，并根据学生的表现进行奖励。关键发现包括在稀疏奖励下实现双层元强化学习、基于实际表现的奖励优于内在奖励、以及生成问题的结构质量和良好定义性对于学习进展更为重要。

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

Authors: Abhishek Divekar, Anirban Majumder

Venue: AAAI 2026

First: 2026-01-26T18:46:49+00:00 · Latest: 2026-01-26T18:46:49+00:00

Comments: Accepted at AAAI 2026 - Innovative Applications of AI (IAAI-26)

Abs · PDF · Code1 · Code2

Abstract

Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations at the query-document level. By reformulating the metric-integration space, we reduced the computational complexity from O(2^|C|) to O(2^K), where |C| represents corpus size (in order of millions). Detailed experiments across prominent retrieval datasets demonstrate that our method reduces the variance of estimates for the business-critical Precision@K metric, while effectively correcting for LLM bias in low-resource settings.

中文标题/摘要

标题：PRECISE：使用预测驱动的排名估计减少LLM评估偏差

传统上，评估搜索、排名和RAG系统的质量需要大量的人类相关性注释。近年来，一些部署系统探索了使用大型语言模型（LLM）作为此任务的自动化裁判，但由于其固有的偏差，无法直接用于度量估计。我们提出了一种统计框架，扩展了预测驱动的推理（PPI），结合少量的人类注释和LLM判断，以产生需要子实例注释的度量的可靠估计。我们的方法只需要100个人类标注的查询和10,000个未标注的示例，与传统方法相比，显著减少了注释需求。我们为基于LLM的查询重写应用提出了我们的框架（PRECISE），将PPI扩展到查询-文档级别的子实例注释。通过重新定义度量整合空间，我们将计算复杂度从O(2^|C|)降低到O(2^K)，其中|C|表示语料库大小（以百万计）。详细实验表明，我们的方法减少了关键的Precision@K度量估计的方差，同时在低资源环境中有效地纠正了LLM的偏差。

Summary / 总结

The research aims to reduce the bias in evaluating the quality of search, ranking, and RAG systems using Large Language Models (LLMs) by combining minimal human annotations with LLM judgments. The method, PRECISE, requires only 100 human-annotated queries and 10,000 unlabeled examples, significantly reducing annotation needs. Experiments show that PRECISE reduces the variance of Precision@K estimates and corrects for LLM bias in low-resource settings.

研究旨在通过提出一种名为PRECISE的统计框架来减少使用大型语言模型（LLM）评估搜索、排名和RAG系统质量时的偏差。该方法结合了少量的人工注释和LLM判断，以生成需要子实例注释的指标的可靠估计。实验表明，PRECISE能够减少Precision@K指标估计的方差，并在低资源环境中有效纠正LLM偏差，同时显著减少了注释需求，相比传统方法更为高效。

Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory

Authors: Yanming Liu, Xinyue Peng, Zixuan Yan, Yanxin Shen, Wenjie Xu, Yuefeng Huang, Xinyi Wang, Jiannan Cao, Jianwei Yin, Xuhong Zhang

First: 2026-01-26T18:42:33+00:00 · Latest: 2026-01-26T18:42:33+00:00

Comments: Dep-Search 1st version

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval-augmented generation (RAG) frameworks to more sophisticated search-based frameworks that orchestrate multi-step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub-questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep-Search, a dependency-aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep-Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep-Search significantly enhances LLMs' ability to tackle complex multi-hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.

中文标题/摘要

标题：Dep-Search：使用持久内存学习依赖感知推理轨迹

大型语言模型（LLMs）在复杂推理任务中展现了卓越的能力，特别是在与搜索机制结合后，能够系统地探索外部知识库。该领域已从传统的检索增强生成（RAG）框架发展到更复杂的基于搜索的框架，通过显式的搜索策略协调多步推理。然而，现有的搜索框架仍然高度依赖隐式的自然语言推理来确定搜索策略以及如何在推理步骤中利用检索到的信息。这种依赖隐式推理的方法在管理子问题之间的依赖性、高效重用已检索的知识以及通过强化学习学习最优搜索策略方面造成了根本性的挑战。为了解决这些限制，我们提出了Dep-Search，这是一种依赖感知的搜索框架，通过GRPO整合结构化推理、检索和持久内存，超越了现有的搜索框架。Dep-Search引入了明确的控制机制，使模型能够分解具有依赖关系的问题，按需检索信息，从内存中访问之前存储的知识，并将长推理上下文总结为可重用的内存条目。通过在七个不同的问答数据集上进行广泛的实验，我们证明了Dep-Search显著增强了LLMs处理复杂多跳推理任务的能力，在不同模型规模上均实现了显著的改进。

Summary / 总结

Dep-Search is a dependency-aware search framework that improves large language models' ability to handle complex reasoning tasks by integrating structured reasoning, retrieval, and persistent memory. It introduces explicit control mechanisms to manage dependencies between sub-questions, reuse retrieved knowledge, and summarize reasoning contexts. Experiments on seven datasets show that Dep-Search outperforms strong baselines, especially in multi-hop reasoning tasks.

Dep-Search 是一种依赖感知的搜索框架，旨在提高大型语言模型处理复杂推理任务的能力。它结合了结构化推理、检索和持久内存，以管理子问题之间的依赖关系并高效地重用检索到的信息。在七个数据集上的实验表明，Dep-Search 在多跳推理任务中显著优于强基线。

HiCache: A Plug-in Scaled-Hermite Upgrade for Taylor-Style Cache-then-Forecast Diffusion Acceleration

Authors: Liang Feng, Shikang Zheng, Jiacheng Liu, Yuqi Lin, Qinming Zhou, Peiliang Cai, Xinyu Wang, Junjie Chen, Chang Zou, Yue Ma, Linfeng Zhang

First: 2025-08-23T10:35:16+00:00 · Latest: 2026-01-26T18:39:41+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion models have achieved remarkable success in content generation but often incur prohibitive computational costs due to iterative sampling. Recent feature caching methods accelerate inference via temporal extrapolation, yet can suffer quality degradation from inaccurate modeling of the complex dynamics of feature evolution. We propose HiCache (Hermite Polynomial-based Feature Cache), a training-free acceleration framework that improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature-derivative approximations in diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials as a potentially optimal basis for Gaussian-correlated processes. We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy, and is also effective when applied standalone or integrated with TaylorSeer. Extensive experiments demonstrate HiCache's superiority, achieving 5.55x speedup on FLUX.1-dev while matching or exceeding baseline quality, and maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Moreover, HiCache can be naturally added to previous caching methods to enhance their performance, e.g., improving ClusCa from 0.9480 to 0.9840 in terms of image rewards. Code: https://github.com/fenglang918/HiCache

中文标题/摘要

标题：HiCache：一种基于插件的扩展型Scaled-Hermite升级版，用于Taylor风格的缓存先行-然后预测扩散加速

扩散模型在内容生成方面取得了显著成功，但由于迭代采样，往往会产生高昂的计算成本。最近的特征缓存方法通过时间外推加速推理，但可能会因对特征演变复杂动态建模不准确而导致质量下降。我们提出了HiCache（基于赫尔mite多项式的特征缓存），这是一种无需训练的加速框架，通过将数学工具与经验特性对齐来提高特征预测。我们的核心见解是，扩散Transformer中的特征导数近似表现出多元高斯特性，这促使我们使用赫尔mite多项式作为高斯相关过程的潜在最优基。我们还引入了一种双重缩放机制，以确保数值稳定性同时保持预测准确性，并且在单独使用或与TaylorSeer集成时也有效。广泛的实验表明HiCache的优越性，在FLUX.1-dev上实现了5.55倍的加速，同时匹配或超越基线质量，并在文本到图像、视频生成和超分辨率任务中保持了强大的性能。此外，HiCache可以自然地添加到先前的缓存方法中以增强其性能，例如，将ClusCa的图像奖励从0.9480提高到0.9840。代码：https://github.com/fenglang918/HiCache

Summary / 总结

HiCache is a training-free acceleration framework for diffusion models that uses Hermite polynomials to improve feature prediction, achieving 5.55x speedup on FLUX.1-dev while maintaining or exceeding baseline quality. It introduces a dual-scaling mechanism for numerical stability and can be integrated with existing methods to enhance their performance.

HiCache 是一种无需训练的加速框架，利用 Hermite 多项式改进特征预测。引入了双重缩放机制以确保数值稳定性和准确性。实验表明，HiCache 在 FLUX.1-dev 上实现了 5.55 倍的加速，同时在文本到图像、视频生成和超分辨率等任务中保持或提升了质量。它还能增强 ClusCa 等先前的缓存方法的性能。

Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation

Authors: Jairo Diaz-Rodriguez, Mumin Jia

First: 2025-10-03T18:57:22+00:00 · Latest: 2026-01-26T18:36:37+00:00

Comments: This paper is withdrawn due to an error in the proof of Proposition 3, which is used to support Theorem 1

Abs · PDF · Code1 · Code2

Abstract

Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.

中文标题/摘要

标题：一致核变化点检测在m依赖数据下的文本分段

核变化点检测（KCPD）已成为识别复杂数据中结构变化的常用工具。虽然现有理论在独立假设下建立了其一致性，但诸如文本等实际序列数据表现出强烈的依赖性。我们为KCPD在m-依赖数据下建立了新的保证：具体而言，我们证明了在检测到的变化点数量上的一致性，并在轻微的附加假设下证明了其位置的弱一致性。我们使用基于LLM的模拟生成合成的m-依赖文本来验证渐近性。为了补充这些结果，我们首次对KCPD在文本分段中的现代嵌入进行了全面的实证研究。在多种文本数据集上，KCPD与文本嵌入的表现优于基线方法在标准文本分段指标中的表现。我们通过泰勒·斯威夫特的推文案例研究证明，KCPD不仅提供了强大的理论和模拟可靠性，而且在文本分段任务中具有实际有效性。

Summary / 总结

The research aims to improve kernel change-point detection (KCPD) for identifying structural changes in text data, which often exhibits strong dependencies. The authors establish new theoretical guarantees for KCPD under $m$-dependent data, proving consistency in the number of detected change points and weak consistency in their locations. Empirical studies using modern text embeddings show that KCPD outperforms baselines in text segmentation tasks across various datasets. A case study on Taylor Swift's tweets confirms the practical effectiveness of KCPD for text segmentation. However, the paper is withdrawn due to an error in the proof of Proposition 3, which supports Theorem 1.

论文解决了文本数据中强依赖性带来的结构变化检测挑战。它为在$m$-依赖数据下进行核变化点检测（KCPD）建立了新的理论保证，证明了检测到的变化点数量的一致性和位置的弱一致性。使用现代文本嵌入进行的实证研究表明，KCPD在各种数据集上的文本分割任务中优于基线方法。通过泰勒·斯威夫特的推文案例研究，展示了KCPD在文本分割任务中的实际有效性。然而，由于支持定理1的命题3的证明中存在错误，论文被撤回。

DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception

Authors: Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool

First: 2025-09-11T20:03:00+00:00 · Latest: 2026-01-26T18:33:05+00:00

Comments: Code and models are available at https://github.com/timbroed/DGFusion

Abs · PDF · Code1 · Code2 · Code3

Abstract

Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DeLiVER datasets. Code and models are available at https://github.com/timbroed/DGFusion

中文标题/摘要

标题：DGFusion：深度引导的传感器融合以实现鲁棒的语义感知

自动驾驶车辆的鲁棒语义感知依赖于有效结合具有互补优势和劣势的多种传感器。最先进的语义感知传感器融合方法通常在输入的空间范围内均匀处理传感器数据，这在面对挑战性条件时会阻碍性能。相比之下，我们提出了一种新颖的深度引导多模态融合方法，通过整合深度信息来提升条件感知融合。我们的网络DGFusion将多模态分割视为一个多任务问题，利用通常在户外传感器套件中可用的激光雷达测量数据，作为模型的输入之一和学习深度的地面真实值。我们的相应辅助深度头有助于学习深度感知特征，这些特征被编码为在空间上变化的局部深度令牌，以条件我们的注意跨模态融合。结合全局条件令牌，这些局部深度令牌动态适应场景中每个传感器的空间变化可靠性，这在很大程度上取决于深度。此外，我们提出了一种鲁棒的深度损失，这对于在通常在恶劣条件下稀疏且噪声的激光雷达输入中学习至关重要。我们的方法在具有挑战性的MUSES和DeLiVER数据集上实现了最先进的全景和语义分割性能。代码和模型可在https://github.com/timbroed/DGFusion获取

Summary / 总结

DGFusion proposes a depth-guided multimodal fusion method for robust semantic perception in autonomous vehicles. The method integrates depth information to condition the fusion process, improving performance in challenging conditions. DGFusion uses lidar measurements as both input and ground truth to learn depth-aware features, which are encoded into local depth tokens that adapt sensor fusion based on the spatial reliability of each sensor. The method achieves state-of-the-art performance on MUSES and DeLiVER datasets for panoptic and semantic segmentation.

DGFusion 提出了一种深度引导的多模态融合方法，以增强自主车辆的语义感知。该方法通过整合深度信息来调整融合过程，从而在恶劣条件下提高性能。网络 DGFusion 使用激光雷达测量作为输入和 ground truth 来学习深度感知特征，这些特征被编码为局部深度令牌，根据每个传感器在场景中的空间可靠性动态调整融合。该方法在 MUSES 和 DeLiVER 数据集上实现了最先进的全景和语义分割性能。

RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

Authors: Andrei Kozyrev, Nikita Khramov, Gleb Solovev, Anton Podkopaev

First: 2025-05-28T20:26:11+00:00 · Latest: 2026-01-26T18:27:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Interactive Theorem Proving was repeatedly shown to be fruitful when combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We identify retrieval-based premise selection as a central component of effective Rocq proof generation and propose a novel approach based on a self-attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator's performance. We tackle the problem of writing Rocq proofs using a multi-stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and demonstrate that incorporating multi-agent debate during the planning stage increases the proof success rate by 20% overall and nearly doubles it for complex theorems, while the reflection mechanism further enhances stability and consistency.

中文标题/摘要

标题：RocqStar：利用相似性驱动检索与自主系统进行Rocq生成

交互式定理证明在与生成型人工智能结合时已被证明是富有成效的。本文评估了多种Rocq生成方法，并阐明了改进的潜在途径。我们确定基于检索的前提选择是有效Rocq证明生成的核心组成部分，并提出了一种基于自我注意嵌入模型的新颖方法。所设计方法的评估显示生成器性能相对提高高达28%。我们使用一个针对形式验证定制的多阶段自主系统解决Rocq证明的编写问题，并证明其具有很高的有效性。我们进行了消融研究，并证明在规划阶段引入多智能体辩论可以将证明成功率整体提高20%，对于复杂定理几乎翻倍，而反思机制进一步增强了稳定性和一致性。

Summary / 总结

This paper explores the use of similarity-driven retrieval and agentic systems to enhance Rocq generation in Interactive Theorem Proving. It introduces a self-attentive embedder model for premise selection and shows a 28% relative improvement in generator performance. Additionally, it proposes a multi-stage agentic system that includes multi-agent debate and reflection mechanisms, which increases the proof success rate by 20% overall and nearly doubles it for complex theorems.

该论文探讨了将相似性驱动的检索与代理系统结合以提升定理证明中的Rocq生成。作者提出了一种自注意嵌入模型用于前提选择，并评估其性能，实现了生成器效率高达28%的相对提升。此外，他们引入了一种针对形式验证的多阶段代理系统，通过多代理辩论和反思机制，整体提高了证明成功率20%，对于复杂定理几乎翻倍。

Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values

Authors: Henry Bell, Lara Neubauer da Costa Schertel, Bochu Ding, Brandon Fain

First: 2026-01-26T18:27:00+00:00 · Latest: 2026-01-26T18:27:00+00:00

Abs · PDF · Code1 · Code2

Abstract

A crucial consideration when developing and deploying Large Language Models (LLMs) is the human values to which these models are aligned. In the constitutional framework of alignment models are aligned to a set of principles (the constitution) specified in natural language. However, it is unclear how to fairly determine this constitution with widespread stakeholder input. In this work we propose Grounded Constitutional AI (GCAI), a unified framework for generating constitutions of principles that are representative of both users' general expectations toward AI (general principles) and their interaction-time preferences (contextual principles). We extend the Inverse Constitutional AI (ICAI) approach to generate contextual principles from human preference annotation data by leveraging human-provided \textit{reasons} for their preferences. We supplement these contextual principles with general principles surfaced from user statements of \textit{values} regarding AI. We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior. Additionally participants consider the GCAI constitution to be more morally grounded, coherent, and pluralistic.

中文标题/摘要

标题：超越偏好：基于人类理由和价值观的学习对齐原则

在开发和部署大型语言模型（LLMs）时，一个关键考虑因素是这些模型与哪些人类价值观相一致。在对齐模型的宪法框架中，模型被对齐到一组用自然语言指定的原则（宪法）。然而，如何公平地确定这组宪法以包含广泛的多方利益相关者输入尚不明确。在本研究中，我们提出了基于人类理由的宪法AI（GCAI），这是一种统一框架，用于生成既代表用户对AI的普遍期望（一般原则）又代表其交互时偏好（情境原则）的原则宪法。我们扩展了逆向宪法AI（ICAI）方法，通过利用人类提供的偏好理由数据来生成情境原则。我们还通过用户关于AI的声明中的价值观补充了这些情境原则。我们展示了由GCAI生成的宪法比由ICAI生成的宪法更受人类的青睐，无论是个人偏好还是广泛应用于治理AI行为。此外，参与者认为GCAI宪法更具道德基础、更连贯和包容性。

Summary / 总结

This research addresses the challenge of fairly determining the principles that Large Language Models should be aligned with, by proposing a unified framework called Grounded Constitutional AI (GCAI). GCAI generates constitutions of principles that reflect both general user expectations and interaction-time preferences, by incorporating human-provided reasons for preferences and values from user statements. The study demonstrates that constitutions generated by GCAI are more preferred by humans and considered more morally grounded, coherent, and pluralistic compared to those generated by the Inverse Constitutional AI (ICAI) approach.

本研究旨在通过提出Grounded Constitutional AI (GCAI)来解决公平确定大型语言模型应遵循的原则的问题。GCAI生成的宪法既包含了从用户价值观中提取的一般原则，也包含了基于人类偏好理由的上下文原则。研究结果表明，GCAI生成的宪法比通过逆向宪法AI (ICAI)方法生成的宪法更受人类的青睐，并且被认为更加道德、连贯和包容。

$α^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

First: 2026-01-26T18:25:07+00:00 · Latest: 2026-01-26T18:25:07+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce $α^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $α^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $α^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $α^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench

中文标题/摘要

标题：$α^3$-SecBench：基于6G网络的LLM驱动无人机代理安全、韧性和信任的大规模评估套件

自主无人机系统在安全关键的网络环境中越来越广泛部署，必须在恶意对手的干扰下可靠运行。尽管最近的基准测试评估了基于大型语言模型（LLM）的无人机代理在推理、导航和效率方面的表现，但在对抗条件下的系统性安全、韧性和信任评估仍然鲜有探索，特别是在新兴的6G环境下。我们提出了$α^{3}$-SecBench，这是首个针对基于LLM的无人机代理在现实对抗干扰下的安全感知自主性的大规模评估套件。该框架基于$α^{3}$-Bench的多轮对话无人机任务，增加了20,000个验证过的安全叠加攻击场景，针对七个自主层，包括感知、规划、控制、通信、边缘/云基础设施和LLM推理。$α^{3}$-SecBench从涵盖175种威胁类型的113,475次任务的语料库中，评估了来自主要工业提供商和领先AI实验室的23种最先进的LLM，使用成千上万个对抗增强的无人机任务样本。尽管许多模型能够可靠地检测异常行为，但有效的缓解、漏洞归因和可信赖的控制行动仍然不一致。总体得分范围从12.9%到57.1%，突显了异常检测与安全感知自主决策之间的巨大差距。我们已在GitHub上发布了$α^{3}$-SecBench：https://github.com/maferrag/AlphaSecBench

VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks

Authors: Efthymios Tsaprazlis, Thanathai Lertpetchpun, Tiantian Feng, Sai Praneeth Karimireddy, Shrikanth Narayanan

First: 2025-09-22T20:57:48+00:00 · Latest: 2026-01-26T18:23:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Voice anonymization aims to conceal speaker identity and attributes while preserving intelligibility, but current evaluations rely almost exclusively on Equal Error Rate (EER) that obscures whether adversaries can mount high-precision attacks. We argue that privacy should instead be evaluated in the low false-positive rate (FPR) regime, where even a small number of successful identifications constitutes a meaningful breach. To this end, we introduce VoxGuard, a framework grounded in differential privacy and membership inference that formalizes two complementary notions: User Privacy, preventing speaker re-identification, and Attribute Privacy, protecting sensitive traits such as gender and accent. Across synthetic and real datasets, we find that informed adversaries, especially those using fine-tuned models and max-similarity scoring, achieve orders-of-magnitude stronger attacks at low-FPR despite similar EER. For attributes, we show that simple transparent attacks recover gender and accent with near-perfect accuracy even after anonymization. Our results demonstrate that EER substantially underestimates leakage, highlighting the need for low-FPR evaluation, and recommend VoxGuard as a benchmark for evaluating privacy leakage.

中文标题/摘要

标题：VoxGuard：通过成员推理攻击评估语音中的用户和属性隐私

语音匿名化旨在在保护可懂度的同时隐藏说话人身份和属性，但当前的评估几乎完全依赖于错误接受率（EER），这掩盖了对手能否发起高精度攻击的情况。我们认为，隐私应该在低假阳性率（FPR）的范围内进行评估，在这种情况下，即使成功识别的次数很少也构成了一种有意义的泄露。为此，我们引入了VoxGuard框架，该框架基于差分隐私和成员推理，正式化了两种互补的概念：用户隐私，防止重新识别说话人，以及属性隐私，保护敏感特征，如性别和口音。在合成和真实数据集上，我们发现，知情的攻击者，尤其是使用微调模型和最大相似度评分的攻击者，在低FPR下实现了比EER相似但强度大得多的攻击。对于属性，我们展示了简单的透明攻击在匿名化后仍能以近乎完美的准确度恢复性别和口音。我们的结果表明，EER严重低估了泄露，突显了低FPR评估的必要性，并推荐VoxGuard作为评估隐私泄露的基准。

Summary / 总结

VoxGuard evaluates user and attribute privacy in speech through membership inference attacks, arguing that current evaluations based on Equal Error Rate (EER) are insufficient. The framework introduces two privacy notions: User Privacy, which prevents speaker re-identification, and Attribute Privacy, which protects sensitive traits like gender and accent. The study finds that even small numbers of successful identifications at low false-positive rates indicate significant breaches, and that fine-tuned models can achieve much stronger attacks compared to EER. Simple attacks recover gender and accent with high accuracy, underscoring the need for low-FPR evaluation.

VoxGuard 通过成员推断攻击评估语音中的用户和属性隐私，认为当前基于等错误率（EER）的评估不足。研究引入了一个框架，在低假阳性率（FPR）的环境中评估隐私，即使少量成功的识别也是重要的泄露。研究发现，尤其是使用微调模型和最大相似度评分的有经验的攻击者，在低FPR下可以实现比EER更强的攻击。对于性别和口音等属性，简单的攻击在匿名化后仍能以近乎完美的准确率恢复这些特征，强调了低FPR评估的重要性而非EER。

HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs

Authors: Xinyue Zeng, Junhong Lin, Yujun Yan, Feng Guo, Liang Shi, Jun Wu, Dawei Zhou

Venue: ICLR

First: 2026-01-26T18:23:09+00:00 · Latest: 2026-01-26T18:23:09+00:00

Comments: Have been accepted by ICLR'26

Abs · PDF · Code1 · Code2

Abstract

The reliability of Large Language Models (LLMs) in high-stakes domains such as healthcare, law, and scientific discovery is often compromised by hallucinations. These failures typically stem from two sources: data-driven hallucinations and reasoning-driven hallucinations. However, existing detection methods usually address only one source and rely on task-specific heuristics, limiting their generalization to complex scenarios. To overcome these limitations, we introduce the Hallucination Risk Bound, a unified theoretical framework that formally decomposes hallucination risk into data-driven and reasoning-driven components, linked respectively to training-time mismatches and inference-time instabilities. This provides a principled foundation for analyzing how hallucinations emerge and evolve. Building on this foundation, we introduce HalluGuard, an NTK-based score that leverages the induced geometry and captured representations of the NTK to jointly identify data-driven and reasoning-driven hallucinations. We evaluate HalluGuard on 10 diverse benchmarks, 11 competitive baselines, and 9 popular LLM backbones, consistently achieving state-of-the-art performance in detecting diverse forms of LLM hallucinations.

中文标题/摘要

标题：HalluGuard：揭开大型语言模型数据驱动和推理驱动幻觉的面纱

大型语言模型（LLMs）在医疗、法律和科学发现等高风险领域中的可靠性常常因幻觉而受损。这些失败通常源自两个来源：数据驱动的幻觉和推理驱动的幻觉。然而，现有的检测方法通常只解决其中一个来源，并依赖于特定任务的启发式方法，限制了它们在复杂场景中的泛化能力。为克服这些限制，我们引入了幻觉风险界，这是一种统一的理论框架，正式将幻觉风险分解为数据驱动和推理驱动的组件，分别与训练时的不匹配和推理时的不稳定性相关联。这为分析幻觉的出现和发展提供了原则性的基础。在此基础上，我们引入了HalluGuard，这是一种基于NTK的评分方法，利用NTK诱导的几何结构和捕获的表示来联合识别数据驱动和推理驱动的幻觉。我们在10个不同的基准上评估了HalluGuard，与11个竞争性基线和9个流行的LLM基础模型进行了比较，一致地实现了检测LLM幻觉的最新性能。

Summary / 总结

The paper addresses the issue of hallucinations in Large Language Models (LLMs) in high-stakes domains by introducing a unified theoretical framework called the Hallucination Risk Bound, which decomposes hallucination risk into data-driven and reasoning-driven components. Based on this framework, the authors developed HalluGuard, an NTK-based score that identifies both types of hallucinations. HalluGuard outperformed 11 competitive baselines across 10 diverse benchmarks and 9 popular LLM backbones, demonstrating its effectiveness in detecting various forms of LLM hallucinations.

论文通过引入一个统一的理论框架——幻觉风险边界，将幻觉风险分解为数据驱动和推理驱动两个部分来解决大型语言模型（LLM）在高风险领域中的问题。基于此框架，提出了一个基于NTK的评分方法HalluGuard，用于同时识别这两种类型的幻觉。实验结果表明，HalluGuard在10个不同的基准测试、11个竞争性基线和9个流行的LLM模型上，能够更有效地检测各种形式的LLM幻觉。

Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour

Authors: Max Norris, Kobi Gal, Sahan Bulathwela

First: 2025-11-04T14:20:56+00:00 · Latest: 2026-01-26T18:20:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.

中文标题/摘要

标题：下一词知识追踪：利用预训练大语言模型表示解码学生行为

在教育中利用AI建模学生知识是一项关键挑战，对个性化学习具有重大影响。知识追踪（KT）任务旨在根据学生之前的互动预测他们在学习环境中对教育问题的响应。现有KT模型通常使用响应正确性以及技能标签和时间戳等元数据，往往忽略了问题文本，这是一项重要的教学洞察来源。这种遗漏导致了预测性能的损失。我们提出了下一词知识追踪（NTKT），这是一种新颖的方法，将KT重新定义为使用预训练大语言模型（LLMs）的下一个词预测任务。NTKT将学生历史和问题内容表示为文本序列，使LLMs能够学习行为和语言中的模式。我们的系列实验显著提高了与最先进的神经KT模型相比的性能，并且在冷启动问题和用户方面有更好的泛化能力。这些发现突显了在KT中问题内容的重要性，并展示了利用预训练LLMs表示来更有效地建模学生学习的好处。

Capturing P: On the Expressive Power and Efficient Evaluation of Boolean Retrieval

Authors: Amir Aavani

First: 2026-01-26T18:07:40+00:00 · Latest: 2026-01-26T18:07:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern information retrieval is transitioning from simple document filtering to complex, neuro-symbolic reasoning workflows. However, current retrieval architectures face a fundamental efficiency dilemma when handling the rigorous logical and arithmetic constraints required by this new paradigm. Standard iterator-based engines (Document-at-a-Time) do not natively support complex, nested logic graphs; forcing them to execute such queries typically results in intractable runtime performance. Conversely, naive recursive approaches (Term-at-a-Time), while capable of supporting these structures, suffer from prohibitive memory consumption when enforcing broad logical exclusions. In this paper, we propose that a retrieval engine must be capable of ``Capturing $\mathbf{P}$'' -- evaluating any polynomial-time property directly over its index in a computationally efficient manner. We define a formal Retrieval Language ($\mathcal{L}_R$) based on Directed Acyclic Graphs (DAGs) and prove it precisely captures the complexity class $\mathbf{P}$. We introduce \texttt{ComputePN}, a novel evaluation algorithm that makes $\mathcal{L}_R$ tractable. By combining native DAG traversal with a memory-efficient ``Positive-Negative'' response mechanism, \texttt{ComputePN} ensures the efficient evaluation of any query in $\mathcal{L}_R$. This work establishes the theoretical foundation for turning the search index into a general-purpose computational engine.

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Authors: Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou

First: 2026-01-26T18:04:54+00:00 · Latest: 2026-01-26T18:04:54+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Time series data is ubiquitous in real-world scenarios and crucial for critical applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve practical problems. However, this dimension is notably absent from existing benchmarks of generalist models. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluated over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual represenations of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.

中文标题/摘要

标题：TSRBench：全面的多任务多模态时间序列推理基准，用于通用模型

时间序列数据在现实场景中无处不在，并且对于从能源管理到交通控制等关键应用至关重要。因此，能够对时间序列进行推理是通用模型解决实际问题的一项基本技能。然而，这一维度在现有通用模型基准中明显缺失。为了弥合这一差距，我们引入了TSRBench，这是一个全面的多模态基准，旨在全面测试时间序列推理能力。TSRBench 特点包括：i) 来自14个领域的4125个多样化问题，按感知、推理、预测和决策制定4个主要维度分类；ii) 4个维度下的15项任务评估关键推理能力（例如，数值推理）。通过广泛的实验，我们在TSRBench中评估了超过30个领先的专有和开源LLM、VLM和TSLLM。我们的发现表明：i) 感知和推理遵循规模法则，但预测则不然；ii) 强大的推理并不保证准确的上下文感知预测，表明语义理解与数值预测之间存在脱节；iii) 尽管文本和视觉表示作为时间序列输入具有互补性，当前的多模态模型未能有效融合它们以实现相互性能提升。TSRBench 提供了一个标准化评估平台，不仅突显了现有挑战，还为推进通用模型提供了宝贵见解。我们的代码和数据集可在https://tsrbench.github.io/ 获取。

Summary / 总结

TSRBench is a comprehensive benchmark for evaluating the time series reasoning capabilities of generalist models across 14 domains. It includes 4125 problems and 15 tasks focusing on perception, reasoning, prediction, and decision-making. Extensive experiments with 30 leading models show that while scaling laws apply to perception and reasoning, they do not hold for prediction. Strong reasoning does not ensure accurate context-aware forecasting, and current multimodal models struggle to effectively combine textual and visual time series data for better performance.

TSRBench 是一个全面的基准，用于评估通用模型在14个领域中的时间序列推理能力，包含4125个问题和15项任务，集中在感知、推理、预测和决策制定上。实验表明，虽然感知和推理遵循规模法则，但预测却不遵循。强大的推理并不保证准确的上下文感知预测，而且当前的多模态模型难以有效结合文本和视觉时间序列数据以获得更好的性能。

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change

Authors: Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger

Venue: ICLR 2026

First: 2025-05-25T21:29:00+00:00 · Latest: 2026-01-26T18:01:53+00:00

Comments: 45 pages, 21 figures, ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Ambivalence and hesitancy (A/H), a closely related construct, is the primary reasons why individuals delay, avoid, or abandon health behaviour changes. It is a subtle and conflicting emotion that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. It manifests by a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exists for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours captured from 300 participants across Canada answering predefined questions to elicit A/H. It is intended to mirror real-world online personalized behaviour change interventions. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participants' meta-data are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, zero-shot prediction, and personalization using source-free domain adaptation. The data, code, and pretrained weights are available.

中文标题/摘要

标题：BAH数据集：视频中数字行为改变中犹豫/矛盾识别

犹豫和矛盾（A/H）是个人推迟、避免或放弃健康行为改变的主要原因。这是一种微妙且矛盾的情绪，使人在正向和负向态度之间、接受和拒绝之间处于一种状态。它表现为情感在多种模态之间或同一模态内的不一致，如面部和语音表达、肢体语言。尽管专家可以被训练来识别A/H，如面对面互动中所做的那样，将其整合到数字健康干预措施中是成本高昂且效果较差的。因此，自动识别A/H对于数字行为改变干预措施的个性化和成本效益至关重要。然而，目前没有用于设计机器学习模型识别A/H的数据集。本文介绍了为视频中多模态A/H识别收集的Behavioral Ambivalence/Hesitancy (BAH)数据集。该数据集包含1,427个视频，总时长10.60小时，来自加拿大300名参与者回答预定义问题以引发A/H。它旨在模拟现实世界的在线个性化行为改变干预措施。BAH由三位专家注释，提供A/H发生的时间戳，以及帧级和视频级带有A/H线索的注释。还提供了视频转录、裁剪和对齐的脸部以及参与者元数据。由于A和H在实践中表现相似，我们提供了二元注释，表明A/H的存在或不存在。此外，本文还包括在BAH上使用基线模型进行帧级和视频级识别、零样本预测和个人化（使用无源域适应）的基准测试结果。数据、代码和预训练权重均可用。

Summary / 总结

This paper introduces the BAH dataset for recognizing ambivalence and hesitancy (A/H) in videos, crucial for digital health interventions. The dataset includes 1,427 videos from 300 participants, capturing A/H through multimodal expressions. Key findings show that A/H cues can be effectively recognized using baseline models, and the dataset supports personalization and cost-effectiveness in digital behavioral change interventions.

论文介绍了用于识别视频中矛盾和犹豫（A/H）的BAH数据集，对于数字健康干预至关重要。该数据集包含来自300名参与者的1,427个视频，通过多模态表达捕捉A/H。关键发现表明，该数据集可以有效支持识别A/H的机器学习模型，基准测试结果展示了模型在帧级和视频级识别、零样本预测和个人化方面的性能。数据、代码和预训练权重可供进一步研究使用。

SeNeDiF-OOD: Semantic Nested Dichotomy Fusion for Out-of-Distribution Detection Methodology in Open-World Classification. A Case Study on Monument Style Classification

Authors: Ignacio Antequera-Sánchez, Juan Luis Suárez-Díaz, Rosana Montes, Francisco Herrera

First: 2026-01-26T18:01:46+00:00 · Latest: 2026-01-26T18:01:46+00:00

Comments: 28 pages

Abs · PDF · Code1 · Code2

Abstract

Out-of-distribution (OOD) detection is a fundamental requirement for the reliable deployment of artificial intelligence applications in open-world environments. However, addressing the heterogeneous nature of OOD data, ranging from low-level corruption to semantic shifts, remains a complex challenge that single-stage detectors often fail to resolve. To address this issue, we propose SeNeDiF-OOD, a novel methodology based on Semantic Nested Dichotomy Fusion. This framework decomposes the detection task into a hierarchical structure of binary fusion nodes, where each layer is designed to integrate decision boundaries aligned with specific levels of semantic abstraction. To validate the proposed framework, we present a comprehensive case study using MonuMAI, a real-world architectural style recognition system exposed to an open environment. This application faces a diverse range of inputs, including non-monument images, unknown architectural styles, and adversarial attacks, making it an ideal testbed for our proposal. Through extensive experimental evaluation in this domain, results demonstrate that our hierarchical fusion methodology significantly outperforms traditional baselines, effectively filtering these diverse OOD categories while preserving in-distribution performance.

中文标题/摘要

标题：SeNeDiF-OOD：开放世界分类中基于语义嵌套二分融合的异常分布检测方法：以纪念物风格分类为例

异常分布（OOD）检测是将人工智能应用可靠地部署在开放世界环境中的一项基本要求。然而，处理从低级污染到语义偏移的异构OOD数据，仍然是单阶段检测器难以解决的复杂挑战。为了解决这一问题，我们提出了一种基于语义嵌套二分融合的新方法SeNeDiF-OOD。该框架将检测任务分解为分层结构的二元融合节点，每一层都设计用于整合与特定语义抽象层次相一致的决策边界。为了验证该框架的有效性，我们使用MonuMAI，一个暴露在开放环境中的真实世界建筑风格识别系统，进行了全面的案例研究。该应用面临各种输入，包括非纪念物图像、未知建筑风格和对抗性攻击，使其成为我们提案的理想测试平台。通过在该领域的广泛实验评估，结果表明，我们的分层融合方法显著优于传统基线，有效地过滤了这些多样化的OOD类别，同时保持了分布内性能。

Summary / 总结

SeNeDiF-OOD is a novel OOD detection method that addresses the challenge of heterogeneous OOD data by decomposing the detection task into a hierarchical structure of binary fusion nodes. This framework integrates decision boundaries at different levels of semantic abstraction. The method was validated using MonuMAI, a real-world architectural style recognition system, which demonstrated that SeNeDiF-OOD outperforms traditional baselines in filtering diverse OOD categories while maintaining in-distribution performance.

SeNeDiF-OOD 是一种新颖的 OOD 检测方法，通过将检测任务分解为二元融合节点的层次结构来应对开放世界分类中异构 OOD 数据的挑战。该框架在不同层次的语义抽象中整合决策边界。该方法通过一个基于 MonuMAI 的真实世界建筑风格识别系统的案例研究得到了验证，该系统展示了在过滤各种 OOD 类别时的优越性能，同时保持了内部分布的准确性。

Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems

Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Jing Yang, Jiawei Yao, Jian Wang, Guanlong Qu, Ziliang Chen, Keze Wang

Venue: ICLR 2026

First: 2026-01-26T17:58:53+00:00 · Latest: 2026-01-26T17:58:53+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3x. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.

中文标题/摘要

标题：为何将疑虑藏在心中？在多智能体 bandit 系统中交易视觉不确定性

视觉-语言模型（VLMs）使强大的多智能体系统成为可能，但将其扩展在经济上是不可持续的：在信息不对称下协调异构智能体往往会导致成本螺旋上升。现有的范式，如混合智能体和知识路由器，依赖于忽略成本的启发式代理，导致不确定性结构的坍塌，从而导致可证明的次优协调。我们提出了Agora框架，将协调重新定义为不确定性的一种分散市场。Agora将知识论不确定性形式化为一种结构化、可交易的资产（感知、语义、推理），并基于理性经济规则在智能体之间实施基于盈利能力的交易。市场意识经纪人扩展了Thompson抽样，启动合作并引导系统向成本效率均衡发展。在五个跨模态基准（MMMU、MMBench、MathVision、InfoVQA、CC-OCR）上的实验表明，Agora在性能上优于强大的VLMs和启发式多智能体策略，例如，在MMMU上比最佳基线高出8.5%的准确率，同时成本降低超过3倍。这些结果确立了基于市场的协调作为一种原理上可行且可扩展的范式，用于构建经济可行的多智能体视觉智能系统。

Summary / 总结

This paper addresses the economic inefficiency of coordinating multi-agent systems with Vision-Language Models (VLMs) under information asymmetry. It introduces Agora, a framework that transforms epistemic uncertainties into tradable assets and uses a market-aware broker to facilitate cost-efficient collaboration among agents. Experiments on five multimodal benchmarks demonstrate that Agora outperforms strong VLMs and heuristic strategies, achieving higher accuracy while reducing costs significantly.

论文针对使用视觉语言模型（VLMs）协调异构代理时的经济效率问题，提出了一种名为Agora的框架，将知识不确定性转化为可交易资产，并通过市场意识经纪人促进成本效益高的协作。实验结果显示，Agora在五个跨模态基准上优于强大的VLMs和启发式策略，实现了更高的准确率并大幅降低了成本。

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

First: 2026-01-26T17:56:50+00:00 · Latest: 2026-01-26T17:56:50+00:00

Comments: 13 pages

Abs · PDF · Code1 · Code2

Abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 4-8x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

中文标题/摘要

标题：自我蒸馏推理器：面向大规模语言模型的在线自我蒸馏

知识蒸馏通过压缩教师大规模语言模型（LLM）的知识来训练较小的LLM，从而改善大型语言模型的推理能力。在线蒸馏通过让学生在教师LLM提供密集的标记级监督的同时，自行采样其自身的轨迹，来推进这一方法，从而解决了脱机蒸馏方法中训练与推理之间的分布不匹配问题。然而，在线蒸馏通常需要一个单独的、通常更大的教师LLM，并且不明确利用推理数据集中可用的真实解决方案。受足够强大的LLM能够合理化外部特权推理轨迹并教导其较弱版本（即没有访问特权信息的版本）这一直觉的启发，我们引入了在线自我蒸馏（OPS），这是一种框架，其中单个模型同时作为教师和学生，通过不同的上下文进行条件化。教师策略基于特权信息（例如，验证的推理轨迹）进行条件化，而学生策略仅看到问题；训练通过最小化学生自身模拟过程中这些分布之间的每个标记差异来进行。我们通过多个数学推理基准展示了我们方法的有效性，与强化学习方法（如GRPO）相比，实现了4-8倍的标记效率，并且优于脱机蒸馏方法。

Summary / 总结

The research aims to improve large language model reasoning through on-policy self-distillation, where a single model acts as both teacher and student. The method conditions the teacher on privileged information and the student on the question, minimizing token divergence during training. Experiments show that OPSD achieves 4-8x token efficiency compared to reinforcement learning methods and outperforms off-policy distillation methods on mathematical reasoning benchmarks.

研究旨在通过单模型作为教师和学生的自监督蒸馏方法提高大型语言模型的推理能力。该方法让教师基于特权信息进行条件化，学生基于问题进行条件化，并在训练中最小化两者之间的token差异。实验结果显示，该方法在数学推理基准测试中比强化学习方法（如GRPO）提高了4-8倍的token效率，并优于脱政策度化方法。

Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge

Authors: Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen, Ziye Wang, Ximeng Meng, Stone Tao, Yiran Qin, Xiaohong Liu, Ruimao Zhang, Lei Bai, Yilun Du, Hao Su, Philip Torr, Zhenfei Yin, Ruihao Gong, Yejun Zeng, Fengjun Zhong, Shenghao Jin, Jinyang Guo, Xianglong Liu, Xiaojun Jia, Tianqi Shan, Wenqi Ren, Simeng Qin, Jialing Yang, Xiaoyu Ma, Tianxing Chen, Zixuan Li, Zijian Cai, Yan Qin, Yusen Qin, Qiangyu Chen, Kaixuan Wang, Zhaoming Han, Yao Mu, Ping Luo, Yuanqi Yao, Haoming Song, Jan-Nico Zaech, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool

Venue: NeurIPS 2025

First: 2026-01-26T17:56:19+00:00 · Latest: 2026-01-26T17:56:19+00:00

Comments: MARS Challenge @ NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI. Challenge page: https://mars-eai.github.io/MARS-Challenge-Webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.

中文标题/摘要

标题：多智能体机器人系统（MARS）挑战的进展与创新

近期多模态大型语言模型和视觉-语言-动作模型的进展显著推动了具身AI的发展。随着领域向更复杂的任务场景过渡，多智能体系统框架变得至关重要，以实现可扩展、高效和协作的解决方案。这一转变由三个主要因素推动：增强的智能体能力、通过任务委派提高系统效率以及实现高级的人机交互。为应对多智能体协作带来的挑战，我们提出了多智能体机器人系统（MARS）挑战，该挑战在NeurIPS 2025空间、视觉、语言和具身AI研讨会中举办。竞赛集中在两个关键领域：规划与控制，参赛者利用视觉-语言模型（VLMs）进行多智能体具身规划，以协调任务并执行机器人操作，以在动态环境中完成操作。通过评估参赛者提交的解决方案，挑战提供了有关具身多智能体系统设计和协调的宝贵见解，为未来先进协作AI系统的开发做出了贡献。

Summary / 总结

The research motivation is to advance Embodied AI through multi-agent systems, addressing the need for scalable, efficient, and collaborative solutions in complex task scenarios. The main method involves using multimodal large language models and vision-language-action models to coordinate multi-agent embodied planning and robotic manipulation. Key experimental findings include improved multi-agent coordination and task execution in dynamic environments, providing insights for the design and coordination of advanced collaborative AI systems.

研究动机是通过多智能体系统推进嵌入式人工智能，在复杂任务场景中实现可扩展、高效和协作的解决方案。主要方法是使用多模态大型语言模型和视觉-语言-动作模型来协调多智能体的嵌入式规划和机器人操作。关键实验发现包括在动态环境中改进了多智能体的协调和任务执行，为先进协作AI系统的设计和协调提供了见解。

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Authors: Hongru Cai, Yongqi Li, Tiezheng Yu, Fengbin Zhu, Wenjie Wang, Fuli Feng, Wenjie Li

First: 2026-01-26T17:55:52+00:00 · Latest: 2026-01-26T17:55:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.

中文标题/摘要

标题：适配任何需求：元奖励建模实现个性化大语言模型对齐

大语言模型（LLMs）的对齐旨在使模型输出与人类偏好一致，而个性化对齐则进一步将模型适应个体用户。这依赖于能够捕捉用户特定偏好并自动提供个性化反馈的个性化奖励模型。然而，开发这些模型面临两个关键挑战：个体用户的反馈稀缺以及需要高效适应未见过的用户。我们认为，解决这些限制需要从拟合数据以学习用户偏好转向学习偏好适应的过程。为实现这一目标，我们提出了元奖励建模（MRM），将其重新定义为一个元学习问题。具体而言，我们将每个用户的奖励模型表示为基奖励函数的加权组合，并使用类似于MAML的框架优化这些权重的初始化，以在有限反馈下支持快速适应。为了确保鲁棒性，我们引入了鲁棒个性化目标（RPO），在元优化过程中更重视难以学习的用户。广泛的实验在个性化偏好数据集上验证了MRM提高了少量示例下的个性化能力、增强了用户鲁棒性，并且始终优于基线模型。

Summary / 总结

The paper addresses the challenge of aligning Large Language Models (LLMs) with individual user preferences by proposing Meta Reward Modeling (MRM). MRM reformulates personalized reward modeling as a meta-learning problem, optimizing the initialization of weights for base reward functions using a Model-Agnostic Meta-Learning (MAML) framework. The approach introduces the Robust Personalization Objective (RPO) to enhance performance on hard-to-learn users. Experiments show that MRM improves few-shot personalization and user robustness compared to baseline methods.

论文通过提出元奖励建模（MRM）来解决将大型语言模型（LLMs）与个体用户偏好对齐的挑战。MRM 将个性化奖励建模重新表述为一个元学习问题，其中每个用户的奖励模型被表示为基奖励函数的加权组合。该方法使用模型无感知元学习（MAML）风格的框架来优化这些权重的初始化，以在有限反馈下实现快速适应。实验表明，MRM 提升了少量样本的个性化能力，增强了用户鲁棒性，并且优于现有基线方法。

Physiology-Informed Generative Multi-Task Network for Contrast-Free CT Perfusion

Authors: Wasif Khan, John Rees, Kyle B. See, Simon Kato, Ziqian Huang, Amy Lazarte, Kyle Douglas, Xiangyang Lou, Teng J. Peng, Dhanashree Rajderkar, Pina Sanelli, Amita Singh, Ibrahim Tuna, Christina A. Wilson, Ruogu Fang

First: 2025-05-12T22:58:55+00:00 · Latest: 2026-01-26T17:55:25+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2

Abstract

Perfusion imaging is extensively utilized to assess hemodynamic status and tissue perfusion in various organs. Computed tomography perfusion (CTP) imaging plays a key role in the early assessment and planning of stroke treatment. While CTP provides essential perfusion parameters to identify abnormal blood flow in the brain, the use of contrast agents in CTP can lead to allergic reactions and adverse side effects, along with costing USD 4.9 billion worldwide in 2022. To address these challenges, we propose a novel deep learning framework called Multitask Automated Generation of Intermodal CT perfusion maps (MAGIC). This framework combines generative artificial intelligence and physiological information to map non-contrast computed tomography (CT) imaging to multiple contrast-free CTP imaging maps. We demonstrate enhanced image fidelity by incorporating physiological characteristics into the loss terms. Our network was trained and validated using CT image data from patients referred for stroke at UF Health and demonstrated robustness to abnormalities in brain perfusion activity. A double-blinded study was conducted involving seven experienced neuroradiologists and vascular neurologists. This study validated MAGIC's visual quality and diagnostic accuracy showing favorable performance compared to clinical perfusion imaging with intravenous contrast injection. Overall, MAGIC holds great promise in revolutionizing healthcare by offering contrast-free, cost-effective, and rapid perfusion imaging.

中文标题/摘要

标题：基于生理学指导的生成多任务网络用于无对比剂CT灌注

灌注成像是广泛用于评估不同器官的血流动力学状态和组织灌注。计算机断层扫描灌注(CTP)成像在早期评估和计划中风治疗中起着关键作用。虽然CTP提供了识别脑部异常血流的关键灌注参数，但使用对比剂可能导致过敏反应和不良副作用，并在全球范围内在2022年花费了49亿美元。为了解决这些挑战，我们提出了一种名为多任务自动生成跨模态CT灌注图谱(MAGIC)的新型深度学习框架。该框架结合了生成的人工智能和生理信息，将非对比剂CT成像映射到多个无对比剂CTP成像图谱。通过将生理特征纳入损失项中，我们展示了图像保真的增强。我们的网络使用UF Health转诊中风患者的CT图像数据进行训练和验证，并展示了对脑灌注活动异常的鲁棒性。进行了双盲研究，涉及七名经验丰富的神经放射科医生和血管神经科医生。该研究验证了MAGIC的视觉质量和诊断准确性，显示了与静脉注射对比剂的临床灌注成像相比的有利表现。总体而言，MAGIC在通过提供无对比剂、低成本和快速的灌注成像方面具有巨大的潜力，有望革新医疗保健。

Summary / 总结

The research aims to address the challenges of using contrast agents in computed tomography perfusion (CTP) imaging, such as allergic reactions and high costs. The proposed MAGIC framework combines generative artificial intelligence and physiological information to generate multiple contrast-free CTP imaging maps from non-contrast CT images. The study demonstrated that incorporating physiological characteristics into the loss terms improved image fidelity and the network's robustness to brain perfusion abnormalities. A double-blinded study by seven experts validated MAGIC's visual quality and diagnostic accuracy, showing favorable performance compared to traditional CTP imaging. This work holds significant potential for cost-effective and rapid perfusion imaging without the use of contrast agents.

研究旨在解决使用CTP成像中的对比剂问题，如过敏反应和高昂成本。提出的MAGIC框架结合生成式人工智能和生理信息，从非对比CT图像生成多个对比剂免费的CTP成像图。研究显示，将生理特征纳入损失项中可以提高图像保真度。该网络使用临床数据进行了验证，并在视觉质量和诊断准确性方面优于传统的带有静脉对比剂注射的CTP成像。

Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale

Authors: Henry Bell, Caroline Zhang, Mohammed Mobasserul Haque, Dhaval Potdar, Samia Zaman, Brandon Fain

First: 2026-01-26T17:54:54+00:00 · Latest: 2026-01-26T17:54:54+00:00

Abs · PDF · Code1 · Code2

Abstract

The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textsc{reflect}, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textsc{reflect} operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textsc{reflect}'s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textsc{reflect} significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model's original parameter fine-tuning, without sacrificing factual reasoning. \textsc{reflect} is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textsc{reflect} naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.

中文标题/摘要

标题：Reflect: 透明原则导向推理在大规模宪法对齐中的应用

宪法框架对齐旨在使大型语言模型（LLMs）与自然语言中写入的价值导向原则（如避免使用有偏见的语言）保持一致。先前的工作主要集中在参数微调技术上，例如基于人类反馈的强化学习（RLHF），以灌输这些原则。然而，这些方法计算需求大，需要精细的工程和调整，并且通常需要难以获取的人类注释数据。我们提出了一种名为\textsc{reflect}的推理时框架，用于宪法对齐，该框架不需要任何训练或数据，提供了一种即插即用的方法，将指令调整模型与一组原则对齐。\textsc{reflect}完全基于上下文运行，结合了(i)宪法条件下的基础响应与(ii)生成后的自我评估，(iii)(a)自我批判，和(iii)(b)最终修订。\textsc{reflect}的技术在于生成后在上下文中显式地对原则进行推理，优于标准的少量示例提示，并提供了透明的推理痕迹。我们的结果表明，\textsc{reflect}显著提高了LLM对多样且复杂的原则的遵从性，包括与模型原始参数微调中强调的原则截然不同的原则，同时不牺牲事实推理。\textsc{reflect}特别有效地减少了原则的罕见但重要的违反率，从而在生成分布的尾端提高了安全性和鲁棒性。最后，我们展示了\textsc{reflect}自然生成了传统参数微调技术的有用训练数据，允许在长期部署场景中高效扩展并减少推理时的计算开销。

Summary / 总结

The paper introduces Reflect, a framework for aligning large language models with value-laden principles during inference without requiring any training or data. It achieves this by incorporating explicit in-context reasoning over principles post-generation, which outperforms standard few-shot prompting. Reflect improves LLM conformance to diverse and complex principles, reduces rare but significant violations, and generates useful training data for traditional parameter fine-tuning techniques, enhancing safety and robustness in long-term deployment.

论文介绍了Reflect框架，该框架在推理过程中无需训练数据即可使大型语言模型与价值导向的原则对齐。它使用上下文中的推理来根据原则条件化模型的响应，随后进行自我评估、自我批判和最终修订。Reflect在标准少量示例提示中表现出色，并提供了透明的推理痕迹。结果显示，Reflect显著提高了LLM对多样原则的遵从性，增强了安全性和鲁棒性，并且可以生成用于参数微调技术的有用训练数据。