Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
Authors: Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano
First: 2026-02-11T18:58:54+00:00 · Latest: 2026-02-11T18:58:54+00:00
Abstract
Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.
中文标题/摘要
标题:数据重复优于数据扩展在长链推理微调中的应用
链式思考数据上的监督微调(SFT)是推理语言模型训练后的一个重要步骤。标准的机器学习直觉认为,使用更多的独特训练样本可以更好地泛化。出乎意料的是,我们展示了重复训练的好处:在固定更新预算下,多次在较小的数据集上训练比在较大数据集上单次训练表现更好。在AIME'24/25和GPQA基准测试中,Olmo3-7B在400个样本上训练128个epoch的表现比在51200个样本上单次训练的表现高出12-26个百分点,且没有出现灾难性遗忘。我们发现,训练标记准确率可以可靠地指示重复训练何时饱和;额外的epoch带来的改进在完全记忆化时趋于平缓,这一模式在所有设置中都是一致的。这些发现为推理微调提供了一种实用的方法,其中使用标记准确率作为停止标准的epoch扩展可以替代昂贵的无向数据扩展。我们提出了重复优势这一新问题,即完全记忆化与改进泛化相一致,作为社区理解大型语言模型训练动力学的新开放问题。
Summary / 总结
The study investigates the impact of data repetition versus data scaling in supervised fine-tuning for reasoning language models. It demonstrates that under a fixed update budget, training with more epochs on smaller datasets outperforms single-epoch training on larger datasets, achieving up to 26 percentage point improvements on benchmarks like AIME'24/25 and GPQA. The research finds that training token accuracy can indicate when repetition has saturated, suggesting a practical approach to replace expensive data scaling with controlled repetition.
研究挑战了传统观点,即更多的独特训练样本会带来更好的泛化能力。相反,它表明,在固定更新预算下,重复较小的数据集优于在更大数据集上进行单个epoch的训练。在AIME'24/25和GPQA基准测试上,Olmo3-7B在400个样本上训练128个epoch的表现优于在51200个样本上训练1个epoch,提高了12-26个百分点,且没有灾难性遗忘。研究还表明,训练令牌准确率可以指示重复何时饱和,这为基于令牌准确率作为停止标准来调整SFT的epoch数量提供了一种实用方法。
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
Authors: Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo
First: 2026-02-11T18:57:29+00:00 · Latest: 2026-02-11T18:57:29+00:00
Comments: Code: https://github.com/HKUST-C4G/diffusion-rm
Abstract
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
中文标题/摘要
标题:超越基于VLM的奖励:扩散原生潜在奖励建模
扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又具有计算效率的奖励函数。视觉语言模型(VLMs)已经成为了主要的奖励提供者,利用其丰富的跨模态先验来引导对齐。然而,它们的计算和内存成本可能相当高,并且通过像素空间奖励优化潜在扩散生成器会引入领域不匹配,这使得对齐更加复杂。在本文中,我们提出了DiNa-LRM,这是一种直接在噪声扩散状态上进行偏好学习的扩散原生潜在奖励模型。我们的方法引入了一种噪声校准的Thurstone似然性,具有扩散噪声依赖的不确定性。DiNa-LRM 利用了一个预训练的潜在扩散主干,并带有时间步条件化的奖励头,支持推理时的噪声集成,提供了一种扩散原生的测试时扩展机制和鲁棒奖励机制。在图像对齐基准测试中,DiNa-LRM 显著优于现有的基于扩散的奖励基线,并且在计算成本仅为现有最佳VLMs的一小部分的情况下,实现了具有竞争力的性能。在偏好优化中,我们展示了DiNa-LRM 改进了偏好优化动力学,使其能够更快且更节省资源地进行模型对齐。
Summary / 总结
This paper addresses the challenge of preference optimization for diffusion and flow-matching models by proposing DiNa-LRM, a diffusion-native latent reward model. It formulates preference learning directly on noisy diffusion states, using a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a lower computational cost. In preference optimization, DiNa-LRM improves the dynamics, enabling faster and more resource-efficient model alignment.
本文提出了一种扩散本征的潜在奖励模型DiNa-LRM,以解决扩散模型的偏好优化问题。该模型直接在噪声扩散状态上进行偏好学习,并引入了噪声校准的Thurstone似然。实验表明,DiNa-LRM在较低的计算成本下,超过了现有的基于扩散的奖励基线,并且其性能与最先进的视觉语言模型相当。此外,它还改善了偏好优化的动力学,使得模型对齐更快且更节省资源。
GENIUS: Generative Fluid Intelligence Evaluation Suite
Authors: Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang
First: 2026-02-11T18:55:54+00:00 · Latest: 2026-02-11T18:55:54+00:00
Abstract
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.
中文标题/摘要
标题:GENIUS:生成流体智能评估套件
统一多模态模型(UMMs)在视觉生成方面取得了显著进展。然而,现有的基准测试主要评估的是$\textit{晶体智力}$,这依赖于回忆积累的知识和学习的模式。这种关注忽视了$\textit{生成流体智力(GFI)}$:即在不断变化的情境中诱导模式、通过约束进行推理和适应新场景的能力。为了严格评估这种能力,我们引入了$\textbf{GENIUS}$($\textbf{G}$ 生成$\textbf{E}$ 流体$\textbf{I}$ 智能$\textbf{U}$ 评估$\textbf{S}$ 套件)。我们将$\textit{GFI}$ 形式化为三个基本要素的综合。这些包括$\textit{诱导隐含模式}$(例如,推断个性化的视觉偏好),$\textit{执行临时约束}$(例如,可视化抽象的隐喻),以及$\textit{适应上下文知识}$(例如,模拟反直觉的物理现象)。这些基本要素共同挑战模型解决完全基于当前上下文的问题。我们对12个代表性模型的系统评估揭示了这些任务中的显著性能缺陷。至关重要的是,我们的诊断分析将这些失败模式分离出来。它表明缺陷源于对上下文理解的有限性,而不是生成能力的不足。为了弥合这一差距,我们提出了一种无需训练的注意力干预策略。最终,$\textbf{GENIUS}$ 为$\textit{GFI}$ 设立了一个严格的标准,引导该领域从知识利用转向动态、通用的推理。我们的数据集和代码将在:$\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$ 发布。
Summary / 总结
The research introduces GENIUS, a suite designed to evaluate Generative Fluid Intelligence (GFI) in models, which is the ability to induce patterns, reason through constraints, and adapt to novel scenarios. By formalizing GFI into three primitives—inducing implicit patterns, executing ad-hoc constraints, and adapting to contextual knowledge—the study systematically evaluates 12 models and finds significant performance deficits. The analysis shows these deficits are due to limited context comprehension rather than intrinsic generative capability. The study proposes an attention intervention strategy to address this issue and establishes a rigorous standard for GFI evaluation, guiding future research towards dynamic, general-purpose reasoning.
研究引入了GENIUS套件,旨在评估模型的生成流体智能(GFI),重点关注其在模式诱导、约束推理和适应新场景方面的能力。研究将GFI形式化为三个基本要素,并评估了12个模型,揭示了由于上下文理解有限而导致的重大性能缺陷。提出了一种无需训练的注意力干预策略来解决这些问题。
Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows
Authors: Shaswat Garg, Matin Moezzi, Brandon Da Silva
First: 2026-02-11T18:54:48+00:00 · Latest: 2026-02-11T18:54:48+00:00
Comments: 9 pages, 3 figures, IEEE International Conference on Robotics and Automation 2026
Abstract
Hierarchical goal-conditioned reinforcement learning (H-GCRL) provides a powerful framework for tackling complex, long-horizon tasks by decomposing them into structured subgoals. However, its practical adoption is hindered by poor data efficiency and limited policy expressivity, especially in offline or data-scarce regimes. In this work, Normalizing flow-based hierarchical implicit Q-learning (NF-HIQL), a novel framework that replaces unimodal gaussian policies with expressive normalizing flow policies at both the high- and low-levels of the hierarchy is introduced. This design enables tractable log-likelihood computation, efficient sampling, and the ability to model rich multimodal behaviors. New theoretical guarantees are derived, including explicit KL-divergence bounds for Real-valued non-volume preserving (RealNVP) policies and PAC-style sample efficiency results, showing that NF-HIQL preserves stability while improving generalization. Empirically, NF-HIQL is evaluted across diverse long-horizon tasks in locomotion, ball-dribbling, and multi-step manipulation from OGBench. NF-HIQL consistently outperforms prior goal-conditioned and hierarchical baselines, demonstrating superior robustness under limited data and highlighting the potential of flow-based architectures for scalable, data-efficient hierarchical reinforcement learning.
中文标题/摘要
标题:基于归一化流的数据高效分层目标导向强化学习
分层目标导向强化学习(H-GCRL)提供了一种强大的框架,用于通过将复杂、长期的任务分解为结构化的子目标来应对挑战。然而,其实际应用受到数据效率差和策略表达能力有限的阻碍,尤其是在离线或数据稀缺的环境中。本文提出了一种新的框架——基于归一化流的分层隐式Q学习(NF-HIQL),该框架在分层的高低层中用表达性强的归一化流策略替代了单模高斯策略。这种设计使得对数似然计算变得可行,提高了采样效率,并能够建模丰富的多模态行为。推导了新的理论保证,包括实值非体积保持(RealNVP)策略的显式KL散度界和PAC风格的数据效率结果,表明NF-HIQL在保持稳定性的同时提高了泛化能力。实验上,NF-HIQL在OGBench中的行走、带球和多步操作等长期任务中进行了评估。NF-HIQL在所有任务中都优于先前的目标导向和分层基线,展示了在数据有限条件下更强的鲁棒性,并突显了基于流的架构在可扩展、数据高效分层强化学习中的潜力。
Summary / 总结
This paper addresses the limitations of hierarchical goal-conditioned reinforcement learning (H-GCRL) in terms of data efficiency and policy expressivity. It introduces NF-HIQL, which uses normalizing flow policies to enhance the expressivity of the model while maintaining computational efficiency. The framework improves generalization and stability, and empirical evaluations on various tasks show that NF-HIQL outperforms existing methods, particularly in data-scarce environments.
该论文针对层级目标条件强化学习(H-GCRL)在数据效率和策略表达性方面的局限性,提出了NF-HIQL框架,使用流形策略增强层级策略的表达能力。理论保证和实验评估表明,NF-HIQL在数据稀缺环境中提高了泛化能力和鲁棒性,特别是在长时任务如运动和操作方面超越了先前的方法。
AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models
Authors: R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu
First: 2026-02-10T10:08:51+00:00 · Latest: 2026-02-11T18:51:19+00:00
Comments: Library opensource and available at https://github.com/Lexsi-Labs/aligntune
Abstract
Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
中文标题/摘要
标题:AlignTune:大型语言模型后训练对齐的模块化工具包
后训练对齐是部署大型语言模型(LLMs)的核心,但实际工作流仍然分散在特定后端工具和临时粘合代码之间,使得实验难以复现。我们确定后端干扰、奖励碎片化和不可复现的管道是对齐研究中的关键障碍。我们引入了AlignTune,这是一种模块化工具包,提供了一个统一接口用于监督微调(SFT)和RLHF风格优化,支持可互换的TRL和Unsloth后端。AlignTune标准化了配置,提供了可扩展的奖励层(基于规则和学习),并集成了标准基准和自定义任务的评估。通过将后端特定逻辑隔离在单一工厂边界后面,AlignTune使可控比较和可复现对齐实验成为可能。
TabICLv2: A better, faster, scalable, and open tabular foundation model
Authors: Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan
First: 2026-02-11T18:51:02+00:00 · Latest: 2026-02-11T18:51:02+00:00
Abstract
Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient-boosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at https://github.com/soda-inria/tabicl, with synthetic data engine and pretraining code to follow.
中文标题/摘要
标题:TabICLv2:更好的、更快的、可扩展的和开放的表格基础模型
表格基础模型,如TabPFNv2和TabICL,最近在预测基准测试中取代了梯度提升树,展示了上下文学习对表格数据的价值。我们介绍了TabICLv2,这是一种新的最先进的回归和分类基础模型,基于三个支柱:(1)一种新型的合成数据生成引擎,旨在实现高预训练多样性;(2)各种架构创新,包括一种新的可扩展的注意力softmax,改进了对大型数据集的一般化,无需进行禁止性的长序列预训练;(3)优化的预训练协议,显著地用Muon优化器取代了AdamW。在TabArena和TALENT基准测试中,TabICLv2在没有任何调整的情况下超越了当前最先进的模型RealTabPFN-2.5(超参数调整、集成和在真实数据上微调)。仅在中等预训练计算下,TabICLv2在50GB GPU内存下有效泛化到百万规模的数据集,同时比RealTabPFN-2.5更快。我们提供了详尽的消融研究来量化这些贡献,并承诺开放研究,首先在https://github.com/soda-inria/tabicl发布推理代码和模型权重,合成数据引擎和预训练代码将随后发布。
Weight Decay Improves Language Model Plasticity
Authors: Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade
First: 2026-02-11T18:49:26+00:00 · Latest: 2026-02-11T18:49:26+00:00
Abstract
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.
中文标题/摘要
标题:权重衰减提高语言模型的可塑性
大型语言模型(LLM)开发的主流范式是在预训练一个基础模型后,进行进一步训练以提高性能和模型行为。然而,超参数优化和缩放法则主要从基础模型的验证损失角度进行研究,忽略了下游适应性。在本文中,我们从模型可塑性的角度研究预训练,即基础模型通过微调成功适应下游任务的能力。我们重点关注预训练期间关键正则化参数权重衰减的作用。通过系统实验,我们表明使用较大权重衰减值训练的模型在下游任务微调时表现出更大的性能提升,这意味着它们具有更大的可塑性。这一现象可能导致反直觉的权衡,即在预训练后表现较差的基础模型在微调后可能表现更好。进一步研究权重衰减对模型行为的机制影响揭示了它鼓励线性可分表示、正则化注意力矩阵并减少对训练数据的过拟合。总之,本文证明了使用交叉熵损失之外的评估指标进行超参数优化的重要性,并揭示了单一优化超参数在塑造模型行为方面所扮演的多面角色。
Summary / 总结
This study investigates the role of weight decay in enhancing the plasticity of large language models, focusing on the model's ability to adapt to downstream tasks after pretraining. Through systematic experiments, the research shows that models trained with higher weight decay values exhibit greater plasticity, leading to better performance gains during fine-tuning. This finding suggests that optimizing beyond traditional validation loss metrics can improve model adaptability and highlights the complex impact of weight decay on model behavior.
研究探讨了权重衰减在增强大型语言模型适应性方面的作用,重点关注模型在预训练后对下游任务的适应能力。通过系统实验,研究显示,使用较高权重衰减值训练的模型在微调时表现出更大的适应性,从而获得更好的性能提升。这一发现表明,优化超越传统交叉熵损失的评估指标可以提高模型的适应性,并突显了权重衰减对模型行为的复杂影响。
Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models
Authors: Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou
Venue: WSDM 2026
First: 2024-08-13T08:22:01+00:00 · Latest: 2026-02-11T18:49:00+00:00
Comments: Accepted at WSDM 2026. Title changed from "Computation-friendly graph neural network design by accumulating knowledge on large language models" to "Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models"
Abstract
High-level automation is increasingly critical in AI, driven by rapid advances in large language models (LLMs) and AI agents. However, LLMs, despite their general reasoning power, struggle significantly in specialized, data-sensitive tasks such as designing Graph Neural Networks (GNNs). This difficulty arises from (1) the inherent knowledge gaps in modeling the intricate, varying relationships between graph properties and suitable architectures and (2) the external noise from misleading descriptive inputs, often resulting in generic or even misleading model suggestions. Achieving proficiency in designing data-aware models -- defined as the meta-level capability to systematically accumulate, interpret, and apply data-specific design knowledge -- remains challenging for existing automated approaches, due to their inefficient construction and application of meta-knowledge. To achieve meta-level proficiency, we propose DesiGNN, a knowledge-centered framework that systematically converts past model design experience into structured, fine-grained knowledge priors well-suited for meta-learning with LLMs. To account for the inherent variability and external noise, DesiGNN aligns empirical property filtering from extensive benchmarks with adaptive elicitation of literature insights via LLMs. By constructing a solid meta-knowledge between unseen graph understanding and known effective architecture patterns, DesiGNN can deliver top-5.77% initial model proposals for unseen datasets within seconds and achieve consistently superior performance with minimal search cost compared to baselines.
中文标题/摘要
标题:通过在大型语言模型上积累知识来熟练设计图神经网络
高级自动化在AI中越来越关键,受到大型语言模型(LLMs)和AI代理快速进步的推动。然而,尽管LLMs具有广泛的推理能力,但在设计图神经网络(GNNs)等专门的数据敏感任务上仍面临巨大挑战。这种困难源于(1)建模图属性与合适架构之间复杂多变关系的知识缺口,以及(2)来自误导性描述输入的外部噪音,通常导致通用甚至误导性的模型建议。实现数据感知模型的专业设计——定义为元层面的能力,能够系统地积累、解释和应用数据特定的设计知识——对于现有自动化方法来说仍然是一个挑战,因为它们在元知识的构建和应用上效率低下。为了实现元层面的专业性,我们提出了一种以知识为中心的框架DesiGNN,该框架系统地将过去的模型设计经验转化为结构化、细粒度的知识先验,这些先验非常适合与LLMs进行元学习。为了应对固有的变异性与外部噪音,DesiGNN将广泛的基准测试中的经验属性过滤与通过LLMs适应性地提取文献见解相结合。通过在未见图的理解与已知有效架构模式之间建立坚实的元知识,DesiGNN可以在几秒内为未见数据集提供排名前5.77%的初始模型提案,并且与基线相比,具有最小的搜索成本,实现了一致的优越性能。
Summary / 总结
The research aims to address the challenge of designing efficient and data-aware Graph Neural Networks (GNNs) using high-level automation, particularly leveraging the reasoning power of large language models (LLMs). DesiGNN, a knowledge-centered framework, converts past design experiences into structured knowledge priors, aligning empirical property filtering with adaptive literature insights from LLMs. This approach enables DesiGNN to propose top-5.77% initial model designs for unseen datasets within seconds, outperforming baseline methods with minimal search cost.
该论文通过利用大型语言模型(LLMs)的知识来解决设计高效图神经网络(GNNs)的挑战。方法DesiGNN将过往的设计经验转化为结构化的知识先验,以增强元学习。它通过对接经验属性过滤与适应性文献提取来处理变异性与外部噪声。实验表明,DesiGNN可以在几秒内为未见数据集生成排名前5.77%的初始模型提案,并且与基线相比具有最小的搜索成本且性能更优。
FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight
Authors: Jiayi Zhou, Yang Sheng, Hantao Lou, Yaodong Yang, Jie Fu
First: 2026-02-11T18:48:11+00:00 · Latest: 2026-02-11T18:48:11+00:00
Comments: 27 pages
Abstract
As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.
中文标题/摘要
标题:FormalJudge:一种代理监督的神经符号范式
随着基于LLM的代理在具有实际后果的高风险领域中越来越多地运行,确保其行为安全变得至关重要。占主导地位的监督范式,LLM作为法官,面临着一个根本性的困境:概率系统如何可靠地监督其他概率系统而不继承其失败模式?我们认为形式化验证提供了一种从困境中解脱出来的原则性方法,但其采用受到了一个关键瓶颈的阻碍:自然语言需求到形式化规范的翻译。本文通过提出一种神经符号框架来弥合这一缺口,该框架采用双向Formal-of-Thought架构:LLM作为规范编译器,自上而下地将高层次的人类意图分解为可验证的原子约束,然后自下而上地使用Dafny规范和Z3可满足性模理论求解来证明合规性,从而产生数学保证而非概率分数。我们在三个基准测试中进行了验证,涵盖行为安全、多域约束遵守和代理向上欺骗检测。针对7个代理模型的实验表明,相比LLM作为法官基准,实现了平均16.6%的改进,使弱到强的泛化成为可能,其中7B法官能够检测72B代理的欺骗,准确率超过90%,并通过迭代细化提供了接近线性的安全改进。
Summary / 总结
FormalJudge is a neuro-symbolic framework designed to ensure the behavioral safety of LLM-based agents in high-stakes domains. It addresses the limitations of the LLM-as-a-Judge paradigm by translating natural language requirements into formal specifications using a bidirectional Formal-of-Thought architecture. Experiments show that FormalJudge outperforms LLM-as-a-Judge baselines by 16.6%, enables weak-to-strong generalization, and provides near-linear safety improvements through iterative refinement.
FormalJudge 是一个神经符号框架,旨在提高 LLM 基础代理在高风险领域的行为安全性。它通过双向 Formal-of-Thought 架构解决将自然语言要求转换为正式规范的挑战。实验表明,FormalJudge 在安全性上比 LLM-as-a-Judge 基线提高了 16.6%,实现了从弱到强的泛化,并通过迭代优化提供了接近线性的安全性改进。
Just on Time: Token-Level Early Stopping for Diffusion Language Models
Authors: Zahar Kohut, Severyn Shykula, Dmytro Khamula, Mykola Vysotskyi, Taras Rumezhak, Volodymyr Karpiv
First: 2026-02-11T18:44:04+00:00 · Latest: 2026-02-11T18:44:04+00:00
Comments: Under review
Abstract
Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model's predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.
中文标题/摘要
标题:恰到好处:扩散语言模型的标记级早期停止
扩散语言模型通过迭代细化生成文本,这一过程通常计算效率低下,因为许多标记在最终去噪步骤之前就已经稳定。我们提出了一种无需训练的标记级早期停止方法,能够在每个位置独立识别收敛。该方法利用模型预测和局部上下文中的轻量级信号,动态确定何时可以最终确定个别标记。这实现了适应性的标记级冻结,无需针对特定任务的微调,大幅减少了所需的扩散步骤总数。在涵盖数学推理、通用问答和科学理解等多种基准测试中,我们的方法在保持生成质量的同时实现了最先进的效率提升。
Summary / 总结
The research aims to improve the computational efficiency of diffusion language models by introducing a token-level early stopping method. This method identifies when individual tokens have reached stability and can be finalized without task-specific fine-tuning. The approach uses lightweight signals from the model's predictions and local context to dynamically determine when to stop refining each token. Experiments across various benchmarks show that this method achieves significant efficiency gains while maintaining generation quality.
研究旨在通过引入基于令牌的早期停止方法提高扩散语言模型的计算效率。该方法能够识别何时单个令牌达到稳定状态并可以被最终确定,无需特定任务的微调。实验结果显示,该方法在数学推理、通用问答和科学理解等多种基准上减少了所需扩散步骤的数量,同时保持了生成质量。
Expanding the Capabilities of Reinforcement Learning via Text Feedback
Authors: Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette
First: 2026-02-02T18:56:56+00:00 · Latest: 2026-02-11T18:43:26+00:00
Comments: 43 pages, 6 figures
Abstract
The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
中文标题/摘要
标题:通过文本反馈扩展强化学习的能力
对于LLM后训练的RL成功,其根源在于一个不合理地不具信息性的来源:每次回放中仅有一比特信息作为二元奖励或偏好标签。在另一极端,蒸馏提供了密集的监督,但需要演示,这既昂贵又难以扩展。我们研究文本反馈作为中间信号:比标量奖励更丰富,但比完整演示更便宜。文本反馈是人类互动的自然方式,并且在许多现实世界环境中已经非常普遍,用户、注释者和自动裁判经常对LLM输出进行评价。为了大规模利用文本反馈,我们形式化了一个多轮RL设置,文本反馈强化学习(RLTF),其中在训练期间可用文本反馈,但在推理时不可用。因此,模型必须学会内化反馈以提高其测试时单轮性能。为此,我们提出了两种方法:自我蒸馏(RLTF-SD),训练单轮策略使其匹配自身反馈条件下的第二轮生成;反馈建模(RLTF-FM),将预测反馈作为辅助目标。我们对这两种方法进行了理论分析,并在推理谜题、竞赛数学和创造性写作任务上进行了实证评估。结果显示,两种方法在基准测试中均优于强基线,突显了大规模使用额外丰富监督源的RL潜力。
Summary / 总结
This paper explores the use of text feedback as a form of intermediate supervision for reinforcement learning (RL) of large language models (LLMs). The study formulates a multi-turn RL setup called RL from Text Feedback (RLTF), where text feedback is provided during training but not at inference. Two methods, Self Distillation (RLTF-SD) and Feedback Modeling (RLTF-FM), are proposed to leverage this feedback. The methods are evaluated on reasoning puzzles, competition math, and creative writing tasks, and both methods outperform strong baselines, demonstrating the potential of RL with rich supervision at scale.
该论文探讨了使用文本反馈来增强大型语言模型(LLM)的强化学习(RL)。动机是提供比标量奖励更丰富的反馈,但避免完整示范的成本。作者提出了两种方法:自我蒸馏(RLTF-SD)和反馈建模(RLTF-FM)。这两种方法通过在训练期间内化文本反馈来提高单轮性能。实验结果显示,这两种方法在逻辑推理谜题、竞赛数学和创造性写作任务上均优于强基线,展示了文本反馈在大规模RL中的潜力。
Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards
Authors: Reinhard Heckel, Mahdi Soltanolkotabi, Christos Thramboulidis
First: 2026-02-11T18:39:42+00:00 · Latest: 2026-02-11T18:39:42+00:00
Abstract
Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.
中文标题/摘要
标题:具有可验证奖励的强化学习中的非对称提示加权
具有可验证奖励的强化学习推动了LLM后训练的近期进展,特别是在推理方面。策略优化算法为给定的提示生成多个响应,然后根据奖励有效地加权相应的梯度。最受欢迎的算法包括GRPO、DAPO和RLOO,它们侧重于模糊提示,即成功率介于中间的提示,而降低非常容易和非常难的提示的梯度权重。在本文中,我们考虑了非对称提示加权,为具有低,甚至零,经验成功率的提示分配更高的权重。我们发现,非对称加权特别有利于从零开始的RL(如R1-Zero),其中训练跨越广泛的准确率范围,而在后SFT RL中,模型已经从高准确率开始,其受益较少。我们还提供了理论,描述了在固定更新预算下,使成功率从初始水平提高到目标准确率所需时间最小化的提示权重。在成功率低的区域,由于信息性响应稀少,响应成本占主导地位,这些最优权重变得非对称,加权低成功率,从而加速有效时间的收敛。
PhyCritic: Multimodal Critic Models for Physical AI
Authors: Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu
First: 2026-02-11T18:35:39+00:00 · Latest: 2026-02-11T18:35:39+00:00
Abstract
With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.
中文标题/摘要
标题:PhyCritic:面向物理AI的多模态批评模型
随着大型多模态模型的快速发展,可靠且准确的批评和评判模型对于开放式的评估和偏好对齐变得至关重要,它们能够提供成对的偏好、数值评分以及解释性的理由来评估模型生成的响应。然而,现有的批评模型主要是在一般视觉领域如描述或图像问答中进行训练,而涉及感知、因果推理和规划的物理AI任务则被严重忽视。我们引入了PhyCritic,这是一种通过两阶段RLVR流水线优化的多模态批评模型:物理技能预热阶段增强物理导向的感知和推理能力,随后是自我参照批评微调阶段,在此阶段批评模型生成自己的预测作为内部参考,再对候选响应进行评判,从而提高评判的稳定性和物理正确性。在物理和通用多模态评判基准测试中,PhyCritic在开源基线之上取得了显著的性能提升,当将其作为策略模型应用时,进一步提高了物理接地任务中的感知和推理能力。
Summary / 总结
PhyCritic is a multimodal critic model designed for physical AI tasks. It uses a two-stage RLVR pipeline: a physical skill warmup stage to enhance perception and reasoning, followed by self-referential critic finetuning. This model outperforms open-source baselines on both physical and general-purpose multimodal judge benchmarks and improves perception and reasoning in physically grounded tasks.
PhyCritic 是一种针对物理 AI 任务的多模态批评模型。它使用两阶段的 RLVR 管道来增强物理感知和推理,随后进行自我参照批评微调。这种方法提高了判断的稳定性和物理正确性。PhyCritic 在物理和通用多模态判断基准测试中均优于开源基线,并且能够增强物理接地任务中的感知和推理能力。
From Natural Language to Materials Discovery:The Materials Knowledge Navigation Agent
Authors: Genmao Zhuang, Amir Barati Farimani
First: 2026-02-11T18:34:24+00:00 · Latest: 2026-02-11T18:34:24+00:00
Comments: 22 pages,5 figures
Abstract
Accelerating the discovery of high-performance materials remains a central challenge across energy, electronics, and aerospace technologies, where traditional workflows depend heavily on expert intuition and computationally expensive simulations. Here we introduce the Materials Knowledge Navigation Agent (MKNA), a language-driven system that translates natural-language scientific intent into executable actions for database retrieval, property prediction, structure generation, and stability evaluation. Beyond automating tool invocation, MKNA autonomously extracts quantitative thresholds and chemically meaningful design motifs from literature and database evidence, enabling data-grounded hypothesis formation. Applied to the search for high-Debye-temperature ceramics, the agent identifies a literature-supported screening criterion (Theta_D > 800 K), rediscovers canonical ultra-stiff materials such as diamond, SiC, SiN, and BeO, and proposes thermodynamically stable, previously unreported Be-C-rich compounds that populate the sparsely explored 1500-1700 K regime. These results demonstrate that MKNA not only finds stable candidates but also reconstructs interpretable design heuristics, establishing a generalizable platform for autonomous, language-guided materials exploration.
中文标题/摘要
标题:从自然语言到材料发现:材料知识导航代理
在能源、电子和航空航天技术领域,加速高性能材料的发现仍然是一个核心挑战,传统的工作流程高度依赖专家直觉和计算成本高昂的模拟。在此,我们介绍了材料知识导航代理(MKNA),这是一种基于语言的系统,能够将自然语言的科学意图转化为数据库检索、性质预测、结构生成和稳定性评估的可执行操作。MKNA不仅自动化工具调用,还能自主从文献和数据库证据中提取定量阈值和化学上有意义的设计模式,从而实现数据驱动的假设形成。应用于寻找高德拜温度陶瓷的搜索中,代理识别了一个文献支持的筛选标准(Θ_D > 800 K),重新发现了如钻石、SiC、SiN和BeO等经典的超硬材料,并提出了热力学稳定的、此前未报道的Be-C丰富的化合物,这些化合物填充了探索不足的1500-1700 K区间。这些结果表明,MKNA不仅能够找到稳定的候选材料,还能重建可解释的设计启发式规则,建立了一个可泛化的、基于语言的自主材料探索平台。
Summary / 总结
The Materials Knowledge Navigation Agent (MKNA) is designed to accelerate materials discovery by translating natural language into executable actions for database retrieval, property prediction, and structure generation. Applied to the search for high-Debye-temperature ceramics, MKNA identifies a screening criterion, rediscovers known ultra-stiff materials, and proposes new Be-C-rich compounds, demonstrating its capability to find stable candidates and reconstruct design heuristics.
Materials Knowledge Navigation Agent (MKNA) 通过将自然语言转化为数据库检索、性质预测和结构生成的执行动作来加速材料发现。应用于寻找高Debye温度陶瓷的搜索中,MKNA 确定了一个筛选标准(Theta_D > 800 K),重新发现了已知的超刚性材料,并提出了新的热力学稳定的 Be-C 富集化合物,位于 1500-1700 K 区域,展示了其发现稳定候选材料和重构设计启发式的能力。
Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations
Authors: Kadircan Aksoy, Protim Bhattacharjee, Peter Jung
First: 2026-01-28T10:46:44+00:00 · Latest: 2026-02-11T18:32:03+00:00
Abstract
We study the supervised training dynamics of neural classifiers through the lens of binary hypothesis testing. We model classification as a set of binary tests between class-conditional distributions of representations and empirically show that, along training trajectories, well-generalizing networks increasingly align with Neyman-Pearson optimal decision rules via monotonic improvements in KL divergence that relate to error rate exponents. We finally discuss how this yields an explanation and possible training or regularization strategies for different classes of neural networks.
中文标题/摘要
标题:隐含假设检验与神经网络表示中的散度保持
我们通过二元假设检验的视角研究了神经分类器的监督训练动态。我们将分类视为类条件表示分布之间的二元测试集,并实验证明,在训练轨迹中,泛化良好的网络通过KL散度的单调改进与Neyman-Pearson最优决策规则逐渐对齐,这与错误率指数相关。最后,我们讨论了这如何提供不同类别的神经网络的解释和可能的训练或正则化策略。
Summary / 总结
The study investigates how neural classifiers learn through the perspective of binary hypothesis testing. It shows that well-generalizing networks align with Neyman-Pearson optimal decision rules during training, improving KL divergence monotonically and relating to error rate exponents. This leads to insights into training dynamics and potential regularization strategies for neural networks.
该研究从二元假设检验的角度考察了神经分类器的学习过程。研究表明,表现良好的网络在训练过程中会与Neyman-Pearson最优决策规则对齐,KL散度会单调改善,这与错误率指数相关。研究提供了训练动态的见解,并可能为不同类型的神经网络提出策略。
Learning to Compose for Cross-domain Agentic Workflow Generation
Authors: Jialiang Wang, Shengxiang Xu, Hanmo Liu, Jiachuan Wang, Yuyu Luo, Shimin Di, Min-Ling Zhang, Lei Chen
First: 2026-02-11T18:27:22+00:00 · Latest: 2026-02-11T18:27:22+00:00
Abstract
Automatically generating agentic workflows -- executable operator graphs or codes that orchestrate reasoning, verification, and repair -- has become a practical way to solve complex tasks beyond what single-pass LLM generation can reliably handle. Yet what constitutes a good workflow depends heavily on the task distribution and the available operators. Under domain shift, current systems typically rely on iterative workflow refinement to discover a feasible workflow from a large workflow space, incurring high iteration costs and yielding unstable, domain-specific behavior. In response, we internalize a decompose-recompose-decide mechanism into an open-source LLM for cross-domain workflow generation. To decompose, we learn a compact set of reusable workflow capabilities across diverse domains. To recompose, we map each input task to a sparse composition over these bases to generate a task-specific workflow in a single pass. To decide, we attribute the success or failure of workflow generation to counterfactual contributions from learned capabilities, thereby capturing which capabilities actually drive success by their marginal effects. Across stringent multi-domain, cross-domain, and unseen-domain evaluations, our 1-pass generator surpasses SOTA refinement baselines that consume 20 iterations, while substantially reducing generation latency and cost.
中文标题/摘要
标题:学习为跨域代理工作流生成作曲
自动生成代理工作流——可执行的操作图或代码,协调推理、验证和修复——已成为解决单次生成无法可靠处理的复杂任务的一种实用方法。然而,什么是好的工作流在很大程度上取决于任务分布和可用的操作。在领域迁移下,当前系统通常依赖于迭代工作流细化,从庞大的工作流空间中发现可行的工作流,这会带来高昂的迭代成本并导致特定领域的不稳定行为。为应对这一挑战,我们将分解-重组-决策机制内置于开源LLM中,以实现跨域工作流生成。通过分解,我们学习了一组跨域可复用的工作流能力。通过重组,我们将每个输入任务映射到这些基底的稀疏组合,以一次生成特定任务的工作流。通过决策,我们将工作流生成的成功或失败归因于学习能力的反事实贡献,从而捕捉到哪些能力通过边际效应真正驱动了成功。在严格的多域、跨域和未见域评估中,我们的1次生成器超越了消耗20次迭代的SOTA细化基线,同时显著减少了生成延迟和成本。
Summary / 总结
The research aims to improve the automatic generation of agentic workflows by addressing the limitations of current iterative refinement methods. The method involves an open-source LLM that incorporates a decompose-recompose-decide mechanism. This mechanism learns reusable workflow capabilities across different domains, maps each task to a sparse composition of these capabilities, and attributes the success or failure of workflow generation to the learned capabilities. The experimental results show that the proposed 1-pass generator outperforms state-of-the-art 20-iteration refinement baselines, reducing generation latency and cost while maintaining high performance across various domains.
研究旨在通过解决当前迭代优化方法的局限性,改进自动生成代理工作流的能力。方法包括一个集成分解-重组-决策机制的开源LLM。该机制跨不同领域学习可重用的工作流能力,将每个任务映射到这些基础能力的稀疏组合,并将工作流生成的成功或失败归因于学习到的能力。实验结果表明,提出的单次生成器在多个领域、跨领域和未见过的领域中均优于20次迭代的最新基线,同时显著减少了生成延迟和成本。
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
Authors: Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
First: 2026-02-10T18:55:41+00:00 · Latest: 2026-02-11T18:20:25+00:00
Comments: 41 pages
Abstract
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.
中文标题/摘要
标题:代理世界模型:无限合成环境的自主强化学习
大型语言模型(LLM)的最新进展使自主代理能够执行需要与工具和环境进行多轮交互的复杂任务。然而,这种代理训练的扩展受到缺乏多样性和可靠环境的限制。在本文中,我们提出了代理世界模型(AWM),这是一种完整的合成环境生成流水线。使用此流水线,我们扩展到1,000个环境,涵盖了日常生活场景,在这些环境中,代理可以与丰富的工具集(每个环境平均35种工具)进行交互并获得高质量的观察结果。值得注意的是,这些环境是代码驱动的,并由数据库支持,提供了比LLM模拟的环境更可靠和一致的状态转换。此外,与从现实环境中收集轨迹相比,它们使代理交互更加高效。为了展示此资源的有效性,我们对多轮工具使用代理进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准上的实验表明,仅在合成环境中进行训练,而不是特定基准环境,可以实现强大的离分布泛化。代码可在https://github.com/Snowflake-Labs/agent-world-model/获取。
Summary / 总结
The research aims to address the challenge of scaling agent training by providing diverse and reliable synthetic environments. The main method involves developing Agent World Model (AWM), a pipeline for generating 1,000 synthetic environments with rich toolsets and reliable state transitions. Key experimental findings show that training agents exclusively in these synthetic environments leads to strong out-of-distribution generalization on three benchmarks, demonstrating the effectiveness of this approach.
研究旨在通过提出Agent World Model (AWM) 合成环境生成管道来解决代理训练规模化的挑战,该管道生成了1,000个多样且可靠的环境,用于代理强化学习。方法包括生成代码驱动的环境,配备丰富的工具集和高质量的观察,这些环境比由大语言模型模拟的环境更可靠。实验表明,仅在这些合成环境中训练可以实现三个基准上的强泛化能力。代码可在GitHub上获得。
GameDevBench: Evaluating Agentic Capabilities Through Game Development
Authors: Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue
First: 2026-02-11T18:15:11+00:00 · Latest: 2026-02-11T18:15:11+00:00
Abstract
Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
中文标题/摘要
标题:GameDevBench:通过游戏开发评估自主能力
尽管在编码代理方面取得了快速进展,但其多模态对应物的进步却落后了。一个关键挑战是缺乏能够结合软件开发复杂性和对深入多模态理解需求的评估测试平台。游戏开发提供了这样一个测试平台,因为代理必须在处理着色器、精灵和动画等内在多模态资产的同时,导航大型密集的代码库,这些资产位于视觉游戏场景中。我们提出了GameDevBench,这是第一个用于评估代理在游戏开发任务上的基准。GameDevBench 包含了从网络和视频教程中提取的132项任务。这些任务需要大量的多模态理解并且非常复杂——平均解决方案的代码行数和文件更改量是之前软件开发基准的三倍多。代理仍然难以应对游戏开发,最好的代理仅解决了54.5%的任务。我们发现感知任务难度与多模态复杂性之间存在强烈相关性,成功率为游戏导向任务的46.9%下降到2D图形任务的31.6%。为了提高多模态能力,我们引入了两种简单的基于图像和视频的反馈机制。尽管这些方法很简单,但它们始终能够提高性能,最大的变化是Claude Sonnet 4.5的性能从33.3%提高到47.7%。我们公开发布了GameDevBench,以支持进一步研究代理游戏开发。
Summary / 总结
GameDevBench evaluates agentic capabilities in game development by addressing the scarcity of evaluation testbeds that combine software development complexity with deep multimodal understanding. The benchmark includes 132 tasks from web and video tutorials, which require significant multimodal understanding and are more complex than prior benchmarks. The best agent solved only 54.5% of tasks, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. Two simple image and video-based feedback mechanisms improved performance, with the largest change being an increase in Claude Sonnet 4.5's success rate from 33.3% to 47.7%.
GameDevBench 通过游戏开发任务评估代理的能力,解决了多模态代理缺乏全面测试床的问题。它包含来自网络和视频教程的132项任务,需要大量的多模态理解。最好的代理仅解决了54.5%的任务,成功率从46.9%的游戏玩法任务下降到31.6%的2D图形任务。引入简单的图像和视频反馈机制提高了性能,Claude Sonnet的成功率从33.3%提高到47.7%。
Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
Authors: Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
First: 2026-02-11T18:09:17+00:00 · Latest: 2026-02-11T18:09:17+00:00
Abstract
Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.
中文标题/摘要
标题:推理模型的安全恢复仅需几步早期转向即可实现
基于强化学习(RL)的后训练方法(例如GRPO)可以提高多模态大规模推理模型(MLRM)的推理能力。但最近的证据表明,这种方法同时会降低安全对齐并增加脱逃成功率。我们提出了一种名为SafeThink的轻量级推理时防御方法,将安全恢复视为满足约束而非最大化目标。SafeThink使用安全奖励模型监控推理轨迹,并仅在安全阈值被违反时有条件地注入优化的短纠正前缀(“请等一下,安全思考”)。在对六种开源MLRM和四种脱逃基准(JailbreakV-28K、Hades、FigStep和MM-SafetyBench)的评估中,SafeThink将攻击成功率降低了30-60%(例如,LlamaV-o1:从JailbreakV-28K的63.33%降至5.74%,R1-Onevision:从Hades的69.07%降至5.65%),同时保持了推理性能(MathVista准确率:从65.20%降至65.00%)。我们实验中的一个关键经验发现是,安全恢复通常只需几步转向即可实现:干预前1-3步推理通常足以引导整个生成过程走向安全完成。
Summary / 总结
The paper addresses the issue of safety degradation in reasoning models enhanced by reinforcement learning, proposing SafeThink as a lightweight defense mechanism. SafeThink monitors the reasoning process and injects a short corrective prefix only when safety is compromised, significantly reducing jailbreak success rates by 30-60% across various models and benchmarks while maintaining reasoning performance. Key findings indicate that safety recovery is often achievable within the first few reasoning steps.
论文解决了增强学习改进的多模态大规模推理模型在安全性方面退化的問題。它提出了SafeThink,一种轻量级的防御机制,在推理过程中监控安全状态并在安全被破坏时注入纠正前缀。SafeThink在多种模型和基准测试中将劫持成功率降低了30-60%,同时保持了推理性能。关键发现是,安全性恢复往往只需要在推理过程的前1-3步进行少量调整即可引导生成安全的完成。
Can Large Language Models Make Everyone Happy?
Authors: Usman Naseem, Gautam Siddharth Kashyap, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Rafiq Ali
First: 2026-02-11T17:57:23+00:00 · Latest: 2026-02-11T17:57:23+00:00
Abstract
Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.
中文标题/摘要
标题:大型语言模型能否让每个人都满意?
大型语言模型(LLMs)的不一致是指未能同时满足安全、价值观和文化维度,导致在这些维度必须共存的现实场景中出现与人类期望相悖的行为。现有的基准测试,如SAFETUNEBED(以安全为中心)、VALUEBENCH(以价值观为中心)和WORLDVIEW-BENCH(以文化为中心),主要分别评估这些维度,因此对它们的相互作用和权衡提供有限的见解。最近的努力,包括MIB和基于机制可解释性的INTERPRETABILITY BENCHMARK,提供了关于模型失败的宝贵视角;然而,它们仍然不足以系统地表征跨维度权衡。为了解决这些差距,我们引入了MisAlign-Profile,这是一种受机制表型启发的统一基准,用于测量不一致权衡。首先,我们构建了MISALIGNTRADE,这是一个涵盖112个规范领域分类的英语不一致对齐数据集,包括14个安全、56个价值观和42个文化领域。除了领域标签,每个提示还被分类为对象、属性或关系不一致中的一个正交语义类型,并通过Qwen3-30B-A3B-Instruct-2507和基于SimHash的指纹扩展以避免重复。每个提示通过两阶段拒绝采样与不一致和对齐的响应配对,以确保质量。其次,我们在MISALIGNTRADE上对通用、微调和开源权重的LLMs进行了基准测试,揭示了12%-34%的跨维度不一致权衡。
Summary / 总结
The paper addresses the issue of misalignment in Large Language Models (LLMs) by introducing MisAlign-Profile, a unified benchmark to measure cross-dimensional trade-offs. It constructs MISALIGNTRADE, an English dataset across 112 normative domains, and uses a two-stage rejection sampling method to ensure high-quality misaligned and aligned responses. The benchmark reveals that general-purpose, fine-tuned, and open-weight LLMs exhibit misalignment trade-offs ranging from 12% to 34% across different dimensions.
论文通过引入MisAlign-Profile统一基准来衡量LLMs在不同维度上的权衡问题。它构建了涵盖112个规范领域的MISALIGNTRADE数据集,并使用两阶段拒绝采样方法确保高质量的对齐和未对齐响应。基准测试显示,通用、微调和开放权重的LLMs在不同维度上的未对齐权衡范围为12%到34%。
DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning
Authors: Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen
First: 2026-02-11T17:56:15+00:00 · Latest: 2026-02-11T17:56:15+00:00
Abstract
In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.
中文标题/摘要
标题:DataChef:通过强化学习烹饪出最优数据配方以适应大规模语言模型
在当前大规模语言模型(LLM)的背景下,大规模高质量训练数据的策划是模型性能的主要驱动力。关键杠杆是\emph{数据配方},它包括一个数据处理管道,将原始数据转换为训练语料库。尽管使用LLM自动化单个数据处理步骤,如数据合成和过滤越来越普遍,但整体数据配方的设计仍然主要依赖于手动和劳动密集型的方法,需要大量的人力专业知识和迭代。为了解决这一差距,我们提出了\emph{端到端数据配方生成}以适应LLM。给定一个目标基准和可用数据源的池,模型需要输出一个完整的数据配方,将基础LLM适应到目标任务。我们提出了DataChef-32B,它使用代理奖励进行在线强化学习,该奖励预测候选配方的下游性能。在六个保留任务中,DataChef-32B生成了实用的配方,其下游性能与人类专家策划的配方相当。值得注意的是,DataChef-32B生成的配方将Qwen3-1.7B-Base适应到数学领域,取得了AIME'25 66.7的成绩,超过了Qwen3-1.7B。这项工作为自动化LLM训练和开发自我进化的AI系统提供了新的视角。
Summary / 总结
The research aims to automate the process of creating data recipes for Large Language Models (LLMs) by formulating end-to-end data recipe generation using reinforcement learning. DataChef-32B, the proposed model, generates complete data recipes that adapt a base LLM to target tasks. Across six tasks, DataChef-32B produces recipes comparable to those curated by human experts, demonstrating its capability to automate data recipe generation and improve model performance. Notably, it adapts Qwen3-1.7B-Base to the math domain, achieving a score of 66.7 on AIME'25, surpassing the base model.
研究旨在通过强化学习自动化生成数据食谱以适应大型语言模型(LLMs)。DataChef-32B 在线执行强化学习,生成的数据食谱在六个任务上的表现与人工专家生成的食谱相当。特别地,它将 Qwen3-1.7B-Base 在数学领域的表现提升至 AIME'25 的 66.7,超过了基模型的表现。
Cross-Attention Speculative Decoding
Authors: Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Yipeng Ji, Chul Lee
First: 2025-05-30T12:52:35+00:00 · Latest: 2026-02-11T17:48:57+00:00
Abstract
Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.
中文标题/摘要
标题:跨注意力推测解码
推测解码(SD)是用于加速大型语言模型(LLMs)推理的一种广泛采用的方法,尤其是在草稿模型和目标模型高度对齐的情况下。然而,最先进的SD方法通常依赖于紧密耦合的、基于自注意力的Transformer解码器,经常附加辅助池化或融合层。这种耦合使它们变得越来越复杂,并且难以在不同的模型之间进行泛化。我们提出了Budget EAGLE(Beagle),这是迄今为止第一个无需池化或辅助组件即可达到与领先自注意力SD模型(EAGLE-v2)相当性能的跨注意力Transformer解码器SD模型,简化了架构,提高了训练效率,并在训练时保持稳定的内存使用。为了使这种新型架构能够有效训练,我们提出了两阶段块注意力训练,这是一种新的方法,可以在块级注意力场景中实现训练稳定性和收敛效率。在多个LLM和数据集上的广泛实验表明,Beagle实现了与EAGLE-v2相当的推理加速和更高的训练效率,为推测解码中的架构提供了一个强有力的替代方案。
Summary / 总结
The research aims to improve speculative decoding (SD) methods for accelerating large language model inference by proposing Budget EAGLE (Beagle), a cross-attention-based SD model that simplifies the architecture and enhances training efficiency. Beagle achieves performance comparable to the state-of-the-art EAGLE-v2 while eliminating auxiliary components, leading to better memory usage and faster training. Experiments demonstrate that Beagle provides competitive inference speedups and higher training efficiency than EAGLE-v2, making it a strong alternative for speculative decoding architectures.
研究旨在通过提出Budget EAGLE(Beagle)——一种基于交叉注意力的SD模型,简化架构并提高训练效率,来改进大型语言模型推理的加速方法。Beagle在不使用辅助组件的情况下,实现了与最先进的EAGLE-v2相当的性能,同时改善了内存使用并加快了训练速度。实验表明,Beagle在推理加速和训练效率方面均优于EAGLE-v2,是推测解码架构的一个强有力替代方案。
Token-Efficient Change Detection in LLM APIs
Authors: Timothée Chauvin, Clément Lalanne, Erwan Le Merrer, Jean-Michel Loubes, François Taïani, Gilles Tredan
First: 2026-02-11T17:48:29+00:00 · Latest: 2026-02-11T17:48:29+00:00
Abstract
Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to achieve both low cost and strict black-box operation, observing only output tokens.
Our approach hinges on specific inputs we call Border Inputs, for which there exists more than one output top token. From a statistical perspective, optimal change detection depends on the model's Jacobian and the Fisher information of the output distribution. Analyzing these quantities in low-temperature regimes shows that border inputs enable powerful change detection tests.
Building on this insight, we propose the Black-Box Border Input Tracking (B3IT) scheme. Extensive in-vivo and in-vitro experiments show that border inputs are easily found for non-reasoning tested endpoints, and achieve performance on par with the best available grey-box approaches. B3IT reduces costs by $30\times$ compared to existing methods, while operating in a strict black-box setting.
中文标题/摘要
标题:LLM API中的高效标记变化检测
在LLM中远程检测变化是一个难题。现有方法要么部署成本过高,要么需要初始的白盒访问模型权重或灰色盒访问输出概率。我们旨在实现低成本和严格的黑盒操作,仅观察输出标记。
我们的方法依赖于我们称为边界输入的特定输入,对于这些输入,存在多个输出顶级标记。从统计角度来看,最优变化检测依赖于模型的雅可比矩阵和输出分布的费雪信息。在低温状态下分析这些量显示,边界输入能够实现强大的变化检测测试。
基于这一洞察,我们提出了黑盒边界输入跟踪(B3IT)方案。广泛的体内和体外实验表明,边界输入对于非推理测试端点易于找到,并且在性能上与最佳可用的灰色盒方法相当。B3IT将成本降低了30倍,同时在严格的黑盒环境中运行。
Summary / 总结
The research aims to develop a low-cost and strict black-box method for detecting changes in LLM APIs. The approach relies on Border Inputs, which have multiple top tokens. Experiments show that these inputs enable effective change detection, achieving performance comparable to the best grey-box methods while reducing costs by 30 times. The Black-Box Border Input Tracking (B3IT) scheme is proposed and validated through extensive experiments.
研究旨在开发一种低成本且严格的黑盒方法来检测LLM API的变化。该方法依赖于Border Inputs,这些输入有多个顶级输出。实验表明,Border Inputs能够有效进行变化检测,性能与最佳灰盒方法相当。B3IT相比现有方法成本降低了30倍,同时保持了黑盒操作。
Algorithmically Establishing Trust in Evaluators
Authors: Adrian de Wynter
First: 2025-06-03T17:04:22+00:00 · Latest: 2026-02-11T17:47:41+00:00
Abstract
An evaluator, such as an LLM-as-a-judge, is trustworthy when there exists some agreed-upon way to measure its performance as a labeller. Traditional approaches either rely on testing the evaluator against references or assume that it `knows' somehow the correct labelling. Both approaches fail when references are unavailable: the former requires data, and the latter is an assumption, not evidence. To address this, we introduce the `No-Data Algorithm', which provably establishes trust in an evaluator without requiring any labelled data. Our algorithm works by successively posing challenges to said evaluator. We prove that after $r$ challenge rounds, it accepts an evaluator which knows the correct labels with probability $ \geq 1 - (1/4)^r$, and reliably flags untrustworthy ones. We present formal proofs of correctness, empirical tests, and applications to assessing trust in LLMs-as-judges for low-resource language labelling. Our work enables scientifically-grounded evaluator trust in low-data domains, addressing a critical bottleneck for scalable, trustworthy LLM deployment.
Summary / 总结
The paper addresses the challenge of establishing trust in evaluators, particularly in the absence of labeled data. It introduces the 'No-Data Algorithm' which tests evaluators through successive challenges. After $r$ rounds, the algorithm accepts an evaluator with a probability of at least $1 - (1/4)^r$ of knowing the correct labels, and reliably identifies untrustworthy ones. The algorithm is supported by formal proofs and empirical tests, demonstrating its effectiveness in assessing trust in LLMs-as-judges for low-resource language labeling.
论文解决了在缺乏标注数据的情况下建立评估者可信度的问题。它提出了‘无数据算法’,通过连续的挑战来测试评估者。在$r$轮后,该算法接受一个评估者的概率至少为$1 - (1/4)^r$,并且能够可靠地识别出不可信的评估者。该算法得到了形式证明和实证测试的支持,展示了其在低资源语言标注中评估LLM-as-judge可信度的有效性。
SteuerLLM: Local specialized large language model for German tax law analysis
Authors: Sebastian Wind, Jeta Sopa, Laurin Schmid, Quirin Jackl, Sebastian Kiefer, Fei Wu, Martin Mayr, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh
First: 2026-02-11T17:46:01+00:00 · Latest: 2026-02-11T17:46:01+00:00
Abstract
Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.
中文标题/摘要
标题:SteuerLLM:针对德国税法分析的本地专业化大型语言模型
大型语言模型(LLMs)展示了强大的通用推理和语言理解能力,但在受严格形式规则、精确术语和法律约束结构支配的领域中,其性能会下降。税法就是一个典型的例子,因为正确的答案需要精确的法定引用、结构化的法律论证和在严格的评分方案下的数值准确性。我们通过算法生成了SteuerEx,这是第一个基于真实的德国大学税法考试题目的公开基准。SteuerEx 包含了115个专家验证的考试题目,覆盖了六个核心税法领域和多个学术层次,并采用了一种基于陈述的、部分评分的评估框架,该框架与实际考试实践非常接近。我们还介绍了SteuerLLM,这是一种针对德国税法的领域适应型LLM,它基于一个大规模合成数据集进行训练,该数据集是通过使用受控检索增强管道生成的真实考试材料构建的。SteuerLLM(280亿参数)在多个方面始终优于同等规模的一般用途指令调优模型,甚至在某些情况下,其性能超过了更大规模的系统,这表明领域特定的数据和架构适应比参数规模对现实法律推理任务的性能更为关键。所有基准数据、训练数据集、模型权重和评估代码均公开发布,以支持领域特定法律人工智能的可重复研究。SteuerLLM 的网络演示可在 https://steuerllm.i5.ai.fau.de/ 上获得。
In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution
Authors: Frank Xiao, Santiago Aranguri
First: 2026-02-11T17:45:31+00:00 · Latest: 2026-02-11T17:45:31+00:00
Abstract
We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2's production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.
中文标题/摘要
标题:野外模型生物:通过数据归因减轻生产LLM后训练中不良 emergent 行为
我们提出了一种基于激活的数据归因方法,该方法将后训练语言模型的行为变化追溯到负责的训练数据点。通过为测试提示和偏好对计算激活差异向量,并按余弦相似度进行排序,我们识别出导致特定行为的数据点,并通过使用修改后的数据重新训练来进行因果验证。聚类行为-数据点相似矩阵还能够实现无监督的新兴行为发现。将此方法应用于OLMo 2的生产DPO训练,我们揭示了诱饵触发的合规性:一种有害行为,即当良性格式化指令附加时,模型会遵从危险请求。过滤排名靠前的数据点可将此行为降低63%,而切换其标签则可实现78%的降低。我们的方法在性能上优于基于梯度的归因和LLM-judge基线,同时成本比两者低10倍以上。这种野外模型生物——源自受污染的偏好数据而非故意注入——为安全技术提供了一个现实基准。
Summary / 总结
The research aims to mitigate undesirable behaviors in production language models by identifying and filtering responsible training data. The method involves activation-based data attribution, which traces behavioral changes to specific training data points and validates these attributions through retraining. Key findings include a 63% reduction in a harmful behavior called distractor-triggered compliance by filtering top-ranked data points, and a 78% reduction by switching their labels. This method outperforms gradient-based attribution and LLM-judge baselines while being significantly cheaper. The study highlights the importance of unsupervised discovery of emergent behaviors in production models.
研究旨在通过识别和过滤负责的训练数据来减轻生产语言模型中的不良行为。方法包括基于激活的数据归因,该方法将行为变化追溯到特定的训练数据点,并通过重新训练进行验证。关键发现包括通过过滤排名靠前的数据点,减少有害行为‘诱饵触发合规性’63%,通过切换这些数据点的标签实现78%的减少。该方法在性能上优于基于梯度的归因和LLM裁判基线,并且成本低得多。研究强调了在生产模型中发现新兴行为的重要性。
Interpretable Attention-Based Multi-Agent PPO for Latency Spike Resolution in 6G RAN Slicing
Authors: Kavan Fatehi, Mostafa Rahmani Ghourtani, Amir Sonee, Poonam Yadav, Alessandra M Russo, Hamed Ahmadi, Radu Calinescu
First: 2026-02-11T17:44:03+00:00 · Latest: 2026-02-11T17:44:03+00:00
Comments: This work has been accepted to appear in the IEEE International Conference on Communications (ICC)
Abstract
Sixth-generation (6G) radio access networks (RANs) must enforce strict service-level agreements (SLAs) for heterogeneous slices, yet sudden latency spikes remain difficult to diagnose and resolve with conventional deep reinforcement learning (DRL) or explainable RL (XRL). We propose \emph{Attention-Enhanced Multi-Agent Proximal Policy Optimization (AE-MAPPO)}, which integrates six specialized attention mechanisms into multi-agent slice control and surfaces them as zero-cost, faithful explanations. The framework operates across O-RAN timescales with a three-phase strategy: predictive, reactive, and inter-slice optimization.
A URLLC case study shows AE-MAPPO resolves a latency spike in $18$ms, restores latency to $0.98$ms with $99.9999\%$ reliability, and reduces troubleshooting time by $93\%$ while maintaining eMBB and mMTC continuity. These results confirm AE-MAPPO's ability to combine SLA compliance with inherent interpretability, enabling trustworthy and real-time automation for 6G RAN slicing.
中文标题/摘要
标题:可解释的基于注意力机制的多智能体PPO方法用于解决6G RAN切片中的延迟突增
第六代(6G)无线接入网络(RAN)必须为异构切片强制执行严格的SLA,但突发的延迟突增仍然难以诊断和解决,传统的深度强化学习(DRL)或可解释的RL(XRL)也是如此。我们提出了\emph{增强注意力机制的多智能体近端策略优化(AE-MAPPO)},将六种专门的注意力机制集成到多智能体切片控制中,并以零成本、忠实的解释方式呈现。该框架在O-RAN时间尺度上运行,采用三阶段策略:预测、反应和跨切片优化。
一个URLLC案例研究显示,AE-MAPPO在18毫秒内解决了延迟突增,将延迟恢复到0.98毫秒,可靠性为99.9999%,同时将故障排除时间减少了93%,并保持了eMBB和mMTC的连续性。这些结果证实了AE-MAPPO结合SLA合规性和内在可解释性的能力,使其能够为6G RAN切片提供可信赖的实时自动化。
Summary / 总结
The paper addresses the challenge of resolving sudden latency spikes in 6G RAN slicing by proposing AE-MAPPO, which integrates six specialized attention mechanisms into multi-agent slice control. The framework uses a three-phase strategy and demonstrates effectiveness through a URLLC case study, resolving a latency spike in 18ms, restoring latency to 0.98ms with 99.9999% reliability, and reducing troubleshooting time by 93% while maintaining service continuity for other slices.
论文提出AE-MAPPO,通过将六种专门的注意力机制集成到多代理切片控制中,解决6G RAN中突发的延迟峰值问题。该框架采用三阶段策略,并在URLLC案例研究中证明了其有效性,通过18ms解决延迟峰值,恢复延迟至0.98ms,可靠性达到99.9999%,同时减少93%的故障排查时间,同时保持eMBB和mMTC的服务连续性。
Chatting with Images for Introspective Visual Thinking
Authors: Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tienie Tan
First: 2026-02-11T17:42:37+00:00 · Latest: 2026-02-11T17:42:37+00:00
Abstract
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
中文标题/摘要
标题:基于图像的对话促进内省视觉思考
当前的大规模视觉-语言模型(LVLMs)通常依赖基于单次视觉编码的文本推理,这往往会导致细粒度视觉信息的丢失。最近提出的“通过图像思考”试图通过外部工具或代码操作图像来缓解这一限制;然而,由此产生的视觉状态往往缺乏语言语义的充分支撑,影响了有效的跨模态对齐——特别是在需要在远距离区域或多个图像之间推理视觉语义或几何关系时。为了解决这些挑战,我们提出了一种新的“基于图像的对话”框架,将视觉操作重新构想为语言引导的特征调制。在表达性语言提示的指导下,模型动态地对多个图像区域进行联合重新编码,从而增强了语言推理与视觉状态更新之间的耦合。我们通过ViLaVT实现这一范式,这是一种新型的LVLM,配备了专门设计用于此类交互式视觉推理的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程进行训练,以促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT实现了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现尤为突出。
Summary / 总结
This paper addresses the limitations of current large vision-language models (LVLMs) that rely on text-only reasoning and often lose fine-grained visual information. To improve cross-modal alignment, the authors propose 'chatting with images,' a framework that uses language prompts to dynamically re-encode multiple image regions, enhancing the coupling between linguistic reasoning and visual state updates. ViLaVT, a novel LVLM equipped with a dynamic vision encoder, is trained using a two-stage curriculum combining supervised fine-tuning and reinforcement learning. Experiments show that ViLaVT outperforms existing models, especially in complex multi-image and video-based spatial reasoning tasks.
研究旨在通过解决细粒度视觉信息丢失和跨模态对齐的挑战,增强大型视觉-语言模型(LVLM)的推理能力。提出的‘与图像对话’方法将视觉操作重新定义为语言引导的特征调制,使模型在表达性语言提示的指导下动态重新编码多个图像区域。实验表明,配备动态视觉编码器的新型LVLM ViLaVT 在复杂的多图像和视频空间推理任务中表现出显著且一致的改进。
Simultaneous Speech-to-Speech Translation Without Aligned Data
Authors: Tom Labiausse, Romain Fabre, Yannick Estève, Alexandre Défossez, Neil Zeghidour
First: 2026-02-11T17:41:01+00:00 · Latest: 2026-02-11T17:41:01+00:00
Comments: See inference code at: https://github.com/kyutai-labs/hibiki-zero
Abstract
Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.
中文标题/摘要
标题:无对齐数据的同步口译
同步口译需要在处理非单调词依赖的同时,实时将源语言口译为目标语言。传统方法依赖于带有词级对齐数据的监督训练,但这种数据难以大规模收集,因此依赖于使用特定语言的启发式方法生成的次优对齐。我们提出了Hibiki-Zero,完全消除了对词级对齐的需求。这从根本上简化了训练管道,并使无缝扩展到具有不同语法结构的多种语言成为可能,消除了设计特定语言对齐启发式方法的瓶颈。我们首先在句级对齐数据上进行训练,以在高延迟下学习口译,然后使用GRPO的新型强化学习策略优化延迟同时保持翻译质量。Hibiki-Zero在五个X到英语任务中的翻译准确性、延迟、声音转移和自然度方面均达到最佳性能。此外,我们证明了我们的模型可以通过少于1000小时的语音数据适应支持新的输入语言。我们提供了示例、模型权重、推理代码,并发布了包含45小时多语言数据的基准用于口译评估。
Summary / 总结
Hibiki-Zero is designed to perform simultaneous speech translation without requiring word-level alignments, simplifying the training process and enabling scalability across diverse languages. It first trains on sentence-level aligned data to achieve high translation accuracy, then uses reinforcement learning to optimize latency while maintaining quality. The model demonstrates state-of-the-art performance in five X-to-English tasks, achieving top results in translation accuracy, latency, voice transfer, and naturalness. Additionally, it can be adapted to support new input languages with minimal data, requiring less than 1000 hours of speech. The model is available with code and benchmark data.
Hibiki-Zero旨在无需词级对齐的情况下进行实时语音翻译,简化了训练流程并使多种语言的扩展成为可能。它首先在句级对齐数据上进行训练以实现高翻译准确性,然后使用强化学习优化延迟同时保持质量。该模型在五个X到英语的任务中表现出色,分别在翻译准确性、延迟、声音转移和自然度方面达到最佳效果。此外,它还可以用少量数据(不到1000小时的语音)支持新输入语言。模型提供了代码和基准数据。
Is In-Context Learning Learning?
Authors: Adrian de Wynter
Venue: ICLR 2026
First: 2025-09-12T17:12:04+00:00 · Latest: 2026-02-11T17:39:46+00:00
Comments: Accepted to ICLR 2026 -- CR version
Abstract
In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL fits the definition of learning; however, its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that, empirically, ICL is limited in its ability to learn and generalise to unseen tasks. Namely, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies and on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism for learning, and suggests limited all-purpose generalisability.
中文标题/摘要
标题:基于上下文学习是否真正学习?
基于上下文学习(ICL)允许一些自回归模型通过下一个标记预测来解决任务,而无需进一步训练。这导致了关于这些模型仅通过提示中的少量示例(范例)就能解决(学习)未见过的任务的声明。然而,演绎并不总是意味着学习,因为ICL并没有明确编码给定的观察。相反,模型依赖于它们的先验知识和给出的示例,如果有的话。我们认为,从数学上讲,ICL 符合学习的定义;然而,其全面特征需要通过实证工作来确定。然后我们进行了一项大规模分析,消融或考虑了记忆、预训练、分布变化以及提示风格和措辞。我们发现,从实证上讲,ICL 在学习和泛化到未见过的任务方面能力有限。具体来说,在示例变得越来越多的情况下,准确度对示例分布、模型、提示风格和输入的语言特征都不敏感。相反,它从提示中的规律中推导出模式,这导致了分布敏感性,尤其是在链式思考等提示风格中尤为明显。鉴于任务形式相似但准确度各异,我们得出结论,自回归的临时编码机制不是一种稳健的学习机制,这表明其泛化能力有限。
Summary / 总结
The study investigates the true nature of in-context learning (ICL) by analyzing its ability to learn and generalize without further training. Using a large-scale empirical approach, the researchers found that ICL is limited in its learning capacity, particularly when the number of exemplars increases. The models rely on patterns deduced from the prompt rather than true learning, leading to distributional sensitivity, especially in chain-of-thought prompting styles. This suggests that autoregressive models' ad-hoc encoding is not a robust mechanism for learning and has limited all-purpose generalizability.
研究探讨了在上下文学习(ICL)在仅通过少量示例提示解决未见过的任务时的真实学习能力。研究者发现,ICL 更多地是从提示中推导模式而非真正的学习,其性能对示例的数量和分布、模型类型和输入特征等都不敏感。这表明ICL的有效性有限,不具备广泛适用的泛化能力。