Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-12T18:59:59+00:00
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
中文标题/摘要
标题:扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐
通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动(VLA)模型在这一目标上取得了显著进展,但它们生成的动作仍然可能与给定的指令不一致。在本文中,我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了基于指令的执行的测试时扩展定律,证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性,通常比独立扩展每个维度更有效地恢复正确的动作。为了利用这些扩展定律,我们提出了CoVer,一种视觉-语言-行动对齐的对比验证器,并展示了我们的架构随着额外计算资源和数据的增加而平滑扩展。然后,我们介绍了“启动时计算”和一个分层验证推理流水线,用于VLAs。在部署时,我们的框架从视觉-语言模型(VLM)预计算一组多样化的重述指令,反复为每条指令生成动作候选,然后使用验证器选择最优的高层提示和低层动作片段。与在相同数据上扩展策略预训练相比,我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进,在实际实验中进一步提高了45%。在PolaRiS基准测试中,CoVer实现了14%的任务进展和9%的成功率改进。
Summary / 总结
This paper explores test-time verification as a method to improve the alignment between actions and natural language instructions in vision-language-action models. By characterizing the scaling laws for instruction following, the authors demonstrate that jointly scaling rephrased instructions and generated actions enhances test-time sample diversity. They introduce CoVer, a contrastive verifier, which scales effectively with additional resources. The proposed framework, which includes boot-time compute and a hierarchical verification pipeline, achieves significant improvements in both in-distribution and out-of-distribution settings on the SIMPLER benchmark, with further enhancements in real-world experiments. On the PolaRiS benchmark, CoVer shows a 14% increase in task progress and a 9% increase in success rate.
本文探讨了测试时验证作为提高视觉-语言-行动模型中自然语言指令与行动之间对齐的方法。研究表明,同时增加重述指令的数量和生成动作的数量可以提高测试时样本多样性,更有效地恢复正确的动作。CoVer架构作为一种对比验证器,能够随着资源的增加而平滑扩展,并且提出的分层验证推理管道提高了性能。与政策预训练相比,在SIMPLER基准测试中,验证方法在分布内获得了22%的提升,在分布外获得了13%的提升,而在实际实验中进一步提高了性能。在PolaRiS基准测试中,CoVer实现了14%的任务进度提升和9%的成功率提升。
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Authors: Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu
First: 2026-02-12T18:59:49+00:00 · Latest: 2026-02-12T18:59:49+00:00
Abstract
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
中文标题/摘要
标题:UniT:统一多模态链式思维测试时扩展
统一模型可以在单一架构中处理多模态理解和生成,但通常在单次通过中运行,而不进行迭代细化。许多多模态任务,尤其是涉及复杂空间组合、多个相互作用的对象或不断变化的指令的任务,需要分解指令、验证中间结果并进行迭代修正。虽然测试时扩展(TTS)已经证明,为迭代推理分配额外的推理计算可以显著提高语言模型的性能,但将这一范式扩展到统一的多模态模型仍然是一个开放的挑战。我们引入了UniT,这是一种多模态链式思维测试时扩展的框架,使单一统一模型能够在多轮次中进行推理、验证和细化。UniT 结合了代理数据合成、统一模型训练和灵活的测试时推理,以引发包括验证、子目标分解和内容记忆在内的认知行为。我们的主要发现是:(1) 统一模型在短推理轨迹上的训练在测试时能够泛化到更长的推理链;(2) 顺序链式思维推理比并行采样提供了一种更可扩展和计算高效的TTS策略;(3) 在生成和编辑轨迹上的训练提高了分布外视觉推理能力。这些结果确立了多模态测试时扩展作为一种有效范式,以推进统一模型中的生成和理解。
Summary / 总结
The research aims to improve the performance of unified multimodal models by enabling iterative reasoning at test time. UniT, a framework for multimodal chain-of-thought test-time scaling, allows a single unified model to decompose instructions, verify intermediate results, and refine outputs iteratively. Key findings include the generalization of unified models to longer inference chains, the scalability and efficiency of sequential chain-of-thought reasoning, and the improvement in out-of-distribution visual reasoning through training on generation and editing trajectories.
研究旨在通过使统一模型能够进行迭代推理来提高其处理复杂多模态任务的能力。UniT 是一种多模态链式思维测试时扩展框架,允许统一模型分解指令、验证中间结果并进行多次迭代修正。主要发现包括统一模型在更长推理链上的泛化能力、顺序链式思维推理的可扩展性和计算效率,以及通过训练生成和编辑轨迹来提高离分布视觉推理能力。
AttentionRetriever: Attention Layers are Secretly Long Document Retrievers
Authors: David Jiahao Fu, Lam Thanh Do, Jiayu Li, Kevin Chen-Chuan Chang
First: 2026-02-12T18:59:35+00:00 · Latest: 2026-02-12T18:59:35+00:00
Abstract
Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.
中文标题/摘要
标题:AttentionRetriever: 注意力层实际上是长文档检索器
检索增强生成(RAG)已被广泛采用,以帮助大型语言模型(LLMs)处理涉及长文档的任务。然而,现有的检索模型并未针对长文档检索进行设计,并且无法解决长文档检索中的几个关键挑战,包括上下文感知性、因果依赖性和检索范围。在本文中,我们提出了AttentionRetriever,这是一种新颖的长文档检索模型,利用注意力机制和基于实体的检索来构建长文档的上下文感知嵌入,并确定检索范围。通过广泛的实验,我们发现AttentionRetriever在长文档检索数据集上的性能显著优于现有检索模型,同时保持与密集检索模型相当的效率。
Summary / 总结
The research aims to address the limitations of existing retrieval models in handling long documents, such as context-awareness and causal dependence. AttentionRetriever, a novel model, uses attention mechanisms and entity-based retrieval to create context-aware embeddings and define the scope of retrieval. Experimental results show that AttentionRetriever outperforms existing models on long document retrieval datasets while maintaining efficiency comparable to dense retrieval models.
论文针对现有长文档检索模型存在的上下文感知不足和因果依赖性差等问题,提出了AttentionRetriever模型,该模型利用注意力机制和实体检索来构建上下文感知的嵌入并确定检索范围。实验结果显示,AttentionRetriever在长文档检索任务上的表现优于现有模型,同时保持与密集检索模型相当的效率。
Agentic Test-Time Scaling for WebAgents
Authors: Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
First: 2026-02-12T18:58:30+00:00 · Latest: 2026-02-12T18:58:30+00:00
Abstract
Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
中文标题/摘要
标题:代理测试时缩放以适应网络代理
测试时缩放已成为提高神经网络模型性能和增强其可靠性的标准方法。然而,它在代理型、多步任务上的行为尚不完全清楚:小的每步误差会在长时间范围内累积;我们发现,简单地均匀增加采样会显示出递减的回报。在本文中,我们提出了CATTS,一种用于动态为多步代理分配计算资源的简单技术。我们首先对网络代理的推理时缩放进行了实证研究。我们发现,均匀增加每步计算资源在长时间环境中很快达到饱和。然后我们研究了更强的聚合策略,包括基于LLM的仲裁者,它可以超越简单的投票,但可以推翻高度一致的决策。我们展示了代理自身投票分布得出的不确定性统计(熵和top-1/top-2差距)与下游成功相关,并提供了一种实用的动态计算分配信号。基于这些发现,我们引入了基于信心的测试时缩放(CATTS),它仅在决策真正有争议时才使用投票得出的不确定性来分配计算资源。CATTS在WebArena-Lite和GoBrowse上的性能比React提高了最多9.1%,同时使用的令牌数量最多减少了2.3倍,提供了效率提升和可解释的决策规则。
Summary / 总结
This work addresses the challenge of test-time scaling for agentic, multi-step tasks, where small errors can accumulate over time. The authors introduce CATTS, a method that dynamically allocates compute based on decision uncertainty. Their empirical study shows that uniformly increasing per-step compute quickly saturates, while uncertainty-derived signals like entropy and margin provide practical signals for dynamic compute allocation. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% compared to React, using up to 2.3x fewer tokens.
本文针对多步任务中累积误差的问题,研究了测试时缩放的方法。作者提出了CATTS技术,根据代理自身投票的不确定性动态分配计算资源。实验证明,均匀增加每步计算在长时间环境中很快达到饱和,而CATTS在WebArena-Lite和GoBrowse上的性能提高了最多9.1%,且使用的令牌数量比均匀缩放少2.3倍,提供了效率提升和可解释的决策规则。
On-Policy Context Distillation for Language Models
Authors: Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei
First: 2026-02-12T18:58:28+00:00 · Latest: 2026-02-12T18:58:28+00:00
Abstract
Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
中文标题/摘要
标题:语言模型的在线策略上下文蒸馏
上下文蒸馏使语言模型能够将上下文中的知识内化到其参数中。在我们的工作中,我们提出了在线策略上下文蒸馏(OPCD)框架,该框架通过训练学生模型在其自身生成的轨迹上,并通过最小化逆Kullback-Leibler散度来与上下文条件化的教师进行对比,将在线策略蒸馏与上下文蒸馏相结合。我们展示了OPCD在两个重要应用中的有效性:经验知识蒸馏,其中模型从其历史解决方案轨迹中提取和提炼可转移的知识;系统提示蒸馏,其中模型内化编码在优化提示中的有益行为。在数学推理、基于文本的游戏和特定领域任务中,OPCD在所有情况下都优于基线方法,同时在保持分布外能力方面表现更好。我们还展示了OPCD在跨规模蒸馏中的有效性,其中较小的学生模型可以从较大的教师中内化经验知识。
Summary / 总结
The research aims to improve language models by enabling them to internalize in-context knowledge. The proposed On-Policy Context Distillation (OPCD) framework trains a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. The study demonstrates that OPCD performs better than baseline methods in experiential knowledge distillation and system prompt distillation, achieving higher task accuracy while preserving out-of-distribution capabilities. Additionally, OPCD allows smaller student models to effectively learn from larger teacher models.
研究旨在通过使语言模型能够内化上下文知识来提升它们。提出了On-Policy Context Distillation (OPCD)框架,该框架通过最小化学生模型自动生成轨迹与上下文条件教师之间的逆Kullback-Leibler散度来进行训练。研究在经验知识内化和系统提示内化两个应用中展示了OPCD的有效性。在各种任务中,OPCD优于基线方法,实现了更高的任务准确率并保留了分布外能力。此外,OPCD还支持大小模型之间的有效知识内化,使较小的模型能够从较大的模型中学习。
MonarchRT: Efficient Attention for Real-Time Video Generation
Authors: Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen
First: 2026-02-12T18:56:53+00:00 · Latest: 2026-02-12T18:56:53+00:00
Abstract
Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
中文标题/摘要
标题:MonarchRT:实时视频生成的高效注意力机制
实时视频生成中的扩散变换器受到三维自注意力的二次成本限制,尤其是在既少步又自回归的实时环境中,错误会随时间累积,每个去噪步骤必须携带更多的信息。在这种环境中,我们发现先前的稀疏注意力近似失效,尽管在双向、多步扩散中表现出色。具体来说,我们观察到视频注意力不是可靠的稀疏,而是由时空位置驱动的显著周期性结构与动态稀疏语义对应和密集混合相结合,超过了甚至先验top-k注意力的表示能力。基于这一洞察,我们提出了Monarch-RT,这是一种用于视频扩散模型的结构化注意力参数化,通过Monarch矩阵分解注意力。通过适当的块结构对齐和我们扩展的Tiled Monarch参数化,我们实现了高表达性同时保持计算效率。我们进一步通过微调克服了参数化的开销,使用了自定义的Triton内核。我们首先验证了Monarch-RT在现有仅针对双向模型设计的稀疏基线中的高有效性。我们还观察到,当应用于最先进的模型Self-Forcing时,Monarch-RT可以达到95%的注意力稀疏性,而不会损失质量,使Monarch-RT成为实时视频生成中高能力稀疏注意力参数化的开创性工作。我们的优化实现分别在Nvidia RTX 5090、H100和B200 GPU上优于FlashAttention-2、FlashAttention-3和FlashAttention-4内核,提供了1.4-11.8倍的内核加速。这使我们首次能够在单个RTX 5090上以16 FPS实现真正的实时视频生成。
Summary / 总结
MonarchRT addresses the challenge of real-time video generation by proposing a structured attention parameterization that factorizes attention using Monarch matrices, achieving high expressivity while maintaining computational efficiency. The method overcomes the limitations of sparse-attention approximations and outperforms existing FlashAttention kernels, enabling true real-time video generation at 16 FPS on a single RTX 5090. MonarchRT attains up to 95% attention sparsity without compromising quality when applied to the state-of-the-art model Self-Forcing, demonstrating its effectiveness in real-time regimes.
MonarchRT 提出了一种结构化的注意力参数化方法,通过使用 Monarch 矩阵分解注意力,实现了高表达性的同时保持计算效率。该方法克服了稀疏注意力近似方法的局限性,并超越了现有的 FlashAttention 内核,使得首次在单个 RTX 5090 上实现了每秒 16 帧的真实时间视频生成。当应用于最先进的模型 Self-Forcing 时,MonarchRT 可以达到 95% 的注意力稀疏性而不影响质量,展示了其在真实时间环境中的有效性。
CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
Authors: Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang
First: 2026-02-12T18:55:09+00:00 · Latest: 2026-02-12T18:55:09+00:00
Abstract
AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
中文标题/摘要
标题:CM2:使用检查表奖励的多轮多步代理工具使用强化学习
AI代理越来越多地用于通过推理多轮用户交互和调用外部工具来解决实际任务。然而,在这种设置中应用强化学习仍然很困难:现实目标往往缺乏可验证的奖励,而是强调开放性行为;此外,多轮、多步代理工具使用中的RL仍然未被充分探索;而且构建和维护可执行的工具环境成本高昂,限制了规模和覆盖面。我们提出了CM2,这是一种RL框架,用检查表奖励取代了可验证的结果奖励。CM2将每轮预期行为细分为具有明确证据基础和结构化元数据的细粒度二元标准,将开放性判断转化为更稳定的分类决策。为了平衡稳定性和信息量,我们的方法采用了稀疏奖励分配但密集评估标准的策略。训练在可扩展的LLM模拟工具环境中进行,避免了为大型工具集进行大量工程工作。实验表明,CM2在tau^-Bench上比监督微调提高了8个点,在BFCL-V4上提高了10个点,在ToolSandbox上提高了12个点。结果与或甚至超过了同样规模的开源基线,包括判断模型。因此,CM2提供了一种无需依赖可验证奖励来优化多轮多步工具使用代理的可扩展方法。开源社区提供的代码:https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
Summary / 总结
The paper proposes CM2, a reinforcement learning framework that uses checklist rewards to address the challenges of training multi-turn and multi-step agentic tool-using AI agents. By decomposing each turn's behavior into binary criteria with explicit evidence, CM2 transforms open-ended judgments into more stable classification decisions. Experiments show that CM2 outperforms supervised fine-tuning on multiple benchmarks, improving by 8 points on tau^-Bench, 10 points on BFCL-V4, and 12 points on ToolSandbox, matching or outperforming similar-sized open-source baselines.
该论文提出了CM2,一种使用清单奖励的强化学习框架,旨在解决训练处理多轮和多步骤与外部工具交互的AI代理所面临的挑战。通过将每轮行为分解为二元标准,并采用稀疏奖励策略,CM2在多个基准测试中优于监督微调方法,包括tau^-Bench、BFCL-V4和ToolSandbox,取得了显著的提升。
T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
Authors: Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas
First: 2026-02-12T18:52:35+00:00 · Latest: 2026-02-12T18:52:35+00:00
Abstract
Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.
中文标题/摘要
标题:T3D:通过轨迹自蒸馏与直接判别优化的少量步骤扩散语言模型
扩散大型语言模型(DLLMs)有可能通过并行解码多个标记来实现快速文本生成。然而,在实践中,它们的推理效率受到需要许多细化步骤的限制,而大幅减少步骤会导致生成质量显著下降。为了解决这个问题,我们提出了一种轨迹自蒸馏框架,通过蒸馏模型自身的生成轨迹来改进少量步骤解码。我们结合了直接判别优化(DDO),这是一种反向KL目标,促进模式寻求蒸馏,并鼓励学生集中于高概率教师模式。在各种基准测试中,我们的方法在紧缩步骤预算下始终优于强大的少量步骤基线和标准训练。尽管全步骤解码仍然更优,但我们显著缩小了差距,为实用的少量步骤DLLMs奠定了坚实的基础。源代码可在https://github.com/Tyrion58/T3D获取。
Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Authors: Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu
First: 2026-02-12T18:49:27+00:00 · Latest: 2026-02-12T18:49:27+00:00
Abstract
Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.
中文标题/摘要
标题:像科学家一样思考:基于物理的LLM代理进行方程发现
通过符号、可解释的公式来解释观察到的现象是科学的基本目标。近年来,大型语言模型(LLMs)因其广泛的领域知识和强大的推理能力,已成为符号方程发现的有前途的工具。然而,现有的大多数基于LLM的系统试图直接从数据中猜测方程,而没有建模科学家通常遵循的多步推理过程:首先推断诸如对称性等物理属性,然后使用这些属性作为先验来限制候选方程的空间。我们引入了开普勒代理,这是一种明确遵循这一科学推理过程的代理框架。该代理协调基于物理的工具来提取中间结构,并使用这些结果来配置符号回归引擎,如PySINDy和PySR,包括它们的功能库和结构约束。在一系列物理方程基准测试中,开普勒代理在符号准确性方面显著高于LLM和传统基线,并且对噪声数据具有更强的鲁棒性。
Summary / 总结
The research aims to enhance the ability of large language models to discover symbolic equations by incorporating a physics-guided reasoning process. KeplerAgent, an agent-based framework, follows this process by first inferring physical properties and then using them to restrict the space of candidate equations. Across various physical equation benchmarks, KeplerAgent demonstrates higher symbolic accuracy and better robustness to noisy data compared to both LLMs and traditional methods.
研究旨在通过模仿科学家的推理过程来增强大型语言模型(LLMs)发现符号方程的能力。KeplerAgent是一种基于代理的框架,明确地通过首先推断物理属性,然后使用这些属性作为先验来限制候选方程的空间来建模这一过程。这种方法在各种物理方程基准测试中比LLMs和传统方法都具有更高的符号准确性和更好的鲁棒性。
Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference
Authors: Nicolas Johansson, Tobias Olsson, Daniel Nilsson, Johan Östman, Fazeleh Hoseini
First: 2025-09-04T12:43:45+00:00 · Latest: 2026-02-12T18:46:20+00:00
Abstract
Membership inference attacks (MIAs) aim to determine whether specific data were used to train a model. While extensively studied on classification models, their impact on time series forecasting remains largely unexplored. We address this gap by introducing two new attacks: (i) an adaptation of multivariate LiRA, a state-of-the-art MIA originally developed for classification models, to the time-series forecasting setting, and (ii) a novel end-to-end learning approach called Deep Time Series (DTS) attack. We benchmark these methods against adapted versions of other leading attacks from the classification setting.
We evaluate all attacks in realistic settings on the TUH-EEG and ELD datasets, targeting two strong forecasting architectures, LSTM and the state-of-the-art N-HiTS, under both record- and user-level threat models. Our results show that forecasting models are vulnerable, with user-level attacks often achieving perfect detection. The proposed methods achieve the strongest performance in several settings, establishing new baselines for privacy risk assessment in time series forecasting. Furthermore, vulnerability increases with longer prediction horizons and smaller training populations, echoing trends observed in large language models.
中文标题/摘要
标题:时间序列预测中的隐私风险:用户级和记录级成员推断
成员推断攻击(MIAs)旨在确定特定数据是否被用于训练模型。尽管在分类模型上进行了广泛研究,但它们对时间序列预测的影响仍很大程度上未被探索。我们通过引入两种新的攻击方法来填补这一空白:(i) 将当前最先进的MIA方法多变量LiRA的适应版本应用于时间序列预测设置,以及(ii) 一种名为深度时间序列(DTS)攻击的端到端学习方法。我们使用来自分类设置的其他领先攻击方法的适应版本对这些方法进行了基准测试。
我们在现实环境中对TUH-EEG和ELD数据集上的所有攻击方法进行了评估,针对两种强大的预测架构LSTM和最先进的N-HiTS,在记录级和用户级威胁模型下进行。我们的结果表明,预测模型存在漏洞,用户级攻击往往能够实现完美的检测。所提出的方法在多种情况下表现出最强的性能,为时间序列预测中的隐私风险评估建立了新的基准。此外,预测时间越长和训练样本越少,漏洞越严重,这与大型语言模型中观察到的趋势一致。
Summary / 总结
This study investigates privacy risks in time series forecasting models through membership inference attacks. The research introduces two new attacks: an adapted multivariate LiRA method and a novel Deep Time Series (DTS) attack. Evaluations on the TUH-EEG and ELD datasets show that forecasting models are vulnerable to these attacks, with user-level attacks achieving perfect detection in many cases. The study highlights that vulnerability increases with longer prediction horizons and smaller training populations.
该研究通过成员推理攻击探讨时间序列预测模型中的隐私风险,引入了两种新攻击方法:适应后的多变量LiRA方法和新型Deep Time Series (DTS) 攻击。在TUH-EEG和ELD数据集上的评估显示,这些攻击可以使预测模型受到威胁,尤其是在用户级攻击中,检测准确率往往达到完美。研究还指出,随着预测时间窗口的延长和训练数据量的减少,模型的脆弱性会增加。
Do language models accommodate their users? A study of linguistic convergence
Authors: Terra Blevins, Susanne Schmalwieser, Benjamin Roth
First: 2025-08-05T09:55:40+00:00 · Latest: 2026-02-12T18:46:13+00:00
Comments: EACL 2026
Abstract
While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation's style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.
中文标题/摘要
标题:语言模型是否适应用户?一种语言趋同研究
虽然大型语言模型(LLMs)通常被认为在生成语言方面表现出色,但它们的语言使用与人类的相似程度仍是一个未被充分研究的问题。在本文中,我们测试语言模型是否表现出语言趋同,这是人类语言交流的核心语用要素:模型是否会适应或趋同于用户的语言模式?为了回答这个问题,我们系统地比较了十六种语言模型、三种对话语料库和各种文体特征下模型对现有对话的完成与原始人类回应之间的差异。我们发现,模型强烈趋同于对话的风格,通常显著超越人类基准。虽然趋同模式往往是特征特定的,但我们观察到在不同的建模环境中趋同模式的一致性变化,指令调整和较大的模型比预训练和较小的模型趋同程度更低。鉴于人类和模型趋同模式之间的差异,我们假设驱动这些行为的底层机制非常不同。
Summary / 总结
This study investigates whether large language models (LLMs) adapt their language usage to match that of their human users, a phenomenon known as linguistic convergence. By comparing model-generated responses to original human responses in various dialogues and using different stylometric features, the researchers found that models strongly adapt to the conversation's style, often overfitting compared to human responses. However, the degree of convergence varies across different models and settings, with instruction-tuned and larger models showing less convergence than their pretrained and smaller counterparts.
本研究探讨大型语言模型(LLMs)是否能调整其语言使用方式以匹配人类用户,这一现象称为语言趋同。通过在多种对话中比较模型生成的回复与原始人类回复,并使用多种文体特征,研究发现模型显著适应对话的风格,往往超过人类的回复。然而,不同模型的趋同程度有所不同,指令调优和较大的模型显示出较少的趋同,相比之下,预训练和较小的模型则表现出更多的趋同。
EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
Authors: Nan Jiang, Ziyi Wang, Yexiang Xue
Venue: ICLR 2026
First: 2025-11-08T04:39:11+00:00 · Latest: 2026-02-12T18:38:11+00:00
Comments: Camera-ready version accepted for ICLR 2026
Abstract
Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the search space and accelerating training lies in *symbolic equivalence*: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates symbolic equivalence into a class of modern symbolic regression methods, including Monte Carlo Tree Search (MCTS), Deep Reinforcement Learning (DRL), and Large Language Models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module (via equality graphs), accelerating learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalent generated sequences in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Theoretically, we show the benefit of embedding EGG into learning: it tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances a class of symbolic regression models across several benchmarks, discovering more accurate expressions within the same time limit. Project page is at: https://nan-jiang-group.github.io/egg-sr.
Summary / 总结
EGG-SR is a framework that integrates symbolic equivalence into modern symbolic regression methods to reduce the search space and accelerate learning. It uses an EGG module to represent equivalent expressions through equality graphs, which helps in pruning redundant exploration in MCTS, aggregating rewards in DRL, and enriching feedback prompts in LLMs. Experiments show that EGG-SR improves the performance of symbolic regression models across various benchmarks, leading to more accurate expressions within the same time limit.
EGG-SR 是一个框架,将符号等价性整合到 MCTS、DRL 和 LLM 等符号回归方法中,以减少搜索空间并加速学习。它使用 EGG 模块通过等式图紧凑地表示等价表达式,有助于剪枝冗余探索并聚合奖励。实验表明,EGG-SR 能够在各种基准测试中提高符号回归模型的性能,从而在相同的时间限制内发现更准确的表达式。
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
Authors: Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
First: 2026-02-12T18:36:09+00:00 · Latest: 2026-02-12T18:36:09+00:00
Abstract
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
中文标题/摘要
标题:"对不起,我没有听清楚那句话": 为何语音模型忽视了最重要的内容
尽管语音识别系统在标准基准上的单词错误率较低,但在实际部署中,它们往往在短且高风险的口头表达上失败。在这里,我们研究了这种失败模式在一项高风险任务中的表现:美国参与者说出的美国街道名称的转录。我们评估了来自OpenAI、Deepgram、Google和Microsoft的15个模型在来自语言多样化的美国发言者录音上的表现,并发现平均转录错误率为44%。我们通过地理区域量化了转录失败的下游影响,并表明错误转录系统性地影响了所有发言者,但非英语母语发言者的路由距离错误是英语母语发言者的两倍。为了减轻这种伤害,我们引入了一种合成数据生成方法,使用开源文本转语音模型生成命名实体的多样化发音。使用不到1000个合成样本进行微调后,非英语母语发言者的街道名称转录准确性提高了近60%(相对于基线模型)。我们的结果突显了语音系统基准性能与实际可靠性之间的关键差距,并展示了减少高风险转录错误的一个简单且可扩展的途径。
ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction
Authors: Nick Ferguson, Josh Pennington, Narek Beghian, Aravind Mohan, Douwe Kiela, Sheshansh Agrawal, Thien Hang Nguyen
First: 2026-02-12T18:31:37+00:00 · Latest: 2026-02-12T18:31:37+00:00
Abstract
Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract-bench.
中文标题/摘要
标题:ExtractBench:复杂结构化提取的基准和评估方法
未结构化的文档如PDF包含有价值的信息,但下游系统需要这些数据以可靠且标准化的形式存在。LLMs越来越多地被部署以自动化此提取过程,因此准确性和可靠性至关重要。然而,进展受到两个瓶颈的阻碍。首先,没有端到端的基准可以评估大规模企业级模式下的PDF到JSON提取。其次,没有系统的方法来捕捉嵌套提取的语义,其中字段需要不同的正确性概念(标识符的精确匹配,数量的容忍度,名称的语义等价),数组需要对齐,遗漏必须与幻觉区分开。我们通过ExtractBench解决了这两个问题,这是一个开源的PDF到JSON结构化提取基准和评估框架。基准测试将35份PDF文档与经济价值领域的JSON模式和人工标注的金标准标签配对,产生了涵盖从几十到几百字段的12,867个可评估字段。评估框架将模式视为可执行规范:每个字段声明其评分标准。基线评估表明,前沿模型(GPT-5/5.2,Gemini-3 Flash/Pro,Claude 4.5 Opus/Sonnet)在现实模式下仍然不可靠。随着模式范围的增加,性能急剧下降,所有测试模型在包含369个字段的财务报告模式下均无有效输出。我们已在https://github.com/ContextualAI/extract-bench/发布了ExtractBench。
Summary / 总结
ExtractBench addresses the lack of an end-to-end benchmark for evaluating PDF-to-JSON extraction and introduces a principled methodology for nested extraction semantics. It includes 35 PDF documents with JSON schemas and human-annotated gold labels, covering 12,867 fields across various domains. Evaluations show that leading models struggle with realistic schemas, with performance dropping significantly as schema complexity increases, producing no valid output on a 369-field schema.
ExtractBench 是一个用于 PDF 到 JSON 结构化提取的基准和评估框架,解决了缺乏端到端基准和规范方法的问题。它包含 35 份 PDF 文档、JSON 架构和人工标注的黄金标签,涵盖了 12,867 个字段,覆盖了多个领域。基线评估显示,最先进的模型在现实架构面前表现不佳,随着架构复杂性的增加,性能急剧下降,最终在包含 369 个字段的财务报告架构上没有任何有效输出。
Towards Autonomous Mathematics Research
Authors: Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong
First: 2026-02-10T18:50:15+00:00 · Latest: 2026-02-12T18:27:29+00:00
Comments: 35 pages. Accompanied blog post https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/
Abstract
Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest quantifying standard levels of autonomy and novelty of AI-assisted results, as well as propose a novel concept of human-AI interaction cards for transparency. We conclude with reflections on human-AI collaboration in mathematics and share all prompts as well as model outputs at https://github.com/google-deepmind/superhuman/tree/main/aletheia.
中文标题/摘要
标题:迈向自主数学研究
近期基础模型的进展产生了能够在国际数学奥林匹克竞赛中获得金牌水平的推理系统。然而,从竞赛级别的问题解决过渡到专业研究,需要导航大量的文献并构建长期的证明。在本工作中,我们介绍了Aletheia,这是一种能够迭代生成、验证和修订自然语言解决方案的数学研究代理。具体而言,Aletheia 由为具有挑战性的推理问题提供动力的Gemini Deep Think的高级版本、一种扩展到奥林匹克级别以上问题的新型推理时扩展法则以及密集的工具使用来应对数学研究的复杂性。我们展示了Aletheia 从奥林匹克问题到博士级别的练习的能力,并且特别地,通过AI辅助数学研究的几个不同里程碑:(a) 一篇由AI生成的研究论文(Feng26),在没有人类干预的情况下计算了算术几何中称为特征权值的某些结构常数;(b) 一篇展示了人类与AI合作证明相互作用粒子系统(称为独立集)的界的研究论文(LeeSeo26);(c) 对Bloom的Erdos猜想数据库中的700个开放问题进行广泛的半自主评估(Feng et al., 2026a),包括自主解决四个开放问题。为了帮助公众更好地理解AI和数学的发展,我们建议量化AI辅助结果的标准自主性和新颖性水平,并提出一种新的透明度概念——人类与AI互动卡片。我们以反思数学中的人机合作结束,并在https://github.com/google-deepmind/superhuman/tree/main/aletheia/ 公开所有提示以及模型输出。
Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
Authors: Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko
First: 2026-02-12T18:15:08+00:00 · Latest: 2026-02-12T18:15:08+00:00
Comments: Accepted to EACL 2026 Student Research Workshop. 14 pages, 6 tables, 1 figure
Abstract
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
中文标题/摘要
标题:检测压缩词表示中的溢出以增强检索生成
对于当前的大语言模型(LLMs),特别是在资源受限的环境中,高效处理长上下文仍然是一个关键挑战。软压缩架构通过用一组学习到的压缩词替换长词序列,承诺可以延长有效的上下文长度。然而,压缩的极限——以及何时压缩开始消除与任务相关的内容——仍然没有得到充分探索。在本文中,我们将压缩表示不再包含足够信息来回答给定查询的阶段定义为“词溢出”,并提出了一种方法来表征和检测它。在xRAG软压缩设置中,我们发现查询无关的饱和统计量可靠地将压缩表示与未压缩表示区分开来,提供了一种实用工具来识别压缩词,但显示出有限的溢出检测能力。通过对查询和上下文xRAG表示的轻量级探针分类器在HotpotQA、SQuADv2和TriviaQA数据集上平均0.72 AUC-ROC检测溢出,表明结合查询信息可以提高检测性能。这些结果从查询无关的诊断发展到查询感知的检测器,使低成本的预LLM门控能够减轻压缩引起的错误。
Summary / 总结
This paper addresses the challenge of efficient long-context processing in large language models by defining token overflow as a regime where compressed representations no longer contain sufficient information to answer a query. The authors propose a methodology to detect this phenomenon using saturation statistics and lightweight probing classifiers, which show an average AUC-ROC of 0.72 on HotpotQA, SQuADv2, and TriviaQA datasets. Query information significantly improves detection performance, advancing from query-independent diagnostics to query-aware detectors.
本文通过定义压缩表示在无法提供足够信息回答查询时的“token overflow”现象,来解决大型语言模型中高效处理长上下文的挑战。作者提出了一种使用饱和统计和轻量级探针分类器检测该问题的方法,这些方法在HotpotQA、SQuADv2和TriviaQA数据集上的平均AUC-ROC为0.72。查询信息显著提高了检测性能,从独立于查询的诊断方法进展到基于查询的检测器。
Evaluating LLM Reasoning Beyond Correctness and CoT
Authors: Soheil Abbasloo
First: 2025-10-20T22:08:59+00:00 · Latest: 2026-02-12T18:07:50+00:00
Abstract
What does it truly mean for a language model to "reason"? Current evaluations reward models' correct standalone answers-but correctness alone reveals little about the process that produced them. We argue that reasoning should be understood not as a static chain of steps but as a dynamic trajectory in which ideas interact, clash, and evolve into integrated insights. Building on the philosophical tradition of dialectics, we introduce SIEV, a structured evaluation framework that assesses reasoning through explicit thesis-antithesis-synthesis interactions. SIEV produces interpretable trajectories that highlight key properties of reasoning-robustness to challenge, adaptability under conflict, and synthesis across competing viewpoints-dimensions that conventional correctness-based metrics cannot capture. Empirical results on GSM and MMLU demonstrate substantial gaps in the reasoning abilities of state-of-the-art models: for example, GPT-5-chat loses more than 40 points (out of 100) on GSM when evaluated through SIEV's process-oriented lens. By shifting focus from what answer a model gives to how it arrives there, SIEV enables a more transparent and principled distinction between structured reasoning and surface-level pattern generation offering a clearer foundation for assessing and understanding the reasoning capabilities of LLMs.
中文标题/摘要
标题:超越正确性和思维过程评估大语言模型的推理能力
语言模型究竟如何“推理”意味着什么?当前的评估奖励模型的正确独立答案,但仅凭正确性无法揭示其背后的推理过程。我们认为,推理不应被视为静态的步骤链,而应被视为一种动态轨迹,在其中思想相互作用、碰撞并发展成综合见解。基于辩证法的哲学传统,我们引入了SIEV结构化评估框架,通过明确的论题-反题-合题互动来评估推理。SIEV生成可解释的轨迹,突出推理的关键属性——面对挑战的稳健性、冲突下的适应性以及观点竞争中的综合能力——这些维度是传统基于正确性的度量无法捕捉到的。在GSM和MMLU上的实证结果表明,最先进的模型在推理能力上存在显著差距:例如,GPT-5-chat在通过SIEV的过程导向视角评估时,得分下降超过40分(满分100分)。通过将焦点从模型给出的答案转向其如何得出答案,SIEV使结构化推理与表面模式生成之间的区分更加透明和原则化,为评估和理解大语言模型的推理能力提供了更清晰的基础。
Summary / 总结
The paper aims to redefine the concept of reasoning in language models by moving beyond mere correctness to evaluate the dynamic process of reasoning. It introduces SIEV, a structured evaluation framework based on dialectical philosophy, which assesses reasoning through explicit interactions of thesis, antithesis, and synthesis. Key findings show significant gaps in the reasoning abilities of state-of-the-art models, such as GPT-5-chat, when evaluated through SIEV's process-oriented lens, highlighting robustness to challenges, adaptability under conflict, and synthesis across competing viewpoints as crucial dimensions of reasoning that conventional metrics fail to capture.
论文旨在超越正确性评估语言模型的推理能力。它引入了基于辩证法的SIEV结构化评估框架,通过明确的论题-反题-合题互动来评估推理。关键发现表明,如GPT-5-chat等最先进的模型在通过SIEV的过程导向评估时,在应对挑战的稳健性和冲突下的适应性等方面存在显著差距,而这些是传统基于正确性的度量无法捕捉到的。
Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser
Authors: Zijing Ou, Jacob Si, Junyi Zhu, Ondrej Bohdal, Mete Ozay, Taha Ceritli, Yingzhen Li
First: 2026-02-12T18:06:03+00:00 · Latest: 2026-02-12T18:06:03+00:00
Abstract
Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.
中文标题/摘要
标题:超越KL的扩散对齐:方差最小化作为有效的策略优化器
扩散对齐将预训练的扩散模型适应为沿去噪轨迹采样奖励倾斜分布。这一过程自然地具有顺序蒙特卡洛(SMC)的解释,其中去噪模型作为提议者,奖励指导诱导重要性权重。受这一观点的启发,我们引入了方差最小化策略优化(VMPO),将扩散对齐视为最小化对数重要性权重的方差,而不是直接优化基于KL的目标。我们证明方差目标在奖励倾斜目标分布下最小化,并且在基于策略采样下,其梯度与标准基于KL对齐的梯度一致。这一视角为理解扩散对齐提供了一个共同的视角。在不同的潜在函数和方差最小化策略选择下,VMPO恢复了各种现有方法,同时也提出了超越KL的新设计方向。
Summary / 总结
The paper introduces Variance Minimisation Policy Optimisation (VMPO) to improve diffusion alignment, which adapts pretrained diffusion models to sample from reward-tilted distributions. VMPO formulates the process as minimizing the variance of log importance weights instead of directly optimizing a KL-based objective. Key findings include that the variance objective is minimized by the reward-tilted target distribution and that its gradient under on-policy sampling matches that of standard KL-based alignment, offering a unified perspective on diffusion alignment methods.
论文提出了Variance Minimisation Policy Optimisation (VMPO) 来改进扩散对齐,这是一种将预训练的扩散模型调整为从奖励倾斜分布中采样的技术。VMPO 将过程表述为最小化对数重要权重的方差,而不是直接优化基于KL的目标。关键实验发现表明,在不同的潜在函数和方差最小化策略下,VMPO 与现有方法保持一致,同时也提出了新的设计方向。
Bandit Learning in Matching Markets with Interviews
Authors: Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, Mohammad Hajiesmaili
First: 2026-02-12T18:03:37+00:00 · Latest: 2026-02-12T18:03:37+00:00
Abstract
Two-sided matching markets rely on preferences from both sides, yet it is often impractical to evaluate preferences. Participants, therefore, conduct a limited number of interviews, which provide early, noisy impressions and shape final decisions. We study bandit learning in matching markets with interviews, modeling interviews as \textit{low-cost hints} that reveal partial preference information to both sides. Our framework departs from existing work by allowing firm-side uncertainty: firms, like agents, may be unsure of their own preferences and can make early hiring mistakes by hiring less preferred agents. To handle this, we extend the firm's action space to allow \emph{strategic deferral} (choosing not to hire in a round), enabling recovery from suboptimal hires and supporting decentralized learning without coordination. We design novel algorithms for (i) a centralized setting with an omniscient interview allocator and (ii) decentralized settings with two types of firm-side feedback. Across all settings, our algorithms achieve time-independent regret, a substantial improvement over the $O(\log T)$ regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.
Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
Authors: Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo
First: 2026-02-12T17:59:58+00:00 · Latest: 2026-02-12T17:59:58+00:00
Abstract
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT
中文标题/摘要
标题:迈向基于策略的微调:分布判别理论及其在大模型训练中的应用
监督微调(SFT)计算效率高,但通常在泛化能力上不如强化学习(RL)。这一差距主要是由于RL使用了基于策略的数据。我们提出了一种框架来弥合这一差距,使其能够实现基于策略的微调。我们首先提出了\textbf{\textit{分布判别理论(DDT)}},该理论解释并量化了数据与模型诱导分布之间的对齐程度。利用DDT,我们引入了两种互补的技术:(i)\textbf{\textit{在分布内微调(IDFT)}},一种在损失层面增强SFT泛化能力的方法;(ii)\textbf{\textit{提示解码}},一种在数据层面的技术,可以重新对齐训练语料库以匹配模型的分布。大量实验表明,我们的框架在泛化性能上与著名的离线RL算法(包括DPO和SimPO)相当,同时保持了SFT管道的效率。因此,该提出的框架为在RL不可行的领域提供了一种实用的替代方案。我们在此开源代码:https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT
Summary / 总结
The paper aims to improve the generalization of supervised fine-tuning (SFT) by bridging the gap with reinforcement learning (RL). It introduces Distribution Discriminant Theory (DDT) to explain the alignment between data and model-induced distribution, and proposes two techniques: In-Distribution Finetuning (IDFT) and Hinted Decoding. Experiments show that the proposed framework matches the performance of offline RL algorithms like DPO and SimPO while maintaining the efficiency of SFT, offering a practical alternative in RL-infeasible domains.
论文通过提出面向策略的微调(On-Policy SFT)框架来弥合监督微调(SFT)和强化学习(RL)之间的差距。它引入了分布判别理论(DDT)来量化数据与模型诱导分布之间的对齐,并提出了两种技术:In-Distribution Finetuning(IDFT)和提示解码。实验表明,该框架在泛化性能上与离线RL算法如DPO和SimPO相当,同时保持了SFT的计算效率,因此在RL不可行的领域提供了一个实用的替代方案。
Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
Authors: Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou
First: 2026-02-12T17:59:08+00:00 · Latest: 2026-02-12T17:59:08+00:00
Abstract
We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
中文标题/摘要
标题:兼而有之:通过统一离散流匹配实现多模态推理与生成
我们提出了一种统一的离散流匹配框架UniDFlow,用于多模态理解、生成和编辑。该框架通过特定任务的低秩适配器解耦理解与生成,避免了目标干扰和表示纠缠,同时一种新颖的基于参考的多模态偏好对齐优化了在相同条件下的相对结果,提高了忠实度和可控性,而无需大规模重新训练。UniDFlpw在八个基准测试中实现了SOTA性能,并且在包括 inpainting、上下文图像生成、基于参考的编辑和组合生成等任务中表现出强大的零样本泛化能力,尽管没有进行明确的特定任务训练。
Summary / 总结
UniDFlow is a unified framework that decouples understanding and generation using task-specific low-rank adapters to avoid objective interference and representation entanglement. It also includes a reference-based multimodal preference alignment to optimize relative outcomes under identical conditioning, enhancing faithfulness and controllability. The model achieves state-of-the-art performance across eight benchmarks and demonstrates strong zero-shot generalization to various tasks such as inpainting, in-context image generation, reference-based editing, and compositional generation without explicit task-specific training.
UniDFlow 是一个统一框架,通过使用任务特定的低秩适配器将理解和生成解耦,以避免目标干扰和表示纠缠。它还包括基于参考的多模态偏好对齐,以在相同条件下优化相对结果,增强真实性和可控性。该模型在八个基准测试中达到了最先进的性能,并且在包括修复、上下文图像生成、基于参考的编辑和组合生成等任务中展示了强大的零样本泛化能力,而无需进行特定任务的显式训练。
The Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics
Authors: Christian Internò, Jumpei Yamaguchi, Loren Amdahl-Culleton, Markus Olhofer, David Klindt, Barbara Hammer
First: 2026-02-12T17:56:07+00:00 · Latest: 2026-02-12T17:56:07+00:00
Abstract
Determining whether neural models internalize physical laws as world models, rather than exploiting statistical shortcuts, remains challenging, especially under out-of-distribution (OOD) shifts. Standard evaluations often test latent capability via downstream adaptation (e.g., fine-tuning or high-capacity probes), but such interventions can change the representations being measured and thus confound what was learned during self-supervised learning (SSL). We propose a non-invasive evaluation protocol, PhyIP. We test whether physical quantities are linearly decodable from frozen representations, motivated by the linear representation hypothesis. Across fluid dynamics and orbital mechanics, we find that when SSL achieves low error, latent structure becomes linearly accessible. PhyIP recovers internal energy and Newtonian inverse-square scaling on OOD tests (e.g., $ρ> 0.90$). In contrast, adaptation-based evaluations can collapse this structure ($ρ\approx 0.05$). These findings suggest that adaptation-based evaluation can obscure latent structures and that low-capacity probes offer a more accurate evaluation of physical world models.
中文标题/摘要
标题:世界模型中的观察者效应:侵入性适应破坏潜在物理定律
确定神经模型是否将物理定律作为世界模型内化,而不是利用统计捷径,尤其是在分布外(OOD)转移时,仍然具有挑战性。标准评估通常通过下游适应(例如微调或高容量探针)测试潜在能力,但这些干预措施会改变正在测量的表示,从而混淆自监督学习(SSL)期间学到的内容。我们提出了一种非侵入性评估协议,PhyIP。我们测试物理量是否可以从冻结的表示中线性可解码,受线性表示假设的启发。在流体动力学和轨道力学中,我们发现当SSL达到低误差时,潜在结构变得线性可访问。PhyIP在分布外测试中恢复了内部能量和牛顿平方反比缩放(例如,ρ>0.90)。相比之下,基于适应的评估可能会使这种结构崩溃(ρ≈0.05)。这些发现表明,基于适应的评估可能会掩盖潜在结构,而低容量探针提供了对物理世界模型更准确的评估。
Summary / 总结
The study aims to evaluate whether neural models internalize physical laws as world models without altering the representations through invasive adaptation. It introduces PhyIP, a non-invasive evaluation protocol, which tests the linear decodability of physical quantities from frozen representations. Across fluid dynamics and orbital mechanics, the study finds that when self-supervised learning achieves low error, latent physical structures become linearly accessible, with PhyIP recovering internal energy and Newtonian inverse-square scaling on out-of-distribution tests. In contrast, adaptation-based evaluations obscure these structures. This suggests that low-capacity probes provide a more accurate evaluation of physical world models than adaptation-based methods.
研究旨在评估神经模型是内化物理定律还是依赖统计捷径,特别是在分布外转移时。它提出了一种非侵入性评估协议PhyIP,通过测试冻结表示中物理量的线性可解性来进行评估。在流体动力学和轨道力学中,研究发现当自我监督学习达到低误差时,物理结构变得线性可访问,PhyIP在分布外测试中恢复了内部能量和牛顿反平方定律。相比之下,适应性评估会掩盖这些结构。这表明适应性评估可能无法准确反映神经模型中的物理世界结构。
VIRENA: Virtual Arena for Research, Education, and Democratic Innovation
Authors: Emma Hoes, K. Jonathan Klueser, Fabrizio Gilardi
First: 2026-02-12T17:46:52+00:00 · Latest: 2026-02-12T17:46:52+00:00
Comments: VIRENA is under active development and currently in use at the University of Zurich, supported by the DIZH Innovation Program: 2nd Founder-Call. This preprint will be updated as new features are released. For the latest version and to inquire about demos or pilot collaborations, contact the authors
Abstract
Digital platforms shape how people communicate, deliberate, and form opinions. Studying these dynamics has become increasingly difficult due to restricted data access, ethical constraints on real-world experiments, and limitations of existing research tools. VIRENA (Virtual Arena) is a platform that enables controlled experimentation in realistic social media environments. Multiple participants interact simultaneously in realistic replicas of feed-based platforms (Instagram, Facebook, Reddit) and messaging apps (WhatsApp, Messenger). Large language model-powered AI agents participate alongside humans with configurable personas and realistic behavior. Researchers can manipulate content moderation approaches, pre-schedule stimulus content, and run experiments across conditions through a visual interface requiring no programming skills. VIRENA makes possible research designs that were previously impractical: studying human--AI interaction in realistic social contexts, experimentally comparing moderation interventions, and observing group deliberation as it unfolds. Built on open-source technologies that ensure data remain under institutional control and comply with data protection requirements, VIRENA is currently in use at the University of Zurich and available for pilot collaborations. Designed for researchers, educators, and public organizations alike, VIRENA's no-code interface makes controlled social media simulation accessible across disciplines and sectors. This paper documents its design, architecture, and capabilities.
中文标题/摘要
标题:VIRENA:虚拟竞技场,用于研究、教育和民主创新
数字平台塑造了人们的交流、讨论和形成观点的方式。由于数据访问受限、现实世界实验的伦理限制以及现有研究工具的局限性,研究这些动态变得越来越困难。VIRENA(虚拟竞技场)是一个平台,它能够在现实社交媒体环境中进行受控实验。多个参与者可以同时在基于信息流的平台(Instagram、Facebook、Reddit)和即时通讯应用(WhatsApp、Messenger)的现实复制品中互动。由大型语言模型驱动的AI代理可以与人类一起参与,具有可配置的人设和现实行为。研究人员可以通过无需编程技能的可视化界面操控内容审核方法、预排定刺激内容,并在不同条件下运行实验。VIRENA 使得以前不切实际的研究设计成为可能:在现实社会环境中研究人类与AI的互动、实验性地比较干预措施的效果以及观察群体讨论的展开。VIRENA 建立在开源技术之上,确保数据保留在机构控制之下并符合数据保护要求,目前在苏黎世大学使用,并可供试点合作。VIRENA 的无代码界面使其跨学科和跨行业的受控社交媒体模拟变得可行。本文档记录了其设计、架构和功能。
Summary / 总结
VIRENA is a platform designed to enable controlled experimentation in realistic social media environments, addressing the challenges of restricted data access and ethical constraints. It allows multiple participants to interact in replicas of popular platforms like Instagram, Facebook, Reddit, WhatsApp, and Messenger, with AI agents participating alongside humans. Researchers can manipulate content moderation and run experiments through a no-code visual interface. Key findings include the ability to study human-AI interactions, compare moderation interventions, and observe group deliberation in realistic settings, making previously impractical research designs possible.
VIRENA 是一个平台,旨在通过在现实社交媒体环境中进行受控实验来促进研究和教育。它允许参与者在 Instagram、Facebook、Reddit、WhatsApp 和 Messenger 等流行平台的复制品中互动,并可选择加入由大型语言模型驱动的 AI 代理。研究人员无需编程技能即可操控内容审核并运行实验。主要发现包括能够研究人机互动、比较审核干预措施以及观察群体讨论。VIRENA 目前在苏黎世大学使用,并可供试点合作。
CONSENT: A Negotiation Framework for Leveraging User Flexibility in Vehicle-to-Building Charging under Uncertainty
Authors: Rishav Sen, Fangqi Liu, Jose Paolo Talusan, Ava Pettet, Yoshinori Suzue, Mark Bailey, Ayan Mukhopadhyay, Abhishek Dubey
First: 2026-01-04T15:59:52+00:00 · Latest: 2026-02-12T17:45:11+00:00
Comments: Submitted to AAMAS 2026. 38 pages, 13 figures, 14 tables
Abstract
The growth of Electric Vehicles (EVs) creates a conflict in vehicle-to-building (V2B) settings between building operators, who face high energy costs from uncoordinated charging, and drivers, who prioritize convenience and a full charge. To resolve this, we propose a negotiation-based framework that, by design, guarantees voluntary participation, strategy-proofness, and budget feasibility. It transforms EV charging into a strategic resource by offering drivers a range of incentive-backed options for modest flexibility in their departure time or requested state of charge (SoC). Our framework is calibrated with user survey data and validated using real operational data from a commercial building and an EV manufacturer. Simulations show that our negotiation protocol creates a mutually beneficial outcome: lowering the building operator's costs by over 3.5\% compared to an optimized, non-negotiating smart charging policy, while simultaneously reducing user charging expenses by 22\% below the utility's retail energy rate. By aligning operator and EV user objectives, our framework provides a strategic bridge between energy and mobility systems, transforming EV charging from a source of operational friction into a platform for collaboration and shared savings.
中文标题/摘要
标题:CONSENT:一种在不确定性条件下利用用户灵活性的车辆到建筑充电谈判框架
电动汽车(EVs)的增长在车辆到建筑(V2B)设置中引发了建筑运营商和驾驶员之间的冲突,前者面临因充电不协调而导致的高昂能源成本,后者则更注重便利性和满电状态。为解决这一问题,我们提出了一种基于谈判的框架,该框架通过设计确保了自愿参与、策略稳健性和预算可行性。该框架通过为驾驶员提供一系列激励支持的选择,将电动汽车充电转变为一种战略资源,允许他们在离开时间或请求的荷电状态(SoC)方面表现出适度的灵活性。我们使用用户调查数据对该框架进行了校准,并使用一家商业建筑和一家电动汽车制造商的实际运营数据进行了验证。模拟结果显示,我们的谈判协议创造了一个互惠互利的结果:与优化的非谈判智能充电政策相比,降低了建筑运营商超过3.5%的成本,同时将用户的充电费用降低了22%以上,低于公用事业的零售能源费率。通过使运营商和电动汽车用户的目标一致,我们的框架为能源和移动性系统之间提供了一座战略桥梁,将电动汽车充电从运营摩擦转变为合作和共享节省的平台。
Summary / 总结
The paper addresses the conflict between building operators and EV drivers in V2B settings by proposing a negotiation framework that ensures voluntary participation, strategy-proofness, and budget feasibility. It offers drivers flexible options for modest charging adjustments, which are calibrated with user data and validated using real operational data. Simulations demonstrate that this framework reduces building operator costs by over 3.5% and decreases user charging expenses by 22% compared to non-negotiating smart charging policies.
论文提出了一种谈判框架,解决了V2B设置中建筑运营商与电动汽车驾驶员之间的冲突,确保了自愿参与、策略稳健性和预算可行性。该框架为驾驶员提供了适度充电调整的灵活选项,相比非谈判智能充电策略,降低了建筑运营商成本超过3.5%,同时降低了用户充电费用22%,低于零售能源费率。
DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
Authors: Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang
First: 2026-02-12T17:44:24+00:00 · Latest: 2026-02-12T17:44:24+00:00
Abstract
Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
中文标题/摘要
标题:DeepGen 1.0:一种轻量级统一多模态模型,用于推进图像生成和编辑
当前用于图像生成和编辑的统一多模态模型通常依赖于庞大的参数规模(例如,>10B),导致高昂的训练成本和部署足迹。在本工作中,我们提出了DeepGen 1.0,这是一种轻量级的5B统一模型,其综合能力与或超越了更大的同类模型。为了克服紧凑模型在语义理解和精细控制方面的局限性,我们引入了堆叠通道桥接(SCB),这是一种深度对齐框架,从多个VLM层中提取层次特征,并与可学习的“思考标记”融合,为生成骨干提供结构化、富含推理的指导。我们还设计了一种以数据为中心的训练策略,分为三个渐进阶段:(1)在大规模图像-文本对和编辑三元组上进行对齐预训练,以同步VLM和DiT表示;(2)在高质量的生成、编辑和推理任务混合集上进行联合监督微调,以培养全方位的能力;(3)使用MR-GRPO的强化学习,利用混合奖励函数和监督信号,显著提高了生成质量和与人类偏好的一致性,同时保持了稳定的训练进展并避免了视觉伪影。尽管仅在约5000万样本上进行训练,DeepGen 1.0在多种基准测试中均表现出领先性能,在WISE上超越了80B的HunyuanImage 28%,在UniREditBench上超越了27B的Qwen-Image-Edit 37%。通过开源我们的训练代码、权重和数据集,我们提供了一种高效、高性能的替代方案,以促进统一多模态研究的民主化。
Summary / 总结
DeepGen 1.0 is a lightweight 5B unified model for image generation and editing, addressing the high training and deployment costs of larger models. It introduces Stacked Channel Bridging (SCB) to enhance semantic understanding and fine-grained control. The model undergoes three progressive training stages: alignment pre-training, joint supervised fine-tuning, and reinforcement learning. DeepGen 1.0 outperforms much larger models on various benchmarks, achieving 28% better performance on WISE and 37% on UniREditBench compared to 80B HunyuanImage and 27B Qwen-Image-Edit respectively.
DeepGen 1.0 是一个轻量级的 5B 统一模型,用于图像生成和编辑,解决了大型模型高昂的训练和部署成本问题。它引入了堆叠通道桥接(SCB)来增强语义理解和细粒度控制。该模型通过三个阶段进行训练:对齐预训练、联合监督微调和强化学习。DeepGen 1.0 在各种基准测试中表现出色,分别在 WISE 和 UniREditBench 上优于 80B 的 HunyuanImage 和 27B 的 Qwen-Image-Edit,分别提高了 28% 和 37%。通过开源其训练代码、权重和数据集,DeepGen 1.0 提供了一个高效的替代方案,以促进统一多模态研究。
ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
Authors: Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah, Pranav Shetty, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso
First: 2026-02-12T17:38:57+00:00 · Latest: 2026-02-12T17:38:57+00:00
Comments: EACL 2026, main conference
Abstract
Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
中文标题/摘要
标题:ExStrucTiny:文档图像中结构化信息提取基准
企业文档,如表单和报告,嵌入了下游应用(如数据归档、自动化工作流和分析)所需的关键信息。尽管通用视觉语言模型(VLMs)在现有的文档理解基准测试中表现良好,但它们在多种文档类型和灵活模式下进行整体、精细结构化提取的能力尚未得到充分研究。现有的关键实体提取(KEE)、关系提取(RE)和视觉问答(VQA)数据集受限于狭窄的实体本体、简单的查询或同质的文档类型,往往忽略了适应性和结构化提取的需求。为解决这些差距,我们引入了ExStrucTiny,这是一个新的基准数据集,用于文档图像中的结构化信息提取(IE),将KEE、RE和VQA的各个方面统一起来。通过结合手动和合成的人工验证样本的新颖管道构建,ExStrucTiny涵盖了更多样化的文档类型和提取场景。我们分析了开放和封闭的VLMs在该基准上的表现,突出了模式适应、查询欠具体化和答案定位等挑战。我们希望我们的工作能够为改进通用模型在文档中的结构化IE提供基础。
Summary / 总结
ExStrucTiny is a new benchmark dataset for structured information extraction from document images, addressing the limitations of existing benchmarks by covering diverse document types and extraction scenarios. It combines manual and synthetic human-validated samples to evaluate the ability of Vision Language Models (VLMs) to perform holistic, fine-grained structured extraction across different schemas. The study highlights challenges such as schema adaptation and query under-specification, aiming to improve generalist models for structured information extraction in documents.
ExStrucTiny 是一个新的基准数据集,旨在评估 Vision Language Models (VLMs) 在多样化的文档图像中进行结构化信息提取的能力。该数据集通过结合手动和合成的人工验证样本构建,涵盖了广泛的文档类型和提取场景,并统一了关键实体提取、关系提取和视觉问答的方面。实验表明,该基准数据集揭示了诸如模式适应和查询欠具体化等挑战,为改进 VLMs 在结构化信息提取任务中的表现提供了参考。
PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery
Authors: Bowei He, Lihao Yin, Hui-Ling Zhen, Xiaokun Zhang, Mingxuan Yuan, Chen Ma
Venue: ICLR 2026
First: 2025-02-18T07:11:08+00:00 · Latest: 2026-02-12T17:38:20+00:00
Comments: Accepted by ICLR 2026
Abstract
Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data. We provide the code repository in \href{https://github.com/BokwaiHo/PASER}{Link}.
中文标题/摘要
标题:PASER:高效剪枝大型语言模型恢复的后训练数据选择
模型剪枝是压缩大型语言模型(LLMs)的有效方法。然而,这一过程往往会导致模型能力显著下降。虽然后训练技术如指令调优常被用来恢复模型性能,但现有方法往往忽视了模型能力的不均匀下降,并且计算成本高。此外,一些无关指令也可能对模型能力恢复产生负面影响。为了解决这些挑战,我们提出了后训练数据选择方法以实现高效剪枝大型语言模型恢复(PASER)。PASER旨在以一定的数据预算识别出能够恢复最受损模型能力的指令。我们的方法首先应用流形学习和谱聚类在语义空间中对恢复指令进行分组,揭示出能力特定的指令集。然后,根据相应模型能力下降的程度,适应性地分配数据预算。在每个聚类中,我们优先选择导致模型性能下降最多的数据样本。为了减轻潜在的负面调优影响,我们还检测并过滤出冲突或无关的恢复数据。广泛的实验表明,PASER显著优于传统基线,有效恢复了剪枝LLMs的一般能力,同时仅使用原始后训练数据的4%-20%。我们在GitHub上提供了代码库:https://github.com/BokwaiHo/PASER。
Summary / 总结
PASER is a post-training data selection method for efficiently recovering the capabilities of pruned large language models. It uses manifold learning and spectral clustering to group instructions and allocate data budget based on model capability degradation. PASER prioritizes data samples that lead to the most decline in model performance and filters out conflicting or irrelevant data. Experiments show that PASER outperforms conventional baselines, recovering the general capabilities of pruned LLMs with only 4%-20% of the original post-training data.
PASER 是一种后训练数据选择方法,旨在高效恢复剪枝的大语言模型性能。它使用流形学习和谱聚类来识别并优先处理最受损的模型能力对应的指令。通过在簇间适配地分配有限的数据预算,PASER 减轻了负面调优效果,并显著优于传统方法,仅使用原始后训练数据的 4%-20% 就能恢复剪枝 LLM 的通用能力。
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-12T17:32:59+00:00
Abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.
中文标题/摘要
标题:CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型
大型语言模型(LLMs)的后训练压缩通常依赖于低秩权重近似,将权重矩阵的每一列表示在共享的低维子空间中。这种策略计算效率高,但其背后的约束对于异构投影权重来说可能过于僵硬,可能会导致不必要的准确度损失。我们提出了CoSpaDi(通过稀疏字典学习压缩),这是一种无需训练的框架,用结构化稀疏分解替换低秩分解,其中每个权重矩阵表示为一个稠密字典乘以列稀疏系数矩阵。这产生了一种子空间模型:权重矩阵的列表示为不同字典原子子集的线性组合,从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的:使用一个小的校准集,我们优化分解以最小化层输出的功能重建误差,而不是权重空间误差。激活衍生的Gram正交化将这种数据感知的目标重新表述为转换后的权重上的标准字典学习问题,并且我们支持层内压缩和组内相似投影之间的跨层字典共享。在Llama和Qwen模型家族中,CoSpaDi 在20-40%的压缩比下,相对于基于SVD的先进基线和强大的结构化剪枝基线,始终能够改善准确度-压缩和困惑度-压缩的权衡。由此产生的结构化稀疏性使稀疏-密集计算成为可能,并与稀疏系数的后训练量化集成。
Summary / 总结
CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to represent weight matrices, improving expressiveness while maintaining accuracy. It optimizes the factorization using a small calibration set to minimize functional reconstruction error, and supports both per-layer compression and cross-layer dictionary sharing. Experiments show that CoSpaDi outperforms SVD-based and structured pruning baselines in terms of accuracy and perplexity at 20-40% compression ratios.
CoSpaDi 是一种无需训练的大型语言模型压缩框架,它使用结构化的稀疏分解来替代低秩分解,从而更灵活且准确地表示权重矩阵。它使用一个小的校准集来优化分解,以最小化层输出的功能重建误差,并支持逐层压缩和跨层字典共享。实验表明,CoSpaDi 在 Llama 和 Qwen 模型上优于基于 SVD 的最先进的基线和强结构剪枝基线,在 20-40% 的压缩比下,在准确性和困惑度折衷方面表现出更优的结果。
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Authors: Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Venue: Nat Mach Intell 8, 20-31 (2026)
First: 2024-10-18T05:21:05+00:00 · Latest: 2026-02-12T17:29:23+00:00
Comments: Published at Nature Machine Intelligence
Abstract
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.
中文标题/摘要
标题:实验室安全台:评估大型语言模型在科学实验室安全问题上的表现
人工智能(AI)正在革新科学研究,但其在实验室环境中的日益集成带来了关键的安全挑战。大型语言模型(LLMs)和视觉语言模型(VLMs)现在协助实验设计和程序指导,但它们的“理解错觉”可能导致研究人员过度信任不安全的输出。我们展示当前模型远未达到安全实验室操作所需的可靠性。我们引入了LabSafety Bench,这是一个全面的基准测试,评估模型在危害识别、风险评估和后果预测方面的表现,涵盖765个多项选择题和404个现实实验室场景,共计3,128个开放式任务。对19个先进LLM和VLM的评估显示,在危害识别方面,没有模型的准确率超过70%。虽然专有模型在结构化评估中表现良好,但在开放式推理方面并没有明显优势。这些结果强调了在实际实验室环境中部署AI系统之前,迫切需要专门的安全评估框架。
Summary / 总结
The paper addresses the safety challenges posed by the integration of AI in scientific laboratories. It introduces LabSafety Bench, a benchmark that tests models on hazard identification, risk assessment, and consequence prediction. Evaluations on 19 advanced LLMs and VLMs reveal that no model achieves over 70% accuracy in hazard identification, highlighting the need for specialized safety evaluation frameworks before deploying AI in real laboratory settings.
论文探讨了AI在科学实验室中的集成所带来的安全挑战。它引入了LabSafety Bench基准,测试模型在危害识别、风险评估和后果预测方面的表现。对19种先进LLM和VLM的评估显示,没有模型在危害识别上的准确率超过70%,强调了在实际实验室环境中部署AI系统前需要专门的安全评估框架的紧迫性。
Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
Authors: Mohamed Huti, Alasdair Mackintosh, Amy Waldock, Dominic Andrews, Maxime Lelièvre, Moritz Boos, Tobias Murray, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
First: 2026-02-12T17:29:03+00:00 · Latest: 2026-02-12T17:29:03+00:00
Abstract
AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
中文标题/摘要
标题:视觉推理基准:评估多模态大语言模型在小学课堂真实视觉问题上的推理能力
AI模型在文本推理方面已达到最先进的水平;然而,它们在处理空间和关系结构方面的推理能力仍然是一个关键瓶颈——特别是在依赖视觉的早期数学中。本文介绍了视觉推理基准(VRB),这是一个新型数据集,旨在评估多模态大语言模型(MLLMs)解决课堂真实视觉问题的能力。该基准基于来自赞比亚和印度小学考试的701个问题构建,涵盖了诸如类比推理、模式填充和空间匹配等一系列任务。我们概述了基准的方法和开发过程,故意使用未经编辑、文字最少的图像来测试模型是否能满足小学教育的实际需求。我们的研究发现,模型在静态技能如计数和缩放方面表现出色,但在面对折叠、反射和旋转等动态操作时却达到了一个明显的“空间天花板”。这些弱点对课堂使用视觉推理问题构成风险,可能导致错误评分、虚假支持和强化学生的错误概念。因此,像VRB这样的面向教育的基准对于确定用于课堂的多模态工具的功能边界至关重要。