arXiv 论文速递

Snapshot: 20260216_0328

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone

First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-12T18:59:59+00:00

Abstract

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

中文标题/摘要

标题：扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐

通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动（VLA）模型在这一目标上取得了显著进展，但它们生成的动作仍然可能与给定的指令不一致。在本文中，我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了基于指令的执行的测试时扩展定律，证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性，通常比独立扩展每个维度更有效地恢复正确的动作。为了利用这些扩展定律，我们提出了CoVer，一种对比验证器，用于视觉-语言-行动对齐，并展示了我们的架构随着额外计算资源和数据的增加而平滑扩展。然后，我们介绍了“启动时计算”和一个分层验证推理流水线，用于VLAs。在部署时，我们的框架从视觉语言模型（VLM）预计算一组多样化的重述指令，反复为每条指令生成动作候选，然后使用验证器选择最优的高层提示和低层动作片段。与在相同数据上扩展策略预训练相比，我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进，在实际实验中进一步提高了45%。在PolaRiS基准测试中，CoVer实现了14%的任务进展和9%的成功率改进。

Summary / 总结

This paper explores test-time verification as a method to improve the alignment between actions and natural language instructions in vision-language-action models. It demonstrates that jointly scaling the number of rephrased instructions and generated actions increases test-time sample diversity, often leading to correct actions more efficiently than scaling each dimension independently. The CoVer architecture, a contrastive verifier, scales gracefully with additional resources, and the proposed hierarchical verification inference pipeline enhances performance. Compared to scaling policy pre-training, the verification approach improves in-distribution performance by 22% and out-of-distribution performance by 13% on the SIMPLER benchmark, with further improvements in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

本文探讨了测试时验证作为提高视觉-语言-动作模型中动作与自然语言指令之间对齐的方法。研究表明，同时扩大重述指令和生成动作的数量可以提高测试时样本多样性，从而更高效地恢复正确的动作。提出的CoVer架构在额外资源下能够平滑扩展，并且该框架预先计算多样化的重述指令，生成动作候选，并使用验证器选择最优的动作。这种方法在SIMPLER基准测试中实现了22%的分布内和13%的分布外性能提升，在PolaRiS基准测试中分别实现了14%的任务进展和9%的成功率提升。

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Authors: Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu

First: 2026-02-12T18:59:49+00:00 · Latest: 2026-02-12T18:59:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

中文标题/摘要

标题：UniT：统一多模态链式思维测试时扩展

统一模型可以在单一架构中处理多模态理解和生成，但通常它们在单次通过过程中运行，而不进行迭代细化。许多多模态任务，尤其是涉及复杂的空间组合、多个相互作用的对象或不断变化的指令的任务，需要分解指令、验证中间结果并进行迭代修正。虽然测试时扩展（TTS）已经证明，为迭代推理分配额外的推理计算可以显著提高语言模型的性能，但将这一范式扩展到统一的多模态模型仍然是一个开放的挑战。我们引入了UniT，这是一种多模态链式思维测试时扩展的框架，使单一统一模型能够在多轮次中进行推理、验证和细化。UniT 结合了代理数据合成、统一模型训练和灵活的测试时推理，以引发包括验证、子目标分解和内容记忆在内的认知行为。我们的主要发现是：(1) 统一模型在短推理轨迹上的训练在测试时能够泛化到更长的推理链；(2) 顺序链式思维推理比并行采样提供了一种更具扩展性和计算效率的TTS策略；(3) 在生成和编辑轨迹上的训练提高了分布外视觉推理的能力。这些结果确立了多模态测试时扩展作为一种有效范式，以推进统一模型中的生成和理解。

Summary / 总结

The motivation for this work is to improve the performance of unified multimodal models on complex tasks by enabling iterative reasoning. The method involves a framework called UniT, which allows a single unified model to reason, verify, and refine its outputs across multiple rounds. Key experimental findings include: unified models trained on short reasoning trajectories generalize well to longer inference chains; sequential chain-of-thought reasoning is more scalable and compute-efficient than parallel sampling; and training on generation and editing trajectories enhances out-of-distribution visual reasoning.

研究旨在通过在测试时启用迭代推理来提升统一多模态模型的性能。UniT 是一种多模态链式思维测试时扩展框架，允许单一统一模型在多轮次中进行推理、验证和修正。关键发现包括统一模型在更长推理链上的泛化能力、顺序链式思维推理的可扩展性和计算效率，以及通过生成和编辑轨迹训练来提升跨分布视觉推理能力。

AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

Authors: David Jiahao Fu, Lam Thanh Do, Jiayu Li, Kevin Chen-Chuan Chang

First: 2026-02-12T18:59:35+00:00 · Latest: 2026-02-12T18:59:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.

中文标题/摘要

标题：AttentionRetriever: 注意力层实际上是长文档检索器

检索增强生成（RAG）已被广泛采用，以帮助大型语言模型（LLMs）处理涉及长文档的任务。然而，现有的检索模型并未设计用于长文档检索，并且无法解决长文档检索中的几个关键挑战，包括上下文意识、因果依赖性和检索范围。在本文中，我们提出了AttentionRetriever，这是一种新颖的长文档检索模型，利用注意力机制和基于实体的检索来构建长文档的上下文感知嵌入，并确定检索范围。通过广泛的实验，我们发现AttentionRetriever在长文档检索数据集上的性能显著优于现有检索模型，同时保持与密集检索模型相当的效率。

Summary / 总结

The research motivation is to address the limitations of existing retrieval models in handling long documents, such as context-awareness, causal dependence, and retrieval scope. The main method involves using AttentionRetriever, which combines attention mechanisms and entity-based retrieval to create context-aware embeddings and determine the retrieval scope. Key experimental findings show that AttentionRetriever outperforms existing models on long document retrieval datasets with significant improvements while maintaining efficiency similar to dense retrieval models.

研究旨在通过解决上下文意识、因果依赖和检索范围等挑战，改进大型语言模型（LLMs）的长文档检索。AttentionRetriever 是一种新型模型，利用注意力机制和基于实体的检索来创建上下文感知嵌入并确定检索范围。实验结果表明，AttentionRetriever 在长文档数据集上的表现优于现有检索模型，同时保持与密集检索模型相当的效率。

Agentic Test-Time Scaling for WebAgents

Authors: Nicholas Lee, Lutfi Eren Erdogan, Chris Joseph John, Surya Krishnapillai, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

First: 2026-02-12T18:58:30+00:00 · Latest: 2026-02-12T18:58:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.

中文标题/摘要

标题：代理测试时缩放以适应网络代理

测试时缩放已成为提高神经网络模型性能和增强其可靠性的标准方法。然而，它在代理型、多步任务上的行为尚不完全清楚：小的每步误差会在长时间范围内累积；我们发现，均匀增加采样的简单策略会显示出递减的回报。在本文中，我们提出了CATTS，这是一种简单的技术，用于动态为多步代理分配计算资源。我们首先对网络代理的推理时缩放进行了实证研究。我们发现，均匀增加每步计算资源在长时间环境中很快达到饱和。然后我们研究了更强的聚合策略，包括基于LLM的仲裁者，它可以超越简单的投票，但可以推翻高度一致的决策。我们展示了代理自身投票分布得出的不确定性统计（熵和top-1/top-2差距）与下游成功相关，并提供了一种实用的动态计算分配信号。基于这些发现，我们引入了基于信心的测试时缩放(CATTS)，它仅在决策真正有争议时才使用投票得出的不确定性来分配计算资源。CATTS在WebArena-Lite和GoBrowse上的性能比React提高了最多9.1%，同时使用的令牌数量最多减少了2.3倍，从而提供了效率提升和可解释的决策规则。

Summary / 总结

This paper addresses the challenge of test-time scaling in agentic, multi-step tasks where small errors can accumulate. It introduces CATTS, a technique that dynamically allocates compute resources based on uncertainty in the agent's decisions. The study shows that uniformly increasing per-step compute quickly saturates in long-horizon environments, and proposes CATTS, which uses vote-derived uncertainty to allocate compute only when decisions are contentious, improving performance on WebArena-Lite and GoBrowse by up to 9.1% with fewer tokens compared to uniform scaling.

该研究针对多步任务中累积误差的问题，提出了一种基于决策不确定性的动态计算分配方法CATTS。研究发现，均匀增加每步计算很快会饱和，而基于不确定性的策略则表现出更好的性能。CATTS通过使用投票得出的不确定性，仅在决策真正有争议时分配计算，相较于React在WebArena-Lite和GoBrowse上性能提升高达9.1%，同时使用更少的计算资源。

On-Policy Context Distillation for Language Models

Authors: Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei

First: 2026-02-12T18:58:28+00:00 · Latest: 2026-02-12T18:58:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.

中文标题/摘要

标题：基于策略的上下文蒸馏语言模型

上下文蒸馏使语言模型能够将上下文中的知识内化到其参数中。在我们的工作中，我们提出了基于策略的上下文蒸馏（OPCD）框架，该框架通过训练学生模型在其自身生成的轨迹上，并通过最小化与上下文条件教师的逆Kullback-Leibler散度来实现策略蒸馏与上下文蒸馏的结合。我们展示了OPCD在两个重要应用中的有效性：经验知识蒸馏，其中模型从其历史解决方案痕迹中提取和提炼可转移的知识；系统提示蒸馏，其中模型内化编码在优化提示中的有益行为。在数学推理、基于文本的游戏以及特定领域任务中，OPCD在所有情况下都优于基线方法，同时在保持分布外能力方面表现更好。我们还展示了OPCD在跨规模蒸馏中的有效性，其中较小的学生模型可以从较大的教师中内化经验知识。

Summary / 总结

The research aims to improve language models by enabling them to internalize in-context knowledge. On-Policy Context Distillation (OPCD) is proposed, which trains a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. The method is effective in experiential knowledge distillation and system prompt distillation, showing higher task accuracy and better preservation of out-of-distribution capabilities across various tasks. Additionally, OPCD allows smaller models to learn from larger ones effectively.

研究旨在通过使语言模型能够内化上下文知识来提升其性能。提出了On-Policy Context Distillation (OPCD)框架，该框架通过最小化学生模型自动生成轨迹与上下文条件化教师之间的逆Kullback-Leibler散度来进行训练。研究显示，OPCD在经验知识内化和系统提示内化中表现优于基线方法，实现了更高的任务准确率并保留了分布外能力。此外，OPCD还允许较小模型从较大模型中有效学习，实现跨规模的模型内化。

MonarchRT: Efficient Attention for Real-Time Video Generation

Authors: Krish Agarwal, Zhuoming Chen, Cheng Luo, Yongqi Chen, Haizhong Zheng, Xun Huang, Atri Rudra, Beidi Chen

First: 2026-02-12T18:56:53+00:00 · Latest: 2026-02-12T18:56:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.

中文标题/摘要

标题：MonarchRT：实时视频生成的高效注意力机制

实时视频生成使用扩散变换器受到3D自我注意力的二次成本瓶颈限制，特别是在既少步又自回归的实时环境中，错误会随时间累积，每个去噪步骤必须携带更多的信息。在这种环境中，我们发现先前的稀疏注意力近似失效，尽管在双向、多步扩散中表现出色。具体来说，我们观察到视频注意力不是可靠的稀疏，而是由时空位置驱动的显著周期性结构与动态稀疏语义对应和密集混合相结合，超过了甚至先验top-k注意力的表示能力。基于这一洞察，我们提出了Monarch-RT，这是一种用于视频扩散模型的结构化注意力参数化，通过使用Monarch矩阵分解注意力。通过适当对齐的块结构和我们扩展的Tiled Monarch参数化，我们实现了高表达性同时保持计算效率。我们进一步通过微调克服了参数化的开销，使用了自定义的Triton内核。我们首先验证了Monarch-RT在现有仅针对双向模型设计的稀疏基线中的高有效性。我们还观察到，当应用于最先进的模型Self-Forcing时，Monarch-RT可以达到95%的注意力稀疏性，而不会损失质量，使Monarch-RT成为实时视频生成中高能力稀疏注意力参数化的开创性工作。我们的优化实现分别在Nvidia RTX 5090、H100和B200 GPU上优于FlashAttention-2、FlashAttention-3和FlashAttention-4内核，提供了1.4-11.8倍的内核加速。这使我们首次能够在单个RTX 5090上以16 FPS实现真正的实时视频生成。

Summary / 总结

The paper addresses the challenge of real-time video generation with Diffusion Transformers, which are hindered by the quadratic cost of 3D self-attention. It introduces Monarch-RT, a structured attention parameterization that uses Monarch matrices to factorize attention, achieving high expressivity while maintaining computational efficiency. Monarch-RT outperforms existing sparse baselines and enables true real-time video generation at 16 FPS on a single RTX 5090, with speedups up to 11.8X compared to FlashAttention kernels.

MonarchRT 通过提出一种新的注意力机制来解决实时视频生成中的计算瓶颈问题，该机制利用 Monarch 矩阵来分解注意力，同时保持高效性。实验结果表明，MonarchRT 显著优于现有的稀疏注意力基线，并且可以将 Self-Forcing 模型中的注意力稀疏度提高到 95% 以上，同时保持质量不变，从而实现真正的实时视频生成，帧率为 16 FPS，只需单个 RTX 5090 GPU。优化后的 MonarchRT 实现相比 FlashAttention 内核在各种 GPU 上提供了 1.4-11.8 倍的速度提升，使其成为实时视频生成中稀疏注意力参数化的一个开创性工作。

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

Authors: Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang

First: 2026-02-12T18:55:09+00:00 · Latest: 2026-02-12T18:55:09+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

中文标题/摘要

标题：CM2：使用检查表奖励的多轮多步代理工具使用强化学习

AI代理越来越多地用于通过推理多轮用户交互和调用外部工具来解决实际任务。然而，将强化学习应用于此类环境仍然具有挑战性：现实目标往往缺乏可验证的奖励，而是强调开放性行为；此外，多轮、多步代理工具使用中的RL仍然未被充分探索；而且构建和维护可执行的工具环境成本高昂，限制了规模和覆盖面。我们提出了CM2，这是一种RL框架，用检查表奖励取代了可验证的结果奖励。CM2将每轮预期行为细分为具有明确证据基础和结构化元数据的细粒度二元标准，将开放性判断转化为更稳定的分类决策。为了平衡稳定性和信息量，我们的方法采用了稀疏奖励分配但密集评估标准的策略。训练在可扩展的LLM模拟工具环境中进行，避免了为大型工具集进行大量工程设计。实验表明，CM2在监督微调的基础上持续改进。从8B基础模型开始，在8k例样本的RL数据集上进行训练，CM2在tau^-Bench上比SFT对齐版本高出8分，在BFCL-V4上高出10分，在ToolSandbox上高出12分。结果与甚至超过了同样规模的开源基线，包括评判模型。因此，CM2提供了一种无需依赖可验证奖励来优化多轮多步工具使用代理的可扩展方法。开源社区提供的代码：https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.

Summary / 总结

The paper introduces CM2, a reinforcement learning framework that uses checklist rewards to address the challenges of training AI agents for multi-turn and multi-step tasks involving external tools. By decomposing each turn's behavior into binary criteria with explicit evidence, CM2 transforms open-ended judgments into more stable classification decisions. Experiments show that CM2 outperforms supervised fine-tuning on multiple benchmarks, achieving improvements of 8 points on tau^-Bench, 10 points on BFCL-V4, and 12 points on ToolSandbox, demonstrating its effectiveness in optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards.

论文针对将强化学习应用于需要与用户多轮交互并使用外部工具的AI代理的挑战，提出了一种名为CM2的强化学习框架，使用检查表奖励来替代可验证的结果奖励。CM2将每轮的行为分解为具有明确证据的二元标准，使决策过程更加稳定。实验表明，CM2在多个基准测试中优于监督微调，分别在tau^-Bench上提高了8个点，在BFCL-V4上提高了10个点，在ToolSandbox上提高了12个点，证明了其在优化多轮、多步骤工具使用代理方面的有效性，无需依赖可验证的奖励。

T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

Authors: Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas

First: 2026-02-12T18:52:35+00:00 · Latest: 2026-02-12T18:52:35+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at https://github.com/Tyrion58/T3D.

中文标题/摘要

标题：T3D：通过轨迹自蒸馏与直接判别优化的少量步骤扩散语言模型

扩散大型语言模型（DLLMs）有可能通过并行解码多个标记来实现快速文本生成。然而，在实践中，它们的推理效率受到需要许多细化步骤的限制，而大幅减少步骤会导致生成质量显著下降。为了解决这个问题，我们提出了一种轨迹自蒸馏框架，通过蒸馏模型自身的生成轨迹来改进少量步骤解码。我们结合了直接判别优化（DDO），这是一种反向KL目标，促进模式寻求蒸馏，并鼓励学生集中于高概率教师模式。在各种基准测试中，我们的方法在紧缩步骤预算下始终优于强大的少量步骤基线和标准训练。尽管全步骤解码仍然更优，但我们显著缩小了差距，为实用的少量步骤DLLMs奠定了坚实的基础。源代码可在https://github.com/Tyrion58/T3D获取。

Think like a Scientist: Physics-guided LLM Agent for Equation Discovery

Authors: Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu

First: 2026-02-12T18:49:27+00:00 · Latest: 2026-02-12T18:49:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Explaining observed phenomena through symbolic, interpretable formulas is a fundamental goal of science. Recently, large language models (LLMs) have emerged as promising tools for symbolic equation discovery, owing to their broad domain knowledge and strong reasoning capabilities. However, most existing LLM-based systems try to guess equations directly from data, without modeling the multi-step reasoning process that scientists often follow: first inferring physical properties such as symmetries, then using these as priors to restrict the space of candidate equations. We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process. The agent coordinates physics-based tools to extract intermediate structure and uses these results to configure symbolic regression engines such as PySINDy and PySR, including their function libraries and structural constraints. Across a suite of physical equation benchmarks, KeplerAgent achieves substantially higher symbolic accuracy and greater robustness to noisy data than both LLM and traditional baselines.

中文标题/摘要

标题：像科学家一样思考：基于物理的LLM代理进行方程发现

通过符号、可解释的公式来解释观察到的现象是科学的基本目标。近年来，大型语言模型（LLMs）因其广泛的领域知识和强大的推理能力，已成为符号方程发现的有前途的工具。然而，现有的大多数基于LLM的系统试图直接从数据中猜测方程，而不建模科学家通常遵循的多步推理过程：首先推断诸如对称性等物理属性，然后使用这些属性作为先验来限制候选方程的空间。我们引入了KeplerAgent，这是一种明确遵循这一科学推理过程的代理框架。该代理协调基于物理的工具来提取中间结构，并使用这些结果来配置符号回归引擎，如PySINDy和PySR，包括它们的功能库和结构约束。在一系列物理方程基准测试中，KeplerAgent在符号准确性方面显著高于LLM和传统基线，并且对噪声数据具有更强的鲁棒性。

Summary / 总结

The research aims to enhance symbolic equation discovery by mimicking the scientific reasoning process. KeplerAgent, an agent-based framework, explicitly models the multi-step reasoning process scientists use, starting with inferring physical properties and then using these as priors to restrict the space of candidate equations. This approach, using physics-based tools and symbolic regression engines, leads to higher symbolic accuracy and better robustness to noisy data compared to existing LLMs and traditional methods.

研究旨在通过模拟科学推理过程来提高符号方程发现的准确性。KeplerAgent是一个基于代理的框架，首先推断物理属性如对称性，并利用这些属性来限制候选方程的空间。这种方法在各种物理方程基准测试中比现有的LLM和传统方法具有更高的符号准确性和更好的鲁棒性。

Privacy Risks in Time Series Forecasting: User- and Record-Level Membership Inference

Authors: Nicolas Johansson, Tobias Olsson, Daniel Nilsson, Johan Östman, Fazeleh Hoseini

First: 2025-09-04T12:43:45+00:00 · Latest: 2026-02-12T18:46:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Membership inference attacks (MIAs) aim to determine whether specific data were used to train a model. While extensively studied on classification models, their impact on time series forecasting remains largely unexplored. We address this gap by introducing two new attacks: (i) an adaptation of multivariate LiRA, a state-of-the-art MIA originally developed for classification models, to the time-series forecasting setting, and (ii) a novel end-to-end learning approach called Deep Time Series (DTS) attack. We benchmark these methods against adapted versions of other leading attacks from the classification setting. We evaluate all attacks in realistic settings on the TUH-EEG and ELD datasets, targeting two strong forecasting architectures, LSTM and the state-of-the-art N-HiTS, under both record- and user-level threat models. Our results show that forecasting models are vulnerable, with user-level attacks often achieving perfect detection. The proposed methods achieve the strongest performance in several settings, establishing new baselines for privacy risk assessment in time series forecasting. Furthermore, vulnerability increases with longer prediction horizons and smaller training populations, echoing trends observed in large language models.

中文标题/摘要

标题：时间序列预测中的隐私风险：用户级和记录级成员推断

成员推断攻击（MIAs）旨在确定特定数据是否被用于训练模型。尽管在分类模型上进行了广泛研究，但它们对时间序列预测的影响仍很大程度上未被探索。我们通过引入两种新的攻击方法来填补这一空白：(i) 将当前最先进的MIA方法多变量LiRA的适应版本应用于时间序列预测设置，(ii) 一种新的端到端学习方法，称为Deep Time Series (DTS)攻击。我们使用来自分类设置的其他领先攻击的适应版本对这些方法进行了基准测试。我们在现实环境中对TUH-EEG和ELD数据集上的所有攻击进行了评估，针对两种强大的预测架构LSTM和最先进的N-HiTS，在记录级和用户级威胁模型下进行。我们的结果表明，预测模型存在漏洞，用户级攻击通常能够实现完美的检测。所提出的方法在多种情况下表现出最强的性能，为时间序列预测中的隐私风险评估建立了新的基准。此外，预测时间越长，训练样本越少，漏洞越严重，这与大型语言模型中观察到的趋势一致。

Summary / 总结

This study investigates privacy risks in time series forecasting models through membership inference attacks. Two new attacks are introduced: an adapted multivariate LiRA and a novel end-to-end learning approach called Deep Time Series (DTS) attack. Evaluations on the TUH-EEG and ELD datasets show that forecasting models are vulnerable, with user-level attacks achieving perfect detection. The study establishes new baselines for privacy risk assessment in time series forecasting, highlighting increased vulnerability with longer prediction horizons and smaller training populations.

该研究通过成员推断攻击研究时间序列预测模型中的隐私风险，引入了两种新攻击方法：多变量LiRA的适应版本和新型Deep Time Series (DTS)攻击。研究在TUH-EEG和ELD数据集上评估了这些方法，针对LSTM和N-HiTS架构，在记录级和用户级威胁模型下进行。结果表明，这些预测模型对这些攻击非常脆弱，用户级攻击在许多情况下实现了完美的检测。研究还指出，随着预测时间范围的延长和训练样本数量的减少，脆弱性会增加。

Do language models accommodate their users? A study of linguistic convergence

Authors: Terra Blevins, Susanne Schmalwieser, Benjamin Roth

First: 2025-08-05T09:55:40+00:00 · Latest: 2026-02-12T18:46:13+00:00

Comments: EACL 2026

Abs · PDF · Code1 · Code2

Abstract

While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of existing dialogues to original human responses across sixteen language models, three dialogue corpora, and various stylometric features. We find that models strongly converge to the conversation's style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained and smaller counterparts. Given the differences in human and model convergence patterns, we hypothesize that the underlying mechanisms driving these behaviors are very different.

中文标题/摘要

标题：语言模型是否适应用户？一种语言趋同研究

虽然大型语言模型（LLMs）通常被认为在生成语言方面表现出色，但它们的语言使用与人类的相似程度仍然研究不足。在本文中，我们测试模型是否表现出语言趋同，这是人类语言交流的核心语用要素：模型是否会适应或趋同于用户的语言模式？为了回答这个问题，我们系统地比较了十六种语言模型、三种对话语料库和各种文体特征下模型对现有对话的完成与原始人类回应。我们发现，模型强烈趋同于对话的风格，通常相对于人类基准显著过拟合。虽然趋同模式往往是特征特定的，但我们观察到在建模设置中趋同模式的一致性变化，指令调整和较大的模型比预训练和较小的模型趋同程度更低。鉴于人类和模型趋同模式的差异，我们假设驱动这些行为的底层机制非常不同。

Summary / 总结

This study investigates whether large language models (LLMs) adapt their language usage to match that of their human users, a phenomenon known as linguistic convergence. By comparing model-generated responses to original human responses in various dialogues and using different stylometric features, the research finds that models strongly adapt to the conversation's style, often exceeding human responses in style matching. However, the degree of convergence varies across different model types, with instruction-tuned and larger models showing less convergence compared to their pretrained and smaller counterparts.

本研究探讨了大型语言模型（LLMs）是否能够调整其语言使用方式以匹配人类用户，这一现象被称为语言趋同。通过在多种对话中比较模型生成的回应与原始人类回应，并使用多种文体特征，研究发现模型显著地适应了对话的风格，通常比人类回应更过度拟合。研究还指出，指令调优和较大的模型比预训练和较小的模型显示出更少的趋同性。

EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph

Authors: Nan Jiang, Ziyi Wang, Yexiang Xue

Venue: ICLR 2026

First: 2025-11-08T04:39:11+00:00 · Latest: 2026-02-12T18:38:11+00:00

Comments: Camera-ready version accepted for ICLR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the search space and accelerating training lies in *symbolic equivalence*: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates symbolic equivalence into a class of modern symbolic regression methods, including Monte Carlo Tree Search (MCTS), Deep Reinforcement Learning (DRL), and Large Language Models (LLMs). EGG-SR compactly represents equivalent expressions through the proposed EGG module (via equality graphs), accelerating learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalent generated sequences in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Theoretically, we show the benefit of embedding EGG into learning: it tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances a class of symbolic regression models across several benchmarks, discovering more accurate expressions within the same time limit. Project page is at: https://nan-jiang-group.github.io/egg-sr.

Summary / 总结

EGG-SR is a framework that integrates symbolic equivalence into symbolic regression methods, such as Monte Carlo Tree Search, Deep Reinforcement Learning, and Large Language Models, to reduce redundant exploration and accelerate learning. It uses an EGG module to represent equivalent expressions through equality graphs, which helps in pruning redundant subtree exploration, aggregating rewards, and enriching feedback prompts. Experiments show that EGG-SR improves the performance of symbolic regression models across various benchmarks, leading to more accurate expressions within the same time limit.

EGG-SR 是一个框架，将符号等价性整合到符号回归方法中，如蒙特卡洛树搜索、深度强化学习和大型语言模型。通过使用 EGG 模块通过等式图表示等价表达式，EGG-SR 减少了冗余探索并加速了学习。实验表明，EGG-SR 在各种基准测试中提高了发现表达式的准确性，同时在相同的时间限制内。

"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

Authors: Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou

First: 2026-02-12T18:36:09+00:00 · Latest: 2026-02-12T18:36:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.

中文标题/摘要

标题："抱歉，我没有听清楚那句话"：为何语音模型错失关键信息

尽管语音识别系统在标准基准上的单词错误率较低，但在实际部署中，它们往往在短且高风险的口头表达上失败。在这里，我们研究了这种失败模式在一项高风险任务中的表现：美国参与者说出的美国街道名称的转录。我们评估了来自OpenAI、Deepgram、Google和Microsoft的15个模型在来自语言多样化的美国发言者录音上的表现，发现平均转录错误率为44%。我们通过地理区域量化了转录失败的下游影响，并表明错误转录系统性地影响了所有发言者，但非英语母语发言者的路由距离错误是英语母语发言者的两倍。为了减轻这种危害，我们引入了一种合成数据生成方法，使用开源文本转语音模型生成命名实体的多样化发音。使用不到1000个合成样本进行微调后，非英语母语发言者的街道名称转录准确性提高了近60%（相对于基线模型）。我们的结果突显了语音系统基准性能与实际可靠性之间的关键差距，并展示了减少高风险转录错误的一个简单可扩展的方法。

Summary / 总结

This study examines the failure of speech recognition systems on short, high-stakes utterances, focusing on the transcription of U.S. street names. Evaluating 15 models from various companies, the researchers found an average error rate of 44%. They also discovered that mis-transcriptions cause errors, with non-English primary speakers experiencing twice the routing distance errors compared to English primary speakers. To address this, they developed a synthetic data generation method that improved transcription accuracy by nearly 60% for non-English primary speakers when fine-tuned with less than 1,000 synthetic samples.

研究探讨了语音识别系统在短时高风险语音片段上的失败情况，特别是在美国街道名称的转录中。评估了来自不同公司的15个模型，发现平均错误率为44%。研究还发现，非英语母语说话者经历的路由距离错误是英语母语说话者的两倍。为了应对这一问题，研究引入了一种合成数据生成方法，在使用不到1000个合成样本进行微调后，可以将非英语母语说话者的街道名称转录准确性提高近60%。

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Authors: Nick Ferguson, Josh Pennington, Narek Beghian, Aravind Mohan, Douwe Kiela, Sheshansh Agrawal, Thien Hang Nguyen

First: 2026-02-12T18:31:37+00:00 · Latest: 2026-02-12T18:31:37+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract-bench.

中文标题/摘要

标题：ExtractBench：复杂结构化提取的基准和评估方法

非结构化文档如PDF包含有价值的结构化信息，但下游系统需要这些数据以可靠且标准化的形式存在。LLMs越来越多地被部署以自动化此提取过程，因此准确性和可靠性至关重要。然而，进展受到两个瓶颈的阻碍。首先，没有端到端的基准可以评估在企业规模范围内的PDF到JSON提取。其次，没有系统的方法来捕捉嵌套提取的语义，其中字段需要不同的正确性概念（标识符的精确匹配，数量的容忍度，名称的语义等价），数组需要对齐，遗漏必须与幻觉区分开。我们通过ExtractBench解决了这两个问题，这是一个开源的PDF到JSON结构化提取基准和评估框架。基准测试将35份PDF文档与经济价值领域的JSON模式和人工标注的黄金标准标签配对，生成了涵盖从几十到几百个字段的12,867个可评估字段。评估框架将模式视为可执行规范：每个字段声明其评分标准。基线评估表明，前沿模型（GPT-5/5.2，Gemini-3 Flash/Pro，Claude 4.5 Opus/Sonnet）在现实模式下仍然不可靠。随着模式范围的增加，性能急剧下降，所有测试模型在包含369个字段的财务报告模式下均无有效输出。我们已在https://github.com/ContextualAI/extract-bench/发布ExtractBench。

Summary / 总结

ExtractBench addresses the lack of an end-to-end benchmark for evaluating PDF-to-JSON extraction and introduces a principled methodology for nested extraction semantics. The benchmark includes 35 PDF documents with JSON schemas and human-annotated gold labels, covering 12,867 fields across various domains. Evaluations show that state-of-the-art models struggle with realistic schemas, with performance dropping sharply as schema complexity increases, resulting in no valid outputs for a 369-field schema.

ExtractBench 解决了缺乏端到端的 PDF 到 JSON 提取评估基准的问题，并引入了一种处理嵌套提取语义的规范方法。基准包括 35 份 PDF 文档、JSON 架构和人工标注的黄金标准，涵盖了 12,867 个字段，覆盖了多个领域。评估结果显示，最先进的模型在处理真实架构时表现不佳，随着架构复杂性的增加，性能急剧下降，最终在包含 369 个字段的财务报告架构上没有任何有效的输出。

Towards Autonomous Mathematics Research

Authors: Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong

First: 2026-02-10T18:50:15+00:00 · Latest: 2026-02-12T18:27:29+00:00

Comments: 35 pages. Accompanied blog post https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest quantifying standard levels of autonomy and novelty of AI-assisted results, as well as propose a novel concept of human-AI interaction cards for transparency. We conclude with reflections on human-AI collaboration in mathematics and share all prompts as well as model outputs at https://github.com/google-deepmind/superhuman/tree/main/aletheia.

中文标题/摘要

标题：迈向自主数学研究

近期基础模型的进展产生了能够在国际数学奥林匹克竞赛中获得金牌标准的推理系统。然而，从竞赛级别的问题解决过渡到专业研究，需要导航大量的文献并构建长期的证明。在本工作中，我们介绍了Aletheia，这是一种能够迭代生成、验证和修订自然语言解决方案的数学研究代理。具体而言，Aletheia 由一个增强版的Gemini Deep Think、一种扩展到奥林匹克级别以上问题的新型推理时扩展法则以及密集的工具使用来应对数学研究的复杂性提供动力。我们展示了Aletheia 从奥林匹克问题到博士级别的练习题的能力，并且通过AI辅助数学研究的几个重要里程碑进行了展示：(a) 一篇由AI生成的研究论文(Feng26)，在没有人类干预的情况下计算了算术几何中称为特征权的结构常数；(b) 一篇展示了人类与AI合作证明交互粒子系统独立集上界的研究论文(LeeSeo26)；(c) 对Bloom的Erdos猜想数据库中的700个开放问题进行了广泛的半自主评估，包括自主解决四个开放问题。为了帮助公众更好地理解AI和数学的发展，我们建议量化AI辅助结果的标准自主性和新颖性水平，并提出了一种新的透明度概念——人类与AI互动卡片。最后，我们反思了数学中的人机合作，并在https://github.com/google-deepmind/superhuman/tree/main/aletheia/ 公布了所有提示及模型输出。

Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Authors: Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko

First: 2026-02-12T18:15:08+00:00 · Latest: 2026-02-12T18:15:08+00:00

Comments: Accepted to EACL 2026 Student Research Workshop. 14 pages, 6 tables, 1 figure

Abs · PDF · Code1 · Code2

Abstract

Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.

中文标题/摘要

标题：压缩词表示中检测溢出以增强检索生成

在资源受限的环境中，高效处理长上下文仍然是当前大型语言模型（LLMs）面临的关键挑战。软压缩架构通过用一组学习到的压缩词替换长词序列，有望延长有效的上下文长度。然而，压缩的极限——以及何时压缩开始消除与任务相关的内容——仍然没有得到充分探索。在本文中，我们将压缩表示不再包含足够信息来回答给定查询的阶段定义为“词溢出”，并提出了一种方法来表征和检测它。在xRAG软压缩设置中，我们发现查询无关的饱和统计量可靠地将压缩表示与未压缩表示区分开来，提供了一种实用工具来识别压缩词，但显示出有限的溢出检测能力。轻量级的探针分类器在查询和上下文xRAG表示上检测溢出，在HotpotQA、SQuADv2和TriviaQA数据集上的平均AUC-ROC为0.72，表明结合查询信息可以提高检测性能。这些结果从查询无关的诊断发展到查询感知的检测器，使低成本的预LLM门控能够减轻压缩引起的错误。

Summary / 总结

This paper addresses the challenge of efficient long-context processing in large language models by defining token overflow as a regime where compressed representations lack sufficient information to answer queries. The authors propose a methodology to detect this phenomenon and find that lightweight probing classifiers, incorporating query information, can reliably detect overflow with an average AUC-ROC of 0.72 across various datasets, improving upon query-independent diagnostics.

本文通过定义token溢出为压缩表示不再包含足够信息来回答查询的阶段，来解决大型语言模型中高效处理长上下文的挑战。作者提出了一种使用饱和统计和轻量级探针分类器来检测这一问题的方法，这些方法在HotpotQA、SQuADv2和TriviaQA等数据集上的平均AUC-ROC为0.72。通过引入查询信息，检测性能得到了提高，从独立于查询的诊断转向了查询感知的检测器。

Evaluating LLM Reasoning Beyond Correctness and CoT

Authors: Soheil Abbasloo

First: 2025-10-20T22:08:59+00:00 · Latest: 2026-02-12T18:07:50+00:00

Abs · PDF · Code1 · Code2

Abstract

What does it truly mean for a language model to "reason"? Current evaluations reward models' correct standalone answers-but correctness alone reveals little about the process that produced them. We argue that reasoning should be understood not as a static chain of steps but as a dynamic trajectory in which ideas interact, clash, and evolve into integrated insights. Building on the philosophical tradition of dialectics, we introduce SIEV, a structured evaluation framework that assesses reasoning through explicit thesis-antithesis-synthesis interactions. SIEV produces interpretable trajectories that highlight key properties of reasoning-robustness to challenge, adaptability under conflict, and synthesis across competing viewpoints-dimensions that conventional correctness-based metrics cannot capture. Empirical results on GSM and MMLU demonstrate substantial gaps in the reasoning abilities of state-of-the-art models: for example, GPT-5-chat loses more than 40 points (out of 100) on GSM when evaluated through SIEV's process-oriented lens. By shifting focus from what answer a model gives to how it arrives there, SIEV enables a more transparent and principled distinction between structured reasoning and surface-level pattern generation offering a clearer foundation for assessing and understanding the reasoning capabilities of LLMs.

中文标题/摘要

标题：超越正确性和思维过程评估大语言模型的推理能力

语言模型究竟如何"推理"意味着什么？当前的评估奖励模型的正确独立答案，但仅凭正确性并不能揭示其背后的推理过程。我们认为，推理不应被视为静态的步骤链，而应被视为一种动态轨迹，在这种轨迹中，思想相互作用、碰撞并发展成为综合的见解。基于辩证法的哲学传统，我们引入了SIEV，这是一种结构化的评估框架，通过明确的论题-反题-合题互动来评估推理。SIEV生成可解释的轨迹，突出推理的关键属性，如面对挑战的稳健性、在冲突下的适应性以及在对立观点间的综合能力，这些维度是传统基于正确性的度量无法捕捉到的。在GSM和MMLU上的实证结果表明，最先进的模型在推理能力上存在显著差距：例如，GPT-5-chat在通过SIEV的过程导向视角评估时，得分下降了超过40分（满分100分）。通过将焦点从模型给出的答案转向其如何得出答案的过程，SIEV能够实现更透明和原则性的区分，即结构化推理与表面模式生成之间的区别，从而为评估和理解大语言模型的推理能力提供更清晰的基础。

Summary / 总结

The paper evaluates language models' reasoning beyond mere correctness by introducing SIEV, a structured evaluation framework based on dialectical philosophy. SIEV assesses reasoning through explicit thesis-antithesis-synthesis interactions, revealing key properties such as robustness to challenges and adaptability under conflict. Empirical results on GSM and MMLU show significant gaps in reasoning abilities of state-of-the-art models when evaluated with SIEV, highlighting the importance of process-oriented assessment over static correctness metrics.

论文提出了SIEV框架，超越正确性评估语言模型的推理能力，通过论题-反题-合题的互动来评估推理。关键发现表明，像GPT-5-chat这样的先进模型在通过SIEV评估时显示出推理能力的显著差距，强调了稳健性、适应性和综合等推理的关键维度，这些维度是传统评估方法无法捕捉到的。

Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

Authors: Zijing Ou, Jacob Si, Junyi Zhu, Ondrej Bohdal, Mete Ozay, Taha Ceritli, Yingzhen Li

First: 2026-02-12T18:06:03+00:00 · Latest: 2026-02-12T18:06:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.

中文标题/摘要

标题：超越KL的扩散对齐：方差最小化作为有效的策略优化器

扩散对齐将预训练的扩散模型适应为沿去噪轨迹从奖励倾斜分布中采样。这一过程自然地具有顺序蒙特卡洛（SMC）的解释，其中去噪模型作为提议者，奖励指导诱导重要性权重。受这一观点的启发，我们引入了方差最小化策略优化（VMPO），它将扩散对齐公式化为最小化对数重要性权重的方差，而不是直接优化基于KL的目标。我们证明方差目标在奖励倾斜目标分布下最小化，并且在基于策略采样的情况下，其梯度与标准基于KL对齐的梯度一致。这一视角为理解扩散对齐提供了一个共同的视角。在不同的潜在函数和方差最小化策略选择下，VMPO恢复了各种现有方法，同时也提出了超越KL的新设计方向。

Summary / 总结

The paper introduces Variance Minimisation Policy Optimisation (VMPO) to optimize diffusion models by minimizing the variance of log importance weights instead of directly using a KL-based objective. This method provides a new perspective on diffusion alignment and recovers existing methods while suggesting new design directions. VMPO aligns pretrained diffusion models to sample from reward-tilted distributions, offering a common framework for understanding this process.

论文旨在提高预训练扩散模型适应新奖励函数的效率。它引入了方差最小化策略优化（VMPO），将扩散对齐视为最小化对数重要权重的方差，而不是直接优化KL基目标。关键实验发现是，VMPO在不同的潜在函数和方差最小化策略下与现有方法一致，同时也提出了超越KL的新设计方向。

Bandit Learning in Matching Markets with Interviews

Authors: Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, Mohammad Hajiesmaili

First: 2026-02-12T18:03:37+00:00 · Latest: 2026-02-12T18:03:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Two-sided matching markets rely on preferences from both sides, yet it is often impractical to evaluate preferences. Participants, therefore, conduct a limited number of interviews, which provide early, noisy impressions and shape final decisions. We study bandit learning in matching markets with interviews, modeling interviews as \textit{low-cost hints} that reveal partial preference information to both sides. Our framework departs from existing work by allowing firm-side uncertainty: firms, like agents, may be unsure of their own preferences and can make early hiring mistakes by hiring less preferred agents. To handle this, we extend the firm's action space to allow \emph{strategic deferral} (choosing not to hire in a round), enabling recovery from suboptimal hires and supporting decentralized learning without coordination. We design novel algorithms for (i) a centralized setting with an omniscient interview allocator and (ii) decentralized settings with two types of firm-side feedback. Across all settings, our algorithms achieve time-independent regret, a substantial improvement over the $O(\log T)$ regret bounds known for learning stable matchings without interviews. Also, under mild structured markets, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.

Summary / 总结

The paper investigates bandit learning in matching markets where participants use interviews as low-cost hints to gain partial preference information. It extends the firm's action space to include strategic deferral, allowing for recovery from suboptimal hires. The study designs algorithms for both centralized and decentralized settings, achieving time-independent regret, a significant improvement over previous bounds. Under certain market structures, decentralized performance matches the centralized counterpart up to polynomial factors in the number of agents and firms.

论文研究了匹配市场中的多臂老虎机问题，参与者通过面试获得部分偏好信息。研究扩展了企业的行动空间，允许战略性推迟，以便从不理想的招聘中恢复。研究设计了适用于集中式和分散式设置的算法，实现了时间无关的遗憾，这是对之前已知的遗憾界限的重要改进。在某些市场结构下，分散式性能与集中式性能在多项式因子内相匹配。

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

Authors: Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, Weihao Jiang, Peng Hou, Anxiang Zeng, Xin Geng, Baining Guo

First: 2026-02-12T17:59:58+00:00 · Latest: 2026-02-12T17:59:58+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT

中文标题/摘要

标题：迈向基于策略的微调：分布判别理论及其在大模型训练中的应用

监督微调（SFT）计算效率高，但通常在泛化能力上不如强化学习（RL）。这一差距主要由RL使用基于策略数据引起。我们提出了一种框架来弥合这一差距，使其能够实现基于策略的微调。我们首先提出了\textbf{\textit{分布判别理论（DDT）}}，解释并量化了数据与模型诱导分布之间的对齐程度。利用DDT，我们引入了两种互补的技术：（i）\textbf{\textit{在分布内微调（IDFT）}}，一种在损失层面增强SFT泛化能力的方法；（ii）\textbf{\textit{提示解码}}，一种在数据层面的技术，可以重新对齐训练语料库以匹配模型的分布。大量实验表明，我们的框架在泛化性能上与著名的离线RL算法（包括DPO和SimPO）相当，同时保持了SFT管道的效率。因此，该提出的框架为在RL不可行的领域提供了一种实用的替代方案。我们在此开源代码：https://github.com/zhangmiaosen2000/Towards-On-Policy-SFT

Summary / 总结

The paper addresses the gap between supervised fine-tuning (SFT) and reinforcement learning (RL) by proposing a framework for on-policy SFT. It introduces Distribution Discriminant Theory (DDT) to quantify the alignment between data and model distribution, and develops two techniques: In-Distribution Finetuning (IDFT) and Hinted Decoding. Experiments show that this framework matches the generalization performance of offline RL algorithms like DPO and SimPO while maintaining the efficiency of SFT, making it a practical alternative for RL-infeasible domains.

论文通过提出面向策略的微调（On-Policy SFT）框架来弥合监督微调（SFT）和强化学习（RL）之间的差距。它引入了分布判别理论（DDT）来量化数据与模型诱导分布之间的对齐，并提出了两种技术：In-Distribution Finetuning（IDFT）和提示解码。实验表明，该框架在泛化性能上与DPO和SimPO等离线RL算法相当，同时保持了SFT的高效性，因此在RL不可行的领域提供了一个实用的替代方案。

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

Authors: Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou

First: 2026-02-12T17:59:08+00:00 · Latest: 2026-02-12T17:59:08+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

中文标题/摘要

标题：兼而有之：通过统一离散流匹配实现多模态推理与生成

我们提出了一种统一的离散流匹配框架UniDFlow，用于多模态理解、生成和编辑。该框架通过特定任务的低秩适配器解耦理解与生成，避免了目标干扰和表示纠缠，同时一种新颖的基于参考的多模态偏好对齐优化了在相同条件下的相对结果，提高了忠实度和可控性，而无需大规模重新训练。UniDFlpw在八个基准测试中实现了SOTA性能，并且在包括 inpainting、上下文图像生成、基于参考的编辑和组合生成等任务上表现出强大的零样本泛化能力，尽管没有进行明确的特定任务训练。

The Observer Effect in World Models: Invasive Adaptation Corrupts Latent Physics

Authors: Christian Internò, Jumpei Yamaguchi, Loren Amdahl-Culleton, Markus Olhofer, David Klindt, Barbara Hammer

First: 2026-02-12T17:56:07+00:00 · Latest: 2026-02-12T17:56:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Determining whether neural models internalize physical laws as world models, rather than exploiting statistical shortcuts, remains challenging, especially under out-of-distribution (OOD) shifts. Standard evaluations often test latent capability via downstream adaptation (e.g., fine-tuning or high-capacity probes), but such interventions can change the representations being measured and thus confound what was learned during self-supervised learning (SSL). We propose a non-invasive evaluation protocol, PhyIP. We test whether physical quantities are linearly decodable from frozen representations, motivated by the linear representation hypothesis. Across fluid dynamics and orbital mechanics, we find that when SSL achieves low error, latent structure becomes linearly accessible. PhyIP recovers internal energy and Newtonian inverse-square scaling on OOD tests (e.g., $ρ> 0.90$). In contrast, adaptation-based evaluations can collapse this structure ($ρ\approx 0.05$). These findings suggest that adaptation-based evaluation can obscure latent structures and that low-capacity probes offer a more accurate evaluation of physical world models.

中文标题/摘要

标题：世界模型中的观察者效应：侵入性适应破坏潜在物理定律

确定神经模型是否将物理定律作为世界模型内化，而不是利用统计捷径，尤其是在分布外（OOD）转移时，仍然具有挑战性。标准评估通常通过下游适应（例如微调或高容量探针）测试潜在能力，但这些干预措施会改变正在测量的表示，从而混淆自监督学习（SSL）期间学到的内容。我们提出了一种非侵入性评估协议，PhyIP。我们测试物理量是否可以从冻结的表示中线性可解码，受线性表示假设的启发。在流体动力学和轨道力学中，我们发现当SSL达到低误差时，潜在结构变得线性可访问。PhyIP在分布外测试中恢复了内部能量和牛顿平方反比缩放（例如，ρ>0.90）。相比之下，基于适应的评估可能会使这种结构崩溃（ρ≈0.05）。这些发现表明，基于适应的评估可能会掩盖潜在结构，而低容量探针提供了对物理世界模型更准确的评估。

Summary / 总结

The study aims to evaluate whether neural models internalize physical laws as world models or rely on statistical shortcuts, particularly under out-of-distribution shifts. It introduces a non-invasive evaluation protocol, PhyIP, which tests the linear decodability of physical quantities from frozen representations. The results show that when self-supervised learning achieves low error, physical structures become linearly accessible, recovering internal energy and Newtonian inverse-square scaling with high correlation ($ρ> 0.90$) on OOD tests. In contrast, adaptation-based evaluations obscure these structures, indicating that low-capacity probes provide a more accurate assessment of physical world models.

研究旨在评估神经模型是内化物理定律还是依赖统计捷径，特别是在分布外变化时。研究引入了一种非侵入性评估协议PhyIP，测试冻结表示中物理量的线性可解性。结果表明，当自我监督学习达到低误差时，物理结构变得线性可解，能够在分布外测试中恢复内部能量和牛顿反平方定律，相关系数高达0.90。相比之下，适应性评估方法会掩盖这些结构。这表明非侵入性方法如PhyIP比侵入性适应方法更能准确评估物理世界模型。

VIRENA: Virtual Arena for Research, Education, and Democratic Innovation

Authors: Emma Hoes, K. Jonathan Klueser, Fabrizio Gilardi

First: 2026-02-12T17:46:52+00:00 · Latest: 2026-02-12T17:46:52+00:00

Comments: VIRENA is under active development and currently in use at the University of Zurich, supported by the DIZH Innovation Program: 2nd Founder-Call. This preprint will be updated as new features are released. For the latest version and to inquire about demos or pilot collaborations, contact the authors

Abs · PDF · Code1 · Code2

Abstract

Digital platforms shape how people communicate, deliberate, and form opinions. Studying these dynamics has become increasingly difficult due to restricted data access, ethical constraints on real-world experiments, and limitations of existing research tools. VIRENA (Virtual Arena) is a platform that enables controlled experimentation in realistic social media environments. Multiple participants interact simultaneously in realistic replicas of feed-based platforms (Instagram, Facebook, Reddit) and messaging apps (WhatsApp, Messenger). Large language model-powered AI agents participate alongside humans with configurable personas and realistic behavior. Researchers can manipulate content moderation approaches, pre-schedule stimulus content, and run experiments across conditions through a visual interface requiring no programming skills. VIRENA makes possible research designs that were previously impractical: studying human--AI interaction in realistic social contexts, experimentally comparing moderation interventions, and observing group deliberation as it unfolds. Built on open-source technologies that ensure data remain under institutional control and comply with data protection requirements, VIRENA is currently in use at the University of Zurich and available for pilot collaborations. Designed for researchers, educators, and public organizations alike, VIRENA's no-code interface makes controlled social media simulation accessible across disciplines and sectors. This paper documents its design, architecture, and capabilities.

中文标题/摘要

标题：VIRENA：虚拟竞技场，用于研究、教育和民主创新

数字平台塑造了人们的交流、讨论和形成观点的方式。由于数据访问受限、现实世界实验的伦理限制以及现有研究工具的局限性，研究这些动态变得越来越困难。VIRENA（虚拟竞技场）是一个平台，它能够在现实社交媒体环境中进行受控实验。多个参与者可以同时在基于信息流的平台（Instagram、Facebook、Reddit）和即时通讯应用（WhatsApp、Messenger）的现实复制品中互动。由大型语言模型驱动的AI代理可以与人类一起参与，具有可配置的人设和现实行为。研究人员可以通过无需编程技能的可视化界面操控内容审核方法、预排定刺激内容，并在不同条件下运行实验。VIRENA 使得以前不切实际的研究设计成为可能：研究人类与AI的互动、实验性地比较干预措施的效果以及观察群体讨论的展开。VIRENA 建立在开源技术之上，确保数据保留在机构控制之下并符合数据保护要求，目前在苏黎世大学使用，并可供试点合作。VIRENA 的无代码界面使其跨学科和跨领域的受控社交媒体模拟变得可行。本文记录了其设计、架构和功能。

Summary / 总结

VIRENA is a platform designed to enable controlled experimentation in realistic social media environments, addressing the challenges of restricted data access and ethical constraints. Researchers can manipulate content moderation, pre-schedule stimulus content, and run experiments through a visual interface without programming skills. Key findings include the ability to study human-AI interaction, experimentally compare moderation interventions, and observe group deliberation in realistic social contexts, making VIRENA a valuable tool for researchers, educators, and public organizations.

VIRENA 是一个平台，旨在使研究人员能够在现实的社交媒体环境中进行受控实验，解决数据访问受限和伦理约束的问题。它允许研究人员通过无代码的可视化界面操纵内容审核、预排定刺激内容并跨条件运行实验。主要发现包括能够研究人-AI 交互、实验性比较审核干预措施以及观察群体讨论在现实社会环境中的演变，使以前难以实现的研究设计成为可能。

CONSENT: A Negotiation Framework for Leveraging User Flexibility in Vehicle-to-Building Charging under Uncertainty

Authors: Rishav Sen, Fangqi Liu, Jose Paolo Talusan, Ava Pettet, Yoshinori Suzue, Mark Bailey, Ayan Mukhopadhyay, Abhishek Dubey

First: 2026-01-04T15:59:52+00:00 · Latest: 2026-02-12T17:45:11+00:00

Comments: Submitted to AAMAS 2026. 38 pages, 13 figures, 14 tables

Abs · PDF · Code1 · Code2

Abstract

The growth of Electric Vehicles (EVs) creates a conflict in vehicle-to-building (V2B) settings between building operators, who face high energy costs from uncoordinated charging, and drivers, who prioritize convenience and a full charge. To resolve this, we propose a negotiation-based framework that, by design, guarantees voluntary participation, strategy-proofness, and budget feasibility. It transforms EV charging into a strategic resource by offering drivers a range of incentive-backed options for modest flexibility in their departure time or requested state of charge (SoC). Our framework is calibrated with user survey data and validated using real operational data from a commercial building and an EV manufacturer. Simulations show that our negotiation protocol creates a mutually beneficial outcome: lowering the building operator's costs by over 3.5\% compared to an optimized, non-negotiating smart charging policy, while simultaneously reducing user charging expenses by 22\% below the utility's retail energy rate. By aligning operator and EV user objectives, our framework provides a strategic bridge between energy and mobility systems, transforming EV charging from a source of operational friction into a platform for collaboration and shared savings.

中文标题/摘要

标题：CONSENT：一种在不确定性条件下利用用户灵活性的车辆到建筑充电谈判框架

电动汽车（EVs）的增长在车辆到建筑（V2B）设置中引发了建筑运营商和驾驶员之间的冲突，前者面临因充电不协调而导致的高昂能源成本，后者则更注重便利性和满电状态。为解决这一问题，我们提出了一种基于谈判的框架，该框架通过设计确保了自愿参与、策略证明性和预算可行性。该框架通过为驾驶员提供一系列激励支持的选择，将电动汽车充电转变为一种战略资源，这些选择包括适度的离站时间或所需荷电状态（SoC）的灵活性。我们使用用户调查数据对该框架进行了校准，并使用一家商业建筑和一家电动汽车制造商的实际运营数据进行了验证。模拟结果显示，我们的谈判协议创造了双赢的结果：与优化的非谈判智能充电政策相比，降低了建筑运营商超过3.5%的成本，同时将用户的充电费用降低了22%以上，低于公用事业的零售能源费率。通过使运营商和电动汽车用户的目标一致，我们的框架为能源和移动性系统之间提供了一座战略桥梁，将电动汽车充电从运营摩擦转变为合作和共享节省的平台。

Summary / 总结

The paper addresses the conflict between building operators and EV drivers in V2B settings by proposing a negotiation framework that ensures voluntary participation and strategy-proofness. It offers drivers flexible options for adjusting their charging times or requested SoC to reduce energy costs. Simulations demonstrate that this framework lowers building operators' costs by over 3.5% and reduces user charging expenses by 22% compared to non-negotiating smart charging policies.

论文提出了一种谈判框架，旨在解决V2B设置中建筑运营商与电动汽车驾驶员之间的冲突，该框架确保了自愿参与、策略稳健性和预算可行性。它为驾驶员提供了适度充电调整的灵活选项，这些选项受到激励。模拟结果显示，该框架将建筑运营商的成本降低了超过3.5%，同时将用户的充电费用降低了22%，相比非谈判的智能充电策略而言。

DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

Authors: Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tianhang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, Jiaqi Wang

First: 2026-02-12T17:44:24+00:00 · Latest: 2026-02-12T17:44:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., >10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.

中文标题/摘要

标题：DeepGen 1.0：一种轻量级统一多模态模型，用于推动图像生成和编辑

当前用于图像生成和编辑的统一多模态模型通常依赖于庞大的参数规模（例如，>10B），导致高昂的训练成本和部署足迹。在本工作中，我们提出了DeepGen 1.0，这是一种轻量级的5B统一模型，其综合能力与或超越了更大的同类模型。为克服紧凑模型在语义理解和精细控制方面的局限性，我们引入了堆叠通道桥接（SCB），这是一种深度对齐框架，从多个VLM层中提取层次特征，并与可学习的“思考令牌”融合，为生成骨干提供结构化、富含推理的指导。我们还设计了一种以数据为中心的训练策略，分为三个渐进阶段：（1）在大规模图像-文本对和编辑三元组上进行对齐预训练，以同步VLM和DiT表示；（2）在高质量的生成、编辑和推理任务混合集上进行联合监督微调，以培养全方位的能力；（3）使用MR-GRPO的强化学习，利用混合奖励函数和监督信号，从而在生成质量和与人类偏好的一致性方面取得显著提升，同时保持稳定的训练进展并避免视觉伪影。尽管仅在约5000万样本上进行训练，DeepGen 1.0在各种基准测试中均表现出领先性能，在WISE上超越了80B的HunyuanImage 28%，在UniREditBench上超越了27B的Qwen-Image-Edit 37%。通过开源我们的训练代码、权重和数据集，我们提供了一种高效、高性能的替代方案，以促进统一多模态研究的民主化。

Summary / 总结

DeepGen 1.0 is a lightweight 5B unified model for image generation and editing, addressing the high training and deployment costs of larger models. It introduces Stacked Channel Bridging (SCB) to enhance semantic understanding and fine-grained control. The model is trained in three stages: alignment pre-training, joint supervised fine-tuning, and reinforcement learning. DeepGen 1.0 outperforms much larger models on various benchmarks, achieving 28% better performance than the 80B HunyuanImage and 37% better than the 27B Qwen-Image-Edit.

DeepGen 1.0 是一个轻量级的 5B 模型，在图像生成和编辑方面表现出色，与更大规模的模型相当或超越。它引入了堆叠通道桥接（SCB）来增强语义理解和精细控制。该模型经过三个阶段的训练：对齐预训练、联合监督微调和强化学习。DeepGen 1.0 在各种基准测试中表现出色，分别在 WISE 和 UniREditBench 上比 80B HunyuanImage 高出 28%，比 27B Qwen-Image-Edit 高出 37%。

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Authors: Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah, Pranav Shetty, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso

First: 2026-02-12T17:38:57+00:00 · Latest: 2026-02-12T17:38:57+00:00

Comments: EACL 2026, main conference

Abs · PDF · Code1 · Code2

Abstract

Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.

中文标题/摘要

标题：ExStrucTiny：文档图像中结构化信息提取基准

企业文档，如表单和报告，嵌入了下游应用（如数据归档、自动化工作流和分析）所需的关键信息。尽管通用视觉语言模型（VLMs）在现有的文档理解基准测试中表现良好，但它们在跨多种文档类型和灵活模式进行整体、精细的结构化提取方面的能力尚未得到充分研究。现有的关键实体提取（KEE）、关系提取（RE）和视觉问答（VQA）数据集由于实体本体狭窄、查询简单或文档类型单一，往往忽略了适应性和结构化提取的需求。为解决这些差距，我们引入了ExStrucTiny，这是一个新的基准数据集，用于文档图像中的结构化信息提取（IE），将KEE、RE和VQA的各个方面统一起来。通过结合手动和合成的人工验证样本的新颖管道构建，ExStrucTiny涵盖了更多样化的文档类型和提取场景。我们分析了开放和封闭的VLMs在这一基准上的表现，突出了模式适应、查询不明确和答案定位等挑战。我们希望我们的工作能够为改进通用模型在文档中的结构化IE提供基础。

Summary / 总结

ExStrucTiny is a new benchmark for structured information extraction from document images, addressing the limitations of existing datasets by covering a wider range of document types and scenarios. It combines manual and synthetic human-validated samples to evaluate the ability of Vision Language Models (VLMs) to perform holistic, fine-grained extraction across diverse document schemas. Key findings include challenges such as schema adaptation and answer localization, which highlight the need for improved models in this area.

ExStrucTiny 是一个针对文档图像的结构化信息提取的新基准，通过涵盖多种文档类型和灵活的结构来解决现有数据集的局限性。它结合了手动和合成的人工验证样本，统一了关键实体提取、关系提取和视觉问答。该基准测试了开放和封闭的视觉语言模型，揭示了诸如结构适应和答案定位等挑战。这项工作旨在改进通用模型在文档中的结构化信息提取能力。

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Authors: Bowei He, Lihao Yin, Hui-Ling Zhen, Xiaokun Zhang, Mingxuan Yuan, Chen Ma

Venue: ICLR 2026

First: 2025-02-18T07:11:08+00:00 · Latest: 2026-02-12T17:38:20+00:00

Comments: Accepted by ICLR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data. We provide the code repository in \href{https://github.com/BokwaiHo/PASER}{Link}.

中文标题/摘要

标题：PASER：高效剪枝大型语言模型恢复的后训练数据选择

模型剪枝是压缩大型语言模型（LLMs）的有效方法。然而，这一过程往往会导致模型能力显著下降。虽然后训练技术如指令调优常被用来恢复模型性能，但现有方法往往忽视了模型能力的不均匀下降，并且计算成本高。此外，一些无关指令也可能对模型能力恢复产生负面影响。为了解决这些挑战，我们提出了后训练数据选择方法以实现高效剪枝大型语言模型恢复（PASER）。PASER旨在以一定的数据预算识别出能够恢复最受损模型能力的指令。我们的方法首先应用流形学习和谱聚类在语义空间中对恢复指令进行分组，揭示出能力特定的指令集。然后，根据相应模型能力下降的程度，适应性地分配数据预算。在每个集群中，我们优先选择导致模型性能下降最多的数据样本。为了减轻潜在的负面调优效果，我们还检测并过滤出冲突或无关的恢复数据。广泛实验表明，PASER显著优于传统基线，有效恢复了剪枝LLMs的一般能力，仅使用原始后训练数据的4%-20%。我们在GitHub上提供了代码库：https://github.com/BokwaiHo/PASER。

Summary / 总结

PASER is a post-training data selection method designed to efficiently recover the capabilities of pruned large language models. It uses manifold learning and spectral clustering to group instructions based on their semantic impact and adaptively allocates data budget to address the most degraded model capabilities. PASER also filters out conflicting or irrelevant data to prevent negative effects. Experiments show that PASER outperforms conventional methods by effectively recovering model capabilities using only 4%-20% of the original data.

PASER 是一种后训练数据选择方法，用于高效恢复剪枝的大语言模型性能。它使用流形学习和谱聚类来识别和优先处理在有限数据预算内恢复最受损模型能力的指令。PASER 还过滤掉冲突或无关的数据以减轻负面影响。实验表明，PASER 超过传统方法，仅使用原始后训练数据的 4%-20% 就能恢复剪枝 LLM 的一般能力。

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis

First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-12T17:32:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.

中文标题/摘要

标题：CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型

大型语言模型（LLMs）的后训练压缩通常依赖于低秩权重近似，即将权重矩阵的每一列表示在共享的低维子空间中。这种策略计算效率高，但其背后的约束对于异构投影权重来说可能过于僵硬，可能会导致不必要的准确度损失。我们提出了CoSpaDi（通过稀疏字典学习压缩），这是一种无需训练的框架，用结构化稀疏分解替代低秩分解，其中每个权重矩阵表示为一个稠密字典乘以列稀疏系数矩阵。这产生了一种子空间模型：权重矩阵的列表示为不同字典原子子集的线性组合，从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的：使用一个小的校准集，我们优化分解以最小化层输出的功能重建误差，而不是权重空间误差。激活衍生的Gram正交化将这种数据感知目标重新表述为转换后的权重上的标准字典学习问题，并支持层内压缩和组内相似投影跨层字典共享。在Llama和Qwen模型家族中，CoSpaDi 在20-40%的压缩比下，相对于基于SVD的先进基线和强大的结构剪枝基线，始终能提高准确度-压缩和困惑度-压缩的权衡。由此产生的结构稀疏性使稀疏-密集计算成为可能，并与稀疏系数的后训练量化集成。

Summary / 总结

CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to represent weight matrices, improving expressiveness and accuracy at lower parameter budgets compared to low-rank factorization methods. It optimizes the factorization using a calibration set to minimize functional reconstruction error, and supports per-layer compression and cross-layer dictionary sharing. Experiments on Llama and Qwen models show that CoSpaDi outperforms SVD-based and structured pruning baselines at 20-40% compression ratios in terms of accuracy and perplexity reduction.

CoSpaDi 是一种无需训练的大型语言模型压缩框架，通过结构化的稀疏分解来表示权重矩阵，提高表达能力同时保持准确性。它使用一个小的校准集来优化因子分解，以最小化层输出的功能重构误差，从而在20-40%的压缩比下比基于SVD的方法和结构化剪枝方法表现更好。结构化的稀疏性允许高效的稀疏-密集计算，并与后训练量化相结合。

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs

Authors: Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

Venue: Nat Mach Intell 8, 20-31 (2026)

First: 2024-10-18T05:21:05+00:00 · Latest: 2026-02-12T17:29:23+00:00

Comments: Published at Nature Machine Intelligence

Abs · PDF · Code1 · Code2

Abstract

Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.

中文标题/摘要

标题：实验室安全基准：评估大型语言模型在科学实验室安全问题上的表现

人工智能（AI）正在革新科学研究，但其在实验室环境中的日益集成带来了关键的安全挑战。大型语言模型（LLMs）和视觉语言模型（VLMs）现在协助实验设计和程序指导，但它们的“理解错觉”可能导致研究人员过度信任不安全的输出。我们展示了当前模型远未达到安全实验室操作所需的可靠性。我们引入了实验室安全基准（LabSafety Bench），这是一个全面的基准，评估模型在危害识别、风险评估和后果预测方面的表现，涵盖765个多项选择题和404个真实的实验室场景，共计3,128个开放式任务。对19个先进LLM和VLM的评估显示，在危害识别方面，没有模型的准确率超过70%。虽然专有模型在结构化评估中表现良好，但在开放式推理方面并没有明显优势。这些结果强调了在实际实验室环境中部署AI系统之前，迫切需要专门的安全评估框架。

Summary / 总结

The research addresses the safety challenges posed by the integration of AI in scientific labs, where current large language models and vision language models often fail to accurately identify hazards and assess risks. LabSafety Bench, a new benchmark, evaluates models on 765 multiple-choice questions and 404 realistic scenarios, revealing that no model achieves over 70% accuracy in hazard identification. The study highlights the need for specialized safety evaluation frameworks before deploying AI in real lab settings.

研究关注AI在科学实验室中的安全挑战，当前的大语言模型和视觉语言模型在识别风险和评估风险方面往往不够准确。LabSafety Bench 是一个新的基准，评估模型在765个选择题和404个真实场景上的表现，结果显示没有模型在识别风险上的准确率超过70%。研究强调在实际实验室环境中部署AI之前需要专门的安全评估框架。

Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Authors: Mohamed Huti, Alasdair Mackintosh, Amy Waldock, Dominic Andrews, Maxime Lelièvre, Moritz Boos, Tobias Murray, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod

First: 2026-02-12T17:29:03+00:00 · Latest: 2026-02-12T17:29:03+00:00

Abs · PDF · Code1 · Code2

Abstract

AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.

中文标题/摘要

标题：视觉推理基准：评估多模态大语言模型在小学课堂真实视觉问题上的推理能力

AI模型在文本推理方面已达到最先进的水平；然而，它们在处理空间和关系结构方面的推理能力仍然是一个关键瓶颈——特别是在依赖视觉的早期数学中。本文介绍了视觉推理基准（VRB），这是一个新型数据集，旨在评估多模态大语言模型（MLLMs）在解决课堂真实视觉问题方面的能力。该基准基于赞比亚和印度小学考试中的701个问题构建，涵盖了诸如类比推理、模式填充和空间匹配等一系列任务。我们概述了基准的方法和开发过程，故意使用未经编辑的、文字最少的图像来测试模型是否能满足小学教育的实际需求。我们的研究发现，模型在静态技能如计数和缩放方面表现出色，但在面对折叠、反射和旋转等动态操作时却达到了一个明显的“空间天花板”。这些弱点对课堂中的视觉推理问题使用构成了风险，可能导致错误评分、虚假支持和强化学生的错误观念。因此，像VRB这样的面向教育的基准对于确定用于课堂的多模态工具的功能边界至关重要。

Summary / 总结

The paper introduces the Visual Reasoning Benchmark (VRB) to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve classroom-Authentic visual problems from primary education. The benchmark consists of 701 questions from Zambia and India, covering tasks such as reasoning by analogy and spatial matching. The study finds that models perform well on static skills like counting and scaling but struggle with dynamic operations like folding and reflection, indicating a 'spatial ceiling' that limits their effectiveness in educational settings.

论文介绍了视觉推理基准（VRB），用于评估多模态大型语言模型（MLLMs）解决小学课堂实际视觉问题的能力。基准包括来自赞比亚和印度的701个问题，涵盖了类比推理和空间匹配等任务。研究发现，模型在计数和缩放等静态技能上表现良好，但在折叠和反射等动态操作上却遇到困难，表明存在一个‘空间天花板’，限制了它们在教育环境中的有效性。

History

20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553