arXiv 论文速递

Snapshot: 20260209_0330

EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference

Authors: Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Alan Yuille

Venue: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pages 649-659

First: 2025-02-07T07:07:04+00:00 · Latest: 2026-02-05T18:59:59+00:00

Abstract

The rapid growth of large models has raised concerns about their environmental impact and equity in accessibility due to significant computational costs. Low-Rank Adapters (LoRA) offer a lightweight solution for finetuning large models, resulting in an abundance of publicly available adapters tailored to diverse domains. We ask: Can these pretrained adapters be leveraged to further streamline adaptation to new tasks while addressing these challenges? We introduce EigenLoRAx, a parameter-efficient finetuning method that recycles existing adapters to create a principal subspace aligned with their shared domain knowledge which can be further augmented with orthogonal basis vectors in low-resource scenarios. This enables rapid adaptation to new tasks by learning only lightweight coefficients on the principal components of the subspace-eliminating the need to finetune entire adapters. EigenLoRAx requires significantly fewer parameters and memory, improving efficiency for both training and inference. Our method demonstrates strong performance across diverse domains and tasks, offering a scalable for edge-based applications, personalization, and equitable deployment of large models in resource-constrained environments.

中文标题/摘要

标题：EigenLoRAx：回收适配器以发现资源高效适应和推理的主要子空间

大型模型的快速增长引发了对其环境影响和由于显著计算成本而导致的访问不公的担忧。低秩适配器（LoRA）提供了一种轻量级的微调解决方案，使得针对不同领域的大量适配器得以公开。我们提出的问题是：这些预训练的适配器能否被利用来进一步简化对新任务的适应，同时解决这些挑战？我们介绍了EigenLoRAx，这是一种参数高效的微调方法，通过回收现有适配器来创建与它们共享领域知识对齐的主要子空间，并在低资源场景中进一步增加正交基向量。这使得通过仅学习子空间主要成分上的轻量级系数来快速适应新任务成为可能，从而消除了对整个适配器进行微调的需要。EigenLoRAx 需要的参数和内存显著减少，提高了训练和推理的效率。我们的方法在不同领域和任务中表现出强大的性能，为边缘应用、个性化和资源受限环境中大型模型的公平部署提供了可扩展的解决方案。

Summary / 总结

EigenLoRAx is a parameter-efficient method that recycles pretrained adapters to create a principal subspace aligned with shared domain knowledge, enabling rapid adaptation to new tasks with fewer parameters and memory. It augments this subspace with orthogonal basis vectors in low-resource scenarios, demonstrating strong performance across various domains and tasks, suitable for resource-constrained environments.

EigenLoRAx 是一种参数高效的方法，通过回收现有适配器来创建与共享领域知识对齐的主要子空间，从而实现快速的新任务适应，并且需要更少的参数和内存。它在各种领域和任务中表现出色，适用于资源受限环境和大型模型的公平部署。

Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Authors: Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji

First: 2026-02-05T18:59:55+00:00 · Latest: 2026-02-05T18:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

中文标题/摘要

标题：从透视描述预测相机姿态以进行空间推理

多图像空间推理仍然是当前多模态大型语言模型（MLLMs）面临的挑战。虽然单视角感知本质上是二维的，但多视角推理需要在不同视角之间构建连贯的场景理解。特别是，我们研究了视角转换，其中模型必须从多视角观察中构建连贯的三维理解，并用于从新的、语言指定的视角进行推理。我们引入了CAMCUE，这是一种姿态感知的多图像框架，使用相机姿态作为跨视图融合和新视图推理的显式几何锚点。CAMCUE 将每视角姿态注入视觉标记，将自然语言视角描述定位到目标相机姿态，并合成姿态条件下的想象目标视图以支持回答。为了支持这一设置，我们收集了CAMCUE-DATA，其中包括27,668个训练实例和508个测试实例，这些实例将多视角图像和姿态与多样化的目标视角描述和视角转换问题配对。我们还在测试分割中包括了人工标注的视角描述，以评估对人类语言的泛化能力。CAMCUE 的整体准确率提高了9.06%，并且从自然语言视角描述中预测目标姿态的旋转准确率超过90%（在20°以内），平移准确率在0.5误差阈值以内超过90%。这种直接定位避免了昂贵的测试时搜索和匹配，将每个示例的推理时间从256.6秒减少到1.45秒，从而在实际场景中实现快速、交互式使用。

Summary / 总结

The research aims to enhance multi-image spatial reasoning for multimodal large language models by addressing the challenge of perspective taking. CAMCUE, a pose-aware multi-image framework, uses camera pose as a geometric anchor for cross-view fusion and novel-view reasoning. It improves overall accuracy by 9.06% and predicts target poses with high rotation and translation accuracy. This direct grounding reduces inference time from 256.6s to 1.45s per example, enabling fast, interactive use in real-world scenarios.

研究旨在通过解决视角转换问题，提升多图像空间推理能力，特别是针对多模态大型语言模型。CAMCUE 是一种姿态感知框架，使用相机姿态作为几何锚点进行跨视图融合和新视图推理。该方法将整体准确率提高了 9.06%，并且在旋转和位移准确性方面表现优异。此外，这种方法将推理时间从每例 256.6 秒缩短至 1.45 秒，使其在现实世界场景中能够快速、交互式地使用。

DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching

Authors: Yuxing Lu, Yucheng Hu, Xukai Zhao, Jiuxin Cao

First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.

中文标题/摘要

标题：DyTopo：基于语义匹配的多智能体动态拓扑路由

由提示的大语言模型构建的多智能体系统可以提高多轮推理能力，但大多数现有管道依赖于固定且贯穿整个轨迹的通信模式，这些模式与迭代问题解决过程中阶段性的需求不匹配。我们引入了DyTopo，这是一种由管理者指导的多智能体框架，在每轮中重建一个稀疏的有向通信图。基于管理者的轮次目标，每个智能体输出轻量级的自然语言查询（需求）和关键（提供）描述；DyTopo嵌入这些描述并进行语义匹配，仅沿诱导的边路由私有消息。在代码生成和数学推理基准测试以及四个LLM基础模型中，DyTopo在最强基线之上始终表现出色（平均提高6.2%）。除了准确性之外，DyTopo还通过不断变化的图提供了可解释的协调轨迹，使人们能够定性地检查通信路径如何在轮次之间重新配置。

Summary / 总结

DyTopo is a manager-guided multi-agent framework that dynamically reconstructs a communication graph at each round based on the manager's goal. Agents output lightweight natural-language queries and offers, which are then matched semantically to route private messages. DyTopo outperforms the strongest baseline by an average of 6.2% across various benchmarks and LLM backbones, and provides interpretable coordination traces through evolving graphs.

DyTopo 是一个由管理者引导的多代理框架，每轮根据管理者的目标动态重构通信图。代理输出轻量级的自然语言查询和提供信息，然后通过语义匹配来路由私有消息。DyTopo 在各种基准测试和大语言模型后端上平均比最强基线高出 6.2%，并通过不断变化的图提供可解释的协调轨迹。

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Authors: Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou

First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00

Comments: Project Page: https://accio-lab.github.io/SwimBird

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.

中文标题/摘要

标题：SwimBird：在混合自回归MLLM中引发可切换的推理模式

多模态大型语言模型（MLLMs）通过连接视觉和语言，在多模态感知和推理方面取得了显著进展。然而，大多数现有的MLLMs主要通过文本的逐步推理（CoT）进行推理，这限制了它们在视觉密集型任务上的效果。最近的方法将固定数量的连续隐藏状态作为“视觉思考”注入推理过程，从而提高了视觉性能，但通常会牺牲基于文本的逻辑推理。我们认为核心限制在于一种僵化的、预先定义的推理模式，无法根据不同用户查询自适应地选择最合适的思考模态。我们引入了SwimBird，这是一种可切换的MLLM，根据输入动态切换三种推理模式：（1）仅文本推理，（2）仅视觉推理（连续隐藏状态作为视觉思考），（3）视觉-文本交替推理。为了实现这一能力，我们采用了一种混合自回归公式，将文本思考的下一个词预测与视觉思考的下一个嵌入预测统一起来，并设计了一种系统性的推理模式筛选策略，构建了SwimBird-SFT-92K，这是一个涵盖所有三种推理模式的多样监督微调数据集。通过实现灵活、查询自适应的模式选择，SwimBird在保持强大的文本逻辑的同时，显著提高了视觉密集任务的性能。跨多种涵盖文本推理和挑战性视觉理解的基准实验表明，SwimBird在先前固定模式多模态推理方法上取得了最先进的结果和稳健的提升。

Summary / 总结

SwimBird is designed to address the limitations of existing MLLMs by introducing a switchable reasoning mode that dynamically adapts to different user queries. It employs a hybrid autoregressive formulation and a reasoning-mode curation strategy to support three reasoning modes: text-only, vision-only, and interleaved vision-text. Experiments show that SwimBird maintains strong text-based logical reasoning while significantly improving performance on vision-intensive tasks, achieving state-of-the-art results across various benchmarks.

SwimBird旨在通过引入可切换的推理模式来解决现有MLLM的局限性，该模式能够根据不同用户查询动态适应。它采用混合自回归公式和推理模式编纂策略，支持三种推理模式：仅文本、仅视觉和视觉-文本交织。实验表明，SwimBird在保持强大文本逻辑推理的同时，显著提升了视觉密集任务的表现，实现了各种基准测试中的最先进结果。

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

Authors: Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li

Venue: ICRA 2026

First: 2026-02-05T18:59:45+00:00 · Latest: 2026-02-05T18:59:45+00:00

Comments: IEEE International Conference on Robotics and Automation (ICRA 2026); Project Website: https://comm-cp.github.io/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

中文标题/摘要

标题：CommCP：通过基于LLM的通信与符合性预测实现高效的多智能体协调

为了通过自然语言完成人类提供的任务，机器人必须解释命令、生成和回答相关问题以理解场景，并操作目标物体。实际部署中，通常需要不同操作能力的多个异构机器人协同处理不同的任务。除了需要专门的操作技能外，有效的信息收集对于完成这些任务至关重要。为了解决这一问题，我们将信息收集过程在完全协同的环境中形式化为一个未充分探索的多任务多智能体体感问答（MM-EQA）问题，这是体感问答（EQA）的经典扩展，其中有效的通信对于协调努力而无冗余至关重要。为了解决这一问题，我们提出了一种名为CommCP的新型基于LLM的去中心化通信框架，用于MM-EQA。我们的框架采用符合性预测来校准生成的消息，从而减少接收者的分心并提高通信可靠性。为了评估我们的框架，我们引入了一个包含多种多样的、逼真的家庭场景的MM-EQA基准，其中包含体感问题。实验结果表明，CommCP在任务成功率和探索效率方面显著优于基线。实验视频、代码和数据集可在我们的项目网站上获取：https://comm-cp.github.io/

Summary / 总结

The paper addresses the challenge of multiple robots working together to complete tasks given by humans. It formulates the problem as a multi-agent multi-task Embodied Question Answering (MM-EQA) problem, emphasizing the importance of effective communication. To tackle this, the authors propose CommCP, a communication framework that uses conformal prediction to improve message reliability. Experiments show that CommCP improves task success and exploration efficiency compared to baseline methods.

该论文旨在通过高效的通信提高多机器人协作完成人类任务的能力。它提出了CommCP，一种基于LLM的通信框架，使用校准预测来减少干扰并提高通信可靠性。实验结果显示，与基线方法相比，CommCP在包含多样化家庭场景的MM-EQA基准测试中显著提高了任务成功率和探索效率。

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Authors: Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

First: 2026-02-05T18:59:32+00:00 · Latest: 2026-02-05T18:59:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

中文标题/摘要

标题：空间几何思维：基于空间几何感知的主动几何整合

多模态大型语言模型（MLLMs）在空间推理方面的最新进展越来越多地利用3D编码器提供的几何先验。然而，现有的大多数整合策略仍然被动：几何信息作为全局流暴露，并以不分青红皂白的方式融合，这往往导致语义-几何错位和冗余信号。我们提出了GeoThinker框架，将范式从被动融合转变为主动感知。GeoThinker 不是通过特征混合，而是使模型能够根据其内部推理需求选择性地检索几何证据。GeoThinker 通过在精心选择的VLM层上应用空间语义融合来实现这一点，其中语义视觉先验通过帧严格的交叉注意力选择性地查询和整合与任务相关的几何结构，并通过重要性门控进一步校准，以偏向于与任务相关的结构的帧间注意力。全面的评估结果表明，GeoThinker 在空间智能方面达到了新的最佳状态，在VSI-Bench上达到峰值得分为72.6。此外，GeoThinker 在复杂下游场景中展示了稳健的泛化能力和显著改进的空间感知能力，包括体感指代和自动驾驶。我们的结果表明，主动整合空间结构的能力对于下一代空间智能至关重要。代码可以在 https://github.com/Li-Hao-yuan/GeoThinker 获取。

Summary / 总结

The research aims to enhance spatial reasoning in multimodal large language models (MLLMs) by integrating geometric priors more effectively. GeoThinker, a proposed framework, shifts from passive geometric fusion to active perception, allowing the model to selectively retrieve geometric evidence based on its reasoning needs. This is achieved through Spatial-Grounded Fusion at specific VLM layers, where semantic visual priors query and integrate relevant geometry via frame-strict cross-attention, further refined by Importance Gating. GeoThinker achieves a new state-of-the-art score of 72.6 on the VSI-Bench and shows robust generalization and improved spatial perception in complex scenarios like embodied referring and autonomous driving.

研究旨在通过解决多模态大型语言模型（MLLMs）中被动几何集成的局限性，提高空间推理能力。GeoThinker 是一种新框架，将范式从被动融合转变为积极感知，使模型能够根据其推理需求选择性地检索几何证据。这通过在特定 VLM 层面上的 Spatial-Grounded 融合实现，其中语义视觉先验通过帧严格交叉注意力查询并整合与任务相关的几何信息，进一步通过重要性门控进行校准。GeoThinker 在 VSI-Bench 上达到新的最佳得分为 72.6，并在复杂的下游任务如体感引用和自动驾驶中表现出色。

DFlash: Block Diffusion for Flash Speculative Decoding

Authors: Jian Chen, Yesheng Liang, Zhijian Liu

First: 2026-02-05T18:59:30+00:00 · Latest: 2026-02-05T18:59:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

中文标题/摘要

标题：DFlash：块扩散以实现闪存投机性解码

自回归大型语言模型（LLMs）表现出色，但需要固有的顺序解码，导致高推理延迟和较差的GPU利用率。投机性解码通过使用快速草稿模型来缓解这一瓶颈，其输出由目标LLM并行验证；然而，现有方法仍然依赖于自回归草稿，这仍然是顺序的，并限制了实际加速。扩散LLMs提供了一种有前景的替代方案，通过实现并行生成，但当前的扩散模型通常在性能上不如自回归模型。在本文中，我们介绍了DFlash，这是一种投机性解码框架，采用轻量级块扩散模型进行并行草稿生成。通过在单次前向传递中生成草稿标记，并将草稿模型基于目标模型提取的上下文特征进行条件化，DFlash能够实现高效且高质量的草稿生成，并具有更高的接受率。实验表明，DFlash在多种模型和任务上实现了超过6倍的无损加速，比最先进的投机性解码方法EAGLE-3提供了高达2.5倍的更高加速。

Summary / 总结

DFlash is a speculative decoding framework that uses a lightweight block diffusion model for parallel drafting, addressing the sequential nature of autoregressive models. It generates draft tokens in a single forward pass and conditions the draft model on context features from the target model, achieving over 6x lossless acceleration across various models and tasks, with up to 2.5x higher speedup compared to the state-of-the-art speculative decoding method EAGLE-3.

DFlash 是一种 speculative 解码框架，使用轻量级的块扩散模型进行并行草稿生成，解决了自回归模型在自回归大型语言模型（LLMs）中的局限性。它通过单次前向传递生成草稿令牌，并将草稿模型条件化为目标模型提取的上下文特征，实现了在各种模型和任务中超过 6 倍的无损加速，比最先进的 speculative 解码方法 EAGLE-3 提供高达 2.5 倍的加速。

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Authors: Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui

First: 2026-02-05T18:59:27+00:00 · Latest: 2026-02-05T18:59:27+00:00

Comments: Webpage: https://sirui-xu.github.io/InterPrior/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.

中文标题/摘要

标题：InterPrior：扩展基于物理的人机物交互生成控制

人类很少在整体身体层面上计划与物体的交互，而是通过高阶意图，如功能，来定义目标，而协调的平衡、接触和操作则可以从潜在的物理和运动先验中自然地涌现出来。扩展这些先验对于使类人机器人能够跨不同场景组合和泛化移动操作技能并保持物理上连贯的整体身体协调至关重要。为此，我们提出了InterPrior，这是一种可扩展的框架，通过大规模模仿预训练和后续的强化学习微调来学习统一的生成控制器。InterPrior首先将一个完整的参考模仿专家提炼为一个多功能、目标条件化的变分策略，该策略可以从多模态观察和高层意图中重建运动。虽然提炼出的策略可以重建训练行为，但由于大规模人机物交互的庞大配置空间，它无法可靠地泛化。为了解决这个问题，我们应用了物理扰动的数据增强，并通过强化学习微调来提高对未见过的目标和初始状态的技能。这些步骤共同将重建的潜在技能凝聚成一个有效的流形，产生一个泛化能力超出训练数据的运动先验，例如，它可以包含与未见过的物体的交互行为。我们进一步展示了其在用户交互控制中的有效性及其在实际机器人部署中的潜力。

Summary / 总结

InterPrior is a scalable framework that learns a unified generative controller through imitation pretraining and reinforcement learning. It first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. Data augmentation with physical perturbations and reinforcement learning fine-tuning improve the policy's generalization to unseen goals and initializations, enabling the framework to generalize beyond the training data and incorporate new behaviors. The method demonstrates effectiveness in user-interactive control and potential for real robot deployment.

InterPrior 是一个通过模仿预训练和强化学习学习统一生成控制器的可扩展框架，用于人类与物体的交互。它首先创建一个可以从观察和高层意图重建运动的多功能策略，然后使用数据增强和强化学习来提高其处理未见过的目标和初始状态的能力。这导致了一个可以处理新行为和未见过的物体的运动先验，使其适用于交互控制和真实机器人部署。

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Authors: Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka

First: 2026-02-05T18:59:21+00:00 · Latest: 2026-02-05T18:59:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

中文标题/摘要

标题：V-Retrver：基于证据的代理推理在通用多模态检索中的应用

多模态大型语言模型（MLLMs）最近被应用于通用多模态检索，其中推理链（CoT）推理改善了候选检索结果的重新排序。然而，现有方法仍然主要依赖语言驱动，依赖静态视觉编码，缺乏主动验证细粒度视觉证据的能力，这往往导致在视觉含糊情况下进行推测性推理。我们提出V-Retrver，一种基于证据的检索框架，将多模态检索重新定义为基于视觉检查的代理推理过程。V-Retrver使MLLM能够在推理过程中通过外部视觉工具选择性地获取视觉证据，执行一种多模态交替推理过程，交替进行假设生成和目标导向的视觉验证。为了训练这种证据收集检索代理，我们采用了一种基于课程的学习策略，结合监督推理激活、拒绝基础的细化以及与证据对齐的目标的强化学习。在多个多模态检索基准上的实验表明，检索准确性（平均提高23.0%）、感知驱动的推理可靠性和泛化能力均得到了一致的提升。

Summary / 总结

V-Retrver is an evidence-driven retrieval framework that reformulates multimodal retrieval as agentic reasoning grounded in visual inspection. It enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, alternating between hypothesis generation and targeted visual verification. Experiments show consistent improvements in retrieval accuracy, reasoning reliability, and generalization, with an average improvement of 23.0%.

V-Retrver 是一种基于视觉检验的证据驱动检索框架，将多模态检索重新表述为基于视觉检验的代理推理过程。该框架使 MLLM 在推理过程中能够通过外部视觉工具选择性地获取视觉证据，交替进行假设生成和目标导向的视觉验证。实验结果显示，该方法在检索准确性、推理可靠性和泛化能力方面均有所提升，平均提升幅度为 23.0%。

Can vision language models learn intuitive physics from interaction?

Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

First: 2026-02-05T18:59:20+00:00 · Latest: 2026-02-05T18:59:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

中文标题/摘要

标题：视觉语言模型能否通过交互学习直观的物理知识？

预训练的视觉语言模型对物理世界的直觉不够好。最近的研究表明，监督微调可以提高模型在简单物理任务上的表现。然而，微调后的模型似乎没有学会能够泛化的物理规则。基于认知科学的研究，我们假设模型需要与环境进行交互才能正确学习其物理动力学。我们使用强化学习训练通过与环境交互来学习的模型。虽然通过交互学习可以让模型提高其任务内的表现，但无法产生具有泛化物理直觉的模型。我们发现，即使任务共享视觉统计和物理原理，针对一个任务训练的模型也不可靠地泛化到相关任务，无论模型是通过交互还是其他方式训练。

Summary / 总结

The study investigates whether vision language models can develop intuitive physics understanding through interaction. Despite improvements in task performance with supervised fine-tuning, models still lack robust generalizable physical intuitions. Models trained via reinforcement learning from interaction show enhanced task performance but fail to generalize to related tasks, suggesting that interaction alone is insufficient for learning transferable physical knowledge.

研究探讨了通过互动是否能使视觉语言模型获得物理直觉。尽管监督微调可以提升模型的任务表现，但模型仍然缺乏可以泛化的物理直觉。通过强化学习从互动中学习的模型虽然在任务表现上有所提升，但在相关任务上的泛化能力仍然不足，表明互动本身不足以教会模型学习可迁移的物理知识。

PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling

Authors: Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui

First: 2026-02-05T18:59:01+00:00 · Latest: 2026-02-05T18:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.

中文标题/摘要

标题：PhysicsAgentABM：基于物理引导的生成性基于代理的建模

基于大型语言模型（LLM）的多代理系统能够实现富有表现力的代理推理，但难以扩展且不适用于时间步长对齐的状态转换模拟，而经典的基于代理的模型（ABMs）虽然具有可解释性，但在整合丰富的个体级信号和非平稳行为方面存在困难。我们提出了PhysicsAgentABM，将推理转移到行为一致的代理集群中：状态专门化的符号代理编码机制性转换先验，多模态神经转换模型捕捉时间动态和交互动态，不确定性意识的本体融合生成校准的集群级转换分布。个体代理随后在局部约束下随机实现转换，从而解耦群体推理与实体级变异性。我们还引入了基于跨上下文行为响应的LLM代理驱动聚类策略ANCHOR，以及一种新颖的对比损失，最多可减少6-8倍的LLM调用次数。在公共卫生、金融和社会科学领域的实验表明，与机制性、神经网络和LLM基线相比，PhysicsAgentABM在事件时间准确性和校准方面均表现出一致的改进。通过围绕不确定性意识的神经符号融合重构生成性ABM以实现群体级推理，PhysicsAgentABM确立了LLM支持的可扩展和校准模拟的新范式。

Summary / 总结

PhysicsAgentABM integrates physics-guided generative agent-based modeling to address the scalability and calibration issues of large language models (LLMs) and the interpretability and signal integration challenges of classical ABMs. It uses state-specialized symbolic agents to encode mechanistic transition priors, a multimodal neural model to capture temporal and interaction dynamics, and uncertainty-aware epistemic fusion to yield calibrated cluster-level transition distributions. The ANCHOR clustering strategy further reduces LLM calls by up to 8 times. Experiments across public health, finance, and social sciences demonstrate consistent improvements in event-time accuracy and calibration over various baselines.

PhysicsAgentABM 结合了基于物理的生成性基于代理的建模，以解决大型语言模型 (LLM) 的可扩展性和校准问题以及经典 ABM 的可解释性和信号整合问题。它使用状态专业化符号代理来编码机制性转换先验，多模态神经模型来捕捉时间和交互动力学，并通过知识融合来生成校准的集群级转换分布。基于 LLM 代理的聚类策略 ANCHOR 进一步减少了 LLM 调用。跨不同领域的实验显示，与现有模型相比，在事件时间准确性与校准方面均有所提升。

Curiosity is Knowledge: Self-Consistent Learning and No-Regret Optimization with Active Inference

Authors: Yingke Li, Anjali Parashar, Enlu Zhou, Chuchu Fan

First: 2026-02-05T18:58:32+00:00 · Latest: 2026-02-05T18:58:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Active inference (AIF) unifies exploration and exploitation by minimizing the Expected Free Energy (EFE), balancing epistemic value (information gain) and pragmatic value (task performance) through a curiosity coefficient. Yet it has been unclear when this balance yields both coherent learning and efficient decision-making: insufficient curiosity can drive myopic exploitation and prevent uncertainty resolution, while excessive curiosity can induce unnecessary exploration and regret. We establish the first theoretical guarantee for EFE-minimizing agents, showing that a single requirement--sufficient curiosity--simultaneously ensures self-consistent learning (Bayesian posterior consistency) and no-regret optimization (bounded cumulative regret). Our analysis characterizes how this mechanism depends on initial uncertainty, identifiability, and objective alignment, thereby connecting AIF to classical Bayesian experimental design and Bayesian optimization within one theoretical framework. We further translate these theories into practical design guidelines for tuning the epistemic-pragmatic trade-off in hybrid learning-optimization problems, validated through real-world experiments.

中文标题/摘要

标题：好奇心即知识：自洽学习与无遗憾优化中的主动推断

主动推断（AIF）通过最小化预期自由能（EFE），以好奇心系数平衡先验价值（信息获取）和实用价值（任务性能），统一了探索与利用。然而，这种平衡何时能同时实现连贯学习和高效决策尚不清楚：好奇心不足可能导致短视的利用并阻止不确定性解决，而好奇心过度则可能导致不必要的探索和遗憾。我们首次为EFE最小化智能体提供了理论保证，表明单一要求——足够的好奇心——同时确保了自洽学习（贝叶斯后验一致性）和无遗憾优化（有界累积遗憾）。我们的分析描述了这种机制如何依赖于初始不确定性、可识别性和目标对齐，从而将AIF与经典贝叶斯实验设计和贝叶斯优化统一在一个理论框架中。我们进一步将这些理论转化为实用的设计指南，用于调整混合学习-优化问题中的先验-实用权衡，并通过实际实验进行了验证。

Summary / 总结

The paper addresses the challenge of balancing exploration and exploitation in learning and optimization by minimizing Expected Free Energy (EFE) through a curiosity coefficient. It provides the first theoretical guarantee that sufficient curiosity ensures both self-consistent learning and no-regret optimization. The study characterizes the impact of initial uncertainty, identifiability, and objective alignment on this process, connecting AIF to Bayesian experimental design and optimization. Practical guidelines for tuning the epistemic-pragmatic trade-off are derived and validated through real-world experiments.

论文旨在通过最小化预期自由能（EFE）来解决主动推理（AIF）中的探索与利用之间的平衡问题。研究提供了理论保证，即足够的好奇心可以同时确保自我一致的学习和无遗憾的优化。关键发现表明，这种平衡取决于初始不确定性、可识别性和目标对齐，将AIF与贝叶斯实验设计和优化联系起来。还提供了实用的设计指南来调整认知-实践权衡，并通过实际实验进行了验证。

Language Models and Logic Programs for Trustworthy Tax Reasoning

Authors: William Jurayj, Nils Holzenberger, Benjamin Van Durme

Venue: AAAI 2026

First: 2025-08-28T17:55:07+00:00 · Latest: 2026-02-05T18:58:31+00:00

Comments: Accepted to AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

According to the United States Internal Revenue Service, ``the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the effectiveness of applying semantic parsing methods to statutory reasoning, and show promising economic feasibility of neuro-symbolic architectures for increasing access to reliable tax assistance.

中文标题/摘要

标题：语言模型与逻辑程序在可信税务推理中的应用

根据美国国税局的数据，“平均美国人填写税务申报表花费270美元和13小时”。即使在美国之外，税务申报也需要复杂的推理，结合应用重叠规则和数值计算。由于错误可能会导致高昂的罚款，任何自动化系统都必须提供高准确性和可审计性，使得现代大型语言模型（LLMs）不适合此任务。我们提出了一种将LLMs与符号求解器集成的方法，以计算税务义务。我们使用具有挑战性的StAtutory Reasoning Assessment (SARA)数据集评估了该系统的变体，并提出了一种基于税务错误实际罚款的新方法来估算部署此类系统的成本。我们还展示了如何通过将文本规则预先翻译成形式逻辑程序，结合智能检索的形式案例表示示例，可以显著提高此任务的性能，并将成本降低到远低于实际平均水平。我们的结果表明，应用语义解析方法进行法规推理的有效性，并展示了神经-符号架构在提高可靠税务援助可及性方面的有希望的经济可行性。

Summary / 总结

The research aims to address the complexity and high error rate in tax filing by proposing an approach that integrates large language models with a symbolic solver. The study evaluates this system on the SARA dataset and introduces a novel cost estimation method. Key findings show that combining plain-text rule translation into formal logic programs with retrieved exemplars improves performance and reduces costs significantly, demonstrating the economic feasibility of neuro-symbolic architectures for tax assistance.

该论文针对税务申报中的复杂性和错误，提出了一种将大型语言模型与符号求解器结合的方法。系统在SARA数据集上进行了评估，并引入了一种基于实际处罚的新型成本估算方法。研究显示，将文本规则翻译成形式逻辑程序并与检索到的案例示例结合使用，可以显著提高性能并降低成本至低于实际平均水平，证明了语义解析方法在法规推理中的有效性以及神经-符号架构在提高可靠税务援助方面的经济可行性。

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Authors: Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen

First: 2026-02-05T18:58:01+00:00 · Latest: 2026-02-05T18:58:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

中文标题/摘要

标题：上下文强制：使用长上下文的一致自回归视频生成

近期的实时长视频生成方法通常采用流式调优策略，试图通过短上下文（无记忆）教师训练一个长上下文学生。在这些框架中，学生进行长时间的展开，但仅能从短至5秒的窗口中获得监督。这种结构上的不匹配导致了一个关键的\textbf{学生-教师不匹配}：由于教师无法访问长期历史，它无法引导学生学习全局时间依赖性，从而限制了学生能够使用的上下文长度。为了解决这一问题，我们提出了一种名为\textbf{上下文强制}的新框架，通过长上下文教师训练长上下文学生。通过确保教师了解完整的生成历史，我们消除了监督不匹配，使模型能够稳健地训练并实现长期一致性。为了使这种计算在极端持续时间（例如2分钟）下可行，我们引入了一种上下文管理系统，将线性增长的上下文转换为\textbf{慢速-快速记忆}架构，显著减少了视觉冗余。大量实验结果表明，我们的方法能够实现超过20秒的有效上下文长度——比LongLive和Infinite-RoPE等最先进的方法长2到10倍。通过利用这种扩展的上下文，上下文强制能够保持在长时间内的一致性，超越各种长视频评估指标上的最先进的基线方法。

Summary / 总结

The paper addresses the issue of student-teacher mismatch in real-time long video generation by proposing Context Forcing, which trains a long-context student using a long-context teacher. This method ensures the teacher has access to the full generation history, eliminating the supervision mismatch. To handle the computational challenge, a Slow-Fast Memory architecture is introduced, reducing visual redundancy. The results show that Context Forcing enables context lengths exceeding 20 seconds, outperforming state-of-the-art methods like LongLive and Infinite-RoPE in maintaining long-term consistency.

论文通过提出Context Forcing框架解决了实时长视频生成中的学生-教师匹配问题，该框架使用长历史上下文的教师来训练长历史上下文的学生，确保教师能够访问完整的生成历史，消除监督不匹配。为了在长时长下保持计算可行性，作者引入了慢速-快速记忆架构，减少了视觉冗余。实验结果表明，Context Forcing能够在20秒以上的时间长度内保持更优的一致性，超越了如LongLive和Infinite-RoPE等最先进的方法。

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

First: 2026-02-05T18:57:09+00:00 · Latest: 2026-02-05T18:57:09+00:00

Comments: Code is available at https://github.com/ViktorAxelsen/BudgetMem

Abs · PDF · Code1 · Code2 · Code3

Abstract

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

中文标题/摘要

标题：学习查询感知预算层级路由以运行时代理内存

内存对于大型语言模型（LLM）代理在超出单一上下文窗口操作时变得越来越关键，但大多数现有系统依赖于离线、查询无关的内存构建，这可能效率低下并可能丢弃查询关键信息。尽管运行时内存利用是一个自然的替代方案，但先前的工作往往会产生大量开销，并且对性能成本权衡的控制有限。在本文中，我们提出了**BudgetMem**，这是一种运行时代理内存框架，用于明确、查询感知的性能成本控制。BudgetMem 将内存处理结构化为一组内存模块，每个模块提供三个预算层级（即**低**/**中**/**高**）。一个轻量级的路由器在模块之间执行预算层级路由，以平衡任务性能和内存构建成本，这通过强化学习训练的紧凑神经策略实现。使用BudgetMem作为统一的测试平台，我们研究了三种互补的预算层级实现策略：实现（方法复杂度）、推理（推理行为）和容量（模块模型大小）。在LoCoMo、LongMemEval和HotpotQA中，当优先考虑性能（即高预算设置）时，BudgetMem超越了强大的基线，并在更紧的预算下提供了更好的准确度成本前沿。此外，我们的分析将不同层级策略的优势和劣势分离开来，阐明了在不同预算条件下，每个轴在提供最有利权衡时的表现。

Summary / 总结

The research aims to address the inefficiency of query-agnostic memory construction in LLM agents by proposing BudgetMem, a runtime agent memory framework that allows explicit, query-aware control over performance and cost. BudgetMem uses a lightweight router to route memory processing across three budget tiers (Low, Mid, High) and employs a compact neural policy trained with reinforcement learning for budget-tier routing. The study evaluates three strategies for realizing budget tiers: implementation, reasoning, and capacity. BudgetMem outperforms strong baselines in high-budget settings and provides better accuracy-cost trade-offs under tighter budgets across various benchmarks.

BudgetMem 是一个允许查询感知的性能-成本控制的运行时代理内存框架，它将内存处理结构化为三个预算层级，并使用一个轻量级路由器来平衡任务性能和内存构建成本。通过在 LoCoMo、LongMemEval 和 HotpotQA 上的实验，BudgetMem 在高预算设置中优于强大的基线，并在更紧的预算下提供了更好的准确度-成本前沿。此外，分析还澄清了在不同预算条件下不同层级策略的优势和劣势。

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Authors: Miranda Muqing Miao, Young-Min Cho, Lyle Ungar

First: 2026-02-05T18:55:56+00:00 · Latest: 2026-02-05T18:55:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

中文标题/摘要

标题：CORAL（正确性优化残差激活透镜）：可移植且校准意识的推理时校正导向

大型语言模型（LLMs）在指令调优和偏好对齐后表现出持续的校准不足。修改后的训练目标可以改善校准，但重新训练成本高昂。推理时校正提供了一种轻量级的替代方案，但大多数现有方法优化的是正确性的代理指标而非正确性本身。我们引入了CORAL（正确性优化残差激活透镜），这是一种正则化推理时校正方法，通过权重衰减MLP探针捕捉模型内部激活中的分布式正确性信号。我们在三个7B参数模型上评估了CORAL，发现它在平均情况下将准确率提高了10%并降低了50%的预期校准误差（ECE）。我们还展示了这些增益在无需重新训练的情况下转移到四个保留基准测试的完整发布测试集（ARC-Challenge、HellaSwag、Math-MC、OpenBookQA）上，平均准确率提高了14%并降低了49%的ECE。我们的结果支持了这样一个假设：当单个神经元不足时，可以使用正则化探针从模型内部提取分布式信息。因此，CORAL提供了一种计算高效、可移植且校准意识的方法，以提高推理时的多项选择题问答性能。

Summary / 总结

The paper introduces CORAL, a regularized inference-time steering method that enhances the accuracy and calibration of large language models. By using weight-decay MLP probes to capture distributed correctness signals from model activations, CORAL improves accuracy by 10% and expected calibration error (ECE) by 50% on average across three 7B-parameter models. These improvements are transferable to four held-out benchmarks without retraining, with average accuracy and ECE improvements of 14% and 49%, respectively. This method offers a compute-efficient, transferable, and calibration-aware approach to improve multiple-choice question answering performance during inference.

论文提出了CORAL，一种正则化推理时校正方法，通过使用权重衰减MLP探针从模型激活中捕获分布式正确性信号来提升大语言模型的准确性和校准。在三个7B参数模型上，CORAL将准确率提高了10%，预期校准误差（ECE）降低了50%。这些改进无需重新训练即可转移到四个保留的基准测试集上，平均提高了14%的准确率和49%的ECE，展示了CORAL的可转移性和校准意识。

Diffusion Model's Generalization Can Be Characterized by Inductive Biases toward a Data-Dependent Ridge Manifold

Authors: Ye He, Yitong Qiu, Molei Tao

First: 2026-02-05T18:55:03+00:00 · Latest: 2026-02-05T18:55:03+00:00

Abs · PDF · Code1 · Code2

Abstract

When a diffusion model is not memorizing the training data set, how does it generalize exactly? A quantitative understanding of the distribution it generates would be beneficial to, for example, an assessment of the model's performance for downstream applications. We thus explicitly characterize what diffusion model generates, by proposing a log-density ridge manifold and quantifying how the generated data relate to this manifold as inference dynamics progresses. More precisely, inference undergoes a reach-align-slide process centered around the ridge manifold: trajectories first reach a neighborhood of the manifold, then align as being pushed toward or away from the manifold in normal directions, and finally slide along the manifold in tangent directions. Within the scope of this general behavior, different training errors will lead to different normal and tangent motions, which can be quantified, and these detailed motions characterize when inter-mode generations emerge. More detailed understanding of training dynamics will lead to more accurate quantification of the generation inductive bias, and an example of random feature model will be considered, for which we can explicitly illustrate how diffusion model's inductive biases originate as a composition of architectural bias and training accuracy, and how they evolve with the inference dynamics. Experiments on synthetic multimodal distributions and MNIST latent diffusion support the predicted directional effects, in both low- and high-dimensions.

中文标题/摘要

标题：扩散模型的泛化可以由数据依赖的岭流形上的归纳偏置来表征

当扩散模型不记忆训练数据集时，它如何进行泛化？对其生成分布的定量理解将有助于例如下游应用中模型性能的评估。因此，我们通过提出对数密度岭流形并量化生成数据与该流形的关系来明确表征扩散模型的生成内容。更具体地说，推理过程围绕岭流形进行拉近-对齐-滑动的过程：轨迹首先拉近流形的邻域，然后在法向方向被推离或推向流形，最后在切向方向沿着流形滑动。在这一一般行为的范围内，不同的训练误差会导致不同的法向和切向运动，这些运动可以被量化，并且这些详细的运动表征了跨模态生成何时出现。对训练动力学更详细的理解将导致对生成归纳偏置更准确的量化，我们将考虑一个随机特征模型的例子，其中可以明确展示扩散模型的归纳偏置如何源自架构偏置和训练准确性组成的组合，并且如何随着推理动力学的发展而演变。在合成多模态分布和MNIST潜在扩散上的实验支持了预测的方向效应，在低维和高维空间中均是如此。

Summary / 总结

The paper investigates how diffusion models generalize by proposing a log-density ridge manifold and analyzing the inference dynamics. It describes a reach-align-slide process where trajectories first approach the manifold, then align with or away from it, and finally slide along it. Different training errors result in varying normal and tangent motions, which can be quantified to understand inter-mode generation. Experiments confirm the directional effects on synthetic and MNIST data, supporting the proposed model's inductive biases.

论文通过提出对数密度岭流形并分析推理动力学，研究了扩散模型的泛化能力。描述了从接近流形、沿法向方向对齐再到沿流形切线方向滑动的reach-align-slide过程。不同的训练误差会导致不同的法向和切线运动，从而产生跨模态生成。通过合成多模态分布和MNIST数据的实验，研究支持了这些预测，并展示了低维和高维空间中的方向效应。

MambaVF: State Space Model for Efficient Video Fusion

Authors: Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler

First: 2026-02-05T18:53:47+00:00 · Latest: 2026-02-05T18:53:47+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io

中文标题/摘要

标题：MambaVF：基于状态空间模型的高效视频融合框架

视频融合是各种视频处理任务中的基本技术。然而，现有的视频融合方法严重依赖于光流估计和特征扭曲，导致了严重的计算开销和有限的可扩展性。本文提出了一种基于状态空间模型（SSM）的高效视频融合框架MambaVF，该框架在无需显式运动估计的情况下进行时间建模。首先，通过将视频融合重新表述为一个顺序状态更新过程，MambaVF以线性复杂度捕获了长程时间依赖性，同时显著减少了计算和内存成本。其次，MambaVF提出了一种轻量级的基于SSM的融合模块，该模块通过时空双向扫描机制替代了传统的流引导对齐。该模块使跨帧的信息聚合变得高效。在多个基准上的广泛实验表明，我们的MambaVF在多曝光、多焦点、红外可见和医学视频融合任务中达到了最先进的性能。我们强调MambaVF具有高效率，参数减少了高达92.25%，计算FLOPs减少了88.79%，并且比现有方法快2.1倍。项目页面：https://mambavf.github.io

Summary / 总结

MambaVF is an efficient video fusion framework that reformulates video fusion as a sequential state update process using state space models (SSMs), reducing computational overhead and memory costs. It introduces a lightweight SSM-based fusion module that replaces conventional flow-guided alignment, enabling efficient information aggregation across frames. Experimental results show that MambaVF outperforms existing methods in various fusion tasks and achieves up to 92.25% fewer parameters, 88.79% fewer FLOPs, and a 2.1x speedup.

MambaVF 是一种高效的视频融合框架，通过使用状态空间模型将视频融合重新表述为顺序状态更新过程，从而减少计算开销和内存使用，同时捕捉长程时间依赖性。该框架在多种视频融合任务中实现了最先进的性能，并将参数减少高达 92.25%，计算 FLOPs 减少高达 88.79%，速度提升 2.1 倍，优于现有方法。

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Authors: Panagiotis Kaliosis, Adithya V Ganesan, Oscar N. E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz

First: 2026-02-05T18:53:17+00:00 · Latest: 2026-02-05T18:53:17+00:00

Comments: 18 pages, 3 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.

中文标题/摘要

标题：大型语言模型在PTSD严重程度估计中的系统评估：背景知识和建模策略的作用

大型语言模型（LLMs）越来越多地以零样本方式用于评估心理健康状况，但我们对影响其准确性的因素知之甚少。本研究利用包含1,437名个体自然语言叙述和自我报告的PTSD严重程度评分的临床数据集，全面评估了11种最先进的LLM的性能。为了理解影响准确性的因素，我们系统地变化了（i）背景知识，如子量表定义、分布摘要和访谈问题，以及（ii）建模策略，包括零样本与少量样本、推理努力程度、模型大小、结构化子量表与直接标量预测、输出重新缩放和九种集成方法。我们的发现表明：（a）当LLMs获得详细的构念定义和叙述背景时，其准确性最高；（b）增加推理努力程度可以提高估计准确性；（c）开放权重模型（Llama, Deepseek）在超过700亿参数后性能趋于平稳，而封闭权重（o3-mini, gpt-5）模型随着新版本的推出而性能提升；（d）当监督模型与零样本LLM集成时，可以获得最佳性能。综上所述，结果表明选择背景知识和建模策略对于部署LLMs以准确评估心理健康状况至关重要。

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang

First: 2026-02-05T18:52:48+00:00 · Latest: 2026-02-05T18:52:48+00:00

Comments: Project Page: https://genarena.github.io/, Code: https://github.com/ruihanglix/genarena

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

中文标题/摘要

标题：GenArena：我们如何实现视觉生成任务的人类对齐评估？

视觉生成模型的快速发展已经超越了传统的评估方法，迫使其采用视觉语言模型作为替代的评判者。在本文中，我们系统地研究了当前广泛使用的绝对点对评分标准在各种视觉生成任务中的可靠性。我们的分析表明，这种范式由于随机不一致性和与人类感知的不良对齐而受到限制。为了解决这些限制，我们引入了GenArena，这是一种利用成对比较范式来确保稳定和人类对齐评估的统一评估框架。我们的实验揭示了一个变革性的发现，即仅采用这种成对协议即可使现成的开源模型超越顶级专有模型。值得注意的是，我们的方法将评估准确性提高了超过20%，并与权威的LMArena排行榜获得了0.86的斯皮尔曼相关性，远超点对方法的0.36相关性。基于GenArena，我们对多种视觉生成模型进行了基准测试，为视觉生成提供了一个严格且自动化的评估标准。

Summary / 总结

This study addresses the limitations of traditional pointwise scoring methods in evaluating visual generation models, which have become inadequate due to the rapid development of these models. The authors introduce GenArena, a pairwise comparison framework, to ensure more stable and human-aligned evaluations. Experiments show that using GenArena, open-source models can outperform proprietary models, with a 20% increase in evaluation accuracy and a Spearman correlation of 0.86 compared to the authoritative LMArena leaderboard, significantly higher than the 0.36 correlation of pointwise methods.

该论文针对视觉生成模型评价方法落后于模型发展速度的问题，提出了一种名为GenArena的统一评价框架，采用成对比较的方式确保评价的稳定性和与人类感知的一致性。实验结果显示，采用这种方法显著提高了评价准确性，开源模型超越了顶级专有模型，并且与权威的LMArena排行榜的Spearman相关性达到0.86，而传统的点对点方法仅为0.36。

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Authors: Xianyang Liu, Shangding Gu, Dawn Song

First: 2026-02-05T18:50:36+00:00 · Latest: 2026-02-05T18:50:36+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.

中文标题/摘要

标题：AgenticPay：多智能体LLM谈判系统用于买家卖家交易

基于大型语言模型（LLM）的代理越来越多地被期望自主谈判、协调和交易，但现有的基准测试缺乏评估语言中介的多智能体经济互动的规范性设置。我们引入了AgenticPay，这是一种多智能体买家卖家谈判基准和仿真框架，由自然语言驱动。AgenticPay 模拟了买家和卖家拥有私人约束和产品依赖价值的市场，并且必须通过多轮语言谈判达成协议，而不仅仅是通过数字竞价。该框架支持超过110项任务的多样化套件，从双边讨价还价到多对多市场，具有结构化的行动提取和可行性、效率和福利的度量标准。对最先进的专有和开源权重LLM的基准测试揭示了谈判性能的巨大差距，并突显了长期战略推理的挑战，确立了AgenticPay作为研究代理商业和语言驱动的市场互动的基础。代码和数据集可在以下链接获取：https://github.com/SafeRL-Lab/AgenticPay。

Summary / 总结

AgenticPay is a benchmark and simulation framework for evaluating multi-agent buyer-seller negotiations using natural language. It models markets with private constraints and product-dependent valuations, requiring agents to reach agreements through multi-round linguistic negotiation. Key findings show significant gaps in negotiation performance among state-of-the-art LLMs, particularly in long-horizon strategic reasoning, establishing AgenticPay as a valuable tool for studying agentic commerce and language-based market interaction.

AgenticPay 是一个用于评估多代理买家卖家谈判的基准和模拟框架，使用自然语言进行。它模拟了具有私人约束和产品依赖估值的市场，要求代理通过多轮语言谈判达成协议。关键发现表明，最先进的语言模型在谈判表现上存在显著差距，尤其是在长期战略推理方面，确立了AgenticPay作为研究代理商业和基于语言的市场交互的重要工具的地位。

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

Authors: Jie Deng, Kaichun Yao, Libo Zhang

First: 2026-02-05T18:45:53+00:00 · Latest: 2026-02-05T18:45:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.

中文标题/摘要

标题：VisRefiner：从视觉差异中学习以实现屏幕截图到代码生成

屏幕截图到代码生成旨在将用户界面屏幕截图转换为忠实再现目标布局和样式的可执行前端代码。现有的多模态大型语言模型直接从屏幕截图进行这种映射，但它们在生成代码时没有观察到视觉结果。相比之下，人类开发人员会迭代地渲染他们的实现，将其与设计进行比较，并学习视觉差异如何与代码更改相关联。受此过程的启发，我们提出了一种训练框架VisRefiner，使模型能够从渲染预测与参考设计之间的视觉差异中学习。我们构建了差异对齐的监督，将视觉差异与相应的代码编辑关联起来，使模型能够理解外观变化是如何由实现更改引起的。在此基础上，我们引入了一种强化学习阶段进行自我完善，其中模型通过观察渲染输出和目标设计之间的视觉差异，并相应地更新代码来改进其生成的代码。实验表明，VisRefiner 显著提高了单步生成质量和布局保真度，同时赋予模型强大的自我完善能力。这些结果表明，从视觉差异中学习对于推进屏幕截图到代码生成的有效性。

Summary / 总结

VisRefiner is a training framework that enables models to learn from visual differences between rendered predictions and reference designs for screenshot-to-code generation. It constructs difference-aligned supervision to associate visual discrepancies with corresponding code edits and introduces a reinforcement learning stage for self-refinement. Experiments show that VisRefiner improves single-step generation quality and layout fidelity and enhances the model's self-refinement ability.

研究旨在通过使模型能够学习渲染预测与参考设计之间的视觉差异来提升截图到代码的生成。VisRefiner 训练框架将视觉差异与代码编辑对齐，使模型能够理解视觉变化是如何由代码修改引起的。该框架还包括一个自改进阶段，模型通过观察渲染输出与目标设计之间的视觉差异来改进其生成的代码。实验表明，VisRefiner 显著提高了生成质量和布局准确性，并增强了模型的自改进能力。

Transmuting prompts into weights

Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo

First: 2025-10-09T18:40:39+00:00 · Latest: 2026-02-05T18:44:09+00:00

Abs · PDF · Code1 · Code2

Abstract

A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to token-dependent implicit weight updates (Dherin et. al, 2025), we derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector-and-matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

中文标题/摘要

标题：将提示转化为权重

越来越多的研究表明，可以通过直接修改大型语言模型的内部状态，在推理时有效控制其行为，这些方法可以通过向其激活值添加向量或更新其权重矩阵来实现。虽然这些技术非常强大，但它们通常受到经验启发式的指导，例如从对比提示的平均激活值中推导出引导向量。这项工作为这些干预措施提供了理论基础，解释了它们如何源自变压器架构的基本计算。基于最近发现的提示影响可以数学映射到与标记相关的隐式权重更新（Dherin等人，2025年），我们推导出一种原理性的方法，将这些信息凝练为与标记无关的思想向量和思想矩阵。这些构造为现有的向量和矩阵为基础的模型编辑技术提供了理论解释，并提供了一种直接且计算上可验证的方法，将文本输入转化为可重用的权重更新。

Summary / 总结

This research aims to provide a theoretical foundation for controlling the behavior of large language models at inference time by modifying their internal states. It builds on recent findings to derive a method for condensing prompt influence into token-independent thought vectors and matrices, offering a direct and computationally-grounded approach to transmuting textual input into reusable weight updates. Key findings include a principled method for condensing prompt information into weight updates, which explains and enhances existing model editing techniques.

该研究旨在为通过修改大型语言模型的内部状态在推理时控制其行为提供理论基础。它基于最近的发现，推导出将提示影响凝练为与标记无关的思想向量和矩阵的方法，提供了一种直接且计算上可行的方法，将文本输入转化为可重用的权重更新。关键发现包括这些构造的推导，它们解释了现有的模型编辑技术，并提供了一种控制模型行为的新方法。

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Authors: Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz

First: 2026-02-05T18:42:00+00:00 · Latest: 2026-02-05T18:42:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

中文标题/摘要

标题：钻石地图：通过随机流图高效实现奖励对齐

流和扩散模型生成高质量样本，但在训练后适应用户偏好或约束仍然成本高昂且脆弱，这一挑战通常被称为奖励对齐。我们认为，高效的奖励对齐应该是生成模型本身的特性，而不是事后考虑的问题，并重新设计了模型以提高适应性。我们提出了“钻石地图”，一种随机流图模型，能够在推理时高效且准确地对齐到任意奖励。钻石地图将许多模拟步骤合并为单步采样器，类似于流图，同时保留了实现最优奖励对齐所需的随机性。这种设计使得搜索、顺序蒙特卡洛和引导变得可扩展，因为它们能够高效且一致地估计价值函数。我们的实验表明，钻石地图可以通过从GLASS流中蒸馏学习，实现更强的奖励对齐性能，并且比现有方法更具可扩展性。我们的结果表明了一条实用的道路，即生成模型可以在推理时快速适应任意偏好和约束。

Summary / 总结

The research aims to address the challenge of reward alignment in generative models, where adapting models to user preferences post-training is costly and brittle. The authors propose Diamond Maps, a stochastic flow map model that enables efficient and accurate reward alignment at inference time. Diamond Maps combine the efficiency of flow maps with the necessary stochasticity for optimal reward alignment, making search and guidance scalable. Experiments demonstrate that Diamond Maps can be learned efficiently from GLASS Flows, achieve better reward alignment performance, and scale better than existing methods.

研究旨在解决生成模型中的奖励对齐问题，该问题在后训练阶段进行时成本高且脆弱。作者提出了一种称为Diamond Maps的随机流图模型，能够在推理时高效且准确地进行奖励对齐。Diamond Maps结合了流图的高效性和必要的随机性以实现最优的奖励对齐，从而使搜索和指导变得可扩展。实验表明，Diamond Maps可以从GLASS Flows高效学习，实现更好的奖励对齐，并且比现有方法更具可扩展性。

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Authors: Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

First: 2026-02-05T18:41:38+00:00 · Latest: 2026-02-05T18:41:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

中文标题/摘要

标题：DSB：动态滑动块调度算法用于扩散大语言模型

扩散大语言模型（dLLMs）已成为文本生成的一种有前途的替代方案，以其原生支持并行解码而著称。实际上，块推理对于避免全局双向解码中的顺序错位并提高输出质量至关重要。然而，广泛使用的固定预定义块（朴素）调度策略忽略了语义难度，使其在质量和效率方面都是次优策略：它可能会过早地对不确定的位置做出承诺，同时推迟接近块边界的简单位置。在本文中，我们分析了朴素块调度的局限性，并揭示了根据语义难度动态调整调度以实现可靠和高效推理的重要性。受此启发，我们提出了动态滑动块（DSB），这是一种无需训练的块调度方法，使用动态大小的滑动块来克服朴素块的僵化。为了进一步提高效率，我们引入了DSB缓存，这是一种针对DSB定制的无需训练的KV缓存机制。在多个模型和基准上的广泛实验表明，DSB与DSB缓存一起，能够一致地提高dLLMs的生成质量和推理效率。代码已发布在https://github.com/lizhuo-luo/DSB。

Summary / 总结

The paper addresses the limitations of fixed block scheduling in diffusion large language models (dLLMs), proposing Dynamic Sliding Block (DSB) as a method to dynamically adjust block sizes based on semantic difficulty. DSB, combined with DSB Cache, enhances both generation quality and inference efficiency. Experiments across various models and benchmarks show consistent improvements over the traditional fixed block scheduling approach.

本文针对固定块调度在扩散大型语言模型（dLLMs）中的局限性，提出了动态滑动块（DSB）方法，该方法基于语义难度动态调整块大小。DSB结合了专门为DSB设计的DSB缓存机制，提高了生成质量和推理效率。跨多种模型和基准的实验显示，DSB方法在各方面都优于传统方法。

Layer-wise LoRA fine-tuning: a similarity metric approach

Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

First: 2026-02-05T18:38:53+00:00 · Latest: 2026-02-05T18:38:53+00:00

Comments: Code is available at https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

中文标题/摘要

标题：逐层LoRA微调：一种相似度度量方法

在网页规模数据集上预训练大型语言模型（LLMs）已成为推动通用人工智能进步的基础。相比之下，通过微调来增强其在下游任务中的预测性能通常涉及调整其知识。参数高效微调技术，如低秩适应（LoRA），旨在通过冻结预训练模型并更新较少的参数来降低此过程的计算成本。与全微调相比，这些方法的可训练参数数量减少了超过99%，具体取决于配置。不幸的是，随着LLMs的规模继续扩大，这种减少可能变得不足。在本研究中，我们通过系统地选择仅微调少数几层来解决上述问题，使用LoRA或其变体。我们认为，并非所有层对模型适应的贡献都相等。利用这一点，我们通过测量它们对内部表示变化的贡献来识别最相关的层进行微调。我们的方法与现有的低秩适应技术是正交的，并且易于兼容。我们通过LoRA技术将可训练参数减少多达50%，同时在不同模型和任务上保持预测性能。具体而言，在仅编码器架构中，这种可训练参数的减少在GLUE基准测试中的预测性能下降可以忽略不计。在仅解码器架构中，我们实现了数学问题解决能力和编程任务上的小幅度下降或甚至改进。最后，这种方法也适用于多模态模型，在这些模型中，我们还观察到与在所有层使用LoRA模块进行微调相比具有竞争力的结果。代码可在：https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

Summary / 总结

This paper addresses the challenge of fine-tuning large language models (LLMs) by proposing a layer-wise LoRA fine-tuning approach. The method selects a few critical layers for fine-tuning based on their contribution to internal representation changes, reducing the number of trainable parameters by up to 50% while maintaining or improving predictive performance across different models and tasks. On encoder-only architectures, there is a negligible drop in performance on the GLUE benchmark, and on decoder-only architectures, there is a small drop or improvement in performance on mathematical problem-solving and coding tasks. The approach is compatible with existing low-rank adaptation techniques and is available in the provided code repository.

本文提出了一种分层LoRA微调方法，通过基于内部表示变化的贡献选择关键层进行微调，从而将可训练参数减少高达50%，同时在不同模型和任务上保持或提高预测性能。对于编码器架构，GLUE基准上的性能下降可以忽略不计；而对于解码器架构，数学问题解决和编程任务上的性能有所下降或提升。该方法与现有的低秩适应技术兼容，并在提供的代码库中可用。

SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model

Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Heather Yu

First: 2026-01-12T05:03:12+00:00 · Latest: 2026-02-05T18:37:54+00:00

Comments: 12 pages, 14 figures, accepted in WACVW 2026

Abs · PDF · Code1 · Code2

Abstract

Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.

中文标题/摘要

标题：SIRR-LMM：基于大型多模态模型的单张图像反射去除

玻璃表面会产生复杂的反射和透射光相互作用，使得单张图像反射去除（SIRR）具有挑战性。现有数据集在合成数据中缺乏物理真实感，或在实际捕获中规模不足。我们提出了一种合成数据集生成框架，通过在真实背景图像上路径追踪3D玻璃模型来创建具有多种玻璃属性、相机设置和后处理效果的物理准确的反射场景。为了利用大型多模态模型（LMM）的能力，我们将图像层合并为单一复合输入，进行联合描述，并使用针对特定任务的LoRA进行微调，而不是进行全面参数训练。这使我们的方法在反射去除和分离性能方面优于现有最先进的方法。

Summary / 总结

The research addresses the challenge of single-image reflection removal (SIRR) from glass surfaces, which is complicated by the interaction of reflected and transmitted light. To overcome limitations in existing datasets, the authors developed a synthetic dataset generation framework that uses path-tracing to create realistic reflection scenarios. They then fine-tuned a Large Multimodal Model (LMM) by concatenating image layers and applying task-specific LoRA, achieving better reflection removal and separation than current state-of-the-art methods.

研究旨在解决来自玻璃表面的单图像反射去除（SIRR）问题，由于反射和透射光的相互作用而复杂化。为克服现有数据集的限制，作者创建了一个新的合成数据集，使用路径追踪的3D玻璃模型在真实背景上模拟反射场景。然后，他们使用大型多模态模型（LMM）并以复合输入形式处理，并使用任务特定的LoRA进行微调，从而在反射去除和分离方面取得了比当前方法更好的结果。

RISE-Video: Can Video Generators Decode Implicit World Rules?

Authors: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang

First: 2026-02-05T18:36:10+00:00 · Latest: 2026-02-05T18:36:10+00:00

Comments: 38 pages, 16 figures, 3 tables; Code: https://github.com/VisionXLab/RISE-Video; HuggingFace: https://huggingface.co/datasets/VisionXLab/RISE-Video

Abs · PDF · Code1 · Code2 · Code3

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

中文标题/摘要

标题：RISE-Video：视频生成器能否解码隐含的世界规则？

尽管生成式视频模型在视觉保真度方面取得了显著进展，但它们在内化和推理隐含世界规则方面的能力仍然是一个关键但尚未充分探索的领域。为弥合这一差距，我们提出了RISE-Video，这是一种开创性的基于推理的Text-Image-to-Video (TI2V) 合成基准，将评估重点从表面美学转移到深层次的认知推理。RISE-Video 包含467个精心的人工标注样本，涵盖八个严格的类别，为从常识和空间动态到专业主题领域的模型智能提供了一个结构化的测试平台。我们的框架引入了四个维度的评估协议，包括推理一致性、时间一致性、物理合理性以及视觉质量。为了进一步支持可扩展的评估，我们提出了一种基于大型多模态模型（LMMs）的自动化流程，以模拟人类评估。在11个最先进的TI2V模型上的广泛实验揭示了在隐含约束下模拟复杂场景的普遍缺陷，为未来世界模拟生成模型的发展提供了关键见解。

Summary / 总结

RISE-Video is a benchmark for evaluating generative video models' ability to understand and reason about implicit world rules. It introduces a multi-dimensional evaluation protocol with four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. Experiments on 11 state-of-the-art models reveal that these models struggle to simulate complex scenarios under implicit constraints, highlighting the need for improved cognitive reasoning capabilities in generative models.

RISE-Video 是一个针对文本-图像到视频合成的推理导向基准，评估模型在隐含世界规则上的推理能力而非仅仅视觉保真度。它包含467个人标注样本，覆盖八个类别，并引入了四个指标：推理对齐、时间一致性、物理合理性以及视觉质量。对11个最先进的模型的实验显示，这些模型在处理隐含约束下的复杂场景时存在缺陷，这表明需要改进生成模型的推理能力。

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

Authors: Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

First: 2025-10-29T02:21:10+00:00 · Latest: 2026-02-05T18:29:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior (e.g., premature convergence) and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 36,383 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels (and supporting future individual-level analyses). We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.

中文标题/摘要

标题：辩论：评估角色扮演大语言模型代理意见动态的大规模基准

准确地通过社会互动建模意见变化对于理解并缓解极化、错误信息和社会冲突至关重要。近期研究使用角色扮演大语言模型代理（RPLA）模拟意见动态，但多代理模拟往往表现出不自然的群体行为（例如，过早收敛），并且缺乏评估其与真实人类群体互动一致性的经验基准。我们引入了DEBATE，一个大规模基准，用于评估多代理RPLA模拟中意见动态的真实性。DEBATE 包含来自708个群体和107个主题的2,832名美国参与者的36,383条消息，包括公开消息和私人李克特量表信念，支持在语句和群体层面进行评估（并支持未来个体层面的分析）。我们使用七种大语言模型实例化“数字双胞胎”RPLA，并在两种设置下进行评估：下一条消息预测和完整对话展开，使用立场一致性和意见收敛度指标。在零样本设置中，RPLA群体相对于人类群体表现出强烈的意见收敛。通过监督微调（SFT）和直接偏好优化（DPO）进行后训练提高了立场一致性，并使群体层面的收敛更接近人类行为，尽管意见变化和信念更新仍存在差异。DEBATE 为模拟意见动态提供了严格的基准测试，并支持未来研究将多代理RPLA与现实人类互动对齐。

Summary / 总结

The research aims to evaluate the authenticity of opinion dynamics in role-playing LLM agents (RPLAs) through a large-scale benchmark called DEBATE. The method involves collecting 36,383 messages from 2,832 participants across 708 groups and 107 topics, using both public messages and private beliefs. Key findings show that RPLA groups exhibit strong opinion convergence compared to human groups, but discrepancies remain in opinion change and belief updating. Supervised fine-tuning and Direct Preference Optimization improve stance alignment and group-level convergence, though not fully matching human behavior.

研究旨在评估多代理角色扮演LLM代理模拟中的意见动态的真实性，这对于理解社会冲突至关重要。研究引入了DEBATE，一个包含36,383条消息的大型基准，来自2,832名参与者，涉及708个群体和107个主题。该基准使用立场对齐和意见收敛度量来评估RPLAs，在零样本设置下表现出强烈的意见收敛，但在经过监督微调和直接偏好优化后，群体层面的收敛性更接近人类行为，尽管意见变化和信念更新仍存在差异，这为未来研究指明了方向。

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Authors: Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao

Venue: ACL

First: 2026-02-05T18:25:24+00:00 · Latest: 2026-02-05T18:25:24+00:00

Comments: Submission to ACL ARR 2026 January

Abs · PDF · Code1 · Code2

Abstract

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

中文标题/摘要

标题：SAGE：评估和提升深度研究代理的检索能力

深度研究代理已发展成为处理复杂查询的强大系统。与此同时，基于LLM的检索器展示了在遵循指令或推理方面的能力。这引发了一个关键问题：基于LLM的检索器能否有效支持深度研究代理的工作流程？为了探讨这一问题，我们引入了SAGE，这是一个由1200个跨四个科学领域的问题组成的科学文献检索基准，包含20万篇论文的检索语料库。我们评估了六种深度研究代理，并发现所有系统在需要推理的检索任务中都表现不佳。以DR Tulu为骨干，我们进一步比较了BM25和基于LLM的检索器（即ReasonIR和gte-Qwen2-7B-instruct）作为替代搜索工具。令人惊讶的是，BM25在性能上显著优于基于LLM的检索器，大约高出30%，因为现有代理生成的是关键词导向的子查询。为了提高性能，我们提出了一种基于语料库的测试时缩放框架，利用LLM增强文档的元数据和关键词，使现成的检索器更容易进行检索。这分别在简短和开放式问题上提高了8%和2%。

Summary / 总结

The paper introduces SAGE, a benchmark for evaluating scientific literature retrieval with 1,200 queries across four domains and a 200,000-paper corpus. It evaluates six deep research agents and finds that they struggle with reasoning-intensive retrieval. Using DR Tulu, the study compares BM25 and LLM-based retrievers, showing that BM25 outperforms LLM-based retrievers by about 30%. To improve performance, the authors propose a corpus-level test-time scaling framework that uses LLMs to augment documents, resulting in 8% and 2% gains for short-form and open-ended questions, respectively.

论文介绍了SAGE基准，包含1,200个跨四个领域的问题和一个20万篇论文的语料库。研究发现，深度研究代理在推理密集型检索中表现不佳。使用DR Tulu，研究比较了BM25和基于LLM的检索器，BM25的表现比基于LLM的检索器高出约30%。为了提高性能，作者提出了一种基于语料库的测试时缩放框架，使用LLM增强文档的元数据和关键词，分别在短形式和开放式问题上实现了8%和2%的提升。

History

20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553