arXiv 论文速递

Snapshot: 20260208_0328

EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference

Authors: Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Alan Yuille

Venue: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pages 649-659

First: 2025-02-07T07:07:04+00:00 · Latest: 2026-02-05T18:59:59+00:00

Abstract

The rapid growth of large models has raised concerns about their environmental impact and equity in accessibility due to significant computational costs. Low-Rank Adapters (LoRA) offer a lightweight solution for finetuning large models, resulting in an abundance of publicly available adapters tailored to diverse domains. We ask: Can these pretrained adapters be leveraged to further streamline adaptation to new tasks while addressing these challenges? We introduce EigenLoRAx, a parameter-efficient finetuning method that recycles existing adapters to create a principal subspace aligned with their shared domain knowledge which can be further augmented with orthogonal basis vectors in low-resource scenarios. This enables rapid adaptation to new tasks by learning only lightweight coefficients on the principal components of the subspace-eliminating the need to finetune entire adapters. EigenLoRAx requires significantly fewer parameters and memory, improving efficiency for both training and inference. Our method demonstrates strong performance across diverse domains and tasks, offering a scalable for edge-based applications, personalization, and equitable deployment of large models in resource-constrained environments.

中文标题/摘要

标题：EigenLoRAx：回收适配器以发现资源高效适应和推理的主要子空间

大型模型的快速增长引发了对其环境影响和访问公平性的担忧，因为它们需要大量的计算资源。低秩适配器（LoRA）提供了一种轻量级的解决方案，用于微调大型模型，从而产生了大量针对不同领域量身定制的公开适配器。我们提出的问题是：这些预训练的适配器能否进一步简化对新任务的适应，同时解决这些挑战？我们介绍了EigenLoRAx，这是一种参数高效的微调方法，通过回收现有适配器来创建与它们共享领域知识对齐的主要子空间，并在低资源场景中进一步扩展为正交基向量。这使得通过仅学习子空间主成分上的轻量级系数来快速适应新任务成为可能，从而消除了对整个适配器进行微调的需要。EigenLoRAx 需要的参数和内存显著减少，提高了训练和推理的效率。我们的方法在各种领域和任务中表现出强大的性能，为边缘应用、个性化和资源受限环境中大型模型的公平部署提供了可扩展的解决方案。

Summary / 总结

EigenLoRAx is a parameter-efficient method that recycles pretrained adapters to create a principal subspace aligned with shared domain knowledge, enabling rapid adaptation to new tasks with fewer parameters and memory. It offers strong performance across various domains and tasks, suitable for resource-constrained environments and equitable deployment of large models.

EigenLoRAx 是一种参数高效的方法，通过回收现有适配器来创建与共享领域知识对齐的主要子空间，从而实现快速适应新任务，同时减少所需参数和内存。它在各种领域和任务中表现出色，适用于边缘设备上的应用和资源受限环境中的公平部署。

Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Authors: Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji

First: 2026-02-05T18:59:55+00:00 · Latest: 2026-02-05T18:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

中文标题/摘要

标题：从视角描述预测相机姿态以进行空间推理

多图像空间推理仍然是当前多模态大型语言模型（MLLMs）面临的挑战。虽然单视角感知本质上是二维的，但多视角推理需要在不同视角之间构建连贯的场景理解。特别是，我们研究了视角转换，其中模型必须从多视角观察中构建连贯的三维理解，并使用它从新的、语言指定的视角进行推理。我们引入了CAMCUE，这是一种姿态感知的多图像框架，使用相机姿态作为跨视图融合和新视图推理的显式几何锚点。CAMCUE 将每视角姿态注入视觉标记，将自然语言视角描述定位到目标相机姿态，并合成姿态条件下的想象目标视图以支持回答。为了支持这一设置，我们收集了CAMCUE-DATA，其中包括27,668个训练实例和508个测试实例，这些实例将多视角图像和姿态与多样化的目标视角描述和视角转换问题配对。我们还在测试分割中包括了人工标注的视角描述，以评估对人类语言的泛化能力。CAMCUE 的整体准确率提高了9.06%，并且能够从自然语言视角描述中预测目标姿态，旋转准确率超过90%（误差在20°以内），平移准确率在0.5误差阈值以内超过90%。这种直接定位避免了昂贵的测试时搜索和匹配，将每个示例的推理时间从256.6秒减少到1.45秒，从而在实际场景中实现快速、交互式使用。

Summary / 总结

The paper addresses the challenge of multi-image spatial reasoning for multimodal large language models by introducing CAMCUE, a pose-aware framework that uses camera pose as a geometric anchor for cross-view fusion and novel-view reasoning. The method injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view. Experiments on the CAMCUE-DATA dataset show a 9.06% improvement in overall accuracy and over 90% accuracy in predicting target poses from natural-language descriptions within specified error thresholds, significantly reducing inference time from 256.6s to 1.45s per example.

论文通过引入CAMCUE框架，解决多视角空间推理的挑战，该框架以相机姿态作为几何锚点进行跨视角融合和新视角推理。方法将每视角姿态注入视觉标记，并将自然语言视角描述定位到目标相机姿态，生成姿态条件下的想象目标视角。实验表明，CAMCUE在CAMCUE-DATA数据集上的整体准确率提高了9.06%，在指定误差阈值内从自然语言描述预测目标姿态的旋转准确率超过90%，并将推理时间从每例256.6秒显著减少到1.45秒，便于实时交互使用。

DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching

Authors: Yuxing Lu, Yucheng Hu, Xukai Zhao, Jiuxin Cao

First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.

中文标题/摘要

标题：DyTopo：基于语义匹配的多智能体动态拓扑路由

由提示的大语言模型构建的多智能体系统可以提高多轮推理能力，但大多数现有管道依赖于固定且贯穿整个轨迹的通信模式，这些模式与迭代问题解决过程中阶段特定的需求匹配不佳。我们引入了DyTopo，这是一种由管理者指导的多智能体框架，在每一轮中重构一个稀疏的有向通信图。基于管理者的轮次目标，每个智能体输出轻量级的自然语言查询（需求）和关键（提供）描述；DyTopo嵌入这些描述并进行语义匹配，仅沿诱导的边路由私有消息。在代码生成和数学推理基准测试以及四个LLM基础模型中，DyTopo在最强基线之上始终表现出色（平均提高6.2%）。除了准确性之外，DyTopo还通过不断变化的图提供了可解释的协调轨迹，使人们能够定性地检查通信路径如何在轮次之间重新配置。

Summary / 总结

DyTopo is a dynamic topology routing framework for multi-agent systems that improves multi-round reasoning by reconstructing a sparse directed communication graph at each round based on the manager's goal. Agents output lightweight natural-language queries and key descriptors, which are then used for semantic matching to route private messages. DyTopo outperforms the strongest baseline across various benchmarks and LLM backbones, with an average improvement of 6.2%. Additionally, DyTopo provides interpretable coordination traces through evolving graphs, allowing for qualitative inspection of communication pathway reconfigurations across rounds.

DyTopo 是一种动态拓扑路由框架，通过在每轮根据管理者的目标重建稀疏有向通信图来提升多轮推理。代理输出轻量级的自然语言查询和关键描述符，然后通过语义匹配来路由私有消息。DyTopo 在多种基准测试和 LLM 后端中均优于最强基线，平均改进幅度为 6.2%。此外，DyTopo 通过不断变化的图提供了可解释的协调轨迹，允许对通信路径配置进行定性检查。

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Authors: Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou

First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00

Comments: Project Page: https://accio-lab.github.io/SwimBird

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.

中文标题/摘要

标题：SwimBird：在混合自回归MLLM中引发可切换的推理模式

多模态大型语言模型（MLLMs）通过将视觉与语言结合，在多模态感知和推理方面取得了显著进展。然而，大多数现有的MLLMs主要通过文本CoT进行推理，这限制了它们在视觉密集型任务上的效果。最近的方法将固定数量的连续隐藏状态作为“视觉思考”注入推理过程，从而提高了视觉性能，但通常会牺牲基于文本的逻辑推理能力。我们认为核心限制在于一种僵化的、预先定义的推理模式，无法根据不同用户查询自适应地选择最合适的思考模态。我们引入了SwimBird，这是一种可切换的MLLM，根据输入动态切换三种推理模式：（1）仅文本推理，（2）仅视觉推理（连续隐藏状态作为视觉思考），（3）视觉-文本交错推理。为了实现这一能力，我们采用了一种混合自回归公式，将文本思考的下一个标记预测与视觉思考的下一个嵌入预测统一起来，并设计了一种系统性的推理模式策展策略，构建了SwimBird-SFT-92K，这是一个涵盖所有三种推理模式的多样化监督微调数据集。通过实现灵活、查询自适应的模式选择，SwimBird在保持强大文本逻辑的同时，显著提高了视觉密集任务的性能。跨多种涵盖文本推理和挑战性视觉理解的基准实验表明，SwimBird在先前固定模式多模态推理方法上取得了最先进的结果和稳健的提升。

Summary / 总结

Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language.

SwimBird 是一种动态切换三种推理模式的 MLLM，根据输入切换为文本-only、视觉-only 和视觉-文本交织模式。通过采用混合自回归建模和推理模式整理策略，SwimBird 在视觉密集任务上表现出色，同时保持了强大的文本逻辑推理能力。实验表明，SwimBird 在各种基准测试中优于之前的固定模式多模态推理方法。

CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction

Authors: Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li

Venue: ICRA 2026

First: 2026-02-05T18:59:45+00:00 · Latest: 2026-02-05T18:59:45+00:00

Comments: IEEE International Conference on Robotics and Automation (ICRA 2026); Project Website: https://comm-cp.github.io/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.

中文标题/摘要

标题：CommCP：通过基于LLM的通信与符合性预测实现高效的多智能体协调

为了通过自然语言完成人类提供的任务，机器人必须解释命令、生成和回答相关问题以理解场景，并操作目标物体。实际部署中，通常需要不同操作能力的多个异构机器人协同处理不同的任务。除了需要专门的操作技能外，有效的信息收集对于完成这些任务至关重要。为了解决这一问题，我们将信息收集过程在完全协同的环境中形式化为一个未充分探索的多任务多智能体体感问答（MM-EQA）问题，这是体感问答（EQA）的经典扩展，其中有效的沟通对于协调努力而无冗余至关重要。为了解决这个问题，我们提出了一种名为CommCP的新型基于LLM的去中心化通信框架，用于MM-EQA。我们的框架采用符合性预测来校准生成的消息，从而减少接收者的分心并提高通信可靠性。为了评估我们的框架，我们引入了一个包含多种多样的、逼真的家庭场景的MM-EQA基准，其中包含体感问题。实验结果表明，与基线相比，CommCP显著提高了任务成功率和探索效率。实验视频、代码和数据集可在我们的项目网站上获得：https://comm-cp.github.io/

Summary / 总结

CommCP is a novel LLM-based communication framework designed to enhance multi-agent coordination in Embodied Question Answering tasks, where robots must interpret commands and collaborate effectively. It uses conformal prediction to calibrate messages, reducing distractions and improving communication reliability. Experiments show that CommCP significantly improves task success rates and exploration efficiency compared to baseline methods.

研究旨在通过自然语言命令提高多机器人协作完成任务的能力。提出了一种新颖的基于LLM的通信框架CommCP，以增强信息收集并减少冗余。该框架使用校准预测来调整消息，提高通信可靠性。实验结果显示，CommCP在新的包含多样家庭场景的MM-EQA基准测试中，显著提高了任务成功率和探索效率，优于基线方法。

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Authors: Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

First: 2026-02-05T18:59:32+00:00 · Latest: 2026-02-05T18:59:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

中文标题/摘要

标题：几何思维：基于几何的主动集成为空间推理

多模态大型语言模型（MLLMs）在空间推理方面的最新进展越来越多地利用3D编码器提供的几何先验。然而，大多数现有的集成策略仍然被动：几何信息作为全局流暴露，并以不分青红皂白的方式融合，这往往导致语义-几何错位和冗余信号。我们提出了GeoThinker框架，从被动融合转向主动感知。GeoThinker 不是通过特征混合，而是使模型能够根据其内部推理需求选择性地检索几何证据。GeoThinker 通过在精心选择的VLM层上应用空间语义融合来实现这一点，其中语义视觉先验通过帧严格的交叉注意力选择性地查询和整合与任务相关的几何信息，并通过重要性门控进一步校准，以偏向于与任务相关的结构的帧间注意力。全面的评估结果表明，GeoThinker 在空间智能方面达到了新的最佳状态，在VSI-Bench上达到峰值分数72.6。此外，GeoThinker 在复杂下游场景中展示了强大的泛化能力和显著改进的空间感知能力，包括体感指代和自主驾驶。我们的结果表明，主动整合空间结构的能力对于下一代空间智能至关重要。代码可以在 https://github.com/Li-Hao-yuan/GeoThinker 获取。

Summary / 总结

The research aims to enhance spatial reasoning by actively integrating geometric information into multimodal large language models (MLLMs). GeoThinker, the proposed framework, shifts from passive fusion to active perception, allowing the model to selectively retrieve geometric evidence based on its reasoning needs. This is achieved through Spatial-Grounded Fusion at specific layers, where semantic visual priors query and integrate task-relevant geometry via frame-strict cross-attention, further refined by Importance Gating. GeoThinker achieves a new state-of-the-art score of 72.6 on the VSI-Bench and shows robust performance in complex downstream scenarios such as embodied referring and autonomous driving.

研究旨在通过解决多模态大型语言模型（MLLMs）中被动几何集成的局限性，提升空间推理能力。GeoThinker 是一种新框架，将范式从被动融合转向主动感知，使模型能够根据其推理需求选择性地检索几何证据。这通过在特定 VLM 层上的空间接地融合实现，使用帧严格的交叉注意力和重要性门控。结果显示，GeoThinker 在 VSI-Bench 上达到 72.6 的峰值得分，并在诸如体感引用和自动驾驶等复杂场景中表现出色。

DFlash: Block Diffusion for Flash Speculative Decoding

Authors: Jian Chen, Yesheng Liang, Zhijian Liu

First: 2026-02-05T18:59:30+00:00 · Latest: 2026-02-05T18:59:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

中文标题/摘要

标题：DFlash：块扩散用于闪存投机解码

自回归大型语言模型（LLMs）表现出色，但需要固有的顺序解码，导致高推理延迟和较差的GPU利用率。投机解码通过使用快速草稿模型并行验证目标LLM的输出来缓解这一瓶颈；然而，现有方法仍然依赖于自回归草稿，这仍然是顺序的，限制了实际加速。扩散LLMs提供了一种有前景的替代方案，通过并行生成来启用，但当前的扩散模型通常在性能上不如自回归模型。在本文中，我们介绍了DFlash，这是一种投机解码框架，采用轻量级块扩散模型进行并行草稿生成。通过在单次前向传递中生成草稿标记，并将草稿模型基于目标模型提取的上下文特征进行条件化，DFlash能够高效地生成高质量的输出和更高的接受率。实验表明，DFlash在各种模型和任务上实现了超过6倍的无损加速，比最先进的投机解码方法EAGLE-3的加速效果高出2.5倍。

Summary / 总结

DFlash is a speculative decoding framework that uses a lightweight block diffusion model for parallel drafting, addressing the sequential nature of autoregressive models. It generates draft tokens in a single forward pass and conditions the draft model on context features from the target model, achieving over 6x lossless acceleration across various models and tasks, with up to 2.5x higher speedup compared to the state-of-the-art speculative decoding method EAGLE-3.

DFlash 是一种 speculative 解码框架，使用轻量级的块扩散模型进行并行草稿生成，解决了自回归模型的顺序性问题。它在一个前向传递中生成草稿令牌，并将草稿模型条件化为目标模型提取的上下文特征，实现了在各种模型和任务上超过 6 倍的无损加速，比最先进的 speculative 解码方法 EAGLE-3 快 2.5 倍。

InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Authors: Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui

First: 2026-02-05T18:59:27+00:00 · Latest: 2026-02-05T18:59:27+00:00

Comments: Webpage: https://sirui-xu.github.io/InterPrior/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.

中文标题/摘要

标题：InterPrior：扩展基于物理的人机物交互生成控制

人类很少在整体身体层面上计划与物体的交互，而是通过高层次意图，如功能，来定义目标，而协调的平衡、接触和操作则可以从潜在的物理和运动先验中自然地涌现出来。扩展这些先验对于使类人机器人能够跨多种情境组合和泛化肢体操作技能，同时保持物理上连贯的整体身体协调至关重要。为此，我们提出了InterPrior，这是一种可扩展的框架，通过大规模模仿预训练和后续的强化学习微调来学习一个统一的生成控制器。InterPrior首先将一个全参考模仿专家提炼为一个多功能、目标条件化的变分策略，该策略可以从多模态观察和高层次意图中重建运动。虽然提炼出的策略可以重建训练行为，但由于大规模人机物交互的庞大配置空间，它无法可靠地泛化。为了解决这个问题，我们应用了物理扰动的数据增强，并通过强化学习微调来提高对未见过的目标和初始状态的技能。这些步骤共同将重建的潜在技能凝聚成一个有效的流形，产生一个泛化能力超出训练数据的运动先验，例如，它可以包含与未见过的物体的交互行为。我们进一步展示了其在用户交互控制中的有效性及其在实际机器人部署中的潜力。

Summary / 总结

The research aims to enable humanoids to generalize loco-manipulation skills across various contexts by scaling physical and motor priors. InterPrior is a scalable framework that combines large-scale imitation pretraining and reinforcement learning fine-tuning. It learns a versatile, goal-conditioned variational policy to reconstruct motion from multimodal observations and high-level intent, which is then augmented with physical perturbations and fine-tuned to improve generalization. Key findings include the ability to incorporate new behaviors like interactions with unseen objects and the framework's effectiveness for user-interactive control and real robot deployment.

InterPrior 是一个可扩展的框架，用于训练统一的生成控制器，使类人机器人能够执行多种移动操作技能。它通过大规模模仿预训练和强化学习提炼出一个多功能、基于目标的变分策略。通过物理扰动的数据增强和强化学习微调，该策略能够更好地泛化到未见过的目标和初始状态，从而使类人机器人能够融入新行为并与未见过的物体互动。

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Authors: Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka

First: 2026-02-05T18:59:21+00:00 · Latest: 2026-02-05T18:59:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

中文标题/摘要

标题：V-Retrver：基于证据的代理推理在通用多模态检索中的应用

多模态大型语言模型（MLLMs）最近被应用于通用多模态检索，其中推理链（CoT）推理改善了候选检索结果的重新排序。然而，现有方法仍然主要依赖语言驱动，依赖静态视觉编码，缺乏主动验证细粒度视觉证据的能力，这往往导致在视觉含糊情况下进行推测性推理。我们提出了一种基于证据的检索框架V-Retrver，将多模态检索重新定义为基于视觉检查的代理推理过程。V-Retrver使MLLM能够在推理过程中通过外部视觉工具选择性地获取视觉证据，执行交替进行假设生成和目标视觉验证的多模态交织推理过程。为了训练这种证据收集检索代理，我们采用了一种基于课程的学习策略，结合监督推理激活、拒绝基础细化和与证据对齐的目标强化学习。在多个多模态检索基准上的实验表明，检索准确性（平均提高23.0%）、感知驱动的推理可靠性和泛化能力均得到了一致的提升。

Summary / 总结

V-Retrver is an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process. It enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process. Experiments show consistent improvements in retrieval accuracy, reasoning reliability, and generalization, with an average improvement of 23.0%.

研究旨在通过引入视觉证据驱动的推理来提升多模态检索，解决现有语言驱动方法的局限性。V-Retrver 是一种证据驱动的检索框架，使 MLLM 能够在推理过程中主动收集和验证视觉证据，从而提高检索准确性和可靠性。实验结果显示，检索准确率平均提高了 23.0%，并且具有更好的泛化能力。

Can vision language models learn intuitive physics from interaction?

Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

First: 2026-02-05T18:59:20+00:00 · Latest: 2026-02-05T18:59:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

中文标题/摘要

标题：视觉语言模型能否通过交互学习直观的物理知识？

预训练的视觉语言模型对物理世界的直觉不够好。最近的研究表明，监督微调可以提高模型在简单物理任务上的表现。然而，微调后的模型似乎没有学会能够泛化的稳健物理规则。基于认知科学的研究，我们假设模型需要与环境进行交互才能正确学习其物理动力学。我们使用强化学习训练通过与环境交互来学习的模型。虽然通过交互学习可以让模型在任务内的表现得到提升，但无法产生具有泛化物理直觉的模型。我们发现，即使任务共享视觉统计和物理原理，针对一个任务训练的模型也不可靠地泛化到相关任务，无论模型是通过交互还是其他方式训练。

Summary / 总结

The study investigates whether vision language models can learn intuitive physics from interaction. Despite improvements in performance through supervised fine-tuning, models still lack robust, generalizable physical intuitions. The research hypothesizes that interaction with the environment is necessary for learning physical dynamics. However, models trained through interaction do not generalize well to new tasks, even when tasks share similar visual and physical characteristics.

研究探讨了视觉语言模型是否可以通过互动学习直观的物理知识。尽管通过监督微调可以提高模型的性能，但模型仍然缺乏稳健且能够泛化的物理直觉。研究假设环境互动是学习物理动态所必需的。然而，通过互动训练的模型在新任务上的泛化能力较差，即使任务具有相似的视觉和物理特征也是如此。

PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling

Authors: Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui

First: 2026-02-05T18:59:01+00:00 · Latest: 2026-02-05T18:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.

中文标题/摘要

标题：PhysicsAgentABM: 物理引导的生成性基于代理的建模

基于大型语言模型（LLM）的多代理系统能够实现富有表现力的代理推理，但难以扩展且不适用于时间步长对齐的状态转换模拟，而传统的基于代理的模型（ABM）则提供可解释性，但难以整合丰富的个体级信号和非平稳行为。我们提出了PhysicsAgentABM，将推理转移到行为一致的代理集群：状态专门化的符号代理编码机制性转换先验，多模态神经转换模型捕捉时间动态和交互动力学，不确定性意识的表征融合生成校准的集群级转换分布。个体代理随后在局部约束下随机实现转换，解耦群体推理与实体级变异性。我们还引入了基于跨上下文行为响应的LLM代理驱动聚类策略ANCHOR，以及一种新颖的对比损失，最多可减少6-8倍的LLM调用次数。在公共卫生、金融和社会科学领域的实验表明，与机制性、神经网络和LLM基线相比，PhysicsAgentABM在事件时间准确性和校准方面均表现出一致的改进。通过围绕不确定性意识神经符号融合重构生成性ABM以实现群体级推理，PhysicsAgentABM确立了LLM支持的可扩展和校准模拟的新范式。

Summary / 总结

PhysicsAgentABM is designed to address the scalability and calibration issues of large language model (LLM)-based multi-agent systems and the interpretability and signal integration challenges of classical ABMs. It uses state-specialized symbolic agents to encode mechanistic transition priors, a multimodal neural transition model to capture temporal and interaction dynamics, and uncertainty-aware epistemic fusion to yield calibrated cluster-level transition distributions. Experiments across public health, finance, and social sciences demonstrate consistent improvements in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. Additionally, ANCHOR, an LLM agent-driven clustering strategy, reduces LLM calls by up to 6-8 times. The method establishes a new paradigm for scalable and calibrated simulation with LLMs.

PhysicsAgentABM旨在解决大规模语言模型（LLM）基于的多智能体系统在可扩展性和校准方面的不足，以及经典ABM在可解释性和个体级信号整合方面的挑战。它通过行为一致的智能体集群工作，其中符号智能体编码机制性转换先验，神经模型捕捉时间动态和交互动态，而知识融合生成校准的集群级转换分布。进一步引入的ANCHOR是一种基于跨上下文行为响应的LLM驱动聚类策略，减少了LLM调用次数。实验结果显示，在公共卫生、金融和社会科学领域，相对于各种基线模型，其在事件时间准确性与校准方面均有所提升。

Curiosity is Knowledge: Self-Consistent Learning and No-Regret Optimization with Active Inference

Authors: Yingke Li, Anjali Parashar, Enlu Zhou, Chuchu Fan

First: 2026-02-05T18:58:32+00:00 · Latest: 2026-02-05T18:58:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Active inference (AIF) unifies exploration and exploitation by minimizing the Expected Free Energy (EFE), balancing epistemic value (information gain) and pragmatic value (task performance) through a curiosity coefficient. Yet it has been unclear when this balance yields both coherent learning and efficient decision-making: insufficient curiosity can drive myopic exploitation and prevent uncertainty resolution, while excessive curiosity can induce unnecessary exploration and regret. We establish the first theoretical guarantee for EFE-minimizing agents, showing that a single requirement--sufficient curiosity--simultaneously ensures self-consistent learning (Bayesian posterior consistency) and no-regret optimization (bounded cumulative regret). Our analysis characterizes how this mechanism depends on initial uncertainty, identifiability, and objective alignment, thereby connecting AIF to classical Bayesian experimental design and Bayesian optimization within one theoretical framework. We further translate these theories into practical design guidelines for tuning the epistemic-pragmatic trade-off in hybrid learning-optimization problems, validated through real-world experiments.

中文标题/摘要

标题：好奇心即知识：自洽学习与无遗憾优化中的主动推理

主动推理（AIF）通过最小化预期自由能量（EFE），以好奇心系数平衡先验价值（信息获取）和实用价值（任务性能），统一了探索与利用。然而，这种平衡何时能同时实现连贯学习和高效决策尚不清楚：好奇心不足可能导致短视的利用并阻止不确定性解决，而好奇心过度则可能导致不必要的探索和遗憾。我们首次为EFE最小化代理提供了理论保证，表明单一要求——足够的好奇心——同时确保了自洽学习（贝叶斯后验一致性）和无遗憾优化（有界累积遗憾）。我们的分析描述了这种机制如何依赖于初始不确定性、可识别性和目标对齐，从而将AIF与经典贝叶斯实验设计和贝叶斯优化统一在一个理论框架中。我们进一步将这些理论转化为在混合学习-优化问题中调整先验-实用权衡的实际设计指南，并通过实际实验进行了验证。

Summary / 总结

The paper addresses the challenge of balancing exploration and exploitation in learning and optimization by minimizing Expected Free Energy (EFE) through a curiosity coefficient. It provides the first theoretical guarantee that sufficient curiosity ensures both self-consistent learning and no-regret optimization. The study characterizes the impact of initial uncertainty, identifiability, and objective alignment on this balance, connecting AIF to classical Bayesian methods. Practical design guidelines for tuning the epistemic-pragmatic trade-off are derived and validated through real-world experiments.

论文旨在通过最小化预期自由能量（EFE）和好奇心系数来解决主动推理（AIF）中的探索与利用之间的平衡问题。研究建立了理论保证，即足够的好奇心可以同时确保自我一致的学习和无遗憾优化。研究显示，这种平衡依赖于初始不确定性、可识别性和目标对齐，将AIF与贝叶斯实验设计和优化联系起来。提供了实用的设计指南来调整认知-实践的权衡，并通过实际实验进行了验证。

Language Models and Logic Programs for Trustworthy Tax Reasoning

Authors: William Jurayj, Nils Holzenberger, Benjamin Van Durme

Venue: AAAI 2026

First: 2025-08-28T17:55:07+00:00 · Latest: 2026-02-05T18:58:31+00:00

Comments: Accepted to AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

According to the United States Internal Revenue Service, ``the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the effectiveness of applying semantic parsing methods to statutory reasoning, and show promising economic feasibility of neuro-symbolic architectures for increasing access to reliable tax assistance.

中文标题/摘要

标题：语言模型与逻辑程序在可信税务推理中的应用

根据美国国内收入局的数据，“平均美国人填写税务申报表花费270美元和13小时”。即使在美国之外，税务申报也需要复杂的推理，结合应用重叠规则和数值计算。由于错误可能会导致高昂的罚款，任何自动化系统都必须提供高准确性和可审计性，使得现代大型语言模型（LLMs）不适合此任务。我们提出了一种将LLMs与符号求解器集成的方法，以计算税务义务。我们使用具有挑战性的StAtutory Reasoning Assessment (SARA)数据集评估了该系统的变体，并提出了一种基于实际税务错误罚款的新方法来估算部署此类系统的成本。我们还展示了如何通过将文本规则预先翻译成形式逻辑程序，并结合智能检索的形式案例表示示例，可以显著提高此任务的性能，并将成本降低到远低于实际平均水平。我们的结果表明，应用语义解析方法进行法规推理的有效性，并展示了神经符号架构在提高可靠税务援助可及性方面的经济可行性。

Summary / 总结

This research aims to address the complexity and high error rate in tax filing by leveraging large language models (LLMs) integrated with symbolic solvers. The study evaluates different system variants on the SARA dataset and introduces a novel cost estimation method. Key findings show that combining plain-text rule translation into formal logic programs with intelligent exemplar retrieval significantly improves performance and reduces costs below real-world averages, highlighting the potential of neuro-symbolic architectures for reliable tax assistance.

研究旨在通过将大型语言模型与符号求解器结合，解决税务申报中复杂性和高错误率的问题，特别是在美国。研究在SARA数据集上评估了所提出的方法，并引入了一种估算部署成本的新方法。关键发现表明，将文本规则翻译成形式逻辑程序并与检索到的案例示例相结合，可以显著提高性能并降低成本，使其低于现实世界平均水平，展示了神经符号架构在提高可靠税务援助方面的潜力。

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Authors: Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen

First: 2026-02-05T18:58:01+00:00 · Latest: 2026-02-05T18:58:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

中文标题/摘要

标题：上下文强制：使用长上下文的一致自回归视频生成

近期的实时长视频生成方法通常采用流式调优策略，试图通过短上下文（无记忆）教师训练一个长上下文学生。在这些框架中，学生进行长时间的展开，但仅能从短至5秒的窗口中获得监督。这种结构上的不匹配导致了一个关键的\textbf{学生-教师不匹配}：由于教师无法访问长期历史，它无法引导学生学习全局时间依赖性，从而限制了学生能够使用的上下文长度。为了解决这一问题，我们提出了一种名为\textbf{上下文强制}的新框架，通过长上下文教师训练长上下文学生。通过确保教师了解完整的生成历史，我们消除了监督不匹配，使模型能够稳健地训练并实现长期一致性。为了使这种计算在极端持续时间（例如2分钟）下可行，我们引入了一种上下文管理系统，将线性增长的上下文转换为\textbf{慢速-快速记忆}架构，显著减少了视觉冗余。大量实验结果表明，我们的方法能够实现超过20秒的有效上下文长度——比LongLive和Infinite-RoPE等最先进的方法长2到10倍。通过利用这种扩展的上下文，上下文强制能够保持在长时间内的一致性，超越各种长视频评估指标上的最先进的基线方法。

Summary / 总结

The paper addresses the issue of student-teacher mismatch in real-time long video generation by proposing Context Forcing, a framework that trains a long-context student using a long-context teacher. This approach ensures the teacher has access to the full generation history, eliminating the supervision mismatch and enabling robust training for long-term consistency. The method introduces a Slow-Fast Memory architecture to manage the context, making it computationally feasible for long durations. Experimental results show that Context Forcing can achieve effective context lengths exceeding 20 seconds, surpassing state-of-the-art methods like LongLive and Infinite-RoPE in long video generation metrics.

研究旨在解决现有方法中长期上下文需求与短期监督之间的不匹配问题。提出的Context Forcing框架使用长上下文教师来引导长上下文学生，消除监督不匹配。这通过一种慢速-快速记忆架构高效管理上下文，使得有效上下文长度超过20秒。实验结果表明，Context Forcing在各种长视频评估指标上优于LongLive和Infinite-RoPE等最先进的基线方法，在长期一致性方面表现更优。

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

First: 2026-02-05T18:57:09+00:00 · Latest: 2026-02-05T18:57:09+00:00

Comments: Code is available at https://github.com/ViktorAxelsen/BudgetMem

Abs · PDF · Code1 · Code2 · Code3

Abstract

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

中文标题/摘要

标题：学习查询感知预算层级路由以运行时代理内存

内存对于大型语言模型（LLM）代理在单个上下文窗口之外运行变得越来越关键，但大多数现有系统依赖于离线、查询无关的内存构建，这可能效率低下并可能丢弃查询关键信息。尽管运行时内存利用是一个自然的替代方案，但先前的工作往往会产生大量开销，并且对性能成本权衡的控制有限。在本文中，我们提出了**BudgetMem**，这是一种运行时代理内存框架，用于明确、查询感知的性能成本控制。BudgetMem 将内存处理结构化为一组内存模块，每个模块提供三个预算层级（即**低**/**中**/**高**）。一个轻量级的路由器在模块之间执行预算层级路由，以平衡任务性能和内存构建成本，这通过强化学习训练的紧凑神经策略实现。使用BudgetMem作为统一的测试平台，我们研究了三种互补的预算层级实现策略：实现（方法复杂度）、推理（推理行为）和容量（模块模型大小）。在LoCoMo、LongMemEval和HotpotQA中，当优先考虑性能（即高预算设置）时，BudgetMem超越了强大的基线，并在更紧的预算下提供了更好的准确度成本前沿。此外，我们的分析将不同层级策略的优势和劣势分离开来，阐明了在不同预算条件下，每个轴在提供最有利权衡时的表现。

Summary / 总结

BudgetMem is a runtime agent memory framework designed for explicit, query-aware performance-cost control in Large Language Model (LLM) agents. It structures memory processing into three budget tiers and uses a lightweight router implemented as a compact neural policy to perform budget-tier routing. Across various benchmarks, BudgetMem outperforms strong baselines in high-budget settings and provides better accuracy-cost trade-offs under tighter budgets. The analysis also clarifies the strengths and weaknesses of different tiering strategies under varying budget regimes.

BudgetMem 是一个允许对性能和成本进行显式、查询感知控制的运行时代理内存框架。它将内存处理结构化为三个预算层级，并使用一个轻量级路由器在不同层级的内存模块之间路由查询。BudgetMem 在高预算设置下优于强基线，并在更紧的预算下提供了更好的准确性和成本折衷。研究还分析了不同层级策略在不同预算条件下的优缺点。

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Authors: Miranda Muqing Miao, Young-Min Cho, Lyle Ungar

First: 2026-02-05T18:55:56+00:00 · Latest: 2026-02-05T18:55:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

中文标题/摘要

标题：CORAL（正确性优化的残差激活透镜）：可移植且校准意识的推理时导向

大型语言模型（LLMs）在指令调优和偏好对齐后表现出持续的校准不足。修改后的训练目标可以改善校准，但重新训练成本高昂。推理时导向提供了一种轻量级的替代方案，但大多数现有方法优化的是正确性的代理指标而非正确性本身。我们引入了CORAL（正确性优化的残差激活透镜），这是一种正则化推理时导向方法，通过权重衰减MLP探针捕捉模型内部激活中的分布式正确性信号。我们在三个7B参数模型上评估了CORAL，发现它在平均情况下将准确率提高了10%，预期校准误差（ECE）降低了50%。我们还展示了这些增益在无需重新训练的情况下转移到四个保留基准测试的完整发布测试集（ARC-Challenge、HellaSwag、Math-MC、OpenBookQA）上，平均准确率提高了14%，ECE降低了49%。我们的结果支持了这样一个假设：当单个神经元不足时，可以使用正则化探针从模型内部提取分布式信息。因此，CORAL提供了一种计算高效、可移植且校准意识的方法，以提高推理时的多项选择题问答性能。

Summary / 总结

CORAL is a regularized inference-time steering method that enhances the accuracy and calibration of large language models by optimizing for correctness directly using weight-decay MLP probes. It improves accuracy by 10% and expected calibration error (ECE) by 50% on average across three 7B-parameter models. These improvements transfer to four held-out benchmarks without retraining, achieving an average 14% accuracy increase and 49% ECE reduction. This suggests that distributed information in model internals can be effectively extracted using regularized probes, providing a compute-efficient and transferable approach to improve multiple-choice question answering performance during inference.

论文介绍了CORAL，一种正则化推理时校正方法，通过使用权重衰减MLP探针捕捉模型内部激活的分布式正确性信号来提升大型语言模型的准确性和校准度。在三个7B参数模型上，CORAL将准确率提高了10%，预期校准误差（ECE）降低了50%。这些改进无需重新训练即可转移到四个保留的基准测试集上，平均提高了14%的准确率和49%的ECE。该方法计算效率高、可迁移且校准意识强，提供了一种轻量级的推理时改进MCQA性能的解决方案。

Diffusion Model's Generalization Can Be Characterized by Inductive Biases toward a Data-Dependent Ridge Manifold

Authors: Ye He, Yitong Qiu, Molei Tao

First: 2026-02-05T18:55:03+00:00 · Latest: 2026-02-05T18:55:03+00:00

Abs · PDF · Code1 · Code2

Abstract

When a diffusion model is not memorizing the training data set, how does it generalize exactly? A quantitative understanding of the distribution it generates would be beneficial to, for example, an assessment of the model's performance for downstream applications. We thus explicitly characterize what diffusion model generates, by proposing a log-density ridge manifold and quantifying how the generated data relate to this manifold as inference dynamics progresses. More precisely, inference undergoes a reach-align-slide process centered around the ridge manifold: trajectories first reach a neighborhood of the manifold, then align as being pushed toward or away from the manifold in normal directions, and finally slide along the manifold in tangent directions. Within the scope of this general behavior, different training errors will lead to different normal and tangent motions, which can be quantified, and these detailed motions characterize when inter-mode generations emerge. More detailed understanding of training dynamics will lead to more accurate quantification of the generation inductive bias, and an example of random feature model will be considered, for which we can explicitly illustrate how diffusion model's inductive biases originate as a composition of architectural bias and training accuracy, and how they evolve with the inference dynamics. Experiments on synthetic multimodal distributions and MNIST latent diffusion support the predicted directional effects, in both low- and high-dimensions.

中文标题/摘要

标题：扩散模型的泛化可以由数据依赖的岭流形上的归纳偏置来表征

当扩散模型不记忆训练数据集时，它如何泛化？对其生成分布的定量理解将有助于例如下游应用中模型性能的评估。因此，我们通过提出对数密度岭流形并量化生成数据与该流形的关系来明确表征扩散模型的生成内容。更具体地说，推理过程围绕着岭流形进行“接近-对齐-滑动”过程：轨迹首先接近流形的邻域，然后在法向方向上被推向或远离流形进行对齐，最后在切向方向上沿着流形滑动。在这一总体行为的范围内，不同的训练误差将导致不同的法向和切向运动，这些运动可以被量化，并且这些详细的运动表征了跨模态生成何时出现。对训练动力学更详细的理解将导致对生成归纳偏置更准确的量化，我们将考虑一个随机特征模型的例子，其中可以明确展示扩散模型的归纳偏置如何作为架构偏置和训练准确性组成的组合而起源，并且如何随着推理动力学的发展而演变。在合成多模态分布和MNIST潜在扩散上的实验支持了预测的方向性效应，在低维和高维空间中均是如此。

Summary / 总结

This study investigates how diffusion models generalize by proposing a log-density ridge manifold and analyzing the inference dynamics. The model's inference process is described as a reach-align-slide mechanism centered around the ridge manifold. Different training errors result in distinct normal and tangent motions, which can be quantified to understand inter-mode generations. Experiments on synthetic and MNIST data support the directional effects predicted by the model.

研究旨在理解当扩散模型不记忆训练数据集时，它如何进行泛化。通过提出一个对数密度岭流形，研究量化了生成数据与该流形的关系。推理过程被描述为一个接近-对齐-滑动机制，其中数据轨迹首先接近流形，然后沿法向量方向对齐，最后沿切向量方向滑动。不同的训练误差会导致不同的法向量和切向量运动，这些可以被量化并解释跨模式生成。研究结果支持在合成数据和MNIST潜在扩散实验中预测的方向性效果。

MambaVF: State Space Model for Efficient Video Fusion

Authors: Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler

First: 2026-02-05T18:53:47+00:00 · Latest: 2026-02-05T18:53:47+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io

中文标题/摘要

标题：MambaVF：基于状态空间模型的高效视频融合框架

视频融合是各种视频处理任务中的基本技术。然而，现有的视频融合方法严重依赖于光流估计和特征扭曲，导致了巨大的计算开销和有限的可扩展性。本文提出了一种基于状态空间模型（SSM）的高效视频融合框架MambaVF，该框架在无需显式运动估计的情况下进行时间建模。首先，通过将视频融合重新表述为一个顺序状态更新过程，MambaVF以线性复杂度捕获了长程时间依赖性，同时显著减少了计算和内存成本。其次，MambaVF提出了一种轻量级的基于SSM的融合模块，该模块通过时空双向扫描机制替代了传统的流引导对齐，从而实现了跨帧的高效信息聚合。在多个基准上的广泛实验表明，我们的MambaVF在多曝光、多焦点、红外可见和医学视频融合任务中达到了最先进的性能。我们强调MambaVF具有高效率，参数减少了高达92.25%，计算FLOPs减少了88.79%，并且比现有方法快2.1倍。项目页面：https://mambavf.github.io

Summary / 总结

MambaVF is an efficient video fusion framework that reformulates video fusion as a state space model to capture long-range temporal dependencies without explicit motion estimation, reducing computational overhead and memory costs. Experiments show that MambaVF outperforms existing methods in multiple video fusion tasks and achieves up to 92.25% fewer parameters and 88.79% less computational FLOPs, with a 2.1x speedup.

MambaVF 是一种高效的视频融合框架，通过使用状态空间模型（SSMs）将视频融合重新表述为一个顺序状态更新过程，从而减少计算开销和内存使用。它引入了一个轻量级的 SSM 基础融合模块，取代了传统的基于流的对齐方式，实现了跨帧的有效信息聚合。实验表明，MambaVF 在多种视频融合任务中表现出色，并且最多可减少 92.25% 的参数和 88.79% 的计算 FLOPs，速度提升 2.1 倍。

A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Authors: Panagiotis Kaliosis, Adithya V Ganesan, Oscar N. E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz

First: 2026-02-05T18:53:17+00:00 · Latest: 2026-02-05T18:53:17+00:00

Comments: 18 pages, 3 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.

中文标题/摘要

标题：大型语言模型在 PTSD 严重程度估计中的系统评估：背景知识和建模策略的作用

大型语言模型（LLMs）越来越多地以零样本方式用于评估心理健康状况，但我们对影响其准确性的因素知之甚少。本研究利用包含1,437名个体自然语言叙述和自我报告的PTSD严重程度评分的临床数据集，全面评估了11种最先进的LLM的性能。为了理解影响准确性的因素，我们系统地变化了（i）背景知识，如子量表定义、分布摘要和访谈问题，以及（ii）建模策略，包括零样本与少量样本、推理努力程度、模型大小、结构化子量表与直接标量预测、输出重新缩放和九种集成方法。我们的研究结果表明：（a）当LLMs获得详细的构念定义和叙述背景时，其准确性最高；（b）增加推理努力程度可以提高估计准确性；（c）开放权重模型（Llama, Deepseek）在超过700亿参数后性能趋于平稳，而封闭权重（o3-mini, gpt-5）模型随着新版本的推出而性能提升；（d）当监督模型与零样本LLM集成时，可以获得最佳性能。综上所述，结果表明选择背景知识和建模策略对于部署LLMs以准确评估心理健康状况至关重要。

Summary / 总结

This study evaluates the performance of 11 state-of-the-art large language models (LLMs) in estimating PTSD severity using a clinical dataset. By varying contextual knowledge and modeling strategies, the research finds that detailed construct definitions and increased reasoning effort enhance accuracy. Open-weight models plateau beyond 70B parameters, while closed-weight models improve with newer generations. Ensemble methods combining supervised models with zero-shot LLMs yield the best results, highlighting the importance of contextual knowledge and modeling strategies for accurate mental health assessment with LLMs.

本研究评估了11个最先进的大型语言模型（LLMs）在使用临床数据估计PTSD严重程度方面的性能。通过改变上下文知识和建模策略，研究发现详细的构念定义和增加推理努力可以提高准确性。开放权重模型在超过70B参数后达到饱和，而封闭权重模型则随着新版本的推出而改进。结合监督模型和零样本LLM的集成方法能获得最佳效果，强调了上下文知识和建模策略对于使用LLM准确评估心理健康的重要性。

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang

First: 2026-02-05T18:52:48+00:00 · Latest: 2026-02-05T18:52:48+00:00

Comments: Project Page: https://genarena.github.io/, Code: https://github.com/ruihanglix/genarena

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.

中文标题/摘要

标题：GenArena：我们如何实现视觉生成任务的人类对齐评估？

视觉生成模型的快速发展已经超越了传统的评估方法，迫切需要采用视觉语言模型作为替代的评判者。在本文中，我们系统地研究了当前广泛使用的绝对点对点评分标准在各种视觉生成任务中的可靠性。我们的分析表明，这种范式由于随机不一致性和与人类感知的不良对齐而受到限制。为了解决这些限制，我们引入了GenArena，这是一种统一的评估框架，利用成对比较范式确保稳定和人类对齐的评估。关键的是，我们的实验揭示了一个变革性的发现，即仅采用这种成对协议即可使现成的开源模型超越顶级专有模型。值得注意的是，我们的方法将评估准确性提高了超过20%，并与权威的LMArena排行榜获得了0.86的斯皮尔曼相关性，远远超过了点对点方法的0.36相关性。基于GenArena，我们对多种视觉生成模型进行了基准测试，为视觉生成提供了一个严格且自动化的评估标准。

Summary / 总结

This study addresses the limitations of traditional absolute pointwise scoring in evaluating visual generation models, proposing GenArena, a pairwise comparison framework that enhances evaluation reliability and alignment with human perception. Experiments show that GenArena significantly improves evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, outperforming pointwise methods by a wide margin.

本文针对传统绝对点评分在评估视觉生成模型方面的局限性，这些模型已迅速发展。作者引入了GenArena，这是一种成对比较框架，以确保更稳定和符合人类感知的评估。实验表明，GenArena显著提高了评估准确性，超过20%，并与权威的LMArena排行榜实现了0.86的Spearman相关性，远超点评分方法的0.36相关性。此外，这种方法还使开源模型超越了顶级专有模型在视觉生成任务中的表现。

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Authors: Xianyang Liu, Shangding Gu, Dawn Song

First: 2026-02-05T18:50:36+00:00 · Latest: 2026-02-05T18:50:36+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.

中文标题/摘要

标题：AgenticPay：多智能体LLM谈判系统用于买家卖家交易

基于大型语言模型（LLM）的代理越来越多地被期望自主谈判、协调和交易，但现有的基准测试缺乏评估语言中介的多智能体经济互动的规范性设置。我们引入了AgenticPay，这是一种用于由自然语言驱动的多智能体买家卖家谈判的基准测试和仿真框架。AgenticPay 模拟了买家和卖家拥有私人约束和产品依赖价值的市场，并且必须通过多轮语言谈判达成协议，而不仅仅是数字竞价。该框架支持超过110项任务的多样化套件，从双边讨价还价到多对多市场，具有结构化的行动提取和可行性、效率和福利的度量标准。对最先进的专有和开源权重LLM的基准测试揭示了谈判表现的巨大差距，并突显了长期战略推理的挑战，确立了AgenticPay作为研究代理商业和语言驱动的市场互动的基础。代码和数据集可在链接处获取：https://github.com/SafeRL-Lab/AgenticPay。

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

Authors: Jie Deng, Kaichun Yao, Libo Zhang

First: 2026-02-05T18:45:53+00:00 · Latest: 2026-02-05T18:45:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.

中文标题/摘要

标题：VisRefiner: 从视觉差异中学习的屏幕截图到代码生成

屏幕截图到代码生成旨在将用户界面屏幕截图转换为忠实再现目标布局和样式的可执行前端代码。现有的多模态大型语言模型直接从屏幕截图进行这种映射，但它们的训练过程中没有观察到生成代码的视觉结果。相比之下，人类开发人员会迭代地渲染他们的实现，将其与设计进行比较，并学习视觉差异如何与代码更改相关联。受此过程的启发，我们提出了一种训练框架VisRefiner，使模型能够从渲染预测与参考设计之间的视觉差异中学习。我们构建了差异对齐的监督，将视觉差异与相应的代码编辑关联起来，使模型能够理解外观变化是如何由实现更改引起的。在此基础上，我们引入了一种强化学习阶段进行自我完善，其中模型通过观察渲染输出和目标设计之间的视觉差异，并相应地更新代码来改进其生成的代码。实验表明，VisRefiner 显著提高了单步生成质量和布局保真度，同时赋予模型强大的自我完善能力。这些结果表明，从视觉差异中学习对于推进屏幕截图到代码生成的有效性。

Summary / 总结

VisRefiner is a training framework that enables models to learn from visual differences between rendered predictions and reference designs, improving screenshot-to-code generation quality and layout fidelity. It uses difference-aligned supervision to associate visual discrepancies with code edits and introduces a reinforcement learning stage for self-refinement, allowing the model to improve its generated code by observing visual differences. Experiments show that VisRefiner significantly enhances single-step generation and self-refinement capabilities.

VisRefiner 是一种训练框架，通过将渲染预测与参考设计之间的视觉差异与代码编辑关联起来，提高截图到代码生成的质量和布局准确性。它使用差异对齐的监督来关联视觉差异与代码编辑，并引入了一个自改进阶段，使模型能够通过观察视觉差异来改进生成的代码。实验表明，VisRefiner 显著提高了单步生成质量和自改进能力。

Transmuting prompts into weights

Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo

First: 2025-10-09T18:40:39+00:00 · Latest: 2026-02-05T18:44:09+00:00

Abs · PDF · Code1 · Code2

Abstract

A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to token-dependent implicit weight updates (Dherin et. al, 2025), we derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector-and-matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

中文标题/摘要

标题：将提示转化为权重

越来越多的研究表明，可以通过直接修改大型语言模型的内部状态，在推理时有效控制其行为，这可以通过对其激活进行向量添加或更新其权重矩阵来实现。虽然这些技术非常强大，但它们通常由经验性启发式方法指导，例如从对比提示的平均激活中推导出引导向量。这项工作为这些干预措施提供了理论基础，解释了它们如何源自变压器架构的基本计算。基于最近发现的提示影响可以数学映射到与标记相关的隐式权重更新（Dherin等人，2025年），我们推导出一种原理性的方法，将这些信息凝练成与标记无关的思想向量和思想矩阵。这些构造为现有的向量和矩阵为基础的模型编辑技术提供了理论解释，并提供了一种直接且计算上可验证的方法，将文本输入转化为可重用的权重更新。

Summary / 总结

This research aims to provide a theoretical foundation for controlling the behavior of large language models at inference time by modifying their internal states. It builds on recent findings to derive a principled method for condensing prompt influence into token-independent thought vectors and matrices, offering a direct and computationally-grounded approach to transmuting textual input into reusable weight updates. Key experimental findings show that these methods effectively control model behavior without relying on empirical heuristics.

该研究旨在为通过修改大型语言模型的内部状态来在其推理过程中控制其行为提供理论基础。它基于最近的发现，推导出一种将提示影响凝练为与标记无关的思想向量和矩阵的方法，提供了一种直接且计算上可行的方法，将文本输入转化为可重用的权重更新。关键实验结果表明，这些方法能够有效地控制模型行为，而无需依赖经验性启发式方法。

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Authors: Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz

First: 2026-02-05T18:42:00+00:00 · Latest: 2026-02-05T18:42:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

中文标题/摘要

标题：钻石地图：通过随机流图高效实现奖励对齐

流和扩散模型生成高质量样本，但在训练后适应用户偏好或约束仍然成本高昂且脆弱，这一挑战通常被称为奖励对齐。我们认为，高效的奖励对齐应该是生成模型本身的特性，而不是事后考虑的问题，并重新设计了模型以提高适应性。我们提出了“钻石地图”，一种随机流图模型，能够在推理时高效且准确地对齐到任意奖励。钻石地图将许多模拟步骤合并为单步采样器，类似于流图，同时保留了实现最优奖励对齐所需的随机性。这种设计使得搜索、顺序蒙特卡洛和引导变得可扩展，因为它们能够高效且一致地估计价值函数。我们的实验表明，钻石地图可以通过从GLASS流中蒸馏学习，实现更强的奖励对齐性能，并且比现有方法更具可扩展性。我们的结果指出了生成模型在推理时能够快速适应任意偏好和约束的实际途径。

Summary / 总结

The research aims to address the challenge of reward alignment in generative models, which is costly and brittle after training. The authors propose Diamond Maps, a type of stochastic flow map model, to enable efficient and accurate reward alignment at inference time. Diamond Maps combine the efficiency of flow maps with the necessary stochasticity for optimal reward alignment, making search and guidance scalable. Experiments show that Diamond Maps can be learned efficiently from GLASS Flows, perform better in reward alignment, and scale better than existing methods.

研究旨在解决生成模型中的奖励对齐问题，该问题在后训练阶段进行时成本高且脆弱。作者提出了Diamond Maps，这是一种随机流图模型，能够在推理时高效且准确地进行奖励对齐。Diamond Maps结合了流图的高效性和必要的随机性以实现最优的奖励对齐，从而使搜索和引导变得可扩展。实验表明，Diamond Maps可以从GLASS Flows高效学习，优于现有方法的奖励对齐性能，并且具有更好的可扩展性。

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Authors: Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

First: 2026-02-05T18:41:38+00:00 · Latest: 2026-02-05T18:41:38+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

中文标题/摘要

标题：DSB: 动态滑动块调度算法用于扩散大语言模型

扩散大语言模型（dLLMs）已成为文本生成的一种有前途的替代方案，以其原生支持并行解码为特点。实际上，块推理对于避免全局双向解码中的顺序错位并提高输出质量至关重要。然而，广泛使用的固定预定义块（朴素）调度策略忽略了语义难度，使其在质量和效率方面都是次优策略：它可能会在不确定的位置上过早地做出承诺，同时推迟接近块边界的简单位置。在本文中，我们分析了朴素块调度的局限性，并揭示了根据语义难度动态调整调度以实现可靠和高效推理的重要性。受此启发，我们提出了动态滑动块（DSB），这是一种无需训练的块调度方法，使用动态大小的滑动块来克服朴素块的僵化。为了进一步提高效率，我们引入了DSB缓存，这是一种针对DSB设计的无需训练的KV缓存机制。在多个模型和基准上的广泛实验表明，DSB与DSB缓存一起，能够一致地提高dLLMs的生成质量和推理效率。代码已发布在https://github.com/lizhuo-luo/DSB。

Summary / 总结

This work addresses the limitations of fixed block scheduling in diffusion large language models (dLLMs) by proposing Dynamic Sliding Block (DSB), a training-free method that dynamically adjusts block size based on semantic difficulty. DSB, along with DSB Cache, a tailored KV-cache mechanism, enhances both generation quality and inference efficiency. Experiments across multiple models and benchmarks show consistent improvements over the naive block scheduling approach.

论文针对固定块调度在扩散大语言模型（dLLMs）中的局限性，提出了一种基于语义难度动态调整块大小的无训练方法Dynamic Sliding Block (DSB)。DSB通过避免过早承诺和延迟容易位置来提高生成质量和推理效率。此外，还引入了DSB Cache以进一步提高效率。实验结果表明，DSB及其缓存机制在多个模型和基准测试中均表现出一致的改进。

Layer-wise LoRA fine-tuning: a similarity metric approach

Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

First: 2026-02-05T18:38:53+00:00 · Latest: 2026-02-05T18:38:53+00:00

Comments: Code is available at https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

中文标题/摘要

标题：逐层LoRA微调：一种相似度度量方法

在互联网规模数据集上预训练大型语言模型（LLMs）已成为推动通用人工智能发展的基础。相比之下，通过微调来增强其在下游任务中的预测性能通常涉及调整其知识。参数高效微调技术，如低秩适应（LoRA），旨在通过冻结预训练模型并更新较少的参数来降低此过程的计算成本。与全微调相比，这些方法的可训练参数数量减少了超过99%，具体取决于配置。不幸的是，随着LLMs的规模不断扩大，这种减少可能变得不足。在本研究中，我们通过系统地选择仅微调几层来解决上述问题，使用LoRA或其变体。我们认为，并非所有层对模型适应的贡献都相等。利用这一点，我们通过测量它们对内部表示变化的贡献来识别最相关的层进行微调。我们的方法与现有的低秩适应技术是正交的，并且易于兼容。我们通过LoRA技术将可训练参数减少多达50%，同时在不同模型和任务上保持预测性能。具体而言，在仅编码器架构中，这种可训练参数的减少导致在GLUE基准测试上的预测性能下降可以忽略不计。在仅解码器架构中，我们实现了数学问题解决能力和编程任务上的小幅度下降或甚至改进。最后，这种方法也适用于多模态模型，在这些模型中，我们还观察到与在所有层使用LoRA模块进行微调相比具有竞争力的结果。代码可在：https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

Summary / 总结

This study addresses the challenge of fine-tuning large language models (LLMs) by proposing a layer-wise LoRA fine-tuning method. The method selects a few critical layers for fine-tuning based on their contribution to internal representation changes, reducing the number of trainable parameters by up to 50% while maintaining or improving predictive performance across various models and tasks. On encoder-only architectures, there is a negligible performance drop on the GLUE benchmark, and on decoder-only architectures, there is a small drop or even improvements in mathematical problem-solving and coding tasks. The approach is compatible with existing low-rank adaptation techniques.

该研究提出了一种分层LoRA微调方法，通过根据层对内部表示变化的贡献选择关键层进行微调，从而将可训练参数减少高达50%，同时在GLUE基准测试、数学问题解决和编程任务等不同任务上保持或甚至改善了预测性能。该方法与现有的LoRA技术兼容，并适用于多模态模型，显示出竞争力的结果。代码可在https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA 获取。

SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model

Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Heather Yu

First: 2026-01-12T05:03:12+00:00 · Latest: 2026-02-05T18:37:54+00:00

Comments: 12 pages, 14 figures, accepted in WACVW 2026

Abs · PDF · Code1 · Code2

Abstract

Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.

中文标题/摘要

标题：SIRR-LMM：基于大型多模态模型的单张图像反射去除

玻璃表面会产生复杂的反射和透射光相互作用，使得单张图像反射去除（SIRR）具有挑战性。现有数据集在合成数据中缺乏物理真实感，或在实际捕获中规模不足。我们提出了一种合成数据集生成框架，通过路径追踪3D玻璃模型在真实背景图像上创建具有多种玻璃属性、相机设置和后处理效果的物理准确反射场景。为了利用大型多模态模型（LMM）的能力，我们将图像层连接成单一复合输入，应用联合描述，并使用针对特定任务的LoRA进行微调，而不是进行全面参数训练。这使我们的方法在反射去除和分离性能上优于现有最先进的方法。

Summary / 总结

The research addresses the challenge of single-image reflection removal (SIRR) from glass surfaces, which is complicated by the complex interactions of light. To overcome limitations in existing datasets, the authors developed a synthetic dataset generation framework that uses path-tracing to create physically accurate reflection scenarios. They then used a Large Multimodal Model (LMM) with a composite input and fine-tuning via Low-Rank Adaptation (LoRA) to achieve better performance in reflection removal and separation compared to existing methods.

研究旨在解决来自玻璃表面的单图像反射去除（SIRR）问题，由于光线的复杂交互使得这一任务具有挑战性。为克服现有数据集的限制，作者开发了一种合成数据生成框架，使用路径追踪创建现实的反射场景。然后通过拼接图像层并应用任务特定的LoRA对大型多模态模型（LMM）进行微调，从而在反射去除和分离性能上优于先前的方法。

RISE-Video: Can Video Generators Decode Implicit World Rules?

Authors: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang

First: 2026-02-05T18:36:10+00:00 · Latest: 2026-02-05T18:36:10+00:00

Comments: 38 pages, 16 figures, 3 tables; Code: https://github.com/VisionXLab/RISE-Video; HuggingFace: https://huggingface.co/datasets/VisionXLab/RISE-Video

Abs · PDF · Code1 · Code2 · Code3

Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

中文标题/摘要

标题：RISE-Video：视频生成器能否解码隐含的世界规则？

尽管生成视频模型在视觉保真度方面取得了显著进展，但它们内化和推理隐含世界规则的能力仍然是一个关键但尚未充分探索的领域。为弥合这一差距，我们提出了RISE-Video，这是一种开创性的基于文本-图像到视频（TI2V）合成的认知推理基准，将评估重点从表面美学转移到深层次的认知推理。RISE-Video 包含467个精心的人工标注样本，涵盖八个严格的类别，为从常识和空间动态到专业主题领域的模型智能提供了一个结构化的测试平台。我们的框架引入了四个评估维度的多维评估协议：推理一致性、时间一致性、物理合理性以及视觉质量。为了进一步支持可扩展的评估，我们提出了一种基于大型多模态模型（LMMs）的自动化流程，以模拟人类评估。在11个最先进的TI2V模型上的广泛实验揭示了在隐含约束下模拟复杂场景的普遍缺陷，为未来世界模拟生成模型的发展提供了关键见解。

Summary / 总结

RISE-Video is a reasoning-oriented benchmark for evaluating Text-Image-to-Video synthesis models, focusing on their ability to understand and reason about implicit world rules rather than just visual fidelity. The benchmark includes 467 human-annotated samples across eight categories and introduces a multi-dimensional evaluation protocol with four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. Experiments on 11 state-of-the-art models show that these models struggle with complex scenarios under implicit constraints, highlighting the need for improved reasoning capabilities in generative models.

RISE-Video 是一个针对文本-图像到视频合成的推理导向基准，评估模型在隐含世界规则上的推理能力，而非仅仅视觉保真度。它包含467个人标注样本，覆盖八个类别，并引入了四个评估指标：推理对齐、时间一致性、物理合理性以及视觉质量。对11个最先进的模型的实验表明，当前模型在模拟隐含约束下的复杂场景时存在缺陷，这表明需要改进生成模型的推理能力。

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

Authors: Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

First: 2025-10-29T02:21:10+00:00 · Latest: 2026-02-05T18:29:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior (e.g., premature convergence) and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 36,383 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels (and supporting future individual-level analyses). We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.

中文标题/摘要

标题：辩论：评估角色扮演大模型代理意见动态的大规模基准

准确地通过社会互动建模意见变化对于理解并缓解极化、虚假信息和社会冲突至关重要。近期研究使用角色扮演大模型代理（RPLA）模拟意见动态，但多代理模拟往往表现出不自然的群体行为（例如，过早收敛），并且缺乏评估其与真实人类群体互动一致性的经验基准。我们引入了DEBATE，这是一个大规模基准，用于评估多代理RPLA模拟中意见动态的真实性。DEBATE 包含来自708个群体和107个主题的2,832名美国参与者发送的36,383条消息，包括公开消息和私人李克特量表信念，这使得可以在消息和群体层面进行评估（并支持未来个体层面的分析）。我们使用七种大模型实例化“数字双胞胎”RPLA，并在两种设置下进行评估：下一条消息预测和完整对话展开，使用立场一致性和意见收敛度指标。在零样本设置中，RPLA群体相对于人类群体表现出强烈的意见收敛。通过监督微调（SFT）和直接偏好优化（DPO）进行训练后，立场一致性和群体层面的收敛度更接近人类行为，尽管意见变化和信念更新仍存在差异。DEBATE 使模拟意见动态的基准测试变得严格，并支持未来研究将多代理RPLA与现实人类互动对齐。

Summary / 总结

The research aims to evaluate the authenticity of opinion dynamics in multi-agent role-playing LLM agent (RPLA) simulations, crucial for understanding social interactions and mitigating societal conflicts. The study introduces DEBATE, a large-scale benchmark with 36,383 messages from 2,832 participants across 708 groups and 107 topics. RPLAs were instantiated with seven LLMs and evaluated in terms of opinion convergence and stance alignment. While RPLA groups showed strong opinion convergence in zero-shot settings, supervised fine-tuning and Direct Preference Optimization improved alignment but left some discrepancies in opinion change and belief updating. DEBATE provides a rigorous benchmark for evaluating RPLAs and supports future research on aligning them with human behavior.

研究旨在评估多代理角色扮演LLM代理（RPLA）模拟中的意见动态真实性，这对于理解社会互动和缓解社会冲突至关重要。研究引入了DEBATE，这是一个大规模基准，包含来自2,832名参与者的36,383条消息，覆盖708个群体和107个主题。使用七种LLM实例化RPLA，并从意见收敛和立场一致性两个方面进行了评估。虽然RPLA群体在零样本设置下表现出强烈的意见收敛，但通过监督微调和直接偏好优化提高了立场一致性，但仍存在一些意见变化和信念更新方面的差异。DEBATE为评估RPLA提供了严格的基准，并支持未来研究使其与人类行为更加一致。

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Authors: Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao

Venue: ACL

First: 2026-02-05T18:25:24+00:00 · Latest: 2026-02-05T18:25:24+00:00

Comments: Submission to ACL ARR 2026 January

Abs · PDF · Code1 · Code2

Abstract

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

中文标题/摘要

标题：SAGE：评估和提升深度研究代理的检索能力

深度研究代理已成为应对复杂查询的强大系统。与此同时，基于LLM的检索器展示了在遵循指令或推理方面的强大能力。这引发了一个关键问题：基于LLM的检索器能否有效贡献于深度研究代理的工作流程？为了探讨这一问题，我们引入了SAGE，这是一个由1200个跨四个科学领域的问题组成的科学文献检索基准，包含20万篇论文的检索语料库。我们评估了六种深度研究代理，发现所有系统在需要推理的检索任务中都表现不佳。以DR Tulu为骨干，我们进一步将BM25和基于LLM的检索器（即ReasonIR和gte-Qwen2-7B-instruct）作为替代搜索工具进行了比较。令人惊讶的是，BM25在性能上比基于LLM的检索器高出约30%，因为现有代理生成的是关键词导向的子查询。为了提高性能，我们提出了一种基于语料库的测试时缩放框架，使用LLM来增强文档中的元数据和关键词，使现成的检索器更容易进行检索。这分别在简短形式和开放式问题上提高了8%和2%。

Summary / 总结

SAGE is a benchmark for evaluating scientific literature retrieval, comprising 1,200 queries across four domains with a 200,000 paper corpus. It finds that deep research agents struggle with reasoning-intensive retrieval, and BM25 outperforms LLM-based retrievers by about 30%. To enhance performance, a corpus-level test-time scaling framework is proposed, which uses LLMs to augment documents, leading to 8% and 2% gains on short-form and open-ended questions, respectively.

研究旨在评估LLM基于的检索器在深度研究代理工作流中的有效性。引入了包含四个科学领域1,200个查询的SAGE基准来评估六个深度研究代理。结果显示，所有系统在推理密集型检索方面都存在问题，BM25比LLM基于的检索器高出约30%。为了提高性能，提出了一种基于语料库的测试时缩放框架，使用LLM增强文档的元数据和关键词，分别在短形式和开放式问题上获得了8%和2%的提升。

History

20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553