EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference
Authors: Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Alan Yuille
Venue: Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pages 649-659
First: 2025-02-07T07:07:04+00:00 · Latest: 2026-02-05T18:59:59+00:00
Abstract
The rapid growth of large models has raised concerns about their environmental impact and equity in accessibility due to significant computational costs. Low-Rank Adapters (LoRA) offer a lightweight solution for finetuning large models, resulting in an abundance of publicly available adapters tailored to diverse domains. We ask: Can these pretrained adapters be leveraged to further streamline adaptation to new tasks while addressing these challenges? We introduce EigenLoRAx, a parameter-efficient finetuning method that recycles existing adapters to create a principal subspace aligned with their shared domain knowledge which can be further augmented with orthogonal basis vectors in low-resource scenarios. This enables rapid adaptation to new tasks by learning only lightweight coefficients on the principal components of the subspace-eliminating the need to finetune entire adapters. EigenLoRAx requires significantly fewer parameters and memory, improving efficiency for both training and inference. Our method demonstrates strong performance across diverse domains and tasks, offering a scalable for edge-based applications, personalization, and equitable deployment of large models in resource-constrained environments.
中文标题/摘要
标题:EigenLoRAx:回收适配器以发现资源高效适应和推理的主要子空间
大型模型的快速增长引发了对其环境影响和访问公平性的担忧,因为它们需要大量的计算资源。低秩适配器(LoRA)提供了一种轻量级的微调解决方案,使得针对不同领域的大量适配器得以公开。我们提出的问题是:这些预训练的适配器能否进一步简化对新任务的适应,同时解决这些挑战?我们引入了EigenLoRAx,这是一种参数高效的微调方法,通过回收现有的适配器来创建一个与它们共享的知识领域对齐的主要子空间,并在低资源场景中进一步扩展为正交基向量。这使得在学习主要子空间的轻量级系数时能够快速适应新任务,从而消除对整个适配器进行微调的需要。EigenLoRAx 需要的参数和内存显著减少,提高了训练和推理的效率。我们的方法在多种领域和任务中表现出色,为边缘应用、个性化和资源受限环境中大型模型的公平部署提供了可扩展的解决方案。
Summary / 总结
EigenLoRAx is a parameter-efficient method that recycles existing adapters to create a principal subspace aligned with shared domain knowledge, enabling rapid adaptation to new tasks with fewer parameters and memory. It demonstrates strong performance across various domains and tasks, offering scalability for edge-based applications and equitable deployment of large models in resource-constrained environments.
EigenLoRAx 是一种参数高效的回收预训练适配器的方法,创建一个与共享领域知识对齐的主要子空间,以实现快速的新任务适应,同时减少所需参数和内存。该方法在多种领域和任务上表现出色,适用于边缘设备应用和资源受限环境中的公平部署。
Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning
Authors: Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji
First: 2026-02-05T18:59:55+00:00 · Latest: 2026-02-05T18:59:55+00:00
Abstract
Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
中文标题/摘要
标题:从透视描述预测相机姿态以进行空间推理
多图像空间推理仍然是当前多模态大型语言模型(MLLMs)面临的挑战。虽然单视角感知本质上是二维的,但多视角推理需要在不同视角之间构建连贯的场景理解。特别是,我们研究了视角转换,其中模型必须从多视角观察中构建连贯的三维理解,并用于从新的语言指定视角进行推理。我们引入了CAMCUE,这是一种姿态感知的多图像框架,使用相机姿态作为跨视角融合和新视角推理的显式几何锚点。CAMCUE 将每视角的姿态注入视觉标记,将自然语言视角描述定位到目标相机姿态,并合成姿态条件下的想象目标视图以支持回答。为了支持这一设置,我们收集了CAMCUE-DATA,其中包括27,668个训练实例和508个测试实例,这些实例将多视角图像和姿态与多样化的目标视角描述和视角转换问题配对。我们还在测试分割中包括了人工标注的视角描述,以评估对人类语言的泛化能力。CAMCUE 的整体准确率提高了9.06%,并且能够从自然语言视角描述中预测目标姿态,旋转准确率超过90%(误差在20°以内),平移准确率在0.5误差阈值以内超过90%。这种直接定位避免了昂贵的测试时搜索和匹配,将每个示例的推理时间从256.6秒减少到1.45秒,从而在实际场景中实现快速、交互式使用。
Summary / 总结
The research aims to improve multi-image spatial reasoning for multimodal large language models by addressing the challenge of perspective taking. CAMCUE, a pose-aware multi-image framework, uses camera pose as a geometric anchor for cross-view fusion and novel-view reasoning. It enhances accuracy by 9.06% and predicts target poses with high rotation and translation accuracy. This method reduces inference time from 256.6s to 1.45s per example, enabling fast, interactive use in real-world scenarios.
研究通过引入CAMCUE框架,解决当前多模态大语言模型在多图像空间推理方面的挑战,该框架使用相机姿态作为几何锚点进行跨视图融合和新颖视图推理。该框架提高了整体准确性9.06%,并在20°旋转误差和0.5误差阈值内实现了超过90%的平移准确性。该框架通过将推理时间从每例256.6秒减少到1.45秒,支持快速交互式使用于现实世界场景中。
DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching
Authors: Yuxing Lu, Yucheng Hu, Xukai Zhao, Jiuxin Cao
First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00
Abstract
Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager's round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.
中文标题/摘要
标题:DyTopo:基于语义匹配的多智能体动态拓扑路由
由提示的大语言模型构建的多智能体系统可以提高多轮推理能力,但大多数现有管道依赖于固定且贯穿整个轨迹的通信模式,这些模式与迭代问题解决过程中阶段特定的需求匹配不佳。我们引入了DyTopo,这是一种由管理者指导的多智能体框架,在每一轮中重建一个稀疏的有向通信图。基于管理者的轮次目标,每个智能体输出轻量级的自然语言查询(需求)和关键(提供)描述;DyTopo嵌入这些描述并进行语义匹配,仅沿诱导的边路由私有消息。在代码生成和数学推理基准测试以及四个LLM基础模型中,DyTopo在最强基线之上始终表现出色(平均提高6.2%)。除了准确性之外,DyTopo还通过不断变化的图提供了可解释的协调轨迹,使人们能够定性地检查通信路径如何在轮次之间重新配置。
Summary / 总结
DyTopo is a dynamic topology routing framework designed to enhance multi-agent reasoning in multi-round problem solving by reconstructing a sparse directed communication graph at each round. Agents output lightweight natural-language queries and offers, which DyTopo uses for semantic matching to route private messages only along relevant edges. DyTopo outperforms the strongest baseline across various benchmarks and LLM backbones, improving accuracy by an average of 6.2%. The evolving graphs provide interpretable coordination traces, allowing for qualitative analysis of communication pathway reconfigurations.
DyTopo 是一种动态拓扑路由框架,通过适应迭代问题解决过程中的阶段依赖需求来改进多轮推理。每个代理输出轻量级的自然语言查询和关键描述符,DyTopo 使用这些描述符进行语义匹配以路由私有消息。在各种基准测试中,DyTopo 的表现优于最强基线,平均高出 6.2%,并通过不断变化的图提供可解释的协调轨迹,便于对通信路径的重新配置进行定性检查。
SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
Authors: Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou
First: 2026-02-05T18:59:51+00:00 · Latest: 2026-02-05T18:59:51+00:00
Comments: Project Page: https://accio-lab.github.io/SwimBird
Abstract
Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
中文标题/摘要
标题:SwimBird:在混合自回归MLLM中引发可切换的推理模式
多模态大型语言模型(MLLMs)通过视觉和语言的结合,在多模态感知和推理方面取得了显著进展。然而,大多数现有的MLLMs主要通过文本的逐步推理(CoT)进行推理,这限制了它们在视觉密集型任务上的效果。最近的方法将固定数量的连续隐藏状态作为“视觉思考”注入推理过程,从而提高了视觉性能,但通常会牺牲基于文本的逻辑推理能力。我们认为核心限制在于一种僵化的、预先定义的推理模式,无法根据不同用户查询自适应地选择最合适的思考模态。我们引入了SwimBird,这是一种可切换的MLLM,根据输入动态切换三种推理模式:(1)仅文本推理,(2)仅视觉推理(连续隐藏状态作为视觉思考),(3)视觉-文本交织推理。为了实现这一能力,我们采用了一种混合自回归公式,将文本思考的下一个词预测与视觉思考的下一个嵌入预测统一起来,并设计了一种系统性的推理模式策展策略,构建了SwimBird-SFT-92K,这是一个涵盖所有三种推理模式的多样化监督微调数据集。通过实现灵活、查询自适应的模式选择,SwimBird在保持强大的文本逻辑推理能力的同时,显著提高了视觉密集型任务的性能。跨多种涵盖文本推理和挑战性视觉理解的基准实验表明,SwimBird在先前固定模式多模态推理方法上取得了最先进的结果和稳健的提升。
Summary / 总结
SwimBird is a reasoning-switchable MLLM that dynamically switches among text-only, vision-only, and interleaved vision-text reasoning modes based on input queries. It uses a hybrid autoregressive formulation and a systematic reasoning-mode curation strategy to construct a diverse fine-tuning dataset. Experiments show that SwimBird maintains strong textual logic while significantly improving performance on vision-intensive tasks, achieving state-of-the-art results across various benchmarks.
SwimBird 是一种可根据输入查询动态切换文本、视觉和视觉-文本交织推理模式的 MLLM。它采用混合自回归建模和系统推理模式构建策略来生成涵盖所有三种推理模式的多样监督微调数据集。实验表明,SwimBird 在保持强大文本逻辑的同时显著提高了视觉密集任务的表现,实现了各种基准测试中的最先进成果。
CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction
Authors: Xiaopan Zhang, Zejin Wang, Zhixu Li, Jianpeng Yao, Jiachen Li
Venue: ICRA 2026
First: 2026-02-05T18:59:45+00:00 · Latest: 2026-02-05T18:59:45+00:00
Comments: IEEE International Conference on Robotics and Automation (ICRA 2026); Project Website: https://comm-cp.github.io/
Abstract
To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: https://comm-cp.github.io.
中文标题/摘要
标题:CommCP:通过基于LLM的通信与符合性预测实现高效的多智能体协调
为了通过自然语言完成人类提供的任务,机器人必须解释命令、生成和回答相关问题以理解场景,并操作目标物体。实际部署中,通常需要不同操作能力的多个异构机器人协同处理不同的任务。除了需要专门的操作技能外,有效的信息收集对于完成这些任务至关重要。为了解决这一问题,我们将信息收集过程在完全合作的环境中形式化为一个未被充分探索的多任务多智能体体态问答(MM-EQA)问题,这是体态问答(EQA)的经典问题的一个新颖扩展,其中有效的通信对于协调努力且不重复至关重要。为了解决这一问题,我们提出CommCP,一种专为MM-EQA设计的基于LLM的分布式通信框架。我们的框架采用符合性预测来校准生成的消息,从而减少接收者的分心并提高通信可靠性。为了评估我们的框架,我们引入了一个包含多种多样的、逼真的家庭场景的MM-EQA基准,其中包含体态问题。实验结果表明,CommCP在任务成功率和探索效率上显著优于基线。实验视频、代码和数据集可在我们的项目网站上获取:https://comm-cp.github.io/
Summary / 总结
The paper addresses the challenge of multiple robots working together to complete tasks given in natural language, emphasizing the importance of effective communication. It introduces CommCP, a communication framework using LLMs and conformal prediction to enhance coordination among robots. Experimental results show that CommCP improves task success and exploration efficiency compared to baseline methods.
论文旨在通过将多智能体多任务体感问答(MM-EQA)问题形式化,提高多智能体在自然语言命令下的协调能力。提出了一种新颖的基于LLM的去中心化通信框架CommCP,使用校准预测来校准消息,减少干扰并提高通信可靠性。实验结果显示,CommCP在任务成功率和探索效率上显著优于基线方法。
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
Authors: Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang
First: 2026-02-05T18:59:32+00:00 · Latest: 2026-02-05T18:59:32+00:00
Abstract
Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
中文标题/摘要
标题:空间几何思维:基于空间几何感知的主动几何整合
多模态大型语言模型(MLLMs)在空间推理方面的最新进展越来越多地利用3D编码器提供的几何先验。然而,大多数现有的整合策略仍然被动:几何信息以全局流的形式呈现,并以不分青红皂白的方式融合,这往往导致语义-几何错位和冗余信号。我们提出了GeoThinker框架,将范式从被动融合转变为主动感知。GeoThinker 不是通过特征混合,而是使模型能够根据其内部推理需求选择性地检索几何证据。GeoThinker 通过在精心选择的VLM层上应用空间接地融合来实现这一点,其中语义视觉先验通过帧严格的交叉注意力选择性地查询和整合与任务相关的几何信息,并通过重要性门控进一步校准,以偏向于与任务相关的结构的帧间注意力。全面的评估结果表明,GeoThinker 在空间智能方面达到了新的最先进水平,在VSI-Bench上达到峰值得分为72.6。此外,GeoThinker 在复杂下游场景中展示了稳健的泛化能力和显著改进的空间感知能力,包括体感指代和自主驾驶。我们的结果表明,主动整合空间结构的能力对于下一代空间智能至关重要。代码可以在 https://github.com/Li-Hao-yuan/GeoThinker 获取。
Summary / 总结
The research aims to enhance spatial reasoning by integrating geometric information more effectively into multimodal large language models. GeoThinker, a new framework, shifts from passive geometric fusion to active perception, allowing the model to selectively retrieve and integrate geometric evidence based on its reasoning needs. This is achieved through Spatial-Grounded Fusion at specific VLM layers, calibrated by Importance Gating. Experimental results show that GeoThinker outperforms previous methods, achieving a peak score of 72.6 on the VSI-Bench and demonstrating robust performance in complex scenarios like embodied referring and autonomous driving.
研究旨在通过更主动地整合几何先验来提升多模态大语言模型的空间推理能力。GeoThinker 提出的框架将融合方式从被动转向主动感知,使模型能够根据其推理需求选择性地检索几何证据。这通过在特定 VLM 层应用 Spatial-Grounded 融合实现,其中语义视觉先验通过帧严格的交叉注意力查询并整合任务相关的几何信息,进一步通过重要性门控进行校准。GeoThinker 在 VSI-Bench 上达到 72.6 的新最佳分数,并在诸如体感引用和自动驾驶等复杂下游场景中展示了强大的泛化能力。
DFlash: Block Diffusion for Flash Speculative Decoding
Authors: Jian Chen, Yesheng Liang, Zhijian Liu
First: 2026-02-05T18:59:30+00:00 · Latest: 2026-02-05T18:59:30+00:00
Abstract
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
中文标题/摘要
标题:DFlash:块扩散用于闪存投机解码
自回归大型语言模型(LLMs)表现出色,但需要固有的顺序解码,导致高推理延迟和低GPU利用率。投机解码通过使用快速草稿模型并行验证目标LLM的输出来缓解这一瓶颈,但现有方法仍然依赖于自回归草稿,这仍然是顺序的并限制了实际加速。扩散LLMs提供了一种有前景的替代方案,通过并行生成来启用,但当前的扩散模型通常在性能上不如自回归模型。在本文中,我们介绍了DFlash,这是一种投机解码框架,使用轻量级块扩散模型进行并行草稿。通过在单次前向传递中生成草稿标记,并将草稿模型基于目标模型提取的上下文特征进行条件化,DFlash能够实现高效且高质量的草稿生成,并具有更高的接受率。实验表明,DFlash在各种模型和任务上实现了超过6倍的无损加速,比最先进的投机解码方法EAGLE-3提供了高达2.5倍的更高加速。
Summary / 总结
DFlash is a speculative decoding framework that uses a lightweight block diffusion model for parallel drafting, addressing the sequential nature of autoregressive models. It generates draft tokens in a single forward pass and conditions the draft model on context features from the target model, achieving over 6x lossless acceleration across various models and tasks, with up to 2.5x higher speedup compared to the state-of-the-art speculative decoding method EAGLE-3.
DFlash 是一种 speculative 解码框架,使用轻量级的块扩散模型进行并行草稿生成,解决了自回归模型的顺序性问题。它通过单次前向传递生成草稿令牌,并将草稿模型条件化为来自目标模型的上下文特征,实现了在各种模型和任务上超过 6 倍的无损加速,比最先进的 speculative 解码方法 EAGLE-3 高出 2.5 倍的加速效果。
InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
Authors: Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui
First: 2026-02-05T18:59:27+00:00 · Latest: 2026-02-05T18:59:27+00:00
Comments: Webpage: https://sirui-xu.github.io/InterPrior/
Abstract
Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
中文标题/摘要
标题:InterPrior:扩展基于物理的人机物交互生成控制
人类很少在整体身体层面上计划与物体的交互,而是通过高层次意图,如功能,来定义目标,而协调的平衡、接触和操作则可以从底层的物理和运动先验中自然地涌现出来。扩展这些先验对于使类人机器人能够跨多种情境组合和泛化肢体操作技能,同时保持物理上连贯的整体身体协调至关重要。为此,我们提出了InterPrior,这是一种可扩展的框架,通过大规模模仿预训练和后续的强化学习微调来学习一个统一的生成控制器。InterPrior首先将一个全参考模仿专家提炼成一个多功能、目标条件化的变分策略,该策略可以从多模态观察和高层次意图中重建运动。虽然提炼出的策略可以重建训练行为,但由于大规模人机物交互的庞大配置空间,它无法可靠地泛化。为了解决这个问题,我们应用了物理扰动的数据增强,并通过强化学习微调来提高对未见过的目标和初始状态的技能。这些步骤共同将重建的潜在技能凝聚成一个有效的流形,产生一个泛化能力超出训练数据的运动先验,例如,它可以包含与未见过的物体的交互等新行为。我们进一步展示了其在用户交互控制中的有效性及其在实际机器人部署中的潜力。
Summary / 总结
InterPrior is a scalable framework that learns a unified generative controller for human-object interactions by combining large-scale imitation pretraining and reinforcement learning. It first creates a versatile policy that can reconstruct motion from multimodal observations and high-level intent, then uses data augmentation and reinforcement learning to improve its ability to generalize to unseen goals and initializations. This leads to a motion prior that can handle new behaviors, such as interactions with unseen objects, and is effective for user-interactive control and real robot deployment.
InterPrior 是一个可扩展的框架,用于学习统一的生成控制器,使类人机器人能够进行与物体的物理连贯的全身交互。它使用大规模的模仿预训练和强化学习将一个完整的参考模仿专家提炼成一个多功能、目标条件化的变分策略。通过物理扰动的数据增强和强化学习微调,该策略能够更好地泛化到未见过的目标和初始状态,从而使类人机器人能够融入新行为并与未见过的物体进行交互。
V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval
Authors: Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka
First: 2026-02-05T18:59:21+00:00 · Latest: 2026-02-05T18:59:21+00:00
Abstract
Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
中文标题/摘要
标题:V-Retrver:基于证据的代理推理在通用多模态检索中的应用
多模态大型语言模型(MLLMs)最近被应用于通用多模态检索,其中推理链(CoT)推理改善了候选检索结果的重新排序。然而,现有方法仍然主要依赖语言驱动,依赖静态视觉编码,缺乏主动验证细粒度视觉证据的能力,这往往导致在视觉含糊情况下进行推测性推理。我们提出了一种基于证据的检索框架V-Retrver,将多模态检索重新定义为基于视觉检查的代理推理过程。V-Retrver使MLLM能够在推理过程中通过外部视觉工具选择性地获取视觉证据,执行一种多模态交替推理过程,交替进行假设生成和目标视觉验证。为了训练这种证据收集检索代理,我们采用了一种基于课程的学习策略,结合监督推理激活、拒绝基础的细化以及与证据对齐的目标的强化学习。在多个多模态检索基准上的实验表明,检索准确性(平均提高23.0%)、感知驱动的推理可靠性和泛化能力均得到了一致的提升。
Summary / 总结
V-Retrver is an evidence-driven retrieval framework that enhances multimodal retrieval by incorporating visual evidence verification. It reformulates the retrieval process as an agentic reasoning task, allowing the model to actively seek and verify visual evidence. This approach leads to improved retrieval accuracy and more reliable reasoning, with an average improvement of 23.0% across multiple benchmarks.
V-Retrver 是一种基于证据的检索框架,将多模态检索重新定义为基于视觉检查的代理推理过程。该框架使 MLLM 在推理过程中能够选择性地获取视觉证据,并通过外部视觉工具进行多模态交替推理,交替进行假设生成和目标视觉验证。实验结果显示,在多个基准测试中,检索准确性和感知驱动的推理可靠性得到了一致的提升。
Can vision language models learn intuitive physics from interaction?
Authors: Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz
First: 2026-02-05T18:59:20+00:00 · Latest: 2026-02-05T18:59:20+00:00
Abstract
Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
中文标题/摘要
标题:视觉语言模型能否通过交互学习直观的物理知识?
预训练的视觉语言模型对物理世界的直觉不够好。最近的研究表明,监督微调可以提高模型在简单物理任务上的表现。然而,微调后的模型似乎没有学会能够泛化的稳健物理规则。基于认知科学的研究,我们假设模型需要与环境互动才能正确学习其物理动态。我们使用强化学习训练通过与环境互动来学习的模型。虽然通过互动学习可以让模型提高其任务内的表现,但无法产生具有泛化物理直觉的模型。我们发现,即使任务共享视觉统计和物理原理,针对一个任务训练的模型也不可靠地泛化到相关任务,无论模型是通过互动还是其他方式训练。
PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling
Authors: Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui
First: 2026-02-05T18:59:01+00:00 · Latest: 2026-02-05T18:59:01+00:00
Abstract
Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.
中文标题/摘要
标题:PhysicsAgentABM:基于物理引导的生成性基于代理的建模
基于大型语言模型(LLM)的多代理系统能够实现富有表现力的代理推理,但难以扩展且不适用于时间步长对齐的状态转换模拟,而经典的基于代理的模型(ABMs)虽然具有可解释性,但在整合丰富的个体级信号和非平稳行为方面存在困难。我们提出了PhysicsAgentABM,该方法将推理转移到行为一致的代理集群中:状态专门化的符号代理编码机制性转换先验,多模态神经转换模型捕捉时间动态和交互动态,不确定性意识的表征融合生成校准的集群级转换分布。个体代理随后在局部约束下随机实现转换,从而解耦群体推理与实体级变异性。我们还引入了基于跨上下文行为响应的LLM代理驱动聚类策略ANCHOR,以及一种新颖的对比损失,最多可减少6-8倍的LLM调用次数。在公共卫生、金融和社会科学领域的实验表明,与机制性、神经网络和LLM基线相比,PhysicsAgentABM在事件时间准确性和校准方面均表现出一致的改进。通过围绕群体级推理重构生成性ABM,并结合不确定性意识的神经-符号融合,PhysicsAgentABM确立了LLM支持的可扩展且校准的模拟新范式。
Summary / 总结
PhysicsAgentABM integrates physics-guided generative agent-based modeling to address the scalability and calibration issues of large language models (LLMs) and the interpretability and signal integration challenges of classical ABMs. It uses state-specialized symbolic agents for mechanistic transition priors, a multimodal neural model for temporal and interaction dynamics, and epistemic fusion for calibrated cluster-level transitions. ANCHOR, an LLM agent-driven clustering strategy, further reduces LLM calls. Experiments across public health, finance, and social sciences demonstrate consistent improvements in event-time accuracy and calibration over various baselines.
PhysicsAgentABM 结合了基于物理的生成性基于代理的建模,以解决大型语言模型 (LLM) 的可扩展性和校准问题以及经典 ABM 的可解释性和个体级信号整合问题。它使用状态专业化符号代理来编码机制性先验,多模态神经模型来捕捉时间和交互动态,并使用知识融合来获得集群级别的校准转换分布。进一步引入的 ANCHOR 是一种基于跨上下文行为响应的 LLM 驱动聚类策略,通过新颖的对比损失减少 LLM 调用次数。实验结果表明,在公共卫生、金融和社会科学领域的一致改进,事件时间准确性和校准度都优于各种基线模型。
Curiosity is Knowledge: Self-Consistent Learning and No-Regret Optimization with Active Inference
Authors: Yingke Li, Anjali Parashar, Enlu Zhou, Chuchu Fan
First: 2026-02-05T18:58:32+00:00 · Latest: 2026-02-05T18:58:32+00:00
Abstract
Active inference (AIF) unifies exploration and exploitation by minimizing the Expected Free Energy (EFE), balancing epistemic value (information gain) and pragmatic value (task performance) through a curiosity coefficient. Yet it has been unclear when this balance yields both coherent learning and efficient decision-making: insufficient curiosity can drive myopic exploitation and prevent uncertainty resolution, while excessive curiosity can induce unnecessary exploration and regret. We establish the first theoretical guarantee for EFE-minimizing agents, showing that a single requirement--sufficient curiosity--simultaneously ensures self-consistent learning (Bayesian posterior consistency) and no-regret optimization (bounded cumulative regret). Our analysis characterizes how this mechanism depends on initial uncertainty, identifiability, and objective alignment, thereby connecting AIF to classical Bayesian experimental design and Bayesian optimization within one theoretical framework. We further translate these theories into practical design guidelines for tuning the epistemic-pragmatic trade-off in hybrid learning-optimization problems, validated through real-world experiments.
中文标题/摘要
标题:好奇心即知识:自洽学习与无遗憾优化中的主动推理
主动推理(AIF)通过最小化预期自由能量(EFE),以好奇心系数平衡先验价值(信息获取)和实用价值(任务性能),统一了探索与利用。然而,这种平衡何时能同时实现连贯学习和高效决策尚不清楚:好奇心不足可能导致短视的利用并阻止不确定性解决,而好奇心过度则可能导致不必要的探索和遗憾。我们首次为EFE最小化代理提供了理论保证,表明单一要求——足够的好奇心——同时确保了自洽学习(贝叶斯后验一致性)和无遗憾优化(有界累积遗憾)。我们的分析描述了这种机制如何依赖于初始不确定性、可识别性和目标对齐,从而将AIF与经典贝叶斯实验设计和贝叶斯优化统一在一个理论框架中。我们进一步将这些理论转化为在混合学习-优化问题中调整先验-实用权衡的实际设计指南,并通过实际实验进行了验证。
Summary / 总结
The paper aims to address the balance between exploration and exploitation in learning and optimization by minimizing Expected Free Energy (EFE) through a curiosity coefficient. It establishes a theoretical guarantee that sufficient curiosity ensures both self-consistent learning and no-regret optimization. Key findings show that this mechanism depends on initial uncertainty, identifiability, and objective alignment, connecting AIF to Bayesian experimental design and optimization.
研究旨在通过最小化预期自由能(EFE)和好奇心系数来解决探索与利用之间的平衡问题。研究提供了理论保证,即适当的好奇心同时确保自我一致的学习和无遗憾的优化。关键发现包括该机制如何依赖初始不确定性、可识别性和目标对齐的特征描述,并将AIF与经典贝叶斯方法联系起来,通过实际设计指南和现实世界实验进行了验证。
Language Models and Logic Programs for Trustworthy Tax Reasoning
Authors: William Jurayj, Nils Holzenberger, Benjamin Van Durme
Venue: AAAI 2026
First: 2025-08-28T17:55:07+00:00 · Latest: 2026-02-05T18:58:31+00:00
Comments: Accepted to AAAI 2026
Abstract
According to the United States Internal Revenue Service, ``the average American spends $\$270$ and 13 hours filing their taxes''. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the effectiveness of applying semantic parsing methods to statutory reasoning, and show promising economic feasibility of neuro-symbolic architectures for increasing access to reliable tax assistance.
中文标题/摘要
标题:语言模型与逻辑程序在可信税务推理中的应用
根据美国国税局的数据,“平均美国人填写税务申报表花费270美元和13小时”。即使在美国之外,税务申报也需要复杂的推理,结合应用重叠规则和数值计算。由于错误可能会导致高昂的罚款,任何自动化系统都必须提供高准确性和可审计性,使得现代大型语言模型(LLMs)不适合此任务。我们提出了一种将LLMs与符号求解器集成的方法,以计算税务义务。我们使用具有挑战性的StAtutory Reasoning Assessment(SARA)数据集评估了该系统的不同变体,并提出了一种基于实际税务错误罚款的新方法来估算部署此类系统的成本。我们还展示了如何通过将文本规则预先翻译成形式逻辑程序,并结合智能检索的形式案例表示示例,可以显著提高此任务的性能,并将成本降低到远低于实际平均水平。我们的结果表明,应用语义解析方法进行法规推理的有效性,并展示了神经-符号架构在提高可靠税务援助可及性方面的经济可行性。
Summary / 总结
The research aims to address the complexity and errors in tax filing by leveraging large language models (LLMs) integrated with symbolic solvers. The study evaluates different system variants on the SARA dataset and introduces a novel cost estimation method based on real-world penalties. Key findings show that combining plain-text rule translation into formal logic programs with intelligent exemplar retrieval significantly improves performance and reduces costs below real-world averages, highlighting the economic feasibility of neuro-symbolic architectures for tax assistance.
研究旨在通过利用大型语言模型(LLMs)与符号求解器的结合来解决税务申报的复杂性和成本问题,以实现准确和可审计的税务推理。研究在SARA数据集上评估了不同系统变体,并引入了一种基于实际税务错误罚金的新成本估算方法。关键发现包括通过语义解析和形式逻辑程序的应用提高了性能,并将成本降低到了实际水平以下。
Context Forcing: Consistent Autoregressive Video Generation with Long Context
Authors: Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen
First: 2026-02-05T18:58:01+00:00 · Latest: 2026-02-05T18:58:01+00:00
Abstract
Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
中文标题/摘要
标题:上下文强制:通过长上下文自回归视频生成
近期的实时长视频生成方法通常采用流式调优策略,试图通过短上下文(无记忆)教师训练一个长上下文学生。在这些框架中,学生进行长时间的展开,但只能从短至5秒的窗口中获得监督。这种结构上的不匹配导致了一个关键的\textbf{学生-教师不匹配}:由于教师无法访问长期历史,它无法引导学生理解全局时间依赖性,从而限制了学生能够使用的上下文长度。为了解决这一问题,我们提出了\textbf{上下文强制}这一新颖框架,通过长上下文教师训练长上下文学生。通过确保教师了解完整的生成历史,我们消除了监督不匹配,从而能够训练出能够长期一致的模型。为了使这一方法在极端时长(例如2分钟)下计算上可行,我们引入了一种上下文管理系统,将线性增长的上下文转换为\textbf{慢速-快速记忆}架构,显著减少了视觉冗余。大量实验结果表明,我们的方法能够实现超过20秒的有效上下文长度——比LongLive和Infinite-RoPE等最先进的方法长2到10倍。通过利用这种扩展的上下文,上下文强制能够保持在长时间内的一致性,超越了各种长视频评估指标上的最先进的基线方法。
Summary / 总结
The paper addresses the issue of student-teacher mismatch in real-time long video generation by proposing Context Forcing, which trains a long-context student using a long-context teacher. This method ensures the teacher has access to the full generation history, eliminating the supervision mismatch. To handle long durations computationally, a Slow-Fast Memory architecture is introduced, reducing visual redundancy. The results show that Context Forcing enables effective context lengths over 20 seconds, outperforming state-of-the-art methods like LongLive and Infinite-RoPE in terms of long-term consistency.
论文通过提出Context Forcing方法解决了实时长视频生成中的学生-教师不匹配问题,该方法使用长历史上下文的教师来训练长历史上下文的学生,确保教师能够访问完整的生成历史,从而提供准确的监督。为了应对计算需求,引入了Slow-Fast Memory架构,减少了视觉冗余。结果表明,Context Forcing可以实现超过20秒的上下文长度,优于LongLive和Infinite-RoPE等最先进的方法,在长视频评估指标上表现出更优的一致性。
Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
First: 2026-02-05T18:57:09+00:00 · Latest: 2026-02-05T18:57:09+00:00
Comments: Code is available at https://github.com/ViktorAxelsen/BudgetMem
Abstract
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
中文标题/摘要
标题:学习查询感知预算层级路由以运行时代理内存
内存对于大型语言模型(LLM)代理在单个上下文窗口之外运行变得越来越重要,但大多数现有系统依赖于离线、查询无关的内存构建,这可能效率低下并可能丢弃查询关键信息。尽管运行时内存利用是一个自然的替代方案,但先前的工作往往会产生大量开销,并且对性能成本权衡的控制有限。在本文中,我们提出了**BudgetMem**,这是一种运行时代理内存框架,用于明确、查询感知的性能成本控制。BudgetMem 将内存处理结构化为一组内存模块,每个模块提供三个预算层级(即**低**/**中**/**高**)。一个轻量级的路由器在模块之间执行预算层级路由,以平衡任务性能和内存构建成本,这通过强化学习训练的紧凑神经策略实现。使用BudgetMem作为统一的测试平台,我们研究了三种互补的预算层级实现策略:实现(方法复杂性)、推理(推理行为)和容量(模块模型大小)。在LoCoMo、LongMemEval和HotpotQA中,当优先考虑性能(即高预算设置)时,BudgetMem超越了强大的基线,并在更紧的预算下提供了更好的准确度成本前沿。此外,我们的分析将不同层级策略的优势和劣势分离开来,阐明了在不同预算条件下,每个轴在何时提供最有利的权衡。
Summary / 总结
BudgetMem is a runtime agent memory framework that allows explicit, query-aware control over performance and cost. It structures memory processing into three budget tiers and uses a lightweight router with a compact neural policy to route queries efficiently. BudgetMem outperforms strong baselines in high-budget settings and provides better accuracy-cost trade-offs under tighter budgets across different benchmarks.
该研究针对LLM代理中查询无关的内存构建效率低下问题,引入了BudgetMem,这是一种允许查询感知的性能-成本显式控制的运行时代理内存框架。BudgetMem使用轻量级路由器将内存处理路由到三个预算级别(低、中、高),并通过强化学习训练的紧凑型神经策略来平衡任务性能和内存构建成本。研究发现,在高预算设置下,BudgetMem在基准测试中表现出色,并在更紧的预算下提供了更好的准确性和成本折衷方案。
Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering
Authors: Miranda Muqing Miao, Young-Min Cho, Lyle Ungar
First: 2026-02-05T18:55:56+00:00 · Latest: 2026-02-05T18:55:56+00:00
Abstract
Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
中文标题/摘要
标题:CORAL(正确性优化残差激活透镜):可移植且校准意识的推理时校正导向
大型语言模型(LLMs)在指令调优和偏好对齐后表现出持续的校准不足。修改后的训练目标可以改善校准,但重新训练成本高昂。推理时校正提供了一种轻量级的替代方案,但大多数现有方法优化的是正确性的代理指标而非正确性本身。我们引入了CORAL(正确性优化残差激活透镜),这是一种正则化推理时校正方法,通过权重衰减MLP探针捕捉模型内部激活中的分布式正确性信号。我们在三个7B参数模型上评估了CORAL,发现它在平均情况下将准确率提高了10%,预期校准误差(ECE)降低了50%。我们还证明了这些增益在无需重新训练的情况下转移到了四个保留基准测试的完整发布测试集(ARC-Challenge、HellaSwag、Math-MC、OpenBookQA)上,平均准确率提高了14%,ECE降低了49%。我们的结果支持了这样一个假设:当单个神经元不足时,可以使用正则化探针从模型内部提取分布式信息。因此,CORAL提供了一种计算高效、可移植且校准意识的方法,以提高推理时的多项选择题问答性能。
Summary / 总结
The research aims to address the persistent miscalibration in large language models (LLMs) after instruction tuning and preference alignment. CORAL, a regularized inference-time steering method, is introduced to capture distributed correctness signals from model internal activations using weight-decay MLP probes. Experiments across three 7B-parameter models show that CORAL improves accuracy by 10% and expected calibration error (ECE) by 50% on average. These improvements transfer to four held-out benchmarks without retraining, averaging 14% accuracy and 49% ECE improvements, demonstrating a compute-efficient, transferable, and calibration-aware approach to enhance MCQA performance during inference.
CORAL 是一种通过内部激活的权重衰减 MLP 探针优化正确性的正则化推理时校正方法。在三个 7B 参数模型上,CORAL 将准确率提高了 10%,预期校准误差降低了 50%。这些改进在四个未见过的基准测试集上无需重新训练也得到了验证,平均准确率提高了 14%,预期校准误差降低了 49%。
Diffusion Model's Generalization Can Be Characterized by Inductive Biases toward a Data-Dependent Ridge Manifold
Authors: Ye He, Yitong Qiu, Molei Tao
First: 2026-02-05T18:55:03+00:00 · Latest: 2026-02-05T18:55:03+00:00
Abstract
When a diffusion model is not memorizing the training data set, how does it generalize exactly? A quantitative understanding of the distribution it generates would be beneficial to, for example, an assessment of the model's performance for downstream applications. We thus explicitly characterize what diffusion model generates, by proposing a log-density ridge manifold and quantifying how the generated data relate to this manifold as inference dynamics progresses. More precisely, inference undergoes a reach-align-slide process centered around the ridge manifold: trajectories first reach a neighborhood of the manifold, then align as being pushed toward or away from the manifold in normal directions, and finally slide along the manifold in tangent directions. Within the scope of this general behavior, different training errors will lead to different normal and tangent motions, which can be quantified, and these detailed motions characterize when inter-mode generations emerge. More detailed understanding of training dynamics will lead to more accurate quantification of the generation inductive bias, and an example of random feature model will be considered, for which we can explicitly illustrate how diffusion model's inductive biases originate as a composition of architectural bias and training accuracy, and how they evolve with the inference dynamics. Experiments on synthetic multimodal distributions and MNIST latent diffusion support the predicted directional effects, in both low- and high-dimensions.
中文标题/摘要
标题:扩散模型的泛化可以由数据依赖的岭流形上的归纳偏置来表征
当扩散模型不记忆训练数据集时,它如何泛化?对其生成分布的定量理解将有助于例如下游应用中模型性能的评估。因此,我们通过提出对数密度岭流形并量化生成数据与该流形的关系来明确表征扩散模型的生成内容。更具体地说,推理过程围绕岭流形进行一个到达-对齐-滑动的过程:轨迹首先到达流形的邻域,然后在法向方向被推近或远离流形,最后在切向方向沿着流形滑动。在这一一般行为的范围内,不同的训练误差会导致不同的法向和切向运动,这些运动可以被量化,并且这些详细的运动表征了跨模态生成何时出现。对训练动力学更详细的理解将导致对生成归纳偏置更准确的量化,我们将考虑一个随机特征模型的例子,其中可以明确展示扩散模型的归纳偏置如何作为架构偏置和训练准确性组成的组合而出现,并且如何随着推理动力学的发展而演变。在合成多模态分布和MNIST潜在扩散上的实验支持了预测的方向效应,在低维和高维空间中都是如此。
Summary / 总结
This study investigates how diffusion models generalize when they do not memorize the training data. By proposing a log-density ridge manifold, the research characterizes the generated distribution through a reach-align-slide process. The study finds that different training errors result in distinct normal and tangent motions, which can be quantified and explain when inter-mode generations occur. These findings provide a detailed understanding of the model's inductive biases and how they evolve during inference, supported by experiments on synthetic and MNIST data.
研究通过提出对数密度岭流形并分析推理动态,探讨了扩散模型的泛化机制。模型的推理过程被描述为围绕该流形的reach-align-slide机制,其中轨迹首先接近流形,然后沿法向方向对齐,最后沿切向方向滑动。不同的训练误差会导致不同的法向和切向运动,这些运动可以被量化以理解何时出现跨模态生成。实验结果支持这些发现,展示了模型归纳偏好的方向效应,包括合成数据和MNIST数据上的结果。
MambaVF: State Space Model for Efficient Video Fusion
Authors: Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler
First: 2026-02-05T18:53:47+00:00 · Latest: 2026-02-05T18:53:47+00:00
Abstract
Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
中文标题/摘要
标题:MambaVF:基于状态空间模型的高效视频融合框架
视频融合是各种视频处理任务中的基本技术。然而,现有的视频融合方法严重依赖于光流估计和特征扭曲,导致了巨大的计算开销和有限的可扩展性。本文提出了一种基于状态空间模型(SSM)的高效视频融合框架MambaVF,该框架在无需显式运动估计的情况下进行时间建模。首先,通过将视频融合重新表述为一个顺序状态更新过程,MambaVF以线性复杂度捕获了长程时间依赖性,同时显著减少了计算和内存成本。其次,MambaVF提出了一种轻量级的基于SSM的融合模块,该模块通过时空双向扫描机制替代了传统的流引导对齐,从而实现了跨帧的高效信息聚合。在多个基准上的广泛实验表明,我们的MambaVF在多曝光、多焦点、红外可见和医学视频融合任务中达到了最先进的性能。我们强调MambaVF具有高效率,参数减少了高达92.25%,计算FLOPs减少了88.79%,并且比现有方法快2.1倍。项目页面:https://mambavf.github.io
Summary / 总结
MambaVF is an efficient video fusion framework that reformulates video fusion as a sequential state update process using state space models, reducing computational overhead and memory costs. It introduces a lightweight SSM-based fusion module that aggregates information across frames without explicit motion estimation, achieving state-of-the-art performance in various video fusion tasks while reducing up to 92.25% of parameters and 88.79% of computational FLOPs, and providing a 2.1x speedup compared to existing methods.
MambaVF 是一种高效的视频融合框架,通过将视频融合重新表述为状态空间模型来捕捉长距离的时间依赖性,而不进行显式的运动估计,从而减少计算开销和内存使用。它在多种视频融合任务中达到了最先进的性能,并且非常高效,参数减少高达 92.25%,计算 FLOPs 减少 88.79%,相比现有方法速度提升 2.1 倍。
A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies
Authors: Panagiotis Kaliosis, Adithya V Ganesan, Oscar N. E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz
First: 2026-02-05T18:53:17+00:00 · Latest: 2026-02-05T18:53:17+00:00
Comments: 18 pages, 3 figures, 5 tables
Abstract
Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
中文标题/摘要
标题:大型语言模型在 PTSD 严重程度估计中的系统评估:背景知识和建模策略的作用
大型语言模型(LLMs)越来越多地以零样本方式用于评估心理健康状况,但我们对影响其准确性的因素知之甚少。本研究利用包含1,437名个体自然语言叙述和自我报告的PTSD严重程度评分的临床数据集,全面评估了11种最先进的LLM的性能。为了理解影响准确性的因素,我们系统地变化了(i)背景知识,如子量表定义、分布摘要和访谈问题,以及(ii)建模策略,包括零样本与少量样本、推理努力程度、模型大小、结构化子量表与直接标量预测、输出重新缩放和九种集成方法。我们的研究结果表明:(a)当LLMs获得详细的构念定义和叙述背景时,其准确性最高;(b)增加推理努力程度可以提高估计准确性;(c)开放权重模型(Llama, Deepseek)在超过700亿参数后性能趋于平稳,而封闭权重(o3-mini, gpt-5)模型随着新版本的推出而性能提升;(d)当监督模型与零样本LLM集成时,可以获得最佳性能。综上所述,结果表明选择背景知识和建模策略对于部署LLMs以准确评估心理健康状况至关重要。
GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
Authors: Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang
First: 2026-02-05T18:52:48+00:00 · Latest: 2026-02-05T18:52:48+00:00
Comments: Project Page: https://genarena.github.io/, Code: https://github.com/ruihanglix/genarena
Abstract
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
中文标题/摘要
标题:GenArena:我们如何实现视觉生成任务的人类对齐评估?
视觉生成模型的快速发展已经超越了传统的评估方法,迫切需要采用视觉语言模型作为替代的评判者。在本文中,我们系统地研究了当前广泛使用的绝对点对点评分标准在各种视觉生成任务中的可靠性。我们的分析表明,这种范式由于随机不一致性和与人类感知的不良对齐而受到限制。为了解决这些限制,我们引入了GenArena,这是一种统一的评估框架,利用成对比较范式确保稳定且人类对齐的评估。关键的是,我们的实验揭示了一个变革性的发现,即简单采用这种成对协议可以使现成的开源模型超越顶级专有模型。值得注意的是,我们的方法将评估准确性提高了超过20%,并与权威的LMArena排行榜获得了0.86的斯皮尔曼相关性,远远超过了0.36的点对点方法相关性。基于GenArena,我们对多种视觉生成模型进行了基准测试,为视觉生成提供了严格的自动化评估标准。
Summary / 总结
This work addresses the limitations of traditional absolute pointwise scoring in evaluating visual generation models, which have outpaced traditional evaluation methods. It introduces GenArena, a pairwise comparison framework that enhances evaluation reliability and alignment with human perception. Experiments show that GenArena significantly improves evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, outperforming pointwise methods by a large margin.
该研究针对传统绝对点评分在评估视觉生成模型方面的局限性,这些模型已经超越了传统评估方法。作者引入了基于成对比较的GenArena统一评估框架,以确保稳定和人类一致的评估。实验表明,采用这种成对协议可以提高评估准确性超过20%,并与权威的LMArena排行榜实现了0.86的Spearman相关性,显著优于点评分方法。GenArena在各种任务中对最先进的视觉生成模型进行了基准测试,为视觉生成提供了一个严格和自动化的评估标准。
AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions
Authors: Xianyang Liu, Shangding Gu, Dawn Song
First: 2026-02-05T18:50:36+00:00 · Latest: 2026-02-05T18:50:36+00:00
Abstract
Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.
中文标题/摘要
标题:AgenticPay:多智能体LLM谈判系统,用于买家卖家交易
基于大型语言模型(LLM)的代理越来越多地被期望自主谈判、协调和交易,但现有的基准测试缺乏评估语言介导的多智能体经济互动的规范性设置。我们引入了AgenticPay,这是一种多智能体买家卖家谈判基准和仿真框架,由自然语言驱动。AgenticPay 模拟了买家和卖家拥有私人约束和产品依赖价值的市场,并且必须通过多轮语言谈判达成协议,而不仅仅是数字竞价。该框架支持超过110项任务的多样化套件,从双边讨价还价到多对多市场,具有结构化的行动提取和可行性、效率和福利的度量标准。对最先进的专有和开源权重LLM的基准测试揭示了谈判表现的巨大差距,并突显了长期战略推理的挑战,确立了AgenticPay作为研究代理商业和语言驱动的市场互动的基础。代码和数据集可在以下链接获取:https://github.com/SafeRL-Lab/AgenticPay.
VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation
Authors: Jie Deng, Kaichun Yao, Libo Zhang
First: 2026-02-05T18:45:53+00:00 · Latest: 2026-02-05T18:45:53+00:00
Abstract
Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
中文标题/摘要
标题:VisRefiner:从视觉差异中学习以实现屏幕截图到代码生成
屏幕截图到代码生成旨在将用户界面屏幕截图转换为能够忠实再现目标布局和样式的可执行前端代码。现有的多模态大型语言模型直接从屏幕截图进行这种映射,但它们的训练过程中并未观察到生成代码的视觉结果。相比之下,人类开发人员会迭代地渲染他们的实现,将其与设计进行比较,并学习视觉差异如何与代码更改相关联。受此过程的启发,我们提出了一种名为VisRefiner的训练框架,该框架使模型能够从渲染预测与参考设计之间的视觉差异中学习。我们构建了差异对齐的监督,将视觉差异与相应的代码编辑关联起来,从而使模型能够理解外观变化是如何由实现更改引起的。在此基础上,我们引入了一个强化学习阶段,用于自我完善,其中模型通过观察渲染输出和目标设计之间的视觉差异,并相应地更新代码来改进其生成的代码。实验表明,VisRefiner显著提高了单步生成质量和布局保真度,同时赋予模型强大的自我完善能力。这些结果表明,从视觉差异中学习对于推进屏幕截图到代码生成的有效性。
Summary / 总结
VisRefiner is a training framework that enables models to learn from visual differences between rendered predictions and reference designs for screenshot-to-code generation. It constructs difference-aligned supervision to associate visual discrepancies with corresponding code edits and introduces a reinforcement learning stage for self-refinement. Experiments show that VisRefiner improves single-step generation quality and layout fidelity, and enhances the model's self-refinement ability.
VisRefiner 是一种训练框架,通过将渲染预测与参考设计之间的视觉差异与相应的代码编辑关联起来,使模型能够学习这些差异。它还引入了一种强化学习阶段进行自我精炼,模型通过观察渲染输出和目标设计之间的视觉差异来改进生成的代码。实验表明,VisRefiner 提高了一步生成质量和布局准确性,并增强了模型的自我精炼能力。
Transmuting prompts into weights
Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo
First: 2025-10-09T18:40:39+00:00 · Latest: 2026-02-05T18:44:09+00:00
Abstract
A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to token-dependent implicit weight updates (Dherin et. al, 2025), we derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector-and-matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.
中文标题/摘要
标题:将提示转化为权重
越来越多的研究表明,可以通过直接修改大型语言模型的内部状态,在推理时有效控制其行为,这些方法可以通过向其激活值添加向量或更新其权重矩阵来实现。虽然这些技术非常强大,但它们通常由经验性启发式方法指导,例如从对比提示的平均激活值中推导出引导向量。这项工作为这些干预措施提供了理论基础,解释了它们如何源自变压器架构的基本计算。基于最近发现的提示影响可以数学映射到与标记相关的隐式权重更新(Dherin等人,2025年),我们推导出一种原理性的方法,将这些信息凝练成与标记无关的思想向量和思想矩阵。这些构造为现有的向量和矩阵为基础的模型编辑技术提供了理论解释,并提供了一种直接且计算上可验证的方法,将文本输入转化为可重用的权重更新。
Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Authors: Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz
First: 2026-02-05T18:42:00+00:00 · Latest: 2026-02-05T18:42:00+00:00
Abstract
Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.
中文标题/摘要
标题:钻石地图:通过随机流图高效实现奖励对齐
流和扩散模型生成高质量样本,但在训练后适应用户偏好或约束仍然成本高昂且脆弱,这一挑战通常被称为奖励对齐。我们认为,高效的奖励对齐应该是生成模型本身的特性,而不是事后考虑的问题,并重新设计了模型以提高适应性。我们提出了“钻石地图”,一种随机流图模型,能够在推理时高效且准确地对齐到任意奖励。钻石地图将许多模拟步骤合并为单步采样器,类似于流图,同时保留了实现最优奖励对齐所需的随机性。这种设计使得搜索、序列蒙特卡洛和引导变得可扩展,因为它们能够高效且一致地估计价值函数。我们的实验表明,钻石地图可以通过从GLASS流中蒸馏学习,实现更强的奖励对齐性能,并且比现有方法更具可扩展性。我们的结果指出了生成模型在推理时能够快速适应任意偏好和约束的实际途径。
Summary / 总结
The research aims to improve the adaptability of generative models to user preferences and constraints, addressing the challenge of reward alignment. Diamond Maps, a new type of stochastic flow map model, are proposed to enable efficient and accurate reward alignment at inference time. These models combine the efficiency of flow maps with the necessary stochasticity for optimal reward alignment, making search and guidance scalable. Experiments demonstrate that Diamond Maps can be learned efficiently from GLASS Flows, perform better in reward alignment, and scale better than existing methods.
研究旨在解决生成模型中的奖励对齐问题,该问题在后训练阶段进行时成本高且脆弱。作者提出了Diamond Maps,这是一种随机流图模型,能够在推理时实现高效且准确的奖励对齐。Diamond Maps结合了流图的效率和最优奖励对齐所需的随机性,使搜索和指导变得可扩展。实验表明,Diamond Maps可以高效学习,实现更好的奖励对齐性能,并且比现有方法更具可扩展性。
DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
Authors: Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang
First: 2026-02-05T18:41:38+00:00 · Latest: 2026-02-05T18:41:38+00:00
Abstract
Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.
中文标题/摘要
标题:DSB:动态滑动块调度算法用于扩散大语言模型
扩散大语言模型(dLLMs)已成为文本生成的一种有前途的替代方案,以其原生支持并行解码而著称。实际上,块推理对于避免全局双向解码中的顺序错位并提高输出质量至关重要。然而,广泛使用的固定预定义块(朴素)调度策略忽略了语义难度,使其在质量和效率方面都是一种次优策略:它可能会过早地对不确定的位置做出承诺,同时推迟接近块边界的简单位置。在本文中,我们分析了朴素块调度的局限性,并揭示了根据语义难度动态调整调度以实现可靠和高效推理的重要性。受此启发,我们提出了动态滑动块(DSB),这是一种无需训练的块调度方法,使用动态大小的滑动块来克服朴素块的僵化。为了进一步提高效率,我们引入了DSB缓存,这是一种针对DSB设计的无需训练的KV缓存机制。在多个模型和基准上的广泛实验表明,DSB与DSB缓存一起,能够一致地提高dLLMs的生成质量和推理效率。代码已发布在https://github.com/lizhuo-luo/DSB。
Summary / 总结
This paper addresses the limitations of fixed block scheduling in diffusion large language models (dLLMs), proposing Dynamic Sliding Block (DSB) as a dynamic scheduling method to adapt to semantic difficulty. DSB uses a sliding block with a dynamic size to improve both generation quality and inference efficiency. Additionally, DSB Cache, a training-free KV-cache mechanism, is introduced to enhance efficiency. Experiments show that DSB, along with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs.
论文针对固定块调度在扩散大语言模型(dLLMs)中的局限性,提出了一种基于语义难度动态调整块大小的无训练方法Dynamic Sliding Block (DSB),以避免过早承诺和延迟。此外,还引入了DSB Cache来进一步提高效率。实验结果显示,在多个模型和基准上均实现了生成质量和推理效率的提升。
Layer-wise LoRA fine-tuning: a similarity metric approach
Authors: Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao
First: 2026-02-05T18:38:53+00:00 · Latest: 2026-02-05T18:38:53+00:00
Comments: Code is available at https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
Abstract
Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
中文标题/摘要
标题:逐层LoRA微调:一种相似度度量方法
在网页规模数据集上预训练大型语言模型(LLMs)已成为推动通用人工智能发展的基础。相比之下,通过微调来增强其在下游任务中的预测性能通常涉及调整其知识。参数高效微调技术,如低秩适应(LoRA),旨在通过冻结预训练模型并更新较少的参数来降低此过程的计算成本。与全微调相比,这些方法的可训练参数数量减少了超过99%,具体取决于配置。不幸的是,随着LLMs的规模不断扩大,这种减少可能变得不足。在本文中,我们通过系统地选择仅微调少数几层来解决上述问题,使用LoRA或其变体。我们认为,并非所有层对模型适应的贡献都相等。利用这一点,我们通过测量它们对内部表示变化的贡献来识别最相关的层进行微调。我们的方法与现有的低秩适应技术是正交的,并且易于兼容。我们通过LoRA技术将可训练参数减少多达50%,同时在不同模型和任务上保持预测性能。具体而言,在仅编码器架构中,这种可训练参数的减少导致在GLUE基准测试上的预测性能下降可以忽略不计。在仅解码器架构中,我们在数学问题解决能力和编程任务上的预测性能上实现了轻微下降或甚至有所改进。最后,这种方法也适用于多模态模型,在这些模型中,我们还观察到与在所有层使用LoRA模块进行微调相比具有竞争力的结果。代码可在:https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
Summary / 总结
This paper addresses the challenge of fine-tuning large language models (LLMs) by proposing a layer-wise LoRA fine-tuning method. It selects a few critical layers for fine-tuning based on their contribution to internal representation changes, reducing the number of trainable parameters by up to 50% while maintaining or improving predictive performance across different models and tasks. The method is compatible with existing low-rank adaptation techniques and is available in the provided code repository.
本文提出了一种分层LoRA微调方法,通过基于内部表示变化选择关键层进行微调,从而将可训练参数减少多达50%,同时在不同模型和任务上保持或提高预测性能。该方法与现有的低秩适应技术兼容,并可在提供的代码库中获得。
SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model
Authors: Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Heather Yu
First: 2026-01-12T05:03:12+00:00 · Latest: 2026-02-05T18:37:54+00:00
Comments: 12 pages, 14 figures, accepted in WACVW 2026
Abstract
Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.
中文标题/摘要
标题:SIRR-LMM:基于大型多模态模型的单张图像反射去除
玻璃表面会产生复杂的反射和透射光相互作用,使得单张图像反射去除(SIRR)具有挑战性。现有数据集在合成数据中缺乏物理现实性,或在真实捕获中规模不足。我们提出了一种合成数据集生成框架,通过路径追踪3D玻璃模型在真实背景图像上创建具有多种玻璃属性、相机设置和后处理效果的物理准确反射场景。为了利用大型多模态模型(LMM)的能力,我们将图像层合并为单一复合输入,应用联合描述,并使用针对特定任务的LoRA进行微调,而不是进行全面参数训练。这使我们的方法在反射去除和分离性能方面优于现有最先进的方法。
Summary / 总结
The research addresses the challenge of single-image reflection removal (SIRR) from glass surfaces, which is complicated by the interactions of reflected and transmitted light. To overcome limitations in existing datasets, the authors created a new synthetic dataset by path-tracing 3D glass models over real backgrounds, ensuring physical accuracy. They then used a Large Multimodal Model (LMM) with a composite input of image layers, joint captioning, and fine-tuning with task-specific LoRA, achieving better reflection removal and separation than previous methods.
研究旨在解决来自玻璃表面的单图像反射去除(SIRR)问题,这由于反射和透射光的相互作用而变得复杂。为克服现有数据集的局限性,作者开发了一种合成数据生成框架,使用路径追踪技术准确模拟反射场景。然后,他们使用大型多模态模型(LMM),通过连接图像层并使用任务特定的LoRA进行微调,实现了比先前方法更好的反射去除和分离性能。
RISE-Video: Can Video Generators Decode Implicit World Rules?
Authors: Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang
First: 2026-02-05T18:36:10+00:00 · Latest: 2026-02-05T18:36:10+00:00
Comments: 38 pages, 16 figures, 3 tables; Code: https://github.com/VisionXLab/RISE-Video; HuggingFace: https://huggingface.co/datasets/VisionXLab/RISE-Video
Abstract
While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
中文标题/摘要
标题:RISE-Video:视频生成器能否解码隐含的世界规则?
尽管生成式视频模型在视觉保真度方面取得了显著进展,但它们在内化和推理隐含世界规则方面的能力仍然是一个关键但尚未充分探索的领域。为弥合这一差距,我们提出了RISE-Video,这是一种开创性的基于文本-图像到视频(TI2V)合成的认知推理基准,将评估重点从表面美学转移到深层次的认知推理。RISE-Video 包含467个精心的人工标注样本,涵盖八个严格的类别,为从常识和空间动态到专业主题领域的模型智能提供了一个结构化的测试平台。我们的框架引入了四个多维度评估指标:推理一致性、时间一致性、物理合理性以及视觉质量。为了进一步支持可扩展的评估,我们提出了一种基于大型多模态模型(LMM)的自动化流程,以模拟人类评估。在11个最先进的TI2V模型上的广泛实验揭示了在隐含约束下模拟复杂场景的普遍缺陷,为未来世界模拟生成模型的发展提供了宝贵的见解。
Summary / 总结
RISE-Video is a reasoning-oriented benchmark for Text-Image-to-Video synthesis that evaluates models based on their ability to reason about implicit world rules, rather than just visual aesthetics. It includes 467 human-annotated samples across eight categories and introduces four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. Experiments on 11 state-of-the-art models highlight their limitations in handling complex scenarios under implicit constraints, providing valuable insights for future model development.
RISE-Video 是一个针对文本-图像到视频合成模型的推理导向基准,旨在评估模型处理隐含世界规则的能力。它包含467个人工注释的样本,涵盖八个类别,并引入了一个多维度评估协议,包括四个指标:推理一致性、时间一致性、物理合理性以及视觉质量。对11个最先进的模型的实验表明,它们在处理隐含约束下的复杂场景时存在缺陷,强调了未来生成模型中增强模型智能的必要性。
DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents
Authors: Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers
First: 2025-10-29T02:21:10+00:00 · Latest: 2026-02-05T18:29:20+00:00
Abstract
Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior (e.g., premature convergence) and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 36,383 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels (and supporting future individual-level analyses). We instantiate "digital twin" RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.
中文标题/摘要
标题:辩论:评估角色扮演大语言模型代理意见动态的大规模基准
准确地通过社会互动建模意见变化对于理解并缓解极化、错误信息和社会冲突至关重要。近期研究使用角色扮演大语言模型代理(RPLA)模拟意见动态,但多代理模拟往往表现出不自然的群体行为(例如,过早收敛),并且缺乏评估其与真实人类群体互动一致性的经验基准。我们引入了DEBATE,这是一个大规模基准,用于评估多代理RPLA模拟中意见动态的真实性。DEBATE 包含来自708个群体和107个主题的2,832名美国参与者的36,383条消息,包括公开消息和私人李克特量表信念,这使得可以在语句和群体层面进行评估(并支持未来个体层面的分析)。我们使用七种大语言模型实例化“数字双胞胎”RPLA,并在两种设置下进行评估:下一条消息预测和完整对话展开,使用立场一致性和意见收敛度指标。在零样本设置中,RPLA群体相对于人类群体表现出强烈的意见收敛。通过监督微调(SFT)和直接偏好优化(DPO)进行训练后,立场一致性和群体层面的收敛度更接近人类行为,尽管意见变化和信念更新仍存在差异。DEBATE 使模拟意见动态的严格基准化成为可能,并支持未来研究如何使多代理RPLA与现实人类互动相一致。
Summary / 总结
The research aims to evaluate the authenticity of opinion dynamics in multi-agent role-playing LLM agent (RPLA) simulations by introducing DEBATE, a large-scale benchmark. DEBATE consists of 36,383 messages from 2,832 participants across 708 groups and 107 topics, allowing evaluation at the utterance and group levels. The study uses seven LLMs to instantiate RPLAs and evaluates them in next-message prediction and full conversation rollout settings. In zero-shot settings, RPLA groups show strong opinion convergence compared to human groups, but post-training with supervised fine-tuning and Direct Preference Optimization improves stance alignment and brings group-level convergence closer to human behavior, though opinion change and belief updating discrepancies persist.
研究旨在通过引入DEBATE基准来评估多代理角色扮演LLM代理(RPLA)模拟中的意见动态的真实性。DEBATE包含来自2,832名参与者的36,383条消息,覆盖708个群体和107个主题,允许在语句和群体层面进行评估。研究使用七种LLM实例化RPLA,并在下一个消息预测和完整对话展开设置中进行评估。在零样本设置中,RPLA群体在意见收敛方面表现出色,但通过监督微调和直接偏好优化进行后训练可以改善立场对齐,并使群体层面的收敛更接近人类行为,尽管意见变化和信念更新的差异仍然存在。
SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
Authors: Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao
Venue: ACL
First: 2026-02-05T18:25:24+00:00 · Latest: 2026-02-05T18:25:24+00:00
Comments: Submission to ACL ARR 2026 January
Abstract
Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.
中文标题/摘要
标题:SAGE:评估和提升深度研究代理的检索能力
深度研究代理已发展成为处理复杂查询的强大系统。与此同时,基于LLM的检索器展示了在遵循指令或推理方面的能力。这引发了一个关键问题:基于LLM的检索器能否有效支持深度研究代理的工作流程?为了探讨这一问题,我们引入了SAGE,这是一个由1200个跨四个科学领域的问题组成的科学文献检索基准,包含20万篇论文的检索语料库。我们评估了六种深度研究代理,并发现所有系统在需要推理的检索任务中都表现不佳。以DR Tulu为骨干,我们进一步比较了BM25和基于LLM的检索器(即ReasonIR和gte-Qwen2-7B-instruct)作为替代搜索工具。令人惊讶的是,BM25在性能上显著优于基于LLM的检索器,大约高出30%,因为现有代理生成的是关键词导向的子查询。为了提高性能,我们提出了一种基于语料库的测试时缩放框架,利用LLM增强文档的元数据和关键词,使现成的检索器更容易进行检索。这分别在简短和开放式问题上提高了8%和2%。
Summary / 总结
The paper introduces SAGE, a benchmark for evaluating scientific literature retrieval, involving 1,200 queries across four domains and a corpus of 200,000 papers. It evaluates six deep research agents and finds that they struggle with reasoning-intensive retrieval. Using DR Tulu as the backbone, it compares BM25 and LLM-based retrievers, finding BM25 outperforms by about 30%. To enhance performance, the authors propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, improving retrieval by 8% and 2% for short-form and open-ended questions, respectively.
该研究引入了SAGE,一个包含1,200个跨四个领域查询的科学文献检索基准,以及一个包含200,000篇论文的语料库。它评估了六种深度研究代理,并发现它们在需要推理的任务上表现不佳。使用DR Tulu作为基础,它比较了BM25和基于LLM的检索器,结果显示BM25在性能上比基于LLM的检索器高出约30%。为了提高性能,作者提出了一种基于语料库的测试时扩展框架,利用LLM为文档添加元数据和关键词,从而分别在短形式和开放式问题上获得了8%和2%的提升。