arXiv 论文速递

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Authors: Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

First: 2025-12-23T18:59:49+00:00 · Latest: 2025-12-23T18:59:49+00:00

Abstract

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

中文标题/摘要

标题：长视频代理：多智能体长视频推理

近期多模态LLM和使用工具进行长视频问答的系统表明，可以在长达一小时的剧集中进行推理。然而，许多方法仍然将内容压缩成失真的摘要，或者依赖有限的工具集，削弱了时间定位并错过了细微线索。我们提出了一种多智能体框架，在该框架中，一个主LLM协调一个定位智能体来定位与问题相关的时间段，并协调一个视觉智能体来提取目标文本观察。主智能体在步数限制下进行规划，并通过强化学习训练以促进简洁、准确和高效的多智能体合作。这种设计有助于主智能体通过定位关注相关片段，补充字幕的视觉细节，并产生可解释的轨迹。在我们提出的LongTVQA和LongTVQA+（从TVQA/TVQA+汇总而成的集水平数据集）上，我们的多智能体系统显著优于强大的非智能体基线。实验还表明，强化学习进一步增强了训练智能体的推理和规划能力。代码和数据将在https://longvideoagent.github.io/上共享。

Summary / 总结

The research aims to improve long-video question answering by developing a multi-agent framework that uses a master language model to coordinate a grounding agent and a vision agent. The master agent plans with a step limit and is trained with reinforcement learning to enhance multi-agent cooperation. The system significantly outperforms non-agent baselines on the LongTVQA and LongTVQA+ datasets, demonstrating improved reasoning and planning capabilities through reinforcement learning. Code and data are available at https://longvideoagent.github.io/.

研究旨在通过开发一个多代理框架来提高长视频问答能力，该框架使用主语言模型协调定位代理和视觉代理。主代理以步限进行规划，并通过强化学习训练以增强多代理合作。该系统在LongTVQA和LongTVQA+数据集上显著优于非代理基线，展示了通过强化学习增强的推理和规划能力。代码和数据可在https://longvideoagent.github.io/获取。

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Authors: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang

First: 2025-12-23T18:59:46+00:00 · Latest: 2025-12-23T18:59:46+00:00

Comments: webpage: https://spatialtree.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

中文标题/摘要

标题：SpatialTree：多模态LLMs中空间能力的分支发展

认知科学表明，空间能力从感知到推理和互动逐步发展。然而，在多模态LLMs（MLLMs）中，这种层次结构仍不明确，大多数研究集中在少数任务上。我们引入了SpatialTree，这是一种认知科学启发式的层次结构，将空间能力分为四个层次：低级感知（L1）、心理制图（L2）、模拟（L3）和能动性（L4）。基于这一分类，我们构建了第一个能力导向的层次基准，全面评估了主流MLLMs的27个子能力。评估结果揭示了一个清晰的结构：L1技能大多相互独立，而更高层次的技能则高度相关，表明了不断增加的相互依赖性。通过有针对性的监督微调，我们发现了一个令人惊讶的转移动态：L1内的负向转移，但低级到高级能力之间存在强大的跨层次转移，且具有显著的协同效应。最后，我们探讨了如何改进整个层次结构。我们发现，鼓励大量“思考”的简单RL是不可靠的：它有助于复杂推理，但损害了直观感知。我们提出了一种简单的自动思考策略，抑制不必要的思考，使RL能够在所有层次上一致地提高性能。通过构建SpatialTree，我们提供了一个概念验证框架，用于理解和系统地扩展MLLMs中的空间能力。

Summary / 总结

The research aims to understand the development of spatial abilities in multimodal language models (MLLMs) by introducing SpatialTree, a cognitive-science-inspired hierarchy. This hierarchy categorizes spatial abilities into four levels: perception, mental mapping, simulation, and agentic competence. The study evaluates mainstream MLLMs across 27 sub-abilities and finds that lower-level skills are largely independent, while higher-level skills are strongly correlated. Through targeted fine-tuning, the study reveals negative transfer within the lowest level but strong cross-level transfer from lower to higher abilities. The research also explores the impact of reinforcement learning (RL) and proposes an auto-think strategy to suppress unnecessary deliberation, enabling consistent improvement across all levels of spatial abilities.

研究旨在通过引入SpatialTree这一认知科学启发的层次结构来理解多模态语言模型（MLLMs）中的空间能力发展，该层次结构将这些能力分为感知、心理映射、模拟和行动能力四个级别。研究对主流MLLMs的27个子能力进行了评估，并发现较低级别的技能基本上是独立的，而高级别的技能则高度相关。通过微调，研究揭示了低级别技能可以增强高级别能力的转移动态，并提出了一种自动思考策略来抑制不必要的思考，从而在所有级别的层次结构中持续提升性能。

Active Intelligence in Video Avatars via Closed-loop World Modeling

Authors: Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen

First: 2025-12-23T18:59:16+00:00 · Latest: 2025-12-23T18:59:16+00:00

Comments: Project Page: https://xuanhuahe.github.io/ORCA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.

中文标题/摘要

标题：通过闭环世界建模的视频头像中的主动智能

当前的视频头像生成方法在身份保留和运动对齐方面表现出色，但缺乏真正的自主性，无法通过适应性环境交互自主追求长期目标。我们通过引入L-IVA（长期交互视觉头像）任务和基准来解决这一问题，用于评估随机生成环境中的目标导向规划，以及ORCA（在线推理和认知架构），这是第一个使视频头像具备主动智能的框架。ORCA 通过两个关键创新体现了内部世界模型（IWM）能力：(1) 闭环OTAR循环（观察-思考-行动-反思），通过不断验证预测结果与实际生成结果来在生成不确定性下保持稳健的状态跟踪；(2) 分层双系统架构，其中系统2进行战略推理并预测状态，系统1将抽象计划转化为具体的模型特定行动指令。通过将头像控制建模为POMDP并实施连续信念更新和结果验证，ORCA 使头像能够在开放域场景中自主完成多步任务。大量实验表明，ORCA 在任务成功率和行为一致性方面显著优于开环和非反思基线，验证了我们基于IWM的设计，使视频头像智能从被动动画提升到主动、目标导向的行为。

Summary / 总结

The research addresses the lack of genuine agency in current video avatar generation methods by introducing L-IVA and ORCA. L-IVA is a task and benchmark for evaluating goal-directed planning, while ORCA is a framework that enables active intelligence in video avatars through a closed-loop OTAR cycle and a hierarchical dual-system architecture. Experiments show that ORCA outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating the effectiveness of the IWM-inspired design for advancing video avatar intelligence.

研究通过引入L-IVA和ORCA来解决当前视频头像生成方法缺乏真正自主性的问题。L-IVA是一个用于评估目标导向规划的任务和基准，而ORCA是一个使视频头像具备主动智能的框架。ORCA具备闭环OTAR循环和分层双系统架构，能够实现自主多步任务完成，并在任务成功率和行为一致性方面显著优于开环和非反思基线。

Making Large Language Models Efficient Dense Retrievers

Authors: Yibin Lei, Shwai He, Ang Li, Andrew Yates

First: 2025-12-23T18:58:25+00:00 · Latest: 2025-12-23T18:58:25+00:00