arXiv 论文速递

2025-12-25 03:20
Snapshot: 20251225_0320
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Authors: Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen
First: 2025-12-23T18:59:49+00:00 · Latest: 2025-12-23T18:59:49+00:00
Abstract
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.
中文标题/摘要
标题:长视频代理:多智能体长视频推理
近期多模态LLM和使用工具进行长视频问答的系统表明,可以在长达一小时的剧集中进行推理。然而,许多方法仍然将内容压缩成失真的摘要,或者依赖有限的工具集,削弱了时间定位并错过了细微线索。我们提出了一种多智能体框架,在该框架中,一个主LLM协调一个定位智能体来定位与问题相关的时间段,并协调一个视觉智能体来提取目标文本观察。主智能体在步数限制下进行规划,并通过强化学习训练以促进简洁、准确和高效的多智能体合作。这种设计有助于主智能体通过定位关注相关片段,补充字幕的视觉细节,并产生可解释的轨迹。在我们提出的LongTVQA和LongTVQA+(从TVQA/TVQA+汇总而成的集水平数据集)上,我们的多智能体系统显著优于强大的非智能体基线。实验还表明,强化学习进一步增强了训练智能体的推理和规划能力。代码和数据将在https://longvideoagent.github.io/上共享。
Summary / 总结
The research aims to improve long-video question answering by developing a multi-agent framework that uses a master language model to coordinate a grounding agent and a vision agent. The master agent plans with a step limit and is trained with reinforcement learning to enhance multi-agent cooperation. The system significantly outperforms non-agent baselines on the LongTVQA and LongTVQA+ datasets, demonstrating improved reasoning and planning capabilities through reinforcement learning. Code and data are available at https://longvideoagent.github.io/.
研究旨在通过开发一个多代理框架来提高长视频问答能力,该框架使用主语言模型协调定位代理和视觉代理。主代理以步限进行规划,并通过强化学习训练以增强多代理合作。该系统在LongTVQA和LongTVQA+数据集上显著优于非代理基线,展示了通过强化学习增强的推理和规划能力。代码和数据可在https://longvideoagent.github.io/获取。
SpatialTree: How Spatial Abilities Branch Out in MLLMs
Authors: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang
First: 2025-12-23T18:59:46+00:00 · Latest: 2025-12-23T18:59:46+00:00
Comments: webpage: https://spatialtree.github.io/
Abstract
Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.
中文标题/摘要
标题:SpatialTree:多模态LLMs中空间能力的分支发展
认知科学表明,空间能力从感知到推理和互动逐步发展。然而,在多模态LLMs(MLLMs)中,这种层次结构仍不明确,大多数研究集中在少数任务上。我们引入了SpatialTree,这是一种认知科学启发式的层次结构,将空间能力分为四个层次:低级感知(L1)、心理制图(L2)、模拟(L3)和能动性(L4)。基于这一分类,我们构建了第一个能力导向的层次基准,全面评估了主流MLLMs的27个子能力。评估结果揭示了一个清晰的结构:L1技能大多相互独立,而更高层次的技能则高度相关,表明了不断增加的相互依赖性。通过有针对性的监督微调,我们发现了一个令人惊讶的转移动态:L1内的负向转移,但低级到高级能力之间存在强大的跨层次转移,且具有显著的协同效应。最后,我们探讨了如何改进整个层次结构。我们发现,鼓励大量“思考”的简单RL是不可靠的:它有助于复杂推理,但损害了直观感知。我们提出了一种简单的自动思考策略,抑制不必要的思考,使RL能够在所有层次上一致地提高性能。通过构建SpatialTree,我们提供了一个概念验证框架,用于理解和系统地扩展MLLMs中的空间能力。
Summary / 总结
The research aims to understand the development of spatial abilities in multimodal language models (MLLMs) by introducing SpatialTree, a cognitive-science-inspired hierarchy. This hierarchy categorizes spatial abilities into four levels: perception, mental mapping, simulation, and agentic competence. The study evaluates mainstream MLLMs across 27 sub-abilities and finds that lower-level skills are largely independent, while higher-level skills are strongly correlated. Through targeted fine-tuning, the study reveals negative transfer within the lowest level but strong cross-level transfer from lower to higher abilities. The research also explores the impact of reinforcement learning (RL) and proposes an auto-think strategy to suppress unnecessary deliberation, enabling consistent improvement across all levels of spatial abilities.
研究旨在通过引入SpatialTree这一认知科学启发的层次结构来理解多模态语言模型(MLLMs)中的空间能力发展,该层次结构将这些能力分为感知、心理映射、模拟和行动能力四个级别。研究对主流MLLMs的27个子能力进行了评估,并发现较低级别的技能基本上是独立的,而高级别的技能则高度相关。通过微调,研究揭示了低级别技能可以增强高级别能力的转移动态,并提出了一种自动思考策略来抑制不必要的思考,从而在所有级别的层次结构中持续提升性能。
Active Intelligence in Video Avatars via Closed-loop World Modeling
Authors: Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen
First: 2025-12-23T18:59:16+00:00 · Latest: 2025-12-23T18:59:16+00:00
Comments: Project Page: https://xuanhuahe.github.io/ORCA/
Abstract
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
中文标题/摘要
标题:通过闭环世界建模的视频头像中的主动智能
当前的视频头像生成方法在身份保留和运动对齐方面表现出色,但缺乏真正的自主性,无法通过适应性环境交互自主追求长期目标。我们通过引入L-IVA(长期交互视觉头像)任务和基准来解决这一问题,用于评估随机生成环境中的目标导向规划,以及ORCA(在线推理和认知架构),这是第一个使视频头像具备主动智能的框架。ORCA 通过两个关键创新体现了内部世界模型(IWM)能力:(1) 闭环OTAR循环(观察-思考-行动-反思),通过不断验证预测结果与实际生成结果来在生成不确定性下保持稳健的状态跟踪;(2) 分层双系统架构,其中系统2进行战略推理并预测状态,系统1将抽象计划转化为具体的模型特定行动指令。通过将头像控制建模为POMDP并实施连续信念更新和结果验证,ORCA 使头像能够在开放域场景中自主完成多步任务。大量实验表明,ORCA 在任务成功率和行为一致性方面显著优于开环和非反思基线,验证了我们基于IWM的设计,使视频头像智能从被动动画提升到主动、目标导向的行为。
Summary / 总结
The research addresses the lack of genuine agency in current video avatar generation methods by introducing L-IVA and ORCA. L-IVA is a task and benchmark for evaluating goal-directed planning, while ORCA is a framework that enables active intelligence in video avatars through a closed-loop OTAR cycle and a hierarchical dual-system architecture. Experiments show that ORCA outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating the effectiveness of the IWM-inspired design for advancing video avatar intelligence.
研究通过引入L-IVA和ORCA来解决当前视频头像生成方法缺乏真正自主性的问题。L-IVA是一个用于评估目标导向规划的任务和基准,而ORCA是一个使视频头像具备主动智能的框架。ORCA具备闭环OTAR循环和分层双系统架构,能够实现自主多步任务完成,并在任务成功率和行为一致性方面显著优于开环和非反思基线。
Making Large Language Models Efficient Dense Retrievers
Authors: Yibin Lei, Shwai He, Ang Li, Andrew Yates
First: 2025-12-23T18:58:25+00:00 · Latest: 2025-12-23T18:58:25+00:00
Abstract
Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it remains unclear whether similar redundancy exists when these models are adapted for retrieval tasks, which require encoding entire sequences into fixed representations rather than generating tokens iteratively. To this end, we conduct a comprehensive analysis of layer redundancy in LLM-based dense retrievers. We find that, in contrast to generative settings, MLP layers are substantially more prunable, while attention layers remain critical for semantic aggregation. Building on this insight, we propose EffiR, a framework for developing efficient retrievers that performs large-scale MLP compression through a coarse-to-fine strategy (coarse-grained depth reduction followed by fine-grained width reduction), combined with retrieval-specific fine-tuning. Across diverse BEIR datasets and LLM backbones, EffiR achieves substantial reductions in model size and inference cost while preserving the performance of full-size models.
中文标题/摘要
标题:提高大型语言模型高效密集检索器的效率
近期研究表明,直接对大型语言模型(LLMs)进行密集检索微调可以取得良好的性能,但其庞大的参数量使其在计算上不够高效。尽管先前的研究揭示了LLMs在生成任务中存在显著的层冗余,但对于需要将整个序列编码为固定表示的检索任务而言,是否也存在类似的冗余尚不清楚。为此,我们对基于LLM的密集检索器中的层冗余进行了全面分析。我们发现,与生成设置不同,MLP层可以大幅压缩,而注意力层对于语义聚合仍然至关重要。基于这一洞察,我们提出了EffiR框架,该框架通过粗到细策略(粗粒度深度减少后进行细粒度宽度减少)大规模压缩MLP,并结合检索特定的微调,从而在不同BEIR数据集和LLM基础模型上实现了模型大小和推理成本的显著减少,同时保持了全尺寸模型的性能。
Summary / 总结
This study addresses the computational inefficiency of large language models (LLMs) when fine-tuned for dense retrieval tasks. By analyzing layer redundancy in LLM-based dense retrievers, the authors find that MLP layers are more prunable than attention layers. Based on this insight, they propose EffiR, a framework that compresses MLP layers through a coarse-to-fine strategy and includes retrieval-specific fine-tuning. EffiR significantly reduces model size and inference cost without compromising performance on various BEIR datasets and LLM backbones.
该研究旨在通过解决大型语言模型(LLM)在密集检索中的计算效率问题来提高其效率。研究分析了LLM基于密集检索的层冗余,并发现MLP层比注意力层更容易压缩。基于此,作者提出了EffiR框架,该框架通过粗到细的策略压缩MLP层,并包含特定于检索的微调,从而在各种数据集和LLM基础模型上实现了显著的模型大小和推理成本减少,同时保持了全大小模型的性能。
Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs
Authors: Dhruv Anand, Ehsan Shareghi
First: 2025-12-23T18:43:05+00:00 · Latest: 2025-12-23T18:43:05+00:00
Comments: 27 pages, 5 figures, 9 tables. Cube available at https://github.com/dana-23/cube-bench
Abstract
We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.
中文标题/摘要
标题:立方体基准:多模态大语言模型空间视觉推理评估
我们介绍了立方体基准,这是一个魔方基准,用于评估多模态大语言模型(MLLMs)的空间和序列推理能力。基准将性能分解为五个技能:(i) 从图像和文本重建魔方面,(ii) 选择最佳下一步,(iii) 预测候选动作的结果而不执行它,(iv) 执行多步计划并从错误中恢复,以及(v) 检测并修正自己的错误。使用一组共享的打乱魔方状态,相同的提示和解析器,以及单一的到解决状态的距离度量,我们按打乱深度比较了最近的MLLMs。在七个MLLMs中,准确性随着深度的增加而急剧下降;一旦轨迹停滞或发散,模型很少能恢复,高面重建准确性并不保证有能力的动作选择或多步执行。明显的闭源与开源差距出现:最强的闭源模型在单步感知任务和多步控制任务中均领先,而开源权重模型在最困难的设置中接近随机;然而,即使最好的MLLM在更高复杂度的魔方上也会退化。简单的自我纠正通过反思思考可以带来适度的收益,但也可能引发过度思考。立方体基准提供了一个紧凑且可重复的空间序列推理探针。
Summary / 总结
Cube Bench evaluates spatial and sequential reasoning in MLLMs through a Rubik's cube benchmark, decomposing performance into five skills. The benchmark compares seven MLLMs on accuracy, which drops sharply with scramble depth, and highlights a closed-source model advantage in both single-step perception and multi-step control tasks, while open-weight models struggle. Simple self-correction via reflective thinking provides modest gains but can lead to overthinking. Overall, MLLMs degrade at higher cube complexity despite high face-reconstruction accuracy.
Cube Bench 通过 Rubik's-cube 基准评估 MLLMs 的空间和序列推理能力,将性能分解为五个技能。在七个 MLLMs 中,随着打乱程度的增加,准确率急剧下降,模型难以从错误中恢复。观察到明显的闭源与开源模型差距,闭源模型在单步感知任务和多步控制任务中表现出色。简单的自我纠正通过反思思考可以带来适度的提升,但也可能导致过度思考。Cube Bench 提供了一种紧凑且可重复的方法来探究 MLLMs 的序列空间推理能力。
Leveraging High-Fidelity Digital Models and Reinforcement Learning for Mission Engineering: A Case Study of Aerial Firefighting Under Perfect Information
Authors: İbrahim Oğuz Çetinkaya, Sajad Khodadadian, Taylan G. Topçu
First: 2025-12-23T18:36:07+00:00 · Latest: 2025-12-23T18:36:07+00:00
Abstract
As systems engineering (SE) objectives evolve from design and operation of monolithic systems to complex System of Systems (SoS), the discipline of Mission Engineering (ME) has emerged which is increasingly being accepted as a new line of thinking for the SE community. Moreover, mission environments are uncertain, dynamic, and mission outcomes are a direct function of how the mission assets will interact with this environment. This proves static architectures brittle and calls for analytically rigorous approaches for ME. To that end, this paper proposes an intelligent mission coordination methodology that integrates digital mission models with Reinforcement Learning (RL), that specifically addresses the need for adaptive task allocation and reconfiguration. More specifically, we are leveraging a Digital Engineering (DE) based infrastructure that is composed of a high-fidelity digital mission model and agent-based simulation; and then we formulate the mission tactics management problem as a Markov Decision Process (MDP), and employ an RL agent trained via Proximal Policy Optimization. By leveraging the simulation as a sandbox, we map the system states to actions, refining the policy based on realized mission outcomes. The utility of the RL-based intelligent mission coordinator is demonstrated through an aerial firefighting case study. Our findings indicate that the RL-based intelligent mission coordinator not only surpasses baseline performance but also significantly reduces the variability in mission performance. Thus, this study serves as a proof of concept demonstrating that DE-enabled mission simulations combined with advanced analytical tools offer a mission-agnostic framework for improving ME practice; which can be extended to more complicated fleet design and selection problems in the future from a mission-first perspective.
中文标题/摘要
标题:利用高保真数字模型和强化学习进行任务工程:在完美信息下的空中灭火案例研究
随着系统工程(SE)目标从单一系统的设计和运行转变为复杂的系统集合体(System of Systems, SoS),任务工程(Mission Engineering, ME)这一学科已经出现,并逐渐被SE社区接受为一种新的思维方式。此外,任务环境是不确定的、动态的,任务结果直接取决于任务资产如何与环境互动。这表明静态架构是脆弱的,并要求ME采用分析上严谨的方法。为此,本文提出了一种智能任务协调方法,将数字任务模型与强化学习(Reinforcement Learning, RL)相结合,以适应性任务分配和重新配置的需求。具体而言,我们利用基于数字工程(Digital Engineering, DE)的基础设施,该基础设施由高保真数字任务模型和基于代理的模拟组成;然后将任务战术管理问题形式化为马尔可夫决策过程(Markov Decision Process, MDP),并采用通过近端策略优化训练的RL代理。通过利用模拟作为沙盒,我们将系统状态映射到行动,并根据实现的任务结果来优化策略。通过空中灭火案例研究展示了基于RL的智能任务协调器的实用性。我们的研究结果表明,基于RL的智能任务协调器不仅超越了基线性能,还显著减少了任务性能的变异性。因此,这项研究作为概念验证,证明了DE使能的任务模拟与高级分析工具结合提供了一种任务无关的框架,以改进ME实践;从任务优先的角度出发,未来可以将其扩展到更复杂的舰队设计和选择问题。
Summary / 总结
This paper proposes an intelligent mission coordination methodology that integrates digital mission models with Reinforcement Learning (RL) to address adaptive task allocation and reconfiguration in complex mission environments. The approach leverages a high-fidelity digital mission model and agent-based simulation, formulating the mission tactics management problem as a Markov Decision Process (MDP) and employing an RL agent trained via Proximal Policy Optimization. The study demonstrates the utility of this RL-based intelligent mission coordinator through an aerial firefighting case study, showing that it surpasses baseline performance and significantly reduces mission performance variability.
本文提出了一种将数字任务模型与强化学习(RL)结合的智能任务协调方法,用于复杂系统中的自适应任务分配和重新配置。通过将任务战术管理问题形式化为马尔可夫决策过程(MDP),并使用Proximal Policy Optimization训练RL代理,研究通过空中灭火案例研究展示了该方法的实用性。研究结果表明,基于RL的智能任务协调器不仅超越了基线方法,还显著降低了任务性能的变异性。
Automated stereotactic radiosurgery planning using a human-in-the-loop reasoning large language model agent
Authors: Humza Nusrat, Luke Francisco, Bing Luo, Hassan Bagher-Ebadian, Joshua Kim, Karen Chin-Snyder, Salim Siddiqui, Mira Shah, Eric Mellon, Mohammad Ghassemi, Anthony Doemer, Benjamin Movsas, Kundan Thind
First: 2025-12-23T18:32:17+00:00 · Latest: 2025-12-23T18:32:17+00:00
Abstract
Stereotactic radiosurgery (SRS) demands precise dose shaping around critical structures, yet black-box AI systems have limited clinical adoption due to opacity concerns. We tested whether chain-of-thought reasoning improves agentic planning in a retrospective cohort of 41 patients with brain metastases treated with 18 Gy single-fraction SRS. We developed SAGE (Secure Agent for Generative Dose Expertise), an LLM-based planning agent for automated SRS treatment planning. Two variants generated plans for each case: one using a non-reasoning model, one using a reasoning model. The reasoning variant showed comparable plan dosimetry relative to human planners on primary endpoints (PTV coverage, maximum dose, conformity index, gradient index; all p > 0.21) while reducing cochlear dose below human baselines (p = 0.022). When prompted to improve conformity, the reasoning model demonstrated systematic planning behaviors including prospective constraint verification (457 instances) and trade-off deliberation (609 instances), while the standard model exhibited none of these deliberative processes (0 and 7 instances, respectively). Content analysis revealed that constraint verification and causal explanation concentrated in the reasoning agent. The optimization traces serve as auditable logs, offering a path toward transparent automated planning.
中文标题/摘要
标题:使用人类在环推理大型语言模型代理的自动化立体定向放射外科计划
立体定向放射外科(SRS)需要对关键结构进行精确的剂量塑形,但由于黑盒AI系统的透明度问题,其在临床中的应用受到限制。我们测试了链式思考推理是否能改善代理规划,在41例接受18Gy单次分割SRS治疗的脑转移瘤患者的回顾性队列中进行测试。我们开发了SAGE(安全生成剂量专家代理),这是一种基于LLM的自动化SRS治疗规划代理。两种变体为每个病例生成了计划:一种使用非推理模型,另一种使用推理模型。推理变体在主要终点(PTV覆盖、最大剂量、符合指数、梯度指数;所有p > 0.21)上的计划剂量学与人类规划者相当,同时减少了耳蜗剂量,低于人类基线(p = 0.022)。当要求改进符合性时,推理模型展示了系统性的规划行为,包括前瞻性约束验证(457次实例)和权衡权衡(609次实例),而标准模型没有表现出这些反思过程(0和7次实例,分别)。内容分析显示,约束验证和因果解释集中在推理代理中。优化轨迹作为可审计的日志,为透明的自动化规划提供了途径。
Summary / 总结
The study aimed to improve the clinical adoption of AI in stereotactic radiosurgery (SRS) by enhancing the transparency of automated planning through chain-of-thought reasoning. SAGE, an LLM-based agent, generated treatment plans for 41 patients with brain metastases, comparing non-reasoning and reasoning models. The reasoning model produced plans with comparable dosimetry to human planners but reduced cochlear dose. It also showed systematic planning behaviors like constraint verification and trade-off deliberation, which the non-reasoning model lacked. These findings suggest that reasoning models can enhance transparency and clinical acceptance of AI in SRS planning.
研究旨在通过增加AI规划的透明度来提高AI在立体定向放射外科(SRS)中的临床应用。开发了基于LLM的规划代理SAGE来生成SRS治疗计划。测试了两种SAGE变体:一种是非推理模型,另一种是推理模型。推理模型生成的计划在剂量学方面与人类规划者相当,并且减少了耳蜗剂量,同时展示了诸如约束验证和权衡权衡等系统规划行为,而非推理模型没有这些行为。这些发现表明,推理模型可以提高AI在SRS规划中的有效性和透明度。
Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Authors: Amirhosein Ghasemabadi, Di Niu
First: 2025-12-23T18:21:32+00:00 · Latest: 2025-12-23T18:21:32+00:00
Abstract
Large language models (LLMs) generate fluent and complex outputs but often fail to recognize their own mistakes and hallucinations. Existing approaches typically rely on external judges, multi-sample consistency, or text-based self-critique, which incur additional compute or correlate weakly with true correctness. We ask: can LLMs predict their own failures by inspecting internal states during inference? We introduce Gnosis, a lightweight self-awareness mechanism that enables frozen LLMs to perform intrinsic self-verification by decoding signals from hidden states and attention patterns. Gnosis passively observes internal traces, compresses them into fixed-budget descriptors, and predicts correctness with negligible inference cost, adding only ~5M parameters and operating independently of sequence length. Across math reasoning, open-domain question answering, and academic knowledge benchmarks, and over frozen backbones ranging from 1.7B to 20B parameters, Gnosis consistently outperforms strong internal baselines and large external judges in both accuracy and calibration. Moreover, it generalizes zero-shot to partial generations, enabling early detection of failing trajectories and compute-aware control. These results show that reliable correctness cues are intrinsic to generation process and can be extracted efficiently without external supervision.
中文标题/摘要
标题:大型语言模型能否预测自身的失败?通过内部电路实现自我意识
大型语言模型(LLMs)生成流畅且复杂的输出,但往往无法识别自身的错误和幻觉。现有方法通常依赖外部评判者、多样本一致性或基于文本的自我批判,这会增加额外的计算量或与真正的正确性关联较弱。我们提出的问题是:LLMs能否在推理过程中检查内部状态来预测自身的失败?我们引入了Gnosis,这是一种轻量级的自我意识机制,使冻结的LLMs能够通过解码隐藏状态和注意力模式的信号来进行内在的自我验证。Gnosis被动地观察内部痕迹,将其压缩为固定预算的描述符,并以几乎无推理成本的方式预测正确性,仅增加约500万参数且独立于序列长度。在数学推理、开放领域问答和学术知识基准测试中,以及在从17亿到200亿参数的冻结骨干网络上,Gnosis在准确性和校准方面始终优于强大的内部基线和大型外部评判者。此外,它能够零样本泛化到部分生成,实现早期失败轨迹检测和计算感知控制。这些结果表明,可靠的正确性线索内生于生成过程,并且可以在无需外部监督的情况下高效提取。
Summary / 总结
The research aims to improve the self-awareness of large language models (LLMs) by enabling them to predict their own failures during inference. Gnosis, a lightweight mechanism, allows frozen LLMs to inspect internal states and predict correctness with minimal additional cost. Across various benchmarks, Gnosis outperforms strong internal baselines and large external judges in both accuracy and calibration, demonstrating that reliable correctness cues are intrinsic to the generation process and can be efficiently extracted without external supervision.
研究旨在通过使大型语言模型(LLMs)在推理过程中能够预测自己的错误来增强其自我意识。Gnosis 是一种轻量级机制,允许冻结的 LLM 检查内部状态并以最小的额外成本预测正确性。在各种基准测试中,Gnosis 在准确性和校准方面均优于强大的内部基线和大型外部评判者,展示了内在的正确性线索可以无需外部监督而高效地提取。
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
Authors: Rui Pan, Zhuofu Chen, Ravi Netravali
First: 2025-12-23T18:16:58+00:00 · Latest: 2025-12-23T18:16:58+00:00
Abstract
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
中文标题/摘要
标题:快速失败,赢得胜利:通过扩散大语言模型重新思考推测性解码的起草策略
扩散大语言模型(dLLMs)提供快速并行的标记生成,但它们的独立使用受到效率与质量固有的权衡。我们表明,如果谨慎应用,dLLMs 的特性实际上可以成为在自回归(AR)验证器辅助下的推测性解码中的优势。我们的核心见解是,dLLM 的并行解码速度大大降低了昂贵的拒绝风险,提供了一种实用机制来有效实现(难以捉摸的)长篇草案,这些草案能够通过推测性解码带来大量加速。我们提出了 FailFast,一种基于 dLLM 的推测性解码框架,通过动态调整其推测长度来实现这一方法。它“快速失败”通过在难以推测的区域花费最少的计算资源来缩短推测延迟,“赢得胜利”通过在容易推测的区域积极扩展草案长度来减少验证延迟(在许多情况下,一次推测并接受 70 个标记!)。无需任何微调,FailFast 为 AR LLM 提供无损加速,并在多种模型和工作负载上分别实现了高达 4.9 倍、1.7 倍和 1.4 倍的速度提升。我们已在 https://github.com/ruipeterpan/failfast 开源了 FailFast。
Summary / 总结
The paper addresses the efficiency-quality tradeoff in using diffusion large language models (dLLMs) for speculative decoding. It introduces FailFast, a framework that dynamically adjusts speculation length to minimize costly rejections and maximize draft length in easy regions. This approach results in up to 4.9 times speedup over vanilla decoding and 1.7 times over the best naive dLLM drafter, without any fine-tuning. The key insight is that dLLM's parallel decoding speed reduces the risk of rejections, enabling effective speculative decoding with autoregressive verifiers.
论文探讨了在使用扩散大语言模型(dLLM)进行推测性解码时的效率与质量权衡问题。它提出了FailFast框架,该框架动态调整推测长度以减少昂贵的拒绝并最大化在较易推测区域的草案长度,实现了高达4.9倍的加速,比传统的推测性解码快4.9倍,比最佳的朴素dLLM推测者快1.7倍。无需微调,FailFast在各种模型和工作负载中显著加速了自回归LLM。
Distilling to Hybrid Attention Models via KL-Guided Layer Selection
Authors: Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, Yoon Kim
First: 2025-12-23T18:12:22+00:00 · Latest: 2025-12-23T18:12:22+00:00
Abstract
Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.
中文标题/摘要
标题:通过KL引导层选择提炼至混合注意力模型
将预训练的softmax注意力变换器提炼成更高效的混合架构,这些架构交替使用softmax和线性注意力层,是一种有望提高LLM推理效率的方法,而无需从头开始进行昂贵的预训练。转换过程中的关键因素是层选择,即决定将哪些层转换为线性注意力变体。本文描述了一种简单且高效的层选择方法,该方法使用从通用文本数据少量训练中得出的层重要性得分。一旦选择了层,我们使用最近的提炼过程管道[RADLADS;],该过程包括注意力权重转移、隐藏状态对齐、基于KL的分布匹配,最后进行少量微调。我们发现,这种方法比现有的层选择方法更有效,包括基于固定比例均匀交替线性注意力的启发式方法,以及依赖于专门诊断数据集的更复杂方法。
Summary / 总结
This paper aims to improve the inference efficiency of large language models (LLMs) by distilling pretrained softmax attention Transformers into hybrid architectures that combine softmax and linear attention layers. The key method involves using layer importance scores derived from generic text data to select layers for conversion to linear attention. The distillation process includes attention weight transfer, hidden state alignment, and KL-based distribution matching, followed by fine-tuning. The study finds that this approach outperforms existing methods, such as uniform interleaving of linear attentions and specialized diagnostic datasets for layer selection.
该论文针对将预训练的softmax注意力Transformer转换为更高效的混合模型中的层选择问题进行了研究。方法使用从通用文本数据中提取的重要性分数来选择层,随后进行注意力权重转移、隐藏状态对齐和KL散度分布匹配的蒸馏过程。该方法在均匀插入线性注意力和专门的诊断数据集依赖的方法中表现更优。
Similarity Field Theory: A Mathematical Framework for Intelligence
Authors: Kei-Sing Ng
First: 2025-09-21T22:34:00+00:00 · Latest: 2025-12-23T18:09:51+00:00
Abstract
We posit that persisting and transforming similarity relations form the structural basis of any comprehensible dynamic system. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field $S: U \times U \to [0,1]$ over a universe of entities $U$, satisfying reflexivity $S(E,E)=1$ and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence $Z_p=(X_p,S^{(p)})$ indexed by $p=0,1,2,\ldots$; (3) concepts $K$ as entities that induce fibers $F_α(K)={E\in U \mid S(E,K)\ge α}$, i.e., superlevel sets of the unary map $S_K(E):=S(E,K)$; and (4) a generative operator $G$ that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator $G$ is intelligent with respect to a concept $K$ if, given a system containing entities belonging to the fiber of $K$, it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. At a high level, this framework reframes intelligence and interpretability as geometric problems on similarity fields--preserving and composing level-set fibers--rather than purely statistical ones. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability implies either an anchor coordinate or asymptotic confinement to the target level (up to arbitrarily small tolerance). Together, these results constrain similarity-field evolution and motivate an interpretive lens that can be applied to large language models.
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
Authors: Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang
First: 2025-12-23T18:05:43+00:00 · Latest: 2025-12-23T18:05:43+00:00
Comments: Under submission
Abstract
Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.
中文标题/摘要
标题:FlashVLM:文本引导的视觉标记选择框架用于大型多模态模型
大型视觉-语言模型(VLMs)通常处理每张图像或视频帧数百或数千个视觉标记,导致二次注意力成本和大量冗余。现有的标记减少方法往往忽视了文本查询或依赖于深度注意力图,这些图在剧烈剪枝下不稳定,导致语义对齐下降。 我们提出了一种FlashVLM,这是一种文本引导的视觉标记选择框架,能够动态适应查询。FlashVLM 不依赖于嘈杂的注意力权重,而是计算投影图像标记与语言模型空间中归一化文本嵌入之间的显式跨模态相似性。这种外在的相关性与内在的视觉显著性通过对数域加权和温度控制锐化相结合。此外,一种保留多样性的划分保留了一组最小但具有代表性的背景标记,以保持全局上下文。 在相同的标记预算和评估协议下,FlashVLM 实现了超越无损压缩,略优于未剪枝基线,同时在LLaVA 1.5上剪枝高达77.8%的视觉标记,并在94.4%的压缩下保持92.8%的准确性。在14个图像和视频基准上的大量实验表明,FlashVLM 在保持强大鲁棒性和泛化能力的同时,提供了最先进的效率性能折衷。
Summary / 总结
FlashVLM is a text-guided visual token selection framework that dynamically adapts visual inputs to textual queries by computing explicit cross-modal similarities and fusing them with intrinsic visual saliency. It achieves beyond lossless compression, surpassing the unpruned baseline while pruning up to 77.8% of visual tokens on LLaVA 1.5, and maintaining 92.8% accuracy even under 94.4% compression. Extensive experiments on 14 benchmarks show that FlashVLM offers state-of-the-art efficiency while maintaining robustness and generalization across various VLMs.
FlashVLM 是一种文本导向的视觉标记选择框架,动态适应视觉输入以匹配文本查询,最多可减少 77.8% 的视觉标记同时保持 LLaVA 1.5 上 92.8% 的准确性。它通过计算图像标记与文本嵌入之间的显式跨模态相似性,融合视觉显著性,并保留少量背景标记以保持全局上下文。实验表明,FlashVLM 在 14 个图像和视频基准测试中表现出色,在效率和鲁棒性方面优于现有方法。
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Authors: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
First: 2025-12-23T17:56:36+00:00 · Latest: 2025-12-23T17:56:36+00:00
Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
中文标题/摘要
标题:学习在四维中推理:视觉语言模型的动态空间理解
视觉语言模型(VLM)在一般理解方面表现出色,但在动态空间推理(DSR)方面仍然较弱,即在时间维度上对3D空间中物体几何形状和关系的演变进行推理,这主要是由于缺乏可扩展的4D感知训练资源。为了在数据集、基准和模型的各个方面弥合这一差距,我们引入了DSR套件。首先,我们提出了一种自动流水线,从野外视频中生成DSR的多项选择题-答案对。通过利用现代视觉基础模型,该流水线提取了丰富的几何和运动信息,包括相机姿态、局部点云、物体掩码、方向和3D轨迹。这些几何线索使得DSR-Train的构建成为可能,并进一步构建了DSR-Bench用于评估。与以往工作相比,我们的数据强调了(i)野外视频来源,(ii)物体和场景级别的3D要求,(iii)视角变换,(iv)多物体交互,以及(v)细粒度、程序化的答案。除了数据,我们还提出了一种轻量级的几何选择模块(GSM),以无缝地将几何先验整合到VLM中,该模块压缩了问题语义,并从预训练的4D重建先验中提取与问题相关的信息,形成一组紧凑的几何标记。这种有针对性的提取避免了向模型灌输无关知识。实验表明,将DSR-Train和GSM集成到Qwen2.5-VL-7B中显著增强了其动态空间推理能力,同时在通用视频理解基准测试中保持了准确性。
Summary / 总结
The research aims to improve vision-language models' ability in dynamic spatial reasoning (DSR) by addressing the scarcity of 4D-aware training resources. It introduces DSR Suite, which includes an automated pipeline for generating DSR question-answer pairs from in-the-wild videos and a lightweight Geometry Selection Module (GSM) to integrate geometric priors into VLMs. The key findings show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability while maintaining performance on general video understanding benchmarks.
研究旨在通过解决4D意识训练资源稀缺问题,提高视觉语言模型在动态空间推理(DSR)方面的能力。引入了DSR套件,包括从野外视频自动生成DSR问答对的自动化管道和几何选择模块(GSM),以将几何先验无缝集成到VLM中。关键发现表明,将DSR-Train和GSM整合到Qwen2.5-VL-7B中,显著增强了其动态空间推理能力,同时在通用视频理解基准测试中保持了准确性。
Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
Authors: Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li
Venue: WACV 2026
First: 2025-12-23T17:55:35+00:00 · Latest: 2025-12-23T17:55:35+00:00
Comments: Accepted to WACV 2026
Abstract
Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.
中文标题/摘要
标题:多粒度文本引导图像融合以应对多曝光和多焦点场景
图像融合旨在从在具有挑战性条件下拍摄的一对输入中合成一张高质量的单张图像,例如不同的曝光水平或焦深。核心挑战在于有效处理输入之间的动态范围和焦深差异。随着视觉语言模型的发展,最近的方法开始将文本描述作为辅助指导以提高融合质量。然而,简单地引入粗粒度描述会阻碍对细粒度细节的理解,并且对跨模态对齐提出了挑战。为了解决这些限制,我们提出了多粒度文本引导图像融合(MTIF),这是一种具有三个关键设计的新型融合范式。首先,它引入了多粒度的文本描述,分别捕捉细粒度细节、结构线索和语义内容,并通过分层跨模态调制模块引导图像融合。其次,它在每个粒度级别引入监督信号,以促进视觉和文本特征之间的对齐并增强辅助文本的实用性。第三,它采用了一种基于显著性的增强模块,通过密集的语义内容增强训练数据,进一步加强跨模态调制和对齐。广泛的实验表明,MTIF在多曝光和多焦点图像融合任务中始终优于先前的方法。
Summary / 总结
The paper addresses the challenge of image fusion under challenging conditions such as varying exposure and focus. It proposes Multi-grained Text-guided Image Fusion (MTIF), which uses hierarchical textual descriptions to guide the fusion process. MTIF introduces multi-grained textual descriptions to capture fine details, structural cues, and semantic content, and uses supervision signals at each granularity to enhance cross-modal alignment. The method also includes a saliency-driven enrichment module to strengthen cross-modal modulation and alignment. Experiments demonstrate that MTIF outperforms previous methods in both multi-exposure and multi-focus image fusion tasks.
研究旨在通过利用文本描述来解决动态范围和焦深差异带来的挑战,以提高图像融合质量。方法Multi-grained Text-guided Image Fusion (MTIF) 引入了多粒度文本描述和层次化的跨模态调制模块来引导融合过程。此外,还使用监督信号和注意力驱动的增强模块来增强跨模态对齐。实验表明,MTIF 在多曝光和多焦点图像融合任务中均优于先前的方法。
Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset & The Effective AAM-TSA Model
Authors: Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing
First: 2025-12-23T17:42:16+00:00 · Latest: 2025-12-23T17:42:16+00:00
Abstract
Teachers' emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers' emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression.In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED.To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process.The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information.Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA.AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.
中文标题/摘要
标题:推进多模态教师情感分析:T-MED数据集与有效的AAM-TSA模型
教师的情感状态在教育场景中至关重要,深刻影响着教学效果、学生参与度和学习成就。然而,现有研究往往由于表演性特征而未能准确捕捉教师的情感,并且忽视了教学信息对情感表达的关键影响。在本文中,我们系统地研究了教师情感分析,相应地构建了数据集和模型。我们构建了首个大规模教师多模态情感分析数据集T-MED。为了确保标注的准确性和效率,我们采用了人机协作标注过程。T-MED数据集包含来自11个学科的250个真实教室的14,938个教师情感数据实例,涵盖了从K-12到高等教育的各个阶段,整合了多模态文本、音频、视频和教学信息。此外,我们提出了一种新颖的非对称注意力机制多模态教师情感分析模型AAM-TSA。AAM-TSA引入了非对称注意力机制和分层门控单元,以实现跨模态特征的差异化融合和精确的情感分类。实验结果表明,AAM-TSA在T-MED数据集上的准确性和可解释性显著优于现有最先进的方法。
Summary / 总结
This paper addresses the importance of teachers' emotional states in education by developing the T-MED dataset and the AAM-TSA model. T-MED is a large-scale multimodal dataset that includes 14,938 instances of teacher emotional data from 250 classrooms, integrating text, audio, video, and instructional information. The AAM-TSA model uses an asymmetric attention mechanism and hierarchical gating unit to achieve better cross-modal feature fusion and emotional classification, outperforming existing methods in accuracy and interpretability.
本文通过开发T-MED数据集和AAM-TSA模型,关注教师情绪状态在教育中的重要性。T-MED数据集包含来自250个教室的14,938个教师情绪实例,涵盖文本、音频、视频和教学信息。AAM-TSA模型采用不对称注意力机制和层次门控单元,融合跨模态特征并精确分类情绪,优于现有方法在T-MED数据集上的表现。
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Authors: Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo
First: 2025-12-15T16:36:52+00:00 · Latest: 2025-12-23T17:38:46+00:00
Comments: Seedance 1.5 pro Technical Report
Abstract
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
中文标题/摘要
标题:Seedance 1.5 pro:一种原生音视频联合生成基础模型
近期音视频生成技术的进步为统一的音视频生成铺平了道路。在此项工作中,我们介绍了Seedance 1.5 pro,这是一种专门针对原生音视频联合生成的基础模型。该模型利用双分支扩散变换器架构,结合跨模态联合模块和专门的多阶段数据管道,实现了卓越的音视频同步和生成质量。为了确保其实用性,我们实施了精细的后训练优化,包括在高质量数据集上进行监督微调(SFT)和多维度奖励模型的人工反馈强化学习(RLHF)。此外,我们还引入了一种加速框架,将推理速度提高了超过10倍。Seedance 1.5 pro 通过精确的多语言和方言唇同步、动态电影级摄像机控制和增强的叙事连贯性脱颖而出,定位为专业级内容创作的强大引擎。Seedance 1.5 pro 现已可在火山引擎上访问:https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo。
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
Authors: Anna Šárová Mikeštíková, Médéric Fourmy, Martin Cífka, Josef Sivic, Vladimir Petrik
First: 2025-12-23T17:29:08+00:00 · Latest: 2025-12-23T17:29:08+00:00
Comments: 18 pages, 9 figures
Abstract
Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.
中文标题/摘要
标题:AlignPose:通过多视图特征度量对齐实现通用的6D姿态估计
基于单视图RGB模型的对象姿态估计方法在泛化能力上表现出色,但从根本上受到深度模糊、杂乱和遮挡的限制。多视图姿态估计方法有可能解决这些问题,但现有工作依赖于精确的单视图姿态估计或缺乏对未见过的对象的泛化能力。我们通过以下三个贡献来应对这些挑战。首先,我们引入了AlignPose,这是一种6D物体姿态估计方法,可以从多个外在校准的RGB视图中聚合信息,无需任何特定于物体的训练或对称标注。其次,该方法的关键组成部分是一种新的多视图特征度量细化方法,专门用于物体姿态。它优化了一个一致的世界坐标系中的物体姿态,最小化了在所有视图中实时渲染的物体特征与观察到的图像特征之间的特征差异。第三,我们在四个数据集(YCB-V、T-LESS、ITODD-MV、HouseCat6D)上进行了广泛的实验,并使用BOP基准评估表明,AlignPose在泛化能力上优于其他已发表的方法,特别是在实践中多视图易于获取的工业数据集上。
Summary / 总结
The research aims to improve 6D pose estimation by addressing limitations of single-view methods, such as depth ambiguity and occlusions. The method, AlignPose, uses multiple calibrated RGB views to estimate object poses without requiring specific training or symmetry annotations. It introduces a multi-view feature-metric refinement that optimizes a consistent object pose by minimizing feature discrepancies across all views. Experiments on four datasets show that AlignPose outperforms existing methods, particularly on industrial datasets with multiple views available.
研究旨在通过解决单视图方法的深度模糊和遮挡等问题,改进6D姿态估计。AlignPose 不需要特定对象的训练,通过多视图RGB图像的信息聚合来优化一致的对象姿态,同时最小化渲染特征与观察特征之间的差异。在四个数据集上的实验表明,AlignPose 在工业数据集中的表现优于现有方法,尤其是多视图可用的情况。
Benchmarking LLMs for Predictive Applications in the Intensive Care Units
Authors: Chehak Malhotra, Mehak Gopal, Akshaya Devadiga, Pradeep Singh, Ridam Pal, Ritwik Kashyap, Tavpritesh Sethi
First: 2025-12-23T17:08:31+00:00 · Latest: 2025-12-23T17:08:31+00:00
Abstract
With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.
中文标题/摘要
标题:重症监护病房中预测应用的大语言模型基准测试
随着大语言模型(LLMs)的出现,自然语言处理领域中的各种任务都得到了转变。然而,它们在预测任务中的应用研究较少。本研究将包括GatorTron-Base(基于临床数据训练)、Llama 8B和Mistral 7B在内的大语言模型与BioBERT、DocBERT、BioClinicalBERT、Word2Vec和Doc2Vec等模型进行比较,为预测重症患者休克设定基准。及时预测休克可以实现早期干预,从而改善患者预后。从MIMIC III数据库中17,294例ICU住院患者的文本数据中,筛选出住院时间超过24小时且休克指数(SI)大于0.7的患者,分别得到355例正常SI指数和87例异常SI指数的患者。在微调过程中,使用焦点损失和交叉熵损失来解决类别不平衡问题。研究结果表明,虽然GatorTron Base的加权召回率最高,达到80.5%,但整体性能指标在SLMs和LLMs之间相当。这表明,尽管LLMs在文本任务上表现出色,但它们在预测未来临床事件方面并不比SVMs更具优越性。为了实现有意义的临床结果,未来在训练LLMs时应优先开发能够预测临床轨迹的模型,而不是专注于命名实体识别或表型识别等简单任务。
Summary / 总结
This study benchmarks large language models (LLMs) and small language models (SLMs) for predicting shock in critically ill patients using text data from the MIMIC III database. Models like GatorTron-Base, Llama 8B, and Mistral 7B were compared against traditional models such as BioBERT and DocBERT. GatorTron Base achieved the highest weighted recall of 80.5%, but overall performance metrics were similar between LLMs and SLMs, indicating that LLMs are not inherently superior for clinical event prediction despite their strong performance on text-based tasks.
该研究使用MIMIC III数据库中的文本数据,对比了大型语言模型(LLMs)和小型语言模型(SLMs)在预测重症患者休克方面的表现。包括GatorTron-Base、Llama 8B和Mistral 7B在内的模型与传统模型如BioBERT和Word2Vec进行了比较。GatorTron Base的加权召回率最高,达到80.5%,但LLMs和SLMs的整体性能相似。研究指出,LLMs在临床事件预测方面并不天然优于SLMs,未来应侧重于预测临床轨迹而非简单的命名实体识别或表型识别任务。
Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition
Authors: Gorjan Radevski
First: 2025-12-23T16:46:58+00:00 · Latest: 2025-12-23T16:46:58+00:00
Comments: Ph.D. manuscript; Supervisors/Mentors: Marie-Francine Moens and Tinne Tuytelaars
Abstract
This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.
中文标题/摘要
标题:跨模态融合与知识转移:增强的多模态理解和识别
本文探讨了多模态对齐、翻译、融合和转移,以增强机器对复杂输入的理解。我们将工作分为五个章节,每个章节都针对多模态机器学习中的独特挑战。 第三章介绍了空间推理BERT,用于将基于文本的空间关系翻译成剪贴画之间的2D排列。这使得空间语言的有效解码为视觉表示成为可能,为与人类空间理解相一致的自动化场景生成铺平了道路。 第四章提出了一种将医学文本翻译到解剖学图谱中特定3D位置的方法。我们引入了一个利用医学术语空间共现性的损失函数,创建了可解释的映射,显著提高了医学文本的可导航性。 第五章解决了将结构化文本翻译为知识图谱中的标准事实的问题。我们开发了一个基准,用于将自然语言链接到实体和谓词,解决了文本提取中的歧义性,提供了更清晰、可操作的见解。 第六章探讨了多模态融合方法在组合动作识别中的应用。我们提出了一种融合视频帧和对象检测表示的方法,提高了识别的鲁棒性和准确性。 第七章研究了多模态知识转移在第一人称动作识别中的应用。我们展示了多模态知识蒸馏如何使仅使用RGB的模型模仿多模态融合的能力,同时减少计算需求并保持性能。 这些贡献推进了空间语言理解、医学文本解释、知识图谱丰富和动作识别的方法,增强了计算系统处理各种应用中复杂多模态输入的能力。
Summary / 总结
This manuscript aims to enhance machine understanding of complex multimodal inputs through alignment, translation, fusion, and transference. The work introduces methods for spatial reasoning, medical text translation, knowledge graph linking, multimodal action recognition, and multimodal knowledge distillation. Key findings include effective spatial language decoding, interpretable medical text mappings, benchmarking for natural language to knowledge graph linking, improved action recognition robustness, and reduced computational requirements for egocentric action recognition through multimodal knowledge transfer.
本论文旨在通过多模态对齐、翻译、融合和转移来增强机器对复杂多模态输入的理解。工作引入了如Spatial-Reasoning Bert等方法,将空间关系翻译为视觉表示,提出了用于将医学文本映射到3D解剖图的损失函数,并开发了一个将自然语言链接到知识图谱的基准。还提出了用于动作识别的多模态融合方法,并展示了多模态知识蒸馏在自中心动作识别中的应用,减少了计算需求同时保持性能。这些贡献提高了空间语言理解、医学文本解释、知识图谱丰富和动作识别的能力。
Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents
Authors: Prahaladh Chandrahasan, Jiahe Jin, Zhihan Zhang, Tevin Wang, Andy Tang, Lucy Mo, Morteza Ziyadi, Leonardo F. R. Ribeiro, Zimeng Qiu, Markus Dreyer, Akari Asai, Chenyan Xiong
First: 2025-07-07T21:35:09+00:00 · Latest: 2025-12-23T16:43:12+00:00
Abstract
Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate steps. To address these gaps, we introduce Deep Research Comparator, a platform that offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation. Annotators can evaluate the overall quality of final reports based on side-by-side comparison, and also provide detailed feedback separately by assessing intermediate steps or specific text spans within the final report. Furthermore, we develop Simple Deepresearch, an end-to-end agent scaffold. This scaffold serves as a baseline that facilitates the easy integration of various large language models to transform them into deep research agents for evaluation. To demonstrate the platform's utility for deep research agent development, we have collected real user preference data from 17 annotators on three deep research agents. A demo video of our platform can be found at https://www.youtube.com/watch?v=g4d2dnbdseg.
中文标题/摘要
标题:深度研究比较器:一种深度研究代理精细人工标注平台
有效评估自主搜索网络、分析信息并生成报告的深度研究代理仍然是一项重大挑战,尤其是在评估长报告和提供其中间步骤的详细反馈方面。为了解决这些差距,我们引入了深度研究比较器,这是一个提供深度研究代理托管、并排比较、精细的人工反馈收集和排名计算的综合框架的平台。给定用户查询,我们的平台会显示两个不同代理的最终报告及其生成过程中的中间步骤。标注者可以根据并排比较来评估最终报告的整体质量,也可以分别评估中间步骤或最终报告中的特定文本段落。此外,我们还开发了简单深度研究,这是一种端到端的代理框架。该框架作为基准,有助于各种大型语言模型的轻松集成,从而将它们转化为用于评估的深度研究代理。为了展示该平台在深度研究代理开发中的实用性,我们从17名标注者那里收集了针对三个深度研究代理的真实用户偏好数据。我们的平台演示视频可以在https://www.youtube.com/watch?v=g4d2dnbdseg找到。
Summary / 总结
The paper introduces Deep Research Comparator, a platform designed to evaluate deep research agents by providing a side-by-side comparison of their final reports and intermediate steps. Annotators can give detailed feedback on both the overall quality and specific parts of the reports. The platform also includes Simple Deepresearch, a baseline agent scaffold that helps integrate large language models into deep research agents. Experimental results from 17 annotators on three agents demonstrate the platform's effectiveness in collecting user preferences and providing detailed feedback.
论文介绍了Deep Research Comparator平台,该平台旨在通过提供报告托管、对比和收集详细人工反馈的整体框架来评估深度研究代理。它包括最终报告和中间步骤的并排显示,允许注释者进行评估并提供反馈。该平台还包含一个名为Simple Deepresearch的端到端代理支架,可以将大型语言模型集成到深度研究代理中进行评估。来自17名注释者对三个代理的实验结果表明,该平台对于深度研究代理的开发和评估具有实用性。
Step-DeepResearch Technical Report
Authors: Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu
First: 2025-12-23T16:32:27+00:00 · Latest: 2025-12-23T16:32:27+00:00
Abstract
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
中文标题/摘要
标题:Step-DeepResearch 技术报告
随着大语言模型(LLMs)向自主代理转变,Deep Research 成为了一个关键指标。然而,现有的学术基准如 BrowseComp 往往无法满足开放研究的实际需求,这需要强大的意图识别、长期决策和跨源验证能力。为解决这一问题,我们提出了 Step-DeepResearch,这是一种经济高效的端到端代理。我们提出了一种基于原子能力的数据合成策略,以增强规划和报告撰写能力,并结合了从代理中期训练到强化学习(RL)和从属强化学习(SFT)的渐进式训练路径。通过一种清单式评判器的增强,这种方法显著提高了鲁棒性。此外,为了弥合中文领域的评估差距,我们建立了 ADR-Bench 以适应现实的深度研究场景。实验结果显示,Step-DeepResearch(32B)在 Scale AI 研究评分表上得分为 61.4%。在 ADR-Bench 上,它显著优于同类模型,并与 OpenAI 和 Gemini DeepResearch 等顶级闭源模型相媲美。这些发现证明了精细训练能够使中型模型在行业领先的成本效率下实现专家级能力。
Summary / 总结
The research aims to develop a cost-effective autonomous agent for open-ended research tasks, addressing limitations of existing benchmarks. The method involves a Data Synthesis Strategy Based on Atomic Capabilities and a progressive training path from mid-training to SFT and RL, enhanced by a Checklist-style Judger. The model, Step-DeepResearch (32B), scores 61.4% on Scale AI Research Rubrics and outperforms comparable models on ADR-Bench, demonstrating refined training enables medium-sized models to achieve expert-level capabilities efficiently.
研究旨在开发一种强大的自主代理以应对开放研究任务,解决现有基准的局限性。方法包括基于原子能力的数据合成策略和从中期训练到SFT和RL的渐进训练路径,增强版的清单式评判器。模型Step-DeepResearch (32B) 在Scale AI研究评分表上得分61.4%,并在ADR-Bench上显著优于同类模型,证明中型模型可以在较低成本下实现专家级能力。
SweRank+: Multilingual, Multi-Turn Code Ranking for Software Issue Localization
Authors: Revanth Gangi Reddy, Ye Liu, Wenting Zhao, JaeHyeok Doo, Tarun Suresh, Daniel Lee, Caiming Xiong, Yingbo Zhou, Semih Yavuz, Shafiq Joty
First: 2025-12-23T16:18:39+00:00 · Latest: 2025-12-23T16:18:39+00:00
Abstract
Maintaining large-scale, multilingual codebases hinges on accurately localizing issues, which requires mapping natural-language error descriptions to the relevant functions that need to be modified. However, existing ranking approaches are often Python-centric and perform a single-pass search over the codebase. This work introduces SweRank+, a framework that couples SweRankMulti, a cross-lingual code ranking tool, with SweRankAgent, an agentic search setup, for iterative, multi-turn reasoning over the code repository. SweRankMulti comprises a code embedding retriever and a listwise LLM reranker, and is trained using a carefully curated large-scale issue localization dataset spanning multiple popular programming languages. SweRankAgent adopts an agentic search loop that moves beyond single-shot localization with a memory buffer to reason and accumulate relevant localization candidates over multiple turns. Our experiments on issue localization benchmarks spanning various languages demonstrate new state-of-the-art performance with SweRankMulti, while SweRankAgent further improves localization over single-pass ranking.
中文标题/摘要
标题:SweRank+: 多语言、多轮次代码排名在软件问题本地化中的应用
维护大规模多语言代码库的关键在于准确地本地化问题,这需要将自然语言错误描述映射到需要修改的相关函数。然而,现有的排名方法往往以Python为中心,并且只进行一次代码库搜索。这项工作引入了SweRank+框架,该框架结合了SweRankMulti,一种跨语言代码排名工具,以及SweRankAgent,一种代理搜索设置,用于在代码库上进行迭代的多轮次推理。SweRankMulti包括代码嵌入检索器和列表级LLM重排序器,并使用一个精心策划的跨多种流行编程语言的大规模问题本地化数据集进行训练。SweRankAgent采用了一个代理搜索循环,超越了一次性本地化,使用记忆缓冲区进行推理和累积多个回合的相关本地化候选。我们在涵盖多种语言的问题本地化基准测试中展示了SweRankMulti的新最佳性能,而SweRankAgent进一步提高了本地化性能,超过了单次排名。
Summary / 总结
The research aims to improve the accuracy of localizing issues in large-scale multilingual codebases by addressing the limitations of existing single-pass ranking approaches. SweRank+ introduces SweRankMulti, which combines a code embedding retriever and a listwise LLM reranker, and SweRankAgent, an iterative multi-turn reasoning framework. SweRankMulti is trained on a large, multilingual dataset, while SweRankAgent uses an agentic search loop to iteratively refine localization candidates. Experiments show that SweRankMulti achieves new state-of-the-art performance, and SweRankAgent further enhances localization accuracy compared to single-pass ranking methods.
研究旨在通过解决现有单次排名方法的局限性,提高大规模多语言代码库中问题定位的准确性。SweRank+引入了SweRankMulti,它结合了代码嵌入检索器和列表级LLM重排序器,以及SweRankAgent,这是一种迭代的多轮推理框架。SweRankMulti在大规模多语言数据集上进行训练,而SweRankAgent使用代理搜索循环逐步细化定位候选。实验表明,SweRankMulti达到了新的最佳性能,而SweRankAgent进一步提高了与单次排名方法相比的定位准确性。
Coherence in the brain unfolds across separable temporal regimes
Authors: Davide Stauba, Finn Rabe, Akhil Misra, Yves Pauli, Roya Hüppi, Nils Lang, Lars Michels, Victoria Edkins, Sascha Frühholz, Iris Sommer, Wolfram Hinzen, Philipp Homan
First: 2025-12-23T16:16:42+00:00 · Latest: 2025-12-23T16:16:42+00:00
Abstract
Coherence in language requires the brain to satisfy two competing temporal demands: gradual accumulation of meaning across extended context and rapid reconfiguration of representations at event boundaries. Despite their centrality to language and thought, how these processes are implemented in the human brain during naturalistic listening remains unclear. Here, we tested whether these two processes can be captured by annotation-free drift and shift signals and whether their neural expression dissociates across large-scale cortical systems. These signals were derived from a large language model (LLM) and formalized contextual drift and event shifts directly from the narrative input. To enable high-precision voxelwise encoding models with stable parameter estimates, we densely sampled one healthy adult across more than 7 hours of listening to thirteen crime stories while collecting ultra high-field (7T) BOLD data. We then modeled the feature-informed hemodynamic response using a regularized encoding framework validated on independent stories. Drift predictions were prevalent in default-mode network hubs, whereas shift predictions were evident bilaterally in the primary auditory cortex and language association cortex. Furthermore, activity in default-mode and parietal networks was best explained by a signal capturing how meaning accumulates and gradually fades over the course of the narrative. Together, these findings show that coherence during language comprehension is implemented through dissociable neural regimes of slow contextual integration and rapid event-driven reconfiguration, offering a mechanistic entry point for understanding disturbances of language coherence in psychiatric disorders.
中文标题/摘要
标题:大脑中的连贯性在分离的时间区间中展开
语言中的连贯性要求大脑满足两个相互竞争的时间需求:在扩展语境中逐步积累意义和在事件边界处快速重新配置表征。尽管这些过程对语言和思维至关重要,但在自然听力过程中人类大脑如何实现这些过程仍然不清楚。在这里,我们测试了是否可以通过无注释漂移和位移信号捕捉这两种过程,并且它们在大规模皮层系统中的神经表达是否分离。这些信号源自一个大型语言模型(LLM),并直接从叙述输入中形式化了上下文漂移和事件位移。为了启用高精度体素级编码模型并获得稳定的参数估计,我们在超过7小时的十三个犯罪故事听力过程中密集采样了一名健康成人,并收集了超高场(7T)BOLD数据。然后,我们使用在独立故事上验证过的正则化编码框架,使用特征导向的血流动力学响应进行建模。漂移预测在默认模式网络枢纽中普遍存在,而位移预测在双侧初级听觉皮层和语言关联皮层中明显。此外,默认模式和顶叶网络的活动最好地由一个信号解释,该信号捕捉了叙述过程中意义如何逐步积累和逐渐消退。总之,这些发现表明,在语言理解过程中连贯性通过缓慢的上下文整合和快速的事件驱动重新配置的分离神经机制实现,为理解精神疾病中语言连贯性障碍的机制提供了切入点。
Summary / 总结
This study investigates how the brain processes language by satisfying two temporal demands: gradual accumulation of meaning and rapid reconfiguration at event boundaries. Using ultra-high-field BOLD imaging and a large language model, the researchers found that drift signals, indicative of gradual meaning accumulation, were prevalent in default-mode network hubs, while shift signals, indicating rapid reconfiguration, were evident in the primary auditory cortex and language association cortex. The findings suggest that coherence during language comprehension is implemented through distinct neural regimes, providing insights into language coherence disturbances in psychiatric disorders.
该研究探讨了大脑在处理语言时如何应对两种竞争的时间需求:逐步积累意义和快速重新配置表征。通过使用超高场BOLD成像和大型语言模型,研究人员发现,指示逐步积累意义的漂移信号主要出现在默认模式网络的中心区域,而指示快速事件驱动重新配置的转换信号则出现在初级听觉皮层和语言关联皮层。研究结果表明,语言理解中的连贯性是通过不同的神经机制实现的,逐步整合上下文和快速事件驱动的重新配置,为理解精神疾病中语言连贯性障碍提供了机制性的切入点。
UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images
Authors: Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin, Longfei Xiong, Zhouhui Lian
Venue: SIGGRAPH Asia 2025
First: 2025-12-23T16:13:55+00:00 · Latest: 2025-12-23T16:13:55+00:00
Comments: 22 pages, 25 figures, SIGGRAPH Asia 2025, Conference Paper
Abstract
AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.
中文标题/摘要
标题:UTDesign:图形设计图像中风格化文本编辑与生成的统一框架
AI辅助的图形设计已成为自动化设计元素(如海报、横幅和广告)创建和编辑的强大工具。尽管基于扩散的文本到图像模型在视觉内容生成方面表现出强大的能力,但它们在小尺度字体和非拉丁文字符的文本渲染性能方面仍然有限。在本文中,我们提出了一种名为UTDesign的统一框架,用于设计图像中的高精度风格化文本编辑和条件文本生成,支持英文字体和中文字体。我们的框架引入了一种从合成数据集从零开始训练的新型DiT基文本风格转换模型,能够生成透明的RGBA文本前景,保留参考字符的风格。我们进一步通过在包含详细文本注释的精心策划数据集上训练多模态条件编码器,将该模型扩展为条件文本生成框架,使其能够根据背景图像、提示和布局规范生成准确且风格一致的文本合成。最后,我们通过集成预训练的文本到图像(T2I)模型和基于MLLM的布局规划器,将我们的方法整合到一个完全自动化的文本到设计(T2D)流水线中。广泛的实验表明,UTDesign在开源方法中在风格一致性与文本准确性方面达到了最先进的性能,并且与专有商业方法相比具有独特优势。本文的代码和数据可在https://github.com/ZYM-PKU/UTDesign获取。
Summary / 总结
UTDesign is a unified framework for stylized text editing and generation in graphic design images, addressing limitations in text rendering for small-scale typography and non-Latin scripts. It uses a DiT-based text style transfer model and a multi-modal condition encoder to generate accurate, style-consistent text. Experiments show UTDesign outperforms open-source methods in stylistic consistency and text accuracy, and has unique advantages over proprietary commercial approaches.
UTDesign 是一个统一框架,用于在图形设计图像中进行高精度的风格化文本编辑和条件文本生成,支持英汉两种文字。它引入了基于DiT的文本风格迁移模型和多模态条件编码器,实现了在风格一致性与文本准确性方面的领先性能。UTDesign 将预训练的文本到图像模型和基于MLLM的布局规划器集成到一个完全自动化的文本到设计(T2D)管道中,优于开源和专有商业方法。
Bohrium + SciMaster: Building the Infrastructure and Ecosystem for Agentic Science at Scale
Authors: Linfeng Zhang, Siheng Chen, Yuzhu Cai, Jingyi Chai, Junhan Chang, Kun Chen, Zhi X. Chen, Zhaohan Ding, Yuwen Du, Yuanpeng Gao, Yuan Gao, Jing Gao, Zhifeng Gao, Qiangqiang Gu, Yanhui Hong, Yuan Huang, Xi Fang, Xiaohong Ji, Guolin Ke, Zixing Lei, Xinyu Li, Yongge Li, Ruoxue Liao, Hang Lin, Xiaolu Lin, Yuxiang Liu, Xinzijian Liu, Zexi Liu, Jintan Lu, Tingjia Miao, Haohui Que, Weijie Sun, Yanfeng Wang, Bingyang Wu, Tianju Xue, Rui Ye, Jinzhe Zeng, Duo Zhang, Jiahui Zhang, Linfeng Zhang, Tianhan Zhang, Wenchang Zhang, Yuzhi Zhang, Zezhong Zhang, Hang Zheng, Hui Zhou, Tong Zhu, Xinyu Zhu, Qingguo Zhou, Weinan E
First: 2025-12-23T16:04:41+00:00 · Latest: 2025-12-23T16:04:41+00:00
Abstract
AI agents are emerging as a practical way to run multi-step scientific workflows that interleave reasoning with tool use and verification, pointing to a shift from isolated AI-assisted steps toward \emph{agentic science at scale}. This shift is increasingly feasible, as scientific tools and models can be invoked through stable interfaces and verified with recorded execution traces, and increasingly necessary, as AI accelerates scientific output and stresses the peer-review and publication pipeline, raising the bar for traceability and credible evaluation. However, scaling agentic science remains difficult: workflows are hard to observe and reproduce; many tools and laboratory systems are not agent-ready; execution is hard to trace and govern; and prototype AI Scientist systems are often bespoke, limiting reuse and systematic improvement from real workflow signals. We argue that scaling agentic science requires an infrastructure-and-ecosystem approach, instantiated in Bohrium+SciMaster. Bohrium acts as a managed, traceable hub for AI4S assets -- akin to a HuggingFace of AI for Science -- that turns diverse scientific data, software, compute, and laboratory systems into agent-ready capabilities. SciMaster orchestrates these capabilities into long-horizon scientific workflows, on which scientific agents can be composed and executed. Between infrastructure and orchestration, a \emph{scientific intelligence substrate} organizes reusable models, knowledge, and components into executable building blocks for workflow reasoning and action, enabling composition, auditability, and improvement through use. We demonstrate this stack with eleven representative master agents in real workflows, achieving orders-of-magnitude reductions in end-to-end scientific cycle time and generating execution-grounded signals from real workloads at multi-million scale.
中文标题/摘要
标题:博 Hirium + 科学大师:构建大规模自主科学的基础架构和生态系统
AI 代理正成为运行多步骤科学工作流的实用方式,这些工作流将推理与工具使用和验证交织在一起,这预示着从孤立的 AI 辅助步骤向大规模自主科学的转变。随着科学工具和模型可以通过稳定接口调用并用记录的执行跟踪进行验证,这种转变变得越来越可行;同时,由于 AI 加速了科学产出并给同行评审和出版管道带来了压力,提高可追溯性和可信评估的标准变得越来越高。 然而,扩展自主科学仍然困难:工作流难以观察和重现;许多工具和实验室系统尚未准备好接受代理;执行难以追踪和管理;并且原型 AI 科学家系统往往是定制的,限制了从实际工作流信号中实现重用和系统改进。 我们认为,扩展自主科学需要一种基础设施和生态系统的方法,体现在博 Hirium+科学大师中。博 Hirium 作为 AI4S 资产的管理、可追踪枢纽,类似于科学领域的 HuggingFace,将多样的科学数据、软件、计算能力和实验室系统转化为可接受代理的能力。科学大师将这些能力编排成长期科学工作流,科学代理可以在其中进行组合和执行。在基础设施和编排之间,一个科学智能基质组织可重用的模型、知识和组件,形成工作流推理和行动的可执行构建块,通过使用实现组合、审计和改进。 我们通过实际工作流中的十一代表型主代理栈展示了这一堆栈,实现了端到端科学周期时间的数个数量级减少,并从百万级规模的实际负载中生成了基于执行的信号。
Summary / 总结
The paper aims to address the challenges of scaling agentic science by proposing an infrastructure-and-ecosystem approach, exemplified by Bohrium+SciMaster. Bohrium acts as a managed, traceable hub for AI4S assets, converting scientific data and systems into agent-ready capabilities. SciMaster then orchestrates these capabilities into long-horizon workflows. The authors demonstrate this stack with eleven master agents, achieving significant reductions in scientific cycle time and generating execution-grounded signals at a large scale.
论文旨在通过提出基础设施和生态系统的方法来解决扩展有能科学的挑战,具体实现为Bohrium+SciMaster。Bohrium作为管理可追溯的枢纽,将AI4S资产转化为可操作的能力。SciMaster则将这些能力编排成长期科学工作流。作者通过 eleven 个主代理在实际工作流中的演示,实现了端到端科学周期时间的大幅减少,并在大规模下生成了基于执行的信号。
The Aligned Economic Index & The State Switching Model
Authors: Ilias Aarab
Venue: Financieel Forum Bank en Financiewezen 2020 3 pp 252-261
First: 2025-12-23T15:55:10+00:00 · Latest: 2025-12-23T15:55:10+00:00
Abstract
A growing empirical literature suggests that equity-premium predictability is state dependent, with much of the forecasting power concentrated around recessionary periods \parencite{Henkel2011,DanglHalling2012,Devpura2018}. I study U.S. stock return predictability across economic regimes and document strong evidence of time-varying expected returns across both expansionary and contractionary states. I contribute in two ways. First, I introduce a state-switching predictive regression in which the market state is defined in real time using the slope of the yield curve. Relative to the standard one-state predictive regression, the state-switching specification increases both in-sample and out-of-sample performance for the set of popular predictors considered by \textcite{WelchGoyal2008}, improving the out-of-sample performance of most predictors in economically meaningful ways. Second, I propose a new aggregate predictor, the Aligned Economic Index, constructed via partial least squares (PLS). Under the state-switching model, the Aligned Economic Index exhibits statistically and economically significant predictive power in sample and out of sample, and it outperforms widely used benchmark predictors and alternative predictor-combination methods.
中文标题/摘要
标题:对齐经济指数与状态转换模型
越来越多的经验研究表明,股权溢价可预测性具有状态依赖性,大部分预测能力集中在衰退期 \parencite{Henkel2011,DanglHalling2012,Devpura2018}。我研究了美国股票收益在不同经济状态下的可预测性,并记录了在扩张和收缩状态下预期收益时间变化的强烈证据。我有两点贡献。首先,我引入了一种状态转换预测回归模型,其中市场状态是通过收益率曲线的斜率在实时定义的。与标准的一状态预测回归相比,状态转换模型提高了考虑的流行预测因子的样本内和样本外表现,大多数预测因子在经济上有意义地提高了样本外表现。其次,我提出了一种新的总体预测指标,即对齐经济指数,通过偏最小二乘法(PLS)构建。在状态转换模型下,对齐经济指数在样本内和样本外表现出统计学和经济学上的显著预测能力,并且优于广泛使用的基准预测指标和替代预测组合方法。
Summary / 总结
This paper investigates the predictability of U.S. stock returns across different economic states and introduces a state-switching predictive regression model using the slope of the yield curve. The study finds that the model improves the performance of popular predictors both in-sample and out-of-sample, particularly for contractionary states. Additionally, a new aggregate predictor, the Aligned Economic Index, is proposed and shown to have statistically and economically significant predictive power in both sample and out-of-sample tests, outperforming existing benchmarks and combination methods.
研究考察了不同经济状态下美国股票回报的可预测性,发现了时间变化的预期回报的强烈证据。引入了一种基于收益率曲线斜率定义市场状态的自切换预测回归方法,这提高了常用预测因子的样本内和样本外表现。此外,提出了一种新的综合预测指标——对齐经济指数,使用部分最小二乘法构建,显示出统计上和经济上有意义的预测能力,无论是样本内还是样本外,都优于现有基准预测指标和替代预测组合方法。
Stochastic activations
Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou
First: 2025-09-26T13:53:56+00:00 · Latest: 2025-12-23T15:51:07+00:00
Abstract
We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
中文标题/摘要
标题:随机激活
我们引入了随机激活。这一新颖策略在大型语言模型的前向层中随机选择几种非线性函数。具体而言,我们根据伯努利抽样在SILU或RELU之间进行选择。这一策略绕过了RELU优化问题,即负输入的恒定形状导致梯度流受阻。我们以两种方式利用这一策略: (1) 在预训练中使用随机激活,在微调时使用RELU,RELU在推理时用于提供稀疏的潜在向量,这减少了推理FLOPs并显著提高了CPU速度。有趣的是,这比从零开始使用RELU激活函数训练效果更好。 (2) 我们评估了随机激活在生成中的应用。这一策略表现相当不错:它仅略逊于最佳确定性非线性,即SILU结合温度缩放。这为现有策略提供了一种替代方案,通过提供一种可控的方式来增加生成文本的多样性。
Summary / 总结
The research introduces stochastic activations, a novel strategy that randomly selects between SILU and RELU in the feed-forward layer of a large language model. This approach addresses the optimization issue of RELU by providing a gradient flow for negative inputs. The method is applied during pre-training and fine-tuning, using RELU for inference to reduce FLOPs and speed up CPU processing, resulting in better performance than training from scratch with RELU. Additionally, the strategy is evaluated for text generation, showing comparable performance to the best deterministic non-linearity, SILU with temperature scaling, and offering a controlled way to increase text diversity.
研究引入了随机激活函数这一新颖策略,在大型语言模型的前向层中随机选择SILU或RELU。该方法解决了RELU在负输入上无法提供梯度流的优化问题。该方法应用于预训练和微调过程中,在推理时使用RELU以减少FLOPs并加快CPU处理速度,结果表明其性能优于从头开始使用RELU进行训练。此外,该策略还被评估用于文本生成,其性能与最佳确定性非线性SILU结合温度缩放相当,并提供了一种控制文本多样性的方式。
Binarization-Aware Adjuster for Discrete Decision Learning with an Application to Edge Detection
Authors: Hao Shu
First: 2025-06-14T11:56:44+00:00 · Latest: 2025-12-23T15:42:00+00:00
Comments: 28 pages
Abstract
Discrete decision tasks in machine learning exhibit a fundamental misalignment between training and inference: models are optimized with continuous-valued outputs but evaluated using discrete predictions. This misalignment arises from the discontinuity of discretization operations, which prevents decision behavior from being directly incorporated into gradient-based optimization. To address this issue, we propose a theoretically grounded framework termed the Binarization-Aware Adjuster (BAA), which embeds binarization characteristics into continuous optimization. The framework is built upon the Distance Weight Function (DWF), which modulates loss contributions according to prediction correctness and proximity to the decision threshold, thereby aligning optimization emphasis with decision-critical regions while remaining compatible with standard learning pipelines. We apply the proposed BAA framework to the edge detection (ED) task, a representative binary decision problem. Experimental results on representative models and datasets show that incorporating BAA into optimization leads to consistent performance improvements, supporting its effectiveness. Overall, this work establishes a principled approach for aligning continuous optimization with discrete decision behavior, with its effectiveness demonstrated in a concrete application setting.
中文标题/摘要
标题:面向离散决策学习的二值化感知调整器及其在边缘检测中的应用
机器学习中的离散决策任务在训练和推理之间存在根本性不匹配:模型使用连续值输出进行优化,但使用离散预测进行评估。这种不匹配源于离散化操作的不连续性,阻止了决策行为直接被梯度优化所利用。为了解决这一问题,我们提出了一种名为二值化感知调整器(BAA)的理论框架,该框架将二值化特性嵌入到连续优化中。该框架基于距离加权函数(DWF),根据预测的正确性和接近决策阈值的程度来调节损失贡献,从而将优化重点与决策关键区域对齐,同时保持与标准学习管道的兼容性。我们将提出的BAA框架应用于边缘检测(ED)任务,这是一个典型的二元决策问题。在代表性模型和数据集上的实验结果表明,将BAA纳入优化可以带来一致的性能提升,支持其有效性。总体而言,这项工作建立了一种原理性的方法,将连续优化与离散决策行为对齐,并通过具体的应用场景证明了其有效性。
Summary / 总结
The paper addresses the misalignment between continuous training and discrete evaluation in machine learning by proposing a Binarization-Aware Adjuster (BAA) framework. BAA embeds binarization characteristics into continuous optimization using a Distance Weight Function (DWF) that adjusts loss contributions based on prediction correctness and proximity to the decision threshold. The framework is applied to edge detection, showing consistent performance improvements when integrated into optimization, thus validating its effectiveness in aligning continuous optimization with discrete decision behavior.
论文提出了一种名为Binarization-Aware Adjuster (BAA)的框架,以解决机器学习中连续训练与离散推理之间的不匹配问题。BAA通过距离权重函数(DWF)根据预测的正确性和接近决策阈值的程度来调整损失贡献,使优化与决策关键区域保持一致。在边缘检测任务上的实验表明,BAA能够一致地提高模型性能,证明了其在使连续优化与离散决策行为保持一致方面的有效性。
Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI
Authors: Muhammad Usman, Azka Rehman, Muhammad Mutti Ur Rehman, Abd Ur Rehman, Muhammad Umar Farooq
First: 2025-12-23T15:24:31+00:00 · Latest: 2025-12-23T15:24:31+00:00
Abstract
Accurate segmentation of ischemic stroke lesions from diffusion magnetic resonance imaging (MRI) is essential for clinical decision-making and outcome assessment. Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) scans provide complementary information on acute and sub-acute ischemic changes; however, automated lesion delineation remains challenging due to variability in lesion appearance. In this work, we study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset. Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked. Based on performance, a dual-encoder TransUNet architecture is proposed to learn modality-specific representations from DWI and ADC inputs. To incorporate spatial context, adjacent slice information is integrated using a three-slice input configuration. All models are trained under a unified framework and evaluated using the Dice Similarity Coefficient (DSC). Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set. The proposed framework offers a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI.
中文标题/摘要
标题:基于双编码器变换器的多模态学习在使用弥散MRI的缺血性中风病灶分割中的应用
从弥散磁共振成像(MRI)中准确分割缺血性中风病灶对于临床决策和结果评估至关重要。弥散加权成像(DWI)和表观扩散系数(ADC)扫描提供了急性及亚急性缺血变化的互补信息;然而,由于病灶外观的变异性,自动病灶勾画仍然具有挑战性。 在这项工作中,我们使用ISLES 2022数据集中的多模态弥散MRI研究缺血性中风病灶分割。多种最先进的卷积和变换器架构,包括U-Net变体、Swin-UNet和TransUNet,进行了基准测试。基于性能,提出了一种双编码器TransUNet架构,从DWI和ADC输入中学习模态特定的表示。为了整合空间上下文,使用三片输入配置整合了相邻切片信息。 所有模型都在统一框架下进行训练,并使用Dice相似性系数(DSC)进行评估。结果显示,基于变换器的模型优于基于卷积的基线模型,提出的双编码器TransUNet在测试集上达到85.4%的Dice分数,提供了从弥散MRI自动分割缺血性中风病灶的稳健解决方案。
Summary / 总结
This study aims to improve the accuracy of ischemic stroke lesion segmentation from diffusion MRI by leveraging multimodal data. The authors benchmark several state-of-the-art architectures and propose a dual-encoder TransUNet that learns modality-specific representations from DWI and ADC inputs. By integrating adjacent slice information, the model achieves a Dice score of 85.4% on the test set, outperforming convolutional baselines and demonstrating robust performance for automated lesion segmentation.
该研究旨在通过利用多模态数据提高缺血性中风病灶从扩散MRI中的分割准确性。作者对比了几种最先进的架构,并提出了一种双编码器TransUNet模型,该模型从DWI和ADC输入中学习模态特定的表示。通过整合相邻切片信息,该模型在测试集上的Dice得分为85.4%,优于卷积基线模型,并展示了自动病灶分割的稳健性能。
Video Generation Models Are Good Latent Reward Models
Authors: Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang
First: 2025-11-26T16:14:18+00:00 · Latest: 2025-12-23T15:17:06+00:00
Abstract
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
中文标题/摘要
标题:视频生成模型是良好的潜在奖励模型
奖励反馈学习(ReFL)已被证明对于使图像生成与人类偏好对齐是有效的。然而,将其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉-语言模型,这将ReFL优化限制在昂贵的VAE解码之后的近完全去噪步骤中。这种像素空间的方法会产生大量的内存开销并增加训练时间,而且其后期优化缺乏早期监督,仅能优化视觉质量而不是基本的运动动态和结构一致性。在本文中,我们展示了预训练的视频生成模型自然适合在嘈杂的潜在空间中进行奖励建模,因为它们明确设计为可以处理任意时间步的嘈杂潜在表示,并通过其序列建模能力内在地保留时间信息。因此,我们提出了过程奖励反馈学习(PRFL)框架,该框架在潜在空间中完全进行偏好优化,从而在整个去噪链中实现高效的梯度反向传播,而无需VAE解码。广泛的实验表明,PRFL在显著提高与人类偏好的对齐程度的同时,与RGB ReFL相比实现了内存消耗和训练时间的大幅减少。
Summary / 总结
This work addresses the challenges of applying reward feedback learning (ReFL) to video generation by proposing Process Reward Feedback Learning (PRFL). PRFL leverages pre-trained video generation models to model rewards in the noisy latent space, avoiding the need for computationally expensive VAE decoding. This approach leads to better alignment with human preferences, reduced memory consumption, and shorter training times compared to traditional pixel-space ReFL methods.
本文提出了一种称为Process Reward Feedback Learning (PRFL)的方法,以解决将奖励反馈学习(ReFL)应用于视频生成的挑战。PRFL利用预训练的视频生成模型在潜空间中优化偏好,避免了昂贵的VAE解码步骤。这种方法减少了内存使用和训练时间,同时提高了与人类偏好的一致性。关键发现包括在与人类偏好的一致性方面取得了显著改进,并且与传统的RGB ReFL方法相比,实现了显著的内存消耗和训练时间减少。
History
20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553