arXiv 论文速递

2025-12-18 03:35
Snapshot: 20251218_0335
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
Authors: Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, Hengshuang Zhao
First: 2025-12-16T18:59:59+00:00 · Latest: 2025-12-16T18:59:59+00:00
Comments: Project Page: https://sihuiji.github.io/MemFlow.github.io/
Abstract
The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
中文标题/摘要
标题:MemFlow:流动的自适应内存以实现一致且高效的长视频叙述
长视频生成的核心挑战在于保持长时间上下文中的内容一致性,这对内存设计提出了很高的要求。大多数现有解决方案通过预定义策略压缩历史帧来维持内存。然而,不同待生成的视频片段应该参考不同的历史线索,这很难用固定策略满足。在本文中,我们提出了MemFlow来解决这个问题。具体来说,在生成即将来的片段之前,我们通过检索与该片段文本提示最相关的历史帧来动态更新内存库。这种设计即使在未来的帧中发生新事件或场景切换时也能保持叙述连贯性。此外,在生成过程中,我们仅在注意力层中的每个查询中激活内存库中最相关的令牌,从而有效保证生成效率。这样,MemFlow在几乎不增加计算负担(与无内存基线相比,速度降低7.9%)的情况下实现了出色的长上下文一致性,并且与任何带有KV缓存的流式视频生成模型兼容。
Summary / 总结
MemFlow addresses the challenge of maintaining content consistency in long video narratives by dynamically updating the memory bank with relevant historical frames based on the text prompt of the current chunk. This approach ensures narrative coherence and efficiency during generation, achieving excellent long-context consistency with minimal computational overhead (7.9% speed reduction compared to a memory-free baseline).
MemFlow 通过根据当前片段的文本提示动态更新包含相关历史帧的内存库来解决长视频叙事中保持内容一致性的挑战。这种方法在生成过程中确保了叙事连贯性和效率,并且在最小的计算开销(与无内存基线相比减少7.9%的计算速度)下实现了良好的长上下文一致性。
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Authors: Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
First: 2025-12-16T18:59:58+00:00 · Latest: 2025-12-16T18:59:58+00:00
Comments: Project Page: https://timelens-arc-lab.github.io/
Abstract
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
中文标题/摘要
标题:TimeLens:以多模态大语言模型重新思考视频时间定位
本文并未引入新的方法,而是为视频时间定位(VTG)这一视频理解的核心能力建立了简单、渐进但至关重要的基准。尽管多模态大语言模型(MLLMs)在各种视频理解任务中表现出色,但针对VTG进行优化的食谱仍处于探索阶段。本文介绍了TimeLens,这是一种系统性的研究,旨在构建具有强大VTG能力的MLLMs,主要从数据质量和算法设计两个维度进行。我们首先揭示了现有VTG基准中的关键质量问题,并引入了TimeLens-Bench,包含三个流行基准的严格重新注释版本。我们的分析显示,与传统基准相比,模型排名发生了巨大变化,证实了先前评估标准的不可靠性。我们还通过自动重新注释流水线解决了嘈杂的训练数据问题,生成了TimeLens-100K,这是一个大规模、高质量的训练数据集。基于我们的数据基础,我们深入探索了算法设计原则,产生了许多有意义的见解和有效且高效的实践。这些包括交错的文本编码用于时间表示、无思考的强化学习(RLVR)训练范式以及RLVR训练的精心设计配方。这些努力最终产生了TimeLens模型,这是一个具有开源模型中最佳VTG性能的MLLM家族,甚至超越了GPT-5和Gemini-2.5-Flash等专有模型。所有代码、数据和模型将被发布,以促进未来的研究。
Summary / 总结
TimeLens aims to establish a robust baseline for video temporal grounding (VTG) by addressing data quality and algorithmic design issues. It introduces TimeLens-Bench, a re-annotated version of existing VTG benchmarks, and TimeLens-100K, a high-quality training dataset. The paper also explores algorithmic design principles, leading to the development of TimeLens models, which achieve state-of-the-art VTG performance, even surpassing proprietary models. All resources are open-sourced to promote further research.
TimeLens旨在通过解决数据质量和算法设计问题,为视频时间定位(VTG)建立一个稳健的基础。作者引入了TimeLens-Bench,这是一个严格质量标准下重新注释的现有基准版本,以及TimeLens-100K,一个大规模、高质量的训练数据集。他们还提出了交错文本编码和无思考的强化学习与可验证奖励(RLVR)方法,从而在开源和专有模型中都达到了最先进的VTG性能。
Love First, Know Later: Persona-Based Romantic Compatibility Through LLM Text World Engines
Authors: Haoyang Shang, Zhengyang Yan, Xuan Liu
Venue: NeurIPS 2025 Oral
First: 2025-12-04T02:07:05+00:00 · Latest: 2025-12-16T18:59:14+00:00
Comments: NeurIPS 2025 Workshop: First Workshop on LLM Persona Modeling (Oral)
Abstract
We propose Love First, Know Later: a paradigm shift in computational matching that simulates interactions first, then assesses compatibility. Instead of comparing static profiles, our framework leverages LLMs as text world engines that operate in dual capacity-as persona-driven agents following behavioral policies and as the environment modeling interaction dynamics. We formalize compatibility assessment as a reward-modeling problem: given observed matching outcomes, we learn to extract signals from simulations that predict human preferences. Our key insight is that relationships hinge on responses to critical moments-we translate this observation from relationship psychology into mathematical hypotheses, enabling effective simulation. Theoretically, we prove that as LLM policies better approximate human behavior, the induced matching converges to optimal stable matching. Empirically, we validate on speed dating data for initial chemistry and divorce prediction for long-term stability. This paradigm enables interactive, personalized matching systems where users iteratively refine their agents, unlocking future possibilities for transparent and interactive compatibility assessment.
中文标题/摘要
标题:先爱后知:基于人设的浪漫兼容性通过LLM文本世界引擎
我们提出“先爱后知”:一种计算匹配范式的转变,先模拟互动,再评估兼容性。我们框架利用LLM作为文本世界引擎,双重作用为基于人设的代理并遵循行为策略,以及作为环境模拟互动动态。我们将兼容性评估形式化为奖励建模问题:给定匹配结果,我们学习从模拟中提取信号以预测人类偏好。我们的核心见解是,关系取决于对关键时刻的反应——我们将这一观察从关系心理学转化为数学假设,从而实现有效的模拟。理论上,我们证明随着LLM策略更好地逼近人类行为,诱导的匹配将收敛到最优稳定匹配。实验上,我们在速配数据上验证了初始化学反应,在离婚预测上验证了长期稳定性。这一范式使用户能够迭代优化其代理,从而解锁未来透明和互动兼容性评估的可能性。
Summary / 总结
The research aims to improve computational matching by simulating interactions between potential partners using LLMs as text world engines. The method involves leveraging LLMs to act as persona-driven agents and model interaction dynamics, translating psychological insights into mathematical hypotheses to assess compatibility. Key findings show that as LLM policies better mimic human behavior, the simulated matches converge to optimal stable matches, validated through speed dating data for initial chemistry and divorce prediction for long-term stability.
研究提出了一种新的计算匹配方法——Love First, Know Later,通过LLM作为文本世界引擎模拟互动。该方法通过学习模拟互动来评估兼容性,而不是比较静态资料。关键发现表明,随着LLM策略更好地模拟人类行为,匹配结果将趋向于最优稳定匹配。系统在速配数据上验证了初始化学反应,在长期稳定性上预测离婚,展示了其在个性化匹配系统中的有效性。
Universal Reasoning Model
Authors: Zitian Gao, Lynx Chen, Yihao Xiao, He Xing, Ran Tao, Haoming Luo, Joey Zhou, Bryan Dai
First: 2025-12-16T18:58:45+00:00 · Latest: 2025-12-16T18:58:45+00:00
Abstract
Universal transformers (UTs) have been widely used for complex reasoning tasks such as ARC-AGI and Sudoku, yet the specific sources of their performance gains remain underexplored. In this work, we systematically analyze UTs variants and show that improvements on ARC-AGI primarily arise from the recurrent inductive bias and strong nonlinear components of Transformer, rather than from elaborate architectural designs. Motivated by this finding, we propose the Universal Reasoning Model (URM), which enhances the UT with short convolution and truncated backpropagation. Our approach substantially improves reasoning performance, achieving state-of-the-art 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. Our code is avaliable at https://github.com/zitian-gao/URM.
中文标题/摘要
标题:通用推理模型
通用变换器(UTs)已被广泛用于复杂的推理任务,如ARC-AGI和数独,但其性能提升的具体来源尚未充分探索。在本文中,我们系统地分析了UTs的变体,并表明ARC-AGI上的改进主要来自于Transformer的递归归纳偏见和强大的非线性组件,而不是复杂的架构设计。受这一发现的启发,我们提出了通用推理模型(URM),该模型通过短卷积和截断反向传播增强了UT。我们的方法显著提高了推理性能,在ARC-AGI 1上实现了最先进的53.8%的pass@1,在ARC-AGI 2上实现了16.0%的pass@1。我们的代码可在https://github.com/zitian-gao/URM获取。
Summary / 总结
This study investigates the performance gains of universal transformers (UTs) in complex reasoning tasks like ARC-AGI and Sudoku, finding that improvements mainly come from recurrent inductive bias and strong nonlinear components. To enhance UTs, the authors propose the Universal Reasoning Model (URM), which includes short convolution and truncated backpropagation. URM significantly improves reasoning performance, achieving state-of-the-art results of 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2.
研究旨在理解通用变压器(UTs)在复杂推理任务中的性能提升原因。通过分析UT变体,研究发现UT性能提升主要归因于Transformer的递归归纳偏置和强大的非线性组件,而非复杂的架构设计。基于此,提出了增强的通用推理模型(URM),该模型在UT中增加了短卷积和截断反向传播。该模型显著提升了推理性能,在ARC-AGI上取得了最先进的结果,ARC-AGI 1的pass@1为53.8%,ARC-AGI 2的pass@1为16.0%。
MMGR: Multi-Modal Generative Reasoning
Authors: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu
First: 2025-12-16T18:58:04+00:00 · Latest: 2025-12-16T18:58:04+00:00
Comments: work in progress
Abstract
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
中文标题/摘要
标题:MMGR:多模态生成推理
视频基础模型生成视觉上逼真且时间上连贯的内容,但它们作为世界模拟器的可靠性取决于是否捕捉了物理、逻辑和空间约束。现有指标如弗雷切视频距离(FVD)强调感知质量,而忽视了推理失败,包括因果关系、物理法则和全局一致性等违反。我们引入了MMGR(多模态生成推理评估与基准),一个基于五种推理能力的原理性评估框架:物理、逻辑、三维空间、二维空间和时间。MMGR在抽象推理(ARC-AGI、数独)、体感导航(现实世界三维导航和定位)和物理常识(体育和组合交互)三个领域评估生成推理。MMGR应用细粒度指标,要求视频和图像生成的整体正确性。我们对领先视频模型(Veo-3、Sora-2、Wan-2.2)和图像模型(Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image)进行了基准测试,揭示了不同领域的性能差距。模型在物理常识任务上表现出适度的成功,但在抽象推理(ARC-AGI准确率低于10%)和体感设置中的长期空间规划方面表现不佳。我们的分析突显了当前模型的关键局限性,包括过度依赖感知数据、全局状态一致性较弱以及奖励视觉合理性而非因果正确性的目标。MMGR提供了一个统一的诊断基准,并为推理感知生成世界模型指明了方向。
Summary / 总结
MMGR (Multi-Modal Generative Reasoning) is a new evaluation framework that assesses the reasoning abilities of video and image models across physical, logical, and spatial constraints. It evaluates models on abstract reasoning, embodied navigation, and physical commonsense tasks. The framework reveals significant performance gaps among leading models, with strong performance on physical commonsense but poor results on abstract reasoning and long-horizon spatial planning. Key limitations include reliance on perceptual data and weak global state consistency.
MMGR 是一个新的评估框架,用于评估视频和图像生成模型在物理、逻辑和空间领域的推理能力。它通过细粒度的指标,在抽象推理、体感导航和物理常识任务上评估模型。基准测试显示,领先模型在物理常识任务上表现良好,但在抽象推理和长时空间规划方面表现不佳,这表明模型需要在推理和因果正确性方面超越感知质量进行改进。
Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization
Authors: Yen-Ju Lu, Kunxiao Gao, Mingrui Liang, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba
First: 2025-12-16T18:54:20+00:00 · Latest: 2025-12-16T18:54:20+00:00
Comments: 12 pages, 2 figures
Abstract
Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
中文标题/摘要
标题:Spoken DialogSum:一种用于口语对话摘要的情感丰富对话数据集
近期的音频语言模型可以跟随长时间的对话。然而,关于情感意识或口语对话摘要的研究受到缺乏将语音、摘要和副语言线索联系起来的数据的限制。我们引入了Spoken DialogSum,这是第一个将原始对话音频与事实摘要、情感丰富的摘要以及说话人年龄、性别和情感的单元级标签对齐的数据集。数据集分为两个阶段构建:首先,一个LLM使用Switchboard风格的填充和回话通道重写DialogSum脚本,然后为每个单元标注情感、音高和说话速率。其次,一个具有表现力的TTS引擎从标注的脚本中合成语音,并与副语言标签对齐。Spoken DialogSum包含13,460个情感多样的对话,每个对话都配有一个事实摘要和一个情感聚焦的摘要。数据集可在https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/在线获取。基线结果显示,音频LLM相比级联ASR-LLM系统,情感摘要ROUGE-L提高了28%,证实了端到端语音建模的价值。
Early Warning Index for Patient Deteriorations in Hospitals
Authors: Dimitris Bertsimas, Yu Ma, Kimberly Villalobos Carballo, Gagan Singh, Michal Laskowski, Jeff Mather, Dan Kombert, Howard Haronian
First: 2025-12-16T18:47:27+00:00 · Latest: 2025-12-16T18:47:27+00:00
Abstract
Hospitals lack automated systems to harness the growing volume of heterogeneous clinical and operational data to effectively forecast critical events. Early identification of patients at risk for deterioration is essential not only for patient care quality monitoring but also for physician care management. However, translating varied data streams into accurate and interpretable risk assessments poses significant challenges due to inconsistent data formats. We develop a multimodal machine learning framework, the Early Warning Index (EWI), to predict the aggregate risk of ICU admission, emergency response team dispatch, and mortality. Key to EWI's design is a human-in-the-loop process: clinicians help determine alert thresholds and interpret model outputs, which are enhanced by explainable outputs using Shapley Additive exPlanations (SHAP) to highlight clinical and operational factors (e.g., scheduled surgeries, ward census) driving each patient's risk. We deploy EWI in a hospital dashboard that stratifies patients into three risk tiers. Using a dataset of 18,633 unique patients at a large U.S. hospital, our approach automatically extracts features from both structured and unstructured electronic health record (EHR) data and achieves C-statistics of 0.796. It is currently used as a triage tool for proactively managing at-risk patients. The proposed approach saves physicians valuable time by automatically sorting patients of varying risk levels, allowing them to concentrate on patient care rather than sifting through complex EHR data. By further pinpointing specific risk drivers, the proposed model provides data-informed adjustments to caregiver scheduling and allocation of critical resources. As a result, clinicians and administrators can avert downstream complications, including costly procedures or high readmission rates and improve overall patient flow.
中文标题/摘要
标题:医院患者恶化早期预警指数
医院缺乏自动系统来利用日益增长的异质临床和运营数据有效预测关键事件。早期识别处于恶化风险的患者不仅对于监测患者护理质量至关重要,也对于医生的护理管理至关重要。然而,将各种数据流转化为准确且可解释的风险评估面临巨大挑战,因为数据格式不一致。我们开发了一种多模态机器学习框架——早期预警指数(EWI),以预测ICU入院、紧急响应团队派遣和死亡的综合风险。EWI设计的关键在于有人参与的过程:临床医生帮助确定警报阈值并解释模型输出,通过Shapley加性解释(SHAP)可解释输出来突出临床和运营因素(如计划手术、病房人数)对每位患者风险的影响。我们部署EWI在医院仪表板中,将患者分为三个风险等级。使用美国一家大型医院的18,633名独特患者的资料集,我们的方法自动从结构化和非结构化的电子健康记录(EHR)数据中提取特征,C统计值为0.796。目前,该方法被用作优先管理处于风险中的患者的分诊工具。该提议的方法通过自动对不同风险级别的患者进行分类,为医生节省了宝贵的时间,使他们能够专注于患者护理,而不是处理复杂的EHR数据。通过进一步明确特定的风险驱动因素,该提议的模型提供了基于数据的护理人员排班和关键资源分配的调整,从而让临床医生和管理人员能够预防下游并发症,包括昂贵的程序或高再入院率,并改善整体患者流动。
Summary / 总结
The study addresses the need for automated systems to predict patient deterioration in hospitals. It introduces the Early Warning Index (EWI), a multimodal machine learning framework that combines structured and unstructured EHR data to forecast ICU admissions, emergency response team dispatch, and mortality. EWI uses a human-in-the-loop process where clinicians set alert thresholds and interpret model outputs, which are explained using SHAP values. The system stratifies patients into three risk tiers and achieves a C-statistic of 0.796. It is currently used as a triage tool to help physicians manage at-risk patients more efficiently, saving time and improving patient care.
研究旨在通过利用临床和运营数据开发自动系统来预测医院患者的恶化情况。Early Warning Index (EWI) 使用多模态机器学习框架来预测ICU入院、紧急响应团队派遣和死亡率。关键功能包括人工参与设置警报阈值和使用SHAP值进行可解释的风险评估。EWI 在医院仪表盘中部署,将患者分为三个风险等级。该方法使用18,633名患者的记录实现了0.796的C统计值,并作为分诊工具用于管理高风险患者,节省了医生的时间并提高了患者护理质量。
TomoGraphView: 3D Medical Image Classification with Omnidirectional Slice Representations and Graph Neural Networks
Authors: Johannes Kiechle, Stefan M. Fischer, Daniel M. Lang, Cosmin I. Bercea, Matthew J. Nyflot, Lina Felsner, Julia A. Schnabel, Jan C. Peeken
First: 2025-11-12T16:30:34+00:00 · Latest: 2025-12-16T18:46:50+00:00
Comments: Preprint submitted to Medical Image Analysis (MedIA)
Abstract
The sharp rise in medical tomography examinations has created a demand for automated systems that can reliably extract informative features for downstream tasks such as tumor characterization. Although 3D volumes contain richer information than individual slices, effective 3D classification remains difficult: volumetric data encode complex spatial dependencies, and the scarcity of large-scale 3D datasets has constrained progress toward 3D foundation models. As a result, many recent approaches rely on 2D vision foundation models trained on natural images, repurposing them as feature extractors for medical scans with surprisingly strong performance. Despite their practical success, current methods that apply 2D foundation models to 3D scans via slice-based decomposition remain fundamentally limited. Standard slicing along axial, sagittal, and coronal planes often fails to capture the true spatial extent of a structure when its orientation does not align with these canonical views. More critically, most approaches aggregate slice features independently, ignoring the underlying 3D geometry and losing spatial coherence across slices. To overcome these limitations, we propose TomoGraphView, a novel framework that integrates omnidirectional volume slicing with spherical graph-based feature aggregation. Instead of restricting the model to axial, sagittal, or coronal planes, our method samples both canonical and non-canonical cross-sections generated from uniformly distributed points on a sphere enclosing the volume. We publicly share our accessible code base at http://github.com/compai-lab/2025-MedIA-kiechle and provide a user-friendly library for omnidirectional volume slicing at https://pypi.org/project/OmniSlicer.
中文标题/摘要
标题:TomoGraphView:使用全方位切片表示和图神经网络的3D医学图像分类
医学断层扫描检查的急剧增加催生了对能够可靠提取用于下游任务(如肿瘤表征)的特征的自动化系统的市场需求。尽管3D体数据包含比单个切片更多的信息,但有效的3D分类仍然具有挑战性:体数据编码复杂的空间依赖性,大规模3D数据集的稀缺性限制了3D基础模型的发展。因此,许多最近的方法依赖于在自然图像上训练的2D视觉基础模型,并重新利用它们作为医学扫描的特征提取器,表现出令人惊讶的性能。尽管这些方法在实践中取得了成功,但通过切片分解将2D基础模型应用于3D扫描的方法仍然从根本上受到限制。标准的沿轴向、矢状和冠状平面的切片往往无法捕捉到结构的真实空间范围,尤其是当其方向不与这些标准视图对齐时。更重要的是,大多数方法独立地聚合切片特征,忽略了潜在的3D几何结构,导致切片之间失去空间连贯性。为了克服这些限制,我们提出了一种名为TomoGraphView的新框架,该框架结合了全方位体积切片和基于球面图的特征聚合。我们的方法不仅采样标准的轴向、矢状和冠状切片,还采样从包围体积的球体上均匀分布的点生成的非标准横截面。我们公开分享了我们的可访问代码库http://github.com/compai-lab/2025-MedIA-kiechle,并提供了一个用户友好的全方位体积切片库https://pypi.org/project/OmniSlicer。
Summary / 总结
TomoGraphView is a novel framework for 3D medical image classification that uses omnidirectional slice representations and graph neural networks. It addresses the limitations of traditional slice-based methods by sampling both canonical and non-canonical cross-sections from uniformly distributed points on a sphere. The key experimental finding is that this approach improves spatial coherence and captures the true spatial extent of structures, leading to better classification performance compared to existing methods that rely on 2D foundation models and slice-based decomposition.
TomoGraphView 是一种新颖的3D医学图像分类框架,结合了全方位切片表示和图神经网络。它通过从包围体积的球体上均匀分布的点生成的切片,克服了传统切片方法的限制,同时采样标准和非标准的横截面。实验结果表明,TomoGraphView 在准确性和切片间的空间一致性方面优于现有的基于2D基础模型的方法,展示了其在捕捉3D医学图像中的复杂空间依赖性方面的有效性。
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Authors: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang
First: 2025-12-16T18:45:18+00:00 · Latest: 2025-12-16T18:45:18+00:00
Abstract
Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.
中文标题/摘要
标题:使用雅可比强迫实现快速准确的因果并行解码
多令牌生成已成为加速基于变压器的大模型推理的一种有前途的范式。近期努力主要探索扩散大型语言模型(dLLMs)以并行解码来减少推理延迟。为了实现AR级生成质量,许多技术将AR模型适应为dLLMs,以实现并行解码。然而,它们与AR模型相比,速度提升有限,因为存在预训练到后训练的不匹配。具体来说,后训练中的遮罩数据分布与预训练期间看到的真实世界数据分布有显著差异,而dLLMs依赖双向注意力,这与预训练期间学习的因果先验相冲突,阻碍了精确KV缓存重用的整合。为解决这一问题,我们引入了雅可比强迫,这是一种渐进式蒸馏范式,其中模型在自己的生成并行解码轨迹上进行训练,平滑地将AR模型转换为高效的并行解码器,同时保留其预训练的因果推理特性。在这一范式下训练的模型,雅可比强迫模型,在编码和数学基准测试中实现了3.8倍的时钟速度提升,性能损失最小。基于雅可比强迫模型的轨迹特征,我们引入了多块解码与拒绝回收,这使得每迭代的令牌接受计数最多提高4.5倍,并且接近4.0倍的时钟速度提升,有效地用额外的计算换取更低的推理延迟。我们的代码可在https://github.com/hao-ai-lab/JacobiForcing/ 获取。
Summary / 总结
The research aims to improve the efficiency of causal parallel decoding in transformer-based models by addressing the mismatch between pretraining and posttraining data distributions. The authors introduce Jacobi Forcing, a progressive distillation method that trains models on their own generated parallel decoding trajectories, preserving the causal inference property while achieving a 3.8x speedup on coding and math benchmarks. Additionally, multi-block decoding with rejection recycling is proposed, which further increases token acceptance and reduces inference latency by 4.0x with minimal performance loss. The method effectively trades additional compute for lower latency, demonstrating significant improvements in both speed and efficiency. The code is available at https://github.com/hao-ai-lab/JacobiForcing.
本文解决了在保持质量的同时提高变压器模型并行解码速度的挑战。它引入了Jacobi Forcing,一种逐步将AR模型转化为高效并行解码器的训练方法,同时保留其因果推理特性。Jacobi Forcing模型在编码和数学基准测试中实现了3.8倍的加速,性能损失微乎其微。此外,论文还提出了多块解码与拒绝回收,通过增加计算量来降低推理延迟,从而在每次迭代中将令牌接受率提高4.5倍,加速4.0倍。
VIBE: Can a VLM Read the Room?
Authors: Tania Chakraborty, Eylon Caplan, Dan Goldwasser
Venue: EMNLP
First: 2025-06-11T19:07:35+00:00 · Latest: 2025-12-16T18:42:51+00:00
Comments: Findings of EMNLP, 2025
Abstract
Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.
中文标题/摘要
标题:VIBE:VLM能否读懂房间里的社交信号?
理解人类社会行为,如识别情绪及其背后的社会动态,是一个重要且具有挑战性的问题。尽管语言模型取得了显著进展,但它们仅限于文本领域,无法解释非言语线索在理解社交情境中的重要作用。视觉语言模型(VLMs)有可能弥补这一差距,但它们在推理此类社会线索方面的能力尚未受到广泛关注。在本文中,我们探讨了VLM在社会推理方面的能力。我们发现VLM的一个先前未被注意到的局限性:视觉社会-语用推理差距。为解决这一差距,我们为VLM提出了一项新任务:视觉社会-语用推理。我们构建了一个高质量的数据集来测试VLM在该任务上的能力,并在该数据集上对几种VLM的性能进行了基准测试。
Summary / 总结
This paper investigates the social reasoning capabilities of Vision Language Models (VLMs) by introducing a new task called Visual Social-Pragmatic Inference. The authors identify a limitation in VLMs known as the Visual Social-Pragmatic Inference gap and propose a high-quality dataset to evaluate VLMs. The study benchmarks several VLMs on this task, highlighting their current limitations in understanding non-verbal social cues.
论文探讨了视觉语言模型(VLM)在社会推理方面的能力,指出了视觉社会-语用推理差距这一限制。为此,作者提出了一项名为视觉社会-语用推理的新任务,并构建了一个高质量的数据集来评估VLM。主要实验发现是当前的VLM在处理视觉社会线索方面存在困难,表明需要在这一领域进行改进。
COMMA: A Communicative Multimodal Multi-Agent Benchmark
Authors: Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler Bradshaw, Junjie Hu
Venue: Transactions on Machine Learning Research, 2025
First: 2024-10-10T02:49:47+00:00 · Latest: 2025-12-16T18:36:40+00:00
Abstract
The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.
中文标题/摘要
标题:COMMA:一种交流多模态多智能体基准
基于大型基础模型构建的多模态智能体的迅速发展很大程度上忽视了它们在协作任务中通过语言进行智能体间交流的潜力。这种忽视在理解其在实际部署中的有效性方面,尤其是在与人类交流时,存在关键缺口。现有的智能体基准未能解决智能体间交流和协作的关键方面,特别是在智能体信息获取不平等且必须合作完成超出个体能力范围的任务的场景中。为填补这一缺口,我们引入了COMMA:一种新型的谜题基准,旨在通过语言交流评估多模态多智能体系统的协作性能。我们的基准包括多种多模态谜题,提供了在交流协作环境中对智能体能力的全面评估。我们的研究结果揭示了最先进的模型,包括强大的专有模型GPT-4o和推理模型o4-mini的惊人弱点。许多链式推理模型如R1-Onevision和LLaVA-CoT在智能体间协作中难以超越随机基线,表明它们在交流能力方面存在潜在的增长空间。
Summary / 总结
The paper introduces COMMA, a benchmark designed to evaluate the communication and collaboration abilities of multimodal multi-agent systems. It addresses the gap in existing benchmarks by focusing on scenarios where agents must communicate and work together to solve complex tasks. Key findings show that even advanced models like GPT-4o and reasoning models like o4-mini struggle in agent-agent collaboration, highlighting the need for improvement in their communication skills.
论文介绍了COMMA基准,用于评估多模态多智能体系统在语言通信下的协作性能。它通过关注智能体信息不平等且必须协作完成超出个体能力的任务的场景,弥补了现有基准的不足。主要发现表明,包括GPT-4o和推理模型o4-mini在内的先进模型在智能体间协作中表现出显著的弱点,这表明它们的通信能力需要改进。
Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models
Authors: Zefang Liu, Nam H. Nguyen, Yinzhu Quan, Shi-Xiong Zhang
First: 2025-12-15T18:10:51+00:00 · Latest: 2025-12-16T18:35:40+00:00
Abstract
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
中文标题/摘要
标题:事件序列建模中大型语言模型的时间分词策略
在使用大型语言模型(LLMs)建模时间事件序列时,表示连续时间是一个关键但尚未充分探索的挑战。已经提出了各种策略,如字节级表示或日历标记,但最优方法仍不清楚,特别是考虑到现实世界事件数据的多样统计分布,这些分布从平滑的对数正态分布到离散的尖峰模式不等。本文首次对事件序列的时间分词进行了实证研究,比较了不同的编码策略:原始数字字符串、高精度字节级表示、基于人类语义的日历标记、经典均匀分箱以及自适应残差标量量化。我们通过在代表这些多样分布的现实世界数据集上微调LLMs来评估这些策略。我们的分析表明,并不存在一种普遍更优的策略;相反,预测性能高度依赖于分词器与数据统计属性的对齐,对数策略在偏斜分布上表现出色,而以人类为中心的格式对混合模态具有鲁棒性。
Summary / 总结
This paper explores the challenge of representing continuous time in event sequence modeling with large language models. It compares various temporal tokenization strategies, including numeric strings, byte-level representations, calendar tokens, uniform binning, and adaptive quantization, by evaluating their performance on real-world datasets with different statistical distributions. The study finds that no single strategy is universally optimal, and the best approach depends on the data's statistical properties, with log-based strategies performing well on skewed distributions and human-centric formats proving robust for mixed modalities.
该论文探讨了在使用大型语言模型进行事件序列建模时如何表示连续时间的问题。它比较了包括数值字符串、字节级表示、日历标记、均匀分箱和自适应量化在内的多种时间分词策略,并通过在多样化的实际数据集上进行评估来检验它们的表现。研究发现,没有一种策略是普遍最优的,对偏斜分布,对数基策略表现更好,而以人类为中心的格式在混合数据类型中表现出更强的鲁棒性。
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
Authors: Zechen Bai, Chen Gao, Mike Zheng Shou
First: 2025-12-16T18:26:38+00:00 · Latest: 2025-12-16T18:26:38+00:00
Comments: 15 pages
Abstract
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame'' this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6\% on long-horizon tasks, +22.0\% in 1-shot learning, and enables cross-task generalization -- achieving 20.8\% success on unseen tasks without task-specific demonstrations training (vs. 0\% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.
中文标题/摘要
标题:EVOLVE-VLA:通过环境反馈进行测试时训练的视觉-语言-行动模型
实现真正适应性的具身智能需要能够通过不断与环境互动来改进的代理,而不仅仅是通过模仿静态演示来学习,这类似于人类通过练习掌握技能的方式。视觉-语言-行动(VLA)模型通过利用大型语言模型推进了机器人的操作,但仍然受到监督微调(SFT)的根本限制:需要每项任务数百次演示,僵化地记忆轨迹,并在部署条件与训练条件不一致时无法适应。我们提出了EVOLVE-VLA,这是一种在环境互动中实现VLAs连续适应的测试时训练框架,几乎不需要特定任务的演示。关键技术挑战是用自主反馈替换测试时不可用的先验奖励信号。我们通过一个学习的进步估计器提供密集反馈来解决这一问题,并且关键地,我们设计了框架来“驯服”这种固有的噪声信号,通过两种机制:(1)累积的进步估计机制平滑噪声的点估计,(2)渐进的视野扩展策略使策略逐步进化。EVOLVE-VLA实现了显著的改进:在长时任务上提高了8.6%,在1次学习上提高了22.0%,并且实现了跨任务泛化——在没有特定任务演示的情况下训练,实现了20.8%的成功率(而纯SFT为0%)。定性分析表明,演示中不存在的新兴能力包括错误恢复和新颖策略。这项工作代表了向真正学习和适应的VLAs迈出的关键一步,从静态模仿转向持续自我改进。
Summary / 总结
EVOLVE-VLA is a test-time training framework for Vision-Language-Action models that enables continuous adaptation through environmental interaction, reducing the need for task-specific demonstrations. It uses a learned progress estimator to provide dense feedback and two mechanisms to handle noisy signals: accumulative progress estimation and progressive horizon extension. Key results include improvements of 8.6% on long-horizon tasks, 22.0% in 1-shot learning, and cross-task generalization without task-specific demonstrations, demonstrating emergent capabilities not present in demonstrations.
EVOLVE-VLA 是一种测试时训练框架,使视觉-语言-行动模型能够通过环境交互持续适应,减少对特定任务演示的需求。它通过使用学习进度估计器和两种机制(累积进度估计和渐进时间范围扩展)来应对使用自主反馈的挑战。该框架显著提高了性能,在长时任务上实现了8.6%的提升,在1次学习中达到了22.0%的提升,并展示了无需特定任务演示训练数据即可实现跨任务泛化,展示了演示中不存在的新兴能力。
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
Authors: Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li
First: 2025-12-16T18:13:54+00:00 · Latest: 2025-12-16T18:13:54+00:00
Comments: Code is available at https://github.com/Leon-LihongWang/ViRC
Abstract
CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8\% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.
中文标题/摘要
标题:ViRC:增强视觉交错数学推理链的推理块化
链式思考(CoT)显著提升了语言模型(LLMs)的推理能力,但在扩展到多模态领域时面临挑战,特别是在数学任务中。现有的多模态语言模型(MLLMs)通常仅从单一静态数学图像进行文本推理,忽视了推理过程中的动态视觉获取。相比之下,人类会反复检查视觉图像,并采用逐步推理来证明中间命题。这种将问题解决过程分解为关键逻辑节点的方法遵循了认知科学中的米勒定律。受此启发,我们提出了一种ViRC框架,用于多模态数学任务,引入了推理块化机制,将多模态数学链式思考(CoT)结构化为连续的关键推理单元(CRUs),以模拟人类专家的问题解决模式。CRUs确保了单元内的文本连贯性,用于中间命题验证,同时整合跨单元的视觉信息以生成后续命题并支持结构化推理。为此,我们使用三种视觉工具和四种推理模式构建了CRUX数据集,为每个数学问题提供了多条推理路径上的显式标注CRUs。利用CRUX数据集,我们提出了一种受人类认知学习启发的渐进式训练策略,包括指令性SFT、练习性SFT和策略性RL,旨在进一步增强模型的推理块化能力。最终,ViRC-7B模型在多个数学基准测试中平均提高了18.8%。代码可在https://github.com/Leon-LihongWang/ViRC获取。
Summary / 总结
The research aims to enhance the reasoning ability of large language models (LLMs) in multimodal mathematical tasks by addressing the limitations of existing models that rely solely on static images. The ViRC framework introduces a Reason Chunking mechanism that breaks down the problem-solving process into consecutive Critical Reasoning Units (CRUs), simulating human expert reasoning. Experimental results show that the ViRC-7B model outperforms baselines by 18.8% on multiple mathematical benchmarks through a progressive training strategy inspired by human cognitive learning.
论文提出了ViRC框架以解决将CoT扩展到多模态领域,特别是数学任务中的挑战。通过引入Reason Chunking将问题解决过程分解为CRUs,确保文本连贯性和视觉信息的整合。使用CRUX数据集和基于人类认知学习的渐进式训练策略开发了ViRC-7B模型,该模型在多个数学基准测试中取得了18.8%的平均改进。
Segmental Attention Decoding With Long Form Acoustic Encodings
Authors: Pawel Swietojanski, Xinwei Li, Mingbin Xu, Takaaki Hori, Dogan Can, Xiaodan Zhuang
First: 2025-12-16T18:12:37+00:00 · Latest: 2025-12-16T18:12:37+00:00
Comments: 5 pages, 1 fig
Abstract
We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.
中文标题/摘要
标题:基于长段声学编码的段落注意力解码
我们解决了基于注意力的编码器-解码器(AED)模型与长段声学编码之间的根本不兼容性。在分割的语音片段上训练的AED模型通过利用段边界之外的有限声学上下文来学习编码绝对帧位置,但在解码长段时由于这些提示消失而无法泛化。由于交叉注意力中键和值的排列不变性,模型失去了对声学编码排序的能力。我们提出了四种修改:(1)为每个解码的段落注入显式的绝对位置编码,(2)使用扩展的声学上下文进行长段训练以消除隐式的绝对位置编码,(3)段落拼接以覆盖训练期间所需的多样化分割,以及(4)语义分割以使AED解码的段落与训练段对齐。我们展示了这些修改缩小了连续和分割声学编码之间的准确度差距,使注意力解码器能够自回归使用。
Summary / 总结
The paper addresses the challenge of using attention-based encoder-decoder models with long-form acoustic encodings. It proposes four modifications: injecting explicit positional encodings, long-form training with extended context, segment concatenation, and semantic segmentation. These changes help close the accuracy gap between continuous and segmented acoustic encodings, allowing for the auto-regressive use of the attention decoder.
论文解决了使用基于注意力的编码器-解码器模型处理长形式声学编码时遇到的挑战,这些模型在分段语音片段上训练时会因为丢失绝对位置信息而无法泛化。作者提出了四种改进措施:注入显式的位置编码、使用包含扩展上下文的长形式训练、分段连接以及语义分段。这些改进有助于缩小连续和分段声学编码之间的准确率差距,使注意力解码器能够自回归使用。
AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection
Authors: Lorenzo Pellegrini, Davide Cozzolino, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Luisa Verdoliva, Marco Prati, Marco Ramilli
First: 2025-04-29T15:41:13+00:00 · Latest: 2025-12-16T18:11:21+00:00
Comments: Accepted at Verimedia workshop, IJCNN 2025. 9 pages, 6 figures, 4 tables, code available: https://github.com/MI-BioLab/AI-GenBench
Abstract
The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.
中文标题/摘要
标题:AI-GenBench:一种新的持续基准测试,用于检测AI生成的图像
生成式AI的迅速发展已经彻底改变了图像创作,使从文本提示生成高质量图像成为可能,同时也带来了媒体真实性方面的关键挑战。我们提出了AI-GenBench,这是一种新型基准测试,旨在解决在现实场景中检测AI生成图像的迫切需求。与现有解决方案仅在静态数据集上评估模型不同,AI-GenBench 引入了一种时间上的评估框架,其中检测方法逐步训练在按其生成模型历史顺序排列的合成图像上,以测试其在面对新的生成模型(例如从GAN到扩散模型的过渡)时的泛化能力。我们的基准测试专注于高质量、多样化的视觉内容,并克服了当前方法的关键局限性,包括任意的数据集划分、不公平的比较以及过高的计算需求。AI-GenBench 提供了一个全面的数据集、标准化的评估协议和易于访问的工具,确保可重复性同时保持实际的训练需求。通过建立明确的评估规则和受控的增强策略,AI-GenBench 使检测方法的有意义比较和可扩展解决方案成为可能。代码和数据已公开,以确保可重复性并支持开发与新合成生成器同步的稳健法医检测器。
Summary / 总结
The paper introduces Ai-GenBench, a new benchmark for detecting AI-generated images, addressing the challenge of media authenticity in the age of generative AI. Unlike static datasets, Ai-GenBench uses a temporal framework where models are incrementally trained on synthetic images generated by different models over time, testing their ability to generalize. Key findings include improved robustness against new generative models and a comprehensive dataset with standardized evaluation protocols, making it accessible for both researchers and non-experts. Code and data are publicly available for reproducibility and further research.
论文介绍了Ai-GenBench,这是一个新的用于检测AI生成图像的基准,旨在应对媒体真实性挑战。与静态数据集不同,Ai-GenBench 使用了一个时间框架,其中模型会逐步训练合成图像,这些图像由不同的生成模型随着时间推移生成。关键发现表明,这种方法提高了检测方法在处理新生成模型(如从GAN到扩散模型的过渡)时的泛化能力。基准提供了全面的数据集、标准化的评估协议和易于使用的工具,确保可重复性并满足研究人员和非专家(如记者、事实核查员)的实用训练需求。
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
Authors: Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, Wenpin Tang
First: 2025-10-12T19:08:38+00:00 · Latest: 2025-12-16T18:10:07+00:00
Abstract
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.
中文标题/摘要
标题:理解训练RLHF扩散模型时采样器的随机性
人类反馈强化学习(RLHF)越来越多地用于微调扩散模型,但一个关键挑战是训练中使用的随机采样器与推理中使用的确定性采样器之间的不匹配。实践中,模型使用随机SDE采样器进行微调以鼓励探索,而推理通常依赖于确定性ODE采样器以提高效率和稳定性。这种差异导致了奖励差距,引发了关于推理时能否期望高质量输出的担忧。在本文中,我们从理论上描述了这种奖励差距,并为通用扩散模型提供了非空洞的边界,同时为发散爆炸(VE)和方差保持(VP)高斯模型提供了更尖锐的收敛速率。方法上,我们采用广义去噪扩散隐式模型(gDDIM)框架,以支持任意高的随机性,同时在整个过程中保持数据边缘。实验上,通过使用去噪扩散策略优化(DDPO)和混合组相对策略优化(MixGRPO)在文本到图像模型上的大规模实验,我们的发现验证了奖励差距在训练过程中一致地缩小,当模型使用更高随机性的SDE训练更新时,ODE采样质量会提高。
Summary / 总结
This paper addresses the challenge of reward gaps in reinforcement learning from human feedback (RLHF) when training diffusion models. It theoretically characterizes the reward gap caused by the use of stochastic samplers during training and deterministic samplers during inference. The authors propose using generalized denoising diffusion implicit models (gDDIM) to support high levels of stochasticity, which helps in narrowing the reward gap. Experiments on text-to-image models using DDPO and MixGRPO show that reward gaps decrease over training and ODE sampling quality improves with higher-stochasticity SDE training.
本文探讨了在使用扩散模型进行强化学习从人类反馈(RLHF)时,由于训练中使用随机采样器与推理中使用确定性采样器之间的差异导致的奖励差距问题。研究通过理论分析了这种奖励差距,并为通用扩散模型提供了非空边界,同时为特定模型提供了更精确的收敛速率。通过在文本到图像模型上使用DDPO和MixGRPO进行大规模实验,研究发现奖励差距随模型使用更高随机性的SDE训练而减小,且ODE采样质量也得到提升。
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Authors: Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu
First: 2025-12-08T18:57:26+00:00 · Latest: 2025-12-16T18:04:34+00:00
Abstract
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
中文标题/摘要
标题:一层足够:适应预训练视觉编码器进行图像生成
视觉生成模型(例如,扩散模型)通常在压缩的潜在空间中运行,以平衡训练效率和样本质量。同时,人们越来越关注利用高质量的预训练视觉表示,无论是通过在VAEs内部对齐它们,还是直接在生成模型内部。然而,由于理解导向的特征与生成友好的潜在空间之间存在根本性差异,因此调整这些表示仍然具有挑战性。表示编码器受益于高维潜在空间,可以捕捉遮罩区域的多种假设,而生成模型则偏好低维潜在空间,必须忠实保留注入的噪声。这种差异导致先前的工作依赖于复杂的优化目标和架构。在本项工作中,我们提出了一种简单的有效框架FAE(特征自编码器),它仅使用一个注意力层即可将预训练的视觉表示适应为适合生成的低维潜在空间,同时保留足够的信息用于重建和理解。关键在于结合两个独立的深度解码器:一个用于重建原始特征空间,另一个则以重建的特征作为输入进行图像生成。FAE是通用的,可以与各种自监督编码器(例如,DINO,SigLIP)结合,并插入两种不同的生成家族:扩散模型和归一化流。在类别条件和文本到图像基准测试中,FAE表现出色。例如,在ImageNet 256x256上,我们的带有CFG的扩散模型达到接近当前最佳的FID为1.29(800个周期)和1.70(80个周期)。不使用CFG时,FAE达到当前最佳的FID为1.48(800个周期)和2.08(80个周期),展示了高质量和快速学习的特点。
Summary / 总结
This work addresses the challenge of adapting pre-trained visual representations for image generation by proposing FAE (Feature Auto-Encoder), which uses a single attention layer to convert high-dimensional features into low-dimensional latents suitable for generation. FAE consists of two deep decoders: one for reconstructing the original feature space and another for generating images. The framework is versatile and can be used with various self-supervised encoders and generative models. Experiments on ImageNet show that FAE achieves strong performance, with near state-of-the-art FID scores on both 800 epochs and 80 epochs, demonstrating both high quality and fast learning.
该研究提出了一种称为FAE(特征自编码器)的方法,通过单一注意力层将高维特征转换为低维潜在空间,以解决预训练视觉表示在图像生成中的适应问题。FAE 包含两个解码器:一个用于重建,另一个用于图像生成。在各种基准测试中,FAE 展现出强大的性能,特别是在 ImageNet 上,即使不使用 Classifier-Free Guidance (CFG),也达到了接近最先进的 FID 分数。
Chase Anonymisation: Privacy-Preserving Knowledge Graphs with Logical Reasoning
Authors: Luigi Bellomarini, Costanza Catalano, Andrea Coletta, Michela Iezzi, Pierangela Samarati
Venue: Proceedings of the 42nd IEEE International Conference on Data Engineering (ICDE) 2026
First: 2024-10-16T10:04:02+00:00 · Latest: 2025-12-16T17:57:52+00:00
Comments: 16 pages, 5 figures
Abstract
We propose a novel framework to enable Knowledge Graphs (KGs) sharing while ensuring that information that should remain private is not directly released nor indirectly exposed via derived knowledge, maintaining at the same time the embedded knowledge of the KGs to support business downstream tasks. Our approach produces a privacy-preserving KG as an augmentation of the input one via controlled addition of nodes and edges as well as re-labeling of nodes and perturbation of weights. We introduce a novel privacy measure for KGs, which considers derived knowledge, a new utility metric that captures the business semantics we want to preserve, and propose two novel anonymisation algorithms. Our extensive experimental evaluation, with both synthetic graphs and real-world datasets, confirms the effectiveness of our approach.
中文标题/摘要
标题:追踪匿名化:基于逻辑推理的隐私保护知识图谱
我们提出了一种新型框架,以确保在共享知识图谱(KGs)时,不会直接或间接地泄露应保持私密的信息,同时保持KGs中的嵌入知识以支持下游任务。我们的方法通过受控添加节点和边以及节点重新标记和权重扰动,生成一个隐私保护的KG作为输入KG的扩展。我们引入了KG的新隐私度量,该度量考虑了衍生知识,一个新的实用性度量,用于捕捉我们希望保留的业务语义,并提出了两种新的匿名化算法。我们的广泛实验评估,使用合成图和真实世界数据集,证实了我们方法的有效性。
Summary / 总结
The research aims to develop a framework for sharing Knowledge Graphs (KGs) while preserving privacy. The method involves augmenting the input KG with controlled additions of nodes and edges, re-labeling nodes, and perturbing weights. Key findings show that the proposed approach effectively anonymizes KGs, maintaining the embedded knowledge necessary for downstream tasks and introducing a new privacy measure and utility metric.
研究旨在开发一种框架,以便在保护隐私的同时共享知识图谱(KGs)。方法包括对输入KG进行控制添加节点和边、重新标记节点和扰动权重的增强。实验结果表明,该方法有效地保护了隐私并保留了支持下游任务的嵌入知识。
Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Authors: Jungsuk Oh, Jay-Yoon Lee
First: 2025-08-25T18:36:28+00:00 · Latest: 2025-12-16T17:50:43+00:00
Abstract
Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce \textbf{Latent Self-Consistency (LSC)}, which selects the most semantically consistent response using learnable token embeddings. LSC's lightweight forward processing of summary tokens only introduces negligible runtime overhead (at most $0.9\%$) on top of standard decoding of the base LLM, and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC, and WUCS on both short-form and long-form on average performance, while adding negligible computational overhead on vanilla inference. These results position LSC as a reliable consistency-selection method that works effectively across various answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low expected calibration error across both answer formats.
中文标题/摘要
标题:潜在自我一致性在短答案和长答案推理中的可靠多数集选择
大型语言模型(LLMs)中的概率解码经常产生不一致的输出,特别是在复杂或长格式的问题上。自我一致性(SC)通过在短格式问答中对精确字符串进行多数投票来缓解这一问题,而通用自我一致性(USC)和加权未克分子一致性得分(WUCS)则扩展到长格式响应,但在短格式基准测试中会失去准确性。 我们引入了**潜在自我一致性(LSC)**,它使用可学习的标记嵌入来选择最语义一致的响应。LSC的轻量级前向处理仅在标准解码的基础上引入了可忽略的运行时开销(最多0.9%),并且不需要更改模型架构。 在6个短格式和5个长格式推理基准测试(例如,MATH、MMLU、TruthfulQA)中,LSC在短格式和长格式上的平均性能均超过了SC、USC和WUCS,同时在纯推理中增加了可忽略的计算开销。这些结果将LSC定位为一种在各种答案格式中都能有效工作的可靠的一致性选择方法。此外,LSC提供了良好的置信度估计,两种答案格式的预期校准误差均较低。
Summary / 总结
The research addresses the issue of inconsistent outputs from Large Language Models (LLMs) on complex questions, particularly in short- and long-form reasoning. Latent Self-Consistency (LSC) is introduced, which uses learnable token embeddings to select the most semantically consistent response. LSC achieves superior performance across various benchmarks, outperforming existing methods like Self-Consistency (SC), Universal Self-Consistency (USC), and Weighted Unigram Consistency Score (WUCS) while adding minimal computational overhead.
研究针对大型语言模型(LLMs)中的不一致性问题,提出了使用可学习的词嵌入来选择最语义一致的响应的潜自一致性(LSC)方法。LSC引入了微乎其微的运行时开销,并在包括短形式和长形式推理基准测试在内的多个基准测试中优于现有的方法,如自一致性(SC)、通用自一致性(USC)和加权未规范化一致性评分(WUCS),同时保持了低的预期校准误差。
Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction
Authors: Felix Mächtle, Nils Loose, Tim Schulz, Florian Sieck, Jan-Niclas Serr, Ralf Möller, Thomas Eisenbarth
First: 2025-04-18T13:13:39+00:00 · Latest: 2025-12-16T17:48:24+00:00
Comments: Published in the Proceedings of ACM ASIA CCS 2026
Abstract
As the number of web applications and API endpoints exposed to the Internet continues to grow, so does the number of exploitable vulnerabilities. Manually identifying such vulnerabilities is tedious. Meanwhile, static security scanners tend to produce many false positives. While machine learning-based approaches are promising, they typically perform well only in scenarios where training and test data are closely related. A key challenge for ML-based vulnerability detection is providing suitable and concise code context, as excessively long contexts negatively affect the code comprehension capabilities of machine learning models, particularly smaller ones. This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code. Trace Gadgets precisely capture the statements that cover the path to the vulnerability. As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance. Moreover, we collect a large-scale dataset generated from real-world applications with manually curated labels to further improve the performance of ML-based vulnerability detectors. Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations, surpassing the detection capabilities of industry-standard static scanners such as GitHub's CodeQL by at least 4% on a fully unseen dataset. By applying our framework to real-world applications, we identify and report previously unknown vulnerabilities in widely deployed software.
中文标题/摘要
标题:追踪小工具:最小化代码上下文以实现基于机器学习的漏洞预测
随着暴露在互联网上的web应用程序和API端点数量不断增加,可利用的漏洞数量也在增加。手动识别这些漏洞是繁琐的。同时,静态安全扫描器往往会生成许多误报。虽然基于机器学习的方法很有前景,但它们通常仅在训练和测试数据紧密相关的情况下表现良好。基于机器学习的漏洞检测的关键挑战是提供合适的简洁代码上下文,因为过长的上下文会负面影响机器学习模型的代码理解能力,尤其是较小的模型。 本研究引入了追踪小工具(Trace Gadgets),这是一种新颖的代码表示方法,通过移除与漏洞无关的代码来最小化代码上下文。追踪小工具精确地捕捉到覆盖漏洞路径的语句。作为机器学习模型的输入,追踪小工具提供了一个最小但完整的上下文,从而提高了检测性能。此外,我们收集了一个大规模的数据集,该数据集来自真实世界的应用程序,并带有手动标注的标签,以进一步提高基于机器学习的漏洞检测器的性能。我们的结果显示,最先进的机器学习模型在使用追踪小工具时表现最佳,与之前的代码表示相比,超过了GitHub CodeQL等工业标准静态扫描器至少4%的检测能力,特别是在完全未见过的数据集上。通过将我们的框架应用于真实世界的应用程序,我们识别并报告了广泛部署的软件中之前未知的漏洞。
Summary / 总结
This work addresses the challenge of identifying exploitable vulnerabilities in web applications and API endpoints by introducing Trace Gadgets, a novel code representation that minimizes the code context by removing non-related code. Trace Gadgets precisely capture the statements leading to the vulnerability, providing a minimal but complete context for machine learning models. The study uses a large-scale dataset with manually curated labels and demonstrates that state-of-the-art machine learning models perform best with Trace Gadgets, outperforming industry-standard static scanners by at least 4% on a fully unseen dataset.
这项工作通过引入Trace Gadgets,一种通过移除无关代码来最小化代码上下文的新代码表示方法,解决了在Web应用程序和API端点中识别可利用漏洞的挑战。Trace Gadgets精确地捕捉到通往漏洞的语句,为机器学习模型提供了一个最小但完整的上下文。研究使用了一个大规模的数据集,并带有手动标注的标签,结果表明,最先进的机器学习模型在使用Trace Gadgets时表现最佳,比行业标准的静态扫描器(如GitHub的CodeQL)在完全未见过的数据集上的检测能力高出至少4%。
JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
Authors: Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
First: 2025-12-16T17:33:00+00:00 · Latest: 2025-12-16T17:33:00+00:00
Comments: Project page: https://mmmu-japanese-benchmark.github.io/JMMMU_Pro/
Abstract
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
中文标题/摘要
标题:JMMMU-Pro: 基于图像的日本多学科多模态理解基准及Vibe基准构建方法
本文介绍了基于图像的日本多学科多模态理解基准JMMMU-Pro及其可扩展的构建方法Vibe基准构建。JMMMU-Pro在JMMMU的基础上通过将问题图像和问题文本组合成单个图像,扩展了JMMMU,从而创建了一个需要通过视觉感知进行视觉-文本综合理解的基准。为了构建JMMMU-Pro,我们提出了Vibe基准构建方法,该方法使用图像生成模型(例如Nano Banana Pro)生成候选视觉问题,人类验证输出,并在必要时使用调整后的提示重新生成以确保质量。通过利用Nano Banana Pro高度逼真的图像生成能力和嵌入干净日文文本的能力,我们以较低的成本构建了一个高质量的基准,涵盖了广泛的背景和布局设计。实验结果表明,所有开源LMM在JMMMU-Pro上都遇到了显著困难,突显了JMMMU-Pro作为指导开源社区未来努力的重要基准的作用。我们相信JMMMU-Pro为评估LMM的日本能力提供了更严格的评估工具,而我们的Vibe基准构建方法也为未来基于图像的VQA基准的发展提供了高效的指南。
Summary / 总结
JMMMU-Pro is an image-based Japanese multimodal understanding benchmark that extends JMMMU by combining question images and texts into a single image, requiring integrated visual-textual understanding. The Vibe Benchmark Construction method uses an image generative model to produce candidate visual questions, which are then verified by humans to ensure quality. Experiments show that open-source LMMs struggle significantly with JMMMU-Pro, highlighting its importance for evaluating Japanese multimodal understanding capabilities.
JMMMU-Pro 是一个基于图像的日本多模态理解基准,通过将问题图像和文本合并为一个图像,要求进行视觉和文本的综合理解。Vibe 基准构建方法使用图像生成模型生成候选视觉问题,然后由人类验证以确保质量。实验结果显示开源 LMM 在 JMMMU-Pro 上表现不佳,突显了其在评估多模态理解系统中日语能力方面的重要性。
Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
Authors: Alessandro Trapasso, Luca Iocchi, Fabio Patrizi
First: 2025-12-16T17:26:24+00:00 · Latest: 2025-12-16T17:26:24+00:00
Comments: 19 pages, 32 figures, includes appendix
Abstract
Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
中文标题/摘要
标题:基于模型的强化学习在离散动作非马尔可夫奖励决策过程中的应用
许多实际的决策问题涉及那些其成功依赖于整个系统历史的任务,而不是达到具有所需属性的状态。马尔可夫强化学习(RL)方法不适用于此类任务,而基于非马尔可夫奖励决策过程(NMRDP)的RL方法使智能体能够处理时间依赖性任务。这种方法长期以来一直缺乏关于(近)最优性和样本效率的正式保证。我们通过QR-MAX,一种新的基于模型的离散NMRDP算法,解决了这两个问题,该算法通过奖励机器将马尔可夫转换学习与非马尔可夫奖励处理分解。据我们所知,这是第一个利用这种分解在离散动作NMRDP中获得PAC收敛到ε-最优策略的基于模型的RL算法,具有多项式样本复杂性。然后,我们通过基于SimHash的分箱QR-MAX扩展了QR-MAX,该分箱器保持了相同的分解结构,并实现了快速稳定的无手动网格化或函数逼近的学习。我们在复杂度递增的环境中实验性地将我们的方法与现代基于模型的RL方法进行了比较,显示出样本效率的显著提高和在寻找最优策略时的增强鲁棒性。
Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery
Authors: Rebecca J. Herman, Jonas Wahl, Urmi Ninad, Jakob Runge
Venue: Proceedings of Machine Learning Research, 2025
First: 2025-03-21T10:46:50+00:00 · Latest: 2025-12-16T17:25:08+00:00
Comments: 4th Conference on Causal Learning and Reasoning. Code published in python package "UUMCdata" (https://pypi.org/project/UUMCdata/)
Abstract
Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data. Because causal ground truth is rarely known in the real world, simulated data plays a vital role in evaluating the performance of the various causal discovery algorithms proposed in the literature. But recent work highlighted certain artifacts of commonly used data generation techniques for a standard class of structural causal models (SCM) that may be nonphysical, including var- and R2-sortability, where the variables' variance and coefficients of determination (R2) after regressing on all other variables, respectively, increase along the causal order. Some causal methods exploit such artifacts, leading to unrealistic expectations for their performance on real-world data. Some modifications have been proposed to remove these artifacts; notably, the internally-standardized structural causal model (iSCM) avoids varsortability and largely alleviates R2-sortability on sparse causal graphs, but exhibits a reversed R2-sortability pattern for denser graphs not featured in their work. We analyze which sortability patterns we expect to see in real data, and propose a method for drawing coefficients that we argue more effectively samples the space of SCMs. Finally, we propose a novel extension of our SCM generation method to the time series setting.
中文标题/摘要
标题:无量纲无限制马尔可夫一致SCM生成:更好的因果发现基准数据集
因果发现旨在从数据中提取以因果图形式表示的定性因果知识。由于现实世界中很少知道因果真相,因此模拟数据在评估文献中提出的各种因果发现算法的性能方面起着至关重要的作用。但最近的工作指出了标准结构因果模型(SCM)中常用数据生成技术的一些特定缺陷,这些缺陷可能是非物理的,包括方差排序性和R2排序性,其中变量在回归所有其他变量后的方差和决定系数(R2)沿因果顺序增加。一些因果方法利用了这些缺陷,导致对它们在真实数据上的性能产生了不切实际的期望。已经提出了一些修改来消除这些缺陷;值得注意的是,内部标准化结构因果模型(iSCM)避免了方差排序性,并在稀疏因果图上大大缓解了R2排序性,但在更密集的图上表现出相反的R2排序性模式,这不在他们的工作中有所体现。我们分析了在真实数据中期望看到哪些排序性模式,并提出了一种绘制系数的方法,我们认为这种方法更有效地抽样SCM的空间。最后,我们提出了一种将我们的SCM生成方法扩展到时间序列设置的新方法。
Summary / 总结
Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data.
本文解决了因果发现中模拟数据中存在的不现实的artifacts问题,特别是varsortability和R2-sortability,这些artifacts可能导致因果发现方法在实际数据上的表现期望过高。作者提出了一种新的生成结构因果模型(SCM)的方法,避免了这些artifacts,并更好地反映了真实数据。主要实验发现是,他们的方法生成了更符合物理现实的数据集,尤其是在更密集的因果图中,这些数据集可以作为评估因果发现算法性能的更好基准。
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Authors: Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo
First: 2025-12-16T17:22:46+00:00 · Latest: 2025-12-16T17:22:46+00:00
Comments: project page: https://3d-models.hunyuan.tencent.com/world/, demo: https://3d.hunyuan.tencent.com/sceneTo3D
Abstract
This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.
中文标题/摘要
标题:WorldPlay:实现实时交互世界建模的长期几何一致性
本文介绍了WorldPlay,一种流式视频扩散模型,能够实现实时、交互式的世界建模,并保持长期的几何一致性,解决了当前方法在速度和内存之间取舍的限制。WorldPlay 依靠三项关键创新。1) 我们使用双动作表示法,以应对用户的键盘和鼠标输入,实现稳健的动作控制。2) 为了确保长期一致性,我们的重构上下文记忆动态重建过去的帧,并使用时间重塑来保持几何上重要但时间久远的帧的可访问性,从而有效缓解了记忆衰减。3) 我们还提出了一种名为上下文强迫的新蒸馏方法,专为记忆感知模型设计。通过在教师和学生之间保持记忆上下文的一致性,保持学生利用长距离信息的能力,从而实现实时速度,同时防止误差漂移。综上所述,WorldPlay 能以 24 FPS 生成长达数小时的 720p 流式视频,具有更优的一致性,与现有技术相比表现更佳,并且在多种场景中具有较强的泛化能力。项目页面和在线演示可以在:https://3d-models.hunyuan.tencent.com/world/ 和 https://3d.hunyuan.tencent.com/sceneTo3D/ 查看。
Summary / 总结
WorldPlay is a streaming video diffusion model that addresses the trade-off between speed and memory in real-time interactive world modeling. It introduces a Dual Action Representation for robust action control, a Reconstituted Context Memory to maintain long-term geometric consistency, and Context Forcing for memory-aware distillation. The model generates long-horizon streaming 720p video at 24 FPS with superior consistency, outperforming existing techniques and showing strong generalization across various scenes.
WorldPlay 是一种流式视频扩散模型,解决了实时互动世界建模中速度与内存之间的权衡问题。它引入了用于稳健动作控制的 Dual Action Representation,用于维持长期几何一致性的 Reconstituted Context Memory,以及用于内存感知蒸馏的 Context Forcing。该模型以24 FPS的速度生成长达数秒的720p视频,具有更高的一致性,优于现有技术,并在多种场景中表现出强大的泛化能力。
Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
Authors: Antonio Guillen-Perez
First: 2025-12-12T20:07:04+00:00 · Latest: 2025-12-16T17:15:46+00:00
Abstract
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% ccompared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
中文标题/摘要
标题:Semantic-Drive: 通过开放词汇接地和神经符号VLM共识促进长尾数据整理的民主化
自主车辆(AV)的稳健开发受到“长尾”训练数据稀缺的限制。尽管车队收集了大量视频日志,但识别罕见的安全关键事件(例如,不规则的随意横穿马路、施工改道)仍然是一个手动且成本高昂的过程。现有解决方案依赖于粗略的元数据搜索,缺乏精确性,或者基于云的VLM,这侵犯了隐私并昂贵。我们提出了Semantic-Drive,这是一种本地优先的神经符号框架,用于语义数据挖掘。我们的方法将感知分为两个阶段:(1)通过实时开放词汇检测器(YOLOE)进行符号接地,以锚定注意力;(2)通过推理VLM进行认知分析,执行法医场景分析。为了减轻幻觉,我们实现了一种“系统2”推理时对齐策略,利用多模型“法官-侦察员”共识机制。在nuScenes数据集上与Waymo开放数据集(WOD-E2E)分类法进行基准测试,Semantic-Drive的召回率为0.966(而CLIP为0.475),与最佳单侦察员模型相比,风险评估误差降低了40%。该系统完全在消费级硬件(NVIDIA RTX 3090)上运行,提供了一种隐私保护的替代方案,替代云服务。
Summary / 总结
Semantic-Drive addresses the challenge of curating long-tail data for autonomous vehicles by introducing a local-first, neuro-symbolic framework. It uses a real-time open-vocabulary detector (YOLOE) for symbolic grounding and a Reasoning VLM for cognitive analysis. The system achieves a recall of 0.966 in identifying rare safety-critical events and reduces risk assessment error by 40% compared to single scout models, all while running on consumer hardware and preserving privacy.
Semantic-Drive通过引入一个本地优先的神经符号框架来解决自动驾驶汽车中长尾数据采集的挑战。它使用实时的开放词汇检测器(YOLOE)进行符号定位,并使用推理VLM进行认知分析。该系统在nuScenes数据集上实现了0.966的高召回率,并将风险评估误差降低了40%,同时在消费级硬件(NVIDIA RTX 3090)上运行,从而保护了隐私。
LLmFPCA-detect: LLM-powered Multivariate Functional PCA for Anomaly Detection in Sparse Longitudinal Texts
Authors: Prasanjit Dubey, Aritra Guha, Zhengyi Zhou, Qiong Wu, Xiaoming Huo, Paromita Dubey
First: 2025-12-16T17:14:10+00:00 · Latest: 2025-12-16T17:14:10+00:00
Abstract
Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals. These complex textual data sets have immense potential to inform future policy and targeted recommendations. However, because SL text data lack dedicated methods and are noisy, heterogeneous, and prone to anomalies, detecting and inferring key patterns is challenging. We introduce LLmFPCA-detect, a flexible framework that pairs LLM-based text embeddings with functional data analysis to detect clusters and infer anomalies in large SL text datasets. First, LLmFPCA-detect embeds each piece of text into an application-specific numeric space using LLM prompts. Sparse multivariate functional principal component analysis (mFPCA) conducted in the numeric space forms the workhorse to recover primary population characteristics, and produces subject-level scores which, together with baseline static covariates, facilitate data segmentation, unsupervised anomaly detection and inference, and enable other downstream tasks. In particular, we leverage LLMs to perform dynamic keyword profiling guided by the data segments and anomalies discovered by LLmFPCA-detect, and we show that cluster-specific functional PC scores from LLmFPCA-detect, used as features in existing pipelines, help boost prediction performance. We support the stability of LLmFPCA-detect with experiments and evaluate it on two different applications using public datasets, Amazon customer-review trajectories, and Wikipedia talk-page comment streams, demonstrating utility across domains and outperforming state-of-the-art baselines.
中文标题/摘要
标题:LLmFPCA-detect:基于LLM的多元功能性主成分分析方法用于稀疏纵向文本中的异常检测
稀疏纵向(SL)文本数据在个体在时间上反复生成文本(例如,客户评论、偶尔的社会媒体帖子、多次访问的电子医疗记录)时产生,但观察的频率和时间因个体而异。这些复杂的文本数据集具有巨大的潜力,可以为未来的政策和针对性建议提供信息。然而,由于缺乏专门的方法,SL文本数据具有噪声、异质性和易发生异常的特点,因此检测和推断关键模式具有挑战性。我们引入了LLmFPCA-detect,这是一种灵活的框架,将基于LLM的文本嵌入与功能性数据分析相结合,以检测大型SL文本数据集中的聚类并推断异常。首先,LLmFPCA-detect使用LLM提示将每段文本嵌入到特定应用的数值空间中。在数值空间中进行稀疏多元功能性主成分分析(mFPCA)形成工作马车,以恢复主要的人口特征,并生成个体水平的评分,这些评分与基线静态协变量一起,有助于数据分段、无监督异常检测和推断,并使其他下游任务成为可能。特别是,我们利用LLM根据LLmFPCA-detect发现的数据段和异常进行动态关键词特征分析,并展示了LLmFPCA-detect生成的特定聚类的功能主成分评分作为现有管道中的特征,有助于提高预测性能。我们通过实验支持LLmFPCA-detect的稳定性,并使用公共数据集Amazon客户评论轨迹和维基百科讨论页面评论流对其进行评估,证明其在不同领域的实用性和优于最先进的基线方法。
Summary / 总结
Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals.
LLmFPCA-detect 是一种结合 LLM 基准文本嵌入与函数数据分析的框架,用于检测稀疏纵向文本中的异常。该方法使用 LLM 提示将文本嵌入到数值空间,并应用稀疏多元函数主成分分析以恢复主要特征并检测异常。该方法利用 LLM 进行动态关键词分析,并显示出改进的预测性能。实验表明,该方法在亚马逊客户评论和维基百科讨论页面上具有有效性,并优于现有方法。
LLM-driven Knowledge Enhancement for Multimodal Cancer Survival Prediction
Authors: Chenyu Zhao, Yingxue Xu, Fengtao Zhou, Yihui Wang, Hao Chen
First: 2025-12-16T17:03:56+00:00 · Latest: 2025-12-16T17:03:56+00:00
Abstract
Current multimodal survival prediction methods typically rely on pathology images (WSIs) and genomic data, both of which are high-dimensional and redundant, making it difficult to extract discriminative features from them and align different modalities. Moreover, using a simple survival follow-up label is insufficient to supervise such a complex task. To address these challenges, we propose KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, which integrates expert reports and prognostic background knowledge. 1) Expert reports, provided by pathologists on a case-by-case basis and refined by large language model (LLM), offer succinct and clinically focused diagnostic statements. This information may typically suggest different survival outcomes. 2) Prognostic background knowledge (PBK), generated concisely by LLM, provides valuable prognostic background knowledge on different cancer types, which also enhances survival prediction. To leverage these knowledge, we introduce the knowledge-enhanced cross-modal (KECM) attention module. KECM can effectively guide the network to focus on discriminative and survival-relevant features from highly redundant modalities. Extensive experiments on five datasets demonstrate that KEMM achieves state-of-the-art performance. The code will be released upon acceptance.
中文标题/摘要
标题:基于LLM的知识增强多模态癌症生存预测
当前的多模态生存预测方法通常依赖于病理图像(WSIs)和基因组数据,这两种数据都是高维度且冗余的,使得难以从中提取具有区分性的特征并使不同模态对齐。此外,使用简单的生存随访标签不足以监督这样一个复杂的任务。为了解决这些挑战,我们提出了KEMM,这是一种基于LLM的知识增强多模态模型,用于癌症生存预测,该模型整合了专家报告和预后背景知识。1)专家报告,由病理学家针对每个病例提供并由大型语言模型(LLM)精炼,提供了简洁且临床导向的诊断陈述。这些信息通常会暗示不同的生存结果。2)预后背景知识(PBK),由LLM简洁生成,提供了不同癌症类型的有价值预后背景知识,也增强了生存预测。为了利用这些知识,我们引入了知识增强跨模态(KECM)注意力模块。KECM可以有效地引导网络关注冗余模态中的具有区分性和生存相关性特征。在五个数据集上的广泛实验表明,KEMM 达到了最先进的性能。代码将在接受后发布。
Summary / 总结
The paper proposes KEMM, an LLM-driven Knowledge-Enhanced Multimodal Model for cancer survival prediction, addressing the challenges of high-dimensional and redundant data by integrating expert reports and prognostic background knowledge. KEMM includes a knowledge-enhanced cross-modal attention module that helps the network focus on relevant features. Experiments on five datasets show that KEMM outperforms existing methods in cancer survival prediction.
该论文旨在解决从高维度和冗余的病理图像和基因组数据中提取区分性特征以进行癌症生存预测的挑战。它提出了一个LLM驱动的知识增强多模态模型KEMM,该模型整合了专家报告和预后背景知识。该模型引入了知识增强的跨模态(KECM)注意力模块,以聚焦于与生存相关的特征。在五个数据集上的实验表明,KEMM 在癌症生存预测中优于现有方法。
Text Embedded Swin-UMamba for DeepLesion Segmentation
Authors: Ruida Cheng, Tejas Sudharshan Mathai, Pritam Mukherjee, Benjamin Hou, Qingqing Zhu, Zhiyong Lu, Matthew McAuliffe, Ronald M. Summers
First: 2025-08-08T16:54:06+00:00 · Latest: 2025-12-16T17:03:17+00:00
Abstract
Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow has the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, our method achieved a high Dice score of 82.64, and a low Hausdorff distance of 6.34 pixels was obtained for lesion segmentation. The proposed Text-Swin-U/Mamba model outperformed prior approaches: 37.79% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001), and surpassed the purely image-based XLSTM-UNet and nnUNet models by 2.58% and 1.01%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba
中文标题/摘要
标题:Text嵌入Swin-UMamba用于DeepLesion分割
CT上病变的分割能够实现自动测量,用于慢性疾病(如淋巴瘤)的临床评估。将大型语言模型(LLMs)集成到病变分割工作流程中,有可能结合影像特征与放射报告中病变特征的描述。在本研究中,我们探讨了将文本集成到Swin-UMamba架构中进行病变分割任务的可行性。使用公开的ULS23 DeepLesion数据集以及报告中的简短发现描述。在测试数据集上,我们的方法获得了82.64的高Dice分数,并且病变分割的低Hausdorff距离为6.34像素。提出的Text-Swin-U/Mamba模型优于先前的方法:比LLM驱动的LanGuideMedSeg模型高出37.79%(p < 0.001),并分别超越基于图像的XLSTM-UNet和nnUNet模型2.58%和1.01%。数据集和代码可在https://github.com/ruida/LLM-Swin-UMamba获取
Summary / 总结
This study explores integrating text from radiology reports into the Swin-UMamba architecture for lesion segmentation on CT images. Using the ULS23 DeepLesion dataset, the proposed Text-Swin-U/Mamba model achieved a Dice score of 82.64 and a Hausdorff distance of 6.34 pixels, outperforming previous methods by 37.79% over LanGuideMedSeg and 2.58% and 1.01% over XLSTM-UNet and nnUNet respectively.
本研究探索将放射报告中的文本信息集成到Swin-UMamba架构中,用于CT图像上的病灶分割。使用ULS23 DeepLesion数据集,提出的Text-Swin-U/Mamba模型获得了82.64的Dice分数和6.34像素的Hausdorff距离,分别比LanGuideMedSeg、XLSTM-UNet和nnUNet高出37.79%、2.58%和1.01%。
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Authors: Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo
First: 2025-12-15T16:36:52+00:00 · Latest: 2025-12-16T16:58:55+00:00
Comments: Seedance 1.5 pro Technical Report
Abstract
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
中文标题/摘要
标题:Seedance 1.5 pro:一种原生音视频联合生成基础模型
近期音视频生成技术的进步为统一的音视频生成铺平了道路。在此项工作中,我们介绍了Seedance 1.5 pro,这是一种专门针对原生音视频联合生成的基础模型。该模型利用双分支扩散变换器架构,结合跨模态联合模块和专门的多阶段数据管道,实现了卓越的音视频同步和生成质量。为了确保其实用性,我们实施了精细的后训练优化,包括在高质量数据集上进行监督微调(SFT)和多维度奖励模型的人工反馈强化学习(RLHF)。此外,我们还引入了一种加速框架,将推理速度提高了超过10倍。Seedance 1.5 pro 通过精确的多语言和方言唇同步、动态电影级摄像机控制和增强的叙事连贯性,使其成为专业级内容创作的强大引擎。Seedance 1.5 pro 现已可在火山引擎上访问:https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo。
Summary / 总结
Seedance 1.5 pro is a foundational model designed for native audio-visual generation using a dual-branch Diffusion Transformer architecture. It integrates a cross-modal joint module and a specialized multi-stage data pipeline, achieving high synchronization and quality. Post-training optimizations, including Supervised Fine-Tuning and RLHF, ensure practical utility, and an acceleration framework enhances inference speed. Key features include precise multilingual lip-syncing, dynamic camera control, and enhanced narrative coherence, making it suitable for professional content creation.
Seedance 1.5 pro 是一种专为原生音视频生成设计的基础模型,采用双分支扩散变换器架构。它集成了跨模态联合模块和专门的多阶段数据管道,实现了高同步性和高质量。通过监督微调和基于人类反馈的强化学习等后训练优化确保其实用性,并通过加速框架提升了推理速度。主要功能包括精确的多语言唇同步、动态摄像机控制和增强的叙事连贯性,使其适用于专业内容创作。
History
20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553