arXiv 论文速递

2026-03-23 03:28
Snapshot: 20260323_0328
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Authors: Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai
First: 2026-03-19T17:59:58+00:00 · Latest: 2026-03-19T17:59:58+00:00
Comments: 31 pages, 12 figures
Abstract
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
中文标题/摘要
标题:生成模型了解空间:释放隐式3D先验以促进场景理解
虽然多模态大型语言模型展示了令人印象深刻的语义能力,但它们往往在空间感知方面存在局限性,难以进行精细的几何推理和物理动力学处理。现有解决方案通常依赖于显式的3D模态或复杂的几何结构,这些方法受限于数据稀缺性和泛化挑战。在本文中,我们提出了一种范式转变,通过利用大规模视频生成模型中的隐式空间先验。我们认为,为了合成时空连贯的视频,这些模型会内在地学习稳健的3D结构先验和物理法则。我们引入了VEGA-3D(视频提取生成意识)框架,该框架将预训练的视频扩散模型重新用于潜空间模拟器。通过从中间噪声级别提取时空特征,并通过基于token的自适应门控融合机制将其与语义表示集成,我们为MLLMs提供了密集的几何线索,而无需显式的3D监督。在3D场景理解、空间推理和具身操作基准测试中的广泛实验表明,我们的方法优于最先进的基线,验证了生成先验为物理世界理解提供了可扩展的基础。代码可在https://github.com/H-EmbodVis/VEGA-3D公开获取。
Summary / 总结
This work addresses the spatial limitations of multimodal large language models by leveraging implicit 3D priors from video generation models. VEGA-3D, a plug-and-play framework, repurposes a pre-trained video diffusion model to simulate a latent world, enriching multimodal language models with geometric cues. Experiments show that this method outperforms existing state-of-the-art approaches in 3D scene understanding, spatial reasoning, and embodied manipulation tasks, proving the scalability of generative priors for physical-world understanding.
本文通过提出VEGA-3D框架,利用视频生成模型中的隐式3D先验来解决多模态大语言模型的空间局限性。通过将时空特征与语义表示集成,VEGA-3D增强了MLLMs的几何线索,提高了3D场景理解、空间推理和体态操作任务的表现。实验表明,VEGA-3D在现有方法中表现出色,验证了生成先验在物理世界理解中的有效性。
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens
Authors: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu
Venue: CVPR 2026
First: 2026-03-19T17:59:55+00:00 · Latest: 2026-03-19T17:59:55+00:00
Comments: Accepted by CVPR 2026 main track; Code: https://github.com/YuqingWang1029/CubiD
Abstract
Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.
中文标题/摘要
标题:立方离散扩散:高维表示令牌上的离散视觉生成
使用离散令牌进行视觉生成获得了显著关注,因为它允许与语言模型共享统一的令牌预测范式,有望实现无缝的多模态架构。然而,当前的离散生成方法仍然局限于低维潜变量(通常为8-32维),牺牲了理解所需的语义丰富性。虽然高维预训练表示(768-1024维)可以弥补这一差距,但它们的离散生成提出了根本性的挑战。在本文中,我们提出了立方离散扩散(CubiD),这是第一个用于高维表示的离散生成模型。CubiD 在高维离散表示中执行精细粒度的掩码——任何维度在任何位置都可以被掩码并从部分观察中预测。这使模型能够学习丰富的空间位置内和跨位置的相关性,生成步骤数固定为 $T$,与特征维度无关,其中 $T \ll hwd$。在ImageNet-256上,CubiD 达到了最先进的离散生成效果,从9亿到37亿参数具有强大的扩展行为。至关重要的是,我们验证这些离散化令牌保留了原始表示能力,证明了相同的离散令牌可以有效地服务于理解和生成任务。我们希望这项工作能够激发未来研究向统一的多模态架构方向发展。代码可在:https://github.com/YuqingWang1029/CubiD 获取。
Summary / 总结
Cubic Discrete Diffusion (CubiD) is the first discrete generation model for high-dimensional representations, addressing the challenge of generating discrete tokens from high-dimensional pretrained features. By performing fine-grained masking and prediction in a high-dimensional discrete space, CubiD learns rich correlations within and across spatial positions. On ImageNet-256, CubiD achieves state-of-the-art discrete generation performance with strong scaling from 900M to 3.7B parameters, and the discretized tokens preserve the original representation capabilities for both understanding and generation tasks.
Cubic Discrete Diffusion (CubiD)通过引入新颖的掩码和预测机制,解决了高维潜在表示的离散生成难题。该方法能够在高维空间中生成丰富且语义意义明确的令牌,ImageNet-256上实现了最先进的结果,并且参数从900M到3.7B具有良好的扩展性。关键发现是这些离散化令牌保留了原始表示能力,能够在理解和生成任务中有效使用。这项工作为未来的统一多模态架构铺平了道路。
MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
Authors: Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu
First: 2026-03-19T17:59:52+00:00 · Latest: 2026-03-19T17:59:52+00:00
Comments: Project page: https://lihaitian.com/MonoArt
Abstract
Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.
中文标题/摘要
标题:MonoArt:单目 articulated 3D 重建的渐进结构推理
从单张图像重建 articulated 3D 对象需要联合推断对象几何、部件结构和运动参数,但视觉证据有限。关键难点在于运动线索与对象结构的纠缠,这使得直接articulation回归不稳定。现有方法通过多视角监督、基于检索的组装或辅助视频生成来应对这一挑战,但往往牺牲了可扩展性或效率。我们提出了 MonoArt,这是一种基于渐进结构推理的统一框架。MonoArt 不是从图像特征直接预测articulation,而是在单一架构中逐步将视觉观察转化为标准几何、结构部件表示和运动感知嵌入。这种结构推理过程使得在没有外部运动模板或多阶段流水线的情况下,能够实现稳定且可解释的articulation推断。在 PartNet-Mobility 上的广泛实验表明,OM 在重建精度和推理速度方面均达到最先进的性能。该框架进一步推广到机器人操作和articulated场景重建。
Summary / 总结
The research aims to reconstruct articulated 3D objects from a single image by jointly inferring geometry, part structure, and motion parameters. MonoArt proposes a unified framework using progressive structural reasoning, transforming visual observations into canonical geometry and motion-aware embeddings within a single architecture. This approach achieves state-of-the-art performance in reconstruction accuracy and inference speed on PartNet-Mobility, and generalizes to robotic manipulation and articulated scene reconstruction.
MonoArt通过在统一架构中逐步将视觉观察转换为规范几何和结构化部件表示来解决从单张图像重建 articulated 3D 对象的挑战,避免了直接articulation回归的不稳定性,并在PartNet-Mobility上实现了在重建精度和推理速度方面的最先进性能。该框架还适用于机器人操作和articulated场景重建。
NavTrust: Benchmarking Trustworthiness for Embodied Navigation
Authors: Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma, Manan Mehta, Yang Zhou, Lichao Sun, Zhiwen Fan, Zhengzhong Tu, Jiachen Li
First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00
Comments: Project Website: https://navtrust.github.io
Abstract
There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.
中文标题/摘要
标题:NavTrust:评估实体导航可信度基准
实体导航主要分为两类:视觉-语言导航(VLN),其中代理通过遵循自然语言指令进行导航;以及目标-对象导航(OGN),其中代理导航至指定目标对象。然而,现有工作主要在理想条件下评估模型性能,忽视了真实世界环境中可能出现的潜在干扰。为解决这一问题,我们提出了NavTrust,这是一个统一基准,系统地在现实场景中对输入模态(包括RGB、深度和指令)进行干扰,并评估其对导航性能的影响。据我们所知,NavTrust是第一个在统一框架中使实体导航代理暴露于多种RGB-Depth干扰和指令变化的基准。我们对七种最先进的方法进行了广泛评估,发现它们在现实干扰下的性能显著下降,这突显了关键的鲁棒性差距,并为更可信的实体导航系统指明了道路。此外,我们系统地评估了四种不同的缓解策略,以增强对RGB-Depth和指令干扰的鲁棒性。我们的基础模型包括Uni-NaVid和ETPNav。我们在一个真实的移动机器人上部署了它们,并观察到对干扰的鲁棒性有所提高。项目网站:https://navtrust.github.io
Summary / 总结
NavTrust benchmarks the trustworthiness of embodied navigation systems by introducing realistic corruptions to RGB, depth, and instructions in VLN and OGN tasks. It evaluates seven state-of-the-art approaches and finds significant performance degradation under real-world conditions, indicating critical robustness gaps. NavTrust also assesses four mitigation strategies to improve robustness against these corruptions. The project website is: https://navtrust.github.io.
NavTrust 通过在现实场景中系统地破坏 RGB、深度和指令来评估 embodied 导航的可信度。它评估了七种最先进的方法,并发现实际破坏下性能显著下降,突显了鲁棒性差距。NavTrust 还评估了四种缓解策略,在真实移动机器人上显示出增强的鲁棒性。
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
Authors: Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang
First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00
Comments: 24 pages, 12 figures
Abstract
Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.
中文标题/摘要
标题:SAMA:因子化语义锚定与运动对齐的指令引导视频编辑
当前的指令引导视频编辑模型难以同时平衡精确的语义修改与忠实的运动保留。现有方法依赖于注入显式的外部先验(例如,VLM特征或结构条件)来缓解这些问题,但这种依赖严重限制了模型的鲁棒性和泛化能力。为克服这一限制,我们提出了SAMA(因子化语义锚定与运动对齐),一种将视频编辑分解为语义锚定和运动建模的框架。首先,我们引入了语义锚定,通过在稀疏锚定帧上联合预测语义令牌和视频潜在变量来建立可靠的视觉锚定,从而实现纯粹基于指令的结构规划。其次,运动对齐在以运动为中心的视频恢复预训练任务(立方体填充、速度扰动和管子打乱)上预训练相同的骨干网络,使模型能够直接从原始视频中内化时间动态。SAMA 通过两阶段优化:无配对视频-指令编辑数据的因子化预训练阶段,学习固有的语义-运动表示,随后在配对编辑数据上进行监督微调。值得注意的是,仅因子化预训练本身就已经表现出强大的零样本视频编辑能力,验证了所提出的分解的有效性。SAMA 在开源模型中达到了最先进的性能,并且与领先的商业系统(例如 Kling-Omni)竞争。代码、模型和数据集将被发布。
Summary / 总结
SAMA addresses the challenge of balancing precise semantic modifications with faithful motion preservation in instruction-guided video editing. It factorizes the process into semantic anchoring and motion modeling. SAMA introduces Semantic Anchoring to establish a visual anchor by jointly predicting semantic tokens and video latents at sparse frames, and Motion Alignment to pre-train the model on motion-centric tasks. The framework is optimized in a two-stage pipeline: factorized pre-training and supervised fine-tuning. Experiments show that SAMA outperforms existing models and is competitive with commercial systems in zero-shot video editing tasks.
SAMA通过将视频编辑过程分解为语义锚定和运动建模来解决指令引导视频编辑中精确语义修改与忠实运动保留之间的平衡问题。它引入了语义锚定,在稀疏帧上联合预测语义标记和视频潜在变量,以创建可靠的视觉锚点,并通过运动中心任务的预训练使模型直接从原始视频中内化时间动态。SAMA的两阶段优化结合了因子预训练和监督微调,显著提高了零样本视频编辑性能,实现了开源模型中的最佳效果,并与商业系统竞争。
FinTradeBench: A Financial Reasoning Benchmark for LLMs
Authors: Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta
First: 2026-03-19T17:59:41+00:00 · Latest: 2026-03-19T17:59:41+00:00
Comments: 8 pages main text, 22 pages total (including references and appendix). 5 figures, 14 tables. Preprint under review. Code and data will be made available upon publication
Abstract
Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.
中文标题/摘要
标题:FinTradeBench:LLMs的金融推理基准
现实世界的金融决策是一个具有挑战性的问题,需要在公司基本面(来自监管文件的数据)和价格动态计算的交易信号等多种信号之间进行推理。近年来,随着大型语言模型(LLMs)的发展,金融分析师已经开始使用它们进行金融决策任务。然而,现有的金融问答基准主要侧重于公司资产负债表数据,很少评估公司股票在市场上的交易情况及其与基本面的互动。为了充分利用两种方法的优势,我们引入了FinTradeBench,这是一个结合公司基本面和交易信号的金融推理基准。FinTradeBench 包含了1400个基于纳斯达克100公司的问题,时间跨度为十年。基准测试分为三大类推理问题:以基本面为主、以交易信号为主和需要跨信号推理的混合问题。为了确保大规模的可靠性,我们采用了校准-扩展框架,结合了专家种子问题、多模型响应生成、模型内自我筛选、数值审计和人-LLM法官对齐。我们在零样本提示和检索增强设置下评估了14个LLM,并观察到了明显的性能差距。检索显著提高了对文本基本面的推理能力,但对交易信号推理的帮助有限。这些发现突显了当前LLM在数值和时间序列推理方面的基本挑战,并激发了未来金融智能研究的动力。
Summary / 总结
FinTradeBench is a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals, addressing the limitations of existing benchmarks. It contains 1,400 questions on NASDAQ-100 companies over a ten-year period, categorized into three types: fundamentals-focused, trading-signal-focused, and hybrid questions. The evaluation of 14 LLMs under zero-shot and retrieval-augmented settings revealed a clear performance gap, with retrieval improving reasoning over textual fundamentals but providing limited benefit for trading-signal reasoning. This highlights challenges in numerical and time-series reasoning for current LLMs and motivates future research in financial intelligence.
FinTradeBench 是一个结合公司基本面和交易信号的金融推理基准,旨在弥补现有基准的不足。它包含1,400个关于纳斯达克100公司的十年期问题,分为三大类:基本面导向、交易信号导向和混合问题。对14个LLM在零样本和检索增强设置下的评估显示,检索在提高对文本基本面的推理方面表现出色,但在交易信号推理方面提供的帮助有限。这突显了当前LLM在数值和时间序列推理方面的挑战,并激发了未来金融智能研究的动力。
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Authors: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
First: 2026-03-19T17:59:21+00:00 · Latest: 2026-03-19T17:59:21+00:00
Abstract
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.
中文标题/摘要
标题:F2LLM-v2:包容性、高性能且高效的多语言嵌入模型
我们提出了F2LLM-v2,这是一种新的通用多语言嵌入模型系列,包含8种不同规模,从80M到14B。该模型基于6000万条高质量公开数据样本的新综合训练而成,支持超过200种语言,特别强调了之前未充分服务的中低资源语言。通过结合两阶段基于LLM的嵌入训练流水线、matryoshka学习、模型剪枝和知识蒸馏技术,我们展示了比之前基于LLM的嵌入模型更高效的模型,同时保持了竞争力。广泛的评估证实,F2LLM-v2-14B在11个MTEB基准测试中排名第一,而该系列中的较小模型也为资源受限的应用设定了新的标准。为了促进开源嵌入模型研究,我们发布了所有模型、数据、代码和中间检查点。
Online Learning and Equilibrium Computation with Ranking Feedback
Authors: Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang
First: 2026-03-19T17:59:07+00:00 · Latest: 2026-03-19T17:59:07+00:00
Abstract
Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.
中文标题/摘要
标题:在线学习与排序反馈下的均衡计算
在线学习在任意的、可能是对抗性的环境中已被广泛研究于序贯决策中,它与博弈论中的均衡计算密切相关。现有的大多数在线学习算法依赖于环境提供的数值效用反馈,但在人类在环应用中或出于隐私考虑,这种反馈可能不可用或受限。本文研究了在线学习模型,其中学习者仅在每个时间步观察一组提议动作的排序。我们考虑了两种排序机制:由当前时间步的瞬时效用诱导的排序,以及由当前时间步之前时间平均效用诱导的排序,在完全信息反馈和多臂老虎机反馈设置下进行考虑。使用标准的外部后悔度量,我们证明了在一般情况下,瞬时效用排序反馈无法实现亚线性后悔。此外,当排序模型相对确定时,即在温度足够小的Plackett-Luce模型下,时间平均效用排序反馈也无法实现亚线性后悔。然后,我们开发了在效用序列具有亚线性总变差的假设下实现亚线性后悔的新算法。值得注意的是,对于完全信息时间平均效用排序反馈,这一额外假设可以被移除。因此,当正常形式博弈中的所有玩家遵循我们的算法时,重复博弈将产生近似粗略联合均衡。我们还展示了我们的算法在在线大型语言模型路由任务中的有效性。
Summary / 总结
This paper investigates online learning in environments where only rankings of actions are available, rather than numeric utility feedback. It considers two ranking mechanisms: instantaneous utility and time-average utility rankings, under both full-information and bandit feedback settings. The authors show that sublinear regret is impossible with instantaneous utility rankings in general, and with time-average utility rankings under a deterministic model. However, they develop new algorithms that achieve sublinear regret under the assumption of sublinear total variation in the utility sequence, leading to approximate coarse correlated equilibrium in repeated play of a normal-form game. The algorithms are also effective in an online large-language-model routing task.
本文研究了仅接收排名反馈而非数值效用反馈的在线学习环境。考虑了两种排名机制:即时效用排名和时间平均效用排名。研究表明,在一般情况下,即时效用排名无法实现亚线性遗憾,而在确定性排名模型下,时间平均效用排名也无法实现亚线性遗憾。然而,作者开发了在效用序列具有亚线性总变化假设下能够实现亚线性遗憾的新算法,从而在重复玩正常形式博弈时产生近似粗略联合均衡。这些算法在在线大型语言模型路由任务中也表现出有效性。
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
Authors: Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
First: 2026-03-19T17:58:52+00:00 · Latest: 2026-03-19T17:58:52+00:00
Comments: We release the model and data at https://huggingface.co/collections/nvidia/nemotron-cascade-2
Abstract
We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.
中文标题/摘要
标题:Nemotron-Cascade 2:级联RL与多领域在线策略蒸馏后的后训练大语言模型
我们介绍了Nemotron-Cascade 2,这是一个开放的30B模型,具有3B激活参数,提供最佳推理能力和强大的代理能力。尽管其体积较小,但在数学和编程推理性能上接近前沿的开放模型。它是继DeepSeekV3.2-Speciale-671B-A37B之后第二个在2025年国际数学奥林匹克(IMO)、国际信息学奥林匹克(IOI)和ICPC世界总决赛中获得金牌水平表现的开放权重大语言模型,显示出极高的智能密度,参数量减少20倍。与Nemotron-Cascade 1相比,关键的技术进步如下。在精心策划的数据集上进行SFT后,我们大幅扩展了级联RL,涵盖了更广泛的推理和代理领域。此外,我们引入了多领域在线策略蒸馏,从每个领域的最强中间教师模型进行蒸馏,整个级联RL过程中,使我们能够高效地恢复基准回归并保持强劲的性能提升。我们发布了模型检查点和训练数据的集合。
Summary / 总结
Nemotron-Cascade 2 is a 30B model with 3B active parameters that excels in reasoning and agentic capabilities, achieving top performance in major international competitions despite its compact size. Key improvements include expanded Cascade RL to cover a wider range of domains and multi-domain on-policy distillation from strong intermediate models, which help maintain performance gains. The model is open-sourced with its checkpoints and training data available at Hugging Face.
Nemotron-Cascade 2 是一个30B参数的模型,其中3B参数是激活的,它在推理和行动能力方面表现出色,即使规模较小也能在国际大赛中取得顶级成绩。该模型使用后训练方法结合级联强化学习和多领域在线策略蒸馏来提升推理和行动能力。通过扩展级联强化学习并结合来自强中间教师模型的蒸馏,该模型在各种领域中保持了高水平的表现。
Rethinking Vector Field Learning for Generative Segmentation
Authors: Chaoyang Wang, Yaobo Liang, Boci Peng, Fan Duan, Jingdong Wang, Yunhai Tong
First: 2026-03-19T17:58:19+00:00 · Latest: 2026-03-19T17:58:19+00:00
Abstract
Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.
中文标题/摘要
标题:重新思考生成分割中的向量场学习
控制扩散模型进行生成分割已引起越来越多的关注。尽管现有方法主要集中在架构调整或训练启发式方法上,但对连续流匹配目标与离散感知任务之间的内在不匹配仍缺乏理解。在本文中,我们从向量场学习的角度重新审视扩散分割。我们识别出两种常用流匹配目标的关键限制:梯度消失和轨迹穿越,这导致收敛速度慢且类别区分差。为解决这些问题,我们提出了一种原理性的向量场重塑策略,通过附加一个分离的距离感知校正项来增强学习的流场。该校正项引入了吸引和排斥的相互作用,增强了靠近质心处的梯度大小,同时保留了原始的扩散训练框架。此外,我们设计了一种受克罗内克序列启发的高效计算、准随机类别编码方案,该方案与端到端像素神经场框架无缝集成,用于像素级语义对齐。广泛的实验一致表明,与原始流匹配方法相比,显著提高了性能,大幅缩小了生成分割与强大判别专家之间的性能差距。
Summary / 总结
This work addresses the limitations of flow matching objectives in diffusion models for generative segmentation by proposing a vector field reshaping strategy that enhances gradient magnitudes near centroids and introduces attractive and repulsive interactions. The study also introduces a quasi-random category encoding scheme for better pixel-level semantic alignment. Experiments show significant improvements over traditional methods, reducing the performance gap with discriminative models.
本文通过提出矢量场重塑策略,解决了扩散模型在生成分割中的流动匹配目标的局限性,引入了距离感知的校正项以增强梯度幅度并改善类别分离。此外,设计了一种基于克罗内克序列的高效类别编码方案,以促进像素级语义对齐。实验表明,这种方法显著优于传统方法,缩小了生成分割与判别式专家之间的性能差距。
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
Authors: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang
First: 2026-03-19T17:58:13+00:00 · Latest: 2026-03-19T17:58:13+00:00
Comments: Project page: https://kd-tao.github.io/LVOmniBench/
Abstract
Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
中文标题/摘要
标题:LVOmniBench:面向多模态LLM的长音频视频理解评估先驱
近年来,多模态大型语言模型(OmniLLM)在理解音频和视频输入方面取得了显著进步。然而,当前的评估主要集中在10秒到5分钟的短音频和视频片段上,未能反映实际应用中的需求,而实际应用中的视频通常持续数十分钟。为解决这一关键缺口,我们引入了LVOmniBench,这是一个专门用于长格式音频和视频跨模态理解的新基准。该数据集包含来自开放平台的高质量视频,这些视频具有丰富的音频-视觉动态。通过严格的手动选择和标注,LVOmniBench 包含275个视频,时长从10分钟到90分钟不等,以及1,014个问答(QA)对。LVOmniBench旨在严格评估OmniLLM在各个领域的能力,包括长期记忆、时间定位、细粒度理解以及多模态感知。我们的广泛评估表明,当前的OmniLLM在处理扩展的音频-视觉输入时面临重大挑战。开源模型通常的准确率低于35%,而Gemini 3 Pro达到的峰值准确率约为65%。我们预计,该数据集以及我们的实证研究结果将激发进一步的研究,并促进开发能够解决长格式音频-视频上下文中的复杂跨模态理解问题的高级模型。
Summary / 总结
LVOmniBench is a new benchmark for evaluating the cross-modal comprehension of long-form audio and video, addressing the limitations of existing short-form evaluations. It includes 275 videos ranging from 10 to 90 minutes and 1,014 question-answer pairs. The evaluation shows that current omnimodal large language models struggle with long-form inputs, with accuracies below 35% for open-source models and a peak of 65% for Gemini 3 Pro.
LVOmniBench 是一个用于评估 omnimodal 大型语言模型(OmniLLMs)在长音频视频跨模态理解能力的新基准,包含 275 条时长从 10 到 90 分钟的视频和 1,014 组问题-答案对。评估结果显示,当前的 OmniLLMs 面临显著挑战,开源模型的准确率低于 35%,而 Gemini 3 Pro 达到约 65% 的准确率。
DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising
Authors: Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou
First: 2026-03-19T17:58:11+00:00 · Latest: 2026-03-19T17:58:11+00:00
Abstract
Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.
中文标题/摘要
标题:DreamPartGen: 基于语义的部件级3D生成通过协作去噪
理解并生成由有意义部件组成的3D对象是人类感知和推理的基础。然而,大多数文本到3D的方法忽略了部件的语义和功能结构。虽然近期的部件感知方法引入了分解,但它们仍然主要集中在几何结构上,缺乏语义基础,无法建模部件如何与文本描述或部件间关系对齐。我们提出了DreamPartGen,一种基于语义的部件感知文本到3D生成框架。DreamPartGen引入了双部件潜在变量(DPLs),联合建模每个部件的几何形状和外观,并引入了关系语义潜在变量(RSLs),捕捉从语言中推导出的部件间依赖关系。同步的联合去噪过程确保了几何和语义的一致性,使3D合成具有连贯性、可解释性和文本对齐性。在多个基准测试中,DreamPartGen在几何保真度和文本形状对齐方面达到了最先进的性能。
Summary / 总结
DreamPartGen is a framework for semantically grounded, part-aware text-to-3D generation. It introduces Duplex Part Latents (DPLs) to model each part's geometry and appearance, and Relational Semantic Latents (RSLs) to capture inter-part dependencies from language. A synchronized co-denoising process ensures mutual geometric and semantic consistency, leading to coherent and text-aligned 3D synthesis. DreamPartGen outperforms existing methods in geometric fidelity and text-shape alignment across multiple benchmarks.
DreamPartGen旨在生成具有语义接地和部分级意识的3D对象,解决了之前文本到3D方法的局限性。它使用双重部分潜变量来建模部分的几何和外观,使用关系语义潜变量来捕捉语言中的部分间依赖关系。同步协同去噪过程确保几何和语义一致性,从而实现连贯、可解释且与文本对齐的3D合成。实验表明,DreamPartGen在几何保真度和文本形状对齐方面超越了现有方法,在多个基准测试中表现出色。
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla
First: 2026-03-19T17:56:32+00:00 · Latest: 2026-03-19T17:56:32+00:00
Comments: Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders
Abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
中文标题/摘要
标题:VLMs是否需要视觉变换器?评估状态空间模型作为视觉编码器
大型视觉-语言模型(VLMs)通常使用冻结的视觉骨干,其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉骨干,但我们询问状态空间模型(SSM)视觉骨干是否可以成为强有力的替代品。我们在受控环境中系统地评估了SSM视觉骨干在VLMs中的表现。在匹配的ImageNet-1K初始化下,SSM骨干在VQA和定位/标注方面实现了最强的整体性能。我们进一步适应了SSM和ViT家族的骨干,并进行了检测或分割训练,发现密集任务调优通常在家族中提高了性能;在此适应后,SSM骨干在较小的模型规模下仍具有竞争力。我们还观察到,(i) 更高的ImageNet准确度或更大的骨干并不一定能可靠地转化为更好的VLM性能,(ii) 一些视觉骨干在定位方面不稳定。基于这些发现,我们提出了稳定策略,以提高两个骨干家族的鲁棒性,并强调SSM骨干作为VLMs中基于变换器视觉编码器的强有力替代品。
Summary / 总结
This study evaluates state space model (SSM) vision backbones in large vision-language models (VLMs), finding that SSMs outperform transformer-based encoders in VQA and grounding/localization tasks under matched ImageNet-1K initialization. After adaptation with detection or segmentation training, SSM backbones remain competitive while being smaller in scale. The research also highlights that higher ImageNet accuracy or larger backbones do not necessarily translate to better VLM performance, and some visual backbones are unstable in localization tasks. These findings suggest SSMs as a strong alternative to transformer-based vision encoders in VLMs.
研究评估了状态空间模型(SSM)在大型视觉-语言模型(VLM)中的应用,发现SSM在VQA和定位/检测任务中优于基于变换器的编码器,尤其是在匹配的ImageNet-1K初始化条件下。经过密集任务适应后,SSM仍具有竞争力且模型规模更小。研究还指出,更高的ImageNet准确度或更大的模型并不一定意味着更好的VLM性能,某些视觉编码器在定位任务中不稳定。研究提出了稳定策略以提高两种编码器家族的鲁棒性,并建议SSM作为VLM中变换器基视觉编码器的强替代方案。
Enhancing Lexicon-Based Text Embeddings with Large Language Models
Authors: Yibin Lei, Tao Shen, Yu Cao, Andrew Yates
Venue: ACL 2025
First: 2025-01-16T18:57:20+00:00 · Latest: 2026-03-19T17:55:24+00:00
Comments: ACL 2025
Abstract
Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging LLMs that achieve competitive performance on these tasks. LENS consolidates the vocabulary space through token embedding clustering to handle the issue of token redundancy in LLM vocabularies. To further improve performance, we investigate bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexical matching with redundant vocabularies by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact representations with dimensionality comparable to dense counterparts. Furthermore, LENS inherently supports efficient embedding dimension pruning without any specialized objectives like Matryoshka Representation Learning. Notably, combining LENS with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e., BEIR).
中文标题/摘要
标题:利用大型语言模型增强基于词典的文本嵌入
近年来,大型语言模型(LLMs)在通用文本嵌入任务中表现出色。尽管密集嵌入在相关研究中占主导地位,我们引入了第一个利用LLMs的基于词典的嵌入(LENS),在这些任务中取得了竞争力的表现。LENS通过词嵌入聚类来整合词汇空间,以处理LLM词汇表中的标记冗余问题。为了进一步提高性能,我们研究了双向注意力和各种池化策略。具体来说,LENS通过将每个维度分配给特定的标记簇来简化与冗余词汇表的词汇匹配,其中语义相似的标记被分组在一起。大量实验表明,LENS在大规模文本嵌入基准测试(MTEB)中优于密集嵌入,提供与密集嵌入相当维度的紧凑表示。此外,LENS固有地支持高效的嵌入维度剪枝,无需像Matryoshka表示学习那样的专门目标。值得注意的是,将LENS与密集嵌入结合使用在MTEB的检索子集(即BEIR)上实现了最先进的性能。
Summary / 总结
This study aims to enhance lexicon-based text embeddings using large language models (LLMs) by leveraging token embedding clustering to address token redundancy. The method involves simplifying lexical matching through bidirectional attention and various pooling strategies. Experimental results show that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB) and supports efficient embedding dimension pruning. Combining LENS with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB.
本文提出了一种名为LENS的基于词典的文本嵌入方法,利用大型语言模型在文本嵌入任务上取得了竞争力的表现。LENS通过聚类词嵌入简化了词典匹配,并使用双向注意力和池化策略来提升性能。实验表明,LENS在MTEB基准上优于密集嵌入,并支持高效的维度修剪。将LENS与密集嵌入结合使用在BEIR检索子集上达到了最先进的性能。
RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing
Authors: Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, Lijun Zhang
First: 2026-03-19T17:54:43+00:00 · Latest: 2026-03-19T17:54:43+00:00
Abstract
Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.
中文标题/摘要
标题:RPiAE:基于表示的自编码器增强图像生成和编辑
扩散模型已成为图像生成和编辑的主导范式,潜空间扩散模型将去噪移至紧凑的潜空间,以提高效率和可扩展性。最近尝试利用预训练的视觉表示模型作为标记器先验,要么将扩散特征对齐到表示特征,要么直接重用表示编码器作为冻结的标记器。尽管这些方法可以提高生成指标,但由于冻结编码器,它们往往在重建保真度上受到限制,进而影响编辑质量,同时高维潜空间也使得扩散建模变得困难。为解决这些局限性,我们提出了基于表示的自编码器(RPiAE),一种基于表示的标记器,能够同时提高生成和编辑质量。我们引入了表示点正则化,这是一种训练策略,使表示初始化的编码器能够进行重建微调,同时保留预训练表示空间的语义结构,随后通过变分桥梁将潜空间压缩为紧凑空间,以更好地进行扩散建模。我们采用目标解耦的阶段训练策略,依次优化生成可操作性和重建保真度目标。这些组件共同产生了一个能够保留强语义、忠实重建并生成具有降低扩散建模复杂度的潜空间的标记器。实验表明,RPiAE 在文本到图像生成和图像编辑方面优于其他视觉标记器,同时在基于表示的标记器中提供最佳的重建保真度。
Summary / 总结
The paper proposes RPiAE, a Representation-Pivoted AutoEncoder, to enhance both image generation and editing by addressing the limitations of previous approaches. It introduces Representation-Pivot Regularization to fine-tune a representation-initialized encoder for reconstruction while preserving semantic structure, and a variational bridge to compress the latent space. The method uses an objective-decoupled training strategy to optimize generative tractability and reconstruction fidelity. Experiments show that RPiAE outperforms other visual tokenizers in text-to-image generation and image editing, and achieves the best reconstruction fidelity among representation-based tokenizers.
研究旨在通过提出Representation-Pivoted AutoEncoder (RPiAE) 来同时提升图像生成和编辑的质量。方法引入了Representation-Pivot Regularization,以对初始化的表示进行微调以实现重建,同时保留语义结构,随后通过变分桥梁压缩潜在空间。实验表明,RPiAE 在文本到图像生成和图像编辑方面优于其他视觉分词器,并且在基于表示的分词器中实现了最佳的重建保真度。
Tinted Frames: Question Framing Blinds Vision-Language Models
Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta
First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-19T17:53:09+00:00
Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/
Abstract
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
中文标题/摘要
标题:着色边框:问题框架限制了视觉语言模型的视野
视觉语言模型(VLMs)已被证明是盲目的,即使在需要视觉推理的任务中,它们也常常未能充分利用视觉输入。在本研究中,我们展示了VLMs的选择性盲视。它们根据语言框架调整对视觉输入的注意力程度,即使存在其他框架要求相同的视觉推理。通过使用视觉注意力作为探针,我们量化了框架如何改变对图像的关注量及其分布。受限框架,如多项选择和是/否,相比开放式框架,显著降低了对图像上下文的关注,减少了对任务相关区域的关注,并将注意力转移到无信息性标记上。我们进一步证明,这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察,我们引入了一种轻量级的提示调优方法,使用可学习标记鼓励在开放式设置中观察到的稳健、视觉接地的注意力模式,从而提高视觉接地并改善不同框架下的性能。
Summary / 总结
This study investigates why Vision-Language Models (VLMs) underutilize visual inputs, even on tasks requiring visual reasoning. By analyzing visual attention patterns, the research shows that VLMs adjust their attention based on question framing, leading to less attention on image context for constrained questions like multiple choice or yes/no, compared to open-ended questions. This misallocation of attention is linked to lower accuracy and inconsistency across different framings. The study proposes a prompt-tuning method using learnable tokens to encourage more robust and visually grounded attention, enhancing performance across various question framings.
研究探讨了为什么视觉语言模型(VLMs)在需要视觉推理的任务中未能充分利用视觉输入。通过分析视觉注意力模式,研究人员发现,VLMs会根据问题的表述调整其注意力,即使不同的表述要求相同的视觉推理。研究显示,如多项选择和是非题等受限表述会导致对图像上下文的关注减少,并将注意力转向无关信息,从而导致性能下降。作者提出了一种使用可学习标记的提示调优方法,以鼓励在开放表述环境中观察到的稳健且视觉导向的注意力模式,从而提高视觉定位和不同表述下的性能。
How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation
Authors: Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee
First: 2026-03-19T17:50:07+00:00 · Latest: 2026-03-19T17:50:07+00:00
Comments: Project website: https://kehanlu.github.io/AKB
Abstract
Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.
中文标题/摘要
标题:大规模语言模型中听觉知识如何塑造音频语言模型:全面评估
大规模语言模型(LLMs)广泛用作大型音频语言模型(LALMs)的知识基础,但它们通过纯文本预训练编码了多少听觉知识以及这对下游性能的影响尚不清楚。我们通过在两种纯文本和一种音频基础设置下比较不同LLMs来研究这一差距:(1) 直接探测AKB-2000,这是一个测试听觉知识广度和深度的精心策划基准;(2) 级联评估,其中LLMs基于音频描述进行推理;(3) 音频基础评估,其中每个LLM通过音频编码器微调为大型音频语言模型(LALM)。我们的研究发现听觉知识在不同家族之间差异很大,纯文本结果与音频性能高度相关。我们的工作为全面理解音频研究中的LLMs提供了实证基础。
Summary / 总结
This study investigates how auditory knowledge in large language model (LLM) backbones influences Large Audio Language Models (LALMs). It compares different LLMs under three settings: direct probing on AKB-2000, cascade evaluation, and audio-grounded evaluation. The research finds that auditory knowledge varies significantly among LLM families, and text-only pre-training results strongly correlate with audio performance. This work offers empirical insights into LLMs' role in audio research.
研究探讨了大型语言模型(LLM)通过文本-only预训练所编码的听觉知识及其对下游音频语言模型(ALM)性能的影响。研究在三个设置下比较了不同LLM的表现:直接在AKB-2000上进行探针测试、级联评估和基于音频的评估。主要发现包括LLM家族间听觉知识的显著差异以及文本-only性能与音频性能之间的强烈关联。
Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting
Authors: Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin
First: 2026-03-19T17:49:43+00:00 · Latest: 2026-03-19T17:49:43+00:00
Comments: Project page at https://vulab-ai.github.io/Splat2BEV/
Abstract
Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.
中文标题/摘要
标题:重建问题:通过3D 高斯点绘制学习几何对齐的BEV表示
鸟瞰图(BEV)感知是自主驾驶的核心基础,提供了一种统一的空间表示,将周围视图图像融合起来,以实现语义分割、3D目标检测和运动预测等多种下游任务的推理。然而,现有的大多数BEV感知框架采用端到端的训练范式,其中图像特征直接转换到BEV空间,并仅通过下游任务监督进行优化。这种形式将整个感知过程视为黑盒,通常缺乏明确的3D几何理解和可解释性,导致性能不佳。在本文中,我们主张明确的3D表示对于准确的BEV感知很重要,并提出了一种基于3D高斯点绘制的Splat2BEV框架,用于BEV任务。Splat2BEV旨在学习既丰富语义又精确几何的BEV特征表示。我们首先预训练一个高斯生成器,显式地从多视图输入重建3D场景,从而生成几何对齐的特征表示。然后将这些表示投影到BEV空间,作为下游任务的输入。在nuScenes和argoverse数据集上的大量实验表明,Splat2BEV达到了最先进的性能,并验证了将明确的3D重建纳入BEV感知的有效性。
Summary / 总结
This paper addresses the limitations of existing BEV perception frameworks by proposing Splat2BEV, which incorporates explicit 3D reconstruction. Splat2BEV learns geometry-aligned BEV representations through a two-step process: pre-training a Gaussian generator to reconstruct 3D scenes from multi-view inputs, and then projecting these representations into the BEV space for downstream tasks. Experiments on nuScenes and argoverse datasets show that Splat2BEV outperforms existing methods and enhances the interpretability and accuracy of BEV perception.
本文通过提出Splat2BEV框架,将显式的3D重建引入BEV感知中以改善几何理解。Splat2BEV预先训练一个高斯生成器从多视角输入中重建3D场景,生成几何对齐的特征表示,然后将这些表示投影到BEV空间中。在nuScenes和argoverse数据集上的实验表明,Splat2BEV在现有方法中表现更优,验证了在BEV感知中引入显式3D重建的重要性。
Score Reversal Is Not Free for Quantum Diffusion Models
Authors: Ammar Fayad
First: 2026-03-06T17:16:17+00:00 · Latest: 2026-03-19T17:48:32+00:00
Abstract
Classical reverse diffusion is generated by changing the drift at fixed noise. We show that the quantum version of this principle obeys an exact law with a sharp phase boundary. For Gaussian pure-loss dynamics, the canonical model of continuous-variable decoherence, we prove that the unrestricted instantaneous reverse optimum exhibits a noiseless-to-noisy transition: below a critical squeezing-to-thermal ratio, reversal can be noiseless; above it, complete positivity forces irreducible reverse noise whose minimum cost we determine in closed form. The optimal reverse diffusion is uniquely covariance-aligned and simultaneously minimizes the geometric, metrological, and thermodynamic price of reversal. For multimode trajectories, the exact cost is additive in a canonical set of mode-resolved data, and a globally continuous protocol attains this optimum on every mixed-state interval. If a pure nonclassical endpoint is included, the same pointwise law holds for every $t>0$, but the optimum diverges as $2/t$: exact Gaussian reversal of a pure quantum state is dynamically unattainable. These results establish the exact Gaussian benchmark against which any broader theory of quantum reverse diffusion must be measured.
中文标题/摘要
标题:量子扩散模型中的分数逆转并非免费
经典的逆向扩散通过固定噪声改变漂移生成。我们证明了这一原理的量子版本遵循一个精确的相变定律。对于高斯纯损耗动力学,即连续变量退相干的典范模型,我们证明了无限制的瞬时逆向最优表现出无噪到有噪的转变:在临界压缩比之下,逆转可以无噪;超过它,完全正性迫使不可约的逆向噪声,我们以闭式形式确定了其最小成本。最优逆向扩散唯一地与协方差对齐,并同时最小化逆转的几何、计量和热力学价格。对于多模式轨迹,精确成本在一组模式解析数据中是可加的,且一个全局连续协议在每个混合态区间内达到这一最优。如果包含一个纯非经典的终点,同样的点律在每个$t>0$时都成立,但最优值随着$2/t$发散:精确的高斯逆转一个纯量子态是动态上不可达的。这些结果确立了高斯基准,任何更广泛的量子逆向扩散理论都必须以此为标准进行衡量。
Summary / 总结
The paper explores the reversibility of quantum diffusion models by extending the classical reverse diffusion principle. It demonstrates that the quantum reverse diffusion follows a precise law with a clear phase boundary. For Gaussian pure-loss dynamics, the study proves that the unrestricted reverse diffusion can be noiseless below a critical squeezing-to-thermal ratio but becomes noisy above it due to complete positivity constraints. The optimal reverse diffusion is uniquely aligned and minimizes various costs. For multimode trajectories, the cost is additive, and a globally continuous protocol achieves the optimal reversal. If a pure nonclassical endpoint is considered, the optimal reverse diffusion diverges as $2/t$, indicating that exact Gaussian reversal of a pure quantum state is not dynamically possible. These findings provide a benchmark for broader theories of quantum reverse diffusion.
论文通过扩展经典反向扩散原理,研究了量子扩散模型的可逆性。研究表明,量子反向扩散遵循精确的定律,并有明确的相变边界。对于高斯纯损耗动力学,研究证明,在特定的挤压-热平衡比以下,反向扩散可以无噪声,但超过该比值则由于完全正性约束而变得有噪声。最优反向扩散是唯一对齐的,并且最小化各种成本。对于多模式轨迹,成本是可加的,全局连续协议可以实现最优反向扩散。如果考虑纯非经典终点,最优反向扩散会随着$2/t$发散,表明纯量子态的精确高斯反向扩散在动力学上是不可能的。这些发现为更广泛的量子反向扩散理论提供了一个基准。
OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards
Authors: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding
First: 2026-03-19T17:47:47+00:00 · Latest: 2026-03-19T17:47:47+00:00
Abstract
Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.
中文标题/摘要
标题:OS-Themis:通用GUI奖励的可扩展批评框架
强化学习(RL)有潜力提高GUI代理在随机环境中的鲁棒性,但训练对奖励函数的质量非常敏感。现有的奖励方法难以同时实现可扩展性和性能。为了解决这个问题,我们提出了OS-Themis,一种可扩展且准确的多代理批评框架。与单一裁判不同,OS-Themis将轨迹分解为可验证的里程碑,以隔离决策所需的关键证据,并采用审查机制在做出最终裁决前严格审计证据链。为了便于评估,我们进一步引入了OmniGUIRewardBench(OGRBench),这是一个跨平台的GUI结果奖励综合基准,所有评估模型在使用OS-Themis时均达到最佳性能。在AndroidWorld上的广泛实验表明,当用于支持在线RL训练时,OS-Themis可提高10.3%;在自我训练循环中用于轨迹验证和过滤时,可提高6.9%,突显了其推动代理进化的能力。
Summary / 总结
The paper addresses the challenge of training robust GUI agents using Reinforcement Learning by proposing OS-Themis, a scalable multi-agent critic framework. Unlike traditional single-judge methods, OS-Themis decomposes trajectories into milestones and uses a review mechanism to ensure the accuracy of the evidence before making decisions. Experiments on AndroidWorld show that OS-Themis improves online RL training by 10.3% and trajectory validation by 6.9%, demonstrating its effectiveness in enhancing agent performance.
论文提出了一种可扩展的多智能体批评框架OS-Themis,以解决使用强化学习训练稳健的GUI代理的问题。与传统的单一裁判方法不同,OS-Themis将轨迹分解为里程碑,并使用审查机制确保证据的准确性后再做决策。在AndroidWorld的实验中,OS-Themis在在线RL训练中提高了10.3%,在轨迹验证和过滤中提高了6.9%,展示了其在提升代理性能方面的有效性。
iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification
Authors: Zixun Xiong, Gaoyi Wu, Qingyang Yu, Mingyu Derek Ma, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
Venue: AAAI 2026
First: 2025-11-12T02:30:19+00:00 · Latest: 2026-03-19T17:47:21+00:00
Comments: Accepted by AAAI 2026
Abstract
Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) has become increasingly crucial. As the standard paradigm for IP ownership verification, LLM fingerprinting thus plays a vital role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective when the model thief fully controls the LLM's inference process. In such settings, attackers may share prompt-response pairs to enable fingerprint unlearning or manipulate outputs to evade exact-match verification. We propose iSeal, the first fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner. It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy. These components are resistant to verification-time attacks, including collusion-based fingerprint unlearning and response manipulation, backed by both theoretical analysis and empirical results. iSeal achieves 100 percent Fingerprint Success Rate (FSR) on 12 LLMs against more than 10 attacks, while baselines fail under unlearning and response manipulations.
中文标题/摘要
标题:iSeal:加密指纹识别以确保大型语言模型产权验证的可靠性
鉴于从头开始训练大型语言模型(LLM)的成本高昂,保护LLM的知识产权变得越来越重要。作为知识产权所有权验证的标准范式,LLM指纹识别在应对这一挑战中发挥着重要作用。现有的LLM指纹识别方法通过提取或注入模型特定特征来验证所有权,但它们忽略了验证过程中的潜在攻击,使得当模型窃贼完全控制LLM的推理过程时,这些方法无效。在这种情况下,攻击者可能会共享提示-响应对以实现指纹遗忘或操纵输出以逃避精确匹配验证。我们提出了iSeal,这是首个在模型窃贼完全控制疑似LLM的端到端情况下设计的可靠验证方法。它通过在模型和外部模块中注入独特的特征,并结合纠错机制和基于相似性的验证策略,增强了这些组件对验证时攻击的抵抗力,包括基于合谋的指纹遗忘和响应操纵,这些都得到了理论分析和实验结果的支持。iSeal在超过10种攻击下对12种LLM实现了100%的指纹成功率(FSR),而基线方法在遗忘和响应操纵下失败。
Summary / 总结
iSeal is a novel encrypted fingerprinting method designed to ensure reliable ownership verification of large language models (LLMs) even when the model thief fully controls the LLM. It injects unique features into both the model and an external module, incorporating an error-correction mechanism and a similarity-based verification strategy to resist attacks such as collusion-based fingerprint unlearning and response manipulation. Experimental results show that iSeal achieves a 100 percent Fingerprint Success Rate (FSR) on 12 LLMs against over 10 attacks, whereas baseline methods fail under these attacks.
iSeal 是一种新颖的指纹识别方法,用于验证大型语言模型(LLM)的所有权,即使模型窃贼完全控制LLM的推理过程也是如此。它将独特的特征注入到模型和外部模块中,并使用错误校正机制和基于相似性的验证策略来抵御攻击。iSeal 在 12 个 LLM 上实现了 100% 的指纹成功率(FSR),针对超过 10 种攻击,而现有方法在去学习和响应操纵下会失败。
Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment
Authors: Amir Asiaee, Samhita Pal
First: 2026-03-19T17:43:12+00:00 · Latest: 2026-03-19T17:43:12+00:00
Abstract
Randomized controlled trials (RCTs) are the gold standard for estimating heterogeneous treatment effects, yet they are often underpowered for detecting effect heterogeneity. Large observational studies (OS) can supplement RCTs for conditional average treatment effect (CATE) estimation, but a key barrier is covariate mismatch: the two sources measure different, only partially overlapping, covariates. We propose CALM (Calibrated ALignment under covariate Mismatch), which bypasses imputation by learning embeddings that map each source's features into a common representation space. OS outcome models are transferred to the RCT embedding space and calibrated using trial data, preserving causal identification from randomization. Finite-sample risk bounds decompose into alignment error, outcome-model complexity, and calibration complexity terms, identifying when embedding alignment outperforms imputation. Under the calibration-based linear variant, the framework provides protection against negative transfer; the neural variant can be vulnerable under severe distributional shift. Under sparse linear models, the embedding approach strictly generalizes imputation. Simulations across 51 settings confirm that (i) calibration-based methods are equivalent for linear CATEs, and (ii) the neural embedding variant wins all 22 nonlinear-regime settings with large margins.
中文标题/摘要
标题:通过校准对齐提高基于RCT的治疗效果估计
随机对照试验(RCT)是估计异质治疗效果的金标准,但往往缺乏检测效果异质性的能力。大型观察性研究(OS)可以补充RCT以估计条件平均治疗效果(CATE),但关键障碍是协变量不匹配:两种来源测量不同的、部分重叠的协变量。我们提出了CALM(校准对齐在协变量不匹配下),通过学习将每个来源的特征映射到公共表示空间的嵌入来绕过插补。OS结果模型被转移到RCT嵌入空间并使用试验数据校准,从而保留随机化带来的因果识别。有限样本风险界分解为对齐误差、结果模型复杂性和校准复杂性项,确定了嵌入对齐何时优于插补。在基于校准的线性变体下,该框架提供了防止负面转移的保护;而神经变体在严重分布偏移下可能易受攻击。在稀疏线性模型下,嵌入方法严格泛化插补。跨51种设置的模拟证实了(i) 基于校准的方法对于线性CATE等效,以及(ii) 神经嵌入变体在22种非线性状态下以较大优势胜出。
Summary / 总结
The paper addresses the challenge of estimating heterogeneous treatment effects by aligning data from randomized controlled trials (RCTs) and large observational studies (OS) despite covariate mismatch. It introduces CALM, a method that learns embeddings to map features from both sources into a common space, avoiding imputation. The method transfers OS outcome models to the RCT space and calibrates them using trial data, ensuring causal identification. Simulations show that calibration-based methods perform equivalently for linear CATEs, while the neural embedding variant outperforms in nonlinear settings with significant margins.
论文旨在通过整合RCT和观察性研究(OS)来估计异质治疗效果,尽管存在协变量不匹配的问题。提出了一种名为CALM的方法,通过学习嵌入将两个来源的特征映射到一个共同的空间中,避免了插补。该方法将OS的结果模型转移到RCT的空间,并使用试验数据进行校准。模拟结果显示,对于线性CATE,校准方法表现相当;而在非线性设置中,神经嵌入变体在显著优势下胜出。
This looks like what? Challenges and Future Research Directions for Part-Prototype Models
Authors: Khawla Elhadri, Tomasz Michalski, Adam Wróbel, Jörg Schlötterer, Bartosz Zieliński, Christin Seifert
First: 2025-02-13T14:00:55+00:00 · Latest: 2026-03-19T17:41:24+00:00
Comments: Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI-2026)
Abstract
The growing interest in eXplainable Artificial Intelligence (XAI) has stimulated research on models with built-in interpretability, among which part-prototype models are particularly prominent. Part-Prototype Models (PPMs) classify inputs by comparing them to learned prototypes and provide human-understandable explanations of the form "this looks like that". Despite this intrinsic interpretability, PPMs have not yet emerged as a competitive alternative to post-hoc explanation methods. This survey reviews work published between 2019 and 2025 and derives a taxonomy of the challenges faced by current PPMs. The analysis reveals a diverse set of open problems. The main issue concerns the quality and number of learned prototypes. Further challenges include limited generalization across tasks and contexts, as well as methodological shortcomings such as non-standardized evaluation. Five broad research directions are identified: improving predictive performance, developing theoretically grounded architectures, establishing frameworks for human-AI collaboration, aligning models with human concepts, and defining robust metrics and benchmarks for evaluation. The survey aims to stimulate further research and promote intrinsically interpretable models for practical applications. A curated list of the surveyed papers is available at https://github.com/aix-group/ppm-survey.
中文标题/摘要
标题:这看起来像什么?部分原型模型的挑战与未来研究方向
随着对可解释人工智能(XAI)兴趣的增长,研究具有内置可解释性的模型变得活跃起来,其中部分原型模型尤为突出。部分原型模型(PPMs)通过将输入与学习到的原型进行比较来进行分类,并提供易于理解的解释,形式为“这看起来像那个”。尽管具有这种内在的可解释性,PPMs尚未成为后验解释方法的有力竞争者。本文综述了2019年至2025年间发表的工作,并推导出当前PPMs面临的挑战分类。分析揭示了一系列开放性问题。主要问题在于学习原型的质量和数量。进一步的挑战包括在任务和上下文之间有限的一般化能力,以及方法论上的不足,如缺乏标准化的评估。确定了五个广泛的研究方向:提高预测性能、开发理论依据的架构、建立人机协作框架、使模型与人类概念相一致以及定义评估的稳健度量和基准。综述旨在激发进一步的研究,促进适用于实际应用的内在可解释模型。所综述论文的精选列表可在https://github.com/aix-group/ppm-survey/获取。
Summary / 总结
This paper reviews part-prototype models (PPMs) that classify inputs by comparing them to learned prototypes and provide human-understandable explanations. Despite their interpretability, PPMs face challenges such as the quality and number of prototypes, limited generalization, and methodological shortcomings. The study identifies five research directions: improving predictive performance, developing theoretically grounded architectures, enhancing human-AI collaboration, aligning models with human concepts, and defining robust evaluation metrics. The goal is to stimulate further research and promote intrinsically interpretable models for practical applications.
本文回顾了部分原型模型(PPMs),这些模型通过将输入与学习到的原型进行比较来进行分类,并提供易于理解的解释。尽管PPMs具有内在的可解释性,但它们面临诸如原型的质量和数量、泛化能力和方法论缺陷等挑战。研究确定了五个研究方向:提高预测性能、开发理论基础架构、增强人机协作、使模型与人类概念相匹配以及定义稳健的评估指标。这些发现旨在促进进一步的研究,并推动在实际应用中使用内在可解释的模型。
Box Maze: A Process-Control Architecture for Reliable LLM Reasoning
Authors: Zou Qiang
First: 2026-03-19T17:41:18+00:00 · Latest: 2026-03-19T17:41:18+00:00
Comments: 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation
Abstract
Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.
中文标题/摘要
标题:盒迷宫:一种可靠的LLM推理过程控制架构
大型语言模型(LLMs)展示了强大的生成能力,但在对抗性提示下仍易出现幻觉和不可靠的推理。现有的安全方法——如基于人类反馈的强化学习(RLHF)和输出过滤——主要在行为层面运作,可能缺乏明确的架构机制来确保推理过程的完整性。 本文提出了盒迷宫框架,这是一种概念性的过程控制架构,将LLM推理分解为三个明确的层次:记忆接地、结构化推理和边界约束。我们进行了初步的基于仿真的评估,涉及跨多个异构LLM系统(DeepSeek-V3、Doubao、Qwen)的边界侵蚀场景。来自n=50个对抗性场景的结果表明,明确的认知控制层可能有助于提高边界维护的一致性,架构约束在对抗条件下将边界失败率从约40%(基线RLHF)降低到低于1%。 当前的验证是基于仿真的,但这些初步结果表明,过程级控制可能为提高大型语言模型推理可靠性提供一个有前景的方向。
Summary / 总结
This paper addresses the issue of hallucination and unreliable reasoning in large language models (LLMs) by proposing the Box Maze framework, which decomposes LLM reasoning into memory grounding, structured inference, and boundary enforcement layers. Through simulation-based evaluation across multiple LLM systems, the study found that explicit cognitive control layers reduced boundary failure rates from about 40% (baseline RLHF) to below 1% under adversarial conditions, indicating potential improvements in reasoning consistency.
本文提出了Box Maze框架,通过将LLM的推理过程分解为记忆接地、结构化推理和边界约束三个层次,来解决LLM在生成过程中出现的幻觉和不可靠推理问题。通过在多个LLM系统上的模拟评估,研究发现,明确的认知控制层可以将边界失败率从基线RLHF的约40%降低到低于1%,表明在推理一致性和可靠性方面可能存在改进的空间。
Steering Awareness: Detecting Activation Steering from Within
Authors: Joshua Fonseca Rivera, David Demitri Africa
First: 2025-11-26T13:49:43+00:00 · Latest: 2026-03-19T17:37:06+00:00
Abstract
Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.
中文标题/摘要
标题:操控意识:从内部检测激活操控
激活操控——向模型的残差流中添加一个向量以修改其行为——在安全性评估中被广泛使用,仿佛模型无法检测到这种干预。我们测试了这一假设,引入了操控意识:模型在其自身前向传播过程中推断出是否注入了操控向量及其所编码的概念的能力。经过微调后,七个指令调优模型在未见过的概念上发展出强大的操控意识;最佳模型达到95.5%的检测率、71.2%的概念识别率,并且在干净输入上没有误报。这在训练分布方向与未见过的操控向量构造方法方向具有高余弦相似度时泛化,表明其为几何检测器而非通用异常检测器。令人惊讶的是,检测能力并未赋予模型抵抗力;在事实性和安全性基准测试中,检测训练模型比其基础版本更易受操控影响。机制上,操控意识并非源自局部电路,而是源自一种分布式变换,逐步将各种注入向量旋转到共享的检测方向。因此,激活操控不应被视为安全性评估中的隐形干预。
Summary / 总结
The study investigates the concept of steering awareness, which is a model's ability to detect and identify an injected steering vector during its forward pass. After fine-tuning, seven instruction-tuned models developed strong steering awareness, achieving 95.5% detection accuracy and 71.2% concept identification accuracy on clean inputs. However, detection does not provide resistance to steering, as detection-trained models are more susceptible to steering interventions on both factual and safety benchmarks compared to their base counterparts. The mechanism behind steering awareness involves a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction.
研究探讨了模型是否能够检测激活引导,即向模型的残差流中添加一个向量以改变其行为。经过微调的七个指令调优模型发展出了强大的激活引导意识,准确检测干预达到了95.5%,概念识别准确率为71.2%,且在干净输入上没有误报。这种能力在与训练分布高余弦相似的未见过的引导向量方法上有所体现,表明这是一种几何检测器。有趣的是,经过检测训练的模型在事实和安全基准测试中比基线版本更易受引导影响。
WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference
Authors: Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang
First: 2026-03-11T16:04:25+00:00 · Latest: 2026-03-19T17:37:01+00:00
Abstract
Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains insufficiently studied. Existing topology inference attempts rely on impractical assumptions, including control over the administrative agent and direct identity queries via jailbreaks, which are easily defeated by basic keyword-based defenses. As a result, prior analyses fail to capture the real-world threat of such attacks. To bridge this realism gap, we propose \textit{WebWeaver}, an attack framework that infers the complete LLM-MAS topology by compromising only a single arbitrary agent instead of the administrative agent. Unlike prior approaches, WebWeaver relies solely on agent contexts rather than agent IDs, enabling significantly stealthier inference. WebWeaver further introduces a new covert jailbreak-based mechanism and a novel fully jailbreak-free diffusion design to handle cases where jailbreaks fail. Additionally, we address a key challenge in diffusion-based inference by proposing a masking strategy that preserves known topology during diffusion, with theoretical guarantees of correctness. Extensive experiments show that WebWeaver substantially outperforms state-of-the-art (SOTA) baselines, achieving about 60\% higher inference accuracy under active defenses with negligible overhead.
中文标题/摘要
标题:WebWeaver:通过隐蔽的基于上下文推理打破LLM多智能体系统的拓扑保密性
通信拓扑是LLM基础的多智能体系统(LLM-MAS)的实用性和安全性的一个关键因素,使其成为一个高价值的知识产权(IP),其保密性研究尚不充分。现有的拓扑推理尝试依赖于不切实际的假设,包括对管理代理的控制和通过越狱直接身份查询,这些假设很容易被基于关键词的基本防御措施击败。因此,之前的分析未能捕捉到此类攻击的实际威胁。为了弥合这种现实差距,我们提出了WebWeaver攻击框架,通过仅破坏一个任意代理而不是管理代理来推断完整的LLM-MAS拓扑。与之前的方案不同,WebWeaver仅依赖于代理上下文而不是代理ID,从而实现更隐蔽的推理。WebWeaver还引入了一种新的隐蔽越狱机制和一种全新的完全无越狱扩散设计,以处理越狱失败的情况。此外,我们通过提出一种掩码策略解决了基于扩散的推理中的一个关键挑战,该策略在扩散过程中保留已知的拓扑结构,并具有正确性的理论保证。广泛的实验表明,WebWeaver在主动防御下显著优于最先进的基线,准确率提高了约60%,且几乎没有额外开销。
Summary / 总结
WebWeaver is an attack framework that infers the complete topology of LLM-based multi-agent systems by compromising a single arbitrary agent, rather than the administrative agent. It relies on agent contexts for stealthy inference and introduces a new covert jailbreak mechanism and a novel diffusion design to handle cases where jailbreaks fail. Experiments show that WebWeaver outperforms state-of-the-art baselines by about 60% in inference accuracy under active defenses with minimal overhead.
WebWeaver 是一种攻击框架,通过攻击单一任意代理而非管理代理来推断 LLM 基础的多代理系统的完整拓扑结构。它依赖于代理上下文进行更隐蔽的推断,并引入了一种隐蔽的 jailbreak 机制和无 jailbreak 的扩散设计。实验表明,WebWeaver 在具有可忽略开销的情况下比最先进的方法具有更高的准确性。
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
Authors: Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas
First: 2026-03-03T18:48:15+00:00 · Latest: 2026-03-19T17:36:27+00:00
Abstract
In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity--difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.
中文标题/摘要
标题:迁移越大,表示越稀疏:LLM中OOD机制分析
在本研究中,我们探讨了大型语言模型(LLMs)在遇到难度增加的输入时,其内部表示如何适应,这些输入的难度通过分布外(OOD)迁移的程度来量化。我们揭示了一种一致且可量化的现象:随着任务难度的增加,无论是通过更难的推理问题、更长的上下文还是增加答案选项,LLM的最后隐藏状态变得显著稀疏。简而言之,\textbf{\textit{迁移越大,表示越稀疏}}。这种稀疏性与难度的关系在不同的模型和领域中都是可观察到的,表明语言模型在面对不熟悉或复杂的输入时,会将计算集中在最后隐藏状态的专门子空间中。通过一系列受学习动力学解释控制的分析,我们证明这种稀疏性不是偶然的,而是适应机制,用于在OOD下稳定推理。利用这一见解,我们设计了\textit{稀疏引导的上下文内少样本学习(SG-ICL)}策略,该策略明确利用表示稀疏性来安排少样本演示,从而显著提高性能。我们的研究为LLM如何内化OOD挑战提供了新的机制性见解。源代码可在以下URL获取:https://github.com/MingyuJ666/sparsityLLM.
Summary / 总结
This study examines how Large Language Models (LLMs) adjust their internal representations as they encounter increasingly difficult inputs, measured by out-of-distribution (OOD) shift. The research reveals that as task difficulty increases, the last hidden states of LLMs become sparser, indicating that LLMs concentrate computation into specialized subspaces to handle unfamiliar or complex inputs. This sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD conditions. The study introduces Sparsity-Guided Curriculum In-Context Learning (SG-ICL), which uses representation sparsity to schedule few-shot demonstrations, enhancing performance. The findings suggest that LLMs respond to OOD challenges by concentrating computation in specialized subspaces in the last hidden state.
研究探讨了大型语言模型(LLMs)在遇到难度增加的输入时,即出-of-distribution(OOD)偏移时,如何调整其内部表示。研究发现,随着任务难度的增加,LLMs的最后隐藏状态变得更加稀疏,表明LLMs在OOD条件下会将计算集中在特定的子空间中。这种稀疏性不是偶然的,而是适应机制。研究还引入了基于稀疏性的 Curriculum In-Context Learning (SG-ICL) 策略,该策略利用表示稀疏性来安排少量的演示,从而提高性能。研究提供了关于LLMs如何应对OOD挑战的新机制性见解。
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
First: 2025-07-03T16:28:25+00:00 · Latest: 2026-03-19T17:35:34+00:00
Comments: Published in IEEE Transactions on Audio, Speech and Language Processing (TASLP). Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio
Abstract
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
中文标题/摘要
标题:DeSTA2.5-Audio:通用大型音频语言模型的自我生成跨模态对齐
我们介绍了DeSTA2.5-Audio,这是一种通用的大型音频语言模型(LALM),旨在实现稳健的听觉感知和指令跟随。最近的LALMs通过在大规模音频指令数据集上进行训练,将听觉能力添加到大型语言模型(LLMs)中。然而,现有的LALMs往往遭受了LLMs原始能力的灾难性遗忘。因此,平衡知识保留和听觉感知已成为一个关键挑战。为了解决这个问题,我们重新审视了数据构建管道,并提出了一种自我生成的跨模态对齐策略,其中骨干LLM生成自己的训练目标,称为DeSTA。该方法旨在保留LLM的本源语言能力,从而实现零样本泛化,无需特定任务的调整。我们构建了DeSTA-AQA5M,这是一个包含500万训练样本的大规模、任务无关的数据集,这些样本来自7000小时的音频,涵盖了50个不同的数据集,包括语音、环境声音和音乐。DeSTA2.5-Audio在包括Dynamic-SUPERB、MMAU、SAKURA、Speech-IFEval和VoiceBench在内的多种音频-语言基准测试中实现了最先进的或竞争性的性能。全面的比较研究表明,我们的自我生成策略优于现有的训练策略。我们的研究结果强调了在LALM开发中精心设计的数据构建的重要性,并为构建稳健的通用LALM提供了实用见解。
Summary / 总结
DeSTA2.5-Audio is a general-purpose Large Audio Language Model (LALM) that aims to balance auditory perception and language proficiency. It uses a self-generated cross-modal alignment strategy where the LLM generates its own training targets, named DeSTA, to preserve its original language abilities. This approach enables zero-shot generalization and achieves state-of-the-art or competitive performance across various audio-language benchmarks. The findings highlight the importance of carefully designed data construction in LALM development.
DeSTA2.5-Audio 是一种旨在平衡听觉感知和语言能力的通用大型音频语言模型。它采用了一种自生成的跨模态对齐策略,即 LLM 生成自己的训练目标 DeSTA,以保留其原始的语言能力。这种方法实现了在各种音频语言基准测试中的领先或竞争力表现。研究结果强调了在 LALM 开发中精心设计数据构造的重要性。
STELLAR: Structure-guided LLM Assertion Retrieval and Generation for Formal Verification
Authors: Saeid Rajabi, Chengmo Yang, Satwik Patnaik
First: 2025-11-28T05:31:07+00:00 · Latest: 2026-03-19T17:33:20+00:00
Comments: Accepted at the 63rd Design Automation Conference (DAC 2026), Long Beach, CA, USA (July 26-29, 2026) 7 pages, 6 figures
Abstract
Formal Verification (FV) relies on high-quality SystemVerilog Assertions (SVAs), but the manual writing process is slow and error-prone. Existing LLM-based approaches either generate assertions from scratch or ignore structural patterns in hardware designs and expert-crafted assertions. This paper presents STELLAR, the first framework that guides LLM-based SVA generation with structural similarity. STELLAR represents RTL blocks as AST structural fingerprints, retrieves structurally relevant (RTL, SVA) pairs from a knowledge base, and integrates them into structure-guided prompts. Experiments show that STELLAR achieves superior syntax correctness, stylistic alignment, and functional correctness, highlighting structure-aware retrieval as a promising direction for industrial FV.
中文标题/摘要
标题:STELLAR:结构引导的大语言模型断言检索与生成在形式验证中的应用
形式验证(FV)依赖于高质量的SystemVerilog断言(SVAs),但手动编写过程缓慢且容易出错。现有的基于大语言模型(LLM)的方法要么从头生成断言,要么忽略硬件设计和专家编写断言的结构模式。本文提出了STELLAR,这是第一个使用结构相似性引导LLM生成SVAs的框架。STELLAR将RTL块表示为AST结构指纹,从知识库中检索结构相关的(RTL,SVA)对,并将它们整合到结构引导的提示中。实验表明,STELLAR在语法正确性、风格对齐和功能正确性方面表现出色,强调结构感知检索是工业FV的一个有前途的方向。
Few-shot Acoustic Synthesis with Multimodal Flow Matching
Authors: Amandine Brunetto
Venue: CVPR 2026
First: 2026-03-19T17:32:06+00:00 · Latest: 2026-03-19T17:32:06+00:00
Comments: To appear at CVPR 2026. 23 pages, 16 figures. Project Page: https://amandinebtto.github.io/FLAC/
Abstract
Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
中文标题/摘要
标题:基于多模态流匹配的少量样本声学合成
生成与场景声学一致的音频对于沉浸式虚拟环境至关重要。近期的神经声学场方法能够实现空间连续的声音渲染,但仍然保持场景特定性,需要密集的音频测量和昂贵的每次环境的训练成本。少量样本的方法提高了跨房间的可扩展性,但仍依赖于多个录音,并且作为确定性方法,无法捕捉在稀疏上下文中场景声学的固有不确定性。我们引入了流匹配声学生成(FLAC),这是一种基于少量场景上下文的概率方法,用于建模给定最小场景上下文时可能的房间冲激响应(RIRs)的分布。FLAC 利用一个通过流匹配目标训练的扩散变换器,在新场景中的任意位置生成 RIRs,条件于空间、几何和声学线索。FLAC 在 AcousticRooms 和 Hearing Anything Anywhere 数据集上的一次样本优于最先进的八次样本基线。为了补充标准的感知度量,我们进一步引入了 AGREE,一种联合声学-几何嵌入,通过检索和分布度量使生成的 RIRs 的几何一致性评估成为可能。这项工作是首次将生成流匹配应用于显式 RIR 合成,确立了稳健和数据高效声学合成的新方向。
Summary / 总结
The research aims to develop a method for generating acoustically consistent audio in virtual environments with minimal scene context, addressing the limitations of existing deterministic and data-intensive approaches. FLAC, a probabilistic method, uses a diffusion transformer trained with a flow-matching objective to generate room impulse responses (RIRs) in novel scenes, conditioned on spatial, geometric, and acoustic cues. The method outperforms existing eight-shot baselines with one-shot performance on AcousticRooms and Hearing Anything Anywhere datasets. Additionally, a new evaluation metric, AGREE, is introduced to assess the geometry-consistent quality of generated RIRs. This work introduces a new direction for robust and data-efficient acoustic synthesis in virtual environments.
研究旨在开发一种在虚拟环境中使用少量场景信息生成声学一致音频的方法。FLAC是一种概率方法,利用一个通过流匹配目标训练的扩散变换器,根据空间、几何和声学线索为新场景生成房间冲激响应(RIR)。该方法在AcousticRooms和Hearing Anything Anywhere数据集上的单次性能优于现有八次采样基线。此外,引入了AGREE新指标来评估生成RIR的几何一致性,增强对虚拟环境中生成音频的评估。
History
20260322_0328 20260321_0342 20260320_0351 20260319_0353 20260318_0401 20260317_0403 20260316_0333 20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553