arXiv 论文速递

Snapshot: 20260322_0328

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Authors: Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

First: 2026-03-19T17:59:58+00:00 · Latest: 2026-03-19T17:59:58+00:00

Comments: 31 pages, 12 figures

Abstract

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

中文标题/摘要

标题：生成模型了解空间：释放隐含的三维先验以促进场景理解

虽然多模态大型语言模型展示了令人印象深刻的语义能力，但它们往往在空间感知方面存在缺陷，难以进行精细的几何推理和物理动力学处理。现有解决方案通常依赖于显式的三维模态或复杂的几何结构，这些方法受限于数据稀缺性和泛化挑战。在本工作中，我们提出了一种范式转变，通过利用大规模视频生成模型中的隐含空间先验。我们认为，为了合成时空连贯的视频，这些模型会内在地学习稳健的三维结构先验和物理法则。我们引入了VEGA-3D（视频提取生成意识）框架，该框架将预训练的视频扩散模型重新用于潜空间模拟器。通过从中间噪声级别提取时空特征，并通过基于标记的自适应门控融合机制将其与语义表示集成，我们为MLLMs提供了密集的几何线索，而无需显式的三维监督。在三维场景理解、空间推理和具身操作基准测试中的广泛实验表明，我们的方法优于最先进的基线，验证了生成先验为物理世界理解提供了可扩展的基础。代码可在https://github.com/H-EmbodVis/VEGA-3D公开获取。

Summary / 总结

This work addresses the spatial limitations of multimodal large language models by leveraging implicit 3D priors from video generation models. It introduces VEGA-3D, a framework that enhances MLLMs with geometric cues through a token-level adaptive gated fusion mechanism, without requiring explicit 3D supervision. Experiments show that VEGA-3D outperforms state-of-the-art methods in 3D scene understanding, spatial reasoning, and embodied manipulation tasks, validating the effectiveness of generative priors for physical-world understanding.

本文通过提出VEGA-3D，利用大规模视频生成模型中的隐式3D先验来解决多模态大语言模型的空间局限性。通过将时空特征与语义表示相结合，VEGA-3D增强了MLLMs的几何线索，提高了3D场景理解、空间推理和体态操作任务的表现。实验表明，VEGA-3D在现有方法中表现更优，验证了生成先验在物理世界理解中的有效性。

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Authors: Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu

Venue: CVPR 2026

First: 2026-03-19T17:59:55+00:00 · Latest: 2026-03-19T17:59:55+00:00

Comments: Accepted by CVPR 2026 main track; Code: https://github.com/YuqingWang1029/CubiD

Abs · PDF · Code1 · Code2 · Code3

Abstract

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

中文标题/摘要

标题：立方离散扩散：高维表示令牌上的离散视觉生成

使用离散令牌进行视觉生成已获得显著关注，因为它能够与语言模型共享统一的令牌预测范式，有望实现无缝的多模态架构。然而，当前的离散生成方法仍然局限于低维潜变量（通常为8-32维），牺牲了理解所需的语义丰富性。虽然高维预训练表示（768-1024维）可以弥补这一差距，但它们的离散生成提出了根本性的挑战。在本文中，我们提出了立方离散扩散（CubiD），这是第一个用于高维表示的离散生成模型。CubiD 在高维离散表示中执行精细粒度的掩码——任何维度在任何位置都可以被掩码并从部分观察中预测。这使模型能够学习丰富的内部和跨空间位置的相关性，生成步骤数固定为 $T$，与特征维度无关，其中 $T \ll hwd$。在ImageNet-256上，CubiD 达到了从900M到3.7B参数的最先进的离散生成效果，具有强大的扩展行为。至关重要的是，我们验证这些离散化令牌保留了原始表示能力，证明了相同的离散令牌可以有效地服务于理解和生成任务。我们希望这项工作能够激发未来研究向统一的多模态架构方向发展。代码可在：https://github.com/YuqingWang1029/CubiD 获取。

Summary / 总结

Cubic Discrete Diffusion (CubiD) is the first discrete generation model for high-dimensional representations, addressing the challenge of generating discrete tokens from high-dimensional pretrained features. By performing fine-grained masking and prediction, CubiD learns rich correlations within and across spatial positions. On ImageNet-256, CubiD achieves state-of-the-art discrete generation performance, scaling effectively from 900M to 3.7B parameters, and the discretized tokens maintain the original representation capabilities for both understanding and generation tasks.

Cubic Discrete Diffusion (CubiD) 旨在解决低维度令牌在捕捉语义丰富性方面的局限性，通过在所有维度上进行精细的掩码和预测来学习丰富的关联性，并保持从900M到3.7B参数的强扩展行为。在ImageNet-256上，CubiD 达到了最先进的结果，并且离散化令牌保留了原始表示能力，支持理解和生成任务。这项工作旨在启发未来统一的多模态架构研究。

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Authors: Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu

First: 2026-03-19T17:59:52+00:00 · Latest: 2026-03-19T17:59:52+00:00

Comments: Project page: https://lihaitian.com/MonoArt

Abs · PDF · Code1 · Code2

Abstract

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

中文标题/摘要

标题：MonoArt：单目 articulated 3D 重建的渐进结构推理

从单张图像重建 articulated 3D 对象需要联合推断对象几何、部件结构和运动参数，但视觉证据有限。关键难点在于运动线索与对象结构的纠缠，这使得直接articulation回归不稳定。现有方法通过多视角监督、基于检索的组装或辅助视频生成来应对这一挑战，但往往牺牲了可扩展性或效率。我们提出了 MonoArt，这是一种基于渐进结构推理的统一框架。MonoArt 不是从图像特征直接预测articulation，而是逐步将视觉观察转化为标准几何、结构部件表示和运动感知嵌入，全部在单一架构中完成。这种结构推理过程使得无需外部运动模板或多阶段管道即可实现稳定且可解释的articulation推断。在 PartNet-Mobility 上的大量实验表明，OM 在重建精度和推理速度方面均达到最先进的性能。该框架进一步推广到机器人操作和articulated场景重建。

Summary / 总结

MonoArt addresses the challenge of reconstructing articulated 3D objects from a single image by progressively transforming visual observations into canonical geometry and motion-aware embeddings within a unified framework. This method avoids the instability of direct articulation regression and achieves state-of-the-art performance in reconstruction accuracy and inference speed on PartNet-Mobility. The framework also generalizes well to robotic manipulation and articulated scene reconstruction.

MonoArt通过在统一框架中逐步将视觉观察转换为标准几何、结构部件表示和运动感知嵌入，解决了从单张图像重建 articulated 3D 对象的挑战。这种方法避免了直接articulation回归的不稳定性，并在PartNet-Mobility上实现了在重建精度和推理速度方面的最佳性能。该框架还适用于机器人操作和articulated场景重建。

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Authors: Huaide Jiang, Yash Chaudhary, Yuping Wang, Zehao Wang, Raghav Sharma, Manan Mehta, Yang Zhou, Lichao Sun, Zhiwen Fan, Zhengzhong Tu, Jiachen Li

First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00

Comments: Project Website: https://navtrust.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.

中文标题/摘要

标题：NavTrust：评估实体导航可信度基准

实体导航主要分为两类：视觉-语言导航（VLN），其中代理通过遵循自然语言指令进行导航；以及目标-目标导航（OGN），其中代理导航至指定目标物体。然而，现有工作主要在理想条件下评估模型性能，忽视了真实世界环境中可能出现的潜在干扰。为解决这一问题，我们提出了NavTrust，这是一个统一基准，系统地在现实场景中对输入模态（包括RGB、深度和指令）进行干扰，并评估其对导航性能的影响。据我们所知，NavTrust是第一个在统一框架中使实体导航代理暴露于多种RGB-Depth干扰和指令变化的基准。我们对七种最先进的方法进行了广泛评估，发现它们在现实干扰下的性能显著下降，这突显了关键的鲁棒性差距，并为更可信的实体导航系统指明了方向。此外，我们系统地评估了四种不同的缓解策略，以增强对RGB-Depth和指令干扰的鲁棒性。我们的基础模型包括Uni-NaVid和ETPNav。我们在一个真实的移动机器人上部署了它们，并观察到对干扰的鲁棒性有所提高。项目网站：https://navtrust.github.io

Summary / 总结

NavTrust benchmarks the trustworthiness of embodied navigation systems by systematically corrupting RGB, depth, and instructions in realistic scenarios. It evaluates seven state-of-the-art approaches and finds significant performance degradation under realistic corruptions, highlighting robustness gaps. NavTrust also assesses four mitigation strategies to enhance robustness against corruptions, with improved results observed on a real mobile robot.

NavTrust 通过在视觉-语言导航（VLN）和对象-目标导航（OGN）任务中引入对 RGB、深度和指令的现实世界干扰，评估了这些系统的可信度。它评估了七种最先进的方法，并发现这些方法在现实条件下表现显著下降，表明存在关键的鲁棒性差距。NavTrust 还评估了四种不同的缓解策略，显示了在现实世界场景中的鲁棒性提升。项目网站是：https://navtrust.github.io

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Authors: Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang

First: 2026-03-19T17:59:51+00:00 · Latest: 2026-03-19T17:59:51+00:00

Comments: 24 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

中文标题/摘要

标题：SAMA：因子化语义锚定与运动对齐的指令引导视频编辑

当前的指令引导视频编辑模型难以同时平衡精确的语义修改与忠实的运动保留。现有方法依赖于注入显式的外部先验（例如，VLM特征或结构条件）来缓解这些问题，但这种依赖严重限制了模型的鲁棒性和泛化能力。为克服这一限制，我们提出了SAMA（因子化语义锚定与运动对齐），一种将视频编辑分解为语义锚定和运动建模的框架。首先，我们引入了语义锚定，通过联合预测稀疏锚帧的语义标记和视频潜在变量，建立可靠的视觉锚点，实现纯粹基于指令的结构规划。其次，运动对齐在以运动为中心的视频恢复预训练任务（立方体填充、速度扰动和管子打乱）上预训练相同的骨干网络，使模型能够直接从原始视频中内化时间动态。SAMA通过两阶段优化：无配对视频-指令编辑数据的因子化预训练阶段，学习固有的语义-运动表示，随后在配对编辑数据上进行监督微调。值得注意的是，仅因子化预训练就已表现出强大的零样本视频编辑能力，验证了所提出的分解的有效性。SAMA在开源模型中达到了最先进的性能，并且与领先的商业系统（例如Kling-Omni）竞争。代码、模型和数据集将被发布。

Summary / 总结

SAMA addresses the challenge of balancing semantic precision and motion fidelity in instruction-guided video editing by factorizing the process into semantic anchoring and motion modeling. It introduces Semantic Anchoring to create a reliable visual anchor through joint prediction of semantic tokens and video latents at sparse frames, and Motion Alignment to pre-train the model on motion-centric tasks. SAMA is optimized in a two-stage pipeline, with factorized pre-training learning semantic-motion representations and supervised fine-tuning on paired data. The factorized pre-training alone shows strong zero-shot video editing capability, and SAMA achieves state-of-the-art performance among open-source models and is competitive with commercial systems.

SAMA 解决了在指令引导的视频编辑中精确的语义修改与忠实的运动保留之间的平衡问题。它将视频编辑分解为语义锚定和运动建模。SAMA 引入了语义锚定，通过在稀疏帧上联合预测语义标记和视频潜在变量来建立可靠的视觉锚点，以及运动对齐，通过在运动中心任务上预训练模型。该框架在两阶段流水线中进行优化，实现了最先进的性能，并与商业系统竞争。

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Authors: Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta

First: 2026-03-19T17:59:41+00:00 · Latest: 2026-03-19T17:59:41+00:00

Comments: 8 pages main text, 22 pages total (including references and appendix). 5 figures, 14 tables. Preprint under review. Code and data will be made available upon publication

Abs · PDF · Code1 · Code2

Abstract

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with the advancement of Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning over how company stocks trade in the market or their interactions with fundamentals. To take advantage of the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

中文标题/摘要

标题：FinTradeBench：LLMs的金融推理基准

现实世界的金融决策是一个具有挑战性的问题，需要在公司基本面（来自监管文件的数据）和价格动态计算的交易信号等多种信号之间进行推理。近年来，随着大型语言模型（LLMs）的发展，金融分析师开始使用它们进行金融决策任务。然而，现有的金融问答基准主要侧重于公司资产负债表数据，很少评估公司股票在市场上的交易情况及其与基本面的互动。为了充分利用两种方法的优势，我们引入了FinTradeBench，这是一个结合公司基本面和交易信号的金融推理基准。FinTradeBench包含1400个基于纳斯达克100公司的问题，时间跨度为十年。基准分为三类推理问题：以基本面为主、以交易信号为主和需要跨信号推理的混合问题。为了确保大规模的可靠性，我们采用了校准-扩展框架，结合了专家种子问题、多模型响应生成、模型内自我筛选、数值审计和人-LLM法官对齐。我们在零样本提示和检索增强设置下评估了14个LLM，并观察到了明显的性能差距。检索显著提高了对文本基本面的推理，但在交易信号推理方面提供的帮助有限。这些发现突显了当前LLM在数值和时间序列推理方面的基本挑战，并激发了未来金融智能研究的动力。

Summary / 总结

FinTradeBench is a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals, addressing the limitations of existing benchmarks. It contains 1,400 questions on NASDAQ-100 companies over a ten-year period, categorized into three types: fundamentals-focused, trading-signal-focused, and hybrid. The evaluation involves 14 LLMs under zero-shot and retrieval-augmented settings, revealing a clear performance gap, with retrieval improving reasoning over textual fundamentals but not trading signals. This highlights challenges in numerical and time-series reasoning for current LLMs and motivates future research in financial intelligence.

FinTradeBench 是一个结合公司基本面和交易信号的金融推理基准，旨在弥补现有基准的不足。它包含1,400个问题，覆盖NASDAQ-100公司长达十年的历史数据，分为三类：基本面导向、交易信号导向和混合问题。评估涉及14个LLM，在零样本和检索增强设置下，显示出明显的性能差距，检索在提高文本基本面推理方面效果显著，但在交易信号推理方面效果有限。这突显了当前LLM在数值和时间序列推理方面的挑战，并激发了未来金融智能研究的动力。

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Authors: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

First: 2026-03-19T17:59:21+00:00 · Latest: 2026-03-19T17:59:21+00:00

Abs · PDF · Code1 · Code2

Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

中文标题/摘要

标题：F2LLM-v2：包容性、高性能且高效的多语言嵌入模型

我们提出了F2LLM-v2，这是一种新的通用多语言嵌入模型系列，包含8种不同规模，从80M到14B。该模型基于6000万条高质量公开数据样本的新综合训练而成，支持超过200种语言，特别强调了之前未充分服务的中低资源语言。通过结合两阶段基于LLM的嵌入训练流水线、matryoshka学习、模型剪枝和知识蒸馏技术，我们展示了比之前基于LLM的嵌入模型更高效的模型，同时保持了竞争力。广泛的评估证实，F2LLM-v2-14B在11个MTEB基准测试中排名第一，而该系列中的较小模型也为资源受限的应用设定了新的标准。为了促进开源嵌入模型研究，我们发布了所有模型、数据、代码和中间检查点。

Online Learning and Equilibrium Computation with Ranking Feedback

Authors: Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang

First: 2026-03-19T17:59:07+00:00 · Latest: 2026-03-19T17:59:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

中文标题/摘要

标题：在线学习与具有排名反馈的均衡计算

在线学习在任意的、可能是对抗性的环境中已被广泛研究于序贯决策中，它与博弈论中的均衡计算密切相关。现有的大多数在线学习算法依赖于环境提供的数值效用反馈，但在人类在环应用中或出于隐私考虑，这种反馈可能不可用或受限。在本文中，我们研究了一种在线学习模型，在这种模型中，学习者仅在每个时间步观察一组提议动作的排名。我们考虑了两种排名机制：由当前时间步的瞬时效用诱导的排名，以及由当前时间步之前的时间平均效用诱导的排名，在完全信息反馈和多臂老虎机反馈设置下进行考虑。使用标准的外部后悔度量，我们证明了在一般情况下，瞬时效用排名反馈无法实现亚线性后悔。此外，当排名模型相对确定时，即在Plackett-Luce模型下，温度足够小，时间平均效用排名反馈也无法实现亚线性后悔。然后，我们开发了在效用序列具有亚线性总变差的假设下实现亚线性后悔的新算法。值得注意的是，对于完全信息时间平均效用排名反馈，这一额外假设可以被移除。因此，当正常形式博弈中的所有玩家都遵循我们的算法时，重复博弈将产生近似粗略联合均衡。我们还在一个在线大型语言模型路由任务中展示了我们算法的有效性。

Summary / 总结

This paper investigates online learning in environments where the learner receives only rankings of proposed actions, rather than numeric utility feedback. It considers two ranking mechanisms: those based on instantaneous utility and those based on time-average utility. The study shows that sublinear regret is impossible with instantaneous-utility ranking feedback and with time-average utility ranking feedback under a deterministic model. However, new algorithms are developed that achieve sublinear regret under the assumption of sublinear total variation in the utility sequence, leading to approximate coarse correlated equilibrium in repeated play of normal-form games. The algorithms are also effective in an online large-language-model routing task.

本文研究了仅提供动作排名而未提供数值效用反馈的在线学习环境。考虑了两种排名机制：基于即时效用和基于时间平均效用的排名。研究发现，在确定性模型下，基于即时效用和基于时间平均效用的排名都无法实现亚线性后悔。然而，在效用序列具有亚线性总变差的假设下，开发了新的算法实现了亚线性后悔，从而在重复玩正常形式博弈时产生了近似粗略联合均衡。这些算法在在线大型语言模型路由任务中也表现出有效性。

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Authors: Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

First: 2026-03-19T17:58:52+00:00 · Latest: 2026-03-19T17:58:52+00:00

Comments: We release the model and data at https://huggingface.co/collections/nvidia/nemotron-cascade-2

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

中文标题/摘要

标题：Nemotron-Cascade 2：级联RL和多域在线策略蒸馏后的训练后LLM

我们介绍了Nemotron-Cascade 2，这是一个开放的30B模型，具有3B激活参数，提供最佳推理能力和强大的代理能力。尽管其体积较小，但在数学和编程推理性能上接近前沿开放模型。它是继DeepSeekV3.2-Speciale-671B-A37B之后第二个在2025年国际数学奥林匹克（IMO）、国际信息学奥林匹克（IOI）和ICPC世界总决赛中获得金牌水平表现的开放权重LLM，显示出极高的智能密度，参数量减少20倍。与Nemotron-Cascade 1相比，关键的技术进步如下。在精心策划的数据集上进行SFT后，我们大幅扩展了级联RL，涵盖了更广泛的推理和代理领域。此外，我们引入了多域在线策略蒸馏，从每个领域的最强中间教师模型进行蒸馏，整个级联RL过程中，使我们能够高效地恢复基准回归并保持强劲的性能提升。我们发布了模型检查点和训练数据的集合。

Summary / 总结

Nemotron-Cascade 2 is a 30B model with 3B active parameters that excels in reasoning and agentic capabilities, achieving top performance in major international competitions despite its compact size. It builds on previous advancements by expanding Cascade RL to cover a wider range of domains and introducing multi-domain on-policy distillation, which helps maintain strong performance gains. The model is open-source and available for download.

Nemotron-Cascade 2 是一个30B模型，有3B活跃参数，擅长推理和代理任务，在国际大赛中表现出色，尽管规模较小。它使用后训练方法，扩展了 Cascade RL 并引入了多领域在线策略蒸馏，以维持强大的性能。该模型通过覆盖更广泛的领域并高效恢复基准水平，超越了之前的版本。

Rethinking Vector Field Learning for Generative Segmentation

Authors: Chaoyang Wang, Yaobo Liang, Boci Peng, Fan Duan, Jingdong Wang, Yunhai Tong

First: 2026-03-19T17:58:19+00:00 · Latest: 2026-03-19T17:58:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

中文标题/摘要

标题：重新思考生成分割中的向量场学习

控制扩散模型用于生成分割已引起越来越多的关注。尽管现有方法主要集中在架构调整或训练启发式方法上，但对连续流匹配目标与离散感知任务之间的内在不匹配仍缺乏理解。在本文中，我们从向量场学习的角度重新审视扩散分割。我们识别出两种常用流匹配目标的关键限制：梯度消失和轨迹穿越，这导致了收敛速度慢和类别分离差。为解决这些问题，我们提出了一种原理性的向量场重塑策略，通过附加一个分离的距离感知修正项来增强学习的流场。该修正项引入了吸引和排斥的相互作用，增强了接近质心处的梯度大小，同时保留了原始的扩散训练框架。此外，我们设计了一种由克罗内克序列启发的高效计算、准随机类别编码方案，该方案与端到端像素神经场框架无缝集成，用于像素级语义对齐。广泛的实验一致表明，与原始流匹配方法相比，显著提高了性能，大幅缩小了生成分割与强大判别专家之间的性能差距。

Summary / 总结

This paper addresses the limitations of flow matching objectives in diffusion models for generative segmentation by proposing a vector field reshaping strategy. The authors identify issues like gradient vanishing and poor class separation and introduce a distance-aware correction term to enhance gradient magnitudes. They also develop a quasi-random category encoding scheme for better semantic alignment. Experiments show that this approach significantly outperforms traditional methods, closing the performance gap with discriminative models.

该研究针对扩散模型在生成分割中的流匹配目标存在的梯度消失和轨迹穿越问题，提出了一种矢量场重塑策略。该方法通过引入距离感知的校正项来增强梯度幅度和改善类别分离，同时保持原始的扩散训练框架。此外，还设计了一种高效的类别编码方案，以促进像素级语义对齐。实验结果表明，该方法显著优于传统的流匹配方法，缩小了生成分割与强判别模型之间的性能差距。

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Authors: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang

First: 2026-03-19T17:58:13+00:00 · Latest: 2026-03-19T17:58:13+00:00

Comments: Project page: https://kd-tao.github.io/LVOmniBench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

中文标题/摘要

标题：LVOmniBench：面向多模态LLM的长音频视频理解评估先驱

近年来，多模态大型语言模型（OmniLLMs）在理解音频和视频输入方面取得了显著进步。然而，当前的评估主要集中在10秒到5分钟的短音频和视频片段上，未能反映实际应用中的需求，而实际应用中的视频通常持续数十分钟。为解决这一关键问题，我们引入了LVOmniBench，这是一个专门用于长格式音频和视频跨模态理解的新基准。该数据集包含来自开放平台的高质量视频，这些视频具有丰富的音频-视觉动态。通过严格的手动选择和标注，LVOmniBench 包含275个视频，时长从10分钟到90分钟不等，以及1,014个问答（QA）对。LVOmniBench旨在全面评估OmniLLMs在各个领域的能力，包括长期记忆、时间定位、细粒度理解以及多模态感知。我们的广泛评估表明，当前的OmniLLMs在处理扩展的音频-视觉输入时面临重大挑战。开源模型通常准确率低于35%，而Gemini 3 Pro达到的峰值准确率约为65%。我们预计，该数据集以及我们的实证研究结果将激发进一步的研究，并推动开发能够解决长格式音频-视频上下文中的复杂跨模态理解问题的高级模型。

Summary / 总结

LVOmniBench is a new benchmark for evaluating the cross-modal comprehension of long-form audio and video by omnimodal large language models (OmniLLMs). It includes 275 videos ranging from 10 to 90 minutes and 1,014 question-answer pairs, sourced from open platforms. The evaluation shows that current OmniLLMs struggle with extended audio-visual inputs, with open-source models achieving accuracies below 35% and Gemini 3 Pro reaching up to 65% accuracy.

LVOmniBench 是一个用于评估 omnimodal 大型语言模型（OmniLLMs）对长音频和视频的跨模态理解能力的新基准。它包含来自开放平台的 275 条时长从 10 到 90 分钟的视频和 1,014 个问答对。评估结果显示，当前的 OmniLLMs 在处理扩展的音频视频输入时存在困难，开源模型的准确率低于 35%，而 Gemini 3 Pro 达到了约 65% 的准确率。

DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Authors: Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou

First: 2026-03-19T17:58:11+00:00 · Latest: 2026-03-19T17:58:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part's geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

中文标题/摘要

标题：DreamPartGen: 基于语义的部件级3D生成通过协作去噪

理解和生成由有意义部件组成的3D对象是人类感知和推理的基础。然而，大多数文本到3D的方法忽略了部件的语义和功能结构。虽然近期的部件感知方法引入了分解，但它们仍然主要集中在几何结构上，缺乏语义基础，无法建模部件如何与文本描述或部件间关系对齐。我们提出DreamPartGen，一种基于语义的部件感知文本到3D生成框架。DreamPartGen引入了双部件潜在变量(DPLs)，联合建模每个部件的几何形状和外观，并引入了关系语义潜在变量(RSLs)，捕捉从语言中推导出的部件间依赖关系。同步的联合去噪过程确保了几何和语义的一致性，使3D合成具有连贯性、可解释性和文本对齐性。在多个基准测试中，DreamPartGen在几何保真度和文本形状对齐方面达到了最先进的性能。

Summary / 总结

DreamPartGen is a framework for semantically grounded, part-aware text-to-3D generation that introduces Duplex Part Latents (DPLs) to model each part's geometry and appearance, and Relational Semantic Latents (RSLs) to capture inter-part dependencies. The synchronized co-denoising process ensures mutual geometric and semantic consistency, leading to coherent and text-aligned 3D synthesis. DreamPartGen outperforms existing methods in geometric fidelity and text-shape alignment across multiple benchmarks.

DreamPartGen旨在通过引入语义和功能结构来解决现有文本到3D方法的局限性。它使用双重部分潜变量来建模每个部分的几何和外观，并使用关系语义潜变量来捕捉语言中的部分间依赖关系。该框架采用同步协同去噪过程，以确保几何和语义一致性，从而实现连贯、可解释且与文本对齐的3D合成。实验表明，DreamPartGen在几何保真度和文本-形状对齐方面在多个基准测试中优于现有方法。

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla

First: 2026-03-19T17:56:32+00:00 · Latest: 2026-03-19T17:56:32+00:00

Comments: Project page: https://lab-spell.github.io/vlm-ssm-vision-encoders/ ; Code: https://github.com/raykuo18/vlm-ssm-vision-encoders

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

中文标题/摘要

标题：VLMs是否需要视觉变换器？评估状态空间模型作为视觉编码器

大型视觉-语言模型（VLMs）通常使用冻结的视觉主干，其图像特征通过轻量级连接器映射到大型语言模型中。虽然基于变换器的编码器是标准的视觉主干，但我们询问状态空间模型（SSM）视觉主干是否可以成为强有力的替代品。我们在受控环境中系统地评估了SSM视觉主干在VLMs中的表现。在匹配的ImageNet-1K初始化下，SSM主干在VQA和定位/标注任务中均表现出最强的整体性能。我们进一步适应了SSM和ViT家族的主干，并进行了检测或分割训练，发现密集任务调整通常在家族中提高了性能；在这一调整后，SSM主干仍具有竞争力，但模型规模要小得多。我们还观察到，(i) 更高的ImageNet准确度或更大的主干并不一定能可靠地转化为更好的VLM性能，(ii) 一些视觉主干在定位方面不稳定。基于这些发现，我们提出了稳定策略，以提高两种主干家族的鲁棒性，并强调SSM主干作为VLMs中基于变换器视觉编码器的强有力替代品。

Summary / 总结

This study investigates the use of state space model (SSM) vision backbones in large vision-language models (VLMs), comparing them to transformer-based encoders. The research finds that SSM backbones outperform transformer-based encoders in both visual question answering and grounding/localization tasks under matched ImageNet-1K initialization. Additionally, the SSM backbone remains competitive after adaptation with detection or segmentation training, while operating at a smaller model scale. The study also highlights that higher ImageNet accuracy or larger backbones do not necessarily translate to better VLM performance, and some visual backbones are unstable in localization tasks. The findings suggest that SSM backbones can be a strong alternative to transformer-based vision encoders in VLMs.

该研究探讨了在大型视觉-语言模型（VLMs）中使用状态空间模型（SSM）视觉骨干网络的可能性，将其与基于变换器的编码器进行比较。研究发现，在匹配的ImageNet-1K初始化条件下，SSM骨干网络在视觉问答和定位/检测任务中均表现出更优的性能。此外，经过检测或分割训练的适应后，SSM骨干网络仍保持竞争力，同时运行规模更小。研究还指出，更高的ImageNet准确度或更大的骨干网络并不一定能够提高VLM的性能，某些视觉骨干网络在定位任务中不稳定。研究结果表明，SSM骨干网络可以作为VLM中基于变换器的视觉编码器的有力替代方案。

Enhancing Lexicon-Based Text Embeddings with Large Language Models

Authors: Yibin Lei, Tao Shen, Yu Cao, Andrew Yates

Venue: ACL 2025

First: 2025-01-16T18:57:20+00:00 · Latest: 2026-03-19T17:55:24+00:00

Comments: ACL 2025

Abs · PDF · Code1 · Code2

Abstract

Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging LLMs that achieve competitive performance on these tasks. LENS consolidates the vocabulary space through token embedding clustering to handle the issue of token redundancy in LLM vocabularies. To further improve performance, we investigate bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexical matching with redundant vocabularies by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact representations with dimensionality comparable to dense counterparts. Furthermore, LENS inherently supports efficient embedding dimension pruning without any specialized objectives like Matryoshka Representation Learning. Notably, combining LENS with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e., BEIR).

中文标题/摘要

标题：利用大型语言模型增强基于词典的文本嵌入

近年来，大型语言模型（LLMs）在通用文本嵌入任务中表现出色。尽管密集嵌入在相关研究中占主导地位，我们引入了第一个基于词典的嵌入（LENS），利用LLMs在这些任务中取得了竞争力的表现。LENS通过词嵌入聚类来整合词汇空间，以解决LLM词汇表中的同义词冗余问题。为了进一步提高性能，我们研究了双向注意力和各种池化策略。具体来说，LENS通过将每个维度分配给特定的词簇来简化与冗余词汇表的词汇匹配，其中语义相似的词被分组在一起。大量实验表明，LENS在大规模文本嵌入基准测试（MTEB）中优于密集嵌入，提供与密集嵌入相当维度的紧凑表示。此外，LENS固有地支持高效的嵌入维度修剪，无需像Matryoshka表示学习那样的专门目标。值得注意的是，将LENS与密集嵌入结合使用在MTEB的检索子集（即BEIR）上实现了最先进的性能。

Summary / 总结

The research aims to enhance lexicon-based text embeddings using large language models (LLMs) to achieve competitive performance on text embedding tasks. LENS, a lexicon-based embedding method, simplifies lexical matching by clustering token embeddings and using bidirectional attention and pooling strategies. Experiments show that LENS outperforms dense embeddings on the MTEB benchmark, providing compact representations. Combining LENS with dense embeddings further improves performance on the BEIR retrieval subset.

本文提出了一种名为LENS的基于词典的文本嵌入方法，利用大型语言模型来解决词汇冗余问题。通过嵌入聚类和使用双向注意力及池化策略，LENS在MTEB基准测试中取得了与稠密嵌入相当甚至更好的性能。此外，LENS支持高效的维度修剪，并且与稠密嵌入结合时，在BEIR检索子集上达到了最先进的性能。

RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Authors: Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, Lijun Zhang

First: 2026-03-19T17:54:43+00:00 · Latest: 2026-03-19T17:54:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

中文标题/摘要

标题：RPiAE：基于表示的自编码器，同时增强图像生成和编辑

扩散模型已成为图像生成和编辑的主导范式，潜空间扩散模型将去噪移至紧凑的潜空间，以提高效率和可扩展性。最近尝试利用预训练的视觉表示模型作为标记器先验，要么将扩散特征对齐到表示特征，要么直接重用表示编码器作为冻结的标记器。尽管这些方法可以提高生成指标，但由于冻结编码器，它们往往在重建保真度方面受到限制，从而影响编辑质量，同时高维潜空间也使扩散建模变得困难。为了解决这些限制，我们提出了基于表示的自编码器（RPiAE），这是一种基于表示的标记器，可以同时提高生成和编辑质量。我们引入了表示点正则化，这是一种训练策略，使表示初始化的编码器能够进行重建微调，同时保留预训练表示空间的语义结构，随后通过变分桥梁将潜空间压缩为紧凑空间，以更好地进行扩散建模。我们采用目标解耦的阶段训练策略，依次优化生成可操作性和重建保真度目标。这些组件共同产生了一个能够保留强语义、忠实重建并生成具有减少的扩散建模复杂度的潜空间的标记器。实验表明，RPiAE 在文本到图像生成和图像编辑方面优于其他视觉标记器，同时在基于表示的标记器中提供最佳的重建保真度。

Summary / 总结

The paper proposes RPiAE, a representation-pivoted autoencoder that enhances both image generation and editing. It introduces Representation-Pivot Regularization to fine-tune a representation-initialized encoder for reconstruction while preserving semantic structure, followed by a variational bridge to compress the latent space. Experiments show that RPiAE outperforms other visual tokenizers in text-to-image generation and image editing, and achieves the best reconstruction fidelity among representation-based tokenizers.

论文提出了RPiAE，一种基于表示的自编码器，能够同时提升图像生成和编辑的质量。它引入了表示点正则化，使初始化的表示编码器能够进行重建训练，同时保留语义结构，随后通过变分桥梁压缩潜在空间。实验表明，RPiAE在文本到图像生成和图像编辑方面优于其他视觉标记器，并且在基于表示的标记器中实现了最佳的重建保真度。

Tinted Frames: Question Framing Blinds Vision-Language Models

Authors: Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta

First: 2026-03-19T17:53:09+00:00 · Latest: 2026-03-19T17:53:09+00:00

Comments: Preprint. Project page: https://davidhalladay.github.io/tinted_frames_demo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

中文标题/摘要

标题：着色框：问题框架使视觉-语言模型失明

视觉-语言模型（VLMs）已被证明是失明的，即使在需要视觉推理的任务中，它们也经常未能充分利用视觉输入。在本研究中，我们展示了VLMs是选择性失明的。它们根据语言框架调整对视觉输入的注意力程度，即使存在其他框架要求相同的视觉推理。通过使用视觉注意力作为探针，我们量化了框架如何改变对图像的关注量及其分布。受限的框架，如多项选择和是/否，相比开放式框架，显著降低了对图像上下文的关注，减少了对任务相关区域的关注，并将注意力转移到无信息性标记上。我们进一步证明，这种注意力分配不当是导致准确度下降和跨框架不一致的主要原因。基于这一机制洞察，我们引入了一种轻量级的提示调优方法，使用可学习标记来鼓励在开放式设置中观察到的稳健、视觉接地的注意力模式，从而提高视觉接地并改善不同框架下的性能。

Summary / 总结

This study investigates why Vision-Language Models (VLMs) are selectively blind to visual inputs, even when linguistic framing suggests the need for visual reasoning. By analyzing visual attention, the research shows that constrained framings like multiple choice and yes/no lead to less attention on image context and a shift towards uninformative tokens. The study also demonstrates that this misallocation of attention is the main reason for reduced accuracy and inconsistency across different framings. A prompt-tuning method using learnable tokens is introduced to encourage robust, visually grounded attention patterns, improving performance across various framings.

研究探讨了为什么视觉语言模型（VLMs）在不同语言框架下对视觉输入的选择性忽视。通过分析视觉注意力模式，研究发现，如多项选择或是/否这类受限框架会导致模型对图像上下文的关注较少，而更多地关注无信息性的标记，相比之下，开放式问题则不然。研究证明，这种注意力分配的错误是导致模型性能下降和不同框架间不一致的主要原因。为解决这一问题，研究提出了一种使用可学习标记的提示调优方法，以促进在开放式设置中观察到的稳健的视觉定位，从而提高模型在各种问题框架下的表现。

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Authors: Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee

First: 2026-03-19T17:50:07+00:00 · Latest: 2026-03-19T17:50:07+00:00

Comments: Project website: https://kehanlu.github.io/AKB

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

中文标题/摘要

标题：大规模语言模型中听觉知识如何塑造音频语言模型：全面评估

大规模语言模型（LLMs）广泛用作大型音频语言模型（LALMs）的知识基础，但它们通过纯文本预训练编码了多少听觉知识以及这对下游性能有何影响仍不清楚。我们通过在两种纯文本和一种音频基础设置下比较不同LLMs来研究这一差距：（1）直接探测AKB-2000，这是一个测试听觉知识广度和深度的精心策划基准；（2）级联评估，其中LLMs基于音频描述进行推理；（3）音频基础评估，其中每个LLM通过音频编码器微调为大型音频语言模型（LALM）。我们的研究发现听觉知识在不同家族之间差异很大，纯文本结果与音频性能高度相关。我们的工作为全面理解音频研究中的LLMs提供了实证基础。

Summary / 总结

This study investigates how auditory knowledge in large language model (LLM) backbones influences the performance of Large Audio Language Models (LALMs). By comparing different LLMs under text-only and audio-grounded settings, the research finds that auditory knowledge varies significantly across LLM families, and text-only pre-training results strongly correlate with audio performance. The work offers empirical insights into the role of auditory knowledge in LLMs for audio research.

研究探讨了大型语言模型（LLM）通过文本-only预训练编码的听觉知识及其对下游音频语言模型（ALM）性能的影响。研究在三个设置下比较了不同LLM：直接在AKB-2000上进行探针测试、级联评估和音频接地评估。主要发现包括LLM家族间听觉知识的显著差异以及文本-only结果与音频性能之间的强相关性。

Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Authors: Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin

First: 2026-03-19T17:49:43+00:00 · Latest: 2026-03-19T17:49:43+00:00

Comments: Project page at https://vulab-ai.github.io/Splat2BEV/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Bird's-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

中文标题/摘要

标题：重建问题：通过3D 高斯点绘制学习几何对齐的BEV表示

鸟瞰图（BEV）感知是自动驾驶的核心基础，提供了一种统一的空间表示，将周围视图图像融合起来，以实现语义分割、3D目标检测和运动预测等多种下游任务的推理。然而，现有的大多数BEV感知框架采用端到端的训练范式，其中图像特征直接转换到BEV空间，并仅通过下游任务监督进行优化。这种形式将整个感知过程视为一个黑箱，往往缺乏明确的3D几何理解和可解释性，导致性能不佳。在本文中，我们主张明确的3D表示对于准确的BEV感知至关重要，并提出了一种基于3D高斯点绘制的Splat2BEV框架，用于BEV任务。Splat2BEV旨在学习既丰富语义又精确几何的BEV特征表示。我们首先预训练一个高斯生成器，从多视图输入中显式重建3D场景，从而生成几何对齐的特征表示。然后将这些表示投影到BEV空间，作为下游任务的输入。在nuScenes和argoverse数据集上的大量实验表明，Splat2BEV达到了最先进的性能，并验证了将明确的3D重建纳入BEV感知的有效性。

Summary / 总结

This paper addresses the limitations of existing BEV perception frameworks by proposing Splat2BEV, which incorporates explicit 3D reconstruction. Splat2BEV uses a Gaussian generator to pre-train and reconstruct 3D scenes from multi-view inputs, generating geometry-aligned feature representations that are then projected into the BEV space. Experiments on nuScenes and argoverse datasets show that Splat2BEV outperforms existing methods and validates the importance of explicit 3D reconstruction in BEV perception.

本文提出了Splat2BEV框架，通过显式重建3D场景生成几何对齐的特征表示来解决现有BEV感知框架的局限性。该方法首先通过多视图输入预训练一个高斯生成器来重建3D场景，然后将这些表示投影到BEV空间以供下游任务使用。在nuScenes和argoverse数据集上的实验表明，Splat2BEV在性能上优于现有方法，验证了在BEV感知中融入显式3D重建的有效性。

Score Reversal Is Not Free for Quantum Diffusion Models

Authors: Ammar Fayad

First: 2026-03-06T17:16:17+00:00 · Latest: 2026-03-19T17:48:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Classical reverse diffusion is generated by changing the drift at fixed noise. We show that the quantum version of this principle obeys an exact law with a sharp phase boundary. For Gaussian pure-loss dynamics, the canonical model of continuous-variable decoherence, we prove that the unrestricted instantaneous reverse optimum exhibits a noiseless-to-noisy transition: below a critical squeezing-to-thermal ratio, reversal can be noiseless; above it, complete positivity forces irreducible reverse noise whose minimum cost we determine in closed form. The optimal reverse diffusion is uniquely covariance-aligned and simultaneously minimizes the geometric, metrological, and thermodynamic price of reversal. For multimode trajectories, the exact cost is additive in a canonical set of mode-resolved data, and a globally continuous protocol attains this optimum on every mixed-state interval. If a pure nonclassical endpoint is included, the same pointwise law holds for every $t>0$, but the optimum diverges as $2/t$: exact Gaussian reversal of a pure quantum state is dynamically unattainable. These results establish the exact Gaussian benchmark against which any broader theory of quantum reverse diffusion must be measured.

中文标题/摘要

标题：量子扩散模型中的分数逆转并非免费

经典的逆向扩散通过固定噪声改变漂移生成。我们证明了这一原理的量子版本遵循一个精确的相变定律。对于高斯纯损耗动力学，即连续变量退相干的典范模型，我们证明了无限制的瞬时逆向最优表现出无噪到有噪的转变：在临界压缩比之下，逆转可以无噪；超过它，完全正性迫使不可约的逆向噪声，我们以闭式形式确定了其最小成本。最优逆向扩散唯一地与协方差对齐，并同时最小化逆转的几何、计量和热力学价格。对于多模式轨迹，精确成本在一组模式解析数据中是可加的，且一个全局连续协议在每个混合态区间内达到这一最优。如果包含一个纯非经典的终点，同样的点律在每个$t>0$时都成立，但最优值随着$2/t$发散：精确的高斯逆转一个纯量子态是动态上不可达的。这些结果确立了高斯基准，任何更广泛的量子逆向扩散理论都必须以此为标准进行衡量。

Summary / 总结

The study explores the principle of reverse diffusion in quantum systems, showing that while classical reverse diffusion changes drift at fixed noise, the quantum version has a sharp phase boundary. For Gaussian pure-loss dynamics, the research proves that reverse diffusion can be noiseless below a critical squeezing-to-thermal ratio but becomes irreducibly noisy above it. The optimal reverse diffusion is uniquely covariance-aligned and minimizes various costs. For multimode trajectories, the cost is additive, and a globally continuous protocol achieves the optimum. Pure nonclassical states, however, cannot be exactly reversed, indicating the limitations of quantum reverse diffusion.

研究探讨了量子扩散模型的可逆性，通过比较经典和量子反向扩散原理。研究表明，量子反向扩散存在一个明确的相变边界。对于高斯纯损耗动力学，反向扩散在一定的压缩比以下可以无噪声，但超过该比值则由于完全正性而不可避免地产生噪声。最优的反向扩散是唯一协方差对齐的，并且可以最小化各种成本。对于多模式轨迹，成本是可加的，全局连续协议可以在每个混合态区间内达到最优。研究还表明，纯量子态的精确高斯反向扩散在动态上是不可实现的，从而为更广泛的量子反向扩散理论设定了基准。

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Authors: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding

First: 2026-03-19T17:47:47+00:00 · Latest: 2026-03-19T17:47:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.

中文标题/摘要

标题：OS-Themis：通用GUI奖励的可扩展批评框架

强化学习（RL）有潜力提高GUI代理在随机环境中的鲁棒性，但训练对奖励函数的质量非常敏感。现有的奖励方法难以同时实现可扩展性和性能。为了解决这个问题，我们提出了OS-Themis，一种可扩展且准确的多代理批评框架。与单一裁判不同，OS-Themis将轨迹分解为可验证的里程碑，以隔离决策所需的关键证据，并采用审查机制在做出最终裁决前严格审计证据链。为了便于评估，我们进一步引入了OmniGUIRewardBench（OGRBench），这是一个跨平台的GUI结果奖励综合基准，所有评估模型在使用OS-Themis时均达到最佳性能。在AndroidWorld上的广泛实验表明，当用于支持在线RL训练时，OS-Themis可提高10.3%，而在自我训练循环中用于轨迹验证和过滤时，可提高6.9%，这突显了其推动代理进化潜力。

Summary / 总结

The paper introduces OS-Themis, a scalable multi-agent critic framework designed to improve the robustness of GUI agents in stochastic environments. Unlike traditional single-judge approaches, OS-Themis decomposes trajectories into verifiable milestones and uses a review mechanism to ensure the accuracy of the evidence before making decisions. The framework is evaluated on AndroidWorld, showing a 10.3% improvement in online RL training and a 6.9% gain in trajectory validation and filtering, indicating its potential to enhance agent performance.

论文提出了OS-Themis，这是一种可扩展的多代理批评框架，旨在提高GUI代理在随机环境中的鲁棒性。与传统的单一裁判方法不同，OS-Themis将轨迹分解为可验证的里程碑，并使用审查机制确保证据的准确性后再做决策。该框架在AndroidWorld上的实验显示，在在线RL训练中提高了10.3%的性能，在轨迹验证和过滤中提高了6.9%的效率，表明其有潜力提升代理性能。

iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification

Authors: Zixun Xiong, Gaoyi Wu, Qingyang Yu, Mingyu Derek Ma, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang

Venue: AAAI 2026

First: 2025-11-12T02:30:19+00:00 · Latest: 2026-03-19T17:47:21+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) has become increasingly crucial. As the standard paradigm for IP ownership verification, LLM fingerprinting thus plays a vital role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective when the model thief fully controls the LLM's inference process. In such settings, attackers may share prompt-response pairs to enable fingerprint unlearning or manipulate outputs to evade exact-match verification. We propose iSeal, the first fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner. It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy. These components are resistant to verification-time attacks, including collusion-based fingerprint unlearning and response manipulation, backed by both theoretical analysis and empirical results. iSeal achieves 100 percent Fingerprint Success Rate (FSR) on 12 LLMs against more than 10 attacks, while baselines fail under unlearning and response manipulations.

中文标题/摘要

标题：iSeal：加密指纹识别以确保大型语言模型产权验证的可靠性

鉴于从头开始训练大型语言模型（LLM）的成本高昂，保护LLM的知识产权变得越来越重要。作为知识产权所有权验证的标准范式，LLM指纹识别在应对这一挑战中发挥着关键作用。现有的LLM指纹识别方法通过提取或注入模型特定特征来验证所有权，但它们忽略了验证过程中的潜在攻击，使得当模型窃贼完全控制LLM的推理过程时，这些方法无效。在这种情况下，攻击者可能会共享提示-响应对以实现指纹遗忘或操纵输出以逃避精确匹配验证。我们提出了iSeal，这是首个在模型窃贼完全控制疑似LLM的端到端场景下实现可靠验证的指纹识别方法。它通过在模型和外部模块中注入独特的特征，并结合错误校正机制和基于相似性的验证策略，增强了这些组件对验证时攻击的抵抗力，包括基于合谋的指纹遗忘和响应操纵，这些都得到了理论分析和实验结果的支持。iSeal在超过10种攻击下对12种LLM实现了100%的指纹成功率（FSR），而基线方法在遗忘和响应操纵下失败。

Summary / 总结

iSeal is a new fingerprinting method designed to verify the ownership of large language models (LLMs) even when the model thief fully controls the LLM's inference process. It injects unique features into both the model and an external module, incorporating an error-correction mechanism and a similarity-based verification strategy to resist attacks. iSeal achieves a 100 percent Fingerprint Success Rate (FSR) on 12 LLMs against over 10 attacks, whereas existing methods fail under unlearning and response manipulation.

iSeal 是一种新颖的指纹识别方法，用于在模型窃贼控制 LLM 的情况下实现可靠的大型语言模型 (LLM) 所有权验证。它将独特的特征注入到模型和外部模块中，并结合了错误校正机制和基于相似性的验证策略，以抵御攻击。iSeal 在 12 个 LLM 上实现了 100% 的指纹成功率 (FSR)，针对超过 10 种攻击，而现有方法在去学习和响应操纵攻击下会失败。

Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment

Authors: Amir Asiaee, Samhita Pal

First: 2026-03-19T17:43:12+00:00 · Latest: 2026-03-19T17:43:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Randomized controlled trials (RCTs) are the gold standard for estimating heterogeneous treatment effects, yet they are often underpowered for detecting effect heterogeneity. Large observational studies (OS) can supplement RCTs for conditional average treatment effect (CATE) estimation, but a key barrier is covariate mismatch: the two sources measure different, only partially overlapping, covariates. We propose CALM (Calibrated ALignment under covariate Mismatch), which bypasses imputation by learning embeddings that map each source's features into a common representation space. OS outcome models are transferred to the RCT embedding space and calibrated using trial data, preserving causal identification from randomization. Finite-sample risk bounds decompose into alignment error, outcome-model complexity, and calibration complexity terms, identifying when embedding alignment outperforms imputation. Under the calibration-based linear variant, the framework provides protection against negative transfer; the neural variant can be vulnerable under severe distributional shift. Under sparse linear models, the embedding approach strictly generalizes imputation. Simulations across 51 settings confirm that (i) calibration-based methods are equivalent for linear CATEs, and (ii) the neural embedding variant wins all 22 nonlinear-regime settings with large margins.

中文标题/摘要

标题：通过校准对齐提高基于RCT的治疗效果估计

随机对照试验（RCT）是估计异质治疗效果的金标准，但往往缺乏检测效果异质性的能力。大型观察性研究（OS）可以补充RCT以估计条件平均治疗效果（CATE），但关键障碍是协变量不匹配：两种来源测量不同的、部分重叠的协变量。我们提出了CALM（校准对齐在协变量不匹配下），通过学习将每个来源的特征映射到公共表示空间的嵌入来绕过插补。OS结果模型被转移到RCT嵌入空间并使用试验数据校准，从而保留随机化带来的因果识别。有限样本风险界分解为对齐误差、结果模型复杂性和校准复杂性项，确定了嵌入对齐何时优于插补。在基于校准的线性变体下，该框架提供了防止负面转移的保护；而神经变体在严重分布偏移下可能易受攻击。在稀疏线性模型下，嵌入方法严格泛化插补。在51种设置的模拟中证实了（i）基于校准的方法对于线性CATE等效，以及（ii）在22种非线性状态下，神经嵌入变体以较大优势胜出所有设置。

Summary / 总结

The paper addresses the challenge of covariate mismatch between randomized controlled trials (RCTs) and observational studies (OS) in estimating heterogeneous treatment effects. It introduces CALM (Calibrated ALignment under covariate Mismatch), a method that learns embeddings to map features from both sources into a common space, avoiding imputation. The method transfers OS outcome models to the RCT space and calibrates them using trial data, ensuring causal identification. Simulations show that calibration-based methods are equivalent for linear CATEs and that the neural embedding variant outperforms imputation in nonlinear settings with large margins.

论文旨在通过整合RCT和观察性研究(OS)来估计异质性治疗效果，尽管存在协变量不匹配的问题。提出了一种名为CALM的方法，通过学习嵌入将两个来源的特征映射到一个共同的空间中，避免了插补。该方法将OS的结果模型转移到RCT空间，并使用试验数据进行校准，确保因果识别。模拟结果显示，校准基线方法在线性CATE中表现相当，并且在非线性设置中优于插补方法，特别是在严重分布偏移的情况下，神经嵌入变体表现出显著优势。

This looks like what? Challenges and Future Research Directions for Part-Prototype Models

Authors: Khawla Elhadri, Tomasz Michalski, Adam Wróbel, Jörg Schlötterer, Bartosz Zieliński, Christin Seifert

First: 2025-02-13T14:00:55+00:00 · Latest: 2026-03-19T17:41:24+00:00

Comments: Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI-2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

The growing interest in eXplainable Artificial Intelligence (XAI) has stimulated research on models with built-in interpretability, among which part-prototype models are particularly prominent. Part-Prototype Models (PPMs) classify inputs by comparing them to learned prototypes and provide human-understandable explanations of the form "this looks like that". Despite this intrinsic interpretability, PPMs have not yet emerged as a competitive alternative to post-hoc explanation methods. This survey reviews work published between 2019 and 2025 and derives a taxonomy of the challenges faced by current PPMs. The analysis reveals a diverse set of open problems. The main issue concerns the quality and number of learned prototypes. Further challenges include limited generalization across tasks and contexts, as well as methodological shortcomings such as non-standardized evaluation. Five broad research directions are identified: improving predictive performance, developing theoretically grounded architectures, establishing frameworks for human-AI collaboration, aligning models with human concepts, and defining robust metrics and benchmarks for evaluation. The survey aims to stimulate further research and promote intrinsically interpretable models for practical applications. A curated list of the surveyed papers is available at https://github.com/aix-group/ppm-survey.

中文标题/摘要

标题：这看起来像什么？部分原型模型的挑战与未来研究方向

随着对可解释人工智能(XAI)兴趣的增长，研究具有内置可解释性的模型变得活跃起来，其中部分原型模型尤为突出。部分原型模型(PPMs)通过将输入与学习到的原型进行比较来进行分类，并提供易于理解的解释，形式为“这看起来像那个”。尽管具有这种内在的可解释性，PPMs尚未成为后验解释方法的有力竞争者。本文综述了2019年至2025年间发表的工作，并推导出当前PPMs面临的挑战分类。分析揭示了一系列开放性问题。主要问题在于学习原型的质量和数量。进一步的挑战包括在任务和上下文之间有限的一般化能力，以及方法论上的不足，如缺乏标准化的评估。确定了五个广泛的研究方向：提高预测性能、开发理论依据的架构、建立人机协作框架、使模型与人类概念相一致以及定义评估的稳健度量和基准。综述旨在激发进一步的研究，并促进适用于实际应用的内在可解释模型。所综述论文的精选列表可在https://github.com/aix-group/ppm-survey/获取。

Summary / 总结

This paper reviews the challenges and future research directions for Part-Prototype Models (PPMs), which classify inputs by comparing them to learned prototypes and provide human-understandable explanations. The study identifies key issues such as the quality and number of prototypes, limited generalization, and methodological shortcomings. Five research directions are proposed: improving predictive performance, developing theoretically grounded architectures, enhancing human-AI collaboration, aligning models with human concepts, and defining robust evaluation metrics. The survey aims to stimulate further research in intrinsically interpretable models for practical applications.

本文回顾了部分原型模型（PPMs）面临的挑战及其未来研究方向，PPMs通过将输入与学习到的原型进行比较来进行分类，并提供易于理解的解释。尽管具有内在可解释性，但PPMs仍面临原型质量与数量、泛化能力有限以及方法论不足等问题。研究确定了五个关键的研究方向：提高预测性能、开发理论基础架构、建立人机协作框架、使模型与人类概念相一致以及定义评估的稳健指标。该调查旨在促进进一步的研究，并推广用于实际应用的内在可解释模型。

Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Authors: Zou Qiang

First: 2026-03-19T17:41:18+00:00 · Latest: 2026-03-19T17:41:18+00:00

Comments: 10 pages, 5 tables, 0 figures. Conceptual architecture with preliminary simulation-based validation

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches -- such as reinforcement learning from human feedback (RLHF) and output filtering -- primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.

中文标题/摘要

标题：盒迷宫：一种可靠的LLM推理过程控制架构

大型语言模型（LLMs）展示了强大的生成能力，但在对抗性提示下仍易出现幻觉和不可靠的推理。现有的安全方法——如基于人类反馈的强化学习（RLHF）和输出过滤——主要在行为层面运作，可能缺乏明确的架构机制来确保推理过程的完整性。本文提出了盒迷宫框架，这是一种概念性的过程控制架构，将LLM推理分解为三个明确的层次：记忆接地、结构化推理和边界约束。我们进行了初步的基于仿真的评估，涉及跨多个异构LLM系统（DeepSeek-V3、Doubao、Qwen）的边界侵蚀场景。来自n=50个对抗性场景的结果表明，明确的认知控制层可能有助于提高边界维护的一致性，架构约束将边界失败率从基线RLHF的约40%降低到对抗性条件下的不到1%。当前的验证基于仿真，但这些初步结果表明，过程级控制可能为提高大型语言模型推理可靠性提供一个有前景的方向。

Summary / 总结

This paper addresses the issue of hallucination and unreliable reasoning in large language models (LLMs) by proposing the Box Maze framework, which decomposes LLM reasoning into memory grounding, structured inference, and boundary enforcement layers. Through simulation-based evaluation across multiple LLM systems, the study shows that explicit cognitive control layers can improve boundary maintenance consistency, reducing boundary failure rates from about 40% (baseline RLHF) to below 1% under adversarial conditions.

论文针对大型语言模型（LLMs）在对抗性提示下容易出现幻觉和不可靠推理的问题，提出了Box Maze框架，将LLM推理分解为记忆接地、结构化推理和边界约束三个层次。初步的模拟结果显示，该过程控制架构在多种LLM系统中将边界失败率从约40%（基线RLHF）降低到低于1%，表明在推理一致性方面可能存在改进的空间。

Steering Awareness: Detecting Activation Steering from Within

Authors: Joshua Fonseca Rivera, David Demitri Africa

First: 2025-11-26T13:49:43+00:00 · Latest: 2026-03-19T17:37:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Activation steering -- adding a vector to a model's residual stream to modify its behavior -- is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model's ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.

Summary / 总结

The study investigates the concept of activation steering awareness, which is a model's ability to detect and identify the concept encoded by a steering vector added during its forward pass. After fine-tuning, seven instruction-tuned models developed strong steering awareness, with the best model achieving 95.5% detection accuracy and 71.2% concept identification accuracy on clean inputs. However, detection does not confer resistance to steering, and detection-trained models are more susceptible to steering interventions compared to their base counterparts on both factual and safety benchmarks. The mechanism behind steering awareness involves a distributed transformation that progressively rotates injected vectors into a shared detection direction, indicating a geometric detector rather than a generic anomaly detector.

研究探讨了激活引导意识的概念，即模型在前向传播过程中检测并识别注入的引导向量及其所编码的概念的能力。经过微调后，七个指令调优模型发展出了强大的引导意识，其中表现最好的模型在干净输入上的检测准确率达到95.5%，概念识别准确率达到71.2%。然而，检测能力并不能使模型对引导干预更具抵抗力，检测训练后的模型在事实和安全基准测试中比基模型更容易受到引导干预的影响。引导意识的机制涉及一个逐步将注入的向量旋转到共享检测方向的分布式转换，表明这是一种几何检测器而非通用异常检测器。

WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Authors: Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang

First: 2026-03-11T16:04:25+00:00 · Latest: 2026-03-19T17:37:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Communication topology is a critical factor in the utility and safety of LLM-based multi-agent systems (LLM-MAS), making it a high-value intellectual property (IP) whose confidentiality remains insufficiently studied. Existing topology inference attempts rely on impractical assumptions, including control over the administrative agent and direct identity queries via jailbreaks, which are easily defeated by basic keyword-based defenses. As a result, prior analyses fail to capture the real-world threat of such attacks. To bridge this realism gap, we propose \textit{WebWeaver}, an attack framework that infers the complete LLM-MAS topology by compromising only a single arbitrary agent instead of the administrative agent. Unlike prior approaches, WebWeaver relies solely on agent contexts rather than agent IDs, enabling significantly stealthier inference. WebWeaver further introduces a new covert jailbreak-based mechanism and a novel fully jailbreak-free diffusion design to handle cases where jailbreaks fail. Additionally, we address a key challenge in diffusion-based inference by proposing a masking strategy that preserves known topology during diffusion, with theoretical guarantees of correctness. Extensive experiments show that WebWeaver substantially outperforms state-of-the-art (SOTA) baselines, achieving about 60\% higher inference accuracy under active defenses with negligible overhead.

中文标题/摘要

标题：WebWeaver：通过隐蔽的基于上下文推理打破LLM多智能体系统的拓扑保密性

通信拓扑是基于LLM的多智能体系统（LLM-MAS）的实用性和安全性的一个关键因素，使其成为一个高价值的知识产权（IP），其保密性研究尚不充分。现有的拓扑推理尝试依赖于不切实际的假设，包括对管理代理的控制和通过越狱直接身份查询，这些假设很容易被基于关键词的基本防御措施击败。因此，之前的分析未能捕捉到此类攻击的实际威胁。为了弥合这种现实差距，我们提出了一种名为\textit{WebWeaver}的攻击框架，通过仅攻击一个任意代理而不是管理代理来推断完整的LLM-MAS拓扑。与之前的方案不同，WebWeaver仅依赖于代理上下文而不是代理ID，从而实现更隐蔽的推理。WebWeaver还引入了一种新的隐蔽越狱机制和一种全新的完全无越狱扩散设计，以处理越狱失败的情况。此外，我们通过提出一种掩码策略解决了基于扩散的推理中的一个关键挑战，该策略在扩散过程中保留已知的拓扑结构，并具有正确性的理论保证。广泛的实验表明，WebWeaver在对抗性防御下显著优于最先进的（SOTA）基线，准确率提高了约60%，且几乎没有额外开销。

Summary / 总结

WebWeaver is an attack framework that infers the complete topology of LLM-based multi-agent systems by compromising a single arbitrary agent, rather than the administrative agent. It relies on agent contexts for stealthy inference and introduces a covert jailbreak mechanism and a novel diffusion design to handle cases where jailbreaks fail. Experiments show that WebWeaver outperforms state-of-the-art baselines by about 60% in inference accuracy under active defenses with minimal overhead.

WebWeaver 是一种攻击框架，通过攻击一个任意代理而非管理代理来推断 LLM 基础的多代理系统的完整拓扑结构。它依赖于代理上下文进行隐蔽推断，并引入了一种隐蔽的 jailbreak 机制和一种新型的扩散设计来处理 jailbreak 失败的情况。实验表明，WebWeaver 在面对主动防御时的推断准确率比最先进的基线高出约 60%，且几乎没有额外开销。

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs

Authors: Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas

First: 2026-03-03T18:48:15+00:00 · Latest: 2026-03-19T17:36:27+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity--difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.

中文标题/摘要

标题：迁移越大，表示越稀疏：LLM的OOD机制分析

在本研究中，我们探讨了大型语言模型（LLMs）在遇到难度增加的输入时，其内部表示如何适应，这些输入的难度通过出分布（OOD）迁移的程度来量化。我们揭示了一个一致且可量化的现象：随着任务难度的增加，无论是通过更难的推理问题、更长的上下文还是增加答案选项，LLMs的最后隐藏状态变得显著稀疏。简而言之，\textbf{\textit{迁移越大，表示越稀疏}}。这种稀疏性与难度的关系在不同的模型和领域中都是可观察到的，表明语言模型在面对不熟悉或复杂的输入时，会将计算集中在最后隐藏状态的专门子空间中。通过一系列受学习动力学解释控制的分析，我们证明这种稀疏性不是偶然的，而是适应机制，用于在OOD下稳定推理。利用这一见解，我们设计了\textit{稀疏引导的上下文内少样本学习（SG-ICL）}策略，该策略明确利用表示稀疏性来安排少样本示范，从而显著提高性能。我们的研究提供了关于LLMs如何内化OOD挑战的新机制性见解。源代码可在以下网址获取：https://github.com/MingyuJ666/sparsityLLM.

Summary / 总结

This study examines how Large Language Models (LLMs) adjust their internal representations when faced with increasingly difficult inputs, measured by the degree of out-of-distribution (OOD) shift. It reveals that as task difficulty increases, the last hidden states of LLMs become sparser, indicating that the models concentrate computation in specialized subspaces to stabilize reasoning under OOD conditions. The research demonstrates that this sparsity is an adaptive mechanism and proposes Sparsity-Guided Curriculum In-Context Learning (SG-ICL) to enhance performance by scheduling few-shot demonstrations based on representation sparsity. The study provides new insights into how LLMs handle OOD challenges.

研究探讨了大型语言模型（LLMs）在遇到难度逐渐增加的输入时，如何调整其内部表示。研究发现，随着任务难度的增加，LLMs的最后一层隐藏状态变得更为稀疏，表明LLMs通过将计算集中在特定子空间来稳定在OOD条件下的推理。这种稀疏性与难度的关系在不同模型和领域中都能观察到，研究提出了一种基于稀疏性的 Curriculum In-Context Learning (SG-ICL) 策略，通过利用这一机制来提升性能。研究结果表明，LLMs使用稀疏性作为一种适应机制来处理不熟悉或复杂的输入。

DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

First: 2025-07-03T16:28:25+00:00 · Latest: 2026-03-19T17:35:34+00:00

Comments: Published in IEEE Transactions on Audio, Speech and Language Processing (TASLP). Model and code available at: https://github.com/kehanlu/DeSTA2.5-Audio

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM's original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM's native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

中文标题/摘要

标题：DeSTA2.5-Audio：通用大型音频语言模型的自我生成跨模态对齐

我们介绍了DeSTA2.5-Audio，这是一种通用的大型音频语言模型（LALM），旨在实现稳健的听觉感知和指令遵循。最近的LALMs通过在大规模音频指令数据集上进行训练，将听觉能力添加到大型语言模型（LLMs）中。然而，现有的LALMs往往在保留LLM原始能力方面存在灾难性遗忘的问题。因此，平衡知识保留和听觉感知成为了一个关键挑战。为了解决这个问题，我们重新审视了数据构建管道，并提出了一种自我生成的跨模态对齐策略，其中骨干LLM生成自己的训练目标，称为DeSTA。该方法旨在保留LLM的本源语言能力，从而实现零样本泛化，无需特定任务的调优。我们构建了DeSTA-AQA5M，这是一个包含500万训练样本的大规模、任务无关的数据集，这些样本来自7000小时的音频，涵盖了50个不同的数据集，包括语音、环境声音和音乐。DeSTA2.5-Audio在包括Dynamic-SUPERB、MMAU、SAKURA、Speech-IFEval和VoiceBench在内的广泛音频语言基准测试中实现了最先进的或竞争性的性能。全面的比较研究表明，我们的自我生成策略优于现有的训练策略。我们的研究结果强调了在LALM开发中精心设计数据构建的重要性，并为构建稳健的通用LALM提供了实用见解。

STELLAR: Structure-guided LLM Assertion Retrieval and Generation for Formal Verification

Authors: Saeid Rajabi, Chengmo Yang, Satwik Patnaik

First: 2025-11-28T05:31:07+00:00 · Latest: 2026-03-19T17:33:20+00:00

Comments: Accepted at the 63rd Design Automation Conference (DAC 2026), Long Beach, CA, USA (July 26-29, 2026) 7 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Formal Verification (FV) relies on high-quality SystemVerilog Assertions (SVAs), but the manual writing process is slow and error-prone. Existing LLM-based approaches either generate assertions from scratch or ignore structural patterns in hardware designs and expert-crafted assertions. This paper presents STELLAR, the first framework that guides LLM-based SVA generation with structural similarity. STELLAR represents RTL blocks as AST structural fingerprints, retrieves structurally relevant (RTL, SVA) pairs from a knowledge base, and integrates them into structure-guided prompts. Experiments show that STELLAR achieves superior syntax correctness, stylistic alignment, and functional correctness, highlighting structure-aware retrieval as a promising direction for industrial FV.

中文标题/摘要

标题：STELLAR：结构引导的大语言模型断言检索与生成在形式验证中的应用

形式验证（FV）依赖于高质量的SystemVerilog断言（SVAs），但手动编写过程缓慢且容易出错。现有的基于大语言模型的方法要么从头生成断言，要么忽略硬件设计和专家编写断言的结构模式。本文提出了STELLAR，这是第一个使用结构相似性引导大语言模型生成SVAs的框架。STELLAR将RTL块表示为AST结构指纹，从知识库中检索结构相关的（RTL，SVA）对，并将它们整合到结构引导的提示中。实验表明，STELLAR在语法正确性、风格对齐和功能正确性方面表现出色，强调结构感知检索是工业FV的一个有前途的方向。

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Authors: Amandine Brunetto

Venue: CVPR 2026

First: 2026-03-19T17:32:06+00:00 · Latest: 2026-03-19T17:32:06+00:00

Comments: To appear at CVPR 2026. 23 pages, 16 figures. Project Page: https://amandinebtto.github.io/FLAC/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

中文标题/摘要

标题：基于多模态流匹配的少量样本声学合成

生成与场景声学一致的音频对于沉浸式虚拟环境至关重要。近期的神经声学场方法能够实现空间连续的声音渲染，但仍然保持场景特定性，需要密集的音频测量和昂贵的训练成本。少量样本的方法提高了跨房间的可扩展性，但仍依赖于多个录音，并且由于是确定性的，无法捕捉稀疏上下文中场景声学的固有不确定性。我们引入了流匹配声学生成（FLAC），这是一种基于少量场景上下文的概率方法，用于建模给定最小场景上下文时可能的房间冲激响应（RIRs）的分布。FLAC 利用一个使用流匹配目标训练的扩散变换器，在新场景中的任意位置生成 RIRs，条件是空间、几何和声学线索。FLAC 在 AcousticRooms 和 Hearing Anything Anywhere 数据集上的一次样本优于最先进的八次样本基线。为了补充标准的感知度量，我们进一步引入了 AGREE，一种联合声学-几何嵌入，通过检索和分布度量使生成的 RIRs 的几何一致性评估成为可能。这项工作是首次将生成流匹配应用于显式 RIR 合成，为稳健和数据高效声学合成开辟了新方向。

Summary / 总结

The research aims to improve the scalability and robustness of acoustic synthesis in virtual environments by developing a few-shot approach. FLAC, a probabilistic method, models the distribution of plausible room impulse responses (RIRs) given minimal scene context using a diffusion transformer with a flow-matching objective. This method outperforms existing eight-shot baselines with one-shot on AcousticRooms and Hearing Anything Anywhere datasets. Additionally, a new metric AGREE is introduced to evaluate the geometry-consistent generation of RIRs. This work marks the first application of generative flow matching to explicit RIR synthesis, advancing the field of acoustic synthesis.

研究旨在通过利用最少的场景上下文实现房间冲激响应（RIR）的少样本生成，提高虚拟环境中的声学合成的可扩展性和鲁棒性。FLAC，一种概率方法，使用一个通过流匹配目标训练的扩散变换器，在接收到空间、几何和声学线索后，在新场景中的任意位置生成RIRs。该方法在AcousticRooms和Hearing Anything Anywhere数据集上的一次样本测试中优于最先进的八次样本基线。此外，还引入了AGREE新度量，用于评估生成的RIRs的几何一致性，提供更全面的合成音频评估。这项工作是首次将生成流匹配应用于显式RIR合成，为高效和鲁棒的声学合成开辟了新方向。

History

20260321_0342 20260320_0351 20260319_0353 20260318_0401 20260317_0403 20260316_0333 20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553