arXiv 论文速递

Snapshot: 20260308_0327

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Authors: Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar

First: 2026-03-05T18:59:32+00:00 · Latest: 2026-03-05T18:59:32+00:00

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

中文标题/摘要

标题：校准稀疏注意加速文本到视频生成

近期的扩散模型能够生成高质量的视频，但运行速度较慢。这些模型中使用的大型基于变压器的骨干网络由于时空注意机制而成为瓶颈。在本文中，我们发现许多词元到词元的连接在各种输入中持续产生微不足道的分数，并且它们的模式在查询之间经常重复。因此，在这些情况下可以跳过注意计算，对结果影响甚微。这一观察结果同样适用于局部词元块之间的连接。受此启发，我们引入了CalibAtt，这是一种无需训练的方法，通过校准稀疏注意来加速视频生成。CalibAtt 进行了一次离线校准过程，以识别在各种输入中稳定的块级稀疏性和重复模式，并将这些模式编译为每层、每个头和每个扩散时间步的优化注意操作。在推理时，我们密集地计算选定的输入相关连接，并以硬件高效的方式跳过未选中的连接。在Wan 2.1 14B、Mochi 1和不同分辨率下的少量步骤蒸馏模型上进行的大量实验表明，CalibAtt 可以实现高达1.58倍的端到端加速，同时优于现有无需训练的方法，保持视频生成质量和文本-视频对齐。

Summary / 总结

This paper addresses the slow runtime of diffusion models used for high-quality text-to-video generation by introducing CalibAtt, a training-free method that accelerates video generation through calibrated sparse attention. CalibAtt identifies and skips negligible token-to-token connections during training, which are then compiled into optimized attention operations. Experiments show that CalibAtt achieves up to 1.58x end-to-end speedup without compromising video quality or text-video alignment.

本文提出了一种名为CalibAtt的训练-free方法，通过校准稀疏注意力加速文本到视频的生成。该方法在训练期间识别并跳过对结果影响不大的token-to-token连接，并将这些连接编译成优化的注意力操作。实验表明，CalibAtt在不牺牲视频质量和文本-视频对齐的情况下，实现了最高1.58倍的加速。

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Authors: Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu

First: 2026-03-05T18:59:23+00:00 · Latest: 2026-03-05T18:59:23+00:00

Comments: Technical report v1 (14 pages, 7 figures, project page: https://spherelab.ai/poetx/)

Abs · PDF · Code1 · Code2 · Project1

Abstract

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

中文标题/摘要

标题：POET-X：通过扩展正交变换提高大语言模型训练的内存效率

在现代机器学习系统中，高效且稳定的大型语言模型（LLMs）训练仍然是一个核心挑战。为了解决这一挑战，提出了参数化正交等价训练（POET），这是一种保持频谱的框架，通过正交等价变换优化每个权重矩阵。尽管POET提供了强大的训练稳定性，但其原始实现由于密集的矩阵乘法而消耗大量内存和计算资源。为克服这些限制，我们引入了POET-X，这是一种可扩展且内存高效的变体，通过显著降低计算成本来执行正交等价变换。POET-X保持了POET的一般化和稳定性优势，同时在吞吐量和内存效率方面取得了显著改进。在我们的实验中，POET-X能够在单个Nvidia H100 GPU上预训练具有十亿参数的LLMs，而标准优化器如AdamW在相同设置下则会因内存不足而无法运行。

Summary / 总结

POET-X is a memory-efficient variant of the POET framework for training large language models (LLMs) that maintains the generalization and stability benefits of the original POET while reducing computational cost and memory consumption. Experiments show that POET-X allows the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, whereas standard optimizers like AdamW fail to do so under the same conditions.

研究旨在通过提出POET-X，一种使用正交等价变换来减少内存消耗和计算开销的改进版本POET，来解决大规模语言模型（LLM）训练中的内存和计算挑战。POET-X 保持了POET的训练稳定性和泛化能力，同时显著提高了吞吐量和内存效率。实验表明，POET-X 可以在单个Nvidia H100 GPU上完成十亿参数的LLM预训练，而标准优化器如AdamW 在相同条件下则会因内存不足而无法运行。

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Authors: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda

First: 2026-03-05T18:58:14+00:00 · Latest: 2026-03-05T18:58:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

中文标题/摘要

标题：被屏蔽的LLM作为秘密知识诱出的自然试验床

大型语言模型有时会产生虚假或误导性的回答。解决这一问题的两种方法是诚实诱出——通过修改提示或权重使模型如实回答——和谎言检测——分类给定的回答是否为虚假信息。先前的工作在专门训练说谎或隐瞒信息的模型上评估这些方法，但这些人工构建的模型可能无法反映自然发生的不诚实行为。相反，我们研究了来自中国开发者的开放权重LLM，这些模型被训练以审查政治敏感话题：Qwen3模型经常在诸如法轮功或天安门抗议等主题上产生虚假信息，偶尔正确回答，表明它们拥有被训练压制的知识。利用这一作为试验床，我们评估了一套诱出和谎言检测技术。对于诚实诱出，不使用聊天模板的采样、少量提示和在通用诚实数据上微调最可靠地增加了真实回答。对于谎言检测，促使审查过的模型分类其自己的回答接近未审查模型的上限，而基于无关数据训练的线性探针提供了一种更便宜的替代方案。最强的诚实诱出技术也转移到前沿的开放权重模型，包括DeepSeek R1。值得注意的是，没有技术完全消除虚假回答。我们发布了所有提示、代码和转录。

Summary / 总结

This study investigates honesty elicitation and lie detection techniques on censored large language models (LLMs) from Chinese developers, which are trained to suppress politically sensitive topics. The research finds that sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most effectively increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs close to the uncensored model's upper bound, and linear probes trained on unrelated data offer a cost-effective alternative. Notably, no technique completely eliminates false responses, and the methods also transfer to other open-weights models like DeepSeek R1.

研究探讨了在被训练抑制政治敏感话题的审查制大型语言模型（LLM）上应用诚实引诱和谎言检测技术的有效性。研究发现，不使用聊天模板的采样、少量示例提示和在通用诚实数据上进行微调是增加真实回答最有效的方法。对于谎言检测，提示审查制模型对其自身回答进行分类和使用与无关数据训练的线性探针都是有效的。这些技术也适用于其他开放权重模型。然而，没有任何方法能够完全消除虚假回答。

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents

Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong

First: 2026-03-04T18:47:26+00:00 · Latest: 2026-03-05T18:56:37+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.

中文标题/摘要

标题：AgentIR：具备推理意识的检索技术以促进深度研究代理

深度研究代理正迅速成为现代检索系统的主要消费者。与人类用户通过不断调整查询而不记录其中间思维过程不同，深度研究代理在每次搜索调用前都会生成明确的自然语言推理，揭示出现有检索器完全忽略的丰富意图和上下文信息。为了利用这一被忽视的信号，我们提出了：(1) 具备推理意识的检索，这是一种检索范式，将代理的推理轨迹与查询一起联合嵌入；(2) DR-Synth，一种从标准问答数据集中生成深度研究检索训练数据的方法。我们证明了这两个组件各自有效，结合使用后训练出的嵌入模型AgentIR-4B取得了显著的提升。在具有挑战性的BrowseComp-Plus基准测试中，使用开源权重代理Tongyi-DeepResearch的AgentIR-4B达到了68%的准确率，而传统的两倍大小的嵌入模型仅为50%，BM25仅为37%。代码和数据可在：https://texttron.github.io/AgentIR/ 获取。

Summary / 总结

The paper introduces Reasoning-Aware Retrieval, a new retrieval method that incorporates the reasoning process of Deep Research agents into the query, and DR-Synth, a data synthesis technique for generating training data. The method significantly improves retrieval accuracy, with AgentIR-4B achieving 68% accuracy on the BrowseComp-Plus benchmark, outperforming larger conventional models and BM25 by substantial margins.

论文提出了Reasoning-Aware Retrieval，这是一种将Deep Research代理的推理过程融入查询的新检索方法，以及DR-Synth，一种从标准问答数据集中生成训练数据的技术。该方法显著提高了检索准确性，AgentIR-4B在BrowseComp-Plus基准测试中达到了68%的准确率，超过了更大规模的传统模型和BM25的显著差距。

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Authors: Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

First: 2026-03-05T18:55:16+00:00 · Latest: 2026-03-05T18:55:16+00:00

Abs · PDF · Code1 · Code2

Abstract

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

中文标题/摘要

标题：推理剧场：解开模型信念与推理链的纠缠

我们提供了推理模型中表现性推理链（CoT）的证据，其中模型对其最终答案变得非常自信，但继续生成令牌而不揭示其内部信念。我们的分析比较了激活探针、早期强制回答和CoT监控在两个大型模型（DeepSeek-R1 671B & GPT-OSS 120B）上的表现，发现任务难度特定的差异：模型的最终答案可以在CoT的早期激活中被解码，而监控则无法做到，尤其是在容易的基于回忆的MMLU问题上。我们将其与困难的多跳GPQA-Diamond问题中的真正推理进行了对比。尽管如此，转折点（例如回溯、‘恍然大悟’时刻）几乎仅出现在探针显示大规模信念变化的响应中，这表明这些行为追踪的是真正的不确定性，而不是学习到的“推理剧场”。最后，探针引导的早期退出在MMLU上最多可减少80%的令牌，在GPQA-Diamond上减少30%，同时保持相似的准确性，将注意力探针定位为检测表现性推理的有效工具，并使计算适应性成为可能。

Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding

Authors: Alex Oshin, Rahul Vodeb Ghosh, Augustinos D. Saravanos, Evangelos A. Theodorou

Venue: ICLR 2026

First: 2025-12-01T11:38:45+00:00 · Latest: 2026-03-05T18:54:48+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

We propose FlexQP, an always-feasible convex quadratic programming (QP) solver based on an $\ell_1$ elastic relaxation of the QP constraints. If the original constraints are feasible, FlexQP provably recovers the optimal solution. If the constraints are infeasible, FlexQP identifies a solution that minimizes the constraint violation while keeping the number of violated constraints sparse. Such infeasibilities arise naturally in sequential quadratic programming (SQP) subproblems due to the linearization of the constraints. We prove the convergence of FlexQP under mild coercivity assumptions, making it robust to both feasible and infeasible QPs. We then apply deep unfolding to learn LSTM-based, dimension-agnostic feedback policies for the algorithm parameters, yielding an accelerated Deep FlexQP. To preserve the exactness guarantees of the relaxation, we propose a normalized training loss that incorporates the Lagrange multipliers. We additionally design a log-scaled loss for PAC-Bayes generalization bounds that yields substantially tighter performance certificates, which we use to construct an accelerated SQP solver with guaranteed QP subproblem performance. Deep FlexQP outperforms state-of-the-art learned QP solvers on a suite of benchmarks including portfolio optimization, classification, and regression problems, and scales to dense QPs with over 10k variables and constraints via fine-tuning. When deployed within SQP, our approach solves nonlinear trajectory optimization problems 4-16x faster than SQP with OSQP while substantially improving success rates. On predictive safety filter problems, Deep FlexQP reduces safety violations by over 70\% and increases task completion by 43\% compared to existing methods.

中文标题/摘要

标题：Deep FlexQP：基于深度展开的加速非线性规划

我们提出了一种基于QP约束的$\ell_1$弹性松弛的凸二次规划（QP）求解器FlexQP。如果原始约束可行，FlexQP能够证明恢复最优解。如果约束不可行，FlexQP将识别一个最小化约束违反并保持违反约束数量稀疏的解。这种不可行性在序列二次规划（SQP）子问题中由于约束的线性化而自然出现。在温和的强制性假设下，我们证明了FlexQP的收敛性，使其对可行和不可行的QP都具有鲁棒性。然后，我们应用深度展开来学习基于LSTM的、维度无关的反馈策略，以加速算法参数，从而得到加速的Deep FlexQP。为了保持松弛的精确性保证，我们提出了一种归一化的训练损失，该损失包含拉格朗日乘子。我们还设计了一种对数缩放损失，用于PAC-Bayes泛化界，该损失提供了显著更紧的性能证书，我们使用它来构建具有保证QP子问题性能的加速SQP求解器。Deep FlexQP在包括投资组合优化、分类和回归问题的一系列基准测试中优于最先进的学习QP求解器，并通过微调扩展到具有超过10,000个变量和约束的密集QP。当部署在SQP中时，我们的方法比使用OSQP的SQP快4-16倍，同时显著提高成功率。在预测性安全过滤器问题上，Deep FlexQP将安全违规减少超过70%，并提高任务完成率43%。

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Authors: Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar

First: 2026-03-05T18:52:28+00:00 · Latest: 2026-03-05T18:52:28+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

中文标题/摘要

标题：通过偏差有界评估迈向可证明无偏的LLM法官

随着AI模型从简单的聊天机器人发展到更复杂的流程，我们正逐渐接近一个临界点，在这个临界点之后，AI系统将在自主、自我维护的反馈循环中被利用。任何自主AI系统都将依赖于自动化的、可验证的奖励和反馈；在地面真相稀疏或非确定性的环境中，一个实际的奖励来源是LLM作为法官。尽管LLM法官不断改进，但文献中尚未引入能够以强保证执行标准的系统，尤其是在偏差向量未知或被敌对发现的情况下。为了解决这一问题，我们提出了平均偏差有界性（A-BB），这是一种算法框架，正式保证了由于LLM法官中的任何可测量偏差而导致的危害/影响的减少。在Arena-Hard-Auto上使用四个LLM法官进行评估，我们实现了（tau=0.5，delta=0.01）的偏差有界保证，同时在格式化和方案偏差设置中保留了61-99%与原始排名的相关性，大多数法官-偏差组合超过80%。我们的研究结果的代码可在https://github.com/penfever/bias-bounded-evaluation上重现。

Summary / 总结

This paper addresses the challenge of ensuring unbiased judgments from language models (LLMs) in complex workflows. It introduces an algorithmic framework called average bias-boundedness (A-BB) to formally guarantee reductions in harm due to any measurable bias in LLM judges. Evaluations on Arena-Hard-Auto with four LLM judges show that A-BB achieves bias-bounded guarantees while maintaining up to 99% correlation with original rankings, even in the presence of formatting and schematic biases.

本文旨在解决创建能够在自主操作中具有强保证的LLM法官的问题，特别是在处理未知或敌对偏见时。它引入了平均偏置有界性（A-BB）框架，确保由于任何可测量偏见而导致的危害减少。在Arena-Hard-Auto上对四个LLM法官进行的评估显示，A-BB实现了偏置有界性保证，同时保持了61-99%与原始排名的相关性，大多数法官-偏见组合的关联性超过80%。

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Authors: Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, Zhiqi Li, Karan Sapra, Fuxiao Liu, Yin-Dong Zheng, Yifei Huang, Limin Wang, Zhiding Yu, Andrew Tao, Guilin Liu, Tong Lu

First: 2026-03-05T18:52:12+00:00 · Latest: 2026-03-05T18:52:12+00:00

Abs · PDF · Code1 · Code2

Abstract

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

中文标题/摘要

标题：迈向多模态终身理解：一个数据集和能动基线

尽管视频理解的数据集已经扩展到小时级长度，但它们通常由紧密连接的片段组成，与自然、非剧本化的日常生活不同。为弥合这一差距，我们引入了MM-Lifelong数据集，旨在用于多模态终身理解。该数据集包含181.1小时的视频，按日、周、月尺度结构化，以捕捉不同的时间密度。广泛评估表明，当前范式存在两种关键失败模式：端到端的MLLMs因上下文饱和而遭受工作记忆瓶颈，而代表性的能动基线在导航稀疏的月度时间线时则经历全局定位崩溃。为解决这一问题，我们提出了递归多模态代理（ReMA），它采用动态内存管理，迭代更新递归信念状态，显著优于现有方法。最后，我们建立了数据集划分，旨在隔离时间偏见和领域偏见，为未来监督学习和离分布泛化的研究提供严格的基石。

Summary / 总结

The research aims to bridge the gap between existing video understanding datasets and natural, unscripted daily life by introducing MM-Lifelong, a dataset with 181.1 hours of footage structured across Day, Week, and Month scales. The study evaluates current paradigms and finds that end-to-end MLLMs face a Working Memory Bottleneck and representative agentic baselines suffer from Global Localization Collapse. To address these issues, the Recursive Multimodal Agent (ReMA) is proposed, which uses dynamic memory management to iteratively update a recursive belief state, outperforming existing methods. The dataset also includes splits to isolate temporal and domain biases, facilitating future research in supervised learning and out-of-distribution generalization.

研究旨在通过引入包含181.1小时 footage 的 MM-Lifelong 数据集，弥合现有视频理解数据集与自然、非剧本化日常生活之间的差距，该数据集按日、周、月尺度结构化。研究评估了当前范式，并发现端到端的 MLLMs 面临工作记忆瓶颈，而代表性的代理基线在导航稀疏的月度时间线时出现全局定位崩溃。为解决这些问题，提出了递归多模态代理 (ReMA)，它使用动态内存管理来迭代更新递归信念状态，显著优于现有方法。此外，数据集还包括用于隔离时间和领域偏差的分割，为未来监督学习和泛化研究提供坚实基础。

FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction

Authors: Jiaxin Yuan, Haizhao Yang, Maria Cameron

First: 2025-10-31T04:49:41+00:00 · Latest: 2026-03-05T18:50:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.

中文标题/摘要

标题：FMint-SDE：一种通过误差校正加速随机微分方程数值模拟的多模态基础模型

快速而准确地模拟动力系统是科学和工程领域的一项基本挑战。传统的数值积分器往往在准确性和计算效率之间存在权衡，而现有的基于神经网络的方法通常需要为每种情况训练一个单独的模型。为克服这些限制，我们提出了一种新的多模态基础模型，用于大规模模拟微分方程：FMint-SDE（基于初始化的基础模型，针对随机微分方程）。基于仅解码器的变压器并利用上下文学习，FMint-SDE 利用数值和文本模态学习一个通用的误差校正方案。它使用由传统求解器生成的粗糙解序列进行提示训练，从而在各种系统之间实现广泛的泛化。我们在涵盖分子动力学、机械系统、金融和生物学应用的挑战性 SDE 基准测试集上评估了我们的模型。实验结果表明，与经典求解器相比，我们的方法在准确性和效率之间实现了更优的权衡，突显了 FMint-SDE 作为动力系统通用模拟工具的潜力。

Summary / 总结

The research aims to improve the accuracy and computational efficiency of simulating dynamical systems. FMint-SDE, a multimodal foundation model, uses a decoder-only transformer with in-context learning to learn an error-correction scheme from numerical and textual modalities. It is trained on coarse solutions from conventional solvers and demonstrates superior accuracy and efficiency compared to classical solvers across various applications including molecular dynamics, mechanical systems, finance, and biology.

研究旨在通过克服传统数值积分器和基于神经网络的方法的局限性，开发更高效和准确的动力系统模拟方法。FMint-SDE 是一种多模态基础模型，使用具有上下文学习的解码器-only 变压器，结合数值和文本数据，学习一个通用的误差校正方案。该模型通过传统求解器生成的粗略解进行训练，并在分子动力学、机械系统、金融和生物学等多个 SDE 基准测试中表现出色，显示出 FMint-SDE 作为动力系统通用模拟工具的潜力。

Thermodynamic Response Functions in Singular Bayesian Models

Authors: Sean Plummer

First: 2026-03-05T18:50:20+00:00 · Latest: 2026-03-05T18:50:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. Although singular learning theory characterizes marginal likelihood behavior through invariants such as the real log canonical threshold and singular fluctuation, these quantities remain difficult to interpret operationally. At the same time, widely used criteria such as WAIC and WBIC appear disconnected from underlying singular geometry. We show that posterior tempering induces a one-parameter deformation of the posterior distribution whose associated observables generate a hierarchy of thermodynamic response functions. A universal covariance identity links derivatives of tempered expectations to posterior fluctuations, placing WAIC, WBIC, and singular fluctuation within a unified response framework. Within this framework, classical quantities from singular learning theory acquire natural thermodynamic interpretations: RLCT governs the leading free-energy slope, singular fluctuation corresponds to curvature of the tempered free energy, and WAIC measures predictive fluctuation. We formalize an observable algebra that quotients out non-identifiable directions, allowing structurally meaningful order parameters to be constructed in singular models. Across canonical singular examples-including symmetric Gaussian mixtures, reduced-rank regression, and overparameterized neural networks-we empirically demonstrate phase-transition-like behavior under tempering. Order parameters collapse, susceptibilities peak, and complexity measures align with structural reorganization in posterior geometry. Our results suggest that thermodynamic response theory provides a natural organizing framework for interpreting complexity, predictive variability, and structural reorganization in singular Bayesian learning.

中文标题/摘要

标题：奇异贝叶斯模型中的热力学响应函数

奇异统计模型，包括混合模型、矩阵分解和神经网络，由于参数不可识别性和退化费舍尔几何学而违反了常规渐近性。尽管奇异学习理论通过不变量（如实数对数可约阈值和奇异波动）来表征边缘似然行为，但这些量仍然难以从操作上进行解释。同时，广泛使用的准则（如WAIC和WBIC）似乎与基础的奇异几何学脱节。我们表明，后验退火诱导了一个后验分布的一参数变形，其关联的可观测量生成了一级热力学响应函数的层次结构。一个普遍协方差恒等式将退火期望的导数与后验波动联系起来，将WAIC、WBIC和奇异波动置于统一的响应框架中。在这一框架内，奇异学习理论中的经典量获得了自然的热力学解释：RLCT控制了自由能的主要斜率，奇异波动对应于退火自由能的曲率，而WAIC衡量预测波动。我们形式化了一个可观测量代数，消除了不可识别方向，允许在奇异模型中构造结构上具有意义的有序参数。在包括对称高斯混合模型、降秩回归和过度参数化神经网络在内的典型奇异示例中，我们通过退火实验证明了相变行为。有序参数坍缩，磁化率达到峰值，复杂性度量与后验几何结构的重新组织相一致。我们的结果表明，热力学响应理论提供了一个自然的组织框架，用于解释奇异贝叶斯学习中的复杂性、预测波动性和结构重组。

Summary / 总结

This paper addresses the challenges in singular statistical models, where standard asymptotic theory fails due to parameter non-identifiability and degenerate Fisher geometry. The authors introduce a method of posterior tempering to generate a hierarchy of thermodynamic response functions, linking derivatives of tempered expectations to posterior fluctuations. Key findings include the interpretation of classical singular learning theory quantities in terms of thermodynamic response theory, such as the real log canonical threshold governing the leading free-energy slope and the singular fluctuation corresponding to the curvature of the tempered free energy. The study demonstrates phase-transition-like behavior in canonical singular models under tempering, with order parameters collapsing and susceptibilities peaking, aligning with structural reorganization in posterior geometry.

该论文探讨了混合模型和神经网络等奇异统计模型由于参数不可识别和退化Fisher几何结构而违反常规渐近性的挑战。作者通过后验退火生成热力学响应函数层次结构的方法，将奇异学习理论中的经典量与热力学解释联系起来。关键发现包括普遍协方差恒等式将退火期望的导数与后验波动联系起来，并在典型奇异示例中展示了在退火下的相变行为，表明热力学响应理论为理解奇异贝叶斯学习中的复杂性和结构重组提供了一个自然框架。

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Authors: Artem Vazhentsev, Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Seleznyov, Mikhail Salnikov, Elena Tutubalina, Vasily Konovalov, Irina Nikishina, Alexander Panchenko, Viktor Moskvoretskii

First: 2026-03-05T18:42:51+00:00 · Latest: 2026-03-05T18:42:51+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.

中文标题/摘要

标题：利用LLM参数化知识进行无需检索的事实核查

信任是基于大型语言模型（LLMs）的代理AI系统的核心研究挑战。为了增强信任，自然语言声明通常通过检索外部知识并使用LLM验证其与检索证据的一致性来从多种来源（包括人类撰写的文本、网络内容和模型输出）进行事实核查。这种方法受到检索错误和外部数据可用性的限制，而未能充分利用模型固有的事实验证能力。我们提出了无需检索的事实核查任务，专注于独立于来源的任意自然语言声明的验证。为了研究这一设置，我们引入了一个以泛化为重点的综合评估框架，测试其对（i）长尾知识、（ii）声明来源的变化、（iii）多语言性和（iv）长文本生成的鲁棒性。在9个数据集、18种方法和3种模型上进行的实验表明，基于logit的方法往往不如利用内部模型表示的方法表现好。基于这一发现，我们引入了INTRA方法，该方法利用内部表示之间的交互，并实现了最先进的性能和强大的泛化能力。更广泛地说，我们的工作确立了无需检索的事实核查作为有前景的研究方向，可以补充基于检索的框架，提高可扩展性，并使这些系统能够在训练期间用作奖励信号或作为生成过程中的组件集成。

Summary / 总结

The research addresses the challenge of enhancing the trustworthiness of agentic AI systems based on Large Language Models (LLMs) by proposing a fact-checking method without retrieval. The study introduces a comprehensive evaluation framework to test the robustness of fact-checking methods under various conditions, including long-tail knowledge, claim source variation, multilinguality, and long-form generation. Key findings show that logit-based approaches often underperform, while a method called INTRA, which leverages internal model representations, achieves state-of-the-art performance with strong generalization capabilities.

研究旨在通过开发无需外部知识检索的事实核查方法来增强代理AI系统的可信度，从而避免检索错误和外部数据可用性带来的限制。研究引入了一个综合评估框架，以测试各种数据集和模型下的事实核查方法的鲁棒性。关键发现表明，基于logit的方法往往表现不佳，而一种名为INTRA的新方法，它利用内部模型表示，实现了最先进的性能并具有强大的泛化能力。

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Authors: Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou

Venue: The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

First: 2026-03-05T18:36:31+00:00 · Latest: 2026-03-05T18:36:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

中文标题/摘要

标题：HALP：无需生成单个词元即可检测视觉语言模型中的幻觉

幻觉仍然是视觉语言模型（VLMs）的一个持续性挑战，它们经常描述不存在的对象或编造事实。现有的检测方法通常在文本生成之后进行操作，这使得干预既昂贵又不及时。我们研究了是否可以在生成任何词元之前通过探测模型的内部表示来预测幻觉风险。在一系列视觉语言任务和八种现代VLMs（包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL）中，我们检查了三种内部表示家族：（i）仅视觉特征而不进行多模态融合，（ii）文本解码器中的视觉词元表示，以及（iii）在生成之前整合视觉和文本信息的查询词元表示。基于这些表示训练的探测器在无需解码的情况下实现了强大的幻觉检测性能，达到Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上的0.93 AUROC。大多数模型中，后期查询词元状态最具预测性，而视觉或中间层特征在少数架构中占主导地位（例如，Qwen2.5-VL-7B使用仅视觉特征的AUROC约为0.79）。这些结果表明：（1）幻觉风险可以在生成之前检测到，（2）最具信息量的层和模态在不同架构中有所不同，（3）轻量级探测器有可能实现早期避免、选择性路由和自适应解码，以提高安全性和效率。

Summary / 总结

The research aims to detect hallucinations in vision-language models before any text generation by probing internal representations. Across various tasks and eight models, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, the study evaluates three types of internal representations and finds that probes trained on late query-token states achieve strong hallucination-detection performance, with up to 0.93 AUROC on some models. The results indicate that hallucination risk can be detected pre-generation, and the most informative layer and modality vary across different architectures.

研究旨在在任何文本生成之前检测视觉语言模型中的幻觉，这比现有方法更高效和及时。研究调查了三种内部表示：仅视觉特征、视觉标记表示和查询标记表示。这些内部表示训练的探针实现了强大的幻觉检测性能，大多数模型中晚期查询标记状态是最具预测性的。结果表明，幻觉风险可以在生成前检测到，不同架构中最信息丰富的层和模态各不相同，轻量级探针可以实现早期干预以提高安全性和效率。

NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance

Authors: Abrar Eyasir, Tahsin Ahmed, Muhammad Ibrahim

First: 2026-03-05T18:35:03+00:00 · Latest: 2026-03-05T18:35:03+00:00

Comments: 18 pages, 7 figures, 6 tables. Dataset contains 87,805 Bangla QA pairs from NCTB textbooks

Abs · PDF · Code1 · Code2

Abstract

Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.

中文标题/摘要

标题：NCTB-QA：大规模孟加拉语教育问答数据集及基准性能评估

针对低资源语言的阅读理解系统在处理无法回答的问题时面临重大挑战。当正确答案不在上下文中时，这些系统往往会生成不可靠的响应。为了解决这一问题，我们引入了NCTB-QA，这是一个包含87,805个问答对的大规模孟加拉语问答数据集，这些问答对是从孟加拉国国家课程和课本委员会出版的50本教科书中提取的。与现有的孟加拉语数据集不同，NCTB-QA保持了可回答问题（57.25%）和无法回答问题（42.75%）的平衡分布。NCTB-QA还包含对抗设计的实例，其中包含可能的干扰项。我们对三种基于变换器的模型（BERT、RoBERTa、ELECTRA）进行了基准测试，并通过微调展示了显著的改进。BERT在F1分数上实现了313%的相对改进（从0.150到0.620）。通过BERTScore衡量的语义答案质量在所有模型中也显著提高。我们的结果确立了NCTB-QA作为孟加拉语教育问答具有挑战性的基准。本研究证明，在低资源环境中，领域特定的微调对于稳健性能至关重要。

Summary / 总结

The research addresses the challenge of handling unanswerable questions in reading comprehension systems for low-resource languages by introducing NCTB-QA, a large-scale Bangla question answering dataset. The dataset includes 87,805 question-answer pairs from 50 Bangla textbooks and maintains a balanced distribution of answerable and unanswerable questions. The study benchmarks three transformer-based models (BERT, RoBERTa, ELECTRA) and shows significant improvements through fine-tuning, with BERT achieving a 313% relative improvement in F1 score. The results highlight the importance of domain-specific fine-tuning for robust performance in low-resource settings.

研究通过引入包含87,805个问题-答案对的NCTB-QA大型孟加拉语数据集，解决了低资源语言阅读理解系统处理不可回答问题的挑战。该数据集保持了可回答和不可回答问题的平衡分布，并包含对抗性实例。对三种基于变换器的模型（BERT、RoBERTa、ELECTRA）进行了基准测试，通过微调显示出显著改进，BERT的F1分数提高了313%。研究强调了在低资源环境中实现稳健性能的关键在于领域特定的微调。

DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates

Authors: Klaywert Danillo Ferreira de Souza, David Eduardo Pereira, Cláudio E. C. Campelo, Larissa Lucena Vasconcelos

First: 2026-03-05T18:30:10+00:00 · Latest: 2026-03-05T18:30:10+00:00

Abs · PDF · Code1 · Code2

Abstract

The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.

中文标题/摘要

标题：DEBISS：个体、半结构化和口语辩论语料库

辩论的过程在我们的日常生活中至关重要，无论是学习、工作活动、简单的日常讨论、电视上的政治辩论，还是社交媒体上的在线讨论。辩论的应用范围很广。由于辩论的多样应用、结构和格式，开发能够涵盖这些差异的语料库具有挑战性，目前该领域的辩论语料库稀缺性尤为明显。因此，当前研究提出了DEBISS语料库：一个包含口语和个体辩论的半结构化集合。该语料库包含广泛的自然语言处理任务注释，如语音转文本、说话人辨识、论点挖掘和辩手质量评估。

Summary / 总结

The research aims to address the scarcity of debate corpora that capture the diverse structures and formats of debates. The main method involves creating the DEBISS corpus, which includes spoken and individual debates with semi-structured features. Key experimental findings include the provision of a wide range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment, which can be used for various natural language processing tasks.

研究旨在解决缺乏能够捕捉辩论多样结构和格式的语料库的问题。主要方法是创建DEBISS语料库，其中包括口语和个体辩论的半结构化特征。关键实验发现包括提供广泛的自然语言处理任务注释，如语音转文本、说话人分辩、论点挖掘和辩论者质量评估，这些可以用于各种自然语言处理任务。

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Authors: Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

Venue: ICLR 2026

First: 2026-03-05T18:25:26+00:00 · Latest: 2026-03-05T18:25:26+00:00

Comments: Accepted at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

中文标题/摘要

标题：超越零散接受：通过最长稳定前缀实现DLM的快速一致推理

扩散语言模型（DLMs）承诺了高度并行的文本生成，但其实用推理速度往往受限于次优的解码调度器。标准方法依赖于“零散接受”——在序列中不连续的位置提交高置信度的标记。这种方法无意中破坏了键值（KV）缓存，破坏了内存局部性，并迫使模型在不稳定的标记边界上进行昂贵的重复修复。为了解决这个问题，我们提出了最长稳定前缀（LSP）调度器，这是一种基于单一前缀吸收的无训练和模型无关的推理范式。在每次去噪步骤中，LSP 通过单向传递评估标记的稳定性，动态识别一个连续的左对齐的稳定预测块，并在原子提交前将其边界调整到自然语言或结构分隔符。这种前缀优先的拓扑结构带来了双重好处：系统上，它将碎片化的KV缓存更新转换为高效的连续追加；算法上，它保留了对几何缩小的活动后缀的双向前瞻，大幅减少了标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的广泛评估表明，LSP 在包括数学推理、代码生成、多语言（CJK）任务和创意写作在内的严格基准测试中将推理加速了高达3.4倍，同时保持或略微提高了输出质量。通过从根本上重新结构化提交拓扑，LSP 桥接了DLMs的理论并行性和实际硬件效率之间的差距。

Summary / 总结

The paper addresses the issue of slow inference in Diffusion Language Models (DLMs) due to suboptimal decoding schedulers that commit high confidence tokens at disjoint positions, leading to fragmented KV cache updates and increased computational costs. It introduces the Longest Stable Prefix (LSP) scheduler, which evaluates token stability and commits a contiguous block of predictions, thereby preserving memory locality and reducing token flip rates. Experiments on LLaDA-8B and Dream-7B show that LSP can accelerate inference by up to 3.4x across various benchmarks while maintaining or slightly improving output quality.

论文针对扩散语言模型（DLMs）由于分段接受解码调度器导致的推理速度慢问题，该调度器在序列中不连续位置承诺高置信度的标记，这会破坏KV缓存并增加计算成本。文中提出了一种最长稳定前缀（LSP）调度器，该调度器在每个去噪步骤中评估标记的稳定性，动态识别一个连续的左对齐稳定预测块，并在自然语言或结构分隔符前进行原子提交。这种方法通过将分段的KV缓存更新转换为高效的连续追加，并保持对逐渐缩小的活动后缀的双向前瞻，从而提高推理速度最多3.4倍，同时保持或略微提升输出质量。

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Authors: Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao

First: 2026-03-05T18:24:49+00:00 · Latest: 2026-03-05T18:24:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.

中文标题/摘要

标题：FlashAttention-4: 针对异构硬件扩展的算法和内核流水线协同设计

注意力机制作为无处不在的Transformer架构的核心层，是大型语言模型和长上下文应用的瓶颈。虽然FlashAttention-3通过异步执行和战位专业化优化了Hopper GPU上的注意力机制，但它主要针对H100架构。AI行业迅速转向部署基于Blackwell的系统，如B200和GB200，由于异构硬件扩展的不同性能特征：张量核吞吐量翻倍，而其他功能单元（共享内存带宽，指数单元）则扩展较慢或保持不变。我们开发了几种技术来应对Blackwell GPU上的这些转移瓶颈：(1) 重新设计的流水线，利用完全异步的MMA操作和更大的块大小，(2) 软件模拟的指数和条件softmax缩放，减少非矩阵运算，以及(3) 利用张量内存和2-CTA MMA模式减少后向传递中的共享内存流量和原子加法。我们证明，我们的方法FlashAttention-4在B200 GPU上使用BF16时，相对于cuDNN 9.13实现了高达1.3倍的加速，相对于Triton实现了高达2.7倍的加速，达到最高1613 TFLOPs/s（71%利用率）。除了算法创新，我们完全在嵌入Python中的CuTe-DSL中实现FlashAttention-4，与传统的C++模板方法相比，编译时间快20-30倍，同时保持完全的表达能力。

Summary / 总结

FlashAttention-4 optimizes attention mechanisms for Blackwell-based GPUs like B200 and GB200, which have asymmetric hardware scaling. It introduces redesigned pipelines with fully asynchronous MMA operations, software-emulated exponential and conditional softmax rescaling, and leverages tensor memory to reduce shared memory traffic. These techniques result in up to 1.3× speedup over cuDNN 9.13 and 2.7× over Triton on B200 GPUs with BF16, achieving up to 1613 TFLOPs/s (71% utilization).

FlashAttention-4 优化了针对 Blackwell 基础的 GPU 如 B200 和 GB200 上的注意力机制，这些 GPU 的硬件扩展性存在差异。该方法引入了重新设计的管道，利用完全异步的 MMA 操作和更大的瓷砖大小，通过软件模拟指数和条件 softmax 缩放来减少非矩阵乘法操作，以及利用张量内存和 2-CTA MMA 模式来减少后向传播中的共享内存流量和原子加操作。这使得在 BF16 下 B200 GPU 上的速度提升至最多 1.3× cuDNN 9.13 和 2.7× Triton，达到最高 1613 TFLOPs/s（71% 利用率）。

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Authors: Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

First: 2026-03-05T18:22:55+00:00 · Latest: 2026-03-05T18:22:55+00:00

Comments: 10 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

中文标题/摘要

标题：分布式部分信息谜题：在知识不对称下的共同知识构建研究

建立共同知识，即共享的一组信念和相互认可的事实，是协作的基础，但在当前的AI系统中仍然是一个挑战，尤其是在多模态、多参与者的场景中，参与者带来不同的信息。我们引入了分布式部分信息谜题（DPIP），这是一种在知识不对称下引发丰富多模态交流的协作构建任务。我们提供了一个多模态的交互数据集，这些数据在语音、手势和动作模态上进行了注释和时间对齐，以支持对命题内容和信念动态的推理。然后，我们评估了两种建模共同知识（CG）的范式：（1）最先进的大型语言模型（LLMs），被提示从多模态更新中推断共享信念，以及（2）一个基于动态知识逻辑（DEL）的公理化管道，逐步执行相同任务。对注释的DPIP数据的评估结果表明，这给现代LLMs追踪任务进展和信念状态的能力带来了挑战。

Summary / 总结

The research aims to improve AI systems' ability to establish common ground in collaborative tasks, particularly in multimodal and multiparty settings. It introduces the Distributed Partial Information Puzzle (DPIP) to examine common ground construction under epistemic asymmetry. The study evaluates two approaches: state-of-the-art large language models (LLMs) and an axiomatic pipeline based on Dynamic Epistemic Logic (DEL). The results show that modern LLMs struggle to track both task progression and belief states effectively, highlighting the challenge of common ground construction in such settings.

论文引入分布式部分信息谜题（DPIP）来研究在知识不对称情况下共同知识构建的问题。它评估了两种范式：最先进的大型语言模型（LLMs）和基于动态语义逻辑（DEL）的公理化管道。结果表明，现代LLMs难以跟踪任务进展和信念状态，突显了在这些设置中共同知识构建的挑战。

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

First: 2026-01-26T17:56:50+00:00 · Latest: 2026-03-05T18:19:57+00:00

Comments: code is release here: https://github.com/siyan-zhao/OPSD

Abs · PDF · Code1 · Code2 · Code3

Abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

中文标题/摘要

标题：自我蒸馏推理器：面向大规模语言模型的在线策略自我蒸馏

知识蒸馏通过压缩教师大规模语言模型（LLM）的知识来训练较小的LLM，从而改善大型语言模型的推理能力。在线策略蒸馏通过让学生在教师LLM提供密集的标记级监督的同时采样自己的轨迹，推进了这一方法，解决了与离线策略蒸馏方法之间训练和推理分布不匹配的问题。然而，在线策略蒸馏通常需要一个单独的、通常更大的教师LLM，并且不明确利用推理数据集中可用的真实解决方案。受足够强大的LLM能够合理化外部特权推理轨迹并教其较弱版本（即没有访问特权信息的版本）这一直觉的启发，我们引入了在线策略自我蒸馏（OPSD）框架，其中单个模型同时作为教师和学生，通过不同的上下文进行条件化。教师策略通过条件化特权信息（例如，验证的推理轨迹）进行条件化，而学生策略仅看到问题；训练通过最小化学生自己轨迹上这些分布之间的每个标记差异来实现。我们通过多个数学推理基准展示了我们方法的有效性，与强化学习方法（如GRPO）相比，实现了8-12倍的标记效率，并且优于离线策略蒸馏方法。

Summary / 总结

The research aims to improve large language model reasoning through on-policy self-distillation, where a single model acts as both teacher and student. The method conditions the teacher policy on privileged information and the student policy on the question alone, minimizing token-level divergence during training. Experiments show that OPSD achieves 8-12x token efficiency compared to reinforcement learning methods and outperforms off-policy distillation methods on mathematical reasoning benchmarks.

研究旨在通过使用单模型作为教师和学生的自监督蒸馏方法来提升大型语言模型的推理能力。该方法让教师基于特权信息进行条件化，而学生仅基于问题进行条件化，在训练过程中最小化令牌差异。实验表明，OPSD 在数学推理基准测试中比强化学习方法效率高 8-12 倍，并且优于脱机蒸馏方法。

HydroGEM: A Self Supervised Zero Shot Hybrid TCN Transformer Foundation Model for Continental Scale Streamflow Quality Control

Authors: Ijaz Ul Haq, Byung Suk Lee, Julia N. Perdrial, David Baude

First: 2025-12-16T05:39:26+00:00 · Latest: 2026-03-05T18:19:22+00:00

Comments: Supplementary materials, datasets, and implementation code will be made publicly available upon acceptance for publication in a peer-reviewed journal

Abs · PDF · Code1 · Code2

Abstract

Advances in sensor networks have enabled real-time stream discharge monitoring, yet persistent sensor malfunctions limit data utility. Manual quality control by expert hydrologists cannot scale with networks generating millions of measurements annually. We introduce HydroGEM, a foundation model for continental-scale streamflow quality control designed to support human expertise. HydroGEM uses self-supervised pretraining on 6.03 million clean sequences from 3,724 USGS stations to learn general hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures both local and long-range temporal dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out observations from 799 stations with 18 synthetic anomaly types grounded in USGS standards, HydroGEM achieves F1=0.792 for detection and 68.7% reconstruction error reduction, outperforming the strongest baseline by 36.3%. For cross-national validation on 100 Environment and Climate Change Canada stations using tolerant evaluation with a plus or minus 24-hour buffer, HydroGEM achieves Tolerant F1=0.70 with 90.1% segment-level event detection, demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns, with peak flagging during winter ice-affected periods matching hydrologists' correction behavior. Architectural separation between simplified training anomalies and complex test anomalies confirms that performance reflects learned hydrometric principles rather than pattern memorization.

中文标题/摘要

标题：HydroGEM：一种用于大陆尺度径流质量控制的自监督零样本混合TCN变换器基础模型

传感器网络的进步使得实时径流监测成为可能，但持续的传感器故障限制了数据的实用性。由专家水文学家进行的手动质量控制无法与每年生成数百万测量值的网络规模相匹配。我们引入了HydroGEM，一种用于大陆尺度径流质量控制的基础模型，旨在支持人类专业知识。HydroGEM通过在来自3,724个USGS站的6,03万条干净序列上进行自监督预训练来学习通用水文学表示，然后通过使用合成异常进行微调以进行检测和重建。混合TCN-变换器架构（14.2M参数）捕捉局部和长程时间依赖性，而分层规范化处理了六个数量级的径流。在来自799个站的18种基于USGS标准的合成异常类型的保留观测中，HydroGEM的检测F1=0.792，重建误差减少68.7%，优于最强基线36.3%。在使用正负24小时缓冲进行容忍评估的100个环境和气候变化加拿大站中，HydroGEM的容忍F1=0.70，段级事件检测率为90.1%，展示了跨国界的泛化能力。该模型在不同校正幅度下保持一致的检测，并与操作季节性模式一致，在冬季冰影响期间的峰值标记与水文学家的校正行为一致。简化训练异常与复杂测试异常之间的架构分离证实了性能反映了学习的水文测量原理而非模式记忆。

Summary / 总结

HydroGEM is a self-supervised zero-shot hybrid TCN-Transformer model designed for continental-scale streamflow quality control. It is pretrained on 6.03 million clean sequences from 3,724 USGS stations and fine-tuned with synthetic anomalies. On validation, HydroGEM achieves an F1 score of 0.792 for detection and a 68.7% reduction in reconstruction error, outperforming existing baselines. It also demonstrates cross-national generalization with a Tolerant F1 score of 0.70 on Canadian data, showing consistent performance across different correction magnitudes and seasonal patterns.

HydroGEM 是一种用于溪流质量控制的基础模型，通过自我监督预训练和使用合成异常进行微调。该模型实现了高检测 F1 分数和显著的重建误差减少，优于现有方法。该模型展示了跨国家的一般化能力，并与水文学家的修正行为一致。

LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

Authors: Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin

First: 2026-03-04T11:36:32+00:00 · Latest: 2026-03-05T18:19:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Code comment classification is a critical task for automated software documentation and analysis. In the context of the NLBSE'26 Tool Competition, we present LoRA-MME, a Multi-Model Ensemble architecture utilizing Parameter-Efficient Fine-Tuning (PEFT). Our approach addresses the multi-label classification challenge across Java, Python, and Pharo by combining the strengths of four distinct transformer encoders: UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa. By independently fine-tuning these models using Low-Rank Adaptation(LoRA) and aggregating their predictions via a learned weighted ensemble strategy, we maximize classification performance without the memory overhead of full model fine-tuning. Our tool achieved an F1 Weighted score of 0.7906 and a Macro F1 of 0.6867 on the test set. However, the computational cost of the ensemble resulted in a final submission score of 41.20%, highlighting the trade-off between semantic accuracy and inference efficiency.

中文标题/摘要

标题：LoRA-MME：针对Java、Python和Pharo的LoRA调优编码器多模型集成代码注释分类

代码注释分类是自动软件文档和分析中的关键任务。在NLBSE'26工具竞赛中，我们提出了LoRA-MME，这是一种利用参数高效微调(PEFT)的多模型集成架构。我们的方法通过结合四种不同变压器编码器的优点——UniXcoder、CodeBERT、GraphCodeBERT和CodeBERTa，解决了跨Java、Python和Pharo的多标签分类挑战。通过独立使用低秩适应(LoRA)对这些模型进行微调，并通过学习加权集成策略聚合它们的预测，我们最大化了分类性能，而无需全模型微调的内存开销。我们的工具在测试集上获得了0.7906的F1加权分数和0.6867的宏F1分数。然而，集成的计算成本导致最终提交得分为41.20%，突显了语义准确性和推理效率之间的权衡。

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Authors: Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata, Seitaro Otsuki, Komei Sugiura

Venue: CVPR 2026

First: 2026-03-05T18:12:29+00:00 · Latest: 2026-03-05T18:12:29+00:00

Comments: Accepted to CVPR 2026 Findings

Abs · PDF · Code1 · Code2

Abstract

We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.

中文标题/摘要

标题：NaiLIA：基于密集意图描述和调色板查询的多模态指甲设计检索

我们专注于基于密集意图描述检索指甲设计图像的任务，这些描述代表了用户对指甲设计的多层意图。这具有挑战性，因为这些描述指定了未受约束的绘画元素和预先制造的装饰品，以及视觉特征、主题和整体印象。除了这些描述之外，我们假设用户通过颜色拾取器指定零个或多个颜色来提供调色板查询，这使得微妙和连续的颜色细微差别得以表达。现有的视觉-语言基础模型往往难以结合这些描述和调色板。为了解决这个问题，我们提出了NaiLIA，一种针对指甲设计图像的多模态检索方法，在检索过程中全面对齐密集意图描述和调色板查询。我们的方法引入了一种基于未标注图像置信分数的宽松损失，可以与描述对齐。为了评估NaiLIA，我们构建了一个基准，包含来自不同文化背景的10,625张图像。这些图像由超过200名注释者标注了长且密集的意图描述。实验结果表明，NaiLIA优于标准方法。

Summary / 总结

The research aims to develop a method for retrieving nail design images based on detailed intent descriptions and color palette queries. NaiLIA, a multimodal retrieval approach, aligns with both dense descriptions and color palettes, addressing the challenge of unconstrained painted elements and visual characteristics. Experiments show that NaiLIA outperforms standard methods on a benchmark of 10,625 images with diverse cultural backgrounds and detailed annotations.

研究旨在开发一种基于详细意图描述和颜色调色板查询的指甲设计图像检索方法。NaiLIA是一种多模态检索方法，能够同时与密集意图描述和颜色调色板对齐，解决了现有视觉-语言模型的局限性。该方法引入了一种基于未标注图像置信分数的松弛损失，提高了与描述的对齐度。实验结果显示，NaiLIA在包含10,625张具有不同文化背景图像的基准上优于标准方法，并且这些图像由超过200名注释者进行了详细的标注。

Agentic Very Long Video Understanding

Authors: Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim

First: 2026-01-26T05:20:47+00:00 · Latest: 2026-03-05T18:12:22+00:00

Comments: 27 pages, 7 figures, 8 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks. Code is available at https://github.com/facebookresearch/egagent.

中文标题/摘要

标题：代理型非常长视频理解

随着全天候可穿戴设备如智能眼镜所支持的始终在线个人AI助手的出现，需要一种新的上下文理解水平，这种理解超越了短暂孤立的事件，涵盖了持续的、纵向的自我中心视频流。实现这一愿景需要在长时视频理解方面取得进展，其中系统必须解释和回忆跨越数天甚至数周的视觉和音频信息。现有的方法，包括大型语言模型和检索增强生成，受限于有限的上下文窗口，并且缺乏在非常长的视频流上进行组合式、多跳推理的能力。在本文中，我们通过EGAgent这一增强的代理框架来应对这些挑战，该框架以实体场景图为中心，表示随着时间推移的人、地点、物体及其关系。我们的系统为规划代理配备了结构化搜索和推理这些图的工具，以及混合视觉和音频搜索能力，从而实现详细、跨模态和时间连贯的推理。在EgoLifeQA和Video-MME（长）数据集上的实验表明，我们的方法在EgoLifeQA上达到了最先进的性能（57.5%），在Video-MME（长）上也取得了竞争力的性能（74.1%），用于复杂的纵向视频理解任务。代码可在https://github.com/facebookresearch/egagent/ 获取。

Summary / 总结

This paper addresses the need for long-horizon video understanding in the context of always-on personal AI assistants, proposing EGAgent, an enhanced agentic framework that uses entity scene graphs to represent and reason over people, places, and objects over time. Experiments show that EGAgent outperforms existing methods on EgoLifeQA with 57.5% accuracy and achieves competitive performance on Video-MME (Long) with 74.1% accuracy for complex longitudinal video understanding tasks.

本文针对始终在线的个人AI助手对长时段视频理解的需求，提出了一种增强的代理框架EGAgent，该框架使用实体场景图来表示和推理长时间的视频流。该系统包括结构化搜索和推理的工具，以及混合视觉和音频搜索能力。实验表明，EGAgent在EgoLifeQA上优于现有方法（57.5%），并在Video-MME（长）上实现了竞争力的表现（74.1%）。

LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery

Authors: Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy

Venue: ICLR 2026

First: 2025-10-26T02:47:15+00:00 · Latest: 2026-03-05T18:02:19+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials discovery (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on 14 realistic tasks that span electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit rates and improved Pareto front quality relative to generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA provides a principled approach to accelerating practical materials discovery. Project website: https://scientific-discovery.github.io/llema-project/

中文标题/摘要

标题：LLEMA：使用LLM的多目标材料发现进化搜索

材料发现需要在广泛的化学和结构空间中导航，同时满足多个常常相互冲突的目标。我们提出了由大语言模型引导的材料发现进化（LLEMA），这是一种统一框架，将大型语言模型中嵌入的科学知识与化学启发的进化规则和基于记忆的改进相结合。在每次迭代中，一个LLM在明确的性质约束下提出晶体学指定的候选物；一个增强的代理评估物理化学性质；一个多目标评分器更新成功/失败记忆以指导后续代。在涵盖电子、能源、涂层、光学和航空航天的14个现实任务上进行评估，LLEMA发现的候选物在化学上合理、热力学稳定且性质对齐，相对于生成性和仅LLM基线，其命中率更高，帕累托前沿质量更好。消融研究证实了规则引导生成、基于记忆的改进和代理预测的重要性。通过强制执行合成性和多目标权衡，LLEMA为加速实际材料发现提供了一种原则性的方法。项目网站：https://scientific-discovery.github.io/llema-project/

Summary / 总结

LLEMA is a framework that uses large language models to guide the evolutionary search for materials, integrating scientific knowledge, chemistry-informed rules, and memory-based refinement. It proposes crystallographically specified candidates, estimates their properties using a surrogate model, and updates success/failure memories to improve future generations. LLEMA outperforms generative and LLM-only baselines in discovering chemically plausible, thermodynamically stable, and property-aligned materials across various applications, demonstrating the importance of rule-guided generation, memory-based refinement, and surrogate prediction.

LLEMA 是一个统一框架，结合了大型语言模型、化学导向的进化规则和基于记忆的改进，用于多目标材料发现。每次迭代中，LLM 在属性约束下提出晶体结构指定的候选材料，伴随的近似模型估计物理化学性质，多目标评分器更新成功/失败记忆以指导后续代。LLEMA 在涵盖电子、能源、涂层、光学和航空航天的 14 个任务中表现出更高的命中率和改进的帕累托前沿质量，优于生成性和仅 LLM 基准。消融研究强调了规则导向生成、基于记忆的改进和近似预测的重要性。

Latent Wasserstein Adversarial Imitation Learning

Authors: Siqi Yang, Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

Venue: ICLR 2026

First: 2026-03-05T18:01:49+00:00 · Latest: 2026-03-05T18:01:49+00:00

Comments: 10 pages, accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.

中文标题/摘要

标题：潜隐 Wasserstein 对抗模仿学习

模仿学习（IL）使代理能够通过学习演示来模仿专家行为。然而，传统的IL方法需要大量中等到高质量的演示以及专家演示的动作，这两种情况通常都不可用。为了减少这种需求，我们提出了潜隐 Wasserstein 对抗模仿学习（LWAIL），这是一种新颖的对抗模仿学习框架，专注于仅状态分布匹配。它得益于在动态感知潜隐空间中计算的 Wasserstein 距离。这种动态感知潜隐空间不同于先前的工作，并通过预训练阶段获得，其中我们使用少量随机生成的仅状态数据训练意图条件价值函数（ICVF），以捕捉状态空间的动态感知结构。我们表明，这增强了策略对状态转换的理解，使学习过程仅使用一个或几个仅状态专家演示就能达到专家水平的表现。通过在多个 MuJoCo 环境中的实验，我们证明了我们的方法优于先前的基于 Wasserstein 的 IL 方法和先前的对抗 IL 方法，在各种任务上取得了更好的结果。

Summary / 总结

The research aims to address the challenge of requiring large amounts of expert demonstrations in Imitation Learning (IL) by proposing Latent Wasserstein Adversarial Imitation Learning (LWAIL). LWAIL uses a dynamics-aware latent space to match state distributions, which is pre-trained using an Intention Conditioned Value Function (ICVF) on randomly generated state-only data. This method allows the policy to learn from fewer expert demonstrations, achieving expert-level performance with just one or a few episodes. Experiments on MuJoCo environments show that LWAIL outperforms previous Wasserstein-based and adversarial IL methods across various tasks.

研究旨在通过提出Latent Wasserstein Adversarial Imitation Learning (LWAIL)来解决传统模仿学习（IL）需要大量专家演示的问题。LWAIL关注使用动态感知的潜在空间进行仅状态分布匹配，并通过Intention Conditioned Value Function (ICVF)在随机生成的状态数据上进行预训练。实验结果表明，LWAIL在MuJoCo环境中优于之前的Wasserstein基模仿学习和对抗模仿学习方法，能够用更少的专家演示达到专家水平的表现。

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

Authors: Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim

First: 2025-05-28T03:45:42+00:00 · Latest: 2026-03-05T18:01:26+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.

中文标题/摘要

标题：OSPO：面向对象的自我改进偏好优化以实现文本到图像生成

近期多模态大型语言模型（MLLMs）的发展使统一的多模态理解和生成成为可能。然而，它们仍然难以实现精细的文本-图像对齐，经常无法准确描绘具有正确属性（如颜色、形状和空间关系）的对象。为解决这一问题，先前的研究探索了偏好优化方法，如DPO和GRPO，但这些方法在构建偏好数据和执行优化方面会带来巨大的计算成本。这促使了自我改进的偏好优化方法的发展，在这些方法中，MLLM自主生成自己的训练数据，自我估计偏好反馈，并使用由此产生的自我构建的偏好对进行自我优化。然而，现有的自我改进方法仍然忽视了细粒度的对象级语义，允许对象幻觉持续存在。为解决这一问题，我们提出了面向对象的自我改进偏好优化（OSPO），这是一种旨在增强对象级文本-图像对齐的自我改进框架。OSPO明确构建了面向对象的偏好数据，不依赖于任何外部数据和外部模型。我们还引入了一种新的方法，利用基于注意力的对象掩码与对象加权SimPO损失相结合，以增强对象特定的保真度。在三个组成图像生成基准上的广泛实验表明，OSPO显著提高了细粒度对齐并减少了对象幻觉，优于先前的自我改进方法，甚至优于专门的基于扩散的文本到图像模型。

Summary / 总结

The research aims to improve the fine-grained alignment between text and images generated by Multimodal Large Language Models (MLLMs) by addressing the issue of object hallucination. OSPO, a self-improving framework, is proposed to autonomously generate training data and optimize preferences without external data or models. It uses attention-based object masks and an object-weighted SimPO loss to enhance object-specific fidelity. Experiments show that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming previous self-improving methods and specialized diffusion-based models.

研究旨在通过解决对象幻觉问题，提高多模态大型语言模型（MLLMs）生成的文本与图像之间的细粒度对齐。提出的对象中心自改进偏好优化（OSPO）框架能够自主生成训练数据并优化偏好，无需外部数据或模型。OSPO利用基于注意力的对象掩码和对象加权的SimPO损失来增强对象特定的保真度。在三个基准测试上的实验表明，OSPO在细粒度对齐和减少对象幻觉方面优于之前的自改进方法，甚至优于专门的扩散基文本到图像模型。

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Authors: Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim, Dong-Jin Kim

Venue: CVPR 2026

First: 2026-03-05T17:59:58+00:00 · Latest: 2026-03-05T17:59:58+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

中文标题/摘要

标题：SAIL：基于相似性感知的指导与跨图例增强学习的弱监督密集视频图释

弱监督密集视频图释旨在仅基于字幕注释训练时，定位和描述视频中的事件，而无需时间边界。先前的工作引入了一种基于高斯掩码和互补字幕的隐式监督范式。然而，现有方法仅关注生成不重叠的掩码，而未考虑其与相应事件的语义关系，导致生成简单且均匀分布的掩码，无法捕捉到语义上有意义的区域。此外，仅依赖真实字幕会导致性能不佳，因为现有数据集的稀疏性。在本工作中，我们提出了SAIL，通过跨模态对齐构建语义感知的掩码。我们的相似性感知训练目标引导掩码强调与相应事件字幕高度相似的视频区域。此外，为了在稀疏注释设置下引导更准确的掩码生成，我们引入了一种基于LLM的增强策略，生成合成字幕以提供额外的对齐信号。这些合成字幕通过跨掩码机制整合，为精确的时间定位提供辅助指导，而不损害主要目标。在ActivityNet Captions和YouCook2上的实验表明，SAIL在图释和定位指标上均达到了最先进的性能。

Summary / 总结

The research aims to improve weakly-supervised dense video captioning by addressing the limitations of existing methods, such as simplistic masks and reliance on sparse datasets. SAIL proposes a similarity-aware training objective to generate masks that emphasize semantically relevant video regions and an LLM-based augmentation strategy to generate synthetic captions for additional alignment signals. Experiments show SAIL outperforms existing methods on both captioning and localization metrics on ActivityNet Captions and YouCook2 datasets.

SAIL 通过跨模态对齐构建语义感知的掩码，使用相似性感知的训练目标强调与事件描述高度相似的视频区域，并引入基于LLM的增强策略生成合成描述以提供额外的对齐信号。实验表明，SAIL 在 ActivityNet Captions 和 YouCook2 上在描述和定位指标上均优于现有方法。

On-Policy Self-Distillation for Reasoning Compression

Authors: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun

First: 2026-03-05T17:54:40+00:00 · Latest: 2026-03-05T17:54:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.

中文标题/摘要

标题：基于策略自我蒸馏的推理压缩

推理模型会大声思考，但其中大部分内容都是噪音。我们引入了OPSDC（基于策略自我蒸馏的推理压缩），一种通过将模型自身的简洁行为蒸馏回自身来教会模型更简洁地推理的方法。整个方法归结为一个想法：在“要简洁”的指令下条件化同一个模型以获得教师概率分布，并在学生的自我展开中最小化每个词元的逆KL散度。无需真实答案，无需词元预算，无需难度估计器。只有自我蒸馏。然而，这种简单性背后隐藏着惊人的复杂性：OPSDC会自动对简单问题进行激进压缩，同时保留解决难题所需的推理。在Qwen3-8B和Qwen3-14B上，我们在MATH-500上实现了57-59%的词元减少，同时准确率提高了9-16个绝对点。在AIME 2024上，14B模型在41%的压缩下获得了10分的提升。秘诀在于，推理模型生成的内容不仅冗余，而且是主动有害的，每个不必要的词元都会累积错误。

Summary / 总结

OPSDC (On-Policy Self-Distillation for Reasoning Compression) is a method that teaches reasoning models to reason more concisely by distilling their own concise behavior back into themselves. This approach involves conditioning the model on a 'be concise' instruction to obtain teacher logits and minimizing per-token reverse KL on the student's own rollouts. On Qwen3-8B and Qwen3-14B, OPSDC achieves 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression, demonstrating that much of the reasoning output is redundant and harmful, compounding errors with unnecessary tokens.

OPSDC（On-Policy Self-Distillation for Reasoning Compression）是一种通过将模型自己的简洁行为反向蒸馏回自身来教会模型更简洁推理的方法。该方法通过将模型条件化在'简洁'指令上以获得教师概率，并在学生的卷出上最小化每个令牌的反KL散度。在Qwen3-8B和Qwen3-14B上，OPSDC在MATH-500上实现了57-59%的令牌减少，同时提高了9-16个点的准确性。在AIME 2024上，14B模型通过41%的压缩获得了10个点的进步，这表明推理模型生成的内容不仅冗余，而且有害，每个不必要的令牌都会加剧错误的累积。

Ensembling Language Models with Sequential Monte Carlo

Authors: Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly, Timothy J. O'Donnell, Ryan Cotterell, Tim Vieira

First: 2026-03-05T17:54:31+00:00 · Latest: 2026-03-05T17:54:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.

中文标题/摘要

标题：使用序列蒙特卡洛方法集成语言模型

从业者可以访问大量语言模型和提示策略以解决许多语言建模任务；然而，先前的工作表明，模型性能对这些选择的高度敏感。经典的机器学习集成技术提供了一种原则性的方法：通过聚合多个来源的预测来实现比任何单一来源更好的性能。然而，在解码过程中将集成应用于语言模型是具有挑战性的：简单地聚合下一个标记的概率会产生一个局部归一化、有偏的近似集成字符串分布，该分布通常是不可计算的。在本文中，我们介绍了一种统一框架，用于将 $K$ 个语言模型组合成一系列函数 $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$ 的 $f$-集成分布。为了从这些分布中采样，我们提出了一种字节级序列蒙特卡洛（SMC）算法，该算法在共享字符空间中运行，使得具有不同词汇表的模型的集成能够一致地采样。我们评估了不同提示和模型组合下的 $f$-集成分布，涵盖了各种结构化文本生成任务，强调了替代聚合策略相对于传统概率平均的优势，并展示了更好的后验近似可以提高集成性能。

Summary / 总结

This work addresses the challenge of ensembling language models during decoding by introducing a unified framework for composing multiple language models into ensemble distributions using various functions. The authors propose a byte-level sequential Monte Carlo algorithm to sample from these distributions, enabling consistent sampling across models with different vocabularies. Experiments across various structured text generation tasks demonstrate that alternative aggregation strategies can outperform traditional probability averaging, leading to improved ensemble performance.

该研究通过引入一种统一框架，使用多种函数将多个语言模型组合成集成分布，解决了在解码过程中集成语言模型的挑战。作者提出了一种字节级别的顺序蒙特卡洛算法来从这些分布中采样，该算法能够处理具有不同词汇表的模型。实验结果表明，替代聚合策略优于传统的概率平均，从而在结构化文本生成任务中提高了集成性能。

RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity

Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

First: 2026-02-04T13:25:47+00:00 · Latest: 2026-03-05T17:54:01+00:00

Abs · PDF · Code1 · Code2

Abstract

As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity.

中文标题/摘要

标题：RA-QA：在现实世界异质性条件下呼吸音频问答基准系统

随着对话型多模态AI工具被越来越多地采用来处理患者数据以进行健康评估，需要稳健的基准来衡量在现实条件下的进步并揭示失败模式。尽管呼吸音频对于移动健康筛查至关重要，但呼吸音频问答仍被研究不足，现有研究评估狭窄且缺乏跨模态、设备和问题类型的现实世界异质性。因此，我们引入了呼吸音频问答（RA-QA）基准，包括标准化的数据生成管道、全面的多模态问答集合以及统一的评估协议。RA-QA 将公共呼吸音频数据集整合为包含900万对格式多样的问答对，涵盖诊断和上下文属性。我们基准测试了经典机器学习基线和多模态音频-语言模型，建立了可重复的参考点，并展示了当前方法在异质性条件下失效的情况。

Summary / 总结

The RA-QA benchmark was created to evaluate the performance of respiratory audio question answering systems under real-world conditions, addressing the lack of robust benchmarks in this area. The system includes a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. Key findings show that current approaches struggle with heterogeneity and that multimodal audio-language models outperform classical ML baselines.

RA-QA基准旨在评估呼吸音频问答系统在真实世界条件下的性能，解决了该领域缺乏稳健基准的问题。该系统包括标准化的数据生成管道、全面的多模态问答集合和统一的评估协议。关键发现表明，当前的方法在异质性下表现不佳，而多模态音频-语言模型则优于经典机器学习基线。

RelaxFlow: Text-Driven Amodal 3D Generation

Authors: Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

First: 2026-03-05T17:45:47+00:00 · Latest: 2026-03-05T17:45:47+00:00

Comments: Code: https://github.com/viridityzhu/RelaxFlow

Abs · PDF · Code1 · Code2 · Code3

Abstract

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

中文标题/摘要

标题：RelaxFlow：文本驱动的非可见3D生成

从图像到3D的生成面临着在遮挡下固有的语义模糊性，其中仅部分观察往往不足以确定物体类别。在本文中，我们形式化了文本驱动的非可见3D生成，其中文本提示引导未见区域的完成，同时严格保留输入观察。关键的是，我们发现这些目标需要不同的控制粒度：对观察进行刚性控制，而对提示进行放松结构控制。为此，我们提出了一种无需训练的双分支框架RelaxFlow，通过多先验一致性模块和放松机制解耦控制粒度。理论上，我们证明了我们的放松等同于在生成向量场中应用低通滤波器，这抑制了高频实例细节以隔离几何结构，以适应观察。为了便于评估，我们引入了两个诊断基准，ExtremeOcc-3D和AmbiSem-3D。广泛的实验表明，RelaxFlow成功地引导了未见区域的生成以匹配提示意图，而不牺牲视觉保真度。

Summary / 总结

The research addresses the challenge of generating complete 3D models from partial observations using text prompts. It introduces RelaxFlow, a training-free framework that uses a Multi-Prior Consensus Module and a Relaxation Mechanism to decouple control granularity between the observed regions and the prompt-driven regions. Theoretical analysis shows that this relaxation suppresses high-frequency details to focus on geometric structure. Experiments on new diagnostic benchmarks demonstrate that RelaxFlow effectively matches the prompt intent while maintaining visual fidelity.

研究旨在通过文本提示从部分观察中生成完整的3D模型。提出了RelaxFlow框架，该框架通过多先验一致性模块和放松机制解耦观察区域和未观察区域的控制粒度。方法确保观察部分的刚性控制，同时允许对未观察部分的灵活结构控制。实验表明，RelaxFlow能够生成与文本提示一致的未观察区域，同时保持视觉保真度。

History

20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553