arXiv 论文速递

2026-03-07 03:39
Snapshot: 20260307_0339
Accelerating Text-to-Video Generation with Calibrated Sparse Attention
Authors: Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar
First: 2026-03-05T18:59:32+00:00 · Latest: 2026-03-05T18:59:32+00:00
Abstract
Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
中文标题/摘要
标题:校准稀疏注意加速文本生成视频
最近的扩散模型能够生成高质量的视频,但运行速度较慢。这些模型中使用的大型基于变压器的骨干网络由于时空注意机制而成为瓶颈。在本文中,我们发现大量词到词的连接在各种输入中持续产生微不足道的分数,并且它们的模式在查询之间经常重复。因此,在这些情况下可以跳过注意计算,对结果影响甚微。这一观察结果同样适用于局部词块之间的连接。受此启发,我们引入了CalibAtt,这是一种无需训练的方法,通过校准稀疏注意来加速视频生成。CalibAtt 进行了一次离线校准过程,以识别在输入间稳定的块级稀疏性和重复模式,并将这些模式编译为每层、每个头和每个扩散时间步的优化注意操作。在推理时,我们对选定的输入相关连接进行密集计算,并以硬件高效的方式跳过未选中的连接。在Wan 2.1 14B、Mochi 1和不同分辨率的少量步骤蒸馏模型上进行的广泛实验表明,CalibAtt 可以实现高达1.58倍的端到端加速,同时优于现有无需训练的方法,保持视频生成质量和文本-视频对齐。
Summary / 总结
This paper addresses the slow runtime of diffusion models used for text-to-video generation by proposing CalibAtt, a training-free method that accelerates video generation through calibrated sparse attention. By identifying and skipping negligible token-to-token connections, CalibAtt achieves up to 1.58x end-to-end speedup without compromising video quality or text-video alignment.
论文通过提出CalibAtt方法,利用校准的稀疏注意力来解决文本到视频生成的扩散模型运行缓慢的问题。该方法通过离线校准识别并跳过对结果影响不大的token-to-token连接,为每一层和扩散时间步优化注意力操作。实验结果显示,该方法可以实现最多1.58倍的加速,同时保持视频质量和文本-视频对齐不受到影响。
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
Authors: Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu
First: 2026-03-05T18:59:23+00:00 · Latest: 2026-03-05T18:59:23+00:00
Comments: Technical report v1 (14 pages, 7 figures, project page: https://spherelab.ai/poetx/)
Abstract
Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
中文标题/摘要
标题:POET-X:通过扩展正交变换提高大语言模型训练的内存效率
在现代机器学习系统中,高效且稳定的大型语言模型(LLMs)训练仍然是一个核心挑战。为了解决这一挑战,提出了参数化正交等价训练(POET),这是一种保持频谱的框架,通过正交等价变换优化每个权重矩阵。尽管POET提供了强大的训练稳定性,但其原始实现由于密集的矩阵乘法而消耗大量内存和计算资源。为克服这些限制,我们引入了POET-X,这是一种可扩展且内存高效的变体,通过显著减少计算成本来执行正交等价变换。POET-X保持了POET的一般化和稳定性优势,同时在吞吐量和内存效率方面取得了显著改进。在我们的实验中,POET-X能够在单个Nvidia H100 GPU上预训练具有十亿参数的LLMs,而在相同设置下,标准优化器如AdamW则会因内存不足而无法运行。
Summary / 总结
POET-X is a memory-efficient variant of the reparameterized orthogonal equivalence training (POET) framework, designed to address the high memory consumption and computational overhead of POET. By performing orthogonal equivalence transformations with reduced computational cost, POET-X maintains the generalization and stability benefits of POET while significantly improving throughput and memory efficiency. Experiments show that POET-X allows the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, whereas standard optimizers like AdamW fail to do so under the same conditions.
POET-X 是 POET 框架的一个高效变体,用于训练大规模语言模型(LLMs),它保持了 POET 的泛化能力和稳定性,同时减少了计算成本。实验表明,POET-X 可以在单个 Nvidia H100 GPU 上完成十亿参数 LLM 的预训练,而标准优化器如 AdamW 在相同条件下由于内存限制而无法运行。
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Authors: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda
First: 2026-03-05T18:58:14+00:00 · Latest: 2026-03-05T18:58:14+00:00
Abstract
Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
中文标题/摘要
标题:被屏蔽的LLM作为秘密知识提取的自然试验场
大型语言模型有时会产生虚假或误导性的回答。解决这一问题的两种方法是诚实性提取——通过修改提示或权重使模型如实回答——和谎言检测——对给定的回答进行分类以判断其是否虚假。以往的工作在专门训练以撒谎或隐瞒信息的模型上评估这些方法,但这些人工构建的模型可能无法反映自然发生的不诚实行为。相反,我们研究了来自中国开发者的开放权重LLM,这些模型被训练以屏蔽政治敏感话题:Qwen3模型经常在诸如法轮功或天安门抗议等主题上产生虚假信息,偶尔也能正确回答,表明它们拥有被训练压制的知识。利用这一作为试验场,我们评估了一系列提取和谎言检测技术。对于诚实性提取,不使用聊天模板的采样、少量示例提示和在通用诚实数据上微调最可靠地增加了真实回答。对于谎言检测,促使被屏蔽模型分类其自身回答接近未屏蔽模型的上限,而基于无关数据训练的线性探针提供了一种更便宜的替代方案。最强的诚实性提取技术也转移到前沿的开放权重模型包括DeepSeek R1中。值得注意的是,没有技术能够完全消除虚假回答。我们发布了所有提示、代码和对话记录。
Summary / 总结
This study investigates the effectiveness of honesty elicitation and lie detection techniques on censored large language models (LLMs) that are trained to suppress politically sensitive topics. The research finds that sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data are the most reliable methods for increasing truthful responses. For lie detection, prompting the censored model to classify its own responses performs well, and linear probes trained on unrelated data offer a cost-effective alternative. Notably, no technique can completely eliminate false responses. The study uses Qwen3 models as a testbed, which frequently produce falsehoods about sensitive topics while occasionally answering correctly, indicating they possess suppressed knowledge. The findings suggest that natural censored LLMs can serve as a realistic testbed for evaluating these techniques. All experimental materials are publicly released.
研究探讨了在中文开发者开发的、训练时被抑制政治敏感话题的大型语言模型(LLM)上,诚实诱引和谎言检测技术的有效性。研究发现,不使用聊天模板的采样、少量示例提示和在通用诚实数据上微调是最可靠的增加真实回答的方法。对于谎言检测,促使被抑制模型分类其自身回答和使用在无关数据上训练的线性探针都是有效的。增强诚实的技术也转移到其他开放权重模型。然而,没有任何方法能够完全消除虚假回答。
AgentIR: Reasoning-Aware Retrieval for Deep Research Agents
Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong
First: 2026-03-04T18:47:26+00:00 · Latest: 2026-03-05T18:56:37+00:00
Abstract
Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.
中文标题/摘要
标题:AgentIR:具备推理意识的检索技术以深化研究代理
深度研究代理正迅速成为现代检索系统的主要消费者。与人类用户通过不断调整查询而不记录其中间思维过程不同,深度研究代理在每次搜索调用前都会生成明确的自然语言推理,揭示出现有检索器完全忽视的丰富意图和上下文信息。为了利用这一被忽视的信号,我们引入了:(1) 具备推理意识的检索,这是一种检索范式,将代理的推理轨迹与查询一起联合嵌入;(2) DR-Synth,一种从标准问答数据集中生成深度研究检索训练数据的方法。我们证明了这两个组件各自有效,结合使用后,训练出的嵌入模型AgentIR-4B取得了显著的提升。在具有挑战性的BrowseComp-Plus基准测试中,使用开放权重代理Tongyi-DeepResearch的AgentIR-4B达到了68%的准确率,而传统的两倍大小的嵌入模型仅为50%,BM25仅为37%。代码和数据可在:https://texttron.github.io/AgentIR/ 获取。
Summary / 总结
The research aims to improve retrieval systems for deep research agents by incorporating their explicit reasoning processes. The method introduces Reasoning-Aware Retrieval, which jointly embeds the agent's reasoning trace with its query, and DR-Synth, a data synthesis method. The combination of these components leads to a significant improvement in performance, with AgentIR-4B achieving 68% accuracy on the BrowseComp-Plus benchmark, outperforming larger conventional models and BM25 by substantial margins.
研究旨在通过纳入深度研究代理的显式推理过程来提升检索系统的性能。方法引入了Reasoning-Aware Retrieval,该方法联合嵌入代理的推理轨迹与其查询,以及DR-Synth,一种从标准问答数据集中生成深度研究检索训练数据的方法。这两种组件的结合显著提升了性能,AgentIR-4B在BrowseComp-Plus基准测试中达到了68%的准确率,超过了更大规模的传统模型和BM25的显著差距。
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Authors: Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo
First: 2026-03-05T18:55:16+00:00 · Latest: 2026-03-05T18:55:16+00:00
Abstract
We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
中文标题/摘要
标题:推理剧场:解开模型信念与推理链的纠缠
我们提供了推理模型中表演性推理链(CoT)的证据,其中模型对其最终答案变得非常自信,但继续生成令牌而不揭示其内部信念。我们的分析比较了激活探针、早期强制回答和CoT监控在两个大型模型(DeepSeek-R1 671B & GPT-OSS 120B)上的表现,并发现任务难度特定的差异:模型的最终答案可以在CoT的早期激活中被解码,而监控则无法做到这一点,尤其是在简单的基于回忆的MMLU问题上。我们将其与困难的多跳GPQA-Diamond问题中的真正推理进行了对比。尽管如此,转折点(例如回溯、‘恍然大悟’时刻)几乎仅出现在探针显示大规模信念变化的响应中,这表明这些行为追踪的是真正的不确定性,而不是学习到的“推理剧场”。最后,探针引导的早期退出在MMLU上最多可减少80%的令牌,在GPQA-Diamond上减少30%,同时保持相似的准确性,将注意力探针定位为检测表演性推理的有效工具,并使计算适应性成为可能。
Summary / 总结
The study investigates performative chain-of-thought (CoT) in reasoning models, where models generate tokens without revealing their internal beliefs, even when strongly confident in their answers. By comparing activation probing, early forced answering, and a CoT monitor across two large models, the research finds that the final answer is decodable much earlier in the CoT for easy questions, but not for difficult ones. The study suggests that inflection points in responses, indicating large belief shifts, track genuine uncertainty rather than performative reasoning. Early exit guided by probes reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, highlighting the utility of attention probing for detecting performative reasoning and enabling adaptive computation.
研究探讨了推理模型中的表演性链式思维(CoT),即模型在生成答案时不会揭示其内部信念,即使非常自信。通过在两个大型模型中比较激活探针、早期强制回答和CoT监控,研究发现,对于简单的回忆任务,最终答案可以在CoT的较早阶段被解码,但对于复杂的推理任务则不然。尽管如此,不确定性转折点出现在具有大规模信念变化的响应中,表明这些行为反映了真正的不确定性而非表演性推理。研究还表明,基于探针的早期退出可以将MMLU任务中的令牌减少多达80%,GPQA-Diamond任务中的令牌减少30%,同时保持相似的准确性,突出了注意力探针在检测表演性推理和实现适应性计算方面的效用。
Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding
Authors: Alex Oshin, Rahul Vodeb Ghosh, Augustinos D. Saravanos, Evangelos A. Theodorou
Venue: ICLR 2026
First: 2025-12-01T11:38:45+00:00 · Latest: 2026-03-05T18:54:48+00:00
Comments: Accepted to ICLR 2026
Abstract
We propose FlexQP, an always-feasible convex quadratic programming (QP) solver based on an $\ell_1$ elastic relaxation of the QP constraints. If the original constraints are feasible, FlexQP provably recovers the optimal solution. If the constraints are infeasible, FlexQP identifies a solution that minimizes the constraint violation while keeping the number of violated constraints sparse. Such infeasibilities arise naturally in sequential quadratic programming (SQP) subproblems due to the linearization of the constraints. We prove the convergence of FlexQP under mild coercivity assumptions, making it robust to both feasible and infeasible QPs. We then apply deep unfolding to learn LSTM-based, dimension-agnostic feedback policies for the algorithm parameters, yielding an accelerated Deep FlexQP. To preserve the exactness guarantees of the relaxation, we propose a normalized training loss that incorporates the Lagrange multipliers. We additionally design a log-scaled loss for PAC-Bayes generalization bounds that yields substantially tighter performance certificates, which we use to construct an accelerated SQP solver with guaranteed QP subproblem performance. Deep FlexQP outperforms state-of-the-art learned QP solvers on a suite of benchmarks including portfolio optimization, classification, and regression problems, and scales to dense QPs with over 10k variables and constraints via fine-tuning. When deployed within SQP, our approach solves nonlinear trajectory optimization problems 4-16x faster than SQP with OSQP while substantially improving success rates. On predictive safety filter problems, Deep FlexQP reduces safety violations by over 70\% and increases task completion by 43\% compared to existing methods.
中文标题/摘要
标题:Deep FlexQP:基于深度展开的加速非线性规划
我们提出了一种基于QP约束的$\ell_1$弹性松弛的凸二次规划(QP)求解器FlexQP。如果原始约束可行,FlexQP能够证明恢复最优解。如果约束不可行,FlexQP将识别一个最小化约束违反并保持违反约束数量稀疏的解。这种不可行性在序列二次规划(SQP)子问题中由于约束的线性化而自然出现。在温和的强制性假设下,我们证明了FlexQP的收敛性,使其对可行和不可行的QP都具有鲁棒性。然后,我们应用深度展开来学习基于LSTM的、维度无关的反馈策略,以加速算法参数,从而得到加速的Deep FlexQP。为了保持松弛的精确性保证,我们提出了一种归一化的训练损失,该损失包含拉格朗日乘子。我们还设计了一种对数缩放损失,用于PAC-Bayes泛化界,该损失提供了显著更紧的性能证书,我们使用它来构建具有保证QP子问题性能的加速SQP求解器。Deep FlexQP在包括投资组合优化、分类和回归问题的一系列基准测试中优于最先进的学习QP求解器,并通过微调扩展到具有超过10,000个变量和约束的密集QP。当部署在SQP中时,我们的方法比使用OSQP的SQP快4-16倍,同时显著提高了成功率。在预测性安全过滤器问题上,Deep FlexQP将安全违规减少了超过70%,并将任务完成率提高了43%,优于现有方法。
Summary / 总结
The research aims to develop an efficient and robust convex quadratic programming (QP) solver, FlexQP, which can handle both feasible and infeasible QPs. FlexQP uses an $\ell_1$ elastic relaxation of QP constraints and converges under mild assumptions. By applying deep unfolding, the authors create an accelerated version, Deep FlexQP, which learns optimal algorithm parameters using LSTM-based feedback policies. The method ensures exactness guarantees through a normalized training loss and a log-scaled loss for tighter performance certificates. Experimental results show that Deep FlexQP outperforms existing QP solvers in various benchmarks and significantly improves performance in nonlinear trajectory optimization and predictive safety filter problems.
论文提出了FlexQP,这是一种通过$\ell_1$弹性松弛QP约束的凸QP求解器,对于可行问题能保证最优解的恢复,对于不可行问题则最小化约束违反。Deep FlexQP利用深度展开学习LSTM基反馈策略,加速算法同时保持精确性保证。该方法在各种基准测试中优于现有QP求解器,并在非线性轨迹优化和预测安全过滤器问题中显著提高性能。
Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
Authors: Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar
First: 2026-03-05T18:52:28+00:00 · Latest: 2026-03-05T18:52:28+00:00
Abstract
As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.
中文标题/摘要
标题:通过偏倚有界评估迈向可证明无偏的LLM法官
随着AI模型从简单的聊天机器人发展到更复杂的流程,我们正逐渐接近一个临界点,在这个临界点之后,AI系统将在自主、自我维护的反馈循环中被利用。任何自主的AI系统都将依赖于自动化的、可验证的奖励和反馈;在地面真相稀疏或非确定性的环境中,一个实际的奖励来源是LLM作为法官。尽管LLM法官不断改进,但文献中尚未引入能够以强保证执行标准的系统,尤其是在偏倚向量未知或被敌对发现的情况下。为了解决这一问题,我们提出了平均偏倚有界性(A-BB),这是一种算法框架,正式保证了由于任何可测量偏倚而导致的任何伤害/影响的减少。在Arena-Hard-Auto上评估四个LLM法官时,我们实现了(tau=0.5,delta=0.01)偏倚有界保证,同时在格式化和方案偏倚设置中保留了61-99%与原始排名的相关性,大多数法官-偏倚组合超过80%。我们的发现的可重复代码可在https://github.com/penfever/bias-bounded-evaluation获取。
Summary / 总结
This paper addresses the challenge of using LLMs as judges in autonomous AI systems where ground truth is uncertain. It introduces an algorithmic framework called average bias-boundedness (A-BB) to ensure that LLM judges operate within strict bias limits, thereby reducing harm. The study evaluates A-BB on Arena-Hard-Auto with four LLM judges and finds that it maintains 61-99% correlation with original rankings while providing strong bias-bounded guarantees, with most judge-bias combinations exceeding 80% correlation.
研究旨在开发一个可证明无偏的LLM法官系统,以确保复杂AI工作流中的自动化和可验证奖励。方法是提出一个平均偏差有界性(A-BB)框架,以保证由于可测量偏差而导致的危害减少。实验结果表明,该系统在Arena-Hard-Auto上实现了偏差有界性保证,同时保持了61-99%与原始排名的相关性,大多数法官-偏差组合的关联性超过80%。
Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
Authors: Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, Zhiqi Li, Karan Sapra, Fuxiao Liu, Yin-Dong Zheng, Yifei Huang, Limin Wang, Zhiding Yu, Andrew Tao, Guilin Liu, Tong Lu
First: 2026-03-05T18:52:12+00:00 · Latest: 2026-03-05T18:52:12+00:00
Abstract
While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
中文标题/摘要
标题:迈向多模态终身理解:一个数据集和能动基线
尽管视频理解的数据集已经扩展到小时级长度,但它们通常由紧密连接的片段组成,与自然的、未编排的日常生活不同。为弥合这一差距,我们引入了MM-Lifelong数据集,旨在用于多模态终身理解。该数据集包含181.1小时的视频,按日、周、月结构化,以捕捉不同的时间密度。广泛评估表明,当前范式存在两种关键失败模式:端到端的MLLMs因上下文饱和而遭受工作记忆瓶颈,而代表性的能动基线在导航稀疏的月度时间线时则经历全局定位崩溃。为解决这一问题,我们提出了递归多模态代理(ReMA),它采用动态内存管理,逐步更新递归信念状态,显著优于现有方法。最后,我们建立了数据集划分,旨在隔离时间偏见和领域偏见,为未来监督学习和离分布泛化的研究提供严格的基石。
Summary / 总结
This paper addresses the gap between existing video understanding datasets and natural, unscripted daily life by introducing MM-Lifelong, a dataset with 181.1 hours of footage structured across Day, Week, and Month scales. The study evaluates current paradigms and finds that end-to-end MLLMs face a Working Memory Bottleneck and agentic baselines suffer from Global Localization Collapse. To overcome these issues, the authors propose the Recursive Multimodal Agent (ReMA), which uses dynamic memory management to iteratively update a recursive belief state, achieving superior performance. The paper also provides dataset splits to help future research focus on temporal and domain biases.
研究旨在通过引入包含181.1小时 footage 的MM-Lifelong 数据集,弥合现有视频理解数据集与自然日常生活的差距,该数据集按日、周、月尺度结构化。评估显示,当前模型存在工作记忆瓶颈和全局定位崩溃的问题。为此,提出了递归多模态代理(ReMA),采用动态内存管理逐步更新递归信念状态,超越现有方法。数据集还包含用于隔离时间和领域偏差的分割,为未来研究提供坚实基础。
FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction
Authors: Jiaxin Yuan, Haizhao Yang, Maria Cameron
First: 2025-10-31T04:49:41+00:00 · Latest: 2026-03-05T18:50:52+00:00
Abstract
Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.
中文标题/摘要
标题:FMint-SDE:一种通过误差校正加速随机微分方程数值模拟的多模态基础模型
快速而准确地模拟动力系统是科学和工程领域的一项基本挑战。传统的数值积分器通常在准确性和计算效率之间存在权衡,而现有的基于神经网络的方法通常需要为每种情况训练一个单独的模型。为克服这些限制,我们提出了一种新的多模态基础模型,用于大规模随机微分方程的模拟:FMint-SDE(基于初始化的基础模型)。基于仅解码器的变压器并利用上下文学习,FMint-SDE 利用数值和文本模态学习一个通用的误差校正方案。它使用由传统求解器生成的粗略解序列进行提示训练,从而在各种系统中实现广泛的泛化。我们在涵盖分子动力学、机械系统、金融和生物学应用的一系列具有挑战性的 SDE 基准测试上评估了我们的模型。实验结果表明,与经典求解器相比,我们的方法在准确性和效率之间实现了更优的权衡,突显了 FMint-SDE 作为动力系统通用模拟工具的潜力。
Summary / 总结
FMint-SDE is a multimodal foundation model designed to accelerate the numerical simulation of stochastic differential equations (SDEs) by correcting errors. It uses a decoder-only transformer with in-context learning to process numerical and textual data, allowing it to learn a universal error-correction scheme. Trained on coarse solutions from conventional solvers, FMint-SDE demonstrates a better accuracy-efficiency tradeoff than classical solvers across various applications including molecular dynamics, mechanical systems, finance, and biology, making it a promising general-purpose simulation tool for dynamical systems.
FMint-SDE 是一种多模态基础模型,旨在通过错误校正加速随机微分方程(SDEs)的数值模拟。它使用解码器-only 变压器和上下文学习来处理数值和文本数据,使其能够学习一个通用的错误校正方案。通过传统求解器生成的粗略解进行训练,FMint-SDE 在分子动力学、机械系统、金融和生物学等多个应用领域中展示了比经典求解器更好的准确性和效率折衷,使其成为动态系统的一般用途模拟工具。
Thermodynamic Response Functions in Singular Bayesian Models
Authors: Sean Plummer
First: 2026-03-05T18:50:20+00:00 · Latest: 2026-03-05T18:50:20+00:00
Abstract
Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. Although singular learning theory characterizes marginal likelihood behavior through invariants such as the real log canonical threshold and singular fluctuation, these quantities remain difficult to interpret operationally. At the same time, widely used criteria such as WAIC and WBIC appear disconnected from underlying singular geometry. We show that posterior tempering induces a one-parameter deformation of the posterior distribution whose associated observables generate a hierarchy of thermodynamic response functions. A universal covariance identity links derivatives of tempered expectations to posterior fluctuations, placing WAIC, WBIC, and singular fluctuation within a unified response framework. Within this framework, classical quantities from singular learning theory acquire natural thermodynamic interpretations: RLCT governs the leading free-energy slope, singular fluctuation corresponds to curvature of the tempered free energy, and WAIC measures predictive fluctuation. We formalize an observable algebra that quotients out non-identifiable directions, allowing structurally meaningful order parameters to be constructed in singular models. Across canonical singular examples-including symmetric Gaussian mixtures, reduced-rank regression, and overparameterized neural networks-we empirically demonstrate phase-transition-like behavior under tempering. Order parameters collapse, susceptibilities peak, and complexity measures align with structural reorganization in posterior geometry. Our results suggest that thermodynamic response theory provides a natural organizing framework for interpreting complexity, predictive variability, and structural reorganization in singular Bayesian learning.
中文标题/摘要
标题:奇异贝叶斯模型中的热力学响应函数
奇异统计模型,包括混合模型、矩阵分解和神经网络,由于参数不可识别性和退化费舍尔几何结构而违反了常规渐近性。尽管奇异学习理论通过不变量(如实数对数可约阈值和奇异波动)来表征边缘似然行为,但这些量仍然难以从操作上进行解释。同时,广泛使用的准则(如WAIC和WBIC)似乎与基础的奇异几何结构脱节。我们表明,后验退火诱导了一个参数变形的后验分布,其关联的可观测量生成了一级热力学响应函数的层次结构。一个普遍协方差恒等式将退火期望的导数与后验波动联系起来,将WAIC、WBIC和奇异波动置于统一的响应框架中。在这一框架中,奇异学习理论中的经典量获得了自然的热力学解释:RLCT控制了自由能斜率的主要部分,奇异波动对应于退火自由能的曲率,而WAIC衡量预测波动。我们形式化了一个可观测量代数,消除了不可识别方向,允许在奇异模型中构建结构上有意义的有序参数。在包括对称高斯混合模型、降秩回归和过度参数化神经网络在内的典型奇异示例中,我们通过退火实验证明了相变行为。有序参数崩溃,磁化率达到峰值,复杂度度量与后验几何结构的重新组织相一致。我们的结果表明,热力学响应理论提供了一个自然的组织框架,用于解释奇异贝叶斯学习中的复杂性、预测波动性和结构重组织。
Summary / 总结
This paper addresses the challenges posed by singular statistical models, which violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. The authors introduce a method of posterior tempering to generate a hierarchy of thermodynamic response functions, linking classical quantities from singular learning theory to thermodynamic interpretations. Key findings include the universal covariance identity that connects derivatives of tempered expectations to posterior fluctuations, and empirical demonstrations of phase-transition-like behavior in canonical singular examples under tempering, indicating structural reorganization in posterior geometry.
该论文探讨了参数非识别性和费希尔几何退化导致的标准渐近理论失效的奇异统计模型。作者引入了后验退火方法,生成了一级热力学响应函数,并将这些函数与经典量如实对数可约阈值(RLCT)和奇异波动联系起来。研究表明,WAIC和WBIC可以在这种热力学框架中理解,为这些度量提供了自然的解释。在各种奇异模型中进行的实证结果表明,在退火下存在相变行为,表明后验几何结构中的结构重组。这项工作表明,热力学响应理论为理解奇异贝叶斯学习中的复杂性和预测变异性提供了一个统一的框架。
Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
Authors: Artem Vazhentsev, Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Seleznyov, Mikhail Salnikov, Elena Tutubalina, Vasily Konovalov, Irina Nikishina, Alexander Panchenko, Viktor Moskvoretskii
First: 2026-03-05T18:42:51+00:00 · Latest: 2026-03-05T18:42:51+00:00
Comments: Preprint
Abstract
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.
中文标题/摘要
标题:利用LLM参数化知识进行无需检索的事实核查
信任度是基于大型语言模型(LLMs)的代理AI系统的核心研究挑战。为了增强信任,自然语言声明通常通过检索外部知识并使用LLM验证其与检索证据的一致性来从多种来源(包括人类撰写的文本、网络内容和模型输出)进行事实核查。这种方法受到检索错误和外部数据可用性的限制,而未能充分利用模型固有的事实验证能力。我们提出了无需检索的事实核查任务,专注于独立于来源的任意自然语言声明的验证。为了研究这一设置,我们引入了一个以泛化为重点的综合评估框架,测试对(i)长尾知识、(ii)声明来源的变化、(iii)多语言性和(iv)长文本生成的鲁棒性。在9个数据集、18种方法和3种模型上进行的实验表明,基于logit的方法往往不如利用内部模型表示的方法表现好。基于这一发现,我们引入了INTRA方法,该方法利用内部表示之间的交互,实现了最先进的性能并具有强大的泛化能力。更广泛地说,我们的工作确立了无需检索的事实核查作为有前景的研究方向,可以补充基于检索的框架,提高可扩展性,并使这些系统能够在训练期间用作奖励信号或作为生成过程中的组件集成。
Summary / 总结
The research aims to enhance the trustworthiness of agentic AI systems by developing a fact-checking method that does not rely on external knowledge retrieval. Instead, it leverages the intrinsic fact-verification capabilities of Large Language Models (LLMs). The study introduces a comprehensive evaluation framework to test the robustness of fact-checking methods across various datasets and models. Key findings show that logit-based approaches often underperform, while a new method called INTRA, which exploits internal model representations, achieves state-of-the-art performance with strong generalization capabilities.
研究旨在通过不依赖外部知识检索的方法提升生成式AI系统的可信度,利用大型语言模型(LLM)的内在事实验证能力。研究引入了一个全面的评估框架,测试不同数据集和模型下事实核查方法的鲁棒性。关键发现表明,基于logit的方法往往表现不佳,而一种名为INTRA的新方法,利用内部模型表示,实现了最先进的性能,并具有强大的泛化能力。
HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
Authors: Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou
Venue: The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)
First: 2026-03-05T18:36:31+00:00 · Latest: 2026-03-05T18:36:31+00:00
Abstract
Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
中文标题/摘要
标题:HALP:无需生成单个词元即可检测视觉语言模型中的幻觉
幻觉仍然是视觉语言模型(VLMs)的一个持续性挑战,它们经常描述不存在的对象或编造事实。现有的检测方法通常在文本生成之后进行操作,这使得干预既昂贵又不及时。我们研究是否可以在生成任何词元之前通过探测模型的内部表示来预测幻觉风险,仅在一次前向传递中进行。在一系列视觉语言任务和八个现代VLMs(包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL)中,我们检查了三种内部表示家族:(i)仅视觉特征而不进行多模态融合,(ii)文本解码器中的视觉词元表示,以及(iii)在生成之前整合视觉和文本信息的查询词元表示。基于这些表示训练的探测器在无需解码的情况下实现了强大的幻觉检测性能,最高达到0.93 AUROC在Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上。大多数模型中,后期查询词元状态最具预测性,而在少数架构中,视觉或中间层特征占主导地位(例如,Qwen2.5-VL-7B使用仅视觉特征的AUROC约为0.79)。这些结果表明:(1)幻觉风险可以在生成之前检测到,(2)最具信息量的层和模态在不同架构中有所不同,(3)轻量级探测器有可能实现早期避免、选择性路由和自适应解码,以提高安全性和效率。
Summary / 总结
The paper introduces HALP, a method for detecting hallucinations in vision-language models without generating any tokens. By probing internal representations in a single forward pass, the method achieves strong performance, with AUROCs up to 0.93 on several models. The study finds that late query-token states are the most predictive for most models, while visual or mid-layer features are more informative for some architectures. This demonstrates that hallucination risk can be detected pre-generation and that lightweight probes can enable early intervention to improve safety and efficiency.
研究通过提出一种在任何文本生成之前预测幻觉风险的方法,来应对视觉语言模型中的幻觉挑战。它在八种模型上评估了三种内部表示,并发现基于后期查询令牌状态的探针能够实现强大的检测性能,最高AUROC可达0.93。结果表明,幻觉风险可以在生成前被检测到,不同架构中最信息丰富的层和模态各不相同,并且轻量级探针可以实现早期干预以提高安全性和效率。
NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance
Authors: Abrar Eyasir, Tahsin Ahmed, Muhammad Ibrahim
First: 2026-03-05T18:35:03+00:00 · Latest: 2026-03-05T18:35:03+00:00
Comments: 18 pages, 7 figures, 6 tables. Dataset contains 87,805 Bangla QA pairs from NCTB textbooks
Abstract
Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.
中文标题/摘要
标题:NCTB-QA:大规模孟加拉语教育问答数据集及基准性能
针对低资源语言的阅读理解系统在处理无法回答的问题时面临重大挑战。当正确答案不在上下文中时,这些系统往往会生成不可靠的响应。为了解决这一问题,我们引入了NCTB-QA,这是一个包含87,805个问答对的大规模孟加拉语问答数据集,这些问答对是从孟加拉国国家课程和课本委员会出版的50本教科书中提取的。与现有的孟加拉语数据集不同,NCTB-QA保持了可回答问题(57.25%)和无法回答问题(42.75%)的平衡分布。NCTB-QA还包含对抗设计的实例,其中包含可能的干扰项。我们对三种基于变换器的模型(BERT、RoBERTa、ELECTRA)进行了基准测试,并通过微调展示了显著的改进。BERT在F1分数上实现了313%的相对改进(从0.150到0.620)。BERTScore衡量的语义答案质量在所有模型中也显著提高。我们的结果确立了NCTB-QA作为孟加拉语教育问答具有挑战性的基准。本研究证明,在低资源环境中,领域特定的微调对于稳健性能至关重要。
Summary / 总结
The research addresses the challenge of handling unanswerable questions in reading comprehension systems for low-resource languages by introducing NCTB-QA, a large-scale Bangla dataset with 87,805 question-answer pairs. The dataset includes both answerable and unanswerable questions, and adversarially designed instances. Three transformer-based models (BERT, RoBERTa, ELECTRA) were benchmarked, showing significant improvements through fine-tuning, with BERT achieving a 313% relative improvement in F1 score. The study highlights the importance of domain-specific fine-tuning for robust performance in low-resource settings.
研究通过引入包含87,805个问题-答案对的NCTB-QA数据集,解决低资源语言阅读理解系统处理无法回答的问题的挑战。该数据集来自50本孟加拉国国家课程和课本局出版的教科书,并保持了可回答和不可回答问题的平衡分布。对三种基于变换器的模型(BERT、RoBERTa、ELECTRA)进行了基准测试,显示了通过微调取得的重大改进,BERT的F1分数相对提高了313%。研究强调了在低资源环境中实现稳健性能的关键在于领域特定的微调。
DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates
Authors: Klaywert Danillo Ferreira de Souza, David Eduardo Pereira, Cláudio E. C. Campelo, Larissa Lucena Vasconcelos
First: 2026-03-05T18:30:10+00:00 · Latest: 2026-03-05T18:30:10+00:00
Abstract
The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.
中文标题/摘要
标题:DEBISS:个体、半结构化和口语辩论语料库
辩论的过程在我们的日常生活中至关重要,无论是学习、工作活动、简单的日常讨论、电视上的政治辩论,还是社交媒体上的在线讨论。辩论的应用范围很广。由于辩论的多样应用、结构和格式,开发能够涵盖这些差异的语料库具有挑战性,目前该领域的辩论语料库稀缺性尤为明显。因此,当前研究提出了DEBISS语料库:一个包含口语和个体辩论的半结构化集合。该语料库包含广泛的自然语言处理任务注释,如语音转文本、说话人辨识、论点挖掘和辩手质量评估。
Summary / 总结
The research aims to address the scarcity of debate corpora by proposing DEBISS, a corpus of spoken and individual debates with semi-structured features. The corpus includes annotations for various NLP tasks like speech-to-text, speaker diarization, argument mining, and debater quality assessment. Key findings include the successful creation of a diverse and annotated debate corpus that can support a wide range of NLP applications.
研究旨在解决辩论语料库稀缺的问题,提出了DEBISS语料库,包含口头和个体辩论,并具有半结构化特征。该语料库包含如语音转文本、发言者识别、论点挖掘和辩论者质量评估等多种NLP任务的注释。主要发现包括成功创建了一个多样且注释丰富的辩论语料库,可用于支持多种NLP应用。
Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
Authors: Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan
Venue: ICLR 2026
First: 2026-03-05T18:25:26+00:00 · Latest: 2026-03-05T18:25:26+00:00
Comments: Accepted at ICLR 2026
Abstract
Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
中文标题/摘要
标题:超越零散接受:通过最长稳定前缀实现DLMs的快速和连贯推理
扩散语言模型(DLMs)承诺实现高度并行的文本生成,但在实际推理速度上往往受限于次优的解码调度器。标准方法依赖于“零散接受”——在序列中不连续的位置上提交高置信度的标记。这种方法无意中破坏了键值(KV)缓存,破坏了内存局部性,并迫使模型在不稳定的标记边界上进行昂贵的重复修复。为了解决这个问题,我们提出了最长稳定前缀(LSP)调度器,这是一种基于单一前缀吸收的无训练和模型无关的推理范式。在每次去噪步骤中,LSP 通过单向传递评估标记的稳定性,动态识别一个连续的左对齐的稳定预测块,并在原子提交前将其边界对齐到自然语言或结构分隔符。这种前缀优先的拓扑结构带来了双重好处:系统上,它将碎片化的KV缓存更新转换为高效的连续追加;算法上,它保留了对几何缩小的活动后缀的双向前瞻,大幅减少了标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的广泛评估表明,LSP 在包括数学推理、代码生成、多语言(CJK)任务和创造性写作在内的严格基准测试中将推理加速了高达3.4倍,同时保持或略微提高了输出质量。通过从根本上重新结构化提交拓扑,LSP 桥接了DLMs的理论并行性和实际硬件效率之间的差距。
Summary / 总结
The paper addresses the issue of slow inference in Diffusion Language Models (DLMs) due to suboptimal decoding schedulers. It introduces the Longest Stable Prefix (LSP) scheduler, which evaluates token stability and commits to a contiguous block of predictions, thereby improving memory locality and reducing the need for costly repairs. Experiments on LLaDA-8B and Dream-7B show that LSP can accelerate inference by up to 3.4x across various tasks while maintaining or slightly improving output quality.
论文针对扩散语言模型(DLMs)因解码调度器不理想导致的推理速度慢问题,该调度器在序列中不连续位置承诺高置信度的标记,这会破坏KV缓存并增加计算成本。它提出了最长稳定前缀(LSP)调度器,该调度器通过单次前向传播评估标记稳定性,并承诺一个连续的稳定预测块,从而保持内存局部性并减少标记翻转率。在LLaDA-8B和Dream-7B上的评估表明,LSP可以加速推理多达3.4倍,同时保持或略微提高输出质量。
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Authors: Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao
First: 2026-03-05T18:24:49+00:00 · Latest: 2026-03-05T18:24:49+00:00
Abstract
Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.
中文标题/摘要
标题:FlashAttention-4:针对异构硬件扩展的算法和内核流水线协同设计
注意力机制作为通用Transformer架构的核心层,是大型语言模型和长上下文应用的瓶颈。虽然FlashAttention-3通过异步执行和战位专业化优化了Hopper GPU上的注意力机制,但它主要针对H100架构。AI行业迅速转向部署基于Blackwell的系统,如B200和GB200,由于异构硬件扩展的不同性能特征:张量核吞吐量翻倍,而其他功能单元(共享内存带宽,指数单元)则扩展较慢或保持不变。我们开发了几种技术来应对Blackwell GPU上的这些变化瓶颈:(1)重新设计的流水线,利用完全异步的MMA操作和更大的瓦片大小,(2)软件模拟的指数和条件softmax缩放,减少非矩阵运算,以及(3)利用张量内存和2-CTA MMA模式减少反向传播中的共享内存流量和原子加操作。我们证明,我们的方法FlashAttention-4在B200 GPU上使用BF16时,相对于cuDNN 9.13实现了高达1.3倍的加速,相对于Triton实现了2.7倍的加速,达到最高1613 TFLOPs/s(71%利用率)。除了算法创新,我们完全在嵌入Python中的CuTe-DSL中实现FlashAttention-4,与传统的C++模板方法相比,编译时间快20-30倍,同时保持完全的表达能力。
Summary / 总结
FlashAttention-4 was developed to optimize attention mechanisms for Blackwell-based GPUs, such as B200 and GB200, which have asymmetric hardware scaling. The method includes redesigned pipelines, software-emulated exponential and conditional softmax rescaling, and leveraging tensor memory to reduce shared memory traffic. This approach achieves up to 1.3× speedup over cuDNN 9.13 and 2.7× over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization).
FlashAttention-4 旨在优化 Blackwell 基础架构的 GPU,如 B200 和 GB200 上的注意力机制,这些 GPU 具有不对称的硬件扩展。该方法包括重新设计的流水线、软件模拟的指数和条件 softmax 缩放,以及利用张量内存以减少共享内存流量。这种方法在 B200 GPU 上使用 BF16 达到了最高 1613 TFLOPs/s(71% 利用率),比 cuDNN 9.13 快 1.3×,比 Triton 快 2.7×。
Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry
Authors: Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy
First: 2026-03-05T18:22:55+00:00 · Latest: 2026-03-05T18:22:55+00:00
Comments: 10 pages, 4 figures
Abstract
Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.
中文标题/摘要
标题:分布式部分信息谜题:在知识不对称下的共同知识构建研究
建立共同知识,即共享的一组信念和相互认可的事实,是协作的基础,但在当前的AI系统中仍然是一个挑战,尤其是在多模态、多参与者的场景中,参与者带来不同的信息。我们引入了分布式部分信息谜题(DPIP),这是一种在知识不对称下引发丰富多模态交流的协作构建任务。我们提供了一个多模态的交互数据集,这些数据在语音、手势和动作模态上进行了注释和时间对齐,以支持对命题内容和信念动态的推理。然后,我们评估了两种建模共同知识(CG)的范式:(1)最先进的大型语言模型(LLMs),被提示从多模态更新中推断共享的信念,以及(2)一个基于动态知识逻辑(DEL)的公理化管道,逐步执行相同任务。对注释的DPIP数据的评估结果表明,这给现代LLMs跟踪任务进展和信念状态的能力带来了挑战。
Summary / 总结
The study aims to explore how AI systems can establish common ground in collaborative settings with epistemic asymmetry. It introduces the Distributed Partial Information Puzzle (DPIP) task, which involves multimodal communication among agents with different information. The research evaluates two approaches: state-of-the-art large language models (LLMs) and an axiomatic pipeline based on Dynamic Epistemic Logic (DEL). The results show that LLMs struggle to track both task progression and belief states, highlighting the challenge of common ground construction in such settings.
研究旨在探索AI系统如何在信息不对称的情况下建立共同知识,在协作环境中尤其具有挑战性。研究引入了分布式部分信息谜题(DPIP),并评估了两种方法:最先进的大型语言模型(LLMs)和基于动态演绎逻辑(DEL)的公理化管道。结果显示,现代LLMs在追踪任务进展和信念状态方面存在困难,突显了当前AI系统在这些场景下的挑战。
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover
First: 2026-01-26T17:56:50+00:00 · Latest: 2026-03-05T18:19:57+00:00
Comments: code is release here: https://github.com/siyan-zhao/OPSD
Abstract
Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
中文标题/摘要
标题:自我蒸馏推理器:大型语言模型的在线自我蒸馏
知识蒸馏通过压缩教师大型语言模型的知识来训练较小的大型语言模型,从而改善大型语言模型的推理能力。在线蒸馏通过让学生在教师大型语言模型提供密集的标记级监督的同时,自行采样其自身的轨迹,来推进这一方法,从而解决了离线蒸馏方法中训练与推理之间的分布不匹配问题。然而,在线蒸馏通常需要一个单独的、通常更大的教师大型语言模型,并且不明确利用推理数据集中可用的真实解决方案。受足够强大的大型语言模型能够合理化外部特权推理轨迹并教导其较弱版本(即没有访问特权信息的版本)这一直觉的启发,我们引入了在线自我蒸馏(OPSD)框架,其中单个模型同时作为教师和学生,通过不同的上下文进行条件化。教师策略通过条件化特权信息(例如,验证的推理轨迹)进行条件化,而学生策略仅看到问题;训练通过最小化学生自身演练过程中这些分布的每个标记差异来进行。我们通过多个数学推理基准展示了我们方法的有效性,与强化学习方法(如GRPO)相比,实现了8-12倍的标记效率,并且优于离线蒸馏方法。
Summary / 总结
The paper introduces On-Policy Self-Distillation (OPSD), a method that uses a single large language model as both teacher and student. The teacher conditions on privileged information like verified reasoning traces, while the student only sees the question. Training minimizes token-level divergence between the teacher and student policies over the student's own rollouts. Experiments show OPSD achieves 8-12x token efficiency compared to reinforcement learning methods and outperforms off-policy distillation methods on mathematical reasoning benchmarks.
研究旨在通过单模型作为教师和学生的自监督蒸馏方法提高大型语言模型的推理能力。该方法让教师基于特权信息进行条件化,而学生仅看到问题,在训练过程中最小化两者之间的令牌级差异。实验表明,OPSD 在数学推理基准测试中相较于强化学习方法具有 8-12 倍的令牌效率,并且优于脱政策度化方法。
HydroGEM: A Self Supervised Zero Shot Hybrid TCN Transformer Foundation Model for Continental Scale Streamflow Quality Control
Authors: Ijaz Ul Haq, Byung Suk Lee, Julia N. Perdrial, David Baude
First: 2025-12-16T05:39:26+00:00 · Latest: 2026-03-05T18:19:22+00:00
Comments: Supplementary materials, datasets, and implementation code will be made publicly available upon acceptance for publication in a peer-reviewed journal
Abstract
Advances in sensor networks have enabled real-time stream discharge monitoring, yet persistent sensor malfunctions limit data utility. Manual quality control by expert hydrologists cannot scale with networks generating millions of measurements annually. We introduce HydroGEM, a foundation model for continental-scale streamflow quality control designed to support human expertise. HydroGEM uses self-supervised pretraining on 6.03 million clean sequences from 3,724 USGS stations to learn general hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures both local and long-range temporal dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out observations from 799 stations with 18 synthetic anomaly types grounded in USGS standards, HydroGEM achieves F1=0.792 for detection and 68.7% reconstruction error reduction, outperforming the strongest baseline by 36.3%. For cross-national validation on 100 Environment and Climate Change Canada stations using tolerant evaluation with a plus or minus 24-hour buffer, HydroGEM achieves Tolerant F1=0.70 with 90.1% segment-level event detection, demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns, with peak flagging during winter ice-affected periods matching hydrologists' correction behavior. Architectural separation between simplified training anomalies and complex test anomalies confirms that performance reflects learned hydrometric principles rather than pattern memorization.
中文标题/摘要
标题:HydroGEM:一种用于大陆规模径流质量控制的自我监督零样本混合TCN变换器基础模型
传感器网络的进步使得实时径流监测成为可能,但持续的传感器故障限制了数据的实用性。由专家水文学家进行的手动质量控制无法与每年产生数百万测量值的网络规模相匹配。我们引入了HydroGEM,一种用于大陆规模径流质量控制的基础模型,旨在支持人类专业知识。HydroGEM通过在来自3,724个USGS站点的6,030,000个干净序列上进行自我监督预训练来学习通用水文表示,然后使用合成异常进行微调以进行检测和重建。混合TCN-变换器架构(14.2M参数)捕捉局部和长程时间依赖性,而分层规范化处理了六个数量级的径流。在799个站点的18种合成异常类型上进行的验证中,HydroGEM在18种合成异常类型上实现了检测F1=0.792,重建误差减少68.7%,优于最强基线36.3%。在使用正负24小时缓冲进行容忍评估的100个环境和气候变化加拿大站点上进行的跨国验证中,HydroGEM实现了容忍F1=0.70,段级事件检测率为90.1%,展示了跨国通用性。该模型在不同校正幅度下保持一致的检测,并与操作季节性模式一致,在冬季冰影响期间的峰值标记与水文学家的校正行为一致。简化训练异常与复杂测试异常之间的架构分离证实了性能反映了学习的水文原理而非模式记忆。
Summary / 总结
HydroGEM is a foundation model designed for continental-scale streamflow quality control, using self-supervised pretraining and fine-tuning with synthetic anomalies. It achieves an F1 score of 0.792 for detection and a 68.7% reduction in reconstruction error, outperforming existing methods by 36.3%. The model demonstrates cross-national generalization and aligns with hydrologists' correction behaviors, making it a valuable tool for quality control in large sensor networks.
HydroGEM 是一种自监督零样本混合 TCN-Transformer 模型,用于大陆规模的径流质量控制。它基于 3,724 个 USGS 站点的 6.03 百万条干净序列进行预训练,并通过合成异常进行微调。HydroGEM 在检测上的 F1 得分为 0.792,重建误差减少了 68.7%,优于现有基线。它还在 Environment and Climate Change Canada 的 100 个站点上展示了跨国家的一般化能力,Tolerant F1 为 0.70,与操作季节模式和水文学家的修正行为保持一致。
LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification
Authors: Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin
First: 2026-03-04T11:36:32+00:00 · Latest: 2026-03-05T18:19:21+00:00
Abstract
Code comment classification is a critical task for automated software documentation and analysis. In the context of the NLBSE'26 Tool Competition, we present LoRA-MME, a Multi-Model Ensemble architecture utilizing Parameter-Efficient Fine-Tuning (PEFT). Our approach addresses the multi-label classification challenge across Java, Python, and Pharo by combining the strengths of four distinct transformer encoders: UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa. By independently fine-tuning these models using Low-Rank Adaptation(LoRA) and aggregating their predictions via a learned weighted ensemble strategy, we maximize classification performance without the memory overhead of full model fine-tuning. Our tool achieved an F1 Weighted score of 0.7906 and a Macro F1 of 0.6867 on the test set. However, the computational cost of the ensemble resulted in a final submission score of 41.20%, highlighting the trade-off between semantic accuracy and inference efficiency.
中文标题/摘要
标题:LoRA-MME:针对Java、Python和Pharo的LoRA调优编码器多模型集成代码注释分类
代码注释分类是自动软件文档和分析中的关键任务。在NLBSE'26工具竞赛中,我们提出了LoRA-MME,这是一种利用参数高效微调(PEFT)的多模型集成架构。我们的方法通过结合四种不同变压器编码器的优点——UniXcoder、CodeBERT、GraphCodeBERT和CodeBERTa,解决了跨Java、Python和Pharo的多标签分类挑战。通过独立使用低秩适应(LoRA)对这些模型进行微调,并通过学习加权集成策略聚合它们的预测,我们最大化了分类性能,而无需全模型微调的内存开销。我们的工具在测试集上获得了0.7906的F1加权分数和0.6867的宏F1分数。然而,集成的计算成本导致最终提交得分为41.20%,突显了语义准确性和推理效率之间的权衡。
NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
Authors: Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata, Seitaro Otsuki, Komei Sugiura
Venue: CVPR 2026
First: 2026-03-05T18:12:29+00:00 · Latest: 2026-03-05T18:12:29+00:00
Comments: Accepted to CVPR 2026 Findings
Abstract
We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.
中文标题/摘要
标题:NaiLIA:基于密集意图描述和调色板查询的多模态指甲设计检索
我们专注于基于密集意图描述检索指甲设计图像的任务,这些描述代表了用户对指甲设计的多层意图。这具有挑战性,因为这些描述指定了未受约束的绘画元素和预制造装饰,以及视觉特征、主题和整体印象。除了这些描述之外,我们假设用户通过颜色拾取器指定零个或多个颜色来提供调色板查询,这使得微妙和连续的颜色细微差别得以表达。现有的视觉-语言基础模型往往难以结合这些描述和调色板。为了解决这个问题,我们提出了NaiLIA,一种针对指甲设计图像的多模态检索方法,在检索过程中全面对齐密集意图描述和调色板查询。我们的方法引入了一种基于未标注图像置信分数的宽松损失,可以与描述对齐。为了评估NaiLIA,我们构建了一个基准,包含来自不同文化背景的10,625张图像。这些图像由超过200名注释者标注了长且密集的意图描述。实验结果表明,NaiLIA优于标准方法。
Summary / 总结
The research aims to develop a method for retrieving nail design images based on detailed user intent descriptions and color palette queries. NaiLIA, a multimodal retrieval approach, aligns with both dense descriptions and color palettes, using a relaxed loss based on confidence scores for unlabeled images. Experiments show that NaiLIA outperforms existing methods on a benchmark of 10,625 images with diverse cultural backgrounds and detailed annotations.
研究旨在开发一种基于详细意图描述和色彩调色板查询的指甲设计图像检索方法。NaiLIA是一种多模态检索方法,能够同时与描述和调色板对齐以应对未约束的绘画元素和视觉特征的挑战。该方法使用基于未标注图像置信分数的松弛损失。实验表明,NaiLIA在包含10,625张具有不同文化背景图像的基准上优于现有方法,并且这些图像由超过200名注释者进行了详细的标注。
Agentic Very Long Video Understanding
Authors: Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim
First: 2026-01-26T05:20:47+00:00 · Latest: 2026-03-05T18:12:22+00:00
Comments: 27 pages, 7 figures, 8 tables
Abstract
The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks. Code is available at https://github.com/facebookresearch/egagent.
中文标题/摘要
标题:代理型非常长视频理解
随着全天候可穿戴设备如智能眼镜所支持的始终在线个人AI助手的出现,需要一种新的上下文理解水平,这种理解超越了短暂孤立的事件,涵盖了持续的、纵向的自我中心视频流。实现这一愿景需要在长时视频理解方面取得进展,其中系统必须解释和回忆跨越数天甚至数周的视觉和音频信息。现有的方法,包括大型语言模型和检索增强生成,受限于有限的上下文窗口,并且缺乏在非常长的视频流上进行组合式、多跳推理的能力。在这项工作中,我们通过EGAgent解决这些挑战,这是一种增强的代理框架,以实体场景图为中心,这些图表示随着时间推移的人、地点、物体及其关系。我们的系统为规划代理配备了在这些图上进行结构化搜索和推理的工具,以及混合视觉和音频搜索能力,使跨模态和时间上一致的推理成为可能。在EgoLifeQA和Video-MME(长)数据集上的实验表明,我们的方法在EgoLifeQA上达到了最先进的性能(57.5%),在Video-MME(长)上达到了竞争性的性能(74.1%),用于复杂的纵向视频理解任务。代码可在https://github.com/facebookresearch/egagent/获取。
Summary / 总结
This paper addresses the need for long-horizon video understanding in the context of always-on personal AI assistants, proposing EGAgent, an enhanced agentic framework. EGAgent uses entity scene graphs to represent people, places, objects, and their relationships over time, and equips a planning agent with structured search and reasoning capabilities. Experiments show that EGAgent outperforms existing methods on EgoLifeQA and achieves competitive performance on Video-MME (Long) for complex longitudinal video understanding tasks.
本文针对始终在线的个人AI助手在长时段视频理解方面的需求,提出了EGAgent,一种增强的代理框架,使用实体场景图来表示和推理长时间的视频流。该系统包括结构化搜索和推理的工具,以及混合视觉和音频搜索能力。实验表明,EGAgent在EgoLifeQA上达到了最先进的性能,在Video-MME(长)上也取得了竞争力的性能,用于复杂的长时段视频理解任务。
LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery
Authors: Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy
Venue: ICLR 2026
First: 2025-10-26T02:47:15+00:00 · Latest: 2026-03-05T18:02:19+00:00
Comments: ICLR 2026
Abstract
Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials discovery (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on 14 realistic tasks that span electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit rates and improved Pareto front quality relative to generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA provides a principled approach to accelerating practical materials discovery. Project website: https://scientific-discovery.github.io/llema-project/
中文标题/摘要
标题:LLEMA:使用大语言模型进行多目标材料发现的进化搜索
材料发现需要在广泛的化学和结构空间中导航,同时满足多个常常相互冲突的目标。我们提出了由大语言模型指导的材料发现进化(LLEMA),这是一种统一框架,将嵌入在大语言模型中的科学知识与化学启发的进化规则和基于记忆的改进相结合。在每次迭代中,大语言模型提出在明确属性约束下的晶体学指定候选物;一个代理增强的预言机估计物理化学属性;一个多目标评分器更新成功/失败记忆以指导后续代。在涵盖电子、能源、涂层、光学和航空航天的14个现实任务上进行评估,LLEMA 发现的候选物在化学上是合理的、热力学上是稳定的,并且属性是匹配的,相对于生成性和仅大语言模型的基线,其命中率更高,帕累托前沿质量更好。消融研究证实了规则指导生成、基于记忆的改进和代理预测的重要性。通过强制执行合成性和多目标权衡,LLEMA 提供了一种加速实际材料发现的原理性方法。项目网站:https://scientific-discovery.github.io/llema-project/
Summary / 总结
LLEMA is a unified framework that integrates large language models with chemistry-informed evolutionary rules and memory-based refinement for multi-objective materials discovery. At each iteration, an LLM proposes crystallographically specified candidates under property constraints, a surrogate-augmented oracle estimates physicochemical properties, and a multi-objective scorer updates success/failure memories. LLEMA outperforms generative and LLM-only baselines in hit rates and Pareto front quality across 14 tasks spanning electronics, energy, coatings, optics, and aerospace. Ablation studies highlight the importance of rule-guided generation, memory-based refinement, and surrogate prediction for the framework's success.
LLEMA 是一个将大型语言模型与化学导向的进化规则和基于记忆的改进相结合的统一框架。在每次迭代中,LLM 在属性约束下提出晶体学指定的候选材料,一个基于代理的预言机估计物理化学性质,一个多目标评分器更新成功/失败记忆以指导后续代。LLEMA 在涉及电子、能源、涂层、光学和航空航天的 14 个任务中表现出更高的命中率和改进的帕累托前沿质量,优于生成性和仅 LLM 基准。消融研究强调了规则导向生成、基于记忆的改进和代理预测的重要性。
Latent Wasserstein Adversarial Imitation Learning
Authors: Siqi Yang, Kai Yan, Alexander G. Schwing, Yu-Xiong Wang
Venue: ICLR 2026
First: 2026-03-05T18:01:49+00:00 · Latest: 2026-03-05T18:01:49+00:00
Comments: 10 pages, accepted to ICLR 2026
Abstract
Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.
中文标题/摘要
标题:潜隐 Wasserstein 对抗模仿学习
模仿学习(IL)使代理能够通过学习演示来模仿专家行为。然而,传统的IL方法需要大量的中等到高质量的演示以及专家演示的动作,这两种情况通常都不可用。为了减少这种需求,我们提出了潜隐 Wasserstein 对抗模仿学习(LWAIL),这是一种新颖的对抗模仿学习框架,专注于仅状态分布匹配。它得益于在动态感知潜隐空间中计算的 Wasserstein 距离。这种动态感知潜隐空间不同于先前的工作,并通过预训练阶段获得,其中我们使用少量随机生成的状态仅数据训练意图条件价值函数(ICVF),以捕捉状态空间的动态感知结构。我们表明,这增强了策略对状态转换的理解,使学习过程能够仅使用一个或几个状态仅专家演示来达到专家级表现。通过在多个 MuJoCo 环境中的实验,我们证明了我们的方法优于先前的基于 Wasserstein 的 IL 方法和先前的对抗 IL 方法,在各种任务上取得了更好的结果。
Summary / 总结
The research aims to address the challenge of requiring large amounts of expert demonstrations in Imitation Learning (IL) by proposing Latent Wasserstein Adversarial Imitation Learning (LWAIL). LWAIL focuses on state-only distribution matching using a dynamics-aware latent space, which is pre-trained using a small set of randomly generated state-only data. This method enhances the policy's understanding of state transitions, allowing it to achieve expert-level performance with just one or a few state-only expert episodes. Experiments on MuJoCo environments show that LWAIL outperforms previous Wasserstein-based and adversarial IL methods.
研究旨在通过提出潜空间 Wasserstein 对抗模仿学习(LWAIL)来解决获得足够高质量演示数据的挑战。LWAIL 在一个动态感知的潜空间中进行状态分布匹配,并通过少量随机生成的状态数据进行预训练。这种方法使得学习过程仅使用一两个状态专家演示就能达到专家水平的表现,优于之前的 Wasserstein 基准模仿学习和对抗模仿学习方法,在多个 MuJoCo 环境中取得了更好的结果。
OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
Authors: Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim
First: 2025-05-28T03:45:42+00:00 · Latest: 2026-03-05T18:01:26+00:00
Comments: 11 pages, 6 figures
Abstract
Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.
中文标题/摘要
标题:OSPO:面向对象的自我改进偏好优化以实现文本到图像生成
近期多模态大型语言模型(MLLMs)的发展使统一的多模态理解和生成成为可能。然而,它们仍然难以实现精细的文本-图像对齐,经常无法准确描绘具有正确属性(如颜色、形状和空间关系)的对象。为解决这一问题,先前的研究探索了偏好优化方法,如DPO和GRPO,但这些方法在构建偏好数据和执行优化方面会带来巨大的计算成本。这促使了自我改进的偏好优化方法的发展,在这些方法中,MLLM自主生成自己的训练数据,自我估计偏好反馈,并使用由此产生的自我构建的偏好对进行自我优化。然而,现有的自我改进方法仍然忽视了细粒度的对象级语义,允许对象幻觉持续存在。为解决这一问题,我们提出了面向对象的自我改进偏好优化(OSPO),这是一种旨在增强对象级文本-图像对齐的自我改进框架。OSPO明确构建了面向对象的偏好数据,不依赖于任何外部数据和外部模型。我们还引入了一种新的方法,利用基于注意力的对象掩码与对象加权SimPO损失相结合,以增强对象特定的保真度。在三个组成图像生成基准上的广泛实验表明,OSPO显著提高了细粒度对齐并减少了对象幻觉,优于先前的自我改进方法,甚至优于专门的基于扩散的文本到图像模型。
Summary / 总结
The research aims to improve fine-grained text-image alignment by addressing the issue of object hallucination in text-to-image generation. The proposed OSPO framework autonomously generates preference data and optimizes using attention-based object masks and object-weighted SimPO loss. Experiments show that OSPO outperforms previous self-improving methods and specialized diffusion models in enhancing object-level text-image alignment and reducing object hallucination.
OSPO 是一个用于增强文本到图像生成中对象级文本-图像对齐的自主改进框架。它自主构建对象中心的偏好数据,并使用基于注意力的对象掩码与对象加权的 SimPO 损失来提高对象特定的保真度。实验表明,OSPO 显著提高了细粒度对齐并减少了对象幻觉,优于之前的自主改进方法和专门的扩散基文本到图像模型。
SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Authors: Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim, Dong-Jin Kim
Venue: CVPR 2026
First: 2026-03-05T17:59:58+00:00 · Latest: 2026-03-05T17:59:58+00:00
Comments: Accepted to CVPR 2026
Abstract
Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.
中文标题/摘要
标题:SAIL:基于相似性感知的跨模态增强学习方法用于弱监督密集视频字幕生成
弱监督密集视频字幕生成旨在仅基于字幕注解训练时,定位并描述视频中的事件,而无需时间边界。先前的工作引入了基于高斯掩码和互补字幕的隐式监督范式。然而,现有方法仅关注生成不重叠的掩码,而未考虑其与相应事件的语义关系,导致生成简单且均匀分布的掩码,无法捕捉到语义上有意义的区域。此外,仅依赖真实字幕会导致性能不佳,因为现有数据集的稀疏性。在本工作中,我们提出了SAIL,通过跨模态对齐构建语义感知的掩码。我们的相似性感知训练目标引导掩码强调与相应事件字幕高度相似的视频区域。此外,为了在稀疏注释设置下引导更准确的掩码生成,我们引入了一种基于LLM的增强策略,生成合成字幕以提供额外的对齐信号。这些合成字幕通过跨掩码机制整合,为精确的时间定位提供辅助指导,而不损害主要目标。在ActivityNet Captions和YouCook2上的实验表明,SAIL在字幕生成和定位指标上均达到了最先进的性能。
Summary / 总结
SAIL aims to improve weakly-supervised dense video captioning by constructing semantically-aware masks through cross-modal alignment. It uses a similarity-aware training objective to emphasize regions with high similarity to event captions and introduces an LLM-based augmentation strategy to generate synthetic captions, which are used to provide additional alignment signals. Experiments show SAIL outperforms existing methods on both captioning and localization metrics on ActivityNet Captions and YouCook2 datasets.
该研究提出了一种名为SAIL的方法,通过跨模态对齐构建语义感知的掩码。SAIL使用相似性感知的训练目标强调与事件描述高度相似的视频区域,并通过LLM生成的合成描述提供额外的对齐信号。实验结果显示,SAIL在ActivityNet Captions和YouCook2数据集上的字幕生成和定位指标上均优于现有方法。
On-Policy Self-Distillation for Reasoning Compression
Authors: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun
First: 2026-03-05T17:54:40+00:00 · Latest: 2026-03-05T17:54:40+00:00
Abstract
Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.
中文标题/摘要
标题:基于策略自我蒸馏的推理压缩
推理模型会大声思考,但其中大部分内容都是噪音。我们引入了OPSDC(基于策略自我蒸馏的推理压缩),这是一种通过将模型自身的简洁行为蒸馏回自身来教会模型更简洁地推理的方法。整个方法归结为一个想法:用“要简洁”指令条件化同一个模型以获得教师概率分布,并在学生的自身展开中最小化每个词元的逆KL散度。无需真实答案,无需词元预算,无需难度估计器。只有自我蒸馏。然而,这种简单性背后隐藏着惊人的复杂性:OPSDC会自动对简单问题进行激进压缩,同时保留解决难题所需的推理。在Qwen3-8B和Qwen3-14B上,我们在MATH-500上实现了57-59%的词元减少,同时准确率提高了9-16个绝对点。在AIME 2024上,14B模型在41%压缩的情况下获得了10分的提升。秘诀在于,推理模型生成的内容不仅冗余,而且是具有破坏性的,每个不必要的词元都会加剧错误的累积。
Summary / 总结
OPSDC (On-Policy Self-Distillation for Reasoning Compression) is a method that teaches reasoning models to reason more concisely by distilling their own behavior. The approach minimizes per-token reverse KL divergence between the teacher and student models, without requiring ground-truth answers or token budgets. On Qwen3-8B and Qwen3-14B, OPSDC achieves 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points. On AIME 2024, the 14B model gains 10 points with 41% compression, indicating that much of the reasoning output is redundant and harmful. This simplicity allows for automatic compression of easy problems while preserving deliberation for harder ones.
OPSDC(On-Policy Self-Distillation for Reasoning Compression)是一种通过将模型自己的简洁行为反向提炼回自身来教会模型更简洁推理的方法。该方法通过将模型条件化在'简洁'指令上以获得教师概率分布,并在学生自己的模拟中最小化每个标记的反向KL散度。在Qwen3-8B和Qwen3-14B上,OPSDC实现了57-59%的标记减少,同时提高了9-16个点的准确性。在AIME 2024上,14B模型在41%的压缩下获得了10个点的提升,表明模型生成的内容不仅冗余,而且有害,每个不必要的标记都会加剧错误的累积。
Ensembling Language Models with Sequential Monte Carlo
Authors: Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly, Timothy J. O'Donnell, Ryan Cotterell, Tim Vieira
First: 2026-03-05T17:54:31+00:00 · Latest: 2026-03-05T17:54:31+00:00
Abstract
Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.
中文标题/摘要
标题:使用序列蒙特卡洛集成语言模型
从业者可以访问大量语言模型和提示策略以解决许多语言建模任务;然而,先前的工作表明,模型性能对这些选择的高度敏感。经典的机器学习集成技术提供了一种原则性的方法:通过聚合多个来源的预测来实现比任何单一来源更好的性能。然而,在解码过程中将集成应用于语言模型是具有挑战性的:简单地聚合下一个标记的概率会产生一个局部归一化、有偏差的近似集成字符串分布。在本文中,我们介绍了一种统一框架,用于将 $K$ 个语言模型组成 $f$-集成分布,适用于一系列函数 $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$。为了从这些分布中采样,我们提出了一种字节级序列蒙特卡洛(SMC)算法,在共享字符空间中操作,使具有不同词汇表的模型集成成为可能,并在极限中实现一致采样。我们评估了不同提示和模型组合下的 $f$-集成族在各种结构化文本生成任务中的表现,突出了替代聚合策略相对于传统概率平均的优势,并展示了更好的后验近似可以提高集成性能。
Summary / 总结
This work addresses the challenge of ensembling language models during decoding by introducing a unified framework for composing multiple language models into ensemble distributions using various functions. The authors propose a byte-level sequential Monte Carlo algorithm to sample from these distributions, enabling consistent sampling across models with different vocabularies. Experimental results show that alternative aggregation strategies yield better performance than traditional probability averaging, and that better posterior approximations can improve ensemble performance for structured text generation tasks.
该研究通过引入一种统一框架,使用多种函数将多个语言模型组合成集成分布,解决了在解码过程中集成语言模型的挑战。作者提出了一种字节级别的顺序蒙特卡洛算法来从这些分布中采样,该算法能够处理具有不同词汇表的模型。实验表明,不同的聚合策略可以优于传统的概率平均,从而提高集成性能。
RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity
Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo
First: 2026-02-04T13:25:47+00:00 · Latest: 2026-03-05T17:54:01+00:00
Abstract
As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity.
中文标题/摘要
标题:RA-QA:在现实世界异质性条件下呼吸音频问答基准系统
随着对话型多模态AI工具被越来越多地用于处理患者数据以进行健康评估,需要稳健的基准来衡量在现实条件下的进步并揭示失败模式。尽管呼吸音频对于移动健康筛查至关重要,但呼吸音频问答仍被研究不足,现有研究评估狭窄且缺乏跨模态、设备和问题类型的现实世界异质性。因此,我们引入了呼吸音频问答(RA-QA)基准,包括标准化的数据生成管道、全面的多模态问答集合以及统一的评估协议。RA-QA 将公共呼吸音频数据集整合为包含900万对格式多样的问答对,涵盖诊断和上下文属性。我们基准测试了经典机器学习基线和多模态音频-语言模型,建立了可重复的参考点,并展示了当前方法在异质性条件下失效的情况。
Summary / 总结
The research aims to develop a robust benchmark for evaluating respiratory audio question answering systems under real-world conditions. The study introduces the RA-QA benchmark, which includes a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. Key findings show that current approaches fail to handle real-world heterogeneity, while classical ML baselines and multimodal audio-language models provide reproducible reference points for future improvements.
RA-QA基准旨在评估呼吸音频问答系统在真实世界条件下的性能。它包括标准化的数据生成管道、全面的多模态QA集合和统一的评估协议。该基准涵盖了900万种格式各异的QA对,并评估了经典ML基线和多模态音频-语言模型,揭示了它们在异质性下的局限性。
RelaxFlow: Text-Driven Amodal 3D Generation
Authors: Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao
First: 2026-03-05T17:45:47+00:00 · Latest: 2026-03-05T17:45:47+00:00
Comments: Code: https://github.com/viridityzhu/RelaxFlow
Abstract
Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.
中文标题/摘要
标题:RelaxFlow:文本驱动的无遮挡3D生成
从图像到3D的生成面临着在遮挡下固有的语义模糊性,仅凭部分观察往往不足以确定物体类别。在本文中,我们形式化了文本驱动的无遮挡3D生成,其中文本提示引导未见区域的完成,同时严格保留输入观察。关键的是,我们发现这些目标需要不同的控制粒度:对观察进行刚性控制,而对提示进行放松的结构控制。为此,我们提出了RelaxFlow,这是一种无需训练的双分支框架,通过多先验一致性模块和放松机制解耦控制粒度。理论上,我们证明我们的放松等同于在生成向量场中应用低通滤波器,这抑制了高频实例细节以隔离几何结构,以适应观察。为了便于评估,我们引入了两个诊断基准,ExtremeOcc-3D和AmbiSem-3D。广泛的实验表明,RelaxFlow成功地引导了未见区域的生成以匹配提示意图,而不牺牲视觉保真度。
Summary / 总结
The paper addresses the challenge of generating 3D models from partial observations, where text prompts guide the completion of unseen regions while preserving the observed parts. It introduces RelaxFlow, a training-free framework that uses a Multi-Prior Consensus Module and a Relaxation Mechanism to decouple control granularity. Theoretical analysis shows that relaxation suppresses high-frequency details, focusing on geometric structure. Experiments show that RelaxFlow successfully matches the prompt intent without losing visual fidelity.
论文解决了从部分观察生成3D模型的挑战,其中文本提示引导未见区域的完成,同时保留观察到的部分。它引入了RelaxFlow,这是一种无需训练的框架,使用多先验一致性模块和放松机制来解耦控制粒度。理论分析表明,放松机制抑制了高频细节,专注于几何结构。实验表明,RelaxFlow能够匹配提示意图而不损失视觉保真度。
History
20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553