arXiv 论文速递

Snapshot: 20260309_0327

Accelerating Text-to-Video Generation with Calibrated Sparse Attention

Authors: Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar

First: 2026-03-05T18:59:32+00:00 · Latest: 2026-03-05T18:59:32+00:00

Abstract

Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.

中文标题/摘要

标题：校准稀疏注意加速文本到视频生成

近期的扩散模型能够生成高质量的视频，但运行速度较慢。这些模型中使用的大型基于变压器的骨干网络在时空注意方面存在瓶颈。本文中，我们发现许多词到词的连接在各种输入中持续产生微不足道的分数，并且它们的模式在查询之间经常重复。因此，在这些情况下可以跳过注意计算，对结果影响甚微。这一观察结果同样适用于局部词块之间的连接。受此启发，我们引入了CalibAtt，这是一种无需训练的方法，通过校准稀疏注意来加速视频生成。CalibAtt 进行了一次离线校准过程，以识别在各种输入中稳定的块级稀疏性和重复模式，并将这些模式编译为每层、每个头和每个扩散时间步的优化注意操作。在推理时，我们对选定的输入相关连接进行密集计算，并以硬件高效的方式跳过未选中的连接。在Wan 2.1 14B、Mochi 1和不同分辨率的少量步骤蒸馏模型上进行的广泛实验表明，CalibAtt 可以实现高达1.58倍的端到端加速，同时保持视频生成质量和文本-视频对齐。

Summary / 总结

This paper addresses the slow runtime of diffusion models used for high-quality video generation. It introduces CalibAtt, a training-free method that accelerates video generation by skipping unnecessary attention computations based on identified sparsity and repetition patterns. Experiments show that CalibAtt achieves up to 1.58x speedup while maintaining video quality and text-video alignment.

本文针对用于高质量文本到视频生成的扩散模型运行缓慢的问题，提出了一种名为CalibAtt的无训练加速方法，通过离线校准跳过不重要的token-to-token连接。实验表明，CalibAtt可以实现最高1.58倍的加速，同时保持视频质量和文本到视频的对齐。

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

Authors: Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu

First: 2026-03-05T18:59:23+00:00 · Latest: 2026-03-05T18:59:23+00:00

Comments: Technical report v1 (14 pages, 7 figures, project page: https://spherelab.ai/poetx/)

Abs · PDF · Code1 · Code2 · Project1

Abstract

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

中文标题/摘要

标题：POET-X：通过扩展正交变换提高大语言模型训练的内存效率

在现代机器学习系统中，高效且稳定的大型语言模型（LLMs）训练仍然是一个核心挑战。为了解决这一挑战，提出了参数化正交等价训练（POET），这是一种保持频谱的框架，通过正交等价变换优化每个权重矩阵。尽管POET提供了强大的训练稳定性，但其原始实现由于密集矩阵乘法导致了高内存消耗和计算开销。为克服这些限制，我们引入了POET-X，这是一种可扩展且内存高效的变体，通过显著降低计算成本来执行正交等价变换。POET-X保持了POET的一般化和稳定性优势，同时在吞吐量和内存效率方面实现了显著改进。在我们的实验中，POET-X能够在单个Nvidia H100 GPU上预训练具有十亿参数的LLMs，而标准优化器如AdamW在相同设置下则会因内存不足而无法运行。

Summary / 总结

POET-X is a memory-efficient variant of the POET framework for training large language models (LLMs) that maintains the generalization and stability benefits of the original POET while significantly reducing computational cost and memory consumption. Experiments show that POET-X allows the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, whereas standard optimizers like AdamW fail due to memory constraints under the same conditions.

研究旨在解决大规模语言模型（LLM）高效且稳定的训练问题。POET-X 通过减少计算成本来实现这一点，它使用正交等价变换。实验结果显示，POET-X 可以在单个 Nvidia H100 GPU 上预训练十亿参数的 LLM，而标准优化器如 AdamW 在相同条件下则因内存不足而无法运行。

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Authors: Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda

First: 2026-03-05T18:58:14+00:00 · Latest: 2026-03-05T18:58:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

中文标题/摘要

标题：被屏蔽的LLM作为秘密知识提取的自然试验场

大型语言模型有时会产生虚假或误导性的回答。解决这一问题的两种方法是诚实性提取——通过修改提示或权重使模型如实回答——和谎言检测——对给定的回答进行分类以判断其是否虚假。先前的工作在专门训练以撒谎或隐瞒信息的模型上评估了这些方法，但这些人工构建的模型可能无法反映自然发生的不诚实行为。相反，我们研究了来自中国开发者的开放权重LLM，这些模型被训练以屏蔽政治敏感话题：Qwen3模型经常在诸如法轮功或天安门抗议等主题上产生虚假信息，偶尔却能正确回答，表明它们拥有被训练压制的知识。利用这一作为试验场，我们评估了一系列提取和谎言检测技术。对于诚实性提取，不使用聊天模板的采样、少量示例提示和在通用诚实数据上微调最可靠地增加了真实回答。对于谎言检测，促使被屏蔽模型分类其自身回答接近未屏蔽模型的上限，而基于无关数据训练的线性探针提供了一种更便宜的替代方案。最强的诚实性提取技术也适用于前沿的开放权重模型，包括DeepSeek R1。值得注意的是，没有技术能够完全消除虚假回答。我们发布了所有提示、代码和对话记录。

Summary / 总结

This study investigates the effectiveness of honesty elicitation and lie detection techniques on censored large language models (LLMs) from Chinese developers, which are trained to suppress politically sensitive topics. The research finds that sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data are the most reliable methods for increasing truthful responses. For lie detection, prompting the censored model to classify its own responses and using linear probes trained on unrelated data are effective. Notably, no technique can completely eliminate false responses. The study uses Qwen3 models, which frequently produce falsehoods about sensitive topics while occasionally answering correctly, as a testbed for evaluating these techniques.

研究探讨了在来自中国开发者的、被训练抑制政治敏感话题的大型语言模型（LLM）上应用诚实引诱和谎言检测技术的有效性。研究发现，不使用聊天模板的采样、少量示例提示和在通用诚实数据上进行微调是增加真实回答最可靠的方法。对于谎言检测，提示被抑制的模型对其自身响应进行分类和使用在无关数据上训练的线性探针都是有效的。值得注意的是，没有任何技术能够完全消除虚假回答。该研究提供了一个自然的测试平台，用于评估在表现出自然不诚实行为的模型上应用这些技术的方法。

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents

Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong

First: 2026-03-04T18:47:26+00:00 · Latest: 2026-03-05T18:56:37+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.

中文标题/摘要

标题：AgentIR：具备推理意识的检索方法以促进深度研究代理

深度研究代理正迅速成为现代检索系统的主要消费者。与人类用户在不记录其中间思维过程的情况下提出并细化查询不同，深度研究代理在每次搜索调用前都会生成明确的自然语言推理，揭示出现有检索器完全忽略的丰富意图和上下文信息。为了利用这一被忽视的信号，我们引入了：(1) 具备推理意识的检索，这是一种检索范式，将代理的推理轨迹与查询一起联合嵌入；(2) DR-Synth，一种从标准问答数据集中生成深度研究检索训练数据的方法。我们证明了这两个组件各自有效，它们的结合产生了训练嵌入模型AgentIR-4B，取得了显著的提升。在具有挑战性的BrowseComp-Plus基准测试中，使用开放权重代理Tongyi-DeepResearch的AgentIR-4B达到了68%的准确率，而传统的两倍大小的嵌入模型仅为50%，BM25仅为37%。代码和数据可在：https://texttron.github.io/AgentIR/ 获取。

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Authors: Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

First: 2026-03-05T18:55:16+00:00 · Latest: 2026-03-05T18:55:16+00:00

Abs · PDF · Code1 · Code2

Abstract

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

中文标题/摘要

标题：推理剧场：解开模型信念与推理链的纠缠

我们提供了推理模型中表演性推理链（CoT）的证据，其中模型对其最终答案变得非常自信，但继续生成令牌而不揭示其内部信念。我们的分析比较了激活探针、早期强制回答和CoT监控在两个大型模型（DeepSeek-R1 671B & GPT-OSS 120B）上的表现，并发现任务难度特定的差异：模型的最终答案可以在CoT的早期激活中被解码，而监控则无法做到这一点，尤其是在简单的基于回忆的MMLU问题上。我们将其与困难的多跳GPQA-Diamond问题中的真正推理进行了对比。尽管如此，转折点（例如回溯、‘恍然大悟’时刻）几乎仅出现在探针显示大规模信念变化的响应中，这表明这些行为追踪的是真正的不确定性，而不是学习到的“推理剧场”。最后，探针引导的早期退出在MMLU上最多可减少80%的令牌，在GPQA-Diamond上减少30%，同时保持相似的准确性，将注意力探针定位为检测表演性推理的有效工具，并使计算适应性成为可能。

Summary / 总结

The study investigates performative chain-of-thought (CoT) in reasoning models, where models generate tokens without revealing their internal beliefs. By comparing activation probing, early forced answering, and a CoT monitor across two large models, the research finds that the final answer is decodable earlier in the CoT than the monitor can detect, especially for easy recall-based questions. However, genuine reasoning in difficult questions shows inflection points that correlate with large belief shifts, suggesting that these shifts track genuine uncertainty. The study also demonstrates that probe-guided early exit can reduce tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, indicating that attention probing can efficiently detect performative reasoning and enable adaptive computation.

研究探讨了推理模型中的表演性链式思考（CoT），即模型生成令牌而不揭示其内部信念。通过比较两种大型模型（DeepSeek-R1 671B 和 GPT-OSS 120B）的激活探针、早期强制回答和CoT监控，研究发现，对于简单的回忆型问题，最终答案可以在CoT的更早阶段被解码，而监控则无法检测到。然而，在困难的多跳问题中，真正的推理显示出与信念大幅变化相关的转折点，表明这些变化反映了真正的不确定性。研究还展示了基于探针的早期退出可以将MMLU上的令牌减少多达80%，GPQA-Diamond上的令牌减少30%，同时保持相似的准确性，表明注意力探针可以有效检测表演性推理并实现适应性计算。

Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding

Authors: Alex Oshin, Rahul Vodeb Ghosh, Augustinos D. Saravanos, Evangelos A. Theodorou

Venue: ICLR 2026

First: 2025-12-01T11:38:45+00:00 · Latest: 2026-03-05T18:54:48+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

We propose FlexQP, an always-feasible convex quadratic programming (QP) solver based on an $\ell_1$ elastic relaxation of the QP constraints. If the original constraints are feasible, FlexQP provably recovers the optimal solution. If the constraints are infeasible, FlexQP identifies a solution that minimizes the constraint violation while keeping the number of violated constraints sparse. Such infeasibilities arise naturally in sequential quadratic programming (SQP) subproblems due to the linearization of the constraints. We prove the convergence of FlexQP under mild coercivity assumptions, making it robust to both feasible and infeasible QPs. We then apply deep unfolding to learn LSTM-based, dimension-agnostic feedback policies for the algorithm parameters, yielding an accelerated Deep FlexQP. To preserve the exactness guarantees of the relaxation, we propose a normalized training loss that incorporates the Lagrange multipliers. We additionally design a log-scaled loss for PAC-Bayes generalization bounds that yields substantially tighter performance certificates, which we use to construct an accelerated SQP solver with guaranteed QP subproblem performance. Deep FlexQP outperforms state-of-the-art learned QP solvers on a suite of benchmarks including portfolio optimization, classification, and regression problems, and scales to dense QPs with over 10k variables and constraints via fine-tuning. When deployed within SQP, our approach solves nonlinear trajectory optimization problems 4-16x faster than SQP with OSQP while substantially improving success rates. On predictive safety filter problems, Deep FlexQP reduces safety violations by over 70\% and increases task completion by 43\% compared to existing methods.

中文标题/摘要

标题：Deep FlexQP：基于深度展开的加速非线性规划

我们提出了一种基于QP约束的$\ell_1$弹性松弛的凸二次规划（QP）求解器FlexQP。如果原始约束可行，FlexQP能够证明恢复最优解。如果约束不可行，FlexQP将识别一个最小化约束违反的解，同时保持违反约束的数量稀疏。这种不可行性在序列二次规划（SQP）子问题中由于约束的线性化而自然出现。在温和的强制性假设下，我们证明了FlexQP的收敛性，使其对可行和不可行的QP都具有鲁棒性。然后，我们应用深度展开来学习基于LSTM的、维度无关的反馈策略，以加速算法参数，从而得到加速的Deep FlexQP。为了保持松弛的精确性保证，我们提出了一种归一化的训练损失，该损失包含拉格朗日乘子。我们还设计了一种对数缩放损失，用于PAC-Bayes泛化界，该损失提供了显著更紧的性能证书，我们使用它来构建具有保证QP子问题性能的加速SQP求解器。Deep FlexQP在包括投资组合优化、分类和回归问题的一系列基准测试中优于最先进的学习QP求解器，并通过微调扩展到具有超过10,000个变量和约束的密集QP。当部署在SQP中时，我们的方法比使用OSQP的SQP快4-16倍，同时显著提高了成功率。在预测性安全过滤器问题上，Deep FlexQP将安全违规减少了70%以上，任务完成率提高了43%，优于现有方法。

Summary / 总结

The paper introduces FlexQP, a convex QP solver that can handle infeasible constraints by minimizing constraint violation while keeping it sparse. It is proven to recover the optimal solution if constraints are feasible. Deep FlexQP, an accelerated version learned via deep unfolding, uses LSTM-based policies to optimize algorithm parameters. The approach outperforms state-of-the-art QP solvers on various benchmarks and significantly speeds up nonlinear trajectory optimization problems while improving success rates. It also reduces safety violations and increases task completion in predictive safety filter problems.

论文提出了FlexQP，一种能够处理不可行约束的凸QP求解器，通过最小化约束违反同时保持违反约束的数量稀疏来实现。该方法在约束可行时能够恢复最优解。Deep FlexQP 是通过深度展开学习LSTM 基础的策略来优化算法参数的加速版本。该方法在各种基准测试中优于现有的QP求解器，并显著加快了非线性轨迹优化问题的速度，同时提高了成功率。在预测安全过滤问题中，Deep FlexQP 将安全违规减少了超过70%，并提高了43%的任务完成率。

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation

Authors: Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar

First: 2026-03-05T18:52:28+00:00 · Latest: 2026-03-05T18:52:28+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.

中文标题/摘要

标题：通过偏差有界评估迈向可证明无偏的LLM法官

随着AI模型从简单的聊天机器人发展到更复杂的流程，我们正逐渐接近一个临界点，在这个临界点之后，AI系统将被用于自主、自我维护的反馈循环中。任何自主AI系统都将依赖于自动化的、可验证的奖励和反馈；在地面真实数据稀疏或非确定性的环境中，一个实际的奖励来源是LLM作为法官。尽管LLM法官不断改进，但文献中尚未引入能够以强保证执行标准的系统，尤其是在偏差向量未知或被敌对发现的情况下。为解决这一问题，我们提出了平均偏差有界性（A-BB），这是一种算法框架，正式保证了由于任何可测量偏差而导致的任何危害/影响的减少。在Arena-Hard-Auto上使用四个LLM法官进行评估，我们实现了（tau=0.5，delta=0.01）的偏差有界保证，同时在格式化和方案偏差设置中保留了61-99%与原始排名的相关性，大多数法官-偏差组合超过80%。我们的发现的复现代码可在https://github.com/penfever/bias-bounded-evaluation获取。

Summary / 总结

The research aims to develop a provably unbiased LLM judge system to ensure automated, verifiable rewards in autonomous AI systems. The method involves proposing average bias-boundedness (A-BB), which formally guarantees reductions in harm due to measurable bias. Key experimental findings show that the A-BB framework achieves bias-bounded guarantees while maintaining 61-99% correlation with original rankings across different bias settings, with most judge-bias combinations exceeding 80% correlation.

该论文旨在确保AI系统在复杂工作流程中做出的判断无偏，特别是在AI自主运行的情况下。它提出了一种名为平均偏置有界性（A-BB）的算法框架，以正式保证由于任何可测量偏见而导致的危害减少。在Arena-Hard-Auto上的评估显示，A-BB在保持与原始排名的强相关性的同时提供了偏置有界性保证，大多数法官偏见组合在格式化和方案偏见设置下超过80%的相关性。

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Authors: Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, Zhiqi Li, Karan Sapra, Fuxiao Liu, Yin-Dong Zheng, Yifei Huang, Limin Wang, Zhiding Yu, Andrew Tao, Guilin Liu, Tong Lu

First: 2026-03-05T18:52:12+00:00 · Latest: 2026-03-05T18:52:12+00:00

Abs · PDF · Code1 · Code2

Abstract

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

中文标题/摘要

标题：迈向多模态终身理解：一个数据集和能动基线

尽管视频理解的数据集已经扩展到小时级长度，但它们通常由紧密连接的片段组成，与自然、非剧本化的日常生活不同。为弥合这一差距，我们引入了MM-Lifelong数据集，旨在用于多模态终身理解。该数据集包含181.1小时的视频，按日、周、月结构化，以捕捉不同的时间密度。广泛评估表明，当前范式存在两种关键失败模式：端到端的MLLMs因上下文饱和而遭受工作记忆瓶颈，而代表性的能动基线在导航稀疏的月度时间线时则经历全局定位崩溃。为解决这一问题，我们提出了递归多模态代理（ReMA），它采用动态内存管理，迭代更新递归信念状态，显著优于现有方法。最后，我们建立了数据集划分，以隔离时间偏见和领域偏见，为未来监督学习和离分布泛化的研究提供严格的基石。

Summary / 总结

The research aims to bridge the gap between existing video understanding datasets and natural, unscripted daily life by introducing MM-Lifelong, a dataset with 181.1 hours of footage structured across Day, Week, and Month scales. The study evaluates current paradigms and finds that end-to-end MLLMs face a Working Memory Bottleneck and representative agentic baselines struggle with Global Localization Collapse. To address these issues, the Recursive Multimodal Agent (ReMA) is proposed, which uses dynamic memory management to iteratively update a recursive belief state, outperforming existing methods. The dataset also includes splits to isolate temporal and domain biases, providing a rigorous foundation for future research.

研究旨在通过引入包含181.1小时 footage 的MM-Lifelong 数据集，弥合现有视频理解数据集与自然、非脚本化日常生活之间的差距，该数据集按日、周、月时间尺度结构化。研究评估了当前范式，发现端到端的MLLMs 面临工作记忆瓶颈，而代表性的代理基线在导航稀疏的月度时间线时出现全局定位崩溃。为解决这些问题，提出了递归多模态代理（ReMA），它使用动态内存管理来迭代更新递归信念状态，优于现有方法。此外，数据集还包括用于隔离时间和领域偏差的拆分，为未来的监督学习和分布外泛化研究提供坚实的基础。

FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction

Authors: Jiaxin Yuan, Haizhao Yang, Maria Cameron

First: 2025-10-31T04:49:41+00:00 · Latest: 2026-03-05T18:50:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.

中文标题/摘要

标题：FMint-SDE：一种通过误差校正加速随机微分方程数值模拟的多模态基础模型

快速而准确地模拟动力系统是科学和工程领域的一项基本挑战。传统的数值积分器通常在准确性和计算效率之间存在权衡，而现有的基于神经网络的方法通常需要为每种情况训练一个单独的模型。为克服这些限制，我们提出了一种新的多模态基础模型，用于大规模模拟微分方程：FMint-SDE（基于初始化的基础模型，用于随机微分方程）。基于仅解码器的变压器并利用上下文学习，FMint-SDE 利用数值和文本模态学习一种通用的误差校正方案。它使用由传统求解器生成的粗糙解序列进行提示训练，从而在各种系统中实现广泛的泛化。我们在涵盖分子动力学、机械系统、金融和生物学应用的挑战性 SDE 基准测试集上评估了我们的模型。实验结果表明，与经典求解器相比，我们的方法在准确性和效率之间实现了更优的权衡，突显了 FMint-SDE 作为动力系统通用模拟工具的潜力。

Summary / 总结

The research aims to address the challenge of fast and accurate simulation of dynamical systems by introducing FMint-SDE, a multimodal foundation model that uses a decoder-only transformer with in-context learning to learn an error-correction scheme from numerical and textual modalities. The model is trained on coarse solutions generated by conventional solvers and demonstrates a better accuracy-efficiency tradeoff compared to classical solvers across various applications including molecular dynamics, mechanical systems, finance, and biology.

研究旨在通过引入FMint-SDE，一种基于解码器的变压器模型，利用数值和文本模态学习误差校正方案，来解决动态系统快速而准确的模拟问题。该模型通过传统求解器生成的粗略解进行训练，并在分子动力学、机械系统、金融和生物学等多个应用领域展示了比经典求解器更好的准确性和效率折衷。

Thermodynamic Response Functions in Singular Bayesian Models

Authors: Sean Plummer

First: 2026-03-05T18:50:20+00:00 · Latest: 2026-03-05T18:50:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. Although singular learning theory characterizes marginal likelihood behavior through invariants such as the real log canonical threshold and singular fluctuation, these quantities remain difficult to interpret operationally. At the same time, widely used criteria such as WAIC and WBIC appear disconnected from underlying singular geometry. We show that posterior tempering induces a one-parameter deformation of the posterior distribution whose associated observables generate a hierarchy of thermodynamic response functions. A universal covariance identity links derivatives of tempered expectations to posterior fluctuations, placing WAIC, WBIC, and singular fluctuation within a unified response framework. Within this framework, classical quantities from singular learning theory acquire natural thermodynamic interpretations: RLCT governs the leading free-energy slope, singular fluctuation corresponds to curvature of the tempered free energy, and WAIC measures predictive fluctuation. We formalize an observable algebra that quotients out non-identifiable directions, allowing structurally meaningful order parameters to be constructed in singular models. Across canonical singular examples-including symmetric Gaussian mixtures, reduced-rank regression, and overparameterized neural networks-we empirically demonstrate phase-transition-like behavior under tempering. Order parameters collapse, susceptibilities peak, and complexity measures align with structural reorganization in posterior geometry. Our results suggest that thermodynamic response theory provides a natural organizing framework for interpreting complexity, predictive variability, and structural reorganization in singular Bayesian learning.

中文标题/摘要

标题：奇异贝叶斯模型中的热力学响应函数

奇异统计模型，包括混合模型、矩阵分解和神经网络，由于参数不可识别性和退化费舍尔几何学而违反了常规渐近性。尽管奇异学习理论通过不变量（如实数对数可约阈值和奇异波动）来表征边缘似然行为，但这些量仍然难以从操作上进行解释。同时，广泛使用的准则（如WAIC和WBIC）似乎与基础的奇异几何学脱节。我们表明，后验退火诱导了一个后验分布的一参数变形，其关联的可观测量生成了一级热力学响应函数的层次结构。一个普遍协方差恒等式将退火期望的导数与后验波动联系起来，将WAIC、WBIC和奇异波动置于统一的响应框架中。在这一框架中，奇异学习理论中的经典量获得了自然的热力学解释：RLCT控制了自由能的主要斜率，奇异波动对应于退火自由能的曲率，而WAIC衡量预测波动。我们形式化了一个可观测量代数，消除了不可识别方向，允许在奇异模型中构建结构上具有意义的有序参数。在包括对称高斯混合模型、降秩回归和过度参数化神经网络在内的典型奇异示例中，我们通过退火实验证明了相变似的行为。有序参数坍缩，磁化率达到峰值，复杂度度量与后验几何结构的重新组织相一致。我们的结果表明，热力学响应理论提供了一个自然的组织框架，用于解释奇异贝叶斯学习中的复杂性、预测波动性和结构重组织。

Summary / 总结

This paper addresses the challenges in singular statistical models, where standard asymptotic theory fails due to parameter non-identifiability and degenerate Fisher geometry. The authors introduce a method of posterior tempering to generate a hierarchy of thermodynamic response functions, linking classical quantities from singular learning theory to thermodynamic concepts. Key findings include the identification of the real log canonical threshold (RLCT) as governing the leading free-energy slope, singular fluctuation as the curvature of the tempered free energy, and the predictive fluctuation measured by WAIC. The study demonstrates phase-transition-like behavior in canonical singular models under tempering, with order parameters and susceptibilities showing significant changes reflecting structural reorganization in the posterior geometry.

该论文探讨了由于参数非识别性和退化费歇几何导致标准渐近理论失效的奇异统计模型。作者引入了后验退火的方法来生成热力学响应函数层次结构，将奇异学习理论中的经典量与热力学概念联系起来。关键发现包括将退火期望的导数与后验波动连接起来的普遍协方差恒等式，并在典型奇异模型下展示了在退火下的相变行为，表明后验几何结构中的结构性重组。

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Authors: Artem Vazhentsev, Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Seleznyov, Mikhail Salnikov, Elena Tutubalina, Vasily Konovalov, Irina Nikishina, Alexander Panchenko, Viktor Moskvoretskii

First: 2026-03-05T18:42:51+00:00 · Latest: 2026-03-05T18:42:51+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.

中文标题/摘要

标题：利用大语言模型参数知识进行无需检索的事实核查

信任度是基于大语言模型（LLMs）的代理AI系统的核心研究挑战。为了增强信任，自然语言声明通常通过检索外部知识并使用LLM验证声明与检索证据的一致性来从多种来源（包括人类撰写的文本、网络内容和模型输出）进行事实核查。这种方法受到检索错误和外部数据可用性的限制，而未能充分利用模型固有的事实验证能力。我们提出了无需检索的事实核查任务，专注于独立于来源的任意自然语言声明的验证。为了研究这一设置，我们引入了一个以泛化为重点的综合评估框架，测试对（i）长尾知识、（ii）声明来源的变化、（iii）多语言性和（iv）长文本生成的鲁棒性。在9个数据集、18种方法和3种模型上进行的实验表明，logit基方法往往不如利用内部模型表示的方法表现好。基于这一发现，我们引入了INTRA方法，该方法利用内部表示之间的交互，实现了最先进的性能并具有强大的泛化能力。更广泛地说，我们的工作确立了无需检索的事实核查作为有前景的研究方向，可以补充基于检索的框架，提高可扩展性，并使这些系统在训练期间作为奖励信号使用或作为生成过程中的组件集成。

Summary / 总结

The research addresses the challenge of enhancing the trustworthiness of agentic AI systems by proposing a fact-checking method without retrieval, which leverages internal model representations of Large Language Models (LLMs). The study introduces a comprehensive evaluation framework to test the robustness of fact-checking methods across various datasets and models. Key findings show that logit-based approaches often underperform, and a new method called INTRA, which exploits internal model interactions, achieves state-of-the-art performance with strong generalization capabilities.

研究旨在通过开发无需外部知识检索的事实核查方法来增强 agentic AI 系统的可信度，从而避免检索错误和外部数据可用性带来的限制。研究引入了一个全面的评估框架，以测试各种数据集和模型下事实核查方法的鲁棒性。关键发现表明，logit 基础方法往往表现不佳，而一种名为 INTRA 的新方法，它利用内部模型表示，实现了最先进的性能并具有强大的泛化能力。

HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

Authors: Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou

Venue: The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)

First: 2026-03-05T18:36:31+00:00 · Latest: 2026-03-05T18:36:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.

中文标题/摘要

标题：HALP：无需生成单个词元即可检测视觉语言模型中的幻觉

幻觉仍然是视觉语言模型（VLMs）的一个持续性挑战，它们经常描述不存在的对象或编造事实。现有的检测方法通常在文本生成之后进行操作，使得干预既昂贵又不及时。我们研究了是否可以在生成任何词元之前通过探测模型的内部表示来预测幻觉风险，而只需一次前向传递。在一系列视觉语言任务和八个现代VLMs（包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL）中，我们检查了三种内部表示家族：（i）仅视觉特征而不进行多模态融合，（ii）文本解码器中的视觉词元表示，以及（iii）在生成之前整合视觉和文本信息的查询词元表示。基于这些表示训练的探测器在无需解码的情况下实现了强大的幻觉检测性能，最高达到0.93 AUROC在Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上。大多数模型中，后期查询词元状态最具预测性，而在少数架构中，视觉或中间层特征占主导地位（例如，Qwen2.5-VL-7B使用仅视觉特征的AUROC约为0.79）。这些结果表明：（1）幻觉风险可以在生成之前检测到，（2）最具信息量的层和模态在不同架构中有所不同，（3）轻量级探测器有可能实现早期回避、选择性路由和自适应解码，以提高安全性和效率。

Summary / 总结

The study addresses the challenge of hallucinations in vision-language models by proposing a method to predict hallucination risk before any text generation. It uses probes on internal model representations to detect hallucinations, achieving up to 0.93 AUROC on several models. The results show that late query-token states are the most predictive for most models, while visual or mid-layer features are more informative for some architectures.

研究旨在无需生成任何标记即可检测视觉语言模型中的幻觉现象，这比现有方法更高效及时。通过在单次前向传播中探查内部表示，研究评估了三种类型的表示在各种任务和模型上的表现。这些探针在训练后表现出强大的幻觉检测性能，对于大多数模型而言，后期查询标记状态是最具预测性的，而对于某些架构而言，视觉或中间层特征则更为重要。这项工作表明，幻觉风险可以在生成前被检测到，并且轻量级探针可以实现早期干预，以提高安全性和效率。

NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance

Authors: Abrar Eyasir, Tahsin Ahmed, Muhammad Ibrahim

First: 2026-03-05T18:35:03+00:00 · Latest: 2026-03-05T18:35:03+00:00

Comments: 18 pages, 7 figures, 6 tables. Dataset contains 87,805 Bangla QA pairs from NCTB textbooks

Abs · PDF · Code1 · Code2

Abstract

Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.

中文标题/摘要

标题：NCTB-QA：大规模孟加拉语教育问答数据集及基准性能评估

针对低资源语言的阅读理解系统在处理无法回答的问题时面临重大挑战。当正确答案不在上下文中时，这些系统往往会生成不可靠的响应。为了解决这一问题，我们引入了NCTB-QA，这是一个包含87,805个问答对的大规模孟加拉语问答数据集，这些问答对是从孟加拉国国家课程和课本委员会出版的50本教科书中提取的。与现有的孟加拉语数据集不同，NCTB-QA保持了可回答问题（57.25%）和无法回答问题（42.75%）的平衡分布。NCTB-QA还包含对抗设计的实例，其中包含可能的干扰项。我们对三种基于变换器的模型（BERT、RoBERTa、ELECTRA）进行了基准测试，并通过微调展示了显著的改进。BERT在F1分数上实现了313%的相对改进（从0.150到0.620）。通过BERTScore衡量的语义答案质量在所有模型中也显著提高。我们的结果确立了NCTB-QA作为孟加拉语教育问答具有挑战性的基准。本研究证明，在低资源环境中，领域特定的微调对于稳健性能至关重要。

Summary / 总结

The paper introduces NCTB-QA, a large-scale Bangla question answering dataset with 87,805 question-answer pairs, aiming to address the challenge of handling unanswerable questions in low-resource languages. The dataset maintains a balanced distribution of answerable and unanswerable questions and includes adversarial instances. Three transformer-based models (BERT, RoBERTa, ELECTRA) were benchmarked, showing significant improvements through fine-tuning, with BERT achieving a 313% relative improvement in F1 score. The study highlights the importance of domain-specific fine-tuning for robust performance in low-resource settings.

研究通过引入包含87,805个问题-答案对的NCTB-QA大型孟加拉语数据集，解决了低资源语言阅读理解系统处理无法回答问题的挑战。该数据集保持了可回答和不可回答问题的平衡分布，并包含对抗性示例。三种基于变换器的模型（BERT、RoBERTa、ELECTRA）被基准测试，通过微调显示出显著改进，BERT的F1分数相对提高了313%。研究强调了在低资源环境中实现稳健性能的关键在于领域特定的微调。

DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates

Authors: Klaywert Danillo Ferreira de Souza, David Eduardo Pereira, Cláudio E. C. Campelo, Larissa Lucena Vasconcelos

First: 2026-03-05T18:30:10+00:00 · Latest: 2026-03-05T18:30:10+00:00

Abs · PDF · Code1 · Code2

Abstract

The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.

中文标题/摘要

标题：DEBISS：个体、半结构化和口语辩论语料库

辩论的过程在我们的日常生活中至关重要，无论是学习、工作活动、简单的日常讨论、电视上的政治辩论，还是社交媒体上的在线讨论。辩论的应用范围很广。由于辩论的多样应用、结构和形式，开发能够涵盖这些差异的语料库具有挑战性，目前该领域的辩论语料库稀缺性尤为突出。因此，当前研究提出了DEBISS语料库：一个包含口语和个体辩论的半结构化集合。该语料库包含广泛的自然语言处理任务注释，如语音转文本、说话人辨识、论点挖掘和辩手质量评估。

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Authors: Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

Venue: ICLR 2026

First: 2026-03-05T18:25:26+00:00 · Latest: 2026-03-05T18:25:26+00:00

Comments: Accepted at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

中文标题/摘要

标题：超越零散接受：通过最长稳定前缀实现DLMs的快速和连贯推理

扩散语言模型（DLMs）承诺实现高度并行的文本生成，但其实用推理速度往往受限于次优解码调度器。标准方法依赖于“零散接受”——在序列中不连续位置上提交高置信度的标记。这种方法无意中破坏了键值（KV）缓存，破坏了内存局部性，并迫使模型在不稳定的标记边界上进行昂贵的重复修复。为了解决这个问题，我们提出了最长稳定前缀（LSP）调度器，这是一种基于单一前缀吸收的无训练和模型无关的推理范式。在每次去噪步骤中，LSP 通过单向传递评估标记的稳定性，动态识别一个连续的左对齐的稳定预测块，并在自然语言或结构分隔符之前将其边界锁定，然后进行原子提交。这种前缀优先的拓扑结构带来了双重好处：系统上，它将碎片化的KV缓存更新转换为高效的连续追加；算法上，它保留了对几何缩小的活动后缀的双向前瞻，大幅减少了标记翻转率和去噪器调用次数。在LLaDA-8B和Dream-7B上的广泛评估表明，LSP 在包括数学推理、代码生成、多语言（CJK）任务和创造性写作在内的严格基准测试中将推理加速了高达3.4倍，同时保持或略微提高了输出质量。通过从根本上重新结构化提交拓扑，LSP 桥接了DLMs的理论并行性和实际硬件效率之间的差距。

Summary / 总结

The paper addresses the issue of slow inference in Diffusion Language Models (DLMs) due to suboptimal decoding schedulers. It introduces the Longest Stable Prefix (LSP) scheduler, which evaluates token stability and commits to a contiguous block of predictions, thereby preserving the KV cache and reducing token flip rates. Experiments on LLaDA-8B and Dream-7B show that LSP can accelerate inference by up to 3.4x across various tasks while maintaining or slightly improving output quality.

论文解决了由于解码调度器不理想导致的Diffusion Language Models (DLMs) 推断速度慢的问题。它提出了Longest Stable Prefix (LSP) 调度器，该调度器通过单次前向传递评估token的稳定性，并提交一个连续的稳定预测块，从而保持KV缓存并减少token翻转率。在LLaDA-8B和Dream-7B上的实验表明，LSP可以将推断加速多达3.4倍，同时保持或略微提高输出质量。

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Authors: Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao

First: 2026-03-05T18:24:49+00:00 · Latest: 2026-03-05T18:24:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.

中文标题/摘要

标题：FlashAttention-4：针对异构硬件扩展的算法和内核流水线协同设计

注意力机制作为无处不在的Transformer架构的核心层，是大型语言模型和长上下文应用的瓶颈。虽然FlashAttention-3通过异步执行和战位专业化优化了Hopper GPU上的注意力机制，但它主要针对H100架构。AI行业迅速转向部署基于Blackwell的系统，如B200和GB200，这些系统由于异构硬件扩展的不同性能特征而表现出根本不同的性能：张量核吞吐量翻倍，而其他功能单元（共享内存带宽、指数单元）则扩展较慢或保持不变。我们开发了几种技术来应对Blackwell GPU上这些变化的瓶颈：(1) 重新设计的流水线，利用完全异步的MMA操作和更大的瓦片大小，(2) 软件模拟的指数和条件softmax缩放，减少非矩阵运算，以及(3) 利用张量内存和2-CTA MMA模式减少后向传递中的共享内存流量和原子加法。我们证明，我们的方法FlashAttention-4在B200 GPU上使用BF16时，相对于cuDNN 9.13实现了高达1.3倍的加速，相对于Triton实现了高达2.7倍的加速，达到最高1613 TFLOPs/s（71%利用率）。除了算法创新，我们完全在嵌入Python中的CuTe-DSL中实现FlashAttention-4，与传统的C++模板方法相比，编译时间快20-30倍，同时保持完全的表达能力。

Summary / 总结

FlashAttention-4 optimizes the attention mechanism for Blackwell-based GPUs, addressing the asymmetric scaling of tensor core throughput and other functional units. It introduces redesigned pipelines with fully asynchronous MMA operations and larger tile sizes, software-emulated exponential and conditional softmax rescaling to reduce non-matmul operations, and leverages tensor memory and the 2-CTA MMA mode to decrease shared memory traffic and atomic adds in the backward pass. The method achieves up to 1.3 times speedup over cuDNN 9.13 and 2.7 times over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization).

FlashAttention-4 为 Blackwell 基础的 GPU 优化了注意力机制，解决了张量核心吞吐量和其他功能单元的非对称扩展问题。它引入了重新设计的管道，利用完全异步的 MMA 操作和更大的瓷砖尺寸，通过软件模拟指数和条件 softmax 缩放来减少非矩阵运算，以及利用张量内存和 2-CTA MMA 模式来减少后向传递中的共享内存流量和原子加操作。该方法在 B200 GPU 上使用 BF16 达到了最高 1613 TFLOPs/s（71% 的利用率），比 cuDNN 9.13 快 1.3 倍，比 Triton 快 2.7 倍。

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

Authors: Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

First: 2026-03-05T18:22:55+00:00 · Latest: 2026-03-05T18:22:55+00:00

Comments: 10 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

中文标题/摘要

标题：分布式部分信息谜题：在知识不对称下的共同知识构建研究

建立共同知识，即共享的一组信念和相互认可的事实，是协作的基础，但在当前的AI系统中仍然是一个挑战，尤其是在多模态、多参与者的场景中，参与者带来不同的信息。我们引入了分布式部分信息谜题（DPIP），这是一种在知识不对称下引发丰富多模态交流的协作构建任务。我们提供了一个多模态的交互数据集，这些数据在语音、手势和动作模态上进行了注释和时间对齐，以支持对命题内容和信念动态的推理。然后，我们评估了两种建模共同知识（CG）的范式：（1）最先进的大型语言模型（LLMs），被提示从多模态更新中推断共享的信念，以及（2）一个基于动态知识逻辑（DEL）的公理化管道，逐步执行相同任务。对注释的DPIP数据的评估结果表明，这给现代LLMs跟踪任务进展和信念状态的能力带来了挑战。

Summary / 总结

The study aims to explore how AI systems can establish common ground in collaborative settings with epistemic asymmetry. It introduces the Distributed Partial Information Puzzle (DPIP) task and evaluates two approaches: state-of-the-art large language models and an axiomatic pipeline based on Dynamic Epistemic Logic. The results show that modern LLMs struggle to track both task progression and belief state, highlighting the challenge in managing common ground under epistemic asymmetry.

研究旨在探索AI系统如何在具有知识不对称性的协作环境中建立共同知识。引入了分布式部分信息谜题（DPIP）任务，并评估了两种方法：最先进的大型语言模型和基于动态演绎逻辑（DEL）的公理化管道。结果显示，现代大语言模型在跟踪任务进展和信念状态方面存在困难，突显了在知识不对称性下管理共同知识的挑战。

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Authors: Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

First: 2026-01-26T17:56:50+00:00 · Latest: 2026-03-05T18:19:57+00:00

Comments: code is release here: https://github.com/siyan-zhao/OPSD

Abs · PDF · Code1 · Code2 · Code3

Abstract

Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student's own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.

中文标题/摘要

标题：自我蒸馏推理器：面向大规模语言模型的在线自我蒸馏

知识蒸馏通过压缩教师大规模语言模型（LLM）的知识来训练较小的LLM，从而改善大型语言模型的推理能力。在线蒸馏通过让学生在教师LLM提供密集的标记级监督的同时，自行采样其自身的轨迹，来推进这一方法，从而解决了脱机蒸馏方法中训练与推理之间的分布不匹配问题。然而，在线蒸馏通常需要一个单独的、通常更大的教师LLM，并且不明确利用推理数据集中可用的真实解决方案。受足够强大的LLM能够合理化外部特权推理轨迹并教导其较弱的自我（即没有访问特权信息的版本）这一直觉的启发，我们引入了在线自我蒸馏（OPSD）框架，其中单个模型同时作为教师和学生，通过不同的上下文进行条件化。教师策略通过条件化特权信息（例如，验证的推理轨迹）进行条件化，而学生策略仅看到问题；训练通过最小化学生自身模拟过程中这些分布之间的每个标记差异来进行。我们通过多个数学推理基准展示了我们方法的有效性，与强化学习方法（如GRPO）相比，实现了8-12倍的标记效率，并且优于脱机蒸馏方法。

Summary / 总结

The research aims to improve large language model reasoning by developing On-Policy Self-Distillation (OPSD), a method where a single model acts as both teacher and student. The method conditions on privileged information for the teacher and only the question for the student, minimizing token-level divergence during training. Experiments show OPSD achieves 8-12x token efficiency compared to reinforcement learning methods and outperforms off-policy distillation methods on mathematical reasoning benchmarks.

研究旨在通过单模型作为教师和学生的自监督蒸馏方法提高大型语言模型的推理能力。该方法让教师基于特权信息进行条件化，学生基于问题进行条件化，并在训练中最小化两者之间的token差异。实验表明，OPSD相比强化学习方法在数学推理基准测试中实现了8-12倍的token效率，并且优于脱政策度化方法。

HydroGEM: A Self Supervised Zero Shot Hybrid TCN Transformer Foundation Model for Continental Scale Streamflow Quality Control

Authors: Ijaz Ul Haq, Byung Suk Lee, Julia N. Perdrial, David Baude

First: 2025-12-16T05:39:26+00:00 · Latest: 2026-03-05T18:19:22+00:00

Comments: Supplementary materials, datasets, and implementation code will be made publicly available upon acceptance for publication in a peer-reviewed journal

Abs · PDF · Code1 · Code2

Abstract

Advances in sensor networks have enabled real-time stream discharge monitoring, yet persistent sensor malfunctions limit data utility. Manual quality control by expert hydrologists cannot scale with networks generating millions of measurements annually. We introduce HydroGEM, a foundation model for continental-scale streamflow quality control designed to support human expertise. HydroGEM uses self-supervised pretraining on 6.03 million clean sequences from 3,724 USGS stations to learn general hydrological representations, followed by fine-tuning with synthetic anomalies for detection and reconstruction. A hybrid TCN-Transformer architecture (14.2M parameters) captures both local and long-range temporal dependencies, while hierarchical normalization handles six orders of magnitude in discharge. On held-out observations from 799 stations with 18 synthetic anomaly types grounded in USGS standards, HydroGEM achieves F1=0.792 for detection and 68.7% reconstruction error reduction, outperforming the strongest baseline by 36.3%. For cross-national validation on 100 Environment and Climate Change Canada stations using tolerant evaluation with a plus or minus 24-hour buffer, HydroGEM achieves Tolerant F1=0.70 with 90.1% segment-level event detection, demonstrating cross-national generalization. The model maintains consistent detection across correction magnitudes and aligns with operational seasonal patterns, with peak flagging during winter ice-affected periods matching hydrologists' correction behavior. Architectural separation between simplified training anomalies and complex test anomalies confirms that performance reflects learned hydrometric principles rather than pattern memorization.

中文标题/摘要

标题：HydroGEM：一种自监督零样本混合TCN变换器基础模型，用于大陆规模的径流质量控制

传感器网络的进步使实时径流监测成为可能，但持续的传感器故障限制了数据的实用性。由专家水文学家进行的手动质量控制无法与每年生成数百万测量值的网络规模相匹配。我们介绍了HydroGEM，一种用于大陆规模径流质量控制的基础模型，旨在支持人类专业知识。HydroGEM通过在来自3,724个USGS站的6,030,000个干净序列上进行自监督预训练来学习通用水文学表示，然后使用合成异常进行微调以进行检测和重建。混合TCN-变换器架构（14.2M参数）捕捉局部和长程时间依赖性，而分层规范化处理了六个数量级的径流。在来自799个站的18种合成异常类型（基于USGS标准）的保留观测中，HydroGEM的检测F1=0.792，重建误差减少68.7%，优于最强基线36.3%。在使用正负24小时缓冲进行容忍评估的100个环境和气候变化加拿大站中，HydroGEM的容忍F1=0.70，段级事件检测率为90.1%，展示了跨国家的一般化能力。该模型在不同校正幅度下保持一致的检测，并与操作季节性模式一致，在冬季冰影响期间的峰值标记与水文学家的修正行为相符。简化训练异常与复杂测试异常之间的架构分离证实了性能反映了学习的水文测量原理，而不是模式记忆。

Summary / 总结

HydroGEM is a foundation model designed for continental-scale streamflow quality control, leveraging self-supervised pretraining and fine-tuning with synthetic anomalies. It uses a hybrid TCN-Transformer architecture to capture both local and long-range temporal dependencies and handles six orders of magnitude in discharge. On validation, HydroGEM achieved an F1 score of 0.792 for detection and a 68.7% reduction in reconstruction error, outperforming existing baselines. It also demonstrated cross-national generalization with a Tolerant F1 score of 0.70 and 90.1% segment-level event detection.

HydroGEM 是一个自监督零样本混合 TCN-Transformer 模型，用于大陆规模的径流质量控制。它基于 3,724 个 USGS 站点的 6.03 百万条干净序列进行预训练，并通过合成异常进行微调。HydroGEM 在检测上的 F1 得分为 0.792，重建误差减少了 68.7%，优于现有基线。它还在 Environment and Climate Change Canada 的 100 个站点上展示了跨国家的一般化能力，Tolerant F1 为 0.70，并且与运营季节性模式保持一致。

LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

Authors: Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin

First: 2026-03-04T11:36:32+00:00 · Latest: 2026-03-05T18:19:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Code comment classification is a critical task for automated software documentation and analysis. In the context of the NLBSE'26 Tool Competition, we present LoRA-MME, a Multi-Model Ensemble architecture utilizing Parameter-Efficient Fine-Tuning (PEFT). Our approach addresses the multi-label classification challenge across Java, Python, and Pharo by combining the strengths of four distinct transformer encoders: UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa. By independently fine-tuning these models using Low-Rank Adaptation(LoRA) and aggregating their predictions via a learned weighted ensemble strategy, we maximize classification performance without the memory overhead of full model fine-tuning. Our tool achieved an F1 Weighted score of 0.7906 and a Macro F1 of 0.6867 on the test set. However, the computational cost of the ensemble resulted in a final submission score of 41.20%, highlighting the trade-off between semantic accuracy and inference efficiency.

中文标题/摘要

标题：LoRA-MME：针对Java、Python和Pharo的LoRA调优编码器多模型集成代码注释分类

代码注释分类是自动软件文档和分析中的关键任务。在NLBSE'26工具竞赛中，我们提出了LoRA-MME，这是一种利用参数高效微调(PEFT)的多模型集成架构。我们的方法通过结合四种不同变压器编码器的优点——UniXcoder、CodeBERT、GraphCodeBERT和CodeBERTa，解决了跨Java、Python和Pharo的多标签分类挑战。通过独立使用低秩适应(LoRA)对这些模型进行微调，并通过学习加权集成策略聚合它们的预测，我们最大化了分类性能，而无需承担全模型微调的内存开销。我们的工具在测试集上获得了0.7906的F1加权分数和0.6867的宏F1分数。然而，集成的计算成本导致最终提交得分为41.20%，突显了语义准确性和推理效率之间的权衡。

NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

Authors: Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata, Seitaro Otsuki, Komei Sugiura

Venue: CVPR 2026

First: 2026-03-05T18:12:29+00:00 · Latest: 2026-03-05T18:12:29+00:00

Comments: Accepted to CVPR 2026 Findings

Abs · PDF · Code1 · Code2

Abstract

We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.

中文标题/摘要

标题：NaiLIA：基于密集意图描述和调色板查询的多模态指甲设计检索

我们专注于基于密集意图描述检索指甲设计图像的任务，这些描述代表了用户对指甲设计的多层意图。这具有挑战性，因为这些描述指定了未受约束的绘画元素和预先制造的装饰品，以及视觉特征、主题和整体印象。除了这些描述之外，我们假设用户通过颜色拾取器指定零个或多个颜色来提供调色板查询，这使得可以表达微妙和连续的颜色细微差别。现有的视觉-语言基础模型往往难以结合这些描述和调色板。为了解决这个问题，我们提出了NaiLIA，一种针对指甲设计图像的多模态检索方法，在检索过程中全面对齐密集意图描述和调色板查询。我们的方法引入了一种基于未标注图像置信分数的松弛损失，可以与描述对齐。为了评估NaiLIA，我们构建了一个基准，包含来自不同文化背景的10,625张图像。这些图像由超过200名注释者标注了长且密集的意图描述。实验结果表明，NaiLIA优于标准方法。

Summary / 总结

The research aims to develop a method for retrieving nail design images based on detailed user intent descriptions and color palette queries. NaiLIA, a multimodal retrieval approach, aligns with both dense descriptions and color palettes, addressing the limitations of existing vision-language models. The method uses a relaxed loss based on confidence scores for unlabeled images. Experiments on a benchmark of 10,625 images show that NaiLIA outperforms standard methods in retrieving nail designs that match user intent and color preferences.

研究旨在开发一种基于详细用户意图描述和颜色调色板查询的指甲设计图像检索方法。NaiLIA是一种多模态检索方法，能够同时与密集描述和颜色调色板对齐，解决了现有视觉-语言模型的局限性。该方法使用基于未标记图像置信度分数的松弛损失。实验表明，NaiLIA在10,625张图像基准上优于标准方法，能够更准确地检索符合用户意图和颜色偏好的指甲设计。

Agentic Very Long Video Understanding

Authors: Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim

First: 2026-01-26T05:20:47+00:00 · Latest: 2026-03-05T18:12:22+00:00

Comments: 27 pages, 7 figures, 8 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks. Code is available at https://github.com/facebookresearch/egagent.

中文标题/摘要

标题：代理型非常长视频理解

随着全天候可穿戴设备（如智能眼镜）支持下的始终在线个人AI助手的出现，需要一种新的上下文理解水平，这种理解超越了短暂孤立的事件，涵盖了持续的、纵向的自我中心视频流。实现这一愿景需要在长时视频理解方面取得进展，其中系统必须解释和回忆跨越数天甚至数周的视觉和音频信息。现有的方法，包括大型语言模型和检索增强生成，受限于有限的上下文窗口，并且缺乏在非常长的视频流上进行组合式、多跳推理的能力。在本文中，我们通过EGAgent解决这些挑战，这是一种增强的代理框架，以实体场景图为中心，这些图表示随着时间推移的人、地点、物体及其关系。我们的系统为规划代理配备了结构化搜索和推理这些图的工具，以及混合视觉和音频搜索能力，使跨模态和时间上一致的推理成为可能。在EgoLifeQA和Video-MME（长）数据集上的实验表明，我们的方法在EgoLifeQA上达到了最先进的性能（57.5%），在Video-MME（长）上达到了竞争性的性能（74.1%），用于复杂的纵向视频理解任务。代码可在https://github.com/facebookresearch/egagent/获取。

Summary / 总结

This research aims to develop a system capable of understanding long-term, continuous egocentric video streams, which is essential for always-on AI assistants. The study introduces EGAgent, an enhanced agentic framework using entity scene graphs to represent and reason about people, places, and objects over time. Experiments demonstrate that EGAgent outperforms existing methods on EgoLifeQA (57.5%) and achieves competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.

该研究针对始终在线的个人AI助手对长时视频理解的需求，提出了EGAgent，一种增强的代理框架，利用实体场景图和混合视觉与音频搜索能力。实验结果显示，EGAgent在EgoLifeQA上的准确率为57.5%，在Video-MME（长）上的准确率为74.1%，在复杂长时视频理解任务中表现出色。

LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery

Authors: Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy

Venue: ICLR 2026

First: 2025-10-26T02:47:15+00:00 · Latest: 2026-03-05T18:02:19+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials discovery (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on 14 realistic tasks that span electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit rates and improved Pareto front quality relative to generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA provides a principled approach to accelerating practical materials discovery. Project website: https://scientific-discovery.github.io/llema-project/

中文标题/摘要

标题：LLEMA：使用LLM的多目标材料发现进化搜索

材料发现需要在广泛的化学和结构空间中导航，同时满足多个常常相互冲突的目标。我们提出了由大语言模型指导的材料发现进化（LLEMA），这是一种统一框架，将嵌入在大语言模型中的科学知识与化学启发的进化规则和基于记忆的改进相结合。在每次迭代中，一个大语言模型在明确的性质约束下提出晶体学指定的候选物；一个增强的代理预言机估计物理化学性质；一个多目标评分器更新成功/失败记忆以指导后续代。在涵盖电子学、能源、涂层、光学和航空航天的14个现实任务上进行评估，LLEMA 发现了化学上合理、热力学稳定且性质对齐的候选物，相对于生成性和仅大语言模型的基线，其命中率更高，帕累托前沿质量更好。消融研究证实了规则指导生成、基于记忆的改进和代理预测的重要性。通过强制执行合成性和多目标权衡，LLEMA 提供了一种加速实际材料发现的原理性方法。项目网站：https://scientific-discovery.github.io/llema-project/

Summary / 总结

LLEMA is a unified framework for materials discovery that integrates large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under property constraints, a surrogate-augmented oracle estimates physicochemical properties, and a multi-objective scorer updates success/failure memories to guide subsequent generations. LLEMA outperforms generative and LLM-only baselines in hit rates and Pareto front quality across 14 diverse materials discovery tasks, demonstrating the importance of rule-guided generation, memory-based refinement, and surrogate prediction in accelerating practical materials discovery.

LLEMA 是一个将大型语言模型与化学导向的进化规则和基于记忆的改进相结合的统一材料发现框架。在每次迭代中，LLM 在属性约束下提出晶体结构指定的候选材料，一个增强的代理预言机估计物理化学性质，一个多目标评分器更新成功/失败记忆以指导后续代。LLEMA 在 14 个不同材料发现任务中表现出更高的命中率和改进的帕累托前沿质量，证明了规则导向生成、基于记忆的改进和代理预测在加速实际材料发现中的重要性。

Latent Wasserstein Adversarial Imitation Learning

Authors: Siqi Yang, Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

Venue: ICLR 2026

First: 2026-03-05T18:01:49+00:00 · Latest: 2026-03-05T18:01:49+00:00

Comments: 10 pages, accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.

中文标题/摘要

标题：潜隐 Wasserstein 对抗模仿学习

模仿学习（IL）使代理能够通过学习演示来模仿专家行为。然而，传统的IL方法需要大量的中等到高质量的演示以及专家演示的动作，这两种情况通常都不可用。为了减少这种需求，我们提出了潜隐 Wasserstein 对抗模仿学习（LWAIL），这是一种新颖的对抗模仿学习框架，专注于仅状态分布匹配。它得益于在动态感知潜隐空间中计算的 Wasserstein 距离。这种动态感知潜隐空间不同于先前的工作，并通过预训练阶段获得，其中我们使用少量随机生成的状态仅数据训练意图条件价值函数（ICVF），以捕捉状态空间的动态感知结构。我们表明，这增强了策略对状态转换的理解，使学习过程能够仅使用一个或几个状态仅专家演示来达到专家水平的性能。通过在多个 MuJoCo 环境中的实验，我们证明了我们的方法优于先前的基于 Wasserstein 的 IL 方法和先前的对抗 IL 方法，在各种任务上取得了更好的结果。

Summary / 总结

The research aims to address the challenge of requiring large amounts of expert demonstrations in Imitation Learning (IL) by proposing Latent Wasserstein Adversarial Imitation Learning (LWAIL). LWAIL focuses on state-only distribution matching using a dynamics-aware latent space, which is pre-trained via an Intention Conditioned Value Function (ICVF) on randomly generated state-only data. Experiments on MuJoCo environments show that LWAIL outperforms previous Wasserstein-based and adversarial IL methods, achieving expert-level performance with fewer expert demonstrations.

研究旨在解决传统模仿学习（IL）需要大量专家演示数据的问题。提出了一种新的模仿学习框架——潜空间 Wasserstein 对抗模仿学习（LWAIL），该框架专注于使用动态感知的潜空间进行状态分布匹配。该方法通过少量随机生成的状态数据预训练意图条件价值函数（ICVF），以捕捉状态空间的动态结构。实验表明，LWAIL 在多个 MuJoCo 环境中优于之前的 Wasserstein 基准 IL 方法和对抗 IL 方法，能够用更少的专家演示数据达到专家水平的表现。

OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation

Authors: Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim

First: 2025-05-28T03:45:42+00:00 · Latest: 2026-03-05T18:01:26+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.

中文标题/摘要

标题：OSPO：面向对象的自我改进偏好优化以实现文本到图像生成

近期多模态大型语言模型（MLLMs）的发展使统一的多模态理解和生成成为可能。然而，它们仍然难以实现精细的文本-图像对齐，经常无法准确描绘具有正确属性（如颜色、形状和空间关系）的对象。为解决这一问题，先前的研究探索了偏好优化方法，如DPO和GRPO，但这些方法在构建偏好数据和执行优化方面会带来巨大的计算成本。这促使了自我改进的偏好优化方法的发展，在这些方法中，MLLM自主生成自己的训练数据，自我估计偏好反馈，并使用由此产生的自我构建的偏好对进行自我优化。然而，现有的自我改进方法仍然忽略了精细的、对象级别的语义，允许对象幻觉持续存在。为解决这一问题，我们提出了面向对象的自我改进偏好优化（OSPO），这是一种旨在增强对象级别文本-图像对齐的自我改进框架。OSPO明确构建了面向对象的偏好数据，不依赖于任何外部数据和外部模型。我们还引入了一种新的方法，利用基于注意力的对象掩码与对象加权SimPO损失相结合，以增强对象特定的保真度。在三个组合图像生成基准上的广泛实验表明，OSPO显著提高了精细对齐并减少了对象幻觉，优于先前的自我改进方法，甚至优于专门的基于扩散的文本到图像模型。

Summary / 总结

The research aims to improve fine-grained text-image alignment by addressing the issue of object hallucination in text-to-image generation. The proposed OSPO framework autonomously generates preference data and optimizes preferences, using attention-based object masks and an object-weighted SimPO loss to enhance object-specific fidelity. Experiments show that OSPO outperforms previous self-improving methods and specialized diffusion models in terms of fine-grained alignment and reducing object hallucination.

论文提出了OSPO，一种用于增强文本到图像生成中对象级文本-图像对齐的自我改进偏好优化框架。它自主构建对象中心的偏好数据，并使用基于注意力的对象掩码与对象加权的SimPO损失来提高对象特定的保真度。实验表明，OSPO显著提高了细粒度对齐并减少了对象幻觉，超越了之前的自我改进方法和专门的扩散模型。

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Authors: Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim, Dong-Jin Kim

Venue: CVPR 2026

First: 2026-03-05T17:59:58+00:00 · Latest: 2026-03-05T17:59:58+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.

中文标题/摘要

标题：SAIL：基于相似性感知的指导与跨图例增强学习的弱监督密集视频图释

弱监督密集视频图释旨在仅基于图释注释训练时，定位和描述视频中的事件，而无需时间边界。先前的工作引入了一种基于高斯掩码和互补图释的隐式监督范式。然而，现有方法仅关注生成不重叠的掩码，而未考虑其与相应事件的语义关系，导致生成简单且均匀分布的掩码，无法捕捉到语义上有意义的区域。此外，仅依赖真实图释会导致性能不佳，因为现有数据集的稀疏性。在本工作中，我们提出了SAIL，通过跨模态对齐构建语义感知的掩码。我们的相似性感知训练目标引导掩码强调与相应事件图释高度相似的视频区域。此外，为了在稀疏注释设置下引导更准确的掩码生成，我们引入了一种基于LLM的增强策略，生成合成图释以提供额外的对齐信号。这些合成图释通过跨掩码机制整合，为精确的时间定位提供辅助指导，而不损害主要目标。在ActivityNet Captions和YouCook2上的实验表明，SAIL在图释和定位指标上均达到了最先进的性能。

Summary / 总结

SAIL aims to improve weakly-supervised dense video captioning by constructing semantically-aware masks through cross-modal alignment. It uses a similarity-aware training objective to emphasize regions with high similarity to event captions and introduces an LLM-based augmentation strategy to generate synthetic captions for additional alignment signals. Experiments show SAIL outperforms existing methods on both captioning and localization metrics on ActivityNet Captions and YouCook2 datasets.

研究旨在通过解决现有方法生成简单化掩码和过度依赖真实标注文本的问题，来提升弱监督密集视频字幕生成。SAIL方法通过跨模态对齐构建语义感知掩码，并引入基于LLM的增强策略生成合成字幕以提供额外的对齐信号。该方法在ActivityNet Captions和YouCook2数据集上展示了在字幕生成和定位指标上的最新性能。

On-Policy Self-Distillation for Reasoning Compression

Authors: Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun

First: 2026-03-05T17:54:40+00:00 · Latest: 2026-03-05T17:54:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a "be concise" instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.

中文标题/摘要

标题：基于策略自我蒸馏的推理压缩

推理模型会大声思考，但其中大部分内容都是噪音。我们引入了OPSDC（基于策略自我蒸馏的推理压缩），这是一种通过将模型自身的简洁行为蒸馏回自身来教会模型更简洁地推理的方法。整个方法归结为一个想法：在“要简洁”的指令下条件化同一个模型以获得教师概率分布，并在学生的自我展开中最小化每个词元的逆KL散度。无需真实答案，无需词元预算，无需难度估计器。只有自我蒸馏。然而，这种简单性背后隐藏着惊人的复杂性：OPSDC会自动对简单问题进行激进的压缩，同时保留解决难题所需的推理。在Qwen3-8B和Qwen3-14B上，我们在MATH-500上实现了57-59%的词元减少，同时提高了9-16个绝对分数的准确性。在AIME 2024上，14B模型在41%的压缩下获得了10分的提升。秘诀在于：推理模型生成的内容不仅冗余，而且是具有破坏性的，每个不必要的词元都会加剧错误的累积。

Summary / 总结

OPSDC (On-Policy Self-Distillation for Reasoning Compression) is a method that teaches reasoning models to reason more concisely by distilling their own behavior back into themselves. This approach involves conditioning the model on a 'be concise' instruction to obtain teacher logits and minimizing per-token reverse KL on the student's rollouts. On Qwen3-8B and Qwen3-14B, OPSDC achieves 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points. On AIME 2024, the 14B model gains 10 points with 41% compression, demonstrating that much of the reasoning output is redundant and harmful, compounding errors with unnecessary tokens.

OPSDC（On-Policy Self-Distillation for Reasoning Compression）是一种通过反向蒸馏模型自身行为来使其更简洁的方法。该方法通过将模型条件化在“简洁”指令上以获得教师概率，并最小化学生自身展开的每个令牌的反向KL散度。在Qwen3-8B和Qwen3-14B上，OPSDC在MATH-500上实现了57-59%的令牌减少，同时提高了9-16个点的准确性。在AIME 2024上，14B模型在41%压缩的情况下获得了10个点的提升，表明模型生成的内容不仅冗余，而且有害，每个不必要的令牌都会加剧错误的累积。

Ensembling Language Models with Sequential Monte Carlo

Authors: Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly, Timothy J. O'Donnell, Ryan Cotterell, Tim Vieira

First: 2026-03-05T17:54:31+00:00 · Latest: 2026-03-05T17:54:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.

中文标题/摘要

标题：使用序列蒙特卡洛方法集成语言模型

从业者可以访问大量语言模型和提示策略以解决许多语言建模任务；然而，先前的工作表明，模型性能对这些选择的高度敏感。经典的机器学习集成技术提供了一种原则性的方法：从多个来源聚合预测以获得比任何单一来源更好的性能。然而，在解码过程中将集成应用于语言模型是具有挑战性的：简单地聚合下一个标记的概率会产生一个局部归一化、有偏的近似分布，该分布通常难以计算。在本文中，我们介绍了一种统一框架，用于将 $K$ 个语言模型组合成一系列函数 $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$ 的 $f$-集成分布。为了从这些分布中采样，我们提出了一种字节级序列蒙特卡洛（SMC）算法，该算法在共享字符空间中操作，使得具有不同词汇表的模型的集成能够一致地采样。我们评估了不同提示和模型组合下的 $f$-集成族在各种结构化文本生成任务中的表现，突出了替代聚合策略相对于传统概率平均的优势，并展示了更好的后验近似可以提高集成性能。

Summary / 总结

This work addresses the challenge of ensembling language models during decoding by introducing a unified framework for composing multiple language models into ensemble distributions using various functions. The authors propose a byte-level sequential Monte Carlo algorithm to sample from these distributions, enabling consistent sampling across models with different vocabularies. Experimental results show that alternative aggregation strategies outperform traditional probability averaging, leading to improved ensemble performance for structured text generation tasks.

该研究通过引入一种统一框架，使用多种聚合函数将多个语言模型组合成集成分布，解决了在解码过程中集成语言模型的挑战。作者提出了一种字节级的顺序蒙特卡洛算法来从这些分布中采样，使得在具有不同词汇表的模型之间可以进行一致的采样。实验结果表明，与传统的概率平均聚合相比，不同的聚合策略能够通过提供更好的后验近似来提高集成性能。

RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity

Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

First: 2026-02-04T13:25:47+00:00 · Latest: 2026-03-05T17:54:01+00:00

Abs · PDF · Code1 · Code2

Abstract

As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity.

中文标题/摘要

标题：RA-QA：在现实世界异质性条件下呼吸音频问答基准系统

随着对话型多模态AI工具被越来越多地用于处理患者数据以进行健康评估，需要稳健的基准来衡量在现实条件下的进步并揭示失败模式。尽管呼吸音频对于移动健康筛查至关重要，但呼吸音频问答仍被研究不足，现有研究评估狭窄且缺乏跨模态、设备和问题类型的现实世界异质性。因此，我们引入了呼吸音频问答（RA-QA）基准，包括标准化的数据生成管道、全面的多模态问答集合和统一的评估协议。RA-QA将公共呼吸音频数据集整合为包含900万对格式多样的问答对，涵盖诊断和上下文属性。我们基准测试了经典机器学习基线和多模态音频-语言模型，建立了可重复的参考点，并展示了当前方法在异质性条件下失效的情况。

Summary / 总结

The research introduces the RA-QA benchmark to evaluate respiratory audio question answering under real-world conditions. It includes a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. The benchmark covers 9 million QA pairs and evaluates classical ML baselines and multimodal audio-language models, highlighting the failure of current approaches under heterogeneity.

研究引入了RA-QA基准，以评估在真实世界条件下的呼吸音频问答。它包括标准化的数据生成管道、全面的多模态QA集合和统一的评估协议。基准涵盖了900万对QA，并评估了经典ML基线和多模态音频-语言模型，展示了当前方法在异质性下的失败情况。

RelaxFlow: Text-Driven Amodal 3D Generation

Authors: Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

First: 2026-03-05T17:45:47+00:00 · Latest: 2026-03-05T17:45:47+00:00

Comments: Code: https://github.com/viridityzhu/RelaxFlow

Abs · PDF · Code1 · Code2 · Code3

Abstract

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

中文标题/摘要

标题：RelaxFlow：文本驱动的无遮挡3D生成

在遮挡下，从图像到3D的生成面临着固有的语义模糊性，仅凭部分观察往往不足以确定物体类别。在本文中，我们形式化了文本驱动的无遮挡3D生成，其中文本提示引导未见区域的完成，同时严格保留输入观察。关键的是，我们发现这些目标需要不同的控制粒度：对观察进行刚性控制，而对提示进行放松的结构控制。为此，我们提出了RelaxFlow，这是一种无需训练的双分支框架，通过多先验一致性模块和放松机制解耦控制粒度。理论上，我们证明了我们的放松等同于在生成向量场中应用低通滤波器，这抑制了高频实例细节，以隔离几何结构，使其适应观察。为了便于评估，我们引入了两个诊断基准，ExtremeOcc-3D和AmbiSem-3D。广泛的实验表明，RelaxFlow成功地引导了未见区域的生成，使其与提示意图一致，而不牺牲视觉保真度。

Summary / 总结

The research addresses the challenge of generating complete 3D models from partial observations, using text prompts to guide the unseen parts while preserving the observed parts. It introduces RelaxFlow, a dual-branch framework that uses a Multi-Prior Consensus Module and a Relaxation Mechanism to control the generation process at different granularities. Theoretical analysis shows that the relaxation mechanism suppresses high-frequency details, focusing on geometric structure. Experiments on new diagnostic benchmarks show that RelaxFlow effectively matches the prompt intent for unseen regions without losing visual quality.

研究解决了从部分观察生成3D模型的挑战，通过文本提示引导未观察区域的完成，同时保留已观察的部分。提出了一个无需训练的双分支框架RelaxFlow，通过多先验一致性模块和放松机制来解耦控制粒度。理论分析表明，放松机制抑制了高频细节，专注于几何结构。实验在新基准上显示，RelaxFlow 成功地使未观察区域与提示意图对齐，同时保持视觉质量。

History

20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553