arXiv 论文速递

Snapshot: 20260218_0353

EditCtrl: Disentangled Local and Global Control for Real-Time Generative Video Editing

Authors: Yehonathan Litman, Shikun Liu, Dario Seyb, Nicholas Milef, Yang Zhou, Carl Marshall, Shubham Tulsiani, Caleb Leak

First: 2026-02-16T18:59:58+00:00 · Latest: 2026-02-16T18:59:58+00:00

Comments: Project page: https://yehonathanlitman.github.io/edit_ctrl

Abs · PDF · Code1 · Code2 · Project1

Abstract

High-fidelity generative video editing has seen significant quality improvements by leveraging pre-trained video foundation models. However, their computational cost is a major bottleneck, as they are often designed to inefficiently process the full video context regardless of the inpainting mask's size, even for sparse, localized edits. In this paper, we introduce EditCtrl, an efficient video inpainting control framework that focuses computation only where it is needed. Our approach features a novel local video context module that operates solely on masked tokens, yielding a computational cost proportional to the edit size. This local-first generation is then guided by a lightweight temporal global context embedder that ensures video-wide context consistency with minimal overhead. Not only is EditCtrl 10 times more compute efficient than state-of-the-art generative editing methods, it even improves editing quality compared to methods designed with full-attention. Finally, we showcase how EditCtrl unlocks new capabilities, including multi-region editing with text prompts and autoregressive content propagation.

中文标题/摘要

标题：EditCtrl: 分离的局部和全局控制实现实时生成视频编辑

通过利用预训练的视频基础模型，高保真生成视频编辑在质量上取得了显著提升。然而，它们的计算成本是一个主要瓶颈，因为它们通常被设计为无论补丁掩码的大小如何，都以低效的方式处理整个视频上下文，即使对于稀疏的局部编辑也是如此。在本文中，我们提出了EditCtrl，这是一种高效的视频修补控制框架，仅在需要的地方进行计算。我们的方法包含一个新颖的局部视频上下文模块，仅作用于被遮罩的标记，计算成本与编辑大小成正比。然后，该局部优先生成由轻量级的时空全局上下文嵌入器引导，以确保视频范围内的上下文一致性，同时最小化开销。与最先进的生成编辑方法相比，EditCtrl不仅计算效率提高了10倍，甚至在编辑质量上也优于那些具有全注意力机制的方法。最后，我们展示了EditCtrl如何解锁新的功能，包括使用文本提示进行多区域编辑和自回归内容传播。

Summary / 总结

EditCtrl is a video editing framework that focuses computation on the areas needing editing, making it 10 times more efficient than existing methods while improving editing quality. It uses a local video context module for masked tokens and a lightweight global context embedder to ensure consistency across the video. This allows for efficient multi-region editing and autoregressive content propagation.

EditCtrl 是一种框架，旨在通过仅在需要的地方进行计算来提高实时生成视频编辑的效率。它使用一个仅在遮罩令牌上操作的局部视频上下文模块，将计算成本降低到与编辑大小成比例的水平。这由一个轻量级的全局上下文嵌入器引导，以确保视频范围内的上下文一致性。EditCtrl 的计算效率比当前方法高 10 倍，并且甚至提高了编辑质量。它还使多区域编辑和自回归内容传播等新功能成为可能。

Symmetry in language statistics shapes the geometry of model representations

Authors: Dhruva Karkada, Daniel J. Korchinski, Andres Nava, Matthieu Wyart, Yasaman Bahri

First: 2026-02-16T18:59:55+00:00 · Latest: 2026-02-16T18:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM representations: for example, calendar months organize into a circle, years form a smooth one-dimensional manifold, and cities' latitudes and longitudes can be decoded by a linear probe. We show that the statistics of language exhibit a translation symmetry -- e.g., the co-occurrence probability of two months depends only on the time interval between them -- and we prove that the latter governs the aforementioned geometric structures in high-dimensional word embedding models. Moreover, we find that these structures persist even when the co-occurrence statistics are strongly perturbed (for example, by removing all sentences in which two months appear together) and at moderate embedding dimension. We show that this robustness naturally emerges if the co-occurrence statistics are collectively controlled by an underlying continuous latent variable. We empirically validate this theoretical framework in word embedding models, text embedding models, and large language models.

中文标题/摘要

标题：语言统计中的对称性塑造模型表示的几何结构

尽管学习到的表示是神经网络成功的关键，但其基本属性仍知之甚少。一个引人注目的例子是大型语言模型（LLM）表示中出现的简单几何结构：例如，月份组织成一个圆圈，年份形成一个平滑的一维流形，城市纬度和经度可以通过线性探针解码。我们展示了语言统计中存在平移对称性——例如，两个月份共现的概率仅取决于它们之间的时间间隔——并且我们证明了后者控制了高维词嵌入模型中的上述几何结构。此外，我们发现即使共现统计显著扰动（例如，移除所有两个月份同时出现的句子）这些结构仍然存在，并且在中等嵌入维度下也是如此。我们证明，如果共现统计由一个潜在的连续隐变量集体控制，这种鲁棒性自然出现。我们在词嵌入模型、文本嵌入模型和大型语言模型中实证验证了这一理论框架。

Summary / 总结

This study investigates the geometric structures in language model representations, driven by the translation symmetry in language statistics. The research demonstrates that the co-occurrence probability of words depends only on the time interval between them, leading to circular and linear geometric patterns in word embeddings. These structures remain robust even when co-occurrence statistics are significantly altered, suggesting they are controlled by an underlying continuous latent variable. The findings are supported by experiments across different types of models, including word embedding models, text embedding models, and large language models.

研究探讨了语言模型表示中的几何结构，由语言统计中的平移对称性驱动。研究显示，词的共现概率仅取决于它们之间的时间间隔，导致词嵌入中出现循环和线性几何模式。即使共现统计显著改变，这些结构依然保持稳健，表明它们由潜在的连续隐变量控制。实验结果在词嵌入模型、文本嵌入模型和大型语言模型中得到了验证。

Long Context, Less Focus: A Scaling Gap in LLMs Revealed through Privacy and Personalization

Authors: Shangding Gu

First: 2026-02-16T18:59:42+00:00 · Latest: 2026-02-16T18:59:42+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) are increasingly deployed in privacy-critical and personalization-oriented scenarios, yet the role of context length in shaping privacy leakage and personalization effectiveness remains largely unexplored. We introduce a large-scale benchmark, PAPerBench, to systematically study how increasing context length influences both personalization quality and privacy protection in LLMs. The benchmark comprises approximately 29,000 instances with context lengths ranging from 1K to 256K tokens, yielding a total of 377K evaluation questions. It jointly evaluates personalization performance and privacy risks across diverse scenarios, enabling controlled analysis of long-context model behavior. Extensive evaluations across state-of-the-art LLMs reveal consistent performance degradation in both personalization and privacy as context length increases. We further provide a theoretical analysis of attention dilution under context scaling, explaining this behavior as an inherent limitation of soft attention in fixed-capacity Transformers. The empirical and theoretical findings together suggest a general scaling gap in current models -- long context, less focus. We release the benchmark to support reproducible evaluation and future research on scalable privacy and personalization. Code and data are available at https://github.com/SafeRL-Lab/PAPerBench

中文标题/摘要

标题：长上下文，少关注：通过隐私和个性化揭示的LLM扩展差距

大型语言模型（LLMs）在隐私关键和个人化导向的场景中越来越被部署，但上下文长度在塑造隐私泄露和个人化效果中的作用仍 largely unexplored。我们引入了一个大规模基准PAPerBench，系统研究增加上下文长度如何影响LLMs中的个性化质量和隐私保护。基准包括约29,000个实例，上下文长度从1K到256K个标记不等，总共生成了377,000个评估问题。它联合评估了在不同场景下的个性化性能和隐私风险，使对长上下文模型行为的可控分析成为可能。对最先进的LLMs的广泛评估揭示了随着上下文长度增加，个性化和隐私性能的一致退化。我们进一步提供了在上下文扩展下注意力稀释的理论分析，将这种行为解释为固定容量Transformer中软注意力的固有限制。实证和理论发现共同表明当前模型存在一个普遍的扩展差距——长上下文，少关注。我们发布了基准以支持可重复评估和未来关于可扩展隐私和个性化的研究。代码和数据可在https://github.com/SafeRL-Lab/PAPerBench 获取

Summary / 总结

This study investigates the impact of context length on the personalization and privacy of large language models (LLMs) by introducing PAPerBench, a large-scale benchmark with 29,000 instances and 377K evaluation questions. The research finds consistent performance degradation in both personalization and privacy as context length increases, suggesting a scaling gap in current models where longer context leads to less focus. Theoretical analysis supports this by explaining attention dilution in fixed-capacity Transformers. The benchmark is released to facilitate reproducible evaluation and future research on scalable privacy and personalization.

研究通过引入包含29,000个实例和377K评估问题的PAPerBench大规模基准，探讨了上下文长度对大型语言模型（LLMs）个性化和隐私的影响。研究发现，随着上下文长度的增加，个性化和隐私性能一致地下降，表明当前模型存在一个普遍的缩放缺口。理论分析进一步解释了在固定容量Transformer中软注意力下的上下文扩展会导致注意力稀释，这是导致这种行为的根本限制。

Rethinking Diffusion Models with Symmetries through Canonicalization with Applications to Molecular Graph Generation

Authors: Cai Zhou, Zijie Chen, Zian Li, Jike Wang, Kaiyi Jiang, Pan Li, Rose Yu, Muhan Zhang, Stephen Bates, Tommi Jaakkola

First: 2026-02-16T18:58:55+00:00 · Latest: 2026-02-16T18:58:55+00:00

Comments: 32 pages

Abs · PDF · Code1 · Code2

Abstract

Many generative tasks in chemistry and science involve distributions invariant to group symmetries (e.g., permutation and rotation). A common strategy enforces invariance and equivariance through architectural constraints such as equivariant denoisers and invariant priors. In this paper, we challenge this tradition through the alternative canonicalization perspective: first map each sample to an orbit representative with a canonical pose or order, train an unconstrained (non-equivariant) diffusion or flow model on the canonical slice, and finally recover the invariant distribution by sampling a random symmetry transform at generation time. Building on a formal quotient-space perspective, our work provides a comprehensive theory of canonical diffusion by proving: (i) the correctness, universality and superior expressivity of canonical generative models over invariant targets; (ii) canonicalization accelerates training by removing diffusion score complexity induced by group mixtures and reducing conditional variance in flow matching. We then show that aligned priors and optimal transport act complementarily with canonicalization and further improves training efficiency. We instantiate the framework for molecular graph generation under $S_n \times SE(3)$ symmetries. By leveraging geometric spectra-based canonicalization and mild positional encodings, canonical diffusion significantly outperforms equivariant baselines in 3D molecule generation tasks, with similar or even less computation. Moreover, with a novel architecture Canon, CanonFlow achieves state-of-the-art performance on the challenging GEOM-DRUG dataset, and the advantage remains large in few-step generation.

中文标题/摘要

标题：通过典范化重新思考具有对称性的扩散模型及其在分子图生成中的应用

化学和科学中的许多生成任务涉及对群对称性（例如置换和旋转）不变的分布。一种常见的策略是通过架构约束（如对称性去噪器和不变先验）来强制不变性和对称性。在本文中，我们通过典范化的替代视角挑战这一传统：首先将每个样本映射到具有典范姿态或顺序的轨道代表，然后在典范切片上训练一个不受约束（非对称）的扩散或流模型，最后在生成时通过采样随机对称变换恢复不变分布。基于形式化的商空间视角，我们的工作提供了典范扩散的全面理论，证明了：（i）典范生成模型相对于不变目标的正确性、普适性和优越的表达能力；（ii）典范化通过消除群混合引起的扩散评分复杂性并减少流匹配中的条件方差来加速训练。然后我们展示了对齐的先验和最优传输与典范化互补，进一步提高了训练效率。我们针对$S_n imes SE(3)$对称性下的分子图生成实例化了该框架。通过利用几何频谱基典范化和温和的位置编码，典范扩散在3D分子生成任务中显著优于对称性基线，计算量相似甚至更少。此外，通过新颖的架构Canon，CanonFlow在具有挑战性的GEOM-DRUG数据集上达到了最先进的性能，且在少步生成中优势仍然很大。

Summary / 总结

This paper reimagines diffusion models by introducing canonicalization, which maps samples to orbit representatives with a canonical pose before training an unconstrained model on the canonical slice. This approach accelerates training and improves expressivity. The method is applied to molecular graph generation under $S_n imes SE(3)$ symmetries, demonstrating superior performance over equivariant baselines with similar or less computational cost. The CanonFlow architecture achieves state-of-the-art results on the GEOM-DRUG dataset, especially in few-step generation.

本文通过引入将样本映射到具有标准姿态的轨道代表的规范化方法，重新构想了扩散模型，并在规范化的切片上训练一个不受约束的模型。这种方法加速了训练并提高了表达能力。该方法应用于$S_n imes SE(3)$对称性的分子图生成，展示了在相似或更低计算成本下优于对称基线的性能。CanonFlow架构在挑战性的GEOM-DRUG数据集上达到了最先进的效果，特别是在少量步骤生成方面仍具有明显优势。

Generalization from Low- to Moderate-Resolution Spectra with Neural Networks for Stellar Parameter Estimation: A Case Study with DESI

Authors: Xiaosheng Zhao, Yuan-Sen Ting, Rosemary F. G. Wyse, Alexander S. Szalay, Yang Huang, László Dobos, Tamás Budavári, Viska Wei

First: 2026-02-16T18:58:47+00:00 · Latest: 2026-02-16T18:58:47+00:00

Comments: 20 pages, 13 figures, 4 tables. Submitted to AAS journals. Comments welcome

Abs · PDF · Code1 · Code2

Abstract

Cross-survey generalization is a critical challenge in stellar spectral analysis, particularly in cases such as transferring from low- to moderate-resolution surveys. We investigate this problem using pre-trained models, focusing on simple neural networks such as multilayer perceptrons (MLPs), with a case study transferring from LAMOST low-resolution spectra (LRS) to DESI medium-resolution spectra (MRS). Specifically, we pre-train MLPs on either LRS or their embeddings and fine-tune them for application to DESI stellar spectra. We compare MLPs trained directly on spectra with those trained on embeddings derived from transformer-based models (self-supervised foundation models pre-trained for multiple downstream tasks). We also evaluate different fine-tuning strategies, including residual-head adapters, LoRA, and full fine-tuning. We find that MLPs pre-trained on LAMOST LRS achieve strong performance, even without fine-tuning, and that modest fine-tuning with DESI spectra further improves the results. For iron abundance, embeddings from a transformer-based model yield advantages in the metal-rich ([Fe/H] > -1.0) regime, but underperform in the metal-poor regime compared to MLPs trained directly on LRS. We also show that the optimal fine-tuning strategy depends on the specific stellar parameter under consideration. These results highlight that simple pre-trained MLPs can provide competitive cross-survey generalization, while the role of spectral foundation models for cross-survey stellar parameter estimation requires further exploration.

中文标题/摘要

标题：利用神经网络从低分辨率光谱到中分辨率光谱的星族参数估计泛化：DESI案例研究

跨调查泛化是恒星光谱分析中的一个关键挑战，特别是在从低分辨率转移到中分辨率调查的情况下。我们使用预训练模型研究了这个问题，重点关注简单的神经网络，如多层感知机（MLPs），以从LAMOST低分辨率光谱（LRS）转移到DESI中分辨率光谱（MRS）为例。具体来说，我们预先在LRS或其嵌入上训练MLPs，并针对DESI恒星光谱进行微调。我们将直接在光谱上训练的MLPs与从基于变换器的模型（自我监督的基础模型预训练用于多个下游任务）派生的嵌入上训练的MLPs进行了比较。我们还评估了不同的微调策略，包括残差头适配器、LoRA和全微调。我们发现，预先在LAMOST LRS上训练的MLPs即使不进行微调也能取得良好的性能，而适度使用DESI光谱进行微调可以进一步提高结果。对于铁丰度，在金属丰富（[Fe/H] > -1.0）的情况下，基于变换器的模型的嵌入具有优势，但在金属贫乏的情况下，直接在LRS上训练的MLPs表现更好。我们还表明，最佳的微调策略取决于所考虑的具体恒星参数。这些结果表明，简单的预训练MLPs可以提供具有竞争力的跨调查泛化，而基于光谱的基础模型在跨调查恒星参数估计中的作用需要进一步探索。

Summary / 总结

This study addresses the challenge of cross-survey generalization in stellar spectral analysis by using pre-trained neural networks, specifically multilayer perceptrons (MLPs), to transfer from low-resolution LAMOST spectra to medium-resolution DESI spectra. The research compares MLPs trained directly on spectra with those trained on embeddings from transformer-based models and evaluates different fine-tuning strategies. Key findings include strong performance of MLPs pre-trained on LAMOST spectra, with modest fine-tuning further improving results. Embeddings from transformer-based models show advantages for metal-rich stars but underperform for metal-poor stars compared to MLPs trained directly on LAMOST spectra. The optimal fine-tuning strategy varies depending on the stellar parameter being estimated.

研究探讨了使用神经网络（特别是多层感知机MLP）将恒星光谱参数估计从低分辨率推广到中分辨率的问题。预训练在LAMOST低分辨率光谱上的MLP表现出较强性能，进一步在DESI中分辨率光谱上进行适度微调可进一步提升效果。基于变压器的模型的光谱嵌入在金属丰富恒星上表现较好，但在金属贫乏恒星上则不如直接使用低分辨率光谱训练的MLP。不同的恒星光谱参数估计的最佳微调策略也有所不同。

Hunt Globally: Deep Research AI Agents for Drug Asset Scouting in Investing, Business Development, and Search & Evaluation

Authors: Alisa Vinogradova, Vlad Vinogradov, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev

First: 2026-02-16T18:57:49+00:00 · Latest: 2026-02-16T18:57:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests >85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at ~30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. We compare Bioptic Agent against Claude Opus 4.6, OpenAI GPT-5.2 Pro, Perplexity Deep Research, Gemini 3 Pro + Deep Research, and Exa Websets. Bioptic Agent achieves 79.7% F1 versus 56.2% (Claude Opus 4.6), 50.6% (Gemini 3 Pro + Deep Research), 46.6% (GPT-5.2 Pro), 44.2% (Perplexity Deep Research), and 26.9% (Exa Websets). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

中文标题/摘要

标题：全球搜索：药物资产筛选的深度研究AI代理

生物医药创新已发生变化：许多新的药物资产现在起源于美国之外，并主要通过区域性的非英语渠道披露。最近的数据表明，超过85%的专利申请起源于美国之外，其中中国占全球总数的近一半；非美国的学术产出比例也在增加。行业估计显示，中国在全球药物研发中的份额约为30%，涵盖1,200多种新型候选药物。在这种高风险环境中，未能发现“未被注意”的资产会给投资者和业务发展团队带来数十亿美元的风险，使资产筛选成为一项关键的竞争，速度和完整性决定价值。然而，当前的深度研究AI代理在实现跨异构、多语言来源的高召回率发现时，仍然落后于人类专家，且缺乏幻觉。我们提出了一种药物资产筛选的基准测试方法，并开发了一种调优的树状自学习Bioptic代理，旨在实现完整的、无幻觉的筛选。我们构建了一个具有挑战性的完整性基准，使用多语言多代理管道：复杂的用户查询配以主要在美国中心雷达之外的真实资产。为了反映实际交易的复杂性，我们收集了专家投资者、BD和VC专业人士的筛选查询，并将其作为先验条件生成基准查询。在评估中，我们使用校准过的LLM作为裁判，以专家意见为标准。我们将Bioptic代理与Claude Opus 4.6、OpenAI GPT-5.2 Pro、Perplexity Deep Research、Gemini 3 Pro + Deep Research和Exa Websets进行了比较。Bioptic代理的F1得分为79.7%，而Claude Opus 4.6为56.2%，Gemini 3 Pro + Deep Research为50.6%，GPT-5.2 Pro为46.6%，Perplexity Deep Research为44.2%，Exa Websets为26.9%。随着计算资源的增加，性能显著提高，支持了更多计算资源会带来更好结果的观点。

Summary / 总结

The research addresses the challenge of identifying drug assets from non-English sources, which are crucial for bio-pharmaceutical innovation. It introduces a Bioptic Agent, a deep learning-based AI system, and benchmarks its performance against other AI agents. The Bioptic Agent outperforms others with an F1 score of 79.7%, significantly higher than the second-best system, which scored 56.2%. This indicates that the Bioptic Agent is more effective in discovering 'under-the-radar' drug assets across diverse, multilingual sources.

研究旨在解决从非英语来源识别新药资产的挑战，这对于生物制药创新至关重要。研究引入了一种基准测试方法和一种自我学习的Bioptic代理，旨在提高资产筛选的召回率，同时避免幻觉。Bioptic代理在全面评估中表现出色，F1得分为79.7%，而其他AI代理的得分则在26.9%到56.2%之间。

Privileged Information Distillation for Language Models

Authors: Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia

First: 2026-02-04T18:46:17+00:00 · Latest: 2026-02-16T18:57:38+00:00

Comments: Abstract border should have been purple

Abs · PDF · Code1 · Code2

Abstract

Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, which typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable, but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically, we find that π-Distill and, in some cases, OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.

中文标题/摘要

标题：语言模型中的特权信息提炼

在训练时使用特权信息（PI）可以使语言模型在原本会失败的任务上取得成功，使其成为在困难、长期环境中强化学习的强大工具。然而，在推理时使用PI学习的能力转移到必须在没有PI的情况下行动的策略上仍然是一个基本挑战。我们研究了在多轮代理环境的前沿模型提炼中的这个问题，这些环境通常隐藏其内部推理并仅暴露行动轨迹。这打破了标准的提炼管道，因为成功的行为是可观察的，但推理过程不是。为此，我们引入了π-提炼，这是一种联合教师-学生目标，使用同一模型同时训练一个PI条件教师和一个未条件的学生。此外，我们还引入了基于强化学习（RL）的逆KL惩罚的在线策略自我提炼（OPSD）。我们展示了这两种算法有效地使用仅行动的PI提炼前沿代理。具体来说，我们发现π-提炼，在某些情况下，OPSD，优于假设跨多个代理基准、模型和PI形式的完整思维链监督的行业标准实践（监督微调后进行RL）。我们通过广泛的分析补充了我们的结果，这些分析描述了使PI有效学习的因素，主要集中在π-提炼上，并描述了OPSD具有竞争力的情况。

Summary / 总结

The paper addresses the challenge of transferring capabilities learned with training-time privileged information (PI) to policies that must operate without it at inference time. It introduces π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously, and On-Policy Self-Distillation (OPSD), which uses RL with a reverse KL-penalty. The authors demonstrate that both methods effectively distill frontier agents using action-only PI, outperforming industry standard practices across multiple benchmarks and models. The study also provides an analysis of the factors enabling effective learning with PI, focusing on π-Distill and OPSD's competitiveness.

研究旨在将训练时的特权信息（PI）所学到的能力转移到推理时必须不使用PI的策略中，解决强化学习在多轮代理环境中的一项基本挑战。研究引入了π-Distill和On-Policy Self-Distillation（OPSD）方法，表明两者都能有效使用仅动作的PI来蒸馏前沿代理，并在多个基准测试、模型和PI形式上优于行业标准实践。研究结果强调了联合教师-学生目标和反KL惩罚在语言模型蒸馏管道中的重要性。

Simulating the Real World: A Unified Survey of Multimodal Generative Models

Authors: Yuqi Hu, Longguang Wang, Xian Liu, Ling-Hao Chen, Yuwei Guo, Yukai Shi, Ce Liu, Anyi Rao, Zeyu Wang, Hui Xiong

First: 2025-03-06T17:31:43+00:00 · Latest: 2026-02-16T18:57:17+00:00

Comments: Repository for the related papers at https://github.com/ALEEEHU/World-Simulator

Abs · PDF · Code1 · Code2 · Code3

Abstract

Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

中文标题/摘要

标题：模拟现实世界：多模态生成模型的统一综述

理解并复制现实世界是人工智能通用智能（AGI）研究中的关键挑战。为了实现这一目标，许多现有方法，如世界模型，旨在捕捉支配物理世界的根本原理，从而实现更准确的模拟和更有意义的交互。然而，当前的方法通常将2D（图像）、视频、3D和4D表示视为独立领域，忽视了它们之间的相互依赖性。此外，这些方法通常专注于现实的孤立维度，而没有系统地整合它们之间的联系。在这篇综述中，我们提供了一个统一的多模态生成模型综述，探讨了现实世界模拟中数据维度的进展。具体来说，这篇综述从2D生成（外观）开始，然后转向视频（外观+动力学）和3D生成（外观+几何），最后达到整合所有维度的4D生成。据我们所知，这是首次尝试在单一框架内系统地统一研究2D、视频、3D和4D生成。为了指导未来的研究，我们提供了数据集、评估指标和未来方向的全面回顾，为新入门者提供见解。这篇综述为在统一框架内推进多模态生成模型和现实世界模拟的研究架起了桥梁。

Summary / 总结

This survey aims to address the challenge of simulating the real world in AGI research by unifying multimodal generative models. It reviews the progression from 2D image generation to 4D integration, covering datasets, evaluation metrics, and future directions. The survey provides a comprehensive framework for understanding and advancing real-world simulations.

本文综述了在AGI研究中模拟现实世界的挑战，通过统一多模态生成模型来解决。它从2D图像生成到3D和4D表示的进展进行了回顾，强调不同模态的集成。主要发现包括在统一框架内系统研究2D、视频、3D和4D生成，并提供了数据集和评估指标的全面综述，以指导未来的研究。

Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation

Authors: Ruoxi Liu, Philipp Koehn

First: 2026-02-16T18:52:43+00:00 · Latest: 2026-02-16T18:52:43+00:00

Comments: 9 pages, 5 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs). Addressing the scarcity of parallel corpora that map between styles, the study employs roundtrip translation to synthesize such parallel datasets from monolingual corpora. This approach creates 'neutralized' text devoid of stylistic attributes, essentially creating a shared input style at training-time and inference-time. Experimental results demonstrate consistent superiority of this method over zero-shot prompting and fewshot ICL techniques measured by BLEU scores and style accuracy scores across four investigated domains. Furthermore, the integration of retrieval-augmented generation (RAG) for terminology and name knowledge enhances robustness and stylistic consistency.

中文标题/摘要

标题：基于参数高效大型语言模型微调和往返翻译的文本风格转换

本文提出了一种基于参数高效大型语言模型（LLM）微调的文本风格转换（TST）新方法。为了解决风格之间的平行语料库稀缺问题，研究利用单语语料库生成此类平行数据集的方法是往返翻译。这种方法创建了去风格化的文本，本质上在训练时间和推理时间创建了共享的输入风格。实验结果表明，该方法在四个研究领域中通过BLEU分数和风格准确度分数衡量的一致性优于零样本提示和少量示例ICL技术。此外，通过检索增强生成（RAG）集成术语和名称知识提高了鲁棒性和风格一致性。

Summary / 总结

This paper introduces a method for Text Style Transfer (TST) using parameter-efficient fine-tuning of Large Language Models (LLMs) and round-trip translation to create parallel datasets from monolingual corpora. The approach synthesizes 'neutralized' text without stylistic attributes, improving style consistency. Experiments show this method outperforms zero-shot prompting and few-shot In-context Learning (ICL) techniques in four domains, as measured by BLEU scores and style accuracy scores. Incorporating retrieval-augmented generation (RAG) further enhances robustness and stylistic consistency.

该论文提出了一种使用参数高效微调大型语言模型和往返翻译合成平行数据集的方法来进行文本风格转换。该方法创建了不含风格属性的‘中性化’文本，能够在四个领域中通过BLEU分数和风格准确性分数实现一致的风格转换，优于零样本提示和少量样本ICL技术。结合检索增强生成提高了鲁棒性和风格一致性。

Cold-Start Personalization via Training-Free Priors from Structured World Models

Authors: Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du, Yulia Tsvetkov, Maryam Fazel, Lin Xiao, Asli Celikyilmaz

First: 2026-02-16T18:52:13+00:00 · Latest: 2026-02-16T18:52:13+00:00

Comments: 24 pages, 4 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available. The core challenge is a routing problem: each task admits dozens of preference dimensions, yet individual users care about only a few, and which ones matter depends on who is asking. With a limited question budget, asking without structure will miss the dimensions that matter. Reinforcement learning is the natural formulation, but in multi-turn settings its terminal reward fails to exploit the factored, per-criterion structure of preference data, and in practice learned policies collapse to static question sequences that ignore user responses. We propose decomposing cold-start elicitation into offline structure learning and online Bayesian inference. Pep (Preference Elicitation with Priors) learns a structured world model of preference correlations offline from complete profiles, then performs training-free Bayesian inference online to select informative questions and predict complete preference profiles, including dimensions never asked about. The framework is modular across downstream solvers and requires only simple belief models. Across medical, mathematical, social, and commonsense reasoning, Pep achieves 80.8% alignment between generated responses and users' stated preferences versus 68.5% for RL, with 3-5x fewer interactions. When two users give different answers to the same question, Pep changes its follow-up 39-62% of the time versus 0-28% for RL. It does so with ~10K parameters versus 8B for RL, showing that the bottleneck in cold-start elicitation is the capability to exploit the factored structure of preference data.

中文标题/摘要

标题：基于结构化世界模型无训练先验的冷启动个性化

冷启动个性化要求通过用户交互推断用户偏好，当没有用户特定的历史数据时。核心挑战是一个路由问题：每个任务包含数十个偏好维度，但每个用户只关心其中的几个，而且哪些维度重要取决于提问的对象。在有限的问题预算下，无结构的提问将错过重要的维度。强化学习是自然的表述，但在多轮设置中，其终端奖励未能利用偏好数据按标准分解的事实结构，实践中学习到的策略会退化为静态的问题序列，忽略用户的响应。我们提出将冷启动提取分解为离线结构学习和在线贝叶斯推理。Pep（偏好提取与先验）从完整档案中学习一个结构化的世界模型来描述偏好相关性，然后在线进行无训练的贝叶斯推理，选择信息性问题并预测完整的偏好档案，包括从未提问过的维度。该框架在下游求解器中是模块化的，只需要简单的信念模型。在医学、数学、社会和常识推理中，Pep 在生成的响应与用户声明的偏好之间的对齐度为 80.8%，而强化学习为 68.5%，交互次数少 3-5 倍。当两个用户对同一问题给出不同答案时，Pep 的后续问题改变比例为 39-62%，而强化学习为 0-28%。它仅使用约 10K 参数，而强化学习为 8B，表明冷启动提取的瓶颈在于利用偏好数据分解结构的能力。

Summary / 总结

The paper addresses cold-start personalization by proposing Pep, which decomposes the problem into offline structure learning and online Bayesian inference. Pep learns structured world models of preference correlations from complete profiles and performs training-free Bayesian inference to select informative questions and predict complete preference profiles. The method achieves higher alignment with users' stated preferences and requires fewer interactions compared to reinforcement learning approaches, demonstrating the importance of exploiting the structured nature of preference data.

论文提出Pep方法，通过离线学习结构化的偏好模型并在在线阶段进行无训练的贝叶斯推理来选择有信息量的问题并预测用户的偏好。该方法在用户偏好匹配度（80.8%）方面优于强化学习（68.5%），且交互次数更少，展示了利用偏好数据结构化特性的有效性。

BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames

Authors: Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, Aviral Kumar

First: 2026-02-16T18:49:56+00:00 · Latest: 2026-02-16T18:49:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations.

中文标题/摘要

标题：BPP：通过关注关键历史帧进行长上下文机器人模仿学习

许多机器人的任务需要关注过去的观察历史。例如，在房间里找到一个物品需要记住已经搜索过的地方。然而，表现最好的机器人策略通常只依赖于当前的观察，限制了它们在这些任务中的应用。简单地依赖过去的观察往往由于虚假的相关性而失败：策略会抓住训练历史中的偶然特征，这些特征在部署到新的分布时无法泛化。我们分析了为什么策略会抓住这些虚假的相关性，并发现这个问题源于训练过程中对可能历史的覆盖范围有限，这种覆盖范围随着时间范围的增加而呈指数增长。现有的正则化技术在不同任务中提供的益处不一致，因为它们没有从根本上解决这个问题。受这些发现的启发，我们提出了大图策略（BPP），该方法基于视觉-语言模型检测到的一组有意义的关键帧进行条件化。通过将多样化的演示投影到一组与任务相关的事件上，BPP显著减少了训练和部署之间的分布偏移，而不会牺牲表达能力。我们在四个具有挑战性的现实世界操作任务和三个模拟任务上评估了BPP，所有任务都需要历史条件化。BPP在现实世界评估中的成功率比最佳对比方法高出70%。

Summary / 总结

This paper addresses the challenge of robot imitation learning by focusing on key history frames to improve the robot's ability to remember past observations. The authors propose Big Picture Policies (BPP), which condition on a minimal set of meaningful keyframes detected by a vision-language model. This approach reduces distribution shift between training and deployment while maintaining expressivity. BPP outperforms existing methods, achieving 70% higher success rates on real-world manipulation tasks compared to the best comparison method.

论文旨在通过关注关键历史帧来改进机器人的长期上下文理解，解决模仿学习的问题。研究发现现有方法常因训练历史中的虚假关联而失败。提出的Big Picture Policies (BPP)方法使用视觉-语言模型检测关键帧，减少部署时的分布差异，实验证实在四个真实世界的操作任务和三个模拟任务上，BPP的成功率比之前的方法高出70%。

Distributed Quantum Gaussian Processes for Multi-Agent Systems

Authors: Meet Gandhi, George P. Kontoudis

Venue: 2026 International Conference on Autonomous Agents and Multiagent Systems

First: 2026-02-16T18:46:23+00:00 · Latest: 2026-02-16T18:46:23+00:00

Comments: 9 pages, 4 figures, accepted at AAMAS 2026 (International Conference on Autonomous Agents and Multiagent Systems)

Abs · PDF · Code1 · Code2

Abstract

Gaussian Processes (GPs) are a powerful tool for probabilistic modeling, but their performance is often constrained in complex, largescale real-world domains due to the limited expressivity of classical kernels. Quantum computing offers the potential to overcome this limitation by embedding data into exponentially large Hilbert spaces, capturing complex correlations that remain inaccessible to classical computing approaches. In this paper, we propose a Distributed Quantum Gaussian Process (DQGP) method in a multiagent setting to enhance modeling capabilities and scalability. To address the challenging non-Euclidean optimization problem, we develop a Distributed consensus Riemannian Alternating Direction Method of Multipliers (DR-ADMM) algorithm that aggregates local agent models into a global model. We evaluate the efficacy of our method through numerical experiments conducted on a quantum simulator in classical hardware. We use real-world, non-stationary elevation datasets of NASA's Shuttle Radar Topography Mission and synthetic datasets generated by Quantum Gaussian Processes. Beyond modeling advantages, our framework highlights potential computational speedups that quantum hardware may provide, particularly in Gaussian processes and distributed optimization.

中文标题/摘要

标题：分布式量子高斯过程在多智能体系统中的应用

高斯过程（GPs）是一种强大的概率建模工具，但由于经典核的表达能力有限，其性能在复杂的大型现实世界领域中常常受到限制。量子计算通过将数据嵌入到指数级大的希尔伯特空间中，有可能克服这一限制，捕捉到经典计算方法无法访问的复杂相关性。在本文中，我们提出了一种分布式量子高斯过程（DQGP）方法，以增强建模能力和可扩展性。为了解决具有挑战性的非欧几里得优化问题，我们开发了一种分布式共识黎曼交替方向乘子算法（DR-ADMM），将局部智能体模型聚合为全局模型。我们通过在经典硬件上的量子模拟器进行的数值实验评估了我们方法的有效性。我们使用了NASA航天飞机雷达地形测绘任务的真实非平稳高程数据集和由量子高斯过程生成的合成数据集。除了建模优势，我们的框架还突显了量子硬件可能提供的潜在计算加速，特别是在高斯过程和分布式优化方面。

Summary / 总结

This paper proposes a Distributed Quantum Gaussian Process (DQGP) method to enhance modeling capabilities and scalability in multi-agent systems. It addresses the non-Euclidean optimization problem by developing a Distributed consensus Riemannian ADMM algorithm. The method is evaluated using numerical experiments on a quantum simulator with real-world and synthetic datasets, demonstrating potential computational speedups in Gaussian processes and distributed optimization.

论文提出了一种分布式量子高斯过程(DQGP)方法，以增强多智能体系统中的建模能力和可扩展性。它使用分布式共识黎曼ADMM算法来解决非欧几里得优化问题。实验结果表明，该方法在高斯过程和分布式优化方面的有效性，并且可能在量子硬件上提供计算加速。

Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

Authors: Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao

First: 2026-02-16T18:45:40+00:00 · Latest: 2026-02-16T18:45:40+00:00

Abs · PDF · Code1 · Code2

Abstract

News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.

中文标题/摘要

标题：通过推理和提炼学习用户兴趣以实现跨域新闻推荐

新闻推荐在在线新闻平台上起着关键作用，帮助用户发现相关的内容。跨域新闻推荐进一步要求从异构信号中推断用户的潜在信息需求，这些信号往往超出了直接新闻消费的范围。一个关键挑战在于超越表面行为，捕捉更深层次、可重用的用户兴趣，同时在大规模生产系统中保持可扩展性。在本文中，我们提出了一种强化学习框架，通过训练大规模语言模型从跨域用户信号生成高质量的兴趣驱动新闻搜索查询列表。我们将查询列表生成形式化为策略优化问题，并采用带有多个奖励信号的GRPO。我们系统地研究了两个计算维度：推理时的采样和模型容量，并实验证明随着计算量的增加，性能持续提升，表现出类似放大规模的行为。最后，我们进行在线策略蒸馏，将从计算密集型教师模型中学到的策略转移到一个适合大规模部署的紧凑型学生模型。广泛的离线实验、消融研究和生产新闻推荐系统中的大规模在线A/B测试表明，在兴趣建模质量和下游推荐性能方面都取得了持续的提升。

Summary / 总结

This paper addresses the challenge of cross-domain news recommendation by developing a reinforcement learning framework that uses large language models to generate high-quality lists of interest-driven news search queries from diverse user signals. The method formulates query-list generation as a policy optimization problem and employs GRPO with multiple reward signals. The study shows consistent improvements with increased compute, and on-policy distillation is used to transfer the learned policy from a large, compute-intensive model to a compact one for scalable deployment. Experimental results from offline and online tests confirm the framework's effectiveness in enhancing interest modeling and recommendation performance.

研究旨在通过从多种信号中推断用户的深层次兴趣来改进跨域新闻推荐。方法是使用强化学习框架和大型语言模型生成高质量的兴趣驱动新闻搜索查询列表。主要发现包括随着计算资源的增加，兴趣建模质量和推荐性能的一致提升，以及成功将大型教师模型精简为适合大规模部署的小型学生模型。

PDE foundation models are skillful AI weather emulators for the Martian atmosphere

Authors: Johannes Schmude, Sujit Roy, Liping Wang, Theodore van Kessel, Levente Klein, Marcus Freitag, Eloisa Bentivegna, Robert Manson-Sawko, Bjorn Lutjens, Manil Maskey, Campbell Watson, Rahul Ramachandran, Juan Bernabe-Moreno

First: 2026-02-16T18:44:46+00:00 · Latest: 2026-02-16T18:44:46+00:00

Abs · PDF · Code1 · Code2

Abstract

We show that AI foundation models that are pretrained on numerical solutions to a diverse corpus of partial differential equations can be adapted and fine-tuned to obtain skillful predictive weather emulators for the Martian atmosphere. We base our work on the Poseidon PDE foundation model for two-dimensional systems. We develop a method to extend Poseidon from two to three dimensions while keeping the pretraining information. Moreover, we investigate the performance of the model in the presence of sparse initial conditions. Our results make use of four Martian years (approx.~34 GB) of training data and a median compute budget of 13 GPU hours. We find that the combination of pretraining and model extension yields a performance increase of 34.4\% on a held-out year. This shows that PDEs-FMs can not only approximate solutions to (other) PDEs but also anchor models for real-world problems with complex interactions that lack a sufficient amount of training data or a suitable compute budget.

中文标题/摘要

标题：PDE基础模型是火星大气的技能型AI天气模拟器

我们展示了预训练于各种偏微分方程数值解的AI基础模型可以通过调整和微调来获得技能型的火星大气天气预测模拟器。我们的工作基于Poseidon PDE基础模型用于二维系统。我们开发了一种方法将Poseidon从二维扩展到三维，同时保留预训练信息。此外，我们研究了模型在初始条件稀疏情况下的性能。我们的结果使用了大约34 GB的四火星年（约34 GB）的训练数据和中位计算预算为13个GPU小时。我们发现预训练与模型扩展的结合在保留的年份上提高了34.4%的性能。这表明PDE基础模型不仅可以近似（其他）PDE的解，还可以为缺乏足够训练数据或适当计算预算的复杂交互现实世界问题建立模型。

Summary / 总结

The study demonstrates that AI foundation models pretrained on partial differential equations can be adapted to create skillful weather emulators for the Martian atmosphere. By extending the Poseidon model from two to three dimensions and using four Martian years of training data, the model achieved a 34.4% performance increase on a held-out year, indicating its capability to handle complex real-world problems with limited data and computational resources.

研究旨在使用预训练的PDE基础模型为火星大气开发AI天气模拟器。方法包括将Poseidon模型从二维扩展到三维，并在稀疏初始条件下进行微调。模型使用了大约四个火星年的数据进行训练，并需要13个GPU小时的计算预算。关键发现表明，在保留出的一年中性能提高了34.4%，展示了该模型在数据和计算资源有限的情况下处理复杂相互作用的能力。

Boundary Point Jailbreaking of Black-Box LLMs

Authors: Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal

First: 2026-02-16T18:29:09+00:00 · Latest: 2026-02-16T18:29:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5's input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.

中文标题/摘要

标题：边界点黑盒LLM破解

前沿LLM通过对抗提示（称为“监狱突破”）来提取有害信息时受到保护。最近，防御者开发了分类器为基础的系统，这些系统经受住了数千小时的人类红队测试。我们提出了边界点监狱突破（BPJ），这是一种新的自动化监狱突破攻击类别，能够规避最强的工业部署防护。与依赖白盒/灰盒假设（如分类器分数或梯度）或现有监狱突破库的先前攻击不同，BPJ是完全黑盒的，并且仅使用每次查询中的一位信息：分类器是否标记了该交互。为了实现这一点，BPJ解决了针对强大现实世界防御优化攻击的核心难题：评估提议的攻击修改是否改进。BPJ不直接尝试学习针对目标有害字符串的攻击，而是将字符串转换为一系列中间攻击目标，并主动选择最佳检测攻击强度微小变化（“边界点”）的评估点。我们认为BPJ是第一个成功开发出适用于宪法分类器的通用监狱突破的完全自动化攻击算法，也是第一个不依赖于人工攻击种子就能成功对抗GPT-5输入分类器的自动化攻击算法。BPJ在单次交互中难以防御，但在优化过程中会产生许多标记，表明有效的防御需要补充单次交互方法与批量级监控。

Summary / 总结

The research aims to develop a new class of automated jailbreak attacks called Boundary Point Jailbreaking (BPJ) that can bypass the strongest industry-defended safeguards against adversarial prompts. Unlike previous attacks that rely on partial information or existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query to determine if the classifier flags the interaction. BPJ converts the target harmful string into a curriculum of intermediate attack targets and selects evaluation points to detect small changes in attack strength, making it the first fully automated attack algorithm to succeed against Constitutional Classifiers and GPT-5's input classifier without human attack seeds. The study shows that while BPJ is difficult to defend against in individual interactions, it incurs many flags during the optimization process, indicating the need for batch-level monitoring in effective defense strategies.

研究旨在开发一种新的全自动越狱攻击方法——边界点越狱（BPJ），能够绕过最强的行业防御措施，防止对抗性提示的攻击。与依赖部分信息或现有越狱的攻击不同，BPJ 是完全黑盒的，仅通过每次查询的一个位信息来判断分类器是否标记了交互。BPJ 将目标有害字符串转换为一系列中间攻击目标，并选择评估点来检测攻击强度的小变化，使其成为第一个无需人工攻击种子就能成功对抗宪法分类器和 GPT-5 输入分类器的全自动攻击算法。研究表明，虽然 BPJ 在单次交互中难以防御，但在优化过程中会引发许多标记，表明有效的防御策略需要补充单次交互方法的批量级监控。

Robust Generalization with Adaptive Optimal Transport Priors for Decision-Focused Learning

Authors: Haixiang Sun, Andrew L. Liu

Venue: The 29th International Conference on Artificial Intelligence and Statistics (AISTATS), 2026

First: 2026-02-01T20:22:41+00:00 · Latest: 2026-02-16T18:27:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Few-shot learning requires models to generalize under limited supervision while remaining robust to distribution shifts. Existing Sinkhorn Distributionally Robust Optimization (DRO) methods provide theoretical guarantees but rely on a fixed reference distribution, which limits their adaptability. We propose a Prototype-Guided Distributionally Robust Optimization (PG-DRO) framework that learns class-adaptive priors from abundant base data via hierarchical optimal transport and embeds them into the Sinkhorn DRO formulation. This design enables few-shot information to be organically integrated into producing class-specific robust decisions that are both theoretically grounded and efficient, and further aligns the uncertainty set with transferable structural knowledge. Experiments show that PG-DRO achieves stronger robust generalization in few-shot scenarios, outperforming both standard learners and DRO baselines.

中文标题/摘要

标题：基于原型引导的自适应最优传输先验的鲁棒泛化决策聚焦学习

少样本学习要求模型在有限监督下泛化，同时保持对分布偏移的鲁棒性。现有的Sinkhorn分布鲁棒优化(DRO)方法提供了理论保证，但依赖于固定的参考分布，这限制了它们的适应性。我们提出了一种原型引导的分布鲁棒优化(PG-DRO)框架，通过分层最优传输从丰富的基础数据中学习类自适应先验，并将其嵌入到Sinkhorn DRO公式中。这种设计使少样本信息能够有机地整合到生成类特定的鲁棒决策中，这些决策既具有理论依据又高效，并进一步使不确定性集与可转移的结构知识相一致。实验表明，PG-DRO在少样本场景中实现了更强的鲁棒泛化，优于标准学习器和DRO基线。

Summary / 总结

The paper addresses the challenge of few-shot learning by proposing a Prototype-Guided Distributionally Robust Optimization (PG-DRO) framework. This framework learns class-adaptive priors from base data using hierarchical optimal transport and integrates them into the Sinkhorn DRO formulation. Experimental results demonstrate that PG-DRO outperforms standard learners and DRO baselines in achieving stronger robust generalization in few-shot scenarios.

论文提出了一种原型引导的分布鲁棒优化（PG-DRO）框架，以解决少样本学习中的鲁棒泛化问题。该框架通过层次最优传输从基础数据中学习类自适应先验，并将其整合到Sinkhorn DRO公式中。实验结果表明，PG-DRO在少样本场景中表现出更强的鲁棒泛化能力，优于标准学习器和DRO基线。

On the Semantics of Primary Cause in Hybrid Dynamic Domains

Authors: Shakil M. Khan, Asim Mehmood, Sandra Zilles

First: 2026-02-16T18:25:08+00:00 · Latest: 2026-02-16T18:25:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning about actual causes of observed effects is fundamental to the study of rationality. This important problem has been studied since the time of Aristotle, with formal mathematical accounts emerging recently. We live in a world where change due to actions can be both discrete and continuous, that is, hybrid. Yet, despite extensive research on actual causation, only few recent studies looked into causation with continuous change. Building on recent progress, in this paper we propose two definitions of primary cause in a hybrid action-theoretic framework, namely the hybrid temporal situation calculus. One of these is foundational in nature while the other formalizes causation through contributions, which can then be verified from a counterfactual perspective using a modified ``but-for'' test. We prove that these two definitions are indeed equivalent. We then show that our definitions of causation have some intuitively justifiable properties.

中文标题/摘要

标题：混合动态领域中首要原因的语义研究

关于观察到的效果的实际原因的推理是研究理性的重要基础。这一重要问题自亚里士多德时代以来就被研究，最近才有了形式化的数学描述。我们生活在一个变化既可能是离散的也可能是连续的世界中，即混合世界。尽管在实际因果关系的研究方面进行了大量研究，但只有少数研究关注了连续变化的因果关系。在此基础上，本文在混合行动理论框架，即混合时间情况语义学中提出了两种首要原因的定义。其中一种定义具有基础性质，另一种则通过贡献形式化因果关系，并可以通过修改后的“如果-则”测试从反事实的角度进行验证。我们证明了这两种定义是等价的。然后我们展示了我们提出的因果关系定义具有某些直观合理的性质。

Summary / 总结

This paper addresses the problem of identifying actual causes in hybrid dynamic domains where changes can be both discrete and continuous. It proposes two definitions of primary cause within a hybrid temporal situation calculus framework: one foundational and the other based on contributions. The authors prove the equivalence of these definitions and demonstrate that they possess intuitively justifiable properties. The work contributes to the formal understanding of causation in complex systems.

论文探讨了在既包含离散变化也包含连续变化的混合动态领域中识别主要原因的问题。它在混合时间情况语义框架下提出了两种主要原因的定义：一种是基础性的，另一种是基于贡献并通过修改后的反事实测试进行验证。作者证明了这两种定义是等价的，并展示了它们具有直观合理的性质。

ThermEval: A Structured Benchmark for Evaluation of Vision-Language Models on Thermal Imagery

Authors: Ayush Shrivastava, Kirtan Gangani, Laksh Jain, Mayank Goel, Nipun Batra

First: 2026-02-16T18:16:19+00:00 · Latest: 2026-02-16T18:16:19+00:00

Comments: 8 Pages with 2 figures of main content. 2 pages of References. 10 pages of appendix with 6 figures

Abs · PDF · Code1 · Code2

Abstract

Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate. We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments. Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning. These results demonstrate that thermal understanding requires dedicated evaluation beyond RGB-centric assumptions, positioning ThermEval as a benchmark to drive progress in thermal vision language modeling.

中文标题/摘要

标题：ThermEval：热成像视觉语言模型评估基准

视觉语言模型（VLMs）在RGB图像上表现出色，但在热图像上却无法泛化。热成像在可见光失效的环境中起着关键作用，包括夜间监视、搜索与救援、自动驾驶和医学筛查。与RGB图像不同，热图像编码的是物理温度而非颜色或纹理，这需要感知和推理能力，而现有的以RGB为中心的基准测试并未对其进行评估。我们引入了ThermEval-B，这是一个包含约55,000个热视觉问答对的结构化基准，旨在评估热视觉语言理解所需的底层能力。ThermEval-B将公共数据集与我们新收集的ThermEval-D结合在一起，ThermEval-D是首个提供密集的逐像素温度图和语义人体部位注释的数据集，适用于多种室内外环境。评估25个开源和闭源VLMs，我们发现模型在温度相关的推理上表现一致不佳，在颜色映射变换下性能下降，并倾向于依赖语言先验或固定响应，仅通过提示或监督微调获得微小改进。这些结果表明，热理解需要超越RGB中心假设的专门评估，将ThermEval定位为推动热视觉语言建模进展的基准。

Summary / 总结

ThermEval is a structured benchmark for evaluating vision-language models on thermal imagery, addressing the lack of generalization from RGB to thermal images. It includes 55,000 thermal visual question answering pairs and a new dataset with per-pixel temperature maps and semantic annotations. Evaluations of 25 models show consistent failure in temperature-grounded reasoning, degradation under colormap transformations, and reliance on language priors, highlighting the need for dedicated thermal understanding benchmarks.

研究旨在评估视觉-语言模型（VLMs）在热成像上的表现，这对于夜间监视和医疗筛查等应用至关重要。研究引入了ThermEval-B，这是一个包含55,000个热视觉问答对的基准，旨在评估基础的热感知和推理能力。关键发现表明，VLMs在温度相关的推理上表现不佳，在颜色映射变换下性能下降，并依赖于语言先验或固定响应，表明需要超越RGB中心假设的专业评估。

Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations

Authors: Carolin Cissee, Raneen Younis, Zahra Ahmadi

First: 2026-02-16T18:06:53+00:00 · Latest: 2026-02-16T18:06:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.

中文标题/摘要

标题：正交化多模态对比学习与非对称掩码用于结构化表示

多模态学习旨在整合来自异构来源的信息，其中信号可能在模态之间共享、特定于个别模态或仅通过它们的交互而出现。虽然自监督多模态对比学习取得了显著进展，但大多数现有方法主要捕捉冗余的跨模态信号，往往忽略了特定于模态的（独特）和交互驱动的（协同）信息。最近的扩展拓宽了这一视角，但它们要么未能明确建模协同交互，要么以纠缠的方式学习不同的信息组件，导致不完整的表示和潜在的信息泄露。我们引入了**COrAL**，这是一种原理性的框架，明确且同时保留了多模态表示中的冗余、独特和协同信息。COrAL 使用具有正交约束的双路径架构来分离共享和特定于模态的特征，确保信息组件的清晰分离。为了促进协同建模，我们引入了非对称掩码，具有互补的视图特定模式，促使模型推断跨模态依赖关系，而不是仅依赖冗余线索。在合成基准和多样化的MultiBench数据集上的广泛实验表明，COrAL 一致地匹配或超越了最先进的方法，同时表现出较低的性能变异性。这些结果表明，明确建模多模态信息的整个谱系会产生更稳定、可靠和全面的嵌入。

Summary / 总结

The paper introduces COrAL, a framework designed to capture redundant, unique, and synergistic information in multimodal representations. COrAL uses a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, and asymmetric masking to promote cross-modal dependency inference. Experiments show that COrAL outperforms or matches state-of-the-art methods with low performance variance.

研究旨在通过捕捉冗余、独特和协同信息来改进多模态学习。COrAL 使用具有正交约束的双路径架构来分离共享和模态特定特征，并引入不对称掩码以促进跨模态依赖关系的推断。实验表明，COrAL 在不同运行中表现出一致的高性能或与最先进的方法相当，并且性能波动较小。

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

Authors: David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Yossi Matias, James Manyika, Vahab Mirrokni

First: 2026-02-03T18:56:17+00:00 · Latest: 2026-02-16T18:02:06+00:00

Comments: Author list now includes Yossi Matias and James Manyika. Acknowledgements also updated. Added more general discussion to sections 1, 9.1, and 9.5. Discussed related work of Gurvits in section 4.3. Clarified closed form in section 6.1 and gave finite sum expansions for coefficients. Other minor formatting fixes

Abs · PDF · Code1 · Code2

Abstract

Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.

中文标题/摘要

标题：使用Gemini加速科学研究：案例研究与常用技术

大型语言模型（LLMs）的最新进展为加速科学研究开辟了新途径。尽管模型在协助处理常规任务方面的能力越来越强，但它们在贡献新颖、专家级的数学发现方面的潜力尚不明确。我们展示了研究人员如何成功与基于Google Gemini的高级AI模型（特别是Gemini Deep Think及其高级变体）合作，解决开放问题、反驳猜想并生成新的证明，涵盖理论计算机科学等多个领域，以及其他领域如经济学、优化和物理学。基于这些经验，我们提取了理论研究中有效的人机协作技术，如迭代细化、问题分解和跨学科知识转移。虽然我们的大部分结果来自这种互动、对话的方法，但我们还强调了一些超越标准聊天界面的具体实例。这些包括将模型部署为严格的 adversarial 审查员以检测现有证明中的细微缺陷，以及将其嵌入“神经符号”循环中，该循环自主编写和执行代码以验证复杂的推导。这些例子共同突显了AI不仅作为自动化工具的潜力，而且作为科学研究发现过程中创造性的真正伙伴的潜力。

Summary / 总结

This paper explores how advanced AI models, particularly Google's Gemini-based systems, can assist in scientific research, especially in theoretical computer science and related fields. The authors present case studies where researchers collaborated with these models to solve open problems, refute conjectures, and generate new proofs. Key techniques for effective human-AI collaboration include iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. The paper also highlights specific instances where the models were used as adversarial reviewers and embedded in a neuro-symbolic loop to verify complex derivations, demonstrating AI's potential as a versatile partner in scientific discovery.

本文探讨了先进AI模型，尤其是Google的Gemini基系统，如何在理论计算机科学及相关领域中辅助科学研究。作者通过案例研究展示了研究人员如何与这些模型合作解决开放问题、反驳猜想并生成新证明。有效的人机协作技术包括迭代细化、问题分解和跨学科知识转移。论文还展示了模型作为 adversarial reviewer 和嵌入神经符号循环以验证复杂推导的具体实例，证明了AI作为科学发现过程中灵活且真正伙伴的潜力。

Evolution Strategies at the Hyperscale

Authors: Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, Antonio León Villares, Anya Sims, Clarisse Wibault, Dmitry Samsonov, Dylan Cope, Jarek Liesen, Kang Li, Lukas Seier, Theo Wolf, Uljad Berdica, Valentin Mohl, Alexander David Goldie, Aaron Courville, Karin Sevegnani, Shimon Whiteson, Jakob Nicolaus Foerster

First: 2025-11-20T18:56:05+00:00 · Latest: 2026-02-16T18:01:18+00:00

Comments: 76 pages, 15 figures, Website at https://eshyperscale.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Evolution Strategies (ES) is a class of powerful black-box optimisation methods that are highly parallelisable and can handle non-differentiable and noisy objectives. However, naïve ES becomes prohibitively expensive at scale on GPUs due to the low arithmetic intensity of batched matrix multiplications with unstructured random perturbations. We introduce Evolution Guided GeneRal Optimisation via Low-rank Learning (EGGROLL), which improves arithmetic intensity by structuring individual perturbations as rank-$r$ matrices, resulting in a hundredfold increase in training speed for billion-parameter models at large population sizes, achieving up to 91% of the throughput of pure batch inference. We provide a rigorous theoretical analysis of Gaussian ES for high-dimensional parameter objectives, investigating conditions needed for ES updates to converge in high dimensions. Our results reveal a linearising effect, and proving consistency between EGGROLL and ES as parameter dimension increases. Our experiments show that EGGROLL: (1) enables the stable pretraining of nonlinear recurrent language models that operate purely in integer datatypes, (2) is competitive with GRPO for post-training LLMs on reasoning tasks, and (3) does not compromise performance compared to ES in tabula rasa RL settings, despite being faster.

中文标题/摘要

标题：超大规模下的进化策略

进化策略（ES）是一类强大的黑盒优化方法，高度并行化且能处理非可微和噪声目标。然而，朴素的ES在GPU上大规模运行时由于批量矩阵乘法与无结构随机扰动的低算术强度变得代价高昂。我们引入了进化引导基因般优化通过低秩学习（EGGROLL），通过将个体扰动结构化为秩-$r$矩阵，提高了算术强度，使得在大规模种群下训练十亿参数模型的速度提高了百倍，达到了纯批量推理吞吐量的91%。我们对高维参数目标的高斯ES进行了严格的理论分析，探讨了ES更新在高维收敛所需的条件。我们的结果揭示了线性化效应，并证明了参数维度增加时EGGROLL与ES之间的一致性。我们的实验表明EGGROLL：（1）使非线性递归语言模型的稳定预训练成为可能，这些模型仅在整数数据类型中运行；（2）在推理任务中与GRPO竞争；（3）在空白学习设置中不损害性能，尽管速度更快。

Summary / 总结

This paper addresses the scalability issues of Evolution Strategies (ES) at large scale by introducing EGGROLL, which structures individual perturbations as rank-$r$ matrices to improve arithmetic intensity. The method achieves a hundredfold increase in training speed for billion-parameter models. Experiments demonstrate that EGGROLL enables stable pretraining of nonlinear recurrent language models, is competitive with GRPO for post-training LLMs on reasoning tasks, and maintains performance in tabula rasa RL settings while being faster.

论文通过引入EGGROLL，将单个扰动结构化为秩-$r$矩阵来解决ES在GPU上的可扩展性问题，从而提高算术强度。理论分析表明，EGGROLL在高维参数目标上与ES具有类似的收敛性。实验结果显示，EGGROLL能够稳定预训练非线性循环语言模型，与GRPO在后训练LLM推理任务上具有竞争力，并且在空白学习RL设置中保持性能，同时速度更快。

Robust Multi-Objective Controlled Decoding of Large Language Models

Authors: Seongho Son, William Bankes, Sangwoong Yoon, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

Venue: ICLR 2026

First: 2025-03-11T18:15:26+00:00 · Latest: 2026-02-16T17:58:26+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

We introduce Robust Multi-Objective Decoding (RMOD), a novel inference-time algorithm that robustly aligns Large Language Models (LLMs) to multiple human objectives (e.g., instruction-following, helpfulness, safety) by maximizing the worst-case rewards. RMOD formulates the robust decoding problem as a maximin two-player game between adversarially computed reward weights and the sampling policy, solvable through a Nash equilibrium. We demonstrate that this game reduces to a convex optimization problem to identify the worst-case reward weights, with the optimal sampling policy analytically derived. For practical applications, we propose an efficient algorithm of RMOD tailored for contemporary LLMs, introducing minimal computational overhead compared to standard non-robust Controlled Decoding methods. Experimental results across a range of popular alignment datasets with up to 10 objectives show the effectiveness of RMOD and its distilled version, consistently outperforming baselines in worst-case rewards and win rates.

中文标题/摘要

标题：鲁棒多目标控制解码大型语言模型

我们引入了鲁棒多目标解码（RMOD），这是一种新颖的推理时算法，通过最大化最坏情况奖励来使大型语言模型（LLMs）与多个人类目标（例如，指令遵循、有用性、安全性）保持一致。RMOD将鲁棒解码问题表述为对抗计算的奖励权重和采样策略之间的最大化最小值二人博弈，并通过纳什均衡求解。我们证明，此博弈可简化为一个凸优化问题以识别最坏情况的奖励权重，最优采样策略也得到了解析推导。对于实际应用，我们提出了一种针对现代LLM的RMOD高效算法，与标准非鲁棒控制解码方法相比，其计算开销最小。在一系列流行的对齐数据集上进行的实验表明，RMOD及其精简版本在最坏情况奖励和胜率方面均优于基线方法。

Summary / 总结

The research introduces Robust Multi-Objective Decoding (RMOD), an algorithm that aligns Large Language Models (LLMs) to multiple human objectives by maximizing worst-case rewards. RMOD formulates the robust decoding problem as a maximin game, leading to a convex optimization problem to identify worst-case reward weights. The optimal sampling policy is analytically derived, and an efficient algorithm is proposed for practical use. Experiments across various alignment datasets show RMOD and its distilled version outperforming baselines in worst-case rewards and win rates.

研究引入了Robust Multi-Objective Decoding (RMOD)，该算法通过最大化最坏情况奖励来使大型语言模型（LLMs）与多个人类目标对齐。RMOD将鲁棒解码问题表述为一个最大化最小值的博弈，转化为一个凸优化问题来找到最坏情况的奖励权重，并且最优采样策略是解析得出的。实验结果表明，RMOD及其精简版本在最坏情况奖励和胜率方面优于基线方法，展示了其实用性。

Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Authors: Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar

First: 2026-02-16T17:56:18+00:00 · Latest: 2026-02-16T17:56:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.

中文标题/摘要

标题：基于LLM的联络中心代理质量保证系统反事实公平性评估

大型语言模型（LLMs）越来越多地部署在联络中心的质量保证（QA）中，以自动化代理绩效评估和教练反馈。尽管LLMs提供了前所未有的可扩展性和速度，但它们对大规模网络训练数据的依赖引发了关于人口统计和行为偏见的担忧，这些偏见可能扭曲了劳动力评估。我们对18种LLM在3,000份真实联络中心转录文本上进行了13个维度的反事实公平性评估，涵盖三个类别：身份、情境和行为风格。公平性通过反事实翻转率（CFR）、二元判断反转频率和平均绝对评分差异（MASD）来量化，MASD衡量反事实配对中教练或信心评分的平均变化。我们发现系统性差异，CFR范围从5.4%到13.0%，且在信心、积极和改进评分中持续存在MASD变化。更大、更高度对齐的模型显示出较低的不公平性，但公平性并不与准确性相关。历史绩效的情境提示导致最严重的退化（CFR最高达16.4%），而隐含的语言身份线索仍然是一个持续的偏见来源。最后，我们分析了公平性意识提示的有效性，发现明确指令仅在评估一致性方面带来适度改进。我们的研究结果强调了在部署LLM进行高风险劳动力评估之前需要标准化的公平性审计流程。

Summary / 总结

The study evaluates the fairness of Large Language Models (LLMs) used in contact-center Quality Assurance (QA) systems by assessing 18 LLMs on 3,000 real-world transcripts across 13 dimensions. The evaluation uses Counterfactual Flip Rate (CFR) and Mean Absolute Score Difference (MASD) to quantify fairness, revealing systematic disparities with CFR ranging from 5.4% to 13.0%. Larger, more aligned models show lower unfairness, but fairness does not correlate with accuracy. Historical performance priming and implicit identity cues are significant bias sources, while fairness-aware prompting provides only limited improvements in evaluative consistency.

研究评估了用于接触中心质量保证系统的大型语言模型（LLMs）的公平性，使用反事实翻转率（CFR）和平均绝对评分差异（MASD）衡量公平性。在18个LLM上对3,000份实际通话记录进行评估后，研究发现存在显著差异，CFR范围从5.4%到13.0%，且评分差异保持一致。更大、更对齐的模型显示出较低的不公平性，但公平性与准确性无关。历史表现提示和隐含的身份线索是主要的偏见来源，而公平性提示仅提供有限的改进。研究强调在部署LLMs进行劳动力评估之前进行公平性审计的必要性。

PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement

Authors: Yian Wang, Han Yang, Minghao Guo, Xiaowen Qiu, Tsun-Hsuan Wang, Wojciech Matusik, Joshua B. Tenenbaum, Chuang Gan

Venue: ICLR 2026

First: 2026-02-16T17:55:25+00:00 · Latest: 2026-02-16T17:55:25+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Automatically generating interactive 3D environments is crucial for scaling up robotic data collection in simulation. While prior work has primarily focused on 3D asset placement, it often overlooks the physical relationships between objects (e.g., contact, support, balance, and containment), which are essential for creating complex and realistic manipulation scenarios such as tabletop arrangements, shelf organization, or box packing. Compared to classical 3D layout generation, producing complex physical scenes introduces additional challenges: (a) higher object density and complexity (e.g., a small shelf may hold dozens of books), (b) richer supporting relationships and compact spatial layouts, and (c) the need to accurately model both spatial placement and physical properties. To address these challenges, we propose PhyScensis, an LLM agent-based framework powered by a physics engine, to produce physically plausible scene configurations with high complexity. Specifically, our framework consists of three main components: an LLM agent iteratively proposes assets with spatial and physical predicates; a solver, equipped with a physics engine, realizes these predicates into a 3D scene; and feedback from the solver informs the agent to refine and enrich the configuration. Moreover, our framework preserves strong controllability over fine-grained textual descriptions and numerical parameters (e.g., relative positions, scene stability), enabled through probabilistic programming for stability and a complementary heuristic that jointly regulates stability and spatial relations. Experimental results show that our method outperforms prior approaches in scene complexity, visual quality, and physical accuracy, offering a unified pipeline for generating complex physical scene layouts for robotic manipulation.

中文标题/摘要

标题：PhyScensis：增强物理的LLM代理用于复杂物理场景排列

自动生成交互式3D环境对于扩大机器人数据收集的模拟规模至关重要。尽管先前的工作主要集中在3D资产放置上，但它们往往忽略了物体之间的物理关系（例如接触、支撑、平衡和包含），这些关系对于创建复杂的现实操作场景（如桌面排列、架子整理或盒子打包）至关重要。与传统的3D布局生成相比，生成复杂的物理场景带来了额外的挑战：(a) 更高的物体密度和复杂性（例如，一个小架子可能容纳几十本书），(b) 更丰富的支撑关系和紧凑的空间布局，以及(c) 需要准确地建模空间放置和物理属性。为了解决这些挑战，我们提出PhyScensis，这是一种基于LLM代理的框架，由物理引擎提供支持，以生成具有高复杂度的物理上合理的场景配置。具体而言，我们的框架由三个主要组件组成：一个LLM代理迭代地提出具有空间和物理谓词的资产；一个配备物理引擎的求解器将这些谓词实现为3D场景；以及求解器的反馈指导代理改进和完善配置。此外，我们的框架通过概率编程保持对细粒度文本描述和数值参数（例如相对位置、场景稳定性）的强大可控性，通过互补启发式方法联合调节稳定性和空间关系。实验结果表明，我们的方法在场景复杂性、视觉质量和物理准确性方面优于先前的方法，提供了一种统一的生成复杂物理场景布局的管道，用于机器人操作。

Summary / 总结

PhyScensis is an LLM agent-based framework that uses a physics engine to generate complex physical scenes for robotic manipulation. It addresses the challenges of higher object density, richer supporting relationships, and accurate modeling of spatial and physical properties. The framework consists of an LLM agent, a solver with a physics engine, and feedback mechanisms to refine scene configurations. Experimental results demonstrate that PhyScensis outperforms previous methods in scene complexity, visual quality, and physical accuracy, providing a unified pipeline for complex physical scene generation.

PhyScensis 是一个基于 LLM 的框架，结合物理引擎生成复杂且物理上合理的 3D 场景，解决高密度物体、丰富支持关系以及准确建模空间位置和物理属性的挑战。该框架包括一个 LLM 代理、一个带有物理引擎的求解器和反馈机制以优化场景配置。实验结果表明，PhyScensis 在场景复杂性、视觉质量和物理准确性方面优于先前的方法，提供了一个统一的管道用于机器人操作场景的生成。

Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Authors: Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, Yang You

Venue: NeurIPS 2025

First: 2024-02-24T07:22:04+00:00 · Latest: 2026-02-16T17:54:52+00:00

Comments: Accepted by NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training. Memory-efficient Zeroth-order (MeZO) optimizers, recently proposed to address this issue, only require forward passes during training, making them more memory-friendly. However, compared with exact gradients, ZO-based gradients usually exhibit an estimation error, which can significantly hurt the optimization process, leading to slower convergence and suboptimal solutions. In addition, we find that the estimation error will hurt more when adding to large weights instead of small weights. Based on this observation, this paper introduces Sparse MeZO, a novel memory-efficient zeroth-order optimization approach that applies ZO only to a carefully chosen subset of parameters. We propose a simple yet effective parameter selection scheme that yields significant performance gains with Sparse-MeZO. Additionally, we develop a memory-optimized implementation for sparse masking, ensuring the algorithm requires only inference-level memory consumption, allowing Sparse-MeZO to fine-tune LLaMA-30b on a single A100 GPU. Experimental results illustrate that Sparse-MeZO consistently improves both performance and convergence speed over MeZO without any overhead. For example, it achieves a 9\% absolute accuracy improvement and 3.5x speedup over MeZO on the RTE task. Code is available at https://github.com/NUS-HPC-AI-Lab/SparseMeZO.

中文标题/摘要

标题：稀疏MeZO：较少参数实现零阶LLM微调的更好性能

虽然针对特定任务微调大型语言模型（LLMs）通常能取得令人印象深刻的结果，但梯度基训练中的反向传播会导致内存效率低下。最近提出的内存高效零阶（MeZO）优化器仅在训练过程中需要前向传递，使其更具内存友好性。然而，与精确梯度相比，基于零阶的梯度通常会表现出估计误差，这会严重影响优化过程，导致收敛速度变慢和次优解。此外，我们发现当添加到大权重时，这种估计误差的影响更大。基于这一观察，本文介绍了稀疏MeZO，这是一种新颖的内存高效零阶优化方法，仅对精心选择的参数子集应用零阶。我们提出了一种简单而有效的参数选择方案，该方案在稀疏MeZO中实现了显著的性能提升。此外，我们还开发了一种内存优化的稀疏掩码实现，确保算法仅需推理级别的内存消耗，从而使稀疏MeZO能够在单个A100 GPU上微调LLaMA-30b。实验结果表明，稀疏MeZO在没有任何开销的情况下，一致地提高了性能和收敛速度。例如，在RTE任务上，它比MeZO实现了9%的绝对准确率提升和3.5倍的速度提升。代码可在https://github.com/NUS-HPC-AI-Lab/SparseMeZO/ 获取。

Summary / 总结

Sparse MeZO is a memory-efficient zeroth-order optimization method that applies zeroth-order gradients only to a subset of parameters, improving both performance and convergence speed compared to MeZO. It uses a simple parameter selection scheme and a memory-optimized implementation for sparse masking, allowing it to fine-tune LLaMA-30b on a single A100 GPU. Experimental results show a 9% absolute accuracy improvement and a 3.5x speedup on the RTE task over MeZO.

Sparse MeZO 是一种仅对部分参数应用零阶优化的方法，相比 MeZO 提高了性能和收敛速度。该方法使用简单的参数选择方案，并只需推理级别的内存，使其能够在单个 A100 GPU 上 fine-tune LLaMA-30b。Sparse MeZO 在 RTE 任务上比 MeZO 提高了 9% 的绝对准确率并加速了 3.5 倍。

Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation

Authors: Rishikesh Devanathan, Varun Nathan, Ayush Kumar

First: 2025-08-25T17:10:36+00:00 · Latest: 2026-02-16T17:44:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Synthetic data is increasingly critical for contact centers, where privacy constraints and data scarcity limit the availability of real conversations. However, generating synthetic dialogues that are realistic and useful for downstream applications remains challenging. In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages. To test downstream utility, we evaluate synthetic transcripts on an automated quality assurance (AutoQA) task, finding that prompts optimized on real transcripts consistently outperform those optimized on synthetic transcripts. These results suggest that current synthetic transcripts fall short in capturing the full realism of real agent-customer interactions. To highlight these downstream gaps, we introduce a diagnostic evaluation framework comprising 17 metrics across four dimensions: (1) Emotional and Sentiment Arcs, (2) Linguistic Complexity, (3) Interaction Style, and (4) Conversational Properties. Our analysis shows that even with structured supervision, current generation strategies exhibit measurable deficiencies in sentiment fidelity, disfluency modeling, behavioral variation, and conversational realism. Together, these results highlight the importance of diagnostic, metric-driven evaluation for synthetic conversation generation intended for downstream applications.

中文标题/摘要

标题：为什么合成数据尚未真实：接触中心对话生成的诊断框架

合成数据在接触中心越来越关键，由于隐私限制和数据稀缺，真实对话的可用性受到限制。然而，生成既现实又对下游应用有用的合成对话仍然具有挑战性。在本研究中，我们基于对通话属性（意图摘要、主题流程和质量保证表单）的结构化监督，对多种生成策略进行了基准测试，跨越了多种语言。为了测试下游应用的实用性，我们对合成转录进行了自动质量保证（AutoQA）任务评估，发现针对真实转录优化的提示始终优于针对合成转录优化的提示。这些结果表明，当前的合成转录未能充分捕捉到真实代理-客户互动的全部现实性。为了突出这些下游差距，我们引入了一个诊断评估框架，包括17个指标，涵盖四个维度：（1）情感和情感弧线，（2）语言复杂性，（3）互动风格，（4）对话属性。我们的分析表明，即使有结构化的监督，当前的生成策略在情感保真度、非流畅性建模、行为变化和对话现实性方面仍存在可测量的缺陷。这些结果共同强调了对旨在用于下游应用的合成对话生成进行诊断性、指标驱动评估的重要性。

Summary / 总结

This work addresses the challenge of generating realistic synthetic dialogues for contact centers, where privacy and data scarcity constraints limit access to real conversations. The study benchmarks various generation strategies using structured supervision on call attributes and evaluates their downstream utility through an automated quality assurance task. Key findings indicate that current synthetic transcripts are less effective than real transcripts in the AutoQA task, suggesting deficiencies in capturing the full realism of real interactions. To diagnose these gaps, the authors introduce a diagnostic evaluation framework with 17 metrics across four dimensions: emotional and sentiment arcs, linguistic complexity, interaction style, and conversational properties, revealing measurable shortcomings in sentiment fidelity, disfluency modeling, and conversational realism even with structured supervision.

研究旨在解决生成接触中心对话的挑战，由于隐私和数据稀缺的限制，难以获取真实的对话。该研究使用呼叫属性的结构化监督来评估各种生成策略，并通过自动化质量保证任务评估其下游实用性。关键发现表明，即使有结构化的监督，合成对话在情感忠实度和对话现实性等方面仍然无法完全捕捉真实交互的全部真实性。为了诊断这些差距，作者引入了一个包含17个指标的诊断评估框架，涵盖四个维度：情感和情感弧线、语言复杂性、互动风格和对话属性，强调需要更严格的评估方法。

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Authors: Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach

First: 2026-02-06T16:45:02+00:00 · Latest: 2026-02-16T17:43:19+00:00

Comments: 49 pages, 14 figures, 10 tables

Abs · PDF · Code1 · Code2

Abstract

LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

中文标题/摘要

标题：AIRS-Bench：前沿人工智能研究科学代理的一套任务集

大语言模型代理在推进科学研究方面具有巨大潜力。为了加速这一进程，我们引入了AIRS-Bench（人工智能研究科学基准），这是一个包含20项任务的套件，这些任务源自最新的机器学习论文。这些任务涵盖了语言建模、数学、生物信息学和时间序列预测等多个领域。AIRS-Bench任务评估代理在整个研究生命周期中的能力，包括创意生成、实验分析和迭代改进，而不提供基线代码。AIRS-Bench的任务格式具有灵活性，便于新任务的集成和不同代理框架之间的严格比较。我们使用前沿模型与顺序和并行支架相结合来建立基线。结果显示，代理在四项任务中超过了人类的SOTA，但在其他十六项任务中未能达到。即使代理超越了人类基准，它们也没有达到底层任务的理论性能上限。这些发现表明，AIRS-Bench远未饱和，提供了巨大的改进空间。我们开源了AIRS-Bench的任务定义和评估代码，以促进自主科学研究的发展。

Summary / 总结

AIRS-Bench is a suite of 20 tasks designed to evaluate the capabilities of AI research science agents across various domains such as language modeling, mathematics, bioinformatics, and time series forecasting. The tasks assess agents' abilities throughout the research lifecycle, from idea generation to iterative refinement. Using state-of-the-art models, the study shows that agents outperform humans in four tasks but fall short in sixteen others, indicating significant room for improvement. The AIRS-Bench framework is open-sourced to promote further development in autonomous scientific research.

AIRS-Bench 是一套包含 20 项任务的评估套件，旨在评估 AI 科学研究代理在语言建模、数学、生物信息学和时间序列预测等不同领域的能力。这些任务评估代理在整个研究生命周期中的能力，从想法生成到迭代改进。使用最先进的模型，研究显示代理在四项任务中超越人类，但在十六项任务中表现不佳，表明仍有很大的改进空间。AIRS-Bench 的框架已开源，以促进自主科学研究的进一步发展。

Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Authors: Varun Nathan, Shreyas Guha, Ayush Kumar

First: 2026-02-16T17:36:05+00:00 · Latest: 2026-02-16T17:36:05+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.

中文标题/摘要

标题：接触中心AI中的工具感知规划：通过谱系引导的查询分解评估LLM

我们提出了一种基于领域的框架和基准，用于接触中心中的工具感知规划，其中回答业务洞察查询（我们的目标用例）需要将查询分解为结构化工具（Text2SQL (T2S)/Snowflake）和非结构化工具（RAG/转录）的可执行步骤，并且具有显式的depends_on以实现并行性。我们的贡献包括三个方面：(i) 一种基于参考的规划评估框架，运行在两种模式下——一个涵盖七个维度（例如，工具提示对齐、查询一致性）的度量评估器和一个一次性评估器；(ii) 一种数据整理方法，通过评估器-优化器循环迭代细化规划，以生成高质量的规划谱系（有序的规划修订），同时减少人工努力；以及(iii) 对14种不同规模和家族的LLM的大规模研究，评估它们将查询分解为逐步、可执行且工具分配的计划的能力，评估在带有和不带有谱系的提示下进行。实证上，LLM在复合查询和超过4步的计划（通常为5-15步）上表现不佳；Claude-3-7-Sonnet的总度量得分最高，达到84.8%，而“A+”（极好，非常好）级别的最强一次性匹配率仅为49.75%（o3-mini）。规划谱系总体上带来了混合收益，但对一些顶级模型有益，并提高了许多步骤的可执行性。我们的结果突显了工具理解中的持续差距，特别是在工具提示对齐和工具使用完整性方面，并表明较短、更简单的计划明显更容易。该框架和发现为评估和改进使用工具进行数据查询分析的代理规划提供了一种可重复的路径。

Summary / 总结

The research aims to evaluate large language models (LLMs) in generating tool-aware plans for contact center queries, focusing on decomposing complex queries into executable steps using structured and unstructured tools. The study introduces a reference-based evaluation framework with two modes and a data curation methodology to iteratively refine plans. Key findings show that LLMs struggle with compound queries and plans exceeding four steps, with the best total metric score reaching 84.8%. Plan lineages provide mixed gains but benefit several top models and improve step executability for many. The research highlights gaps in tool-understanding and suggests shorter, simpler plans are easier to execute.

论文提出了一种针对呼叫中心的工具感知规划框架和基准，重点在于将查询分解为可执行步骤，使用结构化和非结构化工具。关键贡献包括基于参考的评估框架、数据整理方法和对14个语言模型的大规模研究。实验结果显示，语言模型在处理复合查询和超过四步的计划时表现不佳，最佳总指标得分为84.8%。计划线程总体上提供了混合收益，但提高了多个模型的步骤可执行性。研究突出了工具理解的持续差距，并表明更短、更简单的计划更容易执行。

SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation

Authors: Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You, Yifan Zhou, Ruidong Zhang, Andrea Sikora, Lin Zhao, Yohannes Abate, Tianming Liu

First: 2026-01-06T06:19:58+00:00 · Latest: 2026-02-16T17:31:04+00:00

Abs · PDF · Code1 · Code2

Abstract

While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory. To bridge this gap, we introduce Synapse (Synergistic Associative Processing Semantic Encoding), a unified memory architecture that transcends static vector similarity. Drawing from cognitive science, Synapse models memory as a dynamic graph where relevance emerges from spreading activation rather than pre-computed links. By integrating lateral inhibition and temporal decay, the system dynamically highlights relevant sub-graphs while filtering interference. We implement a Triple Hybrid Retrieval strategy that fuses geometric embeddings with activation-based graph traversal. Comprehensive evaluations on the LoCoMo benchmark show that Synapse significantly outperforms state-of-the-art methods in complex temporal and multi-hop reasoning tasks, offering a robust solution to the "Contextual Tunneling" problem. Our code and data will be made publicly available upon acceptance.

中文标题/摘要

标题：SYNAPSE：通过扩散激活增强LLM代理的 episodic-semantic 内存

虽然大型语言模型（LLMs）在通用推理方面表现出色，但标准的检索增强方法无法解决长期代理记忆的断联问题。为弥合这一差距，我们引入了Synapse（协同关联处理语义编码），这是一种超越静态向量相似性的统一记忆架构。借鉴认知科学，Synapse 将记忆建模为一个动态图，相关性源自扩散激活而非预先计算的链接。通过整合侧抑制和时间衰减，系统动态地突出显示相关子图并过滤干扰。我们实现了一种三重混合检索策略，将几何嵌入与基于激活的图遍历相结合。在LoCoMo基准测试中的全面评估表明，Synapse 在复杂的时序和多跳推理任务中显著优于最先进的方法，提供了解决“上下文隧道”问题的稳健方案。我们的代码和数据将在接受后公开发布。

Summary / 总结

SYNAPSE is designed to enhance LLM agents with episodic-semantic memory by using spreading activation in a dynamic graph, addressing the limitations of static vector similarity. The system integrates lateral inhibition and temporal decay to dynamically highlight relevant sub-graphs and filter interference. Evaluations on the LoCoMo benchmark demonstrate that SYNAPSE outperforms existing methods in complex temporal and multi-hop reasoning tasks, effectively solving the 'Contextual Tunneling' problem.

SYNAPSE 通过使用动态图中的扩散激活来增强 LLM 代理的 episodic-semantic 记忆，解决了静态向量相似性的局限性。该系统结合了侧抑制和时间衰减，以动态突出显示相关子图并过滤干扰。在 LoCoMo 基准测试中的评估表明，SYNAPSE 在复杂的时间和多跳推理任务中优于现有方法，有效地解决了‘上下文隧道’问题。

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Authors: Zun Wang, Han Lin, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal

First: 2026-02-16T17:23:08+00:00 · Latest: 2026-02-16T17:23:08+00:00

Comments: Project website: https://zunwang1.github.io/AnchorWeave

Abs · PDF · Code1 · Code2 · Project1

Abstract

Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

中文标题/摘要

标题：AnchorWeave：基于检索局部空间记忆的世界一致视频生成

在长时间范围内保持空间世界一致性仍然是相机可控视频生成中的一个核心挑战。现有的基于记忆的方法通常通过从历史重建的几何体中渲染锚视频来对生成进行全局重建。然而，从多个视角重建全局3D场景不可避免地会引入跨视角对齐错误，因为姿态和深度估计错误导致相同的表面在不同视角中被重建在略有不同的3D位置。当融合时，这些不一致性会累积成噪声几何体，污染条件信号并降低生成质量。我们提出了AnchorWeave，一种增强的记忆视频生成框架，用多个干净的局部几何记忆取代单一的对齐错误的全局记忆，并学习解决它们的跨视角不一致性。为此，AnchorWeave 进行目标轨迹驱动的局部记忆检索，并在生成过程中通过多锚点编织控制器整合所选的局部记忆。广泛的实验表明，AnchorWeave 显著提高了长期场景一致性，同时保持了强大的视觉质量，消融和分析研究进一步验证了局部几何条件、多锚点控制和目标轨迹驱动检索的有效性。

Summary / 总结

AnchorWeave is a video generation framework that addresses the challenge of maintaining spatial world consistency over long horizons. It replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies through a coverage-driven local memory retrieval and a multi-anchor weaving controller. Experiments show that AnchorWeave improves long-term scene consistency while maintaining strong visual quality, with ablation studies validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.

AnchorWeave 是一个视频生成框架，旨在解决长时间内保持空间世界一致性的挑战。它用多个干净的局部几何记忆取代了一个单一的对齐不良的全局记忆，并通过覆盖驱动的局部记忆检索和多锚点编织控制器来解决它们的跨视图不一致性。实验表明，AnchorWeave 在保持强视觉质量的同时显著提高了长期场景一致性，且消融研究进一步验证了局部几何条件、多锚点控制和覆盖驱动检索的有效性。

History

20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553