arXiv 论文速递

Snapshot: 20260306_0356

UMA: A Family of Universal Models for Atoms

Authors: Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick

First: 2025-06-30T15:38:13+00:00 · Latest: 2026-03-04T18:57:47+00:00

Comments: 33 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only ~50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to continue to build increasingly capable AI models.

中文标题/摘要

标题：UMA：原子的通用模型家族

从原子模拟快速准确地计算性质的能力对于推进化学和材料科学中的许多应用至关重要，包括药物发现、储能和半导体制造。为了解决这一需求，Meta FAIR 呈现了一种原子的通用模型家族（UMA），旨在推动速度、准确性和泛化的前沿。UMA 模型在超过五亿个独特的三维原子结构上进行了训练（迄今为止最大的训练规模），通过跨多个化学领域（如分子、材料和催化剂）汇总数据。我们开发了经验性缩放定律来帮助理解如何随着数据集大小增加模型容量以获得最佳准确度。UMA 小型和中型模型采用了我们称之为线性专家混合的新型架构设计，这使得在不牺牲速度的情况下增加模型容量成为可能。例如，UMA 中型模型有 14 亿个参数，但每个原子结构只有约 5000 万个活跃参数。我们在多个领域的多种应用上评估了 UMA 模型，发现令人惊讶的是，一个未经任何微调的单一模型可以与专门模型表现得同样好或更好。我们正在发布 UMA 代码、权重及相关数据，以加速计算工作流并使社区能够继续构建越来越强大的 AI 模型。

Summary / 总结

UMA is a family of universal models for atoms designed to enhance the speed, accuracy, and generalization of atomic simulations in chemistry and materials science. These models are trained on half a billion unique 3D atomic structures, the largest dataset to date, and use a novel architectural design called mixture of linear experts to increase model capacity without sacrificing speed. Experimental results show that a single UMA model can perform similarly or better than specialized models across various applications without any fine-tuning.

UMA 是一组用于原子的通用模型，旨在提高化学和材料科学中原子模拟的速度、准确性和泛化能力。这些模型基于迄今为止最大的数据集——半亿个独特的三维原子结构进行训练，并采用了一种称为混合线性专家的新型架构设计，可以在不牺牲速度的情况下增加模型容量。实验结果显示，单个 UMA 模型在各种应用中无需微调即可与专门的模型表现相当或更优。

A Dual-Helix Governance Approach Towards Reliable Agentic AI for WebGIS Development

Authors: Boyuan, Guan, Wencong Cui, Levente Juhasz

First: 2026-03-04T18:53:25+00:00 · Latest: 2026-03-04T18:53:25+00:00

Comments: Paper submitted to Transactions in GIS

Abs · PDF · Code1 · Code2

Abstract

WebGIS development requires rigor, yet agentic AI frequently fails due to five large language model (LLM) limitations: context constraints, cross-session forgetting, stochasticity, instruction failure, and adaptation rigidity. We propose a dual-helix governance framework reframing these challenges as structural governance problems that model capacity alone cannot resolve. We implement the framework as a 3-track architecture (Knowledge, Behavior, Skills) that uses a knowledge graph substrate to stabilize execution by externalizing domain facts and enforcing executable protocols, complemented by a self-learning cycle for autonomous knowledge growth. Applying this to the FutureShorelines WebGIS tool, a governed agent refactored a 2,265-line monolithic codebase into modular ES6 components. Results demonstrated a 51\% reduction in cyclomatic complexity and a 7-point increase in maintainability index. A comparative experiment against a zero-shot LLM confirms that externalized governance, not just model capability, drives operational reliability in geospatial engineering. This approach is implemented in the open-source AgentLoom governance toolkit.

中文标题/摘要

标题：面向WebGIS开发的可靠代理AI的双螺旋治理方法

WebGIS开发需要严谨性，但代理AI由于五个大型语言模型（LLM）限制（上下文约束、跨会话遗忘、随机性、指令失败和适应性僵化）经常失败。我们提出了一种双螺旋治理框架，将这些挑战重新定义为结构治理问题，而不仅仅是模型能力可以解决的问题。我们通过知识图谱底层实现了一个三轨架构（知识、行为、技能），通过外部化领域事实和强制执行可执行协议来稳定执行，同时结合自我学习循环以实现自主知识增长。将此方法应用于FutureShorelines WebGIS工具，一个治理代理将2,265行单一代码库重构为模块化的ES6组件。结果表明，代码复杂度减少了51%，可维护性指数提高了7分。与零样本LLM的对比实验表明，外部化治理，而非仅仅是模型能力，驱动了地理空间工程中的操作可靠性。该方法在开源AgentLoom治理工具包中实现。

Summary / 总结

The paper addresses the challenges of developing reliable agentic AI for WebGIS by proposing a dual-helix governance framework that reframes limitations such as context constraints and stochasticity as structural governance problems. The framework is implemented as a 3-track architecture (Knowledge, Behavior, Skills) using a knowledge graph to stabilize execution and a self-learning cycle for autonomous knowledge growth. The approach was applied to the FutureShorelines WebGIS tool, resulting in a 51% reduction in cyclomatic complexity and a 7-point increase in maintainability index. The study shows that externalized governance, rather than model capability alone, enhances operational reliability in geospatial engineering.

论文提出了一种双螺旋治理框架，以解决开发WebGIS中可靠的代理AI所面临的挑战，如上下文约束和随机性等问题，将其重新定义为结构治理问题。该框架采用知识、行为和技能三轨架构，利用知识图谱稳定执行，并结合自我学习循环实现自主知识增长。该方法应用于FutureShorelines WebGIS工具，实现了代码复杂度51%的降低和可维护性指数7分的提升。研究表明，外部化治理而非模型能力本身提升了地理空间工程的操作可靠性。

AgentIR: Reasoning-Aware Retrival for Deep Research Agents

Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong

First: 2026-03-04T18:47:26+00:00 · Latest: 2026-03-04T18:47:26+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: https://texttron.github.io/AgentIR/.

中文标题/摘要

标题：AgentIR：具备推理意识的检索技术以深化研究代理

深度研究代理正迅速成为现代检索系统的首要消费者。与人类用户通过不断调整查询而不记录其中间思维过程不同，深度研究代理在每次搜索调用前都会生成明确的自然语言推理，揭示出现有检索器完全忽略的丰富意图和上下文信息。为了利用这一被忽视的信号，我们引入了：(1) 具备推理意识的检索，这是一种检索范式，将代理的推理轨迹与查询一起联合嵌入；(2) DR-Synth，一种从标准问答数据集中生成深度研究检索训练数据的方法。我们证明了这两个组件各自有效，结合使用后产生了训练嵌入模型AgentIR-4B，取得了显著的提升。在具有挑战性的BrowseComp-Plus基准测试中，使用开放权重代理Tongyi-DeepResearch的AgentIR-4B达到了68%的准确率，而传统的两倍大小的嵌入模型仅为50%，BM25仅为37%。代码和数据可在：https://texttron.github.io/AgentIR/ 获取。

Summary / 总结

The research aims to enhance retrieval systems for deep research agents by incorporating their explicit reasoning processes. The method involves Reasoning-Aware Retrieval, which embeds the reasoning trace alongside the query, and DR-Synth, a data synthesis technique for generating training data. The key findings show that combining these components significantly improves performance, with AgentIR-4B achieving 68% accuracy on the BrowseComp-Plus benchmark compared to 50% and 37% for larger conventional models and BM25, respectively.

研究旨在通过纳入深研究代理的推理过程来提升检索系统的性能。方法引入了推理感知检索，该方法将代理的推理轨迹与查询一起嵌入，以及DR-Synth，一种从标准问答数据集中生成训练数据的方法。结果显示，推理感知检索和DR-Synth各自独立地提高了检索性能，它们的结合在挑战性的BrowseComp-Plus基准测试中显著优于传统模型，准确率达到68%，而传统模型和BM25的准确率分别为50%和37%。

Composition-Grounded Data Synthesis for Visual Reasoning

Authors: Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He

First: 2025-10-16T18:00:48+00:00 · Latest: 2026-03-04T18:45:57+00:00

Comments: ICLR2026 camera-ready version. Project page: https://cogsynthesis.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded data Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.

中文标题/摘要

标题：基于组成驱动的数据合成以增强视觉推理能力

预训练的多模态大型语言模型（MLLMs）在多种多模态任务中表现出色，但在难以收集注释的领域中推理能力仍然有限。本文我们关注人工图像领域，如图表、渲染文档和网页，这些领域在实践中丰富但缺乏大规模的人工注释推理数据集。我们引入了COGS（基于组成的数据合成），这是一种数据高效框架，可以从少量种子问题中赋予MLLMs高级推理能力。核心思想是将每个种子问题分解为基本感知和推理因素，然后系统地重新组合新图像以生成大量合成的问答对。每个生成的问题都配以子问题和中间答案，这使得因子级过程奖励的强化学习成为可能。在图表推理实验中，COGS在未见过的问题上显著提高了性能，特别是在推理密集和组合性问题上取得了最大的改进。此外，使用不同种子数据的因子级混合进行训练在多个数据集上表现出更好的迁移性，表明COGS诱导了可泛化的功能而非数据集特定的过拟合。我们进一步证明了该框架不仅适用于图表，还可以扩展到其他领域，如网页。

Summary / 总结

This work addresses the limitation of multi-modal large language models in visual reasoning tasks where annotations are hard to obtain. It introduces COGS, a data-efficient framework that decomposes seed questions into primitive factors and systematically recomposes them with new images to generate synthetic question-answer pairs. Experiments show that COGS significantly improves performance on unseen questions, especially for reasoning-heavy and compositional questions, and suggests the framework can generalize across different domains like charts and webpages.

该研究针对多模态大型语言模型在难以收集注解的任务中的视觉推理能力有限的问题，引入了COGS框架，该框架将种子问题分解为感知和推理因素，然后重新组合这些因素与新图像生成合成的问答对。实验结果显示，COGS在未见过的问题上显著提高了性能，尤其是在推理密集和组合性问题上，并表明其具备泛化的推理能力而非数据集特定的过拟合。

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

Authors: Maximilian von Klinski, Maximilian Schall

Venue: WACV 2026

First: 2026-03-04T18:45:35+00:00 · Latest: 2026-03-04T18:45:35+00:00

Comments: Accepted at WACV 2026

Abs · PDF · Code1 · Code2

Abstract

Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7\% average accuracy, exceeding human performance (77.3\%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.

中文标题/摘要

标题：TaxonRL：使用中间奖励的强化学习进行可解释的细粒度视觉推理

传统的视觉-语言模型在对比细粒度分类谱系推理方面存在困难，尤其是在区分同一属或同一科中的视觉相似物种时。我们提出了TaxonRL，这是一种使用组相对策略优化的强化学习方法，并使用中间奖励将推理过程分解为层次分类预测。我们的方法激励模型在最终分类之前明确地推理物种级、属级和科级特征。这种结构化方法不仅旨在提高准确性，还旨在产生透明且可验证的决策过程。在具有挑战性的鸟类到词语数据集上，TaxonRL 达到了 91.7% 的平均准确率，超过了人类表现（77.3%），同时生成了可解释的推理轨迹。我们展示了强大的跨域泛化能力，在灵长类和海洋物种验证中取得了显著进步。我们的结果表明，强制执行结构化、分层推理为细粒度视觉区分提供了一个强大且可转移的框架。

Summary / 总结

TaxonRL is a reinforcement learning method that uses intermediate rewards to decompose fine-grained taxonomic reasoning into hierarchical steps, improving accuracy and interpretability. On the Birds-to-Words dataset, TaxonRL achieves 91.7% average accuracy, surpassing human performance. It also shows strong generalization to other species, demonstrating the effectiveness of structured hierarchical reasoning for fine-grained visual discrimination.

研究旨在提高视觉-语言模型在细粒度分类推理方面的表现，特别是区分视觉上相似的物种。TaxonRL 使用强化学习和中间奖励来将推理过程分解为层次分类预测。该方法在鸟类词汇表数据集上达到了91.7%的平均准确率，超过了人类的表现，并生成了可解释的推理痕迹。它还在灵长类和海洋物种验证中展示了强大的跨域泛化能力。

Robustness of Agentic AI Systems via Adversarially-Aligned Jacobian Regularization

Authors: Furkan Mumcu, Yasin Yilmaz

First: 2026-03-04T18:41:45+00:00 · Latest: 2026-03-04T18:41:45+00:00

Abs · PDF · Code1 · Code2

Abstract

As Large Language Models (LLMs) transition into autonomous multi-agent ecosystems, robust minimax training becomes essential yet remains prone to instability when highly non-linear policies induce extreme local curvature in the inner maximization. Standard remedies that enforce global Jacobian bounds are overly conservative, suppressing sensitivity in all directions and inducing a large Price of Robustness. We introduce Adversarially-Aligned Jacobian Regularization (AAJR), a trajectory-aligned approach that controls sensitivity strictly along adversarial ascent directions. We prove that AAJR yields a strictly larger admissible policy class than global constraints under mild conditions, implying a weakly smaller approximation gap and reduced nominal performance degradation. Furthermore, we derive step-size conditions under which AAJR controls effective smoothness along optimization trajectories and ensures inner-loop stability. These results provide a structural theory for agentic robustness that decouples minimax stability from global expressivity restrictions.

中文标题/摘要

标题：代理型人工智能系统的鲁棒性通过对抗对齐雅可比正则化

随着大型语言模型（LLMs）过渡到自主多智能体生态系统，鲁棒的最小最大训练变得至关重要，但在高度非线性的策略导致内部最大化出现极端局部曲率时，仍易受不稳定性的困扰。标准的补救措施通过施加全局雅可比边界过于保守，抑制了所有方向的敏感性，并导致了鲁棒性价格的大幅增加。我们引入了对抗对齐雅可比正则化（AAJR），这是一种轨迹对齐的方法，严格控制敏感性沿对抗上升方向。我们证明，在温和条件下，AAJR 比全局约束提供了更大的可接受策略类，意味着更小的近似差距和名义性能退化减少。此外，我们推导了步长条件，使得AAJR 控制优化轨迹的有效光滑度并确保内部循环的稳定性。这些结果为代理型鲁棒性提供了一种结构理论，将最小最大稳定性与全局表达性限制脱钩。

Summary / 总结

The research aims to enhance the robustness of agentic AI systems in autonomous multi-agent ecosystems by addressing the instability caused by highly non-linear policies. The study introduces Adversarially-Aligned Jacobian Regularization (AAJR), which controls sensitivity strictly along adversarial ascent directions. Key findings include a larger admissible policy class, a reduced approximation gap, and effective smoothness control, leading to improved inner-loop stability and minimax robustness without overly conservative global constraints.

研究旨在通过解决高度非线性策略引起的不稳定性，增强自主多智能体生态系统中代理AI系统的鲁棒性。方法引入了对抗对齐雅可比正则化（AAJR），严格沿对抗上升方向控制灵敏度。关键发现包括在温和条件下更大的可接受策略类，较小的近似间隙，以及在特定步长条件下确保内循环稳定性，从而提供了一个将最小最大稳定性与全局表达性限制解耦的结构理论。

Unsupervised Representation Learning - an Invariant Risk Minimization Perspective

Authors: Yotam Norman, Ron Meir

First: 2025-05-18T17:54:23+00:00 · Latest: 2026-03-04T18:35:33+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a novel unsupervised framework for \emph{Invariant Risk Minimization} (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that separates environment-invariant and environment-dependent latent factors. Our approach is based on a novel ``unsupervised'' structural causal model and supports environment-conditioned sample-generation and intervention. Empirical evaluations on synthetic dataset, modified versions of MNIST, and CelebA demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.

中文标题/摘要

标题：无监督表示学习——不变风险最小化视角

我们提出了一种新的无监督框架，用于\emph{不变风险最小化}（IRM），将不变性的概念扩展到标签不可用的设置中。传统的IRM方法依赖于带标签的数据来学习在环境分布变化时鲁棒的表示。相比之下，我们的方法通过特征分布对齐重新定义不变性，从而能够从无标签数据中学习鲁棒的表示。我们在此框架中引入了两种方法：主不变成分分析（PICA），这是一种在高斯假设下提取不变方向的线性方法，以及变分不变自编码器（VIAE），这是一种分离环境不变和环境依赖潜在因子的深度生成模型。我们的方法基于一种新颖的“无监督”结构因果模型，并支持环境条件下的样本生成和干预。在合成数据集、修改后的MNIST版本和CelebA上的实证评估表明，我们的方法在捕获不变结构、保留相关信息以及在无标签情况下跨环境泛化方面具有有效性。

Summary / 总结

The paper proposes an unsupervised framework for Invariant Risk Minimization (IRM) to learn robust representations from unlabeled data. It introduces two methods: PICA, a linear method for extracting invariant directions under Gaussian assumptions, and VIAE, a deep generative model that separates invariant and environment-specific latent factors. Experiments on synthetic and real datasets show that the proposed methods effectively capture invariant structures, preserve relevant information, and generalize well across different environments without labeled data support.

论文提出了一种无监督的不变风险最小化（IRM）框架，用于从无标签数据中学习稳健的表示。它引入了两种方法：PICA，一种在高斯假设下提取不变方向的线性方法，以及VIAE，一种将不变和环境特定的潜在因素分离的深度生成模型。实验结果表明，这些方法能够有效地捕捉不变结构，保留相关信息，并在不同环境中表现出良好的泛化能力，无需使用标签数据。

Detecting AI-Generated Essays in Writing Assessment: Responsible Use and Generalizability Across LLMs

Authors: Jiangang Hao

First: 2026-03-02T19:51:01+00:00 · Latest: 2026-03-04T18:35:26+00:00

Comments: 21 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Writing is a foundational literacy skill that underpins effective communication, fosters critical thinking, facilitates learning across disciplines, and enables individuals to organize and articulate complex ideas. Consequently, writing assessment plays a vital role in evaluating language proficiency, communicative effectiveness, and analytical reasoning. The rapid advancement of large language models (LLMs) has made it increasingly easy to generate coherent, high-quality essays, raising significant concerns about the authenticity of student-submitted work. This chapter first provides an overview of the current landscape of detectors for AI-generated and AI-assisted essays, along with guidelines for their responsible use. It then presents empirical analyses to evaluate how well detectors trained on essays from one LLM generalize to identifying essays produced by other LLMs, based on essays generated in response to public GRE writing prompts. These findings provide guidance for developing and retraining detectors for practical applications.

中文标题/摘要

标题：检测AI生成的作文在写作评估中的应用：负责任的使用与跨LLM的一般化

写作是基础的读写技能，支撑着有效的沟通，培养批判性思维，促进跨学科的学习，并使个人能够组织和表达复杂的思想。因此，写作评估在评估语言 proficiency、沟通效果和分析推理方面发挥着重要作用。大型语言模型（LLMs）的迅速发展使得生成连贯、高质量的作文变得越来越容易，这引发了对学生提交作品真实性的重要关切。本章首先概述了当前用于检测AI生成和AI辅助作文的检测器的现状，以及它们的负责任使用指南。然后，通过基于公共GRE写作提示生成的作文进行实证分析，评估了在一种LLM上训练的检测器识别其他LLM生成的作文的能力。这些发现为开发和重新训练适用于实际应用的检测器提供了指导。

Summary / 总结

The research aims to address the challenge of detecting AI-generated essays in writing assessment, particularly as large language models (LLMs) have made it easier to produce high-quality essays. The study evaluates the generalizability of detectors trained on essays from one LLM to identify essays generated by other LLMs using public GRE writing prompts. Key findings indicate that detectors trained on one LLM may not effectively generalize to other LLMs, highlighting the need for developing and retraining detectors for practical applications.

研究关注检测AI生成的作文在写作评估中的挑战，强调写作技能的重要性以及AI可能对学生成绩的真实性带来的影响。研究评估了在一种大型语言模型（LLM）上训练的检测器，将其应用于识别其他LLM生成的作文的效果，使用了公共GRE写作提示。关键发现表明，针对一种LLM训练的检测器可能无法有效应用于其他LLM，强调了进一步开发和重新训练检测器以适应实际应用的必要性。

$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

Authors: Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres

First: 2026-03-04T18:34:47+00:00 · Latest: 2026-03-04T18:34:47+00:00

Comments: 29 pages (10 main + 19 appendix)

Abs · PDF · Code1 · Code2

Abstract

Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$-Knowledge, an extension of $τ$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $τ$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.

中文标题/摘要

标题：$τ$-知识：评估对话代理在非结构化知识上的表现

对话代理在知识密集型环境中越来越被部署，正确的行为依赖于在与用户实时交互过程中从大型、专有且非结构化的语料库中检索和应用特定领域的知识。然而，现有的大多数基准测试独立地评估检索或工具使用，这在长时间交互中创建了一个现实的、全面的评估缺口。我们引入了$τ$-知识，这是$τ$-基准的扩展，用于评估代理在环境中表现，其中成功依赖于协调外部自然语言知识与工具输出，以产生可验证的、符合政策的状态变化。我们的新领域$τ$-银行业，模拟了现实的金融科技客户服务工作流程，在这些流程中，代理必须在执行工具介导的账户更新的同时导航大约700个相互关联的知识文档。即使在基于嵌入的检索和基于终端的搜索中，最先进的模型即使有较高的推理预算，也只能达到约25.5%的通过率，可靠性在多次试验中急剧下降。代理难以从紧密关联的知识库中检索正确的文档，并且难以准确地在复杂的内部政策上进行推理。总体而言，$τ$-知识为开发能够整合非结构化知识的人机交互代理提供了现实的测试平台。

Summary / 总结

The research aims to evaluate conversational agents in knowledge-intensive settings where they must retrieve and apply unstructured domain-specific knowledge during live interactions. The method involves extending $τ$-Bench to create $τ$-Knowledge, which evaluates agents in a new domain, $τ$-Banking, where they must navigate interconnected knowledge documents and execute tool-mediated account updates. Key findings show that even advanced models achieve only around 25.5% pass rate, indicating significant challenges in retrieving correct documents and reasoning over complex policies.

论文提出了$τ$-Knowledge，这是$τ$-Bench的扩展，用于评估对话代理在知识密集型环境中的表现，特别是在金融科技客服支持方面。方法涉及一个新领域$τ$-Banking，模拟了代理需要导航和使用未结构化的知识文档和工具来更新账户的现实流程。关键发现表明，即使是最先进的模型也只能达到约25.5%的成功率，突显了在复杂且紧密相连的知识库中检索和推理的挑战。

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Authors: Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng

First: 2026-03-04T18:29:54+00:00 · Latest: 2026-03-04T18:29:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal web agents that process both screenshots and accessibility trees are increasingly deployed to interact with web interfaces, yet their dual-stream architecture opens an underexplored attack surface: an adversary who injects content into the webpage DOM simultaneously corrupts both observation channels with a consistent deceptive narrative. Our vulnerability analysis on MiniWob++ reveals that attacks including a visual component far outperform text-only injections, exposing critical gaps in text-centric VLM safety training. Motivated by this finding, we propose Dual-Modality Multi-Stage Adversarial Safety Training (DMAST), a framework that formalizes the agent-attacker interaction as a two-player zero-sum Markov game and co-trains both players through a three-stage pipeline: (1) imitation learning from a strong teacher model, (2) oracle-guided supervised fine-tuning that uses a novel zero-acknowledgment strategy to instill task-focused reasoning under adversarial noise, and (3) adversarial reinforcement learning via Group Relative Policy Optimization (GRPO) self-play. On out-of-distribution tasks, DMAST substantially mitigates adversarial risks while simultaneously doubling task completion efficiency. Our approach significantly outperforms established training-based and prompt-based defenses, demonstrating genuine co-evolutionary progress and robust generalization to complex, unseen environments.

中文标题/摘要

标题：双模态多阶段对抗安全训练：增强多模态网络代理对抗跨模态攻击的鲁棒性

处理屏幕截图和无障碍树的多模态网络代理越来越多地被部署以与网页界面交互，但其双流架构打开了一个未被充分探索的攻击面：攻击者同时向网页DOM注入内容，会以一致的欺骗性叙述同时破坏两个观察通道。我们对MiniWob++的漏洞分析表明，包含视觉成分的攻击远优于仅包含文本的注入，暴露了以文本为中心的VLM安全训练中的关键漏洞。受此发现的启发，我们提出了双模态多阶段对抗安全训练（DMAST）框架，将代理-攻击者交互形式化为一个两玩家零和马尔可夫博弈，并通过三阶段流水线共同训练两个玩家：（1）从强大教师模型中学习模仿，（2）使用新颖的零确认策略的oracle引导监督微调，以在对抗噪声下培养任务导向的推理，（3）通过Group Relative Policy Optimization（GRPO）自博弈的对抗强化学习。在分布外任务中，DMAST显著减轻了对抗风险，同时将任务完成效率翻倍。我们的方法显著优于现有的基于训练和基于提示的防御，展示了真正的共生进步和对复杂、未见过的环境的强大泛化能力。

Summary / 总结

The paper addresses the vulnerability of multimodal web agents that process both screenshots and accessibility trees, which can be attacked by adversaries injecting content into the webpage DOM. It proposes DMAST, a framework that includes imitation learning, supervised fine-tuning with a zero-acknowledgment strategy, and adversarial reinforcement learning. DMAST effectively mitigates adversarial risks and improves task completion efficiency on out-of-distribution tasks, outperforming existing defenses.

论文针对处理屏幕截图和无障碍树的多模态网络代理易受网页DOM注入攻击的问题，提出了DMAST框架，该框架包括模仿学习、基于oracle的监督微调和对抗强化学习。该方法显著降低了对抗风险并提高了任务完成效率，在分布外任务中优于现有方法。

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Authors: Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre

First: 2025-11-05T13:02:06+00:00 · Latest: 2026-03-04T18:27:25+00:00

Comments: Accepted at LREC 2026. To access the dataset, see https://github.com/bonzid/CareMedEval

Abs · PDF · Code1 · Code2 · Code3

Abstract

Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.

中文标题/摘要

标题：CareMedEval 数据集：评估生物医学领域的批判性评估与推理

批判性评估科学文献是生物医学领域的一项基本技能。虽然大型语言模型（LLMs）在这一任务中提供了有希望的支持，但它们的可靠性仍然有限，特别是在专门领域的批判性推理方面。我们介绍了CareMedEval，这是一个原创数据集，旨在评估LLMs在生物医学批判性评估和推理任务中的表现。该数据集源自法国医学生的真实考试，包含基于37篇科学文章的534个问题。与现有的基准不同，CareMedEval明确评估了基于科学论文的批判性阅读和推理。在不同上下文条件下对最先进的通用和生物医学专业化LLMs进行基准测试揭示了任务的难度：开源和商用模型即使生成中间推理令牌也无法超过0.5的精确匹配率。然而，模型在关于研究局限性和统计分析的问题上仍然面临挑战。CareMedEval为基于推理的基准测试提供了挑战，揭示了当前LLM的局限性，并为未来开发自动支持批判性评估铺平了道路。

Summary / 总结

The research aims to evaluate the critical appraisal and reasoning skills of large language models (LLMs) in the biomedical field. The study introduces CareMedEval, a dataset derived from authentic exams taken by French medical students, containing 534 questions based on 37 scientific articles. Benchmarking generalist and specialized LLMs on this dataset shows that even with intermediate reasoning tokens, these models struggle, particularly with questions about study limitations and statistical analysis, achieving an Exact Match Rate of only 0.5. This highlights the need for further development in automated support for critical appraisal in specialized domains.

研究旨在评估大型语言模型（LLMs）在生物医学领域的批判性评估和推理能力。研究引入了CareMedEval数据集，该数据集来源于法国医学生的真实考试，包含基于37篇科学文章的534个问题。对通用和专门化LLM在该数据集上的基准测试显示，即使使用中间推理令牌，这些模型也难以应对关于研究局限性和统计分析的问题，仅达到0.5的精确匹配率。这表明需要进一步开发自动化支持以提高批判性评估的能力。

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Authors: Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos

First: 2026-02-28T12:10:58+00:00 · Latest: 2026-03-04T18:26:58+00:00

Abs · PDF · Code1 · Code2

Abstract

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.

中文标题/摘要

标题：CMI-RewardBench：基于组合多模态指令评估音乐奖励模型

尽管音乐生成模型已经能够处理混合文本、歌词和参考音频的复杂多模态输入，但评估机制却落后了。本文通过建立基于组合多模态指令（CMI）的音乐奖励建模综合生态系统，填补了这一关键空白，其中生成的音乐可以基于文本描述、歌词和音频提示进行条件化。我们首先介绍了包含110,000个伪标签样本的CMI-Pref-Pseudo大规模偏好数据集，以及一个针对细粒度对齐任务的人类注释高质量语料库CMI-Pref。为了统一评估框架，我们提出了CMI-RewardBench统一基准，该基准在音乐性、文本-音乐对齐和组合指令对齐方面对音乐奖励模型进行评估。利用这些资源，我们开发了CMI奖励模型（CMI-RMs），这是一种参数高效的奖励模型家族，能够处理异构输入。我们评估了它们与人类判断得分在音乐性和对齐方面的相关性，以及与先前数据集的对齐情况。进一步的实验表明，CMI-RM 不仅与人类判断高度相关，还通过top-k过滤实现了有效的推理时缩放。训练数据、基准和奖励模型均已公开。

Summary / 总结

This paper addresses the gap in evaluating music generation models by introducing CMI-RewardBench, a unified benchmark for music reward modeling under Compositional Multimodal Instruction (CMI). It includes CMI-Pref-Pseudo, a large preference dataset, and CMI-Pref, a high-quality human-annotated corpus. The proposed CMI reward models (CMI-RMs) are evaluated on musicality and alignment tasks, showing strong correlation with human judgments and enabling effective inference-time scaling via top-k filtering.

本文通过引入面向Compositional Multimodal Instruction (CMI)的统一基准CMI-RewardBench，解决了音乐生成模型评估的空白问题。该基准包括大规模偏好数据集CMI-Pref-Pseudo和高质量的人工标注语料CMI-Pref。CMI奖励模型（CMI-RMs）参数高效，能够处理异构输入，显示出与音乐性和对齐的人类判断高度相关。实验还表明，通过top-k过滤可以实现有效的推理时缩放。

Dissecting Quantization Error: A Concentration-Alignment Perspective

Authors: Marco Federici, Boris van Breugel, Paul Whatmough, Markus Nagel

First: 2026-03-04T18:26:24+00:00 · Latest: 2026-03-04T18:26:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Quantization can drastically increase the efficiency of large language and vision models, but typically incurs an accuracy drop. Recently, function-preserving transforms (e.g. rotations, Hadamard transform, channel-wise scaling) have been successfully applied to reduce post-training quantization error, yet a principled explanation remains elusive. We analyze linear-layer quantization via the signal-to-quantization-noise ratio (SQNR), showing that for uniform integer quantization at a fixed bit width, SQNR decomposes into (i) the concentration of weights and activations (capturing spread and outliers), and (ii) the alignment of their dominant variation directions. This reveals an actionable insight: beyond concentration - the focus of most prior transforms (e.g. rotations or Hadamard) - improving alignment between weight and activation can further reduce quantization error. Motivated by this, we introduce block Concentration-Alignment Transforms (CAT), a lightweight linear transformation that uses a covariance estimate from a small calibration set to jointly improve concentration and alignment, approximately maximizing SQNR. Experiments across several LLMs show that CAT consistently matches or outperforms prior transform-based quantization methods at 4-bit precision, confirming the insights gained in our framework.

中文标题/摘要

标题：解析量化误差：从集中对齐视角出发

量化可以大幅提高大型语言和视觉模型的效率，但通常会带来准确率下降。最近，函数保持变换（例如旋转、哈达玛变换、通道缩放）已被成功应用于减少后训练量化误差，但其原理性的解释仍然缺乏。我们通过信噪比（SQNR）分析了线性层量化，表明对于固定位宽的均匀整数量化，SQNR可以分解为（i）权重和激活的集中度（捕捉分布和异常值），以及（ii）它们主导变化方向的对齐度。这揭示了一个可操作的见解：除了集中度（这是大多数先前变换（例如旋转或哈达玛变换）的重点），改善权重和激活之间的对齐度可以进一步减少量化误差。受此启发，我们引入了块集中对齐变换（CAT），这是一种轻量级线性变换，使用小校准集的协方差估计来同时改善集中度和对齐度，近似最大化SQNR。在多个LLM上的实验表明，CAT在4位精度下始终能够匹配或超越基于变换的量化方法，证实了我们框架中获得的见解。

Summary / 总结

This paper investigates the quantization error in large language and vision models by analyzing the signal-to-quantization-noise ratio (SQNR). It decomposes SQNR into concentration and alignment factors, suggesting that improving alignment between weights and activations can further reduce quantization error. The authors propose Block Concentration-Alignment Transforms (CAT), a lightweight method that jointly improves concentration and alignment, leading to consistent performance improvements over previous transform-based quantization methods at 4-bit precision.

本文通过分析信号到量化噪声比（SQNR）来研究大型语言和视觉模型中的量化误差，将SQNR分解为权重和激活的集中度以及它们主要变化方向的对齐度两个部分。基于这一分析，作者提出了Block Concentration-Alignment Transforms（CAT），该方法联合提高集中度和对齐度以最大化SQNR。实验结果显示，CAT在4位精度下优于或匹配了之前的基于变换的量化方法，验证了提出的框架。

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Authors: Niamul Hassan Samin, Md Arifur Rahman, Abdullah Ibne Hanif Arean, Juena Ahmed Noshin, Md Ashikur Rahman

First: 2026-02-25T23:08:31+00:00 · Latest: 2026-03-04T18:21:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) often hallucinate objects that are not present in the input image. We identify a contributing cause of this behavior, which we term spatial credit collapse: in early transformer layers, hidden-state activation concentrates on a small number of visual patches, suppressing surrounding contextual evidence and increasing reliance on language priors. Across seven models we observe a strong correlation between visual attention entropy and hallucination rate (r = -0.65, p < 0.001), suggesting that reduced spatial credit diversity contributes to hallucination. To address this issue we propose Spatial Credit Redistribution (SCR), a training-free inference-time method. SCR uses a lightweight two-pass procedure. A diagnostic pass identifies the top-K high-attention source patches and their spatial neighbors. A redistribution pass then scales each source by 1/lambda (~0.91) and injects a (lambda - 1) weighted copy of its hidden state into neighboring patches, restoring suppressed visual context without modifying model weights. Because the diagnostic pass is performed once per image and reused across the output sequence, the added latency is negligible (<0.5 ms per token for 100-token responses). We evaluate SCR across seven model configurations from four VLM families (Chameleon, LLaVA-1.5, Qwen-VL/Qwen2-VL, and InternVL2) on five benchmarks: POPE, CHAIR, MME, HallusionBench, and AMBER. SCR reduces POPE-Adversarial hallucination by 4.6-6.0 percentage points and CHAIR-s by 41-51 percent while preserving caption quality (CIDEr drop <=0.8). Compared with prior inference-time methods including OPERA, VCD, OA-VCD, DoLa, VLI, SID, and CRoPS, SCR achieves a better trade-off between hallucination reduction, generation quality, and latency.

中文标题/摘要

标题：超越主导斑块：空间信用再分配以实现基于视觉-语言模型的扎根

视觉-语言模型（VLMs）经常在输入图像中虚构不存在的对象。我们识别出这种行为的一个促成因素，称为空间信用崩溃：在早期的变压器层中，隐藏状态激活集中在少量的视觉斑块上，抑制了周围上下文证据，并增加了对语言先验的依赖。在七个模型中，我们观察到视觉注意力熵与虚构率之间存在强烈的相关性（r = -0.65，p < 0.001），表明空间信用多样性减少会促进虚构。为解决这一问题，我们提出了空间信用再分配（SCR），这是一种无需训练的推理时方法。SCR 使用一种轻量级的两步程序。诊断步骤识别出高注意力的前K个源斑块及其空间邻居。再分配步骤然后将每个源斑块的大小调整为1/λ（~0.91），并注入（λ - 1）加权的隐藏状态副本到相邻斑块中，从而恢复被抑制的视觉上下文，而不修改模型权重。由于诊断步骤在每张图像上仅执行一次并在输出序列中重复使用，因此增加的延迟可以忽略不计（每100个标记的响应大约为0.5毫秒）。我们在四种VLM家族（Chameleon、LLaVA-1.5、Qwen-VL/Qwen2-VL、InternVL2）的七个模型配置上，对五个基准（POPE、CHAIR、MME、HallusionBench、AMBER）进行了评估。SCR 将POPE-对抗虚构减少了4.6-6.0个百分点，将CHAIR-s减少了41-51个百分点，同时保持了描述质量（CIDEr下降<=0.8）。与先前的推理时方法（包括OPERA、VCD、OA-VCD、DoLa、VLI、SID和CRoPS）相比，SCR 在减少虚构、生成质量和延迟之间实现了更好的权衡。

Efficient Refusal Ablation in LLM through Optimal Transport

Authors: Geraldin Nanfack, Eugene Belilovsky, Elvis Dohmatob

First: 2026-03-04T18:19:50+00:00 · Latest: 2026-03-04T18:19:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent activation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.

中文标题/摘要

标题：通过最优传输实现高效的大语言模型拒绝消融

安全对齐的语言模型通过其内部表示中学习到的拒绝行为来拒绝有害请求。最近的基于激活的监狱破解方法通过应用正交投影去除拒绝方向来绕过这些安全机制，但这些方法将拒绝视为一维现象，忽略了模型激活的丰富分布结构。我们提出了一种基于最优传输理论的原理性框架，将有害激活的整个分布转换为无害激活的分布。通过结合主成分分析（PCA）和闭式高斯最优传输，我们在高维表示空间中实现了高效的计算，同时保留了基本的几何结构。在六种模型（Llama-2、Llama-3.1、Qwen-2.5；7B-32B参数）上，我们的方法在攻击成功率上比最先进的基线方法高至多11%，同时保持了相当的困惑度，证明了模型能力的优越保留。关键的是，我们发现选择性层干预（在大约40-60%网络深度处选择1-2个层应用最优传输）显著优于全网络干预，表明拒绝机制可能是局部化的而不是分布式的。我们的分析提供了关于安全表示几何结构的新见解，并表明当前的对齐方法可能对超出简单方向去除的分布攻击是脆弱的。

Summary / 总结

This paper addresses the challenge of circumventing safety mechanisms in language models by proposing a novel method based on optimal transport theory. The method transforms the distribution of harmful activations to match harmless ones, achieving efficient computation while preserving geometric structure. Experiments across six models show that this approach outperforms existing methods by up to 11% in attack success rates while maintaining model perplexity. The study also finds that layer-selective intervention is more effective than full-network interventions, suggesting that refusal mechanisms might be localized rather than distributed.

论文提出了一种基于最优传输理论的新方法，通过将有害激活分布转换为无害激活分布来绕过语言模型的安全机制。该方法在保持模型困惑度的同时实现了更高的攻击成功率。研究发现，针对大约网络深度40-60%处的1-2层进行层选择性干预的效果优于全网络干预，这表明拒绝机制可能是局部化的而非分布式的。

Out-of-distribution transfer of PDE foundation models to material dynamics under extreme loading

Authors: Mahindra Rautela, Alexander Most, Siddharth Mansingh, Aleksandra Pachalieva, Bradley Love, Daniel O Malley, Alexander Scheinker, Kyle Hickmann, Diane Oyen, Nathan Debardeleben, Earl Lawrence, Ayan Biswas

First: 2026-03-04T18:19:35+00:00 · Latest: 2026-03-04T18:19:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Most PDE foundation models are pretrained and fine-tuned on fluid-centric benchmarks. Their utility under extreme-loading material dynamics remains unclear. We benchmark out-of-distribution transfer on two discontinuity-dominated regimes in which shocks, evolving interfaces, and fracture produce highly non-smooth fields: shock-driven multi-material interface dynamics (perturbed layered interface or PLI) and dynamic fracture/failure evolution (FRAC). We formulate the downstream task as terminal-state prediction, i.e., learning a long-horizon map that predicts the final state directly from the first snapshot without intermediate supervision. Using a unified training and evaluation protocol, we evaluate two open-source pretrained PDE foundation models, POSEIDON and MORPH, and compare fine-tuning from pretrained weights against training from scratch across training-set sizes to quantify sample efficiency under distribution shift.

中文标题/摘要

标题：极端加载条件下材料动力学中PDE基础模型的离分布转移

大多数PDE基础模型在流体中心基准上进行预训练和微调。它们在极端加载条件下材料动力学中的实用性尚不清楚。我们对两种以不连续性为主导的领域进行了离分布转移基准测试，在这些领域中，冲击波、演化界面和断裂产生高度非光滑场：冲击波驱动的多材料界面动力学（扰动层状界面或PLI）和动态断裂/失效演化（FRAC）。我们将下游任务定义为终端状态预测，即学习一个长期预测映射，直接从初始快照预测最终状态，而无需中间监督。使用统一的训练和评估协议，我们评估了两个开源预训练PDE基础模型POSEIDON和MORPH，并比较了从预训练权重微调与从零开始训练在不同训练集大小下的样本效率，以量化分布转移下的样本效率。

Summary / 总结

The research aims to evaluate the applicability of PDE foundation models pretrained on fluid dynamics benchmarks for extreme loading material dynamics. The study benchmarks out-of-distribution transfer on two discontinuity-dominated regimes: shock-driven multi-material interface dynamics and dynamic fracture/failure evolution. The downstream task is formulated as terminal-state prediction. Two open-source pretrained PDE models, POSEIDON and MORPH, are evaluated, and the study compares fine-tuning from pretrained weights to training from scratch, quantifying sample efficiency under distribution shift.

研究旨在评估预训练于流体动力学基准上的PDE基础模型在极端加载材料动力学中的适用性。研究在两种不连续主导的领域中进行基准测试：冲击驱动的多材料界面动力学和动态断裂/失效演化。下游任务被定义为终端状态预测。研究评估了两个开源预训练PDE模型POSEIDON和MORPH，并比较了从预训练权重微调与从头训练的效果，量化了在分布偏移下的样本效率。

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

Authors: Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin

First: 2026-03-04T18:14:00+00:00 · Latest: 2026-03-04T18:14:00+00:00

Abs · PDF · Code1 · Code2

Abstract

The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips based on their graph-based captions, and a training-free method for selecting keyframes from these clips. Unlike existing methods, the proposed Scene-Caption LLM Selector does not rely on the original sequence of low-resolution frames; instead, it operates on a compact textual representation of the scene. We then design a training-free Patch-wise Sparse-Flow Retention (PSFR) method to select keyframes from the resulting sequence of clips, which are fed into an MLLM to produce the final answer. Together, these components enable FocusGraph to achieve state-of-the-art results on challenging egocentric long-video question answering benchmarks, including FindingDory and HourVideo, while significantly reducing inference time relative to baseline approaches.

中文标题/摘要

标题：FocusGraph：基于图结构框架的选择性帧提取用于体感长视频问答

理解长视频的能力对于体感智能代理至关重要，因为它们的效果取决于能否有效地积累、组织和利用长期感知记忆。最近，由于其理解和利用世界知识的通用能力，多模态LLM因解决长视频理解任务而受到越来越多的关注。然而，随着提供给MLLM的帧数量增加，其响应质量往往会下降，推理时间也会增长。因此，在使用MLLM进行长视频理解时，关键步骤是从视频中选择关键帧以回答用户查询。在本文中，我们开发了FocusGraph，这是一种用于长第一人称视角视频问答的关键帧选择框架。它利用一种轻量级可训练的场景-描述LLM选择器，该选择器基于图基描述选择与查询相关的片段，并且使用一种无需训练的方法从这些片段中选择关键帧。与现有方法不同，提出的场景-描述LLM选择器不依赖于原始的低分辨率帧序列，而是操作于场景的紧凑文本表示。然后，我们设计了一种无需训练的块级稀疏流保留(PSFR)方法，从生成的片段序列中选择关键帧，这些片段被输入到MLLM以生成最终答案。这些组件共同使FocusGraph在具有挑战性的第一人称视角长视频问答基准测试（包括FindingDory和HourVideo）中取得了最先进的结果，同时显著减少了推理时间，相对于基线方法而言。

Summary / 总结

FocusGraph is a framework for keyframe selection in long video question answering, using a lightweight Scene-Caption LLM Selector to generate query-relevant clips based on graph-based captions, and a PSFR method to select keyframes from these clips. This approach reduces inference time and achieves state-of-the-art results on benchmarks like FindingDory and HourVideo compared to baseline methods.

FocusGraph 是一种用于长视频问答的关键帧选择框架，使用轻量级的 Scene-Caption LLM 选择器生成与查询相关的片段，并使用训练-free 的 PSFR 方法从这些片段中选择关键帧。这种方法减少了推理时间，并在 FindingDory 和 HourVideo 等基准测试中取得了最先进的结果，同时改进了依赖低分辨率帧的现有方法。

RANGER: Sparsely-Gated Mixture-of-Experts with Adaptive Retrieval Re-ranking for Pathology Report Generation

Authors: Yixin Chen, Ziyu Su, Hikmat Khan, Muhammad Khalid Khan Niazi

First: 2026-03-04T18:12:31+00:00 · Latest: 2026-03-04T18:12:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Pathology report generation remains a relatively under-explored downstream task, primarily due to the gigapixel scale and complex morphological heterogeneity of Whole Slide Images (WSIs). Existing pathology report generation frameworks typically employ transformer architectures, relying on a homogeneous decoder architecture and static knowledge retrieval integration. Such architectures limit generative specialization and may introduce noisy external guidance during the report generation process. To address these limitations, we propose RANGER, a sparsely-gated Mixture-of-Experts (MoE) framework with adaptive retrieval re-ranking for pathology report generation. Specifically, we integrate a sparsely gated MoE into the decoder, along with noisy top-$k$ routing and load-balancing regularization, to enable dynamic expert specialization across various diagnostic patterns. Additionally, we introduce an adaptive retrieval re-ranking module that selectively refines retrieved memory from a knowledge base before integration, reducing noise and improving semantic alignment based on visual feature representations. We perform extensive experiments on the PathText-BRCA dataset and demonstrate consistent improvements over existing approaches across standard natural language generation metrics. Our full RANGER model achieves optimal performance on PathText dataset, reaching BLEU-1 to BLEU-4 scores of 0.4598, 0.3044, 0.2036, and 0.1435, respectively, with METEOR of 0.1883, and ROUGE-L of 0.3038, validating the effectiveness of dynamic expert routing and adaptive knowledge refinement for semantically grounded pathology report generation.

中文标题/摘要

标题：RANGER：稀疏门控混合专家体系结构与自适应检索重排序在病理报告生成中的应用

病理报告生成仍然是相对未被充分探索的下游任务，主要由于全切片图像（WSIs）的巨像素规模和复杂的形态异质性。现有的病理报告生成框架通常采用变压器架构，依赖于同质解码器架构和静态知识检索集成。这些架构限制了生成的专业化，并可能在报告生成过程中引入噪声外部指导。为了解决这些限制，我们提出了一种稀疏门控混合专家（MoE）框架RANGER，该框架结合了自适应检索重排序，以实现病理报告生成。具体而言，我们将在解码器中集成稀疏门控MoE，并采用嘈杂的top-$k$路由和负载均衡正则化，以实现各种诊断模式下的动态专家专业化。此外，我们引入了一个自适应检索重排序模块，在集成前选择性地细化知识库检索的记忆，减少噪声并基于视觉特征表示提高语义对齐。我们在PathText-BRCA数据集上进行了广泛的实验，并在标准自然语言生成指标上展示了相对于现有方法的一致改进。我们的完整RANGER模型在PathText数据集上达到了最优性能，BLEU-1到BLEU-4得分为0.4598、0.3044、0.2036和0.1435，METEOR得分为0.1883，ROUGE-L得分为0.3038，验证了动态专家路由和自适应知识细化在语义导向病理报告生成中的有效性。

Summary / 总结

RANGER is a sparsely-gated Mixture-of-Experts framework with adaptive retrieval re-ranking designed for pathology report generation. It integrates a sparsely gated MoE into the decoder and includes a module for adaptive retrieval re-ranking to refine knowledge from a database. Experiments on the PathText-BRCA dataset show consistent improvements over existing methods in natural language generation metrics, with optimal performance on the PathText dataset, achieving BLEU scores and ROUGE-L of 0.4598, 0.3044, 0.2036, 0.1435, and 0.3038 respectively, and METEOR of 0.1883.

RANGER 是一种稀疏门控 Mixture-of-Experts 框架，结合了自适应检索重排序模块，旨在用于病理报告生成。该框架将稀疏门控 MoE 集成到解码器中，并包括一个模块来减少噪声并提高语义对齐。在 PathText-BRCA 数据集上的实验显示，该方法在现有方法上的一致改进，其在 PathText 数据集上的最佳性能，分别实现了 BLEU-1 到 BLEU-4 的得分 0.4598、0.3044、0.2036 和 0.1435，METEOR 为 0.1883，ROUGE-L 为 0.3038。

Cognition Envelopes for Bounded Decision Making in Autonomous UAS Operations

Authors: Pedro Antonio Alarcon Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang

First: 2025-10-30T18:11:32+00:00 · Latest: 2026-03-04T18:07:51+00:00

Comments: 12 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Cyber-physical systems increasingly rely on foundational models, such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, over-generalizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance. In this paper we describe an LLM/VLM-supported pipeline for dynamic clue analysis within the domain of small autonomous Uncrewed Aerial Systems deployed on Search and Rescue (SAR) missions, and a Cognition Envelope based on probabilistic reasoning and resource analysis. We evaluate the approach through assessing decisions made by our Clue Analysis Pipeline in a series of SAR missions. Finally, we identify key software engineering challenges for systematically designing, implementing, and validating Cognition Envelopes for AI-supported decisions in cyber-physical systems.

中文标题/摘要

标题：自主无人航空系统受限决策认知包

网络物理系统越来越多地依赖大型语言模型（LLMs）和视觉语言模型（VLMs）等基础模型，通过增强感知、推理和规划来提高自主性。然而，这些模型也会引入新的错误类型，如幻觉、过度概括和上下文错位，导致错误和有缺陷的决策。为了解决这一问题，我们提出了认知包的概念，旨在通过限制AI生成的决策来建立推理边界，同时补充元认知和传统安全包的使用。与安全包类似，认知包需要实用的指南和系统的过程来定义、验证和保证。在本文中，我们描述了一个由LLM/VLM支持的动态线索分析管道，用于小型自主无人航空系统在搜索和救援（SAR）任务中的领域，并基于概率推理和资源分析构建了认知包。我们通过评估在一系列SAR任务中由我们的线索分析管道做出的决策来评估该方法。最后，我们确定了系统设计、实现和验证支持AI决策的网络物理系统中认知包的关键软件工程挑战。

Summary / 总结

This paper addresses the issue of errors in autonomous decision-making by introducing Cognition Envelopes, which establish reasoning boundaries to constrain AI-generated decisions. The authors describe a pipeline using LLMs and VLMs for clue analysis in SAR missions, and validate the approach through decision assessments in SAR missions. Key challenges for systematically designing and validating Cognition Envelopes are also identified.

本文旨在通过引入认知包络来解决自主决策中的错误问题，该认知包络划定推理边界以约束AI生成的决策。作者描述了一个使用LLM和VLM进行线索分析的管道，应用于小型自主无人航空系统在搜救任务中的场景，并通过搜救任务中的决策评估验证了该方法。主要发现包括在计算物理系统中系统地设计、实现和验证认知包络的必要性。

Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

Authors: Chris Vorster, Mayug Maniparambil, Noel E. O'Connor, Noel Murphy, Derek Molloy

First: 2026-03-04T18:07:23+00:00 · Latest: 2026-03-04T18:07:23+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.

中文标题/摘要

标题：基础模型预训练数据中代表性不足？一种单次探针方法

大规模视觉-语言基础模型（VLFMs），如CLIP，现在支撑着广泛范围的计算机视觉研究和应用。VLFMs通常被调整以适应各种特定领域的任务。然而，VLFMs在新颖、专门或代表性不足领域的表现仍然不一致。评估VLFMs通常需要带有标签的测试集，而这些测试集对于特定领域的兴趣点，尤其是来自全球南方的领域来说往往不可用。我们通过提出一种仅使用每个类别一个带有标签的图像来预测VLFM在目标领域零样本准确性的高效方法来填补这一空白。我们的方法使用大型语言模型生成给定图像的合理反事实描述。通过测量VLFM区分正确描述与这些困难负样本的能力，我们设计了能够捕捉VLFM在共享嵌入空间中判别能力的特征。基于这些相似度分数训练的线性回归器在各种视觉领域中估计VLFM的零样本测试准确率，皮尔逊相关系数为0.96。我们在五个不同的数据集中展示了该方法的性能，包括标准基准数据集和来自非洲的代表性不足的数据集。我们的工作提供了一种低成本、可靠的VLFM探针工具，使研究人员和从业者能够在投入大量资源之前做出知情的数据注释决策。模型训练代码、生成的描述和反事实在此发布：https://github.com/chris-vorster/PreLabellingProbe。

Summary / 总结

The research aims to address the inconsistency in the performance of Vision-Language Foundation Models (VLFMs) on underrepresented domains by proposing a data-efficient method to predict zero-shot accuracy using a single labeled image per class. The method leverages a Large Language Model to generate counterfactual descriptions of images, which are then used to measure the VLFM's discriminative power. The linear regressor trained on these similarity scores achieves a Pearson-r correlation of 0.96, effectively estimating the VLFM's zero-shot test accuracy across various visual domains, including underrepresented datasets from Africa.

研究旨在通过提出一种高效方法，利用单张标记图像预测视觉语言基础模型（VLFM）在未充分代表领域的零样本准确性，该方法利用大型语言模型生成图像的反事实描述，进而衡量VLFM的区分能力。基于这些相似度分数训练的线性回归器实现了0.96的皮尔逊相关系数，有效估计了VLFM在各种视觉领域，包括来自非洲的未充分代表数据集的零样本测试准确性。

Benchmarking ECG FMs: A Reality Check Across Clinical Tasks

Authors: M A Al-Masud, Juan Miguel Lopez Alcaraz, Nils Strodthoff

Venue: ICLR 2026

First: 2025-09-29T17:29:48+00:00 · Latest: 2026-03-04T18:06:32+00:00

Comments: Accepted at ICLR 2026. OpenReview: https://openreview.net/forum?id=xXRqWpt3Xr

Abs · PDF · Code1 · Code2

Abstract

The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. FMs promise broader adaptability, but fundamental questions remain: Which architectures generalize best? How do models scale with limited labels? What explains performance differences across model families? We benchmarked eight ECG FMs on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in adult ECG interpretation, three FMs consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model, dominated 5 of 7 task categories, demonstrating that architecture matters more than scale. FMs improved label efficiency 3.3-9x over supervised baselines, though scaling behaviors varied across architectures. Representation analysis reveals that models with similar performance learn markedly different internal structures, suggesting multiple viable paths to effective ECG representation. Overall, while FMs show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. ECG-CPC's strong performance despite being orders of magnitude smaller challenges the assumption that FM quality requires massive scale, highlighting architectural inductive biases as an untapped opportunity.

中文标题/摘要

标题：心电图FMs基准测试：跨临床任务的现实检查

12导联心电图（ECG）是一种长期的诊断工具。然而，ECG解释的机器学习仍然支离破碎，通常局限于狭窄的任务或数据集。FMs承诺具有更广泛的适应性，但基本问题仍然存在：哪种架构泛化最好？模型在有限标签下如何扩展？模型家族之间性能差异的原因是什么？我们使用12个公共数据集中的1,650个回归和分类目标，对26个临床相关任务进行了8种ECG FMs的基准测试。模型在微调和冻结设置下进行了评估，并进行了跨数据集规模的扩展分析。结果显示，不同领域间性能异质性：在成人ECG解释中，三种FMs始终优于强大的监督基线。相反，ECG-CPC，一种紧凑的结构化状态空间模型，在7个任务类别中的5个中占主导地位，表明架构比规模更重要。FMs在标签效率上提高了3.3-9倍，尽管不同架构的扩展行为有所不同。表示分析表明，具有类似性能的模型学习了截然不同的内部结构，暗示了多种有效ECG表示的有效途径。总体而言，虽然FMs在成人ECG分析中显示出前景，但在心脏结构、结果预测和患者特征方面仍存在巨大差距。尽管ECG-CPC在规模小得多的情况下表现出色，其强大的性能挑战了FMs质量需要大规模的假设，突显了架构归纳偏见作为未开发的机会。

Summary / 总结

This study benchmarks eight ECG feature extraction models (FMs) on 26 clinical tasks using 12 public datasets, evaluating their performance under fine-tuning and frozen settings. Results indicate that different FMs excel in various domains, with some outperforming supervised baselines and others demonstrating superior label efficiency. Notably, ECG-CPC, a compact model, shows strong performance despite its small size, suggesting that architectural design is more critical than scale for effective ECG representation.

该研究使用12个公开数据集对8种ECG特征提取模型（FMs）进行了26项临床任务的基准测试，评估了它们在微调和冻结设置下的性能。结果表明，不同的FMs在不同领域表现出色，有些架构即使在缩放后仍能超越监督基线。紧凑型模型ECG-CPC在多个任务类别中表现出色，表明架构比规模更为关键。观察到标签效率提高了3.3到9倍，但不同架构的缩放行为各异。研究还发现，具有相似性能的模型学习了不同的内部结构，表明存在多种有效的ECG表示路径。

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

Authors: Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu

Venue: ICLR 2026

First: 2025-09-16T02:15:06+00:00 · Latest: 2026-03-04T18:04:43+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Music learners can greatly benefit from tools that accurately detect errors in their practice. Existing approaches typically compare audio recordings to music scores using heuristics or learnable models. This paper introduces LadderSym, a novel Transformer-based method for music error detection. LadderSym is guided by two key observations about the state-of-the-art approaches: (1) late fusion limits inter-stream alignment and cross-modality comparison capability; and (2) reliance on score audio introduces ambiguity in the frequency spectrum, degrading performance in music with concurrent notes. To address these limitations, LadderSym introduces (1) a two-stream encoder with inter-stream alignment modules to improve audio comparison capabilities and error detection F1 scores, and (2) a multimodal strategy that leverages both audio and symbolic scores by incorporating symbolic representations as decoder prompts, reducing ambiguity and improving F1 scores. We evaluate our method on the MAESTRO-E and CocoChorales-E datasets by measuring the F1 score for each note category. Compared to the previous state of the art, LadderSym more than doubles F1 for missed notes on MAESTRO-E (26.8% -> 56.3%) and improves extra note detection by 14.4 points (72.0% -> 86.4%). Similar gains are observed on CocoChorales-E. Furthermore, we also evaluate our models on real data we curated. This work introduces insights about comparison models that could inform sequence evaluation tasks for reinforcement learning, human skill assessment, and model evaluation. Code: https://github.com/ben2002chou/LadderSYM

中文标题/摘要

标题：LadderSym：一种多模态交织变换器，用于音乐练习错误检测

音乐学习者可以从能够准确检测练习中错误的工具中受益。现有方法通常使用启发式方法或可学习模型将音频录音与乐谱进行比较。本文介绍了一种新颖的基于Transformer的方法LadderSym，用于音乐错误检测。LadderSym基于对现有方法的两个关键观察：（1）晚期融合限制了跨流对齐和跨模态比较的能力；（2）依赖乐谱音频引入了频率谱中的模糊性，降低了同时音符音乐的性能。为了解决这些限制，LadderSym引入了（1）一种双流编码器，带有跨流对齐模块，以提高音频比较能力和错误检测F1分数，以及（2）一种多模态策略，通过将符号表示作为解码器提示来利用音频和符号乐谱，减少模糊性并提高F1分数。我们通过测量每个音符类别的F1分数，在MAESTRO-E和CocoChorales-E数据集上评估了该方法。与之前的最新技术相比，LadderSym在MAESTRO-E上将遗漏音符的F1分数提高了两倍多（26.8% -> 56.3%），并且在额外音符检测上提高了14.4个百分点（72.0% -> 86.4%）。在CocoChorales-E上也观察到类似收益。此外，我们还评估了我们的模型在我们整理的真实数据上。这项工作引入了关于比较模型的见解，这些见解可以指导序列评估任务，如强化学习、人类技能评估和模型评估。代码：https://github.com/ben2002chou/LadderSYM

Enhancing Authorship Attribution with Synthetic Paintings

Authors: Clarissa Loures, Caio Hosken, Luan Oliveira, Gianlucca Zuin, Adriano Veloso

First: 2026-03-04T18:00:42+00:00 · Latest: 2026-03-04T18:00:42+00:00

Comments: Accepted for publication at the 24th IEEE International Conference on Machine Learning and Applications (ICMLA 2025)

Abs · PDF · Code1 · Code2

Abstract

Attributing authorship to paintings is a historically complex task, and one of its main challenges is the limited availability of real artworks for training computational models. This study investigates whether synthetic images, generated through DreamBooth fine-tuning of Stable Diffusion, can improve the performance of classification models in this context. We propose a hybrid approach that combines real and synthetic data to enhance model accuracy and generalization across similar artistic styles. Experimental results show that adding synthetic images leads to higher ROC-AUC and accuracy compared to using only real paintings. By integrating generative and discriminative methods, this work contributes to the development of computer vision techniques for artwork authentication in data-scarce scenarios.

中文标题/摘要

标题：利用合成绘画提升作者归属识别

将绘画归因于作者是一个历史上复杂的工作，其主要挑战之一是可用于训练计算模型的真实艺术品数量有限。本研究探讨了通过DreamBooth对Stable Diffusion进行微调生成的合成图像是否能够改善此类情境下的分类模型性能。我们提出了一种结合真实和合成数据的混合方法，以提高模型在类似艺术风格下的准确性和泛化能力。实验结果表明，添加合成图像比仅使用真实绘画能获得更高的ROC-AUC和准确率。通过结合生成性和判别性方法，本研究为在数据稀缺场景下开发计算机视觉技术以进行艺术品鉴定做出了贡献。

Summary / 总结

This study addresses the challenge of limited real artworks for training models in authorship attribution for paintings. It proposes a hybrid approach using both real and synthetic images generated through DreamBooth fine-tuning of Stable Diffusion. The results indicate that incorporating synthetic images improves ROC-AUC and accuracy compared to using real paintings alone, contributing to more accurate artwork authentication in data-scarce scenarios.

该研究通过提出一种结合真实和通过DreamBooth微调Stable Diffusion生成的合成图像的混合方法，解决了因真实艺术品稀缺而导致的作者归属模型训练难题。实验结果表明，加入合成图像可以提高模型性能，表现为更高的ROC-AUC和准确性。这项工作为在数据稀缺场景下艺术品鉴定的计算机视觉技术的发展做出了贡献。

Human-Certified Module Repositories for the AI Age

Authors: Szilárd Enyedi

First: 2026-03-03T01:46:41+00:00 · Latest: 2026-03-04T17:58:26+00:00

Comments: v2: 12 pages, improved references v1: 11 pages, 3 figures, 2 tables, prepared for AQTR 2026

Abs · PDF · Code1 · Code2

Abstract

Human-Certified Module Repositories (HCMRs) are introduced in this work as a new architectural model for constructing trustworthy software in the era of AI-assisted development. As large language models increasingly participate in code generation, configuration synthesis, and multi-component integration, the reliability of AI-assembled systems will depend critically on the trustworthiness of the building blocks they use. Today's software supply-chain incidents and modular development ecosystems highlight the risks of relying on components with unclear provenance, insufficient review, or unpredictable composition behavior. We argue that future AI-driven development workflows require repositories of reusable modules that are curated, security-reviewed, provenance-rich, and equipped with explicit interface contracts. To this end, we propose HCMRs, a framework that blends human oversight with automated analysis to certify modules and support safe, predictable assembly by both humans and AI agents. We present a reference architecture for HCMRs, outline a certification and provenance workflow, analyze threat surfaces relevant to modular ecosystems, and extract lessons from recent failures. We further discuss implications for governance, scalability, and AI accountability, positioning HCMRs as a foundational substrate for reliable and auditable AI-constructed software systems.

中文标题/摘要

标题：人类认证模块仓库：AI时代的可信软件架构

本文介绍了人类认证模块仓库（HCMRs）作为AI辅助开发时代构建可信软件的新架构模型。随着大型语言模型越来越多地参与代码生成、配置合成和多组件集成，AI组装系统的可靠性将取决于它们所使用的构建块的可信度。当前的软件供应链事件和模块化开发生态系统突显了依赖来源不明、审查不足或组合行为不可预测的组件的风险。我们认为，未来的AI驱动开发工作流需要一个可重用模块的仓库，该仓库经过筛选、安全审查、具有丰富的来源信息，并配备了明确的接口合同。为此，我们提出了HCMRs框架，该框架结合了人工监督和自动化分析，以认证模块并支持人类和AI代理的安全、可预测的组装。我们提出了HCMRs的参考架构，概述了认证和来源工作流程，分析了模块化生态系统相关的威胁面，并从最近的失败中汲取教训。我们进一步讨论了治理、可扩展性和AI问责制的影响，将HCMRs定位为可靠和可审计的AI构建软件系统的基础结构。

Summary / 总结

This work introduces Human-Certified Module Repositories (HCMRs) to address the trustworthiness of AI-assisted software development. HCMRs blend human oversight with automated analysis to certify modules, ensuring they are reusable, security-reviewed, and provenance-rich. Key findings include the necessity of explicit interface contracts and a certification workflow that mitigates risks in modular ecosystems, positioning HCMRs as a foundational substrate for reliable AI-constructed software systems.

本文提出了人类认证模块仓库（HCMRs），以应对AI辅助软件开发中的可信度问题。HCMRs结合了人工监督和自动化分析，以认证模块，确保它们是可重用的、经过安全审查的和具有丰富来源信息的。研究强调了依赖不可信组件的风险，并提出了一种框架来支持人类和AI代理的安全组装。关键发现包括参考架构、认证和来源信息工作流以及从模块生态系统中的近期失败中吸取的教训。

ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors

Authors: Zihao Huang, Tianqi Liu, Zhaoxi Chen, Shaocong Xu, Saining Zhang, Lixing Xiao, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu

First: 2026-03-04T17:58:04+00:00 · Latest: 2026-03-04T17:58:04+00:00

Comments: Project Page: https://arthoi.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation through reconstruction-informed synthesis.

中文标题/摘要

标题：ArtHOI：基于视频先验的4D重建实现 articulated 人-物交互合成

在没有3D/4D监督的情况下合成物理上合理的articulated 人-物交互（HOI）仍然是一个基本挑战。虽然最近的零样本方法利用视频扩散模型来合成人-物交互，但它们主要局限于刚体操作，并缺乏明确的4D几何推理。为了解决这一问题，我们将articulated HOI合成建模为从单目视频先验进行4D重建的问题：仅给定由扩散模型生成的视频，我们无需任何3D监督即可重建完整的4D articulated 场景。基于重建的方法将生成的2D视频视为逆渲染问题的监督，恢复出几何上一致且物理上合理的4D场景，这些场景自然地符合接触、articulation和时间连贯性。我们引入了ArtHOI，这是第一个通过基于视频先验的4D重建实现articulated 人-物交互合成的零样本框架。我们的关键设计包括：1) 基于流的部分分割：利用光学流作为几何线索来分离单目视频中的动态和静态区域；2) 分解的重建流水线：在单目模糊下，同时优化人类运动和物体articulation的联合优化不稳定，因此我们首先恢复物体articulation，然后在重建的物体状态条件下合成人类运动。ArtHOI将基于视频的生成与几何感知的重建结合起来，产生既在语义上对齐又在物理上合理的交互。在各种articulated 场景（例如，打开冰箱、橱柜、微波炉）中，ArtHOI在接触准确性、穿透减少和articulation保真度方面显著优于先前的方法，通过重建指导的合成将零样本交互合成扩展到刚体操作之外。

Summary / 总结

The research aims to synthesize physically plausible articulated human-object interactions without 3D/4D supervision. The method formulates the problem as a 4D reconstruction from monocular video priors, using a flow-based part segmentation and a decoupled reconstruction pipeline. Key experimental findings show that ArtHOI outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity, extending zero-shot interaction synthesis beyond rigid manipulation.

研究旨在通过单目视频先验在无需3D/4D监督的情况下合成物理上合理的 articulated 人类-物体交互（HOI）。方法将HOI合成表述为从单目视频先验进行4D重建的问题，使用基于流的部分分割和解耦重建流水线来恢复几何上一致且物理上合理的4D场景。关键实验发现表明，ArtHOI在接触准确性、穿透减少和关节保真度方面显著优于先前的方法，跨越各种articulated场景（例如打开冰箱、橱柜、微波炉），通过基于重建的合成将零样本交互合成扩展到超越刚体操作。

Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection

Authors: Dacheng Qi, Chenyu Wang, Jingwei Xu, Tianzhe Chu, Zibo Zhao, Wen Liu, Wenrui Ding, Yi Ma, Shenghua Gao

First: 2026-03-04T17:55:01+00:00 · Latest: 2026-03-04T17:55:01+00:00

Comments: Accepted by CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Constructing computer-aided design (CAD) models is labor-intensive but essential for engineering and manufacturing. Recent advances in Large Language Models (LLMs) have inspired the LLM-based CAD generation by representing CAD as command sequences. But these methods struggle in practical scenarios because command sequence representation does not support entity selection (e.g. faces or edges), limiting its ability to support complex editing operations such as chamfer or fillet. Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. In particular, Pointer-CAD decomposes CAD model generation into steps, conditioning the generation of each subsequent step on both the textual description and the B-rep generated from previous steps. Whenever an operation requires the selection of a specific geometric entity, the LLM predicts a Pointer that selects the most feature-consistent candidate from the available set. Such a selection operation also reduces the quantization error in the command sequence-based representation. To support the training of Pointer-CAD, we develop a data annotation pipeline that produces expert-level natural language descriptions and apply it to build a dataset of approximately 575K CAD models. Extensive experimental results demonstrate that Pointer-CAD effectively supports the generation of complex geometric structures and reduces segmentation error to an extremely low level, achieving a significant improvement over prior command sequence methods, thereby significantly mitigating the topological inaccuracies introduced by quantization error.

中文标题/摘要

标题：Pointer-CAD：通过基于指针的边与面选择统一B-Rep和命令序列

计算机辅助设计（CAD）模型的构建是劳动密集型但对工程和制造至关重要。近年来，大型语言模型（LLMs）的进步激发了基于LLM的CAD生成，通过将CAD表示为命令序列。但这些方法在实际场景中遇到困难，因为命令序列表示不支持实体选择（例如面或边），限制了其支持复杂编辑操作（如倒角或圆角）的能力。此外，在草图和拉伸操作中连续变量的离散化可能导致拓扑错误。为了解决这些限制，我们提出了Pointer-CAD，这是一种新颖的基于LLM的CAD生成框架，利用基于指针的命令序列表示将B-Rep模型的几何信息显式地纳入顺序建模中。特别是，Pointer-CAD将CAD模型生成分解为步骤，每个后续步骤的生成同时依赖于文本描述和前一步生成的B-Rep。每当操作需要选择特定的几何实体时，LLM会预测一个指针，从可用集中选择最符合特征的候选实体。这种选择操作还减少了基于命令序列表示的量化误差。为了支持Pointer-CAD的训练，我们开发了一种数据注释流水线，生成专家级的自然语言描述，并应用于构建约57.5万个CAD模型的数据集。广泛的实验结果表明，Pointer-CAD有效地支持了复杂几何结构的生成，并将分割误差降低到极低水平，显著优于先前的命令序列方法，从而显著减轻了量化误差引入的拓扑不准确性。

Summary / 总结

Pointer-CAD is a novel LLM-based CAD generation framework that addresses the limitations of command sequence representation by incorporating geometric information of B-rep models. It decomposes CAD model generation into steps, conditioning each step on both textual descriptions and B-rep generated from previous steps. The LLM predicts a pointer to select the most feature-consistent geometric entity, reducing quantization errors. Extensive experiments show that Pointer-CAD effectively generates complex geometric structures and significantly reduces segmentation errors compared to previous methods.

Pointer-CAD 是一种新型框架，通过将 B-rep 几何信息集成到顺序建模中来解决基于命令序列的 CAD 生成的局限性。它使用基于指针的命令序列来选择特定的几何实体，从而减少量化误差。Pointer-CAD 使用包含约 57.5 万个 CAD 模型的数据集进行训练，并在生成复杂几何结构和减少分割误差方面显著优于先前的方法。

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints

Authors: Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu

First: 2026-03-04T17:51:42+00:00 · Latest: 2026-03-04T17:51:42+00:00

Abs · PDF · Code1 · Code2

Abstract

We present SpotIt+, an open-source tool for evaluating Text-to-SQL systems via bounded equivalence verification. Given a generated SQL query and the ground truth, SpotIt+ actively searches for database instances that differentiate the two queries. To ensure that the generated counterexamples reflect practically relevant discrepancies, we introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation. Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases, while preserving its ability to efficiently uncover numerous discrepancies between generated and gold SQL queries that are missed by standard test-based evaluation.

中文标题/摘要

标题：SpotIt+: 基于验证的文本到SQL评估工具，包含数据库约束

我们介绍了SpotIt+，一个开源工具，通过有界等价验证来评估文本到SQL系统。给定生成的SQL查询和真实查询，SpotIt+积极搜索能够区分两个查询的数据库实例。为了确保生成的反例反映实际相关的差异，我们引入了一种约束挖掘流水线，该流水线结合了基于规则的示例数据库规范挖掘与基于LLM的验证。在BIRD数据集上的实验结果表明，挖掘的约束使SpotIt+能够生成更现实的区分数据库，同时保持其高效地发现生成SQL查询和黄金SQL查询之间大量未被标准测试评估发现的差异的能力。

Summary / 总结

SpotIt+ is an open-source tool for evaluating Text-to-SQL systems by verifying the equivalence of generated SQL queries against ground truth queries. It uses a constraint-mining pipeline combining rule-based specification mining and LLM-based validation to find realistic database instances that differentiate the queries. Experiments on the BIRD dataset demonstrate that SpotIt+ can generate more realistic counterexamples and uncover more discrepancies than standard test-based evaluation methods.

SpotIt+ 是一个开源工具，使用有界等价验证来评估 Text-to-SQL 系统。它会寻找数据库实例来区分生成的查询和真实查询。该工具包含一个约束挖掘管道，结合了基于规则的规格挖掘和基于大语言模型的验证，以确保反例具有实际相关性。实验结果表明，SpotIt+ 可以生成更现实的区分数据库，并发现比标准基于测试的评估方法更多的差异。

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Authors: Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu

Venue: ICLR 2026

First: 2025-02-03T17:13:03+00:00 · Latest: 2026-03-04T17:50:58+00:00

Comments: Accepted by ICLR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.

中文标题/摘要

标题：偏好泄露：LLM作为法官中的污染问题

大型语言模型（LLMs）作为法官和基于LLM的数据合成已经成为了两种基本的LLM驱动的数据注释方法，在模型开发中得到了广泛应用。虽然它们的结合显著提高了模型训练和评估的效率，但对这种新模型开发范式带来的潜在污染却很少受到关注。在本文中，我们揭示了偏好泄露，这是一种由数据生成器LLM与法官LLM之间的相关性引起的污染问题。为了研究这一问题，我们首先定义了数据生成器LLM和法官LLM之间的三种常见相关性：同一模型、继承关系以及同一模型家族。通过广泛的实验，我们实证地确认了偏好泄露导致法官倾向于其相关的学生模型的偏差，这一现象在多个LLM基线和基准中得到了验证。进一步的分析表明，偏好泄露是一个普遍且难以检测的现实问题，比之前在LLM作为法官场景中识别出的偏差更为棘手。所有这些发现都表明，偏好泄露是LLM作为法官领域中一个普遍且具有挑战性的问题。我们已在以下链接发布了所有代码和数据：https://github.com/David-Li0406/Preference-Leakage。

Summary / 总结

This work addresses the issue of preference leakage in LLM-as-a-judge, where the relatedness between synthetic data generators and evaluators introduces bias. By defining three relatedness types and conducting extensive experiments, the study confirms the existence of this bias across various LLM baselines and benchmarks. The findings suggest that preference leakage is a widespread and challenging problem that is harder to detect than previously identified biases, indicating the need for more robust evaluation methods in LLM development.

研究探讨了LLM-as-a-judge中的偏好泄露问题，这种问题由于合成数据生成器和评估器之间的相关性引入了偏差。研究定义了三种相关性类型，并在多个LLM基线和基准上实证确认了这种偏差。研究结果表明，偏好泄露是一个广泛且难以检测的问题，比之前识别的偏差更具挑战性。

Scalable Evaluation of the Realism of Synthetic Environmental Augmentations in Images

Authors: Damian J. Ruck, Paul Vautravers, Oliver Chalkley, Jake Thomas

First: 2026-03-04T17:46:08+00:00 · Latest: 2026-03-04T17:46:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Evaluation of AI systems often requires synthetic test cases, particularly for rare or safety-critical conditions that are difficult to observe in operational data. Generative AI offers a promising approach for producing such data through controllable image editing, but its usefulness depends on whether the resulting images are sufficiently realistic to support meaningful evaluation. We present a scalable framework for assessing the realism of synthetic image-editing methods and apply it to the task of adding environmental conditions-fog, rain, snow, and nighttime-to car-mounted camera images. Using 40 clear-day images, we compare rule-based augmentation libraries with generative AI image-editing models. Realism is evaluated using two complementary automated metrics: a vision-language model (VLM) jury for perceptual realism assessment, and embedding-based distributional analysis to measure similarity to genuine adverse-condition imagery. Generative AI methods substantially outperform rule-based approaches, with the best generative method achieving approximately 3.6 times the acceptance rate of the best rule-based method. Performance varies across conditions: fog proves easiest to simulate, while nighttime transformations remain challenging. Notably, the VLM jury assigns imperfect acceptance even to real adverse-condition imagery, establishing practical ceilings against which synthetic methods can be judged. By this standard, leading generative methods match or exceed real-image performance for most conditions. These results suggest that modern generative image-editing models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines. Our framework therefore provides a practical approach for scalable realism evaluation, though validation against human studies remains an important direction for future work.

中文标题/摘要

标题：合成环境增强在图像中的现实性可扩展评估

AI系统的评估通常需要合成测试案例，尤其是对于在操作数据中难以观察到的罕见或安全关键条件。生成式AI通过可控的图像编辑提供了一种有前景的数据生成方法，但其有用性取决于生成的图像是否足够现实，以支持有意义的评估。我们提出了一种可扩展的框架来评估合成图像编辑方法的现实性，并将其应用于向车载摄像头图像添加环境条件（雾、雨、雪和夜间）的任务。使用40张晴天图像，我们将基于规则的增强库与生成式AI图像编辑模型进行了比较。现实性通过两种互补的自动化度量标准进行评估：基于视觉-语言模型（VLM）的陪审团进行感知现实性评估，以及基于嵌入的分布分析来衡量与真实恶劣条件图像的相似性。生成式AI方法显著优于基于规则的方法，最佳生成式方法的接受率大约是最佳基于规则方法的3.6倍。性能在不同条件下有所不同：雾是最容易模拟的，而夜间变换仍然具有挑战性。值得注意的是，VLM陪审团即使对真实恶劣条件图像也赋予了不完美的接受度，这为合成方法设定了实际的上限。按照这一标准，领先的生成式方法在大多数条件下与真实图像的性能相当或超过。这些结果表明，现代生成图像编辑模型可以实现恶劣条件图像的可扩展生成，用于评估管道。因此，我们的框架提供了一种实用的方法来进行可扩展的现实性评估，尽管未来工作仍需通过人类研究进行验证。

Summary / 总结

The research aims to evaluate the realism of synthetic environmental augmentations in images, particularly for rare or safety-critical conditions. A scalable framework was developed using two automated metrics: a vision-language model (VLM) for perceptual realism and embedding-based distributional analysis. The study compared rule-based augmentation libraries with generative AI models for adding fog, rain, snow, and nighttime to car-mounted camera images. Generative AI methods outperformed rule-based approaches, with the best generative method achieving about 3.6 times the acceptance rate of the best rule-based method. The study found that while fog was easiest to simulate, nighttime transformations remained challenging. Notably, the VLM jury assigned imperfect acceptance to real adverse-condition imagery, indicating practical ceilings for synthetic methods. Leading generative methods matched or exceeded real-image performance for most conditions, suggesting that modern generative models can enable scalable generation of realistic adverse-condition imagery for evaluation pipelines.

研究旨在评估合成环境增强图像的真实性，特别是对于罕见或安全关键条件。开发了一个可扩展的框架，使用两种自动评估指标：视觉语言模型（VLM）进行感知真实性和嵌入式分布分析。研究将基于规则的增强库与生成AI模型进行了比较，用于向汽车车载摄像头图像添加雾、雨、雪和夜间效果。生成AI方法优于基于规则的方法，最佳生成方法的接受率约为最佳基于规则方法的3.6倍。研究发现，虽然雾最容易模拟，但夜间变换仍然具有挑战性。值得注意的是，VLM评委会对真实不良条件图像也给予不完美的接受，表明合成方法的实际上限。领先的生成方法在大多数条件下与真实图像性能相当或超过，表明现代生成模型可以实现对评估管道中现实不良条件图像的可扩展生成。

Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation

Authors: Egor Cherepanov, Nikita Kachaev, Artem Zholus, Alexey K. Kovalev, Aleksandr I. Panov

First: 2024-12-09T14:34:31+00:00 · Latest: 2026-03-04T17:39:28+00:00

Comments: 20 pages, 6 figures, 9 tables

Abs · PDF · Code1 · Code2

Abstract

The incorporation of memory into agents is essential for numerous tasks within the domain of Reinforcement Learning (RL). In particular, memory is paramount for tasks that require the use of past information, adaptation to novel environments, and improved sample efficiency. However, the term "memory" encompasses a wide range of concepts, which, coupled with the lack of a unified methodology for validating an agent's memory, leads to erroneous judgments about agents' memory capabilities and prevents objective comparison with other memory-enhanced agents. This paper aims to streamline the concept of memory in RL by providing practical precise definitions of agent memory types, such as long-term vs. short-term memory and declarative vs. procedural memory, inspired by cognitive science. Using these definitions, we categorize different classes of agent memory, propose a robust experimental methodology for evaluating the memory capabilities of RL agents, and standardize evaluations. Furthermore, we empirically demonstrate the importance of adhering to the proposed methodology when evaluating different types of agent memory by conducting experiments with different RL agents and what its violation leads to.

中文标题/摘要

标题：解析强化学习代理中的记忆复杂性：一种分类与评估方法

将记忆融入代理对于强化学习（RL）领域中的许多任务至关重要。特别是，记忆对于需要使用过去信息、适应新环境和提高样本效率的任务至关重要。然而，“记忆”这一术语涵盖了广泛的概念，加之缺乏统一的方法来验证代理的记忆能力，导致对代理记忆能力的错误判断，并阻碍了与其他增强记忆的代理进行客观比较。本文旨在通过提供基于认知科学的代理记忆类型的实际精确定义，简化RL中的记忆概念，从而提供不同的代理记忆类别，提出一种稳健的实验方法来评估RL代理的记忆能力，并标准化评估。此外，通过使用不同的RL代理进行实验，我们实证展示了在评估不同类型代理记忆时遵循提议方法的重要性，以及违反该方法会导致的问题。

Summary / 总结

This paper addresses the complexity of memory in Reinforcement Learning (RL) agents by defining different types of memory such as long-term vs. short-term and declarative vs. procedural memory, inspired by cognitive science. It proposes a robust experimental methodology for evaluating memory capabilities and standardizes evaluations. The study shows that adhering to the proposed methodology is crucial for accurate evaluations of different types of agent memory, while its violation can lead to incorrect judgments about agents' memory capabilities.

本文通过借鉴认知科学，定义了不同类型的记忆，如长期记忆与短期记忆、陈述性记忆与程序性记忆，以解决强化学习（RL）代理中记忆的复杂性问题。它提出了一种稳健的实验方法来评估记忆能力，并标准化了评估过程。研究表明，遵循提出的评估方法对于准确评估不同类型代理的记忆能力至关重要，而违反这种方法则会导致对代理记忆能力的错误判断。

History

20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553