arXiv 论文速递

Embedding Autonomous Agents in Resource-Constrained Robotic Platforms

Authors: Negar Halakou, Juan F. Gutierrez, Ye Sun, Han Jiang, Xueming Wu, Yilun Song, Andres Gomez

First: 2026-01-07T18:57:32+00:00 · Latest: 2026-01-07T18:57:32+00:00

Comments: This is an open-access, author-archived version of a manuscript published in European Conference on Multi-Agent Systems 2025

Abs · PDF · Code1 · Code2

Abstract

Many embedded devices operate under resource constraints and in dynamic environments, requiring local decision-making capabilities. Enabling devices to make independent decisions in such environments can improve the responsiveness of the system and reduce the dependence on constant external control. In this work, we integrate an autonomous agent, programmed using AgentSpeak, with a small two-wheeled robot that explores a maze using its own decision-making and sensor data. Experimental results show that the agent successfully solved the maze in 59 seconds using 287 reasoning cycles, with decision phases taking less than one millisecond. These results indicate that the reasoning process is efficient enough for real-time execution on resource-constrained hardware. This integration demonstrates how high-level agent-based control can be applied to resource-constrained embedded systems for autonomous operation.

中文标题/摘要

标题：在资源受限的机器人平台上嵌入自主代理

许多嵌入式设备在资源受限和动态环境中运行，需要本地决策能力。使设备能够在这些环境中独立做出决策可以提高系统的响应性并减少对外部控制的依赖。在本研究中，我们使用AgentSpeak编程将一个自主代理与一个小型两轮机器人集成，该机器人利用自身的决策能力和传感器数据探索迷宫。实验结果表明，代理在59秒内成功解决了迷宫，使用了287次推理循环，决策阶段耗时不到一毫秒。这些结果表明，推理过程对于资源受限的硬件的实时执行是高效的。这种集成展示了如何将基于代理的高级控制应用于资源受限的嵌入式系统以实现自主操作。

Summary / 总结

This study aims to enhance the autonomy and responsiveness of resource-constrained robotic platforms by integrating an autonomous agent with a two-wheeled robot. The agent, programmed using AgentSpeak, enables the robot to make decisions based on its own sensor data, solving a maze in 59 seconds with 287 reasoning cycles and decision phases taking less than one millisecond. This demonstrates the feasibility of high-level agent-based control on resource-limited hardware for real-time operation.

研究旨在使资源受限的机器人平台能够自主决策，以提高系统的响应性。研究将使用AgentSpeak编写的自主代理集成到一个两轮机器人中，使其能够自主探索迷宫。关键发现表明，代理在59秒内通过287次推理循环成功解决了迷宫问题，这表明在有限的硬件资源上实现了高效的实时执行。

ImLoc: Revisiting Visual Localization with Image-based Representation

Authors: Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

First: 2026-01-07T18:51:51+00:00 · Latest: 2026-01-07T18:51:51+00:00

Comments: Code will be available at https://github.com/cvg/Hierarchical-Localization

Abs · PDF · Code1 · Code2 · Code3

Abstract

Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at https://github.com/cvg/Hierarchical-Localization.

中文标题/摘要

标题：ImLoc：基于图像的表示重新审视视觉定位

现有的视觉定位方法通常要么是基于2D图像的，这类方法易于构建和维护，但几何推理能力有限；要么是基于3D结构的，这类方法准确性高，但需要集中重建，难以更新。在本文中，我们基于2D图像的表示重新审视视觉定位，并提出将每个图像与估计的深度图相结合以捕捉几何结构。通过有效使用密集匹配器，这种表示不仅易于构建和维护，而且在具有挑战性的条件下实现了最高的准确性。借助紧凑压缩和GPU加速的LO-RANSAC实现，整个管道在存储和计算上都高效，并允许在准确性和最高内存效率之间灵活权衡。我们的方法在各种标准基准上达到了新的最先进的准确性，并在相似的地图大小下优于现有的内存高效方法。代码将在https://github.com/cvg/Hierarchical-Localization公开。

Summary / 总结

This paper revisits visual localization using 2D image-based representation, enhancing each image with estimated depth maps to capture geometric structure. The method leverages dense matchers for effective geometric reasoning and is efficient in both storage and computation. It achieves state-of-the-art accuracy on various benchmarks and outperforms existing memory-efficient methods at comparable map sizes.

本文重新审视了使用基于2D图像的表示方法进行视觉定位，通过添加估计的深度图来捕捉几何结构。该方法利用密集匹配器进行有效的几何推理，并在存储和计算效率方面表现出色。它在各种基准测试中达到了最先进的准确性，并在相似的地图大小下优于现有的高效内存方法。

Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition

Authors: Nia Touko, Matthew O A Ellis, Cristiano Capone, Alessio Burrello, Elisa Donati, Luca Manneschi

First: 2026-01-07T18:48:31+00:00 · Latest: 2026-01-07T18:48:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Reliable long-term decoding of surface electromyography (EMG) is hindered by signal drift caused by electrode shifts, muscle fatigue, and posture changes. While state-of-the-art models achieve high intra-session accuracy, their performance often degrades sharply. Existing solutions typically demand large datasets or high-compute pipelines that are impractical for energy-efficient wearables. We propose a lightweight framework for Test-Time Adaptation (TTA) using a Temporal Convolutional Network (TCN) backbone. We introduce three deployment-ready strategies: (i) causal adaptive batch normalization for real-time statistical alignment; (ii) a Gaussian Mixture Model (GMM) alignment with experience replay to prevent forgetting; and (iii) meta-learning for rapid, few-shot calibration. Evaluated on the NinaPro DB6 multi-session dataset, our framework significantly bridges the inter-session accuracy gap with minimal overhead. Our results show that experience-replay updates yield superior stability under limited data, while meta-learning achieves competitive performance in one- and two-shot regimes using only a fraction of the data required by current benchmarks. This work establishes a path toward robust, "plug-and-play" myoelectric control for long-term prosthetic use.

中文标题/摘要

标题：基于EMG的手势识别测试时轻量级适应

由于电极位移、肌肉疲劳和姿势变化导致的信号漂移，长期可靠的表面肌电图(EMG)解码受到阻碍。尽管最先进的模型在单会话中能实现高精度，但其性能往往会在会话间急剧下降。现有解决方案通常需要大量数据集或高计算量的管道，这在能量高效的可穿戴设备上是不切实际的。我们提出了一种基于时间卷积网络(TCN)骨干的测试时轻量级适应(TTA)框架。我们引入了三种可部署策略：(i) 因果自适应批量归一化进行实时统计对齐；(ii) 通过经验回放的高斯混合模型(GMM)对齐以防止遗忘；(iii) 元学习以实现快速、少样本校准。在NinaPro DB6多会话数据集上评估，我们的框架在最小开销下显著缩小了会话间精度差距。我们的结果表明，经验回放更新在有限数据下提供了更好的稳定性，而元学习仅使用当前基准所需数据的一小部分即可在单样本和两样本情况下实现竞争力的性能。这项工作为长期假肢使用中的稳健、即插即用的肌电控制奠定了道路。

Summary / 总结

The paper addresses the challenge of signal drift in EMG-based gesture recognition, proposing a lightweight Test-Time Adaptation framework using a TCN backbone. It introduces three strategies: causal adaptive batch normalization, GMM alignment with experience replay, and meta-learning for rapid calibration. The framework significantly improves inter-session accuracy with minimal overhead, showing superior stability and competitive performance with limited data compared to current benchmarks.

本文提出了一种基于Temporal Convolutional Network (TCN)的轻量级Test-Time Adaptation (TTA)框架，以解决表面肌电图（EMG）基于手势识别中的信号漂移问题。该框架包括三种策略：因果自适应批量归一化、带有经验重播的GMM对齐以及元学习快速校准。在NinaPro DB6数据集上的评估表明，该框架能够显著提高跨会话的准确率，且具有最小的开销；经验重播更新提供了在有限数据下更好的稳定性，而元学习仅使用少量数据即可在单次和两次射击模式下实现与当前基准相当的性能。

Agentic Rubrics as Contextual Verifiers for SWE Agents

Authors: Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

First: 2026-01-07T18:38:23+00:00 · Latest: 2026-01-07T18:38:23+00:00

Comments: 31 pages, 11 Figures

Abs · PDF · Code1 · Code2

Abstract

Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

中文标题/摘要

标题：代理评分表作为软件工程代理环境中的上下文验证器

验证对于提高代理至关重要：它为强化学习提供奖励信号，并通过测试时缩放（TTS）在推理时获得收益。尽管如此，软件工程（SWE）代理环境中的验证通常依赖于代码执行，这由于环境设置开销而难以扩展。可扩展的替代方案，如补丁分类器和启发式方法存在，但它们在代码库上下文中的根基较浅且难以解释。为此，我们探索了代理评分表：专家代理与仓库交互以创建基于上下文的评分表检查表，候选补丁随后根据其评分表进行评分，无需执行测试。在SWE-Bench Verified的并行TTS评估下，代理评分表在Qwen3-Coder-30B-A3B上的得分为54.2%，在Qwen3-32B上的得分为40.6%，与我们比较集中最强基线相比至少提高了3.5个百分点。我们进一步分析了评分表的行为，显示评分表得分与真实测试结果一致，同时也能标记测试无法捕捉的问题。我们的消融实验表明，代理上下文收集对于生成代码库特定的、无歧义的标准至关重要。综上所述，这些结果表明代理评分表为SWE代理提供了高效、可扩展且精细的验证信号。

Summary / 总结

The research aims to improve the verification process for software engineering agents by providing a scalable alternative to code execution. Agentic Rubrics, where an expert agent creates a context-grounded rubric checklist, are used to score candidate patches without requiring test execution. The method achieves a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with a +3.5 percentage-point gain over the strongest baseline. The rubric scores are consistent with ground-truth tests and flag issues that tests do not capture, highlighting the rubric's efficiency and scalability.

研究旨在通过提供可扩展的替代方案来改进软件工程代理的验证过程，以替代代码执行。使用Agentic Rubrics方法，即专家代理创建上下文相关的检查表，对候选补丁进行评分，无需执行测试。该方法在Qwen3-Coder-30B-A3B上得分为54.2%，在Qwen3-32B上得分为40.6%，比最强基线高出至少3.5个百分点。检查表得分与真实测试结果一致，并且能够指出测试未捕捉到的问题，表明该方法的高效性和可扩展性。

Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions

Authors: Abhishek Rath

First: 2026-01-07T18:37:26+00:00 · Latest: 2026-01-07T18:37:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-agent Large Language Model (LLM) systems have emerged as powerful architectures for complex task decomposition and collaborative problem-solving. However, their long-term behavioral stability remains largely unexamined. This study introduces the concept of agent drift, defined as the progressive degradation of agent behavior, decision quality, and inter-agent coherence over extended interaction sequences. We present a comprehensive theoretical framework for understanding drift phenomena, proposing three distinct manifestations: semantic drift (progressive deviation from original intent), coordination drift (breakdown in multi-agent consensus mechanisms), and behavioral drift (emergence of unintended strategies). We introduce the Agent Stability Index (ASI), a novel composite metric framework for quantifying drift across twelve dimensions, including response consistency, tool usage patterns, reasoning pathway stability, and inter-agent agreement rates. Through simulation-based analysis and theoretical modeling, we demonstrate how unchecked agent drift can lead to substantial reductions in task completion accuracy and increased human intervention requirements. We propose three mitigation strategies: episodic memory consolidation, drift-aware routing protocols, and adaptive behavioral anchoring. Theoretical analysis suggests these approaches can significantly reduce drift-related errors while maintaining system throughput. This work establishes a foundational methodology for monitoring, measuring, and mitigating agent drift in production agentic AI systems, with direct implications for enterprise deployment reliability and AI safety research.

中文标题/摘要

标题：代理漂移：多代理大语言模型系统长期行为退化量化

多代理大语言模型（LLM）系统已成为复杂任务分解和协作问题解决的强大架构。然而，它们的长期行为稳定性尚未得到充分研究。本研究引入了代理漂移的概念，定义为代理行为、决策质量和多代理间一致性在长时间交互序列中的逐步退化。我们提出了一种全面的理论框架来理解漂移现象，提出了三种不同的表现形式：语义漂移（逐步偏离原始意图）、协调漂移（多代理共识机制的失效）和行为漂移（意外策略的出现）。我们引入了代理稳定性指数（ASI），这是一种新颖的综合度量框架，用于在十二个维度上量化漂移，包括响应一致性、工具使用模式、推理路径稳定性和多代理间一致率。通过基于仿真的分析和理论建模，我们展示了未加控制的代理漂移可能导致任务完成准确性大幅下降和人类干预需求增加。我们提出了三种缓解策略： episodic 记忆巩固、漂移感知路由协议和自适应行为锚定。理论分析表明，这些方法可以显著减少与漂移相关的错误，同时保持系统吞吐量。本研究为监测、测量和缓解生产中代理漂移奠定了基础方法，直接对企业的部署可靠性和AI安全研究产生了影响。

Summary / 总结

This study introduces the concept of agent drift in multi-agent LLM systems, defined as the progressive degradation of agent behavior over extended interactions. It proposes a theoretical framework with three manifestations: semantic, coordination, and behavioral drift. The Agent Stability Index (ASI) is introduced as a composite metric to quantify these drift phenomena across twelve dimensions. The research demonstrates that unchecked drift can reduce task completion accuracy and increase human intervention. Three mitigation strategies are proposed: episodic memory consolidation, drift-aware routing protocols, and adaptive behavioral anchoring, which can reduce drift-related errors while maintaining system throughput.

该研究通过引入代理漂移的概念，关注多代理LLM系统的长期行为稳定性，漂移现象包括语义漂移、协调漂移和行为漂移。作者提出了代理稳定性指数（ASI）来衡量这些漂移现象在多个维度上的表现。通过仿真和理论分析，他们表明未加控制的漂移会降低任务完成的准确性并增加人工干预的需求。研究还提出了三种缓解策略： episodic 记忆整合、漂移感知路由协议和适应性行为锚定，这些策略可以减少漂移相关错误同时保持系统的吞吐量。

Qomhra: A Bilingual Irish and English Large Language Model

Authors: Joseph McInerney, Khanh-Tung Tran, Liam Lonergan, Ailbhe Ní Chasaide, Neasa Ní Chiaráin, Barry Devereux

First: 2025-10-20T15:27:53+00:00 · Latest: 2026-01-07T18:35:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM) research and development has overwhelmingly focused on the world's major languages, leading to under-representation of low-resource languages such as Irish. This paper introduces \textbf{Qomhrá}, a bilingual Irish and English LLM, developed under extremely low-resource constraints. A complete pipeline is outlined spanning bilingual continued pre-training, instruction tuning, and the synthesis of human preference data for future alignment training. We focus on the lack of scalable methods to create human preference data by proposing a novel method to synthesise such data by prompting an LLM to generate ``accepted'' and ``rejected'' responses, which we validate as aligning with L1 Irish speakers. To select an LLM for synthesis, we evaluate the top closed-weight LLMs for Irish language generation performance. Gemini-2.5-Pro is ranked highest by L1 and L2 Irish-speakers, diverging from LLM-as-a-judge ratings, indicating a misalignment between current LLMs and the Irish-language community. Subsequently, we leverage Gemini-2.5-Pro to translate a large scale English-language instruction tuning dataset to Irish and to synthesise a first-of-its-kind Irish-language human preference dataset. We comprehensively evaluate Qomhrá across several benchmarks, testing translation, gender understanding, topic identification, and world knowledge; these evaluations show gains of up to 29\% in Irish and 44\% in English compared to the existing open-source Irish LLM baseline, UCCIX. The results of our framework provide insight and guidance to developing LLMs for both Irish and other low-resource languages.

中文标题/摘要

标题：Qomhra：一种双语爱尔兰语和英语大型语言模型

大型语言模型（LLM）的研究与开发主要集中在世界上主要语言上，导致资源稀缺的语言如爱尔兰语的代表性不足。本文介绍了在极其低资源条件下开发的双语爱尔兰语和英语LLM——Qomhrá。概述了一个完整的流程，涵盖双语持续预训练、指令调优以及未来对齐训练中的人类偏好数据合成。我们关注于缺乏可扩展的方法来创建人类偏好数据，提出了一种新颖的方法，通过提示LLM生成“接受”和“拒绝”的响应来合成此类数据，并验证这些数据与一语言爱尔兰语使用者的偏好一致。为了选择用于合成的LLM，我们评估了顶级封闭权重LLM在爱尔兰语生成性能上的表现。Gemini-2.5-Pro在一语言和二语言爱尔兰语使用者中排名最高，这与LLM作为裁判的评级不同，表明当前的LLM与爱尔兰语社区之间存在偏差。随后，我们利用Gemini-2.5-Pro将大规模英语指令调优数据集翻译成爱尔兰语，并合成了一种前所未有的爱尔兰语人类偏好数据集。我们在多个基准上全面评估了Qomhrá，测试了翻译、性别理解、主题识别和世界知识；这些评估显示，与现有的开源爱尔兰语LLM基线UCCIX相比，爱尔兰语方面提高了29%，英语方面提高了44%。我们框架的结果为开发爱尔兰语和其他低资源语言的LLM提供了见解和指导。

Summary / 总结

Qomhra is a bilingual Irish and English large language model developed under low-resource constraints. The research introduces a complete pipeline including bilingual continued pre-training, instruction tuning, and synthesizing human preference data. The study proposes a novel method to generate 'accepted' and 'rejected' responses using an LLM, validated by L1 Irish speakers. Gemini-2.5-Pro was selected for synthesis due to its high performance in Irish language generation. Qomhra shows significant improvements in translation, gender understanding, topic identification, and world knowledge, with up to 29% and 44% gains in Irish and English, respectively, compared to the existing UCCIX baseline.

Qomhra 是一种双语爱尔兰语和英语大型语言模型，在资源有限的情况下开发。研究引入了一个完整的管道，包括双语持续预训练、指令调优和生成人类偏好数据。研究提出了一种使用 LLM 生成“接受”和“拒绝”响应的新方法，并通过 L1 爱尔兰语使用者验证了这种方法。Gemini-2.5-Pro 由于其在爱尔兰语生成方面的高表现被选中用于合成。Qomhra 在翻译、性别理解、主题识别和世界知识方面显示出显著改进，与现有开源爱尔兰语基线 UCCIX 相比，爱尔兰语和英语分别提高了 29% 和 44%。

Clinical Data Goes MEDS? Let's OWL make sense of it

Authors: Alberto Marfoglia, Jong Ho Jhee, Adrien Coulet

First: 2026-01-07T18:25:02+00:00 · Latest: 2026-01-07T18:25:02+00:00

Comments: 12 pages, 5 tables, 4 figures

Abs · PDF · Code1 · Code2

Abstract

The application of machine learning on healthcare data is often hindered by the lack of standardized and semantically explicit representation, leading to limited interoperability and reproducibility across datasets and experiments. The Medical Event Data Standard (MEDS) addresses these issues by introducing a minimal, event-centric data model designed for reproducible machine-learning workflows from health data. However, MEDS is defined as a data-format specification and does not natively provide integration with the Semantic Web ecosystem. In this article, we introduce MEDS-OWL, a lightweight OWL ontology that provides formal concepts and relations to enable representing MEDS datasets as RDF graphs. Additionally, we implemented meds2rdf, a Python conversion library that transforms MEDS events into RDF graphs, ensuring conformance with the ontology. We demonstrate the approach on a synthetic clinical dataset that describes patient care pathways for ruptured intracranial aneurysms and validate the resulting graph using SHACL constraints. The first release of MEDS-OWL comprises 13 classes, 10 object properties, 20 data properties, and 24 OWL axioms. Combined with meds2rdf, it enables data transformation into FAIR-aligned datasets, provenance-aware publishing, and interoperability of event-based clinical data. By bridging MEDS with the Semantic Web, this work contributes a reusable semantic layer for event-based clinical data and establishes a robust foundation for subsequent graph-based analytics.

中文标题/摘要

标题：临床数据进入MEDS？让我们用OWL来理解它

在医疗健康数据上应用机器学习往往受到标准化和语义明确表示缺乏的阻碍，导致数据集和实验之间互操作性和可重复性有限。医疗事件数据标准（MEDS）通过引入一个基于事件的最小化数据模型来解决这些问题，该模型旨在为健康数据的可重复机器学习工作流提供支持。然而，MEDS定义为数据格式规范，并未原生提供与语义网络生态系统的集成。本文介绍了MEDS-OWL，这是一种轻量级的OWL本体，提供了形式概念和关系，以使MEDS数据集能够表示为RDF图。此外，我们还实现了meds2rdf，这是一种Python转换库，将MEDS事件转换为RDF图，确保符合本体。我们通过合成临床数据集展示了这种方法，该数据集描述了颅内动脉瘤破裂患者的治疗路径，并使用SHACL约束验证了生成的图。MEDS-OWL的第一个版本包括13个类、10个对象属性、20个数据属性和24个OWL公理。结合meds2rdf，它使数据能够转换为FAIR对齐的数据集、具有来源意识的发布，并实现基于事件的临床数据的互操作性。通过将MEDS与语义网络相结合，这项工作为基于事件的临床数据提供了一个可重用的语义层，并为后续的图分析奠定了坚实的基础。

Summary / 总结

This study addresses the challenge of standardizing and semantically representing healthcare data to enhance interoperability and reproducibility. It introduces MEDS-OWL, an OWL ontology that converts the Medical Event Data Standard (MEDS) into RDF graphs, and a Python library, meds2rdf, to facilitate this transformation. The approach is validated using a synthetic clinical dataset, and the resulting RDF graphs are shown to be FAIR-aligned and interoperable.

本文旨在标准化和语义化医疗数据，以促进机器学习应用。它引入了MEDS-OWL，一种将MEDS数据集表示为RDF图的OWL本体，并提供了一个Python库meds2rdf，用于将MEDS事件转换为RDF。该方法通过合成临床数据集和SHACL约束进行了验证，展示了事件驱动的临床数据向FAIR对齐数据集的转换，并增强了互操作性和可再现性。

Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models

Authors: Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, Mattias Rantalainen

First: 2026-01-07T18:24:12+00:00 · Latest: 2026-01-07T18:24:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.

中文标题/摘要

标题：扫描诱导的领域偏移削弱了病理基础模型的稳健性

病理基础模型（PFMs）已成为计算病理学的核心，旨在提供从全切片图像（WSIs）中提取特征的一般编码器。尽管在基准测试中表现出色，但PFMs对现实世界技术领域偏移（如全切片扫描仪设备的变异性）的稳健性仍然知之甚少。我们系统地评估了14种PFMs对扫描器诱导变异性的影响，包括最先进的模型、早期的自监督模型以及一个基于自然图像训练的基线模型。使用包含384例乳腺癌WSI的多扫描器数据集，这些WSI在五种设备上扫描，我们独立地隔离了扫描器效应，排除了生物学和实验室混杂因素的影响。稳健性通过互补的无监督嵌入分析和一系列临床病理学监督预测任务进行评估。我们的结果表明，当前的PFMs对扫描器诱导的领域偏移缺乏不变性。大多数模型在其嵌入空间中编码了明显的扫描器特定变异性。虽然AUC通常保持稳定，但这掩盖了一个关键的失败模式：扫描器变异性系统地改变了嵌入空间，并影响了下游模型预测的校准，导致扫描器依赖性偏差，这可能影响临床应用的可靠性。我们进一步表明，稳健性不是简单地由训练数据规模、模型大小或模型的最新性决定的。没有一个模型能够可靠地抵抗扫描器诱导的变异性。虽然训练数据最多样化的模型（这里代表视觉-语言模型）似乎在稳健性方面具有优势，但它们在下游监督任务中表现不佳。我们得出结论，PFMs的开发和评估需要超越以准确性为中心的基准，转向在现实获取变异性条件下明确评估和优化嵌入稳定性和校准。

Summary / 总结

The study evaluates the robustness of 14 pathology foundation models (PFMs) to scanner-induced variability using a multiscanner dataset of breast cancer whole-slide images. It finds that most PFMs encode scanner-specific variability, leading to scanner-dependent bias in downstream predictions, despite stable Area Under the Curve (AUC) scores. The robustness is not correlated with model size, training data diversity, or recency, suggesting that current PFMs are not invariant to scanner-induced domain shifts.

研究使用乳腺癌全切片图像的多扫描仪数据集评估了14种病理基础模型（PFMs）对扫描器引起的变异性的鲁棒性。研究发现，这些模型对扫描器引起的领域变化缺乏不变性，大多数模型在其嵌入空间中编码了扫描器特有的变化。尽管AUC分数通常保持稳定，但这掩盖了扫描器变化如何改变嵌入空间并影响下游预测校准的关键问题，从而引入了扫描器依赖的偏差。鲁棒性与模型大小、训练数据多样性或模型的最新性无关，即使在多样数据上训练的模型在下游监督任务中表现不佳。研究强调，在现实的获取变异条件下评估PFMs时，需要超越以准确性为中心的基准，明确评估和优化嵌入稳定性和校准。

Exploring Iterative Controllable Summarization with Large Language Models

Authors: Sangwon Ryu, Heejin Do, Daehee Kim, Hwanjo Yu, Dongwoo Kim, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok

First: 2024-11-19T12:36:02+00:00 · Latest: 2026-01-07T18:22:44+00:00

Comments: EACL Findings 2026

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks. However, their ability to precisely control summary attributes (e.g., length or topic) remains underexplored, limiting their adaptability to specific user preferences. In this paper, we systematically explore the controllability of LLMs. To this end, we revisit summary attribute measurements and introduce iterative evaluation metrics, failure rate and average iteration count to precisely evaluate controllability of LLMs, rather than merely assessing errors. Our findings show that LLMs struggle more with numerical attributes than with linguistic attributes. To address this challenge, we propose a guide-to-explain framework (GTE) for controllable summarization. Our GTE framework enables the model to identify misaligned attributes in the initial draft and guides it in self-explaining errors in the previous output. By allowing the model to reflect on its misalignment, GTE generates well-adjusted summaries that satisfy the desired attributes with robust effectiveness, requiring surprisingly fewer iterations than other iterative approaches.

中文标题/摘要

标题：探索大规模语言模型的迭代可控总结

大规模语言模型（LLMs）在抽象总结任务中表现出色。然而，它们在精确控制总结属性（如长度或主题）方面的能力尚未得到充分探索，限制了它们对特定用户偏好的适应性。在本文中，我们系统地探索了LLMs的可控性。为此，我们重新审视了总结属性的测量方法，并引入了迭代评估指标、失败率和平均迭代次数，以精确评估LLMs的可控性，而不仅仅是评估错误。我们的研究发现表明，LLMs在处理数值属性方面比处理语言属性更困难。为了解决这一挑战，我们提出了一种解释指南框架（GTE）以实现可控总结。我们的GTE框架使模型能够识别初稿中的不匹配属性，并引导其在前一输出中解释错误。通过允许模型反思其不匹配，GTE生成了调整良好的总结，能够以强大的效果满足所需的属性，所需的迭代次数比其他迭代方法少得多。

Summary / 总结

This paper explores the controllability of large language models (LLMs) in abstractive summarization, introducing new metrics like failure rate and average iteration count to precisely evaluate controllability. The study finds that LLMs have more difficulty with numerical attributes compared to linguistic ones. To improve this, the authors propose a guide-to-explain framework (GTE) that helps the model self-correct and generate summaries that meet desired attributes with fewer iterations.

本文探讨了大型语言模型（LLMs）在抽象总结中的可控性，引入了失败率和平均迭代次数等新指标。研究发现，LLMs在处理数值属性方面比语言属性更困难。为了提高可控性，作者提出了一种指导解释框架（GTE），该框架帮助模型纠正初始草稿中的不匹配属性，并生成满足所需属性的总结，所需迭代次数较少。

Attention Needs to Focus: A Unified Perspective on Attention Allocation

Authors: Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

First: 2026-01-01T08:39:15+00:00 · Latest: 2026-01-07T18:20:49+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

中文标题/摘要

标题：注意力需要聚焦：统一视角下的注意力分配

Transformer架构是现代大型语言模型（LLMs）序列建模的核心，主要得益于其注意力机制。然而，尽管其强大，标准的注意力机制仍存在已知的问题：表示坍塌和注意力陷阱。尽管先前的工作提出了针对这些问题的方法，但它们通常被孤立研究，掩盖了它们之间的深层联系。在本文中，我们提出了一种统一的视角，认为这些问题都可以追溯到一个共同根源——不恰当的注意力分配。我们识别了两种失败模式：1）注意力过载，其中令牌接收相似的高权重，模糊了语义特征，导致表示坍塌；2）注意力不足，其中没有令牌具有语义相关性，但注意力仍被迫分配，导致虚假关注，如注意力陷阱。基于这一洞察，我们引入了懒惰注意力机制，这是一种旨在实现更聚焦的注意力分配的新机制。为了缓解过载，它在头和维度之间使用位置区分来增强令牌的区别。为了对抗不足，它引入了弹性-softmax，这是一种修改的规范化函数，放松了标准的softmax约束，以抑制对不相关令牌的注意力。在FineWeb-Edu语料库上的实验，通过九个不同的基准进行评估，表明懒惰注意力机制成功地缓解了注意力陷阱，并在与标准注意力和现代架构相比时实现了竞争力的性能，同时达到高达59.58%的注意力稀疏度。

Summary / 总结

This paper addresses the issues of representational collapse and attention sink in the standard attention mechanism of Transformers by proposing a unified perspective. It identifies two failure modes: Attention Overload and Attention Underload, and introduces Lazy Attention, which uses positional discrimination and Elastic-Softmax to mitigate these issues. Experiments show that Lazy Attention reduces attention sparsity by up to 59.58% and achieves competitive performance on various benchmarks.

本文探讨了标准Transformer架构中注意力机制的表征坍塌和注意力陷阱问题，并提出了一种统一视角，识别了两种失效模式：注意力过载和注意力不足。为解决这些问题，作者引入了Lazy Attention机制，该机制通过位置区分和弹性Softmax来改进注意力分配。实验表明，Lazy Attention能够减少注意力陷阱，并且在多种基准测试中达到与标准注意力机制和现代架构相当的性能，同时达到高达59.58%的注意力稀疏性。

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Authors: Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, Pavan Kapanipathi

First: 2025-09-15T14:17:17+00:00 · Latest: 2026-01-07T18:19:02+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has emerged as a critical yet underexplored area of research. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark to systematically evaluate reward models in tool-calling scenarios. Our analysis shows that current reward models frequently miss key signals of effective tool use, highlighting the need for domain-specific modeling. We address this by proposing a training framework for outcome reward models using data synthesized from permissively licensed, open-weight LLMs. We introduce ToolRM - a suite of reward models for tool-use ranging from 1.7B to 14B parameters. Across diverse settings, these models consistently outperform general-purpose baselines. Notably, they achieve up to a 25% improvement with Best-of-N sampling, while also improving robustness to input noise, enabling effective data filtering, and supporting RL-training of policy models.

中文标题/摘要

标题：ToolRM：工具调用大型语言模型的结果奖励模型

随着大型语言模型（LLMs）越来越多地与外部工具交互，工具使用方面的奖励建模已成为一个关键但尚未充分探索的研究领域。现有的奖励模型主要基于自然语言输出进行训练，难以评估基于工具的推理和执行。为了量化这一差距，我们引入了FC-RewardBench，这是首个系统评估奖励模型在工具调用场景中的基准测试。我们的分析表明，当前的奖励模型经常遗漏有效工具使用的关键信号，突显了领域特定建模的必要性。我们通过提出一种使用来自宽松许可、开源权重LLM合成数据的训练框架来解决这一问题，该框架用于结果奖励模型的训练。我们引入了ToolRM——一系列从1.7B到14B参数的工具使用奖励模型。在各种不同的场景中，这些模型始终优于通用基准。值得注意的是，它们通过Best-of-N采样实现了高达25%的性能提升，同时提高了对输入噪声的鲁棒性，支持有效数据过滤，并支持基于奖励的学习训练策略模型。

Summary / 总结

The research aims to improve reward modeling for large language models (LLMs) that interact with external tools. To address the limitations of existing reward models, which primarily focus on natural language outputs, the authors introduce FC-RewardBench, a benchmark for evaluating reward models in tool-calling scenarios. The study proposes ToolRM, a suite of outcome reward models, which outperform general-purpose baselines across various settings and show up to a 25% improvement with Best-of-N sampling, enhancing robustness to input noise and supporting RL-training of policy models.

研究旨在改进大型语言模型（LLMs）与外部工具交互时的奖励建模。为解决现有奖励模型主要关注自然语言输出的局限性，作者引入了FC-RewardBench，这是一个用于评估奖励模型在工具调用场景中的基准。研究提出了ToolRM，这是一个用于工具使用的奖励模型套件，在各种场景中均优于通用基线模型，并通过Best-of-N采样实现了高达25%的性能提升，增强了对输入噪声的鲁棒性，并支持基于奖励的学习训练策略模型。

All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

Authors: Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Ziyang Xu, Chen Xu, Zhiyang Deng, Prayag Tiwari, Xi Chen, Alejandro Lopez-Lira, Jimin Huang, Junichi Tsujii, Sophia Ananiadou

First: 2026-01-07T18:18:28+00:00 · Latest: 2026-01-07T18:18:28+00:00

Comments: 39 pages; 24 figures

Abs · PDF · Code1 · Code2

Abstract

We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.

中文标题/摘要

标题：并非一切闪烁的都是黄金：一种无参考的反事实财务误导检测基准

我们引入了RFC Bench，这是一种用于在现实新闻环境下评估大型语言模型在财务误导检测方面的基准。RFC Bench 在段落级别运行，并捕捉财务新闻中的上下文复杂性，其中意义源自分散的线索。该基准定义了两个互补的任务：无参考误导检测和基于比较的诊断，使用成对的原始和扰动输入。实验揭示了一致的模式：当有比较性上下文可用时，性能显著增强，而在无参考设置下则暴露出显著的弱点，包括不稳定的预测和增加的无效输出。这些结果表明，当前模型在没有外部支撑的情况下难以保持连贯的信念状态。通过突出这一差距，RFC Bench 为研究无参考推理和推进更可靠的财务误导检测提供了结构化的测试平台。

Summary / 总结

The research introduces RFC Bench, a benchmark for evaluating large language models in detecting financial misinformation in realistic news contexts. It focuses on reference-free misinformation detection and comparative diagnosis using paired original and perturbed inputs. The experiments show that models perform better with comparative context but struggle in reference-free settings, indicating difficulties in maintaining coherent belief states without external grounding. This highlights the need for more reliable financial misinformation detection methods.

RFC Bench 是一个基准，用于评估大型语言模型在现实新闻环境中检测金融误导信息的能力。它关注段落级别的分析，并捕捉金融新闻中分散的线索。基准包括两个任务：无参考误导信息检测和使用原始和修改输入配对的比较诊断。实验表明，模型在有比较上下文时表现更好，但在无参考设置中遇到困难，表明它们在没有外部支撑的情况下难以保持连贯的信念状态。这突显了需要更可靠的金融误导信息检测方法。

FLEx: Language Modeling with Few-shot Language Explanations

Authors: Adar Avsian, Christopher Richardson, Anirudh Sundar, Larry Heck

First: 2026-01-07T18:12:05+00:00 · Latest: 2026-01-07T18:12:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Language models have become effective at a wide range of tasks, from math problem solving to open-domain question answering. However, they still make mistakes, and these mistakes are often repeated across related queries. Natural language explanations can help correct these errors, but collecting them at scale may be infeasible, particularly in domains where expert annotators are required. To address this issue, we introduce FLEx ($\textbf{F}$ew-shot $\textbf{L}$anguage $\textbf{Ex}$planations), a method for improving model behavior using a small number of explanatory examples. FLEx selects representative model errors using embedding-based clustering, verifies that the associated explanations correct those errors, and summarizes them into a prompt prefix that is prepended at inference-time. This summary guides the model to avoid similar errors on new inputs, without modifying model weights. We evaluate FLEx on CounterBench, GSM8K, and ReasonIF. We find that FLEx consistently outperforms chain-of-thought (CoT) prompting across all three datasets and reduces up to 83\% of CoT's remaining errors.

中文标题/摘要

标题：FLEx：基于少量语言解释的语言建模

语言模型在从数学问题解决到开放域问答等多种任务中变得非常有效。然而，它们仍然会犯错误，这些错误在相关查询中经常重复出现。自然语言解释可以帮助纠正这些错误，但在大规模收集这些解释时可能不可行，特别是在需要专家注释员的领域。为了解决这个问题，我们引入了FLEx（基于少量语言解释），这是一种使用少量解释性示例来改善模型行为的方法。FLEx使用嵌入式聚类选择代表性模型错误，验证相关解释是否纠正了这些错误，并在推理时将其总结为一个前缀提示。这个总结在新输入上引导模型避免类似的错误，而不修改模型权重。我们在CounterBench、GSM8K和ReasonIF上评估了FLEx。我们发现FLEx在所有三个数据集上都优于思维链（CoT）提示，并减少了CoT剩余错误的83%。

Summary / 总结

The paper introduces FLEx, a method that uses a few natural language explanations to correct model errors without altering model weights. FLEx clusters model errors, verifies the correctness of explanations, and summarizes them into a prompt prefix for inference. The method improves model performance on CounterBench, GSM8K, and ReasonIF, outperforming chain-of-thought prompting and reducing up to 83% of remaining errors.

论文提出了FLEx方法，该方法使用少量自然语言解释来纠正模型错误，而不修改模型权重。它使用聚类选择代表性错误，验证解释，并将它们总结为一个在推理时前置的提示前缀。FLEx在CounterBench、GSM8K和ReasonIF上表现更好，优于链式思考提示，并减少了高达83%的剩余错误。

CktGen: Automated Analog Circuit Design with Generative Artificial Intelligence

Authors: Yuxuan Hou, Hehe Fan, Jianrong Zhang, Yue Zhang, Hua Chen, Min Zhou, Faxin Yu, Roger Zimmermann, Yi Yang

First: 2024-10-01T18:35:44+00:00 · Latest: 2026-01-07T18:11:26+00:00

Comments: Paper accepted by Engineering

Abs · PDF · Code1 · Code2

Abstract

The automatic synthesis of analog circuits presents significant challenges. Most existing approaches formulate the problem as a single-objective optimization task, overlooking that design specifications for a given circuit type vary widely across applications. To address this, we introduce specification-conditioned analog circuit generation, a task that directly generates analog circuits based on target specifications. The motivation is to leverage existing well-designed circuits to improve automation in analog circuit design. Specifically, we propose CktGen, a simple yet effective variational autoencoder that maps discretized specifications and circuits into a joint latent space and reconstructs the circuit from that latent vector. Notably, as a single specification may correspond to multiple valid circuits, naively fusing specification information into the generative model does not capture these one-to-many relationships. To address this, we decouple the encoding of circuits and specifications and align their mapped latent space. Then, we employ contrastive training with a filter mask to maximize differences between encoded circuits and specifications. Furthermore, classifier guidance along with latent feature alignment promotes the clustering of circuits sharing the same specification, avoiding model collapse into trivial one-to-one mappings. By canonicalizing the latent space with respect to specifications, we can search for an optimal circuit that meets valid target specifications. We conduct comprehensive experiments on the open circuit benchmark and introduce metrics to evaluate cross-model consistency. Experimental results demonstrate that CktGen achieves substantial improvements over state-of-the-art methods.

中文标题/摘要

标题：CktGen：基于生成人工智能的自动化模拟电路设计

模拟电路的自动综合面临着重大挑战。大多数现有方法将问题表述为单一目标优化任务，忽略了给定电路类型在不同应用中的设计规范差异。为解决这一问题，我们引入了基于规范条件的模拟电路生成任务，该任务直接根据目标规范生成模拟电路。动机是利用现有设计良好的电路来提高模拟电路设计的自动化程度。具体而言，我们提出了CktGen，这是一种简单而有效的变分自编码器，将离散化规范和电路映射到联合潜在空间，并从潜在向量中重建电路。值得注意的是，一种规范可能对应多个有效电路，简单地将规范信息融合到生成模型中不能捕捉这些一对多的关系。为此，我们解耦电路和规范的编码，并对它们的映射潜在空间进行对齐。然后，我们采用对比训练与滤波掩码来最大化编码电路和规范之间的差异。此外，分类器指导与潜在特征对齐促进了具有相同规范的电路的聚类，避免模型陷入简单的一对一映射。通过对规范进行潜在空间的规范化，我们可以搜索满足有效目标规范的最佳电路。我们在开放电路基准上进行了全面的实验，并引入了评估跨模型一致性的度量。实验结果表明，CktGen 在最先进的方法上取得了显著的改进。

Summary / 总结

CktGen is designed to address the challenges in automatically synthesizing analog circuits by leveraging specification-conditioned analog circuit generation. It uses a variational autoencoder to map discretized specifications and circuits into a joint latent space, and employs contrastive training with a filter mask to align the latent space of circuits and specifications. This method avoids model collapse and promotes clustering of circuits with the same specifications. Experiments on an open circuit benchmark show that CktGen significantly outperforms existing state-of-the-art methods.

论文通过引入CktGen，一种将规格和电路映射到联合潜在空间的变分自编码器，解决了自动合成模拟电路的挑战。CktGen将电路和规格的编码解耦，并对齐它们的潜在空间，使用对比训练确保电路和规格的潜在表示是匹配的。该方法还使用分类器指导来促进具有相同规格的电路的聚类，避免简单的一对一映射。实验表明，CktGen在生成符合目标规格的模拟电路方面优于现有方法。

Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning

Authors: Yifan Wang, Yanyu Li, Sergey Tulyakov, Yun Fu, Anil Kag

First: 2026-01-07T18:05:08+00:00 · Latest: 2026-01-07T18:05:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.

中文标题/摘要

标题：扩散-DRF：可微奖励流用于视频扩散微调

直接偏好优化（DPO）最近通过提高视觉保真度和文本对齐性改善了文本到视频（T2V）生成。然而，当前方法依赖于人类注释或学习的奖励模型中的非可微偏好信号，这使得训练耗时、易产生偏差且容易被操纵，常导致奖励劫持和训练不稳定。我们提出了一种可微奖励流（Diffusion-DRF），使用冻结的现成视觉-语言模型（VLM）作为无训练的批评家，对视频扩散模型进行微调。Diffusion-DRF 通过扩散去噪链直接反向传播 VLM 反馈，将 logits 级别的响应转换为可优化的 token 意识梯度。我们提出了一种自动化的、按方面结构化的提示管道，以获得可靠的多维度 VLM 反馈，同时梯度检查点使最终去噪步骤中的高效更新成为可能。Diffusion-DRF 在不使用额外奖励模型或偏好数据集的情况下，提高了视频质量和语义对齐性，同时减轻了奖励劫持和崩溃。它具有模型通用性，并且可以轻松推广到其他基于扩散的生成任务。

Summary / 总结

The paper addresses the limitations of current Direct Preference Optimization (DPO) methods in Text-to-Video (T2V) generation, which rely on non-differentiable preference signals that are label-intensive, bias-prone, and prone to reward hacking. It introduces Diffusion-DRF, a differentiable reward flow that uses a frozen Vision-Language Model (VLM) to provide token-aware gradients for optimization, avoiding the need for additional reward models or preference datasets. The method improves video quality and semantic alignment while mitigating reward hacking and collapse.

研究旨在通过解决当前直接偏好优化（DPO）方法的局限性，即依赖非可微偏好信号，来改进文本到视频（T2V）生成。提出的Diffusion-DRF方法使用冻结的视觉-语言模型（VLM）作为批评者，提供可感知标记的梯度进行优化，直接反向传播通过去噪链。这种方法提高了视频质量和语义对齐，缓解了奖励作弊和崩溃问题，无需额外的奖励模型或偏好数据集。

Klear: Unified Multi-Task Audio-Video Joint Generation

Authors: Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, Chen Zhang, Pengfei Wan

First: 2026-01-07T18:03:45+00:00 · Latest: 2026-01-07T18:03:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.

中文标题/摘要

标题：Klear：统一多任务音频-视频联合生成

音频-视频联合生成取得了快速进展，但仍面临重大挑战。非商业方法仍存在音频-视觉不同步、唇音对齐差和单模态退化等问题，这些问题可能源于弱的音频-视觉对应建模、有限的泛化能力和稀缺的高质量密集字幕数据。为解决这些问题，我们引入了Klear，并深入探讨了三个维度——模型架构、训练策略和数据整理。从架构上看，我们采用了单塔设计，使用统一的DiT块和全注意力机制，实现了紧密的音频-视觉对齐和强大的可扩展性。从训练上看，我们采用了渐进式多任务制度——随机模态掩蔽以跨任务联合优化，以及多阶段课程，从而生成稳健的表示，增强音频-视觉一致的世界知识，并防止单模态崩溃。对于数据集，我们首次提出了一个大规模的带有密集字幕的音频-视频数据集，并引入了一种新的自动化数据构建管道，该管道标注和过滤了数百万个多样、高质量、严格对齐的音频-视频-字幕三元组。基于此，Klear能够扩展到大规模数据集，提供高保真度、语义和时间上对齐的、指令跟随的生成，在联合和单模态设置中都能稳健泛化到分布外场景。在各个任务上，它大幅优于先前的方法，并实现了与Veo 3相当的性能，提供了一条统一、可扩展的通往下一代音频-视频合成的道路。

Summary / 总结

Klear addresses the challenges in audio-video joint generation by improving model architecture, training strategy, and data curation. It uses a single-tower design with unified DiT blocks and Omni-Full Attention for better audio-visual alignment and scalability. The training strategy involves a progressive multitask regime with random modality masking and a multistage curriculum to enhance robust representations and prevent unimodal collapse. Klear also introduces a large-scale dataset with dense captions and an automated data-construction pipeline, leading to high-fidelity, semantically and temporally aligned generation in both joint and unimodal settings, and outperforming previous methods significantly across tasks.

Klear通过引入统一的模型架构、渐进的多任务训练策略以及新的数据整理流程，解决了音频-视频联合生成中的挑战。模型采用单塔设计，使用DiT块和全注意力机制，实现了紧密的音频-视频对齐。训练策略包括随机模态遮蔽和多阶段课程，增强了鲁棒性并防止了单模态崩溃。数据集包含数百万个密集标注的音频-视频三元组，Klear在多个任务上显著超越了先前的方法，提供了高质量、语义和时间上对齐的生成，在联合和单模态设置中都表现出色。

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Authors: Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi

First: 2025-05-21T16:15:01+00:00 · Latest: 2026-01-07T17:58:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.

中文标题/摘要

标题：奖励足够：大语言模型是上下文强化学习者

强化学习（RL）是一种解决顺序决策问题的框架。在本工作中，我们展示了令人惊讶的现象：在大型语言模型（LLMs）的推理过程中，RL 会自然地出现，我们将其称为上下文强化学习（ICRL）。为了揭示这一能力，我们引入了一种简单的多轮提示框架，称为 ICRL 提示，用于推理时的自我改进。ICRL 提示的目标是引导 LLM 在推理过程中进行强化学习，以在给定任务上进行自我改进。每次响应后，模型会收到一个数值标量反馈，称为奖励。在下一轮中，我们再次提示 LLM 并提供一个上下文，该上下文是所有先前响应及其相关奖励的串联。我们观察到，随着上下文的增长，响应质量持续提高。换句话说，LLM 可以在推理过程中优化标量奖励信号，表现出类似于强化学习的行为。我们在 24 点游戏、创意写作、ScienceWorld 以及奥林匹克级别的数学竞赛（AIME 和 HMMT）中评估了 ICRL 提示，展示了其相对于 Self-Refine 和 Reflexion 等基线的显著改进。值得注意的是，即使奖励信号由相同的 LLM 生成，ICRL 提示仍然提高了性能，突显出一种新的测试时扩展范式。

Summary / 总结

This study explores the emergence of reinforcement learning (RL) during the inference of large language models (LLMs), termed in-context RL (ICRL). By using a multi-round prompting framework, the researchers guide the LLMs to improve their responses based on received numerical scalar feedback (rewards). The results show that response quality improves as the context grows, indicating that LLMs can optimize for reward signals during inference, similar to RL. The ICRL prompting method was evaluated on various tasks, including Game of 24, creative writing, ScienceWorld, and math competitions, showing significant improvements over existing methods like Self-Refine and Reflexion.

研究探讨了大型语言模型（LLMs）在推理过程中出现的强化学习（RL）现象，称为在上下文中的RL（ICRL）。通过使用多轮提示框架，研究人员引导LLMs根据接收到的数值标量反馈（奖励）来改进其响应。结果显示，随着上下文的增加，响应质量有所提高，表明LLMs可以在推理过程中优化奖励信号，类似于RL。ICRL提示方法在包括24点游戏、创意写作、ScienceWorld和数学竞赛（AIME和HMMT）等任务上进行了评估，显示出比现有方法如Self-Refine和Reflexion显著的改进。

Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Authors: Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang

First: 2026-01-07T17:50:37+00:00 · Latest: 2026-01-07T17:50:37+00:00

Abs · PDF · Code1 · Code2

Abstract

As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.

中文标题/摘要

标题：Wow, wo, val！全面的具身世界模型评估图灵测试

随着具身人工智能中世界模型的发展，越来越多的研究工作探索使用视频基础模型作为下游具身任务（如3D预测或交互生成）的预测世界模型。然而，在探索这些下游任务之前，视频基础模型仍然有两个关键问题未得到解答：（1）它们的生成泛化是否足够以保持人类观察者的感知保真度，（2）它们是否足够稳健以作为现实世界具身代理的通用先验。为了提供一个标准化框架来回答这些问题，我们引入了具身图灵测试基准：Wow-wo-val（Wow, wo, val）。基于609个机器人操作数据，Wow-wo-val 检查了五个核心能力，包括感知、规划、预测、泛化和执行。我们提出了一种全面的评估协议，包含22个指标来评估模型的生成能力，该协议在整体评分与人类偏好之间的皮尔逊相关系数超过0.93，并为人类图灵测试建立了可靠的基础。在Wow-wo-val上，模型在长时规划上仅达到17.27分，在物理一致性上最高为68.02分，表明空间时间一致性有限和物理推理能力有限。对于逆动力学模型图灵测试，我们首先使用逆动力学模型（IDM）评估视频基础模型在现实世界中的执行准确性。然而，大多数模型的准确率降至约0%，而Wow保持了40.74%的成功率。这些发现表明生成的视频与现实世界之间存在明显的差距，突显了在具身人工智能中基准测试世界模型的紧迫性和必要性。

Summary / 总结

This paper introduces WoW-wo-val, a benchmark for evaluating world models in embodied AI, addressing the critical questions of perceptual fidelity and robustness. The evaluation protocol includes 22 metrics and achieves a high correlation with human preference. Models show limited spatiotemporal consistency and physical reasoning, scoring only 17.27 on long-horizon planning and 68.02 on physical consistency. For the Inverse Dynamic Model Turing Test, WoW outperforms other models with a 40.74% success rate, highlighting the gap between generated videos and real-world execution.

论文提出了Wow-wo-val基准，用于评估视频基础模型在感知保真度和鲁棒性方面的表现。它评估了五个核心能力，并使用22个指标进行评估，发现模型在长时规划和物理一致性方面表现较差。在逆动力学模型图灵测试中，Wow相比其他模型在真实世界执行准确性上表现更好，这表明生成的视频与现实世界之间存在显著差距。

LLMberjack: Guided Trimming of Debate Trees for Multi-Party Conversation Creation

Authors: Leonardo Bottona, Nicolò Penzo, Bruno Lepri, Marco Guerini, Sara Tonelli

First: 2026-01-07T17:49:17+00:00 · Latest: 2026-01-07T17:49:17+00:00

Comments: 9 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

We present LLMberjack, a platform for creating multi-party conversations starting from existing debates, originally structured as reply trees. The system offers an interactive interface that visualizes discussion trees and enables users to construct coherent linearized dialogue sequences while preserving participant identity and discourse relations. It integrates optional large language model (LLM) assistance to support automatic editing of the messages and speakers' descriptions. We demonstrate the platform's utility by showing how tree visualization facilitates the creation of coherent, meaningful conversation threads and how LLM support enhances output quality while reducing human effort. The tool is open-source and designed to promote transparent and reproducible workflows to create multi-party conversations, addressing a lack of resources of this type.

中文标题/摘要

标题：LLMberjack：基于辩论树的多党对话创建引导式修剪平台

我们提出LLMberjack，一个从现有辩论（最初以回复树形式结构化）开始创建多党对话的平台。该系统提供了一个交互式界面，可视化讨论树，并使用户能够在保留参与者身份和话语关系的同时构建连贯的线性对话序列。该系统集成了可选的大语言模型（LLM）辅助，以支持消息和发言者描述的自动编辑。我们通过展示树状图可视化如何促进创建连贯且有意义的对话线程以及LLM支持如何提高输出质量并减少人力投入来展示该平台的实用性。该工具是开源的，旨在促进透明和可重复的工作流程，以创建多党对话，解决此类资源缺乏的问题。

Summary / 总结

LLMberjack is a platform that transforms debate trees into coherent multi-party conversations by visualizing discussion trees and allowing users to construct linearized dialogue sequences while preserving discourse relations. It optionally uses large language models to assist in editing messages and speaker descriptions, improving output quality. The platform demonstrates that tree visualization aids in creating meaningful conversation threads and reduces human effort through LLM support.

LLMberjack 是一个平台，通过可视化讨论树并允许用户编辑消息和演讲者描述（可选使用大型语言模型辅助），将辩论树转换为连贯的多党对话。该系统有助于创建有意义的对话线程，同时保留话语关系和参与者身份，减少人力投入。关键发现包括树状图可视化的作用以及通过大型语言模型支持提高输出质量。

ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

Authors: Nikhil Anand, Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, Koyel Mukherjee

First: 2026-01-07T17:45:20+00:00 · Latest: 2026-01-07T17:45:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model's internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.

中文标题/摘要

标题：ContextFocus: 用于大型语言模型上下文忠实性的激活导向控制

大型语言模型（LLMs）在预训练过程中编码了大量的参数知识。随着世界知识的演变，有效部署越来越多地依赖于它们能够忠实跟随外部检索到的上下文的能力。当这种证据与模型内部知识发生冲突时，LLMs 经常默认使用记忆中的事实，产生不忠实的输出。在本研究中，我们引入了 ContextFocus，一种轻量级的激活导向控制方法，该方法在知识冲突情况下提高了上下文忠实性，同时保持流畅性和效率。与先前的方法不同，我们的解决方案不需要模型微调，并且在推理时间上几乎没有额外开销，使其非常高效。我们在 ConFiQA 基准上评估了 ContextFocus，将其与 ContextDPO、COIECD 和基于提示的方法等强基线进行比较。此外，我们展示了我们的方法与提示策略的互补性，并且在更大规模的模型上仍然有效。广泛的实验表明，ContextFocus 显著提高了上下文忠实性。我们的结果突显了 ContextFocus 在提高LLM输出上下文忠实性方面的有效性和鲁棒性以及效率。

Summary / 总结

This work addresses the issue of large language models producing unfaithful outputs when external context conflicts with their internal knowledge. The authors introduce ContextFocus, a lightweight activation steering approach that enhances context faithfulness without requiring model fine-tuning or significant inference-time overhead. Experiments on the ConFiQA benchmark demonstrate that ContextFocus significantly improves contextual-faithfulness compared to strong baselines and is effective on larger models.

该研究针对大型语言模型在外部上下文与内部知识冲突时产生不忠实输出的问题，提出了一种轻量级的激活引导方法ContextFocus，该方法在无需模型微调和显著增加推理时间开销的情况下，增强了上下文忠实性。在ConFiQA基准上的实验表明，ContextFocus显著提高了上下文忠实性，并且在更大规模的模型上仍然有效。

Pixel-Wise Multimodal Contrastive Learning for Remote Sensing Images

Authors: Leandro Stival, Ricardo da Silva Torres, Helio Pedrini

First: 2026-01-07T17:41:11+00:00 · Latest: 2026-01-07T17:41:11+00:00

Comments: 21 pages, 9 Figures

Abs · PDF · Code1 · Code2

Abstract

Satellites continuously generate massive volumes of data, particularly for Earth observation, including satellite image time series (SITS). However, most deep learning models are designed to process either entire images or complete time series sequences to extract meaningful features for downstream tasks. In this study, we propose a novel multimodal approach that leverages pixel-wise two-dimensional (2D) representations to encode visual property variations from SITS more effectively. Specifically, we generate recurrence plots from pixel-based vegetation index time series (NDVI, EVI, and SAVI) as an alternative to using raw pixel values, creating more informative representations. Additionally, we introduce PIxel-wise Multimodal Contrastive (PIMC), a new multimodal self-supervision approach that produces effective encoders based on two-dimensional pixel time series representations and remote sensing imagery (RSI). To validate our approach, we assess its performance on three downstream tasks: pixel-level forecasting and classification using the PASTIS dataset, and land cover classification on the EuroSAT dataset. Moreover, we compare our results to state-of-the-art (SOTA) methods on all downstream tasks. Our experimental results show that the use of 2D representations significantly enhances feature extraction from SITS, while contrastive learning improves the quality of representations for both pixel time series and RSI. These findings suggest that our multimodal method outperforms existing models in various Earth observation tasks, establishing it as a robust self-supervision framework for processing both SITS and RSI. Code avaliable on

中文标题/摘要

标题：遥感图像像素级多模态对比学习

卫星持续生成大量数据，特别是用于地球观测的卫星图像时间序列（SITS）。然而，大多数深度学习模型设计用于处理整个图像或完整的时间序列序列以提取有意义的特征用于下游任务。在本研究中，我们提出了一种新颖的多模态方法，利用像素级二维（2D）表示来更有效地编码SITS中的视觉属性变化。具体而言，我们从基于像素的植被指数时间序列（NDVI、EVI和SAVI）生成循环图，作为使用原始像素值的替代方法，创建更具信息量的表示。此外，我们引入了PIxel-wise多模态对比（PIMC），这是一种新的多模态半监督方法，基于二维像素时间序列表示和遥感图像（RSI）生成有效的编码器。为了验证我们的方法，我们在三个下游任务上评估其性能：使用PASTIS数据集的像素级预测和分类，以及使用EuroSAT数据集的土地覆盖分类。此外，我们在所有下游任务上将我们的结果与最先进的（SOTA）方法进行比较。我们的实验结果表明，使用二维表示显著提高了从SITS中提取特征的能力，而对比学习提高了像素时间序列和RSI表示的质量。这些发现表明，我们的多模态方法在各种地球观测任务中优于现有模型，确立了其作为处理SITS和RSI的稳健半监督框架的地位。代码可在

Summary / 总结

This study introduces a novel multimodal approach called PIxel-wise Multimodal Contrastive (PIMC) for processing satellite image time series (SITS) data. By generating recurrence plots from pixel-based vegetation index time series and using contrastive learning, the method effectively encodes visual property variations. Experiments on pixel-level forecasting and classification tasks with the PASTIS dataset and land cover classification on the EuroSAT dataset demonstrate that PIMC outperforms state-of-the-art methods, highlighting the effectiveness of 2D representations and contrastive learning in enhancing feature extraction from SITS and remote sensing imagery (RSI).

该研究提出了一种名为PIxel-wise Multimodal Contrastive (PIMC)的新多模态方法，用于处理卫星图像时间序列（SITS）数据。通过生成基于像素的植被指数时间序列的回溯图并使用对比学习，该方法有效地编码了SITS中的视觉属性变化。在使用PASTIS数据集进行像素级预测和分类任务以及使用EuroSAT数据集进行土地覆盖分类实验中，PIMC在特征提取和表示质量方面均优于现有最先进的方法，表明该方法在地球观测任务中具有强大的自监督框架作用。

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Authors: Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, Yan Lu

First: 2026-01-07T17:40:08+00:00 · Latest: 2026-01-07T17:40:08+00:00

Comments: Work In Progress

Abs · PDF · Code1 · Code2

Abstract

GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.

中文标题/摘要

标题：InfiniteWeb：大规模生成GUI代理训练用的可扩展网络环境

代表用户与图形界面交互的GUI代理是实用AI助手的一个有前途的方向。然而，训练这些代理受到合适环境稀缺的阻碍。我们提出了InfiniteWeb，这是一种自动大规模生成功能性的网络环境的系统，用于GUI代理训练。虽然大语言模型在生成单个网页方面表现良好，但构建具有许多相互连接页面的现实且功能性的网站面临挑战。我们通过统一规范、以任务为中心的测试驱动开发以及网站种子与参考设计图像的结合来解决这些挑战，以确保多样性。我们的系统还生成可验证的任务评估器，为强化学习提供密集的奖励信号。实验表明，InfiniteWeb在现实网站构建方面超越了商业编码代理，而训练于我们生成环境的GUI代理在OSWorld和Online-Mind2Web上实现了显著性能提升，证明了该系统的有效性。

Summary / 总结

The research aims to address the scarcity of suitable environments for training GUI agents that interact with graphical interfaces. InfiniteWeb is a system that automatically generates functional web environments at scale. It overcomes challenges in generating realistic websites with interconnected pages through unified specification, task-centric test-driven development, and combining website seeds with reference design images. Experiments show that InfiniteWeb outperforms commercial coding agents in constructing realistic websites and that GUI agents trained on InfiniteWeb-generated environments perform significantly better on OSWorld and Online-Mind2Web, highlighting the effectiveness of the proposed system.

研究旨在通过合成可扩展的网络环境来解决GUI代理训练的挑战。方法包括使用统一规范、任务导向的测试驱动开发，并结合网站种子和参考图像生成多样且功能性的网络环境。关键发现表明，InfiniteWeb在构建真实网站方面优于商业编码代理，并且在OSWorld和Online-Mind2Web上的GUI代理训练后表现更好。

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Authors: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang

First: 2025-12-23T18:59:46+00:00 · Latest: 2026-01-07T17:31:29+00:00

Comments: webpage: https://spatialtree.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

中文标题/摘要

标题：SpatialTree：多模态LLMs中空间能力的分支发展

认知科学表明，空间能力是逐步发展的，从感知到推理和互动。然而，在多模态LLMs（MLLMs）中，这种层次结构仍然不甚明了，因为大多数研究都集中在狭窄的任务集上。我们引入了SpatialTree，这是一种认知科学启发式的层次结构，将空间能力分为四个层次：低级感知（L1）、心理制图（L2）、模拟（L3）和能动性（L4）。基于这种分类法，我们构建了第一个能力导向的层次基准，全面评估主流MLLMs在27个子能力上的表现。评估结果揭示了一个清晰的结构：L1技能大多相互独立，而更高层次的技能则高度相关，表明它们之间的相互依赖性在增加。通过有针对性的监督微调，我们发现了一个令人惊讶的转移动态：L1内的负向转移，但低级到高级能力之间存在强大的跨层次转移，且具有显著的协同效应。最后，我们探讨了如何改进整个层次结构。我们发现，鼓励大量“思考”的简单RL方法是不可靠的：它有助于复杂的推理，但损害了直观的感知。我们提出了一种简单的自动思考策略，抑制不必要的思考，使RL能够在所有层次上一致地提高性能。通过构建SpatialTree，我们提供了一个概念验证框架，用于理解和系统地扩展MLLMs中的空间能力。

Summary / 总结

The research aims to understand the development of spatial abilities in multimodal LLMs by introducing SpatialTree, a cognitive-science-inspired hierarchy. The study evaluates mainstream MLLMs across 27 sub-abilities and finds that lower-level skills are orthogonal while higher-level skills are strongly correlated. Supervised fine-tuning reveals negative transfer within L1 but strong cross-level transfer from low- to high-level abilities. The research also explores RL methods and finds that naive RL can hurt intuitive perception but a simple auto-think strategy can improve performance across all levels.

研究旨在通过提出一个基于认知科学的层次结构SpatialTree，来理解多模态LLM中的空间能力发展，该层次结构包括感知、心理映射、模拟和行动能力四个层次。研究对主流的LLM在27个子能力上进行了评估，发现较低层次的能力大多独立，而较高层次的能力则高度相关。有针对性的微调揭示了L1内的负迁移，但较低层次到较高层次的能力之间存在强烈的跨层次迁移。研究还指出，简单的强化学习可能会损害直观感知，但提出了一种简单的自动思考策略来在所有层次上提高性能。

GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

Authors: Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu

First: 2026-01-07T17:26:41+00:00 · Latest: 2026-01-07T17:26:41+00:00

Abs · PDF · Code1 · Code2

Abstract

The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

中文标题/摘要

标题：GeoReason: 通过逻辑一致性强化学习使遥感视觉语言模型的思考与回答保持一致

遥感视觉语言模型(RS-VLMs)的发展强调了从感知中心的识别向高级演绎推理过渡的重要性，以增强复杂空间任务中的认知可靠性。然而，当前的模型往往遭受逻辑幻觉的困扰，即正确的答案是基于有缺陷的推理链或依赖于位置捷径而非空间逻辑。这种脱节削弱了在战略空间决策中的可靠性。为了解决这一问题，我们提出了GeoReason，一种旨在使内部思考与最终决策同步的框架。我们首先构建了GeoReason-Bench，这是一个逻辑驱动的数据集，包含4,000条从几何原语和专家知识中合成的推理轨迹。然后，我们制定了两阶段训练策略：(1) 监督知识初始化，以使模型具备推理语法和领域专业知识；(2) 一致性感知强化学习，以提高演绎可靠性。这一阶段整合了一种新颖的逻辑一致性奖励，通过选项排列策略惩罚逻辑漂移，以使决策基于可验证的推理轨迹。实验结果表明，我们的框架显著提高了RS-VLMs的认知可靠性和可解释性，达到了与其他先进方法相比的最优性能。

Summary / 总结

GeoReason is a framework designed to improve the cognitive reliability of Remote Sensing Vision-Language Models (RS-VLMs) by aligning internal thinking with final decisions. It addresses the issue of logical hallucinations in current models through a two-stage training strategy: supervised knowledge initialization and consistency-aware reinforcement learning. The framework uses a novel Logical Consistency Reward to penalize logical drift and anchor decisions in verifiable reasoning traces, leading to enhanced interpretability and performance compared to other methods.

GeoReason 是一个框架，旨在通过使内部推理与最终答案保持一致来提高遥感视觉语言模型（RS-VLMs）的认知可靠性。它采用两阶段训练策略：监督知识初始化，使模型具备推理语法和领域专业知识，随后是通过逻辑一致性奖励来惩罚逻辑漂移的一致性意识强化学习。这种方法提高了模型的可解释性和可靠性，并在实验中达到了最先进的性能。

Low Resource Reconstruction Attacks Through Benign Prompts

Authors: Sol Yarkoni, Mahmood Sharif, Roi Livni

First: 2025-07-10T17:32:26+00:00 · Latest: 2026-01-07T17:17:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data stewardship. To better understand and control these risks, prior work has introduced techniques and attacks that reconstruct images, or parts of images, from training data. While these results demonstrate that training data can be recovered, existing methods often rely on high computational resources, partial access to the training set, or carefully engineered prompts. In this work, we present a new attack that requires low resources, assumes little to no access to the training data, and identifies seemingly benign prompts that can lead to potentially risky image reconstruction. We further show that such reconstructions may occur unintentionally, even for users without specialized knowledge. For example, we observe that for one existing model, the prompt ``blue Unisex T-Shirt'' generates the face of a real individual. Moreover, by combining the identified vulnerabilities with real-world prompt data, we discover prompts that reproduce memorized visual elements. Our approach builds on insights from prior work and leverages domain knowledge to expose a fundamental vulnerability arising from the use of scraped e-commerce data, where templated layouts and images are closely tied to pattern-like textual prompts. The code for our attack is publicly available at https://github.com/TheSolY/lr-tmi.

中文标题/摘要

标题：低资源重建攻击通过良性提示

生成模型的最新进展，如扩散模型，引发了隐私、版权侵权和数据管理方面的问题。为了更好地理解和控制这些风险，先前的工作引入了技术和攻击方法，可以从训练数据中重建图像或图像的部分。虽然这些结果表明训练数据可以被恢复，但现有方法通常依赖于高计算资源、部分访问训练集或精心设计的提示。在这项工作中，我们提出了一种新的攻击方法，该方法需要低资源，假设几乎没有访问训练数据，并识别出看似无害的提示，这些提示可能导致潜在的风险图像重建。我们进一步表明，这种重建可能会无意中发生，即使对于没有专门知识的用户也是如此。例如，我们观察到，对于一个现有的模型，“蓝色男女通用T恤”这个提示生成了一个真实个体的面部。此外，通过将识别出的漏洞与实际的提示数据相结合，我们发现了一些能够重现记忆视觉元素的提示。我们的方法基于先前工作的见解，并利用领域知识揭示了由于使用了从电子商务数据中抓取的数据而产生的基本漏洞，其中模板布局和图像与模式化的文本提示紧密相关。我们攻击的代码可以在 https://github.com/TheSolY/lr-tmi 公开获取。

Summary / 总结

This paper addresses the privacy and data stewardship risks associated with generative models by presenting a new low-resource reconstruction attack that uses seemingly benign prompts to recover images from training data. The method requires minimal computational resources and no access to the training set, and it identifies prompts that can unintentionally reconstruct sensitive images. The study demonstrates that even users without specialized knowledge can trigger such reconstructions, highlighting a fundamental vulnerability in models trained on scraped e-commerce data.

该研究通过引入一种新的低资源攻击方法，利用看似无害的提示来重建训练数据中的图像，以应对生成模型带来的隐私和数据安全风险。该方法无需大量计算资源和访问训练数据。关键发现包括从提示如“蓝色男女通用T恤”中无意间重建出真实个体的面部，突显了此类攻击可能在没有专业知识的情况下发生。该方法利用领域知识揭示了模型在训练于爬取的电子商务数据时存在的漏洞，其中图像和文本提示紧密相关。

Causal Data Augmentation for Robust Fine-Tuning of Tabular Foundation Models

Authors: Magnus Bühler, Lennart Purucker, Frank Hutter

First: 2026-01-07T17:16:39+00:00 · Latest: 2026-01-07T17:16:39+00:00

Comments: Accepted for oral presentation at the EurIPS 2025 Workshop on AI for Tabular Data (Copenhagen)

Abs · PDF · Code1 · Code2

Abstract

Fine-tuning tabular foundation models (TFMs) under data scarcity is challenging, as early stopping on even scarcer validation data often fails to capture true generalization performance. We propose CausalMixFT, a method that enhances fine-tuning robustness and downstream performance by generating structurally consistent synthetic samples using Structural Causal Models (SCMs) fitted on the target dataset. This approach augments limited real data with causally informed synthetic examples, preserving feature dependencies while expanding training diversity. Evaluated across 33 classification datasets from TabArena and over 2300 fine-tuning runs, our CausalMixFT method consistently improves median normalized ROC-AUC from 0.10 (standard fine-tuning) to 0.12, outperforming purely statistical generators such as CTGAN (-0.01), TabEBM (-0.04), and TableAugment (-0.09). Moreover, it narrows the median validation-test performance correlation gap from 0.67 to 0.30, enabling more reliable validation-based early stopping, a key step toward improving fine-tuning stability under data scarcity. These results demonstrate that incorporating causal structure into data augmentation provides an effective and principled route to fine-tuning tabular foundation models in low-data regimes.

中文标题/摘要

标题：因果数据增强以提高表格基础模型在数据稀缺情况下的稳健微调

在数据稀缺的情况下微调表格基础模型（TFMs）具有挑战性，因为即使在更稀缺的验证数据上提前停止也往往无法捕捉到真正的泛化性能。我们提出了一种名为CausalMixFT的方法，该方法通过使用结构因果模型（SCMs）拟合目标数据集来生成结构上一致的合成样本，从而增强微调的稳健性和下游性能。这种方法用因果信息驱动的合成示例扩充了有限的真实数据，同时保持了特征依赖性并扩展了训练多样性。在TabArena的33个分类数据集上进行了评估，并进行了超过2300次微调运行，我们的CausalMixFT方法始终将中位数归一化ROC-AUC从0.10（标准微调）提高到0.12，优于统计生成器CTGAN（-0.01）、TabEBM（-0.04）和TableAugment（-0.09）。此外，它将中位数验证-测试性能相关性差距从0.67缩小到0.30，使基于验证的提前停止更加可靠，这是在数据稀缺情况下提高微调稳定性的关键步骤。这些结果表明，将因果结构纳入数据增强提供了一种有效且原则性的方法，以在数据稀缺条件下微调表格基础模型。

Summary / 总结

The paper addresses the challenge of fine-tuning tabular foundation models under data scarcity by proposing CausalMixFT, which uses Structural Causal Models to generate synthetic data that preserves feature dependencies. This method improves median normalized ROC-AUC from 0.10 to 0.12 across 33 datasets, outperforming statistical generators and enabling more reliable validation-based early stopping. It narrows the validation-test performance correlation gap, enhancing fine-tuning robustness in low-data regimes.

论文提出了一种名为CausalMixFT的方法，通过使用结构因果模型生成保留特征依赖性的合成数据，以应对表格基础模型在数据稀缺情况下的微调挑战。该方法将中位归一化ROC-AUC从0.10提高到0.12，并缩小了验证集和测试集性能的相关性差距，从而实现更可靠的早期停止和更好的微调稳定性。

Quantifying the Impact of Modules and Their Interactions in the PSO-X Framework

Authors: Christian L. Camacho-Villalón, Ana Nikolikj, Katharina Dost, Eva Tuba, Sašo Džeroski, Tome Eftimov

First: 2026-01-07T17:06:05+00:00 · Latest: 2026-01-07T17:06:05+00:00

Abs · PDF · Code1 · Code2

Abstract

The PSO-X framework incorporates dozens of modules that have been proposed for solving single-objective continuous optimization problems using particle swarm optimization. While modular frameworks enable users to automatically generate and configure algorithms tailored to specific optimization problems, the complexity of this process increases with the number of modules in the framework and the degrees of freedom defined for their interaction. Understanding how modules affect the performance of algorithms for different problems is critical to making the process of finding effective implementations more efficient and identifying promising areas for further investigation. Despite their practical applications and scientific relevance, there is a lack of empirical studies investigating which modules matter most in modular optimization frameworks and how they interact. In this paper, we analyze the performance of 1424 particle swarm optimization algorithms instantiated from the PSO-X framework on the 25 functions in the CEC'05 benchmark suite with 10 and 30 dimensions. We use functional ANOVA to quantify the impact of modules and their combinations on performance in different problem classes. In practice, this allows us to identify which modules have greater influence on PSO-X performance depending on problem features such as multimodality, mathematical transformations and varying dimensionality. We then perform a cluster analysis to identify groups of problem classes that share similar module effect patterns. Our results show low variability in the importance of modules in all problem classes, suggesting that particle swarm optimization performance is driven by a few influential modules.

中文标题/摘要

标题：量化PSO-X框架中模块及其交互的影响

PSO-X框架整合了用于解决单目标连续优化问题的粒子群优化算法中提出的数十个模块。虽然模块化框架使用户能够自动生成和配置针对特定优化问题的算法，但随着框架中模块数量的增加和它们交互的自由度，这一过程的复杂性也随之增加。理解模块如何影响不同问题上算法的性能对于提高找到有效实现的效率并确定进一步研究的有希望领域至关重要。尽管模块化优化框架具有实际应用和科学意义，但缺乏研究探讨哪些模块在这些框架中最重要以及它们如何相互作用的实证研究。在本文中，我们分析了从PSO-X框架实例化出的1424个粒子群优化算法在CEC'05基准套件中的25个函数（10和30维）上的性能。我们使用函数ANOVA来量化不同问题类别中模块及其组合对性能的影响。在实践中，这使我们能够根据问题特征（如多模态性、数学变换和不同维度）识别出对PSO-X性能影响更大的模块。然后，我们进行聚类分析以识别具有相似模块效应模式的问题类别组。我们的结果表明，在所有问题类别中模块的重要性变化很小，这表明粒子群优化性能主要由少数几个有影响力的模块驱动。

Summary / 总结

This paper aims to understand the impact of modules and their interactions in the PSO-X framework, which is used for solving single-objective continuous optimization problems. The authors analyze 1424 particle swarm optimization algorithms instantiated from the PSO-X framework on 25 functions from the CEC'05 benchmark suite. Using functional ANOVA, they quantify the influence of modules and their combinations on performance across different problem classes. The results indicate that the performance of PSO-X is largely driven by a few influential modules, with low variability in their importance across problem classes.

本文旨在理解PSO-X框架中模块及其交互的影响，该框架用于解决单目标连续优化问题。作者使用功能性方差分析来分析1424个粒子群优化算法在25个具有不同维度的基准函数上的性能。研究发现，PSO-X的性能主要由少数几个关键模块驱动，在不同问题类别中的模块重要性变化较小。

Layer-wise Positional Bias in Short-Context Language Modeling

Authors: Maryam Rahimi, Mahdi Nouri, Yadollah Yaghoobzadeh

First: 2026-01-07T17:04:30+00:00 · Latest: 2026-01-07T17:04:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Language models often show a preference for using information from specific positions in the input regardless of semantic relevance. While positional bias has been studied in various contexts, from attention sinks to task performance degradation in long-context settings, prior work has not established how these biases evolve across individual layers and input positions, or how they vary independent of task complexity. We introduce an attribution-based framework to analyze positional effects in short-context language modeling. Using layer conductance with a sliding-window approach, we quantify how each layer distributes importance across input positions, yielding layer-wise positional importance profiles. We find that these profiles are architecture-specific, stable across inputs, and invariant to lexical scrambling. Characterizing these profiles, we find prominent recency bias that increases with depth and subtle primacy bias that diminishes through model depth. Beyond positional structure, we also show that early layers preferentially weight content words over function words across all positions, while later layers lose this word-type differentiation.

中文标题/摘要

标题：短语境语言模型中的逐层位置偏见

语言模型常常倾向于在输入中使用特定位置的信息，而不论其语义相关性。尽管位置偏见在各种上下文中已有研究，从注意力陷阱到长语境设置中的任务性能下降，但先前的工作尚未阐明这些偏见如何在各个层和输入位置上演变，以及它们如何独立于任务复杂性而变化。我们引入了一种基于归因的框架来分析短语境语言模型中的位置效应。通过滑动窗口方法计算层导电性，我们量化了每一层如何在输入位置上分配重要性，从而得到逐层的位置重要性分布。我们发现这些分布具有架构特异性，在不同输入上保持稳定，并且对词汇混排不变。通过对这些分布的表征，我们发现随着深度增加，存在明显的近因偏见，而早期偏见则随着模型深度增加而减弱。除了位置结构外，我们还表明早期层更倾向于在所有位置上赋予内容词比功能词更高的权重，而后期层则丧失了这种词类差异。

Summary / 总结

This study investigates positional biases in short-context language models, using an attribution-based framework to analyze how layers distribute importance across input positions. The research finds that positional importance profiles are architecture-specific, stable across different inputs, and unaffected by lexical scrambling. Key findings include a prominent recency bias that grows with depth and a diminishing primacy bias. Additionally, early layers show a preference for content words over function words, while later layers lose this differentiation.

研究探讨了短语境语言模型中的位置偏见，通过层导电性和滑动窗口方法量化位置的重要性。关键发现包括架构特定、稳定的位罝重要性图谱，显示随深度增加的近期偏见和逐渐减弱的远期偏见。早期层更偏好内容词而非功能词，而后期层丧失了这种词类差异。

SearchAttack: Red-Teaming LLMs against Real-World Threats via Framing Unsafe Web Information-Seeking Tasks

Authors: Yu Yan, Sheng Sun, Mingfeng Li, Zheming Yang, Chiwei Zhu, Fei Ma, Benfeng Xu, Min Liu

First: 2026-01-07T16:59:34+00:00 · Latest: 2026-01-07T16:59:34+00:00

Comments: We find that the key to jailbreak the LLM is objectifying its safety responsibility, thus we delegate the open-web to inject harmful semantics and get the huge gain from unmoderated web resources

Abs · PDF · Code1 · Code2

Abstract

Recently, people have suffered and become increasingly aware of the unreliability gap in LLMs for open and knowledge-intensive tasks, and thus turn to search-augmented LLMs to mitigate this issue. However, when the search engine is triggered for harmful tasks, the outcome is no longer under the LLM's control. Once the returned content directly contains targeted, ready-to-use harmful takeaways, the LLM's safeguards cannot withdraw that exposure. Motivated by this dilemma, we identify web search as a critical attack surface and propose \textbf{\textit{SearchAttack}} for red-teaming. SearchAttack outsources the harmful semantics to web search, retaining only the query's skeleton and fragmented clues, and further steers LLMs to reconstruct the retrieved content via structural rubrics to achieve malicious goals. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems.

中文标题/摘要

标题：SearchAttack：通过框架不安全的网络信息查询任务来对抗现实世界威胁的红队演练

最近，人们遭受了LLMs在开放和知识密集型任务中可靠性差距的困扰，并因此转向搜索增强的LLMs来缓解这一问题。然而，当搜索引擎被触发执行有害任务时，结果将不再受LLM的控制。一旦返回的内容直接包含有针对性、可立即使用的有害信息，LLM的安全防护也无法撤回这种暴露。受此困境的启发，我们识别网络搜索为一个关键的攻击面，并提出**SearchAttack**进行红队演练。SearchAttack将有害语义外包给网络搜索，仅保留查询的框架和碎片化的线索，并进一步引导LLM通过结构化规范重构检索内容以实现恶意目标。进行了广泛的实验来对搜索增强的LLMs进行负责任的漏洞评估。实证研究表明，SearchAttack在攻击这些系统方面表现出强大的效果。

Summary / 总结

The research motivation is to address the unreliability gap in LLMs for open and knowledge-intensive tasks by red-teaming search-augmented LLMs. The main method involves outsourcing harmful semantics to web search while retaining only the query's skeleton and fragmented clues, steering LLMs to reconstruct the retrieved content to achieve malicious goals. Key experimental findings show that SearchAttack is highly effective in attacking these systems, highlighting the critical need for responsible vulnerability assessment.

研究旨在通过提出SearchAttack方法来解决LLM在处理开放和知识密集型任务时的不可靠性问题。SearchAttack将有害语义外包给网络搜索，仅保留查询的骨架和碎片线索，引导LLM重构检索内容以实现恶意目标。实验表明，SearchAttack有效地攻击了这些系统，强调了负责任的漏洞评估的必要性。

S2Vec: Self-Supervised Geospatial Embeddings for the Built Environment

Authors: Shushman Choudhury, Elad Aharoni, Chandrakumari Suvarna, Iveel Tsogsuren, Abdul Rahman Kreidieh, Chun-Ta Lu, Neha Arora

Venue: ACM Transactions on Spatial Algorithms and Systems 2026

First: 2025-04-10T20:16:02+00:00 · Latest: 2026-01-07T16:58:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Scalable general-purpose representations of the built environment are crucial for geospatial artificial intelligence applications. This paper introduces S2Vec, a novel self-supervised framework for learning such geospatial embeddings. S2Vec uses the S2 Geometry library to partition large areas into discrete S2 cells, rasterizes built environment feature vectors within cells as images, and applies masked autoencoding on these rasterized images to encode the feature vectors. This approach yields task-agnostic embeddings that capture local feature characteristics and broader spatial relationships. We evaluate S2Vec on several large-scale geospatial prediction tasks, both random train/test splits (interpolation) and zero-shot geographic adaptation (extrapolation). Our experiments show S2Vec's competitive performance against several baselines on socioeconomic tasks, especially the geographic adaptation variant, with room for improvement on environmental tasks. We also explore combining S2Vec embeddings with image-based embeddings downstream, showing that such multimodal fusion can often improve performance. Our findings highlight how S2Vec can learn effective general-purpose geospatial representations of the built environment features it is provided, and how it can complement other data modalities in geospatial artificial intelligence.

中文标题/摘要

标题：S2Vec：建筑环境的自监督地理空间嵌入

可扩展的通用建筑环境表示对于地理空间人工智能应用至关重要。本文介绍了S2Vec，这是一种新颖的自监督框架，用于学习此类地理空间嵌入。S2Vec 使用S2几何库将大面积划分为离散的S2单元，将单元内的建筑环境特征向量作为图像进行栅格化，并在这些栅格化图像上应用掩蔽自编码以编码特征向量。该方法生成了任务无关的嵌入，能够捕捉局部特征特性和更广泛的地理关系。我们在多个大规模地理空间预测任务上评估了S2Vec，包括随机训练/测试拆分（内插）和零样本地理适应（外推）。我们的实验表明，S2Vec 在社会经济任务上与几个基线具有竞争力，特别是在地理适应变体方面，但在环境任务上仍有改进空间。我们还探索了将S2Vec嵌入与下游的图像嵌入结合使用，表明这种多模态融合通常可以提高性能。我们的研究结果突显了S2Vec如何学习有效的通用地理空间表示，以及它如何在地理空间人工智能中补充其他数据模态。

Summary / 总结

S2Vec is a self-supervised framework that learns geospatial embeddings for the built environment using S2 Geometry to partition areas into cells, rasterize feature vectors, and apply masked autoencoding. It achieves competitive performance on socioeconomic tasks, particularly in geographic adaptation, and shows promise when combined with image-based embeddings for improved performance. However, it has less success on environmental tasks.

S2Vec 是一种自监督框架，使用 S2 几何将大区域划分为单元格，将建筑环境特征向量作为图像进行栅格化，并应用掩蔽自编码来学习任务无关的地理空间嵌入。该方法捕捉局部特征特性和更广泛的地理关系。在社会经济和环境任务上的实验显示 S2Vec 的竞争力表现，尤其是在地理适应方面，并且将 S2Vec 嵌入与图像嵌入结合使用通常可以提高性能。