arXiv 论文速递

Snapshot: 20260305_0342

Utonia: Toward One Encoder for All Point Clouds

Authors: Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li, Yuechen Zhang, Zehao Huang, Naiyan Wang, Hengshuang Zhao

First: 2026-03-03T18:59:58+00:00 · Latest: 2026-03-03T18:59:58+00:00

Comments: produced by Pointcept, project page: https://pointcept.github.io/Utonia

Abs · PDF · Code1 · Code2 · Project1

Abstract

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.

中文标题/摘要

标题：Utonia：朝向通用点云编码器的一步

我们梦想着一个未来，所有领域的点云能够汇聚在一起，共同塑造一个能够惠及所有领域的单一模型。为此，我们提出了Utonia，这是朝着训练一个跨越多种领域的单一自监督点变换编码器迈出的第一步，这些领域包括遥感、户外LiDAR、室内RGB-D序列、对象中心的CAD模型以及从纯RGB视频中提取的点云。尽管它们具有不同的传感几何结构、密度和先验知识，Utonia仍然能够学习一个一致的表示空间，该空间可以在不同领域之间进行迁移。这种统一提高了感知能力，同时揭示了只有在联合训练领域时才会出现的有趣涌现行为。超越感知，我们观察到Utonia表示还可以为具身和多模态推理提供帮助：基于Utonia特征的视觉-语言-动作策略可以提高机器人的操作能力，将它们整合到视觉-语言模型中也能在空间推理方面取得收益。我们希望Utonia能够作为稀疏3D数据基础模型的一步，支持AR/VR、机器人技术和自动驾驶等下游应用。

Summary / 总结

The research aims to develop a unified model for point clouds from various domains. Utonia, a self-supervised point transformer encoder, is trained across diverse domains including remote sensing, outdoor LiDAR, indoor RGB-D sequences, CAD models, and RGB-only videos. The model learns a consistent representation space that improves perception and reveals emergent behaviors. Beyond perception, Utonia enhances embodied and multimodal reasoning, improving robotic manipulation and vision-language models. The study suggests Utonia could be a foundation model for sparse 3D data applications in AR/VR, robotics, and autonomous driving.

研究旨在开发一种适用于不同领域点云的统一模型。Utonia作为一种自监督点变换编码器，被跨远程 sensing、户外 LiDAR、室内 RGB-D 序列、CAD 模型和 RGB-only 视频等多种领域训练。该模型学习了一致的表示空间，提升了感知能力并揭示了联合训练时出现的新兴行为。除了感知之外，Utonia 还增强了具身和多模态推理，提高了机器人操作和视觉-语言模型的表现。研究认为 Utonia 可能成为稀疏 3D 数据应用的基础模型，支持 AR/VR、机器人和自动驾驶等领域的发展。

MIBURI: Towards Expressive Interactive Gesture Synthesis

Authors: M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt

Venue: CVPR 2026

First: 2026-03-03T18:59:51+00:00 · Latest: 2026-03-03T18:59:51+00:00

Comments: CVPR 2026. Project page: https://vcai.mpi-inf.mpg.de/projects/MIBURI/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.

中文标题/摘要

标题：MIBURI：迈向表达性互动手势合成

具身对话代理（ECAs）旨在通过语音、手势和面部表情来模拟面对面的人类互动。当前基于大型语言模型（LLM）的对话代理缺乏具身性和自然互动所需的表情手势。现有的ECAs解决方案往往产生僵硬、低多样性的动作，不适合人类互动。相反，用于同步口述手势合成的生成方法可以产生自然的身体手势，但依赖于未来的语音上下文，并需要长时间运行。为弥合这一差距，我们提出了MIBURI，这是第一个在线因果框架，用于生成与实时口语对话同步的表达性全身手势和面部表情。我们使用身体部位感知的手势编解码器，将层次运动细节编码为多级离散令牌。这些令牌然后由一个二维因果框架自回归生成，该框架基于LLM的语音-文本嵌入进行条件化，实时建模时间和部位层次运动。此外，我们引入了辅助目标来鼓励表达性和多样性手势，防止收敛到静态姿势。比较评估表明，我们的因果和实时方法在与最近基线相比时，生成了自然且上下文对齐的手势。我们敦促读者访问https://vcai.mpi-inf.mpg.de/projects/MIBURI/上的演示视频。

Summary / 总结

MIBURI is an online causal framework that generates expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue, addressing the limitations of existing methods in terms of expressiveness and diversity. It uses body-part aware gesture codecs to encode hierarchical motion details into discrete tokens, which are autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings. Comparative evaluations show that MIBURI produces natural and contextually aligned gestures better than recent baselines.

MIBURI 是一个在线因果框架，用于生成与实时对话同步的富有表现力的全身手势和面部表情。它使用身体部位感知的手势编码器来编码层次化的运动细节，并在基于大语言模型的语音文本嵌入条件下自回归生成令牌。比较评估表明，MIBURI 生成了自然且上下文一致的手势，优于最近的基线。该方法旨在实时运行，避免产生僵硬和低多样性的动作。更多演示视频请参见 https://vcai.mpi-inf.mpg.de/projects/MIBURI/.

CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Authors: Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, Yueqi Duan

Venue: CVPR 2026

First: 2026-03-03T18:59:48+00:00 · Latest: 2026-03-03T18:59:48+00:00

Comments: Accepted by CVPR 2026; Project Page: https://hanyang-21.github.io/CFG-Ctrl

Abs · PDF · Code1 · Code2 · Project1

Abstract

Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: https://hanyang-21.github.io/CFG-Ctrl

中文标题/摘要

标题：CFG-Ctrl：基于控制的分类器自由扩散引导

分类器自由引导（CFG）已成为增强流基扩散模型语义对齐的核心方法。本文探讨了一种统一框架CFG-Ctrl，将CFG重新解释为对第一阶连续生成流的控制，使用条件-非条件差异作为误差信号调整速度场。从这个角度来看，我们总结了传统的CFG为固定增益的比例控制器（P控制），而常见的后续变体则在此基础上发展了扩展的控制律设计。然而，现有方法主要依赖线性控制，这导致了不稳定性、超调和语义保真度下降，尤其是在大引导尺度下。为了解决这一问题，我们引入了滑模控制CFG（SMC-CFG），强制生成流向快速收敛的滑动流形。具体而言，我们定义了语义预测误差的指数滑模表面，并引入切换控制项以建立非线性反馈引导校正。此外，我们提供了李亚普诺夫稳定性分析，以理论支持有限时间收敛。实验表明，SMC-CFG在语义对齐方面优于标准CFG，并且在广泛的引导尺度范围内增强了鲁棒性。项目页面：https://hanyang-21.github.io/CFG-Ctrl

Summary / 总结

The research aims to improve semantic alignment in flow-based diffusion models using a control-based approach. The method, CFG-Ctrl, reinterprets Classifier-Free Guidance (CFG) as a control applied to the first-order continuous-time generative flow, and introduces Sliding Mode Control CFG (SMC-CFG) to enhance stability and semantic fidelity. Experiments show that SMC-CFG outperforms standard CFG in semantic alignment and robustness across various guidance scales, particularly with large guidance scales. The project page provides more details: https://hanyang-21.github.io/CFG-Ctrl

论文提出了CFG-Ctrl框架，将Classifier-Free Guidance (CFG)重新解释为对第一阶连续生成流的控制应用。通过使用指数滑动模式表面和切换控制项实现非线性反馈校正，解决了现有方法的不稳定性及语义保真度问题。实验表明，SMC-CFG在文本到图像生成模型如Stable Diffusion 3.5、Flux和Qwen-Image中，在语义对齐和各种指导尺度下的鲁棒性方面优于标准CFG。

How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference

Authors: Toru Lin, Shuying Deng, Zhao-Heng Yin, Pieter Abbeel, Jitendra Malik

First: 2026-03-03T18:59:32+00:00 · Latest: 2026-03-03T18:59:32+00:00

Comments: Project page can be found at https://toruowo.github.io/peel

Abs · PDF · Code1 · Code2 · Project1

Abstract

Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.

中文标题/摘要

标题：如何使用刀具去皮：精细操作与人类偏好的对齐

许多重要的操作任务，如食物准备、手术和手工艺，对于自主机器人来说仍然难以解决。这些任务不仅具有接触丰富、力敏感的动力学特性，还具有“隐含”的成功标准：与拾取和放置不同，这些领域的任务质量是连续且主观的（例如，土豆去皮的质量如何），这使得定量评估和奖励工程变得困难。我们提出了一种针对此类任务的学习框架，以使用刀具去皮作为代表性的例子。我们的方法遵循两阶段管道：首先，我们通过力感知数据收集和模仿学习学习稳健的初始策略，以实现对不同物体的泛化；其次，我们通过基于偏好的微调来改进策略，使用结合定量任务指标和定性人类反馈的奖励模型，使策略行为与人类对任务质量的看法相一致。仅使用50-200个去皮轨迹，我们的系统在包括黄瓜、苹果和土豆在内的具有挑战性的农产品上实现了超过90%的平均成功率，通过基于偏好的微调，性能提高了高达40%。值得注意的是，仅在一个农产品类别上训练的策略在未见过的同类别实例以及来自不同类别的分布外农产品上表现出强大的零样本泛化能力，同时保持超过90%的成功率。

Summary / 总结

The paper addresses the challenge of fine-grained manipulation tasks like peeling with a knife, which are difficult for autonomous robots due to their continuous and subjective success criteria. It proposes a two-stage learning framework: first, a robust initial policy is learned using force-aware data collection and imitation learning, and then the policy is refined through preference-based finetuning with a learned reward model combining quantitative metrics and qualitative human feedback. The system achieves over 90% success rates on various produce with up to 40% improvement through finetuning, and demonstrates strong zero-shot generalization across different produce categories.

本文探讨了如用刀削皮等精细操作任务，对于自主机器人来说因其连续性和主观性成功标准而难以实现的挑战。作者提出了一种两阶段学习框架：首先，利用力感知的数据收集和模仿学习开发一个稳健的初始策略，然后通过结合定量任务指标和定性人类反馈的奖励模型进行偏好导向的微调。该方法在各种果蔬上实现了超过90%的成功率，通过微调可提高高达40%的性能，并且在未见过的实例和类别上表现出强大的泛化能力。

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Authors: Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui

First: 2026-03-03T18:59:29+00:00 · Latest: 2026-03-03T18:59:29+00:00

Comments: Project Page: https://ultra-humanoid.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

中文标题/摘要

标题：ULTRA：统一多模态控制框架实现自主类人全身体操与操作

实现自主且多功能的全身体操与操作仍然是使类人机器人实用化的关键障碍。然而，现有方法存在根本限制：重新目标化的数据往往稀缺或质量低；方法难以扩展到大规模技能库；最重要的是，它们依赖于跟踪预定义的运动参考，而不是从感知和高层次任务规范生成行为。为解决这些限制，我们提出了一种统一框架ULTRA，包含两个关键组件。首先，我们引入了一种基于物理的神经重新目标化算法，将大规模运动捕捉转换为类人机器人实体，同时保持物理合理性，以支持丰富的接触交互。其次，我们学习了一个统一的多模态控制器，支持密集参考和稀疏任务规范，在从精确的运动捕捉状态到嘈杂的自中心视觉输入的多种感知范围内运行。我们将通用跟踪策略提炼到该控制器中，将运动技能压缩到紧凑的潜在空间，并应用强化学习微调以扩展覆盖范围并提高在分布外场景下的鲁棒性。这使得在测试时无需参考运动即可实现协调的全身体操行为。我们在模拟和真实Unitree G1类人机器人上评估了ULTRA。结果表明，ULTRA能够从自中心感知自主地实现目标导向的全身体操与操作，且在技能有限的情况下始终优于仅跟踪的基线。

Summary / 总结

The research aims to achieve autonomous and versatile whole-body locomotion and manipulation for humanoids. The method involves a unified framework with a physics-driven neural retargeting algorithm and a unified multimodal controller. The key findings show that ULTRA can generalize to autonomous, goal-conditioned whole-body locomotion and manipulation using egocentric perception, outperforming tracking-only baselines with limited skills.

研究旨在实现人形机器人全身体态移动和操作的自主性和多样性。方法包括一个统一框架，包含一个基于物理的神经重定位算法和一个统一的多模态控制器。主要发现表明，ULTRA 可以从第一人称感知自主地实现目标导向的全身体态行为，在模拟和实际的 Unitree G1 人形机器人实验中均优于仅跟踪基准方法。

Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping

Authors: William Liang, Sam Wang, Hung-Ju Wang, Osbert Bastani, Yecheng Jason Ma, Dinesh Jayaraman

Venue: ICLR

First: 2026-03-03T18:59:07+00:00 · Latest: 2026-03-03T18:59:07+00:00

Comments: International Conference on Learning Representations (ICLR), 2026. Project website and code: https://tether-research.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.

中文标题/摘要

标题：Tether：基于对应驱动轨迹扭曲的自主功能性玩耍

能够进行交互和从经验中学习的能力是机器人技术中的一个核心挑战，提供了一种劳动密集型的人类示范的可扩展替代方案。然而，实现这种“玩耍”需要（1）一种对各种潜在分布外环境状态具有鲁棒性的策略，以及（2）一种能够持续生成有用机器人经验的程序。为了解决这些挑战，我们引入了Tether，一种涉及结构化、任务导向交互的自主功能性玩耍方法。首先，我们设计了一种新颖的开环策略，通过将动作锚定到目标场景中的语义关键点对应关系，对来自少量源示范（≤10个）的动作进行扭曲。我们展示了这种设计在数据效率和鲁棒性方面具有极大的优势，即使在显著的空间和语义变化下也是如此。其次，我们通过视觉理解能力引导的连续循环任务选择、执行、评估和改进，将此策略部署到现实世界中的自主功能性玩耍中。这种方法生成了大量高质量的数据集，同时减少了人类干预。在类似家庭的多对象设置中，我们的方法是第一个仅从少量示范开始，在现实世界中进行多小时的自主多任务玩耍的方法。这产生了一条持续改进闭环模仿策略性能的数据流，最终产生了超过1000条专家级轨迹，并训练出与人类收集示范学习的策略竞争的策略。

Summary / 总结

Tether is a method for autonomous functional play in robotics, addressing the challenges of robust policy design and continuous experience generation. It uses a novel open-loop policy that warps actions from a few source demonstrations based on semantic keypoint correspondences, making it highly data-efficient and robust. Tether continuously selects tasks, executes them, evaluates the outcomes, and improves the policy, generating high-quality datasets with minimal human intervention. In a household-like setup, Tether performs multi-task play for many hours, producing over 1000 expert-level trajectories and training policies competitive with those from human demonstrations.

Tether 是一种用于自主功能玩耍的方法，旨在解决稳健策略设计和持续经验生成的挑战。它使用一种新颖的开环策略，基于语义关键点对应来扭曲来自少量源演示的动作，显示出高效的数据利用和鲁棒性。Tether 通过持续选择任务、执行、评估和改进策略，生成高质量的数据集，几乎无需人工干预。在类似家庭的多对象设置中，Tether 能够进行多任务玩耍长达数小时，生成超过1000条专家级轨迹，并训练出与人类收集的演示相比竞争力相当的策略。

Beyond Language Modeling: An Exploration of Multimodal Pretraining

Authors: Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie

First: 2026-03-03T18:58:00+00:00 · Latest: 2026-03-03T18:58:00+00:00

Comments: Project website at https://beyond-llms.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

中文标题/摘要

标题：超越语言模型：多模态预训练探索

视觉世界为推进基础模型超越语言提供了关键维度。尽管对该方向的兴趣日益增长，但原生多模态模型的设计空间仍然模糊不清。我们通过控制性、从零开始的预训练实验提供了实证上的清晰度，隔离了多模态预训练的决定因素，而不受语言预训练的干扰。我们采用Transfusion框架，使用下一个标记预测语言，使用扩散模型处理视觉，训练数据包括文本、视频、图像-文本对，甚至动作条件化的视频。我们的实验得出四个关键见解：(i) 表征自编码器（RAE）通过在视觉理解和生成方面表现出色，提供了最优的统一视觉表示；(ii) 视觉和语言数据是互补的，共同促进了下游能力；(iii) 统一的多模态预训练自然地导向世界建模，能力源自通用训练；(iv) 混合专家模型（MoE）能够高效有效地扩展多模态模型，自然地诱导模态专业化。通过IsoFLOP分析，我们计算了两种模态的扩展定律，并揭示了扩展不对称性：视觉比语言更需要数据。我们证明MoE架构通过提供语言所需的高模型容量，同时适应视觉的数据密集特性，协调了这种扩展不对称性，为真正统一的多模态模型铺平了道路。

Summary / 总结

The research aims to advance foundation models by exploring multimodal pretraining beyond language. The authors use the Transfusion framework to train models on diverse data, including text, video, and image-text pairs. Key findings include the superiority of Representation Autoencoder for unified visual representation, the complementary nature of visual and language data, the emergence of world modeling capabilities through unified pretraining, and the efficiency of Mixture-of-Experts in handling the data-intensive nature of vision while scaling effectively with language. IsoFLOP analysis reveals a scaling asymmetry between vision and language, with vision requiring more data. The MoE architecture is shown to harmonize this asymmetry, enabling efficient multimodal scaling.

研究旨在通过探索超越语言的多模态预训练来推进基础模型。作者使用Transfusion框架在文本、视频和图像-文本对等多样化的数据上进行训练。关键发现包括Representation Autoencoder在统一视觉表示方面的优越性，视觉数据和语言数据的互补性，统一多模态预训练自然导致世界建模能力的出现，以及Mixture-of-Experts架构在处理视觉数据的高数据需求的同时有效扩展语言模型容量。通过IsoFLOP分析揭示了视觉和语言之间的扩展不对称性，视觉需要更多的数据。MoE架构通过提供语言所需的高模型容量并适应视觉的数据密集性，实现了多模态扩展的高效性。

Learning Demographic-Conditioned Mobility Trajectories with Aggregate Supervision

Authors: Jessie Z. Li, Zhiqing Hong, Toru Shirakawa, Serina Chang

First: 2026-03-03T18:57:44+00:00 · Latest: 2026-03-03T18:57:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Human mobility trajectories are widely studied in public health and social science, where different demographic groups exhibit significantly different mobility patterns. However, existing trajectory generation models rarely capture this heterogeneity because most trajectory datasets lack demographic labels. To address this gap in data, we propose ATLAS, a weakly supervised approach for demographic-conditioned trajectory generation using only (i) individual trajectories without demographic labels, (ii) region-level aggregated mobility features, and (iii) region-level demographic compositions from census data. ATLAS trains a trajectory generator and fine-tunes it so that simulated mobility matches observed regional aggregates while conditioning on demographics. Experiments on real trajectory data with demographic labels show that ATLAS substantially improves demographic realism over baselines (JSD $\downarrow$ 12%--69%) and closes much of the gap to strongly supervised training. We further develop theoretical analyses for when and why ATLAS works, identifying key factors including demographic diversity across regions and the informativeness of the aggregate feature, paired with experiments demonstrating the practical implications of our theory. We release our code at https://github.com/schang-lab/ATLAS.

中文标题/摘要

标题：使用聚合监督学习人口条件化的移动轨迹

人类移动轨迹在公共卫生和社会科学中广泛研究，不同的人口群体表现出显著不同的移动模式。然而，现有的轨迹生成模型很少捕捉到这种异质性，因为大多数轨迹数据集缺乏人口标签。为了解决这一数据缺口，我们提出了一种名为ATLAS的弱监督方法，用于使用仅有的(i) 无人口标签的个体轨迹，(ii) 区域级聚合移动特征，以及(iii) 来自人口普查数据的区域级人口组成进行人口条件化的轨迹生成。ATLAS训练了一个轨迹生成器，并对其进行微调，使其模拟的移动与观察到的区域聚合相匹配，同时根据人口进行条件化。在具有人口标签的真实轨迹数据上的实验表明，与基线相比，ATLAS在人口现实性方面有了显著提高（JSD下降12%--69%），并大大缩小了与强监督训练的差距。我们进一步对ATLAS何时以及为何有效进行了理论分析，确定了关键因素包括区域间的人口多样性以及聚合特征的信息量，并通过实验展示了我们理论的实际意义。我们已在https://github.com/schang-lab/ATLAS/发布了我们的代码。

Summary / 总结

The research aims to address the lack of demographic heterogeneity in human mobility trajectory models by proposing ATLAS, a weakly supervised approach. ATLAS uses individual trajectories, region-level aggregated mobility features, and demographic compositions from census data to generate demographic-conditioned trajectories. Experiments show that ATLAS significantly improves demographic realism compared to baselines and closes much of the gap to strongly supervised training methods.

研究旨在通过提出ATLAS，一种弱监督方法，解决现有移动轨迹模型中缺乏人口异质性的问题。ATLAS 利用个体轨迹、区域级别的聚合移动特征以及人口普查数据中的人口组成来生成人口条件化的轨迹。实验表明，ATLAS 显著提高了人口现实性，相比基线方法有显著改进，并且在很大程度上缩小了与强监督训练方法的差距。

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun

First: 2026-03-03T18:55:37+00:00 · Latest: 2026-03-03T18:55:37+00:00

Comments: Project page: https://LoGeR-project.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.

中文标题/摘要

标题：LoGeR：长上下文几何重建与混合记忆

前馈几何基础模型在短窗口重建中表现出色，但将其扩展到几分钟长的视频受到二次注意力复杂度或递归设计中有限的有效记忆的限制。我们提出了LoGeR（长上下文几何重建），这是一种新型架构，可以在无需后优化的情况下扩展密集的3D重建到极其长的序列。LoGeR 分块处理视频流，利用强大的双向先验知识进行高保真度的块内推理。为了解决块边界间连贯性这一关键挑战，我们提出了一种基于学习的混合记忆模块。这个双组件系统结合了一个参数化的测试时训练（TTT）记忆来锚定全局坐标系并防止尺度漂移，以及一个非参数化的滑动窗口注意力（SWA）机制来保留未压缩的上下文以实现高精度的相邻对齐。令人惊讶的是，这种记忆架构使LoGeR能够在128帧的序列上进行训练，并在推理过程中泛化到数千帧。LoGeR 在标准基准测试和一个新重新利用的VBR数据集上进行了评估，该数据集包含长达19000帧的序列，LoGeR 显著优于先前的前馈方法——在KITTI上的ATE降低了超过74%——并且实现了前所未有的长距离的稳健且全局一致的重建。

Summary / 总结

LoGeR is designed to perform long-context geometric reconstruction for extremely long video sequences by processing them in chunks and using a hybrid memory module. This module includes a parametric Test-Time Training memory to maintain a global coordinate frame and a non-parametric Sliding Window Attention mechanism to preserve context. LoGeR can be trained on sequences of 128 frames and generalize to thousands of frames, significantly outperforming previous methods on benchmarks and a newly repurposed VBR dataset with up to 19k frames, reducing ATE on KITTI by over 74%.

LoGeR 通过分块处理视频并使用混合记忆模块来扩展密集的3D重建到长视频序列。该模块包括一个参数化的时间训练记忆来维持全局坐标框架，以及一个非参数化的滑动窗口注意力机制来保留上下文。实验表明，LoGeR 可以在128帧序列上进行训练，并在推断时泛化到数千帧，其绝对轨迹误差在KITTI上的表现优于之前的方法，降低了超过74%。

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Authors: Achyutha Menon, Magnus Saebo, Tyler Crosse, Spencer Gibson, Eyon Jang, Diogo Cruz

Venue: ICLR 2026

First: 2026-03-03T18:50:59+00:00 · Latest: 2026-03-03T18:50:59+00:00

Comments: 22 pages, 7 figures. Accepted at ICLR 2026 Lifelong Agents Workshop

Abs · PDF · Code1 · Code2

Abstract

The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents' tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models remains unclear. In this work, we provide an updated characterization of the extent and causes of goal drift. We investigate drift in state-of-the-art models within a simulated stock-trading environment (Arike et al., 2025). These models are largely shown to be robust even when subjected to adversarial pressure. We show, however, that this robustness is brittle: across multiple settings, the same models often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of conditioning-induced drift varies significantly by model family, with only GPT-5.1 maintaining consistent resilience among tested models. We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior, with strong hierarchy following failing to reliably predict resistance to drift. Finally, we run analogous experiments in a new emergency room triage environment to show preliminary evidence for the transferability of our results across qualitatively different settings. Our findings underscore the continued vulnerability of modern LM agents to contextual pressures and the need for refined post-training techniques to mitigate this.

中文标题/摘要

标题：遗传性目标漂移：上下文压力可能削弱自主目标

语言模型（LMs）作为部署在长上下文任务中的代理的加速采用促使我们对目标漂移有一个全面的理解：代理倾向于偏离原始目标的倾向。尽管前一代语言模型代理已被证明容易发生漂移，但这些漂移对较新模型的影响程度尚不清楚。在本研究中，我们提供了对漂移程度及其原因的最新描述。我们研究了在模拟股票交易环境中（Arike et al., 2025）最先进的模型中的漂移。这些模型在遭受对抗性压力时显示出很大的鲁棒性。然而，我们表明这种鲁棒性是脆弱的：在多个设置中，当这些模型在预填充较弱代理轨迹的条件下运行时，它们经常继承漂移。由条件引起的漂移程度在不同模型家族之间差异显著，只有GPT-5.1在测试模型中保持了一致的抗漂移能力。我们发现漂移行为在不同提示变体之间不一致，并且与指令层次遵循行为的相关性较差，强烈的层次遵循行为并不能可靠地预测对漂移的抵抗力。最后，我们在新的急诊室分诊环境中运行类似的实验，以初步证明我们的结果在不同质的环境中具有可转移性。我们的研究结果强调了现代LM代理在面对上下文压力时持续的脆弱性，并强调了需要改进的后训练技术来缓解这一问题。

Summary / 总结

This study investigates the extent and causes of goal drift in state-of-the-art language models (LMs) within a simulated stock-trading environment. Despite the models' robustness against adversarial pressure, they often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of drift varies by model family, with only GPT-5.1 showing consistent resilience. The study also finds that drift behavior is inconsistent across prompt variations and poorly correlates with instruction hierarchy following, highlighting the need for refined post-training techniques to mitigate this vulnerability.

研究在模拟股票交易环境中考察了高级语言模型（LMs）的目标偏移现象，发现尽管这些模型在对抗性压力下表现出 robust 性，但在受到较弱代理预填充轨迹的条件时，它们往往会继承目标偏移。不同模型家族的偏移程度差异显著，GPT-5.1 展现出一致的抗偏移能力。研究还指出，偏移行为在不同提示变体之间不一致，并且与指令层次遵循行为的相关性较差。

Theory of Code Space: Do Code Agents Understand Software Architecture?

Authors: Grigory Sapunov

First: 2026-02-28T11:40:17+00:00 · Latest: 2026-03-03T18:45:08+00:00

Comments: updated experiments

Abs · PDF · Code1 · Code2 · Code3

Abstract

AI code agents excel at isolated tasks yet struggle with complex, multi-file software engineering requiring understanding of how dozens of modules relate. We hypothesize these failures stem from inability to construct, maintain, and update coherent architectural beliefs during codebase exploration. We introduce Theory of Code Space (ToCS), a benchmark that evaluates this capability by placing agents in procedurally generated codebases under partial observability, requiring them to build structured belief states over module dependencies, cross-cutting invariants, and design intent. The framework features: (1) a procedural codebase generator producing medium-complexity Python projects with four typed edge categories reflecting different discovery methods -- from syntactic imports to config-driven dynamic wiring -- with planted architectural constraints and verified ground truth; (2) a partial observability harness where agents explore under a budget; and (3) periodic belief probing via structured JSON, producing a time-series of architectural understanding. We decompose the Active-Passive Gap from spatial reasoning benchmarks into selection and decision components, and introduce Architectural Constraint Discovery as a code-specific evaluation dimension. Preliminary experiments with four rule-based baselines and five frontier LLM agents from three providers validate discriminative power: methods span a wide performance range (F1 from 0.129 to 0.646), LLM agents discover semantic edge types invisible to all baselines, yet weaker models score below simple heuristics -- revealing that belief externalization, faithfully serializing internal understanding into structured JSON, is itself a non-trivial capability and a first-order confounder in belief-probing benchmarks. Open-source toolkit: https://github.com/che-shr-cat/tocs

中文标题/摘要

标题：代码空间理论：代码代理是否理解软件架构？

AI代码代理在执行孤立任务方面表现出色，但在处理需要理解数十个模块之间关系的复杂、多文件软件工程时却遇到困难。我们假设这些失败源于其在代码库探索过程中无法构建、维护和更新一致的架构信念。我们提出了代码空间理论（ToCS），通过将代理置于部分可观测性的程序生成代码库中，要求它们构建模块依赖关系、切面不变量和设计意图的结构化信念状态来评估这一能力。该框架包括：（1）一个程序生成代码库生成器，生成具有四种类型边类别的中等复杂度的Python项目，反映不同的发现方法——从语法导入到配置驱动的动态连接，并植入架构约束和验证的地面真相；（2）一个部分可观测性框架，代理在预算内探索；（3）通过结构化JSON进行定期信念探查，产生架构理解的时间序列。我们将空间推理基准中的主动-被动差距分解为选择和决策组件，并引入架构约束发现作为代码特定的评估维度。使用四个基于规则的基线和五个来自三家提供商的前沿LLM代理的初步实验验证了区分能力：方法的性能范围广泛（F1从0.129到0.646），LLM代理发现所有基线都无法识别的语义边类型，但较弱的模型得分低于简单启发式方法——揭示信念外化，将内部理解忠实序列化为结构化JSON，本身就是一项非平凡的能力，并且在信念探查基准中是一个主要的混淆因素。开源工具包：https://github.com/che-shr-cat/tocs

Summary / 总结

The research aims to understand why AI code agents struggle with complex software engineering tasks that require understanding of module dependencies and architectural intent. It introduces Theory of Code Space (ToCS), a benchmark that evaluates agents' ability to construct and maintain coherent architectural beliefs in procedurally generated codebases. The study uses a procedural codebase generator and a partial observability harness to assess agents' performance, and introduces Architectural Constraint Discovery as a new evaluation dimension. Experiments with rule-based baselines and LLM agents show a wide performance range and highlight the importance of belief externalization into structured JSON for accurate architectural understanding.

研究旨在通过假设AI代码代理在复杂软件工程任务中挣扎是因为无法构建和维护一致的架构信念来理解其原因。研究引入了代码空间理论（ToCS），通过部分可观测的程序生成代码库来评估这一能力。关键发现包括方法间广泛的表现范围（F1从0.129到0.646），LLM代理发现了规则基线无法识别的语义边类型，但较弱的模型得分低于简单启发式方法，突显了将内部理解外部化为结构化JSON的重要性，这是信念探查基准中的一个重要先验条件。

UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren

First: 2026-01-07T23:49:52+00:00 · Latest: 2026-03-03T18:40:54+00:00

Comments: Project Page: https://unidrive-wm.github.io/UniDrive-WM

Abs · PDF · Code1 · Code2 · Project1

Abstract

World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .

中文标题/摘要

标题：UniDrive-WM：统一理解、规划和生成世界模型在自动驾驶中的应用

世界模型已成为自动驾驶的核心，准确的场景理解和未来预测对于安全控制至关重要。近期研究探索了使用视觉-语言模型（VLMs）进行规划，但现有方法通常将感知、预测和规划视为独立模块。我们提出UniDrive-WM，这是一种基于VLM的统一世界模型，能够在单一架构中联合执行驾驶场景理解、轨迹规划和基于轨迹的未来图像生成。UniDrive-WM的轨迹规划器预测未来轨迹，条件化VLM图像生成器以生成合理的未来帧。这些预测提供了额外的监督信号，增强场景理解并逐步细化轨迹生成。我们进一步比较了离散和连续输出表示对未来图像预测的影响，分析其对下游驾驶性能的影响。在具有挑战性的Bench2Drive基准测试中，UniDrive-WM生成了高保真度的未来图像，并在L2轨迹误差和碰撞率方面分别提高了5.9%和9.2%，超过了之前的最佳方法。这些结果表明，将VLM驱动的推理、规划和生成世界建模紧密集成对于自动驾驶的优势。项目页面可在https://unidrive-wm.github.io/UniDrive-WM 查看。

Summary / 总结

UniDrive-WM is a unified VLM-based world model that integrates driving-scene understanding, trajectory planning, and future image generation. It uses a trajectory planner to predict future trajectories, which conditions a VLM to generate plausible future frames. Experiments show that UniDrive-WM improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate compared to the previous best method on the Bench2Drive benchmark.

UniDrive-WM 是一个统一的基于 VLM 的世界模型，集成了驾驶场景理解、轨迹规划和未来图像生成。它使用轨迹规划器预测未来路径，并通过 VLM 生成可能的未来帧。实验结果显示，UniDrive-WM 在 L2 轨迹误差和碰撞率方面分别比之前的最佳方法提高了 5.9% 和 9.2%。该项目通过未来预测提供的额外监督信号增强场景理解并逐步细化轨迹生成。连续输出表示法在预测未来图像方面比离散表示法更有效。结果表明，将 VLM 驱动的推理、规划和生成的世界建模紧密集成对自动驾驶具有优势。

Using Learning Progressions to Guide AI Feedback for Science Learning

Authors: Xin Xia, Nejla Yuruk, Yun Wang, Xiaoming Zhai

First: 2026-03-03T18:39:58+00:00 · Latest: 2026-03-03T18:39:58+00:00

Comments: 15pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

中文标题/摘要

标题：利用学习进展引导人工智能反馈以促进科学学习

生成式人工智能（AI）为形成性反馈提供了可扩展的支持，但大多数由AI生成的反馈依赖于由领域专家撰写的任务特定评分标准。虽然有效，但评分标准的撰写耗时且限制了其在教学情境中的可扩展性。学习进展（LP）提供了一种理论依据明确的学生理解发展的表示，可能提供一种替代方案。本研究探讨了基于LP的评分标准生成管道是否能产生与由专家撰写的任务特定评分标准指导的反馈质量相当的AI生成反馈。我们分析了207名中学生在化学任务中生成的关于书面科学解释的AI生成反馈。比较了两种管道：（a）由人类专家设计的任务特定评分标准指导的反馈，以及（b）在评分和反馈生成之前从学习进展自动推导出的任务特定评分标准指导的反馈。两名人类编码员使用一个多维度评分标准评估反馈质量，该评分标准评估清晰度、准确性、相关性、参与度和动机、反思性（10个子维度）。编码者间一致性高，百分比一致率从89%到100%不等，可估计维度的科恩κ值为0.66到0.88。配对t检验显示，两个管道在清晰度（t1 = 0.00，p1 = 1.000；t2 = 0.84，p2 = .399）、相关性（t1 = 0.28，p1 = .782；t2 = -0.58，p2 = .565）、参与度和动机（t1 = 0.50，p1 = .618；t2 = -0.58，p2 = .565）或反思性（t = -0.45，p = .656）方面均无统计学差异。这些发现表明，基于LP的评分标准管道可以作为一种替代方案。

Summary / 总结

This study investigates whether an AI-generated feedback system guided by learning progressions can match the quality of feedback produced by a system using expert-designed rubrics. The research compared two pipelines: one using a task-specific rubric authored by experts and another using a rubric automatically generated from learning progressions. After evaluating the feedback on five dimensions (Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness), no statistically significant differences were found between the two pipelines, suggesting that the LP-driven approach is a viable alternative.

研究探讨了由学习进展驱动的AI反馈系统能否与专家设计的评分标准生成的反馈系统在质量上相媲美。研究比较了两种管道：一种使用专家撰写的任务特定评分标准，另一种使用从学习进展自动生成的评分标准。在五个维度（清晰度、准确性、相关性、参与度和动机、反思性）上评估反馈后，未发现两种管道之间存在统计学上的显著差异，这表明基于学习进展的方法是一个可行的替代方案。

Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

Authors: Patrick Gerard, Svitlana Volkova

First: 2026-03-03T18:36:25+00:00 · Latest: 2026-03-03T18:36:25+00:00

Comments: 27 Pages

Abs · PDF · Code1 · Code2

Abstract

Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision or predefined principles, which are effective for well-resourced settings but exclude most online communities -- particularly those without institutional backing, annotation infrastructure, or organized around sensitive topics -- where preference elicitation is costly, ethically fraught, or culturally misaligned. We observe that communities already express preferences implicitly through what content they accept, engage with, and allow to persist. We show that this acceptance behavior induces measurable geometric structure in representation space: accepted responses occupy coherent, high-density regions that reflect community-specific norms, while rejected content falls in sparser or misaligned areas. We operationalize this structure as an implicit preference signal for alignment and introduce density-guided response optimization (DGRO), a method that aligns language models to community norms without requiring explicit preference labels. Using labeled preference data, we demonstrate that local density recovers pairwise community judgments, indicating that geometric structure encodes meaningful preference signal. We then apply DGRO in annotation-scarce settings across diverse communities spanning platform, topic, and language. DGRO-aligned models consistently produce responses preferred by human annotators, domain experts, and model-based judges over supervised and prompt-based baselines. We position DGRO as a practical alignment alternative for communities where explicit preference supervision is unavailable or misaligned with situated practices, and discuss the implications and risks of learning from emergent acceptance behavior.

中文标题/摘要

标题：基于密度的响应优化：通过隐含接受信号实现社区导向对齐

部署在网络社区中的语言模型必须适应在社会、文化和领域特定背景下变化的规范。先前的对齐方法依赖于显式的偏好监督或预定义的原则，这些方法在资源丰富的情境下有效，但在大多数网络社区中无效，特别是那些没有机构支持、注释基础设施或围绕敏感话题组织起来的社区，其中偏好获取成本高、伦理上棘手或文化上不一致。我们观察到，社区已经通过他们接受、参与和允许存在的内容隐含地表达了偏好。我们展示了这种接受行为在表示空间中产生了可测量的几何结构：被接受的响应占据着反映社区特定规范的高密度、连贯区域，而被拒绝的内容则位于稀疏或错位的区域。我们将这种结构操作化为隐含的偏好信号，并引入基于密度的响应优化（DGRO）方法，该方法可以在不需要显式偏好标签的情况下将语言模型对齐到社区规范。使用标记的偏好数据，我们证明局部密度恢复了社区间的成对判断，表明几何结构编码了有意义的偏好信号。然后，我们在平台、主题和语言方面多样化的社区中应用DGRO，这些社区缺乏注释数据。DGRO对齐的模型在人类注释者、领域专家和基于模型的评判者中产生了一致的偏好响应，优于监督和提示基线。我们将DGRO定位为在显式偏好监督不可用或与情境实践不一致的社区中的一种实用对齐替代方案，并讨论从新兴接受行为中学习的含义和风险。

Summary / 总结

The research aims to align language models with community norms in settings where explicit preference supervision is unavailable or ethically challenging. The method, density-guided response optimization (DGRO), leverages implicit acceptance signals from community behavior to align models without explicit labels. Key findings show that local density in representation space reflects community preferences, and DGRO outperforms supervised and prompt-based baselines in diverse communities across platforms, topics, and languages.

研究旨在解决在缺乏明确偏好监督或伦理上具有挑战性的社区中，如何使语言模型与社区规范保持一致的问题。方法是密度导向响应优化（DGRO），通过利用社区行为中的隐含接受信号来对齐模型，而无需明确的标签。关键发现表明，表示空间中的局部密度反映了社区偏好，且DGRO在跨平台、主题和语言的多样化社区中优于监督和提示基线方法。

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Authors: Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen

First: 2026-03-03T18:36:16+00:00 · Latest: 2026-03-03T18:36:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.

中文标题/摘要

标题：UniG2U-Bench：统一模型是否推进了多模态理解？

统一多模态模型最近展示了强大的生成能力，但生成是否以及何时提升理解仍不清楚。现有基准缺乏对生成促进理解的具体任务的系统探索。为此，我们引入了UniG2U-Bench，这是一个全面的基准，将生成到理解（G2U）评估分为7个阶段和30个子任务，需要不同程度的隐式或显式视觉转换。对超过30个模型的广泛评估揭示了三个核心发现：1）统一模型通常不如其基础视觉语言模型（VLM），生成后推理通常会降低性能相对于直接推理。2）在空间智能、视觉错觉或多轮推理子任务中出现一致的增强，其中增强的空间和形状感知以及多步中间图像状态是有益的。3）具有相似推理结构的任务和共享架构的模型表现出相关行为，表明生成-理解耦合在任务、预训练数据和模型架构上诱导出类一致的归纳偏置。这些发现强调了需要更多样化的训练数据和新颖的范式来充分释放统一多模态建模的潜力。

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Authors: Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou

Venue: ICML

First: 2024-03-11T21:51:39+00:00 · Latest: 2026-03-03T18:36:00+00:00

Comments: 46 pages, 31 figures, ICML '24

Abs · PDF · Code1 · Code2

Abstract

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

中文标题/摘要

标题：大规模监控AI修改内容：ChatGPT对AI会议同行评审影响的案例研究

我们提出了一种方法，用于估计大型语料库中可能被大幅修改或由大型语言模型（LLM）生成的文本的比例。我们的最大似然模型利用专家撰写的和AI生成的参考文本，以准确高效地在语料库级别检查实际的LLM使用情况。我们将这种方法应用于AI会议同行评审的案例研究，这些会议发生在ChatGPT发布之后：ICLR 2024、NeurIPS 2023、CoRL 2023和EMNLP 2023。我们的结果显示，提交给这些会议的同行评审文本中有6.5%至16.9%可能已被LLM大幅修改，即超出拼写检查或轻微的写作更新。生成文本出现的环境提供了用户行为的见解：估计的LLM生成文本的比例在报告较低信心、接近截止日期提交和不太可能回应作者反驳的审稿人中更高。我们还观察到语料库级别的生成文本趋势，这些趋势在个体层面可能不易察觉，并讨论了这些趋势对同行评审的影响。我们呼吁未来跨学科研究探讨LLM使用如何改变我们的信息和知识实践。

Summary / 总结

The study presents a method for estimating the proportion of text in large corpora that is likely to be substantially modified or generated by large language models (LLMs). By applying this method to peer reviews from AI conferences after the release of ChatGPT, the research found that between 6.5% and 16.9% of the text could have been significantly altered by LLMs. The findings suggest that generated text is more common in reviews with lower confidence, submitted near the deadline, and from less responsive reviewers. The study also highlights corpus-level trends that may not be apparent at the individual level, and calls for further interdisciplinary research into the impact of LLM use on peer review practices.

研究提出了一种方法，用于估计大型语料库中可能被大幅修改或由大型语言模型（LLM）生成的文本比例。通过将此方法应用于ChatGPT发布后AI会议的同行评审文本，研究发现6.5%至16.9%的文本可能已被LLM大幅修改。研究结果表明，生成的文本在信心较低、接近截止日期提交和回复作者反驳较少的评审中更为常见。研究还指出了在个体层面可能不易察觉的语料库级趋势，并呼吁进一步的跨学科研究，以探讨LLM使用如何改变我们的信息和知识实践。

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

Authors: Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski

First: 2026-03-03T18:31:46+00:00 · Latest: 2026-03-03T18:31:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// miquel-espinosa.github.io/cop-gen

中文标题/摘要

标题：COP-GEN：用于哥白尼地球观测数据的潜在扩散变换器——设计上具有随机性

地球观测应用越来越多地依赖于多种传感器的数据，包括光学、雷达、高程和土地覆盖产品。这些模态之间的关系对于数据集成至关重要，但它们是本原非单射的：相同的条件信息可以对应多个物理上合理的观测结果。因此，这样的条件映射应该被参数化为数据分布。结果，确定性模型往往会向条件均值坍塌，并且无法表示诸如数据完成和跨传感器翻译等任务所需的不确定性与变化性。我们引入了COP-GEN，这是一种多模态潜在扩散变换器，它在各自的原生空间分辨率下建模了异构地球观测模态的联合分布。通过将跨模态映射参数化为条件分布，COP-GEN 使任意到任意的条件生成变得灵活，包括零样本模态翻译、光谱波段填充以及在部分或缺失输入下的生成，而无需针对特定任务重新训练。在大规模全球多模态数据集上的实验表明，COP-GEN 生成了多样且物理上一致的实现，同时在光学、雷达和高程模态中保持了强大的峰值保真度。定性和定量分析表明，该模型捕捉到了有意义的跨模态结构，并且随着条件信息的增加系统地调整其输出不确定性。这些结果突显了随机生成建模在地球观测中的实际重要性，并激励了超越单一参考点度量的评估协议。

Summary / 总结

COP-GEN is a multimodal latent diffusion transformer designed for Earth observation data, addressing the need for stochastic generation to handle the non-injective relationships between different sensor modalities. By modeling the joint distribution of heterogeneous Earth Observation data, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation and spectral band infilling. Experiments show that COP-GEN generates diverse and physically consistent realizations while maintaining strong peak fidelity across various modalities, highlighting the practical importance of stochastic generative modeling for Earth observation tasks.

COP-GEN 是一种多模态潜扩散变换器，旨在处理不同传感器模态之间非单射关系的地球观测数据生成问题，通过将跨模态映射参数化为条件分布，COP-GEN 能够实现灵活且具有不确定性意识的生成，包括零样本模态转换和光谱带填充。实验结果显示，COP-GEN 能够生成多样且物理上一致的实现，并在各种地球观测模态中保持高峰值保真度。这些结果强调了在地球观测应用中使用随机生成模型的重要性，并表明需要采用能够考虑不确定性与变化性的评估指标。

Quantifying User Coherence: A Unified Framework for Analyzing Recommender Systems Across Domains

Authors: Michaël Soumm, Alexandre Fournier-Montgieux, Adrian Popescu, Bertrand Delezoide

Venue: The Web Conference

First: 2024-10-03T13:02:07+00:00 · Latest: 2026-03-03T18:27:39+00:00

Comments: Accepted at The Web Conference (WWW 2026)

Abs · PDF · Code1 · Code2

Abstract

The performance of Recommender Systems (RS) varies significantly across users, yet the underlying reasons for this variance remain poorly understood. This paper introduces a unified framework to analyze and explain this performance gap by quantifying user profile characteristics. We propose two novel, information-theoretic measures: Mean Surprise (S(u)), which captures a user's deviation from popular items and is closely related to popularity bias, and Mean Conditional Surprise (CS(u)), which measures the internal coherence of a user's interactions in a domain-agnostic manner. Through extensive experiments on 7 algorithms and 9 datasets, we demonstrate that these measures are strong predictors of recommendation performance. Our analysis reveals that performance gains from complex models are concentrated on "coherent" users, while all algorithms perform poorly on "incoherent" users. We show how these measures provide practical utility for the Web community by: (1) enabling robust, stratified evaluation to identify model weaknesses; (2) facilitating a novel analysis of the behavioral alignment of recommendations; and (3) guiding targeted system design, which we validate by training a specialized model on a segment of "coherent" users that achieves superior performance for that group with significantly less data. This work provides a new lens for understanding user behavior and offers practical tools for building more robust and efficient large-scale recommender systems.

中文标题/摘要

标题：量化用户一致性：跨领域分析推荐系统性能的统一框架

推荐系统（RS）在不同用户中的表现差异显著，但其背后的原因仍不甚明了。本文提出了一种统一框架，通过量化用户特征来分析和解释这种性能差异。我们提出了两种新颖的信息论度量：平均惊讶度（S(u)），它捕捉用户与流行项目的偏差，与流行度偏差密切相关；平均条件惊讶度（CS(u)），它以领域无关的方式衡量用户交互的内部一致性。通过在7种算法和9个数据集上的广泛实验，我们证明了这些度量是推荐性能的强预测因子。我们的分析表明，复杂模型的性能提升主要集中在“一致”的用户上，而所有算法在“不一致”的用户上表现不佳。我们展示了这些度量如何为网络社区提供实用价值：（1）通过稳健的分层评估来识别模型的弱点；（2）促进对推荐行为对齐的新分析；（3）指导有针对性的系统设计，我们通过训练专门针对“一致”用户群体的模型来验证这一点，该模型在显著减少数据的情况下实现了该群体的优异性能。本研究为理解用户行为提供了一个新的视角，并提供了构建更稳健和高效的大型推荐系统的方法。

Summary / 总结

This paper introduces a unified framework to analyze the performance gap in Recommender Systems across users by quantifying user profile characteristics. It proposes two novel measures, Mean Surprise and Mean Conditional Surprise, which are shown to be strong predictors of recommendation performance. The study reveals that performance gains from complex models are concentrated on coherent users, while all algorithms perform poorly on incoherent users. The measures provide practical utility for robust evaluation, analysis of behavioral alignment, and targeted system design, as demonstrated by a specialized model trained on coherent users achieving superior performance with less data.

本文旨在通过引入统一框架来理解推荐系统（RS）在不同用户之间的性能差异。提出了两个新的度量标准，分别是Mean Surprise (S(u))和Mean Conditional Surprise (CS(u))，用于量化用户特征。通过对7种算法和9个数据集的广泛实验表明，这些度量标准是推荐性能的强预测因子，显示了复杂模型的性能提升主要集中在一致用户上，而所有算法在不一致用户上表现不佳。该框架为稳健评估、行为分析和目标系统设计提供了实用工具。

Guiding Sparse Neural Networks with Neurobiological Principles to Elicit Biologically Plausible Representations

Authors: Patrick Inoue, Florian Röhrbein, Andreas Knoblauch

First: 2026-03-03T18:27:37+00:00 · Latest: 2026-03-03T18:27:37+00:00

Abs · PDF · Code1 · Code2

Abstract

While deep neural networks (DNNs) have achieved remarkable performance in tasks such as image recognition, they often struggle with generalization, learning from few examples, and continuous adaptation - abilities inherent in biological neural systems. These challenges arise due to DNNs' failure to emulate the efficient, adaptive learning mechanisms of biological networks. To address these issues, we explore the integration of neurobiologically inspired assumptions in neural network learning. This study introduces a biologically inspired learning rule that naturally integrates neurobiological principles, including sparsity, lognormal weight distributions, and adherence to Dale's law, without requiring explicit enforcement. By aligning with these core neurobiological principles, our model enhances robustness against adversarial attacks and demonstrates superior generalization, particularly in few-shot learning scenarios. Notably, integrating these constraints leads to the emergence of biologically plausible neural representations, underscoring the efficacy of incorporating neurobiological assumptions into neural network design. Preliminary results suggest that this approach could extend from feature-specific to task-specific encoding, potentially offering insights into neural resource allocation for complex tasks.

中文标题/摘要

标题：利用神经生物学原理指导稀疏神经网络以产生生物上合理的表示

尽管深度神经网络（DNNs）在图像识别等任务中取得了显著的性能，但它们在泛化能力、从少量示例学习以及持续适应方面常常表现不佳——这些都是生物神经系统的固有能力。这些问题源于DNNs无法模拟生物网络的高效、自适应学习机制。为了解决这些问题，我们探索了在神经网络学习中整合神经生物学启发式假设的方法。本研究引入了一种受神经生物学启发的学习规则，自然地整合了包括稀疏性、对数正态权重分布和遵守达尓定律在内的神经生物学原则，而无需显式地强制执行。通过与这些核心神经生物学原则保持一致，我们的模型增强了对对抗攻击的鲁棒性，并在少量示例学习场景中表现出更优的泛化能力。值得注意的是，整合这些约束条件导致了生物上合理的神经表示的出现，突显了将神经生物学假设纳入神经网络设计的有效性。初步结果表明，这种方法可能从特征特定编码扩展到任务特定编码，可能为复杂任务中的神经资源分配提供见解。

Summary / 总结

This study aims to improve deep neural networks by integrating neurobiological principles such as sparsity and lognormal weight distributions. The research introduces a biologically inspired learning rule that enhances robustness against adversarial attacks and improves generalization, especially in few-shot learning. Key findings show that these constraints lead to biologically plausible neural representations, suggesting that incorporating neurobiological assumptions can enhance neural network performance.

该研究旨在通过整合神经生物学原理来提升深度神经网络的泛化能力和少量样本学习能力。研究引入了一种受生物启发的学习规则，结合了稀疏性、对数正态权重分布和达尓定律，从而增强了鲁棒性并实现了更好的泛化，特别是在少量样本学习场景中。主要发现包括增强了对抗攻击的鲁棒性以及产生了生物上合理的神经表示。

VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

Authors: Chuyue Sun, Yican Sun, Daneshvar Amrollahi, Ethan Zhang, Shuvendu Lahiri, Shan Lu, David Dill, Clark Barrett

First: 2025-10-28T22:28:37+00:00 · Latest: 2026-03-03T18:26:58+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce VeriStruct, a novel framework that extends AI-assisted automated verification from single functions to more complex data structure modules in Verus. VeriStruct employs a planner module to orchestrate the systematic generation of abstractions, type invariants, specifications, and proof code. To address the challenge that LLMs often misunderstand Verus' annotation syntax and verification-specific semantics, VeriStruct embeds syntax guidance within prompts and includes a repair stage to automatically correct annotation errors. In an evaluation on eleven Rust data structure modules, VeriStruct succeeds on ten of the eleven, successfully verifying 128 out of 129 functions (99.2%) in total. These results represent an important step toward the goal of automatic AI-assisted formal verification.

中文标题/摘要

标题：VeriStruct：Verus中数据结构模块的AI辅助自动化验证框架

我们介绍了VeriStruct，这是一种新颖的框架，它将AI辅助的自动化验证从单个函数扩展到Verus中的更复杂的数据结构模块。VeriStruct 使用一个规划模块来协调抽象、类型不变式、规范和证明代码的系统生成。为了解决LLMs经常误解Verus的注解语法和验证特定语义的问题，VeriStruct 在提示中嵌入了语法指导，并包括一个修复阶段以自动纠正注解错误。在对十一项Rust数据结构模块的评估中，VeriStruct 成功处理了十项中的九项，总共验证了129个函数中的128个（99.2%）。这些结果代表了自动AI辅助形式验证目标的重要一步。

Summary / 总结

VeriStruct is a framework that extends AI-assisted automated verification to complex data structure modules in Verus. It uses a planner to generate abstractions, type invariants, specifications, and proof code. VeriStruct includes syntax guidance and a repair stage to handle LLM misunderstandings of Verus' syntax and semantics. Evaluations on eleven Rust data structure modules showed that VeriStruct successfully verified 128 out of 129 functions (99.2%) across ten of the eleven modules, advancing the goal of automatic AI-assisted formal verification.

VeriStruct 是一个框架，将 AI 辅助的自动化验证扩展到 Verus 的复杂数据结构模块中。它使用一个规划模块来生成抽象、类型不变式、规范和证明代码。VeriStruct 包含语法指导和一个修复阶段，以处理 LLM 对 Verus 注解语法和验证特定语义的理解错误。对 eleven 个 Rust 数据结构模块的评估显示，VeriStruct 成功验证了 128 个函数中的 129 个（99.2%），覆盖了十个模块中的十个，这为自动 AI 辅助形式化验证的目标迈出了重要一步。

AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework

Authors: Zihang Zeng, Jiaquan Zhang, Pengze Li, Yuan Qi, Xi Chen

First: 2026-03-03T18:25:00+00:00 · Latest: 2026-03-03T18:25:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP's effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.

中文标题/摘要

标题：面向科学的低代码平台与贝叶斯对抗多智能体框架

大型语言模型（LLMs）在自动化科学代码生成方面展现出潜力，但面临可靠性和多智能体工作流中的错误传播挑战，以及在成功度量不明确的领域评估困难。我们提出了一种专门针对科学人工智能（AI4S）任务的贝叶斯对抗多智能体框架，以低代码平台（LCP）的形式呈现。在贝叶斯框架下，协调了三个基于LLM的智能体：任务管理器将用户输入结构化为可执行计划和自适应测试案例，代码生成器生成候选解决方案，评估器提供全面反馈。框架采用了一个对抗循环，其中任务管理器迭代细化测试案例以挑战代码生成器，同时通过结合代码质量指标（功能正确性、结构对齐和静态分析）使用贝叶斯原则动态更新提示分布。这种测试和代码的协同优化减少了对LLM可靠性的依赖，并解决了科学任务固有的评估不确定性。LCP通过将非专家提示翻译成特定领域的规范，简化了人机协作，避免了没有编程背景的从业者需要手动提示工程。基准评估表明，LCP在生成稳健代码的同时最大限度地减少了错误传播。所提出的平台还在地球科学跨学科任务中进行了测试，并表现出强大的可靠性，优于竞争对手模型。

Summary / 总结

The research aims to address the challenges of reliability and error propagation in automated scientific code generation using large language models (LLMs). It introduces a Bayesian adversarial multi-agent framework within a Low-code Platform (LCP) to enhance AI for Science (AI4S) tasks. The framework consists of three LLM-based agents: a Task Manager, a Code Generator, and an Evaluator. The Task Manager and Code Generator engage in an adversarial loop to refine test cases and generate robust code, while the Evaluator provides feedback. The platform effectively reduces reliance on LLM reliability and minimizes error propagation, as demonstrated through benchmark evaluations and an Earth Science task, outperforming competing models.

研究旨在解决使用大型语言模型（LLMs）进行自动化科学代码生成时的可靠性和错误传播问题。提出了一种贝叶斯对抗多代理框架，嵌入在低代码平台（LCP）中，协调三个基于LLM的代理：任务管理器、代码生成器和评估器。该框架使用对抗循环来迭代细化测试案例并根据贝叶斯原则动态更新提示分布，这有助于减少对LLM可靠性的依赖并解决科学任务中的评估不确定性。实验结果表明，LCP能够有效生成稳健的代码并最小化错误传播，并在地球科学跨学科任务中表现出色，优于其他模型。

Coalgebras for categorical deep learning: Representability and universal approximation

Authors: Dragan Mašulović

First: 2026-03-03T18:18:50+00:00 · Latest: 2026-03-03T18:18:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Categorical deep learning (CDL) has recently emerged as a framework that leverages category theory to unify diverse neural architectures. While geometric deep learning (GDL) is grounded in the specific context of invariants of group actions, CDL aims to provide domain-independent abstractions for reasoning about models and their properties. In this paper, we contribute to this program by developing a coalgebraic foundation for equivariant representation in deep learning, as classical notions of group actions and equivariant maps are naturally generalized by the coalgebraic formalism. Our first main result demonstrates that, given an embedding of data sets formalized as a functor from SET to VECT, and given a notion of invariant behavior on data sets modeled by an endofunctor on SET, there is a corresponding endofunctor on VECT that is compatible with the embedding in the sense that this lifted functor recovers the analogous notion of invariant behavior on the embedded data. Building on this foundation, we then establish a universal approximation theorem for equivariant maps in this generalized setting. We show that continuous equivariant functions can be approximated within our coalgebraic framework for a broad class of symmetries. This work thus provides a categorical bridge between the abstract specification of invariant behavior and its concrete realization in neural architectures.

中文标题/摘要

标题：煤代数在分类深度学习中的应用：表示性和普遍逼近

分类深度学习(CDL)最近作为一种框架出现，利用范畴论来统一各种神经架构。虽然几何深度学习(GDL)基于群作用不变量的具体上下文，CDL旨在为模型及其属性提供跨领域的抽象。在本文中，我们通过发展煤代数基础来为深度学习中的对称表示提供支持，因为经典的群作用和对称映射概念自然地被煤代数形式化所推广。我们的第一个主要结果表明，给定一个数据集嵌入形式化的函子从SET到VECT，以及给定一个由SET上的自函子建模的数据集不变行为概念，存在一个相应的VECT上的自函子，该函子在嵌入数据上与嵌入函子兼容，即提升的函子恢复了嵌入数据上的类似不变行为概念。在此基础上，我们建立了这种广义设置中对称映射的普遍逼近定理。我们证明了在广泛对称性类中，连续对称函数可以在我们的煤代数框架中被逼近。因此，本文为抽象不变行为的规范描述与其在神经架构中的具体实现之间提供了范畴论桥梁。

Summary / 总结

This paper aims to develop a coalgebraic foundation for equivariant representation in deep learning, unifying diverse neural architectures under a categorical framework. The authors demonstrate that given an embedding of data sets and a notion of invariant behavior, there is a corresponding endofunctor on VECT that recovers the invariant behavior on the embedded data. They also establish a universal approximation theorem, showing that continuous equivariant functions can be approximated within their coalgebraic framework for a broad class of symmetries.

本文旨在通过范畴框架发展深度学习中的共变表示的煤代数基础，统一不同的神经架构。作者证明了给定数据集的嵌入和不变行为的概念时，存在一个在VECT上的同构，可以恢复嵌入数据上的不变行为。他们还建立了连续共变函数的普遍逼近定理，表明在广泛的对称性类中，这些函数可以在他们的煤代数框架内被逼近。

Classroom Final Exam: An Instructor-Tested Reasoning Benchmark

Authors: Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen

First: 2026-02-23T05:17:41+00:00 · Latest: 2026-03-03T18:16:35+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce CFE-Bench (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE-Bench is curated from repeatedly used, authentic university homework and exam problems, paired with reference solutions provided by course instructors. CFE-Bench remains challenging for frontier models: the newly released Gemini-3.1-pro-preview achieves 59.69% overall accuracy, while the second-best model, Gemini-3-flash-preview, reaches 55.46%, leaving substantial room for improvement. Beyond aggregate scores, we conduct a diagnostic analysis by decomposing instructor reference solutions into structured reasoning flows. We find that while frontier models often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically contain more reasoning steps than instructor solutions, indicating lower step efficiency and a higher risk of error accumulation. Data and code are available at https://github.com/Analogy-AI/CFE_Bench.

中文标题/摘要

标题：教室期末考试：由教师测试的推理基准

我们介绍了CFE-Bench（教室期末考试），这是一个多模态基准，用于评估大型语言模型在超过20个STEM领域的推理能力。CFE-Bench 从反复使用的、真实的大学作业和考试问题中精选而来，并配以课程教师提供的参考解决方案。CFE-Bench 对前沿模型仍然具有挑战性：新发布的Gemini-3.1-pro-preview 的总体准确率为59.69%，而第二好的模型Gemini-3-flash-preview 达到55.46%，留有很大改进空间。除了总分，我们通过将教师参考解决方案分解为结构化的推理流程进行了诊断分析。我们发现，虽然前沿模型通常能正确回答中间子问题，但在多步解决方案中可靠地推导和保持正确中间状态方面存在困难。我们还观察到，模型生成的解决方案通常包含比教师解决方案更多的推理步骤，表明较低的步骤效率和更高的错误累积风险。数据和代码可在https://github.com/Analogy-AI/CFE_Bench 获取。

NeuroSkill(tm): Proactive Real-Time Agentic System Capable of Modeling Human State of Mind

Authors: Nataliya Kosmyna, Eugene Hauptmann

First: 2026-03-03T18:06:42+00:00 · Latest: 2026-03-03T18:06:42+00:00

Comments: 36 pages, 18 figures

Abs · PDF · Code1 · Code2

Abstract

Real-time proactive agentic system, capable of modeling Human State of Mind, using foundation EXG model and text embeddings model, running fully offline on the edge. Unlike all previously known systems, the NeuroSkill(tm) system leverages SKILL.md description of Human's State of Mind via API and CLI provided by the system, directly from the Brain-Computer Interface (BCI) devices, which records Human biophysical and brain signals. Our custom harness - NeuroLoop(tm) - utilizes all of the above to run agentic flow that manages to engage with the Human on multiple cognitive and affective levels of their State of Mind (e.g., empathy), by providing actionable tool calls and protocol execution with explicit or implicit requests from the Human. GPLv3 open-source software with ethically aligned AI100 licensing for the skill markdown.

中文标题/摘要

标题：NeuroSkill(tm): 预见性的实时代理系统，能够建模人类心理状态

实时预见性代理系统，能够使用基础EXG模型和文本嵌入模型建模人类心理状态，在边缘端完全离线运行。与所有已知系统不同，NeuroSkill(tm)系统利用系统提供的通过API和CLI的SKILL.md描述的人类心理状态，直接从脑机接口（BCI）设备记录人类的生物物理和脑信号。我们自定义的框架-NeuroLoop(tm)-利用上述所有内容运行代理流程，能够与人类在多个认知和情感层面的心理状态进行互动（例如，共情），通过提供可操作的工具调用和协议执行，响应人类的显式或隐式请求。该软件采用GPLv3开源许可，并附带符合AI100伦理标准的技能Markdown许可。

Summary / 总结

The research aims to develop a real-time proactive agentic system, NeuroSkill(tm), which models the human state of mind using an EXG model and text embeddings. The system runs offline on the edge and utilizes a Brain-Computer Interface (BCI) to record biophysical and brain signals. Key findings include the system's ability to engage with humans on multiple cognitive and affective levels, providing actionable tool calls and protocol execution based on explicit or implicit human requests through a custom harness called NeuroLoop(tm).

研究旨在开发一种实时主动代理系统NeuroSkill(tm)，该系统使用EXG模型和文本嵌入来建模人类状态。系统在边缘设备上离线运行，并通过脑机接口（BCI）记录生物物理和脑信号。关键发现包括系统能够通过一个名为NeuroLoop(tm)的自定义工具与人类在多个认知和情感层面进行互动，提供基于显式或隐式人类请求的操作工具调用和协议执行。

NutriBench: A Dataset for Evaluating Large Language Models on Nutrition Estimation from Meal Descriptions

Authors: Andong Hua, Mehak Preet Dhaliwal, Laya Pullela, Ryan Burke, Yao Qin

Venue: ICLR 2025

First: 2024-07-04T15:10:51+00:00 · Latest: 2026-03-03T18:03:31+00:00

Comments: ICLR 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Accurate nutrition estimation helps people make informed dietary choices and is essential in the prevention of serious health complications. We present NutriBench, the first publicly available natural language meal description nutrition benchmark. NutriBench consists of 11,857 meal descriptions generated from real-world global dietary intake data. The data is human-verified and annotated with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. We conduct an extensive evaluation of NutriBench on the task of carbohydrate estimation, testing twelve leading Large Language Models (LLMs), including GPT-4o, Llama3.1, Qwen2, Gemma2, and OpenBioLLM models, using standard, Chain-of-Thought and Retrieval-Augmented Generation strategies. Additionally, we present a study involving professional nutritionists, finding that LLMs can provide comparable but significantly faster estimates. Finally, we perform a real-world risk assessment by simulating the effect of carbohydrate predictions on the blood glucose levels of individuals with diabetes. Our work highlights the opportunities and challenges of using LLMs for nutrition estimation, demonstrating their potential to aid professionals and laypersons and improve health outcomes. Our benchmark is publicly available at: https://mehak126.github.io/nutribench.html

中文标题/摘要

标题：NutriBench：用于评估大型语言模型从餐食描述中估计营养成分的大规模数据集

准确的营养估计有助于人们做出知情的饮食选择，并且对于预防严重的健康并发症至关重要。我们提出了NutriBench，这是第一个公开可用的自然语言餐食描述营养基准数据集。NutriBench 包含来自全球实际饮食摄入数据的 11,857 条餐食描述。数据由人工验证并标注了宏营养素标签，包括碳水化合物、蛋白质、脂肪和卡路里。我们对 NutriBench 进行了广泛的评估，测试了十二个领先的大型语言模型（LLMs），包括 GPT-4o、Llama3.1、Qwen2、Gemma2 和 OpenBioLLM 模型，使用标准、因果推理和检索增强生成策略。此外，我们还进行了一项涉及专业营养师的研究，发现 LLM 可以提供可比但显著更快的估计。最后，我们通过模拟碳水化合物预测对糖尿病患者血糖水平的影响进行了实际风险评估。我们的工作突显了使用 LLM 进行营养估计的机会和挑战，展示了它们在帮助专业人士和普通人群以及改善健康结果方面的潜力。我们的基准数据集可在以下网址获取：https://mehak126.github.io/nutribench.html

Summary / 总结

NutriBench is a dataset for evaluating large language models (LLMs) in estimating nutrition from meal descriptions, consisting of 11,857 human-verified meal descriptions. The study evaluates twelve LLMs, including GPT-4o and Llama3.1, on carbohydrate estimation using various strategies and finds that LLMs can provide comparable estimates but faster than professional nutritionists. The work also assesses the real-world impact of carbohydrate predictions on individuals with diabetes, highlighting the potential of LLMs in aiding nutrition estimation and improving health outcomes.

NutriBench 是一个用于评估大型语言模型从餐食描述中估计营养成分的数据集，包含11,857条经过人工验证的餐食描述。研究使用多种策略评估了十二种领先的LLM在碳水化合物估计任务上的表现，并发现LLM可以提供与人工相当但更快的估计。工作还评估了碳水化合物预测对糖尿病患者的影响，突显了LLM在营养估计和改善健康结果方面的潜力。

Understanding and Mitigating Dataset Corruption in LLM Steering

Authors: Cullen Anderson, Narmeen Oozeer, Foad Namjoo, Remy Ogasawara, Amirali Abdullah, Jeff M. Phillips

First: 2026-03-03T18:00:49+00:00 · Latest: 2026-03-03T18:00:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Contrastive steering has been shown as a simple and effective method to adjust the generative behavior of LLMs at inference time. It uses examples of prompt responses with and without a trait to identify a direction in an intermediate activation layer, and then shifts activations in this 1-dimensional subspace. However, despite its growing use in AI safety applications, the robustness of contrastive steering to noisy or adversarial data corruption is poorly understood. We initiate a study of the robustness of this process with respect to corruption of the dataset of examples used to train the steering direction. Our first observation is that contrastive steering is quite robust to a moderate amount of corruption, but unwanted side effects can be clearly and maliciously manifested when a non-trivial fraction of the training data is altered. Second, we analyze the geometry of various types of corruption, and identify some safeguards. Notably, a key step in learning the steering direction involves high-dimensional mean computation, and we show that replacing this step with a recently developed robust mean estimator often mitigates most of the unwanted effects of malicious corruption.

中文标题/摘要

标题：理解并缓解LLM引导中的数据集损坏

对比引导已被证明是一种简单而有效的方法，在推理时调整LLM的生成行为。它使用带有和不带有特定特征的提示响应示例来识别中间激活层中的一个方向，然后在这一1维子空间中移动激活。然而，尽管它在AI安全应用中的使用日益增多，但对比引导对嘈杂或对抗性数据损坏的鲁棒性尚未得到充分理解。我们开始了对这一过程在训练示例数据集损坏方面的鲁棒性研究。我们的第一个观察结果是，对比引导对中等程度的损坏具有相当的鲁棒性，但当训练数据中非微不足道的比例被修改时，可能会出现明显的和恶意的副作用。其次，我们分析了各种类型损坏的几何结构，并识别了一些防护措施。值得注意的是，在学习引导方向的关键步骤涉及高维均值计算，我们展示了用最近开发的鲁棒均值估计器替换这一步骤通常可以缓解大部分恶意损坏的不良影响。

Summary / 总结

The study investigates the robustness of contrastive steering, a method used to adjust the generative behavior of LLMs, against dataset corruption. It finds that contrastive steering is robust to moderate corruption but can exhibit unwanted side effects when a significant portion of the training data is altered. The research also analyzes the geometry of different types of corruption and suggests using a robust mean estimator to mitigate the effects of malicious corruption.

该研究考察了用于调整大型语言模型（LLM）生成行为的对比引导方法在面对数据集污染时的鲁棒性。研究发现，对比引导在面对适度污染时表现 robust，但在训练数据中有相当一部分被篡改时会表现出不良副作用。研究还分析了不同类型污染的几何结构，并建议使用一种鲁棒的均值估计器来减轻恶意污染的不良影响。

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Venue: ICLR 2026

First: 2025-08-25T17:57:49+00:00 · Latest: 2026-03-03T17:59:41+00:00

Comments: Accepted by ICLR 2026. Code: https://github.com/Ironieser/mmtok , Project Homepage: https://project.ironieser.cc/mmtok

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual inputs to vision tokens. However, redundancy in vision tokens results in the degraded inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the coverage criterion. We first formulate the subset selection problem as a maximum coverage problem. Afterwards, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Finally, with only four vision tokens, 87.7% of the original performance is still preserved on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection. The code is available at https://github.com/Ironieser/mmtok

中文标题/摘要

标题：MMTok：多模态覆盖率最大化以提高VLMs高效推理

视觉-语言模型（VLMs）通过将视觉输入转换为视觉标记来理解带有语言指令的视觉内容，表现出令人印象深刻的性能。然而，视觉标记中的冗余性导致了VLMs推理效率的下降。尽管已经提出了许多算法来减少视觉标记的数量，但大多数算法仅使用单模态信息（即视觉/文本）进行剪枝，忽略了视觉-语言任务的固有多模态特性。此外，缺乏一个适用于不同模态的通用标准。为了解决这一局限性，本文提出利用视觉和文本标记来通过覆盖率标准选择信息性的视觉标记。首先，将子集选择问题形式化为最大覆盖问题。之后，优化一个视觉标记子集以同时覆盖文本标记和原始的视觉标记集。所提出的方法MMTok在不同的基准数据集和VLMs上进行了广泛评估。比较结果表明，视觉和文本信息是互补的，结合多模态信息可以明显超越单模态基线。此外，在POPE数据集上的最大覆盖标准下，我们的方法在LLaVA-NeXT-13B上实现了1.87倍的速度提升，同时保持了98.7%的原始性能。最后，仅使用四个视觉标记，LLaVA-1.5-7B的原始性能仍可保持87.7%。这些结果突显了覆盖率在标记选择中的有效性。代码可在https://github.com/Ironieser/mmtok 获取。

Summary / 总结

The research aims to improve the inference efficiency of Vision-Language Models (VLMs) by reducing redundant vision tokens while preserving performance. The method, MMTok, leverages both vision and text tokens to select informative vision tokens based on a coverage criterion. Experiments on benchmark datasets show that combining multimodal information outperforms unimodal baselines, achieving a 1.87x speedup with 98.7% of the original performance on LLaVA-NeXT-13B and 87.7% performance with only four vision tokens on LLaVA-1.5-7B.

该研究针对视觉语言模型（VLMs）因视觉令牌冗余而导致的效率低下问题，提出了MMTok方法，该方法利用视觉和文本令牌选择具有代表性的视觉令牌，并基于覆盖准则进行优化。该方法将子集选择问题形式化为最大覆盖问题，并优化视觉令牌子集以同时覆盖文本和原始视觉令牌。实验结果表明，结合多模态信息优于单模态基线，并在LLaVA-NeXT-13B上实现了1.87倍的加速，同时保持98.7%的原始性能，仅使用四个视觉令牌即可保持87.7%的性能。

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Authors: Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah

First: 2026-03-03T17:59:35+00:00 · Latest: 2026-03-03T17:59:35+00:00

Comments: 24 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.

中文标题/摘要

标题：学习何时行动或拒绝：为安全多步工具使用保护代理推理模型

代理语言模型在安全性方面与聊天模型存在根本差异：它们必须计划、调用工具并执行长期行动，其中任何一次失误，如访问文件或输入凭据，都可能导致不可逆的损害。现有的对齐方法主要针对静态生成和任务完成进行了优化，在这些场景中由于顺序决策、对抗性工具反馈和过度自信的中间推理而失效。我们提出了MOSAIC，这是一种后训练框架，通过使安全性决策显式化和可学习化，为安全多步工具使用对齐代理。MOSAIC 将推理结构化为计划、检查、然后行动或拒绝的循环，其中包含显式的安全推理和拒绝作为一等行动。为了在没有轨迹级标签的情况下进行训练，我们使用基于偏好的强化学习和轨迹对之间的成对比较，这捕捉到了标量奖励经常忽略的安全差异。我们零样本地在三个模型家族Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4以及跨越有害任务、提示注入、良性工具使用和跨域隐私泄露的分布外基准上评估了MOSAIC。MOSAIC 将有害行为减少了最多50%，在注入攻击中将有害任务的拒绝率提高了超过20%，减少了隐私泄露，并保持或改善了良性任务的性能，展示了模型、领域和代理设置中的稳健泛化。

Summary / 总结

The study addresses the safety challenges of agentic language models that must plan and execute multi-step actions, which can lead to irreversible harm if missteps occur. It introduces MOSAIC, a post-training framework that makes safety decisions explicit and learnable through a plan, check, then act or refuse loop. MOSAIC uses preference-based reinforcement learning to train models without trajectory-level labels, capturing safety distinctions that scalar rewards might miss. The framework significantly reduces harmful behavior and increases refusal of harmful tasks, while maintaining or improving performance on benign tasks across different model families and domains.

该论文旨在确保使用计划和执行多步操作涉及工具的代理语言模型的安全性。它引入了MOSAIC，这是一种后训练框架，通过计划、检查、然后执行或拒绝循环使安全决策变得明确和可学习。MOSAIC 使用基于偏好的强化学习来训练模型，无需轨迹级标签，专注于安全区分。实验表明，MOSAIC 可将有害行为减少多达 50%，在注入攻击中拒绝有害任务的比例超过 20%，并提高隐私安全性，同时保持或提升对良性任务的性能。

No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

Authors: Omer Sela

First: 2026-03-03T17:55:24+00:00 · Latest: 2026-03-03T17:55:24+00:00

Comments: 8 pages main text, 5 pages appendix, 9 figures, 7 tables. Code available at https://github.com/Sela-Omer/Contamination-Detection-Small-LM

Abs · PDF · Code1 · Code2 · Code3

Abstract

CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD's effectiveness depends critically on whether fine-tuning produces verbatim memorization. With low-rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine-tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter-efficient fine-tuning can produce contamination that output-distribution methods do not detect. Our code is available at https://github.com/Sela-Omer/Contamination-Detection-Small-LM

中文标题/摘要

标题：无需记忆，无需检测：小型语言模型输出分布基于的污染检测

CDD，或输出分布污染检测，通过测量模型采样输出的尖锐度来识别数据污染。我们研究了这种方法在从70M到410M参数的小型语言模型中成功和失败的条件。通过在GSM8K、HumanEval和MATH上进行受控污染实验，我们发现CDD的有效性取决于微调是否产生逐字记忆。使用低秩适应，模型可以从污染数据中学习而无需记忆，即使数据是可验证的污染，CDD的表现也仅在随机水平。只有当微调能力足以引起记忆时，CDD才能恢复较强的检测准确性。我们的结果描述了一个决定可检测性的记忆阈值，并强调了一个实际考虑：参数高效的微调可以产生输出分布方法无法检测的污染。

Summary / 总结

The paper introduces CDD, a method for detecting data contamination in small language models by measuring the peakedness of model outputs. Experiments on GSM8K, HumanEval, and MATH datasets show that CDD's effectiveness varies based on whether fine-tuning leads to verbatim memorization. Without memorization, CDD performs poorly even on contaminated data, but it recovers strong detection accuracy when memorization occurs, indicating a memorization threshold that affects detectability. The study highlights that parameter-efficient fine-tuning can produce undetectable contamination by output-distribution methods.

论文提出了一种名为CDD的方法，通过分析模型输出的集中度来检测小语言模型中的数据污染。研究评估了CDD在不同参数规模（70M到410M）模型中的有效性，发现当模型不完全记忆数据但仍从中学习时，CDD的表现较差。只有当微调容量允许记忆时，CDD才能实现较强的检测准确性，表明存在一个关键的记忆阈值来决定污染的可检测性。

Code2Math: Can Your Code Agent Effectively Evolve Math Problems Through Exploration?

Authors: Dadi Guo, Yuejin Xie, Qingyu Liu, Jiayu Liu, Zhiyuan Fan, Qihan Ren, Shuai Shao, Tianyi Zhou, Dongrui Liu, Yi R. Fung

Venue: ICML 2026

First: 2026-03-03T17:55:10+00:00 · Latest: 2026-03-03T17:55:10+00:00

Comments: Under review in ICML 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

As large language models (LLMs) advance their mathematical capabilities toward the IMO level, the scarcity of challenging, high-quality problems for training and evaluation has become a significant bottleneck. Simultaneously, recent code agents have demonstrated sophisticated skills in agentic coding and reasoning, suggesting that code execution can serve as a scalable environment for mathematical experimentation. In this paper, we investigate the potential of code agents to autonomously evolve existing math problems into more complex variations. We introduce a multi-agent framework designed to perform problem evolution while validating the solvability and increased difficulty of the generated problems. Our experiments demonstrate that, given sufficient test-time exploration, code agents can synthesize new, solvable problems that are structurally distinct from and more challenging than the originals. This work provides empirical evidence that code-driven agents can serve as a viable mechanism for synthesizing high-difficulty mathematical reasoning problems within scalable computational environments. Our data is available at https://github.com/TarferSoul/Code2Math.

中文标题/摘要

标题：Code2Math：你的代码代理能否通过探索有效地进化数学问题？

随着大型语言模型（LLMs）在数学能力方面向IMO水平迈进，高质量的挑战性问题在训练和评估中的稀缺性已成为一个重要的瓶颈。同时，最近的代码代理展示了在自主编程和推理方面的高级技能，表明代码执行可以作为数学实验的可扩展环境。在本文中，我们研究了代码代理自主进化现有数学问题为更复杂变体的潜力。我们介绍了一个多代理框架，旨在执行问题进化并验证生成问题的可解性和增加难度。我们的实验表明，在充分的测试时探索下，代码代理可以合成新的、可解的问题，这些问题是结构上不同的且更具挑战性。这项工作提供了实证证据，表明代码驱动的代理可以在可扩展的计算环境中作为合成高难度数学推理问题的有效机制。我们的数据可在https://github.com/TarferSoul/Code2Math获取。

Summary / 总结

This paper explores the capability of code agents to autonomously evolve existing math problems into more complex variations. By using a multi-agent framework, the authors validate the solvability and increased difficulty of the generated problems. The experiments show that code agents can synthesize new, solvable problems that are structurally distinct and more challenging than the originals, suggesting that code-driven agents can be a viable mechanism for creating high-difficulty mathematical reasoning problems in scalable computational environments.

本文探讨了代码代理自主将现有数学问题演化为更复杂和更具挑战性的变体的能力。通过使用多代理框架，作者验证了生成的问题的可解性和难度增加。实验表明，代码代理可以合成新的、可解的问题，这些问题是结构上不同的并且比原始问题更具难度，这表明代码驱动的代理可以在可扩展的计算环境中作为生成高难度数学推理问题的有效机制。

History

20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553