arXiv 论文速递

AnyView: Synthesizing Any Novel View in Dynamic Scenes

Authors: Basile Van Hoorick, Dian Chen, Shun Iwase, Pavel Tokmakov, Muhammad Zubair Irshad, Igor Vasiljevic, Swati Gupta, Fangzhou Cheng, Sergey Zakharov, Vitor Campagnolo Guizilini

First: 2026-01-23T18:59:58+00:00 · Latest: 2026-01-23T18:59:58+00:00

Comments: Project webpage: https://tri-ml.github.io/AnyView/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce \textbf{AnyView}, a diffusion-based video generation framework for \emph{dynamic view synthesis} with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose \textbf{AnyViewBench}, a challenging new benchmark tailored towards \emph{extreme} dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emph{any} viewpoint. Results, data, code, and models can be viewed at: https://tri-ml.github.io/AnyView/

中文标题/摘要

标题：AnyView：动态场景中的任意视图合成

现代生成视频模型在生成逼真、高质量的输出方面表现出色，但在保持多视角和时空一致性方面，在高度动态的真实世界环境中遇到困难。在本工作中，我们引入了**AnyView**，一种基于扩散的视频生成框架，用于**动态视图合成**，几乎没有任何归纳偏见或几何假设。我们利用多种带有不同程度监督的数据源进行训练，包括单目（2D）、多视角静态（3D）和多视角动态（4D）数据集，以生成零样本的新颖视频，这些视频可以从任意的摄像机位置和轨迹生成。我们在标准基准上评估了AnyView，显示了与当前最先进的技术竞争的结果，并提出了**AnyViewBench**，这是一个针对**极端**动态视图合成的具有挑战性的新基准，适用于多种真实世界场景。在这一更具戏剧性的环境中，我们发现大多数基线在性能上大幅下降，因为它们需要视点之间有显著的重叠，而AnyView在从**任何**视点提示时，仍能生成逼真、合理且时空一致的视频。有关结果、数据、代码和模型，请参阅：https://tri-ml.github.io/AnyView/

Summary / 总结

AnyView is a diffusion-based video generation framework designed for dynamic view synthesis without relying on specific geometric assumptions. It leverages multiple data sources to train a generalist spatiotemporal implicit representation, enabling zero-shot generation of novel videos from arbitrary camera locations. Experimental results show competitive performance on standard benchmarks and superior handling of extreme dynamic scenarios compared to existing methods, which often require significant viewpoint overlap to function effectively.

AnyView 是一种基于扩散的视频生成框架，用于动态视图合成，无需强先验假设或几何假设。它利用单目、多视角静态和多视角动态数据集等多种数据源进行训练，以生成通用的时空隐式表示。该框架在标准基准测试中表现出竞争力，并在专注于极端动态视图合成的 AnyViewBench 中表现出色，能够在各种真实世界场景中从任意视角生成时空一致且逼真的视频。结果、数据、代码和模型可在 https://tri-ml.github.io/AnyView/ 查看。

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Authors: Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, Michael Shvartsman

First: 2026-01-23T18:59:40+00:00 · Latest: 2026-01-23T18:59:40+00:00

Comments: 9 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.

中文标题/摘要

标题：一种可扩展的损失景观曲率度量方法，用于分析大规模语言模型的训练动力学

理解损失景观的曲率演变是分析神经网络训练动力学的基础。最常研究的度量标准是海森堡尖锐度（$λ_{\max}^H$）——损失海森堡矩阵的最大特征值，它决定了局部训练稳定性，并在整个训练过程中与学习率相互作用。尽管在分析训练动力学方面具有重要意义，但由于大规模语言模型（LLMs）的高计算成本，直接测量海森堡尖锐度仍然是不可行的。我们分析了$\textit{关键尖锐度}$（$λ_c$），这是一种计算效率高的度量标准，给定更新方向$Δ\mathbfθ$，只需要少于10次前向传递即可计算。关键的是，这种度量标准能够很好地捕捉到已记录的海森堡尖锐度现象，包括逐步尖锐化和临界稳定性边缘。利用这一度量标准，我们首次在70亿参数规模上展示了这些尖锐度现象，涵盖了OLMo-2模型的预训练和中期训练。我们还引入了$\textit{相对关键尖锐度}$（$λ_c^{1\to 2}$），它量化了在优化另一个损失景观时一个损失景观的曲率，用于分析从预训练到微调的过渡，并指导数据混合策略。关键尖锐度为从业者提供了一种实用工具，用于诊断曲率动态并指导大规模的数据组合选择。更广泛地说，我们的工作表明，可扩展的曲率度量可以为大规模训练提供可操作的见解。

Summary / 总结

The paper aims to analyze the training dynamics of Large Language Models (LLMs) by measuring the curvature of the loss landscape. It introduces a computationally efficient measure called critical sharpness ($λ_c$), which requires fewer than 10 forward passes and captures phenomena like progressive sharpening and Edge of Stability. The study demonstrates these phenomena at scale, up to 7B parameters, and introduces relative critical sharpness ($λ_c^{1\to 2}$) to analyze transitions between pre-training and fine-tuning. This measure provides practical insights for diagnosing curvature dynamics and guiding data composition choices in large-scale training.

本文通过提出一个计算高效的度量方法——关键尖锐度（$λ_c$），解决了大规模语言模型（LLMs）中分析损失景观曲率的挑战，该方法只需少于10次前向传递。研究展示了该方法在捕捉关键现象如逐步尖锐化和临界稳定边缘的有效性，并提供了从预训练到微调过渡的见解。作者还引入了相对关键尖锐度（$λ_c^{1\to 2}$），用于分析在优化另一个损失景观时的一个损失景观的曲率，这有助于指导数据混合策略。总体而言，这项工作提供了一种诊断曲率动态的实用工具，并指导大规模训练中的数据组合选择。

Towards Reasoning for PDE Foundation Models: A Reward-Model-Driven Inference-Time-Scaling Algorithm

Authors: Siddharth Mansingh, James Amarel, Ragib Arnab, Arvind Mohan, Kamaljeet Singh, Gerd J. Kunde, Nicolas Hengartner, Benjamin Migliori, Emily Casleton, Nathan A. Debardeleben, Ayan Biswas, Diane Oyen, Earl Lawrence

First: 2025-09-02T21:31:32+00:00 · Latest: 2026-01-23T18:55:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Partial Differential Equations (PDEs) are the bedrock for modern computational sciences and engineering, and inherently computationally expensive. While PDE foundation models have shown much promise for simulating such complex spatio-temporal phenomena, existing models remain constrained by the pretraining datasets and struggle with auto-regressive rollout performance, especially in out-of-distribution (OOD) cases. Furthermore, they have significant compute and training data requirements which hamper their use in many critical applications. Inspired by recent advances in ``thinking" strategies used in large language models (LLMs), we introduce the first test-time computing (TTC) strategy for PDEs that utilizes computational resources during inference to achieve more accurate predictions with fewer training samples and smaller models. We accomplish this with two types of reward models that evaluate predictions of a stochastic based model for spatio-temporal consistency. We demonstrate this method on compressible Euler-equation simulations from the PDEGym benchmark and show that TTC captures improved predictions relative to standard non-adaptive auto-regressive inference. This TTC framework marks a foundational step towards more advanced reasoning algorithms or PDE modeling, inluding building reinforcement-learning-based approaches, potentially transforming computational workflows in physics and engineering.

中文标题/摘要

标题：面向PDE基础模型的推理推理：一种奖励模型驱动的推理时缩放算法

偏微分方程（PDEs）是现代计算科学和工程的基石，且本质上计算成本高昂。尽管PDE基础模型在模拟复杂的时空现象方面显示出巨大的潜力，但现有模型仍受限于预训练数据集，并且在自回归展开性能上遇到困难，尤其是在分布外（OOD）情况下。此外，它们对计算资源和训练数据的需求限制了其在许多关键应用中的使用。受大型语言模型（LLMs）中“思考”策略最新进展的启发，我们首次引入了一种用于PDE的测试时计算（TTC）策略，该策略在推理过程中利用计算资源以较少的训练样本和更小的模型实现更准确的预测。我们通过两种类型的奖励模型来评估基于随机模型的时空一致性预测来实现这一点。我们在PDEGym基准上的可压缩欧拉方程模拟上展示了这种方法，并表明TTC相对于标准非自适应自回归推理捕获了更好的预测。这一TTC框架标志着向更高级的PDE建模推理算法迈进的基础步骤，包括构建基于强化学习的方法，有可能彻底改变物理和工程中的计算工作流。

Summary / 总结

This paper addresses the computational challenges of Partial Differential Equations (PDEs) in computational sciences and engineering by introducing a test-time computing (TTC) strategy. The method uses reward models to evaluate predictions for spatio-temporal consistency, reducing the need for extensive training data and smaller models. Experiments on compressible Euler-equation simulations show that TTC improves prediction accuracy compared to standard inference methods.

该论文通过引入测试时计算（TTC）策略来解决偏微分方程（PDEs）在计算科学和工程中的计算挑战。该方法使用奖励模型来评估基于随机模型的空间-时间一致性预测，旨在用更少的训练样本和更小的模型提高准确性。实验结果表明，TTC 在离分布情况下比标准推理方法表现出更好的预测性能。

Scribble-Supervised Medical Image Segmentation with Dynamic Teacher Switching and Hierarchical Consistency

Authors: Thanh-Huy Nguyen, Hoang-Loc Cao, Dat T. Chung, Mai-Anh Vu, Thanh-Minh Nguyen, Minh Le, Phat K. Huynh, Ulas Bagci

First: 2026-01-21T01:01:01+00:00 · Latest: 2026-01-23T18:54:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Scribble-supervised methods have emerged to mitigate the prohibitive annotation burden in medical image segmentation. However, the inherent sparsity of these annotations introduces significant ambiguity, which results in noisy pseudo-label propagation and hinders the learning of robust anatomical boundaries. To address this challenge, we propose SDT-Net, a novel dual-teacher, single-student framework designed to maximize supervision quality from these weak signals. Our method features a Dynamic Teacher Switching (DTS) module to adaptively select the most reliable teacher. This selected teacher then guides the student via two synergistic mechanisms: high-confidence pseudo-labels, refined by a Pick Reliable Pixels (PRP) mechanism, and multi-level feature alignment, enforced by a Hierarchical Consistency (HiCo) module. Extensive experiments on the ACDC and MSCMRseg datasets demonstrate that SDT-Net achieves state-of-the-art performance, producing more accurate and anatomically plausible segmentation.

中文标题/摘要

标题：Scribble-监督医学图像分割中的动态教师切换和层次一致性

scribble-监督方法已出现以减轻医学图像分割中的标注负担。然而，这些标注的固有稀疏性引入了显著的模糊性，导致伪标签传播噪声化并阻碍了对稳健的解剖边界的学习。为解决这一挑战，我们提出了一种新的双教师、单学生框架SDT-Net，旨在最大化这些弱信号的监督质量。该方法包含一个动态教师切换(DTS)模块，以自适应地选择最可靠的教师。该选定的教师通过两种协同机制指导学生：由挑选可靠像素(PRP)机制精炼的高置信度伪标签，以及由层次一致性(HiCo)模块强制执行的多级特征对齐。在ACDC和MSCMRseg数据集上的广泛实验表明，SDT-Net达到了最先进的性能，产生了更准确且解剖上更合理的分割。

Summary / 总结

The paper addresses the challenge of using sparse annotations in medical image segmentation by proposing SDT-Net, a dual-teacher, single-student framework. SDT-Net includes a Dynamic Teacher Switching module to select the most reliable teacher and a Hierarchical Consistency module to enforce multi-level feature alignment. The method also uses a Pick Reliable Pixels mechanism to refine high-confidence pseudo-labels. Experiments on ACDC and MSCMRseg datasets show that SDT-Net outperforms existing methods, producing more accurate and anatomically plausible segmentations.

论文提出了一种双教师、单学生框架SDT-Net，以应对医学图像分割中稀疏标注的挑战。该框架包含一个动态教师切换模块来选择最可靠的教师，以及一个挑选可靠像素机制来细化伪标签。此外，还使用了层次一致性模块来强制执行多级特征对齐。实验结果显示，SDT-Net在ACDC和MSCMRseg数据集上优于现有方法，生成了更准确且解剖上更合理的分割结果。

LLM Reasoning for Cold-Start Item Recommendation

Authors: Shijun Li, Yu Wang, Jin Wang, Ying Li, Joydeep Ghosh, Anne Cocos

Venue: WWW 2026

First: 2025-11-23T03:22:53+00:00 · Latest: 2026-01-23T18:51:39+00:00

Comments: Published on Proceedings of the ACM on Web Conference 2026 (WWW 2026)

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have shown significant potential for improving recommendation systems through their inherent reasoning capabilities and extensive knowledge base. Yet, existing studies predominantly address warm-start scenarios with abundant user-item interaction data, leaving the more challenging cold-start scenarios, where sparse interactions hinder traditional collaborative filtering methods, underexplored. To address this limitation, we propose novel reasoning strategies designed for cold-start item recommendations within the Netflix domain. Our method utilizes the advanced reasoning capabilities of LLMs to effectively infer user preferences, particularly for newly introduced or rarely interacted items. We systematically evaluate supervised fine-tuning, reinforcement learning-based fine-tuning, and hybrid approaches that combine both methods to optimize recommendation performance. Extensive experiments on real-world data demonstrate significant improvements in both methodological efficacy and practical performance in cold-start recommendation contexts. Remarkably, our reasoning-based fine-tuned models outperform Netflix's production ranking model by up to 8% in certain cases.

中文标题/摘要

标题：LLM在冷启动项目推荐中的推理应用

大型语言模型（LLMs）通过其固有的推理能力和广泛的知识库，显示出改善推荐系统的重要潜力。然而，现有研究主要集中在有丰富用户-项目交互数据的温启动场景，而冷启动场景则因交互稀疏而被传统协作过滤方法忽视。为解决这一局限，我们提出了一种针对Netflix领域的新型推理策略，用于冷启动项目推荐。该方法利用LLMs的高级推理能力，有效推断用户偏好，特别是对于新引入或很少交互的项目。我们系统地评估了监督微调、基于强化学习的微调以及结合两种方法的混合方法，以优化推荐性能。在实际数据上的广泛实验表明，在冷启动推荐场景中，该方法在方法论有效性和实际性能方面均取得了显著改进。值得注意的是，在某些情况下，我们的基于推理的微调模型比Netflix的生产排名模型高出8%。

Summary / 总结

The research aims to enhance cold-start item recommendation by leveraging the reasoning capabilities of Large Language Models (LLMs). It proposes novel reasoning strategies and evaluates supervised fine-tuning, reinforcement learning-based fine-tuning, and hybrid approaches. The experiments on real-world data show that reasoning-based fine-tuned models significantly improve recommendation performance, outperforming Netflix's production ranking model by up to 8% in certain cases.

该研究通过利用大型语言模型（LLMs）的推理能力来解决冷启动项目推荐的挑战。研究提出了新的推理策略，并评估了监督微调、基于强化学习的微调以及结合两种方法的混合方法。实验结果表明，基于推理的微调模型在实际数据上的推荐性能显著提升，在某些情况下比Netflix的生产排名模型高出8%以上。

VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Authors: Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez

First: 2026-01-23T18:43:34+00:00 · Latest: 2026-01-23T18:43:34+00:00

Comments: Project page: https://visgym.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.

中文标题/摘要

标题：VisGym：多模态代理的多样化、可定制、可扩展环境

现代视觉-语言模型（VLMs）在多步骤视觉交互中仍然缺乏充分的表征，特别是在它们如何在长时程内整合感知、记忆和行动方面。我们引入了VisGym，这是一个包含17个环境的测试场，用于评估和训练VLMs。该套件涵盖了符号谜题、真实图像理解、导航和操作，并提供了对难度、输入表示、规划时程和反馈的灵活控制。我们还提供了多步骤求解器，生成结构化的演示，以实现监督微调。我们的评估表明，所有前沿模型在交互设置中都面临挑战，在简单（46.6%）和困难（26.0%）配置中成功率都很低。我们的实验揭示了一些显著的局限性：模型难以有效利用长上下文，在无界历史记录情况下表现不如在截断窗口情况下。此外，我们发现，一旦以视觉形式呈现，几种基于文本的符号任务变得显著更难。然而，在部分可观测或未知动力学设置中，通过显式目标观察、文本反馈和探索性演示进行监督微调可以带来一致的收益，突显了具体的失败模式和改进多步骤视觉决策的途径。代码、数据和模型可在：https://visgym.github.io/ 获取。

Summary / 总结

VisGym is designed to evaluate and train Vision-Language Models (VLMs) in multi-step visual interactions, covering various tasks such as symbolic puzzles, real-image understanding, navigation, and manipulation. It offers flexible controls over difficulty, input representation, planning horizon, and feedback. Experiments show that current VLMs perform poorly in interactive settings, with low success rates even in easy configurations. The study highlights that models struggle with long context and that explicit goal observations and textual feedback improve performance in partially observable or unknown-dynamics settings.

VisGym 旨在评估和训练视觉-语言模型（VLMs）在多步视觉交互中的表现，涵盖符号谜题、真实图像理解、导航和操作等多种任务。它提供了对难度、输入表示、规划时间范围和反馈的灵活控制。实验表明，当前的 VLMs 在交互式设置中表现不佳，在简单配置中成功率也很低。研究还指出，模型在处理长上下文时存在困难，而明确的目标观察、文本反馈和在部分可观测或未知动力学设置中的探索性演示可以提高性能。

BONO-Bench: A Comprehensive Test Suite for Bi-objective Numerical Optimization with Traceable Pareto Sets

Authors: Lennart Schäpermeier, Pascal Kerschke

First: 2026-01-23T18:42:20+00:00 · Latest: 2026-01-23T18:42:20+00:00

Comments: Accepted for publication in the Special Issue on Benchmarking in Multi-Criteria Optimization at ACM TELO

Abs · PDF · Code1 · Code2

Abstract

The evaluation of heuristic optimizers on test problems, better known as \emph{benchmarking}, is a cornerstone of research in multi-objective optimization. However, most test problems used in benchmarking numerical multi-objective black-box optimizers come from one of two flawed approaches: On the one hand, problems are constructed manually, which result in problems with well-understood optimal solutions, but unrealistic properties and biases. On the other hand, more realistic and complex single-objective problems are composited into multi-objective problems, but with a lack of control and understanding of problem properties. This paper proposes an extensive problem generation approach for bi-objective numerical optimization problems consisting of the combination of theoretically well-understood convex-quadratic functions into unimodal and multimodal landscapes with and without global structure. It supports configuration of test problem properties, such as the number of decision variables, local optima, Pareto front shape, plateaus in the objective space, or degree of conditioning, while maintaining theoretical tractability: The optimal front can be approximated to an arbitrary degree of precision regarding Pareto-compliant performance indicators such as the hypervolume or the exact R2 indicator. To demonstrate the generator's capabilities, a test suite of 20 problem categories, called \emph{BONO-Bench}, is created and subsequently used as a basis of an illustrative benchmark study. Finally, the general approach underlying our proposed generator, together with the associated test suite, is publicly released in the Python package \texttt{bonobench} to facilitate reproducible benchmarking.

中文标题/摘要

标题：BONO-Bench：用于具有可追溯帕累托集的双目标数值优化的综合测试套件

在多目标优化研究中，启发式优化器在测试问题上的评估，即所谓的基准测试，是其基石。然而，用于基准测试数值多目标黑盒优化器的大多数测试问题来自两种有缺陷的方法之一：一方面，问题是手工构建的，导致具有已知最优解但不现实的属性和偏差的问题。另一方面，更现实和复杂的单目标问题被组合成多目标问题，但缺乏对问题属性的控制和理解。本文提出了一种广泛的问题生成方法，用于双目标数值优化问题，该方法将理论理解良好的凸二次函数组合成单模态和多模态景观，有和没有全局结构。该方法支持测试问题属性的配置，如决策变量的数量、局部最优解的数量、帕累托前沿的形状、目标空间中的平台或条件程度，同时保持理论可处理性：帕累托合规性能指标（如超体积或精确R2指标）的最优前沿可以任意精度近似。为了展示生成器的能力，创建了一个包含20个问题类别的测试套件，称为BONO-Bench，并随后用作说明性基准研究的基础。最后，我们提出的方法的通用方法及其相关的测试套件在Python包bonobench中公开发布，以促进可重复的基准测试。

Summary / 总结

This paper addresses the limitations of existing benchmark problems in multi-objective optimization by proposing BONO-Bench, a comprehensive test suite. The method combines theoretically well-understood convex-quadratic functions to create bi-objective numerical optimization problems with controllable properties such as the number of decision variables and Pareto front shape. Key findings include the ability to generate problems with precise control over Pareto-optimal solutions, enabling accurate benchmarking of heuristic optimizers. The BONO-Bench suite, along with the Python package bonobench, is publicly released to support reproducible research.

本文通过提出BONO-Bench综合测试套件，解决了现有多目标优化基准问题的局限性。该方法通过组合凸二次函数来创建具有可控特性的双目标问题，确保理论可解性和现实特性。主要发现包括能够配置决策变量数量和帕累托前沿形状等参数，并通过包含20个问题类别的基准研究展示了生成器的能力。

On Fine-Grained I/O Complexity of Attention Backward Passes

Authors: Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Song Yue, Jiahao Zhang

First: 2024-10-12T07:01:30+00:00 · Latest: 2026-01-23T18:42:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency bottlenecks, necessitating the development of I/O-optimized algorithms. In this work, we conduct a systematic examination of the I/O complexity inherent in attention mechanisms, with a specific emphasis on the backward pass under both small and large cache settings. By leveraging the red-blue pebble game framework, we derive tight bounds for I/O complexity across the full spectrum of cache sizes. We validate that FlashAttention, one of the current industry standards, achieves optimality in the large-cache scenario for both forward and backward passes. Conversely, for small-cache environments, we introduce a novel algorithm that outperforms contemporary methods and successfully attains theoretical tight bounds. Furthermore, we expand our investigation to include sparse attention by establishing granular lower bounds for both forward and backward passes across all cache configurations. Ultimately, our results solidify the theoretical framework regarding I/O complexity in attention mechanisms, providing critical guidance for the development of efficient LLM training and inference systems.

中文标题/摘要

标题：细粒度I/O复杂性在注意力反向传递中的研究

大型语言模型（LLMs）在处理自然语言中的广泛上下文方面表现出色。然而，注意力计算相对于序列长度的二次缩放导致了显著的效率瓶颈，需要开发I/O优化算法。在本文中，我们系统地研究了注意力机制内在的I/O复杂性，特别关注在小缓存和大缓存设置下的反向传递。通过利用红蓝石子游戏框架，我们推导出了从缓存大小全谱范围内的紧界。我们验证了FlashAttention，当前工业标准之一，在大缓存场景下的前向和反向传递中实现了最优性。相反，在小缓存环境中，我们提出了一种新颖的算法，优于现有方法，并成功达到了理论紧界。此外，我们将研究扩展到稀疏注意力，建立了所有缓存配置下前向和反向传递的精细下界。最终，我们的结果巩固了关于注意力机制中I/O复杂性的理论框架，为高效LLM训练和推理系统的开发提供了关键指导。

Summary / 总结

This work examines the I/O complexity in attention mechanisms, particularly focusing on the backward pass under different cache settings. By using the red-blue pebble game framework, the authors derive tight bounds for I/O complexity. They find that FlashAttention is optimal for large-cache scenarios, while they introduce a new algorithm that outperforms existing methods for small-cache environments. Additionally, they establish lower bounds for sparse attention, contributing to a comprehensive theoretical framework for I/O complexity in attention mechanisms.

该研究探讨了大型语言模型中注意力机制的I/O复杂性，重点关注不同缓存设置下的前向和反向传递。通过使用红蓝棋子游戏框架，作者推导出了I/O复杂性的紧界。他们发现FlashAttention在大缓存场景下是最优的，而他们提出的新算法在小缓存环境中优于现有方法。此外，该研究还为稀疏注意力机制建立了下界，为理解注意力机制中的I/O复杂性提供了理论基础。

Empowering Medical Equipment Sustainability in Low-Resource Settings: An AI-Powered Diagnostic and Support Platform for Biomedical Technicians

Authors: Bernes Lorier Atabonfack, Ahmed Tahiru Issah, Mohammed Hardi Abdul Baaki, Clemence Ingabire, Tolulope Olusuyi, Maruf Adewole, Udunna C. Anazodo, Timothy X Brown

Venue: MICCAI 2025

First: 2026-01-23T18:39:55+00:00 · Latest: 2026-01-23T18:39:55+00:00

Comments: Accepted at the MIRASOL Workshop at MICCAI 2025. To appear in Lecture Notes in Computer Science (LNCS)

Abs · PDF · Code1 · Code2

Abstract

In low- and middle-income countries (LMICs), a significant proportion of medical diagnostic equipment remains underutilized or non-functional due to a lack of timely maintenance, limited access to technical expertise, and minimal support from manufacturers, particularly for devices acquired through third-party vendors or donations. This challenge contributes to increased equipment downtime, delayed diagnoses, and compromised patient care. This research explores the development and validation of an AI-powered support platform designed to assist biomedical technicians in diagnosing and repairing medical devices in real-time. The system integrates a large language model (LLM) with a user-friendly web interface, enabling imaging technologists/radiographers and biomedical technicians to input error codes or device symptoms and receive accurate, step-by-step troubleshooting guidance. The platform also includes a global peer-to-peer discussion forum to support knowledge exchange and provide additional context for rare or undocumented issues. A proof of concept was developed using the Philips HDI 5000 ultrasound machine, achieving 100% precision in error code interpretation and 80% accuracy in suggesting corrective actions. This study demonstrates the feasibility and potential of AI-driven systems to support medical device maintenance, with the aim of reducing equipment downtime to improve healthcare delivery in resource-constrained environments.

中文标题/摘要

标题：在低资源环境中赋能医疗设备可持续性：一种基于AI的诊断和支持平台，助力生物医学技术人员

在低收入和中等收入国家（LMICs），大量医疗诊断设备因缺乏及时维护、技术专家有限以及制造商支持不足而未充分利用或无法正常运行，尤其是对于通过第三方供应商或捐赠获得的设备。这一挑战导致设备停机时间增加、诊断延迟和患者护理质量下降。本研究探讨了开发和验证一种基于AI的支持平台，旨在协助生物医学技术人员实时诊断和修复医疗设备。该系统结合了大型语言模型（LLM）和用户友好的网页界面，使影像技师/放射技士和生物医学技术人员能够输入错误代码或设备症状并获得准确的故障排除指导。该平台还包括一个全球性的点对点讨论论坛，以支持知识交流并为罕见或未记录的问题提供额外背景信息。使用飞利浦HDI 5000超声机进行了概念验证，错误代码解释的精确度达到100%，建议纠正措施的准确性为80%。本研究展示了基于AI的系统支持医疗设备维护的可行性和潜力，旨在减少设备停机时间，改善资源受限环境中的医疗服务。

Summary / 总结

This research addresses the issue of underutilized medical diagnostic equipment in low- and middle-income countries by developing an AI-powered support platform. The platform uses a large language model to provide real-time troubleshooting guidance and includes a global discussion forum for knowledge exchange. Using the Philips HDI 5000 ultrasound machine, the system achieved 100% precision in error code interpretation and 80% accuracy in suggesting corrective actions, demonstrating its potential to reduce equipment downtime and improve healthcare delivery.

该研究针对低收入和中等收入国家医疗诊断设备利用率低的问题，开发了一个基于AI的支持平台。该平台使用大型语言模型提供实时故障排除指导，并包含一个全球讨论论坛。使用飞利浦HDI 5000超声波机器进行测试，系统在错误代码解释方面达到了100%的精度，并在建议纠正措施方面达到了80%的准确性，展示了其减少设备停机时间和改善医疗服务的潜力。

Provable Differentially Private Computation of the Cross-Attention Mechanism

Authors: Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song, Jiahao Zhang

First: 2024-07-20T01:02:27+00:00 · Latest: 2026-01-23T18:38:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Cross-attention has emerged as a cornerstone module in modern artificial intelligence, underpinning critical applications such as retrieval-augmented generation (RAG), system prompting, and guided stable diffusion. However, this is a rising concern about securing the privacy of cross-attention, as the underlying key and value matrices frequently encode sensitive data or private user information. In this work, we introduce a novel data structure designed to enforce differential privacy (DP) for cross-attention mechanisms, accompanied by provable theoretical guarantees. Specifically, letting $n$ denote the input sequence length, $d$ the feature dimension, $R$ the maximum magnitude of query and key matrices, $R_w$ the maximum magnitude of the value matrix, and $r, s, ε_s$ the parameters for polynomial kernel methods, our proposed structure achieves $\widetilde{O}(ndr^2)$ space and initialization complexity, with a query time of $\widetilde{O}(d r^2)$ per token. Moreover, we demonstrate that our mechanism satisfies $(ε, δ)$-DP, incurring an additive error of $\widetilde{O}((1-ε_s)^{-1} n^{-1} ε^{-1} R^{2s} R_w r^2)$ and a relative error of $2ε_s/(1-ε_s)$ with respect to the ground truth. Crucially, our framework maintains robustness against adaptive queries, ensuring security even in adversarial settings. To the best of our knowledge, this constitutes the first approach providing provable differential privacy for cross-attention, establishing a foundation for future privacy-preserving algorithms in large generative models (LGMs).

中文标题/摘要

标题：可验证差分隐私的交叉注意机制计算

交叉注意已成为现代人工智能的核心模块，支撑着诸如检索增强生成（RAG）、系统提示和引导稳定扩散等关键应用。然而，关于如何保护交叉注意的隐私性，这是一个日益增长的关切，因为底层的关键和值矩阵经常编码敏感数据或私人用户信息。在本文中，我们引入了一种新颖的数据结构，旨在为交叉注意机制强制执行差分隐私（DP），并附带可验证的理论保证。具体而言，令$n$表示输入序列长度，$d$表示特征维度，$R$表示查询和键矩阵的最大幅度，$R_w$表示值矩阵的最大幅度，$r, s, ε_s$表示多项式核方法的参数，我们提出的数据结构实现了$\widetilde{O}(ndr^2)$的空间和初始化复杂度，每次查询的时间复杂度为$\widetilde{O}(d r^2)$。此外，我们证明我们的机制满足$(ε, δ)$-DP，引入的附加误差为$\widetilde{O}((1-ε_s)^{-1} n^{-1} ε^{-1} R^{2s} R_w r^2)$，相对误差为$2ε_s/(1-ε_s)$相对于真实值。至关重要的是，我们的框架保持了对适应性查询的鲁棒性，即使在对抗性环境中也能确保安全性。据我们所知，这是第一个提供可验证差分隐私的交叉注意方法，为未来大型生成模型（LGM）中的隐私保护算法奠定了基础。

Spatial-Agent: Agentic Geo-spatial Reasoning with Scientific Core Concepts

Authors: Riyang Bao, Cheng Yang, Dazhou Yu, Zhexiang Tang, Gengchen Mai, Liang Zhao

First: 2026-01-23T18:33:45+00:00 · Latest: 2026-01-23T18:33:45+00:00

Comments: 15pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Geospatial reasoning is essential for real-world applications such as urban analytics, transportation planning, and disaster response. However, existing LLM-based agents often fail at genuine geospatial computation, relying instead on web search or pattern matching while hallucinating spatial relationships. We present Spatial-Agent, an AI agent grounded in foundational theories of spatial information science. Our approach formalizes geo-analytical question answering as a concept transformation problem, where natural-language questions are parsed into executable workflows represented as GeoFlow Graphs -- directed acyclic graphs with nodes corresponding to spatial concepts and edges representing transformations. Drawing on spatial information theory, Spatial-Agent extracts spatial concepts, assigns functional roles with principled ordering constraints, and composes transformation sequences through template-based generation. Extensive experiments on MapEval-API and MapQA benchmarks demonstrate that Spatial-Agent significantly outperforms existing baselines including ReAct and Reflexion, while producing interpretable and executable geospatial workflows.

中文标题/摘要

标题：空间-代理：基于科学核心概念的空间代理推理

空间推理对于实际应用如城市分析、交通规划和灾害响应至关重要。然而，现有的基于LLM的代理往往无法进行真正的空间计算，而是依赖于网络搜索或模式匹配，同时虚构空间关系。我们提出了空间代理，这是一种基于空间信息科学基础理论的AI代理。我们的方法将地理分析问题回答形式化为概念转换问题，其中自然语言问题被解析为由空间概念节点和表示转换的边组成的GeoFlow图——有向无环图。借助空间信息理论，空间代理提取空间概念，赋予功能角色并结合通过基于模板的生成进行转换序列的组合。在MapEval-API和MapQA基准上的大量实验表明，空间代理显著优于包括ReAct和Reflexion在内的现有基线，同时生成可解释和可执行的空间工作流。

Summary / 总结

The research aims to improve geospatial reasoning in AI agents for applications like urban analytics and disaster response. Spatial-Agent formalizes geospatial question answering as concept transformation using GeoFlow Graphs, which are directed acyclic graphs representing spatial concepts and their transformations. Experiments show that Spatial-Agent outperforms existing methods like ReAct and Reflexion, generating interpretable and executable geospatial workflows.

研究旨在通过改进AI代理的地理空间推理能力，应用于城市分析和灾害响应等领域。Spatial-Agent将地理空间问题解答形式化为概念转换问题，使用GeoFlow图。方法包括提取空间概念、分配功能角色和组成转换序列。实验表明，Spatial-Agent优于现有基线，并生成可解释的地理空间工作流。

AgentDrive: An Open Benchmark Dataset for Agentic AI Reasoning with LLM-Generated Scenarios in Autonomous Systems

Authors: Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

First: 2026-01-23T18:33:41+00:00 · Latest: 2026-01-23T18:33:41+00:00

Comments: 16 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

The rapid advancement of large language models (LLMs) has sparked growing interest in their integration into autonomous systems for reasoning-driven perception, planning, and decision-making. However, evaluating and training such agentic AI models remains challenging due to the lack of large-scale, structured, and safety-critical benchmarks. This paper introduces AgentDrive, an open benchmark dataset containing 300,000 LLM-generated driving scenarios designed for training, fine-tuning, and evaluating autonomous agents under diverse conditions. AgentDrive formalizes a factorized scenario space across seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. An LLM-driven prompt-to-JSON pipeline generates semantically rich, simulation-ready specifications that are validated against physical and schema constraints. Each scenario undergoes simulation rollouts, surrogate safety metric computation, and rule-based outcome labeling. To complement simulation-based evaluation, we introduce AgentDrive-MCQ, a 100,000-question multiple-choice benchmark spanning five reasoning dimensions: physics, policy, hybrid, scenario, and comparative reasoning. We conduct a large-scale evaluation of fifty leading LLMs on AgentDrive-MCQ. Results show that while proprietary frontier models perform best in contextual and policy reasoning, advanced open models are rapidly closing the gap in structured and physics-grounded reasoning. We release the AgentDrive dataset, AgentDrive-MCQ benchmark, evaluation code, and related materials at https://github.com/maferrag/AgentDrive

中文标题/摘要

标题：AgentDrive：一种基于LLM生成场景的自主系统中代理型AI推理的开放基准数据集

大型语言模型（LLMs）的迅速发展激发了将其集成到自主系统中进行推理驱动的感知、规划和决策的兴趣。然而，由于缺乏大规模、结构化和安全性关键的基准，评估和训练此类代理型AI模型仍然具有挑战性。本文介绍了AgentDrive，一个包含300,000个LLM生成的驾驶场景的开放基准数据集，旨在在各种条件下训练、微调和评估自主代理。AgentDrive 在七个正交轴上形式化了一个因素化的场景空间：场景类型、驾驶员行为、环境、道路布局、目标、难度和交通密度。基于LLM的提示到JSON流水线生成了语义丰富、可模拟的规范，并通过物理和模式约束进行了验证。每个场景都经历了模拟滚动、代理安全度量计算和基于规则的结果标记。为了补充基于模拟的评估，我们引入了AgentDrive-MCQ，一个涵盖五个推理维度的100,000道选择题基准：物理、策略、混合、场景和比较推理。我们对50个领先的LLM在AgentDrive-MCQ上进行了大规模评估。结果显示，尽管专有前沿模型在上下文和策略推理方面表现最佳，但先进的开源模型在结构化和物理基础推理方面正在迅速缩小差距。我们发布了AgentDrive数据集、AgentDrive-MCQ基准、评估代码及相关材料，网址为https://github.com/maferrag/AgentDrive

Summary / 总结

AgentDrive is an open benchmark dataset containing 300,000 LLM-generated driving scenarios aimed at training and evaluating agentic AI models in autonomous systems. It includes seven orthogonal axes: scenario type, driver behavior, environment, road layout, objective, difficulty, and traffic density. Each scenario is validated and labeled through simulation rollouts and safety metrics. Additionally, AgentDrive-MCQ, a 100,000-question multiple-choice benchmark, evaluates reasoning across five dimensions. The evaluation shows that while proprietary models excel in contextual and policy reasoning, open models are improving in structured and physics-grounded reasoning. The dataset and materials are publicly available at https://github.com/maferrag/AgentDrive.

AgentDrive 是一个包含 300,000 个由 LLM 生成的驾驶场景的开放基准数据集，用于训练和评估自主系统中的 agentic AI 模型。它在七个维度上正式化了一个因素化的场景空间，并使用 LLM 驱动的管道生成语义丰富的规范。该数据集包括模拟滚出和安全指标计算。AgentDrive-MCQ 是一个包含 100,000 个问题的多项选择基准，评估五个维度的推理能力。结果显示，专有模型在上下文和策略推理方面表现最佳，而开源模型在结构化和物理基础推理方面正在迅速提高。

Domain-invariant Mixed-domain Semi-supervised Medical Image Segmentation with Clustered Maximum Mean Discrepancy Alignment

Authors: Ba-Thinh Lam, Thanh-Huy Nguyen, Hoang-Thien Nguyen, Quang-Khai Bui-Tran, Nguyen Lan Vi Vu, Phat K. Huynh, Ulas Bagci, Min Xu

Venue: ICASSP 2026

First: 2026-01-23T18:23:03+00:00 · Latest: 2026-01-23T18:23:03+00:00

Comments: accepted in ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

Deep learning has shown remarkable progress in medical image semantic segmentation, yet its success heavily depends on large-scale expert annotations and consistent data distributions. In practice, annotations are scarce, and images are collected from multiple scanners or centers, leading to mixed-domain settings with unknown domain labels and severe domain gaps. Existing semi-supervised or domain adaptation approaches typically assume either a single domain shift or access to explicit domain indices, which rarely hold in real-world deployment. In this paper, we propose a domain-invariant mixed-domain semi-supervised segmentation framework that jointly enhances data diversity and mitigates domain bias. A Copy-Paste Mechanism (CPM) augments the training set by transferring informative regions across domains, while a Cluster Maximum Mean Discrepancy (CMMD) block clusters unlabeled features and aligns them with labeled anchors via an MMD objective, encouraging domain-invariant representations. Integrated within a teacher-student framework, our method achieves robust and precise segmentation even with very few labeled examples and multiple unknown domain discrepancies. Experiments on Fundus and M&Ms benchmarks demonstrate that our approach consistently surpasses semi-supervised and domain adaptation methods, establishing a potential solution for mixed-domain semi-supervised medical image segmentation.

中文标题/摘要

标题：域不变混合域半监督医学图像分割与聚类最大均值偏差对齐

深度学习在医学图像语义分割方面取得了显著进展，但其成功高度依赖于大规模专家注释和一致的数据分布。实践中，注释稀缺，图像来自多个扫描器或中心，导致存在未知域标签和严重域差距的混合域设置。现有半监督或域适应方法通常假设单一域转移或可访问显式域索引，这在实际部署中很少成立。本文提出了一种域不变混合域半监督分割框架，该框架联合增强数据多样性并减轻域偏差。复制粘贴机制（CPM）通过在域间转移信息性区域来扩充训练集，而聚类最大均值偏差（CMMD）块通过MMD目标聚类未标记特征并将其与标记锚点对齐，鼓励域不变表示。该方法集成在教师-学生框架中，即使有少量标记示例和多个未知域差距，也能实现稳健和精确的分割。在视网膜和M&Ms基准测试上进行的实验表明，我们的方法在半监督和域适应方法中表现优异，为混合域半监督医学图像分割提供了一种潜在解决方案。

Summary / 总结

This paper addresses the challenge of medical image segmentation in mixed-domain settings with limited labeled data and unknown domain labels. It proposes a domain-invariant mixed-domain semi-supervised segmentation framework that includes a Copy-Paste Mechanism to augment the training set and a Cluster Maximum Mean Discrepancy block to align unlabeled features with labeled anchors. Experiments on Fundus and M&Ms benchmarks show that the proposed method outperforms existing semi-supervised and domain adaptation approaches, achieving robust and precise segmentation even with few labeled examples and multiple domain discrepancies.

本文针对混合域设置下的医学图像分割问题，该设置中标签数据有限且未知域标签。提出了一种域不变的混合域半监督分割框架，结合Copy-Paste机制和Cluster Maximum Mean Discrepancy块来增强数据多样性并减轻域偏差。该方法即使在少量标注样本和多个域差异的情况下也能实现稳健且精确的分割，并在视网膜和M&Ms基准测试中优于现有的半监督和域适应方法。

Evaluating the Effect of Retrieval Augmentation on Social Biases

Authors: Tianhui Zhang, Yi Zhou, Danushka Bollegala

First: 2025-02-24T19:58:23+00:00 · Latest: 2026-01-23T18:20:11+00:00

Comments: EACL26 main

Abs · PDF · Code1 · Code2

Abstract

Retrieval Augmented Generation (RAG) has gained popularity as a method for conveniently incorporating novel facts that were not seen during the pre-training stage in Large Language Model (LLM)-based Natural Language Generation (NLG) systems. However, LLMs are known to encode significant levels of unfair social biases. The modulation of these biases by RAG in NLG systems is not well understood. In this paper, we systematically study the relationship between the different components of a RAG system and the social biases presented in the text generated across three languages (i.e. English, Japanese and Chinese) and four social bias types (i.e. gender, race, age and religion). Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we evaluate the social biases in RAG responses from document collections with varying levels of stereotypical biases, employing multiple LLMs used as generators. We find that the biases in document collections are often amplified in the generated responses, even when the generating LLM exhibits a low-level of bias. Our findings raise concerns about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.

中文标题/摘要

标题：评估检索增强对社会偏见的影响

检索增强生成（RAG）因其方便地将预训练阶段未见过的新颖事实纳入基于大型语言模型（LLM）的自然语言生成（NLG）系统而受到关注。然而，LLM已知会编码大量的不公平社会偏见。RAG在NLG系统中调节这些偏见的程度尚不明确。在本文中，我们系统地研究了RAG系统不同组件与生成文本中呈现的社会偏见之间的关系，跨越三种语言（即英语、日语和中文）和四种社会偏见类型（即性别、种族、年龄和宗教）。具体而言，我们使用偏见问答（BBQ）基准数据集，评估来自具有不同水平刻板印象偏见的文档集合的RAG响应中的社会偏见，使用多种LLM作为生成器。我们发现，即使生成LLM表现出较低水平的偏见，文档集合中的偏见也往往会在生成的响应中被放大。我们的研究结果对将RAG作为向NLG系统注入新颖事实的技术提出了担忧，并在RAG应用的实际部署之前呼吁对其潜在的社会偏见进行仔细评估。

Summary / 总结

This paper investigates how Retrieval Augmented Generation (RAG) affects social biases in text generated by Large Language Models (LLMs) across three languages and four bias types. Using the BBQ benchmark datasets, the study evaluates biases in RAG responses from document collections with varying levels of stereotypical biases, employing multiple LLMs. The findings indicate that biases in document collections are often amplified in the generated responses, even when the generating LLM has low bias levels, raising concerns about the use of RAG in NLG systems.

该研究探讨了检索增强生成（RAG）如何影响大型语言模型（LLM）生成文本中的社会偏见。它使用BBQ基准数据集在三种语言和四种类型（性别、种族、年龄和宗教）上评估偏见。研究发现，即使LLM本身具有较低的偏见水平，文档集合中的偏见在RAG生成的响应中也会被放大，这引起了对RAG在自然语言生成系统中应用时潜在社会偏见评估的关注。

Strategies for Span Labeling with Large Language Models

Authors: Danil Semin, Ondřej Dušek, Zdeněk Kasner

First: 2026-01-23T18:03:10+00:00 · Latest: 2026-01-23T18:03:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are increasingly used for text analysis tasks, such as named entity recognition or error detection. Unlike encoder-based models, however, generative architectures lack an explicit mechanism to refer to specific parts of their input. This leads to a variety of ad-hoc prompting strategies for span labeling, often with inconsistent results. In this paper, we categorize these strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To address the limitations of content matching, we introduce LogitMatch, a new constrained decoding method that forces the model's output to align with valid input spans. We evaluate all methods across four diverse tasks. We find that while tagging remains a robust baseline, LogitMatch improves upon competitive matching-based methods by eliminating span matching issues and outperforms other strategies in some setups.

中文标题/摘要

标题：大规模语言模型的区间标注策略

大规模语言模型（LLMs）越来越多地用于文本分析任务，如命名实体识别或错误检测。然而，与编码器模型不同，生成架构缺乏明确机制来引用输入的特定部分。这导致了各种各样的区间标注的非正式提示策略，结果往往不一致。在本文中，我们将这些策略归类为三类：标记输入文本、索引区间的位置和匹配区间内容。为了解决内容匹配的局限性，我们引入了LogitMatch，这是一种新的约束解码方法，强制模型的输出与有效的输入区间对齐。我们在四个不同的任务上评估了所有方法。我们发现，虽然标记仍然是一个稳健的基础方法，但LogitMatch通过消除区间匹配问题并优于其他竞争性的匹配方法，在某些设置中表现更优。

Summary / 总结

This paper addresses the challenge of span labeling in large language models (LLMs) by categorizing existing prompting strategies into three families: tagging the input text, indexing numerical positions of spans, and matching span content. To overcome the limitations of content matching, the authors introduce LogitMatch, a constrained decoding method. Evaluations across four tasks show that while tagging remains a robust baseline, LogitMatch outperforms other strategies by improving upon competitive matching-based methods and addressing span matching issues.

本文探讨了使用大型语言模型进行命名实体识别等跨度标注任务的挑战。作者将现有策略分为三类：标记输入文本、索引跨度的数值位置以及匹配跨度内容。作者引入了LogitMatch，这是一种约束解码方法，迫使模型的输出与有效的输入跨度对齐。在四个不同任务上的评估表明，虽然标记仍然是一个稳健的基础方法，但LogitMatch通过解决跨度匹配问题并在某些设置中提高了性能，从而优于其他策略。

Efficient semantic uncertainty quantification in language models via diversity-steered sampling

Authors: Ji Won Park, Kyunghyun Cho

Venue: NeurIPS 2025

First: 2025-10-24T10:06:21+00:00 · Latest: 2026-01-23T18:02:21+00:00

Comments: 10 pages (+7 appendix), 7 figures. Accepted at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.

中文标题/摘要

标题：通过多样性引导采样在语言模型中高效量化语义不确定性

在大型语言模型（LLMs）中准确估计自由形式问答（QA）中的语义 aleatoric 和 epistemic 不确定性特别具有挑战性，通常需要许多昂贵的生成才能获得稳定估计。我们引入了一种多样性引导的采样器，该采样器在解码过程中避免产生语义冗余输出，适用于自回归和掩码扩散范式，并且能够显著提高样本效率。核心思想是使用一个轻量级微调的自然语言推理（NLI）模型，将连续的语义相似性惩罚注入模型的提议分布中。我们通过重要性加权消除下游不确定性估计的偏差，并通过控制变量减少其方差。在四个问答基准测试中，我们的方法在相同数量的样本下覆盖了更多的语义簇，且能够匹配或超越基线方法。该框架具有模块化且无需访问基础LLM的梯度，有望作为风险敏感模型部署中不确定性估计的即插即用增强。

Summary / 总结

This paper addresses the challenge of quantifying semantic uncertainties in large language models (LLMs) during free-form question answering. It proposes a diversity-steered sampler that reduces semantically redundant outputs and improves sample efficiency. The method injects a semantic-similarity penalty using a fine-tuned natural language inference model, and debiases and shrinks uncertainty estimates. Experiments on four QA benchmarks show that the method matches or outperforms baselines while covering more semantic clusters with the same number of samples.

研究旨在通过减少语义冗余输出来提高大型语言模型在自由形式问答中语义不确定性估计的准确性。方法使用一个多样性的引导采样器，该采样器从部分前缀或中间扩散状态轻量级微调的自然语言推理模型中引入语义相似性惩罚，适用于自回归和掩码扩散模型。实验结果显示，该方法在四个问答基准上达到了可比或更好的性能，同时使用更少的样本，从而提高了样本效率，并作为风险敏感模型部署中的模块化增强工具。

AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

Authors: Edward Ajayi

First: 2026-01-06T00:02:11+00:00 · Latest: 2026-01-23T18:00:27+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.

中文标题/摘要

标题：AfriEconQA：基于世界银行报告的非洲经济分析基准数据集

我们介绍了AfriEconQA，这是一个基于236份世界银行报告的专门基准数据集，用于非洲经济分析。AfriEconQA的任务是从专门的机构文件中回答复杂的经济查询，这些查询需要高精度的数值推理和时间消歧。该数据集包含8,937个精心筛选的问答实例，从10018个合成问题中严格筛选出来，以确保高质量的证据-答案对齐。每个实例包括：(1) 需要对经济指标进行推理的问题，(2) 从语料库中检索到的相应证据，(3) 验证过的正确答案，以及(4) 来源元数据（例如，URL和出版日期），以确保时间来源。AfriEconQA是第一个专注于非洲经济分析的基准数据集，为信息检索（IR）系统提供了一个独特的挑战，因为这些数据在当前大型语言模型（LLMs）的预训练语料库中几乎不存在。我们通过一个11实验矩阵操作化该数据集，将零样本基线（GPT-5 Mini）与使用GPT-4o和Qwen 32B的RAG配置进行基准测试，采用五种不同的嵌入和排名策略。我们的结果表明，零样本模型无法回答超过90%的查询，即使是最先进的RAG管道也难以实现高精度。这证实了AfriEconQA作为下一代领域特定IR和RAG系统的稳健且具有挑战性的基准。AfriEconQA数据集和代码将在发表后公开。

Summary / 总结

AfriEconQA is a benchmark dataset for African economic analysis based on 236 World Bank reports, containing 8,937 curated QA instances. Each instance includes a question requiring economic reasoning, corresponding evidence, a verified answer, and source metadata. The dataset challenges Information Retrieval systems, as it is not covered in current Large Language Models' pretraining corpora. Experiments show that zero-shot models fail to answer over 90 percent of queries, and even advanced RAG pipelines struggle with precision, confirming AfriEconQA's robustness and challenge for domain-specific IR and RAG systems.

AfriEconQA 是基于 236 份世界银行报告的非洲经济分析专用基准数据集，包含 8,937 个精心筛选的问答实例。每个实例包括一个问题、从语料库中检索的证据、一个验证过的答案和来源元数据。该数据集旨在测试高精度的数值推理和时间消歧。实验结果显示，零样本模型甚至最先进的 RAG 管道都无法回答超过 90% 的查询，突显了领域特定 IR 和 RAG 系统中的严重参数知识差距。数据集将在发表后公开。

Failures of Contingent Thinking

Authors: Evan Piermont, Peio Zuazo-Garin

First: 2020-07-15T14:21:16+00:00 · Latest: 2026-01-23T17:59:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a behavioral definition of an agent's perceived implication that uniquely identifies a subjective state-space representing her view of a decision problem, and which may differ from the modeler's. By examining belief updating within this model, we formalize the recent empirical consensus that reducing uncertainty improves contingent thinking, and propose a novel form of updating corresponding to the agent 'realizing' a flaw in her own thinking. Finally, we clarify the sense in which contingent thinking makes state-bystate dominance more cognitively demanding than obvious dominance.

中文标题/摘要

标题：条件性思维的失败

我们提供了一个行为定义，该定义独特地识别出代理感知到的推论，代表了她对决策问题的看法，这可能与建模者的观点不同。通过在该模型中研究信念更新，我们形式化了最近的实证共识，即减少不确定性可以改善条件性思维，并提出了一种新的更新形式，对应于代理意识到自己思维中的缺陷。最后，我们澄清了条件性思维如何使状态-状态支配比明显支配更具认知要求。

Summary / 总结

The paper defines a behavioral criterion for an agent's perceived implications, distinguishing between the agent's subjective state-space and the modeler's perspective. It examines belief updating within this framework and formalizes the idea that reducing uncertainty enhances contingent thinking. The study introduces a new form of updating where the agent recognizes a flaw in their own thinking. Additionally, it clarifies that contingent thinking increases the cognitive demand of state-by-state dominance compared to obvious dominance.

研究定义了代理感知推论的行為定義，以识别其主观状态空间，可能与建模者的不同。通过分析信念更新，研究正式化了减少不确定性会改善条件思考的共识，并引入了一种新的更新形式，即代理认识到自己思考中的缺陷。研究还澄清了条件思考为何使状态-状态主导权的认知需求高于显而易见的主导权。

A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media

Authors: Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken

Venue: AAAI Oral Presentation

First: 2025-11-25T07:12:09+00:00 · Latest: 2026-01-23T17:53:12+00:00

Comments: Oral Presentation at the AAAI-26 Bridge Program on AI for Medicine and Healthcare. To appear in Proceedings of Machine Learning Research (PMLR)

Abs · PDF · Code1 · Code2

Abstract

Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous "split-then-balance" pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard ("Social Media Screener") designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.

中文标题/摘要

标题：一种从社交媒体中检测心理健康状况和网络欺凌的机器学习方法

心理健康挑战和网络欺凌在数字空间中日益普遍，需要可扩展且可解释的检测系统。本文介绍了一种统一的多类分类框架，用于从社交媒体数据中检测十个不同的心理健康和网络欺凌类别。我们从Twitter和Reddit收集数据集，采用严格的“先分割后平衡”管道进行训练，同时在现实的、保留的不平衡测试集上进行评估。我们进行了全面的评估，比较了传统词汇模型、混合方法以及几种端到端微调的变压器。结果表明，端到端微调对于性能至关重要，领域适应的MentalBERT模型脱颖而出，准确率为0.92，宏F1得分为0.76，超过了其通用版本和零样本LLM基线。基于全面的伦理分析，我们将系统定位为人工在环筛查辅助工具，而非诊断工具。为此，我们引入了一种混合SHAPLLM可解释性框架，并展示了一个原型仪表板（“社交媒体筛查器”），旨在将模型预测及其解释整合到管理员的实际工作流程中。我们的工作提供了一个稳健的基线，突显了未来在在线安全和计算心理健康交叉领域的多标签、临床验证数据集的需求。

Summary / 总结

This paper addresses the detection of mental health conditions and cyberbullying on social media using a unified multiclass classification framework. The authors curate datasets from Twitter and Reddit, employing a balanced training and imbalanced testing approach. They compare traditional lexical models, hybrid approaches, and fine-tuned transformers, finding that end-to-end fine-tuning, particularly with domain-adapted MentalBERT, outperforms other methods, achieving high accuracy and F1 scores. The system is designed as a screening aid, with an explainability framework and a prototype dashboard to support moderators in their work.

该论文利用统一的多类别分类框架，针对社交媒体上的心理健康状况和网络欺凌进行检测。作者从Twitter和Reddit收集数据，采用平衡训练和现实中的不平衡测试方法。他们比较了多种模型，结果显示端到端微调的变压器，尤其是适应性较强的MentalBERT，优于传统和混合方法，实现了较高的准确率和F1分数。该研究还引入了可解释性框架和原型仪表板，以支持人类审核员在筛选社交媒体内容时的工作流程。

Pretraining Frame Preservation in Autoregressive Video Memory Compression

Authors: Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala

First: 2025-12-29T20:29:21+00:00 · Latest: 2026-01-23T17:47:41+00:00

Comments: Additional Results: https://lllyasviel.github.io/pfp_gitpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

中文标题/摘要

标题：自回归视频记忆压缩中的预训练框架保存

我们提出了一种名为PFP的神经网络结构，用于将长视频压缩为短上下文，并具有明确的预训练目标，以在任意时间位置上保留单帧的高频细节。基线模型可以将20秒的视频压缩到大约5k长度的上下文中，其中可以随机检索具有感知保真度外观的帧。此类预训练模型可以直接微调为自回归视频模型的记忆编码器，从而实现具有较低上下文成本和相对较低保真度损失的长历史记忆。我们通过消融设置评估了该框架，并讨论了可能的神经架构设计的权衡。

Summary / 总结

The research aims to develop a neural network structure called PFP for compressing long videos into short contexts while preserving high-frequency details of frames. The method involves an explicit pretraining objective to maintain frame details at arbitrary temporal positions. Key experimental findings show that a 20-second video can be compressed into a context of about 5k length, allowing for perceptually preserved frame retrieval. Fine-tuning these pretrained models as memory encoders for autoregressive video models enables long history memory with low context cost and minimal fidelity loss.

研究旨在开发一种名为PFP的神经网络结构，用于压缩长视频并保留帧的高频细节。方法包括一个明确的预训练目标，以在任意时间位置保持帧细节。实验结果显示，20秒的视频可以被压缩成约5k长度的上下文，允许保留感知上的帧检索。将这些预训练模型微调为自回归视频模型的记忆编码器，可以实现长历史记忆，同时保持较低的上下文成本和较小的保真度损失。

CASE -- Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement

Authors: Gaifan Zhang, Yi Zhou, Danushka Bollegala

First: 2025-03-21T16:27:12+00:00 · Latest: 2026-01-23T17:43:27+00:00

Comments: Accepted to EACL2026

Abs · PDF · Code1 · Code2

Abstract

The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using a Large Language Model (LLM), where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised nonlinear projection is learned to reduce the dimensionality of the LLM-based text embeddings. We show that CASE significantly outperforms previously proposed Conditional Semantic Textual Similarity (C-STS) methods on an existing standard benchmark dataset. We find that subtracting the condition embedding consistently improves the C-STS performance of LLM-based text embeddings. Moreover, we propose a supervised dimensionality reduction method that not only reduces the dimensionality of LLM-based embeddings but also significantly improves their performance.

中文标题/摘要

标题：CASE -- 基于条件的句子嵌入方法用于条件语义文本相似度测量

一个句子的意义往往取决于它出现的上下文。尽管句子嵌入方法取得了进展，但如何最好地根据其上下文修改句子嵌入仍然不清楚。为了解决这个问题，我们提出了条件感知句子嵌入(CASE)，这是一种高效且准确的方法，用于在给定条件下创建句子的嵌入。首先，CASE 使用大型语言模型(LLM)为条件创建嵌入，在池化过程中，句子会影响计算出的条件中各个标记的注意力分数。接下来，学习一个监督非线性投影来降低基于LLM的文本嵌入的维度。我们展示了CASE在现有标准基准数据集上显著优于之前提出的条件语义文本相似度(C-STS)方法。我们发现，从LLM基于的文本嵌入中减去条件嵌入可以一致地提高C-STS性能。此外，我们提出了一种监督降维方法，不仅可以降低LLM基于的嵌入的维度，还能显著提高其性能。

Summary / 总结

The research aims to improve the accuracy of semantic textual similarity measurement by considering the context of a sentence. CASE, a Condition-Aware Sentence Embeddings method, is proposed to create an embedding for a sentence under a given condition. It uses a Large Language Model to create an embedding for the condition, where the sentence influences the attention scores during pooling. A supervised nonlinear projection is then applied to reduce the dimensionality of the embeddings. Experiments show that CASE outperforms previous methods on a standard benchmark, and subtracting the condition embedding consistently improves C-STS performance. Additionally, a supervised dimensionality reduction method is proposed to further enhance performance.

研究旨在通过考虑句子的上下文来提高语义文本相似度测量的准确性。提出的Condition-Aware Sentence Embeddings (CASE)方法使用大型语言模型为条件创建嵌入，并在聚合期间影响注意力分数。然后应用监督非线性投影来减少嵌入的维度。实验表明，CASE在标准基准数据集上优于先前的方法，并且从嵌入中减去条件嵌入可以提高LLM基文本嵌入的C-STS性能。此外，还提出了一种监督降维方法，不仅减少了嵌入的维度，还显著提高了其性能。

Nishpaksh: TEC Standard-Compliant Framework for Fairness Auditing and Certification of AI Models

Authors: Shashank Prakash, Ranjitha Prasad, Avinash Agarwal

First: 2026-01-23T17:35:05+00:00 · Latest: 2026-01-23T17:35:05+00:00

Comments: Accepted and presented at 2026 18th International Conference on COMmunication Systems and NETworks (COMSNETS)

Abs · PDF · Code1 · Code2

Abstract

The growing reliance on Artificial Intelligence (AI) models in high-stakes decision-making systems, particularly within emerging telecom and 6G applications, underscores the urgent need for transparent and standardized fairness assessment frameworks. While global toolkits such as IBM AI Fairness 360 and Microsoft Fairlearn have advanced bias detection, they often lack alignment with region-specific regulatory requirements and national priorities. To address this gap, we propose Nishpaksh, an indigenous fairness evaluation tool that operationalizes the Telecommunication Engineering Centre (TEC) Standard for the Evaluation and Rating of Artificial Intelligence Systems. Nishpaksh integrates survey-based risk quantification, contextual threshold determination, and quantitative fairness evaluation into a unified, web-based dashboard. The tool employs vectorized computation, reactive state management, and certification-ready reporting to enable reproducible, audit-grade assessments, thereby addressing a critical post-standardization implementation need. Experimental validation on the COMPAS dataset demonstrates Nishpaksh's effectiveness in identifying attribute-specific bias and generating standardized fairness scores compliant with the TEC framework. The system bridges the gap between research-oriented fairness methodologies and regulatory AI governance in India, marking a significant step toward responsible and auditable AI deployment within critical infrastructure like telecommunications.

中文标题/摘要

标题：Nishpaksh：符合TEC标准的公平性审计和AI模型认证框架

随着人工智能（AI）模型在高风险决策系统中的广泛应用，特别是在新兴电信和6G应用领域，透明和标准化的公平性评估框架的需求变得尤为迫切。尽管IBM AI Fairness 360和Microsoft Fairlearn等全球工具包在偏见检测方面取得了进展，但它们往往缺乏与地区特定监管要求和国家优先事项的对齐。为了解决这一差距，我们提出了Nishpaksh，这是一种本地化的公平性评估工具，它将电信工程中心（TEC）的人工智能系统评估和评级标准操作化。Nishpaksh将基于调查的风险量化、情境阈值确定和定量公平性评估整合到一个统一的基于Web的仪表板中。该工具采用向量化计算、响应式状态管理和认证准备的报告，以实现可重复的、审计级别的评估，从而满足标准化实施后的重要需求。在COMPAS数据集上的实验验证表明，Nishpaksh在识别属性特定偏见并生成符合TEC框架的标准化公平性评分方面具有有效性。该系统在研究导向的公平性方法与印度的监管AI治理之间架起了一座桥梁，标志着在关键基础设施如电信领域负责任和可审计的AI部署方面迈出的重要一步。

Summary / 总结

Nishpaksh is a fairness auditing and certification framework for AI models, designed to align with the Telecommunication Engineering Centre (TEC) Standard. It integrates risk quantification, threshold determination, and fairness evaluation into a web-based dashboard, using vectorized computation and certification-ready reporting. Experimental validation on the COMPAS dataset shows Nishpaksh’s effectiveness in identifying attribute-specific bias and generating TEC-compliant fairness scores.

Nishpaksh 是一个针对新兴电信和6G应用的公平性审计和认证框架，旨在满足特定的监管要求。它将风险量化、阈值确定和公平性评估整合到一个基于Web的仪表板中，并符合TEC标准。实验结果表明，Nishpaksh 可以识别属性特定的偏差并生成标准化的公平性评分，从而满足电信等关键基础设施中负责任的AI部署需求。

HEIGHT: Heterogeneous Interaction Graph Transformer for Robot Navigation in Crowded and Constrained Environments

Authors: Shuijing Liu, Haochen Xia, Fatemeh Cheraghi Pouria, Kaiwen Hong, Neeloy Chakraborty, Zichao Hu, Joydeep Biswas, Katherine Driggs-Campbell

First: 2024-11-19T00:56:35+00:00 · Latest: 2026-01-23T17:24:19+00:00

Comments: Accepted to IEEE Transactions of Automation Science and Engineering (T-ASE)

Abs · PDF · Code1 · Code2 · Project1

Abstract

We study the problem of robot navigation in dense and interactive crowds with static constraints such as corridors and furniture. Previous methods fail to consider all types of spatial and temporal interactions among agents and obstacles, leading to unsafe and inefficient robot paths. In this article, we leverage a graph-based representation of crowded and constrained scenarios and propose a structured framework to learn robot navigation policies with deep reinforcement learning. We first split the representations of different inputs and propose a heterogeneous spatio-temporal graph to model distinct interactions among humans, robots, and obstacles. Based on the heterogeneous spatio-temporal graph, we propose HEIGHT, a novel navigation policy network architecture with different components to capture heterogeneous interactions through space and time. HEIGHT utilizes attention mechanisms to prioritize important interactions and a recurrent network to track changes in the dynamic scene over time, encouraging the robot to avoid collisions adaptively. Through extensive simulation and real-world experiments, we demonstrate that HEIGHT outperforms state-of-the-art baselines in terms of success, navigation time, and generalization to domain shifts in challenging navigation scenarios. More information is available at https://sites.google.com/view/crowdnav-height/home.

中文标题/摘要

标题：高度：拥挤和受限环境中的异质交互图变换器用于机器人导航

我们研究了在密集且互动性强的人群中进行机器人导航的问题，其中包含静态约束，如走廊和家具。以往的方法未能考虑所有类型的时空交互，导致机器人路径不安全且效率低下。在本文中，我们利用拥挤和受限场景的图表示，并提出了一种结构化框架，利用深度强化学习学习机器人导航策略。我们首先将不同输入的表示进行拆分，并提出了一种异质时空图来建模人类、机器人和障碍物之间的不同交互。基于异质时空图，我们提出了一种新颖的导航策略网络架构HEIGHT，通过空间和时间捕捉异质交互。HEIGHT利用注意力机制优先处理重要交互，并使用循环网络跟踪动态场景随时间的变化，促使机器人适应性地避免碰撞。通过广泛的仿真和实地实验，我们证明了在具有挑战性的导航场景中，HEIGHT在成功率、导航时间和对领域转移的泛化能力方面优于最先进的基线方法。更多信息请参见https://sites.google.com/view/crowdnav-height/home。

Summary / 总结

The research addresses the challenge of robot navigation in crowded and constrained environments by proposing a heterogeneous spatio-temporal graph model called HEIGHT. This model captures various interactions among humans, robots, and obstacles, using attention mechanisms to prioritize important interactions and a recurrent network to track dynamic changes. Experimental results show that HEIGHT outperforms existing methods in terms of success rate, navigation time, and generalization to new scenarios.

研究关注机器人在拥挤和受限环境中的导航问题，以往方法往往因为未能充分考虑空间和时间交互而失败。作者提出了一种基于图的导航策略网络HEIGHT，通过时空图模型异质交互，并使用注意力机制和循环网络来适应动态场景。实验结果显示，HEIGHT在导航成功率、效率以及对新场景的适应性方面优于现有方法。

T-LoRA: Single Image Diffusion Model Customization Without Overfitting

Authors: Vera Soboleva, Aibek Alanov, Andrey Kuznetsov, Konstantin Sobolev

Venue: AAAI 2026

First: 2025-07-08T13:14:10+00:00 · Latest: 2026-01-23T17:14:49+00:00

Comments: AAAI 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework specifically designed for diffusion model personalization. We show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. T-LoRA incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments on SD-XL and FLUX-1.dev show that T-LoRA and its individual components outperform standard LoRA and other diffusion model personalization techniques, achieving a superior balance between concept fidelity and text alignment. Project page is available at https://controlgenai.github.io/T-LoRA/.

中文标题/摘要

标题：T-LoRA：无需过拟合的单张图像扩散模型定制

尽管扩散模型微调为预训练模型生成特定对象提供了一种强大的方法，但在训练样本有限时，它经常遭受过拟合的困扰，这会损害模型的泛化能力和输出多样性。本文解决了使用单张概念图像适应扩散模型这一具有挑战性但最具影响力的任务，因为单张图像定制具有最大的实际潜力。我们提出了T-LoRA，一种针对扩散模型个性化设计的时间步依赖低秩适应框架。我们展示了更高的扩散时间步更容易过拟合，因此需要一种时间步敏感的微调策略。T-LoRA 包含两个关键创新：（1）一种动态微调策略，根据扩散时间步调整秩约束更新，（2）一种权重参数化技术，通过正交初始化确保适配器组件之间的独立性。在 SD-XL 和 FLUX-1.dev 上的大量实验表明，T-LoRA 及其各个组件均优于标准 LoRA 和其他扩散模型个性化技术，实现了概念保真度和文本对齐之间的更优平衡。项目页面可在 https://controlgenai.github.io/T-LoRA/ 获取。

Summary / 总结

This paper addresses the challenge of customizing diffusion models using a single concept image without overfitting. It introduces T-LoRA, a Timestep-Dependent Low-Rank Adaptation framework that dynamically adjusts fine-tuning strategies based on diffusion timesteps and uses orthogonal initialization to ensure component independence. Experiments on SD-XL and FLUX-1.dev demonstrate that T-LoRA outperforms standard LoRA and other techniques, achieving better balance between concept fidelity and text alignment.

本文解决了仅使用一张图像定制扩散模型而不发生过拟合的挑战。提出了T-LoRA框架，采用基于时间步长的低秩适应策略。T-LoRA结合了动态微调策略和正交初始化，以防止过拟合。在SD-XL和FLUX-1.dev上的实验表明，T-LoRA在概念保真度和文本对齐之间实现了更好的平衡，优于标准LoRA和其他技术。

Calibrated Similarity for Reliable Geometric Analysis of Embedding Spaces

Authors: Nicolas Tacheny

First: 2026-01-23T17:14:44+00:00 · Latest: 2026-01-23T17:14:44+00:00

Comments: arXiv admin note: substantial text overlap with arXiv:2512.10350

Abs · PDF · Code1 · Code2

Abstract

While raw cosine similarity in pretrained embedding spaces exhibits strong rank correlation with human judgments, anisotropy induces systematic miscalibration of absolute values: scores concentrate in a narrow high-similarity band regardless of actual semantic relatedness, limiting interpretability as a quantitative measure. Prior work addresses this by modifying the embedding space (whitening, contrastive fine tuning), but such transformations alter geometric structure and require recomputing all embeddings. Using isotonic regression trained on human similarity judgments, we construct a monotonic transformation that achieves near-perfect calibration while preserving rank correlation and local stability(98% across seven perturbation types). Our contribution is not to replace cosine similarity, but to restore interpretability of its absolute values through monotone calibration, without altering its ranking properties. We characterize isotonic calibration as an order-preserving reparameterization and prove that all order-based constructions (angular ordering, nearest neighbors, threshold graphs and quantile-based decisions) are invariant under this transformation.

中文标题/摘要

标题：校准相似度以实现嵌入空间几何分析的可靠几何分析

虽然预训练嵌入空间中的原始余弦相似度与人类判断表现出强烈的相关性，但各向异性会导致绝对值的系统性校准偏差：分数集中在高相似度的狭窄区间，不论实际语义相关性如何，这限制了其作为定量度量的可解释性。先前的工作通过修改嵌入空间（去相关化、对比微调）来解决这一问题，但这些变换会改变几何结构并需要重新计算所有嵌入。通过在人类相似度判断上训练的等向性回归，我们构建了一个单调变换，实现了近乎完美的校准，同时保持了相关性和局部稳定性（在七种扰动类型上达到98%）。我们的贡献不是替代余弦相似度，而是通过单调校准恢复其绝对值的可解释性，而不改变其排名属性。我们将等向性校准描述为一个保持顺序的重新参数化，并证明所有基于顺序的构造（角度排序、最近邻、阈值图和基于分位数的决策）在该变换下不变。

The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning

Authors: Calarina Muslimani, Yunshu Du, Kenta Kawamoto, Kaushik Subramanian, Peter Stone, Peter Wurman

First: 2026-01-23T17:13:54+00:00 · Latest: 2026-01-23T17:13:54+00:00

Abs · PDF · Code1 · Code2

Abstract

The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function's induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.

中文标题/摘要

标题：两阶段轨迹对齐系数：从奖励调优到奖励学习

强化学习（RL）的成功从根本上依赖于一个准确反映任务目标的奖励函数。然而，设计奖励函数通常耗时且容易出错。为解决这一问题，我们的首要目标是理解如何支持RL从业者为奖励函数指定合适的权重。我们利用轨迹对齐系数（TAC），这是一个评估奖励函数诱导的偏好与领域专家偏好匹配程度的度量。为了评估TAC在实际应用中的有效性，我们在RL从业者对Lunar Lander进行奖励权重调优的人类实验中进行了测试。我们发现，在奖励调优过程中提供TAC使参与者生成了更高效的奖励函数，并报告了较低的认知负荷，而没有TAC的标准调优则不然。然而，该研究也表明，即使有TAC，手动设计奖励仍然劳动密集。这一局限性促使我们的第二个目标：直接学习一个最大化TAC的奖励模型。具体而言，我们提出了软TAC，这是一个可微近似TAC，可以作为损失函数从人类偏好数据中训练奖励模型。在赛车模拟器Gran Turismo 7中验证，使用软TAC训练的奖励模型成功捕捉了偏好特定的目标，导致了与使用标准交叉熵损失训练的模型相比具有更多质性差异的行为策略。这项工作表明，TAC可以作为指导奖励调优的实用工具和复杂领域中的奖励学习目标。

Summary / 总结

The paper aims to improve the process of reward function design in reinforcement learning by leveraging the Trajectory Alignment Coefficient (TAC). In a human-subject study, TAC was used to guide the tuning of reward weights for Lunar Lander, leading to more performant reward functions and reduced cognitive workload. However, manual reward design remains labor-intensive. To address this, the authors propose Soft-TAC, a differentiable approximation of TAC, which was used to train reward models in Gran Turismo 7, resulting in policies with more distinct behaviors compared to standard Cross-Entropy loss models.

论文旨在通过利用轨迹对齐系数（TAC）来改进强化学习中的奖励函数设计过程。在一项人类实验中，TAC 被用于指导 Lunar Lander 的奖励权重调整，这导致了更高效的奖励函数并减少了认知负担。然而，手动设计奖励仍然很耗时。为了解决这个问题，作者提出了 Soft-TAC，这是一种 TAC 的可微近似，用于使用人类偏好数据训练奖励模型，在 Gran Turismo 7 中验证时，使用 Soft-TAC 训练的奖励模型产生了与使用标准交叉熵损失训练的模型相比具有更多独特行为的策略。

GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints

Authors: Andy Zhu, Rongzhe Wei, Yupu Gu, Pan Li

First: 2026-01-23T17:13:54+00:00 · Latest: 2026-01-23T17:13:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine unlearning (MU) for large language models has become critical for AI safety, yet existing methods fail to generalize to Mixture-of-Experts (MoE) architectures. We identify that traditional unlearning methods exploit MoE's architectural vulnerability: they manipulate routers to redirect queries away from knowledgeable experts rather than erasing knowledge, causing a loss of model utility and superficial forgetting. We propose Geometric Routing Invariance Preservation (GRIP), an algorithm-agnostic framework for unlearning for MoE. Our core contribution is a geometric constraint, implemented by projecting router gradient updates into an expert-specific null-space. Crucially, this decouples routing stability from parameter rigidity: while discrete expert selections remain stable for retained knowledge, the continuous router parameters remain plastic within the null space, allowing the model to undergo necessary internal reconfiguration to satisfy unlearning objectives. This forces the unlearning optimization to erase knowledge directly from expert parameters rather than exploiting the superficial router manipulation shortcut. GRIP functions as an adapter, constraining router parameter updates without modifying the underlying unlearning algorithm. Extensive experiments on large-scale MoE models demonstrate that our adapter eliminates expert selection shift (achieving over 95% routing stability) across all tested unlearning methods while preserving their utility. By preventing existing algorithms from exploiting MoE model's router vulnerability, GRIP adapts existing unlearning research from dense architectures to MoEs.

中文标题/摘要

标题：GRIP：面向混合专家架构的几何路由不变性机器遗忘算法

大型语言模型的机器遗忘（MU）对于AI安全至关重要，但现有方法无法适用于混合专家（MoE）架构。我们发现传统遗忘方法利用了MoE的架构弱点：它们通过操控路由器将查询重定向到知识渊博的专家，而不是删除知识，从而导致模型实用性下降和表面遗忘。我们提出了几何路由不变性保持（GRIP），这是一种面向MoE的算法无关框架。我们的核心贡献是一种几何约束，通过将路由器梯度更新投影到专家特定的零空间中实现。关键的是，这种做法将路由稳定性与参数刚性分离开来：对于保留的知识，离散的专家选择保持稳定，而连续的路由器参数在零空间内保持可塑性，允许模型进行必要的内部重构以满足遗忘目标。这迫使遗忘优化直接从专家参数中删除知识，而不是利用表面的路由器操控捷径。GRIP作为适配器，限制路由器参数更新而不修改底层的遗忘算法。大规模MoE模型的广泛实验表明，我们的适配器在所有测试的遗忘方法中实现了超过95%的路由稳定性，同时保持了模型的实用性。通过阻止现有算法利用MoE模型的路由器弱点，GRIP将现有的遗忘研究从密集架构扩展到了MoE。

Summary / 总结

The research addresses the challenge of machine unlearning (MU) for Mixture-of-Experts (MoE) architectures, where traditional methods manipulate routers to redirect queries, leading to superficial forgetting. The proposed GRIP framework introduces a geometric constraint that projects router gradient updates into an expert-specific null-space, decoupling routing stability from parameter rigidity. This allows the model to reconfigure internally to satisfy unlearning objectives, directly erasing knowledge from expert parameters. Experiments show that GRIP maintains over 95% routing stability while preserving model utility across various unlearning methods.

研究解决了现有方法无法有效处理Mixture-of-Experts (MoE)架构的机器遗忘（MU）问题。提出的Geometric Routing Invariance Preservation (GRIP)方法引入了一个几何约束，将路由器梯度更新投影到专家特定的零空间，从而解耦路由稳定性与参数刚性。这使得模型能够在满足遗忘目标的同时进行内部重构，直接从专家参数中删除知识，而不是操纵路由器。实验表明，GRIP在各种遗忘方法下保持了超过95%的路由稳定性，同时保持了模型的实用性。

Evaluating Large Vision-language Models for Surgical Tool Detection

Authors: Nakul Poudel, Richard Simon, Cristian A. Linte

First: 2026-01-23T17:00:46+00:00 · Latest: 2026-01-23T17:00:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Surgery is a highly complex process, and artificial intelligence has emerged as a transformative force in supporting surgical guidance and decision-making. However, the unimodal nature of most current AI systems limits their ability to achieve a holistic understanding of surgical workflows. This highlights the need for general-purpose surgical AI systems capable of comprehensively modeling the interrelated components of surgical scenes. Recent advances in large vision-language models that integrate multimodal data processing offer strong potential for modeling surgical tasks and providing human-like scene reasoning and understanding. Despite their promise, systematic investigations of VLMs in surgical applications remain limited. In this study, we evaluate the effectiveness of large VLMs for the fundamental surgical vision task of detecting surgical tools. Specifically, we investigate three state-of-the-art VLMs, Qwen2.5, LLaVA1.5, and InternVL3.5, on the GraSP robotic surgery dataset under both zero-shot and parameter-efficient LoRA fine-tuning settings. Our results demonstrate that Qwen2.5 consistently achieves superior detection performance in both configurations among the evaluated VLMs. Furthermore, compared with the open-set detection baseline Grounding DINO, Qwen2.5 exhibits stronger zero-shot generalization and comparable fine-tuned performance. Notably, Qwen2.5 shows superior instrument recognition, while Grounding DINO demonstrates stronger localization.

中文标题/摘要

标题：评估大型视觉语言模型在手术工具检测中的效果

手术是一个高度复杂的过程，人工智能已经成为了支持手术指导和决策的变革性力量。然而，大多数当前的单模态AI系统因其单一模态的性质限制了它们实现对手术工作流程的全面理解的能力。这突显了需要能够全面建模手术场景中相关组件的一般用途手术AI系统的需求。最近在多模态数据处理方面取得的大型视觉语言模型的进步为建模手术任务和提供类人的场景推理和理解提供了强大的潜力。尽管它们具有潜力，但在手术应用中的系统性研究仍然有限。在本研究中，我们评估了大型视觉语言模型在基本的手术视觉任务——检测手术工具方面的有效性。具体而言，我们在GraSP机器人手术数据集上研究了三种最先进的视觉语言模型Qwen2.5、LLaVA1.5和InternVL3.5，在零样本和参数高效LoRA微调设置下进行研究。我们的结果表明，在评估的视觉语言模型中，Qwen2.5在两种配置下都持续实现了更优的检测性能。此外，与开放集检测基准Grounding DINO相比，Qwen2.5在零样本泛化方面表现更强，并且微调性能相当。值得注意的是，Qwen2.5在器械识别方面表现出色，而Grounding DINO在定位方面表现更强。

Summary / 总结

This study evaluates the effectiveness of large vision-language models (VLMs) for detecting surgical tools, focusing on Qwen2.5, LLaVA1.5, and InternVL3.5. The research uses both zero-shot and parameter-efficient LoRA fine-tuning settings on the GraSP robotic surgery dataset. Qwen2.5 is found to consistently outperform the other models, showing superior detection performance in both configurations and stronger zero-shot generalization compared to the open-set detection baseline Grounding DINO.

研究评估了大型视觉-语言模型（VLMs）在检测手术工具方面的有效性，重点关注Qwen2.5、LLaVA1.5和InternVL3.5。研究使用了GraSP机器人手术数据集上的零样本和参数高效LoRA微调设置。Qwen2.5在两种设置下均表现出更优的检测性能，并且在零样本泛化方面优于开放集检测基准Grounding DINO，尽管Grounding DINO在定位方面表现更佳。

LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

Authors: João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton

First: 2026-01-23T16:57:16+00:00 · Latest: 2026-01-23T16:57:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.

中文标题/摘要

标题：基于LLM的论辩式对抗攻击对事实核查系统的攻击

自动化事实核查(AFC)系统容易受到对抗攻击的影响，使虚假声明得以逃避检测。现有的对抗框架通常依赖于注入噪声或改变语义，但没有现有的框架利用论辩技术的对抗潜力，这些技术在信息操纵活动中广泛用于操控受众。在本文中，我们通过使用生成型LLM来重新表述声明，引入了一类新颖的论辩式对抗攻击，以对AFC进行攻击。我们研究了15种技术，将其分为6个类别，使用解耦评估策略研究论辩对声明验证和证据检索的影响。在FEVER和FEVEROUS基准上的实验表明，论辩攻击可以显著降低验证性能和证据检索效果。我们的分析表明，论辩技术是一种强大的对抗攻击类别，突显了需要更 robust 的AFC系统。

Summary / 总结

This paper addresses the vulnerability of automated fact-checking systems to adversarial attacks, particularly those using persuasion techniques. By employing a generative language model to rephrase claims with persuasion techniques, the authors demonstrate that such attacks can significantly reduce the performance of fact-checking systems in both claim verification and evidence retrieval. Experiments on FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade the effectiveness of these systems, underscoring the need for more robust fact-checking mechanisms.

本文探讨了自动化事实核查系统对劝说技术驱动的对抗攻击的脆弱性。通过使用生成语言模型重新表述虚假声明以逃避检测。在FEVER和FEVEROUS基准上的实验表明，这些劝说攻击显著降低了声明验证和证据检索的性能，强调了需要更 robust 的事实核查系统。

Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

Authors: Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski

Venue: NeurIPS 2025

First: 2025-09-03T14:18:05+00:00 · Latest: 2026-01-23T16:51:17+00:00

Comments: 9 pages, 7 figures including appendices. Accepted to Machine Learning and the Physical Sciences Workshop, NeurIPS 2025 (https://ml4physicalsciences.github.io/2025/). Repository with corresponding code: https://github.com/FHendriks11/bifurcationML/. Video explanation: https://www.youtube.com/watch?v=wsL3h17KtjY

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models are unable to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we formalize the use of generative AI, specifically flow matching, as a principled way to model the full probability distribution over bifurcation outcomes. Our approach builds on existing techniques by combining flow matching with equivariant architectures and an optimal-transport-based coupling mechanism. We generalize equivariant flow matching to a symmetric coupling strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from simple conceptual systems to physical problems such as buckling beams and the Allen--Cahn equation. The results demonstrate that the approach accurately captures multimodal distributions and symmetry-breaking bifurcations. Moreover, our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods. This offers a principled and scalable solution for modeling multistability in high-dimensional systems.

中文标题/摘要

标题：对称破缺分岔问题的等变流匹配方法

非线性动力系统中的分岔现象通常会导致多个共存的稳定解，特别是在对称破缺的情况下。确定性的机器学习模型无法捕捉这种多样性，会平均化解，并且无法表示低对称性结果。在本文中，我们通过将生成AI，特别是流匹配与等变架构和基于最优传输的耦合机制相结合，正式化了使用生成AI的方法，作为一种原理性的方法来建模分岔结果的完整概率分布。我们的方法在现有技术的基础上，通过结合流匹配与等变架构和基于最优传输的耦合机制，构建了一种对称耦合策略，该策略在群作用下对预测输出和目标输出进行对齐，从而在等变设置中实现准确的学习。我们在从简单概念系统到物理问题（如屈曲梁和Allen--Cahn方程）的一系列系统中验证了我们的方法。结果表明，该方法能够准确捕捉多模态分布和对称破缺分岔。此外，我们的结果表明，流匹配方法在非概率性和变分方法中表现显著更优。这为建模高维系统中的多稳态提供了一种原理性的和可扩展的解决方案。

Summary / 总结

This work addresses the challenge of modeling symmetry-breaking bifurcations in nonlinear dynamical systems using generative AI, specifically flow matching. The method combines flow matching with equivariant architectures and an optimal-transport-based coupling mechanism to accurately capture the full probability distribution over bifurcation outcomes. Experiments on various systems, including conceptual and physical problems, show that the approach effectively models multimodal distributions and symmetry-breaking bifurcations, outperforming non-probabilistic and variational methods.

该研究利用生成AI中的流匹配方法解决了非线性动力系统中对称性破坏分岔的建模问题。方法结合了流匹配与等变架构以及基于最优传输的耦合机制，以准确捕捉分岔结果的完整概率分布。实验结果表明，该方法在建模多稳态和对称性破坏分岔方面优于非概率性和变分方法。