arXiv 论文速递

UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

Authors: Ruiheng Zhang, Jingfeng Yao, Huangxuan Zhao, Hao Yan, Xiao He, Lei Chen, Zhou Wei, Yong Luo, Zengmao Wang, Lefei Zhang, Dacheng Tao, Bo Du

First: 2026-01-16T18:59:58+00:00 · Latest: 2026-01-16T18:59:58+00:00

Comments: Codes and models are available at https://github.com/ZrH42/UniX

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.

中文标题/摘要

标题：UniX：统一自回归和扩散模型以理解与生成胸部X光片

尽管取得了进展，但医疗基础模型仍然难以统一视觉理解和生成，因为这两个任务具有固有的冲突目标：语义抽象与像素级重建。现有方法通常基于参数共享的自回归架构，经常导致在其中一个或两个任务上的性能妥协。为了解决这一问题，我们提出了UniX，这是一种用于胸部X光片理解和生成的新一代统一医疗基础模型。UniX 将两个任务分别拆分为一个自回归分支用于理解，一个扩散分支用于高保真生成。关键地，引入了一种跨模态自注意力机制，以动态地用理解特征引导生成过程。结合严格的去噪数据处理管道和多阶段训练策略，该架构能够使任务之间协同合作，同时利用扩散模型的优势以实现更出色的生成效果。在两个代表性基准上，UniX 在理解性能（Micro-F1）上提高了46.1%，在生成质量（FD-RadDino）上提高了24.2%，仅使用LLM-CXR参数的四分之一。通过达到与任务特定模型相当的性能，我们的工作确立了一种可扩展的医疗图像理解和生成协同范式。代码和模型可在 https://github.com/ZrH42/UniX 获取。

Summary / 总结

UniX is designed to unify the tasks of understanding and generating chest X-rays by decoupling them into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. It introduces a cross-modal self-attention mechanism to dynamically guide the generation process with understanding features. On benchmarks, UniX improves understanding performance by 46.1% (Micro-F1) and generation quality by 24.2% (FD-RadDino) while using only a quarter of the parameters of LLM-CXR, demonstrating a scalable paradigm for medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.

UniX 通过将任务拆分为一个自回归分支用于理解以及一个扩散分支用于高保真生成，来统一胸片的视觉理解和生成。它引入了一种跨模态自注意力机制，以动态地用理解特征来引导生成过程。在基准测试中，UniX 将理解性能提高了 46.1%（Micro-F1）和生成质量提高了 24.2%（FD-RadDino），同时仅使用 LLM-CXR 参数的四分之一，展示了医疗图像理解和生成的可扩展范式。

How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

Authors: Jonathan Roberts, Kai Han, Samuel Albanie

First: 2026-01-16T18:58:29+00:00 · Latest: 2026-01-16T18:58:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.

中文标题/摘要

标题：一段线有多长？对分词器的一种简要经验分析

前沿的大规模语言模型（LLM）在学术界、社会和工业界中越来越广泛地被使用。一个常用的模型、输入和输出的比较单位以及估算推理成本的单位是“token”。通常，token 被视为一种稳定的货币，假设在不同分词器和上下文中大致一致，从而能够进行直接比较。然而，分词在不同模型和文本领域之间差异显著，使得对 token 数量的简单解释变得复杂。我们通过提供一个全面的经验分析来量化这种差异，探索不同文本数据分布下序列到 token 的压缩情况。我们的分析挑战了关于 token 长度的常见启发式方法，发现它们过于简单化。我们希望研究的见解能为当代大模型中的分词提供清晰性和直觉。

Summary / 总结

The study aims to address the variability in tokenization across different models and text domains, which can affect the interpretation of token counts. The researchers employ an empirical analysis of tokenization, examining how sequences are compressed into tokens across various types of textual data. Key findings suggest that token lengths vary more than previously thought, challenging existing heuristics and highlighting the need for more nuanced understanding of tokenization in large language models.

研究旨在解决不同语言模型和文本领域之间标记化差异性的问题，这可能导致对标记数量的误解。研究人员通过实证分析标记化，考察了序列在不同文本分布下的压缩成标记的情况。主要发现表明，标记长度差异显著，并不符合简单的经验法则，强调了对现代语言模型中标记化更细致理解的必要性。

Do explanations generalize across large reasoning models?

Authors: Koyena Pal, David Bau, Chandan Singh

First: 2026-01-16T18:55:29+00:00 · Latest: 2026-01-16T18:55:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.

中文标题/摘要

标题：大型推理模型的解释是否具有普适性？

大型推理模型（LRMs）在解决问题的过程中产生了一种文本形式的推理链（CoT），这可以作为一种强大的工具来理解问题，通过提供一种可读的、自然语言的解释。然而，尚不清楚这些解释是否具有普适性，即它们是否捕捉到了关于底层问题的一般模式，而不是仅限于LRM的特殊模式。这是一个理解或发现新概念的关键问题，例如在科学中的AI。我们通过评估一种特定的普适性概念来研究这个问题：一种LRM生成的解释是否会在提供给其他LRM时产生相同的行为。我们发现CoT解释通常表现出这种形式的普适性（即它们增加了LRM之间的一致性），并且这种增加的普适性与人类的偏好排名和强化学习后的训练相关。我们进一步分析了解释产生一致答案的条件，并提出了一种简单的、基于句子的集成策略，以提高一致性。综上所述，这些结果建议在使用LRM解释以获得新见解时要谨慎，并概述了一种表征LRM解释普适性的框架。

Summary / 总结

The study investigates whether textual chain of thought (CoT) explanations generated by large reasoning models (LRMs) generalize across different models, meaning they capture general patterns rather than model-specific ones. The research evaluates this by seeing if explanations from one LRM can induce similar behavior in another LRM. Key findings include that CoT explanations often do generalize, increasing consistency between LRMs, and this generalization is linked to human preference and reinforcement learning. The study also suggests that under certain conditions, explanations can yield consistent answers and proposes a simple ensembling strategy to enhance consistency.

研究探讨了大型推理模型（LRM）生成的文本链式思考（CoT）解释是否能够在不同模型之间泛化，即这些解释是否捕捉到通用的问题模式而非特定于模型的模式。通过评估LRM在收到另一个LRM的CoT解释时的一致性，研究发现CoT解释通常确实能够泛化，从而提高模型之间的一致性。这种泛化还与人类偏好排名和后训练强化学习相关。研究建议在使用LRM解释时应谨慎，并提出了一种基于句子的集成策略以提高一致性。

Building Production-Ready Probes For Gemini

Authors: János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, Arthur Conmy

First: 2026-01-16T18:54:29+00:00 · Latest: 2026-01-16T18:54:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.

Summary / 总结

The research aims to develop effective misuse mitigation techniques for language models by addressing the challenge of probe generalization under production distribution shifts, particularly from short-context to long-context inputs. The authors propose new probe architectures and evaluate their robustness in the cyber-offensive domain, finding that a combination of architecture choice and diverse training is necessary for broad generalization. They also demonstrate that pairing probes with prompted classifiers improves accuracy while maintaining computational efficiency. These findings have led to the successful deployment of these probes in user-facing instances of Gemini, Google's advanced language model.

本文旨在通过开发生产级激活探针来缓解高级语言模型被滥用的问题。作者发现现有探针在生产分布变化下难以泛化，尤其是从短语境到长语境的转变。他们提出了新的探针架构并评估了其在网络安全领域中的鲁棒性，表明架构选择和多样化的训练对于广泛泛化是必要的。探针与提示分类器结合使用，以实现高效且最优的准确性。这些发现已经使误用缓解探针在Google前沿语言模型Gemini的用户界面实例中成功部署。

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Authors: Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel

Venue: www

First: 2026-01-16T18:51:24+00:00 · Latest: 2026-01-16T18:51:24+00:00

Comments: Project Page: http://facebookresearch.github.io/ShapeR Video: https://www.youtube.com/watch?v=EbY30KAA55I

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.

中文标题/摘要

标题：ShapeR：基于随意捕捉的稳健条件3D形状生成

近期在3D形状生成方面的进展取得了令人印象深刻的成果，但大多数现有方法依赖于干净、无遮挡和良好分割的输入。在现实世界场景中，这些条件很少被满足。我们提出了ShapeR，一种新颖的方法，用于从随意捕捉的序列中生成条件3D对象形状。给定一个图像序列，我们利用现成的视觉-惯性SLAM、3D检测算法和视觉-语言模型，为每个对象提取一组稀疏的SLAM点、多视角图像和机器生成的描述。一种经过训练以有效利用这些模态的矫正流变换器随后生成高保真度的度量3D形状。为了确保对随意捕捉数据挑战的鲁棒性，我们采用了包括实时组合增强、跨越对象和场景数据集的课程训练方案以及处理背景杂乱的策略。此外，我们引入了一个新的评估基准，包括7个真实世界场景中的178个野外对象，带有几何注释。实验表明，在这种具有挑战性的设置中，ShapeR 显著优于现有方法，与最先进的方法相比，平均切比雪夫距离提高了2.7倍。

Summary / 总结

ShapeR is a novel method for generating 3D object shapes from casually captured sequences. It uses visual-inertial SLAM, 3D detection, and vision-language models to extract sparse SLAM points, multi-view images, and machine-generated captions. A rectified flow transformer then generates high-fidelity 3D shapes. ShapeR demonstrates robustness to real-world challenges through techniques like on-the-fly augmentations and a curriculum training scheme. Experiments show ShapeR significantly outperforms existing methods, reducing the Chamfer distance by 2.7 times.

ShapeR 是一种从随意拍摄的序列中生成 3D 形状的新方法，解决了现有方法需要干净且良好分割输入的限制。它使用视觉惯性 SLAM、3D 检测和视觉语言模型来提取稀疏的 SLAM 点、多视角图像和生成的描述，然后由一个经过校正的流变换器处理以生成高保真 3D 形状。ShapeR 在具有挑战性的现实世界场景中展示了显著的鲁棒性和准确性，将 Chamfer 距离降低了 2.7 倍，优于现有方法。

From Aggregation to Selection: User-Validated Distributed Social Recommendation

Authors: Jingyuan Huang, Dan Luo, Zihe Ye, Weixin Chen, Minghao Guo, Yongfeng Zhang

Venue: WWW 2026

First: 2025-05-27T16:17:06+00:00 · Latest: 2026-01-16T18:45:34+00:00

Comments: Accepted by HCRS@WWW 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Social recommender systems facilitate social connections by identifying potential friends for users. Each user maintains a local social network centered around themselves, resulting in a naturally distributed social structure. Recent research on distributed modeling for social recommender systems has gained increasing attention, as it naturally aligns with the user-centric structure of user interactions. Current distributed social recommender systems rely on automatically combining predictions from multiple models, often overlooking the user's active role in validating whether suggested connections are appropriate. Moreover, recommendation decisions are validated by individual users rather than derived from a single global ordering of candidates. As a result, standard ranking-based evaluation metrics make it difficult to evaluate whether a user-confirmed recommendation decision is actually correct. To address these limitations, we propose DeSocial, a distributed social recommendation framework with user-validation. DeSocial enables users to select recommendation algorithms to validate their potential connections, and the verification is processed through majority consensus among multiple independent user validators. To evaluate the distributed recommender system with user validator, we formulate this setting as a link prediction and verification task and introduce Acc@K, a consensus-based evaluation metric that measures whether user-approved recommendations are correct. Experiments on 4 real-world social networks shows that DeSocial improves decision correctness and robustness compared to single-point and distributed baselines. These findings highlight the potential of user-validated distributed recommender systems as a practical approach to social recommendation, with broader applicability to distributed and decentralized recommendations. Code: https://github.com/agiresearch/DeSocial.

中文标题/摘要

标题：从聚合到选择：用户验证分布式社会推荐

社会推荐系统通过识别潜在朋友来促进社交连接。每个用户维护一个以自己为中心的本地社交网络，形成自然分布的社会结构。分布式建模在社会推荐系统中的研究逐渐受到关注，因为它自然地与用户交互的用户中心结构相吻合。当前的分布式社会推荐系统依赖于自动组合多个模型的预测，往往忽略了用户在验证建议连接是否合适中的主动作用。此外，推荐决策由个别用户验证，而不是从单一的全局候选排序中得出。因此，标准的排名评价指标难以评估用户确认的推荐决策是否正确。为解决这些问题，我们提出了DeSocial，一种具有用户验证的分布式社会推荐框架。DeSocial使用户能够选择推荐算法来验证其潜在连接，并通过多个独立用户验证者的多数共识进行验证。为了评估具有用户验证者的分布式推荐系统，我们将此设置形式化为链接预测和验证任务，并引入基于共识的评价指标Acc@K，衡量用户批准的推荐是否正确。在4个真实社交网络上的实验表明，与单点和分布式基线相比，DeSocial在决策正确性和鲁棒性方面有所提高。这些发现突显了用户验证的分布式推荐系统作为社会推荐的实用方法的潜力，具有更广泛的应用于分布式和去中心化推荐。

Summary / 总结

The paper proposes DeSocial, a user-validated distributed social recommendation framework that allows users to select and validate potential connections through majority consensus. This approach addresses the limitations of existing systems by incorporating user validation, which is evaluated using a consensus-based metric Acc@K. Experiments on four real-world social networks demonstrate that DeSocial outperforms single-point and distributed baselines in terms of decision correctness and robustness.

该研究提出了DeSocial，一种用户验证的分布式社会推荐框架，允许用户通过多数共识选择和验证潜在联系人。这种方法在四个真实社交网络上的实验表明，相比单一和分布式基线，它提高了决策的正确性和稳健性。通过Acc@K共识评价指标衡量用户批准的推荐准确性，显示出显著的改进。

ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

Authors: Emily Steiner, Jianhao Zheng, Henry Howard-Jenkins, Chris Xie, Iro Armeni

First: 2026-01-16T18:45:19+00:00 · Latest: 2026-01-16T18:45:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.

中文标题/摘要

标题：ReScene4D：演化的室内三维场景时空一致语义实例分割

室内环境随着物体的移动、出现或消失而演变。捕捉这些动态需要在间歇性捕获的3D扫描中保持时空一致的实例身份，即使在未观察到变化时也是如此。我们引入并形式化了时空稀疏的4D室内语义实例分割（SIS）任务，该任务联合分割、识别和时空关联物体实例。这一设置对现有的3DSIS方法构成了挑战，因为它们由于缺乏时间推理需要进行离散匹配步骤，同时也对依赖于高频率时间测量的4D LiDAR方法构成了挑战，因为这些方法在室内环境长时间演变中表现不佳。我们提出了一种名为ReScene4D的新方法，该方法无需密集观测即可适应3DSIS架构进行4DSIS。它探索了在观测之间共享信息的策略，证明这种共享上下文不仅使实例跟踪保持一致，还提高了标准3DSIS的质量。为了评估此任务，我们定义了一个新的度量标准t-mAP，该标准扩展了mAP以奖励时间身份一致性。ReScene4D在3RScan数据集上达到了最先进的性能，为理解演化的室内场景建立了新的基准。

Summary / 总结

The research aims to capture the dynamic changes in indoor environments by maintaining consistent instance identities across temporally sparse 3D scans. The method, ReScene4D, adapts 3D semantic instance segmentation (3DSIS) architectures to handle 4D semantic instance segmentation (4DSIS) without requiring dense observations. It introduces a new metric, t-mAP, to evaluate temporal identity consistency and demonstrates superior performance on the 3RScan dataset, setting a new benchmark for evolving indoor scene understanding.

研究旨在通过引入4D室内语义实例分割（SIS）任务来解决在不断变化的室内3D场景中分割和跟踪物体实例的挑战。提出的ReScene4D方法将3DSIS架构适应稀疏时间数据，展示了改进的实例跟踪和标准3DSIS质量。该方法使用新的t-mAP度量来评估时间身份一致性，并在3RScan数据集上实现了最先进的性能。

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them

Authors: Jiahe Jin, Abhijay Paladugu, Chenyan Xiong

First: 2025-10-08T00:20:35+00:00 · Latest: 2026-01-16T18:30:29+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Agentic search requires large language models (LLMs) to perform multi-step search to solve complex information-seeking tasks, imposing unique challenges on their reasoning capabilities. However, what constitutes effective reasoning for agentic search and how it can be learned remains unclear. In this work, we first investigate the reasoning behaviors that enable success in agentic search. By comparing successful and failed trajectories via an LLM-based analysis pipeline, we identify four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Building on this, we propose Behavior Priming, a training approach that equips agentic search models with these reasoning behaviors before reinforcement learning (RL). Specifically, it first performs supervised fine-tuning (SFT) on collected trajectories exhibiting the identified behaviors to cultivate these behaviors, and then applies standard RL to further improve task performance. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct show that Behavior Priming yields relative improvements over direct RL by 37.2\% on three web benchmarks and 6.2\% on seven multi-hop QA benchmarks, and outperforms the SFT-then-RL baseline using outcome-correct trajectories for fine-tuning. Crucially, we show that these reasoning behaviors matter more than outcome correctness in the priming stage prior to RL. Further analysis reveals that Behavior Priming enhances exploration (pass@8) and test-time scaling (search step number), providing a robust foundation for RL. Our code are avalible at https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search.

中文标题/摘要

标题：代理搜索中的有益推理行为及其有效后训练获取方法

代理搜索要求大型语言模型（LLMs）执行多步搜索以解决复杂的信息检索任务，对它们的推理能力提出了独特的挑战。然而，有效的代理搜索推理构成要素及其如何学习仍然不清楚。在本工作中，我们首先研究使代理搜索成功的推理行为。通过基于LLM的分析管道比较成功的和失败的轨迹，我们确定了四种有益的行为：信息验证、权威评估、适应性搜索和错误恢复。在此基础上，我们提出了一种行为引导的训练方法，该方法在强化学习（RL）之前为代理搜索模型配备了这些推理行为。具体而言，它首先对表现出所识别行为的轨迹进行监督微调（SFT），以培养这些行为，然后应用标准RL进一步提高任务性能。在Qwen3-1.7B和Llama3.2-3B-Instruct上的实验表明，行为引导相较于直接RL在三个网页基准上提高了37.2%，在七个多跳问答基准上提高了6.2%，并且在使用结果正确的轨迹进行微调时优于SFT-然后-RL基线。至关重要的是，我们表明，在RL之前的引导阶段，这些推理行为比结果正确性更重要。进一步的分析表明，行为引导增强了探索（pass@8）和测试时的扩展（搜索步骤数），为RL提供了坚实的基础。我们的代码可在https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search/ 获取。

Summary / 总结

This study investigates the reasoning behaviors necessary for successful agentic search, identifying four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. It proposes a training approach called Behavior Priming, which involves supervised fine-tuning on trajectories exhibiting these behaviors followed by reinforcement learning. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct show that Behavior Priming improves task performance by 37.2% on web benchmarks and 6.2% on multi-hop QA benchmarks compared to direct reinforcement learning, and outperforms the SFT-then-RL baseline using outcome-correct trajectories for fine-tuning. The study highlights the importance of these reasoning behaviors over outcome correctness in the priming stage before reinforcement learning.

研究探讨了成功执行代理搜索所需的推理行为，确定了四种有益的行为：信息验证、权威评估、适应性搜索和错误恢复。提出了一种名为行为引导的训练方法，该方法包括在展示这些行为的轨迹上进行监督微调，然后进行强化学习。实验表明，与直接强化学习相比，该方法在网页基准测试中提高了37.2%的任务性能，在多跳问答基准测试中提高了6.2%，并且优于使用正确结果轨迹进行微调的SFT-然后-RL基线。研究强调了这些推理行为在强化学习之前的重要性，超过了结果正确性。

The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

Authors: Eilam Shapira, Roi Reichart, Moshe Tennenholtz

First: 2026-01-16T18:18:03+00:00 · Latest: 2026-01-16T18:18:03+00:00

Abs · PDF · Code1 · Code2

Abstract

The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the "Poisoned Apple" effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator's choice of market design in their favor. This strategic release improves the releaser's welfare at the expense of their opponent and the regulator's fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.

中文标题/摘要

标题：毒苹果效应：通过AI代理技术扩展对媒介市场的战略操控

将AI代理融入经济市场从根本上改变了战略互动的格局。我们研究了在三种经典的博弈论框架下扩展可用技术集的经济影响：讨价还价（资源分配）、谈判（不对称信息交易）和说服（战略信息传递）。我们发现，仅仅增加AI代理的选择就能大幅改变均衡收益和监管结果，通常会促使监管者主动开发和发布技术。相反，我们发现了一种战略现象，称为“毒苹果”效应：一个代理可能会发布一种新技术，这种技术他们和对手最终都不会使用，只是为了操纵监管者对市场设计的选择以利于自己。这种战略发布会提高发布者的福利，同时损害对手和监管者的公平目标。我们的研究结果表明，静态的监管框架容易受到技术扩展的操控，需要动态的市场设计来适应AI能力的不断变化。

Summary / 总结

The paper explores how the expansion of AI technologies affects strategic interactions in economic markets. By examining bargaining, negotiation, and persuasion scenarios, the study reveals that increasing the number of AI options can significantly alter equilibrium outcomes and regulatory decisions. A key finding is the 'Poisoned Apple' effect, where an agent releases a new technology to influence the regulator's market design choices, benefiting themselves at the expense of their opponent and regulatory fairness. This highlights the need for dynamic regulatory frameworks to counteract such manipulations.

论文探讨了AI技术在经济市场中的扩展如何战略性地操纵市场结果。研究在讨价还价、谈判和说服场景中进行了分析，发现增加AI选择可以显著改变均衡收益。研究引入了‘毒苹果’效应，即一个代理发布一种新技术，虽然他们和对手都不使用，但可以影响监管者的市场设计，从而提高自己的福利，损害对手和监管者的公平目标。这表明静态的监管框架容易受到技术扩展的操纵，需要动态的市场设计来适应不断发展的AI能力。

CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation

Authors: Vanshali Sharma, Andrea Mia Bejar, Gorkem Durak, Ulas Bagci

Venue: ISBI 2026

First: 2026-01-16T18:09:19+00:00 · Latest: 2026-01-16T18:09:19+00:00

Comments: Accepted at ISBI 2026

Abs · PDF · Code1 · Code2

Abstract

In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.

中文标题/摘要

标题：CTest-Metric：一种统一框架以评估CT报告生成中临床有效性的度量标准

在生成式AI时代，即使关键医疗任务正变得越来越自动化，放射学报告生成（RRG）仍然依赖于次优化的度量标准来进行质量评估。因此，开发特定领域的度量标准一直是研究的活跃领域，但由于缺乏一个统一且定义良好的框架来评估其在临床环境中的稳健性和适用性，这仍然是一个挑战。为了解决这个问题，我们提出了CTest-Metric，这是一种统一的度量标准评估框架，包含三个模块来确定度量标准在CT RRG中的临床可行性。这些模块测试：(i) 通过基于LLM的重写测试写作风格的一般性（WSG）；(ii) 在不同严重程度上注入合成错误（SEI）；(iii) 使用临床医生对175个“分歧”案例的评级测试度量标准与专家判断的相关性（MvE）。八个广泛使用的度量标准（BLEU、ROUGE、METEOR、BERTScore-F1、F1-RadGraph、RaTEScore、GREEN评分、CRG）在七个基于CT-CLIP编码器构建的LLM上进行了研究。使用我们新颖的框架，我们发现词汇型自然语言生成度量标准对风格变化非常敏感；GREEN评分与专家判断最一致（斯皮尔曼相关系数约为0.70），而CRG显示出负相关；BERTScore-F1对事实错误注入的敏感性最低。我们将发布该框架、代码以及匿名评估数据的部分（重写/错误注入的CT报告），以促进可重复基准测试和未来度量标准的发展。

Summary / 总结

CTest-Metric is a unified framework designed to assess the clinical validity of metrics for CT report generation. It includes three modules: Writing Style Generalizability, Synthetic Error Injection, and Metrics-vs-Expert correlation. The study evaluated eight metrics across seven LLMs and found that lexical NLG metrics are sensitive to stylistic variations, GREEN Score best aligns with expert judgments, CRG shows a negative correlation, and BERTScore-F1 is least sensitive to factual errors.

CTest-Metric 是一个统一的框架，用于评估用于 CT 报告生成的度量标准的临床有效性。它通过三个模块进行评估：写作风格的一般性、合成错误注入和度量标准-专家一致性。研究发现，词汇型 NLG 度量标准对风格变化非常敏感，GREEN Score 最好地与专家判断一致，而 BERTScore-F1 对事实错误注入最不敏感。

Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

Authors: Shuo Cheng, Liqian Ma, Zhenyang Chen, Ajay Mandlekar, Caelan Garrett, Danfei Xu

Venue: NeurIPS 2025

First: 2025-09-23T04:32:53+00:00 · Latest: 2026-01-16T18:05:09+00:00

Comments: Accepted to NeurIPS 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Behavior cloning has shown promise for robot manipulation, but real-world demonstrations are costly to acquire at scale. While simulated data offers a scalable alternative, particularly with advances in automated demonstration generation, transferring policies to the real world is hampered by various simulation and real domain gaps. In this work, we propose a unified sim-and-real co-training framework for learning generalizable manipulation policies that primarily leverages simulation and only requires a few real-world demonstrations. Central to our approach is learning a domain-invariant, task-relevant feature space. Our key insight is that aligning the joint distributions of observations and their corresponding actions across domains provides a richer signal than aligning observations (marginals) alone. We achieve this by embedding an Optimal Transport (OT)-inspired loss within the co-training framework, and extend this to an Unbalanced OT framework to handle the imbalance between abundant simulation data and limited real-world examples. We validate our method on challenging manipulation tasks, showing it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate and even generalize to scenarios seen only in simulation. Project webpage: https://ot-sim2real.github.io/.

中文标题/摘要

标题：通用领域适应的模拟与现实政策共训练

行为克隆在机器人操作中显示出潜力，但大规模获取真实世界的演示数据成本高昂。虽然模拟数据提供了可扩展的替代方案，特别是随着自动化演示生成技术的进步，将策略转移到现实世界受到各种模拟与现实领域差距的阻碍。在本文中，我们提出了一种统一的模拟与现实共训练框架，用于学习通用的操作策略，主要依赖于模拟数据，仅需少量真实世界的演示数据。我们方法的核心在于学习一个领域不变的任务相关特征空间。我们的关键见解是，跨领域对观测和相应动作联合分布的对齐提供了比仅对观测（边缘分布）对齐更丰富的信号。我们通过在共训练框架中嵌入一种基于最优传输（OT）的损失来实现这一点，并将其扩展为不平衡OT框架，以处理模拟数据丰富而现实世界示例有限的不平衡问题。我们在具有挑战性的操作任务上验证了该方法，表明它可以利用丰富的模拟数据在现实世界成功率上提高多达30%，甚至可以泛化到仅在模拟中出现的场景。项目网页：https://ot-sim2real.github.io/

Summary / 总结

This paper addresses the challenge of transferring robot manipulation policies from simulation to the real world by proposing a unified sim-and-real co-training framework. The method focuses on learning a domain-invariant feature space and uses an Optimal Transport (OT)-inspired loss to align the joint distributions of observations and actions across domains. Experiments show that the proposed approach can significantly improve real-world success rates, achieving up to a 30% improvement and even generalizing to unseen scenarios in simulation.

本文提出了一种统一的仿真与现实联合训练框架，以解决将机器人操作策略从仿真环境转移到现实世界的问题。该方法侧重于学习一个领域不变的特征空间，并使用最优传输（OT）启发的损失来对仿真和现实环境中观察与动作的联合分布进行对齐。实验表明，所提出的方法可以利用大量的仿真数据，将现实世界的成功率提高30%以上，并能够泛化到仅在仿真中出现的场景。

Health Facility Location in Ethiopia: Leveraging LLMs to Integrate Expert Knowledge into Algorithmic Planning

Authors: Yohai Trabelsi, Guojun Xiong, Fentabil Getnet, Stéphane Verguet, Milind Tambe

First: 2026-01-16T18:02:09+00:00 · Latest: 2026-01-16T18:02:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Ethiopia's Ministry of Health is upgrading health posts to improve access to essential services, particularly in rural areas. Limited resources, however, require careful prioritization of which facilities to upgrade to maximize population coverage while accounting for diverse expert and stakeholder preferences. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we propose a hybrid framework that systematically integrates expert knowledge with optimization techniques. Classical optimization methods provide theoretical guarantees but require explicit, quantitative objectives, whereas stakeholder criteria are often articulated in natural language and difficult to formalize. To bridge these domains, we develop the Large language model and Extended Greedy (LEG) framework. Our framework combines a provable approximation algorithm for population coverage optimization with LLM-driven iterative refinement that incorporates human-AI alignment to ensure solutions reflect expert qualitative guidance while preserving coverage guarantees. Experiments on real-world data from three Ethiopian regions demonstrate the framework's effectiveness and its potential to inform equitable, data-driven health system planning.

中文标题/摘要

标题：埃塞俄比亚卫生设施位置优化：利用大语言模型整合专家知识

埃塞俄比亚卫生部正在升级卫生站以改善基本服务的可及性，特别是在农村地区。然而，有限的资源要求在升级哪些设施时进行仔细优先排序，以最大化人口覆盖率并考虑多样化的专家和利益相关者偏好。与埃塞俄比亚公共卫生研究所和卫生部合作，我们提出了一种混合框架，系统地将专家知识与优化技术相结合。经典优化方法提供了理论保证，但需要明确的、量化的目标，而利益相关者的标准通常用自然语言表达且难以形式化。为了弥合这些领域之间的差距，我们开发了大型语言模型和扩展贪婪（LEG）框架。该框架结合了用于人口覆盖率优化的可证明近似算法以及由大语言模型驱动的迭代改进，确保解决方案反映专家的定性指导，同时保持覆盖率保证。在三个埃塞俄比亚地区的实际数据上进行的实验表明，该框架的有效性及其对公平、数据驱动的卫生系统规划的潜在影响。

Summary / 总结

The research aims to improve access to essential health services in Ethiopia by optimizing the upgrade of health facilities, especially in rural areas. The proposed LEG framework integrates expert knowledge with optimization techniques to address the challenge of limited resources. The framework combines a provable algorithm for population coverage optimization with iterative refinement using a large language model to ensure solutions align with expert qualitative guidance while maintaining theoretical guarantees. Experiments on real-world data from three Ethiopian regions show the framework's effectiveness in prioritizing facility upgrades to maximize population coverage and support equitable health system planning.

研究旨在通过优化卫生设施升级来改善埃塞俄比亚的医疗服务，特别是在农村地区。提出了一种名为大型语言模型和扩展贪婪（LEG）的混合框架，该框架将专家知识与优化技术相结合。该框架结合了一个可证明的算法来优化人口覆盖率，并通过LLM驱动的迭代改进来确保解决方案与专家的定性指导保持一致，同时保持理论上的保证。在三个埃塞俄比亚地区的实际数据上进行的实验表明，该框架在实现公平和数据驱动的卫生系统规划方面的有效性。

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

Authors: Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui

First: 2025-06-14T15:26:31+00:00 · Latest: 2026-01-16T17:59:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

中文标题/摘要

标题：什么使得好的语音分词器适用于以LLM为中心的语音生成？一项系统研究

语音语言模型（SLMs）为统一语音和文本的理解与生成提供了有希望的途径。然而，在实现有效的跨模态对齐和高质量的语音生成方面仍存在挑战。在本工作中，我们系统地研究了以LLM为中心的SLMs中的语音分词器设计的作用，这些设计通过语音头和说话人建模进行增强。我们在公平的SLM框架下比较了耦合、半解耦和完全解耦的语音分词器，并发现解耦分词显著提高了对齐和合成质量。为了解决语音和文本之间信息密度的不匹配，我们引入了多令牌预测（MTP）到SLMs中，使每个隐藏状态能够解码多个语音令牌。这导致了高达12倍的解码速度提升，并且词错误率大幅下降（从6.07降至3.01）。此外，我们提出了一种基于说话人的生成范式，并引入了RoleTriviaQA，这是一个包含多种说话人身份的大规模角色扮演知识问答基准。实验表明，我们的方法增强了知识理解和说话人一致性。

Summary / 总结

This study investigates the impact of different speech tokenizer designs on LLM-centric SLMs, showing that decoupled tokenization improves alignment and synthesis quality. The introduction of multi-token prediction (MTP) enables faster decoding and reduces word error rate. Additionally, a speaker-aware generation paradigm and RoleTriviaQA benchmark are proposed to enhance knowledge understanding and speaker consistency in speech generation models.

研究探讨了不同语音分词设计对LLM为中心的SLM的影响，发现解耦分词可以提高对齐和合成质量。引入多令牌预测（MTP）提高了解码速度并降低了词错误率，而基于角色的生成范式和RoleTriviaQA基准进一步提高了知识理解和说话人一致性。

UCB-type Algorithm for Budget-Constrained Expert Learning

Authors: Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, Yuriy Dorn

First: 2025-10-26T12:36:17+00:00 · Latest: 2026-01-16T17:59:33+00:00

Abs · PDF · Code1 · Code2

Abstract

In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^α)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-α}\,T^α\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.

Generative Scenario Rollouts for End-to-End Autonomous Driving

Authors: Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, Thomas Svantesson, Fatih Porikli, Hong Cai

First: 2026-01-16T17:59:28+00:00 · Latest: 2026-01-16T17:59:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.

中文标题/摘要

标题：生成场景展开在端到端自动驾驶中的应用

视觉-语言-动作（VLA）模型正在成为端到端自动驾驶系统中高度有效的规划模型。然而，当前的工作主要依赖于稀疏轨迹注释的模仿学习，并且未能充分利用其作为生成模型的潜力。我们提出了生成场景展开（GeRo），这是一种插件式框架，通过自回归展开策略联合执行基于语言的未来交通场景的规划和生成。首先，训练一个VLA模型将自我车辆和代理的动力学编码为在规划、运动和语言任务监督下的潜在标记，促进文本对齐的生成。接下来，GeRo执行基于语言的自回归生成。给定多视角图像、场景描述和自我动作问题，它生成未来潜在标记和文本响应以引导长期展开。展开一致性损失使用真实值或伪标签稳定预测，减轻漂移并保持文本-动作对齐。这种设计使GeRo能够执行时间一致、基于语言的展开，支持长期推理和多智能体规划。在Bench2Drive上，GeRo的驾驶得分和成功率分别提高了15.7%和26.2%。通过将强化学习与生成展开相结合，GeRo实现了最先进的闭环和开环性能，展示了强大的零样本鲁棒性。这些结果突显了生成、基于语言推理作为端到端自动驾驶安全性和可解释性基础的潜力。

Summary / 总结

The research aims to enhance end-to-end autonomous driving systems by leveraging Vision-Language-Action (VLA) models as generative models. GeRo, a plug-and-play framework, integrates planning and generation through an autoregressive rollout strategy. It trains VLA models to encode dynamics into latent tokens and performs language-conditioned autoregressive generation. GeRo improves driving scores and success rates by 15.7% and 26.2%, respectively, on Bench2Drive, and achieves state-of-the-art performance with strong zero-shot robustness.

研究旨在通过利用Vision-Language-Action (VLA) 模型作为生成模型来提升端到端的自动驾驶系统。GeRo 是一个插件即用的框架，通过自回归展开策略结合规划和生成。它训练VLA模型将动态编码为潜在令牌，并执行基于语言的自回归生成。GeRo 在Bench2Drive 上将驾驶得分和成功率分别提高了15.7%和26.2%，并且实现了最先进的性能，具有强大的零样本鲁棒性。

Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

Authors: Hauke Licht

First: 2025-12-11T18:11:46+00:00 · Latest: 2026-01-16T17:56:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Research increasingly leverages audio-visual materials to analyze emotions in political communication. Multimodal large language models (mLLMs) promise to enable such analyses through in-context learning. However, we lack systematic evidence on whether these models can reliably measure emotions in real-world political settings. This paper evaluates leading mLLMs for video-based emotional arousal measurement using two complementary human-labeled video datasets: recordings created under laboratory conditions and real-world parliamentary debates. I find a critical lab-vs-field performance gap. In video created under laboratory conditions, mLLMs arousal scores approach human-level reliability with little to no demographic bias. However, in parliamentary debate recordings, all examined models' arousal scores correlate at best moderately with average human ratings and exhibit systematic bias by speaker gender and age. Neither relying on leading closed-source mLLMs nor computational noise mitigation strategies change this finding. Further, mLLMs underperform even in sentiment analysis when using video recordings instead of text transcripts of the same speeches. These findings reveal important limitations of current mLLMs for real-world political video analysis and establish a rigorous evaluation framework for tracking future developments.

中文标题/摘要

标题：使用多模态大语言模型进行计算情感分析：新兴方法学机会的现有证据

研究越来越多地利用音频-视觉材料来分析政治沟通中的情感。多模态大语言模型（mLLMs）有望通过上下文学习来实现此类分析。然而，我们缺乏系统证据表明这些模型是否能在现实世界的政治环境中可靠地测量情感。本文使用两个互补的人标注视频数据集评估了领先mLLMs在基于视频的情感唤醒测量：实验室条件下创建的视频记录和实际议会辩论记录。我发现实验室与现场表现存在关键差异。在实验室条件下创建的视频中，mLLMs的情感唤醒评分接近人类水平的可靠性，几乎没有人口统计学偏差。但在议会辩论记录中，所有检查的模型的情感唤醒评分与平均人类评分的相关性最多为中等，并且表现出系统性偏差，按发言者性别和年龄。无论依赖领先的闭源mLLMs还是计算噪声缓解策略，这一发现都不会改变。此外，当使用视频记录而非相同演讲的文字转录时，mLLMs在情感分析中的表现甚至不如在文本转录中。这些发现揭示了当前mLLMs在现实世界政治视频分析中的重要局限性，并建立了跟踪未来发展的严格评估框架。

Summary / 总结

This paper evaluates the performance of multimodal large language models (mLLMs) in measuring emotional arousal from video recordings of political communication. It uses two datasets: laboratory-created videos and real-world parliamentary debates. The study finds that mLLMs perform well in laboratory settings but struggle in real-world contexts, showing lower reliability and systematic biases by speaker gender and age. The findings highlight significant limitations of current mLLMs for real-world political video analysis and suggest the need for further development.

本文评估了多模态大型语言模型（mLLMs）在测量政治沟通视频中情感唤醒方面的表现。研究使用了两个数据集：实验室创建的视频和实际议会辩论。研究发现，mLLMs在实验室环境中表现良好，但在实际世界环境中表现较差，显示出较低的可靠性和与演讲者性别和年龄相关的系统性偏差。研究结果揭示了当前mLLMs在实际政治视频分析中的重要局限性，并建议需要进一步发展。

Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

Authors: Alessandro Padella, Massimiliano de Leoni, Marlon Dumas

First: 2026-01-16T17:54:55+00:00 · Latest: 2026-01-16T17:54:55+00:00

Comments: 19 pages, 4 figure, TMIS journal submission

Abs · PDF · Code1 · Code2

Abstract

Predictive Process Monitoring is a branch of process mining that aims to predict the outcome of an ongoing process. Recently, it leveraged machine-and-deep learning architectures. In this paper, we extend our prior LLM-based Predictive Process Monitoring framework, which was initially focused on total time prediction via prompting. The extension consists of comprehensively evaluating its generality, semantic leverage, and reasoning mechanisms, also across multiple Key Performance Indicators. Empirical evaluations conducted on three distinct event logs and across the Key Performance Indicators of Total Time and Activity Occurrence prediction indicate that, in data-scarce settings with only 100 traces, the LLM surpasses the benchmark methods. Furthermore, the experiments also show that the LLM exploits both its embodied prior knowledge and the internal correlations among training traces. Finally, we examine the reasoning strategies employed by the model, demonstrating that the LLM does not merely replicate existing predictive methods but performs higher-order reasoning to generate the predictions.

中文标题/摘要

标题：探索基于LLM的预测过程监控在小型事件日志中的特征

预测过程监控是过程挖掘的一个分支，旨在预测正在进行过程的结果。最近，它利用了机器学习和深度学习架构。在本文中，我们扩展了我们之前基于LLM的预测过程监控框架，最初专注于通过提示进行总时间预测。扩展包括全面评估其通用性、语义利用和推理机制，跨越多个关键绩效指标。在三个不同的事件日志和总时间和活动发生预测的关键绩效指标上进行的实证评估表明，在只有100条轨迹的数据稀缺设置中，LLM超过了基准方法。此外，实验还表明，LLM 利用了其内在的知识和训练轨迹之间的内部关联。最后，我们研究了模型采用的推理策略，证明LLM 不仅复制现有的预测方法，还进行高层次的推理以生成预测。

Summary / 总结

This paper extends a prior LLM-based Predictive Process Monitoring framework initially focused on total time prediction. It evaluates the framework's generality, semantic leverage, and reasoning mechanisms across multiple Key Performance Indicators. Experiments on three event logs show that the LLM outperforms benchmark methods in data-scarce settings with only 100 traces, leveraging both prior knowledge and internal trace correlations. The LLM demonstrates higher-order reasoning for prediction generation rather than merely replicating existing methods.

本文扩展了先前基于LLM的预测过程监控框架，最初专注于总时间预测。它在多个关键绩效指标上评估了该框架的通用性、语义利用和推理机制。实验表明，在只有100条轨迹的数据稀缺环境中，LLM在三个事件日志上优于基准方法，利用了先验知识和训练轨迹间的内部关联。LLM展示了更高层次的推理能力来进行预测生成，而不仅仅是复制现有的预测方法。

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Authors: Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui

First: 2026-01-16T17:45:34+00:00 · Latest: 2026-01-16T17:45:34+00:00

Abs · PDF · Code1 · Code2

Abstract

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.

中文标题/摘要

标题：MHA2MLA-VLM：使DeepSeek的经济型多头潜在注意力适用于视觉-语言模型

随着视觉-语言模型（VLMs）处理越来越复杂和多模态的任务，关键-值（KV）缓存的快速增长在推理过程中产生了显著的内存和计算瓶颈。虽然多头潜在注意力（MLA）提供了一种有效的压缩KV缓存和加速推理的方法，但将现有的VLMs适应到MLA架构中而不进行昂贵的预训练仍然鲜有探索。在本文中，我们提出了MHA2MLA-VLM，这是一种参数高效且多模态感知的框架，用于将现成的VLMs转换为MLA。我们的方法包括两个核心技术：（1）一种适应模态的部分-RoPE策略，该策略通过选择性地屏蔽非必要维度支持传统的和多模态设置，（2）一种模态解耦的低秩近似方法，该方法独立地压缩了视觉和文本的KV空间。此外，我们引入了参数高效的微调以最小化适应成本，并证明了最小化输出激活误差而不是参数距离可以显著减少性能损失。在三个代表性VLMs上的广泛实验表明，MHA2MLA-VLM在最少的监督数据下恢复了原始模型性能，显著减少了KV缓存的占用空间，并与KV量化无缝集成。

Summary / 总结

The research aims to address the memory and computational challenges posed by the Key-Value (KV) cache in vision-language models (VLMs) by introducing MHA2MLA-VLM, a parameter-efficient framework for converting existing VLMs to Multi-Head Latent Attention (MLA). The method employs a modality-adaptive partial-RoPE strategy and a modality-decoupled low-rank approximation to compress the KV cache, and uses parameter-efficient fine-tuning to minimize adaptation cost. Experiments on three VLMs show that MHA2MLA-VLM can restore original model performance with minimal supervised data, reduce KV cache size, and integrate well with KV quantization.

研究旨在通过引入MHA2MLA-VLM框架解决视觉-语言模型（VLMs）中关键值（KV）缓存带来的内存和计算挑战，该框架能够将现有的VLMs转换为多头潜在注意力（MLA）。方法采用模态自适应部分-RoPE策略和模态解耦低秩近似来压缩视觉和文本的KV空间，并包含参数高效的微调以最小化适应成本。实验表明，MHA2MLA-VLM可以在少量监督数据下恢复原始性能，显著减小KV缓存大小，并与KV量化无缝集成。

Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations

Authors: Franziska Herbert, Vignesh Prasad, Han Liu, Dorothea Koert, Georgia Chalvatzaki

First: 2026-01-16T17:35:00+00:00 · Latest: 2026-01-16T17:35:00+00:00

Comments: 9 pages, 7 figures, preprint

Abs · PDF · Code1 · Code2

Abstract

Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.

中文标题/摘要

标题：从人类示范中学习语义-几何任务图表示

从人类示范中学习结构化任务表示对于理解长时间段操作行为至关重要，特别是在双臂操作环境中，操作顺序、物体参与和交互几何可以显著变化。一个关键挑战在于如何联合捕捉任务的离散语义结构和物体为中心的几何关系随时间的演变，以支持任务进展的推理。在本文中，我们提出了一种语义-几何任务图表示，该表示从人类示范中编码物体身份、物体间关系及其随时间的几何演变。基于此表示，我们提出了一种学习框架，该框架结合了消息传递神经网络（MPNN）编码器和基于变换器的解码器，将场景表示学习与基于动作条件的任务进展推理解耦。编码器仅在时间场景图上操作以学习结构化表示，而解码器根据动作上下文预测未来动作序列、相关物体及其在长时间段内的运动。通过在人类示范数据集上的广泛评估，我们展示了语义-几何任务图表示特别适用于具有高动作和物体变异性任务，其中基于序列的简单模型难以捕捉任务进展。最后，我们展示了任务图表示可以转移到物理双臂机器人并用于在线动作选择，突显了它们作为下游操作系统决策中可重用任务抽象的潜力。

Summary / 总结

This paper addresses the challenge of learning structured task representations from human demonstrations, especially in bimanual manipulation tasks. It introduces a semantic-geometric task graph-representation that captures object identities, inter-object relations, and their temporal evolution. The proposed learning framework uses a Message Passing Neural Network encoder and a Transformer-based decoder to encode scene graphs and predict future actions. Experiments show that this approach is effective in tasks with high action and object variability, outperforming simpler sequence-based models. The task graph representations can also be transferred to a physical robot for online action selection, demonstrating their potential for manipulation systems.

本文解决了从人类演示中学习结构化任务表示的问题，特别是在双臂操作任务中。提出了一种语义-几何任务图表示，能够捕捉物体身份、物体间关系及其随时间的演变。所提出的学习框架使用消息传递神经网络编码器和基于变换器的解码器来编码场景图并预测未来动作。实验表明，该方法在具有高动作和物体变异性任务中表现出色，优于简单的序列模型。这些任务图表示还可以被转移到物理机器人中进行在线动作选择，展示了其在操作系统中作为可重用任务抽象的潜力。

Probabilistic Mission Design for Neuro-Symbolic Unmanned Aircraft Systems

Authors: Simon Kohaut, Benedict Flade, Daniel Ochs, Devendra Singh Dhami, Julian Eggert, Kristian Kersting

First: 2024-12-25T11:04:00+00:00 · Latest: 2026-01-16T17:27:13+00:00

Comments: arXiv admin note: text overlap with arXiv:2406.03454

Abs · PDF · Code1 · Code2

Abstract

Advanced Air Mobility (AAM) is a growing field that demands accurate and trustworthy models of legal concepts and restrictions for navigating Unmanned Aircraft Systems (UAS). In addition, any implementation of AAM needs to face the challenges posed by inherently dynamic and uncertain human-inhabited spaces robustly. Nevertheless, the employment of UAS beyond visual line of sight (BVLOS) is an endearing task that promises to significantly enhance today's logistics and emergency response capabilities. Hence, we propose Probabilistic Mission Design (ProMis), a novel neuro-symbolic approach to navigating UAS within legal frameworks. ProMis is an interpretable and adaptable system architecture that links uncertain geospatial data and noisy perception with declarative, Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and its legality. To inform planning with legal restrictions and uncertainty in mind, ProMis yields Probabilistic Mission Landscapes (PML). These scalar fields quantify the belief that the HPLP is satisfied across the agent's state space. Extending prior work on ProMis' reasoning capabilities and computational characteristics, we show its integration with potent machine learning models such as Large Language Models (LLM) and Transformer-based vision models. Hence, our experiments underpin the application of ProMis with multi-modal input data and how our method applies to many AAM scenarios.

中文标题/摘要

标题：神经符号无人驾驶航空系统中的概率任务设计

先进空中交通（AAM）是一个快速增长的领域，需要准确可靠的法律概念和限制模型来导航无人驾驶航空系统（UAS）。此外，任何AAM的实现都需要面对动态和不确定的人类居住空间带来的挑战。然而，超越视距（BVLOS）的UAS应用是一个令人向往的任务，有望显著提升当今的物流和应急响应能力。因此，我们提出了概率任务设计（ProMis），这是一种新颖的神经符号方法，用于在法律框架内导航UAS。ProMis是一种可解释且可适应的系统架构，将不确定的地理空间数据和嘈杂的感知与声明性混合概率逻辑程序（HPLP）连接起来，以推理代理的状态空间及其合法性。为了在规划中考虑法律限制和不确定性，ProMis生成了概率任务景观（PML）。这些标量场量化了HPLP在代理状态空间中得到满足的信念。通过扩展ProMis推理能力和计算特性的先前工作，我们展示了其与强大的机器学习模型（如大型语言模型LLM和基于变换器的视觉模型）的集成。因此，我们的实验证明了ProMis在多模态输入数据下的应用及其方法如何应用于许多AAM场景。

Summary / 总结

The research aims to develop a robust navigation system for Unmanned Aircraft Systems (UAS) in complex, dynamic environments, particularly for Beyond Visual Line of Sight (BVLOS) operations. ProMis, a neuro-symbolic approach, integrates uncertain geospatial data and noisy perception with Hybrid Probabilistic Logic Programs to reason about the legality of the UAS's state space. Key findings include the creation of Probabilistic Mission Landscapes that quantify the likelihood of compliance with legal restrictions, demonstrating the system's effectiveness in various AAM scenarios with multi-modal input data.

研究旨在解决在动态和不确定环境中导航无人机系统（UAS）并遵守法律框架的挑战。提出的Probabilistic Mission Design（ProMis）采用神经符号方法，将不确定的地理空间数据和嘈杂的感知与混合概率逻辑程序（HPLP）结合，以推理代理的状态空间。关键实验结果表明，ProMis可以生成概率任务景观（PML），量化法律限制得到满足的信念，并能够与机器学习模型结合处理多模态输入数据。

Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation

Authors: Xin Sun, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang

First: 2026-01-16T17:07:01+00:00 · Latest: 2026-01-16T17:07:01+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.

中文标题/摘要

标题：预测检索！检索增强生成的测试时适应

检索增强生成（RAG）已成为通过集成外部知识增强大型语言模型问答能力的强大方法。然而，当将RAG系统适应到特定领域时，由于分布偏移，会出现挑战，导致性能不佳。在本文中，我们提出了一种测试时适应方法TTARAG，在推理过程中动态更新语言模型参数，以提高RAG系统在特定领域的性能。该方法引入了一种简单而有效的方法，使模型学会预测检索到的内容，从而实现自动参数调整以适应目标领域。通过在六个特定领域的广泛实验，我们证明TTARAG在基线RAG系统上实现了显著的性能提升。代码可在https://github.com/sunxin000/TTARAG获取。

Summary / 总结

The research aims to address the challenges of adapting Retrieval-Augmented Generation (RAG) systems to specialized domains, where distribution shifts can lead to poor performance. The proposed TTARAG method dynamically updates the language model's parameters during inference to predict retrieved content, allowing for automatic parameter adjustments to the target domain. Experiments across six specialized domains show that TTARAG significantly improves RAG system performance compared to baseline methods.

研究旨在解决检索增强生成（RAG）系统在专业化领域中的泛化性能不佳问题。TTARAG 是一种测试时自适应方法，在推理过程中动态更新语言模型的参数以提高 RAG 性能。在六个专业化领域的实验中，TTARAG 较基线 RAG 系统实现了显著的性能提升。

Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps

Authors: Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez-Pellitero, Youngkyoon Jang

First: 2026-01-16T17:02:46+00:00 · Latest: 2026-01-16T17:02:46+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.

中文标题/摘要

标题：Map2Thought：通过度量认知图进行明确的三维空间推理

我们提出了Map2Thought框架，该框架使3D VLM能够进行明确且可解释的空间推理。该框架基于两个关键组件：度量认知图（Metric-CogMap）和认知思维链（Cog-CoT）。度量认知图通过将离散网格用于关系推理与连续的度量尺度表示用于精确的几何理解，提供了一种统一的空间表示。基于度量认知图，认知思维链通过确定性操作（包括向量操作、边界框距离以及遮挡感知的外观顺序提示）进行明确的几何推理，生成基于三维结构的可解释推理轨迹。实验结果表明，Map2Thought能够实现可解释的三维理解，仅使用一半的监督数据即可达到59.9%的准确率，接近使用完整数据集训练的基线60.9%。在10%、25%和50%训练子集上，它分别比最先进的方法高出5.3%、4.8%和4.0%的准确率，在VSI-Bench上表现优异。

Summary / 总结

Map2Thought is a framework that enhances 3D vision and language models with explicit spatial reasoning through Metric Cognitive Maps and Cognitive Chain-of-Thought. It integrates discrete and continuous spatial representations for precise geometric understanding and interpretable reasoning. Experiments show that Map2Thought achieves 59.9% accuracy with half the supervision, outperforming state-of-the-art methods by 5.3%, 4.8%, and 4.0% under different training subset sizes.

Map2Thought 是一个框架，通过 Metric Cognitive Maps 和 Cognitive Chain-of-Thought 提升 3D 视觉和语言模型的空间推理能力，结合离散和连续的空间表示以实现精确的几何理解和可解释的推理。实验表明，Map2Thought 在仅使用一半监督的情况下达到 59.9% 的准确率，并在不同训练子集大小下分别超越最新方法 5.3%、4.8% 和 4.0%。

Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models

Authors: Xiaojie Gu, Guangxu Chen, Yuheng Yang, Jingxin Han, Andi Zhang

Venue: ICASSP 2026

First: 2026-01-16T17:02:19+00:00 · Latest: 2026-01-16T17:02:19+00:00

Comments: ICASSP 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE

中文标题/摘要

标题：层次正交残差扩展在大型语言模型精确大规模编辑中的应用

大型语言模型（LLMs）在各个领域表现出色，但面临关键的安全问题。模型编辑已成为缓解这些问题的有效方法。现有模型编辑方法通常侧重于优化融合新旧知识的信息矩阵。虽然有效，但这些方法可能计算成本高且可能导致冲突。相比之下，我们关注信息矩阵的层次正交残差扩展，从不同角度减少噪声梯度并实现更稳定的编辑。我们通过与几种流行方法的清晰理论比较和在两个数据集上对多个LLM进行的广泛实验，展示了HORSE方法的有效性。结果显示，HORSE在多种场景下保持了精确的大规模编辑。代码可在https://github.com/XiaojieGu/HORSE获取

Summary / 总结

The research aims to address safety concerns in large language models (LLMs) by proposing a method called Hierarchical Orthogonal Residual SprEad (HORSE) for precise massive editing. The method focuses on reducing noisy gradients in the information matrix, offering a computationally efficient and stable approach. Experiments on two datasets across multiple LLMs show that HORSE effectively maintains precise massive editing across various scenarios.

研究旨在通过提出一种名为Hierarchical Orthogonal Residual SprEad (HORSE)的方法来解决大型语言模型（LLMs）的安全问题，该方法侧重于减少信息矩阵中的噪声梯度，提供一种高效且稳定的编辑方式。在两个数据集上对多个LLMs进行的实验表明，HORSE能够在各种场景中有效保持精确的大规模编辑。

From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP

Authors: Shanshan Xu, Santosh T. Y. S. S, Barbara Plank

First: 2025-10-09T17:48:29+00:00 · Latest: 2026-01-16T17:00:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the diversity of human perspectives rather than mere error. Long treated in NLP as noise to be eliminated, HLV has only recently been reframed as a signal for improving model robustness. With the rise of large language models (LLMs) and post-training methods such as human feedback-based alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely collapse multiple annotations into a single label, flattening diverse perspectives into artificial consensus. Preserving HLV is necessary not only for pluralistic alignment but also for sociotechnical safety evaluation, where model behavior must be assessed in relation to human interaction and societal context. This position paper argues that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck, an intrinsic value in itself. We analyze the limitations of existing preference datasets and propose actionable strategies for incorporating HLV into dataset construction to better preserve pluralistic human values.

中文标题/摘要

标题：从噪声到信号再到自我目的：在NLP后训练时代重新定义人类标签变异

人类标签变异（HLV）指的是注释中的合法分歧，反映了人类视角的多样性而非单纯的错误。在NLP领域长期被视为需要消除的噪声，HLV仅在最近被重新定义为提高模型鲁棒性的信号。随着大型语言模型（LLMs）和后训练方法如基于人类反馈的对齐的兴起，HLV的作用变得越来越重要。然而，当前的偏好学习数据集通常将多个注释合并为单一标签，人为地抹平了多样化的视角。保留HLV不仅对于多元主义对齐至关重要，也对于社会技术安全性评估至关重要，其中模型行为必须与人类互动和社会背景相关联进行评估。本文认为，保留HLV作为人类多元主义的体现，必须被视为一种自我目的，即内在价值本身。我们分析了现有偏好数据集的局限性，并提出了将HLV纳入数据集构建以更好地保留多元人类价值观的可操作策略。

Summary / 总结

This paper addresses the treatment of Human Label Variation (HLV) in Natural Language Processing (NLP), which is the legitimate disagreement among human annotators. Traditionally seen as noise, HLV is now recognized as a signal for improving model robustness. With the advent of large language models and post-training methods like human feedback-based alignment, HLV's role has become more significant. However, current datasets often collapse multiple annotations into a single label, reducing diverse perspectives. The paper argues that preserving HLV is essential for pluralistic alignment and sociotechnical safety evaluation. It identifies the limitations of existing datasets and suggests strategies to better incorporate HLV in dataset construction to preserve human pluralism as an intrinsic value.

本文探讨了自然语言处理（NLP）中的人类标注变异（HLV），即人类标注者之间合法的分歧，反映了不同的视角。传统上被视为噪声，HLV现在被认作提高模型鲁棒性的信号。随着大型语言模型和后训练方法如基于人类反馈的对齐的出现，HLV的作用变得更加重要。然而，当前的数据集通常将多个标注合并为一个标签，失去了人类视角的多样性。文章认为，保留HLV对于多元对齐和社会技术安全性评估至关重要。它指出了现有数据集的局限性，并提出了策略以更好地保留人类价值观。

The unreasonable effectiveness of pattern matching

Authors: Gary Lupyan, Blaise Agüera y Arcas

First: 2026-01-16T16:53:08+00:00 · Latest: 2026-01-16T16:53:08+00:00

Abs · PDF · Code1 · Code2

Abstract

We report on an astonishing ability of large language models (LLMs) to make sense of "Jabberwocky" language in which most or all content words have been randomly replaced by nonsense strings, e.g., translating "He dwushed a ghanc zawk" to "He dragged a spare chair". This result addresses ongoing controversies regarding how to best think of what LLMs are doing: are they a language mimic, a database, a blurry version of the Web? The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching. Pattern-matching is not an alternative to "real" intelligence, but rather a key ingredient.

中文标题/摘要

标题：模式匹配的不可思议的有效性

我们报告了大型语言模型（LLMs）对“Jabberwocky”语言的惊人理解能力，在这种语言中，大部分或全部内容词都被随机替换为无意义的字符串，例如将“He dwushed a ghanc zawk”翻译为“He dragged a spare chair”。这一结果解决了关于如何最好地理解LLMs在做什么的持续争议：它们是语言模仿、数据库还是网络的模糊版本？LLMs从结构模式中恢复意义的能力表明了模式匹配的不可思议的有效性。模式匹配不是替代“真实”智能的选择，而是关键组成部分。

Summary / 总结

The study explores the surprising capability of large language models to understand and translate sentences in 'Jabberwocky' language, where most words are replaced with random strings. This finding challenges existing debates on the nature of LLMs, suggesting that their effectiveness stems from pattern-matching rather than other forms of intelligence. The research indicates that pattern-matching is a crucial component in LLMs, not an alternative to real intelligence but a key ingredient for their functionality.

研究探讨了大型语言模型理解并翻译大部分词汇为无意义字符串的虚构语言的能力。这一发现挑战了现有关于LLM本质的讨论，表明它们的有效性源于模式匹配而非深层次的语义理解。研究指出，LLM可以通过结构模式恢复意义，突显了模式匹配在这些模型中的‘不合理有效性’。

DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization

Authors: Chao Zhang, Xin Shi, Xueqiao Zhang, Yifan Zhu, Yi Yang, Yawei Luo

First: 2025-05-22T17:56:21+00:00 · Latest: 2026-01-16T16:50:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.

中文标题/摘要

标题：DecoupledESC：通过策略-响应解耦偏好优化提升情感支持生成

情感支持对话（ESC）的最新进展通过监督微调（SFT）对大型语言模型（LLMs）进行微调，从而提高了情感支持生成的效果。然而，常见的心理错误仍然存在。虽然直接偏好优化（DPO）通过成对偏好学习显示出减少这些错误的潜力，但在ESC任务中的有效性受到两个关键挑战的限制：（1）纠缠的数据结构：现有的ESC数据本质上将心理策略和响应内容纠缠在一起，使得难以构建高质量的偏好成对；（2）优化模糊性：将传统的DPO应用于这种纠缠的成对数据会导致训练目标模糊。为了解决这些问题，我们引入了推断偏好挖掘（IPM）来构建高质量的偏好数据，形成了IPM-PrefDial数据集。在此数据集的基础上，我们借鉴格罗斯的情绪调节扩展过程模型，提出了一个解耦的ESC框架，将ESC任务分解为两个顺序子任务：策略规划和共情响应生成。每个任务都通过SFT进行训练，并随后通过DPO增强，以与心理偏好对齐。广泛的实验表明，我们的解耦ESC框架优于联合优化基线，减少了偏好偏差并提高了响应质量。

Summary / 总结

The research aims to enhance emotional support generation by addressing common psychological errors in Emotional Support Conversation (ESC) tasks. It introduces a Decoupled ESC framework that decomposes the task into strategy planning and empathic response generation, using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align with psychological preferences. Experiments show that this framework outperforms joint optimization methods, reducing preference bias and improving response quality.

研究旨在通过解决情感支持对话(ESC)任务中的常见心理错误来提升情感支持生成。它引入了一个拆分的情感支持框架，将任务分解为策略规划和同理心回应生成，使用监督微调(SFT)和直接偏好优化(DPO)来与心理偏好对齐。实验表明，该框架优于联合优化方法，减少了偏好偏差并提高了回应质量。

Relational Linearity is a Predictor of Hallucinations

Authors: Yuetian Lu, Yihong Liu, Hinrich Schütze

First: 2026-01-16T16:47:49+00:00 · Latest: 2026-01-16T16:47:49+00:00

Comments: 11 pages, 4 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using $Δ\cos$. We find a strong correlation ($r \in [.78,.82]$) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.

中文标题/摘要

标题：关系线性是幻觉的预测因子

幻觉是大型语言模型（LLMs）中的一个核心失败模式。我们关注的是对诸如“格伦·古德演奏的是哪种乐器？”这类问题的回答幻觉，但我们使用模型未知的合成实体提出这些问题。令人惊讶的是，我们发现像Gemma-7B-IT这样的中型模型经常产生幻觉，即它们难以识别幻觉事实不属于其知识。我们假设导致这些幻觉的一个重要因素是关系的线性：线性关系往往以更抽象的方式存储，使得LLM难以评估其知识；非线性关系的事实通常以更直接的方式存储，使得知识评估更容易。为了检验这一假设，我们创建了SyntHal数据集，包含6000个六种关系的合成实体。在对四种模型的实验中，我们确定了每种关系在SyntHal上的幻觉率，并使用$Δ\cos$测量其线性。我们发现关系线性与幻觉率之间存在强烈的相关性（$r \in [.78,.82]$），这为我们的假设提供了证据，即关系三元组的底层存储是模型能否自我评估其知识能力的一个因素。这一发现对管理幻觉行为具有重要意义，并为改进LLMs中事实知识的表示提出了新的研究方向。

Summary / 总结

The study investigates hallucinations in large language models (LLMs) by focusing on synthetic entities unknown to the models. It hypothesizes that the linearity of relations affects the models' ability to recognize their own knowledge gaps. Using SyntHal, a dataset of 6000 synthetic entities, the researchers found a strong correlation between relational linearity and hallucination rates, supporting the hypothesis that the underlying storage of relation triples impacts the model's self-assessment capability. This finding suggests new avenues for managing hallucination behavior and improving LLMs' factual knowledge representation.

研究探讨了关系线性与大型语言模型（LLMs）幻觉之间的关系。通过创建包含6000个合成实体的SyntHal数据集，研究人员发现关系线性与幻觉率之间存在较强的相关性（r在[.78, .82]之间），表明线性关系的抽象存储导致模型难以自我评估知识，从而引发幻觉。这一发现为管理幻觉行为提供了见解，并提出了改进LLMs事实知识表示的新研究方向。

Isotropy-Optimized Contrastive Learning for Semantic Course Recommendation

Authors: Ali Khreis, Anthony Nasr, Yusuf Hilal

First: 2026-01-16T16:47:29+00:00 · Latest: 2026-01-16T16:47:29+00:00

Comments: 7 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

This paper presents a semantic course recommendation system for students using a self-supervised contrastive learning approach built upon BERT (Bidirectional Encoder Representations from Transformers). Traditional BERT embeddings suffer from anisotropic representation spaces, where course descriptions exhibit high cosine similarities regardless of semantic relevance. To address this limitation, we propose a contrastive learning framework with data augmentation and isotropy regularization that produces more discriminative embeddings. Our system processes student text queries and recommends Top-N relevant courses from a curated dataset of over 500 engineering courses across multiple faculties. Experimental results demonstrate that our fine-tuned model achieves improved embedding separation and more accurate course recommendations compared to vanilla BERT baselines.

中文标题/摘要

标题：基于自监督对比学习的 isotropy-优化语义课程推荐

本文提出了一种基于 BERT（双向编码器表示变换器）的自监督对比学习方法的语义课程推荐系统。传统的 BERT 表示空间具有各向异性，导致课程描述在语义相关性不同时仍表现出高余弦相似度。为解决这一局限，我们提出了一种带有数据增强和各向同性正则化的对比学习框架，以生成更具区分性的嵌入。该系统处理学生文本查询，并从涵盖多个学院超过 500 门工程课程的精选数据集中推荐 Top-N 相关课程。实验结果表明，与 vanilla BERT 基线相比，我们的微调模型在嵌入分离和课程推荐准确性方面均有所提高。

Summary / 总结

This paper introduces a semantic course recommendation system using a contrastive learning approach based on BERT to address the anisotropic representation issue in traditional BERT embeddings. The system employs data augmentation and isotropy regularization to generate more discriminative embeddings. Experiments show that the proposed model outperforms vanilla BERT in embedding separation and course recommendation accuracy.

该论文提出了一种使用具有各向同性正则化的对比学习方法的语义课程推荐系统，解决了BERT嵌入空间的各向异性问题。通过数据增强和各向同性正则化，该系统生成了更具区分性的嵌入，从而提高了课程推荐的准确性。实验结果显示，微调后的模型在嵌入分离和课程推荐准确性方面优于vanilla BERT。

The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents

Authors: Ziyu Wang, Chenyuan Liu, Yushun Xiang, Runhao Zhang, Qingbo Hao, Hongliang Lu, Houyu Chen, Zhizhong Feng, Kaiyue Zheng, Dehao Ye, Xianchao Zeng, Xinyu Zhou, Boran Wen, Jiaxin Li, Mingyu Zhang, Kecheng Zheng, Qian Zhu, Ran Cheng, Yong-Lu Li

First: 2026-01-16T16:42:05+00:00 · Latest: 2026-01-16T16:42:05+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.

中文标题/摘要

标题：伟大的3月100：100项细致任务评估具身AI代理

近年来，随着机器人学习和模仿学习的快速发展，出现了大量数据集和方法。然而，这些数据集及其任务设计往往缺乏系统的考虑和原则。这提出了重要问题：当前的数据集和任务设计是否真正推动了机器人代理的能力？在少数几个常见任务上的评估是否能准确反映不同团队提出的不同方法在不同任务上的差异化表现？为了解决这些问题，我们引入了伟大的3月100（GM-100）作为迈向机器人学习奥运会的第一步。GM-100 包含100个精心设计的任务，涵盖了广泛的交互和长尾行为，旨在提供一个多样且具有挑战性的任务集，全面评估机器人代理的能力，并促进机器人数据集任务设计的多样性和复杂性。这些任务通过系统分析现有任务设计的扩展，并结合人类物体交互基本原理和物体功能的见解而开发。我们在不同的机器人平台上收集了大量的轨迹数据，并评估了几种基线模型。实验结果表明，GM-100 任务是1）可执行的，2）足够具有挑战性，能够有效区分当前VLA模型的性能。我们的数据和代码可在https://rhos.ai/research/gm-100/获取。

Summary / 总结

The research aims to address the lack of systematic consideration in current robotic datasets and task designs. It introduces the Great March 100 (GM-100), a set of 100 detailed tasks designed to evaluate embodied AI agents. The tasks cover a wide range of interactions and long-tail behaviors, and are evaluated through trajectory data collection and baseline model testing. The results show that GM-100 tasks are feasible and sufficiently challenging to differentiate the performance of current visual learning agents (VLAs).

研究旨在解决当前机器人数据集和任务设计缺乏系统考虑的问题。引入了Great March 100 (GM-100)，包含100个详细任务，用于评估具身AI代理。这些任务涵盖了广泛的交互和长尾行为，并通过轨迹数据收集和基线模型测试进行评估。结果显示，GM-100任务既可行又具有足够的挑战性，能够区分当前视觉学习代理（VLAs）的性能差异。

Zero-Shot Detection of Elastic Transient Morphology Across Physical Systems

Authors: Jose Sánchez Andreu

First: 2026-01-16T16:35:07+00:00 · Latest: 2026-01-16T16:35:07+00:00

Comments: 17 pages, 6 figures. Supplemental material included

Abs · PDF · Code1 · Code2

Abstract

We test whether a representation learned from interferometric strain transients in gravitational-wave observatories can act as a frozen morphology-sensitive operator for unseen sensors, provided the target signals preserve coherent elastic transient structure. Using a neural encoder trained exclusively on non-Gaussian instrumental glitches, we perform strict zero-shot anomaly analysis on rolling-element bearings without retraining, fine-tuning, or target-domain labels. On the IMS-NASA run-to-failure dataset, the operator yields a monotonic health index HI(t) = s0.99(t)/tau normalized to an early-life reference distribution, enabling fixed false-alarm monitoring at 1-q = 1e-3 with tau = Q0.999(P0). In discrete fault regimes (CWRU), it achieves strong window-level discrimination (AUC_win about 0.90) and file-level separability approaching unity (AUC_file about 0.99). Electrically dominated vibration signals (VSB) show weak, non-selective behavior, delineating a physical boundary for transfer. Under a matched IMS controlled-split protocol, a generic EfficientNet-B0 encoder pretrained on ImageNet collapses in the intermittent regime (Lambda_tail about 2), while the interferometric operator retains strong extreme-event selectivity (Lambda_tail about 860), indicating that the effect is not a generic property of CNN features. Controlled morphology-destruction transformations selectively degrade performance despite per-window normalization, consistent with sensitivity to coherent time-frequency organization rather than marginal amplitude statistics.

中文标题/摘要

标题：弹性瞬态形态在物理系统中的零样本检测

我们测试从引力波观测站的干涉仪应变瞬态中学习到的表示，是否可以在未见过的传感器上作为冻结的形态敏感操作符发挥作用，前提是目标信号保留了一致的弹性瞬态结构。使用仅在非高斯仪器瞬态上训练的神经编码器，我们对滚动轴承进行严格的零样本异常分析，无需重新训练、微调或目标域标签。在IMS-NASA运行至失效数据集中，该操作符产生一个归一化到早期寿命参考分布的单调健康指数HI(t) = s0.99(t)/tau，使其在1-q = 1e-3的固定误报率下运行，其中tau = Q0.999(P0)。在离散故障区间（CWRU），它实现了强大的窗口级区分（AUC_win约0.90）和文件级可分性接近1（AUC_file约0.99）。电主导振动信号（VSB）表现出弱的、非选择性行为，划定了一种物理边界，限制了转移。在匹配的IMS控制分割协议下，通用的预训练在ImageNet上的EfficientNet-B0编码器在间歇区间（Lambda_tail约2）失效，而干涉仪操作符保持了强大的极端事件选择性（Lambda_tail约860），表明该效果不是CNN特征的通用属性。控制形态破坏变换选择性地降级性能，尽管进行了窗口归一化，这与对一致的时间-频率组织的敏感性一致，而不是边缘幅度统计。

Summary / 总结

The study aims to determine if a representation learned from gravitational-wave observatories can be used for anomaly detection in unseen sensors without retraining. Using a neural encoder trained on non-Gaussian glitches, the method performs zero-shot anomaly analysis on rolling-element bearings, achieving a monotonic health index and strong discrimination in fault regimes. The interferometric operator shows robust performance, while a generic CNN encoder fails in the intermittent regime, indicating sensitivity to coherent time-frequency structure rather than amplitude statistics.

研究旨在确定是否可以从引力波观测站学到的表示可以在未见过的传感器上用于无重新训练的异常检测。使用一个仅在非高斯瞬态信号上训练的神经编码器，该方法对滚动轴承进行零样本异常分析，实现了单调健康指数和在故障区域中的强区分能力。干涉仪操作器表现出稳健的性能，而一个通用的CNN编码器在间歇性区域失效，表明其对相干时频结构而不是边缘幅度统计的敏感性。