UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation
Authors: Ruiheng Zhang, Jingfeng Yao, Huangxuan Zhao, Hao Yan, Xiao He, Lei Chen, Zhou Wei, Yong Luo, Zengmao Wang, Lefei Zhang, Dacheng Tao, Bo Du
First: 2026-01-16T18:59:58+00:00 · Latest: 2026-01-16T18:59:58+00:00
Comments: Codes and models are available at https://github.com/ZrH42/UniX
Abstract
Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.
中文标题/摘要
标题:UniX:统一自回归和扩散模型以理解与生成胸部X光片
尽管取得了进展,但医疗基础模型仍然难以统一视觉理解和生成,因为这两个任务具有固有的冲突目标:语义抽象与像素级重建。现有方法通常基于参数共享的自回归架构,经常导致在其中一个或两个任务上的性能妥协。为了解决这一问题,我们提出了UniX,这是一种用于胸部X光片理解和生成的新一代统一医疗基础模型。UniX 将两个任务分别拆分为一个自回归分支用于理解,一个扩散分支用于高保真生成。关键地,引入了一种跨模态自注意力机制,以动态地用理解特征引导生成过程。结合严格的去噪数据处理管道和多阶段训练策略,该架构能够使任务之间协同合作,同时利用扩散模型的优势以实现更出色的生成效果。在两个代表性基准上,UniX 在理解性能(Micro-F1)上提高了46.1%,在生成质量(FD-RadDino)上提高了24.2%,仅使用LLM-CXR参数的四分之一。通过达到与任务特定模型相当的性能,我们的工作确立了一种可扩展的医疗图像理解和生成协同范式。代码和模型可在 https://github.com/ZrH42/UniX 获取。
Summary / 总结
UniX is designed to unify the tasks of understanding and generating chest X-rays by decoupling them into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. It introduces a cross-modal self-attention mechanism to dynamically guide the generation process with understanding features. On two benchmarks, UniX shows a 46.1% improvement in understanding performance and a 24.2% gain in generation quality, using only a quarter of the parameters of LLM-CXR. This work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.
UniX 通过将理解任务和生成任务分别拆分为自回归分支和扩散分支来统一胸部X光的理解和生成。它引入了一种跨模态自注意力机制,以动态地使用理解特征来引导生成过程。在基准测试中,UniX 将理解性能提高了46.1%(Micro-F1)和生成质量提高了24.2%(FD-RadDino),仅使用LLM-CXR四分之一的参数,展示了医疗图像理解和生成的可扩展范式。代码和模型可在 https://github.com/ZrH42/UniX 获取。
How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers
Authors: Jonathan Roberts, Kai Han, Samuel Albanie
First: 2026-01-16T18:58:29+00:00 · Latest: 2026-01-16T18:58:29+00:00
Abstract
Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.
中文标题/摘要
标题:一段绳子有多长?关于分词器的简要实证分析
前沿大语言模型(LLM)在学术界、社会和工业中越来越广泛地被使用。一个常用于比较模型、输入和输出以及估算推理成本的单位是“令牌”。通常,令牌被视为一种稳定的货币,假设在不同分词器和上下文中大致一致,从而能够进行直接比较。然而,分词在不同模型和文本领域之间差异显著,使得对令牌数量的简单解释变得复杂。我们通过提供全面的实证分析来量化这种差异,探索不同文本数据分布下序列到令牌的压缩。我们的分析挑战了关于令牌长度的常用启发式方法,发现它们过于简单化。我们希望本研究的见解能为当代大语言模型中的分词提供清晰性和直觉。
Summary / 总结
The study aims to address the variability in tokenization across different models and text domains, which can affect the interpretation of token counts. The researchers employ an empirical analysis of tokenizers, examining how sequences are compressed into tokens across various textual data distributions. Key findings suggest that token lengths are not as consistent as previously assumed, challenging existing heuristics and highlighting the need for more nuanced understanding in the context of contemporary language models.
研究旨在解决不同模型和文本领域之间标记化差异的问题,这影响了对标记数量的解释。研究人员通过分析标记化,考察了序列在不同文本数据分布下的压缩情况。主要发现包括对常见标记长度假设的挑战,表明它们不像以前认为的那样一致,从而使得基于标记数量的直接模型比较变得复杂。
Do explanations generalize across large reasoning models?
Authors: Koyena Pal, David Bau, Chandan Singh
First: 2026-01-16T18:55:29+00:00 · Latest: 2026-01-16T18:55:29+00:00
Abstract
Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.
中文标题/摘要
标题:大型推理模型的解释是否具有普适性?
大型推理模型(LRMs)在解决问题的过程中产生了一种文本形式的推理链(CoT),这可能成为理解问题的强大工具,因为它提供了易于理解的自然语言解释。然而,尚不清楚这些解释是否具有普适性,即它们是否捕捉到了问题的普遍模式,而不是仅限于LRM的特殊模式。这是理解或发现新概念的关键问题,例如在科学中的AI。我们通过评估一种特定的普适性概念来研究这个问题:一种LRM生成的解释是否会在提供给其他LRM时产生相同的行为。我们发现CoT解释通常表现出这种形式的普适性(即它们增加了LRM之间的一致性),并且这种增加的普适性与人类的偏好排名和强化学习后的训练相关。我们进一步分析了解释产生一致答案的条件,并提出了一种简单的句子级集成策略,以提高一致性。综上所述,这些结果建议在使用LRM解释以获得新见解时要谨慎,并概述了表征LRM解释普适性的框架。
Summary / 总结
The study investigates whether textual chain of thought (CoT) explanations generated by large reasoning models (LRMs) generalize across different models. By evaluating the consistency of behavior when CoT explanations from one LRM are given to another, the research finds that these explanations often lead to increased consistency between LRMs. This generalization is also linked to human preference rankings and improvements after reinforcement learning. The study suggests that while CoT explanations can be useful, caution is needed when using them to derive new insights, and proposes a sentence-level ensembling strategy to enhance consistency.
研究探讨了大型推理模型(LRM)生成的文本链式思考(CoT)解释是否能在不同模型之间泛化。通过评估当一个LRM的CoT解释被提供给另一个LRM时的行为一致性,研究发现这些解释通常会导致LRM之间的一致性增加。这种泛化还与人类的偏好排名和强化学习后的训练相关联。研究建议,在使用CoT解释以获得新见解时应持谨慎态度,因为它们可能并不总是捕捉到一般模式。此外,还提出了一种简单的句子级集成策略来提高解释的一致性。
Building Production-Ready Probes For Gemini
Authors: János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, Arthur Conmy
First: 2026-01-16T18:54:29+00:00 · Latest: 2026-01-16T18:54:29+00:00
Abstract
Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architecture that handle this long-context distribution shift.
We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant shifts, including multi-turn conversations, static jailbreaks, and adaptive red teaming. Our results demonstrate that while multimax addresses context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes.
These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.
Summary / 总结
This research aims to improve the robustness of language model probes to prevent misuse by bad actors. The study identifies that existing probes struggle with long-context inputs and proposes new architectures to address this issue. Evaluations in the cyber-offensive domain show that a combination of architecture choice and diverse training data is necessary for broad generalization. The research also demonstrates that pairing probes with prompted classifiers can achieve high accuracy efficiently. These findings have enabled the successful deployment of probes in Google's Gemini model and suggest that automating some AI safety research is feasible.
论文旨在通过开发生产级的激活探针来缓解高级语言模型被滥用的问题。研究发现现有探针在处理长上下文输入变化时难以泛化。作者提出了新的探针架构并将其在网络安全领域进行了评估,结果显示架构选择和多样化的训练数据对于广泛泛化是必要的。他们还展示了将探针与提示分类器结合使用可以提高准确率同时保持计算效率。这些发现使得这些探针在Google的先进语言模型Gemini中成功部署,并表明自动化某些AI安全性研究是可行的。
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Authors: Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel
Venue: www
First: 2026-01-16T18:51:24+00:00 · Latest: 2026-01-16T18:51:24+00:00
Comments: Project Page: http://facebookresearch.github.io/ShapeR Video: https://www.youtube.com/watch?v=EbY30KAA55I
Abstract
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
中文标题/摘要
标题:ShapeR:基于随意捕捉的稳健条件3D形状生成
近期在3D形状生成方面的进展取得了令人印象深刻的成果,但大多数现有方法依赖于干净、未遮挡和良好分割的输入。在现实世界场景中,这些条件很少被满足。我们提出了ShapeR,一种新颖的方法,用于从随意捕捉的序列中生成条件3D物体形状。给定一个图像序列,我们利用现成的视觉-惯性SLAM、3D检测算法和视觉-语言模型,为每个物体提取一组稀疏的SLAM点、多视角图像和机器生成的描述。一种训练有素的矫正流变换器能够有效利用这些模态进行条件生成,从而生成高保真度的度量3D形状。为了确保对随意捕捉数据挑战的鲁棒性,我们采用了包括实时组合增强、跨越物体和场景级别的数据集的课程训练方案以及处理背景杂乱的策略。此外,我们引入了一个新的评估基准,包括7个真实世界场景中的178个野外物体,带有几何注释。实验表明,在这种具有挑战性的设置中,ShapeR 显著优于现有方法,与最先进的方法相比,平均切比雪夫距离提高了2.7倍。
Summary / 总结
ShapeR is a novel approach for generating 3D object shapes from casually captured sequences. It uses visual-inertial SLAM, 3D detection, and vision-language models to extract sparse SLAM points, multi-view images, and captions. A rectified flow transformer then generates high-fidelity 3D shapes. ShapeR demonstrates robustness to casually captured data through techniques like on-the-fly augmentations and a curriculum training scheme. Experiments show ShapeR significantly outperforms existing methods, reducing Chamfer distance by 2.7 times.
ShapeR 是一种从随意拍摄的序列中生成 3D 物体形状的新方法。它使用视觉惯性 SLAM、3D 检测和视觉语言模型来提取稀疏 SLAM 点、多视角图像和机器生成的描述。然后,一个校正的流变压器生成高保真的 3D 形状。ShapeR 通过使用在线合成增强和分层训练方案等技术来应对随意拍摄数据的挑战。实验表明,ShapeR 在这个具有挑战性的设置中显著优于现有方法,将 Chamfer 距离降低了 2.7 倍。
From Aggregation to Selection: User-Validated Distributed Social Recommendation
Authors: Jingyuan Huang, Dan Luo, Zihe Ye, Weixin Chen, Minghao Guo, Yongfeng Zhang
Venue: WWW 2026
First: 2025-05-27T16:17:06+00:00 · Latest: 2026-01-16T18:45:34+00:00
Comments: Accepted by HCRS@WWW 2026
Abstract
Social recommender systems facilitate social connections by identifying potential friends for users. Each user maintains a local social network centered around themselves, resulting in a naturally distributed social structure. Recent research on distributed modeling for social recommender systems has gained increasing attention, as it naturally aligns with the user-centric structure of user interactions. Current distributed social recommender systems rely on automatically combining predictions from multiple models, often overlooking the user's active role in validating whether suggested connections are appropriate. Moreover, recommendation decisions are validated by individual users rather than derived from a single global ordering of candidates. As a result, standard ranking-based evaluation metrics make it difficult to evaluate whether a user-confirmed recommendation decision is actually correct. To address these limitations, we propose DeSocial, a distributed social recommendation framework with user-validation. DeSocial enables users to select recommendation algorithms to validate their potential connections, and the verification is processed through majority consensus among multiple independent user validators. To evaluate the distributed recommender system with user validator, we formulate this setting as a link prediction and verification task and introduce Acc@K, a consensus-based evaluation metric that measures whether user-approved recommendations are correct. Experiments on 4 real-world social networks shows that DeSocial improves decision correctness and robustness compared to single-point and distributed baselines. These findings highlight the potential of user-validated distributed recommender systems as a practical approach to social recommendation, with broader applicability to distributed and decentralized recommendations. Code: https://github.com/agiresearch/DeSocial.
中文标题/摘要
标题:从聚合到选择:用户验证分布式社会推荐
社会推荐系统通过识别潜在朋友来促进社交连接。每个用户维护一个以自己为中心的本地社交网络,形成自然分布的社会结构。社会推荐系统的分布式建模研究近年来引起了越来越多的关注,因为它自然地与用户交互的用户中心结构相吻合。当前的分布式社会推荐系统依赖于自动组合多个模型的预测,往往忽略了用户在验证建议连接是否合适中的积极作用。此外,推荐决策是由个别用户验证而不是从单一的全局候选排序中得出的。因此,标准的排名评价指标难以评估用户确认的推荐决策是否正确。为了解决这些局限性,我们提出了DeSocial,一种具有用户验证的分布式社会推荐框架。DeSocial使用户能够选择推荐算法来验证其潜在连接,并通过多个独立用户验证者的多数共识来处理验证。为了评估具有用户验证者的分布式推荐系统,我们将此设置形式化为链接预测和验证任务,并引入基于共识的评价指标Acc@K,衡量用户批准的推荐是否正确。在4个真实世界的社交网络上的实验表明,与单点和分布式基线相比,DeSocial在决策正确性和鲁棒性方面有所提高。这些发现突显了用户验证的分布式推荐系统作为社会推荐的实用方法的潜力,具有更广泛的分布式和去中心化推荐应用。
Summary / 总结
This paper addresses the limitations of current distributed social recommender systems by proposing DeSocial, a framework that incorporates user-validation. DeSocial allows users to select and validate potential connections through majority consensus among independent validators. The evaluation metric Acc@K measures the correctness of user-approved recommendations. Experiments on four real-world social networks demonstrate that DeSocial outperforms single-point and distributed baselines in terms of decision correctness and robustness.
论文提出了一种名为DeSocial的用户验证分布式社会推荐框架,允许用户通过多数共识选择和验证潜在联系人。该方法在决策正确性和稳健性方面优于单点和分布式基线。通过Acc@K共识评价指标衡量用户批准的推荐正确性,展示了用户验证在分布式社会推荐系统中的有效性。
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Authors: Emily Steiner, Jianhao Zheng, Henry Howard-Jenkins, Chris Xie, Iro Armeni
First: 2026-01-16T18:45:19+00:00 · Latest: 2026-01-16T18:45:19+00:00
Abstract
Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
中文标题/摘要
标题:ReScene4D:演化的室内三维场景语义实例分割
室内环境随着物体的移动、出现或消失而演变。捕捉这些动态需要在间歇性捕获的3D扫描中保持实例身份的一致性,即使在未观察到变化时也是如此。我们引入并形式化了4D室内语义实例分割(SIS)任务,该任务联合分割、识别和时间关联物体实例。这一设置对现有的3DSIS方法构成了挑战,因为它们由于缺乏时间推理需要进行离散匹配步骤,同时也对依赖于高频率时间测量的4D LiDAR方法构成了挑战,因为这些方法在室内环境长时间演变中表现不佳。我们提出了一种名为ReScene4D的新方法,该方法无需密集观测即可适应3DSIS架构进行4DSIS。它探索了在观测之间共享信息的策略,证明这种共享上下文不仅能够实现一致的实例跟踪,还能提高标准3DSIS的质量。为了评估这一任务,我们定义了一个新的度量标准t-mAP,该标准扩展了mAP以奖励时间身份一致性。ReScene4D在3RScan数据集上达到了最先进的性能,为理解演化的室内场景建立了新的基准。
Summary / 总结
The research aims to capture the dynamic changes in indoor environments by maintaining consistent instance identities across temporally sparse 3D scans. The method, ReScene4D, adapts 3D semantic instance segmentation architectures for 4D settings, enabling consistent instance tracking even without dense observations. Key findings include superior performance on the 3RScan dataset and the introduction of a new metric, t-mAP, which evaluates temporal identity consistency, demonstrating improved standard 3D semantic instance segmentation quality.
研究旨在通过引入时空稀疏的室内语义实例分割(SIS)任务,解决在不断变化的室内3D场景中保持实例身份一致性的挑战。提出的ReScene4D方法通过适应3DSIS架构来处理稀疏的时空数据,而无需密集观测,展示了改进的实例跟踪和标准3DSIS质量。研究引入了t-mAP新度量来评估时间身份一致性,并在3RScan数据集上达到了最先进的性能。
Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them
Authors: Jiahe Jin, Abhijay Paladugu, Chenyan Xiong
First: 2025-10-08T00:20:35+00:00 · Latest: 2026-01-16T18:30:29+00:00
Abstract
Agentic search requires large language models (LLMs) to perform multi-step search to solve complex information-seeking tasks, imposing unique challenges on their reasoning capabilities. However, what constitutes effective reasoning for agentic search and how it can be learned remains unclear. In this work, we first investigate the reasoning behaviors that enable success in agentic search. By comparing successful and failed trajectories via an LLM-based analysis pipeline, we identify four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Building on this, we propose Behavior Priming, a training approach that equips agentic search models with these reasoning behaviors before reinforcement learning (RL). Specifically, it first performs supervised fine-tuning (SFT) on collected trajectories exhibiting the identified behaviors to cultivate these behaviors, and then applies standard RL to further improve task performance. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct show that Behavior Priming yields relative improvements over direct RL by 37.2\% on three web benchmarks and 6.2\% on seven multi-hop QA benchmarks, and outperforms the SFT-then-RL baseline using outcome-correct trajectories for fine-tuning. Crucially, we show that these reasoning behaviors matter more than outcome correctness in the priming stage prior to RL. Further analysis reveals that Behavior Priming enhances exploration (pass@8) and test-time scaling (search step number), providing a robust foundation for RL. Our code are avalible at https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search.
中文标题/摘要
标题:代理搜索中的有益推理行为及其有效后训练获取方法
代理搜索要求大型语言模型(LLMs)执行多步搜索以解决复杂的信息检索任务,对它们的推理能力提出了独特的挑战。然而,有效的代理搜索推理构成要素及其如何学习仍然不清楚。在本工作中,我们首先研究使代理搜索成功的推理行为。通过基于LLM的分析管道比较成功的和失败的轨迹,我们确定了四种有益的行为:信息验证、权威评估、适应性搜索和错误恢复。在此基础上,我们提出了一种行为引导的训练方法,该方法在强化学习(RL)之前为代理搜索模型配备了这些推理行为。具体而言,它首先对表现出所识别行为的轨迹进行监督微调(SFT),以培养这些行为,然后应用标准RL进一步提高任务性能。在Qwen3-1.7B和Llama3.2-3B-Instruct上的实验表明,行为引导相较于直接RL在三个网页基准上提高了37.2%,在七个多跳问答基准上提高了6.2%,并且在使用结果正确的轨迹进行微调时优于SFT-然后-RL基线。至关重要的是,我们证明了在RL之前的引导阶段,这些推理行为比结果正确性更为重要。进一步的分析表明,行为引导增强了探索(pass@8)和测试时的扩展(搜索步骤数),为RL提供了坚实的基础。我们的代码可在https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search/ 获取。
Summary / 总结
This study investigates effective reasoning behaviors for agentic search, identifying four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. A training approach called Behavior Priming is proposed, which combines supervised fine-tuning on these behaviors with reinforcement learning. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct show that Behavior Priming improves task performance by 37.2% on web benchmarks and 6.2% on multi-hop QA benchmarks compared to direct reinforcement learning, and outperforms the SFT-then-RL baseline using outcome-correct trajectories for fine-tuning. The study highlights the importance of these reasoning behaviors over outcome correctness in the priming stage before reinforcement learning.
研究探讨了有效的代理搜索推理行为,识别出四种有益的行为:信息验证、权威评估、适应性搜索和错误恢复。提出了一种名为行为引导的训练方法,该方法结合了对这些行为的监督微调和强化学习。实验表明,与直接强化学习相比,该方法在网页基准测试中提高了37.2%的任务性能,在多跳问答基准测试中提高了6.2%,并且优于使用正确结果轨迹进行微调的SFT-然后-RL基线。研究强调了这些推理行为在强化学习之前的重要性,超过了结果正确性。
The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents
Authors: Eilam Shapira, Roi Reichart, Moshe Tennenholtz
First: 2026-01-16T18:18:03+00:00 · Latest: 2026-01-16T18:18:03+00:00
Abstract
The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the "Poisoned Apple" effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator's choice of market design in their favor. This strategic release improves the releaser's welfare at the expense of their opponent and the regulator's fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.
中文标题/摘要
标题:毒苹果效应:通过AI代理技术扩展对媒介市场的战略操控
将AI代理融入经济市场从根本上改变了战略互动的格局。我们研究了在三种经典的博弈论框架下扩展可用技术集的经济影响:讨价还价(资源分配)、谈判(不对称信息交易)和说服(战略信息传递)。我们发现,仅仅增加AI代理的选择就能大幅改变均衡收益和监管结果,经常促使监管者主动开发和发布技术。相反,我们发现了一种战略现象,称为“毒苹果”效应:一个代理可能会发布一种新技术,这种技术他们和对手最终都不使用,只是为了操纵监管者对市场设计的选择以利于自己。这种战略发布提高了发布者的福利,却损害了对手和监管者的公平目标。我们的研究结果表明,静态的监管框架容易受到技术扩展的操控,需要动态的市场设计以适应AI能力的不断变化。
Summary / 总结
This study explores how the expansion of AI technologies in economic markets affects strategic interactions. By examining bargaining, negotiation, and persuasion scenarios, the research reveals that increasing AI choices can significantly alter equilibrium outcomes and regulatory decisions. A key finding is the 'Poisoned Apple' effect, where an agent releases a new technology to influence the regulator's market design, benefiting themselves at the expense of their opponent and regulatory fairness. This highlights the need for adaptive regulatory frameworks to counteract such manipulations.
研究探讨了AI技术扩展如何影响经济市场的战略互动。通过分析讨价还价、谈判和说服场景,研究发现增加AI选择可以显著改变均衡结果和监管决策。一个关键发现是‘毒苹果’效应,即一方发布新技术以影响监管者的市场设计,从而自身受益但损害对手和监管公平性。这表明静态的监管框架容易被操纵,需要动态的市场设计来应对AI能力的演变。
CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation
Authors: Vanshali Sharma, Andrea Mia Bejar, Gorkem Durak, Ulas Bagci
Venue: ISBI 2026
First: 2026-01-16T18:09:19+00:00 · Latest: 2026-01-16T18:09:19+00:00
Comments: Accepted at ISBI 2026
Abstract
In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.
中文标题/摘要
标题:CTest-Metric:一种统一框架以评估CT报告生成中临床有效性的度量标准
在生成式AI时代,即使关键医疗任务正变得越来越自动化,放射学报告生成(RRG)仍然依赖于次优化的度量标准来进行质量评估。因此,开发特定领域的度量标准一直是研究的活跃领域,但由于缺乏一个统一且定义良好的框架来评估其在临床环境中的稳健性和适用性,这仍然是一个挑战。为了解决这个问题,我们提出了CTest-Metric,这是一种统一的度量标准评估框架,包含三个模块来确定度量标准在CT RRG中的临床可行性。这些模块测试:(i) 通过基于LLM的重写测试写作风格的一般性(WSG);(ii) 在不同严重程度上注入合成错误(SEI);(iii) 使用临床医生对175个“分歧”案例的评级测试度量标准与专家判断的相关性(MvE)。八个广泛使用的度量标准(BLEU、ROUGE、METEOR、BERTScore-F1、F1-RadGraph、RaTEScore、GREEN评分、CRG)在七个基于CT-CLIP编码器构建的LLM上进行了研究。使用我们新颖的框架,我们发现词汇NLG度量标准对风格变化非常敏感;GREEN评分与专家判断最一致(斯皮尔曼相关系数约为0.70),而CRG显示出负相关;BERTScore-F1对事实错误注入的敏感性最低。我们将发布该框架、代码以及匿名评估数据的部分(重写/错误注入的CT报告),以促进可重复基准测试和未来度量标准的发展。
Summary / 总结
CTest-Metric is a unified framework designed to assess the clinical validity of metrics for CT report generation. It includes three modules: Writing Style Generalizability (WSG), Synthetic Error Injection (SEI), and Metrics-vs-Expert correlation (MvE). The study evaluated eight metrics across seven large language models and found that lexical NLG metrics are sensitive to stylistic variations, GREEN Score best aligns with expert judgments, CRG shows negative correlation, and BERTScore-F1 is least sensitive to factual errors.
CTest-Metric 是一个统一的框架,用于评估用于 CT 报告生成的指标的临床有效性。它通过三个模块进行评估:写作风格的一致性、合成错误注入和指标与专家判断的相关性。研究发现,词汇型 NLG 指标对风格变化非常敏感,GREEN Score 最好地与专家判断一致,CRG 显示出负相关性,而 BERTScore-F1 对事实错误注入最不敏感。
Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training
Authors: Shuo Cheng, Liqian Ma, Zhenyang Chen, Ajay Mandlekar, Caelan Garrett, Danfei Xu
Venue: NeurIPS 2025
First: 2025-09-23T04:32:53+00:00 · Latest: 2026-01-16T18:05:09+00:00
Comments: Accepted to NeurIPS 2025
Abstract
Behavior cloning has shown promise for robot manipulation, but real-world demonstrations are costly to acquire at scale. While simulated data offers a scalable alternative, particularly with advances in automated demonstration generation, transferring policies to the real world is hampered by various simulation and real domain gaps. In this work, we propose a unified sim-and-real co-training framework for learning generalizable manipulation policies that primarily leverages simulation and only requires a few real-world demonstrations. Central to our approach is learning a domain-invariant, task-relevant feature space. Our key insight is that aligning the joint distributions of observations and their corresponding actions across domains provides a richer signal than aligning observations (marginals) alone. We achieve this by embedding an Optimal Transport (OT)-inspired loss within the co-training framework, and extend this to an Unbalanced OT framework to handle the imbalance between abundant simulation data and limited real-world examples. We validate our method on challenging manipulation tasks, showing it can leverage abundant simulation data to achieve up to a 30% improvement in the real-world success rate and even generalize to scenarios seen only in simulation. Project webpage: https://ot-sim2real.github.io/.
中文标题/摘要
标题:通用领域适应的模拟与现实政策共训练
行为克隆在机器人操作中显示出潜力,但大规模获取真实世界的演示数据成本高昂。虽然模拟数据提供了可扩展的替代方案,特别是随着自动化演示生成技术的进步,将策略转移到现实世界受到各种模拟与现实领域差距的阻碍。在本文中,我们提出了一种统一的模拟与现实共训练框架,用于学习通用的操作策略,主要依赖于模拟数据,仅需少量真实世界的演示数据。我们方法的核心在于学习一个领域不变的任务相关特征空间。我们的关键见解是,跨领域对观测和相应动作联合分布的对齐提供了比仅对观测(边缘分布)对齐更丰富的信号。我们通过在共训练框架中嵌入一种基于最优传输(OT)的损失来实现这一点,并将其扩展为不平衡OT框架,以处理模拟数据丰富而现实世界示例有限的不平衡问题。我们在具有挑战性的操作任务上验证了该方法,表明它可以利用丰富的模拟数据在现实世界成功率上提高多达30%,甚至可以泛化到仅在模拟中出现的场景。项目网页:https://ot-sim2real.github.io/
Summary / 总结
This work addresses the challenge of transferring robot manipulation policies from simulation to the real world by proposing a unified sim-and-real co-training framework. The method focuses on learning a domain-invariant feature space and uses an Optimal Transport-inspired loss to align the joint distributions of observations and actions across domains. Experiments show that the approach can leverage abundant simulation data to improve real-world success rates by up to 30% and generalize to unseen real-world scenarios.
本文提出了一种统一的模拟与现实联合训练框架,以解决将机器人操作策略从模拟环境转移到现实世界的问题。该方法侧重于学习一个域不变的特征空间,并使用最优传输启发式的损失来对观测和动作的联合分布进行对齐。实验结果表明,该方法可以通过利用丰富的模拟数据显著提高现实世界的成功率,最高可提高30%,甚至可以泛化到仅在模拟中出现的场景。
Health Facility Location in Ethiopia: Leveraging LLMs to Integrate Expert Knowledge into Algorithmic Planning
Authors: Yohai Trabelsi, Guojun Xiong, Fentabil Getnet, Stéphane Verguet, Milind Tambe
First: 2026-01-16T18:02:09+00:00 · Latest: 2026-01-16T18:02:09+00:00
Abstract
Ethiopia's Ministry of Health is upgrading health posts to improve access to essential services, particularly in rural areas. Limited resources, however, require careful prioritization of which facilities to upgrade to maximize population coverage while accounting for diverse expert and stakeholder preferences. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we propose a hybrid framework that systematically integrates expert knowledge with optimization techniques. Classical optimization methods provide theoretical guarantees but require explicit, quantitative objectives, whereas stakeholder criteria are often articulated in natural language and difficult to formalize. To bridge these domains, we develop the Large language model and Extended Greedy (LEG) framework. Our framework combines a provable approximation algorithm for population coverage optimization with LLM-driven iterative refinement that incorporates human-AI alignment to ensure solutions reflect expert qualitative guidance while preserving coverage guarantees. Experiments on real-world data from three Ethiopian regions demonstrate the framework's effectiveness and its potential to inform equitable, data-driven health system planning.
中文标题/摘要
标题:埃塞俄比亚卫生设施位置优化:利用大语言模型整合专家知识
埃塞俄比亚卫生部正在升级卫生站以改善基本服务的可及性,特别是在农村地区。然而,有限的资源要求在升级哪些设施时进行仔细优先排序,以最大化人口覆盖率并考虑多样化的专家和利益相关者偏好。与埃塞俄比亚公共卫生研究所和卫生部合作,我们提出了一种混合框架,系统地将专家知识与优化技术相结合。经典优化方法提供了理论保证,但需要明确的、量化的目标,而利益相关者的标准通常用自然语言表达且难以形式化。为了弥合这些领域之间的差距,我们开发了大语言模型和扩展贪婪(LEG)框架。该框架结合了可证明的近似算法来优化人口覆盖率,并通过LLM驱动的迭代改进,确保解决方案反映专家的定性指导,同时保持覆盖率保证。在三个埃塞俄比亚地区的实际数据上进行的实验表明,该框架的有效性及其对公平、数据驱动的卫生系统规划的潜在影响。
Summary / 总结
The research aims to improve access to essential health services in Ethiopia by prioritizing the upgrade of health facilities, especially in rural areas. The hybrid LEG framework integrates expert knowledge with optimization techniques to address the challenge of limited resources. This framework combines a provable algorithm for population coverage optimization with iterative refinement using large language models to ensure solutions align with expert qualitative guidance while maintaining theoretical guarantees. Experiments on real-world data from three Ethiopian regions show the framework's effectiveness in equitable health system planning.
研究旨在通过优先升级卫生设施,特别是在农村地区,提高埃塞俄比亚的基本医疗服务可及性。提出的LEG框架结合了专家知识和优化技术来应对资源有限的挑战。该框架使用了用于人口覆盖率优化的可证明近似算法,并通过大型语言模型驱动的迭代改进过程确保解决方案与专家的定性指导相一致,同时保持理论上的保证。在三个埃塞俄比亚地区的实际数据上进行的实验表明,该框架在指导公平的卫生系统规划方面具有有效性。
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study
Authors: Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui
First: 2025-06-14T15:26:31+00:00 · Latest: 2026-01-16T17:59:34+00:00
Abstract
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.
中文标题/摘要
标题:什么造就了适合LLM中心语音生成的优质语音分词器?一项系统研究
语音语言模型(SLMs)为统一语音和文本的理解与生成提供了有希望的途径。然而,在实现有效的跨模态对齐和高质量的语音生成方面仍存在挑战。在本工作中,我们系统地研究了在LLM中心的SLMs中语音分词器设计的作用,这些模型通过语音头和说话人建模进行增强。我们在公平的SLM框架下比较了耦合、半解耦和完全解耦的语音分词器,并发现解耦分词显著提高了对齐和合成质量。为了解决语音和文本之间信息密度的不匹配,我们引入了多令牌预测(MTP)到SLMs中,使每个隐藏状态能够解码多个语音令牌。这导致了高达12倍的解码速度提升,并且词错误率大幅下降(从6.07降至3.01)。此外,我们提出了一种基于说话人的生成范式,并引入了RoleTriviaQA,这是一个包含多种说话人身份的大规模角色扮演知识问答基准。实验表明,我们的方法提高了知识理解和说话人一致性。
Summary / 总结
This study investigates the impact of different speech tokenizer designs on LLM-centric speech generation models. By comparing coupled, semi-decoupled, and fully decoupled tokenizers, the research finds that decoupled tokenization improves alignment and synthesis quality. The introduction of multi-token prediction (MTP) further enhances decoding speed and reduces word error rate. Additionally, a speaker-aware generation paradigm and RoleTriviaQA benchmark are proposed to improve knowledge understanding and speaker consistency in speech generation models.
该研究系统地探讨了不同语音分词设计对LLM为中心的语音生成模型的影响。通过比较耦合、半解耦和完全解耦的分词器,研究发现解耦分词可以提高对齐和合成质量。引入多令牌预测(MTP)进一步提高了解码速度并降低了词错误率。此外,提出了一个基于角色的生成范式和RoleTriviaQA大规模角色扮演知识问答基准,以提高语音生成模型中的知识理解和说话人一致性。
UCB-type Algorithm for Budget-Constrained Expert Learning
Authors: Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, Yuriy Dorn
First: 2025-10-26T12:36:17+00:00 · Latest: 2026-01-16T17:59:33+00:00
Abstract
In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget.
We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^α)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-α}\,T^α\Bigr)$.
To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.
Summary / 总结
The paper addresses the problem of dynamically selecting among multiple adaptive learning algorithms under a fixed training budget. It introduces M-LCB, a computationally efficient UCB-style meta-algorithm that provides anytime regret guarantees. M-LCB ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-α}\,T^α\Bigr)$ when each expert achieves internal regret $\tilde O(T^α)$. This is the first result establishing regret guarantees for multiple adaptive experts under per-round budget constraints.
论文解决了在固定训练预算下动态选择多个自适应学习算法的问题。它引入了M-LCB,这是一种高效计算的UCB风格元算法,提供了任意时间的遗憾保证。M-LCB在每个专家达到内部遗憾$\tilde O(T^α)$的情况下,确保总体遗憾界为$\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-α}\,T^α\Bigr)$。这是首次在每轮预算约束下为多个自适应专家建立遗憾保证的结果。
Generative Scenario Rollouts for End-to-End Autonomous Driving
Authors: Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, Thomas Svantesson, Fatih Porikli, Hong Cai
First: 2026-01-16T17:59:28+00:00 · Latest: 2026-01-16T17:59:28+00:00
Abstract
Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.
中文标题/摘要
标题:生成场景展开在端到端自动驾驶中的应用
视觉-语言-动作(VLA)模型正在成为端到端自动驾驶系统中高度有效的规划模型。然而,当前的工作主要依赖于稀疏轨迹注释的模仿学习,并且未能充分利用其作为生成模型的潜力。我们提出了生成场景展开(GeRo),这是一种插件式框架,通过自回归展开策略联合执行基于语言的未来交通场景的规划和生成。首先,训练一个VLA模型将自我车辆和代理的动力学编码为在规划、运动和语言任务监督下的潜在标记,促进文本对齐的生成。接下来,GeRo执行基于语言的自回归生成。给定多视角图像、场景描述和自我动作问题,它生成未来潜在标记和文本响应以引导长期展开。展开一致性损失使用真实值或伪标签稳定预测,减轻漂移并保持文本-动作对齐。这种设计使GeRo能够执行时间一致、基于语言的展开,支持长期推理和多智能体规划。在Bench2Drive上,GeRo的驾驶得分和成功率分别提高了15.7%和26.2%。通过将强化学习与生成展开相结合,GeRo实现了最先进的闭环和开环性能,展示了强大的零样本鲁棒性。这些结果突显了生成、基于语言推理作为端到端自动驾驶安全性和可解释性基础的潜力。
Summary / 总结
The paper proposes Generative Scenario Rollouts (GeRo), a framework for Vision-Language-Action models to perform joint planning and generation of future traffic scenes. GeRo uses an autoregressive rollout strategy to generate future latent tokens and textual responses, with a rollout-consistency loss to stabilize predictions. On Bench2Drive, GeRo improves driving score and success rate by 15.7% and 26.2%, respectively, and achieves state-of-the-art performance in closed-loop and open-loop scenarios, highlighting the potential of generative, language-conditioned reasoning for safer and more interpretable autonomous driving.
研究旨在通过利用Vision-Language-Action (VLA) 模型作为生成模型来提升端到端自动驾驶系统。提出的Generative Scenario Rollouts (GeRo) 框架训练VLA模型将动态编码为潜在标记,并进行未来交通场景的自回归生成。GeRo 将驾驶得分和成功率分别提高了15.7%和26.2%,并在闭环和开环场景中均实现了最先进的性能,展示了强大的零样本鲁棒性。
Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity
Authors: Hauke Licht
First: 2025-12-11T18:11:46+00:00 · Latest: 2026-01-16T17:56:16+00:00
Abstract
Research increasingly leverages audio-visual materials to analyze emotions in political communication. Multimodal large language models (mLLMs) promise to enable such analyses through in-context learning. However, we lack systematic evidence on whether these models can reliably measure emotions in real-world political settings. This paper evaluates leading mLLMs for video-based emotional arousal measurement using two complementary human-labeled video datasets: recordings created under laboratory conditions and real-world parliamentary debates. I find a critical lab-vs-field performance gap. In video created under laboratory conditions, mLLMs arousal scores approach human-level reliability with little to no demographic bias. However, in parliamentary debate recordings, all examined models' arousal scores correlate at best moderately with average human ratings and exhibit systematic bias by speaker gender and age. Neither relying on leading closed-source mLLMs nor computational noise mitigation strategies change this finding. Further, mLLMs underperform even in sentiment analysis when using video recordings instead of text transcripts of the same speeches. These findings reveal important limitations of current mLLMs for real-world political video analysis and establish a rigorous evaluation framework for tracking future developments.
中文标题/摘要
标题:使用多模态大语言模型进行计算情感分析:新兴方法学机会的现有证据
研究越来越多地利用音频-视觉材料来分析政治沟通中的情感。多模态大语言模型(mLLMs)有望通过上下文学习来实现此类分析。然而,我们缺乏系统证据表明这些模型是否能在现实世界的政治环境中可靠地测量情感。本文使用两个互补的人标注视频数据集——实验室条件下创建的录制和实际议会辩论录制,评估了领先mLLMs在基于视频的情感唤醒测量方面的表现。我发现实验室与现场之间存在关键性能差距。在实验室条件下创建的视频中,mLLMs的唤醒评分接近人类水平的可靠性,几乎没有人口统计学偏差。然而,在议会辩论录制中,所有检查的模型的唤醒评分与平均人类评分的相关性最多为中等,并且表现出系统性偏差,按发言者性别和年龄划分。无论是依赖领先的闭源mLLMs,还是计算噪声缓解策略,都无法改变这一发现。此外,当使用视频录制而不是相同演讲的文字转录时,mLLMs在情感分析中的表现甚至不如在文本转录中。这些发现揭示了当前mLLMs在现实世界政治视频分析中的重要局限性,并建立了跟踪未来发展的严格评估框架。
Summary / 总结
This paper evaluates the performance of multimodal large language models (mLLMs) in measuring emotional arousal from video recordings of political communication. It uses two datasets: laboratory-created videos and real-world parliamentary debates. The study finds that mLLMs perform well in laboratory settings but show significant performance gaps in real-world contexts, particularly in terms of gender and age bias. The models also underperform in sentiment analysis when using video recordings compared to text transcripts. This highlights the limitations of current mLLMs for real-world political video analysis and suggests the need for further development.
本文评估了多模态大型语言模型(mLLMs)在测量政治沟通视频中情感唤醒方面的表现。研究使用了两个数据集:实验室创建的视频和实际议会辩论。研究发现,mLLMs在实验室环境中表现良好,但在实际环境中表现出显著的性能差距和性别、年龄偏差。此外,当使用视频录制而不是同一演讲的文字转录进行情感分析时,模型表现不佳。这些发现揭示了当前mLLMs在实际政治视频分析中的重要局限性,并建议需要进一步发展。
Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs
Authors: Alessandro Padella, Massimiliano de Leoni, Marlon Dumas
First: 2026-01-16T17:54:55+00:00 · Latest: 2026-01-16T17:54:55+00:00
Comments: 19 pages, 4 figure, TMIS journal submission
Abstract
Predictive Process Monitoring is a branch of process mining that aims to predict the outcome of an ongoing process. Recently, it leveraged machine-and-deep learning architectures. In this paper, we extend our prior LLM-based Predictive Process Monitoring framework, which was initially focused on total time prediction via prompting. The extension consists of comprehensively evaluating its generality, semantic leverage, and reasoning mechanisms, also across multiple Key Performance Indicators. Empirical evaluations conducted on three distinct event logs and across the Key Performance Indicators of Total Time and Activity Occurrence prediction indicate that, in data-scarce settings with only 100 traces, the LLM surpasses the benchmark methods. Furthermore, the experiments also show that the LLM exploits both its embodied prior knowledge and the internal correlations among training traces. Finally, we examine the reasoning strategies employed by the model, demonstrating that the LLM does not merely replicate existing predictive methods but performs higher-order reasoning to generate the predictions.
中文标题/摘要
标题:探索基于LLM的预测过程监控在小型事件日志中的特征
预测过程监控是过程挖掘的一个分支,旨在预测正在进行的过程的结果。最近,它利用了机器学习和深度学习架构。在本文中,我们扩展了我们之前基于LLM的预测过程监控框架,该框架最初专注于通过提示进行总时间预测。扩展包括全面评估其通用性、语义利用和推理机制,以及跨多个关键绩效指标。在三个不同的事件日志和总时间和活动发生预测的关键绩效指标上进行的实证评估表明,在只有100条轨迹的数据稀缺环境中,LLM超过了基准方法。此外,实验还表明,LLM利用了其内在的知识和训练轨迹之间的内部关联。最后,我们研究了模型采用的推理策略,证明LLM不仅复制现有的预测方法,还进行更高层次的推理以生成预测。
Summary / 总结
This paper extends a prior LLM-based Predictive Process Monitoring framework to evaluate its generality and reasoning mechanisms across different Key Performance Indicators. Empirical evaluations on three event logs show that the LLM outperforms benchmark methods in data-scarce settings with only 100 traces. The LLM leverages its prior knowledge and internal trace correlations to generate predictions, demonstrating higher-order reasoning capabilities.
本文扩展了先前基于LLM的预测过程监控框架,最初专注于总时间预测。它在多个关键绩效指标上评估了该框架的通用性、语义利用和推理机制。实验表明,在只有100条轨迹的数据稀缺环境中,LLM在三个事件日志上优于基准方法,利用了先验知识和训练轨迹之间的内部关联。LLM展示了生成预测时的高层次推理,而不仅仅是复制现有方法。
MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
Authors: Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui
First: 2026-01-16T17:45:34+00:00 · Latest: 2026-01-16T17:45:34+00:00
Abstract
As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
中文标题/摘要
标题:MHA2MLA-VLM:使DeepSeek的经济型多头潜在注意力适用于视觉-语言模型
随着视觉-语言模型(VLMs)处理越来越复杂和多模态的任务,关键-值(KV)缓存的快速增长在推理过程中造成了显著的内存和计算瓶颈。虽然多头潜在注意力(MLA)提供了一种有效的压缩KV缓存和加速推理的方法,但如何在不进行昂贵的预训练的情况下将现有的VLMs适应到MLA架构中仍然鲜有探索。在本文中,我们提出了MHA2MLA-VLM,这是一种参数高效且多模态感知的框架,用于将现成的VLMs转换为MLA。我们的方法包含两个核心技术:(1)一种适应模态的部分-RoPE策略,该策略通过选择性地屏蔽非必要维度支持传统的和多模态设置,(2)一种模态解耦的低秩近似方法,该方法独立地压缩了视觉和文本的KV空间。此外,我们引入了参数高效的微调以最小化适应成本,并证明了最小化输出激活误差而非参数距离可以显著减少性能损失。在三个代表性VLMs上的广泛实验表明,MHA2MLA-VLM在最少的监督数据下恢复了原始模型性能,显著减少了KV缓存的占用空间,并与KV量化无缝集成。
Summary / 总结
The research aims to address the memory and computational challenges posed by the Key-Value (KV) cache in vision-language models (VLMs) by introducing MHA2MLA-VLM, a parameter-efficient framework for converting existing VLMs to Multi-Head Latent Attention (MLA). The method employs a modality-adaptive partial-RoPE strategy and a modality-decoupled low-rank approximation to compress the KV cache, and it includes parameter-efficient fine-tuning to minimize adaptation cost. Experimental results on three VLMs show that MHA2MLA-VLM can restore original model performance with minimal supervised data, reduce KV cache size, and integrate well with KV quantization.
该研究针对视觉语言模型(VLMs)中关键值缓存带来的内存和计算瓶颈,提出了MHA2MLA-VLM,一种参数高效的框架,将现有VLMs转换为多头潜注意力(MLA)。方法包括模态自适应部分RoPE策略和模态解耦低秩近似,以支持传统和多模态设置,并引入参数高效的微调以最小化适应成本。实验表明,MHA2MLA-VLM 可以在最少的监督数据下恢复原始模型性能,减少关键值缓存的占用,并与关键值量化无缝集成。
Learning Semantic-Geometric Task Graph-Representations from Human Demonstrations
Authors: Franziska Herbert, Vignesh Prasad, Han Liu, Dorothea Koert, Georgia Chalvatzaki
First: 2026-01-16T17:35:00+00:00 · Latest: 2026-01-16T17:35:00+00:00
Comments: 9 pages, 7 figures, preprint
Abstract
Learning structured task representations from human demonstrations is essential for understanding long-horizon manipulation behaviors, particularly in bimanual settings where action ordering, object involvement, and interaction geometry can vary significantly. A key challenge lies in jointly capturing the discrete semantic structure of tasks and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. In this work, we introduce a semantic-geometric task graph-representation that encodes object identities, inter-object relations, and their temporal geometric evolution from human demonstrations. Building on this formulation, we propose a learning framework that combines a Message Passing Neural Network (MPNN) encoder with a Transformer-based decoder, decoupling scene representation learning from action-conditioned reasoning about task progression. The encoder operates solely on temporal scene graphs to learn structured representations, while the decoder conditions on action-context to predict future action sequences, associated objects, and object motions over extended time horizons. Through extensive evaluation on human demonstration datasets, we show that semantic-geometric task graph-representations are particularly beneficial for tasks with high action and object variability, where simpler sequence-based models struggle to capture task progression. Finally, we demonstrate that task graph representations can be transferred to a physical bimanual robot and used for online action selection, highlighting their potential as reusable task abstractions for downstream decision-making in manipulation systems.
中文标题/摘要
标题:从人类示范中学习语义-几何任务图表示
从人类示范中学习结构化任务表示对于理解长时间段操作行为至关重要,特别是在双臂操作环境中,操作顺序、物体参与和交互几何可以显著变化。一个关键挑战在于如何联合捕捉任务的离散语义结构和物体为中心的几何关系随时间的演变,以支持任务进展的推理。在本文中,我们提出了一种语义-几何任务图表示,该表示从人类示范中编码物体身份、物体间关系及其随时间的几何演变。基于此表示,我们提出了一种学习框架,该框架结合了消息传递神经网络(MPNN)编码器和基于变换器的解码器,将场景表示学习与基于动作条件的任务进展推理解耦。编码器仅在时间场景图上操作以学习结构化表示,而解码器根据动作上下文预测未来动作序列、相关物体及其在长时间段内的运动。通过在人类示范数据集上的广泛评估,我们表明语义-几何任务图表示特别适用于具有高动作和物体变异性任务,其中基于序列的简单模型难以捕捉任务进展。最后,我们展示了任务图表示可以转移到物理双臂机器人并用于在线动作选择,突显了它们作为下游操作系统决策中可重用任务抽象的潜力。
Summary / 总结
This paper addresses the challenge of learning structured task representations from human demonstrations, especially in bimanual manipulation tasks. It introduces a semantic-geometric task graph-representation that captures object identities, inter-object relations, and their temporal evolution. The proposed learning framework uses a Message Passing Neural Network encoder and a Transformer-based decoder to encode scene graphs and predict future actions. Experiments show that this approach is effective for tasks with high action and object variability, outperforming simpler sequence-based models. The task graph representations can also be transferred to a physical robot for online action selection, demonstrating their potential for manipulation systems.
该研究旨在从人类演示中学习结构化的任务表示,特别是在双臂操作任务中。它提出了一种语义-几何任务图表示,能够捕捉物体身份、物体间关系及其时间上的演变。所提出的学习框架使用了消息传递神经网络编码器和基于变换器的解码器来预测未来动作和物体运动。实验表明,该方法在高动作和物体变异性任务中优于简单的序列模型,并且可以应用于物理机器人进行在线动作选择。
Probabilistic Mission Design for Neuro-Symbolic Unmanned Aircraft Systems
Authors: Simon Kohaut, Benedict Flade, Daniel Ochs, Devendra Singh Dhami, Julian Eggert, Kristian Kersting
First: 2024-12-25T11:04:00+00:00 · Latest: 2026-01-16T17:27:13+00:00
Comments: arXiv admin note: text overlap with arXiv:2406.03454
Abstract
Advanced Air Mobility (AAM) is a growing field that demands accurate and trustworthy models of legal concepts and restrictions for navigating Unmanned Aircraft Systems (UAS). In addition, any implementation of AAM needs to face the challenges posed by inherently dynamic and uncertain human-inhabited spaces robustly. Nevertheless, the employment of UAS beyond visual line of sight (BVLOS) is an endearing task that promises to significantly enhance today's logistics and emergency response capabilities. Hence, we propose Probabilistic Mission Design (ProMis), a novel neuro-symbolic approach to navigating UAS within legal frameworks. ProMis is an interpretable and adaptable system architecture that links uncertain geospatial data and noisy perception with declarative, Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and its legality. To inform planning with legal restrictions and uncertainty in mind, ProMis yields Probabilistic Mission Landscapes (PML). These scalar fields quantify the belief that the HPLP is satisfied across the agent's state space. Extending prior work on ProMis' reasoning capabilities and computational characteristics, we show its integration with potent machine learning models such as Large Language Models (LLM) and Transformer-based vision models. Hence, our experiments underpin the application of ProMis with multi-modal input data and how our method applies to many AAM scenarios.
中文标题/摘要
标题:神经符号无人驾驶航空系统中的概率任务设计
先进空中交通(AAM)是一个快速增长的领域,需要准确可靠的法律概念和限制模型来导航无人驾驶航空系统(UAS)。此外,任何AAM的实现都需要面对动态和不确定的人类居住空间带来的挑战。然而,超越视距(BVLOS)的UAS应用是一个令人向往的任务,有望显著提升当今的物流和应急响应能力。因此,我们提出了概率任务设计(ProMis),这是一种新颖的神经符号方法,用于在法律框架内导航UAS。ProMis是一种可解释且适应性强的系统架构,将不确定的地理空间数据和嘈杂的感知与声明性混合概率逻辑程序(HPLP)连接起来,以推理代理的状态空间及其合法性。为了在规划中考虑法律限制和不确定性,ProMis生成了概率任务景观(PML)。这些标量场量化了HPLP在代理状态空间中得到满足的信念。通过扩展ProMis推理能力和计算特性的先前工作,我们展示了其与强大的机器学习模型(如大型语言模型LLM和基于变换器的视觉模型)的集成。因此,我们的实验证明了ProMis在多模态输入数据下的应用及其方法如何应用于许多AAM场景。
Summary / 总结
The paper proposes Probabilistic Mission Design (ProMis), a neuro-symbolic approach for navigating Unmanned Aircraft Systems (UAS) within legal frameworks. ProMis integrates uncertain geospatial data and noisy perception with Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and legality, producing Probabilistic Mission Landscapes (PML) that quantify the belief of HPLP satisfaction. Experiments demonstrate ProMis's capability to handle multi-modal input data and its applicability to various Advanced Air Mobility (AAM) scenarios, enhancing logistics and emergency response capabilities beyond visual line of sight (BVLOS).
论文提出了Probabilistic Mission Design (ProMis),这是一种神经符号方法,用于在法律框架内导航无人驾驶航空系统 (UAS),以应对动态和不确定环境的挑战。ProMis 使用混合概率逻辑程序 (HPLP) 对代理的状态空间和合法性进行推理,生成概率任务景观 (PML),量化HPLP 满足的信念。实验展示了 ProMis 与机器学习模型的集成,证明了其在各种先进空中移动 (AAM) 场景中的适用性,使用多模态输入数据。
Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation
Authors: Xin Sun, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang
First: 2026-01-16T17:07:01+00:00 · Latest: 2026-01-16T17:07:01+00:00
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for enhancing large language models' question-answering capabilities through the integration of external knowledge. However, when adapting RAG systems to specialized domains, challenges arise from distribution shifts, resulting in suboptimal generalization performance. In this work, we propose TTARAG, a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains. Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain. Through extensive experiments across six specialized domains, we demonstrate that TTARAG achieves substantial performance improvements over baseline RAG systems. Code available at https://github.com/sunxin000/TTARAG.
中文标题/摘要
标题:预测检索!测试时自适应以增强检索增强生成
检索增强生成(RAG)已成为通过集成外部知识来增强大型语言模型问答能力的强大方法。然而,当将RAG系统适应到特定领域时,由于分布偏移,会出现挑战,导致性能不佳。在本文中,我们提出了一种测试时自适应方法TTARAG,在推理过程中动态更新语言模型的参数,以提高RAG系统在特定领域的性能。该方法通过使模型学习预测检索内容,实现自动参数调整以适应目标领域。通过在六个特定领域的广泛实验,我们证明TTARAG在基线RAG系统上实现了显著的性能提升。代码可在https://github.com/sunxin000/TTARAG获取。
Summary / 总结
The research aims to address the challenges of adapting Retrieval-Augmented Generation (RAG) systems to specialized domains by proposing TTARAG, a test-time adaptation method. TTARAG dynamically updates the language model's parameters during inference to predict retrieved content, facilitating automatic parameter adjustments to the target domain. Experiments across six specialized domains show that TTARAG significantly improves RAG system performance compared to baseline systems.
研究旨在解决将检索增强生成(RAG)系统适应特定领域时遇到的挑战,其中分布偏移可能导致性能不佳。提出的TTARAG方法在推理过程中动态更新语言模型的参数,以预测检索到的内容,从而在特定领域中提高性能。在六个领域的实验中,TTARAG显著优于基线RAG系统。
Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps
Authors: Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez-Pellitero, Youngkyoon Jang
First: 2026-01-16T17:02:46+00:00 · Latest: 2026-01-16T17:02:46+00:00
Abstract
We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.
中文标题/摘要
标题:Map2Thought:通过度量认知图进行明确的三维空间推理
我们提出了Map2Thought框架,该框架使3D VLM能够进行明确且可解释的空间推理。该框架基于两个关键组件:度量认知图(Metric-CogMap)和认知思维链(Cog-CoT)。度量认知图通过将离散网格用于关系推理与连续的度量尺度表示用于精确的几何理解,提供了一种统一的空间表示。基于度量认知图,认知思维链通过确定性操作(包括向量操作、边界框距离以及遮挡感知的外观顺序提示)进行明确的几何推理,生成基于三维结构的可解释推理轨迹。实验结果表明,Map2Thought能够实现可解释的三维理解,仅使用一半的监督数据即可达到59.9%的准确率,接近使用完整数据集训练的基线60.9%。在10%、25%和50%训练子集上,它分别比最先进的方法高出5.3%、4.8%和4.0%的准确率,在VSI-Bench上表现优异。
Summary / 总结
Map2Thought is a framework that enhances 3D vision and language models (VLMs) with explicit spatial reasoning capabilities. It uses Metric Cognitive Maps (Metric-CogMap) for unified spatial representation and Cognitive Chain-of-Thought (Cog-CoT) for explicit geometric reasoning. The framework achieves 59.9% accuracy with half the supervision, outperforming state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.
Map2Thought 是一个框架,通过 Metric Cognitive Maps 和 Cognitive Chain-of-Thought 提升 3D 视觉和语言模型的空间推理能力。它将离散网格用于关系推理与连续的度量表示用于精确几何理解相结合。该框架在 VSI-Bench 上使用一半的监督信息实现了 59.9% 的准确率,并在 10%、25% 和 50% 的训练子集上分别比最先进的方法高出 5.3%、4.8% 和 4.0%。
Hierarchical Orthogonal Residual Spread for Precise Massive Editing in Large Language Models
Authors: Xiaojie Gu, Guangxu Chen, Yuheng Yang, Jingxin Han, Andi Zhang
Venue: ICASSP 2026
First: 2026-01-16T17:02:19+00:00 · Latest: 2026-01-16T17:02:19+00:00
Comments: ICASSP 2026
Abstract
Large language models (LLMs) exhibit exceptional performance across various domains, yet they face critical safety concerns. Model editing has emerged as an effective approach to mitigate these issues. Existing model editing methods often focus on optimizing an information matrix that blends new and old knowledge. While effective, these approaches can be computationally expensive and may cause conflicts. In contrast, we shift our attention to Hierarchical Orthogonal Residual SprEad of the information matrix, which reduces noisy gradients and enables more stable edits from a different perspective. We demonstrate the effectiveness of our method HORSE through a clear theoretical comparison with several popular methods and extensive experiments conducted on two datasets across multiple LLMs. The results show that HORSE maintains precise massive editing across diverse scenarios. The code is available at https://github.com/XiaojieGu/HORSE
中文标题/摘要
标题:层次正交残差扩展在大型语言模型精确大规模编辑中的应用
大型语言模型(LLMs)在各个领域表现出色,但面临关键的安全问题。模型编辑已成为缓解这些问题的有效方法。现有模型编辑方法通常侧重于优化融合新旧知识的信息矩阵。虽然有效,但这些方法可能计算成本高且可能导致冲突。相比之下,我们关注信息矩阵的层次正交残差扩展,从不同角度减少噪声梯度并实现更稳定的编辑。我们通过与几种流行方法的清晰理论比较和在两个数据集上对多个LLM进行的大量实验,展示了HORSE方法的有效性。结果显示,HORSE在多种场景下保持了精确的大规模编辑。代码可在https://github.com/XiaojieGu/HORSE获取
Summary / 总结
This paper addresses the safety concerns of large language models by proposing a method called Hierarchical Orthogonal Residual SprEad (HORSE) for precise massive editing. Unlike existing methods that optimize an information matrix, HORSE focuses on reducing noisy gradients through a hierarchical orthogonal approach, which leads to more stable edits. The effectiveness of HORSE is demonstrated through theoretical comparisons and experiments on two datasets across multiple LLMs, showing its capability to maintain precise massive editing in various scenarios.
本文提出了一种名为Hierarchical Orthogonal Residual SprEad (HORSE)的方法,用于大型语言模型的精确大规模编辑,以解决其安全性问题。HORSE不同于现有方法通过优化信息矩阵来编辑模型,而是通过分层正交的方法减少噪声梯度,从而实现更稳定的编辑。通过理论比较和在多个LLM上的两个数据集上的实验,证明了HORSE在各种场景下能够保持精确的大规模编辑能力。
From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP
Authors: Shanshan Xu, Santosh T. Y. S. S, Barbara Plank
First: 2025-10-09T17:48:29+00:00 · Latest: 2026-01-16T17:00:35+00:00
Abstract
Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the diversity of human perspectives rather than mere error. Long treated in NLP as noise to be eliminated, HLV has only recently been reframed as a signal for improving model robustness. With the rise of large language models (LLMs) and post-training methods such as human feedback-based alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely collapse multiple annotations into a single label, flattening diverse perspectives into artificial consensus. Preserving HLV is necessary not only for pluralistic alignment but also for sociotechnical safety evaluation, where model behavior must be assessed in relation to human interaction and societal context. This position paper argues that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck, an intrinsic value in itself. We analyze the limitations of existing preference datasets and propose actionable strategies for incorporating HLV into dataset construction to better preserve pluralistic human values.
中文标题/摘要
标题:从噪声到信号再到自我目的:在NLP后训练时代重新定义人类标签变异
人类标签变异(HLV)指的是注释中的合法分歧,反映了人类视角的多样性而非单纯的错误。在NLP中长期被视为需要消除的噪声,HLV仅在最近被重新定义为提高模型鲁棒性的信号。随着大型语言模型(LLMs)和后训练方法如基于人类反馈的对齐的兴起,HLV的作用变得越来越重要。然而,当前的偏好学习数据集通常将多个注释合并为单一标签,人为地抹平了多样性的视角。保留HLV不仅对于多元主义对齐至关重要,也对于社会技术安全性评估至关重要,其中模型行为必须与人类互动和社会背景相关联进行评估。本文认为,保留HLV作为人类多元主义的体现必须被视为一种自我目的,即内在价值本身。我们分析了现有偏好数据集的局限性,并提出了将HLV纳入数据集构建的可操作策略,以更好地保留多元的人类价值观。
Summary / 总结
The paper addresses the treatment of Human Label Variation (HLV) in NLP, which is the legitimate disagreement among human annotators. Traditionally seen as noise, HLV is now recognized as a signal for improving model robustness. With the advent of large language models and post-training methods, the importance of HLV has grown. However, current datasets often collapse multiple annotations into a single label, losing diverse perspectives. The authors argue that preserving HLV is essential for pluralistic alignment and sociotechnical safety evaluation. They propose strategies to incorporate HLV into dataset construction to better reflect human values.
论文探讨了自然语言处理(NLP)中的人类标注变异(HLV),即人类标注者之间的合法分歧。传统上被视为噪声,HLV现在被视作提高模型稳健性的信号。随着大型语言模型和后训练方法的发展,HLV的重要性日益增加。然而,当前的数据集通常将多个标注合并为一个标签,从而丧失了多样化的视角。作者认为,保留HLV对于多元主义对齐和社会技术安全性评估至关重要。他们提出了将HLV纳入数据集构建的策略,以更好地反映人类价值观。
The unreasonable effectiveness of pattern matching
Authors: Gary Lupyan, Blaise Agüera y Arcas
First: 2026-01-16T16:53:08+00:00 · Latest: 2026-01-16T16:53:08+00:00
Abstract
We report on an astonishing ability of large language models (LLMs) to make sense of "Jabberwocky" language in which most or all content words have been randomly replaced by nonsense strings, e.g., translating "He dwushed a ghanc zawk" to "He dragged a spare chair". This result addresses ongoing controversies regarding how to best think of what LLMs are doing: are they a language mimic, a database, a blurry version of the Web? The ability of LLMs to recover meaning from structural patterns speaks to the unreasonable effectiveness of pattern-matching. Pattern-matching is not an alternative to "real" intelligence, but rather a key ingredient.
中文标题/摘要
标题:模式匹配的不可思议的有效性
我们报告了大型语言模型(LLMs)对“Jabberwocky”语言的惊人理解能力,在这种语言中,大多数或所有内容词都被随机替换为无意义的字符串,例如将“He dwushed a ghanc zawk”翻译为“He dragged a spare chair”。这一结果解决了关于如何最好地理解LLMs在做什么的持续争议:它们是语言模仿、数据库还是网络的模糊版本?LLMs从结构模式中恢复意义的能力表明了模式匹配的不可思议的有效性。模式匹配不是“真实”智能的替代品,而是关键组成部分。
Summary / 总结
The study investigates the surprising capability of large language models to understand and translate sentences composed of random nonsense words, suggesting that these models rely heavily on pattern matching rather than literal language comprehension. The findings challenge existing interpretations of LLMs as language mimics, databases, or web mirrors, instead highlighting the importance of pattern recognition in their functioning.
研究探讨了大型语言模型在‘ Jabberwocky ’语言中理解并翻译句子的惊人能力,其中大多数单词被随机字符串替换。这一发现挑战了对 LLMs 的现有解释,表明它们的有效性源于模式匹配而非其他形式的智能。结果表明,模式匹配是 LLMs 能够理解语言的关键组成部分,而不是替代真正的智能。
DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization
Authors: Chao Zhang, Xin Shi, Xueqiao Zhang, Yifan Zhu, Yi Yang, Yawei Luo
First: 2025-05-22T17:56:21+00:00 · Latest: 2026-01-16T16:50:05+00:00
Abstract
Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
中文标题/摘要
标题:DecoupledESC:通过策略-响应解耦偏好优化提升情感支持生成
情感支持对话(ESC)的最新进展通过监督微调(SFT)对大型语言模型(LLMs)进行微调,从而提高了情感支持生成的效果。然而,常见的心理错误仍然存在。虽然直接偏好优化(DPO)通过成对偏好学习显示出减少这些错误的潜力,但在ESC任务中的有效性受到两个关键挑战的限制:(1)纠缠的数据结构:现有的ESC数据本质上将心理策略和响应内容纠缠在一起,使得难以构建高质量的偏好成对;(2)优化模糊性:将传统的DPO应用于这种纠缠的成对数据会导致训练目标模糊。为了解决这些问题,我们引入了推断偏好挖掘(IPM)来构建高质量的偏好数据,形成了IPM-PrefDial数据集。在此数据集的基础上,我们借鉴格罗斯的情绪调节扩展过程模型,提出了一个解耦的ESC框架,将ESC任务分解为两个顺序子任务:策略规划和共情响应生成。每个任务都通过SFT进行训练,并随后通过DPO增强,以与心理偏好对齐。广泛的实验表明,我们的解耦ESC框架优于联合优化基线,减少了偏好偏差并提高了响应质量。
Summary / 总结
The research aims to enhance emotional support generation by addressing common psychological errors in Emotional Support Conversation tasks. It introduces a Decoupled ESC framework that decouples strategy planning and empathic response generation, using Inferential Preference Mining to construct high-quality preference data. Experiments show that this approach outperforms joint optimization methods, reducing preference bias and improving response quality.
研究旨在通过解决现有方法的限制,特别是数据中心理策略和回应内容的纠缠以及训练目标的模糊性,来提升情感支持生成。作者提出了一种分解式ESC框架,将任务分解为策略规划和共情回应生成,并使用监督微调(SFT)和直接偏好优化(DPO)来更好地与心理偏好对齐。实验表明,这种方法优于联合优化基线,减少了偏好偏差并提高了回应质量。
Relational Linearity is a Predictor of Hallucinations
Authors: Yuetian Lu, Yihong Liu, Hinrich Schütze
First: 2026-01-16T16:47:49+00:00 · Latest: 2026-01-16T16:47:49+00:00
Comments: 11 pages, 4 figures, 8 tables
Abstract
Hallucination is a central failure mode in large language models (LLMs). We focus on hallucinations of answers to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities that are unknown to the model. Surprisingly, we find that medium-size models like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. We hypothesize that an important factor in causing these hallucinations is the linearity of the relation: linear relations tend to be stored more abstractly, making it difficult for the LLM to assess its knowledge; the facts of nonlinear relations tend to be stored more directly, making knowledge assessment easier. To investigate this hypothesis, we create SyntHal, a dataset of 6000 synthetic entities for six relations. In our experiments with four models, we determine, for each relation, the hallucination rate on SyntHal and also measure its linearity, using $Δ\cos$. We find a strong correlation ($r \in [.78,.82]$) between relational linearity and hallucination rate, providing evidence for our hypothesis that the underlying storage of triples of a relation is a factor in how well a model can self-assess its knowledge. This finding has implications for how to manage hallucination behavior and suggests new research directions for improving the representation of factual knowledge in LLMs.
Summary / 总结
The study investigates the relationship between relational linearity and hallucinations in large language models (LLMs). By creating a dataset called SyntHal with 6000 synthetic entities for six relations, the researchers found a strong correlation (r in [.78, .82]) between the linearity of relations and the hallucination rate. This suggests that the abstract storage of linear relations makes it harder for LLMs to recognize their own knowledge gaps, while nonlinear relations are stored more directly, facilitating better self-assessment. The findings imply that managing hallucination behavior may require addressing how LLMs store and represent factual knowledge.
研究探讨了关系线性与大型语言模型(LLMs)幻觉之间的关系。通过创建包含6000个合成实体的SyntHal数据集,研究人员发现关系线性与幻觉率之间存在较强的相关性(r在[.78, .82]之间)。这表明线性关系的抽象存储使得LLMs更难识别自己的知识空白,而非线性关系则存储得更直接,有助于更好的自我评估。研究结果表明,管理幻觉行为可能需要解决LLMs如何存储和表示事实知识的问题。
Isotropy-Optimized Contrastive Learning for Semantic Course Recommendation
Authors: Ali Khreis, Anthony Nasr, Yusuf Hilal
First: 2026-01-16T16:47:29+00:00 · Latest: 2026-01-16T16:47:29+00:00
Comments: 7 pages, 7 figures
Abstract
This paper presents a semantic course recommendation system for students using a self-supervised contrastive learning approach built upon BERT (Bidirectional Encoder Representations from Transformers). Traditional BERT embeddings suffer from anisotropic representation spaces, where course descriptions exhibit high cosine similarities regardless of semantic relevance. To address this limitation, we propose a contrastive learning framework with data augmentation and isotropy regularization that produces more discriminative embeddings. Our system processes student text queries and recommends Top-N relevant courses from a curated dataset of over 500 engineering courses across multiple faculties. Experimental results demonstrate that our fine-tuned model achieves improved embedding separation and more accurate course recommendations compared to vanilla BERT baselines.
中文标题/摘要
标题:基于自监督对比学习的 isotropy-优化语义课程推荐
本文提出了一种基于 BERT(双向编码器表示变换器)的自监督对比学习方法的语义课程推荐系统。传统的 BERT 表示空间具有各向异性,导致课程描述在语义相关性不同时仍表现出高余弦相似度。为解决这一局限,我们提出了一种带有数据增强和 isotropy 正则化的对比学习框架,以生成更具区分性的嵌入。该系统处理学生文本查询,并从涵盖多个学院超过 500 门工程课程的精选数据集中推荐 Top-N 相关课程。实验结果表明,与 vanilla BERT 基线相比,我们的微调模型在嵌入分离和课程推荐准确性方面均有所提高。
Summary / 总结
This paper introduces a semantic course recommendation system using a contrastive learning approach based on BERT to address the anisotropic representation issue in traditional BERT embeddings. The system employs data augmentation and isotropy regularization to generate more discriminative embeddings. Experiments show that the fine-tuned model outperforms vanilla BERT in embedding separation and course recommendation accuracy.
该论文提出了一种基于BERT的对比学习方法来解决传统BERT嵌入中的各向异性表示问题,引入了数据增强和各向同性正则化来生成更具区分性的嵌入。实验结果表明,微调后的模型在嵌入分离和课程推荐准确性方面优于vanilla BERT。
The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents
Authors: Ziyu Wang, Chenyuan Liu, Yushun Xiang, Runhao Zhang, Qingbo Hao, Hongliang Lu, Houyu Chen, Zhizhong Feng, Kaiyue Zheng, Dehao Ye, Xianchao Zeng, Xinyu Zhou, Boran Wen, Jiaxin Li, Mingyu Zhang, Kecheng Zheng, Qian Zhu, Ran Cheng, Yong-Lu Li
First: 2026-01-16T16:42:05+00:00 · Latest: 2026-01-16T16:42:05+00:00
Abstract
Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.
中文标题/摘要
标题:伟大的3月100:100项细致任务评估具身AI代理
近年来,随着机器人学习和模仿学习的快速发展,出现了大量数据集和方法。然而,这些数据集及其任务设计往往缺乏系统的考虑和原则。这提出了重要问题:当前的数据集和任务设计是否真正推动了机器人代理的能力?在少数几个常见任务上的评估是否能准确反映不同团队提出的不同方法在不同任务上的差异化表现?为了解决这些问题,我们引入了伟大的3月100(GM-100)作为迈向机器人学习奥运会的第一步。GM-100 包含100个精心设计的任务,涵盖了广泛的交互和长尾行为,旨在提供一个多样且具有挑战性的任务集,全面评估机器人代理的能力,并促进机器人数据集任务设计的多样性和复杂性。这些任务通过系统分析和扩展现有任务设计,并结合人类-物体交互基本原理和物体功能的见解而开发。我们在不同的机器人平台上收集了大量的轨迹数据,并评估了几种基线模型。实验结果表明,GM-100 任务是1)可执行的,2)足够具有挑战性,能够有效区分当前VLA模型的性能。我们的数据和代码可在https://rhos.ai/research/gm-100/获取。
Summary / 总结
The research introduces the Great March 100 (GM-100) as a comprehensive evaluation framework for embodied AI agents, consisting of 100 carefully designed tasks that cover various interactions and long-tail behaviors. The tasks are systematically expanded from existing designs and incorporate insights from human-object interactions. Experimental results show that GM-100 is feasible to execute and sufficiently challenging to differentiate the performance of current VLA models. The dataset and code are publicly available.
研究旨在解决当前机器人代理数据集和任务设计缺乏系统考虑的问题。引入了Great March 100 (GM-100),包含100个详细任务,用于评估具身AI代理的能力。这些任务涵盖了广泛的交互和长尾行为,并通过现有设计的系统分析和扩展进行开发。实验结果表明,这些任务是可执行的,并且能够有效区分不同模型的性能。数据和代码已公开可用。
Zero-Shot Detection of Elastic Transient Morphology Across Physical Systems
Authors: Jose Sánchez Andreu
First: 2026-01-16T16:35:07+00:00 · Latest: 2026-01-16T16:35:07+00:00
Comments: 17 pages, 6 figures. Supplemental material included
Abstract
We test whether a representation learned from interferometric strain transients in gravitational-wave observatories can act as a frozen morphology-sensitive operator for unseen sensors, provided the target signals preserve coherent elastic transient structure. Using a neural encoder trained exclusively on non-Gaussian instrumental glitches, we perform strict zero-shot anomaly analysis on rolling-element bearings without retraining, fine-tuning, or target-domain labels.
On the IMS-NASA run-to-failure dataset, the operator yields a monotonic health index HI(t) = s0.99(t)/tau normalized to an early-life reference distribution, enabling fixed false-alarm monitoring at 1-q = 1e-3 with tau = Q0.999(P0). In discrete fault regimes (CWRU), it achieves strong window-level discrimination (AUC_win about 0.90) and file-level separability approaching unity (AUC_file about 0.99). Electrically dominated vibration signals (VSB) show weak, non-selective behavior, delineating a physical boundary for transfer.
Under a matched IMS controlled-split protocol, a generic EfficientNet-B0 encoder pretrained on ImageNet collapses in the intermittent regime (Lambda_tail about 2), while the interferometric operator retains strong extreme-event selectivity (Lambda_tail about 860), indicating that the effect is not a generic property of CNN features. Controlled morphology-destruction transformations selectively degrade performance despite per-window normalization, consistent with sensitivity to coherent time-frequency organization rather than marginal amplitude statistics.
中文标题/摘要
标题:弹性瞬态形态在物理系统中的零样本检测
我们测试从引力波观测站的干涉仪应变瞬态中学习到的表示,是否可以在未见过的传感器上作为冻结的形态敏感操作符发挥作用,前提是目标信号保留了一致的弹性瞬态结构。使用仅在非高斯仪器瞬态上训练的神经编码器,我们对滚动轴承进行严格的零样本异常分析,无需重新训练、微调或目标域标签。
在IMS-NASA运行至失效数据集中,该操作符产生一个归一化到早期寿命参考分布的单调健康指数HI(t) = s0.99(t)/tau,使固定误报监测在1-q = 1e-3时tau = Q0.999(P0)。在离散故障区间(CWRU),它实现了强大的窗口级区分(AUC_win约0.90)和文件级可分性接近1(AUC_file约0.99)。电主导振动信号(VSB)表现出弱的非选择性行为,划定了一种物理边界,限制了转移。
在匹配的IMS控制分割协议下,通用的预训练在ImageNet上的EfficientNet-B0编码器在间歇区间(Lambda_tail约2)失效,而干涉仪操作符保持了强大的极端事件选择性(Lambda_tail约860),表明该效果不是CNN特征的通用属性。控制形态破坏变换选择性地降级性能,尽管进行了窗口归一化,这与对一致的时间-频率组织的敏感性一致,而不是边缘幅度统计。
Summary / 总结
This study investigates whether a representation learned from gravitational-wave observatory data can be used for zero-shot anomaly detection in rolling-element bearings. A neural encoder trained on non-Gaussian glitches was used to perform strict zero-shot analysis without retraining or fine-tuning. The operator produced a health index that enabled fixed false-alarm monitoring and achieved strong discrimination and separability in discrete fault regimes, with AUC values of about 0.90 and 0.99, respectively. The interferometric operator showed strong extreme-event selectivity, while a generic CNN encoder pretrained on ImageNet collapsed in the intermittent regime, indicating that the effect is specific to the coherent time-frequency organization of the signals.
该研究探讨了是否可以从引力波观测站数据中学习到的表示可以用于滚动轴承的零样本异常检测。研究人员使用一个仅在非高斯仪器故障上训练的神经编码器进行了严格的零样本分析,无需重新训练或微调。该操作符生成了一个健康指数,使其能够实现固定误报监控,并在离散故障区域实现了约0.90和0.99的AUC值,分别实现了强区分能力和文件级可分性。干涉仪操作符在间歇性区域显示出了强大的极端事件选择性,而预训练在ImageNet上的通用CNN编码器则在该区域崩溃,表明这种效果是特定于信号的相干时频组织的。