arXiv 论文速递

2025-12-19 03:23
Snapshot: 20251219_0323
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
Authors: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang
First: 2025-12-17T18:59:55+00:00 · Latest: 2025-12-17T18:59:55+00:00
Comments: 11 pages, 5 figures, conference or other essential info
Abstract
In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
中文标题/摘要
标题:DiffusionVL:将任何自回归模型转化为扩散视觉语言模型
在最近的多模态研究中,扩散范式因其独特的解码优势,已成为自回归范式(AR)的有前途的替代方案。然而,由于基础扩散语言模型能力的限制,扩散视觉语言模型(dVLM)的性能仍然远远落后于主流模型。这引发了一个简单而基本的问题:是否可以基于现有的强大自回归模型构建dVLM?为此,我们提出了DiffusionVL,这是一个可以从任何强大自回归模型转换而来的dVLM家族。通过简单的微调,我们成功地将自回归预训练模型适应到扩散范式中。这种方法产生了两个关键观察结果:(1)从基于自回归的多模态模型到扩散的范式转变非常有效。(2)直接将自回归语言模型转换为dVLM也是可行的,其性能与LLaVA风格的视觉指令调优相当。此外,我们引入了一种块解码设计到dVLM中,支持任意长度的生成和KV缓存重用,实现了显著的推理速度提升。我们进行了大量的实验。尽管使用了比先前方法少于5%的数据进行训练,DiffusionVL在MMMU-Pro(视觉)基准上的综合性能提高了34.4%,在MME(认知)基准上的性能提高了37.5%,同时实现了2倍的推理速度提升。模型和代码发布在https://github.com/hustvl/DiffusionVL。
Summary / 总结
DiffusionVL is a family of diffusion vision language models (dVLMs) that can be translated from existing powerful autoregressive (AR) models through simple fine-tuning. This approach shows that shifting from AR-based multimodal models to the diffusion paradigm is highly effective and that direct conversion of AR language models to dVLMs is feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. The model introduces a block-decoding design for arbitrary-length generation and KV cache reuse, resulting in a 2x inference speedup. Despite using less than 5% of the data, DiffusionVL achieves significant performance improvements, with a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% on the MME (Cog.) bench.
论文旨在通过利用现有的强大自回归(AR)模型来提升扩散视觉语言模型(dVLM)的性能。它提出了DiffusionVL,这是一种可以从AR模型中转换而来的dVLM家族。关键发现包括在MMMU-Pro(视觉)基准测试上实现了34.4%的提升,在MME(认知)基准测试上实现了37.5%的提升,同时实现了2倍的推理速度提升,且仅使用了之前方法所需数据的不到5%。
Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
Authors: Vincent Huang, Dami Choi, Daniel D. Johnson, Sarah Schwettmann, Jacob Steinhardt
First: 2025-12-17T18:59:48+00:00 · Latest: 2025-12-17T18:59:48+00:00
Comments: 28 pages, 12 figures
Abstract
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and are able to accurately surface latent user attributes.
中文标题/摘要
标题:预测概念解码器:训练可扩展的端到端可解释性助手
对神经网络的内部激活进行解释可以产生更忠实的行为解释,但由于激活空间结构复杂,这是一项困难的任务。现有的可扩展解释方法使用手工设计的代理来提出和测试关于内部激活如何与外部行为相关联的假设。我们提出将此任务转化为端到端的训练目标,通过通信瓶颈训练解释助手,使其能够从激活中准确预测模型行为。具体来说,编码器将激活压缩为稀疏的概念列表,解码器读取此列表并回答关于模型的自然语言问题。我们展示了如何在大量非结构化数据上预训练此助手,然后微调其回答问题。由此产生的架构,我们称之为预测概念解码器,具有有利的可扩展性:瓶颈概念的自动解释得分随着数据量的增加而提高,下游应用的表现也得到了提升。具体而言,PCD可以检测突破、秘密提示和植入的潜在概念,并能够准确揭示潜在的用户属性。
Summary / 总结
The research aims to develop scalable interpretability methods for neural networks by training interpretable assistants that can predict model behavior from internal activations. The method involves an encoder that compresses activations into a sparse list of concepts and a decoder that interprets these concepts to answer questions about the model. Key findings include improved auto-interp scores with more data and successful detection of jailbreaks, secret hints, and implanted latent concepts, as well as accurate identification of latent user attributes.
研究旨在通过训练端到端的预测概念解码器(PCD),将激活压缩成稀疏的概念列表并解码以回答关于模型行为的问题,以实现神经网络的可解释性。关键发现包括随数据量增加自动解释分数的提升,以及成功检测到越狱、秘密提示和植入的潜在概念,同时准确地揭示潜在用户属性。
Artism: AI-Driven Dual-Engine System for Art Generation and Critique
Authors: Shuai Liu, Yiqing Tian, Yang Chen, Mar Canet Sola
First: 2025-12-17T18:58:42+00:00 · Latest: 2025-12-17T18:58:42+00:00
Comments: 7 pages, 3 figures, 36 references, appendix with support material and 1 introduction video
Abstract
This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.
中文标题/摘要
标题:Artism:由AI驱动的双引擎系统,用于艺术创作与批评
本文提出了一种双引擎AI架构方法,旨在解决艺术进化中潜在轨迹探索的复杂问题。我们介绍了两个相互关联的组件:AIDA(人工艺术家社交网络)和Ismism Machine(批评分析系统)。核心创新在于利用深度学习和多智能体协作,实现艺术历史发展和概念创新模式的多维度模拟。框架探索了从传统单向批评向智能、互动反思实践模式的转变。目前,我们正在将此方法应用于当代艺术概念的实验研究。本研究基于AI驱动的批评循环,提出了一种新的艺术计算分析方法。
Summary / 总结
This paper introduces a dual-engine AI system called Artism, which includes AIDA (an artificial artist social network) and the Ismism Machine for critical analysis. The system uses deep learning and multi-agent collaboration to simulate art historical developments and conceptual innovation patterns, moving towards an interactive critique mode. Initial experimental studies on contemporary art concepts demonstrate the potential of this AI-driven methodology for computational analysis of art.
该论文提出了一种名为Artism的双引擎AI系统,包括艺术家社交网络AIDA和批判分析的Ismism Machine。该系统利用深度学习和多智能体协作来模拟艺术历史发展和概念创新模式,朝着互动批判模式迈进。初步的当代艺术概念实验研究表明,这种基于AI的批判循环方法为艺术的计算分析提供了新的可能性。
GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
Authors: Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall
Venue: WACV 2026
First: 2025-12-17T18:56:52+00:00 · Latest: 2025-12-17T18:56:52+00:00
Comments: accepted by WACV 2026
Abstract
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
中文标题/摘要
标题:GateFusion:分层门控跨模态融合在主动说话人检测中的应用
主动说话人检测(ASD)旨在识别视频每一帧中谁在说话。最先进的方法大多依赖于晚期融合来结合视觉和音频特征,但晚期融合往往无法捕捉到细粒度的跨模态交互,这对于在不受约束的场景中实现稳健性能至关重要。本文介绍了一种名为GateFusion的新架构,它结合了强大的预训练单模态编码器和分层门控融合解码器(HiGate)。HiGate通过在Transformer骨干网的多个层中适配地注入来自一种模态的上下文特征到另一种模态中,实现逐层、多深度融合,由可学习的双模态条件门引导。为了进一步加强多模态学习,我们提出了两个辅助目标:掩码对齐损失(MAL)以使单模态输出与多模态预测对齐,以及过度正性惩罚(OPP)以抑制视频仅激活。GateFusion在多个具有挑战性的ASD基准测试中建立了新的最先进结果,分别在Ego4D-ASD、UniTalk和WASD基准测试中实现了77.8%(+9.4%)、86.1%(+2.9%)和96.1%(+0.5%)的mAP,并在AVA-ActiveSpeaker上实现了竞争力的性能。域外实验展示了我们模型的泛化能力,而全面的消融实验表明每个组件的互补优势。
Summary / 总结
GateFusion is a novel architecture for Active Speaker Detection that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate) to capture fine-grained cross-modal interactions. It uses learnable, bimodally-conditioned gates to adaptively inject contextual features from one modality into the other at multiple layers of the Transformer backbone. Two auxiliary objectives, Masked Alignment Loss and Over-Positive Penalty, are proposed to enhance multimodal learning. GateFusion achieves new state-of-the-art results on several ASD benchmarks, including 77.8% mAP on Ego4D-ASD, 86.1% mAP on UniTalk, and 96.1% mAP on WASD, while demonstrating good generalization to out-of-domain scenarios.
GateFusion 是一种用于活动发言人检测的新架构,结合了强大的预训练单模编码器和层次门控融合解码器(HiGate),以捕捉细粒度的跨模态交互。该模型在多个基准测试中达到了最先进的结果,包括在 Ego4D-ASD 上的 77.8% mAP,在 UniTalk 上的 86.1% mAP,在 WASD 上的 96.1% mAP。此外,该模型还包括辅助目标,以对齐单模输出与多模预测,并抑制虚假的视频激活,进一步提高其性能。
Dynamic Rebatching for Efficient Early-Exit Inference with DREX
Authors: Xuting Liu, Daniel Alexander, Siva Kesava Reddy Kakarla, Behnaz Arzani, Vincent Liu
First: 2025-12-17T18:55:45+00:00 · Latest: 2025-12-17T18:55:45+00:00
Abstract
Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that avoids physical data movement, and 2) an EE and SLA-aware scheduler that analytically predicts whether a given rebatching operation will be profitable. DREX also efficiently handles the missing KV cache from skipped layers using memory-efficient state-copying. Our evaluation shows that DREX improves throughput by 2-12% compared to baseline approaches while maintaining output quality. Crucially, DREX completely eliminates involuntary exits, providing a key guarantee for preserving the output quality intended by the EE model.
中文标题/摘要
标题:DREX中的动态重新分批以提高早期退出推理效率
早期退出(EE)是一种大型语言模型(LLM)架构,通过允许使用模型的部分层生成更容易的令牌来加速推理。然而,传统的批处理框架不适合EE LLM,因为批次中的请求可能不会在同一时间准备好退出。现有解决方案要么对批次做出统一决策,忽视了EE机会,要么通过强制提前退出降低输出质量。我们提出了动态重新分批,一种在每次早期退出点动态重新组织批次的解决方案。满足退出条件的请求立即处理,继续的请求被暂存,重新分组到新的批次并转发到更深的层。我们引入了DREX,这是一种实现动态重新分批的早期退出推理系统,具有两个关键优化:1)无复制的重新分批缓冲区,避免物理数据移动;2)一种考虑EE和SLA的调度器,通过分析预测给定的重新分批操作是否有利可图。DREX还通过高效处理跳过的层导致的缺失KV缓存来使用内存高效的状态复制。我们的评估表明,与基线方法相比,DREX提高了2-12%的吞吐量,同时保持了输出质量。最关键的是,DREX完全消除了不必要的退出,为保持EE模型预期的输出质量提供了关键保证。
Summary / 总结
The paper addresses the challenge of efficient early-exit inference in Large Language Models (LLMs) by proposing Dynamic Rebatching, a technique that dynamically reorganizes batches at each early-exit point. This method allows requests to exit early if they meet the criteria, while others are buffered and re-grouped. The system, DREX, includes optimizations such as a copy-free rebatching buffer and an EE and SLA-aware scheduler. Experimental results show that DREX improves throughput by 2-12% without degrading output quality and eliminates involuntary exits, ensuring consistent output quality as intended by the EE model.
研究针对传统批处理框架在Early-Exit (EE) 大型语言模型(LLM)中的效率问题,提出了一种动态重新组织批处理的方法,在每个早期退出点动态重新组织批处理。该方法允许立即处理满足退出条件的请求,并将其他请求重新分组到新的批处理中。DREX系统通过实现动态重新组织批处理以及使用无拷贝重新组织缓冲区和EE和SLA感知调度器等优化措施,提高了2-12%的吞吐量,同时保持了输出质量并完全消除了非自愿退出。
VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Authors: Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang
First: 2025-12-17T18:52:55+00:00 · Latest: 2025-12-17T18:52:55+00:00
Comments: 14 pages, 8 figures
Abstract
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic
中文标题/摘要
标题:VLIC:视觉-语言模型作为感知裁判用于人类对齐的图像压缩
对图像压缩性能的评估,包括人类偏好,通常发现简单的失真函数如均方误差(MSE)不足以与人类感知对齐。为了使压缩模型与人类感知对齐,先前的工作使用了由大规模人类心理视觉判断数据集校准的可微分感知损失,由神经网络组成。我们展示了令人惊讶的是,最先进的视觉-语言模型(VLMs)可以在被要求对两幅图像之间的差异进行推理时,零样本地复制二选一强迫选择(2AFC)的人类判断。受利用VLMs强大的零样本视觉推理能力的启发,我们提出了视觉-语言模型用于图像压缩(VLIC),这是一种基于扩散的图像压缩系统,设计为后训练与二元VLM判断。VLIC 利用现有的扩散模型后训练技术,而不是将VLM判断提炼为一个单独的感知损失网络。我们展示了在VLM判断上校准该系统在感知度量和大规模用户研究中产生了竞争力或最先进的性能,取决于数据集。我们还进行了VLM为基础的奖励设计和训练过程的广泛分析,并分享了重要的见解。更多视觉内容可在 https://kylesargent.github.io/vlic 获取。
Summary / 总结
The research aims to improve the alignment of image compression models with human perception by using vision-language models (VLMs) to replicate human judgments. VLIC, a diffusion-based compression system, is proposed and trained using binary VLM judgments. The system achieves competitive or state-of-the-art performance on human-aligned visual compression according to perceptual metrics and user studies.
研究旨在通过使用视觉语言模型(VLMs)来复制人类判断,以改善图像压缩模型与人类感知的对齐。方法是使用VLMs的二元判断训练一个基于扩散的图像压缩系统VLIC,而无需将这些判断提炼为单独的感知损失网络。关键发现表明,VLIC在感知度量和大规模用户研究中实现了与人类对齐的视觉压缩的竞争力或最先进水平。
FrontierCS: Evolving Challenges for Evolving Intelligence
Authors: Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, Youran Sun, Wesley Zheng, Meiyuwang Zhang, Ruyi Ji, Xuechang Tu, Zihan Zheng, Zexing Chen, Kangyang Zhou, Zhaozi Wang, Jingbang Chen, Aleksandra Korolova, Peter Henderson, Pramod Viswanath, Vijay Ganesh, Saining Xie, Zhuang Liu, Dawn Song, Sewon Min, Ion Stoica, Joseph E. Gonzalez, Jingbo Shang, Alvin Cheung
First: 2025-12-17T18:52:45+00:00 · Latest: 2025-12-17T18:52:45+00:00
Comments: Code with instruction: https://github.com/FrontierCS/Frontier-CS
Abstract
We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.
中文标题/摘要
标题:FrontierCS:不断演化的智能挑战
我们介绍了FrontierCS,这是一个涵盖计算机科学多个领域的156个开放性问题的基准测试,由专家设计和审核,包括计算机科学博士和顶级竞赛编程参与者及出题者。与现有主要关注具有已知最优解的任务的基准不同,FrontierCS针对的是最优解未知但解决方案质量可以客观评估的问题。模型通过实现可执行程序而不是直接输出答案来解决这些任务。FrontierCS包括算法问题,通常是具有客观部分评分的竞赛编程问题的NP难变体,以及具有相同属性的研究问题。对于每个问题,我们提供了专家参考解决方案和自动评估器。结合开放性设计、可衡量的进步和专家审核,FrontierCS提供了一个计算机科学难度前沿的基准测试。实证研究发现,前沿推理模型在算法和研究轨道上仍然远远落后于人类专家,单纯增加推理预算并不能缩小这一差距,而且模型往往过度优化生成可工作的代码,而不是发现高质量的算法和系统设计。
Summary / 总结
FrontierCS is a benchmark of 156 open-ended problems in computer science, designed by experts to evaluate models' ability to solve problems without known optimal solutions. Models implement executable programs rather than providing direct answers. The benchmark includes algorithmic and research problems, with expert reference solutions and automatic evaluators. Experiments show that current models perform poorly compared to human experts, and increasing reasoning budgets does not significantly improve performance. Models often generate merely workable code rather than high-quality algorithms and designs.
FrontierCS 是一个包含 156 个跨领域计算机科学开放性问题的基准,由专家设计和审核。与现有基准不同,FrontierCS 关注没有已知最优解但质量可以客观评价的问题。模型通过实现可执行程序来解决这些任务。基准包括算法和研究问题,附有专家参考解决方案和自动评估器。实验证明,当前前沿推理模型在算法和研究轨道上都远逊于人类专家,增加推理预算并不能显著改善性能。模型往往生成的是可工作的代码而不是高质量的算法和设计。
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Authors: Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
First: 2025-12-17T18:48:26+00:00 · Latest: 2025-12-17T18:48:26+00:00
Comments: Project Page: https://github.com/JoeLeelyf/Skyra
Abstract
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
中文标题/摘要
标题:Skyra:基于接地 artifacts 推理的 AI 生成视频检测
AI 驱动的视频生成技术的滥用引发了严重的社会关注,突显了可靠 AI 生成视频检测器的迫切需求。然而,大多数现有方法仅限于二元分类,缺乏供人类解释的必要说明。在本文中,我们提出了 Skyra,这是一种专门的多模态大型语言模型(MLLM),能够识别 AI 生成视频中的人类可感知的视觉 artifacts,并利用它们作为检测和解释的接地证据。为了支持这一目标,我们构建了 ViF-CoT-4K 用于监督微调 (SFT),这是第一个带有精细粒度人类注释的大规模 AI 生成视频 artifacts 数据集。然后,我们开发了一种两阶段训练策略,系统地增强了模型的空间-时间 artifacts 感知、解释能力和检测准确性。为了全面评估 Skyra,我们引入了 ViF-Bench,这是一个基准,包含由十多个最先进的视频生成器生成的 3K 高质量样本。广泛的实验表明,Skyra 在多个基准上超越了现有方法,而我们的评估为推进可解释的 AI 生成视频检测提供了宝贵的见解。
Summary / 总结
The paper addresses the need for reliable detectors for AI-generated videos, which is crucial due to the misuse of such technologies. Skyra, a specialized multimodal large language model, is developed to identify visual artifacts in AI-generated videos and provide grounded explanations. The model is trained on a new dataset, ViF-CoT-4K, and uses a two-stage training strategy to improve its artifact perception and detection accuracy. Experiments show that Skyra outperforms existing methods across multiple benchmarks, offering valuable insights for explainable AI-generated video detection.
该论文介绍了Skyra,这是一种专门用于检测AI生成视频中视觉缺陷并为人类提供解释的多模态大型语言模型。它构建了一个大规模数据集ViF-CoT-4K用于监督微调,并开发了两阶段训练策略以增强缺陷感知和检测准确性。实验表明,Skyra在多个基准测试中优于现有方法,提供了关于可解释的AI生成视频检测的重要见解。
BashArena: A Control Setting for Highly Privileged AI Agents
Authors: Adam Kaufman, James Lucassen, Tyler Tracy, Cody Rushing, Aryan Bhatt
First: 2025-12-17T18:45:25+00:00 · Latest: 2025-12-17T18:45:25+00:00
Comments: The task generation pipeline can be found here: https://github.com/redwoodresearch/basharena_public
Abstract
Future AI agents might run autonomously with elevated privileges. If these agents are misaligned, they might abuse these privileges to cause serious damage. The field of AI control develops techniques that make it harder for misaligned AIs to cause such damage, while preserving their usefulness. We introduce BashArena, a setting for studying AI control techniques in security-critical environments. BashArena contains 637 Linux system administration and infrastructure engineering tasks in complex, realistic environments, along with four sabotage objectives (execute malware, exfiltrate secrets, escalate privileges, and disable firewall) for a red team to target. We evaluate multiple frontier LLMs on their ability to complete tasks, perform sabotage undetected, and detect sabotage attempts. Claude Sonnet 4.5 successfully executes sabotage while evading monitoring by GPT-4.1 mini 26% of the time, at 4% trajectory-wise FPR. Our findings provide a baseline for designing more effective control protocols in BashArena. We release the dataset as a ControlArena setting and share our task generation pipeline.
中文标题/摘要
标题:BashArena:一种高度特权AI代理的控制设置
未来的AI代理可能会以提升的权限自主运行。如果这些代理与预期不符,它们可能会滥用这些权限造成严重损害。AI控制领域发展出的技术旨在使不一致的AI更难造成此类损害,同时保持其有用性。我们引入了BashArena,一种在安全关键环境中研究AI控制技术的设置。BashArena包含637个Linux系统管理与基础设施工程任务,位于复杂且现实的环境中,并设有四个破坏性目标(执行恶意软件、窃取机密、提升权限和禁用防火墙),供红队攻击。我们评估了多个前沿LLM在完成任务、隐蔽执行破坏行为以及检测破坏行为尝试方面的能力。Claude Sonnet 4.5成功执行破坏行为并4%的轨迹误报率下躲避GPT-4.1 mini 26%的检测。我们的研究结果为BashArena设计更有效的控制协议提供了基准。我们发布了该数据集作为ControlArena设置,并分享了我们的任务生成管道。
Summary / 总结
BashArena is designed to study AI control techniques in security-critical environments, containing 637 Linux tasks and four sabotage objectives. Multiple frontier LLMs were evaluated, with Claude Sonnet 4.5 successfully executing sabotage while evading monitoring 26% of the time, at 4% trajectory-wise false positive rate. This provides a baseline for designing more effective control protocols.
BashArena 是一个用于研究安全关键环境中的 AI 控制技术的设置,包含 637 个 Linux 任务和四个破坏性目标。多种语言模型被评估,Claude Sonnet 4.5 成功执行破坏行为并躲避监控的比例为 26%。研究提供了设计更有效控制协议的基础。
Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
Authors: Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu
First: 2025-12-17T18:44:45+00:00 · Latest: 2025-12-17T18:44:45+00:00
Abstract
Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
中文标题/摘要
标题:大型语言模型能否引导自身的探索?基于梯度引导的强化学习框架
强化学习已成为增强大型语言模型推理能力的关键工具,但当前的探索机制与这些模型实际学习的方式存在根本上的不一致。熵奖励和外部语义比较器鼓励表面层次的变化,但并不能保证采样的轨迹在影响优化的方向上有所不同。我们提出了一种名为G2RL的梯度引导强化学习框架,在这种框架中,探索不是由外部启发式方法驱动的,而是由模型自身的梯度更新几何驱动。对于每个响应,G2RL从模型最终层的敏感性中构建一个序列级特征,这种特征可以从标准前向传播中以几乎不增加成本的方式获得,并通过比较这些特征来衡量每个轨迹如何重塑策略。引入新颖梯度方向的轨迹会获得一个有界乘法奖励调节器,而冗余或偏离流形的更新则会被淡化,从而产生一种自我参照的探索信号,这种信号自然与PPO风格的稳定性和KL控制相一致。在数学和一般推理基准测试(MATH500、AMC、AIME24、AIME25、GPQA、MMLUpro)上,G2RL在Qwen3基础1.7B和4B模型上的一致性改进了pass@1、maj@16和pass@k,超过了基于熵的GRPO和外部嵌入方法。通过对诱导的几何形状进行分析,我们发现G2RL将探索扩展到了更多正交且往往对立的梯度方向,同时保持了语义连贯性,揭示出策略自身的更新空间为大型语言模型强化学习中的探索引导提供了更为忠实和有效的基础。
Summary / 总结
The paper addresses the misalignment between current exploration mechanisms and how large language models (LLMs) learn, proposing G2RL, a gradient-guided reinforcement learning framework. G2RL uses the model's own sensitivity at the final layer to guide exploration, rewarding novel gradient directions and penalizing redundant updates. Experiments on math and general reasoning benchmarks show that G2RL improves performance metrics like pass@1, maj@16, and pass@k compared to entropy-based methods and external embedding techniques. The induced geometry reveals that G2RL explores more orthogonal and opposing gradient directions while maintaining semantic coherence.
论文提出了G2RL,一种基于模型自身一阶更新几何的梯度引导强化学习框架,而不是外部启发式方法。该方法从模型最终层的敏感性构建序列级特征,并测量每个轨迹如何重塑策略。G2RL在数学和一般推理基准测试中提高了pass@1、maj@16和pass@k指标,优于基于熵的方法和外部嵌入方法。诱导的几何结构显示,G2RL探索了更多正交和对立的梯度方向,同时保持了语义连贯性。
MMGR: Multi-Modal Generative Reasoning
Authors: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu
First: 2025-12-16T18:58:04+00:00 · Latest: 2025-12-17T18:42:37+00:00
Comments: work in progress
Abstract
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
中文标题/摘要
标题:MMGR:多模态生成推理
视频基础模型生成视觉上逼真且时间上连贯的内容,但它们作为世界模拟器的可靠性取决于是否捕捉了物理、逻辑和空间约束。现有指标如弗雷切视频距离(FVD)强调感知质量,而忽视了推理失败,包括因果关系、物理规律和全局一致性方面的违反。我们引入了MMGR(多模态生成推理评估与基准),一个基于五种推理能力的原理性评估框架:物理、逻辑、三维空间、二维空间和时间。MMGR在抽象推理(ARC-AGI、数独)、体感导航(现实世界三维导航和定位)和物理常识(体育和组合交互)三个领域评估生成推理。MMGR应用细粒度指标,要求视频和图像生成的整体正确性。我们对领先视频模型(Veo-3、Sora-2、Wan-2.2)和图像模型(Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image)进行了基准测试,揭示了不同领域的性能差距。模型在物理常识任务上表现出中等成功,但在抽象推理(ARC-AGI准确率低于10%)和体感设置中的长期空间规划方面表现不佳。我们的分析指出了当前模型的关键局限性,包括过度依赖感知数据、全局状态一致性较弱以及奖励视觉合理性而非因果正确性的目标。MMGR提供了一个统一的诊断基准,并为推理感知生成世界模型指明了方向。
Summary / 总结
MMGR is a new evaluation framework for video and image generative models, focusing on their reasoning abilities across physical, logical, and spatial domains. It evaluates models on tasks like abstract reasoning, embodied navigation, and physical commonsense, using fine-grained metrics that require holistic correctness. The study reveals significant performance gaps among leading models, with strong performance on physical tasks but poor results on abstract reasoning and long-term spatial planning. This highlights the need for models to move beyond perceptual quality to achieve causal correctness and global consistency.
MMGR 是一个用于评估视频和图像生成模型推理能力的新框架,重点关注其在物理、逻辑和空间领域的表现。它通过细粒度的指标评估模型在抽象推理、体感导航和物理常识任务上的表现,要求整体正确性。研究显示,领先模型在物理任务上表现良好,但在抽象推理和长期空间规划方面表现不佳,这表明模型需要超越感知质量,实现因果正确性和全局一致性。
High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations
Authors: Victor Léger, Florent Chatelain
First: 2025-12-17T18:38:01+00:00 · Latest: 2025-12-17T18:38:01+00:00
Abstract
Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this paper, we study a data integration model in which two high-dimensional data matrices share a low-rank common latent structure while also containing individual-specific components. We analyze the singular vectors of the associated cross-covariance matrix using tools from random matrix theory and derive asymptotic characterizations of the alignment between estimated and true latent directions. These results provide a quantitative explanation of the reconstruction performance of the PLS variant based on Singular Value Decomposition (PLS-SVD) and identify regimes where the method exhibits counter-intuitive or limiting behavior. Building on this analysis, we compare PLS-SVD with principal component analysis applied separately to each dataset and show its asymptotic superiority in detecting the common latent subspace. Overall, our results offer a comprehensive theoretical understanding of high-dimensional PLS-SVD, clarifying both its advantages and fundamental limitations.
中文标题/摘要
标题:高维偏最小二乘法:光谱分析与基本限制
偏最小二乘法(PLS)是一种广泛用于数据整合的方法,旨在从配对的高维数据集中提取共享的潜在成分。尽管在实践中取得了数十年的成功,但在高维情况下对其行为的精确理论理解仍然有限。在本文中,我们研究了一种数据整合模型,其中两个高维数据矩阵共享一个低秩的共同潜在结构,同时包含个体特定的成分。我们使用随机矩阵理论中的工具分析了相关协方差矩阵的奇异向量,并推导出了估计值与真实潜在方向之间对齐的渐近特征。这些结果为基于奇异值分解的PLS变体(PLS-SVD)的重建性能提供了定量解释,并确定了方法表现出反直觉或限制性行为的区域。在此分析的基础上,我们将PLS-SVD与分别应用于每个数据集的主成分分析进行比较,并展示了其在检测共同潜在子空间方面的渐近优越性。总体而言,我们的结果为高维PLS-SVD提供了全面的理论理解,澄清了其优势和基本限制。
Summary / 总结
This paper investigates the behavior of Partial Least Squares (PLS) in high-dimensional settings, particularly focusing on the alignment of latent components between paired datasets. Using random matrix theory, the authors derive asymptotic characterizations of the alignment between estimated and true latent directions, providing a theoretical explanation for the performance of PLS-SVD. The study identifies conditions under which PLS-SVD outperforms independent principal component analysis in detecting the common latent subspace, offering a comprehensive understanding of the method's advantages and limitations.
本文研究了高维环境下部分最小二乘法(PLS)的行为,特别是配对数据集之间潜在成分的对齐情况。通过随机矩阵理论,作者推导出了估计值和真实潜在方向之间对齐的渐近特性,为PLS-SVD的表现提供了理论解释。研究还确定了PLS-SVD在检测共同潜在子空间方面优于独立主成分分析的条件,从而为该方法的优势和基本局限性提供了全面的理解。
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Authors: Adam Karvonen, James Chua, Clément Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Arnab Sen Sharma, Daniel Wen, Owain Evans, Samuel Marks
First: 2025-12-17T18:26:28+00:00 · Latest: 2025-12-17T18:26:28+00:00
Comments: 36 pages
Abstract
Large language model (LLM) activations are notoriously difficult to understand, with most existing techniques using complex, specialized methods for interpreting them. Recent work has proposed a simpler approach known as LatentQA: training LLMs to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language. However, prior work has focused on narrow task settings for both training and evaluation. In this paper, we instead take a generalist perspective. We evaluate LatentQA-trained models, which we call Activation Oracles (AOs), in far out-of-distribution settings and examine how performance scales with training data diversity. We find that AOs can recover information fine-tuned into a model (e.g., biographical knowledge or malign propensities) that does not appear in the input text, despite never being trained with activations from a fine-tuned model. Our main evaluations are four downstream tasks where we can compare to prior white- and black-box techniques. We find that even narrowly-trained LatentQA models can generalize well, and that adding additional training datasets (such as classification tasks and a self-supervised context prediction task) yields consistent further improvements. Overall, our best AOs match or exceed prior white-box baselines on all four tasks and are the best method on 3 out of 4. These results suggest that diversified training to answer natural-language queries imparts a general capability to verbalize information about LLM activations.
中文标题/摘要
标题:激活先知:作为通用激活解释器训练和评估LLM
大型语言模型(LLM)的激活过程非常难以理解,现有的大多数技术都使用复杂的专业方法来解释它们。最近的工作提出了一种更简单的方法,称为LatentQA:训练LLM直接接受LLM激活作为输入,并用自然语言回答关于它们的任意问题。然而,先前的工作主要集中在狭窄的任务设置中进行训练和评估。在本文中,我们采取了一种通才视角。我们评估了LatentQA训练的模型,我们称之为激活先知(AOs),并在远超出分布的环境中进行评估,并检查了训练数据多样性如何影响性能。我们发现,即使从未用过微调模型的激活进行训练,AOs也能恢复模型中微调的信息(例如,生平知识或恶意倾向),而这些信息并未出现在输入文本中。我们的主要评估是在四个下游任务中,可以与先前的白盒和黑盒技术进行比较。我们发现,即使经过狭窄训练的LatentQA模型也能很好地泛化,而增加额外的训练数据集(如分类任务和自监督上下文预测任务)可以带来一致的进一步改进。总体而言,我们最好的AOs在所有四个任务中与先前的白盒基线相当或超过基线,并且在三个任务中是最佳方法。这些结果表明,为了回答自然语言查询而进行多样化的训练赋予了LLM激活信息的口头表达能力。
Summary / 总结
This paper explores the use of LatentQA-trained models, called Activation Oracles (AOs), to understand large language model (LLM) activations. The authors evaluate AOs in out-of-distribution settings and find that these models can recover fine-tuned information not present in the input text. They also show that adding diverse training datasets improves performance, with their best AOs matching or exceeding previous white-box baselines on four downstream tasks.
本文研究了使用LatentQA训练的模型,称为Activation Oracles (AOs),以通用视角解释大型语言模型(LLM)的激活。研究在分布外场景下评估AOs,并发现它们能够恢复输入文本中未出现的细调信息。研究显示,窄训练的LatentQA模型可以很好地泛化,增加多样化的训练数据集可以进一步提高性能。最好的AOs在四个下游任务中匹配或超过了之前的白盒基线,并且在四个任务中的三个任务上是最佳方法,表明多样化训练增强了对LLM激活信息的口头表达能力。
Explaining the Reasoning of Large Language Models Using Attribution Graphs
Authors: Chase Walker, Rickard Ewetz
First: 2025-12-17T18:15:26+00:00 · Latest: 2025-12-17T18:15:26+00:00
Abstract
Large language models (LLMs) exhibit remarkable capabilities, yet their reasoning remains opaque, raising safety and trust concerns. Attribution methods, which assign credit to input features, have proven effective for explaining the decision making of computer vision models. From these, context attributions have emerged as a promising approach for explaining the behavior of autoregressive LLMs. However, current context attributions produce incomplete explanations by directly relating generated tokens to the prompt, discarding inter-generational influence in the process. To overcome these shortcomings, we introduce the Context Attribution via Graph Explanations (CAGE) framework. CAGE introduces an attribution graph: a directed graph that quantifies how each generation is influenced by both the prompt and all prior generations. The graph is constructed to preserve two properties-causality and row stochasticity. The attribution graph allows context attributions to be computed by marginalizing intermediate contributions along paths in the graph. Across multiple models, datasets, metrics, and methods, CAGE improves context attribution faithfulness, achieving average gains of up to 40%.
中文标题/摘要
标题:使用归因图解释大型语言模型的推理
大型语言模型(LLMs)表现出色,但其推理过程仍然不透明,这引发了安全性和信任问题。归因方法通过将信用分配给输入特征,已被证明对解释计算机视觉模型的决策有效。从这些方法中,上下文归因已成为解释自回归LLMs行为的一种有前途的方法。然而,当前的上下文归因会产生不完整的解释,直接将生成的标记与提示相关联,从而忽略了代际影响。为克服这些不足,我们提出了上下文归因通过图解释(CAGE)框架。CAGE引入了一个归因图:一个有向图,量化了每一代如何受到提示和所有先前生成的影响。该图构建时保留了因果性和行随机性两种属性。归因图允许通过在图中路径上的边际化中间贡献来计算上下文归因。在多个模型、数据集、度量标准和方法上,CAGE提高了上下文归因的可信度,平均提高了40%。
Summary / 总结
This study addresses the opacity of reasoning in large language models (LLMs) by introducing the Context Attribution via Graph Explanations (CAGE) framework. CAGE constructs an attribution graph that quantifies the influence of the prompt and all prior generations on each generation, ensuring causality and row stochasticity. This method improves the faithfulness of context attributions, achieving up to 40% better results across various models, datasets, and metrics compared to existing approaches.
论文通过引入Context Attribution via Graph Explanations (CAGE)框架来解决大型语言模型(LLMs)推理的不透明性问题。CAGE构建了一个属性图,量化了每一代对提示及其所有先前生成的影响,并保持因果性和行随机性。这种方法提高了上下文属性的准确性,与现有方法相比,在各种模型、数据集和指标上平均提高了40%以上的效果。
Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Authors: Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan LU
First: 2025-12-17T18:15:17+00:00 · Latest: 2025-12-17T18:15:17+00:00
Comments: Under Review
Abstract
Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.
中文标题/摘要
标题:逐步思考-批判:一种统一的鲁棒且可解释的大语言模型推理框架
人类通过批判性思维解决复杂问题,其中推理和评估交织在一起,逐步趋向正确的解决方案。然而,现有的大多数大语言模型(LLMs)将推理与验证脱钩:它们要么生成推理而没有明确的自我检查,要么依赖外部验证者在事后检测错误。前者缺乏即时反馈,而后者增加了系统复杂性并妨碍了同步学习。受人类批判性思维的启发,我们提出了一种统一框架Stepwise Think-Critique (STC),该框架在单一模型中将推理和自我批判交织在一起。STC 通过结合推理奖励和批判一致性奖励的混合强化学习目标,同时优化推理质量和自我评估。在数学推理基准测试中的实验表明,STC 展示了强大的批判性思维能力,并生成了更可解释的推理轨迹,代表了向具有内置批判性思维的大语言模型迈进的一步。
Summary / 总结
The research aims to enhance the robustness and interpretability of large language models (LLMs) by integrating reasoning and self-critique. The proposed Stepwise Think-Critique (STC) framework interleaves these processes within a single model, using a hybrid reinforcement learning objective to optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks indicate that STC improves critic-thinking capabilities and generates more interpretable reasoning traces compared to existing models.
研究旨在通过将推理与自我批判相结合来增强大型语言模型(LLMs)的鲁棒性和可解释性。提出的Stepwise Think-Critique (STC)框架在单一模型中交替进行这些过程,并使用混合强化学习目标来优化推理质量和自我评估。实验表明,STC在数学推理基准测试中提高了自我批判能力,并生成了更可解释的推理痕迹,优于现有模型。
From Trace to Line: LLM Agent for Real-World OSS Vulnerability Localization
Authors: Haoran Xi, Minghao Shao, Brendan Dolan-Gavitt, Muhammad Shafique, Ramesh Karri
First: 2025-09-30T22:27:18+00:00 · Latest: 2025-12-17T18:10:36+00:00
Abstract
Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function or file level detections which offers limited actionable guidance to engineers who need precise line-level localization and targeted patches in real-world software development. We present T2L-Agent (Trace-to-Line Agent), a project-level, end-to-end framework that plans its own analysis and progressively narrows scope from modules to exact vulnerable lines. T2L-Agent couples multi-round feedback with an Agentic Trace Analyzer (ATA) that fuses run-time evidence such as crash points, stack traces, and coverage deltas with AST-based code chunking, enabling iterative refinement beyond single pass predictions and translating symptoms into actionable, line-level diagnoses. To benchmark line-level vulnerability discovery, we introduce T2L-ARVO, a diverse, expert-verified 50-case benchmark spanning five crash families and real-world projects. T2L-ARVO is specifically designed to support both coarse-grained detection and fine-grained localization, enabling rigorous evaluation of systems that aim to move beyond file-level predictions. On T2L-ARVO, T2L-Agent achieves up to 58.0% detection and 54.8% line-level localization, substantially outperforming baselines. Together, the framework and benchmark push LLM-based vulnerability detection from coarse identification toward deployable, robust, precision diagnostics that reduce noise and accelerate patching in open-source software workflows.
中文标题/摘要
标题:从踪迹到线:面向现实世界开源软件漏洞定位的LLM代理
大型语言模型在漏洞发现方面显示出潜力,但现有方法孤立地检查代码,难以处理长上下文,并且主要集中在粗粒度的功能或文件级别的检测上,这为需要精确行级定位和针对性修补的工程师提供了有限的行动指导。我们提出了T2L-Agent(踪迹到线代理),这是一种项目级别的端到端框架,能够自主规划分析,并逐步将范围从模块缩小到具体的漏洞行。T2L-Agent 结合多轮反馈与行动踪迹分析器(ATA),将运行时证据(如崩溃点、堆栈跟踪和覆盖率差异)与基于AST的代码片段化相结合,实现迭代细化,超越单次预测,并将症状转化为可操作的行级诊断。为了评估行级漏洞发现,我们引入了T2L-ARVO基准,这是一个多样化的、专家验证的50个案例基准,涵盖了五个崩溃家族和实际项目。T2L-ARVO特别设计用于支持粗粒度检测和细粒度定位,使系统能够超越文件级别的预测进行严格的评估。在T2L-ARVO上,T2L-Agent 的检测率最高可达58.0%,行级定位率最高可达54.8%,显著优于基线。该框架和基准共同推动了基于LLM的漏洞检测从粗略识别向可部署、稳健、精确诊断的发展,减少噪音并加速开源软件工作流中的补丁修复。
Summary / 总结
The research aims to improve vulnerability localization in open-source software by addressing the limitations of existing methods that focus on coarse-level detections and struggle with long code contexts. T2L-Agent, a project-level framework, uses an iterative approach to progressively narrow down the scope from modules to specific vulnerable lines. It integrates multi-round feedback with an Agentic Trace Analyzer to refine predictions and translate symptoms into actionable line-level diagnoses. On the T2L-ARVO benchmark, T2L-Agent demonstrates up to 58.0% detection and 54.8% line-level localization, significantly outperforming baseline methods.
研究旨在通过解决现有方法关注粗粒度检测和难以处理长代码上下文的问题,提高开源软件中的漏洞定位。T2L-Agent 是一个项目级别的框架,采用逐步缩小范围的方法,从模块到具体的脆弱行。它结合多轮反馈和一个代理追踪分析器来细化预测,并将症状转化为可操作的行级诊断。在T2L-ARVO基准测试上,T2L-Agent 的检测率高达58.0%,行级定位率为54.8%,显著优于基线方法。
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Authors: Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang
First: 2025-12-17T17:58:35+00:00 · Latest: 2025-12-17T17:58:35+00:00
Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
中文标题/摘要
标题:VTCBench:视觉语言模型能否通过视觉文本压缩理解长上下文?
LLM扩展上下文窗口相关的计算和内存开销严重限制了其可扩展性。值得注意的解决方案是视觉文本压缩(VTC),如DeepSeek-OCR和Glyph等框架,将长文本转换为密集的二维视觉表示,从而实现3倍至20倍的标记压缩比。然而,这种高信息密度对视觉语言模型(VLM)的核心长上下文能力的影响仍缺乏研究。为填补这一空白,我们首次引入了VTC基准,并系统评估了VLM在三种长上下文理解设置中的性能:VTC-检索,评估模型检索和聚合信息的能力;VTC-推理,要求模型通过最小的词汇重叠来推断潜在关联以定位事实;VTC-记忆,衡量模型在长期对话记忆中进行综合问答的能力。此外,我们建立了VTCBench-Wild以模拟多种输入场景。我们在基准上全面评估了领先开源和专有模型。结果表明,尽管大多数VLM能够很好地解码文本信息(如OCR),但在使用VTC压缩信息时,它们在长上下文理解方面表现出令人惊讶的差劲能力,无法捕捉上下文中的长期关联或依赖关系。本研究为VTC的理解提供了深入的理解,并为设计更高效和可扩展的VLM奠定了基础。
Summary / 总结
The study introduces VTCBench, a benchmark to evaluate vision-language models (VLMs) on long-context understanding with vision-text compression (VTC). It assesses models in three settings: VTC-Retrieval, VTC-Reasoning, and VTC-Memory. The results show that most VLMs struggle to understand long associations and dependencies when using VTC-compressed information, despite their ability to decode textual information well. This highlights the need for improving VLMs' long-context understanding capabilities.
该研究引入了VTCBench,用于评估使用视觉文本压缩(VTC)的视觉语言模型(VLM)在长上下文理解方面的能力。它在VTC-Retrieval、VTC-Reasoning和VTC-Memory三个场景下评估模型,并发现大多数VLM在处理VTC压缩的信息时难以理解长上下文,无法捕捉到长的关联或依赖关系。该研究强调了提高VLM处理压缩的视觉文本信息能力的必要性。
Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift
Authors: Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang shen
First: 2025-12-17T17:54:20+00:00 · Latest: 2025-12-17T17:54:20+00:00
Comments: Code at: https://github.com/Jiacheng8/HALD
Abstract
Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.
中文标题/摘要
标题:硬标签入局!重新思考硬标签在缓解局部语义漂移中的作用
由教师模型生成的软标签已成为知识迁移和大规模数据集蒸馏(如SRe2L、RDED、LPLD)中的主导范式,提供了比传统硬标签更丰富的监督。然而,我们观察到,当每张图像仅使用有限数量的裁剪时,软标签容易出现局部语义漂移:一个裁剪可能在视觉上类似于另一个类别,导致其软嵌入偏离原始图像的真实语义。这种局部视觉内容与全局语义意义之间的不匹配引入了训练和测试之间的系统性错误和分布不一致。在本文中,我们重新审视了被忽视的硬标签作用,并展示了当适当集成时,它们提供了一种强大的内容无关锚点来校准语义漂移。我们从理论上描述了在少量软标签监督下漂移的出现,并证明了混合软标签和硬标签恢复了视觉内容和语义监督之间的对齐。基于这一见解,我们提出了一种新的训练范式——缓解局部语义漂移的硬标签(HALD),利用硬标签作为中间纠正信号,同时保留软标签的细粒度优势。在大规模数据集蒸馏和传统分类基准上的广泛实验验证了我们的方法,展示了一致的泛化改进。在ImageNet-1K上,我们仅使用2.85亿存储的软标签实现了42.7%,超越了先前的SPLD最佳结果9.0%。我们的研究重新确立了硬标签作为补充工具的重要性,并呼吁重新思考它们在软标签主导训练中的作用。
Summary / 总结
This paper addresses the issue of local semantic drift in soft labels generated by teacher models, which can cause systematic errors and distribution misalignment. It proposes a new training paradigm, HALD, that integrates hard labels to provide a content-agnostic anchor, thereby mitigating semantic drift. Experiments show consistent improvements in generalization, with a notable performance gain on ImageNet-1K compared to previous methods.
本文解决了软标签中的局部语义漂移问题,这会导致系统性错误和分布不一致。研究提出了一种新的训练范式HALD,通过整合硬标签提供内容无关的锚点,从而校准语义漂移。实验结果显示在泛化性能上的一致改进,在ImageNet-1K上的表现显著优于先前的方法。
Structure-Aligned Protein Language Model
Authors: Can Chen, David Heurtel-Depeiges, Robert M. Vernon, Christopher James Langmead, Yoshua Bengio, Quentin Fournier
First: 2025-05-22T16:56:12+00:00 · Latest: 2025-12-17T17:53:11+00:00
Comments: 28 pages, 16 figures, 9 tables
Abstract
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but often lack the structural knowledge essential for some biological applications. To address this, we introduce a method to enrich pLMs with structural knowledge by leveraging pre-trained protein graph neural networks (pGNNs). First, a latent-level contrastive learning task aligns residue representations from pLMs with those from pGNNs across multiple proteins, injecting inter-protein structural information. Additionally, a physical-level task integrates intra-protein information by training pLMs to predict structure tokens. Together, the proposed dual-task framework effectively incorporates both inter- and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module that uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method as a simple, lightweight post-training step to the state-of-the-art ESM2 and AMPLIFY yields notable performance gains. These improvements are consistent across a wide range of tasks, including substantial gains in deep mutational scanning (DMS) fitness prediction and a 59% increase in P@L for ESM2 650M contact prediction on CASP16. Furthermore, we demonstrate that these performance gains are robust, scaling with model sizes from 8M to 650M and extending to different downstream tasks.
中文标题/摘要
标题:结构对齐蛋白质语言模型
蛋白质语言模型(pLMs)在大规模蛋白质序列数据库上预训练,擅长各种下游任务,但往往缺乏某些生物应用所需的结构知识。为解决这一问题,我们提出了一种方法,通过利用预训练的蛋白质图神经网络(pGNNs)来丰富pLMs的结构知识。首先,一个潜在级别的对比学习任务将pLMs和pGNNs中的残基表示在多种蛋白质上对齐,注入跨蛋白质的结构信息。此外,一个物理级别的任务通过训练pLMs预测结构标记来整合蛋白质内的信息。结合提出的双任务框架,有效地将跨蛋白质和蛋白质内的结构知识整合到pLMs中。鉴于PDB中蛋白质结构质量的变异性,我们进一步引入了一个残基损失选择模块,该模块使用一个小模型在高质量结构上训练,选择可靠且具有挑战性的残基损失供pLM学习。将我们的结构对齐方法作为简单的轻量级后训练步骤应用于当前最先进的ESM2和AMPLIFY,可获得显著的性能提升。这些改进在各种任务中保持一致,包括深度突变扫描(DMS)适应性预测的显著提升,以及ESM2 650M接触预测在CASP16上的P@L提高59%。此外,我们证明了这些性能提升是稳健的,随着模型大小从8M扩展到650M,并扩展到不同的下游任务。
Summary / 总结
This study addresses the limitation of protein language models (pLMs) in lacking structural knowledge by integrating structural information from pre-trained protein graph neural networks (pGNNs). The method uses a dual-task framework that aligns residue representations at the latent level and integrates intra-protein structural information at the physical level. This approach, applied as a post-training step to state-of-the-art models ESM2 and AMPLIFY, significantly improves performance across various tasks, including deep mutational scanning and contact prediction, with notable gains in P@L for ESM2 650M on CASP16.
该研究通过引入一种方法,利用预训练的蛋白质图神经网络(pGNNs)来丰富蛋白质语言模型(pLMs)的结构知识,解决了pLMs缺乏结构信息的问题。该方法包括一个潜层对比学习任务和一个物理层任务,以对齐和整合结构信息。提出的双重任务框架有效地结合了蛋白质间的和蛋白质内的结构信息。在应用这种结构对齐方法后,ESM2和AMPLIFY等最先进的pLMs的性能显著提高,特别是在深突变扫描适应性预测和CASP16上的接触预测方面取得了显著进步。
IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning
Authors: Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang, Libiao Jin, Qi Mao
First: 2025-12-17T17:47:18+00:00 · Latest: 2025-12-17T17:47:18+00:00
Abstract
We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
中文标题/摘要
标题:IC-Effect:基于上下文学习的精确高效视频特效编辑
我们提出了一种名为\textbf{IC-Effect}的指令引导、基于DiT的框架,用于少量样本的视频VFX编辑,能够合成复杂的特效(例如火焰、粒子和卡通人物)同时严格保持空间和时间一致性。视频VFX编辑极具挑战性,因为注入的特效必须与背景无缝融合,背景必须完全不变,且必须从有限的配对数据中高效学习特效模式。然而,现有的视频编辑模型无法满足这些要求。IC-Effect 利用源视频作为清洁的上下文条件,利用DiT模型的上下文学习能力实现精确的背景保留和自然的特效注入。通过两阶段训练策略,包括通用编辑适应和通过Effect-LoRA进行特效特定学习,确保了强烈的指令跟随和稳健的特效建模。为了进一步提高效率,我们引入了时空稀疏分词,使高保真度的计算量大幅减少。我们还发布了跨越15种高质量视觉风格的配对VFX编辑数据集。大量实验表明,IC-Effect 提供了高质量、可控且时间一致的VFX编辑,为视频创作开辟了新的可能性。
Summary / 总结
IC-Effect is an instruction-guided framework for video VFX editing using DiT models, which can synthesize complex effects while preserving spatial and temporal consistency. It uses a two-stage training strategy and spatiotemporal sparse tokenization to ensure precise background preservation and natural effect injection, achieving high-quality and temporally consistent VFX editing with reduced computation. The framework demonstrates strong instruction following and robust effect modeling, opening new possibilities for video creation.
IC-Effect 是一个基于 DiT 模型的指令引导框架,用于少量样本的视频 VFX 编辑,能够合成复杂的特效同时保持空间和时间的一致性。该框架利用源视频作为上下文条件,确保背景的精确保留和特效的自然注入。框架采用两阶段训练策略和时空稀疏标记化,实现高质量、可控且时间一致的 VFX 编辑,超越现有模型的表现。
How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness
Authors: Darshita Rathore, Vineet Kumar, Chetna Bansal, Anindya Moitra
First: 2025-12-17T17:44:09+00:00 · Latest: 2025-12-17T17:44:09+00:00
Comments: Accepted at AACL IJCNLP 2025
Abstract
Large language models are increasingly adapted to downstream tasks through fine-tuning. Full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), are two dominant approaches. While PEFT methods are widely used for their computational efficiency, the implications of their configurations (e.g., rank) remain under-explored in downstream Q&A tasks and generalisation. In this work, we perform a comprehensive evaluation across multiple reasoning and recall datasets, conducting a rank sweep to quantify the trade-off between SFT and PEFT. We also compare the accuracy of PEFT and SFT models across in-domain and out-of-domain adaptation, highlighting distinct generalisation behaviour and task-specific forgetting. We demonstrate that LoRA achieves competitive and in some cases superior performance compared to SFT, particularly on reasoning tasks at specific rank values. Additionally, we analyze the internal representations via spectral features and layer-wise attention structures, offering insights into representational drift and structural changes in attention patterns.
中文标题/摘要
标题:多少才是太多?探索LoRA秩折衷以保留知识和领域稳健性
大型语言模型越来越多地通过微调适应下游任务。全监督微调(SFT)和参数高效微调(PEFT)方法,如低秩适应(LoRA),是两种主导方法。尽管PEFT方法因其计算效率而广泛使用,但其配置(如秩)的影响在下游问答任务和泛化方面仍被研究不足。在本研究中,我们在多个推理和回忆数据集上进行了全面评估,进行秩扫描以量化SFT和PEFT之间的权衡。我们还比较了PEFT和SFT模型在领域内和领域外适应的准确性,突显了不同的泛化行为和任务特定的遗忘。我们证明,在特定秩值下,LoRA在某些情况下实现了与SFT竞争甚至更优的性能。此外,我们通过频谱特征和逐层注意力结构分析了内部表示,提供了关于表示漂移和注意力模式结构变化的见解。
Summary / 总结
This study evaluates the trade-offs between full supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) methods, specifically focusing on the Low-Rank Adaptation (LoRA) technique. The researchers conducted a comprehensive evaluation across multiple datasets, varying the rank parameter to understand its impact on knowledge retention and domain robustness. They found that LoRA can achieve competitive performance compared to SFT, especially at certain rank values, and provides insights into internal model representations and attention patterns.
研究评估了全监督微调(SFT)和参数高效微调(PEFT)方法之间的权衡,特别是Low-Rank Adaptation(LoRA)技术。通过在多种数据集上的全面评估,研究探讨了不同秩值对模型性能和泛化能力的影响。关键发现表明,LoRA在特定秩值下,特别是在推理任务上,可以实现与SFT竞争的性能,并提供了关于模型内部表示和注意力模式变化的见解。
Learning without training: The implicit dynamics of in-context learning
Authors: Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo
First: 2025-07-21T18:44:35+00:00 · Latest: 2025-12-17T17:34:33+00:00
Abstract
One of the most striking features of Large Language Models (LLMs) is their ability to learn in-context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in-context and not only during training. Specifically, we show how a transformer block implicitly transforms a context into a low-rank weight-update of its MLP layer.
中文标题/摘要
标题:无需训练的学习:上下文学习的隐式动态
大型语言模型(LLMs)最引人注目的特征之一是其在上下文中的学习能力。即在推理时,LLM能够在提示中以示例形式呈现的新模式出现时,无需任何额外的权重更新就能学习新的模式,即使这些模式在训练过程中未见过。这些机制仍然很大程度上未知。在本文中,我们展示了将自注意力层与MLP堆叠起来,使变压器块能够根据上下文隐式修改MLP层的权重。我们通过理论和实验表明,这种简单的机制可能是LLM能够在上下文而非仅在训练期间学习的原因。具体而言,我们展示了变压器块如何隐式地将上下文转换为MLP层的低秩权重更新。
Summary / 总结
The research explores the mechanism behind Large Language Models (LLMs) learning new patterns at inference time without additional training. By stacking a self-attention layer with an MLP, the transformer block implicitly modifies the MLP weights based on the context. Experiments demonstrate that this simple mechanism enables LLMs to learn in-context, suggesting it as a key reason for their ability to learn new patterns during inference rather than only during training.
研究探讨了大型语言模型(LLMs)在推理时学习新模式而不进行额外训练的机制。通过将自注意力层与MLP堆叠,变压器块可以基于上下文隐式调整MLP的权重。实验表明,这种简单机制可能解释了为什么LLMs能够在推理时学习,而不是仅在训练期间学习。关键发现是,变压器块可以将上下文转换为MLP层的低秩权重更新,从而实现上下文学习。
Persistent feature reconstruction of resident space objects (RSOs) within inverse synthetic aperture radar (ISAR) images
Authors: Morgan Coe, Gruffudd Jones, Leah-Nani Alconcel, Marina Gashinova
First: 2025-12-17T17:24:50+00:00 · Latest: 2025-12-17T17:24:50+00:00
Abstract
With the rapidly growing population of resident space objects (RSOs) in the near-Earth space environment, detailed information about their condition and capabilities is needed to provide Space Domain Awareness (SDA). Space-based sensing will enable inspection of RSOs at shorter ranges, independent of atmospheric effects, and from all aspects. The use of a sub-THz inverse synthetic aperture radar (ISAR) imaging and sensing system for SDA has been proposed in previous work, demonstrating the achievement of sub-cm image resolution at ranges of up to 100 km. This work focuses on recognition of external structures by use of sequential feature detection and tracking throughout the aligned ISAR images of the satellites. The Hough transform is employed to detect linear features, which are tracked throughout the sequence. ISAR imagery is generated via a metaheuristic simulator capable of modelling encounters for a variety of deployment scenarios. Initial frame-to-frame alignment is achieved through a series of affine transformations to facilitate later association between image features. A gradient-by-ratio method is used for edge detection within individual ISAR images, and edge magnitude and direction are subsequently used to inform a double-weighted Hough transform to detect features with high accuracy. Feature evolution during sequences of frames is analysed. It is shown that the use of feature tracking within sequences with the proposed approach will increase confidence in feature detection and classification, and an example use-case of robust detection of shadowing as a feature is presented.
中文标题/摘要
标题:近地空间环境内居民空间物体(RSOs)的持久特征重构
随着近地空间环境中居民空间物体(RSOs)数量的迅速增长,需要详细了解其状态和能力的信息以提供空间域意识(SDA)。基于空间的传感将使在较短的距离范围内检查RSOs成为可能,不受大气效应的影响,并可以从所有角度进行。在先前的工作中,提出了使用亚THz逆合成孔径雷达(ISAR)成像和传感系统进行SDA的方案,证明了在100公里范围内实现亚厘米级图像分辨率。本工作专注于通过在整个对齐的卫星ISAR图像序列中使用顺序特征检测和跟踪来识别外部结构。霍夫变换被用来检测线性特征,这些特征在整个序列中被跟踪。ISAR图像通过一个元启发式模拟器生成,该模拟器能够模拟各种部署场景下的遭遇。通过一系列仿射变换实现初始帧到帧的对齐,以促进后续图像特征之间的关联。使用梯度比方法进行边缘检测,随后使用边缘强度和方向来指导双加权霍夫变换以高精度检测特征。分析了帧序列中特征的演变。研究表明,使用所提出的方法在序列中进行特征跟踪将增加特征检测和分类的信心,并展示了鲁棒检测阴影作为特征的示例用例。
Summary / 总结
This research aims to provide detailed information about resident space objects (RSOs) for Space Domain Awareness (SDA) by using sub-THz inverse synthetic aperture radar (ISAR) imaging. The method involves sequential feature detection and tracking using a Hough transform and a metaheuristic simulator for ISAR imagery generation. Key findings include increased confidence in feature detection and classification through feature tracking, and robust detection of shadowing as a feature is demonstrated.
研究旨在通过使用亚THz逆合成孔径雷达(ISAR)图像来检测和跟踪居民空间物体(RSO)的持久特征,以增强空间域意识(SDA)。方法包括使用霍夫变换进行特征检测,使用梯度比方法进行边缘检测,随后使用双加权霍夫变换进行高精度特征检测。研究表明,通过序列中的特征跟踪可以增加特征检测和分类的信心,以一个检测阴影特征的稳健检测为例进行了说明。
Evaluating Metrics for Safety with LLM-as-Judges
Authors: Kester Clegg, Richard Hawkins, Ibrahim Habli, Tom Lawton
First: 2025-12-17T17:24:49+00:00 · Latest: 2025-12-17T17:24:49+00:00
Abstract
LLMs (Large Language Models) are increasingly used in text processing pipelines to intelligently respond to a variety of inputs and generation tasks. This raises the possibility of replacing human roles that bottleneck existing information flows, either due to insufficient staff or process complexity. However, LLMs make mistakes and some processing roles are safety critical. For example, triaging post-operative care to patients based on hospital referral letters, or updating site access schedules in nuclear facilities for work crews. If we want to introduce LLMs into critical information flows that were previously performed by humans, how can we make them safe and reliable? Rather than make performative claims about augmented generation frameworks or graph-based techniques, this paper argues that the safety argument should focus on the type of evidence we get from evaluation points in LLM processes, particularly in frameworks that employ LLM-as-Judges (LaJ) evaluators. This paper argues that although we cannot get deterministic evaluations from many natural language processing tasks, by adopting a basket of weighted metrics it may be possible to lower the risk of errors within an evaluation, use context sensitivity to define error severity and design confidence thresholds that trigger human review of critical LaJ judgments when concordance across evaluators is low.
中文标题/摘要
标题:使用LLM作为裁判评估安全性指标
大型语言模型(LLMs)在文本处理管道中越来越多地用于智能响应各种输入和生成任务。这为替代瓶颈现有信息流的人类角色提供了可能性,无论是由于人员不足还是流程复杂。然而,LLMs 会犯错误,一些处理角色是安全关键的。例如,根据医院转诊信对术后护理进行分类,或在核设施中更新工作团队的访问时间表。如果我们想将LLMs引入之前由人类执行的关键信息流中,我们如何使它们安全可靠?而不是做出关于增强生成框架或图基方法的表演性声明,本文认为安全性论点应集中在LLM过程中评估点获得的证据类型,特别是在使用LLM作为裁判(LaJ)评估器的框架中。本文认为,尽管我们无法从许多自然语言处理任务中获得确定性评估,但通过采用加权指标篮子,可能可以在评估中降低错误风险,利用上下文敏感性定义错误严重性,并设计触发关键LaJ判断的人类审查的信心阈值,当评估者之间的一致性较低时。
Summary / 总结
This paper evaluates metrics for ensuring the safety of Large Language Models (LLMs) in critical information flows, particularly when LLMs are used as judges (LaJ) to make decisions. The study argues that while deterministic evaluations are not always possible, a combination of weighted metrics can help reduce the risk of errors. Context sensitivity is used to define error severity, and confidence thresholds are designed to trigger human review when LaJ judgments lack concordance among evaluators.
本文评估了确保大型语言模型(LLM)在关键信息流中安全性的指标,特别是在LLM作为法官(LaJ)使用的情况下。研究认为,虽然无法进行确定性评估,但通过采用加权指标的组合可以降低错误风险并定义错误严重性。研究建议设计置信阈值,在LaJ判断缺乏评估者一致时触发人工审核。
Behavior Tokens Speak Louder: Disentangled Explainable Recommendation with Behavior Vocabulary
Authors: Xinshun Feng, Mingzhe Liu, Yi Qiao, Tongyu Zhu, Leilei Sun, Shuai Wang
Venue: AAAI 2026
First: 2025-12-17T17:24:24+00:00 · Latest: 2025-12-17T17:24:24+00:00
Comments: accepted by AAAI 2026
Abstract
Recent advances in explainable recommendations have explored the integration of language models to analyze natural language rationales for user-item interactions. Despite their potential, existing methods often rely on ID-based representations that obscure semantic meaning and impose structural constraints on language models, thereby limiting their applicability in open-ended scenarios. These challenges are intensified by the complex nature of real-world interactions, where diverse user intents are entangled and collaborative signals rarely align with linguistic semantics. To overcome these limitations, we propose BEAT, a unified and transferable framework that tokenizes user and item behaviors into discrete, interpretable sequences. We construct a behavior vocabulary via a vector-quantized autoencoding process that disentangles macro-level interests and micro-level intentions from graph-based representations. We then introduce multi-level semantic supervision to bridge the gap between behavioral signals and language space. A semantic alignment regularization mechanism is designed to embed behavior tokens directly into the input space of frozen language models. Experiments on three public datasets show that BEAT improves zero-shot recommendation performance while generating coherent and informative explanations. Further analysis demonstrates that our behavior tokens capture fine-grained semantics and offer a plug-and-play interface for integrating complex behavior patterns into large language models.
中文标题/摘要
标题:行为令牌更胜一筹:基于行为词汇表的可解释推荐去纠缠化
近年来,可解释推荐领域的进展探索了将语言模型集成到用户-项目交互的自然语言理由分析中。尽管具有潜力,现有方法往往依赖于基于ID的表示,这会掩盖语义意义并限制语言模型在开放场景中的应用。这些挑战在现实世界交互的复杂性上被进一步放大,其中多样化的用户意图相互交织,协作信号很少与语言语义对齐。为克服这些限制,我们提出了一种统一且可迁移的框架BEAT,该框架将用户和项目行为离散化为可解释的序列。我们通过基于向量量化自编码的过程构建行为词汇表,从图表示中分离出宏观兴趣和微观意图。然后,我们引入多层次语义监督以弥合行为信号与语言空间之间的差距。设计了一种语义对齐正则化机制,将行为令牌直接嵌入冻结语言模型的输入空间。在三个公开数据集上的实验表明,BEAT在零样本推荐性能上有所提升,同时生成了连贯且信息丰富的解释。进一步的分析表明,我们的行为令牌捕捉到了细微的语义,并为将复杂的交互模式集成到大型语言模型中提供了即插即用的接口。
Summary / 总结
The paper addresses the limitations of existing explainable recommendation methods that rely on ID-based representations, which obscure semantic meaning and impose structural constraints on language models. It proposes BEAT, a unified framework that tokenizes user and item behaviors into interpretable sequences and constructs a behavior vocabulary through vector-quantized autoencoding. The method introduces multi-level semantic supervision and a semantic alignment regularization mechanism to enhance recommendation performance and generate coherent explanations. Experiments on three public datasets show that BEAT improves zero-shot recommendation performance and provides informative explanations.
论文针对现有基于ID表示的解释性推荐方法存在的问题,这些方法会模糊语义意义并施加结构约束。提出了BEAT框架,该框架通过向量量化自编码过程构建行为词汇表,将用户和项目行为 tokenize 成可解释的序列。BEAT引入了多级语义监督和语义对齐正则化机制,以提高推荐性能并生成连贯的解释。实验结果表明,BEAT在三个数据集上的零样本推荐性能得到提升,并能生成有意义的解释。
Natural Variational Annealing for Multimodal Optimization
Authors: Tâm LeMinh, Julyan Arbel, Thomas Möllenhoff, Mohammad Emtiyaz Khan, Florence Forbes
First: 2025-01-08T18:28:12+00:00 · Latest: 2025-12-17T17:16:21+00:00
Abstract
We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually trade off exploration for exploitation. Finally, it learns the variational search distribution using natural-gradient learning where updates resemble well-known and easy-to-implement algorithms. The three concepts come together in NVA giving rise to new algorithms and also allowing us to incorporate "fitness shaping", a core concept from evolutionary algorithms. We assess the quality of search on simulations and compare them to methods using gradient descent and evolution strategies. We also provide an application to a real-world inverse problem in planetary science.
中文标题/摘要
标题:自然变分退火的多模态优化
我们提出了一种新的多模态优化方法,称为自然变分退火(NVA),该方法结合了三个基础概念的优势,同时搜索黑盒非凸目标的多个全局和局部模式。首先,它通过使用变分后验,如混合高斯分布,实现同时搜索。其次,它应用退火逐步权衡探索与利用之间的权衡。最后,它使用自然梯度学习来学习变分搜索分布,其中更新类似于众所周知且易于实现的算法。这三个概念在NVA中结合在一起,产生了新的算法,并允许我们引入进化算法中的核心概念“适应性塑造”。我们在模拟中评估搜索质量,并将其与使用梯度下降和进化策略的方法进行比较。我们还提供了一个行星科学中的实际逆问题应用。
Summary / 总结
The paper introduces Natural Variational Annealing (NVA), a multimodal optimization approach that combines variational posteriors, annealing, and natural-gradient learning to search for multiple global and local modes of black-box nonconvex objectives. The method uses mixtures of Gaussians for exploration and gradually shifts from exploration to exploitation. NVA also incorporates fitness shaping from evolutionary algorithms. Experimental results show that NVA outperforms gradient descent and evolution strategies in simulations and is applied to solve an inverse problem in planetary science.
研究引入了自然变分退火(NVA),这是一种结合了变分后验、退火和自然梯度学习的多模态优化新方法。NVA 使用混合高斯分布进行同时搜索,并通过退火平衡探索和利用。它还从进化算法中引入了适应度塑造。实验表明,NVA 在模拟中优于梯度下降和进化策略,并在行星科学中的实际逆问题中展示了其有效性。
Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction
Authors: Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet
First: 2025-12-17T17:14:26+00:00 · Latest: 2025-12-17T17:14:26+00:00
Abstract
Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.
中文标题/摘要
标题:自回归语言模型实际上是能量基模型:关于下一词预测前瞻能力的见解
自回归模型(ARMs)目前构成了大型语言模型(LLMs)的主要范式。能量基模型(EBMs)代表了另一类模型,尽管在LLM开发中历史上传播较少,但自然地描述了后训练对齐中的最优策略。在本文中,我们提供了这两个模型类的统一视角。以概率链规则为起点,我们建立了函数空间中ARMs和EBMs之间的显式双射关系,我们证明这对应于最大熵强化学习中软贝尔曼方程的特殊情形。基于这种双射关系,我们推导了ARMs和EBMs监督学习的等价性。此外,我们通过提供理论误差界分析了EBMs向ARMs的蒸馏过程。我们的结果为理解基于下一词预测范式的ARMs的前瞻能力提供了见解。
Summary / 总结
This paper explores the relationship between autoregressive models (ARMs) and energy-based models (EBMs) in the context of large language models. By using the chain rule of probability, the authors establish a bijection between ARMs and EBMs, showing that ARMs can be seen as a special case of EBMs in maximum entropy reinforcement learning. The study reveals that ARMs, despite their next-token prediction nature, can effectively plan ahead, which is typically associated with EBMs. The authors also provide theoretical error bounds for the distillation of EBMs into ARMs, offering insights into the lookahead capabilities of ARMs.
本文探讨了自回归模型(ARMs)和能量模型(EBMs)在大型语言模型(LLMs)中的关系。通过使用概率链规则,作者建立了ARMs和EBMs之间的双射关系,表明ARMs可以被视为最大熵强化学习中软贝尔曼方程的一种特殊情况。研究揭示了尽管ARMs基于下一个词的预测范式,但由于这种等价性,它们具有前瞻能力。作者还提供了从EBMs到ARMs的蒸馏过程中的理论误差界,为ARMs的规划能力提供了见解。
You Never Know a Person, You Only Know Their Defenses: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations
Authors: Hongbin Na, Zimu Wang, Zhaoming Chen, Peilin Zhou, Yining Hua, Grace Ziqi Zhou, Haiyang Zhang, Tao Shen, Wei Wang, John Torous, Shaoxiong Ji, Ling Chen
First: 2025-12-17T17:11:05+00:00 · Latest: 2025-12-17T17:11:05+00:00
Comments: Under Review
Abstract
Psychological defenses are strategies, often automatic, that people use to manage distress. Rigid or overuse of defenses is negatively linked to mental health and shapes what speakers disclose and how they accept or resist help. However, defenses are complex and difficult to reliably measure, particularly in clinical dialogues. We introduce PsyDefConv, a dialogue corpus with help seeker utterances labeled for defense level, and DMRS Co-Pilot, a four-stage pipeline that provides evidence-based pre-annotations. The corpus contains 200 dialogues and 4709 utterances, including 2336 help seeker turns, with labeling and Cohen's kappa 0.639. In a counterbalanced study, the co-pilot reduced average annotation time by 22.4%. In expert review, it averaged 4.62 for evidence, 4.44 for clinical plausibility, and 4.40 for insight on a seven-point scale. Benchmarks with strong language models in zero-shot and fine-tuning settings demonstrate clear headroom, with the best macro F1-score around 30% and a tendency to overpredict mature defenses. Corpus analyses confirm that mature defenses are most common and reveal emotion-specific deviations. We will release the corpus, annotations, code, and prompts to support research on defensive functioning in language.
中文标题/摘要
标题:你永远无法了解一个人,只能了解他们的防御机制:在支持性对话中检测心理防御机制的水平
心理防御是人们用来管理压力的策略,通常是自动的。防御机制的僵化或过度使用与心理健康状况呈负相关,并影响说话者披露的内容以及他们接受或拒绝帮助的方式。然而,防御机制复杂且难以可靠测量,尤其是在临床对话中。我们引入了PsyDefConv,这是一个对话语料库,其中求助者的话语被标记为防御水平,并提供了DMRS Co-Pilot,这是一个四阶段管道,提供基于证据的预标注。该语料库包含200个对话和4709个话语,包括2336个求助者的话语,标注和Cohen's kappa为0.639。在一项反平衡研究中,Co-Pilot将平均标注时间减少了22.4%。在专家评审中,它在七点量表上平均得分为4.62(证据)、4.44(临床合理性)和4.40(洞察力)。在零样本和微调设置下的基准测试中,最强的语言模型的宏F1分数约为30%,并倾向于高估成熟的防御机制。语料库分析证实,成熟的防御机制最常见,并揭示了情绪特异性偏差。我们将发布语料库、标注、代码和提示,以支持语言中防御功能的研究。
Summary / 总结
This study aims to measure psychological defense mechanisms in supportive conversations, which are crucial for understanding mental health. The researchers developed PsyDefConv, a dialogue corpus with 200 dialogues labeled for defense levels, and DMRS Co-Pilot, a four-stage pipeline for pre-annotations. The corpus includes 4709 utterances, with a labeling agreement of 0.639. The pipeline reduced annotation time by 22.4% and received high scores for evidence, clinical plausibility, and insight. Benchmarks with language models showed that current methods have significant room for improvement, with macro F1-scores around 30% and a tendency to overpredict mature defenses. Analyses of the corpus revealed that mature defenses are most common and showed emotion-specific patterns.
研究旨在检测支持性对话中心理防御机制的水平,这对于理解心理健康和沟通动态至关重要。研究人员开发了PsyDefConv对话语料库,包含200个对话和4709个语句,并开发了DMRS Co-Pilot四阶段管道进行预标注。该语料库的标注一致性为0.639,并在对照实验中将标注时间减少了22.4%。专家评审对该管道的证据、临床合理性和洞察力给予了高度评价。语言模型基准测试显示其性能有限,表明有改进空间。语料库分析表明,成熟的防御机制最为常见,并且具有情绪特异性。该语料库及相关资源将被释放,以促进对语言中防御功能的研究。
What Is Your AI Agent Buying? Evaluation, Biases, Model Dependence, & Emerging Implications for Agentic E-Commerce
Authors: Amine Allouah, Omar Besbes, Josué D Figueroa, Yash Kanoria, Akshit Kumar
First: 2025-08-04T17:19:36+00:00 · Latest: 2025-12-17T16:52:38+00:00
Abstract
Online marketplaces will be transformed by autonomous AI agents acting on behalf of consumers. Rather than humans browsing and clicking, AI agents can parse webpages or leverage APIs to view, evaluate and choose products. We investigate the behavior of AI agents using ACES, a provider-agnostic framework for auditing agent decision-making. We reveal that agents can exhibit choice homogeneity, often concentrating demand on a few ``modal'' products while ignoring others entirely. Yet, these preferences are unstable: model updates can drastically reshuffle market shares. Furthermore, randomized trials show that while agents have improved over time on simple tasks with a clearly identified best choice, they exhibit strong position biases -- varying across providers and model versions, and persisting even in text-only "headless" interfaces -- undermining any universal notion of a ``top'' rank. Agents also consistently penalize sponsored tags while rewarding platform endorsements, and sensitivities to price, ratings, and reviews vary sharply across models. Finally, we demonstrate that sellers can respond: a seller-side agent making simple, query-conditional description tweaks can drive significant gains in market share. These findings reveal that agentic markets are volatile and fundamentally different from human-centric commerce, highlighting the need for continuous auditing and raising questions for platform design, seller strategy and regulation.
中文标题/摘要
标题:你的AI代理在买什么?评估、偏见、模型依赖及新兴影响
在线市场将由代表消费者行动的自主AI代理进行改造。而不是人类浏览和点击,AI代理可以解析网页或利用API来查看、评估和选择产品。我们使用ACES框架调查了AI代理的行为,ACES是一个提供者无关的代理决策审计框架。我们发现,代理可能会表现出选择同质性,经常将需求集中在少数“模态”产品上,而完全忽略其他产品。然而,这些偏好是不稳定的:模型更新可以大幅重新分配市场份额。此外,随机试验表明,虽然代理在简单任务上随着时间的推移有所改进,但在明确最佳选择的任务上,它们表现出强烈的位置偏见——这些偏见在不同提供者和模型版本之间变化,并且即使在仅包含文本的“无头”界面中也持续存在,这削弱了任何关于“顶级”排名的普遍概念。代理还一致地惩罚赞助标签,而奖励平台背书,价格、评分和评论的敏感性在不同模型之间差异巨大。最后,我们证明卖家可以做出反应:一个卖家端代理通过简单的、查询条件下的描述调整可以显著提高市场份额。这些发现揭示了代理市场是不稳定的,与以人类为中心的商业活动本质上不同,突显了持续审计的必要性,并提出了平台设计、卖家策略和监管方面的问题。
Summary / 总结
The study evaluates AI agents' behavior in e-commerce, revealing choice homogeneity where agents often focus on a few products while ignoring others, and that these preferences are unstable with model updates reshuffling market shares. Agents exhibit strong position biases, penalize sponsored tags, and reward platform endorsements, with varying sensitivities to price, ratings, and reviews across models. Sellers can gain market share by making simple description tweaks based on queries, indicating the volatility and differences in agentic markets compared to human-centric commerce, necessitating continuous auditing and raising regulatory questions.
研究评估了AI代理在电子商务中的决策,发现它们往往集中在少数产品上而忽略其他产品,偏好不稳定,模型更新会大幅改变市场份额。位置偏见在不同模型和界面中持续存在,代理会惩罚带有赞助标签的产品而奖励平台推荐。卖家可以通过简单的描述修改来增加市场份额。这些发现表明,代理市场是不稳定的,与以人类为中心的商业不同,需要持续审计并引发监管问题。
A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks
Authors: S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, Jungpil Shin
First: 2025-09-16T19:11:28+00:00 · Latest: 2025-12-17T16:48:31+00:00
Comments: Accepted at the 11th IEEE WIECON-ECE 2025
Abstract
Prompt injection attacks represent a major vulnerability in Large Language Model (LLM) deployments, where malicious instructions embedded in user inputs can override system prompts and induce unintended behaviors. This paper presents a novel multi-agent defense framework that employs specialized LLM agents in coordinated pipelines to detect and neutralize prompt injection attacks in real-time. We evaluate our approach using two distinct architectures: a sequential chain-of-agents pipeline and a hierarchical coordinator-based system. Our comprehensive evaluation on 55 unique prompt injection attacks, grouped into 8 categories and totaling 400 attack instances across two LLM platforms (ChatGLM and Llama2), demonstrates significant security improvements. Without defense mechanisms, baseline Attack Success Rates (ASR) reached 30% for ChatGLM and 20% for Llama2. Our multi-agent pipeline achieved 100% mitigation, reducing ASR to 0% across all tested scenarios. The framework demonstrates robustness across multiple attack categories including direct overrides, code execution attempts, data exfiltration, and obfuscation techniques, while maintaining system functionality for legitimate queries.
中文标题/摘要
标题:一种针对提示注入攻击的多智能体LLM防御管道
提示注入攻击是大型语言模型(LLM)部署中的一个重大漏洞,其中恶意指令嵌入在用户输入中可以覆盖系统提示并引发意外行为。本文提出了一种新颖的多智能体防御框架,该框架通过协调的智能体管道中的专门LLM智能体来实时检测和中和提示注入攻击。我们使用两种不同的架构进行了评估:顺序链式智能体管道和基于协调器的层次系统。在两个LLM平台(ChatGLM和Llama2)上对55种独特的提示注入攻击进行的全面评估,这些攻击被分为8个类别,共计400个攻击实例,表明了显著的安全改进。在没有防御机制的情况下,基线攻击成功率(ASR)分别达到了ChatGLM的30%和Llama2的20%。我们的多智能体管道实现了100%的缓解,将所有测试场景中的ASR降低到0%。该框架在包括直接覆盖、代码执行尝试、数据泄露和混淆技术在内的多个攻击类别中表现出鲁棒性,同时保持了对合法查询的系统功能。
Summary / 总结
This paper addresses the vulnerability of Large Language Models (LLMs) to prompt injection attacks by proposing a multi-agent defense framework. The framework uses specialized LLM agents in sequential and hierarchical pipelines to detect and neutralize these attacks. Evaluations on 55 unique attacks across two LLM platforms showed that the multi-agent pipeline completely mitigated the attacks, reducing the Attack Success Rate to 0%, compared to baseline ASRs of 30% and 20% for ChatGLM and Llama2 respectively.
本文提出了一种多代理防御框架,以应对大型语言模型(LLM)中的提示注入攻击。该框架使用专门的LLM代理,在顺序和层次结构配置中协同工作,以实现实时检测和消除这些攻击。对两个LLM平台上的55种不同攻击的评估显示,防御机制将攻击成功率降低到0%,显著提高了安全性,相比之下,基线攻击成功率分别为ChatGLM的30%和Llama2的20%。
History
20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553