arXiv 论文速递

2025-12-31 03:26
Snapshot: 20251231_0326
Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
Authors: Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu
First: 2025-12-29T18:59:57+00:00 · Latest: 2025-12-29T18:59:57+00:00
Comments: Project page: https://jamichss.github.io/stream-diffvsr-project-page/
Abstract
Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/
中文标题/摘要
标题:Stream-DiffVSR:基于自回归扩散的低延迟流式视频超分辨率
基于扩散的视频超分辨率(VSR)方法在感知质量上表现出色,但由于依赖未来帧和昂贵的多步去噪,仍不适用于对延迟敏感的环境。我们提出了一种因果条件下的扩散框架Stream-DiffVSR,用于高效的在线VSR。该框架仅使用过去帧进行操作,结合了四步精简去噪器以实现快速推理,以及一个自动回归时间引导(ARTG)模块,在潜变量去噪期间注入运动对齐的线索,并采用一个轻量级的时间感知解码器,其中包含时间处理器模块(TPM),以增强细节和时间连贯性。Stream-DiffVSR 在 RTX4090 GPU 上处理 720p 帧仅需 0.328 秒,并显著优于先前的扩散基方法。与在线 SOTA TMP 相比,它在感知质量(LPIPS +0.095)上有所提升,同时将延迟降低了超过 130 倍。Stream-DiffVSR 实现了扩散基 VSR 中报告的最低延迟,将初始延迟从超过 4600 秒降低到 0.328 秒,从而使其成为第一个适合低延迟在线部署的扩散 VSR 方法。项目页面:https://jamichss.github.io/stream-diffvsr-project-page/
Summary / 总结
Stream-DiffVSR is a causally conditioned diffusion framework for efficient online video super-resolution, addressing the latency issues of previous methods by operating solely on past frames. It includes a four-step distilled denoiser, an Auto-regressive Temporal Guidance module, and a lightweight temporal-aware decoder with a Temporal Processor Module. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU, significantly outperforming prior methods in both perceptual quality and latency, making it the first diffusion-based VSR method suitable for low-latency online deployment.
Stream-DiffVSR 是一种因果条件下的扩散框架,用于高效的在线视频超分辨率。它使用四步去噪器进行快速推理,Auto-regressive Temporal Guidance 模块注入运动对齐的线索,并使用一个轻量级的时空感知解码器和 Temporal Processor Module。该方法在 RTX4090 GPU 上处理 720p 帧仅需 0.328 秒,并在感知质量和延迟方面显著优于之前的扩散基方法,使其适合低延迟的在线部署。
Eliciting Behaviors in Multi-Turn Conversations
Authors: Jing Huang, Shujian Zhang, Lun Wang, Andrew Hard, Rajiv Mathews, John Lambert
First: 2025-12-29T18:57:10+00:00 · Latest: 2025-12-29T18:57:10+00:00
Abstract
Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.
中文标题/摘要
标题:多轮对话中行为诱引
在对话环境中识别大型语言模型(LLMs)的具体且通常复杂的特定行为对于其评估至关重要。近期工作提出了新颖的技术来找到诱导目标模型产生特定行为的自然语言提示,但这些研究主要集中在单轮设置中。在本工作中,我们研究了多轮对话中的行为诱引。我们首先提供了一个分析框架,将现有方法分为三类,基于它们与目标模型的交互方式:仅使用先验知识的方法、使用离线交互的方法以及从在线交互中学习的方法。然后,我们引入了一种多轮在线方法的一般化形式,统一了单轮和多轮诱引。我们评估了这三类方法在自动生成多轮测试案例方面的表现。我们通过分析查询预算,即与目标模型的交互次数,与成功率,即行为诱引输入的发现率之间的权衡,来研究这些方法的效率。我们发现,在三个任务中,仅用几千次查询,在线方法就能实现45/19/77%的平均成功率,而现有的多轮对话基准中的静态方法在这些任务中发现的失败案例很少甚至没有。我们的工作突显了行为诱引方法在多轮对话评估中的新应用,并强调了社区转向动态基准的必要性。
Summary / 总结
This study addresses the challenge of eliciting specific behaviors from large language models in multi-turn conversational settings. It introduces an analytical framework to categorize existing methods into three families based on their interaction with the target model: using prior knowledge, offline interactions, and online interactions. The research evaluates these methods by generating multi-turn test cases and finds that online methods can achieve a success rate of 45/19/77% with a few thousand queries, outperforming static methods that often fail to elicit behaviors in multi-turn tasks.
研究提出了一种分析框架,将现有方法分为基于先验知识、离线交互和在线交互三类,以解决在多轮对话中从大型语言模型(LLMs)中引发特定行为的挑战。研究引入了一种多轮在线方法的通用公式,统一了单轮和多轮引发。评估结果显示,与以前的静态方法相比,在三个任务中,使用少量数千次查询的在线方法可以实现45/19/77%的成功率,而在这些任务中,静态方法几乎没有或根本没有失败案例。
Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
Authors: Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng
First: 2025-10-09T17:58:07+00:00 · Latest: 2025-12-29T18:55:54+00:00
Abstract
Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce Group Diffusion Policy Optimization (GDPO), a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
中文标题/摘要
标题:通过组扩散策略优化提高扩散语言模型的推理能力
扩散语言模型(DLMs)能够进行并行、无序的生成,并通过迭代细化,提供了一种灵活的替代自回归大型语言模型(LLLMs)的选择。然而,将强化学习(RL)微调应用于DLMs仍然是一个开放的挑战,因为难以计算似然性。先驱工作如diffu-GRPO通过一次去遮蔽估计了标记级别的似然性。尽管计算效率高,但这种方法严重有偏。更原则的基础在于序列级别的似然性,其中证据下界(ELBO)作为替代。尽管存在这种清晰的数学联系,但由于似然性评估成本高昂,基于ELBO的方法应用有限。在本文中,我们重新审视了ELBO估计,并将其方差来源进行了分解。这种分解促使我们通过快速、确定性的积分近似来减少方差,沿着几个关键维度。基于这一见解,我们引入了组扩散策略优化(GDPO),这是一种新的针对DLMs的RL算法。GDPO利用简单的半确定性蒙特卡洛方案,减轻了在常规双蒙特卡洛采样下ELBO估计器的方差爆炸,从而在严格的评估预算下提供了方差更低的估计器。实验上,GDPO在预训练检查点上实现了持续的改进,并在大多数数学、推理和编码基准测试中优于diffu-GRPO,这是当前最先进的基线之一。
Summary / 总结
This work addresses the challenge of applying reinforcement learning (RL) fine-tuning to diffusion language models (DLMs) by revisiting the estimation of sequence-level likelihoods through the evidence lower bound (ELBO). The authors introduce Group Diffusion Policy Optimization (GDPO), which uses semi-deterministic Monte Carlo schemes to reduce variance in ELBO estimators, leading to a more efficient and lower-variance RL algorithm. Experiments show that GDPO outperforms existing methods like diffu-GRPO on math, reasoning, and coding benchmarks.
本文旨在通过重新审视证据下界(ELBO)估计,解决将强化学习(RL)微调应用于扩散语言模型(DLMs)的挑战。作者提出了Group Diffusion Policy Optimization(GDPO),利用半确定性蒙特卡洛方案减少ELBO估计的方差,从而获得更低方差且更高效的RL算法。实验表明,GDPO在数学、推理和编码基准测试中优于现有方法如diffu-GRPO。
Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans
Authors: Sky CH-Wang, Justin Svegliato, Helen Appel, Jason Eisner
First: 2025-12-29T18:51:56+00:00 · Latest: 2025-12-29T18:51:56+00:00
Abstract
We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.
中文标题/摘要
标题:使用文本片段细粒度人类反馈微调LLMs
我们提出了一种方法和数据集,使用反馈驱动的改进链对语言模型进行微调,以偏好监督的形式。对于模型的响应,标注员通过标记“喜欢”和“不喜欢”的片段并说明他们喜欢或不喜欢的原因来提供细粒度的反馈。基础模型然后相应地重写不喜欢的片段,从左到右进行,形成一系列逐步改进的序列。我们从改进链中的每个相邻步骤构建偏好配对,直接对齐,使模型能够从局部、针对性的编辑中学习。我们发现,我们的方法优于基于标准A/B偏好排名或完整对比重写的直接对齐方法,表明结构化的、基于修订的监督可以更高效、更有效地进行偏好调优。
Summary / 总结
The research aims to improve language models through fine-grained human feedback on text spans, using a method where annotators mark liked and disliked parts of model responses and specify the reasons. The model then rewrites the disliked parts, forming a sequence of improvements. The study shows that this approach, which involves direct alignment from each step in the improvement chain, outperforms other methods like direct preference ranking or full rewrites, indicating that structured, revision-based supervision is more efficient and effective for preference tuning.
研究提出了一种使用文本片段细粒度人类反馈来微调语言模型的方法,其中注释者标记喜欢和不喜欢的部分并说明原因。模型随后会修正不喜欢的部分,形成一系列改进。这种方法通过从每一步构建偏好对,优于直接对齐方法,表明结构化的修订监督对于偏好调优更高效和有效。
PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech
Authors: Deepak Babu Piskala
First: 2025-12-29T18:43:23+00:00 · Latest: 2025-12-29T18:43:23+00:00
Comments: Benchmark dataset and evaluation suite. Data and code available at: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench https://github.com/prdeepakbabu/ProfASR-Bench
Abstract
Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench Code: https://github.com/prdeepakbabu/ProfASR-Bench
中文标题/摘要
标题:PROFASR-BENCH:面向高风险专业语音的上下文条件ASR基准
在专业环境中,自动语音识别(ASR)面临现有基准所忽视的挑战:密集的专业术语、正式语体的变体以及对关键实体错误的零容忍度。我们提出了ProfASR-Bench,这是一个面向金融、医学、法律和技术领域的高风险应用的专业对话评估套件。每个示例都配有一个自然语言提示(领域线索和/或说话人简介)和一个富含实体的目标语音,这使得可以控制地测量上下文条件下的识别。该语料库支持传统的ASR指标,以及实体感知评分和按口音和性别切片的报告。使用代表性的Whisper(编码器-解码器ASR)和Qwen-Omni(音频语言模型)家族,在匹配的无上下文、简介、领域+简介、先验和对抗条件下,我们发现一个一致的模式:轻量级的文本上下文在平均词错误率(WER)上几乎没有任何变化,即使在先验提示下也是如此,对抗提示也不可靠地降低性能。我们称此为上下文利用差距(CUG):当前系统名义上是可以提示的,但未能充分利用现成的辅助信息。ProfASR-Bench 提供了一个标准化的上下文梯度、实体和切片感知的报告以及置信区间,并为模型家族之间的融合策略比较提供了一个可重复的测试平台。
Summary / 总结
The research addresses the limitations of existing ASR benchmarks by introducing ProfASR-Bench, a professional speech evaluation suite for high-stakes applications. The benchmark includes context-rich prompts and entity-rich utterances from finance, medicine, legal, and technology domains. Using Whisper and Qwen-Omni models, the study finds that even with oracle prompts, lightweight textual context does not significantly reduce word error rates, and adversarial prompts do not reliably degrade performance. This phenomenon is termed the context-utilization gap (CUG), indicating current ASR systems underutilize available side information.
研究针对专业环境中自动语音识别(ASR)面临的独特挑战,如密集的专业术语和正式的语言风格。ProfASR-Bench 是一个基准套件,适用于金融、医学、法律和技术领域的高风险应用。研究评估了如 Whisper 和 Qwen-Omni 等 ASR 模型在不同条件下的表现,包括无上下文、个人资料、领域+个人资料、Oracle 和对抗性提示。主要发现表明,即使是 Oracle 提示,轻量级的文本上下文也不显著降低词错误率,且对抗性提示也不可靠地降低性能。这表明当前的 ASR 系统存在上下文利用差距。基准提供了标准化的上下文梯度、实体感知报告以及置信区间,用于模型比较。
Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing
Authors: Panagiotis Theocharopoulos, Ajinkya Kulkarni, Mathew Magimai. -Doss
First: 2025-12-29T18:43:05+00:00 · Latest: 2025-12-29T18:43:05+00:00
Abstract
Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.
中文标题/摘要
标题:基于LLM的学术评审中的多语言隐藏提示注入攻击
大型语言模型(LLMs)在高影响的工作流中越来越被考虑用于使用,包括学术同行评审。然而,LLMs 对于文档级别的隐藏提示注入攻击是脆弱的。在本研究中,我们构建了一个包含约500篇真实学术论文的数据集,这些论文已被ICML接受,并评估了在这些文档中嵌入隐藏对抗性提示的影响。每篇论文在四种不同语言中注入了语义等效的指令,并使用LLM进行评审。我们发现,对于英语、日语和中文注入,提示注入引起了评审分数和接受/拒绝决定的重大变化,而阿拉伯语注入几乎没有效果。这些结果突显了基于LLM的评审系统对文档级别提示注入的脆弱性,并揭示了不同语言在易受攻击性方面的显著差异。
Summary / 总结
This study investigates the vulnerability of large language models (LLMs) to hidden prompt injection attacks in academic peer review. A dataset of 500 real academic papers was created and injected with semantically equivalent instructions in four languages: English, Japanese, Chinese, and Arabic. The papers were then reviewed using an LLM, and the results showed that the injection of prompts in English, Japanese, and Chinese significantly altered review scores and accept/reject decisions, whereas Arabic injections had little to no effect. This study underscores the susceptibility of LLM-based reviewing systems to such attacks and highlights language-specific differences in vulnerability.
研究探讨了大型语言模型(LLMs)在学术同行评审中对文档级隐藏提示注入攻击的脆弱性。研究人员创建了一个包含500篇真实学术论文的数据集,并在四种语言中嵌入了语义等效的指令:英语、日语、中文和阿拉伯语。当这些论文被LLM评审时,带有隐藏提示的论文在评分和决策上出现了显著变化,特别是在英语、日语和中文中,但阿拉伯语则几乎没有影响。这表明LLM基于的评审系统对这样的攻击是脆弱的,不同语言的脆弱性存在显著差异。
Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD
Authors: Arseniy Andreyev, Pierfrancesco Beneventano
First: 2024-12-29T18:59:01+00:00 · Latest: 2025-12-29T18:39:34+00:00
Comments: 83 pages, 36 figures
Abstract
Recent findings by Cohen et al., 2021, demonstrate that when training neural networks using full-batch gradient descent with a step size of $η$, the largest eigenvalue $λ_{\max}$ of the full-batch Hessian consistently stabilizes around $2/η$. These results have significant implications for convergence and generalization. This, however, is not the case for mini-batch optimization algorithms, limiting the broader applicabilityof the consequences of these findings. We show mini-batch Stochastic Gradient Descent (SGD) trains in a different regime we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at $2/η$ is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence $λ_{\max}$ -- which is generally smaller than Batch Sharpness -- is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.
中文标题/摘要
标题:随机稳定性边缘:重访SGD的稳定性边缘
Cohen等人在2021年的研究发现,使用步长为$η$的全批量梯度下降训练神经网络时,全批量海森矩阵的最大特征值$λ_{\max}$会稳定在$2/η$左右。这些结果对收敛性和泛化能力有重大影响。然而,这并不适用于批量优化算法,限制了这些发现的更广泛适用性。我们展示了批量随机梯度下降(SGD)在我们称为随机稳定性边缘(EoSS)的不同区域进行训练。在这个区域,稳定在$2/η$的是批量尖锐度:批量海森矩阵沿其相应随机梯度的方向曲率的期望值。因此,$λ_{\max}$——通常小于批量尖锐度——被抑制,这与较小批量和较大步长更有利于较平坦极小值的长期经验观察一致。我们进一步讨论了对SGD轨迹的数学建模的影响。
Summary / 总结
This study revisits the concept of the edge of stability for Stochastic Gradient Descent (SGD) by focusing on mini-batch optimization. Unlike full-batch gradient descent, mini-batch SGD operates in a regime termed Edge of Stochastic Stability (EoSS), where the Batch Sharpness stabilizes around $2/η$. This differs from the largest eigenvalue of the Hessian, which is suppressed, aligning with the observation that smaller batches and larger step sizes favor flatter minima. The findings have implications for the mathematical modeling of SGD trajectories.
本文重新审视了随机梯度下降(SGD)的边缘稳定性概念,并引入了一个新的边缘稳定性(EoSS)阶段。与全批量梯度下降不同,mini-batch SGD使批量尖锐度(Batch Sharpness)稳定在某个值,这是沿随机梯度方向的期望曲率。这导致最大特征值被抑制,与较小批量和较大步长促进较平坦极小值的长期经验观察一致。
Web World Models
Authors: Jichen Feng, Yifan Zhang, Chenggong Zhang, Yifu Lu, Shilong Liu, Mengdi Wang
First: 2025-12-29T18:31:45+00:00 · Latest: 2025-12-29T18:31:45+00:00
Comments: Project Page: https://github.com/Princeton-AI2-Lab/Web-World-Models
Abstract
Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web frameworks provide reliable but fixed contexts backed by databases, while fully generative world models aim for unlimited environments at the expense of controllability and practical engineering. In this work, we introduce the Web World Model (WWM), a middle ground where world state and ``physics'' are implemented in ordinary web code to ensure logical consistency, while large language models generate context, narratives, and high-level decisions on top of this structured latent state. We build a suite of WWMs on a realistic web stack, including an infinite travel atlas grounded in real geography, fictional galaxy explorers, web-scale encyclopedic and narrative worlds, and simulation- and game-like environments. Across these systems, we identify practical design principles for WWMs: separating code-defined rules from model-driven imagination, representing latent state as typed web interfaces, and utilizing deterministic generation to achieve unlimited but structured exploration. Our results suggest that web stacks themselves can serve as a scalable substrate for world models, enabling controllable yet open-ended environments. Project Page: https://github.com/Princeton-AI2-Lab/Web-World-Models.
中文标题/摘要
标题:网络世界模型
语言代理越来越多地需要持久的世界,在其中它们可以行动、记忆和学习。现有方法处于两个极端:传统的网络框架提供可靠但固定的背景,由数据库支持,而完全生成的世界模型则追求无限的环境,但代价是可控性和实际工程的复杂性。在本工作中,我们引入了网络世界模型(WWM),这是一种折中方案,其中世界状态和“物理”在普通的网络代码中实现以确保逻辑一致性,而大型语言模型则在此结构化的潜在状态之上生成背景、叙述和高层次的决策。我们基于现实的网络堆栈构建了一系列WWM,包括基于真实地理的无限旅行地图、虚构的银河系探险者、网络规模的百科全书和叙述世界,以及模拟和游戏般的环境。在这些系统中,我们确定了网络世界模型的实用设计原则:将代码定义的规则与模型驱动的想象分离,将潜在状态表示为类型化的网络界面,并利用确定性生成实现无限但结构化的探索。我们的结果表明,网络堆栈本身可以作为世界模型的可扩展基底,使环境可控但开放。项目页面:https://github.com/Princeton-AI2-Lab/Web-World-Models.
Summary / 总结
The research aims to create persistent worlds for language agents that are both logically consistent and flexible. The Web World Model (WWM) combines ordinary web code for world state and physics with large language models for context generation. Key findings include practical design principles such as separating rules from imagination, using typed web interfaces for latent state, and deterministic generation to enable structured exploration. This approach suggests that web stacks can serve as a scalable substrate for world models, providing controllable yet open-ended environments.
研究旨在为语言代理创建持久且动态的世界,使其能够行动和学习。Web 世界模型(WWM)结合了网页框架的可靠性和生成模型的灵活性。关键发现包括分离规则与想象、使用类型化的网页接口表示潜在状态以及通过确定性生成实现结构化的探索。这些 WWM 在现实的网页堆栈上实现了可控且开放的环境。
End-to-End Test-Time Training for Long Context
Authors: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun
First: 2025-12-29T18:30:14+00:00 · Latest: 2025-12-29T18:30:14+00:00
Comments: Code: https://github.com/test-time-training/e2e
Abstract
We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.
中文标题/摘要
标题:长上下文端到端测试时训练
我们将长上下文语言建模视为连续学习问题,而不是架构设计问题。在这种表述下,我们仅使用标准架构——具有滑动窗口注意力的Transformer。然而,我们的模型在测试时通过给定上下文的下一个标记预测继续学习,将读取的上下文压缩到其权重中。此外,我们通过在训练时进行元学习改进了模型的初始化,以便在测试时学习。总体而言,我们的方法,一种形式的测试时训练(TTT),在测试时(通过下一个标记预测)和训练时(通过元学习)都是端到端的,与之前的版本不同。我们进行了广泛的实验,重点关注缩放特性。特别是对于使用164B标记训练的3B模型,我们的方法(TTT-E2E)在上下文长度上的缩放与具有全注意力的Transformer相同,而其他方法,如Mamba 2和Gated DeltaNet,则不然。然而,类似于RNNs,TTT-E2E具有恒定的推理延迟,无论上下文长度如何,使其在128K上下文长度时比全注意力快2.7倍。我们的代码已公开。
Summary / 总结
The research aims to address long-context language modeling through a continual learning approach, using a standard Transformer architecture with sliding-window attention. The model learns continuously at test time by predicting the next token given the context, effectively compressing the context into its weights. Additionally, the model's initialization is improved through meta-learning during training. The method, End-to-End Test-Time Training (TTT-E2E), scales similarly to full attention models with context length and has constant inference latency, making it more efficient for long contexts. Experiments show that TTT-E2E outperforms other methods like Mamba 2 and Gated DeltaNet in terms of scaling and inference speed.
研究旨在通过持续学习方法解决长上下文语言建模问题,使用标准的滑动窗口注意力Transformer架构。模型通过预测给定上下文的下一个词来持续学习,将上下文压缩到其权重中。此外,模型的初始化通过训练期间的元学习得到改进。该方法名为端到端测试时训练(TTT-E2E),对于3B参数模型,其可扩展性与全注意力模型相当,但具有恒定的推理延迟,使得对于长上下文的推理速度比全注意力快2.7倍。
IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition
Authors: Kang Du, Yirui Guan, Zeyu Wang
First: 2025-12-29T18:24:46+00:00 · Latest: 2025-12-29T18:24:46+00:00
Comments: 10 pages 4 figures
Abstract
Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.
中文标题/摘要
标题:IDT:一种物理导向的前馈变压器多视图内在分解方法
内在图像分解对于视觉理解至关重要,因为RGB图像将材料属性、光照和视点依赖效应纠缠在一起。基于扩散的方法在单视图内在分解方面取得了显著成果;然而,将这些方法扩展到多视图设置仍然具有挑战性,通常会导致严重的视点不一致性。我们提出了**内在分解变压器(IDT)**,这是一种用于多视图内在图像分解的前馈框架。通过利用基于变压器的注意力机制联合推理多个输入图像,IDT 在单次前向传递中生成视点一致的内在因素,而无需迭代生成采样。IDT 采用一种物理导向的图像形成模型,明确将图像分解为漫反射、漫射阴影和镜面反射。这种结构化的因子分解将朗伯和平朗伯光传输分离,使材料和光照效应在不同视点下的分解具有可解释性和可控性。在合成和真实世界数据集上的实验表明,IDT 能够获得更干净的漫反射、更一致的漫射阴影以及更好的镜面反射分离,同时在多视图一致性方面显著优于先前的内在分解方法。
Summary / 总结
The research aims to address the challenge of view inconsistency in multi-view intrinsic image decomposition. IDT, a feed-forward transformer-based framework, is proposed to decompose images into intrinsic factors such as diffuse reflectance, diffuse shading, and specular shading in a single forward pass. Experiments show that IDT improves multi-view consistency and provides cleaner and more coherent intrinsic factors compared to previous methods.
研究旨在解决多视角内在图像分解中的视点一致性问题。提出了IDT,一种基于前馈变压器的框架,能够在单次前向传递中将图像分解为内在因素,如漫反射、漫射阴影和镜面反射。实验表明,IDT 提高了多视角的一致性,并提供了更清洁和更连贯的内在因素。
Victor Calibration (VC): Multi-Pass Confidence Calibration and CP4.3 Governance Stress Test under Round-Table Orchestration
Authors: Victor Stasiuc
First: 2025-12-18T04:09:22+00:00 · Latest: 2025-12-29T18:07:41+00:00
Comments: 7 pages, 1 figure, 4 tables. Exploratory case study
Abstract
Safety alignment can make frontier LMs overly conservative, degrading collaboration via hedging or false refusals. We present a lightweight toolkit with three parts: (1) Victor Calibration (VC), a multi-pass protocol that elicits a scalar confidence proxy T (T0<T1<T2) through iterative evidence re-evaluation; (2) FD-Lite, a behavior-only phenomenology audit with a fixed anchor phrase and a meta-prefix trap to avoid anthropomorphic claims; and (3) CP4.3, a governance stress test for rank invariance and allocation monotonicity (M6). Across Claude 4.5 models (Haiku, Sonnet no-thinking, Sonnet thinking) and Opus, we observe monotonic VC trajectories without violating safety invariants, and stable CP4.3 behavior. ("Opus" here refers to a single Claude Opus 4.1 session accessed via a standard UI account, as reported in Table 1.) This work was conducted by a single operator (n=1) and is intended as hypothesis-generating; we explicitly invite replication, critique, and extension by the research community. We include prompt templates and an artifact plan to facilitate independent verification.
中文标题/摘要
标题:Victor校准(VC):多轮次信心校准和CP4.3治理压力测试
安全对齐可能会使前沿语言模型过于保守,通过规避或虚假拒绝降低协作。我们提出了一种轻量级工具包,包括三个部分:(1)Victor校准(VC),一种多轮次协议,通过迭代证据重新评估,引出一个标量信心代理T(T0<T1<T2);(2)FD-Lite,一种仅行为的现象学审计,带有固定锚定短语和元前缀陷阱,以避免拟人化声明;(3)CP4.3,一种治理压力测试,用于检验排名不变性和分配单调性(M6)。在Claude 4.5模型(俳句、十四行诗无思考、十四行诗思考)和Opus中,我们观察到单调的VC轨迹,没有违反安全不变量,并且CP4.3行为稳定。(“Opus”在此指通过标准UI账户访问的单个Claude Opus 4.1会话,如表1所示。)本研究由单一操作员(n=1)完成,旨在作为假设生成;我们明确邀请研究社区进行复制、批评和扩展。我们包括提示模板和一个制品计划,以促进独立验证。
Summary / 总结
The research aims to address the issue of safety alignment in large language models (LMs) by developing a toolkit called Victor Calibration (VC). VC uses a multi-pass protocol to elicit a scalar confidence proxy through iterative evidence re-evaluation. The toolkit also includes FD-Lite for behavior-only phenomenology audit and CP4.3 for governance stress testing. The experiments across different Claude 4.5 models and Opus show monotonic VC trajectories and stable CP4.3 behavior without violating safety invariants.
研究解决了大型语言模型(LMs)因安全对齐而变得过于保守的问题,影响了协作。研究引入了Victor Calibration (VC) 多轮协议、FD-Lite 行为审计以及CP4.3 治理压力测试。在各种Claude 4.5模型和Opus中,研究观察到VC 轨迹的单调性以及CP4.3 行为的稳定性,且未违反安全不变量。
RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion
Authors: Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Tao Huang, Zhenguo Sun, Yibo Peng, Pengwei Wang, Zhongyuan Wang, Fangzhou Liu, Chang Xu, Shanghang Zhang
First: 2025-12-29T17:59:19+00:00 · Latest: 2025-12-29T17:59:19+00:00
Abstract
Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
中文标题/摘要
标题:RoboMirror: 在模仿之前理解——从视频到类人行走的框架
人类通过视觉观察学习行走,先理解视觉内容再模仿动作。然而,最先进的类人行走系统依赖于精心策划的运动捕捉轨迹或稀疏的文本命令,这在视觉理解和控制之间留下了关键差距。基于文本的运动方法受到语义稀疏性和阶段化管道错误的影响,而基于视频的方法仅进行机械姿态模仿,缺乏真正的视觉理解。我们提出了RoboMirror,这是第一个无需重新目标化的从视频到行走的框架,体现了“理解后再模仿”的理念。利用VLMs,它将第一人称/第三人称视频提炼为视觉运动意图,直接条件化扩散机制策略生成物理上合理且语义上对齐的行走,无需显式的姿态重建或重新目标化。广泛的实验验证了RoboMirror的有效性,它通过第一人称视频实现远程存在感,将第三人称控制延迟降低了80%,并且在任务成功率上比基线高出3.7%。通过将类人控制重新构架为视频理解,我们弥合了视觉理解和动作之间的差距。
Summary / 总结
RoboMirror is a video-to-locomotion framework that focuses on visual understanding before imitation. It uses VLMs to interpret raw egocentric and third-person videos and directly conditions a diffusion-based policy to generate physically plausible and semantically aligned locomotion. Experiments show that RoboMirror reduces third-person control latency by 80% and improves task success rate by 3.7% compared to baselines, bridging the gap between visual understanding and action.
RoboMirror 是一种视频到运动框架,强调在模仿动作之前理解视觉内容,解决人形系统中视觉理解和控制之间的差距。它使用 VLM 从原始视频中提取视觉运动意图,并直接条件化扩散模型生成物理上合理且语义对齐的运动。RoboMirror 显著减少了第三人称控制延迟,并将任务成功率提高了 3.7%,通过第一人称视频实现更好的远程存在感。
Nested Browser-Use Learning for Agentic Information Seeking
Authors: Baixuan Li, Jialong Wu, Wenbiao Yin, Kuan Li, Zhongwang Zhang, Huifeng Yin, Zhengwei Tao, Liwen Zhang, Pengjun Xie, Jingren Zhou, Yong Jiang
First: 2025-12-29T17:59:14+00:00 · Latest: 2025-12-29T17:59:14+00:00
Abstract
Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.
中文标题/摘要
标题:代理信息搜索中的嵌套浏览器使用学习
信息搜索(IS)代理在广泛和深入的搜索任务中表现出强大的性能,但它们的工具使用主要局限于API级别的片段检索和基于URL的页面获取,限制了对通过实际浏览可以获得的更丰富信息的访问。虽然全面的浏览器交互可以解锁更深层次的能力,但其精细的控制和冗长的页面内容返回引入了对ReAct风格的功能调用代理来说相当大的复杂性。为了弥合这一差距,我们提出了嵌套浏览器使用学习(NestBrowse),它引入了一个最小且完整的浏览器操作框架,通过嵌套结构将交互控制与页面探索解耦。这种设计简化了代理推理,同时使有效的深网信息获取成为可能。在具有挑战性的深网IS基准上的实证结果表明,NestBrowse在实践中提供了明显的益处。进一步的深入分析强调了其效率和灵活性。
Summary / 总结
The research aims to enhance information-seeking agents by enabling them to use browsers more effectively, which can provide richer information than simple API calls. The method involves developing Nested Browser-Use Learning (NestBrowse), which simplifies interaction control and page exploration through a nested structure. Key findings show that NestBrowse improves performance on deep information-seeking tasks and demonstrates clear benefits in practical applications.
研究旨在通过使信息检索代理能够更有效地使用浏览器来增强它们的能力,浏览器可以访问比简单API调用更丰富的信息。方法是开发了Nested Browser-Use Learning(NestBrowse),通过使用嵌套结构将交互控制与页面探索分离来简化浏览器交互。关键发现表明,NestBrowse在深度信息检索任务上表现出色,提供了明显的实际应用优势。
OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
Authors: Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang
First: 2025-12-29T17:59:05+00:00 · Latest: 2025-12-29T17:59:05+00:00
Comments: Website:https://kd-tao.github.io/OmniAgent/
Abstract
Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
中文标题/摘要
标题:OmniAgent:基于音频引导的多模态音频视频感知代理
多模态大型语言模型在统一音频和视觉模态方面取得了显著进展;然而,它们往往缺乏细粒度的跨模态理解,并且难以实现多模态对齐。为了解决这些限制,我们引入了OmniAgent,这是一种完全基于音频的主动感知代理,能够动态协调专门的工具以实现更细粒度的视听推理。与依赖于僵硬的静态工作流和密集的帧字幕标注的先前工作不同,本文展示了从被动响应生成到主动多模态查询的范式转变。OmniAgent采用动态规划,自主地在需要时调用工具,战略性地将感知注意力集中在与任务相关的线索上。我们方法的核心是新颖的从粗到细的基于音频的感知范式,利用音频线索定位时间事件并指导后续推理。在三个音频视频理解基准上的广泛实证评估表明,OmniAgent达到了最先进的性能,比领先的开源和专有模型在准确率上高出10%-20%。
Summary / 总结
Omnimodal large language models have limitations in fine-grained cross-modal understanding and multimodal alignment. To address this, OmniAgent is introduced as an audio-guided active perception agent that dynamically uses specialized tools for fine-grained audio-visual reasoning. It surpasses existing models by 10% to 20% in accuracy on three benchmarks, demonstrating its effectiveness in multimodal understanding.
由于大规模语言模型在跨模态理解和多模态对齐方面存在局限性,OmniAgent作为一种基于音频的主动感知代理被提出,能够动态使用专门工具进行精细的音视频推理。在三个音视频理解基准测试中,它比现有模型高出10%到20%的准确率,展示了其在多模态理解方面的有效性。
Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception
Authors: Xiaoyu Li, Peidong Li, Xian Wu, Long Shi, Dedong Liu, Yitao Wu, Jiajia Fu, Dixiao Cui, Lijun Zhao, Lining Sun
Venue: AAAI 2026
First: 2025-12-29T17:48:56+00:00 · Latest: 2025-12-29T17:48:56+00:00
Comments: Accepted to AAAI 2026
Abstract
Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.
中文标题/摘要
标题:重新思考端到端3D感知的空间-时间对齐
空间-时间对齐对于自动驾驶(AD)中端到端(E2E)感知的时序建模至关重要,提供了有价值的结构和纹理先验信息。现有方法通常依赖注意力机制跨帧对齐物体,用统一的显式物理模型(恒定速度等)简化运动模型,倾向于使用语义特征进行隐式对齐,挑战了传统感知范式中显式运动建模的重要性。然而,不同类别和帧之间运动状态和物体特征的变化使得这种对齐效果不佳。为解决这一问题,我们提出了HAT,这是一种空间-时间对齐模块,允许每个物体自适应地从多个假设中解码最优对齐提案,无需直接监督。具体而言,HAT 首先利用多个显式运动模型生成历史实例的空间锚点和运动感知特征提案,然后通过嵌入在缓存对象查询中的语义和运动线索进行多假设解码,最终为目标帧提供最优对齐提案。在nuScenes上,HAT 在各种基线中一致地提高了3D时序检测器和跟踪器的性能。当与DETR3D检测器配对时,它在测试集上实现了46.0%的AMOTA的最优跟踪结果。在基于对象的端到端AD方法中,HAT 提高了感知准确性(+1.3% mAP,+3.1% AMOTA)并降低了碰撞率(32%)。当语义被破坏(nuScenes-C)时,HAT 对运动建模的增强使端到端AD中的感知和规划更加稳健。
Summary / 总结
The paper addresses the limitations of existing spatio-temporal alignment methods in end-to-end 3D perception for autonomous driving, which rely on simplified motion models and semantic features. It introduces HAT, a module that allows objects to adaptively decode optimal alignment proposals from multiple hypotheses. HAT uses explicit motion models to generate spatial anchors and motion-aware feature proposals, then performs multi-hypothesis decoding with semantic and motion cues. Experiments on nuScenes show HAT improves 3D temporal detectors and trackers, achieving state-of-the-art tracking results and enhancing perception accuracy and reducing collision rates in object-centric E2E AD methods, even when semantics are corrupted.
论文旨在通过解决现有方法依赖简化运动模型和语义特征的问题,改进自动驾驶中端到端3D感知的时空对齐。它提出了HAT模块,允许每个对象从多个假设中自适应地解码最优对齐提案。HAT在nuScenes上增强了3D时空检测器和跟踪器,实现了最先进的跟踪结果,并提高了感知准确性,降低了碰撞率。
AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms
Authors: LearnLM Team, Eedi, :, Albert Wang, Aliya Rysbek, Andrea Huber, Anjali Nambiar, Anna Kenolty, Ben Caulfield, Beth Lilley-Draper, Bibi Groot, Brian Veprek, Chelsea Burdett, Claire Willis, Craig Barton, Digory Smith, George Mu, Harriet Walters, Irina Jurenka, Iris Hulls, James Stalley-Moores, Jonathan Caton, Julia Wilkowski, Kaiz Alarakyia, Kevin R. McKee, Liam McCafferty, Lucy Dalton, Markus Kunesch, Pauline Malubay, Rachel Kidson, Rich Wells, Sam Wheeler, Sara Wiltberger, Shakir Mohamed, Simon Woodhead, Vasco Brazão
First: 2025-12-29T17:44:03+00:00 · Latest: 2025-12-29T17:44:03+00:00
Abstract
One-to-one tutoring is widely considered the gold standard for personalized education, yet it remains prohibitively expensive to scale. To evaluate whether generative AI might help expand access to this resource, we conducted an exploratory randomized controlled trial (RCT) with $N = 165$ students across five UK secondary schools. We integrated LearnLM -- a generative AI model fine-tuned for pedagogy -- into chat-based tutoring sessions on the Eedi mathematics platform. In the RCT, expert tutors directly supervised LearnLM, with the remit to revise each message it drafted until they would be satisfied sending it themselves. LearnLM proved to be a reliable source of pedagogical instruction, with supervising tutors approving 76.4% of its drafted messages making zero or minimal edits (i.e., changing only one or two characters). This translated into effective tutoring support: students guided by LearnLM performed at least as well as students chatting with human tutors on each learning outcome we measured. In fact, students who received support from LearnLM were 5.5 percentage points more likely to solve novel problems on subsequent topics (with a success rate of 66.2%) than those who received tutoring from human tutors alone (rate of 60.7%). In interviews, tutors highlighted LearnLM's strength at drafting Socratic questions that encouraged deeper reflection from students, with multiple tutors even reporting that they learned new pedagogical practices from the model. Overall, our results suggest that pedagogically fine-tuned AI tutoring systems may play a promising role in delivering effective, individualized learning support at scale.
中文标题/摘要
标题:AI辅导可以在安全有效的前提下支持学生:英国教室中的探索性RCT研究
一对一辅导被认为是个性化教育的黄金标准,但其扩展应用因成本高昂而受限。为了评估生成式AI是否有助于扩大这一资源的获取,我们在五所英国中学的165名学生中进行了探索性随机对照试验(RCT)。我们将在Eedi数学平台上基于聊天的辅导会话中整合了针对教育进行微调的LearnLM——一种生成式AI模型。在RCT中,专家导师直接监督LearnLM,要求他们修改其起草的每条消息,直到满意为止。LearnLM证明是一个可靠的教育指导来源,监督导师批准了其起草的76.4%的消息,仅进行了零次或少量编辑(即仅更改一两个字符)。这转化为有效的辅导支持:由LearnLM指导的学生在我们测量的每个学习成果上表现得与与人类导师聊天的学生至少一样好。实际上,接受LearnLM支持的学生在后续主题上解决新问题的成功率比仅接受人类导师辅导的学生高出5.5个百分点(分别为66.2%和60.7%)。在访谈中,导师们强调了LearnLM在起草促进学生深入思考的苏格拉底式问题方面的优势,多名导师甚至表示从模型中学到了新的教学实践。总体而言,我们的结果表明,教育上微调的AI辅导系统可能在大规模提供有效、个性化的学习支持方面发挥重要作用。
Summary / 总结
The study aimed to explore if generative AI could support personalized education effectively and affordably. An RCT with 165 students across five UK secondary schools was conducted, integrating a pedagogically fine-tuned AI model, LearnLM, into chat-based tutoring sessions. Supervising tutors approved 76.4% of LearnLM's messages, and students receiving AI support performed at least as well as those with human tutors on measured learning outcomes, with a 5.5 percentage point higher success rate in solving novel problems. Tutors also noted that LearnLM's Socratic questions encouraged deeper reflection and led to new pedagogical insights.
研究旨在探索生成式AI是否可以支持一对一辅导,这种辅导方式因成本高昂而难以扩大规模。在五所英国中学的165名学生中进行的RCT发现,经过教育优化的AI模型LearnLM提供了有效的辅导支持。监督老师批准了LearnLM 76.4%的消息无需修改,使用AI支持的学生在衡量的学习成果上与使用人类导师的学生表现相当。值得注意的是,使用AI支持的学生在解决新问题上比单独使用人类导师的学生高出5.5个百分点,表明学习效果更好。老师们还注意到,LearnLM通过提出启发性问题鼓励学生进行更深入的反思。
BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization
Authors: Iris Xu, Guangtao Zeng, Zexue He, Charles Jin, Aldo Pareja, Dan Gutfreund, Chuang Gan, Zhang-Wei Hong
First: 2025-12-29T17:41:11+00:00 · Latest: 2025-12-29T17:41:11+00:00
Abstract
Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow-interpreting issues, navigating large codebases, and implementing fixes-within one reasoning chain. Such monolithic designs force the model to retain irrelevant context, leading to spurious correlations and poor generalization. Motivated by how human engineers decompose complex problems, we propose structuring SWE agents as orchestrators coordinating specialized sub-agents for sub-tasks such as localization, editing, and validation. The challenge lies in discovering effective hierarchies automatically: as the number of sub-agents grows, the search space becomes combinatorial, and it is difficult to attribute credit to individual sub-agents within a team. We address these challenges by formulating hierarchy discovery as a multi-armed bandit (MAB) problem, where each arm represents a candidate sub-agent and the reward measures its helpfulness when collaborating with others. This framework, termed Bandit Optimization for Agent Design (BOAD), enables efficient exploration of sub-agent designs under limited evaluation budgets. On SWE-bench-Verified, BOAD outperforms single-agent and manually designed multi-agent systems. On SWE-bench-Live, featuring more recent and out-of-distribution issues, our 36B system ranks second on the leaderboard at the time of evaluation, surpassing larger models such as GPT-4 and Claude. These results demonstrate that automatically discovered hierarchical multi-agent systems significantly improve generalization on challenging long-horizon SWE tasks. Code is available at https://github.com/iamxjy/BOAD-SWE-Agent.
中文标题/摘要
标题:BOAD:通过多臂老虎机优化发现层次化的软件工程代理
大型语言模型(LLMs)展示了强大的推理和编程能力,但在处理长周期和分布外的实际软件工程(SWE)问题时却遇到困难。现有系统通常依赖单一代理来处理整个工作流程,包括解释问题、导航大型代码库和实施修复,都在单一推理链中完成。这种单一设计迫使模型保留无关上下文,导致虚假相关性和较差的泛化能力。受人类工程师如何分解复杂问题的启发,我们建议将SWE代理结构化为协调专门子代理的协调者,这些子代理负责子任务,如定位、编辑和验证。挑战在于自动发现有效的层次结构:随着子代理数量的增加,搜索空间变得组合化,难以在团队内部为个别子代理分配信用。我们通过将层次结构发现形式化为多臂老虎机(MAB)问题来应对这些挑战,其中每个臂代表一个候选子代理,奖励衡量其与其他代理合作时的有用性。该框架称为代理设计的多臂老虎机优化(BOAD),在有限的评估预算下能够高效探索子代理设计。在SWE-bench-Verified上,BOAD优于单一代理和手动设计的多代理系统。在SWE-bench-Live上,包含更多近期和分布外的问题,我们的36B系统在评估时排名第二,超过了如GPT-4和Claude等更大模型。这些结果表明,自动发现的层次化多代理系统在处理具有挑战性的长周期SWE任务时显著提高了泛化能力。代码可在https://github.com/iamxjy/BOAD-SWE-Agent/ 获取。
Summary / 总结
The paper aims to address the limitations of large language models in handling long-horizon and out-of-distribution software engineering tasks by proposing BOAD, a method that structures software engineering agents as orchestrators coordinating specialized sub-agents. BOAD formulates hierarchy discovery as a multi-armed bandit problem, enabling efficient exploration of sub-agent designs. Experiments on SWE-bench-Verified and SWE-bench-Live show that BOAD outperforms single-agent and manually designed multi-agent systems, particularly on more recent and out-of-distribution issues, demonstrating significant improvements in generalization for complex SWE tasks.
论文旨在通过提出BOAD方法,解决大型语言模型在处理长期和分布外的软件工程任务时的局限性。BOAD将软件工程代理结构化为协调专门子代理的协调者。该方法将层次结构发现形式化为多臂老虎机问题,以在有限的评估预算下高效探索子代理设计。实验结果表明,BOAD在SWE-bench-Verified和SWE-bench-Live上优于单个代理和手动设计的多代理系统,特别是在更近期和分布外的问题上,显示出在复杂SWE任务中显著的泛化改进。
Feature Responsiveness Scores: Model-Agnostic Explanations for Recourse
Authors: Harry Cheon, Anneke Wernerfelt, Sorelle A. Friedler, Berk Ustun
Venue: ICLR 2025
First: 2024-10-29T23:37:49+00:00 · Latest: 2025-12-29T17:23:17+00:00
Comments: 11 pages, 2 figures in body, ICLR 2025, Extended Version
Abstract
Consumer protection rules require companies that deploy models to automate decisions in high-stakes settings to explain predictions to decision subjects. These rules are motivated, in part, by the belief that explanations can promote recourse by revealing information that decision subjects can use to contest or overturn their predictions. In practice, companies provide individuals with a list of principal reasons based on feature importance derived from methods like SHAP and LIME. In this work, we show how common practices can fail to provide recourse and propose to highlight features based on their responsiveness -- the probability that a decision subject can attain a target prediction through an arbitrary intervention on the feature. We develop efficient methods to compute responsiveness scores for any model and actionability constraints. We show that standard practices in lending can undermine decision subjects by highlighting unresponsive features and explaining predictions that are fixed.
中文标题/摘要
标题:特征响应评分:模型无关的救济解释
消费者保护规则要求在高风险环境中使用模型自动化决策的公司向决策主体解释预测结果。这些规则部分基于这样的信念:解释可以促进救济,通过揭示决策主体可以用来质疑或推翻其预测的信息。实际上,公司为个人提供基于SHAP和LIME等方法提取的重要特征的主因列表。在本文中,我们展示了常见做法如何未能提供救济,并建议根据特征的响应性——决策主体通过任意干预特征以获得目标预测的概率来突出显示特征。我们开发了计算任何模型和可操作性约束下响应性评分的有效方法。我们展示了在信贷领域的标准做法如何通过突出显示无响应特征并解释固定的预测结果来损害决策主体。
Summary / 总结
This study addresses the need for companies to provide explanations for automated decisions to comply with consumer protection rules. It proposes a new method called Feature Responsiveness Scores to highlight features that decision subjects can effectively change to achieve a desired outcome. The research demonstrates that traditional feature importance methods can be misleading and fail to provide actionable recourse. The key finding is that focusing on responsiveness scores can better enable individuals to contest or overturn unfavorable predictions.
研究旨在通过提供模型无关的解释来满足消费者保护法规的要求,这些解释能够赋予决策主体可操作的信息以质疑模型预测。研究提出使用基于响应性的特征得分,这表示决策主体通过改变特征来实现期望结果的概率。作者开发了计算这些得分的方法,适用于任何模型和可操作性约束。关键发现表明,在信贷领域,常用做法可能会通过突出显示不响应的特征和解释固定预测来未能提供救济,从而损害决策主体。
Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning
Authors: Deniz Akdemir
First: 2025-12-29T17:21:44+00:00 · Latest: 2025-12-29T17:21:44+00:00
Abstract
Distribution shift is the defining challenge of real-world machine learning. The dominant paradigm--Unsupervised Domain Adaptation (UDA)--enforces feature invariance, aligning source and target representations via symmetric divergence minimization [Ganin et al., 2016]. We demonstrate that this approach is fundamentally flawed: when domains are unequally informative (e.g., high-quality vs degraded sensors), strict invariance necessitates information destruction, causing "negative transfer" that can be catastrophic in safety-critical applications [Wang et al., 2019]. We propose a decision-theoretic framework grounded in Le Cam's theory of statistical experiments [Le Cam, 1986], using constructive approximations to replace symmetric invariance with directional simulability. We introduce Le Cam Distortion, quantified by the Deficiency Distance $δ(E_1, E_2)$, as a rigorous upper bound for transfer risk conditional on simulability. Our framework enables transfer without source degradation by learning a kernel that simulates the target from the source. Across five experiments (genomics, vision, reinforcement learning), Le Cam Distortion achieves: (1) near-perfect frequency estimation in HLA genomics (correlation $r=0.999$, matching classical methods), (2) zero source utility loss in CIFAR-10 image classification (81.2% accuracy preserved vs 34.7% drop for CycleGAN), and (3) safe policy transfer in RL control where invariance-based methods suffer catastrophic collapse. Le Cam Distortion provides the first principled framework for risk-controlled transfer learning in domains where negative transfer is unacceptable: medical imaging, autonomous systems, and precision medicine.
Summary / 总结
The paper addresses the challenge of distribution shift in machine learning by proposing a decision-theoretic framework called Le Cam Distortion, which uses Le Cam's theory of statistical experiments to replace feature invariance with directional simulability. This approach avoids information destruction and negative transfer, especially in safety-critical applications. Key experimental results include near-perfect frequency estimation in genomics, zero source utility loss in image classification, and safe policy transfer in reinforcement learning, outperforming invariance-based methods.
论文提出了一种基于Le Cam统计实验理论的决策框架Le Cam Distortion,通过将特征不变性替换为方向可模拟性来应对机器学习中的分布偏移问题,避免信息破坏和负面转移,特别是在安全关键应用中。关键实验结果包括在基因组学中近乎完美的频率估计、在图像分类中零源效用损失,以及在强化学习控制中实现安全策略转移,优于基于不变性的方法。
Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing
Authors: Yuwen Li, Wei Zhang, Zelong Huang, Mason Yang, Jiajun Wu, Shawn Guo, Huahao Hu, Lingyi Sun, Jian Yang, Mingjie Tang, Byran Dai
First: 2025-12-29T17:12:39+00:00 · Latest: 2025-12-29T17:12:39+00:00
Abstract
Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single-model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self-evolving multi-agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool-Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single-turn calls to complex multi-step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher-quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus, and entirely from synthetic data without human annotation.
中文标题/摘要
标题:闭合循环:通过多智能体角色扮演合成无限工具使用数据
使大型语言模型(LLMs)可靠地调用外部工具仍然是自主代理的关键瓶颈。现有方法面临三个根本挑战:昂贵的人工注释以获得高质量轨迹、对未见过的工具的差强人意的泛化能力以及单模型合成固有的质量上限,这些上限延续了偏见和覆盖范围的缺口。我们引入了InfTool,这是一种完全自主的框架,通过自我进化的多智能体合成打破了这些障碍。仅给定原始API规范,InfTool 协调三个协作智能体(用户模拟器、工具调用助手和MCP服务器)生成从单轮调用到复杂多步骤工作流的多样且验证过的轨迹。该框架建立了一个闭合循环:合成数据通过门控奖励与组相对策略优化(GRPO)训练模型,改进后的模型生成更高质量的数据以针对能力缺口,这个循环无需人工干预即可迭代进行。在伯克利函数调用排行榜(BFCL)上的实验表明,InfTool 将基础32B模型的准确率从19.8%提高到70.9%(+258%),超越了比其大10倍的模型,并且与Claude-Opus相当,这一切都完全来自合成数据,无需人工注释。
Summary / 总结
The research aims to enable large language models to effectively use external tools, addressing challenges of expensive human annotation, poor generalization, and quality limitations. InfTool, a self-evolving multi-agent framework, synthesizes diverse tool-use trajectories through collaborative agents. This closed-loop system trains the model using Group Relative Policy Optimization with gated rewards, iteratively improving model performance. Experiments show InfTool significantly boosts accuracy from 19.8% to 70.9% on the BFCL, surpassing larger models and rivaling Claude-Opus entirely through synthetic data without human annotation.
研究旨在解决使大型语言模型(LLMs)有效使用外部工具的关键瓶颈问题。提出了InfTool,一个完全自主的框架,利用自我进化的多智能体合成生成多样且验证过的轨迹。该框架使用带有门控奖励的组相对策略优化(GRPO)来训练模型,并且无需人工干预即可迭代。实验表明,InfTool在伯克利函数调用排行榜上的模型准确性显著提高,超越了更大的模型,并且完全通过合成数据生成与Claude-Opus竞争。
OM4OV: Leveraging Ontology Matching for Ontology Versioning
Authors: Zhangcheng Qiang, Kerry Taylor, Weiqing Wang
First: 2024-09-30T14:00:04+00:00 · Latest: 2025-12-29T17:05:52+00:00
Comments: 16 pages, 8 figures, 1 table
Abstract
Due to the dynamic nature of the Semantic Web, version control is necessary to manage changes in widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many approaches treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse similarities and differences between OM and OV and formalise the OM4OV pipeline to offer more advanced OV support. The pipeline is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be reused for OV tasks, but without necessary extensions, the current OM4OV pipeline can produce skewed measurements, poor performance in detecting update entities, and limited explainability of false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, which builds on existing OM alignments to reduce the number of matching candidates and to improve overall OV performance.
中文标题/摘要
标题:OM4OV:利用本体匹配进行本体版本管理
由于语义网的动态性质,版本控制对于管理广泛使用的本体中的更改是必要的。尽管本体版本管理(OV)作为有效本体管理的关键组成部分已有长期的认识,但许多方法仍将OV视为类似于本体匹配(OM),并直接重用OM系统来处理OV任务。在本研究中,我们系统地分析了OM和OV之间的相似性和差异,并正式化了OM4OV管道,以提供更高级的OV支持。该管道在最先进的OM系统Agent-OM中实现和评估。实验结果表明,OM系统可以重用于OV任务,但如果没有必要的扩展,当前的OM4OV管道会产生失真的测量结果,检测更新实体的性能较差,并且对错误映射的解释性有限。为了解决这些问题,我们提出了一种优化方法,称为交叉引用(CR)机制,该机制基于现有的OM对齐来减少匹配候选的数量,并提高整体的OV性能。
Summary / 总结
This study addresses the need for efficient ontology versioning (OV) in the Semantic Web by analyzing the differences between ontology matching (OM) and OV. The authors propose an OM4OV pipeline implemented in Agent-OM, which, while reusing OM systems, requires extensions to avoid skewed measurements and improve performance. Experimental results show that without these extensions, the current pipeline has poor performance in detecting update entities and limited explainability of false mappings. To enhance the pipeline, the authors introduce a cross-reference (CR) mechanism to reduce matching candidates and improve overall OV performance.
该研究分析了本体匹配(OM)和本体版本控制(OV)之间的差异和相似性,旨在解决语义网中有效的OV需求。作者在Agent-OM中提出了OM4OV流程,虽然可以重用OM系统,但需要扩展以避免测量偏差并提高性能。还提出了一种称为交叉引用(CR)机制的优化方法,以减少匹配候选并提高整体OV性能。
The Big Three in Marriage Talk: LLM-Assisted Analysis of Moral Ethics and Sentiment on Weibo and Xiaohongshu
Authors: Frank Tian-Fang Ye, Xiaozi Gao
First: 2025-12-29T17:05:06+00:00 · Latest: 2025-12-29T17:05:06+00:00
Abstract
China's marriage registrations have declined dramatically, dropping from 13.47 million couples in 2013 to 6.1 million in 2024. Understanding public attitudes toward marriage requires examining not only emotional sentiment but also the moral reasoning underlying these evaluations. This study analyzed 219,358 marriage-related posts from two major Chinese social media platforms (Sina Weibo and Xiaohongshu) using large language model (LLM)-assisted content analysis. Drawing on Shweder's Big Three moral ethics framework, posts were coded for sentiment (positive, negative, neutral) and moral dimensions (Autonomy, Community, Divinity). Results revealed platform differences: Weibo discourse skewed positive, while Xiaohongshu was predominantly neutral. Most posts across both platforms lacked explicit moral framing. However, when moral ethics were invoked, significant associations with sentiment emerged. Posts invoking Autonomy ethics and Community ethics were predominantly negative, whereas Divinity-framed posts tended toward neutral or positive sentiment. These findings suggest that concerns about both personal autonomy constraints and communal obligations drive negative marriage attitudes in contemporary China. The study demonstrates LLMs' utility for scaling qualitative analysis and offers insights for developing culturally informed policies addressing marriage decline in Chinese contexts.
中文标题/摘要
标题:婚姻讨论中的三大要素:大语言模型辅助分析微博和小红书上的道德伦理与情感
中国婚姻登记数量急剧下降,从2013年的1347万对减少到2024年的610万对。理解公众对婚姻的态度不仅需要考察情感倾向,还需要探究这些评价背后的道德推理。本研究使用大语言模型(LLM)辅助内容分析,从两个主要的中国社交媒体平台(新浪微博和小红书)中分析了219,358条与婚姻相关的帖子。基于Shweder的三大道德伦理框架,帖子被编码为情感(正面、负面、中性)和道德维度(自主、社群、神性)。研究结果表明,两个平台存在差异:新浪微博的讨论偏向正面,而小红书则主要为中性。两个平台的大多数帖子缺乏明确的道德框架。然而,当涉及道德伦理时,情感倾向与道德伦理之间出现了显著关联。涉及自主伦理和社群伦理的帖子主要为负面情感,而神性框架的帖子则倾向于中性或正面情感。这些发现表明,个人自主权的限制和社群义务的担忧在中国当代社会中驱动着负面的婚姻态度。本研究展示了大语言模型在扩展定性分析方面的实用性,并为制定针对中国婚姻下降趋势的文化导向政策提供了见解。
Summary / 总结
This study analyzed 219,358 marriage-related posts from Weibo and Xiaohongshu using LLM-assisted content analysis to understand public attitudes toward marriage. It found that Weibo discourse was more positive, while Xiaohongshu was predominantly neutral. When moral ethics were invoked, posts framed by Autonomy and Community ethics were predominantly negative, while Divinity-framed posts tended toward neutral or positive sentiment. This suggests that concerns about personal autonomy and communal obligations drive negative marriage attitudes in China. The study highlights the utility of LLMs in scaling qualitative analysis for policy development.
本研究通过分析来自微博和小红书的219,358条婚姻相关帖子,使用大型语言模型(LLM)进行内容分析,并采用Shweder的三大道德伦理框架来编码帖子的情感和道德维度。结果显示,微博的讨论偏向积极,而小红书则主要是中性的。当涉及道德伦理时,以个人自主性和社区伦理框架呈现的帖子大多为负面,而以神性框架呈现的帖子则倾向于中性或积极,这表明个人自主权受限和社区义务的担忧在中国当代社会中导致了对婚姻的负面态度。
Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation
Authors: Manh Hung Nguyen, Adish Singla
First: 2025-12-29T16:53:48+00:00 · Latest: 2025-12-29T16:53:48+00:00
Comments: Preprint
Abstract
Large language models (LLMs) have significant potential for generating educational questions and problems, enabling educators to create large-scale learning materials. However, LLMs are fundamentally limited by the ``Artificial Hivemind'' effect, where they generate similar responses within the same model and produce homogeneous outputs across different models. As a consequence, students may be exposed to overly similar and repetitive LLM-generated problems, which harms diversity of thought. Drawing inspiration from Wallas's theory of creativity and Guilford's framework of divergent-convergent thinking, we propose CreativeDC, a two-phase prompting method that explicitly scaffolds the LLM's reasoning into distinct phases. By decoupling creative exploration from constraint satisfaction, our method enables LLMs to explore a broader space of ideas before committing to a final problem. We evaluate CreativeDC for creative problem generation using a comprehensive set of metrics that capture diversity, novelty, and utility. The results show that CreativeDC achieves significantly higher diversity and novelty compared to baselines while maintaining high utility. Moreover, scaling analysis shows that CreativeDC generates a larger effective number of distinct problems as more are sampled, increasing at a faster rate than baseline methods.
中文标题/摘要
标题:大型语言模型中的发散-收敛思维在创造性问题生成中的应用
大型语言模型(LLMs)在生成教育问题和题目方面具有巨大潜力,使教育者能够创建大规模的学习材料。然而,LLMs 本质上受到“人工蜂群”效应的限制,这种效应导致它们在同一个模型中生成相似的响应,并在不同模型中产生同质的输出。因此,学生可能会接触到过多相似和重复的LLM生成的问题,这损害了思维的多样性。借鉴Wallas的创造力理论和Guilford的发散-收敛思维框架,我们提出了CreativeDC,这是一种两阶段的提示方法,明确地将LLM的推理分为不同的阶段。通过将创造性探索与约束满足解耦,我们的方法使LLMs能够在最终确定问题之前探索更广泛的想法空间。我们使用一系列综合的度量标准评估CreativeDC在创造性问题生成中的表现,这些度量标准涵盖了多样性和新颖性以及实用性。结果表明,CreativeDC在多样性和新颖性方面显著优于基线方法,同时保持了高实用性。此外,扩展分析表明,随着样本数量的增加,CreativeDC生成的有效不同问题的数量更大,并且增长速度比基线方法更快。
Summary / 总结
The paper addresses the issue of homogeneity in responses generated by large language models (LLMs) for educational purposes. It proposes CreativeDC, a two-phase prompting method inspired by creativity theories, to enhance diversity and novelty in generated problems. The method decouples creative exploration from constraint satisfaction, allowing LLMs to explore a broader range of ideas before finalizing a problem. Experimental results demonstrate that CreativeDC outperforms baseline methods in terms of diversity and novelty while maintaining utility.
论文提出了一种名为CreativeDC的双阶段提示方法,通过借鉴创造力理论,解决大型语言模型(LLMs)生成教育问题时缺乏多样性的局限性。该方法将创造性的探索与约束满足分离,使LLMs能够在最终确定问题之前探索更广泛的想法。实验结果表明,CreativeDC在多样性和新颖性方面优于基线方法,同时保持了实用性。此外,扩展分析显示,CreativeDC生成的唯一问题数量随着采样数量的增加而更快地增加,超过了基线方法。
Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging
Authors: Janani Annur Thiruvengadam, Kiran Mayee Nabigaru, Anusha Kovi
First: 2025-12-29T16:51:13+00:00 · Latest: 2025-12-29T16:51:13+00:00
Abstract
The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.
中文标题/摘要
标题:基于混合元启发式优化的可扩展残差特征聚合框架在多模态CT影像中稳健早期胰腺肿瘤检测
胰腺肿瘤的早期检测是临床的一大难题,主要是因为肿瘤可能在CT扫描中表现出微小的对比边缘和广泛的解剖变异。这些复杂性需要通过一个有效且可扩展的系统来解决,该系统能够增强细微视觉线索的显著性并提供对多模态影像数据的高度泛化能力。在本研究中,提出了一种可扩展残差特征聚合(SRFA)框架来满足这些条件。该框架结合了预处理管道和使用MAGRes-UNet进行分割,以使胰腺结构更加明显并隔离感兴趣区域。使用具有残差特征存储的DenseNet-121提取特征,以允许在不损失特征属性的情况下进行深层次特征聚合。进一步地,使用混合HHO-BA元启发式特征选择策略,以确保最佳特征子集的优化。基于结合了注意力机制的Vision Transformer(ViT)与高效能的EfficientNet-B3的新混合模型进行训练。采用结合SSA和GWO的双重优化机制来微调超参数,以增强鲁棒性并减少过拟合。实验结果支持显著的性能改进,所建议的模型达到96.23%的准确率、95.58%的F1分数和94.83%的特异性,该模型明显优于传统CNN和当前的基于变换器的模型。这些结果突显了SRFA框架作为早期胰腺肿瘤检测有用工具的可能性。
Summary / 总结
The study aims to address the challenges in early detection of pancreatic neoplasms by proposing a Scalable Residual Feature Aggregation (SRFA) framework. This framework uses a preprocessing pipeline, MAGRes-UNet for segmentation, DenseNet-121 for feature extraction, and a hybrid HHO-BA metaheuristic for feature selection. The system is trained using a hybrid model combining Vision Transformer and EfficientNet-B3, with hyperparameters fine-tuned using SSA and GWO. The model achieved high accuracy, F1-score, and specificity, outperforming traditional CNNs and contemporary transformer-based models.
研究提出了一种可扩展的残差特征聚合(SRFA)框架,以解决CT扫描中胰腺肿瘤早期检测的挑战。该框架使用MAGRes-UNet进行预处理和分割,DenseNet-121进行特征提取,并使用混合HHO-BA元启发式进行特征选择。模型结合了Vision Transformer(ViT)和EfficientNet-B3进行分类,并采用双重优化机制来微调超参数。实验结果表明,该模型在准确率、F1分数和特异性方面取得了显著改进,分别达到了96.23%、95.58%和94.83%,优于传统CNN和当代基于变换器的模型。
How Safe Are AI-Generated Patches? A Large-scale Study on Security Risks in LLM and Agentic Automated Program Repair on SWE-bench
Authors: Amirali Sajadi, Kostadin Damevski, Preetha Chatterjee
First: 2025-06-30T21:10:19+00:00 · Latest: 2025-12-29T16:44:07+00:00
Abstract
Large language models (LLMs) and their agentic frameworks are increasingly adopted to perform development tasks such as automated program repair (APR). While prior work has identified security risks in LLM-generated code, most have focused on synthetic, simplified, or isolated tasks that lack the complexity of real-world program repair. In this study, we present the first large-scale security analysis of LLM-generated patches using 20,000+ GitHub issues. We evaluate patches proposed by developers, a standalone LLM (Llama 3.3 Instruct-70B), and three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb). Finally, we analyze a wide range of code, issue, and project-level factors to understand the conditions under which generating insecure patches is more likely. Our findings reveal that Llama introduces many new vulnerabilities, exhibiting unique patterns not found in developers' code. Agentic workflows also generate a number of vulnerabilities, particularly when given more autonomy. We find that vulnerabilities in LLM-generated patches are associated with distinctive code characteristics and are commonly observed in issues missing specific types of information. These results suggest that contextual factors play a critical role in the security of the generated patches and point toward the need for proactive risk assessment methods that account for both issue and code-level information.
中文标题/摘要
标题:AI生成补丁的安全性如何?基于SWE-bench的大规模研究
大型语言模型(LLMs)及其自主框架越来越多地被用于执行开发任务,如自动程序修复(APR)。尽管先前的工作已经识别出LLM生成代码中的安全风险,但大多数研究都集中在合成、简化或孤立的任务上,缺乏真实世界程序修复的复杂性。在本研究中,我们首次使用20,000多个GitHub问题对LLM生成的补丁进行大规模安全分析。我们评估了开发人员、独立LLM(Llama 3.3 Instruct-70B)和三个表现最佳的自主框架(OpenHands、AutoCodeRover、HoneyComb)提出的补丁。最后,我们分析了代码、问题和项目层面的各种因素,以了解生成不安全补丁的条件。我们的研究发现Llama引入了许多新的漏洞,表现出不同于开发人员代码的独特模式。自主工作流程在获得更多自主权时也会生成大量漏洞。我们发现,LLM生成补丁中的漏洞与特定的代码特征相关,并且通常出现在缺少特定类型信息的问题中。这些结果表明,上下文因素在生成补丁的安全性中起着关键作用,并指出了需要考虑问题和代码层面信息的前瞻性风险评估方法的必要性。
Same or Not? Enhancing Visual Perception in Vision-Language Models
Authors: Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari
First: 2025-12-29T16:43:47+00:00 · Latest: 2025-12-29T16:43:47+00:00
Comments: Project webpage: https://glab-caltech.github.io/twin/
Abstract
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/
中文标题/摘要
标题:是否相同?提升视觉语言模型的视觉感知能力
视觉语言模型(VLMs)在广泛的视觉理解方面表现出色,但仍然粗略,存在视觉偏见,并且忽略了一些细微的视觉细节。现有的训练语料库通过强调一般识别(“是猫还是狗?”)而不是精细的感知来强化这一局限性。为了解决这个问题,我们引入了一个新的训练语料库和任务,旨在增强VLMs的感知能力。TWIN是一个包含561,000个图像对查询的大规模数据集,要求模型判断两个视觉相似的图像是否描绘同一个物体,鼓励关注细微的视觉线索。该数据集涵盖了各种日常物体在不同上下文、视角和外观范围内的多样性。在TWIN上微调VLMs在精细识别方面取得了显著进步,即使在未见过的领域如艺术、动物、植物和地标也是如此。为了量化这些进步,我们引入了FGVQA,这是一个包含12,000个查询的基准套件,重新利用了多个领域中的精细识别和检索数据集。虽然现有的VLMs在FGVQA上表现不佳,但在TWIN上微调后,它们的性能提高了高达19.3%,而不会影响通用VQA基准的性能。最后,我们的TWIN数据集在对象注释方面具有可扩展性,我们的分析表明,规模是关键。我们设想TWIN作为开源VLM训练语料库的即插即用补充,推动未来模型感知精度的提升。项目网页:https://glab-caltech.github.io/twin/
Summary / 总结
This paper introduces TWIN, a new dataset of 561,000 image-pair queries to enhance the fine-grained perception of vision-language models (VLMs). By fine-tuning VLMs on TWIN, the models show significant improvements in recognizing subtle visual details, even in unseen domains like art and landmarks. The FGVQA benchmark demonstrates that VLMs fine-tuned on TWIN outperform existing models by up to 19.3% without sacrificing general VQA performance. The study highlights the importance of scale in dataset annotations for better model performance.
该论文通过引入一个新的数据集TWIN,包含561,000对图像查询,旨在增强视觉语言模型(VLMs)在细微视觉特征识别方面的能力。通过在TWIN上微调VLMs,可以显著提高其在艺术、动物、植物和地标等未见过领域的细粒度识别性能,最高可提升19.3%的FGVQA基准测试分数,同时不影响通用VQA性能。TWIN数据集具有可扩展性,可以通过对象注释进行扩展,并且可以轻松集成到VLM训练数据集中,以推进未来模型的感知精度。
Leveraging Large Language Models for Rare Disease Named Entity Recognition
Authors: Nan Miles Xi, Yu Deng, Lin Wang
First: 2025-08-12T20:16:31+00:00 · Latest: 2025-12-29T16:39:56+00:00
Abstract
Named Entity Recognition (NER) in the rare disease domain poses unique challenges due to limited labeled data, semantic ambiguity between entity types, and long-tail distributions. In this study, we evaluate the capabilities of GPT-4o for rare disease NER under low-resource settings, using a range of prompt-based strategies including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning. We design a structured prompting framework that encodes domain-specific knowledge and disambiguation rules for four entity types. We further introduce two semantically guided few-shot example selection methods to improve in-context performance while reducing labeling effort. Experiments on the RareDis Corpus show that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding the strongest performance among the evaluated approaches and improving upon the previously reported BioClinicalBERT baseline. Cost-performance analysis reveals that few-shot prompting delivers high returns at low token budgets. RAG provides limited overall gains but can improve recall for challenging entity types, especially signs and symptoms. An error taxonomy highlights common failure modes such as boundary drift and type confusion, suggesting opportunities for post-processing and hybrid refinement. Our results demonstrate that prompt-optimized LLMs can serve as effective, scalable alternatives to traditional supervised models in biomedical NER, particularly in rare disease applications where annotated data is scarce.
中文标题/摘要
标题:利用大型语言模型进行罕见疾病命名实体识别
在罕见疾病领域的命名实体识别(NER)面临独特挑战,由于标注数据有限、实体类型之间的语义模糊以及长尾分布。本研究评估了GPT-4o在低资源设置下进行罕见疾病NER的能力,使用了零样本提示、少量样本上下文学习、检索增强生成(RAG)和任务级微调等多种提示策略。我们设计了一个结构化提示框架,编码了四种实体类型的领域特定知识和消歧规则。我们还引入了两种语义引导的少量样本示例选择方法,以提高上下文性能并减少标注工作量。在RareDis语料库上的实验表明,GPT-4o在与BioClinicalBERT相比时,能够实现竞争力或更优的性能,任务级微调在评估方法中表现最佳,并且优于之前报告的BioClinicalBERT基线。成本-性能分析显示,少量样本提示在低令牌预算下提供了高回报。RAG总体上提供了有限的增益,但在处理具有挑战性的实体类型时可以提高召回率,尤其是症状和体征。错误分类学揭示了常见的失败模式,如边界漂移和类型混淆,这表明了后处理和混合精炼的机会。我们的结果表明,提示优化的LLM可以作为生物医学NER中传统监督模型的有效、可扩展的替代方案,特别是在标注数据稀缺的罕见疾病应用中。
Summary / 总结
This study addresses the challenges of Named Entity Recognition (NER) in the rare disease domain by evaluating the capabilities of GPT-4o under low-resource settings. Various prompt-based strategies, including zero-shot prompting, few-shot in-context learning, retrieval-augmented generation (RAG), and task-level fine-tuning, were used. The research shows that GPT-4o achieves competitive or superior performance compared to BioClinicalBERT, with task-level fine-tuning yielding the strongest performance. Few-shot prompting is found to be cost-effective, while RAG provides limited gains but can improve recall for challenging entity types. The study also identifies common failure modes and suggests opportunities for post-processing and hybrid refinement.
研究通过评估GPT-4o在不同提示策略下的性能,解决了罕见疾病领域命名实体识别(NER)的挑战。研究设计了结构化的提示框架,并引入了语义引导的少量样本示例选择方法。实验结果显示,任务级微调表现出最强的性能,优于BioClinicalBERT。少量样本提示在成本效益方面表现出色,而检索增强生成(RAG)提供了有限的整体增益,但可以提高难以识别实体类型的召回率。识别出的常见错误模式表明,有改进后处理和混合精炼的机会。
Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
Authors: Yu-Xiang Lin, Cheng-Han Chiang, Hung-yi Lee
First: 2025-12-29T16:23:54+00:00 · Latest: 2025-12-29T16:23:54+00:00
Comments: Work in progress
Abstract
In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.
中文标题/摘要
标题:风格失忆:多轮对话语音模型说话风格退化及其缓解研究
在本文中,我们展示了当语音模型(SLMs)在多轮对话的开始被指示以特定的说话风格进行交流时,它们在多次交互后无法维持所需的说话风格;我们将这种现象称为SLMs的风格失忆。我们关注副语言学的说话风格,包括情感、口音、音量和语速。我们评估了三种专有和两种开源SLMs,证明这些模型在被指示时无法保持一致的说话风格。我们进一步表明,当要求SLMs在后续轮次中回忆风格指令时,它们可以回忆起风格指令,但在对话中却无法表达出来。我们还表明,明确要求模型回忆风格指令可以部分缓解风格失忆。此外,我们研究了各种提示策略,发现当指令放置在系统消息而非用户消息中时,SLMs难以遵循所需的风格,这与系统提示的预期功能相矛盾。
LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
Authors: Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu
First: 2025-12-29T16:17:36+00:00 · Latest: 2025-12-29T16:17:36+00:00
Abstract
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
中文标题/摘要
标题:LiveTalk: 通过改进的在线策略蒸馏实现实时多模态交互视频扩散
通过扩散生成实时视频对于构建通用多模态交互AI系统至关重要。然而,通过迭代过程中的双向注意力同时对所有视频帧进行去噪会妨碍实时交互。虽然现有的蒸馏方法可以使模型自回归并减少采样步骤以缓解这一问题,但它们主要集中在文本到视频生成上,使得人机交互显得不自然且效率较低。本文旨在针对多模态上下文(包括文本、图像和音频)条件下的实时交互视频扩散,以弥合这一差距。鉴于观察到领先的在线策略蒸馏方法Self Forcing在多模态条件下的挑战(如视觉伪影、黑屏和质量下降),我们研究了一种改进的蒸馏配方,强调条件输入的质量以及在线策略优化的初始化和计划。在包括HDTF、AVSpeech和CelebV-HQ的多模态条件(音频、图像和文本)头像视频生成基准上,我们的蒸馏模型在与相似或更大规模的双向基线具有相同或更好的视觉质量的同时,推理成本和延迟降低了20倍。此外,我们将模型与音频语言模型和长视频推理技术Anchor-Heavy Identity Sinks结合,构建了LiveTalk实时多模态交互头像系统。在我们策划的多轮交互基准上的系统级评估表明,LiveTalk在多轮视频连贯性和内容质量方面优于最先进的模型(Sora2、Veo3),同时将响应延迟从1到2分钟缩短到实时生成,从而实现无缝的人机多模态交互。
Summary / 总结
This paper addresses the challenge of real-time video generation via diffusion models for multimodal interactive AI systems. It proposes an improved on-policy distillation method to enhance the quality of condition inputs and the initialization and schedule for optimization. The distilled model achieves visual quality comparable to full-step baselines with 20 times less inference cost and latency, and integrates with audio language models and long-form video inference techniques to create LiveTalk, a real-time multimodal interactive avatar system. LiveTalk outperforms state-of-the-art models in multi-turn video coherence and content quality, with significantly reduced response latency.
该论文针对通过扩散模型进行实时视频生成以构建多模态交互AI系统的挑战,引入了一种改进的在线策略蒸馏方法,以提高条件输入的质量并优化初始化和调度以获得更好的性能。蒸馏模型在视觉质量上与全步骤基线相当,但成本和延迟降低了20倍,并结合了音频语言模型和长视频推理技术,构建了实时多模态交互avatar系统LiveTalk。LiveTalk在多轮视频连贯性和内容质量方面优于最先进的模型,响应延迟从1到2分钟缩短到实时生成。
ProGuard: Towards Proactive Multimodal Safeguard
Authors: Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao
First: 2025-12-29T16:13:23+00:00 · Latest: 2025-12-29T16:13:23+00:00
Abstract
The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
中文标题/摘要
标题:ProGuard:面向主动多模态保护
生成模型的快速演进不断催生出新的多模态安全风险,暴露了现有防御方法的局限性。为应对这些挑战,我们提出ProGuard,这是一种视觉-语言主动防护系统,能够在无需调整模型的情况下识别和描述分布外(OOD)的安全风险。我们首先构建了一个包含87,000个样本的模态平衡数据集,每个样本都标注了二元安全标签和在多模态安全分类学下的风险类别,有效缓解了模态偏差,确保了文本、图像和图文输入的一致性审查。基于此数据集,我们通过强化学习(RL)训练我们的视觉-语言基础模型,以实现高效和简洁的推理。为了在受控环境中模拟主动安全场景,我们进一步引入了分布外(OOD)安全类别推断任务,并通过基于同义词库的相似性奖励来增强RL目标,鼓励模型为未见过的不安全类别生成简洁描述。实验结果表明,ProGuard在二元安全分类上的性能与闭源大型模型相当,在不安全内容分类上显著优于现有开源防护模型。最值得注意的是,ProGuard提供了强大的主动审查能力,OOD风险检测提高了52.6%,OOD风险描述提高了64.8%。
Summary / 总结
ProGuard is designed to proactively address multimodal safety risks in generative models by identifying and describing out-of-distribution (OOD) risks without model adjustments. It uses a modality-balanced dataset of 87K samples, each annotated with safety labels and risk categories, and trains a vision-language model via reinforcement learning to generate concise descriptions. ProGuard outperforms existing open-source guard models in unsafe content categorization and shows a 52.6% improvement in OOD risk detection and a 64.8% improvement in OOD risk description.
ProGuard旨在通过识别和描述超出分布(OOD)的风险来应对生成模型中的新兴多模态安全风险,而无需对模型进行调整。它使用87K样本的模态平衡数据集,并通过强化学习训练视觉语言模型,以实现高效和简洁的推理。ProGuard在不安全内容分类上优于现有的开源防护模型,并展示了强大的主动管理能力,分别将OOD风险检测和描述的性能提高了52.6%和64.8%。
History
20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553