Mode Seeking meets Mean Seeking for Fast Long Video Generation
Authors: Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat
First: 2026-02-27T18:59:02+00:00 · Latest: 2026-02-27T18:59:02+00:00
Comments: Project website: https://primecai.github.io/mmm/
Abstract
Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.
中文标题/摘要
标题:模式搜索与均值搜索结合实现快速长视频生成
将视频生成从秒级扩展到分钟级面临一个关键瓶颈:虽然短视频数据丰富且高保真,但连贯的长视频数据稀缺且局限于狭窄领域。为解决这一问题,我们提出了一种训练范式,即模式搜索与均值搜索相结合,通过统一表示方式下的解耦扩散变换器,将局部保真度与长期连贯性解耦。我们的方法利用一个通过监督学习在长视频上训练的全局流匹配头部来捕捉叙事结构,同时采用一个局部分布匹配头部,通过模式搜索逆KL散度将滑动窗口与冻结的短视频教师对齐。该策略通过监督流匹配学习长距离连贯性和运动,同时通过将学生每个滑动窗口段与冻结的短视频教师对齐继承局部现实性,从而实现快速长视频生成器。评估表明,我们的方法通过同时提高局部清晰度、运动和长距离一致性有效地缩小了保真度-时间差距。
Summary / 总结
The paper addresses the challenge of generating long videos with high fidelity and coherence, which is difficult due to the scarcity of long-form training data. It proposes a training paradigm combining Mode Seeking and Mean Seeking using a Decoupled Diffusion Transformer. The method uses a global Flow Matching head to capture narrative structure and a local Distribution Matching head to align segments with a short-video teacher, enabling the generation of minute-scale videos with improved long-range consistency and local realism. Evaluations show that the proposed method effectively enhances local sharpness, motion, and long-range coherence.
研究旨在通过解决高质量长视频数据稀缺的问题,生成从几秒到几分钟的长视频。提出的Mode Seeking meets Mean Seeking方法使用Decoupled Diffusion Transformer,其中全局Flow Matching头用于捕捉叙事结构,局部Distribution Matching头用于局部现实感。该方法合成了具有更好长范围一致性和局部清晰度的分钟级视频,有效缩小了保真度-时间差距。评估结果显示在运动、长范围一致性和局部清晰度方面有显著改进。
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
Authors: Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, Feng Yan
Venue: ICLR 2026
First: 2026-02-27T18:58:57+00:00 · Latest: 2026-02-27T18:58:57+00:00
Comments: Published as a conference paper at ICLR 2026. 10 pages plus appendix
Abstract
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
中文标题/摘要
标题:DARE-bench:评估大语言模型在数据科学中的建模和指令忠实度
随着对使用大型语言模型(LLMs)解决复杂多步骤数据科学任务需求的快速增长,准确的基准测试变得迫切需要。现有基准测试存在两个主要差距:(i)缺乏标准化、过程意识的评估,无法捕捉指令遵守和过程忠实度,(ii)缺乏准确标注的训练数据。为弥补这些差距,我们提出了DARE-bench,一个旨在评估机器学习建模和数据科学指令的基准测试。与依赖于人类或模型评判的许多现有基准测试不同,DARE-bench 中的所有任务都有可验证的地面真相,确保了客观和可重复的评估。为了涵盖广泛的任务并支持自主工具,DARE-bench 包含了 6,300 个 Kaggle 派生任务,并提供了大规模的训练数据和评估集。广泛的评估显示,即使是能力很强的模型如 gpt-o4-mini 也难以取得良好的性能,尤其是在机器学习建模任务中。使用 DARE-bench 的训练任务进行微调可以显著提高模型性能。例如,监督微调将 Qwen3-32B 的准确率提高了 1.83 倍,强化学习将 Qwen3-4B 的准确率提高了超过 8 倍。这些显著的改进验证了 DARE-bench 作为准确评估基准和关键训练数据的重要性。
Summary / 总结
DARE-bench is a benchmark designed to evaluate the modeling and instruction fidelity of Large Language Models (LLMs) in data science tasks. It addresses the gaps in existing benchmarks by providing verifiable ground truth and covering a wide range of tasks. The benchmark includes 6,300 Kaggle-derived tasks and supports both training and evaluation. Extensive evaluations show that even highly capable models struggle in data science tasks, but fine-tuning with DARE-bench data significantly improves their performance, demonstrating the benchmark's effectiveness.
DARE-bench 是一个用于评估大型语言模型在数据科学任务中建模和指令忠实度的基准。它通过提供可验证的地面真相并涵盖广泛的任务来弥补现有基准的不足。该基准包括 6,300 个来自 Kaggle 的任务,并支持训练和评估。广泛的评估表明,即使是高度有能力的模型在数据科学任务中也表现不佳,但使用 DARE-bench 数据进行微调可以显著提高其性能,证明了该基准的有效性。
Do LLMs Benefit From Their Own Words?
Authors: Jenny Y. Huang, Leshem Choshen, Ramon Astudillo, Tamara Broderick, Jacob Andreas
First: 2026-02-27T18:58:26+00:00 · Latest: 2026-02-27T18:58:26+00:00
Abstract
Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.
中文标题/摘要
标题:大语言模型是否受益于自己的话语?
多轮交互中,大型语言模型通常会在对话历史中保留助手自身的先前响应。在本工作中,我们重新审视了这一设计选择,询问大型语言模型是否从条件其自身先前响应中受益。使用真实世界的多轮对话,我们比较了标准(全背景)提示与仅包含用户轮次的提示方法,后者省略了所有先前助手响应,这在三个开放推理模型和一个最先进的模型中进行了比较。令我们惊讶的是,我们发现省略先前助手响应对很大一部分轮次的响应质量没有影响。省略助手方历史可以将累积背景长度减少多达10倍。为了解释这一结果,我们发现多轮对话中包含相当大的比例(36.4%)的自包含提示,并且许多后续提示提供了足够的指令,仅使用当前用户轮次和先前用户轮次即可回答。在分析仅用户轮次提示显著优于全背景的情况下,我们识别出一些背景污染实例,在这些实例中,模型过度依赖其先前响应,引入了错误、幻觉或风格化特征,这些特征在轮次之间传播。受这些发现的启发,我们设计了一种上下文过滤方法,选择性地省略助手方背景。我们的研究结果表明,选择性地省略助手历史可以提高响应质量并减少内存消耗。
Summary / 总结
This study investigates whether large language models benefit from retaining their own past responses in conversation history. By comparing standard prompting with a user-turn-only approach, the research finds that removing prior assistant responses does not significantly affect response quality in many cases. This method can reduce cumulative context lengths by up to 10 times. The study also identifies instances of context pollution and proposes a context-filtering approach to selectively omit assistant-side context, which can improve response quality and reduce memory consumption.
研究探讨了大型语言模型是否从保留其自身过去的响应中受益。通过将标准提示与用户轮次仅提示方法进行比较,研究发现,在许多情况下,移除之前的助手响应不会显著影响响应质量。这种方法可以将累积上下文长度减少多达10倍。研究还识别了上下文污染的情况,并提出了一种上下文过滤方法,以选择性地省略助手侧的上下文,这可以提高响应质量和减少内存消耗。
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Authors: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
First: 2026-02-27T18:58:05+00:00 · Latest: 2026-02-27T18:58:05+00:00
Abstract
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
中文标题/摘要
标题:CUDA代理:大规模代理强化学习以生成高性能CUDA内核
GPU内核优化是现代深度学习的基础,但仍然是一个高度专业化且需要深厚硬件知识的任务。尽管在通用编程方面表现出色,大型语言模型(LLMs)在CUDA内核生成方面仍无法与基于编译器的系统(如torch.compile)竞争。现有的CUDA代码生成方法要么依赖于无训练的细化,要么在固定的多轮执行反馈循环中微调模型,但这两种范式都无法从根本上提高模型的CUDA优化能力,导致性能提升有限。我们提出了CUDA代理,这是一种通过三个组件开发CUDA内核专业知识的大规模代理强化学习系统:可扩展的数据合成管道、具有自动化验证和分析的技能增强CUDA开发环境,以提供可靠的奖励信号,以及强化学习算法技术,以实现稳定的训练。CUDA代理在KernelBench上取得了最先进的成果,分别在KernelBench Level-1、Level-2和Level-3分割上比torch.compile快100%,100%和92%,在最难的Level-3设置上,比最强的专有模型Claude Opus 4.5和Gemini 3 Pro高出约40%。
Summary / 总结
The research aims to optimize GPU kernels for deep learning by leveraging agentic reinforcement learning. The method involves a scalable data synthesis pipeline, a skill-augmented development environment with automated verification and profiling, and reinforcement learning techniques. Key findings show that CUDA Agent outperforms torch.compile and proprietary models like Claude Opus 4.5 and Gemini 3 Pro, achieving up to 100% faster performance on KernelBench Level-1 and Level-2, and 92% faster on Level-3.
研究旨在通过强化学习优化GPU内核以提升深度学习性能。方法包括可扩展的数据合成管道、增强技能的开发环境以及自动验证和分析以提供可靠的奖励信号,并采用强化学习技术。关键发现表明,CUDA Agent 在 KernelBench Level-1 和 Level-2 上实现了高达 100% 的性能提升,在 Level-3 上则实现了 92% 的提升,超越了 torch.compile 以及 Claude Opus 4.5 和 Gemini 3 Pro 等专有模型。
Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
Authors: Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Venue: ICLR 2026 Oral
First: 2026-02-27T18:57:06+00:00 · Latest: 2026-02-27T18:57:06+00:00
Comments: Camera-ready version. Accepted as Oral at ICLR 2026
Abstract
Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer's memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre's effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines. Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach's effectiveness across both pre-training and fine-tuning paradigms. Our code is publicly available at https://github.com/mrflogs/LoRA-Pre.
中文标题/摘要
标题:驯服动量:通过低秩逼近重新思考优化器状态
现代优化器如Adam和Muon在训练大规模语言模型中至关重要,但它们对一阶和二阶动量的依赖引入了显著的内存开销,这限制了可扩展性和计算效率。在本文中,我们将这些动量中使用的指数移动平均(EMA)重新构想为通过在线梯度流训练线性回归器。基于这种等价性,我们提出了LoRA-Pre,这是一种新型低秩优化器,旨在提高预训练效率。具体而言,LoRA-Pre通过将完整的动量矩阵分解为在线线性学习器中的紧凑低秩子空间,减少了优化器的内存占用,从而保持优化性能的同时提高内存效率。我们通过从Llama架构家族预训练模型,从60M扩展到1B参数,实证验证了LoRA-Pre的有效性。LoRA-Pre在所有模型规模上均表现出最高性能。值得注意的是,LoRA-Pre展示了显著的秩效率,仅使用基线方法1/8的秩即可达到可比或更优的结果。除了预训练,我们还评估了LoRA-Pre在微调场景中的有效性。在相同的秩下,LoRA-Pre始终优于所有高效的微调基线。具体而言,与标准LoRA相比,LoRA-Pre在Llama-3.1-8B上实现了3.14点的显著改进,在Llama-2-7B上实现了6.17点的改进,验证了我们方法在预训练和微调范式中的有效性。我们的代码可在https://github.com/mrflogs/LoRA-Pre上公开获取。
Summary / 总结
This work addresses the memory overhead of modern optimizers like Adam and Muon, which are crucial for training large language models. By reinterpreting the exponential moving average (EMA) as a linear regressor, the authors propose LoRA-Pre, a low-rank optimizer that reduces memory usage while maintaining optimization performance. Experiments show that LoRA-Pre outperforms baseline methods across various model sizes, particularly in fine-tuning, where it achieves significant improvements over standard methods.
该研究针对现代优化器如Adam和Muon在训练大型语言模型时带来的内存开销问题。通过将指数移动平均(EMA)重新解释为线性回归器,作者提出了LoRA-Pre,这是一种低秩优化器,能够在减少内存使用的同时保持优化性能。实验表明,LoRA-Pre在各种模型规模下均优于基线方法,特别是在微调场景中,它在标准方法上实现了显著的性能提升。
Memory Caching: RNNs with Growing Memory
Authors: Ali Behrouz, Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni
First: 2026-02-27T18:53:41+00:00 · Latest: 2026-02-27T18:53:41+00:00
Abstract
Transformers have been established as the de-facto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity that scales with the context length. While plausible for retrieval tasks, it causes quadratic complexity and so has motivated recent studies to explore viable subquadratic recurrent alternatives. Despite showing promising preliminary results in diverse domains, such recurrent architectures underperform Transformers in recall-intensive tasks, often attributed to their fixed-size memory. In this paper, we introduce Memory Caching (MC), a simple yet effective technique that enhances recurrent models by caching checkpoints of their memory states (a.k.a. hidden states). Memory Caching allows the effective memory capacity of RNNs to grow with sequence length, offering a flexible trade-off that interpolates between the fixed memory (i.e., $O(L)$ complexity) of RNNs and the growing memory (i.e., $O(L^2)$ complexity) of Transformers. We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules. Our experimental results on language modeling, and long-context understanding tasks show that MC enhances the performance of recurrent models, supporting its effectiveness. The results of in-context recall tasks indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
中文标题/摘要
标题:记忆缓存:具有增长记忆的RNN
变换器已成为大多数近期序列建模进展的事实基础,主要是由于它们的记忆容量随着上下文长度增长。虽然对于检索任务来说是合理的,但这导致了二次复杂度,因此促使了近期研究探索可行的次二次递归替代方案。尽管在不同领域显示出有希望的初步结果,但这些递归架构在需要大量召回的任务中表现不如变换器,通常归因于它们固定大小的记忆。在本文中,我们引入了记忆缓存(MC),这是一种简单而有效的技术,通过缓存递归模型的记忆状态检查点(即隐藏状态)来增强递归模型。记忆缓存允许RNN的记忆容量随着序列长度增长,提供了一种在固定记忆(即$O(L)$复杂度)的RNN和增长记忆(即$O(L^2)$复杂度)的变换器之间进行灵活权衡的插值。我们提出了MC的四种变体,包括门控聚合和稀疏选择机制,并讨论了它们对线性和深层记忆模块的影响。我们在语言建模和长上下文理解任务上的实验结果表明,MC增强了递归模型的性能,支持了其有效性。上下文召回任务的结果表明,虽然变换器在准确率上最佳,但我们的MC变体表现出竞争力,缩小了与变换器的差距,并优于最先进的递归模型。
Summary / 总结
This paper introduces Memory Caching (MC), a technique that enhances the performance of recurrent neural networks (RNNs) by caching memory states, allowing the effective memory capacity to grow with sequence length. This method provides a flexible trade-off between the fixed memory of RNNs and the growing memory of Transformers. Experimental results on language modeling and long-context understanding tasks show that MC improves the performance of RNNs, closing the gap with Transformers in in-context recall tasks and outperforming state-of-the-art recurrent models.
本文提出了Memory Caching (MC) 技术,通过缓存记忆状态来增强 RNN,使其有效记忆容量随着序列长度增长。该方法在固定记忆的 RNN 和增长记忆的 Transformer 之间提供了灵活的权衡。实验结果表明,MC 改进了 RNN 的性能,其变体在上下文召回任务中与 Transformer 的性能相当,甚至更优。
Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment
Authors: Dake Zhang, Mark D. Smucker, Charles L. A. Clarke
First: 2026-02-27T18:49:31+00:00 · Latest: 2026-02-27T18:49:31+00:00
Abstract
Many readers today struggle to assess the trustworthiness of online news because reliable reporting coexists with misinformation. The TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track provided a venue for researchers to develop and evaluate assistive RAG systems that support readers' news trustworthiness assessment by producing reader-oriented, well-attributed reports. As the organizers of the DRAGUN track, we describe the resources that we have newly developed to allow for the reuse of the track's tasks. The track had two tasks: (Task 1) Question Generation, producing 10 ranked investigative questions; and (Task 2, the main task) Report Generation, producing a 250-word report grounded in the MS MARCO V2.1 Segmented Corpus. As part of the track's evaluation, we had TREC assessors create importance-weighted rubrics of questions with expected short answers for 30 different news articles. These rubrics represent the information that assessors believe is important for readers to assess an article's trustworthiness. The assessors then used their rubrics to manually judge the participating teams' submitted runs. To make these tasks and their rubrics reusable, we have created an automated process to judge runs not part of the original assessing. We show that our AutoJudge ranks existing runs well compared to the TREC human-assessed evaluation (Kendall's $τ= 0.678$ for Task 1 and $τ= 0.872$ for Task 2). These resources enable both the evaluation of RAG systems for assistive news trustworthiness assessment and, with the human evaluation as a benchmark, research on improving automated RAG evaluation.
中文标题/摘要
标题:辅助阅读新闻可信度评估的自动化评估资源
如今许多读者难以评估在线新闻的可信度,因为可靠报道与虚假信息共存。TREC 2025 DRAGUN(检测、检索和增强生成以理解新闻)赛道为研究人员提供了开发和评估支持读者新闻可信度评估的辅助RAG系统的平台。作为DRAGUN赛道的组织者,我们描述了新开发的资源,以便重新利用赛道的任务。赛道有两个任务:(任务1)问题生成,生成10个排名的问题;(任务2,主要任务)报告生成,基于MS MARCO V2.1分段语料库生成250字的报告。作为赛道评估的一部分,我们让TREC评估员为30篇不同新闻文章创建了加权问题清单,其中包含预期的简短答案。这些清单代表了评估员认为读者评估文章可信度时需要的重要信息。评估员随后使用他们的清单手动评估参赛团队提交的运行结果。为了使这些任务及其清单可重用,我们创建了一个自动评估过程,用于评估不属于原始评估的运行结果。我们展示了我们的AutoJudge在任务1(Kendall's τ=0.678)和任务2(Kendall's τ=0.872)中对现有运行结果的评估与TREC的人工评估相比表现良好。这些资源不仅可用于评估辅助新闻可信度评估的RAG系统,还可以作为基准,推动改进自动化RAG评估的研究。
Summary / 总结
This paper describes resources developed for evaluating assistive RAG systems that help readers assess news trustworthiness. The TREC 2025 DRAGUN track included two tasks: question generation and report generation. Assessors created importance-weighted rubrics for 30 news articles to evaluate the trustworthiness of reports. An automated process, AutoJudge, was developed to rank runs, showing good correlation with human assessments (Kendall's τ=0.678 for Task 1 and τ=0.872 for Task 2).
本文描述了用于评估辅助RAG系统以帮助读者评估新闻可信度的资源。评估涉及两项任务:生成调查性问题和生成基于MS MARCO V2.1分段语料库的报告。评估人员为30篇新闻文章创建了重要性加权的评判标准来判断报告的可信度。开发了一个自动化过程AutoJudge来排名运行,对于任务1和任务2分别达到了Kendall's τ值0.678和0.872,与人工评估相当。
Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?
Authors: Yongjun Zhang
First: 2026-02-25T20:52:14+00:00 · Latest: 2026-02-27T18:49:27+00:00
Comments: Commentary
Abstract
AI agents -- systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills -- represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain-specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching -- the AI-era parallel to vibe coding (Karpathy, 2025) -- and uses scholar-skill, a 23-skill plugin for Claude Code covering the full research pipeline from idea to submission, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions -- codifiability and tacit knowledge requirement -- to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession -- augmentation with fragile conditions, stratification risk, and a pedagogical crisis -- and proposes five principles for responsible vibe researching.
中文标题/摘要
标题:以狼的姿态进行研究:具备技能的AI代理能否替代或增强社会科学家?
AI代理——能够执行多步推理工作流、保持状态、访问工具并具备专业技能的系统——代表了社会科学研究中从先前自动化技术到质的飞跃。与仅能响应孤立查询的聊天机器人不同,AI代理现在可以读取文件、运行代码、查询数据库、搜索网络,并调用特定领域的技能以自主执行整个研究管道。本文介绍了vibe研究的概念——AI时代的vibe编码的平行概念(Karpathy, 2025),并使用涵盖从想法到提交整个研究管道的23项技能插件scholar-skill(Claude Code)作为示例。我开发了一种认知任务框架,根据可编码性和隐性知识需求两个维度对研究活动进行分类,以确定一个认知边界,而不是顺序边界:它贯穿研究管道的每一个阶段,而不是在阶段之间。我主张,AI代理在速度、覆盖面和方法论支撑方面表现出色,但在理论原创性和隐性领域知识方面存在困难。文章最后分析了对职业的三个影响——脆弱条件下的增强、分层风险以及教学危机,并提出了五项负责任vibe研究的原则。
Summary / 总结
This paper explores the potential of AI agents to replace or augment social scientists by introducing the concept of vibe researching. It uses a 23-skill plugin for Claude Code to execute the full research pipeline autonomously and classifies research activities into codifiability and tacit knowledge requirement. The study finds that AI agents excel in speed, coverage, and methodological support but lack theoretical originality and tacit field knowledge. It concludes with implications for the profession and proposes principles for responsible AI use in research.
本文探讨了AI代理取代或辅助社会科学家的可能性,通过引入vibe researching的概念,并使用Claude Code的23技能插件自动执行整个研究流程。研究将研究活动分类为可编码性和隐性知识需求,并发现AI代理在速度、覆盖面和方法论支持方面表现出色,但在理论原创性和隐性领域知识方面存在不足。文章还提出了职业影响和负责任的AI研究原则。
Hierarchical Action Learning for Weakly-Supervised Action Segmentation
Authors: Junxian Huang, Ruichu Cai, Hao Zhu, Juntao Fang, Boyan Xu, Weilin Chen, Zijian Li, Shenghua Gao
Venue: CVPR2026
First: 2026-02-27T18:48:22+00:00 · Latest: 2026-02-27T18:48:22+00:00
Abstract
Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.
中文标题/摘要
标题:层次化行动学习在弱监督行动分割中的应用
人类通过关键转换来感知行动,这些转换在多个抽象层次上结构化行动,而机器依赖于视觉特征,往往会过度分割。这突显了在视频理解中实现层次化推理的难度。有趣的是,我们观察到低级视觉和高级行动潜在变量以不同的速率演变,低级视觉变量变化迅速,而高级行动变量演变较慢,使其更容易识别。基于这一洞察,我们提出了层次化行动学习(\textbf{HAL})模型,用于弱监督行动分割。我们的方法引入了一个层次化的因果数据生成过程,其中高级潜在行动控制低级视觉特征的动力学。为了有效建模这些不同的时间尺度,我们引入了确定性过程来在时间上对齐这些潜在变量。\textbf{HAL}模型使用层次化金字塔变换器来捕获视觉特征和潜在变量,并应用稀疏转换约束来强制执行高级行动变量的较慢动力学。这种机制增强了这些潜在变量随时间的识别。在温和的假设下,我们证明这些潜在行动变量是严格可识别的。在几个基准上的实验结果表明,\textbf{HAL}模型在弱监督行动分割中显著优于现有方法,证实了其在实际应用中的有效性。
Summary / 总结
The paper addresses the challenge of hierarchical reasoning in video understanding by proposing the Hierarchical Action Learning (HAL) model for weakly-supervised action segmentation. HAL models the hierarchical causal data generation process where high-level latent action variables govern low-level visual features, and uses a hierarchical pyramid transformer to capture both visual features and latent variables. The model introduces a sparse transition constraint to enforce the slower dynamics of high-level action variables, making latent variables more identifiable. Experiments on benchmarks demonstrate that HAL significantly outperforms existing methods in weakly-supervised action segmentation.
研究旨在解决在视频理解中实现层次化推理的挑战,特别是弱监督动作分割。提出的层次化动作学习(HAL)模型引入了一个层次化的因果数据生成过程,其中高层的潜在动作变量控制低层视觉特征的动力学。该模型使用层次化金字塔变换器和稀疏过渡约束来有效建模不同时间尺度,并增强潜在变量的识别。实验结果表明,HAL 显著优于现有方法,证实了其在实际应用中的有效性。
A Minimal Agent for Automated Theorem Proving
Authors: Borja Requena Pozo, Austin Letson, Krystian Nowakowski, Izan Beltran Ferreiro, Leopoldo Sarra
First: 2026-02-27T18:43:47+00:00 · Latest: 2026-02-27T18:43:47+00:00
Abstract
We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.
中文标题/摘要
标题:一种用于自动定理证明的最小代理
我们提出了一种最小代理基线,以系统地比较不同基于AI的定理证明器架构。该设计实现了当前最先进的系统共有的核心功能:迭代证明细化、库搜索和上下文管理。我们使用不同类型的基准测试评估了我们的基线,并比较了各种流行的模型和设计选择,展示了与最先进的方法相比具有竞争力的性能,同时使用了显著更简单的架构。我们的结果表明,迭代方法在样本效率和成本效益方面相对于多次单次生成具有一致的优势。该实现已开源,作为未来研究的候选参考以及社区的可访问证明器。
Summary / 总结
The paper introduces a minimal agent for automated theorem proving to facilitate systematic comparisons among different AI-based theorem prover architectures. The design includes core features like iterative proof refinement, library search, and context management. Experiments show competitive performance with state-of-the-art approaches while using a simpler architecture, highlighting the benefits of an iterative approach in terms of sample efficiency and cost effectiveness.
论文提出了一种最小化代理的自动定理证明方法,以促进不同AI定理证明器架构之间的系统比较。该方法实现了迭代证明精炼、库搜索和上下文管理等核心功能。跨不同基准的评估显示,尽管架构简单,但该设计在样本效率和成本效益方面与最先进的方法竞争,特别是迭代方法在多个场景中优于单次生成方法。
Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives
Authors: AmirHossein Zamani, Bruno Roy, Arianna Rampini
First: 2025-09-29T17:28:58+00:00 · Latest: 2026-02-27T18:42:50+00:00
Abstract
Recent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. Our implementation code is publicly available at: https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.
中文标题/摘要
标题:基于语义和视见性的3D网格参数化无监督表示学习
近期的3D生成模型能够为3D网格对象生成高质量的纹理。然而,它们通常依赖于一个沉重的假设,即输入的3D网格必须伴随有手动的网格参数化(UV映射),这是一个需要技术精确性和艺术判断的繁琐手动任务。行业调查显示,这一过程往往占用了资产创建的很大一部分,成为3D内容创作者的主要瓶颈。此外,现有的自动方法往往忽略了两个重要的感知标准:(1)语义意识(UV图应使形状中的语义相似3D部分对齐)和(2)视见意识(切割缝应位于不太可能被看到的区域)。为了克服这些不足并自动化网格参数化过程,我们提出了一种无监督可微分框架,该框架在标准的几何保持UV学习中增加了语义和视见意识的目标。对于语义意识,我们的流水线(i)将网格分割成语义3D部分,(ii)应用一个无监督学习的每部分UV参数化骨干网络,(iii)将每部分图整合成一个统一的UV图集。对于视见意识,我们使用环境遮挡(AO)作为曝光代理,并通过反向传播一个软可微分的AO加权切割缝目标来引导切割缝朝向被遮挡的区域。通过与最新方法进行定性和定量评估,我们展示了所提出的方法生成的UV图集在支持纹理生成和减少可感知的切割缝伪影方面优于最近的基线方法。我们的实现代码可在以下链接公开获取:https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.
Summary / 总结
This paper addresses the challenge of automatic 3D mesh parameterization by introducing an unsupervised framework that incorporates semantic and visibility objectives. The method segments the mesh into semantic parts, applies an unsupervised UV parameterization for each part, and then aggregates them into a unified atlas. Additionally, it uses ambient occlusion to guide seam placement towards occluded regions. Experimental results demonstrate that the proposed method generates UV atlases with better texture support and reduced perceptible seam artifacts compared to existing approaches.
论文解决了3D内容创作中手动网格参数化的问题,这是一个耗时的任务,需要技术和艺术判断。它提出了一种无监督框架,结合了语义和视见性目标来自动化这一过程。该方法将网格分割为语义部分,为每个部分应用学习的UV参数化,并使用环境遮挡来引导切口位置。实验结果表明,所提出的方法生成的UV图集在纹理支持方面更好,并且比现有方法具有更少的可感知切口伪影。
Knowledge-Guided Machine Learning: Illustrating the use of Explainable Boosting Machines to Identify Overshooting Tops in Satellite Imagery
Authors: Nathan Mitchell, Lander Ver Hoef, Imme Ebert-Uphoff, Kristina Moen, Kyle Hilburn, Yoonjin Lee, Emily J. King
First: 2025-07-02T16:34:50+00:00 · Latest: 2026-02-27T18:32:00+00:00
Comments: 48 pages, 18 figures
Abstract
Machine learning (ML) algorithms have emerged in many meteorological applications. However, these algorithms struggle to extrapolate beyond the data they were trained on, i.e., they may adopt faulty strategies that lead to catastrophic failures. These failures are difficult to predict due to the opaque nature of ML algorithms. In high-stakes applications, such as severe weather forecasting, is is crucial to avoid such failures. One approach to address this issue is to develop more interpretable ML algorithms. The primary goal of this work is to illustrate the use of a specific interpretable ML algorithm that has not yet found much use in meteorology, Explainable Boosting Machines (EBMs). We demonstrate that EBMs are particularly suitable to implement human-guided strategies in an ML algorithm. As guiding example, we show how to develop an EBM to detect overshooting tops (OTs) in satellite imagery. EBMs require input features to be scalar. We use techniques from Knowledge-Guided Machine Learning to first extract scalar features from meteorological imagery. For the application of identifying OTs this includes extracting cloud texture from satellite imagery using Gray-Level Co-occurrence Matrices. Once trained, the EBM was examined and minimally altered to more closely match strategies used by domain scientists to identify OTs. The result of our efforts is a fully interpretable ML algorithm developed in a human-machine collaboration that uses human-guided strategies. While the final model does not reach the accuracy of more complex approaches, it performs reasonably well and we hope paves the way for building more interpretable ML algorithms for this and other meteorological applications.
中文标题/摘要
标题:知识引导的机器学习:通过可解释提升机在卫星图像中识别超射顶的应用示例
机器学习(ML)算法在许多气象应用中崭露头角。然而,这些算法难以在训练数据之外进行外推,即它们可能会采用错误的策略导致灾难性失败。由于ML算法的不透明性,这些失败难以预测。在高风险应用中,如严重天气预报,避免此类失败至关重要。解决这一问题的一种方法是开发更具可解释性的ML算法。本文的主要目标是展示一种特定的可解释ML算法在气象学中尚未广泛应用的使用方法,即可解释提升机(EBMs)。我们证明EBMs特别适合在ML算法中实现人类指导策略。以超射顶(OTs)的检测为例,我们展示了如何使用EBMs来检测卫星图像中的OTs。EBMs需要输入特征为标量。我们使用知识引导的机器学习技术从气象图像中提取标量特征。对于识别OTs的应用,这包括使用灰度共生矩阵从卫星图像中提取云纹理。训练完成后,EBM被检查并进行了微调,使其更接近领域科学家识别OTs所使用的方法。我们努力的结果是一种完全可解释的ML算法,在人类-机器协作中开发,使用了人类指导的策略。虽然最终模型的准确性不如更复杂的方法,但它表现得相当不错,并希望为构建此类及其他气象应用的可解释ML算法铺平道路。
Summary / 总结
This study addresses the challenge of using machine learning algorithms in high-stakes meteorological applications by developing an interpretable ML approach. The authors use Explainable Boosting Machines (EBMs) to identify overshooting tops in satellite imagery, a task previously not well-suited for EBMs. By extracting scalar features from meteorological imagery and incorporating human-guided strategies, the EBM model is made interpretable and aligned with domain expert knowledge. Although the model's accuracy is lower than more complex approaches, it provides a promising path for developing more interpretable ML algorithms for meteorological applications.
该研究旨在通过开发可解释的机器学习方法解决气象应用中的挑战。作者使用解释性提升机(EBMs)来识别卫星图像中的 overshooting tops,这是EBMs之前不太适合的任务。通过从气象图像中提取标量特征并结合领域专家策略,EBM模型变得可解释且与领域专家知识一致。尽管该模型的准确性低于更复杂的方法,但它为开发更多可解释的机器学习算法铺平了道路,适用于气象等应用领域。
Reinforcement Learning from Human Feedback
Authors: Nathan Lambert
First: 2025-04-16T21:36:46+00:00 · Latest: 2026-02-27T18:22:58+00:00
Comments: 204 pages. Web-native version at https://rlhfbook.com/ Continually improving, latest version at website
Abstract
Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.
中文标题/摘要
标题:从人类反馈中学习强化学习
从人类反馈中学习强化学习(RLHF)已成为部署最新机器学习系统的重要的技术和叙述工具。本书旨在为具有一定定量背景的人士提供一个温和的介绍核心方法。本书从RLHF的起源开始——既包括最近的文献,也包括科学、经济学、哲学和最优控制领域的交汇。然后我们通过定义、问题表述、数据收集和其他文献中常用的数学方法来奠定基础。本书的核心部分详细介绍了使用RLHF的每一个优化阶段,从指令调优开始,到训练奖励模型,再到拒绝采样、强化学习和直接对齐算法。本书以合成数据和评估中的未研究问题以及领域的开放问题作为结尾。
Summary / 总结
This book provides an introduction to reinforcement learning from human feedback (RLHF), focusing on the core methods for those with a quantitative background. It covers the origins of RLHF, problem formulation, data collection, and various optimization stages from instruction tuning to direct alignment algorithms. Key findings include detailed explanations of reward model training and rejection sampling, with a conclusion on advanced topics and open questions in the field.
本书为那些具有定量背景的人介绍了人类反馈强化学习(RLHF)的核心方法,涵盖了RLHF的起源、问题定义、数据收集以及从指令调优到直接对齐算法的各种优化阶段。关键发现包括对奖励模型训练和拒绝采样的详细解释,以及对合成数据和评估等先进主题和领域内开放问题的讨论。
Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text
Authors: Hainan Xu, Vladimir Bataev, Travis M. Bartley, Jagadeesh Balam
Venue: ICASSP 2026
First: 2026-02-27T18:17:10+00:00 · Latest: 2026-02-27T18:17:10+00:00
Comments: Accepted at ICASSP 2026
Abstract
We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks -- up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T's strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.
中文标题/摘要
标题:分块注意力转换器:快速准确的流式语音识别
我们提出了一种新的分块注意力转换器(CHAT),这是一种对RNN-T模型的扩展,它以固定大小的块处理音频,并在每个块内使用交叉注意力。这种混合方法保持了RNN-T的流式处理能力,同时引入了局部对齐建模的可控灵活性。CHAT显著减少了RNN-T必须处理的时间维度,带来了巨大的效率提升:峰值训练内存减少高达46.2%,训练速度提高1.36倍,推理速度提高1.69倍。除了这些效率提升,CHAT在多个语言和任务上实现了相对于RNN-T的一致性准确度提升——语音识别的相对WER减少高达6.3%,语音翻译的BLEU提高高达18.0%。该方法特别适用于语音翻译,因为RNN-T严格的单调对齐会损害性能。我们的结果表明,CHAT模型提供了一种在不牺牲实时约束的情况下部署更强大流式语音模型的实用解决方案。
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Authors: Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu
First: 2026-02-10T18:59:56+00:00 · Latest: 2026-02-27T18:13:41+00:00
Comments: 11 pages
Abstract
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across seven LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
中文标题/摘要
标题:盲点中的偏见:检测LLMs未提及的内容
大型语言模型(LLMs)通常提供链式思考(CoT)推理痕迹,看似合理,但可能隐藏内部偏见。我们称这些为*未言明的偏见*。通过其陈述的推理来监控模型是不可靠的,现有的偏见评估通常需要预定义的类别和手工制作的数据集。在本研究中,我们引入了一个完全自动化的、黑盒的流程来检测任务特定的未言明偏见。给定一个任务数据集,该流程使用LLM自动评分器生成候选偏见概念。然后,通过生成正向和负向变化,在逐步增大的输入样本上测试每个概念,并应用多重检验和提前停止的统计技术。如果一个概念在统计上显著影响性能差异,但未在模型的CoTs中被列为理由,则标记为未言明偏见。我们在三个决策任务(招聘、贷款审批和大学入学)上对七种LLM进行了评估。我们的技术自动发现了这些模型中以前未知的偏见(如西班牙语流利度、英语熟练度、写作正式度)。在同一运行中,该流程还验证了先前工作手动识别的偏见(性别、种族、宗教、族裔)。更广泛地说,我们提出的方法提供了一条实用的、可扩展的自动任务特定偏见发现途径。
Summary / 总结
This study addresses the issue of unverbalized biases in Large Language Models (LLMs) by introducing an automated pipeline that detects biases not explicitly mentioned in the models' reasoning traces. The pipeline uses LLM autoraters to generate candidate bias concepts and tests them on progressively larger input samples, applying statistical techniques for validation. The study evaluates this approach across seven LLMs on three decision tasks and discovers previously unknown biases such as Spanish fluency and English proficiency, while also validating known biases like gender and race. This method provides a practical and scalable way to automatically identify task-specific biases in LLMs.
该研究通过引入自动化管道来检测大型语言模型(LLMs)中的未言明偏见,解决了这一问题。方法包括使用LLM自动评分器生成候选偏见概念,并在输入样本上进行测试。管道通过检测在模型链式思考推理中未提及但具有统计学显著性性能差异的概念来识别未言明偏见。研究在七个LLM上对三个决策任务(招聘、贷款审批和大学录取)进行了评估,发现了诸如西班牙语流利度和英语熟练度等先前未知的偏见,同时也验证了性别、种族、宗教和族裔等已知偏见。这项工作提供了一种实用且可扩展的方法,用于自动发现LLM中的任务特定偏见。
Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution
Authors: Chengyan Deng, Zhangquan Chen, Li Yu, Kai Zhang, Xue Zhou, Wang Zhang
First: 2026-02-27T18:13:31+00:00 · Latest: 2026-02-27T18:13:31+00:00
Abstract
Diffusion-based Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to iterative sampling. While recent distillation approaches leveraging large-scale Text-to-Image (T2I) priors have enabled one-step generation, they are typically hindered by prohibitive parameter counts and the inherent capability bounds imposed by teacher models. As a lightweight alternative, Consistency Models offer efficient inference but struggle with two critical limitations: the accumulation of consistency drift inherent to transitive training, and a phenomenon we term "Geometric Decoupling" - where the generative trajectory achieves pixel-wise alignment yet fails to preserve structural coherence. To address these challenges, we propose GTASR (Geometric Trajectory Alignment Super-Resolution), a simple yet effective consistency training paradigm for Real-ISR. Specifically, we introduce a Trajectory Alignment (TA) strategy to rectify the tangent vector field via full-path projection, and a Dual-Reference Structural Rectification (DRSR) mechanism to enforce strict structural constraints. Extensive experiments verify that GTASR delivers superior performance over representative baselines while maintaining minimal latency. The code and model will be released at https://github.com/Blazedengcy/GTASR.
中文标题/摘要
标题:联合几何和轨迹一致性学习的一步现实世界超分辨率
基于扩散的现实世界图像超分辨率(Real-ISR)实现了令人印象深刻的感知质量,但由于迭代采样导致计算成本高。虽然最近利用大规模文本到图像(T2I)先验的知识蒸馏方法能够实现一步生成,但它们通常受到参数数量庞大和教师模型固有限制的阻碍。作为一种轻量级的替代方案,一致性模型提供了高效的推理,但面临两个关键限制:传递训练中固有的一致性漂移累积,以及我们称之为“几何解耦”的现象——生成轨迹在像素级对齐但无法保持结构一致性。为了解决这些挑战,我们提出了GTASR(几何轨迹对齐超分辨率),这是一种简单而有效的现实世界超分辨率一致性训练范式。具体而言,我们引入了一条轨迹对齐(TA)策略,通过全程投影修正切线向量场,并引入了一种双重参考结构校正(DRSR)机制以施加严格的结构约束。大量实验验证了GTASR在保持最小延迟的同时优于代表性基线。代码和模型将在https://github.com/Blazedengcy/GTASR上发布。
Summary / 总结
The paper addresses the high computational costs of diffusion-based Real-World Image Super-Resolution (Real-ISR) by proposing GTASR, a consistency training paradigm. GTASR introduces a Trajectory Alignment (TA) strategy and a Dual-Reference Structural Rectification (DRSR) mechanism to improve geometric and trajectory consistency. Experimental results show that GTASR outperforms existing methods while maintaining low latency.
论文针对基于扩散的实时世界图像超分辨率(Real-ISR)的高计算成本,提出了GTASR一致性训练范式。GTASR引入了轨迹对齐(TA)策略和双重参考结构校正(DRSR)机制,以提高几何和轨迹一致性。实验结果表明,GTASR在保持低延迟的同时优于现有方法。
Time Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis
Authors: Javier Pulido, Filipe Rodrigues
First: 2026-02-27T18:10:54+00:00 · Latest: 2026-02-27T18:10:54+00:00
Comments: 6 pages
Abstract
Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.
中文标题/摘要
标题:时间序列基础模型作为交通预测的强基线:大规模基准分析
准确预测交通动态对于城市交通和基础设施规划至关重要。尽管最近的工作已经使用深度学习模型取得了强大的性能,但这些方法通常需要针对数据集进行特定训练、架构设计和超参数调整。本文通过在涵盖高速公路交通流量和流速、城市交通速度、共享单车需求和电动汽车充电站数据的十个真实世界数据集上对最先进的模型Chronos-2进行零样本基准测试,评估通用时间序列基础模型是否可以作为交通任务的预测器。在一致的评估协议下,我们发现,即使没有任何任务特定的微调,Chronos-2在大多数数据集上都能达到最先进的或具有竞争力的准确性,经常优于经典统计基线和专门的深度学习架构,尤其是在更长的时间范围内。除了点预测之外,我们还使用预测区间覆盖和锐度评估其原生的概率输出,证明Chronos-2在无需特定数据集训练的情况下也能提供有用的不确定性量化。总体而言,这项研究支持将时间序列基础模型作为交通预测研究中的关键基线。
Summary / 总结
This paper evaluates the performance of a general-purpose time-series foundation model, Chronos-2, in transportation forecasting tasks across ten real-world datasets. Without any task-specific fine-tuning, Chronos-2 achieves state-of-the-art or competitive accuracy, often outperforming classical and specialized deep learning models, especially at longer forecasting horizons. Additionally, it provides useful uncertainty quantification through its probabilistic outputs.
该研究通过在十个多领域真实数据集上对比Chronos-2模型,评估了通用时间序列基础模型在交通预测中的应用。未经特定任务微调的情况下,Chronos-2在长预测时域上达到了领先或竞争力的准确度,并通过其概率输出提供了有用的不确定性量化。该研究支持将时间序列基础模型作为交通预测研究中的关键基准模型。
SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
Authors: Jialiang Fan, Weizhe Xu, Mengyu Liu, Oleg Sokolsky, Insup Lee, Fangxin Kong
First: 2026-02-27T18:06:10+00:00 · Latest: 2026-02-27T18:06:10+00:00
Comments: 12 pages, 6 figures
Abstract
Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named SafeGen-LLM. SafeGen-LLM can not only enhance the safety satisfaction of task plans but also generalize well to novel safety properties in various domains. We first construct a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints. Then, we introduce a two-stage post-training framework: Supervised Fine-Tuning (SFT) on a constraint-compliant planning dataset to learn planning syntax and semantics, and Group Relative Policy Optimization (GRPO) guided by fine-grained reward machines derived from formal verification to enforce safety alignment and by curriculum learning to better handle complex tasks. Extensive experiments show that SafeGen-LLM achieves strong safety generalization and outperforms frontier proprietary baselines across multi-domain planning tasks and multiple input formats (e.g., PDDLs and natural language).
中文标题/摘要
标题:SafeGen-LLM:增强机器人系统任务规划中的安全性泛化
机器人系统中的安全性关键任务规划仍然具有挑战性:经典规划器在可扩展性方面表现不佳,基于强化学习(RL)的方法泛化能力差,基础大型语言模型无法保证安全性。为了解决这一差距,我们提出了一种安全性泛化的大语言模型,命名为SafeGen-LLM。SafeGen-LLM不仅可以增强任务计划的安全性满足度,还可以在各种领域中很好地泛化到新的安全属性。我们首先构建了一个具有显式安全约束的多领域Planning Domain Definition Language 3 (PDDL3)基准。然后,我们引入了一个两阶段后训练框架:在约束合规的规划数据集上进行监督微调(SFT)以学习规划语法和语义,以及由形式验证中导出的细粒度奖励机器引导的组相对策略优化(GRPO),并由课程学习来更好地处理复杂任务以确保安全对齐。广泛的实验表明,SafeGen-LLM在多领域规划任务和多种输入格式(例如PDDL和自然语言)中实现了强大的安全性泛化,并优于前沿的专有基线。
Summary / 总结
The paper addresses the challenge of safety-critical task planning in robotic systems by proposing SafeGen-LLM, a safety-generalizable large language model. It introduces a two-stage post-training framework involving Supervised Fine-Tuning on a constraint-compliant dataset and Group Relative Policy Optimization guided by reward machines and curriculum learning. Experiments demonstrate that SafeGen-LLM significantly enhances safety satisfaction and generalizes well to novel safety properties across various domains, outperforming existing methods in multi-domain planning tasks and different input formats.
论文通过提出SafeGen-LLM,一种安全泛化的大型语言模型,解决了机器人系统中的安全关键任务规划问题。它引入了两阶段后训练框架,包括基于约束合规数据集的监督微调和由奖励机器和课程学习引导的组相对策略优化。实验表明,SafeGen-LLM 在多个领域和输入格式中表现出色,优于现有方法在安全泛化方面的性能。
Enhancing Spatial Understanding in Image Generation via Reward Modeling
Authors: Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou
Venue: CVPR 2026
First: 2026-02-27T17:59:57+00:00 · Latest: 2026-02-27T17:59:57+00:00
Comments: Accepted at CVPR 2026. Github: https://github.com/DAGroup-PKU/SpatialT2I Project website: https://dagroup-pku.github.io/SpatialT2I/
Abstract
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
中文标题/摘要
标题:通过奖励建模增强图像生成中的空间理解
文本到图像生成的近期进展极大地提高了视觉保真度和创造力,但也对提示的复杂性提出了更高要求,特别是在编码复杂的空间关系方面。在这种情况下,获得满意的结果通常需要多次采样尝试。为了解决这一挑战,我们提出了一种新方法,以增强当前图像生成模型的空间理解能力。我们首先构建了包含超过8万个偏好对的SpatialReward-Dataset。在此基础上,我们构建了SpatialScore,这是一种奖励模型,旨在评估文本到图像生成中的空间关系准确性,其性能甚至超过了领先的专有模型在空间评估中的表现。我们进一步证明,该奖励模型能够有效地实现复杂空间生成的在线强化学习。在多个基准上的广泛实验表明,我们的专门奖励模型在图像生成中的空间理解方面取得了显著且一致的提升。
Summary / 总结
This paper addresses the challenge of encoding intricate spatial relationships in text-to-image generation by introducing a novel method that enhances spatial understanding. It constructs the SpatialReward-Dataset with over 80k preference pairs and develops SpatialScore, a reward model to evaluate spatial accuracy, which outperforms leading models. Experiments show that this reward model improves spatial understanding in image generation across multiple benchmarks.
研究旨在通过解决复杂空间关系编码的挑战来提高图像生成模型的空间理解能力。作者开发了SpatialReward-Dataset和一个名为SpatialScore的奖励模型来评估空间准确性,其性能甚至超过了现有模型。实验表明,该奖励模型能够有效增强复杂空间生成的在线强化学习,从而在多个基准测试中显著提高了空间理解能力。
A Variational Estimator for $L_p$ Calibration Errors
Authors: Eugène Berta, Sacha Braun, David Holzmüller, Francis Bach, Michael I. Jordan
First: 2026-02-27T17:56:52+00:00 · Latest: 2026-02-27T17:56:52+00:00
Abstract
Calibration$\unicode{x2014}$the problem of ensuring that predicted probabilities align with observed class frequencies$\unicode{x2014}$is a basic desideratum for reliable prediction with machine learning systems. Calibration error is traditionally assessed via a divergence function, using the expected divergence between predictions and empirical frequencies. Accurately estimating this quantity is challenging, especially in the multiclass setting. Here, we show how to extend a recent variational framework for estimating calibration errors beyond divergences induced induced by proper losses, to cover a broad class of calibration errors induced by $L_p$ divergences. Our method can separate over- and under-confidence and, unlike non-variational approaches, avoids overestimation. We provide extensive experiments and integrate our code in the open-source package probmetrics (https://github.com/dholzmueller/probmetrics) for evaluating calibration errors.
中文标题/摘要
标题:一种$L_p$校准误差的变分估计器
校准$\unicode{x2014}$确保预测概率与观察到的类别频率一致$\unicode{x2014}$是机器学习系统可靠预测的基本要求。校准误差通常通过偏差函数来评估,使用预测值与经验频率之间的期望偏差。准确估计这个量在多类别设置中尤其具有挑战性。在这里,我们展示了如何将最近的变分框架扩展到超越由适当损失引起的偏差,以涵盖由$L_p$偏差引起的广泛校准误差类。我们的方法可以区分过度自信和欠自信,并且与非变分方法不同,避免了过度估计。我们提供了广泛的实验,并将我们的代码集成到开源包probmetrics(https://github.com/dholzmueller/probmetrics)中,用于评估校准误差。
Summary / 总结
This paper addresses the challenge of accurately estimating calibration errors in machine learning models, particularly in the multiclass setting. It introduces a variational estimator for $L_p$ calibration errors, which can distinguish between over- and under-confidence in predictions. Unlike non-variational methods, this approach avoids overestimation. Extensive experiments validate the method's effectiveness and it is integrated into the open-source package probmetrics for practical use.
本文解决了机器学习模型在多类设置下准确估计校准误差的挑战。它提出了一种$L_p$校准误差的变分估计方法,可以区分预测中的过自信和欠自信。与非变分方法不同,该方法避免了估计误差的夸大。广泛的实验验证了该方法的有效性,并将其集成到开源包probmetrics中以供实际使用。
Controllable Reasoning Models Are Private Thinkers
Authors: Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych
First: 2026-02-27T17:39:10+00:00 · Latest: 2026-02-27T17:39:10+00:00
Abstract
AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models
中文标题/摘要
标题:可控推理模型是私密思考者
由推理模型驱动的AI代理需要访问敏感用户数据。然而,它们的推理痕迹难以控制,可能会无意中将私人信息泄露给外部方。我们提出训练模型不仅在最终答案中遵循指令,还在推理痕迹中遵循指令,可能在不同约束条件下。我们假设在推理痕迹中提高其遵循指令的能力可以提高其隐私保护能力。为了证明这一点,我们在一个具有明确推理痕迹限制的新指令遵循数据集上对模型进行微调。我们还引入了一种生成策略,使用单独的LoRA适配器分离推理和答案生成。我们在两个模型家族的六个模型上进行了评估,参数范围从17亿到140亿,跨越两个指令遵循基准和两个隐私基准。我们的方法取得了显著的改进,在指令遵循性能上提高了高达20.9分,在隐私基准上提高了高达51.9个百分点。然而,这些改进可能会因推理性能与遵循指令能力之间的权衡而牺牲任务实用性。总体而言,我们的结果表明,在推理模型中提高遵循指令的行为可以显著增强隐私,表明未来隐私感知代理开发的一个有希望的方向。我们的代码和数据可在https://github.com/UKPLab/arxiv2026-controllable-reasoning-models 获取。
Summary / 总结
This study addresses the issue of private information leakage in reasoning models by proposing a method to control reasoning traces. The authors fine-tune models on a new dataset with explicit restrictions and introduce a generation strategy to decouple reasoning and answer generation. Evaluations on six models from two model families show significant improvements in both instruction-following performance and privacy benchmarks, though at the cost of task utility. The results suggest that enhancing instruction-following in reasoning models can significantly improve privacy.
论文针对依赖推理模型的AI代理在处理敏感用户数据时可能引发隐私泄露的问题,提出了一种训练模型不仅在最终答案中遵循指令,也在推理过程中遵循指令的方法。通过在包含明确推理限制的新数据集上进行微调,并引入一种分离推理和答案生成的生成策略,研究在六个不同模型上的不同基准测试中取得了显著的指令遵循性能和隐私改进,尽管这可能会牺牲任务实用性。结果表明,增强推理模型中的指令遵循行为可以显著提高隐私,这为未来隐私意识代理的发展指出了一个有前景的方向。
Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
Authors: Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi
First: 2026-02-25T22:28:50+00:00 · Latest: 2026-02-27T17:36:39+00:00
Comments: In the paper, there are still many statements that are unclear and lack sufficient justification. Since it is difficult for us to estimate how much time would be required to properly revise and correct these issues, we would like to request a withdrawal of the paper in this moment. Thank you!
Abstract
Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.
中文标题/摘要
标题:通过检索-转换头连接潜在推理与目标语言生成
近期研究发现,Transformer中的部分注意力头是检索头,负责从上下文中检索信息。在本研究中,我们首先探讨了多语言上下文中的检索头。在多语言语言模型中,我们发现检索头往往在多种语言之间共享。扩展到跨语言设置,我们识别出检索-转换头(RTH),它们管理向特定目标语言输出的转换。我们的实验表明,RTH与检索头不同,对于多语言LLM中的链式推理更为关键。在四个多语言基准(MMLU-ProX、MGSM、MLQA和XQuaD)和两个模型家族(Qwen-2.5和Llama-3.1)中,我们证明了屏蔽RTH导致的性能下降比屏蔽检索头更大。我们的研究通过分离负责映射到目标语言的注意力头,推进了对多语言LMs的理解。
Summary / 总结
This study investigates retrieval heads in multilingual contexts and identifies Retrieval-Transition heads (RTH) that govern the transition to specific target-language output. The research finds that RTHs are distinct from retrieval heads and more crucial for Chain-of-Thought reasoning in multilingual LLMs. Experiments across four multilingual benchmarks show that masking RTHs leads to a larger performance drop than masking retrieval heads.
研究在多语言上下文中探讨了检索头,并识别出负责特定目标语言输出过渡的检索转换头(RTH)。研究发现,RTH与检索头不同,在多语言LLM中的链式推理中更为关键。实验结果显示,在四个多语言基准和两种模型家族中,屏蔽RTH导致的性能下降比屏蔽检索头更大。
FeynTune: Large Language Models for High-Energy Theory
Authors: Paul Richmond, Prarit Agarwal, Borun Chowdhury, Vasilis Niarchos, Constantinos Papageorgakis
Venue: Mach. Learn.: Sci. Technol. 7 025012 (2026)
First: 2025-07-24T18:21:03+00:00 · Latest: 2026-02-27T17:30:30+00:00
Comments: 16 pages; v2: Human evaluation discussion updated, additional training hyperparameters and inference settings included and references added
Abstract
We present specialized Large Language Models for theoretical High-Energy Physics, obtained as 20 fine-tuned variants of the 8-billion parameter Llama-3.1 model. Each variant was trained on arXiv abstracts (through August 2024) from different combinations of hep-th, hep-ph and gr-qc. For a comparative study, we also trained models on datasets that contained abstracts from disparate fields such as the q-bio and cs categories. All models were fine-tuned using two distinct Low-Rank Adaptation fine-tuning approaches and varying dataset sizes, and outperformed the base model on hep-th abstract completion tasks. We compare performance against leading commercial LLMs (ChatGPT, Claude, Gemini, DeepSeek) and derive insights for further developing specialized language models for High-Energy Theoretical Physics.
中文标题/摘要
标题:FeynTune:高能理论领域的大型语言模型
我们展示了针对理论高能物理的专业大型语言模型,这些模型是通过20种不同变体的80亿参数Llama-3.1模型微调得到的。每个变体都基于2024年8月之前arXiv摘要的不同组合(来自hep-th、hep-ph和gr-qc)。为了进行比较研究,我们还训练了包含来自不同领域(如q-bio和cs)摘要的数据集的模型。所有模型都使用了两种不同的低秩适应微调方法,并且在不同的数据集大小下进行微调,且在高能理论物理摘要完成任务上优于基础模型。我们将性能与领先的商用大型语言模型(ChatGPT、Claude、Gemini、DeepSeek)进行了比较,并得出了进一步开发专门化语言模型的见解。
Summary / 总结
The research aims to develop specialized Large Language Models (LLMs) for High-Energy Theory by fine-tuning 20 variants of the 8-billion parameter Llama-3.1 model on arXiv abstracts from hep-th, hep-ph, and gr-qc. These models outperformed the base model on hep-th abstract completion tasks and surpassed leading commercial LLMs like ChatGPT, Claude, Gemini, and DeepSeek. Different Low-Rank Adaptation fine-tuning approaches and varying dataset sizes were used to achieve these results.
研究旨在通过在arXiv hep-th、hep-ph和gr-qc领域的摘要上微调80亿参数的Llama-3.1模型的20个变体,开发专门用于高能理论的大型语言模型(LLMs)。这些模型在高能理论摘要完成任务上优于基模型,并超越了ChatGPT、Claude、Gemini和DeepSeek等领先的商用LLM。使用了不同的低秩适应微调方法和不同的数据集大小来实现这些结果。
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
Authors: Gregory Kang Ruey Lau, Hieu Dao, Nicole Kan Hui Lin, Bryan Kian Hsiang Low
Venue: ICLR 2025
First: 2026-02-27T17:18:42+00:00 · Latest: 2026-02-27T17:18:42+00:00
Comments: Earlier versions presented at ICLR 2025 QUESTION workshop and ICML 2025 R2-FM workshop
Abstract
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.
中文标题/摘要
标题:多模态大型语言模型的不确定性量化:基于内部模态特征的不一致性调整语义体积
尽管具有强大的能力,多模态大型语言模型(MLLMs)可能会生成看似合理但实际上错误的输出,阻碍了可靠部署。准确的不确定性度量可以将不可靠的查询升级给人类专家或更大规模的模型以提高性能。然而,现有的不确定性度量存在实际限制,如仅针对特定模态设计、依赖外部工具或计算成本高昂。我们提出了UMPIRE,这是一种无需训练的MLLM不确定性量化框架,可以在各种输入和输出模态下高效工作,无需外部工具,仅依赖模型自身的内部模态特征。UMPIRE 通过计算给定任务实例中采样MLLM响应的不一致性调整语义体积,有效地捕捉样本的全局语义多样性和基于内部模型置信度的响应局部不一致性。我们为MLLMs提出了不确定性期望,并提供了支持UMPIRE设计的理论分析。广泛的实验表明,UMPIRE 在图像、音频和视频文本基准测试中,包括对抗性和离分布设置中,始终优于基线度量在错误检测和不确定性校准方面的表现。我们还展示了UMPIRE在非文本输出任务中的泛化能力,包括图像和音频生成。
Summary / 总结
The research addresses the issue of unreliable outputs from Multimodal Large Language Models (MLLMs) by introducing UMPIRE, a training-free uncertainty quantification framework. UMPIRE computes the incoherence-adjusted semantic volume of MLLM responses, capturing both global semantic diversity and local incoherence. Experiments show that UMPIRE outperforms baseline metrics in error detection and uncertainty calibration across various benchmarks, including image, audio, and video-text tasks, and even in adversarial and out-of-distribution settings. Additionally, UMPIRE generalizes well to non-text output tasks such as image and audio generation.
研究旨在通过引入UMPRIRE无训练不确定性量化框架解决多模态大型语言模型(MLLMs)的不可靠输出问题。UMPRIRE通过计算MLLM响应的去不一致调整后的语义体积,同时捕捉全局语义多样性和局部不一致性。实验表明,UMPRIRE在各种基准测试中的错误检测和不确定性校准方面均优于基线指标,包括图像、音频和视频-文本任务,甚至在对抗性和离分布设置中也是如此。此外,UMPRIRE在图像和音频生成等非文本输出任务中也表现出良好的泛化能力。
Resilient Strategies for Stochastic Systems: How Much Does It Take to Break a Winning Strategy?
Authors: Kush Grover, Markel Zubia, Debraj Chakraborty, Muqsit Azeem, Nils Jansen, Jan Kretinsky
First: 2026-02-27T17:15:49+00:00 · Latest: 2026-02-27T17:15:49+00:00
Comments: To appear in Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25-29, 2026
Abstract
We study the problem of resilient strategies in the presence of uncertainty. Resilient strategies enable an agent to make decisions that are robust against disturbances. In particular, we are interested in those disturbances that are able to flip a decision made by the agent. Such a disturbance may, for instance, occur when the intended action of the agent cannot be executed due to a malfunction of an actuator in the environment. In this work, we introduce the concept of resilience in the stochastic setting and present a comprehensive set of fundamental problems. Specifically, we discuss such problems for Markov decision processes with reachability and safety objectives, which also smoothly extend to stochastic games. To account for the stochastic setting, we provide various ways of aggregating the amounts of disturbances that may have occurred, for instance, in expectation or in the worst case. Moreover, to reason about infinite disturbances, we use quantitative measures, like their frequency of occurrence.
中文标题/摘要
标题:具有弹性的随机系统策略:需要多少才能打破一个成功的策略?
我们研究在不确定性存在下的弹性策略问题。弹性策略使代理能够在受到干扰时做出稳健的决策。特别是,我们对能够翻转代理决策的干扰感兴趣。例如,当代理的意图动作由于环境中的执行器故障而无法执行时,可能会发生此类干扰。在本文中,我们介绍了随机环境下的弹性概念,并提出了一个全面的基本问题集。具体而言,我们讨论了马尔可夫决策过程中的可达性和安全性目标,这些目标也平滑地扩展到随机博弈。为了考虑随机环境,我们提供了各种方式来聚合可能发生的干扰量,例如期望值或最坏情况。此外,为了处理无限干扰,我们使用了诸如其发生频率之类的定量度量。
Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways
Authors: Paloma Rabaey, Jong Hak Moon, Jung-Oh Lee, Min Gwan Kim, Hangyul Yoon, Thomas Demeester, Edward Choi
First: 2025-11-06T16:24:53+00:00 · Latest: 2026-02-27T17:15:26+00:00
Abstract
Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.
中文标题/摘要
标题:放射学报告中的临床不确定性建模:从明确的不确定性标记到隐含的推理路径
放射学报告对于临床决策至关重要,并且在结构化为机器可读格式时具有巨大的自动化分析潜力。这些报告中经常包含不确定性,我们将其分为两种类型:(i) 明确的不确定性反映了对发现存在与否的怀疑,通过语气词表达。这些语气词的意义依赖于上下文,使得基于规则的系统无法量化特定发现的不确定性水平;(ii) 隐含的不确定性发生在放射科医生省略推理的一部分时,仅记录关键发现或诊断。在这种情况下,省略的发现可能是真正不存在,或者只是出于简洁而未提及。我们通过两部分框架来应对这些挑战。我们通过创建一个由专家验证的、基于LLM的常见语气词参考排名来量化明确的不确定性,并根据该参考将每个发现映射到一个概率值。此外,我们通过一个扩展框架来建模隐含的不确定性,该框架系统地添加了从14种常见诊断的专家定义诊断路径中派生出的特征亚发现。使用这些方法,我们发布了Lunguage++,这是一个扩展的、具有不确定性意识的细粒度结构放射学报告基准。这一丰富资源使不确定性意识的图像分类、忠实的诊断推理以及对诊断不确定性临床影响的新研究成为可能。
Summary / 总结
The research aims to address the challenges of modeling uncertainty in radiology reports, which are crucial for clinical decision-making. The study introduces a two-part framework: one to quantify explicit uncertainty using expert-validated, LLM-based reference rankings of hedging phrases, and another to model implicit uncertainty by expanding findings with characteristic sub-findings derived from expert-defined diagnostic pathways. The result is Lunguage++, an expanded and uncertainty-aware version of the Lunguage benchmark, which facilitates uncertainty-aware image classification and diagnostic reasoning.
研究通过区分显性和隐性不确定性来解决放射学报告中临床不确定性建模的挑战。显性不确定性通过专家验证的LLM基准参考排名来量化,而隐性不确定性则通过一个扩展框架来建模,该框架从专家定义的诊断路径中添加特征子发现。研究结果产生了Lunguage++,一个扩展的基准,包含了带有不确定性的结构化放射学报告,这使得更准确的图像分类和诊断推理成为可能。
MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
Authors: Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, Mirella Lapata
First: 2026-02-27T17:13:20+00:00 · Latest: 2026-02-27T17:13:20+00:00
Abstract
We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.
中文标题/摘要
标题:MT-PingEval:使用私人信息博弈评估多轮协作
我们提出了一种可扩展的方法,用于评估语言模型在多轮交互中的表现,使用一系列需要有效沟通私人信息的协作游戏。这使得可以进行交互式缩放分析,在这种分析中,固定数量的令牌预算可以分配给可变数量的回合。我们发现,在许多情况下,语言模型无法通过互动协作来超越单个代理尝试总结其信息而另一个代理立即行动的非互动基线场景,尽管存在很大的改进空间。这表明最先进的模型在规划和执行多轮协作对话方面仍然存在显著的弱点。我们分析了这些对话的语言特征,评估了奉承、信息密度和话语连贯性的作用。虽然没有单一的语言解释可以解释当前语言模型协作能力的弱点,但我们注意到,人类通过产生比大多数语言模型更连贯的对话,以更高的令牌效率实现了相当的任务成功。主动管理私人信息是现实世界交流的特征之一,我们希望MT-PingEval能够推动进一步的工作,以提高这一能力。
Summary / 总结
The study introduces a scalable method for evaluating language models in multi-turn interactions through collaborative games that require effective communication about private information. The research finds that many language models fail to use interactive collaboration to improve over a non-interactive baseline, indicating significant weaknesses in planning and executing multi-turn conversations. The analysis of dialogues reveals that humans achieve better task success with more coherent dialogues, suggesting a need for models to better manage private information proactively.
研究引入了一种通过协作游戏评估语言模型在多轮交互中的方法,这些游戏需要有效沟通私人信息。研究发现,许多语言模型难以超越非交互式基线,表明它们在规划和执行多轮协作对话方面存在显著弱点。对对话的分析显示,人类通过更连贯的对话实现更好的任务成功,这表明模型需要更好地主动管理私人信息。
GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Authors: Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki
First: 2025-11-05T18:55:42+00:00 · Latest: 2026-02-27T17:07:45+00:00
Abstract
We present an extended Greek Dialectal Dataset (GRDD+) 1that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
中文标题/摘要
标题:GRDD+: 一种扩展的希腊方言数据集及其跨架构微调评估
我们介绍了扩展的希腊方言数据集(GRDD+),该数据集补充了现有的GRDD数据集,增加了更多来自克里特岛、塞浦路斯、黑海地区和北希腊的数据,同时新增了六种新方言:希腊科西嘉语、里哥克语(南意大利希腊语)、马尼奥特语、七群岛语、萨卡尼亚语和卡塔里维苏斯希腊语。结果是一个包含6,374,939个单词和10种方言的数据集。这是迄今为止第一个具有如此多样性和规模的数据集。我们进行了多项微调实验,以观察高质量方言数据对多种语言模型的影响。我们微调了三种模型架构(Llama-3-8B、Llama-3.1-8B、Krikri-8B),并将结果与前沿模型(Claude-3.7-Sonnet、Gemini-2.5、ChatGPT-5)进行了比较。
A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification
Authors: Yixuan Liu, Kanwal K. Bhatia, Ahmed E. Fetit
First: 2026-02-27T17:06:37+00:00 · Latest: 2026-02-27T17:06:37+00:00
Abstract
Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical settings. Existing auditing approaches mainly rely on unimodal features or metadata-based subgroup analyses, which are limited in interpretability and often fail to capture hidden systematic failures. To address these limitations, we introduce the first automated auditing framework that extends slice discovery methods to multimodal representations specifically for medical applications. Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset, demonstrating the framework's strong capability in both failure discovery and explanation generation. Our results also show that multimodal information generally allows more comprehensive and effective auditing of classifiers, while unimodal variants beyond image-only inputs exhibit strong potential in scenarios where resources are constrained.
中文标题/摘要
标题:一种多模态切片发现框架,用于医学图像分类中的系统性故障检测与解释
尽管在基于机器学习的医学图像分类器方面取得了进展,但在实际应用中,这些系统的安全性和可靠性仍然是主要关切。现有的审计方法主要依赖于单一模态特征或基于元数据的子组分析,这些方法在可解释性方面存在局限性,往往无法捕捉到隐藏的系统性故障。为了解决这些局限性,我们引入了第一个自动审计框架,该框架将切片发现方法扩展到专门适用于医学应用的多模态表示。在使用MIMIC-CXR-JPG数据集进行的全面实验中,该框架在故障发现和解释生成方面表现出强大的能力。我们的结果还表明,多模态信息通常允许对分类器进行更全面和有效的审计,而不仅仅是图像输入的单一模态变体,在资源受限的场景中表现出强大的潜力。
Summary / 总结
The research aims to enhance the safety and reliability of machine learning-based medical image classifiers by addressing the limitations of existing unimodal and metadata-based auditing approaches. The proposed framework uses multimodal slice discovery methods to automatically detect and explain systematic failures in medical image classification. Experiments on the MIMIC-CXR-JPG dataset show that the framework effectively identifies and explains failures, and that multimodal information provides more comprehensive and effective auditing compared to unimodal approaches, especially in resource-constrained scenarios.
研究旨在通过解决现有单一模态和基于元数据的审计方法的局限性,提高基于机器学习的医学影像分类系统的安全性和可靠性。提出的框架使用多模态切片发现方法来增强可解释性并捕捉隐藏的系统性故障。在MIMIC-CXR-JPG数据集上的实验表明,该框架能够有效发现和解释故障,并且多模态信息相比单一模态方法提供了更全面和有效的审计,特别是在资源受限的场景中。
Manifold of Failure: Behavioral Attraction Basins in Language Models
Authors: Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Blake Gatto
First: 2026-02-25T15:08:20+00:00 · Latest: 2026-02-27T17:04:02+00:00
Abstract
While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
中文标题/摘要
标题:失败流形:语言模型的行为吸引盆地
尽管先前的工作集中在将对抗性示例投影回自然数据流形以恢复安全性,但我们认为,全面理解人工智能安全性需要表征不安全区域本身。本文介绍了一种系统映射大型语言模型(LLMs)失败流形的框架。我们将寻找漏洞的问题重新定义为质量多样性问题,使用MAP-Elites照亮这些失败区域的连续拓扑结构,我们称其为行为吸引盆地。我们的质量度量,对齐偏差,指导搜索模型行为与预期对齐最偏离的区域。在三个LLM:Llama-3-8B、GPT-OSS-20B和GPT-5-Mini中,我们展示了MAP-Elites实现了高达63%的行为覆盖率,发现了多达370个独特的漏洞生态位,并揭示了不同模型特有的拓扑特征:Llama-3-8B表现出几乎普遍的漏洞平台(平均对齐偏差0.93),GPT-OSS-20B显示了一个碎片化的景观,具有空间集中的盆地(平均0.73),而GPT-5-Mini则表现出强大的鲁棒性,天花板为0.50。我们的方法生成了每个模型安全景观的可解释全局图,这是现有攻击方法(GCG、PAIR或TAP)无法提供的,将范式从寻找离散的失败转移到理解其潜在结构。
Summary / 总结
This paper addresses the need for a comprehensive understanding of AI safety by mapping the Manifold of Failure in Large Language Models (LLMs). Using MAP-Elites to search for vulnerabilities, the authors introduce a quality diversity approach guided by Alignment Deviation. Across three LLMs, they achieve up to 63% behavioral coverage, discover up to 370 distinct vulnerability niches, and reveal different topological signatures, indicating varying levels of robustness among the models.
本文通过表征语言模型中的不安全区域,旨在全面理解AI安全性。它引入了使用MAP-Elites来绘制Manifold of Failure(行为吸引盆地)的框架。研究使用了质量度量指标,即偏差偏离度,来引导搜索模型行为与预期对齐最偏离的区域。在三个LLM中,MAP-Elites实现了高达63%的行为覆盖率,发现了多达370个不同的脆弱性区域,并揭示了每个模型独特的拓扑特征,突显了它们独特的安全景观。