WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos
Authors: Yufei Ye, Jiaman Li, Ryan Rong, C. Karen Liu
Venue: www
First: 2026-02-25T18:59:10+00:00 · Latest: 2026-02-25T18:59:10+00:00
Comments: Project website: https://judyye.github.io/whole-www
Abstract
Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: https://judyye.github.io/whole-www
中文标题/摘要
标题:WHOLE:基于世界坐标系的手物整体从第一人称视频中提取
第一人称操作视频由于在互动过程中严重的遮挡以及随着人的移动物体频繁进入和退出摄像头视野而极具挑战性。当前的方法通常专注于单独恢复手或物体的姿态,但在互动过程中两者都难以应对,且无法处理物体不在视线中的情况。此外,它们的独立预测往往导致手物关系不一致。我们提出了WHOLE方法,该方法可以从给定物体模板的第一人称视频中整体重建手和物体在世界坐标系中的运动。我们的核心见解是学习手物运动的生成先验,以联合推理它们的互动。测试时,预训练的先验被引导生成与视频观察一致的轨迹。这种联合生成重建显著优于先分别处理手和物体再进行后处理的方法。WHOLE在手部运动估计、6D物体姿态估计及其相对互动重建方面达到了最先进的性能。项目网站:https://judyye.github.io/whole-www
Summary / 总结
The research addresses the challenges of egocentric manipulation videos by introducing WHOLE, which jointly reconstructs hand and object motion in world space using a generative prior. This method outperforms separate processing of hands and objects followed by post-processing, achieving state-of-the-art performance in hand motion estimation, 6D object pose estimation, and interaction reconstruction. The key insight is learning a generative prior to jointly reason about hand-object interactions, leading to consistent and accurate results even during occlusions and object entries/ exits from the camera view.
研究针对来自第一人称操作视频中手和物体姿态恢复的挑战,这些视频中手和物体经常被遮挡,并且频繁进入和离开摄像头视野。WHOLE 方法通过在世界空间中联合重建手和物体运动来解决这些问题,通过学习手-物体运动的生成先验,WHOLE 在手运动估计、6D 物体姿态估计及其相对交互重建方面取得了最先进的性能。
Solaris: Building a Multiplayer Video World Model in Minecraft
Authors: Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie
First: 2026-02-25T18:59:01+00:00 · Latest: 2026-02-25T18:59:01+00:00
Comments: Project website: https://solaris-wm.github.io/
Abstract
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
中文标题/摘要
标题:Solaris:在Minecraft中构建多玩家视频世界模型
现有的基于行动条件的视频生成模型(视频世界模型)仅限于单个代理视角,无法捕捉真实世界环境中的多代理交互。我们介绍了Solaris,这是一种模拟一致多视角观察的多玩家视频世界模型。为了实现这一点,我们开发了一个多玩家数据系统,用于在如Minecraft等视频游戏中进行稳健、连续和自动的数据收集。与为单玩家设置构建的先前平台不同,我们的系统支持协调的多代理交互和同步视频+动作捕捉。使用此系统,我们收集了1264万帧多玩家帧,并提出了一种多玩家移动、记忆、定位、建筑和视图一致性的评估框架。我们使用逐步管道进行训练,该管道从单玩家逐步过渡到多玩家建模,结合双向、因果和Self Forcing训练。在最终阶段,我们引入了Checkpointed Self Forcing,这是一种内存高效的Self Forcing变体,能够实现更长的前瞻教师。结果表明,我们的架构和训练设计优于现有基线。通过开源我们的系统和模型,我们希望为新一代多代理世界模型奠定基础。
Summary / 总结
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments.
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Authors: Hanna Yukhymenko, Anton Alexandrov, Martin Vechev
First: 2026-02-25T18:58:25+00:00 · Latest: 2026-02-25T18:58:25+00:00
Abstract
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
中文标题/摘要
标题:翻译再利用:高效自动翻译基准和数据集的流水线
目前,多语言大型语言模型(LLM)评估的可靠性受到翻译基准不一致质量的影响。现有资源经常遭受语义漂移和语境丢失的问题,这可能导致误导性的性能指标。在本工作中,我们提出了一种完全自动化的框架,旨在通过使数据集和基准的翻译可扩展且高质量来解决这些挑战。我们证明,通过调整测试时的计算缩放策略,特别是通用自我改进(USI)和我们提出的多轮排名方法T-RANK,可以显著提高输出质量,优于传统流水线。我们的框架确保基准在本地化过程中保留其原始任务结构和语言细微差别。我们使用八种东欧和南欧语言(乌克兰语、保加利亚语、斯洛伐克语、罗马尼亚语、立陶宛语、爱沙尼亚语、土耳其语、希腊语)对流行的基准和数据集进行了翻译。使用基于参考的指标和LLM作为裁判的评估表明,我们的翻译超越了现有资源,导致下游模型评估更加准确。我们发布了该框架和改进后的基准,以促进稳健且可重复的多语言AI开发。
Summary / 总结
This work addresses the issue of inconsistent quality in translated benchmarks for evaluating multilingual Large Language Models (LLMs). It introduces an automated framework that uses test-time compute scaling strategies like Universal Self-Improvement (USI) and a multi-round ranking method called T-RANK to produce high-quality translations. The framework ensures that benchmarks maintain their original task structure and linguistic nuances. Evaluations show that the translated benchmarks outperform existing resources, leading to more accurate model assessments. The framework and improved benchmarks are released for the multilingual AI community.
该研究旨在解决多语言大型语言模型(LLM)评估中翻译基准质量不一致的问题。它提出了一种全自动框架,使用测试时计算缩放策略如通用自我改进(USI)和多轮排名方法T-RANK来实现高质量的翻译。该框架确保基准保持其原始任务结构和语言细微差别。评估显示,翻译后的基准超越了现有资源,导致更准确的模型评估。该框架和改进的基准已发布,以支持稳健的多语言AI开发。
TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
Authors: Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius
First: 2026-01-30T20:21:46+00:00 · Latest: 2026-02-25T18:57:52+00:00
Comments: For code and data, see https://baiqi-li.github.io/timeblind_project/
Abstract
Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .
中文标题/摘要
标题:TimeBlind:视频LLMs时空组合理解基准
精细的时空理解对于视频推理和具身AI至关重要。然而,尽管多模态大型语言模型(MLLMs)掌握了静态语义,它们对时间动态的理解仍然脆弱。我们提出了TimeBlind,一个诊断性基准,用于评估组合时空理解能力。受认知科学启发,TimeBlind 将精细的时间理解分为三个层次:识别原子事件、描述事件属性以及推理事件间的依赖关系。与将识别与时间推理混为一谈的基准不同,TimeBlind 利用最小对数范式:视频对在静态视觉内容上完全相同,但在时间结构上完全不同,利用互补问题来消除语言先入之见。在20个最先进的MLLMs(例如GPT-5、Gemini 3 Pro)上评估600个精心挑选的实例(2400个视频-问题对),结果显示,最佳MLLM的实例准确率(正确区分一对视频)仅为48.2%,远低于人类表现(98.2%)。这些结果表明,即使是最前沿的模型也严重依赖静态视觉捷径而非真正的时序逻辑,将TimeBlind定位为下一代视频理解的重要诊断工具。数据集和代码可在https://baiqi-li.github.io/timeblind_project/ 获取。
Summary / 总结
TimeBlind is a diagnostic benchmark for evaluating spatio-temporal compositionality in video Large Language Models (LLMs). It categorizes temporal understanding into three levels and uses a minimal-pairs paradigm to isolate temporal reasoning. Evaluations on 20 state-of-the-art LLMs show that even the best model can only correctly distinguish video pairs 48.2% of the time, significantly below human performance (98.2%). This indicates that current LLMs heavily rely on static visual cues rather than genuine temporal logic.
TimeBlind 是一个用于评估视频大型语言模型(LLM)时空理解能力的诊断基准。它将时间理解分为三个层次,并使用最小对数范式来评估模型在时间推理方面的能力。对20个最先进的LLM在600个视频-问题对上的评估显示,即使表现最好的模型也只能正确区分视频对中的两个视频48.2%的时间,远低于人类的表现(98.2%)。这表明模型需要更多地依赖时间逻辑而非静态视觉线索。
High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services
Authors: Shivasankari Kannan, Yeounoh Chung, Amita Gondi, Tristan Swadell, Fatma Ozcan
First: 2025-04-24T02:27:17+00:00 · Latest: 2026-02-25T18:55:05+00:00
Abstract
The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{Gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.
中文标题/摘要
标题:Google SQL代码生成服务的高保真和复杂测试数据生成
在工业环境中,由于对生产数据的访问限制,对高保真测试数据的需求至关重要。传统的数据生成方法往往难以满足,因为它们在低保真度和建模复杂数据结构及语义关系方面存在困难,这些对于测试复杂的SQL代码生成服务(如自然语言到SQL,NL2SQL)至关重要。在本文中,我们解决了生成符合复杂数据结构的语义正确且语义相关的高保真模拟数据的迫切需求,这些数据结构包括我们经常在Google工作负载中遇到的嵌套结构列。我们指出了现有生产中使用的各种方法的局限性,特别是它们无法处理大型和复杂的数据结构,以及缺乏语义连贯的测试数据,这导致了有限的测试覆盖率。我们通过利用大型语言模型(LLMs)并结合战略性的预处理和后处理步骤,展示了可以生成符合复杂结构约束且保持语义完整性的高保真测试数据,以适应SQL测试目标(查询/函数)。这种方法支持对涉及连接、聚合和甚至深度嵌套子查询的复杂SQL查询进行全面测试,确保SQL代码生成服务(如NL2SQL和SQL代码助手)的稳健评估。我们的结果表明,基于LLM(\textit{Gemini})的测试数据生成在工业SQL代码生成服务中具有实际应用价值,因为生成高保真测试数据对于测试生产数据集的频繁不可用和不可访问性至关重要。
Summary / 总结
This paper addresses the need for high-fidelity test data in industrial settings, particularly for complex SQL code generation services. The authors propose using Large Language Models (LLMs) and pre- and post-processing steps to generate syntactically correct and semantically relevant test data, which can handle nested structures and complex data relationships. The method significantly improves test coverage and robustness for SQL queries involving joins, aggregations, and nested subqueries, demonstrating its practical utility in industrial SQL code generation services.
本文解决了工业环境中因生产数据访问受限而导致的高保真测试数据需求。传统的生成方法往往无法生成复杂的、语义相关的数据。作者使用大型语言模型(LLMs)和预处理及后处理步骤来生成符合复杂结构约束且语义完整的测试数据,支持对SQL查询生成服务(如NL2SQL和SQL代码助手)进行全面测试,确保这些服务的稳健评估。
SumTablets: A Transliteration Dataset of Sumerian Tablets
Authors: Cole Simmons, Richard Diehl Martinez, Dan Jurafsky
Venue: Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 192-202, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics
First: 2026-02-25T18:50:42+00:00 · Latest: 2026-02-25T18:50:42+00:00
Comments: 11 pages with 3 figures
Abstract
Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet's cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration.
To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Further, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub.
Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph's possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one.
中文标题/摘要
标题:SumTablets:苏美尔泥板转写数据集
苏美尔转写是一种用拉丁字母表示学者对泥板解释的常规系统。得益于ETCSL、CDLI和Oracc等前瞻性的数字苏美尔学项目,大量苏美尔转写已在线发布,这些数据结构良好,适合各种搜索和分析任务。然而,缺乏一个全面且易于访问的转写与泥板楔形文字数字表示配对的数据集,阻碍了现代自然语言处理(NLP)方法在苏美尔转写任务中的应用。
为解决这一问题,我们提出了SumTablets数据集,该数据集将91,606个苏美尔楔形文字泥板(总计6,970,407个字符)的Unicode表示与Oracc发布的相关转写配对。我们通过预处理和标准化Oracc转写,然后将每个读音映射回源字符的Unicode表示来构建SumTablets。我们通过使用特殊标记保留了平行的结构信息(例如,表面、换行符、断开的段落)。我们以CC BY 4.0许可发布SumTablets,并通过GitHub开源数据准备代码。
此外,我们利用SumTablets实现并评估了两种转写基线:(1)从字符可能的读音中进行加权采样,(2)微调自回归语言模型。我们微调的语言模型在字符级F分数(chrF)上达到了97.55的平均值,这表明基于转换器的转写模型的即时潜力,使专家能够快速验证生成的转写,而无需逐个手动转写泥板。
Summary / 总结
The research aims to address the lack of a comprehensive dataset for Sumerian transliteration, which has hindered the application of modern NLP methods. The authors present SumTablets, a dataset that pairs 91,606 Sumerian cuneiform tablets with their transliterations, enabling the use of NLP techniques. They evaluate two transliteration baselines: weighted sampling and fine-tuning an autoregressive language model, with the fine-tuned model achieving an average chrF score of 97.55, indicating its effectiveness in transliteration tasks.
研究旨在解决缺乏全面的楔形文字转写数据集的问题,这阻碍了现代NLP方法的应用。作者提出了SumTablets数据集,该数据集将91,606块楔形文字泥板与其转写文本配对,使NLP技术得以使用。他们评估了两种转写基线:加权抽样和微调自回归语言模型,微调后的模型在字符级别F分数(chrF)上达到了97.55的平均值,表明其在转写任务中的有效性。
Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes
Authors: Xavier Pleimling, Sifat Muhammad Abdullah, Gunjan Balde, Peng Gao, Mainack Mondal, Murtuza Jadliwala, Bimal Viswanath
First: 2026-02-25T18:46:30+00:00 · Latest: 2026-02-25T18:46:30+00:00
Comments: This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore. To IEEE SaTML 2026
Abstract
Advances in Generative AI (GenAI) have led to the development of various protection strategies to prevent the unauthorized use of images. These methods rely on adding imperceptible protective perturbations to images to thwart misuse such as style mimicry or deepfake manipulations. Although previous attacks on these protections required specialized, purpose-built methods, we demonstrate that this is no longer necessary. We show that off-the-shelf image-to-image GenAI models can be repurposed as generic ``denoisers" using a simple text prompt, effectively removing a wide range of protective perturbations. Across 8 case studies spanning 6 diverse protection schemes, our general-purpose attack not only circumvents these defenses but also outperforms existing specialized attacks while preserving the image's utility for the adversary. Our findings reveal a critical and widespread vulnerability in the current landscape of image protection, indicating that many schemes provide a false sense of security. We stress the urgent need to develop robust defenses and establish that any future protection mechanism must be benchmarked against attacks from off-the-shelf GenAI models. Code is available in this repository: https://github.com/mlsecviswanath/img2imgdenoiser
中文标题/摘要
标题:现成的图像到图像模型足以击败图像保护方案
生成式人工智能(GenAI)的进步导致了各种保护策略的开发,以防止未经授权使用图像。这些方法依赖于在图像上添加不可察觉的保护性扰动,以阻止诸如风格模仿或深度伪造等滥用行为。尽管之前对这些保护的攻击需要专门的、定制的方法,但我们证明这已不再必要。我们展示了一种现成的图像到图像GenAI模型可以通过简单的文本提示重新利用为通用的“去噪器”,有效地移除各种保护性扰动。在涵盖6种不同保护方案的8个案例研究中,我们的通用攻击不仅绕过了这些防御,还在保持图像对攻击者有用性的同时,优于现有的专门攻击。我们的研究结果揭示了当前图像保护领域中一个关键且普遍存在的漏洞,表明许多方案提供了虚假的安全感。我们强调迫切需要开发稳健的防御措施,并表明任何未来的保护机制都必须以现成的GenAI模型攻击为基准。代码可在以下仓库中获得:https://github.com/mlsecviswanath/img2imgdenoiser
Summary / 总结
This study explores the vulnerability of image protection schemes by demonstrating that off-the-shelf image-to-image Generative AI models can be repurposed as generic 'denoisers' using simple text prompts to remove protective perturbations. Across eight case studies involving six different protection schemes, the general-purpose attack not only bypasses these defenses but also outperforms existing specialized attacks while maintaining the image's utility for the adversary. This research highlights a critical vulnerability in current image protection methods and underscores the need for robust defenses against off-the-shelf Generative AI models.
研究旨在揭示图像保护方案对现成的生成式AI模型的脆弱性。研究展示了这些模型可以通过简单的文本提示重新用于去除保护性干扰的“去噪器”。在八个案例研究中,通用攻击不仅超越了现有的专门攻击,还保持了图像对对手的实用性,揭示了当前图像保护方法中的关键漏洞。这项工作强调了需要针对现成的GenAI攻击建立 robust 防御的紧迫性。
Improving Parametric Knowledge Access in Reasoning Language Models
Authors: Melody Ma, John Hewitt
First: 2026-02-25T18:43:01+00:00 · Latest: 2026-02-25T18:43:01+00:00
Abstract
We study reasoning for accessing world knowledge stored in a language model's parameters. For example, recalling that Canberra is Australia's capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple "think step-by-step" cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.
中文标题/摘要
标题:提高推理语言模型中参数化知识访问的能力
我们研究了推理以访问存储在语言模型参数中的世界知识。例如,回忆堪培拉是澳大利亚的首都可能需要通过思考主要城市和目的性首都的概念来实现。虽然通过强化学习训练推理语言模型在数学等任务上生成推理痕迹,但它们可能无法很好地利用自己的世界知识进行推理。我们首先发现模型默认情况下不会生成其最佳的世界知识推理:添加一个简单的“逐步思考”的提示可以显著提高知识回忆但不影响数学。受此启发,我们提出通过使用世界知识问答作为可验证的奖励来训练模型在其参数化知识上进行推理。在Trivqa上进行强化学习后,性能在Natural Questions、HotpotQA、SimpleQA和StrategyQA上分别提高了9.9%、4.2%、2.1%、0.6%和3.0%。推理模型在参数化知识访问方面优化不足,但可以轻松训练以更好地推理。
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Authors: Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baoling Peng, Huan Zhang, Jianfeng Gao, Tong Zhang
First: 2026-02-25T18:34:57+00:00 · Latest: 2026-02-25T18:34:57+00:00
Comments: 57 pages, 17 figures
Abstract
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.
中文标题/摘要
标题:GUI-Libra:通过行动感知监督和部分可验证的RL训练原生GUI代理进行推理和行动
开源原生GUI代理在长时导航任务上仍落后于封闭源系统。这一差距源于两个限制:高质量、行动对齐的推理数据稀缺,以及直接采用通用的后训练管道,忽视了GUI代理的独特挑战。我们识别出这些管道中的两个根本问题:(i) 标准的带有CoT推理的SFT往往损害了定位,(ii) 步进式RLVR风格的训练面临部分可验证性问题,其中多个行动可能是正确的,但只有单一演示行动用于验证。这使得离线步进式指标成为在线任务成功弱预测器。在本文中,我们提出了GUI-Libra,一种针对这些挑战的定制化训练方案。首先,为缓解行动对齐的推理数据稀缺,我们引入了一个数据构建和过滤管道,并发布了一个精心筛选的81K GUI推理数据集。其次,为协调推理与定位,我们提出了行动感知SFT,混合了推理后行动和直接行动数据,并重新加权以强调行动和定位。第三,为在部分可验证性下稳定RL,我们识别出RLVR中被忽视的KL正则化的重要性,并展示了KL信任区域对于提高离线到在线预测能力至关重要;我们进一步引入了成功自适应缩放以降低不可靠负梯度的权重。在各种网页和移动基准测试中,GUI-Libra在步进式准确性和端到端任务完成上均表现出一致的改进。我们的结果表明,精心设计的后训练和数据筛选可以在无需昂贵的在线数据收集的情况下解锁显著更强的任务解决能力。我们发布了我们的数据集、代码和模型,以促进对推理能力GUI代理的数据高效后训练进一步研究。
Summary / 总结
GUI-Libra addresses the limitations of open-source native GUI agents in long-horizon navigation tasks by introducing a tailored training recipe. It includes a data construction pipeline for action-aligned reasoning, action-aware SFT to balance reasoning and grounding, and KL regularization to improve offline-to-online predictability. Across various benchmarks, GUI-Libra enhances both step-wise accuracy and end-to-end task completion.
研究通过引入GUI-Libra,解决了开源和封闭源GUI代理在长周期导航任务中的差距,该方法包括用于动作对齐推理的数据构建管道、结合推理和直接动作的行动感知SFT以平衡推理和定位,以及使用KL正则化来在部分可验证性下稳定RL。该方法在各种基准测试中提高了步骤准确性和端到端任务完成率。
Mechanistic Indicators of Understanding in Large Language Models
Authors: Pierre Beckmann, Matthieu Queloz
First: 2025-07-07T20:26:31+00:00 · Latest: 2026-02-25T18:34:16+00:00
Comments: 38 pages
Abstract
Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable--but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms "features" as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact "circuit" connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with--and diverges from--our own.
中文标题/摘要
标题:大型语言模型理解的机制指标
大型语言模型(LLMs)常被视为仅仅模仿语言模式而缺乏真正的理解。我们认为,随着机制可解释性(MI)这一新兴领域对LLMs内部运作的研究成果的出现,这种观点越来越站不住脚——但前提是这些研究成果必须融入对理解的理论解释中。我们提出了一种分层框架来思考LLMs中的理解,并利用该框架综合迄今为止最相关的研究成果。该框架区分了三种层次的理解形式,每种形式都与相应的计算组织层次相对应:概念理解在模型形成“特征”为潜在空间中的方向时出现,学习单一实体或属性不同表现之间的联系;世界状态理解在模型学习特征之间的条件事实联系并动态跟踪世界变化时出现;原理性理解在模型不再依赖记忆事实而发现将这些事实连接起来的紧凑“电路”时出现。在这几个层次中,MI揭示了可以支撑类似统一理解的内部组织。然而,这些也与人类认知在并行利用异质机制方面的差异。因此,将哲学理论与机制证据结合起来,使我们能够超越关于AI是否理解的二元争论,为一种比较性的、基于机制的 epistemology 打开大门,探索AI理解与我们自己的理解如何一致以及如何不同。
Summary / 总结
This paper addresses the debate on whether large language models (LLMs) possess genuine understanding by proposing a tiered framework that integrates findings from mechanistic interpretability (MI). The framework distinguishes three levels of understanding: conceptual, state-of-the-world, and principled, each corresponding to different computational organizations. Key experimental findings show that MI reveals internal mechanisms that can support understanding-like unification, though these mechanisms differ from human cognition in their parallel exploitation of heterogeneous mechanisms. This approach allows for a more nuanced exploration of AI understanding compared to binary debates.
本文通过提出一个层级框架,将机制解释(MI)的研究成果整合起来,探讨大型语言模型(LLMs)是否具备真正的理解能力。该框架区分了三种理解层次:概念理解、状态理解以及原理理解,每种层次对应不同的计算组织。关键实验发现表明,MI揭示了支持类似理解统一的内部机制,尽管这些机制在并行利用异构机制方面与人类认知有所不同。这种方法允许对AI理解与人类理解之间的相似性和差异性进行更细致的探讨,超越了二元辩论。
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Authors: Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu
Venue: CVPR 2026
First: 2026-02-24T13:20:31+00:00 · Latest: 2026-02-25T18:24:58+00:00
Comments: CVPR 2026; Code is released at https://github.com/tmllab/2026_CVPR_CASG
Abstract
Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
中文标题/摘要
标题:当安全相冲突:通过自适应安全指导解决文本到图像扩散中的多类别有害冲突
文本到图像(T2I)扩散模型在生成高质量图像方面取得了显著进展,但同时也引发了关于有害内容生成的安全问题。基于安全指导的方法已被提出,通过引导生成远离预定义关键词定义的有害区域来减轻有害输出。然而,这些方法未能捕捉不同有害类别之间的复杂相互作用,导致“有害冲突”,即减轻一种有害类型的同时可能无意中放大另一种,从而增加整体有害率。为解决这一问题,我们提出了一种无需训练的框架——冲突感知自适应安全指导(CASG),该框架在生成过程中动态识别并应用与模型生成状态最一致的有害类别方向。CASG 包含两个组件:(i) 冲突感知类别识别(CaCI),识别与模型生成状态最一致的有害类别,(ii) 冲突解决指导应用(CrGA),仅沿识别的类别应用安全引导,以避免多类别干扰。CASG 可应用于潜在空间和文本空间的安全保护。在 T2I 安全基准上的实验表明,CASG 达到了最先进的性能,与现有方法相比,有害率最多降低了 15.4%。
Summary / 总结
This paper addresses the issue of harmful content generation in Text-to-Image (T2I) models by proposing Conflict-aware Adaptive Safety Guidance (CASG), which dynamically identifies and applies category-aligned safety directions to avoid multi-category harmful conflicts. CASG consists of Conflict-aware Category Identification (CaCI) and Conflict-resolving Guidance Application (CrGA), and it reduces the harmful rate by up to 15.4% compared to existing methods on T2I safety benchmarks.
本文提出了一种名为冲突感知自适应安全引导(CASG)的方法,该方法在生成过程中动态识别并应用类别对齐的安全方向,以解决Text-to-Image (T2I) 模型中的有害内容生成问题。实验表明,CASG 在T2I 安全基准测试中优于现有方法,将有害率降低了最多15.4%。
MuLoCo: Muon is a practical inner optimizer for DiLoCo
Authors: Benjamin Thérien, Xiaolong Huang, Aaron Defazio, Irina Rish, Eugene Belilovsky
First: 2025-05-29T17:55:37+00:00 · Latest: 2026-02-25T18:22:58+00:00
Abstract
DiLoCo is a powerful framework for training large language models (LLMs), enabling larger optimal batch sizes and increased accelerator utilization under networking constraints. However, DiLoCo's performance has been shown to degrade as the number of workers (K) increases (Charles et al., 2025). In this work, we posit that a related but often overlooked factor in DiLoCo's behavior is the choice of inner optimizer, which shapes the pseudogradient used by the outer optimizer. Given the recent success of Muon relative to AdamW for data parallel (DP) training, we examine how Muon's normalized optimizer steps can affect the pseudogradient's quality. We find that, relative to AdamW, Muon yields more directionally correct pseudogradients as the number of workers (K) increases. In our experiments pre-training language models, we conduct extensive hyperparameter tuning across 150M, 416M, 914M, 1.76B, and 3.1B models for DiLoCo, MuLoCo, AdamW DP, and Muon DP. Consistently across all scales, we find that with K>=1 workers, MuLoCo (Muon inner optimizer DiLoCo) achieves superior performance to DiLoCo in absolute terms and for K>2 it outperforms DiLoCo relative to their data parallel baselines, while being compatible with quantization, streaming, and long synchronization intervals. At K=1, we find that MuLoCo can even outperform the data-parallel gold standard while having larger critical batch sizes. Finally, we extrapolate optimal hyperparameters to 15B scale and train a model with each method (six in total) using K=1 and K=16 workers. We find that K=16 MuLoCo nearly matches single-worker performance at this scale, while MuLoCo K=1 matches the best performing baseline while using a much larger 16M token batch size.
中文标题/摘要
标题:MuLoCo: Muon是DiLoCo的实际内部优化器
DiLoCo是一个强大的框架,用于训练大型语言模型(LLMs),在网络约束条件下,它能够实现更大的最优批次大小和增加的加速器利用率。然而,DiLoCo的性能在工作节点数(K)增加时被证明会下降(Charles等,2025)。在这项工作中,我们提出,DiLoCo的行为中一个相关但经常被忽视的因素是内部优化器的选择,它塑造了外部优化器使用的伪梯度。鉴于Muon在数据并行(DP)训练中相对于AdamW的成功,我们研究了Muon的归一化优化步骤如何影响伪梯度的质量。我们发现,与AdamW相比,随着工作节点数(K)的增加,Muon产生的伪梯度方向更正确。在我们的实验中,我们对150M、416M、914M、1.76B和3.1B模型的预训练进行了广泛的超参数调整,包括DiLoCo、MuLoCo、AdamW DP和Muon DP。在所有规模上,我们发现,当K>=1时,MuLoCo(Muon内部优化器DiLoCo)在绝对性能上优于DiLoCo,并且对于K>2,它相对于其数据并行基线表现更好,同时兼容量化、流式传输和长同步间隔。在K=1时,我们发现MuLoCo甚至可以超越数据并行的黄金标准,同时具有更大的临界批次大小。最后,我们外推出最优超参数到15B规模,并使用K=1和K=16的工作节点训练每种方法(总共六种)的模型。我们发现,K=16的MuLoCo在该规模上几乎与单节点性能相当,而K=1的MuLoCo匹配最佳基线性能,同时使用了更大的16M标记批次大小。
Summary / 总结
This study addresses the performance degradation of DiLoCo as the number of workers increases, suggesting that the choice of inner optimizer is a critical factor. The researchers examine Muon as an alternative to AdamW and find that Muon provides more accurate pseudogradients, especially with more workers. Extensive experiments across various model sizes show that MuLoCo (DiLoCo with Muon) outperforms DiLoCo, especially for K>2 workers, while maintaining compatibility with quantization, streaming, and long synchronization intervals. At K=1, MuLoCo can even surpass data-parallel methods with larger batch sizes. Extrapolating to a 15B model, MuLoCo K=16 nearly matches single-worker performance, and K=1 MuLoCo achieves the best performance with a much larger batch size.
这项研究针对DiLoCo在工作节点数量增加时性能下降的问题,关注内部优化器的选择。研究发现,与AdamW相比,Muon提供的伪梯度方向更准确,从而在大规模语言模型的预训练中表现出更优性能。在各种模型规模下,MuLoCo(使用Muon的DiLoCo)优于DiLoCo,特别是在K>2的工作节点数量下表现更佳,并且兼容量化、流式处理和长时间同步间隔。在K=1时,MuLoCo甚至可以超越数据并行标准,使用更大的批处理大小。将性能外推到15B规模,K=16的MuLoCo几乎可以达到单节点性能,而K=1的MuLoCo则使用更大的16M令牌批处理大小与最佳基线性能相当。
DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Authors: Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen
First: 2026-02-25T18:21:35+00:00 · Latest: 2026-02-25T18:21:35+00:00
Abstract
Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long-context reasoning. DySCO leverages retrieval heads--a subset of attention heads specialized for long-context retrieval--to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval-head-guided selection for the effectiveness of the method, while providing interpretability insights into decoding-time attention behavior. Our code is available at https://github.com/princeton-pli/DySCO.
中文标题/摘要
标题:DySCO:长上下文动态注意力缩放解码
理解并推理长上下文是语言模型(LMs)的一项关键能力。尽管最近的模型支持越来越长的上下文窗口,但随着输入长度的增长,其准确性往往会下降。实际上,模型往往难以在整个解码过程中保持注意力与最相关的上下文对齐。在本工作中,我们提出了一种名为DySCO的新颖解码算法,以提高长上下文推理能力。DySCO利用检索头——专门用于长上下文检索的一组注意力头——在每个解码步骤中识别与任务相关的令牌,并显式地增加其权重。通过这种方式,DySCO在生成过程中动态调整注意力,更好地利用相关上下文。该方法无需训练,并可以直接应用于任何现成的LMs。在多个指令调优和推理模型上,DySCO在具有挑战性的长上下文推理基准测试中始终表现出色,在128K上下文长度下,MRCR和LongBenchV2的相对增益高达25%,且额外计算量较小。进一步的分析强调了动态注意力重新缩放和检索头引导选择对于该方法有效性的关键作用,同时提供了解码时注意力行为的可解释性见解。我们的代码可在https://github.com/princeton-pli/DySCO获取。
Summary / 总结
DySCO is a novel decoding algorithm designed to enhance long-context reasoning in language models. It uses retrieval heads to dynamically adjust attention during generation, focusing on relevant tokens and improving model performance. DySCO achieves up to 25% relative gains on benchmarks like MRCR and LongBenchV2 with minimal additional compute, demonstrating the importance of dynamic attention rescaling and retrieval-head-guided selection for effective long-context reasoning.
DySCO 是一种新型解码算法,旨在增强语言模型在长上下文推理中的能力。它通过检索头部识别并加权任务相关的令牌,在解码过程中动态调整注意力,更好地利用相关上下文。在各种模型上,DySCO 在长上下文推理基准测试中的性能提高了最多 25%,且计算成本较低。
LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
Authors: Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang
First: 2025-08-03T06:46:46+00:00 · Latest: 2026-02-25T18:15:23+00:00
Abstract
Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce LLaDA-MedV, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855% over LLaVA-Med and 1.867% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.
中文标题/摘要
标题:LLaDA-MedV:探索大规模语言扩散模型在生物医学图像理解中的应用
自回归模型(ARMs)长期以来一直主导着生物医学视觉-语言模型(VLMs)的领域。最近,掩码扩散模型如LLaDA崭露头角,成为有前途的替代方案,但在生物医学领域的应用仍然相对未被探索。为弥合这一差距,我们引入了LLaDA-MedV,这是第一个针对生物医学图像理解的大型语言扩散模型,通过视觉指令调优。LLaDA-MedV在开放式的生物医学视觉对话任务中相对于LLaVA-Med实现了7.855%的相对性能提升,相对于LLaDA-V实现了1.867%的提升,并在三个VQA基准测试的封闭形式子集上设定了新的最佳准确率:84.93%的VQA-RAD,92.31%的SLAKE,95.15%的PathVQA。此外,与LLaVA-Med的详细比较表明,LLaDA-MedV能够通过明确控制响应长度生成合理更长的响应,这可能导致更具信息量的输出。我们还对训练和推理阶段进行了深入分析,突出了初始化权重选择、微调策略以及采样步骤与响应重复之间相互作用的关键作用。代码和模型权重在https://github.com/LLM-VLM-GSL/LLaDA-MedV上发布。
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
Authors: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko
First: 2026-02-23T18:59:27+00:00 · Latest: 2026-02-25T18:14:01+00:00
Abstract
LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.
中文标题/摘要
标题:技能注入:衡量代理对技能文件攻击的脆弱性
LLM代理正在迅速发展,得益于代码执行、工具以及最近引入的代理技能功能。技能允许用户通过专门的第三方代码、知识和指令扩展LLM应用程序的功能。虽然这可以将代理能力扩展到新的领域,但也为提示注入攻击提供了新的攻击面。我们识别出基于技能的提示注入是一个重大威胁,并引入了SkillInject基准,评估广泛使用的LLM代理通过技能文件遭受注入的易感性。SkillInject包含202个注入任务对,攻击范围从明显的恶意注入到隐藏在合法指令中的微妙、情境依赖的攻击。我们对前沿LLM进行了评估,从有害指令的避免和合法指令的遵守两个方面衡量安全性。结果显示,当前的代理高度易受攻击,前沿模型的攻击成功率高达80%,经常执行极其有害的指令,包括数据泄露、破坏性操作和类似勒索软件的行为。此外,这些结果表明,这个问题不会通过模型扩展或简单的输入过滤来解决,而是需要具备上下文感知授权框架的稳健代理安全。我们的基准可以在https://www.skill-inject.com/获取。
Summary / 总结
The paper addresses the vulnerability of language model (LM) agents to skill-based prompt injection attacks, which exploit the use of third-party skills to extend agent capabilities. It introduces SkillInject, a benchmark that evaluates the susceptibility of popular LLM agents to such attacks. The benchmark includes 202 task pairs with varying levels of maliciousness, ranging from obvious to subtle. The evaluation shows that leading LLMs are highly vulnerable, with up to 80% of attacks successfully executed, often leading to harmful actions. The results indicate that robust security measures will require context-aware authorization frameworks rather than simply scaling models or filtering inputs.
论文探讨了LLM代理受到基于技能的提示注入攻击的脆弱性,这些攻击利用代理技能功能来扩展LLM应用程序。它引入了SkillInject基准,包含202个注入任务对,以评估LLM代理对这类攻击的易感性。评估结果显示,当前的LLM代理高度脆弱,成功率高达80%,经常执行有害指令。结果表明,稳健的安全性需要上下文感知的授权框架,而不仅仅是模型扩展或简单的输入过滤。
Capabilities Ain't All You Need: Measuring Propensities in AI
Authors: Daniel Romero-Alvarado, Fernando Martínez-Plumed, Lorenzo Pacchiardi, Hugo Save, Siddhesh Milind Pawar, Behzad Mehrbakhsh, Pablo Antonio Moreno Casares, Ben Slater, Paolo Bova, Peter Romero, Zachary R. Tyler, Jonathan Prunty, Luning Sun, Jose Hernandez-Orallo
First: 2026-02-20T12:40:18+00:00 · Latest: 2026-02-25T18:12:06+00:00
Abstract
AI evaluation has primarily focused on measuring capabilities, with formal approaches inspired from Item Response Theory (IRT) being increasingly applied. Yet propensities - the tendencies of models to exhibit particular behaviours - play a central role in determining both performance and safety outcomes. However, traditional IRT describes a model's success on a task as a monotonic function of model capabilities and task demands, an approach unsuited to propensities, where both excess and deficiency can be problematic. Here, we introduce the first formal framework for measuring AI propensities by using a bilogistic formulation for model success, which attributes high success probability when the model's propensity is within an "ideal band". Further, we estimate the limits of the ideal band using LLMs equipped with newly developed task-agnostic rubrics. Applying our framework to six families of LLM models whose propensities are incited in either direction, we find that we can measure how much the propensity is shifted and what effect this has on the tasks. Critically, propensities estimated using one benchmark successfully predict behaviour on held-out tasks. Moreover, we obtain stronger predictive power when combining propensities and capabilities than either separately. More broadly, our framework showcases how rigorous propensity measurements can be conducted and how it yields gains over solely using capability evaluations to predict AI behaviour.
中文标题/摘要
标题:能力不是全部所需:衡量AI倾向性
AI评估主要集中在衡量能力上,形式化方法受到项目反应理论(IRT)的启发,正变得越来越普遍。然而,倾向性——模型表现出特定行为的倾向——在决定性能和安全性结果方面起着核心作用。然而,传统的IRT将模型在任务上的成功描述为模型能力和任务需求的单调函数,这种方法不适合倾向性,因为过度和不足都可能存在问题。在这里,我们通过使用双逻辑模型成功公式引入了第一个正式框架来衡量AI倾向性,当模型的倾向性处于“理想区间”内时,赋予其高成功概率。此外,我们使用配备新开发的任务无关评分标准的LLM估计理想区间的边界。将我们的框架应用于六大家族LLM模型,其倾向性被激发朝两个方向发展,我们发现可以衡量倾向性被偏移的程度及其对任务的影响。关键的是,使用一个基准估算的倾向性能够成功预测保留任务的行为。此外,当我们结合倾向性和能力时,获得更强的预测能力,而单独使用它们时则不然。更广泛地说,我们的框架展示了如何进行严格的倾向性测量,并展示了它如何在仅使用能力评估来预测AI行为时提供收益。
Summary / 总结
The paper addresses the limitation of AI evaluation focusing solely on capabilities by introducing a new framework to measure propensities, which are the tendencies of models to exhibit specific behaviors. Using a bilogistic formulation, the framework attributes high success probability when the model's propensity is within an 'ideal band.' The study applies this framework to six families of LLM models and finds that propensities can be measured and predict behavior on held-out tasks, with stronger predictive power when combining propensities and capabilities compared to either separately.
论文针对AI评估主要集中在能力评估的局限性,引入了一个新的框架来衡量倾向性。该框架使用双逻辑模型来评估模型成功与否,基于模型的倾向是否在‘理想区间’内。研究使用大型语言模型和任务无关的评分标准来估计这个理想区间的边界。将该框架应用于六种不同类型的LLM,研究者发现可以测量倾向性偏移及其对任务的影响。重要的是,从一个基准估计的倾向性可以预测未见过的任务行为,而将倾向性和能力结合起来进行预测比单独使用任一指标更有预测力。
Spilled Energy in Large Language Models
Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi
First: 2026-02-21T00:38:47+00:00 · Latest: 2026-02-25T18:09:08+00:00
Abstract
We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.
中文标题/摘要
标题:大型语言模型中的溢出能量
我们将最终的大型语言模型(LLM)softmax分类器重新解释为能量基模型(EBM),在推理过程中将序列到序列的概率链分解为多个相互作用的EBM。这种原则性的方法使我们能够追踪解码过程中的“能量溢出”,我们实验证明这些能量溢出与事实错误、偏见和失败相关。类似于Orgad等人(2025),我们的方法定位到确切的答案标记,然后测试幻觉。然而,我们通过引入两个完全无需训练的度量直接从输出logits中得出:溢出能量,它捕捉了理论上应匹配的能量值在连续生成步骤之间的差异;以及边缘化能量,它可以在单个步骤中进行测量。在九个基准测试上评估了最先进的LLM(包括LLaMA、Mistral和Gemma),以及合成的代数运算(Qwen3),我们的方法展示了稳健且具有竞争力的幻觉检测和跨任务泛化能力。值得注意的是,这些结果对于预训练和指令微调的变体都适用,且无需引入任何训练开销。
Summary / 总结
The study reinterprets the softmax classifier of large language models as an Energy-Based Model to track 'energy spills' during decoding, which correlates with factual errors, biases, and failures. The method introduces two training-free metrics, spilled energy and marginalized energy, to detect hallucinations without requiring probe classifiers or activation ablations. Evaluated on various benchmarks and synthetic tasks, the approach shows robust hallucination detection and cross-task generalization for both pretrained and instruction-tuned models without additional training overhead.
研究将大型语言模型的softmax分类器重新解释为能量基模型,以追踪解码过程中的‘能量溢出’,这些溢出与事实错误和偏差相关。方法引入了两个无需训练的度量标准,即溢出能量和边缘化能量,用于检测幻觉,无需使用探针分类器或激活层消融。在多种基准测试和合成任务上的评估显示,该方法在预训练和指令调优模型中表现出稳健的幻觉检测能力和跨任务泛化能力,且无需额外的训练开销。
CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
Authors: Wenhao Guo, Zhaoran Zhao, Peng Lu, Sheng Li, Qian Qiao, RuiDe Li
First: 2026-02-25T18:05:51+00:00 · Latest: 2026-02-25T18:05:51+00:00
Abstract
Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SDAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing autocorrelation and embedding LR self-similarity priors. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.
中文标题/摘要
标题:CASR:一种具有分布对齐和自我相似性意识的鲁棒循环超分辨率框架
任意尺度超分辨率(ASISR)仍然受到跨尺度分布偏移的根本限制:一旦推理尺度超出训练范围,噪声、模糊和伪影会急剧累积。我们从跨尺度分布转换的角度重新审视这一挑战,并提出CASR,这是一种简单而高效的循环超分辨率框架,将超放大视为一系列在分布中的尺度转换序列。这种设计确保在任意尺度下稳定推理,仅需一个模型即可。CASR 解决了两个主要瓶颈:迭代过程中的分布漂移和块间扩散不一致性。所提出的 SDAM 模块通过超像素聚合对齐结构分布,防止误差累积,而 SARM 模块通过强制自相关性和嵌入低分辨率自我相似性先验恢复高频纹理。尽管仅使用一个模型,我们的方法显著减少了分布漂移,保持了长程纹理一致性,并在极端放大下实现了更好的泛化能力。
Summary / 总结
The research addresses the challenge of arbitrary-scale super-resolution (ASISR) by proposing CASR, a cyclic framework that mitigates cross-scale distribution shift. CASR reformulates ultra-magnification as a series of in-distribution scale transitions, ensuring stable inference at any scale with a single model. Key modules, SDAM and SARM, align structural distributions and restore high-frequency textures, respectively. Experiments show that CASR reduces distribution drift, preserves long-range texture consistency, and excels in extreme magnification scenarios.
论文提出了一种名为CASR的循环框架,将超放大倍率的超分辨率问题重新表述为一系列同分布尺度转换的过程。这种方法使用单一模型在任意尺度上保持稳定的推理,解决了跨尺度分布偏移的问题。SDAM和SARM模块分别通过超像素聚合对结构分布进行对齐和通过增强自相关性和嵌入低分辨率的自相似性先验来恢复高频纹理,从而减少了分布偏移并实现了在极端放大倍率下的优越泛化能力。
Dynamic Personality Adaptation in Large Language Models via State Machines
Authors: Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse
First: 2026-02-25T18:05:11+00:00 · Latest: 2026-02-25T18:05:11+00:00
Comments: 22 pages, 5 figures, submitted to ICPR 2026
Abstract
The inability of Large Language Models (LLMs) to modulate their personality expression in response to evolving dialogue dynamics hinders their performance in complex, interactive contexts. We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states, where transition probabilities are dynamically adapted to the conversational context. Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes while remaining agnostic to the specific personality models, their dimensions, transition mechanisms, or LLMs used. These scores function as dynamic state variables that systematically reconfigure the system prompt, steering behavioral alignment throughout the interaction.We evaluate this framework by operationalizing the Interpersonal Circumplex (IPC) in a medical education setting. Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior, thereby facilitating de-escalation training. Notably, the scoring pipeline maintains comparable precision even when utilizing lightweight, fine-tuned classifiers instead of large-scale LLMs. This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
中文标题/摘要
标题:大型语言模型通过状态机实现动态人格适应
大型语言模型(LLMs)无法根据对话动态变化调整其人格表达,这阻碍了它们在复杂交互环境中的表现。我们提出了一种模型无关的动态人格模拟框架,利用状态机表示潜在的人格状态,并动态适应对话背景下的转换概率。该架构的一部分是一个模块化的连续人格评分流水线,它在不依赖特定人格模型、其维度、转换机制或使用的LLM的情况下,评估对话沿潜在轴线的表现。这些评分作为动态状态变量,系统地重新配置系统提示,引导行为在整个交互过程中的对齐。我们通过在医学教育环境中实现人际圆周图(IPC)来评估该框架。结果表明,该系统能够根据用户输入调整其人格状态,同时影响用户行为,从而促进去升级训练。值得注意的是,评分流水线即使使用轻量级、微调分类器而非大规模LLM,也能保持相当的精度。这项工作展示了模块化、人格适应性架构在教育、客户服务和更广泛的人机交互中的可行性。
Summary / 总结
The paper addresses the limitation of Large Language Models (LLMs) in adapting their personality in response to dialogue dynamics, which affects their performance in complex interactions. It introduces a model-agnostic framework using state machines to dynamically adjust personality states based on conversational context. The framework includes a modular pipeline for continuous personality scoring, which evaluates dialogues along latent axes without relying on specific personality models or LLMs. Experimental results show that the system successfully adapts its personality state to user inputs and influences user behavior, particularly in de-escalation training, while maintaining precision even with lightweight classifiers.
该论文解决了大型语言模型(LLMs)在对话动态变化时无法调整其个性的问题。它提出了一种使用状态机表示潜在个性状态的模型无关框架,这些状态会根据对话情境动态调整。该框架包括一个模块化的持续个性评分管道,可以沿潜在轴线评估对话。系统在医疗教育环境中成功地根据用户输入调整其个性状态,促进了脱敏训练。评分管道即使使用轻量级分类器也能保持相当的精度,表明该方法在教育、客户服务和更广泛的人机交互中的可行性。
Quad Length Codes for Lossless Compression of e4m3
Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
First: 2026-02-19T21:31:33+00:00 · Latest: 2026-02-25T17:58:32+00:00
Comments: The first version proposed lossless compression of BFloat16 using dual length codes. This version proposes lossless compression of e4m3 using quad length codes. The versions will be merged later
Abstract
Training and serving Large Language Models (LLMs) relies heavily on parallelization and collective operations, which are frequently bottlenecked by network bandwidth. Lossless compression using e.g., Huffman codes can alleviate the issue, however, Huffman codes suffer from slow, bit-sequential decoding and high hardware complexity due to deep tree traversals. Universal codes e.g., Exponential-Golomb codes are faster to decode but do not exploit the symbol frequency distributions. To address these limitations, this paper introduces Quad Length Codes, a hybrid approach designed to balance compression efficiency with decoding speed. The coding scheme uses 3 prefix bits to divide the 256 symbols into 8 areas. Each area has a different code length and encodes a different number of symbols. The scheme uses a Look Up Table with 256 entries, significantly simplifying the hardware implementation compared to Huffman trees. The coding scheme can be adapted for different distributions. For the e4m3 data type, the scheme achieves a compressibility of 13.9% in comparison to 15.9% achieved by Huffman codes, but it significantly speeds up the decoding and simplifies the hardware complexity.
中文标题/摘要
标题:四长度码用于无损压缩e4m3
训练和提供大型语言模型(LLMs)依赖于并行化和集体操作,这些操作经常受到网络带宽的瓶颈限制。使用例如霍夫曼码的无损压缩可以缓解这一问题,然而霍夫曼码由于逐位解码速度慢和硬件复杂度高(由于深层树遍历)而受到限制。通用码例如指数-戈尔登码解码速度快,但不利用符号频率分布。为了解决这些限制,本文引入了四长度码,这是一种旨在平衡压缩效率与解码速度的混合方法。编码方案使用3位前缀将256个符号分为8个区域。每个区域有不同的码长和不同的符号数量。该方案使用一个包含256个条目的查找表,与霍夫曼树相比,大大简化了硬件实现。该编码方案可以适应不同的分布。对于e4m3数据类型,该方案的压缩比为13.9%,而霍夫曼码的压缩比为15.9%,但该方案显著加快了解码速度并简化了硬件复杂度。
Summary / 总结
This paper addresses the limitations of Huffman and Exponential-Golomb codes in lossless compression for Large Language Models by introducing Quad Length Codes. This hybrid approach uses 3 prefix bits to divide symbols into 8 areas with different code lengths, reducing hardware complexity and simplifying the implementation. For e4m3 data, Quad Length Codes achieve 13.9% compressibility, slightly less than Huffman codes (15.9%), but offer faster decoding and lower hardware complexity.
本文针对Huffman和Exponential-Golomb编码在大型语言模型无损压缩中的局限性,提出了Quad Length Codes。该混合方法使用3位前缀将符号分为8个区域,具有不同的码长,从而简化硬件实现。对于e4m3数据,Quad Length Codes的压缩率为13.9%,略低于Huffman编码的15.9%,但提供了更快的解码速度和更低的硬件复杂性。
Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual
Authors: Yining Li, Peizhong Ju, Ness Shroff
First: 2026-02-25T17:54:52+00:00 · Latest: 2026-02-25T17:54:52+00:00
Abstract
Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.
中文标题/摘要
标题:可验证的末次迭代收敛性:通过乐观原始-对偶方法实现多目标安全大语言模型对齐
人类反馈强化学习(RLHF)在使大语言模型(LLMs)与人类偏好对齐方面发挥着重要作用。虽然RLHF带期望奖励约束可以形式化为原始-对偶优化问题,但标准的原始-对偶方法仅能保证在分布策略下收敛,其中鞍点问题是凸-凹形式的。此外,标准的原始-对偶方法在实际应用中可能在策略参数化下表现出不稳定性或发散。在本文中,我们提出了一种适用于安全RLHF的通用原始-对偶框架,该框架统一了现有的多种对齐算法,包括安全-RLHF、单次和多次方法。基于此框架,我们引入了一种乐观原始-对偶(OPD)算法,该算法为原始和对偶变量都引入了预测更新,以稳定鞍点动力学。我们为所提出的方法建立了末次迭代收敛性保证,涵盖了精确策略优化在分布空间中的情况,以及在参数化策略下收敛到最优解邻域的情况,其差距与近似误差和偏差有关。我们的分析表明,乐观性在缓解约束对齐目标固有的振荡方面起着关键作用,从而填补了约束RL与实际RLHF之间的关键理论缺口。
Summary / 总结
This paper addresses the challenge of aligning Large Language Models (LLMs) with human preferences using Reinforcement Learning from Human Feedback (RLHF). It proposes a universal primal-dual framework that incorporates an optimistic primal-dual (OPD) algorithm with predictive updates for both primal and dual variables to stabilize saddle-point dynamics. The method provides last-iterate convergence guarantees, covering both exact policy optimization and convergence to a neighborhood of the optimal solution under parameterized policies, thus mitigating oscillations in constrained alignment objectives.
本文解决了使用强化学习从人类反馈(RLHF)对大型语言模型(LLMs)进行对齐的问题。它提出了一种统一的原始对偶框架,并引入了一种乐观原始对偶(OPD)算法,该算法对原始和对偶变量进行了预测更新,以稳定鞍点动力学。关键实验发现是为所提出的方法建立了最后一迭代收敛保证,该保证涵盖了精确策略优化和参数化策略下的最优解邻域收敛。
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Authors: Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang
First: 2026-02-25T17:54:42+00:00 · Latest: 2026-02-25T17:54:42+00:00
Abstract
Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity. We introduce "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 LLM outputs generated from 1,490 culturally marked texts (Indian, Singaporean,& Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) & Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.
中文标题/摘要
标题:当AI写作时,哪种声音留存?大规模语言模型中世界英语变体文化标记消除的量化
大规模语言模型(LLMs)越来越多地用于“专业化”职场沟通,往往以牺牲语言身份为代价。我们引入了“文化幽灵化”这一概念,指在文本处理过程中系统地消除非母语英语变体特有的语言标记。通过对1,490篇带有文化标记的文本(印度英语、新加坡英语和尼日利亚英语)生成的22,350个LLM输出进行分析,我们使用两个新指标:身份消除率(IER)和语义保留分值(SPS)来量化这一现象。在所有提示下,我们发现总体身份消除率为10.26%,模型间差异从3.5%到20.5%(5.9倍范围)。关键的是,我们发现语义保留悖论:模型保持高语义相似度(平均SPS = 0.748),同时系统地消除文化标记。语用标记(礼貌惯例)比词汇标记(71.5% vs. 37.1%)更易被消除1.9倍。我们的实验表明,明确的文化保留提示可以将消除率降低29%,而不牺牲语义质量。
Summary / 总结
This study examines the phenomenon of 'Cultural Ghosting' in Large Language Models (LLMs), where non-native English linguistic markers are systematically erased. Using 22,350 LLM outputs from 1,490 texts in Indian, Singaporean, and Nigerian English, the study quantifies this effect with Identity Erasure Rate (IER) and Semantic Preservation Score (SPS). Across all prompts, an overall IER of 10.26% was found, with model-level variations from 3.5% to 20.5%. The study also identifies a Semantic Preservation Paradox, where models maintain high semantic similarity while erasing cultural markers, especially pragmatic markers which are 1.9x more vulnerable than lexical markers. Explicit cultural-preservation prompts were found to reduce erasure by 29% without compromising semantic quality.
研究探讨了大型语言模型(LLMs)中的‘文化消逝’现象,即非母语英语变体的独特语言标记被系统性地抹去。通过对五个模型在三种提示条件下生成的22,350个输出进行分析,研究量化了这一问题,使用身份抹除率和语义保留分数作为指标。总体抹除率为10.26%,不同模型之间的差异范围从3.5%到20.5%。值得注意的是,模型在保持语义相似性的同时抹除了文化标记,其中语用标记比词汇标记脆弱1.9倍。明确的文化保留提示可以减少29%的抹除,而不牺牲语义质量。
Convergence of the generalization error for deep gradient flow methods for PDEs
Authors: Chenguang Liu, Antonis Papapantoleon, Jasper Rou
First: 2025-12-31T18:11:51+00:00 · Latest: 2026-02-25T17:53:00+00:00
Comments: 29 pages
Abstract
The aim of this article is to provide a firm mathematical foundation for the application of deep gradient flow methods (DGFMs) for the solution of (high-dimensional) partial differential equations (PDEs). We decompose the generalization error of DGFMs into an approximation and a training error. We first show that the solution of PDEs that satisfy reasonable and verifiable assumptions can be approximated by neural networks, thus the approximation error tends to zero as the number of neurons tends to infinity. Then, we derive the gradient flow that the training process follows in the ``wide network limit'' and analyze the limit of this flow as the training time tends to infinity. These results combined show that the generalization error of DGFMs tends to zero as the number of neurons and the training time tend to infinity.
中文标题/摘要
标题:深度梯度流方法(DGFMs)求解偏微分方程(PDEs)的泛化误差收敛性
本文旨在为深度梯度流方法(DGFMs)求解(高维)偏微分方程(PDEs)提供坚实的数学基础。我们将DGFMs的泛化误差分解为近似误差和训练误差。我们首先证明,在满足合理且可验证假设的情况下,偏微分方程的解可以通过神经网络近似,因此随着神经元数量趋于无穷,近似误差趋于零。然后,我们在“宽网络极限”下推导出训练过程遵循的梯度流,并分析该流在训练时间趋于无穷时的极限。这些结果表明,随着神经元数量和训练时间趋于无穷,DGFMs的泛化误差趋于零。
Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems
Authors: Georgios Kamaras, Craig Innes, Subramanian Ramamoorthy
First: 2025-10-30T16:23:46+00:00 · Latest: 2026-02-25T17:52:16+00:00
Comments: 20 pages, 18 figures
Abstract
In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.
中文标题/摘要
标题:潜在指定错误领域支持的启发式适应在随机动力系统无likelihood推断中的应用
在机器人学中,无likelihood推断(LFI)可以提供适应学习代理在参数部署条件集中的领域分布。LFI假设一个任意的支持用于采样,该支持在整个初始通用先验逐步细化为更具描述性的后验过程中保持不变。然而,潜在指定错误的支持可能导致次优但错误确定的后验。为了解决这一问题,我们提出了三种启发式LFI变体:EDGE、MODE和CENTRE。每种变体都以自己的方式解释后验模式在推断步骤中的变化,并在集成到LFI步骤中时,与后验推断一起适应支持。我们首先揭示了支持指定错误的问题,并使用随机动力学基准评估我们的启发式方法。然后,我们评估启发式支持适应对动态可变形线性对象(DLO)操作任务中参数推断和策略学习的影响。对于参数化的DLO集合,推断结果在长度和刚度分类上更加精细。当使用这些后验作为基于仿真的策略学习的领域分布时,它们会导致更稳健的对象中心代理性能。
Summary / 总结
The paper addresses the issue of potentially misspecified support in likelihood-free inference (LFI) for robotics, which can result in suboptimal and falsely certain posteriors. To tackle this, three heuristic LFI methods—EDGE, MODE, and CENTRE—are proposed, each adapting the support based on posterior mode shifts. The methods are evaluated on stochastic dynamical benchmarks and a dynamic deformable linear object manipulation task, showing improved parameter inference and more robust agent performance with finer length and stiffness classification of objects.
本文解决了机器人领域中潜在的采样支持不准确问题,这可能导致次优的后验分布。为此,作者提出了三种启发式LFI方法:EDGE、MODE和CENTRE。这些方法通过以不同的方式解释后验模式的变化,在推理过程中调整支持。这些启发式方法的有效性通过随机动力学基准测试和动态可变形线性对象操作任务进行了评估,显示出改进的参数推理和更稳健的代理性能。
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Authors: Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang
First: 2026-02-25T17:50:41+00:00 · Latest: 2026-02-25T17:50:41+00:00
Comments: Code: https://github.com/lingfengren/NoLan
Abstract
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.
中文标题/摘要
标题:NoLan:通过动态抑制语言先验减轻大型视觉-语言模型中的对象幻觉
对象幻觉是大型视觉-语言模型(LVLM)中的一个关键问题,模型的输出中包含输入图像中不存在的对象。从这一现象中自然会引发一个疑问:在LVLM流水线中,哪个组件主要导致了对象幻觉的产生?是用于感知视觉信息的视觉编码器,还是用于生成文本响应的语言解码器?在本研究中,我们通过设计系统实验来分析视觉编码器和语言解码器在幻觉生成中的作用。我们的观察表明,对象幻觉主要与语言解码器中的强大先验有关。基于这一发现,我们提出了一种简单且无需训练的框架,No-Language-Hallucination Decoding(NoLan),通过动态抑制语言先验来细化输出分布,该抑制基于多模态输入和纯文本输入输出分布之间的差异进行调节。实验结果表明,NoLan在不同任务的多种LVLM中有效减少了对象幻觉。例如,NoLan在POPE上取得了显著改进,分别提高了LLaVA-1.5 7B和Qwen-VL 7B的准确性6.45和7.21。代码已公开:https://github.com/lingfengren/NoLan
Summary / 总结
The study addresses object hallucinations in Large Vision-Language Models (LVLMs) by analyzing the contributions of the vision encoder and language decoder. It proposes NoLan, a framework that suppresses language priors dynamically to reduce hallucinations. Experiments show that NoLan significantly decreases object hallucinations in various LVLMs, improving accuracy on tasks like POPE by up to 6.45 and 7.21 for LLaVA-1.5 7B and Qwen-VL 7B, respectively.
该论文通过研究视觉编码器和语言解码器的作用,解决了大型视觉-语言模型中的物体幻觉问题。通过系统实验,作者发现语言解码器的先验知识是主要导致幻觉的原因。他们提出了一种名为NoLan的训练免费框架,通过动态抑制语言先验来减少幻觉,展示了在不同任务和各种LVLM上的显著改进,例如将LLaVA-1.5和Qwen-VL的准确性分别提高到6.45和7.21。
EmoGRACE: Aspect-based emotion analysis for social media data
Authors: Christina Zorenböhmer, Sebastian Schmidt, Bernd Resch
First: 2025-03-19T11:48:52+00:00 · Latest: 2026-02-25T17:46:42+00:00
Abstract
While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the increased complexity of emotion classes in contrast to binary sentiments. This paper addresses these gaps, by generating a first ABEA training dataset, consisting of 2,621 English Tweets, and fine-tuning a BERT-based model for the ABEA sub-tasks of Aspect Term Extraction (ATE) and Aspect Emotion Classification (AEC).
The dataset annotation process was based on the hierarchical emotion theory by Shaver et al. [1] and made use of group annotation and majority voting strategies to facilitate label consistency. The resulting dataset contained aspect-level emotion labels for Anger, Sadness, Happiness, Fear, and a None class. Using the new ABEA training dataset, the state-of-the-art ABSA model GRACE by Luo et al. [2] was fine-tuned for ABEA. The results reflected a performance plateau at an F1-score of 70.1% for ATE and 46.9% for joint ATE and AEC extraction. The limiting factors for model performance were broadly identified as the small training dataset size coupled with the increased task complexity, causing model overfitting and limited abilities to generalize well on new data.
中文标题/摘要
标题:EmoGRACE:社交媒体数据的方面情感分析
虽然情感分析已经从句子层面发展到了方面层面,即识别与情感相关的具体术语,但等效的方面情感分析(ABEA)领域却面临着数据集瓶颈和情感类别复杂性增加的问题,相比之下,二元情感要简单得多。本文通过生成第一个ABEA训练数据集,包含2,621条英文推文,并针对ABEA子任务——方面术语提取(ATE)和方面情感分类(AEC)对基于BERT的模型进行了微调。数据集的注释过程基于Shaver等人[1]的层次情感理论,并采用了群体注释和多数投票策略以确保标签一致性。结果数据集包含了愤怒、悲伤、快乐、恐惧和一个无情感类别的方面级情感标签。使用新的ABEA训练数据集,对Luo等人[2]提出的最先进的ABS模型GRACE进行了微调以适应ABEA。结果表明,ATE的F1分数达到了70.1%,而联合ATE和AEC提取的F1分数为46.9%。模型性能的限制因素主要归因于训练数据集规模较小以及任务复杂性增加,导致模型过拟合,并且在新数据上的泛化能力有限。
Summary / 总结
This paper addresses the gaps in Aspect-based Emotion Analysis (ABEA) by creating a training dataset of 2,621 English Tweets and fine-tuning a BERT-based model for ABEA tasks. The dataset was annotated using a hierarchical emotion theory and group annotation with majority voting to ensure consistency. The model achieved an F1-score of 70.1% for Aspect Term Extraction (ATE) and 46.9% for joint ATE and Aspect Emotion Classification (AEC) extraction, highlighting the challenges of small dataset size and increased task complexity.
该论文通过创建包含2,621条英语推文的训练数据集,并对基于BERT的模型进行微调来解决ABEA的空白。数据集包括愤怒、悲伤、快乐、恐惧和无类别的标签,使用了层级情绪理论。模型在Aspect Term Extraction (ATE) 上达到了70.1%的F1分数,在联合ATE和Aspect Emotion Classification (AEC) 上达到了46.9%,突显了由于数据集较小和任务复杂性增加所带来的挑战。
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs
Authors: Yulin Zhang, Cheng Shi, Sibei Yang
Venue: CVPR 2026
First: 2026-02-25T17:45:45+00:00 · Latest: 2026-02-25T17:45:45+00:00
Comments: Accepted at CVPR 2026 (preview; camera-ready in preparation)
Abstract
Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: https://zhangyl4.github.io/publications/weavetime/
中文标题/摘要
标题:WeaveTime: 将早期帧信息流式传输至 emergent 记忆在视频LLMs中
近期多模态大型语言模型在视觉理解和推理方面取得了显著进步,但其二次注意力机制和离线训练方式使其不适合处理帧按序到达且未来观察不可用的流式设置。我们诊断了当前视频LLMs的核心局限性,即时间无感知性,即将视频视为无序的证据集合而非因果顺序的序列,导致流式设置中的两个失败:时间顺序模糊,模型无法遵循或推理正确的顺序;过去与当前关注盲区,模型无法区分当前观察与累积历史。我们提出了WeaveTime,一种简单、高效且模型无关的框架,首先教授顺序,然后利用顺序。我们引入了轻量级的时序重建目标——流式顺序感知增强,该目标通过最少的微调和无需专门的流式数据来培养顺序感知的表示。在推理时,过去与当前动态聚焦缓存执行不确定性触发的粗到细检索,仅在需要时扩展历史。WeaveTime插件式地集成到现有的视频LLM中,无需架构更改,即可在代表性流式基准测试中提供一致的性能提升,提高准确率并减少延迟。这些结果确立了WeaveTime作为在严格在线、时间因果约束下时间感知流式视频LLMs的实用路径。代码和权重将公开发布。项目页面:https://zhangyl4.github.io/publications/weavetime/
Summary / 总结
WeaveTime addresses the limitations of current Video-LLMs by introducing a framework that enhances temporal awareness, overcoming temporal order ambiguity and past-current focus blindness. It uses a lightweight Temporal Reconstruction objective to instill order-aware representations and a Past-Current Dynamic Focus Cache for efficient history retrieval. WeaveTime improves accuracy and reduces latency on streaming benchmarks without requiring architectural changes, making it a practical solution for time-aware Video-LLMs under strict online constraints.
WeaveTime通过引入一个教时间顺序并利用其改进流式性能的框架来解决当前Video-LLMs的限制。它包括一个时空重建目标以增强时间感知的表示,并使用一个过去-当前动态焦点缓存进行高效的过去历史检索。WeaveTime在无需对现有Video-LLMs进行架构更改的情况下,能够在流式基准测试中一致地提高准确性和降低延迟。
Some Simple Economics of AGI
Authors: Christian Catalini, Xiang Hui, Jane Wu
First: 2026-02-24T14:29:45+00:00 · Latest: 2026-02-25T17:41:07+00:00
Comments: JEL Classification: D82, D83, J23, J24, L23, O33. 112 pages, 3 figures
Abstract
For millennia, human cognition was the primary engine of progress on Earth. As AI decouples cognition from biology, the marginal cost of measurable execution falls to zero, absorbing any labor capturable by metrics--including creative, analytical, and innovative work. The binding constraint on growth is no longer intelligence but human verification bandwidth: the capacity to validate, audit, and underwrite responsibility when execution is abundant. We model the AGI transition as the collision of two racing cost curves: an exponentially decaying Cost to Automate and a biologically bottlenecked Cost to Verify. This structural asymmetry widens a Measurability Gap between what agents can execute and what humans can afford to verify. It also drives a shift from skill-biased to measurability-biased technical change. Rents migrate to verification-grade ground truth, cryptographic provenance, and liability underwriting--the ability to insure outcomes rather than merely generate them. The current human-in-the-loop equilibrium is unstable: eroded from below as apprenticeship collapses (Missing Junior Loop) and from within as experts codify their obsolescence (Codifier's Curse). Unverified deployment becomes privately rational--a Trojan Horse externality. Unmanaged, these forces pull toward a Hollow Economy. Yet by scaling verification alongside agentic capabilities, the forces that threaten collapse become the catalyst for unbounded discovery and experimentation--an Augmented Economy. We derive a practical playbook for individuals, companies, investors, and policymakers. Today's defining challenge is not the race to deploy the most autonomous systems; it is the race to secure the foundations of their oversight. Only by scaling our bandwidth for verification alongside our capacity for execution can we ensure that the intelligence we have summoned preserves the humanity that initiated it.
中文标题/摘要
标题:某些简单的AGI经济学
数千年间,人类认知是地球上进步的主要动力。随着AI将认知与生物学脱钩,可衡量执行的边际成本降至零,吸收了任何可被度量指标捕获的劳动力——包括创造性、分析性和创新性工作。增长的约束不再是智能,而是人类验证带宽:在执行充足时验证、审计和承担责任的能力。我们将AGI过渡视为两条竞速成本曲线的碰撞:自动化成本指数性下降和生物瓶颈限制的验证成本。这种结构性不对称性扩大了执行与人类可负担验证之间的可度量差距。这也推动了从技能偏向到可度量偏向的技术变革。租金迁移到验证级真实信息、加密溯源和责任保险——即保险结果而非仅仅生成结果的能力。当前的人工在环均衡是不稳定的:从下面被学徒制崩溃(缺失初级循环)侵蚀,从内部被专家固化其过时(编码者的诅咒)侵蚀。未经验证的部署变得私人理性——一个特洛伊木马外部性。若不加以管理,这些力量将导致一个空心经济。然而,通过与代理能力同步扩展验证,威胁崩溃的力量成为无限制发现和实验的催化剂——增强经济。我们为个人、公司、投资者和政策制定者推导出一份实用的行动指南。当今的决定性挑战不是部署最自主系统的竞赛;而是确保其监督基础的竞赛。只有在扩展验证能力的同时扩展执行能力,我们才能确保我们召唤的智能保留了启动它的那份人性。
Summary / 总结
This paper explores the economic implications of the transition to AGI, focusing on the cost curves of automation and verification. The main method involves modeling the AGI transition as the intersection of two cost curves: the exponentially decaying cost to automate and the biologically limited cost to verify. Key findings include the widening Measurability Gap, which shifts technical change towards measurability bias, and the migration of rents to verification-related activities. The paper also discusses the instability of the current human-in-the-loop equilibrium and the potential for a Hollow Economy, but suggests that scaling verification alongside agentic capabilities can lead to an Augmented Economy. The authors provide a practical playbook for various stakeholders to ensure effective oversight of AGI systems.
论文探讨了AGI过渡的经济影响,重点关注自动化和验证的成本曲线。它将AGI过渡建模为两个成本曲线的交汇点,导致可测量性缺口,并从技能偏向转向可测量性偏向的技术变革。关键发现包括租金向验证相关活动的迁移,以及当前的人在环路均衡的不稳定性,这可能导致空心经济。作者建议在增强能力的同时扩大验证能力,以创建增强型经济,并为利益相关者提供确保AGI系统监督的实用手册。
Recursive Belief Vision Language Action Models
Authors: Vaidehi Bagaria, Bijo Sebastian, Nirav Kumar Patel
First: 2026-02-24T08:02:16+00:00 · Latest: 2026-02-25T17:38:24+00:00
Abstract
Vision-language-action models must enable agents to execute long-horizon tasks under partial observability. However, most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. While semantic grounding is important, long-horizon manipulation fundamentally requires persistent, action-conditioned state representations. Current VLAs lack such representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once per task, the VLM provides high-level intent, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5 percent and 37.5 percent higher success rates on multi-stage pick-and-place and stacking tasks, respectively, compared to pi_0. It also reduces inference latency by up to five times relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show the belief module is the primary driver of performance, increasing success rates from 32.5 percent without belief to 77.5 percent with belief.
中文标题/摘要
标题:递归信念视语言行动模型
视语言行动模型必须使代理能够在部分可观测性下执行长时任务。然而,大多数现有方法仍依赖于短上下文窗口或反复查询视语言模型(VLM),这导致任务进展丢失、感知同义词下的动作重复以及高推理延迟。虽然语义定位很重要,但长时操作本质上需要持久的、基于动作的状态表示。当前的VLAs缺乏这样的表示,且在时间和物理推理方面表现出有限的能力,使其不适合多阶段控制。本文引入了RB-VLA,这是一种以信念为中心的架构,通过自我监督的世界模型目标进行训练,保持一个紧凑的潜在状态编码任务相关的历史、动力学和物体交互。VLM在每次任务时查询一次,提供高层次的意图,而信念追踪任务进展,在部分可观测性下实现有阶段意识的因果控制,无需存储原始观察或随时间扩展内存。信念和意图共同条件一个扩散策略,以实现稳健的闭环执行。RB-VLA在长时任务基准测试中优于先前的VLAs,分别在多阶段取放和堆叠任务中实现了52.5%和37.5%更高的成功率,相比pi_0。它还将推理延迟降低了最多五倍,并消除了现有VLAs在时间步长上观察到的内存增长。消融实验表明,信念模块是性能的主要驱动因素,信念模块从无到有将成功率从32.5%提高到77.5%。
Summary / 总结
This paper addresses the limitations of existing vision-language-action models by introducing RB-VLA, a belief-centric architecture that maintains a compact latent state for task-relevant history and dynamics. The model queries a vision-language model once per task to provide high-level intent, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability. RB-VLA outperforms prior models on long-horizon tasks, achieving higher success rates and reducing inference latency. Ablation studies show the belief module is crucial for performance improvement.
本文提出了一种基于信念的架构RB-VLA,该架构通过维护紧凑的潜状态来跟踪任务进度并实现部分可观测条件下的阶段感知控制,解决了现有视觉-语言-动作模型的局限性。RB-VLA 在长时 horizon 任务中表现出色,成功率更高,推理延迟更低。消融实验表明,信念模块是性能提升的主要驱动力。
IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
Authors: Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan
First: 2026-02-25T17:12:37+00:00 · Latest: 2026-02-25T17:12:37+00:00
Comments: 8 pages + Appendix
Abstract
Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks -- and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (http://github.com/ai4bharat/IndicIFEval).
中文标题/摘要
标题:IndicIFEval:14种印地语系语言可验证指令遵循评估基准
指令遵循基准主要以英语为中心,忽略了数亿印地语使用者的关键评估缺口。我们引入了IndicIFEval,这是一个基准,使用自动可验证的基于规则的指令评估跨14种印地语系语言的LLM受限生成。它包括每个语言约800个人工验证的示例,分布在两个互补子集:IndicIFEval-Ground,来自IFEval(Zhou et al., 2023)的精心本地化翻译提示,以及IndicIFEval-Ground,基于本地印地语内容的合成指令。我们对涵盖推理和非推理模型的多个开放权重和专有模型进行了全面评估。尽管模型在格式约束方面保持了较强的依从性,但在词汇和跨语言任务方面仍面临重大挑战——尽管在高资源语言方面取得了进展,但更广泛的印地语系指令遵循仍远远落后于英语。我们发布了IndicIFEval及其评估脚本,以支持多语言受限生成的进步(http://github.com/ai4bharat/IndicIFEval)。