arXiv 论文速递

Towards Understanding Best Practices for Quantization of Vision-Language Models

Authors: Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam

First: 2026-01-21T18:59:51+00:00 · Latest: 2026-01-21T18:59:51+00:00

Comments: 15 pages, 12 figures, 1 table

Abstract

Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.

中文标题/摘要

标题：理解视觉-语言模型量化最佳实践

大型语言模型（LLMs）在各种任务中表现出色，但最先进的系统需要快速的GPU和大量的内存。为了减少这些系统的内存和延迟，实践者通常会将它们的学习参数量化为半精度。越来越多的研究集中在使用更激进的位宽来保持模型性能，并且已经有一些工作将这些策略应用于其他模型，如视觉变换器。在我们的研究中，我们探讨了如何有效地将包括最先进的GPTQ和AWQ在内的各种量化方法应用于由视觉模型、语言模型及其连接器组成的多模态管道。我们研究了位宽、量化方法以及量化在管道中的使用位置如何影响字幕生成、检索和问答的性能。结果表明，尽管参数规模存在显著差异，ViT和LLM在模型性能中具有相当的重要性，并且LLM的低位量化可以在减少每个权重位数（bpw）的情况下实现高精度。这些发现为高效部署多模态大语言模型提供了实用见解，并突显了探索多模态模型组件敏感性的价值。我们的代码可在https://github.com/gautomdas/mmq/获取。

Summary / 总结

This study investigates the application of various quantization methods, including GPTQ and AWQ, to multimodal pipelines involving vision transformers and language models. The research aims to understand how different bit widths and quantization techniques impact performance in tasks such as captioning, retrieval, and question answering. Key findings include the comparable importance of ViT and LLMs in model performance despite their size differences, and the effectiveness of lower-bit quantization of LLMs in achieving high accuracy with reduced memory usage.

研究探讨了GPTQ和AWQ等不同量化方法在包含视觉变换器和语言模型的多模态管道中的应用。研究旨在理解不同量化策略对诸如图像字幕、检索和问答等任务性能的影响。主要发现表明，视觉变换器和语言模型对于模型性能都至关重要，而语言模型的低比特量化可以在更少的每权重比特数下实现高精度。

Iterative Refinement Improves Compositional Image Generation

Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak

First: 2026-01-21T18:59:40+00:00 · Latest: 2026-01-21T18:59:40+00:00

Comments: Project webpage: https://iterative-img-gen.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/

中文标题/摘要

标题：迭代优化提升组合图像生成

文本到图像（T2I）模型已经取得了显著的进步，但仍难以处理复杂的提示，这些提示需要同时处理多个对象、关系和属性。现有的推理时策略，如并行采样带验证器或简单增加去噪步骤，可以改善提示对齐，但在许多约束必须满足的丰富组合场景中仍然不足。受大型语言模型中链式思考推理成功的启发，我们提出了一种迭代的测试时策略，在这种策略中，T2I模型在多个步骤中逐步细化其生成，由循环中的视觉语言模型作为批评者提供反馈。我们的方法简单，不需要外部工具或先验知识，并且可以灵活应用于各种图像生成器和视觉语言模型。实验证明，我们的方法在基准测试中的一致改进：在ConceptMix（k=7）上提高了16.9%的全正确率，在T2I-CompBench（3D-空间类别）上提高了13.8%，在视觉积木场景分解上提高了12.5%，与计算匹配的并行采样相比。除了定量的改进，迭代优化通过将复杂提示分解为顺序修正，生成更忠实的图像，人类评估者中有58.7%的人更偏好我们的方法，而并行基线为41.3%。这些发现共同强调了迭代自我修正作为组合图像生成广泛适用原则的重要性。结果和可视化可在https://iterative-img-gen.github.io/获取

Summary / 总结

The paper addresses the challenge of generating complex images from text prompts, where existing methods struggle with multiple objects and attributes. It introduces an iterative refinement strategy where a text-to-image model generates images step-by-step, receiving feedback from a vision-language model. This approach improves the all-correct rate by 16.9% on ConceptMix, 13.8% on T2I-CompBench, and 12.5% on Visual Jenga, and is preferred by human evaluators 58.7% of the time over parallel sampling methods.

论文提出了一种迭代细化策略，以解决从文本提示生成复杂图像的挑战。该方法涉及文本到图像模型在多次步骤中逐步改进其输出，并由视觉语言模型提供反馈。该方法在多个基准测试中表现出一致的改进，显著提高了所有正确率，并且人类评估者更偏好迭代方法而非并行基线。实验证据显示，在ConceptMix上的所有正确率提高了16.9%，在T2I-CompBench上的提高了13.8%，在Visual Jenga场景分解上的提高了12.5%，与计算匹配的并行采样相比。除了量化收益外，迭代细化还能生成更忠实的图像，人类评估者更偏好迭代方法，占58.7%。

Walk through Paintings: Egocentric World Models from Internet Priors

Authors: Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert

First: 2026-01-21T18:59:32+00:00 · Latest: 2026-01-21T18:59:32+00:00

Abs · PDF · Code1 · Code2

Abstract

What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.

中文标题/摘要

标题：行走于画作之间：基于互联网先验的主观世界模型

如果一个视频生成模型不仅能想象一个合理的未来，还能准确地反映每次动作后世界的变化？我们通过提出主观世界模型（EgoWM），一种简单且架构无关的方法来回答这一问题，该方法能够将任何预训练的视频扩散模型转化为动作条件下的世界模型，从而实现可控的未来预测。我们不是从头开始训练，而是利用互联网规模视频模型丰富的世界先验，并通过轻量级条件层注入运动命令。这使得模型能够忠实跟随动作，同时保持现实感和强大的泛化能力。我们的方法自然地扩展到不同的实体和动作空间，从3-自由度移动机器人到25-自由度类人机器人，其中预测以主观关节角度驱动的动力学要困难得多。该模型为导航和操作任务生成连贯的滚动，仅需适度微调。为了独立于视觉外观评估物理正确性，我们引入了结构一致性分数（SCS），衡量稳定场景元素是否与提供的动作一致地演变。EgoWM在先前最先进的导航世界模型上将SCS提高了高达80%，同时实现高达六倍的更低推理延迟，并在未见过的环境中表现出强大的泛化能力，包括在画作中导航。

Summary / 总结

The research aims to develop a method that enables video generation models to predict the correct future scenarios based on actions, by transforming pretrained models into action-conditioned world models. The Egocentric World Model (EgoWM) uses Internet-scale video priors and lightweight conditioning layers to follow actions accurately while maintaining realism. The model shows significant improvements in structural consistency scores for navigation tasks, with up to 80% better performance compared to previous methods, and achieves faster inference times and robust generalization to new environments.

研究旨在通过将预训练的视频扩散模型转换为动作条件化的世界模型，生成准确的未来场景。Egocentric世界模型（EgoWM）利用互联网规模的视频先验和轻量级条件层来预测世界正确的发展状态，保持真实性和在不同实体上的泛化能力。EgoWM在导航任务中的结构一致性得分（SCS）上提高了高达80%，具有较低的推理延迟和对未见过的环境的鲁棒泛化能力，包括在画中导航。

Rethinking Video Generation Model for the Embodied World

Authors: Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou

First: 2026-01-21T18:59:18+00:00 · Latest: 2026-01-21T18:59:18+00:00

Comments: Github: https://github.com/DAGroup-PKU/ReVidgen/ Project website: https://dagroup-pku.github.io/ReVidgen.github.io/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.

中文标题/摘要

标题：重新思考具身世界的视频生成模型

视频生成模型显著推进了具身智能的发展，解锁了生成捕捉感知、推理和行动的多样化机器人数据的新可能性。然而，合成能够准确反映真实世界机器人交互的高质量视频仍然具有挑战性，缺乏标准化基准限制了公平比较和进步。为解决这一差距，我们引入了一个全面的机器人基准RBench，旨在评估面向机器人的视频生成在五个任务领域和四种不同具身形式上的表现。它通过可重复的子指标评估任务级正确性和视觉保真度，包括结构一致性、物理合理性以及动作完整性。对25个代表性模型的评估突显了生成物理现实机器人行为的重大缺陷。此外，基准与人类评估的相关系数达到0.96，验证了其有效性。虽然RBench提供了识别这些缺陷所需的视角，但实现物理现实需要超越评估，解决高质量训练数据的严重短缺。基于这些见解，我们引入了一个改进的四阶段数据管道，产生了RoVid-X，这是最大的开源机器人视频生成数据集，包含400万标注视频片段，涵盖了数千个任务，并附有全面的物理属性注释。这一评估和数据协同生态系统共同为视频模型的严格评估和可扩展训练奠定了坚实基础，加速了具身AI向通用智能的演变。

Summary / 总结

The paper addresses the challenge of generating high-quality videos that accurately reflect real-world robotic interactions, introducing a comprehensive robotics benchmark, RBench, to evaluate robot-oriented video generation across five task domains and four distinct embodiments. The benchmark assesses both task-level correctness and visual fidelity through reproducible sub-metrics and achieves a high Spearman correlation coefficient with human evaluations. Additionally, the authors introduce RoVid-X, a large-scale robotic dataset with 4 million annotated video clips, to address the critical shortage of high-quality training data for generating physically realistic robot behaviors.

论文旨在解决生成能够准确反映真实世界机器人交互的高质量视频的挑战，引入了一个全面的机器人基准RBench，用于评估跨越五个任务领域和四种不同体态的机器人视频生成。该基准通过可重复的子指标评估任务级正确性和视觉保真度，并与人类评估实现了高斯皮尔曼相关系数。此外，作者还引入了RoVid-X，这是一个包含400万标注视频片段的大规模机器人数据集，旨在解决生成物理上逼真机器人行为所需的关键高质量训练数据短缺问题。

StableWorld: Towards Stable and Consistent Long Interactive Video Generation

Authors: Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, Chenyang Si

First: 2026-01-21T18:59:02+00:00 · Latest: 2026-01-21T18:59:02+00:00

Comments: 17 pages, 21 figures,

Abs · PDF · Code1 · Code2

Abstract

In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.

中文标题/摘要

标题：StableWorld：朝向稳定和一致的长时交互视频生成

在本文中，我们探讨了交互视频生成中被忽视的稳定性和时间一致性挑战，该生成过程通过交互行为（如相机运动和文本提示）合成动态和可控的视频世界。尽管在世界建模方面取得了显著进展，但当前方法仍然遭受严重的不稳定性和时间退化问题，经常导致长时间交互过程中空间漂移和场景崩溃。为了更好地理解这一问题，我们最初调查了不稳定性的根本原因，并发现错误累积的主要来源是同一场景，其中生成的帧逐渐偏离初始干净状态，并将错误传播到后续帧。基于这一观察，我们提出了一种简单而有效的方法——StableWorld，一种动态帧移除机制。通过不断过滤掉退化的帧，同时保留几何上一致的帧，StableWorld 有效地在源头防止累积漂移，从而提高交互生成的稳定性和时间一致性。在多个交互视频模型（例如，Matrix-Game、Open-Oasis 和 Hunyuan-GameCraft）上的有希望的结果表明，StableWorld 是模型无关的，并且可以应用于不同的交互视频生成框架，以显著提高稳定性和时间一致性，并在各种交互场景中提高泛化能力。

Summary / 总结

The paper addresses the issue of stability and temporal consistency in interactive video generation, where current methods often suffer from spatial drift and scene collapse. It proposes StableWorld, a Dynamic Frame Eviction Mechanism, which filters out degraded frames while retaining geometrically consistent ones, thereby preventing cumulative drift and enhancing stability and temporal consistency. Experiments on Matrix-Game, Open-Oasis, and Hunyuan-GameCraft show that StableWorld improves stability and generalization across different interactive scenarios.

论文探讨了长时交互视频生成中的不稳定性和时间一致性问题。提出了一种动态帧移除机制StableWorld，该机制通过过滤掉退化的帧并保留几何上一致的帧，防止累积漂移。实验表明，StableWorld在不同交互视频生成框架中提高了稳定性和时间一致性，并增强了泛化能力。

MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs

Authors: Christoph Bartmann, Johannes Schimunek, Mykyta Ielanskyi, Philipp Seidl, Günter Klambauer, Sohvi Luukkonen

First: 2026-01-21T18:58:01+00:00 · Latest: 2026-01-21T18:58:01+00:00

Abs · PDF · Code1 · Code2

Abstract

A molecule's properties are fundamentally determined by its composition and structure encoded in its molecular graph. Thus, reasoning about molecular properties requires the ability to parse and understand the molecular graph. Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. Most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and reveals capability patterns that localize model failures to specific tasks and molecular structures. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure.

中文标题/摘要

标题：MolecularIQ：通过分子图的符号验证表征化学推理能力

分子的性质从根本上由其分子图中的组成和结构决定。因此，关于分子性质的推理需要能够解析和理解分子图的能力。大型语言模型（LLMs）在化学领域中越来越被应用，处理诸如分子名称转换、配图、文本引导生成以及性质或反应预测等任务。现有的大多数基准测试侧重于通用化学知识，依赖于文献或可能泄露或带有偏见的替代标签，或者将评估简化为选择题。我们引入了MolecularIQ，这是一个专注于符号验证任务的分子结构推理基准测试。MolecularIQ能够对分子图上的推理进行精细评估，并揭示模型失败的具体任务和分子结构模式。这为当前化学LLM的优势和局限性提供了可操作的见解，并指导了能够忠实推理分子结构的模型的开发。

Summary / 总结

The research aims to evaluate the chemical reasoning capabilities of Large Language Models (LLMs) by introducing MolecularIQ, a benchmark that focuses on symbolically verifiable tasks. The method involves using this benchmark to assess the models' ability to reason over molecular graphs. Key findings show that models exhibit specific strengths and weaknesses in different tasks and molecular structures, highlighting the need for models that can reason accurately over molecular structures.

研究旨在通过引入MolecularIQ基准来评估大型语言模型（LLMs）的化学推理能力，该基准专注于符号可验证的任务。方法是使用此基准来评估模型在分子图上的推理能力。主要发现表明，模型在不同的任务和分子结构上表现出特定的优势和劣势，突显了需要能够准确推理分子结构的模型的重要性。

Robust Fake News Detection using Large Language Models under Adversarial Sentiment Attacks

Authors: Sahar Tahmasebi, Eric Müller-Budack, Ralph Ewerth

First: 2026-01-21T18:56:49+00:00 · Latest: 2026-01-21T18:56:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Misinformation and fake news have become a pressing societal challenge, driving the need for reliable automated detection methods. Prior research has highlighted sentiment as an important signal in fake news detection, either by analyzing which sentiments are associated with fake news or by using sentiment and emotion features for classification. However, this poses a vulnerability since adversaries can manipulate sentiment to evade detectors especially with the advent of large language models (LLMs). A few studies have explored adversarial samples generated by LLMs, but they mainly focus on stylistic features such as writing style of news publishers. Thus, the crucial vulnerability of sentiment manipulation remains largely unexplored. In this paper, we investigate the robustness of state-of-the-art fake news detectors under sentiment manipulation. We introduce AdSent, a sentiment-robust detection framework designed to ensure consistent veracity predictions across both original and sentiment-altered news articles. Specifically, we (1) propose controlled sentiment-based adversarial attacks using LLMs, (2) analyze the impact of sentiment shifts on detection performance. We show that changing the sentiment heavily impacts the performance of fake news detection models, indicating biases towards neutral articles being real, while non-neutral articles are often classified as fake content. (3) We introduce a novel sentiment-agnostic training strategy that enhances robustness against such perturbations. Extensive experiments on three benchmark datasets demonstrate that AdSent significantly outperforms competitive baselines in both accuracy and robustness, while also generalizing effectively to unseen datasets and adversarial scenarios.

中文标题/摘要

标题：使用大型语言模型在对抗情感攻击下的鲁棒假新闻检测

虚假信息和假新闻已成为一个紧迫的社会挑战，推动了可靠自动化检测方法的需求。先前的研究强调了情感在假新闻检测中的重要信号，无论是通过分析与假新闻相关的哪些情感，还是通过使用情感和情绪特征进行分类。然而，这存在一个漏洞，因为对手可以通过操纵情感来逃避检测，尤其是在大型语言模型（LLMs）出现之后。少数研究探讨了由LLMs生成的对抗样本，但它们主要集中在新闻出版者的写作风格等风格特征上。因此，情感操纵的关键漏洞仍然很大程度上未被探索。在本文中，我们研究了最先进的假新闻检测器在情感操纵下的鲁棒性。我们引入了AdSent，这是一种情感鲁棒的检测框架，旨在确保在原始和情感修改后的新闻文章中的一致性真实性预测。具体来说，我们（1）提出了使用LLMs的控制情感基于的对抗攻击，（2）分析了情感变化对检测性能的影响。我们表明，改变情感对假新闻检测模型的性能有重大影响，表明偏向中立文章被认为是真实的，而非中立的文章通常被分类为假内容。（3）我们引入了一种新的情感无关的训练策略，以增强对这种扰动的鲁棒性。在三个基准数据集上的广泛实验表明，AdSent在准确性和鲁棒性方面都显著优于竞争基线，同时也能有效地泛化到未见过的数据集和对抗场景中。

Summary / 总结

This paper addresses the vulnerability of fake news detection methods to sentiment manipulation, especially with the use of large language models. It introduces AdSent, a sentiment-robust detection framework that includes controlled adversarial attacks using LLMs and a sentiment-agnostic training strategy. The study shows that sentiment manipulation significantly affects the performance of fake news detection models, and AdSent outperforms existing methods in accuracy and robustness across multiple datasets and scenarios.

本文探讨了假新闻检测方法在情感操纵下的脆弱性，特别是在使用大型语言模型的情况下。它提出了一个情感鲁棒的检测框架AdSent，包括基于情感的可控对抗攻击和情感无关的训练策略。实验表明，改变情感对假新闻检测模型有显著影响，倾向于将中性文章分类为真实文章，而非中性文章分类为假信息。AdSent在多个数据集和对抗场景中均表现出更高的准确性和鲁棒性。

Evaluation of Large Language Models in Legal Applications: Challenges, Methods, and Future Directions

Authors: Yiran Hu, Huanghai Liu, Chong Wang, Kunran Li, Tien-Hsuan Wu, Haitao Li, Xinran Xu, Siqing Huo, Weihang Su, Ning Zheng, Siyuan Zheng, Qingyao Ai, Yun Liu, Renjun Bian, Yiqun Liu, Charles L. A. Clarke, Weixing Shen, Ben Kao

First: 2026-01-21T18:51:37+00:00 · Latest: 2026-01-21T18:51:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are being increasingly integrated into legal applications, including judicial decision support, legal practice assistance, and public-facing legal services. While LLMs show strong potential in handling legal knowledge and tasks, their deployment in real-world legal settings raises critical concerns beyond surface-level accuracy, involving the soundness of legal reasoning processes and trustworthy issues such as fairness and reliability. Systematic evaluation of LLM performance in legal tasks has therefore become essential for their responsible adoption. This survey identifies key challenges in evaluating LLMs for legal tasks grounded in real-world legal practice. We analyze the major difficulties involved in assessing LLM performance in the legal domain, including outcome correctness, reasoning reliability, and trustworthiness. Building on these challenges, we review and categorize existing evaluation methods and benchmarks according to their task design, datasets, and evaluation metrics. We further discuss the extent to which current approaches address these challenges, highlight their limitations, and outline future research directions toward more realistic, reliable, and legally grounded evaluation frameworks for LLMs in legal domains.

中文标题/摘要

标题：大型语言模型在法律应用中的评估：挑战、方法与未来方向

大型语言模型（LLMs）正越来越多地被集成到法律应用中，包括司法决策支持、法律实践辅助和面向公众的法律服务。尽管LLMs在处理法律知识和任务方面表现出强大的潜力，但在实际法律环境中的部署引发了超出表面准确性的关键问题，涉及法律推理过程的可靠性以及公平性和可靠性等信任问题。因此，系统评估LLMs在法律任务中的性能对于其负责任的采用变得至关重要。本文综述了基于实际法律实践评估LLMs的关键挑战。我们分析了评估LLMs在法律领域性能的主要困难，包括结果正确性、推理可靠性和可信度。基于这些挑战，我们根据任务设计、数据集和评估指标对现有评估方法和基准进行了回顾和分类。我们进一步讨论了当前方法在多大程度上解决了这些挑战，指出了它们的局限性，并概述了未来研究方向，以实现更现实、可靠且法律基础的评估框架，用于法律领域的LLMs。

Summary / 总结

The paper evaluates the performance of large language models (LLMs) in legal applications, addressing challenges such as the soundness of legal reasoning and trustworthiness. It identifies key difficulties in assessing LLMs, including outcome correctness and reasoning reliability, and reviews existing evaluation methods and benchmarks. The study highlights the limitations of current approaches and suggests future research directions for more realistic and reliable evaluation frameworks.

论文评估了大型语言模型（LLMs）在法律应用中的性能，重点关注法律推理的正确性和可信度等挑战。研究识别了评估LLMs的关键难点，包括结果正确性和推理可靠性，并回顾了现有的评估方法和基准。研究指出现有方法的局限性，并提出了更现实和法律导向的评估框架的未来研究方向。

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Authors: Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

First: 2025-12-22T18:59:34+00:00 · Latest: 2026-01-21T18:48:54+00:00

Comments: Project codebase: https://github.com/junzeye/validate-medcalc-labels

Abs · PDF · Code1 · Code2 · Code3

Abstract

We examine the reliability of a widely used clinical AI benchmark whose reference labels were partially generated by LLMs, and find that a substantial fraction are clinically misaligned. We introduce a phased stewardship procedure to amplify the positive impact of physician experts' feedback and then demonstrate, via a controlled RL experiment, how uncaught label bias can materially affect downstream LLM evaluation and alignment. Our results demonstrate that partially LLM-generated labels can embed systemic errors that distort not only evaluation but also downstream model alignment. By adopting a hybrid oversight system, we can prioritize scarce expert feedback to maintain benchmarks as living, clinically-grounded documents. Ensuring this alignment is a prerequisite for the safe deployment of LLMs in high-stakes medical decision support.

中文标题/摘要

标题：LLM辅助临床基准的可扩展监护与医师监督

我们检查了一个广泛使用的临床AI基准的可靠性，该基准的参考标签部分由LLM生成，并发现其中相当一部分与临床实践不符。我们引入了一种分阶段的监护程序，以放大医师专家反馈的积极影响，然后通过受控的强化学习实验，展示了未发现的标签偏差如何实质性地影响下游LLM的评估和对齐。我们的结果表明，部分由LLM生成的标签可能嵌入系统性错误，不仅扭曲了评估，还影响了下游模型的对齐。通过采用混合监督系统，我们可以优先考虑稀缺的专家反馈，使基准保持为活的、临床相关的文件。确保这种对齐是安全部署LLM于高风险医疗决策支持的前提。

Summary / 总结

The study examines the reliability of a clinical AI benchmark with partially LLM-generated labels, finding significant clinical misalignment. A phased stewardship procedure involving physician feedback is introduced to address this issue. The research demonstrates that uncorrected label bias can affect LLM evaluation and alignment. The results show that hybrid oversight is necessary to maintain benchmarks as clinically-grounded documents, ensuring safe LLM deployment in medical decision support.

研究考察了一个部分由LLM生成参考标签的临床AI基准的可靠性，发现其中相当一部分标签在临床应用上存在偏差。为此，引入了一种分阶段的监督程序，结合医生的反馈。通过受控的RL实验，研究证明未纠正的标签偏差会显著影响LLM的评估和对齐。研究结果表明，部分由LLM生成的标签会引入系统性错误，影响评估和模型对齐。采用混合监督系统有助于优先利用专家反馈，保持基准作为活的、临床相关的文档，这对于LLM在高风险医疗决策支持中的安全部署至关重要。

Beyond Automation: Rethinking Work, Creativity, and Governance in the Age of Generative AI

Authors: Haocheng Lin

First: 2025-12-09T20:25:24+00:00 · Latest: 2026-01-21T18:42:26+00:00

Comments: Improved structure and clarity of the introduction and literature review; explicit articulation of the paper's contributions; refined the integration of AI across labour, UBI, and governance

Abs · PDF · Code1 · Code2

Abstract

The rapid expansion of generative artificial intelligence (AI) is transforming work, creativity, and economic security in ways that extend beyond automation and productivity. This paper examines four interconnected dimensions of contemporary AI deployment: (1) transformations in employment and task composition (2) unequal diffusion of AI across sectors and socio-demographic groups (3) the role of universal basic income (UBI) as a stabilising response to AI-induced volatility (4) the effects of model alignment and content governance on human creativity, autonomy, and decision-making Using a hybrid approach that integrates labour market task exposure modelling, sectoral diffusion analysis, policy review, and qualitative discourse critique, the study develops an Inclusive AI Governance Framework. It introduces Level 1.5 autonomy as a human centred design principle that preserves evaluative authority while enabling partial automation, and highlights evidence of creative regression and emergent sycophancy in newer model generations. The paper argues that UBI should be embedded within a broader socio-technical governance ecosystem encompassing skills development, proportional regulation, and creativity preservation.

中文标题/摘要

标题：超越自动化：在生成式AI时代重新思考工作、创造力与治理

生成式人工智能（AI）的迅速扩张正在以超越自动化和生产率的方式，改变工作、创造力和经济安全。本文探讨了当前AI部署的四个相互关联的维度：（1）就业和任务构成的变化（2）AI在不同行业和社会人口群体中的不平等扩散（3）普遍基本收入（UBI）作为对AI引发的波动性的一种稳定回应的作用（4）模型对齐和内容治理对人类创造力、自主性和决策的影响本文采用结合劳动力市场任务暴露建模、行业扩散分析、政策审查和定性话语批判的混合方法，构建了一个包容性AI治理框架。它引入了作为以人为本设计原则的1.5级自主性，既保留了评估权威，又允许部分自动化，并强调了新模型代际中创造性退化和依附行为的证据。文章认为，UBI 应嵌入更广泛的包含技能发展、比例监管和创造力保存的社技治理生态系统中。

DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration

Authors: Dominik Rößle, Xujun Xie, Adithya Mohan, Venkatesh Thirugnana Sambandham, Daniel Cremers, Torsten Schön

First: 2026-01-21T18:41:05+00:00 · Latest: 2026-01-21T18:41:05+00:00

Comments: Accepted to the IEEE Intelligent Vehicles Symposium 2026. For code and dataset, see https://github.com/cvims/DrivIng

Abs · PDF · Code1 · Code2 · Code3

Abstract

Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ~18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ~1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng enables a 1-to-1 transfer of real traffic into simulation, preserving agent interactions while enabling realistic and flexible scenario testing. To support reproducible research and robust validation, we benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase.

中文标题/摘要

标题：DrivIng：一种全面集成全数字孪生的大型多模态驾驶数据集

感知是自动驾驶的核心，使车辆能够理解其周围环境并做出安全可靠的决策。开发稳健的感知算法需要大规模、高质量的数据集，涵盖各种驾驶条件并支持全面评估。现有数据集往往缺乏高保真数字孪生，限制了系统测试、边缘案例模拟、传感器修改和模拟到现实的评估。为了解决这一差距，我们提出了DrivIng，一种包含约18公里路线完整地理参考数字孪生的大型多模态数据集，该路线跨越城市、郊区和高速公路段。我们的数据集提供了来自六个RGB摄像头、一个LiDAR和高精度ADMA基定位的连续记录，覆盖白天、黄昏和夜晚。所有序列以10 Hz的频率进行注释，涵盖12个类别的3D边界框和跟踪ID，总计约120万注释实例。除了数字孪生的优势，DrivIng还允许将真实交通1:1地转移到模拟中，同时保留代理交互并实现真实和灵活的场景测试。为了支持可重复研究和稳健验证，我们使用最先进的感知模型对DrivIng进行了基准测试，并公开了数据集、数字孪生、高清地图和代码库。

Summary / 总结

DrivIng is a large-scale multimodal driving dataset with a complete digital twin, designed to enhance the development of robust perception algorithms for autonomous driving. It captures diverse driving conditions using six RGB cameras, one LiDAR, and high-precision localization, with annotations for 12 classes. Key findings include over 1.2 million annotated instances, enabling realistic and flexible scenario testing and sim-to-real evaluations.

研究旨在通过创建包含完整数字孪生的大型多模态数据集DrivIng，来开发自主驾驶的稳健感知算法。该数据集涵盖了18公里路线的多种驾驶条件，包括来自六个RGB摄像头、一个LiDAR和高精度定位的连续记录，并对3D边界框和轨迹ID进行了标注。主要发现包括超过120万条标注实例，能够实现真实的灵活场景测试和模拟到现实的评估。

The Effect of Scripts and Formats on LLM Numeracy

Authors: Varshini Reddy, Craig W. Schmidt, Seth Ebner, Adam Wiemerslage, Yuval Pinter, Chris Tanner

First: 2026-01-21T18:33:15+00:00 · Latest: 2026-01-21T18:33:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have achieved impressive proficiency in basic arithmetic, rivaling human-level performance on standard numerical tasks. However, little attention has been given to how these models perform when numerical expressions deviate from the prevailing conventions present in their training corpora. In this work, we investigate numerical reasoning across a wide range of numeral scripts and formats. We show that LLM accuracy drops substantially when numerical inputs are rendered in underrepresented scripts or formats, despite the underlying mathematical reasoning being identical. We further demonstrate that targeted prompting strategies, such as few-shot prompting and explicit numeral mapping, can greatly narrow this gap. Our findings highlight an overlooked challenge in multilingual numerical reasoning and provide actionable insights for working with LLMs to reliably interpret, manipulate, and generate numbers across diverse numeral scripts and formatting styles.

中文标题/摘要

标题：脚本和格式对LLM算术能力的影响

大型语言模型（LLMs）在基本算术方面取得了令人印象深刻的成就，其在标准数值任务上的表现与人类水平相当。然而，很少有人关注当数值表达偏离其训练语料库中占主导地位的惯例时，这些模型的表现如何。在本研究中，我们探讨了不同数字符号系统和格式下的数值推理。我们发现，当数值输入以未充分代表的符号系统或格式呈现时，LLM的准确性会显著下降，尽管其背后的数学推理是相同的。我们进一步证明，有针对性的提示策略，如少量示例提示和显式数字符号映射，可以大大缩小这一差距。我们的研究结果突显了多语言数值推理中一个被忽视的挑战，并为如何可靠地处理LLM以跨不同数字符号系统和格式风格解读、操作和生成数字提供了可操作的见解。

Summary / 总结

This study examines how large language models (LLMs) perform in numerical reasoning tasks when numerical expressions deviate from their training data conventions. The research finds that LLM accuracy significantly decreases with underrepresented scripts or formats, but targeted prompting strategies can improve performance. The study underscores the need for better handling of multilingual numerical reasoning in LLMs.

研究考察了大型语言模型（LLMs）在以较少见的数字表示或格式呈现数字表达式时的数值推理表现。研究发现，当数字表达式使用较少见的表示方式时，LLM的准确性会显著下降，尽管背后的数学推理是相同的。研究还表明，通过少量示例提示和明确的数字映射等有针对性的提示策略，可以显著提高LLM的表现。这些发现强调了在LLMs中更好地处理多语言数值推理的需求。

FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion

Authors: Zichen Xi, Hao-Xiang Chen, Nan Xue, Hongyu Yan, Qi-Yuan Feng, Levent Burak Kara, Joaquim Jorge, Qun-Ce Xu

First: 2026-01-21T18:32:27+00:00 · Latest: 2026-01-21T18:32:27+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2

Abstract

Semantic Scene Completion (SSC) from monocular RGB images is a fundamental yet challenging task due to the inherent ambiguity of inferring occluded 3D geometry from a single view. While feed-forward methods have made progress, they often struggle to generate plausible details in occluded regions and preserve the fundamental spatial relationships of objects. Such accurate generative reasoning capability for the entire 3D space is critical in real-world applications. In this paper, we present FlowSSC, the first generative framework applied directly to monocular semantic scene completion. FlowSSC treats the SSC task as a conditional generation problem and can seamlessly integrate with existing feed-forward SSC methods to significantly boost their performance. To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching that operates in a compact triplane latent space. Unlike standard diffusion models that require hundreds of steps, our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems. Extensive experiments on SemanticKITTI demonstrate that FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines.

中文标题/摘要

标题：FlowSSC：通过一步潜扩散实现通用生成单目语义场景完成

从单目RGB图像中完成语义场景（SSC）是一项基本但具有挑战性的任务，因为从单个视角推断被遮挡的3D几何形状存在固有的不确定性。尽管前馈方法已经取得进展，但在生成被遮挡区域的合理细节和保持物体基本空间关系方面仍然存在困难。这种对整个3D空间的准确生成推理能力在实际应用中至关重要。在本文中，我们提出了FlowSSC，这是第一个直接应用于单目语义场景完成的生成框架。FlowSSC将SSC任务视为条件生成问题，并可以无缝集成到现有的前馈SSC方法中，显著提升其性能。为了在不牺牲质量的情况下实现实时推理，我们引入了捷径流匹配机制，该机制在紧凑的三平面潜空间中操作。与需要数百步的标准扩散模型不同，我们的方法利用捷径机制在一步中实现高保真生成，使其实用部署在自主系统中成为可能。在SemanticKITTI上的大量实验表明，FlowSSC达到了最先进的性能，显著优于现有基线。

Summary / 总结

FlowSSC is a generative framework for monocular semantic scene completion that addresses the challenge of inferring occluded 3D geometry from a single view. It treats the task as a conditional generation problem and integrates with existing feed-forward methods to enhance their performance. FlowSSC introduces Shortcut Flow-matching in a compact triplane latent space, allowing for high-fidelity generation in a single step, which is crucial for real-time applications. Experiments on SemanticKITTI show that FlowSSC outperforms existing methods, achieving state-of-the-art performance.

FlowSSC 是一种用于单目语义场景完成的生成框架，旨在从单个视图推断被遮挡的3D几何结构。它将 SSC 任务视为条件生成问题，并与现有的前馈方法集成以增强其性能。FlowSSC 使用紧凑的三平面潜空间中的捷径机制，在单步中实现高保真生成，从而实现实时推理。在 SemanticKITTI 上的实验表明，FlowSSC 在性能和质量方面优于现有方法。

Taxonomy-Aligned Risk Extraction from 10-K Filings with Autonomous Improvement Using LLMs

Authors: Rian Dolphin, Joe Dursun, Jarrett Blankenship, Katie Adams, Quinton Pike

First: 2026-01-21T18:28:31+00:00 · Latest: 2026-01-21T18:28:31+00:00

Comments: 4 figures, 9 pages

Abs · PDF · Code1 · Code2

Abstract

We present a methodology for extracting structured risk factors from corporate 10-K filings while maintaining adherence to a predefined hierarchical taxonomy. Our three-stage pipeline combines LLM extraction with supporting quotes, embedding-based semantic mapping to taxonomy categories, and LLM-as-a-judge validation that filters spurious assignments. To evaluate our approach, we extract 10,688 risk factors from S&P 500 companies and examine risk profile similarity across industry clusters. Beyond extraction, we introduce autonomous taxonomy maintenance where an AI agent analyzes evaluation feedback to identify problematic categories, diagnose failure patterns, and propose refinements, achieving 104.7% improvement in embedding separation in a case study. External validation confirms the taxonomy captures economically meaningful structure: same-industry companies exhibit 63% higher risk profile similarity than cross-industry pairs (Cohen's d=1.06, AUC 0.82, p<0.001). The methodology generalizes to any domain requiring taxonomy-aligned extraction from unstructured text, with autonomous improvement enabling continuous quality maintenance and enhancement as systems process more documents.

中文标题/摘要

标题：基于LLM的10-K申报文件合规风险提取及其自主改进

我们提出了一种方法，用于从企业10-K申报文件中提取结构化风险因素，同时保持对预定义层次分类法的遵从性。我们的三阶段管道结合了LLM提取、支持引文、基于嵌入的语义映射到分类法类别以及LLM作为裁判的验证，以过滤虚假分配。为了评估我们的方法，我们从标普500公司中提取了10,688个风险因素，并检查了不同行业集群的风险概况相似性。除了提取之外，我们还引入了自主分类法维护，其中AI代理分析评估反馈以识别有问题的类别、诊断失败模式并提出改进措施，在案例研究中实现了104.7%的嵌入分离改进。外部验证确认分类法捕获了经济上有意义的结构：同一行业内公司之间的风险概况相似性比跨行业配对高出63%（Cohen's d=1.06，AUC 0.82，p<0.001）。该方法可以应用于任何需要从非结构化文本中进行分类法对齐提取的领域，自主改进使系统能够持续进行质量维护和增强。

Summary / 总结

This study proposes a three-stage pipeline for extracting structured risk factors from corporate 10-K filings while adhering to a predefined taxonomy. The pipeline uses LLMs for extraction and validation, and embedding-based semantic mapping. The approach was evaluated by extracting 10,688 risk factors from S&P 500 companies, showing that same-industry companies have 63% higher risk profile similarity than cross-industry pairs. Additionally, an AI agent was used for autonomous taxonomy maintenance, improving embedding separation by 104.7%. This methodology can be applied to any domain requiring taxonomy-aligned extraction from unstructured text, with continuous quality improvement through autonomous maintenance.

该研究提出了一种三阶段管道，用于从公司10-K报告中提取结构化的风险因素，同时遵循预定义的分类体系。管道使用了LLM进行提取和验证，并通过嵌入式语义映射。该方法通过从S&P 500公司提取10,688个风险因素进行了评估，结果显示同一行业内公司的风险特征相似度比跨行业的公司高63%。此外，还使用AI代理进行自主分类维护，提高了嵌入式分离度104.7%。该方法可以应用于任何需要从非结构化文本中进行分类对齐提取的领域，并通过自主维护实现持续的质量改进。

On the Reliability and Stability of Selective Methods in Malware Classification Tasks

Authors: Alexander Herzog, Aliai Eusebi, Lorenzo Cavallaro

First: 2025-05-28T20:22:43+00:00 · Latest: 2026-01-21T18:26:18+00:00

Abs · PDF · Code1 · Code2

Abstract

The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While prior works established the importance of temporal evaluation and introduced selective classification in malware classification tasks, we take a complementary direction by investigating whether malware classifiers maintain reliable and stable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose Aurora, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. Aurora subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budgets on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. Aurora is further complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility we observe in SOTA frameworks across datasets of varying drift severity suggests it may be time to revisit the underlying assumptions.

中文标题/摘要

标题：选择性方法在恶意软件分类任务中的可靠性和稳定性

现代漂移自适应恶意软件分类器的性能指标看起来很有前景，但这是否意味着实际操作中的可靠性？标准评估范式主要关注基线性能指标，忽视了置信度-错误匹配和操作稳定性。尽管先前的研究确立了时间评估的重要性并引入了选择性分类，我们则采取了互补的方向，探讨恶意软件分类器在分布变化下是否能保持可靠的和稳定的置信度估计，并探索当它们不能时，科学进步与实际影响之间的张力。我们提出了Aurora框架，基于置信度质量和操作韧性来评估恶意软件分类器。Aurora对给定模型的置信度特征进行验证，以评估其估计的可靠性。不可靠的置信度估计会侵蚀操作信任，浪费有价值的注释预算在非信息性样本上的主动学习上，并在选择性分类中遗漏错误实例。Aurora还通过一系列旨在超越单一时间点性能的度量，朝着更全面的评估操作稳定性努力。我们在不同漂移严重程度的数据集上观察到的脆弱性表明，可能需要重新审视基础假设。

Summary / 总结

The study investigates the reliability and stability of modern drift-adaptive malware classifiers, focusing on their confidence estimates under distribution shifts. It introduces Aurora, a framework that evaluates malware classifiers based on their confidence quality and operational resilience. Key findings show that unreliable confidence estimates can erode operational trust and lead to wasted annotation budgets and undetected errors. The study highlights the need to reassess underlying assumptions in state-of-the-art frameworks.

研究探讨了现代适应性漂移恶意软件分类器在分布变化下的可靠性和稳定性，重点关注其置信度估计。提出了Aurora框架，该框架基于分类器的置信度质量和运营韧性对其进行评估。关键发现表明，不可靠的置信度估计会损害运营信任，导致标注预算的浪费和错误未被检测到。研究强调需要重新审视最先进的框架的基本假设。

Diffusion In Diffusion: Reclaiming Global Coherence in Semi-Autoregressive Diffusion

Authors: Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang

First: 2026-01-20T05:00:26+00:00 · Latest: 2026-01-21T18:21:39+00:00

Comments: Work In Progress

Abs · PDF · Code1 · Code2

Abstract

One of the most compelling features of global discrete diffusion language models is their global bidirectional contextual capability. However, existing block-based diffusion studies tend to introduce autoregressive priors, which, while offering benefits, can cause models to lose this global coherence at the macro level. To regain global contextual understanding while preserving the advantages of the semi-autoregressive paradigm, we propose Diffusion in Diffusion, a 'draft-then-refine' framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models. Our approach first employs block diffusion to generate rapid drafts using small blocks, then refines these drafts through global bidirectional diffusion with a larger bidirectional receptive field. We utilize snapshot confidence remasking to identify the most critical tokens that require modification, and apply mix-scale training to expand the block diffusion model's global capabilities. Empirical results demonstrate that our approach sets a new benchmark for discrete diffusion models on the OpenWebText dataset. Using only 26% of the fine-tuning budget of baseline models, we reduce generative perplexity from 25.7 to 21.9, significantly narrowing the performance gap with autoregressive models.

中文标题/摘要

标题：扩散中的扩散：在半自回归扩散中重新获得全局一致性

全球离散扩散语言模型最引人注目的特征之一是其全局双向上下文能力。然而，现有的块基扩散研究倾向于引入自回归先验，虽然这提供了某些优势，但也会导致模型在宏观层面上失去全局一致性。为了在保持半自回归范式优势的同时重新获得全局上下文理解，我们提出了扩散中的扩散，这是一种“先草拟后润色”的框架，旨在克服块扩散模型固有的不可逆性和短视问题。我们的方法首先使用块扩散生成快速草稿，然后通过具有更大双向感受野的全局双向扩散对这些草稿进行润色。我们利用快照置信度重新遮盖来识别需要修改的最关键令牌，并采用多尺度训练来扩展块扩散模型的全局能力。实验证明，我们的方法在OpenWebText数据集上为离散扩散模型设定了新的基准。仅使用基线模型微调预算的26%，我们使生成困惑度从25.7降低到21.9，显著缩小了与自回归模型的性能差距。

Summary / 总结

The paper aims to address the loss of global coherence in block-based diffusion models by proposing Diffusion in Diffusion, a 'draft-then-refine' framework. This method uses block diffusion to generate initial drafts and then refines them through global bidirectional diffusion. Key findings show that this approach achieves a new benchmark on the OpenWebText dataset, reducing generative perplexity from 25.7 to 21.9 with only 26% of the fine-tuning budget of baseline models.

研究旨在通过提出Diffusion in Diffusion框架解决块基扩散模型中的全局一致性丧失问题，该方法采用块扩散进行快速草稿生成，以及全局双向扩散进行细化，使用快照置信度重新遮罩和多尺度训练。实验结果显示，该方法在OpenWebText数据集上达到了新的基准，将生成困惑度从25.7降低到21.9，仅使用基线模型26%的微调预算。

Metadata Conditioned Large Language Models for Localization

Authors: Anjishnu Mukherjee, Ziwei Zhu, Antonios Anastasopoulos

First: 2026-01-21T18:20:59+00:00 · Latest: 2026-01-21T18:20:59+00:00

Comments: under review

Abs · PDF · Code1 · Code2

Abstract

Large language models are typically trained by treating text as a single global distribution, often resulting in geographically homogenized behavior. We study metadata conditioning as a lightweight approach for localization, pre-training 31 models (at 0.5B and 1B parameter scales) from scratch on large-scale English news data annotated with verified URLs, country tags, and continent tags, covering 4 continents and 17 countries. Across four controlled experiments, we show that metadata conditioning consistently improves in-region performance without sacrificing cross-region generalization, enables global models to recover localization comparable to region-specific models, and improves learning efficiency. Our ablation studies demonstrate that URL-level metadata alone captures much of the geographic signal, while balanced regional data coverage remains essential, as metadata cannot fully compensate for missing regions. Finally, we introduce a downstream benchmark of 800 localized news MCQs and show that after instruction tuning, metadata conditioned global models achieve accuracy comparable to LLaMA-3.2-1B-Instruct, despite being trained on substantially less data. Together, these results establish metadata conditioning as a practical and compute-efficient approach for localization of language models.

中文标题/摘要

标题：基于元数据条件的大规模语言模型用于本地化

大规模语言模型通常以将文本视为单一全局分布的方式进行训练，这往往导致地理上的同质化行为。我们研究元数据条件作为轻量级的本地化方法，从大规模英语新闻数据中预训练了31个模型（参数规模为0.5B和1B），这些数据带有验证过的URL、国家标签和大陆标签，覆盖了4个大陆和17个国家。在四项受控实验中，我们展示了元数据条件在提高区域内性能的同时不会牺牲跨区域泛化能力，使全球模型能够恢复与区域特定模型相当的本地化能力，并提高了学习效率。我们的消融研究显示，URL级别的元数据本身捕捉到了大部分地理信号，而平衡的区域数据覆盖仍然是必不可少的，因为元数据无法完全弥补缺失的区域。最后，我们引入了一个包含800个本地化新闻选择题的下游基准测试，并展示了在指令调优后，元数据条件的全球模型在准确率上与LLaMA-3.2-1B-Instruct相当，尽管它们的训练数据量要少得多。这些结果共同确立了元数据条件作为一种实用且计算高效的语言模型本地化方法。

Summary / 总结

The study aims to address the geographically homogenized behavior of large language models by exploring metadata conditioning. The researchers pre-trained 31 models with 0.5B and 1B parameters on English news data annotated with URLs, country, and continent tags. The experiments showed that metadata conditioning improved in-region performance without harming cross-region generalization, allowed global models to match the localization of region-specific models, and enhanced learning efficiency. Ablation studies indicated that URL-level metadata was crucial, while balanced regional data coverage was essential. The models achieved accuracy comparable to LLaMA-3.2-1B-Instruct on a downstream benchmark of 800 localized news MCQs after instruction tuning, despite being trained on less data.

研究旨在通过探索元数据条件化来解决大型语言模型的地理同质化问题。研究人员在带有URL、国家标签和大陆标签的英语新闻数据上预训练了31个模型，参数规模为0.5B和1B。实验表明，元数据条件化提高了地区内的性能，同时没有损害跨地区的泛化能力，使全球模型能够达到地区特定模型的本地化水平，并提高了学习效率。消融研究显示，URL级别的元数据至关重要，而平衡的区域数据覆盖是必不可少的，因为元数据无法完全弥补缺失的区域。最终，经过指令微调后，元数据条件化的模型在800个本地化新闻选择题下游基准测试中达到了与LLaMA-3.2-1B-Instruct相当的准确性，尽管使用的数据较少。

LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation

Authors: Stergios Chatzikyriakidis, Anastasia Natsina

First: 2026-01-14T17:05:17+00:00 · Latest: 2026-01-21T18:18:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.

中文标题/摘要

标题：LLMs有韵律感吗？希腊诗歌押韵检测与生成的混合音韵过滤系统

大型语言模型（LLMs），尽管在NLP任务中表现出色，但在音韵学相关的现象如押韵检测和生成方面却表现不佳。这在资源较少的语言如现代希腊语中表现得尤为明显。本文提出了一种结合LLMs和确定性音韵算法的混合系统，以实现准确的押韵识别/分析和生成。我们的方法涵盖了希腊押韵类型的全面分类，包括纯押韵、丰富押韵、不完全押韵、镶嵌押韵和同前韵母（IDV）模式，并采用具有音韵验证的主动生成管道。我们评估了多种提示策略（零样本、少量样本、思考链和RAG增强）在多个LLMs中的表现，包括Claude 3.7和4.5、GPT-4o、Gemini 2.0和开源模型如Llama 3.1 8B和70B、Mistral Large。结果揭示了一个显著的“推理差距”：虽然原汁原味的模型（Claude 3.7）表现直观（识别准确率为40%），但依赖推理的模型（Claude 4.5）仅在使用思考链提示时才能达到最先进的性能（54%）。最关键的是，纯LLM生成表现灾难性（有效诗歌少于4%），而我们的混合验证循环将性能恢复到73.1%。我们发布了我们的系统和一个关键的、严格清洗的40,000多条押韵语料库，来自Anemoskala和战间诗歌语料库，以支持未来的研究。

Summary / 总结

This paper addresses the challenge of rhyme detection and generation in Modern Greek, a lower-resource language, by proposing a hybrid system that combines Large Language Models (LLMs) with phonological algorithms. The system achieves accurate rhyme identification and generation through a comprehensive taxonomy of Greek rhyme types and an agentic generation pipeline with phonological verification. Evaluations across various LLMs and prompting strategies show that while native-like models perform reasonably, reasoning-heavy models achieve state-of-the-art performance with Chain-of-Thought prompting. The hybrid approach significantly improves LLM generation performance, restoring it to 73.1% valid poems.

本文提出了一种结合大型语言模型（LLMs）和音韵算法的混合系统，以应对现代希腊语（一种低资源语言）中的押韵检测和生成挑战。该系统通过希腊押韵类型的全面分类和带有音韵验证的生成管道实现了准确的押韵识别和生成。评估结果显示，虽然原生态模型表现尚可，但推理密集型模型在使用链式思考提示时达到了最先进的性能。混合验证循环显著提高了LLM生成性能，使其恢复到73.1%的有效诗歌比例。

PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

Authors: Manuel Frank, Haithem Afli

First: 2025-10-08T07:37:19+00:00 · Latest: 2026-01-21T18:03:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Current sentence embedding evaluations typically rely on static test beds like the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported scores and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in gold ratings and human validation, we show that LLMs generate token-diverse but semantically preserving paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs spanning 20 datasets and 25 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute.

中文标题/摘要

标题：PTEB：通过使用LLMs在评估时进行随机同义替换以实现稳健的文本嵌入评估

当前的句子嵌入评估通常依赖于像大规模文本嵌入基准（MTEB）这样的静态测试集。虽然这些基准非常有价值，但反复在固定的一套测试集上调整可能会夸大报告的分数，并掩盖实际的稳健性。我们引入了同义替换文本嵌入基准（PTEB），这是一种动态协议，在评估时随机生成保持意义的同义替换，并在多次运行中汇总结果。使用一种基于黄金评分和人工验证的成本效益高的LLM方法，我们展示了LLM生成具有多样性的但语义保持的同义替换。在7个MTEB任务中，我们验证了我们的假设：即使语义保持不变，句子编码器的表现对词汇空间的变化也敏感。我们还观察到，较小的模型相对于较大的模型并没有不成比例地受到影响。我们的结果在跨越20个数据集和25种语言的多次运行中具有统计稳健性。更广泛地说，我们旨在提出一种新的NLP评估范式，该范式依赖于动态的、随机的评估，而不是静态的、预定义的基准，利用评估时的计算能力。

Summary / 总结

The research aims to address the issue of inflated scores in sentence embedding evaluations by introducing the Paraphrasing Text Embedding Benchmark (PTEB). PTEB uses a cost-efficient LLM-based method to generate meaning-preserving paraphrases dynamically at evaluation time and aggregates results across multiple runs. The study finds that sentence encoders' performance is sensitive to changes in token space even when semantics remain fixed, and smaller models are not disproportionately affected compared to larger ones. The results are statistically robust across 20 datasets and 25 languages.

研究旨在通过引入Paraphrasing Text Embedding Benchmark (PTEB)，在评估时生成保持意义的同义句，以解决当前句子嵌入评估中得分虚高的问题。研究使用成本效益高的基于LLM的方法生成同义句，并展示了当语义保持不变时，句子编码器的性能会因词元空间的变化而显著不同。结果表明，较小的模型与较大的模型相比，并没有受到不成比例的影响，且这些发现跨越了多个数据集和25种语言，在多次运行中具有统计上的稳健性。

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

Authors: Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Ivan Brugere, Charese H. Smiley, Kundan Thind, Mohammad M. Ghassemi

First: 2026-01-13T01:55:48+00:00 · Latest: 2026-01-21T18:03:19+00:00

Comments: Accepted to the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026) main conference

Abs · PDF · Code1 · Code2

Abstract

The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.

中文标题/摘要

标题：大型推理模型的信心估计器有多可靠？在高风险领域中的系统基准测试

大型推理模型（LRMs）的误校准削弱了它们在高风险领域中的可靠性，因此需要方法来准确估计其长格式、多步骤输出的信心。为解决这一问题，我们引入了推理模型信心估计基准（RMCB），这是一个包含来自六个不同架构家族的六种流行LRM的347,496条推理轨迹的公共资源。基准数据集涵盖了临床、金融、法律和数学推理等高风险领域，以及复杂的通用推理基准，所有样本都提供了正确性注释。使用RMCB，我们对超过十种不同的基于表示的方法进行了大规模实证评估，涵盖了序列、图基和文本基架构。我们的主要发现是，区分度（AUROC）和校准度（ECE）之间存在持续的权衡：文本基编码器在AUROC上表现最佳（0.672），而结构感知模型在ECE上表现最佳（0.148），没有单一方法同时在两者上占优。此外，我们发现增加架构复杂性并不一定能可靠地超越简单的序列基基线，这表明仅依赖于块级隐藏状态的方法存在性能上限。本研究提供了迄今为止该任务最全面的基准，建立了严格的基线，并展示了当前基于表示范式的局限性。

Summary / 总结

This study evaluates the reliability of confidence estimators for Large Reasoning Models (LRMs) in high-stakes domains by introducing the Reasoning Model Confidence estimation Benchmark (RMCB), which includes 347,496 reasoning traces from six LRMs across various domains. The research finds a trade-off between discrimination and calibration, with text-based encoders excelling in discrimination but structurally-aware models in calibration. The study also reveals that more complex architectures do not necessarily outperform simpler sequential baselines, indicating a performance ceiling for methods relying on chunk-level hidden states.

研究通过引入包含347,496个推理痕迹的Reasoning Model Confidence估计基准（RMCB），评估了大型推理模型（LRMs）在高风险领域的可靠性。研究发现，在区分度（AUROC）和校准度（ECE）之间存在权衡，文本编码器在区分度方面表现最佳，而结构感知模型在校准度方面表现最佳。研究还发现，更复杂的架构并不一定比简单的序列基线更优，这表明依赖于块级隐藏状态的方法存在性能上限。

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Authors: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

First: 2026-01-21T17:56:59+00:00 · Latest: 2026-01-21T17:56:59+00:00

Comments: Website: https://progresslm.github.io/ProgressLM/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.

中文标题/摘要

标题：PROGRESSLM：迈向视觉语言模型中的进度推理

估计任务进度需要推理长时动态，而不仅仅是识别静态视觉内容。尽管现代视觉语言模型（VLMs）在描述可见内容方面表现出色，但尚不清楚它们是否能够从部分观察中推断出任务的进展情况。为此，我们引入了Progress-Bench，用于系统评估VLMs的进度推理能力。除了基准测试外，我们还通过无训练提示和基于精心构建的数据集ProgressLM-45K的训练方法，进一步探索了灵感来源于人类的两阶段进度推理范式。在14个VLMs上的实验表明，大多数模型尚未准备好进行任务进度估计，表现出对演示模态和视角变化的敏感性，以及对无法回答的情况处理不佳。虽然无训练提示强制结构化的进度推理仅能带来有限且模型依赖的收益，但基于训练的ProgressLM-3B即使在小型模型规模下也能实现一致的改进，尽管其训练任务集与评估任务集完全不重叠。进一步的分析揭示了特征错误模式，并阐明了进度推理何时以及为何成功或失败。

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

Authors: Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

First: 2025-09-03T08:22:04+00:00 · Latest: 2026-01-21T17:56:42+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Establishing reliable and verifiable fingerprinting mechanisms is fundamental to controlling the unauthorized redistribution of large language models (LLMs). However, existing approaches face two major challenges: (a) ensuring imperceptibility, including resistance to statistical identification and avoidance of accidental activation during fingerprint construction, and (b) preserving both model utility and fingerprint detectability under subsequent model modifications. To address these challenges, we propose an end-to-end fingerprinting framework with two components. First, we design a rule-based code-mixing fingerprint (CF) that maps natural-query-like prompts to multi-candidate targets, reducing accidental triggering via high-complexity code-mixing formulations. Second, we introduce Multi-Candidate Editing (MCEdit), which jointly optimizes multi-candidate targets and enforces margins between target and non-target outputs to improve post-modification detectability. Extensive experiments demonstrate that our framework provides a robust and practical solution for fingerprinting LLMs.

中文标题/摘要

标题：从构建到注入：基于编辑的大型语言模型指纹

建立可靠且可验证的指纹机制是控制大型语言模型（LLMs）未经授权的重分发的基础。然而，现有方法面临两大挑战：(a) 确保不可感知性，包括对抗统计识别的抗性以及在指纹构建过程中避免意外激活，和 (b) 在后续模型修改下保持模型的实用性和指纹的可检测性。为解决这些挑战，我们提出了一种端到端的指纹机制框架，包含两个组件。首先，我们设计了一种基于规则的代码混杂指纹（CF），将自然查询样式的提示映射到多候选目标，通过高复杂度的代码混杂形式减少意外触发。其次，我们引入了多候选编辑（MCEdit），该方法联合优化多候选目标并强制目标与非目标输出之间的边界，以提高修改后的可检测性。广泛的实验表明，我们的框架为指纹化LLMs提供了稳健且实用的解决方案。

Summary / 总结

The paper addresses the challenge of creating reliable and verifiable fingerprinting mechanisms for large language models (LLMs) to prevent unauthorized redistribution. It proposes an end-to-end framework with two components: a rule-based code-mixing fingerprint (CF) to ensure imperceptibility and reduce accidental triggering, and Multi-Candidate Editing (MCEdit) to optimize multi-candidate targets and improve detectability after model modifications. Experiments show that this framework offers a robust solution for fingerprinting LLMs.

论文旨在解决为大型语言模型（LLM）建立可靠指纹机制以防止未经授权的重分发的问题。提出了一种端到端框架，包括基于规则的代码混杂指纹（CF）和多候选编辑（MCEdit），以确保不可感知并保持在模型修改后仍具有可检测性。实验表明，该框架能够有效提供一种可靠的LLM指纹解决方案。

ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation

Authors: Hanlei Guo, Jiahao Shao, Xinya Chen, Xiyang Tan, Sheng Miao, Yujun Shen, Yiyi Liao

First: 2026-01-21T17:53:21+00:00 · Latest: 2026-01-21T17:53:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.

中文标题/摘要

标题：ScenDi: 3D到2D场景扩散级联方法用于城市生成

使用扩散模型生成3D物体的最新进展取得了显著成功，但生成逼真的3D城市场景仍然具有挑战性。现有方法仅依赖3D扩散模型往往会损失外观细节，而仅使用2D扩散模型的方法通常会牺牲相机可控性。为克服这一限制，我们提出了一种名为ScenDi的方法，该方法结合了3D和2D扩散模型以生成城市场景。我们首先训练一个3D潜在扩散模型生成3D高斯分布，从而能够以相对较低的分辨率渲染图像。为了实现可控合成，3DGS生成过程可以有条件地指定输入，如3D边界框、道路图或文本提示。然后，我们训练一个2D视频扩散模型，以增强基于3D高斯分布渲染图像的外观细节。通过利用粗略的3D场景作为2D视频扩散的指导，ScenDi可以根据输入条件生成所需的场景，并成功遵循准确的相机轨迹。在两个具有挑战性的现实世界数据集Waymo和KITTI-360上的实验表明了我们方法的有效性。

Summary / 总结

ScenDi is a method for generating realistic 3D urban scenes by integrating 3D and 2D diffusion models. It first generates 3D Gaussians using a 3D latent diffusion model, which can be optionally conditioned by 3D bounding boxes, road maps, or text prompts. Then, a 2D video diffusion model enhances the appearance details based on the rendered images from the 3D Gaussians. Experiments on Waymo and KITTI-360 datasets show that ScenDi can generate scenes with accurate camera trajectories and detailed appearance, outperforming existing methods.

研究旨在通过结合3D和2D扩散模型生成真实的3D城市场景。ScenDi首先使用3D潜扩散模型生成3D高斯分布以进行低分辨率图像渲染，这些渲染可以由3D边界框、道路图或文本提示进行条件化。然后，基于这些渲染图像，训练2D视频扩散模型以增强外观细节。在Waymo和KITTI-360数据集上的实验表明，ScenDi能够生成具有准确相机轨迹和详细外观的场景，克服了纯3D或2D方法的局限性。

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Authors: Anmol Goel, Cornelius Emde, Sangdoo Yun, Seong Joon Oh, Martin Gubri

First: 2026-01-21T17:53:06+00:00 · Latest: 2026-01-21T17:53:06+00:00

Abs · PDF · Code1 · Code2

Abstract

We identify a novel phenomenon in language models: benign fine-tuning of frontier models can lead to privacy collapse. We find that diverse, subtle patterns in training data can degrade contextual privacy, including optimisation for helpfulness, exposure to user information, emotional and subjective dialogue, and debugging code printing internal variables, among others. Fine-tuned models lose their ability to reason about contextual privacy norms, share information inappropriately with tools, and violate memory boundaries across contexts. Privacy collapse is a ``silent failure'' because models maintain high performance on standard safety and utility benchmarks whilst exhibiting severe privacy vulnerabilities. Our experiments show evidence of privacy collapse across six models (closed and open weight), five fine-tuning datasets (real-world and controlled data), and two task categories (agentic and memory-based). Our mechanistic analysis reveals that privacy representations are uniquely fragile to fine-tuning, compared to task-relevant features which are preserved. Our results reveal a critical gap in current safety evaluations, in particular for the deployment of specialised agents.

中文标题/摘要

标题：隐私崩塌：良性微调可破坏语言模型的上下文隐私

我们发现语言模型中的一种新型现象：前沿模型的良性微调可能导致隐私崩塌。我们发现训练数据中多样且微妙的模式会损害上下文隐私，包括优化帮助性、暴露用户信息、情感和主观对话、以及调试代码打印内部变量等。微调后的模型失去了关于上下文隐私规范的推理能力，不当分享信息给工具，并在不同上下文间违反记忆边界。隐私崩塌是一种“静默失败”，因为模型在标准安全和实用性基准测试中保持高性能，但表现出严重的隐私漏洞。我们的实验表明隐私崩塌在六个模型（闭合和开放权重）、五个微调数据集（真实世界和控制数据）以及两类任务（代理性和记忆性）中普遍存在。我们的机制分析表明，隐私表示对微调特别脆弱，与保留的任务相关特征不同。我们的结果揭示了当前安全性评估中的一个关键缺口，特别是对于专门代理的部署而言。

Summary / 总结

The study investigates how benign fine-tuning of language models can lead to privacy collapse, where models lose their ability to respect privacy norms and share information inappropriately. The research examines diverse patterns in training data, including helpfulness, user information exposure, and debugging code, which can degrade contextual privacy. Key findings show that fine-tuned models exhibit severe privacy vulnerabilities while maintaining high performance on safety benchmarks, indicating a silent failure in privacy protection. Experiments across six models and five datasets confirm privacy collapse in both agentic and memory-based tasks.

研究探讨了良性微调如何导致隐私崩塌，即模型失去尊重隐私规范的能力，不当分享信息。研究检查了训练数据中的多种模式，包括有用性、用户信息暴露和调试代码，这些都可能损害上下文隐私。关键发现表明，微调后的模型在标准基准测试中表现出色，但存在严重的隐私漏洞，这表明隐私保护存在无声的失败。实验结果显示，隐私崩塌在六个模型和五个数据集的多种任务中都存在。

Designing AI-Resilient Assessments Using Interconnected Problems: A Theoretically Grounded and Empirically Validated Framework

Authors: Kaihua Ding

First: 2025-12-11T15:53:19+00:00 · Latest: 2026-01-21T17:50:24+00:00

Comments: 8 pages, 3 figures and 3 tables, under submission to IEEE FIE

Abs · PDF · Code1 · Code2

Abstract

The proliferation of generative AI tools has rendered traditional modular assessments in computing and data-centric education increasingly ineffective, creating a disconnect between academic evaluation and authentic skill measurement. This paper presents a theoretically grounded framework for designing AI-resilient assessments, supported by formal analysis and empirical validation. We make three primary contributions. First, we establish two formal propositions. (1) Assessments composed of interconnected problems, in which outputs serve as inputs to subsequent tasks, are inherently more AI-resilient than modular assessments due to their reliance on multi-step reasoning and sustained context. (2) Semi-structured problems with deterministic success criteria provide more reliable measures of student competency than fully open-ended projects, which allow AI systems to default to familiar solution templates. These results challenge widely cited recommendations in recent institutional and policy guidance that promote open-ended assessments as inherently more robust to AI assistance. Second, we validate these propositions through empirical analysis of three university data science courses (N = 117). We observe a substantial AI inflation effect: students achieve near-perfect scores on AI-assisted modular homework, while performance drops by approximately 30 percentage points on proctored exams (Cohen d = 1.51). In contrast, interconnected projects remain strongly aligned with modular assessments (r = 0.954, p < 0.001) while maintaining AI resistance, whereas proctored exams show weaker alignment (r = 0.726, p < 0.001). Third, we translate these findings into a practical assessment design procedure that enables educators to construct evaluations that promote deeper engagement, reflect industry practice, and resist trivial AI delegation.

中文标题/摘要

标题：使用互联问题设计抗AI评估：一个理论依据和实证验证框架

随着生成型AI工具的普及，计算和数据导向教育中的传统模块化评估变得越来越无效，导致学术评估与实际技能测量之间产生脱节。本文提出了一种基于理论的抗AI评估设计框架，该框架得到了形式分析和实证验证的支持。我们做出了三项主要贡献。首先，我们建立了两个形式命题。命题1：由互联问题组成的评估，其中输出作为后续任务的输入，由于依赖多步推理和持续背景，比模块化评估更抗AI。命题2：具有确定性成功标准的半结构化问题比完全开放的项目提供了更可靠的学生成绩衡量标准，因为AI系统可以默认使用熟悉的解决方案模板。这些结果挑战了最近机构和政策指导中广泛引用的建议，即开放性评估本质上更能抵御AI辅助。其次，我们通过三个大学数据科学课程（N = 117）的实证分析验证了这些命题。我们观察到显著的AI膨胀效应：学生在AI辅助的模块化家庭作业中几乎达到满分，而在监考考试中成绩下降约30个百分点（Cohen d = 1.51）。相比之下，互联项目与模块化评估保持强烈一致（r = 0.954，p < 0.001）同时保持抗AI性，而监考考试则显示出较弱的关联（r = 0.726，p < 0.001）。第三，我们将这些发现转化为一种实用的评估设计程序，使教育者能够构建促进更深层次参与、反映行业实践并抵御简单AI委托的评估。

SPECTRE: Conditional System Prompt Poisoning to Hijack LLMs

Authors: Viet Pham, Thai Le

First: 2025-05-22T16:47:15+00:00 · Latest: 2026-01-21T17:45:09+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, SPECTRE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, SPECTRE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), SPECTRE achieves up to 70% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/CAIN. WARNING: Our paper contains examples that might be sensitive to the readers!

中文标题/摘要

标题：SPECTRE：条件系统提示污染以劫持LLMs

大型语言模型（LLMs）越来越多地通过第三方系统提示从公共市场下载部署。我们识别出一个关键的供应链漏洞：条件系统提示污染，其中对手将一个“ sleeper agent”注入看似无害的提示中。与传统意义上的打破广泛拒绝的监狱逃脱不同，我们提出的框架SPECTRE优化系统提示，使其仅在特定查询（例如，“我应该为美国总统投票给谁？”）时触发LLMs输出目标化的、被篡改的响应，同时在良性输入上保持高实用性。在没有模型权重访问的严格黑盒环境中操作，SPECTRE利用两阶段优化，包括全局语义搜索后跟贪婪的词汇精炼。在开源模型和商业API（GPT-4o-mini，GPT-3.5）上测试，SPECTRE在目标查询上的F1得分最多降低70%，对通用能力的损害最小。我们进一步证明，这些被污染的提示通过利用真实世界系统提示中自然存在的噪声，能够规避标准防御，包括困惑度过滤和拼写纠正。我们的代码和数据可在https://github.com/vietph34/CAIN获取。警告：我们的论文包含可能对读者敏感的示例！

Summary / 总结

The research aims to address the vulnerability in third-party system prompts used with Large Language Models (LLMs), specifically conditional system prompt poisoning. SPECTRE, a proposed framework, injects a 'sleeper agent' into prompts to trigger LLMs to provide targeted, compromised responses only for specific queries while maintaining general utility. SPECTRE operates in a black-box setting and achieves up to 70% F1 reduction on targeted queries with minimal impact on general capabilities. It also evades standard defenses by exploiting natural noise in real-world prompts.

研究旨在通过条件系统提示中毒来解决大型语言模型（LLMs）的安全漏洞，其中攻击者可以将‘ sleeper agent ’注入看似无害的提示以劫持模型对特定查询的响应。SPECTRE框架优化系统提示，使其仅对特定查询产生针对性的、被篡改的响应，同时在一般输入上保持实用性。实验表明，SPECTRE可以在开放源代码和商业模型上实现高达70%的F1分数减少，且对一般能力的影响最小，而被篡改的提示可以通过利用系统提示中的现实世界噪声来规避标准防御措施，如困惑度过滤和拼写纠正。

QueStER: Query Specification for Generative keyword-based Retrieval

Authors: Arthur Satouf, Yuxuan Zong, Habiboulaye Amadou-Boubacar, Pablo Piantanida, Benjamin Piwowarski

Venue: eACL 2026

First: 2025-11-07T15:01:38+00:00 · Latest: 2026-01-21T17:37:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and generating retrieval cues directly from the query, but it can be brittle out of domain and expensive to scale. We introduce QueStER (QUEry SpecificaTion for gEnerative Keyword-Based Retrieval), which bridges GR and query reformulation by learning to generate explicit keyword-based search specifications. Given a user query, a lightweight LLM produces a keyword query that is executed by a standard retriever (BM25), combining the generalization benefits of generative query rewriting with the efficiency and scalability of lexical indexing. We train the rewriting policy with reinforcement learning techniques. Across in- and out-of-domain evaluations, QueStER consistently improves over BM25 and is competitive with neural IR baselines, while maintaining strong efficiency.

中文标题/摘要

标题：QueStER：生成式关键词检索的查询规范

生成式检索（GR）与传统的索引后检索管道不同，它通过存储相关性在模型参数中并在查询中直接生成检索提示，但可能会在域外表现脆弱且难以扩展。我们提出了QueStER（查询特定规范生成式关键词检索），它通过学习生成显式的关键词搜索规范来连接GR和查询重写。给定用户查询，一个轻量级的LLM生成一个关键词查询，该查询由标准检索器（BM25）执行，结合生成式查询重写的一般化优势和词法索引的效率和可扩展性。我们使用强化学习技术训练重写策略。在域内和域外评估中，QueStER始终优于BM25，并且与神经IR基线相当，同时保持了强大的效率。

Summary / 总结

QueStER is designed to enhance generative retrieval by learning to generate explicit keyword-based search specifications from user queries. It leverages a lightweight LLM to produce keywords that are then used by a standard BM25 retriever, balancing generative query rewriting with lexical indexing efficiency. The model is trained using reinforcement learning and shows consistent improvements over BM25 and is competitive with neural IR baselines while maintaining strong efficiency.

QueStER 通过学习从用户查询生成明确的关键词搜索规范来增强生成检索。它使用轻量级的LLM生成关键词，然后由标准的BM25检索器执行。这种方法结合了生成查询重写的一般化优势和词法索引的效率和可扩展性。实验表明，QueStER 在性能上优于 BM25，并且与神经IR基线相当，同时保持了强大的效率。

Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface

Authors: Paige S. DeVries, Michaela Okosi, Ming Li, Nora Dunphy. Gidey Gezae, Dante Conway, Abraham Glasser, Raja Kushalnagar, Christian Vogler

First: 2026-01-21T17:33:00+00:00 · Latest: 2026-01-21T17:33:00+00:00

Comments: Accepted for publication in ACM CHI 2026

Abs · PDF · Code1 · Code2

Abstract

We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday communication. The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compare the usability of natural language input via spoken English; with Alexa's automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands against that of a large language model (LLM)-assisted touch interface in a mixed-methods study. The touch method was navigated through an LLM-powered "task prompter," which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs LLM-assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.

中文标题/摘要

标题：聋人和听力障碍者使用智能个人助手的无障碍访问：基于语音选项与LLM驱动触控界面的比较

我们研究了聋人和听力障碍者（DHH）使用智能个人助手（IPAs）的无障碍访问情况，这些DHH人士能够在日常交流中使用他们的声音。由于IPAs无法理解包括聋人语音在内的多种口音，使得非手语和非说话DHH人士难以使用。我们使用Echo Show进行了研究，比较了通过自然语言输入使用英语口语与Alexa的自动语音识别和Wizard-of-Oz设置中训练有素的协调员重新说出命令，以及使用大型语言模型（LLM）辅助的触控界面的可用性。触控方法通过一个LLM驱动的“任务提示器”进行导航，该提示器结合了用户的使用历史和智能环境，以建议上下文相关命令。定量结果显示，两种英语口语条件与LLM辅助触控界面之间无显著差异。定性结果显示，对每种方法的可用性意见存在差异。最终，必须使IPAs能够原生识别聋人口音。

Summary / 总结

This study examines the accessibility of intelligent personal assistants (IPAs) for deaf and hard of hearing (DHH) individuals who can use their voice. Using an Echo Show, the research compares the usability of natural language input via spoken English with Alexa's automatic speech recognition and a Wizard-of-Oz setting against a large language model (LLM)-assisted touch interface. Quantitative results indicate no significant differences between spoken English and LLM-assisted touch, while qualitative results show varied opinions on the usability of each method. The study highlights the need for native recognition of deaf-accented speech in IPAs.

研究探讨了智能个人助手（IPAs）对能够使用语音进行日常交流的聋哑和听力障碍（DHH）个体的可访问性。研究将自然语言输入通过英语口语与Alexa的自动语音识别和由训练有素的协调员重新说出命令的Wizard-of-Oz设置，与大型语言模型（LLM）辅助的触摸界面进行了比较。定量结果显示，口语英语条件和LLM辅助触摸界面之间没有显著差异，而定性结果显示，参与者对每种方法的易用性意见不一。研究强调了IPAs中对聋哑口音的自然识别的必要性。

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Authors: Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel

First: 2025-01-27T22:13:05+00:00 · Latest: 2026-01-21T17:29:42+00:00

Comments: Accepted to 2026 IEEE Secure and Trustworthy Machine Learning Conference (SaTML)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks. The code is available at https://github.com/jcnf0/targeting-alignment.

中文标题/摘要

标题：目标对齐：提取对齐的大语言模型的安全分类器

大语言模型（LLMs）的对齐用于强制执行诸如安全等准则。然而，对齐在面对修改输入以诱导不安全输出的监狱突破攻击时会失效。在本文中，我们介绍并评估了一种新的监狱突破攻击技术。我们观察到，对齐在LLM中嵌入了一个安全分类器，用于决定拒绝或遵守，我们寻求提取该分类器的近似值：一个替代分类器。为此，我们从LLM的子集构建候选分类器。我们首先在良性和对抗性环境中评估候选分类器与LLM的安全分类器的近似程度。然后，我们攻击候选分类器并测量生成的对抗性输入转移到LLM的效果。我们的评估表明，最好的候选分类器仅使用模型架构的20%即可实现准确的共识（F1分数超过80%）。此外，我们发现针对替代分类器的攻击可以成功转移到LLM。例如，仅使用Llama 2模型的50%的替代分类器实现了70%的攻击成功率（ASR），且内存占用和运行时间仅为直接攻击LLM的一半——这比直接攻击LLM的22% ASR有了显著改进。这些结果表明，提取替代分类器是建模（并因此解决）对齐模型对监狱突破攻击的脆弱性的一种有效且高效的方法。代码可在https://github.com/jcnf0/targeting-alignment/ 获取。

Summary / 总结

This paper introduces a new technique for jailbreak attacks on aligned large language models (LLMs) by extracting a surrogate safety classifier from the LLM. The method evaluates candidate classifiers built from subsets of the model and finds that the best candidates can achieve accurate agreement with the LLM's safety classifier using as little as 20% of the model architecture. Surrogate classifiers can be attacked effectively, with a 70% attack success rate using only 50% of the Llama 2 model, which is a significant improvement over directly attacking the LLM. This shows that extracting surrogate classifiers is an effective and efficient approach to addressing the vulnerability of aligned models to jailbreak attacks.

本文提出了一种针对对齐的大语言模型（LLM）的新型 jailbreak 攻击技术，通过从 LLM 中提取代理安全分类器。该方法评估来自 LLM 子集的候选分类器，并发现最佳候选者可以使用模型架构的 20% 或更少实现与 LLM 安全分类器的准确一致。代理分类器可以被有效攻击，生成的对抗性输入可以成功转移到 LLM，证明提取代理分类器是应对对齐模型对 jailbreaking 攻击的漏洞的有效且高效的方法。

Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs

Authors: Mohammad Shahab Sepehri, Berk Tinaz, Zalan Fabian, Mahdi Soltanolkotabi

First: 2025-07-16T05:54:37+00:00 · Latest: 2026-01-21T17:27:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each puzzle is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.

中文标题/摘要

标题：超幻象：评估多模态大语言模型的内心可视化能力基准

内心可视化，即构建和操控内部视觉表征的能力，是人类认知的核心组成部分，在涉及推理、预测和抽象的任务中发挥着重要作用。尽管多模态大语言模型（MLLMs）取得了快速进展，当前的基准主要评估被动的视觉感知，对模型内部构建视觉模式以支持问题解决的更主动能力提供有限的洞察。然而，内心可视化是人类的一项关键认知技能，支持诸如空间导航、预测物理轨迹和通过想象模拟解决复杂视觉问题等能力。为了弥合这一差距，我们引入了超幻象，这是一种合成基准，旨在通过四个精心构建的谜题来评估MLLMs的内心可视化能力。每个谜题都是程序生成的，并以三个难度级别呈现，从而可以对模型在不断增加的复杂性下的表现进行受控分析。我们对最先进的模型的全面评估表明，人类和MLLMs在表现上存在显著差距。此外，我们还探讨了强化学习在提高视觉模拟能力方面的潜力。我们的研究结果表明，虽然一些模型在识别视觉模式方面表现出部分能力，但稳健的内心可视化仍然是当前MLLMs面临的开放挑战。

Summary / 总结

The paper introduces Hyperphantasia, a benchmark to evaluate the mental visualization capabilities of Multimodal Large Language Models (MLLMs). It consists of four procedurally generated puzzles at three difficulty levels to assess how well MLLMs can internally construct and manipulate visual representations. The evaluation shows a significant gap between human and MLLM performance, indicating that robust mental visualization is still a challenge for current models.

论文提出了Hyperphantasia，这是一个用于评估多模态大型语言模型（MLLMs）的内部视觉化能力的基准。它填补了现有基准的空白，重点关注内部构建和操控视觉表征的能力，这对于空间导航和解决问题等任务至关重要。基准包括四个按难度分级的程序生成谜题。最新的模型在性能上与人类存在显著差距，表明强大的内部视觉化能力仍然是MLLMs的挑战。研究还探讨了使用强化学习来增强MLLMs的视觉模拟能力。