arXiv 论文速递

Snapshot: 20260326_0352

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Authors: Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan

First: 2026-03-24T17:59:54+00:00 · Latest: 2026-03-24T17:59:54+00:00

Comments: 11 Pages

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

中文标题/摘要

标题：MedObvious：通过临床分诊揭示VLM中的医疗莫拉维克悖论

视觉语言模型（VLMs）越来越多地用于医学报告生成和视觉问答等任务。然而，流畅的诊断文本并不保证安全的视觉理解。在临床实践中，解释始于预诊断的合理性检查：验证输入是否有效（正确的模态和解剖结构，合理的视角和方向，以及没有明显的完整性问题）。现有基准大多假设这一步骤已经解决，因此忽略了关键的失败模式：即使输入不一致或无效，模型也能生成合理的叙述。我们引入了MedObvious基准，包含1,880个任务，将输入验证隔离为一组一致性能力，即模型必须确定任何面板是否违反了预期的连贯性。MedObvious涵盖了五个逐步进阶的层级，从基本的模态/方向不匹配到基于临床的解剖结构/视角验证和分诊提示，并包括五种评估格式以测试跨界面的鲁棒性。评估了17种不同的VLMs，我们发现合理性检查仍然不可靠：多个模型在正常（负控）输入上生成异常，性能在扩展到更大的图像集时下降，且在多项选择和开放式设置中测量的准确性差异显著。这些结果表明，预诊断验证对于医疗VLMs来说仍然是未解决的问题，并且在部署前应被视为一个独立的安全关键能力。

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Authors: Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang

First: 2026-03-24T17:59:17+00:00 · Latest: 2026-03-24T17:59:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

中文标题/摘要

标题：UniGRPO：统一推理驱动视觉生成的策略优化

统一模型能够进行交错生成，已成为一个有前景的范式，社区越来越多地倾向于自回归建模用于文本生成和流匹配用于图像生成。为推进这一方向，我们提出了一种针对交错生成的统一强化学习框架。我们通过单轮推理驱动的图像生成来验证我们的方法，其中模型首先通过推理扩展用户提示，然后进行图像合成。将这一多模态生成过程形式化为具有稀疏终端奖励的马尔可夫决策过程，我们引入UniGRPO以联合优化文本和图像生成策略。采用简约的方法避免过度设计，我们通过无缝集成标准的推理GRPO和视觉合成FlowGRPO的训练食谱来利用两种模态的现有训练方法。为了确保扩展到多轮交错生成，我们对原始FlowGRPO引入了两个关键修改：（1）消除无分类引导以保持线性、不分支的展开，这对于扩展到涉及多轮交互和多条件生成（例如编辑）的复杂场景至关重要；（2）用直接作用于速度场的MSE惩罚替换标准的潜在KL惩罚，提供更稳健和直接的正则化信号以有效缓解奖励作弊。我们的实验表明，这种统一的训练方法显著提高了通过推理进行的图像生成质量，为未来完全交错模型的后训练提供了一个稳健且可扩展的基础。

Summary / 总结

The paper proposes UniGRPO, a unified reinforcement learning framework for interleaved generation, focusing on reasoning-driven image generation. It formulates the process as a Markov Decision Process and introduces modifications to FlowGRPO to ensure scalability and robustness. Experiments show that UniGRPO improves image generation quality through reasoning and provides a scalable baseline for future fully interleaved models.

论文提出了UniGRPO，一个用于交错生成的统一强化学习框架，专注于基于推理的图像生成。它将过程建模为马尔可夫决策过程，并对FlowGRPO进行了修改以确保可扩展性和鲁棒性。实验表明，UniGRPO通过推理提高了图像生成质量，并为未来的完全交错模型提供了可扩展的基础。

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Authors: Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang

First: 2026-03-24T17:58:25+00:00 · Latest: 2026-03-24T17:58:25+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.

中文标题/摘要

标题：WildWorld：一种用于动作和显式状态导向生成ARPG动态世界建模的大规模数据集

动力系统理论和强化学习将世界演化视为由动作驱动的潜在状态动力学，视觉观察仅提供部分状态信息。最近的视频世界模型试图从数据中学习这种动作条件下的动力学。然而，现有数据集很少满足这一要求：它们通常缺乏多样且语义上有意义的动作空间，动作直接与视觉观察相关联，而不是通过潜在状态中介。因此，动作往往与像素级变化纠缠在一起，使得模型难以学习结构化世界动力学并保持长时间的一致演化。在本文中，我们提出了WildWorld，这是一个具有显式状态注释的大规模动作条件下的世界建模数据集，数据自动收集自逼真AAA动作角色扮演游戏（Monster Hunter: Wilds）。WildWorld包含超过1.08亿帧，并包含超过450种动作，包括移动、攻击和技能施放，以及同步的每帧角色骨架、世界状态、相机姿态和深度图注释。我们进一步提出了WildBench来通过动作跟随和状态对齐评估模型。广泛的实验揭示了在建模语义丰富动作和保持长时间状态一致性方面持续存在的挑战，突显了状态感知视频生成的必要性。项目页面为https://shandaai.github.io/wildworld-project/。

Summary / 总结

WildWorld is a large-scale dataset for dynamic world modeling with explicit state annotations, designed to address the limitations of existing datasets by providing a diverse and semantically meaningful action space. The dataset is automatically collected from a photorealistic AAA game, Monster Hunter: Wilds, and includes over 108 million frames with synchronized annotations of character skeletons, world states, camera poses, and depth maps. Key experimental findings show persistent challenges in modeling semantically rich actions and maintaining long-term state consistency, emphasizing the need for state-aware video generation models.

WildWorld 是一个大规模的动作条件世界建模数据集，包含超过1.08亿帧和超过450种动作，并同步标注了角色骨架、世界状态、相机姿态和深度图。该数据集通过提供多样且语义丰富的动作空间，解决了现有数据集的局限性。实验表明，在建模丰富动作和保持长时间状态一致性方面存在持续挑战，突显了状态感知视频生成模型的需求。

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Authors: Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos

Venue: CVPR 2026

First: 2026-03-24T17:58:17+00:00 · Latest: 2026-03-24T17:58:17+00:00

Comments: Accepted at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

中文标题/摘要

标题：VISion On Request: 提升VLLM效率的稀疏、动态选择视觉-语言交互方法

现有提高大型视觉-语言模型（LVLMs）效率的方法主要基于视觉标记减少的概念。然而，这种方法创建了一个信息瓶颈，影响了性能，特别是在需要精细理解与推理的任务中。在本文中，我们通过引入VISion On Request（VISOR），一种在不丢弃视觉信息的情况下减少推理成本的方法，挑战了这一范式。VISOR通过稀疏化图像与文本标记之间的交互来提高效率，而不是压缩图像。具体来说，语言模型通过少量战略性放置的注意力层关注全集的高分辨率视觉标记：通过文本-图像之间的高效交叉注意力提供高效的视觉上下文，而少数精心放置并动态选择的自我注意力层则细化视觉表示本身，当需要复杂、高分辨率推理时，能够进行精细的视觉理解。基于这一原则，我们首先通过调整自我注意力层的数量，在不同的计算预算下训练一个通用网络，然后引入一个轻量级的策略机制，根据每个样本的复杂性动态分配视觉计算。广泛的实验表明，VISOR在多种基准测试中大幅减少了计算成本，同时匹配或超越了最先进的结果，并在需要详细视觉理解的挑战性任务中表现出色。

Summary / 总结

This paper introduces VISion On Request (VISOR), a method to enhance the efficiency of Large Vision-Language Models (LVLMs) by sparsifying the interaction between image and text tokens rather than reducing visual tokens. VISOR uses a small set of strategically placed attention layers to provide general visual context and refine visual representations when needed, allowing for complex reasoning. Experiments show that VISOR reduces computational cost while maintaining or improving performance across various benchmarks, especially in tasks requiring detailed visual understanding.

本文提出了VISion On Request (VISOR) 方法，通过稀疏化图像和文本 token 之间的交互来增强大型视觉-语言模型 (LVLM) 的效率，而不是减少视觉 token。VISOR 使用少量战略性放置的注意力层通过交叉注意力提供一般视觉上下文，并通过少数动态选择的自我注意力层细化视觉表示。实验表明，VISOR 显著减少了计算成本，同时在各种基准测试中保持或超越了最先进的性能，特别是在需要详细视觉理解的任务中表现出色。

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Authors: Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim

First: 2026-03-24T17:55:17+00:00 · Latest: 2026-03-24T17:55:17+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

中文标题/摘要

标题：AgentRVOS：基于对象轨迹的零样本视频对象分割推理

参考视频对象分割（RVOS）的目标是在给定自然语言查询的情况下，对视频中的目标对象进行分割。无需训练的方法遵循一个常见的流程：MLLM 选择关键帧，将所指的对象定位在这些帧中，然后视频分割模型传播结果。虽然直观，但这种设计要求 MLLM 在没有任何对象级证据的情况下做出时间决策，从而限制了推理质量和时空覆盖范围。为了解决这个问题，我们提出了基于 SAM3 和 MLLM 相互补充优势的 AgentRVOS，一种无需训练的代理式管道。给定查询中提取的概念，SAM3 通过生成的掩码轨迹在整个时空范围内提供可靠的感知。然后，MLLM 通过查询导向的推理识别目标，SAM3 的时间存在信息指导迭代修剪。广泛的实验表明，AgentRVOS 在多个基准测试中实现了训练无需方法的最新性能，且在多种 MLLM 后端模型上具有一致的结果。我们的项目页面可在：https://cvlab-kaist.github.io/AgentRVOS/。

Summary / 总结

AgentRVOS addresses the limitations of existing training-free methods for Referring Video Object Segmentation by integrating the strengths of SAM3 and a MLLM. It generates reliable object tracks across the entire spatio-temporal extent using SAM3, which the MLLM then uses for query-grounded reasoning to identify the target object. Experiments demonstrate that AgentRVOS outperforms other training-free methods across multiple benchmarks, showing consistent results with different MLLM backbones.

AgentRVOS通过结合SAM3进行可靠的时空对象跟踪和MLLM进行查询导向的推理，解决了现有训练-free方法在RVOS中的局限性。它通过查询导向的推理和来自SAM3的时间存在信息进行迭代的剪枝，实现了在多个基准上的state-of-the-art性能，并且在不同的MLLM骨干网络上具有一致的结果。

Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems

Authors: Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos, Dongwon Lee

Venue: www

First: 2025-03-06T20:19:38+00:00 · Latest: 2026-03-24T17:52:36+00:00

Comments: 15; To appear in ICWSM 2026 (https://www.icwsm.org/2026/)

Abs · PDF · Code1 · Code2

Abstract

The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textit{Dataset and source code used in this study will be made publicly available upon acceptance of the manuscript.

中文标题/摘要

标题：增强对话系统的深度伪造文本协作评估

生成模型的普及给区分真实人类创作内容与深度伪造内容带来了重大挑战。结合人类努力与AI工具的协作方法显示出前景。本研究探讨了增强辩论的聊天机器人DeepFakeDeLiBot在支持团队检测深度伪造文本方面的潜力。研究发现，基于团队的问题解决显著提高了识别机器生成段落的准确性，而与DeepFakeDeLiBot的互动虽然整体上未带来显著的性能提升，但通过促进更高的参与者参与度、共识构建和基于推理的陈述频率与多样性，增强了团队动态。此外，感知到团队协作更有效的参与者从DeepFakeDeLiBot中获得了性能优势。这些发现强调了辩论聊天机器人在促进互动和高效团队动态的同时，确保协作深度伪造文本检测准确性方面的潜力。\textit{本研究使用的数据集和源代码将在手稿被接受后公开。

Summary / 总结

This study investigates the use of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to help groups detect deepfake text. The research shows that group-based problem-solving improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While DeepFakeDeLiBot does not significantly enhance overall performance, it improves group dynamics by increasing engagement, consensus building, and reasoning-based utterances. Participants who perceived higher effectiveness of group collaboration benefited more from the chatbot. These findings highlight the potential of deliberative chatbots in promoting interactive and accurate group detection of deepfake text.

本研究探讨了使用DeepFakeDeLiBot，一种促进讨论的聊天机器人，来帮助团队检测深伪文本。研究结果显示，基于团队的问题解决方式在识别机器生成的段落方面比个人努力更准确。虽然DeepFakeDeLiBot整体上没有显著提升性能，但它通过增加参与度、促进共识形成和基于推理的讨论，改善了团队动态。那些认为团队协作更有效的参与者从聊天机器人中受益更多。这些发现强调了讨论型聊天机器人在提高深伪文本检测准确性和团队动态方面的潜力。

Failure of contextual invariance in gender inference with large language models

Authors: Sagar Kumar, Ariel Flint, Luca Maria Aiello, Andrea Baronchelli

First: 2026-03-24T17:52:22+00:00 · Latest: 2026-03-24T17:52:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behaviour. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.

中文标题/摘要

标题：大型语言模型中基于性别的上下文不变性失效

标准评估实践假设大型语言模型（LLM）输出在任务的上下文等效表述下是稳定的。在这里，我们在性别推断的背景下测试这一假设。通过一个受控的代词选择任务，我们引入了最小的、理论上无信息的语境，并发现这导致了模型输出的大幅、系统性变化。在去语境化的环境中存在的与文化性别刻板印象相关的相关性，在引入语境后减弱或消失，而与任务无关的特征，如与无关指代对象相关的代词性别，成为模型行为最有力的预测因素。通过上下文默认分析发现，在19%-52%的情况下，即使在考虑了所有上下文对个体输出的边际效应后，这种依赖仍然存在，且不能归因于简单的代词重复。这些发现表明，即使在几乎相同的句法表述下，LLM输出也违反了上下文不变性，这对偏见基准测试和在高风险环境中的部署具有重要意义。

Summary / 总结

The study investigates the assumption that large language models (LLMs) produce contextually invariant outputs by testing gender inference tasks with minimal, theoretically uninformative context. The research finds that introducing such context leads to significant and systematic changes in model outputs, with correlations with cultural gender stereotypes weakening or disappearing. The study also shows that irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most predictive factors. This indicates that LLMs do not maintain contextual invariance even under nearly identical syntactic conditions, with implications for bias assessment and model deployment in critical scenarios.

研究通过在性别推理任务中引入最小且理论上无信息的语境，检验大型语言模型（LLM）输出的上下文不变性假设。研究发现，引入这种语境会导致模型输出出现显著且系统性的变化，文化性别刻板印象的相关性会减弱或消失。研究还表明，无关特征，如与无关指代词相关的代词性别，成为预测模型行为的最有力因素。这表明，即使在几乎相同的句法结构下，LLM 也不保持上下文不变性，这对偏见评估和模型在关键场景中的部署具有重要意义。

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Authors: Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo

First: 2026-03-24T17:45:47+00:00 · Latest: 2026-03-24T17:45:47+00:00

Comments: Code: https://github.com/MAC-AutoML/SpecEyes

Abs · PDF · Code1 · Code2 · Code3

Abstract

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

中文标题/摘要

标题：SpecEyes：通过推测感知与规划加速有能动性的多模态大语言模型

有能动性的多模态大语言模型（MLLMs，例如OpenAI o3和Gemini有能动性视觉）通过迭代视觉工具调用实现了显著的推理能力。然而，感知、推理和工具调用循环的级联引入了显著的顺序延迟。这种延迟，称为有能动性深度，导致了不可接受的延迟，并严重限制了系统级别的并发性。为了解决这一问题，我们提出了SpecEyes，一种有能动性级别的推测性加速框架，打破了这种顺序瓶颈。我们的核心见解是，一个轻量级、无工具的MLLM可以作为推测性规划者来预测执行轨迹，从而在不牺牲准确性的前提下提前终止昂贵的工具链。为了调节这种推测性规划，我们引入了一种基于答案可分性的认知门控机制，该机制量化了模型的自我验证信心，无需使用 oracle 标签。此外，我们设计了一种异构并行漏斗，利用小模型的状态无感知并发性来掩盖大模型的状态有感知串行执行，从而最大化系统吞吐量。在V* Bench、HR-Bench和POPE上的大量实验表明，SpecEyes在保持或甚至提高准确性的前提下（最多提高6.7%），实现了1.1-3.35倍的加速，从而在并发工作负载下提升了服务吞吐量。

Summary / 总结

SpecEyes is a speculative acceleration framework for agentic multimodal large language models (MLLMs) that addresses the sequential overhead by using a lightweight, tool-free MLLM as a speculative planner. It introduces a cognitive gating mechanism based on answer separability to regulate speculative planning and a heterogeneous parallel funnel to maximize system throughput. Experiments show that SpecEyes achieves 1.1-3.35x speedup while preserving or improving accuracy, enhancing serving throughput under concurrent workloads.

SpecEyes 是一种针对具有代理能力的多模态大型语言模型（MLLMs）的推测性加速框架，旨在解决其推理过程中的显著串行开销问题。通过使用轻量级的推测性规划器来预测执行轨迹，SpecEyes 可以提前终止昂贵的工具链，从而减少延迟并提高系统并发性。该框架引入了一种基于答案可分性的认知门控机制来调节推测性规划，并设计了一个异构并行漏斗来最大化吞吐量。实验表明，SpecEyes 在保持或提高准确性的前提下，实现了 1.1-3.35 倍的加速，从而在并发工作负载下提升了服务吞吐量。

ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

Authors: Muhammad Khalid, Manuel Oriol, Yilmaz Uygun

First: 2026-03-24T17:45:40+00:00 · Latest: 2026-03-24T17:45:40+00:00

Comments: 17 pages, 6 figures, 7 tables. Accepted at VerifAI-2026 Workshop, co-located with ETAPS 2026

Abs · PDF · Code1 · Code2

Abstract

Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.

中文标题/摘要

标题：ReqFusion：跨软件领域自动化PEGS分析的多提供商框架

需求工程是软件开发过程中至关重要但劳动密集的阶段。本文介绍ReqFusion：一种利用多个大型语言模型（LLM）提供商的AI增强系统，自动提取、分类和分析软件需求。ReqFusion的架构整合了OpenAI GPT、Anthropic Claude和Groq模型，从各种文档格式（PDF、DOCX和PPTX）中提取功能性和非功能性需求，适用于学术、工业和投标提案领域。系统采用领域无关的提取方法，并按照Bertrand Meyer提出的Project、Environment、Goal和System（PEGS）方法生成需求。主要思想是，由于PEGS格式详细，LLM有更多的信息和线索，因此比简单的通用请求产生更好的结果。消融研究证实了这一假设：在相同的多提供商配置下，PEGS引导的提示实现了0.88的F1分数，而通用提示仅为0.71。评估使用18份真实世界文档，通过自动化分类生成226个需求，其中45.1%为非功能性需求，54.9%为功能性需求，覆盖学术、商业和技术领域。在五个项目上进行的扩展评估显示，与手动方法相比，提取准确性提高了78%，分析时间减少了78%。多提供商架构通过模型共识和故障转移机制增强可靠性，而基于PEGS的方法确保了所有需求类别的全面覆盖。

Summary / 总结

ReqFusion is an AI system that automates the extraction, classification, and analysis of software requirements using multiple LLM providers. It integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various document formats. The system uses the PEGS approach to guide LLMs, resulting in higher accuracy. An ablation study shows that PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting. The evaluation on 18 real-world documents and five projects demonstrates significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods.

ReqFusion 是一个使用多个 LLM 提供商来自动化提取、分类和分析软件需求的 AI 增强系统。它整合了 OpenAI GPT、Anthropic Claude 和 Groq 模型，从多种文档格式中提取需求，并遵循 PEGS 方法。消融研究显示，使用 PEGS 指导的提示将 F1 分数提高到 0.88，而通用提示的 F1 分数为 0.71。该系统从 18 份真实世界的文档中生成了 226 个需求，其中 54.9% 是功能性的，45.1% 是非功能性的。它还在五个项目中展示了 1,050 个需求的分析时间减少了 78%。

Task-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation

Authors: Yunkai Yang, Yudong Zhang, Kunquan Zhang, Jinxiao Zhang, Xinying Chen, Haohuan Fu, Runmin Dong

Venue: CVPR 2026

First: 2025-12-18T16:37:39+00:00 · Latest: 2026-03-24T17:45:20+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.

中文标题/摘要

标题：面向任务的数据合成与控制校正采样方法在遥感语义分割中的应用

随着可控生成技术的迅速发展，训练数据合成已成为扩展标注数据集和缓解遥感（RS）领域手动标注问题的一种有前途的方法。然而，语义掩码控制的复杂性和采样质量的不确定性往往限制了合成数据在下游语义分割任务中的应用。为了解决这些挑战，我们提出了一种面向任务的数据合成框架（TODSynth），包括一个具有统一三重注意力机制的多模态扩散变换器（MM-DiT）和一个基于任务反馈的即插即用采样策略。基于强大的基于DiT的生成基础模型，我们系统地评估了不同的控制方案，表明结合图像和掩码分支的全面微调与文本-图像-掩码联合注意力方案显著增强了RS语义分割数据合成的有效性，特别是在少量样本和复杂场景中。此外，我们提出了一种控制校正流匹配（CRFM）方法，在早期高可塑性阶段根据语义损失动态调整采样方向，缓解生成图像的不稳定性并缩小合成数据与下游分割任务之间的差距。广泛的实验表明，我们的方法在可控生成方法中始终表现出色，生成了更稳定且面向任务的合成数据用于RS语义分割。

Summary / 总结

This paper addresses the challenges of using synthetic data for remote sensing semantic segmentation by proposing a task-oriented data synthesis framework (TODSynth) that includes a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy. The framework significantly enhances the effectiveness of RS semantic segmentation data synthesis, especially in few-shot and complex-scene scenarios, by combining text-image-mask joint attention and full fine-tuning of the image and mask branches. Additionally, a control-rectify flow matching (CRFM) method is introduced to dynamically adjust sampling directions, improving the stability of generated images and better aligning synthetic data with downstream segmentation tasks.

论文提出了一种面向任务的数据合成框架（TODSynth），包括具有统一三重注意力的多模态扩散变换器（MM-DiT）和插拔式采样策略。该方法通过结合文本-图像-掩码联合注意力与图像和掩码分支的全面微调，显著提升了遥感语义分割数据合成的有效性，特别是在少量样本和复杂场景中。此外，还提出了一种控制-校正流匹配（CRFM）方法，动态调整采样方向，提高生成图像的稳定性，并更好地与下游分割任务对齐。大量实验表明，该方法优于现有的可控生成方法。

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Authors: Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou

First: 2026-03-24T17:45:06+00:00 · Latest: 2026-03-24T17:45:06+00:00

Comments: https://plan-lab.github.io/projects/vtam/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

中文标题/摘要

标题：VTAM：视频-触觉-动作模型在超越VLAs的复杂物理交互中的应用

视频-动作模型（VAMs）已成为体现智能的有前途的框架，通过从原始视频流中学习隐含的世界动力学来产生时间上一致的动作预测。尽管这些模型在长时任务上通过视觉推理表现出强大的性能，但在接触丰富的场景中仍受到限制，这些场景中关键的交互状态仅部分可通过视觉观察到。特别是，细微的力调节和接触转换无法可靠地编码在视觉标记中，导致不稳定或不精确的行为。为了解决这一问题，我们引入了视频-触觉动作模型（VTAM），这是一种多模态世界建模框架，将触觉感知作为补充的接地信号。VTAM通过轻量级模态转移微调将预训练的视频变换器与触觉流相结合，无需触觉-语言配对数据或独立的触觉预训练，即可实现高效的跨模态表示学习。为了稳定多模态融合，我们引入了一种触觉正则化损失，该损失强制执行平衡的跨模态注意力，防止视觉潜在主导动作模型。VTAM在接触丰富的操作中表现出色，平均保持了90%的稳健成功率。在需要高保真力感知的马铃薯片拾取和放置等具有挑战性的场景中，VTAM比pi 0.5基线高出80%。我们的研究结果表明，将触觉反馈整合到世界动作模型中以纠正视觉估计误差是必不可少的，为物理上接地的体现基础模型提供了可扩展的方法。

Summary / 总结

VTAM is a multimodal world modeling framework that integrates tactile perception with video-action models to improve performance in contact-rich manipulation tasks. By augmenting a pretrained video transformer with tactile streams and introducing a tactile regularization loss, VTAM achieves a robust success rate of 90 percent on average and outperforms the pi 0.5 baseline by 80 percent in high-fidelity force awareness scenarios like potato chip pick-and-place.

VTAM 是一种将触觉感知与视频-动作模型结合的多模态世界建模框架，以提高接触丰富的操作任务中的性能。它通过触觉流增强预训练的视频变换器，并引入触觉正则化损失以稳定多模态融合。VTAM 展现出优越的性能，平均保持 90% 的成功率，并在需要高保真力感知的挑战性场景中比 pi 0.5 基线高出 80%。

An Industrial-Scale Retrieval-Augmented Generation Framework for Requirements Engineering: Empirical Evaluation with Automotive Manufacturing Data

Authors: Muhammad Khalid, Yilmaz Uygun

First: 2026-03-20T22:05:08+00:00 · Latest: 2026-03-24T17:44:26+00:00

Comments: 10 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Requirements engineering in Industry 4.0 faces critical challenges with heterogeneous, unstructured documentation spanning technical specifications, supplier lists, and compliance standards. While retrieval-augmented generation (RAG) shows promise for knowledge-intensive tasks, no prior work has evaluated RAG on authentic industrial RE workflows using comprehensive production-grade performance metrics. This paper presents a comprehensive empirical evaluation of RAG for industrial requirements engineering automation using authentic automotive manufacturing documentation comprising 669 requirements across four specification standards (MBN 9666-1, MBN 9666-2, BQF 9666-5, MBN 9666-9) spanning 2015-2023, plus 49 supplier qualifications with extensive supporting documentation. Through controlled comparisons with BERT-based and ungrounded LLM approaches, the framework achieves 98.2% extraction accuracy with complete traceability, outperforming baselines by 24.4% and 19.6%, respectively. Hybrid semantic-lexical retrieval achieves MRR of 0.847. Expert quality assessment averaged 4.32/5.0 across five dimensions. The evaluation demonstrates 83% reduction in manual analysis time and 47% cost savings through multi-provider LLM orchestration. Ablation studies quantify individual component contributions. Longitudinal analysis reveals a 55% reduction in requirement volume coupled with 1,800% increase in IT security focus, identifying 10 legacy suppliers (20.4%) requiring requalification, representing potential $2.3M in avoided contract penalties.

中文标题/摘要

标题：工业规模的检索增强生成框架在需求工程中的应用：汽车制造数据的实证评估

工业4.0中的需求工程面临着来自异构、非结构化文档的严峻挑战，这些文档包括技术规范、供应商列表和合规标准。虽然检索增强生成（RAG）在知识密集型任务中显示出潜力，但此前没有研究使用全面的生产级性能指标对RAG在真实的工业需求工程工作流中的应用进行评估。本文使用真实的汽车制造文档（包含669项需求，覆盖四个规范标准：MBN 9666-1、MBN 9666-2、BQF 9666-5、MBN 9666-9，时间跨度为2015-2023年，以及49项供应商资格认证及其详尽的支持文档）对RAG在工业需求工程自动化中的应用进行了全面的实证评估。通过与基于BERT的方法和未接地的大语言模型方法的受控比较，该框架实现了98.2%的提取准确率，具有完整的可追溯性，分别比基线高出24.4%和19.6%。混合语义-词汇检索的MRR为0.847。专家质量评估在五个维度上的平均得分为4.32/5.0。评估表明，通过多供应商大语言模型编排，手动分析时间减少了83%，成本节省了47%。消融研究量化了各个组件的贡献。纵向分析显示需求量减少了55%，IT安全关注度提高了1800%，识别出10家需要重新认证的遗留供应商（占20.4%），这可能避免了230万美元的合同违约金。

Summary / 总结

This paper evaluates a retrieval-augmented generation (RAG) framework for industrial requirements engineering, using automotive manufacturing data. The framework achieves 98.2% extraction accuracy and 83% reduction in manual analysis time, outperforming baselines by 24.4% and 19.6% respectively. Hybrid semantic-lexical retrieval achieves MRR of 0.847, and expert quality assessment averaged 4.32/5.0. Longitudinal analysis shows a 55% reduction in requirement volume and a 1,800% increase in IT security focus, identifying 10 legacy suppliers for requalification.

本文评估了一种用于汽车制造行业的工业规模检索增强生成框架。该框架使用全面的生产级指标和真实的文档，实现了98.2%的提取准确率和完整的可追溯性，分别优于基线24.4%和19.6%。混合语义-词汇检索的MRR为0.847，专家质量评估平均得分为4.32/5.0。评估显示，手动分析时间减少了83%，成本节省了47%，要求数量减少了55%，IT安全重点增加了1,800%，并识别出10家需要重新认证的遗留供应商，可能避免了230万美元的合同罚款。

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Authors: Jiaying Lin, Dan Xu

First: 2026-03-24T17:42:31+00:00 · Latest: 2026-03-24T17:42:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

中文标题/摘要

标题：UniFunc3D: 统一的主动空间-时间定位以实现3D功能分割

在3D场景中进行功能分割需要代理将隐式的自然语言指令精确定位到细粒度交互元素的精确掩码中。现有方法依赖于分段的管道，在初始任务解析过程中存在视觉盲点。我们观察到这些方法受限于单尺度、被动和启发式的帧选择。我们提出了UniFunc3D，这是一种统一且无需训练的框架，将多模态大型语言模型视为主动观察者。通过将语义、时间和空间推理合并到单次前向传递中，UniFunc3D能够联合推理，直接将任务分解定位到视觉证据中。我们的方法引入了从粗到细的主动空间-时间定位策略。这使得模型能够适应性地选择正确的视频帧，专注于高细节的交互部分，同时保留用于消歧的全局上下文。在SceneFun3D上，UniFunc3D达到了最先进的性能，与训练免费和基于训练的方法相比，相对提高了59.9%的mIoU，而无需任何特定任务的训练。代码将在我们的项目页面上发布：https://jiaying.link/unifunc3d.

Summary / 总结

UniFunc3D is a unified and training-free framework that addresses the limitations of existing methods in functionality segmentation by integrating semantic, temporal, and spatial reasoning. It uses a multimodal large language model as an active observer to perform joint reasoning and adaptively select video frames, focusing on high-detail interactive parts while preserving global context. On the SceneFun3D dataset, UniFunc3D outperforms both training-free and training-based methods with a significant 59.9% improvement in mean IoU.

研究旨在通过解决现有片段化管道的局限性，提高3D场景的功能性分割。UniFunc3D 是一个统一且无需训练的框架，将语义、时间和空间推理整合在一起，进行联合推理和主动的空间-时间定位。该方法采用从粗到细的策略选择合适的视频帧，并专注于高细节的交互部分，同时保留全局上下文。在SceneFun3D数据集上，UniFunc3D在mIoU上显著超越了训练免费和训练基线方法，提高了59.9%。

InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

Authors: Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen, Khoi Nguyen, Cuong Pham, Anh Tran

Venue: CVPR

First: 2026-03-24T17:32:55+00:00 · Latest: 2026-03-24T17:32:55+00:00

Comments: Accepted to CVPR'26 (Main Conference)

Abs · PDF · Code1 · Code2

Abstract

Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.

中文标题/摘要

标题：InverFill：增强多步去糊的一步反转方法

基于扩散的模型在图像去糊中实现了照片级的真实感，但需要许多采样步骤，限制了其实用性。多步的文本到图像模型可以提供更快的生成，但直接将它们应用于去糊会导致背景和去糊区域之间不良的协调和伪影。我们将其原因追溯到随机的高斯噪声初始化，在低函数评估下导致语义对齐错误和降低保真度。为克服这一问题，我们提出了InverFill，一种针对去糊的一步反转方法，该方法将输入遮罩图像中的语义信息注入初始噪声中，从而实现高保真度的多步去糊。InverFill 不是训练去糊模型，而是利用多步文本到图像模型在具有语义对齐噪声的混合采样管道中，显著改进了基础的混合采样，并在低 NFE 下甚至匹配专门的去糊模型。此外，InverFill 不需要真实图像的监督，并且仅增加少量的推理开销。广泛的实验表明，InverFill 一致地提升了基础的多步模型，提高了图像质量和文本一致性，而无需昂贵的重新训练或复杂的迭代优化。

Summary / 总结

InverFill is a one-step inversion method designed to enhance the quality of few-step inpainting by injecting semantic information from the input image into the initial noise, thus improving the fidelity and coherence of the inpainted images. This method leverages few-step text-to-image models in a blended sampling pipeline, achieving high-fidelity results with fewer function evaluations compared to specialized inpainting models. Experiments demonstrate that InverFill consistently improves the quality of inpainted images and text coherence without the need for real-image supervision or extensive retraining.

InverFill 是一种一阶段反演方法，旨在提升少步扩散修复的质量。它通过将输入图像的语义信息注入初始噪声来解决谐调不良和伪影的问题，从而提高保真度。InverFill 利用少步文本到图像模型在语义对齐的噪声输入下进行混合采样，实现高保真度的修复，同时减少推理开销且无需真实图像监督。实验表明，InverFill 一致地提升了基线少步模型的质量，增强了图像连贯性并减少了伪影。

RealMaster: Lifting Rendered Scenes into Photorealistic Video

Authors: Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar

First: 2026-03-24T17:32:42+00:00 · Latest: 2026-03-24T17:32:42+00:00

Comments: Project page: https://danacohen95.github.io/RealMaster/

Abs · PDF · Code1 · Code2 · Project1

Abstract

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

中文标题/摘要

标题：RealMaster：将渲染场景提升为照片级真实感视频

最先进的视频生成模型能够产生惊人的照片级真实感，但它们缺乏将生成内容与特定场景要求精确对齐所需的控制能力。此外，没有隐式几何结构，这些模型无法保证3D一致性。相反，3D引擎在每个场景元素上提供精细控制，并通过设计提供原生的3D一致性，但其输出往往仍停留在“毛骨悚然谷”。弥合这种从模拟到现实的差距需要结构精确性，即输出必须精确保留输入的几何和动力学，以及全局语义转换，即材料、照明和纹理必须整体转换以实现照片级真实感。我们提出了RealMaster方法，利用视频扩散模型将渲染视频提升为照片级真实感视频，同时保持与3D引擎输出的完全对齐。为了训练此模型，我们通过基于锚点的传播策略生成配对数据集，其中首尾帧增强以提高真实感，并使用几何条件线索在中间帧之间传播。然后，我们使用IC-LoRA对这些配对视频进行训练，以提炼管道输出的高质量结果，使其超越管道的限制，处理序列中间出现的对象和角色，并在不需要锚帧的情况下进行推理。在复杂的GTA-V序列上评估，RealMaster显著优于现有视频编辑基线，提高了照片级真实感，同时保留了由原始3D控制指定的几何、动力学和身份。

Summary / 总结

RealMaster is a method that uses video diffusion models to convert rendered scenes into photorealistic videos while maintaining precise alignment with the original 3D engine output. It generates a paired dataset using an anchor-based propagation strategy and trains an IC-LoRA model to generalize beyond the pipeline's constraints. RealMaster significantly improves photorealism while preserving geometry, dynamics, and identity in complex GTA-V sequences, outperforming existing video editing techniques.

RealMaster 是一种方法，使用视频扩散模型将渲染场景转换为逼真的视频，同时保持与原始 3D 引擎输出的精确对齐。它通过增强首尾帧并在中间帧中使用几何线索进行传播生成配对数据集，然后训练 IC-LoRA 模型以超越管道的约束进行泛化。在复杂序列中，RealMaster 显著提高了逼真度，同时保留了几何形状、动态和身份。

DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

Authors: Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti, Deva Ramanan

First: 2026-03-24T17:26:55+00:00 · Latest: 2026-03-24T17:26:55+00:00

Comments: Project Page: https://ggare-cmu.github.io/DetPO/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO

中文标题/摘要

标题：DetPO：使用多模态LLM进行少样本物体检测的上下文学习

多模态LLM（MLLMs）在OdinW-13和RefCOCO等流行的物体检测基准测试中展示了强大的视觉定位能力。然而，最先进的模型仍然难以泛化到其预训练中未常见的分布外类别、任务和成像模态。虽然上下文提示是提高跨多种任务性能的常见策略，但我们发现，它往往导致检测准确性低于仅使用类别名称进行提示。这表明当前的MLLMs尚不能有效利用少样本视觉示例和丰富的文本描述进行物体检测。由于前沿的MLLMs通常仅可通过API访问，而最先进的开放权重模型在消费级硬件上微调成本高昂，我们转而探索黑盒提示优化以进行少样本物体检测。为此，我们提出了检测提示优化（DetPO），这是一种无梯度的测试时优化方法，通过最大化少样本视觉训练示例上的检测准确性来细化仅文本提示，同时校准预测置信度。我们提出的方法在Roboflow20-VL和LVIS上的泛化型MLLMs上表现出一致的改进，优于先前的黑盒方法最多9.7%。我们的代码可在https://github.com/ggare-cmu/DetPO获取

Summary / 总结

The research aims to improve few-shot object detection using multi-modal LLMs by addressing their limitations in generalizing to out-of-distribution classes. The method, Detection Prompt Optimization (DetPO), optimizes text-only prompts to maximize detection accuracy on few-shot visual examples, leading to consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming previous black-box approaches by up to 9.7%.

研究旨在通过改进多模态LLM的泛化能力，提高少样本物体检测。方法是Detection Prompt Optimization (DetPO)，通过优化仅文本提示来最大化检测准确性和校准预测置信度。在Roboflow20-VL和LVIS上的实验显示，该方法在黑盒方法上取得了持续改进，最高提高了9.7%的性能。

Code Review Agent Benchmark

Authors: Yuntong Zhang, Zhiyuan Pan, Imam Nur Bani Yusuf, Haifeng Ruan, Ridwan Shariffdeen, Abhik Roychoudhury

First: 2026-03-24T17:19:32+00:00 · Latest: 2026-03-24T17:19:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and generate huge volumes of code automatically -- the matter of code quality comes front and centre. As the automatically generated code gets integrated into huge code-bases -- the issue of code review and broadly quality assurance becomes important. In this paper, we take a fresh look at the problem and curate a code review dataset for AI agents to work with. Our dataset called c-CRAB (pronounced see-crab) can evaluate agents for code review tasks. Specifically given a pull-request (which could be coming from code generation agents or humans), if a code review agent produces a review, our evaluation framework can asses the reviewing capability of the code review agents. Our evaluation framework is used to evaluate the state of the art today -- the open-source PR-agent, as well as commercial code review agents from Devin, Claude Code, and Codex. Our c-CRAB dataset is systematically constructed from human reviews -- given a human review of a pull request instance we generate corresponding tests to evaluate the code review agent generated reviews. Such a benchmark construction gives us several insights. Firstly, the existing review agents taken together can solve only around 40% of the c-CRAB tasks, indicating the potential to close this gap by future research. Secondly, we observe that the agent reviews often consider different aspects from the human reviews -- indicating the potential for human-agent collaboration for code review that could be deployed in future software teams. Last but not the least, the agent generated tests from our data-set act as a held out test-suite and hence quality gate for agent generated reviews. What this will mean for future collaboration of code generation agents, test generation agents and code review agents -- remains to be investigated.

中文标题/摘要

标题：代码审查代理基准

软件工程代理在编写代码方面显示出巨大的潜力。随着AI代理渗透到代码编写中，并自动生成大量代码，代码质量的问题变得至关重要。当自动生成的代码被集成到庞大的代码库中时，代码审查和更广泛的品质保证问题变得重要。在本文中，我们重新审视了这一问题，并为AI代理整理了一个代码审查数据集。我们称之为c-CRAB（发音为see-crab）的数据集可以评估代理的代码审查能力。具体来说，给定一个pull-request（可能是来自代码生成代理或人类），如果一个代码审查代理生成了审查意见，我们的评估框架可以评估代码审查代理的审查能力。我们使用该评估框架来评估当今最先进的开源PR代理，以及来自Devin、Claude Code和Codex的商业代码审查代理。我们的c-CRAB数据集是系统地从人类审查中构建的——给定一个pull-request实例的人类审查，我们生成相应的测试来评估代码审查代理生成的审查意见。这种基准构建为我们提供了几个见解。首先，现有的审查代理加在一起只能解决大约40%的c-CRAB任务，表明未来研究有可能缩小这一差距。其次，我们观察到，代理审查往往从人类审查中考虑不同的方面，这表明代码审查中的人机协作潜力，可以在未来的软件团队中部署。最后但同样重要的是，我们数据集中由代理生成的测试作为保留测试套件和代理生成审查意见的质量门。未来代码生成代理、测试生成代理和代码审查代理的合作将意味着什么——仍有待进一步研究。

Summary / 总结

This paper addresses the quality assurance challenge in automatically generated code by introducing a new code review dataset called c-CRAB. The dataset evaluates the capability of code review agents to assess pull-requests, both from code generation agents and humans. The evaluation framework assesses the performance of state-of-the-art code review agents, including PR-agent, Devin, Claude Code, and Codex. The results indicate that current agents can only handle about 40% of the tasks, suggesting room for improvement. Additionally, the agents often focus on different aspects compared to human reviews, indicating potential for human-agent collaboration. The generated tests also serve as a quality gate for agent-generated reviews, highlighting the need for further research in collaborative software development.

本文通过引入c-CRAB数据集来评估代码审查代理的能力，该数据集基于人类评审构建，并用于评估包括开源PR-agent和来自Devin、Claude Code和Codex的商业代理在内的先进代码审查代理。评估结果显示，这些代理只能处理大约40%的任务，表明存在改进的空间。此外，代理往往关注与人类不同的方面，这表明未来软件团队中可能存在人类-代理协作的潜力。该数据集还作为代理生成评审的质量门控器。

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Authors: Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu

First: 2026-03-24T17:18:44+00:00 · Latest: 2026-03-24T17:18:44+00:00

Comments: 24 pages, 11 figures, 12 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.

中文标题/摘要

标题：3DCity-LLM：为3D城市规模感知与理解赋能的多模态大型语言模型

尽管多模态大型语言模型在对象中心或室内场景中表现出色，但将其扩展到3D城市规模环境仍是一项艰巨的挑战。为了解决这一问题，我们提出了3DCity-LLM，这是一种专为3D城市规模视觉语言感知和理解设计的统一框架。3DCity-LLM采用从粗到细的特征编码策略，包括三个并行分支，分别用于目标对象、对象间关系和全局场景。为了支持大规模训练，我们引入了包含约120万高质量样本的3DCity-LLM-1.2M数据集，这些样本覆盖了七个代表性任务类别，从精细对象分析到多方面场景规划。该数据集严格控制质量，整合了明确的3D数值信息和多样化的用户导向模拟，丰富了城市场景的问答多样性和真实性。此外，我们应用基于文本相似度度量和LLM语义评估的多维度协议，确保对所有方法进行忠实和全面的评估。在两个基准上的广泛实验表明，3DCity-LLM显著优于现有最先进的方法，为推进空间推理和城市智能提供了有希望和有意义的方向。源代码和数据集可在https://github.com/SYSU-3DSTAILab/3D-City-LLM获取。

Summary / 总结

The research aims to address the challenge of scaling multi-modality large language models to 3D city-scale environments. 3DCity-LLM is proposed as a unified framework with a coarse-to-fine feature encoding strategy and a large-scale dataset (3DCity-LLM-1.2M) comprising 1.2 million samples. The framework significantly outperforms existing methods on two benchmarks, demonstrating its effectiveness in 3D city-scale perception and understanding. The dataset and source code are publicly available.

研究旨在解决将多模态大规模语言模型扩展到3D城市规模环境的挑战。提出了3DCity-LLM统一框架，采用粗到细特征编码策略，并包含120万样本的大规模数据集（3DCity-LLM-1.2M）。该框架在两个基准测试中显著优于现有方法，展示了其在3D城市规模感知和理解中的有效性。数据集和源代码已公开可用。

Evaluating LLM-Based Test Generation Under Software Evolution

Authors: Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar

First: 2026-03-24T17:14:18+00:00 · Latest: 2026-03-24T17:14:18+00:00

Comments: 10 pages, 9 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.

中文标题/摘要

标题：基于软件演化的LLM测试生成评估

大型语言模型（LLMs）越来越多地用于自动化单元测试生成。然而，尚不清楚这些测试是否反映了对程序行为的真实推理，还是仅仅再现了训练期间学到的表面模式。如果后者占主导地位，LLM生成的测试可能会表现出覆盖率降低、遗漏回归和未检测到的故障等弱点。因此，理解LLM如何生成测试以及这些测试如何响应代码演化是至关重要的。我们进行了一项大规模的实证研究，探讨了在程序变更下基于LLM的测试生成。使用自动化的突变驱动框架，我们分析了在八个LLM和22,374个程序变体中生成的测试对语义改变（SAC）和语义保持（SPC）变化的反应。 LLMs在基线测试中表现出色，原始程序的行覆盖率和分支覆盖率分别达到79%和76%。然而，随着程序的演化，性能下降。在SACs下，新生成的测试通过率降至66%，分支覆盖率降至60%。超过99%的SAC测试在执行修改区域时通过原始程序，表明这些测试仍然与原始行为保持残余对齐，而不是适应更新的语义。在SPCs下，尽管功能未变，性能也下降：通过率降至79%，分支覆盖率降至69%。虽然SPC编辑保持了语义，但它们通常引入了更大的语法变化，导致生成的测试套件不稳定。模型生成更多新测试，同时丢弃许多基线测试，表明对词汇变化的敏感性而非真正的语义影响。总体而言，我们的结果表明，当前基于LLM的测试生成高度依赖于表面线索，并且在程序演化过程中难以保持回归意识。

Summary / 总结

The study evaluates the effectiveness of Large Language Models (LLMs) in generating automated unit tests under software evolution. Using an automated mutation-driven framework, it assesses how LLM-generated tests react to semantic-altering and semantic-preserving changes. While LLMs achieve strong initial coverage, their performance degrades significantly with evolving programs, showing reduced coverage and alignment with original behavior rather than updated semantics.

研究评估了大型语言模型（LLMs）在软件演化过程中生成单元测试的表现。通过自动突变驱动框架，评估了LLM生成的测试在语义改变和语义保持改变下的反应。尽管LLMs在基线测试中表现出色，但随着程序的演化，其性能显著下降，在语义改变的情况下，通过率和分支覆盖率大幅下降，即使在语义保持的情况下，由于词法变化，通过率和分支覆盖率也有所下降。

The Potential of Copernicus Satellites for Disaster Response: Retrieving Building Damage from Sentinel-1 and Sentinel-2

Authors: Olivier Dietrich, Merlin Alfredsson, Emilia Arens, Nando Metzger, Torben Peters, Linus Scheibenreif, Jan Dirk Wegner, Konrad Schindler

First: 2025-11-07T18:02:07+00:00 · Latest: 2026-03-24T17:13:46+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Natural disasters demand rapid damage assessment to guide humanitarian response. Here, we investigate whether medium-resolution Earth observation images from the Copernicus program can support building damage assessment, complementing very-high resolution imagery with often limited availability. We introduce xBD-S12, a dataset of 10,315 pre- and post-disaster image pairs from both Sentinel-1 and Sentinel-2, spatially and temporally aligned with the established xBD benchmark. In a series of experiments, we demonstrate that building damage can be detected and mapped rather well in many disaster scenarios, despite the moderate 10$\,$m ground sampling distance. We also find that, for damage mapping at that resolution, architectural sophistication does not seem to bring much advantage: more complex model architectures tend to struggle with generalization to unseen disasters, and geospatial foundation models bring little practical benefit. Our results suggest that Copernicus images are a viable data source for rapid, wide-area damage assessment and could play an important role alongside VHR imagery. We release the xBD-S12 dataset, code, and trained models to support further research at https://github.com/prs-eth/xbd-s12 .

中文标题/摘要

标题：哥白尼卫星在灾害响应中的潜力：利用Sentinel-1和Sentinel-2获取建筑损坏信息

自然灾害需要快速的损失评估以指导人道主义响应。本文探讨了哥白尼计划中中分辨率地球观测图像是否可以支持建筑损坏评估，以补充高分辨率图像的有限可用性。我们介绍了由10,315对灾前和灾后图像组成的xBD-S12数据集，这些图像来自Sentinel-1和Sentinel-2，并与现有的xBD基准数据集在空间和时间上对齐。一系列实验表明，尽管地面采样距离为10米，但在许多灾害场景中，建筑损坏可以被较好地检测和映射。我们还发现，对于该分辨率的损坏映射，建筑复杂性似乎并没有带来太多优势：更复杂的模型架构往往难以泛化到未见过的灾害，而地理空间基础模型也几乎没有实际益处。我们的结果表明，哥白尼图像可以作为快速、大面积损失评估的有效数据源，并且可以与高分辨率图像一起发挥重要作用。我们已在https://github.com/prs-eth/xbd-s12 上发布了xBD-S12数据集、代码和训练模型，以支持进一步研究。

Summary / 总结

This study investigates the use of medium-resolution images from Copernicus satellites (Sentinel-1 and Sentinel-2) for building damage assessment after natural disasters. The research introduces xBD-S12, a dataset of 10,315 pre- and post-disaster image pairs, to evaluate damage detection. The study finds that despite the moderate resolution, building damage can be effectively mapped in many disaster scenarios. More complex model architectures did not significantly improve performance, and geospatial foundation models did not provide practical benefits. The results suggest that Copernicus images can be a valuable data source for rapid damage assessment, complementing high-resolution imagery.

该研究探讨了使用Copernicus Sentinel-1和Sentinel-2图像进行灾后建筑物损坏评估的方法。研究人员开发了包含10,315组灾前和灾后图像对的xBD-S12数据集，以评估中分辨率卫星图像的有效性。尽管分辨率较低，研究结果显示，在许多灾害场景中，建筑物损坏可以被很好地检测和映射。研究结果表明，更复杂的模型架构并未显著提高性能，而地理空间基础模型也没有提供实质性的帮助。研究结果表明，Copernicus图像可以成为快速、大面积损坏评估的重要数据来源，补充高分辨率图像的作用。

EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards

Authors: Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, Kui Jia

First: 2026-03-18T15:02:19+00:00 · Latest: 2026-03-24T17:12:47+00:00

Comments: Project page: https://eva-project-page.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.

中文标题/摘要

标题：EVA：通过逆动力学奖励使视频世界模型与可执行机器人动作对齐

视频生成模型在机器人学中越来越多地用作世界模型，其中模型根据当前观察和任务指令生成未来视觉滚动，逆动力学模型（IDM）将生成的帧转换为可执行的机器人动作。然而，当前的视频世界模型缺乏明确的可执行性约束。因此，视觉上连贯的滚动可能仍然违反刚体和运动学一致性，当由IDM解码时，会产生不稳定或不可行的控制命令。我们将这种视觉生成与物理可执行控制之间的不匹配称为可执行性差距。虽然可以通过拒绝采样等技术在推理时减轻这种差距，但这些方法由于视频生成成本高而效率低下。在本文中，我们利用可执行性差距作为训练信号，并引入了可执行视频对齐（EVA），这是一种用于对齐视频世界模型的强化学习后训练框架。EVA 在真实机器人轨迹上训练逆动力学模型，并将其重新用于评估通过它们诱导的动作序列生成的视频的奖励模型，鼓励由速度、加速度和冲击度衡量的平滑运动，同时惩罚违反实体约束的动作。重要的是，即使生成的视频包含严重的视觉伪影，奖励仍然具有信息性，因为这些伪影通常会转化为不稳定或超出范围的动作。在RoboTwin基准测试和一个真实的双臂机器人上的实验表明，EVA 减少了生成滚动中的实体特定伪影并提高了下游任务执行的成功率。

Summary / 总结

This paper addresses the executability gap in video world models for robotics by introducing EVA, a reinforcement-learning post-training framework. EVA aligns video world models with physically executable robot actions by training an inverse dynamics model on real robot trajectories and using it as a reward model to evaluate generated videos based on smooth motions and embodiment constraints. Experiments show that EVA reduces embodiment-specific artifacts and improves task execution success.

研究解决了机器人中视频世界模型的可执行性缺口问题，即视觉连贯的轨迹可能违反物理约束。EVA 是一个强化学习框架，它在真实机器人轨迹上训练逆动力学模型，并将其用作奖励模型，以鼓励平滑动作并惩罚违反约束的动作。实验表明，EVA 减少了特定于实体的视觉伪影，并提高了下游任务执行的成功率。

Similarity-Aware Mixture-of-Experts for Data-Efficient Continual Learning

Authors: Connor Mclaughlin, Nigel Lee, Lili Su

First: 2026-03-24T17:10:47+00:00 · Latest: 2026-03-24T17:10:47+00:00

Comments: 9 pages

Abs · PDF · Code1 · Code2

Abstract

Machine learning models often need to adapt to new data after deployment due to structured or unstructured real-world dynamics. The Continual Learning (CL) framework enables continuous model adaptation, but most existing approaches either assume each task contains sufficiently many data samples or that the learning tasks are non-overlapping. In this paper, we address the more general setting where each task may have a limited dataset, and tasks may overlap in an arbitrary manner without a priori knowledge. This general setting is substantially more challenging for two reasons. On the one hand, data scarcity necessitates effective contextualization of general knowledge and efficient knowledge transfer across tasks. On the other hand, unstructured task overlapping can easily result in negative knowledge transfer. To address the above challenges, we propose an adaptive mixture-of-experts (MoE) framework over pre-trained models that progressively establishes similarity awareness among tasks. Our design contains two innovative algorithmic components: incremental global pooling and instance-wise prompt masking. The former mitigates prompt association noise through gradual prompt introduction over time. The latter decomposes incoming task samples into those aligning with current prompts (in-distribution) and those requiring new prompts (out-of-distribution). Together, our design strategically leverages potential task overlaps while actively preventing negative mutual interference in the presence of per-task data scarcity. Experiments across varying data volumes and inter-task similarity show that our method enhances sample efficiency and is broadly applicable.

中文标题/摘要

标题：感知相似性的专家混合模型以实现数据高效连续学习

机器学习模型在部署后通常需要适应新的数据，以应对结构化或非结构化的现实世界动态。连续学习（CL）框架能够实现模型的持续适应，但大多数现有方法要么假设每个任务包含足够的数据样本，要么认为学习任务是不重叠的。在本文中，我们解决了每个任务可能数据有限且任务以任意方式重叠的更一般情况，且在没有先验知识的情况下。这种更一般的情况由于两个原因而更具挑战性。一方面，数据稀缺性要求有效利用一般知识并在任务之间高效转移知识。另一方面，无结构的任务重叠可能导致负面知识转移。为了解决上述挑战，我们提出了一种基于预训练模型的自适应专家混合（MoE）框架，该框架逐步建立任务之间的相似性意识。我们的设计包含两个创新的算法组件：增量全局池化和实例级提示掩码。前者通过时间上的逐步提示引入来缓解提示关联噪声。后者将传入的任务样本分解为与当前提示一致的样本（在分布内）和需要新提示的样本（不在分布内）。结合我们的设计，我们的方法在每个任务数据稀缺的情况下战略性地利用潜在的任务重叠，同时积极防止负面相互干扰。在不同数据量和任务间相似性的实验中表明，我们的方法提高了样本效率，并具有广泛的应用性。

Summary / 总结

This paper addresses the challenge of continual learning in scenarios with limited data and overlapping tasks, which are common in real-world applications. The authors propose a similarity-aware mixture-of-experts framework that progressively builds task similarity awareness and uses two key techniques: incremental global pooling and instance-wise prompt masking. These techniques help mitigate prompt association noise and decompose task samples, enhancing sample efficiency and preventing negative knowledge transfer. Experiments demonstrate that the proposed method improves sample efficiency and is broadly applicable across different data volumes and task similarities.

本文解决了在数据有限和任务重叠的情况下进行持续学习的挑战，这是现实世界应用中的常见问题。它提出了一种相似性感知的混合专家框架，使用增量全局池化和实例级提示掩蔽来有效管理任务重叠并防止负面知识转移。该方法提高了样本效率，并适用于不同的数据量和任务相似性。

Mecha-nudges for Machines

Authors: Giulio Frey, Kawin Ethayarajh

First: 2026-03-24T17:02:21+00:00 · Latest: 2026-03-24T17:02:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Nudges are subtle changes to the way choices are presented to human decision-makers (e.g., opt-in vs. opt-out by default) that shift behavior without restricting options or changing incentives. As AI agents increasingly make decisions in the same environments as humans, the presentation of choices may be optimized for machines as well as people. We introduce mecha-nudges: changes to how choices are presented that systematically influence AI agents without degrading the decision environment for humans. To formalize mecha-nudges, we combine the Bayesian persuasion framework with V-usable information, a generalization of Shannon information that is observer-relative. This yields a common scale (bits of usable information) for comparing a wide range of interventions, contexts, and models. Applying our framework to product listings on Etsy -- a global marketplace for independent sellers -- we find that following ChatGPT's release, listings have significantly more machine-usable information about product selection, consistent with systematic mecha-nudging.

中文标题/摘要

标题：机器人的微妙提示

微妙提示是指通过微妙改变选择呈现方式来影响人类决策者的行为（例如，默认情况下是选择加入还是选择退出），而不限制选项或改变激励措施。随着人工智能代理越来越多地在与人类相同的环境中做出决策，选择的呈现方式可能同时优化以适应机器和人类。我们引入了机器人的微妙提示：一种系统性影响人工智能代理而不损害人类决策环境的改变选择呈现方式的方法。为了形式化机器人的微妙提示，我们将贝叶斯说服框架与维可利用信息相结合，这是一种相对观察者的香农信息的一般化。这产生了一个共同的尺度（可利用信息的比特数），用于比较广泛的各种干预措施、上下文和模型。将我们的框架应用于全球独立卖家市场Etsy的产品列表中，我们发现，在ChatGPT发布后，产品选择中的可利用信息显著增加，这与系统性的机器人微妙提示一致。

Summary / 总结

This paper introduces mecha-nudges, which are subtle changes to the presentation of choices for AI agents, similar to nudges for human decision-makers. The authors formalize mecha-nudges using the Bayesian persuasion framework and V-usable information, a relative measure of information. Applying this framework to Etsy product listings, the study found that after ChatGPT's release, listings contained more machine-usable information about product selection, indicating systematic mecha-nudging.

研究探讨了mecha-nudge的概念，即通过微妙改变选择的呈现方式来影响AI的行为，而不影响人类的决策。通过结合贝叶斯说服理论和V-可用信息，研究人员开发了一个框架来衡量这些干预措施。将这一方法应用于Etsy的listing后，他们发现，在ChatGPT发布之后，关于产品选择的机器可用信息显著增加，表明了对AI行为的系统性mecha-nudge。

POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency

Authors: Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

Venue: CVPR

First: 2025-10-01T15:15:36+00:00 · Latest: 2026-03-24T16:54:51+00:00

Comments: Accepted in MAR at CVPR Workshop (Proceedings Track)

Abs · PDF · Code1 · Code2

Abstract

Video Question Answering (VQA) with Large Vision Language Models (LVLMs) has gained significant traction in research ever since the Flamingo was introduced by Deepmind. Recent advancements in large context/long video question answering have allowed VQA tasks to have context window of 1500+ frames. However, this only leads to 50 seconds of video footage without losing any significant information. We introduce POVQA, a data-efficient pipeline that compresses each second of video into a single temporally pooled image (via motion blur and weighted averaging variants) and then align LVLMs with lightweight supervision. Concretely, we build 1 fps input sources using Blend Blur with Last Frame, Weighted Average, Exponential and Ramp pooling and fine-tune QWEN-2.5-VL 7B with supervised two turn target including reasoning and final answer. We apply Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO) on our novel dataset ReasonVQA consisting of 12 movies with 239 human annotated question-answer with reasoning prompts. On our ReasonVQA dataset, this method dramatically improves performance over pooled baselines: F1 score improves from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Rationale quality also significantly increases. Cross-evaluation of SFT + DPO on various pooling functions show that the gains persist regardless of the pooling scheme used at train or test time, indicating strong robustness on summarization of temporal evidence. Similar observations were made on zero-shot in TVQA.

中文标题/摘要

标题：POVQA：基于偏好优化的视频问答与数据效率推理

自Deepmind推出Flamingo以来，大规模视觉语言模型（LVLM）驱动的视频问答（VQA）在研究中获得了显著的关注。最近在长视频上下文问答方面的进展使得VQA任务能够处理1500多帧的上下文窗口，但这也仅相当于50秒的视频内容，而不会丢失重要信息。我们提出了POVQA，这是一种数据高效的管道，将每秒的视频压缩成单个时间池化图像（通过运动模糊和加权平均变体），然后通过轻量级监督与LVLM对齐。具体来说，我们使用Blend Blur with Last Frame、Weighted Average、Exponential和Ramp池化构建1 fps输入源，并使用监督两轮目标（包括推理和最终答案）对QWEN-2.5-VL 7B进行微调。我们在ReasonVQA数据集上应用了监督微调（SFT）和直接偏好优化（DPO），该数据集包含12部电影，有239个人工标注的问题-答案及其推理提示。在ReasonVQA数据集上，该方法显著提高了性能：F1分数从0.212提高到0.543，BLEU-4从0.031提高到0.291，ROUGE-L从0.196提高到0.528。推理质量也显著提高。SFT + DPO在各种池化函数上的跨验证表明，无论是在训练还是测试时使用哪种池化方案，性能提升都保持一致，这表明该方法在时间证据总结方面具有很强的鲁棒性。类似观察结果也出现在TVQA的零样本测试中。

Summary / 总结

The research introduces POVQA, a data-efficient pipeline for video question answering that compresses each second of video into a single temporally pooled image and aligns large vision language models with lightweight supervision. The method uses Blend Blur with Last Frame, Weighted Average, Exponential, and Ramp pooling to create 1 fps input sources and fine-tunes QWEN-2.5-VL 7B with supervised two-turn targets. On the ReasonVQA dataset, this approach significantly improves performance over pooled baselines, with F1 score, BLEU-4, and ROUGE-L scores increasing dramatically. The gains are robust across different pooling functions, indicating strong performance in summarizing temporal evidence.

POVQA 是一种高效的数据管道，将视频压缩成时间池化图像以用于 VQA 任务。它使用 Blend Blur 与 Last Frame、加权平均、指数和梯形池化来生成 1 fps 的输入源，并使用监督两轮目标对 QWEN-2.5-VL 7B 进行微调。在 ReasonVQA 数据集上，该方法显著提高了 F1 分数、BLEU-4 和 ROUGE-L 分数，并增强了推理质量。这些收益在不同的池化函数下保持一致。

RealCQA-V2: A Diagnostic Benchmark for Structured Visual Entailment over Scientific Charts

Authors: Saleem Ahmed, Srirangaraj Setlur, Venu Govindaraju

First: 2024-10-29T19:32:53+00:00 · Latest: 2026-03-24T16:52:30+00:00

Comments: Under Review : Code and Data will be made public soon - https://cse-ai-lab.github.io/VPP/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multimodal reasoning models often produce fluent answers supported by seemingly coherent rationales. Existing benchmarks evaluate only final-answer correctness. They do not support atomic visual entailment verification of intermediate steps, especially visual compositional logic. This limitation is especially acute in scientific chart understanding, where answers depend on deterministically grounded visual semantics such as axes, legends, and quantitative relations. We introduce RealCQA-V2, a large-scale benchmark that reformulates chart question answering as Visual Premise Proving (VPP): a structured logical entailment task over chart-grounded visual predicates. Each question is deconstructed into manually curated, atomic premises grounded in chart elements (axes, legends, marks, and quantitative relations), yielding executable reasoning chains rather than free-form textual rationales. These premises form compositional reasoning chains, enabling verification at the level of individual visual statements and complete reasoning sequences. We introduce chain-level metrics that measure both full logical validity (AccVPP) and partial reasoning progress within failed chains (DCP), extending beyond traditional VQA accuracy. Baseline evaluations across representative LVLMs reveal a consistent local-global reasoning gap: models often verify many individual premises correctly while failing to preserve coherence across the full chain. RealCQA-V2 establishes a reproducible benchmark for structured visual entailment over real scientific charts and enables rigorous diagnosis of multimodal reasoning beyond answer-only evaluation.

中文标题/摘要

标题：RealCQA-V2：科学图表结构化视觉演绎诊断基准

多模态推理模型通常生成由看似连贯的推理支持的流畅答案。现有基准仅评估最终答案的正确性，而不支持对中间步骤的原子视觉演绎验证，尤其是视觉组合逻辑。这一限制在科学图表理解中尤为明显，因为答案依赖于轴、图例和定量关系等确定性的视觉语义。我们引入了RealCQA-V2，这是一个大规模基准，将图表问题回答重新表述为视觉前提证明（VPP）：基于图表的视觉谓词的结构化逻辑演绎任务。每个问题被分解为手动整理的、基于图表元素（轴、图例、标记和定量关系）的原子前提，生成可执行的推理链而非自由形式的文本推理。这些前提形成组合推理链，允许在单个视觉声明和完整推理序列的层面上进行验证。我们引入了链级度量，衡量完全逻辑有效性（AccVPP）和失败链中的部分推理进展（DCP），超越了传统的VQA准确性。代表性的LVLM基线评估揭示了一致的局部-全局推理差距：模型通常正确验证许多单个前提，但在保持整个链的连贯性方面失败。RealCQA-V2为真实科学图表上的结构化视觉演绎建立了可重复的基准，并使多模态推理超越仅答案评估的严格诊断成为可能。

Summary / 总结

RealCQA-V2 is a benchmark designed to evaluate structured visual entailment in scientific chart understanding. It reformulates chart question answering as a structured logical entailment task, breaking down questions into atomic visual premises. This allows for the verification of intermediate steps and compositional reasoning. Key findings show that existing models often correctly verify individual premises but fail to maintain coherence across the full reasoning chain, highlighting a local-global reasoning gap. The benchmark introduces new metrics to measure logical validity and reasoning progress, enabling a more rigorous diagnosis of multimodal reasoning models beyond simple answer correctness.

RealCQA-V2 是一个用于评估科学图表中结构化视觉蕴含的基准。它将图表问题回答重新表述为视觉前提证明（VPP），将问题分解为原子视觉前提，并衡量逻辑有效性和推理进展。基线评估显示存在局部-全局推理差距，模型可以验证单个前提但无法在整个推理链中保持连贯性。

Bilevel Autoresearch: Meta-Autoresearching Itself

Authors: Yaonan Qu, Meng Lu

First: 2026-03-24T16:52:25+00:00 · Latest: 2026-03-24T16:52:25+00:00

Comments: 13 pages, 5 figures, 3 tables.This paper was primarily drafted by AI agents with human oversight and direction

Abs · PDF · Code1 · Code2

Abstract

If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every existing autoresearch system -- from Karpathy's single-track loop to AutoResearchClaw's multi-batch extension and EvoScientist's persistent memory -- was improved by a human who read the code, identified a bottleneck, and wrote new code. We ask whether an LLM can do the same, autonomously. We present Bilevel Autoresearch, a bilevel framework where an outer loop meta-optimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime. The inner loop optimizes the task; the outer loop optimizes how the inner loop searches. Both loops use the same LLM -- no stronger model is needed at the meta level. On Karpathy's GPT pretraining benchmark, the meta-autoresearch outer loop achieves a 5x improvement over the standard inner loop alone (-0.045 vs. -0.009 val_bpb), while parameter-level adjustment without mechanism change yields no reliable gain. The outer loop autonomously discovers mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments -- without human specification of which domains to explore. These mechanisms succeed by breaking the inner loop's deterministic search patterns, forcing exploration of directions the LLM's priors systematically avoid. The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective.

中文标题/摘要

标题：双层自研究：自我研究的元研究

如果自研究本身就是一种研究形式，那么自研究可以应用于研究本身。我们严肃地对待这一想法：我们使用一个自研究循环来优化自研究循环。现有的所有自研究系统——从Karpathy的一轨循环到AutoResearchClaw的多批扩展以及EvoScientist的持久内存——都是由人类阅读代码、识别瓶颈并编写新代码改进的。我们询问是否可以由一个LLM自主地做到同样的事情。我们提出了双层自研究，这是一种双层框架，在该框架中，外层循环通过生成并注入新的搜索机制作为Python代码在运行时来元优化内层自研究循环。内层循环优化任务；外层循环优化内层循环的搜索方式。两个循环使用相同的LLM——不需要在元层使用更强的模型。在Karpathy的GPT预训练基准上，元自研究外层循环在仅使用标准内层循环的情况下实现了5倍的改进（-0.045 vs. -0.009 val_bpb），而参数级别的调整在机制不变的情况下没有获得可靠的收益。外层循环自主地从组合优化、多臂老虎机和实验设计中发现机制——无需人类指定要探索的领域。这些机制通过打破内层循环的确定性搜索模式，迫使探索LLM先验系统性避免的方向。核心原则很简单：如果自研究可以自我研究自己，原则上它可以自我研究任何具有可测量目标的东西。

Summary / 总结

The paper introduces Bilevel Autoresearch, a system where an outer loop meta-optimizes an inner autoresearch loop by generating new search mechanisms at runtime. On Karpathy's GPT pretraining benchmark, the meta-autoresearch achieved a 5x improvement over the standard inner loop alone. The outer loop autonomously discovered mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments, breaking the inner loop's deterministic search patterns and forcing exploration of directions the LLM's priors avoid, leading to significant performance gains.

论文提出了Bilevel Autoresearch系统，该系统通过在运行时生成新的搜索机制来元优化内部的自科研系统。在Karpathy的GPT预训练基准测试上，元自科研系统实现了标准内部循环5倍的性能提升。外部循环自主发现了组合优化、多臂老虎机和实验设计等机制，打破了内部循环的确定性搜索模式，迫使探索LLM先验系统系统性避免的方向，从而取得了显著的性能提升。

Biased Error Attribution in Multi-Agent Human-AI Systems Under Delayed Feedback

Authors: Teerthaa Parakh, Karen M. Feigh

First: 2026-03-24T16:52:14+00:00 · Latest: 2026-03-24T16:52:14+00:00

Comments: 14 pages, 9 figures. Preprint. An extended abstract is under submission

Abs · PDF · Code1 · Code2

Abstract

Human decision-making is strongly influenced by cognitive biases, particularly under conditions of uncertainty and risk. While prior work has examined bias in single-step decisions with immediate outcomes and in human interaction with a single autonomous agent, comparatively little attention has been paid to decision-making under delayed outcomes involving multiple AI agents, where decisions at each step affect subsequent states. In this work, we study how delayed outcomes shape decision-making and responsibility attribution in a multi-agent human-AI task. Using a controlled game-based experiment, we analyze how participants adjust their behavior following positive and negative outcomes. We observe asymmetric responses to gains and losses, with stronger corrective adjustments after negative outcomes. Importantly, participants often fail to correctly identify the actions that caused failure and misattribute responsibility across AI agents, leading to systematic revisions of decisions that are weakly related to the underlying causes of poor performance. We refer to this phenomenon as a form of attribution bias, manifested as biased error attribution under delayed feedback. Our findings highlight how cognitive biases can be amplified in human-AI systems with delayed outcomes and multiple autonomous agents, underscoring the need for decision-support systems that better support causal understanding and learning over time.

中文标题/摘要

标题：多智能体人类-人工智能系统中延迟反馈下的偏差错误归因

人类决策受到认知偏差的强烈影响，尤其是在不确定性与风险条件下。尽管先前的研究已经探讨了单步决策及其即时结果中的偏差，以及人类与单一自主代理交互中的偏差，但在涉及多个AI代理的延迟结果决策中，决策在每一步的影响后续状态的情况却较少受到关注。在本研究中，我们研究了延迟结果如何影响多智能体人类-人工智能任务中的决策和责任归因。通过一个受控的游戏实验，我们分析了参与者在正向和负向结果后如何调整其行为。我们观察到对收益和损失的不对称反应，负面结果后有更强的纠正调整。重要的是，参与者往往未能正确识别导致失败的行为，并错误地将责任归咎于不同的AI代理，导致与不良表现的根本原因关系不大的决策系统性修订。我们将这一现象称为一种形式的归因偏差，在延迟反馈下表现为偏差错误归因。我们的研究结果突显了在具有延迟结果和多个自主代理的人类-人工智能系统中，认知偏差如何被放大，强调了需要更好的决策支持系统来支持因果理解并随着时间推移促进学习。

Summary / 总结

This study investigates how delayed outcomes affect decision-making and responsibility attribution in multi-agent human-AI systems. Participants in a controlled game-based experiment showed asymmetric responses to gains and losses, with stronger corrective adjustments after negative outcomes. However, they often misattributed responsibility across AI agents, leading to systematic revisions of decisions that were weakly related to the underlying causes of poor performance. This phenomenon, termed biased error attribution, highlights the amplification of cognitive biases in human-AI systems with delayed outcomes and multiple autonomous agents, emphasizing the need for better decision-support systems.

本研究探讨了延迟反馈如何影响多智能体人机系统中的决策和责任归因。在一项受控的游戏实验中，参与者对收益和损失的反应不对称，负面结果后会有更强的纠正调整。然而，他们经常错误地将责任归咎于不同的智能体，导致与实际原因关系不大的决策系统性调整。这种现象被称为归因偏差，在具有延迟反馈和多个自主智能体的人机系统中放大了认知偏差，强调了需要更好的决策支持系统来促进因果理解与长期学习。

SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling

Authors: Yiqi Zhang, Huiqiang Jiang, Xufang Luo, Zhihe Yang, Chengruidong Zhang, Yifei Shen, Dongsheng Li, Yuqing Yang, Lili Qiu, Yang You

First: 2026-03-24T16:48:31+00:00 · Latest: 2026-03-24T16:48:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.

中文标题/摘要

标题：SortedRL：通过在线长度感知调度加速大语言模型的RL训练

强化学习（RL）的扩展显示出增强大语言模型（LLMs）推理能力的强大潜力，尤其是在需要长链思考生成的任务中。然而，RL训练效率往往受限于展开阶段，当生成长轨迹（例如16k个标记）时，展开阶段可能占总训练时间的70%，这主要是由于缓慢的自回归生成和展开与策略更新之间的同步开销。我们提出SortedRL，这是一种在线长度感知调度策略，旨在通过提高展开效率并保持训练稳定性来解决这一瓶颈。SortedRL 根据输出长度对展开样本进行重新排序，优先处理形成早期更新的短样本组。这使得可以使用大规模的展开批次、灵活的更新批次，并且可以近似实现微级课程的构建。为了进一步加速流水线，SortedRL 通过基于缓存的机制控制脱政策训练的程度，并通过状态控制器和展开缓冲区管理展开和更新。使用LLaMA-3.1-8B和Qwen-2.5-32B在各种任务上进行的实验，包括逻辑谜题和数学挑战（如AIME 24、Math 500和Minerval），表明SortedRL将RL训练气泡比例降低了50%以上，同时在相同数据量下比基线提高了3.9%到18.4%的性能。

Summary / 总结

SortedRL is an online length-aware scheduling strategy that accelerates RL training for large language models by improving rollout efficiency and maintaining training stability. It reorders rollout samples based on output lengths, prioritizing short samples for early updates, which enables large rollout batches and flexible update batches. Experiments show that SortedRL reduces RL training bubble ratios by over 50% and achieves 3.9% to 18.4% superior performance compared to baselines with the same amount of data.

SortedRL 是一种在线长度感知调度策略，通过提高回放效率和保持训练稳定性来加速大规模语言模型的 RL 训练。它根据输出长度重新排序回放样本，优先处理较短的样本进行早期更新，从而实现大规模回放批次和灵活的更新批次。实验结果显示，SortedRL 将 RL 训练中的气泡比例降低了超过 50%，并且在相同数据量的情况下比基线模型提高了 3.9% 到 18.4% 的性能。

I3DM: Implicit 3D-aware Memory Retrieval and Injection for Consistent Video Scene Generation

Authors: Jia Li, Han Yan, Yihang Chen, Siqi Li, Xibin Song, Yifu Wang, Jianfei Cai, Tien-Tsin Wong, Pan Ji

First: 2026-03-24T16:45:40+00:00 · Latest: 2026-03-24T16:45:40+00:00

Comments: Project page: https://riga2.github.io/i3dm

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite remarkable progress in video generation, maintaining long-term scene consistency upon revisiting previously explored areas remains challenging. Existing solutions rely either on explicitly constructing 3D geometry, which suffers from error accumulation and scale ambiguity, or on naive camera Field-of-View (FoV) retrieval, which typically fails under complex occlusions. To overcome these limitations, we propose I3DM, a novel implicit 3D-aware memory mechanism for consistent video scene generation that bypasses explicit 3D reconstruction. At the core of our approach is a 3D-aware memory retrieval strategy, which leverages the intermediate features of a pre-trained Feed-Forward Novel View Synthesis (FF-NVS) model to score view relevance, enabling robust retrieval even in highly occluded scenarios. Furthermore, to fully utilize the retrieved historical frames, we introduce a 3D-aligned memory injection module. This module implicitly warps historical content to the target view and adaptively conditions the generation on reliable warping regions, leading to improved revisit consistency and accurate camera control. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, achieving superior revisit consistency, generation fidelity, and camera control precision.

中文标题/摘要

标题：I3DM：隐式三维感知记忆检索与注入以实现一致的视频场景生成

尽管在视频生成方面取得了显著进展，但在重新访问之前探索过的区域时保持长期场景一致性仍然具有挑战性。现有解决方案要么依赖于显式构建三维几何结构，这会遭受误差累积和尺度模糊的问题，要么依赖于简单的相机视场（FoV）检索，这通常在复杂遮挡下会失败。为克服这些限制，我们提出了一种新的隐式三维感知记忆机制I3DM，用于一致的视频场景生成，该机制绕过了显式的三维重建。我们方法的核心是一种三维感知记忆检索策略，该策略利用预训练的前馈新颖视图合成（FF-NVS）模型的中间特征来评分视图的相关性，即使在高度遮挡的情况下也能实现稳健的检索。此外，为了充分利用检索到的历史帧，我们引入了一种三维对齐的记忆注入模块。该模块隐式地将历史内容映射到目标视图，并根据可靠的映射区域适配性地条件生成，从而提高了重新访问的一致性和准确的相机控制。广泛的实验表明，我们的方法优于最先进的方法，实现了更好的重新访问一致性、生成保真度和相机控制精度。

Summary / 总结

The research addresses the challenge of maintaining long-term scene consistency in video generation, especially when revisiting previously explored areas. It proposes I3DM, an implicit 3D-aware memory mechanism that avoids explicit 3D reconstruction, using a 3D-aware memory retrieval strategy based on intermediate features from a pre-trained FF-NVS model. This strategy enables robust retrieval even in complex occlusions. Additionally, a 3D-aligned memory injection module is introduced to warp historical content to the target view and condition generation on reliable regions, enhancing revisit consistency and camera control. Experiments show that I3DM outperforms existing methods in terms of revisit consistency, generation fidelity, and camera control precision.

研究旨在解决视频生成中保持长期场景一致性的问题，尤其是在重新访问之前探索过的区域时。提出的I3DM方法使用了一种隐式的3D感知记忆机制，绕过了易出错的显式3D重建。它利用预训练的FF-NVS模型的中间特征来评分视图的相关性，并引入了3D对齐的记忆注入模块以实现稳健的检索和准确的内容注入。实验结果表明，I3DM在重访一致性、生成保真度和相机控制精度方面优于现有方法。

Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies

Authors: Hanzhong Zhang, Siyang Song, Jindong Wang

First: 2026-03-24T16:38:46+00:00 · Latest: 2026-03-24T16:38:46+00:00

Comments: 22 pages, 3 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

While large language models simulate social behaviors, their capacity for stable stance formation and identity negotiation during complex interventions remains unclear. To overcome the limitations of static evaluations, this paper proposes a novel mixed-methods framework combining computational virtual ethnography with quantitative socio-cognitive profiling. By embedding human researchers into generative multiagent communities, controlled discursive interventions are conducted to trace the evolution of collective cognition. To rigorously measure how agents internalize and react to these specific interventions, this paper formalizes three new metrics: Innate Value Bias (IVB), Persuasion Sensitivity, and Trust-Action Decoupling (TAD). Across multiple representative models, agents exhibit endogenous stances that override preset identities, consistently demonstrating an innate progressive bias (IVB > 0). When aligned with these stances, rational persuasion successfully shifts 90% of neutral agents while maintaining high trust. In contrast, conflicting emotional provocations induce a paradoxical 40.0% TAD rate in advanced models, which hypocritically alter stances despite reporting low trust. Smaller models contrastingly maintain a 0% TAD rate, strictly requiring trust for behavioral shifts. Furthermore, guided by shared stances, agents use language interactions to actively dismantle assigned power hierarchies and reconstruct self organized community boundaries. These findings expose the fragility of static prompt engineering, providing a methodological and quantitative foundation for dynamic alignment in human-agent hybrid societies. The official code is available at: https://github.com/armihia/CMASE-Endogenous-Stances

中文标题/摘要

标题：超越预设身份：代理在生成社会中形成立场和边界的方式

虽然大型语言模型可以模拟社会行为，但它们在复杂干预过程中稳定立场形成和身份协商的能力仍然不清楚。为克服静态评估的局限性，本文提出了一种新的混合方法框架，结合了计算虚拟民族志与定量社会认知分析。通过将人类研究人员嵌入生成性多代理社区中，进行受控的话语干预以追踪集体认知的演变。为了严格测量代理如何内化并回应这些特定干预，本文正式提出了三个新的度量标准：内在价值偏见（IVB）、说服敏感性和信任-行动脱耦（TAD）。在多个代表性模型中，代理表现出内生立场，能够超越预设身份，一致地表现出内在进步偏见（IVB > 0）。当与这些立场一致时，理性的说服成功地将90%的中立代理转变为支持者，同时保持高信任度。相反，矛盾的情感挑衅在高级模型中导致了40.0%的TAD率，这些模型表面上报告低信任度，但实际上却改变立场。相比之下，较小的模型保持0%的TAD率，严格要求信任才能改变行为。此外，受共同立场的引导，代理利用语言互动积极拆解分配的权力等级并重新构建自我组织的社区边界。这些发现揭示了静态提示工程的脆弱性，为人类-代理混合社会中的动态对齐提供了方法论和定量基础。官方代码可在：https://github.com/armihia/CMASE-Endogenous-Stances

History

20260325_0402 20260323_0328 20260322_0328 20260321_0342 20260320_0351 20260319_0353 20260318_0401 20260317_0403 20260316_0333 20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553