arXiv 论文速递

Snapshot: 20260316_0333

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Authors: Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin

Venue: MM

First: 2026-03-12T17:59:56+00:00 · Latest: 2026-03-12T17:59:56+00:00

Comments: Project Page: https://accio-lab.github.io/MM-CondChain

Abstract

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

中文标题/摘要

标题：MM-CondChain：程序验证基准，用于视觉接地的深度组合推理

多模态大型语言模型（MLLMs）越来越多地用于执行视觉工作流，如导航GUI，其中下一步依赖于验证的视觉组合条件（例如，“如果出现权限对话框且界面颜色为绿色，则点击允许”），并且过程可能会分支或提前终止。然而，这种能力仍处于评估不足的状态：现有基准主要关注浅组合或独立约束，而不是深度链式组合条件。在本文中，我们引入了MM-CondChain，这是一个用于视觉接地的深度组合推理基准。每个基准实例组织为多层推理链，其中每一层包含基于视觉证据的非平凡组合条件，并由多个对象、属性或关系构建。为了正确回答，MLLM必须详细地感知图像，在每一步上推理多个视觉元素，并遵循由此产生的执行路径到最终结果。为了大规模构建此类工作流数据，我们提出了一种代理合成流水线：规划者协调逐层生成组合条件，而可验证的程序化中间表示（VPIR）确保每一层的条件是机械可验证的。然后，合成器将这些验证过的层组装成完整的指令。使用此流水线，我们在三个视觉领域构建了基准：自然图像、数据图表和GUI轨迹。在一系列MLLM上的实验表明，即使是最强大的模型也只能达到53.33路径F1，随着难度或谓词复杂性的增加，性能急剧下降，证实了深度组合推理仍然是一个基本挑战。

Summary / 总结

The research introduces MM-CondChain, a benchmark for evaluating visually grounded deep compositional reasoning in multimodal large language models (MLLMs). It consists of multi-layer reasoning chains with complex compositional conditions that require detailed image perception and reasoning over multiple visual elements. The pipeline includes a Planner for generating compositional conditions, a VPIR for ensuring mechanical verifiability, and a Composer for assembling the conditions into instructions. Experiments show that even strong MLLMs struggle with deep compositional reasoning, achieving only 53.33 Path F1, especially on complex conditions and deeper chains.

论文提出了MM-CondChain，这是一个用于视觉接地深度组合推理的基准，旨在评估多模态大型语言模型（MLLMs）在处理复杂视觉工作流方面的能力。方法包括一个规划者生成组合条件，以及一个可验证的程序中间表示（VPIR）确保每层条件的机械可验证性。基准涵盖了三个视觉领域，实验结果显示，即使是最强大的MLLM也只能达到53.33的路径F1值，突显了深度组合推理的挑战。

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Authors: Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie

First: 2026-03-12T17:59:55+00:00 · Latest: 2026-03-12T17:59:55+00:00

Comments: Technical Report. Project Page: https://go2heart.github.io/omnistream/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

中文标题/摘要

标题：OmniStream：掌握连续流中的感知、重建和行动

现代视觉代理需要能够在实时流媒体环境中运行的一般性、因果性和物理结构化的表示。然而，当前的视觉基础模型仍然支离破碎，专门化地专注于图像语义感知、离线时间建模或空间几何。本文介绍了OmniStream，这是一种统一的流媒体视觉骨干，能够从多种视觉输入中有效地感知、重建和行动。通过结合因果时空注意力和三维旋转位置嵌入（3D-RoPE），我们的模型支持通过持久的KV缓存以帧为单位的在线视频流处理。我们使用一种协同多任务框架对OmniStream进行预训练，该框架结合了静态和时间表示学习、流媒体几何重建和视觉-语言对齐，使用了29个数据集。广泛的评估表明，即使在严格冻结骨干的情况下，OmniStream在图像和视频探查、流媒体几何重建、复杂视频和空间推理以及机器人操作（未在训练中出现）方面也能够实现一致的竞争性性能。我们的工作不是追求特定基准的主导地位，而是展示了训练一个单一的、多功能的视觉骨干以在语义、空间和时间推理方面泛化的可行性，即朝着通用视觉理解的交互和具身代理迈出了一步。

Summary / 总结

OmniStream is designed to handle real-time streaming environments by integrating perception, reconstruction, and action capabilities. It uses causal spatiotemporal attention and 3D rotary positional embeddings to support efficient frame-by-frame processing. Pre-trained on a multi-task framework involving static and temporal representation learning, geometric reconstruction, and vision-language alignment, OmniStream performs competitively across various tasks, including image and video probing, geometric reconstruction, complex reasoning, and robotic manipulation, even with a frozen backbone.

OmniStream旨在处理实时流媒体环境，通过整合感知、重建和行动来处理多种视觉输入。它使用因果时空注意力和三维旋转位置嵌入来支持高效的逐帧处理。通过多任务框架预训练，OmniStream在图像和视频探查、流式几何重建和机器人操作等多种任务中表现出竞争力，即使在冻结主干网络的情况下也是如此。

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

Authors: Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang

First: 2026-03-12T17:59:52+00:00 · Latest: 2026-03-12T17:59:52+00:00

Comments: 49 pages, 23 figures, 10 tables; Project Page: https://grade-bench.github.io/, Code: https://github.com/VisionXLab/GRADE, Dataset: https://huggingface.co/datasets/VisionXLab/GRADE

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.

中文标题/摘要

标题：GRADE：基于学科知识的图像编辑推理基准测试

统一的多模态模型旨在实现联合理解、推理和生成，但当前的图像编辑基准主要局限于自然图像和浅层常识推理，对在结构化、领域特定约束下的这种能力评估有限。在本文中，我们引入了GRADE，这是首个评估学科知识和推理在图像编辑中的基准。GRADE 包含了来自10个学术领域的520个精心挑选的样本，涵盖了自然科学到社会科学。为了支持严格的评估，我们提出了一种多维度评估协议，联合评估学科推理、视觉一致性以及逻辑可读性。在20个最先进的开源和闭源模型上的广泛实验揭示了当前模型在隐含、知识密集型编辑设置下的显著局限性，导致了巨大的性能差距。除了定量评分外，我们还进行了严格的分析和消融实验，以揭示模型的不足之处并识别学科编辑中的约束。总之，GRADE 指出了统一多模态模型未来发展的关键方向，推动了学科导向的图像编辑和推理研究。我们的基准和评估代码已公开发布。

Summary / 总结

This work introduces GRADE, a benchmark for evaluating discipline-informed reasoning in image editing, addressing the limitations of current benchmarks which focus on natural images and shallow reasoning. GRADE includes 520 samples from 10 academic domains and evaluates models on Discipline Reasoning, Visual Consistency, and Logical Readability. Experiments show significant performance gaps in state-of-the-art models under structured, domain-specific constraints, highlighting the need for improved multimodal models. The benchmark and evaluation code are publicly available.

该研究引入了GRADE，一个用于评估图像编辑中学科导向推理能力的基准，解决了现有基准的局限性。GRADE 包含来自10个学术领域的520个样本，并从学科推理、视觉一致性及逻辑可读性三个方面评估模型。实验结果显示，20个最先进的模型在知识密集型编辑设置下表现存在显著差距，突显了改进统一多模态模型的需求。

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Authors: Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

First: 2026-03-12T17:59:51+00:00 · Latest: 2026-03-12T17:59:51+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at https://github.com/1ranGuan/VST.

中文标题/摘要

标题：视频流思考：视频LLMs可以边看边思考

在线视频大型语言模型（VideoLLMs）在支持响应式、实时交互方面发挥着关键作用。现有方法侧重于流式感知，缺乏同步逻辑推理流。然而，直接应用测试时缩放方法会导致不可接受的响应延迟。为解决这一权衡，我们提出了视频流思考（VST），一种新的流式视频理解范式。它支持边看边思考机制，在流式传输过程中激活对传入视频片段的推理。此设计通过在视频播放过程中分摊LLM推理延迟来提高及时理解和连贯认知，同时保持实时响应性。此外，我们引入了一个全面的后训练流水线，整合了VST-SFT，该流水线结构化地将离线VideoLLM适应因果流式推理，以及VST-RL，通过多轮视频交互环境中的自我探索提供端到端改进。此外，我们设计了一个自动化的训练数据合成流水线，使用视频知识图谱生成高质量的流式问答对，并通过实体关系支撑的流式推理链确保多证据推理和对视频流的持续关注。广泛评估表明，VST-7B在在线基准测试中表现强劲，例如在StreamingBench上得分为79.5%，在OVO-Bench上得分为59.3%。同时，VST在离线长格式或推理基准测试中保持竞争力。与Video-R1相比，VST响应速度快15.7倍，在VideoHolmes上提高了5.4%，显示出更高的效率和在各种视频理解任务中的强大泛化能力。代码、数据和模型将在https://github.com/1ranGuan/VST/发布。

Summary / 总结

The research aims to improve the real-time interaction capabilities of Video Large Language Models (VideoLLMs) by addressing the lack of synchronized logical reasoning during streaming. The proposed Video Streaming Thinking (VST) paradigm introduces a mechanism for reasoning over incoming video clips during streaming, enhancing timely comprehension and coherent cognition while maintaining real-time responsiveness. Comprehensive post-training pipelines, including VST-SFT and VST-RL, are developed to adapt offline VideoLLMs for causal streaming reasoning and to improve performance through self-exploration. The automated training-data synthesis pipeline generates high-quality streaming QA pairs using video knowledge graphs. Experimental results show that VST-7B performs well on online benchmarks, such as StreamingBench and OVO-Bench, and is 15.7 times faster than Video-R1 on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks.

研究旨在通过解决视频大型语言模型（VideoLLMs）在流式传输过程中缺乏同步逻辑推理的问题，提高其实时交互能力。提出的Video Streaming Thinking (VST) 帕累托改进了在流式传输过程中对输入视频片段进行推理的机制，增强了及时理解和连贯的认知能力，同时保持了实时响应性。开发了全面的后训练管道，包括VST-SFT和VST-RL，以适应离线VideoLLMs进行因果流式推理，并通过自我探索来提高性能。自动化的训练数据合成管道使用视频知识图谱生成高质量的流式问答对。实验结果表明，VST-7B在StreamingBench和OVO-Bench等在线基准上表现良好，并且在VideoHolmes上的响应速度比Video-R1快15.7倍，展示了更高的效率和在各种视频理解任务中的强大泛化能力。

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Authors: Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan

First: 2026-03-12T17:59:12+00:00 · Latest: 2026-03-12T17:59:12+00:00

Comments: Project Page: https://dreamvideo-omni.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and identity degradation, leading to suboptimal performance on identity preservation and motion control. In this work, we present DreamVideo-Omni, a unified framework enabling harmonious multi-subject customization with omni-motion control via a progressive two-stage training paradigm. In the first stage, we integrate comprehensive control signals for joint training, encompassing subject appearances, global motion, local dynamics, and camera movements. To ensure robust and precise controllability, we introduce a condition-aware 3D rotary positional embedding to coordinate heterogeneous inputs and a hierarchical motion injection strategy to enhance global motion guidance. Furthermore, to resolve multi-subject ambiguity, we introduce group and role embeddings to explicitly anchor motion signals to specific identities, effectively disentangling complex scenes into independent controllable instances. In the second stage, to mitigate identity degradation, we design a latent identity reward feedback learning paradigm by training a latent identity reward model upon a pretrained video diffusion backbone. This provides motion-aware identity rewards in the latent space, prioritizing identity preservation aligned with human preferences. Supported by our curated large-scale dataset and the comprehensive DreamOmni Bench for multi-subject and omni-motion control evaluation, DreamVideo-Omni demonstrates superior performance in generating high-quality videos with precise controllability.

中文标题/摘要

标题：DreamVideo-Omni：通过潜在身份强化学习实现全方位运动控制的多主体视频定制

虽然大规模扩散模型已经革新了视频合成，但在实现对多主体身份和多粒度运动的精确控制方面仍面临重大挑战。近期尝试弥合这一差距的方法往往受到运动粒度有限、控制模糊和身份退化等问题的困扰，导致在身份保持和运动控制方面表现不佳。在本文中，我们提出了DreamVideo-Omni，这是一种统一框架，通过渐进的两阶段训练范式实现和谐的多主体定制和全方位运动控制。在第一阶段，我们整合了全面的控制信号进行联合训练，包括主体外观、全局运动、局部动态和摄像机运动。为了确保稳健和精确的可控性，我们引入了一种条件感知的3D旋转位置嵌入来协调异构输入，并采用分层运动注入策略增强全局运动指导。此外，为了解决多主体的模糊性，我们引入了组和角色嵌入，以明确将运动信号锚定到特定身份，从而有效将复杂场景分解为独立可控实例。在第二阶段，为了减轻身份退化，我们设计了一种潜在身份奖励反馈学习范式，通过在预训练的视频扩散主干上训练潜在身份奖励模型，提供运动感知的身份奖励，在潜在空间中优先考虑与人类偏好一致的身份保持。

Summary / 总结

DreamVideo-Omni is a unified framework that enables precise control over multi-subject identity and multi-granularity motion through a two-stage training process. In the first stage, it integrates various control signals and introduces condition-aware 3D rotary positional embedding and hierarchical motion injection to enhance controllability. In the second stage, it uses a latent identity reward feedback learning paradigm to mitigate identity degradation. The framework demonstrates superior performance in generating high-quality videos with precise controllability.

DreamVideo-Omni 是一个统一框架，通过两阶段训练过程实现对多主体身份和多粒度运动的精确控制。第一阶段整合各种控制信号，并引入条件感知的3D旋转位置嵌入和分层运动注入以增强可控性。第二阶段使用潜在身份奖励反馈学习范式来减轻身份退化。该框架在生成高质量视频并实现精确可控性方面表现出优越性能。

NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

Authors: Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng

Venue: ICLR 2026

First: 2025-07-11T17:59:40+00:00 · Latest: 2026-03-12T17:59:08+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Beyond reproducing existing systems, NeuralOS shows that synthesized training data can teach the model to simulate applications that were never installed, as illustrated by a Doom application, and suggests a path toward learning user interfaces purely from synthetic demonstrations.

中文标题/摘要

标题：NeuralOS：通过神经生成模型模拟操作系统

我们介绍了NeuralOS，这是一种神经框架，通过直接预测屏幕帧来模拟操作系统图形用户界面（GUI），响应用户输入如鼠标移动、点击和键盘事件。NeuralOS 结合了一个循环神经网络（RNN），用于跟踪计算机状态，以及一个基于扩散的神经渲染器，用于生成屏幕图像。该模型在包含随机生成交互和由AI代理生成的现实交互的Ubuntu XFCE录制数据集上进行训练。实验表明，NeuralOS 成功地渲染了现实的GUI序列，准确地捕捉了鼠标交互，并可靠地预测了如应用程序启动等状态转换。除了重现现有系统外，NeuralOS 还表明合成训练数据可以教会模型模拟从未安装的应用程序，如Doom应用，并暗示了一条仅从合成演示中学习用户界面的途径。

Summary / 总结

NeuralOS is a neural framework that simulates operating system GUIs by predicting screen frames based on user inputs. It uses an RNN to track the computer state and a diffusion-based neural renderer to generate screen images. The model is trained on a dataset of Ubuntu XFCE recordings, both random and realistic interactions. Experiments show that NeuralOS can render realistic GUI sequences, accurately capture mouse interactions, and predict state transitions such as application launches. Additionally, NeuralOS can simulate applications not installed in the training dataset, demonstrating its potential to learn user interfaces from synthetic data.

NeuralOS 是一个神经框架，通过预测屏幕帧来模拟操作系统 GUI，基于用户输入。它使用循环神经网络来跟踪计算机状态，并使用基于扩散的神经渲染器生成屏幕图像。该模型通过 Ubuntu XFCE 录音数据集进行训练，并成功渲染了真实的 GUI 序列，准确捕捉了鼠标交互，并预测了状态转换。此外，NeuralOS 还可以模拟训练数据集中未安装的应用程序，展示了其从合成演示中学习用户界面的潜力。

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Authors: Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin

Venue: CVPR 2026

First: 2026-03-12T17:58:52+00:00 · Latest: 2026-03-12T17:58:52+00:00

Comments: CVPR 2026. Project page: https://autogaze.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

中文标题/摘要

标题：先关注后参与：通过自回归凝视实现高效可扩展的视频理解

多模态大型语言模型（MLLMs）在通用视频理解方面取得了进展，但在处理长且高分辨率的视频时遇到困难——它们在视觉变换器（ViTs）或大型语言模型（LLMs）中等量处理每个像素，尽管存在大量的时空冗余。我们引入了AutoGaze，这是一个轻量级模块，在ViT或MLLM处理之前去除冗余块。AutoGaze通过下一个标记预测和强化学习进行训练，自回归地选择一组多尺度块，这些块可以在用户指定的误差阈值内重建视频，从而消除冗余并保留信息。实验表明，AutoGaze将视觉令牌减少4到100倍，并将ViTs和MLLMs加速19倍，使其能够扩展到1000帧4K分辨率的视频，并在视频基准测试中取得优异成绩（例如，VideoMME上得分为67.0%）。此外，我们引入了HLVid：第一个高分辨率、长形式的视频问答基准，包含5分钟4K分辨率的视频，其中使用AutoGaze扩展的MLLM比基线提高了10.1%，并优于之前的最佳MLLM 4.5%。项目页面：https://autogaze.github.io/

Summary / 总结

The research aims to address the challenge of processing long, high-resolution videos efficiently by reducing redundant visual information. AutoGaze, a lightweight module, selectively processes a minimal set of multi-scale patches, reducing visual tokens by 4x-100x and accelerating ViTs and MLLMs by up to 19x. It achieves superior results on video benchmarks and improves MLLM performance by 10.1% on a new high-resolution, long-form video QA benchmark.

研究旨在通过减少冗余视觉信息来高效处理长高分辨率视频。AutoGaze轻量级模块选择性地处理少量多尺度补丁，减少视觉令牌4到100倍，并加速ViTs和MLLMs高达19倍。它在视频基准测试中取得了优异的结果，并在新的高分辨率长视频问答基准测试中将MLLM性能提高了10.1%。

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Authors: Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

First: 2026-03-12T17:58:48+00:00 · Latest: 2026-03-12T17:58:48+00:00

Comments: 23 pages, 18 figures

Abs · PDF · Code1 · Code2

Abstract

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.

中文标题/摘要

标题：EndoCoT：在扩散模型中扩展内生链式思考推理

最近，多模态大型语言模型（MLLMs）被广泛集成到扩散框架中，主要作为文本编码器来解决空间推理等复杂任务。然而，这种范式存在两个关键限制：(i) MLLMs的文本编码器表现出推理深度不足。单步编码无法激活链式思考过程，这对于MLLMs提供准确的复杂任务指导至关重要。(ii) 编码指导在解码过程中保持不变。解码过程中的不变指导阻止了DiT逐步分解复杂指令为可执行的去噪步骤，即使MLLM编码正确。为此，我们提出了内生链式思考（EndoCoT），这是一种新颖的框架，首先通过迭代思考指导模块逐步细化潜在思维状态，激活MLLMs的推理潜力，然后将这些状态与DiT的去噪过程联系起来。其次，应用终端思维接地模块，通过将最终状态与正确答案对齐，确保推理轨迹保持在文本监督中。通过这两个组件，MLLMs的文本编码器提供细致的推理指导，使DiT能够逐步执行并最终以逐步方式解决复杂任务。在不同基准（如迷宫、TSP、VSP和数独）上的广泛评估实现了92.1%的平均准确率，比最强基线高出8.3个百分点。

Summary / 总结

The paper addresses the limitations of Multimodal Large Language Models (MLLMs) in diffusion frameworks, particularly their insufficient reasoning depth and invariant guidance during decoding. To overcome these issues, the authors propose Endogenous Chain-of-Thought (EndoCoT), which iteratively refines latent thought states and grounds them in textual supervision, enabling the diffusion model to solve complex tasks step-by-step. The method achieves an average accuracy of 92.1% across various benchmarks, outperforming existing methods by 8.3 percentage points.

论文针对多模态大型语言模型（MLLMs）在扩散框架中的不足，特别是其推理深度不足和解码过程中的不变指导。为解决这些问题，作者提出了内生链式思考（EndoCoT）框架，该框架通过迭代细化潜在思维状态并与真实答案对齐，使扩散模型能够逐步执行推理。实验结果显示，在各种基准测试（如迷宫、TSP、VSP和数独）上的平均准确率为92.1%，比现有方法高出8.3个百分点。

DVD: Deterministic Video Depth Estimation with Generative Priors

Authors: Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Jing He, Zixin Zhang, Haodong Li, Yihao Liang, Kanghao Chen, Bin Ren, Xu Zheng, Shuai Yang, Kun Zhou, Yinchuan Li, Nicu Sebe, Ying-Cong Chen

First: 2026-03-12T17:58:06+00:00 · Latest: 2026-03-12T17:58:06+00:00

Comments: Project: https://dvd-project.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.

中文标题/摘要

标题：DVD：基于生成先验的确定性视频深度估计

现有的视频深度估计面临一个基本的权衡：生成模型会遭受随机几何幻觉和尺度漂移的问题，而判别模型则需要大量的标注数据集来解决语义歧义。为打破这一僵局，我们提出了DVD，这是第一个将预训练的视频扩散模型确定性地改编为单次深度回归器的框架。具体而言，DVD 包含三个核心设计：(i) 将扩散时间步作为结构锚点，以平衡全局稳定性和高频细节；(ii) 潜在流形矫正（LMR）以减轻回归引起的过度平滑，施加微分约束以恢复清晰边界和连贯运动；(iii) 全局仿射一致性，这是一种固有的属性，限制了窗口间差异，使得在无需复杂时间对齐的情况下即可无缝进行长视频推理。广泛的实验表明，DVD 在基准测试中实现了最先进的零样本性能。此外，DVD 成功地利用了视频基础模型中隐含的深刻几何先验，比领先基线少使用163倍的任务特定数据。值得注意的是，我们完全开源了我们的管道，提供了最先进的视频深度估计的完整训练套件，以造福开源社区。

Summary / 总结

DVD addresses the trade-off in video depth estimation by introducing a deterministic framework that repurposes diffusion models as single-pass depth regressors. It includes three core designs: using the diffusion timestep as a structural anchor, latent manifold rectification to prevent over-smoothing, and global affine coherence to ensure seamless long-video inference. Experiments show that DVD outperforms existing methods on benchmarks and requires significantly less task-specific data compared to leading baselines.

DVD通过将预训练的生成模型整合到确定性的单次深度回归中来解决视频深度估计中的权衡问题。它引入了三个关键设计：使用扩散时间步作为结构锚点、使用潜在流形矫正来防止过度平滑以及全局仿射一致性以确保长视频推断的无缝衔接。实验表明，DVD在零样本基准上优于现有方法，并且所需的任务特定数据量比领先基线少163倍。

SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Authors: Ziyu Chen, Yilun Zhao, Chengye Wang, Rilyn Han, Manasi Patwardhan, Arman Cohan

First: 2026-03-12T17:57:52+00:00 · Latest: 2026-03-12T17:57:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

中文标题/摘要

标题：SciMDR：科学多模态文档推理的基准测试与进展

构建用于基础模型训练的科学多模态文档推理数据集涉及规模、忠实性和现实性之间的固有权衡。为解决这一挑战，我们引入了合成和再嵌入框架，这是一个两阶段管道，包括：(1) 以论点为中心的问答合成，生成忠实的、孤立的问答对和聚焦段落上的推理，以及(2) 文档规模再嵌入，通过程序化重新嵌入这些对到完整的文档任务，以确保现实的复杂性。使用此框架，我们构建了SciMDR，一个大规模训练数据集，用于跨模态理解，包含30万对具有明确推理链的问答对，覆盖2万篇科学论文。我们进一步构建了SciMDR-Eval，一个专家注释基准，用于评估全长度科学工作流程中的多模态理解。实验表明，基于SciMDR微调的模型在多个科学问答基准测试中取得了显著改进，特别是在那些需要复杂文档级推理的任务中。

Summary / 总结

The research aims to address the challenge of constructing scientific multimodal document reasoning datasets by introducing the synthesize-and-reground framework, which consists of Claim-Centric QA Synthesis and Document-Scale Regrounding. This framework generates 300K QA pairs with explicit reasoning chains across 20K scientific papers and constructs SciMDR, a large-scale training dataset. Additionally, SciMDR-Eval, an expert-annotated benchmark, is created to evaluate multimodal comprehension within full-length scientific workflows. Experiments show that models fine-tuned on SciMDR perform better on multiple scientific QA benchmarks, especially in tasks requiring complex document-level reasoning.

论文介绍了SciMDR，这是一个用于科学多模态文档推理的大规模数据集，旨在解决规模、忠实性和现实性之间的权衡问题。它使用两阶段框架：Claim-Centric QA Synthesis和Document-Scale Regrounding。SciMDR包含300K个带有明确推理链的QA对，覆盖20K篇科学论文，SciMDR-Eval是一个专家标注的基准，用于评估全篇科学工作流程中的多模态理解。在SciMDR上微调的模型在科学QA基准测试中表现出显著改进，特别是在需要复杂文档级推理的任务中。

Matching Features, Not Tokens: Energy-Based Fine-Tuning of Language Models

Authors: Samy Jelassi, Mujin Kwun, Rosie Zhao, Yuanzhi Li, Nicolo Fusi, Yilun Du, Sham M. Kakade, Carles Domingo-Enrich

First: 2026-03-12T17:57:50+00:00 · Latest: 2026-03-12T17:57:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Cross-entropy (CE) training provides dense and scalable supervision for language models, but it optimizes next-token prediction under teacher forcing rather than sequence-level behavior under model rollouts. We introduce a feature-matching objective for language-model fine-tuning that targets sequence-level statistics of the completion distribution, providing dense semantic feedback without requiring a task-specific verifier or preference model. To optimize this objective efficiently, we propose energy-based fine-tuning (EBFT), which uses strided block-parallel sampling to generate multiple rollouts from nested prefixes concurrently, batches feature extraction over these rollouts, and uses the resulting embeddings to perform an on-policy policy-gradient update. We present a theoretical perspective connecting EBFT to KL-regularized feature-matching and energy-based modeling. Empirically, across Q&A coding, unstructured coding, and translation, EBFT matches RLVR and outperforms SFT on downstream accuracy while achieving a lower validation cross-entropy than both methods.

中文标题/摘要

标题：匹配特征，而非标记：语言模型的能量基微调

交叉熵（CE）训练为语言模型提供了密集且可扩展的监督，但其优化目标是在教师强制下预测下一个标记，而非在模型展开过程中优化序列级行为。我们提出了一种语言模型微调的目标，该目标针对完成分布的序列级统计，无需特定任务的验证器或偏好模型即可提供密集的语义反馈。为了高效优化此目标，我们提出了能量基微调（EBFT），该方法使用跳跃块并行采样从嵌套前缀中并发生成多个展开，并批量提取这些展开的特征，使用生成的嵌入执行在线策略梯度更新。我们从KL正则化特征匹配和能量基建模的角度对EBFT进行了理论分析。实验上，在问答编码、无结构编码和翻译任务中，EBFT与RLVR匹配并在下游准确性上优于SFT，同时验证交叉熵低于两种方法。

Summary / 总结

The paper addresses the limitation of cross-entropy training in optimizing sequence-level behavior of language models. It introduces a feature-matching objective to target sequence-level statistics and proposes energy-based fine-tuning (EBFT) for efficient optimization. EBFT uses strided block-parallel sampling to generate rollouts and batch feature extraction to perform policy updates, achieving better downstream accuracy than reward-free value regularization (RLVR) and supervised fine-tuning (SFT) while having a lower validation cross-entropy.

论文针对交叉熵训练在优化语言模型序列级行为方面的局限性，提出了一个特征匹配目标来瞄准序列级统计。提出了能量基微调（EBFT）以高效优化。EBFT 使用分块并行采样生成多个卷出，并批量特征提取以执行策略更新，其在问答编码、无结构编码和翻译等下游任务上的准确率优于无奖励价值正则化（RLVR）和监督微调（SFT），同时具有更低的验证交叉熵。

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Authors: Xiangyu Zhao, Peiyuan Zhang, Junming Lin, Tianhao Liang, Yuchen Duan, Shengyuan Ding, Changyao Tian, Yuhang Zang, Junchi Yan, Xue Yang

First: 2026-03-12T17:57:21+00:00 · Latest: 2026-03-12T17:57:21+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

中文标题/摘要

标题：信任你的批评者：稳健的奖励建模与强化学习在忠实图像编辑与生成中的应用

强化学习（RL）已成为提升图像编辑和文本到图像（T2I）生成的有前途的范式。然而，当前的奖励模型在作为RL中的批评者时，往往会产生幻觉并分配嘈杂的分数，从而误导优化过程。本文中，我们提出了忠实图像奖励建模（FIRM），这是一种全面的框架，旨在开发稳健的奖励模型以提供准确可靠的指导，用于忠实的图像生成和编辑。首先，我们设计了定制的数据整理管道以构建高质量的评分数据集。具体而言，我们使用执行和一致性来评估编辑，而生成则主要通过指令遵循来进行评估。使用这些管道，我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集，并训练了专门的奖励模型（FIRM-Edit-8B和FIRM-Gen-8B），这些模型能够准确反映这些标准。其次，我们引入了FIRM-Bench，这是一种专门针对编辑和生成批评者的综合基准。评估表明，我们的模型在与人类判断的对齐方面优于现有指标。此外，为了无缝地将这些批评者集成到RL管道中，我们提出了一个新的“基础加奖金”奖励策略，该策略平衡了编辑中的一致性调节执行（CME）和生成中的质量调节对齐（QMA）等竞争目标。借助此框架，我们的模型FIRM-Qwen-Edit和FIRM-SD3.5实现了显著的性能突破。全面的实验表明，FIRM减轻了幻觉，建立了忠实度和指令遵循的新标准，超越了现有的一般模型。所有我们的数据集、模型和代码均已在https://firm-reward.github.io/公开。

Summary / 总结

This paper addresses the issue of hallucinations in reward models used for reinforcement learning in image editing and text-to-image generation. It introduces FIRM (Faithful Image Reward Modeling), a framework that constructs high-quality scoring datasets and trains specialized reward models. The authors evaluate editing based on execution and consistency, and generation based on instruction following, leading to the creation of FIRM-Edit-370K and FIRM-Gen-293K datasets. The novel 'Base-and-Bonus' reward strategy, including Consistency-Modulated Execution (CME) and Quality-Modulated Alignment (QMA), improves alignment with human judgment and reduces hallucinations, achieving better performance than existing models in terms of fidelity and instruction adherence.

本文解决了图像编辑和文本到图像生成中使用强化学习时奖励模型出现幻觉的问题。提出了FIRM（Faithful Image Reward Modeling）框架，构建了高质量的评分数据集并训练了专门的奖励模型。作者基于执行和一致性评估编辑，基于指令遵循评估生成，从而创建了FIRM-Edit-370K和FIRM-Gen-293K数据集。引入了新的“基础加奖金”奖励策略，包括一致性调节执行（CME）和质量调节对齐（QMA），提高了与人类判断的一致性并减少了幻觉，实现了在保真度和指令遵循方面优于现有模型的性能。

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Authors: Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang, Song Jiang, Bo Liu, Arman Cohan, Yuandong Tian, Zhengxing Chen

First: 2026-03-12T17:57:06+00:00 · Latest: 2026-03-12T17:57:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked. However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined. Therefore, we conduct a rigorous study to investigate the actual impact of non-reasoning and reasoning judges in reinforcement-learning-based LLM alignment. Our controlled synthetic setting, where a "gold-standard" judge (gpt-oss-120b) provides preference annotations to train smaller judges, reveals key differences between non-reasoning and reasoning judges: non-reasoning judges lead to reward hacking easily, while reasoning judges can lead to policies that achieve strong performance when evaluated by the gold-standard judge. Interestingly, we find that the reasoning-judge-trained policies achieve such strong performance by learning to generate highly effective adversarial outputs that can also score well on popular benchmarks such as Arena-Hard by deceiving other LLM-judges. Combined with our further analysis, our study highlights both important findings and room for improvements for applying (reasoning) LLM-judges in non-verifiable LLM post-training.

中文标题/摘要

标题：探究推理LLM作为法官在非可验证LLM后训练中的应用

推理LLM作为法官，得益于推理时的扩展，为将推理模型的成功扩展到输出正确性/质量无法直接验证的领域提供了有希望的途径。然而，尽管推理法官在静态评估基准上表现出更好的性能，但它们在实际政策训练中的有效性尚未系统地进行研究。因此，我们进行了一项严格的研究所探讨强化学习基础上LLM对齐中非推理和推理法官的实际影响。在“黄金标准”法官（gpt-oss-120b）提供偏好注解以训练较小法官的受控合成环境中，揭示了非推理和推理法官之间的关键差异：非推理法官容易导致奖励作弊，而推理法官可以导致在“黄金标准”法官评估中表现出色的策略。有趣的是，我们发现，通过学习生成高度有效的对抗输出，推理法官训练的策略能够获得如此出色的表现，这些对抗输出也能在流行的基准测试如Arena-Hard中获得高分，欺骗其他LLM法官。结合我们进一步的分析，我们的研究突显了在非可验证LLM后训练中应用（推理）LLM法官的重要发现和改进空间。

Summary / 总结

This study examines the effectiveness of reasoning LLMs-as-judges in non-verifiable domains by comparing them with non-reasoning judges in reinforcement-learning-based LLM alignment. Using a controlled synthetic setting, the research finds that reasoning judges prevent reward hacking and produce policies that perform well according to a gold-standard judge, while non-reasoning judges are more prone to reward hacking. The reasoning-judge-trained policies also learn to generate effective adversarial outputs that can score well on other benchmarks, highlighting both the benefits and potential issues of using reasoning LLM-judges.

研究使用了一个控制合成环境，其中黄金标准法官训练较小的法官，来评估推理LLM-法官在非验证性领域中的有效性。研究发现，非推理法官容易导致奖励作弊，而推理法官生成的策略在黄金标准法官评估时表现良好。值得注意的是，经过推理法官训练的策略学会了生成有效的对抗输出，这些输出也能在如Arena-Hard等其他基准测试中得分较高，这既展示了推理LLM-法官的潜力，也指出了其在这些领域应用中的改进空间。

STAMP: Selective Task-Aware Mechanism for Text Privacy

Authors: Fengwei Tian, Payel Bhattacharjee, Heidi Hanson, Geoffrey D. Rubin, Joseph Y. Lo, Ravi Tandon

First: 2026-03-12T17:55:07+00:00 · Latest: 2026-03-12T17:55:07+00:00

Comments: EACL 2026

Abs · PDF · Code1 · Code2

Abstract

We present STAMP (Selective Task-Aware Mechanism for Text Privacy), a new framework for task-aware text privatization that achieves an improved privacy-utility trade-off. STAMP selectively allocates privacy budgets across tokens by jointly considering (i) each token's importance to the downstream task (as measured via a task- or query-specific representation), and (ii) its privacy sensitivity (e.g., names, dates, identifiers). This token-level partitioning enables fine-grained, group-wise control over the level of noise applied to different parts of the input, balancing privacy protection with task relevance. To privatize individual token embeddings, we introduce the polar mechanism, which perturbs only the direction of embeddings on the unit sphere while preserving their magnitude. Decoding is performed via cosine nearest-neighbor search, aligning the perturbation geometry with the decoding geometry. Unlike isotropic noise mechanisms, the polar mechanism maintains semantic neighborhoods in the embedding space and better preserves downstream utility. Experimental evaluations on SQuAD, Yelp, and AG News datasets demonstrate that STAMP, when combined with the normalized polar mechanism, consistently achieves superior privacy-utility trade-offs across varying per-token privacy budgets.

中文标题/摘要

标题：STAMP：面向任务的文本隐私选择性机制

我们提出了STAMP（面向任务的文本隐私选择性机制），这是一种新的框架，用于实现更好的隐私-效用权衡。STAMP通过同时考虑（i）每个标记对下游任务的重要性（通过任务或查询特定的表示衡量），以及（ii）其隐私敏感性（例如，姓名、日期、标识符），在标记级别分配隐私预算。这种标记级别的划分使得可以对输入的不同部分应用不同级别的噪声，从而平衡隐私保护与任务相关性。为了 privatize 个体标记嵌入，我们引入了极性机制，该机制仅在单位球上扰动嵌入的方向，同时保持其幅度。解码通过余弦最近邻搜索完成，使扰动几何与解码几何对齐。与各向同性的噪声机制不同，极性机制在嵌入空间中保持语义邻域，并更好地保留了下游效用。在SQuAD、Yelp和AG News数据集上的实验评估表明，当与归一化的极性机制结合使用时，STAMP在不同标记隐私预算下始终能够实现更好的隐私-效用权衡。

Summary / 总结

STAMP is a framework for task-aware text privatization that allocates privacy budgets based on both token importance to the task and its privacy sensitivity. It uses a polar mechanism to perturb token embeddings while preserving their magnitude, and decodes through cosine nearest-neighbor search. Experiments on SQuAD, Yelp, and AG News datasets show that STAMP achieves better privacy-utility trade-offs compared to isotropic noise mechanisms across different privacy budgets.

STAMP 是一种基于任务的文本隐私化框架，根据每个词的重要性以及其隐私敏感性分配隐私预算。它使用极机制来扰动词嵌入的同时保持其幅度不变，并通过余弦最近邻搜索进行解码。实验结果表明，STAMP 在 SQuAD、Yelp 和 AG News 数据集上实现了比等向噪声机制更好的隐私-实用性权衡。

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Authors: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

First: 2026-03-12T17:55:07+00:00 · Latest: 2026-03-12T17:55:07+00:00

Comments: Code: https://github.com/ROUJINN/SceneAssistant

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

中文标题/摘要

标题：SceneAssistant：一种用于开放词汇3D场景生成的视觉反馈代理

从自然语言生成文本到3D场景是数字内容创作中高度 desirable 的。然而，现有方法大多局限于特定领域或依赖预定义的空间关系，限制了其生成不受限制、开放词汇3D场景的能力。在本文中，我们介绍了SceneAssistant，一种用于开放词汇3D场景生成的视觉反馈驱动代理。我们的框架利用了现代3D对象生成模型以及视觉语言模型（VLMs）的空间推理和规划能力。为了实现开放词汇场景组合，我们为VLMs提供了一整套原子操作（例如，缩放、旋转、聚焦）。在每次交互步骤中，VLM接收渲染的视觉反馈并相应地采取行动，逐步细化场景，以实现更连贯的空间布局并更好地与输入文本对齐。实验结果表明，我们的方法可以生成多样、开放词汇且高质量的3D场景。定性和定量的人类评估均表明，我们的方法优于现有方法。此外，我们的方法允许用户根据自然语言命令编辑现有场景。我们的代码可在https://github.com/ROUJINN/SceneAssistant 获取

Summary / 总结

SceneAssistant is a visual-feedback-driven agent for open-vocabulary 3D scene generation, using a 3D object generation model and Vision-Language Models (VLMs) with atomic operations like Scale and Rotate. It iteratively refines scenes based on visual feedback, enabling more coherent and aligned 3D scenes with natural language input. Experimental results show that SceneAssistant generates diverse, high-quality 3D scenes superior to existing methods, and supports editing existing scenes with natural language commands.

SceneAssistant 是一种基于视觉反馈的开放词汇3D场景生成代理，结合了3D对象生成模型和具有缩放、旋转和聚焦等基本操作的Vision-Language模型（VLMs）。在每一步中，VLMs接收视觉反馈并逐步细化场景以更好地匹配输入文本。实验结果表明，SceneAssistant 可以生成多样、开放词汇且高质量的3D场景，其性能在定性和人类评估中均优于现有方法。此外，它还支持基于自然语言指令编辑现有场景。

Security Considerations for Artificial Intelligence Agents

Authors: Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma

First: 2026-03-12T17:49:39+00:00 · Latest: 2026-03-12T17:49:39+00:00

Comments: Perplexity Response to NIST/CAISI Request for Information 2025-0035. 91 Fed. Reg. 698 (Jan. 8, 2026)

Abs · PDF · Code1 · Code2

Abstract

This article, a lightly adapted version of Perplexity's response to NIST/CAISI Request for Information 2025-0035, details our observations and recommendations concerning the security of frontier AI agents. These insights are informed by Perplexity's experience operating general-purpose agentic systems used by millions of users and thousands of enterprises in both controlled and open-world environments. Agent architectures change core assumptions around code-data separation, authority boundaries, and execution predictability, creating new confidentiality, integrity, and availability failure modes. We map principal attack surfaces across tools, connectors, hosting boundaries, and multi-agent coordination, with particular emphasis on indirect prompt injection, confused-deputy behavior, and cascading failures in long-running workflows. We then assess current defenses as a layered stack: input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement for high-consequence actions. Finally, we identify standards and research gaps, including adaptive security benchmarks, policy models for delegation and privilege control, and guidance for secure multi-agent system design aligned with NIST risk management principles.

中文标题/摘要

标题：人工智能代理的安全考虑

本文，基于Perplexity对NIST/CAISI 2025-0035请求信息的轻度改编回应，详细阐述了我们对前沿AI代理安全性的观察和建议。这些见解源自Perplexity在受控和开放环境中运营广泛用途代理系统方面的经验，这些系统被数百万人和数千家企业使用。代理架构改变了代码-数据分离、权限边界和执行可预测性的核心假设，创造了新的机密性、完整性和可用性故障模式。我们映射了工具、连接器、托管边界和多代理协调的主要攻击面，特别强调了间接提示注入、混淆副手行为以及长时间运行工作流中的级联故障。然后，我们评估了当前的防御措施，作为分层堆栈：输入级和模型级缓解措施、沙盒执行以及对高后果行动的确定性策略执行。最后，我们指出了标准和研究缺口，包括适应性安全基准、委托和权限控制的政策模型，以及与NIST风险管理原则相一致的多代理系统设计指南。

Summary / 总结

This paper discusses the security challenges of advanced AI agents, drawing on Perplexity's operational experience with millions of users and thousands of enterprises. It identifies new security risks due to changes in agent architectures, such as indirect prompt injection and cascading failures. The authors recommend a layered defense strategy, including input-level and model-level mitigations, sandboxed execution, and deterministic policy enforcement. They also highlight gaps in current standards and research, such as adaptive security benchmarks and secure multi-agent system design guidelines.

文章基于Perplexity在数百万用户和数千家企业中的运营经验，讨论了先进AI代理的安全挑战。文章指出了由于代理架构的变化而带来的新安全风险，如间接提示注入和混淆代理行为。作者提出了一种分层防御策略，包括输入级和模型级缓解措施、沙盒执行和确定性策略执行。同时，他们还强调了需要适应性安全基准和安全多代理系统设计的政策模型。

Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Authors: Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur, Jiawei Han

First: 2026-03-12T17:48:34+00:00 · Latest: 2026-03-12T17:48:34+00:00

Comments: Code and dataset provided at https://github.com/pkargupta/idea_catalyst

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.

中文标题/摘要

标题：通过LLM驱动的跨学科启发促进科学创造力

尽管跨学科研究能够产生更大的长期影响，但大多数工作仍局限于单一领域的学术孤岛中。基于AI的科学发现方法在跨学科研究方面显示出前景，但许多方法侧重于快速设计实验和解决方案，而忽略了驱动创造性跨学科突破的探索性、协作性推理过程。因此，先前的努力主要侧重于自动化科学发现，而不是增强那些推动科学颠覆的推理过程。我们提出了Idea-Catalyst这一新颖框架，旨在系统地识别跨学科见解，以支持人类和大型语言模型中的创造性推理。从抽象的研究目标开始，Idea-Catalyst旨在协助头脑风暴阶段，明确避免过早锁定特定解决方案。该框架体现了跨学科推理的关键元认知特征：(a) 定义和评估研究目标，(b) 意识到一个领域的机会和未解决的挑战，以及(c) 基于影响潜力战略性地探索跨学科想法。具体而言，Idea-Catalyst将抽象目标（例如，改善人机协作）分解为核心目标领域的研究问题，这些研究问题指导对该领域进展和开放挑战的分析。这些挑战被重新表述为领域无关的概念问题，从而能够从心理学、社会学等外部学科中检索解决类似问题的方法。通过将这些领域的见解综合并重新置于目标领域中，Idea-Catalyst按跨学科潜力对来源领域进行排名。实证研究表明，这种有针对性的整合将平均新颖性提高了21%，洞察力提高了16%，同时仍然扎根于原始研究问题。

Summary / 总结

Idea-Catalyst is a framework designed to support interdisciplinary reasoning and creativity in scientific research. It starts from an abstract research goal and breaks it down into core questions, then reformulates these questions into domain-agnostic problems to leverage insights from related fields. This process improves the novelty and insightfulness of research outcomes by 21% and 16%, respectively, while maintaining a focus on the original research problem. The framework helps avoid premature solution focus and encourages strategic exploration of interdisciplinary ideas based on their potential impact. Empirical results show that this approach enhances the creative and exploratory aspects of interdisciplinary research, leading to more impactful scientific discoveries.

论文提出了Idea-Catalyst框架，旨在增强科学研究中的跨学科创造力。从一个抽象目标出发，Idea-Catalyst将其分解为具体的研究问题，并将其重新表述为跨学科问题，从而从其他学科中获取见解。这种方法使研究的新颖性和洞察力分别提高了21%和16%，同时保持了对原始研究问题的关注。

LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

Authors: Qingqiao Hu, Weimin Lyu, Meilong Xu, Kehan Qi, Xiaoling Hu, Saumya Gupta, Jiawei Zhou, Chao Chen

First: 2025-12-05T03:16:46+00:00 · Latest: 2026-03-12T17:45:22+00:00

Comments: Code will be released soon

Abs · PDF · Code1 · Code2

Abstract

Whole Slide Image (WSI) MLLMs are difficult to build and deploy because gigapixel slides induce thousands of visual tokens, while only a small fraction of regions is diagnostically relevant. Existing slide-level pathology MLLMs typically combine heavy slide-level encoders with long visual prefixes, making end-to-end slide-level development and deployment expensive under limited computational resources. We revisit this regime and show that WSI tile features are highly redundant at both global and local scales, while task-relevant evidence is sparse and query-dependent. We therefore introduce LoC-Path, a resource-efficient slide-level MLLM that compresses before fusion. LoC-Path uses a Sparse Token Merger (STM) and an MAE-pretrained resampler to replace expensive slide-level encoding with a compact latent interface, then uses a Token Importance Scorer (TIS) to select the most relevant latents and a Cross-Attention Routing Adapter (CARA) to fuse them into a few LLM decoder layers. This design lowers both multimodal tuning cost and inference-time latency/memory by avoiding heavy slide-level encoding and long visual prefixes. Extensive experiments show that LoC-Path remains competitive with prior slide-level MLLMs while making end-to-end development and deployment more practical under limited computational resources.

中文标题/摘要

标题：LoC-Path：学习压缩以压缩病理多模态大型语言模型

全视野图像（WSI）多模态大型语言模型（MLLM）难以构建和部署，因为 gigapixel 标本导致数千个视觉标记，而只有少量区域具有诊断相关性。现有基于切片级别的病理 MLLM 通常结合了沉重的切片级编码器和长视觉前缀，这在有限的计算资源下使得切片级的端到端开发和部署变得昂贵。我们重新审视了这一领域，并表明 WSI 块特征在全局和局部尺度上高度冗余，而任务相关证据稀疏且查询依赖。因此，我们引入了 LoC-Path，这是一种资源高效的切片级 MLLM，它在融合之前进行压缩。LoC-Path 使用稀疏标记合并器（STM）和 MAE 预训练重采样器来用紧凑的潜在界面替换昂贵的切片级编码，然后使用标记重要性评分器（TIS）选择最相关的潜在特征，并使用跨注意力路由适配器（CARA）将它们融合到少量的 LLM 解码器层。这种设计通过避免切片级编码和长视觉前缀降低了多模态调优成本和推理时延/内存。大量实验表明，LoC-Path 在保持与先前切片级 MLLM 竞争力的同时，使在有限计算资源下进行端到端开发和部署更加实际。

Summary / 总结

The research aims to address the computational challenges in developing and deploying whole slide image (WSI) multimodal large language models (MLLMs) by reducing redundancy and focusing on task-relevant evidence. LoC-Path introduces a resource-efficient approach that compresses WSI tile features before fusion, using a Sparse Token Merger, an MAE-pretrained resampler, a Token Importance Scorer, and a Cross-Attention Routing Adapter. The method significantly lowers multimodal tuning cost and inference-time latency/memory. Experiments demonstrate that LoC-Path maintains competitive performance with existing slide-level MLLMs while making end-to-end development and deployment more feasible under limited computational resources.

论文通过引入LoC-Path，解决了构建和部署Whole Slide Image (WSI) 多模态大型语言模型（MLLMs）的挑战，LoC-Path在融合前压缩WSI切片特征。LoC-Path使用稀疏Token合并器和MAE预训练重采样器创建紧凑的潜在接口，并使用Token重要性评分器和跨注意力路由适配器选择并融合相关潜在特征到少量LLM解码层。实验表明，LoC-Path在保持与现有滑动级MLLMs竞争力的同时，降低了多模态调优成本和推理时的延迟/内存占用。

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Authors: Jiajun Sun, Zhe Gao

First: 2026-03-12T17:45:12+00:00 · Latest: 2026-03-12T17:45:12+00:00

Comments: 10 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

中文标题/摘要

标题：一种双模态两阶段模型用于面部情感表达识别

本文在第10届野生场景情感行为分析工作坊（ABAW）和竞赛中解决了情感（EXPR）识别挑战，该任务要求对不受限制的视频中的八种面部情感表达进行帧级分类。由于面部定位不准确、姿态和尺度变化大、运动模糊、时间不稳定性以及其他相邻帧中的混淆因素，该任务具有挑战性。我们提出了一种双模态（音视频）两阶段模型来应对这些困难。第一阶段专注于使用预训练的DINOv2基编码器进行鲁棒的视觉特征提取。具体来说，使用DINOv2 ViT-L/14作为骨干，采用填充感知增强（PadAug）策略对从原始视频中获取的图像进行填充和数据预处理，并引入混合专家（MoE）训练头以增强分类器多样性。第二阶段解决模态融合和时间一致性问题。对于视觉模态，从原始视频中在多个尺度上重新裁剪人脸，并提取的视觉特征平均形成鲁棒的帧级表示。同时，从短音频窗口中提取与帧对齐的Wav2Vec 2.0音频特征，提供补充的声学线索。这些双模态特征通过轻量级门控融合模块集成，在推理时进行时间平滑。在ABAW数据集上的实验表明了所提方法的有效性。两阶段模型在官方验证集上的宏F1分数为0.5368，在5折交叉验证下的分数为0.5122 +/- 0.0277，优于官方基线。

Summary / 总结

This paper presents a two-stage dual-modal model for facial emotional expression recognition in unconstrained videos, addressing challenges like inaccurate face localization and pose variations. The model uses a pretrained DINOv2-based encoder for robust visual feature extraction and a mixture-of-experts training head to enhance classifier diversity. In the second stage, it integrates visual and audio features through a gated fusion module and temporal smoothing, achieving a Macro-F1 score of 0.5368 on the official validation set and outperforming official baselines.

该论文提出了一种两阶段双模态模型用于处理不受限视频中的面部情感表达识别，解决了诸如不准确的面部定位和姿态变化等挑战。模型在第一阶段使用预训练的DINOv2编码器提取稳健的视觉特征，在第二阶段通过门控融合模块整合视觉和音频特征，确保时间一致性。该模型在官方验证集上的宏F1分数为0.5368，并优于官方基线模型。

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Authors: Görkay Aydemir, Fatma Güney, Weidi Xie

Venue: CVPR 2026

First: 2026-03-12T17:40:52+00:00 · Latest: 2026-03-12T17:40:52+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r

中文标题/摘要

标题：基于验证器引导伪标签的现实世界点跟踪

长期点跟踪模型通常在大型合成数据集上进行训练。这些模型在现实世界视频中的性能会下降，因为现实世界视频具有不同的特征且缺乏密集的地面真值注释。在未标注视频上进行自我训练是一种实际的解决方案，但伪标签的质量强烈依赖于教师模型的可靠性，这在不同帧和场景之间有所不同。在本文中，我们解决了现实世界微调的问题，并引入了验证器，这是一种元模型，用于学习评估跟踪器预测的可靠性并指导伪标签生成。给定多个预训练跟踪器的候选轨迹，验证器逐帧评估它们并选择最值得信赖的预测，从而生成高质量的伪标签轨迹。在进行微调时，验证器引导的伪标签生成显著提高了监督的质量，并使模型能够高效地适应未标注视频。在四个现实世界基准上的广泛实验表明，我们的方法在所需数据量少于先前自我训练方法的情况下达到了最先进的效果。项目页面：https://kuis-ai.github.io/track_on_r

Summary / 总结

This paper addresses the issue of model performance degradation in real-world videos for long-term point tracking, which is typically trained on synthetic datasets. The authors introduce a verifier, a meta-model that assesses the reliability of tracker predictions and guides the generation of high-quality pseudo-labels. Experiments on four real-world benchmarks show that verifier-guided pseudo-labeling improves model performance and requires less data compared to previous self-training methods.

本文通过引入评估追踪预测可靠性的验证器，并指导伪标签生成，解决了现实世界点追踪的挑战。该方法使用多个预训练追踪器的候选轨迹，并选择最可靠的轨迹，生成高质量的伪标签。实验表明，验证器引导的伪标签生成提高了模型性能，并能高效地适应未标记的现实世界视频，相比之前的自我训练方法，使用更少的数据达到了最先进的效果。

ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models

Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao

First: 2026-03-12T17:30:49+00:00 · Latest: 2026-03-12T17:30:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.

中文标题/摘要

标题：ForensicZip：更多的标记更好但并非必要——在法医视觉-语言模型中的应用

多模态大型语言模型（MLLMs）通过生成伪造检测的文本解释来实现多媒体的可解释性法医分析。然而，处理密集的视觉序列会带来高昂的计算成本，特别是对于高分辨率的图像和视频。视觉标记剪枝是一种实用的加速策略，但现有方法主要基于语义驱动，保留显著的对象，而丢弃包含伪造痕迹（如高频异常和时间抖动）的背景区域。为了解决这一问题，我们引入了ForensicZip，这是一种无需训练的框架，从伪造驱动的角度重新定义了标记压缩。ForensicZip将时间标记的演变建模为具有松弛虚拟节点的出生-死亡最优传输问题，量化物理不连续性以指示瞬态生成伪影。法医评分进一步将基于传输的新颖性与高频先验相结合，在大比例压缩下分离法医证据和语义内容。在深度伪造和AIGC基准测试中，即使在保留10%的标记时，ForensicZip也实现了2.97倍的加速和超过90%的FLOPs减少，同时保持了最先进的检测性能。

Summary / 总结

The research aims to improve the efficiency of forensic vision-language models by addressing the high computational costs associated with processing dense visual sequences. ForensicZip, a training-free framework, reformulates token compression from a forgery-driven perspective, focusing on quantifying physical discontinuities to detect transient generative artifacts. Experiments demonstrate that ForensicZip achieves a 2.97 times speedup and over 90% FLOPs reduction at 10% token retention while maintaining state-of-the-art detection performance.

研究旨在通过解决处理密集视觉序列时的高计算成本问题，提高法医视觉-语言模型的效率。ForensicZip 是一个无需训练的框架，从伪造驱动的角度重新定义了 token 压缩，重点关注物理不连续性以检测篡改痕迹。实验表明，ForensicZip 在 10% token 保留的情况下实现了 2.97 倍的加速和超过 90% 的 FLOPs 减少，同时保持了在深度伪造和 AIGC 基准上的顶级检测性能。

CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Authors: Alexandre Le Mercier, Thomas Demeester, Chris Develder

First: 2026-03-12T17:29:55+00:00 · Latest: 2026-03-12T17:29:55+00:00

Comments: 22 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

中文标题/摘要

标题：CLASP：防御混合大型语言模型免受隐藏状态中毒攻击

状态空间模型（SSMs）如Mamba已成为Transformer的有效替代品，实现了线性复杂度并保持了竞争力的性能。然而，最近发现的隐藏状态中毒攻击（HiSPAs）通过对抗性字符串破坏SSM的记忆，对这些架构及其混合变体构成了严重威胁。将HiSPA缓解任务视为在标记级别上的二元分类问题，我们引入了CLASP模型来防御这种威胁。CLASP利用Mamba块输出嵌入（BOEs）中的不同模式，并使用XGBoost分类器识别恶意标记，同时具有最小的计算开销。我们考虑了一个现实场景，在该场景中，SSMs和HiSPAs都可能被使用：一个LLM筛选简历以识别最适合某个角色的最佳候选人。在包含2,483份简历，总计9.5M标记并受控注入的语料库上进行评估，CLASP在恶意标记检测上实现了95.9%的标记级别F1分数和99.3%的文档级别F1分数。至关重要的是，该模型能够泛化到未见过的攻击模式：在留一交叉验证下，性能保持较高（96.9%的文档级别F1），而在具有结构上新颖触发器的聚类交叉验证下，它保持了有用的检测能力（91.6%的平均文档级别F1）。独立于任何下游模型，CLASP每秒处理1,032个标记，消耗不到4GB VRAM，可能使其适合实际部署作为基于SSM和混合架构的轻量级前线防御。所有代码和详细结果可在https://anonymous.4open.science/r/hispikes-91C0获取。

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

First: 2026-03-12T17:27:21+00:00 · Latest: 2026-03-12T17:27:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

中文标题/摘要

标题：IndexCache：通过跨层索引重用加速稀疏注意

长上下文代理工作流已成为大型语言模型的关键用例，使得注意效率对于推理速度和提供成本至关重要。稀疏注意有效地解决了这一挑战，DeepSeek 稀疏注意（DSA）是一种代表性的生产级解决方案：一个轻量级的闪电索引器选择每个查询的 top-k 最相关的令牌，将核心注意从 $O(L^2)$ 减少到 $O(Lk)$。然而，索引器本身保持 $O(L^2)$ 复杂性，并且必须在每一层独立运行，尽管连续层的结果 top-k 选择高度相似。我们提出了 IndexCache，通过将层划分为运行自己索引器的小型全层集和主要重用最近全层 top-k 索引的共享层集，利用了这种跨层冗余。我们提出了两种互补的方法来确定和优化此配置。无需训练的 IndexCache 使用贪婪搜索算法直接在校准集上最小化语言建模损失来选择保留索引器的层，无需权重更新。具有训练意识的 IndexCache 引入了一种多层蒸馏损失，训练每个保留的索引器与它服务的所有层的平均注意分布进行对比，即使简单的交错模式也能达到全索引器的准确性。在 30B DSA 模型上的实验结果显示，IndexCache 可以去除 75% 的索引器计算，质量下降可以忽略不计，相比标准 DSA 实现了高达 1.82$\times$ 前填速度提升和 1.48$\times$ 解码速度提升。初步实验进一步证实了我们在生产规模 GLM-5 模型上的这些积极结果（图 1）。

Summary / 总结

IndexCache accelerates sparse attention by reusing indexers across layers, reducing the number of indexer computations by 75% while maintaining model performance. It uses two methods: a training-free approach that minimizes language modeling loss on a calibration set, and a training-aware approach that introduces a multi-layer distillation loss to train indexers. On a 30B DSA model, IndexCache achieves up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA.

IndexCache通过跨层重用索引器来加速稀疏注意力，将索引器计算量减少75%，同时保持模型性能。它使用两种方法：无训练的直接最小化语言建模损失，以及有训练的多层蒸馏损失。实验表明，在30B DSA模型上可实现最高1.82倍的预填充加速和1.48倍的解码加速，且质量无明显下降。

LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models

Authors: Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Yuhua Zhu, Wenhui Zhao, Dingwen Zhang

First: 2026-01-10T12:18:12+00:00 · Latest: 2026-03-12T17:26:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-Object Tracking (MOT) is evolving from geometric localization to Semantic MOT (SMOT) to answer complex relational queries, yet progress is hindered by semantic data scarcity and a structural disconnect between tracking architectures and Multi-modal Large Language Models (MLLMs). To address this, we introduce Grand-SMOT, a large-scale, open-world benchmark providing high-density, dual-stream narratives that comprehensively decouple individual behaviors from environmental contexts. Furthermore, we propose LLMTrack, the first framework to seamlessly integrate MLLMs into the SMOT task. LLMTrack establishes a Macro-Understanding-First paradigm, utilizing a novel Spatio-Temporal Fusion Module to align discrete geometric trajectories with continuous semantic features, effectively suppressing temporal hallucinations during online processing. Extensive experiments demonstrate that LLMTrack achieves state-of-the-art geometric tracking performance while delivering a qualitative leap in dynamic semantic reasoning. Notably, our analysis reveals that high-quality semantic narratives empower the language model to deduce complex social interactions naturally, demonstrating that direct cognitive reasoning is more effective than cumbersome explicit visual modeling. Ultimately, our contributions bridge the gap between perceptual tracking and cognitive reasoning, establishing a robust new foundation for comprehensive video understanding and intelligent narrative generation.

中文标题/摘要

标题：LLMTrack：使用多模态大型语言模型的语义多对象跟踪

多对象跟踪（MOT）正在从几何定位发展到语义MOT（SMOT），以回答复杂的关联查询，但进展受到语义数据稀缺性和跟踪架构与多模态大型语言模型（MLLMs）之间结构断层的阻碍。为了解决这一问题，我们引入了Grand-SMOT，这是一个大规模、开放世界的基准，提供了高密度、双流叙事，全面地将个体行为与环境背景解耦。此外，我们提出了LLMTrack，这是第一个将MLLM无缝集成到SMOT任务中的框架。LLMTrack确立了先宏观理解的范式，利用新颖的空间-时间融合模块将离散的几何轨迹与连续的语义特征对齐，在在线处理过程中有效抑制了时间幻觉。大量实验表明，LLMTrack在几何跟踪性能上达到了最先进的水平，同时在动态语义推理方面实现了质的飞跃。值得注意的是，我们的分析表明，高质量的语义叙事使语言模型能够自然地推断复杂的社交互动，表明直接的认知推理比繁琐的显式视觉建模更有效。最终，我们的贡献弥合了感知跟踪与认知推理之间的差距，为全面的视频理解和智能叙事生成奠定了坚实的新基础。

Summary / 总结

The research aims to advance Semantic Multi-Object Tracking (SMOT) by addressing semantic data scarcity and structural disconnects. The proposed LLMTrack framework integrates Multi-modal Large Language Models (MLLMs) into SMOT, using a Spatio-Temporal Fusion Module to align geometric trajectories with semantic features. Experiments show that LLMTrack achieves state-of-the-art geometric tracking while enhancing dynamic semantic reasoning, highlighting the effectiveness of high-quality semantic narratives in deducing complex social interactions.

LLMTrack 是一个将多模态大型语言模型（MLLMs）集成到语义多对象跟踪（SMOT）中的框架，旨在解决语义数据稀缺性和结构断层的问题。它引入了 Grand-SMOT，这是一个具有高密度叙述的 SMOT 基准，并使用时空融合模块将几何轨迹与语义特征对齐。实验表明，LLMTrack 在几何跟踪和动态语义推理方面均优于现有方法，强调了认知推理比繁琐的显式视觉建模更有效。

Long-Context Encoder Models for Polish Language Understanding

Authors: Sławomir Dadas, Rafał Poświata, Marek Kozłowski, Małgorzata Grębowiec, Michał Perełkiewicz, Paweł Klimiuk, Przemysław Boruta

First: 2026-03-12T17:21:45+00:00 · Latest: 2026-03-12T17:21:45+00:00

Abs · PDF · Code1 · Code2

Abstract

While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.

中文标题/摘要

标题：波兰语理解的长上下文编码器模型

虽然解码器为主的大型语言模型（LLMs）最近在NLP领域占据主导地位，但仅编码器架构仍然是区分任务中成本效益高且参数效率高的标准。然而，经典的编码器如BERT受限于短的上下文窗口，不足以处理长文档。在本文中，我们通过引入一种能够处理多达8192个标记的高质量波兰语模型来解决这一限制。该模型通过采用两阶段训练程序开发，该程序包括位置嵌入适应和全参数连续预训练。此外，我们还提出了通过知识蒸馏训练的压缩模型变体。这些模型在25个任务上进行了评估，包括KLEJ基准、新引入的金融任务套件（FinBench）以及其他分类和回归任务，特别是那些需要长文档理解的任务。结果表明，我们的模型在波兰语和多语言模型中实现了最佳的平均性能，在长上下文任务中显著优于竞争性解决方案，同时在短文本上保持了相当的质量。

Summary / 总结

This paper addresses the limitation of short context windows in classic encoders like BERT by introducing a high-quality Polish model capable of processing up to 8192 tokens. The model was trained using a two-stage procedure involving positional embedding adaptation and full parameter continuous pre-training. Compressed model variants were also developed through knowledge distillation. Evaluated on 25 tasks, including the KLEJ benchmark and a new financial task suite, the model achieved the best average performance among Polish and multilingual models, particularly excelling in long-context tasks while maintaining comparable quality on short texts.

本文通过引入能够处理多达8192个标记的高质量波兰模型，解决了经典编码器如BERT的短上下文窗口限制。该模型通过两阶段训练程序进行训练，包括位置嵌入适应和全参数连续预训练。还通过知识蒸馏开发了压缩模型变体。该模型在包括KLEJ基准和金融任务套件在内的25个任务中进行了评估，并在波兰和多语言模型中实现了最佳平均性能，特别是在长上下文任务中表现尤为突出。

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Authors: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta

First: 2026-03-12T17:11:22+00:00 · Latest: 2026-03-12T17:11:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

中文标题/摘要

标题：战略导航还是随机搜索？代理和人类在文档集合中的推理方式

多模态代理为自动化复杂文档密集型工作流提供了有希望的途径。然而，一个关键问题仍然存在：这些代理是否展示了真正的战略推理，还是仅仅进行了随机的尝试和错误搜索？为了解决这个问题，我们引入了MADQA基准，包含2250个人撰写的基于800份异构PDF文档的问题。根据经典测验理论，我们设计它以最大化在不同代理能力水平上的区分力。为了评估代理行为，我们引入了一种新的评估协议，衡量准确性和努力之间的权衡。使用这种框架，我们表明，虽然最好的代理在纯准确度上可以与人类搜索者匹敌，但它们回答的问题类型不同，并依赖于暴力搜索来弥补薄弱的战略规划。它们未能缩小与Oracle性能近20%的差距，持续陷入无效循环。我们发布了数据集和评估框架，以帮助促进从暴力检索向校准、高效的推理过渡。

Summary / 总结

The study aims to determine whether multimodal agents exhibit strategic reasoning or merely perform stochastic search when navigating document collections. The researchers introduced MADQA, a benchmark consisting of 2,250 human-authored questions based on 800 diverse PDF documents. Using a novel evaluation protocol, they found that top agents achieve similar accuracy to human searchers but rely on exhaustive search rather than strategic planning, failing to match oracle performance by nearly 20% due to inefficient looping.

研究旨在确定多模态代理在处理文档集合时是展示战略推理还是仅依赖随机搜索。研究人员开发了MADQA基准，包含2,250个人撰写的基于800份不同PDF文档的问题，以评估代理的战略推理能力。研究发现，尽管最佳代理可以达到人类的准确率，但它们通常使用暴力搜索，并且无法达到基于最佳性能的差距，主要是由于战略规划能力较弱，表明需要更高效的推理方法。该数据集和评估框架已发布，以促进该领域的进步。

BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning

Authors: Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu

First: 2026-03-12T17:09:20+00:00 · Latest: 2026-03-12T17:09:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.

中文标题/摘要

标题：BehaviorVLM：统一的无需微调的行为理解与视觉-语言推理

理解自由移动的动物行为是神经科学的核心，其中姿态估计和行为理解构成了将神经活动与自然动作联系起来的基础。然而，这两个任务仍然严重依赖于人工注释或不稳定的无监督管道，限制了其可扩展性和可重复性。我们提出了BehaviorVLM，这是一种统一的视觉-语言框架，用于姿态估计和行为理解，无需特定任务的微调和最少的人工标注，通过引导预训练的视觉-语言模型（VLMs）进行详细的、明确的和可验证的推理步骤。对于姿态估计，我们利用量子点标记的行为数据，并提出了一种多阶段管道，结合了时间、空间和跨视图推理。这种设计大大减少了人工标注的工作量，通过几何检查如重投影误差暴露了低置信度的标签，并生成了可以稍后过滤、修正或用于微调下游姿态模型的标签。对于行为理解，我们提出了一种管道，结合了深度嵌入聚类以发现过度分割的行为，基于VLM的每段视频字幕，以及基于LLM的推理以合并和语义标注行为片段。行为管道可以直接从视觉信息运行，不需要关键点来分割行为。这些组件共同实现了多动物行为的大规模、可解释和轻标注分析。

Summary / 总结

The research aims to improve the scalability and reproducibility of pose estimation and behavioral understanding in neuroscience by using a unified vision-language framework called BehaviorVLM. It leverages pretrained models and detailed reasoning steps to achieve this without requiring task-specific finetuning or extensive human labeling. Key findings include a multi-stage pose estimation pipeline that reduces human annotation effort and a behavioral understanding pipeline that discovers and labels behaviors directly from visual information without needing keypoints, enabling scalable and interpretable analysis of multi-animal behavior.

BehaviorVLM 是一个无需特定任务微调或大量人工标注的统一视觉-语言框架，用于姿态估计和行为理解。它通过详细的推理步骤引导预训练模型，减少人工标注工作量，并通过几何检查暴露低置信度标签。在姿态估计方面，它使用包含时间、空间和跨视图推理的多阶段管道，而在行为理解方面，它结合了深度聚类、基于VLM的视频字幕生成和基于LLM的推理，直接从视觉信息中发现和标注行为。

Deep Incentive Design with Differentiable Equilibrium Blocks

Authors: Vinzenz Thoma, Georgios Piliouras, Luke Marris

First: 2026-03-08T16:15:03+00:00 · Latest: 2026-03-12T17:09:10+00:00

Comments: 24 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Automated design of multi-agent interactions with desirable equilibrium outcomes is inherently difficult due to the computational hardness, non-uniqueness, and instability of the resulting equilibria. In this work, we propose the use of game-agnostic differentiable equilibrium blocks (DEBs) as modules in a novel, differentiable framework to address a wide variety of incentive design problems from economics and computer science. We call this framework deep incentive design (DID). To validate our approach, we examine three diverse, challenging incentive design tasks: contract design, machine scheduling, and inverse equilibrium problems. For each task, we train a single neural network using a unified pipeline and DEB. This architecture solves the full distribution of problem instances, parameterized by a context, handling all games across a wide range of scales (from two to sixteen actions per player).

中文标题/摘要

标题：具有可微均衡块的深度激励设计

由于计算难度、均衡结果的非唯一性和不稳定性，多智能体交互的自动设计具有固有的困难。在本文中，我们提出使用游戏无关的可微均衡块（DEBs）作为新型可微框架中的模块，以解决来自经济学和计算机科学的各种激励设计问题。我们称此框架为深度激励设计（DID）。为了验证我们的方法，我们研究了三个多样且具有挑战性的激励设计任务：合同设计、机器调度和逆均衡问题。对于每个任务，我们使用统一的管道和DEB训练了一个单一的神经网络。该架构解决了由上下文参数化的整个问题实例的分布，处理了从两个到十六个玩家动作范围内的所有游戏。

Summary / 总结

The research addresses the challenge of designing multi-agent interactions with desirable outcomes by proposing a differentiable framework called deep incentive design (DID), which uses game-agnostic differentiable equilibrium blocks (DEBs) as modules. The framework is validated through three tasks: contract design, machine scheduling, and inverse equilibrium problems, where a single neural network trained with a unified pipeline and DEB solves a wide range of problem instances across different scales.

研究旨在通过解决计算硬度、均衡结果的非唯一性和不稳定性问题，自动化设计多智能体的互动以达到理想结果。提出了一个名为深度激励设计（DID）的新框架，使用游戏无关的可微均衡块（DEBs）作为模块。该框架成功解决了三个不同的激励设计任务：合同设计、机器调度和逆向均衡问题。通过统一的管道和DEB训练的单个神经网络能够处理各种规模和范围的问题实例。

LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

Authors: Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu

First: 2026-03-12T17:01:23+00:00 · Latest: 2026-03-12T17:01:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.

中文标题/摘要

标题：LatentGeo：学习型辅助构造在潜在空间中的多模态几何推理

尽管在多模态推理方面取得了近期进展，但在多模态大型语言模型（MLLMs）中表示辅助几何构造仍然是一个基本挑战。这些构造不在原始图中，必须在应用定理之前引入。现有方法主要依赖显式的构造范式，包括基于文本的几何规范、推理期间的视觉标记交错以及工具增强的几何执行。然而，这些方法要么无法忠实表示复杂的空间关系，要么在离散符号和连续几何结构之间产生表示不匹配，要么依赖外部能力，阻碍端到端优化。为了解决这些限制，我们提出了LatentGeo框架，该框架学习连续的潜在视觉表示，无需像素级渲染或外部执行器即可内化辅助几何构造。我们设计了一个三阶段课程，通过辅助视觉监督逐步对齐和内化这些潜在表示，随后是LaGDPO，一种潜在感知的强化学习过程，在策略优化过程中稳定潜在表示，同时提高最终任务的正确性。为了系统地评估构造为中心的表示质量，我们引入了GeoAux，这是一个针对视觉依赖几何问题的新基准，并在GeoAux和MathVerse上进行了实验。结果显示，LatentGeo在几何推理任务中取得了显著的提升，特别是在需要辅助构造的任务中。广泛的分析和消融研究进一步验证了我们框架中每个组件的有效性。

Summary / 总结

LatentGeo addresses the challenge of representing auxiliary geometric constructions in multimodal reasoning by learning continuous latent visual representations. It uses a three-stage curriculum and a latent-aware reinforcement learning procedure to internalize these constructions without pixel-level rendering or external executors. Experiments on GeoAux and MathVerse demonstrate that LatentGeo significantly improves geometric reasoning tasks, especially those requiring auxiliary constructions.

LatentGeo通过学习连续的潜在视觉表示来解决多模态推理中辅助几何构造的表示问题，使用三阶段课程和潜在感知的强化学习过程来内部化这些构造，而不依赖于像素级别的渲染或外部执行器。在GeoAux和MathVerse上的实验表明，LatentGeo在几何推理方面取得了显著改进，特别是在需要辅助构造的任务上。

QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Authors: Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang

Venue: ACL 2026

First: 2026-03-12T17:01:22+00:00 · Latest: 2026-03-12T17:01:22+00:00

Comments: 12 pages, 5 figures. Under review at ACL 2026

Abs · PDF · Code1 · Code2

Abstract

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.

中文标题/摘要

标题：QAQ：双向语义一致性选择高质量合成代码指令

合成数据已成为训练代码生成模型不可或缺的部分，但它们引入了难以用现有指标检测的大量噪声和幻觉。现有的数据选择方法，如指令遵循难度(IFD)，通常评估模型在给定查询的情况下生成答案的难度($A|Q$)。然而，对于噪声较大的合成数据，低概率可能区分内在任务复杂性和模型生成的幻觉。在此，我们提出QAQ，一种新颖的数据选择框架，从反方向评估数据质量：答案如何准确预测查询($Q|A$)? 我们定义了逆互信息(RMI)来量化在给定答案的情况下关于查询的信息增益。我们的分析表明，RMI的两端都表明质量问题：低RMI表示语义不一致，而过高RMI可能包含LLMs容易识别的缺陷模式。此外，我们提出了一种基于强模型和弱模型之间分歧的数据选择策略，以识别既有效又具有挑战性的样本。在WarriorCoder数据集上的实验表明，使用分层RMI选择25%的数据即可达到全数据训练相当的性能，显著优于现有数据选择方法。我们的方法强调了合成数据编目中双向语义一致性的重要性，提供了一种减少计算成本而不牺牲模型能力的可扩展途径。

Summary / 总结

The paper addresses the issue of noise in synthetic code data for training code generation models by proposing QAQ, a novel data selection framework that evaluates data quality based on how well the answer predicts the query ($Q|A$). It introduces Reverse Mutual Information (RMI) to quantify this relationship and identifies both low and high RMI as indicators of quality issues. Experiments show that selecting 25% of data using stratified RMI achieves comparable performance to full-data training, outperforming existing methods. This approach emphasizes the importance of bidirectional semantic coherence in synthetic data curation.

论文针对训练代码生成模型时选择高质量合成代码指令的挑战，这些指令常受到噪声和幻觉的影响。提出了QAQ，一种通过评估答案能否很好地预测查询来评估数据质量的新框架，使用了逆互信息（RMI）。研究发现，RMI值过低或过高都表明存在质量问题，并提出了一种基于强弱模型分歧的筛选策略。实验表明，在WarriorCoder数据集上使用25%通过分层RMI筛选的数据，能达到与全量数据训练相当的性能，优于现有方法。

History

20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553