VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Authors: Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu
First: 2026-03-23T17:59:51+00:00 · Latest: 2026-03-23T17:59:51+00:00
Abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
中文标题/摘要
标题:VideoDetective:通过外部查询和内在相关性进行长视频线索搜索
由于上下文窗口有限,多模态大型语言模型(MLLMs)在理解长视频方面仍然具有挑战性,因此需要识别稀疏的查询相关视频片段。然而,现有方法主要基于查询定位线索,忽视了视频的内在结构以及片段之间的变化相关性。为了解决这个问题,我们提出了一种VideoDetective框架,该框架结合了查询到片段的相关性和片段之间的亲和力,以有效地在长视频问答中进行线索搜索。具体而言,我们将视频划分为多个片段,并通过视觉相似性和时间临近性构建视觉-时间亲和力图来表示它们。然后,我们执行假设-验证-精炼循环来估计观察片段与查询的相关性得分,并将其传播到未观察到的片段,从而生成一个全局相关性分布,该分布指导最终回答所需的最关键片段的定位,基于稀疏观察。实验表明,我们的方法在主流MLLMs上的一系列代表性基准测试中实现了显著的性能提升,VideoMME-long上的准确率提高高达7.5%。我们的代码可在https://videodetective.github.io/获取
Summary / 总结
The research aims to improve long video understanding by addressing the limitations of existing methods that focus solely on query relevance. VideoDetective is a framework that integrates query-to-segment relevance and inter-segment affinity to effectively identify critical video segments. The method divides videos into segments, constructs a visual-temporal affinity graph, and uses a Hypothesis-Verification-Refinement loop to estimate relevance scores, which are then propagated to unseen segments. Experiments demonstrate that VideoDetective outperforms existing methods, achieving up to 7.5% accuracy improvements on VideoMME-long benchmarks.
研究旨在通过解决当前多模态大型语言模型在有限上下文窗口下的局限性,提高长视频理解能力。VideoDetective框架结合了查询到片段的相关性和片段间的亲和力,有效识别相关视频片段。该方法将视频划分为片段,构建视觉-时间亲和图,并使用假设-验证-精炼循环来估计相关性分数,指导关键片段的定位。实验结果显示,与现有方法相比,在各种基准测试上取得了显著的准确率提升,最高可达7.5%。
End-to-End Training for Unified Tokenization and Latent Denoising
Authors: Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman
First: 2026-03-23T17:59:49+00:00 · Latest: 2026-03-23T17:59:49+00:00
Comments: First two authors contributed equally. Project: https://xingjianbai.com/unite-tokenization-generation/ Code: https://github.com/ShivamDuggal4/UNITE-tokenization-generation
Abstract
Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.
中文标题/摘要
标题:统一标记化与潜在去噪端到端训练
潜在扩散模型(LDMs)通过在学习的潜在空间中操作,能够实现高保真合成。然而,训练最先进的LDMs需要复杂的分阶段训练:首先必须训练一个标记化器,然后才能在冻结的潜在空间中训练扩散模型。我们提出UNITE - 一种统一标记化和潜在扩散的自动编码器架构。UNITE 包含一个生成编码器,该编码器通过权重共享作为图像标记化器和潜在生成器。我们的关键见解是,标记化和生成可以被视为在不同条件下的相同潜在推理问题:标记化从完全观察到的图像中推断潜在变量,而生成则从噪声中推断潜在变量,同时结合文本或类别条件。受此启发,我们引入了一种单阶段训练程序,通过两次通过相同的生成编码器的前向传递同时优化两个任务。共享参数使梯度能够共同塑造潜在空间,鼓励“共同潜在语言”。在图像和分子模态中,UNITE 在没有对抗损失或预训练编码器(例如 DINO)的情况下达到了接近最先进的性能,Base 模型和 Large 模型在 ImageNet 256x256 上的 FID 分别为 2.12 和 1.73。我们还通过表示对齐和压缩的视角分析了生成编码器。这些结果表明,从零开始联合训练标记化和生成是可行的。
Summary / 总结
The paper proposes UNITE, an autoencoder architecture that unifies tokenization and latent diffusion for high-fidelity synthesis. It introduces a single-stage training procedure where a Generative Encoder serves both as an image tokenizer and latent generator. This approach jointly optimizes tokenization and generation, leading to near state-of-the-art performance on ImageNet with FID scores of 2.12 and 1.73 for Base and Large models, respectively, without adversarial losses or pretrained encoders.
该论文提出了UNITE,一种统一了标记化和潜在扩散的自编码器架构,以实现高保真度的合成。通过在生成编码器和标记器之间共享参数,UNITE在单阶段训练过程中同时优化两个任务。这种方法在ImageNet上达到了接近最先进的性能,Base模型和Large模型的FID分数分别为2.12和1.73,且不使用对抗损失或预训练编码器。共享参数鼓励形成一个共同的潜在语言,并且该模型在图像和分子模态上表现良好。
UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation
Authors: Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu
First: 2026-03-23T17:59:48+00:00 · Latest: 2026-03-23T17:59:48+00:00
Comments: 42 pages, 16 figures
Abstract
We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
中文标题/摘要
标题:UniMotion:统一的运动-文本-视觉理解与生成框架
我们提出了UniMotion,据我们所知,这是首个在同一架构中同时理解和生成人类运动、自然语言和RGB图像的统一框架。现有的统一模型仅处理有限的模态子集(例如,运动-文本或静态姿态-图像),并且主要依赖于离散标记化,这引入了量化误差并破坏了时间连续性。UniMotion通过一个核心原则克服了这两个限制:将运动视为与RGB同等重要的第一级连续模态。一种新颖的跨模态对齐运动VAE(CMA-VAE)和对称的双路径嵌入器构建了运动和RGB的并行连续路径,共享一个LLM主干。为了在不依赖图像的情况下将视觉语义先验注入运动表示,我们提出了双后验KL对齐(DPA),该方法将视觉融合编码器的更丰富的后验信息提炼到仅运动编码器中。为了解决冷启动问题——其中仅文本监督太稀疏,不足以校准新引入的运动路径——我们进一步提出了潜在重建对齐(LRA),这是一种自监督预训练策略,使用密集的运动潜在变量作为明确的条件,共同校准嵌入器、主干和流头,为所有下游任务建立一个稳定的运动感知基础。UniMotion在三个模态之间实现从任何到任何的理解、生成和编辑的七项任务中达到了最先进的性能,特别是在跨模态组合任务上具有明显优势。
Summary / 总结
UniMotion is a unified framework that simultaneously understands and generates human motion, natural language, and RGB images. It addresses the limitations of existing models by treating motion as a continuous modality and using a Cross-Modal Aligned Motion VAE and dual-path embedders within a shared LLM backbone. Key findings include superior performance across seven tasks, especially in cross-modal compositional tasks, and the introduction of Dual-Posterior KL Alignment and Latent Reconstruction Alignment to enhance motion representation and pre-training stability, respectively.
UniMotion 是一个统一框架,能够同时理解和生成人类动作、自然语言和 RGB 图像。它通过将动作视为连续模态并使用交叉模态对齐动作 VAE 和双路径嵌入器在共享的 LLM 主干内来解决现有模型的限制。关键发现包括在七个任务中表现出色,特别是在跨模态组合任务中的优势,并引入了双后验 KL 对齐和潜在重建对齐来增强动作表示和预训练稳定性。
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Authors: Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu
First: 2026-03-23T17:59:42+00:00 · Latest: 2026-03-23T17:59:42+00:00
Comments: 10 pages, 5 figures
Abstract
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
中文标题/摘要
标题:ThinkJEPA:利用大型视觉-语言推理模型增强潜在世界模型
潜在世界模型(例如V-JEPA2)的最新进展显示了从视频观察中预测未来世界状态的有希望的能力。然而,从短观察窗口进行密集预测限制了时间上下文,并可能导致预测偏向局部、低级的外推,难以捕捉长时序语义,从而降低下游实用性。视觉-语言模型(VLMs)通过在均匀采样的帧上进行推理提供了强大的语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈将细粒度的交互状态压缩成文本导向的表示,以及在适应小型动作条件数据集时的数据制度不匹配,它们并不适合作为独立的密集预测器。我们提出了一种由VLM引导的JEPA风格的潜在世界建模框架,该框架结合了密集帧动力学建模和通过双时间路径进行长时序语义指导:密集JEPA分支用于细粒度的运动和交互提示,均匀采样的VLM \emph{思考者}分支具有较大的时间步长,提供知识丰富的指导。为了有效地转移VLM的渐进推理信号,我们引入了一种分层金字塔表示提取模块,将多层VLM表示聚合为与潜在预测兼容的指导特征。在手操作轨迹预测实验中,我们的方法优于强大的VLM仅基线和JEPA预测器基线,并且表现出更稳健的长时序展开行为。
Summary / 总结
The research aims to enhance latent world models by integrating a vision-language model (VLM) for better long-horizon predictions. The method uses a dual-temporal pathway, combining a dense JEPA branch for fine-grained motion and interaction cues with a VLM thinker branch for semantic guidance. Experiments show that this approach outperforms both a VLM-only baseline and a JEPA-predictor baseline, providing more robust long-term predictions.
研究旨在通过整合视觉-语言模型(VLM)来增强潜世界模型对未来状态预测的能力。方法采用双时间路径,结合密集的JEPA分支用于精细的运动和交互线索,以及均匀采样的VLM‘思考者’分支用于语义指导。实验结果表明,所提出的方法在手操作轨迹预测中优于仅使用VLM的基线和JEPA预测器基线,提供了更稳健的长期预测行为。
DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models
Authors: Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li
First: 2026-03-23T17:59:25+00:00 · Latest: 2026-03-23T17:59:25+00:00
Abstract
Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.
中文标题/摘要
标题:DualCoT-VLA:通过并行推理实现视觉语言逻辑思维的视觉语言行动模型
视觉语言行动(VLA)模型直接将视觉观察和语言指令映射到机器人行动。虽然对于简单任务有效,但标准VLA模型往往难以应对需要逻辑规划的复杂多步骤任务,以及需要精细空间感知的精确操作。最近的努力已经将逻辑思维(CoT)推理引入VLA模型,赋予它们“先思考后行动”的能力。然而,当前基于CoT的VLA模型面临两个关键限制:1)由于依赖于孤立的单模态CoT,无法同时捕捉低级视觉细节和高级逻辑规划;2)逐步自回归解码导致推理延迟增加并累积错误。为了解决这些限制,我们提出了DualCoT-VLA,一种具有并行推理机制的视觉语言CoT方法。为了实现全面的多模态推理,我们的方法结合了视觉CoT进行低级空间理解以及语言CoT进行高级任务规划。此外,为了克服延迟瓶颈,我们引入了一种并行CoT机制,其中包含两组可学习查询标记,将自回归推理转变为单步前向推理。广泛的实验表明,我们的DualCoT-VLA在LIBERO和RoboCasa GR1基准测试以及实际平台中均实现了最先进的性能。
Summary / 总结
The research aims to enhance Vision-Language-Action (VLA) models to handle complex, multi-step tasks by integrating a parallel reasoning mechanism. DualCoT-VLA incorporates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning, while also introducing a parallel CoT mechanism to reduce inference latency. Experimental results show that DualCoT-VLA outperforms existing models on the LIBERO and RoboCasa GR1 benchmarks and in real-world platforms.
研究旨在通过解决当前基于CoT的VLA模型的限制,增强VLA模型处理复杂多步骤任务的能力。DualCoT-VLA引入了一种并行推理机制,将视觉和语言CoT结合起来进行综合多模态推理,并将推理从逐步骤自回归推理转变为单步前向推理以减少延迟。实验结果表明,DualCoT-VLA在LIBERO、RoboCasa GR1基准测试以及实际平台上的表现优于现有模型。
3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
Authors: Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu, Kaichun Mo, Chuang Gan, Subhashree Radhakrishnan
First: 2026-03-23T17:59:14+00:00 · Latest: 2026-03-23T17:59:14+00:00
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.
Summary / 总结
The research addresses the challenge of spatial understanding and layout consistency in fine-grained visual editing tasks, where large language models and vision language models often fall short. It introduces a Structured Reasoning framework that uses scene-graph reasoning to generate updated scene graphs based on natural-language instructions, maintaining spatial coherence. The method shows a 15% improvement in IoU and a 25% reduction in center-distance error compared to other baselines, and achieves up to 20% higher mIoU than state-of-the-art zero-shot language models, highlighting its enhanced spatial precision and control.
研究针对大型语言模型和视觉语言模型在精细视觉编辑任务中难以实现空间理解和布局一致性的问题。提出了一种结构化推理框架,利用场景图推理生成基于自然语言指令的更新场景图,保持空间连贯性。该方法在IoU上提高了15%,中心距离误差减少了25%,并比其他基线方法高出20%的mIoU,显示出其在空间精度和控制方面的显著提升。
The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
Authors: Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham
First: 2026-03-23T17:58:02+00:00 · Latest: 2026-03-23T17:58:02+00:00
Comments: 26 pages, 35 figures
Abstract
Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
中文标题/摘要
标题:视觉语言模型中空间推理的双重机制
许多多模态任务,如图像字幕和视觉问答,要求视觉语言模型(VLMs)将物体与其属性和空间关系联系起来。然而,尚不清楚这些联系在VLMs中的何处和如何进行计算。在本研究中,我们展示了VLMs依赖于两种并行机制来表示这些联系。在语言模型骨干中,中间层在视觉标记(对应于物体)之上表示内容无关的空间关系。然而,这种机制在塑造模型预测方面只起次要作用。相反,空间信息的主要来源在于视觉编码器,其表示编码了物体的布局,并直接被语言模型骨干利用。值得注意的是,这种空间信号是全局分布在视觉标记中的,延伸到物体区域之外的背景区域。我们展示了在整个图像标记中增强这些视觉衍生的空间表示可以提高自然图像的空间推理性能。综上所述,我们的结果阐明了空间联系在VLMs中的计算方式,并突显了视觉编码器在实现空间推理中的核心作用。
Summary / 总结
This study investigates how vision-language models (VLMs) process spatial reasoning in tasks like image captioning and visual question answering. It reveals that VLMs use two mechanisms: one in the language model backbone that represents spatial relations independently of objects, and another in the vision encoder that encodes object layouts and directly influences model predictions. The vision encoder's spatial information, which extends beyond object regions, is more influential in shaping model outputs. Enhancing these spatial representations globally improves spatial reasoning performance. This work clarifies the mechanisms of spatial association within VLMs and underscores the importance of the vision encoder in spatial reasoning tasks.
该研究探讨了视觉语言模型(VLMs)在图像字幕和视觉问答等任务中处理空间推理的方式。研究表明,VLMs 使用两种机制:一种是在语言模型主干中独立于物体表示空间关系,另一种是在视觉编码器中编码物体布局,并对模型预测起主导作用。视觉编码器的表示在全球范围内分布在所有图像标记中,对于空间推理至关重要。通过增强这些空间表示的全局性可以提高自然图像上的性能,突显视觉编码器在VLMs中空间推理中的核心作用。
Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
Authors: Alexandra Zelenin, Alexandra Zhuravlyova
First: 2026-03-23T17:57:24+00:00 · Latest: 2026-03-23T17:57:24+00:00
Comments: 30 pages, 15 figures, 15 tables, including appendices. Code and data at https://github.com/sockeye44/dorafactors
Abstract
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved.
We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice.
Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.
中文标题/摘要
标题:DoRA的扩展:通过因子化范数和融合核函数实现高秩适应
权重分解低秩适应(DoRA)通过将权重幅度与方向解耦来扩展LoRA,但其前向传播需要计算W + sBA的行范数,这一计算在我们调查的每个主要框架中都通过计算密集型的[d_out, d_in]乘积BA来实现。在d_in = 8192和秩r = 384的情况下,单个模块的范数需要大约512 MB的临时工作内存(bf16),这使得高秩DoRA在涉及数百个已适应模块和检查点时在常见的单GPU设置上变得昂贵且往往不可行。
我们提出了两个系统贡献。因子化范数将平方范数分解为基、交叉和格朗项,这些项可以通过O(d_out r + r^2)中间量计算,从而消除密集乘积。融合Triton内核将四核DoRA组合简化为单次通过,减少约4倍的内存流量,并使用数值稳定的格式避免在幅度集中于接近单位缩放的区域时出现灾难性消减。
在六种8-32B视觉-语言模型(VLMs)上,使用三个NVIDIA GPU(RTX 6000 PRO、H200、B200)在bf16下r = 384的情况下,融合实现比Hugging Face PEFT的DoRA实现快1.5-2.0倍的推理速度,梯度计算(排除优化器步骤)快1.5-1.9倍,峰值VRAM低7 GB。在六种跨越四代架构(L40S、A100、RTX 6000 PRO、H200、B200、B300)的GPU上进行的微基准测试确认了1.5-2.7倍的组合内核加速。最终logit余弦相似度在所有模型/GPU对上均超过0.9999,多种子训练曲线在2000步内每步损失差值的均值内匹配7.1 x 10^-4。
Summary / 总结
This paper addresses the computational challenges of high-rank Weight-Decomposed Low-Rank Adaptation (DoRA) by introducing a factored norm and fused Triton kernels. The factored norm decomposes the squared norm into base, cross, and Gram terms, reducing the need for dense matrix multiplication and eliminating the requirement for 512 MB of transient working memory. The fused Triton kernels combine the four kernels of DoRA into a single pass, reducing memory traffic and improving numerical stability. Experiments on six vision-language models across three NVIDIA GPUs show that the fused implementation is 1.5-2.0x faster for inference and 1.5-1.9x faster for gradient computation, with up to 7 GB lower peak VRAM usage compared to the Hugging Face PEFT's DoRA implementation.
本文旨在解决高秩Weight-Decomposed Low-Rank Adaptation (DoRA)的计算挑战,通过引入分解范数和融合Triton内核来优化。分解范数将平方范数分解为基、交叉和格朗项,减少了密集矩阵乘法的需求,并消除了临时工作内存瓶颈。融合Triton内核进一步优化了DoRA的组合,减少了内存流量并避免了数值不稳定。在三个NVIDIA GPU上对六个视觉-语言模型的实验显示,融合实现比标准DoRA实现快1.5-2.0倍的推理速度,1.5-1.9倍的梯度计算速度,并且峰值VRAM使用量最多可减少7 GB。
Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration
Authors: Zakaria Mhammedi, James Cohan
First: 2026-03-23T17:56:52+00:00 · Latest: 2026-03-23T17:56:52+00:00
Abstract
The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.
中文标题/摘要
标题:解耦探索与策略优化:基于不确定性引导的树搜索方法在困难探索中的应用
探索过程需要主动探索——即收集新的和有信息量的数据。然而,高效的自主探索仍然是一个主要未解决的问题。主流方法通过使用强化学习(RL)训练具有内在动机的代理,最大化外在奖励和内在奖励的复合目标来应对这一挑战。我们建议这种方法带来了不必要的开销:虽然策略优化对于精确执行任务是必要的,但仅为了扩展状态覆盖范围而使用这种机制可能是低效的。在本文中,我们提出了一种新的范式,明确地将探索与利用分离,并在探索阶段绕过RL。我们的方法使用受Go-With-The-Winner算法启发的树搜索策略,并配以表征不确定性来系统地驱动探索。通过去除策略优化的开销,我们的方法在困难的Atari基准测试中比标准的内在动机基线高效得多。此外,我们证明了发现的轨迹可以使用现有的监督反向学习算法进行提炼,从而在Montezuma’s Revenge、Pitfall!和Venture上取得了显著优于现有技术水平的得分,而无需依赖领域特定知识。最后,我们展示了在高维连续动作空间中该框架的通用性,通过直接从图像观察中解决MuJoCo Adroit灵巧操作和AntMaze任务,无需专家演示或离线数据集。据我们所知,这是首次实现这一目标。
Summary / 总结
This paper addresses the challenge of efficient autonomous exploration in reinforcement learning by proposing a new paradigm that decouples exploration from policy optimization. The method uses a tree-search strategy with epistemic uncertainty to guide exploration, bypassing RL during the exploration phase. This approach significantly improves exploration efficiency on hard Atari benchmarks compared to standard intrinsic motivation methods. Additionally, the discovered trajectories can be used to train deployable policies, achieving state-of-the-art scores on Montezuma's Revenge, Pitfall!, and Venture. The framework is also demonstrated to be effective in high-dimensional continuous action spaces for tasks like dexterous manipulation and maze navigation, directly from image observations without expert data.
本文提出了一种新的范式,将探索与策略优化分离,以解决强化学习中的自主高效探索问题。该方法使用带有表征不确定性指导的树搜索策略,在探索阶段避免了策略优化的开销。这种方法在硬币Atari基准测试中显著提高了探索效率,相比标准的内在动机方法。此外,发现的轨迹可以被提炼成可部署的策略,在Montezuma的复仇、Pitfall!和Venture上实现了最先进的得分。该框架还在高维连续动作空间中的灵巧操作和迷宫导航任务中直接从图像观察中有效工作,无需专家数据。
DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution
Authors: Zhengyao Lv, Menghan Xia, Xintao Wang, Kwan-Yee K. Wong
Venue: CVPR 2026
First: 2026-03-23T17:56:17+00:00 · Latest: 2026-03-23T17:56:17+00:00
Comments: Accepted to CVPR 2026
Abstract
Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.
中文标题/摘要
标题:DUO-VSR:双流蒸馏的一步视频超分辨率
基于扩散的视频超分辨率(VSR)最近取得了显著的保真度,但仍面临高昂的采样成本。虽然分布匹配蒸馏(DMD)可以加速扩散模型向一步生成的转变,但直接应用于VSR往往会导致训练不稳定并降低监督效果。为了解决这些问题,我们提出DUO-VSR,这是一种基于双流蒸馏策略的三阶段框架,将分布匹配和对抗监督统一起来以实现一步VSR。首先,通过轨迹保持蒸馏初始化渐进引导蒸馏,以稳定后续训练。其次,双流蒸馏联合优化DMD和真实-假象分数特征生成对抗网络(RFS-GAN)流,后者利用来自真实和假象分数模型的鉴别特征提供互补的对抗监督。最后,偏好引导细化阶段进一步使学生模型与感知质量偏好对齐。大量实验表明,DUO-VSR在视觉质量和效率方面优于之前的一步VSR方法。
Summary / 总结
DUO-VSR is a three-stage framework for one-step video super-resolution that addresses the training instability and insufficient supervision issues of existing methods. It uses a Dual-Stream Distillation strategy combining distribution matching and adversarial supervision. The framework includes a Progressive Guided Distillation Initialization, Dual-Stream Distillation, and a Preference-Guided Refinement stage. Experimental results show that DUO-VSR outperforms previous one-step VSR approaches in both visual quality and efficiency.
DUO-VSR 是一种用于一帧生成视频超分辨率的三阶段框架,解决了现有方法中的训练不稳定性及监督不足问题。该框架采用结合分布匹配和对抗监督的双流蒸馏策略。首先通过轨迹保持蒸馏稳定训练,然后联合优化 DMD 和 RFS-GAN 流以提供互补的监督,最后通过感知质量偏好进一步细化学生模型。DUO-VSR 在视觉质量和效率上均优于之前的单帧生成 VSR 方法。
GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning
Authors: Yixuan Luo, Feng Qiao, Zhexiao Xiong, Yanjing Li, Nathan Jacobs
First: 2026-03-23T17:55:52+00:00 · Latest: 2026-03-23T17:55:52+00:00
Abstract
Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.
中文标题/摘要
标题:GenOpticalFlow:一种生成式无监督光学流学习方法
光学流估计是计算机视觉中的一个基本问题,但由于依赖昂贵的地面真实标注,监督方法的可扩展性受到限制。尽管无监督和半监督方法可以缓解这一问题,但它们往往基于亮度恒定性和平滑性假设,导致在复杂现实场景中运动估计不准确。为克服这些限制,我们引入了**GenOpticalFlow**,一种新颖的框架,该框架通过预训练的深度估计网络生成伪光学流,作为下一帧生成模型的条件输入,以生成高保真、像素对齐的后续帧,从而合成大规模、完美对齐的帧-流数据对,用于监督光学流训练,无需人工标注。此外,我们提出了一种**不一致像素过滤**策略,用于识别并移除生成帧中的不可靠像素,有效提升在真实数据集上的微调性能。在KITTI2012、KITTI2015和Sintel上的广泛实验表明,**GenOpticalFlow**在与现有无监督和半监督方法的比较中取得了竞争力或更优的结果,突显了其作为光学流学习的可扩展且无需标注解决方案的潜力。我们将在接受后发布我们的代码。
Summary / 总结
GenOpticalFlow is a generative framework that synthesizes large-scale, perfectly aligned frame-flow data pairs for unsupervised optical flow training. It uses a pre-trained depth estimation network to generate pseudo optical flows, which are then used to train a next-frame generation model. This approach creates high-quality synthetic data without human annotations. The method also includes an inconsistent pixel filtering strategy to improve fine-tuning performance. Experiments show that GenOpticalFlow outperforms or matches existing unsupervised and semi-supervised approaches on KITTI2012, KITTI2015, and Sintel datasets, making it a scalable and annotation-free solution for optical flow learning.
GenOpticalFlow 是一种新颖的框架,通过预训练的深度估计网络生成伪光流,然后用于训练生成下一帧的模型,从而合成大规模、完美对齐的帧-流数据对,无需人工标注。此外,还提出了一种不一致像素过滤策略,以提高合成数据的质量。在 KITTI2012、KITTI2015 和 Sintel 上的实验表明,GenOpticalFlow 达到了与现有无监督和半监督方法相当或更优的结果,使其成为光流学习的可扩展且无需标注的解决方案。
TiCo: Time-Controllable Training for Spoken Dialogue Models
Authors: Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass
First: 2026-03-23T17:51:40+00:00 · Latest: 2026-03-23T17:51:40+00:00
Abstract
We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.
中文标题/摘要
标题:TiCo:时间可控训练方法在口语对话模型中的应用
我们提出了一种名为TiCo的简单后训练方法,使口语对话模型(SDMs)能够遵循时间限制指令并生成可控时长的响应。这一能力对于语音助手和交互代理等现实世界口语语言系统来说非常有价值,因为控制响应时长可以提高交互质量。然而,尽管现有模型能够生成自然的口语响应,但它们缺乏时间意识,难以遵循与时长相关的指令(例如,“请生成一个大约持续15秒的响应”)。通过对开源和商用SDMs的实证评估,我们发现它们经常无法满足此类时间控制要求。TiCo通过使模型在生成过程中通过口语时间标记(STM,例如<10.6秒>)估计已用时间来解决这一限制,这些标记有助于模型保持时间意识并调整剩余内容以满足目标时长。TiCo简单且高效:它只需要少量数据,无需额外的问题-答案对,而是依赖自我生成和强化学习。实验结果表明,TiCo在满足时长约束的同时显著提高了响应质量。
Summary / 总结
TiCo is a post-training method that enables spoken dialogue models to follow time-constrained instructions and generate responses with controllable duration. This is valuable for improving interaction quality in voice assistants and interactive agents. TiCo uses Spoken Time Markers to help models maintain time awareness and adjust content to meet target durations. Experiments show TiCo significantly improves adherence to duration constraints without compromising response quality.
TiCo 是一种后训练方法,使对话模型能够生成具有可控时长的响应,从而在语音助手和交互代理中提高交互质量。它使用口语时间标记来帮助模型估计已用时间并相应调整内容。实验结果显示,TiCo 显著提高了对时长约束的遵守程度,同时保持了响应质量。
The Price of Progress: Price Performance and the Future of AI
Authors: Hans Gundlach, Jayson Lynch, Matthias Mertens, Neil Thompson
First: 2025-11-28T18:47:33+00:00 · Latest: 2026-03-23T17:48:22+00:00
Abstract
Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities *per dollar*. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5\times$ to $10\times$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3\times$ per year. However, at the same time, the price of running frontier models is rising between $3\times$ to $18\times$ per year due to bigger models and larger reasoning demands. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.
中文标题/摘要
标题:进步的代价:价格性能与AI的未来
近年来,语言模型在高级基准测试中取得了巨大的进步,但这些进步中的许多都只能通过使用更昂贵的模型来实现。因此,基准测试可能无法准确反映每美元的实际能力进步。为了纠正这一点,我们使用来自人工分析和Epoch AI的数据,形成了迄今为止最大的价格和历史价格数据集来运行基准测试。我们发现,对于知识、推理、数学和软件工程基准测试中的前沿模型,达到给定水平的基准性能的价格下降速度非常快,大约每年5到10倍。这些AI推理成本的降低是由于经济力量、硬件效率改进和算法效率改进所致。通过隔离开放模型以控制竞争效应,并除以硬件价格下降,我们估计算法效率的进步约为每年3倍。然而,同时,运行前沿模型的成本正在以每年3到18倍的速度上升,这主要是由于更大的模型和更大的推理需求。最后,我们建议评估者不仅要公布还要考虑基准测试的成本作为衡量AI实际影响的重要部分。
Greater accessibility can amplify discrimination in generative AI
Authors: Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou, Valentin Hofmann, Katharina von der Wense, Anne Lauscher
First: 2026-03-23T17:47:44+00:00 · Latest: 2026-03-23T17:47:44+00:00
Comments: Preprint
Abstract
Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.
中文标题/摘要
标题:更大的可访问性可能会在生成式AI中放大歧视
数亿人依赖大型语言模型(LLMs)进行教育、工作甚至医疗保健。然而,这些模型在训练数据中存在并放大了社会偏见。此外,基于文本的界面仍然是许多人的障碍,例如,识字能力有限、运动障碍或仅使用移动设备的用户。语音交互有望扩大可访问性,但与文本不同,语音携带身份线索,用户难以掩盖,这引发了担忧,即可访问性提升是否可能以牺牲公平对待为代价。我们显示,配备音频功能的LLMs表现出系统性性别歧视,仅基于说话者的声音,就倾向于使用性别刻板印象的形容词和职业,并且放大了比基于文本交互中观察到的更多的偏见。因此,语音界面不仅将文本模型扩展到新的模态,还引入了与副语言线索相关的独特偏见机制。补充调查证据(n=1,000)显示,不经常使用聊天机器人的用户最不愿意进行未披露属性推断,并且最有可能在这些做法被揭示时退出。为了展示一种潜在的缓解策略,我们展示了音高操纵可以系统地调节性别歧视输出。总体而言,我们的研究揭示了AI发展中一个关键的紧张关系:通过语音界面扩展可访问性的同时,也创造了新的歧视途径,要求公平性和可访问性必须同时得到解决。
Summary / 总结
The study investigates how voice interfaces in large language models (LLMs) can amplify gender discrimination, despite their promise of increased accessibility. By analyzing audio-enabled LLMs, the research finds that these models exhibit systematic gender bias, shifting responses towards gender-stereotyped adjectives and occupations based on the speaker's voice. The study also reveals that users are hesitant to attribute personal characteristics to chatbots and are likely to disengage when such practices are disclosed. Pitch manipulation is shown to mitigate gender-discriminatory outputs, highlighting the need to address fairness and accessibility together in AI development.
研究发现,语音启用的大语言模型(LLMs)表现出性别歧视,基于说话人的声音,回应倾向于性别刻板印象的形容词和职业。研究指出,虽然语音界面可以增强无障碍性,但也引入了新的偏见机制。一项针对1,000名参与者的调查显示,不经常使用聊天机器人的用户对属性推断更为犹豫,并且在这些做法被揭示时更有可能退出。研究建议,通过音高操纵可以作为一种潜在的缓解策略,以调节性别歧视的输出。
Scalable Prompt Routing via Fine-Grained Latent Task Discovery
Authors: Yunyi Zhang, Soji Adeshina, Sheng Guan, Ashwin Ganesh, Zhen Han, Vassilis N. Ioannidis, Huzefa Rangwala, George Karypis
First: 2026-03-19T19:15:51+00:00 · Latest: 2026-03-23T17:46:56+00:00
Abstract
Prompt routing dynamically selects the most appropriate large language model from a pool of candidates for each query, optimizing performance while managing costs. As model pools scale to include dozens of frontier models with narrow performance gaps, existing approaches face significant challenges: manually defined task taxonomies cannot capture fine-grained capability distinctions, while monolithic routers struggle to differentiate subtle differences across diverse tasks. We propose a two-stage routing architecture that addresses these limitations through automated fine-grained task discovery and task-aware quality estimation. Our first stage employs graph-based clustering to discover latent task types and trains a classifier to assign prompts to discovered tasks. The second stage uses a mixture-of-experts architecture with task-specific prediction heads for specialized quality estimates. At inference, we aggregate predictions from both stages to balance task-level stability with prompt-specific adaptability. Evaluated on 10 benchmarks with 11 frontier models, our method consistently outperforms existing baselines and surpasses the strongest individual model while incurring less than half its cost.
中文标题/摘要
标题:通过细粒度潜在任务发现实现可扩展的提示路由
提示路由动态地从候选模型池中选择最适合每个查询的大语言模型,优化性能同时管理成本。随着模型池扩展到包括数十个前沿模型且性能差距缩小,现有方法面临重大挑战:手动定义的任务分类无法捕捉细粒度的能力差异,而单一的路由器难以区分多样任务中的细微差异。我们提出了一种两阶段路由架构,通过自动化的细粒度任务发现和任务感知的质量评估来解决这些限制。第一阶段使用图聚类发现潜在任务类型并训练分类器将提示分配给发现的任务。第二阶段使用专家混合架构,带有特定任务的预测头,进行专门的质量估计。在推理时,我们从两个阶段聚合预测以平衡任务级别的稳定性与提示特定的适应性。在10个基准测试和11个前沿模型上评估,我们的方法在所有基准测试中都优于现有基线,并且成本不到最强单个模型的一半,同时性能更优。
Summary / 总结
The research aims to improve prompt routing in large language models by addressing the challenges posed by scaling model pools. It proposes a two-stage routing architecture that automates the discovery of fine-grained task types and uses a mixture-of-experts approach for quality estimation. The method outperforms existing baselines and individual models while incurring lower costs.
研究旨在通过解决现有方法的局限性来改进大型语言模型中的提示路由。提出了一种两阶段路由架构,使用图基聚类来发现潜在的任务类型,并训练分类器将提示分配给这些任务。第二阶段采用具有任务特定预测头的混合专家架构来提供专门的质量估计。该方法在10个基准测试中优于现有基线,并且在11个前沿模型中优于最强的单个模型,同时成本不到其一半。
Measuring Iterative Temporal Reasoning with Time Puzzles
Authors: Zhengxiang Wang, Zeyu Dong
First: 2026-01-12T02:39:26+00:00 · Latest: 2026-03-23T17:44:47+00:00
Comments: 11 pages, 4 tables, 3 figures
Abstract
Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning.
中文标题/摘要
标题:使用时间谜题衡量迭代时间推理
工具使用,如网络搜索,已成为大型语言模型(LLM)中的一项标准能力。然而,现有的基准主要在静态、无工具使用的环境中评估时间推理,这与LLM在实际中如何进行时间推理相差甚远。我们引入了时间谜题,这是一种基于约束的日期推断任务,用于评估使用工具的迭代时间推理。每个谜题结合了事实性的时间锚点与(跨文化)日历关系,并可能允许一个或多个有效日期。谜题是通过算法生成的,这使得评估可以受控且持续进行。在13个LLM中,即使最好的模型(GPT-5)在没有工具的情况下也仅能达到55.3%的准确率,尽管使用了易于搜索的事实。虽然网络搜索可以提高性能,但当约束被重写为明确的日期时,模型的表现显著提高,从而消除了事实查找的需要。这些结果揭示了在迭代时间推理中可靠工具使用的能力差距。
Summary / 总结
The study introduces Time Puzzles, a constraint-based date inference task to evaluate iterative temporal reasoning with tools, addressing the gap in existing benchmarks that do not reflect real-world tool usage. Across 13 LLMs, the best model (GPT-5) achieved only 55.3% accuracy without tools, but performance improved significantly with web search and better constraint rewriting. This highlights the need for reliable tool use in iterative temporal reasoning.
研究旨在评估大型语言模型(LLMs)在使用工具的实际场景中的迭代时间推理能力。引入了基于约束的日期推断任务Time Puzzles,结合了事实时间锚点和历法关系。在13个LLM中,最佳模型(GPT-5)在没有工具的情况下仅达到55.3%的准确率,但通过网络搜索和将约束重写为明确日期,性能显著提高,减少了事实查找的需求。这表明LLMs在迭代时间推理中的可靠工具使用存在差距。
EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild
Authors: Jeffri Murrugarra-Llerena, Pranav Chitale, Zicheng Liu, Kai Ao, Yujin Ham, Guha Balakrishnan, Paola Cascante-Bonilla
First: 2026-03-23T17:43:49+00:00 · Latest: 2026-03-23T17:43:49+00:00
Comments: Project Page: https://lab-spell.github.io/EgoGroups/
Abstract
Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.
中文标题/摘要
标题:EgoGroups:检测野生环境中人群社交群体的标准
人群社交群体检测,即识别参与互惠人际互动的人类(例如,家庭成员、朋友、顾客和商家),是社交智能的关键组成部分,对于在世界中进行交易的代理至关重要。现有的少数社交群体检测基准受到场景多样性低和依赖第三人视角摄像机来源(例如,监控录像)的限制。因此,这些基准通常缺乏在不同文化背景和不受限制的环境中对群体如何形成和演变的现实评估。为解决这一问题,我们引入了EgoGroups,这是一个第一人称视角的数据集,用于捕捉世界各地城市的社交动态。EgoGroups覆盖了65个国家,包括低、中、高人群密度设置,在四种天气/时间条件下的场景。我们包括了密集的人类注释,用于个人和社交群体,以及丰富的地理和场景元数据。使用此数据集,我们对最先进的VLM/LLMs和监督模型进行了广泛的评估,以测试它们的群体检测能力。我们发现了一些有趣的结果,包括在零样本设置下,VLMs和LLMs可以超越监督基线,而人群密度和文化区域明显影响模型性能。
Summary / 总结
EgoGroups is a first-person view dataset designed to evaluate social group detection in diverse real-world settings. It captures social dynamics across 65 countries under various conditions. The dataset includes dense human annotations and rich metadata. Experiments show that vision-language models and large language models outperform supervised models in a zero-shot setting, and performance varies with crowd density and cultural regions.
EgoGroups 是一个第一人称视角的数据集,旨在评估在多样化的现实环境中的人群组检测能力。它在65个国家捕捉社会动态,并在不同条件下提供密集的注释和丰富的元数据。研究发现,视觉-语言模型和大型语言模型在零样本设置下优于监督模型,而人群密度和文化区域明显影响模型性能。
MemDLM: Memory-Enhanced DLM Training
Authors: Zehua Pei, Hui-Ling Zhen, Weizhe Lin, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu
First: 2026-03-23T17:39:56+00:00 · Latest: 2026-03-23T17:39:56+00:00
Abstract
Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.
中文标题/摘要
标题:MemDLM:增强记忆的DLM训练
扩散语言模型(DLMs)相对于自回归(AR)模型具有吸引力的优势,例如全注意并行解码和灵活生成。然而,它们遭受训练-推理不匹配的问题:DLMs在静态的单步掩码预测目标下进行训练,但在部署时通过多步渐进去噪轨迹进行。我们提出了MemDLM(增强记忆的DLM),通过双层优化将模拟的去噪过程嵌入到训练中来缩小这一差距。内层循环更新一组快速权重,形成参数记忆,捕捉每个样本的局部轨迹经验,而外层循环在基于此记忆的条件下更新基础模型。通过将记忆压力从令牌表示转移到参数上,MemDLM实现了更快的收敛和更低的训练损失。此外,内层循环可以在推理时重新启用作为适应步骤,从而在长上下文理解上获得额外收益。我们发现,当在推理时激活时,这种参数记忆充当了一种新兴的在权重内检索机制,帮助MemDLM进一步减少在具有挑战性的针扎草堆检索任务中的令牌级注意力瓶颈。代码:https://github.com/JarvisPei/MemDLM.
Summary / 总结
MemDLM is proposed to address the train-inference mismatch in DLMs by embedding a simulated denoising process into training through Bi-level Optimization. An inner loop updates fast weights to form a Parametric Memory, capturing the local trajectory experience of each sample, while an outer loop updates the base model based on this memory. This method leads to faster convergence and lower training loss. Additionally, the Parametric Memory can be re-enabled at inference time to improve long-context understanding and reduce token-level attention bottlenecks on retrieval tasks.
MemDLM 通过在训练过程中通过双层优化嵌入模拟去噪过程来解决 DLM 的训练推理不匹配问题。内层循环更新快速权重形成参数记忆,捕捉每个样本的局部轨迹经验,而外层循环基于此记忆更新基础模型。这种方法导致更快的收敛和更低的训练损失,并且在推理时可以重新启用参数记忆以提高长上下文理解能力并减少检索任务中的标记级注意力瓶颈。
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Authors: Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao
First: 2026-03-23T17:26:35+00:00 · Latest: 2026-03-23T17:26:35+00:00
Abstract
Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
中文标题/摘要
标题:SpatialReward:可验证的空间奖励建模以实现文本到图像生成中的细粒度空间一致性
通过强化学习(RL)实现的文本到图像(T2I)生成的最新进展得益于评估语义对齐和视觉质量的奖励模型。然而,大多数现有的奖励模型对细粒度的空间关系关注有限,经常生成整体上看似合理的图像,但包含物体定位的不准确之处。在本文中,我们提出了**SpatialReward**,一种明确设计用于评估生成图像的空间布局的可验证奖励模型。SpatialReward 采用多阶段流水线:一个**提示分解器**从自由形式的提示中提取实体、属性和空间元数据;专家检测器提供准确的视觉定位,以确定物体的位置和属性;视觉语言模型在已定位的观察上进行链式推理,以评估基于规则的方法难以处理的复杂空间关系。为了更全面地评估生成图像中的空间关系,我们引入了**SpatRelBench**,该基准涵盖了物体属性、方向、物体间关系以及渲染文本的位置。在Stable Diffusion和FLUX上的实验表明,将SpatialReward纳入RL训练中可以一致地提高空间一致性和整体生成质量,结果与人类判断更为一致。这些发现表明,可验证的奖励模型在实现文本到图像生成模型中更准确和可控的优化方面具有巨大潜力。
Summary / 总结
Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality.
研究旨在通过解决现有奖励模型忽视精细空间关系的问题,提高文本生成图像的空间一致性。方法包括多阶段管道:Prompt Decomposer提取空间元数据,专家检测器提供准确的视觉定位,视觉语言模型评估复杂的空间关系。实验表明,将SpatialReward集成到RL训练中可以提高空间一致性和整体生成质量,更接近人类判断。
VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
Authors: Yi Du, Taimeng Fu, Zhipeng Zhao, Shaoshu Su, Zitong Zhan, Qiwei Du, Zhuoqun Chen, Bowen Li, Chen Wang
First: 2025-02-02T21:44:15+00:00 · Latest: 2026-03-23T17:26:04+00:00
Abstract
Navigating unseen, large-scale environments based on complex and abstract human instructions remains a formidable challenge for autonomous mobile robots. Addressing this requires robots to infer implicit semantics and efficiently explore large-scale task spaces. However, existing methods, ranging from end-to-end learning to foundation model-based modular architectures, often lack the capability to decompose complex tasks or employ efficient exploration strategies, leading to robot aimless wandering or target recognition failures. To address these limitations, we propose VL-Nav, a neuro-symbolic (NeSy) vision-language navigation system. The proposed system intertwines neural reasoning with symbolic guidance through two core components: (1) a NeSy task planner that leverages a symbolic 3D scene graph and image memory system to enhance the vision language models' (VLMs) neural reasoning capabilities for task decomposition and replanning; and (2) a NeSy exploration system that couples neural semantic cues with the symbolic heuristic function to efficiently gather the task-related information while minimizing unnecessary repeat travel during exploration. Validated on the DARPA TIAMAT Challenge navigation tasks, our system achieved an 83.4% success rate (SR) in indoor environments and 75% in outdoor scenarios. VL-Nav achieved an 86.3% SR in real-world experiments, including a challenging 483-meter run. Finally, we validate the system with complex instructions in a 3D multi-floor scenario.
中文标题/摘要
标题:VL-Nav:一种基于神经符号的方法进行基于推理的视觉语言导航
基于复杂的抽象人类指令自主导航未知的大规模环境仍然是自主移动机器人的一大挑战。解决这一问题需要机器人推断隐含的语义并高效探索大规模的任务空间。然而,现有的方法,从端到端学习到基于基础模型的模块化架构,往往缺乏分解复杂任务或采用高效探索策略的能力,导致机器人盲目游荡或目标识别失败。为了解决这些限制,我们提出了VL-Nav,一种神经符号(NeSy)视觉语言导航系统。该系统通过两个核心组件将神经推理与符号指导相结合:(1)一个NeSy任务规划器,利用符号3D场景图和图像记忆系统增强视觉语言模型(VLMs)的神经推理能力,以实现任务分解和重新规划;(2)一个NeSy探索系统,将神经语义线索与符号启发式函数耦合,以高效地收集任务相关信息,同时在探索过程中尽量减少不必要的重复旅行。在DARPA TIAMAT挑战导航任务中,该系统在室内环境中的成功率(SR)为83.4%,在室外场景中的成功率为75%。在真实世界实验中,VL-Nav实现了86.3%的成功率,包括一次具有挑战性的483米长跑。最后,我们使用复杂的指令在3D多层场景中验证了该系统。
Summary / 总结
VL-Nav is a neuro-symbolic approach designed to help robots navigate based on complex human instructions in large-scale environments. It integrates neural reasoning with symbolic guidance through a task planner and an exploration system. The task planner uses a symbolic 3D scene graph and image memory to improve the neural reasoning of vision-language models for task decomposition and replanning, while the exploration system efficiently gathers task-related information using neural semantic cues and a symbolic heuristic function. Experimental results show that VL-Nav achieved an 83.4% success rate in indoor environments, 75% in outdoor scenarios, and 86.3% in real-world experiments, including a 483-meter run.
VL-Nav 是一种神经符号方法,旨在帮助自主机器人根据复杂的指令进行导航。该方法结合了神经推理和符号指导,通过任务规划器和探索系统实现。任务规划器利用符号3D场景图和图像记忆来增强机器人分解任务和重新规划的能力,而探索系统则高效地收集任务相关信息,减少不必要的重复旅行。实验结果显示,VL-Nav 在室内环境中的成功率为 83.4%,在户外环境中的成功率为 75%,在真实世界实验中,包括一个 483 米的复杂指令导航任务,成功率为 86.3%。
Adapting Self-Supervised Speech Representations for Cross-lingual Dysarthria Detection in Parkinson's Disease
Authors: Abner Hernandez, Eunjung Yeo, Kwanghee Choi, Chin-Jou Li, Zhengjun Yue, Rohan Kumar Das, Jan Rusz, Mathew Magimai Doss, Juan Rafael Orozco-Arroyave, Tomás Arias-Vergara, Andreas Maier, Elmar Nöth, David R. Mortensen, David Harwath, Paula Andrea Perez-Toro
First: 2026-03-23T17:23:39+00:00 · Latest: 2026-03-23T17:23:39+00:00
Comments: Submitted to Interspeech 2026
Abstract
The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson's disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
中文标题/摘要
标题:自监督语音表示的适应性在帕金森病跨语言构音障碍检测中的应用
由于构音障碍语音数据的有限性,跨语言检测成为一个重要但具有挑战性的问题。一个主要困难是语音表示往往编码了语言相关的结构,这会干扰构音障碍的检测。我们提出了一种表示级的语言转移(LS),通过从健康对照语音中估计的基于质心的向量适应,将源语言的自监督语音表示与目标语言的分布对齐。我们在捷克语、德语和西班牙语的帕金森病语音数据集中的口述DDK录音下,分别在跨语言和多语言设置下评估了该方法。LS在跨语言设置中显著提高了灵敏度和F1值,在多语言设置中则带来了较小但一致的收益。进一步的表示分析表明,LS减少了嵌入空间中的语言身份,支持了LS消除了语言相关结构的解释。
Summary / 总结
The research aims to address the challenge of cross-lingual dysarthria detection in Parkinson's disease due to limited dysarthric speech data. The method involves a representation-level language shift (LS) that aligns self-supervised speech representations from a source language with the target language using centroid-based vector adaptation from healthy-control speech. The approach was evaluated on Parkinson's disease speech datasets in Czech, German, and Spanish, showing significant improvements in sensitivity and F1 score in cross-lingual settings, with smaller but consistent gains in multilingual settings. Analysis of the representations indicates that LS reduces language identity in the embedding space, suggesting it removes language-dependent structure.
论文旨在解决由于数据有限而在不同语言中检测帕金森病患者的言语障碍的挑战。提出了一种称为代表级语言转移(LS)的方法,该方法使用健康对照的语音数据,将源语言的自监督语音表示与目标语言对齐。该方法在捷克语、德语和西班牙语的言语录音上进行了评估,结果显示在跨语言设置中显著提高了敏感性和F1分数,在多语言设置中则有较小但一致的提升。进一步的表示分析表明,LS减少了语言特定的结构,支持了其有效去除语言依赖性干扰的解释。
Agnostics: Learning to Code in Any Programming Language via Reinforcement with a Universal Learning Environment
Authors: Aleksander Boruch-Gruszecki, Yangtian Zi, Zixuan Wu, Tejas Oberoi, Carolyn Jane Anderson, Joydeep Biswas, Arjun Guha
Venue: ICLR 2026
First: 2025-08-06T20:30:55+00:00 · Latest: 2026-03-23T17:23:32+00:00
Comments: 30 pages, 19 figures. Accepted at ICLR 2026. For data, code, artifacts, see https://agnostics.abgru.me
Abstract
Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering. Besides the obvious shortage of pre-training data, post-training itself is a bottleneck: every new language seems to require new datasets, test harnesses, and reinforcement-learning (RL) infrastructure.
We introduce Agnostics, a language-agnostic post-training pipeline that eliminates this per-language engineering. The key idea is to judge code solely by its externally observable behavior, so a single verifier can test solutions written in any language. Concretely, we (i) use an LLM to rewrite existing unit-test datasets into an I/O format, (ii) supply a short configuration that tells the verifier how to compile and run a target language, and (iii) apply reinforcement learning with verifiable rewards (RLVR) in a robust code execution environment.
Applied to five low-resource languages--Lua, Julia, R, OCaml, and Fortran--Agnostics (1) improves Qwen-3 4B to performance that rivals other 16B-70B open-weight models; (2) scales cleanly to larger and diverse model families (Qwen-3 8B, DeepSeek Coder 6.7B Instruct, Phi 4 Mini); and (3) for ${\le} 16$B parameter models, sets new state-of-the-art pass@1 results on MultiPL-E and a new multi-language version of LiveCodeBench that we introduce.
We release the language-agnostic training datasets (Ag-MBPP-X, Ag-Codeforces-X, Ag-LiveCodeBench-X), training code, and ready-to-use configurations, making RL post-training in any programming language as simple as editing a short YAML file.
中文标题/摘要
标题:Agnostics:通过通用学习环境强化学习任何编程语言
大型语言模型(LLMs)已经在高资源语言如Python和JavaScript的代码编写方面表现出色,但在低资源语言方面却遇到困难,这些语言对于科学和工程仍然至关重要。除了预训练数据的明显短缺,后训练本身也是一个瓶颈:每种新语言似乎都需要新的数据集、测试框架和强化学习(RL)基础设施。
我们引入了Agnostics,这是一种语言无关的后训练管道,消除了每种语言的工程需求。关键思想是仅通过外部可观察的行为来评判代码,因此单一验证器可以测试用任何语言编写的解决方案。具体来说,我们(i)使用LLM将现有的单元测试数据集重写为I/O格式,(ii)提供一个简短的配置来告诉验证器如何编译和运行目标语言,(iii)在稳健的代码执行环境中应用可验证奖励的强化学习(RLVR)。
应用于五种低资源语言——Lua、Julia、R、OCaml和Fortran——Agnostics(1)将Qwen-3 4B提升到与16B-70B开放权重模型相当的性能;(2)可以干净地扩展到更大的多样化模型家族(Qwen-3 8B、DeepSeek Coder 6.7B Instruct、Phi 4 Mini);(3)对于≤16B参数模型,在MultiPL-E和我们引入的新多语言版本LiveCodeBench上设置了新的最佳结果。
我们发布了语言无关的训练数据集(Ag-MBPP-X、Ag-Codeforces-X、Ag-LiveCodeBench-X)、训练代码和即用型配置,使得任何编程语言的RL后训练简单到只需编辑一个简短的YAML文件。
Summary / 总结
Large language models (LLMs) already excel at writing code in high-resource languages such as Python and JavaScript, yet stumble on low-resource languages that remain essential to science and engineering.
Agnostics 是一种语言无关的后训练管道,通过可验证奖励的强化学习(RLVR)在五种低资源语言上提升代码性能。它将现有单元测试数据集重写为 I/O 格式,配置目标语言的验证器,并在稳健的代码执行环境中应用 RLVR。结果表明,Agnostics 提升了 Qwen-3 4B 的性能,使其与 16B-70B 开源模型相当,能够扩展到更大模型,并在 MultiPL-E 和新推出的多语言 LiveCodeBench 版本上设置了新的最佳 pass@1 结果。
Instructional Text Across Disciplines: A Survey of Representations, Downstream Tasks, and Open Challenges Toward Capable AI Agents
Authors: Abdulfattah Safa, Tamta Kapanadze, Arda Uzunoğlu, Gözde Gül Şahin
First: 2024-10-24T08:22:59+00:00 · Latest: 2026-03-23T17:18:20+00:00
Comments: Pre-CoLI print. Accepted for publication in Computational Linguistics (MIT Press). Advance online publication. March 2026
Abstract
Recent advances in large language models have demonstrated promising capabilities in following simple instructions through instruction tuning. However, real-world tasks often involve complex, multi-step instructions that remain challenging for current NLP systems. Robust understanding of such instructions is essential for deploying LLMs as general-purpose agents that can be programmed in natural language to perform complex, real-world tasks across domains like robotics, business automation, and interactive systems. Despite growing interest in this area, there is a lack of a comprehensive survey that systematically analyzes the landscape of complex instruction understanding and processing. Through a systematic review of the literature, we analyze available resources, representation schemes, and downstream tasks related to instructional text. Our study examines 181 papers, identifying trends, challenges, and opportunities in this emerging field. We provide AI/NLP researchers with essential background knowledge and a unified view of various approaches to complex instruction understanding, bridging gaps between different research directions and highlighting future research opportunities.
中文标题/摘要
标题:跨学科的指令文本:复杂指令理解与处理的综述及面向能干AI代理的开放挑战
大型语言模型的最新进展表明,通过指令调优,这些模型在遵循简单指令方面表现出有希望的能力。然而,现实世界中的任务往往涉及复杂的多步指令,这对当前的NLP系统来说仍然是一个挑战。对这类指令的稳健理解对于部署LLM作为通用代理至关重要,这些代理可以用自然语言编程来执行跨领域如机器人技术、业务自动化和交互系统等复杂、现实世界中的任务。尽管对该领域越来越感兴趣,但缺乏一个全面的综述来系统地分析复杂指令理解和处理的现状。通过系统地回顾文献,我们分析了与指令文本相关的可用资源、表示方案和下游任务。我们的研究分析了181篇论文,识别了该新兴领域中的趋势、挑战和机遇。我们为AI/NLP研究人员提供了必要的背景知识,并提供了一个统一的复杂指令理解各种方法的视角,填补了不同研究方向之间的差距,并突显了未来的研究机会。
Summary / 总结
This study addresses the gap in understanding complex instructions for AI agents by surveying 181 papers. It analyzes representation schemes and downstream tasks for instructional text, highlighting trends and challenges in this field. The research provides essential background for AI/NLP researchers and aims to bridge gaps between different research directions, emphasizing future opportunities for improving AI capabilities in handling multi-step instructions across various domains.
该研究通过调研181篇论文,分析了指令文本的表示方案和下游任务,指出了该领域的发展趋势、挑战和机会。研究为AI/NLP研究人员提供了必要的背景知识,并旨在弥合不同研究方向之间的差距,强调了提高AI处理跨领域多步指令能力的未来研究机会。
Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
Authors: Tom Biskupski, Stephan Kleber
First: 2026-03-23T17:12:29+00:00 · Latest: 2026-03-23T17:12:29+00:00
Abstract
A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment.
Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.
中文标题/摘要
标题:评估大型语言模型自动化判断系统的可靠性和真实性
作为法官的大型语言模型(LLM)通过分析受害机器学习(ML)模型,特别是LLM的输出来评估这些模型的质量。作为法官的LLM是模型与一个专门设计的法官提示词的结合,该提示词包含了分析的标准。这种分析的自动化可以比人类审查者更快、更一致地扩展对受害模型自由形式文本输出的复杂评估。因此,对LLM的质量和安全评估可以覆盖受害模型的广泛用途案例。作为一种相对较新的技术,作为法官的LLM缺乏对其可靠性和与人类判断的一致性的深入研究。
我们的工作评估了LLM作为受害LLM自动质量评估器的适用性。我们测试了37个不同规模的对话LLM与5种不同法官提示词的组合效果,以及二级法官的概念和5个为任务微调的模型作为评估者。作为评估目标,我们根据人类评估创建了八个不同判断任务类别的数据集及其相应的地面真实标签。我们的实证结果表明,当与合适的提示词结合时,作为法官的LLM与人类评估高度相关,特别是在GPT-4o、多个开源参数量≥32B的模型以及少数较小的模型如Qwen2.5 14B方面。
Summary / 总结
This study evaluates the reliability and agreement of Large Language Models (LLMs) as judges in assessing the quality of other LLMs. The research tests 37 differently sized LLMs with 5 judge prompts, a second-level judge concept, and 5 fine-tuned models. Datasets for eight judgment tasks and their human-assessed ground-truth labels were curated. Results indicate high correlation between LLM judges and human assessments, especially for GPT-4o, several open-source models with at least 32B parameters, and smaller models like Qwen2.5 14B when appropriate prompts are used.
本研究评估了大型语言模型(LLM)作为裁判在评估其他LLM质量方面的可靠性和一致性。研究测试了37种不同大小的对话LLM与5种裁判提示、二级裁判概念以及5种为任务微调的模型。基于人类评估,构建了八个判断任务的数据集及其地面真实标签。结果表明,当使用合适的提示时,LLM作为裁判与人类评估之间存在高度相关性,特别是对于GPT-4o、至少拥有32B参数的开源模型以及较小的模型如Qwen2.5 14B。
SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection
Authors: Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu
First: 2026-03-23T17:11:43+00:00 · Latest: 2026-03-23T17:11:43+00:00
Abstract
While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.
中文标题/摘要
标题:SPA:一种简单但难以超越的知识注入基线
虽然大型语言模型(LLMs)在大量数据上进行预训练,但在专门的数据稀缺领域,其知识覆盖仍然不完整,这推动了大量研究致力于通过合成数据生成进行知识注入。我们提出了SPA(Scaling Prompt-engineered Augmentation),这是一种简单但难以超越的基线,使用少量精心设计的提示生成大规模合成数据以进行知识注入。通过系统比较,我们发现SPA优于几种强基线。此外,我们确定了先前方法的两个关键局限性:(1) 虽然基于RL的方法在小规模下可能提高LLM数据增强的令牌效率,但随着数据规模扩大,它们会遭受多样性崩溃,导致收益递减;(2) 虽然多阶段提示可能优于简单的增强方法,但在仔细调整提示后,它们的优势可能会消失。我们的结果表明,对于知识注入,结合精心设计的提示与直接的大规模增强可以非常有效,并希望SPA可以作为未来研究的强基线。我们的代码可在https://github.com/Tangkexian/SPA获取。
Summary / 总结
The research aims to address the limitations of large language models in specialized domains by proposing SPA, a simple prompt-engineered augmentation method. SPA uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. The study finds that SPA outperforms several strong baselines and highlights the limitations of RL-based methods and multi-stage prompting approaches, suggesting that careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective for knowledge injection.
研究旨在通过提出SPA(Prompt-engineered Augmentation)方法来解决大型语言模型在专业领域知识覆盖不全的问题。SPA 使用少量精心设计的提示来生成大量合成数据以进行知识注入。研究发现,SPA 出色地超过了几个强大的基线,并指出了基于 RL 的方法和多阶段提示方法的局限性,表明精心设计的提示结合简单的大量数据增强可以非常有效。
Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Authors: Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang
First: 2026-03-23T17:10:29+00:00 · Latest: 2026-03-23T17:10:29+00:00
Abstract
Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
中文标题/摘要
标题:Omni-WorldBench:向着全面的交互为中心的评估方法论
基于视频的世界模型已经沿着两种主要范式发展:视频生成和3D重建。然而,现有的评估基准要么仅专注于生成模型的视觉保真度和文本-视频对齐,要么依赖于静态的3D重建指标,这些指标根本忽视了时间动态。我们认为,世界建模的未来在于4D生成,即同时建模空间结构和时间演变。在这个范式中,核心能力是交互响应:能够准确反映交互动作如何驱动空间和时间中的状态转换。然而,目前没有基准能够系统地评估这一关键维度。为了解决这一差距,我们提出了Omni-WorldBench,这是一个全面的基准,专门设计用于评估世界模型在4D设置中的交互响应能力。Omni-WorldBench 包含两个关键组成部分:Omni-WorldSuite,一个涵盖多种交互级别和场景类型的系统提示套件;以及Omni-Metrics,一个基于代理的评估框架,通过测量交互动作对最终结果和中间状态演变轨迹的因果影响来量化世界建模能力。我们在多个范式中对18个代表性世界模型进行了广泛的评估。我们的分析揭示了当前世界模型在交互响应方面的关键局限性,为未来研究提供了可操作的见解。Omni-WorldBench 将公开发布,以促进交互4D世界建模的进步。
Summary / 总结
Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction.
论文提出了Omni-WorldBench,这是一个新的基准,旨在评估4D世界模型的交互响应能力。该基准通过关注交互动作如何驱动空间和时间中的状态转换来弥补现有基准的不足。该基准包括一个广泛的提示套件和一个基于代理的评估框架。对18个不同范式的世界模型的评估揭示了交互响应中的显著局限性,为未来研究提供了见解。
Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs
Authors: Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen
First: 2026-03-23T17:01:42+00:00 · Latest: 2026-03-23T17:01:42+00:00
Abstract
Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2--2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.
中文标题/摘要
标题:chimera:针对异构大语言模型的延迟和性能感知多代理服务
多代理应用通常执行复杂的任务作为多阶段工作流,其中每个阶段是LLM调用,其输出成为后续步骤的上下文的一部分。现有的LLM服务系统大多假设具有相同模型副本的同质集群。这种设计忽视了异构部署的潜力,其中不同大小和能力的模型能够实现更精细的延迟和性能权衡。然而,异构性引入了在具有不同吞吐量和性能的模型之间进行调度的新挑战。我们提出了Chimera,这是一种针对异构大语言模型集群上的多代理工作流服务的预测调度系统,可以同时提高端到端延迟和任务性能。Chimera 应用语义路由来为每个请求估算每个模型的信心分数,预测工作流的剩余总输出长度,并使用在飞预测的标记体积来估算每个模型的拥塞情况以实现负载均衡。我们使用多种异构大语言模型配置对代表性的代理工作流(代码生成和数学推理)进行了Chimera的评估。在可比的设置中,Chimera 跟踪了最佳的延迟-性能前沿,将端到端延迟降低了1.2-2.4倍,并且在平均任务性能上比竞争基线(包括vLLM)提高了8.0-9.5个百分点。
Summary / 总结
Chimera is a predictive scheduling system designed for multi-agent workflow serving on heterogeneous large language model (LLM) clusters. It addresses the challenges of heterogeneity by using semantic routing to estimate model confidence scores, predicting workflow output length, and balancing load using in-flight token volumes. Evaluations on code generation and math reasoning workflows show that Chimera reduces end-to-end latency by 1.2-2.4 times and improves task performance by 8.0-9.5 percentage points on average compared to competitive baselines like vLLM.
Chimera 是一种预测调度系统,用于在异构大型语言模型 (LLM) 集群上执行多代理工作流服务。它通过使用语义路由来估计模型的信心分数、预测工作流的输出长度,并使用在途的令牌体积进行负载均衡来应对异构性带来的挑战。在代码生成和数学推理工作流上的评估表明,Chimera 可以将端到端的延迟降低 1.2-2.4 倍,并将任务性能平均提高 8.0-9.5 个百分点,优于竞争性基线如 vLLM。
Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
Authors: Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, Massoud Pedram
First: 2026-02-10T20:31:40+00:00 · Latest: 2026-03-23T17:01:31+00:00
Abstract
Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution $π_α(y\mid x)\propto p_θ(y\mid x)^α$ ($α>1$), which concentrates mass on whole sequences instead of adjusting token-level temperature. Prior work shows that Metropolis--Hastings (MH) sampling from this distribution recovers strong reasoning performance, but at order-of-magnitude inference slowdowns. We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency. Power-SMC advances a small particle set in parallel, corrects importance weights token-by-token, and resamples when necessary, all within a single GPU-friendly batched decode. We prove that temperature $τ=1/α$ is the unique prefix-only proposal minimizing incremental weight variance, interpret residual instability via prefix-conditioned Rényi entropies, and introduce an exponent-bridging schedule that improves particle stability without altering the target. On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28\times$ to $1.4$--$3.3\times$ over baseline decoding. The code is available at https://github.com/ArminAzizi98/Power-SMC.
中文标题/摘要
标题:Power-SMC:低延迟序列级功率采样以实现无需训练的LLM推理
许多大型语言模型最近的推理改进可以解释为分布硬化:将生成偏向于预训练模型已支持的高概率轨迹,而不是修改其权重。一个自然的形式化是序列级功率分布$π_α(y|x)\propto p_θ(y|x)^α$($α>1$),它将质量集中在整个序列上,而不是调整令牌级温度。先前的工作表明,从该分布进行的Metropolis--Hastings(MH)采样可以恢复强大的推理性能,但会带来数量级的推理延迟。我们引入了Power-SMC,这是一种无需训练的顺序蒙特卡洛方案,旨在实现相同的目标,同时保持接近标准解码延迟。Power-SMC 并行推进一小组粒子,逐个令牌校正重要性权重,并在必要时重新采样,全部在单个GPU友好的批量解码中完成。我们证明了温度$τ=1/α$是唯一前缀仅提议,最小化增量权重方差的,通过前缀条件的Rényi熵解释残余不稳定性,并引入了一个指数桥梁调度,提高了粒子稳定性而不改变目标。在MATH500上,Power-SMC 在延迟从$16$--$28\times$减少到$1.4$--$3.3\times$的基础上,匹配或超过了MH功率采样。代码可在https://github.com/ArminAzizi98/Power-SMC/ 获取。
Summary / 总结
Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights.
Power-SMC 是一种无需训练的方法,用于在大型语言模型中进行序列级功率采样,旨在在不显著增加推理延迟的情况下恢复强大的推理性能。它使用顺序蒙特卡洛方案并行推进粒子并逐个纠正重要性权重,与基线解码相比,延迟减少了1.4-3.3倍,同时在 MATH500 数据集上达到或超过了马尔可夫-哈斯廷斯功率采样的性能。
DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models
Authors: Jin Ma, Mohammed Aldeen, Christopher Salas, Feng Luo, Mashrur Chowdhury, Mert Pesé, Long Cheng
First: 2025-09-04T18:20:36+00:00 · Latest: 2026-03-23T17:00:09+00:00
Abstract
Object detection is fundamental to various real-world applications, such as security monitoring and surveillance video analysis. Despite their advancements, state-of-the-art object detectors are still vulnerable to adversarial patch attacks, which can be easily applied to real-world objects to either conceal actual items or create non-existent ones, leading to severe consequences. In this work, we introduce DisPatch, the first diffusion-based defense framework for object detection. Unlike previous works that aim to "detect and remove" adversarial patches, DisPatch adopts a "regenerate and rectify" strategy, leveraging generative models to disarm attack effects while preserving the integrity of the input image. Specifically, we utilize the in-distribution generative power of diffusion models to regenerate the entire image, aligning it with benign data. A rectification process is then employed to identify and replace adversarial regions with their regenerated benign counterparts. DisPatch is attack-agnostic and requires no prior knowledge of the existing patches. Extensive experiments across multiple detectors demonstrate that DisPatch consistently outperforms state-of-the-art defenses on both hiding attacks and creating attacks, achieving the best overall mAP@0.5 score of 89.3% on hiding attacks, and lowering the attack success rate to 24.8% on untargeted creating attacks. Moreover, it strikes the balance between effectiveness and efficiency, and maintains strong robustness against adaptive attacks, making it a practical and reliable defense method.
中文标题/摘要
标题:DisPatch:使用扩散模型解除对象检测中对抗补丁的武装
对象检测是各种实际应用的基础,如安全监控和视频分析。尽管取得了进展,最先进的对象检测器仍然容易受到对抗补丁攻击的影响,这些攻击可以轻松应用于现实中的物体,以隐藏实际物品或创造不存在的物品,导致严重后果。在本文中,我们介绍了DisPatch,这是首个基于扩散模型的对象检测防御框架。与之前旨在“检测和移除”对抗补丁的工作不同,DisPatch 采用“再生和校正”的策略,利用生成模型解除攻击效果,同时保持输入图像的完整性。具体而言,我们利用扩散模型的同分布生成能力再生整个图像,使其与良性数据对齐。然后进行校正过程,以识别并用再生的良性区域替换对抗区域。DisPatch 对抗无特定知识,无需了解现有补丁。在多个检测器上的广泛实验表明,DisPatch 在隐藏攻击和创造攻击方面均优于最先进的防御方法,在隐藏攻击中实现最佳的整体 mAP@0.5 分数为 89.3%,在非目标创造攻击中的攻击成功率降低到 24.8%。此外,它在有效性与效率之间取得了平衡,并对适应性攻击保持了强大的鲁棒性,使其成为一种实用可靠的防御方法。
Summary / 总结
DisPatch is a diffusion-based defense framework for object detection that aims to disarm adversarial patch attacks by regenerating the entire image and replacing adversarial regions with benign counterparts. This method outperforms existing defenses on both hiding and creating attacks, achieving an mAP@0.5 score of 89.3% and a 24.8% attack success rate on untargeted creating attacks. It is attack-agnostic and does not require prior knowledge of the patches, balancing effectiveness and efficiency while maintaining robustness against adaptive attacks.
DisPatch 是一种新颖的基于扩散模型的防御框架,用于解决对象检测中的对抗贴图攻击问题。不同于之前专注于检测和移除贴图的方法,DisPatch 使用“再生和校正”的策略。它使用生成模型再生整个图像,并用良性区域替换对抗区域。DisPatch 在多个检测器上表现出优越的性能,其在隐藏攻击中的 mAP@0.5 得分为 89.3%,在未目标创建攻击中的攻击成功率降低到 24.8%,同时保持了对适应性攻击的鲁棒性。
Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents
Authors: Rachmad Vidya Wicaksana Putra, Avaneesh Devkota, Muhammad Shafique
First: 2025-04-18T08:12:59+00:00 · Latest: 2026-03-23T16:51:57+00:00
Comments: Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC), July 26-29, 2026 in Long Beach, CA, USA. [Codes: https://github.com/rachmadvwp/SwitchMT]
Abstract
Training resource-constrained autonomous agents on multiple tasks simultaneously is crucial for adapting to diverse real-world environments. Recent works employ reinforcement learning (RL) approach, but they still suffer from sub-optimal multi-task performance due to task interference. State-of-the-art works employ Spiking Neural Networks (SNNs) to improve RL-based multi-task learning and enable low-power/energy operations through network enhancements and spike-driven data stream processing. However, they rely on fixed task-switching intervals during its training, thus limiting its performance and scalability. To address this, we propose SwitchMT, a novel methodology that employs adaptive task-switching for effective, scalable, and simultaneous multi-task learning. SwitchMT employs the following key ideas: (1) leveraging a Deep Spiking Q-Network with active dendrites and dueling structure, that utilizes task-specific context signals to create specialized sub-networks; and (2) devising an adaptive task-switching policy that leverages both rewards and internal dynamics of the network parameters. Experimental results demonstrate that SwitchMT achieves competitive scores in multiple Atari games (i.e., Pong: -8.8, Breakout: 5.6, and Enduro: 355.2) and longer game episodes as compared to the state-of-the-art. These results also highlight the effectiveness of SwitchMT methodology in addressing task interference without increasing the network complexity, enabling intelligent autonomous agents with scalable multi-task learning capabilities.
中文标题/摘要
标题:通过自适应任务切换策略的脉冲神经网络实现可扩展的多任务学习以适应资源受限的智能自主代理
同时在多个任务上对资源受限的自主代理进行训练对于适应多变的现实环境至关重要。最近的研究采用了强化学习(RL)方法,但由于任务干扰,其多任务性能仍然不尽如人意。最先进的研究利用脉冲神经网络(SNNs)来提高基于RL的多任务学习,并通过网络增强和基于脉冲的数据流处理实现低功耗/低能耗操作。然而,它们在训练过程中依赖于固定的任务切换间隔,从而限制了其性能和可扩展性。为了解决这个问题,我们提出了SwitchMT,这是一种新颖的方法,采用自适应任务切换策略实现有效的、可扩展的和同时的多任务学习。SwitchMT采用了以下关键思想:(1)利用具有活跃树突和对分结构的深度脉冲Q网络,利用任务特定的上下文信号创建专门的子网络;(2)设计一种基于奖励和网络参数内部动力学的自适应任务切换策略。实验结果表明,SwitchMT在多个Atari游戏中(例如,Pong:-8.8,Breakout:5.6,Enduro:355.2)和更长的游戏回合中实现了与最先进的方法相当的分数。这些结果还突显了SwitchMT方法在不增加网络复杂性的情况下解决任务干扰的有效性,使智能自主代理具备可扩展的多任务学习能力。
Summary / 总结
The research aims to improve multi-task learning for autonomous agents by addressing task interference through a scalable approach. SwitchMT uses a Deep Spiking Q-Network with adaptive task-switching and specialized sub-networks to enhance performance in Atari games. The method achieves competitive scores in Pong, Breakout, and Enduro, demonstrating effective multi-task learning without increasing network complexity.
研究旨在通过自适应任务切换来改善自主代理的多任务学习,以解决任务干扰问题。提出了一种SwitchMT方法,该方法使用具有专门子网络的深度脉冲Q网络,并基于奖励和网络参数内部动态设计自适应任务切换策略。实验结果表明,SwitchMT在多个Atari游戏中和更长的游戏时段中优于现有方法,展示了有效的可扩展多任务学习,且未增加网络复杂性。