arXiv 论文速递

Snapshot: 20260310_0345

Multimodal Large Language Models as Image Classifiers

Authors: Nikita Kisel, Illia Volkov, Klara Janouskova, Jiri Matas

First: 2026-03-06T18:59:58+00:00 · Latest: 2026-03-06T18:59:58+00:00

Abstract

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.

中文标题/摘要

标题：多模态大型语言模型作为图像分类器

多模态大型语言模型（MLLM）的分类性能在很大程度上取决于评估协议和真实标签的质量。比较MLLM、监督模型和视觉-语言模型的研究报告得出相互矛盾的结论，我们表明这些矛盾源于要么夸大要么低估性能的评估协议。在最常见的评估协议中，我们识别并解决了关键问题：模型输出超出提供的类别列表并被丢弃、由于弱的选择题干扰项导致的夸大结果以及在开放世界设置中由于输出映射不佳而表现不佳。我们还量化了通常被忽视的设计选择——批量大小、图像排序和文本编码器选择的影响，表明它们显著影响准确性。在我们的多标签重注释的625个ImageNet-1k类别上进行评估显示，MLLM最受益于修正的标签（最多+10.8%），显著缩小了与监督模型之间的感知差距。因此，报告的MLLM在分类上的表现不佳主要是由于嘈杂的真实标签和有缺陷的评估协议，而不是真正的模型缺陷。对监督训练信号依赖较少的模型对注释质量最为敏感。最后，我们展示了MLLM可以辅助人类注释员：在受控案例研究中，注释员在大约50%的困难案例中确认或整合了MLLM的预测，证明了它们在大规模数据集整理中的潜力。

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

Authors: Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, Chaoyou Fu

First: 2026-03-06T18:59:57+00:00 · Latest: 2026-03-06T18:59:57+00:00

Comments: Project page: https://omni-diffusion.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

中文标题/摘要

标题：Omni-Diffusion：基于掩码离散扩散模型的统一多模态理解和生成

尽管近期的多模态大型语言模型（MLLMs）取得了显著进展，但它们主要采用传统的自回归架构作为基础，这在架构设计方面留下了探索更有效和更高效替代方案的巨大空间。同时，最近的研究成功将离散扩散模型应用于视觉理解、图像生成等多个领域，揭示了其作为多模态系统潜在的有前途基础架构的巨大潜力。受到这些开创性研究的启发，我们提出了Omni-Diffusion，这是首个完全基于掩码离散扩散模型的任何到任何的多模态语言模型，它统一了文本、语音和图像的理解与生成。Omni-Diffusion采用统一的掩码离散扩散模型直接捕捉离散多模态标记的联合分布。这种方法不仅支持二模态任务，还支持涉及多个模态的更复杂场景。在一系列多样的基准测试中，我们的方法在性能上优于或与现有处理两个或多个模态的多模态系统相当，突显了扩散模型在推动下一代多模态基础模型方面的巨大潜力。项目网页：https://omni-diffusion.github.io

Summary / 总结

Omni-Diffusion is a unified multimodal language model that uses a mask-based discrete diffusion model to understand and generate across text, speech, and images. It outperforms or matches existing multimodal systems on various benchmarks, demonstrating the potential of diffusion models in multimodal tasks.

Omni-Diffusion 是一种使用基于掩码的离散扩散模型来跨文本、语音和图像进行理解和生成的统一多模态语言模型。它在多种基准测试中表现优于或与现有系统持平，展示了扩散模型在多模态任务中的潜力。这项工作解决了传统自回归架构的局限性，并为多模态基础模型开辟了新途径。

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

Authors: Thomas Monninger, Shaoyuan Xie, Qi Alfred Chen, Sihao Ding

First: 2026-03-06T18:59:55+00:00 · Latest: 2026-03-06T18:59:55+00:00

Comments: 4 figures, 6 tables in the main paper, 32 pages in total

Abs · PDF · Code1 · Code2

Abstract

The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.

中文标题/摘要

标题：BEVLM：将大型语言模型的语义知识提炼为鸟瞰图表示

将大型语言模型（LLMs）集成到自动驾驶中引起了广泛关注，因为它们强大的推理和语义理解能力对于处理复杂的决策和长尾场景至关重要。然而，现有方法通常将LLMs与多视图和多帧图像的标记独立输入，导致冗余计算和空间一致性有限。这种视觉处理的分离阻碍了准确的三维空间推理，并且无法在不同视图之间保持几何一致性。另一方面，从几何标注任务（例如物体检测）中学习的鸟瞰图（BEV）表示提供了空间结构，但缺乏基础视觉编码器的语义丰富性。为了弥合这一差距，我们提出了一种BEVLM框架，该框架将空间一致且语义提炼的BEV表示与LLMs连接起来。通过广泛的实验，我们展示了BEVLM使LLMs在跨视图驾驶场景中推理更加有效，通过利用BEV特征作为统一输入，准确率提高了46%。此外，通过将LLMs中的语义知识提炼到BEV表示中，BEVLM在安全关键场景中的闭环端到端驾驶性能显著提高了29%。

Summary / 总结

BEVLM integrates semantic knowledge from LLMs into BEV representations to enhance autonomous driving. It addresses the limitations of existing methods by providing spatially consistent and semantically rich inputs, improving cross-view reasoning accuracy by 46% and closed-loop driving performance by 29% in safety-critical scenarios.

BEVLM 将 LLM 的语义知识融入到 BEV 表示中以提升自动驾驶性能。它解决了现有方法的空间不一致性和语义贫乏的问题，通过提供空间一致且语义丰富的输入，提高了跨视图推理准确率 46% 和安全关键场景下的端到端驾驶性能 29%。

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

Authors: Alejandra Perez, Anita Rau, Lee White, Busisiwe Mlambo, Chinedu Nwoye, Muhammad Abdullah Jamal, Omid Mohareri

First: 2026-03-06T18:58:36+00:00 · Latest: 2026-03-06T18:58:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

中文标题/摘要

标题：SUREON：一个手术推理基准和视觉语言模型

外科医生不只是看，而是进行解释。当专家观察手术场景时，他们不仅理解正在使用的器械是什么，还理解为什么选择这种器械，它带来的风险是什么，接下来会发生什么。当前的手术AI无法回答这些问题，主要是因为大规模标注包含手术推理的训练数据极其困难。然而，手术视频讲座中已经包含了这些内容——由专家解释意图、理由和预测，目的是教学。尽管这些叙述本身是噪音且结构化不足，但它们编码了当前手术AI所缺乏的推理。我们引入了SUREON，一个大规模的视频问答数据集，系统地从手术学术视频中收集这种训练信号。SUREON定义了12个问题类别，涵盖安全评估、决策理由和预测，并使用多智能体流水线在大规模下提取和结构化监督。在134.7万段剪辑和170种手术类型中，SUREON产生了206.8万对问答对和354个专家验证基准。为了评估这种监督是否转化为手术推理能力，我们引入了两个模型：SureonVLM，通过监督微调适应的视觉语言模型，以及SureonVLM-R1，使用组相对策略优化训练的推理模型。这两个模型都能回答复杂的手术问题，并显著优于大型通用领域模型，在SUREON基准测试中超过84%的准确率，同时在标准的手术感知任务中也优于通用领域模型。对SureonVLM-R1的定性分析显示了明确的推理行为，例如从视觉上下文推断手术意图。

Summary / 总结

SUREON is a large-scale video QA dataset that captures surgical reasoning from expert narrations in surgical academic videos. It includes 12 question categories and 206,800 QA pairs, providing a benchmark for evaluating surgical reasoning. Two models, SureonVLM and SureonVLM-R1, were trained on this dataset and outperformed general-domain models, achieving over 84% accuracy on the SUREON benchmark and demonstrating explicit reasoning behavior.

SUREON 是一个用于外科推理的基准和视觉-语言模型，旨在解决当前AI系统中缺乏明确的外科推理问题。它利用外科视频讲座创建了一个大规模的问答数据集，涵盖12个问题类别，生成了206,800个问答对。模型SureonVLM和SureonVLM-R1取得了高精度，优于通用领域的模型，在SUREON基准和标准外科感知任务上均表现出色。

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Authors: Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang

First: 2026-03-06T18:58:04+00:00 · Latest: 2026-03-06T18:58:04+00:00

Comments: Penguin-VL Technical Report; Code: https://github.com/tencent-ailab/Penguin-VL

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL

中文标题/摘要

标题：Penguin-VL：基于LLM的视觉编码器探索VLM的效率极限

视觉语言模型（VLM）的发展主要依赖于扩大模型规模，这阻碍了在计算受限的移动和边缘设备（如智能手机和机器人）上的部署。在本研究中，我们探索了紧凑型（例如，2B和8B）VLM的性能极限。我们挑战了当前VLM必须依赖通过大规模对比预训练（例如，CLIP/SigLIP）初始化的视觉编码器的主流做法。我们发现对比学习的目标不匹配：这种优化用于区分的对比学习会强制执行粗略的和类别级别的不变性，抑制了密集描述和复杂VLM推理所需的细粒度视觉线索。为了解决这一问题，我们提出了Penguin-VL，其视觉编码器从纯文本的LLM初始化。我们的实验表明，Penguin-Encoder比传统的对比预训练更优越，能够为多模态理解提供更高的视觉保真度和数据效率。在各种图像和视频基准测试中，Penguin-VL在数学推理方面与领先VLM（如Qwen3-VL）表现相当，在文档理解、视觉知识和多视角视频理解等任务上则超越了它们。值得注意的是，这些改进是通过轻量级架构实现的，表明改进的视觉表示而非模型规模是性能提升的主要驱动力。我们的消融实验表明，Penguin-Encoder始终优于对比预训练的编码器，保留了对密集感知和复杂推理至关重要的细粒度空间和时间线索。这使其成为计算高效的VLM的强有力替代品，并在资源受限的环境中实现高性能。代码：https://github.com/tencent-ailab/Penguin-VL

Neural Signals Generate Clinical Notes in the Wild

Authors: Jathurshan Pradeepkumar, Zheng Chen, Jimeng Sun

First: 2026-01-29T13:07:30+00:00 · Latest: 2026-03-06T18:57:14+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generating clinical reports that summarize abnormal patterns, diagnostic findings, and clinical interpretations from long-term EEG recordings remains labor-intensive. We curate a large-scale clinical EEG dataset with $9{,}922$ reports paired with approximately $11{,}000$ hours of EEG recordings from $9{,}048$ patients. We therefore develop CELM, the first clinical EEG-to-Language foundation model capable of summarizing long-duration, variable-length EEG recordings and performing end-to-end clinical report generation at multiple scales, including recording description, background activity, epileptiform abnormalities, events/seizures, and impressions. Experimental results show that, with patient history supervision, our method achieves $70\%$-$95\%$ average relative improvements in standard generation metrics (e.g., ROUGE-1 and METEOR) from $0.2$-$0.3$ to $0.4$-$0.6$. In the zero-shot setting without patient history, CELM attains generation scores in the range of $0.43$-$0.52$, compared to baselines of $0.17$-$0.26$. CELM integrates pretrained EEG foundation models with language models to enable scalable multimodal learning. We release our model and benchmark construction pipeline at https://github.com/Jathurshan0330/CELM.

中文标题/摘要

标题：神经信号生成野生环境中的临床笔记

从长时间的EEG记录中生成总结异常模式、诊断发现和临床解释的临床报告仍然劳动密集型工作。我们整理了一个大规模的临床EEG数据集，包含约9,922份报告和大约11,000小时的EEG记录，来自9,048名患者。因此，我们开发了CELM，这是第一个能够总结长时间、变长的EEG记录并进行多尺度临床报告端到端生成的临床EEG到语言基础模型，包括记录描述、背景活动、癫痫样异常、事件/癫痫发作和印象。实验结果表明，在患者历史监督下，我们的方法在标准生成指标（如ROUGE-1和METEOR）上实现了20%-30%的平均相对改进，从0.2-0.3提高到0.4-0.6。在没有患者历史的零样本设置下，CELM的生成得分为0.43-0.52，而基线得分为0.17-0.26。CELM将预训练的EEG基础模型与语言模型结合，以实现可扩展的多模态学习。我们在https://github.com/Jathurshan0330/CELM上发布了我们的模型和基准构建管道。

Summary / 总结

This study addresses the labor-intensive task of generating clinical reports from long-term EEG recordings by developing CELM, a clinical EEG-to-Language foundation model. The model can summarize various aspects of EEG recordings and generate clinical reports at multiple scales. With patient history supervision, CELM shows significant improvements in generation metrics, achieving up to 95% relative improvements in ROUGE-1 and METEOR scores. In a zero-shot setting, CELM outperforms baselines by a margin of 0.26 to 0.35 in generation scores.

论文通过开发CELM，一种临床EEG到语言的基础模型，解决了从EEG记录生成临床报告的劳动密集型任务。CELM能够总结长时间的EEG记录并在多个尺度上生成临床报告。在有患者历史监督的情况下，CELM在生成指标上取得了显著改进，ROUGE-1和METEOR得分提高了70%-95%。在零样本设置下，CELM仍然优于基线模型。

Boosting deep Reinforcement Learning using pretraining with Logical Options

Authors: Zihan Ye, Phil Chau, Raban Emunds, Jannis Blüml, Cedric Derstroff, Quentin Delfosse, Oleg Arenz, Kristian Kersting

First: 2026-03-06T18:55:15+00:00 · Latest: 2026-03-06T18:55:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans' ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.

中文标题/摘要

标题：使用逻辑选项预训练提升深度强化学习

深度强化学习代理往往存在偏差，因为它们会过度利用早期的奖励信号。最近，一些符号方法通过编码稀疏目标和对齐的计划来解决这些挑战。然而，纯粹的符号架构难以扩展，并且难以应用于连续环境。因此，我们提出了一种混合方法，灵感来源于人类获取新技能的能力。我们使用两阶段框架，在基于神经网络的强化学习代理中注入符号结构，而不牺牲深度策略的表达能力。我们的方法称为混合层次化强化学习（H^2RL），它引入了一种基于逻辑选项的预训练策略，引导学习策略远离短期奖励循环，转向目标导向行为，同时允许最终策略通过标准环境交互进行细化。实验上，我们展示了这种方法在长期决策制定方面的一致改进，并产生了优于强大神经、符号和神经符号基线的代理。

Summary / 总结

The research aims to address the misalignment issue in deep reinforcement learning agents by proposing a hybrid approach that combines symbolic and neural methods. The method, Hybrid Hierarchical RL (H^2RL), uses a two-stage framework with logical option-based pretraining to guide the learning policy towards goal-directed behavior. Experiments demonstrate that this approach enhances long-term decision-making and outperforms neural, symbolic, and neuro-symbolic baselines in various tasks.

研究旨在通过结合符号和神经方法解决深度强化学习代理的对齐问题，提出了一种名为Hybrid Hierarchical RL (H^2RL)的混合方法。该方法采用两阶段框架，通过逻辑选项预训练来引导学习策略向目标导向行为发展，同时保持深度策略的灵活性。实验表明，这种方法在长期决策制定方面表现出色，并且优于神经、符号和神经-符号基线方法。

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

Authors: Fangrui Zhu, Yunfeng Xi, Jianmo Ni, Mu Cai, Boqing Gong, Long Zhao, Chen Qu, Ian Miao, Yi Li, Cheng Zhong, Huaizu Jiang, Shwetak Patel

First: 2026-03-06T18:49:04+00:00 · Latest: 2026-03-06T18:49:04+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.

中文标题/摘要

标题：EgoReasoner：通过任务自适应结构化思考学习自中心4D推理

自中心视频理解由于环境的动态4D特性而固有地复杂，其中摄像机运动和物体位移需要不断重新评估空间关系。在本工作中，我们针对一系列尚未充分探索的自中心4D推理任务，包括固定装置交互计数、视角相对固定装置位置、物体运动行程跟踪和静止物体定位，这些任务需要不同的认知操作：空间锚定、时间跟踪和持续时间推理。我们观察到，这些结构差异使得任务无关的方法不足：通用的链式思考方法缺乏任务适当的推理原语，而统一的强化学习会主动破坏空间任务的性能。为了解决这个问题，我们提出了EgoReasoner，这是一种两阶段框架，将推理框架和奖励信号与每个任务的认知结构对齐。在第一阶段，任务自适应思考模板指导结构化CoT轨迹的合成，通过监督微调使模型能够适应性地推理不同类型的任务。在第二阶段，任务感知的奖励函数验证实体定位、时间对齐和任务自适应逻辑一致性，通过基于GRPO的强化微调选择性地加强每条推理路径。我们的3亿参数模型，在仅使用16000个样本训练后，达到了挑战性的HD-EPIC基准上的37.5%平均准确率，超过了Qwen2.5-VL-7B（25.7%）超过10个百分点。

Summary / 总结

EgoReasoner is designed to tackle complex egocentric 4D reasoning tasks by aligning reasoning and reward signals to specific cognitive structures. It uses Task-Adaptive Thinking Templates for supervised fine-tuning and task-aware reward functions for reinforcement learning. The model, with 3 billion parameters and trained on 16,000 samples, achieves 37.5% average accuracy on the HD-EPIC benchmark, outperforming Qwen2.5-VL-7B by 10 points.

EgoReasoner 是一个两阶段框架，旨在解决如固定装置交互计数和物体运动跟踪等4D推理任务的挑战。它使用任务自适应思考模板来引导结构化CoT痕迹的合成进行监督微调，并使用任务自适应奖励函数进行强化微调。该模型在16K样本上训练，平均准确率达到37.5%，超越了Qwen2.5-VL-7B 10个百分点。

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

Authors: David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Yossi Matias, James Manyika, Vahab Mirrokni

First: 2026-02-03T18:56:17+00:00 · Latest: 2026-03-06T18:48:45+00:00

Comments: The changes over version 2 are that we cleaned up the last paragraph on color-coding at the end of section 2. Also, for section 6.1 we added a reference to followup work of the authors, and other minor edits in that section

Abs · PDF · Code1 · Code2

Abstract

Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.

中文标题/摘要

标题：使用Gemini加速科学研究：案例研究与常用技术

大型语言模型（LLMs）的最新进展为加速科学研究开辟了新途径。虽然这些模型在协助处理常规任务方面越来越强大，但它们在贡献新颖、专家级的数学发现方面的潜力尚不明确。我们展示了研究人员如何成功与基于Google Gemini的高级AI模型（特别是Gemini Deep Think及其高级变体）合作，解决开放问题、反驳猜想并生成新的证明，涵盖理论计算机科学等多个领域，以及其他领域如经济学、优化和物理学。基于这些经验，我们提取了理论研究中有效的人工智能协作的常用技术，如迭代细化、问题分解和跨学科知识转移。虽然我们的大部分结果来自这种互动、对话的方法，但我们还强调了一些超越标准聊天界面的具体实例。这些包括将模型部署为严格的 adversarial 审查员以检测现有证明中的细微缺陷，以及将其嵌入“神经符号”循环中，该循环自主编写和执行代码以验证复杂的推导。这些例子共同突显了人工智能不仅作为自动化工具的潜力，而且作为科学研究发现过程中创造性的真正伙伴的潜力。

Summary / 总结

This paper explores how advanced AI models, particularly Gemini-based models, have been used to assist in solving open problems and generating new proofs in theoretical computer science and other fields. Through case studies, the authors identify common techniques for effective human-AI collaboration, such as iterative refinement and problem decomposition. Key findings include the model's ability to act as a rigorous adversarial reviewer and to be embedded in a neuro-symbolic loop for autonomous code execution and verification, demonstrating AI's potential as a creative partner in scientific discovery rather than just an automation tool.

本文探讨了高级AI模型，特别是基于Gemini的模型，如何被用于解决理论计算机科学及其他领域中的开放问题和生成新的证明。通过案例研究，作者确定了有效的人机协作技术，如迭代细化和问题分解。主要发现包括模型能够作为严格的 adversarial 审查员发挥作用，并被嵌入到神经-符号循环中以实现自主代码执行和验证，展示了AI作为科学发现过程中创造性的伙伴而非仅仅自动化工具的潜力。

CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

Authors: Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez

First: 2025-12-22T16:21:39+00:00 · Latest: 2026-03-06T18:46:27+00:00

Comments: updated with improved CA results

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are commonly trained by directly inserting image tokens from a pretrained vision encoder into the text stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes rapidly costly for long multi-image conversations or streaming video applications, both in terms of memory and compute. VLMs leveraging cross-attention (CA) are an efficient alternative to token insertion as image tokens are not added to the KV cache. Despite being introduced early on, multimodal CA models are scarce in the current VLM literature and often underperform their token insertion counterparts. In this work, we reinvestigate the effectiveness of cross-attention for vision-language modeling: (i) We analyze the core differences between the cross-attention and self-attention mechanisms, (ii) we train cross-attention VLMs both from a text-only LLM and by adapting a pretrained insertion-based VLM, showing that simple cross-attention is far more competitive with token insertion than previously reported, and (iii) we demonstrate the practical advantages of cross-attention on real-time video captioning, where it naturally maintains low latency and near-constant memory cost. For samples and code, please see our project page at https://kyutai.org/casa .

中文标题/摘要

标题：CASA：自注意力上的交叉注意力高效视觉-语言融合

视觉-语言模型（VLMs）通常通过将预训练视觉编码器中的图像令牌直接插入语言模型的文字流中进行训练。这使得文本和图像信息能够在模型内部完全相互注意，但在长多图像对话或流式视频应用中，这变得迅速昂贵，从内存和计算资源上都是如此。利用交叉注意力（CA）的VLMs是令牌插入的高效替代方案，因为图像令牌不会被添加到KV缓存中。尽管早在引入，多模态CA模型在当前的VLM文献中仍然很少见，并且通常不如其令牌插入的对应物表现好。在本文中，我们重新调查了交叉注意力在视觉-语言建模中的有效性：（i）我们分析了交叉注意力和自注意力机制的核心差异，（ii）我们从仅文本的大语言模型和通过调整预训练的插入式VLM训练交叉注意力VLMs，表明简单的交叉注意力比之前报告的更具有竞争力，（iii）我们展示了交叉注意力在实时视频字幕中的实际优势，它自然地保持了低延迟和近似恒定的内存成本。有关样本和代码，请参见我们的项目页面 https://kyutai.org/casa 。

Summary / 总结

The research aims to improve the efficiency of vision-language models by exploring cross-attention (CA) mechanisms, which avoid the memory and compute overhead of token insertion. The study compares CA with self-attention, showing that simple cross-attention outperforms token insertion in both text-only and pretrained models. Key findings include the practical advantages of cross-attention in real-time video captioning, maintaining low latency and constant memory usage.

该论文通过与自我注意力机制的比较，研究了交叉注意力（CA）在视觉语言模型（VLM）中的有效性。作者发现，CA在实时视频字幕生成中比插入方法更具竞争力，因为它能保持低延迟和恒定的内存成本。他们分别从纯文本语言模型和通过调整预训练的插入式VLM进行训练，结果显示简单的交叉注意力在这一领域超过了之前的报告。

ContextBench: Modifying Contexts for Targeted Latent Activation

Authors: Robert Graham, Edward Stevinson, Leo Richter, Alexander Chia, Joseph Miller, Joseph Isaac Bloom

Venue: ICLR 2026

First: 2025-06-15T16:54:09+00:00 · Latest: 2026-03-06T18:37:24+00:00

Comments: Published at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.

中文标题/摘要

标题：ContextBench：修改上下文以针对激活目标潜在特征

识别能够触发语言模型特定行为或潜在特征的输入可能具有广泛的安全应用案例。我们研究了一类能够生成目标明确、语言流畅的输入的方法，这些输入可以激活特定的潜在特征或引发模型行为。我们将这种方法形式化为上下文修改，并介绍了ContextBench——一个基准测试，评估核心方法能力和潜在的安全应用。我们的评估框架衡量了引发强度（激活潜在特征或行为）和语言流畅性，突显了当前最先进的方法在平衡这些目标方面存在的困难。我们通过LLM辅助和扩散模型补全改进了进化提示优化（EPO），并证明这些变体在平衡引发效果和流畅性方面达到了最先进的性能。

Summary / 总结

The research aims to identify inputs that can trigger specific behaviors or latent features in language models for safety purposes. The study introduces ContextBench, a benchmark that evaluates the ability of methods to generate linguistically fluent and targeted inputs. The evaluation shows that current state-of-the-art methods struggle to balance elicitation strength and linguistic fluency. The researchers enhance Evolutionary Prompt Optimisation with LLM-assistance and diffusion model inpainting, achieving state-of-the-art performance in balancing these objectives.

研究旨在识别能够触发语言模型特定行为或潜在特征的输入，以确保安全性。研究引入了ContextBench基准，评估方法生成有针对性且语言流畅的输入的能力。评估结果显示，当前最先进的方法难以平衡触发强度和语言流畅性。研究人员通过LLM辅助和扩散模型补丁增强进化提示优化，实现了在平衡这些目标方面的最新性能。

LiveSense: A Real-Time Wi-Fi Sensing Platform for Range-Doppler on COTS Laptop

Authors: Jessica Sanson, Rahul C. Shah, Maximilian Pinaroc, Cagri Tanriover, Valerio Frascolla

First: 2026-03-06T18:33:14+00:00 · Latest: 2026-03-06T18:33:14+00:00

Abs · PDF · Code1 · Code2

Abstract

We present LiveSense - a cross-platform that transforms a commercial off-the-shelf (COTS) Wi-Fi Network Interface Card (NIC) on a laptop into a centimeter-level Range-Doppler sensor while preserving simultaneous communication capability. The laptops are equipped with COTS Intel AX211 (Wi-Fi 6E) or Intel BE201 (Wi-Fi 7) NICs. LiveSense can (i) Extract fully-synchronized channel state information (CSI) at >= 40 Hz, (ii) Perform time-phase alignment and self-interference cancellation on-device, and (iii) Provide a real-time stream of range, Doppler, subcarrier magnitude/phase and annotated video frames to a Python/Qt Graphical User Interface (GUI). The demo will showcase the ability to detect (i) Distance and radial velocity of attendees within a few meters of the device, (ii) Micro-motion (respiration), and (iii) Hand-gesture ranging. To the best of our knowledge, this is the first-ever demo to obtain accurate range information of targets from commercial Wi-Fi, despite the limited 160 MHz bandwidth.

中文标题/摘要

标题：LiveSense：一种基于商用现成Wi-Fi网络接口卡的实时雷达-Doppler平台

我们介绍了LiveSense - 一种跨平台技术，能够将商用现成（COTS）Wi-Fi网络接口卡（NIC）转变为笔记本电脑上的厘米级雷达-Doppler传感器，同时保持同时通信能力。笔记本电脑配备了COTS Intel AX211（Wi-Fi 6E）或Intel BE201（Wi-Fi 7）NIC。LiveSense可以（i）以>=40 Hz的频率提取完全同步的信道状态信息（CSI），（ii）在设备上执行时间-相位对齐和自干扰消除，以及（iii）向Python/Qt图形用户界面（GUI）提供实时的范围、Doppler、子载波幅度/相位和标注视频帧流。演示将展示LiveSense能够检测（i）设备几米范围内参会者的距离和径向速度，（ii）微运动（呼吸），以及（iii）手势测距。据我们所知，这是首次通过商用Wi-Fi获得目标准确距离信息的演示，尽管其带宽仅为160 MHz。

Summary / 总结

LiveSense is a platform that converts a COTS Wi-Fi NIC on a laptop into a centimeter-level Range-Doppler sensor, maintaining communication capability. It extracts CSI at 40 Hz, performs time-phase alignment and self-interference cancellation on-device, and provides real-time range, Doppler, and subcarrier data through a Python/Qt GUI. Key findings include detecting attendees' distance and radial velocity, micro-motion, and hand-gesture ranging, making it the first to achieve accurate range information from commercial Wi-Fi with 160 MHz bandwidth.

LiveSense 将商用笔记本电脑中的 Wi-Fi 网络接口卡转换为厘米级的 Range-Doppler 传感器，支持实时通信。它以 40 Hz 的速率提取信道状态信息 (CSI)，在设备上执行时间相位对齐和自干扰消除，并提供实时的距离、多普勒、子载波幅度/相位和标注视频帧数据。关键发现包括检测参会者的距离和径向速度、微运动以及手势测距，展示了即使在有限的 160 MHz 带宽下也能获得准确的目标距离信息。

Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

Authors: Yuhan Zhou, Mehri Sattari, Haihua Chen, Kewei Sha

First: 2026-03-06T18:31:10+00:00 · Latest: 2026-03-06T18:31:10+00:00

Comments: This paper has been accepted by the Fourth IEEE International Conference on Mobility: Operations, Services, and Technologies (MOST) 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6\%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: https://github.com/yhZHOU515/RedundancyAD

中文标题/摘要

标题：多源多模态数据中自动驾驶中的冗余建模与测量

下一代自动驾驶车辆（AV）依赖大量多源和多模态（$M^2$）数据以支持实时决策。实践中，由于环境条件和传感器限制，数据质量（DQ）在不同来源和模态之间存在差异，然而AV研究主要侧重于算法设计而非DQ分析。本研究聚焦于冗余作为AV数据集中一个基础但未充分探索的DQ问题。使用nuScenes和Argoverse 2（AV2）数据集，我们对多源相机数据和多模态图像-LiDAR数据进行了冗余建模与测量，并评估了去除冗余标签对YOLOv8目标检测任务的影响。实验结果显示，从具有共享视场的相机中选择性去除冗余多源图像对象标签可以提高检测效果。在nuScenes中，三个代表性重叠区域的mAP${50}$分别从$0.66$提高到$0.70$，从$0.64$提高到$0.67$，从$0.53$提高到$0.55$，而在其他重叠相机对中，即使在更严格的剪枝下，检测保持在基线水平。在AV2中，去除$4.1$-$8.6\%$的标签，mAP${50}$保持在$0.64$基线附近。多模态分析还揭示了图像和LiDAR数据之间存在大量冗余。这些发现表明，冗余是一个可测量和可操作的DQ因素，直接影响AV性能。本研究强调了冗余作为AV感知中的数据质量因素的作用，并激励从数据为中心的角度评估和改进AV数据集。代码、数据和实现细节可在：https://github.com/yhZHOU515/RedundancyAD公开获取。

Summary / 总结

This work addresses redundancy as a critical data quality issue in autonomous driving datasets. Using nuScenes and AV2 datasets, the authors model and measure redundancy in multisource camera data and multimodal image-LiDAR data. By selectively removing redundant labels, they improve YOLOv8 object detection performance, with mAP gains in nuScenes ranging from 0.66 to 0.70 and minimal impact on other overlapping camera pairs. In AV2, 4.1-8.6% of labels are removed with mAP staying near the baseline, indicating redundancy's measurable and actionable nature for enhancing AV performance.

该研究关注自主驾驶数据集中冗余数据作为关键的数据质量问题。作者使用nuScenes和AV2数据集，对多源相机数据和图像-激光雷达数据中的冗余进行了建模和测量，并评估了去除冗余标签对YOLOv8目标检测任务的影响。结果显示，选择性地去除冗余标签可以提高检测准确性，在nuScenes中mAP有所提升，而在AV2中即使在强剪枝下检测性能也基本保持不变。这表明冗余是可测量和可操作的因素，能够直接提升自主驾驶系统的性能。

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Authors: Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

First: 2026-03-06T18:29:15+00:00 · Latest: 2026-03-06T18:29:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.

中文标题/摘要

标题：RAMoEA-QA：呼吸音频问答的分层专业化

对话生成型AI正迅速进入医疗保健领域，在此领域中，通用模型必须整合异质患者信号并支持多种交互方式，同时生成临床意义的结果。在呼吸护理中，非侵入性音频，如通过移动麦克风捕获的录音，可实现可扩展的筛查和纵向监测，但异质性挑战尤为严峻：录音在设备、环境和采集协议方面差异巨大，问题涉及多种意图和问题格式。现有的生物医学音频-语言问答系统通常为单一结构，缺乏针对多样化的呼吸数据集和查询意图的专业化机制。它们也仅在有限的环境中进行了验证，因此在实际场景中如何可靠地处理这些变化尚不清楚。为解决这些局限性，我们提出了RAMoEA-QA，这是一种用于呼吸音频问答的分层路由生成模型，能够统一多种问题类型，并在单一多模态系统中支持离散和连续目标。RAMoEA-QA 应用了两阶段条件专业化：音频混合专家路由每段录音到合适的预训练音频编码器，语言混合适配器在共享冻结的大语言模型上选择一个LoRA适配器以匹配查询意图和答案格式。通过针对每个示例的专业化声学表示和生成行为，RAMoEA-QA 在参数量最小的情况下始终优于强大的基线和路由消融，提高领域内测试准确率至0.72（优于最先进的基线的0.61和0.67），并在领域、模态和任务转移下表现出最强的泛化能力。

Summary / 总结

RAMoEA-QA is a hierarchical generative model designed for robust respiratory audio question answering in healthcare. It uses a two-stage specialization mechanism where an audio mixture-of-experts routes recordings to appropriate pre-trained encoders, and a language mixture-of-adapters selects a LoRA adapter to match query intents. This model consistently outperforms strong baselines with improved in-domain test accuracy and better generalization under various shifts.

RAMoEA-QA 是一种用于呼吸音频问题回答的分层生成模型，采用两阶段专业化机制，其中音频混合专家将录音路由到合适的编码器，语言混合适配器选择适配器以匹配查询意图。该模型显著优于强基线，实现领域内测试准确率为0.72，而最先进的基线分别为0.61和0.67，并在各种变化下表现出强大的泛化能力。

Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering

Authors: Julius Gun, Timo Oksanen

Venue: Technical University of Munich. 2026. ISBN 978-3-911430-11-1. https://mediatum.ub.tum.de/1845092

First: 2025-08-25T14:54:46+00:00 · Latest: 2026-03-06T18:23:23+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a case study evaluating large language models (LLMs) with 128K-token context windows on a technical question answering (QA) task. Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German. It simulates a cross-lingual information retrieval scenario where questions are posed in English against all three language versions of the manual. The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations. We compare nine long-context LLMs using direct prompting against three Retrieval-Augmented Generation (RAG) strategies (keyword, semantic, hybrid), with an LLM-as-a-judge for evaluation. Our findings for this specific manual show that Hybrid RAG consistently outperforms direct long-context prompting. Models like Gemini 2.5 Flash and the smaller Qwen 2.5 7B achieve high accuracy (over 85%) across all languages with RAG. This paper contributes a detailed analysis of LLM performance in a specialized industrial domain and an open framework for similar evaluations, highlighting practical trade-offs and challenges.

中文标题/摘要

标题：Agri-Query：跨语言技术问答中RAG与长上下文LLM的案例研究

我们对具有128K词上下文窗口的大语言模型（LLM）在技术问答（QA）任务上的表现进行了评估。基准测试基于一份农业机械用户手册，该手册有英文、法文和德文三种语言版本。该测试模拟了跨语言信息检索场景，其中问题用英文提出，针对手册的三种语言版本。评估重点在于现实中的“大海捞针”挑战，并包括无法回答的问题以测试模型的幻觉倾向。我们使用直接提示与三种检索增强生成（RAG）策略（关键词、语义、混合）进行了九种长上下文LLM的对比，使用LLM作为评判者。对于这份特定的手册，我们的研究发现混合RAG策略始终优于直接长上下文提示。如Gemini 2.5 Flash和较小的Qwen 2.5 7B等模型在使用RAG时，所有语言的准确率均超过85%。本文为特定工业领域的LLM性能分析提供了详细分析，并提供了一个类似的评估框架，突出了实际的权衡和挑战。

Summary / 总结

This study evaluates large language models with 128K-token context windows on a cross-lingual technical QA task using an agricultural machine user manual in English, French, and German. It compares nine long-context LLMs against three RAG strategies and finds that Hybrid RAG outperforms direct prompting, especially for Gemini 2.5 Flash and Qwen 2.5 7B, which achieve over 85% accuracy across all languages. The research highlights practical challenges and trade-offs in LLM performance in specialized domains.

该研究评估了具有128K-token上下文窗口的大语言模型在农业机器用户手册（英文、法文、德文）上的技术问答任务。它将九种长上下文LLM与三种RAG策略进行了比较，并发现Hybrid RAG在性能上优于直接长上下文提示，尤其是Gemini 2.5 Flash和Qwen 2.5 7B等模型，在所有语言中的准确率超过85%。研究强调了在专业工业领域中LLM性能的实际挑战和权衡。

Spatial Calibration of Diffuse LiDARs

Authors: Nikhil Behari, Ramesh Raskar

First: 2026-03-06T18:18:07+00:00 · Latest: 2026-03-06T18:18:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.

中文标题/摘要

标题：漫反射LiDAR的空间校准

漫反射直接时间飞行LiDAR报告由广泛瞬时视场内的光子返回聚合形成的每个像素的深度直方图，违反了标准LiDAR-RGB校准背后的单束光假设。我们提出了一种简单的空间校准方法，用于估计每个漫反射LiDAR像素的有效支持区域（脚印）及其相对于共定位RGB图像平面的相对空间灵敏度。使用扫描的反光板并结合背景减法，我们恢复了每个像素的响应图，提供了LiDAR到RGB的显式对应关系，用于跨模态对齐和融合。我们在ams OSRAM TMF8828上演示了该方法。

AV-Unified: A Unified Framework for Audio-visual Scene Understanding

Authors: Guangyao Li, Xin Wang, Wenwu Zhu

First: 2026-03-06T18:16:30+00:00 · Latest: 2026-03-06T18:16:30+00:00

Comments: Accepted by IEEE Transactions on Multimedia (TMM)

Abs · PDF · Code1 · Code2

Abstract

When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.

中文标题/摘要

标题：AV-Unified：视听场景理解的统一框架

当人类感知世界时，他们自然会在动态的真实世界场景中整合多种视听任务。然而，当前的工作（如事件定位、解析、分割和问答）大多单独探索，这使得全面理解复杂的视听场景和探索任务间的关系变得困难。因此，我们提出了**AV-Unified**，一种统一框架，能够跨多种视听场景理解任务进行联合学习。AV-Unified 标准化了每个任务的多样输入输出格式，并结合多尺度时空感知网络，有效捕捉视听关联。具体来说，我们通过将所有支持任务的输入和输出统一为离散标记序列，建立共享表示，使得单一架构能够在异构数据集上联合训练。考虑到视听事件的时间粒度差异，我们设计了多尺度时间感知模块来捕捉关键线索。同时，为克服视觉领域缺乏听觉监督的问题，我们设计了一种跨模态引导的空间感知模块，建模空间视听关联。此外，使用任务特定的文本提示来增强模型的适应性和任务意识。在基准数据集（如AVE、LLP、MUSIC-AVQA、VGG-SS和AVS）上的广泛实验表明，AV-Unified 在时间、空间和时空任务上均表现出有效性。

Summary / 总结

AV-Unified is a unified framework designed to jointly learn across various audio-visual scene understanding tasks, such as event localization and question answering. It standardizes input-output formats and uses a multi-scale spatiotemporal perception network to capture audio-visual associations. The framework demonstrates effectiveness across different types of tasks on benchmark datasets, showing improvements in both temporal and spatial understanding.

论文提出了AV-Unified统一框架，用于跨事件定位、解析、分割和问答等多种音频-视觉场景理解任务的联合学习。该框架标准化了输入输出格式，并使用多尺度时空感知网络来捕捉音频-视觉关联。该框架在不同任务上的基准数据集上展示了有效性。

CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

Authors: Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao, Wei Liu, Jian Luan, Shuo Shang, Bo Du, Ji-Rong Wen, Rui Yan

First: 2026-02-27T16:19:45+00:00 · Latest: 2026-03-06T18:07:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts' capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets.

中文标题/摘要

标题：CoME：赋能移动专家的信息混合能力推理

移动代理可以自主执行用户指令，这需要混合能力推理，包括屏幕摘要、子任务规划、动作决策和动作函数。然而，现有的代理在实现这些能力的解耦增强和平衡集成方面存在困难。为了解决这些挑战，我们提出了移动专家通道（CoME），这是一种新型的代理架构，由四个不同的专家组成，每个专家都与特定的推理阶段对齐。CoME在每个推理阶段通过输出导向激活激活相应的专家以生成输出令牌。为了赋予CoME混合能力推理能力，我们引入了一种渐进式训练策略：Expert-FT使不同专家的能力解耦和增强；Router-FT将专家激活与不同的推理阶段对齐；CoT-FT促进多个能力之间的无缝协作和平衡优化。为了减轻混合能力推理中的错误传播，我们提出了基于信息增益的DPO（Info-DPO），它使用信息增益来评估每个中间步骤的贡献，从而引导CoME进行更具信息量的推理。全面的实验表明，CoME在AITZ和AMEX数据集上均优于密集移动代理和MoE方法。

Summary / 总结

The research aims to improve mobile agents' ability to autonomously execute user instructions by addressing the challenges of decoupled enhancement and balanced integration of hybrid-capabilities reasoning. The proposed Channel-of-Mobile-Experts (CoME) architecture consists of four experts aligned with specific reasoning stages, and a progressive training strategy is introduced to enhance and align these experts. CoME outperforms dense mobile agents and MoE methods on AITZ and AMEX datasets, demonstrating superior performance in hybrid-capabilities reasoning.

研究旨在通过解决混合能力推理中的解耦增强和平衡集成问题，提高移动代理自主执行用户指令的能力。提出的CoME架构包括四个与特定推理阶段对齐的专家，并引入了渐进式训练策略来增强和对齐这些专家。实验结果表明，CoME在AITZ和AMEX数据集上优于密集移动代理和MoE方法。

The Limits of Long-Context Reasoning in Automated Bug Fixing

Authors: Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker

Venue: ICLR 2026

First: 2026-02-17T22:51:40+00:00 · Latest: 2026-03-06T18:01:03+00:00

Comments: Accepted to ICLR 2026 ICBINB workshop

Abs · PDF · Code1 · Code2

Abstract

Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k-30k tokens, and that longer accumulated contexts correlate with lower success rates, indicating that agentic success primarily arises from task decomposition into short-context steps rather than effective long-context reasoning. To directly test long-context capability, we construct a data pipeline where we artificially inflate the context length of the input by placing the relevant files into the context (ensuring perfect retrieval recall); we then study single-shot patch generation under genuinely long contexts (64k tokens). Despite this setup, performance degrades sharply: Qwen3-Coder-30B-A3B achieves only a 7\% resolve rate at 64k context, while GPT-5-nano solves none of the tasks. Qualitative analysis reveals systematic failure modes, including hallucinated diffs, incorrect file targets, and malformed patch headers. Overall, our findings highlight a significant gap between nominal context length and usable context capacity in current LLMs, and suggest that existing agentic coding benchmarks do not meaningfully evaluate long-context reasoning.

中文标题/摘要

标题：自动化错误修复中长上下文推理的局限性

代码上下文长度的迅速增长导致了对大型语言模型（LLMs）能够直接处理整个代码库的假设。同时，LLMs 的最新进展使其在软件工程基准测试中表现出色，尤其是在与代理型工作流结合使用时。在本研究中，我们系统地评估当前的LLMs是否能够可靠地进行长上下文代码调试和补丁生成。使用SWE-bench Verified作为受控实验环境，我们首先在代理型框架（mini-SWE-agent）中评估最先进的模型，结果显示性能显著提升：GPT-5-nano在100个样本中最高解决率为31%，开源模型如Deepseek-R1-0528获得竞争力的结果。然而，基于token的分析表明，成功的代理型路径通常保持在20k-30k token以下，而更长的累积上下文与较低的成功率相关，表明代理型成功主要来自于任务分解为短上下文步骤，而不是有效的长上下文推理。为了直接测试长上下文能力，我们构建了一个数据管道，通过将相关文件放入上下文（确保完美检索召回率）来人为增加输入的上下文长度；然后在真正长的上下文中（64k token）研究单次补丁生成。尽管如此，性能急剧下降：Qwen3-Coder-30B-A3B在64k上下文中仅解决7%的任务，而GPT-5-nano没有解决任何任务。定性分析揭示了系统性的失败模式，包括虚假的差异、错误的文件目标和不规范的补丁头。总体而言，我们的研究结果突显了当前LLMs名义上下文长度与可用上下文容量之间的显著差距，并表明现有的代理型编程基准未能实质性地评估长上下文推理。

Summary / 总结

This study evaluates the capability of large language models (LLMs) to perform long-context code debugging and patch generation. Using SWE-bench Verified, the research finds that while performance improves with agentic workflows, successful trajectories typically remain under 20k-30k tokens, and longer contexts correlate with lower success rates. Directly testing long-context capability by inflating input context length, the study shows that performance degrades sharply, with models like Qwen3-Coder-30B-A3B achieving only 7% resolve rate at 64k tokens. The findings indicate a significant gap between nominal and usable context capacity in current LLMs, and suggest that existing benchmarks do not adequately evaluate long-context reasoning.

研究评估了大型语言模型（LLMs）在长上下文代码调试和补丁生成中的能力。使用SWE-bench Verified，研究发现虽然通过代理工作流可以提高性能，但成功的轨迹通常保持在20k-30k个标记以内，而更长的上下文与较低的成功率相关。通过增加输入上下文长度直接测试长上下文能力，研究显示性能急剧下降，例如Qwen3-Coder-30B-A3B在64k标记上下文下的解决率为7%。研究结果表明，当前LLMs在名义上下文长度和可用上下文容量之间存在显著差距，并暗示现有的基准测试未能充分评估长上下文推理能力。

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Authors: Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum

Venue: ICLR 2026

First: 2025-10-23T17:57:28+00:00 · Latest: 2026-03-06T18:00:11+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Many emerging applications of AI--from scientific discovery to medical diagnosis--require agents to seek information strategically: forming hypotheses, asking targeted questions, and making decisions under uncertainty. In high-stakes settings with limited resources, do language models (LMs) behave like rational agents? Drawing on insights from human cognition, we develop methods to evaluate and enhance agentic information-seeking. First, we introduce a decision-oriented dialogue task called Collaborative Battleship, in which a Captain must balance exploration (asking questions) and action (taking shots), while a Spotter must supply accurate, contextually-grounded answers. Compared to human players (N=42), we find that many LM agents struggle to ask informative questions, produce accurate answers, and identify high-utility actions. To address these gaps, we develop novel Monte Carlo inference strategies for LMs inspired by Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5's cost. We replicate these findings on Guess Who?, where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building information-seeking agents.

中文标题/摘要

标题：先射后问？构建理性探索与行动的智能体

许多新兴的AI应用——从科学发现到医学诊断——要求智能体战略性地寻求信息：形成假设，提出有针对性的问题，并在不确定性下做出决策。在资源有限的高风险环境中，语言模型（LM）是否像理性智能体一样行为？借鉴人类认知的见解，我们开发了评估和增强智能体信息寻求能力的方法。首先，我们引入了一种以决策为导向的对话任务——协作版“战舰”，其中船长必须在探索（提问）和行动（开枪）之间平衡，而观察员必须提供准确且上下文相关的答案。与42名人类玩家相比，我们发现许多LM智能体难以提出有信息量的问题，生成准确的答案，并识别高价值行动。为解决这些差距，我们为观察员智能体开发了基于贝叶斯实验设计（BED）的新型蒙特卡洛推理策略；对于船长智能体，我们的方法在LM基线基础上绝对提高了14.7%的准确性；对于船长智能体，它将预期信息增益（EIG）提高了0.227比特（94.2%的可实现噪声天花板）。结合这些组件，这些方法提高了目标的精确度（+0.303-0.374 F1），并使较弱的LM，如Llama-4-Scout，能够在成本仅为GPT-5的1%的情况下，击败人类（胜率从8%提高到82%）和前沿模型（胜率从0%提高到67% vs. GPT-5）。我们在“猜猜看”中复制了这些发现，我们的方法显著提高了准确性（+28.3-42.4个百分点），证明了其在构建信息寻求智能体方面的普遍适用性。

Summary / 总结

This study evaluates and enhances the information-seeking behavior of language models (LMs) in strategic decision-making tasks. It introduces a decision-oriented dialogue task called Collaborative Battleship, where LMs act as Captain and Spotter, balancing exploration and action. Compared to human players, LMs often struggle with asking informative questions and providing accurate answers. The research develops Monte Carlo inference strategies inspired by Bayesian Experimental Design to improve LMs' performance. These strategies boost accuracy by up to 14.7% for Spotter agents and increase expected information gain by up to 0.227 bits for Captain agents. The methods also enable weaker LMs to outperform both humans and frontier models in tasks like Collaborative Battleship and Guess Who? at a lower cost.

该研究评估并提升了语言模型（LMs）在战略决策任务中的信息寻求行为。研究引入了一个名为协作军旗的游戏任务，其中LMs分别扮演船长和瞭望员，平衡探索和行动。与人类玩家相比，LMs在提出有信息量的问题和提供准确答案方面常常表现不佳。研究开发了受贝叶斯实验设计启发的蒙特卡洛推理策略来改进LMs的表现。这些策略在瞭望员代理中可将准确性提升至14.7%，在船长代理中可增加预期信息增益至0.227比特。这些方法还使较弱的LMs在协作军旗和猜猜谁等任务中能够超越人类和前沿模型，且成本仅为GPT-5的1%。

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Authors: Kartik Sharma, Yiqiao Jin, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, Srijan Kumar

Venue: ICLR 2026

First: 2025-06-18T05:48:05+00:00 · Latest: 2026-03-06T17:55:20+00:00

Comments: ICLR 2026. Code available at https://github.com/Ksartik/sysformer

Abs · PDF · Code1 · Code2 · Code3

Abstract

As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose $\textbf{Sysformer}$, a trans$\textbf{former}$ model that updates an initial $\textbf{sys}$tem prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on $5$ LLMs from different families and $2$ recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto $80\%$ gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto $90\%$. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto $100\%$ more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.

中文标题/摘要

标题：Sysformer：通过自适应系统提示保护冻结的大语言模型

随着大语言模型（LLMs）在安全关键环境中部署，确保其响应符合安全标准变得至关重要。先前的研究表明，LLMs往往无法理解安全行为的概念，导致对无害提示的不合理拒绝或生成有害内容。尽管已经做出了大量努力来提高其鲁棒性，但现有的防御措施往往依赖于昂贵的模型参数微调或采用次优的启发式技术。在本工作中，我们通过学习在指令调优的LLMs中适应系统提示来采取一种新颖的方法来保护LLMs。虽然LLMs通常预训练为遵循固定系统提示，但我们研究了将系统提示根据每个特定用户输入进行调整对响应安全性的影响。为此，我们提出了Sysformer，这是一种更新初始系统提示为更稳健系统提示的转换器模型，同时关注用户提示。在冻结LLM参数的情况下，Sysformer被训练为拒绝一组有害提示，同时对一组安全提示作出理想响应。通过在5种不同家族的LLM和2个最新基准上进行广泛的实验，我们证明Sysformer可以显著增强LLMs的鲁棒性，使其在有害提示上的拒绝率提高多达80%，同时在安全提示上的合规性提高多达90%。结果还很好地推广到复杂的监狱突破攻击，使LLMs对不同攻击策略的鲁棒性提高多达100%。我们希望我们的发现能够使LLMs的保护成本更低，并激励未来关于设计可变系统提示的研究。

Summary / 总结

This work addresses the need to ensure that large language models (LLMs) generate safe responses in critical applications. It introduces Sysformer, a model that adapts system prompts to each user input, enhancing LLM safety without fine-tuning the model parameters. Experiments on five LLMs and two benchmarks show that Sysformer can significantly improve safety, with up to an 80% increase in refusal rates for harmful prompts and a 90% improvement in compliance with safe prompts. The method also effectively counters sophisticated attacks, making LLMs up to 100% more robust against various strategies.

研究旨在通过调整指令调优的大语言模型中的系统提示来提升其安全性。方法是训练一个名为Sysformer的模型，在保持大语言模型参数不变的情况下，更新初始系统提示以使其更 robust。实验结果显示，Sysformer能够显著提高大语言模型的鲁棒性，拒绝有害提示的准确率最高可提升80%，遵守安全提示的准确率最高可提升90%。该方法还有效抵御了复杂的破解攻击，使大语言模型对各种攻击策略的鲁棒性提高了100%。

SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants

Authors: Rohit Menon, Niklas Mueller-Goldingen, Sicong Pan, Gokul Krishna Chenchani, Maren Bennewitz

First: 2026-03-06T17:52:51+00:00 · Latest: 2026-03-06T17:52:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Robotic harvesting in dense crop canopies requires effective interventions that depend not only on geometry, but also on explicit, direction-conditioned relations identifying which organs obstruct a target fruit. We present SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning), a relational framework that, given instance-segmented organ point clouds, infers a scene graph encoding physical attachments and direction-conditioned occlusion. We introduce an occlusion ranking task for retrieving and ranking candidate leaves for a target fruit and approach direction, and propose a direction-aware graph neural architecture with per-fruit leaf-set attention and union-level aggregation. Experiments on a multi-plant synthetic pepper dataset show improved occlusion prediction (F1=0.73, NDCG@3=0.85) and attachment inference (edge F1=0.83) over strong ablations, yielding a structured relational signal for downstream intervention planning.

中文标题/摘要

标题：SG-DOR：基于方向条件遮挡推理的辣椒植物场景图学习

在密集作物冠层中的机器人收获需要不仅依赖于几何关系，还需要明确的方向条件关系来识别哪些器官遮挡了目标果实。我们提出了SG-DOR（基于方向条件遮挡推理的场景图），这是一种关系框架，给定实例分割的器官点云，可以推断出包含物理连接和方向条件遮挡的场景图。我们引入了一项遮挡排名任务，用于检索和按目标果实和接近方向排名候选叶片，并提出了一种具有针对果实的叶片集注意力和联合级聚合的方向感知图神经架构。在多株合成辣椒数据集上的实验表明，与强大的基线相比，该方法在遮挡预测（F1=0.73，NDCG@3=0.85）和连接推理（边F1=0.83）方面有所改进，从而为下游干预规划提供了结构化的关系信号。

Summary / 总结

The research aims to develop effective interventions for robotic harvesting in dense crop canopies by considering both geometry and direction-conditioned relations. SG-DOR, a relational framework, infers a scene graph from segmented organ point clouds to encode physical attachments and direction-conditioned occlusions. The method uses a direction-aware graph neural network with per-fruit leaf-set attention and union-level aggregation, showing improved occlusion prediction and attachment inference compared to strong ablations on a multi-plant synthetic pepper dataset.

研究旨在通过考虑几何和方向条件下的关系来开发密集作物丛中机器人收获的有效干预措施。方法涉及SG-DOR，它从实例分割的器官点云中推断出场景图，以编码物理连接和方向条件下的遮挡。关键发现包括改进的遮挡预测（F1=0.73，NDCG@3=0.85）和连接推断（边F1=0.83），相比强基线改进显著，提供了用于后续干预规划的结构化关系信号。

CMRAG: Co-modality-based visual document retrieval and question answering

Authors: Wang Chen, Wenhan Yu, Guanqiang Qi, Weikang Li, Yang Li, Lei Sha, Deguo Xia, Jizhou Huang

Venue: ICLR 2026

First: 2025-09-02T09:17:57+00:00 · Latest: 2026-03-06T17:51:55+00:00

Comments: Published at ICLR 2026 Workshop on Multimodal Intelligence

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks. However, existing methods have limitations when dealing with multimodal documents: one category of methods relies on layout analysis and text extraction, which can only utilize explicit text information and struggle to capture images or unstructured content; the other category treats document segmentation as visual input and directly passes it to visual language models (VLMs) for processing, yet it ignores the semantic advantages of text, leading to suboptimal retrieval and generation results. To address these research gaps, we propose the Co-Modality-based RAG (CMRAG) framework, which can simultaneously leverage texts and images for more accurate retrieval and generation. Our framework includes two key components: (1) a Unified Encoding Model (UEM) that projects queries, parsed text, and images into a shared embedding space via triplet-based training, and (2) a Unified Co-Modality-informed Retrieval (UCMR) method that statistically normalizes similarity scores to effectively fuse cross-modal signals. To support research in this direction, we further construct and release a large-scale triplet dataset of (query, text, image) examples. Experiments demonstrate that our proposed framework consistently outperforms single-modality--based RAG in multiple visual document question-answering (VDQA) benchmarks. The findings of this paper show that integrating co-modality information into the RAG framework in a unified manner is an effective approach to improving the performance of complex VDQA systems.

中文标题/摘要

标题：CMRAG：基于多模态的视觉文档检索与问答

检索增强生成（RAG）已成为文档问答任务的核心范式。然而，现有方法在处理多模态文档时存在局限性：一类方法依赖布局分析和文本提取，只能利用显式文本信息，难以捕捉图像或非结构化内容；另一类方法将文档分割作为视觉输入，直接传递给视觉语言模型（VLMs）进行处理，但忽略了文本的语义优势，导致检索和生成结果欠佳。为解决这些研究空白，我们提出了基于多模态的RAG（CMRAG）框架，可以同时利用文本和图像以更准确的检索和生成。我们的框架包括两个关键组件：（1）统一编码模型（UEM），通过三元组训练将查询、解析文本和图像投影到共享嵌入空间；（2）统一多模态指导检索（UCMR）方法，统计归一化相似度分数以有效融合跨模态信号。为了支持该方向的研究，我们进一步构建并发布了大量（查询，文本，图像）三元组数据集。实验表明，我们提出的框架在多个视觉文档问答（VDQA）基准测试中始终优于基于单一模态的RAG。本文的研究结果表明，以统一方式将多模态信息整合到RAG框架中是提高复杂VDQA系统性能的有效方法。

Summary / 总结

The paper proposes CMRAG, a framework that integrates text and image information to enhance document question answering. It introduces a Unified Encoding Model and a Unified Co-Modality-informed Retrieval method to improve the accuracy of retrieval and generation. Experiments show that CMRAG outperforms single-modality-based RAG in various visual document question-answering benchmarks.

论文提出了CMRAG框架，该框架结合文本和图像以提高文档检索和问答的准确性。它引入了统一编码模型（UEM）进行三元组训练，并提出了统一跨模态信息检索（UCMR）方法以融合跨模态信号。实验表明，CMRAG在多个视觉文档问答（VDQA）基准测试中优于单模态RAG，证明了在RAG框架中统一集成跨模态信息的有效性。

Culture in Action: Evaluating Text-to-Image Models through Social Activities

Authors: Sina Malakouti, Boqing Gong, Adriana Kovashka

First: 2025-11-07T19:51:11+00:00 · Latest: 2026-03-06T17:45:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image (T2I) diffusion models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully. Existing cultural benchmarks focus mainly on object-centric categories (e.g., food, attire, and architecture), overlooking the social and daily activities that more clearly reflect cultural norms. Few metrics exist for measuring cultural faithfulness. We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities (e.g., greetings, dining, games, traditional dances, and cultural celebrations). CULTIVate spans 16 countries with 576 prompts and more than 19,000 images, and provides an explainable descriptor-based evaluation framework across multiple cultural dimensions, including background, attire, objects, and interactions. We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity. Our findings reveal systematic disparities: models perform better for global north countries than for the global south, with distinct failure modes across T2I systems. Human studies confirm that our metrics correlate more strongly with human judgments than existing text-image metrics.

中文标题/摘要

标题：文化在行动：通过社会活动评估文本到图像模型

文本到图像（T2I）扩散模型通过大规模网络数据训练实现了令人印象深刻的逼真度，但模型继承了文化偏见，未能忠实描绘未被充分代表的地区。现有的文化基准主要集中在以对象为中心的类别（如食物、服饰和建筑）上，忽视了更能反映文化规范的社会和日常活动。很少有度量标准可以衡量文化忠实度。我们引入了CULTIVate，这是一个用于评估T2I模型在跨文化活动（如问候、用餐、游戏、传统舞蹈和文化庆典）上的基准。CULTIVate覆盖了16个国家，有576个提示和超过19,000张图像，并提供了一个基于描述符的多文化维度可解释评估框架，包括背景、服饰、物体和互动。我们提出了四个度量标准来衡量文化一致性、幻觉、夸张元素和多样性。我们的研究发现系统性差异：模型在表现上更优于全球北方国家，而对全球南方国家的表现则有明显的失败模式。人类研究证实，我们的度量标准与现有的文本-图像度量标准相比，与人类判断的相关性更强。

Summary / 总结

The research aims to evaluate text-to-image models by focusing on social activities that better reflect cultural norms, addressing the limitations of existing benchmarks. The study introduces CULTIVate, a benchmark with 576 prompts and over 19,000 images from 16 countries, using a descriptor-based evaluation framework. Four metrics are proposed to measure cultural alignment, hallucination, exaggerated elements, and diversity. The findings show that models perform better for global north countries and exhibit distinct failure modes across different systems, with human studies confirming the metrics' correlation with human judgments more strongly than existing ones.

研究旨在通过关注社交活动来评估文本到图像模型，这些活动比以物体为中心的类别更能反映文化规范。研究引入了CULTIVate基准，包含来自16个国家的576个提示和19,000张图像，使用四个指标来衡量文化一致性、幻觉、夸张元素和多样性。结果表明，模型在北半球国家表现更好，并且在不同系统中表现出不同的失败模式。

How Well Does Agent Development Reflect Real-World Work?

Authors: Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangarajan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, Graham Neubig

First: 2026-03-01T17:55:49+00:00 · Latest: 2026-03-06T17:43:36+00:00

Abs · PDF · Code1 · Code2

Abstract

AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills. We first analyze 43 benchmarks and 72,342 tasks, measuring their alignment with human employment and capital allocation across all 1,016 real-world occupations in the U.S. labor market. We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated. Within work areas that agents currently target, we further characterize current agent utility by measuring their autonomy levels, providing practical guidance for agent interaction strategies across work scenarios. Building on these findings, we propose three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.

中文标题/摘要

标题：代理开发在多大程度上反映现实世界的工作？

人工智能代理越来越多地在与人类工作相关的基准上进行开发和评估，但这些基准的努力是否代表整个劳动力市场仍不清楚。在本研究中，我们系统地研究了代理开发努力与现实世界人类工作分布之间的关系，通过将基准实例映射到工作领域和技能。我们首先分析了43个基准和72,342个任务，测量它们与美国劳动力市场中所有1,016种实际职业的人类就业和资本分配的一致性。我们揭示了代理开发倾向于以编程为中心与人类劳动和经济价值集中的类别之间存在显著差异。在代理目前瞄准的工作领域内，我们进一步通过测量其自主水平来表征当前代理的实用性，为不同工作场景下的代理交互策略提供实用指导。基于这些发现，我们提出了三个可衡量的原则，以设计更好地捕捉社会上重要且技术上具有挑战性的形式的工作的基准：覆盖面、现实性和细粒度评估。

Summary / 总结

This study investigates the alignment between AI agent development and real-world human work by analyzing 43 benchmarks and 72,342 tasks against 1,016 U.S. occupations. The research reveals significant mismatches, with agent development focusing more on programming skills while human labor and economic value are concentrated in other areas. The study proposes three principles—coverage, realism, and granular evaluation—for designing more representative benchmarks.

研究通过分析43个基准和72,342个任务与美国1,016种职业的对应关系，考察了AI代理开发与现实世界人类工作的契合度。研究揭示了显著的不匹配，代理开发更多集中在编程技能上，而人类劳动力和经济价值集中在其他领域。研究提出了三个原则——覆盖、现实性和细粒度评估——以设计更具代表性的基准。

When One Modality Rules Them All: Backdoor Modality Collapse in Multimodal Diffusion Models

Authors: Qitong Wang, Haoran Dai, Haotian Zhang, Christopher Rasmussen, Binghui Wang

Venue: ICLR 2026

First: 2026-03-06T17:42:08+00:00 · Latest: 2026-03-06T17:42:08+00:00

Comments: Accepted to the ICLR 2026 Workshop on Principled Design for Trustworthy AI. The first two authors contributed equally

Abs · PDF · Code1 · Code2

Abstract

While diffusion models have revolutionized visual content generation, their rapid adoption has underscored the critical need to investigate vulnerabilities, e.g., to backdoor attacks. In multimodal diffusion models, it is natural to expect that attacking multiple modalities simultaneously (e.g., text and image) would yield complementary effects and strengthen the overall backdoor. In this paper, we challenge this assumption by investigating the phenomenon of Backdoor Modality Collapse, a scenario where the backdoor mechanism degenerates to rely predominantly on a subset of modalities, rendering others redundant. To rigorously quantify this behavior, we introduce two novel metrics: Trigger Modality Attribution (TMA) and Cross-Trigger Interaction (CTI). Through extensive experiments across diverse training configurations in multimodal conditional diffusion, we consistently observe a ``winner-takes-all'' dynamic in backdoor behavior. Our results reveal that (1) attacks often collapse into subset-modality dominance, and (2) cross-modal interaction is negligible or even negative, contradicting the intuition of synergistic vulnerability. These findings highlight a critical blind spot in current assessments, suggesting that high attack success rates often mask a fundamental reliance on a subset of modalities. This establishes a principled foundation for mechanistic analysis and future defense development.

中文标题/摘要

标题：一模独大：多模态扩散模型中的后门模态坍塌

尽管扩散模型已经彻底改变了视觉内容生成，但它们的快速采用也凸显了研究其脆弱性（例如，后门攻击）的迫切需要。在多模态扩散模型中，同时攻击多个模态（例如，文本和图像）会产生互补效果并增强整体后门攻击的假设是自然的。在本文中，我们通过研究后门模态坍塌现象挑战了这一假设，这是一种后门机制主要依赖于少数模态子集，而其他模态变得冗余的情况。为了严格量化这种行为，我们引入了两个新的度量标准：触发模态归因（TMA）和跨触发交互（CTI）。通过在多模态条件扩散训练配置中进行广泛的实验，我们始终观察到后门行为中的“赢家通吃”动态。我们的结果表明：（1）攻击往往坍塌为少数模态的主导；（2）跨模态交互可以忽略不计甚至为负，这与协同脆弱性的直觉相矛盾。这些发现突显了当前评估中的一个关键盲点，表明高攻击成功率往往掩盖了对少数模态的依赖。这为机制分析和未来防御开发奠定了原则性的基础。

Summary / 总结

This paper investigates the phenomenon of Backdoor Modality Collapse in multimodal diffusion models, where the backdoor mechanism relies primarily on a subset of modalities, making others redundant. The authors introduce two metrics, Trigger Modality Attribution (TMA) and Cross-Trigger Interaction (CTI), to quantify this behavior. Experiments show that attacks often collapse into subset-modality dominance and that cross-modal interaction is negligible, challenging the assumption of synergistic vulnerability. This work highlights a critical blind spot in current assessments and suggests that high attack success rates may mask a fundamental reliance on a subset of modalities.

本文研究了多模态扩散模型中的Backdoor Modality Collapse现象，即后门机制主要依赖于少数模态，使其他模态变得多余。为了量化这一现象，作者引入了TMA和CTI指标。在各种训练设置下的实验显示，攻击往往集中在单一模态上，并且跨模态交互很少甚至为负，这挑战了同时攻击多个模态会增强后门机制的假设，揭示了当前评估中的一个关键盲点。

Semantics-Aware Caching for Concept Learning

Authors: Louis Mozart Kamdem Teyou, Caglar Demir, Axel-Cyrille Ngonga Ngomo

First: 2026-03-06T17:40:13+00:00 · Latest: 2026-03-06T17:40:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Concept learning is a form of supervised machine learning that operates on knowledge bases in description logics. State-of-the-art concept learners often rely on an iterative search through a countably infinite concept space. In each iteration, they retrieve instances of candidate solutions to select the best concept for the next iteration. While simple learning problems might require a few dozen instance retrieval calls to find a fitting solution, complex learning problems might necessitate thousands of calls. We alleviate the resulting runtime challenge by presenting a semantics-aware caching approach. Our cache is essentially a subsumption-aware map that links concepts to a set of instances via crisp set operations. Our experiments on 5 datasets with 4 symbolic reasoners, a neuro-symbolic reasoner, and 5 popular pagination policies demonstrate that our cache can reduce the runtime of concept retrieval and concept learning by an order of magnitude while being effective for both symbolic and neuro-symbolic reasoners.

中文标题/摘要

标题：面向语义的缓存技术用于概念学习

概念学习是一种基于描述逻辑的知识库上的监督机器学习形式。最先进的概念学习器通常依赖于在可数无限的概念空间中进行迭代搜索。在每次迭代中，它们检索候选解决方案的实例以选择下一个迭代的最佳概念。虽然简单的学习问题可能只需要几十次实例检索调用来找到合适的解决方案，但复杂的学习问题可能需要数千次调用。我们通过提出一种面向语义的缓存方法来缓解由此产生的运行时挑战。我们的缓存本质上是一个子类意识的映射，通过精确集合操作将概念与一组实例链接起来。我们在5个数据集上使用4个符号推理器、一个神经-符号推理器以及5个流行的分页策略进行的实验表明，我们的缓存可以将概念检索和概念学习的运行时减少一个数量级，同时对符号推理器和神经-符号推理器都有效。

Summary / 总结

The paper addresses the challenge of efficient concept learning in description logics by proposing a semantics-aware caching approach. This method uses a subsumption-aware map to link concepts to sets of instances, reducing the number of instance retrieval calls needed for complex learning problems. Experiments show that this approach can significantly decrease the runtime of concept retrieval and learning, benefiting both symbolic and neuro-symbolic reasoners by an order of magnitude.

论文提出了一种语义感知缓存方法，以提高知识库中概念学习的效率。该方法通过子种类感知映射将概念与实例集关联起来，减少所需的概念实例检索次数。实验表明，这种方法可以将概念检索和概念学习的运行时间减少一个数量级，适用于多种数据集和分页策略，同时有效支持符号推理器和神经符号推理器。

Localizing and Correcting Errors for LLM-based Planners

Authors: Aditya Kumar, William W. Cohen

First: 2026-01-30T19:56:15+00:00 · Latest: 2026-03-06T17:37:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated strong reasoning capabilities on math and coding, but frequently fail on symbolic classical planning tasks. Our studies, as well as prior work, show that LLM-generated plans routinely violate domain constraints given in their instructions (e.g., walking through walls). To address this failure, we propose iteratively augmenting instructions with Localized In-Context Learning (L-ICL) demonstrations: targeted corrections for specific failing steps. Specifically, L-ICL identifies the first constraint violation in a trace and injects a minimal input-output example giving the correct behavior for the failing step. Our proposed technique of L-ICL is much effective than explicit instructions or traditional ICL, which adds complete problem-solving trajectories, and many other baselines. For example, on an 8x8 gridworld, L-ICL produces valid plans 89% of the time with only 60 training examples, compared to 59% for the best baseline, an increase of 30%. L-ICL also shows dramatic improvements in other domains (gridworld navigation, mazes, Sokoban, and BlocksWorld), and on several LLM architectures.

中文标题/摘要

标题：本地化和纠正基于LLM的规划器中的错误

大型语言模型（LLMs）在数学和编程方面展示了强大的推理能力，但在符号经典规划任务中经常失败。我们的研究以及先前的工作表明，LLM生成的计划经常违反其指令中给出的领域约束（例如，穿过墙壁）。为了解决这一问题，我们提出了一种迭代增强指令的方法：针对特定失败步骤的局部上下文学习（L-ICL）演示。具体来说，L-ICL 识别轨迹中的第一个约束违反，并注入一个最小的输入-输出示例，给出失败步骤的正确行为。我们提出的技术L-ICL比显式指令或传统的ICL更有效，后者添加了完整的解决问题轨迹，以及其他许多基线。例如，在8x8网格世界中，L-ICL仅使用60个训练示例就产生了89%的有效计划，而最佳基线的这一比例为59%，提高了30%。L-ICL还在其他领域（网格世界导航、迷宫、推箱子和积木世界）以及几种LLM架构上显示出了显著的改进。

Summary / 总结

The paper addresses the issue of large language models (LLMs) frequently violating domain constraints in symbolic classical planning tasks. It proposes Localized In-Context Learning (L-ICL), which iteratively corrects specific failing steps by providing targeted minimal input-output examples. L-ICL outperforms explicit instructions, traditional In-Context Learning (ICL), and other baselines, achieving 89% valid plans on an 8x8 gridworld with only 60 training examples, compared to 59% for the best baseline, an improvement of 30%. L-ICL also shows significant improvements in other domains such as gridworld navigation, mazes, Sokoban, and BlocksWorld across various LLM architectures.

论文针对大型语言模型（LLMs）在符号经典规划任务中失败的问题，提出了局部上下文学习（L-ICL）方法，该方法通过迭代修正特定失败步骤的针对性演示来纠正错误。L-ICL 识别出第一个约束违规，并注入一个最小的输入-输出示例来纠正失败的步骤。该方法在 8x8 网格世界中仅使用 60 个训练示例就达到了 89% 的有效规划，而最佳基线仅为 59%，提高了 30%。L-ICL 在网格世界导航、迷宫、Sokoban 和 BlocksWorld 等其他领域也表现出显著的改进，并且在不同的 LLM 架构上有效。

Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Authors: Peter Polák, Sara Papi, Luisa Bentivogli, Ondřej Bojar

First: 2025-09-22T04:21:19+00:00 · Latest: 2026-03-06T17:37:14+00:00

Comments: Changes: - small change in the name (Evaluation -> Meta-Evaluation); - added reference to the implementation; - excluded two test sets (IWSLT22 En-Zh, En-Ja) because of incorrect and missing segmentation; - main results unchanged; - added Degenerate Policy Test; - added sensitivity of the metrics to change in the metric value

Abs · PDF · Code1 · Code2 · Code3

Abstract

Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.

中文标题/摘要

标题：迟来总比不来临：同时性语音转文本翻译延迟度量的元评价

同时性语音转文本翻译系统必须在翻译质量和延迟之间取得平衡。虽然质量评估已经很成熟，但延迟测量仍然是一个挑战。现有的度量标准会产生不一致的结果，尤其是在短文本设置中，且存在人工预分段的情况下。我们首次全面评估了跨语言对和系统的延迟度量标准。我们发现当前度量标准中存在与分段相关的结构性偏差。我们引入了YAAL（Yet Another Average Lagging）以进行更准确的短文本评估，并引入了LongYAAL用于未分段音频。我们提出了基于软词级对齐的SoftSegmenter重新分段工具。我们证明YAAL和LongYAAL与SoftSegmenter结合使用时，优于流行的延迟度量标准，从而能够更可靠地评估短文本和长文本同时性语音翻译系统。我们将在OmniSTEval工具包中实现所有相关工具：https://github.com/pe-trik/OmniSTEval。

Summary / 总结

The paper addresses the challenge of evaluating latency in simultaneous speech-to-text translation systems, which must balance translation quality and latency. It presents a comprehensive meta-evaluation of existing latency metrics, identifies a segmentation bias, and introduces new metrics like YAAL and LongYAAL, along with a resegmentation tool called SoftSegmenter. The study shows that these new metrics outperform traditional ones, providing more reliable assessments for both short- and long-form systems.

论文针对同时进行语音转文本翻译系统的延迟评估挑战，这些系统需要在翻译质量和延迟之间取得平衡。研究进行了现有延迟度量标准的全面元评估，并引入了新的度量标准如YAAL和LongYAAL，以及重新分割工具SoftSegmenter，以提高准确性。研究发现，这些新方法优于传统方法，为短形式和长形式系统提供了更可靠的评估。

Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Authors: Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar

First: 2026-03-06T17:37:06+00:00 · Latest: 2026-03-06T17:37:06+00:00

Comments: Accepted at LREC 2026

Abs · PDF · Code1 · Code2

Abstract

Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.

中文标题/摘要

标题：在上下文中的发言：通过对比学习实现多语言ASR的语音上下文对齐

自动语音识别（ASR）得益于预训练语音和语言模型的进步，但大多数系统仍然局限于单语言环境和短的孤立语句。尽管最近在上下文感知ASR方面取得了一些进展，但仍然存在两个关键挑战：有限的多语言支持和语音和上下文表示之间缺乏原则性的对齐。在本文中，我们介绍了一种上下文感知的多语言ASR框架，该框架支持多种语言和口音，同时保持预训练模型的模块化。我们的方法通过一个轻量级的投影模块结合了一个冻结的语音编码器和一个仅解码器语言模型，允许结构化的上下文提示，包括对话历史和偏向词，来引导转录。为了提高语音和上下文之间的交互，我们采用了一种对比学习目标，将它们的表示在共享嵌入空间中对齐。在1,500多小时的11种语言和5种英语方言的真实对话语音上的评估表明，上下文输入始终可以提高识别质量。当应用于不同类型的上下文时，对比对齐提供了额外的增益，总体性能提高了超过5%。这些结果突显了上下文建模和跨模态对齐在多语言ASR中的重要性。

History

20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553