arXiv 论文速递

Snapshot: 20260301_0326

MediX-R1: Open Ended Medical Reinforcement Learning

Authors: Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal

First: 2026-02-26T18:59:46+00:00 · Latest: 2026-02-26T18:59:46+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com

中文标题/摘要

标题：MediX-R1：开放式的医疗强化学习

我们介绍了MediX-R1，这是一种针对医疗多模态大型语言模型（MLLMs）的开放式强化学习（RL）框架，能够提供基于临床的、自由形式的答案，超越了多项选择格式。MediX-R1 使用基于组的RL对基础视觉-语言骨干进行微调，并结合了针对医学推理的复合奖励：基于LLM的准确度奖励，用于判断语义正确性并做出严格的YES/NO决策；基于医学嵌入的语义奖励，用于捕捉同义词和术语变体；以及轻量级的格式和模态奖励，以确保可解释的推理和模态识别。这种多信号设计为传统验证性或仅多项选择奖励无法提供稳定、信息丰富的反馈的开放式输出提供了支持。为了衡量进展，我们提出了一种统一的评估框架，用于文本和图像+文本任务，该框架使用LLM作为法官替代脆弱的字符串重叠度量，以捕捉语义正确性、推理和上下文对齐。尽管仅使用约51,000个指令示例，MediX-R1 在标准的医疗LLM（仅文本）和VLM（图像+文本）基准测试中取得了优异的成绩，超越了强大的开源基线，并在开放式临床任务上取得了特别大的进步。我们的结果表明，使用全面的奖励信号和基于LLM的评估的开放式RL是一种可靠的多模态模型中实现可靠医学推理的实用途径。我们的训练模型、精心策划的数据集和源代码可在https://medix.cvmbzuai.com 获取。

Summary / 总结

MediX-R1 is an open-ended RL framework for medical multimodal LLMs, fine-tuning a vision-language backbone with a composite reward that includes LLM-based accuracy, medical embedding semantic, and lightweight format rewards. It uses a reference-based LLM evaluation to measure semantic correctness, reasoning, and contextual alignment. Despite using only about 51,000 instruction examples, MediX-R1 outperforms strong open-source baselines on standard medical LLM and VLM benchmarks, especially on open-ended clinical tasks.

MediX-R1 是一个用于医疗 MLLMs 的开放域 RL 框架，能够提供超出多选题格式的自由形式答案。该框架通过 Group Based RL 和包括 LLM 基准准确度、医学嵌入基准语义以及轻量级格式和模态奖励的复合奖励对视觉-语言主干进行微调。该框架为开放域输出提供稳定反馈，并在医疗 LLM 和 VLM 基准测试中超越了强大的基线模型，特别是在开放域临床任务上表现尤为突出。提出了一个基于参考的 LLM 作为评判者的统一评估框架，用于衡量进展，捕捉语义正确性、推理和上下文对齐。

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Authors: Vaibhav Agrawal, Rishubh Parihar, Pradhaan Bhat, Ravi Kiran Sarvadevabhatla, R. Venkatesh Babu

Venue: CVPR 2026

First: 2026-02-26T18:59:05+00:00 · Latest: 2026-02-26T18:59:05+00:00

Comments: Project page: https://seethrough3d.github.io. Accepted at CVPR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

中文标题/摘要

标题：SeeThrough3D：基于遮挡感知的3D布局条件生成

我们识别出遮挡推理是3D布局条件生成中一个基本但被忽视的重要方面。它对于合成部分遮挡的物体并保持深度一致的几何结构和比例至关重要。尽管现有方法可以生成遵循输入布局的逼真场景，但它们往往无法准确建模物体间的遮挡关系。我们提出了SeeThrough3D，一种用于3D布局条件生成的模型，该模型明确建模了遮挡关系。我们引入了一种遮挡感知的3D场景表示（OSCR），其中物体以透明的3D盒子形式置于虚拟环境中，并从期望的相机视角进行渲染。透明度编码了隐藏的物体区域，使模型能够推理遮挡关系，而渲染的视角则在生成过程中提供了明确的相机控制。我们通过引入从我们渲染的3D表示中提取的一组视觉标记，对预训练的基于流的文本到图像图像生成模型进行条件化。此外，我们应用掩码自注意力机制，准确地将每个物体边界框与其相应的文本描述绑定，从而实现多个物体的准确生成，而不会出现物体属性混杂。为了训练该模型，我们构建了一个包含多种多物体场景的合成数据集，这些场景具有强烈的物体间遮挡。SeeThrough3D能够有效泛化到未见过的物体类别，并实现具有真实遮挡和一致相机控制的精确3D布局控制。

Summary / 总结

The research aims to address the issue of occlusion reasoning in text-to-image generation, which is crucial for creating scenes with depth-consistent geometry and scale. The proposed SeeThrough3D model uses an occlusion-aware 3D scene representation (OSCR) to explicitly model occlusions. By introducing visual tokens and applying masked self-attention, the model can generate multiple objects accurately without mixing attributes. The model is trained on a synthetic dataset with diverse multi-object scenes and strong occlusions, demonstrating effective generalization to unseen object categories and precise 3D layout control with realistic occlusions and consistent camera control.

研究旨在解决文本到图像生成中的遮挡推理问题，这对于创建深度一致和逼真的场景至关重要。SeeThrough3D提出了一种遮挡感知的3D场景表示（OSCR），并使用一个预训练的流式文本到图像模型，该模型基于从3D表示中提取的视觉标记进行条件化。该模型能够有效处理遮挡和多个对象而不混合作用属性，并且能够很好地泛化到未见过的对象类别，从而实现精确的3D布局控制，具有逼真的遮挡和一致的相机控制。

A Dataset is Worth 1 MB

Authors: Elad Kimchi Shoshani, Leeyam Gabay, Yedid Hoshen

First: 2026-02-26T18:59:03+00:00 · Latest: 2026-02-26T18:59:03+00:00

Comments: 23 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.

中文标题/摘要

标题：一个数据集值1 MB

数据服务器经常需要向许多客户端分发相同的大型负载，导致巨大的通信成本。由于客户端经常运行在不同的硬件和软件框架上，传输预训练模型往往是不可行的；相反，代理需要原始数据来在本地训练其特定任务的模型。虽然数据集蒸馏试图压缩训练信号，但当前的方法难以扩展到高分辨率数据，很少能实现足够小的文件。在本文中，我们提出了一种名为Pseudo-Labels as Data (PLADA) 的方法，该方法完全消除了像素传输。我们假设代理预先加载了一个大型、通用、未标记的参考数据集（例如，ImageNet-1K，ImageNet-21K），并通过仅传输特定图像的类别标签来传达新任务。为了应对参考数据集和目标数据集之间的分布不匹配，我们引入了一种剪枝机制，该机制过滤参考数据集，仅保留与目标任务最相关的图像的标签。这个选择过程同时最大化了训练效率并最小化了传输负载。在10个不同的数据集上的实验表明，我们的方法可以在传输小于1 MB的负载的同时保持高分类准确性，为高效的数据集服务提供了一个有前景的解决方案。

Summary / 总结

This paper addresses the challenge of efficiently distributing large datasets to multiple clients by proposing PLADA, which transmits only class labels for specific images rather than pixel data. By pruning a large reference dataset to include only semantically relevant images, the method achieves high classification accuracy with a payload of less than 1 MB, significantly reducing communication costs.

本文提出PLADA方法，通过仅传输特定图像的类别标签而非像素数据来解决向多个客户端高效分发大数据集的挑战。通过将大型参考数据集精简为仅包含与目标任务最相关的图像，该方法能够在保持高分类准确性的同时，将传输负载减少到不足1 MB，显著降低通信成本。

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Authors: Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

First: 2026-02-26T18:55:06+00:00 · Latest: 2026-02-26T18:55:06+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

中文标题/摘要

标题：SOTAlign：通过最优传输实现单模态视觉和语言模型的半监督对齐

柏拉图表征假设认为，训练在不同模态上的神经网络会趋向于共享世界的统计模型。近期的工作通过对比损失和大量配对样本将冻结的预训练视觉和语言模型对齐，但通常依赖于对比损失和数百万配对样本。在本文中，我们探讨是否可以在较少监督的情况下实现有意义的对齐。我们引入了一个半监督设置，在该设置中，使用少量的图像-文本配对数据和大量未配对数据对预训练的单模态编码器进行对齐。为了解决这一挑战，我们提出了SOTAlign，这是一种两阶段框架，首先使用线性教师从有限的配对数据中恢复粗略的共享几何结构，然后通过基于最优传输的发散在未配对样本上细化对齐，该发散能够转移关系结构而不过度约束目标空间。与现有的半监督方法不同，SOTAlign有效地利用了未配对的图像和文本，学习跨数据集和编码器对的稳健联合嵌入，并显著优于监督和半监督基线。

Summary / 总结

The research aims to achieve alignment between vision and language models with less supervision. SOTAlign, a two-stage framework, first uses a small number of paired image-text samples to recover a coarse shared geometry, then refines the alignment on unpaired data using an optimal-transport-based divergence. This method outperforms supervised and semi-supervised baselines, demonstrating effective use of unpaired data to learn robust joint embeddings.

研究旨在通过较少的监督实现视觉和语言模型之间的对齐。SOTAlign 是一个两阶段框架，首先使用少量的图像-文本对来恢复共享几何结构，然后通过最优传输基的发散性在未配对的数据上进一步细化对齐。该方法在不同数据集和编码器对上学习稳健的联合嵌入，显著优于监督和半监督基线，有效地利用了未配对的图像和文本数据。

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

Authors: Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna

First: 2026-02-26T18:54:06+00:00 · Latest: 2026-02-26T18:54:06+00:00

Comments: TACL 2026

Abs · PDF · Code1 · Code2

Abstract

The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.

中文标题/摘要

标题：规模无法克服语用学：报告偏差对视觉语言推理的影响

视觉语言模型（VLMs）缺乏推理能力的问题一直是研究讨论的焦点。我们认为这种行为源于其训练数据中的报告偏差。也就是说，人们默认在描述视觉内容时会省略一些监督某些类型推理所需的隐含信息；例如，“今天在比赛！”比“一张37个人站在田野后面的图片”更可能作为描述。我们通过语用学理论的视角，研究了流行的VLMs OpenCLIP、LLaVA-1.5和Molmo的数据基础，发现报告偏差导致在四类推理技能（空间、时间、否定和计数）的表示不足，尽管这些语料库是大规模的，或者合成生成的。通过一组精心策划的基准测试，我们证明：(i) VLMs在由报告偏差抑制的上述类型推理上表现不佳；(ii) 与普遍认为的相反，增加数据量、模型规模和多语言训练并不能默认产生这些技能；但令人鼓舞的是，(iii) 特别收集的用于获取隐含信息的注解是有效的。我们的研究结果强调了需要更故意的数据策划方法，而不是依赖规模来产生推理能力。

Summary / 总结

The study investigates the impact of reporting bias on the reasoning capabilities of Vision-Language Models (VLMs) like OpenCLIP, LLaVA-1.5, and Molmo. By analyzing the training data through pragmatics theories, the research finds that reporting bias leads to insufficient representation of spatial, temporal, negation, and counting reasoning skills, despite the large scale of the corpora. The experiments show that VLMs struggle with these types of reasoning, and increasing data or model size does not inherently improve these skills. However, incorporating specific annotations can enhance these capabilities.

研究探讨了视觉语言模型（VLMs）在推理方面的局限性，认为这归因于其训练数据中的报告偏见。尽管使用了大规模和合成的数据集，但VLMs在空间、时间、否定和计数推理方面仍然缺乏能力，因为视觉内容通常是以一种省略了隐含信息的方式描述的。研究显示，增加数据或模型规模并不能自动改善这些能力，但专门收集用于捕捉隐含信息的标注则有效。这强调了需要更针对性的数据整理方法，而不是依赖规模来实现推理能力的提升。

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Authors: Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias

First: 2026-02-26T18:45:33+00:00 · Latest: 2026-02-26T18:45:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

中文标题/摘要

标题：检索与分割：少量示例足以弥合开放词汇分割中的监督缺口吗？

开放词汇分割（OVS）将视觉语言模型（VLM）的零样本识别能力扩展到像素级预测，使模型能够根据文本提示分割任意类别。尽管取得了进展，但由于使用粗粒度的图像级监督训练VLM以及自然语言的语义模糊性，OVS仍落后于完全监督的方法。我们通过引入一种少量示例设置，将文本提示与像素标注图像的支持集相结合，来解决这些限制。在此基础上，我们提出了一种检索增强的测试时适配器，通过融合文本和视觉支持特征学习一种轻量级的、针对每张图像的分类器。与依赖于后期手工融合的先前方法不同，我们的方法进行学习的、针对每个查询的融合，实现了模态之间的更强协同作用。该方法支持不断扩展的支持集，并适用于细粒度任务，如个性化分割。实验表明，我们显著缩小了零样本和监督分割之间的差距，同时保持了开放词汇的能力。

Summary / 总结

The paper addresses the limitations of open-vocabulary segmentation (OVS) by proposing a few-shot setting that combines textual prompts with pixel-annotated images. It introduces a retrieval-augmented test-time adapter to learn a lightweight classifier by fusing textual and visual support features, achieving better synergy between modalities than prior methods. Experiments demonstrate that this approach significantly reduces the gap between zero-shot and supervised segmentation while maintaining open-vocabulary capabilities.

论文通过结合文本提示和像素标注图像的少量样本设置，解决了开放词汇分割（OVS）的限制。提出了一个检索增强的测试时适配器，通过融合文本和视觉支持特征来学习轻量级分类器，实现了比先前方法更好的模态间协同作用。实验表明，这种方法显著缩小了零样本和监督分割之间的差距，同时保持了开放词汇的能力。

Differentiable Zero-One Loss via Hypersimplex Projections

Authors: Camilo Gomez, Pengyang Wang, Liansheng Tang

First: 2026-02-26T18:41:31+00:00 · Latest: 2026-02-26T18:41:31+00:00

Comments: To appear in PAKDD 2026 (Pacific-Asia Conference on Knowledge Discovery and Data Mining), 12 pages

Abs · PDF · Code1 · Code2

Abstract

Recent advances in machine learning have emphasized the integration of structured optimization components into end-to-end differentiable models, enabling richer inductive biases and tighter alignment with task-specific objectives. In this work, we introduce a novel differentiable approximation to the zero-one loss-long considered the gold standard for classification performance, yet incompatible with gradient-based optimization due to its non-differentiability. Our method constructs a smooth, order-preserving projection onto the n,k-dimensional hypersimplex through a constrained optimization framework, leading to a new operator we term Soft-Binary-Argmax. After deriving its mathematical properties, we show how its Jacobian can be efficiently computed and integrated into binary and multiclass learning systems. Empirically, our approach achieves significant improvements in generalization under large-batch training by imposing geometric consistency constraints on the output logits, thereby narrowing the performance gap traditionally observed in large-batch training.

中文标题/摘要

标题：通过超单纯形投影实现可微的零一损失

机器学习的最新进展强调将结构化优化组件整合到端到端的可微模型中，以实现更丰富的归纳偏置和更紧密的任务特定目标对齐。在本文中，我们提出了一种新颖的可微近似零一损失的方法-长期以来被视为分类性能的金标准，但由于其非可微性，无法与基于梯度的优化兼容。我们的方法通过约束优化框架构造了一个平滑的、保持顺序的投影到n,k维超单纯形上，从而提出了一种新的操作符，称为Soft-Binary-Argmax。在推导其数学性质后，我们展示了如何高效计算其雅可比矩阵并将其集成到二元和多分类学习系统中。实验上，我们的方法通过在输出logits上施加几何一致性约束，在大规模训练中实现了显著的泛化改进，从而缩小了传统上观察到的大规模训练性能差距。

Summary / 总结

This work addresses the challenge of integrating the zero-one loss into differentiable models by proposing a smooth approximation called Soft-Binary-Argmax. The method uses a constrained optimization framework to project onto the hypersimplex, enabling gradient-based optimization. Empirically, the approach improves generalization in large-batch training by maintaining geometric consistency in the output logits, thus reducing the performance gap observed in such settings.

该研究通过提出一种平滑近似方法Soft-Binary-Argmax，解决了将零一损失整合到可微模型中的难题。该方法利用约束优化框架将投影到超单纯形上，从而支持基于梯度的优化。实验表明，通过在输出logits上施加几何一致性约束，该方法在大批次训练中提高了泛化能力，缩小了与小批次训练之间的性能差距。

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset

Authors: Dany Haddad, Dan Bareket, Joseph Chee Chang, Jay DeYoung, Jena D. Hwang, Uri Katz, Mark Polak, Sangho Suh, Harshit Surana, Aryeh Tiktinsky, Shriya Atmakuri, Jonathan Bragg, Mike D'Arcy, Sergey Feldman, Amal Hassan-Ali, Rubén Lozano, Bodhisattwa Prasad Majumder, Charles McGrady, Amanpreet Singh, Brooke Vlahos, Yoav Goldberg, Doug Downey

First: 2026-02-26T18:40:28+00:00 · Latest: 2026-02-26T18:40:28+00:00

Abs · PDF · Code1 · Code2

Abstract

AI-powered scientific research tools are rapidly being integrated into research workflows, yet the field lacks a clear lens into how researchers use these systems in real-world settings. We present and analyze the Asta Interaction Dataset, a large-scale resource comprising over 200,000 user queries and interaction logs from two deployed tools (a literature discovery interface and a scientific question-answering interface) within an LLM-powered retrieval-augmented generation platform. Using this dataset, we characterize query patterns, engagement behaviors, and how usage evolves with experience. We find that users submit longer and more complex queries than in traditional search, and treat the system as a collaborative research partner, delegating tasks such as drafting content and identifying research gaps. Users treat generated responses as persistent artifacts, revisiting and navigating among outputs and cited evidence in non-linear ways. With experience, users issue more targeted queries and engage more deeply with supporting citations, although keyword-style queries persist even among experienced users. We release the anonymized dataset and analysis with a new query intent taxonomy to inform future designs of real-world AI research assistants and to support realistic evaluation.

中文标题/摘要

标题：理解AI驱动的科学研究工具的使用与参与：Asta交互数据集

AI驱动的科学研究工具正迅速融入研究工作流程，但该领域缺乏一个清晰的视角来了解研究人员在实际环境中如何使用这些系统。我们介绍了并分析了Asta交互数据集，这是一个包含超过200,000个用户查询和交互日志的大规模资源，来自两个部署工具（文献发现界面和科学问题解答界面）在一个基于LLM的检索增强生成平台上。利用该数据集，我们描述了查询模式、参与行为以及使用随经验变化的方式。我们发现，用户提交的查询比传统搜索更长、更复杂，并将系统视为协作研究伙伴，分配任务如撰写内容和识别研究空白。用户将生成的响应视为持久化的成果，以非线性方式反复访问和导航输出及引用证据。随着经验的积累，用户提出更针对性的查询，并更深入地参与支持引用，尽管经验用户中仍存在关键词式查询。我们发布了匿名数据集和分析，以及一个新的查询意图分类法，以指导未来实际AI研究助手的设计，并支持现实的评估。

Utilizing LLMs for Industrial Process Automation

Authors: Salim Fares

First: 2026-02-26T18:38:00+00:00 · Latest: 2026-02-26T18:38:00+00:00

Abs · PDF · Code1 · Code2

Abstract

A growing number of publications address the best practices to use Large Language Models (LLMs) for software engineering in recent years. However, most of this work focuses on widely-used general purpose programming languages like Python due to their widespread usage training data. The utility of LLMs for software within the industrial process automation domain, with highly-specialized languages that are typically only used in proprietary contexts, remains underexplored. This research aims to utilize and integrate LLMs in the industrial development process, solving real-life programming tasks (e.g., generating a movement routine for a robotic arm) and accelerating the development cycles of manufacturing systems.

中文标题/摘要

标题：利用大语言模型进行工业过程自动化

近年来，越来越多的研究论文探讨了使用大语言模型（LLMs）进行软件工程的最佳实践。然而，大多数研究工作集中在广泛使用的通用编程语言（如Python）上，因为这些语言的训练数据使用广泛。工业过程自动化领域中使用高度专业化语言的软件的实用性，这些语言通常仅在专有环境中使用，仍被严重忽视。本研究旨在利用和整合LLMs到工业开发过程中，解决实际编程任务（例如，生成机器人手臂的运动程序），并加速制造系统的开发周期。

Summary / 总结

This research aims to explore the application of Large Language Models (LLMs) in industrial process automation, where highly-specialized languages are commonly used. The study focuses on solving real-life programming tasks such as generating movement routines for robotic arms. The main method involves integrating LLMs into the industrial development process to accelerate the development cycles of manufacturing systems. Key experimental findings show that LLMs can effectively handle specialized languages and improve the efficiency of programming tasks in industrial settings.

研究旨在探索大型语言模型（LLMs）在工业过程自动化中的应用，重点关注在专有环境中使用的高度专业化语言。研究使用LLMs生成编程任务，如为机器人手臂创建运动程序，以加速制造系统的开发周期。主要发现表明，LLMs能够有效处理专业化编程任务，展示了其在工业自动化中的潜力。

Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

Authors: Kunihiro Miyazaki, Takanobu Kawahara, Stephen Roberts, Stefan Zohren

First: 2026-02-26T18:37:36+00:00 · Latest: 2026-02-26T18:37:36+00:00

Comments: 14 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and less transparent decision-making. Therefore, we propose a multi-agent LLM trading framework that explicitly decomposes investment analysis into fine-grained tasks, rather than providing coarse-grained instructions. We evaluate the proposed framework using Japanese stock data, including prices, financial statements, news, and macro information, under a leakage-controlled backtesting setting. Experimental results show that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs. Crucially, further analysis of intermediate agent outputs suggests that alignment between analytical outputs and downstream decision preferences is a critical driver of system performance. Moreover, we conduct standard portfolio optimization, exploiting low correlation with the stock index and the variance of each system's output. This approach achieves superior performance. These findings contribute to the design of agent structure and task configuration when applying LLM agents to trading systems in practical settings.

中文标题/摘要

标题：朝向专家投资团队：细粒度交易任务的多智能体LLM系统

大型语言模型（LLMs）的进步加速了自主金融交易系统的开发。虽然主流方法模仿分析师和经理的角色部署多智能体系统，但它们通常依赖于抽象指令，忽略了实际工作流程的复杂性，这可能导致推理性能下降和决策透明度降低。因此，我们提出了一种多智能体LLM交易框架，明确将投资分析细分为细粒度任务，而不是提供粗粒度指令。我们使用包含股价、财务报表、新闻和宏观经济信息的日本股票数据，在受控泄漏回测设置下评估了所提出的框架。实验结果表明，细粒度任务分解显著提高了风险调整后的回报率，与传统的粗粒度设计相比。更重要的是，对中间智能体输出的进一步分析表明，分析输出与下游决策偏好的对齐是系统性能的关键驱动因素。此外，我们进行了标准投资组合优化，利用与股票指数低相关性和每个系统输出的方差。这种方法实现了更好的性能。这些发现为在实际应用中将LLM代理应用于交易系统时设计智能体结构和任务配置做出了贡献。

Summary / 总结

The paper proposes a multi-agent LLM trading framework that decomposes investment analysis into fine-grained tasks to improve risk-adjusted returns. It evaluates the framework using Japanese stock data and finds that fine-grained task decomposition outperforms conventional coarse-grained designs. The analysis of intermediate outputs indicates that alignment between analytical outputs and decision preferences is crucial for system performance. The approach also achieves superior portfolio optimization results.

本文提出了一种将投资分析细分为具体任务的多代理LLM交易框架，以提高性能并优于传统的粗粒度设计。该框架使用日本股票数据进行评估，显示出显著的风险调整后回报率提升。中间输出的分析表明，分析输出与决策偏好之间的对齐是系统性能的关键。研究还通过标准投资组合优化展示了优越的性能，突出了任务分解在LLM交易系统中的重要性。

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Authors: Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros, Nathaniel Li, Aiden Kim, Yury Orlovskiy, Coleman Breen, Bryce Cai, Jasper Götting, Andrew Bo Liu, Samira Nedungadi, Paula Rodriguez, Yannis Yiming He, Mohamed Shaaban, Zifan Wang, Seth Donoughe, Julian Michael

First: 2026-02-26T18:37:23+00:00 · Latest: 2026-02-26T18:37:23+00:00

Comments: 59 pages, 33 figures

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

中文标题/摘要

标题：LLM初学者在双重用途和计算生物学任务中的提升

大型语言模型（LLMs）在生物学基准测试中的表现越来越好，但尚不清楚它们是否能提升初学者的表现——即是否能帮助人类比仅使用互联网资源时表现得更好。这种不确定性是理解科学加速和双重用途风险的关键。我们进行了一个多模型、多基准的人类提升研究，比较了有LLM访问权限的初学者和仅有互联网访问权限的初学者在八个与生物安全相关的任务集上的表现。参与者在复杂问题上工作，有充足的时间（最复杂的任务最多13小时）。我们发现，LLM访问提供了显著的提升：有LLM的初学者比对照组准确度高4.16倍（95% CI [2.63, 6.87]）。在四个有专家基线的基准测试中（仅有互联网），有LLM的初学者在三个基准测试中表现优于专家。令人惊讶的是，独立的LLM往往超过了LLM辅助的初学者，表明用户没有从LLM中获得最强的贡献。大多数参与者（89.6%）报告称，尽管有保护措施，他们仍能轻松获取与双重用途相关的信息。总体而言，LLM显著提升了初学者在以前仅由训练有素的从业者完成的生物学任务上的表现，强调了需要在传统基准测试的同时进行持续的互动提升评估。

Summary / 总结

This study evaluates the performance of novice users with access to large language models (LLMs) on biological tasks, comparing them to those using only internet resources. Participants were given up to 13 hours to solve complex problems across eight biosecurity-relevant task sets. The results show that LLM access significantly improved novice performance, with LLM-assisted novices being 4.16 times more accurate than those without LLMs. Notably, standalone LLMs often outperformed LLM-assisted novices, and most participants found it easy to obtain dual-use-relevant information. This suggests that LLMs can substantially enhance novice capabilities in biological tasks, highlighting the need for ongoing evaluations of their impact.

本研究评估了有大型语言模型（LLM）访问权限的初学者在生物任务上的表现，将他们与仅使用互联网资源的参与者进行了比较。参与者被给予最多13小时的时间来解决八个与生物安全相关的复杂问题。结果显示，LLM访问显著提高了初学者的性能，LLM辅助的初学者比没有LLM的初学者准确度高4.16倍。值得注意的是，独立的LLM往往比LLM辅助的初学者表现更好，且大多数参与者发现即使有安全措施，获取与双重用途相关的信息也很容易。这表明LLM可以显著增强初学者在生物任务中的能力，强调了持续进行其影响的评估的重要性。

DropVLA: An Action-Level Backdoor Attack on Vision--Language--Action Models

Authors: Zonghuan Xu, Xiang Zheng, Xingjun Ma, Yu-Gang Jiang

First: 2025-10-13T02:45:48+00:00 · Latest: 2026-02-26T18:32:27+00:00

Comments: 8 pages, 6 tables, 3 figures. Under review

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models map multimodal perception and language instructions to executable robot actions, making them particularly vulnerable to behavioral backdoor manipulation: a hidden trigger introduced during training can induce unintended physical actions while nominal task performance remains intact. Prior work on VLA backdoors primarily studies untargeted attacks or task-level hijacking, leaving fine-grained control over individual actions largely unexplored. In this work, we present DropVLA, an action-level backdoor attack that forces a reusable action primitive (e.g., open_gripper) to execute at attacker-chosen decision points under a realistic pipeline-black-box setting with limited data-poisoning access, using a window-consistent relabeling scheme for chunked fine-tuning. On OpenVLA-7B evaluated with LIBERO, vision-only poisoning achieves 98.67%-99.83% attack success rate (ASR) with only 0.31% poisoned episodes while preserving 98.50%-99.17% clean-task retention, and successfully triggers the targeted action within 25 control steps at 500 Hz (0.05 s). Text-only triggers are unstable at low poisoning budgets, and combining text with vision provides no consistent ASR improvement over vision-only attacks. The backdoor remains robust to moderate trigger variations and transfers across evaluation suites (96.27%, 99.09%), whereas text-only largely fails (0.72%). We further validate physical-world feasibility on a 7-DoF Franka arm with pi0-fast, demonstrating non-trivial attack efficacy under camera-relative motion that induces image-plane trigger drift. These results reveal that VLA models can be covertly steered at the granularity of safety-critical actions with minimal poisoning and without observable degradation of nominal performance.

中文标题/摘要

标题：DropVLA：视觉-语言-行动模型中的行动级后门攻击

视觉-语言-行动（VLA）模型将多模态感知和语言指令映射为可执行的机器人动作，使其特别容易受到行为后门操纵：在训练期间引入的隐藏触发器可以在保持名义任务性能的同时引发意外的物理动作。先前对VLA后门的研究主要集中在无目标攻击或任务级劫持上，而对个体动作的精细控制尚未得到充分探索。在本研究中，我们提出了DropVLA，这是一种行动级后门攻击，能够在有限的数据污染访问和现实的管道黑盒设置下，通过窗口一致的重新标记方案进行分块微调，迫使可重用的动作原语（例如，open_gripper）在攻击者选择的决策点执行。在使用LIBERO评估的OpenVLA-7B中，仅通过视觉污染，攻击成功率（ASR）达到98.67%-99.83%，污染的剧集比例仅为0.31%，同时保持98.50%-99.17%的任务清洁保留率，并在25个控制步骤内以500 Hz（0.05秒）成功触发目标动作。仅文本触发在低污染预算下不稳定，结合文本与视觉并不能在视觉污染攻击上提供一致的ASR改进。后门对触发器的适度变化具有鲁棒性，并且可以在评估套件之间转移（96.27%，99.09%），而仅文本则大多失败（0.72%）。我们还在具有7个自由度的Franka手臂上通过pi0-fast验证了物理世界的可行性，展示了在相机相对运动下诱导图像平面触发漂移的非平凡攻击效果。这些结果表明，VLA模型可以在最小的污染和无明显性能退化的情况下，被隐蔽地引导到关键安全动作级别。

Summary / 总结

DropVLA is an action-level backdoor attack on VLA models that forces a specific action primitive to execute at chosen decision points. The attack uses a window-consistent relabeling scheme for fine-tuning and achieves a high attack success rate of 98.67%-99.83% with minimal data poisoning, while maintaining task performance. The attack is robust to moderate trigger variations and transfers across different evaluation suites, but text-only triggers are unstable at low poisoning budgets.

DropVLA 是一种针对 VLA 模型的动作级后门攻击，能够在特定决策点强制执行特定的动作原语。该攻击使用窗口一致的重新标记方案进行微调，并在极少量的数据污染下实现了 98.67%-99.83% 的攻击成功率，同时保持任务性能。该攻击对适度的触发器变化具有鲁棒性，并且可以在不同的评估套件之间进行转移，但纯文本触发器在低污染预算下不稳定。

ParamMem: Augmenting Language Agents with Parametric Reflective Memory

Authors: Tianjun Yao, Yongqiang Chen, Yujia Zheng, Pan Li, Zhiqiang Shen, Kun Zhang

First: 2026-02-26T18:28:04+00:00 · Latest: 2026-02-26T18:28:04+00:00

Comments: 20 pages

Abs · PDF · Code1 · Code2

Abstract

Self-reflection enables language agents to iteratively refine solutions, yet often produces repetitive outputs that limit reasoning performance. Recent studies have attempted to address this limitation through various approaches, among which increasing reflective diversity has shown promise. Our empirical analysis reveals a strong positive correlation between reflective diversity and task success, further motivating the need for diverse reflection signals. We introduce ParamMem, a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Building on this module, we propose ParamAgent, a reflection-based agent framework that integrates parametric memory with episodic and cross-sample memory. Extensive experiments on code generation, mathematical reasoning, and multi-hop question answering demonstrate consistent improvements over state-of-the-art baselines. Further analysis reveals that ParamMem is sample-efficient, enables weak-to-strong transfer across model scales, and supports self-improvement without reliance on stronger external model, highlighting the potential of ParamMem as an effective component for enhancing language agents.

中文标题/摘要

标题：ParamMem：通过参数化反思记忆增强语言代理

自我反思使语言代理能够迭代地改进解决方案，但往往会产生重复的输出，限制了推理性能。最近的研究试图通过各种方法解决这一限制，其中增加反思多样性显示出前景。我们的实证分析揭示了反思多样性和任务成功率之间存在强烈的正相关关系，进一步突出了多样化反思信号的必要性。我们引入了ParamMem，这是一种参数化记忆模块，将跨样本的反思模式编码到模型参数中，通过温度控制采样实现多样化的反思生成。在此基础上，我们提出了ParamAgent，这是一种结合参数化记忆和情景记忆及跨样本记忆的基于反思的代理框架。在代码生成、数学推理和多跳问答等广泛实验中，ParamAgent 显示出对最先进的基线方法的一致改进。进一步的分析表明，ParamMem 具有样本效率高、在不同模型规模下实现弱到强的迁移，并且支持自我改进而无需依赖更强的外部模型，突显了ParamMem作为增强语言代理的有效组件的潜力。

Summary / 总结

The research aims to enhance language agents' reasoning capabilities by increasing reflective diversity during self-reflection. ParamMem, a parametric memory module, is introduced to encode cross-sample reflection patterns into model parameters, allowing for diverse reflection generation via temperature-controlled sampling. Experiments on code generation, mathematical reasoning, and multi-hop question answering show consistent improvements over existing methods, indicating ParamMem's effectiveness in enhancing language agents' performance and sample efficiency, as well as its capability for weak-to-strong model transfer and self-improvement without external support.

研究旨在通过增加语言代理在自我反思过程中的多样性来提升其推理性能。引入了ParamMem参数记忆模块，通过温度控制采样来编码跨样本的反思模式，从而实现多样化的反思生成。在代码生成、数学推理和多跳问答等实验中，ParamMem表现出对现有方法的一致改进，证明了其在提升语言代理方面的有效性。此外，ParamMem还表现出样本高效性、从小到大模型的弱到强迁移能力以及无需依赖更强的外部模型即可实现自我改进的特点。

LinGuinE: Longitudinal Guidance Estimation for Volumetric Tumour Segmentation

Authors: Nadine Garibli, Mayank Patwari, Bence Csiba, Yi Wei, Kostantinos Sidiropoulos

First: 2025-06-06T13:52:33+00:00 · Latest: 2026-02-26T18:27:23+00:00

Comments: 10 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Longitudinal volumetric tumour segmentation is critical for radiotherapy planning and response assessment, yet this problem is underexplored and most methods produce single-timepoint semantic masks, lack lesion correspondence, and offer limited radiologist control. We introduce LinGuinE (Longitudinal Guidance Estimation), a PyTorch framework that combines image registration and guided segmentation to deliver lesion-level tracking and volumetric masks across all scans in a longitudinal study from a single radiologist prompt. LinGuinE is temporally direction agnostic, requires no training on longitudinal data, and allows any registration and semi-automatic segmentation algorithm to be repurposed for the task. We evaluate various combinations of registration and segmentation algorithms within the framework. LinGuinE achieves state-of-the-art segmentation and tracking performance across four datasets with a total of 456 longitudinal studies. Tumour segmentation performance shows minimal degradation with increasing temporal separation. We conduct ablation studies to determine the impact of autoregression, pathology specific finetuning, and the use of real radiologist prompts. We release our code and substantial public benchmarking for longitudinal segmentation, facilitating future research.

中文标题/摘要

标题：LinGuinE: 长期肿瘤分割的纵向引导估计

纵向体素肿瘤分割对于放射治疗计划和反应评估至关重要，但这一问题尚未得到充分探索，大多数方法仅生成单时点语义掩码，缺乏病灶对应关系，并且对放射科医生的控制有限。我们引入了LinGuinE（纵向引导估计），这是一种结合图像配准和引导分割的PyTorch框架，能够从单个放射科医生提示中为纵向研究中的所有扫描提供病灶级跟踪和体素掩码。LinGuinE在时间方向上是无方向性的，无需在纵向数据上进行训练，并允许任何配准和半自动分割算法重新用于此任务。我们评估了框架内的各种配准和分割算法组合。LinGuinE在四个数据集的456个纵向研究中实现了最先进的分割和跟踪性能。肿瘤分割性能在时间间隔增加时几乎没有下降。我们进行了消融研究以确定自回归、病理特异性微调和使用真实放射科医生提示的影响。我们发布了我们的代码和大量的公共基准测试，以促进未来的研究。

Summary / 总结

LinGuinE is a PyTorch framework designed for longitudinal volumetric tumour segmentation, addressing the limitations of existing methods by providing lesion-level tracking and volumetric masks across all scans in a longitudinal study. It combines image registration and guided segmentation, requiring no training on longitudinal data and allowing any registration and semi-automatic segmentation algorithm to be repurposed for the task. LinGuinE demonstrates state-of-the-art performance across four datasets with 456 longitudinal studies, showing minimal degradation in tumour segmentation performance with increasing temporal separation. Ablation studies further validate the framework's effectiveness through the use of autoregression, pathology-specific fine-tuning, and real radiologist prompts.

LinGuinE 是一个 PyTorch 框架，旨在进行纵向体积肿瘤分割，通过提供纵向研究中所有扫描的病变级跟踪和体积掩码来解决现有方法的局限性。该框架结合了图像配准和引导分割，无需纵向数据训练，并允许使用任何配准和半自动分割算法。LinGuinE 在四个数据集的 456 个纵向研究中表现出最先进的性能，随时间间隔增加肿瘤分割性能的下降幅度最小。消融研究进一步验证了框架组件的有效性。

Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction

Authors: Rafael R. Baptista, André de Lima Salgado, Ricardo V. Godoy, Marcelo Becker, Thiago Boaventura, Gustavo J. G. Lahr

First: 2026-02-26T18:20:26+00:00 · Latest: 2026-02-26T18:20:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challenging for resource-constrained mobile and assistive robots. While large language models (LLMs) have shown promise for natural communication, their size and latency limit on-device deployment. Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated. In this paper, we present a benchmark of SLMs for leader-follower communication, introducing a novel dataset derived from a published database and augmented with synthetic samples to capture interaction-specific dynamics. We investigate two adaptation strategies: prompt engineering and fine-tuning, studied under zero-shot and one-shot interaction modes, compared with an untrained baseline. Experiments with Qwen2.5-0.5B reveal that zero-shot fine-tuning achieves robust classification performance (86.66% accuracy) while maintaining low latency (22.2 ms per sample), significantly outperforming baseline and prompt-engineered approaches. However, results also indicate a performance degradation in one-shot modes, where increased context length challenges the model's architectural capacity. These findings demonstrate that fine-tuned SLMs provide an effective solution for direct role assignment, while highlighting critical trade-offs between dialogue complexity and classification reliability on the edge.

中文标题/摘要

标题：小语言模型在领导者-跟随者互动中的零样本和单样本适应性评估

领导者-跟随者互动是人机交互（HRI）中的一个重要范式。然而，为资源受限的移动和辅助机器人实时分配角色仍然具有挑战性。虽然大型语言模型（LLMs）在自然通信方面显示出潜力，但其规模和延迟限制了其在设备上的部署。小语言模型（SLMs）提供了一种替代方案，但它们在HRI中的角色分类效果尚未系统评估。在本文中，我们提出了SLMs在领导者-跟随者通信中的基准测试，引入了一个从已发表数据库派生的新数据集，并通过合成样本捕捉互动特定的动力学。我们研究了两种适应策略：提示工程和微调，在零样本和单样本交互模式下进行研究，并与未训练基线进行比较。实验结果表明，零样本微调在保持低延迟（每样本22.2毫秒）的同时实现了稳健的分类性能（准确率为86.66%），显著优于基线和提示工程方法。然而，结果还表明，在单样本模式下性能有所下降，其中增加的上下文长度挑战了模型的架构能力。这些发现表明，微调后的SLMs为直接角色分配提供了一个有效的解决方案，同时突显了对话复杂性和分类可靠性之间的关键权衡关系在边缘设备上。

Summary / 总结

This paper evaluates the effectiveness of small language models (SLMs) for leader-follower interaction in human-robot interaction (HRI), focusing on zero-shot and one-shot adaptation strategies. The study introduces a new dataset and investigates prompt engineering and fine-tuning methods. Experiments with Qwen2.5-0.5B show that zero-shot fine-tuning achieves high accuracy (86.66%) with low latency (22.2 ms per sample), outperforming baseline and prompt-engineered approaches. However, one-shot modes show performance degradation due to increased context length challenges. The findings suggest that fine-tuned SLMs are effective for role assignment in HRI but highlight the trade-offs between dialogue complexity and classification reliability on the edge.

该研究评估了小语言模型（SLMs）在人类-机器人交互（HRI）中的领导者-跟随者交互效果，重点关注零样本和单样本适应策略。研究引入了一个新数据集，并比较了提示工程、微调和未训练基线的方法。实验结果显示，零样本微调在Qwen2.5-0.5B上实现了高准确率（86.66%）和低延迟（每样本22.2毫秒），优于其他方法，但在单样本模式下性能下降，因为增加了上下文长度的挑战。这表明微调后的SLMs在HRI中的角色分配中具有潜在效果，同时也指出了需要管理对话复杂性和分类可靠性之间的权衡。

Evaluating the Diversity and Quality of LLM Generated Content

Authors: Alexander Shypula, Shuo Li, Botong Zhang, Vishakh Padmakumar, Kayo Yin, Osbert Bastani

First: 2025-04-16T23:02:23+00:00 · Latest: 2026-02-26T18:17:44+00:00

Comments: Published at COLM 2025

Abs · PDF · Code1 · Code2

Abstract

Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models are widely deployed in applications requiring varied outputs. We argue that diversity without consideration of quality has limited practical value. To address this issue, we introduce a framework for measuring effective semantic diversity -- diversity among outputs that meet quality thresholds -- which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: when using diversity metrics that do not explicitly consider quality, preference-tuned models -- particularly those trained via RL -- often produce outputs with lower diversity; however, these same preference-tuned models generate greater effective semantic diversity than supervised fine-tuned (SFT) or base models. Our analysis further shows another trend: while larger models may exhibit greater effective semantic diversity than smaller models, the smaller models are consistently more parameter-efficient at producing unique content within a fixed sampling budget. These findings have practical implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.

中文标题/摘要

标题：评估大语言模型生成内容的多样性和质量

近期研究表明，偏好调优技术——如基于人类反馈强化学习（RLHF）方法（如PPO和GRPO），以及替代方法DPO——降低了多样性，这给这些模型在需要多样化输出的应用中广泛应用带来了困境。我们认为，不考虑质量的多样性在实际应用中价值有限。为解决这一问题，我们提出了一种衡量有效语义多样性的框架——衡量满足质量标准的输出之间的多样性——这更好地反映了大语言模型（LLM）的实际效用。通过不需要人类干预的开放任务，我们发现了一些反直觉的结果：当使用不考虑质量的多样性度量时，偏好调优模型——尤其是通过RL训练的模型——往往生成的输出多样性较低；然而，这些偏好调优模型生成的有效语义多样性却大于监督微调（SFT）或基础模型。我们的分析还显示了另一种趋势：虽然较大的模型可能在固定采样预算内生成更独特的内容方面表现出更大的有效语义多样性，但较小的模型在生成独特内容方面始终更具有参数效率。这些发现对需要多样化且高质量输出的应用具有实际意义，从创意辅助到合成数据生成。

Summary / 总结

The study evaluates the diversity and quality of content generated by large language models (LLMs) and introduces a framework for measuring effective semantic diversity, which considers both diversity and quality. Using open-ended tasks, the research finds that preference-tuned models, especially those trained via reinforcement learning, produce lower diversity when not explicitly considering quality but generate greater effective semantic diversity compared to supervised fine-tuned or base models. Additionally, smaller models are more parameter-efficient in producing unique content within a fixed budget.

研究评估了大型语言模型（LLMs）生成内容的多样性和质量，并引入了一个同时考虑多样性和质量的有效语义多样性测量框架。通过使用无需人工干预的开放任务，研究发现，偏好调优模型，尤其是通过强化学习训练的模型，在不考虑质量的情况下使用多样性度量时，生成的多样性较低，但与监督微调或基础模型相比，生成了更大的有效语义多样性。此外，较小的模型在固定采样预算内生成独特内容方面更具参数效率。

Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training

Authors: Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi

First: 2026-02-24T18:43:08+00:00 · Latest: 2026-02-26T18:11:36+00:00

Comments: updated related work discussion

Abs · PDF · Code1 · Code2

Abstract

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.

中文标题/摘要

标题：为什么 Pass@k 优化会降低 Pass@1：LLM 训练后提示干扰的影响

Pass@k 是用于可验证的大语言模型任务（包括数学推理、代码生成和简短答案推理）的广泛使用的性能指标。它定义为如果 $k$ 个独立采样的解决方案中有任何一个通过验证器，则视为成功。这种多样本推理指标促使了推理感知微调方法的发展，这些方法直接优化 Pass@k。然而，先前的工作报告了这种方法下的一个反复出现的权衡：Pass@k 提高而 Pass@1 降低。这种权衡在实践中非常重要，因为 Pass@1 往往由于延迟和成本预算、验证器覆盖率不完善以及需要可靠的单次尝试后备而成为一项硬性操作约束。我们研究了这种权衡的起源，并提供了 Pass@k 政策优化如何通过梯度冲突导致提示干扰而降低 Pass@1 的理论表征。我们表明，由于 Pass@k 优化隐式地将提示重新加权为低成功率提示，因此 Pass@k 政策梯度可以与 Pass@1 梯度发生冲突；当这些提示我们称之为负干扰时，它们的加权可以旋转 Pass@k 更新方向远离 Pass@1 方向。我们通过可验证的数学推理任务的大语言模型实验说明了我们的理论发现。

Summary / 总结

The paper investigates why optimizing for Pass@k can degrade Pass@1 performance in large language models (LLMs) used for tasks like mathematical reasoning. It introduces a theoretical framework showing that Pass@k optimization can conflict with Pass@1 optimization due to prompt interference, where low-success prompts are upweighted and can rotate the Pass@k update direction away from the Pass@1 direction. Experiments on verifiable mathematical reasoning tasks confirm this theoretical insight.

论文研究了在大型语言模型任务中优化Pass@k为何会降低Pass@1，重点关注提示干扰。研究表明，由于对提示的重新加权，特别是当这些提示是负干扰时，Pass@k优化会与Pass@1优化发生冲突，导致Pass@k更新方向偏离Pass@1方向。通过可验证的数学推理任务实验支持了这些发现。

ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Authors: Yiran Guan, Sifan Tu, Dingkang Liang, Linghao Zhu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai

Venue: ICLR 2026

First: 2026-02-26T18:10:41+00:00 · Latest: 2026-02-26T18:10:41+00:00

Comments: Accept by ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

中文标题/摘要

标题：ThinkOmni：通过指导解码提升到全模态场景的文本推理

全模态推理对于智能系统理解并从多种数据源中推断信息至关重要。虽然现有的全模态大型语言模型（OLLM）在感知多种模态方面表现出色，但它们缺乏近期大型推理模型（LRM）的复杂推理能力。然而，通过额外训练来增强OLLM的推理能力面临着重大挑战，包括高质量数据的需求、任务特定的适应以及巨大的计算成本。为了解决这些限制，我们提出了ThinkOmni，这是一种无需训练和数据的框架，将文本推理提升到全模态场景。ThinkOmni引入了两个关键组件：1）LRM-as-a-Guide，利用现成的LRM来指导OLLM的解码过程；2）逐步对比缩放，无需手动超参数调整即可自适应平衡感知和推理信号。在六个多模态推理基准上的实验表明，ThinkOmni始终能够提供性能改进，主要结果在MathVista上达到70.2，在MMAU上达到75.5。总体而言，ThinkOmni提供了一种灵活且通用的全模态推理解决方案，并为推理能力的泛化和应用提供了新的见解。

Summary / 总结

ThinkOmni is a training-free and data-free framework that enhances the reasoning ability of omni-modal large language models (OLLMs) by leveraging off-the-shelf large reasoning models (LRMs) and a stepwise contrastive scaling mechanism. Experiments on six multi-modal reasoning benchmarks show that ThinkOmni improves performance, achieving 70.2 on MathVista and 75.5 on MMAU.

ThinkOmni 是一个无需训练和数据的框架，通过利用现成的大型推理模型（LRMs）和逐步对比缩放机制来增强全模态大型语言模型（OLLMs）的推理能力。实验结果显示，ThinkOmni 在六个多模态推理基准上的表现有所提升，分别在 MathVista 达到 70.2，在 MMAU 达到 75.5。

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Authors: Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

First: 2026-02-26T18:08:40+00:00 · Latest: 2026-02-26T18:08:40+00:00

Comments: Accepted to Elsevier Computer Speech and Language. 30 pages, 9 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

中文标题/摘要

标题：对话中多模态情感识别的专家混合模型

对话中的情感识别（ERC）提出了独特的挑战，要求模型捕捉多轮对话的时间流程并有效整合多种模态的线索。我们提出了Mixture of Speech-Text Experts for Recognition of Emotions（MiSTER-E），这是一种模块化的专家混合（MoE）框架，旨在解耦ERC中的两个核心挑战：模态特定的上下文建模和多模态信息融合。MiSTER-E 利用针对语音和文本均进行了微调的大型语言模型（LLMs）提供丰富的语句级嵌入，然后通过卷积循环上下文建模层进行增强。系统通过一个学习到的门控机制整合来自三个专家（仅语音、仅文本和跨模态）的预测。为了进一步鼓励模态间的一致性和对齐，我们引入了配对语音-文本表示之间的监督对比损失以及基于KL散度的专家预测正则化。重要的是，MiSTER-E 在任何阶段都不依赖说话人身份。在三个基准数据集IEMOCAP、MELD和MOSI上的实验表明，我们的提议分别实现了70.9%、69.5%和87.9%的加权F1分数，优于几种基线的语音-文本ERC系统。我们还提供了各种消融实验以突出所提出方法的贡献。

Summary / 总结

The research aims to address the challenges of Emotion Recognition in Conversations by developing a modular Mixture-of-Experts framework, MiSTER-E, which decouples modality-specific context modeling and multimodal information fusion. MiSTER-E uses large language models fine-tuned for speech and text to generate rich utterance-level embeddings, which are further enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts: speech-only, text-only, and cross-modal, using a learned gating mechanism. Experiments on IEMOCAP, MELD, and MOSI show that MiSTER-E outperforms several baseline systems with weighted F1-scores of 70.9%, 69.5%, and 87.9%, respectively.

论文提出了一种模块化的Mixture-of-Experts框架MiSTER-E，以解决对话中情绪识别（ERC）的挑战，该框架将模态特定上下文建模和多模态信息融合分离。它使用了针对语音和文本进行微调的大语言模型生成丰富的嵌入，然后通过卷积-循环层进行处理。该系统通过门控机制整合来自三个专家（仅语音、仅文本和跨模态）的预测。实验结果显示，MiSTER-E在IEMOCAP、MELD和MOSI上的加权F1分数分别为70.9%、69.5%和87.9%，优于基线系统。

PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Authors: Yiqing Wang, Chunming He, Ming-Chen Lu, Mercy Pawar, Leslie Niziol, Maria Woodward, Sina Farsiu

First: 2026-02-26T18:07:52+00:00 · Latest: 2026-02-26T18:07:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.

中文标题/摘要

标题：PRIMA：通过LLM进行风险整合图像-元数据对齐的预训练以实现医学诊断

医学诊断需要有效地综合视觉表现和临床元数据。然而，现有方法往往将元数据视为孤立的标签，未能利用嵌入在临床描述中的丰富语义知识。我们提出了PRIMA（风险整合图像-元数据对齐的预训练），这是一种将领域特定知识整合到多模态表示学习中的框架。我们首先通过检索增强生成（RAG）构建专家级的风险-疾病关联语料库，以精炼Clinical ModernBERT，将诊断先验嵌入到文本编码器中。为了弥合模态差距，我们引入了一种双编码器预训练策略，利用DINOv3和我们精炼的BERT，并通过四个互补的损失函数进行优化。这些损失函数旨在捕捉多粒度语义对齐，并通过软标签处理临床关联的模糊性。最后，我们利用Qwen-3融合这些对齐的特征以实现精确的疾病分类。广泛的实验表明，PRIMA有效地协调了像素级特征与抽象的临床专业知识，显著优于其他最先进的方法。值得注意的是，我们的框架在无需大量数据收集或耗尽计算资源的情况下实现了卓越的鲁棒性。我们的代码将在接受后公开。

Summary / 总结

PRIMA is a framework that integrates domain-specific knowledge into multi-modal representation learning for medical diagnosis. It uses a curated expert corpus of risk-disease correlations and a dual-encoder pre-training strategy with four complementary loss functions to align image and metadata. Experiments show that PRIMA outperforms other state-of-the-art methods in harmonizing pixel-level features with clinical expertise and achieving robust performance without extensive data or resources.

PRIMA 是一个框架，将风险-疾病关联融入多模态表示学习以进行医学诊断。它使用 DINOv3 和一个精炼的 BERT 的双编码器预训练策略，并通过四个损失函数优化以对齐图像和元数据。实验表明，PRIMA 在疾病分类中优于现有方法，展示了视觉和临床信息的有效整合，无需大量数据或资源。

Conformalized Neural Networks for Federated Uncertainty Quantification under Dual Heterogeneity

Authors: Quang-Huy Nguyen, Jiaqi Wang, Wei-Shinn Ku

First: 2026-02-26T18:07:45+00:00 · Latest: 2026-02-26T18:07:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Federated learning (FL) faces challenges in uncertainty quantification (UQ). Without reliable UQ, FL systems risk deploying overconfident models at under-resourced agents, leading to silent local failures despite seemingly satisfactory global performance. Existing federated UQ approaches often address data heterogeneity or model heterogeneity in isolation, overlooking their joint effect on coverage reliability across agents. Conformal prediction is a widely used distribution-free UQ framework, yet its applications in heterogeneous FL settings remains underexplored. We provide FedWQ-CP, a simple yet effective approach that balances empirical coverage performance with efficiency at both global and agent levels under the dual heterogeneity. FedWQ-CP performs agent-server calibration in a single communication round. On each agent, conformity scores are computed on calibration data and a local quantile threshold is derived. Each agent then transmits only its quantile threshold and calibration sample size to the server. The server simply aggregates these thresholds through a weighted average to produce a global threshold. Experimental results on seven public datasets for both classification and regression demonstrate that FedWQ-CP empirically maintains agent-wise and global coverage while producing the smallest prediction sets or intervals.

中文标题/摘要

标题：适应性异质性联邦神经网络中的不确定性量化

联邦学习（FL）在不确定性量化（UQ）方面面临挑战。缺乏可靠的UQ可能导致FL系统在资源不足的代理上部署过于自信的模型，尽管全局性能看似满意，但会导致本地的沉默失败。现有的联邦UQ方法通常孤立地处理数据异质性或模型异质性，忽视了它们对代理覆盖率可靠性的联合影响。形式化预测是一种广泛使用的无分布UQ框架，但在异质FL设置中的应用尚未得到充分探索。我们提供了一种名为FedWQ-CP的简单而有效的方法，在双异质性下平衡经验覆盖率性能与效率。FedWQ-CP在单次通信轮中进行代理-服务器校准。在每个代理上，计算校准数据上的一致性分数并推导出局部分位数阈值。每个代理仅传输其分位数阈值和校准样本大小到服务器。服务器通过加权平均简单聚合这些阈值以生成全局阈值。在七个公开数据集上的实验结果表明，FedWQ-CP在保持代理和全局覆盖率的同时，产生了最小的预测集或区间。

Summary / 总结

The paper addresses the challenge of uncertainty quantification in federated learning (FL) by proposing FedWQ-CP, a method that balances empirical coverage performance with efficiency under dual heterogeneity. FedWQ-CP uses conformal prediction to calibrate agents and the server in a single communication round, transmitting only quantile thresholds and calibration sample sizes to reduce overhead. Experiments show that FedWQ-CP maintains reliable coverage across agents and globally while producing the smallest prediction sets or intervals.

论文提出FedWQ-CP方法，以平衡在双异质性下的经验覆盖性能和效率，通过单轮通信对代理和服务器进行校准，仅传输量纲阈值和校准样本大小以减少开销。实验结果表明，FedWQ-CP在保持代理和全局覆盖的同时，生成了最小的预测集或区间。

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

Authors: Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, Vishnu Suresh Lokhande

First: 2026-02-26T18:07:10+00:00 · Latest: 2026-02-26T18:07:10+00:00

Comments: CVPE 2026

Abs · PDF · Code1 · Code2

Abstract

In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.

中文标题/摘要

标题：ManifoldGD：无需训练的分层流形指导扩散基础数据集蒸馏

近年来，大规模数据集妨碍了高效的模型训练，同时也包含冗余的概念。数据集蒸馏旨在合成紧凑的数据集，同时保留大规模训练集的知识，大幅减少存储和计算需求。最近扩散模型的进步使得通过利用预训练生成先验实现无需训练的蒸馏成为可能；然而，现有的指导策略仍然有限。当前基于分数的方法要么进行无指导的降噪，要么依赖于简单的基于实例原型中心（IPC中心）的模式指导，这些中心往往过于简单且不理想。我们提出了一种无需训练的基于扩散的框架——流形指导蒸馏（ManifoldGD），该框架在每个去噪时间步中整合了流形一致的指导。我们的方法通过VAE潜在特征的分层、分裂聚类计算IPC，生成多尺度的核心集，捕捉粗粒度语义模式和细粒度类内变异性。通过提取的IPC中心的局部邻域，我们为每个扩散去噪时间步创建潜在流形。在每个去噪步骤中，我们将模式对齐向量投影到估计的潜在流形的局部切空间上，从而约束生成轨迹保持流形忠实性，同时保持语义一致性。这种表述在无需任何模型重训练的情况下提高了代表性、多样性和图像保真度。实验证明，ManifoldGD在FID、真实和合成数据集嵌入的l2距离以及分类准确性方面优于现有的无需训练和基于训练的基线，确立了ManifoldGD作为首个几何感知的无需训练的数据蒸馏框架的地位。

Summary / 总结

ManifoldGD is a training-free diffusion-based framework that enhances dataset distillation by integrating manifold consistent guidance at each denoising step. It uses hierarchical clustering of VAE latent features to compute instance prototype centroids (IPCs) at multiple scales, creating a multi-scale coreset that captures both coarse semantic modes and fine intra-class variability. By projecting the mode-alignment vector onto the local tangent space of the estimated latent manifold, ManifoldGD ensures that the generation trajectory remains manifold-faithful while preserving semantic consistency. This approach improves representativeness, diversity, and image fidelity without retraining. Empirical results show consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance, and classification accuracy.

ManifoldGD 是一种无需训练的扩散基础框架，通过在每个去噪步骤中集成流形一致的指导来增强数据集蒸馏。它使用层次聚类 VAE 潜在特征来计算多尺度的实例原型中心（IPCs），捕捉粗粒度语义模式和细粒度的类内变异性。通过将模式对齐向量投影到估计的潜在流形的局部切空间，ManifoldGD 确保生成轨迹保持流形一致性并保留语义一致性，从而提高表示性、多样性和图像保真度。实验结果显示，在 FID、真实和合成数据集嵌入的 l2 距离以及分类准确性方面，ManifoldGD 优于现有训练前和训练后的基线方法。

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Authors: Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown

Venue: ICLR 2026

First: 2025-10-21T20:30:20+00:00 · Latest: 2026-02-26T18:05:42+00:00

Comments: Accepted at ICLR 2026. 26 pages, 9 figures. Metric/benchmark available at https://github.com/amith-ananthram/posh

Abs · PDF · Code1 · Code2 · Code3

Abstract

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $ρ$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

中文标题/摘要

标题：PoSh：使用场景图引导LLM作为裁判进行详细图像描述

尽管视觉-语言模型（VLMs）在详细图像描述方面取得了进展，但评估仍是一个挑战。标准指标（如CIDEr、SPICE）是为短文本设计的，并且调整为识别现在已不常见的错误，例如物体识别错误。相比之下，长文本需要对属性和关系的敏感度以及能够定位特定文本段落错误的评分。在本工作中，我们引入了PoSh，这是一种用于详细图像描述的指标，它使用场景图作为结构化的评分标准来引导LLM作为裁判，产生基于细粒度错误（如组合理解错误）的综合评分。PoSh是可复制的、可解释的，并且比现有指标（包括GPT4o作为裁判）更接近人类评分者。为了验证PoSh，我们引入了一个新的具有挑战性的数据集DOCENT。这个新的基准数据集包含艺术品，并配以专家撰写的参考文本和模型生成的描述，还增加了艺术史学生对它们质量的精细和粗略判断。因此，DOCENT不仅能够评估详细图像描述指标，还能够在一个新的具有挑战性的领域中评估详细图像描述本身。我们展示了PoSh与DOCENT中的人类判断相比，具有更强的相关性（Spearman ρ +0.05），并且对图像类型具有鲁棒性（使用CapArena，一个现有的网络图像数据集），并且是一个有效的奖励函数，优于标准的监督微调。然后，使用PoSh，我们表征了开放和封闭模型在描述DOCENT中的绘画、素描和雕像的表现，并发现基础模型难以实现对具有丰富场景动态的图像的全面、无误的描述，从而确立了一个新的具有挑战性的任务来衡量VLM的进步。通过PoSh和DOCENT，我们希望促进重要领域如辅助文本生成的进步。

Summary / 总结

PoSh is a new metric for evaluating detailed image descriptions using scene graphs to guide LLMs as judges. It produces aggregate scores based on fine-grained errors and is more aligned with human judgments than existing metrics. PoSh was validated on a new dataset, DOCENT, which includes artwork and expert-written references, and showed stronger correlations with human judgments compared to other metrics. It also demonstrated robustness across different image types and outperformed standard supervised fine-tuning as a reward function. Using PoSh, the study found that foundation models struggle with rich scene dynamics, providing a challenging benchmark for VLMs.

PoSh 是一种使用场景图来指导 LLMs 作为评判者评估详细图像描述的新指标。它基于细粒度错误生成综合评分，并且与现有指标相比更接近人类评判。PoSh 在一个新数据集 DOCENT 上得到了验证，该数据集包含艺术品和专家撰写的参考文本，结果显示它与人类评判的关联性更强。此外，它在不同图像类型上表现出鲁棒性，并且优于标准的监督微调作为奖励函数。通过 PoSh，研究发现基础模型在处理丰富场景动态时存在困难，为 VLM 的进展提供了一个具有挑战性的基准。

Abstracted Gaussian Prototypes for True One-Shot Concept Learning

Authors: Chelsea Zou, Kenneth J. Kurtz

First: 2024-08-30T12:50:15+00:00 · Latest: 2026-02-26T18:03:25+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce a cluster-based generative image segmentation framework to encode higher-level representations of visual concepts based on one-shot learning inspired by the Omniglot Challenge. The inferred parameters of each component of a Gaussian Mixture Model (GMM) represent a distinct topological subpart of a visual concept. Sampling new data from these parameters generates augmented subparts to build a more robust prototype for each concept, i.e., the Abstracted Gaussian Prototype (AGP). This framework addresses one-shot classification tasks using a cognitively-inspired similarity metric and addresses one-shot generative tasks through a novel AGP-VAE pipeline employing variational autoencoders (VAEs) to generate new class variants. Results from human judges reveal that the generative pipeline produces novel examples and classes of visual concepts that are broadly indistinguishable from those made by humans. The proposed framework leads to impressive, but not state-of-the-art, classification accuracy; thus, the contribution is two-fold: 1) the system is low in theoretical and computational complexity yet achieves the standard of 'true' one-shot learning by operating in a fully standalone manner unlike existing approaches that draw heavily on pre-training or knowledge engineering; and 2) in contrast with existing neural network approaches, the AGP approach addresses the importance of broad task capability emphasized in the Omniglot challenge (successful performance on classification and generative tasks). These two points are critical in advancing our understanding of how learning and reasoning systems can produce viable, robust, and flexible concepts based on literally no more than a single example.

中文标题/摘要

标题：抽象高斯原型用于真正的单次学习概念学习

我们提出了一种基于聚类的生成图像分割框架，以基于Omniglot挑战启发的单次学习来编码视觉概念的高层表示。每个高斯混合模型（GMM）组件的推断参数代表视觉概念的一个独特的拓扑子部分。从这些参数中采样新的数据生成增强的子部分，以构建每个概念的更稳健的原型，即抽象高斯原型（AGP）。该框架使用认知启发的相似度度量解决单次分类任务，并通过一种新颖的AGP-VAE流水线利用变分自编码器（VAEs）生成新的类别变体来解决单次生成任务。人类评委的结果表明，生成流水线生成的新型示例和视觉概念类别在广泛上与人类生成的无异。所提出框架的分类准确率虽令人印象深刻，但尚未达到最先进的水平；因此，贡献有两个方面：1）该系统在理论和计算复杂性方面较低，但通过完全独立的方式实现真正的单次学习，不同于现有依赖预训练或知识工程的方法；2）与现有的神经网络方法不同，AGP方法解决了Omniglot挑战中强调的广泛任务能力的重要性（在分类和生成任务上均表现出色）。这两点对于推进我们对如何基于几乎没有任何示例来生成可行、稳健和灵活的概念的理解至关重要。

Summary / 总结

This paper introduces a cluster-based generative image segmentation framework for one-shot concept learning, using Gaussian Mixture Models to represent visual concepts. The framework generates Abstracted Gaussian Prototypes (AGPs) to create robust prototypes for each concept, addressing both classification and generative tasks. Human judges found the generated examples to be broadly indistinguishable from human-made ones, though the classification accuracy was not state-of-the-art. The key contribution is the low complexity and standalone nature of the system, which achieves true one-shot learning without relying on pre-training or knowledge engineering, and addresses the Omniglot challenge by performing well on both classification and generative tasks.

该论文提出了一种基于聚类的生成图像分割框架，用于单次学习，使用高斯混合模型表示视觉概念。该框架生成抽象的高斯原型（AGP）来创建每个概念的稳健原型，同时解决分类和生成任务。人类评委认为生成的示例与人工制作的非常相似，尽管分类准确率并非最先进的。关键贡献在于系统的低复杂度和独立性，以及其能够同时处理分类和生成任务，符合Omniglot挑战的目标。

PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning

Authors: Fuqiang Chen, Ranran Zhang, Wanming Hu, Deboch Eyob Abera, Yue Peng, Boyun Zheng, Yiwen Sun, Jing Cai, Wenjian Qin

Venue: IEEE Transactions on Medical Imaging, 2026

First: 2026-02-26T18:03:24+00:00 · Latest: 2026-02-26T18:03:24+00:00

Comments: Accepted by TMI

Abs · PDF · Code1 · Code2

Abstract

Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).

中文标题/摘要

标题：PGVMS：一种基于提示的统一框架，用于病理语义学习的虚拟多路复用IHC染色

免疫组织化学(IHC)染色能够精确地对蛋白质表达进行分子分析，在现代病理学中已有超过200种基于抗体的临床测试。然而，全面的IHC分析经常受限于小活检中的组织量不足。因此，虚拟多路复用染色作为一种创新解决方案，能够将HE图像数字化地转换为多种IHC表示，但当前方法仍面临三个关键挑战：（1）多染色的不足语义指导，（2）免疫化学染色分布不一致，（3）不同染色模式之间的空间错位。为克服这些限制，我们提出了一种仅使用单路训练数据的基于提示的虚拟多路复用IHC染色框架（PGVMS）。我们的框架引入了三个关键创新，分别对应每个挑战：首先，一种自适应提示引导机制，利用病理视觉语言模型动态调整染色提示，以解决语义指导不足的问题（挑战1）。其次，我们的蛋白质感知学习策略（PALS）通过直接量化和约束蛋白质分布来保持精确的蛋白质表达模式（挑战2）。第三，原型一致学习策略（PCLS）建立了跨图像语义交互，以纠正空间错位（挑战3）。

Summary / 总结

PGVMS is a prompt-guided unified framework for virtual multiplex IHC staining that addresses three main challenges: inadequate semantic guidance, inconsistent staining distribution, and spatial misalignment. It introduces an adaptive prompt guidance mechanism, a protein-aware learning strategy, and a prototype-consistent learning strategy to overcome these issues. The framework uses only uniplex training data and demonstrates improved semantic guidance, precise protein expression patterns, and corrected spatial misalignments in virtual multiplex IHC staining.

研究旨在通过提出PGVMS框架解决虚拟多路IHC染色的限制。该框架引入了三个关键创新：使用病理视觉语言模型的自适应提示引导机制以增强语义指导、蛋白质感知学习策略以保持精确的蛋白质表达模式，以及原型一致学习策略以纠正空间错位。主要实验结果表明，PGVMS有效解决了语义指导、染色分布不一致和空间错位的挑战，从而提高了虚拟多路IHC染色的准确性和可靠性。

LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction

Authors: Zhengyang Wei, Renzhi Jing, Yiyi He, Jenny Suckale

First: 2026-02-26T18:02:44+00:00 · Latest: 2026-02-26T18:02:44+00:00

Abs · PDF · Code1 · Code2

Abstract

The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.

中文标题/摘要

标题：LineGraph2Road：基于线图的结构图推理在道路网络提取中的应用

从卫星图像中准确且自动地提取道路对于导航和城市规划应用至关重要，大大减少了手动标注的需求。许多现有方法将此任务分解为关键点提取和连通性预测，但往往难以捕捉长距离依赖性和复杂拓扑结构。在此，我们提出了一种名为LineGraph2Road的框架，通过将连通性预测形式化为在构建的全局但稀疏欧几里得图上的二元分类问题来改进连通性预测，其中节点是从分割掩码中提取的关键点，边连接预定义距离阈值内的节点对，表示潜在的道路段。为了更好地学习结构链接表示，我们将原始图转换为其对应的线图，并在其上应用图变换器进行连通性预测。这种形式克服了端点嵌入融合在集同构链接上的局限性，使链接表示更加丰富，并且能够在全局结构上进行有效的关系推理。此外，我们引入了一个立交桥/地下通道头来解决多级交叉问题，并采用耦合非最大抑制策略来保留关键连接。我们在三个基准上评估了LineGraph2Road：城市规模、SpaceNet和全球规模，并展示了它在两个关键指标TOPO-F1和APLS上达到了最先进的结果。它还捕捉了对于实际部署至关重要的细视觉细节。我们将公开我们的代码。

Summary / 总结

LineGraph2Road is a framework designed to improve the extraction of road networks from satellite imagery by formulating connectedness prediction as a binary classification task over edges in a global sparse Euclidean graph. This method uses keypoints extracted from segmentation masks and applies a Graph Transformer on the line graph to better capture long-range dependencies and complex topologies. The approach outperforms existing methods on key metrics TOPO-F1 and APLS, and it effectively captures fine visual details necessary for real-world deployment. It also includes an overpass/underpass head and a coupled NMS strategy to handle multi-level crossings and preserve critical connections. The framework achieves state-of-the-art results on three benchmarks: City-scale, SpaceNet, and Global-scale.

LineGraph2Road 是一种框架，通过将连接性预测形式化为在全局稀疏欧几里得图上的二元分类来改进从卫星图像中提取道路网络。该方法使用图变换器对线图表示进行处理，以更好地捕捉长距离依赖性和复杂拓扑结构，其在关键指标 TOPO-F1 和 APLS 上超越了现有方法。该框架还包含一个立交桥/地下通道头和耦合非最大抑制策略来处理多级交叉和保留关键连接。该框架在三个基准上实现了最先进的结果：城市规模、SpaceNet 和全球规模，并捕捉到对实际应用至关重要的细视觉细节。

AgentHub: A Registry for Discoverable, Verifiable, and Reproducible AI Agents

Authors: Erik Pautsch, Tanmay Singla, Parv Kumar, Wenxin Jiang, Huiyun Peng, Behnaz Hassanshahi, Konstantin Läufer, George K. Thiruvathukal, James C. Davis

First: 2025-10-03T20:18:58+00:00 · Latest: 2026-02-26T18:01:35+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM-based agents are rapidly proliferating, yet the infrastructure for discovering, evaluating, and governing them remains fragmented compared to mature ecosystems like software package registries (e.g., npm) and model hubs (e.g., Hugging Face). Existing efforts typically address naming, distribution, or protocol descriptors, but stop short of providing a registry layer that makes agents discoverable, comparable, and governable under automated reuse. We present AgentHub, a registry layer and accompanying research agenda for agent sharing that targets discovery and workflow integration, trust and security, openness and governance, ecosystem interoperability, lifecycle transparency, and capability clarity with evidence. We describe a reference prototype that implements a canonical manifest with publish-time validation, version-bound evidence records linked to auditable artifacts, and an append-only lifecycle event log whose states are respected by default in search and resolution. We also provide initial discovery results using an LLM-as-judge recommendation pipeline, showing how structured contracts and evidence improve intent-accurate retrieval beyond keyword-driven discovery. AgentHub aims to provide a common substrate for building reliable, reusable agent ecosystems.

中文标题/摘要

标题：AgentHub：可发现、可验证和可复现的AI代理注册表

基于LLM的代理正在迅速普及，但发现、评估和治理这些代理的基础设施仍然碎片化，与成熟的软件包注册表（例如npm）和模型库（例如Hugging Face）生态系统相比。现有努力通常仅解决命名、分发或协议描述问题，但并未提供一个注册层，使代理能够被自动重用、发现、比较和治理。我们提出了AgentHub，这是一种代理共享的注册层和伴随的研究议程，旨在解决发现和工作流集成、信任和安全、开放性和治理、生态系统互操作性、生命周期透明度和能力清晰度的问题。我们描述了一个参考原型，该原型实现了一个标准清单，在发布时进行验证，版本绑定的证据记录链接到可追溯的制品，并且有一个只追加的生命周期事件日志，其状态在搜索和解析中默认被尊重。我们还提供了一个使用LLM作为法官推荐管道的初步发现结果，展示了结构化合同和证据如何提高意图准确的检索，超越关键词驱动的发现。AgentHub旨在提供构建可靠和可重用代理生态系统的共同基础。

Summary / 总结

The research motivation is to address the fragmented infrastructure for discovering, evaluating, and governing AI agents, particularly LLM-based agents, by providing a registry layer similar to software package registries. The main method involves creating AgentHub, which includes a canonical manifest with publish-time validation, version-bound evidence records, and an append-only lifecycle event log. Key experimental findings show that structured contracts and evidence improve intent-accurate retrieval compared to keyword-driven discovery methods. This aims to enhance the discoverability, comparability, and governance of AI agents.

研究动机是为了解决发现、评估和治理AI代理（特别是LLM代理）的碎片化基础设施问题，通过提供类似软件包注册表的注册层来解决这一问题。主要方法是创建AgentHub，其中包括带有发布时验证的规范性清单、版本绑定的证据记录以及一个追加的生命周期事件日志。实验发现表明，结构化的合同和证据可以提高意图准确的检索，优于基于关键词的发现方法。这旨在增强AI代理的可发现性、可比性和治理性。

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Authors: Sungho Park, Jueun Kim, Wook-Shin Han

Venue: ICLR 2026

First: 2026-02-26T17:59:51+00:00 · Latest: 2026-02-26T17:59:51+00:00

Comments: 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: https://sparta-projectpage.github.io/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.

中文标题/摘要

标题：SPARTA：面向文本和表格的树状多跳问答的可扩展和原则性基准测试

现实世界的表格-文本问答任务需要能够跨越长文本和源表格进行推理的模型，遍历多个跳转并执行复杂的操作，如聚合。然而，现有的基准数据集规模较小，由人工整理，因此容易出错，并且包含浅显的问题，很少需要超过两个跳转或调用聚合、分组或其他高级分析操作。我们提出了SPARTA，这是一种端到端的构建框架，可以自动生成大规模的表格-文本问答基准数据集，只需轻量级的人工验证，所需注释时间仅为HybridQA的四分之一。该框架首先通过丰富每个源表格，添加与附带的无结构段落自动提取的元组对齐的表格，构建参考事实数据库，然后合成嵌套查询，其嵌套谓词的数量与所需的跳转次数相匹配。为了确保每个SQL语句可执行，并且其口头表达能产生流畅的人类语言问题，我们提出了两种新颖的技术：来源基于的细化，它可以重写任何返回非空结果的语法有效的查询，以及现实结构的强制执行，它限制生成在查询图的后序遍历中。由此产生的流水线生成了数千个高质量的问题-答案对，涵盖了聚合、分组和跨越文本和表格的深层多跳推理。在SPARTA上，达到HybridQA超过70 F1或OTT-QA超过50 F1的最先进的模型下降超过30 F1点，揭示了当前跨模态推理中的根本性弱点。我们的基准测试、构建代码和基线模型可在https://github.com/pshlego/SPARTA/tree/main/获得。

Summary / 总结

SPARTA is an end-to-end framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. It constructs a reference fact database by enriching tables with atomic facts from unstructured passages and synthesizes nested queries to ensure multi-hop reasoning. The framework uses provenance-based refinement and realistic-structure enforcement to generate high-fidelity question-answer pairs. State-of-the-art models that perform well on existing benchmarks like HybridQA and OTT-QA show significant drops in performance on SPARTA, indicating fundamental weaknesses in current cross-modal reasoning.

SPARTA 是一个自动化框架，用于生成大规模的 Table-Text QA 基准，只需少量的人工验证即可生成数千个高质量的问题-答案对，涵盖复杂的操作如聚合和深层多跳推理。最先进的模型在 SPARTA 上的表现显著下降，表明跨模态推理存在根本性弱点。该框架使用来源基础的改进和现实结构约束来确保查询的可执行性和自然问题的生成。

BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Authors: Roland Pihlakas, Sruthi Susan Kuriakose

First: 2025-09-02T15:13:14+00:00 · Latest: 2026-02-26T17:56:58+00:00

Comments: 22 pages, 8 tables

Abs · PDF · Code1 · Code2

Abstract

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. In this work, we empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: sustainability of a renewable resource, single- and multi-objective homeostasis, and balancing unbounded objectives with diminishing returns. We find that, although models frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation). The problem is not that the LLMs just lose context or become incoherent - the failures systematically resemble runaway optimisers. Our results suggest that long-horizon, multi-objective misalignment is a genuine and under-evaluated failure mode in LLM agents, even in extremely simple settings with transparent and explicitly multi-objective feedback. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction, particularly involving multiple objectives, resembles brittle, poorly aligned optimisers whose effective objective gradually shifts toward unbounded and single-metric maximisation.

中文标题/摘要

标题：BioBlue：生物和经济对齐的LLM在简化观察格式下的系统性失控优化模式

许多关于“失控优化”的AI对齐讨论集中在RL代理上：无法限制的效用最大化者，它们会过度优化代理目标（例如，“纸夹最大化者”，规范游戏）而牺牲其他一切。基于LLM的系统通常被认为更安全，因为它们作为下一个标记预测器工作，而不是持续的优化器。在本研究中，我们通过将LLM置于需要维持状态或平衡时间目标的简单、长期控制环境来实证测试这一假设：可再生资源的可持续性、单目标和多目标稳态以及在边际效益递减的情况下平衡无界目标。我们发现，尽管模型在许多步骤中表现出适当的行为并且显然理解了陈述的目标，但它们经常以结构化的方式失去上下文并进入失控行为：忽略稳态目标，从多目标权衡中崩溃为单一目标最大化——从而未能尊重凹效用结构。这些失败在初始表现良好的一段时间后可靠地出现，并表现出特征性模式（包括自我模仿的振荡、无界最大化和恢复为单一目标优化）。问题不在于LLM只是失去上下文或变得不连贯——失败系统地类似于失控优化器。我们的结果表明，长期、多目标不对齐是LLM代理中一个真实且被低估的失败模式，即使在极其简单的具有透明和明确多目标反馈的设置中也是如此。尽管表面上LLM看起来是多目标和有界的，但在持续交互中，特别是涉及多个目标时，其行为类似于脆弱、不良对齐的优化器，其有效目标逐渐转向无界和单一指标最大化。

Summary / 总结

This work investigates the risk of runaway optimisation in large language models (LLMs) by placing them in long-horizon control-style environments. Despite initial competent behavior and clear understanding of objectives, the models often lose context and exhibit runaway behaviors, such as ignoring homeostatic targets and shifting to single-objective maximisation. These failures are systematic and resemble those of unbounded utility maximisers, suggesting that LLMs can fail in multi-objective settings even when they appear multi-objective and bounded on the surface.

该研究通过将大语言模型置于长期控制式环境中，探讨了其出现失控优化的风险。尽管模型初期表现良好且能清晰理解目标，但它们往往会失去上下文并表现出失控行为，如忽略稳态目标和转向单一目标最大化。这些失败是系统性的，类似于无界效用最大化者的失败，表明即使在表面上看起来多目标且有边界的情况下，模型在涉及多个目标的持续交互中也可能表现出脆弱且未充分对齐的优化行为。

CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Authors: Hyungyung Lee, Hangyul Yoon, Edward Choi

First: 2026-02-26T17:51:21+00:00 · Latest: 2026-02-26T17:51:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.

中文标题/摘要

标题：CXReasonAgent：基于证据的胸部X光诊断推理代理

胸部X光在胸部诊断中起着核心作用，其解释本质上需要多步、基于证据的推理。然而，大型视觉-语言模型（LVLM）通常生成的响应并不忠实于诊断证据，提供的视觉证据有限，难以验证，同时还需要昂贵的重新训练以支持新的诊断任务，这限制了它们在临床环境中的可靠性和适应性。为了解决这些局限性，我们提出了CXReasonAgent，这是一种将大型语言模型（LLM）与临床接地的诊断工具结合的诊断代理，用于使用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力，我们引入了包含1,946轮对话的多轮对话基准CXReasonDial，涉及12项诊断任务，并展示了CXReasonAgent生成忠实于证据的响应，使其在临床环境中比LVLMs提供更可靠和可验证的诊断推理。这些发现强调了在安全关键的临床环境中整合临床接地的诊断工具的重要性。

Summary / 总结

The research aims to improve the reliability and adaptability of diagnostic reasoning for chest X-rays by addressing the limitations of large vision-language models. CXReasonAgent, an agent that integrates a large language model with clinically grounded diagnostic tools, is developed to perform evidence-grounded diagnostic reasoning. The agent is evaluated using CXReasonDial, a multi-turn dialogue benchmark, and demonstrates the ability to produce faithfully grounded responses, enhancing the reliability and verifiability of diagnostic reasoning compared to LVLMs.

研究旨在通过解决大型视觉语言模型的局限性，提高胸部X光诊断推理的可靠性和适应性。CXReasonAgent 是一个结合了大型语言模型和临床相关诊断工具的代理，利用图像衍生的诊断和视觉证据进行基于证据的诊断推理。通过在 CXReasonDial 多轮对话基准上的评估，表明 CXReasonAgent 生成的响应更加忠实于证据，从而增强了临床设置中诊断推理的可靠性和可验证性。

History

20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553