arXiv 论文速递

2026-03-12 03:46
Snapshot: 20260312_0346
CREATE: Testing LLMs for Associative Creativity
Authors: Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, Greg Durrett
First: 2026-03-10T17:58:44+00:00 · Latest: 2026-03-10T17:58:44+00:00
Abstract
A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.
中文标题/摘要
标题:CREATE:测试LLMs的联想创造力
创造力的一个关键组成部分是联想推理:能够在概念之间建立新颖而有意义的联系的能力。我们引入了CREATE,这是一个旨在评估模型进行创造性联想推理能力的基准。CREATE 要求模型生成连接模型参数知识中概念的路径集。路径应具有高度的特异性(概念连接的独特性和接近性)和高度的多样性(与其他路径的差异性),并且如果模型产生更多的强大且多样化的路径集,其得分更高。此任务与真实的创造力任务(如假设生成)的需求相似,包括一个极其庞大的搜索空间,但能够收集一个具有客观答案评分的大规模基准。对前沿模型的评估表明,最强的模型在创造性效用方面高于其他模型,由于答案的高多样性以及搜索的复杂性,使得基准饱和难以实现。此外,我们的结果表明,在我们的任务中,思考模型并不总是更有效,即使具有高标记预算。最近的创造性提示方法仅提供了一些但有限的额外改进。CREATE 为开发新方法以提高模型的联想创造力能力提供了一个试验场。
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning
Authors: Jannis Becktepe, Julian Dierkes, Carolin Benjamins, Aditya Mohan, David Salinas, Raghu Rajan, Frank Hutter, Holger Hoos, Marius Lindauer, Theresa Eimer
Venue: Journal of Data-centric Machine Learning Research; 2026
First: 2024-09-27T15:22:28+00:00 · Latest: 2026-03-10T17:58:13+00:00
Abstract
Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at https://github.com/automl/arlbench.
中文标题/摘要
标题:ARLBench:强化学习中超参数优化基准测试的灵活高效评估
超参数是可靠训练高性能强化学习(RL)代理的关键因素。不幸的是,开发和评估用于调整这些超参数的自动化方法既昂贵又耗时。因此,这些方法通常仅在单一领域或算法上进行评估,这使得比较变得困难,并限制了对其普适性的见解。我们提出了ARLBench,这是一种用于强化学习中超参数优化(HPO)的基准测试,它允许比较各种HPO方法,同时在评估方面非常高效。为了在低计算资源设置中进行强化学习中的HPO研究,我们选择了一个代表性的HPO任务子集,涵盖了多种算法和环境组合。这一选择使得仅使用以前所需计算资源的一小部分即可生成自动化RL(AutoRL)方法的性能概况,从而让更广泛的研究人员能够从事HPO研究。基于我们选择的广泛且大规模的超参数景观数据集,ARLBench是一个高效、灵活且面向未来的AutoRL研究基础。基准测试和数据集可在https://github.com/automl/arlbench获取。
Summary / 总结
ARLBench is designed to facilitate the evaluation of hyperparameter optimization (HPO) methods in reinforcement learning (RL) by providing a flexible and efficient benchmark. It selects a representative subset of HPO tasks across various RL algorithms and environments, allowing for the comparison of different HPO approaches with minimal computational resources. Key findings include the generation of a performance profile for AutoRL methods using a fraction of the previously required compute, enabling broader research participation. This benchmark is intended to support future research on AutoRL with an extensive dataset of hyperparameter landscapes.
ARLBench 旨在通过提供一个灵活且高效的基准来促进强化学习(RL)中超参数优化(HPO)方法的评估。它选择了跨越多种RL算法和环境的代表性HPO任务子集,以实现不同HPO方法的比较。这使得研究人员能够使用较少的计算资源生成自动化RL方法的性能概况,从而让HPO研究更加普及。基准和数据集已公开,支持该领域的更广泛研究。
Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People
Authors: Jazmin Collins, Sharon Y Lin, Tianqi Liu, Andrea Stevenson Won, Shiri Azenkot
First: 2026-03-10T17:56:57+00:00 · Latest: 2026-03-10T17:56:57+00:00
Comments: 16 pages, 5 figures, 3 tables, Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13-17, 2026, Barcelona, Spain. ACM
Abstract
As social virtual reality (VR) grows more popular, addressing accessibility for blind and low vision (BLV) users is increasingly critical. Researchers have proposed an AI "sighted guide" to help users navigate VR and answer their questions, but it has not been studied with users. To address this gap, we developed a large language model (LLM)-powered guide and studied its use with 16 BLV participants in virtual environments with confederates posing as other users. We found that when alone, participants treated the guide as a tool, but treated it companionably around others, giving it nicknames, rationalizing its mistakes with its appearance, and encouraging confederate-guide interaction. Our work furthers understanding of guides as a versatile method for VR accessibility and presents design recommendations for future guides.
中文标题/摘要
标题:理解大型语言模型赋能的向导如何使盲人和低视力人士在虚拟现实中的访问更加无障碍
随着社会虚拟现实(VR)越来越受欢迎,为盲人和低视力(BLV)用户解决无障碍问题变得越来越关键。研究人员提出了一个AI“有视力的向导”来帮助用户在VR中导航并回答问题,但尚未经过用户研究。为了解决这一缺口,我们开发了一个大型语言模型(LLM)赋能的向导,并在虚拟环境中与扮演其他用户的共犯一起研究了其使用情况,共有16名BLV参与者参与。我们发现,当独自一人时,参与者将向导视为一种工具,但在其他人周围时,他们将向导当作同伴对待,给它起昵称,为它的错误找借口,并鼓励共犯-向导之间的互动。我们的工作进一步加深了对向导作为一种多功能方法在VR无障碍中的理解,并提出了未来向导设计的建议。
Summary / 总结
This study investigates the use of an AI 'sighted guide' to enhance the accessibility of virtual reality for blind and low vision users. The researchers developed an LLM-powered guide and tested it with 16 participants in virtual environments. The findings show that participants treated the guide as a tool when alone but interacted with it companionably around others, giving it nicknames and rationalizing its mistakes. This work advances understanding of guides as a versatile VR accessibility method and provides design recommendations for future guides.
研究旨在通过开发AI‘明眼向导’来增强盲和低视力用户在虚拟现实中的可访问性。使用16名盲和低视力参与者在虚拟环境中测试了一个基于大型语言模型(LLM)的向导。研究发现,参与者在独自使用时将向导视为工具,但在他人面前则以同伴的方式与其互动,给它起昵称并为其错误进行辩解。这项工作推进了对向导作为VR可访问性的一种多功能方法的理解,并为未来向导的设计提供了建议。
Emotional Modulation in Swarm Decision Dynamics
Authors: David Freire-Obregón
First: 2026-03-10T17:56:42+00:00 · Latest: 2026-03-10T17:56:42+00:00
Comments: Accepted for presentation at the International Conference on Agents and Artificial Intelligence (ICAART 2026)
Abstract
Collective decision-making in biological and human groups often emerges from simple interaction rules that amplify minor differences into consensus. The bee equation, developed initially to describe nest-site selection in honeybee swarms, captures this dynamic through recruitment and inhibition processes. Here, we extend the bee equation into an agent-based model in which emotional valence (positive-negative) and arousal (low-high) act as modulators of interaction rates, effectively altering the recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from their valence-arousal states, allowing the study of emotional contagion in consensus formation. Three scenarios are explored: (1) the joint effect of valence and arousal on consensus outcomes and speed, (2) the role of arousal in breaking ties when valence is matched, and (3) the "snowball effect" in which consensus accelerates after surpassing intermediate support thresholds. Results show that emotional modulation can bias decision outcomes and alter convergence times by shifting effective recruitment and inhibition rates. At the same time, intrinsic non-linear amplification can produce decisive wins even in fully symmetric emotional conditions. These findings link classical swarm decision theory with affective and social modelling, highlighting how both emotional asymmetries and structural tipping points shape collective outcomes. The proposed framework offers a flexible tool for studying the emotional dimensions of collective choice in both natural and artificial systems.
中文标题/摘要
标题:群体决策动力学中的情绪调节
生物群体和人类群体的集体决策往往源自简单的交互规则,这些规则将微小差异放大为共识。蜂群方程最初用于描述蜜蜂群体选择巢址的动力学,通过招募和抑制过程捕捉这一动态。在此,我们扩展了蜂群方程,构建了一个基于代理的模型,其中情绪的正负值和低高唤醒度作为交互速率的调节器,有效改变招募和交叉抑制参数。代理展示模拟面部表情,映射自其正负值-唤醒度状态,允许研究情绪在达成共识中的传播。 探索了三种情景:(1)正负值和唤醒度对共识结果和速度的联合影响,(2)唤醒度在正负值匹配时打破僵局的作用,以及(3)“滚雪球效应”,即在超过中间支持阈值后,共识加速。结果表明,情绪调节可以偏倚决策结果,并通过改变有效的招募和抑制速率来改变收敛时间。同时,内在的非线性放大可以产生决定性的胜利,即使在完全对称的情绪条件下也是如此。 这些发现将经典的群体决策理论与情感和社会建模联系起来,强调了情绪不对称性和结构临界点如何塑造集体结果。提出的框架为研究自然和人工系统中的集体选择的情绪维度提供了一个灵活的工具。
Summary / 总结
This study extends the bee equation to incorporate emotional valence and arousal as modulators of interaction rates in an agent-based model, exploring how these emotions affect consensus formation. Key findings include the biasing of decision outcomes and changes in convergence times due to emotional modulation, as well as the 'snowball effect' where consensus accelerates after surpassing intermediate support thresholds. The research links classical swarm decision theory with affective and social modeling, emphasizing the role of emotional asymmetries and structural tipping points in shaping collective outcomes.
该研究将情绪的正负性和唤醒度引入蜜蜂方程的扩展模型中,作为交互速率的调节器,探讨这些情绪如何影响共识形成。主要发现包括情绪调节对决策结果的偏导和收敛时间的变化,以及在达到中间支持阈值后共识加速的“滚雪球效应”。研究将经典的群体决策理论与情感和社会建模相结合,强调情绪不对称性和结构临界点在塑造集体结果中的作用。
BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Authors: Xinyu Gao, Gang Chen, Javier Alonso-Mora
First: 2026-03-10T17:56:16+00:00 · Latest: 2026-03-10T17:56:16+00:00
Comments: 8 pages. Project page: https://xin-yu-gao.github.io/beacon
Abstract
Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.
中文标题/摘要
标题:BEACON:基于语言条件的遮挡下导航可用性预测
基于语言条件的局部导航要求机器人从其当前观察和开放词汇关系指令中推断出附近的可通行目标位置。现有的视觉-语言空间定位方法通常依赖视觉-语言模型(VLM)在图像空间中进行推理,产生与可见像素相关的二维预测。因此,它们在遮挡区域(通常由家具或移动的人类引起)推断目标位置时遇到困难。为了解决这个问题,我们提出了BEACON,它预测了一个以自我为中心的鸟瞰图(BEV)可用性热力图,覆盖了一个包括遮挡区域的局部区域。给定一个指令和来自机器人周围四个方向的环绕视图RGB-D观察结果,BEACON通过将空间线索注入VLM并将VLM的输出与深度衍生的BEV特征融合来预测BEV热力图。使用在Habitat模拟器中构建的具有遮挡感知的数据集,我们进行了详细的实验分析,以验证我们的BEV空间表示和每个模块的设计选择。我们的方法在验证子集上遮挡目标位置的平均测地距离阈值精度上比最先进的图像空间基线提高了22.74个百分点。我们的项目页面是:https://xin-yu-gao.github.io/beacon.
Summary / 总结
The research aims to improve language-conditioned local navigation by addressing the challenge of inferring target locations in occluded regions. BEACON predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region, including occluded areas, by injecting spatial cues into a vision-language model and fusing it with depth-derived BEV features. Experiments show that BEACON significantly improves accuracy by 22.74 percentage points over existing methods on the validation subset with occluded target locations.
BEACON旨在解决语言引导的局部导航中在遮挡区域预测目标位置的挑战。它通过视觉语言模型和深度衍生的BEV特征预测一个局部区域的Bird's-Eye View (BEV) 可操作性热图。实验结果显示,在验证子集中的遮挡目标位置上,其准确率比现有方法提高了22.74个百分点。
Think Before You Lie: How Reasoning Improves Honesty
Authors: Ann Yuan, Asma Ghandeharioun, Carter Blum, Alicia Machado, Jessica Hoffmann, Daphne Ippolito, Martin Wattenberg, Lucas Dixon, Katja Filippova
First: 2026-03-10T17:52:49+00:00 · Latest: 2026-03-10T17:52:49+00:00
Abstract
While existing evaluations of large language models (LLMs) measure deception rates, the underlying conditions that give rise to deceptive behavior are poorly understood. We investigate this question using a novel dataset of realistic moral trade-offs where honesty incurs variable costs. Contrary to humans, who tend to become less honest given time to deliberate (Capraro, 2017; Capraro et al., 2019), we find that reasoning consistently increases honesty across scales and for several LLM families. This effect is not only a function of the reasoning content, as reasoning traces are often poor predictors of final behaviors. Rather, we show that the underlying geometry of the representational space itself contributes to the effect. Namely, we observe that deceptive regions within this space are metastable: deceptive answers are more easily destabilized by input paraphrasing, output resampling, and activation noise than honest ones. We interpret the effect of reasoning in this vein: generating deliberative tokens as part of moral reasoning entails the traversal of a biased representational space, ultimately nudging the model toward its more stable, honest defaults.
中文标题/摘要
标题:三思而后行:推理如何提升诚实
虽然现有的大型语言模型(LLMs)评估测量了欺骗率,但导致欺骗行为的潜在条件却知之甚少。我们使用一个新颖的数据集来研究这一问题,该数据集包含现实中的道德权衡,其中诚实会带来可变的成本。与人类不同,人类在有时间深思熟虑的情况下往往会变得不那么诚实(Capraro, 2017; Capraro et al., 2019),我们发现推理在不同规模上一致地增加了多个LLM家族的诚实度。这一效果不仅取决于推理内容,因为推理痕迹往往不是最终行为的良好预测器。相反,我们表明,表示空间本身的几何结构也对效果有所贡献。具体来说,我们观察到,该空间中的欺骗区域是亚稳态:欺骗性答案比诚实性答案更容易通过输入重述、输出重采样和激活噪声被破坏。我们从这一角度解释推理的效果:在道德推理过程中生成审慎的令牌意味着穿越一个有偏见的表示空间,最终促使模型向其更稳定、更诚实的默认状态靠拢。
Towards a Neural Debugger for Python
Authors: Maximilian Beck, Jonas Gehring, Jannik Kossen, Gabriel Synnaeve
First: 2026-03-10T17:47:05+00:00 · Latest: 2026-03-10T17:47:05+00:00
Comments: 22 pages
Abstract
Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team et al., 2025). However, developers rarely execute programs step by step; instead, they use debuggers to stop execution at certain breakpoints and step through relevant portions only while inspecting or modifying program variables. Existing neural interpreter approaches lack such interactive control. To address this limitation, we introduce neural debuggers: language models that emulate traditional debuggers, supporting operations such as stepping into, over, or out of functions, as well as setting breakpoints at specific source lines. We show that neural debuggers -- obtained via fine-tuning large LLMs or pre-training smaller models from scratch -- can reliably model both forward execution (predicting future states and outputs) and inverse execution (inferring prior states or inputs) conditioned on debugger actions. Evaluated on CruxEval, our models achieve strong performance on both output and input prediction tasks, demonstrating robust conditional execution modeling. Our work takes first steps towards future agentic coding systems in which neural debuggers serve as a world model for simulated debugging environments, providing execution feedback or enabling agents to interact with real debugging tools. This capability lays the foundation for more powerful code generation, program understanding, and automated debugging.
中文标题/摘要
标题:朝向Python的神经调试器
通过在Python执行跟踪上训练大型语言模型(LLMs),可以将它们扎根于代码执行,并使它们能够预测整个Python程序的逐行执行,有效地将它们转变为神经解释器(FAIR CodeGen团队等,2025)。然而,开发人员很少逐行执行程序;相反,他们使用调试器在特定断点处停止执行,并仅在检查或修改程序变量时逐步通过相关部分。现有的神经解释器方法缺乏这种交互控制。为了解决这一局限性,我们引入了神经调试器:模拟传统调试器的语言模型,支持进入、越过或跳出函数的操作,以及在特定源代码行设置断点。我们展示了通过微调大型LLMs或从零开始预训练较小模型获得的神经调试器可以可靠地模拟基于调试器操作的正向执行(预测未来状态和输出)和逆向执行(推断先前状态或输入)。在CruxEval上评估,我们的模型在输出和输入预测任务上表现出色,展示了强大的条件执行建模能力。我们的工作朝着未来自主编码系统迈出了第一步,在这些系统中,神经调试器作为模拟调试环境的世界模型,提供执行反馈或使代理能够与实际调试工具交互。这种能力为更强大的代码生成、程序理解和自动化调试奠定了基础。
Summary / 总结
This paper addresses the limitation of existing neural interpreters by introducing neural debuggers, which emulate traditional debuggers to support interactive control over program execution. By fine-tuning large language models or pre-training smaller models, the researchers enable these debuggers to predict both forward and inverse execution conditioned on debugger actions. The models achieve strong performance on CruxEval, indicating reliable execution modeling and setting the stage for future agentic coding systems.
研究旨在通过开发能够模拟传统调试功能的神经调试器来增强神经解释器。方法是通过微调大型语言模型或从零开始预训练较小的模型来支持进入、越过或跳出函数以及设置断点等功能。关键发现表明,这些神经调试器能够准确预测正向和逆向执行,实现强大的输出和输入预测任务,从而提供稳健的条件执行建模。
GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics
Authors: Arsham Gholamzadeh Khoee, Shuai Wang, Robert Feldt, Dhasarathy Parthasarathy, Yinan Yu
First: 2025-03-27T17:48:32+00:00 · Latest: 2026-03-10T17:44:31+00:00
Abstract
Ensuring reliable data-driven decisions is crucial in domains where analytical accuracy directly impacts safety, compliance, or operational outcomes. Decision support in such domains relies on large tabular datasets, where manual analysis is slow, costly, and error-prone. While Large Language Models (LLMs) offer promising automation potential, they face challenges in analytical reasoning, structured data handling, and ambiguity resolution. This paper introduces GateLens, an LLM-based architecture for reliable analysis of complex tabular data. Its key innovation is the use of Relational Algebra (RA) as a formal intermediate representation between natural-language reasoning and executable code, addressing the reasoning-to-code gap that can arise in direct generation approaches. In our automotive instantiation, GateLens translates natural language queries into RA expressions and generates optimized Python code. Unlike traditional multi-agent or planning-based systems that can be slow, opaque, and costly to maintain, GateLens emphasizes speed, transparency, and reliability. We validate the architecture in automotive software release analytics, where experimental results show that GateLens outperforms the existing Chain-of-Thought (CoT) + Self-Consistency (SC) based system on real-world datasets, particularly in handling complex and ambiguous queries. Ablation studies confirm the essential role of the RA layer. Industrial deployment demonstrates over 80% reduction in analysis time while maintaining high accuracy across domain-specific tasks. GateLens operates effectively in zero-shot settings without requiring few-shot examples or agent orchestration. This work advances deployable LLM system design by identifying key architectural features--intermediate formal representations, execution efficiency, and low configuration overhead--crucial for domain-specific analytical applications.
中文标题/摘要
标题:GateLens:一种增强推理的LLM代理,用于汽车软件发布分析
在直接影响安全、合规或运营结果的领域中,确保可靠的数据驱动决策至关重要。此类领域的决策支持依赖于大型表格数据集,而人工分析则缓慢、昂贵且容易出错。虽然大型语言模型(LLMs)提供了自动化的潜力,但它们在分析推理、结构化数据处理和歧义解决方面面临挑战。本文介绍了GateLens,这是一种基于LLM的架构,用于可靠地分析复杂表格数据。其关键创新之处在于使用关系代数(RA)作为自然语言推理与可执行代码之间的正式中间表示,解决了直接生成方法中可能出现的推理到代码的差距。在我们的汽车实例中,GateLens 将自然语言查询翻译成RA表达式,并生成优化的Python代码。与传统的多代理或基于规划的系统相比,GateLens 强调速度、透明性和可靠性,无需维护缓慢、不透明且昂贵的系统。我们在汽车软件发布分析中验证了该架构,实验结果表明,GateLens 在处理复杂和模糊查询方面优于现有的基于思维链(CoT)+ 自我一致性(SC)的系统。消融研究证实了RA层的必要性。工业部署表明,在保持特定领域任务高准确性的同时,分析时间减少了超过80%。GateLens 在零样本设置中有效运行,无需少量示例或代理协调。本文通过识别关键架构特征——中间形式化表示、执行效率和低配置开销,推进了可部署的LLM系统设计,这些特征对于特定领域的分析应用至关重要。
Summary / 总结
GateLens is an LLM-based architecture designed for reliable analysis of complex tabular data in automotive software release analytics. It uses Relational Algebra as an intermediate representation to bridge natural language reasoning and executable code, addressing the reasoning-to-code gap. Experimental results show that GateLens outperforms existing systems, particularly in handling complex and ambiguous queries, with over 80% reduction in analysis time while maintaining high accuracy. Ablation studies confirm the importance of the Relational Algebra layer, and industrial deployment demonstrates its effectiveness in zero-shot settings.
GateLens 是一种基于LLM的架构,用于汽车软件发布分析中的复杂表格数据分析。它使用关系代数作为中间表示,连接自然语言推理和可执行代码,解决推理到代码的差距。实验结果表明,GateLens 在处理复杂和模糊查询方面优于现有系统,分析时间减少超过80%,同时保持高准确性。消融研究证实了关系代数层的重要性,工业部署展示了其在零样本设置中的有效性。
The Confidence Gate Theorem: When Should Ranked Decision Systems Abstain?
Authors: Ronald Doku
First: 2026-03-10T17:44:10+00:00 · Latest: 2026-03-10T17:44:10+00:00
Abstract
Ranked decision systems -- recommenders, ad auctions, clinical triage queues -- must decide when to intervene in ranked outputs and when to abstain. We study when confidence-based abstention monotonically improves decision quality, and when it fails. The formal conditions are simple: rank-alignment and no inversion zones. The substantive contribution is identifying why these conditions hold or fail: the distinction between structural uncertainty (missing data, e.g., cold-start) and contextual uncertainty (missing context, e.g., temporal drift). Empirically, we validate this distinction across three domains: collaborative filtering (MovieLens, 3 distribution shifts), e-commerce intent detection (RetailRocket, Criteo, Yoochoose), and clinical pathway triage (MIMIC-IV). Structural uncertainty produces near-monotonic abstention gains in all domains; structurally grounded confidence signals (observation counts) fail under contextual drift, producing as many monotonicity violations as random abstention on our MovieLens temporal split. Context-aware alternatives -- ensemble disagreement and recency features -- substantially narrow the gap (reducing violations from 3 to 1--2) but do not fully restore monotonicity, suggesting that contextual uncertainty poses qualitatively different challenges. Exception labels defined from residuals degrade substantially under distribution shift (AUC drops from 0.71 to 0.61--0.62 across three splits), providing a clean negative result against the common practice of exception-based intervention. The results provide a practical deployment diagnostic: check C1 and C2 on held-out data before deploying a confidence gate, and match the confidence signal to the dominant uncertainty type.
中文标题/摘要
标题:信心门槛定理:排序决策系统何时应弃权?
排序决策系统——推荐系统、广告拍卖、临床分诊队列——必须决定何时干预排序输出,何时弃权。我们研究了基于信心的弃权何时单调地提高决策质量,何时失败。形式条件很简单:排序对齐和无反转区。实质性贡献在于识别这些条件为何成立或失败:结构性不确定性(缺失数据,例如冷启动)与情境不确定性(缺失上下文,例如时间漂移)之间的区别。实验上,我们在三个领域验证了这种区别:协同过滤(MovieLens,3种分布变化),电子商务意图检测(RetailRocket、Criteo、Yoochoose),以及临床路径分诊(MIMIC-IV)。结构性不确定性在所有领域都产生了接近单调的弃权收益;结构化的信心信号(观测计数)在情境漂移下失效,导致在我们的MovieLens时间分割上与随机弃权一样多的单调性违反。情境感知的替代方案——集成分歧和近期特征——显著缩小了差距(将违反次数从3减少到1-2),但并未完全恢复单调性,表明情境不确定性提出了质的不同挑战。从残差定义的异常标签在分布变化下显著下降(AUC从0.71降至0.61-0.62),提供了对基于异常干预的常见做法的干净否定结果。结果提供了一个实际部署诊断:在部署信心门槛之前,在保留数据上检查C1和C2,并将信心信号与主导的不确定性类型匹配。
Summary / 总结
The paper investigates when confidence-based abstention in ranked decision systems improves decision quality and when it fails. It identifies two types of uncertainty: structural (missing data) and contextual (missing context or temporal drift). Empirical validation across collaborative filtering, e-commerce intent detection, and clinical pathway triage shows that structural uncertainty leads to consistent abstention gains, while contextual uncertainty causes violations of monotonicity, even with context-aware alternatives. The study provides a practical diagnostic: check for structural and contextual uncertainty before deploying a confidence gate and match the confidence signal to the dominant uncertainty type.
论文研究了在排序决策系统中基于信心的回避何时能提高决策质量以及何时会失败。它确定了如排序对齐和无反转区等条件,并区分了结构性不确定性(如冷启动)和情境不确定性(如时间漂移)。在协同过滤、电子商务意图检测和临床路径分诊等领域的实证验证表明,结构性不确定性会导致一致的回避增益,而情境不确定性会导致单调性违反。情境感知的替代方案可以减少但不能完全恢复单调性,表明情境不确定性提出了独特的挑战。研究提供了实用的部署诊断:在部署信心门之前检查这些条件,并将信心信号与主要的不确定性类型匹配。
No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space
Authors: Yundi Zhang, Sevgi Gokce Kafali, Niklas Bubeck, Daniel Rueckert, Jiazhen Pan
First: 2026-03-10T17:38:38+00:00 · Latest: 2026-03-10T17:38:38+00:00
Abstract
Conventional clinical CMR pipelines rely on a sequential "reconstruct-then-analyze" paradigm, forcing an ill-posed intermediate step that introduces avoidable artifacts and information bottlenecks. This creates a fundamental mathematical paradox: it attempts to recover high-dimensional pixel arrays (i.e., images) from undersampled k-space, rather than directly extracting the low-dimensional physiological labels actually required for diagnosis. To unlock the direct diagnostic potential of k-space, we propose k-MTR (k-space Multi-Task Representation), a k-space representation learning framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold. Leveraging a large-scale controlled simulation of 42,000 subjects, k-MTR forces the k-space encoder to restore anatomical information lost to undersampling directly within the latent space, bypassing the explicit inverse problem for downstream analysis. We demonstrate that this latent alignment enables the dense latent space embedded with high-level physiological semantics directly from undersampled frequencies. Across continuous phenotype regression, disease classification, and anatomical segmentation, k-MTR achieves highly competitive performance against state-of-the-art image-domain baselines. By showcasing that precise spatial geometries and multi-task features can be successfully recovered directly from the k-space representations, k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows.
中文标题/摘要
标题:无图像,无问题:从欠采样k-空间进行端到端多任务心脏分析
传统的临床CMR流程依赖于“重建-分析”串行范式,导致一个病态的中间步骤,引入了可避免的伪影和信息瓶颈。这创造了一个基本的数学悖论:试图从欠采样的k-空间中恢复高维像素阵列(即图像),而不是直接提取诊断所需的低维生理标签。为了直接解锁k-空间的诊断潜力,我们提出了k-MTR(k-空间多任务表示),这是一种k-空间表示学习框架,将欠采样的k-空间数据和完全采样的图像对齐到共享的语义流形中。利用42,000个受控模拟的大规模数据集,k-MTR迫使k-空间编码器在潜在空间中直接恢复由于欠采样而丢失的解剖信息,绕过了下游分析中的显式逆问题。我们证明这种潜在对齐使高阶生理语义可以直接嵌入欠采样频率的密集潜在空间中。在连续表型回归、疾病分类和解剖分割中,k-MTR在最先进的图像域基线中表现出高度竞争力。通过展示可以从k-空间表示中直接恢复精确的空间几何和多任务特征,k-MTR为任务感知的心脏MRI工作流程提供了稳健的架构蓝图。
Summary / 总结
The paper addresses the limitations of conventional clinical CMR pipelines by proposing k-MTR (k-space Multi-Task Representation), which directly extracts physiological labels from undersampled k-space data without reconstructing images first. Using a large-scale simulation of 42,000 subjects, k-MTR aligns undersampled k-space data and fully-sampled images into a shared semantic manifold, enabling direct extraction of high-level physiological semantics. The method achieves competitive performance in continuous phenotype regression, disease classification, and anatomical segmentation tasks, demonstrating the potential of k-space data for direct diagnostic use without image reconstruction.
论文通过提出k-MTR(k空间多任务表示)框架,直接从欠采样的k空间数据中提取生理标签,而不先重建图像。利用42,000个受试者的大型模拟,k-MTR将欠采样的k空间数据和完全采样的图像对齐到共享语义流形,从而可以直接提取高级生理语义。该方法在连续表型回归、疾病分类和解剖分割等任务中表现出色,展示了k空间数据直接用于诊断而无需图像重建的潜力。
PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs
Authors: Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu
First: 2026-03-10T17:35:49+00:00 · Latest: 2026-03-10T17:35:49+00:00
Abstract
Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.
中文标题/摘要
标题:PathMem:病理MLLMs的认知对齐记忆转换
计算病理学既需要视觉模式识别,也需要动态整合结构化的领域知识,包括分类学、分级标准和临床证据。实践中,诊断推理需要将形态学证据与正式的诊断和分级标准联系起来。尽管多模态大型语言模型(MLLMs)展示了强大的视觉语言推理能力,但它们缺乏结构化知识整合和可解释的记忆控制的明确机制。因此,现有模型在推理过程中难以一致地整合病理学特定的诊断标准。受人类病理学家分层记忆过程的启发,我们提出PathMem,这是一种以记忆为中心的多模态框架,用于病理学MLLMs。PathMem 将结构化的病理学知识组织成长期记忆(LTM),并引入了一个记忆变换器,通过多模态记忆激活和上下文感知的知识接地,建模从LTM到工作记忆(WM)的动态过渡,从而实现上下文感知的记忆细化以支持下游推理。PathMem 在基准测试中实现了SOTA性能,WSI-Bench报告生成的WSI-精确度提高了12.8%,WSI-相关性提高了10.1%,开放式诊断分别提高了9.7%和8.9%,优于之前的基于WSI的模型。
Summary / 总结
PathMem is a memory-centric multimodal framework designed to enhance the reasoning capabilities of pathology large language models (MLLMs) by integrating structured domain knowledge. It organizes pathology knowledge as long-term memory and introduces a Memory Transformer to dynamically transition this knowledge to working memory, enabling context-aware memory refinement. PathMem outperforms previous models on WSI-Bench, improving WSI-Precision by 12.8% and WSI-Relevance by 10.1%, and enhancing open-ended diagnosis by 9.7% and 8.9% respectively.
PathMem 是一种记忆中心化的多模态框架,旨在通过整合结构化领域知识来增强病理学大型语言模型(MLLMs)的推理能力。它将病理知识组织成长期记忆,并使用记忆变换器动态地将其转换为工作记忆,从而实现上下文相关的记忆精炼。PathMem 在 WSI-Bench 报告生成和开放性诊断方面分别提高了 12.8% 和 9.7%。
MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers
Authors: Arash Ahmadi, Sarah Sharif, Yaser M. Banad
First: 2025-04-11T22:19:48+00:00 · Latest: 2026-03-10T17:34:59+00:00
Comments: 42 pages, 28 figures
Abstract
Large Language Models (LLMs) are increasingly augmented with external tools through standardized interfaces like the Model Context Protocol (MCP). However, current MCP implementations face critical limitations: they typically require local process execution through STDIO transports, making them impractical for resource-constrained environments like mobile devices, web browsers, and edge computing. We present MCP Bridge, a lightweight RESTful proxy that connects to multiple MCP servers and exposes their capabilities through a unified API. Unlike existing solutions, MCP Bridge is fully LLM-agnostic, supporting any backend regardless of vendor. The system implements a risk-based execution model with three security levels-standard execution, confirmation workflow, and Docker isolation-while maintaining backward compatibility with standard MCP clients. However, reliable execution within this framework requires models that can strictly adhere to protocol schemas. To this end, we also fine-tuned the Qwen3 4B and 8B model family on the Agent-Ark/Toucan-1.5M dataset using four Reinforcement Learning techniques: Group Relative Policy Optimization (GRPO), Dr. GRPO, Beta Normalization Policy Optimization (BNPO), and Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO). Evaluated on the MCPToolBench++ benchmark, our optimized model achieves an F1 score of 73.0% that outperforms GPT-OSS-120B (62.17%) and remains competitive with the 70B+ parameter baselines. Evaluation demonstrates that MCP Bridge successfully addresses the constraints of direct MCP connections while providing enhanced security controls and cross-platform compatibility, enabling sophisticated LLM-powered applications in previously inaccessible environments.
中文标题/摘要
标题:MCP桥接器:一种轻量级、LLM无关的RESTful代理,用于模型上下文协议服务器
大型语言模型(LLMs)通过标准化接口(如模型上下文协议MCP)与外部工具进行增强。然而,当前的MCP实现面临关键限制:它们通常需要通过STDIO传输进行本地进程执行,这使得它们在资源受限的环境中(如移动设备、网页浏览器和边缘计算)不切实际。我们提出了MCP桥接器,这是一种轻量级的RESTful代理,可以连接到多个MCP服务器并通过统一的API暴露其功能。与现有解决方案不同,MCP桥接器完全不依赖于特定的LLM,支持任何后端,无论供应商如何。该系统实现了一种基于风险的执行模型,具有三个安全级别:标准执行、确认工作流和Docker隔离,同时保持与标准MCP客户端的向后兼容性。然而,在此框架内可靠执行需要严格遵守协议模式的模型。为此,我们还使用四种强化学习技术(组相对策略优化GRPO、博士GRPO、贝塔归一化策略优化BNPO和解耦剪辑和动态采样策略优化DAPO)对Qwen3 4B和8B模型家族进行了微调,并在Agent-Ark/Toucan-1.5M数据集上进行了训练。在MCPToolBench++基准测试中,我们的优化模型实现了73.0%的F1分数,优于GPT-OSS-120B(62.17%),并且与70B+参数基线保持竞争力。评估表明,MCP桥接器成功解决了直接MCP连接的限制,同时提供了增强的安全控制和跨平台兼容性,使复杂的LLM驱动应用程序能够在以前无法访问的环境中运行。
Summary / 总结
MCP Bridge is a lightweight, LLM-agnostic RESTful proxy that connects to multiple MCP servers and exposes their capabilities through a unified API. It implements a risk-based execution model with three security levels and maintains backward compatibility with standard MCP clients. The authors fine-tuned the Qwen3 4B and 8B model family using four RL techniques and achieved an F1 score of 73.0% on the MCPToolBench++ benchmark, outperforming GPT-OSS-120B (62.17%) and remaining competitive with larger models.
MCP Bridge 是一种轻量级的 RESTful 代理,旨在连接多个 Model Context Protocol (MCP) 服务器并通过统一的 API 展示其功能,解决了现有 MCP 实现的限制。它支持任何后端 LLM,并实现了一种基于风险的执行模型,包含三个安全级别。作者使用四种强化学习技术对 Qwen3 4B 和 8B 模型进行了微调,并在 MCPToolBench++ 基准测试中取得了 73.0% 的 F1 分数,超过了 GPT-OSS-120B,且与更大规模的模型保持竞争力。
SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
Authors: Fredrik K. Gustafsson, Xiao Gu, Mattia Carletti, Patitapaban Palo, David W. Eyre, David A. Clifton
First: 2026-03-10T17:32:28+00:00 · Latest: 2026-03-10T17:32:28+00:00
Comments: Code is available at https://github.com/fregu856/SignalMC-MED
Abstract
Recent biosignal foundation models (FMs) have demonstrated promising performance across diverse clinical prediction tasks, yet systematic evaluation on long-duration multimodal data remains limited. We introduce SignalMC-MED, a benchmark for evaluating biosignal FMs on synchronized single-lead electrocardiogram (ECG) and photoplethysmogram (PPG) data. Derived from the MC-MED dataset, SignalMC-MED comprises 22,256 visits with 10-minute overlapping ECG and PPG signals, and includes 20 clinically relevant tasks spanning prediction of demographics, emergency department disposition, laboratory value regression, and detection of prior ICD-10 diagnoses. Using this benchmark, we perform a systematic evaluation of representative time-series and biosignal FMs across ECG-only, PPG-only, and ECG + PPG settings. We find that domain-specific biosignal FMs consistently outperform general time-series models, and that multimodal ECG + PPG fusion yields robust improvements over unimodal inputs. Moreover, using the full 10-minute signal consistently outperforms shorter segments, and larger model variants do not reliably outperform smaller ones. Hand-crafted ECG domain features provide a strong baseline and offer complementary value when combined with learned FM representations. Together, these results establish SignalMC-MED as a standardized benchmark and provide practical guidance for evaluating and deploying biosignal FMs.
中文标题/摘要
标题:SignalMC-MED:一种用于评估生物信号基础模型的多模态基准,针对单导联ECG和PPG
近年来,生物信号基础模型(FMs)在多种临床预测任务中表现出有希望的性能,但在长时程多模态数据上的系统评估仍然有限。我们引入了SignalMC-MED基准,用于评估生物信号FMs在同步单导联心电图(ECG)和光电容积描记图(PPG)数据上的表现。该基准数据集源自MC-MED数据集,包含22,256次访问,每10分钟有重叠的ECG和PPG信号,并包括20项临床相关任务,涵盖人口统计学预测、急诊科处置、实验室值回归以及ICD-10诊断的检测。使用该基准,我们对代表性的时序和生物信号FMs在ECG仅、PPG仅和ECG + PPG设置下进行了系统评估。我们发现,针对特定领域的生物信号FMs始终优于通用时序模型,而多模态ECG + PPG融合在单模态输入上提供了稳健的改进。此外,使用完整的10分钟信号始终优于较短的片段,而较大的模型变体并不总是优于较小的模型。手工构建的ECG领域特征提供了强大的基线,并且当与学习到的FM表示结合时,提供了补充价值。这些结果共同确立了SignalMC-MED作为标准化基准,并为评估和部署生物信号FMs提供了实用指导。
Summary / 总结
SignalMC-MED is a benchmark for evaluating biosignal foundation models on synchronized single-lead ECG and PPG data, comprising 22,256 visits with 10-minute overlapping signals and 20 clinically relevant tasks. The study finds that domain-specific biosignal models outperform general time-series models, and multimodal ECG + PPG fusion improves performance over unimodal inputs. Longer signal segments and larger models do not consistently outperform smaller ones, and hand-crafted ECG features complement learned FM representations. This benchmark provides practical guidance for evaluating biosignal FMs.
SignalMC-MED 是一个用于评估同步单导联 ECG 和 PPG 数据上的生物信号基础模型的基准,涵盖了 20 个临床相关任务。研究发现,领域特定的生物信号模型优于通用时间序列模型,而多模态 ECG + PPG 融合在单一模态输入上表现出更优的效果。更长的信号段和更大模型并不总是优于较小模型,而手工构建的 ECG 特征与学习到的 FM 表征相结合时提供了补充价值。这将 SignalMC-MED 作为生物信号 FM 的标准化基准建立起来。
Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions
Authors: Mingyang Song, Mao Zheng
First: 2026-03-10T17:31:55+00:00 · Latest: 2026-03-10T17:31:55+00:00
Abstract
Model merging has emerged as a transformative paradigm for combining the capabilities of multiple neural networks into a single unified model without additional training. With the rapid proliferation of fine-tuned large language models~(LLMs), merging techniques offer a computationally efficient alternative to ensembles and full retraining, enabling practitioners to compose specialized capabilities at minimal cost. This survey presents a comprehensive and structured examination of model merging in the LLM era through the \textbf{FUSE} taxonomy, a four-dimensional framework organized along \textbf{F}oundations, \textbf{U}nification Strategies, \textbf{S}cenarios, and \textbf{E}cosystem. We first establish the theoretical underpinnings of merging, including loss landscape geometry, mode connectivity, and the linear mode connectivity hypothesis. We then systematically review the algorithmic landscape, spanning weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization approaches. For each method family, we analyze the core formulation, highlight representative works, and discuss practical trade-offs. We further examine downstream applications across multi-task learning, safety alignment, domain specialization, multilingual transfer, and federated learning. Finally, we survey the supporting ecosystem of open-source tools, community platforms, and evaluation benchmarks, and identify key open challenges including theoretical gaps, scalability barriers, and standardization needs. This survey aims to equip researchers and practitioners with a structured foundation for advancing model merging.
中文标题/摘要
标题:大型语言模型时代的模型合并:方法、应用与未来方向
模型合并已成为一种变革性的范式,用于将多个神经网络的能力合并到一个统一模型中,而无需额外训练。随着精细调整的大型语言模型(LLMs)的迅速普及,合并技术提供了一种计算上高效的替代方案,以集成集成和全面重新训练,使从业者能够以最低成本组合专门的能力。本文综述了通过FUSE分类法在LLM时代对模型合并进行全面和结构化的考察,这是一种四维框架,按基础、统一策略、场景和生态系统组织。我们首先建立了合并的理论基础,包括损失景观几何、模式连通性和线性模式连通性假设。然后我们系统地回顾了算法景观,涵盖了权重平均、任务向量算术、稀疏化增强方法、专家混合架构和进化优化方法。对于每种方法家族,我们分析了核心公式,突出了代表性作品,并讨论了实际权衡。我们进一步探讨了跨多任务学习、安全性对齐、领域专业化、多语言迁移和联邦学习的下游应用。最后,我们综述了支持生态系统的开源工具、社区平台和评估基准,并确定了关键的开放挑战,包括理论空白、可扩展性障碍和标准化需求。本文旨在为研究人员和从业者提供一个结构化的基础,以推进模型合并。
Summary / 总结
This paper explores model merging techniques for combining the capabilities of multiple neural networks into a single unified model, particularly in the context of large language models (LLMs). It introduces the FUSE taxonomy, a four-dimensional framework that categorizes merging methods based on foundations, unification strategies, scenarios, and ecosystem. The study reviews various merging methods, including weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization approaches. Key findings include the theoretical underpinnings of merging, such as loss landscape geometry and mode connectivity, and the practical trade-offs of different methods. The paper also discusses applications in multi-task learning, safety alignment, domain specialization, multilingual transfer, and federated learning, and highlights the need for open-source tools and standardization in the field.
本文探讨了在大规模语言模型(LLMs)背景下,将多个神经网络的能力合并为单一统一模型的模型合并技术。它引入了FUSE分类框架,该框架从基础、统一策略、应用场景和生态系统四个维度对合并方法进行分类。研究回顾了各种合并方法,包括权重平均、任务向量算术、稀疏化增强方法、专家混合架构和进化优化方法。关键发现包括合并的理论基础,如损失景观几何和模式连通性,以及不同方法的实用权衡。文章还讨论了在多任务学习、安全性对齐、领域专业化、多语言迁移和联邦学习等领域的应用,并强调了该领域需要开源工具和标准化的需求。
Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
Authors: Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative
First: 2026-03-10T17:26:45+00:00 · Latest: 2026-03-10T17:26:45+00:00
Abstract
Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80\% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff
中文标题/摘要
标题:自适应临床感知潜在扩散在多模态脑影像生成及缺失模态插补中的应用
多模态神经影像学为阿尔茨海默病诊断提供了互补的见解,但临床数据集经常存在缺失模态的问题。我们提出了一种名为ACADiff的框架,通过自适应临床感知扩散来合成缺失的脑影像模态。ACADiff通过逐步去噪潜在表示并关注可用的影像数据和临床元数据,学习不完整多模态观察与目标模态之间的映射关系。该框架采用自适应融合,根据输入可用性动态重新配置,并通过GPT-4o编码的提示提供语义临床指导。三个专门的生成器使sMRI、FDG-PET和AV45-PET之间实现双向合成。在ADNI受试者上进行评估,ACADiff在生成质量上表现出色,并且即使在极端80%缺失场景下仍能保持稳健的诊断性能,优于所有现有基线。为了促进可重复性,代码可在https://github.com/rongzhou7/ACADiff获取
Summary / 总结
The research aims to address the issue of missing modalities in clinical neuroimaging datasets for Alzheimer's disease diagnosis. ACADiff, a framework that uses adaptive clinical-aware latent diffusion, is proposed to synthesize missing brain imaging modalities. By progressively denoising latent representations and dynamically fusing available imaging data and clinical metadata, ACADiff achieves high-quality multimodal brain image generation and maintains diagnostic performance even with 80% missing data, outperforming existing methods.
ACADiff 是一种框架,通过适应性临床感知扩散来合成缺失的脑成像模态。它通过逐级去噪潜在表示并考虑可用的成像数据和临床元数据来学习不完整多模态观察与目标模态之间的映射关系。ACADiff 在生成高质量图像和保持诊断性能方面优于现有方法,即使在 80% 数据缺失的情况下也能保持良好的性能,已在 ADNI 受试者上进行了评估。
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
Authors: Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao
First: 2026-03-10T17:26:42+00:00 · Latest: 2026-03-10T17:26:42+00:00
Abstract
Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
中文标题/摘要
标题:细粒度运动检索通过联合角度运动图像和标记块晚交互
文本-运动检索旨在学习自然语言描述与3D人体运动骨架序列之间的语义对齐的潜在空间,从而在两种模态之间实现双向搜索。现有大多数方法使用双编码器框架将运动和文本压缩为全局嵌入,丢弃了细粒度的局部对应关系,从而降低了准确性。此外,这些全局嵌入方法对检索结果的解释性有限。为克服这些限制,我们提出了一种可解释的基于关节角度的运动表示,将关节级局部特征映射到与预训练的视觉变换器兼容的结构化伪图像。对于文本到运动检索,我们采用MaxSim,这是一种标记级晚交互机制,并通过掩码语言建模正则化增强它,以促进稳健且可解释的文本-运动对齐。在HumanML3D和KIT-ML上的广泛实验表明,我们的方法在可解释的细粒度文本-运动对应关系方面优于最先进的文本-运动检索方法。代码可在附录中获取。
Summary / 总结
The research aims to improve the accuracy and interpretability of text-motion retrieval by addressing the limitations of existing dual-encoder methods that focus on global embeddings and ignore fine-grained local correspondences. The proposed method uses a joint-angle-based motion representation and a token-wise late interaction mechanism called MaxSim, which is enhanced with Masked Language Modeling regularization. Experiments on HumanML3D and KIT-ML demonstrate that this approach outperforms state-of-the-art methods and provides interpretable fine-grained correspondences between text and motion descriptions.
研究旨在通过解决现有方法关注全局嵌入而忽略精细局部对应关系的问题,提高文本-动作检索的准确性和可解释性。提出的方案采用基于关节角度的动作表示,并使用一种基于标记的后期交互机制MaxSim,该机制通过掩码语言建模正则化增强。在HumanML3D和KIT-ML上的实验表明,该方法优于最先进的方法,能够提供文本和动作描述之间的可解释的精细对应关系。
A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven Deformable Linear Object Manipulation
Authors: Georgios Kamaras, Subramanian Ramamoorthy
Venue: In IEEE Robotics and Automation Letters, Volume 10, Issue 8, August 2025, Pages 8075-8082
First: 2025-02-25T20:01:06+00:00 · Latest: 2026-03-10T17:25:35+00:00
Abstract
We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies (i.e. assuming only visual and proprioceptive sensory) for a DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.
中文标题/摘要
标题:基于视觉驱动的可变形线性物体操作中物体中心代理适应的实2仿2实分布处理方法
我们提出了一种集成(或端到端)框架,用于基于视觉感知操纵可变形线性物体(DLOs)的实2仿2实问题。使用参数化的DLO集合,我们使用无似然推断(LFI)来计算物理参数的后验分布,从而可以近似模拟每个特定DLO的行为。在训练过程中,我们使用这些后验分布进行领域随机化,在仿真中使用无模型强化学习为DLO抓取任务训练特定于物体的视知觉运动策略(即,假设只有视觉和本体感觉感知)。我们通过零样本方式部署仿真实训的DLO操作策略于现实世界中,即无需任何进一步微调。在此背景下,我们评估了一种流行的LFI方法在仅使用动态操作轨迹中获得的视觉和本体感觉数据对参数化DLO集合进行精细分类的能力。然后我们研究了基于仿真的策略学习和现实世界性能中结果领域分布的影响。
Summary / 总结
The paper presents an integrated framework for manipulating deformable linear objects (DLOs) in the real world using visual perception. It uses likelihood-free inference to compute the posterior distributions of physical parameters for each DLO, enabling domain randomisation in simulation for training object-specific visuomotor policies. The trained policies are successfully deployed in the real world without further fine-tuning, demonstrating the approach's utility and the effectiveness of the LFI method in sim-to-real transfer.
论文提出了一种集成框架,利用视觉感知来操纵变形线性物体(DLOs)。它使用无似然推断来计算每个DLO的物理参数后验分布,从而在模拟中进行领域随机化以训练物体特定的视觉-运动策略。训练好的策略无需进一步微调即可在现实世界中部署,展示了该方法的实用性和无似然推断方法在模拟到现实世界转移中的有效性。
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Authors: Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He
First: 2026-03-10T17:18:53+00:00 · Latest: 2026-03-10T17:18:53+00:00
Comments: Accepted by CVPR26, codes and weights are publicly available
Abstract
Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
中文标题/摘要
标题:WikiCLIP:开放领域视觉实体识别的一种高效对比基准
开放领域视觉实体识别(VER)旨在将图像与维基百科等百科知识库中的实体关联起来。最近为VER量身定制的生成方法表现出强大的性能,但计算成本高昂,限制了其可扩展性和实际部署。在本文中,我们重新审视了VER中的对比范式,并引入了WikiCLIP,这是一种简单而有效的框架,为开放领域VER建立了强大的高效基准。WikiCLIP利用大型语言模型嵌入作为知识丰富的实体表示,并通过视觉引导知识适配器(VGKA)在像素级别对文本语义与视觉线索进行对齐。为了进一步促进细粒度的区分,一种硬负样本合成机制在训练过程中生成视觉相似但语义不同的负样本。在流行的开放领域VER基准测试,如OVEN上,实验结果表明,WikiCLIP显著优于强大的基准。具体而言,WikiCLIP在具有挑战性的OVEN未见过的集合上实现了16%的改进,而与领先的生成模型AutoVER相比,推理延迟降低了近100倍。项目页面可在https://artanic30.github.io/project_pages/WikiCLIP/获取。
Summary / 总结
WikiCLIP is a simple yet effective framework for open-domain visual entity recognition that uses large language model embeddings and a Vision-Guided Knowledge Adaptor to align textual and visual information. It outperforms strong baselines on the OVEN benchmark, achieving a 16% improvement on the unseen set and reducing inference latency by nearly 100 times compared to the leading generative model, AutoVER.
WikiCLIP 是一个用于开放领域视觉实体识别的对比框架,旨在高效地将图像与维基百科等知识库中的实体关联起来。它使用大型语言模型嵌入和视觉引导知识适配器来对齐文本和视觉信息,并通过困难负样本合成机制增强区分能力。在 OVEN 等基准测试上,WikiCLIP 出色地超过了强基线,实现了在未见过的集合上 16% 的改进,并将推理延迟降低了近 100 倍,相比领先的生成模型 AutoVER。
The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Authors: Alper Yıldırım
First: 2026-03-05T14:41:01+00:00 · Latest: 2026-03-10T17:16:04+00:00
Comments: 19 pages, 2 figures, 3 tables. Code available at https://github.com/AlperYildirim1/geometric-grokking
Abstract
Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task-specific geometric alignment rather than a generic optimization stabilizer, we use non-commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.
中文标题/摘要
标题:理解“grokking”的几何归纳偏见:通过架构拓扑绕过相变
机制可解释性通常依赖于对训练网络的后验分析。我们采用干预性方法:通过修改架构拓扑来测试假设,观察训练动态。我们研究了Transformer在循环模块加法(Zp)训练中的“grokking”现象,即延迟泛化,探讨特定的架构自由度是否延长了记忆阶段。 我们确定了标准Transformer中的两个独立结构因素:无界的表示幅度和数据依赖的注意力路由。首先,我们引入了一个完全有界的球形拓扑,通过在整个残差流中实施L2归一化和一个固定温度比例的嵌入矩阵,消除了基于幅度的自由度,不使用权重衰减的情况下将“grokking”起始时间减少了20多倍。其次,均匀注意力消融用均匀分布覆盖了数据依赖的查询-键路由,将注意力层简化为连续的词袋(CBOW)聚合器。尽管消除了自适应路由,这些模型在所有种子上实现了100%的泛化,并完全绕过了“grokking”延迟。 为了评估这种加速是否是特定任务的几何对齐,而不是通用的优化稳定器,我们使用非交换的S5置换组合作为负面控制。在S5上施加球形约束不会加速泛化。这表明消除记忆阶段强烈依赖于使架构先验与任务固有的对称性对齐。这些发现共同提供了干预性证据,表明架构自由度显著影响“grokking”,暗示了对训练动态的预测性结构视角。
Summary / 总结
The study investigates the geometric inductive bias of grokking in Transformers by modifying architectural topology. It identifies two factors: unbounded representational magnitude and data-dependent attention routing. By enforcing a spherical topology and uniform attention, the models bypass the grokking delay, achieving 100% generalization. A control experiment with non-commutative S5 permutation composition shows that this acceleration is task-specific, suggesting architectural priors align with the task's intrinsic symmetries.
研究通过修改架构拓扑来探索Transformer中grokking的几何归纳偏差。通过引入完全有界的球形拓扑和均匀注意力,研究将grokking的起始时间减少了超过20倍,并完全绕过了记忆阶段。研究结果表明,架构的自由度显著影响grokking,与任务的内在对称性对齐以加速泛化。
MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems
Authors: Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li
First: 2026-03-10T17:03:11+00:00 · Latest: 2026-03-10T17:03:11+00:00
Abstract
While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS-Project/MedMASLab/
中文标题/摘要
标题:MedMASLab:统一编排框架,用于评估多模态医疗多智能体系统
尽管多智能体系统(MAS)在复杂临床决策支持方面显示出潜力,但该领域仍受到架构碎片化和缺乏标准化多模态集成的阻碍。当前的医疗MAS研究遭受非统一数据摄入管道、不一致的视觉推理评估和跨专科基准测试不足的困扰。为了解决这些挑战,我们提出了MedMASLab,这是一种用于多模态医疗多智能体系统的统一框架和基准平台。MedMASLab 引入了:(1)一种标准化的多模态智能体通信协议,使11种异构MAS架构在24种医疗模态下无缝集成。(2)一种自动临床推理评估器,这是一种零样本语义评估范式,通过利用大型视觉语言模型来验证诊断逻辑和视觉定位,克服了词法字符串匹配的局限性。(3)迄今为止最广泛的基准测试,涵盖了11个器官系统和473种疾病,标准化了11个临床基准的数据。我们的系统评估揭示了一个关键的专业领域性能差距:尽管MAS提高了推理深度,但当前架构在从专业医学子领域过渡时表现出显著的脆弱性。我们提供了交互机制和成本-性能权衡的严格分析,为未来的自主临床系统建立了新的技术基准。源代码和数据可在:https://github.com/NUS-Project/MedMASLab/ 公开获取。
Summary / 总结
MedMASLab is a unified framework designed to benchmark multimodal medical multi-agent systems, addressing the fragmented architecture and lack of standardization in current research. It introduces a standardized communication protocol for integrating 11 heterogeneous MAS architectures across 24 medical modalities, an automated clinical reasoning evaluator, and the largest benchmark to date, covering 11 organ systems and 473 diseases. Key findings show a significant domain-specific performance gap, with current architectures showing fragility when transitioning between medical sub-domains, highlighting the need for further research in this area.
MedMASLab 是一个统一框架,旨在评估多模态医疗多智能体系统,通过引入标准化通信协议、自动化临床推理评估器以及迄今为止最大的基准测试,解决当前研究中的碎片化问题。评估结果显示了显著的领域特定性能差距,并提供了关于交互机制和成本性能权衡的见解,为未来的临床系统设定了新的技术基准。
EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering
Authors: Runnan Lu, Yuxuan Zhang, Jiaming Liu, Haofan Wang, Yiren Song
First: 2025-05-30T09:55:39+00:00 · Latest: 2026-03-10T16:59:39+00:00
Abstract
Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.
中文标题/摘要
标题:EasyText:可控扩散变换器多语言文本渲染框架
使用扩散模型生成准确的多语言文本长期以来一直被渴望但仍然具有挑战性。最近的方法在单语言文本渲染方面取得了进展,但渲染任意语言仍然是一个未探索的领域。本文介绍了基于DiT(扩散变换器)的文本渲染框架EasyText,该框架将去噪潜在变量与多语言字符标记连接起来。我们提出了字符定位编码和位置编码插值技术以实现可控和精确的文本渲染。此外,我们构建了一个包含100万个多语言图像-文本注释的大规模合成文本图像数据集以及一个包含2万张标注图像的高质量数据集,分别用于预训练和微调。广泛的实验和评估证明了我们方法在多语言文本渲染、视觉质量和布局感知文本集成方面的有效性和进步。
Summary / 总结
The paper addresses the challenge of generating accurate multilingual text using diffusion models. It introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens. The authors propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. They also created a large dataset with 1 million multilingual image-text annotations for pretraining and a high-quality dataset of 20K annotated images for fine-tuning. The experiments show that EasyText improves multilingual text rendering, visual quality, and layout-aware text integration.
研究旨在使用扩散模型生成准确的多语言文本,但由于需要渲染任意语言,这一直具有挑战性。EasyText 是一种基于 DiT 的文本渲染框架,用于将去噪潜变量与多语言字符标记连接起来。该方法使用字符定位编码和位置编码插值实现可控且精确的文本渲染。实验表明,EasyText 在多语言文本渲染、视觉质量和布局感知文本集成方面优于现有方法。
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs
Authors: Zorik Gekhman, Roee Aharoni, Eran Ofek, Mor Geva, Roi Reichart, Jonathan Herzig
First: 2026-03-10T16:59:20+00:00 · Latest: 2026-03-10T16:59:20+00:00
Abstract
While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model's parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.
中文标题/摘要
标题:思考以回忆:推理如何解锁LLMs中的参数知识
虽然在LLMs中推理在数学、代码生成和多跳事实问题中自然地发挥作用,但其对简单的一跳事实问题的影响仍然不清楚。这类问题不需要逐步的逻辑分解,使得推理的实用性显得非常反直觉。然而,我们发现,启用推理显著扩展了模型参数知识回忆的能力边界,解锁了原本几乎无法达到的正确答案。为什么在没有复杂推理步骤的情况下,推理仍能帮助参数知识回忆?为了解答这个问题,我们设计了一系列假设驱动的控制实验,并识别出两个关键驱动机制:(1)计算缓冲效应,模型使用生成的推理标记进行与语义内容无关的潜在计算;(2)事实预热,生成相关主题的事实作为语义桥梁,促进正确答案的检索。重要的是,这种生成性的自我检索机制存在固有的风险:我们证明,在推理过程中虚构中间事实会增加最终答案虚构的可能性。最后,我们展示了我们的见解可以直接通过优先选择包含无虚构事实陈述的推理路径来提高模型的准确性。
Summary / 总结
This study explores how reasoning enhances the recall of parametric knowledge in LLMs, particularly for simple, single-hop factual questions. Through controlled experiments, the research identifies two mechanisms: a computational buffer effect and factual priming. The computational buffer effect involves the model using reasoning tokens for latent computation, while factual priming uses related facts as a semantic bridge to retrieve correct answers. However, hallucinating intermediate facts during reasoning increases the risk of final answer hallucinations. The findings suggest that prioritizing reasoning with factual statements can improve model accuracy.
该研究探讨了推理如何增强LLMs中参数知识的回忆能力,特别是对于简单的单跳事实问题。研究人员设计了控制实验来识别两种机制:计算缓冲效应和事实提示。计算缓冲效应涉及模型使用推理令牌进行潜在计算,而事实提示作为语义桥梁促进正确答案的检索。然而,在推理过程中生成中间事实会增加最终答案的幻觉可能性。研究结果表明,优先选择包含无幻觉事实陈述的推理路径可以提高模型的准确性。
Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale
Authors: Tobias Jülg, Pierre Krack, Seongjin Bien, Yannik Blei, Khaled Gamal, Ken Nakahara, Johannes Hechtl, Roberto Calandra, Wolfram Burgard, Florian Walter
Venue: ICRA 2026
First: 2025-09-18T13:12:16+00:00 · Latest: 2026-03-10T16:58:47+00:00
Comments: Accepted at ICRA 2026
Abstract
Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets, weights, and videos are available at: https://robotcontrolstack.github.io/
中文标题/摘要
标题:机器人控制堆栈:大规模机器人学习的精简生态系统
视觉-语言-行动模型(VLAs)标志着机器人学习的重大转变。它们用大规模数据收集和特定场景的微调取代了专家策略的专业架构和任务定制组件。在以模型为中心、注重大规模训练的机器学习工作流程中,传统的机器人软件框架成为瓶颈,而机器人模拟仅提供有限的支持,用于从模拟到现实世界的实验过渡。在这项工作中,我们通过引入机器人控制堆栈(RCS),一个从头开始设计的精简生态系统,来弥合这一差距,该生态系统旨在支持大规模通用策略下的机器人学习研究。RCS的核心是一个模块化且易于扩展的分层架构,具有统一的接口,适用于模拟和物理机器人,促进从模拟到现实的过渡。尽管其占用空间和依赖性很小,但它提供了完整的功能集,支持现实世界的实验和大规模的模拟训练。我们的贡献有两个方面:首先,我们介绍了RCS的架构及其设计原则;其次,我们评估了其在VLAs和RL策略开发周期中的可用性和性能。我们的实验还对Octo、OpenVLA和Pi Zero在多种机器人上的表现进行了广泛的评估,并揭示了模拟数据如何提高现实世界策略性能。我们的代码、数据集、权重和视频可在:https://robotcontrolstack.github.io/ 获取。
Summary / 总结
The research motivation is to address the limitations of traditional robotics software frameworks and robot simulations in supporting large-scale robot learning. The main method involves developing Robot Control Stack (RCS), a modular and extensible ecosystem that facilitates sim-to-real transfer. Key experimental findings show that RCS enables both real-world experiments and large-scale training in simulation, and that simulation data can significantly improve real-world policy performance.
本文介绍了Robot Control Stack (RCS),这是一种旨在支持大规模机器人学习的轻量级生态系统,使用Vision-Language-Action模型。RCS具有模块化架构,并为模拟和物理机器人提供统一接口,便于从模拟到现实的过渡。评估表明,RCS能够支持现实世界的实验和大规模的模拟训练,Octo、OpenVLA和Pi Zero通过模拟数据展示了在现实世界中性能的提升。
Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
Authors: Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong
First: 2026-03-10T16:50:32+00:00 · Latest: 2026-03-10T16:50:32+00:00
Abstract
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
中文标题/摘要
标题:将VLMs带入法庭:评估体育中的空间智能
体育运动长期以来一直吸引着广泛的关注,因为它们推动了人类身体和认知能力的极限。随着对视觉语言模型(VLMs)的空间智能兴趣日益增长,体育运动为理解高强度的人体运动和动态物体交互提供了一个自然的测试平台。为此,我们提出了CourtSI,这是首个针对体育场景的空间智能大规模数据集。CourtSI 包含超过100万对问答,按照全面的分类系统系统地涵盖了空间计数、距离测量、定位和关系推理,覆盖了代表性网球场上包括羽毛球、网球和乒乓球在内的运动。利用明确的场地几何作为度量锚点,我们开发了一种半自动数据引擎来重建体育场景,从而实现CourtSI的大规模整理。此外,我们引入了CourtSI-Bench,这是一个高质量的评估基准,包含3,686对经过严格人工验证的问答对。我们在CourtSI-Bench上评估了25个专有和开源的VLMs,揭示了人类与AI之间的性能差距,并且现有空间智能基准的泛化能力有限。这些发现表明,体育场景揭示了现有基准所捕捉的空间智能能力的局限性。进一步地,对Qwen3-VL-8B进行微调后,其在CourtSI-Bench上的准确率提高了23.5个百分点。调整后的模型还能够有效泛化到基于类似但未见过的运动构建的CourtSI-Ext评估集,并展示了增强的空间感知评论生成能力。总之,这些发现表明,CourtSI为推动VLMs在体育中的空间智能提供了可扩展的途径。
Summary / 总结
The research aims to evaluate the spatial intelligence of vision-language models (VLMs) using sports as a testbed. The study introduces CourtSI, a large-scale dataset with over 1 million QA pairs covering spatial intelligence tasks in sports scenarios. CourtSI-Bench, a high-quality evaluation benchmark, is used to assess 25 VLMs, revealing a significant human-AI performance gap and limited generalization from existing benchmarks. Fine-tuning Qwen3-VL-8B on CourtSI improves performance on CourtSI-Bench by 23.5 percentage points and enhances spatial-aware commentary generation. This indicates that CourtSI is a valuable resource for advancing VLMs' spatial intelligence in sports scenarios.
论文介绍了CourtSI,这是一个用于评估体育场景中空间智能的大规模数据集,包含超过100万个问答对,涵盖空间计数、距离测量、定位和关系推理。它在CourtSI-Bench上评估了25个VLM,揭示了显著的人工智能与人类之间的性能差距以及现有基准的有限泛化能力。通过在CourtSI上微调Qwen3-VL-8B,准确率提高了23.5个百分点,并在未见过的运动场景中增强了空间感知的评论生成能力。
MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning
Authors: Yiyang Lu, Yu He, Jianlong Chen, Hongyuan Zha
First: 2026-03-10T16:49:44+00:00 · Latest: 2026-03-10T16:49:44+00:00
Abstract
Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While strong adaptability enables rapid acquisition of new knowledge, it also exposes LLMs to catastrophic forgetting, where previously learned skills degrade during sequential training. Existing replay-based strategies, such as fixed interleaved replay, accuracy-supervised, and loss-driven scheduling, remain limited: some depend on heuristic rules and provide only partial mitigation of forgetting, while others improve performance but incur substantial computational overhead. Motivated by retention dynamics under sequential fine-tuning, we propose Memory-Inspired Sampler and Scheduler Replay (MSSR), an experience replay framework that estimates sample-level memory strength and schedules rehearsal at adaptive intervals to mitigate catastrophic forgetting while maintaining fast adaptation. Extensive experiments across three backbone models and 11 sequential tasks show that MSSR consistently outperforms state-of-the-art replay baselines, with particularly strong gains on reasoning-intensive and multiple-choice benchmarks.
中文标题/摘要
标题:MSSR:面向持续LLM微调的记忆感知自适应重放
随着大型语言模型(LLMs)部署在动态环境中,任务和数据分布随时间演变,持续微调LLMs变得越来越重要。虽然强大的适应性能够快速获取新知识,但也使LLMs面临灾难性遗忘的问题,即在顺序训练过程中之前学习的技能会退化。现有的基于重放的策略,如固定交错重放、准确度监督和损失驱动调度,仍然有限:一些依赖于启发式规则,只能部分缓解遗忘,而另一些则提高了性能但带来了巨大的计算开销。受顺序微调下的保留动态启发,我们提出了记忆启发式采样和调度重放(MSSR),这是一种经验重放框架,通过估计样本级别的记忆强度并在自适应间隔内安排复习来缓解灾难性遗忘,同时保持快速适应。在三个骨干模型和11个顺序任务上的广泛实验表明,MSSR在所有基准测试中都优于最先进的重放基线,特别是在推理密集型和多项选择基准测试中表现尤为突出。
Summary / 总结
The research aims to address catastrophic forgetting in the continual fine-tuning of large language models (LLMs) by proposing MSSR, a memory-aware adaptive replay framework. MSSR estimates sample-level memory strength and schedules rehearsal adaptively to mitigate forgetting while ensuring fast adaptation. Experiments across three backbone models and 11 sequential tasks demonstrate that MSSR outperforms existing replay baselines, especially on reasoning-intensive and multiple-choice benchmarks.
论文提出了一种记忆导向的自适应回放框架MSSR,以解决大规模语言模型(LLMs)连续微调中的灾难性遗忘问题。MSSR通过估计样本级别的记忆强度并调度回放间隔来减轻遗忘,同时实现快速适应。实验结果显示,MSSR在三个骨干模型和11个连续任务上优于现有回放基线,特别是在推理密集型和多项选择基准上表现出显著优势。
Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
Authors: Hongbo Bo, Jingyu Hu, Weiru Liu
First: 2026-03-10T16:47:25+00:00 · Latest: 2026-03-10T16:47:25+00:00
Abstract
Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
中文标题/摘要
标题:通过策略参数化提示影响LLM多智能体对话
大型语言模型(LLMs)已成为多智能体系统的新型范式。然而,现有基于LLM的多智能体行为研究依赖于随意的提示,缺乏原则性的策略视角。不同于强化学习,我们研究提示是否可以参数化,以便构建一个轻量级策略,该策略由状态-动作序列组成,用于影响对话行为而无需训练。我们的框架将提示视为由LLM执行的动作,并基于当前智能体的状态动态构建提示,基于五个组件。为了测试参数化控制的有效性,我们根据响应性、反驳、证据使用、不重复和立场转变五个指标评估了对话流程。我们使用两种与公众相关的讨论场景中的不同LLM驱动智能体进行实验,表明提示参数化可以影响对话动态。这一结果表明,策略参数化提示提供了一种简单而有效的机制来影响对话过程,这将有助于多智能体系统研究朝着社会模拟的方向发展。
Summary / 总结
This study addresses the limitations of existing research on LLM-based multi-agent systems by proposing a framework that parameterizes prompts to influence conversational behaviors. The method involves dynamically constructing prompts based on the current state of the agent, treating prompts as actions. Experiments evaluated dialogue flow using five indicators and demonstrated that parameterized prompts can effectively influence dialogue dynamics in two discussion scenarios, highlighting the potential of this approach for social simulation in multi-agent systems.
该研究针对现有基于LLM的多智能体系统研究的局限性,提出了一种通过参数化提示来影响对话行为的框架。方法包括基于当前智能体的状态动态构建提示,将提示视为动作。实验使用五个指标评估对话流程,并展示了参数化提示能够有效地影响两个讨论场景中的对话动态,突显了该方法在多智能体系统社会模拟方向的应用潜力。
LCA: Local Classifier Alignment for Continual Learning
Authors: Tung Tran, Danilo Vasconcellos Vargas, Khoat Than
First: 2026-03-10T16:46:09+00:00 · Latest: 2026-03-10T16:46:09+00:00
Abstract
A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) \textit{mismatch} between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel \textit{Local Classifier Alignment} (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.
中文标题/摘要
标题:LCA:连续学习中的局部分类器对齐
智能系统的基本要求是在不断变化的环境中持续学习的能力。然而,在这种模式下训练的模型往往会出现灾难性遗忘。利用预训练模型最近被认为是一种有希望的解决方案,因为它们泛化的特征提取器能够实现更快和更稳健的适应。虽然一些早期的工作通过仅在第一个任务上进行微调来减轻遗忘,但随着任务数量的增长和数据分布的差异,这种方法很快就会失效。更近期的研究则试图将任务知识整合到一个统一的骨干网络中,或者在新任务到来时适应骨干网络。然而,这些方法可能会在任务特定分类器和适应后的骨干网络之间造成(潜在的)不匹配。为了解决这个问题,我们提出了一种新的“局部分类器对齐”(LCA)损失,以更好地使分类器与骨干网络对齐。理论上,我们证明这种LCA损失可以使分类器不仅能够很好地泛化到所有已观察到的任务,还能提高鲁棒性。此外,我们还开发了一个完整的连续学习解决方案,遵循模型合并方法并使用LCA。在多个标准基准上的广泛实验表明,我们的方法通常能够实现领先性能,有时甚至在较大差距上超越了最先进的方法。
Summary / 总结
The paper addresses the challenge of catastrophic forgetting in continual learning by proposing a Local Classifier Alignment (LCA) loss to better align task-specific classifiers with the backbone. Theoretical analysis shows that LCA can enhance generalization and robustness. Experiments on standard benchmarks show that the proposed method often outperforms existing approaches, sometimes by a significant margin.
论文提出了一种局部分类器对齐(LCA)损失,以更好地使任务特定的分类器与已适应的主干网络对齐,解决连续学习中的灾难性遗忘问题。理论分析表明,LCA可以提高泛化能力和鲁棒性。在标准基准上的实验结果显示,所提出的方法通常优于现有方法,有时差距很大。
Benchmarking Political Persuasion Risks Across Frontier Large Language Models
Authors: Zhongren Chen, Joshua Kalla, Quan Le
First: 2026-03-10T16:42:05+00:00 · Latest: 2026-03-10T16:42:05+00:00
Abstract
Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=19,145) across bipartisan issues and stances, we evaluate seven state-of-the-art LLMs developed by Anthropic, OpenAI, Google, and xAI. We find that LLMs outperform standard campaign advertisements, with heterogeneity in performance across models. Specifically, Claude models exhibit the highest persuasiveness, while Grok exhibits the lowest. The results are robust across issues and stances. Moreover, in contrast to the findings in Hackenburg et al. (2025b) and Lin et al. (2025) that information-based prompts boost persuasiveness, we find that the effectiveness of information-based prompts is model-dependent: they increase the persuasiveness of Claude and Grok while substantially reducing that of GPT. We introduce a data-driven and strategy-agnostic LLM-assisted conversation analysis approach to identify and assess underlying persuasive strategies. Our work benchmarks the persuasive risks of frontier models and provides a framework for cross-model comparative risk assessment.
中文标题/摘要
标题:跨前沿大型语言模型的政治说服风险基准测试
关于大型语言模型(LLMs)能否影响政治观点的能力,仍存在担忧。尽管先前的研究声称LLMs并不比标准的政治竞选活动更具说服力,但最近前沿模型的兴起需要进一步研究。在涉及两党问题和立场的两项调查实验(N=19,145)中,我们评估了由Anthropic、OpenAI、Google和xAI开发的七种最先进的LLMs。我们发现,LLMs在说服力上优于标准竞选广告,但不同模型之间存在差异。具体来说,Claude模型表现出最高的说服力,而Grok表现出最低的说服力。结果在不同问题和立场上具有稳健性。此外,与Hackenburg等人(2025b)和Lin等人(2025)的研究结果相反,信息提示的说服力增强效果取决于模型:它们增加了Claude和Grok的说服力,但显著降低了GPT的说服力。我们引入了一种数据驱动且策略无关的LLM辅助对话分析方法,以识别和评估潜在的说服策略。我们的研究基准测试了前沿模型的说服风险,并提供了一种跨模型比较风险评估的框架。
Summary / 总结
This study investigates the political persuasion risks of frontier Large Language Models (LLMs) by comparing their performance with standard political campaign practices in two survey experiments involving 19,145 participants. The research finds that LLMs are more persuasive than traditional campaign ads, with Claude models showing the highest persuasiveness and Grok the lowest. The effectiveness of information-based prompts varies across models, enhancing persuasiveness for Claude and Grok but reducing it for GPT. The study introduces a data-driven approach to analyze and assess persuasive strategies, providing a benchmark for evaluating the risks of these models.
研究通过两项涉及19,145名参与者的调查实验,评估了七个最先进的大型语言模型(LLMs)在跨派别议题和立场上的政治说服风险。研究发现,LLMs在说服力上优于标准竞选广告,Claude模型表现出最高的说服力,而Grok则最低。信息导向的提示在不同模型中的效果不同,可以增加Claude和Grok的说服力,但会显著降低GPT的说服力。研究引入了一种数据驱动的方法来分析模型间的说服策略,提供了一种跨模型风险评估的框架。
DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary
Authors: Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang
First: 2026-03-10T16:40:41+00:00 · Latest: 2026-03-10T16:40:41+00:00
Abstract
Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.
中文标题/摘要
标题:DISPLAY: 通过稀疏运动指导和多任务辅助实现可操控的人机物交互视频生成
以人为中心的视频生成技术取得了快速进展,但现有方法难以生成可控且物理上一致的人机物交互(HOI)视频。现有工作依赖密集的控制信号、模板视频或精心设计的文字提示,这限制了其灵活性和对新物体的泛化能力。我们提出了一种名为DISPLAY的框架,该框架由稀疏运动指导驱动,仅包含手腕关节坐标和形状无关的对象边界框。这种轻量级的指导缓解了人类和物体表示之间的不平衡,并使用户能够直观地控制。为了在如此稀疏的条件下提高保真度,我们提出了一种对象强调注意力机制,以提高对象的鲁棒性。为了解决高质量HOI数据稀缺的问题,我们进一步开发了一种多任务辅助训练策略,并采用专门的数据整理管道,使模型能够从可靠的HOI样本和辅助任务中受益。全面的实验表明,我们的方法在多种任务中实现了高质量、可控制的HOI生成。项目页面可访问:https://mumuwei.github.io/DISPLAY/
Summary / 总结
The motivation for this work is to generate controllable and physically consistent Human-Object Interaction (HOI) videos, which existing methods struggle with due to their reliance on dense control signals, template videos, or text prompts. The method involves a framework called DISPLAY, which uses sparse motion guidance consisting of wrist joint coordinates and object bounding boxes, along with an Object-Stressed Attention mechanism to improve object robustness. The model also benefits from a Multi-Task Auxiliary Training strategy. The experiments show that DISPLAY can generate high-fidelity, controllable HOI videos across various tasks.
研究旨在通过引入名为DISPLAY的框架生成可控且物理上一致的人体-物体交互(HOI)视频,该框架使用稀疏运动指导,包括手腕关节坐标和物体边界框。这种方法增强了模型对新物体的灵活性和泛化能力。方法还包括一种物体强调注意力机制以提高物体的鲁棒性,以及多任务辅助训练策略以利用辅助任务改善数据质量。实验表明,DISPLAY能够在各种任务中实现高保真度和可控的HOI生成。
Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Authors: Maike Züfle, Sara Papi, Fabian Retkowski, Szymon Mazurek, Marek Kasztelnik, Alexander Waibel, Luisa Bentivogli, Jan Niehues
First: 2026-03-10T16:39:46+00:00 · Latest: 2026-03-10T16:39:46+00:00
Abstract
Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
中文标题/摘要
标题:照我的说的做:一种指令遵循的语音指令数据集
语音大型语言模型(SLLMs)迅速扩展,支持多种任务。这些模型通常通过文本提示进行评估,这可能无法反映用户在实际场景中与语音交互的情况。为解决这一差距,我们引入了DoWhatISay(DOWIS),这是一个多语言数据集,包含人类录制的语音和书面提示,旨在与任何现有基准搭配使用,以在语音指令条件下对SLLMs进行现实评估。该数据集覆盖了9项任务和11种语言,每项任务-语言对提供10种提示变体,涵盖五种风格。使用DOWIS,我们对最先进的SLLMs进行了基准测试,分析了提示模态、风格、语言和任务类型之间的相互作用。结果显示,文本提示在所有情况下都优于语音提示,特别是在低资源和跨语言设置中。只有在具有语音输出的任务中,语音提示才能缩小差距,突显了在SLLM评估中使用语音提示的必要性。
Summary / 总结
The research aims to evaluate Speech Large Language Models (SLLMs) under realistic spoken instruction conditions by introducing DoWhatISay (DOWIS), a multilingual dataset of spoken and written prompts. The dataset includes 10 prompt variants per task-language pair across five styles and 9 tasks in 11 languages. Experiments show that text prompts generally outperform spoken prompts, especially in low-resource and cross-lingual settings, but spoken prompts are more effective for tasks requiring speech output. This highlights the importance of speech-based prompting in evaluating SLLMs.
研究旨在通过引入DoWhatISay (DOWIS)数据集,评估Speech Large Language模型在真实语音指令条件下的表现,该数据集包含11种语言、9项任务和每任务语言对10种提示变体,共五种风格。实验结果显示,文本提示在低资源和跨语言设置中通常优于语音提示,但在需要语音输出的任务中,语音提示的效果更佳,这强调了在SLLM评估中使用语音提示的重要性。
History
20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553