arXiv 论文速递

2026-02-26 03:57
Snapshot: 20260226_0357
Language Models use Lookbacks to Track Beliefs
Authors: Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger
First: 2025-05-20T17:59:45+00:00 · Latest: 2026-02-24T18:59:40+00:00
Comments: 38 pages, 50 figures. Code and data at https://belief.baulab.info/
Abstract
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
中文标题/摘要
标题:语言模型使用回溯来追踪信念
语言模型(LMs)如何表示角色的信念,尤其是当这些信念可能与现实不符时?这个问题是理解LMs的理论心智(ToM)能力的核心。我们通过因果中介和抽象分析了LMs处理角色信念的能力。我们构建了一个名为CausalToM的数据集,包含了一些简单的故事情节,其中两个角色独立地改变两个物体的状态,可能彼此不知对方的行为。我们的研究揭示了一种普遍的算法模式,我们称之为回溯机制,它使LM能够在必要时回忆重要信息。LM通过将每个角色-物体-状态三元组的参考信息在状态标记残差流的低秩子空间中共同定位,将它们绑定在一起。当询问一个角色关于某个物体状态的信念时,绑定回溯检索正确的状态ID,然后答案回溯检索相应状态标记。当我们引入说明一个角色是否(不)可见给另一个角色的文本时,我们发现LM首先生成一个可见性ID,编码观察者角色和被观察者角色的ID之间的关系。在可见性回溯中,这个ID用于检索被观察者角色的信息并更新观察者角色的信念。我们的工作为信念追踪机制提供了见解,朝着反向工程LMs的ToM推理迈出了一步。
Summary / 总结
This study investigates how language models (LMs) represent characters' beliefs, particularly in scenarios where beliefs may differ from reality. By analyzing causal mediation and abstraction in LMs, the researchers developed a dataset called CausalToM, which includes simple stories with two characters changing the state of two objects independently. The key finding is that LMs use a lookback mechanism to recall important information, binding character-object-state triples through Ordering IDs (OIs) in low-rank subspaces. The LM updates characters' beliefs based on visibility information, using visibility IDs to retrieve relevant information and update beliefs accordingly.
研究探讨了语言模型(LMs)如何表示角色的信念,特别是在信念可能与现实不符的情况下。通过分析LMs中的因果中介和抽象,研究人员开发了一个名为CausalToM的数据集,并发现了一种称为'回溯'的机制。该机制通过低秩子空间中的Ordering IDs(OIs)将角色-对象-状态三元组绑定在一起,使LMs能够在必要时回忆重要信息。研究还发现,LMs使用可见性ID来根据角色的可见性状态更新角色的信念,表明了一个复杂的信念追踪系统。
Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models
Authors: Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, Moksh Jain
First: 2025-09-30T17:58:03+00:00 · Latest: 2026-02-24T18:58:30+00:00
Comments: 23 pages, 10 figures. Project page: https://rsa-llm.github.io/
Abstract
Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA with Gemini 3 Flash attains performance near the top of the ARC-AGI-2 public leaderboard. RSA also enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further propose a novel aggregation-aware reinforcement learning approach that yields significant performance gains by training the model to combine solutions.
中文标题/摘要
标题:递归自我聚合解锁大型语言模型的深度思考
测试时缩放方法通过增加推理时的计算量来提高大型语言模型(LLMs)的能力。计算量可以在并行中通过选择多个独立解决方案或通过自我完善来缩放。我们提出了一种名为递归自我聚合(RSA)的测试时缩放方法,该方法受到进化方法的启发,结合了并行和序列缩放的优点。RSA的每一步通过聚合子集来精炼候选推理链,从而产生改进的解决方案群体,这些解决方案群体用于下一迭代的候选池。实验证明,RSA在不同任务、模型家族和规模上随着计算预算的增加提供了显著的性能提升。值得注意的是,使用Gemini 3 Flash的RSA达到了ARC-AGI-2公共排行榜的顶级性能。RSA还使Qwen3-4B-Instruct-2507能够与更大的推理模型(包括DeepSeek-R1和o3-mini)竞争,并在AIME-25、HMMT-25、推理体操、LiveCodeBench-v6和SuperGPQA中超越了纯粹并行和序列缩放策略。我们还提出了一种新的聚合感知强化学习方法,通过训练模型组合解决方案来获得显著的性能提升。
Summary / 总结
The paper introduces Recursive Self-Aggregation (RSA), a test-time scaling method for large language models (LLMs) that combines parallel and sequential scaling benefits. RSA refines a population of candidate reasoning chains through iterative aggregation, leading to improved solutions. RSA demonstrates significant performance gains across various tasks and model sizes, achieving competitive performance with larger models and outperforming purely parallel and sequential scaling strategies in multiple benchmarks.
研究旨在通过提出递归自聚合(RSA)方法来增强大型语言模型(LLMs)的推理能力,RSA结合了并行和顺序扩展的优点,通过迭代聚合精炼候选推理链,从而获得更好的解决方案。RSA在各种任务和模型大小上表现出显著的性能提升,达到了ARC-AGI-2公共排行榜的顶级性能,使用Gemini 3 Flash。它还使较小的模型能够匹配较大推理模型在多个基准测试中的性能,超越了单纯的并行和顺序扩展策略。
Aletheia tackles FirstProof autonomously
Authors: Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov, Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Chervonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V. Le, Thang Luong
First: 2026-02-24T18:56:10+00:00 · Latest: 2026-02-24T18:56:10+00:00
Comments: 34 pages. Project page: https://github.com/google-deepmind/superhuman/tree/main/aletheia
Abstract
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google-deepmind/superhuman/tree/main/aletheia.
中文标题/摘要
标题:Aletheia 自动应对 FirstProof 挑战
我们报告了由 Gemini 3 Deep Think 驱动的数学研究代理 Aletheia (Feng et al., 2026b) 在首届 FirstProof 挑战中的表现。在挑战允许的时间范围内,根据多数专家评估,Aletheia 自主解决了 6 个问题(2, 5, 7, 8, 9, 10)中的 10 个;我们注意到专家们对问题 8 的评估并不一致(仅此一个问题)。为了完全透明,我们解释了对 FirstProof 的理解,并披露了实验的详细信息以及评估方法。原始提示和输出可在 https://github.com/google-deepmind/superhuman/tree/main/aletheia 获取。
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Authors: Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi
First: 2026-02-24T18:55:18+00:00 · Latest: 2026-02-24T18:55:18+00:00
Abstract
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.
中文标题/摘要
标题:从试错中学习:体态LLM的反思测试时规划
体态LLM赋予机器人高层次的任务推理能力,但它们无法反思错误的原因,导致部署成为一系列独立的试验,错误重复发生而非积累为经验。借鉴人类反思型实践者的做法,我们引入了反思测试时规划,该方法结合了两种反思模式:在行动中的反思(reflection-in-action),即代理在执行前使用测试时缩放生成并评分多个候选动作;以及行动后的反思(reflection-on-action),即在执行后使用测试时训练更新其内部反思模型和动作策略,基于外部反思。我们还加入了回顾性反思,允许代理重新评估早期决策,并基于事后知识进行模型更新,以正确分配长期信用。在我们新设计的长时 horizon 家庭环境基准和MuJoCo橱柜装配基准上的实验表明,该方法显著优于基线模型,消融研究验证了在行动中的反思和行动后反思的互补作用。定性分析,包括真实机器人试验,突出了通过反思实现行为纠正的优势。
Summary / 总结
The research aims to enhance the performance of embodied language models by enabling them to reflect on their actions and learn from mistakes. The method involves Reflective Test-Time Planning, which includes reflection-in-action and reflection-on-action. Reflection-in-action allows the agent to generate and score multiple candidate actions before execution, while reflection-on-action updates the internal reflection model and action policy based on post-execution feedback. Experiments on new benchmarks show significant improvements over baseline models, with ablation studies confirming the effectiveness of both reflection modes. Qualitative analyses demonstrate behavioral correction through reflection in real-robot trials.
论文针对嵌入式LLM无法反思错误的问题,提出了结合行动中反思和行动后反思的Reflective Test-Time Planning方法。该方法允许代理在执行前生成和评分多个动作,并根据执行后的反思更新其模型。新基准上的实验显示,该方法显著优于基线模型,消融研究证实了两种反思模式的有效性。实机实验展示了通过反思进行行为修正的效果。
On Data Engineering for Scaling LLM Terminal Capabilities
Authors: Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping
First: 2026-02-24T18:51:04+00:00 · Latest: 2026-02-24T18:51:04+00:00
Abstract
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.
中文标题/摘要
标题:关于扩展大语言模型终端能力的数据工程
尽管大型语言模型的终端能力在最近取得了快速进展,但最先进的终端代理背后的训练数据策略仍然鲜有披露。我们通过系统研究终端代理的数据工程实践,做出了两项关键贡献:(1) 终端任务生成管道 Terminal-Task-Gen,支持基于种子和技能的任务构建,(2) 对数据和训练策略进行了全面分析,包括过滤、课程学习、长上下文训练和扩展行为。我们的管道生成了终端语料库 Terminal-Corpus,这是一个大规模的开源终端任务数据集。使用此数据集,我们训练了从 Qwen3(8B, 14B, 32B) 初始化的 Nemotron-Terminal 模型家族,在终端基准 2.0 上取得了显著的提升:Nemotron-Terminal-8B 从 2.5% 提高到 13.0%,Nemotron-Terminal-14B 从 4.0% 提高到 20.2%,Nemotron-Terminal-32B 从 3.4% 提高到 27.4%,接近更大模型的性能。为了加速该领域的研究,我们在 https://huggingface.co/collections/nvidia/nemotron-terminal 开源了我们的模型检查点和大部分合成数据集。
Summary / 总结
This study addresses the lack of transparency in the training data strategies for state-of-the-art large language model terminal agents. It introduces Terminal-Task-Gen, a lightweight synthetic task generation pipeline, and a comprehensive analysis of data and training strategies. The research results in Terminal-Corpus, an open-source dataset, and trains Nemotron-Terminal models that show significant improvements on Terminal-Bench 2.0, with Nemotron-Terminal-32B matching the performance of much larger models.
研究旨在解决大型语言模型终端能力背后的数据策略透明度不足的问题。研究引入了Terminal-Task-Gen,一个合成任务生成管道,并分析了过滤、课程学习和长上下文训练等数据和训练策略。研究结果表明,使用Terminal-Corpus数据集训练的模型,如Nemotron-Terminal,能够在Terminal-Bench 2.0上取得显著性能提升,其中Nemotron-Terminal-32B的表现与更大规模的模型相当。
Transfer Learning in Infinite Width Feature Learning Networks
Authors: Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan
First: 2025-07-06T16:14:43+00:00 · Latest: 2026-02-24T18:49:11+00:00
Abstract
We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We test our theory on linear and polynomial regression tasks as well as real datasets. Our theory allows interpretable conclusions on performance, which depend on the amount of data on both tasks, the alignment between tasks, and the feature learning strength.
中文标题/摘要
标题:无限宽度特征学习网络中的迁移学习理论
我们发展了无限宽神经网络在梯度流动下的迁移学习理论,量化了在源任务上预训练如何改善目标任务上的泛化能力。我们分析了两种情况:(i) 微调,即下游预测器在源诱导特征的基础上进行训练;(ii) 一个联合丰富的设置,其中预训练和下游任务都可以在特征学习阶段进行,但下游模型使用预训练后获得的特征进行初始化。在这种设置下,随机初始化网络在丰富预训练后的统计特征是适应核,依赖于源数据和标签。对于(i),我们分析了不同预训练数据模式下的读出性能。对于(ii),学习目标任务后的统计特征仍然是依赖于源任务和目标任务特征的适应核。我们使用线性和多项式回归任务以及真实数据集测试了我们的理论。我们的理论允许基于数据量、任务对齐和特征学习强度的可解释结论。
Summary / 总结
The research aims to understand when pretraining improves generalization in transfer learning within infinitely wide neural networks. The study analyzes both fine-tuning and a jointly rich setting where pretraining and downstream tasks operate in a feature learning regime. Key findings show that the performance of a readout for different pretraining data regimes can be quantified, and the summary statistics after learning the target task are adaptive kernels that depend on both source and target tasks. The theory provides interpretable conclusions based on the amount of data, task alignment, and feature learning strength.
研究旨在理解在无限宽神经网络中预训练如何改善迁移学习中的泛化能力。研究分析了两种情况:一是微调,二是联合丰富的设置,在这种设置中,预训练和下游任务都在特征学习阶段进行,但下游模型使用预训练后的特征初始化。关键发现表明,不同预训练数据集的读出性能可以量化,并且在学习目标任务后的总结统计是依赖于源任务和目标任务的自适应核。理论提供了基于数据量、任务对齐和特征学习强度的可解释结论。
Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning
Authors: Seyed Hossein Alavi, Zining Wang, Shruthi Chockkalingam, Raymond T. Ng, Vered Shwartz
First: 2026-02-20T00:07:18+00:00 · Latest: 2026-02-24T18:49:01+00:00
Abstract
Interactive systems such as chatbots and games are increasingly used to persuade and educate on sustainability-related topics, yet it remains unclear how different delivery formats shape learning and persuasive outcomes when content is held constant. Grounding on identical arguments and factual content across conditions, we present a controlled user study comparing three modes of information delivery: static essays, conversational chatbots, and narrative text-based games. Across subjective measures, the chatbot condition consistently outperformed the other modes and increased perceived importance of the topic. However, perceived learning did not reliably align with objective outcomes: participants in the text-based game condition reported learning less than those reading essays, yet achieved higher scores on a delayed (24-hour) knowledge quiz. Additional exploratory analyses further suggest that common engagement proxies, such as verbosity and interaction length, are more closely related to subjective experience than to actual learning. These findings highlight a dissociation between how persuasive experiences feel and what participants retain, and point to important design trade-offs between interactivity, realism, and learning in persuasive systems and serious games.
中文标题/摘要
标题:教学游戏,说服对话:不同格式对说服性学习的比较
诸如聊天机器人和游戏之类的交互系统越来越多地被用于说服和教育可持续发展相关的话题,但不同传递格式如何在内容保持不变的情况下影响学习和说服效果仍不清楚。基于相同的观点和事实内容,我们进行了一项受控用户研究,比较了三种信息传递模式:静态文章、对话聊天机器人和叙事文本游戏。在主观指标上,聊天机器人条件始终优于其他模式,并增加了对主题的重要性感知。然而,感知学习并不总是与客观结果一致:与阅读文章的人相比,文本游戏条件的参与者报告学习较少,但在延迟(24小时)的知识测验中得分更高。此外,进一步的探索性分析还表明,常见的参与度指标,如冗长性和交互长度,与主观体验的关系比与实际学习的关系更密切。这些发现突显了说服性体验感觉与参与者保留内容之间的分离,并指出了在说服系统和严肃游戏中,交互性、逼真性和学习之间的设计权衡的重要性。
Summary / 总结
This study compares the effectiveness of static essays, conversational chatbots, and narrative text-based games in persuasive learning about sustainability. Despite identical content, chatbots outperformed other formats in increasing perceived importance of the topic. However, participants in the text-based game condition reported less learning but performed better on a delayed knowledge quiz, suggesting a disconnect between subjective experience and actual learning. Engagement metrics like verbosity and interaction length were more related to subjective experience than learning outcomes.
本研究比较了静态文章、对话聊天机器人和基于文本的叙事游戏在可持续性教育中的有效性。尽管内容相同,聊天机器人在提高参与者对主题重要性的感知方面表现最好。然而,游戏组的参与者报告的学习较少,但在延迟知识测试中表现更好,这表明主观体验与实际学习之间存在差异。参与度指标如冗长性和互动长度更多地与主观体验相关,而不是学习结果。
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Authors: Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi
First: 2026-02-24T18:43:08+00:00 · Latest: 2026-02-24T18:43:08+00:00
Abstract
Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.
中文标题/摘要
标题:为什么 Pass@k 优化会降低 Pass@1:LLM 训练后提示干扰的影响
Pass@k 是用于可验证的大语言模型任务(包括数学推理、代码生成和简短答案推理)的广泛使用的性能指标。它定义为如果 $k$ 个独立采样的解决方案中有任何一个通过验证器则视为成功。这一多样本推理指标促使了推理感知微调方法的发展,这些方法直接优化 Pass@k。然而,先前的工作报告了这种方法下的一个反复出现的权衡:Pass@k 提高而 Pass@1 下降。这种权衡在实践中非常重要,因为 Pass@1 往往由于延迟和成本预算、验证器覆盖率不完善以及需要可靠的单次尝试后备而成为一项硬性操作约束。我们研究了这种权衡的起源,并通过梯度冲突引起的提示干扰提供了 Pass@k 政策优化可能通过重新加权低成功率提示而降低 Pass@1 的理论表征。我们展示了 Pass@k 政策梯度可以与 Pass@1 梯度冲突,因为 Pass@k 优化隐式地将提示重新加权向低成功率提示;当这些提示我们称之为负干扰时,它们的加权可以旋转 Pass@k 更新方向远离 Pass@1 方向。我们通过可验证的数学推理任务的大语言模型实验说明了我们的理论发现。
Summary / 总结
The paper investigates why optimizing Pass@k can degrade Pass@1 in large language models, focusing on prompt interference. It shows that Pass@k optimization can conflict with Pass@1 optimization due to reweighting prompts towards low-success prompts, which can negatively interfere and rotate the Pass@k update direction away from the Pass@1 direction, leading to performance degradation in Pass@1. Experiments on verifiable mathematical reasoning tasks support these findings.
论文研究了为什么在大型语言模型中优化Pass@k会降低Pass@1,重点关注提示干扰。研究表明,Pass@k优化会与Pass@1优化发生冲突,因为Pass@k优化会将提示重新加权指向低成功率的提示,当这些提示是所谓的负干扰时,它们的加权会旋转Pass@k更新方向远离Pass@1方向。在可验证的数学推理任务上的实验验证了这些发现。
Human Video Generation from a Single Image with 3D Pose and View Control
Authors: Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani
First: 2026-02-24T18:42:20+00:00 · Latest: 2026-02-24T18:42:20+00:00
Abstract
Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.
中文标题/摘要
标题:从单张图像生成具有3D姿态和视角控制的人类视频
近期的扩散方法由于其强大的视觉生成能力,在从单张图像生成视频方面取得了显著进展。然而,在图像到视频合成中,特别是在人类视频生成方面,从单张图像推断视角一致、运动相关的衣物褶皱仍然是一个艰巨的问题。在本文中,我们提出了4D人类视频生成(HVG),这是一种能够通过3D姿态和视角控制从单张图像生成高质量、多视角、时空一致的人类视频的潜在视频扩散模型。HVG 通过三个关键设计实现这一点:(i) 骨骼图驱动的姿态调制,通过一种新颖的双维度骨骼图捕捉3D关节的解剖关系,并通过引入3D信息解决多视角中的自遮挡问题;(ii) 视角和时间对齐,确保参考图像与姿态序列之间的多视角一致性,以实现帧到帧的稳定性;以及 (iii) 渐进时空采样,结合时间对齐以在长时间多视角动画中保持平滑过渡。在图像到视频任务上的大量实验表明,HVG 在从各种人类图像和姿态输入生成高质量4D人类视频方面优于现有方法。
Summary / 总结
The research addresses the challenge of generating high-quality human videos from a single image, focusing on view-consistent and motion-dependent clothing wrinkles. The method, Human Video Generation in 4D (HVG), uses a latent video diffusion model with three key designs: articulated pose modulation, view and temporal alignment, and progressive spatio-temporal sampling. Experiments show that HVG outperforms existing methods in producing multi-view, spatiotemporally coherent human videos.
研究旨在解决从单张图像生成高质量的人类视频的挑战,重点在于不同视角和姿态下服装褶皱的一致性和逼真性。方法Human Video Generation in 4D (HVG) 采用了三个关键设计:Articulated Pose Modulation、View and Temporal Alignment 和 Progressive Spatio-Temporal Sampling。实验结果表明,HVG 在从多种输入图像和姿态序列生成多视角、时空一致的人类视频方面优于现有方法。
Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Authors: Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang
First: 2026-02-24T18:37:34+00:00 · Latest: 2026-02-24T18:37:34+00:00
Abstract
While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.
中文标题/摘要
标题:Spa3R:三维视觉推理的预测空间场建模
尽管视觉语言模型(VLMs)在二维视觉理解方面表现出色,但它们在理解和推理三维空间方面的能力仍然有限,而三维空间理解是空间智能的核心。当前的方法试图通过依赖显式的三维模态或通过部分视图条件几何先验增强VLMs来弥合这一领域差距。然而,这些方法阻碍了可扩展性,并最终使语言模型承担从稀疏线索隐式重建完整三维几何结构的不明确任务。在本文中,我们主张空间智能可以从二维视觉中自然地涌现,而不是通过显式的空间指令调优来强加。为此,我们引入了Spa3R,这是一种自监督框架,可以从未指定的多视角图像中直接学习统一的、视角不变的空间表示。Spa3R基于提出的预测空间场建模(PSFM)范式,其中Spa3R学习根据紧凑的潜在表示合成任意未见过的视图的特征场,从而内化对底层三维场景的整体和连贯的理解。我们进一步通过轻量级适配器将预训练的Spa3R编码器集成到现有的VLMs中,形成Spa3-VLM,有效地将语言推理置于全局空间上下文中。在具有挑战性的VSI-Bench实验中,Spa3-VLM在3D VQA上的准确率达到58.6%,显著优于先前的方法。这些结果突显了PSFM作为推进空间智能的可扩展途径。代码可在https://github.com/hustvl/Spa3R获取。
Summary / 总结
This paper addresses the limitation of Vision-Language Models (VLMs) in understanding 3D spatial reasoning, proposing Spa3R, a self-supervised framework that learns a unified spatial representation from multi-view images without explicit 3D data. Spa3R uses Predictive Spatial Field Modeling (PSFM) to synthesize feature fields for unseen views, enabling a holistic understanding of 3D scenes. Integrating Spa3R into existing VLMs through a lightweight adapter, Spa3-VLM, significantly improves 3D Visual Question Answering (VQA) performance, achieving 58.6% accuracy on VSI-Bench, surpassing previous methods.
研究旨在通过引入Spa3R框架提升Vision-Language模型在三维空间推理方面的能力。Spa3R是一个自监督框架,可以从多视角图像中学习统一的空间表示,无需显式的3D数据。Spa3R使用预测性空间场建模(PSFM)来合成未见过的视角的特征场,从而实现对3D场景的整体理解。实验表明,将Spa3R集成到现有VLM中(形成Spa3-VLM)可以显著提高3D VQA的准确率至58.6%,超越了先前的方法。
Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision
Authors: Nicolás Gaggion, Maria J. Ledesma-Carbayo, Stergios Christodoulidis, Maria Vakalopoulou, Enzo Ferrante
First: 2026-02-24T18:29:13+00:00 · Latest: 2026-02-24T18:29:13+00:00
Abstract
Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.
中文标题/摘要
标题:Mask-HybridGNet:基于图的分割与像素级监督下自涌现的解剖对应关系
基于图的医学图像分割使用边界图表示解剖结构,提供固定的拓扑地标和固有的群体级对应关系。然而,其临床应用受到主要要求的阻碍:在实践中,维持患者间点对点对应关系的手动标注地标标注训练数据很少存在。我们引入了Mask-HybridGNet框架,该框架可以直接使用标准的像素级掩码训练基于图的模型,从而消除手动地标标注的需要。我们的方法通过结合Chamfer距离监督和边基正则化将可变长度的地面真实边界与固定长度的地标预测对齐,以确保局部平滑和规则地标分布,并通过可微放大型进一步细化。此框架的一个重要自涌现特性是,预测的地标位置在患者之间始终与特定的解剖位置相关联,而无需显式的对应关系监督。这种隐式的图集学习使我们能够进行时间跟踪、跨层重建和形态学群体分析。除了直接分割,Mask-HybridGNet还可以从现有的分割掩码中提取对应关系,使其能够从任何高质量的像素基模型生成稳定的解剖图集。在胸部X光、心脏超声、心脏MRI和胎儿成像等领域的实验表明,我们的模型在与最先进的像素基方法竞争的同时,通过强制边界连通性来确保解剖上的合理性,使用固定图邻接矩阵。该框架利用标准分割掩码的广泛可用性构建保持拓扑完整性的结构模型,并提供隐式的对应关系。
Summary / 总结
Mask-HybridGNet is a framework that trains graph-based models using standard pixel-wise masks, avoiding the need for manual landmark annotations. It aligns ground truth boundaries with fixed-length landmark predictions using Chamfer distance supervision and edge-based regularization, which leads to consistent anatomical correspondences across patients. Experiments show that Mask-HybridGNet achieves competitive results with state-of-the-art pixel-based methods while ensuring anatomical plausibility through boundary connectivity enforcement.
Mask-HybridGNet旨在解决无需手动标注解剖标志点即可训练基于图的医学图像分割模型的问题。它使用像素级掩码进行训练,并通过Chamfer距离监督和边基正则化将变量长度的地面真值边界与固定长度的标志点预测对齐。该模型在患者之间表现出一致的解剖对应关系,能够实现时间跟踪和形态学群体分析。实验表明,Mask-HybridGNet在与最先进的像素级方法竞争的同时,通过强制边界连通性保持了解剖上的合理性。该方法利用标准分割掩码的广泛可用性构建具有拓扑完整性和隐式对应关系的结构化模型。
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Authors: Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser
First: 2026-02-24T18:28:08+00:00 · Latest: 2026-02-24T18:28:08+00:00
Comments: Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences
Abstract
Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ''black boxes'' and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: https://github.com/ALSER-Lab/XMorph.
中文标题/摘要
标题:XMorph:通过LLM辅助混合深度智能实现可解释的大脑肿瘤分析
深度学习在自动化大脑肿瘤诊断方面取得了显著进展,但临床应用受限于可解释性和计算约束。传统模型通常作为不透明的“黑箱”运作,无法量化恶性生长特征的复杂、不规则的肿瘤边界。为解决这些挑战,我们提出XMorph,一种用于细粒度分类三种主要大脑肿瘤类型(胶质瘤、脑膜瘤和垂体瘤)的可解释且计算高效的框架。我们提出了一种信息加权边界归一化(IWBN)机制,强调诊断相关的边界区域以及非线性混沌和临床验证的特征,从而实现肿瘤生长的更丰富的形态学表示。双通道可解释AI模块结合GradCAM++视觉提示和LLM生成的文本理由,将模型推理转化为临床可解释的见解。所提出的框架实现了96.0%的分类准确率,证明了在基于AI的医学成像系统中可解释性和高性能可以共存。XMorph的源代码和材料均可在以下网址获取:https://github.com/ALSER-Lab/XMorph。
Summary / 总结
XMorph is a framework designed to improve the explainability and computational efficiency of brain tumor diagnosis using deep learning. It introduces an Information-Weighted Boundary Normalization (IWBN) mechanism and a dual-channel explainable AI module that combines visual and textual rationales. XMorph achieves a classification accuracy of 96.0% for glioma, meningioma, and pituitary tumors, showing that high performance and explainability can be simultaneously achieved in medical imaging systems.
XMorph 是一个旨在通过深度学习提高脑肿瘤诊断的可解释性和计算效率的框架。它引入了信息加权边界规范化机制,以突出临床相关的边界区域,并结合了 GradCAM++ 视觉提示和 LLM 生成的文本理由来提高可解释性。XMorph 在胶质瘤、脑膜瘤和垂体瘤分类中的准确率达到 96.0%,表明在医学成像系统中可以同时实现高性能和可解释性。
How much does context affect the accuracy of AI health advice?
Authors: Prashant Garg, Thiemo Fetzer
First: 2025-04-25T12:37:15+00:00 · Latest: 2026-02-24T18:23:32+00:00
Abstract
Large language models (LLMs) are increasingly used to provide health advice, yet evidence on how their accuracy varies across languages, topics and information sources remains limited. We assess how linguistic and contextual factors affect the accuracy of AI-based health-claim verification. We evaluated seven widely used LLMs on two datasets: (i) 1,975 legally authorised nutrition and health claims from UK and EU regulatory registers translated into 21 languages; and (ii) 9,088 journalist-vetted public-health claims from the PUBHEALTH corpus spanning COVID-19, abortion, politics and general health, drawn from government advisories, scientific abstracts and media sources. Models classified each claim as supported or unsupported using majority voting across repeated runs. Accuracy was analysed by language, topic, source and model. Accuracy on authorised claims was highest in English and closely related European languages and declined in several widely spoken non-European languages, decreasing with syntactic distance from English. On real-world public-health claims, accuracy was substantially lower and varied systematically by topic and source. Models performed best on COVID-19 and government-attributed claims and worst on general health and scientific abstracts. High performance on English, canonical health claims masks substantial context-dependent gaps. Differences in training data exposure, editorial framing and topic-specific tuning likely contribute to these disparities, which are comparable in magnitude to cross-language differences. LLM accuracy in health-claim verification depends strongly on language, topic and information source. English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
中文标题/摘要
标题:上下文对AI健康建议准确性的影响有多大?
大型语言模型(LLMs)越来越多地用于提供健康建议,但对其准确性在不同语言、主题和信息源之间的变化情况的证据仍然有限。我们评估了语言和上下文因素如何影响基于AI的健康声明验证的准确性。我们对七种广泛使用的LLM在两个数据集上进行了评估:(i)来自英国和欧盟监管登记处的1,975条合法授权的营养和健康声明,已翻译成21种语言;(ii)9,088条经过记者审核的公共卫生声明,来自PUBHEALTH语料库,涵盖COVID-19、堕胎、政治和一般健康,来源包括政府建议、科学摘要和媒体。模型使用多次运行中的多数投票来分类每条声明为支持或不支持。准确性分析按语言、主题、来源和模型进行。授权声明的准确性在英语和相关欧洲语言中最高,并在多种广泛使用的非欧洲语言中下降,随着与英语的句法距离增加而下降。在真实世界公共卫生声明上,准确性显著较低,并且按主题和来源系统地变化。模型在COVID-19和政府归属声明上表现最好,在一般健康和科学摘要上表现最差。在英语上的高表现掩盖了显著的上下文依赖性差距。训练数据曝光、编辑框架和主题特定调整的差异可能对这些差异有所贡献,这些差异的规模与跨语言差异相当。健康声明验证中的LLM准确性强烈依赖于语言、主题和信息来源。英语表现并不可靠地跨上下文推广,强调了在公共卫生传播中部署前进行多语言和领域特定评估的必要性。
Summary / 总结
This study evaluates how linguistic and contextual factors impact the accuracy of AI health advice provided by large language models. Seven models were tested on two datasets: 1,975 nutrition and health claims from UK and EU regulatory registers in 21 languages, and 9,088 public-health claims from the PUBHEALTH corpus. Accuracy varied significantly by language, topic, and source, with English and closely related European languages showing the highest accuracy, while non-European languages performed poorly. The models performed best on COVID-19 and government-attributed claims and worst on general health and scientific abstracts, highlighting the need for multilingual and domain-specific evaluations before deployment in public health communication.
研究评估了语言和上下文因素对大型语言模型(LLMs)提供的健康建议准确性的影响。七个LLM在两个数据集上进行了测试:1,975条来自英国和欧盟监管注册的营养和健康声明,以及9,088条来自PUBHEALTH语料库的公共健康声明。准确率在不同语言、主题和来源之间存在差异,英语及其相关欧洲语言的准确率较高,而非欧洲语言则较低。研究指出,尽管在英语标准健康声明上的表现良好,但这些表现并不能可靠地推广到其他上下文,强调了在公共健康沟通中部署前进行多语言和领域特定评估的必要性。
Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
Authors: Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu
First: 2026-02-24T18:20:57+00:00 · Latest: 2026-02-24T18:20:57+00:00
Abstract
Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.
中文标题/摘要
标题:透过文字看图像:利用语言模型控制视觉检索质量
文本到图像检索是视觉语言学习中的基本任务,但在实际场景中,由短且不明确的用户查询所引发的挑战尤为突出。这类查询通常只有1到2个词,导致其在语义上模糊不清,容易在多种视觉解释中发生碰撞,并且缺乏对检索图像质量的明确控制。为解决这些问题,我们提出了一种新的质量可控检索范式,该范式通过上下文细节丰富简短的查询,并融入图像质量的明确概念。我们的核心思想是利用生成语言模型作为查询扩展函数,将不明确的查询扩展为描述性形式,捕捉细微的视觉属性,如姿态、场景和美学。我们提出了一种通用框架,该框架根据相关性和美学评分模型离散的质量级别条件查询扩展,使得查询丰富不仅在语义上具有意义,而且在质量上也具有意识。该系统提供了三个关键优势:1) 灵活性,它与任何预训练的视觉语言模型(VLMs)兼容,无需修改;2) 透明性,增强的查询可以明确地由用户解释;3) 可控性,使检索结果能够朝向用户偏好的质量水平进行引导。大量实验表明,我们提出的方法显著提高了检索结果,并提供了有效的质量控制,弥合了现代VLMs的表达能力和简短用户查询的不明确性之间的差距。我们的代码可在https://github.com/Jianglin954/QCQC/ 获取。
Summary / 总结
This paper addresses the challenge of text-to-image retrieval with short and ambiguous user queries by proposing a quality-controllable retrieval paradigm. It leverages a generative language model to extend underspecified queries into more descriptive forms, incorporating explicit notions of image quality. The framework conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, ensuring that the enriched queries are both semantically meaningful and quality-aware. Experiments show that this approach significantly improves retrieval results and provides effective quality control, making it compatible with any pretrained vision-language model and enabling user-preferred quality levels in retrieval results.
该论文通过提出一种质量可控的检索范式,解决了使用短且语义模糊用户查询进行文本到图像检索的挑战。它利用生成语言模型将不明确的查询扩展为更具描述性的形式,同时融入图像质量的明确概念。该框架根据相关性和美学评分模型得出的离散质量级别来条件化查询扩展,确保增强的查询既具有语义意义,又具有质量意识。实验表明,这种方法显著提高了检索结果,并提供了有效的质量控制,使其与任何预训练的视觉语言模型兼容,并允许透明和可控的检索。
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Authors: Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan
Venue: CVPR 2026
First: 2026-02-24T18:17:21+00:00 · Latest: 2026-02-24T18:17:21+00:00
Comments: Accepted to CVPR 2026
Abstract
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $<$60\% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.
中文标题/摘要
标题:NoRD:一种无需推理的数据高效视觉-语言-行动模型
视觉-语言-行动(VLA)模型通过使用统一的端到端架构取代模块化管道,正在推动自主驾驶的发展。然而,当前的VLA面临两个昂贵的要求:(1)大规模数据集收集,(2)密集的推理注释。在本工作中,我们通过\modelname(\textbf{无} \textbf{推}理的\textbf{驾}驶)解决了这两个挑战。与现有的VLA相比,\modelname在不到60%的数据和无推理注释的情况下实现了竞争力的性能,结果减少了3倍的令牌。我们发现,标准的组相对策略优化(GRPO)在应用于如此小的、无推理数据集训练的策略时无法取得显著改进。我们表明,这一限制源于难度偏差,这在GRPO中不成比例地惩罚了产生高方差滚动的场景的奖励信号。\modelname通过引入Dr.~GRPO克服了这一限制,Dr.~GRPO是一种最近设计用于减轻LLM中难度偏差的算法。因此,\modelname在Waymo和NAVSIM上实现了竞争力的性能,使用了少量的训练数据和无推理开销,从而使得自主系统更加高效。
Summary / 总结
The research addresses the challenges of massive dataset collection and dense reasoning annotations in Vision-Language-Action (VLA) models for autonomous driving. It introduces NoRD, which achieves competitive performance with only 60% of the training data and no reasoning annotations, using Dr. GRPO to mitigate difficulty bias. NoRD demonstrates competitive performance on Waymo and NAVSIM with significantly fewer tokens.
NoRD 是一种高效的数据视觉-语言-行动模型,解决了自主驾驶中大规模数据集收集和密集推理注解的挑战。通过使用 Dr. GRPO 来缓解难度偏差,NoRD 在 Waymo 和 NAVSIM 上实现了与完整数据集相当的性能,仅使用了 60% 的训练数据且无需推理注解,显著减少了所需的训练令牌数量。
Multi-Round Human-AI Collaboration with User-Specified Requirements
Authors: Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas
First: 2026-02-19T18:54:34+00:00 · Latest: 2026-02-24T18:15:39+00:00
Abstract
As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.
中文标题/摘要
标题:多轮人机协作与用户指定要求
随着人类越来越多地依赖多轮对话AI进行高风险决策,需要有原则性的框架来确保此类交互能够可靠地提高决策质量。我们采取以人为中心的观点,遵循两个原则:反事实伤害,确保AI不削弱人类的优势;互补性,确保AI在人类容易出错的地方增加价值。我们通过用户定义的规则形式化这些概念,允许用户明确指定其特定任务中的伤害和互补性含义。然后,我们引入了一个在线的、无分布假设的算法,该算法在有限样本下提供保证,并在协作动态中强制执行用户指定的约束。我们通过两个交互设置评估了我们的框架:模拟大型语言模型在医疗诊断任务上的合作和人类众包研究在图像推理任务上的合作。我们展示了我们的在线程序即使在非平稳交互动态下也能保持规定的反事实伤害和互补性违反率。此外,收紧或放松这些约束会产生可预测的人类下游准确性变化,证实了这两个原则作为实用杠杆的作用,可以引导多轮合作向更好的决策质量发展,而无需建模或约束人类行为。
PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
Authors: Samah Fodeh, Linhai Ma, Yan Wang, Srivani Talakokkul, Ganesh Puthiaraju, Afshan Khan, Ashley Hagaman, Sarah Lowe, Aimee Roundtree
First: 2026-02-24T18:10:00+00:00 · Latest: 2026-02-24T18:10:00+00:00
Abstract
Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.
中文标题/摘要
标题:PVminer:一种专门用于检测患者声音的工具,针对患者生成数据
患者生成的文本,如安全消息、调查问卷和访谈,包含了丰富的患者声音(PV)表达,反映了沟通行为和社会决定因素(SDoH)。传统的定性编码框架劳动密集且无法扩展到健康系统中大量患者撰写的消息。现有的机器学习(ML)和自然语言处理(NLP)方法提供了部分解决方案,但通常将患者中心的沟通(PCC)和SDoH视为单独的任务,或者依赖于不适用于面向患者的语言的模型。我们介绍了PVminer,这是一种针对安全患者-提供者沟通中患者声音进行结构化处理的领域适应型NLP框架。PVminer将PV检测表述为多标签、多类预测任务,整合了患者特定的BERT编码器(PV-BERT-base和PV-BERT-large)、无监督主题建模进行主题增强(PV-Topic-BERT)以及针对代码、子代码和组合级别标签的微调分类器。主题表示在微调和推理期间被整合,以丰富语义输入。PVminer在层次任务上取得了强劲的表现,并优于生物医学和临床预训练基线,代码级别的F1得分为82.25%,子代码为80.14%,组合级别最高为77.87%。进一步的消融研究显示,作者身份和基于主题的增强各自带来了有意义的提升。预训练模型、源代码和文档将公开发布,对于研究使用,可应要求提供标注数据集。
Summary / 总结
PVminer is a domain-specific tool designed to detect the patient voice in patient-generated data, addressing the limitations of traditional qualitative coding and existing ML/NLP approaches. It uses a multi-label, multi-class prediction framework with patient-specific BERT encoders, unsupervised topic modeling, and fine-tuned classifiers to achieve strong performance across hierarchical tasks, outperforming biomedical and clinical baselines with F1 scores of 82.25% (Code), 80.14% (Subcode), and 77.87% (Combo). An ablation study shows that author identity and topic-based augmentation contribute significantly to its effectiveness.
PVminer 是一个专门用于检测患者生成数据中患者声音的工具,如安全消息和调查问卷。它使用多标签、多类预测框架,结合患者特定的 BERT 编码器、无监督主题建模和微调分类器来识别 Code、Subcode 和 Combo 级别标签。PVminer 在 Code、Subcode 和 Combo 级别上分别实现了 82.25%、80.14% 和最高 77.87% 的 F1 分数,优于现有的生物医学和临床基线。消融研究显示,作者身份和基于主题的增强显著提高了性能。
SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Authors: Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray, Hua Wei
First: 2026-02-24T18:04:54+00:00 · Latest: 2026-02-24T18:04:54+00:00
Abstract
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.
中文标题/摘要
标题:SELAUR: 自适应演化的大语言模型代理通过不确定性感知奖励
大规模语言模型(LLMs)越来越多地被部署为多步决策代理,有效的奖励设计对于引导学习至关重要。尽管最近的工作探索了各种形式的奖励塑造和步骤级的信用分配,但一个关键信号仍然被很大程度上忽视了:LLMs 的内在不确定性。不确定性反映了模型的信心,揭示了探索所需的地方,并在失败轨迹中提供了有价值的指导线索。我们引入了SELAUR:通过不确定性感知奖励的自适应演化大语言模型代理,这是一种强化学习框架,直接将不确定性纳入奖励设计中。SELAUR 将熵、最小置信度和边际度量结合成一个综合的令牌级不确定性估计,提供密集的信心对齐监督,并采用一种失败感知的奖励重塑机制,将这些不确定性信号注入步骤级和轨迹级奖励中,以提高探索效率和学习稳定性。在两个基准测试ALFWorld和WebShop上的实验表明,我们的方法在成功率上始终优于强大的基线。消融研究进一步证明了不确定性信号如何增强探索和鲁棒性。
Summary / 总结
The research aims to enhance the effectiveness of large language models (LLMs) as multi-step decision-making agents by incorporating uncertainty into reward design. SELAUR, a reinforcement learning framework, uses entropy, least-confidence, and margin-based metrics to estimate token-level uncertainty, providing dense confidence-aligned supervision. It also employs a failure-aware reward reshaping mechanism to inject uncertainty signals into step- and trajectory-level rewards, improving exploration efficiency and learning stability. Experiments on ALFWorld and WebShop show that SELAUR consistently improves success rates over strong baselines.
研究旨在提高大型语言模型(LLMs)在多步决策任务中的奖励设计效果。SELAUR 是一种强化学习框架,通过使用熵、最小置信度和边际基线度量直接将不确定性纳入奖励设计中。这种方法提高了探索效率和学习稳定性,ALFWorld 和 WebShop 基准测试中的实验证明,该方法在成功率上优于强基线。消融研究进一步证实了不确定性信号在探索和鲁棒性方面的优势。
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
Authors: David Schmotz, Luca Beurer-Kellner, Sahar Abdelnabi, Maksym Andriushchenko
First: 2026-02-23T18:59:27+00:00 · Latest: 2026-02-24T18:03:02+00:00
Abstract
LLM agents are evolving rapidly, powered by code execution, tools, and the recently introduced agent skills feature. Skills allow users to extend LLM applications with specialized third-party code, knowledge, and instructions. Although this can extend agent capabilities to new domains, it creates an increasingly complex agent supply chain, offering new surfaces for prompt injection attacks. We identify skill-based prompt injection as a significant threat and introduce SkillInject, a benchmark evaluating the susceptibility of widely-used LLM agents to injections through skill files. SkillInject contains 202 injection-task pairs with attacks ranging from obviously malicious injections to subtle, context-dependent attacks hidden in otherwise legitimate instructions. We evaluate frontier LLMs on SkillInject, measuring both security in terms of harmful instruction avoidance and utility in terms of legitimate instruction compliance. Our results show that today's agents are highly vulnerable with up to 80% attack success rate with frontier models, often executing extremely harmful instructions including data exfiltration, destructive action, and ransomware-like behavior. They furthermore suggest that this problem will not be solved through model scaling or simple input filtering, but that robust agent security will require context-aware authorization frameworks. Our benchmark is available at https://www.skill-inject.com/.
中文标题/摘要
标题:技能注入:衡量代理对技能文件攻击的脆弱性
LLM代理正在迅速发展,得益于代码执行、工具以及最近引入的代理技能功能。技能允许用户通过专门的第三方代码、知识和指令扩展LLM应用程序的功能。虽然这可以将代理能力扩展到新的领域,但也为提示注入攻击提供了新的攻击面。我们识别出基于技能的提示注入是一个重大威胁,并引入了SkillInject基准,评估广泛使用的LLM代理通过技能文件遭受注入攻击的易感性。SkillInject包含202个注入任务对,攻击范围从明显的恶意注入到隐藏在合法指令中的微妙、情境依赖的攻击。我们对前沿LLM进行了评估,从有害指令的避免和合法指令的遵守两个方面衡量安全性。结果显示,当前的代理高度易受攻击,前沿模型的攻击成功率高达80%,经常执行极其有害的指令,包括数据泄露、破坏性操作和类似勒索软件的行为。此外,这些结果表明,这个问题不会通过模型扩展或简单的输入过滤来解决,而是需要具备上下文感知的授权框架来实现代理的安全性。我们的基准可以在https://www.skill-inject.com/找到。
Summary / 总结
The paper addresses the vulnerability of LLM agents to skill-based prompt injection attacks, which can extend agent capabilities but also introduce security risks. It introduces SkillInject, a benchmark with 202 injection-task pairs, to evaluate the susceptibility of LLM agents to these attacks. The evaluation shows that leading LLMs have a high attack success rate, up to 80%, and often execute harmful instructions. The findings suggest that robust security requires context-aware authorization frameworks rather than just model scaling or input filtering.
论文关注语言模型(LM)代理对基于技能的提示注入攻击的脆弱性,这些攻击利用第三方技能扩展代理功能。研究引入了SkillInject基准,评估LM代理对这类攻击的易感性。基准包括202个注入任务对,范围从明显的攻击到微妙的攻击。对领先LM的评估显示高脆弱性,成功率高达80%,通常导致有害行为。结果表明,稳健的安全性需要上下文感知的授权框架,而不仅仅是模型扩展或简单的输入过滤。
From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?
Authors: Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu
First: 2025-12-02T18:31:18+00:00 · Latest: 2026-02-24T18:01:52+00:00
Comments: Accepted by PAKDD 2026 special session on Data Science: Foundations and Applications
Abstract
The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.
中文标题/摘要
标题:从审查到调解:大语言模型能否在在线争吵中担任调解人?
大型语言模型(LLMs)的迅速发展为AI向善的应用开辟了新的可能性。随着LLMs越来越多地调解在线交流,它们促进同理心和建设性对话的潜力成为负责任AI研究的重要前沿。本研究探讨LLMs是否不仅能作为检测有害内容的审查者,还能作为能够理解并缓解在线冲突的调解人。我们的框架将调解分解为两个子任务:判断,即LLM评估对话的公平性和情感动态;引导,即生成同理心、缓解冲突的消息,引导参与者走向解决。为了评估调解质量,我们构建了一个基于Reddit的大规模数据集,并提出了一种结合原则评分、用户模拟和人工对比的多阶段评估管道。实验表明,基于API的模型在调解时在推理和干预一致性方面优于开源版本。我们的研究结果突显了当前LLMs作为在线社会调解新兴代理的潜力和局限性。
Summary / 总结
This study investigates whether large language models (LLMs) can act as mediators in online conflicts, beyond their role as moderators. The research proposes a framework that decomposes mediation into judgment and steering tasks. It constructs a Reddit-based dataset and uses a multi-stage evaluation pipeline to assess mediation quality. The experiments indicate that API-based models outperform open-source models in both reasoning and intervention alignment during mediation, highlighting both the potential and limitations of current LLMs in online social mediation.
研究探讨了大型语言模型(LLMs)是否能够调解在线冲突,通过评估其在判断和引导两个子任务中的表现。研究构建了一个基于Reddit的数据集,并使用多阶段评估管道来评估调解质量。实验表明,API基模型在推理和干预对齐方面均优于开源模型,这既显示了LLMs在线社会调解中的潜力,也揭示了当前的局限性。
CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning
Authors: Ziwei Niu, Hao Sun, Shujun Bian, Xihong Yang, Lanfen Lin, Yuxin Liu, Yueming Jin
Venue: ICASSP 2026
First: 2026-02-24T17:59:21+00:00 · Latest: 2026-02-24T17:59:21+00:00
Comments: Accepted by ICASSP 2026
Abstract
Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.
中文标题/摘要
标题:CG-DMER:混合对比生成框架用于解耦的多模态心电图表示学习
准确解读心电图(ECG)信号对于诊断心血管疾病至关重要。最近将ECG与伴随的临床报告结合的多模态方法显示出强大的潜力,但仍存在两个主要的模态问题:(1)模内问题:现有模型以导联无关的方式处理ECG,忽视了导联间的时空依赖性,限制了其在建模细微诊断模式方面的有效性;(2)模间问题:现有方法直接将ECG信号与临床报告对齐,由于报告的自由文本性质,引入了模态特定的偏差。针对这两个问题,我们提出CG-DMER,一种基于对比生成框架的解耦多模态ECG表示学习方法,通过两个关键设计:(1)时空掩码建模旨在通过在时空维度上应用掩码并重建缺失信息,更好地捕捉细微的时间动态和导联间的空间依赖性;(2)一种表示解耦和对齐策略旨在通过引入模态特定和模态共享编码器,减少不必要的噪声和模态特定偏差,确保模态不变和模态特定表示之间的清晰分离。在三个公开数据集上的实验表明,CG-DMER在多种下游任务中达到了最先进的性能。
Summary / 总结
CG-DMER is a hybrid contrastive-generative framework designed to improve the interpretation of ECG signals by addressing intra-modality and inter-modality issues. It uses spatial-temporal masked modeling to capture fine-grained temporal dynamics and inter-lead spatial dependencies, and a representation disentanglement strategy to reduce modality-specific biases. Experiments show that CG-DMER outperforms existing methods on various downstream tasks.
CG-DMER 是一种混合对比生成框架,旨在通过解决模内和模间问题来提高 ECG 信号的解释能力。它通过时空掩码建模来捕捉细粒度的时间动态和跨导联的空间依赖性,并通过引入模态特定和模态共享编码器来减少模态特定偏差,实现模态不变和模态特定表示的清晰分离。实验表明,CG-DMER 在各种下游任务中优于现有方法。
A Very Big Video Reasoning Suite
Authors: Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Fengyuan Yu, Weihang Xiao, Yizheng Jiao, Jianheng Hou, Danyang Zhang, Pengcheng Xu, Boyang Zhong, Zehong Zhao, Gaoyun Fang, John Kitaoka, Yile Xu, Hua Xu, Kenton Blacutt, Tin Nguyen, Siyuan Song, Haoran Sun, Shaoyue Wen, Linyang He, Runming Wang, Yanzhi Wang, Mengyue Yang, Ziqiao Ma, Raphaël Millière, Freda Shi, Nuno Vasconcelos, Daniel Khashabi, Alan Yuille, Yilun Du, Ziming Liu, Bo Li, Dahua Lin, Ziwei Liu, Vikash Kumar, Yijiang Li, Lei Yang, Zhongang Cai, Hokin Deng
First: 2026-02-23T18:59:41+00:00 · Latest: 2026-02-24T17:59:15+00:00
Comments: Homepage: https://video-reason.com/
Abstract
Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy and over one million video clips, approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark toolkit, and models are publicly available at https://video-reason.com/ .
中文标题/摘要
标题:一个非常大的视频推理套件
视频模型的快速发展主要集中在视觉质量上,而对其推理能力的探索则相对不足。视频推理将智能置于时空一致的视觉环境中,超越了文本所能自然捕捉的内容,使人们能够直观地推理时空结构,如连续性、交互性和因果关系。然而,系统地研究视频推理及其扩展行为受到大规模训练数据缺乏的阻碍。为解决这一问题,我们引入了非常大的视频推理(VBVR)数据集,这是一个前所未有的大规模资源,涵盖了200个经过精心分类的推理任务,涉及超过一百万段视频片段,比现有数据集大三个数量级。我们还提出了VBVR-Bench,这是一种可验证的评估框架,通过引入基于规则、与人类对齐的评分者,超越基于模型的评判,实现可重复和可解释的视频推理能力诊断。利用VBVR套件,我们进行了第一个大规模的视频推理扩展研究,并观察到对未见过的推理任务出现早期泛化的迹象。总体而言,VBVR为可泛化视频推理的下一阶段研究奠定了基础。数据、基准工具包和模型可在https://video-reason.com/ 公开获取。
Summary / 总结
This paper addresses the underexplored area of video reasoning, introducing the Very Big Video Reasoning (VBVR) Dataset, which includes over one million video clips and 200 curated reasoning tasks. The authors present VBVR-Bench, a verifiable evaluation framework that evaluates video reasoning capabilities through rule-based scorers and human alignment. Experimental results show early signs of emergent generalization to unseen reasoning tasks, indicating potential for scalable video reasoning. The dataset and evaluation toolkit are publicly available.
该论文通过引入包含超过一百万视频片段和200个精心策划推理任务的Very Big Video Reasoning (VBVR) 数据集,解决了视频推理这一尚未充分探索的领域。作者还提出了VBVR-Bench,这是一种通过基于规则的评分器和人类对齐的评分器来评估视频推理能力的可验证评估框架。研究揭示了对未见推理任务的早期泛化迹象,标志着向可泛化的视频推理迈出重要一步。
Complexity-aware fine-tuning
Authors: Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev
First: 2025-06-26T13:13:24+00:00 · Latest: 2026-02-24T17:50:18+00:00
Abstract
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains. Better results can be achieved by distilling the chain-of-thought of a larger model at the cost of numerous expensive calls and a much greater amount of data. We propose a novel blueprint for efficient fine-tuning that uses reasoning only for complex data identified by entropy. Specifically, across three small open models ($\approx 3B$) we split the training data into complexity categories by a single token answer entropy (ROC AUC $0.73$), fine-tune large language models (LLMs) via SFT and distillation, and show that our pipeline significantly outperforms the standard SFT approach ($0.58$ vs $0.45$ average accuracy) and outperforms the distillation approach ($0.58$ vs $0.56$ average accuracy) while using $81\%$ less data.
中文标题/摘要
标题:复杂性意识微调
通用大型语言模型(LLMs)经常通过监督微调(SFT)进行微调以增强特定领域的性能。通过提炼大模型的推理过程可以在成本增加和数据量大幅增加的情况下获得更好的结果。我们提出了一种新的高效微调蓝图,仅对由熵识别的复杂数据进行推理。具体来说,在三个小型开源模型(≈3B)中,我们通过单一标记答案熵将训练数据分为复杂性类别(ROC AUC 0.73),通过SFT和蒸馏微调大型语言模型(LLMs),并展示了我们的流程显著优于标准SFT方法(平均准确率0.58 vs 0.45)和蒸馏方法(平均准确率0.58 vs 0.56),同时使用了81%少的数据。
Summary / 总结
The research aims to improve the efficiency of fine-tuning large language models (LLMs) for specific domains by using a complexity-aware approach. The method involves splitting training data into complexity categories based on entropy and applying supervised fine-tuning (SFT) and distillation selectively. The results show that this approach outperforms standard SFT and distillation methods with higher accuracy and less data usage.
研究旨在通过复杂性感知的方法提高大型语言模型(LLM)在特定领域的微调效率。方法是根据单个标记答案的熵将训练数据划分为复杂性类别,并在复杂数据上选择性地应用监督微调和蒸馏。结果表明,该方法在准确性和数据使用量方面均优于标准的监督微调和蒸馏方法。
Scaling State-Space Models on Multiple GPUs with Tensor Parallelism
Authors: Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi
First: 2026-02-24T17:47:54+00:00 · Latest: 2026-02-24T17:47:54+00:00
Comments: Submitted to 46th IEEE International Conference on Distributed Computing Systems (ICDCS 2026)
Abstract
Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.
中文标题/摘要
标题:使用张量并行在多个GPU上扩展状态空间模型
选择性状态空间模型(SSMs)已成为大型语言模型的强大基础,尤其是在处理长上下文任务时。然而,在部署过程中,其推理性能往往受限于单个GPU的内存容量、带宽和延迟限制,使得多GPU执行变得越来越必要。尽管张量并行(TP)广泛用于扩展Transformer推理,但将其应用于选择性SSM块并不简单,因为SSM混合器将大规模投影与序列相关的递归状态更新和局部混合结合在一起,其效率取决于保持局部性和避免关键路径中的同步。 本文提出了一种针对选择性SSM推理的通信高效TP设计,以解决三个实际工程挑战:通过SSM状态缓存跨预填充和解码启用TTFT改进,将混合器的打包参数张量分区,以便递归更新保持局部性同时减少通信,以及通过量化AllReduce减少TP聚合开销。我们在NVIDIA A6000和A100集群上的三种代表性SSM基础的大规模语言模型(LLM)- Mamba、Falcon-Mamba和Zamba上进行了评估。我们的实验表明,张量并行SSM推理带来了显著的吞吐量提升,对于Mamba,在2个GPU上提高了约1.6-2.1倍,在4个GPU上提高了约2.6-4.0倍,且在长上下文长度时收益最大,并通过降低同步带宽开销进一步实现了约10-18%的吞吐量提升。
Summary / 总结
This paper addresses the challenge of scaling selective state space models (SSMs) for large language models by proposing a communication-efficient tensor parallelism (TP) design. The method tackles three key engineering challenges: enabling TTFT improvements via an SSM state cache, partitioning the mixer's packed parameter tensor to maintain locality, and reducing TP aggregation overhead with quantized AllReduce. The experiments show significant throughput gains, with batch-request throughput improving by 1.6-2.1x on 2 GPUs and 2.6-4.0x on 4 GPUs for Mamba, and an additional 10-18% improvement from quantized all-reduce on NVIDIA A6000 and A100 clusters.
本文提出了一种通信高效的张量并行(TP)设计,以解决扩展选择性状态空间模型(SSMs)用于大型语言模型的问题。该方法解决了三个关键工程挑战:通过SSM状态缓存实现TTFT改进、将混合器的打包参数张量分区以保持局部性,并通过量化AllReduce降低TP聚合开销。实验表明,Mamba、Falcon-Mamba和Zamba模型在2个GPU上的吞吐量提高了1.6-2.1倍,在4个GPU上的吞吐量提高了2.6-4.0倍,特别是在长上下文长度时效果最佳,进一步通过量化AllReduce提高了10-18%的吞吐量。
A Benchmark for Deep Information Synthesis
Authors: Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov, Lena Sophia Bolliger, Aysim Toker, Roy Miles, Andreea-Maria Oncescu, Jasivan Alex Sivakumar, Philipp Borchert, Ismail Elezi, Meiru Zhang, Ka Yiu Lee, Guchun Zhang, Jun Wang, Gerasimos Lampouras
Venue: ICLR 2026
First: 2026-02-24T17:43:32+00:00 · Latest: 2026-02-24T17:43:32+00:00
Comments: Accepted at ICLR 2026
Abstract
Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.
中文标题/摘要
标题:深度信息综合基准
基于大型语言模型(LLM)的代理越来越多地用于解决涉及工具使用(如网络浏览、代码执行和数据分析)的复杂任务。然而,当前的评估基准未能充分评估它们从多个来源综合信息并超越简单事实检索进行推断的能力。为解决这一问题,我们引入了DEEPSYNTH,这是一种新型基准,旨在评估代理在综合信息收集、综合和结构化推理以产生见解的现实、耗时问题上的表现。DEEPSYNTH 包含来自 7 个领域和数据源的 120 个任务,覆盖 67 个国家。DEEPSYNTH 是通过一个多阶段数据收集管道构建的,要求注释者收集官方数据源、创建假设、进行手动分析并设计具有可验证答案的任务。在 DEEPSYNTH 上评估时,11 个最先进的 LLM 和深度研究代理的最大 F1 分数分别为 8.97 和 17.5,突显了该基准的难度。我们的分析表明,当前的代理在幻觉和处理大量信息空间的推理方面存在困难,突显了 DEEPSYNTH 对未来研究的指导意义。
Summary / 总结
The paper introduces DEEPSYNTH, a benchmark designed to evaluate large language model-based agents on complex, real-world tasks that require synthesizing information from multiple sources and inferring insights. The benchmark includes 120 tasks across 7 domains and 67 countries, and 11 state-of-the-art agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, indicating the difficulty of the tasks. The study highlights that current agents struggle with hallucinations and reasoning over large information spaces, underscoring the importance of DEEPSYNTH for future research.
论文介绍了DEEPSYNTH基准,用于评估基于大型语言模型的代理在需要从多个来源合成信息并进行结构化推理的复杂任务上的能力。该基准包含来自7个领域和67个国家的120个任务,11个最先进的代理在评估中得分较低,表明任务的难度。研究揭示了幻觉和在大量信息空间中推理的挑战,强调了DEEPSYNTH对未来研究的重要性。
LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
Authors: Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P. Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R. Roth, Marius George Linguraru
Venue: ISBI
First: 2026-02-24T17:42:46+00:00 · Latest: 2026-02-24T17:42:46+00:00
Comments: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract
Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.
中文标题/摘要
标题:LUMEN:纵向多模态放射学模型用于预后和诊断
大型视觉-语言模型(VLMs)已从通用应用发展到临床领域的专业用途,展示了在放射学领域决策支持的潜力。一种有前景的应用是通过视觉和自然语言问答(VQA)界面分析放射学影像数据(如胸部X光片),辅助放射科医生进行决策。当有纵向影像时,放射科医生会分析时间变化,这对于准确诊断和预后至关重要。手动的纵向分析是一个耗时的过程,推动了纵向影像解释训练框架的发展。我们介绍了一种新的训练框架LUMEN,该框架针对纵向胸部X光片解释进行了优化,利用多图像和多任务指令微调来增强预后和诊断性能。我们在公开的MIMIC-CXR及其相关Medical-Diff-VQA数据集上进行了实验。我们进一步制定了一个包含纵向研究的新指令遵循数据集,以促进预后VQA任务的发展。我们的方法在诊断VQA任务中显著优于基线模型,并且更重要的是,展示了预后能力的前景。这些结果强调了精心设计、指令调优的VLMs在纵向放射学影像数据放射学解释中的价值。
Summary / 总结
LUMEN is a training framework designed for longitudinal analysis of chest X-rays to improve diagnostic and prognostic capabilities in radiology. It leverages multi-image and multi-task instruction fine-tuning of large vision-language models. Experiments on MIMIC-CXR and Medical-Diff-VQA datasets show significant improvements in diagnostic VQA tasks and promising potential for prognostic capabilities.
LUMEN 是一种通过利用多图像和多任务指令微调来增强纵向胸部X光(CXR)分析的预测和诊断能力的训练框架。在 MIMIC-CXR 和 Medical-Diff-VQA 数据集上的实验显示,在诊断 VQA 任务中显著优于基线模型,并且在预测能力方面表现出有希望的潜力。
UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics
Authors: Joseph Raj Vishal, Nagasiri Poluri, Katha Naik, Rutuja Patil, Kashyap Hegde Kota, Krishna Vinod, Prithvi Jai Ramesh, Mohammad Farhadi, Yezhou Yang, Bharatesh Chakravarthi
First: 2026-02-24T17:33:12+00:00 · Latest: 2026-02-24T17:33:12+00:00
Abstract
Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.
中文标题/摘要
标题:UDVideoQA:城市动态视频问答数据集,用于城市动力学多对象时空推理
理解复杂的城市交通多智能体动态仍然是视频语言模型的基本挑战。本文介绍了Urban Dynamics VideoQA,这是一个基准数据集,捕捉了动态城市场景的非剧本化现实行为。UDVideoQA 从多个城市交叉口的 16 小时交通录像中精心挑选,涵盖了多种交通、天气和照明条件。它使用事件驱动的动态模糊技术确保隐私保护,同时不牺牲场景保真度。通过统一的注释流水线,数据集包含 8 小时密集注释视频中的 28K 问答对,平均每秒一个问题。其分类法遵循分层推理级别,从基本理解到事件推理、逆向推理和反事实推理,使视觉定位和因果推理的系统评估成为可能。全面的实验在 UDVideoQA 上基准测试了 10 个 SOTA 视频语言模型,在 VideoQGen 上基准测试了 8 个模型。结果揭示了感知-推理差距的持续存在,表明在抽象推理方面表现优异的模型往往在基本视觉定位方面失败。虽然像 Gemini Pro 这样的模型在零样本准确性上达到最高,但对 UDVideoQA 的微调较小的 Qwen2.5-VL 7B 模型填补了这一差距,实现了与专有系统相当的性能。在 VideoQGen 中,Gemini 2.5 Pro 和 Qwen3 Max 生成最相关和复杂的问答,尽管所有模型都表现出有限的语言多样性,突显了人类中心评估的必要性。UDVideoQA 套件包括数据集、注释工具以及用于视频问答和视频问答生成的基准测试,为推进稳健、隐私意识和现实世界多模态推理提供了基础。UDVideoQA 可在 https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/ 获取。
Summary / 总结
UDVideoQA is a benchmark dataset for traffic video question answering that captures real-world urban dynamics. It includes 16 hours of traffic footage with 28K question-answer pairs, focusing on multi-object spatio-temporal reasoning. Experiments on 10 state-of-the-art video language models show a gap between abstract inference and visual grounding, with fine-tuning Qwen2.5-VL 7B on UDVideoQA improving performance. The dataset supports both VideoQA and VideoQGen benchmarks, promoting robust and privacy-aware multimodal reasoning.
UDVideoQA 是一个用于评估城市交通动态中多对象时空推理的基准数据集,包含16小时的交通录像和28K问题-答案对,侧重于真实场景。该数据集包括多种条件,并采用隐私保护技术。实验显示,模型在抽象推理和视觉定位之间的差距,通过在UDVideoQA上微调Qwen2.5-VL 7B可以改善性能。该套件包括注释工具和用于视频问答和视频问题生成的基准测试。
SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
Authors: David Anugraha, Vishakh Padmakumar, Diyi Yang
First: 2026-02-24T17:33:02+00:00 · Latest: 2026-02-24T17:33:02+00:00
Abstract
Qualitative insights from user experiences are critical for informing product and policy decisions, but collecting such data at scale is constrained by the time and availability of experts to conduct semi-structured interviews. Recent work has explored using large language models (LLMs) to automate interviewing, yet existing systems lack a principled mechanism for balancing systematic coverage of predefined topics with adaptive exploration, or the ability to pursue follow-ups, deep dives, and emergent themes that arise organically during conversation. In this work, we formulate adaptive semi-structured interviewing as an optimization problem over the interviewer's behavior. We define interview utility as a trade-off between coverage of a predefined interview topic guide, discovery of relevant emergent themes, and interview cost measured by length. Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility. We evaluate SparkMe through controlled experiments with LLM-based interviewees, showing that it achieves higher interview utility, improving topic guide coverage (+4.7% over the best baseline) and eliciting richer emergent insights while using fewer conversational turns than prior LLM interviewing approaches. We further validate SparkMe in a user study with 70 participants across 7 professions on the impact of AI on their workflows. Domain experts rate SparkMe as producing high-quality adaptive interviews that surface helpful profession-specific insights not captured by prior approaches. The code, datasets, and evaluation protocols for SparkMe are available as open-source at https://github.com/SALT-NLP/SparkMe.
中文标题/摘要
标题:SparkMe:适应性半结构化访谈以发现定性洞察
用户经验中的定性洞察对于指导产品和政策决策至关重要,但大规模收集此类数据受到专家时间与可用性的限制,难以进行半结构化访谈。近期研究探索了使用大型语言模型(LLMs)自动化访谈,但现有系统缺乏平衡预定义主题的系统覆盖与适应性探索的机制,或无法追求随访谈有机出现的跟进、深入探讨和新兴主题。在本研究中,我们将适应性半结构化访谈形式化为对访谈者行为的优化问题。我们将访谈效用定义为预定义访谈主题指南的覆盖、相关新兴主题的发现以及访谈成本(以对话轮次衡量)之间的权衡。基于此形式化,我们引入了SparkMe,这是一种多智能体LLM访谈者,通过模拟对话演练进行审慎规划,以选择具有高预期效用的问题。通过基于LLM的访谈员的受控实验评估SparkMe,结果显示它实现了更高的访谈效用,提高了主题指南覆盖(比最佳基线高4.7%),并产生了更丰富的新兴洞察,同时使用了比先前的LLM访谈方法更少的对话轮次。我们进一步在70名来自7个职业的参与者中进行了用户研究,验证了SparkMe对AI对其工作流程影响的评估。领域专家认为SparkMe生成了高质量的适应性访谈,揭示了先前方法未捕捉到的专业特定洞察。SparkMe的代码、数据集和评估协议已作为开源资源发布在https://github.com/SALT-NLP/SparkMe/上。
Summary / 总结
SparkMe is designed to enhance the efficiency and effectiveness of semi-structured interviews by balancing systematic coverage of predefined topics with adaptive exploration of emergent themes. It uses a multi-agent large language model to perform deliberative planning through simulated conversation rollouts, selecting questions with high expected utility. Experimental evaluations show that SparkMe improves topic guide coverage by 4.7% compared to the best baseline and elicits richer insights using fewer conversational turns. User studies with 70 participants from various professions further validate SparkMe's ability to produce high-quality adaptive interviews that uncover profession-specific insights not captured by previous methods.
SparkMe旨在通过平衡系统性覆盖预定义主题与探索新兴主题来提高半结构化访谈的效率和效果。它使用多智能体大型语言模型通过模拟对话演练进行有计划的规划,选择具有高预期效用的问题。实验评估表明,SparkMe相比最佳基线提高了4.7%的主题指南覆盖率,并且在较少的对话回合中获得了更丰富的见解。来自不同职业的70名参与者的研究进一步验证了SparkMe生成高质量适应性访谈的能力,这些访谈揭示了以前方法未捕捉到的专业特定见解。
SOM-VQ: Topology-Aware Tokenization for Interactive Generative Models
Authors: Alessandro Londei, Denise Lanzieri, Matteo Benati
First: 2026-02-24T17:29:04+00:00 · Latest: 2026-02-24T17:29:04+00:00
Abstract
Vector-quantized representations enable powerful discrete generative models but lack semantic structure in token space, limiting interpretable human control. We introduce SOM-VQ, a tokenization method that combines vector quantization with Self-Organizing Maps to learn discrete codebooks with explicit low-dimensional topology. Unlike standard VQ-VAE, SOM-VQ uses topology-aware updates that preserve neighborhood structure: nearby tokens on a learned grid correspond to semantically similar states, enabling direct geometric manipulation of the latent space. We demonstrate that SOM-VQ produces more learnable token sequences in the evaluated domains while providing an explicit navigable geometry in code space. Critically, the topological organization enables intuitive human-in-the-loop control: users can steer generation by manipulating distances in token space, achieving semantic alignment without frame-level constraints. We focus on human motion generation - a domain where kinematic structure, smooth temporal continuity, and interactive use cases (choreography, rehabilitation, HCI) make topology-aware control especially natural - demonstrating controlled divergence and convergence from reference sequences through simple grid-based sampling. SOM-VQ provides a general framework for interpretable discrete representations applicable to music, gesture, and other interactive generative domains.
中文标题/摘要
标题:SOM-VQ:基于拓扑结构的代币化方法以实现交互生成模型
向量量化表示能够启用强大的离散生成模型,但在代币空间中缺乏语义结构,限制了可解释的人类控制。我们引入了SOM-VQ,这是一种结合向量量化与自组织映射的代币化方法,以学习具有显式低维拓扑结构的离散码本。与标准的VQ-VAE不同,SOM-VQ 使用了拓扑结构感知的更新方法,以保持邻域结构:在学习网格上相邻的代币对应于语义相似的状态,从而能够直接对潜在空间进行几何操作。我们证明了SOM-VQ在评估领域中生成了更可学习的代币序列,同时在编码空间中提供了明确的可导航几何结构。关键的是,拓扑组织使直观的人机交互控制成为可能:用户可以通过操纵代币空间中的距离来引导生成,从而实现语义对齐,而无需帧级约束。我们专注于人体运动生成——一个具有运动学结构、平滑的时间连续性和交互式用例(编舞、康复、人机交互)的领域,其中拓扑结构感知控制特别自然——通过简单的网格采样从参考序列中实现受控的发散和收敛。SOM-VQ 提供了一种适用于音乐、手势和其他交互生成领域的可解释离散表示的一般框架。
Summary / 总结
SOM-VQ is a tokenization method that integrates vector quantization with Self-Organizing Maps to create discrete codebooks with explicit low-dimensional topology, enabling more interpretable and geometrically manipulable latent spaces. This approach preserves the neighborhood structure of tokens, allowing for semantic similarity between nearby tokens. Experimental results show that SOM-VQ produces more learnable token sequences and provides an explicit navigable geometry in code space, facilitating intuitive human-in-the-loop control in motion generation tasks, such as choreography and rehabilitation, by manipulating distances in token space without frame-level constraints.
SOM-VQ提出了一种结合向量量化和自组织映射的拓扑感知分词方法,以学习具有显式低维拓扑的离散码本。该方法保留了邻域结构,使用户能够几何地操纵潜在空间,并提供直观的人机交互控制。SOM-VQ展示了更好的可学习分词序列和代码空间中的显式可导航几何结构,特别是在人体运动生成领域,它允许通过简单的网格采样从参考序列中进行受控的发散和收敛。
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems
Authors: Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong, Xiaofeng Wang
First: 2026-02-24T17:23:11+00:00 · Latest: 2026-02-24T17:23:11+00:00
Abstract
Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare. However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users. While extensive research focuses on agent-centric threats, human susceptibility to deception by a compromised agent remains unexplored. We present the first large-scale empirical study with 303 participants to measure human susceptibility to AMD. This is based on HAT-Lab (Human-Agent Trust Laboratory), a high-fidelity research platform we develop, featuring nine carefully crafted scenarios spanning everyday and professional domains (e.g., healthcare, software development, human resources). Our 10 key findings reveal significant vulnerabilities and provide future defense perspectives. Specifically, only 8.6% of participants perceive AMD attacks, while domain experts show increased susceptibility in certain scenarios. We identify six cognitive failure modes in users and find that their risk awareness often fails to translate to protective behavior. The defense analysis reveals that effective warnings should interrupt workflows with low verification costs. With experiential learning based on HAT-Lab, over 90% of users who perceive risks report increased caution against AMD. This work provides empirical evidence and a platform for human-centric agent security research.
中文标题/摘要
标题:"你确定吗?": 大型语言模型驱动代理系统中人类感知脆弱性的一项实证研究
大型语言模型(LLM)代理正在迅速成为软件开发和医疗保健等高风险领域值得信赖的副驾。然而,这种不断加深的信任引入了一个新的攻击面:代理介导的欺骗(AMD),其中被篡改的代理被武器化以对抗其人类用户。尽管大量研究集中在代理中心的威胁上,但人类对被篡改代理欺骗的易感性仍未被探索。我们首次进行了一项大规模实证研究,有303名参与者来衡量人类对AMD的易感性。这基于HAT-Lab(人类-代理信任实验室),这是一个我们开发的高保真研究平台,包括九个精心设计的场景,涵盖日常生活和专业领域(例如,医疗保健、软件开发、人力资源)。我们的10项关键发现揭示了显著的脆弱性,并提供了未来防御视角。具体来说,只有8.6%的参与者察觉到AMD攻击,而领域专家在某些场景中显示出更高的易感性。我们识别出用户中的六种认知失败模式,并发现他们的风险意识往往未能转化为保护性行为。防御分析表明,有效的警告应该中断具有低验证成本的工作流程。基于HAT-Lab的经验学习,超过90%察觉风险的用户报告了对AMD的更大警惕。这项工作提供了实证证据和一个人类为中心的代理安全研究平台。
Summary / 总结
This study investigates human susceptibility to Agent-Mediated Deception (AMD) in LLM-driven agentic systems by conducting a large-scale empirical experiment with 303 participants using a high-fidelity research platform called HAT-Lab. The study reveals that only 8.6% of participants perceive AMD attacks, with domain experts showing increased susceptibility in certain scenarios. Key findings include cognitive failure modes and the need for effective warnings that interrupt workflows with low verification costs. Experiential learning from HAT-Lab increases users' caution against AMD.
本研究通过使用HAT-Lab(人类-代理信任实验室)这一高保真研究平台,对303名参与者进行了大规模实证实验,以评估人类对LLM驱动的代理系统中的Agent-Mediated Deception (AMD) 的易感性。研究发现,只有8.6%的参与者能够察觉到AMD攻击,且某些场景中领域专家的易感性更高。关键发现包括用户中的认知失败模式以及需要有效的中断工作流且验证成本低的警告。通过HAT-Lab的经验学习,超过90%的参与者报告了对AMD的警惕性增加。本研究提供了人类为中心的代理安全研究的实证证据和平台。
History
20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553