arXiv 论文速递

Snapshot: 20260318_0401

Towards Generalizable Robotic Manipulation in Dynamic Environments

Authors: Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai

First: 2026-03-16T17:59:57+00:00 · Latest: 2026-03-16T17:59:57+00:00

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

中文标题/摘要

标题：在动态环境中的通用可转移机器人操作

视觉-语言-动作（VLA）模型在静态操作中表现出色，但在具有移动目标的动态环境中却难以应对。这种性能差距主要源于缺乏动态操作数据集以及主流VLA依赖单帧观察，限制了它们的空间-时间推理能力。为了解决这一问题，我们引入了DOMINO，这是一个大规模的动态操作数据集和基准测试，包含35个具有层次复杂性的任务，超过11万个专家轨迹，以及多维度的评估套件。通过全面的实验，我们系统地评估了现有VLA在动态任务上的表现，探索了有效的动态意识训练策略，并验证了动态数据的可转移性。此外，我们提出了PUMA，一种动态感知的VLA架构。通过整合场景中心的历史光流和专门的世界查询，PUMA隐式预测对象中心的未来状态，将历史感知与短期预测相结合。结果表明，PUMA达到了最先进的性能，相对于基线模型在成功率上提高了6.3%。此外，我们展示了在动态数据上进行训练可以培养出对静态任务具有鲁棒性的空间-时间表示。所有代码和数据均可在https://github.com/H-EmbodVis/DOMINO/获取。

Summary / 总结

The research aims to improve robotic manipulation in dynamic environments by addressing the limitations of existing Vision-Language-Action (VLA) models, which perform well in static settings but struggle with moving targets. To tackle this, the authors introduce DOMINO, a large dataset and benchmark for dynamic manipulation, and propose PUMA, a dynamics-aware VLA architecture that integrates historical optical flow and world queries for better future state prediction. Experiments show that PUMA outperforms existing models with a 6.3% improvement in success rate and that training on dynamic data enhances generalizability to static tasks.

研究旨在通过解决现有视觉-语言-动作（VLA）模型在动态环境中的局限性，提高机器人的操作能力。为此，研究引入了DOMINO，一个大规模的动态操作数据集和基准，包含35个任务、超过110K专家轨迹和多维度评估套件。实验评估了现有VLA模型在动态任务上的表现，提出了一个动态感知与短期预测相结合的VLA架构PUMA，并展示了动态数据训练增强了时空表示，使成功率提高了6.3%。此外，动态训练数据也提高了静态任务的表现。

Mixture-of-Depths Attention

Authors: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang

First: 2026-03-16T17:59:55+00:00 · Latest: 2026-03-16T17:59:55+00:00

Comments: Code is released at https://github.com/hustvl/MoDA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

中文标题/摘要

标题：深度混合注意力

深度扩展是大型语言模型（LLM）的关键驱动力。然而，随着LLM变得更深，它们往往会遭受信号降解：浅层形成的有信息特征逐渐被重复的残差更新稀释，使其在深层更难恢复。我们引入了深度混合注意力（MoDA）机制，允许每个注意力头同时关注当前层的序列KV对和来自先前层的深度KV对。我们还描述了一种针对MoDA的硬件高效算法，解决了非连续内存访问模式，实现了在序列长度为64K时达到FlashAttention-2效率的97.3%。在1.5B参数模型上的实验表明，MoDA始终优于强大的基线。值得注意的是，它在10个验证基准上将平均困惑度降低了0.2，并在10个下游任务上提高了2.11%的平均性能，计算开销仅为3.7%的FLOPs。我们还发现，将MoDA与后规范化结合使用比与前规范化结合使用效果更好。这些结果表明，MoDA是深度扩展的一种有前途的基本构建块。代码发布在https://github.com/hustvl/MoDA。

Summary / 总结

The paper introduces mixture-of-depths attention (MoDA), which allows attention heads to access information from both current and preceding layers, addressing signal degradation in deep language models. Experiments show that MoDA improves average perplexity by 0.2 and increases performance by 2.11% on downstream tasks with minimal computational overhead. Combining MoDA with post-norm further enhances performance compared to pre-norm. The hardware-efficient algorithm for MoDA achieves 97.3% of FlashAttention-2's efficiency at a sequence length of 64K.

论文提出了混合深度注意力（MoDA），以解决深度语言模型中信号退化的问题。MoDA 允许注意力头访问当前层和前几层的关键值对，从而减轻了信息特征的稀释。实验表明，MoDA 可以将平均困惑度降低 0.2，并在下游任务上提高 2.11% 的性能，且计算开销很小。结合 MoDA 和后归一化进一步提升了性能。结果表明，MoDA 是深度扩展大型语言模型的一个有前景的技术。

Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models

Authors: Yulin Luo, Hao Chen, Zhuangzhe Wu, Bowen Sui, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Qiuxuan Feng, Jiale Yu, Shuo Gu, Peng Jia, Pheng-Ann Heng, Shanghang Zhang

First: 2026-03-16T17:59:54+00:00 · Latest: 2026-03-16T17:59:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for robotic manipulation, in which reliable action prediction critically depends on accurately interpreting and integrating visual observations conditioned on language instructions. Although recent works have sought to enhance the visual capabilities of VLA models, most approaches treat the LLM backbone as a black box, providing limited insight into how visual information is grounded into action generation. Therefore, we perform a systematic analysis of multiple VLA models across different action-generation paradigms and observe that sensitivity to visual tokens progressively decreases in deeper layers during action generation. Motivated by this observation, we propose \textbf{DeepVision-VLA}, built on a \textbf{Vision-Language Mixture-of-Transformers (VL-MoT)} framework. This framework enables shared attention between the vision foundation model and the VLA backbone, injecting multi-level visual features from the vision expert into deeper layers of the VLA backbone to enhance visual representations for precise and complex manipulation. In addition, we introduce \textbf{Action-Guided Visual Pruning (AGVP)}, which leverages shallow-layer attention to prune irrelevant visual tokens while preserving task-relevant ones, reinforcing critical visual cues for manipulation with minimal computational overhead. DeepVision-VLA outperforms prior state-of-the-art methods by 9.0\% and 7.5\% on simulated and real-world tasks, respectively, providing new insights for the design of visually enhanced VLA models.

中文标题/摘要

标题：行动之前先观察：增强视觉基础表示以提升视觉-语言-行动模型

视觉-语言-行动（VLA）模型最近已成为机器人操作的有前途的范式，其中可靠的行动预测在很大程度上依赖于准确地解释和整合视觉观察，这些观察是根据语言指令进行的。尽管最近的工作已经寻求增强VLA模型的视觉能力，但大多数方法将LLM主干视为黑盒，提供了有限的关于视觉信息如何嵌入到行动生成中的见解。因此，我们对不同行动生成范式下的多种VLA模型进行了系统的分析，并观察到在行动生成过程中，视觉标记的敏感性在更深的层中逐渐降低。受此观察的启发，我们提出了基于视觉-语言混合的变换器（VL-MoT）框架的DeepVision-VLA。该框架使视觉基础模型与VLA主干之间共享注意力，将视觉专家的多级视觉特征注入到VLA主干的更深层中，以增强视觉表示，实现精确和复杂的操作。此外，我们引入了行动引导的视觉剪枝（AGVP），利用浅层注意力剪枝无关的视觉标记，同时保留与任务相关的标记，以最小的计算开销强化关键的视觉提示。DeepVision-VLA在模拟和真实世界任务中分别比先前的最先进方法提高了9.0%和7.5%，为设计视觉增强的VLA模型提供了新的见解。

Summary / 总结

The research aims to improve the visual capabilities of Vision-Language-Action (VLA) models for robotic manipulation by addressing the issue of visual information being less effective in deeper layers. The study proposes DeepVision-VLA, which uses a Vision-Language Mixture-of-Transformers (VL-MoT) framework to enable shared attention between the vision foundation model and the VLA backbone. This approach injects multi-level visual features into deeper layers of the VLA backbone, enhancing visual representations for precise manipulation. Additionally, Action-Guided Visual Pruning (AGVP) is introduced to prune irrelevant visual tokens while preserving task-relevant ones, further improving the model's performance. The results show that DeepVision-VLA outperforms previous state-of-the-art methods by 9.0% and 7.5% on simulated and real-world tasks, respectively.

研究旨在通过解决视觉信息在深层层中相关性较低的问题，提高Vision-Language-Action (VLA) 模型的视觉能力，以用于机器人操作。研究提出了DeepVision-VLA，该模型基于Vision-Language Mixture-of-Transformers框架，使视觉基础模型与VLA主干网络之间能够共享注意力，并引入了Action-Guided Visual Pruning来增强视觉表示。该模型在模拟和真实世界任务中分别比之前最先进的方法提高了9.0%和7.5%。

HorizonMath: Measuring AI Progress Toward Mathematical Discovery with Automatic Verification

Authors: Erik Y. Wang, Sumeet Motwani, James V. Roggeveen, Eliot Hodges, Dulhan Jayalath, Charles London, Kalyan Ramakrishnan, Flaviu Cipcigan, Philip Torr, Alessandro Abate

First: 2026-03-16T17:59:53+00:00 · Latest: 2026-03-16T17:59:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Can AI make progress on important, unsolved mathematical problems? Large language models are now capable of sophisticated mathematical and scientific reasoning, but whether they can perform novel research is still widely debated and underexplored. We introduce HorizonMath, a benchmark of over 100 predominantly unsolved problems spanning 8 domains in computational and applied mathematics, paired with an open-source evaluation framework for automated verification. Our benchmark targets a class of problems where discovery is hard, requiring meaningful mathematical insight, but verification is computationally efficient and simple. Because these solutions are unknown, HorizonMath is immune to data contamination, and most state-of-the-art models score near 0%. Existing research-level benchmarks instead rely on formal proof verification or manual review, both of which are expensive to scale. Using this platform, we find two problems for which GPT 5.4 Pro proposes solutions that improve on the best-known published results, representing potential novel contributions (pending expert review). We release HorizonMath as an open challenge and a growing community resource, where correct solutions to problems in the unsolved problem classes could constitute novel results in the mathematical literature.

中文标题/摘要

标题：地平线数学：通过自动验证衡量AI在数学发现方面的进步

AI能否在重要的未解数学问题上取得进展？大型语言模型现在能够进行复杂的数学和科学推理，但它们是否能够进行新颖的研究仍然存在广泛争议和未被充分探索。我们引入了地平线数学，这是一个包含100多个主要未解问题的基准，这些问题是计算和应用数学领域的，配有一个开源的自动验证评估框架。我们的基准针对一类发现困难的问题，需要有意义的数学洞察，但验证计算上高效且简单。由于这些解决方案未知，地平线数学不受数据污染的影响，大多数最先进的模型得分接近0%。现有的研究级基准则依赖于形式证明验证或人工审查，这两种方法都难以大规模扩展。使用这个平台，我们发现两个问题，在这些问题上GPT 5.4 Pro提出了改进现有最佳已发表结果的解决方案，这可能代表潜在的新贡献（待专家评审）。我们发布地平线数学作为一项开放挑战和不断增长的社区资源，在未解问题类别的正确解决方案可能构成数学文献中的新成果。

Summary / 总结

The research aims to evaluate AI's capability in making novel contributions to unsolved mathematical problems. The method involves creating HorizonMath, a benchmark of over 100 unsolved problems in computational and applied mathematics, with an open-source evaluation framework for automatic verification. Key findings include GPT 5.4 Pro proposing solutions that improve on existing results for two problems, suggesting potential novel contributions. The benchmark is released as an open challenge to encourage further research and community engagement.

研究旨在评估AI在解决未解决问题时能否做出新颖贡献。方法是创建HorizonMath，包含超过100个未解决问题的基准，涵盖计算和应用数学领域，并提供自动验证的开源评估框架。关键发现包括GPT 5.4 Pro为两个问题提出了改进现有结果的解决方案，表明潜在的新颖贡献。该基准作为开放挑战发布，旨在鼓励进一步研究和社区参与。

Mechanistic Origin of Moral Indifference in Language Models

Authors: Lingyu Li, Yan Teng, Yingchun Wang

First: 2026-03-16T17:59:17+00:00 · Latest: 2026-03-16T17:59:17+00:00

Comments: 24 pages, 11 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.

中文标题/摘要

标题：语言模型道德冷漠的机制起源

现有针对大型语言模型（LLMs）的行为对齐技术往往忽视表面合规与内部未对齐表示之间的差异，使LLMs面临长尾风险。更为关键的是，我们提出LLMs由于将不同的道德概念压缩成统一的概率分布，因而具有内在的道德冷漠状态。我们验证并修正了LLMs潜在表示中的这种冷漠状态，利用251k个基于原型理论和社会化学-101数据集构建的道德向量。首先，我们对23个模型的分析表明，当前的LLMs无法表示对立道德类别之间的区别以及这些类别内部的细微典型性梯度；值得注意的是，无论是模型规模、架构还是显式对齐都无法改变这种冷漠。然后，我们使用稀疏自编码器在Qwen3-8B上进行操作，分离出单一语义的道德特征，并针对性地重建它们的拓扑关系，使其与真实道德向量对齐。这种表示对齐自然提高了道德推理和细微程度，实现了在独立对抗火焰基准测试中75%的对局胜率。最后，我们从经验主义哲学的角度阐述了当前干预方法的补救性质，认为内生对齐的人工智能可能需要从事后修正转变为积极培养。

Summary / 总结

This study addresses the issue of moral indifference in Large Language Models (LLMs) by analyzing their latent representations and employing Sparse Autoencoders to align moral vectors. The research finds that current LLMs fail to distinguish between opposed moral categories and fine-grained typicality gradients, a problem not alleviated by model scaling, architecture, or explicit alignment. After representational alignment, LLMs show improved moral reasoning, achieving a 75% win-rate on the adversarial Flames benchmark. The study also discusses the need for proactive cultivation of aligned AI rather than post-hoc corrections.

研究旨在通过分析大型语言模型（LLMs）的潜在表示来解决其道德冷漠问题。基于原型理论和社会化学-101数据集构建了251k道德向量以验证并修正这种内在的道德冷漠。分析23个模型表明，当前的LLMs无法区分对立的道德类别和细微的典型性梯度。通过在Qwen3-8B上使用稀疏自编码器，研究人员隔离了单义道德特征并重构了它们的拓扑关系，实现了在独立对抗性火焰基准上的75%对局胜率。

Tri-Prompting: Video Diffusion with Unified Control over Scene, Subject, and Motion

Authors: Zhenghong Zhou, Xiaohang Zhan, Zhiqin Chen, Soo Ye Kim, Nanxuan Zhao, Haitian Zheng, Qing Liu, He Zhang, Zhe Lin, Yuqian Zhou, Jiebo Luo

First: 2026-03-16T17:59:05+00:00 · Latest: 2026-03-16T17:59:05+00:00

Comments: Project page: https://zhouzhenghong-gt.github.io/Tri-Prompting-Page/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent video diffusion models have made remarkable strides in visual quality, yet precise, fine-grained control remains a key bottleneck that limits practical customizability for content creation. For AI video creators, three forms of control are crucial: (i) scene composition, (ii) multi-view consistent subject customization, and (iii) camera-pose or object-motion adjustment. Existing methods typically handle these dimensions in isolation, with limited support for multi-view subject synthesis and identity preservation under arbitrary pose changes. This lack of a unified architecture makes it difficult to support versatile, jointly controllable video. We introduce Tri-Prompting, a unified framework and two-stage training paradigm that integrates scene composition, multi-view subject consistency, and motion control. Our approach leverages a dual-condition motion module driven by 3D tracking points for background scenes and downsampled RGB cues for foreground subjects. To ensure a balance between controllability and visual realism, we further propose an inference ControlNet scale schedule. Tri-Prompting supports novel workflows, including 3D-aware subject insertion into any scenes and manipulation of existing subjects in an image. Experimental results demonstrate that Tri-Prompting significantly outperforms specialized baselines such as Phantom and DaS in multi-view subject identity, 3D consistency, and motion accuracy.

中文标题/摘要

标题：三提示：统一控制场景、主体和运动的视频扩散

近期的视频扩散模型在视觉质量方面取得了显著进步，但精确、细致的控制仍然是一个关键瓶颈，限制了内容创作的实际定制化。对于AI视频创作者来说，三种形式的控制至关重要：(i) 场景构图，(ii) 多视角一致的主体定制，和(iii) 摄像机姿态或物体运动调整。现有方法通常在这些维度上孤立处理，对多视角主体合成和姿态变化下的身份保持支持有限。缺乏统一的架构使得支持多功能、联合可控的视频变得困难。我们引入了三提示，这是一种统一框架和两阶段训练范式，将场景构图、多视角主体一致性以及运动控制整合在一起。我们的方法利用由3D跟踪点驱动的双条件运动模块处理背景场景，并利用下采样的RGB线索处理前景主体。为了在可控性和视觉真实性之间取得平衡，我们进一步提出了一种推理ControlNet尺度调度。三提示支持新的工作流程，包括在任何场景中进行3D感知的主体插入以及对图像中现有主体的操控。实验结果表明，三提示在多视角主体身份、3D一致性以及运动准确性方面显著优于专门基准如Phantom和DaS。

Summary / 总结

The research aims to enhance the control over scene composition, subject customization, and motion adjustment in video diffusion models. The Tri-Prompting framework integrates these aspects through a dual-condition motion module and an inference ControlNet scale schedule. Experiments show that Tri-Prompting outperforms existing methods like Phantom and DaS in maintaining multi-view subject identity, 3D consistency, and motion accuracy.

研究旨在提升视频扩散模型在场景构成、主体定制和运动控制方面的精确性。Tri-Prompting 提出了一种统一框架，结合了双条件运动模块和推理中的 ControlNet 比例调度，以实现这些目标。该方法在多视角主体身份、3D 一致性及运动准确性方面显著优于现有基线，支持如 3D 意识主体插入和操作等新型工作流。

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Authors: Aozhe Wang, Yuchen Yan, Nan Zhou, Zhengxi Lu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

First: 2026-03-16T17:58:13+00:00 · Latest: 2026-03-16T17:58:13+00:00

Comments: Project Page: https://zju-real.github.io/Code-A1 Code: https://github.com/ZJU-REAL/Code-A1

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.

中文标题/摘要

标题：Code-A1: 通过强化学习的代码LLM和测试LLM对抗演化

代码生成的强化学习依赖于单元测试通过率的可验证奖励。然而高质量的测试套件稀缺，现有数据集提供的覆盖范围有限，静态奖励无法随着模型改进而适应。最近的自博弈方法在单一模型中统一了代码和测试生成，但面临固有的困境：白盒访问会导致模型自相勾结，生成简单的测试以获得容易的奖励，而黑盒限制则导致通用测试，容易遗漏实现特定的错误。我们引入了Code-A1，一种对抗演化框架，联合优化代码LLM和测试LLM，两者具有对立的目标。代码LLM通过通过更多测试获得奖励，而测试LLM通过暴露更多缺陷获得奖励。这种架构分离消除了自相勾结的风险，并安全地允许白盒测试生成，其中测试LLM可以检查候选代码以构建针对性的对抗性测试。我们进一步引入了错误书机制进行经验回放，并引入复合奖励平衡测试的有效性与对抗难度。在Qwen2.5-Coder模型上的实验表明，Code-A1在代码生成性能上与或超过使用人工标注测试训练的模型相匹配，同时显著提高了测试生成能力。

Summary / 总结

Code-A1 is an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. Experiments show that Code-A1 matches or exceeds models trained on human-annotated tests in code generation and significantly improves test generation capability.

Code-A1 是一种对抗性协同进化框架，同时优化代码生成模型和测试生成模型，两者目标相反。代码生成模型通过通过更多测试获得奖励，而测试生成模型通过发现更多缺陷获得奖励。这种方法消除了自欺风险，显著提高了测试生成能力，实现了代码生成性能与基于人工标注测试训练的模型相当或超越的结果。

Do Metrics for Counterfactual Explanations Align with User Perception?

Authors: Felix Liedeker, Basil Ell, Philipp Cimiano, Christoph Düsing

First: 2026-03-16T17:56:54+00:00 · Latest: 2026-03-16T17:56:54+00:00

Comments: Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI 2026)

Abs · PDF · Code1 · Code2

Abstract

Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.

中文标题/摘要

标题：因果解释的度量标准与用户感知是否一致？

解释性被认为是值得信赖的人工智能系统的重要组成部分。然而，用于评估因果解释的常见度量标准通常是算法评估度量，很少经过人类对解释质量判断的验证。这引发了这样的问题：这些度量标准是否真正反映了用户的观点。我们通过一项实证研究直接比较了算法评估度量与人类判断在三个数据集上的表现。参与者从多个感知质量维度对因果解释进行了评分，并将这些评分与标准因果度量集相关联。我们分析了单个关系以及度量组合预测人类评估的范围。结果显示，算法度量与人类评分之间的相关性通常较弱且高度依赖于数据集。此外，增加用于预测模型的度量数量并不能带来可靠改进，表明当前度量在捕捉对人类重要的标准方面存在结构性限制。总体而言，我们的研究结果表明，广泛使用的因果解释度量标准未能反映用户感知到的解释质量的关键方面，强调了需要更多以人类为中心的方法来评估可解释的人工智能。

Summary / 总结

This study investigates whether commonly used algorithmic metrics for evaluating counterfactual explanations align with user perceptions. Through an empirical study involving three datasets, participants rated the quality of counterfactual explanations, which were then compared to various algorithmic metrics. The results indicate weak and dataset-dependent correlations between algorithmic metrics and human ratings, suggesting that current metrics fail to capture key aspects of explanation quality as perceived by users. This highlights the need for more human-centered approaches to evaluating explainable AI.

研究探讨了常用算法评价指标是否与人类对解释质量的感知相一致。通过涉及三个数据集的实证研究，参与者对反事实解释在多个质量维度上进行了评分，然后将这些评分与各种标准反事实指标进行了相关性分析。结果表明，算法指标与人类评分之间的相关性较弱且依赖于数据集，这表明当前指标未能可靠地捕捉到用户感知中关键的解释质量标准。这强调了需要采用更以人类为中心的方法来评估可解释的人工智能的必要性。

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Authors: Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang, Shilong Mu, Xiaokang Yang, Yao Mu

First: 2026-03-16T17:53:28+00:00 · Latest: 2026-03-16T17:53:28+00:00

Comments: 31 pages

Abs · PDF · Code1 · Code2

Abstract

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

中文标题/摘要

标题：从被动观察者到主动批评者：强化学习引发过程推理以实现机器人操作

长期视角下的机器人操作过程监督仍然是一个关键挑战。当前主要基于监督微调（SFT）范式的视频MLLMs的主要瓶颈在于，它们作为被动的“观察者”，仅能识别正在进行的事件，而不能评估当前状态相对于最终任务目标的状态。本文提出了一种名为PRIMO R1（过程推理诱导监控）的7B框架，将视频MLLMs转变为积极的“批评者”。我们利用基于结果的强化学习来激励生成明确的推理链以进行进度估计。此外，我们的架构通过明确将视频序列锚定在初始状态和当前状态图像之间，构建了一个结构化的时序输入。通过提出的PRIMO数据集和基准测试，我们在多种领域内环境和跨领域的实际世界类人机器人场景中进行了广泛的实验，表明PRIMO R1达到了最先进的性能。定量上，我们的7B模型在专门推理基线上的平均绝对误差降低了50%，显示出相对于72B规模的通用MLLMs的显著相对准确度提升。此外，PRIMO R1在困难的故障检测任务上表现出强大的零样本泛化能力。我们在RoboFail基准测试中取得了67.0%的准确率，超越了如OpenAI o1等闭源模型6.0%。

Summary / 总结

This paper addresses the challenge of accurate process supervision in long-horizon robotic manipulation by introducing PRIMO R1, a 7B framework that transforms video MLLMs into active 'Critics'. It uses outcome-based Reinforcement Learning to encourage explicit Chain-of-Thought generation for progress estimation and constructs a structured temporal input by anchoring the video sequence between initial and current state images. Experiments show that PRIMO R1 outperforms specialized reasoning baselines with a 50% reduction in mean absolute error and achieves state-of-the-art performance on the RoboFail benchmark with 67.0% accuracy, surpassing other models like OpenAI o1 by 6.0%.

本文通过引入PRIMO R1，利用基于结果的强化学习将视频MLLMs转化为积极的“批评者”，激励显式的推理链生成并构建结构化的时序输入，以解决长时域机器人操作中的过程监督难题。实验表明，PRIMO R1在绝对误差上比专门的推理基线降低了50%，并在RoboFail基准测试中取得了67.0%的准确率，超越了其他模型6.0%。

SmartSearch: How Ranking Beats Structure for Conversational Memory Retrieval

Authors: Jesper Derehag, Carlos Calva, Timmy Ghiurau

First: 2026-03-16T17:53:21+00:00 · Latest: 2026-03-16T17:53:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent conversational memory systems invest heavily in LLM-based structuring at ingestion time and learned retrieval policies at query time. We show that neither is necessary. SmartSearch retrieves from raw, unstructured conversation history using a fully deterministic pipeline: NER-weighted substring matching for recall, rule-based entity discovery for multi-hop expansion, and a CrossEncoder+ColBERT rank fusion stage -- the only learned component -- running on CPU in ~650ms. Oracle analysis on two benchmarks identifies a compilation bottleneck: retrieval recall reaches 98.6%, but without intelligent ranking only 22.5% of gold evidence survives truncation to the token budget. With score-adaptive truncation and no per-dataset tuning, SmartSearch achieves 93.5% on LoCoMo and 88.4% on LongMemEval-S, exceeding all known memory systems under the same evaluation protocol on both benchmarks while using 8.5x fewer tokens than full-context baselines.

中文标题/摘要

标题：SmartSearch：排名胜于结构的对话记忆检索

近期的对话记忆系统在摄入时大量投资基于LLM的结构化处理，并在查询时学习检索策略。我们表明，这两种方法都不是必要的。SmartSearch 使用完全确定性的流水线从原始未结构化的对话历史中检索：NER加权子字符串匹配用于召回，基于规则的实体发现用于多跳扩展，以及一个仅有的学习组件——CrossEncoder+ColBERT 排名融合阶段——在CPU上运行约650毫秒。在两个基准上的Oracle分析指出一个编译瓶颈：检索召回率达到98.6%，但没有智能排名，只有22.5%的黄金证据在截断到令牌预算后幸存。使用自适应得分截断且无需针对每个数据集进行调整，SmartSearch 在LoCoMo上达到93.5%，在LongMemEval-S上达到88.4%，在相同的评估协议下，两个基准上均超过所有已知的记忆系统，同时使用比全上下文基线少8.5倍的令牌。

Summary / 总结

SmartSearch demonstrates that ranking is more effective than structuring for conversational memory retrieval. It uses a deterministic pipeline with NER-weighted substring matching, rule-based entity discovery, and a rank fusion stage to retrieve from unstructured conversation history. Despite not using learned retrieval policies, SmartSearch achieves high recall and, with score-adaptive truncation, outperforms existing systems on LoCoMo and LongMemEval-S benchmarks, using significantly fewer tokens.

SmartSearch 表明，在会话记忆检索中，排名比结构化更为有效。它使用包括 NER 加权子字符串匹配、基于规则的实体发现和排名融合阶段的确定性管道来从原始会话历史中检索信息。尽管仅使用比全上下文基线少 8.5 倍的令牌，SmartSearch 在 LoCoMo 和 LongMemEval-S 基准测试中分别达到了 93.5% 和 88.4% 的性能，超过了其他在相同协议下进行评估的记忆系统，且无需针对每个数据集进行调整。

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

Authors: Pengjun Fang, Yingqing He, Yazhou Xing, Qifeng Chen, Ser-Nam Lim, Harry Yang

Venue: ICLR 2026

First: 2026-03-16T17:53:07+00:00 · Latest: 2026-03-16T17:53:07+00:00

Comments: Accepted at ICLR 2026. 15 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.

中文标题/摘要

标题：AC-Foley：参考音频引导的视频到音频合成与声学转移

现有的视频到音频（V2A）生成方法主要依赖于文本提示和视觉信息来合成音频。然而，存在两个关键瓶颈：训练数据中的语义粒度差距，例如将声学上不同的声音归类为粗略标签，以及描述微声学特征的文本歧义性。这些瓶颈使得使用文本控制模式进行精细声音合成变得困难。为了解决这些限制，我们提出了AC-Foley，这是一种基于音频的V2A模型，可以直接利用参考音频来实现对生成声音的精确和精细控制。这种方法使精细声音合成、音色转移、零样本声音生成和提高音频质量成为可能。通过直接基于音频信号进行条件化，我们的方法绕过了文本描述的语义歧义，同时允许精确操纵声学属性。实验证明，当基于参考音频条件化时，AC-Foley在福莱声生成方面达到了最先进的性能，即使在没有音频条件化的情况下，其性能也与最先进的视频到音频方法相当。

Summary / 总结

AC-Foley is an audio-conditioned video-to-audio synthesis model designed to address the limitations of existing methods by leveraging reference audio to achieve precise and fine-grained sound synthesis. It overcomes issues of semantic granularity and textual ambiguity, enabling fine-grained sound synthesis, timbre transfer, and zero-shot sound generation. Empirical results show that AC-Foley outperforms state-of-the-art methods for Foley sound generation when conditioned on reference audio, while still maintaining competitive performance without audio conditioning.

AC-Foley 是一种基于参考音频的视频到音频合成模型，能够实现精确和细粒度的声音合成。该方法通过绕过语义模糊性并控制声学属性来解决现有方法的限制。主要发现包括在 Foley 声音生成中的性能提升，以及在无需音频条件的情况下与最先进的视频到音频方法保持竞争力。

Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing

Authors: Benjamin Reichman, Adar Avsian, Samuel Webster, Larry Heck

First: 2026-03-10T05:23:18+00:00 · Latest: 2026-03-16T17:52:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.

Summary / 总结

The paper investigates how emotional tone in text affects large language models (LLMs) by treating emotion as a latent factor rather than a prediction target. The study analyzes how emotional variations alter attention patterns in transformer models and introduces Affect-Uniform ReAding QA (AURA-QA), a dataset with emotionally balanced context passages. An emotional regularization framework is proposed to constrain emotion-conditioned representational drift during training, showing consistent improvements in reading comprehension across multiple QA benchmarks under distribution shift and in-domain settings.

论文通过将情感视为潜在因素而非预测目标，研究了文本中情感基调如何影响大型语言模型（LLMs）。研究分析了情感变化如何改变变压器模型的注意力模式，并引入了情感平衡的阅读问答数据集Affect-Uniform ReAding QA (AURA-QA)。提出了一个情感正则化框架，以在训练期间约束情感条件下的表示漂移，实验结果显示该方法在多个问答基准测试中的一致改进，包括在分布转移和领域内设置下的阅读理解能力提升。

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Authors: Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen

First: 2026-03-16T17:52:04+00:00 · Latest: 2026-03-16T17:52:04+00:00

Comments: 15 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

中文标题/摘要

标题：OpenSeeker：通过全面开源训练数据普及前沿搜索代理

深度搜索能力已成为前沿大型语言模型（LLM）代理不可或缺的技能，但由于缺乏透明且高质量的训练数据，高性能搜索代理的开发仍主要由工业巨头主导。这种持续的数据稀缺性从根本上阻碍了更广泛研究社区在该领域的开发和创新。为解决这一问题，我们引入了OpenSeeker，这是首个全面开源的搜索代理（即模型和数据），通过两项核心技术创新实现了前沿级别的性能：（1）基于事实的可扩展可控问答合成，通过拓扑扩展和实体混淆反向工程网络图，生成具有可控覆盖范围和复杂度的复杂多跳推理任务。（2）去噪轨迹合成，采用回顾性总结机制去噪轨迹，从而促进教师LLM生成高质量行动。实验结果表明，OpenSeeker仅在11.7k合成样本上进行一次训练，就在包括BrowseComp、BrowseComp-ZH、xbench-DeepSearch和WideSearch等多个基准测试中达到了最先进的性能。值得注意的是，使用简单的SFT训练后，OpenSeeker显著优于第二好的全面开源代理DeepDive（例如，在BrowseComp上的表现分别为29.5%和15.3%），甚至在BrowseComp-ZH上超越了工业竞争对手通义DeepResearch（通过广泛的持续预训练、SFT和RL训练，得分为48.4%和46.7%）。我们全面开源了完整的训练数据集和模型权重，以普及前沿搜索代理研究，促进更加透明和协作的生态系统。

Summary / 总结

The research aims to democratize the development of high-performance search agents by addressing the lack of transparent and high-quality training data. OpenSeeker, the first fully open-source search agent, achieves state-of-the-art performance across multiple benchmarks through two innovations: fact-grounded scalable controllable QA synthesis and denoised trajectory synthesis. OpenSeeker, trained on only 11,700 synthesized samples, outperforms both open-source and industrial competitors on various benchmarks, demonstrating significant advancements in search agent development.

OpenSeeker 通过完全开源训练数据和两项核心技术创新（事实导向的可扩展可控问答合成和去噪轨迹合成）来促进前沿搜索代理的研究。OpenSeeker 在仅使用 11,700 个合成样本进行训练后，在 BrowseComp、BrowseComp-ZH、xbench-DeepSearch 和 WideSearch 等多个基准测试中表现出色，超越了其他开源和工业搜索代理。

SemBench: A Benchmark for Semantic Query Processing Engines

Authors: Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, Immanuel Trummer

First: 2025-11-03T16:25:19+00:00 · Latest: 2026-03-16T17:51:06+00:00

Comments: Accepted to VLDB 2026; Revised version

Abs · PDF · Code1 · Code2

Abstract

We present a benchmark targeting a novel class of systems: semantic query processing engines. Those systems rely inherently on generative and reasoning capabilities of state-of-the-art large language models (LLMs). They extend SQL with semantic operators, configured by natural language instructions, that are evaluated via LLMs and enable users to perform various operations on multimodal data. Our benchmark introduces diversity across three key dimensions: scenarios, modalities, and operators. Included are scenarios ranging from movie review analysis to car damage detection. Within these scenarios, we cover different data modalities, including images, audio, and text. Finally, the queries involve a diverse set of operators, including semantic filters, joins, mappings, ranking, and classification operators. We evaluated our benchmark on three academic systems (LOTUS, Palimpzest, and ThalamusDB) and one industrial system, Google BigQuery. Although these results reflect a snapshot of systems under continuous development, our study offers crucial insights into their current strengths and weaknesses, illuminating promising directions for future research.

中文标题/摘要

标题：SemBench：语义查询处理引擎的基准测试

我们提出了一项基准测试，针对一类新型系统：语义查询处理引擎。这些系统依赖于最先进的大型语言模型（LLMs）的生成和推理能力。它们扩展了SQL，加入了由自然语言指令配置的语义操作符，通过LLMs进行评估，使用户能够对多模态数据执行各种操作。我们的基准测试在三个关键维度上引入了多样性：场景、模态和操作符。其中包括从电影评论分析到汽车损伤检测的各种场景。在这些场景中，我们涵盖了不同类型的模态数据，包括图像、音频和文本。最后，查询涉及多种操作符，包括语义过滤器、连接、映射、排名和分类操作符。我们在三个学术系统（LOTUS、Palimpzest和ThalamusDB）和一个工业系统Google BigQuery上评估了我们的基准测试。尽管这些结果反映了正在持续开发的系统的一个快照，但我们的研究提供了关于它们当前优势和劣势的关键见解，揭示了未来研究的有希望的方向。

Summary / 总结

The research aims to evaluate semantic query processing engines that leverage large language models for generative and reasoning tasks. The benchmark covers diverse scenarios, modalities, and operators, including image, audio, and text data, and operations like filtering, joining, and ranking. Evaluations on four systems—LOTUS, Palimpzest, ThalamusDB, and Google BigQuery—highlight their current capabilities and limitations, providing guidance for future improvements.

研究旨在评估依赖大型语言模型进行生成和推理任务的语义查询处理引擎。基准涵盖了多种场景、模态和操作，包括图像、音频和文本数据，以及过滤、连接和排序等操作。对LOTUS、Palimpzest、ThalamusDB和Google BigQuery四种系统的评估揭示了它们当前的功能和局限性，为未来的研究指明了方向。

Effective Distillation to Hybrid xLSTM Architectures

Authors: Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

First: 2026-03-16T17:49:04+00:00 · Latest: 2026-03-16T17:49:04+00:00

Abs · PDF · Code1 · Code2

Abstract

There have been numerous attempts to distill quadratic attention-based large language models (LLMs) into sub-quadratic linearized architectures. However, despite extensive research, such distilled models often fail to match the performance of their teacher LLMs on various downstream tasks. We set out the goal of lossless distillation, which we define in terms of tolerance-corrected Win-and-Tie rates between student and teacher on sets of tasks. To this end, we introduce an effective distillation pipeline for xLSTM-based students. We propose an additional merging stage, where individually linearized experts are combined into a single model. We show the effectiveness of this pipeline by distilling base and instruction-tuned models from the Llama, Qwen, and Olmo families. In many settings, our xLSTM-based students recover most of the teacher's performance, and even exceed it on some downstream tasks. Our contributions are an important step towards more energy-efficient and cost-effective replacements for transformer-based LLMs.

中文标题/摘要

标题：有效蒸馏至混合xLSTM架构

已经有许多尝试将基于二次注意力的大语言模型（LLM）蒸馏为次二次线性化架构。然而，尽管进行了大量研究，这些蒸馏模型往往无法在各种下游任务上达到其教师LLM的性能。我们设定了无损蒸馏的目标，我们用学生和教师在任务集上的容忍校正胜率和平局率来定义这一目标。为此，我们引入了一种基于xLSTM的学生的有效蒸馏管道。我们提出了一种额外的合并阶段，其中将单独线性化的专家合并为一个模型。通过从Llama、Qwen和Olmo家族中蒸馏基础模型和指令调优模型，我们展示了该管道的有效性。在许多情况下，我们的基于xLSTM的学生能够恢复大部分教师的性能，并在某些下游任务上甚至超过了教师的性能。我们的贡献是朝着更节能和成本效益更高的transformer基LLM替代方案迈出的重要一步。

Summary / 总结

The research aims to distill large language models (LLMs) into more efficient architectures without losing performance. The authors propose an effective distillation pipeline for xLSTM-based students, including an additional merging stage to combine linearized experts into a single model. Experiments on Llama, Qwen, and Olmo models show that xLSTM-based students can recover most of the teacher's performance and even outperform it on some downstream tasks, demonstrating the potential for more energy-efficient and cost-effective LLM replacements.

研究旨在通过蒸馏将大型语言模型（LLMs）转换为更高效的架构，同时保持性能。作者引入了一种针对xLSTM学生的有效蒸馏管道，包括一个额外的合并阶段，将线性化的专家合并为一个模型。实验表明，xLSTM基的学生可以恢复大部分教师的性能，并在某些下游任务上甚至超过了它，展示了更节能和成本效益更高的LLM替代品的潜力。

Computational Concept of the Psyche

Authors: Anton Kolonin, Vladimir Krykov

First: 2026-03-16T17:46:58+00:00 · Latest: 2026-03-16T17:46:58+00:00

Comments: 19 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

This article presents an overview of approaches to modeling the human psyche in the context of constructing an artificial one. Based on this overview, a concept of cognitive architecture is proposed, in which the psyche is viewed as the operating system of a living or artificial subject, comprising a space of states, including the state of needs that determine the meaning of a subject's being in relation to stimuli from the external world, and intelligence as a decision-making system regarding actions in this world to satisfy these needs. Based on this concept, a computational formalization is proposed for creating artificial general intelligence systems for an agent through experiential learning in a state space that includes agent's needs, taking into account their biological or existential significance for the intelligent agent, along with agent's sensations and actions. Thus, the problem of constructing artificial general intelligence is formalized as a system for making optimal decisions in the space of specific agent needs under conditions of uncertainty, maximizing success in achieving goals, minimizing existential risks, and maximizing energy efficiency. A minimal experimental implementation of the model is presented.

中文标题/摘要

标题：计算心理概念

本文概述了在构建人工心理时对人类心理建模的方法。在此基础上，提出了一种认知架构的概念，将心理视为生物或人工主体的操作系统，包括状态空间，包括需求状态，这些需求状态决定了主体在面对外部世界刺激时的意义；以及作为满足这些需求在世界中采取行动的决策系统。基于此概念，提出了一种计算形式化方法，通过经验学习在包括代理需求的状态空间中创建通用人工智能系统，同时考虑这些需求对智能代理的生物学或存在意义，以及代理的感觉和行动。因此，构建通用人工智能的问题被形式化为在特定代理需求的空间中做出最优决策的系统，在不确定性条件下最大化目标实现的成功率，最小化存在风险，并最大化能量效率。还呈现了该模型的最小实验实现。

Summary / 总结

This paper aims to model the human psyche for artificial intelligence, proposing a cognitive architecture where the psyche is seen as an operating system comprising a state space and intelligence as a decision-making system. The authors formalize the creation of artificial general intelligence through experiential learning in a state space that includes the agent's needs, sensations, and actions. Key findings include formalizing the AGI problem as optimal decision-making under uncertainty, maximizing success, minimizing risks, and energy efficiency. A minimal experimental implementation is provided.

本文提出了一种认知架构，用于建模人类心理和人工智能，将心理视为一个包括状态空间和决策系统的操作系统。该概念通过经验学习为创建AGI进行了形式化，考虑了代理的需求及其生物学或存在意义。该问题被形式化为在不确定性条件下做出最优决策，最大化成功、最小化风险和最大化能量效率。还呈现了一个最小的实验实现模型。

Grounding World Simulation Models in a Real-World Metropolis

Authors: Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim

First: 2026-03-16T17:46:04+00:00 · Latest: 2026-03-16T17:46:04+00:00

Comments: project page: https://seoul-world-model.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.

中文标题/摘要

标题：将世界模拟模型扎根于真实都市

如果世界模拟模型能够渲染一个实际存在的城市，而不是一个想象中的环境会怎样？先前的生成世界模型通过想象所有内容来合成视觉上可信但人工的环境。我们提出了首尔世界模型（SWM），这是一个基于首尔市的真实城市规模的世界模型。SWM 通过检索增强的邻近街景图像条件来锚定自回归视频生成。然而，这种设计引入了几个挑战，包括检索参考与动态目标场景之间的时间对齐问题，以及由于车辆安装捕获在稀疏时间间隔内导致的轨迹多样性有限和数据稀疏性。我们通过跨时间配对、大规模合成数据集以及从稀疏街景图像中合成连贯训练视频的视图插值管道来解决这些挑战。我们还引入了一个虚拟前瞻汇流口，通过不断将每个片段重新锚定到未来位置的检索图像来稳定长时生成。我们在首尔、釜山和安阿伯三个城市中将SWM 与最近的视频世界模型进行了评估。SWM 在生成空间上忠实、时间上一致、长时的视频方面优于现有方法，这些视频扎根于实际的城市环境，轨迹可达数百米，并支持多种摄像机运动和文本提示场景变化。

Summary / 总结

The research aims to create a world simulation model that generates videos of a real city, specifically Seoul, rather than an imagined environment. The method involves using autoregressive video generation with retrieval-augmented conditioning on nearby street-view images to address challenges such as temporal misalignment and data sparsity. Key findings show that Seoul World Model (SWM) outperforms existing methods in generating spatially faithful, temporally consistent, and long-horizon videos, supporting diverse camera movements and text-prompted scenario variations.

研究旨在创建一个生成真实城市首尔视频的世界模拟模型，而非想象中的环境。方法包括使用基于检索的自回归视频生成，并通过附近街景图像增强条件来解决时间错位和数据稀疏性等问题。关键发现表明，SWM在生成空间上忠实、时间上一致且长时序的视频方面优于现有方法，支持多种摄像机运动和文本提示的场景变化。

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

Authors: Xianqiang Gao, Pingrui Zhang, Delin Qu, Dong Wang, Zhigang Wang, Yan Ding, Bin Zhao

Venue: AAAI 2025 Oral

First: 2024-08-23T12:27:33+00:00 · Latest: 2026-03-16T17:35:55+00:00

Comments: Accepted by AAAI 2025 (Oral)

Abs · PDF · Code1 · Code2

Abstract

3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the Multi-Image Guided Invariant-Feature-Aware 3D Affordance Grounding (MIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (IAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (ADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (MIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons.

中文标题/摘要

标题：学习2D不变性功能知识以实现3D功能定位

3D物体功能定位旨在预测3D物体上的功能区域，并为机器人技术的广泛应用奠定了基础。最近的进展通过学习3D区域与单个人机交互图像之间的映射来解决这一问题。然而，3D物体的几何结构与人机交互图像中的物体并不总是保持一致，导致泛化能力较差。为了解决这一问题，我们提出从同一功能类别内的多个个人机交互图像中学习可泛化的不变功能知识。具体而言，我们引入了多图像引导的不变特征感知3D功能定位（MIFAG）框架。该框架通过识别多个个人机交互图像中的共同交互模式来定位3D物体的功能区域。首先，不变功能知识提取模块（IAM）利用迭代更新策略逐步从多个图像中提取对齐的功能知识，并将其整合到功能字典中。然后，功能字典自适应融合模块（ADM）学习全面的点云表示，考虑多个图像中的所有功能候选。此外，我们构建了多图像和点功能（MIPA）基准，并在各种实验比较中我们的方法优于现有最先进的方法。

Summary / 总结

The research aims to improve 3D object affordance grounding by addressing the inconsistency between 3D object geometry and 2D interaction images. The proposed MIFAG framework extracts invariant affordance knowledge from multiple images and integrates it into an affordance dictionary. This method outperforms existing approaches in various experimental comparisons, demonstrating better generalization and accuracy. The MIPA benchmark was constructed to validate the method's effectiveness.

本文提出MIFAG框架以解决3D物体功能区域定位的问题，通过从同一类别多个交互图像中学习不变的功能知识来提高泛化能力。该方法包括不变功能知识提取模块和功能字典自适应融合模块，共同识别常见交互模式并生成综合点云表示。实验结果表明，所提出的方法在各种比较中优于现有最先进的方法。

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Authors: Nandan Kumar Jha, Brandon Reagen

Venue: ICLR 2026

First: 2026-03-06T22:50:43+00:00 · Latest: 2026-03-16T17:30:30+00:00

Comments: Accepted to ICLR 2026. Project page: https://nerve-eigenspectrum.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

中文标题/摘要

标题：NerVE：大规模语言模型前馈网络非线性特征谱动力学

我们引入了NerVE，这是一种统一的特征谱框架，用于理解大规模语言模型（LLMs）中的前馈网络（FFNs）如何在高维潜空间中组织和调节信息流。尽管FFNs占据了大部分参数预算，但它们的高维动力学仍然知之甚少。NerVE 通过四种互补的度量标准（谱熵（分散性）、参与比（有效维度）、特征值早期富集（顶部重性）和Jensen-Shannon散度（分布变化））轻量级、内存高效地跟踪特征谱动力学来填补这一空白。我们的核心见解是，FFN的非线性重新注入了特征模式中的方差，从根本上控制了潜空间的利用，并且优化器几何结构强烈调节了这种方差重新注入的程度。我们在不同规模的模型、多样化的架构和优化器配置中验证了NerVE，每种配置都独特地塑造了FFN的动力学：归一化方案控制方差流动；FFN权重几何结构限制潜空间；位置编码和激活函数调节信息流动；以及优化器选择重新分配深度中的有效容量。在这些设置中，NerVE 一致地恢复了与模型泛化能力相关的稳定谱特征，并对设计选择作出可预测的响应，超越了变压器架构，适用于MLP-Mixer架构，提供了有关架构和优化器选择的可操作见解，超越了试错。

Summary / 总结

NerVE is a framework that uses four metrics—Spectral Entropy, Participation Ratio, Eigenvalue Early Enrichment, and Jensen-Shannon divergence—to track the dynamics of eigenspectrum in feed-forward networks of large language models. This approach helps understand how these networks manage information flow and utilize latent dimensions. The study finds that nonlinearity in FFNs reinjects variance across eigenmodes, and optimizer geometry significantly influences this process. These insights are validated across various model scales and configurations, showing consistent spectral signatures that correlate with model generalization and provide guidance for architectural and optimizer choices.

NerVE 是一个框架，通过四个指标（谱熵、参与度比、特征值早期富集和 Jensen-Shannon 散度）来跟踪大型语言模型中前馈网络的谱动态。这种方法有助于理解这些网络如何在高维潜空间中管理信息流。研究发现，前馈网络中的非线性会重新注入特征值中的方差，而优化器几何结构显著影响这种方差。这些见解在不同的模型规模和配置中得到了验证，显示了 NerVE 可以预测模型的泛化能力和指导架构和优化器选择的作用。

Mamba-3: Improved Sequence Modeling using State Space Principles

Authors: Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu

Venue: ICLR 2026

First: 2026-03-16T17:30:08+00:00 · Latest: 2026-03-16T17:30:08+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While the current Transformer-based models deliver strong model quality, their quadratic compute and linear memory make inference expensive. This has spurred the development of sub-quadratic models with reduced linear compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule that enables richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation for better model performance without increasing decode latency. Together with architectural refinements, our Mamba-3 model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with Mamba-3's MIMO variant further improving accuracy by another 1.2 points for a total 1.8 point gain. Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half of its predecessor's state size. Our evaluations demonstrate Mamba-3's ability to advance the performance-efficiency Pareto frontier.

中文标题/摘要

标题：Mamba-3：基于状态空间原理改进的序列建模

在推理时间计算扩展方面，提高已成为影响大语言模型性能的重要因素，使得推理效率成为模型设计的核心关注点之一，与模型质量并重。尽管当前基于Transformer的模型在模型质量上表现出色，但它们的计算复杂度呈二次增长，内存需求呈线性增长，导致推理成本高昂。这促使开发了次二次模型，这些模型具有减少的线性计算和常数内存需求。然而，许多最近的线性模型为了提高算法效率而牺牲了模型质量和能力，在状态跟踪等任务上表现不佳。此外，它们理论上线性的推理在实践中仍然存在硬件效率低的问题。受状态空间模型（SSM）视角的启发，我们提出了三种核心方法改进，旨在从推理优先的角度出发。我们结合了：（1）从SSM离散化中推导出的更具表达性的递归，（2）复数状态更新规则，以实现更丰富的状态跟踪，以及（3）多输入多输出（MIMO）形式，以提高模型性能而不增加解码延迟。结合架构改进，我们的Mamba-3模型在检索、状态跟踪和下游语言建模任务上取得了显著的提升。在1.5B规模下，Mamba-3相比下一个最佳模型（Gated DeltaNet）的平均下游准确性提高了0.6个百分点，而Mamba-3的MIMO变体进一步提高了1.2个百分点，总共提高了1.8个百分点。在不同状态大小的实验中，Mamba-3在使用其前身一半状态大小的情况下，实现了与Mamba-2相当的困惑度。我们的评估表明，Mamba-3能够推动性能效率帕累托前沿的进步。

Summary / 总结

Mamba-3 improves inference efficiency by incorporating state space principles, enhancing recurrence, using complex-valued state updates, and employing a MIMO formulation. It achieves significant gains in retrieval, state-tracking, and downstream language modeling tasks, with a 1.8-point improvement in average downstream accuracy at the 1.5B scale compared to the next best model, Gated DeltaNet.

Mamba-3 通过引入状态空间原理，增强递归，启用复数状态更新，并采用多输入多输出（MIMO）形式，提高了推理效率。该模型在下游语言建模任务中取得了显著进展，相比下一个最佳模型，在1.5B规模下平均准确率提高了1.8个百分点，并且在状态大小减半的情况下保持了与前一代模型相当的困惑度。

LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D

Authors: Lingteng Qiu, Peihao Li, Heyuan Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Rui Peng, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong

First: 2025-06-16T17:59:56+00:00 · Latest: 2026-03-16T17:28:33+00:00

Comments: HomePage: https://lingtengqiu.github.io/LHM++/ Online Demo: https://huggingface.co/spaces/Lingteng/LHMPP

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Reconstructing animatable 3D humans from casually captured images of articulated subjects without camera or pose information is highly practical but remains challenging due to view misalignment, occlusions, and the absence of structural priors. In this work, we present LHM++, an efficient large-scale human reconstruction model that generates high-quality, animatable 3D avatars within seconds from one or multiple pose-free images. At its core is an Encoder-Decoder Point-Image Transformer architecture that progressively encodes and decodes 3D geometric point features to improve efficiency, while fusing hierarchical 3D point features with image features through multimodal attention. The fused features are decoded into 3D Gaussian splats to recover detailed geometry and appearance. To further enhance visual fidelity, we introduce a lightweight 3D-aware neural animation renderer that refines the rendering quality of reconstructed avatars in real time. Extensive experiments show that our method produces high-fidelity, animatable 3D humans without requiring camera or pose annotations. Our code and project page are available at https://lingtengqiu.github.io/LHM++/

中文标题/摘要

标题：LHM++：一种高效的大型人体重建模型，用于生成无姿态图像的3D动画人体

从包含姿态变化的主体的随意拍摄图像中重建可动画的3D人体，无需相机或姿态信息，尽管具有很高的实用性，但由于视角不一致、遮挡和缺乏结构先验，仍具有挑战性。本文中，我们提出了一种高效的大型人体重建模型LHM++，能够在几秒钟内从一张或多张无姿态图像生成高质量、可动画的3D头像。其核心是一种编码器-解码器点-图像变换架构，逐步编码和解码3D几何点特征以提高效率，同时通过多模态注意力融合层次3D点特征和图像特征。融合后的特征被解码为3D高斯斑点以恢复详细的几何形状和外观。为了进一步提高视觉保真度，我们引入了一种轻量级的3D感知神经动画渲染器，可以实时优化重建头像的渲染质量。大量实验表明，我们的方法能够在无需相机或姿态注释的情况下生成高保真度、可动画的3D人体。我们的代码和项目页面可在https://lingtengqiu.github.io/LHM++/获取。

Summary / 总结

The research aims to reconstruct high-fidelity 3D humans from pose-free images without camera or pose information. The method uses an Encoder-Decoder Point-Image Transformer to encode and decode 3D geometric features efficiently, and a 3D-aware neural animation renderer to refine the rendering quality. Experiments show that the model can generate animatable 3D avatars within seconds from one or multiple images, achieving high visual fidelity without requiring additional annotations. The code and project page are available online.

该研究旨在从无姿态信息的图像中高效地重建3D人类模型，无需相机或姿态标注。提出了一种LHM++方法，使用编码器-解码器点-图像变换器来生成高质量、可动画化的3D化身。该方法通过多模态注意力融合3D点特征和图像特征，并将其解码为3D高斯点云以恢复详细的几何和外观。实时的轻量级3D感知神经动画渲染器进一步提升了视觉保真度。实验表明，LHM++能够在几秒内生成高质量的3D人类模型，无需相机或姿态标注。

Lore: Repurposing Git Commit Messages as a Structured Knowledge Protocol for AI Coding Agents

Authors: Ivan Stetsenko

First: 2026-03-16T17:27:30+00:00 · Latest: 2026-03-16T17:27:30+00:00

Comments: 8 pages, 1 figure, 1 table. Preprint available at https://doi.org/10.5281/zenodo.19051840

Abs · PDF · Code1 · Code2

Abstract

As AI coding agents become both primary producers and consumers of source code, the software industry faces an accelerating loss of institutional knowledge. Each commit captures a code diff but discards the reasoning behind it - the constraints, rejected alternatives, and forward-looking context that shaped the decision. I term this discarded reasoning the Decision Shadow. This paper proposes Lore, a lightweight protocol that restructures commit messages - using native git trailers - into self-contained decision records carrying constraints, rejected alternatives, agent directives, and verification metadata. Lore requires no infrastructure beyond git, is queryable via a standalone CLI tool, and is discoverable by any agent capable of running shell commands. The paper formalizes the protocol, compares it against five competing approaches, stress-tests it against its strongest objections, and outlines an empirical validation path.

中文标题/摘要

标题：lore：将git提交信息重新用于ai编码代理的结构化知识协议

随着ai编码代理成为源代码的主要生产者和消费者，软件行业面临着机构知识加速流失的问题。每次提交记录了一个代码差异，但丢弃了背后的推理——包括约束、被拒绝的替代方案和塑造决策的前瞻背景。我将这种被丢弃的推理称为决策阴影。本文提出lore，一种轻量级协议，通过使用git自带的尾部注释重新结构化提交信息，使其成为包含约束、被拒绝的替代方案、代理指令和验证元数据的自包含决策记录。lore不需要超出git的基础设施，可以通过独立的命令行工具查询，并且任何能够运行shell命令的代理都可以发现它。本文正式化了该协议，将其与五种竞争方法进行了比较，针对其最强的反对意见进行了压力测试，并概述了实证验证路径。

Summary / 总结

This paper addresses the loss of institutional knowledge in software development due to AI coding agents. It proposes Lore, a protocol that restructures commit messages into structured decision records using git trailers. Lore is lightweight, requiring no additional infrastructure, and is queryable via a CLI tool. The paper compares Lore against five alternatives, stress-tests it, and outlines an empirical validation path.

本文探讨了由于代码变更背后的理由被丢弃而导致软件行业中机构知识流失的问题。提出了一种名为Lore的协议，通过git尾标将提交信息重新结构化为包含决策记录的结构化数据。Lore轻量级，可以通过命令行工具查询，无需额外基础设施。论文将Lore与五种替代方案进行了比较，对其进行了压力测试，并概述了实证验证的路径。

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Authors: Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin

Venue: NeurIPS 2025

First: 2026-03-16T17:25:42+00:00 · Latest: 2026-03-16T17:25:42+00:00

Comments: 41 pages, 26 figures, 5 tables. NeurIPS 2025 Competition Track

Abs · PDF · Code1 · Code2

Abstract

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

中文标题/摘要

标题：PokeAgent挑战：大规模竞争与长时序学习

我们提出了PokeAgent挑战，这是一个基于宝可梦多智能体战斗系统和广阔角色扮演游戏（RPG）环境的大规模决策研究基准。部分可观测性、博弈论推理和长时序规划仍然是前沿AI领域的开放问题，但很少有基准能在现实条件下同时对这三项进行压力测试。PokeAgent通过两个互补的赛道来解决这些局限性：我们的战斗赛道要求在宝可梦战斗中进行部分可观测性的战略推理和泛化；我们的速通赛道则要求在宝可梦RPG中进行长时序规划和顺序决策。我们的战斗赛道提供了一个包含2000万以上战斗轨迹的数据集，以及一系列基于启发式、强化学习和大语言模型的基线，能够实现高水平的竞争性游戏。我们的速通赛道提供了第一个标准化的RPG速通评估框架，包括一个开源的多智能体编排系统，用于模块化和可重复的基于套件的大语言模型方法比较。我们的NeurIPS 2025竞赛赛道验证了我们资源的质量和研究社区对宝可梦的兴趣，超过100支队伍在两个赛道中竞争，获胜解决方案在我们的论文中有详细说明。参赛提交和我们的基线揭示了通用（大语言模型）、专业（强化学习）和精英人类表现之间的巨大差距。与BenchPress评估矩阵的分析表明，宝可梦战斗几乎与标准的大语言模型基准无关，测量了现有套件未捕捉到的能力，并将宝可梦定位为一个未解决的基准，可以推动强化学习和大语言模型研究的发展。我们将其转换为一个活的基准，具有实时排行榜的战斗赛道和自包含评估的速通赛道，可在https://pokeagentchallenge.com/访问。

Summary / 总结

The PokeAgent Challenge is a large-scale benchmark for decision-making research based on Pokemon's multi-agent battle system and RPG environment. It addresses open problems in AI such as partial observability, game-theoretic reasoning, and long-horizon planning through two tracks: Battling Track for strategic reasoning under partial observability and Speedrunning Track for long-horizon planning in RPGs. The challenge provides a dataset of over 20 million battle trajectories and an evaluation framework for speedrunning, with submissions showing significant gaps between LLM, RL, and human performance. The competition at NeurIPS 2025 attracted over 100 teams and highlighted Pokemon as a unique benchmark for advancing AI research.

PokeAgent挑战基于宝可梦战斗系统和RPG环境构建了一个大规模的决策研究基准。它通过两个赛道来解决AI中的开放问题，即部分可观测性、博弈论推理和长期规划：战斗赛道侧重于在部分可观测性下的战略推理，速通赛道侧重于RPG中的长期规划。挑战提供了两个赛道的数据集和基线，并举办了NeurIPS 2025竞赛，揭示了不同AI方法之间的显著差距，并突出了宝可梦作为推进RL和LLM研究的独特基准的重要性。

EvoX: Meta-Evolution for Automated Discovery

Authors: Shu Liu, Shubham Agarwal, Monishwaran Maheswaran, Mert Cemri, Zhifei Li, Qiuyang Mang, Ashwin Naren, Ethan Boneh, Audrey Cheng, Melissa Z. Pan, Alexander Du, Kurt Keutzer, Alvin Cheung, Alexandros G. Dimakis, Koushik Sen, Matei Zaharia, Ion Stoica

First: 2026-02-26T18:54:41+00:00 · Latest: 2026-03-16T17:22:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent work such as AlphaEvolve has shown that combining LLM-driven optimization with evolutionary search can effectively improve programs, prompts, and algorithms across domains. In this paradigm, previously evaluated solutions are reused to guide the model toward new candidate solutions. Crucially, the effectiveness of this evolution process depends on the search strategy: how prior solutions are selected and varied to generate new candidates. However, most existing methods rely on fixed search strategies with predefined knobs (e.g., explore-exploit ratios) that remain static throughout execution. While effective in some settings, these approaches often fail to adapt across tasks, or even within the same task as the search space changes over time. We introduce EvoX, an adaptive evolution method that optimizes its own evolution process. EvoX jointly evolves candidate solutions and the search strategies used to generate them, continuously updating how prior solutions are selected and varied based on progress. This enables the system to dynamically shift between different search strategies during the optimization process. Across nearly 200 real-world optimization tasks, EvoX outperforms existing AI-driven evolutionary methods including AlphaEvolve, OpenEvolve, GEPA, and ShinkaEvolve on the majority of tasks.

中文标题/摘要

标题：EvoX：元进化以实现自动化发现

近期的工作，如AlphaEvolve表明，将LLM驱动的优化与进化搜索相结合，可以在跨领域的程序、提示和算法中有效提升性能。在此范式中，先前评估过的解决方案被重用以引导模型向新的候选解决方案发展。关键的是，这一进化过程的有效性取决于搜索策略：如何选择和变异先前的解决方案以生成新的候选者。然而，大多数现有方法依赖于固定不变的搜索策略，这些策略在执行过程中保持不变。虽然在某些情况下这些方法是有效的，但它们往往无法跨任务或在搜索空间随时间变化时在同一个任务中进行调整。我们引入了EvoX，一种自适应进化方法，优化其自身的进化过程。EvoX同时进化候选解决方案及其生成策略，根据进展不断更新如何选择和变异先前的解决方案。这使得系统在优化过程中能够动态地在不同的搜索策略之间切换。在近200个实际优化任务中，EvoX在大多数任务上优于现有的基于AI的进化方法，包括AlphaEvolve、OpenEvolve、GEPA和ShinkaEvolve。

Summary / 总结

EvoX is an adaptive evolution method that jointly evolves candidate solutions and the search strategies used to generate them, allowing for dynamic adjustment of search strategies based on progress. Across 200 real-world optimization tasks, EvoX outperforms existing AI-driven evolutionary methods such as AlphaEvolve, OpenEvolve, GEPA, and ShinkaEvolve on the majority of tasks.

EvoX 是一种自适应进化方法，通过同时进化候选解决方案和生成这些解决方案的搜索策略来优化其自身的进化过程。这使得系统可以根据进度动态调整其搜索策略。在近200个实际优化任务中，EvoX 在大多数任务上都优于现有的基于AI的进化方法，如AlphaEvolve、OpenEvolve、GEPA和ShinkaEvolve。

Panoramic Affordance Prediction

Authors: Zixin Zhang, Chenfei Liao, Hongfei Zhang, Harold Haodong Chen, Kanghao Chen, Zichen Wen, Litao Guo, Bin Ren, Xu Zheng, Yinchuan Li, Xuming Hu, Nicu Sebe, Ying-Cong Chen

First: 2026-03-16T17:21:49+00:00 · Latest: 2026-03-16T17:21:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Affordance prediction serves as a critical bridge between perception and action in embodied AI. However, existing research is confined to pinhole camera models, which suffer from narrow Fields of View (FoV) and fragmented observations, often missing critical holistic environmental context. In this paper, we present the first exploration into Panoramic Affordance Prediction, utilizing 360-degree imagery to capture global spatial relationships and holistic scene understanding. To facilitate this novel task, we first introduce PAP-12K, a large-scale benchmark dataset containing over 1,000 ultra-high-resolution (12k, 11904 x 5952) panoramic images with over 12k carefully annotated QA pairs and affordance masks. Furthermore, we propose PAP, a training-free, coarse-to-fine pipeline inspired by the human foveal visual system to tackle the ultra-high resolution and severe distortion inherent in panoramic images. PAP employs recursive visual routing via grid prompting to progressively locate targets, applies an adaptive gaze mechanism to rectify local geometric distortions, and utilizes a cascaded grounding pipeline to extract precise instance-level masks. Experimental results on PAP-12K reveal that existing affordance prediction methods designed for standard perspective images suffer severe performance degradation and fail due to the unique challenges of panoramic vision. In contrast, PAP framework effectively overcomes these obstacles, significantly outperforming state-of-the-art baselines and highlighting the immense potential of panoramic perception for robust embodied intelligence.

中文标题/摘要

标题：全景功能预测

功能预测是将感知与行动在具身人工智能中联系起来的关键桥梁。然而，现有的研究局限于针孔相机模型，这些模型视野狭窄且观察片段化，经常缺失关键的整体环境背景。在本文中，我们首次探索全景功能预测，利用360度图像捕捉全局空间关系和整体场景理解。为了促进这一新型任务，我们首先引入了PAP-12K，这是一个大规模基准数据集，包含超过1,000张超高分辨率（12k，11904 x 5952）全景图像，以及超过12k个仔细标注的问答对和功能掩码。此外，我们提出了PAP，一种无需训练、从粗到细的管道，灵感来源于人类的中心视觉系统，以应对全景图像中的超高清分辨率和严重的几何失真。PAP 通过递归视觉路由和网格提示逐步定位目标，应用自适应凝视机制校正局部几何失真，并利用级联定位管道提取精确的实例级掩码。在PAP-12K上的实验结果表明，现有的设计用于标准视角图像的功能预测方法在全景视觉的独特挑战面前表现严重下降并失败。相比之下，PAP框架有效地克服了这些障碍，显著优于最先进的基线方法，突显了全景感知在稳健的具身智能中的巨大潜力。

Summary / 总结

The research aims to address the limitations of existing affordance prediction methods by exploring panoramic imagery. The study introduces PAP-12K, a large dataset of panoramic images, and proposes PAP, a coarse-to-fine pipeline that uses grid prompting and an adaptive gaze mechanism to handle the unique challenges of panoramic vision. Experiments show that PAP outperforms existing methods, demonstrating the potential of panoramic perception for robust embodied intelligence.

本文通过引入全景 affordance 预测，解决了现有方法视野狭窄的问题，使用了 360 度图像。作者开发了包含超过 1,000 张超高清全景图像及其详细注释的 PAP-12K 数据集。他们还提出了 PAP，一种无需训练的管道，通过递归视觉路由和自适应注视机制来处理全景图像的挑战，实验结果表明 PAP 在 PAP-12K 数据集上的表现优于现有方法。

Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models

Authors: Lexiang Xiong, Qi Li, Jingwen Ye, Xinchao Wang

First: 2026-03-16T17:20:38+00:00 · Latest: 2026-03-16T17:20:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) frequently "hallucinate" - generate plausible yet factually incorrect statements - posing a critical barrier to their trustworthy deployment. In this work, we propose a new paradigm for diagnosing hallucinations, recasting them from static output errors into dynamic pathologies of a model's computational cognition. Our framework is grounded in a normative principle of computational rationality, allowing us to model a VLM's generation as a dynamic cognitive trajectory. We design a suite of information-theoretic probes that project this trajectory onto an interpretable, low-dimensional Cognitive State Space. Our central discovery is a governing principle we term the geometric-information duality: a cognitive trajectory's geometric abnormality within this space is fundamentally equivalent to its high information-theoretic surprisal. Hallucination detection is counts as a geometric anomaly detection problem. Evaluated across diverse settings - from rigorous binary QA (POPE) and comprehensive reasoning (MME) to unconstrained open-ended captioning (MS-COCO) - our framework achieves state-of-the-art performance. Crucially, it operates with high efficiency under weak supervision and remains highly robust even when calibration data is heavily contaminated. This approach enables a causal attribution of failures, mapping observable errors to distinct pathological states: perceptual instability (measured by Perceptual Entropy), logical-causal failure (measured by Inferential Conflict), and decisional ambiguity (measured by Decision Entropy). Ultimately, this opens a path toward building AI systems whose reasoning is transparent, auditable, and diagnosable by design.

中文标题/摘要

标题：谎言剖析：视觉-语言模型中幻觉的多阶段诊断框架

视觉-语言模型（VLMs）经常“产生幻觉”——生成看似合理但实际上不正确的陈述，这构成了它们可靠部署的关键障碍。在本文中，我们提出了一种新的诊断幻觉的范式，将幻觉重新定义为模型计算认知动态病态。我们的框架基于计算理性规范原则，使我们能够将VLM的生成建模为动态认知轨迹。我们设计了一套信息论探针，将此轨迹投影到可解释的低维认知状态空间中。我们的主要发现是一种我们称之为几何-信息二元性的原则：此空间中认知轨迹的几何异常本质上等同于其高信息论意外性。幻觉检测被视为几何异常检测问题。在从严格的二元问答（POPE）和全面推理（MME）到不受限制的开放生成（MS-COCO）的各种场景中，我们的框架实现了最先进的性能。关键的是，它在弱监督下高效运行，并且即使校准数据严重污染时仍保持高度鲁棒性。这种方法使我们能够对失败进行因果归因，将可观察的错误映射到不同的病理状态：感知不稳定性（通过感知熵测量）、逻辑因果失败（通过推理冲突测量）和决策模糊性（通过决策熵测量）。最终，这为构建透明、可审计和可诊断的AI系统开辟了道路。

Summary / 总结

The research aims to address the issue of hallucinations in Vision-Language Models (VLMs), which generate plausible but factually incorrect statements. The authors propose a multi-stage diagnostic framework that models hallucinations as dynamic pathologies in the model's computational cognition. By using information-theoretic probes, the framework projects the model's cognitive trajectory into a low-dimensional space, identifying a geometric-information duality where abnormal geometric patterns correspond to high information-theoretic surprisal. The framework demonstrates state-of-the-art performance across various settings, including QA, reasoning, and open-ended captioning, and is robust under weak supervision and contaminated calibration data, enabling causal attribution of failures to specific pathological states such as perceptual instability, logical-causal failure, and decisional ambiguity.

研究旨在解决视觉-语言模型（VLMs）生成合理但事实错误的陈述问题。作者提出了一种多阶段诊断框架，将幻觉视为模型计算认知中的动态病态。通过使用信息论探针，该框架将模型的认知轨迹投影到低维空间中，发现几何信息二元性，即几何异常模式对应于高信息论惊讶度。该框架在各种场景下，包括问答、推理和开放生成描述中表现出最先进的性能，并且在弱监督和污染校准数据下仍然稳健，能够将可观察的错误归因于特定的病理状态，如感知不稳定性、逻辑因果失败和决策模糊性。

Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Authors: Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

First: 2025-06-07T02:41:54+00:00 · Latest: 2026-03-16T17:16:37+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.

中文标题/摘要

标题：从易到难任务强化学习提高LLM推理能力

我们旨在通过强化学习（RL）提高语言模型的推理能力。最近的RL后训练模型如DeepSeek-R1在数学和编程任务上展示了推理能力。然而，先前的研究表明，仅使用RL来提高难以推理任务的推理能力效果较差。在这里，我们借鉴了课程学习的理念，提出从易到难（E2H）调度任务，使LLMs逐步建立推理技能。我们的方法称为E2H推理器。实证上，我们观察到，虽然初始阶段容易的任务很重要，但通过适当的调度逐渐淡化它们对于防止过拟合是必要的。理论上，我们在一个近似策略迭代框架内为E2H推理器建立了收敛保证。我们推导出有限样本复杂性界，并表明当任务适当分解和条件化时，通过课程阶段学习所需的总样本数少于直接学习。跨多个领域的实验表明，E2H推理器显著提高了小规模LLM（1.5B到3B）的推理能力，这些模型在仅使用标准RL训练时会遇到困难，突显了我们方法的有效性。我们的代码可以在https://github.com/divelab/E2H-Reasoning找到。

Summary / 总结

The study aims to enhance the reasoning abilities of language models using reinforcement learning (RL) by proposing a curriculum learning approach, termed E2H Reasoner, which schedules tasks from easy to hard. Empirically, the method shows that fading out easy tasks through appropriate scheduling prevents overfitting, and theoretically, it establishes convergence guarantees and finite-sample complexity bounds. Experiments across various domains demonstrate that E2H Reasoner significantly improves the reasoning ability of small language models (1.5B to 3B parameters) when trained with RL, compared to vanilla RL alone.

研究旨在通过强化学习（RL）结合逐级任务难度的方法提升语言模型的推理能力。提出的E2H Reasoner方法表明，虽然初始阶段需要使用简单任务，但适时减少简单任务的使用可以防止过拟合。理论分析提供了收敛保证和有限样本复杂性界，表明通过逐级学习所需样本量少于直接学习。跨多个领域的实验表明，E2H Reasoner显著提高了小规模语言模型（1.5B到3B参数）在使用RL训练时的推理能力，突显了该方法的有效性。

Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

Authors: Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar, Mrinmaya Sachan

First: 2026-03-16T17:09:41+00:00 · Latest: 2026-03-16T17:09:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.

中文标题/摘要

标题：大语言模型能否模拟学生的错误推理？一项关于生成干扰项的案例研究

模拟合理的学生误解对于教育中的AI至关重要。在本研究中，我们探讨了大语言模型（LLMs）在生成多项选择题干扰项时如何推理关于误解的情况，这一任务需要协调解题知识、模拟学生误解并评估合理性。我们引入了一种分析策略的分类法，检查了最先进的LLMs的推理过程，并将其与学习科学中的既定最佳实践进行了比较。我们的结构化分析揭示了一个令人惊讶的契合：模型通常首先正确解决问题，然后阐述和模拟多种潜在的误解，最后选择一组干扰项。对失败模式的分析表明，错误主要源自未能恢复正确的解题过程和选择响应候选项，而不是模拟错误或组织过程。与这些结果一致，我们发现，在提示中提供正确的解题过程可以将生成的干扰项与人类编写的干扰项的匹配度提高8%，突显了在生成合理的错误学生推理时锚定正确解题过程的重要性。总体而言，我们的分析提供了一个结构化和可解释的视角，以了解LLMs模拟错误学生推理和生成高质量干扰项的能力。

Summary / 总结

This study investigates how large language models (LLMs) generate multiple-choice distractors, which requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. The research reveals that LLMs typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. Errors mainly arise from failing to recover the correct solution and selecting among response candidates. Providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, underscoring the importance of anchoring to the correct solution when generating plausible incorrect student reasoning.

研究探讨了大型语言模型（LLMs）生成多项选择题干扰项的方法，这需要模型能够生成错误但合理的答案。通过分析最先进的LLMs的策略并将其与学习科学中的最佳实践进行比较，研究发现这些模型通常会先正确解决问题，然后模拟多种潜在的误解，最后选择干扰项。分析还显示，错误主要来源于未能恢复正确的解和选择答案候选项。在提示中提供正确的解可以将与人工编写的干扰项的匹配度提高8%，强调了生成合理的错误学生推理时锚定正确解的重要性。

Kimodo: Scaling Controllable Human Motion Generation

Authors: Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, Sanja Fidler

First: 2026-03-16T17:09:30+00:00 · Latest: 2026-03-16T17:09:30+00:00

Comments: Project page: https://research.nvidia.com/labs/sil/projects/kimodo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage denoiser architecture that decomposes root and body prediction to minimize motion artifacts while allowing for flexible constraint conditioning. Experiments on the large-scale mocap dataset justify key design decisions and analyze how the scaling of dataset size and model size affect performance.

中文标题/摘要

标题：Kimodo：扩展可控人体运动生成

高质量的人体运动数据在机器人技术、模拟和娱乐应用中变得越来越重要。最近的生成模型提供了一种潜在的数据来源，通过文本提示或姿态的运动学约束等直观输入实现人体运动的合成。然而，公共动捕数据集规模较小限制了这些模型的运动质量、控制精度和泛化能力。在本文中，我们介绍了Kimodo，一种基于700小时光学动捕数据训练的表达性和可控性的运动扩散模型。我们的模型在通过文本和一系列完整的运动学约束（包括全身关键帧、稀疏关节位置/旋转、2D航点和密集2D路径）进行控制的同时，生成高质量的运动。这得益于精心设计的运动表示和两阶段去噪架构，该架构将根部和身体预测分解以最小化运动伪影，同时允许灵活的约束条件调整。大规模动捕数据集上的实验验证了关键设计决策，并分析了数据集规模和模型规模对性能的影响。

Summary / 总结

Kimodo is a kinematic motion diffusion model trained on 700 hours of optical motion capture data to generate high-quality and controllable human motions. It uses a two-stage denoiser architecture to minimize motion artifacts and allows control through text and various kinematic constraints. Experiments show that increasing dataset and model sizes improves performance and justifies key design choices.

Kimodo 是一个训练在 700 小时光学动作捕捉数据上的动力学运动扩散模型，用于生成高质量且可控的人类动作。它使用两阶段去噪架构来最小化动作伪影，并允许通过文本和各种动力学约束进行控制。实验表明，增加数据集和模型规模可以提高性能并验证关键设计选择。

InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems

Authors: Shaojie Shi, Zhengyu Shi, Lingran Zheng, Xinyu Su, Anna Xie, Bohao Lv, Rui Xu, Zijian Chen, Zhichao Chen, Guolei Liu, Naifu Zhang, Mingjian Dong, Zhuo Quan, Bohao Chen, Teqi Hao, Yuan Qi, Yinghui Xu, Libo Wu

First: 2026-03-16T17:06:37+00:00 · Latest: 2026-03-16T17:06:37+00:00

Comments: 35pages,3 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Causal inference in social science relies on end-to-end, intervention-centered research-design reasoning grounded in real-world policy interventions, but current benchmarks fail to evaluate this capability of large language models (LLMs). We present InterveneBench, a benchmark designed to assess such reasoning in realistic social settings. Each instance in InterveneBench is derived from an empirical social science study and requires models to reason about policy interventions and identification assumptions without access to predefined causal graphs or structural equations. InterveneBench comprises 744 peer-reviewed studies across diverse policy domains. Experimental results show that state-of-the-art LLMs struggle under this setting. To address this limitation, we further propose a multi-agent framework, STRIDES. It achieves significant performance improvements over state-of-the-art reasoning models. Our code and data are available at https://github.com/Sii-yuning/STRIDES.

中文标题/摘要

标题：InterveneBench：评估大型语言模型在实际社会系统中干预推理和因果研究设计的能力

社会科学中的因果推断依赖于基于实际政策干预的端到端、干预中心的研究设计推理，但当前基准未能评估大型语言模型（LLMs）的这种能力。我们提出了InterveneBench，旨在评估这种推理在现实社会环境中的表现。InterveneBench中的每个实例均源自实证社会科学研究，并要求模型在无预定义因果图或结构方程的情况下推理政策干预和识别假设。InterveneBench包含来自不同政策领域的744篇同行评审研究。实验结果显示，最先进的LLMs在这一环境中表现不佳。为解决这一局限，我们进一步提出了一种多智能体框架STRIDES。STRIDES在最先进的推理模型上实现了显著的性能提升。我们的代码和数据可在https://github.com/Sii-yuning/STRIDES获取。

Summary / 总结

InterveneBench is a benchmark designed to evaluate large language models' ability to reason about policy interventions and causal study design in real social systems. It consists of 744 empirical studies across various policy domains, requiring models to reason without predefined causal graphs. State-of-the-art LLMs perform poorly in this setting, but STRIDES, a multi-agent framework, shows significant performance improvements.

InterveneBench 是一个基准，旨在评估大型语言模型在现实社会系统中进行干预和因果研究设计推理的能力，当前基准未能涵盖这一点。该基准包含来自不同政策领域的744篇同行评审研究，要求模型在没有预定义因果图的情况下推理政策干预和识别假设。最先进的LLM在这一环境中表现不佳，但提出的STRIDES多智能体框架显示出显著的改进。

History

20260317_0403 20260316_0333 20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553