arXiv 论文速递

Snapshot: 20260211_0405

WorldCompass: Reinforcement Learning for Long-Horizon World Models

Authors: Zehan Wang, Tengfei Wang, Haiyu Zhang, Xuhui Zuo, Junta Wu, Haoyuan Wang, Wenqiang Sun, Zhenwei Wang, Chenjie Cao, Hengshuang Zhao, Chunchao Guo, Zhou Zhao

First: 2026-02-09T18:59:47+00:00 · Latest: 2026-02-09T18:59:47+00:00

Comments: Project page: \url{https://3d-models.hunyuan.tencent.com/world/}

Abs · PDF · Code1 · Code2

Abstract

This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

中文标题/摘要

标题：WorldCompass：长时程交互视频世界模型的强化学习后训练框架

本文介绍了WorldCompass，这是一种新颖的强化学习（RL）后训练框架，用于长时程、交互式的视频基世界模型，使它们能够基于交互信号更准确且一致地探索世界。为了有效“引导”世界模型的探索，我们针对自回归视频生成范式引入了三项核心创新：1）片段级回放策略：我们在单个目标片段上生成并评估多个样本，这显著提高了回放效率并提供了精细的奖励信号。2）互补奖励函数：我们设计了奖励函数，既考虑交互跟随的准确性也考虑视觉质量，这提供了直接监督并有效抑制了奖励作弊行为。3）高效的RL算法：我们采用了负向意识微调策略并结合了各种效率优化，以高效且有效地增强模型能力。在当前最先进的开源世界模型WorldPlay上的评估表明，WorldCompass在各种场景中显著提高了交互准确性和视觉保真度。

Summary / 总结

WorldCompass is a reinforcement learning framework designed to enhance long-horizon world models in interactive video scenarios. It introduces a clip-level rollout strategy, complementary reward functions, and an efficient RL algorithm to improve exploration accuracy and visual fidelity. Evaluations on the state-of-the-art world model, WorldPlay, show significant improvements in interaction accuracy and visual quality across different scenarios.

WorldCompass 是一种针对长周期互动视频世界模型的强化学习框架，通过引入片段级回放策略、互补奖励函数和高效 RL 算法来提升探索准确性和视觉保真度。评估结果显示，与现有方法相比，它在不同场景中显著提高了交互准确性和视觉质量。

$χ_{0}$: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies

Authors: Checheng Yu, Chonghao Sima, Gangcheng Jiang, Hai Zhang, Haoguang Mai, Hongyang Li, Huijie Wang, Jin Chen, Kaiyang Wu, Li Chen, Lirui Zhao, Modi Shi, Ping Luo, Qingwen Bu, Shijia Peng, Tianyu Li, Yibo Yuan

First: 2026-02-09T18:59:45+00:00 · Latest: 2026-02-09T18:59:45+00:00

Abs · PDF · Code1 · Code2

Abstract

High-reliability long-horizon robotic manipulation has traditionally relied on large-scale data and compute to understand complex real-world dynamics. However, we identify that the primary bottleneck to real-world robustness is not resource scale alone, but the distributional shift among the human demonstration distribution, the inductive bias learned by the policy, and the test-time execution distribution -- a systematic inconsistency that causes compounding errors in multi-stage tasks. To mitigate these inconsistencies, we propose $χ_{0}$, a resource-efficient framework with effective modules designated to achieve production-level robustness in robotic manipulation. Our approach builds off three technical pillars: (i) Model Arithmetic, a weight-space merging strategy that efficiently soaks up diverse distributions of different demonstrations, varying from object appearance to state variations; (ii) Stage Advantage, a stage-aware advantage estimator that provides stable, dense progress signals, overcoming the numerical instability of prior non-stage approaches; and (iii) Train-Deploy Alignment, which bridges the distribution gap via spatio-temporal augmentation, heuristic DAgger corrections, and temporal chunk-wise smoothing. $χ_{0}$ enables two sets of dual-arm robots to collaboratively orchestrate long-horizon garment manipulation, spanning tasks from flattening, folding, to hanging different clothes. Our method exhibits high-reliability autonomy; we are able to run the system from arbitrary initial state for consecutive 24 hours non-stop. Experiments validate that $χ_{0}$ surpasses the state-of-the-art $π_{0.5}$ in success rate by nearly 250%, with only 20-hour data and 8 A100 GPUs. Code, data and models will be released to facilitate the community.

中文标题/摘要

标题：$χ_{0}$: 资源感知鲁棒操作通过驯服分布不一致性

高可靠性的长期机器人操作传统上依赖大规模数据和计算来理解复杂的现实世界动力学。然而，我们发现现实世界鲁棒性的主要瓶颈不仅在于资源规模，还在于人类演示分布、策略学习的归纳偏见和测试时执行分布之间的分布偏移——这是一种系统性不一致性，导致多阶段任务中的累积错误。为了缓解这些不一致性，我们提出了$χ_{0}$，一种资源高效框架，具有专门设计的有效模块，以实现机器人操作的生产级鲁棒性。我们的方法基于三个技术支柱：(i) 模型算术，一种权重空间合并策略，能够高效地吸收不同演示的多样化分布，从物体外观到状态变化；(ii) 阶段优势，一种阶段感知的优势估计器，提供稳定、密集的进步信号，克服了之前非阶段方法的数值不稳定性；(iii) 训练部署对齐，通过时空增强、启发式DAgger修正和时间片段平滑来弥合分布差距。$χ_{0}$ 使两台双臂机器人能够协作执行长期的服装操作，涵盖从平整、折叠到挂不同衣物的任务。我们的方法展示了高可靠性的自主性；我们能够从任意初始状态连续运行系统24小时不间断。实验验证了$χ_{0}$ 在成功率上比最先进的$π_{0.5}$ 高出近250%，仅使用20小时数据和8个A100 GPU。代码、数据和模型将被发布以促进社区的发展。

Summary / 总结

The paper addresses the challenge of achieving high-reliability long-horizon robotic manipulation by focusing on mitigating distributional inconsistencies. The proposed $χ_{0}$ framework uses three key techniques: Model Arithmetic, Stage Advantage, and Train-Deploy Alignment, to handle diverse distributions and provide stable progress signals. Experiments show that $χ_{0}$ significantly outperforms existing methods, achieving nearly 250% higher success rates with limited resources.

论文针对高可靠性长时机器人操作面临的分布不一致问题，提出了$χ_{0}$框架，该框架通过Model Arithmetic、Stage Advantage和Train-Deploy Alignment三个技术支柱来缓解这些不一致。该方法使两台双臂机器人能够协作完成长时服装操作任务，展示了高可靠性自主性，并在有限数据和资源条件下将成功率提高了近250%，超越了最先进的$π_{0.5}$方法。

Robustness Is a Function, Not a Number: A Factorized Comprehensive Study of OOD Robustness in Vision-Based Driving

Authors: Amir Mallak, Alaa Maalouf

First: 2026-02-09T18:59:03+00:00 · Latest: 2026-02-09T18:59:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Out of distribution (OOD) robustness in autonomous driving is often reduced to a single number, hiding what breaks a policy. We decompose environments along five axes: scene (rural/urban), season, weather, time (day/night), and agent mix; and measure performance under controlled $k$-factor perturbations ($k \in \{0,1,2,3\}$). Using closed loop control in VISTA, we benchmark FC, CNN, and ViT policies, train compact ViT heads on frozen foundation-model (FM) features, and vary ID support in scale, diversity, and temporal context. (1) ViT policies are markedly more OOD-robust than comparably sized CNN/FC, and FM features yield state-of-the-art success at a latency cost. (2) Naive temporal inputs (multi-frame) do not beat the best single-frame baseline. (3) The largest single factor drops are rural $\rightarrow$ urban and day $\rightarrow$ night ($\sim 31\%$ each); actor swaps $\sim 10\%$, moderate rain $\sim 7\%$; season shifts can be drastic, and combining a time flip with other changes further degrades performance. (4) FM-feature policies stay above $85\%$ under three simultaneous changes; non-FM single-frame policies take a large first-shift hit, and all no-FM models fall below $50\%$ by three changes. (5) Interactions are non-additive: some pairings partially offset, whereas season-time combinations are especially harmful. (6) Training on winter/snow is most robust to single-factor shifts, while a rural+summer baseline gives the best overall OOD performance. (7) Scaling traces/views improves robustness ($+11.8$ points from $5$ to $14$ traces), yet targeted exposure to hard conditions can substitute for scale. (8) Using multiple ID environments broadens coverage and strengthens weak cases (urban OOD $60.6\% \rightarrow 70.1\%$) with a small ID drop; single-ID preserves peak performance but in a narrow domain. These results yield actionable design rules for OOD-robust driving policies.

Summary / 总结

The study aims to understand the out-of-distribution (OOD) robustness in autonomous driving by decomposing environments into five factors: scene, season, weather, time, and agent mix. Using VISTA, the research benchmarks FC, CNN, and ViT policies and finds that ViT policies are more OOD-robust than CNN/FC. Key findings include that rural-to-urban and day-to-night transitions are the most challenging, while winter/snow training is most robust to single-factor shifts. The study also highlights the importance of temporal context and the benefits of using multiple ID environments for broader coverage.

研究旨在通过将环境分解为五个因素（场景、季节、天气、时间、代理混合）来理解自主驾驶中出分布（OOD）环境的鲁棒性。研究人员使用VISTA对FC、CNN和ViT策略进行了基准测试，发现ViT策略在OOD鲁棒性方面优于CNN/FC策略。研究还发现，最大的性能下降发生在从农村到城市和从白天到晚上的过渡中。研究强调了在多样化和具有挑战性的条件下进行训练的重要性，并指出扩展跟踪/视图可以提高鲁棒性，但有针对性地暴露于恶劣条件下也同样有效。总体而言，研究提供了开发OOD鲁棒驾驶策略的实用设计规则。

Contact-Anchored Policies: Contact Conditioning Creates Strong Robot Utility Models

Authors: Zichen Jeff Cui, Omar Rayyan, Haritheja Etukuru, Bowen Tan, Zavier Andrianarivo, Zicheng Teng, Yihang Zhou, Krish Mehta, Nicholas Wojno, Kevin Yuanbo Wu, Manan H Anjaria, Ziyuan Wu, Manrong Mao, Guangxun Zhang, Binit Shah, Yejin Kim, Soumith Chintala, Lerrel Pinto, Nur Muhammad Mahi Shafiullah

First: 2026-02-09T18:58:50+00:00 · Latest: 2026-02-09T18:58:50+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

The prevalent paradigm in robot learning attempts to generalize across environments, embodiments, and tasks with language prompts at runtime. A fundamental tension limits this approach: language is often too abstract to guide the concrete physical understanding required for robust manipulation. In this work, we introduce Contact-Anchored Policies (CAP), which replace language conditioning with points of physical contact in space. Simultaneously, we structure CAP as a library of modular utility models rather than a monolithic generalist policy. This factorization allows us to implement a real-to-sim iteration cycle: we build EgoGym, a lightweight simulation benchmark, to rapidly identify failure modes and refine our models and datasets prior to real-world deployment. We show that by conditioning on contact and iterating via simulation, CAP generalizes to novel environments and embodiments out of the box on three fundamental manipulation skills while using only 23 hours of demonstration data, and outperforms large, state-of-the-art VLAs in zero-shot evaluations by 56%. All model checkpoints, codebase, hardware, simulation, and datasets will be open-sourced. Project page: https://cap-policy.github.io/

Summary / 总结

This work addresses the challenge of guiding robots with abstract language by introducing Contact-Anchored Policies (CAP), which condition on physical contact points. The approach uses a modular utility model structure and a real-to-sim iteration cycle through EgoGym, a lightweight simulation benchmark. CAP demonstrates strong generalization to new environments and tasks, using only 23 hours of demonstration data and outperforming state-of-the-art visual language models by 56% in zero-shot evaluations.

该研究通过引入基于接触的策略（CAP），解决了用抽象语言指导机器人的问题，该方法基于物理接触点进行条件化。该方法采用模块化的效用模型结构，并通过轻量级的模拟基准EgoGym进行实到仿的迭代循环。CAP在仅使用23小时的演示数据的情况下，展示了对新环境和任务的强大泛化能力，并在零样本评估中比最先进的视觉语言模型性能高出56%。

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Authors: Zihan Yang, Shuyuan Tu, Licheng Zhang, Qi Dai, Yu-Gang Jiang, Zuxuan Wu

First: 2026-02-09T18:56:14+00:00 · Latest: 2026-02-09T18:56:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typically approximate the teacher trajectory by using linear shortcuts, which makes it difficult to match its constantly changing tangent directions as velocities evolve across timesteps, thereby leading to quality degradation. To address this limitation, we propose ArcFlow, a few-step distillation framework that explicitly employs non-linear flow trajectories to approximate pre-trained teacher trajectories. Concretely, ArcFlow parameterizes the velocity field underlying the inference trajectory as a mixture of continuous momentum processes. This enables ArcFlow to capture velocity evolution and extrapolate coherent velocities to form a continuous non-linear trajectory within each denoising step. Importantly, this parameterization admits an analytical integration of this non-linear trajectory, which circumvents numerical discretization errors and results in high-precision approximation of the teacher trajectory. To train this parameterization into a few-step generator, we implement ArcFlow via trajectory distillation on pre-trained teacher models using lightweight adapters. This strategy ensures fast, stable convergence while preserving generative diversity and quality. Built on large-scale models (Qwen-Image-20B and FLUX.1-dev), ArcFlow only fine-tunes on less than 5% of original parameters and achieves a 40x speedup with 2 NFEs over the original multi-step teachers without significant quality degradation. Experiments on benchmarks show the effectiveness of ArcFlow both qualitatively and quantitatively.

中文标题/摘要

标题：ArcFlow：通过高精度非线性流蒸馏实现两步文本到图像生成

扩散模型在生成质量方面取得了显著成就，但由于其依赖于多个顺序去噪步骤，导致了显著的推理成本，促使最近的研究努力将此推理过程简化为几步。然而，现有的蒸馏方法通常通过使用线性捷径来近似教师轨迹，这使得难以匹配其随时间步变化不断变化的切线方向，从而导致质量下降。为了解决这一局限性，我们提出了ArcFlow，这是一种显式采用非线性流轨迹来近似预训练教师轨迹的几步蒸馏框架。具体而言，ArcFlow 将推理轨迹下的速度场参数化为连续动量过程的混合。这使ArcFlow能够捕捉速度演变并外推一致的速度，以在每个去噪步骤内形成连续的非线性轨迹。重要的是，这种参数化允许对这种非线性轨迹进行解析积分，从而避免了数值离散化误差，并实现了对教师轨迹的高精度近似。为了将此参数化训练成几步生成器，我们通过预训练教师模型使用轻量级适配器实现ArcFlow的轨迹蒸馏。这种策略确保了快速、稳定的收敛，同时保持了生成多样性和质量。基于大规模模型（Qwen-Image-20B 和 FLUX.1-dev），ArcFlow 只微调了原始参数的不到5%，实现了比原始多步教师40倍的速度提升，且NFEs为2，而没有显著的质量下降。基准实验表明，ArcFlow 在定性和定量上都表现出有效性。

Summary / 总结

ArcFlow is a two-step text-to-image generation framework that uses non-linear flow trajectories to approximate the inference process of pre-trained diffusion models, addressing the quality degradation issue caused by linear approximations. It parameterizes the velocity field as a mixture of continuous momentum processes, enabling accurate velocity evolution and coherent trajectory formation. ArcFlow achieves a 40x speedup with only 2 noise-free evaluations (NFEs) and less than 5% of the original parameters, without significant quality loss, as demonstrated by benchmark experiments.

ArcFlow 是一种两步文本到图像生成框架，通过非线性流轨迹来蒸馏预训练扩散模型的推理过程，解决了由线性近似引起的质量下降问题。它将速度场参数化为连续动量过程的混合，能够捕捉速度演变并在每个去噪步骤中生成连贯的速度。ArcFlow 仅使用不到 5% 的原始参数实现了 40 倍的加速和 2 次非流评估 (NFE)，同时没有显著的质量损失，如基准实验所示。

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Authors: Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

First: 2025-11-07T18:38:54+00:00 · Latest: 2026-02-09T18:56:14+00:00

Comments: This paper is a revised version of a manuscript currently under revision at the Journal of Systems and Software

Abs · PDF · Code1 · Code2

Abstract

Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

中文标题/摘要

标题：代码语言模型的知识蒸馏从元测试视角：学生模型是否深刻模仿教师模型？

基于变换器的代码语言模型在软件分析任务中取得了最先进的性能，但由于计算成本高、推理速度慢和环境影响大，其实际部署仍然受到限制。为了解决这些挑战，最近的研究越来越多地探索知识蒸馏作为将大型代码语言模型（教师）压缩为较小模型（学生）的方法，同时保持性能。然而，学生模型是否深刻模仿教师模型的预测行为和内部表示仍然很少被探索，当前基于准确性的评估只能提供模型质量的表面视图，并且经常未能捕捉到教师和学生模型之间行为忠实度的更深层次差异。为了解决这一差距，我们实验证明学生模型往往未能深刻模仿教师模型，导致在对抗攻击下的性能下降高达285%，这在传统的基于准确性的评估中未被捕捉到。因此，我们提出了MetaCompress，这是一种元测试框架，通过在一组行为保持的元关系下比较教师和学生模型的输出来系统地评估行为忠实度。我们在两个广泛研究的任务上评估了MetaCompress，使用了通过三种不同的知识蒸馏技术（Compressor、AVATAR和MORPH）获得的流行代码语言模型的压缩版本。结果表明，MetaCompress在学生模型中识别出高达62%的行为差异，突显了在知识蒸馏管道中进行行为忠实度评估的必要性，并将MetaCompress确立为测试通过知识蒸馏获得的压缩代码语言模型的实际框架。

Summary / 总结

This study investigates the extent to which a smaller student model mimics the behavior and internal representations of a larger teacher model in the context of knowledge distillation for code language models. The research finds that the student model often fails to deeply mimic the teacher, leading to significant performance drops under adversarial attacks. To address this, the authors propose MetaCompress, a metamorphic testing framework that evaluates behavioral fidelity by comparing the outputs of teacher and student models under behavior-preserving metamorphic relations. The framework identifies up to 62% behavioral discrepancies in student models, highlighting the need for behavioral fidelity evaluation in knowledge distillation pipelines.

该论文使用元形测试研究压缩的学生模型在代码语言模型中是否能深刻模仿其较大教师模型的行为和内部表示。研究发现，学生模型往往未能深刻模仿教师模型，导致在对抗攻击下的性能显著下降。提出的MetaCompress框架通过在元形关系下比较输出来评估行为一致性，并在学生模型中识别高达62%的行为差异，强调了在知识蒸馏过程中需要进行行为一致性评估的必要性。

Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense

Authors: Jiacheng Liu, Yaxin Luo, Jiacheng Cui, Xinyi Shang, Xiaohan Zhao, Zhiqiang Shen

First: 2026-02-09T18:55:33+00:00 · Latest: 2026-02-09T18:55:33+00:00

Comments: Project page at https://greenoso.github.io/NextGen-CAPTCHAs_webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The rapid evolution of GUI-enabled agents has rendered traditional CAPTCHAs obsolete. While previous benchmarks like OpenCaptchaWorld established a baseline for evaluating multimodal agents, recent advancements in reasoning-heavy models, such as Gemini3-Pro-High and GPT-5.2-Xhigh have effectively collapsed this security barrier, achieving pass rates as high as 90% on complex logic puzzles like "Bingo". In response, we introduce Next-Gen CAPTCHAs, a scalable defense framework designed to secure the next-generation web against the advanced agents. Unlike static datasets, our benchmark is built upon a robust data generation pipeline, allowing for large-scale and easily scalable evaluations, notably, for backend-supported types, our system is capable of generating effectively unbounded CAPTCHA instances. We exploit the persistent human-agent "Cognitive Gap" in interactive perception, memory, decision-making, and action. By engineering dynamic tasks that require adaptive intuition rather than granular planning, we re-establish a robust distinction between biological users and artificial agents, offering a scalable and diverse defense mechanism for the agentic era.

中文标题/摘要

标题：下一代CAPTCHA：利用认知差距实现可扩展和多样的GUI-代理防御

GUI使能代理的快速发展使传统CAPTCHA过时。虽然像OpenCaptchaWorld这样的基准测试为评估多模态代理奠定了基础，但最近的推理型模型，如Gemini3-Pro-High和GPT-5.2-Xhigh已经有效地消除了这一安全障碍，在复杂的逻辑谜题如“宾果”上达到了高达90%的通过率。为应对这一挑战，我们提出了下一代CAPTCHA，这是一种可扩展的防御框架，旨在保护下一代网络免受高级代理的攻击。与静态数据集不同，我们的基准测试基于强大的数据生成管道，允许大规模和易于扩展的评估，特别是对于后端支持的类型，我们的系统能够生成几乎无限的CAPTCHA实例。我们利用持续的人机“认知差距”在交互感知、记忆、决策和行动中的差异。通过设计需要适应性直觉而非细粒度规划的动态任务，我们重新建立了生物用户和人工代理之间的坚实区别，为代理时代提供了可扩展和多样的防御机制。

Summary / 总结

The paper addresses the need for new CAPTCHA systems due to the advancement of GUI-enabled agents that can bypass traditional CAPTCHAs. It introduces Next-Gen CAPTCHAs, a scalable defense framework that leverages a 'Cognitive Gap' between humans and artificial agents. The system generates dynamic tasks requiring adaptive intuition, which cannot be easily replicated by reasoning-heavy models like Gemini3-Pro-High and GPT-5.2-Xhigh, thus re-establishing security for the next-generation web.

论文针对GUI启用代理的进步导致传统CAPTCHA失效的问题，提出了Next-Gen CAPTCHAs，一种可扩展的防御框架，利用人类与人工代理之间的‘认知差距’。该系统生成需要适应性直觉的动态任务，这些任务难以让推理型模型解决，从而重新建立下一代网络的安全性。

ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Authors: Yilang Zhang, Bingcong Li, Niao He, Georgios B. Giannakis

First: 2026-02-09T18:54:18+00:00 · Latest: 2026-02-09T18:54:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Scaling network depth has been a central driver behind the success of modern foundation models, yet recent investigations suggest that deep layers are often underutilized. This paper revisits the default mechanism for deepening neural networks, namely residual connections, from an optimization perspective. Rigorous analysis proves that the layout of residual connections can fundamentally shape convergence behavior, and even induces an exponential gap in convergence rates. Prompted by this insight, we introduce adaptive neural connection reassignment (ANCRe), a principled and lightweight framework that parameterizes and learns residual connectivities from the data. ANCRe adaptively reassigns residual connections with negligible computational and memory overhead ($<1\%$), while enabling more effective utilization of network depth. Extensive numerical tests across pre-training of large language models, diffusion models, and deep ResNets demonstrate consistently accelerated convergence, boosted performance, and enhanced depth efficiency over conventional residual connections.

中文标题/摘要

标题：ANCRe: 自适应神经连接重分配以实现高效的深度缩放

网络深度的扩展一直是现代基础模型取得成功的核心驱动力，然而最近的研究表明，深层网络往往被严重低估。本文从优化的角度重新审视了加深神经网络的默认机制，即残差连接。严格的分析证明，残差连接的布局可以从根本上影响收敛行为，并且甚至会导致收敛速率的指数级差距。基于这一洞察，我们提出了自适应神经连接重分配（ANCRe），这是一种原理上合理且轻量级的框架，能够从数据中参数化和学习残差连接性。ANCRe 通过几乎不增加计算和内存开销（<1%）的方式自适应地重新分配残差连接，从而更有效地利用网络深度。广泛的数值测试表明，ANCRe 在大规模语言模型的预训练、扩散模型和深度 ResNets 中能够实现一致的加速收敛、提升性能和增强深度效率，优于传统的残差连接。

Summary / 总结

This paper addresses the underutilization of deep layers in neural networks by revisiting residual connections from an optimization perspective. It introduces ANCRe, a framework that adaptively reassigns residual connections to improve network efficiency. Experiments show that ANCRe accelerates convergence, enhances performance, and utilizes network depth more effectively compared to traditional residual connections in various models including large language models, diffusion models, and deep ResNets.

本文从优化角度重新审视了残差连接，旨在解决深层网络中深层层利用率低的问题。提出了ANCRe框架，该框架能够适应性地重新分配残差连接以提高网络效率。实验结果表明，ANCRe在大型语言模型、扩散模型和ResNet等模型中能够加速收敛、提升性能并提高深度利用率，同时具有极小的计算和内存开销。

Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey

Authors: Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen Wang, Xiongxiao Xu, Baixiang Huang, Juntao Tan, Shelby Heinecke, Huan Wang, Caiming Xiong, Ahmed A. Metwally, Jun Yan, Chen-Yu Lee, Hanqing Zeng, Yinglong Xia, Xiaokai Wei, Ali Payani, Yu Wang, Haitong Ma, Wenya Wang, Chengguang Wang, Yu Zhang, Xin Wang, Yongfeng Zhang, Jiaxuan You, Hanghang Tong, Xiao Luo, Xue Liu, Yizhou Sun, Wei Wang, Julian McAuley, James Zou, Jiawei Han, Philip S. Yu, Kai Shu

First: 2026-01-14T07:38:38+00:00 · Latest: 2026-02-09T18:53:33+00:00

Abs · PDF · Code1 · Code2

Abstract

The research of artificial intelligence is undergoing a paradigm shift from prioritizing model innovations over benchmark scores towards emphasizing problem definition and rigorous real-world evaluation. As the field enters the "second half," the central challenge becomes real utility in long-horizon, dynamic, and user-dependent environments, where agents face context explosion and must continuously accumulate, manage, and selectively reuse large volumes of information across extended interactions. Memory, with hundreds of papers released this year, therefore emerges as the critical solution to fill the utility gap. In this survey, we provide a unified view of foundation agent memory along three dimensions: memory substrate (internal and external), cognitive mechanism (episodic, semantic, sensory, working, and procedural), and memory subject (agent- and user-centric). We then analyze how memory is instantiated and operated under different agent topologies and highlight learning policies over memory operations. Finally, we review evaluation benchmarks and metrics for assessing memory utility, and outline various open challenges and future directions.

中文标题/摘要

标题：重新思考基础代理在第二阶段的记忆机制：一项综述

人工智能的研究正在经历从优先考虑模型创新而非基准得分向强调问题定义和严格的现实世界评估的范式转变。随着领域进入“第二阶段”，中心挑战在于在长期、动态和用户依赖的环境中实现实际效用，其中代理面临上下文爆炸，必须在长时间交互中不断积累、管理和选择性重用大量信息。记忆，今年有数百篇论文发布，因此成为填补效用缺口的关键解决方案。在这项综述中，我们从三个维度提供了一致的基础代理记忆视图：记忆载体（内部和外部）、认知机制（事件、语义、感觉、工作和程序），以及记忆主体（代理中心和用户中心）。然后我们分析了在不同代理拓扑结构下记忆是如何实现和操作的，并强调了对记忆操作的学习策略。最后，我们回顾了评估记忆效用的基准和指标，并概述了各种开放挑战和未来方向。

Summary / 总结

This survey addresses the shift in AI research towards focusing on problem definition and real-world evaluation, highlighting the importance of memory mechanisms for foundation agents in long-horizon, dynamic environments. The study examines memory from three dimensions: substrate, cognitive mechanism, and subject, and analyzes how memory is instantiated under different agent topologies. Key findings include the need for selective reuse of information and the evaluation of memory utility using specific benchmarks.

本文探讨了AI研究向注重问题定义和实际评估的转变，强调了基础代理在长期、动态环境中的记忆机制的重要性。研究从存储介质、认知机制和主体三个维度分析了记忆，并分析了在不同代理拓扑结构下记忆的实现方式。关键发现包括信息的选择性重用以及使用特定基准评估记忆效用的必要性。

ARO: A New Lens On Matrix Optimization For Large Models

Authors: Wenbo Gong, Javier Zazo, Qijun Luo, Puqian Wang, James Hensman, Chao Ma

First: 2026-02-09T18:51:22+00:00 · Latest: 2026-02-09T18:51:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Matrix-based optimizers have attracted growing interest for improving LLM training efficiency, with significant progress centered on orthogonalization/whitening based methods. While yielding substantial performance gains, a fundamental question arises: can we develop new paradigms beyond orthogonalization, pushing the efficiency frontier further? We present \textbf{Adaptively Rotated Optimization (ARO}, a new matrix optimization framework that treats gradient rotation as a first class design principle. ARO accelerates LLM training by performing normed steepest descent in a rotated coordinate system, where the rotation is determined by a novel norm-informed policy. This perspective yields update rules that go beyond existing orthogonalization and whitening optimizers, improving sample efficiency in practice. To make comparisons reliable, we propose a rigorously controlled benchmarking protocol that reduces confounding and bias. Under this protocol, ARO consistently outperforms AdamW (by 1.3 $\sim$1.35$\times$) and orthogonalization methods (by 1.1$\sim$1.15$\times$) in LLM pretraining at up to 8B activated parameters, and up to $8\times$ overtrain budget, without evidence of diminishing returns. Finally, we discuss how ARO can be reformulated as a symmetry-aware optimizer grounded in rotational symmetries of residual streams, motivating advanced designs that enable computationally efficient exploitation of cross-layer/cross module couplings.

中文标题/摘要

标题：ARO：大型模型训练的一种新视角下的矩阵优化

基于矩阵的优化器因其提高大规模语言模型（LLM）训练效率而引起了广泛关注，特别是在正交化/白化方法方面取得了显著进展。尽管这些方法带来了显著的性能提升，但一个基本问题出现了：我们能否开发出超越正交化的新范式，进一步推动效率边界？我们提出了**自适应旋转优化（ARO）**，这是一种新的矩阵优化框架，将梯度旋转视为一种首要设计原则。ARO 通过在旋转坐标系中执行归一化最速下降来加速 LLM 训练，其中旋转由一种新颖的归一化导向策略确定。这种视角产生的更新规则超越了现有的正交化和白化优化器，提高了样本效率。为了使比较可靠，我们提出了一种严格控制的基准测试协议，减少了混淆和偏差。在该协议下，ARO 在多达 80 亿激活参数的 LLM 预训练中始终优于 AdamW（1.3 至 1.35 倍）和正交化方法（1.1 至 1.15 倍），且在多达 8 倍的过拟合预算下表现更优，没有证据显示边际效益递减。最后，我们讨论了如何将 ARO 重新表述为一种基于残差流旋转对称性的对称感知优化器，这激发了高级设计，使跨层/跨模块耦合的计算高效利用成为可能。

Summary / 总结

The paper introduces Adaptively Rotated Optimization (ARO), a new matrix optimization framework that enhances LLM training efficiency by treating gradient rotation as a first-class principle. ARO uses a norm-informed policy to perform normed steepest descent in a rotated coordinate system, improving sample efficiency. Experiments show that ARO outperforms AdamW and orthogonalization methods by 1.3 to 1.35 times and 1.1 to 1.15 times, respectively, in LLM pretraining with up to 8 billion activated parameters and up to 8 times overtrain budget, without diminishing returns. The authors also propose a benchmarking protocol to ensure reliable comparisons and discuss ARO's reformulation as a symmetry-aware optimizer.

论文提出了自适应旋转优化（ARO），这是一种新的矩阵优化框架，通过将梯度旋转视为核心设计原则来提升LLM训练效率。ARO在旋转坐标系中执行归一化最速下降，旋转由一种新颖的范数导向策略确定。实验结果显示，ARO在最多80亿激活参数的LLM预训练中分别比AdamW和正交化方法快1.3到1.35倍和1.1到1.15倍，并且在8倍过训练预算下没有表现出边际效益递减的现象。

Data Science and Technology Towards AGI Part I: Tiered Data Management

Authors: Yudong Wang, Zixuan Fu, Hengyu Zhao, Chen Zhao, Chuyue Zhou, Xinle Lin, Hongya Lyu, Shuaikang Xue, Yi Yi, Yingjiao Wang, Zhi Zheng, Yuzhou Zhang, Jie Zhou, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun

First: 2026-02-09T18:47:51+00:00 · Latest: 2026-02-09T18:47:51+00:00

Comments: 16 pages, 3 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

The development of artificial intelligence can be viewed as an evolution of data-driven learning paradigms, with successive shifts in data organization and utilization continuously driving advances in model capability. Current LLM research is dominated by a paradigm that relies heavily on unidirectional scaling of data size, increasingly encountering bottlenecks in data availability, acquisition cost, and training efficiency. In this work, we argue that the development of AGI is entering a new phase of data-model co-evolution, in which models actively guide data management while high-quality data, in turn, amplifies model capabilities. To implement this vision, we propose a tiered data management framework, designed to support the full LLM training lifecycle across heterogeneous learning objectives and cost constraints. Specifically, we introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge. Importantly, LLMs are fully used in data management processes, such as quality scoring and content editing, to refine data across tiers. Each tier is characterized by distinct data properties, management strategies, and training roles, enabling data to be strategically allocated across LLM training stages, including pre-training, mid-training, and alignment. The framework balances data quality, acquisition cost, and marginal training benefit, providing a systematic approach to scalable and sustainable data management. We validate the effectiveness of the proposed framework through empirical studies, in which tiered datasets are constructed from raw corpora and used across multiple training phases. Experimental results demonstrate that tier-aware data utilization significantly improves training efficiency and model performance. To facilitate further research, we release our tiered datasets and processing tools to the community.

中文标题/摘要

标题：数据科学与技术向AGI迈进第一部分：分层数据管理

人工智能的发展可以视为数据驱动学习范式的进化，数据组织和利用方式的连续转变不断推动着模型能力的进步。当前的LLM研究主要依赖于数据规模的单向扩展，越来越遇到数据可用性、获取成本和训练效率的瓶颈。本文认为，AGI的发展正进入数据-模型协同进化的新阶段，在这个阶段中，模型主动引导数据管理，高质量数据反过来增强模型能力。为了实现这一愿景，我们提出了一种分层数据管理框架，旨在支持跨异构学习目标和成本约束的整个LLM训练生命周期。具体而言，我们引入了一个从原始未整理资源到组织和可验证知识的L0-L4分层数据管理框架。重要的是，在数据管理过程中，LLM完全用于质量评分和内容编辑，以在各层中精炼数据。每一层都具有独特的数据属性、管理策略和训练角色，使数据能够战略性地分配到LLM训练阶段，包括预训练、中期训练和对齐。该框架平衡了数据质量、获取成本和边际训练收益，提供了一种系统的方法来实现可扩展和可持续的数据管理。我们通过实证研究验证了所提框架的有效性，在这些研究中，从原始语料库构建分层数据集，并在多个训练阶段中使用。实验结果表明，分层数据利用显著提高了训练效率和模型性能。为了促进进一步研究，我们向社区发布了分层数据集和处理工具。

Summary / 总结

This paper addresses the limitations of current large language model (LLM) research, which primarily focuses on unidirectional data scaling. It proposes a tiered data management framework (L0-L4) to support the full LLM training lifecycle, where LLMs actively guide data management and refine data quality. Experimental results show that tier-aware data utilization enhances training efficiency and model performance.

本文认为AGI的发展正进入数据-模型协同进化阶段，模型指导数据管理。提出了一种分层数据管理框架（L0-L4），支持LLM训练生命周期，LLM参与数据质量评分和内容编辑。该框架平衡了数据质量、获取成本和训练效益。实验证明，分层数据利用显著提高了训练效率和模型性能。

From Obstacles to Etiquette: Robot Social Navigation with VLM-Informed Path Selection

Authors: Zilin Fang, Anxing Xiao, David Hsu, Gim Hee Lee

First: 2026-02-09T18:46:12+00:00 · Latest: 2026-02-09T18:46:12+00:00

Comments: Accepted to IEEE Robotics and Automation Letters (RA-L)

Abs · PDF · Code1 · Code2 · Project1

Abstract

Navigating socially in human environments requires more than satisfying geometric constraints, as collision-free paths may still interfere with ongoing activities or conflict with social norms. Addressing this challenge calls for analyzing interactions between agents and incorporating common-sense reasoning into planning. This paper presents a social robot navigation framework that integrates geometric planning with contextual social reasoning. The system first extracts obstacles and human dynamics to generate geometrically feasible candidate paths, then leverages a fine-tuned vision-language model (VLM) to evaluate these paths, informed by contextually grounded social expectations, selecting a socially optimized path for the controller. This task-specific VLM distills social reasoning from large foundation models into a smaller and efficient model, allowing the framework to perform real-time adaptation in diverse human-robot interaction contexts. Experiments in four social navigation contexts demonstrate that our method achieves the best overall performance with the lowest personal space violation duration, the minimal pedestrian-facing time, and no social zone intrusions. Project page: https://path-etiquette.github.io

中文标题/摘要

标题：从障碍到礼仪：基于VLM的路径选择社会导航

在人类环境中进行社会导航不仅需要满足几何约束，碰撞自由路径仍可能干扰正在进行的活动或违反社会规范。解决这一挑战需要分析代理之间的交互，并将常识推理纳入规划中。本文提出了一种结合几何规划与情境社会推理的社会机器人导航框架。系统首先提取障碍和人类动态以生成几何上可行的候选路径，然后利用微调后的视觉语言模型（VLM）根据情境化的社会期望评估这些路径，选择一个社会优化路径供控制器使用。这种任务特定的VLM将大型基础模型中的社会推理提炼到一个更小且高效的模型中，使框架能够在多种人机交互场景中进行实时适应。在四个社会导航场景中的实验表明，我们的方法在个人空间侵犯时间最短、行人面对时间最少且无社会区域侵入方面表现最佳。项目页面：https://path-etiquette.github.io

Summary / 总结

This paper addresses the challenge of social navigation for robots by integrating geometric planning with social reasoning. It proposes a framework that first generates geometrically feasible paths and then uses a fine-tuned vision-language model to evaluate these paths based on social expectations, selecting the most socially optimized path. Experiments show that the method performs best in four social navigation contexts, with minimal personal space violations and no social zone intrusions.

本文通过将几何规划与社会推理相结合，解决了机器人的社会导航问题。系统首先生成几何上可行的路径，然后使用微调后的视觉语言模型根据社会期望评估这些路径，选择最优化的社会路径。实验在四个社会导航场景中表明，所提出的方法在最小个人空间侵犯和无社会区域侵入方面优于其他方法。

Semantics-Aware Generative Latent Data Augmentation for Learning in Low-Resource Domains

Authors: Jaesung Bae, Minje Kim

First: 2026-02-02T21:43:54+00:00 · Latest: 2026-02-09T18:46:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline's unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.

中文标题/摘要

标题：面向低资源领域的语义感知生成潜在数据增强以学习

尽管在数据丰富的环境中表现出色，深度学习在实践中常见的数据稀缺设置中往往表现不佳。虽然在大规模数据集上训练的基础模型（FMs）能够通过提取通用特征来表现出强大的泛化能力，但在下游微调过程中仍可能受到缺乏标注数据的影响。为了解决这一问题，我们提出了一种语义感知生成潜在数据增强框架GeLDA，该框架利用条件扩散模型在基础模型诱导的潜在空间中合成样本。由于该空间是低维度的，并且与输入空间相比集中了与任务相关的信息，GeLDA 能够实现高效、高质量的数据生成。GeLDA 通过条件生成辅助特征向量来捕捉类别或子域之间的语义关系，从而在低资源领域促进数据增强。我们在两个大规模识别任务中验证了GeLDA：(a) 在零样本语言特定语音情感识别中，GeLDA 将Whisper-large基线的加权平均召回率提高了6.13%；(b) 在长尾图像分类中，它在ImageNet-LT上实现了74.7%的尾部类准确率，创下了新的最佳结果。

Summary / 总结

The paper addresses the challenge of deep learning underperforming in low-resource settings by proposing GeLDA, a semantics-aware generative latent data augmentation framework. GeLDA uses conditional diffusion models to synthesize samples in a low-dimensional latent space induced by foundation models, which helps in generating efficient and high-quality data. The method improves the Whisper-large baseline's unweighted average recall by 6.13% in zero-shot language-specific speech emotion recognition and achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result in long-tailed image classification.

论文提出了一种名为GeLDA的语义感知生成潜在数据增强框架，利用条件扩散模型在基础模型诱导的低维潜在空间中生成样本。该方法在资源稀缺环境下提高了性能，例如在零样本特定语言语音情感识别任务中，GeLDA使Whisper-large基线的未加权平均召回率提高了6.13%，并在ImageNet-LT的长尾图像分类任务中达到了74.7%的尾部类准确率，创下了新的最佳结果。

iGRPO: Self-Feedback-Driven LLM Reasoning

Authors: Ali Hatamizadeh, Shrimai Prabhumoye, Igor Gitman, Ximing Lu, Seungju Han, Wei Ping, Yejin Choi, Jan Kautz

First: 2026-02-09T18:45:11+00:00 · Latest: 2026-02-09T18:45:11+00:00

Comments: Tech report

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

中文标题/摘要

标题：iGRPO：自我反馈驱动的LLM推理

大型语言模型（LLMs）在解决复杂数学问题方面显示出潜力，但仍难以提供准确且一致的解决方案。强化学习（RL）是一种使这些模型与特定任务奖励相一致的框架，从而提高整体质量和可靠性。组相对策略优化（GRPO）是Proximal Policy Optimization（PPO）的一种高效、无价值函数的替代方案，利用组相对奖励归一化。我们引入了迭代组相对策略优化（iGRPO），这是一种GRPO的两阶段扩展，通过模型生成的草稿添加动态自我调节。在第一阶段，iGRPO采样多个探索性草稿，并使用相同的标量奖励信号选择最高奖励的草稿进行优化。在第二阶段，它将此最佳草稿附加到原始提示，并对条件改进应用GRPO风格的更新，训练策略超越其最强的先前尝试。在匹配的展开预算下，iGRPO在不同基础模型（例如Nemotron-H-8B-Base-8K和DeepSeek-R1 Distilled）上始终优于GRPO，验证了其在各种推理基准上的有效性。此外，将iGRPO应用于在AceReason-Math上训练的OpenReasoning-Nemotron-7B，分别在AIME24和AIME25上取得了新的最佳结果85.62%和79.64%。消融实验进一步表明，改进包装器超越了GRPO变体，受益于生成式评判，并通过延迟熵崩溃改变了学习动态。这些结果强调了迭代、基于自我反馈的RL在推进可验证数学推理方面的潜力。

Summary / 总结

iGRPO is a two-stage reinforcement learning method that enhances the reasoning capabilities of Large Language Models (LLMs) by incorporating self-conditioning through model-generated drafts. In Stage 1, iGRPO selects the highest-reward draft from multiple exploratory drafts, and in Stage 2, it refines the prompt with this draft to improve the model's performance. Experiments show that iGRPO outperforms GRPO across various base models and achieves new state-of-the-art results on AIME24 and AIME25 benchmarks with OpenReasoning-Nemotron-7B, validating its effectiveness in mathematical reasoning tasks.

iGRPO 是一种两阶段强化学习方法，通过模型生成的草稿进行自我条件化来增强大型语言模型（LLM）的推理能力。在第一阶段，iGRPO 从多个探索性草稿中选择最高奖励的草稿；在第二阶段，它将此草稿附加到原始提示中进行细化，以提高模型性能。实验表明，iGRPO 在各种基础模型上优于 GRPO，并且在使用 OpenReasoning-Nemotron-7B 训练的 AIME24 和 AIME25 基准测试中取得了新的最佳结果，验证了其在数学推理任务中的有效性。

Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

Authors: Lavender Y. Jiang, Xujin Chris Liu, Kyunghyun Cho, Eric K. Oermann

First: 2026-02-09T18:43:19+00:00 · Latest: 2026-02-09T18:43:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.

中文标题/摘要

标题：去标识化悖论：在大语言模型时代对HIPAA安全港的批判

隐私是一项人权，维持着患者与提供者之间的信任。临床记录捕捉到患者的私人脆弱性和个体性，用于护理协调和研究。在HIPAA安全港下，这些记录被去标识化以保护患者隐私。然而，安全港是为分类表格数据时代设计的，侧重于删除显式标识符，而忽略了身份与准标识符之间相关性的潜在信息，这些信息可以被现代大语言模型捕获。我们首先使用因果图形式化这些相关性，然后通过从清洗后的记录中重新识别患者来实证验证。去标识化的悖论进一步通过诊断消除实验得到展示：即使移除了所有其他信息，模型仅凭诊断就能预测患者的居住地。本文立场文件提出了一个问题：在去标识化本质上不完美的情况下，我们作为社区如何行动以维持患者与提供者之间的信任。我们旨在提高意识并讨论可操作的建议。

Summary / 总结

The paper critiques the HIPAA Safe Harbor de-identification standard in the context of modern language models (LLMs), which can capture latent information from clinical notes. It formalizes correlations using a causal graph and validates them through patient re-identification from scrubbed notes. The study demonstrates that even after removing explicit identifiers, diagnoses can still reveal sensitive information like a patient's neighborhood, highlighting the paradox of de-identification. The authors call for community action to address this issue and maintain patient-provider trust.

论文批评了现代语言模型（LLMs）在HIPAA安全港去标识化标准下的作用，这些模型可以从临床笔记中捕获隐含信息。研究使用因果图形式化这些关联，并通过清洗后的笔记重新识别患者来验证它们。研究显示，即使移除了显式标识符，诊断信息仍能揭示敏感信息，如患者的居住地，揭示了去标识化的悖论。作者呼吁社区采取行动，以维护患者-提供者信任。

Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

Authors: Arushi Rai, Adriana Kovashka

Venue: WACV 2026

First: 2026-02-09T18:41:43+00:00 · Latest: 2026-02-09T18:41:43+00:00

Comments: to appear WACV 2026

Abs · PDF · Code1 · Code2

Abstract

While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.

中文标题/摘要

标题：通过观看比赛和阅读书籍泛化体育反馈生成：以攀岩为例的研究

尽管视频-LLM在高级推理能力方面取得了快速进展，但现有研究表明，这些模型在体育反馈生成这一具有挑战性的任务上表现不佳，需要为每项运动收集昂贵且难以获取的微调反馈数据。这一限制体现在模型在未见过的体育项目上的泛化能力较差。此外，传统的文本生成评估指标（如BLEU-4、METEOR、ROUGE-L、BERTScore），最初是为机器翻译和摘要开发的，无法捕捉体育反馈质量的独特方面。为了解决第一个问题，以攀岩为例，我们提出使用目标领域中的辅助免费网络数据，如比赛视频和教练手册，以及来自不同源领域的现有体育反馈，以提高目标领域体育反馈生成性能。为了改进评估，我们提出了两个评估指标：（1）具体性；（2）可操作性。结合我们的方法，即使在有限注释的情况下，也能实现更有意义和实用的体育反馈生成。

Summary / 总结

The research aims to enhance sports feedback generation for unseen sports by leveraging auxiliary data from the target domain, such as competition videos and coaching manuals, alongside existing sports feedback from a different domain. The proposed method introduces two new evaluation metrics: specificity and actionability, to better assess the quality of sports feedback. Key findings show improved generalization and practicality of generated feedback with limited annotations.

研究旨在通过利用目标领域的辅助数据，如比赛视频和教练手册，以及不同领域的现有反馈来提高运动反馈生成。方法引入了两个新的评估指标：具体性和可操作性，以更好地评估运动反馈的质量。关键发现表明，这种方法可以增强运动反馈生成的泛化能力，并在有限标注的情况下提供更具意义和实用性的反馈。

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Authors: Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun

First: 2026-02-09T18:41:15+00:00 · Latest: 2026-02-09T18:41:15+00:00

Comments: Project Homepage: https://osu-nlp-group.github.io/Misaligned-Action-Detection/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 15% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 90% under adversarial settings while preserving or even improving task success rate in benign environments.

中文标题/摘要

标题：当行为偏离任务：检测和纠正计算机使用代理中的对齐偏差

计算机使用代理（CUAs）在过去一年中取得了巨大进展，但仍经常产生与用户原始意图不符的行为。这些对齐偏差可能源自外部攻击（例如，间接提示注入）或内部限制（例如，错误的推理）。它们不仅使CUAs面临安全风险，还降低了任务效率和可靠性。本研究首次尝试定义和研究CUAs中的对齐偏差检测，涵盖了外部诱导和内部产生的对齐偏差的全面覆盖。我们进一步识别了现实世界CUA部署中的三种常见类别，并构建了MisActBench，这是一个包含人类标注的行为级对齐标签的基准数据集。此外，我们提出了DeAction，这是一种实用且通用的护栏，可以在执行前检测对齐偏差，并通过结构化反馈迭代纠正它们。DeAction在离线和在线评估中均优于所有现有基线，具有适度的延迟开销：（1）在MisActBench上，其F1分数绝对值比基线高出15%以上；（2）在在线评估中，在对抗性设置下将攻击成功率降低超过90%，同时在良性环境中保持或甚至提高了任务成功率。

Summary / 总结

This work addresses the issue of misaligned actions in computer-use agents (CUAs), which can arise from external attacks or internal limitations. It introduces MisActBench, a benchmark for detecting these misaligned actions, and proposes DeAction, a guardrail that detects and corrects misaligned actions. DeAction outperforms existing methods by over 15% in F1 score on MisActBench and significantly reduces attack success rates in adversarial settings while maintaining task success in benign environments.

该研究针对计算机使用代理（CUA）中的错配动作问题，这些问题可能源自外部攻击或内部限制。为此，作者提出了MisActBench，一个包含现实轨迹和人工标注动作对齐标签的基准，以及DeAction，一种在执行前检测并纠正错配动作的护栏。DeAction在MisActBench上的F1分数提高了15%，并在对抗性环境中将攻击成功率降低了90%以上，同时在良性环境中保持或提高了任务成功率。

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

Authors: Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lilong Wang, Zifu Wang, Jiong Wang, Wanghan Xu, Yue Deng, Dongrui Liu, Yiheng Wang, Wenlong Zhang, Fenghua Ling, Shufei Zhang, Xiaosong Wang, Shuangjia Zheng, Xun Huang, Siqi Sun, Shuyue Hu, Peng Ye, Chunfeng Song, Bin Wang, Conghui He, Yihao Liu, Xin Li, Qibin Hou, Tao Chen, Xiangyu Yue, Bin Wang, Liang He, Dahua Lin, Bowen Zhou, Bo Zhang, Lei Bai

First: 2026-02-09T18:36:06+00:00 · Latest: 2026-02-09T18:36:06+00:00

Comments: Code and project page: https://github.com/InternScience/InternAgent

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce InternAgent-1.5, a unified system designed for end-to-end scientific discovery across computational and empirical domains. The system is built on a structured architecture composed of three coordinated subsystems for generation, verification, and evolution. These subsystems are supported by foundational capabilities for deep research, solution optimization, and long horizon memory. The architecture allows InternAgent-1.5 to operate continuously across extended discovery cycles while maintaining coherent and improving behavior. It also enables the system to coordinate computational modeling and laboratory experimentation within a single unified system. We evaluate InternAgent-1.5 on scientific reasoning benchmarks such as GAIA, HLE, GPQA, and FrontierScience, and the system achieves leading performance that demonstrates strong foundational capabilities. Beyond these benchmarks, we further assess two categories of discovery tasks. In algorithm discovery tasks, InternAgent-1.5 autonomously designs competitive methods for core machine learning problems. In empirical discovery tasks, it executes complete computational or wet lab experiments and produces scientific findings in earth, life, biological, and physical domains. Overall, these results show that InternAgent-1.5 provides a general and scalable framework for autonomous scientific discovery.

中文标题/摘要

标题：InternAgent-1.5：统一代理框架，用于长期自主科学发现

我们介绍了InternAgent-1.5，这是一个用于跨计算和经验领域端到端科学发现的统一系统。该系统基于一个结构化的架构，由三个协调的子系统组成，用于生成、验证和进化。这些子系统由深度研究的基础能力、解决方案优化和长期记忆支持。该架构使InternAgent-1.5能够在扩展的发现周期中连续运行，同时保持一致并改进行为。它还使系统能够在单一统一系统中协调计算建模和实验室实验。我们在GAIA、HLE、GPQA和FrontierScience等科学推理基准上评估了InternAgent-1.5，该系统展示了强大的基础能力，取得了领先性能。除了这些基准，我们还进一步评估了两类发现任务。在算法发现任务中，InternAgent-1.5自主设计了针对核心机器学习问题的竞争性方法。在经验发现任务中，它执行完整的计算或湿实验，并在地球、生命、生物和物理领域产生科学发现。总体而言，这些结果表明，InternAgent-1.5提供了一个通用且可扩展的自主科学发现框架。

Summary / 总结

InternAgent-1.5 is a unified system for long-horizon scientific discovery, featuring a structured architecture with generation, verification, and evolution subsystems. It demonstrates strong foundational capabilities and achieves leading performance on scientific reasoning benchmarks. The system autonomously designs competitive methods for machine learning and conducts experiments in various scientific domains, showcasing its general and scalable framework for autonomous discovery.

InternAgent-1.5 是一个用于长期自主科学研究的统一系统，其结构化架构包括生成、验证和进化子系统。该系统在科学推理基准测试中表现出色，并自主设计了适用于机器学习问题的竞方法，并执行了涵盖地球、生命、生物和物理领域的实验，展示了强大的基础能力和可扩展性。

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Authors: Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song

First: 2026-02-05T18:01:52+00:00 · Latest: 2026-02-09T18:34:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f-Group Relative Policy Optimization (f-GRPO), a class of on-policy reinforcement learning, and f-Hybrid Alignment Loss (f-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.

中文标题/摘要

标题：f-GRPO及其扩展：基于偏差的强化学习算法在通用LLM对齐中的应用

近期研究表明，偏好对齐(PA)目标可以作为对齐(选择)和未对齐(拒绝)响应分布之间偏差的估计器。在此项工作中，我们将这种基于偏差的观点扩展到一般的对齐设置中，例如仅具有环境奖励的可验证奖励强化学习(RLVR)。在这一统一框架中，我们提出了f-组相对策略优化(f-GRPO)，这是一种在线策略强化学习方法，以及f-混合对齐损失(f-HAL)，这是一种混合在线/离线策略目标，基于f-偏差的变分表示，用于通用LLM对齐。我们提供了这些类目标在对齐后提高平均奖励的理论保证。实验上，我们在RLVR(数学推理)和PA任务(安全对齐)上验证了我们的框架，展示了与当前方法相比的优越性能和灵活性。

Summary / 总结

This research aims to enhance the alignment of large language models (LLMs) using divergence-based reinforcement learning (RL) methods. The study proposes f-GRPO and f-HAL, which extend the divergence-based perspective to general alignment settings, including reinforcement learning with verifiable rewards (RLVR). The key findings show that these methods improve the average reward after alignment and outperform existing methods in both RLVR and safety alignment tasks.

研究旨在通过基于发散的强化学习方法提升大型语言模型（LLM）的对齐。研究提出了f-GRPO和f-HAL，将发散视角扩展到包括可验证奖励的强化学习（RLVR）等一般对齐设置中。实验结果表明，这些方法在平均奖励提升方面优于现有方法，并在RLVR和安全对齐任务中表现出更优的性能和灵活性。

Safety Subspaces are Not Linearly Distinct: A Fine-Tuning Case Study

Authors: Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma

Venue: ICLR 2026

First: 2025-05-20T10:41:49+00:00 · Latest: 2026-02-09T18:32:52+00:00

Comments: ICLR 2026. Kaustubh Ponkshe, Shaan Shah, and Raghav Singhal contributed equally to this work

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

中文标题/摘要

标题：安全子空间并非线性独立：一项微调案例研究

大型语言模型（LLMs）依赖于安全对齐以生成社会上可接受的响应。然而，这种行为已知是脆弱的：即使在良性或轻微污染的数据上进行进一步微调，也可能损害安全并重新引入有害行为。越来越多的研究表明，对齐可能对应于权重空间中的可识别方向，形成可以隔离或保留以防御对齐失效的子空间。在本研究中，我们进行了全面的经验研究，探讨安全相关行为是否集中在特定的线性子空间中，是否可以与通用学习分离，以及有害性是否源自激活中的可区分模式。在权重和激活空间中，我们的发现是一致的：放大安全行为的子空间也放大了有用的行为，具有不同安全含义的提示激活了重叠的表示。我们证明，安全性与模型的一般学习组件高度交织，而不是存在于不同的方向中。这表明基于子空间的防御面临根本局限，强调了在持续训练下保护安全的替代策略的必要性。我们通过Llama和Qwen家族的五个开源LLM进行了多项实验来验证这些发现。我们的代码可在以下网址获取：https://github.com/CERT-Lab/safety-subspaces。

Summary / 总结

This study investigates the effectiveness of safety subspaces in large language models (LLMs) for maintaining safety under fine-tuning. The research finds that safety-relevant behaviors are not confined to distinct linear subspaces but are instead intertwined with general learning components. This suggests that subspace-based safety defenses may be limited and that alternative strategies are needed to preserve safety during continued training. Experiments on five open-source LLMs from the Llama and Qwen families support these findings.

研究探讨了安全性子空间在大型语言模型（LLMs）中维持安全行为的有效性。研究发现，与安全相关的行为并不是集中在特定的线性子空间中，而是与通用学习组件高度交织。进一步微调会降低安全性，不同安全含义的提示会激活重叠的表示。这些发现表明，基于子空间的防御可能在持续训练中难以根本性地保持安全性，需要寻找替代策略。

The Refutability Gap: Challenges in Validating Reasoning by Large Language Models

Authors: Elchanan Mossel

First: 2025-12-18T14:42:03+00:00 · Latest: 2026-02-09T18:32:44+00:00

Comments: he authors explicitly reserve all rights in this work. No permission is granted for the reproduction, storage, or use of this document for the purpose of training artificial intelligence systems or for text and data mining (TDM), including but not limited to the generation of embeddings, summaries, or synthetic derivatives

Abs · PDF · Code1 · Code2

Abstract

Recent reports claim that Large Language Models (LLMs) have achieved the ability to derive new science and exhibit human-level general intelligence. We argue that such claims are not rigorous scientific claims, as they do not satisfy Popper's refutability principle (often termed falsifiability), which requires that scientific statements be capable of being disproven. We identify several methodological pitfalls in current AI research on reasoning, including the inability to verify the novelty of findings due to opaque and non-searchable training data, the lack of reproducibility caused by continuous model updates, and the omission of human-interaction transcripts, which obscures the true source of scientific discovery. Additionally, the absence of counterfactuals and data on failed attempts creates a selection bias that may exaggerate LLM capabilities. To address these challenges, we propose guidelines for scientific transparency and reproducibility for research on reasoning by LLMs. Establishing such guidelines is crucial for both scientific integrity and the ongoing societal debates regarding fair data usage.

中文标题/摘要

标题：反驳缺口：大型语言模型验证推理能力的挑战

近期报告称，大型语言模型（LLMs）已具备推导新科学知识和展现人类级通用智能的能力。我们认为，此类声明并非严谨的科学声明，因为它们未能满足波普尔的可反驳性原则（通常称为可证伪性），该原则要求科学陈述能够被证伪。我们指出了当前AI研究中推理方法论上的几个陷阱，包括由于不透明且不可搜索的训练数据导致无法验证发现的新颖性，由于模型持续更新导致的不可再现性，以及省略了人类互动记录，这掩盖了科学发现的真实来源。此外，缺乏反事实和失败尝试的数据造成了选择偏差，可能夸大了LLM的能力。为应对这些挑战，我们提出了关于LLM推理研究的科学透明性和可再现性的指导原则。建立此类指导原则对于科学诚信以及持续的社会讨论关于公平数据使用都至关重要。

Summary / 总结

The paper argues that claims about Large Language Models (LLMs) achieving human-level general intelligence are not scientifically rigorous due to a lack of refutability. It identifies methodological issues such as opaque training data, continuous model updates, and the omission of human-interaction transcripts. The authors propose guidelines for scientific transparency and reproducibility to address these challenges, emphasizing the importance for both scientific integrity and societal debates on data usage.

论文认为关于大型语言模型（LLMs）达到人类级通用智能的声明缺乏科学严谨性，因为缺乏可反驳性。它指出了方法论问题，如不透明的训练数据、模型的持续更新以及省略了人类互动记录。作者提出了科学透明性和可重复性的指导方针来解决这些问题，强调这对科学诚信和社会关于公平数据使用的辩论的重要性。

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

Authors: Fabian Retkowski, Maike Züfle, Thai Binh Nguyen, Jan Niehues, Alexander Waibel

First: 2026-02-09T18:28:10+00:00 · Latest: 2026-02-09T18:28:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

中文标题/摘要

标题：超越脚本：音频章节化的新视角

音频章节化，即自动将长音频分割成连贯部分的任务，对于导航播客、讲座和视频越来越重要。尽管其重要性，研究仍局限于基于文本的方法，留下了关于利用音频信息、处理ASR错误以及无脚本评估的关键问题未解。我们通过三个贡献来填补这些空白：（1）系统比较基于文本的模型与声学特征、一种新颖的仅基于音频的架构（AudioSeg）以及多模态LLM；（2）分析影响性能的因素，包括脚本质量、声学特征、时长和演讲者组成；（3）正式化的评估协议，对比依赖脚本的文本空间协议与无脚本的时间空间协议。我们在YTSeg上的实验表明，AudioSeg显著优于基于文本的方法，停顿提供了最大的声学增益，而MLLMs受限于上下文长度和指令遵循能力，但在较短音频上仍具潜力。

Summary / 总结

The paper addresses the limitations of text-based approaches in audio chaptering by introducing a novel audio-only architecture (AudioSeg) and evaluating it against text-based models and multimodal LLMs. Key findings include AudioSeg's superior performance, the importance of pauses in acoustic features, and the limitations of LLMs despite their promise for shorter audio segments.

研究通过引入新型的仅音频模型（AudioSeg）并将其与基于文本的方法和多模态LLM进行比较，解决了音频章节化中基于文本方法的局限性。研究评估了转录质量、声学特征和说话人组成等因素，并提出了新的评估协议。实验表明，AudioSeg在性能上优于基于文本的方法，音频中的停顿提供了显著的性能增益。虽然多模态LLM有潜力，但它们仍然受限于上下文长度和指令遵循能力，但在较短的音频片段上表现出潜力。

Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs

Authors: Junyi Jessy Li, Yang Janet Liu, Kanishka Misra, Valentina Pyatkin, William Sheffield

First: 2026-02-02T22:35:29+00:00 · Latest: 2026-02-09T18:21:16+00:00

Comments: accepted to the TeachNLP 2026 workshop (co-located with EACL 2026), camera-ready, 14 pages; aclpubcheck fixed and ref updated

Abs · PDF · Code1 · Code2

Abstract

The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula. We present a new course, "Computational Discourse and Natural Language Generation". The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.

中文标题/摘要

标题：哪门课程？话语！在大语言模型时代教授话语与生成

过去几年，自然语言处理领域经历了广泛而持续的变革，引发了跨学科的讨论。这提出了重要的教育问题：在这一不断变化的背景下，我们如何设计能够跨越子学科的课程？本文从话语处理的角度探讨了这一问题，这是一个富含语言学洞察和计算模型的领域，用于语言的意图、注意力和连贯结构。话语对于开放生成或长文本生成至关重要，但在现有的本科课程中，这一联系尚未得到充分探索。我们提出了一门新的课程“计算话语与自然语言生成”。该课程由具有互补专长的团队共同设计，并于2025年秋季首次作为高年级本科课程开设，跨列在语言学和计算机科学之间。我们的理念是深入整合理论和实证方面，并在课堂内外培养探索性思维。本文详细描述了该课程，并以独立调查的结果和对未来方向的展望作为结论。

Summary / 总结

This paper addresses the need to design courses that bridge sub-disciplines in the evolving field of NLP. It introduces a new course, 'Computational Discourse and Natural Language Generation', focusing on discourse processing and its relevance to open-ended text generation. The course, collaboratively designed by experts from Linguistics and Computer Science, was offered as an upper-level undergraduate course in Fall 2025. Key findings include the integration of theoretical and empirical aspects and the creation of an exploratory mindset in both classroom and assignments, supported by an independent survey.

本文探讨了在不断变化的NLP领域中设计跨学科课程的需求。它介绍了一门新的课程“计算话语与自然语言生成”，重点在于话语处理及其对开放文本生成的相关性。该课程由语言学和计算机科学领域的专家共同设计，于2025年秋季首次作为高年级本科生课程开设。主要发现包括理论与实证方面的深度融合，以及在课堂和作业中培养探索性思维，这些发现得到了独立调查的支持。

ABBA-Adapters: Efficient and Expressive Fine-Tuning of Foundation Models

Authors: Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma

Venue: ICLR 2026

First: 2025-05-20T11:43:25+00:00 · Latest: 2026-02-09T18:14:19+00:00

Comments: ICLR 2026. Raghav Singhal, Kaustubh Ponkshe, and Rohit Vartak contributed equally to this work

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget, a property we validate through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.

中文标题/摘要

标题：ABBA-适配器：高效且富有表现力的基础模型微调

大型语言模型在广泛的任务中表现出强大的性能，但如何高效地将它们适应到新的领域仍然是一个关键挑战。参数高效微调（PEFT）方法通过引入轻量级、可训练的模块来解决这一问题，同时保持大部分预训练权重不变。目前占主导地位的方法LoRA使用低秩分解来建模更新，但其表现力受到秩的固有限制。最近的方法HiRA通过结合与冻结权重的哈达玛积来增加表现力，但仍依赖于预训练模型的结构。我们提出了ABBA，一种新的PEFT架构，将更新重新参数化为两个独立可学习低秩矩阵的哈达玛积。与先前工作不同，ABBA完全解耦了更新与预训练权重，使得两个组件可以自由优化。这在相同参数预算下实现了显著更高的表现力，我们通过矩阵重构实验验证了这一点。实证上，ABBA在算术和常识推理基准测试中取得了最先进的结果，相对于现有PEFT方法在多个模型上表现出显著的优越性。我们的代码已公开：https://github.com/CERT-Lab/abba.

Summary / 总结

The research aims to improve the efficiency and expressiveness of fine-tuning large language models for new domains. ABBA-Adapters reparameterize the update as a Hadamard product of two independently learnable low-rank matrices, decoupling the update from the pre-trained weights. This leads to higher expressivity under the same parameter budget and achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, outperforming existing methods by a significant margin.

研究旨在提高大型语言模型在新领域中的微调效率和表达能力。ABBA-Adapters将更新重新参数化为两个独立可学习的低秩矩阵的哈达玛积，使更新与预训练权重脱钩。这在相同参数预算下实现了更高的表达能力，并在算术和常识推理基准测试中取得了最先进的结果，显著优于现有方法。

Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models

Authors: Sercan Karakaş

First: 2026-01-30T23:00:04+00:00 · Latest: 2026-02-09T18:11:06+00:00

Abs · PDF · Code1 · Code2

Abstract

This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced evaluation set of 100 Turkish sentences that systematically pit local against non-local antecedents for the reflexives kendi and kendisi. We compare two contrasting systems: an OpenAI chain-of-thought model optimized for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA 2 derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined paradigm that integrates sentence-level perplexity with a forced-choice comparison between minimally differing continuations. Overall, Trendyol-LLM favors local bindings in approximately 70 percent of trials, exhibiting a robust locality bias consistent with a preference for structurally proximate antecedents. By contrast, the OpenAI model (o1 Mini) distributes its choices nearly evenly between local and long-distance readings, suggesting weaker or less consistent sensitivity to locality in this binding configuration. Taken together, these results reveal a marked contrast in binding behavior across the two systems and motivate closer analysis of how model architecture, training data, and inference-time reasoning strategies shape the representation of Turkish anaphoric dependencies.

中文标题/摘要

标题：句内还是句外？测试适应型与思维链大型语言模型中的土耳其语反身代词绑定关系

本研究评估了最先进的大型语言模型是否捕捉到了土耳其语反身代词的绑定关系。我们构建了一个平衡的评估集，包含100个土耳其句子，系统地将本地先行词与非本地先行词对反身代词kendi和kendisi进行对比。我们比较了两种截然不同的系统：一个由OpenAI优化的思维链模型，适用于多步推理，以及一个名为Trendyol-LLM-7B-base-v0.1的LLaMA 2衍生模型，该模型在土耳其数据上进行了广泛微调。先行词的选择通过结合句子级困惑度与对最小差异延续的强制选择比较的综合范式进行评估。总体而言，Trendyol-LLM在约70%的试验中倾向于本地绑定，表现出对结构上接近的先行词的稳健偏好。相比之下，OpenAI模型（o1 Mini）在本地和长距离解读之间几乎均匀地分配其选择，表明在这一绑定配置中对局部性的敏感性较弱或不一致。综上所述，这些结果揭示了两种系统在绑定行为上的显著差异，并促使我们更深入地分析模型架构、训练数据和推理时间的推理策略如何塑造土耳其语指代依赖关系的表示。

Summary / 总结

This study investigates how state-of-the-art large language models interpret Turkish reflexive pronouns by comparing an OpenAI chain-of-thought model and a Trendyol-LLM fine-tuned on Turkish data. Using a balanced set of 100 sentences, the models were evaluated based on their preference for local versus non-local antecedents. The Trendyol-LLM showed a strong preference for local bindings, while the OpenAI model had a more balanced choice, indicating different sensitivities to locality in reflexive binding. This suggests that model architecture and training data significantly influence the representation of Turkish anaphoric dependencies.

研究通过构建包含100个句子的平衡评估集，评估大型语言模型对土耳其语反身代词绑定关系的处理。研究对比了OpenAI链式思考模型和Trendyol-LLM模型，发现Trendyol-LLM更倾向于局部绑定，而OpenAI模型在局部和长距离绑定之间分布较为均匀。这表明模型架构和训练数据对土耳其语指代依赖的表示有显著影响。

WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Authors: Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Xin Zhang, Yinzhou Tang, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Yong Li

First: 2026-02-09T18:09:20+00:00 · Latest: 2026-02-09T18:09:20+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://worldarena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.

中文标题/摘要

标题：WorldArena：评估具身世界模型感知与功能实用性的统一基准

尽管世界模型已成为具身智能的基石，通过动作条件下的预测使代理能够推理环境动力学，但其评估仍然支离破碎。当前对具身世界模型的评估主要集中在感知保真度（例如，视频生成质量）上，忽视了这些模型在下游决策任务中的功能实用性。在本文中，我们引入了WorldArena，这是一种统一基准，旨在系统地从感知和功能两个维度评估具身世界模型。WorldArena 通过三个维度评估模型：视频感知质量，通过六个子维度下的 16 个指标进行测量；具身任务功能，评估世界模型作为数据引擎、策略评估器和动作规划器的能力，并结合主观的人类评估。此外，我们提出了EWMScore，这是一种综合多维度性能的统一指标。通过对 14 个代表性模型的广泛实验，我们揭示了感知-功能差距，表明高视觉质量并不一定转化为强大的具身任务能力。WorldArena 基准及其公开排行榜可在 https://worldarena.ai 发布，提供了一个跟踪向真正功能世界模型发展的框架，在具身人工智能中。

Summary / 总结

WorldArena is a unified benchmark to evaluate the perceptual and functional utility of embodied world models. It assesses models through video perception quality using 16 metrics and embodied task functionality, including data engine, policy evaluator, and action planner capabilities, with subjective human evaluation. The benchmark reveals a significant gap between high visual quality and strong embodied task capability, indicating that perceptual fidelity does not always correlate with functional utility. The EWMScore metric combines these dimensions into a single interpretable index. WorldArena is publicly available at https://worldarena.ai to track progress in embodied AI.

WorldArena 是一个统一基准，用于评估具身世界模型的感知和功能实用性。它通过 16 个指标评估视频感知质量，并通过数据引擎、策略评估器和行动规划器能力等维度评估具身任务功能，同时包含主观的人类评估。基准揭示了高视觉质量与强大具身任务能力之间存在显著差距，表明感知保真度并不总是与功能性实用性相关。EWMScore 指标将这些维度综合成一个可解释的指数。WorldArena 公开发布在 https://worldarena.ai，以跟踪具身 AI 中真正功能性世界模型的进步。

stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation

Authors: Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, Randall Balestriero

First: 2026-02-09T18:04:22+00:00 · Latest: 2026-02-09T18:04:22+00:00

Abs · PDF · Code1 · Code2

Abstract

World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.

中文标题/摘要

标题：稳定的世界模型v1：可重复的世界建模研究与评估

世界模型已成为一种强大的范式，用于学习环境动力学的紧凑、预测性表示，使智能体能够推理、规划并超越直接经验进行泛化。尽管最近对世界模型的兴趣增加，但大多数可用实现仍具有出版物特定性，严重限制了其可重用性，增加了错误的风险，并降低了评估标准化。为缓解这些问题，我们引入了稳定的世界模型（SWM），这是一个模块化、经过测试和文档化的世界建模研究生态系统，提供了高效的数据收集工具、标准化环境、规划算法和基线实现。此外，SWM 中的每个环境都支持鲁棒性和持续学习研究，允许控制变化因素，包括视觉和物理属性。最后，我们通过使用SWM 研究DINO-WM 的零样本鲁棒性来展示SWM 的实用性。

Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

Authors: John Gardiner, Orlando Romero, Brendan Tivnan, Nicolò Dal Fabbro, George J. Pappas

First: 2026-02-09T18:01:40+00:00 · Latest: 2026-02-09T18:01:40+00:00

Abs · PDF · Code1 · Code2

Abstract

The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).

中文标题/摘要

标题：通过量子纠缠学习协调在多智能体强化学习中的应用

在多智能体强化学习（MARL）中，无法通信是协调的主要挑战。先前的工作探索了通过共享随机性关联局部策略的方法，有时以相关设备的形式，作为辅助去中心化决策机制。与此相反，本文引入了第一个利用共享量子纠缠作为协调资源的MARL代理训练框架，这使得可以使用比仅共享随机性更多的无通信关联策略。这受到量子物理学中已知结果的启发，即对于某些无通信的单轮合作博弈，共享量子纠缠能够实现优于仅使用共享随机性的策略。在这种情况下，我们说存在量子优势。我们的框架基于一种新颖的可微分策略参数化，能够优化量子测量，以及一种新颖的策略架构，将联合策略分解为量子协调器和分散的本地执行者。为了说明我们提出的方法的有效性，我们首先展示了如何仅从经验中学习在单轮博弈中实现量子优势的策略，这些博弈被视为黑盒或acles。然后，我们展示了我们的机制如何在作为去中心化部分可观测马尔可夫决策过程（Dec-POMDP）提出的示例多智能体顺序决策问题中学习具有量子优势的策略。

Summary / 总结

This work addresses the challenge of coordination in multi-agent reinforcement learning (MARL) without communication by introducing a framework that leverages shared quantum entanglement. Motivated by quantum physics results showing quantum advantage in certain cooperative games, the authors propose a differentiable policy parameterization and a policy architecture that decomposes joint policies into a quantum coordinator and local actors. Key findings include learning strategies that achieve quantum advantage in single-round games and in a multi-agent sequential decision-making problem formulated as a Dec-POMDP.

本文通过引入一种新的框架，利用量子纠缠来促进多智能体强化学习（MARL）中的无通信协调，解决了多智能体协调的挑战。该方法使用可微分的策略参数化和一种新的策略架构来优化量子测量和局部动作。关键发现包括能够在单轮游戏中以及在作为去中心化部分可观测马尔可夫决策过程（Dec-POMDP）提出的序列决策问题中学习实现量子优势的策略。

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Authors: Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

First: 2026-02-09T18:00:28+00:00 · Latest: 2026-02-09T18:00:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

Summary / 总结

The study aims to develop a framework for evaluating an agent's goal-directedness by combining behavioral assessments with interpretability analyses of the agent's internal representations. The framework is applied to a language model (LM) agent navigating a 2D grid world, showing that the agent's performance scales with task difficulty but remains robust to changes in the environment. Probing methods reveal that the agent encodes a coarse spatial map of the environment and its actions are consistent with these internal representations, with reasoning shifting from broader structural cues to immediate action support. This suggests that introspective examination is necessary to understand how agents represent and pursue their objectives.

研究旨在通过结合行为评估和内部表示的可解释性分析，开发一种评估语言模型代理目标导向性的框架。研究考察了一个在2D网格世界中导航的LLM代理，发现代理的性能随着任务难度的增加而增加，但对保持难度的任务变换具有鲁棒性。探针方法表明，代理编码了环境的粗略空间地图，其行动大致与这些内部表示一致，推理从更广泛的环境线索转向支持即时行动选择的信息。

Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting

Authors: Guangxun Zhu, Xuan Liu, Nicolas Pugeault, Chongfeng Wei, Edmond S. L. Ho

Venue: ICRA

First: 2026-02-09T17:58:53+00:00 · Latest: 2026-02-09T17:58:53+00:00

Comments: Accepted for IEEE International Conference on Robotics and Automation (ICRA) 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian-vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian-vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian-vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving. Code is available at: https://github.com/GuangxunZhu/VehCondPose3D

中文标题/摘要

标题：基于车辆条件的3D行人姿态预测模型：3D行人-车辆交互建模

准确预测行人运动对于在复杂城市环境中实现安全可靠的自动驾驶至关重要。本文提出了一种基于车辆条件的3D行人姿态预测框架，明确地将周围车辆信息纳入其中。为此，我们扩展了Waymo-3DSkelMo数据集，添加了对齐的3D车辆边界框，使多智能体行人-车辆交互的现实建模成为可能。我们引入了一种采样方案，根据行人和车辆数量对场景进行分类，便于在不同交互复杂性下进行训练。我们提出的网络在TBIFormer架构的基础上添加了专门的车辆编码器和行人-车辆交互交叉注意力模块，以融合行人和车辆特征，使预测能够同时基于历史行人运动和周围车辆。大量实验表明，预测准确性有了显著提高，并验证了不同行人-车辆交互建模方法的有效性，突显了车辆感知3D姿态预测对自动驾驶的重要性。代码可在：https://github.com/GuangxunZhu/VehCondPose3D 获取。

Summary / 总结

This work addresses the challenge of accurately predicting pedestrian motion in urban environments, which is essential for autonomous driving. The authors propose a 3D vehicle-conditioned pedestrian pose forecasting framework that integrates vehicle information to improve prediction accuracy. They enhance the Waymo-3DSkelMo dataset with 3D vehicle bounding boxes and introduce a scene categorization scheme based on pedestrian and vehicle counts. The proposed network, which includes a vehicle encoder and interaction cross-attention module, effectively fuses pedestrian and vehicle features. Experiments show significant improvements in forecasting accuracy, emphasizing the importance of vehicle-aware 3D pose prediction for autonomous driving.

该研究旨在通过纳入车辆信息来提高城市环境中自主驾驶的行人运动预测准确性。方法包括增强Waymo-3DSkelMo数据集以包含3D车辆边界框，并使用采样方案来分类场景。提出的网络修改了TBIFormer架构，包括车辆编码器和交互交叉注意力模块，以根据行人历史和周围车辆来条件化预测，从而在复杂交互中获得更好的预测准确性。

History

20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553