arXiv 论文速递

Snapshot: 20260205_0342

Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

Authors: Tong Zheng, Chengsong Huang, Runpeng Dai, Yun He, Rui Liu, Xin Ni, Huiwen Bao, Kaishen Wang, Hongtu Zhu, Jiaxin Huang, Furong Huang, Heng Huang

First: 2026-02-03T18:59:41+00:00 · Latest: 2026-02-03T18:59:41+00:00

Comments: 14 pages

Abs · PDF · Code1 · Code2

Abstract

Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce $\textbf{Parallel-Probe}$, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to $\textbf{35.8}$% and total token cost by over $\textbf{25.8}$% while maintaining competitive accuracy.

中文标题/摘要

标题：平行探针：通过二维探针实现高效的并行思考

并行思考已成为一种有前景的推理范式，但同时也带来了显著的计算负担。现有提高效率的方法主要依赖于局部、每条轨迹的信号，缺乏利用并行分支间全局动态的原理性机制。我们引入了二维探针，这是一种接口，通过周期性地从所有分支中提取中间答案来暴露并行思考的宽度-深度动态。我们的分析揭示了三个关键见解：宽度-深度分配的非单调扩展、异质推理分支长度以及早期全球共识的稳定。根据这些见解，我们提出了无需训练的控制器平行探针，旨在优化在线并行思考。平行探针利用共识为基础的早期停止来调节推理深度，并利用基于偏差的分支修剪来动态调整宽度。在三个基准和多个模型上的广泛实验表明，平行探针为测试时扩展建立了更优的帕累托前沿。与标准多数投票相比，它将顺序令牌减少了最多35.8%，总令牌成本降低了超过25.8%，同时保持了竞争力的准确性。

Summary / 总结

The paper addresses the computational challenges of parallel thinking by introducing 2D probing, which periodically collects intermediate answers from all parallel branches. This method uncovers non-monotonic scaling, heterogeneous branch lengths, and early consensus stabilization. Based on these insights, Parallel-Probe is proposed to optimize online parallel thinking through consensus-based early stopping and deviation-based branch pruning. Experiments show that Parallel-Probe reduces sequential tokens by up to 35.8% and total token cost by over 25.8% compared to standard majority voting, while maintaining accuracy.

论文通过引入2D探针来解决并行思考的计算挑战，该探针定期从所有分支收集中间答案以理解宽度-深度动态。关键洞察包括非单调缩放、异质分支长度以及早期共识稳定化。基于这些洞察，提出了Parallel-Probe来优化在线并行思考，通过共识驱动的早期停止和偏差驱动的分支修剪。实验表明，Parallel-Probe可将顺序令牌减少最多35.8%，总令牌成本降低超过25.8%，同时保持竞争力的准确性。

Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes

Authors: Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie

First: 2026-01-26T18:57:00+00:00 · Latest: 2026-02-03T18:58:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.

中文标题/摘要

标题：重用你的FLOPs：通过条件化非常离策前缀扩展强化学习

典型的强化学习（RL）方法在LLM推理中对难题浪费计算资源，因为正确的策略轨迹罕见，策略梯度消失，学习停滞。为了更高效地启动RL，我们考虑重用旧的采样FLOPs（来自先前推理或RL训练）以离策轨迹的形式。标准的离策方法使用离策数据进行监督，导致RL优化过程中出现不稳定性。我们引入了PrefixRL，其中我们条件化于成功的离策轨迹的前缀，并运行策略轨迹以完成它们，从而绕过离策不稳定性。PrefixRL通过调整离策前缀长度来调节问题的难度，从而增强在难题上的学习信号。我们证明PrefixRL目标不仅与标准RL目标一致，而且更具样本效率。实验中，我们发现反向泛化：仅在前缀问题上进行训练可以推广到未见过的前缀外性能，且学习策略往往与前缀中的不同。在我们的实验中，我们通过拒绝采样基模型生成离策轨迹，创建了一个自我改进循环。在难题推理问题上，PrefixRL比最强基线（在离策数据上进行SFT然后RL）快2倍达到相同的训练奖励，即使考虑初始拒绝采样的计算成本，且最终奖励提高了3倍。这些收益转移到了保留测试基准上，即使离策轨迹源自不同的模型家族，PrefixRL仍然有效，验证了其在实际应用中的灵活性。

Summary / 总结

The paper addresses the inefficiency of reinforcement learning (RL) methods in handling hard problems where on-policy data is scarce. It proposes PrefixRL, which conditions on the prefix of successful off-policy traces to guide on-policy RL, thereby boosting learning efficiency. Experiments show that PrefixRL outperforms strong baselines, achieving the same training reward 2x faster and tripling the final reward on hard reasoning tasks. Additionally, it demonstrates back-generalization, where strategies learned from prefixed problems generalize to unprefixed settings. The method creates a self-improvement loop by using rejection sampling with a base model, and its effectiveness is validated across different model families.

论文提出PrefixRL方法，通过条件化成功的离策略前缀来提升学习效率，解决典型RL方法在处理难题时的低效问题。该方法避免了离策略不稳定性和提高了样本效率。实验表明，PrefixRL在难题上的训练奖励比强基线快2倍，并将最终奖励提高了3倍。此外，该方法还展示了反向泛化能力，即从前缀问题中学到的策略可以推广到未见过的无前缀问题上。

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Authors: Erfan Miahi, Eugene Belilovsky

First: 2026-02-03T18:56:48+00:00 · Latest: 2026-02-03T18:56:48+00:00

Comments: 32 pages, 14 figures

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.

中文标题/摘要

标题：理解并利用权重更新稀疏性以实现通信高效的分布式强化学习

强化学习（RL）是后训练大型语言模型（LLMs）的关键组成部分。然而，在带宽受限的分布式RL中，可扩展性通常受限于策略权重从训练器同步到推理工作者的过程，特别是在普通网络或去中心化设置中。虽然最近的研究表明，RL更新仅修改了模型参数的一小部分，但这些观察通常是基于粗略的检查点差异。我们对权重更新稀疏性进行了系统性的经验研究，包括步长级和多步级粒度，考察了其在训练动态、离策略延迟和模型规模变化中的演变。我们发现，更新稀疏性在实际相关设置中始终很高，经常超过99%。利用这种结构，我们提出了PULSE（基于无损稀疏编码的补丁更新）方法，该方法仅传输修改参数的索引和值，从而实现无损权重同步。PULSE对传输错误具有鲁棒性，并避免了累加差分方案固有的浮点漂移。在带宽受限的去中心化环境中，我们的方法实现了超过100倍（14 GB到~108 MB）的通信减少，同时保持与完整权重同步相同的位级一致的训练动态和性能。通过利用这种结构，PULSE使去中心化RL训练接近集中式吞吐量，将权重同步所需的带宽从20 Gbit/s降低到0.2 Gbit/s，以保持高GPU利用率。

Summary / 总结

This paper addresses the challenge of communication efficiency in distributed reinforcement learning (RL) by studying the sparsity of weight updates. It finds that weight updates are highly sparse, often exceeding 99% across different settings. Based on this observation, the authors propose PULSE, a method that transmits only the indices and values of modified parameters, achieving over 100x communication reduction while maintaining training dynamics and performance. This enables decentralized RL training to approach centralized throughput, significantly reducing the required bandwidth.

该论文通过研究权重更新的稀疏性来解决分布式强化学习中的通信效率问题，发现更新通常只修改模型参数的一小部分，稀疏性经常超过99%。基于此，作者提出了PULSE方法，仅传输修改参数的索引和值，实现了超过100倍的通信减少，同时保持与完整权重同步相同的训练动态和性能。

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Authors: Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong

First: 2025-07-29T13:40:09+00:00 · Latest: 2026-02-03T18:56:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for faster sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.

中文标题/摘要

标题：MixGRPO：通过混合ODE-SDE提升基于流的GRPO效率

尽管GRPO在图像生成中的人类偏好对齐方面显著增强了基于流的模型，但诸如FlowGRPO和DanceGRPO的方法仍然由于马尔可夫决策过程（MDP）中所有去噪步骤的采样和优化需求而表现出低效率。本文提出了一种名为$\textbf{MixGRPO}$的新框架，通过结合随机微分方程（SDE）和常微分方程（ODE）的混合采样策略，简化了MDP中的优化过程，提高了效率并提升了性能。具体而言，MixGRPO引入了一种滑动窗口机制，在窗口内使用SDE采样和GRPO引导的优化，而在窗口外使用ODE采样。这种设计将采样随机性限制在窗口内的时间步，从而减少了优化开销，并允许更集中的梯度更新以加速收敛。此外，由于滑动窗口外的时间步不参与优化，因此支持更高阶的求解器以加快采样速度。因此，我们提出了一种更快的变体$\textbf{MixGRPO-Flash}$，进一步提高了训练效率，同时保持了相当的性能。MixGRPO在多个维度上的人类偏好对齐方面表现出显著的改进，不仅在效果上优于DanceGRPO，而且训练时间减少了近50%。值得注意的是，MixGRPO-Flash将训练时间进一步减少了71%。

Summary / 总结

MixGRPO is a novel framework that integrates SDE and ODE to improve the efficiency of flow-based GRPO models in human preference alignment for image generation. It introduces a sliding window mechanism, using SDE sampling and GRPO optimization within the window and ODE sampling outside, which reduces optimization overhead and accelerates convergence. The faster variant, MixGRPO-Flash, further reduces training time by 71% while maintaining comparable performance, outperforming DanceGRPO in both effectiveness and efficiency with nearly 50% lower training time.

MixGRPO 是一种通过结合 SDE 和 ODE 提高流基生成模型效率的框架，减少了对所有去噪步骤进行采样和优化的需要。它引入了一个滑动窗口机制，在窗口内使用 SDE 和 GRPO 进行优化，而在窗口外使用 ODE。这种方法减少了优化开销，并支持更高阶的求解器，从而加快了训练速度并保持了相当的性能。MixGRPO-Flash 是 MixGRPO 的一个变体，进一步将训练时间减少了 71%。与 DanceGRPO 相比，MixGRPO 在有效性和效率方面都取得了显著的提升，训练时间减少了近 50%。

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

Authors: David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Vahab Mirrokni

First: 2026-02-03T18:56:17+00:00 · Latest: 2026-02-03T18:56:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.

中文标题/摘要

标题：使用Gemini加速科学研究：案例研究与常用技术

大型语言模型（LLMs）的最新进展为加速科学研究开辟了新途径。尽管模型在协助处理常规任务方面的能力越来越强，但它们在贡献新颖、专家级的数学发现方面的潜力尚不明确。我们展示了研究人员如何成功与基于Google Gemini的高级AI模型（特别是Gemini Deep Think及其高级变体）合作，解决开放问题、反驳猜想并生成新的证明，涵盖理论计算机科学等多个领域，以及其他领域如经济学、优化和物理学。基于这些经验，我们提取了理论研究中有效的人机协作技术，如迭代细化、问题分解和跨学科知识转移。虽然我们的大部分结果来自这种互动、对话的方法，但我们还强调了一些超越标准聊天界面的具体实例。这些包括将模型部署为严格的 adversarial reviewer 来检测现有证明中的细微缺陷，以及将其嵌入“神经符号”循环中，该循环自主编写和执行代码以验证复杂的推导。这些例子共同突显了AI不仅作为自动化工具的潜力，而且作为科学研究创造性过程中的多功能、真正的合作伙伴的潜力。

Summary / 总结

The paper explores how advanced AI models, particularly Google's Gemini-based models, have been used to assist in solving open problems and generating new proofs across various scientific fields. Researchers have employed techniques such as iterative refinement and problem decomposition to effectively collaborate with these models. Key findings include the successful use of Gemini as a rigorous adversarial reviewer to detect flaws in proofs and as part of a neuro-symbolic loop to autonomously write and execute code for complex derivations, showcasing AI's potential as a creative partner in scientific discovery.

论文探讨了高级AI模型，尤其是基于Gemini的模型，如何被用于解决理论计算机科学及其他领域如经济学、优化和物理学中的开放问题和生成新证明。作者通过案例研究展示了人类与AI的合作，强调了迭代改进和问题分解等技术。他们还讨论了AI作为严格审查者和神经符号循环中验证复杂推导的应用，展示了AI在科学发现过程中的创造性和多功能性，而不仅仅是自动化工具。

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Authors: Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, Yue Zhang

Venue: ICLR 2026

First: 2026-02-03T18:41:43+00:00 · Latest: 2026-02-03T18:41:43+00:00

Comments: Accepted at the ICLR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.

中文标题/摘要

标题：AutoFigure：生成和优化出版级科学插图

高质量的科学插图对于有效传达复杂的科学和技术概念至关重要，但其手动创建仍然是学术界和工业界公认的瓶颈。我们提出了FigureBench，这是首个用于从长篇科学文本生成科学插图的大规模基准。它包含3,300个高质量的科学文本-插图对，涵盖了来自科学论文、综述、博客和教科书的多种文本到插图任务。此外，我们提出了AutoFigure，这是首个基于长篇科学文本自动生成高质量科学插图的代理框架。具体而言，在最终呈现结果之前，AutoFigure 进行了广泛的思考、重组和验证，以生成一个既结构合理又美观的布局，输出的科学插图兼具结构完整性和美学吸引力。利用FigureBench中的高质量数据，我们进行了广泛的实验，测试AutoFigure相对于各种基线方法的性能。结果表明，AutoFigure 一贯超越所有基线方法，生成了出版级的科学插图。代码、数据集和huggingface空间在https://github.com/ResearAI/AutoFigure发布。

Summary / 总结

The research aims to address the bottleneck of manually creating high-quality scientific illustrations. AutoFigure, an agentic framework, is proposed to automatically generate such illustrations from long-form scientific texts. It includes a thorough thinking, recombination, and validation process to ensure structural soundness and aesthetic appeal. Experiments show that AutoFigure outperforms existing methods, producing publication-ready illustrations. The dataset and code are publicly available.

研究解决了手动创建高质量科学插图的瓶颈问题，提出了包含3,300个文本-插图对的FigureBench基准，并开发了AutoFigure框架，该框架能够自动生成结构合理且美观的插图。实验表明，AutoFigure在所有基线方法中表现最佳，能够生成适合出版的科学插图。

Multi-Agent Pathfinding Under Team-Connected Communication Constraint via Adaptive Path Expansion and Dynamic Leading

Authors: Hoang-Dung Bui, Erion Plaku, Gregoy J. Stein

First: 2025-01-06T05:21:18+00:00 · Latest: 2026-02-03T18:36:02+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper proposes a novel planning framework to handle a multi-agent pathfinding problem under team-connected communication constraint, where all agents must have a connected communication channel to the rest of the team during their entire movements. Standard multi-agent path finding approaches (e.g., priority-based search) have potential in this domain but fail when neighboring configurations at start and goal differ. Their single-expansion approach -- computing each agent's path from the start to the goal in just a single expansion -- cannot reliably handle planning under communication constraints for agents as their neighbors change during navigating. Similarly, leader-follower approaches (e.g., platooning) are effective at maintaining team communication, but fixing the leader at the outset of planning can cause planning to become stuck in dense-clutter environments, limiting their practical utility. To overcome this limitation, we propose a novel two-level multi-agent pathfinding framework that integrates two techniques: adaptive path expansion to expand agent paths to their goals in multiple stages; and dynamic leading technique that enables the reselection of the leading agent during each agent path expansion whenever progress cannot be made. Simulation experiments show the efficiency of our planners, which can handle up to 25 agents across five environment types under a limited communication range constraint and up to 11-12 agents on three environment types under line-of-sight communication constraint, exceeding 90% success-rate where baselines routinely fail.

中文标题/摘要

标题：基于团队连接通信约束的自适应路径扩展与动态领航的多智能体路径规划

本文提出了一种新的规划框架，用于处理团队连接通信约束下的多智能体路径规划问题，其中所有智能体在整个移动过程中必须与团队的其余部分保持连接的通信通道。标准的多智能体路径规划方法（例如基于优先级的搜索）在此领域具有潜力，但在起始配置和目标配置邻近的情况下会失效。它们的单扩展方法——从起点到目标仅计算每个智能体的路径——无法可靠地处理在通信约束下的路径规划，因为智能体在导航过程中其邻居会改变。同样，领航跟随方法（例如编队）在保持团队通信方面非常有效，但在规划初期固定领航者会导致在密集障碍环境中规划陷入困境，限制了其实用性。为克服这一限制，我们提出了一种新的两层多智能体路径规划框架，该框架结合了两种技术：自适应路径扩展，以多阶段扩展智能体路径到目标；以及动态领航技术，允许在每次智能体路径扩展过程中，当无法取得进展时重新选择领航智能体。仿真实验表明，我们的规划器可以高效地处理多达25个智能体在五种环境类型下的通信范围受限问题，以及在视线通信约束下处理多达11-12个智能体在三种环境类型下的问题，成功率超过90%，而基线方法通常会失败。

Summary / 总结

This paper addresses the multi-agent pathfinding problem under a team-connected communication constraint, where all agents must maintain a connected communication channel to the rest of the team. It proposes a two-level framework combining adaptive path expansion and dynamic leading to overcome the limitations of standard multi-agent pathfinding and leader-follower approaches. The experiments demonstrate that the proposed planners can handle up to 25 agents in various environments under limited communication range and up to 11-12 agents under line-of-sight communication, achieving success rates above 90% where baseline methods fail.

该论文解决了团队连接通信约束下的多智能体路径规划问题，要求所有智能体在整个移动过程中保持通信连接。提出了一种结合自适应路径扩展和动态领航的两级框架。仿真实验表明，所提出的规划器在各种环境下最多可处理25个智能体在有限通信范围内的问题，以及最多11-12个智能体在视线通信范围内的问题，成功率超过90%，超越了基线方法。

ME-IGM: Individual-Global-Max in Maximum Entropy Multi-Agent Reinforcement Learning

Authors: Wen-Tse Chen, Yuxuan Li, Shiyu Huang, Jiayu Chen, Jeff Schneider

Venue: Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25 - 29, 2026, IFAAMAS, 19 pages

First: 2024-06-20T01:55:08+00:00 · Latest: 2026-02-03T18:35:29+00:00

Comments: Published in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

Abs · PDF · Code1 · Code2

Abstract

Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.

中文标题/摘要

标题：ME-IGM：个体-全局-最大值在最大熵多智能体强化学习中的应用

多智能体的信用分配是合作多智能体强化学习（MARL）中的一个基本挑战，其中一组智能体通过共享奖励信号进行学习。个体-全局-最大值（IGM）条件是多智能体信用分配中广泛使用的原则，要求由个体Q函数确定的联合动作最大化全局Q值。同时，最大熵原则已被用于增强MARL中的探索。然而，我们发现现有最大熵MARL方法的一个关键局限性：局部策略与最大化全局Q值的联合策略之间存在偏差，导致违反了IGM条件。为了解决这种偏差，我们提出了一种保持顺序的变换。在此基础上，我们引入了ME-IGM，这是一种与任何满足IGM条件的信用分配机制兼容的新颖最大熵MARL算法，同时享受最大熵探索的好处。我们通过非单调矩阵游戏评估了ME-IGM的两种变体：ME-QMIX和ME-QPLEX，并在SMAC-v2和Overcooked中展示了其在17个场景中的先进性能。

Summary / 总结

The paper addresses the challenge of multi-agent credit assignment in cooperative MARL by proposing ME-IGM, a maximum entropy MARL algorithm that aligns local policies with the global Q-value maximization. It introduces an order-preserving transformation to ensure the IGM condition is met, and evaluates ME-IGM variants ME-QMIX and ME-QPLEX in non-monotonic matrix games, showing superior performance in 17 scenarios of SMAC-v2 and Overcooked compared to existing methods.

论文通过提出ME-IGM，一种确保IGM条件的极大熵MARL算法，来解决合作MARL中的多智能体信用分配问题。它引入了一种保序变换来使局部策略与全局策略对齐，并在非单调矩阵游戏中评估了ME-IGM的变体ME-QMIX和ME-QPLEX，结果显示在SMAC-v2和Overcooked的17个场景中性能优于现有方法。

Continuous Control of Editing Models via Adaptive-Origin Guidance

Authors: Alon Wolf, Chen Katzir, Kfir Aberman, Or Patashnik

First: 2026-02-03T18:33:39+00:00 · Latest: 2026-02-03T18:33:39+00:00

Comments: Project page at https://adaor-paper.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.

中文标题/摘要

标题：通过自适应起始点引导实现编辑模型的连续控制

基于扩散的编辑模型已成为语义图像和视频操作的强大工具。然而，现有模型缺乏一种机制来平滑控制文本引导编辑的强度。在标准文本条件生成中，无分类自由引导（CFG）影响提示的遵守度，这表明它可能是编辑模型中编辑强度控制的潜在机制。然而，我们表明，在这些模型中按比例放大CFG不会在输入和编辑结果之间产生平滑过渡。我们将这种行为归因于无条件预测，它作为引导起始点，在低引导比例下主导生成，而代表输入内容的任意操作。为了实现连续控制，我们引入了自适应起始点引导（AdaOr）方法，该方法通过与身份操作对应的指令调整标准引导起始点，使用身份条件下的自适应起始点。通过根据编辑强度将身份预测与标准无条件预测进行插值，我们确保从输入到编辑结果的连续过渡。我们在图像和视频编辑任务上评估了我们的方法，证明它提供了比当前基于滑块的编辑方法更平滑和更一致的控制。我们的方法将身份指令整合到标准训练框架中，在推理时实现细粒度控制，无需针对每个编辑过程或依赖专门的数据集。

Summary / 总结

The paper addresses the challenge of smoothly controlling the intensity of text-guided edits in diffusion-based editing models. It introduces Adaptive-Origin Guidance (AdaOr), which adjusts the standard guidance origin with an identity-conditioned adaptive origin to enable continuous control. Experiments on image and video editing tasks show that AdaOr provides smoother and more consistent control compared to existing slider-based methods.

论文解决了在基于扩散的编辑模型中平滑控制文本引导编辑强度的挑战。它引入了自适应原点引导（AdaOr），通过将标准引导原点调整为身份条件下的自适应原点来实现连续控制。实验结果表明，AdaOr 在图像和视频编辑任务中提供了比现有滑块基编辑方法更平滑和一致的控制。

Robust Intervention Learning from Emergency Stop Interventions

Authors: Ethan Pronovost, Khimya Khetarpal, Siddhartha Srinivasa

First: 2026-02-03T18:33:21+00:00 · Latest: 2026-02-03T18:33:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Human interventions are a common source of data in autonomous systems during testing. These interventions provide an important signal about where the current policy needs improvement, but are often noisy and incomplete. We define Robust Intervention Learning (RIL) as the problem of learning from intervention data while remaining robust to the quality and informativeness of the intervention signal. In the best case, interventions are precise and avoiding them is sufficient to solve the task, but in many realistic settings avoiding interventions is necessary but not sufficient for achieving good performance. We study robust intervention learning in the context of emergency stop interventions and propose Residual Intervention Fine-Tuning (RIFT), a residual fine-tuning algorithm that treats intervention feedback as an incomplete learning signal and explicitly combines it with a prior policy. By framing intervention learning as a fine-tuning problem, our approach leverages structure encoded in the prior policy to resolve ambiguity when intervention signals under-specify the task. We provide theoretical analysis characterizing conditions under which this formulation yields principled policy improvement, and identify regimes where intervention learning is expected to fail. Our experiments reveal that residual fine-tuning enables robust and consistent policy improvement across a range of intervention strategies and prior policy qualities, and highlight robust intervention learning as a promising direction for future work.

中文标题/摘要

标题：鲁棒干预学习从紧急停止干预中学习

人类干预是自主系统测试期间数据的一个常见来源。这些干预提供了关于当前策略需要改进之处的重要信号，但通常噪声较大且不完整。我们定义鲁棒干预学习（RIL）为在保持对干预信号质量及信息量鲁棒性的前提下从干预数据中学习的问题。在最佳情况下，干预是精确的，避免干预足以解决问题，但在许多现实场景中，避免干预是必要的但不足以实现良好性能。我们研究了在紧急停止干预的背景下进行鲁棒干预学习，并提出了一种残差微调算法——残差干预微调（RIFT），该算法将干预反馈视为不完整的学习信号，并显式地将其与先验策略结合。通过将干预学习视为微调问题，我们的方法利用先验策略中编码的结构来解决当干预信号对任务描述不足时的歧义。我们提供了理论分析，描述了在这种表述下产生合理策略改进的条件，并指出了干预学习可能失败的领域。我们的实验表明，残差微调能够在各种干预策略和先验策略质量下实现鲁棒且一致的策略改进，并突显了鲁棒干预学习作为未来研究方向的潜力。

Summary / 总结

The paper addresses the challenge of learning from human interventions in autonomous systems, which are often noisy and incomplete. It introduces Robust Intervention Learning (RIL) and proposes Residual Intervention Fine-Tuning (RIFT) as a method to leverage these interventions while remaining robust to their quality. The experiments show that RIFT can improve policy performance consistently across different intervention strategies and prior policy qualities, suggesting its potential as a promising approach for future work.

论文探讨了在自主系统中从人类干预中学习的挑战，这些干预往往是嘈杂且不完整的。作者提出了鲁棒干预学习（RIL）方法，以在干预质量不佳的情况下仍能进行有效的学习。他们提出了残差干预微调（RIFT）算法，该算法将干预反馈与先验策略结合起来，以稳健地改进策略。实验表明，RIFT能够在各种干预策略和先验策略质量下实现一致的策略改进。

Closing the Loop: Universal Repository Representation with RPG-Encoder

Authors: Jane Luo, Chengyu Yin, Xin Zhang, Qingtao Li, Steven Liu, Yiming Huang, Jie Wu, Hao Liu, Yangyu Huang, Yu Kang, Fangkai Yang, Ying Xin, Scarlett Li

First: 2026-02-02T13:30:00+00:00 · Latest: 2026-02-03T18:33:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art localization performance on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% in localization accuracy on SWE-bench Live Lite. These results highlight our superior fine-grained precision in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG's high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.

中文标题/摘要

标题：闭合循环：使用RPG-编码器的通用仓库表示

当前的仓库代理因表示碎片化而遇到推理断层，现有方法依赖于孤立的API文档或缺乏语义深度的依赖图。我们考虑仓库理解和生成是统一循环中的逆过程：生成将意图扩展为实现，而理解则将实现压缩回意图。为了解决这个问题，我们提出了RPG-编码器框架，该框架将仓库规划图（RPG）从静态生成蓝图推广为统一的高保真表示。RPG-编码器通过三种机制闭合推理循环：（1）将原始代码编码到RPG中，结合提升的语义特征与代码依赖性；（2）逐步演化拓扑结构以解耦维护成本与仓库规模，减少95.7%的开销；（3）作为结构感知导航的统一接口。在评估中，RPG-编码器在SWE-bench上实现了最先进的定位性能，准确率为93.7%的Acc@5，并在SWE-bench Live Lite上的定位准确性上超过了最佳基线超过10%。这些结果突显了我们在复杂代码库中优越的细粒度精度。此外，它在RepoCraft上的重建覆盖率达到了98.5%，证实了RPG具有高度保真的能力来反映原始代码库，并在意图与实现之间闭合了循环。

Summary / 总结

The research aims to address the reasoning disconnect in repository agents by proposing RPG-Encoder, which generalizes the Repository Planning Graph (RPG) into a unified, high-fidelity representation. RPG-Encoder achieves this through three mechanisms: encoding raw code with lifted semantic features and dependencies, incrementally evolving the topology to reduce maintenance costs, and providing a unified interface for structure-aware navigation. The evaluation shows that RPG-Encoder outperforms existing methods with 93.7% Acc@5 in localization accuracy on SWE-bench Verified and 98.5% reconstruction coverage on RepoCraft, demonstrating its superior precision and fidelity in complex codebases.

研究旨在通过提出RPG-Encoder来解决仓库代理中的推理断层问题，RPG-Encoder将仓库规划图泛化为统一的高保真表示。RPG-Encoder通过编码带有提升语义特征和代码依赖性的原始代码，逐步演化拓扑结构以减少维护成本，并作为结构感知导航的统一接口。评估结果显示，RPG-Encoder在SWE-bench上的定位性能达到93.7%的Acc@5，并在RepoCraft上实现98.5%的重建覆盖率，展示了在复杂代码库中更高的精确度和保真度。

Deep-learning-based pan-phenomic data reveals the explosive evolution of avian visual disparity

Authors: Jiao Sun

First: 2026-02-03T18:32:15+00:00 · Latest: 2026-02-03T18:32:15+00:00

Comments: Readers from the field of computer science may be interested in section 2.1, 2.2, 3.1, 4.1, 4.2. These sections discussed the interpretability and representation learning, especially the texture vs shape problem, highlighting our model's ability of overcoming the texture biases and capturing overall shape features. (Although they're put here to prove the biological validity of the model.)

Abs · PDF · Code1 · Code2

Abstract

The evolution of biological morphology is critical for understanding the diversity of the natural world, yet traditional analyses often involve subjective biases in the selection and coding of morphological traits. This study employs deep learning techniques, utilising a ResNet34 model capable of recognising over 10,000 bird species, to explore avian morphological evolution. We extract weights from the model's final fully connected (fc) layer and investigate the semantic alignment between the high-dimensional embedding space learned by the model and biological phenotypes. The results demonstrate that the high-dimensional embedding space encodes phenotypic convergence. Subsequently, we assess the morphological disparity among various taxa and evaluate the association between morphological disparity and species richness, demonstrating that species richness is the primary driver of morphospace expansion. Moreover, the disparity-through-time analysis reveals a visual "early burst" after the K-Pg extinction. While mainly aimed at evolutionary analysis, this study also provides insights into the interpretability of Deep Neural Networks. We demonstrate that hierarchical semantic structures (biological taxonomy) emerged in the high-dimensional embedding space despite being trained on flat labels. Furthermore, through adversarial examples, we provide evidence that our model in this task can overcome texture bias and learn holistic shape representations (body plans), challenging the prevailing view that CNNs rely primarily on local textures.

中文标题/摘要

标题：基于深度学习的泛表型数据揭示了鸟类视觉差异的爆炸性进化

生物形态的进化对于理解自然界的多样性至关重要，但传统分析往往在形态特征的选择和编码上存在主观偏见。本研究采用深度学习技术，利用能够识别超过10,000种鸟类的ResNet34模型，探索鸟类形态的进化。我们从模型最终的全连接层提取权重，并研究模型学习到的高维嵌入空间与生物表型之间的语义对齐。结果表明，高维嵌入空间编码了表型趋同。随后，我们评估了不同类群之间的形态差异，并评估了形态差异与物种丰富度之间的关联，表明物种丰富度是形态空间扩张的主要驱动力。此外，通过时间上的形态差异分析揭示了K-Pg灭绝后的视觉“早期爆发”。尽管主要针对进化分析，本研究还为深度神经网络的可解释性提供了见解。我们证明，尽管模型仅在平面标签上进行训练，但在高维嵌入空间中仍出现了分层语义结构（生物学分类）。此外，通过对抗样本，我们提供了证据表明，我们的模型在该任务中能够克服纹理偏见并学习整体形状表示（体型），挑战了CNN主要依赖局部纹理的观点。

Summary / 总结

This study uses deep learning to analyze avian morphology, employing a ResNet34 model to explore morphological evolution without subjective biases. The research finds that the high-dimensional embedding space captures phenotypic convergence and that species richness drives morphospace expansion. Additionally, the disparity-through-time analysis shows a visual 'early burst' post-K-Pg extinction, providing insights into the interpretability of deep neural networks and their ability to learn holistic shape representations despite training on flat labels.

该研究利用深度学习技术分析鸟类形态，采用ResNet34模型从超过10,000种鸟类中提取表型信息。结果显示，模型的高维嵌入空间捕捉了表型趋同和形态差异，物种丰富度驱动了形态空间的扩展。此外，分析揭示了白垩纪-古近纪灭绝事件后视觉差异的早期爆发，提供了关于深度神经网络可解释性和其学习整体形状表征能力的见解，即使训练时使用的是平面标签。

They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References

Authors: Sahil Tripathi, Gautam Siddharth Kashyap, Mehwish Nasim, Jian Yang, Jiechao Gao, Usman Naseem

Venue: The Web Conference 2026

First: 2026-02-03T18:29:46+00:00 · Latest: 2026-02-03T18:29:46+00:00

Comments: Accepted at the The Web Conference 2026 (Research Track)

Abs · PDF · Code1 · Code2

Abstract

Meme-based social abuse detection is challenging because harmful intent often relies on implicit cultural symbolism and subtle cross-modal incongruence. Prior approaches, from fusion-based methods to in-context learning with Large Vision-Language Models (LVLMs), have made progress but remain limited by three factors: i) cultural blindness (missing symbolic context), ii) boundary ambiguity (satire vs. abuse confusion), and iii) lack of interpretability (opaque model reasoning). We introduce CROSS-ALIGN+, a three-stage framework that systematically addresses these limitations: (1) Stage I mitigates cultural blindness by enriching multimodal representations with structured knowledge from ConceptNet, Wikidata, and Hatebase; (2) Stage II reduces boundary ambiguity through parameter-efficient LoRA adapters that sharpen decision boundaries; and (3) Stage III enhances interpretability by generating cascaded explanations. Extensive experiments on five benchmarks and eight LVLMs demonstrate that CROSS-ALIGN+ consistently outperforms state-of-the-art methods, achieving up to 17% relative F1 improvement while providing interpretable justifications for each decision.

中文标题/摘要

标题：他们说梗无害——我们发现了那些伤人的：解码笑话、符号和文化引用

基于梗的社会虐待检测具有挑战性，因为有害意图往往依赖于隐含的文化象征和微妙的跨模态不一致。先前的方法，从融合方法到带有大型视觉-语言模型（LVLM）的上下文学习，已经取得了进展，但仍然受到三个因素的限制：i) 文化盲视（缺少象征性背景），ii) 边界模糊（讽刺与虐待混淆），iii) 缺乏可解释性（不透明的模型推理）。我们引入了CROSS-ALIGN+，这是一种三阶段框架，系统地解决了这些限制：(1) 第一阶段通过从ConceptNet、Wikidata和Hatebase中丰富多模态表示来缓解文化盲视；(2) 第二阶段通过参数高效的LoRA适配器减少边界模糊；(3) 第三阶段通过生成级联解释增强可解释性。在五个基准和八个LVLM上的广泛实验表明，CROSS-ALIGN+始终优于最先进的方法，相对F1改进高达17%，并且为每个决策提供了可解释的依据。

Summary / 总结

The research aims to detect harmful memes by addressing the challenges of implicit cultural symbolism and subtle cross-modal incongruence. The proposed CROSS-ALIGN+ framework consists of three stages: enriching multimodal representations with structured knowledge, sharpening decision boundaries through parameter-efficient adapters, and generating interpretable explanations. Experiments show that CROSS-ALIGN+ outperforms existing methods, achieving up to 17% relative F1 improvement and providing justifications for each decision.

研究旨在通过解决文化盲点、边界模糊和缺乏解释性的问题来检测有害的 meme。引入了 CROSS-ALIGN+，这是一个三阶段框架，通过结构化知识丰富多模态表示，使用 LoRA 适配器细化决策边界，并生成级联解释以增强解释性。实验表明，CROSS-ALIGN+ 在五个基准测试上优于现有方法，相对 F1 提高了高达 17%，并为每个决策提供了可解释的依据。

Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

Authors: Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen

First: 2026-02-03T18:18:11+00:00 · Latest: 2026-02-03T18:18:11+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1$\times$ and LLaVA-NeXT by 4.0$\times$, retaining over 99% performance. Code: https://github.com/dingkun-zhang/DualSpeed

中文标题/摘要

标题：通过视觉标记剪枝实现快速-缓慢高效训练的多模态大型语言模型

多模态大型语言模型（MLLMs）遭受严重的训练效率问题，这与它们庞大的模型规模和视觉标记数量有关。现有的高效训练努力主要集中在减少模型规模或可训练参数上。受视觉标记剪枝（VTP）在提高推理效率方面取得成功的启发，我们探索了通过减少视觉标记来实现高效训练的另一个重要研究方向。然而，在训练阶段应用VTP会导致训练-推理不匹配：剪枝训练的模型在对完整的视觉标记序列进行推理时表现不佳。为了解决这一问题，我们提出了DualSpeed，这是一种用于MLLMs高效训练的快速-缓慢框架。快速模式是主要模式，它将现有的VTP方法作为插件来减少视觉标记，并包含一个模式隔离器来隔离模型的行为。慢速模式是辅助模式，在此模式下，模型在完整的视觉序列上进行训练以保持训练-推理一致性。为了提高其训练效率，它进一步利用自我蒸馏从充分训练的快速模式中学习。综上所述，DualSpeed可以同时实现训练效率和非退化性能。实验表明，DualSpeed将LLaVA-1.5的训练加速了2.1倍，将LLaVA-NeXT的训练加速了4.0倍，保留了超过99%的性能。代码：https://github.com/dingkun-zhang/DualSpeed

Summary / 总结

The research aims to address the training inefficiency of Multimodal Large Language Models (MLLMs) by reducing visual tokens through a fast-slow training framework called DualSpeed. The framework uses a fast-mode to prune visual tokens and a slow-mode to maintain training-inference consistency by training on full visual sequences. Self-distillation is employed to enhance the training of the fast-mode. Experiments demonstrate that DualSpeed can accelerate the training of LLaVA-1.5 by 2.1 times and LLaVA-NeXT by 4.0 times while retaining over 99% performance.

论文提出了一种名为DualSpeed的快慢模式框架，通过在快模式中使用视觉令牌剪枝（VTP）减少视觉令牌数量，并在慢模式中使用完整视觉序列进行训练以保持一致性。DualSpeed还利用自我蒸馏来提高训练效率。实验结果显示，DualSpeed可以将LLaVA-1.5的训练速度提升2.1倍，将LLaVA-NeXT的训练速度提升4.0倍，同时保持超过99%的性能。

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

Authors: Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, Haoyu Wang

First: 2025-10-09T03:31:26+00:00 · Latest: 2026-02-03T18:17:52+00:00

Comments: The first two authors contributed equally. Updated OpenRubrics dataset, RMs, and results

Abs · PDF · Code1 · Code2

Abstract

Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured criteria to capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further remove noisy rubrics via preserving preference-label consistency. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 8.4%. These gains transfer to policy models on instruction-following and biomedical benchmarks.

中文标题/摘要

标题：OpenRubrics：面向奖励建模和LLM对齐的大规模合成评分明细生成

奖励建模是人类反馈强化学习（RLHF）的核心，但现有的大多数奖励模型依赖于标量或成对判断，无法捕捉人类偏好中的多维性质。最近的研究探索了使用结构化标准的评分明细作为奖励（RaR），以捕捉响应质量的多个维度。然而，生成既可靠又可扩展的评分明细仍然是一个关键挑战。在本文中，我们介绍了OpenRubrics，这是一个多样化的大型（提示，评分明细）对集合，用于训练评分明细生成和基于评分明细的奖励模型。为了提取区分性和全面的评估信号，我们引入了对比评分明细生成（CRG），通过对比优选和拒绝的响应来推导出硬规则（显式约束）和原则（隐含品质）。我们进一步通过保持偏好标签一致性来去除噪声评分明细。在多个奖励建模基准测试中，我们的基于评分明细的奖励模型Rubric-RM超越了大小匹配的基线模型8.4%。这些收益在指令遵循和生物医学基准测试中的策略模型中得到了转移。

Summary / 总结

This paper addresses the challenge of generating reliable and scalable rubrics for reward modeling in reinforcement learning from human feedback (RLHF). It introduces OpenRubrics, a large dataset of (prompt, rubric) pairs, and a method called Contrastive Rubric Generation (CRG) to derive discriminative and comprehensive evaluation signals. The rubric-based reward model, Rubric-RM, outperforms strong baselines by 8.4% across multiple reward-modeling benchmarks and shows transferability to policy models on instruction-following and biomedical tasks.

该研究旨在生成可靠且可扩展的评分标准以用于强化学习中的人类反馈奖励建模。作者引入了OpenRubrics数据集，包含大量（提示，评分标准）对，并提出对比评分标准生成（CRG）方法来提取显性和隐性评价标准。基于评分标准的奖励模型Rubric-RM在多个奖励建模基准测试中比强基线模型高出8.4%，并在指令遵循和生物医药基准测试中的策略模型中表现出良好的迁移性。

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Authors: Xi Wang, Anushri Suresh, Alvin Zhang, Rishi More, William Jurayj, Benjamin Van Durme, Mehrdad Farajtabar, Daniel Khashabi, Eric Nalisnick

First: 2026-02-03T18:17:22+00:00 · Latest: 2026-02-03T18:17:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning -- spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

中文标题/摘要

标题：符合性思考：在计算预算内进行推理的风险控制

大型语言模型（LLMs）的推理能力在测试时可以扩展，随著令牌预算的增加，数据集级别的准确率会提高，这促使了适应性推理——在提高可靠性时花费令牌，并在额外计算不太可能有帮助时提前停止。然而，设置令牌预算以及适应性推理的阈值是一个实际挑战，涉及风险-准确性的根本权衡。我们将预算设置问题重新定义为风险控制，限制错误率同时最小化计算量。我们的框架引入了一个上限阈值，在模型自信时停止推理（可能产生错误输出），以及一个新颖的参数化下限阈值，提前停止无法解决的实例（可能产生过早停止）。给定一个目标风险和验证集，我们使用无分布风险控制来最优地指定这些停止机制。对于有多重预算控制标准的场景，我们结合效率损失来选择最计算高效的退出机制。跨多种推理任务和模型的实证结果表明了我们风险控制方法的有效性，展示了下限阈值和集成停止机制带来的计算效率提升，同时遵守用户指定的风险目标。

Summary / 总结

The paper addresses the challenge of setting a token budget for Large Language Models (LLMs) to balance accuracy and computational efficiency. It introduces a risk control framework that uses an upper threshold to stop reasoning when the model is confident and a lower threshold to preemptively stop unsolvable instances. The framework optimally specifies these thresholds to meet a user-specified risk target, demonstrating computational efficiency gains across various reasoning tasks and models.

论文解决了为大型语言模型（LLMs）设置令牌预算以平衡准确性和计算成本的挑战。它引入了一个风险管理框架，包括一个上限阈值，在模型自信时停止推理，以及一个下限阈值，提前停止无法解决的实例。该框架在满足指定的风险目标的同时，通过优化设置这些阈值来最小化计算成本。实验结果显示，该方法有效控制了风险并提高了计算效率，同时不牺牲准确性。

Antidistillation Fingerprinting

Authors: Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey, Tom Goldstein, Fei Fang, J. Zico Kolter

First: 2026-02-03T18:15:50+00:00 · Latest: 2026-02-03T18:15:50+00:00

Comments: 26 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.

中文标题/摘要

标题：抗蒸馏指纹识别

模型蒸馏能够高效地模拟前沿的大语言模型（LLMs），因此需要稳健的机制来检测第三方学生模型是否在教师模型的输出上进行了训练。然而，现有的用于检测此类蒸馏的指纹识别技术依赖于启发式的扰动，这在生成质量和指纹识别强度之间造成了陡峭的权衡，通常需要显著降低实用性以确保指纹被学生模型有效地内化。我们提出了抗蒸馏指纹识别（ADFP），这是一种原理性的方法，将指纹识别目标与学生的学习动态相一致。基于抗蒸馏采样的梯度框架，ADFP 使用代理模型来识别并采样那些在微调后能够直接最大化学生模型中指纹可检测性的令牌，而不是依赖于对更简单的水印的非目标偏见的偶然吸收。在GSM8K和OASST1基准上的实验表明，ADFP 在保持最小实用性影响的情况下，显著优于最先进的基线方法，提供了更强的检测置信度，即使学生模型的架构未知。

Summary / 总结

The research introduces antidistillation fingerprinting (ADFP), a method designed to detect when a student model has been trained on a teacher model's outputs. ADFP aligns the fingerprinting objective with the student's learning dynamics, using a proxy model to sample tokens that maximize the fingerprint's detectability. Experiments show ADFP outperforms existing techniques, providing stronger detection with minimal impact on utility, even without knowing the student model's architecture.

研究旨在开发 robust 机制来检测模型 distillation，即学生模型从教师模型的输出中进行训练。论文引入了 antidistillation 船形标记 (ADFP)，它将指纹识别目标与学生的学习动态对齐。ADFP 使用代理模型采样最大化指纹可检测性的标记，从而在最小影响模型实用性的情况下提供更强的检测信心。实验结果表明，ADFP 在 GSM8K 和 OASST1 基准上优于现有方法，提供了显著的检测改进。

Enhancing Imbalanced Node Classification via Curriculum-Guided Feature Learning and Three-Stage Attention Network

Authors: Abdul Joseph Fofanah, Lian Wen, David Chen, Shaoyang Zhang

First: 2026-02-03T18:10:40+00:00 · Latest: 2026-02-03T18:10:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Imbalanced node classification in graph neural networks (GNNs) happens when some labels are much more common than others, which causes the model to learn unfairly and perform badly on the less common classes. To solve this problem, we propose a Curriculum-Guided Feature Learning and Three-Stage Attention Network (CL3AN-GNN), a learning network that uses a three-step attention system (Engage, Enact, Embed) similar to how humans learn. The model begins by engaging with structurally simpler features, defined as (1) local neighbourhood patterns (1-hop), (2) low-degree node attributes, and (3) class-separable node pairs identified via initial graph convolutional networks and graph attention networks (GCN and GAT) embeddings. This foundation enables stable early learning despite label skew. The Enact stage then addresses complicated aspects: (1) connections that require multiple steps, (2) edges that connect different types of nodes, and (3) nodes at the edges of minority classes by using adjustable attention weights. Finally, Embed consolidates these features via iterative message passing and curriculum-aligned loss weighting. We evaluate CL3AN-GNN on eight Open Graph Benchmark datasets spanning social, biological, and citation networks. Experiments show consistent improvements across all datasets in accuracy, F1-score, and AUC over recent state-of-the-art methods. The model's step-by-step method works well with different types of graph datasets, showing quicker results than training everything at once, better performance on new, imbalanced graphs, and clear explanations of each step using gradient stability and attention correlation learning curves. This work provides both a theoretically grounded framework for curriculum learning in GNNs and practical evidence of its effectiveness against imbalances, validated through metrics, convergence speeds, and generalisation tests.

中文标题/摘要

标题：通过课程引导特征学习和三阶段注意力网络增强图神经网络中的不平衡节点分类

图神经网络（GNN）中的不平衡节点分类发生在某些标签远比其他标签常见时，这会导致模型学习不公平且在较少出现的类别上表现不佳。为了解决这一问题，我们提出了一种课程引导特征学习和三阶段注意力网络（CL3AN-GNN），这是一种使用三步注意力系统（参与、执行、嵌入）的学习网络，类似于人类的学习过程。模型首先通过结构上简单的特征进行学习，这些特征定义为（1）局部邻域模式（1-跳），（2）低度节点属性，以及（3）通过初始图卷积网络和图注意力网络（GCN和GAT）嵌入识别的可分节点对。这一基础使模型能够在标签分布不均的情况下稳定地早期学习。执行阶段则处理复杂方面：（1）需要多步的连接，（2）连接不同节点类型的边，以及（3）少数类节点，通过可调注意力权重。最后，嵌入阶段通过迭代消息传递和课程对齐的损失加权来整合这些特征。我们在八个跨越社交、生物和引用网络的开放图基准数据集上评估了CL3AN-GNN。实验结果显示，与最近的先进方法相比，该模型在所有数据集上的准确率、F1分数和AUC上都表现出一致的改进。该模型的逐步方法适用于不同类型的图数据集，比一次性训练所有内容更快，对新出现的不平衡图表现更好，并且通过梯度稳定性和注意力相关性学习曲线清晰地解释了每一步。这项工作为GNN中的课程学习提供了理论基础框架，并通过指标、收敛速度和泛化测试验证了其有效性。

Summary / 总结

The paper addresses the issue of imbalanced node classification in graph neural networks by proposing CL3AN-GNN, which uses a three-stage attention network (Engage, Enact, Embed) to handle different types of graph features. The Engage stage focuses on simpler features like local patterns and node attributes, while the Enact stage tackles more complex connections. The Embed stage consolidates these features using iterative message passing and curriculum-aligned loss weighting. Experiments on eight datasets show consistent improvements in accuracy, F1-score, and AUC over recent state-of-the-art methods, highlighting the effectiveness of the step-by-step approach in handling imbalanced data.

论文提出了一种名为CL3AN-GNN的方法，通过三阶段注意力系统（Engage、Enact、Embed）以类似课程学习的方式学习特征。Engage阶段处理简单特征，Enact阶段处理复杂特征，Embed阶段整合这些特征。实验结果显示，在八个数据集上，该方法在准确率、F1分数和AUC方面均优于最新方法，表明其在处理不平衡节点分类问题上的有效性。

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Authors: Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun

First: 2026-02-03T18:08:41+00:00 · Latest: 2026-02-03T18:08:41+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.

中文标题/摘要

标题：连接在线与离线RL：多轮代码生成的上下文臂学习

近年来，研究人员对使用强化学习（RL）训练大规模语言模型（LLMs）以完成实际任务（如多轮代码生成）产生了浓厚兴趣。尽管在线RL通常优于离线RL，但其较高的训练成本和不稳定性阻碍了其广泛应用。本文基于观察到多轮代码生成可以被表述为一步可恢复的马尔可夫决策过程，提出了一种结合在线和离线RL优点的新方法——上下文臂学习（Cobalt），该方法利用离线轨迹进行训练。Cobalt 首先使用参考LLM收集代码生成轨迹，并将其划分为部分轨迹作为上下文提示。然后，在在线臂学习过程中，LLM通过单步代码生成来完成每个部分轨迹提示的训练。Cobalt 在LiveCodeBench 上的R1-Distill 8B和Qwen3 8B上分别提高了9.0和6.2的绝对Pass@1分数，超过了基于GRPO和VeRPO的两个多轮在线RL基线。此外，我们分析了LLMs的上下文奖励作弊行为，并通过扰动轨迹来增强Cobalt的训练，以减轻这一问题。总体而言，我们的结果表明Cobalt 是一种有前景的多轮代码生成等迭代决策任务的解决方案。我们的代码和数据可在https://github.com/OSU-NLP-Group/cobalt 获取。

Summary / 总结

This paper addresses the challenge of training large language models (LLMs) with reinforcement learning (RL) for multi-turn code generation tasks. It proposes Cobalt, a method that combines the benefits of online and offline RL. Cobalt first collects trajectories using a reference LLM and divides them into partial prompts. During online bandit learning, the LLM is trained to complete these prompts. Cobalt outperforms two online RL baselines and improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Additionally, the paper analyzes reward hacking behaviors and mitigates them through perturbed trajectories.

本文针对大型语言模型（LLMs）在多轮代码生成任务中使用强化学习（RL）的挑战，提出了一种结合在线和离线RL的方法——Cobalt，通过使用基于离线轨迹的上下文臂学习。Cobalt 在 LiveCodeBench 上显著提高了 R1-Distill 8B 和 Qwen3 8B 的性能，分别提升了 9.0 和 6.2 个绝对 Pass@1 分数。作者还分析了模型的上下文奖励作弊行为，并通过使用扰动轨迹来缓解这一问题，从而增强了 Cobalt 的训练。

Measuring Agents in Production

Authors: Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, Marquita Ellis

First: 2025-12-02T16:45:10+00:00 · Latest: 2026-02-03T18:06:26+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployments successful. We present the first systematic study of Measuring Agents in Production, MAP, using first-hand data from agent developers. We conducted 20 case studies via in-depth interviews and surveyed 306 practitioners across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and their top development challenges. Our study finds that production agents are built using simple, controllable approaches: 68% execute at most 10 steps before human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability (consistent correct behavior over time) remains the top development challenge, which practitioners currently address through systems-level design. MAP documents the current state of production agents, providing the research community with visibility into deployment realities and under-explored research avenues.

中文标题/摘要

标题：生产中的代理测量

基于LLM的代理已经在许多行业中投入生产使用，但我们缺乏理解哪些技术方法使部署成功。我们首次通过代理开发者的第一手数据，系统研究了生产中的代理测量（MAP）。我们进行了20个案例研究，通过深入访谈，并对26个领域中的306名从业者进行了调查。我们探讨了组织为何构建代理、如何构建代理、如何评估代理以及他们面临的最大开发挑战。我们的研究发现，生产中的代理主要采用简单可控的方法构建：68%的代理在人类干预前最多执行10步，70%的代理依赖于调用现成模型而非权重调整，74%的代理主要依赖于人工评估。可靠性（时间上的一致正确行为）仍然是最大的开发挑战，从业者目前通过系统级设计来解决这一问题。MAP记录了生产代理的现状，为研究界提供了部署现实情况的可见性以及未被充分探索的研究方向。

Summary / 总结

This study investigates the success factors of LLM-based agents in production by analyzing first-hand data from 20 case studies and a survey of 306 practitioners across 26 domains. The research finds that most agents are built using simple, controllable approaches with fewer than 10 steps before human intervention, rely on prompting off-the-shelf models, and are primarily evaluated by humans. The top development challenge is ensuring reliability, which practitioners address through systems-level design. This study provides insights into the current state of production agents and highlights under-explored research areas.

本研究通过分析来自20个案例研究和306名从业者（覆盖26个领域）的第一手数据，探讨LLM基座代理在生产中的成功因素。研究发现，大多数代理采用简单可控的方法，执行步骤少于10步，依赖于提示现成模型，并主要通过人工评估。最大的开发挑战是确保可靠性，从业者通过系统级设计来解决这一问题。本研究提供了生产代理当前状态的见解，并指出了未被充分探索的研究领域。

FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation

Authors: Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li

First: 2026-02-03T18:01:34+00:00 · Latest: 2026-02-03T18:01:34+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Assisting non-expert users to develop complex interactive websites has become a popular task for LLM-powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full-stack data processing and storage with fancy visual effects. Notably, constructing production-level full-stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack-Agent, a unified agent system for full-stack agentic coding that consists of three parts: (1) FullStack-Dev, a multi-agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack-Learn, an innovative data-scaling and self-improving method that back-translates crawled and synthesized website repositories to improve the backbone LLM of FullStack-Dev. (3) FullStack-Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack-Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self-improvement, demonstrating the effectiveness of our approach. The code is released at https://github.com/mnluzimu/FullStack-Agent.

中文标题/摘要

标题：FullStack-Agent：通过开发导向的测试和仓库反向翻译提升代理全栈网页编码能力

利用大语言模型（LLM）的代码代理帮助非专家用户开发复杂的交互式网站已成为一项流行的任务。然而，现有的代码代理大多仅生成前端网页，通过花哨的视觉效果掩盖了实际全栈数据处理和存储的缺失。值得注意的是，构建生产级别的全栈网页应用程序远比仅生成前端网页更具挑战性，需要对数据流进行精细控制、全面理解不断更新的软件包和依赖关系，并准确定位代码库中的隐秘错误。为了解决这些困难，我们引入了FullStack-Agent，这是一个由三个部分组成的统一代理系统，用于全栈代理编码：（1）FullStack-Dev，一个具有强大规划、代码编辑、代码库导航和错误定位能力的多代理框架。（2）FullStack-Learn，一种创新的数据扩展和自我改进方法，通过反向翻译爬取和合成的网站仓库来提高FullStack-Dev的骨干LLM。（3）FullStack-Bench，一个全面的基准测试，系统地测试生成网站的前端、后端和数据库功能。我们的FullStack-Dev在前端、后端和数据库测试案例中分别优于之前最先进的方法8.7%、38.2%和15.9%。此外，FullStack-Learn通过自我改进在三个测试案例集上分别提高了30B模型的性能9.7%、9.5%和2.8%，证明了我们方法的有效性。代码发布在https://github.com/mnluzimu/FullStack-Agent。

Summary / 总结

FullStack-Agent is designed to enhance agentic full-stack web coding by integrating development-oriented testing and repository back-translation. It consists of FullStack-Dev, which has strong planning, code editing, and bug localization capabilities; FullStack-Learn, which improves the backbone LLM through back-translating crawled and synthesized website repositories; and FullStack-Bench, which tests the frontend, backend, and database functionalities. FullStack-Dev outperforms previous methods by 8.7%, 38.2%, and 15.9% on frontend, backend, and database test cases, respectively. FullStack-Learn also improves the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases, showing the effectiveness of the approach.

FullStack-Agent旨在通过集成开发导向的测试和仓库反向翻译来增强全栈代编程。它包括具有强大规划、代码编辑和错误定位能力的FullStack-Dev；通过反向翻译爬取和合成的网站仓库来提高骨干LLM的FullStack-Learn；以及全面的基准测试FullStack-Bench，用于测试生成网站的前端、后端和数据库功能。FullStack-Dev在前端、后端和数据库测试案例中的表现分别优于先前的方法8.7%、38.2%和15.9%。FullStack-Learn还通过自我改进提高了30B模型在三个测试案例集上的性能，分别为9.7%、9.5%和2.8%，展示了该方法的有效性。

3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Authors: Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai

First: 2026-02-03T17:59:09+00:00 · Latest: 2026-02-03T17:59:09+00:00

Comments: Project Page: https://hjrphoebus.github.io/3DiMo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

中文标题/摘要

标题：基于3D感知的视点自适应人体视频生成运动控制

现有的人体运动控制方法在视频生成中通常依赖于2D姿态或显式的3D参数模型（例如SMPL）作为控制信号。然而，2D姿态刚性地将运动绑定到驱动视点，限制了新视点合成。尽管显式的3D模型在结构上具有信息性，但由于深度模糊和不准确的动力学等固有不准确性，在作为强约束使用时，会覆盖大型视频生成器的强大内在3D感知。在本文中，我们从3D感知的角度重新审视运动控制，提倡一种视点无关的隐式运动表示，这种表示自然地与生成器的空间先验对齐，而不是依赖于外部重建的约束。我们引入了3DiMo，它联合训练了一个运动编码器和一个预训练的视频生成器，将驱动帧提炼为紧凑的、视点无关的运动令牌，并通过交叉注意力语义地注入。为了促进3D感知，我们使用视点丰富的监督（即单视点、多视点和移动摄像机视频）进行训练，迫使不同视角下的运动一致性。此外，我们使用辅助几何监督，仅在早期初始化时利用SMPL，并逐渐减少到零，使模型能够从外部3D指导过渡到从数据和生成器的先验中学习真正的3D空间运动理解。实验结果证实，3DiMo能够灵活地根据文本驱动的相机控制准确地再现驱动运动，显著超越现有方法在运动保真度和视觉质量方面的表现。

Summary / 总结

This work addresses the limitations of existing methods for human motion control in video generation by proposing 3DiMo, which uses an implicit, view-agnostic motion representation. It jointly trains a motion encoder with a pretrained video generator to produce compact motion tokens that are semantically injected via cross-attention. The model is trained with view-rich supervision and auxiliary geometric guidance, which helps in developing 3D awareness. Experiments show that 3DiMo outperforms existing methods in motion fidelity and visual quality, allowing for flexible, text-driven camera control.

该研究针对现有2D姿态和显式3D参数模型在视频生成中的人体运动控制的局限性，提出了一种隐式的、视角无关的运动表示方法3DiMo，使其与生成器的空间先验对齐。该方法联合训练了一个运动编码器和一个预训练的视频生成器，使用丰富的视角监督和辅助几何指导来确保在不同视角下的运动一致性。实验表明，3DiMo在运动保真度和视觉质量方面显著优于现有方法，能够实现灵活的、基于文本的相机控制。

Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity

Authors: Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, Shangding Gu

First: 2026-02-03T17:58:10+00:00 · Latest: 2026-02-03T17:58:10+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

LLM-based multi-agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information-theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture-agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce $K^*$, an effective channel count that quantifies the number of effective channels without ground-truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity-aware design. Code and Dataset are available at the link: https://github.com/SafeRL-Lab/Agent-Scaling.

中文标题/摘要

标题：通过多样性理解基于LLM的多智能体系统中的智能体扩展

基于LLM的多智能体系统（MAS）已成为解决个体LLM难以处理的复杂任务的一种有前途的方法。一种自然策略是通过增加智能体的数量来扩展性能；然而，我们发现，在同质设置中，这种扩展表现出明显的递减回报，而引入异质性（例如，不同的模型、提示或工具）仍然能带来显著的收益。这提出了一个基本问题：什么限制了扩展，为什么多样性有助于此？我们提出了一种信息论框架，表明MAS的性能受到内在任务不确定性限制，而不是智能体数量的限制。我们推导出一种架构无关的上限，表明改进取决于系统访问的有效通道数量。同质智能体由于其输出高度相关而早期饱和，而异质智能体则提供互补的证据。我们进一步引入了$K^*$，这是一种有效通道计数，量化了在没有真实标签的情况下有效通道的数量。实证研究表明，异质配置始终优于同质扩展：2个不同的智能体可以匹配甚至超过16个同质智能体的性能。我们的结果为通过多样性感知设计构建高效和稳健的MAS提供了原则性的指导。代码和数据集可在以下链接获取：https://github.com/SafeRL-Lab/Agent-Scaling.

Summary / 总结

The paper explores the scaling of performance in LLM-based multi-agent systems (MAS) by increasing the number of agents or introducing diversity. It finds that homogeneous scaling shows diminishing returns, while diversity continues to improve performance. An information-theoretic framework is presented, showing that MAS performance is limited by task uncertainty rather than agent count. The study introduces $K^*$, an effective channel count, to quantify the number of effective channels. Experiments demonstrate that heterogeneous configurations outperform homogeneous scaling, with 2 diverse agents matching or exceeding the performance of 16 homogeneous agents.

该论文研究了基于LLM的多智能体系统（MAS）的性能扩展，并发现同质扩展显示出递减的回报，而多样性则提高性能。提出了信息论框架，表明MAS的性能受限于任务不确定性而非智能体数量。异质智能体提供互补的证据，导致更好的性能。实验表明，2个异质智能体可以匹配或超越16个同质智能体的性能，突显了在MAS设计中多样性的重要性。

BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

Authors: Yixiang Chen, Peiyan Li, Jiabing Yang, Keji He, Xiangnan Wu, Yuan Xu, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

First: 2026-02-03T17:56:28+00:00 · Latest: 2026-02-03T17:56:28+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .

中文标题/摘要

标题：BridgeV2W：通过体感掩码将视频生成模型与体感世界模型对接

体感世界模型已成为机器人领域的一个有前途的范式，大多数模型利用大规模互联网视频或预训练的视频生成模型来丰富视觉和运动先验知识。然而，它们仍然面临关键挑战：坐标空间动作与像素空间视频之间的不匹配、对摄像机视角的敏感性以及体感模型之间的非统一架构。为此，我们提出了BridgeV2W，它将坐标空间动作转换为从URDF和摄像机参数渲染的像素对齐的体感掩码。然后，这些掩码通过一种类似于ControlNet的路径注入到预训练的视频生成模型中，这使得动作控制信号与预测的视频对齐，增加了视角特定的条件以适应摄像机视角，并在体感模型中产生了统一的世界模型架构。为了减轻对静态背景的过度拟合，BridgeV2W 进一步引入了一种基于流动的运动损失，专注于学习动态和任务相关区域。在单臂（DROID）和双臂（AgiBot-G1）数据集上的实验，涵盖了多种多样的具有未见过的视角和场景的挑战性条件，表明BridgeV2W 在视频生成质量上优于先前的最先进方法。我们进一步展示了BridgeV2W 在下游实际任务中的潜力，包括策略评估和目标条件规划。更多结果可以在我们的项目网站 https://BridgeV2W.github.io 上找到。

Summary / 总结

BridgeV2W addresses the challenges in embodied world models by converting coordinate-space actions into pixel-aligned masks, which are then injected into a pretrained video generation model. This method aligns action control signals with predicted videos, accommodates camera viewpoints, and yields a unified architecture. Experiments show that BridgeV2W improves video generation quality and performs well on downstream tasks such as policy evaluation and goal-conditioned planning compared to previous methods.

BridgeV2W通过将坐标空间的动作转换为像素对齐的体感掩码，然后通过ControlNet风格的路径注入到预训练的视频生成模型中，解决了体感世界模型中的挑战。该方法对齐了动作控制信号与预测视频，并添加了视角特定的条件，从而获得了一个统一的架构。实验表明，BridgeV2W在多样且具有挑战性的条件下，尤其是在未见过的视角和场景中，提高了视频生成质量，并展示了在下游任务如策略评估和目标条件规划中的潜力，优于先前的先进方法。

WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents

Authors: Xilong Wang, Yinuo Liu, Zhun Wang, Dawn Song, Neil Gong

First: 2026-02-03T17:55:04+00:00 · Latest: 2026-02-03T17:55:04+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones. Existing methods for detecting and localizing such attacks achieve limited effectiveness, as their underlying assumptions often do not hold in the web-agent setting. In this work, we propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages. Given a webpage, Step I extracts \emph{segments of interest} that may be contaminated, and Step II evaluates each segment by checking its consistency with the webpage content as context. We show that WebSentinel is highly effective, substantially outperforming baseline methods across multiple datasets of both contaminated and clean webpages that we collected. Our code is available at: https://github.com/wxl-lxw/WebSentinel.

中文标题/摘要

标题：WebSentinel：检测和定位网页代理中的提示注入攻击

提示注入攻击操纵网页内容，使网页代理执行攻击者指定的任务而非用户的意图任务。现有检测和定位此类攻击的方法效果有限，因为它们的基本假设在网页代理环境中往往不成立。在本工作中，我们提出了一种名为WebSentinel的两步方法，用于检测和定位网页中的提示注入攻击。给定一个网页，第一步提取可能被污染的“感兴趣段落”，第二步通过检查这些段落与网页内容的一致性来评估每个段落。我们展示了WebSentinel的高度有效性，其在我们收集的受污染和未受污染网页数据集上显著优于基线方法。我们的代码可在：https://github.com/wxl-lxw/WebSentinel 获取。

Summary / 总结

WebSentinel is a two-step approach designed to detect and localize prompt injection attacks in webpages. The first step identifies segments of interest that may be contaminated, while the second step evaluates each segment for consistency with the webpage content. WebSentinel significantly outperforms baseline methods across various datasets, both contaminated and clean, demonstrating its effectiveness in detecting and localizing prompt injection attacks.

WebSentinel 是一个两步方法，用于检测和定位网页中的提示注入攻击。第一步提取可能被污染的感兴趣段落，第二步评估每个段落是否与网页内容一致。WebSentinel 在各种数据集中表现优异，无论是受污染的还是干净的网页。

Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Authors: Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Xiaomin Lin

First: 2026-01-29T15:11:31+00:00 · Latest: 2026-02-03T17:50:45+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.

中文标题/摘要

标题：Mil-SCORE：大型语言模型在地理空间推理和规划中的基准测试

随着大型语言模型（LLMs）被应用于越来越长且复杂的任务，对现实的长上下文基准测试的需求日益增长，这些基准测试需要选择性地阅读和整合异构的多模态信息源。对于地理空间规划问题，如大规模军事行动规划，这种需求尤为迫切，因为这些问题要求快速准确地在地图、命令、情报报告和其他分布式数据上进行推理。为了解决这一缺口，我们提出了MilSCORE（军事场景上下文推理），据我们所知，这是首个用于训练的专家撰写的、基于复杂模拟军事规划场景的多跳问题数据集。MilSCORE旨在评估高风险决策和规划，测试LLMs在多个来源上结合战术和空间推理以及在长时间尺度上进行地理空间推理的能力。基准测试包括七个类别中多种问题类型，涵盖事实回忆和关于约束、策略和空间分析的多步推理。我们提供了一套评估协议，并报告了多种当代视觉语言模型的基线结果。我们的研究结果突显了MilSCORE上的巨大改进空间，表明当前系统在现实的、场景级别的长上下文规划方面存在困难，并将MilSCORE定位为未来工作的具有挑战性的测试平台。

Summary / 总结

The research aims to evaluate large language models (LLMs) in handling long-context geospatial reasoning and planning tasks, particularly in military scenarios. The method involves creating MilSCORE, a dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario. Key findings show that current LLMs struggle with realistic, scenario-level long-context planning, indicating significant room for improvement.

研究旨在评估大型语言模型（LLMs）在处理长上下文地理空间推理和规划任务方面的能力，特别是针对军事操作。方法是创建MilSCORE，一个包含专家撰写的、基于复杂模拟军事规划场景的多跳问题的数据集。主要发现表明，当前的LLMs在处理现实的、场景级别的长上下文规划任务时存在显著困难，表明存在很大的改进空间。

Model Optimization for Multi-Camera 3D Detection and Tracking

Authors: Ethan Anderson, Justin Silva, Kyle Zheng, Sameer Pusegaonkar, Yizhou Wang, Zheng Tang, Sujit Biswas

First: 2026-01-31T01:51:30+00:00 · Latest: 2026-02-03T17:47:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.

中文标题/摘要

标题：多相机3D检测与跟踪模型优化

面向室内的多相机感知越来越重要，其中一组静态相机必须在遮挡和异构视角下支持多目标跟踪。我们评估了Sparse4D，这是一种基于查询的空间-时间3D检测与跟踪框架，它在共享世界坐标系中融合多视角特征，并通过实例记忆传播稀疏对象查询。我们研究了降低输入帧率、后训练量化（INT8和FP8）、向WILDTRACK基准转移以及Transformer Engine混合精度微调。为了更好地捕捉身份稳定性，我们报告了平均跟踪持续时间（AvgTrackDur），它以秒为单位衡量身份持续时间。Sparse4D在适度降低FPS时保持稳定，但低于2 FPS时，即使检测稳定，身份关联也会崩溃。主干和颈部的选择性量化提供了最佳的速度-准确度权衡，而与注意力相关的模块始终对低精度敏感。在WILDTRACK上，低FPS预训练在基点检中提供了显著的零样本增益，而小规模微调提供的额外益处有限。Transformer Engine混合精度降低了延迟并提高了相机的可扩展性，但可能会导致身份传播不稳定，从而促使进行稳定性意识验证。

Summary / 总结

The research aims to optimize multi-camera 3D detection and tracking in indoor environments, focusing on Sparse4D, a query-based spatiotemporal framework. It evaluates the framework's performance under reduced frame rates, post-training quantization, and fine-tuning with Transformer Engine. Key findings include Sparse4D's stability under moderate FPS reductions but identity association collapse below 2 FPS. Selective quantization of backbone and neck modules provides the best speed-accuracy trade-off, while attention modules are sensitive to low precision. On WILDTRACK, low-FPS pretraining offers significant zero-shot gains, while fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency but can destabilize identity propagation, highlighting the need for stability-aware validation.

研究旨在优化室内环境下的多摄像头3D检测与跟踪，重点是Sparse4D，一个基于查询的时空框架。研究评估了该框架在降低输入帧率、后训练量化和Transformer Engine混合精度微调下的表现。关键发现包括Sparse4D在中等帧率减少下保持稳定，但低于2帧/秒时身份关联会崩溃。选择性量化骨干和颈部模块提供了最佳的速度-准确度权衡，而注意力相关的模块对低精度敏感。在WILDTRACK上，低帧率预训练提供了显著的零样本增益，而小规模微调提供的额外益处有限。Transformer Engine混合精度降低了延迟，但可能使身份传播不稳定，强调了需要稳定性意识验证的重要性。

AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Authors: Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo, Jiayi Zhang

First: 2026-02-03T17:46:16+00:00 · Latest: 2026-02-03T17:46:16+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra

中文标题/摘要

标题：AOrchestra: 自动化子代理创建以实现自主编排

语言代理在任务自动化方面展现了强大的潜力。为了实现这一潜力，特别是在越来越复杂、长期的任务中，已经推动了子代理作为工具的次级代理范式的兴起，用于多轮次任务解决。然而，现有的设计仍然缺乏对子代理的动态抽象视图，从而影响了适应性。我们通过一个统一的、框架无关的代理抽象来应对这一挑战，将任何代理建模为指令、上下文、工具、模型的元组。这个元组充当了能力组合的食谱，使系统能够根据需要生成专门的执行器。基于这一抽象，我们引入了一个自主系统AOrchestra，其中中央协调器在每一步具体化该元组：它精选任务相关的上下文，选择工具和模型，并通过自动代理创建进行即时执行委派。这样的设计能够减少人力工程努力，并且保持框架无关性，支持插拔式多种代理作为任务执行器。它还能够实现可控的性能-成本权衡，使系统能够接近帕累托有效。在三个具有挑战性的基准测试（GAIA、SWE-Bench、Terminal-Bench）中，AOrchestra在与Gemini-3-Flash配对时，相对最强基线实现了16.28%的改进。代码可在：https://github.com/FoundationAgents/AOrchestra 获取

Summary / 总结

AOrchestra automates the creation of sub-agents for complex tasks by modeling agents as a tuple of Instruction, Context, Tools, and Model. This enables dynamic task-solving and reduces human engineering efforts. AOrchestra, the central orchestrator, dynamically spawns specialized executors for each task. Across three benchmarks, AOrchestra shows a 16.28% relative improvement over the strongest baseline when paired with Gemini-3-Flash.

AOrchestra通过将代理建模为指令、上下文、工具和模型的元组来自动化子代理的创建，以应对复杂的任务解决需求。这使得系统能够为每个任务动态生成专门的执行器。在三个基准测试中，AOrchestra与Gemini-3-Flash配对时，性能提高了16.28%，展示了其适应性和可控的性能成本折衷。

Moonworks Lunara Aesthetic II: An Image Variation Dataset

Authors: Yan Wang, Partho Hassan, Samiha Sadeka, Nada Soliman, M M Sayeef Abdullah, Sabit Hassan

First: 2026-02-02T05:37:28+00:00 · Latest: 2026-02-03T17:45:15+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara's signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.

中文标题/摘要

标题：Moonworks Lunara美学II：图像变体数据集

我们介绍了Lunara Aesthetic II，这是一个公开发布、伦理来源的图像数据集，旨在支持现代图像生成和编辑系统中上下文一致性控制评估和学习。该数据集包含2,854个锚链接变体对，源自Moonworks创作的原始艺术和摄影作品。每个变体对应用了如光照、天气、视角、场景构图、色彩基调或情绪等上下文变换，同时保持稳定的底层身份。Lunara Aesthetic II将身份保留的上下文变体作为监督信号进行操作，同时保留Lunara的高美学评分。结果显示，身份稳定性高，目标属性实现能力强，美学特征稳健，超过大规模网络数据集。Lunara Aesthetic II在Apache 2.0许可证下发布，旨在用于图像生成和图像到图像系统的基准测试、微调和分析，这些系统具有可解释的、关系型的监督信号，以评估上下文泛化、身份保留和编辑稳健性。数据集可在以下网址获取：https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations。

Summary / 总结

Lunara Aesthetic II is a publicly available image dataset designed to evaluate and improve contextual consistency in image generation and editing systems. It consists of 2,854 pairs of images that apply various contextual transformations while maintaining a consistent identity. The dataset shows high identity stability and strong realization of target attributes, with an aesthetic profile surpassing large web datasets. It is intended for benchmarking and fine-tuning of image generation systems and is released under the Apache 2.0 license.

Lunara Aesthetic II 是一个公开可用的图像数据集，旨在评估和学习图像生成和编辑中的上下文一致性。它包含 2,854 对图像，每对图像应用上下文变换的同时保持基础身份不变。该数据集展示了高度的身份稳定性以及目标属性的强大实现能力，其美学特征超越了大规模网络数据集。它旨在用于评估上下文泛化、身份保持和编辑鲁棒性等图像生成系统的性能。

Context Compression via Explicit Information Transmission

Authors: Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He

First: 2026-02-03T17:44:12+00:00 · Latest: 2026-02-03T17:44:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.

中文标题/摘要

标题：通过显式信息传输进行上下文压缩

由于二次注意力和不断增长的关键值缓存，大型语言模型（LLMs）处理长上下文推理成本高昂，因此推动了上下文压缩的需求。在本文中，我们研究了软上下文压缩，即将长上下文压缩为一组连续表示。现有方法通常将LLM本身重新用于可训练的压缩器，依赖逐层自注意力逐步聚合信息。我们认为这种方法存在两个结构限制：（i）逐层的表示覆盖（ii）令牌间压缩能力的不协调分配。我们提出了ComprExIT（通过显式信息传输进行上下文压缩），这是一种轻量级框架，将软压缩重新定义为新的范式：在冻结的LLM隐藏状态上进行显式信息传输。这将压缩与模型内部的自注意力动态解耦。ComprExIT执行（i）深度传输以选择性地将多层信息传输到令牌锚点，减轻了逐层覆盖，（ii）宽度传输以通过全局优化的传输计划将锚点聚合为少量槽，确保信息的协调分配。在六个问答基准测试中，ComprExIT在引入仅约1%额外参数的情况下始终优于最先进的上下文压缩方法，证明了显式和协调的信息传输能够实现更有效和稳健的长上下文压缩。

Summary / 总结

This work addresses the computational challenges of long-context inference with Large Language Models (LLMs) by proposing ComprExIT, a framework for soft context compression. ComprExIT overcomes the limitations of existing methods by explicitly transmitting information over frozen LLM hidden states, which mitigates progressive overwriting and ensures coordinated information allocation. Experiments on six question-answering benchmarks show that ComprExIT outperforms state-of-the-art methods with minimal additional parameters.

本文提出ComprExIT框架以解决大型语言模型（LLM）在长上下文推理中的计算成本问题。ComprExIT通过显式地在冻结的LLM隐藏状态之间传输信息，克服了现有方法的局限性，使用深度传输和宽度传输来缓解逐步覆盖并确保信息的协调分配。在六个问答基准测试上的实验表明，ComprExIT在少量额外参数的情况下优于最先进的方法。

History

20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553