arXiv 论文速递

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Authors: Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang

First: 2026-01-14T18:59:59+00:00 · Latest: 2026-01-14T18:59:59+00:00

Comments: Project page: https://jasper0314-huang.github.io/fast-thinkact/

Abstract

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

中文标题/摘要

标题：Fast-ThinkAct：通过可言化的潜在规划实现高效的视觉-语言-动作推理

视觉-语言-动作（VLA）任务需要在复杂的视觉场景中进行推理，并在动态环境中执行适应性动作。虽然近期关于推理VLA的研究表明显式的思维链（CoT）可以提高泛化能力，但它们由于推理痕迹较长而面临较高的推理延迟。我们提出Fast-ThinkAct，一种高效的推理框架，通过可言化的潜在推理实现紧凑且高性能的规划。Fast-ThinkAct通过从教师中提炼，利用偏好导向的目标来对齐操作轨迹，从而转移语言和视觉规划能力，以实现体感控制。这使得推理增强的策略学习能够有效地将紧凑的推理与动作执行连接起来。在多种体感操作和推理基准测试中的广泛实验表明，Fast-ThinkAct在保持有效的长期规划、少量样本适应和故障恢复的同时，相较于最先进的推理VLA，推理延迟最多可减少89.3%。

Summary / 总结

Fast-ThinkAct is designed to improve the efficiency of vision-language-action reasoning by using verbalizable latent reasoning, reducing inference latency by up to 89.3% compared to state-of-the-art methods while maintaining strong performance in long-horizon planning and few-shot adaptation. The framework learns efficient reasoning through a preference-guided objective that aligns manipulation trajectories, transferring both linguistic and visual planning capabilities for embodied control.

Fast-ThinkAct 是一种高效的视觉-语言-动作任务框架，与最先进的方法相比，其推理延迟最多可减少89.3%，同时保持强大的性能。它通过教师-学生设置和对齐操作轨迹的偏好导向目标来学习紧凑的潜在推理，从而实现有效的长期规划和少量样本适应。

Value-Aware Numerical Representations for Transformer Language Models

Authors: Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu

First: 2026-01-14T18:59:14+00:00 · Latest: 2026-01-14T18:59:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.

中文标题/摘要

标题：价值感知数值表示法用于变换器语言模型

基于变换器的语言模型在数学推理基准测试中通常能够取得优异的结果，但在基本的数值理解和算术运算方面却显得脆弱。一个主要的限制是，数字被处理为符号标记，其嵌入并没有明确编码数值信息，导致系统性的错误。我们提出了一种价值感知的数值表示法，通过在标准标记化输入中添加一个专门的前缀标记，其嵌入明确地依赖于底层的数值信息，来增强模型。这种机制直接将幅度信息注入到模型的输入空间中，同时保持与现有标记化器和仅解码器变换器架构的兼容性。在算术任务上的评估表明，所提出的方法在各种数值格式、任务和操作数长度上都优于基线方法。这些结果表明，明确编码数值信息是提高语言模型基本数值鲁棒性的一种有效且高效的方法。

Summary / 总结

The research addresses the fragility of transformer-based language models in numerical understanding and arithmetic operations, where numbers are processed as symbolic tokens without explicit numerical value encoding. The study proposes a value-aware numerical representation that adds a prefix token embedding conditioned on the numerical value, enhancing the model's input space. Experiments on arithmetic tasks demonstrate that this approach surpasses baseline models across various numerical formats, tasks, and operand lengths, indicating the effectiveness of explicitly encoding numerical value for improving numerical robustness in language models.

研究旨在解决基于变换器的语言模型在数值理解和算术运算方面的脆弱性。方法是引入一种数值感知的数值表示，通过在输入中添加一个前缀标记，其嵌入根据数值值进行条件化，从而直接将幅度信息注入模型的输入空间。实验表明，该方法在各种数值格式、任务和操作数长度上优于基线模型，表明明确编码数值值可以有效提高语言模型的基本数值稳健性。

ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation

Authors: Sicong Liu, Yanxian Huang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Yuchi Ma, Hongyu Zhang, Yin Zhang, Yanlin Wang

First: 2026-01-14T18:57:31+00:00 · Latest: 2026-01-14T18:57:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (LLMs) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption. While existing research prioritizes inference-phase optimizations such as prompt compression and model quantization, the generation phase remains underexplored. To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while preserving semantic equivalence and readability. In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-preserving transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with LLM-guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base LLMs. Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation.

中文标题/摘要

标题：ShortCoder：知识增强的语法优化以实现高效且节省令牌的代码生成

代码生成任务旨在自动化用户需求到可执行代码的转换，显著减少手动开发工作量并提升软件生产力。大型语言模型（LLMs）的出现极大地推进了代码生成，尽管它们的效率仍受某些固有架构限制的影响。每次生成令牌都需要进行完整的推理过程，需要持续保留上下文信息在内存中，导致资源消耗增加。虽然现有研究侧重于推理阶段的优化，如提示压缩和模型量化，但生成阶段仍被忽视。为应对这些挑战，我们提出了一种名为ShortCoder的知识注入框架，该框架在保持语义等价性和可读性的同时优化代码生成效率。特别是，我们引入了：(1) 10条针对Python的语法级简化规则，源自AST保留变换，实现18.1%的令牌减少，同时不牺牲功能；(2) 结合基于规则的重写与LLM引导的细化的混合数据合成流水线，生成ShorterCodeBench，一个包含原始代码和简化代码的语义一致的语料库；(3) 一种微调策略，将简洁性意识注入到基础LLM中。广泛的实验结果表明，ShortCoder在HumanEval上始终优于最先进的方法，与先前方法相比，在生成效率上提高了18.1%-37.8%，同时保证了代码生成的性能。

Summary / 总结

ShortCoder is a framework that optimizes code generation efficiency by applying syntax-level simplification rules and a hybrid data synthesis pipeline, resulting in a 18.1% to 37.8% improvement in generation efficiency compared to previous methods on HumanEval, while maintaining semantic equivalence and readability. The method includes ten Python syntax simplification rules, a corpus of validated code pairs, and a fine-tuning strategy for LLMs.

ShortCoder 是一个知识增强的框架，通过引入语法级简化规则和混合数据合成管道来优化代码生成效率，在 HumanEval 上实现了 18.1% 至 37.8% 的生成效率提升，同时保持语义一致性和可读性。

Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic Environments

Authors: Luca Castri, Gloria Beraldo, Nicola Bellotto

First: 2025-04-16T09:26:04+00:00 · Latest: 2026-01-14T18:52:06+00:00

Comments: Causal Discovery and Inference - Robot Autonomy - Human-Robot Spatial Interaction - Decision-Making

Abs · PDF · Code1 · Code2

Abstract

The growing integration of robots in shared environments - such as warehouses, shopping centres, and hospitals - demands a deep understanding of the underlying dynamics and human behaviours, including how, when, and where individuals engage in various activities and interactions. This knowledge goes beyond simple correlation studies and requires a more comprehensive causal analysis. By leveraging causal inference to model cause-and-effect relationships, we can better anticipate critical environmental factors and enable autonomous robots to plan and execute tasks more effectively. To this end, we propose a novel causality-based decision-making framework that reasons over a learned causal model to assist the robot in deciding when and how to complete a given task. In the examined use case - i.e., a warehouse shared with people - we exploit the causal model to estimate battery usage and human obstructions as factors influencing the robot's task execution. This reasoning framework supports the robot in making informed decisions about task timing and strategy. To achieve this, we developed also PeopleFlow, a new Gazebo-based simulator designed to model context-sensitive human-robot spatial interactions in shared workspaces. PeopleFlow features realistic human and robot trajectories influenced by contextual factors such as time, environment layout, and robot state, and can simulate a large number of agents. While the simulator is general-purpose, in this paper we focus on a warehouse-like environment as a case study, where we conduct an extensive evaluation benchmarking our causal approach against a non-causal baseline. Our findings demonstrate the efficacy of the proposed solutions, highlighting how causal reasoning enables autonomous robots to operate more efficiently and safely in dynamic environments shared with humans.

中文标题/摘要

标题：增强因果关系的自主移动机器人决策制定以应对动态环境

随着机器人在共享环境中的集成越来越多，如仓库、购物中心和医院，需要深入理解底层动力学和人类行为，包括个体何时何地参与各种活动和互动。这种知识超越了简单的相关性研究，需要更全面的因果分析。通过利用因果推理来建模因果关系，可以更好地预测关键环境因素，使自主机器人能够更有效地规划和执行任务。为此，我们提出了一种基于因果关系的新型决策制定框架，该框架通过推理学习到的因果模型来帮助机器人决定何时以及如何完成给定任务。在考察的用例中，即共享有人类的仓库中，我们利用因果模型来估计电池使用和人类障碍等因素对机器人任务执行的影响。该推理框架支持机器人做出关于任务时间和策略的知情决策。为此，我们还开发了PeopleFlow，这是一种新的基于Gazebo的模拟器，用于建模共享工作空间中上下文敏感的人机空间交互。PeopleFlow具有受时间、环境布局和机器人状态等因素影响的现实人类和机器人轨迹，并可以模拟大量代理。虽然模拟器具有通用性，但在本文中我们以类似仓库的环境为案例研究，进行了广泛的评估，将我们的因果方法与非因果基线进行了基准测试。我们的研究结果表明，所提出解决方案的有效性，突显了因果推理如何使自主机器人在与人类共享的动态环境中更高效、更安全地运行。

Summary / 总结

The paper addresses the need for autonomous mobile robots to understand and predict human behaviors in shared environments through causal inference. It introduces a causality-based decision-making framework that uses a learned causal model to estimate factors like battery usage and human obstructions, aiding the robot in task planning. The framework was evaluated in a warehouse setting using a new simulator called PeopleFlow, which simulates human-robot interactions. The results show that causal reasoning enhances the robot's efficiency and safety in dynamic environments.

论文针对自主移动机器人在共享环境（如仓库）中更好地理解和预测动态需求，提出了一种基于因果推理的决策框架，利用因果推理建模因果关系，使机器人能够就任务时间与策略做出明智决策。该框架通过一个名为PeopleFlow的新Gazebo仿真器进行评估，该仿真器模拟了仓库环境中的人机互动。结果表明，因果方法优于非因果基线，展示了在动态人机共享空间中更高的效率和安全性。

LLM-Based Emulation of the Radio Resource Control Layer: Towards AI-Native RAN Protocols

Authors: Ziming Liu, Bryan Liu, Alvaro Valcarce, Xiaoli Chu

First: 2025-05-22T15:55:56+00:00 · Latest: 2026-01-14T18:50:49+00:00

Comments: This work has been submitted to the IEEE for possible publication. Focuses on applying LLMs to 5G RRC protocol generation; primary: cs.NI; cross-list: eess.SP, cs.LG

Abs · PDF · Code1 · Code2

Abstract

Integrating Large AI Models (LAMs) into 6G mobile networks is a key enabler of the AI-Native Air Interface (AI-AI), where protocol intelligence must scale beyond handcrafted logic. This paper presents, to our knowledge, the first standards-compliant emulation of the Radio Resource Control (RRC) layer using a decoder-only LAM (LLAMA-class) fine-tuned with Low-Rank Adaptation (LoRA) on a multi-vendor corpus of real-world traces spanning both 5G and 4G systems. We treat RRC as a domain-specific language and construct a segmentation-safe, question-answer (Question-and-Answer (QA)) dataset that preserves Abstract Syntax Notation (ASN.1) structure through linearization prior to Byte Pair Encoding (BPE) tokenization. The proposed approach combines parameter-efficient adaptation with schema-bounded prompting to ensure syntactic and procedural fidelity. Evaluation introduces a standards-aware triad -- ASN.1 conformance, field-level coverage analysis, and uplink-to-downlink state-machine checks -- alongside semantic similarity and latency profiling across 120 configurations. On 30k 5G request-response pairs plus an additional 4.8k QA turns from 4G sessions, our 8B model achieves a median cosine similarity of 0.97, a 61% relative gain over a zero-shot baseline, while sustaining high conformance rates. These results demonstrate that LAMs, when augmented with protocol-aware reasoning, can directly orchestrate control-plane procedures, laying the foundation for the future Artificial Intelligence (AI)-native Radio Access Network (RAN).

中文标题/摘要

标题：基于LLM的无线资源控制层模拟：迈向AI原生RAN协议

将大型AI模型（LAMs）集成到6G移动网络是实现AI原生空中接口（AI-AI）的关键，其中协议智能必须超越手工编写的逻辑进行扩展。本文据我们所知，首次提出了使用经过低秩适应（LoRA）微调的解码器仅大型AI模型（LLAMA类）对多供应商的真实世界跟踪数据进行标准兼容的无线资源控制（RRC）层模拟。我们将RRC视为领域特定语言，并构建了一个分段安全、问题-答案（问题-回答（QA））数据集，通过线性化保留ASN.1结构，然后进行字节对编码（BPE）分词。所提出的方法结合了参数高效适应与模式约束提示，以确保语法和程序的准确性。评估引入了一种标准意识三元组——ASN.1符合性、字段级覆盖率分析和上行链路到下行链路状态机检查，以及语义相似性和延迟分析，跨越120种配置。在30,000个5G请求-响应对以及额外的4,800个QA轮次中，我们的8B模型实现了中位余弦相似度为0.97，相对于零样本基线获得了61%的相对增益，同时保持了高符合率。这些结果表明，当与协议感知推理相结合时，LAMs可以直接编排控制平面过程，为未来的AI原生无线接入网（RAN）奠定基础。

Empathy Applicability Modeling for General Health Queries

Authors: Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem

Venue: ACL

First: 2026-01-14T18:47:02+00:00 · Latest: 2026-01-14T18:47:02+00:00

Comments: In Submission to ACL

Abs · PDF · Code1 · Code2

Abstract

LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors' responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.

中文标题/摘要

标题：通用健康查询中的同理心应用建模

大型语言模型（LLMs）正越来越多地被整合到临床工作流程中，但它们往往缺乏临床同理心，这是有效医生-患者沟通的一个重要方面。现有的NLP框架主要集中在对医生回应中的同理心进行反应性标注，但在前瞻性建模同理心需求方面提供的支持有限，尤其是在通用健康查询中。我们引入了同理心适用性框架（EAF），这是一种以理论为基础的方法，根据临床、上下文和语言线索，将患者查询分类为情感反应和解释的适用性。我们发布了一组真实患者的查询基准，由人类和GPT-4o双重标注。在人类一致性的子集中，我们还观察到人类与GPT-4o的显著对齐。为了验证EAF，我们在人类标注和GPT-only标注上训练分类器来预测同理心适用性，取得了强大的性能并优于启发式和零样本LLM基线。错误分析揭示了持续存在的挑战：隐含的痛苦、临床严重性模糊以及上下文困难，强调了多标注者建模、临床医生在环校准和文化多样标注的必要性。EAF提供了一个在响应生成前识别同理心需求的框架，建立了前瞻性同理心建模的基准，并使支持异步医疗中的同理心沟通成为可能。

Summary / 总结

The research aims to address the lack of clinical empathy in LLMs used in clinical workflows by introducing the Empathy Applicability Framework (EAF), which classifies patient queries based on clinical, contextual, and linguistic cues. The framework is validated through classifiers trained on human and GPT-4o annotations, showing strong performance and outperforming baseline models. Key findings include challenges in identifying implicit distress, clinical-severity ambiguity, and contextual hardship, emphasizing the need for multi-annotator and clinician-in-the-loop approaches.

研究旨在通过开发Empathy Applicability Framework (EAF) 来增强LLM的临床同理心，该框架根据临床、上下文和语言线索对患者查询进行分类。研究使用了由人类和GPT-4o双重标注的患者查询基准数据集，并展示了基于人类和GPT标注训练的分类器在预测同理心适用性方面优于基线模型。关键发现包括处理隐含的痛苦、临床严重性模糊和上下文困难的挑战，这表明需要多标注者和临床医生参与的校准方法。

LLMs can Compress LLMs: Adaptive Pruning by Agents

Authors: Sai Varun Kodathala, Rakesh Vunnam

First: 2026-01-14T18:45:36+00:00 · Latest: 2026-01-14T18:45:36+00:00

Comments: 17 Pages

Abs · PDF · Code1 · Code2

Abstract

As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.

中文标题/摘要

标题：LLMs可以压缩LLMs：代理引导的自适应剪枝

随着大型语言模型（LLMs）的不断扩展，后训练剪枝已成为一种有前景的方法，可以在保持性能的同时降低计算成本。现有方法如SparseGPT和Wanda通过层间权重重构或激活感知的幅度剪枝实现高稀疏度，但依赖于均匀或手工设计的启发式规则来确定每层的稀疏度比例。此外，最近的研究表明，剪枝后的LLMs在事实知识方面遭受严重退化，结构化剪枝方法在事实问答能力方面几乎完全崩溃。我们提出了代理引导剪枝，其中基础模型作为自适应剪枝代理，在每次迭代中智能选择要剪枝的层，同时保留关键的知识路径。我们的方法通过结合Wanda启发的权重-激活度量和梯度重要性得分构建层间敏感性配置文件，并将这些统计信息标准化为z分数，以实现模型无关的比较。这些统计数据由具备自我反思能力的LLM代理处理，使其能够从之前的剪枝结果中学习并逐步优化其策略。通过回滚机制保持模型质量，当困惑度退化超过阈值时进行回滚。我们在大约45%稀疏度的Qwen3模型（4B和8B参数）上评估了该方法，显示出相对于结构化剪枝基线的显著改进：MMLU准确率提高56%，FreebaseQA上的事实知识保留提高19倍，困惑度退化降低69%。值得注意的是，我们的框架不需要重新训练，以模型无关的方式运行，并且仅需2-4次回滚即可在21-40次迭代中表现出有效的自我纠正，证明基础模型可以有效地引导其他基础模型的压缩。

Summary / 总结

This study addresses the challenge of reducing computational costs in Large Language Models (LLMs) through adaptive pruning guided by an agent. The method constructs layer-wise sensitivity profiles using a combination of weight-activation metrics and gradient importance scores, processed by an LLM agent with self-reflection capabilities. The approach evaluates Qwen3 models at 45% sparsity, showing significant improvements over structured pruning methods, with 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention, and 69% lower perplexity degradation. The framework requires no retraining and demonstrates effective self-correction with minimal rollback mechanisms.

该研究通过由代理引导的自适应剪枝来应对大规模语言模型（LLMs）的计算成本降低挑战。方法通过结合权重-激活度量和梯度重要性得分构建逐层敏感性配置文件，并由具备自我反思能力的LLM代理处理。该方法在Qwen3模型中实现45%的稀疏度，相比结构化剪枝基线，MMLU准确率提高56%，在FreebaseQA上的事实知识保留提高19倍，并将困惑度退化降低69%。该框架无需重新训练，并且通过最少2-4次回滚迭代展示了有效的自我纠正能力。

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

Authors: Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer

First: 2026-01-14T18:45:08+00:00 · Latest: 2026-01-14T18:45:08+00:00

Comments: ELLIS ML4Molecules Workshop 2025, ELLIS Unconference, Copenhagen 2025

Abs · PDF · Code1 · Code2

Abstract

Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves state-of-the-art zero-shot virtual screening performance in settings where no binding pocket information is provided as input, substantially outperforms existing methods on a challenging target fishing task, and demonstrates competitive ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

中文标题/摘要

标题：对比几何学习解锁统一的结构和配体基于药物设计

结构基于和配体基于的计算药物设计传统上依赖于分离的数据源和建模假设，限制了它们的大规模联合使用。在本工作中，我们引入了统一计算药物设计的对比几何学习（ConGLUDe），这是一种统一结构和配体训练的单一对比几何模型。ConGLUDe 结合了一个几何蛋白质编码器，该编码器产生整个蛋白质的表示和预测结合位点的隐式嵌入，以及一个快速配体编码器，消除了预先定义口袋的需要。通过对比学习将配体与全局蛋白质表示和多个候选结合位点对齐，ConGLUDe 支持配体条件下的口袋预测，以及虚拟筛选和目标捕获，同时在蛋白质-配体复合物和大规模生物活性数据上联合训练。在各种基准测试中，ConGLUDe 在不提供结合口袋信息的情况下实现了最先进的零样本虚拟筛选性能，显著优于现有方法在具有挑战性的目标捕获任务上的表现，并展示了竞争力的配体条件下的口袋选择。这些结果突显了统一结构-配体训练的优势，并将ConGLUDe 作为通用基础模型药物发现的一个步骤。

Summary / 总结

This work addresses the limitations of traditional structure-based and ligand-based drug design by introducing ConGLUDe, a unified model that combines both approaches through contrastive geometric learning. ConGLUDe uses a geometric protein encoder for whole-protein and binding site representations, and a fast ligand encoder, eliminating the need for predefined binding pockets. It achieves state-of-the-art performance in zero-shot virtual screening, outperforms existing methods in target fishing, and demonstrates competitive ligand-conditioned pocket selection, highlighting the benefits of unified training for drug discovery.

该研究通过引入结合了结构和配体方法的统一模型ConGLUDe，解决了传统结构和配体药物设计的局限性。ConGLUDe 使用几何蛋白编码器生成整个蛋白质和结合位点的表示，并使用快速配体编码器，无需预先定义结合口袋。该模型在零样本虚拟筛选中达到了最先进的性能，在目标捕获任务中显著优于现有方法，并展示了竞争性的配体条件下的结合位点选择，突显了统一训练的优势，为药物发现提供了通用基础模型。

Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Authors: Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal

First: 2026-01-14T18:43:32+00:00 · Latest: 2026-01-14T18:43:32+00:00

Comments: Code: https://github.com/tianyiniu/RoutingGenData

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.

中文标题/摘要

标题：基于生成数据的路由：无注释的LLM技能估计与专家选择

大型语言模型（LLM）路由器动态选择适合给定输入的最佳模型。现有方法通常假设可以访问真实标签数据，但在实践中，特别是在用户请求分布异质且未知的情况下，这种数据往往不可用。我们引入了基于生成数据的路由（RGD），这是一种挑战性设置，在这种设置中，路由器仅基于生成器LLM从高级任务描述生成的查询和答案进行训练。我们评估了使用查询和标签的查询-答案路由器以及仅使用查询的查询路由器在四个不同基准和12个模型上的性能，发现随着生成器质量下降，查询-答案路由器的性能下降速度比查询路由器更快。我们的分析揭示了有效生成器的两个关键特征：它们必须准确回答自己的问题，其问题必须在模型池中产生足够的性能差异。然后我们展示了如何通过筛选这些特征来提高生成数据的质量。我们进一步提出了CASCAL，这是一种新颖的查询路由器，通过共识投票估计模型的正确性，并通过层次聚类识别模型特定的技能细分。CASCAL在生成器质量较弱的情况下表现出更高的鲁棒性，当使用弱生成器数据训练时，其绝对准确率比最佳查询-答案路由器高出4.6%。

Summary / 总结

The paper addresses the challenge of training LLM routers without access to ground-truth labeled data, a common issue in practice. It introduces RGD, a method where routers are trained on generated queries and answers from generator LLMs. The study evaluates both query-answer and query-only routers across various benchmarks and models, showing that query-only routers are more robust to the quality of the generator. The research identifies two key characteristics of effective generators and proposes CASCAL, a novel query-only router that uses consensus voting and hierarchical clustering to improve performance. CASCAL outperforms the best query-answer router by 4.6% accuracy when trained on weak generator data.

论文解决了在缺乏真实标注数据的情况下训练LLM路由器的问题，这是实践中常见的挑战。研究引入了RGD方法，其中路由器是基于生成器LLM生成的查询和答案进行训练的。研究评估了查询-答案路由器和查询-only路由器在多样基准和模型上的表现，发现查询-only路由器对生成器质量的依赖性较低。作者提出了CASCAL，该方法通过共识投票和层次聚类来识别模型特定的能力领域，即使在使用质量较低的生成器数据进行训练时也能表现出更好的性能。

Provable Acceleration of Distributed Optimization with Local Updates

Authors: Zuang Wang, Yongqiang Wang

First: 2026-01-06T22:10:11+00:00 · Latest: 2026-01-14T18:40:59+00:00

Abs · PDF · Code1 · Code2

Abstract

In conventional distributed optimization, each agent performs a single local update between two communication rounds with its neighbors to synchronize solutions. Inspired by the success of using multiple local updates in federated learning, incorporating local updates into distributed optimization has recently attracted increasing attention. However, unlike federated learning, where multiple local updates can accelerate learning by improving gradient estimation under mini-batch settings, it remains unclear whether similar benefits hold in distributed optimization when gradients are exact. Moreover, existing theoretical results typically require reducing the step size when multiple local updates are employed, which can entirely offset any potential benefit of these additional local updates and obscure their true impact on convergence. In this paper, we focus on the classic DIGing algorithm and leverage the tight performance bounds provided by Performance Estimation Problems (PEP) to show that incorporating local updates can indeed accelerate distributed optimization. To the best of our knowledge, this is the first rigorous demonstration of such acceleration for a broad class of objective functions. Our analysis further reveals that, under an appropriate step size, performing only two local updates is sufficient to achieve the maximal possible improvement, and that additional local updates provide no further gains. Because more updates increase computational cost, these findings offer practical guidance for efficient implementation. Extensive experiments on both synthetic and real-world datasets corroborate the theoretical findings.

中文标题/摘要

标题：可验证的分布式优化加速：局部更新的应用

在传统的分布式优化中，每个代理在两次与邻居通信轮次之间仅执行一次局部更新以同步解决方案。受联邦学习中使用多次局部更新取得成功启发，将局部更新引入分布式优化最近引起了越来越多的关注。然而，与联邦学习不同，在梯度精确的情况下，多次局部更新是否能像在小批量设置中改进梯度估计那样加速学习仍不清楚。此外，现有的理论结果通常要求在使用多次局部更新时减少步长，这可能会完全抵消这些额外局部更新带来的潜在好处，从而掩盖它们对收敛性的真实影响。在本文中，我们专注于经典的DIGing算法，并利用Performance Estimation Problems (PEP) 提供的紧致性能边界，证明了引入局部更新确实可以加速分布式优化。据我们所知，这是首次对广泛类别的目标函数进行如此严格的加速演示。我们的分析进一步表明，在适当的步长下，仅执行两次局部更新就足以实现最大的可能改进，而额外的局部更新则不会带来进一步的收益。由于更多的更新会增加计算成本，这些发现为高效的实现提供了实用指导。广泛的实验证明了理论发现的有效性。

Summary / 总结

This paper investigates the impact of multiple local updates in distributed optimization, inspired by their use in federated learning. It demonstrates that, unlike federated learning, multiple local updates can still accelerate distributed optimization by improving gradient estimation, even when gradients are exact. The authors use the DIGing algorithm and Performance Estimation Problems to show that two local updates are sufficient to achieve the maximum possible improvement, with additional updates providing no further gains. Experiments confirm these theoretical findings.

本文研究了在分布式优化中使用多次局部更新以加速DIGing算法收敛的方法。与联邦学习中多次局部更新改善梯度估计不同，该研究展示了在精确梯度下，类似的益处也可以在分布式优化中实现。作者使用性能估计问题来证明，只需两次局部更新即可实现最大可能的改进，而额外的更新则不会带来进一步的收益。这一发现为高效实施提供了实用指导，并得到了广泛的实验支持。

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Authors: Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing

First: 2026-01-14T18:38:31+00:00 · Latest: 2026-01-14T18:38:31+00:00

Comments: Source code: https://github.com/Infinity-AILab/DeepResearchEval

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.

中文标题/摘要

标题：DeepResearchEval：一种自动化的深度研究任务构建与自主评估框架

深度研究系统广泛应用于多步骤的网络研究、分析和跨源综合，但其评估仍然具有挑战性。现有的基准测试通常需要密集的注释任务构建，依赖于静态评估维度，或者在引文缺失时无法可靠地验证事实。为了解决这些差距，我们引入了DeepResearchEval，这是一种自动化的深度研究任务构建和自主评估框架。在任务构建方面，我们提出了一种基于人设的流水线，生成基于多样化用户画像的现实且复杂的研究任务，并应用任务资格和搜索必要性两阶段过滤器，仅保留需要多源证据整合和外部检索的任务。在评估方面，我们提出了一种自主的流水线，包括两个组件：自适应点对点质量评估，该评估动态地根据每个生成的任务推导出特定的评估维度、标准和权重；以及主动事实核查，该核查自主地通过网络搜索提取和验证报告声明，即使在引文缺失时也是如此。

Summary / 总结

DeepResearchEval is an automated framework for constructing and evaluating deep research tasks. It uses a persona-driven pipeline to generate realistic, complex tasks and a two-stage filter to ensure multi-source evidence is required. For evaluation, it employs an adaptive quality evaluation that dynamically adjusts criteria based on each task and an active fact-checking mechanism that autonomously verifies statements even without citations. Key findings include improved task construction and more reliable evaluation compared to existing benchmarks.

DeepResearchEval 是一个自动化框架，用于构建和评估深度研究任务。它采用基于人设的管道生成现实且复杂的任务，并通过两阶段过滤确保需要多源证据整合。在评估方面，它使用自适应质量评估和主动事实核查来动态推导任务特定的标准，并通过网络搜索自主验证陈述，即使缺少引文也能进行验证。主要发现包括相比现有基准提高了任务构建和评估的可靠性。

Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection

Authors: Ziyu Yang, Guibin Chen, Yuxin Yang, Aoxiong Zeng, Xiangquan Yang

First: 2026-01-14T18:36:22+00:00 · Latest: 2026-01-14T18:36:22+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape's capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95\% of the performance gap between multi-task and single-task baselines with negligible computational overhead.

中文标题/摘要

标题：通过正交梯度投影在多任务LoRA中解开任务冲突

多任务学习（MTL）结合低秩适应（LoRA）已成为参数高效部署大型语言模型（LLMs）的一个有前途的方向。通过在多个任务中共享一个适配器，可以显著减少存储开销。然而，这种方法会遭受负迁移的问题，即来自不同任务的冲突梯度更新会降低单个任务的性能，与单任务微调相比。由于LoRA中的低秩约束限制了优化景观容纳多样化任务需求的能力，这一问题在LoRA中被进一步加剧。在本文中，我们提出了一种名为Ortho-LoRA的梯度投影方法，专门针对LoRA的二分结构。Ortho-LoRA动态地将冲突任务的梯度投影到彼此的正交补空间中。在GLUE基准上的广泛实验表明，Ortho-LoRA有效地缓解了任务干扰，优于标准联合训练，并且在几乎无计算开销的情况下恢复了多任务和单任务基线之间的95%的性能差距。

Summary / 总结

This paper addresses the issue of task conflicts in multi-task learning with Low-Rank Adaptation (LoRA) by proposing Ortho-LoRA, a gradient projection method. Ortho-LoRA projects conflicting task gradients onto the orthogonal complement of each other within the LoRA subspace to mitigate task interference. Experiments on the GLUE benchmark show that Ortho-LoRA outperforms standard joint training and nearly recovers the performance gap between multi-task and single-task baselines with minimal computational overhead.

论文解决了多任务学习与低秩适应（LoRA）中的负迁移问题，即不同任务的冲突梯度更新会降低性能。它提出了一种梯度投影方法Ortho-LoRA，在LoRA子空间内将冲突任务的梯度投影到正交补空间，有效缓解了任务干扰。实验表明，Ortho-LoRA在GLUE基准上优于标准联合训练，并几乎恢复了多任务与单任务基线之间的性能差距，同时计算开销很小。

Automating Supply Chain Disruption Monitoring via an Agentic AI Approach

Authors: Sara AlMahri, Liming Xu, Alexandra Brintrup

First: 2026-01-14T18:28:31+00:00 · Latest: 2026-01-14T18:28:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern supply chains are increasingly exposed to disruptions from geopolitical events, demand shocks, trade restrictions, to natural disasters. While many of these disruptions originate deep in the supply network, most companies still lack visibility beyond Tier-1 suppliers, leaving upstream vulnerabilities undetected until the impact cascades downstream. To overcome this blind-spot and move from reactive recovery to proactive resilience, we introduce a minimally supervised agentic AI framework that autonomously monitors, analyses, and responds to disruptions across extended supply networks. The architecture comprises seven specialised agents powered by large language models and deterministic tools that jointly detect disruption signals from unstructured news, map them to multi-tier supplier networks, evaluate exposure based on network structure, and recommend mitigations such as alternative sourcing options. \rev{We evaluate the framework across 30 synthesised scenarios covering three automotive manufacturers and five disruption classes. The system achieves high accuracy across core tasks, with F1 scores between 0.962 and 0.991, and performs full end-to-end analyses in a mean of 3.83 minutes at a cost of \$0.0836 per disruption. Relative to industry benchmarks of multi-day, analyst-driven assessments, this represents a reduction of more than three orders of magnitude in response time. A real-world case study of the 2022 Russia-Ukraine conflict further demonstrates operational applicability. This work establishes a foundational step toward building resilient, proactive, and autonomous supply chains capable of managing disruptions across deep-tier networks.

中文标题/摘要

标题：通过代理人工智能方法自动化供应链中断监测

现代供应链越来越容易受到地缘政治事件、需求冲击、贸易限制以及自然灾害等中断的影响。虽然许多这些中断源自供应链的深处，但大多数公司仍然缺乏对一级供应商以上的可见性，直到中断的影响波及下游时，上游的漏洞才被发现。为了克服这一盲点，从被动恢复转向主动韧性，我们引入了一个最小监督的代理人工智能框架，该框架能够自主监测、分析和应对扩展供应链网络中的中断。该架构由七个由大型语言模型和确定性工具驱动的专业代理组成，它们联合从非结构化新闻中检测中断信号，将它们映射到多级供应商网络，根据网络结构评估暴露程度，并推荐替代采购等缓解措施。我们在涵盖三家汽车制造商和五类中断的30个合成场景中评估了该框架。系统在核心任务上的准确率很高，F1分数在0.962到0.991之间，平均在3.83分钟内完成端到端分析，成本为每中断0.0836美元。与多天、分析师驱动的评估相比，响应时间减少了三个数量级。2022年俄罗斯-乌克兰冲突的实际案例进一步证明了其操作适用性。这项工作为构建能够管理深层网络中断的弹性、主动和自主供应链奠定了基础。

Summary / 总结

The paper addresses the challenge of monitoring supply chain disruptions by introducing an agentic AI framework that autonomously detects, analyzes, and responds to disruptions across extended supply networks. The framework uses seven specialized agents to process unstructured news, map disruptions to supplier networks, and recommend mitigations. Evaluations across 30 synthesized scenarios show high accuracy and fast response times, significantly reducing the time needed for disruption management compared to industry benchmarks.

论文提出了一种基于代理AI的框架，用于监控和应对供应链中断，解决了一级供应商之外的可见性不足问题。该框架使用七个专门的代理来检测、分析并推荐缓解措施。在30个合成场景中的评估显示，准确率很高，F1分数在0.962到0.991之间，平均分析时间为3.83分钟，每起中断的成本为0.0836美元，相比行业基准显著缩短了响应时间。

OptiMind: Teaching LLMs to Think Like Optimization Experts

Authors: Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, Sirui Li

First: 2025-09-26T22:23:12+00:00 · Latest: 2026-01-14T18:26:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Mathematical programming -- the task of expressing operations and decision-making problems in precise mathematical language -- is fundamental across domains, yet remains a skill-intensive process requiring operations research expertise. Recent advances in large language models for complex reasoning have spurred interest in automating this task, translating natural language into executable optimization models. Current approaches, however, achieve limited accuracy, hindered by scarce and noisy training data without leveraging domain knowledge. In this work, we systematically integrate optimization expertise to improve formulation accuracy for mixed-integer linear programming, a key family of mathematical programs. Our OptiMind framework leverages semi-automated, class-based error analysis to guide both training and inference, explicitly preventing common mistakes within each optimization class. Our resulting fine-tuned LLM significantly improves formulation accuracy by 20.7% across multiple optimization benchmarks, with consistent gains under test-time scaling methods such as self-consistency and multi-turn feedback, enabling further progress toward robust LLM-assisted optimization formulation.

中文标题/摘要

标题：OptiMind：教LLM像优化专家一样思考

数学规划——将操作和决策问题精确地表达成数学语言的任务——在各个领域都是基础性的，但仍然是一个需要运筹学专业知识的技能密集型过程。最近在复杂推理方面的大语言模型的进步激发了自动化这一任务的兴趣，即将自然语言翻译成可执行的优化模型。然而，当前的方法在准确性上受到限制，这主要是由于缺乏和噪声较大的训练数据，未能利用领域知识。在本项工作中，我们系统地整合了优化专业知识，以提高混合整数线性规划问题的建模准确性，这是一个关键的数学规划家族。我们的OptiMind框架利用半自动化的、基于类别的错误分析来指导训练和推理，明确地防止每个优化类中的常见错误。我们最终微调的LLM在多个优化基准测试中将建模准确性提高了20.7%，并且在测试时的缩放方法（如自我一致性、多轮反馈）下保持了一致的改进，这为进一步实现稳健的LLM辅助优化建模奠定了基础。

Summary / 总结

The research aims to enhance the ability of large language models (LLMs) to translate natural language into accurate optimization models, a task crucial for various domains but requiring specialized expertise. The study introduces OptiMind, a framework that integrates optimization expertise through semi-automated error analysis, improving formulation accuracy by 20.7% across multiple benchmarks. This improvement is consistent under scaling methods like self-consistency and multi-turn feedback, suggesting potential for robust LLM-assisted optimization.

研究旨在提升大型语言模型（LLMs）将自然语言转化为精确优化模型的能力，这对于运筹学至关重要。OptiMind框架通过半自动错误分析整合优化专业知识，使公式化准确性提高了20.7%，并在自我一致性等扩展方法下保持一致，表明其在实际应用中的稳健性。

DNN Modularization via Activation-Driven Training

Authors: Tuan Ngo, Abid Hassan, Saad Shafiq, Nenad Medvidovic

First: 2024-11-01T23:07:33+00:00 · Latest: 2026-01-14T18:22:31+00:00

Comments: Accepted at International Conference on Software Engineering (ICSE) 2026 - Research Track

Abs · PDF · Code1 · Code2

Abstract

Deep Neural Networks (DNNs) tend to accrue technical debt and suffer from significant retraining costs when adapting to evolving requirements. Modularizing DNNs offers the promise of improving their reusability. Previous work has proposed techniques to decompose DNN models into modules both during and after training. However, these strategies yield several shortcomings, including significant weight overlaps and accuracy losses across modules, restricted focus on convolutional layers only, and added complexity and training time by introducing auxiliary masks to control modularity. In this work, we propose MODA, an activation-driven modular training approach. MODA promotes inherent modularity within a DNN model by directly regulating the activation outputs of its layers based on three modular objectives: intra-class affinity, inter-class dispersion, and compactness. MODA is evaluated using three well-known DNN models and five datasets with varying sizes. This evaluation indicates that, compared to the existing state-of-the-art, using MODA yields several advantages: (1) MODA accomplishes modularization with 22% less training time; (2) the resultant modules generated by MODA comprise up to 24x fewer weights and 37x less weight overlap while (3) preserving the original model's accuracy without additional fine-tuning; in module replacement scenarios, (4) MODA improves the accuracy of a target class by 12% on average while ensuring minimal impact on the accuracy of other classes.

中文标题/摘要

标题：基于激活驱动训练的DNN模块化

深度神经网络（DNNs）在适应不断变化的需求时往往会积累技术债务，并遭受显著的重新训练成本。模块化DNNs有望提高其可重用性。先前的工作提出了在训练期间和之后分解DNN模型为模块的技术。然而，这些策略存在一些缺点，包括模块间显著的权重重叠和准确性损失、仅关注卷积层、通过引入辅助掩码来控制模块化增加了复杂性和训练时间。在本工作中，我们提出了一种基于激活驱动的模块化训练方法MODA。MODA通过直接调节其层的激活输出来促进DNN模型内的固有模块化，基于三个模块化目标：类内亲和力、类间分散性和紧凑性。使用三个知名的DNN模型和五个不同大小的数据集对MODA进行了评估。评估表明，与现有的最先进的技术相比，使用MODA具有以下优势：（1）MODA在模块化方面节省了22%的训练时间；（2）由MODA生成的模块包含多达24倍少的权重和37倍少的权重重叠，同时保持原始模型的准确性无需额外微调；在模块替换场景中，（3）MODA将目标类别的准确性平均提高了12%，同时确保对其他类别的准确性影响最小。

Summary / 总结

This work addresses the challenges of retraining DNNs by proposing MODA, an activation-driven modular training approach. MODA improves modularity by directly regulating layer activations to achieve intra-class affinity, inter-class dispersion, and compactness. The evaluation across three DNN models and five datasets shows that MODA reduces training time by 22%, generates modules with up to 24x fewer weights and 37x less weight overlap, and maintains original accuracy without fine-tuning. Additionally, MODA enhances target class accuracy by 12% on average in module replacement scenarios with minimal impact on other classes.

该研究旨在通过提出基于激活驱动的模块化训练方法MODA来解决DNN重训练的挑战。MODA通过三个目标（类内亲和性、类间分散性和紧凑性）调节层激活来提高模块化。在三个DNN模型和五个数据集上的评估表明，MODA将训练时间减少了22%，生成的模块包含多达24倍少的权重和37倍少的权重重叠，并且在无需微调的情况下保持了原始模型的准确性。此外，在模块替换场景中，MODA将目标类别的准确性平均提高了12%，同时对其他类别的准确性影响最小。

Template-Based Probes Are Imperfect Lenses for Counterfactual Bias Evaluation in LLMs

Authors: Farnaz Kohankhaki, D. B. Emerson, Jacob-Junqi Tian, Laleh Seyyed-Kalantari, Faiza Khan Khattak

First: 2024-04-04T14:24:06+00:00 · Latest: 2026-01-14T18:20:19+00:00

Comments: 22 Pages, 6 Figures, 5 Tables

Abs · PDF · Code1 · Code2

Abstract

Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It aims to measure whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can introduce systematic distortions in bias measurements. Specifically, we consistently find that such probes suggest that LLMs classify text associated with White race as negative at disproportionately elevated rates. This is observed consistently across a large collection of LLMs, over several diverse template-based probes, and with different classification approaches. We hypothesize that this arises artificially due to linguistic asymmetries present in LLM pretraining data, in the form of markedness, (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). These findings highlight the need for more rigorous methodologies in counterfactual bias evaluation, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.

中文标题/摘要

标题：基于模板的探针是评估LLM反事实偏见的不完美透镜

大型语言模型（LLMs）中的偏见有多种形式，从明显的歧视到隐含的刻板印象。反事实偏见评估是一种广泛使用的量化偏见的方法，通常依赖于基于模板的探针，这些探针明确陈述了群体成员身份。其目标是测量LLM执行任务的结果是否在群体成员身份变化时保持不变。在本研究中，我们发现基于模板的探针会在偏见测量中引入系统性失真。具体来说，我们一致发现，这些探针表明LLM将与白人种族相关的文本分类为负面的比例异常高。这种现象在大量LLM、多种多样的基于模板的探针以及不同的分类方法中都得到了一致的观察。我们认为这可能是由于LLM预训练数据中存在的语言不对称性（如，黑人总统 vs. 总统）以及用于偏见测量的模板（如，黑人总统 vs. 白人总统）所导致的人为因素。这些发现强调了在反事实偏见评估中需要更严谨的方法，以确保观察到的差异反映的是真正的偏见而非语言惯例的产物。

Summary / 总结

This study examines the limitations of template-based probes in evaluating counterfactual bias in large language models (LLMs). It finds that these probes often indicate that LLMs classify text related to the White race as negative more frequently than expected. This bias is observed across various LLMs, probes, and classification methods. The researchers suggest that this may result from linguistic asymmetries in the pretraining data and the templates used for bias measurement. The study emphasizes the need for more robust methodologies to ensure that observed biases are genuine and not artifacts of language conventions.

该研究探讨了基于模板的探针在评估大型语言模型（LLM）的反事实偏见时的局限性。研究发现，这些探针通常表明LLM更频繁地将与白人相关的文本分类为负面，这一模式在各种LLM和不同的探针中均被观察到。作者认为这种偏见源于预训练数据中的语言不对称性和用于偏见评估的模板，强调需要更严谨的方法来确保观察到的偏见是真实的。

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Authors: Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, Enhong Chen

First: 2026-01-09T12:19:49+00:00 · Latest: 2026-01-14T18:19:43+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM agents operating in open environments face escalating risks from indirect prompt injection, particularly within the tool stream where manipulated metadata and runtime feedback hijack execution flow. Existing defenses encounter a critical dilemma as advanced models prioritize injected rules due to strict alignment while static protection mechanisms sever the feedback loop required for adaptive reasoning. To reconcile this conflict, we propose \textbf{VIGIL}, a framework that shifts the paradigm from restrictive isolation to a verify-before-commit protocol. By facilitating speculative hypothesis generation and enforcing safety through intent-grounded verification, \textbf{VIGIL} preserves reasoning flexibility while ensuring robust control. We further introduce \textbf{SIREN}, a benchmark comprising 959 tool stream injection cases designed to simulate pervasive threats characterized by dynamic dependencies. Extensive experiments demonstrate that \textbf{VIGIL} outperforms state-of-the-art dynamic defenses by reducing the attack success rate by over 22\% while more than doubling the utility under attack compared to static baselines, thereby achieving an optimal balance between security and utility.

中文标题/摘要

标题：VIGIL：通过验证后再提交防御LLM代理工具流注入

在开放环境中的LLM代理面临来自间接提示注入的不断升级风险，尤其是在工具流中，篡改的元数据和运行时反馈劫持执行流程。现有防御措施遇到一个关键困境，即高级模型因严格对齐而优先处理注入规则，而静态保护机制则切断了适应性推理所需的反馈循环。为解决这一冲突，我们提出了一种名为\textbf{VIGIL}的框架，该框架从限制性隔离转向验证后再提交协议。通过促进推测性假设生成并通过意图导向验证确保安全性，\textbf{VIGIL}在保持推理灵活性的同时确保了稳健的控制。我们还引入了包含959个工具流注入案例的\textbf{SIREN}基准，旨在模拟由动态依赖性特征定义的普遍威胁。大量实验表明，\textbf{VIGIL}在降低攻击成功率方面优于最先进的动态防御措施，同时在攻击下的实用性比静态基线高出一倍以上，从而实现了安全性和实用性的最佳平衡。

Summary / 总结

VIGIL is a framework designed to protect LLM agents from tool stream injection attacks by implementing a verify-before-commit protocol. It allows for speculative hypothesis generation and intent-grounded verification to maintain reasoning flexibility while ensuring robust control. Experiments show that VIGIL outperforms existing dynamic defenses by reducing attack success rates and maintaining utility under attack, achieving a balance between security and functionality.

VIGIL 是一个框架，通过实现 verify-before-commit 协议来保护 LLM 代理免受工具流注入攻击。它允许进行推测性假设生成和基于意图的验证，以保持推理的灵活性同时确保控制的稳健性。实验表明，VIGIL 在减少攻击成功率和在攻击下将实用性提高一倍方面优于现有动态防御措施，从而在安全性和实用性之间实现了最佳平衡。

Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Authors: Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo

Venue: NeurIPS 2025 Spotlight

First: 2025-09-24T23:34:01+00:00 · Latest: 2026-01-14T18:11:20+00:00

Comments: NeurIPS 2025 Spotlight

Abs · PDF · Code1 · Code2

Abstract

Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy's decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

中文标题/摘要

标题：通过懒学习接口实现政策兼容技能增量学习

技能增量学习（SIL）是指通过利用与环境交互获得的经验或通过整合额外数据来扩展和细化其技能集的过程。SIL 促进了基于可重用技能的分层政策的有效获取，以供下游任务使用。然而，随着技能库的演变，它可能会破坏现有基于技能的政策的兼容性，限制其可重用性和泛化能力。在本文中，我们提出了一种新的框架SIL-C，该框架确保技能与政策的兼容性，从而使增量学习的技能改进能够增强下游政策的性能，而无需重新训练政策或结构适应。SIL-C 使用双边懒学习映射技术动态对齐政策引用的子任务空间与代理行为解码的技能空间。这使得每个从复杂任务分解中派生的子任务能够根据轨迹分布相似性选择合适的技能来执行。我们跨多种SIL场景评估了SIL-C，并证明它在技能演变过程中保持了与下游政策的兼容性，同时确保了学习过程的高效性。

Summary / 总结

研究旨在解决技能增量学习（SIL）中，随技能演进而保持与现有策略兼容性的挑战。提出的SIL-C框架使用懒学习映射技术动态对齐子任务空间与技能空间，确保学习技能的改进能够增强下游策略，而无需重新训练。实验结果表明，SIL-C在各种SIL场景中保持了兼容性和学习过程的效率。

STEP3-VL-10B Technical Report

Authors: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge

First: 2026-01-14T17:58:24+00:00 · Latest: 2026-01-14T17:58:24+00:00

Comments: 50 pages

Abs · PDF · Code1 · Code2

Abstract

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

中文标题/摘要

标题：STEP3-VL-10B 技术报告

我们提出了STEP3-VL-10B，这是一种轻量级开源基础模型，旨在重新定义紧凑效率与前沿多模态智能之间的权衡。STEP3-VL-10B 通过两个战略转变实现：首先，一种统一的、完全解冻的预训练策略，使用1.2万亿多模态令牌，结合语言对齐的感知编码器和Qwen3-8B解码器，建立内在的视觉-语言协同作用；其次，一个扩大的后训练流水线，包含超过1000次强化学习迭代。关键的是，我们实现了并行协调推理（PaCoRe）来扩展测试时计算，将资源分配给可扩展的感知推理，探索和综合多种视觉假设。因此，尽管其紧凑的10B参数量，STEP3-VL-10B 在性能上与比其大10-20倍的模型（如GLM-4.6V-106B、Qwen3-VL-235B）相当或超越，并且在顶级专有旗舰产品如Gemini 2.5 Pro和Seed-1.5-VL中表现出色。它在MMBench上记录了92.2%的得分，在MMMU上为80.11%，在复杂推理方面分别达到了94.43%的AIME2025得分和75.95%的MathVision得分。我们发布了完整的模型套件，为社区提供了一个强大、高效且可重复的基础。

Summary / 总结

The research aims to develop a lightweight foundation model that balances compactness with advanced multimodal intelligence. STEP3-VL-10B employs a unified pre-training strategy and a scaled post-training pipeline, achieving performance comparable to much larger models. Key findings include 92.2% on MMBench, 80.11% on MMMU, 94.43% on AIME2025, and 75.95% on MathVision, while maintaining a compact 10B parameter size.

研究旨在开发一种轻量级基础模型，平衡紧凑性和先进的多模态智能。STEP3-VL-10B采用统一的预训练策略和扩展的后训练管道，性能与更大规模的模型相当。关键发现包括在MM Bench上达到92.2%，在MMMU上达到80.11%，在AIME2025上达到94.43%，在MathVision上达到75.95%，同时保持10B参数的紧凑规模。

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park

First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-14T17:57:43+00:00

Comments: Work in Progress

Abs · PDF · Code1 · Code2

Abstract

Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.

中文标题/摘要

标题：协作多智能体推理时的强化学习

多智能体系统已成为许多应用中的实用LLM驱动合作者，通过多样性和交叉验证获得鲁棒性。然而，多智能体强化学习（MARL）训练资源密集且不稳定：队友的共同适应会引入非平稳性，奖励通常稀疏且高方差。因此，我们引入了**多智能体推理时的强化学习（MATTRL）**框架，在推理时向多智能体讨论注入结构化的文本经验。MATTRL 形成一个由专家组成的多专家团队，进行多轮讨论，检索并整合推理时的经验，并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池，然后将其重新注入对话中。在医学、数学和教育等具有挑战性的基准测试中，MATTRL 在多智能体基线上的准确率平均提高了3.67%，在可比的单智能体基线上的准确率提高了8.67%。消融研究探讨了不同的信用分配方案，并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径，无需调整即可实现分布转移鲁棒的多智能体推理。

Summary / 总结

The paper introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which injects structured textual experience into multi-agent deliberation at inference time to improve robustness and accuracy. MATTRL forms a multi-expert team for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for decision-making. Experiments show that MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines across various benchmarks in medicine, math, and education. Ablation studies explore different credit-assignment schemes and their impact on training outcomes.

论文提出了Multi-Agent Test-Time Reinforcement Learning (MATTRL)，通过在推理时注入结构化的文本经验来增强多智能体系统。MATTRL 形成一个多专家团队进行多轮讨论，检索并整合测试时的经验，并达成共识进行最终决策。实验结果显示，MATTRL 在医学、数学和教育等领域的基准测试中，相对于多智能体基线提高了 3.67% 的准确性，相对于可比的单智能体基线提高了 8.67%。消融研究评估了不同信用分配方案及其对训练结果的影响。

Self-Supervised Animal Identification for Long Videos

Authors: Xuyang Fang, Sion Hannuna, Edwin Simpson, Neill Campbell

First: 2026-01-14T17:53:59+00:00 · Latest: 2026-01-14T17:53:59+00:00

Comments: 11 pages, 1 figure

Abs · PDF · Code1 · Code2 · Code3

Abstract

Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.

中文标题/摘要

标题：长视频中自我监督的动物识别

在长时间视频中识别个体动物对于行为生态学、野生动物监测和畜牧管理至关重要。传统方法需要大量手动标注，而现有的自我监督方法计算量大且不适用于长序列，因为存在内存限制和时间误差传播问题。我们提出了一种高效且自我监督的方法，将动物识别重新定义为全局聚类任务，而不是顺序跟踪问题。我们的方法假设视频中个体数量已知且固定，仅需边界框检测和总数。通过采样帧对、使用冻结的预训练骨干网络，并利用匈牙利算法进行内部批处理伪标签分配，我们的方法在没有身份标签的情况下学习判别特征。我们从视觉-语言模型中适应二元交叉熵损失，使准确率达到97%以上，同时每个批次消耗的GPU内存少于1 GB——比标准对比方法少一个数量级。在具有挑战性的实际数据集（3D-POP鸽子和8头牛进食视频）上评估，我们的框架匹配或超越了在超过1000个标注帧上训练的监督基线，有效地消除了手动标注瓶颈。这项工作使在消费级硬件上实现高精度动物识别成为可能，具有在资源受限的研究环境中广泛应用的潜力。本文所有代码可在https://huggingface.co/datasets/tonyFang04/8-calves 获取。

Summary / 总结

The paper addresses the challenge of identifying individual animals in long videos, which is crucial for various applications like behavioral ecology and wildlife monitoring. It introduces a self-supervised method that reframes the task as a global clustering problem, using bounding box detections and a self-bootstrapping mechanism. This approach achieves state-of-the-art accuracy over 97% while requiring less than 1 GB of GPU memory per batch, significantly less than standard contrastive methods. The method outperforms supervised baselines on real-world datasets, effectively reducing the need for extensive manual annotation.

该论文解决了在长视频中识别个体动物的问题，这对于行为生态学和野生动物监测等应用至关重要。它提出了一种自监督方法，将问题重新定义为全局聚类任务，使用边界框检测和预训练的骨干网络。该方法采用匈牙利算法进行伪标签分配，并使用二元交叉熵损失，实现了超过97%的高精度，同时每批消耗的GPU内存少于1 GB。该方法在真实世界数据集上优于监督基线，并适用于资源受限的环境。

Controlled Self-Evolution for Algorithmic Code Optimization

Authors: Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Yi Xu, Huacan Wang

First: 2026-01-12T09:23:13+00:00 · Latest: 2026-01-14T17:53:44+00:00

Comments: 27 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.

中文标题/摘要

标题：算法代码优化中的可控自我进化

自我进化方法通过迭代的“生成-验证-精炼”循环增强代码生成，但现有方法在探索效率方面存在不足，无法在有限的预算内发现具有更优复杂度的解决方案。这种低效源于初始化偏差将进化限制在较差的解空间区域，无控制的随机操作缺乏反馈指导，以及跨任务经验利用不足。为了解决这些瓶颈，我们提出了可控自我进化（CSE），它包含三个关键组件。多样化规划初始化生成结构上不同的算法策略，以覆盖广泛的解空间。遗传进化用反馈指导机制替代随机操作，实现目标突变和组合交叉。层次进化记忆在跨任务和同任务层面捕捉成功和失败的经验。在EffiBench-X上的实验表明，CSE在各种LLM后端模型中始终优于所有基线。此外，CSE从早期生成阶段开始就表现出更高的效率，并在整个进化过程中持续改进。我们的代码可在https://github.com/QuantaAlpha/EvoControl公开获取。

Summary / 总结

The research aims to improve the efficiency of self-evolution methods in code generation by addressing issues such as low exploration efficiency and lack of feedback guidance. The authors propose Controlled Self-Evolution (CSE), which includes diversified planning initialization, genetic evolution with feedback-guided mechanisms, and hierarchical evolution memory. Experiments on EffiBench-X show that CSE outperforms existing methods across different LLM backbones, achieving higher efficiency and continuous improvement throughout the evolution process.

研究旨在通过解决探索效率低和缺乏反馈指导等问题，提高代码生成中自进化方法的效率。提出的受控自进化（CSE）方法包括多样化规划初始化、具有反馈引导机制的遗传进化以及跨任务和内部任务级别的层级进化记忆。实验结果表明，CSE 在 EffiBench-X 上优于所有基线，并在整个进化过程中保持持续改进。

LiteEmbed: Adapting CLIP to Rare Classes

Authors: Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi

First: 2026-01-14T17:53:11+00:00 · Latest: 2026-01-14T17:53:11+00:00

Comments: 14 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.

中文标题/摘要

标题：LiteEmbed：将CLIP适应稀有类别

大规模的跨模态模型如CLIP在零样本识别方面表现出色，但在预训练时很少见到的类别上存在困难，包括新出现的实体和文化特定类别。我们提出了LiteEmbed，这是一种轻量级框架，用于CLIP的少量样本个性化，使得无需重新训练其编码器即可添加新类别。LiteEmbed在CLIP词汇表内通过子空间引导优化文本嵌入，利用基于PCA的分解来分离粗粒度语义方向和细粒度变化。两个互补的目标，粗粒度对齐和细粒度分离，共同保持全局语义一致性，同时增强视觉相似类别之间的可区分性。一旦优化，嵌入可以即插即用，无缝替代CLIP的原始文本特征，应用于分类、检索、分割和检测任务。广泛的实验表明，与先前方法相比，LiteEmbed在适应未充分代表、稀有或未见过的类别方面取得了显著的改进。

Summary / 总结

The research aims to address the challenge of CLIP's poor performance on rare classes, especially newly emerging entities and culturally specific categories. LiteEmbed is a lightweight framework that optimizes text embeddings within CLIP's vocabulary through subspace-guided optimization, leveraging PCA to disentangle coarse semantic directions from fine-grained variations. The method achieves significant improvements in zero-shot recognition tasks for rare classes, outperforming previous methods across various tasks including classification, retrieval, segmentation, and detection.

研究旨在解决CLIP在处理罕见类别时表现不佳的问题，特别是新兴和文化特有类别。LiteEmbed是一种轻量级框架，通过子空间引导优化CLIP词汇中的文本嵌入，利用PCA分离粗粒度语义方向和细粒度变化。该方法在零样本识别任务中显著提高了罕见类别的表现，优于先前的方法，并在分类、检索、分割和检测任务中展示了有效性。

Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization

Authors: Frank Röder, Jan Benad, Manfred Eppe, Pradeep Kr. Banerjee

Venue: NeurIPS 2025

First: 2025-08-27T22:02:56+00:00 · Latest: 2026-01-14T17:50:26+00:00

Comments: 31 pages, 4 figures, accepted to NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Real-world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics-Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI's latent space enables counterfactual consistency: Perturbing a gravity-encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context-unaware baselines, often surpassing context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.

中文标题/摘要

标题：动态对齐的潜在想象在上下文世界模型中的应用以实现零样本泛化

现实世界的强化学习需要在无需昂贵重训练的情况下适应未见过的环境条件。上下文马尔可夫决策过程（cMDP）模型这一挑战，但现有方法通常需要显式的上下文变量（例如，摩擦力，重力），限制了它们在上下文是潜在的或难以测量时的应用。我们引入了动态对齐的潜在想象（DALI），这是一种集成在Dreamer架构中的框架，通过从智能体-环境交互中推断潜在的上下文表示。通过训练一个自监督编码器来预测前向动力学，DALI 生成了条件于世界模型和策略的可操作表示，从而连接感知与控制。我们理论上证明了该编码器对于高效上下文推断和稳健泛化是必不可少的。DALI 的潜在空间允许反事实一致性：扰动重力编码维度会以物理上合理的方式改变想象的回放。在具有挑战性的cMDP基准测试中，DALI 在上下文无意识基线之上取得了显著的改进，通常在外推任务中超过了上下文意识基线，从而实现了对未见过的上下文变化的零样本泛化。

Summary / 总结

The research aims to address the challenge of zero-shot generalization in reinforcement learning under unseen environmental conditions. The method, Dynamics-Aligned Latent Imagination (DALI), integrates within the Dreamer architecture to infer latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations that condition the world model and policy, facilitating the bridge between perception and control. Key experimental findings show that DALI outperforms context-unaware baselines and often surpasses context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.

研究旨在通过在未见过的环境条件下无需重新训练来解决强化学习中的零样本泛化问题。方法Dynamics-Aligned Latent Imagination (DALI) 在Dreamer架构中整合，通过代理与环境的交互来推断潜在的上下文表示。通过训练一个自监督编码器来预测前向动力学，DALI 生成可操作的表示，这些表示条件化了世界模型和策略，实现了感知与控制之间的桥梁。关键实验发现表明，DALI 在超越无上下文基线的同时，往往在外推任务中也超过了有上下文基线，从而实现了对未见过的上下文变化的零样本泛化。

Image2Garment: Simulation-ready Garment Generation from a Single Image

Authors: Selim Emir Can, Jan Ackermann, Kiyohiro Nakayama, Ruofan Liu, Tong Wu, Yang Zheng, Hugo Bertiche, Menglei Chai, Thabo Beeler, Gordon Wetzstein

First: 2026-01-14T17:47:33+00:00 · Latest: 2026-01-14T17:47:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.

中文标题/摘要

标题：Image2Garment：从单张图像生成准备用于模拟的服装

从单张图像估计物理准确且准备用于模拟的服装极具挑战性，因为缺乏图像到物理的数据集，且该问题本身是病态的。先前的方法要么需要多视角捕捉和昂贵的可微模拟，要么仅预测服装几何形状而没有用于真实模拟所需的材料属性。我们提出了一种无需迭代优化即可从单张图像生成准备用于模拟的服装的前馈框架。该框架首先通过微调视觉-语言模型从真实图像中推断材料组成和织物属性，然后使用少量的材料-物理测量数据集训练一个轻量级预测器，将这些属性映射到相应的物理织物参数。我们的方法引入了两个新数据集（FTAG和T2P），并实现了无需迭代优化即可从单张图像生成准备用于模拟的服装。实验表明，我们的估计器在材料组成估计和织物属性预测方面具有更高的准确性，并通过我们的物理参数估计器进一步实现了比最先进的图像到服装方法更高的保真度模拟。

Summary / 总结

The research aims to generate physically accurate, simulation-ready garments from a single image, addressing the challenges of image-to-physics datasets and the ill-posed nature of the problem. The method involves fine-tuning a vision-language model to infer material composition and fabric attributes from real images, followed by training a lightweight predictor to map these attributes to physical fabric parameters. Experiments demonstrate superior accuracy in material composition and fabric attribute prediction, leading to higher-fidelity simulations compared to existing methods.

研究旨在从单张图像生成物理上准确、可用于模拟的服装，解决图像到物理数据集和问题的病态性质带来的挑战。方法包括微调视觉-语言模型从真实图像中推断材料组成和织物属性，然后训练一个轻量级预测器将这些属性映射到相应的物理织物参数。实验表明，在材料组成和织物属性预测方面具有更高的准确性，从而实现比现有方法更高的仿真保真度。

Non-Linear Scoring Model for Translation Quality Evaluation

Authors: Serge Gladkoff, Lifeng Han, Katerina Gasova

First: 2025-11-17T15:09:22+00:00 · Latest: 2026-01-14T17:45:20+00:00

Comments: Technical report, 31 pages

Abs · PDF · Code1 · Code2

Abstract

Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

中文标题/摘要

标题：翻译质量评估的非线性评分模型

基于多维质量指标（MQM）的分析性翻译质量评估（TQE）传统上使用线性误差-惩罚比例，该比例基于1000-2000词的参考样本进行校准。然而，线性外推会导致对不同大小样本的判断偏差，对短样本过度惩罚，对长样本则惩罚不足，导致与专家直觉的不一致。本文基于多范围框架，提出了一种校准的非线性评分模型，更好地反映了不同长度样本中人类内容消费者对翻译质量的感知。来自三个大型企业环境的实证数据显示，可接受的错误数量随样本大小呈对数增长，而非线性增长。心理物理和认知证据，包括韦伯-费希纳定律和认知负荷理论，支持这一观点，解释了为什么额外错误的感知影响随规模增长而减弱，而认知负担则随规模增长。我们提出一个两参数模型 E(x) = a * ln(1 + b * x)，a, b > 0，该模型以参考容忍度为锚点，并通过一个一维根寻找步骤校准两个容忍度点。该模型在相对误差不超过±20%的区间内使线性近似保持有效，并且可以与现有的评估工作流程集成，只需添加一个动态容忍度函数。该方法提高了人类和AI生成翻译的解释性、公平性和评分者一致性。通过操作化一个感知上有效的评分范式，它推动了翻译质量评估向更准确和可扩展的评估迈进。该模型还为与人类判断一致的基于AI的文档级评估提供了更强的基础。讨论了CAT/LQA系统实施考虑和对人类和AI生成文本评估的影响。

Summary / 总结

This paper addresses the limitations of traditional linear scoring models in Translation Quality Evaluation (TQE) by proposing a non-linear scoring model. The model, based on the Multi-Range framework and supported by psychophysical and cognitive evidence, better aligns with human perception of translation quality across different sample sizes. Key findings show that acceptable error counts grow logarithmically with sample size, and the proposed two-parameter model E(x) = a * ln(1 + b * x) improves interpretability and fairness in both human and AI-generated translations.

本文针对传统线性评分模型在翻译质量评估（TQE）中的局限性，提出了一个基于实证和心理学证据的非线性评分模型。该模型E(x) = a * ln(1 + b * x)更好地反映了人类内容消费者对不同样本大小翻译质量的感知。来自三个大型企业环境的实证数据显示，可接受的错误数量随样本大小呈对数增长，从而提高了人类和AI生成文本评分的可解释性、公平性和评分者一致性。

Exploring Fine-Tuning for Tabular Foundation Models

Authors: Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Vinay Kumar Sankarapu

First: 2026-01-14T17:40:46+00:00 · Latest: 2026-01-14T17:40:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Tabular Foundation Models (TFMs) have recently shown strong in-context learning capabilities on structured data, achieving zero-shot performance comparable to traditional machine learning methods. We find that zero-shot TFMs already achieve strong performance, while the benefits of fine-tuning are highly model and data-dependent. Meta-learning and PEFT provide moderate gains under specific conditions, whereas full supervised fine-tuning (SFT) often reduces accuracy or calibration quality. This work presents the first comprehensive study of fine-tuning in TFMs across benchmarks including TALENT, OpenML-CC18, and TabZilla. We compare Zero-Shot, Meta-Learning, Supervised (SFT), and parameter-efficient (PEFT) approaches, analyzing how dataset factors such as imbalance, size, and dimensionality affect outcomes. Our findings cover performance, calibration, and fairness, offering practical guidelines on when fine-tuning is most beneficial and its limitations.

中文标题/摘要

标题：探索表格基础模型的微调

表格基础模型（TFMs）在结构化数据上最近展示了强大的上下文学习能力，实现了与传统机器学习方法相当的零样本性能。我们发现，零样本TFMs已经达到了很强的性能，而微调的好处则高度依赖于模型和数据。元学习和PEFT在特定条件下提供了适度的增益，而全监督微调（SFT）通常会降低准确度或校准质量。本研究首次在包括TALENT、OpenML-CC18和TabZilla在内的基准上对TFMs的微调进行全面研究。我们比较了零样本、元学习、监督（SFT）和参数高效（PEFT）方法，并分析了数据集因素如不平衡、大小和维度如何影响结果。我们的研究涵盖了性能、校准和公平性，提供了关于何时微调最有益及其局限性的实用指南。

Summary / 总结

The study explores the effectiveness of fine-tuning for Tabular Foundation Models (TFMs) by comparing zero-shot performance with various fine-tuning methods including meta-learning, parameter-efficient fine-tuning, and full supervised fine-tuning. Across multiple benchmarks, the research finds that while zero-shot TFMs already perform well, fine-tuning benefits are highly dependent on the specific model and dataset. Meta-learning and parameter-efficient fine-tuning show moderate gains under certain conditions, but full supervised fine-tuning often degrades performance or calibration quality. The study provides insights into when fine-tuning is most beneficial and its limitations, covering performance, calibration, and fairness.

这项研究探讨了Tabular Foundation Models（TFMs）的微调方法，发现零样本TFMs已经表现良好，微调的效果取决于模型和数据。元学习和参数高效微调在特定条件下提供适度的增益，而完全监督微调通常会降低性能或校准质量。研究涵盖了如TALENT、OpenML-CC18和TabZilla等基准，分析了数据集因素如不平衡、大小和维度如何影响微调结果，并提供了关于何时微调最有效及其局限性的实用指导。

Identifying Models Behind Text-to-Image Leaderboards

Authors: Ali Naseh, Yuefeng Peng, Anshuman Suri, Harsh Chaudhari, Alina Oprea, Amir Houmansadr

First: 2026-01-14T17:30:58+00:00 · Latest: 2026-01-14T17:30:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image (T2I) models are increasingly popular, producing a large share of AI-generated images online. To compare model quality, voting-based leaderboards have become the standard, relying on anonymized model outputs for fairness. In this work, we show that such anonymity can be easily broken. We find that generations from each T2I model form distinctive clusters in the image embedding space, enabling accurate deanonymization without prompt control or training data. Using 22 models and 280 prompts (150K images), our centroid-based method achieves high accuracy and reveals systematic model-specific signatures. We further introduce a prompt-level distinguishability metric and conduct large-scale analyses showing how certain prompts can lead to near-perfect distinguishability. Our findings expose fundamental security flaws in T2I leaderboards and motivate stronger anonymization defenses.

中文标题/摘要

标题：识别文本到图像排行榜背后的模型

文本到图像（T2I）模型越来越受欢迎，生成了大量的AI生成图像。为了比较模型质量，基于投票的排行榜已成为标准，依赖匿名模型输出以确保公平性。在本研究中，我们展示了这种匿名性可以轻易被破解。我们发现，每个T2I模型生成的图像在图像嵌入空间中形成了独特的集群，使得在没有提示控制或训练数据的情况下也能实现准确的去匿名化。使用22个模型和280个提示（15万张图像），我们的基于质心的方法实现了高精度，并揭示了系统性的模型特定签名。我们进一步引入了提示级别可区分度度量，并进行了大规模分析，展示了某些提示如何导致近乎完美的可区分度。我们的研究结果揭示了T2I排行榜中的基本安全漏洞，并促使采取更强的匿名化防御措施。

Summary / 总结

The research aims to expose the security vulnerabilities in text-to-image model leaderboards by deanonymizing models based on their distinctive image embeddings. The method involves using a centroid-based approach to identify unique signatures of each model. Key findings show high accuracy in deanonymization and reveal systematic model-specific signatures, indicating that current anonymity measures are insufficient and suggesting the need for stronger defenses.

研究旨在通过基于图像嵌入的独特性来揭露文本到图像模型排行榜的安全漏洞，并通过中心点方法识别模型特有的签名，无需控制提示或训练数据。研究发现，可以使用高精度的方法对模型进行去匿名化，并引入提示级别可区分性指标来展示某些提示可以导致近乎完美的可区分性。这突显了在文本到图像排行榜中需要更强的匿名化措施。

SPGD: Steepest Perturbed Gradient Descent Optimization

Authors: Amir M. Vahedi, Horea T. Ilies

Venue: ASME. J. Mech. Des. (January 14, 2026)

First: 2024-11-07T18:23:30+00:00 · Latest: 2026-01-14T17:20:45+00:00

Comments: 28 pages, 26 figures, submitted to Journal of Mechanical Design

Abs · PDF · Code1 · Code2

Abstract

Optimization algorithms are pivotal in advancing various scientific and industrial fields but often encounter obstacles such as trapping in local minima, saddle points, and plateaus (flat regions), which makes the convergence to reasonable or near-optimal solutions particularly challenging. This paper presents the Steepest Perturbed Gradient Descent (SPGD), a novel algorithm that innovatively combines the principles of the gradient descent method with periodic uniform perturbation sampling to effectively circumvent these impediments and lead to better solutions whenever possible. SPGD is distinctively designed to generate a set of candidate solutions and select the one exhibiting the steepest loss difference relative to the current solution. It enhances the traditional gradient descent approach by integrating a strategic exploration mechanism that significantly increases the likelihood of escaping sub-optimal local minima and navigating complex optimization landscapes effectively. Our approach not only retains the directed efficiency of gradient descent but also leverages the exploratory benefits of stochastic perturbations, thus enabling a more comprehensive search for global optima across diverse problem spaces. We demonstrate the efficacy of SPGD in solving the 3D component packing problem, an NP-hard challenge. Preliminary results show a substantial improvement over four established methods, particularly on response surfaces with complex topographies and in multidimensional non-convex continuous optimization problems. Comparative analyses with established 2D benchmark functions highlight SPGD's superior performance, showcasing its ability to navigate complex optimization landscapes. These results emphasize SPGD's potential as a versatile tool for a wide range of optimization problems.

中文标题/摘要

标题：SPGD：最陡扰动梯度下降优化

优化算法在推动各个科学和工业领域的发展中起着关键作用，但常常会遇到诸如陷入局部极小值、鞍点和平坦区域（平坦区域）等障碍，这使得收敛到合理或接近最优解变得尤为困难。本文提出了最陡扰动梯度下降（SPGD）算法，这是一种新颖的算法，它创新地将梯度下降法的原则与周期性均匀扰动采样相结合，以有效地克服这些障碍，并在可能的情况下导向更好的解。SPGD的独特设计是生成一组候选解，并选择相对于当前解损失差异最陡峭的那个。它通过整合一种战略性探索机制，显著增加了从次优局部极小值中逃逸和有效导航复杂优化景观的可能性。我们的方法不仅保留了梯度下降法的定向效率，还利用了随机扰动的探索优势，从而能够在各种问题空间中进行更全面的全局最优搜索。我们通过解决3D组件包装问题展示了SPGD的有效性，这是一个NP难问题。初步结果显示，SPGD在复杂地形的响应曲面和多维非凸连续优化问题上优于四种现有方法。与现有2D基准函数的比较分析突显了SPGD的优越性能，展示了其在复杂优化景观中导航的能力。这些结果强调了SPGD作为广泛优化问题的多功能工具的潜力。

Summary / 总结

SPGD is a novel optimization algorithm that combines gradient descent with periodic uniform perturbation sampling to improve convergence to optimal solutions by effectively escaping local minima and navigating complex landscapes. It generates candidate solutions and selects the one with the steepest loss difference, enhancing traditional gradient descent with strategic exploration. SPGD demonstrates significant improvements over four established methods in solving the 3D component packing problem and other non-convex optimization challenges, particularly on complex response surfaces.

论文提出了一种名为SPGD的新优化算法，该算法结合了梯度下降和周期性均匀扰动采样的方法，以提高收敛到最优解的能力。SPGD通过集成探索机制增强了传统的梯度下降方法，有效地逃逸局部极小值并导航复杂的优化景观。实验结果表明，SPGD在解决3D组件包装问题和其他非凸优化挑战中优于四种现有方法，特别是在复杂响应曲面上表现出色。

Training Large Neural Networks With Low-Dimensional Error Feedback

Authors: Maher Hanut, Jonathan Kadmon

First: 2025-02-27T22:45:41+00:00 · Latest: 2026-01-14T17:19:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Training deep neural networks typically relies on backpropagating high dimensional error signals a computationally intensive process with little evidence supporting its implementation in the brain. However, since most tasks involve low-dimensional outputs, we propose that low-dimensional error signals may suffice for effective learning. To test this hypothesis, we introduce a novel local learning rule based on Feedback Alignment that leverages indirect, low-dimensional error feedback to train large networks. Our method decouples the backward pass from the forward pass, enabling precise control over error signal dimensionality while maintaining high-dimensional representations. We begin with a detailed theoretical derivation for linear networks, which forms the foundation of our learning framework, and extend our approach to nonlinear, convolutional, and transformer architectures. Remarkably, we demonstrate that even minimal error dimensionality on the order of the task dimensionality can achieve performance matching that of traditional backpropagation. Furthermore, our rule enables efficient training of convolutional networks, which have previously been resistant to Feedback Alignment methods, with minimal error. This breakthrough not only paves the way toward more biologically accurate models of learning but also challenges the conventional reliance on high-dimensional gradient signals in neural network training. Our findings suggest that low-dimensional error signals can be as effective as high-dimensional ones, prompting a reevaluation of gradient-based learning in high-dimensional systems. Ultimately, our work offers a fresh perspective on neural network optimization and contributes to understanding learning mechanisms in both artificial and biological systems.

中文标题/摘要

标题：使用低维误差反馈训练大型神经网络

深度神经网络的训练通常依赖于反向传播高维误差信号，这是一个计算密集型的过程，但缺乏证据支持其在大脑中的实现。然而，由于大多数任务涉及低维输出，我们提出低维误差信号可能足以实现有效的学习。为了测试这一假设，我们引入了一种基于反馈对齐的新型局部学习规则，该规则利用间接的低维误差反馈来训练大型网络。我们的方法将反向传播与前向传播解耦，允许对误差信号的维度进行精确控制，同时保持高维表示。我们从线性网络的详细理论推导开始，这构成了我们学习框架的基础，并将我们的方法扩展到非线性、卷积和变换器架构。令人惊讶的是，我们证明即使最小的误差维度与任务维度相当，也能达到传统反向传播的性能。此外，我们的规则使卷积网络的高效训练成为可能，这些网络以前对反馈对齐方法具有抵抗力，误差最小。这一突破不仅为更接近生物学的学习模型铺平了道路，还挑战了在神经网络训练中对高维梯度信号的传统依赖。我们的研究结果表明，低维误差信号可以与高维信号一样有效，促使对高维系统中的梯度基学习进行重新评估。最终，我们的工作为神经网络优化提供了新的视角，并有助于理解人工和生物系统中的学习机制。

Summary / 总结

The paper aims to address the computational inefficiency of high-dimensional error signals in training deep neural networks by proposing a novel local learning rule based on Feedback Alignment. This method uses low-dimensional error feedback to train large networks, decoupling the backward pass from the forward pass and enabling precise control over error signal dimensionality. The study demonstrates that minimal error dimensionality can match the performance of traditional backpropagation, particularly for convolutional networks, challenging the conventional reliance on high-dimensional gradient signals and suggesting that low-dimensional error signals can be as effective. This approach not only enhances the biological plausibility of neural network models but also improves the efficiency of training complex architectures.

论文旨在通过提出基于反馈对齐的新型局部学习规则，解决深度神经网络中高维误差信号的计算效率问题。该方法使用低维误差反馈来训练大型网络，将反向传播与正向传播解耦，允许精确控制误差信号的维度。研究显示，即使最小的误差维度也能达到传统反向传播的性能，特别是在卷积网络方面，挑战了对高维梯度信号的传统依赖，表明低维误差信号可以与高维信号一样有效。这种方法不仅增强了神经网络模型的生物可行性，还提高了复杂架构的训练效率。