arXiv 论文速递

Snapshot: 20260204_0351

Reward-free Alignment for Conflicting Objectives

Authors: Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

First: 2026-02-02T18:59:52+00:00 · Latest: 2026-02-02T18:59:52+00:00

Comments: 27 pages

Abstract

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

中文标题/摘要

标题：冲突目标的无奖励对齐

直接对齐方法越来越多地用于使大型语言模型（LLMs）与人类偏好对齐。然而，许多实际对齐问题涉及多个冲突的目标，其中简单的偏好聚合可能导致训练不稳定和较差的权衡。特别是，加权损失方法可能无法识别同时改善所有目标的更新方向，而现有的多目标方法通常依赖于显式的奖励模型，引入了额外的复杂性并扭曲了用户指定的偏好。本文的贡献有两个方面。首先，我们提出了一种名为冲突目标无奖励对齐（RACO）的框架，直接利用成对偏好数据并通过一种新颖的冲突规避梯度下降的裁剪变体解决梯度冲突。我们提供了收敛到尊重用户指定目标权重的帕累托关键点的保证，并进一步表明在两目标设置中裁剪可以严格提高收敛速度。其次，我们使用一些启发式方法改进了该方法，并进行了实验以证明所提出框架在LLM对齐中的兼容性。在多个LLM家族（Qwen 3、Llama 3、Gemma 3）上的多目标总结和安全性对齐任务的定性和定量评估中，我们的方法始终优于现有的多目标对齐基线。

Summary / 总结

This paper addresses the challenge of aligning large language models with multiple conflicting objectives by proposing a Reward-free Alignment framework for Conflicted Objectives (RACO). RACO uses pairwise preference data and a novel clipped variant of conflict-averse gradient descent to resolve gradient conflicts. The method provides convergence guarantees to Pareto-critical points and shows improved convergence in the two-objective setting. Experiments on summarization and safety alignment tasks demonstrate that RACO achieves better Pareto trade-offs compared to existing multi-objective alignment methods across different LLM families.

本文提出了一种名为RACO的Reward-free Alignment框架，用于解决大型语言模型与多个冲突目标之间的对齐问题。RACO利用成对偏好数据和一种新的剪裁变体的冲突规避梯度下降来解决梯度冲突。实验表明，RACO在多目标总结和安全性对齐任务中实现了比现有多目标对齐方法更好的帕累托折衷，适用于不同的LLM家族。

MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training

Authors: Dulhan Jayalath, Oiwi Parker Jones

First: 2026-02-02T18:59:50+00:00 · Latest: 2026-02-02T18:59:50+00:00

Comments: 19 pages, 8 figures, 5 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at https://github.com/neural-processing-lab/MEG-XL .

中文标题/摘要

标题：MEG-XL：通过长上下文预训练实现数据高效的大脑到文本接口

临床大脑到文本接口旨在为无法提供大量训练录音的瘫痪患者设计。预训练通过在不同受试者之间学习统计先验来提高数据高效泛化，但这些先验严重依赖于上下文。虽然自然语言可能在几分钟内逐渐展开，但大多数方法仅使用几秒钟的上下文进行预训练。因此，我们提出了MEG-XL模型，该模型每样本预训练使用2.5分钟的MEG上下文，比以往工作长5-300倍，相当于191k个令牌，捕捉到扩展的神经上下文。在从大脑数据解码单词的任务上微调，MEG-XL使用少量数据（例如1小时 vs 50小时）匹配监督性能，并优于大脑基础模型。我们发现，使用更长上下文进行预训练的模型在解码单词方面学到的表示具有更好的迁移性。我们的结果表明，长上下文预训练有助于利用其他方法不必要的丢弃的扩展神经上下文。代码、模型权重和说明可在https://github.com/neural-processing-lab/MEG-XL 获取。

Summary / 总结

MEG-XL is designed for clinical brain-to-text interfaces for paralyzed patients with limited training data. It uses long-context pre-training to learn statistical priors, with 2.5 minutes of context per sample, 5-300 times longer than previous methods. MEG-XL matches supervised performance with much less data and outperforms brain foundation models. Longer pre-training context improves transfer to word decoding tasks.

MEG-XL 旨在为瘫痪患者设计临床脑-文本接口，这些患者的数据训练有限。它通过每样本2.5分钟的上下文进行预训练，比以往方法长5-300倍，以捕捉扩展的神经上下文。在从脑数据解码单词的任务上微调后，MEG-XL 以少量数据匹配监督性能并优于脑基础模型。更长的预训练上下文有助于模型在解码任务上更好地转移。

DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

Authors: Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng

Venue: ICLR 2026

First: 2025-08-18T08:49:29+00:00 · Latest: 2026-02-02T18:59:08+00:00

Comments: Accepted to ICLR 2026. Project page: https://attention-is-all-i-need.github.io/Design-Logic-Reasoning

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large language models (LLMs) perform strongly on many language tasks but still struggle with complex multi-step reasoning across disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, as well as guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents to generate multidisciplinary questions. The central insight is the notion of Design Logic, a form of reusable meta-knowledge that encapsulates the structured process human experts use to transform knowledge into complex exam questions, enabling LLMs to generate new questions with the same complex reasoning patterns from entirely different source texts with explicit control over difficulty, diversity, and question types. We use LLMs to reverse-engineer and abstract over 120,000 Design Logics from existing questions across various disciplines. By designing a two-stage retrieve-and-generate mechanism to match these Design Logics with raw corpus, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. Supervised fine-tuning (SFT) on Qwen3 and Llama3 with our data substantially improves multidisciplinary reasoning and outperforms baseline datasets. Notably, by applying SFT on the base versions of these models using only our data, we even surpass their official final models that have undergone the full post-training.

中文标题/摘要

标题：DESIGNER: 设计逻辑引导的多学科数据合成以指导LLM推理

大型语言模型（LLMs）在许多语言任务上表现出色，但在跨学科的复杂多步推理方面仍然存在困难。现有的推理数据集往往缺乏学科广度、推理深度和多样性，以及问题合成的指导原则。我们提出了DESIGNER：一种设计逻辑引导的推理数据合成管道，利用自然可用的大量原始文档生成多学科问题。核心洞察是设计逻辑的概念，这是一种可重用的元知识形式，封装了人类专家将知识转化为复杂考试问题的结构化过程，使LLMs能够从完全不同的原始文本中生成具有相同复杂推理模式的新问题，并且具有对难度、多样性和问题类型的显式控制。我们使用LLMs从各个学科的现有问题中逆向工程并抽象出超过120,000个设计逻辑。通过设计两阶段检索和生成机制将这些设计逻辑与原始语料库匹配，我们合成了两个大规模推理数据集，涵盖了75个学科：DLR-Book（来自书本语料库的304万个问题）和DLR-Web（来自网络语料库的166万个问题）。数据分析表明，我们方法合成的问题比基线数据集中的问题更具难度和多样性。在Qwen3和Llama3上的监督微调（SFT）显著提高了多学科推理能力，并优于基线数据集。值得注意的是，仅使用我们数据对这些模型的基础版本进行SFT，我们甚至超过了它们经过完整后训练的官方最终模型。

Summary / 总结

The paper proposes DESIGNER, a pipeline that uses Design Logic to generate multidisciplinary questions for LLMs to improve their reasoning capabilities. By reverse-engineering 120,000 Design Logics from existing questions and synthesizing two large-scale datasets (DLR-Book and DLR-Web), the method enhances the difficulty and diversity of the questions. Supervised fine-tuning on Qwen3 and Llama3 with these datasets significantly improves LLMs' multidisciplinary reasoning performance, surpassing their official final models in some cases.

论文提出了DESIGNER，一种使用设计逻辑生成跨学科问题的管道，以提高LLM的推理能力。通过从现有问题中反向工程120,000个设计逻辑并合成两个大规模数据集（DLR-Book和DLR-Web），该方法增强了问题的难度和多样性。使用这些数据集对Qwen3和Llama3进行监督微调，显著提高了LLM的跨学科推理性能，在某些情况下甚至超过了它们的官方最终模型。

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Authors: Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

First: 2026-02-02T18:59:04+00:00 · Latest: 2026-02-02T18:59:04+00:00

Comments: Code: https://github.com/Gen-Verse/Open-AgentRL

Abs · PDF · Code1 · Code2 · Code3

Abstract

We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL

中文标题/摘要

标题：RLAnything：在完全动态的RL系统中锻造环境、策略和奖励模型

我们提出了RLAnything，这是一种强化学习框架，通过闭环优化动态锻造环境、策略和奖励模型，增强学习信号，强化整个RL系统，适用于任何LLM或代理场景。具体而言，策略通过逐步反馈和结果信号的整合进行训练，而奖励模型则通过一致性反馈联合优化，进一步提高策略训练效果。此外，基于理论动机的自动环境适应通过利用每个模型的评论反馈，增强了奖励和策略模型的训练，使学习经验化。实验证明，每个新增组件都一致地提高了整个系统的效果，RLAnything在各种代表性LLM和代理任务中取得了显著的提升，分别提高了Qwen3-VL-8B-Thinking在OSWorld上的9.1%，Qwen2.5-7B-Instruct在AlfWorld和LiveBench上的18.7%和11.9%。我们还发现优化后的奖励模型信号优于依赖于人工标签的结果。代码：https://github.com/Gen-Verse/Open-AgentRL

Summary / 总结

RLAnything is a reinforcement learning framework that dynamically creates environment, policy, and reward models through closed-loop optimization, enhancing learning signals and improving the overall RL system for various scenarios. The policy is trained using integrated feedback from step-wise and outcome signals, while the reward model is optimized via consistency feedback, which further enhances policy training. Empirically, each added component improves the system, and RLAnything significantly boosts performance on various tasks, with gains of 9.1% to 18.7% across different scenarios. Optimized reward-model signals outperform human-labeled outcomes.

RLAnything 是一种通过闭环优化动态构建环境、策略和奖励模型的强化学习框架，增强学习信号并改善各种场景下的整体 RL 系统。策略使用步骤和结果反馈进行训练，而奖励模型通过一致性反馈进行优化，进一步提升策略训练效果。实验表明，每个新增组件都能提升系统性能，RLAnything 在不同任务上的表现显著提升，分别提高了 Qwen3-VL-8B-Thinking 9.1%，Qwen2.5-7B-Instruct 18.7% 和 11.9%。优化后的奖励模型信号优于依赖人类标签的结果。

RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Authors: Jialiang Zhu, Gongrui Zhang, Xiaolong Ma, Lin Xu, Miaosen Zhang, Ruiqi Yang, Song Wang, Kai Qiu, Zhirong Wu, Qi Dai, Ruichun Ma, Bei Liu, Yifan Yang, Chong Luo, Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Xin Geng, Baining Guo

First: 2026-02-02T18:58:07+00:00 · Latest: 2026-02-02T18:58:07+00:00