arXiv 论文速递

Snapshot: 20260213_0358

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Authors: Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano

First: 2026-02-11T18:58:54+00:00 · Latest: 2026-02-11T18:58:54+00:00

Abstract

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

中文标题/摘要

标题：数据重复优于数据扩展在长链推理微调中的应用

链式思考数据上的监督微调（SFT）是推理语言模型训练后的一个重要步骤。标准的机器学习直觉认为，使用更多的独特训练样本可以更好地泛化。出乎意料的是，我们展示了重复训练的好处：在固定更新预算下，多次在较小的数据集上训练比在较大数据集上单次训练表现更好。在AIME'24/25和GPQA基准测试中，Olmo3-7B在400个样本上训练128个epoch的表现比在51200个样本上单次训练的表现高出12-26个百分点，且没有出现灾难性遗忘。我们发现，训练标记准确率可以可靠地指示重复训练何时饱和；额外的epoch带来的改进在完全记忆化时趋于平缓，这一模式在所有设置中都是一致的。这些发现为推理微调提供了一种实用的方法，其中使用标记准确率作为停止标准的epoch扩展可以替代昂贵的无向数据扩展。我们提出了重复优势这一新问题，即完全记忆化与改进泛化相一致，作为社区理解大型语言模型训练动力学的新开放问题。

Summary / 总结

The study investigates the impact of data repetition versus data scaling in supervised fine-tuning for reasoning language models. It demonstrates that under a fixed update budget, training with more epochs on smaller datasets outperforms single-epoch training on larger datasets, achieving up to 26 percentage point improvements on benchmarks like AIME'24/25 and GPQA. The research finds that training token accuracy can indicate when repetition has saturated, suggesting a practical approach to replace expensive data scaling with controlled repetition.

研究挑战了传统观点，即更多的独特训练样本会带来更好的泛化能力。相反，它表明，在固定更新预算下，重复较小的数据集优于在更大数据集上进行单个epoch的训练。在AIME'24/25和GPQA基准测试上，Olmo3-7B在400个样本上训练128个epoch的表现优于在51200个样本上训练1个epoch，提高了12-26个百分点，且没有灾难性遗忘。研究还表明，训练令牌准确率可以指示重复何时饱和，这为基于令牌准确率作为停止标准来调整SFT的epoch数量提供了一种实用方法。

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Authors: Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

First: 2026-02-11T18:57:29+00:00 · Latest: 2026-02-11T18:57:29+00:00

Comments: Code: https://github.com/HKUST-C4G/diffusion-rm

Abs · PDF · Code1 · Code2 · Code3

Abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

中文标题/摘要

标题：超越基于VLM的奖励：扩散原生潜在奖励建模

扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又具有计算效率的奖励函数。视觉语言模型（VLMs）已经成为了主要的奖励提供者，利用其丰富的跨模态先验来引导对齐。然而，它们的计算和内存成本可能相当高，并且通过像素空间奖励优化潜在扩散生成器会引入领域不匹配，这使得对齐更加复杂。在本文中，我们提出了DiNa-LRM，这是一种直接在噪声扩散状态上进行偏好学习的扩散原生潜在奖励模型。我们的方法引入了一种噪声校准的Thurstone似然性，具有扩散噪声依赖的不确定性。DiNa-LRM 利用了一个预训练的潜在扩散主干，并带有时间步条件化的奖励头，支持推理时的噪声集成，提供了一种扩散原生的测试时扩展机制和鲁棒奖励机制。在图像对齐基准测试中，DiNa-LRM 显著优于现有的基于扩散的奖励基线，并且在计算成本仅为现有最佳VLMs的一小部分的情况下，实现了具有竞争力的性能。在偏好优化中，我们展示了DiNa-LRM 改进了偏好优化动力学，使其能够更快且更节省资源地进行模型对齐。

Summary / 总结

This paper addresses the challenge of preference optimization for diffusion and flow-matching models by proposing DiNa-LRM, a diffusion-native latent reward model. It formulates preference learning directly on noisy diffusion states, using a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a lower computational cost. In preference optimization, DiNa-LRM improves the dynamics, enabling faster and more resource-efficient model alignment.

本文提出了一种扩散本征的潜在奖励模型DiNa-LRM，以解决扩散模型的偏好优化问题。该模型直接在噪声扩散状态上进行偏好学习，并引入了噪声校准的Thurstone似然。实验表明，DiNa-LRM在较低的计算成本下，超过了现有的基于扩散的奖励基线，并且其性能与最先进的视觉语言模型相当。此外，它还改善了偏好优化的动力学，使得模型对齐更快且更节省资源。

GENIUS: Generative Fluid Intelligence Evaluation Suite

Authors: Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang

First: 2026-02-11T18:55:54+00:00 · Latest: 2026-02-11T18:55:54+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.

中文标题/摘要

标题：GENIUS：生成流体智能评估套件

统一多模态模型（UMMs）在视觉生成方面取得了显著进展。然而，现有的基准测试主要评估的是$\textit{晶体智力}$，这依赖于回忆积累的知识和学习的模式。这种关注忽视了$\textit{生成流体智力（GFI）}$：即在不断变化的情境中诱导模式、通过约束进行推理和适应新场景的能力。为了严格评估这种能力，我们引入了$\textbf{GENIUS}$（$\textbf{G}$ 生成$\textbf{E}$ 流体$\textbf{I}$ 智能$\textbf{U}$ 评估$\textbf{S}$ 套件）。我们将$\textit{GFI}$ 形式化为三个基本要素的综合。这些包括$\textit{诱导隐含模式}$（例如，推断个性化的视觉偏好），$\textit{执行临时约束}$（例如，可视化抽象的隐喻），以及$\textit{适应上下文知识}$（例如，模拟反直觉的物理现象）。这些基本要素共同挑战模型解决完全基于当前上下文的问题。我们对12个代表性模型的系统评估揭示了这些任务中的显著性能缺陷。至关重要的是，我们的诊断分析将这些失败模式分离出来。它表明缺陷源于对上下文理解的有限性，而不是生成能力的不足。为了弥合这一差距，我们提出了一种无需训练的注意力干预策略。最终，$\textbf{GENIUS}$ 为$\textit{GFI}$ 设立了一个严格的标准，引导该领域从知识利用转向动态、通用的推理。我们的数据集和代码将在：$\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$ 发布。

Summary / 总结

The research introduces GENIUS, a suite designed to evaluate Generative Fluid Intelligence (GFI) in models, which is the ability to induce patterns, reason through constraints, and adapt to novel scenarios. By formalizing GFI into three primitives—inducing implicit patterns, executing ad-hoc constraints, and adapting to contextual knowledge—the study systematically evaluates 12 models and finds significant performance deficits. The analysis shows these deficits are due to limited context comprehension rather than intrinsic generative capability. The study proposes an attention intervention strategy to address this issue and establishes a rigorous standard for GFI evaluation, guiding future research towards dynamic, general-purpose reasoning.

研究引入了GENIUS套件，旨在评估模型的生成流体智能（GFI），重点关注其在模式诱导、约束推理和适应新场景方面的能力。研究将GFI形式化为三个基本要素，并评估了12个模型，揭示了由于上下文理解有限而导致的重大性能缺陷。提出了一种无需训练的注意力干预策略来解决这些问题。

Data-Efficient Hierarchical Goal-Conditioned Reinforcement Learning via Normalizing Flows

Authors: Shaswat Garg, Matin Moezzi, Brandon Da Silva

First: 2026-02-11T18:54:48+00:00 · Latest: 2026-02-11T18:54:48+00:00

Comments: 9 pages, 3 figures, IEEE International Conference on Robotics and Automation 2026