arXiv 论文速递

I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners

Authors: Lu Ling, Yunhao Ge, Yichen Sheng, Aniket Bera

First: 2025-12-15T18:59:13+00:00 · Latest: 2025-12-15T18:59:13+00:00

Abstract

Generalization remains the central challenge for interactive 3D scene generation. Existing learning-based approaches ground spatial understanding in limited scene dataset, restricting generalization to new layouts. We instead reprogram a pre-trained 3D instance generator to act as a scene level learner, replacing dataset-bounded supervision with model-centric spatial supervision. This reprogramming unlocks the generator transferable spatial knowledge, enabling generalization to unseen layouts and novel object compositions. Remarkably, spatial reasoning still emerges even when the training scenes are randomly composed objects. This demonstrates that the generator's transferable scene prior provides a rich learning signal for inferring proximity, support, and symmetry from purely geometric cues. Replacing widely used canonical space, we instantiate this insight with a view-centric formulation of the scene space, yielding a fully feed-forward, generalizable scene generator that learns spatial relations directly from the instance model. Quantitative and qualitative results show that a 3D instance generator is an implicit spatial learner and reasoner, pointing toward foundation models for interactive 3D scene understanding and generation. Project page: https://luling06.github.io/I-Scene-project/

中文标题/摘要

标题：I-场景：3D实例模型是隐式的通用空间学习者

泛化仍然是交互式3D场景生成的核心挑战。现有的基于学习的方法将空间理解局限于有限的场景数据集，限制了对新布局的泛化能力。相反，我们重新编程了一个预训练的3D实例生成器，使其作为场景级别的学习者发挥作用，用模型为中心的空间监督取代数据集限定的监督。这种重新编程解锁了生成器可转移的空间知识，使其能够泛化到未见过的布局和新的物体组合。令人惊讶的是，即使训练场景是由随机组合的物体构成的，空间推理仍然会出现。这表明生成器的可转移场景先验为从纯几何线索中推断出邻近性、支撑和对称性提供了丰富的学习信号。我们用基于视点的场景空间公式替代了广泛使用的标准空间，从而实现了一个完全前馈、可泛化的场景生成器，它直接从实例模型中学习空间关系。定量和定性的结果表明，3D实例生成器是一个隐式的空间学习者和推理者，指出了交互式3D场景理解和生成的基础模型。项目页面：https://luling06.github.io/I-Scene-project/

Summary / 总结

The research addresses the challenge of generalization in 3D scene generation by reprogramming a pre-trained 3D instance generator to act as a scene-level learner. This approach replaces dataset-bounded supervision with model-centric spatial supervision, enabling the generator to learn transferable spatial knowledge and generalize to unseen layouts and novel object compositions. The study demonstrates that the generator can infer spatial relationships such as proximity, support, and symmetry from geometric cues alone, even when training scenes are randomly composed. This leads to a fully feed-forward, generalizable scene generator that learns spatial relations directly from instance models, suggesting potential for foundation models in interactive 3D scene understanding and generation.

研究通过重新编程一个预训练的3D实例生成器作为场景级学习者，来解决3D场景生成中的泛化问题。这种方法用模型为中心的空间监督取代了数据集限制的监督，使生成器能够学习可转移的空间知识，并泛化到未见过的布局和新型物体组合。研究显示，生成器可以从几何线索中推断出空间关系，如接近性、支撑和对称性，即使训练场景是随机组合的。这导致了一个完全前馈、可泛化的场景生成器，可以直接从实例模型中学习空间关系，表明其在交互式3D场景理解和生成中的潜在应用。

LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

Authors: Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, Huaizu Jiang

First: 2025-12-15T18:59:04+00:00 · Latest: 2025-12-15T18:59:04+00:00

Comments: 16 pages

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$

Summary / 总结

The motivation for LASER is to enable training-free streaming 4D reconstruction using existing offline models, which suffer from high memory complexity. LASER proposes a layer-wise scale alignment method to convert offline models into streaming systems, improving performance on camera pose estimation and point map reconstruction. Experiments show LASER operates at 14 FPS with 6 GB of memory, achieving state-of-the-art results and practical deployment for large-scale streaming videos.

研究旨在解决VGGT和$π^3$等前馈重建模型的内存复杂性问题，这阻碍了它们的实际流式应用。LASER 提出了一种无需训练的方法，通过层间尺度对齐将离线模型转换为流式系统，对连续时间窗口中的预测进行对齐。关键发现是，LASER 在相机姿态估计和点云重建方面达到了最先进的性能，同时在 RTX A6000 GPU 上以每秒 14 帧的速度运行，峰值内存使用量为 6 GB，能够实现实时处理大规模流式视频。

JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

Authors: Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, Kai Han

First: 2025-12-15T18:58:18+00:00 · Latest: 2025-12-15T18:58:18+00:00

Comments: Project page: \url{https://visual-ai.github.io/jova}

Abs · PDF · Code1 · Code2 · Project1

Abstract

In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.

中文标题/摘要

标题：JoVA：统一多模态学习的联合视频-音频生成

在本文中，我们提出了JoVA，一种联合视频-音频生成的统一框架。尽管最近取得了令人鼓舞的进步，但现有方法面临两个关键限制。首先，大多数现有方法只能生成环境声，缺乏生成与唇部动作同步的人声的能力。其次，最近的联合人类视频-音频生成尝试通常依赖于显式融合或模态特定对齐模块，这引入了额外的架构设计，削弱了原始变压器的模型简洁性。为了解决这些问题，JoVA 在每个变压器层中采用视频和音频标记之间的联合自注意力，使跨模态交互直接且高效，无需额外的对齐模块。此外，为了实现高质量的唇部语音同步，我们引入了一种基于面部关键点检测的简单而有效的口部区域损失，这在训练过程中增强了对关键口部区域的监督，而不牺牲架构的简洁性。在基准测试上的广泛实验表明，JoVA 在唇同步准确性、语音质量和整体视频-音频生成保真度方面优于或与联合和音频驱动的最新方法竞争。我们的结果将JoVA确立为高质量多模态生成的优雅框架。

Summary / 总结

JoVA is a unified framework for joint video-audio generation that addresses the limitations of existing methods by enabling synchronized human speech and lip movements through joint self-attention across video and audio tokens. It introduces a simple mouth-area loss for better lip-sync accuracy without additional complexity. Experiments show JoVA outperforms or matches state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity.

JoVA 是一种统一的联合视频-音频生成框架，通过在视频和音频标记之间进行联合自注意力，解决了现有方法中无法同步人类语音和唇部动作的问题。它引入了基于面部关键点检测的口部区域损失，以提高唇同步准确性。实验表明，JoVA 在唇同步准确性、语音质量和整体视频-音频生成保真度方面优于或可与最先进的方法竞争。

Towards Effective Model Editing for LLM Personalization

Authors: Baixiang Huang, Limeng Cui, Jiapeng Liu, Haoran Wang, Jiawei Xu, Zhuiyue Tan, Yutong Chen, Chen Luo, Yi Liu, Kai Shu

First: 2025-12-15T18:58:15+00:00 · Latest: 2025-12-15T18:58:15+00:00

Comments: 15 pages (including appendix), 7 figures. Code, data, results, and additional resources are available at: https://model-editing.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Personalization is becoming indispensable for LLMs to align with individual user preferences and needs. Yet current approaches are often computationally expensive, data-intensive, susceptible to catastrophic forgetting, and prone to performance degradation in multi-turn interactions or when handling implicit queries. To address these challenges, we conceptualize personalization as a model editing task and introduce Personalization Editing, a framework that applies localized edits guided by clustered preference representations. This design enables precise preference-aligned updates while preserving overall model capabilities. In addition, existing personalization benchmarks frequently rely on persona-based dialogs between LLMs rather than user-LLM interactions, or focus primarily on stylistic imitation while neglecting information-seeking tasks that require accurate recall of user-specific preferences. We introduce User Preference Question Answering (UPQA), a short-answer QA dataset constructed from in-situ user queries with varying levels of difficulty. Unlike prior benchmarks, UPQA directly evaluates a model's ability to recall and apply specific user preferences. Across experimental settings, Personalization Editing achieves higher editing accuracy and greater computational efficiency than fine-tuning, while outperforming prompting-based baselines in multi-turn conversations and implicit preference questions settings.

中文标题/摘要

标题：面向LLM个性化有效模型编辑

个性化对于使LLM与个体用户偏好和需求相一致变得不可或缺。然而，当前的方法往往计算成本高、数据密集、容易发生灾难性遗忘，并且在多轮交互或处理隐式查询时容易导致性能下降。为了解决这些挑战，我们将个性化概念化为模型编辑任务，并引入了个性化编辑框架，该框架通过聚类偏好表示进行局部编辑指导。这种设计能够实现精确的偏好对齐更新，同时保留整体模型能力。此外，现有的个性化基准测试通常依赖于基于人设的LLM对话，而不是用户-LLM交互，或者主要集中在风格模仿，而忽视了需要准确回忆用户特定偏好信息寻求任务。我们引入了用户偏好问答(UPQA)，这是一个从现场用户查询构建的简短答案问答数据集，具有不同难度级别。与之前的基准测试不同，UPQA直接评估模型回忆和应用特定用户偏好的能力。在各种实验设置中，个性化编辑在编辑准确性、计算效率以及多轮对话和隐式偏好问题设置中均优于微调，同时优于基于提示的基线。

Summary / 总结

This paper addresses the challenges of personalizing large language models (LLMs) by proposing Personalization Editing, a framework that applies localized model edits guided by clustered preference representations. This approach enables precise preference-aligned updates while preserving overall model capabilities, offering higher editing accuracy and greater computational efficiency compared to fine-tuning. Additionally, the authors introduce User Preference Question Answering (UPQA), a dataset that evaluates models' ability to recall and apply specific user preferences, unlike existing benchmarks that focus on persona-based dialogs or stylistic imitation. Personalization Editing outperforms prompting-based baselines in multi-turn conversations and implicit preference questions settings.

本文提出了一种Personalization Editing框架，通过聚类偏好表示进行局部模型编辑，以实现精确的偏好对齐更新，同时保持模型的整体能力。该方法在编辑准确性和计算效率方面优于微调，并在多轮对话和隐式偏好问题设置中优于基于提示的基线。此外，作者引入了User Preference Question Answering (UPQA)数据集，直接评估模型召回和应用特定用户偏好的能力，不同于现有的基准，后者主要关注基于人设的对话或风格模仿。

Towards Interactive Intelligence for Digital Humans

Authors: Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, Dianwen Ng, Zixiong Su, Erwin Wu, Yuhan Wu, Dingkun Yan, Tianyu Yan, Chang Zeng, Bo Zheng, You Zhou

First: 2025-12-15T18:57:35+00:00 · Latest: 2025-12-15T18:57:35+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

中文标题/摘要

标题：迈向数字人类的交互智能

我们介绍了交互智能，这是一种新型的数字人类，能够实现个性化的表达、适应性的交互和自我进化。为了实现这一目标，我们提出了Mio（多模态交互全息化身），这是一个由五个专门模块组成的端到端框架：Thinker、Talker、Face Animator、Body Animator和Renderer。这一统一架构将认知推理与实时多模态实体化相结合，以实现流畅、一致的交互。此外，我们还建立了一个新的基准，以严格评估交互智能的能力。广泛的实验表明，我们的框架在所有评估维度上都优于最先进的方法。这些贡献共同推动了数字人类从表面模仿向智能交互的转变。

Summary / 总结

The research introduces Interactive Intelligence, a new paradigm for digital humans that can exhibit personality-aligned behavior, adapt to interactions, and evolve. The study presents Mio, an end-to-end framework with specialized modules for cognitive reasoning, multimodal expression, and real-time rendering. Experiments show that Mio outperforms existing methods in all evaluated dimensions, advancing digital humans towards intelligent interaction rather than mere imitation.

研究旨在开发具备互动智能的数字人类，能够实现个性化的表达、适应性交互和自我进化。为此，作者提出了Mio框架，包含五个专门模块：Thinker、Talker、Face Animator、Body Animator和Renderer。该统一架构将认知推理与实时多模态表现结合，实现流畅且一致的交互。实验表明，Mio在所有评估维度上均优于现有方法，推动了数字人类从简单的模仿向智能交互发展。

Directional Textual Inversion for Personalized Text-to-Image Generation

Authors: Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim

First: 2025-12-15T18:57:07+00:00 · Latest: 2025-12-15T18:57:07+00:00

Comments: Project page: https://kunheek.github.io/dti

Abs · PDF · Code1 · Code2 · Project1

Abstract

Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.

中文标题/摘要

标题：方向性文本反转用于个性化文本到图像生成

文本反转(TI)是一种高效的文本到图像个性化方法，但在处理复杂提示时经常失败。我们追踪这些失败的原因是嵌入范数膨胀：学习到的令牌漂移到分布外的幅度，损害了预归一化Transformer的提示条件。实证上，我们表明语义主要由CLIP令牌空间中的方向编码，而膨胀的范数损害了上下文化；理论上，我们分析了大幅度如何减弱位置信息并妨碍预归一化块中的残差更新。我们提出了方向性文本反转(DTI)，它将嵌入幅度固定在分布内尺度，并通过黎曼SGD仅在单位超球面上优化方向。我们将方向学习视为具有von Mises-Fisher先验的MAP，产生一个恒定方向的先验梯度，该梯度易于集成且高效。在个性化任务中，DTI在保持主题相似性的同时提高了文本保真度，而TI及其变体则没有。最关键的是，DTI的超球面参数化使学习概念之间的平滑、语义连贯的插值（slerp）成为可能，这是标准TI所不具备的能力。我们的研究结果表明，仅方向优化是实现提示忠实个性化的一种稳健且可扩展的途径。

Summary / 总结

The research aims to improve the personalization of text-to-image generation by addressing the issue of embedding norm inflation in Textual Inversion (TI). The method, Directional Textual Inversion (DTI), fixes the embedding magnitude to an in-distribution scale and optimizes only the direction on the unit hypersphere, leading to better text fidelity and subject similarity compared to TI and its variants. DTI also enables smooth, semantically coherent interpolation between learned concepts, a capability not present in standard TI.

研究旨在通过解决Textual Inversion (TI)中嵌入范数膨胀的问题来提高文本到图像的个性化能力，这通常会降低提示的条件性。方法是Directional Textual Inversion (DTI)，它将嵌入的范数固定在一个分布内尺度，并仅在单位超球面上优化方向，从而在文本保真度和主题相似性方面优于TI及其变体。值得注意的是，DTI能够实现平滑、语义连贯的概念间插值，这是标准TI所不具备的功能。

AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection

Authors: Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang

First: 2025-12-15T18:57:04+00:00 · Latest: 2025-12-15T18:57:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.

中文标题/摘要

标题：AgentIAD：工具增强的单智能体工业异常检测

工业异常检测（IAD）由于正常参考样本稀缺和许多缺陷的细微、局部性质而具有挑战性。单次视图-语言模型（VLMs）往往忽视小异常，缺乏与典型正常模式进行对比的明确机制。我们提出了一种工具驱动的代理框架AgentIAD，以实现多阶段视觉检查。该代理配备了感知放大器（PZ）进行局部精细分析，以及比较检索器（CR）在证据模糊时查询正常示例。为了教授这些检查行为，我们从MMAD数据集中构建了结构化的感知和比较轨迹，并分两阶段训练模型：监督微调后进行强化学习。两部分奖励设计驱动这一过程：感知奖励监督分类准确性、空间对齐和类型正确性，行为奖励鼓励高效使用工具。这些组件共同使模型能够通过逐步观察、放大和验证来完善其判断。AgentIAD在MMAD上实现了新的97.62%分类准确率，超越了基于MLLM的先前方法，同时生成透明且可解释的检查轨迹。

Summary / 总结

AgentIAD is a tool-driven framework for industrial anomaly detection that uses a Perceptive Zoomer for localized analysis and a Comparative Retriever to query normal patterns. It is trained in two stages: supervised fine-tuning and reinforcement learning. The model achieves 97.62% classification accuracy, surpassing previous methods and providing transparent inspection traces.

AgentIAD 是一种工具驱动的工业异常检测框架，采用多阶段视觉检查过程。它包括一个感知放大器进行详细分析，以及一个比较检索器查询正常模式。模型在两个阶段进行训练：监督微调和强化学习，并使用奖励系统专注于分类准确性及工具使用效率。AgentIAD 达到了 97.62% 的分类准确率，超越了先前的方法，并提供了透明的检查轨迹。

A Scientific Reasoning Model for Organic Synthesis Procedure Generation

Authors: Guoqing Liu, Junren Li, Zihan Zhao, Eray Inanc, Krzysztof Maziarz, Jose Garrido Torres, Victor Garcia Satorras, Shoko Ueda, Christopher M. Bishop, Marwin Segler

First: 2025-12-15T18:55:39+00:00 · Latest: 2025-12-15T18:55:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Solving computer-aided synthesis planning is essential for enabling fully automated, robot-assisted synthesis workflows and improving the efficiency of drug discovery. A key challenge, however, is bridging the gap between computational route design and practical laboratory execution, particularly the accurate prediction of viable experimental procedures for each synthesis step. In this work, we present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures directly from reaction equations, with explicit chain-of-thought reasoning. To develop QFANG, we curated a high-quality dataset comprising 905,990 chemical reactions paired with structured action sequences, extracted and processed from patent literature using large language models. We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale. The model subsequently undergoes supervised fine-tuning to elicit complex chemistry reasoning. Finally, we apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy. Experimental results demonstrate that QFANG outperforms advanced general-purpose reasoning models and nearest-neighbor retrieval baselines, measured by traditional NLP similarity metrics and a chemically aware evaluator using an LLM-as-a-judge. Moreover, QFANG generalizes to certain out-of-domain reaction classes and adapts to variations in laboratory conditions and user-specific constraints. We believe that QFANG's ability to generate high-quality synthesis procedures represents an important step toward bridging the gap between computational synthesis planning and fully automated laboratory synthesis.

中文标题/摘要

标题：有机合成程序生成的科学推理模型

计算机辅助合成规划对于实现完全自动化的、机器人辅助的合成工作流程以及提高药物发现的效率至关重要。然而，一个关键挑战是弥合计算路线设计与实际实验室执行之间的差距，特别是准确预测每个合成步骤的可行实验程序。在本工作中，我们提出了QFANG，这是一种能够直接从反应方程式生成精确、结构化实验程序的科学推理语言模型，并且具有显式的链式推理。为了开发QFANG，我们整理了一个高质量的数据集，包含905,990个化学反应及其结构化的操作序列，这些数据是从专利文献中提取并使用大型语言模型进行处理的。我们引入了一种化学指导推理（CGR）框架，该框架能够大规模生成基于化学知识的链式推理数据。随后，模型通过监督微调来激发复杂的化学推理。最后，我们应用可验证奖励的强化学习（RLVR）进一步提高程序的准确性。实验结果表明，QFANG在传统NLP相似度指标和使用LLM作为法官的化学意识评估器中均优于先进的通用推理模型和最近邻检索基线。此外，QFANG能够泛化到某些新的反应类别，并适应实验室条件和用户特定约束的变化。我们认为，QFANG生成高质量合成程序的能力是弥合计算合成规划与完全自动化实验室合成之间差距的重要一步。

Summary / 总结

The research aims to bridge the gap between computational synthesis planning and practical laboratory execution by generating precise experimental procedures from reaction equations. QFANG, a scientific reasoning language model, is developed using a curated dataset of 905,990 chemical reactions and a Chemistry-Guided Reasoning framework. The model is fine-tuned with supervised learning and enhanced with Reinforcement Learning from Verifiable Rewards. Experimental results show that QFANG outperforms general-purpose reasoning models and baselines, and it generalizes well to new reaction classes and adapts to various laboratory conditions and constraints.

研究旨在通过生成精确的实验步骤来弥合计算路线设计与实验室执行之间的差距。QFANG是一种科学推理模型，使用包含905,990个化学反应的定制数据集和化学引导推理框架开发。该模型通过监督学习进行微调，并使用可验证奖励的强化学习进一步增强。实验结果表明，QFANG在性能上优于通用推理模型和基线，并且能够很好地应用于新的反应类别和实验室条件。

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Authors: Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang

First: 2025-12-15T18:52:43+00:00 · Latest: 2025-12-15T18:52:43+00:00

Comments: Project page: https://zhoues.github.io/RoboTracer

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.

中文标题/摘要

标题：RoboTracer：在视觉语言模型中通过推理掌握空间跟踪

空间跟踪是机器人基本的具身交互能力之一，由于需要多步度量导向的推理和复杂的空间指代以及现实世界的度量测量，因此本质上具有挑战性。然而，现有方法在处理这种组合任务时存在困难。为此，我们提出RoboTracer，这是一种3D感知的VLM，首先通过通用空间编码器和回归监督解码器实现3D空间指代和测量，增强监督微调（SFT）期间的尺度意识。此外，RoboTracer通过度量敏感的过程奖励进行强化微调（RFT），监督关键中间感知线索，以准确生成空间跟踪。为了支持SFT和RFT训练，我们引入了TraceSpatial，这是一个包含3000万QA对的大规模数据集，涵盖了户外/室内/桌面场景，并支持复杂的推理过程（多达9步）。我们还提出了TraceSpatial-Bench，这是一个具有挑战性的基准，填补了空间跟踪评估的空白。实验结果表明，RoboTracer在空间理解、测量和指代方面超越了基线，平均成功率达到了79.1%，并且在TraceSpatial-Bench上也以显著优势超越了Gemini-2.5-Pro，准确率高出36%。值得注意的是，RoboTracer可以与各种控制策略结合，执行跨不同机器人（UR5，G1人形机器人）的复杂场景中的长期动态任务。

Summary / 总结

RoboTracer is designed to address the challenges of spatial tracing in robotics by combining 3D spatial reasoning and metric measurement. It uses a universal spatial encoder and regression-supervised decoder for supervised fine-tuning, and reinforcement fine-tuning with metric-sensitive process rewards to enhance multi-step reasoning. The method is evaluated on a large dataset, TraceSpatial, and outperforms existing methods with an average success rate of 79.1% and achieves state-of-the-art performance on TraceSpatial-Bench, surpassing Gemini-2.5-Pro by 36% accuracy. RoboTracer can be integrated with various control policies for dynamic tasks in real-world environments.

RoboTracer 通过利用 3D 意识的视觉语言模型来解决机器人中的空间跟踪挑战。它使用通用的空间编码器和回归监督解码器进行空间引用和测量，并采用强化微调来增强多步度量导向的推理。该模型在包含 3000 万个问答对的 TraceSpatial 数据集上进行训练，以支持复杂的推理任务。RoboTracer 的平均成功率达到了 79.1%，并在 TraceSpatial-Bench 基准测试中实现了最先进的性能，超越 Gemini-2.5-Pro 36% 的准确率。它可以与各种控制策略结合使用，以处理不同机器人（UR5、G1 人形机器人）在复杂环境中的长期任务。

Embedding-Based Rankings of Educational Resources based on Learning Outcome Alignment: Benchmarking, Expert Validation, and Learner Performance

Authors: Mohammadreza Molavi, Mohammad Moein, Mohammadreza Tavakoli, Abdolali Faraji, Stefan T. Mol, Gábor Kismihók

First: 2025-12-15T18:51:00+00:00 · Latest: 2025-12-15T18:51:00+00:00

Comments: Accepted for publication at the 16th International Conference on Learning Analytics & Knowledge (LAK 2026)

Abs · PDF · Code1 · Code2

Abstract

As the online learning landscape evolves, the need for personalization is increasingly evident. Although educational resources are burgeoning, educators face challenges selecting materials that both align with intended learning outcomes and address diverse learner needs. Large Language Models (LLMs) are attracting growing interest for their potential to create learning resources that better support personalization, but verifying coverage of intended outcomes still requires human alignment review, which is costly and limits scalability. We propose a framework that supports the cost-effective automation of evaluating alignment between educational resources and intended learning outcomes. Using human-generated materials, we benchmarked LLM-based text-embedding models and found that the most accurate model (Voyage) achieved 79% accuracy in detecting alignment. We then applied the optimal model to LLM-generated resources and, via expert evaluation, confirmed that it reliably assessed correspondence to intended outcomes (83% accuracy). Finally, in a three-group experiment with 360 learners, higher alignment scores were positively related to greater learning performance, chi-squared(2, N = 360) = 15.39, p < 0.001. These findings show that embedding-based alignment scores can facilitate scalable personalization by confirming alignment with learning outcomes, which allows teachers to focus on tailoring content to diverse learner needs.

中文标题/摘要

标题：基于嵌入的教育资源排名：基于学习成果对齐的基准测试、专家验证和学习者表现

随着在线学习环境的发展，个性化的需求越来越明显。尽管教育资源日益丰富，教育工作者在选择既符合预期学习成果又能满足多样化学习者需求的材料时仍面临挑战。大型语言模型（LLMs）因其潜在能力而受到越来越多的关注，可以创建更支持个性化的学习资源，但验证预期成果的覆盖范围仍需人工对齐审查，这既昂贵又限制了可扩展性。我们提出了一种框架，以支持以较低成本自动化评估教育资源与预期学习成果之间的对齐。使用人类生成的材料，我们对基于LLM的文本嵌入模型进行了基准测试，发现最准确的模型（Voyage）在检测对齐方面的准确率为79%。然后，我们应用了最优模型到LLM生成的资源上，并通过专家评估确认它可靠地评估了与预期成果的对应性（准确率为83%）。最后，在包含360名学习者的三组实验中，更高的对齐分数与更好的学习表现正相关，χ²(2, N = 360) = 15.39, p < 0.001。这些发现表明，基于嵌入的对齐分数可以通过确认与学习成果的对齐来促进可扩展的个性化，从而使教师能够专注于根据多样化学习者的需求定制内容。

Summary / 总结

The research aims to address the challenge of personalizing educational resources by aligning them with intended learning outcomes. It benchmarks text-embedding models using human-generated materials and finds that the most accurate model (Voyage) achieves 79% accuracy in detecting alignment. Applying this model to LLM-generated resources, it confirms 83% accuracy through expert evaluation. The study also shows that higher alignment scores are positively related to better learning performance, as evidenced by a chi-squared test with a p-value less than 0.001.

研究旨在解决选择符合预期学习成果和满足多样化学习者需求的教育资源的挑战。通过使用人类生成的材料对文本嵌入模型进行基准测试，发现最准确的模型（Voyage）在检测对齐方面的准确率为79%。然后将最优模型应用于生成的教育资源，并通过专家评估确认其准确率为83%。在一项涉及360名学习者的实验中，更高的对齐分数与更好的学习成果正相关，表明基于嵌入的对齐分数可以支持教育中的可扩展个性化。

Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Authors: Richard J. Young

First: 2025-12-15T18:48:42+00:00 · Latest: 2025-12-15T18:48:42+00:00

Comments: 25 pages, 6 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

中文标题/摘要

标题：大规模语言模型消融方法的比较分析：跨架构评估

大规模语言模型中的安全性对齐机制通过学习拒绝行为来防止对有害查询的响应，但这些机制同时阻碍了包括认知建模、对抗性测试和安全分析在内的合法研究应用。虽然消融技术可以通过定向正交化实现拒绝表示的外科移除，但现有实现的有效性尚未得到充分描述。本研究评估了四种消融工具（Heretic、DECCP、ErisForge、FailSpy）在16种指令调优模型（7B-14B参数）上的表现，报告了所有16个模型的工具兼容性，并在工具支持的子集上报告了定量指标。单次处理方法在基准子集上展示了更好的能力保留（平均GSM8K变化：ErisForge -0.28个百分点；DECCP -0.13个百分点），而贝叶斯优化消融产生了可变的分布偏移（KL散度：0.043-1.646），且模型依赖的能力影响。这些发现为研究人员提供了基于证据的选择标准，以在不同模型架构中部署消融工具。主要发现表明，数学推理能力对消融干预最为敏感，GSM8K变化范围从+1.51个百分点到-18.81个百分点（相对变化-26.5%），具体取决于工具选择和模型架构。

Summary / 总结

This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) on sixteen instruction-tuned models to address the limitations of safety alignment mechanisms in large language models. Single-pass methods showed better capability preservation, while Bayesian-optimized abliteration resulted in variable distribution shifts. The research highlights that mathematical reasoning capabilities are most sensitive to abliteration interventions, with significant changes ranging from +1.51 pp to -18.81 pp depending on the tool and model architecture.

该研究评估了四种去噪工具（Heretic、DECCP、ErisForge、FailSpy）在16个指令调优模型上的表现，以解决大型语言模型中安全对齐机制的限制。研究发现，单次处理方法在保持模型能力方面优于贝叶斯优化方法，ErisForge和DECCP表现更优。研究还指出，数学推理能力对去噪干预最为敏感，性能变化显著，这取决于所使用的工具和模型架构。

Large-Language Memorization During the Classification of United States Supreme Court Cases

Authors: John E. Ortega, Dhruv D. Joshi, Matt P. Borkowski

First: 2025-12-15T18:47:48+00:00 · Latest: 2025-12-15T18:47:48+00:00

Comments: 7 pages, 1 figure, Appendix of Prompts

Abs · PDF · Code1 · Code2

Abstract

Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.

中文标题/摘要

标题：大型语言模型在分类美国最高法院案例过程中的记忆化

大型语言模型（LLMs）在分类任务中表现出多种响应方式，而不仅仅是问答任务。LLMs的输出有时被称为“幻觉”，因为其输出不符合预期。正在详细研究LLMs的记忆策略，以理解其响应机制。我们深入探讨了一个基于美国最高法院（SCOTUS）判决的分类任务。SCOTUS语料库是研究LLMs记忆准确性的一个理想分类任务，因为它带来了显著挑战，包括长句子、复杂的法律术语、非标准结构和领域特定词汇。我们使用最新的LLM微调和检索方法，如参数高效微调、自模型化等，对两个传统的SCOTUS分类任务进行了实验：一个是包含15个标记主题的任务，另一个是包含279个主题的任务。我们展示了基于提示的具有记忆的模型，如DeepSeek，在两个任务上的表现更为稳健，得分比之前基于提示的模型高出约2分。

Summary / 总结

This study investigates how large-language models (LLMs) perform in classifying United States Supreme Court (SCOTUS) cases, focusing on memorization strategies. The research uses parameter-efficient fine-tuning and auto-modeling approaches on a corpus with complex legal terminology and extensive sentence length. Key findings show that prompt-based models with memories, such as DeepSeek, outperform previous BERT-based models, achieving about a 2-point improvement on both 15-labeled and 279-labeled SCOTUS classification tasks.

研究探讨了大型语言模型（LLMs）在分类美国最高法院（SCOTUS）案件中的表现，重点关注记忆策略。研究使用了参数高效微调和自模型方法处理具有复杂法律术语和长句子的语料库。关键发现表明，带有记忆的提示基模型，如DeepSeek，在15个标签和279个标签的SCOTUS分类任务上均优于之前的BERT基模型，分别提高了约2个点的成绩。

MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

Authors: Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, Xiang Bai

First: 2025-12-15T18:31:32+00:00 · Latest: 2025-12-15T18:31:32+00:00

Comments: 16 pages, 12 figures, 6 tables; Project Page: https://xiaomi-mlab.github.io/MindDrive/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.

中文标题/摘要

标题：MindDrive：一种基于在线强化学习的视觉-语言-行动模型在自动驾驶中的应用

当前自动驾驶中的视觉-语言-行动（VLA）范式主要依赖于模仿学习（IL），这引入了分布偏移和因果混淆等固有挑战。在线强化学习通过试错学习提供了一条有希望的途径来解决这些问题。然而，将在线强化学习应用于自动驾驶中的VLA模型受到在连续动作空间中探索效率低下的阻碍。为克服这一限制，我们提出了MindDrive，一种包含两个不同LoRA参数集的大语言模型（LLM）的VLA框架。一个LLM作为决策专家进行场景推理和驾驶决策，另一个作为行动专家动态将语言决策映射为可行轨迹。通过将轨迹级奖励反馈到推理空间，MindDrive能够在有限的离散语言驾驶决策集上实现试错学习，而不是直接在连续动作空间中操作。这种方法有效地平衡了在复杂场景中的最优决策、类人驾驶行为和在线强化学习中的高效探索。MindDrive在具有挑战性的Bench2Drive基准测试中实现了强闭环性能，驾驶得分为78.04，成功率55.09%。据我们所知，这是首次证明在线强化学习在自动驾驶中对VLA模型的有效性。

Summary / 总结

MindDrive is a Vision-Language-Action framework that uses Online Reinforcement Learning to address the challenges of Imitation Learning in autonomous driving, such as distribution shift and causal confusion. By employing two sets of LoRA parameters in a large language model, MindDrive distinguishes between a Decision Expert for scenario reasoning and a separate Action Expert for mapping linguistic decisions into trajectories. This approach enables efficient exploration through trajectory-level rewards, achieving a Driving Score of 78.04 and a Success Rate of 55.09% on the Bench2Drive benchmark, marking the first demonstration of online reinforcement learning for VLA models in autonomous driving.

MindDrive 是一个用于自动驾驶的 Vision-Language-Action 框架，采用在线强化学习来解决模仿学习中的分布偏移和因果混淆等问题。它使用一个大型语言模型和两组 LoRA 参数：一组用于场景推理和驾驶决策，另一组用于将语言决策映射为可行的轨迹。通过将轨迹级奖励反馈到推理空间，MindDrive 实现了高效的探索和试错学习，Bench2Drive 基准上的驾驶得分为 78.04，成功率 55.09%。

Universality of high-dimensional scaling limits of stochastic gradient descent

Authors: Reza Gheissari, Aukosh Jagannath

First: 2025-12-15T18:30:26+00:00 · Latest: 2025-12-15T18:30:26+00:00

Comments: 30 pages

Abs · PDF · Code1 · Code2

Abstract

We consider statistical tasks in high dimensions whose loss depends on the data only through its projection into a fixed-dimensional subspace spanned by the parameter vectors and certain ground truth vectors. This includes classifying mixture distributions with cross-entropy loss with one and two-layer networks, and learning single and multi-index models with one and two-layer networks. When the data is drawn from an isotropic Gaussian mixture distribution, it is known that the evolution of a finite family of summary statistics under stochastic gradient descent converges to an autonomous ordinary differential equation (ODE), as the dimension and sample size go to $\infty$ and the step size goes to $0$ commensurately. Our main result is that these ODE limits are universal in that this convergence occurs even when the data is drawn from mixtures of product measures provided the first two moments match the corresponding Gaussian distribution and the initialization and ground truth vectors are sufficiently coordinate-delocalized. We complement this by proving two corresponding non-universality results. We provide a simple example where the ODE limits are non-universal if the initialization is coordinate aligned. We also show that the stochastic differential equation limits arising as fluctuations of the summary statistics around their ODE's fixed points are not universal.

中文标题/摘要

标题：高维随机梯度下降的标度极限普遍性

我们考虑高维统计任务，其损失仅依赖于数据在由参数向量和某些真实向量张成的固定维子空间上的投影。这包括使用交叉熵损失的一层和两层网络分类混合分布，以及使用一两层网络学习单指数模型和多指数模型。当数据来自各向同性的高斯混合分布时，已知在随机梯度下降下，有限家族的汇总统计量的演化收敛到一个自治常微分方程（ODE），当维度和样本量趋向无穷大，步长趋向于0时。我们的主要结果是这些ODE极限是普遍的，即使数据来自混合的乘积测度，只要其前两矩与相应的高斯分布匹配，初始化和真实向量充分地坐标非局域化，这种收敛仍然会发生。我们通过证明两个相应的非普遍性结果来补充这一点。我们提供了一个简单的例子，说明如果初始化是坐标对齐的，ODE极限是非普遍的。我们还表明，作为汇总统计量围绕其ODE不动点的波动的随机微分方程极限不是普遍的。

Summary / 总结

The study investigates the universality of high-dimensional scaling limits of stochastic gradient descent (SGD) in statistical tasks where the loss depends on data projections into a fixed-dimensional subspace. Using one and two-layer networks for tasks like mixture distribution classification and learning single and multi-index models, the research shows that the evolution of summary statistics under SGD converges to an ODE as dimension and sample size increase, and step size decreases. This convergence is universal for data drawn from mixtures of product measures with matching first and second moments and sufficiently delocalized initialization and ground truth vectors. However, the study also demonstrates non-universality in cases where initialization is coordinate aligned and in the stochastic differential equation limits around ODE fixed points.

研究考察了高维空间中损失依赖于数据投影到固定维度子空间的统计任务。使用随机梯度下降，总结统计量的演化收敛到一个常微分方程（ODE）。主要发现是这些ODE极限是通用的，即当数据来自具有匹配的一阶和二阶矩的混合测度时，这些极限仍然适用，前提是初始化和真实向量在坐标上足够分散。研究还展示了当初始化是坐标对齐时ODE极限的非通用性，以及在ODE固定点周围的随机微分方程极限的非通用性。

StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

Authors: Guransh Singh, Md Shah Fahad

First: 2025-12-15T18:28:39+00:00 · Latest: 2025-12-15T18:28:39+00:00

Comments: 13 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Stuttering detection breaks down when disfluencies overlap. Existing parametric models struggle to distinguish complex, simultaneous disfluencies (e.g., a 'block' with a 'prolongation') due to the scarcity of these specific combinations in training data. While Retrieval-Augmented Generation (RAG) has revolutionized NLP by grounding models in external knowledge, this paradigm remains unexplored in pathological speech processing. To bridge this gap, we introduce StutterFuse, the first Retrieval-Augmented Classifier (RAC) for multi-label stuttering detection. By conditioning a Conformer encoder on a non-parametric memory bank of clinical examples, we allow the model to classify by reference rather than memorization. We further identify and solve "Modality Collapse", an "Echo Chamber" effect where naive retrieval boosts recall but degrades precision. We mitigate this using: (1) SetCon, a Jaccard-Weighted Metric Learning objective that optimizes for multi-label set similarity, and (2) a Gated Mixture-of-Experts fusion strategy that dynamically arbitrates between acoustic evidence and retrieved context. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, outperforming strong baselines and demonstrating remarkable zero-shot cross-lingual generalization.

中文标题/摘要

标题：StutterFuse：通过Jaccard加权度量学习和门控融合减轻 stutter检测中的模态坍塌

当断言重叠时，stutter检测会失效。现有的参数模型难以区分复杂的、同时发生的断言（例如，一个“阻塞”伴随一个“延长”），因为训练数据中这些特定组合极为稀缺。虽然检索增强生成（RAG）已通过将模型与外部知识相结合而彻底改变了NLP，但这一范式在病理语音处理中尚未被探索。为弥合这一差距，我们引入了StutterFuse，这是第一个用于多标签stutter检测的检索增强分类器（RAC）。通过将Conformer编码器条件化于临床示例的非参数化记忆库上，我们允许模型通过参考而非记忆来进行分类。我们还识别并解决了“模态坍塌”这一问题，即“回声室”效应，其中简单的检索虽然提高了召回率但降低了精确率。我们通过以下方式缓解了这一问题：(1) SetCon，一种Jaccard加权度量学习目标，旨在优化多标签集合相似性；(2) 一种门控混合专家融合策略，该策略动态地在声学证据和检索上下文之间进行仲裁。在SEP-28k数据集上，StutterFuse实现了加权F1分数0.65，超越了强大的基线模型，并展示了显著的零样本跨语言泛化能力。

Summary / 总结

Stuttering detection models often fail when disfluencies overlap. StutterFuse addresses this by introducing a Retrieval-Augmented Classifier (RAC) that uses a Conformer encoder conditioned on a non-parametric memory bank of clinical examples. It also mitigates 'Modality Collapse' through SetCon, a Jaccard-Weighted Metric Learning objective, and a Gated Mixture-of-Experts fusion strategy. On the SEP-28k dataset, StutterFuse achieves a weighted F1-score of 0.65, surpassing strong baselines and showing zero-shot cross-lingual generalization capabilities.

StutterFuse通过引入一种基于临床示例非参数记忆库的检索增强分类器（RAC）来解决区分复杂失流畅的问题。它通过Jaccard加权度量学习目标和门控混合专家融合策略来缓解模态崩溃，实现在SEP-28k数据集上的加权F1分数为0.65，超越了强基线，并展示了零样本跨语言泛化能力。

Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Authors: Zefang Liu, Nam Nguyen, Yinzhu Quan, Austin Zhang

First: 2025-12-15T18:10:51+00:00 · Latest: 2025-12-15T18:10:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.

中文标题/摘要

标题：事件序列建模中大型语言模型的时间分词策略

在使用大型语言模型（LLMs）建模时间事件序列时，连续时间的表示是一个关键但尚未充分探索的挑战。已经提出了各种策略，如字节级表示或日历标记，但最优方法仍不清楚，尤其是在现实世界事件数据的多样统计分布面前，这些分布从平滑的对数正态分布到离散的尖峰模式不等。本文首次对事件序列的时间分词进行了实证研究，比较了不同的编码策略：原始数字字符串、高精度字节级表示、基于人类语义的日历标记、经典均匀分箱以及自适应残差标量量化。我们通过在代表这些多样分布的现实世界数据集上微调LLMs来评估这些策略。我们的分析表明，并不存在一种普遍更优的策略；相反，预测性能高度依赖于分词器与数据统计属性的对齐，对数基策略在偏斜分布上表现出色，而以人类为中心的格式则对混合模态表现出稳健性。

Summary / 总结

This paper addresses the challenge of representing continuous time in event sequence modeling with large language models. It compares various temporal tokenization strategies, including numeric strings, byte-level representations, calendar tokens, uniform binning, and adaptive quantization, by evaluating their performance on real-world datasets with different statistical properties. The study finds that no single strategy is universally optimal, and the best approach depends on the data's statistical characteristics, with log-based strategies performing well on skewed distributions and human-centric formats proving robust for mixed data types.

该论文探讨了在使用大型语言模型进行事件序列建模时连续时间的表示挑战。它比较了包括数值字符串、字节级表示、日历标记、均匀分箱和自适应量化在内的各种时间标记策略，并通过在具有不同统计分布的真实世界数据集上进行评估来衡量它们的表现。研究发现，没有一种策略是普遍最优的，对偏斜分布而言，基于对数的策略表现更好，而以人类为中心的格式对于混合数据类型则更为稳健。

Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models

Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli

First: 2025-12-15T18:03:42+00:00 · Latest: 2025-12-15T18:03:42+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.

中文标题/摘要

标题：执行与撤销：生成和逆转视觉语言模型中的物理动作

我们引入了执行与撤销任务及基准测试，以解决视觉语言模型中的关键问题：理解并生成由真实世界动作驱动的物理上合理的场景变换。与以往专注于对象级编辑的工作不同，执行与撤销要求模型模拟物理动作的结果，然后准确地逆转它，反映视觉世界中的真正因果关系。我们从真实世界的视频中收集了一个大规模的可逆动作数据集，并设计了一种训练策略，以确保动作的稳健定位。我们的实验表明，当前的模型在物理可逆性方面存在困难，突显了该任务对于具身人工智能、机器人技术和物理感知生成建模的重要性。执行与撤销为评估和推进多模态系统中的物理推理提供了一个直观的测试平台。

Summary / 总结

The Do-Undo task and benchmark were introduced to address the gap in vision-language models' ability to understand and generate physically plausible scene transformations. Unlike previous work focusing on object-level edits, Do-Undo requires models to simulate and reverse physical actions, reflecting true cause-and-effect. Experiments show that current models struggle with physical reversibility, highlighting the need for better physical reasoning in multimodal systems. The dataset consists of reversible actions from real-world videos, and a training strategy is designed to enforce consistency for robust action grounding.

提出了Do-Undo任务和基准，旨在解决视觉-语言模型在理解并生成物理上合理的场景变换方面的不足。与以往专注于对象级编辑的工作不同，Do-Undo要求模型模拟并逆转物理动作，反映真实的因果关系。实验表明，当前模型在物理可逆性方面存在困难，突显了在多模态系统中提高物理推理能力的重要性。数据集包含来自真实视频的可逆动作，并设计了一种训练策略以确保一致性，从而实现稳健的动作定位。

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Authors: Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

First: 2025-12-15T18:02:35+00:00 · Latest: 2025-12-15T18:02:35+00:00

Comments: We publicly release the Nemotron-Cascade models and the full collection of training data at: https://huggingface.co/collections/nvidia/nemotron-cascade

Abs · PDF · Code1 · Code2 · Code3

Abstract

Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

中文标题/摘要

标题：Nemotron-Cascade：扩展级联强化学习以适应通用推理模型

使用强化学习（RL）构建通用推理模型涉及跨域异质性，包括推理时响应长度和验证延迟的大幅变化。这种变化性使RL基础设施复杂化，减慢了训练速度，并使训练课程（例如，响应长度扩展）和超参数选择变得困难。在本文中，我们提出了一种级联领域强化学习（Cascade RL）方法，以开发能够在指令模式和深度思考模式下运行的通用推理模型Nemotron-Cascade。与传统的混合不同领域异质提示的方法不同，Cascade RL协调顺序的领域级强化学习，减少了工程复杂性，并在一系列基准测试中实现了最先进的性能。值得注意的是，当使用RLHF进行对齐作为预步骤时，它显著提升了模型的推理能力，远超简单的偏好优化，而后续的领域级RLVR阶段通常不会损害先前领域中获得的基准性能，甚至可能还会提升它（见图1中的示例）。经过RL训练后的14B模型，在LiveCodeBench v5/v6/Pro上优于其SFT教师DeepSeek-R1-0528，并在2025年国际信息学奥林匹克竞赛（IOI）中获得银牌。我们透明地分享了我们的训练和数据配方。

IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning

Authors: Aayush Mishra, Daniel Khashabi, Anqi Liu

First: 2025-09-26T17:46:32+00:00 · Latest: 2025-12-15T18:01:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Supervised Fine-Tuning (SFT) is used to specialize model behavior by training weights to produce intended target responses for queries. In contrast, In-Context Learning (ICL) adapts models during inference with instructions or demonstrations in the prompt. ICL can offer better generalizability and more calibrated responses compared to SFT in data scarce settings, at the cost of more inference compute. In this work, we ask the question: Can ICL's internal computations be used to improve the qualities of SFT? We first show that ICL and SFT produce distinct activation patterns, indicating that the two methods achieve adaptation through different functional mechanisms. Motivated by this observation and to use ICL's rich functionality, we introduce ICL Activation Alignment (IA2), a self-distillation technique which aims to replicate ICL's activation patterns in SFT models and incentivizes ICL-like internal reasoning. Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families. This finding is not only practically useful, but also offers a conceptual window into the inner mechanics of model adaptation.

中文标题/摘要

标题：IA2：ICL激活的一致性改进监督微调

监督微调（SFT）用于通过训练权重以产生查询的预期目标响应来使模型行为专业化。相比之下，上下文学习（ICL）在推理过程中通过提示中的指令或示范来适应模型。在数据稀缺的情况下，ICL相比SFT可以提供更好的泛化能力和更校准的响应，但代价是更多的推理计算。在本研究中，我们提出的问题是：ICL的内部计算能否用于改进SFT的质量？我们首先展示了ICL和SFT产生不同的激活模式，表明这两种方法通过不同的功能机制实现适应。受此观察的启发，为了利用ICL的丰富功能，我们引入了ICL激活一致性（IA2），这是一种自我蒸馏技术，旨在复制ICL的激活模式并在SFT模型中激励ICL似的内部推理。在SFT之前作为预热步骤执行IA2，显著提高了模型输出的准确性和校准性，如我们在12个流行基准和两个模型家族上的广泛实验证明。这一发现不仅在实践中具有实用性，还为模型适应的内部机制提供了一个概念性的窗口。

Summary / 总结

This work explores whether In-Context Learning (ICL) activations can improve Supervised Fine-Tuning (SFT) by introducing ICL Activation Alignment (IA2), a self-distillation technique. The study shows that ICL and SFT produce different activation patterns, suggesting they adapt through distinct mechanisms. By aligning SFT with ICL activations, IA2 enhances model accuracy and calibration, as demonstrated through extensive experiments on 12 benchmarks and two model families.

研究探讨了是否可以通过使SFT模型与ICL的激活模式对齐来增强SFT，从而利用ICL的优势。作者提出了一种名为IA2的自蒸馏技术，旨在将ICL的内部计算模式复制到SFT模型中。在12个基准测试上的广泛实验表明，IA2可以显著提高SFT模型输出的准确性和校准性，这不仅具有实际应用价值，还为模型适应机制提供了新的见解。

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Authors: Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu

First: 2025-12-15T17:59:58+00:00 · Latest: 2025-12-15T17:59:58+00:00

Comments: Project Page: https://vchitect.github.io/LongVie2-project/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

中文标题/摘要

标题：LongVie 2：多模态可控超长视频世界模型

基于预训练视频生成系统的视频世界模型构建是实现通用时空智能的重要但具有挑战性的一步。一个世界模型应具备三个基本属性：可控性、长期视觉质量和时间一致性。为此，我们采取渐进的方法——首先增强可控性，然后扩展到长期高质量生成。我们提出了LongVie 2，这是一种端到端的自回归框架，分三个阶段训练：（1）多模态引导，将密集和稀疏控制信号结合，提供隐式的全局监督，提高可控性；（2）输入帧的降级感知训练，缩小训练与长期推理之间的差距，保持高质量视觉效果；（3）历史上下文引导，对相邻片段中的上下文信息进行对齐，以确保时间一致性。我们还引入了LongVGenBench，这是一个包含100个高分辨率一分钟视频的基准，涵盖了多种真实世界和合成环境。广泛的实验表明，LongVie 2 在长距离可控性、时间连贯性和视觉保真度方面达到了最先进的性能，并支持长达五分钟的连续视频生成，标志着统一视频世界建模的重要一步。

Summary / 总结

The research aims to develop a multimodal controllable ultra-long video world model with long-term visual quality and temporal consistency. LongVie 2 is an end-to-end autoregressive framework trained in three stages: multi-modal guidance for controllability, degradation-aware training for high visual quality, and history-context guidance for temporal consistency. Experiments show that LongVie 2 outperforms existing methods in long-range controllability, temporal coherence, and visual fidelity, supporting continuous video generation up to five minutes.

研究旨在开发一个具备长时视觉质量和时间连贯性的多模态可控超长视频世界模型。LongVie 2 是一个端到端的自回归框架，通过三个阶段训练：多模态引导提高可控性、退化感知训练保持高视觉质量、历史上下文引导确保时间连贯性。实验表明，LongVie 2 在长距离可控性、时间连贯性和视觉保真度方面优于现有方法，并支持长达五分钟的连续视频生成。

DA-SSL: self-supervised domain adaptor to leverage foundational models in turbt histopathology slides

Authors: Haoyue Zhang, Meera Chappidi, Erolcan Sayar, Helen Richards, Zhijun Chen, Lucas Liu, Roxanne Wadia, Peter A Humphrey, Fady Ghali, Alberto Contreras-Sanz, Peter Black, Jonathan Wright, Stephanie Harmon, Michael Haffner

First: 2025-12-15T17:53:18+00:00 · Latest: 2025-12-15T17:53:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent deep learning frameworks in histopathology, particularly multiple instance learning (MIL) combined with pathology foundational models (PFMs), have shown strong performance. However, PFMs exhibit limitations on certain cancer or specimen types due to domain shifts - these cancer types were rarely used for pretraining or specimens contain tissue-based artifacts rarely seen within the pretraining population. Such is the case for transurethral resection of bladder tumor (TURBT), which are essential for diagnosing muscle-invasive bladder cancer (MIBC), but contain fragmented tissue chips and electrocautery artifacts and were not widely used in publicly available PFMs. To address this, we propose a simple yet effective domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model itself. We pilot this framework for predicting treatment response in TURBT, where histomorphological features are currently underutilized and identifying patients who will benefit from neoadjuvant chemotherapy (NAC) is challenging. In our multi-center study, DA-SSL achieved an AUC of 0.77+/-0.04 in five-fold cross-validation and an external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. Our results demonstrate that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks. Code is Available at https://github.com/zhanghaoyue/DA_SSL_TURBT.

中文标题/摘要

标题：DA-SSL：用于利用基础模型在经尿道膀胱肿瘤切除术病理切片中的自我监督领域适配器

在病理学中，特别是结合病理基础模型（PFMs）的多实例学习（MIL）的近期深度学习框架已经显示出强大的性能。然而，PFMs由于领域偏移而在某些癌症或标本类型上存在局限性——这些癌症类型在预训练中很少使用，或标本包含在预训练群体中很少见的组织基质伪影。这种情况适用于经尿道膀胱肿瘤切除术（TURBT），这是诊断浸润性膀胱癌（MIBC）所必需的，但包含碎片化的组织芯片和电凝伪影，并且在公开可用的PFMs中未广泛使用。为了解决这个问题，我们提出了一种简单而有效的领域适配自我监督适配器（DA-SSL），它无需微调基础模型本身即可重新对齐预训练的PFM特征到TURBT领域。我们为预测TURBT的治疗反应试点了这一框架，在TURBT中，组织形态学特征目前被严重低估，识别将从新辅助化疗（NAC）中受益的患者具有挑战性。在我们的多中心研究中，DA-SSL在五折交叉验证中实现了0.77±0.04的AUC，并在外部测试中实现了0.84的准确率、0.71的敏感性和0.91的特异性，使用多数投票。我们的结果表明，轻量级的自我监督领域适应可以有效地增强基于PFM的MIL流水线，以应对临床挑战性的病理学任务。代码可在https://github.com/zhanghaoyue/DA_SSL_TURBT/ 获取。

Summary / 总结

The research aims to address the limitations of pathology foundational models (PFMs) in diagnosing certain cancer types, particularly TURBT, by proposing a domain-adaptive self-supervised adaptor (DA-SSL) that realigns pretrained PFM features to the TURBT domain without fine-tuning the foundational model. The method was evaluated in a multi-center study for predicting treatment response in TURBT, achieving an AUC of 0.77±0.04 and external test accuracy of 0.84, sensitivity of 0.71, and specificity of 0.91 using majority voting. This demonstrates that lightweight domain adaptation with self-supervision can effectively enhance PFM-based MIL pipelines for clinically challenging histopathology tasks.

该研究提出了一种自监督领域适配器DA-SSL，旨在提高病理基础模型（PFM）在膀胱肿瘤经尿道切除（TURBT）组织病理学切片上的性能。该方法在不微调PFM本身的情况下，重新对齐预训练的PFM特征以适应TURBT领域。在多中心研究中，DA-SSL在五折交叉验证中的AUC达到0.77±0.04，外部测试准确率为0.84，敏感性为0.71，特异性为0.91，使用多数投票法，证明了其在临床挑战性病理学任务中增强PFM基于的多重实例学习管道的有效性。

Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization

Authors: Daniel Melcer, Qi Chen, Wen-Hao Chiang, Shweta Garg, Pranav Garg, Christian Bock

First: 2025-12-15T17:52:16+00:00 · Latest: 2025-12-15T17:52:16+00:00

Abs · PDF · Code1 · Code2

Abstract

A well-engineered prompt can increase the performance of large language models; automatic prompt optimization techniques aim to increase performance without requiring human effort to tune the prompts. One leading class of prompt optimization techniques introduces the analogy of textual gradients. We investigate the behavior of these textual gradient methods through a series of experiments and case studies. While such methods often result in a performance improvement, our experiments suggest that the gradient analogy does not accurately explain their behavior. Our insights may inform the selection of prompt optimization strategies, and development of new approaches.

中文标题/摘要

标题：文本梯度是自动提示优化的不恰当比喻

精心设计的提示可以提高大型语言模型的性能；自动提示优化技术旨在提高性能而不需人工调整提示。一类领先的提示优化技术引入了文本梯度的类比。我们通过一系列实验和案例研究调查了这些文本梯度方法的行为。尽管这些方法通常会导致性能提升，但我们的实验表明，梯度类比并不能准确解释其行为。我们的见解可能有助于提示优化策略的选择，并促进新方法的发展

Summary / 总结

The paper investigates the effectiveness of textual gradient methods in automatic prompt optimization for large language models. Despite often improving performance, the experiments show that these methods do not behave as expected based on the gradient analogy. This research provides insights for choosing prompt optimization strategies and developing new approaches.

论文研究了文本梯度方法在大型语言模型自动提示优化中的应用。尽管这些方法通常能提高性能，但实验表明，它们的行为并不符合梯度类比的预期。这项研究旨在指导有效的提示优化策略的选择，并激发新的方法的发展。

Lighting in Motion: Spatiotemporal HDR Lighting Estimation

Authors: Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde

First: 2025-12-15T17:49:22+00:00 · Latest: 2025-12-15T17:49:22+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

中文标题/摘要

标题：运动中的照明：时空高动态范围照明估计

我们提出了运动中的照明（LiMo），这是一种基于扩散的方法，用于时空照明估计。LiMo旨在预测真实的高频细节并准确估计照度。为了兼顾这两点，我们提出根据输入中的3D位置生成不同曝光的镜面和漫反射球体。利用扩散先验，我们在大规模定制的室内和室外场景数据集上对强大的现有扩散模型进行微调，该数据集与时空光探针配对。为了准确的空间条件，我们证明仅使用深度是不够的，并引入了一种新的几何条件来提供场景相对于目标3D位置的相对位置。最后，我们利用可微渲染将不同曝光的漫反射和镜面预测合并为一个高动态范围图像（HDRI）图。我们全面评估了我们的方法和设计选择，以确立LiMo在空间控制和预测准确性方面的最新技术水平。

Summary / 总结

Lighting in Motion (LiMo) is a diffusion-based approach for spatiotemporal lighting estimation that aims to predict realistic high-frequency details and accurate illuminance. It generates mirrored and diffuse spheres at various exposures based on 3D positions and uses diffusion priors to fine-tune existing models on a large customized dataset. LiMo introduces a new geometric condition for spatial conditioning and combines diffuse and mirror predictions into a single HDRI map. Thorough evaluations show that LiMo outperforms existing methods in both spatial control and prediction accuracy.

Lighting in Motion (LiMo) 是一种基于扩散的方法，用于时空照明估计，旨在预测真实的高频细节和准确的照度。它根据3D位置生成不同曝光的镜面和漫反射球体，并使用扩散先验对现有模型进行大规模数据集的微调。LiMo 引入了一种新的几何条件来进行空间条件，并将漫反射和镜面预测结合成一个 HDRI 图。详尽的评估表明，LiMo 在空间控制和预测准确性方面优于现有方法。

Bilevel ZOFO: Efficient LLM Fine-Tuning and Meta-Training

Authors: Reza Shirkavand, Peiran Yu, Qi He, Heng Huang

First: 2025-02-05T20:47:44+00:00 · Latest: 2025-12-15T17:47:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Fine-tuning pre-trained Large Language Models (LLMs) for downstream tasks using First-Order (FO) optimizers presents significant computational challenges. Parameter-Efficient Fine-Tuning (PEFT) methods address these by freezing most model parameters and training only a small subset. However, PEFT often underperforms compared to full fine-tuning when high task-specific accuracy is required. Zeroth-Order (ZO) methods fine-tune the entire pre-trained model without back-propagation, estimating gradients through forward passes only. While memory-efficient, ZO methods suffer from slow convergence and high sensitivity to prompt selection. We bridge these two worlds with Bilevel-ZOFO, a bilevel optimization method that couples fast, local FO-PEFT adaptation at the inner level with stable, memory-efficient ZO updates of the full backbone at the outer level. The FO-PEFT inner loop performs fast, low-memory local adaptation that reduces the variance of ZO estimates and stabilizes the search, guiding the outer ZO updates of the full backbone and reducing prompt sensitivity. In the mean time, the outer ZO provides better generalization ability for PEFT. We provide theoretical convergence guarantees and empirically demonstrate that Bilevel-ZOFO significantly outperforms existing ZO and FO-PEFT methods, achieving 2-4 times faster training while maintaining similar memory efficiency. Additionally, we show by updating the backbone with ZO and adapting only a tiny FO-PEFT block per task, Bilevel-ZOFO combines full-model capacity with few-shot efficiency, making it a very efficient meta-learning algorithm that quickly adapts to new tasks.

中文标题/摘要

标题：双层ZOFO：高效的大语言模型微调与元训练

使用一阶（FO）优化器对预训练的大语言模型（LLMs）进行下游任务的微调面临着显著的计算挑战。参数高效微调（PEFT）方法通过冻结大部分模型参数并仅训练一个小子集来解决这些问题。然而，当需要高任务特定准确性时，PEFT往往不如全微调表现得好。零阶（ZO）方法通过仅通过前向传递估计梯度来对整个预训练模型进行微调，而不进行反向传播，这虽然节省了内存，但收敛速度慢且对提示选择高度敏感。我们通过结合内层快速局部FO-PEFT适应和外层稳定、节省内存的ZO更新，用双层优化方法Bilevel-ZOFO来弥合这两者之间的差距。FO-PEFT内层循环执行快速、低内存的局部适应，减少了ZO估计的方差并稳定了搜索，指导外层ZO更新整个主干，减少提示敏感性。同时，外层ZO为PEFT提供了更好的泛化能力。我们提供了理论收敛保证，并通过实验表明Bilevel-ZOFO显著优于现有的ZO和FO-PEFT方法，在保持类似内存效率的同时，训练速度提高了2-4倍。此外，我们通过用ZO更新主干并在每次任务中仅适应一个小型FO-PEFT块，展示了Bilevel-ZOFO结合了全模型容量和少样本效率，使其成为一个非常高效的元学习算法，能够快速适应新任务。

Summary / 总结

Bilevel-ZOFO is a bilevel optimization method that combines fast, low-memory local adaptation using First-Order Parameter-Efficient Fine-Tuning (FO-PEFT) with stable, memory-efficient Zeroth-Order (ZO) updates of the full model. This approach reduces the variance of ZO estimates and stabilizes the search, leading to faster training and better generalization compared to existing ZO and FO-PEFT methods. Bilevel-ZOFO achieves 2-4 times faster training while maintaining similar memory efficiency, and it efficiently adapts to new tasks with minimal parameter updates.

Bilevel-ZOFO 是一种结合了快速局部一阶（FO）参数高效微调（PEFT）和稳定的零阶（ZO）全模型更新的 bilevel 优化方法。这种方法减少了 ZO 估计的方差并稳定了搜索，从而比现有的 ZO 和 FO-PEFT 方法更快地训练并具有更好的泛化能力。Bilevel-ZOFO 在保持类似内存效率的同时，显著优于其他方法，特别是在任务特定准确性方面。

Explainable Quantum Machine Learning for Multispectral Images Segmentation: Case Study

Authors: Tomasz Rybotycki, Manish K. Gupta, Piotr Gawron

First: 2025-03-11T23:55:10+00:00 · Latest: 2025-12-15T17:46:00+00:00

Comments: 18 pages, 5 figures, 3 tables. Extended and refined version of "On the status of current quantum machine learning software"

Abs · PDF · Code1 · Code2

Abstract

The emergence of Big Data changed how we approach information systems engineering. Nowadays, when we can use remote sensing techniques for Big Data acquisition, the issues such data introduce are as important as ever. One of those concerns is the processing of the data. Classical methods often fail to address that problem or are incapable of processing the data in a reasonable time. With that in mind information system engineers are required to investigate different approaches to the data processing. The recent advancements in noisy intermediate-scale quantum (NISQ) devices implementation allow us to investigate their application to real-life computational problem. This field of study is called quantum (information) systems engineering and usually focuses on technical problems with the contemporary devices. However, hardware challenges are not the only ones that hinder our quantum computation capabilities. Software limitations are the other, less explored side of this medal. Using multispectral image segmentation as a task example, we investigated how difficult it is to run a hybrid quantum-classical model on a real, publicly available quantum device. To quantify how and explain why the performance of our model changed when ran on a real device, we propose new explainability metrics. These metrics introduce new meaning to the explainable quantum machine learning; the explanation of the performance issue comes from the quantum device behavior. We also analyzed the expected money costs of running similar experiment on contemporary quantum devices using standard market prices.

中文标题/摘要

标题：可解释的量子机器学习在多光谱图像分割中的应用：案例研究

大数据的出现改变了我们对信息系统工程的看法。如今，当我们可以使用遥感技术进行大数据获取时，数据带来的问题仍然非常重要。其中一个问题是数据的处理。经典方法往往无法解决这个问题，或者无法在合理的时间内处理数据。因此，信息系统工程师需要研究不同的数据处理方法。最近在嘈杂的中等规模量子（NISQ）设备实现方面的进展使我们能够研究其在实际计算问题中的应用。这一领域的研究称为量子（信息）系统工程，通常关注当代设备的技术问题。然而，硬件挑战并不是阻碍我们量子计算能力的唯一问题。软件限制是另一个较少探索的方面。以多光谱图像分割为例，我们研究了如何在实际的、公开可用的量子设备上运行混合量子-经典模型。为了量化并解释我们的模型在实际设备上运行时性能的变化，我们提出了新的可解释性度量。这些度量为可解释的量子机器学习引入了新的含义；解释性能问题来自于量子设备的行为。我们还分析了在当前量子设备上运行类似实验的预期成本，使用了标准市场价格。

Summary / 总结

The paper explores the application of explainable quantum machine learning for multispectral image segmentation, addressing the limitations of classical methods in processing big data. Using a hybrid quantum-classical model, the authors propose new explainability metrics to quantify and explain performance issues when running the model on real quantum devices. Key findings include the identification of performance degradation and the introduction of metrics that link these issues to the behavior of quantum devices. The study also evaluates the cost implications of running similar experiments on contemporary quantum devices.

该研究解决了使用经典方法处理来自遥感技术的大数据时遇到的挑战，这些方法往往无法高效处理数据。通过利用噪声中等规模量子（NISQ）设备，研究人员提出了一种用于多光谱图像分割的混合量子-经典模型。他们引入了新的可解释性度量来量化并解释在实际量子设备上运行模型时性能的变化，突显了量子计算中的软件限制。研究还评估了在当前量子设备上运行类似实验的预期成本。

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Authors: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li

First: 2025-12-15T17:41:19+00:00 · Latest: 2025-12-15T17:41:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

中文标题/摘要

标题：ReFusion：一种具有并行自回归解码的扩散大型语言模型

自回归模型（ARMs）受到缓慢的顺序推理限制。虽然掩码扩散模型（MDMs）提供了并行的替代方案，但它们遭受着关键的缺点：由于禁止使用键值（KV）缓存而导致的高计算开销，以及由于在难以处理的令牌组合空间中学习依赖关系而导致的不连贯生成。为了解决这些限制，我们引入了ReFusion，这是一种新颖的掩码扩散模型，通过将并行解码从令牌级别提升到更高的一级插槽级别，实现了更优的性能和效率，其中每个插槽是一个固定长度的连续子序列。这通过迭代的“计划和填充”解码过程实现：基于扩散的计划步骤首先识别一组弱依赖插槽，然后自回归填充步骤并行解码这些选定的插槽。基于插槽的设计同时解锁了完整的KV缓存重用，并通过统一的因果框架减少了从令牌组合空间到可管理的插槽级排列空间的学习复杂性。在七个不同的基准上的广泛实验表明，ReFusion不仅在性能上大幅超越了先前的MDMs，平均性能提高了34%，速度提高了18倍以上，而且还缩小了与强大的ARMs之间的性能差距，同时保持了2.33倍的平均加速。

Summary / 总结

ReFusion is a novel masked diffusion model that improves upon autoregressive models by using a slot-level parallel decoding approach, which addresses the limitations of both autoregressive and masked diffusion models. Through an iterative 'plan-and-infill' process, ReFusion identifies weakly dependent slots and decodes them in parallel, enabling full Key-Value cache reuse and reducing learning complexity. Experiments on seven benchmarks demonstrate that ReFusion outperforms previous masked diffusion models by 34% and achieves an average speedup of over 18 times, while also narrowing the performance gap to strong autoregressive models with a 2.33 times speedup.

ReFusion 是一种新型的掩码扩散模型，通过在更高层次的槽级别而非标记级别实现并行解码，改进了自回归模型。这种方法解决了传统掩码扩散模型的计算开销和生成不连贯的问题。在七个基准上的实验表明，ReFusion 在性能上比之前的掩码扩散模型高出 34%，平均快了 18 倍，同时还能与强大的自回归模型缩小性能差距，并保持 2.33 倍的平均加速。

WakeupUrban: Unsupervised Semantic Segmentation of Mid-20$^{th}$ century Urban Landscapes with Satellite Imagery

Authors: Tianxiang Hao, Lixian Zhang, Yingjia Zhang, Mengxuan Chen, Jinxiao Zhang, Runmin Dong, Haohuan Fu

First: 2025-06-11T07:41:30+00:00 · Latest: 2025-12-15T17:33:00+00:00

Comments: 11 pages, 4 figures, 3 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Historical satellite imagery archive, such as Keyhole satellite data, offers rare insights into understanding early urban development and long-term transformation. However, severe quality degradation ($\textit{e.g.}$, distortion, misalignment, and spectral scarcity) and the absence of annotations have long hindered its analysis. To bridge this gap and enhance understanding of urban development, we introduce $\textbf{WakeupUrbanBench}$, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing remote sensing (RS) datasets, along with a framework for unsupervised segmentation tasks, $\textbf{WakeupUSM}$. First, WakeupUrbanBench serves as a pioneer, expertly annotated dataset built on mid-$20^{\text{th}}$ century RS imagery, involving four key urban classes and spanning 4 cities across 2 continents with nearly 1000 km$^2$ area of diverse urban morphologies, and additionally introducing one present-day city. Second, WakeupUSM is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence-aware alignment mechanism and focal-confidence loss based on a self-supervised learning architecture, which generates robust pseudo-labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Comprehensive experiments demonstrate WakeupUSM significantly outperforms existing unsupervised segmentation methods $\textbf{both WakeupUrbanBench and public dataset}$, promising to pave the way for quantitative studies of long-term urban change using modern computer vision. Our benchmark and codes will be released at https://github.com/Tianxiang-Hao/WakeupUrban.

中文标题/摘要

标题：WakeupUrban：基于卫星影像的20世纪中期城市景观无监督语义分割

历史卫星影像档案，如关键孔卫星数据，提供了理解早期城市发展和长期转变的罕见见解。然而，严重的质量退化（例如，失真、错位和光谱稀缺）以及缺乏注释长期阻碍了其分析。为弥合这一差距并增强对城市发展的理解，我们引入了WakeupUrbanBench，这是一个基于历史卫星影像的标注分割数据集，是所有现有遥感（RS）数据集中最早观测时间的数据集，以及一个用于无监督分割任务的框架WakeupUSM。首先，WakeupUrbanBench作为先驱，是一个基于20世纪中期遥感影像的专家标注数据集，涉及四个关键城市类别，跨越两个大洲的四座城市，总面积近1000平方公里，包含多种城市形态，并引入了一个现代城市。其次，WakeupUSM是一个用于历史遥感影像的新型无监督语义分割框架。它采用一种基于自我监督学习架构的置信度感知对齐机制和焦点-置信度损失，生成稳健的伪标签，并自适应地优先考虑预测难度和标签可靠性，以提高在嘈杂的历史数据上的无监督分割，而无需人工监督。全面的实验表明，WakeupUSM在WakeupUrbanBench和公共数据集上的无监督分割方法中表现显著优于现有方法，有望为使用现代计算机视觉定量研究长期城市变化铺平道路。我们的基准和代码将在https://github.com/Tianxiang-Hao/WakeupUrban发布。

Summary / 总结

The research aims to leverage historical satellite imagery to understand urban development and long-term transformation. To address the challenges of quality degradation and lack of annotations, the authors introduce WakeupUrbanBench, an annotated dataset, and WakeupUSM, an unsupervised semantic segmentation framework. WakeupUSM uses a confidence-aware alignment mechanism and focal-confidence loss to generate robust pseudo-labels and improve segmentation accuracy on noisy historical data. Experiments show that WakeupUSM outperforms existing methods on both WakeupUrbanBench and public datasets, facilitating quantitative studies of urban change over time.

研究旨在利用历史卫星图像来理解城市发展的长期变化，解决质量退化和缺乏注释的问题。研究引入了WakeupUrbanBench，这是一个基于20世纪中期遥感图像的标注数据集，以及WakeupUSM，这是一个用于历史遥感图像的无监督语义分割框架。WakeupUSM采用了一种基于自监督学习的置信度感知对齐机制和焦点置信损失，生成稳健的伪标签，从而在嘈杂的历史数据上提高无监督分割的性能。实验表明，WakeupUSM在WakeupUrbanBench和公开数据集上均优于现有方法，有助于使用现代计算机视觉技术进行长期城市变化的定量研究。

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Authors: Dawei Li, Abdullah Alnaibari, Muhammad Arslan, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

First: 2025-12-02T18:31:18+00:00 · Latest: 2025-12-15T17:31:26+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

中文标题/摘要

标题：从审查到调解：大语言模型能否作为在线争吵中的调解人？

大型语言模型（LLMs）的迅速发展为AI向善的应用开辟了新的可能性。随着LLMs越来越多地调解在线交流，它们促进同理心和建设性对话的潜力成为负责任AI研究的重要前沿。本文探讨LLMs是否不仅能作为检测有害内容的审查者，还能作为理解并缓解在线冲突的调解者。我们的框架将调解分解为两个子任务：判断，即LLM评估对话的公平性和情感动态；引导，即生成同理心、缓解冲突的消息，引导参与者走向解决。为了评估调解质量，我们构建了一个基于Reddit的大规模数据集，并提出了一种结合原则评分、用户模拟和人工对比的多阶段评估管道。实验表明，基于API的模型在进行调解时，在推理和干预一致性方面优于开源版本。我们的研究结果突显了当前LLMs作为在线社会调解新兴代理的潜力和局限性。

Summary / 总结

This study investigates whether large language models (LLMs) can function as mediators in online conflicts, beyond their role as moderators. The research proposes a framework that decomposes mediation into judgment and steering tasks. Using a large Reddit dataset, the study evaluates LLMs through a multi-stage pipeline, showing that API-based models outperform open-source models in both reasoning and intervention alignment during mediation. The findings suggest that while LLMs show promise, they still have limitations in online social mediation.

这项研究探讨了大型语言模型（LLMs）是否可以在在线冲突中作为调解者发挥作用，而不仅仅是作为内容过滤器。研究提出了一个将调解分解为判断和引导两个子任务的框架。通过一个大型的Reddit数据集，研究使用多阶段评估管道来评估LLMs，结果显示API基模型在推理和干预一致性方面优于开源模型。研究结果表明，尽管LLMs显示出潜力，但在在线社会调解方面仍存在局限性。

MMhops-R1: Multimodal Multi-hop Reasoning

Authors: Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Bing Li, Chunfeng Yuan, Guangting Wang, Fengyun Rao, Ying Shan, Weiming Hu

Venue: AAAI 2026

First: 2025-12-15T17:29:02+00:00 · Latest: 2025-12-15T17:29:02+00:00

Comments: Acceped by AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

The ability to perform multi-modal multi-hop reasoning by iteratively integrating information across various modalities and external knowledge is critical for addressing complex real-world challenges. However, existing Multi-modal Large Language Models (MLLMs) are predominantly limited to single-step reasoning, as existing benchmarks lack the complexity needed to evaluate and drive multi-hop abilities. To bridge this gap, we introduce MMhops, a novel, large-scale benchmark designed to systematically evaluate and foster multi-modal multi-hop reasoning. MMhops dataset comprises two challenging task formats, Bridging and Comparison, which necessitate that models dynamically construct complex reasoning chains by integrating external knowledge. To tackle the challenges posed by MMhops, we propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation (mRAG) framework for dynamic reasoning. Our framework utilizes reinforcement learning to optimize the model for autonomously planning reasoning paths, formulating targeted queries, and synthesizing multi-level information. Comprehensive experiments demonstrate that MMhops-R1 significantly outperforms strong baselines on MMhops, highlighting that dynamic planning and multi-modal knowledge integration are crucial for complex reasoning. Moreover, MMhops-R1 demonstrates strong generalization to tasks requiring fixed-hop reasoning, underscoring the robustness of our dynamic planning approach. In conclusion, our work contributes a challenging new benchmark and a powerful baseline model, and we will release the associated code, data, and weights to catalyze future research in this critical area.

中文标题/摘要

标题：MMhops-R1：多模态多跳推理

在迭代整合各种模态和外部知识信息以进行多模态多跳推理方面的能力对于解决复杂的现实世界挑战至关重要。然而，现有的多模态大型语言模型（MLLMs）主要局限于单步推理，因为现有的基准测试缺乏评估和推动多跳能力所需的复杂性。为了弥合这一差距，我们引入了MMhops，这是一种新型的大规模基准测试，旨在系统地评估和促进多模态多跳推理。MMhops数据集包含两种具有挑战性的任务格式，即桥接和比较，这要求模型动态构建复杂的推理链并整合外部知识。为应对MMhops带来的挑战，我们提出了MMhops-R1，这是一种新颖的多模态检索增强生成（mRAG）框架，用于动态推理。我们的框架利用强化学习优化模型，使其能够自主规划推理路径、提出针对性查询并综合多层次信息。全面的实验表明，MMhops-R1在MMhops上显著优于强大的基线模型，突显了动态规划和多模态知识整合对于复杂推理的重要性。此外，MMhops-R1在需要固定跳数推理的任务中表现出强大的泛化能力，强调了我们动态规划方法的稳健性。总之，我们的工作贡献了一个具有挑战性的新基准和一个强大的基线模型，并将发布相关的代码、数据和权重，以促进该关键领域的未来研究。

Summary / 总结

The paper introduces MMhops, a new benchmark for evaluating multi-modal multi-hop reasoning, addressing the limitations of existing benchmarks. MMhops-R1, a novel Retrieval-Augmented Generation framework, is proposed to tackle the challenges posed by MMhops through dynamic planning and multi-modal knowledge integration. Experiments show that MMhops-R1 outperforms strong baselines and demonstrates strong generalization to fixed-hop reasoning tasks, emphasizing the importance of dynamic planning and multi-modal integration for complex reasoning.

论文提出了MMhops，这是一个新的基准，用于评估和促进多模态多跳推理，这对于解决复杂的现实世界挑战至关重要。提出了一个名为MMhops-R1的新检索增强生成框架，通过强化学习进行动态推理路径规划和多模态知识整合来应对MMhops的挑战。实验表明，MMhops-R1在基准测试中优于强大的基线，并且在固定跳推理任务中表现出强大的泛化能力。

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Authors: Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev

First: 2025-09-29T21:40:10+00:00 · Latest: 2025-12-15T17:24:36+00:00

Comments: Code: \url{https://github.com/ontocord/mixturevitae}

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive-first, risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data-signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open-sci-ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M-1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb-Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction-tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36 times fewer tokens (300B vs. ~11T). Supported by a thorough decontamination analysis, these results show that permissive-first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

中文标题/摘要

标题：MixtureVitae：开放的网络规模预训练数据集，基于宽松许可的文本源构建，包含高质量的指令和推理数据

我们介绍了MixtureVitae，一个开放访问的预训练语料库，旨在最小化法律风险同时提供强大的下游性能。MixtureVitae 采用了一种宽松许可优先、风险缓解的采样策略，结合了公共领域和宽松许可的文本（例如，CC-BY/Apache）以及经过仔细验证的低风险添加（例如，政府作品和欧盟TDM合格来源）。MixtureVitae 采用了一种简单的单阶段预训练配方，整合了大量的宽松许可合成指令和推理数据信号，这些信号通常在后训练阶段引入，并且在宽松许可的网络语料库中通常较为稀缺。我们将所有来源分为三级方案，反映不同的风险级别，并提供分片级别的来源元数据以支持风险意识使用。在使用开放科学参考训练协议（固定架构和超参数；130M-1.7B参数，50B和300B令牌预算）的受控实验中，使用MixtureVitae 训练的模型在一系列标准基准测试中始终优于其他宽松许可的数据集，在1.7B参数/300B令牌设置下，它们超越了FineWeb-Edu，并接近DCLM的后期训练表现。特别是在MMLU和数学、代码基准测试中，表现尤为突出：一个使用300B MixtureVitae令牌预训练的1.7B模型在GSM8K、HumanEval和MBPP基准测试中与一个强大的1.7B指令调优基线相当或超过，尽管使用了超过36倍少的令牌（300B vs. ~11T）。通过彻底的去污分析支持，这些结果表明，基于许可优先的数据，按许可证和来源相关风险分级，具有高指令和推理密度，可以为训练强大的语言模型提供实用且风险缓解的基础，减少对广泛网络抓取的依赖，而不牺牲竞争力。代码：https://github.com/ontocord/mixturevitae

Summary / 总结

MixtureVitae is an open-access pretraining corpus designed to minimize legal risk while maintaining strong downstream performance. It combines public-domain and permissively licensed text with carefully selected low-risk additions, using a simple pretraining recipe that integrates synthetic instruction and reasoning data. Experiments show that models trained on MixtureVitae outperform other permissive datasets across various benchmarks, particularly on MMLU, math, and code tasks, with a 1.7B model pretrained on 300B tokens matching or exceeding a strong instruction-tuned baseline despite using significantly fewer tokens.

MixtureVitae 是一个开放访问的预训练数据集，旨在最小化法律风险同时保持强大的下游性能。它结合了公共领域和许可许可的文本，并加入了经过仔细选择的低风险内容。使用简单的预训练配方，MixtureVitae 整合了大量的合成指令和推理数据。实验结果显示，使用 MixtureVitae 训练的模型在各种基准测试中优于其他许可数据集，特别是在 MMLU、数学和代码基准测试中，一个 1.7B 参数模型在 300B 令牌的预训练下，其表现与一个强大的 1.7B 参数指令调优基线相当或更好，尽管使用了显著较少的令牌。这表明，具有高指令和推理密度的许可许可数据可以提供一个实用且风险可控的基础，用于训练强大的语言模型，而不依赖于广泛的网页抓取。