arXiv 论文速递

CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

Authors: Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, Ying-Cong Chen

First: 2026-01-22T18:59:56+00:00 · Latest: 2026-01-22T18:59:56+00:00

Abstract

Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.

中文标题/摘要

标题：CamPilot：通过高效相机奖励反馈提高视频扩散模型中的相机控制

近期在相机控制的视频扩散模型方面的进展显著提高了视频与相机的对齐。然而，相机的可控性仍然有限。在本工作中，我们基于奖励反馈学习，旨在进一步提高相机的可控性。然而，直接借用现有的奖励反馈学习（ReFL）方法面临几个挑战。首先，当前的奖励模型缺乏评估视频与相机对齐的能力。其次，将潜在变量解码为RGB视频以进行奖励计算引入了大量计算开销。第三，视频解码过程中通常忽略了3D几何信息。为了解决这些限制，我们引入了一种高效的相机感知3D解码器，将视频潜在变量解码为3D表示以进行奖励量化。具体来说，视频潜在变量连同相机姿态一起被解码为3D高斯分布。在这个过程中，相机姿态不仅作为输入，还作为投影参数。视频潜在变量与相机姿态之间的对齐不良会导致3D结构中的几何失真，从而产生模糊的渲染结果。基于这一特性，我们显式地优化渲染的新视角与真实视角之间的像素级一致性作为奖励。为了适应随机性，我们进一步引入了一个可见性项，仅监督通过几何变形得到的确定性区域。在RealEstate10K和WorldScore基准上的广泛实验表明了我们提出方法的有效性。项目页面：https://a-bigbao.github.io/CamPilot/

Summary / 总结

The research aims to enhance camera controllability in video diffusion models by addressing limitations in existing reward feedback learning approaches. It introduces an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization, optimizing pixel-level consistency between rendered novel views and ground-truth ones. Experiments on RealEstate10K and WorldScore benchmarks show the proposed method's effectiveness in improving camera controllability and video-camera alignment.

研究旨在通过解决现有奖励反馈学习方法的局限性，提升视频扩散模型中的相机可控性。方法引入了一个高效的3D解码器，将视频潜变量解码为3D表示，并将相机姿态作为输入和投影参数。这使得可以显式优化渲染视图与真实视图之间的像素级一致性，从而提高视频-相机对齐。在RealEstate10K和WorldScore基准上的实验表明，所提出的方法在提升相机可控性方面是有效的。

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Authors: Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou

First: 2026-01-22T18:58:55+00:00 · Latest: 2026-01-22T18:58:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

中文标题/摘要

标题：PyraTok：语言对齐的分层分词器用于视频理解和生成

离散视频VAEs是现代文本到视频生成和视频理解系统的基石，但现有的分词器通常在单尺度上学习视觉码本，词汇量有限且语言监督浅薄，导致跨模态对齐差且零样本迁移效果不佳。我们提出了PyraTok，一种语言对齐的分层分词器，能够在多个时空分辨率上学习语义结构化的离散潜在变量。PyraTok 基于一个预训练的视频VAE和一个新颖的语言对齐分层量化（LaPQ）模块，该模块使用共享的大二进制码本在多个深度上离散化编码特征，从而产生紧凑且富有表现力的视频分词序列。为了紧密耦合视觉分词与语言，PyraTok 联合优化多尺度文本引导量化和分词层次上的全局自回归目标。在十个基准测试中，PyraTok 在视频重建方面达到最先进的（SOTA）性能，一致地提高了文本到视频的质量，并在视频分割、动作定位和视频理解方面设立了新的SOTA零样本性能，能够稳健地扩展到4K/8K分辨率。

Summary / 总结

PyraTok is designed to improve the alignment between language and visual representations in video understanding and generation systems. It introduces a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. By using a shared large binary codebook and optimizing multi-scale text-guided quantization, PyraTok achieves superior video reconstruction and text-to-video generation quality, as well as new state-of-the-art zero-shot performance on video segmentation and temporal action localization, up to 4K/8K resolutions.

PyraTok旨在通过在多个时空分辨率上学习语义结构化的离散潜变量来改善视频理解和生成中的跨模态对齐和零样本迁移。它使用一种名为Language aligned Pyramidal Quantization (LaPQ)的模块，以共享的大二进制码本在不同深度对编码特征进行离散化，并联合优化多尺度文本引导量化和全局自回归目标。PyraTok在视频重建、文本到视频生成以及各种视频理解任务中均达到最先进的性能，并且在高分辨率下表现出稳健的扩展性。

GutenOCR: A Grounded Vision-Language Front-End for Documents

Authors: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew

First: 2026-01-20T21:26:15+00:00 · Latest: 2026-01-22T18:58:24+00:00

Abs · PDF · Code1 · Code2

Abstract

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

中文标题/摘要

标题：GutenOCR：文档的基于视觉语言的前端

GutenOCR 是通过微调 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 获得的一系列基于视觉语言的 OCR 前端。生成的单模型视觉语言模型通过统一的提示界面暴露了阅读、检测和定位。该模型在商业文档、科学文章和合成定位数据上进行训练，支持全页和局部阅读，具有行级和段落级的边界框，并支持条件“x 在哪里？”查询。我们引入了一种基于视觉语言的 OCR 评估协议，并展示了 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将 Qwen2.5-VL-7B 主干的综合基于视觉语言的 OCR 分数提高了 1.05（从 0.40 到 0.82）。在 Fox 和 OmniDocBench v1.5 上，我们的方法显著提高了区域级和行级 OCR 以及文本检测召回率，但揭示了页面级线性化、颜色引导 OCR 和公式密集布局方面的权衡。

Summary / 总结

GutenOCR is a vision-language model fine-tuned from Qwen2.5-VL-3B and Qwen2.5-VL-7B, which provides a unified interface for reading, detection, and grounding through prompts. Trained on business documents and scientific articles, GutenOCR-7B significantly improves the grounded OCR score, achieving a composite score of 0.82 compared to 0.40 for its backbone model. It supports full-page and localized reading with bounding boxes and can answer 'where is x?' queries. However, it shows some trade-offs in page-level linearization and formula-heavy layouts.

GutenOCR 是从 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 精调而来的一组视觉-语言前端模型，通过提示式接口提供统一的阅读、检测和定位功能。通过对商务文档和科学文章的训练，GutenOCR-7B 在 10,500 个保留页面上的复合视觉-语言 OCR 评分显著提高，达到 0.82，而其基础模型的评分为 0.40。此外，它还增强了区域和行级 OCR 以及文本检测召回率，但在页面级线性化、颜色引导 OCR 和公式密集型布局方面存在一些权衡。

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Authors: Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie

First: 2026-01-22T18:58:16+00:00 · Latest: 2026-01-22T18:58:16+00:00

Comments: website: https://rae-dit.github.io/scale-rae/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

中文标题/摘要

标题：使用表示自编码器扩展文本到图像扩散变换器

表示自编码器（RAEs）在ImageNet上的扩散建模中通过在高维语义潜在空间中训练显示出明显的优势。在本文中，我们研究了这种框架是否可以扩展到大规模、自由形式的文本到图像（T2I）生成。我们首先将RAE解码器扩展到冻结表示编码器（SigLIP-2）之外的ImageNet，通过在网页、合成和文本渲染数据上进行训练，发现虽然规模提高了通用保真度，但特定领域（如文本）的针对性数据组合是必不可少的。然后，我们严格测试了最初为ImageNet提出的RAE设计选择。我们的分析表明，扩展简化了框架：虽然维度相关的噪声调度仍然是关键，但诸如宽扩散头部和噪声增强解码等架构复杂性在规模下几乎没有益处。在此简化框架的基础上，我们对RAE与当前最先进的FLUX VAE在从0.5B到9.8B参数的扩散变换器规模下进行了受控比较。在整个模型规模下，RAE在预训练期间始终优于VAE。此外，在高质量数据集上的微调过程中，基于VAE的模型在64个周期后灾难性过拟合，而RAE模型在256个周期内保持稳定并实现持续更好的性能。在所有实验中，基于RAE的扩散模型展示了更快的收敛速度和更好的生成质量，确立了RAE作为比VAE更简单且更强的基础，适用于大规模T2I生成。此外，由于视觉理解和生成可以在共享表示空间中进行，多模态模型可以直接对生成的潜在变量进行推理，为统一模型开辟了新的可能性。

Summary / 总结

This work explores the scalability of Representation Autoencoders (RAEs) for large-scale text-to-image (T2I) generation, originally successful on ImageNet. By scaling RAE decoders on a frozen representation encoder and training on diverse data, the study finds that while scale improves general image fidelity, targeted data composition is crucial for specific domains like text. The research also shows that simplifying the RAE framework leads to better performance, with RAEs outperforming VAEs across different model scales during both pretraining and finetuning. The RAE-based models converge faster and generate higher quality images, establishing RAEs as a stronger foundation for T2I generation.

该研究探讨了Representation Autoencoders (RAEs)在文本到图像(T2I)生成中的可扩展性，最初在ImageNet上取得成功。通过在大规模、多样化的数据集上扩展RAE解码器，研究发现虽然规模提高了图像的一般质量，但特定领域如文本的数据组成至关重要。研究还表明，RAEs在预训练和微调过程中均优于变分自编码器（VAEs），展示了更快的收敛速度和更好的生成质量。这确立了RAEs作为大规模T2I生成的更简单且更强的基础。

LLM-in-Sandbox Elicits General Agentic Intelligence

Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei

First: 2026-01-22T18:57:09+00:00 · Latest: 2026-01-22T18:57:09+00:00

Comments: Project Page: https://llm-in-sandbox.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.

中文标题/摘要

标题：LLM-in-Sandbox 激发通用代理智能

我们介绍了 LLM-in-Sandbox，使大语言模型能够在代码沙箱（即虚拟计算机）中探索，以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下，能够利用代码沙箱来完成非代码任务，表现出泛化能力。例如，大语言模型自发地访问外部资源以获取新知识，利用文件系统处理长文本，并执行脚本以满足格式要求。我们进一步表明，通过仅使用非代理数据训练用于沙箱探索的模型，LLM-in-Sandbox 强化学习（LLM-in-Sandbox-RL）可以增强这些代理能力。实验表明，无论是训练前还是训练后，LLM-in-Sandbox 都能够在数学、物理、化学、生物医学、长文本理解以及指令遵循等多个领域实现稳健的泛化。最后，我们从计算和系统角度分析了 LLM-in-Sandbox 的效率，并将其开源为 Python 包，以促进其实用部署。

Summary / 总结

The study introduces LLM-in-Sandbox, which allows language models to explore a code sandbox to develop general intelligence in non-code domains. The research demonstrates that strong language models can generalize and use the sandbox for non-code tasks, such as accessing external resources and executing scripts. The method further enhances these capabilities through LLM-in-Sandbox Reinforcement Learning, which trains models on non-agentic data. Experiments show robust generalization across various fields including mathematics, physics, and biomedicine. The study also analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for deployment.

研究旨在通过让大型语言模型（LLMs）探索代码沙箱来在非代码领域发展一般智能。研究显示，强大的LLMs可以泛化并在非代码任务中利用沙箱，例如访问外部资源和执行脚本。此外，LLM-in-Sandbox强化学习进一步增强了这些能力。实验表明，LLM-in-Sandbox在数学、物理和生物医学等多个领域表现出稳健的泛化能力。研究还从计算和系统角度分析了LLM-in-Sandbox的效率，并将其开源为Python包以促进实际部署。

Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing

Authors: Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang

First: 2026-01-22T18:52:21+00:00 · Latest: 2026-01-22T18:52:21+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.

中文标题/摘要

标题：多模态大型语言模型特征空间平滑的可验证鲁棒性

多模态大型语言模型（MLLMs）在多种应用中表现出强大的能力，但仍然容易受到对抗性扰动的影响，这些扰动会扭曲其特征表示并导致错误预测。为了解决这一脆弱性，我们提出了特征空间平滑（FS）方法，并理论上证明了FS能够为MLLMs的特征表示提供认证鲁棒性。具体而言，FS将任何特征编码器转换为一种平滑变体，该变体在$\ell_2$有界攻击下能够保证清洁表示和对抗性表示之间的特征余弦相似度下限。此外，我们表明，从原始编码器中获得的特征余弦相似度界（FCSB）的值可以通过扩大定义的高斯鲁棒性得分来提高。在此基础上，我们引入了净化器和平滑映射器（PSM），这是一种即插即用模块，可以提高MLLMs的高斯鲁棒性得分，从而在不重新训练MLLMs的情况下增强其在FS下的认证鲁棒性。我们证明，FS与PSM不仅提供了强大的理论鲁棒性保证，而且在对抗训练中表现出更优越的实证性能。广泛的实验表明，FS-PSM在各种白盒攻击下的攻击成功率（ASR）从近90%降低到约1%。

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Authors: Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han

First: 2025-12-01T18:59:45+00:00 · Latest: 2026-01-22T18:49:14+00:00

Comments: 10 pages, 4 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

As large language models have grown larger, interest has grown in low-precision numerical formats such as NVFP4 as a way to improve speed and reduce memory usage. However, quantizing models to NVFP4 remains difficult as the lack of precision generally degrades model performance. In this work, we address this issue with Four Over Six (4/6), a modification to the block-scaled NVFP4 quantization algorithm that yields reduced quantization error. Unlike integer formats, floating point formats have non-uniform step sizes which create larger quantization error on larger values. 4/6 takes advantage of this by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform and reducing quantization error for near-maximal values. We show that 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, resulting in performance gains during both pre-training and inference with minimal computational overhead. In pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture, we find that 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. Our code is available at http://github.com/mit-han-lab/fouroversix.

中文标题/摘要

标题：四分之六：带有自适应块缩放的更准确NVFP4量化

随着大型语言模型变得越来越大，人们越来越关注低精度数值格式（如NVFP4），以提高速度并减少内存使用。然而，将模型量化为NVFP4仍然很困难，因为缺乏精度通常会降低模型性能。在本文中，我们通过Four Over Six（4/6）解决了这一问题，4/6是对块缩放NVFP4量化算法的修改，可以减少量化误差。与整数格式不同，浮点格式具有非均匀的步长，这在较大值上会产生更大的量化误差。4/6通过自适应地将某些块缩放到较小的FP4值，使可表示值的分布更加均匀，从而减少接近最大值时的量化误差。我们展示了4/6可以在NVIDIA Blackwell GPU上高效实现，从而在预训练和推理过程中获得性能提升，同时计算开销最小。在使用Nemotron 3 Nano 30B-A3B模型架构的预训练实验中，我们发现4/6可以使训练损失更接近BF16，优于使用当前最先进的NVFP4训练食谱训练的模型。我们的代码可在http://github.com/mit-han-lab/fouroversix获取。

Summary / 总结

This paper addresses the challenge of quantizing large language models to NVFP4 by introducing Four Over Six (4/6), an adaptive block scaling method that reduces quantization error. The method leverages the non-uniform step sizes of floating point formats to scale some blocks to smaller FP4 values, making the distribution of representable values more uniform. Experiments show that 4/6 can be efficiently implemented on NVIDIA Blackwell GPUs, leading to performance gains in both pre-training and inference with minimal computational overhead. The 4/6 method brings training loss closer to BF16 compared to existing NVFP4 training recipes in pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture.

本文通过引入Four Over Six (4/6) 方法，解决将大型语言模型量化到NVFP4时遇到的挑战。4/6 方法利用浮点格式的非均匀步长，将某些块缩放到较小的FP4值，从而使表示值的分布更加均匀，从而减少接近最大值时的量化误差。实验表明，4/6 可以在NVIDIA Blackwell GPU上高效实现，在预训练和推理过程中都能带来性能提升，且计算开销较小。具体来说，在Nemotron 3 Nano 30B-A3B模型架构的预训练实验中，4/6 使训练损失更接近BF16。

360Anything: Geometry-Free Lifting of Images and Videos to 360°

Authors: Ziyi Wu, Daniel Watson, Andrea Tagliasacchi, David J. Fleet, Marcus A. Brubaker, Saurabh Saxena

First: 2026-01-22T18:45:59+00:00 · Latest: 2026-01-22T18:45:59+00:00

Comments: Project page: https://360anything.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.

中文标题/摘要

标题：360Anything：无需几何的图像和视频到360°提升

将视角图像和视频提升为360°全景图可以实现沉浸式的3D世界生成。现有方法通常依赖于视角和等效圆柱投影（ERP）空间之间的显式几何对齐。然而，这需要已知的相机元数据，这在野外数据中通常是缺失或噪声较大的。我们提出了360Anything，一个基于预训练扩散变换器的几何无关框架。通过将视角输入和全景目标简单地视为标记序列，360Anything以完全数据驱动的方式学习视角到等效圆柱投影的映射，消除了对相机信息的需求。我们的方法在图像和视频视角到360°生成方面均达到了最先进的性能，优于使用真实相机信息的先前工作。我们还追踪了ERP边界处接缝伪影的根本原因，归因于VAE编码器中的零填充，并引入了循环潜编码以促进无缝生成。最后，我们在零样本相机视场和方向估计基准测试中展示了竞争力的结果，证明了360Anything在计算机视觉任务中的深刻几何理解和更广泛的应用。更多结果请参见https://360anything.github.io/

Summary / 总结

360Anything is a geometry-free framework that uses pre-trained diffusion transformers to lift perspective images and videos to 360° panoramas. It learns the mapping between perspective and equirectangular projection without requiring camera metadata, making it suitable for in-the-wild data. The approach outperforms previous methods that rely on ground-truth camera information and introduces Circular Latent Encoding to address seam artifacts, achieving state-of-the-art performance in both image and video generation. Additionally, it shows strong performance in zero-shot camera field of view and orientation estimation benchmarks, indicating its deep geometric understanding and broader utility in computer vision tasks.

360Anything 是一个无需几何信息的框架，利用预训练的扩散变换器将视角图像和视频提升为360°全景图。它无需相机元数据即可学习视角到等效圆柱投影的映射，适用于野外数据。该方法在依赖真实相机信息的先前方法中表现出色，并引入了循环潜编码以减少接缝伪影。此外，它在零样本相机视场和方向估计基准测试中也表现出竞争力，表明其在计算机视觉任务中的深度几何理解和更广泛的应用价值。

Paramanu: Compact and Competitive Monolingual Language Models for Low-Resource Morphologically Rich Indian Languages

Authors: Mitodru Niyogi, Eric Gaussier, Arnab Bhattacharya

First: 2024-01-31T17:58:10+00:00 · Latest: 2026-01-22T18:28:42+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multilingual large language models (LLMs) are expensive to pretrain and often suffer from imbalances across languages and datasets, English-centric bias, tokenizer oversegmentation for morphologically rich low-resource languages, and the curse of multilinguality. We introduce PARAMANU, the first family of Indian-only autoregressive language models trained from scratch on open-source language-specific data for the five most spoken Indian languages: Bengali, Hindi, Marathi, Tamil, and Telugu. All models are designed for affordability and are trained on a single GPU with a budget under $1,000, allowing under-resourced researchers to build competitive language models. To address low-resource challenges, we develop morphology-aligned, low-fertility tokenizers, propose an interpolation-based method for token position indices in RoPE based scaling to train longer sequences efficiently. We also create instruction-tuning datasets in Bangla that are translated to the other four languages. Despite their small size (108M-367M parameters), Paramanu achieves a strong performance-efficiency tradeoff and outperforms most larger multilingual models across all five languages. Our collection is available at https://huggingface.co/collections/mitodru/paramanu.

中文标题/摘要

标题：Paramanu：面向低资源丰富形态语言的紧凑且竞争性的单语言语言模型

多语言大型语言模型（LLMs）的预训练成本高昂，且常在语言和数据集之间存在不平衡，具有英语中心偏见，以及对形态丰富且低资源语言的分词过度分割问题，并且面临多语言性的诅咒。我们引入了PARAMANU，这是首个仅针对印度语族的自回归语言模型系列，从头开始在开源语言特定数据上训练，针对五种最常用的印度语：孟加拉语、印地语、马拉地语、泰米尔语和泰卢固语。所有模型都设计为经济实惠，并在单个GPU上训练，预算低于1000美元，使资源不足的研究人员能够构建具有竞争力的语言模型。为应对低资源挑战，我们开发了形态对齐、低丰度的分词器，并提出了一种基于插值的方法来调整RoPE基于位置的缩放，以高效地训练更长的序列。我们还为孟加拉语创建了指令调优数据集，并将其翻译成其他四种语言。尽管参数量较小（1.08亿-3.67亿），Paramanu仍实现了性能与效率的良好权衡，并在所有五种语言中均优于大多数更大规模的多语言模型。我们的集合可在https://huggingface.co/collections/mitodru/paramanu 获取。

Summary / 总结

Paramanu is a family of Indian-only autoregressive language models trained on open-source language-specific data for five Indian languages: Bengali, Hindi, Marathi, Tamil, and Telugu. These models are designed to be affordable, requiring only a single GPU and a budget under $1,000. Paramanu addresses low-resource challenges through morphology-aligned tokenizers and an interpolation-based method for token position indices in RoPE scaling. Despite their small size (108M-367M parameters), Paramanu outperforms most larger multilingual models across all five languages, achieving a strong performance-efficiency tradeoff.

Paramanu 是为五种印度语言（孟加拉语、印地语、马拉地语、泰米尔语和泰卢固语）设计的一系列单语言语言模型。这些模型使用开源语言特定数据进行训练，并且成本低廉，仅需一个GPU和不到1000美元。Paramanu 通过使用形态学对齐的分词器和基于插值的方法来解决低资源挑战，该方法用于 RoPE 缩放中的 token 位置索引。尽管参数量较小（108M-367M），但 Paramanu 在所有五种语言上均优于大多数大型多语言模型，实现了性能和效率的良好平衡。

Learning to Discover at Test Time

Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun

First: 2026-01-22T18:24:00+00:00 · Latest: 2026-01-22T18:24:00+00:00

Comments: Code: https://github.com/test-time-training/discover

Abs · PDF · Code1 · Code2 · Code3

Abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

中文标题/摘要

标题：在测试时学习发现

我们如何使用AI在科学问题上发现新的前沿？先前的测试时缩放工作，如AlphaEvolve，通过提示冻结的LLM进行搜索。我们进行测试时的强化学习，因此LLM可以继续训练，但现在是针对测试问题的具体经验。这种持续学习的形式非常特殊，因为它旨在产生一个最佳解决方案，而不是平均多个较好的解决方案，并且要解决这个问题而不是泛化到其他问题。因此，我们的学习目标和搜索子程序被设计为优先考虑最有前途的解决方案。我们称这种方法为测试时训练以发现（TTT-Discover）。我们遵循先前的工作，专注于具有连续奖励的问题。我们报告了我们尝试的每个问题的结果，涵盖数学、GPU内核工程、算法设计和生物学。TTT-Discover在几乎所有问题上都设定了新的前沿：(i) 艾尔德什最小重叠问题和自相关不等式；(ii) GPUMode内核竞赛（比先前的最佳结果快至2倍）；(iii) 过去的AtCoder算法竞赛；和(iv) 单细胞分析中的去噪问题。我们的解决方案由专家或组织者审核。所有结果均使用开源模型OpenAI gpt-oss-120b实现，并可通过我们公开的代码重现，与之前的最佳结果相比，这些结果不需要封闭的前沿模型。我们的测试时训练运行使用Thinking Machines的Tinker API，每问题成本仅为几百美元。

Summary / 总结

The research aims to use AI to discover new state-of-the-art solutions for scientific problems by performing reinforcement learning at test time. The method, Test-Time Training to Discover (TTT-Discover), allows the LLM to continue training with problem-specific experience, prioritizing promising solutions. Results across various domains, including mathematics, GPU kernel engineering, algorithm design, and biology, show that TTT-Discover sets new state-of-the-art solutions in almost all cases, with cost-effective test-time training runs.

研究旨在通过在测试时进行强化学习来使用AI发现科学问题的新前沿解决方案，使LLM能够继续使用特定于测试问题的经验进行训练。该方法称为测试时训练以发现（TTT-Discover），优先考虑有前途的解决方案，并在数学、GPU内核工程、算法设计和生物学等多个领域设立了新的基准。结果使用的是开源模型OpenAI gpt-oss-120b，并且可以通过公开的代码进行复现，展示了成本效益和透明的结果，而之前的方法依赖于封闭的前沿模型。

Is this chart lying to me? Automating the detection of misleading visualizations

Authors: Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych

First: 2025-08-29T14:36:45+00:00 · Latest: 2026-01-22T18:23:24+00:00

Comments: Preprint under review. Code and data available at: https://github.com/UKPLab/arxiv2025-misviz

Abs · PDF · Code1 · Code2 · Code3

Abstract

Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also create Misviz-synth, a synthetic dataset of 57,665 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and image-axis classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.

中文标题/摘要

标题：这张图表在欺骗我吗？自动化误导性可视化检测

误导性可视化是社交媒体和网络上信息误导的强大驱动因素。通过违反图表设计原则，它们扭曲数据并引导读者得出不准确的结论。先前的研究表明，无论是人类还是多模态大型语言模型（MLLMs）都经常被这些可视化所欺骗。自动检测误导性可视化并识别它们违反的具体设计规则可以帮助保护读者并减少信息误导的传播。然而，由于缺乏大型、多样且公开可用的数据集，AI模型的训练和评估受到了限制。在本研究中，我们引入了Misviz，这是一个包含2,604个真实世界可视化并标注了12种误导类型的基准数据集。为了支持模型训练，我们还创建了Misviz-synth，这是一个基于真实数据表生成的57,665个可视化数据集，使用Matplotlib生成。我们使用最先进的MLLMs、基于规则的系统和图像轴分类器对两个数据集进行了全面评估。我们的结果表明，该任务仍然极具挑战性。我们发布了Misviz、Misviz-synth及其配套代码。

Summary / 总结

This paper addresses the issue of misleading visualizations that can spread misinformation. It introduces Misviz, a benchmark dataset of 2,604 real-world visualizations annotated with 12 types of misleaders, and Misviz-synth, a synthetic dataset of 57,665 visualizations. The authors evaluate state-of-the-art models, rule-based systems, and image-axis classifiers on these datasets and find that the task is still highly challenging. The work aims to help protect readers by automating the detection of misleading visualizations and identifying the specific design rules they violate.

研究旨在解决误导性可视化可能传播虚假信息的问题，引入了包含2,604个真实世界可视化并标注了12种误导类型的Misviz基准数据集，以及基于真实数据表生成的57,665个可视化实例的Misviz-synth合成数据集。研究使用最先进的多模态大型语言模型、基于规则的系统和图像轴分类器对这些数据集进行了全面评估，发现检测误导性可视化任务仍然具有挑战性。数据集和代码已公开发布。

Structured Hints for Sample-Efficient Lean Theorem Proving

Authors: Zachary Burton

First: 2026-01-22T18:16:46+00:00 · Latest: 2026-01-22T18:16:46+00:00

Comments: 9 pages, 1 figure

Abs · PDF · Code1 · Code2

Abstract

State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.

中文标题/摘要

标题：结构化提示以提高样本效率的轻量级定理证明

当前最先进的神经定理证明器如DeepSeek-Prover-V1.5结合了大型语言模型和强化学习，通过复杂的训练取得了令人印象深刻的成果。我们提出的问题是：这些高度训练的模型在推理时是否仍然受益于简单的结构指导？我们在miniF2F基准测试上评估了一种轻量级干预措施——固定提示调度表，覆盖15种常见的战术骨架。这种方法简单有效，与从同一模型标准采样相比，16个样本的通过率提高了21.7%，相对改进了43%，使用相同的生成长度（1024个标记）。我们的结果表明，即使是有能力的RL训练的证明器也未能充分利用战术语言中可用的结构先验，并且简单的推理时指导仍然是一个廉价的补充提升。

Summary / 总结

The study investigates whether state-of-the-art neural theorem provers, despite being highly trained, still benefit from simple structural guidance during inference. By using a fixed prompt schedule over 15 common tactic skeletons, the approach achieved a 21.7% pass@16 rate, which is a 43% relative improvement compared to standard sampling from the same model, using the same number of samples and generation length. This indicates that even advanced models can be further enhanced with minimal structural hints.

研究探讨了即使经过高度训练的神经定理证明器，在推理过程中是否仍能从简单的结构指导中受益。通过使用15种常见策略骨架的固定提示调度，该方法实现了21.7%的pass@16率，相比相同模型的标准采样提高了43%的相对改进，使用了相同数量的样本和生成长度。这表明即使先进的模型也可以通过最小的结构提示进一步增强。

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

Authors: Steven Kolawole, Lucio Dery, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar

First: 2024-02-08T04:48:26+00:00 · Latest: 2026-01-22T18:13:50+00:00

Comments: 19 pages, 6 fiigures, 16 tables

Abs · PDF · Code1 · Code2

Abstract

Structured pruning is a promising approach to create smaller, faster large language models. However, existing methods typically rely on computing the gradient via backward passes, which can inflate memory requirements and compute costs. In this work we introduce Bonsai, a gradient-free structured pruning method that eliminates the need for backpropagation, significantly reducing memory requirements and compute costs while achieving state-of-the-art pruning performance. Bonsai uses forward-pass-only perturbative pruning to enable efficient compression of large models on a broader range of hardware configurations. Unlike existing structured pruning approaches, Bonsai not only achieves better compression with fewer resources but also produces models that are twice as fast as those generated by semi-structured pruning. As a concrete demonstration, we use Bonsai to prune 7B and 8B models to 50% sparsity on a single A6000 GPU -- a task challenging for backprop-based methods in memory-constrained settings, as they require 2-3x the memory. Our results show that removing backprop as a requirement not only enables pruning larger models on constrained hardware but can also lead to state-of-the-art efficiency and performance.

中文标题/摘要

标题：现在修剪：仅使用前向传递修剪LLMs

结构化修剪是一种有前途的方法，可以创建更小、更快的大语言模型。然而，现有方法通常依赖于通过反向传递计算梯度，这会增加内存需求和计算成本。在本工作中，我们引入了Bonsai，这是一种无需梯度的结构化修剪方法，消除了反向传播的需要，显著减少了内存需求和计算成本，同时实现了最先进的修剪性能。Bonsai 使用仅前向传递的扰动修剪来实现对更大模型的高效压缩，适用于更广泛的硬件配置。与现有的结构化修剪方法不同，Bonsai 不仅在更少的资源下实现了更好的压缩，还生成了比半结构化修剪方法生成的模型快两倍的模型。作为具体的演示，我们使用Bonsai将7B和8B模型修剪到50%的稀疏性，这在内存受限的环境中对基于反向传递的方法来说是一项具有挑战性的任务，因为它们需要2-3倍的内存。我们的结果表明，去除反向传递的要求不仅使在受限硬件上修剪更大模型成为可能，还可以实现最先进的效率和性能。

Summary / 总结

This work introduces Bonsai, a gradient-free structured pruning method that uses forward-pass-only perturbative pruning to compress large language models efficiently. Unlike existing methods that require backpropagation, Bonsai significantly reduces memory and compute costs while achieving state-of-the-art pruning performance. It prunes 7B and 8B models to 50% sparsity on a single A6000 GPU, demonstrating better compression and twice the speed of semi-structured pruning methods in memory-constrained settings.

该研究引入了Bonsai，一种无需反向传播的结构化剪枝方法，通过仅使用前向传播的扰动剪枝来降低大型语言模型的内存和计算成本。Bonsai实现了最先进的剪枝性能，并生成的模型比半结构化剪枝快两倍。它成功地将7B和8B模型压缩到50%的稀疏性，展示了其在内存受限环境中高效性，而反向传播方法在这些环境中难以应对。

GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval

Authors: Justus-Jonas Erker, Nils Reimers, Iryna Gurevych

First: 2025-03-10T16:42:48+00:00 · Latest: 2026-01-22T18:12:25+00:00

Comments: Accepted at EACL 2026 Main Conference

Abs · PDF · Code1 · Code2

Abstract

Decomposition-based multi-hop retrieval methods rely on many autoregressive steps to break down complex queries, which breaks end-to-end differentiability and is computationally expensive. Decomposition-free methods tackle this, but current decomposition-free approaches struggle with longer multi-hop problems and generalization to out-of-distribution data. To address these challenges, we introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. GRITHopper combines generative and representational instruction tuning by integrating causal language modeling with dense retrieval training. Through controlled studies, we find that incorporating additional context after the retrieval process, referred to as post-retrieval language modeling, enhances dense retrieval performance. By including elements such as final answers during training, the model learns to better contextualize and retrieve relevant information. GRITHopper-7B offers a robust, scalable, and generalizable solution for multi-hop dense retrieval, and we release it to the community for future research and applications requiring multi-hop reasoning and retrieval capabilities.

中文标题/摘要

标题：GRITHopper：无需分解的多跳密集检索

基于分解的多跳检索方法依赖于许多自回归步骤来分解复杂的查询，这破坏了端到端的可微性并导致计算成本高昂。无需分解的方法解决了这一问题，但当前的无需分解方法在处理较长的多跳问题和泛化到未见过的数据方面存在困难。为了解决这些挑战，我们引入了GRITHopper-7B，这是一种新型的多跳密集检索模型，它在分布内和分布外基准测试中均实现了最先进的性能。GRITHopper结合了生成性和表征性指令微调，通过将因果语言建模与密集检索训练相结合。通过受控研究，我们发现检索过程后的额外上下文建模，称为检索后语言建模，可以增强密集检索性能。通过在训练中包含最终答案等元素，模型学会了更好地上下文化和检索相关信息。GRITHopper-7B提供了一种稳健、可扩展且通用的多跳密集检索解决方案，并将其发布给社区，以供未来的研究和需要多跳推理和检索能力的应用使用。

Summary / 总结

GRITHopper is designed to address the limitations of decomposition-based multi-hop retrieval methods by introducing a decomposition-free approach. It combines generative and representational instruction tuning, integrating causal language modeling with dense retrieval training. Experimental results show that GRITHopper-7B outperforms existing methods on both in-distribution and out-of-distribution benchmarks, demonstrating its robustness and generalizability for multi-hop retrieval tasks.

GRITHopper 通过引入无分解的多跳检索方法来解决基于分解的多跳检索方法的限制。它结合了生成性和表示性指令微调，将因果语言建模与密集检索训练相结合。实验结果表明，GRITHopper-7B 在分布内和分布外基准测试中均表现出色，展示了其在多跳检索任务中的鲁棒性和泛化能力。

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Authors: Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu

First: 2026-01-22T18:09:30+00:00 · Latest: 2026-01-22T18:09:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

中文标题/摘要

标题：宇宙政策：针对视觉运动控制和规划微调视频模型

近期的视频生成模型展示了捕捉复杂物理交互和场景随时间演变的非凡能力。为了利用其时空先验知识，机器人学工作将视频模型适应为策略学习，但引入了复杂性，需要多阶段的后训练和新的架构组件来生成动作。在本工作中，我们提出了宇宙政策(Cosmos Policy)，这是一种简单的方法，通过在目标平台收集的机器人演示数据上进行单阶段的后训练，将大型预训练视频模型(Cosmos-Predict2)适应为有效的机器人策略，无需架构修改。宇宙政策学习直接生成机器人动作，编码为视频模型的潜在扩散过程中的潜在帧，利用模型的预训练先验和核心学习算法捕捉复杂动作分布。此外，宇宙政策生成未来状态图像和值（预期累积奖励），同样编码为潜在帧，使测试时能够规划具有更高成功概率的动作轨迹。在我们的评估中，宇宙政策在LIBERO和RoboCasa模拟基准测试中分别实现了98.5%和67.1%的平均成功率，并在具有挑战性的实际双臂操作任务中获得了最高的平均分数，优于从头开始训练的强大扩散策略、基于视频模型的策略和在相同机器人演示上微调的最先进的视觉-语言-动作模型。此外，给定策略展开数据，宇宙政策可以从经验中学习改进其世界模型和价值函数，并利用基于模型的规划在具有挑战性的任务中实现更高的成功率。我们将在https://research.nvidia.com/labs/dir/cosmos-policy/发布代码、模型和训练数据/

Summary / 总结

Cosmos Policy is a method for adapting a large pretrained video model into an effective robot policy through a single stage of post-training on robot demonstration data, without architectural modifications. It learns to generate robot actions and future state images as latent frames, leveraging the pretrained model's priors and learning algorithm. In evaluations, Cosmos Policy outperforms other approaches on simulation benchmarks and real-world bimanual manipulation tasks, achieving state-of-the-art success rates and the highest average score in challenging tasks.

Cosmos Policy 是一种方法，通过在机器人演示数据上进行单阶段后训练将大型预训练视频模型转换为有效的机器人策略，无需修改架构。它学习将机器人动作和未来状态图像作为潜在帧生成，利用预训练模型的先验知识和学习算法。Cosmos Policy 在仿真基准测试和复杂的双臂操作任务中表现出色，实现了最先进的成功率，并能够通过模型基础规划在具有挑战性的任务中实现更高的成功率。

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change

Authors: Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger

First: 2025-05-25T21:29:00+00:00 · Latest: 2026-01-22T18:06:39+00:00

Comments: 45 pages, 21 figures, under review

Abs · PDF · Code1 · Code2

Abstract

Ambivalence and hesitancy (A/H), a closely related construct, is the primary reasons why individuals delay, avoid, or abandon health behaviour changes. It is a subtle and conflicting emotion that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. It manifests by a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exists for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours captured from 300 participants across Canada answering predefined questions to elicit A/H. It is intended to mirror real-world online personalized behaviour change interventions. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participants' meta-data are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, zero-shot prediction, and personalization using source-free domain adaptation. The data, code, and pretrained weights are available.

中文标题/摘要

标题：BAH数据集：视频中数字行为改变中犹豫/矛盾识别

犹豫和矛盾（A/H），这一紧密相关的概念，是个人推迟、避免或放弃健康行为改变的主要原因。这是一种微妙且矛盾的情绪，使人处于正向和负向态度之间，或接受与拒绝某事之间。它表现为情感在多种模态之间或同一模态内的不一致，如面部和语音表达以及肢体语言。尽管专家可以被训练来识别A/H，如在面对面互动中所做的那样，将其整合到数字健康干预措施中既昂贵又效果不佳。因此，自动识别A/H对于数字行为改变干预措施的个性化和成本效益至关重要。然而，目前尚无用于设计机器学习模型识别A/H的数据集。本文介绍了为视频中多模态识别A/H而收集的Behavioral Ambivalence/Hesitancy (BAH)数据集。该数据集包含1,427个视频，总时长10.60小时，来自加拿大300名参与者回答预定义问题以引发A/H。它旨在模拟现实世界的在线个性化行为改变干预措施。BAH由三位专家注释，提供A/H发生的时间戳，以及帧级和视频级带有A/H线索的注释。还提供了视频转录、裁剪和对齐的脸部以及参与者元数据。由于A和H在实践中表现相似，我们提供了二元注释，表明A/H的存在或不存在。此外，本文还包括在BAH上使用基线模型进行帧级和视频级识别、零样本预测和使用源代码免费领域适应进行个性化处理的基准结果。数据、代码和预训练权重均可用。

Summary / 总结

The paper introduces the BAH dataset for recognizing ambivalence and hesitancy (A/H) in videos, which is crucial for personalizing digital health interventions. The dataset includes 1,427 videos from 300 participants answering questions to elicit A/H, annotated by experts for A/H occurrences and cues. Benchmarking results show that baseline models perform well in recognizing A/H at both frame and video levels, and zero-shot prediction and personalization using domain adaptation are also explored.

该论文介绍了用于识别视频中矛盾和犹豫（A/H）的BAH数据集，这对于个性化数字健康干预至关重要。数据集包含300名参与者回答问题以引发A/H的1,427个视频，并由专家进行A/H发生时间和线索的标注。基准测试结果显示，基线模型在帧和视频级别识别A/H方面表现良好，并且还探索了使用无源域适应进行零样本预测和个人化的方法。

HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

Authors: Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, Tao Jin

Venue: ICASSP 2026

First: 2026-01-22T17:57:42+00:00 · Latest: 2026-01-22T17:57:42+00:00

Comments: Accepted by ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.

中文标题/摘要

标题：HVD：基于人类视觉的视频表示学习方法在文本-视频检索中的应用

CLIP的成功推动了文本-视频检索领域的显著进步。然而，当前的方法往往受到“盲视”特征交互的困扰，模型难以从背景噪声中辨识出关键的视觉信息，这主要是由于文本查询的稀疏性。为了解决这一问题，我们借鉴了人类的认知行为，提出了基于人类视觉驱动（HVD）模型。我们的框架建立了一种从粗到细的对齐机制，包括两个关键组件：帧特征选择模块（FFSM）和补丁特征压缩模块（PFCM）。FFSM通过选择关键帧来模拟人类的宏观感知能力，从而消除时间冗余。随后，PFCM通过先进的注意力机制聚合补丁特征，形成显著的视觉实体，实现精确的实体级匹配。在五个基准上的广泛实验表明，HVD不仅捕捉到了类似人类的视觉焦点，还实现了最先进的性能。

Summary / 总结

The research aims to improve text-video retrieval by addressing the issue of 'blind' feature interaction where models struggle to distinguish key visual information from background noise. The Human Vision-Driven (HVD) model is proposed, which includes a Frame Features Selection Module (FFSM) and a Patch Features Compression Module (PFCM). FFSM selects key frames to reduce temporal redundancy, while PFCM aggregates patch features into salient visual entities using an advanced attention mechanism for precise entity-level matching. Experiments on five benchmarks show that HVD captures human-like visual focus and achieves state-of-the-art performance.

论文旨在通过解决模型难以区分关键视觉信息和背景噪声的问题，提高文本-视频检索的效果。提出了Human Vision-Driven (HVD)模型，包含Frame Features Selection Module (FFSM)和Patch Features Compression Module (PFCM)。FFSM通过选择关键帧减少时间冗余，而PFCM使用高级注意力机制聚合补丁特征形成显著的视觉实体。实验表明，HVD能够捕捉人类的视觉焦点，并在五个基准上达到了最先进的性能。

Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization

Authors: Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, Emilios Cambouropoulos

First: 2026-01-22T17:46:31+00:00 · Latest: 2026-01-22T17:46:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.

中文标题/摘要

标题：关注旋律：单编码器旋律和声化中的课程掩码

旋律和声化，即为给定旋律生成和声伴奏的任务，在计算音乐生成中仍然是一个核心挑战。最近的单编码器变压器方法将和声化视为一个掩码序列建模问题，但现有的受离散扩散启发的训练课程往往导致旋律和和声之间的弱（跨）注意力。这导致了对旋律线索的有限利用，尤其是在领域外上下文中。在本文中，我们引入了一种训练课程FF（全到全），该课程在训练的前几轮中保持所有和声标记被掩码，然后在训练过程中逐步取消整个序列的掩码，以加强旋律和和声之间的互动。我们系统地评估了这种方法与先前课程的性能，包括时间量化（四分音符 vs. 十六分音符）、小节级 vs. 节拍标记条件、旋律表示（全范围 vs. 音阶类）以及推理时的掩码策略。模型在HookTheory数据集上进行训练，并在领域内和精心挑选的爵士标准曲集上进行评估，使用一系列全面的指标来评估和弦进程结构、和声-旋律对齐和节奏连贯性。结果表明，提出的FF课程在几乎所有指标上都优于基线模型，特别是在领域外评估中，和声适应新旋律序列的能力至关重要。我们还发现，四分音符量化、小节标记的交织以及音阶类旋律表示在FF设置中是有利的。我们的研究结果强调了训练课程在使有效旋律条件化方面的重要性，并表明全到全的取消掩码是一种稳健的单编码器和声化策略。

Summary / 总结

This paper addresses the challenge of melodic harmonization by proposing a new training curriculum called FF, which keeps harmony tokens masked for several steps before unmasking entire sequences. This approach enhances the interaction between melody and harmony, leading to better performance across various metrics, especially in out-of-domain contexts. The study evaluates the FF curriculum against existing methods on the HookTheory dataset and jazz standards, showing consistent improvements and highlighting the benefits of quarter-note quantization and pitch-class melody representations.

本文通过引入新的训练课程FF，该课程在训练初期保持和声令牌被遮盖，之后逐步解遮盖整个序列，以增强旋律与和声的互动。研究在多种实验条件下评估了这种方法，并发现它在现有方法中表现更优，尤其是在处理新旋律序列时表现出更强的适应性。训练使用FF课程的模型在和弦进行结构、和声与旋律对齐以及节奏连贯性等方面表现出更好的效果。

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Authors: Remy Sabathier, David Novotny, Niloy J. Mitra, Tom Monnier

First: 2026-01-22T17:41:13+00:00 · Latest: 2026-01-22T17:41:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.

中文标题/摘要

标题：ActionMesh：基于时间3D扩散的动画3D网格生成

生成动画3D对象是许多应用的核心，但大多数先进的工作由于其有限的设置、长时间的运行或有限的质量，通常难以在实践中应用。我们介绍了ActionMesh，这是一种生成模型，能够以前馈方式预测“在行动”中的生产级3D网格。受到早期视频模型的启发，我们的关键见解是修改现有的3D扩散模型，加入时间轴，从而形成我们称之为“时间3D扩散”的框架。具体来说，我们首先将3D扩散阶段适应为生成表示时间变化和独立3D形状的同步潜在变量序列。其次，我们设计了一个时间3D自编码器，将一系列独立形状转换为预定义参考形状的相应变形，使我们能够构建动画。结合这两个组件，ActionMesh可以从单目视频、文本描述甚至带有动画描述的3D网格等不同输入生成动画3D网格。此外，与以前的方法相比，我们的方法速度快，生成的结果无骨架且拓扑一致，因此能够实现快速迭代和无缝应用，如纹理化和目标变换。我们在标准视频到4D基准（Consistent4D，Objaverse）上评估了我们的模型，并在几何准确性和时间一致性方面报告了最先进的性能，证明了我们的模型能够以前所未有的速度和质量生成动画3D网格。

Summary / 总结

ActionMesh is a generative model that predicts animated 3D meshes in a feed-forward manner by incorporating a temporal axis into existing 3D diffusion models. It first generates synchronized latents for time-varying 3D shapes and then uses a temporal 3D autoencoder to deform a reference shape into the corresponding animated mesh. This method allows for rapid generation of animated 3D meshes from various inputs, such as videos, text descriptions, or 3D meshes, and achieves state-of-the-art performance in geometric accuracy and temporal consistency on standard benchmarks.

ActionMesh 是一种生成模型，通过将时间轴引入现有的 3D 扩散模型中，以前馈方式预测动画 3D 网格。它生成表示时间变化形状的同步潜在变量，并使用 3D 时序自编码器将参考形状变形为相应的动画。该模型可以从单目视频、文本描述或带有文本提示的 3D 网格等多种输入生成动画。ActionMesh 快速且生成的网格无骨架约束、拓扑一致，实现了在标准基准测试中几何准确性和时间一致性方面的最新性能。

Beat-ssl: Capturing Local ECG Morphology through Heartbeat-level Contrastive Learning with Soft Targets

Authors: Muhammad Ilham Rizqyawan, Peter Macfarlane, Stathis Hadjidemetriou, Fani Deligianni

Venue: ISBI 2026

First: 2026-01-22T17:40:23+00:00 · Latest: 2026-01-22T17:40:23+00:00

Comments: Accepted at ISBI 2026

Abs · PDF · Code1 · Code2

Abstract

Obtaining labelled ECG data for developing supervised models is challenging. Contrastive learning (CL) has emerged as a promising pretraining approach that enables effective transfer learning with limited labelled data. However, existing CL frameworks either focus solely on global context or fail to exploit ECG-specific characteristics. Furthermore, these methods rely on hard contrastive targets, which may not adequately capture the continuous nature of feature similarity in ECG signals. In this paper, we propose Beat-SSL, a contrastive learning framework that performs dual-context learning through both rhythm-level and heartbeat-level contrasting with soft targets. We evaluated our pretrained model on two downstream tasks: 1) multilabel classification for global rhythm assessment, and 2) ECG segmentation to assess its capacity to learn representations across both contexts. We conducted an ablation study and compared the best configuration with three other methods, including one ECG foundation model. Despite the foundation model's broader pretraining, Beat-SSL reached 93% of its performance in multilabel classification task and surpassed all other methods in the segmentation task by 4%.

中文标题/摘要

标题：Beat-ssl：通过心跳级对比学习软目标捕获心电图局部形态

获取带有标签的心电图数据以开发监督模型具有挑战性。对比学习（CL）已成为一种有前景的预训练方法，能够有效利用有限的标签数据进行迁移学习。然而，现有的CL框架要么仅关注全局上下文，要么未能利用心电图的特定特征。此外，这些方法依赖于硬对比目标，这可能无法充分捕捉心电图信号中特征相似性的连续性。在本文中，我们提出了一种名为Beat-SSL的对比学习框架，该框架通过心跳级和节律级对比学习并使用软目标进行双重上下文学习。我们对预训练模型进行了两项下游任务的评估：1）全局节律评估的多标签分类，2）心电图分割以评估其在两种上下文中的表示学习能力。我们进行了消融研究，并将最佳配置与三种其他方法进行了比较，包括一种心电图基础模型。尽管基础模型的预训练范围更广，但Beat-SSL在多标签分类任务中的性能达到了基础模型的93%，并且在分割任务中超越了所有其他方法4%。

Summary / 总结

The research aims to address the challenge of obtaining labeled ECG data for supervised models by proposing Beat-SSL, a contrastive learning framework that performs dual-context learning through rhythm-level and heartbeat-level contrasting with soft targets. The model was evaluated on two tasks: multilabel classification for global rhythm assessment and ECG segmentation. Beat-SSL achieved 93% of the performance of a foundation model in multilabel classification and outperformed other methods by 4% in segmentation.

论文提出了Beat-SSL框架，该框架通过节奏级和心搏级对比学习以及软目标来解决标注ECG数据获取的难题。该模型在两个任务上进行了评估：全局节律分类和ECG分割。Beat-SSL在分类任务上达到了基础模型性能的93%，并在分割任务上比其他方法高出4%。

Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data

Authors: Paul Quinlan, Qingguo Li, Xiaodan Zhu

First: 2025-03-13T21:05:11+00:00 · Latest: 2026-01-22T17:37:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models are being rapidly deployed across many fields such as healthcare, finance, transportation, and energy, where time-series data are fundamental components. The current works are still limited in their ability to perform reasoning that involves both time-series and the corresponding textual content. We address this gap by introducing Chat-TS, a large language model (LLM) based framework designed to support reasoning over time series and textual data. Unlike traditional models, Chat-TS integrates time-series tokens into LLMs' vocabulary, enhancing its reasoning ability over both modalities without compromising core natural language capabilities. To support learning and evaluation, we contribute new datasets: the TS Instruct Training Dataset (pairing diverse time-series data with relevant text instructions and responses for instruction tuning), the TS Instruct Question and Answer (QA) Gold Dataset (multiple-choice questions to evaluate multimodal reasoning), and a TS Instruct Quantitative Probing Set (a small subset of TS Instruct QA reasoning tasks alongside math and decision-making questions for LLM evaluation). We design a training strategy to preserve the inherent reasoning capabilities of LLMs while augmenting them for time-series reasoning. Experiments show that Chat-TS achieves state-of-the-art performance in multimodal reasoning tasks by maintaining strong natural language proficiency while improving time-series reasoning.

中文标题/摘要

标题：Chat-TS：增强时间序列和自然语言数据跨时间的多模态推理

大型语言模型正在被迅速部署到医疗保健、金融、交通和能源等多个领域，其中时间序列数据是基本组成部分。当前的工作仍然在处理涉及时间序列和相应文本内容的推理方面能力有限。我们通过引入Chat-TS，一种基于大型语言模型（LLM）的框架来解决这一差距，该框架旨在支持时间序列和文本数据的推理。与传统模型不同，Chat-TS 将时间序列标记整合到LLM的词汇表中，增强了其在两种模态上的推理能力，同时不牺牲核心自然语言能力。为了支持学习和评估，我们贡献了新的数据集：TS Instruct 训练数据集（将多样化的时序数据与相关的文本指令和响应配对，用于指令调优），TS Instruct 问题和答案黄金数据集（多项选择题，用于评估多模态推理），以及TS Instruct 定量探测集（TS Instruct QA推理任务的小型子集，以及数学和决策问题，用于LLM评估）。我们设计了一种训练策略，以保持LLM固有的推理能力，同时增强其时间序列推理能力。实验表明，Chat-TS 在多模态推理任务中达到了最先进的性能，同时保持了强大的自然语言能力并提高了时间序列推理能力。

Summary / 总结

The research aims to enhance the reasoning capabilities of large language models over both time-series and natural language data, which are crucial in fields like healthcare and finance. Chat-TS, a new framework, integrates time-series tokens into LLMs to support reasoning over both modalities. Key findings show that Chat-TS outperforms existing models in multimodal reasoning tasks while maintaining strong natural language proficiency.

研究旨在增强大型语言模型在时间序列和自然语言数据上的推理能力，这对于医疗保健和金融等领域至关重要。Chat-TS 是一种新型框架，将时间序列标记集成到 LLM 中以支持两种模态的推理。实验结果表明，Chat-TS 在多模态推理任务中表现出色，同时保持了自然语言的专业能力。

LLM Prompt Evaluation for Educational Applications

Authors: Langdon Holmes, Adam Coscia, Scott Crossley, Joon Suh Choi, Wesley Morris

First: 2026-01-22T17:31:25+00:00 · Latest: 2026-01-22T17:31:25+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.

中文标题/摘要

标题：大型语言模型在教育应用中的提示评估

随着大型语言模型（LLMs）在教育应用中的日益普及，需要基于证据的方法来设计和评估LLM提示，以产生个性化和教育目标一致的输出。本研究提出了一种可推广的系统评估方法，通过结构化对话活动中的LLM生成的后续问题分析来展示。设计并测试了六种提示模板。这些模板结合了已有的提示工程模式，每个提示强调不同的教育策略。通过一种类似淘汰赛的评估框架来比较提示模板，该框架可以适应其他教育应用。该淘汰赛采用了Glicko2评分系统，八名评委在三个维度上评估问题对：格式、对话支持和对学习者的适宜性。数据来自120次真实的用户交互，分布在三个不同的教育部署中。结果显示，一个与策略性阅读相关的提示在一对一比较中胜出的概率从81%到100%不等。该提示结合了角色和上下文管理模式，旨在支持元认知学习策略，如自我导向学习。该方法展示了教育技术研究人员如何系统地评估和改进提示设计，从经验性的提示工程转向基于证据的提示开发，以应用于教育应用。

Summary / 总结

This study evaluates LLM prompts for educational applications by designing six templates that emphasize different pedagogical strategies. A tournament-style evaluation using the Glicko2 rating system with eight judges assessed the prompts across format, dialogue support, and learner appropriateness. The strategic reading prompt, which incorporated persona and context manager patterns, outperformed others with win probabilities ranging from 81% to 100% in pairwise comparisons, demonstrating its effectiveness in supporting metacognitive learning.

本研究旨在开发评估大型语言模型（LLM）提示在教育应用中的证据基础方法。设计并测试了六种提示模板，每种模板强调不同的教学策略。采用Glicko2评分系统进行赛制评估框架，八位评委从格式、对话支持和学习者适宜性三个维度评估问题。战略阅读提示结合了角色和上下文管理模式，在一对一比较中表现出色，胜率从81%到100%不等。这展示了教育技术研究人员如何系统地评估和改进提示设计，从经验性提示工程转向基于证据的提示开发。

ViSymRe: Vision-guided Multimodal Symbolic Regression

Authors: Da Li, Junping Yin, Jin Xu, Xinxin Li, Juan Zhang

First: 2024-12-15T10:05:31+00:00 · Latest: 2026-01-22T17:29:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Extracting simple mathematical expression from an observational dataset to describe complex natural phenomena is one of the core objectives of artificial intelligence (AI). This field is known as symbolic regression (SR). Traditional SR models are based on genetic programming (GP) or reinforcement learning (RL), facing well-known challenges, such as low efficiency and overfitting. Recent studies have integrated SR with large language models (LLMs), enabling fast zero-shot inference by learning mappings from millions of dataset-expression pairs. However, since the input and output are inherently different modalities, such models often struggle to converge effectively. In this paper, we introduce ViSymRe, a vision-guided multimodal SR model that incorporates the third resource, expression graph, to bridge the modality gap. Different from traditional multimodal models, ViSymRe is trained to extract vision, termed virtual vision, from datasets, without relying on the global availability of expression graphs, which addresses the essential challenge of visual SR, i.e., expression graphs are not available during inference. Evaluation results on multiple mainstream benchmarks show that ViSymRe achieves more competitive performance than the state-of-the-art dataset-only baselines. The expressions predicted by ViSymRe not only fit the dataset well but are also simple and structurally accurate, goals that SR models strive to achieve.

中文标题/摘要

标题：ViSymRe：视觉引导的多模态符号回归

从观测数据集中提取简单的数学表达式以描述复杂的自然现象是人工智能（AI）的核心目标之一。这一领域被称为符号回归（SR）。传统的SR模型基于遗传编程（GP）或强化学习（RL），面临着低效率和过拟合等众所周知的挑战。最近的研究将SR与大型语言模型（LLMs）结合，通过学习数百万数据集-表达式对之间的映射，实现了快速的零样本推理。然而，由于输入和输出是固有的不同模态，这些模型往往难以有效收敛。在本文中，我们介绍了ViSymRe，这是一种视觉引导的多模态SR模型，它结合了表达图这一资源来弥合模态差距。与传统的多模态模型不同，ViSymRe被训练从数据集中提取所谓的虚拟视觉，而无需依赖全局可用的表达图，这解决了视觉SR的基本挑战，即在推理过程中表达图不可用。在多个主流基准上的评估结果表明，ViSymRe在与数据集仅基线相比时，实现了更具有竞争力的性能。ViSymRe预测的表达式不仅很好地拟合了数据集，而且简单且结构准确，这是SR模型努力实现的目标。

Summary / 总结

The paper introduces ViSymRe, a vision-guided multimodal symbolic regression model that addresses the challenge of expressing complex natural phenomena from observational data. Unlike traditional models, ViSymRe incorporates an expression graph to bridge the modality gap between input and output. It trains the model to extract 'virtual vision' from datasets without requiring global expression graphs, which is crucial for visual symbolic regression. Experimental results on multiple benchmarks demonstrate that ViSymRe outperforms state-of-the-art dataset-only baselines, producing simple and structurally accurate expressions that fit the datasets well.

论文提出了ViSymRe，一种基于视觉的多模态符号回归模型，旨在从观测数据集中提取简单的数学表达式。与依赖全局表达图的先前模型不同，ViSymRe 通过直接从数据集中提取虚拟视觉来进行训练，从而实现快速零样本推理。实验结果表明，ViSymRe 在多个主流基准上优于最先进的数据集仅基线模型，提供的表达式不仅拟合数据集良好，而且简单且结构准确。

Replicating Human Motivated Reasoning Studies with LLMs

Authors: Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N. Druckman, Daniel Molden, Gourab Ghoshal, Ehsan Hoque

First: 2026-01-22T17:29:07+00:00 · Latest: 2026-01-22T17:29:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.

中文标题/摘要

标题：使用大语言模型复制人类动机性推理研究

动机性推理——个体在处理信息时可能被动机驱使以达到某种结论，无论结论是否准确或预先确定——作为人类现象已经被广泛研究。然而，尚不清楚基础大语言模型是否会模仿这些动机性变化。通过复制4项先前的政治动机性推理研究，我们发现基础大语言模型的行为与预期的人类行为不一致。此外，不同模型的基础大语言模型行为在某些方面存在相似性，如较小的标准差和不准确的论点强度评估。我们强调这些发现对于使用大语言模型自动化如调查数据收集和论点评估等任务的研究人员的重要性。

Summary / 总结

This study investigates whether base language models (LLMs) exhibit motivated reasoning, a human tendency to process information in a way that supports a desired conclusion. By replicating four previous studies on political motivated reasoning, the researchers found that base LLMs do not mimic expected human behavior. Instead, these models show smaller standard deviations and inaccurate assessments of argument strength, highlighting the need for caution when using LLMs for tasks like survey data collection and argument evaluation.

本研究探讨了基础语言模型（LLMs）是否表现出动机推理，即人类倾向于以支持其预设结论的方式处理信息的现象。通过复制四个关于政治动机推理的先前研究，研究人员发现，基础语言模型并未表现出预期的人类行为。相反，这些模型显示出较小的标准差和对论点强度评估不准确的特点，强调了在使用LLMs进行如调查数据收集和论点评估等任务时需要谨慎。

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Authors: Alphaeus Dmonte, Vidhi Gupta, Daniel J Perry, Mark Arehart

First: 2026-01-22T17:28:24+00:00 · Latest: 2026-01-22T17:28:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.

中文标题/摘要

标题：通过语言特定模型合并提高训练效率并降低维护成本

针对特定任务的多语言大型语言模型（LLM）微调涉及使用包含所需所有语言示例的多语言数据集对模型进行训练。更新一个或多个支持的语言或添加对新语言的支持需要重新训练模型，这在计算上效率低下并形成严重的维护瓶颈。最近关于合并多语言多任务模型的研究显示出改进质量的前景，但其计算和维护效率尚未研究。在本工作中，我们首次从效率角度对这种合并策略进行了集中分析，评估了其在三个独立任务上的表现。我们展示了在保持质量一致性的前提下取得了显著的效率提升：这种合并方法将初始训练时间减少了最多50%。我们还展示了在模型维护过程中，更新个别语言并重新合并可以将训练成本降低超过60%，与重新训练整个多语言模型相比。我们在公共数据集和专有行业数据集上都进行了验证，证明该方法不仅适用于之前研究的学术场景，也适用于工业应用案例。

Summary / 总结

The research aims to improve the efficiency of training and reduce maintenance costs for multilingual large language models by merging language-specific models. The study evaluates the merging strategy across three tasks and finds that it reduces initial training time by up to 50% while maintaining quality. Additionally, updating individual languages and re-merging reduces training costs by more than 60% compared to re-training the full model, applicable to both public and proprietary datasets.

该研究旨在通过提出模型合并策略来解决更新和维护多语言大型语言模型的效率问题。研究在三个任务上评估了这种方法，并发现它可以使初始训练时间减少高达50%，同时保持质量不变。此外，更新个别语言并重新合并模型的成本比重新训练整个多语言模型低60%以上，使其在学术和工业应用中都更为高效。

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Authors: Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long, Pengjun Xie, Siyue Zhang, Yilun Zhao, Shu Wu

First: 2026-01-22T17:26:52+00:00 · Latest: 2026-01-22T17:26:52+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.

中文标题/摘要

标题：重新思考组合图像检索评估：来自图像编辑的细粒度基准

组合图像检索（CIR）是多模态理解中的一个关键且复杂的任务。当前的CIR基准通常包含有限的查询类别，无法捕捉到现实场景中的多样化需求。为了弥合这一评估差距，我们利用图像编辑实现对修改类型和内容的精确控制，从而构建了一个涵盖广泛类别的查询合成管道。利用该管道，我们构建了EDIR，这是一个新颖的细粒度CIR基准。EDIR包含5000个高质量的查询，分布在五个主要类别和十五个子类别中。我们对13种多模态嵌入模型的全面评估揭示了显著的能力差距；即使是最先进的模型（如RzenEmbed和GME）也无法在所有子类别中保持一致表现，突显了我们基准的严格性。通过对比分析，我们进一步揭示了现有基准的内在局限性，如模态偏差和类别覆盖不足。此外，一个领域内训练实验证明了我们基准的可行性。该实验通过区分可以用目标数据解决的类别和暴露当前模型架构固有限制的类别，阐明了任务挑战。

Summary / 总结

The paper addresses the limitations of current Composed Image Retrieval (CIR) benchmarks by introducing EDIR, a fine-grained benchmark created through image editing. This method allows for precise control over query categories and content, resulting in 5,000 high-quality queries across five main and fifteen subcategories. Evaluating 13 multimodal embedding models on EDIR, the study finds significant capability gaps, especially for state-of-the-art models like RzenEmbed and GME, which struggle across all subcategories. The research also highlights inherent limitations in existing benchmarks and demonstrates the feasibility of using EDIR for in-domain training to better understand task challenges.

论文通过引入基于图像编辑的细粒度基准EDIR，解决了当前Composed Image Retrieval (CIR)基准的局限性。这种方法允许对查询类别和内容进行精确控制，最终生成了5,000个高质量查询，覆盖五个主要类别和十五个子类别。对13种多模态嵌入模型在EDIR上的评估发现，即使是RzenEmbed和GME等最先进的模型，在所有子类别上也表现出显著的能力差距。研究还指出了现有基准的内在局限性，并通过领域内训练实验展示了使用EDIR来更好地理解任务挑战的可行性。

Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add

Authors: Zhengchi Ma, Anru R. Zhang

First: 2026-01-22T17:15:26+00:00 · Latest: 2026-01-22T17:15:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic examples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples and evaluated under the balanced population risk. Our theory shows that synthetic data is not always beneficial. In a ``local symmetry" regime, imbalance is not the dominant source of error near the balanced optimum, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help (a ``local asymmetry" regime), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing, sometimes by a small refinement and sometimes substantially when generator bias is systematic. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures when the data indicate them. Simulations and a real sepsis prediction study support the theory and illustrate when synthetic augmentation helps, when it cannot, and how to tune its quantity effectively.

中文标题/摘要

标题：不平衡学习中的合成增强：何时有益，何时有害，以及应添加多少

不平衡分类中，一个类别比另一个类别出现的频率低得多，这通常会导致标准训练程序优先处理多数类，而对稀有的但重要的情况表现不佳。经典的广泛使用的解决方法是通过合成样本增强少数类，但两个基本问题仍然没有解决：合成增强何时真正有益，以及应生成多少合成样本？我们为不平衡学习中的合成增强开发了一个统一的统计框架，研究在平衡人口风险下使用不平衡数据和合成少数类样本训练的模型。我们的理论表明，合成数据并不总是有益的。在“局部对称”状态下，不平衡不是接近平衡最优解附近的主要误差来源，因此添加合成样本不能提高学习速率，甚至可能通过放大生成器不匹配而降低性能。当增强可以提供帮助（“局部不对称”状态），最佳合成样本大小取决于生成器的准确性以及生成器的残差不匹配是否与固有的多数类-少数类转移方向一致。这种结构可以使最佳合成样本大小偏离简单的完全平衡，有时仅需细微调整，有时在生成器偏差系统时会显著不同。实践中，我们推荐验证调优合成样本大小（VTSS）：通过在接近完全平衡基线的范围内最小化平衡验证损失来选择合成样本大小，同时允许数据表明有意义的偏离。模拟和实际的脓毒症预测研究支持该理论，并说明了合成增强何时有效，何时无效，以及如何有效调整其数量。

Summary / 总结

The paper addresses the issue of synthetic augmentation in imbalanced learning, where the minority class is underrepresented. It develops a statistical framework to determine when synthetic augmentation helps and when it hurts, and how many synthetic samples should be generated. The study finds that in a 'local symmetry' regime, synthetic data can degrade performance, while in a 'local asymmetry' regime, the optimal synthetic size depends on the generator's accuracy and the direction of mismatch. The authors recommend Validation-Tuned Synthetic Size (VTSS) to effectively tune the synthetic augmentation quantity based on balanced validation loss.

论文研究了合成增强在不平衡学习中何时有效以及如何确定合成样本的最佳数量。它建立了一个统计框架，表明合成数据在‘局部对称’状态下可能会降低性能，但在‘局部不对称’状态下可以提高性能，其中最佳合成样本数量取决于生成器的准确性及其剩余不匹配的方向。研究推荐使用验证调优合成大小（VTSS）来有效调整合成增强的数量。

AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs

Authors: Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang

First: 2025-11-17T11:45:41+00:00 · Latest: 2026-01-22T17:11:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AudioMotionBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AudioMotionBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50\%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.

中文标题/摘要

标题：AudioMotionBench：评估音频LLMs的听觉运动感知

大型音频-语言模型（LALMs）在语音识别、音频描述和听觉问答方面最近取得了令人印象深刻的进展。然而，这些模型是否能够感知空间动态，特别是声源的运动，仍然不清楚。在本文中，我们揭示了当前ALLMs在运动感知方面存在系统性的缺陷。为了研究这一问题，我们引入了AudioMotionBench，这是第一个明确设计用于评估听觉运动理解的基准。AudioMotionBench引入了一个受控的问答基准，旨在评估音频-语言模型（LALMs）是否能够从立体声音频中推断出移动声源的方向和轨迹。全面的定量和定性分析表明，当前的模型在可靠地识别运动线索或区分方向模式方面存在困难。平均准确率低于50%，突显了听觉空间推理的基本局限性。我们的研究突显了人类和模型在听觉空间推理方面的根本差距，为未来音频-语言模型的空间认知增强提供了诊断工具和新的见解。

Summary / 总结

The research aims to evaluate the ability of Large Audio-Language Models (LALMs) to perceive spatial dynamics, particularly the motion of sound sources. To address this, the study introduces AudioMotionBench, a benchmark for assessing auditory motion understanding. The results show that current models have significant difficulties in recognizing motion cues and distinguishing directional patterns, with average accuracy below 50%. This indicates a fundamental limitation in auditory spatial reasoning for these models.

该研究通过引入AudioMotionBench基准，评估大型音频语言模型（LALMs）的听觉运动理解能力，该基准使用控制下的问答任务和双耳音频来评估模型识别移动声源方向和轨迹的能力。关键发现表明，当前模型表现不佳，平均准确率低于50%，表明在听觉空间推理方面存在显著局限性。

Enhanced Climbing Image Nudged Elastic Band method with Hessian Eigenmode Alignment

Authors: Rohit Goswami, Miha Gunde, Hannes Jónsson

First: 2026-01-19T00:21:52+00:00 · Latest: 2026-01-22T17:11:23+00:00

Comments: 25 pages. 11 figures

Abs · PDF · Code1 · Code2

Abstract

Accurate determination of transition states is central to an understanding of reaction kinetics. Double-endpoint methods where both initial and final states are specified, such as the climbing image nudged elastic band (CI-NEB), identify the minimum energy path between the two and thereby the saddle point on the energy surface that is relevant for the given transition, thus providing an estimate of the transition state within the harmonic approximation of transition state theory. Such calculations can, however, incur high computational costs and may suffer stagnation on exceptionally flat or rough energy surfaces. Conversely, methods that only require specification of an initial set of atomic coordinates, such as the minimum mode following (MMF) method, offer efficiency but can converge on saddle points that are not relevant for transition of interest. Here, we present an adaptive hybrid algorithm that integrates the CI-NEB with the MMF method so as to get faster convergence to the relevant saddle point. The method is benchmarked for the Baker-Chan (BC) saddle point test set using the PET-MAD machine-learned potential as well as 59 transitions of a heptamer island on Pt(111) from the OptBench benchmark set. A Bayesian analysis of the performance shows a median reduction in energy and force calculations of 46% [95% CrI: -55%, -37%] relative to CI-NEB for the BC set, while a 28% reduction is found for the transitions of the heptamer island. These results establish this hybrid method as a highly effective tool for high-throughput automated chemical discovery of atomic rearrangements.

中文标题/摘要

标题：增强的攀爬图像拉伸带方法与哈密顿特征模式对齐

准确确定过渡态是理解反应动力学的关键。双端点方法，如攀爬图像拉伸带（CI-NEB）方法，通过指定初始和最终状态来识别两者之间的最低能量路径，从而确定与给定过渡相关的鞍点，提供过渡态的谐振子近似估计。然而，此类计算可能会产生高昂的计算成本，并可能在异常平坦或粗糙的能量表面上停滞不前。相反，仅需指定一组原子坐标的方法，如最小模式跟随（MMF）方法，虽然效率更高，但可能会收敛到与所需过渡无关的鞍点。在此，我们提出了一种自适应混合算法，将CI-NEB方法与MMF方法结合，以更快地收敛到相关鞍点。该方法使用PET-MAD机器学习势能对Baker-Chan（BC）鞍点测试集进行了基准测试，并对Pt(111)上七聚岛的59个过渡进行了基准测试。贝叶斯分析表明，对于BC集，相对于CI-NEB，能量和力的计算中位数减少46% [95% CrI: -55%，-37%]，而对于七聚岛的过渡，减少28%。这些结果确立了该混合方法作为高效工具，用于高通量自动化原子重排的化学发现。

Summary / 总结

The research aims to enhance the accuracy and efficiency of determining transition states in chemical reactions, which are crucial for understanding reaction kinetics. The method combines the climbing image nudged elastic band (CI-NEB) with the minimum mode following (MMF) method to achieve faster convergence to the relevant saddle point. The hybrid algorithm is benchmarked on the Baker-Chan (BC) saddle point test set and 59 transitions of a heptamer island on Pt(111), showing a median reduction of 46% in energy and force calculations for the BC set and 28% for the heptamer transitions, demonstrating its effectiveness for high-throughput chemical discovery.

研究旨在提高化学反应中过渡态确定的准确性和效率。作者开发了一种结合爬升图像拉伸带（CI-NEB）和最小模式跟随（MMF）方法的自适应混合方法。该方法在Baker-Chan（BC）鞍点测试集和Pt(111)上的七聚岛59个转变中进行了基准测试，结果显示BC集中的能量和力计算减少了46%，七聚岛转变中的减少为28%。

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi

First: 2025-05-24T15:57:07+00:00 · Latest: 2026-01-22T17:10:05+00:00

Comments: Accepted by NeurIPS2025

Abs · PDF · Code1 · Code2

Abstract

Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.

中文标题/摘要

标题：GenPO：生成扩散模型与在线强化学习的结合

强化学习（RL）的最新进展展示了基于生成扩散的策略的强大探索能力和多模态性。虽然在离线RL和离策RL设置中取得了显著进展，但将扩散策略整合到像PPO这样的在线框架中仍然鲜有探索。鉴于大规模并行GPU加速模拟器（如IsaacLab）的广泛应用，这些模拟器针对在线RL算法进行了优化，能够快速训练复杂的机器人任务，这一差距尤为重要。一个关键挑战在于在扩散策略下计算状态-动作对数似然，对于高斯策略来说很简单，但对于基于流的模型来说由于不可逆的正向-反向过程和离散化误差（例如欧拉-马尔可夫近似）则无法解决。为了解决这一问题，我们提出了GenPO，这是一种利用精确扩散反演构建可逆动作映射的生成策略优化框架。GenPO引入了一种新颖的双虚拟动作机制，通过交替更新实现可逆性，解决了对数似然计算障碍。此外，我们还使用动作对数似然进行无偏熵和KL散度估计，使KL自适应学习率和熵正则化能够在在线更新中实现。在八个IsaacLab基准测试上的广泛实验，包括腿足运动（Ant、Humanoid、Anymal-D、Unitree H1、Go2）、灵巧操作（Shadow Hand）、空中控制（Quadcopter）和机器人臂任务（Franka），证明了GenPO优于现有RL基线。值得注意的是，GenPO是第一个成功将扩散策略整合到在线RL中的方法，为大规模并行化训练和实际机器人部署打开了大门。

Summary / 总结

GenPO is a generative policy optimization framework that integrates generative diffusion models into on-policy reinforcement learning (RL) frameworks like PPO. It addresses the challenge of computing state-action log-likelihoods for diffusion policies by proposing a novel doubled dummy action mechanism, enabling invertibility and exact diffusion inversion. Extensive experiments on various IsaacLab benchmarks show that GenPO outperforms existing RL baselines, making diffusion policies suitable for large-scale parallelized training and real-world robotic applications.

GenPO 是一种生成性策略优化框架，将生成扩散模型集成到在线策略强化学习（RL）中，以解决在扩散策略下计算似然度的问题。它使用精确的扩散反演和一种新颖的双虚拟动作机制来实现可逆性并解决不可解问题。在八个 IsaacLab 基准测试上的实验表明，GenPO 在性能上优于现有 RL 基线，并且是第一个将扩散策略成功集成到在线策略 RL 中的方法，从而实现了大规模并行化训练和实际机器人部署。