arXiv 论文速递

CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

Authors: Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, Ying-Cong Chen

First: 2026-01-22T18:59:56+00:00 · Latest: 2026-01-22T18:59:56+00:00

Abstract

Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.

中文标题/摘要

标题：CamPilot：通过高效相机奖励反馈提高视频扩散模型中的相机控制

近期在相机控制视频扩散模型方面的进展显著提高了视频与相机的对齐。然而，相机的可控性仍然有限。在本工作中，我们基于奖励反馈学习，旨在进一步提高相机的可控性。然而，直接借用现有的奖励反馈学习（ReFL）方法面临几个挑战。首先，当前的奖励模型缺乏评估视频与相机对齐的能力。其次，将潜在变量解码为RGB视频以进行奖励计算引入了大量计算开销。第三，视频解码过程中通常忽略了3D几何信息。为解决这些限制，我们引入了一种高效的相机感知3D解码器，将视频潜在变量解码为3D表示以进行奖励量化。具体来说，视频潜在变量与相机姿态一起被解码为3D高斯分布。在这个过程中，相机姿态不仅作为输入，还作为投影参数。视频潜在变量与相机姿态之间的对齐不良会导致3D结构中的几何失真，从而产生模糊的渲染结果。基于这一特性，我们显式地优化渲染的新视角与真实视角之间的像素级一致性作为奖励。为了适应随机性，我们进一步引入了一个可见性项，仅监督通过几何变形得到的确定性区域。在RealEstate10K和WorldScore基准上的广泛实验表明了我们提出方法的有效性。项目页面：https://a-bigbao.github.io/CamPilot/。

Summary / 总结

The research aims to enhance camera control in video diffusion models by addressing limitations in current reward feedback learning approaches. The method introduces an efficient 3D decoder that decodes video latent into 3D representations for reward quantization, using camera pose as both input and projection parameter. Experiments on RealEstate10K and WorldScore benchmarks show improved camera controllability and alignment in generated videos.

研究旨在通过解决现有奖励反馈学习方法的局限性，增强视频扩散模型中的相机可控性。引入了一个高效的3D解码器，将视频潜变量解码为3D表示，以量化奖励，优化渲染的新视角与真实视角之间的像素级一致性。在RealEstate10K和WorldScore基准上的实验显示了所提方法在提高相机可控性和视频-相机对齐方面的有效性。

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Authors: Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou

First: 2026-01-22T18:58:55+00:00 · Latest: 2026-01-22T18:58:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

中文标题/摘要

标题：PyraTok：语言对齐的分层分词器用于视频理解和生成

离散视频VAEs是现代文本到视频生成和视频理解系统的基石，但现有的分词器通常在单尺度上学习视觉码本，词汇量有限且语言监督浅薄，导致跨模态对齐差且零样本迁移效果不佳。我们提出了PyraTok，一种语言对齐的分层分词器，能够在多个时空分辨率上学习语义结构化的离散潜在变量。PyraTok 基于一个预训练的视频VAE和一个新颖的语言对齐分层量化（LaPQ）模块，使用共享的大二进制码本在多个深度上离散化编码特征，产生紧凑且富有表现力的视频分词序列。为了紧密耦合视觉分词与语言，PyraTok 联合优化多尺度文本引导量化和分词层次上的全局自回归目标。在十个基准测试中，PyraTok 在视频重建方面达到最先进的性能，一致地提高了文本到视频的质量，并在视频分割、动作定位和视频理解方面设置了新的零样本性能基准，能够稳健地扩展到4K/8K分辨率。

Summary / 总结

PyraTok is a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions, improving cross-modal alignment and zero-shot transfer in video understanding and generation. It uses a Language-aligned Pyramidal Quantization (LaPQ) module to discretize encoder features at various depths with a shared large binary codebook, and jointly optimizes multi-scale text-guided quantization and a global autoregressive objective. PyraTok achieves state-of-the-art performance in video reconstruction, text-to-video generation, and various video understanding tasks, scaling well to high resolutions.

PyraTok 是一种语言对齐的分层 tokenizer，能够在多个时空分辨率上学习语义结构化的离散潜变量，从而改善跨模态对齐和零样本迁移。它使用语言对齐的分层量化（LaPQ）模块在多个深度上共享一个大型二进制码本对编码特征进行离散化，并联合优化多尺度文本引导量化和全局自回归目标。PyraTok 在视频重建、文本到视频生成以及各种视频理解任务中均达到最佳性能，并且能够很好地扩展到高分辨率。

GutenOCR: A Grounded Vision-Language Front-End for Documents

Authors: Hunter Heidenreich, Ben Elliott, Olivia Dinica, Yosheb Getachew

First: 2026-01-20T21:26:15+00:00 · Latest: 2026-01-22T18:58:24+00:00

Abs · PDF · Code1 · Code2

Abstract

GutenOCR is a family of grounded OCR front-ends obtained by fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B. The resulting single-checkpoint vision-language models expose reading, detection, and grounding through a unified, prompt-based interface. Trained on business documents, scientific articles, and synthetic grounding data, the models support full-page and localized reading with line- and paragraph-level bounding boxes and conditional ``where is x?'' queries. We introduce a grounded OCR evaluation protocol and show that GutenOCR-7B more than doubles the composite grounded OCR score of its Qwen2.5-VL-7B backbone on 10.5K held-out business and scientific pages (0.40 to 0.82). On Fox and OmniDocBench v1.5, our approach substantially improves region- and line-level OCR as well as text-detection recall, but reveals trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts.

中文标题/摘要

标题：GutenOCR：一种基于文档的视觉-语言前端

GutenOCR 是通过微调 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 获得的一系列基于文档的 OCR 前端。生成的单模型视觉-语言模型通过统一的提示界面展示了阅读、检测和定位。该模型在商业文档、科学文章和合成定位数据上进行训练，支持全页和局部阅读，具有行级和段落级的边界框，并支持“x 在哪里？”的条件查询。我们引入了一种基于文档的 OCR 评估协议，并展示了 GutenOCR-7B 在 10.5K 保留的商业和科学页面上将 Qwen2.5-VL-7B 主干的综合基于文档的 OCR 分数提高了 1.05（从 0.40 到 0.82）。在 Fox 和 OmniDocBench v1.5 上，我们的方法显著提高了区域级和行级 OCR 以及文本检测召回率，但揭示了页面级线性化、颜色引导 OCR 和公式密集布局方面的权衡。

Summary / 总结

GutenOCR is a vision-language model fine-tuned from Qwen2.5-VL-3B and Qwen2.5-VL-7B, which supports unified reading, detection, and grounding through a prompt-based interface. Trained on business documents and scientific articles, GutenOCR-7B significantly improves the composite grounded OCR score from 0.40 to 0.82 on 10,500 pages, demonstrating superior performance in full-page and localized reading with line- and paragraph-level bounding boxes and conditional queries. However, it shows some trade-offs in page-level linearization, color-guided OCR, and formula-heavy layouts compared to the original backbone model.

GutenOCR 是从 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 微调而来的视觉-语言模型，通过提示提供统一的阅读、检测和定位接口。这些模型经过商业文件和科学文章的训练，支持全页和局部阅读，带有边界框和条件查询。GutenOCR-7B 在商业和科学页面上的综合定位OCR得分为0.82，比其基础模型提高了0.40。在Fox和OmniDocBench上，GutenOCR 提升了区域和行级OCR以及文本检测召回率，但在页面级线性化、颜色引导OCR和公式密集布局方面显示出一些权衡。

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Authors: Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie

First: 2026-01-22T18:58:16+00:00 · Latest: 2026-01-22T18:58:16+00:00

Comments: website: https://rae-dit.github.io/scale-rae/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

中文标题/摘要

标题：使用表示自编码器扩展文本到图像扩散变换器

表示自编码器（RAEs）在ImageNet上的扩散建模中通过在高维语义潜在空间中训练显示出明显的优势。在本研究中，我们探讨了该框架是否可以扩展到大规模、自由形式的文本到图像（T2I）生成。我们首先将RAE解码器扩展到冻结表示编码器（SigLIP-2）之外的ImageNet，通过在网页、合成和文本渲染数据上进行训练，发现虽然规模提高了通用保真度，但特定领域（如文本）的数据组合是必不可少的。然后，我们严格测试了最初为ImageNet提出的RAE设计选择。我们的分析表明，扩展简化了框架：虽然维度相关的噪声调度仍然是关键，但如宽扩散头部和噪声增强解码等架构复杂性在规模下几乎没有益处。在此简化框架的基础上，我们对RAE与当前最先进的FLUX VAE在从0.5B到9.8B参数的扩散变换器规模下进行了受控比较。在整个模型规模下，RAE在预训练期间始终优于VAE。此外，在高质量数据集上的微调过程中，基于VAE的模型在64个周期后灾难性过拟合，而RAE模型在256个周期后保持稳定并实现更优性能。在所有实验中，基于RAE的扩散模型展示了更快的收敛速度和更好的生成质量，确立了RAE作为比VAE更简单且更强的基础框架，适用于大规模T2I生成。此外，由于视觉理解和生成可以在共享表示空间中进行，多模态模型可以直接推理生成的潜在变量，为统一模型开辟了新的可能性。

Summary / 总结

This work explores the scalability of Representation Autoencoders (RAEs) for large-scale text-to-image (T2I) generation. By scaling RAE decoders on a frozen representation encoder (SigLIP-2) and training on diverse datasets, the study finds that while scale improves general image fidelity, targeted data composition is crucial for specific domains like text. The research also shows that RAEs outperform Variational Autoencoders (VAEs) during pretraining and finetuning, with RAE models demonstrating faster convergence and better generation quality, even at large model scales.

该研究探讨了Representation Autoencoders (RAEs)在文本到图像(T2I)生成中的可扩展性，将RAE解码器扩展到ImageNet之外的网络、合成和文本渲染数据上进行训练。研究发现，虽然扩展可以提高图像的一般保真度，但特定领域如文本的数据组成至关重要。RAEs在所有模型规模的预训练中表现出色，并且在微调过程中保持稳定性和更好的性能。结果表明，RAEs在大规模T2I生成中提供了更快的收敛速度和更好的生成质量，使其成为比VAEs更强的基础框架。

LLM-in-Sandbox Elicits General Agentic Intelligence

Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei

First: 2026-01-22T18:57:09+00:00 · Latest: 2026-01-22T18:57:09+00:00

Comments: Project Page: https://llm-in-sandbox.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.

中文标题/摘要

标题：LLM-in-Sandbox 激发通用代理智能

我们介绍了 LLM-in-Sandbox，使大语言模型能够在代码沙盒（即虚拟计算机）中探索，以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下，能够利用代码沙盒来执行非代码任务的一般化能力。例如，大语言模型会自发地访问外部资源以获取新知识，利用文件系统处理长文本，并执行脚本以满足格式要求。我们进一步表明，通过仅使用非代理数据训练用于沙盒探索的模型，LLM-in-Sandbox 强化学习（LLM-in-Sandbox-RL）可以增强这些代理能力。实验表明，无论是在无训练还是后训练设置下，LLM-in-Sandbox 都能够实现涵盖数学、物理、化学、生物医学、长文本理解以及指令遵循的稳健泛化。最后，我们从计算和系统角度分析了 LLM-in-Sandbox 的效率，并将其开源为 Python 包，以促进其实用部署。

Summary / 总结

The study introduces LLM-in-Sandbox, which allows large language models (LLMs) to explore a code sandbox to develop general intelligence in non-code domains. The research shows that strong LLMs can generalize and use the sandbox for non-code tasks without additional training, such as accessing external resources, handling long contexts, and executing scripts. The study also demonstrates that these capabilities can be further enhanced through LLM-in-Sandbox Reinforcement Learning. Experiments show robust generalization across various fields including mathematics, physics, chemistry, biomedicine, and long-context understanding. The study analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for real-world deployment.

研究引入了LLM-in-Sandbox方法，使大型语言模型（LLMs）能够在代码沙箱中探索，以在非代码领域发展一般智能。研究表明，强大的LLMs可以泛化并在访问外部资源、处理长文本和执行脚本等方面利用沙箱。研究还表明，通过使用非代理数据的LLM-in-Sandbox强化学习可以增强这些能力。实验显示，LLM-in-Sandbox在数学、物理和生物医学等多个领域表现出稳健的泛化能力。研究还评估了LLM-in-Sandbox的效率，并将其作为Python包开源以促进实际部署。

Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing

Authors: Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang

First: 2026-01-22T18:52:21+00:00 · Latest: 2026-01-22T18:52:21+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.

中文标题/摘要

标题：多模态大型语言模型的可验证鲁棒性通过特征空间平滑

多模态大型语言模型（MLLMs）在多种应用中表现出强大的能力，但仍然容易受到对抗性扰动的影响，这些扰动会扭曲其特征表示并导致错误预测。为了解决这一脆弱性，我们提出了特征空间平滑（FS）并理论上证明了FS为MLLMs的特征表示提供了可验证的鲁棒性。具体而言，FS将任何特征编码器转换为一种平滑变体，该变体在$\ell_2$有界攻击下保证了干净和对抗性表示之间的特征余弦相似度的可验证下界。此外，我们表明，从基础编码器中获得的特征余弦相似度界（FCSB）的值可以通过扩大定义的高斯鲁棒性得分来提高。在此基础上，我们引入了净化器和平滑映射器（PSM），这是一种即插即用模块，可以提高MLLMs的高斯鲁棒性得分，从而在不重新训练MLLMs的情况下增强其在FS下的可验证鲁棒性。我们证明，FS与PSM不仅提供了强大的理论鲁棒性保证，而且在对抗训练中表现出更优越的实证性能。广泛的实验表明，FS-PSM在各种白盒攻击下的攻击成功率（ASR）从近90%降低到约1%。

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Authors: Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han

First: 2025-12-01T18:59:45+00:00 · Latest: 2026-01-22T18:49:14+00:00

Comments: 10 pages, 4 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

As large language models have grown larger, interest has grown in low-precision numerical formats such as NVFP4 as a way to improve speed and reduce memory usage. However, quantizing models to NVFP4 remains difficult as the lack of precision generally degrades model performance. In this work, we address this issue with Four Over Six (4/6), a modification to the block-scaled NVFP4 quantization algorithm that yields reduced quantization error. Unlike integer formats, floating point formats have non-uniform step sizes which create larger quantization error on larger values. 4/6 takes advantage of this by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform and reducing quantization error for near-maximal values. We show that 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, resulting in performance gains during both pre-training and inference with minimal computational overhead. In pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture, we find that 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. Our code is available at http://github.com/mit-han-lab/fouroversix.

中文标题/摘要

标题：四分之六：带有自适应块缩放的更准确NVFP4量化

随着大型语言模型变得越来越大，人们越来越关注低精度数值格式（如NVFP4），以提高速度并减少内存使用。然而，将模型量化为NVFP4仍然很困难，因为缺乏精度通常会降低模型性能。在本文中，我们通过Four Over Six（4/6）解决了这一问题，4/6是对块缩放NVFP4量化算法的修改，可以减少量化误差。与整数格式不同，浮点格式具有非均匀的步长，这在较大值上会产生更大的量化误差。4/6通过自适应地将某些块缩放到较小的FP4值，使可表示值的分布更加均匀，从而减少接近最大值时的量化误差。我们展示了4/6可以在NVIDIA Blackwell GPU上高效实现，从而在预训练和推理过程中获得性能提升，同时计算开销最小。在使用Nemotron 3 Nano 30B-A3B模型架构的预训练实验中，我们发现4/6可以使训练损失更接近BF16，优于使用当前最先进的NVFP4训练食谱训练的模型。我们的代码可在http://github.com/mit-han-lab/fouroversix获取。

Summary / 总结

This work addresses the challenge of quantizing large language models to NVFP4 by introducing Four Over Six (4/6), an adaptive block scaling method that reduces quantization error. By scaling some blocks to smaller FP4 values, 4/6 makes the distribution of representable values more uniform, thereby decreasing quantization error for larger values. The method is efficiently implemented on NVIDIA Blackwell GPUs, leading to performance gains in both pre-training and inference with minimal computational overhead. Experiments show that 4/6 brings training loss closer to BF16 compared to current NVFP4 training methods.

本文提出了一种名为Four Over Six (4/6)的方法，通过自适应块缩放减少NVFP4量化误差，解决了大语言模型量化到NVFP4的难题。4/6利用浮点格式中非均匀步长的特点，将某些块缩放到较小的FP4值，从而使表示值的分布更加均匀，减少近最大值的量化误差。实验表明，4/6可以在NVIDIA Blackwell GPU上高效实现，在预训练和推理中都能带来性能提升，且计算开销较小，与当前最先进的NVFP4训练方法相比，训练损失更接近BF16。

360Anything: Geometry-Free Lifting of Images and Videos to 360°

Authors: Ziyi Wu, Daniel Watson, Andrea Tagliasacchi, David J. Fleet, Marcus A. Brubaker, Saurabh Saxena

First: 2026-01-22T18:45:59+00:00 · Latest: 2026-01-22T18:45:59+00:00

Comments: Project page: https://360anything.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.

中文标题/摘要

标题：360Anything：无需几何的图像和视频到360°提升

将视角图像和视频提升为360°全景图可以生成沉浸式的3D世界。现有方法通常依赖于视角和等效圆柱投影（ERP）空间之间的显式几何对齐。然而，这需要已知的相机元数据，这在野外数据中通常是缺失或噪声较大的。我们提出了360Anything，一个基于预训练扩散变换器的几何无关框架。通过将视角输入和全景目标简单地视为标记序列，360Anything以纯数据驱动的方式学习视角到等效圆柱投影的映射，消除了对相机信息的需求。我们的方法在图像和视频视角到360°生成方面均达到了最先进的性能，超越了使用真实相机信息的先前工作。我们还追踪了ERP边界处接缝伪影的根本原因，归因于VAE编码器中的零填充，并引入了循环潜编码以促进无缝生成。最后，我们在零样本相机视场和方向估计基准测试中展示了竞争力的结果，证明了360Anything在计算机视觉任务中的深刻几何理解和更广泛的用途。更多结果请参见https://360anything.github.io/

Summary / 总结

360Anything is a geometry-free framework that uses pre-trained diffusion transformers to lift perspective images and videos to 360° panoramas. By treating the inputs and targets as token sequences, it learns the mapping without needing camera metadata, thus enabling applications to in-the-wild data. The approach outperforms previous methods that rely on ground-truth camera information and introduces Circular Latent Encoding to address seam artifacts. It also demonstrates strong performance in zero-shot camera FoV and orientation estimation benchmarks, indicating its deep geometric understanding and broader utility in computer vision tasks.

360Anything 是一个无需几何信息的框架，使用预训练的扩散变换器将视角图像和视频转换为360°全景图。它无需相机元数据即可学习视角到等效圆柱投影的映射，适用于野外数据。该方法在依赖真实相机信息的先前方法中表现出色，并引入了循环潜编码以解决接缝伪影问题。此外，它在零样本相机视场和方向估计基准测试中表现出色，表明其深厚的几何理解和更广泛的应用价值在计算机视觉任务中。

Paramanu: Compact and Competitive Monolingual Language Models for Low-Resource Morphologically Rich Indian Languages

Authors: Mitodru Niyogi, Eric Gaussier, Arnab Bhattacharya

First: 2024-01-31T17:58:10+00:00 · Latest: 2026-01-22T18:28:42+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multilingual large language models (LLMs) are expensive to pretrain and often suffer from imbalances across languages and datasets, English-centric bias, tokenizer oversegmentation for morphologically rich low-resource languages, and the curse of multilinguality. We introduce PARAMANU, the first family of Indian-only autoregressive language models trained from scratch on open-source language-specific data for the five most spoken Indian languages: Bengali, Hindi, Marathi, Tamil, and Telugu. All models are designed for affordability and are trained on a single GPU with a budget under $1,000, allowing under-resourced researchers to build competitive language models. To address low-resource challenges, we develop morphology-aligned, low-fertility tokenizers, propose an interpolation-based method for token position indices in RoPE based scaling to train longer sequences efficiently. We also create instruction-tuning datasets in Bangla that are translated to the other four languages. Despite their small size (108M-367M parameters), Paramanu achieves a strong performance-efficiency tradeoff and outperforms most larger multilingual models across all five languages. Our collection is available at https://huggingface.co/collections/mitodru/paramanu.

中文标题/摘要

标题：Paramanu：面向低资源丰富形态语言的紧凑且竞争性的单语语言模型

多语言大型语言模型（LLMs）的预训练成本高昂，且常在语言和数据集之间存在不平衡，具有英语中心偏见，分词器对形态丰富且低资源语言的过度分词问题，以及多语言诅咒。我们引入了Paramanu，这是首个仅针对印度语族的自回归语言模型系列，从头开始在开源语言特定数据上训练，针对五种最常用的印度语：孟加拉语、印地语、马拉地语、泰米尔语和泰卢固语。所有模型均设计为经济实惠，并在单个GPU上训练，预算低于1000美元，使资源不足的研究人员能够构建具有竞争力的语言模型。为应对低资源挑战，我们开发了形态对齐、低丰度的分词器，并提出了一种基于插值的方法来在RoPE基于的缩放中训练较长序列的词位索引。我们还为孟加拉语创建了指令调优数据集，并将其翻译成其他四种语言。尽管Paramanu的规模较小（108M-367M参数），但在所有五种语言上仍实现了性能与效率的良好权衡，并优于大多数更大规模的多语言模型。我们的集合可在https://huggingface.co/collections/mitodru/paramanu 获取。

Summary / 总结

Paramanu is a family of Indian-only autoregressive language models trained on open-source language-specific data for five major Indian languages. These models are designed to be affordable, using a single GPU and a budget under $1,000. Paramanu addresses low-resource challenges through morphology-aligned tokenizers and an interpolation-based method for token position indices in RoPE scaling. Despite their small size (108M-367M parameters), Paramanu outperforms most larger multilingual models across all five languages, achieving a strong performance-efficiency tradeoff.

Paramanu 是专门为五种印度语言（孟加拉语、印地语、马拉地语、泰米尔语和泰卢固语）设计的一系列单语言语言模型。这些模型使用开源语言特定数据进行训练，体积小巧且成本低廉，能够在单个 GPU 上以不到 1000 美元的预算进行训练。Paramanu 通过使用形态学对齐的分词器和基于插值的方法来解决低资源挑战，该方法用于 RoPE 缩放中的标记位置索引。尽管参数量较小（108M-367M），但 Paramanu 在所有五种语言中都表现出色，性能效率tradeoff表现优于大多数大型多语言模型。

Learning to Discover at Test Time

Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun

First: 2026-01-22T18:24:00+00:00 · Latest: 2026-01-22T18:24:00+00:00

Comments: Code: https://github.com/test-time-training/discover

Abs · PDF · Code1 · Code2 · Code3

Abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

中文标题/摘要

标题：在测试时学习发现

我们如何使用AI在科学问题上发现新的前沿？先前的测试时缩放工作，如AlphaEvolve，通过提示冻结的LLM进行搜索。我们进行测试时的强化学习，因此LLM可以继续训练，但现在是针对测试问题的具体经验。这种持续学习的形式非常特殊，因为它旨在产生一个最佳解决方案，而不是平均多个较好的解决方案，并且解决这个问题而不是泛化到其他问题。因此，我们的学习目标和搜索子程序设计优先考虑最有前途的解决方案。我们称这种方法为测试时训练以发现（TTT-Discover）。我们遵循先前的工作，专注于具有连续奖励的问题。我们报告了我们尝试的每个问题的结果，涵盖数学、GPU内核工程、算法设计和生物学。TTT-Discover在几乎所有问题上都设定了新的前沿：(i) 艾尔德什最小重叠问题和自相关不等式；(ii) GPUMode内核竞赛（比先前的最佳结果快至2倍）；(iii) 过去的AtCoder算法竞赛；和(iv) 单细胞分析中的去噪问题。我们的解决方案由专家或组织者审核。所有结果均使用开源模型OpenAI gpt-oss-120b实现，并可通过我们公开的代码进行重现，与之前的最佳结果相比，无需使用封闭的前沿模型。我们的测试时训练运行使用Thinking Machines的Tinker API，每问题的成本仅为几百美元。

Summary / 总结

The research aims to use AI to discover new state-of-the-art solutions for scientific problems by performing reinforcement learning at test time. The method, called Test-Time Training to Discover (TTT-Discover), allows the LLM to continue training with problem-specific experience, prioritizing the most promising solutions. The method sets new state-of-the-art results in various domains including mathematics, GPU kernel engineering, algorithm design, and biology, with solutions reviewed by experts. All results are achieved using an open model, OpenAI gpt-oss-120b, and can be reproduced with publicly available code.

研究旨在通过在测试时进行强化学习来使用AI发现科学问题的新前沿解决方案。方法Test-Time Training to Discover (TTT-Discover) 允许LLM在获得特定于测试问题的经验后继续训练，并优先考虑最有前途的解决方案。该方法在数学、GPU内核工程、算法设计和生物学等多个领域设置了新的前沿结果，解决方案得到了专家的评审。所有结果使用的是开源模型OpenAI gpt-oss-120b，并且可以通过公开的代码进行复现。

Is this chart lying to me? Automating the detection of misleading visualizations

Authors: Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych

First: 2025-08-29T14:36:45+00:00 · Latest: 2026-01-22T18:23:24+00:00

Comments: Preprint under review. Code and data available at: https://github.com/UKPLab/arxiv2025-misviz

Abs · PDF · Code1 · Code2 · Code3

Abstract

Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also create Misviz-synth, a synthetic dataset of 57,665 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and image-axis classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.

中文标题/摘要

标题：这张图表是在欺骗我吗？自动化误导性可视化检测

误导性可视化是社交媒体和网络上信息误导的强大驱动因素。通过违反图表设计原则，它们扭曲数据并引导读者得出不准确的结论。先前的研究表明，无论是人类还是多模态大型语言模型（MLLMs）都经常被这些可视化所欺骗。自动检测误导性可视化并识别它们违反的具体设计规则可以帮助保护读者并减少信息误导的传播。然而，由于缺乏大型、多样且公开可用的数据集，AI模型的训练和评估受到了限制。在本研究中，我们引入了Misviz，这是一个包含2,604个真实世界可视化并标注了12种误导类型的基准数据集。为了支持模型训练，我们还创建了Misviz-synth，这是一个基于真实数据表生成的57,665个可视化数据集，使用Matplotlib生成。我们使用最先进的MLLMs、基于规则的系统和图像轴分类器对两个数据集进行了全面评估。我们的结果表明，该任务仍然极具挑战性。我们发布了Misviz、Misviz-synth及其配套代码。

Summary / 总结

This paper addresses the issue of misleading visualizations that can spread misinformation. It introduces Misviz, a benchmark dataset of 2,604 real-world visualizations annotated with 12 types of misleaders, and Misviz-synth, a synthetic dataset of 57,665 visualizations. The authors evaluate state-of-the-art models, rule-based systems, and image-axis classifiers on both datasets and find that the task is still highly challenging. The work aims to help protect readers from inaccurate conclusions drawn from such visualizations. Code and data are available at https://github.com/UKPLab/arxiv2025-misviz.

本文旨在解决误导性可视化可能传播虚假信息的问题，引入了包含2,604个真实世界可视化和12种误导类型的Misviz基准数据集，以及基于真实数据表生成的57,665个可视化实例的Misviz-synth合成数据集。作者使用最先进的多模态大型语言模型、基于规则的系统和图像轴分类器对这些数据集进行了全面评估。结果显示，检测误导性可视化仍然是一个极具挑战性的任务。数据集和代码已公开发布。

Structured Hints for Sample-Efficient Lean Theorem Proving

Authors: Zachary Burton

First: 2026-01-22T18:16:46+00:00 · Latest: 2026-01-22T18:16:46+00:00

Comments: 9 pages, 1 figure

Abs · PDF · Code1 · Code2

Abstract

State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.

中文标题/摘要

标题：结构化提示以提高样本高效精益定理证明

当前最先进的神经定理证明器如DeepSeek-Prover-V1.5结合了大型语言模型和强化学习，通过复杂的训练取得了令人印象深刻的成果。我们提出的问题是：这些高度训练的模型在推理时是否仍然受益于简单的结构指导？我们在miniF2F基准上评估了一种轻量级干预措施——固定提示调度表，覆盖15种常见的策略骨架。这种方法简单有效，与从同一模型标准采样相比，16个样本的通过率提高了21.7%，相对改进了43%，使用相同的生成长度（1024个标记）。我们的结果表明，即使能力较强的RL训练证明器也未能充分利用策略语言中可用的结构先验，并且简单的推理时指导仍然是一个廉价的补充提升。

Summary / 总结

The research aims to explore whether highly-trained neural theorem provers still benefit from simple structural guidance during inference. The study evaluates a lightweight intervention—a fixed prompt schedule over 15 common tactic skeletons—on the miniF2F benchmark. This simple approach improves pass@16 by 21.7% compared to 15.2% for standard sampling, representing a 43% relative improvement using the same number of samples and maximum generation length.

研究旨在探讨尽管最先进的神经定理证明器经过高度训练，但在推理过程中是否仍能从简单的结构指导中受益。研究在miniF2F基准上评估了一种轻量级干预措施——15种常见策略骨架的固定提示调度。这种方法在pass@16上提高了21.7%，与同一模型的标准采样相比，相对改进了43%，使用相同数量的样本和最大生成长度。研究结果表明，即使训练良好的模型也未能充分利用结构先验，而简单的推理时指导可以提供显著的提升。

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

Authors: Steven Kolawole, Lucio Dery, Jean-François Kagy, Virginia Smith, Graham Neubig, Ameet Talwalkar

First: 2024-02-08T04:48:26+00:00 · Latest: 2026-01-22T18:13:50+00:00

Comments: 19 pages, 6 fiigures, 16 tables

Abs · PDF · Code1 · Code2

Abstract

Structured pruning is a promising approach to create smaller, faster large language models. However, existing methods typically rely on computing the gradient via backward passes, which can inflate memory requirements and compute costs. In this work we introduce Bonsai, a gradient-free structured pruning method that eliminates the need for backpropagation, significantly reducing memory requirements and compute costs while achieving state-of-the-art pruning performance. Bonsai uses forward-pass-only perturbative pruning to enable efficient compression of large models on a broader range of hardware configurations. Unlike existing structured pruning approaches, Bonsai not only achieves better compression with fewer resources but also produces models that are twice as fast as those generated by semi-structured pruning. As a concrete demonstration, we use Bonsai to prune 7B and 8B models to 50% sparsity on a single A6000 GPU -- a task challenging for backprop-based methods in memory-constrained settings, as they require 2-3x the memory. Our results show that removing backprop as a requirement not only enables pruning larger models on constrained hardware but can also lead to state-of-the-art efficiency and performance.

中文标题/摘要

标题：现在修剪：仅使用前向传递修剪LLMs的结构化修剪

结构化修剪是一种有前途的方法，可以创建更小、更快的大语言模型。然而，现有方法通常依赖于通过反向传递计算梯度，这会增加内存需求和计算成本。在本工作中，我们引入了Bonsai，这是一种无需反向传播的梯度自由结构化修剪方法，显著减少了内存需求和计算成本，同时实现了最先进的修剪性能。Bonsai 使用仅前向传递的扰动修剪来实现大型模型在更广泛的硬件配置上的高效压缩。与现有的结构化修剪方法不同，Bonsai 不仅在更少的资源下实现了更好的压缩，还生成了比半结构化修剪方法生成的模型快两倍的模型。作为具体的演示，我们使用Bonsai将7B和8B模型修剪到50%的稀疏性，这在内存受限的环境中对基于反向传播的方法来说是一项具有挑战性的任务，因为它们需要2-3倍的内存。我们的结果表明，去除反向传播的要求不仅使在受限硬件上修剪更大规模的模型成为可能，还可以实现最先进的效率和性能。

Summary / 总结

This work introduces Bonsai, a gradient-free structured pruning method that uses forward-pass-only perturbative pruning to compress large language models efficiently. Bonsai reduces memory and compute costs while achieving state-of-the-art pruning performance, and it produces models that are twice as fast as those generated by semi-structured pruning. Experiments demonstrate that Bonsai can prune 7B and 8B models to 50% sparsity on a single A6000 GPU, a task challenging for backprop-based methods due to memory constraints.

该研究引入了Bonsai，一种无需反向传播的结构化剪枝方法，通过仅使用前向传播的扰动剪枝来减少内存和计算成本，同时达到最先进的剪枝性能。Bonsai能够在单个A6000 GPU上将7B和8B模型压缩到50%的稀疏性，展示了比半结构化剪枝方法更好的压缩效果和两倍的运行速度，即使在内存受限的环境中也是如此。

GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval

Authors: Justus-Jonas Erker, Nils Reimers, Iryna Gurevych

First: 2025-03-10T16:42:48+00:00 · Latest: 2026-01-22T18:12:25+00:00

Comments: Accepted at EACL 2026 Main Conference

Abs · PDF · Code1 · Code2

Abstract

Decomposition-based multi-hop retrieval methods rely on many autoregressive steps to break down complex queries, which breaks end-to-end differentiability and is computationally expensive. Decomposition-free methods tackle this, but current decomposition-free approaches struggle with longer multi-hop problems and generalization to out-of-distribution data. To address these challenges, we introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. GRITHopper combines generative and representational instruction tuning by integrating causal language modeling with dense retrieval training. Through controlled studies, we find that incorporating additional context after the retrieval process, referred to as post-retrieval language modeling, enhances dense retrieval performance. By including elements such as final answers during training, the model learns to better contextualize and retrieve relevant information. GRITHopper-7B offers a robust, scalable, and generalizable solution for multi-hop dense retrieval, and we release it to the community for future research and applications requiring multi-hop reasoning and retrieval capabilities.

中文标题/摘要

标题：GRITHopper：无需分解的多跳密集检索

基于分解的多跳检索方法依赖于许多自回归步骤来分解复杂的查询，这破坏了端到端的可微性并导致计算成本高昂。无需分解的方法解决了这一问题，但当前的无需分解方法在处理较长的多跳问题和泛化到未见过的数据方面存在困难。为了解决这些挑战，我们引入了GRITHopper-7B，这是一种新型的多跳密集检索模型，它在分布内和分布外基准测试中均实现了最先进的性能。GRITHopper结合了生成性和表征性指令微调，通过将因果语言建模与密集检索训练相结合。通过受控研究，我们发现检索过程后的额外上下文建模，称为检索后语言建模，可以增强密集检索性能。通过在训练中包含最终答案等元素，模型学会了更好地上下文化和检索相关信息。GRITHopper-7B提供了一种稳健、可扩展且通用的多跳密集检索解决方案，并将其发布给社区，以供未来的研究和需要多跳推理和检索能力的应用使用。

Summary / 总结

The research aims to improve multi-hop retrieval by addressing the limitations of decomposition-based methods, which are computationally expensive and lack end-to-end differentiability. GRITHopper-7B, a novel decomposition-free model, combines generative and representational instruction tuning to enhance dense retrieval. Key findings show that post-retrieval language modeling and including final answers in training improve performance, leading to state-of-the-art results on both in-distribution and out-of-distribution benchmarks.

研究旨在通过解决分解式方法的计算昂贵和缺乏端到端可微性问题来改进多跳检索。GRITHopper-7B 是一种新颖的无分解方法，结合生成性和表示性指令微调，以增强密集检索。关键发现表明，检索后的语言建模和在训练中包含最终答案可以提高性能，从而在分布内和分布外基准测试中达到最先进的结果。

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Authors: Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu

First: 2026-01-22T18:09:30+00:00 · Latest: 2026-01-22T18:09:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

中文标题/摘要

标题：宇宙政策：针对视觉运动控制和规划微调视频模型

近期的视频生成模型展示了捕捉复杂物理交互和场景随时间演变的非凡能力。为了利用其时空先验知识，机器人学工作将视频模型适应为策略学习，但引入了复杂性，需要多阶段的后训练和新的架构组件来生成动作。在本工作中，我们提出了宇宙政策(Cosmos Policy)，这是一种简单的方法，通过在目标平台收集的机器人演示数据上进行单一阶段的后训练，将大型预训练视频模型(Cosmos-Predict2)适应为有效的机器人策略，无需架构修改。宇宙政策学习直接生成机器人动作，编码为视频模型的潜在扩散过程中的潜在帧，利用模型的预训练先验和核心学习算法捕捉复杂动作分布。此外，宇宙政策生成未来状态图像和值（预期累积奖励），同样编码为潜在帧，使测试时能够规划具有更高成功概率的动作轨迹。在我们的评估中，宇宙政策在LIBERO和RoboCasa模拟基准测试中分别实现了98.5%和67.1%的平均成功率，并在具有挑战性的实际双臂操作任务中获得了最高的平均分数，优于从头开始训练的强大扩散策略、基于视频模型的策略和在相同机器人演示上微调的最先进的视觉-语言-动作模型。此外，给定策略展开数据，宇宙政策可以从经验中学习改进其世界模型和价值函数，并利用基于模型的规划在具有挑战性的任务中实现更高的成功率。我们将在https://research.nvidia.com/labs/dir/cosmos-policy/发布代码、模型和训练数据/

Summary / 总结

Cosmos Policy aims to simplify the adaptation of large pretrained video models for robotics tasks by requiring only a single stage of post-training on robot demonstration data, without architectural modifications. It leverages the pretrained model's priors and learning algorithm to generate robot actions and future state images, enabling test-time planning. Experiments show Cosmos Policy outperforms other methods on simulation benchmarks and real-world bimanual manipulation tasks, achieving high success rates and demonstrating the ability to refine its model from experience.

Cosmos Policy 是一种方法，通过在机器人演示数据上进行单阶段后训练，将一个大型预训练视频模型（Cosmos-Predict2）转化为有效的机器人策略，无需修改架构。它学习将机器人动作和未来状态图像作为潜变量帧生成，利用模型的预训练先验。Cosmos Policy 在仿真基准测试和复杂的双臂操作任务中表现出色，超越了其他方法。

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change

Authors: Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger

First: 2025-05-25T21:29:00+00:00 · Latest: 2026-01-22T18:06:39+00:00

Comments: 45 pages, 21 figures, under review

Abs · PDF · Code1 · Code2

Abstract

Ambivalence and hesitancy (A/H), a closely related construct, is the primary reasons why individuals delay, avoid, or abandon health behaviour changes. It is a subtle and conflicting emotion that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. It manifests by a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exists for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours captured from 300 participants across Canada answering predefined questions to elicit A/H. It is intended to mirror real-world online personalized behaviour change interventions. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participants' meta-data are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, zero-shot prediction, and personalization using source-free domain adaptation. The data, code, and pretrained weights are available.

中文标题/摘要

标题：BAH数据集：视频中数字行为改变中犹豫/矛盾识别

犹豫和矛盾（A/H）是个人推迟、避免或放弃健康行为改变的主要原因。这是一种微妙且矛盾的情绪，使人处于积极和消极、接受和拒绝之间的状态。它表现为不同模态或同一模态中的情感不一致，如面部和语音表达以及肢体语言。尽管专家可以被训练来识别A/H，就像在面对面互动中那样，将其整合到数字健康干预措施中既昂贵又不那么有效。因此，自动识别A/H对于数字行为改变干预措施的个性化和成本效益至关重要。然而，目前没有用于设计机器学习模型识别A/H的数据集。本文介绍了为视频中多模态识别A/H收集的Behavioral Ambivalence/Hesitancy (BAH)数据集。该数据集包含来自加拿大300名参与者回答预定义问题以引发A/H的1,427个视频，总时长为10.60小时。它旨在模拟现实世界的在线个性化行为改变干预措施。BAH由三位专家注释，提供A/H发生的时间戳，以及帧级和视频级带有A/H线索的注释。还提供了视频转录、裁剪和对齐的脸部以及参与者的元数据。由于A和H在实践中表现相似，我们提供了二元注释，表明A/H的存在或不存在。此外，本文还包括在BAH上使用基线模型进行帧级和视频级识别、零样本预测和个性化（使用无源域适应）的基准测试结果。数据、代码和预训练权重均可用。

HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

Authors: Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, Tao Jin

Venue: ICASSP 2026

First: 2026-01-22T17:57:42+00:00 · Latest: 2026-01-22T17:57:42+00:00

Comments: Accepted by ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.

中文标题/摘要

标题：HVD：基于人类视觉的视频表示学习方法在文本-视频检索中的应用

CLIP的成功推动了文本-视频检索领域的重要进展。然而，当前的方法往往受到“盲视”特征交互的困扰，模型难以从背景噪声中区分关键视觉信息，这主要是由于文本查询的稀疏性。为了解决这一问题，我们借鉴了人类的认知行为，提出了基于人类视觉驱动（HVD）的模型。我们的框架建立了一种从粗到细的对齐机制，包括两个关键组件：帧特征选择模块（FFSM）和补丁特征压缩模块（PFCM）。FFSM通过选择关键帧来模拟人类的宏观感知能力，从而消除时间冗余。随后，PFCM通过先进的注意力机制将补丁特征聚合为显著的视觉实体，模拟微观感知，实现精确的实体级匹配。在五个基准上的广泛实验表明，HVD不仅捕捉到了类似人类的视觉焦点，还实现了最先进的性能。

Summary / 总结

The research aims to improve text-video retrieval by addressing the issue of 'blind' feature interaction in current models. It proposes the Human Vision-Driven (HVD) model, which includes a Frame Features Selection Module (FFSM) and a Patch Features Compression Module (PFCM). FFSM selects key frames to reduce temporal redundancy, while PFCM aggregates patch features into salient visual entities for precise matching. Experiments on five benchmarks show that HVD captures human-like visual focus and achieves state-of-the-art performance.

研究旨在通过解决模型依赖文本查询而导致忽视重要视觉信息的问题，来提升文本-视频检索的效果。提出了Human Vision-Driven (HVD)模型，包括Frame Features Selection Module (FFSM)和Patch Features Compression Module (PFCM)。FFSM通过选择关键帧来减少时间冗余，而PFCM通过先进的注意力机制聚合片段特征以突出显示显著的视觉实体。在五个基准上的实验表明，HVD不仅超越了现有方法，还能更有效地捕捉人类的视觉焦点。

Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization

Authors: Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Soiledis, Konstantinos-Theodoros Tsamis, Vassilis Katsouros, Emilios Cambouropoulos

First: 2026-01-22T17:46:31+00:00 · Latest: 2026-01-22T17:46:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.

中文标题/摘要

标题：关注旋律：单编码器旋律和声化中的课程掩码

旋律和声化，即为给定旋律生成和声伴奏的任务，在计算音乐生成中仍然是一个核心挑战。最近的单编码器变换器方法将和声化视为一个掩码序列建模问题，但现有的受离散扩散启发的训练课程往往导致旋律和和声之间的弱（跨）注意力。这导致了对旋律线索的有限利用，尤其是在领域外上下文中。在本文中，我们引入了一种训练课程FF（全掩码到全掩码），在训练的前几轮中保持所有和声标记都被掩码，然后在训练过程中逐步取消整个序列的掩码，以加强旋律和和声之间的互动。我们系统地将这种方法与先前的课程进行了比较，包括时间量化（四分音符 vs. 十六分音符）、小节级 vs. 节拍制条件、旋律表示（全范围 vs. 音阶类）以及推理时的掩码策略。模型在HookTheory数据集上进行训练，并在领域内和精心挑选的爵士标准曲集上进行评估，使用一系列全面的评估指标来评估和弦进程结构、和声-旋律对齐和节奏连贯性。结果表明，提出的FF课程在几乎所有指标上都优于基线模型，特别是在领域外评估中，和声对新颖旋律队列的适应性至关重要时，效果尤为显著。我们还发现，在FF设置中，四分音符量化、小节标记的交织以及音阶类旋律表示是有利的。我们的研究结果强调了训练课程在实现有效旋律条件方面的关键作用，并表明从全掩码到全掩码的取消掩码是一种稳健的单编码器和声化策略。

Summary / 总结

This study addresses the challenge of melodic harmonization by introducing a novel training curriculum, FF, which keeps harmony tokens masked for several steps before unmasking entire sequences. This approach enhances the interaction between melody and harmony, particularly in out-of-domain contexts. Experiments across various metrics show that the FF curriculum outperforms existing methods, especially in handling novel melodic inputs, and that quarter-note quantization and pitch-class melody representations are beneficial.

本文提出了一种新的训练课程FF（full-to-full），该课程在训练初期保持和弦标记隐藏，之后逐步解码整个序列，以增强旋律与和声之间的互动。实验表明，FF 方法在域外评估中表现尤为出色，特别是在处理新颖旋律线索时。研究使用多种度量标准和数据集评估模型，证明了该方法在生成与旋律线索更好地对齐且保持节奏连贯的和声方面更为有效。

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Authors: Remy Sabathier, David Novotny, Niloy J. Mitra, Tom Monnier

First: 2026-01-22T17:41:13+00:00 · Latest: 2026-01-22T17:41:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.

中文标题/摘要

标题：ActionMesh：基于时间3D扩散的动画3D网格生成

生成动画3D对象是许多应用的核心，但大多数先进的工作由于其有限的设置、长时间的运行或有限的质量，通常难以在实践中应用。我们介绍了ActionMesh，这是一种生成模型，能够以前馈方式预测“在行动”的生产级3D网格。受到早期视频模型的启发，我们的关键见解是修改现有的3D扩散模型，加入时间轴，从而形成我们称之为“时间3D扩散”的框架。具体来说，我们首先将3D扩散阶段适应为生成表示时间变化和独立3D形状的同步潜在变量序列。其次，我们设计了一个时间3D自编码器，将一系列独立形状转换为预定义参考形状的相应变形，使我们能够构建动画。结合这两个组件，ActionMesh可以从单目视频、文本描述甚至带有描述其动画的文本提示的3D网格等不同输入生成动画3D网格。此外，与以前的方法相比，我们的方法速度快，生成的结果是无骨架的且拓扑一致，因此能够实现快速迭代和无缝应用，如纹理化和目标变换。我们在标准视频到4D基准（Consistent4D，Objaverse）上评估了我们的模型，并在几何准确性和时间一致性方面报告了最先进的性能，证明了我们的模型可以以前所未有的速度和质量生成动画3D网格。

Summary / 总结

ActionMesh is a generative model that predicts animated 3D meshes in a feed-forward manner by incorporating a temporal axis into existing 3D diffusion models, enabling the generation of synchronized latents and deformations of a reference shape. The model can produce animated 3D meshes from various inputs such as monocular videos, text descriptions, or 3D meshes with text prompts. ActionMesh outperforms previous methods in terms of speed and quality, achieving state-of-the-art performance on geometric accuracy and temporal consistency in benchmarks like Consistent4D and Objaverse.

ActionMesh 是一种生成模型，通过将时间轴引入 3D 扩散模型中，以前馈方式预测动画 3D 网格。它首先生成时间变化的 3D 形状的同步潜在变量，然后使用 3D 时序自编码器将参考形状变形为相应的动画。这种方法可以从多种输入快速生成无骨架动画，并在标准基准测试中实现几何精度和时间一致性方面的最佳性能。

Beat-ssl: Capturing Local ECG Morphology through Heartbeat-level Contrastive Learning with Soft Targets

Authors: Muhammad Ilham Rizqyawan, Peter Macfarlane, Stathis Hadjidemetriou, Fani Deligianni

Venue: ISBI 2026

First: 2026-01-22T17:40:23+00:00 · Latest: 2026-01-22T17:40:23+00:00

Comments: Accepted at ISBI 2026

Abs · PDF · Code1 · Code2

Abstract

Obtaining labelled ECG data for developing supervised models is challenging. Contrastive learning (CL) has emerged as a promising pretraining approach that enables effective transfer learning with limited labelled data. However, existing CL frameworks either focus solely on global context or fail to exploit ECG-specific characteristics. Furthermore, these methods rely on hard contrastive targets, which may not adequately capture the continuous nature of feature similarity in ECG signals. In this paper, we propose Beat-SSL, a contrastive learning framework that performs dual-context learning through both rhythm-level and heartbeat-level contrasting with soft targets. We evaluated our pretrained model on two downstream tasks: 1) multilabel classification for global rhythm assessment, and 2) ECG segmentation to assess its capacity to learn representations across both contexts. We conducted an ablation study and compared the best configuration with three other methods, including one ECG foundation model. Despite the foundation model's broader pretraining, Beat-SSL reached 93% of its performance in multilabel classification task and surpassed all other methods in the segmentation task by 4%.

中文标题/摘要

标题：Beat-ssl：通过心跳级对比学习软目标捕获心电图局部形态

获得带有标签的心电图数据以开发监督模型具有挑战性。对比学习(CL)已成为一种有前景的预训练方法，能够有效进行有限标签数据的迁移学习。然而，现有的CL框架要么仅关注全局上下文，要么未能利用心电图的特定特征。此外，这些方法依赖于硬对比目标，这可能无法充分捕捉心电图信号中特征相似性的连续性。在本文中，我们提出了一种对比学习框架Beat-SSL，该框架通过心跳级和节律级对比学习使用软目标进行双重上下文学习。我们在两个下游任务上评估了我们的预训练模型：1) 全局节律评估的多标签分类，2) 心电图分割以评估其在两种上下文中的表示学习能力。我们进行了消融研究，并将最佳配置与三种其他方法进行了比较，包括一种心电图基础模型。尽管基础模型的预训练范围更广，但Beat-SSL在多标签分类任务中的性能达到了基础模型的93%，并且在分割任务中超过了所有其他方法4%。

Summary / 总结

The paper proposes Beat-SSL, a contrastive learning framework that performs dual-context learning through rhythm-level and heartbeat-level contrasting with soft targets to address the challenge of obtaining labeled ECG data. The model was evaluated on two tasks: multilabel classification for global rhythm assessment and ECG segmentation. Beat-SSL outperformed other methods, achieving 93% of the performance of a foundation model in the classification task and surpassing all other methods by 4% in the segmentation task.

研究旨在通过提出Beat-SSL框架解决获得标注ECG数据以开发监督模型的挑战，该框架通过节奏级和心搏级对比学习以及软目标进行双重上下文学习。该模型在两个任务上进行了评估：全局节律评估的多标签分类和ECG分割。Beat-SSL在多标签分类任务上的性能达到了基础模型的93%，并在分割任务上比其他方法高出4%。

Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data

Authors: Paul Quinlan, Qingguo Li, Xiaodan Zhu

First: 2025-03-13T21:05:11+00:00 · Latest: 2026-01-22T17:37:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models are being rapidly deployed across many fields such as healthcare, finance, transportation, and energy, where time-series data are fundamental components. The current works are still limited in their ability to perform reasoning that involves both time-series and the corresponding textual content. We address this gap by introducing Chat-TS, a large language model (LLM) based framework designed to support reasoning over time series and textual data. Unlike traditional models, Chat-TS integrates time-series tokens into LLMs' vocabulary, enhancing its reasoning ability over both modalities without compromising core natural language capabilities. To support learning and evaluation, we contribute new datasets: the TS Instruct Training Dataset (pairing diverse time-series data with relevant text instructions and responses for instruction tuning), the TS Instruct Question and Answer (QA) Gold Dataset (multiple-choice questions to evaluate multimodal reasoning), and a TS Instruct Quantitative Probing Set (a small subset of TS Instruct QA reasoning tasks alongside math and decision-making questions for LLM evaluation). We design a training strategy to preserve the inherent reasoning capabilities of LLMs while augmenting them for time-series reasoning. Experiments show that Chat-TS achieves state-of-the-art performance in multimodal reasoning tasks by maintaining strong natural language proficiency while improving time-series reasoning.

中文标题/摘要

标题：Chat-TS：增强时间序列和自然语言数据跨时间的多模态推理能力

大型语言模型正在被迅速部署到医疗、金融、交通和能源等多个领域，其中时间序列数据是基本组成部分。当前的工作仍然在处理涉及时间序列和相应文本内容的推理方面能力有限。我们通过引入Chat-TS，一种基于大型语言模型（LLM）的框架来解决这一问题，该框架旨在支持时间序列和文本数据的推理。与传统模型不同，Chat-TS 将时间序列标记整合到LLM的词汇中，增强了其在两种模态上的推理能力，同时不牺牲核心自然语言能力。为了支持学习和评估，我们贡献了新的数据集：TS Instruct 训练数据集（将多样化的时序数据与相关的文本指令和响应配对，用于指令调优），TS Instruct 问题和答案黄金数据集（多项选择题，用于评估多模态推理），以及TS Instruct 定量探测集（TS Instruct QA推理任务的小子集，以及数学和决策问题，用于LLM评估）。我们设计了一种训练策略，以保持LLM固有的推理能力，同时增强其时间序列推理能力。实验表明，Chat-TS 在多模态推理任务中达到了最先进的性能，同时保持了强大的自然语言能力并提高了时间序列推理能力。

Summary / 总结

The research aims to enhance the ability of large language models to reason over both time-series and textual data, which is crucial for applications in healthcare, finance, transportation, and energy. Chat-TS, a novel framework, integrates time-series tokens into the vocabulary of large language models to improve their reasoning capabilities over both modalities. Key experimental results show that Chat-TS outperforms existing models in multimodal reasoning tasks while maintaining strong natural language proficiency.

研究旨在增强大型语言模型在处理时间序列数据和自然语言方面的推理能力。Chat-TS 是一种新框架，将时间序列标记整合到大型语言模型的词汇表中，以提高其在两种模态上的推理能力。实验结果表明，Chat-TS 在多模态推理任务中表现出色，同时保持了强大的自然语言能力。

LLM Prompt Evaluation for Educational Applications

Authors: Langdon Holmes, Adam Coscia, Scott Crossley, Joon Suh Choi, Wesley Morris

First: 2026-01-22T17:31:25+00:00 · Latest: 2026-01-22T17:31:25+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.

中文标题/摘要

标题：大型语言模型在教育应用中的提示评估

随着大型语言模型（LLMs）在教育应用中的日益普及，需要基于证据的方法来设计和评估LLM提示，以产生个性化和教育目标一致的输出。本研究提出了一种可推广的系统评估方法，通过结构化对话活动中的LLM生成的后续问题分析来展示。设计并测试了六种提示模板。这些模板结合了已有的提示工程模式，每个提示强调不同的教育策略。通过一种类似淘汰赛的评估框架来比较提示模板，该框架可以适应其他教育应用。该淘汰赛采用了Glicko2评分系统，八名评委在三个维度上评估问题对：格式、对话支持和对学习者的适宜性。数据来自120次真实的用户交互，分布在三个不同的教育部署中。结果显示，一个与策略性阅读相关的提示在一对一比较中胜出的概率从81%到100%不等。该提示结合了角色和上下文管理模式，旨在支持元认知学习策略，如自我导向学习。该方法展示了教育技术研究人员如何系统地评估和改进提示设计，从经验性的提示工程转向基于证据的提示开发，以应用于教育应用。

Summary / 总结

This study aims to develop evidence-based methods for evaluating LLM prompts in educational applications. Six prompt templates were designed and tested, each emphasizing different pedagogical strategies. A tournament-style evaluation framework using the Glicko2 rating system was employed by eight judges to assess the prompts across format, dialogue support, and learner appropriateness. The strategic reading prompt, which incorporated persona and context manager patterns, outperformed other templates with win probabilities ranging from 81% to 100%. This method demonstrates a systematic approach for improving prompt designs in educational technology research.

本研究旨在为教育应用中的LLM提示评估开发基于证据的方法。设计并测试了六种提示模板，每种模板强调不同的教学策略。采用使用Glicko2评分系统的锦标赛式评估框架，八位评委从格式、对话支持和学习者适宜性三个方面评估问题。战略性阅读提示结合了角色和上下文管理模式，在一对一比较中表现最佳，胜率从81%到100%不等。这展示了教育技术研究人员如何系统地评估和改进提示设计，从经验性的提示工程转向基于证据的提示开发。

ViSymRe: Vision-guided Multimodal Symbolic Regression

Authors: Da Li, Junping Yin, Jin Xu, Xinxin Li, Juan Zhang

First: 2024-12-15T10:05:31+00:00 · Latest: 2026-01-22T17:29:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Extracting simple mathematical expression from an observational dataset to describe complex natural phenomena is one of the core objectives of artificial intelligence (AI). This field is known as symbolic regression (SR). Traditional SR models are based on genetic programming (GP) or reinforcement learning (RL), facing well-known challenges, such as low efficiency and overfitting. Recent studies have integrated SR with large language models (LLMs), enabling fast zero-shot inference by learning mappings from millions of dataset-expression pairs. However, since the input and output are inherently different modalities, such models often struggle to converge effectively. In this paper, we introduce ViSymRe, a vision-guided multimodal SR model that incorporates the third resource, expression graph, to bridge the modality gap. Different from traditional multimodal models, ViSymRe is trained to extract vision, termed virtual vision, from datasets, without relying on the global availability of expression graphs, which addresses the essential challenge of visual SR, i.e., expression graphs are not available during inference. Evaluation results on multiple mainstream benchmarks show that ViSymRe achieves more competitive performance than the state-of-the-art dataset-only baselines. The expressions predicted by ViSymRe not only fit the dataset well but are also simple and structurally accurate, goals that SR models strive to achieve.

中文标题/摘要

标题：ViSymRe：视觉引导的多模态符号回归

从观测数据集中提取简单的数学表达式以描述复杂的自然现象是人工智能（AI）的核心目标之一。这一领域被称为符号回归（SR）。传统的SR模型基于遗传编程（GP）或强化学习（RL），面临着低效率和过拟合等已知挑战。最近的研究将SR与大型语言模型（LLMs）结合，通过学习数百万数据集-表达式对之间的映射，实现快速零样本推理。然而，由于输入和输出是固有的不同模态，这类模型往往难以有效收敛。在本文中，我们引入了ViSymRe，这是一种视觉引导的多模态SR模型，它结合了表达图这一资源来弥合模态差距。与传统的多模态模型不同，ViSymRe被训练从数据集中提取所谓的虚拟视觉，而无需依赖全局可用的表达图，这解决了视觉SR的基本挑战，即在推理过程中表达图不可用。在多个主流基准上的评估结果表明，ViSymRe在与数据集基线相比时，实现了更优的性能。ViSymRe预测的表达式不仅很好地拟合了数据集，而且简单且结构准确，这是SR模型追求的目标。

Summary / 总结

This paper introduces ViSymRe, a vision-guided multimodal symbolic regression model that addresses the challenges of traditional symbolic regression methods by incorporating expression graphs. Unlike previous multimodal models, ViSymRe trains to extract 'virtual vision' from datasets without requiring global expression graphs, making it suitable for visual symbolic regression. Experimental results on multiple benchmarks demonstrate that ViSymRe outperforms state-of-the-art dataset-only baselines, providing simple and structurally accurate expressions that fit the datasets well.

该论文提出了ViSymRe，一种基于视觉的多模态符号回归模型，通过从数据集中提取虚拟视觉来解决推理过程中表达图不可用的问题，而无需在训练时依赖全局表达图。与传统多模态模型不同，ViSymRe 不需要全局表达图进行训练。实验结果表明，ViSymRe 在多个基准测试中优于最先进的基于数据集的基线模型，提供了既符合数据集又简单且结构准确的表达式。

Replicating Human Motivated Reasoning Studies with LLMs

Authors: Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N. Druckman, Daniel Molden, Gourab Ghoshal, Ehsan Hoque

First: 2026-01-22T17:29:07+00:00 · Latest: 2026-01-22T17:29:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Motivated reasoning -- the idea that individuals processing information may be motivated to reach a certain conclusion, whether it be accurate or predetermined -- has been well-explored as a human phenomenon. However, it is unclear whether base LLMs mimic these motivational changes. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as smaller standard deviations and inaccurate argument strength assessments. We emphasize the importance of these findings for researchers using LLMs to automate tasks such as survey data collection and argument assessment.

中文标题/摘要

标题：使用大语言模型复制人类动机性推理研究

动机性推理——个体在处理信息时可能被动机驱使以达到某种结论，无论结论是否准确或预先确定——作为人类现象已经被广泛研究。然而，尚不清楚基础大语言模型是否会模仿这些动机性变化。通过复制4项先前的政治动机性推理研究，我们发现基础大语言模型的行为与预期的人类行为不一致。此外，不同模型的基础大语言模型行为在某些方面存在相似性，如较小的标准差和不准确的论点强度评估。我们强调这些发现对于使用大语言模型自动化如调查数据收集和论点评估等任务的研究人员的重要性。

Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging

Authors: Alphaeus Dmonte, Vidhi Gupta, Daniel J Perry, Mark Arehart

First: 2026-01-22T17:28:24+00:00 · Latest: 2026-01-22T17:28:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.

中文标题/摘要

标题：通过语言特定模型合并提高训练效率并降低维护成本

针对特定任务的多语言大型语言模型（LLM）微调涉及使用包含所需所有语言示例的多语言数据集对模型进行训练。更新一个或多个支持的语言或添加对新语言的支持需要重新训练模型，这在计算上效率低下，并且会形成严重的维护瓶颈。最近关于合并多语言多任务模型的研究显示出了改进质量的前景，但其计算和维护效率尚未得到研究。在本工作中，我们首次从效率角度对这种合并策略进行了集中分析，评估了其在三个独立任务上的表现。我们展示了在保持质量一致性的前提下，这种合并方法将初始训练时间减少了高达50%。我们还展示了在模型维护过程中，更新个别语言并重新合并可以将训练成本降低超过60%，与重新训练整个多语言模型相比。我们在公共数据集和专有行业数据集上进行了验证，证明该方法不仅适用于学术研究中已经研究过的场景，也适用于工业应用案例。

Summary / 总结

This work addresses the inefficiencies in updating and maintaining task-specific multilingual large language models by proposing a model merging strategy. The study evaluates this approach across three tasks and finds that it reduces initial training time by up to 50% while maintaining quality. Additionally, updating individual languages and re-merging the model reduces training costs by over 60% compared to retraining the full multilingual model.

研究旨在通过合并语言特定模型来提高训练效率并降低维护成本。研究评估了该合并策略在三个任务中的表现，并发现它可以使初始训练时间减少高达50%，同时保持质量不变。此外，更新个别语言并重新合并可以将训练成本降低超过60%，与重新训练整个多语言模型相比，在公共和专有数据集上均适用。

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Authors: Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long, Pengjun Xie, Siyue Zhang, Yilun Zhao, Shu Wu

First: 2026-01-22T17:26:52+00:00 · Latest: 2026-01-22T17:26:52+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.

中文标题/摘要

标题：重新思考组合图像检索评估：来自图像编辑的细粒度基准

组合图像检索（CIR）是多模态理解中的一个关键且复杂的任务。当前的CIR基准通常包含有限的查询类别，无法捕捉到现实场景中的多样化需求。为了弥合这一评估差距，我们利用图像编辑实现对修改类型和内容的精确控制，从而构建了一个广泛的查询合成管道。利用此管道，我们构建了EDIR，这是一个新颖的细粒度CIR基准。EDIR包含5,000个高质量的查询，分布在五个主要类别和十五个子类别中。我们对13种多模态嵌入模型的全面评估揭示了一个显著的能力差距；即使是最先进的模型（如RzenEmbed和GME）也难以在所有子类别中表现一致，突显了我们基准的严格性。通过对比分析，我们进一步揭示了现有基准的内在局限性，如模态偏差和类别覆盖不足。此外，一个领域内训练实验证明了我们基准的可行性。该实验通过区分可以通过目标数据解决的类别和暴露当前模型架构固有限制的类别，阐明了任务挑战。

Summary / 总结

The paper addresses the limitations of current Composed Image Retrieval (CIR) benchmarks by introducing EDIR, a fine-grained benchmark created through image editing. The method involves synthesizing queries across various categories to evaluate 13 multimodal embedding models, revealing significant capability gaps, especially for state-of-the-art models like RzenEmbed and GME. The findings highlight the need for more diverse and precise benchmarks to accurately assess CIR models.

论文通过引入基于图像编辑的细粒度基准EDIR，解决了当前CIR基准的局限性。该方法通过合成跨多种类别的查询来评估13种多模态嵌入模型，揭示了显著的能力差距，特别是在处理各种子类别时。研究结果强调了需要更全面的基准，并表明现有模型在某些类别上的表现不佳，这归因于当前模型架构的固有限制。

Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add

Authors: Zhengchi Ma, Anru R. Zhang

First: 2026-01-22T17:15:26+00:00 · Latest: 2026-01-22T17:15:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic examples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples and evaluated under the balanced population risk. Our theory shows that synthetic data is not always beneficial. In a ``local symmetry" regime, imbalance is not the dominant source of error near the balanced optimum, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help (a ``local asymmetry" regime), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing, sometimes by a small refinement and sometimes substantially when generator bias is systematic. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures when the data indicate them. Simulations and a real sepsis prediction study support the theory and illustrate when synthetic augmentation helps, when it cannot, and how to tune its quantity effectively.

中文标题/摘要

标题：不平衡学习中的合成增强：何时有益，何时有害，以及应添加多少

不平衡分类中，一个类别远比另一个类别出现的频率低，这通常会导致标准训练程序优先处理多数类，而对稀有的但重要的情况表现不佳。经典的广泛使用的解决方法是通过合成样本来增加少数类的数量，但两个基本问题仍然没有得到解决：合成增强何时真正有益，以及应生成多少合成样本？我们为不平衡学习中的合成增强开发了一个统一的统计框架，研究在平衡人口风险下使用不平衡数据和合成少数类样本进行训练的模型。我们的理论表明，合成数据并不总是有益的。在“局部对称”状态下，不平衡不是接近平衡最优解时的主要误差来源，因此增加合成样本无法提高学习速率，甚至可能通过放大生成器不匹配而降低性能。当增强可以提供帮助（“局部不对称”状态）时，最佳合成样本数量取决于生成器的准确性以及生成器的残差不匹配是否与固有的多数类-少数类转移方向一致。这种结构可以使最佳合成样本数量偏离简单的完全平衡，有时仅需细微调整，有时则因生成器偏差系统性而显著不同。实践中，我们推荐验证调优合成样本量（VTSS）：通过在接近完全平衡基线的范围内最小化平衡验证损失来选择合成样本量，同时允许在数据表明时有意义的偏离。模拟和实际的脓毒症预测研究支持该理论，并说明了合成增强何时有效，何时无效，以及如何有效调整其数量。

Summary / 总结

The paper addresses the issue of synthetic augmentation in imbalanced learning, where the minority class is underrepresented. It develops a unified statistical framework to determine when synthetic augmentation helps and when it hurts, and how many synthetic samples should be generated. The study finds that synthetic data can degrade performance in the 'local symmetry' regime but can improve it in the 'local asymmetry' regime, where the optimal synthetic size depends on the generator's accuracy and the alignment of its residual mismatch with the intrinsic data shift. A practical recommendation, Validation-Tuned Synthetic Size (VTSS), is provided to effectively tune the synthetic size based on balanced validation loss.

论文探讨了在不平衡学习中合成增强的问题，即通过增加合成少数类样本来改善模型性能。研究建立了一个统一的统计框架来确定何时合成增强有助于提高性能以及应生成多少合成样本。研究发现，在“局部对称”状态下，合成数据可能会降低性能，但在“局部不对称”状态下，合成数据可以改善性能，其中最佳合成样本数量取决于生成器的准确性及其失配的方向。作者建议使用验证调优合成大小（VTSS）方法，根据平衡验证损失选择合成样本数量，并在数据表明需要时允许有意义的偏离。

AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs

Authors: Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang

First: 2025-11-17T11:45:41+00:00 · Latest: 2026-01-22T17:11:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AudioMotionBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AudioMotionBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50\%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.

中文标题/摘要

标题：AudioMotionBench：评估音频LLMs的听觉运动感知

大型音频-语言模型（LALMs）在语音识别、音频描述和听觉问答方面最近取得了令人印象深刻的进展。然而，这些模型是否能够感知空间动态，特别是声源的运动，仍然不清楚。在本文中，我们揭示了当前ALLMs在运动感知方面存在系统性的缺陷。为了研究这一问题，我们引入了AudioMotionBench，这是第一个明确设计用于评估听觉运动理解的基准。AudioMotionBench引入了一个受控的问答基准，旨在评估音频-语言模型（LALMs）是否能够从双耳音频中推断出移动声源的方向和轨迹。全面的定量和定性分析表明，当前的模型在可靠地识别运动线索或区分方向模式方面存在困难。平均准确率低于50%，突显了听觉空间推理的基本局限性。我们的研究突显了人类和模型在听觉空间推理方面的根本差距，为增强未来音频-语言模型的空间认知提供了诊断工具和新的见解。

Summary / 总结

The research aims to evaluate the ability of Large Audio-Language Models (LALMs) to perceive spatial dynamics, particularly the motion of sound sources. To address this, the study introduces AudioMotionBench, a benchmark for evaluating auditory motion understanding. The results show that current models struggle to recognize motion cues and distinguish directional patterns, with average accuracy below 50%, indicating a significant limitation in auditory spatial reasoning.

研究评估了大型音频语言模型（LALMs）在感知听觉运动方面的能力，发现当前模型存在系统性的缺陷。作者引入了AudioMotionBench，这是一个评估听觉运动理解的基准，结果显示模型难以识别运动线索并区分方向模式，平均准确率低于50%。这揭示了这些模型在听觉空间推理方面的根本局限性。

Enhanced Climbing Image Nudged Elastic Band method with Hessian Eigenmode Alignment

Authors: Rohit Goswami, Miha Gunde, Hannes Jónsson

First: 2026-01-19T00:21:52+00:00 · Latest: 2026-01-22T17:11:23+00:00

Comments: 25 pages. 11 figures

Abs · PDF · Code1 · Code2

Abstract

Accurate determination of transition states is central to an understanding of reaction kinetics. Double-endpoint methods where both initial and final states are specified, such as the climbing image nudged elastic band (CI-NEB), identify the minimum energy path between the two and thereby the saddle point on the energy surface that is relevant for the given transition, thus providing an estimate of the transition state within the harmonic approximation of transition state theory. Such calculations can, however, incur high computational costs and may suffer stagnation on exceptionally flat or rough energy surfaces. Conversely, methods that only require specification of an initial set of atomic coordinates, such as the minimum mode following (MMF) method, offer efficiency but can converge on saddle points that are not relevant for transition of interest. Here, we present an adaptive hybrid algorithm that integrates the CI-NEB with the MMF method so as to get faster convergence to the relevant saddle point. The method is benchmarked for the Baker-Chan (BC) saddle point test set using the PET-MAD machine-learned potential as well as 59 transitions of a heptamer island on Pt(111) from the OptBench benchmark set. A Bayesian analysis of the performance shows a median reduction in energy and force calculations of 46% [95% CrI: -55%, -37%] relative to CI-NEB for the BC set, while a 28% reduction is found for the transitions of the heptamer island. These results establish this hybrid method as a highly effective tool for high-throughput automated chemical discovery of atomic rearrangements.

中文标题/摘要

标题：增强攀爬图像拉伸带方法与哈密尔顿特征模式对齐

准确确定过渡态是理解反应动力学的关键。双端点方法，如攀爬图像拉伸带（CI-NEB）方法，通过指定初始和最终状态来识别两者之间的最低能量路径，从而确定与给定过渡相关的鞍点，提供过渡态的谐振子近似估计。然而，此类计算可能会产生高昂的计算成本，并可能在异常平坦或粗糙的能量表面上停滞不前。相比之下，仅需指定一组原子坐标的方法，如最小模式跟随（MMF）方法，虽然效率更高，但可能会收敛到与所需过渡无关的鞍点。在此，我们提出了一种自适应混合算法，将CI-NEB方法与MMF方法结合，以更快地收敛到相关鞍点。该方法使用PET-MAD机器学习势能对Baker-Chan（BC）鞍点测试集进行了基准测试，并对Pt(111)上七聚岛的59个过渡进行了基准测试。贝叶斯分析表明，相对于BC集的CI-NEB，能量和力的计算中位数减少了46% [95% CrI: -55%，-37%]，而七聚岛的过渡中减少了28%。这些结果确立了该混合方法作为高效工具，用于高通量自动化原子重排的化学发现。

Summary / 总结

The research aims to improve the accuracy and efficiency of determining transition states in chemical reactions. The authors developed an adaptive hybrid algorithm combining the climbing image nudged elastic band (CI-NEB) method with the minimum mode following (MMF) method. This hybrid approach was benchmarked using the Baker-Chan (BC) saddle point test set and 59 transitions of a heptamer island on Pt(111). The results showed a median reduction of 46% in energy and force calculations for the BC set and a 28% reduction for the heptamer island transitions, demonstrating the hybrid method's effectiveness for high-throughput chemical discovery.

研究旨在提高化学反应中过渡态识别的准确性和效率，这对于理解反应动力学至关重要。该方法将爬坡图像拉伸能带（CI-NEB）与最小模式跟随（MMF）方法相结合，以更快地收敛到相关鞍点。关键实验结果表明，对于Baker-Chan（BC）集，能量和力的计算减少了46%，而对于Pt(111)上的七聚岛过渡，减少了28%，这表明该混合方法在高通量化学发现中的有效性。

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi

First: 2025-05-24T15:57:07+00:00 · Latest: 2026-01-22T17:10:05+00:00

Comments: Accepted by NeurIPS2025

Abs · PDF · Code1 · Code2

Abstract

Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.

中文标题/摘要

标题：GenPO：生成扩散模型与在线强化学习的结合

强化学习（RL）的最新进展展示了基于生成扩散的策略的强大探索能力和多模态性。虽然在离线RL和离策RL设置中取得了显著进展，但将扩散策略整合到像PPO这样的在线框架中仍然鲜有探索。鉴于大规模并行GPU加速模拟器（如IsaacLab）的广泛应用，这些模拟器优化了在线RL算法，使复杂机器人任务的快速训练成为可能，这一差距尤为重要。一个关键挑战在于在扩散策略下计算状态-动作对数似然，对于高斯策略来说是直接的，但对于基于流的模型来说是不可行的，因为不可逆的正向-反向过程和离散化误差（例如Euler-Maruyama近似）导致了不可解性。为了解决这一问题，我们提出了GenPO，一种利用精确扩散反演构建可逆动作映射的生成策略优化框架。GenPO引入了一种新颖的双虚拟动作机制，通过交替更新实现可逆性，解决了对数似然计算障碍。此外，我们还使用动作对数似然进行无偏熵和KL散度估计，使KL自适应学习率和熵正则化能够在在线更新中实现。在八个IsaacLab基准测试上的广泛实验，包括腿足运动（Ant、Humanoid、Anymal-D、Unitree H1、Go2）、灵巧操作（Shadow Hand）、空中控制（Quadcopter）和机器人臂任务（Franka），证明了GenPO优于现有RL基线。值得注意的是，GenPO是第一个成功将扩散策略整合到在线RL中的方法，开启了其在大规模并行化训练和实际机器人部署中的潜力。

Summary / 总结

The paper introduces GenPO, a generative policy optimization framework that integrates diffusion policies into on-policy reinforcement learning (RL) frameworks like PPO. It addresses the challenge of computing state-action log-likelihoods for flow-based models by proposing a novel doubled dummy action mechanism. Extensive experiments on various IsaacLab benchmarks show that GenPO outperforms existing RL baselines and is the first method to successfully integrate diffusion policies into on-policy RL, enabling rapid training and real-world robotic deployment.

GenPO 是一种生成性策略优化框架，将生成扩散模型集成到在线强化学习（RL）中，以解决流基模型的对数似然计算难题。它引入了一种双重虚拟动作机制以实现可逆性，并使用动作对数似然进行无偏熵和KL散度估计，允许KL自适应学习率和熵正则化。在八个IsaacLab基准测试上的实验表明，GenPO 在性能上优于现有RL基线，是首个成功将扩散策略集成到在线RL中的方法，能够实现大规模并行化训练和实际机器人部署。