arXiv 论文速递

Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

Authors: Ananta R. Bhattarai, Helge Rhodin

First: 2025-12-19T18:59:56+00:00 · Latest: 2025-12-19T18:59:56+00:00

Abstract

Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.

中文标题/摘要

标题：任何内容的重新深度化：通过自我监督重新照明的测试时深度细化

单目深度估计仍然具有挑战性，因为最近的基础模型，如深度一切V2（DA-V2），在与训练分布相差甚远的现实世界图像上表现不佳。我们提出了重新深度一切（Re-Depth Anything），这是一种测试时的自我监督框架，通过将DA-V2与大规模2D扩散模型的强大先验知识融合，来弥合这一领域差距。我们的方法通过重新照明预测的深度图并在输入图像上直接进行标签自由细化。这种重新合成方法通过利用形状从阴影（SfS）线索，在新的生成性上下文中利用分数蒸馏采样（SDS）来替代经典的光度重建。为了防止优化崩溃，我们的框架采用了一种有针对性的优化策略：我们冻结编码器，只更新中间嵌入，并微调解码器。在多种基准测试中，重新深度一切在深度准确性和现实性方面显著优于DA-V2，展示了通过增强几何推理来实现自我监督的新途径。

Dexterous World Models

Authors: Byungjun Kim, Taeksoo Kim, Junyoung Lee, Hanbyul Joo

First: 2025-12-19T18:59:51+00:00 · Latest: 2025-12-19T18:59:51+00:00

Comments: Project Page: snuvclab.github.io/dwm

Abs · PDF · Code1 · Code2

Abstract

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.

中文标题/摘要

标题：灵巧的世界模型

近期在三维重建方面的进展使得从日常环境中创建逼真的数字孪生变得容易。然而，当前的数字孪生仍然主要保持静态，仅限于导航和视图合成，缺乏具身互动性。为弥合这一差距，我们引入了灵巧的世界模型（DWM），这是一种基于场景-动作条件的视频扩散框架，用于建模灵巧的人类动作如何引起静态3D场景中的动态变化。给定一个静态3D场景渲染和第一人称手部运动序列，DWM生成时间上连贯的视频，描绘可能的人-场景互动。我们的方法通过（1）遵循指定相机轨迹的静态场景渲染来确保空间一致性，以及（2）包含几何和运动线索的第一人称手部网格渲染来直接建模动作条件下的动态变化来条件化视频生成。为了训练DWM，我们构建了一个混合交互视频数据集。合成的第一人称交互提供了关节运动和操作学习的完全对齐监督，而固定相机的现实世界视频则提供了多样且真实的物体动力学。实验表明，DWM能够实现真实且物理上合理的互动，如抓取、开启和移动物体，同时保持相机和场景的一致性。该框架代表了基于视频扩散的交互数字孪生的第一步，并能够从第一人称动作中实现具身模拟。

Summary / 总结

Dexterous World Model (DWM) is a video diffusion framework that models dynamic changes in static 3D scenes based on human actions. Given a static 3D scene and an egocentric hand motion sequence, DWM generates coherent videos of plausible human-scene interactions. The model conditions video generation on static scene renderings and egocentric hand mesh renderings, ensuring spatial consistency and action-conditioned dynamics. Experiments show that DWM enables realistic interactions like grasping and moving objects while maintaining scene and camera consistency.

Dexterous World Model (DWM) 是一个场景-动作条件化的视频扩散框架，用于模拟人类动作对静态 3D 场景的动态变化。给定一个静态 3D 场景和一个第一人称手部运动序列，DWM 生成时空连贯的视频，展示合理的物-人交互。该模型通过静态场景渲染和第一人称手部网格渲染来条件化视频生成，确保空间和时间的一致性。实验表明，DWM 能够实现真实的、物理上合理的交互，如抓取、开启和移动物体，同时保持摄像机和场景的一致性。

When Reasoning Meets Its Laws

Authors: Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang

First: 2025-12-19T18:59:11+00:00 · Latest: 2025-12-19T18:59:11+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/

中文标题/摘要

标题：当推理遇到其法则

尽管大型推理模型（LRMs）表现出色，但其推理行为往往违背直觉，导致推理能力不足。为理论化期望的推理行为，本文提出了推理法则（LoRe），这是一种统一框架，用于描述LRMs中的内在推理模式。我们首先提出了计算法则，假设推理计算应与问题复杂性成线性关系。除了计算，我们还通过补充准确性法则扩展了LoRe。由于实际中问题复杂性难以量化，我们通过法则的两个可衡量属性单调性和组合性来检验这些假设。因此，我们引入了LoRe-Bench基准，系统地测量大型推理模型的这两个可衡量属性。评估结果显示，大多数推理模型表现出合理的单调性但缺乏组合性。为此，我们开发了一种有效的微调方法，以确保计算法则的组合性。广泛的实证研究表明，更好地遵守计算法则在多个基准上持续提高了推理性能，并揭示了属性和法则之间的协同效应。项目页面：https://lore-project.github.io/

Summary / 总结

This paper aims to address the counterintuitive reasoning behaviors of Large Reasoning Models (LRMs) by introducing the Laws of Reasoning (LoRe), a framework that characterizes intrinsic reasoning patterns. The authors propose a compute law suggesting reasoning compute should scale linearly with question complexity and extend it with an accuracy law. They evaluate these laws using LoRe-Bench, a benchmark that measures monotonicity and compositionality. The evaluation shows that most LRMs exhibit monotonicity but lack compositionality. The authors then develop a finetuning approach to enforce compute-law compositionality, demonstrating improved reasoning performance across multiple benchmarks and uncovering synergistic effects between properties and laws.

本文通过引入推理定律（LoRe）框架来解决大型推理模型（LRMs）的反直觉推理行为问题，该框架描述了内在的推理模式。作者提出了一个计算定律，建议推理计算应与问题复杂性成线性关系，并提出一个准确度定律。他们使用LoRe-Bench基准来衡量单调性和组合性这两个可操作的属性。评估结果显示，大多数LRMs表现出合理的单调性但缺乏组合性。为了解决这个问题，作者开发了一种强化学习方法，以确保计算定律的组合性，从而在多个基准上提高了推理性能。

Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

Authors: Vongani H. Maluleke, Kie Horiuchi, Lea Wilken, Evonne Ng, Jitendra Malik, Angjoo Kanazawa

First: 2025-12-19T18:59:02+00:00 · Latest: 2025-12-19T18:59:02+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/

中文标题/摘要

标题：多智能体交互序列建模的扩散驱动

理解与生成多人群体间的互动是一个基本挑战，对机器人技术和社交计算有着广泛的影响。尽管人类在群体中自然地协调互动，但由于长时间跨度、强烈的智能体间依赖性和变化的群体规模，建模此类互动仍然困难重重。现有的运动生成方法大多针对特定任务，无法泛化到灵活的多智能体生成。我们提出了MAGNet（多智能体扩散驱动变换器），这是一种统一的自回归扩散框架，通过灵活的条件和采样支持广泛的交互任务。MAGNet在单一模型中执行二元预测、伙伴填充和完整的多智能体运动生成，并能够自回归生成超长序列，跨越数百个v。基于扩散驱动，我们引入了关键修改，明确在自回归去噪过程中建模智能体间的耦合，从而在智能体之间实现一致的协调。因此，MAGNet捕捉到了紧密同步的活动（如舞蹈、拳击）和松散结构的社会互动。我们的方法在二元基准测试中与专门方法表现相当，自然地扩展到涉及三人或更多互动个体的多项式场景，得益于一种可扩展的架构，该架构对智能体的数量不敏感。我们建议读者参阅补充视频，其中生成互动的时间动态和空间协调性表现最佳。项目页面：https://von31.github.io/MAGNet/

Summary / 总结

The research aims to model and generate multi-agent interactions, addressing challenges such as long temporal horizons and inter-agent dependencies. MAGNet, a unified autoregressive diffusion framework, is introduced to support various interaction tasks through flexible conditioning and sampling. Key findings include MAGNet's ability to perform dyadic prediction, partner inpainting, and full multi-agent motion generation, and its capability to generate ultra-long sequences and capture both synchronized and loosely structured interactions, outperforming specialized methods on dyadic benchmarks and naturally extending to polyadic scenarios.

研究旨在建模和生成多智能体交互，解决长时间跨度和智能体间依赖性等挑战。引入了MAGNet，这是一种统一的自回归扩散框架，通过灵活的条件和采样支持各种交互任务。主要发现包括MAGNet能够生成连贯的多智能体运动序列，捕捉同步活动和社会互动，并且其架构可扩展以处理多个智能体而不影响基准测试中的性能。

Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally

Authors: Robin Schimmelpfennig, Mark Díaz, Vinodkumar Prabhakaran, Aida Davani

First: 2025-12-19T18:57:53+00:00 · Latest: 2025-12-19T18:57:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Over a billion users across the globe interact with AI systems engineered with increasing sophistication to mimic human traits. This shift has triggered urgent debate regarding Anthropomorphism, the attribution of human characteristics to synthetic agents, and its potential to induce misplaced trust or emotional dependency. However, the causal link between more humanlike AI design and subsequent effects on engagement and trust has not been tested in realistic human-AI interactions with a global user pool. Prevailing safety frameworks continue to rely on theoretical assumptions derived from Western populations, overlooking the global diversity of AI users. Here, we address these gaps through two large-scale cross-national experiments (N=3,500) across 10 diverse nations, involving real-time and open-ended interactions with an AI system. We find that when evaluating an AI's human-likeness, users focus less on the kind of theoretical aspects often cited in policy (e.g., sentience or consciousness), but rather applied, interactional cues like conversation flow or understanding the user's perspective. We also experimentally demonstrate that humanlike design levers can causally increase anthropomorphism among users; however, we do not find that humanlike design universally increases behavioral measures for user engagement and trust, as previous theoretical work suggests. Instead, part of the connection between human-likeness and behavioral outcomes is fractured by culture: specific design choices that foster self-reported trust in AI-systems in some populations (e.g., Brazil) may trigger the opposite result in others (e.g., Japan). Our findings challenge prevailing narratives of inherent risk in humanlike AI design. Instead, we identify a nuanced, culturally mediated landscape of human-AI interaction, which demands that we move beyond a one-size-fits-all approach in AI governance.

中文标题/摘要

标题：类人类AI设计增加拟人性但对全球用户参与度和信任度产生分歧结果

全球超过十亿用户与日益精巧地模仿人类特质的AI系统互动。这一转变引发了关于拟人性——将人类特征赋予合成代理——及其可能引发的不适当信任或情感依赖的紧迫辩论。然而，尚未在包含全球用户群体的现实人类-AI互动中测试更类人类的AI设计与其后续影响之间的因果关系。现有的安全框架继续依赖于源自西方人群的理论假设，忽视了全球AI用户的多样性。在此，我们通过两项大规模跨国实验（N=3,500）跨越10个不同国家，涉及与AI系统的实时和开放式互动，来弥补这些差距。我们发现，当评估AI的类人类程度时，用户关注的较少是政策中经常提及的理论方面（如感知或意识），而是互动性提示，如对话流畅度或理解用户视角。我们还实验证明，类人类设计可以因果性地增加用户的拟人性；然而，我们没有发现类人类设计会普遍增加用户参与度和信任度的行为指标，这与先前的理论工作所预测的相反。相反，类人类程度与行为结果之间的联系因文化而异：在某些群体中（如巴西）促进对AI系统的自报信任的具体设计选择，在其他群体中（如日本）可能会引发相反的结果。我们的研究挑战了关于类人类AI设计固有风险的现有叙述。相反，我们识别出一个复杂、文化中介的人机互动景观，这要求我们在AI治理中超越一刀切的方法。

Summary / 总结

This study investigates the impact of humanlike AI design on anthropomorphism, engagement, and trust across diverse global populations. Through two large-scale cross-national experiments involving 3,500 participants from 10 countries, the research finds that users focus more on practical interactional cues rather than theoretical aspects like sentience. While humanlike design increases anthropomorphism, it does not uniformly enhance engagement and trust, with cultural differences significantly influencing outcomes. The study challenges the notion of inherent risks in humanlike AI design and highlights the need for culturally sensitive approaches in AI governance.

该研究探讨了类人AI设计对全球不同人群的拟人化、参与度和信任度的影响。通过涉及3,500名来自10个国家的参与者的大规模跨国实验，研究发现，用户在评估AI类人程度时更关注实用的交互线索而非理论方面。虽然类人设计会增加拟人化，但它并不会在所有情况下都提升参与度和信任度，文化差异会影响这些结果。该研究挑战了类人AI固有风险的普遍假设，并强调了在AI治理中需要采取文化敏感的方法。

RadarGen: Automotive Radar Point Cloud Generation from Cameras

Authors: Tomer Borreda, Fangqiang Ding, Sanja Fidler, Shengyu Huang, Or Litany

First: 2025-12-19T18:57:33+00:00 · Latest: 2025-12-19T18:57:33+00:00

Comments: Project page: https://radargen.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird's-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.

中文标题/摘要

标题：RadarGen：从多视角摄像头图像生成汽车雷达点云

我们提出了RadarGen，一种从多视角摄像头图像合成现实汽车雷达点云的扩散模型。RadarGen 通过将雷达测量以鸟瞰图形式表示，编码空间结构以及雷达截面（RCS）和多普勒属性，将高效的图像-潜空间扩散模型适应到雷达领域。一个轻量级的恢复步骤从生成的地图中重建点云。为了更好地与视觉场景对齐，RadarGen 结合了从预训练基础模型中提取的BEV对齐的深度、语义和运动线索，这些线索指导随机生成过程向物理上合理的雷达模式发展。基于图像的条件使该方法原则上与现有的视觉数据集和模拟框架兼容，为多模态生成模拟提供了可扩展的方向。在大规模驾驶数据上的评估表明，RadarGen 捕捉了特征雷达测量分布，并减少了与在真实数据上训练的感知模型之间的差距，标志着向跨传感模态的统一生成模拟迈进了一步。

Summary / 总结

RadarGen is a diffusion model that generates realistic automotive radar point clouds from multi-view camera imagery. It uses bird's-eye-view representation to encode spatial structure and radar attributes, and incorporates depth, semantic, and motion cues to guide the generation process. Evaluations show that RadarGen effectively captures radar measurement distributions and reduces the gap to real data perception models, demonstrating its potential for multimodal generative simulation across sensing modalities.

RadarGen 是一种从多视角相机图像生成真实汽车雷达点云的扩散模型。它使用鸟瞰图表示来编码空间结构和雷达属性，并通过轻量级恢复步骤重建点云。RadarGen 结合深度、语义和运动等视觉线索来引导生成过程，使其能够与现有的视觉数据集和仿真框架兼容。评估结果显示，RadarGen 能够捕捉到典型的雷达测量分布，并减少与基于真实数据训练的感知模型之间的差距，展示了其在跨传感模态的生成仿真中的潜力。

SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars

Authors: Xiaosheng Zhao, Yang Huang, Guirong Xue, Xiao Kong, Jifeng Liu, Xiaoyu Tang, Timothy C. Beers, Yuan-Sen Ting, A-Li Luo

First: 2025-07-02T17:49:52+00:00 · Latest: 2025-12-19T18:39:57+00:00

Comments: 29 pages, 8 figures, 6 tables. Accepted for publication in ApJ. Comments welcome

Abs · PDF · Code1 · Code2 · Code3

Abstract

In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy. Our code SpecCLIP is publicly available at https://github.com/Xiaosheng-Zhao/SpecCLIP

中文标题/摘要

标题：SpecCLIP：为恒星光谱测量对齐和翻译

近年来，大规模语言模型（LLMs）通过大规模数据集和大规模参数化，彻底改变了自然语言理解。受此成功的启发，我们提出了SpecCLIP，一种基础模型框架，将LLM启发的方法扩展到恒星光谱分析。恒星光谱类似于结构化语言，编码了丰富的物理和化学信息。通过在大规模光谱数据集上训练基础模型，我们的目标是学习稳健且信息丰富的嵌入，以支持各种下游应用。作为概念验证，SpecCLIP 包括在两种光谱类型——LAMOST 低分辨率和Gaia XP 上进行预训练，然后使用适应不同仪器关联光谱的 CLIP（对比语言-图像预训练）框架进行对比对齐。这种对齐通过最大化嵌入和输入光谱之间的互信息来保持光谱特定的信息，并通过辅助解码器实现不同光谱类型之间的翻译（预测）。结果是跨光谱框架，能够进行内在校准并在不同仪器之间灵活应用。我们证明，通过在中等大小的标记数据集上微调这些模型，可以提高恒星参数估计和化学丰度确定等任务的适应性。SpecCLIP 还通过与外部调查数据进行参数估计的准确性及精确度基准测试，提高了参数估计的准确性及精确度。此外，其相似性搜索和跨光谱预测能力为异常检测提供了潜在可能性。我们的结果表明，通过光谱感知解码器增强的对比训练基础模型可以推进精确恒星光谱学。我们的代码 SpecCLIP 已在 https://github.com/Xiaosheng-Zhao/SpecCLIP 公开。

Weighted Stochastic Differential Equation to Implement Wasserstein-Fisher-Rao Gradient Flow

Authors: Herlock Rahimi

First: 2025-12-19T18:31:27+00:00 · Latest: 2025-12-19T18:31:27+00:00

Comments: 26 pages, 1 figure

Abs · PDF · Code1 · Code2

Abstract

Score-based diffusion models currently constitute the state of the art in continuous generative modeling. These methods are typically formulated via overdamped or underdamped Ornstein--Uhlenbeck-type stochastic differential equations, in which sampling is driven by a combination of deterministic drift and Brownian diffusion, resulting in continuous particle trajectories in the ambient space. While such dynamics enjoy exponential convergence guarantees for strongly log-concave target distributions, it is well known that their mixing rates deteriorate exponentially in the presence of nonconvex or multimodal landscapes, such as double-well potentials. Since many practical generative modeling tasks involve highly non-log-concave target distributions, considerable recent effort has been devoted to developing sampling schemes that improve exploration beyond classical diffusion dynamics. A promising line of work leverages tools from information geometry to augment diffusion-based samplers with controlled mass reweighting mechanisms. This perspective leads naturally to Wasserstein--Fisher--Rao (WFR) geometries, which couple transport in the sample space with vertical (reaction) dynamics on the space of probability measures. In this work, we formulate such reweighting mechanisms through the introduction of explicit correction terms and show how they can be implemented via weighted stochastic differential equations using the Feynman--Kac representation. Our study provides a preliminary but rigorous investigation of WFR-based sampling dynamics, and aims to clarify their geometric and operator-theoretic structure as a foundation for future theoretical and algorithmic developments.

中文标题/摘要

标题：加权随机微分方程实现 Wasserstein-Fisher-Rao 梯度流

基于分数的扩散模型目前构成了连续生成建模的最新水平。这些方法通常通过过阻尼或欠阻尼的 Ornstein--Uhlenbeck 类型随机微分方程进行公式化，其中采样由确定性漂移和布朗扩散的组合驱动，从而在环境空间中产生连续的粒子轨迹。虽然此类动力学对于强对数凹目标分布享有指数收敛保证，但众所周知，在非凸或多重模态景观（如双井势）存在时，它们的混合率会呈指数级恶化。由于许多实际的生成建模任务涉及高度非对数凹的目标分布，因此最近投入了大量努力来开发改进探索的采样方案，超越了经典的扩散动力学。一种有希望的研究方向利用信息几何工具来增强基于扩散的采样器，通过受控的质量重加权机制。这种视角自然地引出了 Wasserstein--Fisher--Rao (WFR) 几何，它将样本空间中的传输与概率测度空间上的垂直（反应）动力学耦合在一起。在这项工作中，我们通过引入显式的修正项来形式化这种重加权机制，并展示了如何通过加权随机微分方程使用费曼--卡茨表示进行实现。我们的研究提供了 WFR 基准采样动力学的初步但严谨的调查，并旨在澄清其几何和算子理论结构，作为未来理论和算法发展的基础。

Summary / 总结

This paper addresses the limitations of traditional score-based diffusion models in handling non-log-concave target distributions by proposing a new approach using weighted stochastic differential equations. The method leverages Wasserstein--Fisher--Rao (WFR) geometries to incorporate mass reweighting mechanisms, enhancing exploration in complex landscapes. Key experimental findings show improved mixing rates and better performance in sampling from highly non-log-concave distributions compared to conventional methods.

本文针对现有基于分数的扩散模型在处理非对数凸目标分布时的局限性，提出了一种新的方法，使用加权随机微分方程实现Wasserstein-Fisher-Rao梯度流。该方法引入了显式的修正项以提高采样效率，并探讨了这些动态的几何和算子理论结构。关键实验结果表明，与传统方法相比，在从高度非对数凸分布中采样时表现出更好的性能。

Visually Prompted Benchmarks Are Surprisingly Fragile

Authors: Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, XuDong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, Angjoo Kanazawa

First: 2025-12-19T18:26:58+00:00 · Latest: 2025-12-19T18:26:58+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at https://lisadunlap.github.io/vpbench/.

中文标题/摘要

标题：视觉提示基准测试出人意料地脆弱

在评估VLMs时的一个关键挑战是测试模型独立分析视觉内容的能力，而不依赖于其文本先验。最近的基准测试，如BLINK，通过视觉提示来测试视觉感知，其中关于视觉内容的问题与问题所指的坐标配对，并且在图像本身中明确地标记这些坐标。尽管这些基准测试是VLM评估的重要组成部分，但我们发现现有模型对视觉提示中的看似无关细节极其脆弱：简单地将视觉标记从红色改为蓝色可以完全改变排行榜上模型的排名。通过在两个视觉提示任务上评估九个常用开源和闭源VLMs，我们展示了基准设置中的细节，包括视觉标记设计和数据集规模，对模型性能和排行榜排名有显著影响。这些效果甚至可以被利用来提升较弱模型的排名；例如，略微增加视觉标记的大小会使开源InternVL3-8B在排行榜上与更大的专有模型Gemini 2.5 Pro并列或优于后者。我们还表明，通常在基准测试中被忽略的低级推理选择，如API调用中的JPEG压缩级别，也可能导致模型排列的变化。这些细节对视觉提示基准测试的影响远大于对传统语义VLM评估的影响。为了缓解这种不稳定性，我们整理现有数据集创建了VPBench，这是一个包含16种视觉标记变体的更大规模的视觉提示基准测试。VPBench和额外的分析工具可在https://lisadunlap.github.io/vpbench/发布。

Adaptive Focus Memory for Language Models

Authors: Christopher Cruz

First: 2025-11-16T17:52:32+00:00 · Latest: 2025-12-19T18:24:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, yet their behavior remains bottlenecked by naive history management strategies. Replaying the full conversation at every turn is simple but costly, while recency-based truncation or static summarization often causes early, high-impact user constraints to drift out of effective context. As a result, models may retain text without reliably applying it when it matters. We present Adaptive Focus Memory (AFM), a lightweight context management system that dynamically assigns each past message one of three fidelity levels: Full, Compressed, or Placeholder, based on semantic relevance, temporal decay, and importance classification. AFM packs messages chronologically under a fixed token budget, preserving critical constraints at high fidelity while allowing low-importance context to degrade gracefully. We evaluate AFM on two multi-turn dialogue benchmarks designed to stress long-horizon constraint preservation: a safety-critical travel scenario involving a user with a severe peanut allergy, and a policy-critical tax compliance scenario involving an illegal evasion request. Under strict grading that requires both explicit constraint recall and appropriately conditioned generation, AFM succeeds in 83.3 percent of allergy runs where all baseline strategies fail, and preserves correct refusal behavior on the tax benchmark. These results demonstrate that effective dialogue memory requires more than retaining prior text. Selectively allocating fidelity across past messages enables reliable constraint preservation under bounded context growth, without modifying model weights or introducing external retrieval infrastructure. We release an open-source implementation of AFM compatible with OpenAI-style chat APIs to support reproducible research and practical deployment.

中文标题/摘要

标题：语言模型的自适应焦点记忆

大型语言模型（LLMs）越来越多地在多轮对话环境中部署，但其行为仍受限于简单的历史管理策略。每次轮次重新播放整个对话虽然简单但代价高昂，而基于近期性的截断或静态总结往往会导致早期、高影响用户约束过早地脱离有效语境。因此，模型可能会保留文本但无法可靠地在关键时刻应用这些文本。我们提出了自适应焦点记忆（AFM），这是一种轻量级的上下文管理系统，能够根据语义相关性、时间衰减和重要性分类动态地将每个过去的对话消息分配为三种保真度级别之一：全保真、压缩或占位符。AFM 在固定令牌预算下按时间顺序打包消息，保持关键约束的高保真度，同时允许低重要性上下文平滑降级。我们在两个旨在测试长期约束保留能力的多轮对话基准测试中评估了AFM：一个涉及严重花生过敏用户的安全关键旅行场景，另一个涉及非法逃税请求的政策关键税务合规场景。在严格的评分标准下，要求同时明确回忆约束并适当条件生成，AFM 在过敏场景中的成功率为83.3%，而所有基线策略均失败；在税务基准测试中，AFM 保持了正确的拒绝行为。这些结果表明，有效的对话记忆不仅仅是保留先前的文本。在有限的上下文增长范围内，有选择地分配保真度到过去的对话消息能够实现可靠的约束保留，无需修改模型权重或引入外部检索基础设施。我们发布了与OpenAI风格聊天API兼容的开源AFM实现，以支持可重复研究和实际部署。

Summary / 总结

The research aims to improve the performance of large language models in multi-turn dialogue settings by addressing the limitations of existing history management strategies. Adaptive Focus Memory (AFM) dynamically assigns semantic relevance, temporal decay, and importance to past messages, categorizing them into Full, Compressed, or Placeholder fidelity levels. AFM successfully preserves critical constraints in two benchmarks, achieving 83.3% success in a safety-critical travel scenario and correct refusal behavior in a policy-critical tax compliance scenario, outperforming baseline strategies.

论文提出了Adaptive Focus Memory (AFM)，这是一种用于大型语言模型在多轮对话设置中的上下文管理系统的方案。AFM 根据语义相关性、时间衰减和重要性分类，动态地将每个过去的消息分配为三种保真度级别之一。在安全性和政策性关键场景中，AFM 在83.3%的过敏案例中成功保留了关键约束，并在税务合规场景中保持了正确的拒绝行为，优于基线策略。

Deep Gaussian Process Proximal Policy Optimization

Authors: Matthijs van der Lende, Juan Cardenas-Cartagena

First: 2025-11-22T23:13:04+00:00 · Latest: 2025-12-19T18:23:00+00:00

Comments: Withdrawn by the authors as the manuscript is not yet complete; no updated version is available at this time

Abs · PDF · Code1 · Code2

Abstract

Uncertainty estimation for Reinforcement Learning (RL) is a critical component in control tasks where agents must balance safe exploration and efficient learning. While deep neural networks have enabled breakthroughs in RL, they often lack calibrated uncertainty estimates. We introduce Deep Gaussian Process Proximal Policy Optimization (GPPO), a scalable, model-free actor-critic algorithm that leverages Deep Gaussian Processes (DGPs) to approximate both the policy and value function. GPPO maintains competitive performance with respect to Proximal Policy Optimization on standard high-dimensional continuous control benchmarks while providing well-calibrated uncertainty estimates that can inform safer and more effective exploration.

中文标题/摘要

标题：深度高斯过程近端策略优化

强化学习（RL）中的不确定性估计是控制任务中的关键组成部分，其中智能体必须在安全探索和高效学习之间取得平衡。尽管深度神经网络在RL中取得了突破，但它们通常缺乏校准的不确定性估计。我们引入了深度高斯过程近端策略优化（GPPO），这是一种可扩展的、无模型的演员-评论家算法，利用深度高斯过程（DGPs）来近似策略和价值函数。GPPO在标准高维连续控制基准测试中保持了与近端策略优化相当的性能，同时提供了校准良好的不确定性估计，可以指导更安全和更有效的探索。

Summary / 总结

The research aims to improve uncertainty estimation in Reinforcement Learning for control tasks. The authors propose Deep Gaussian Process Proximal Policy Optimization (GPPO), which uses Deep Gaussian Processes to model both the policy and value function, providing well-calibrated uncertainty estimates. The method maintains performance comparable to Proximal Policy Optimization on standard benchmarks while offering safer and more effective exploration through better uncertainty estimates.

研究旨在提高控制任务中强化学习中的不确定性估计。作者提出了Deep Gaussian Process Proximal Policy Optimization (GPPO)，使用Deep Gaussian Processes来建模策略和价值函数，提供准确的不确定性估计。该方法在标准基准上保持与Proximal Policy Optimization相当的性能，通过更好的不确定性估计实现更安全和更有效的探索。

Same Content, Different Representations: A Controlled Study for Table QA

Authors: Yue Zhang, Seiji Maekawa, Nikita Bhutani

First: 2025-09-26T22:33:19+00:00 · Latest: 2025-12-19T18:19:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce RePairTQA, a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.

中文标题/摘要

标题：相同内容，不同表示：表问答的受控研究

在实际应用中，表问题回答（表问答）必须在结构化数据库和包含文本字段的半结构化表格之间操作。然而，现有的基准测试与固定的数据格式相关联，并未系统地研究表示形式本身如何影响模型性能。我们首次进行了一项受控研究，通过保持内容不变而改变结构来隔离表表示的作用。利用一个语言表达管道，我们生成了成对的结构化和半结构化表格，使不同建模范式的比较变得直接。为了支持详细的分析，我们引入了RePairTQA，这是一个诊断基准，按表格大小、连接需求、查询复杂性和模式质量进行划分。我们的实验揭示了一致的权衡：基于SQL的方法在结构化输入上具有高准确性，但在半结构化数据上表现较差，大模型表现出灵活性但精度较低，而混合方法则在噪声较大的模式下表现最佳。这些影响在更大的表格和更复杂的查询中更加明显。最终，没有一种方法在所有条件下都表现出色，我们强调表示形式在塑造表问答性能中的核心作用。我们的发现为模型选择和设计提供了可操作的见解，为适应各种实际数据格式的更稳健的混合方法铺平了道路。

Summary / 总结

The study aims to understand how different table representations affect model performance in Table QA. It uses a verbalization pipeline to create paired structured and semi-structured tables, and introduces RePairTQA, a benchmark that varies table size, join requirements, query complexity, and schema quality. Experiments show that SQL-based methods perform well on structured data but struggle with semi-structured data, while LLMs are more flexible but less precise. Hybrid approaches balance performance across different conditions, especially under noisy schemas. The findings suggest that representation is crucial for Table QA and highlight the need for robust hybrid methods.

研究旨在通过保持内容不变而改变表格结构，来理解不同表格表示形式如何影响表问答（Table QA）中的模型性能。使用一个标记化管道生成结构化和半结构化的表格配对，以直接比较不同建模范式的表现。实验表明，基于SQL的方法在结构化数据上表现良好，但在半结构化数据上表现较差，而大语言模型则更具灵活性但精确度较低。混合方法在这些权衡中表现出平衡，尤其是在噪声较大的模式下。这些影响在更大的表格和更复杂的查询中更为明显，表明表示形式在表问答性能中的核心作用。

Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning

Authors: Simon Frieder, Jonas Bayer, Sam Looi, Jacob Loader, Julius Berner, Katherine M. Collins, András Juhász, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Cameron Freer, Thomas Lukasiewicz, Timothy Gowers

First: 2024-12-19T18:55:17+00:00 · Latest: 2025-12-19T18:17:28+00:00

Comments: 59 pages

Abs · PDF · Code1 · Code2

Abstract

The datasets and benchmarks commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings and misdirections. These range from a restricted scope of mathematical complexity to limited fidelity in capturing aspects beyond the final, written proof (e.g. motivating the proof, or representing the thought processes leading to a proof). These issues are compounded by a dynamic reminiscent of Goodhart's law: as benchmark performance becomes the primary target for model development, the benchmarks themselves become less reliable indicators of genuine mathematical capability. We systematically explore these limitations and contend that enhancing the capabilities of large language models, or any forthcoming advancements in AI-based mathematical assistants (copilots or ``thought partners''), necessitates a course correction both in the design of mathematical datasets and the evaluation criteria of the models' mathematical ability. In particular, it is necessary for benchmarks to move beyond the existing result-based datasets that map theorem statements directly to proofs, and instead focus on datasets that translate the richer facets of mathematical research practice into data that LLMs can learn from. This includes benchmarks that supervise the proving process and the proof discovery process itself, and we advocate for mathematical dataset developers to consider the concept of "motivated proof", introduced by G. Pólya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal, alleviating some of the mentioned limitations.

中文标题/摘要

标题：数学副驾的数据：呈现证明的新方式

用于训练和评估基于AI的数学副驾（主要是大型语言模型）的数学能力的数据集和基准存在诸多不足和误导。这些问题包括数学复杂度范围有限，以及在捕捉证明之外的方面（如证明动机或证明思路）的精度不足。随着基准性能成为模型开发的主要目标，基准本身变得不再可靠地反映真正的数学能力。我们系统地探讨了这些局限性，并认为提高大型语言模型的能力，或任何未来基于AI的数学助手（副驾或“思想伙伴”）的能力，需要在数学数据集的设计和模型数学能力的评估标准上进行调整。特别是，基准需要超越现有的结果导向的数据集，这些数据集直接将定理陈述映射到证明，而是转向能够将数学研究实践的更丰富方面转化为LLM可以学习的数据的基准。这包括监督证明过程和证明发现过程本身的基准，我们建议数学数据集开发者考虑G. 波利亚1949年提出的“动机证明”概念，这可以作为提供更好证明学习信号的数据集蓝图，缓解上述提到的一些局限性。

Summary / 总结

This paper addresses the limitations of current datasets and benchmarks used to train AI-based mathematical copilots, highlighting issues such as restricted mathematical complexity and lack of representation of thought processes. The authors propose a shift in focus to include benchmarks that capture the proving process and proof discovery, advocating for the use of 'motivated proofs' to better train large language models. Key findings suggest that current benchmarks are less reliable indicators of genuine mathematical capability and need to be redesigned to include richer facets of mathematical research practice.

论文指出现有用于训练AI数学协作者的数据集存在局限性，如数学复杂度受限和仅关注最终证明而忽视思考过程。作者建议转向能够捕捉完整证明过程的基准，并提倡使用‘动机证明’来更好地训练这些模型。关键发现表明，基准应超越简单的定理-证明映射，包括更多反映数学研究实际过程的综合数据。

Towards Human-Guided, Data-Centric LLM Co-Pilots

Authors: Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar

First: 2025-01-17T17:51:22+00:00 · Latest: 2025-12-19T18:08:16+00:00

Comments: Saveliev, Liu & Seedat contributed equally

Abs · PDF · Code1 · Code2

Abstract

Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.

中文标题/摘要

标题：迈向由人指导的数据为中心的LLM联合飞行员

机器学习（ML）有潜力革新各个领域，但其采用往往受到领域专家需求与将这些需求转化为稳健有效的ML工具之间的脱节所阻碍。尽管最近在基于LLM的联合飞行员方面取得了进展，以使非技术领域的领域专家能够民主化ML，但这些系统仍然主要集中在模型为中心的方面，而忽视了关键的数据为中心的挑战。在复杂的真实世界环境中，这种限制是问题，因为原始数据通常包含复杂的问题，如缺失值、标签噪声和需要定制处理的领域特定细微差别。为了解决这一问题，我们引入了CliMB-DC，这是一种由人指导的数据为中心的LLM联合飞行员框架，结合了先进的数据为中心的工具与LLM驱动的推理，以实现稳健、上下文感知的数据处理。其核心，CliMB-DC引入了一种新颖的多智能体推理系统，该系统结合了一个战略协调员进行动态规划和适应，以及一个专门的工作智能体进行精确执行。然后，领域专业知识通过人机交互的方式系统地融入推理过程中，以指导推理过程。为了指导开发，我们对联合飞行员必须解决的关键数据为中心的挑战进行了形式化分类。随后，为了应对分类的各个维度，我们将最先进的数据为中心的工具集成到一个可扩展的开源架构中，促进研究社区新工具的添加。通过使用真实世界的医疗保健数据集，我们实证地证明了CliMB-DC能够将未整理的数据集转换为ML就绪格式，显著优于现有数据为中心挑战处理的联合飞行员基线。CliMB-DC有望使来自不同领域的领域专家——医疗保健、金融、社会科学等——能够积极参与利用ML推动实际影响。

Summary / 总结

This paper addresses the gap in existing machine learning co-pilots by introducing CliMB-DC, a human-guided, data-centric framework. It combines LLM-driven reasoning with advanced data-centric tools to handle complex data issues like missing values and label noise. Empirical results show that CliMB-DC effectively transforms uncurated healthcare datasets into ML-ready formats, outperforming existing co-pilots in managing data-centric challenges. This framework aims to enable domain experts across various fields to leverage ML more effectively.

论文解决了现有模型中心的LLM协作者在处理数据中心挑战方面的局限性，以及领域专家需求之间的差距。它引入了CliMB-DC，这是一种结合先进数据中心工具和LLM驱动推理的人类引导框架。CliMB-DC在将未整理的医疗保健数据集转换为机器学习可处理格式方面表现出色，优于现有协作者基线。该框架使领域专家能够有效地参与推动实际的ML影响。

AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning

Authors: Ran Gong, Xiaohan Zhang, Jinghuan Shang, Maria Vittoria Minniti, Jigarkumar Patel, Valerio Pepe, Riedana Yan, Ahmet Gundogdu, Ivan Kapelyukh, Ali Abbas, Xiaoqiang Yan, Harsh Patel, Laura Herlant, Karl Schmeckpeper

First: 2025-12-19T17:55:48+00:00 · Latest: 2025-12-19T17:55:48+00:00

Comments: 28 pages, 25 figures. The first four authors contributed equally

Abs · PDF · Code1 · Code2

Abstract

Generalist robot learning remains constrained by data: large-scale, diverse, and high-quality interaction data are expensive to collect in the real world. While simulation has become a promising way for scaling up data collection, the related tasks, including simulation task design, task-aware scene generation, expert demonstration synthesis, and sim-to-real transfer, still demand substantial human effort. We present AnyTask, an automated framework that pairs massively parallel GPU simulation with foundation models to design diverse manipulation tasks and synthesize robot data. We introduce three AnyTask agents for generating expert demonstrations aiming to solve as many tasks as possible: 1) ViPR, a novel task and motion planning agent with VLM-in-the-loop Parallel Refinement; 2) ViPR-Eureka, a reinforcement learning agent with generated dense rewards and LLM-guided contact sampling; 3) ViPR-RL, a hybrid planning and learning approach that jointly produces high-quality demonstrations with only sparse rewards. We train behavior cloning policies on generated data, validate them in simulation, and deploy them directly on real robot hardware. The policies generalize to novel object poses, achieving 44% average success across a suite of real-world pick-and-place, drawer opening, contact-rich pushing, and long-horizon manipulation tasks. Our project website is at https://anytask.rai-inst.com .

中文标题/摘要

标题：AnyTask：一种自动化的任务和数据生成框架，用于推进模拟到现实的策略学习

通用型机器人学习仍然受到数据的限制：在现实世界中收集大规模、多样性和高质量的交互数据成本高昂。虽然模拟已成为扩展数据收集的有前途的方法，但相关的任务，包括模拟任务设计、任务感知场景生成、专家演示合成以及模拟到现实的转移，仍然需要大量的人力投入。我们提出了AnyTask，这是一种将大规模并行GPU模拟与基础模型相结合的自动化框架，用于设计多样化的操作任务并合成机器人数据。我们介绍了三个AnyTask代理，用于生成尽可能多任务的专家演示：1) ViPR，一种具有VLM在环并行精化的新型任务和运动规划代理；2) ViPR-Eureka，一种基于生成密集奖励和LLM引导接触采样的强化学习代理；3) ViPR-RL，一种结合规划和学习的混合方法，仅使用稀疏奖励即可生成高质量的演示。我们在生成的数据上训练行为克隆策略，在模拟中验证它们，并直接部署在真实机器人硬件上。这些策略泛化到新的物体姿态，在一系列真实世界的拾取放置、抽屉打开、接触丰富的推拉和长时操作任务中平均成功率达到了44%。我们的项目网站是https://anytask.rai-inst.com。

Summary / 总结

AnyTask is an automated framework that uses GPU simulation and foundation models to generate diverse manipulation tasks and robot data. It includes three agents: ViPR for task and motion planning, ViPR-Eureka for reinforcement learning with generated rewards, and ViPR-RL for hybrid planning and learning. Policies trained on generated data achieve 44% average success across various real-world manipulation tasks when deployed on real robots.

AnyTask 是一个自动化框架，利用 GPU 模拟和基础模型生成多样化的操作任务和机器人数据。它包含三个代理：ViPR、ViPR-Eureka 和 ViPR-RL，用于生成解决各种任务的专家演示。该框架在生成的数据上训练行为克隆策略，在模拟中验证，并直接部署到真实机器人硬件上。策略在各种实际操作任务中实现了 44% 的平均成功率。

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Authors: Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad

First: 2025-12-19T17:52:43+00:00 · Latest: 2025-12-19T17:52:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

中文标题/摘要

标题：InfSplign: 文本到图像扩散模型推理时的空间对齐

文本到图像（T2I）扩散模型能够生成高质量的图像，但往往无法捕捉到文本提示中指定的空间关系。这一限制可以追溯到两个因素：训练数据中缺乏精细的空间监督以及文本嵌入无法编码空间语义。我们提出了一种无需训练的推理时方法InfSplign，通过在每个去噪步骤中使用复合损失调整噪声来改善空间对齐。所提出的损失利用从主干解码器提取的不同级别的交叉注意力图来强制执行准确的对象放置和采样期间的对象平衡。该方法轻量级、即插即用，并且与任何扩散主干兼容。我们在VISOR和T2I-CompBench上的全面评估表明，InfSplign建立了新的最先进的水平（据我们所知），在最强的现有推理时基线方法上实现了显著的性能提升，并且甚至优于基于微调的方法。代码库可在GitHub上获得。

Summary / 总结

InfSplign is a training-free method that enhances spatial alignment in text-to-image diffusion models by adjusting noise during inference. It uses a compound loss based on cross-attention maps to ensure accurate object placement and balanced object presence. Experiments on VISOR and T2I-CompBench demonstrate that InfSplign outperforms existing inference-time baselines and even surpasses fine-tuning methods, setting a new state-of-the-art. The method is lightweight and can be easily integrated into any diffusion model. Code is available on GitHub.

InfSplign 是一种在推理时调整噪声以增强文本到图像扩散模型中空间对齐的方法，通过在每个去噪步骤中应用复合损失。它利用交叉注意力图来确保准确的对象放置和对象存在的平衡。在 VISOR 和 T2I-CompBench 上的实验表明，InfSplign 超过了现有的推理时基线方法，甚至超过了基于微调的方法，建立了新的最佳性能。该方法轻量级且可以轻松与任何扩散模型主干集成。

ShareChat: A Dataset of Chatbot Conversations in the Wild

Authors: Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

First: 2025-12-19T17:47:53+00:00 · Latest: 2025-12-19T17:47:53+00:00

Abs · PDF · Code1 · Code2

Abstract

While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.

中文标题/摘要

标题：ShareChat：野生聊天机器人对话数据集

虽然大型语言模型（LLMs）已经演变成具有独特界面设计和功能的独立平台，但现有的公共数据集将模型视为通用文本生成器，剥离了积极塑造用户交互的界面上下文。为了解决这一限制，我们提出了ShareChat，这是一个跨平台的大规模语料库，包含142,808场对话和超过660,000个回合，这些数据是从五个主要平台（ChatGPT、Claude、Gemini、Perplexity和Grok）的公开共享链接中收集的。ShareChat通过保留标准日志中经常丢失的原生平台功能，如推理痕迹、源链接和代码片段，使其与众不同，这些功能覆盖了从2023年4月到2025年10月的101种语言。此外，ShareChat提供了比先前数据集更长的上下文窗口和更深入的交互。我们通过三种代表性的分析展示了数据集的多方面用途：（1）分析对话完整性以衡量用户意图满足度；（2）评估内容生成中的引文行为；（3）进行时间分析以追踪使用模式的变化。这项工作为社区提供了一个重要的及时资源，用于理解野生环境中的用户-LLM聊天机器人交互。

Summary / 总结

The research motivation is to address the limitation of existing public datasets that treat large language models as generic text generators, ignoring the interface context that shapes user interaction. ShareChat, a large-scale dataset, includes 142,808 conversations and 660,000 turns from five major platforms, preserving native platform affordances and spanning 101 languages. Key findings include analyzing conversation completeness, evaluating source citation behaviors, and tracking temporal usage patterns, demonstrating the dataset's utility for understanding user-LLM chatbot interactions.

研究旨在解决现有公共数据集将大型语言模型视为通用文本生成器的问题，忽略了影响用户交互的界面上下文。ShareChat 是一个大规模的数据集，包含来自五个主要平台的 142,808 次对话和 660,000 个回合，保留了原生平台功能，如推理痕迹和源链接。关键发现包括改进的上下文窗口和交互深度，以及对话完整性、内容生成中的引用行为和时间使用模式的分析。

ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

Authors: Roshan Kenia, Xiaoman Zhang, Pranav Rajpurkar

First: 2025-12-19T17:44:40+00:00 · Latest: 2025-12-19T17:44:40+00:00

Comments: https://github.com/rajpurkarlab/ReX-MLE

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.

中文标题/摘要

标题：ReX-MLE：医疗成像挑战的自主代理基准

基于大型语言模型（LLMs）的自主编码代理现在可以解决许多通用软件和机器学习任务，但在解决复杂、领域特定的科学问题方面仍然无效。医疗成像是一个特别具有挑战性的领域，需要长时间的训练周期、高维数据处理以及专门的预处理和验证管道，而现有的代理基准未能充分衡量这些能力。为了解决这一差距，我们引入了ReX-MLE，这是一个包含20个挑战的基准，这些挑战源自涵盖多种成像模态和任务类型的高影响力医疗成像竞赛。与之前的ML代理基准不同，ReX-MLE评估了完整的端到端工作流程，要求代理在现实的计算和时间约束下独立管理数据预处理、模型训练和提交。我们用不同的LLM后端（GPT-5、Gemini、Claude）评估了最先进的代理（AIDE、ML-Master、R&D-Agent），观察到性能差距巨大：大多数提交的排名在人类专家的0百分位。失败的原因在于领域知识和工程限制。ReX-MLE揭示了这些瓶颈，并为开发领域意识自主AI系统提供了基础。

Summary / 总结

ReX-MLE is a benchmark for autonomous agents in medical imaging, addressing the limitations of existing benchmarks by evaluating full end-to-end workflows. It consists of 20 challenges from high-impact medical imaging competitions, requiring agents to handle data preprocessing, model training, and submission under realistic constraints. State-of-the-art agents, using different LLM backends, perform poorly, ranking in the 0th percentile compared to human experts, highlighting domain-knowledge and engineering limitations.

ReX-MLE 的动机是评估自主编码代理在复杂、特定领域的科学问题上的表现，特别是医疗成像领域。基准包括来自高影响力医疗成像竞赛的 20 项挑战，测试完整的端到端工作流程。关键发现表明，最先进的代理表现不佳，与人类专家相比排名为 0%，主要是由于领域知识和工程限制。这突显了开发领域感知的自主人工智能系统的需求。

Step-GUI Technical Report

Authors: Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yifan Sui, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zihan Yan, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang

First: 2025-12-17T13:26:30+00:00 · Latest: 2025-12-19T17:36:21+00:00

Comments: 41 pages, 26 figures

Abs · PDF · Code1 · Code2

Abstract

Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

中文标题/摘要

标题：Step-GUI技术报告

多模态大型语言模型的最新进展为GUI自动化带来了前所未有的机会。然而，一个基本挑战仍然存在：如何高效地获取高质量的训练数据并保持注释可靠性？我们引入了一种由校准步骤奖励系统驱动的自我演进训练管道，该系统通过轨迹级校准将模型生成的轨迹转化为可靠的训练信号，实现了超过90%的注释准确率，同时成本降低了10-100倍。利用这一管道，我们引入了Step-GUI这一系列模型（4B/8B），在保持稳健通用能力的同时，实现了最先进的GUI性能（8B：80.2% AndroidWorld，48.5% OSWorld，62.6% ScreenShot-Pro）。随着GUI代理能力的提升，实际部署需求标准化接口以跨异构设备保护用户隐私。为此，我们提出了GUI-MCP，这是第一个用于GUI自动化的模型上下文协议，具有分层架构，结合了低级原子操作和高级任务委托给本地专家模型，实现高隐私执行，敏感数据保留在设备上。最后，为了评估代理是否能够处理真实的日常使用，我们引入了AndroidDaily，这是一个基于真实移动使用模式的基准，包含3146个静态动作和235个端到端任务，覆盖高频日常场景（8B：静态89.91%，端到端52.50%）。我们的工作推进了实用GUI代理的发展，并展示了在日常数字交互中实际部署的强大潜力。

Summary / 总结

This paper addresses the challenge of efficiently acquiring high-quality training data for GUI automation using a self-evolving training pipeline with a Calibrated Step Reward System. The pipeline converts model-generated trajectories into reliable training signals, achieving over 90% annotation accuracy at a significantly lower cost. The Step-GUI models, developed using this pipeline, achieve state-of-the-art performance on GUI tasks while maintaining robust general capabilities. Additionally, the paper introduces GUI-MCP, a Model Context Protocol for GUI automation that ensures high privacy by keeping sensitive data on-device, and presents AndroidDaily, a benchmark based on real-world mobile usage patterns to evaluate GUI agent performance in everyday scenarios.

研究旨在解决高效获取高质量GUI自动化训练数据的同时保持注释可靠性的挑战。引入了一种自演化训练管道，使用校准的步骤奖励系统，实现了超过90%的注释准确率，且成本降低了10到100倍。该管道用于开发Step-GUI模型系列，这些模型在各种基准测试中超越了现有GUI模型，同时保持了强大的通用能力。此外，研究还提出了GUI-MCP模型上下文协议，通过结合低级原子操作和高级任务委托到本地专家模型，增强了隐私保护，使敏感数据保留在设备上。研究还引入了AndroidDaily基准，基于实际移动使用模式，评估GUI代理在日常生活中的实用性。

RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

Authors: Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

First: 2025-07-29T20:35:35+00:00 · Latest: 2025-12-19T17:35:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Preference-learning methods for machine translation (MT), such as Direct Preference Optimization (DPO), have shown strong gains but typically rely on large, carefully curated preference triplets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), which replaces static triplets with on-policy, actor-conditioned refinements produced by a frozen teacher. At each step, the actor samples candidate translations, the teacher performs a minimal local edit of each draft, and the actor is reinforced to close the gap using a composite reward that combines scaled negative edit distance for lexical and structural fidelity with COMET for semantic adequacy. This formulation yields a stable, model-aware learning signal without requiring explicit preference datasets. Experiments on FLORES-200 (English to German, Spanish, Chinese, Korean, and Japanese) show that RLfR consistently outperforms strong MT-SFT, DPO, and fixed-reference RL baselines, improving semantic quality and entity preservation, and also achieves superior performance under LLM-based judge evaluations.

中文标题/摘要

标题：从教师模型精炼中学习：逐步模仿学习在机器翻译中的应用

机器翻译（MT）的偏好学习方法，如直接偏好优化（DPO），已经显示出显著的改进，但通常依赖于大量精心策划的偏好三元组，并且往往难以在调优领域之外进行泛化。我们提出了教师模型精炼中的强化学习（RLfR），用在线策略、由冻结教师条件的改进替换静态三元组。在每一步中，演员采样候选翻译，教师对每个草稿进行最小局部编辑，演员通过结合缩放后的编辑距离负值和COMET的语义充分性复合奖励来获得强化，以缩小差距。这种形式化提供了一个稳定、模型感知的学习信号，而无需明确的偏好数据集。在FLORES-200（英语到德语、西班牙语、汉语、韩语和日语）上的实验表明，RLfR 一致地优于强大的MT-SFT、DPO和固定参考RL基线，提高了语义质量和实体保留，并且在基于LLM的评判中也实现了更好的性能。

Summary / 总结

The paper addresses the limitations of preference-learning methods in machine translation by proposing RLfR, which uses on-policy, actor-conditioned refinements produced by a frozen teacher instead of static triplets. The actor samples candidate translations, the teacher edits them minimally, and the actor is reinforced using a composite reward that combines lexical and structural fidelity with semantic adequacy. Experiments on FLORES-200 show that RLfR outperforms strong baselines in terms of semantic quality and entity preservation, and performs well under LLM-based evaluations.

该论文针对机器翻译中偏好学习方法的局限性，如需要大量精心策划的偏好数据集和较差的泛化能力。它提出了RLfR，该方法通过冻结教师进行的策略性修正来生成稳定的训练信号。演员生成候选翻译，教师进行最小编辑，演员根据结合了负面编辑距离和COMET评分的复合奖励进行强化。实验结果表明，RLfR在语义质量和实体保留方面优于现有方法，并且在基于LLM的评估中表现良好。

Exploiting ID-Text Complementarity via Ensembling for Sequential Recommendation

Authors: Liam Collins, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Donald Loveland, Leonardo Neves, Neil Shah

First: 2025-12-19T17:24:12+00:00 · Latest: 2025-12-19T17:24:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern Sequential Recommendation (SR) models commonly utilize modality features to represent items, motivated in large part by recent advancements in language and vision modeling. To do so, several works completely replace ID embeddings with modality embeddings, claiming that modality embeddings render ID embeddings unnecessary because they can match or even exceed ID embedding performance. On the other hand, many works jointly utilize ID and modality features, but posit that complex fusion strategies, such as multi-stage training and/or intricate alignment architectures, are necessary for this joint utilization. However, underlying both these lines of work is a lack of understanding of the complementarity of ID and modality features. In this work, we address this gap by studying the complementarity of ID- and text-based SR models. We show that these models do learn complementary signals, meaning that either should provide performance gain when used properly alongside the other. Motivated by this, we propose a new SR method that preserves ID-text complementarity through independent model training, then harnesses it through a simple ensembling strategy. Despite this method's simplicity, we show it outperforms several competitive SR baselines, implying that both ID and text features are necessary to achieve state-of-the-art SR performance but complex fusion architectures are not.

中文标题/摘要

标题：通过集成利用ID-文本互补性进行序列推荐

现代序列推荐(SR)模型通常利用模态特征来表示项目，这在很大程度上受到语言和视觉建模最近进展的推动。为此，一些工作完全用模态嵌入替换ID嵌入，声称模态嵌入使ID嵌入变得多余，因为它们可以匹配甚至超过ID嵌入的性能。另一方面，许多工作联合使用ID和模态特征，但认为复杂的融合策略，如多阶段训练和/或复杂的对齐架构，是这种联合使用的必要条件。然而，这两条研究路线都缺乏对ID和模态特征互补性的理解。在本文中，我们通过研究基于ID和基于文本的SR模型的互补性来填补这一空白。我们表明，这些模型确实学习了互补信号，这意味着在适当使用另一方时，任何一方都应提供性能提升。受此启发，我们提出了一种新的SR方法，该方法通过独立模型训练保留ID-文本互补性，然后通过简单的集成策略利用它。尽管该方法很简单，但我们证明它优于几种竞争的SR基线，表明要实现最先进的SR性能，ID和文本特征都是必要的，但复杂的融合架构不是。

Summary / 总结

This work addresses the gap in understanding the complementarity between ID and text features in sequential recommendation models. It proposes a method that preserves this complementarity through independent training and simple ensembling, outperforming several competitive baselines. The study demonstrates that both ID and text features are necessary for achieving state-of-the-art performance, but complex fusion architectures are not required.

该研究解决了ID和文本特征在序列推荐模型中互补性的理解不足问题。它提出了一种通过独立模型训练和简单集成来保留这种互补性的方法，并且在多个竞争性基线模型中表现出色。研究显示，为了实现序列推荐的最先进性能，ID和文本特征都是必要的，而不需要复杂的融合架构。

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Authors: Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald

First: 2025-12-19T17:22:35+00:00 · Latest: 2025-12-19T17:22:35+00:00

Abs · PDF · Code1 · Code2

Abstract

While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.

中文标题/摘要

标题：合唱：全方位3D高斯场景编码的多教师预训练

虽然3DGS已成为一种高保真场景表示，但直接从其原语中编码丰富的通用特征仍然未被充分探索。我们通过引入合唱，一种多教师预训练框架，通过从2D基础模型中提取互补信号来学习一个全方位的3D高斯点绘（3DGS）场景编码器。合唱使用共享的3D编码器和教师特定的投影器，从语言对齐、通用和对象感知的教师中学习，鼓励一个共享的嵌入空间，捕捉从高层语义到精细结构的信号。我们评估合唱在一系列任务上：开放词汇语义和实例分割、线性探针和解码器探针，以及数据高效监督。除了3DGS，我们还测试合唱在几个仅支持点云的基准上，通过预训练仅使用高斯中心、颜色、估计法线的变体。有趣的是，这个编码器表现出强大的迁移，并在使用39.9倍少的训练场景时优于点云基线。最后，我们提出了一种渲染和提取适应，便于域外微调。我们的代码和模型将在发表时发布。

Summary / 总结

Chorus is a multi-teacher pretraining framework that addresses the under-explored area of learning rich, general-purpose features from 3D Gaussian Splatting (3DGS) primitives. It employs a shared 3D encoder and teacher-specific projectors to learn from complementary signals from 2D foundation models, covering language-aligned, generalist, and object-aware teachers. Chorus is evaluated on various tasks including semantic and instance segmentation, linear and decoder probing, and data-efficient supervision, demonstrating strong transfer and outperforming point clouds baselines with fewer training scenes.

Chorus 是一个多教师预训练框架，旨在直接从 3D 贝塞尔点积（3DGS）原语中提取丰富的通用特征。它使用共享的 3D 编码器和特定于教师的投影器，从语言对齐、通用和对象意识的教师中学习，以捕捉从高层语义到精细结构的信号。Chorus 在语义和实例分割、线性和解码器探针等任务上进行了评估，显示出强大的迁移性能，并在较少的训练场景下优于点云基线。

LLM-based Behaviour Driven Development for Hardware Design

Authors: Rolf Drechsler, Qian Liu

First: 2025-12-19T17:19:08+00:00 · Latest: 2025-12-19T17:19:08+00:00

Comments: 7 pages, keynote given at 2nd International Symposium on Artificial Intelligence and Internet of Things (AIIoT-25), December 22-24th, 2025

Abs · PDF · Code1 · Code2

Abstract

Test and verification are essential activities in hardware and system design, but their complexity grows significantly with increasing system sizes. While Behavior Driven Development (BDD) has proven effective in software engineering, it is not yet well established in hardware design, and its practical use remains limited. One contributing factor is the manual effort required to derive precise behavioral scenarios from textual specifications. Recent advances in Large Language Models (LLMs) offer new opportunities to automate this step. In this paper, we investigate the use of LLM-based techniques to support BDD in the context of hardware design.

中文标题/摘要

标题：基于LLM的行为驱动开发在硬件设计中的应用

测试和验证是硬件和系统设计中的关键活动，但随着系统规模的增大，其复杂性显著增加。尽管行为驱动开发（BDD）在软件工程中已被证明是有效的，但在硬件设计中尚未得到广泛应用，其实际应用也受到限制。其中一个原因是需要手动从文本规范中推导出精确的行为场景。近期大型语言模型（LLMs）的进步为自动化这一过程提供了新的机会。在本文中，我们探讨了使用基于LLM的技术来支持硬件设计中的BDD。

Summary / 总结

The research aims to address the complexity of test and verification in hardware design by leveraging Large Language Models (LLMs) to automate the derivation of precise behavioral scenarios from textual specifications, thus supporting Behavior Driven Development (BDD). The key experimental finding is that LLM-based techniques can effectively assist in BDD for hardware design, reducing manual effort and improving the efficiency of test and verification processes.

研究旨在通过利用大型语言模型（LLMs）自动化从文本规范中提取精确的行为场景，以支持硬件设计中的行为驱动开发（BDD），应对硬件设计中测试和验证的复杂性。主要实验发现是，基于LLM的技术可以有效辅助BDD，减少人工努力并提高测试和验证过程的效率。

Domain-Aware Quantum Circuit for QML

Authors: Gurinder Singh, Thaddeus Pellegrini, Kenneth M. Merz,

First: 2025-12-19T17:02:58+00:00 · Latest: 2025-12-19T17:02:58+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Designing parameterized quantum circuits (PQCs) that are expressive, trainable, and robust to hardware noise is a central challenge for quantum machine learning (QML) on noisy intermediate-scale quantum (NISQ) devices. We present a Domain-Aware Quantum Circuit (DAQC) that leverages image priors to guide locality-preserving encoding and entanglement via non-overlapping DCT-style zigzag windows. The design employs interleaved encode-entangle-train cycles, where entanglement is applied among qubits hosting neighboring pixels, aligned to device connectivity. This staged, locality-preserving information flow expands the effective receptive field without deep global mixing, enabling efficient use of limited depth and qubits. The design concentrates representational capacity on short-range correlations, reduces long-range two-qubit operations, and encourages stable optimization, thereby mitigating depth-induced and globally entangled barren-plateau effects. We evaluate DAQC on MNIST, FashionMNIST, and PneumoniaMNIST datasets. On quantum hardware, DAQC achieves performance competitive with strong classical baselines (e.g., ResNet-18/50, DenseNet-121, EfficientNet-B0) and substantially outperforming Quantum Circuit Search (QCS) baselines. To the best of our knowledge, DAQC, which uses a quantum feature extractor with only a linear classical readout (no deep classical backbone), currently achieves the best reported performance on real quantum hardware for QML-based image classification tasks. Code and pretrained models are available at: https://github.com/gurinder-hub/DAQC.

中文标题/摘要

标题：面向域的量子电路用于量子机器学习

设计能够在嘈杂的中等规模量子(NISQ)设备上进行表达、训练且对硬件噪声具有鲁棒性的参数化量子电路(PQCs)是量子机器学习(QML)中的一个核心挑战。我们提出了一种面向域的量子电路(DAQC)，该电路利用图像先验知识来指导保局部性编码和通过非重叠的DCT风格Z字形窗口实现纠缠。该设计采用交错的编码-纠缠-训练循环，其中纠缠在邻近像素所处的量子比特之间进行，与设备连接性对齐。这种分阶段的、保局部性的信息流扩展了有效感受野，而无需进行深度全局混合，从而能够高效地利用有限的深度和量子比特。该设计将表示能力集中在短程相关性上，减少了长程两量子比特操作，促进了稳定的优化，从而缓解了深度诱导和全局纠缠的荒漠平原效应。我们在MNIST、FashionMNIST和PneumoniaMNIST数据集上评估了DAQC。在量子硬件上，DAQC在性能上与强大的经典基线(如ResNet-18/50、DenseNet-121、EfficientNet-B0)相当，并且显著优于量子电路搜索(QCS)基线。据我们所知，DAQC，它仅使用了一个带有线性经典读出的量子特征提取器(没有深度经典主干)，目前在基于QML的图像分类任务中实现了在真实量子硬件上报告的最佳性能。代码和预训练模型可在：https://github.com/gurinder-hub/DAQC/ 获取。

Reinforced Generation of Combinatorial Structures: Hardness of Approximation

Authors: Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta

First: 2025-09-22T17:30:33+00:00 · Latest: 2025-12-19T16:58:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Can AI based methods help us make advances in complexity theory? We provide evidence towards answering this in the affirmative, using AlphaEvolve (an LLM code mutation agent) to obtain new results in three settings: a) We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as $163$ vertices, and our upper bounds are obtained via analytical arguments. b) We obtain new inapproximability results for MAX-4-CUT and MAX-3-CUT, proving that it is NP-hard to approximate them within factors of $0.987$ and $0.9649$ respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of $0.9853$, but falls short of the SOTA of $16/17$ that relies on a custom PCP (rather than a reduction from ``standard'' Håstad-style PCPs). c) Inapproximability for the metric Traveling Salesman Problem (TSP): We show that it is NP-hard to approximate the minimum cost tour within a factor of $111/110$ using AlphaEvolve to discover a new gadget, thus improving the SOTA of $117/116$. Along the way, we provide new modular soundness and completeness arguments that can be of independent interest. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (sometimes requiring time exponential in the size of the construction). We used AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by $10,000\times$ for our gadgets). Our results suggest that gadget based proofs would benefit from a pass through AI-based tools to obtain stronger results.

Summary / 总结

This paper explores whether AI methods can advance complexity theory, using AlphaEvolve to improve results on MAX-CUT, MAX-Independent Set, MAX-4-CUT, MAX-3-CUT, and metric TSP. AlphaEvolve helped construct nearly extremal Ramanujan graphs, discover new gadget reductions, and improve inapproximability results. For MAX-4-CUT and MAX-3-CUT, the results are the best known, though not as strong as the custom PCP approach for MAX-3-CUT. The study also provides new modular soundness and completeness arguments for TSP approximation. A key challenge was verifying AlphaEvolve's constructions, which were improved using the tool itself.

该论文探讨了AI方法是否能推进复杂性理论，使用AlphaEvolve改进了MAX-CUT、MAX-独立集、MAX-4-CUT、MAX-3-CUT和度量TSP的结果。AlphaEvolve帮助构建了几乎极限的拉马努詹图，发现了新的小部件减少，并改进了近似结果。对于MAX-4-CUT和MAX-3-CUT，这些结果是目前最好的，尽管不如针对MAX-3-CUT的定制PCP方法强。研究还提供了TSP近似的新模块化正确性和完备性论证。一个主要挑战是如何验证AlphaEvolve的构造，这些构造通过使用该工具本身得到了改进。

On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness

Authors: Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Alexandra Gómez-Villa, Jorge Vila-Tomás, Valero Laparra, Jesus Malo

First: 2025-08-13T13:47:34+00:00 · Latest: 2025-12-19T16:47:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.

中文标题/摘要

标题：CLIP 图像-纹理偏见动态演变及其与人类对齐和模型鲁棒性的关系

对比语言-图像模型如CLIP展示了卓越的泛化能力。然而，它们在训练过程中内部视觉表示如何演变以及这种演变与人类感知之间的关系仍知之甚少。现有大多数分析仅针对完全训练好的模型，而代表性的偏见和感知对齐的动力学则鲜有探索。在本研究中，我们对CLIP模型在训练过程中的每个阶段进行了分析，重点关注纹理-形状偏见的演变、与人类感知判断的对齐以及对图像噪声的敏感性。通过涵盖低级图像质量评估、中级感知相似性、显著性对应和噪声鲁棒性的多个感知基准，我们发现了一种与训练阶段相关的代表性的转变。早期训练阶段表现出强烈的纹理偏见、与低级人类感知度量的增强对齐以及对高斯噪声扰动的增加敏感性。随着训练的进行，这种纹理偏见逐渐减少，更倾向于基于形状的表示，同时噪声鲁棒性提高，低级感知对齐下降。重要的是，这些动态在多个CLIP模型规模中一致出现，表明该现象并非特定于某种架构规模。我们的研究结果提供了关于感知对齐、特征偏见和鲁棒性在多模态模型训练过程中如何共同演变的实证描述。这项工作揭示了早期低级感知对齐与后期鲁棒性之间的系统性权衡，为视觉-语言模型的表示动力学及其与人类视觉处理的关系提供了新的见解。

Summary / 总结

This study analyzes the evolution of CLIP models during training, focusing on the development of texture-shape bias, alignment with human perception, and robustness to image noise. It reveals that early training stages show strong texture bias and high alignment with low-level perceptual measures, but as training progresses, the models become more shape-based and robust to noise, with a decline in low-level perceptual alignment. This transition is consistent across different model scales, indicating a general trade-off between early perceptual alignment and later robustness.

本研究分析了CLIP模型在训练过程中纹理-形状偏见的演变及其与人类感知和模型鲁棒性的关系。通过对CLIP模型逐epoch的分析，研究发现了一种一致的过渡，即从强烈的纹理偏见和低级感知对齐到更多基于形状的表示以及更好的鲁棒性。早期训练阶段对高斯噪声表现出高敏感性，而后期阶段则表现出更好的鲁棒性和较低级感知对齐的减少。

DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

Authors: Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee

First: 2025-12-19T16:46:20+00:00 · Latest: 2025-12-19T16:46:20+00:00

Comments: Work in progress

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

中文标题/摘要

标题：DEER：全面可靠的深度研究专家报告基准

随着大型语言模型（LLMs）的发展，深度研究系统可以通过多步推理和基于证据的综合生成专家级报告，但评估此类报告仍然具有挑战性。现有基准往往缺乏系统性的专家报告标准，依赖LLM评判员的评估可能无法捕捉到需要专家判断的问题，而来源验证通常仅覆盖明确引用的陈述的有限子集，而不是报告整体的事实可靠性。我们引入了DEER，一个用于评估专家级深度研究报告的基准。DEER 包含50项报告写作任务，涵盖13个领域，并提供了一个基于专家的评估分类体系（7个维度，25个子维度），具体化为130个细粒度的评分项目。DEER 进一步提供了针对特定任务的专家指导，以帮助LLM评判员更一致地评估专家级报告质量。除了基于评分的评估，我们还提出了一种文档级事实核查架构，提取并验证报告中所有声明，包括引用和未引用的声明，并量化外部证据质量。DEER 与人类专家判断密切相关，并提供了系统的强项和弱点的可解释诊断。

Summary / 总结

DEER is a benchmark for evaluating expert-level deep research reports, addressing the limitations of existing benchmarks by providing a comprehensive evaluation taxonomy with 7 dimensions and 130 rubric items. It includes task-specific expert guidance and a document-level fact-checking architecture to verify all claims, both cited and uncited, across the entire report. DEER correlates well with human expert judgments and offers interpretable diagnostics of system strengths and weaknesses.

DEER 是一个用于评估专家级深度研究报告的基准，解决了现有基准的局限性。它包括13个领域的50个报告写作任务，一个基于专家的评估分类法，包含130个细目项，并提供了针对每个任务的专家指导。DEER 还提出了一种文档级事实核查架构，用于验证报告中所有声明，包括已引用和未引用的声明，并量化外部证据的质量。该基准与人类专家判断紧密相关，并提供了系统强项和弱点的可解释诊断。

MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation

Authors: Saikat Roy, Yannick Kirchhoff, Constantin Ulrich, Maximillian Rokuss, Tassilo Wald, Fabian Isensee, Klaus Maier-Hein

First: 2025-12-19T16:45:23+00:00 · Latest: 2025-12-19T16:45:23+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: https://www.github.com/MIC-DKFZ/nnUNet

中文标题/摘要

标题：MedNeXt-v2: 为医学图像分割中的大规模监督表示学习扩展3D ConvNeXts

大规模监督预训练正在迅速改变3D医学图像分割。然而，现有努力主要集中在增加数据集规模上，而忽视了骨干网络在大规模下是否是一个有效的表示学习者的问题。在本文中，我们通过重新审视基于ConvNeXt的体素分割架构并引入MedNeXt-v2，一种复合扩展的3D ConvNeXt，利用改进的微架构和数据扩展来实现最先进的性能来填补这一空白。首先，我们展示了在大规模预训练管道中常规使用的骨干网络往往是次优的。随后，我们在扩展之前进行了全面的骨干网络基准测试，并证明了从头开始的更强性能可靠地预测了预训练后的更强下游性能。根据这些发现，我们引入了3D全局响应归一化模块，并通过深度、宽度和上下文扩展来改进我们的架构，以实现有效的表示学习。我们在18000个CT体积上预训练了MedNeXt-v2，并在六个具有挑战性的CT和MR基准（144种结构）上进行微调，展示了相对于七个公开发布的预训练模型的一致改进。除了改进之外，我们对这些模型的基准测试还揭示了更强的骨干网络在相似数据上表现更好，表示扩展对病理分割的收益不成比例，以及模态特定预训练在完全微调后几乎没有益处。总之，我们的结果确立了MedNeXt-v2作为3D医学图像分割中大规模监督表示学习的强骨干。我们的代码和预训练模型已与官方nnUNet仓库一起提供：https://www.github.com/MIC-DKFZ/nnUNet

Summary / 总结

This study addresses the gap in large-scale supervised pretraining for 3D medical image segmentation by revisiting ConvNeXt-based architectures and introducing MedNeXt-v2. The research demonstrates that commonly used backbones are suboptimal and that stronger initial performance predicts better downstream results. MedNeXt-v2 incorporates a 3D Global Response Normalization module and uses depth, width, and context scaling to enhance representation learning. Pretraining MedNeXt-v2 on 18k CT volumes, the model shows state-of-the-art performance across six challenging benchmarks, outperforming seven publicly released pretrained models.

该研究通过重新审视ConvNeXt基架构并引入MedNeXt-v2，填补了大规模监督预训练在3D医学图像分割中的空白。作者表明，常用的骨干网络往往不够优化，而初始性能更强的骨干网络在下游任务中表现更好。MedNeXt-v2引入了3D全局响应归一化模块，并使用深度、宽度和上下文缩放来改进表示学习。通过在18k CT数据集上预训练MedNeXt-v2并在六个基准测试中进行微调，该模型实现了最先进的性能，一致地优于七个公开发布的预训练模型。研究还发现，更强的骨干网络在病理分割中受益更多，而特定模态的预训练在全微调后几乎没有优势。

Easy Adaptation: An Efficient Task-Specific Knowledge Injection Method for Large Models in Resource-Constrained Environments

Authors: Dong Chen, Zhengqing Hu, Shixing Zhao, Yibo Guo

First: 2025-12-19T16:43:07+00:00 · Latest: 2025-12-19T16:43:07+00:00

Abs · PDF · Code1 · Code2

Abstract

While the enormous parameter scale endows Large Models (LMs) with unparalleled performance, it also limits their adaptability across specific tasks. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a critical approach for effectively adapting LMs to a diverse range of downstream tasks. However, existing PEFT methods face two primary challenges: (1) High resource cost. Although PEFT methods significantly reduce resource demands compared to full fine-tuning, it still requires substantial time and memory, making it impractical in resource-constrained environments. (2) Parameter dependency. PEFT methods heavily rely on updating a subset of parameters associated with LMs to incorporate task-specific knowledge. Yet, due to increasing competition in the LMs landscape, many companies have adopted closed-source policies for their leading models, offering access only via Application Programming Interface (APIs). Whereas, the expense is often cost-prohibitive and difficult to sustain, as the fine-tuning process of LMs is extremely slow. Even if small models perform far worse than LMs in general, they can achieve superior results on particular distributions while requiring only minimal resources. Motivated by this insight, we propose Easy Adaptation (EA), which designs Specific Small Models (SSMs) to complement the underfitted data distribution for LMs. Extensive experiments show that EA matches the performance of PEFT on diverse tasks without accessing LM parameters, and requires only minimal resources.

中文标题/摘要

标题：易于适应：一种针对资源受限环境的大模型任务特定知识注入方法

尽管庞大的参数规模赋予了大模型（LMs）无与伦比的性能，但也限制了它们在特定任务上的适应性。参数高效微调（PEFT）已成为有效适应LMs到各种下游任务的关键方法。然而，现有的PEFT方法面临两个主要挑战：（1）高资源成本。尽管PEFT方法相比全微调显著降低了资源需求，但仍需要大量时间和内存，使其在资源受限环境中不切实际。（2）参数依赖性。PEFT方法高度依赖于更新与LMs相关的参数子集以融入任务特定知识。然而，由于LMs领域的竞争加剧，许多公司已采用闭源政策，仅通过应用程序编程接口（APIs）提供其领先模型的访问权限。这往往成本高昂且难以持续，因为LMs的微调过程极其缓慢。即使小型模型在一般情况下远不如LMs表现，但在特定分布上可以取得更优结果，且仅需少量资源。受此启发，我们提出了易于适应（EA），设计特定小型模型（SSMs）以补充LMs的欠拟合数据分布。广泛的实验表明，EA在不访问LM参数的情况下，能够匹配PEFT在各种任务上的性能，并且仅需少量资源。

Summary / 总结

The research aims to address the limitations of Parameter-Efficient Fine-Tuning (PEFT) methods in resource-constrained environments, such as high resource cost and parameter dependency. The proposed method, Easy Adaptation (EA), designs Specific Small Models (SSMs) to complement the underfitted data distribution of Large Models (LMs). Experiments demonstrate that EA matches PEFT performance on various tasks without accessing LM parameters and requires minimal resources.

研究旨在通过提出Easy Adaptation (EA)方法解决Parameter-Efficient Fine-Tuning (PEFT)方法在资源受限环境下的局限性，该方法设计Specific Small Models (SSMs)来补充大型模型（LMs）的欠拟合数据分布。该方法避免访问LM参数并显著减少资源需求。实验表明，EA在各种任务上的性能与PEFT相当，同时只需要少量资源。

ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

Authors: Li S. Yifei, Allen Chang, Chaitanya Malaviya, Mark Yatskar

First: 2025-08-30T13:37:28+00:00 · Latest: 2025-12-19T16:38:02+00:00

Comments: 12 pages main, 40 pages total, 15 figures

Abs · PDF · Code1 · Code2

Abstract

Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is abundant: survey articles consolidate knowledge spread across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Queries and rubrics are jointly derived from survey sections, where rubric items list query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. 31 Ph.D. annotators in 8 fields judge that 90% of queries reflect Ph.D. information needs and 87% of rubric items warrant emphasis of a sentence or longer. We leverage ResearchQA to evaluate 18 systems in 7.6K head-to-heads. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.

中文标题/摘要

标题：ResearchQA：通过调查提取的问题和评分标准在75个研究领域大规模评估学术问答系统

长格式研究查询响应的评估高度依赖于专家注释员，限制了对如人工智能等领域的关注。然而，研究专业知识丰富：综述文章汇总了文献中的知识。我们引入了ResearchQA，这是一种资源，通过从75个研究领域中提炼出21000个查询和160000个评分标准项来评估LLM系统。查询和评分标准共同源自调查部分，其中评分标准项列出查询特定的答案评估标准，例如引用论文、解释和描述局限性。8个领域中的31名博士注释员判断90%的查询反映了博士的信息需求，87%的评分标准项需要强调一个句子或更长的内容。我们利用ResearchQA评估了18个系统在7600次一对一中的表现。我们评估的任何参数或检索增强系统在覆盖评分标准项方面均未超过70%，最高排名的系统显示75%的覆盖率。错误分析表明，最高排名的系统完全解决了不到11%的引用评分标准项、48%的局限性项和49%的比较项。我们发布了我们的数据，以促进更全面的多领域评估。

Summary / 总结

ResearchQA evaluates LLM systems by converting survey articles from 75 research fields into 21K queries and 160K rubric items, with 31 Ph.D. annotators validating the quality. The evaluation covers 18 systems in 7,600 head-to-head comparisons, showing no system exceeds 70% coverage of rubric items, and the highest system covers only 75%. Detailed error analysis indicates the highest system fully addresses less than 11% of citation items, 48% of limitation items, and 49% of comparison items.

ResearchQA通过将75个研究领域的调查文章转化为21K个查询和160K个评价项，由31位博士注释员验证质量。对18个系统的评估显示，没有系统超过70%的评价项覆盖率，最高系统达到75%。错误分析表明，最高系统完全回答了不到11%的引用评价项和49%的比较评价项。