arXiv 论文速递

2026-01-10 03:24
Snapshot: 20260110_0324
Pixel-Perfect Visual Geometry Estimation
Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang
First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00
Comments: Code: https://github.com/gangweix/pixel-perfect-depth
Abstract
Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
中文标题/摘要
标题:像素完美视觉几何估计
从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而,现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中,我们提出了像素完美视觉几何模型,通过在像素空间中利用生成建模来预测高质量、无漂像素的点云。我们首先介绍了像素完美深度(PPD),这是一种基于像素空间扩散变换器(DiT)的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性,我们提出了两个关键设计:1)语义提示DiT,该设计结合了视觉基础模型的语义表示来提示扩散过程,保留全局语义同时增强细粒度视觉细节;2)级联DiT架构,逐步增加图像标记的数量,提高效率和准确性。为了将PPD进一步扩展到视频(PPVD),我们引入了一种新的语义一致DiT,该设计从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播,以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳,并且产生的点云比其他所有模型都更干净。
Summary / 总结
This paper addresses the challenge of recovering clean and accurate geometry from images, essential for robotics and augmented reality. It introduces pixel-perfect visual geometry models using generative modeling in the pixel space. The models, including Pixel-Perfect Depth (PPD) and its video extension PPVD, leverage pixel-space diffusion transformers (DiT) and incorporate semantic prompts and a cascade architecture to enhance fine-grained details and computational efficiency. Experimental results show that these models outperform existing methods in monocular and video depth estimation, producing cleaner point clouds.
本文旨在解决从图像中恢复干净准确的几何形状以应用于机器人和增强现实的问题。提出了像素完美的视觉几何模型,特别是Pixel-Perfect Depth (PPD)及其视频扩展PPVD,使用像素空间扩散变压器(DiT)来预测无飞像素的高质量点云。关键创新包括Semantics-Prompted DiT和Cascade DiT架构以提高效率和准确性,以及Semantics-Consistent DiT用于视频。这些模型在单目和视频深度估计中表现出色,生成的点云更为干净。
Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration
Authors: Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang, Xiaowei Zhou, Kuan Fang
First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00
Comments: Project Page: https://cordex-manipulation.github.io/
Abstract
Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.
中文标题/摘要
标题:生成、转移、适应:从单个人类演示学习功能性灵巧抓取
功能性灵巧抓取对于使机器人手能够使用工具和进行复杂操作至关重要,但进展受限于两个持续存在的瓶颈:大规模数据集的稀缺性和学习模型中缺乏集成的语义和几何推理。在本文中,我们提出了CorDex框架,该框架能够从仅一个单个人类演示生成的合成数据中稳健地学习新物体的功能性灵巧抓取。我们方法的核心是一个基于对应关系的数据引擎,该引擎在仿真中生成多样且高质量的训练数据。基于人类演示,我们的数据引擎生成同一类别的多种物体实例,通过对应关系估计将专家抓取转移到生成的物体上,并通过优化进行抓取适应。基于生成的数据,我们引入了一个多模态预测网络,该网络整合了视觉和几何信息。通过设计局部-全局融合模块和重要性感知采样机制,我们实现了功能灵巧抓取的稳健且计算高效的预测。通过在各种物体类别上的广泛实验,我们证明了CorDex能够很好地泛化到未见过的物体实例,并显著优于最先进的基线。
Summary / 总结
The research aims to address the challenges of learning functional dexterous grasping from a single human demonstration, focusing on the scarcity of large-scale datasets and the lack of integrated semantic and geometric reasoning. The method involves generating diverse synthetic training data through a correspondence-based engine, transferring expert grasps to new objects, and adapting them through optimization. The key experimental findings show that CorDex generalizes well to unseen object instances and outperforms existing state-of-the-art methods across various object categories.
该研究提出了一种名为CorDex的框架,通过单一人类演示生成多样化的训练数据来解决功能灵巧抓取的学习挑战。该方法使用基于对应的数据引擎生成高质量的合成数据,通过对应估计转移专家抓取,并通过优化进行适应。多模态预测网络整合视觉和几何信息,提高了抓取预测的鲁棒性和效率。实验表明,CorDex在未见过的对象上表现出良好的泛化能力,并优于现有方法。
Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation
Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider
First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00
Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426
Abstract
Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, nonmedical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM pipeline centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.070. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74% and outperforms classifiers trained on images generated by prior state-of-the-art. Classifier training solely on our method's synthetic images achieved comparable performance to real image training. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric pipeline enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility.
中文标题/摘要
标题:利用临床文本和类别条件生成3D前列腺MRI
目标:潜在扩散模型(LDM)可以缓解医学成像领域机器学习开发中的数据稀缺挑战。然而,医学LDM策略通常依赖于简短提示文本编码器、非医学LDM或大量数据。这些策略可能会限制性能和科学可访问性。我们提出了一种新的LDM条件化方法来解决这些限制。方法:我们提出了类别条件高效大型语言模型适配器(CCELLA),这是一种新颖的双头条件化方法,同时用自由文本临床报告和放射学分类条件化LDM U-Net。我们还提出了一种以CCELLA为中心的数据高效LDM管道和一个提出的联合损失函数。我们首先在3D前列腺MRI上评估了我们的方法,与最先进的方法进行了比较。然后,我们使用我们方法生成的合成图像增强了下游分类器模型训练数据集。结果:我们的方法在大小受限的3D前列腺MRI数据集上实现了0.025的3D FID分数,显著优于最近的基础模型,该模型的FID为0.070。当训练前列腺癌预测分类器时,在训练过程中添加由我们方法生成的合成图像,分类器的准确率从69%提高到74%,并优于使用先前最先进的方法生成的图像训练的分类器。仅使用我们方法生成的合成图像进行分类器训练,其性能与使用真实图像训练的分类器相当。结论:我们展示了我们的方法在使用有限数据和最少的人工注释的情况下,提高了合成图像质量和下游分类器性能。意义:提出的CCELLA为中心的管道能够在有限的数据量和人工数据注释的情况下,使放射学报告和类别条件化的LDM训练用于高质量的医学图像合成,从而提高LDM性能和科学可访问性。
Summary / 总结
The research aims to address data scarcity challenges in medical imaging by proposing a novel latent diffusion model (LDM) conditioning approach called CCELLA. This method combines free-text clinical reports and radiology classification to condition the LDM U-Net, resulting in a data-efficient pipeline. The method achieves a 3D FID score of 0.025, outperforming a recent foundation model with a score of 0.070. Additionally, synthetic images generated by this method improve the accuracy of a downstream classifier for prostate cancer prediction from 69% to 74%. The study demonstrates that the proposed method can enhance both synthetic image quality and downstream classifier performance with limited data and minimal human annotation.
研究旨在通过利用潜扩散模型(LDM)和提出一种名为CCELLA的新颖条件化方法来解决医学成像中的数据稀缺问题。CCELLA同时使用自由文本临床报告和放射学分类对LDM U-Net进行条件化,并开发了一个数据高效的工作流程。该方法在3D FID得分上达到0.025,显著优于最近的基础模型。此外,由该方法生成的合成图像将前列腺癌预测下游分类器的准确性从69%提高到74%。该方法在有限数据和少量人工注释的情况下展示了改进的性能和科学可访问性。
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
First: 2026-01-08T18:59:24+00:00 · Latest: 2026-01-08T18:59:24+00:00
Comments: NVIDIA-Tech Report
Abstract
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
中文标题/摘要
标题:GDPO:组奖励-解耦归一化策略优化方法在多奖励RL优化中的应用
随着语言模型能力的不断增强,用户期望它们不仅能提供准确的响应,还能表现出与各种场景中不同人类偏好相一致的行为。为了实现这一目标,强化学习(RL)管道开始采用多个奖励,每个奖励捕捉一种独特的偏好,以引导模型向这些期望行为靠拢。然而,最近的工作在多奖励设置下默认使用组相对策略优化(GRPO)而没有对其适用性进行检查。本文表明,直接将GRPO应用于归一化不同的回放奖励组合会导致这些奖励的优势值坍缩为相同的值,降低训练信号的分辨率,导致次优收敛,在某些情况下甚至导致训练早期失败。我们随后引入了组奖励-解耦归一化策略优化(GDPO),这是一种新的策略优化方法,通过解耦个体奖励的归一化,更忠实地保留它们的相对差异,从而实现更准确的多奖励优化,并且训练稳定性显著提高。我们在工具调用、数学推理和编程推理三个任务上将GDPO与GRPO进行了比较,评估了正确性指标(准确率、错误率)和约束遵守指标(格式、长度)。在所有设置中,GDPO始终优于GRPO,证明了其在多奖励强化学习优化中的有效性和普适性。
Summary / 总结
This paper addresses the issue of using Group Relative Policy Optimization (GRPO) in multi-reward reinforcement learning (RL) settings, which can cause distinct rewards to collapse into identical values, leading to suboptimal training. The authors propose Group reward-Decoupled Normalization Policy Optimization (GDPO) to decouple the normalization of individual rewards, preserving their relative differences and improving training stability. GDPO outperforms GRPO across three tasks: tool calling, math reasoning, and coding reasoning, in terms of both correctness and constraint adherence metrics.
该论文探讨了在多奖励强化学习(RL)设置中使用组相对策略优化(GRPO)的问题,这会导致不同奖励值坍缩为相同值,从而导致训练效果不佳。为此,作者提出了组奖励解耦归一化策略优化(GDPO)方法,该方法通过解耦个体奖励的归一化,保留它们的相对差异,从而提高训练稳定性。GDPO在工具调用、数学推理和编码推理三个任务中均优于GRPO,展示了在正确性和约束遵守度指标上的更好表现。
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
First: 2026-01-08T18:59:22+00:00 · Latest: 2026-01-08T18:59:22+00:00
Abstract
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
中文标题/摘要
标题:RoboVIP:基于视觉身份提示的多视角视频生成增强机器人操作
操作数据的多样性、数量和质量对于训练有效的机器人策略至关重要。然而,由于硬件和物理设置的限制,收集大规模的现实世界操作数据在不同环境中难以扩展。近期的工作使用文本提示条件下的图像扩散模型来通过改变视觉观察中的背景和桌面物体来增强操作数据。然而,这些方法往往忽视了由最先进的策略模型所需的多视角和时间上一致的观察需求。此外,仅靠文本提示无法可靠地指定场景设置。为了给扩散模型提供明确的视觉指导,我们引入了视觉身份提示,通过提供示例图像作为条件输入来引导生成所需的场景设置。为此,我们还构建了一个可扩展的流水线,从大型机器人数据集中策划视觉身份池。使用我们增强的操作数据来训练下游的视觉-语言-动作和视知觉运动策略模型,在仿真和真实机器人环境中均能获得一致的性能提升。
Summary / 总结
The paper addresses the challenge of collecting diverse and high-quality manipulation data for training robot policies. It introduces RoboVIP, a method that uses visual identity prompting to generate multi-view and temporally coherent video data. By conditioning image diffusion models with exemplar images, the method ensures that the generated data matches the desired scene setup. Experimental results show consistent performance improvements in both simulation and real-robot settings when using this augmented data to train vision-language-action and visuomotor policy models.
该论文旨在解决收集用于机器人训练的多样且高质量操作数据的挑战。它引入了RoboVIP方法,通过视觉身份提示生成多视角和时间连贯的视频数据。通过提供示例图像作为条件输入,该方法增强了由文本提示指定的场景设置。作者构建了一个可扩展的管道,从大型机器人数据集中筛选视觉身份池。实验表明,使用这种增强的数据可以提高在仿真和真实机器人环境中的视觉-语言-动作和视知觉运动策略模型的性能。
Robust Reasoning as a Symmetry-Protected Topological Phase
Authors: Ilmo Sung
First: 2026-01-08T18:58:34+00:00 · Latest: 2026-01-08T18:58:34+00:00
Abstract
Large language models suffer from "hallucinations"-logical inconsistencies induced by semantic noise. We propose that current architectures operate in a "Metric Phase," where causal order is vulnerable to spontaneous symmetry breaking. Here, we identify robust inference as an effective Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding, replacing fragile geometric interpolation with robust topological invariants. Empirically, we demonstrate a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, our Holonomic Network reveals a macroscopic "mass gap," maintaining invariant fidelity below a critical noise threshold. Furthermore, in a variable-binding task on $S_{10}$ ($3.6 \times 10^6$ states) representing symbolic manipulation, we demonstrate holonomic generalization: the topological model maintains perfect fidelity extrapolating $100\times$ beyond training ($L=50 \to 5000$), consistent with a theoretically indefinite causal horizon, whereas Transformers lose logical coherence. Ablation studies indicate this protection emerges strictly from non-Abelian gauge symmetry. This provides strong evidence for a new universality class for logical reasoning, linking causal stability to the topology of the semantic manifold.
中文标题/摘要
标题:稳健推理作为一种对称保护拓扑相
大型语言模型遭受“幻觉”——由语义噪声引起的逻辑不一致。我们提出当前架构处于“度量相”中,在这种相中因果顺序容易自发对称破缺。在这里,我们将稳健推理识别为一种有效的对称保护拓扑相,在这种相中逻辑操作形式上等同于非阿贝尔任意子编织,用脆弱的几何插值替换为稳健的拓扑不变量。实证上,我们展示了明显的拓扑相变:虽然变换器和RNNs表现出无隙衰减,我们的本征网络揭示了宏观的“质量隙”,在临界噪声阈值以下保持不变的保真度。此外,在$S_{10}$($3.6 imes 10^6$个状态)表示符号操作的变量绑定任务中,我们展示了本征泛化:拓扑模型在训练($L=50$)外推$100$倍($5000$)时保持完美的保真度,这与理论上无限的因果视界一致,而变换器则失去逻辑连贯性。消融研究表明,这种保护严格源自非阿贝尔规范对称性。这为逻辑推理提供了一个新的普遍类,将因果稳定性与语义流形的拓扑学联系起来。
Summary / 总结
The research aims to address the issue of logical inconsistencies in large language models, known as hallucinations, by proposing a new architecture that operates in a Symmetry-Protected Topological phase. The method involves using a Holonomic Network, which is designed to maintain logical operations through robust topological invariants rather than fragile geometric interpolation. Key experimental findings include a sharp phase transition where the Holonomic Network shows a macroscopic 'mass gap' and maintains fidelity below a critical noise threshold, while Transformers and RNNs exhibit gapless decay. Additionally, the Holonomic Network demonstrates holonomic generalization, maintaining perfect fidelity in a variable-binding task on $S_{10}$, extrapolating 100 times beyond training, whereas Transformers lose logical coherence.
研究旨在通过提出一种新的架构来解决大型语言模型中逻辑不一致的问题,即幻觉现象,该架构在对称保护拓扑相中运行。方法是使用Holonomic Network,该网络通过保持拓扑不变量而非脆弱的几何插值来维持逻辑操作。关键实验发现包括一个明显的相变,Holonomic Network在临界噪声阈值以下表现出宏观的“质量间隙”并保持保真度,而Transformer和RNN则表现出无隙衰减。此外,Holonomic Network在$S_{10}$上的变量绑定任务中展示了拓扑泛化能力,能够完美地将训练范围外推100倍,而Transformer则失去逻辑一致性。
Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
Authors: P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter
First: 2026-01-08T18:57:01+00:00 · Latest: 2026-01-08T18:57:01+00:00
Comments: 6 pages, 4 figures
Abstract
We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.
中文标题/摘要
标题:通过机器学习和人工智能衡量与促进和平
我们使用机器学习和人工智能:1) 从新闻和社交媒体中衡量各国的和平水平;2) 开发在线工具促进和平,帮助用户更好地理解自己的媒体饮食。对于新闻媒体,我们使用神经网络从在线新闻来源的文本嵌入中衡量和平水平。该模型在训练于一个新闻媒体数据集后,也对分析另一个新闻数据集时表现出高准确性。对于社交媒体,如YouTube,我们开发了其他模型来衡量与和平相关的社会维度,使用了词级(GoEmotions)和上下文级(大型语言模型)方法。为了促进和平,我们注意到20-40岁人群中,71%的人每天主要通过社交媒体上的短视频获取新闻。这些视频内容创作者倾向于制作能够激发情绪的视频,让你生气以增加点击率。我们开发并测试了一个名为MirrorMirror的Chrome扩展程序,为YouTube观众提供他们所观看媒体的实时反馈,关于其和平程度。我们的长期目标是让MirrorMirror成为一个开源工具,供内容创作者、记者、研究人员、平台和个人用户更好地理解其媒体创作和消费的语气及其对观众的影响。超越简单的参与度指标,我们希望鼓励更加尊重、细致和信息丰富的交流。
Summary / 总结
This research aims to measure and foster peace through machine learning and AI by analyzing news and social media content. For news, neural networks were used to assess peace levels from text embeddings, showing high accuracy across different datasets. For social media, models were developed to measure social dimensions related to peace using word and context levels. A Chrome extension called MirrorMirror was created to provide real-time feedback on the peacefulness of media content, aiming to promote more respectful and informative communication among users.
本研究旨在利用机器学习和AI来衡量和促进和平。它开发了模型来评估新闻和社交媒体内容中的和平水平,并创建了一个名为MirrorMirror的Chrome扩展程序,以实时反馈用户在YouTube上观看的内容的和平程度。关键发现包括模型在不同新闻来源中衡量和平的高准确性,以及情绪化内容在吸引观众方面的重要作用,而MirrorMirror工具旨在通过促进更和平和尊重的交流来减轻这一影响。
Learning Latent Action World Models In The Wild
Authors: Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat
First: 2026-01-08T18:55:39+00:00 · Latest: 2026-01-08T18:55:39+00:00
Comments: 37 pages, 25 figures
Abstract
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
中文标题/摘要
标题:学习自然环境中的潜在动作世界模型
能够在现实世界中进行推理和规划的智能体需要预测其行为后果的能力。尽管世界模型具备这种能力,但它们通常需要行为标签,而这些标签在大规模应用中往往难以获取。这促使我们学习潜在动作模型,可以从视频中学习动作空间。我们的工作解决了在自然环境视频中学习潜在动作世界模型的问题,扩展了现有工作在简单机器人模拟、视频游戏或操作数据方面的研究范围。虽然这使我们能够捕捉到更丰富的动作,但也带来了视频多样性带来的挑战,如环境噪声或视频间缺乏共同的实体。为应对部分挑战,我们讨论了动作应遵循的属性以及相关架构选择和评估。我们发现,连续但受限的潜在动作能够捕捉自然环境视频中动作的复杂性,而常见的向量量化则无法做到这一点。例如,我们发现来自智能体(如人类进入房间)的环境变化可以在视频间转移,这突显了学习特定于自然环境视频的动作的能力。在视频间缺乏共同实体的情况下,我们主要能够学习在空间上局部化的潜在动作,相对于摄像机而言。尽管如此,我们能够训练一个控制器,将已知动作映射到潜在动作,使我们能够使用潜在动作作为通用接口,并使用世界模型解决规划任务,其性能与基于动作的基线相当。我们的分析和实验为将潜在动作模型扩展到现实世界迈出了一步。
Summary / 总结
This work addresses the challenge of learning latent action models from in-the-wild videos, which are more complex and diverse than those in robotics simulations or video games. The authors propose a method to capture richer actions while handling environmental noise and lack of a common embodiment across videos. Key findings include the ability of continuous, constrained latent actions to capture the complexity of in-the-wild actions and the capability to transfer changes in the environment across videos. The model can also map known actions to latent ones, enabling planning tasks with similar performance to action-conditioned baselines.
该研究旨在从真实世界视频中学习潜在动作模型,这些视频比机器人仿真、视频游戏或操作数据更复杂多样。作者提出了一种方法来捕捉更丰富的动作,同时处理环境噪声和视频间缺乏共同主体的问题。关键发现包括连续但受限的潜在动作能够捕捉真实世界视频中动作的复杂性,并且能够跨视频转移环境变化。该模型还可以将已知动作映射到潜在动作,从而在执行规划任务时与基于动作的基线具有相似的性能。
Non-Linear Scoring Model for Translation Quality Evaluation
Authors: Serge Gladkoff, Lifeng Han, Katerina Gasova
First: 2025-11-17T15:09:22+00:00 · Latest: 2026-01-08T18:51:57+00:00
Comments: ongoing work, 32 pages
Abstract
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
中文标题/摘要
标题:翻译质量评估的非线性评分模型
基于多维质量指标(MQM)的分析性翻译质量评估(TQE)传统上使用线性误差-惩罚比例,该比例基于1000-2000词的参考样本进行校准。然而,线性外推会导致对不同大小样本的判断偏差,对短样本过度惩罚,对长样本则惩罚不足,导致与专家直觉的不一致。 本文基于多范围框架,提出了一种校准的非线性评分模型,更好地反映了不同长度样本中人类内容消费者对翻译质量的感知。来自三个大型企业环境的实证数据显示,可接受的错误数量随样本大小呈对数增长,而非线性增长。 心理物理和认知证据,包括韦伯-费希纳定律和认知负荷理论,支持这一观点,解释了为什么额外错误的感知影响随规模增长而减弱,而认知负担则随规模增长。我们提出一个两参数模型 E(x) = a * ln(1 + b * x),a, b > 0, 该模型以参考容忍度为锚点,并通过一个一维根寻找步骤校准两个容忍度点。该模型在相对误差不超过±20%的区间内保持线性近似,并且可以与现有的评估工作流程无缝集成,只需添加一个动态容忍度函数。 该方法提高了人类和AI生成翻译的解释性、公平性和评分者一致性。通过实现一个感知上有效的评分范式,它推动了翻译质量评估向更准确和可扩展的评估迈进。该模型还为与人类判断一致的基于AI的文档级评估提供了更强的基础。讨论了CAT/LQA系统实施考虑和对人类和AI生成文本评估的影响。
Summary / 总结
This paper addresses the limitations of traditional linear scoring models in Translation Quality Evaluation (TQE) by proposing a non-linear scoring model based on the Multi-Range framework. Empirical data from three enterprise environments show that acceptable error counts grow logarithmically with sample size. The proposed model, E(x) = a * ln(1 + b * x), improves interpretability and fairness, and integrates into existing workflows with a dynamic tolerance function. It enhances the accuracy and scalability of translation quality assessment and provides a better basis for AI-based evaluations aligned with human judgment.
本文针对传统线性评分模型在翻译质量评估(TQE)中的局限性,提出了一个非线性评分模型。该模型基于实证数据和认知理论,表明可接受的错误数量随样本大小呈对数增长。提出的两参数模型E(x) = a * ln(1 + b * x)提高了人类和AI生成文本的解释性、公平性和一致性,更好地符合专家直觉和认知负荷理论。
MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
Authors: Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel
First: 2026-01-08T18:39:52+00:00 · Latest: 2026-01-08T18:39:52+00:00
Abstract
We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.
中文标题/摘要
标题:MineNPC-Task:面向记忆意识Minecraft代理的任务套件
我们提出了\textsc{MineNPC-Task},一种用户编写的基准测试和评估框架,用于测试开放世界\emph{Minecraft}中的记忆意识、混合主动性LLM代理。该框架不依赖于合成提示,而是从与专家玩家的形成性及总结性共玩中获取任务,将这些任务规范化为具有显式先决条件和依赖结构的参数化模板,并配以基于有界知识策略的机器可验证验证器,该策略禁止使用世界外的捷径。该框架捕捉计划/行动/记忆事件,包括计划预览、目标澄清、记忆读写、先决条件检查以及修复尝试,并根据尝试的子任务总数报告结果,这些结果源自于世内的证据。 作为初步的快照,我们使用GPT-4o实例化了该框架,并在8名经验丰富的玩家中评估了\textbf{216}个子任务。我们观察到代码执行、库存/工具处理、引用和导航中的反复出现的故障模式,以及通过混合主动性澄清和轻量级记忆支持的恢复。参与者对交互质量和界面易用性给予了积极评价,同时指出了需要更强的记忆持久性以跨越任务。我们发布了完整的任务套件、验证器、日志和框架,以支持未来记忆意识实体代理的透明、可重复评估。
Summary / 总结
The research introduces MineNPC-Task, a benchmark for testing memory-aware LLM agents in Minecraft. It involves tasks elicited from expert players, normalized into templates, and paired with validators. The study evaluates 216 subtasks across 8 players using GPT-4o, identifying recurring issues in code execution, inventory handling, referencing, and navigation. Participants positively rated interaction quality and usability but noted the need for better memory persistence. The task suite, validators, logs, and harness are released for future evaluations.
研究引入了MineNPC-Task,用于评估记忆感知的LLM代理在Minecraft中的表现。任务从专家玩家中获取,规范化为具有明确前提条件的模板,并使用机器可验证的验证器进行评估。研究测试了8名玩家的216个子任务,发现了代码执行、库存处理、引用和导航等方面的重复问题,通过混合主动澄清和轻量级记忆支持恢复。参与者对交互质量和界面易用性给予了积极评价,但也指出需要更强的记忆持久性。
Internal Representations as Indicators of Hallucinations in Agent Tool Selection
Authors: Kait Healy, Bharathi Srinivasan, Visakh Madathil, Jing Wu
First: 2026-01-08T18:38:45+00:00 · Latest: 2026-01-08T18:38:45+00:00
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in tool calling and tool usage, but suffer from hallucinations where they choose incorrect tools, provide malformed parameters and exhibit 'tool bypass' behavior by performing simulations and generating outputs instead of invoking specialized tools or external systems. This undermines the reliability of LLM based agents in production systems as it leads to inconsistent results, and bypasses security and audit controls. Such hallucinations in agent tool selection require early detection and error handling. Unlike existing hallucination detection methods that require multiple forward passes or external validation, we present a computationally efficient framework that detects tool-calling hallucinations in real-time by leveraging LLMs' internal representations during the same forward pass used for generation. We evaluate this approach on reasoning tasks across multiple domains, demonstrating strong detection performance (up to 86.4\% accuracy) while maintaining real-time inference capabilities with minimal computational overhead, particularly excelling at detecting parameter-level hallucinations and inappropriate tool selections, critical for reliable agent deployment.
中文标题/摘要
标题:代理工具选择中的内部表示作为幻觉指标
大型语言模型(LLMs)在工具调用和使用方面表现出色,但在选择错误工具、提供不正确的参数和通过模拟生成输出而不是调用专门工具或外部系统方面存在幻觉问题。这削弱了基于LLM的代理在生产系统中的可靠性,导致结果不一致,并绕过了安全和审计控制。代理工具选择中的这种幻觉需要早期检测和错误处理。不同于现有的需要多次前向传递或外部验证的幻觉检测方法,我们提出了一种计算效率高的框架,通过利用LLM在生成过程中同一前向传递期间的内部表示来实时检测调用工具的幻觉。我们在多个领域的推理任务上评估了这种方法,展示了强大的检测性能(最高可达86.4%的准确率),同时保持了实时推理能力,计算开销最小,特别擅长检测参数级幻觉和不适当的工具选择,这对于可靠的代理部署至关重要。
Summary / 总结
The research aims to address the issue of hallucinations in Large Language Models (LLMs) when selecting tools, which can lead to unreliable results and bypass security controls. The study introduces a computationally efficient framework that detects these hallucinations in real-time by analyzing the LLM's internal representations during the same forward pass used for generation. The method achieves up to 86.4% accuracy in detecting parameter-level hallucinations and inappropriate tool selections, while maintaining real-time inference capabilities with minimal computational overhead.
该研究针对大型语言模型(LLMs)在工具选择过程中出现的幻觉问题,这些问题可能导致结果不可靠并绕过安全控制。研究提出了一种计算高效的框架,该框架通过分析LLMs在生成过程中同一前向传递期间的内部表示来实时检测这些幻觉。该方法在检测参数级幻觉和不适当工具选择方面达到了高达86.4%的准确性,同时保持了实时推理能力,并具有最小的计算开销。
Belief Is All You Need: Modeling Narrative Archetypes in Conspiratorial Discourse
Authors: Soorya Ram Shimgekar, Abhay Goyal, Roy Ka-Wei Lee, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-12-10T21:51:16+00:00 · Latest: 2026-01-08T18:34:35+00:00
Abstract
Conspiratorial discourse is increasingly embedded within digital communication ecosystems, yet its structure and spread remain difficult to study. This work analyzes conspiratorial narratives in Singapore-based Telegram groups, showing that such content is woven into everyday discussions rather than confined to isolated echo chambers. We propose a two-stage computational framework. First, we fine-tune RoBERTa-large to classify messages as conspiratorial or not, achieving an F1-score of 0.866 on 2,000 expert-labeled messages. Second, we build a signed belief graph in which nodes represent messages and edge signs reflect alignment in belief labels, weighted by textual similarity. We introduce a Signed Belief Graph Neural Network (SiBeGNN) that uses a Sign Disentanglement Loss to learn embeddings that separate ideological alignment from stylistic features. Using hierarchical clustering on these embeddings, we identify seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN yields stronger clustering quality (cDBI = 8.38) than baseline methods (13.60 to 67.27), supported by 88 percent inter-rater agreement in expert evaluations. Our analysis shows that conspiratorial messages appear not only in clusters focused on skepticism or distrust, but also within routine discussions of finance, law, and everyday matters. These findings challenge common assumptions about online radicalization by demonstrating that conspiratorial discourse operates within ordinary social interaction. The proposed framework advances computational methods for belief-driven discourse analysis and offers applications for stance detection, political communication studies, and content moderation policy.
中文标题/摘要
标题:信念即所必需:在阴谋论话语中建模叙事原型
阴谋论话语越来越多地嵌入数字通信生态系统中,但其结构和传播仍然难以研究。本研究分析了基于新加坡Telegram群组中的阴谋论叙述,表明此类内容融入了日常讨论,而非局限于孤立的回声室中。我们提出了一种两阶段的计算框架。首先,我们对RoBERTa-large进行微调,以分类信息为阴谋论或非阴谋论,使用2,000条专家标注信息,F1分数达到0.866。其次,我们构建了一个带符号的信念图,节点代表信息,边的符号反映信念标签的一致性,权重由文本相似度决定。我们引入了一种带符号信念图神经网络(SiBeGNN),使用符号解纠缠损失来学习将意识形态一致性与风格特征分离的嵌入。通过这些嵌入进行层次聚类,我们识别出553,648条信息中的七个叙述原型:法律主题、医疗关切、媒体讨论、金融、权威矛盾、群体管理以及一般聊天。SiBeGNN的聚类质量(cDBI = 8.38)优于基线方法(13.60到67.27),并得到88%的专家评价的一致性支持。我们的分析表明,阴谋论信息不仅出现在关注怀疑或不信任的聚类中,还出现在金融、法律和日常事务的常规讨论中。这些发现挑战了关于在线激进化的一些常见假设,表明阴谋论话语在普通社会互动中运作。所提出的方法推进了信念驱动话语分析的计算方法,并为立场检测、政治传播研究和内容审核政策提供了应用。
Summary / 总结
This study examines the structure and spread of conspiratorial discourse in Singapore-based Telegram groups, proposing a two-stage computational framework. The first stage fine-tunes RoBERTa-large to classify messages, achieving an F1-score of 0.866. The second stage builds a signed belief graph and uses SiBeGNN to identify seven narrative archetypes. SiBeGNN outperforms baseline methods with a cDBI score of 8.38, and the analysis reveals that conspiratorial messages are embedded in everyday discussions, challenging the notion of isolated echo chambers.
该研究分析了新加坡Telegram群组中的阴谋论叙事,提出了一种两阶段计算框架。第一阶段使用RoBERTa-large对消息进行分类,F1得分为0.866。第二阶段构建了带符号的信任图,并使用SiBeGNN识别出七个叙事模式,显示出比基线方法更强的聚类质量。研究发现,阴谋论信息嵌入在日常讨论中,挑战了关于在线极端化的常见假设。
From Policy to Logic for Efficient and Interpretable Coverage Assessment
Authors: Rhitabrat Pokharel, Hamid Reza Hassanzadeh, Ameeta Agrawal
Venue: AAAI 2026
First: 2026-01-03T19:24:51+00:00 · Latest: 2026-01-08T18:28:40+00:00
Comments: Accepted at AIMedHealth @ AAAI 2026
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.
中文标题/摘要
标题:从政策到逻辑:高效可解释的覆盖评估
大型语言模型(LLMs)在解释长篇复杂的法律和政策语言方面表现出强大的能力。然而,它们的可靠性可能会因幻觉和不一致而受到损害,特别是在分析主观和细腻的文件时。这些挑战在医疗覆盖政策审查中尤为关键,因为人类专家必须依赖准确的信息。在本文中,我们提出了一种支持人类审查员的方法,以使政策解释更加高效和可解释。我们介绍了一种方法,该方法将覆盖感知检索器与符号规则推理相结合,以突出显示相关的政策语言,将其组织成明确的事实和规则,并生成可审计的理由。这种混合系统减少了所需的LLM推理次数,从而降低了整体模型成本。值得注意的是,我们的方法在推理成本上减少了44%,F1分数提高了4.5%,既提高了效率又提高了效果。
Summary / 总结
This paper addresses the challenges of interpreting complex medical coverage policies using Large Language Models (LLMs), which can suffer from hallucinations and inconsistencies. The authors propose a hybrid system combining a coverage-aware retriever and symbolic rule-based reasoning to make policy interpretation more efficient and interpretable. This approach reduces inference cost by 44% while improving the F1 score by 4.5%, showing both efficiency and effectiveness.
本文旨在解决使用大型语言模型(LLMs)解释复杂的医疗覆盖政策时所面临的幻觉和不一致性问题。为支持人类审查员,作者提出了一种结合覆盖感知检索器和符号规则推理的混合系统。该方法通过减少44%的LLM推理次数,降低了整体模型成本,同时将F1分数提高了4.5%。
Stock Market Price Prediction using Neural Prophet with Deep Neural Network
Authors: Navin Chhibber, Suneel Khemka, Navneet Kumar Tyagi, Rohit Tewari, Bireswar Banerjee, Piyush Ranjan
First: 2026-01-08T18:24:22+00:00 · Latest: 2026-01-08T18:24:22+00:00
Abstract
Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.
中文标题/摘要
标题:使用深度神经网络的神经先知进行股票市场价格预测
股票市场价格预测是金融、统计和经济学交叉领域的显著研究领域。准确预测股票价格一直是各种研究人员的关注点。然而,现有的时间序列预测统计方法往往无法有效预测未来股票价格的概率范围。因此,为了解决这个问题,提出了使用深度神经网络的神经先知(NP-DNN)来预测股票市场价格。本研究中使用的预处理技术是Z分数标准化,通过消除规模差异来标准化股票价格数据,使模式更容易被检测到。缺失值填充填补了历史数据中的空白,增强了模型使用完整信息进行更准确预测的能力。多层感知机(MLP)学习股票市场价格之间的复杂非线性关系,从输入数据中提取隐藏模式,从而创建更有意义的特征表示,以提高预测准确性。所提出的NP-DNN模型的准确率为99.21%,与其他使用融合大型语言模型的方法相比。关键词:深度神经网络,预测股票价格,多层感知机,神经先知,股票市场价格预测。
Summary / 总结
The research aims to improve the accuracy of stock market price prediction by using the Neural Prophet with a Deep Neural Network (NP-DNN). The method involves preprocessing techniques such as Z-score normalization and missing value imputation, followed by the use of a Multi-Layer Perceptron (MLP) to learn complex nonlinear relationships. The proposed NP-DNN model achieved an accuracy of 99.21%, outperforming other approaches in forecasting stock prices.
研究旨在通过使用深度神经网络(NP-DNN)来提高股票市场价格预测的准确性。方法包括使用Z-score归一化预处理数据并填补缺失值,以确保使用完整的信息。多层感知机(MLP)被用来学习复杂的非线性关系并从输入数据中提取隐藏模式。提出的NP-DNN模型的准确率为99.21%,优于其他方法。
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00
Abstract
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
中文标题/摘要
标题:视觉语言模型中提示诱发幻觉的机制
大型视觉语言模型(VLMs)虽然功能强大,但常常倾向于根据文本提示而非视觉证据进行幻觉。我们在一个受控的对象计数设置中研究了这种失败模式,其中提示会夸大图像中的对象数量(例如,要求模型描述四朵水仙花,而实际上只有三朵)。在对象数量较少时,模型通常会纠正这种高估,但随着对象数量的增加,它们越来越倾向于遵循提示,无视差异。通过对三种VLMs的机制分析,我们发现一小组注意力头的消除可以显著减少提示诱发幻觉(PIH),至少降低40%且无需额外训练。在不同模型中,PIH头以特定方式介导提示复制。我们描述了这些差异,并表明PIH消除增加了对视觉证据的纠正。我们的研究提供了关于提示诱发幻觉内部机制的见解,揭示了这些行为在不同模型中的特定差异实现方式。
Summary / 总结
The study investigates how large vision-language models (VLMs) hallucinate based on textual prompts rather than visual evidence. By manipulating object counts in images, the researchers found that models tend to correct prompt-induced overestimations at low object counts but increasingly conform to the prompt as the number of objects increases. Ablating specific attention heads reduced prompt-induced hallucinations by at least 40% across models, and these heads mediate prompt copying in model-specific ways, with ablation increasing correction towards visual evidence. This work provides insights into the internal mechanisms driving prompt-induced hallucinations in VLMs, highlighting model-specific differences in behavior implementation.
该研究探讨了大型视觉-语言模型(VLMs)如何基于文本提示而非视觉证据产生幻觉。通过在受控的物体计数设置中进行研究,研究人员发现,在低物体数量时,模型倾向于纠正过度估计,但随着物体数量的增加,它们越来越倾向于遵循提示。通过对三个VLMs的分析,研究团队发现特定的注意力头,当移除这些头时,可以显著减少提示诱导的幻觉(PIH)至少40%,而无需进一步训练。研究结果揭示了这些行为在不同模型中的具体实现差异,并表明减少PIH可以提高与视觉证据的一致性。
An interpretable data-driven approach to optimizing clinical fall risk assessment
Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi
First: 2026-01-08T18:17:31+00:00 · Latest: 2026-01-08T18:17:31+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2510.20714
Abstract
In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study's risk labels, and without changing the tool's form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.
中文标题/摘要
标题:一种可解释的数据驱动方法以优化临床跌倒风险评估
在本研究中,我们旨在通过数据驱动建模方法更好地使约翰霍普金斯跌倒风险评估工具(JHFRAT)的跌倒风险预测与额外的临床有意义的指标相一致。我们对2022年3月至2023年10月期间约翰霍普金斯健康系统三家医院的54,209例住院病例进行了回顾性队列分析。共有20,208例住院病例被纳入高跌倒风险事件,13,941例被纳入低跌倒风险事件。为了整合临床知识并保持可解释性,我们使用约束评分优化(CSO)模型重新加权JHFRAT评分权重,同时保持其加性结构和临床阈值。校准是指调整项目权重,使所得评分能够更一致地按研究的风险标签对事件进行排序,而不改变工具的形式因素或部署工作流程。该模型在预测性能上显著优于当前的JHFRAT(CSO AUC-ROC=0.91,JHFRAT AUC-ROC=0.86)。这种性能改进相当于每周在约翰霍普金斯健康系统中额外保护35名高风险患者。约束评分优化模型在有和没有EHR变量的情况下表现相似。尽管基准黑盒模型(XGBoost)在知识驱动的约束逻辑回归的基础上提高了性能指标(AUC-ROC=0.94),但CSO在风险标签变化方面表现出更强的稳健性。这种基于证据的方法为医疗系统提供了一个坚实的基础,以系统地增强住院跌倒预防协议和患者安全,利用数据驱动优化技术,从而在医疗保健环境中改善风险评估和资源分配。
Summary / 总结
This study aims to improve the predictive performance of the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) by incorporating clinical knowledge through constrained score optimization (CSO) models. A retrospective cohort analysis of 54,209 inpatient admissions showed that the CSO models significantly improved predictive performance (AUC-ROC=0.91) compared to the original JHFRAT (AUC-ROC=0.86), protecting an additional 35 high-risk patients per week. The CSO models maintained interpretability and robustness, even without electronic health record variables, and provided a robust foundation for enhancing inpatient fall prevention protocols in healthcare settings.
本研究旨在通过约束分数优化(CSO)模型将约翰霍普金斯跌倒风险评估工具(JHFRAT)与临床有意义的指标相结合,以提高其预测性能。对三家约翰霍普金斯医院54,209名住院患者的回顾性队列分析显示,CSO模型在预测性能(AUC-ROC=0.91)上显著优于原始JHFRAT(AUC-ROC=0.86),每周额外保护了35名高风险患者。CSO模型保持了可解释性和鲁棒性,即使不使用电子健康记录变量,也是一项有价值的工具,用于增强住院患者的跌倒预防协议和患者安全,从而改善医疗保健环境中的风险评估和资源配置。
LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
Authors: Samy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger
First: 2026-01-08T18:15:34+00:00 · Latest: 2026-01-08T18:15:34+00:00
Abstract
Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.
中文标题/摘要
标题:LELA:基于LLM的零样本领域自适应实体链接方法
实体链接(将文本中含糊的提及映射到知识库中的实体)是知识图谱构建、问答和信息提取等任务中的一个基础步骤。我们的方法LELA是一种模块化的粗细结合方法,利用了大型语言模型(LLM)的能力,并且可以在不同的目标领域、知识库和LLM上工作,无需任何微调阶段。我们在各种实体链接设置下的实验表明,LELA在与微调方法的竞争中表现出色,并且显著优于未微调的方法。
Summary / 总结
LELA is a modular coarse-to-fine entity linking approach that uses large language models (LLMs) for mapping text mentions to knowledge base entities. It does not require fine-tuning and can adapt to different domains and knowledge bases. Experiments show that LELA performs competitively with fine-tuned methods and outperforms non-fine-tuned approaches in various settings.
LELA 是一种模块化的实体链接方法,利用大型语言模型(LLM)将文本中的提及映射到知识库实体。它不需要微调,可以适应不同的领域和知识库。实验表明,LELA 在各种设置中与微调方法竞争,并且优于非微调方法。
Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable
Authors: Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam
First: 2026-01-08T18:13:46+00:00 · Latest: 2026-01-08T18:13:46+00:00
Abstract
When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines
中文标题/摘要
标题:降低AI研究成本:任务感知压缩如何使大型语言模型代理负担得起
当研究人员使用大型语言模型进行自主任务,如文献审查或生成假设时,计算费用会迅速增加。使用一个700亿参数模型的一次研究会话可能需要大约127美元的云费用,使这些工具对许多学术实验室来说遥不可及。我们开发了AgentCompress来直接解决这个问题。核心思想源于我们在工作中的一个简单观察:撰写新的假设比重新格式化参考文献需要模型更多的能力。为什么这两个任务都应该以全精度运行?我们的系统使用一个小的神经网络,根据每个新任务的开头词语来判断任务的难度,然后将其路由到一个适当压缩的模型变体。这个决定在不到一毫秒内完成。在四个科学领域的500个研究工作流中进行测试,我们计算成本降低了68.3%,同时保持了96.2%的原始成功率。对于那些关注预算的实验室来说,这可能意味着能够在进行实验和坐观旁待之间做出选择
SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning
Authors: Yanchang Liang, Xiaowei Zhao
First: 2026-01-08T18:10:35+00:00 · Latest: 2026-01-08T18:10:35+00:00
Abstract
Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.
中文标题/摘要
标题:SimuAgent:一种增强强化学习的基于LLM的Simulink建模助手
大型语言模型(LLMs)已经彻底改变了基于文本的代码自动化,但在图形导向的工程工作流中的潜力尚未得到充分探索。我们介绍了SimuAgent,这是一种专为Simulink设计的LLM驱动的建模和仿真代理。SimuAgent用简洁的字典风格的Python表示法取代了冗长的XML,大幅减少了标记数量,提高了可解释性,并使仿真变得快速且在进程内进行。一种轻量级的计划-执行架构,经过两阶段训练,使代理具备了低级工具技能和高级设计推理能力。为应对长时任务中的稀疏奖励,我们提出了Reflection-GRPO(ReGRPO),它通过自我反思轨迹增强了Group Relative Policy Optimization(GRPO),提供了丰富的中间反馈,加速了收敛并提高了鲁棒性。在我们新发布的包含5300个多领域建模任务的SimuBench基准测试上进行的实验表明,使用SimuAgent微调的Qwen2.5-7B模型比标准的RL基线收敛更快,建模精度更高,甚至在使用少量示例提示在相同基准测试上评估时,超过了GPT-4o。消融实验表明,两阶段课程和抽象重建数据增强进一步增强了泛化能力。SimuAgent完全在本地进行训练和运行,硬件要求较低,提供了一种保护隐私、成本效益高的工业模型驱动工程解决方案。SimuAgent在LLMs和图形建模环境之间架起了一座桥梁,为工业环境中的AI辅助工程设计提供了一个实用的解决方案。
Summary / 总结
SimuAgent is an LLM-powered agent designed for Simulink modeling, using a lightweight plan-execute architecture and a two-stage training process to enhance both low-level tool skills and high-level design reasoning. It employs Reflection-GRPO to address sparse rewards in long-horizon tasks, improving convergence and robustness. Experiments on SimuBench demonstrate that SimuAgent, fine-tuned with a Qwen2.5-7B model, converges faster and achieves higher modeling accuracy than standard RL baselines and even surpasses GPT-4o with few-shot prompting. SimuAgent is cost-effective and privacy-preserving, suitable for industrial model-driven engineering.
SimuAgent 是一个基于 LLM 的 Simulink 模型设计助手,采用轻量级计划执行架构和 Reflection-GRPO,提升其性能。它将冗长的 XML 转换为简洁的 Python 表示,提高可解释性和仿真速度。实验表明,SimuAgent 在 SimuBench 上的表现优于标准的 RL 基线和 GPT-4o,在建模准确性和收敛速度方面更胜一筹,同时保持了工业应用中的隐私保护和成本效益。
Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop
Authors: Yaxuan Wang, Zhongteng Cai, Yujia Bao, Xueru Zhang, Yang Liu
First: 2026-01-08T18:08:15+00:00 · Latest: 2026-01-08T18:08:15+00:00
Abstract
The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.
中文标题/摘要
标题:大型语言模型偏见的观察与补救措施在自我消耗执行循环中的影响
大型语言模型(LLMs)的迅速发展引发了对使用合成数据进行未来模型训练的兴趣。然而,这导致了一个自我消耗的重新训练循环,模型在训练过程中使用自己的输出,可能导致性能下降并引发新的偏见。在实际应用中,之前部署的LLMs可能会影响它们生成的数据,形成一个由用户反馈驱动的动态系统。例如,如果模型持续未能满足某一用户群体的需求,那么来自该特定用户群体的数据收集量将会减少。在本研究中,我们引入了自我消耗执行循环(SCPL)的概念,并探讨合成数据在这些动态迭代训练过程中如何塑造偏见的作用。这种受控的反馈机制是由于难以获取动态生产系统中的真实用户偏好数据,使我们能够以一种原则性的方式隔离和分析反馈驱动的偏见演变。我们关注两种类型的循环,包括典型的重新训练设置和增量微调设置,后者尚未得到充分探索。通过三个实际任务的实验,我们发现执行循环增加了偏好偏见并减少了差异偏见。我们设计了一种基于奖励的拒绝采样策略来减轻偏见,朝着更可信赖的自我改进系统迈进。
Summary / 总结
This study addresses the issue of bias in large language models (LLMs) that arise from self-consuming performative loops, where models are trained on their own outputs. The research introduces the concept of Self-Consuming Performative Loop (SCPL) and investigates how synthetic data influences bias during iterative training processes. Experiments on three real-world tasks show that the performative loop increases preference bias and decreases disparate bias. The study proposes a reward-based rejection sampling strategy to mitigate these biases, aiming to enhance the trustworthiness of self-improving systems.
研究探讨了大型语言模型(LLM)中的自我消费表现性循环(SCPL),其中模型通过训练自己的输出导致性能下降和偏见产生。研究引入了SCPL,并考察了合成数据在动态迭代训练过程中如何塑造偏见。实验结果显示,表现性循环增加了偏好偏见并减少了差异偏见。研究提出了一种基于奖励的拒绝采样策略来缓解这些偏见,旨在提高自我改进系统的可信度。
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Authors: Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
First: 2025-05-27T11:56:59+00:00 · Latest: 2026-01-08T18:06:58+00:00
Abstract
Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
中文标题/摘要
标题:面向事实核查的检索增强生成输出忠实度感知不确定性量化
增强检索的大语言模型(LLMs),即检索增强生成(RAG)方法,在开放域问答任务中取得了优异表现。然而,RAG 仍然容易产生幻觉:由于模型内部知识和检索到的上下文的不准确性,可能会产生事实错误的输出。现有减轻幻觉的方法往往将事实性与检索证据的忠实度混为一谈,错误地将与检索证据不完全一致但事实上正确的陈述标记为幻觉。在本文中,我们提出了 FRANQ,一种新的 RAG 输出幻觉检测方法。FRANQ 应用了不同的不确定性量化(UQ)技术,根据陈述是否忠实于检索到的上下文来估计事实性。为了评估 FRANQ 和竞争的 UQ 方法,我们构建了一个新的长形式问答数据集,该数据集同时标注了事实性和忠实度,并结合了自动标注和手动验证具有挑战性的案例。在多个数据集、任务和大语言模型上的广泛实验表明,FRANQ 在检测 RAG 生成响应中的事实错误方面比现有方法更准确。
Summary / 总结
This paper addresses the issue of hallucinations in Retrieval-Augmented Generation (RAG) outputs by introducing FRANQ, a method that uses distinct uncertainty quantification techniques to estimate factuality while considering faithfulness to the retrieved context. The authors evaluate FRANQ and competing UQ methods on a new dataset annotated for both factuality and faithfulness, demonstrating that FRANQ provides more accurate detection of factual errors in RAG-generated responses than existing approaches.
本文通过引入FRANQ方法,该方法使用不同的不确定性量化技术来估计事实性,同时考虑检索上下文的忠实性,来解决检索增强生成(RAG)输出中的幻觉问题。作者在标注了事实性和忠实性的新数据集上评估了FRANQ和竞争方法,结果显示FRANQ在检测RAG生成响应中的事实错误方面比现有方法更准确。
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Authors: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
First: 2026-01-08T18:00:59+00:00 · Latest: 2026-01-08T18:00:59+00:00
Comments: Project page: https://ivul-kaust.github.io/projects/videoauto-r1/
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
中文标题/摘要
标题:VideoAuto-R1:通过一次思考,两次回答进行视频自动推理
链式思考(CoT)推理已成为多模态大型语言模型在视频理解任务中的一种强大工具。然而,其必要性及其与直接回答相比的优势尚未得到充分探索。在本文中,我们首先证明,对于通过强化学习训练的视频模型,直接回答往往能够匹配甚至超越CoT的表现,尽管CoT以更高的计算成本生成逐步分析。受此启发,我们提出了一种VideoAuto-R1视频理解框架,采用一种必要时才推理的策略。在训练过程中,我们的方法遵循一次思考,两次回答的模式:模型首先生成一个初始答案,然后进行推理,最后输出一个经过审查的答案。两个答案都通过可验证的奖励进行监督。在推理过程中,模型使用初始答案的置信度分数来决定是否继续进行推理。在视频问答和定位基准测试中,VideoAuto-R1在显著提高效率的同时达到了最先进的准确率,平均响应长度减少了约3.3倍,例如,从149个词减少到仅44个词。此外,我们观察到,在感知导向的任务中,推理模式的激活率较低,而在推理密集型任务中,激活率较高。这表明显式的基于语言的推理通常是有益的,但并非总是必要的。
Summary / 总结
This paper explores the necessity of chain-of-thought (CoT) reasoning in video understanding tasks and introduces VideoAuto-R1, a framework that reasons only when necessary. During training, the model generates an initial answer, performs reasoning, and outputs a reviewed answer, both supervised by verifiable rewards. During inference, the model decides whether to reason based on the confidence of the initial answer. VideoAuto-R1 achieves state-of-the-art accuracy while significantly reducing response length, and shows that reasoning is more beneficial for reasoning-intensive tasks.
论文探讨了链式思考(CoT)推理在视频理解任务中的必要性,并提出了VideoAuto-R1框架,该框架仅在必要时进行推理。在训练过程中,模型首先生成初始答案,然后进行推理并输出审查后的答案,通过可验证的奖励进行监督。在推理过程中,它根据初始答案的置信度决定是否进行推理。VideoAuto-R1实现了最先进的准确率,并提高了效率,将响应长度减少了3.3倍。研究表明,推理是有益的,但在感知导向的任务中并不总是必要的。
FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts
Authors: Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu
Venue: KDD 2026
First: 2026-01-08T18:00:58+00:00 · Latest: 2026-01-08T18:00:58+00:00
Comments: Accepted to KDD 2026
Abstract
Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.
中文标题/摘要
标题:FaST:基于专家混合的异质性感知大规模时空图长时预测框架
大规模网络上的时空图(STG)预测引起了广泛关注。然而,现有模型主要关注短时预测,并在扩展到长时预测和大规模图时遭受严重的计算成本和内存消耗问题。为应对上述挑战,我们提出了一种基于异质性感知专家混合(MoEs)的FaST框架,该框架适用于长时和大规模STG预测,能够实现一周(672步,每15分钟一个时间粒度)的预测,涉及数千个节点。FaST的核心创新包括:首先,提出了一种自适应图代理注意力机制,以缓解在大规模图上应用传统图卷积和自注意力模块时固有的计算负担;其次,提出了一种新的并行MoE模块,用门控线性单元(GLUs)替换传统的前馈网络,从而实现高效且可扩展的并行结构。在真实世界数据集上的广泛实验表明,FaST不仅在长时预测准确性上表现出色,而且在计算效率上也显著优于最先进的基线方法。我们的源代码可在:https://github.com/yijizhao/FaST/ 获取。
Summary / 总结
FaST is designed to address the challenges of long-horizon forecasting on large-scale spatial-temporal graphs by proposing an adaptive graph agent attention mechanism and a parallel MoE module with Gated Linear Units. This framework significantly improves computational efficiency and predictive accuracy, achieving one-week-ahead forecasts with thousands of nodes. Experiments show FaST outperforms existing methods in both accuracy and efficiency.
FaST 是一种用于大型时空图长时预测的框架,通过引入适应性图代理注意力机制和带有门控线性单元的并行 Mixture-of-Experts 模块来解决现有模型的计算挑战。FaST 实现了一周(672 步,每 15 分钟一步)的预测,并且在准确性和计算效率上都优于最先进的方法。
CoV: Chain-of-View Prompting for Spatial Reasoning
Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00
Abstract
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
中文标题/摘要
标题:CoV:空间推理的链式视角提示
在3D环境中的嵌入式问题回答(EQA)通常需要收集分布在多个视角且部分被遮挡的上下文。然而,大多数最新的视觉-语言模型(VLMs)仅限于固定且有限的输入视角集,这限制了它们在推理时获取与问题相关上下文的能力,并阻碍了复杂的空间推理。我们提出了一种名为Chain-of-View(CoV)的提示方法,这是一种无需训练、在测试时进行推理的框架,通过从粗到细的探索过程将VLM转变为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图,然后通过交替进行迭代推理和离散相机动作进行细粒度视图调整,从底层3D场景表示中获取新观察,直到收集到足够上下文或达到步骤预算。 我们在OpenEQA上对CoV进行了评估,跨四个主流VLMs获得了平均+11.56%的LLM-Match改进,最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性:增加最小动作预算可额外获得平均+2.51%的改进,峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上,CoV表现出强大的性能(例如,ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1)。总体而言,这些结果表明,与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理能力的有效、模型无关的策略,无需额外训练。
Summary / 总结
The paper proposes Chain-of-View (CoV) prompting to enhance spatial reasoning in embodied question answering (EQA) by enabling a vision-language model to explore multiple viewpoints dynamically. CoV uses a View Selection agent to filter redundant frames and identify relevant anchor views, followed by fine-grained view adjustments through iterative reasoning and camera actions. Experiments on OpenEQA show an average improvement of 11.56% in LLM-Match, with up to 13.62% on Qwen3-VL-Flash. CoV also scales positively with more actions, achieving up to 3.73% improvement on Gemini-2.5-Flash. Performance on ScanQA and SQA3D is also strong, indicating the effectiveness of CoV for spatial reasoning in 3D EQA without additional training.
论文针对3D环境中的体感问答(EQA)问题,其中背景信息分布在多个视角中。提出了Chain-of-View (CoV) 提示,这是一种测试时的推理框架,通过粗到细的过程增强VLMs,使其能够主动探索并收集相关背景信息。CoV在LLM-Match上平均提高了11.56%,最大增益为13.62%(Qwen3-VL-Flash)。此外,它还展示了测试时的扩展性,最小动作预算增加时额外提高了2.51%的性能,并在ScanQA和SQA3D上表现出色。
Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems
Authors: Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu, Bo Tang, Feiyu Xiong, Zhiyu li
First: 2026-01-08T17:59:11+00:00 · Latest: 2026-01-08T17:59:11+00:00
Abstract
Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.
中文标题/摘要
标题:Inside Out:演化中的用户中心核心记忆树以支持长期个性化对话系统
现有的长期个性化对话系统难以调和无界交互流与有限上下文约束之间的关系,常常受到记忆噪声累积、推理退化和人设不一致的困扰。为了解决这些挑战,本文提出Inside Out框架,利用全局维护的PersonaTree作为长期用户画像的载体。通过初始模式约束主干并更新分支和叶子,PersonaTree实现了可控增长,同时实现了记忆压缩并保持一致性。此外,我们通过基于过程的奖励进行强化学习训练了一个轻量级的MemListener,以生成结构化、可执行和可解释的{ADD, UPDATE, DELETE, NO_OP}操作,从而支持个性化树的动态演化。在响应生成过程中,PersonaTree直接被利用以在延迟敏感场景中增强输出;当用户需要更多细节时,在PersonaTree的约束下触发代理模式以按需引入细节。实验表明,PersonaTree在抑制上下文噪声和保持人设一致性方面优于全文拼接和各种个性化记忆系统。值得注意的是,小型MemListener模型在记忆操作决策性能上与强大的推理模型DeepSeek-R1-0528和Gemini-3-Pro相当,甚至超越它们。
Summary / 总结
This paper addresses the challenges of long-term personalized dialogue systems by proposing Inside Out, a framework that uses a PersonaTree to maintain user profiling. The PersonaTree allows for controlled growth by updating branches and leaves while constraining the trunk with an initial schema, which helps in memory compression and consistency. A lightweight MemListener trained via reinforcement learning generates structured operations to support dynamic evolution of the personalized tree. Experiments demonstrate that PersonaTree outperforms other methods in reducing contextual noise and maintaining persona consistency, with the MemListener achieving performance comparable to powerful reasoning models.
本文提出了一种名为Inside Out的框架,通过使用PersonaTree来维护用户画像,解决了长期个性化对话系统中的挑战。PersonaTree允许有控制的增长和记忆压缩,同时保持一致性。通过强化学习训练的轻量级MemListener生成结构化的操作来更新PersonaTree。实验表明,PersonaTree在抑制上下文噪声和保持人物一致性方面优于其他方法,且MemListener的性能与强大的推理模型相当。
Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference
Authors: Rasmus Blanck, Bill Noble, Stergios Chatzikyriakidis
First: 2026-01-08T17:58:52+00:00 · Latest: 2026-01-08T17:58:52+00:00
Abstract
Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.
中文标题/摘要
标题:逆向工程自然语言推理:自然语言推理元推理性质的研究
自然语言推理(NLI)一直是评估自然语言处理语言模型的重要任务,但该任务的逻辑性质尚未得到充分理解,经常被误表征。理解NLI所捕捉的推理概念是解释模型在该任务上的表现的关键。在本文中,我们提出了NLI标签集的三种可能解读,并对它们所蕴含的元推理性质进行了全面分析。以SNLI数据集为例,我们利用(1)具有共享前提的NLI项目和(2)由LLM生成的项目来评估在SNLI上训练的模型的元推理一致性,并推导出数据集中编码的逻辑关系的解读。
Summary / 总结
This paper aims to clarify the logical properties of the Natural Language Inference (NLI) task, which is crucial for interpreting model performance. The authors formulate three possible interpretations of the NLI label set and conduct a detailed analysis using the SNLI dataset. They evaluate models trained on SNLI for meta-inferential consistency by examining items with shared premises and those generated by language models, revealing insights into the logical relations encoded by the dataset.
本文旨在通过提出NLI标签集的三种可能解读并分析其元推理属性,来理解NLI任务的逻辑特性。作者使用SNLI数据集,重点关注具有共享前提的项目和由LLM生成的项目,以评估模型的元推理一致性。主要发现包括对SNLI数据集中编码的逻辑关系的见解,有助于更好地理解NLI任务上的模型性能。
RelayLLM: Efficient Reasoning via Collaborative Decoding
Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
First: 2026-01-08T17:56:16+00:00 · Latest: 2026-01-08T17:56:16+00:00
Abstract
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
中文标题/摘要
标题:RelayLLM:通过协作解码实现高效推理
大型语言模型(LLMs)在进行复杂推理时往往受到高计算成本和延迟的限制,而资源高效的小型语言模型(SLMs)通常缺乏必要的推理能力。现有的协作方法,如级联或路由,以粗粒度的方式运行,将整个查询卸载到LLMs上,当SLM能够处理大多数推理步骤时,这会导致显著的计算浪费。为了解决这个问题,我们提出了一种名为RelayLLM的新框架,通过基于token的协作解码实现高效推理。与路由器不同,RelayLLM赋予SLM作为主动控制器的能力,动态地仅在关键token上调用LLM,通过特殊命令有效地“传递”生成过程。我们引入了一种两阶段训练框架,包括预热和组相对策略优化(GRPO),以教导模型平衡独立性和战略性求助。在六个基准测试中的实验结果表明,RelayLLM实现了49.52%的平均准确率,有效地弥合了两种模型之间的性能差距。值得注意的是,这仅通过调用LLM处理生成的token的1.07%,相比性能匹配的随机路由器,实现了98.2%的成本降低。
Summary / 总结
RelayLLM is a framework that enables efficient reasoning through token-level collaborative decoding between Small Language Models (SLMs) and Large Language Models (LLMs). It allows the SLM to dynamically invoke the LLM only for critical tokens, reducing computational waste. The framework uses a two-stage training process to balance independence and strategic help-seeking. Experiments on six benchmarks show that RelayLLM achieves 49.52% accuracy by invoking the LLM for only 1.07% of tokens, reducing costs by 98.2% compared to random routers.
RelayLLM 是一种通过标记级协作解码实现高效推理的框架,解决了大型语言模型(LLMs)的计算和延迟问题,同时利用小型语言模型(SLMs)的推理能力。它使 SLM 能够动态地仅对关键标记调用 LLM,显著减少了计算浪费。实验结果表明,RelayLLM 的平均准确率为 49.52%,仅需 1.07% 的标记调用 LLM 即可实现,相比随机路由器的成本降低了 98.2%。
MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging
Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li
First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00
Comments: The project is available at https://charlescsyyy.github.io/MVT
Abstract
Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
中文标题/摘要
标题:MVT:基于掩码的视觉-语言模型在分类学对齐的土地覆盖标签化中的应用
遥感中的土地覆盖理解越来越需要跨数据集泛化但保持空间精确性和可解释性的类无差别系统。我们研究了在领域转移下的几何优先发现与解释设置,其中候选区域以类无差别方式划定,监督通过匿名标识符避免使用类名词汇。除了开放集识别和开放世界学习,我们专注于将类无差别掩码证据与分类学导向的场景解释相结合,而不是未知拒绝或持续类扩展。我们提出了MVT,一个三阶段框架,(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分,通过分层专家评分校准输出评估。在跨数据集分割迁移(在OpenEarthMap上训练,在LoveDA上评估)中,领域适应的SAM2提高了掩码质量;同时,双步骤多模态LLM微调产生了更准确的分类学对齐标签和更具有信息性的掩码导向场景描述。
Summary / 总结
The research aims to develop a class-agnostic system for land-cover tagging that generalizes across datasets while maintaining spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by expert ratings. The study shows that domain-adapted SAM2 improves mask quality, and dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative scene descriptions on cross-dataset segmentation transfer.
研究旨在开发适用于遥感的土地覆盖理解系统,注重空间精度和可解释性。方法包括三个阶段:(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双重步骤的LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为评判者进行输出评估,并通过分层专家评分进行校准。关键发现包括领域适应后的SAM2提高了掩码质量,而双重步骤的LLM微调则产生了更准确的分类对齐标签和更具信息量的掩码导向场景描述。
Improving and Evaluating Open Deep Research Agents
Authors: Doaa Allabadi, Kyle Bradbury, Jordan M. Malof
First: 2025-08-13T19:32:01+00:00 · Latest: 2026-01-08T17:54:58+00:00
Comments: 8 pages, 2 figures, 2 tables
Abstract
We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.
中文标题/摘要
标题:改进和评估开放深度研究代理
我们在这里关注深度研究代理(DRAs),这是一种可以从用户那里接收自然语言提示,并自主搜索和利用互联网内容来回应提示的系统。最近的DRAs在公共基准测试中展示了令人印象深刻的性能,然而,最近的研究主要涉及专有的闭源系统。在本研究进行时,我们仅发现一个开源的DRA,称为开放深度研究(ODR)。在本工作中,我们将具有挑战性的最近的BrowseComp基准测试改编为比较ODR与现有专有系统的基准测试。我们提出了BrowseComp-Small(BC-Small),这是一个更易于计算的DRAs基准测试,适用于学术实验室。我们在BC-Small上对ODR和两个其他专有系统进行了基准测试:一个来自Anthropic的系统和一个来自Google的系统。我们发现,这三个系统在包含60个问题的测试集上均未达到100%的准确率。我们提出了对ODR的三个战略改进,从而形成了ODR+模型,该模型在BC-Small基准测试中实现了专有和开源系统中的最佳10%成功率。我们报告了消融研究,表明我们的三个改进都对ODR+的成功做出了贡献。
Summary / 总结
This research aims to improve and evaluate open deep research agents (DRAs) by adapting the BrowseComp benchmark to compare ODR with proprietary systems. The study introduces three strategic improvements to ODR, resulting in the ODR+ model, which achieves a 10% success rate on BC-Small, outperforming both closed-source and open-source systems. Ablation studies show that all three improvements contributed to the success of ODR+.
研究集中于能够处理自然语言提示并自主搜索和利用互联网内容的深度研究代理(DRAs)。研究将BrowseComp基准适应以评估ODR,一个开源DRA,与现有系统进行对比。在基准测试后,所有系统均未达到0%的准确率。对ODR进行了三项战略改进,形成了ODR+模型,在BC-Small基准测试中达到了10%的成功率,成为开源和封闭源系统中的新最佳水平。
DocDancer: Towards Agentic Document-Grounded Information Seeking
Authors: Qintong Zhang, Xinjie Lv, Jialong Wu, Baixuan Li, Zhengwei Tao, Guochen Yan, Huanyao Zhang, Bin Wang, Jiahao Xu, Haitao Mi, Wentao Zhang
First: 2026-01-08T17:54:32+00:00 · Latest: 2026-01-08T17:54:32+00:00
Abstract
Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
中文标题/摘要
标题:DocDancer: 向基于文档的主动信息寻求迈进
文档问题回答(DocQA)专注于基于给定文档回答问题,但现有的DocQA代理缺乏有效的工具利用,主要依赖于封闭源模型。在本工作中,我们介绍了DocDancer,一个端到端训练的开源Doc代理。我们将DocQA形式化为一个信息寻求问题,并提出了一种工具驱动的代理框架,明确地建模了文档探索和理解。为了使此类代理能够端到端训练,我们引入了一种探索然后合成的数据合成管道,以解决DocQA高质量训练数据稀缺的问题。在合成数据上进行训练,两个长上下文文档理解基准MMLongBench-Doc和DocBench上的训练模型展示了其有效性。进一步的分析为代理工具设计和合成数据提供了有价值的见解。
Summary / 总结
DocDancer is an end-to-end trained open-source document-grounded question answering agent that addresses the limitations of existing closed-source models by incorporating tool utilization and explicit document exploration. The agent framework models document comprehension and exploration, and an Exploration-then-Synthesis data synthesis pipeline is introduced to overcome the scarcity of high-quality training data. The trained models on MMLongBench-Doc and DocBench benchmarks demonstrate effectiveness, providing insights for agentic tool design and synthetic data generation.
DocDancer 是一个端到端训练的开源文档导向问答代理。它通过明确建模文档探索和理解来克服现有代理的局限性。该代理使用探索然后合成的数据合成管道进行训练,以应对高质量训练数据稀缺的问题。训练后的模型在两个长文档理解基准测试中表现出色,表明其有效性。这项工作为代理工具设计和合成数据生成提供了见解。
History
20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553