Pixel-Perfect Visual Geometry Estimation
Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang
First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00
Comments: Code: https://github.com/gangweix/pixel-perfect-depth
Abstract
Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
中文标题/摘要
标题:像素完美视觉几何估计
从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而,现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中,我们提出了像素完美视觉几何模型,通过在像素空间中利用生成建模来预测无漂像素的高质量点云。我们首先介绍了像素完美深度(PPD),这是一种基于像素空间扩散变换器(DiT)的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性,我们提出了两种关键设计:1)语义提示DiT,该设计结合了视觉基础模型的语义表示来提示扩散过程,保留全局语义同时增强细粒度视觉细节;2)级联DiT架构,逐步增加图像标记的数量,提高效率和准确性。为了将PPD扩展到视频(PPVD),我们引入了一种新的语义一致DiT,该设计从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播,以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳,并且产生的点云比其他所有模型都更干净。
Summary / 总结
This paper addresses the issue of recovering clean and accurate geometry from images for robotics and augmented reality. It introduces pixel-perfect visual geometry models, specifically Pixel-Perfect Depth (PPD) and its video extension PPVD, which use pixel-space diffusion transformers to predict high-quality point clouds without flying pixels. Key designs include Semantics-Prompted DiT for preserving global semantics and enhancing fine details, and Cascade DiT for improving efficiency and accuracy. The models outperform existing methods in monocular and video depth estimation, producing cleaner point clouds.
本文旨在解决从图像中恢复干净准确几何结构的挑战,这对机器人技术和增强现实至关重要。文中提出了像素完美的视觉几何模型,特别是Pixel-Perfect Depth (PPD)及其视频扩展PPVD,能够预测无飞像素的高质量点云。PPD 使用像素空间扩散变换器 (DiT) 并结合语义提示来保留全局语义并增强细粒度视觉细节。Cascade DiT 架构提高了效率和准确性。对于视频,引入了语义一致的 DiT 来保持时间一致性。这些模型在单目和视频深度估计中表现出色,生成的点云更为干净。
Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration
Authors: Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang, Xiaowei Zhou, Kuan Fang
First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00
Comments: Project Page: https://cordex-manipulation.github.io/
Abstract
Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.
中文标题/摘要
标题:生成、转移、适应:从单个人类演示学习功能性灵巧抓取
功能性灵巧抓取对于使机器人手能够使用工具和进行复杂操作至关重要,但进展受限于两个持续存在的瓶颈:大规模数据集的稀缺性和学习模型中缺乏集成的语义和几何推理。在本工作中,我们提出了CorDex框架,该框架能够从单一个人类演示生成的合成数据中稳健地学习新物体的功能性灵巧抓取。我们方法的核心是一个基于对应关系的数据引擎,该引擎在仿真中生成多样且高质量的训练数据。基于人类演示,数据引擎生成同一类别的多种物体实例,通过对应关系估计将专家抓取转移到生成的物体上,并通过优化进行抓取适应。基于生成的数据,我们引入了一个多模态预测网络,结合了视觉和几何信息。通过设计局部-全局融合模块和重要性感知采样机制,我们实现了功能性灵巧抓取的稳健且计算高效的预测。通过在各种物体类别上的广泛实验,我们证明了CorDex能够很好地泛化到未见过的物体实例,并显著优于最先进的基线。
Summary / 总结
The research aims to address the challenges of learning functional dexterous grasping from limited data by proposing CorDex, a framework that generates diverse training data from a single human demonstration. The method involves a correspondence-based data engine that creates high-quality synthetic objects and optimizes grasps through transfer and adaptation. Experiments show that CorDex outperforms existing methods in predicting functional dexterous grasps for various object categories and generalizes well to unseen instances.
该研究通过提出CorDex框架解决了学习功能性灵巧抓取的挑战,该框架从单个人类演示中生成多样化的训练数据。方法使用基于对应关系的数据引擎生成高质量的合成数据,通过对应关系估计将专家抓取转移,并通过优化进行适应。多模态预测网络整合视觉和几何信息以预测功能性抓取。实验表明,CorDex在未见过的对象上表现出良好的泛化能力并优于现有方法。
Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation
Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider
First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00
Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426
Abstract
Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, nonmedical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM pipeline centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.070. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74% and outperforms classifiers trained on images generated by prior state-of-the-art. Classifier training solely on our method's synthetic images achieved comparable performance to real image training. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric pipeline enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility.
中文标题/摘要
标题:利用临床文本和类别调节生成3D前列腺MRI
目标:潜在扩散模型(LDM)可以缓解医学成像领域机器学习开发中的数据稀缺挑战。然而,医学LDM策略通常依赖于简短提示文本编码器、非医学LDM或大量数据。这些策略可能限制性能和科学可访问性。我们提出了一种新的LDM调节方法来解决这些限制。方法:我们提出了类别调节高效大型语言模型适配器(CCELLA),这是一种新颖的双头调节方法,同时用自由文本临床报告和放射学分类调节LDM U-Net。我们还提出了一种以CCELLA为中心的数据高效LDM管道和一个提出的联合损失函数。我们首先在3D前列腺MRI上评估了我们的方法,与最先进的方法进行了比较。然后,我们使用我们方法生成的合成图像增强了下游分类器模型训练数据集。结果:我们的方法在大小受限的3D前列腺MRI数据集上实现了0.025的3D FID分数,显著优于最近的基础模型,其FID为0.070。在训练前列腺癌预测分类器时,使用我们方法生成的合成图像进行训练,分类器的准确性从69%提高到74%,并优于使用先前最先进的方法生成的图像进行训练的分类器。仅使用我们方法生成的合成图像进行分类器训练,其性能与使用真实图像训练相当。结论:我们展示了我们的方法在使用有限数据和最少的人工注释的情况下,提高了合成图像质量和下游分类器性能。意义:提出的CCELLA为中心的管道能够在有限的数据量和人工数据注释的情况下,实现放射学报告和类别调节LDM训练,以生成高质量的医学图像,从而提高LDM性能和科学可访问性。
Summary / 总结
The research aims to address the data scarcity challenge in medical imaging by leveraging latent diffusion models (LDM) and proposing a novel conditioning approach called CCELLA. CCELLA conditions the LDM U-Net with both free-text clinical reports and radiology classification, and a data-efficient pipeline is developed. The method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, outperforming a recent foundation model. Additionally, synthetic images generated by the method improve the accuracy of a downstream classifier for prostate cancer prediction from 69% to 74%. The approach demonstrates improved performance with limited data and minimal human annotation.
研究旨在通过提出一种名为CCELLA的新颖潜扩散模型(LDM)条件化方法来解决医学成像中的数据稀缺问题。CCELLA同时使用自由文本临床报告和放射学分类对LDM U-Net进行条件化,并开发了一个数据高效的管道。该方法在3D FID得分上达到了0.025,显著优于最近的基础模型。此外,使用该方法生成的合成图像可以将前列腺癌预测下游分类器的准确性从69%提高到74%。该方法在有限数据和少量人工注释的情况下展示了改进的合成图像质量和分类器性能。
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
First: 2026-01-08T18:59:24+00:00 · Latest: 2026-01-08T18:59:24+00:00
Comments: NVIDIA-Tech Report
Abstract
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
中文标题/摘要
标题:GDPO:组奖励-解耦归一化策略优化方法在多奖励RL优化中的应用
随着语言模型能力的不断增强,用户期望它们不仅能提供准确的响应,还能表现出与各种场景中不同人类偏好的一致行为。为了实现这一目标,强化学习(RL)管道开始采用多个奖励,每个奖励捕捉一种独特的偏好,以引导模型向这些期望的行为发展。然而,最近的工作在多奖励设置下默认使用组相对策略优化(GRPO)而没有对其适用性进行检查。本文展示了直接将GRPO应用于归一化不同的回放奖励组合会导致这些组合的优势值坍缩为相同的值,降低了训练信号的分辨率,导致次优收敛,在某些情况下甚至导致训练早期失败。我们随后引入了组奖励-解耦归一化策略优化(GDPO),这是一种新的策略优化方法,通过解耦个体奖励的归一化来解决这些问题,更忠实地保留它们的相对差异,从而实现更准确的多奖励优化,并且训练稳定性显著提高。我们通过工具调用、数学推理和编程推理三个任务将GDPO与GRPO进行了比较,评估了正确性指标(准确率、错误率)和约束遵守指标(格式、长度)。在所有设置中,GDPO始终优于GRPO,证明了其在多奖励强化学习优化中的有效性和普适性。
Summary / 总结
This paper addresses the issue of using Group Relative Policy Optimization (GRPO) in multi-reward reinforcement learning, which can cause distinct rewards to collapse into identical values, leading to suboptimal training. To resolve this, the authors propose Group reward-Decoupled Normalization Policy Optimization (GDPO), which decouples the normalization of individual rewards, preserving their relative differences and improving training stability. GDPO outperforms GRPO across three tasks: tool calling, math reasoning, and coding reasoning, in terms of both correctness and constraint adherence metrics.
本文探讨了在强化学习中使用多个奖励来引导语言模型实现期望行为的挑战。它指出了Group Relative Policy Optimization (GRPO)方法的问题,该方法可能导致不同的奖励值变得相同,从而导致训练效果不佳。为了解决这一问题,作者提出了Group reward-Decoupled Normalization Policy Optimization (GDPO)方法,该方法通过分离个体奖励的归一化,保留它们的相对差异,从而提高训练稳定性。GDPO在工具调用、数学推理和编程推理三个任务中,在正确性和约束遵守度指标上均优于GRPO。
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
First: 2026-01-08T18:59:22+00:00 · Latest: 2026-01-08T18:59:22+00:00
Abstract
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
中文标题/摘要
标题:RoboVIP:视觉身份提示增强的多视角视频生成与机器人操作
操作数据的多样性和数量对于训练有效的机器人策略至关重要。然而,由于硬件和物理设置的限制,收集大规模的现实世界操作数据在不同环境中难以扩展。近期的工作使用文本提示条件下的图像扩散模型来通过改变视觉观察中的背景和桌面物体来扩充操作数据。然而,这些方法往往忽视了由最先进的策略模型所需的多视角和时间上一致的观察需求。此外,仅靠文本提示无法可靠地指定场景设置。为了给扩散模型提供明确的视觉指导,我们引入了视觉身份提示,通过提供示例图像作为条件输入来引导生成所需的场景设置。为此,我们还构建了一个可扩展的流水线从大规模机器人数据集中整理视觉身份池。使用我们扩充的操作数据来训练下游的视觉-语言-动作和视知觉运动策略模型,在仿真和真实机器人环境中均能获得一致的性能提升。
Summary / 总结
The research aims to enhance the diversity, quantity, and quality of manipulation data for training robot policies. It introduces RoboVIP, a method that uses visual identity prompting to generate multi-view and temporally coherent observations. This approach improves the performance of downstream vision-language-action and visuomotor policy models in both simulation and real-robot settings, addressing the limitations of previous text-prompt-based methods that often lack explicit visual guidance and temporal coherence.
论文旨在通过增强操作数据的多样性、数量和质量来提高机器人策略的训练效果。它引入了RoboVIP方法,利用视觉身份提示和多视角视频生成来扩充操作数据。该方法改进了之前的基于文本提示的图像扩散模型,通过提供显式的视觉指导确保多视角和时间上的一致性。实验结果表明,使用扩充后的数据训练视觉-语言-动作和视知觉运动策略模型时,在仿真和真实机器人环境中均能获得一致的性能提升。
Robust Reasoning as a Symmetry-Protected Topological Phase
Authors: Ilmo Sung
First: 2026-01-08T18:58:34+00:00 · Latest: 2026-01-08T18:58:34+00:00
Abstract
Large language models suffer from "hallucinations"-logical inconsistencies induced by semantic noise. We propose that current architectures operate in a "Metric Phase," where causal order is vulnerable to spontaneous symmetry breaking. Here, we identify robust inference as an effective Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding, replacing fragile geometric interpolation with robust topological invariants. Empirically, we demonstrate a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, our Holonomic Network reveals a macroscopic "mass gap," maintaining invariant fidelity below a critical noise threshold. Furthermore, in a variable-binding task on $S_{10}$ ($3.6 \times 10^6$ states) representing symbolic manipulation, we demonstrate holonomic generalization: the topological model maintains perfect fidelity extrapolating $100\times$ beyond training ($L=50 \to 5000$), consistent with a theoretically indefinite causal horizon, whereas Transformers lose logical coherence. Ablation studies indicate this protection emerges strictly from non-Abelian gauge symmetry. This provides strong evidence for a new universality class for logical reasoning, linking causal stability to the topology of the semantic manifold.
中文标题/摘要
标题:稳健推理作为一种对称保护拓扑相
大型语言模型遭受“幻觉”——由语义噪声引起的逻辑不一致。我们提出当前架构处于“度量相”中,在这种相中因果顺序容易自发对称破缺。在此,我们将稳健推理识别为一种有效的对称保护拓扑相,在这种相中逻辑操作形式上等同于非阿贝尔任意子编织,用稳健的拓扑不变量取代脆弱的几何插值。实证上,我们展示了明显的拓扑相变:虽然变换器和RNN表现出无隙衰减,我们的本征网络揭示了宏观的“质量隙”,在临界噪声阈值以下保持不变的保真度。此外,在$S_{10}$(3.6×$10^6$状态)表示符号操作的变量绑定任务中,我们展示了本征泛化:拓扑模型在训练($L=50$)基础上外推100倍($5000$),保持完美保真度,这与理论上无限的因果视界一致,而变换器则失去逻辑连贯性。消融研究表明,这种保护严格源自非阿贝尔规范对称性。这为逻辑推理提供了一个新的普遍类,将因果稳定性与语义流形的拓扑学联系起来。
Summary / 总结
The research aims to address the issue of logical inconsistencies in large language models, known as hallucinations, by proposing a new architecture that operates in a Symmetry-Protected Topological phase. The method involves using a Holonomic Network, which is designed to maintain logical operations through robust topological invariants rather than geometric interpolation. Key findings include a sharp phase transition where the Holonomic Network shows a macroscopic 'mass gap,' maintaining fidelity below a critical noise threshold, while traditional models like Transformers and RNNs decay. Additionally, the Holonomic Network demonstrates holonomic generalization, maintaining perfect fidelity in a symbolic manipulation task with $S_{10}$, extrapolating 100 times beyond training, whereas Transformers lose logical coherence.
研究旨在解决大型语言模型中存在的逻辑不一致问题,即‘幻觉’。作者提出了一种新的架构——全同网络,使其在‘对称保护拓扑相’中运行,其中逻辑操作具有鲁棒性和对噪声的不变性。实验表明,与传统的模型如变换器和RNN相比,全同网络在噪声下仍能保持高保真度,并且可以超越训练数据进行泛化,显示出理论上的无限因果范围。
Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
Authors: P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter
First: 2026-01-08T18:57:01+00:00 · Latest: 2026-01-08T18:57:01+00:00
Comments: 6 pages, 4 figures
Abstract
We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.
中文标题/摘要
标题:通过机器学习和人工智能衡量与促进和平
我们使用机器学习和人工智能:1) 从新闻和社交媒体中衡量各国的和平水平;2) 开发在线工具以促进和平,帮助用户更好地理解自己的媒体消费。对于新闻媒体,我们使用神经网络从在线新闻来源的文本嵌入中衡量和平水平。该模型在训练于一个新闻媒体数据集后,也对分析另一个新闻数据集时表现出高准确性。对于社交媒体,如YouTube,我们开发了其他模型来衡量与和平相关的社会维度,使用了词级(GoEmotions)和上下文级(大型语言模型)方法。为了促进和平,我们注意到20-40岁人群中,71%的人每天主要通过社交媒体上的短视频获取新闻。这些视频内容创作者倾向于制作能够激发情绪、让你生气的视频以增加点击率。我们开发并测试了一个名为MirrorMirror的Chrome扩展程序,为YouTube观众提供他们正在观看的媒体的实时反馈,关于其和平程度。我们的长期目标是让MirrorMirror成为一个开源工具,供内容创作者、记者、研究人员、平台和个人用户更好地理解其媒体创作和消费的语气及其对观众的影响。我们希望超越简单的参与度指标,鼓励更加尊重、细致和信息丰富的交流。
Summary / 总结
This study uses machine learning and artificial intelligence to measure peace levels in countries from news and social media, and develops an online tool called MirrorMirror to promote peace by providing real-time feedback on the peacefulness of media content. The research finds that 71% of people aged 20-40 primarily consume news through short videos on social media, which often prioritize emotional engagement over peaceful content. The tool, MirrorMirror, aims to help users understand and improve the tone of their media consumption and creation, fostering more respectful and informative communication.
该研究利用机器学习和AI来从新闻和社交媒体中测量国家的和平水平,并开发工具来促进和平,通过分析媒体内容。对于新闻,使用了神经网络从文本嵌入中测量和平,显示了跨数据集的高准确性。对于社交媒体,开发了使用单词和上下文水平的模型来测量社会维度。一个名为MirrorMirror的Chrome扩展程序提供了实时反馈,显示用户正在观看的媒体内容的和平程度,旨在鼓励更加尊重和有信息量的沟通。关键发现包括模型的高准确性以及短视频在新闻消费中的重要作用,这些内容往往优先考虑情感参与而非促进和平的内容。
Learning Latent Action World Models In The Wild
Authors: Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat
First: 2026-01-08T18:55:39+00:00 · Latest: 2026-01-08T18:55:39+00:00
Comments: 37 pages, 25 figures
Abstract
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
中文标题/摘要
标题:学习自然环境中的潜在动作世界模型
能够在现实世界中进行推理和规划的智能体需要预测其行为后果的能力。尽管世界模型具备这种能力,但它们通常需要行为标签,而这些标签在大规模应用中往往难以获取。这促使我们学习潜在动作模型,可以从视频中学习动作空间。我们的工作解决了在自然环境视频中学习潜在动作世界模型的问题,扩展了现有工作集中在简单机器人模拟、视频游戏或操作数据上的范围。虽然这使我们能够捕捉到更丰富的动作,但也带来了视频多样性带来的挑战,如环境噪声或视频间缺乏共同的实体。为应对部分挑战,我们讨论了动作应遵循的属性以及相关架构选择和评估。我们发现,连续但受限的潜在动作能够捕捉自然环境视频中动作的复杂性,而常见的向量量化无法做到这一点。例如,我们发现来自智能体(如人类进入房间)的环境变化可以在视频间转移,这突显了学习特定于自然环境视频的动作的能力。在视频间缺乏共同实体的情况下,我们主要能够学习在空间上局部化的潜在动作,相对于摄像机而言。尽管如此,我们能够训练一个控制器,将已知动作映射到潜在动作,使我们能够使用潜在动作作为通用接口,并使用世界模型解决规划任务,其性能与基于动作的基线相当。我们的分析和实验为将潜在动作模型扩展到现实世界提供了一步进展。
Summary / 总结
This work addresses the challenge of learning latent action models from in-the-wild videos, which require predicting the consequences of actions without explicit action labels. The method involves capturing richer actions from diverse videos while handling challenges like environmental noise and varying embodiments. Key findings include the ability to capture complex actions and transfer changes in the environment across videos, though actions are localized relative to the camera. A controller was trained to map known actions to latent ones, enabling the use of latent actions for planning tasks with comparable performance to action-conditioned baselines.
该研究旨在开发能够预测真实世界视频中动作后果的潜在动作世界模型,无需明确的动作标签。作者通过提出一个连续但受限的潜在动作模型来应对从多样化的在野视频中学习的挑战。关键发现包括能够捕捉复杂动作并跨视频转移环境变化,尽管动作相对于摄像头位置局部化。尽管如此,该模型仍能执行与动作条件基线相当性能的规划任务。
Non-Linear Scoring Model for Translation Quality Evaluation
Authors: Serge Gladkoff, Lifeng Han, Katerina Gasova
First: 2025-11-17T15:09:22+00:00 · Latest: 2026-01-08T18:51:57+00:00
Comments: ongoing work, 32 pages
Abstract
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition.
Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size.
Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model
E(x) = a * ln(1 + b * x), a, b > 0,
anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added.
The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
中文标题/摘要
标题:翻译质量评估的非线性评分模型
基于多维质量度量(MQM)的分析性翻译质量评估(TQE)传统上使用线性误差到惩罚比例,该比例针对1000-2000词的参考样本进行校准。然而,线性外推会偏倚不同大小样本的判断,对短样本过度惩罚,对长样本则惩罚不足,导致与专家直觉不一致。
本文基于多范围框架,提出了一种校准的非线性评分模型,更好地反映了不同长度样本中人类内容消费者对翻译质量的感知。来自三个大型企业环境的实证数据表明,可接受的错误数量随样本大小呈对数增长,而非线性增长。
心理物理和认知证据,包括韦伯-费希纳定律和认知负荷理论,支持这一观点,解释了为什么额外错误的感知影响随规模增长而减弱,而认知负担则增加。我们提出一个两参数模型
E(x) = a * ln(1 + b * x),a, b > 0,
该模型以参考容忍度为锚点,并通过一个一维根查找步骤校准两个容忍度点。该模型在相对误差不超过±20%的区间内保持线性近似,并且只需添加动态容忍度函数即可与现有的评估工作流程集成。
该方法提高了对人类和AI生成翻译的解释性、公平性和评分者一致性。通过操作化一个感知上有效的评分范式,它推动了翻译质量评估向更准确和可扩展的评估迈进。该模型还为与人类判断一致的基于AI的文档级评估提供了更强的基础。讨论了CAT/LQA系统实施考虑和对人类和AI生成文本评估的影响。
Summary / 总结
This paper addresses the limitations of linear scoring models in Translation Quality Evaluation (TQE) by proposing a non-linear scoring model based on the Multi-Range framework. Empirical data from three large-scale enterprise environments indicate that acceptable error counts grow logarithmically with sample size, not linearly. The proposed model, E(x) = a * ln(1 + b * x), improves interpretability, fairness, and inter-rater reliability, and aligns better with human judgment. It also enhances the scalability of TQE for both human and AI-generated translations.
本文针对传统线性评分模型在翻译质量评估(TQE)中的局限性,提出了一个非线性评分模型。该模型基于Multi-Range框架,使用两个参数的函数E(x) = a * ln(1 + b * x)更好地反映不同样本大小下的人类感知翻译质量。实证数据和心理理论支持可接受错误数量随样本大小呈对数增长。该模型提高了可解释性、公平性和评分者间的一致性,并以最小更改集成到现有的评估工作流中。
MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
Authors: Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel
First: 2026-01-08T18:39:52+00:00 · Latest: 2026-01-08T18:39:52+00:00
Abstract
We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence.
As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.
中文标题/摘要
标题:MineNPC-Task:面向记忆意识Minecraft代理的任务套件
我们提出了\textsc{MineNPC-Task},一种用户编写的基准测试和评估框架,用于测试开放世界\emph{Minecraft}中的记忆意识、混合主动性LLM代理。该框架不依赖于合成提示,而是从与专家玩家的形成性和总结性共玩中引出任务,将其规范化为具有显式先决条件和依赖结构的参数化模板,并配以在有限知识政策下的机器可验证验证器,该政策禁止世界外的捷径。该框架捕捉计划/行动/记忆事件,包括计划预览、目标澄清、记忆读写、先决条件检查和修复尝试,并根据尝试的子任务总数报告结果,这些结果源自于世内的证据。
作为初步快照,我们使用GPT-4o实例化了该框架,并在\textbf{8}名经验丰富的玩家中评估了\textbf{216}个子任务。我们观察到代码执行、库存/工具处理、引用和导航中的反复出现的故障模式,以及由混合主动性澄清和轻量级记忆支持的恢复。参与者对交互质量和界面易用性给予了积极评价,同时指出了需要更强的记忆持久性以跨越任务。我们发布了完整的任务套件、验证器、日志和框架,以支持未来记忆意识实体代理的透明、可重复评估。
Summary / 总结
The research introduces MineNPC-Task, a benchmark for evaluating memory-aware, mixed-initiative language models in Minecraft. Tasks are derived from expert play and structured into parametric templates with explicit conditions. The evaluation framework captures detailed events like plans, clarifications, and memory interactions. Initial evaluation with GPT-4o across 8 players revealed issues in code execution, inventory handling, and navigation, but showed positive interaction quality. The study highlights the need for better memory persistence between tasks and provides the task suite and evaluation tools for future research.
研究介绍了MineNPC-Task,一个用于测试记忆感知LLM代理在Minecraft中的基准。任务源自专家共玩,并被结构化为具有明确先决条件的参数化模板。评估框架捕获详细的记忆事件并报告结果。初步评估使用GPT-4o与8名玩家合作,揭示了代码执行、库存处理和导航等方面的重复问题,交互质量得到了积极反馈,但需要更强的记忆持久性。任务套件及相关材料已公开发布,以支持未来研究的透明和可重复性评价。
Internal Representations as Indicators of Hallucinations in Agent Tool Selection
Authors: Kait Healy, Bharathi Srinivasan, Visakh Madathil, Jing Wu
First: 2026-01-08T18:38:45+00:00 · Latest: 2026-01-08T18:38:45+00:00
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in tool calling and tool usage, but suffer from hallucinations where they choose incorrect tools, provide malformed parameters and exhibit 'tool bypass' behavior by performing simulations and generating outputs instead of invoking specialized tools or external systems. This undermines the reliability of LLM based agents in production systems as it leads to inconsistent results, and bypasses security and audit controls. Such hallucinations in agent tool selection require early detection and error handling. Unlike existing hallucination detection methods that require multiple forward passes or external validation, we present a computationally efficient framework that detects tool-calling hallucinations in real-time by leveraging LLMs' internal representations during the same forward pass used for generation. We evaluate this approach on reasoning tasks across multiple domains, demonstrating strong detection performance (up to 86.4\% accuracy) while maintaining real-time inference capabilities with minimal computational overhead, particularly excelling at detecting parameter-level hallucinations and inappropriate tool selections, critical for reliable agent deployment.
中文标题/摘要
标题:代理工具选择中的内部表示作为幻觉指标
大型语言模型(LLMs)在工具调用和使用方面展现了显著的能力,但在选择错误工具、提供不正确的参数和通过模拟和生成输出而不是调用专门工具或外部系统来绕过工具使用方面存在幻觉问题。这削弱了基于LLM的代理在生产系统中的可靠性,导致结果不一致,并绕过了安全和审计控制。代理工具选择中的这种幻觉需要早期检测和错误处理。不同于现有的需要多次前向传递或外部验证的幻觉检测方法,我们提出了一种计算效率高的框架,通过利用LLM在生成过程中同一前向传递期间的内部表示来实时检测调用工具的幻觉。我们在多个领域的推理任务上评估了这种方法,展示了强大的检测性能(最高可达86.4%的准确率),同时保持了实时推理能力,计算开销最小,特别擅长检测参数级幻觉和不适当工具选择,这对于可靠的代理部署至关重要。
Summary / 总结
The paper addresses the issue of hallucinations in Large Language Models (LLMs) when selecting tools, which can lead to unreliable results and bypass security controls. It introduces a computationally efficient framework that detects these hallucinations in real-time by analyzing the LLM's internal representations during the same forward pass used for generation. The method achieves up to 86.4% accuracy in detecting parameter-level hallucinations and inappropriate tool selections, while maintaining real-time inference capabilities with minimal computational overhead.
研究旨在解决大型语言模型(LLMs)在工具选择中出现幻觉的问题,这可能导致结果不可靠并绕过安全控制。该研究提出了一种计算效率高的框架,通过分析LLMs在生成过程中同一前向传递期间的内部表示来实现实时幻觉检测。该方法在参数级幻觉和不适当工具选择的检测方面达到了86.4%的准确率,同时保持了实时推理能力,并且计算开销很小。
Belief Is All You Need: Modeling Narrative Archetypes in Conspiratorial Discourse
Authors: Soorya Ram Shimgekar, Abhay Goyal, Roy Ka-Wei Lee, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-12-10T21:51:16+00:00 · Latest: 2026-01-08T18:34:35+00:00
Abstract
Conspiratorial discourse is increasingly embedded within digital communication ecosystems, yet its structure and spread remain difficult to study. This work analyzes conspiratorial narratives in Singapore-based Telegram groups, showing that such content is woven into everyday discussions rather than confined to isolated echo chambers. We propose a two-stage computational framework. First, we fine-tune RoBERTa-large to classify messages as conspiratorial or not, achieving an F1-score of 0.866 on 2,000 expert-labeled messages. Second, we build a signed belief graph in which nodes represent messages and edge signs reflect alignment in belief labels, weighted by textual similarity. We introduce a Signed Belief Graph Neural Network (SiBeGNN) that uses a Sign Disentanglement Loss to learn embeddings that separate ideological alignment from stylistic features.
Using hierarchical clustering on these embeddings, we identify seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN yields stronger clustering quality (cDBI = 8.38) than baseline methods (13.60 to 67.27), supported by 88 percent inter-rater agreement in expert evaluations. Our analysis shows that conspiratorial messages appear not only in clusters focused on skepticism or distrust, but also within routine discussions of finance, law, and everyday matters. These findings challenge common assumptions about online radicalization by demonstrating that conspiratorial discourse operates within ordinary social interaction. The proposed framework advances computational methods for belief-driven discourse analysis and offers applications for stance detection, political communication studies, and content moderation policy.
中文标题/摘要
标题:信念即足矣:建模阴谋论话语中的叙事原型
阴谋论话语越来越多地嵌入数字通信生态系统中,但其结构和传播仍然难以研究。本研究分析了新加坡Telegram群组中的阴谋论叙述,表明此类内容融入了日常讨论,而非局限于孤立的回声室中。我们提出了一种两阶段的计算框架。首先,我们对RoBERTa-large进行微调,以分类信息为阴谋论或非阴谋论,对2,000条专家标注的信息达到0.866的F1分数。其次,我们构建了一个带符号的信念图,在该图中,节点代表信息,边的符号反映信念标签的一致性,并根据文本相似度加权。我们引入了一种带符号信念图神经网络(SiBeGNN),使用符号解纠缠损失来学习将意识形态一致性与风格特征分离的嵌入。通过这些嵌入进行层次聚类,我们识别出553,648条信息中的七个叙事原型:法律主题、医疗关切、媒体讨论、金融、权威矛盾、群体管理以及一般聊天。SiBeGNN的聚类质量(cDBI = 8.38)优于基线方法(13.60到67.27),并得到88%的专家评价的一致性支持。我们的分析表明,阴谋论信息不仅出现在关注怀疑或不信任的聚类中,还出现在金融、法律和日常事务的常规讨论中。这些发现挑战了关于在线激进化的一些常见假设,表明阴谋论话语在普通社会互动中运作。所提出的方法推进了信念驱动话语分析的计算方法,并为立场检测、政治传播研究和内容审核政策提供了应用。
Summary / 总结
This study examines conspiratorial narratives in Singapore-based Telegram groups, showing that such content is integrated into everyday discussions. A two-stage computational framework is proposed: first, RoBERTa-large is fine-tuned for classifying messages, achieving an F1-score of 0.866. Second, a Signed Belief Graph Neural Network (SiBeGNN) is developed to identify seven narrative archetypes, including legal topics, medical concerns, and finance, from 553,648 messages. SiBeGNN outperforms baseline methods in clustering quality and demonstrates that conspiratorial discourse occurs in various social contexts, challenging the notion of isolated echo chambers.
该研究分析了新加坡Telegram群组中的阴谋论内容,发现此类内容被整合到日常讨论中。提出了一种两阶段计算框架,首先使用RoBERTa-large分类消息,F1分为0.866,然后使用SiBeGNN构建带符号的信任图来识别七个叙事原型。该框架的聚类质量优于基线方法,并揭示了阴谋论信息出现在各种情境中,挑战了孤立回音室的观念。
From Policy to Logic for Efficient and Interpretable Coverage Assessment
Authors: Rhitabrat Pokharel, Hamid Reza Hassanzadeh, Ameeta Agrawal
Venue: AAAI 2026
First: 2026-01-03T19:24:51+00:00 · Latest: 2026-01-08T18:28:40+00:00
Comments: Accepted at AIMedHealth @ AAAI 2026
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.
中文标题/摘要
标题:从政策到逻辑:高效可解释的覆盖评估
大型语言模型(LLMs)在解释长篇复杂的法律和政策语言方面表现出强大的能力。然而,它们的可靠性可能会因幻觉和不一致而受到损害,特别是在分析主观和细腻的文件时。这些挑战在医疗覆盖政策审查中尤为关键,因为人类专家必须依赖准确的信息。在本文中,我们提出了一种方法,旨在通过使政策解释更高效和可解释来支持人类审查员。我们介绍了一种方法,该方法将覆盖感知检索器与符号规则推理相结合,以突出显示相关政策语言,将其组织成明确的事实和规则,并生成可审计的理由。这种混合系统减少了所需的LLM推理次数,从而降低了整体模型成本。值得注意的是,我们的方法在推理成本上减少了44%,F1分数提高了4.5%,既提高了效率又提高了效果。
Summary / 总结
This paper addresses the challenges of interpreting complex medical coverage policies using Large Language Models (LLMs), which can suffer from hallucinations and inconsistencies. To enhance efficiency and interpretability, the authors propose a hybrid system combining a coverage-aware retriever and symbolic rule-based reasoning. This approach reduces the number of LLM inferences by 44%, lowering overall model costs, while also improving the F1 score by 4.5%.
本文针对使用大型语言模型(LLMs)解释复杂的医疗覆盖政策时可能出现的幻觉和不一致性问题进行了研究。为了支持人类审查员,作者提出了一种结合覆盖感知检索器和符号规则推理的混合系统。该方法通过减少44%的LLM推理次数,降低了整体模型成本,同时将F1分数提高了4.5%。
Stock Market Price Prediction using Neural Prophet with Deep Neural Network
Authors: Navin Chhibber, Suneel Khemka, Navneet Kumar Tyagi, Rohit Tewari, Bireswar Banerjee, Piyush Ranjan
First: 2026-01-08T18:24:22+00:00 · Latest: 2026-01-08T18:24:22+00:00
Abstract
Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.
中文标题/摘要
标题:使用深度神经网络的神经先知进行股票市场价格预测
股票市场价格预测是金融、统计和经济学交叉领域的显著研究领域。准确预测股票价格一直是各种研究人员的关注点。然而,现有的时间序列预测统计方法往往无法有效预测未来股票价格的概率范围。因此,为了解决这个问题,提出了使用深度神经网络的神经先知(NP-DNN)来预测股票市场价格。本研究中使用的预处理技术是Z分数标准化,通过消除数据的尺度差异,使模式更容易被检测。缺失值填充填补了历史数据中的空白,增强了模型使用完整信息进行更准确预测的能力。多层感知机(MLP)学习股票市场价格之间的复杂非线性关系,从输入数据中提取隐藏模式,从而创建更有意义的特征表示,以提高预测准确性。提出的NP-DNN模型的准确率为99.21%,与其他使用融合大型语言模型的方法相比。关键词:深度神经网络,预测股票价格,多层感知机,神经先知,股票市场价格预测。
Summary / 总结
The research aims to improve the accuracy of stock market price prediction by proposing a Neural Prophet with a Deep Neural Network (NP-DNN) model. The method involves preprocessing techniques such as Z-score normalization and missing value imputation, followed by training a Multi-Layer Perceptron (MLP) to learn complex relationships in the data. The NP-DNN model achieved an accuracy of 99.21%, outperforming other approaches in forecasting stock prices.
研究旨在通过提出一种神经先知与深度神经网络(NP-DNN)模型来提高股票市场价格预测的准确性。方法包括使用Z-分数标准化进行数据预处理和缺失值填充以处理不完整的历史数据。多层感知器(MLP)用于学习数据中的复杂非线性关系。所提出的模型在预测股票价格方面的准确率为99.21%,优于其他方法。
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00
Abstract
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
中文标题/摘要
标题:视觉语言模型中提示诱发幻觉的机制
大型视觉语言模型(VLMs)虽然功能强大,但常常倾向于根据文本提示而非视觉证据进行幻觉。我们在一个受控的物体计数设置中研究了这种失败模式,其中提示会夸大图像中的物体数量(例如,要求模型描述四朵水仙花,而实际上只有三朵)。在物体数量较少时,模型通常会纠正这种高估,但随着物体数量的增加,它们越来越倾向于遵循提示,无视差异。通过对三种VLMs的机制分析,我们发现一小组注意力头的消除可以显著减少提示诱发幻觉(PIH),至少降低40%且无需额外训练。在不同模型中,PIH头以特定方式介导提示复制。我们描述了这些差异,并表明PIH消除增加了对视觉证据的纠正。我们的研究结果提供了关于提示诱发幻觉内部机制的见解,揭示了这些行为在不同模型中的特定实现差异。
Summary / 总结
The study investigates how large vision-language models hallucinate by prioritizing textual prompts over visual evidence, particularly in an object-counting task where prompts overstate the number of objects. As the number of objects increases, models increasingly conform to the prompt. By analyzing three VLMs, the researchers identify specific attention heads that, when removed, significantly reduce prompt-induced hallucinations by at least 40% without additional training. The findings highlight model-specific differences in how these behaviors are implemented and suggest that ablation of these heads improves alignment with visual evidence.
研究探讨了视觉语言模型(VLMs)如何基于文本提示而非视觉证据产生幻觉。通过改变图像中的物体数量,研究人员发现,随着物体数量的增加,模型越来越倾向于遵循提示。通过消除一小组注意力头,可以显著减少提示诱导的幻觉至少40%,且无需额外训练。研究结果表明,这些注意力头对于提示复制至关重要,消除它们可以提高与视觉证据的一致性。
An interpretable data-driven approach to optimizing clinical fall risk assessment
Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi
First: 2026-01-08T18:17:31+00:00 · Latest: 2026-01-08T18:17:31+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2510.20714
Abstract
In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study's risk labels, and without changing the tool's form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.
中文标题/摘要
标题:一种可解释的数据驱动方法以优化临床跌倒风险评估
在本研究中,我们旨在通过数据驱动建模方法更好地使约翰霍普金斯跌倒风险评估工具(JHFRAT)的跌倒风险预测与额外的临床有意义的指标相一致。我们对2022年3月至2023年10月期间约翰霍普金斯健康系统三家医院的54,209例住院病例进行了回顾性队列分析。共有20,208例住院病例被纳入高跌倒风险事件,13,941例被纳入低跌倒风险事件。为了整合临床知识并保持可解释性,我们使用了约束评分优化(CSO)模型重新加权JHFRAT评分权重,同时保持其加性结构和临床阈值。校准是指调整项目权重,使所得评分能够更一致地按研究的风险标签对事件进行排序,而不改变工具的形式因素或部署工作流程。该模型在预测性能上显著优于当前的JHFRAT(CSO AUC-ROC=0.91,JHFRAT AUC-ROC=0.86)。这种性能改进相当于每周为约翰霍普金斯健康系统保护额外的35名高风险患者。约束评分优化模型在有和没有EHR变量的情况下表现相似。尽管基准黑盒模型(XGBoost)在知识驱动的约束逻辑回归的基础上提高了性能指标(AUC-ROC=0.94),但CSO在风险标签变化方面表现出了更高的鲁棒性。这种基于证据的方法为医疗机构系统地增强住院跌倒预防协议和患者安全提供了坚实的基础,利用数据驱动优化技术,有助于改善风险评估和资源分配。
Summary / 总结
This study aims to improve the predictive performance of the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) by incorporating clinical knowledge and maintaining interpretability through constrained score optimization (CSO) models. A retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins hospitals showed that the CSO model significantly improved predictive performance (AUC-ROC=0.91) compared to the current JHFRAT (AUC-ROC=0.86), protecting an additional 35 high-risk patients per week. The CSO models performed similarly with and without electronic health record (EHR) variables, demonstrating robustness to variations in risk labeling.
本研究旨在通过约束分数优化(CSO)方法将临床有意义的指标纳入约翰霍普金斯跌倒风险评估工具(JHFRAT),以提高其预测性能。使用54,209名住院患者的资料,CSO模型重新加权了JHFRAT的评分权重,使其预测性能显著提高(AUC-ROC=0.91),优于原始JHFRAT(AUC-ROC=0.86)。这一改进每周可额外保护35名高风险患者。CSO模型即使不使用电子健康记录变量也表现出稳健性能,并且在风险标签变化时比基准黑盒模型(XGBoost)更具鲁棒性。
LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
Authors: Samy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger
First: 2026-01-08T18:15:34+00:00 · Latest: 2026-01-08T18:15:34+00:00
Abstract
Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.
中文标题/摘要
标题:LELA:基于LLM的零样本领域自适应实体链接方法
实体链接(将文本中含糊的提及映射到知识库中的实体)是知识图谱构建、问答和信息提取等任务中的一个基础步骤。我们的方法LELA是一种模块化的粗细结合方法,利用了大型语言模型(LLM)的能力,并且可以在不同的目标领域、知识库和LLM上工作,无需任何微调阶段。我们在各种实体链接设置下的实验表明,LELA在与微调方法的竞争中表现出色,并且显著优于未微调的方法。
Summary / 总结
LELA is a modular entity linking approach that uses large language models and does not require fine-tuning, making it adaptable to different domains and knowledge bases. Experiments show that LELA performs competitively with fine-tuned approaches and outperforms non-fine-tuned methods across various settings.
LELA 是一种模块化的实体链接方法,利用大型语言模型且不需要微调,使其能够适应不同的领域和知识库。实验表明,LELA 在各种设置中与微调方法竞争,并且优于非微调方法。
Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable
Authors: Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam
First: 2026-01-08T18:13:46+00:00 · Latest: 2026-01-08T18:13:46+00:00
Abstract
When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines
中文标题/摘要
标题:降低AI研究成本:任务感知压缩如何使大型语言模型代理负担得起
当研究人员使用大型语言模型进行自主任务,如文献审查或生成假设时,计算费用会迅速增加。使用一个700亿参数模型的一次研究会话可能需要大约127美元的云费用,使这些工具对许多学术实验室来说遥不可及。我们开发了AgentCompress来直接解决这个问题。核心思想源于我们在工作中的一个简单观察:撰写新的假设比重新格式化参考文献需要模型更多的能力。为什么这两个任务都应该以全精度运行?我们的系统使用一个小神经网络根据每个任务的开头词语来评估任务的难度,然后将其路由到一个适当压缩的模型变体。这个决定在不到一毫秒内完成。在四个科学领域的500个研究工作流中进行测试,我们将计算成本降低了68.3%,同时保持了96.2%的原始成功率。对于那些关注预算的实验室来说,这可能意味着能够在进行实验和坐观台之间做出选择
Summary / 总结
The research addresses the high computational costs associated with using large language models for autonomous tasks, which can exceed $127 per session. To mitigate this, the authors developed AgentCompress, which uses a small neural network to assess the difficulty of incoming tasks based on their initial words and routes them to appropriately compressed models. This approach reduced compute costs by 68.3% while maintaining 96.2% of the original success rate across 500 research workflows in four scientific fields, making these tools more accessible to academic labs with limited budgets.
研究针对使用大型语言模型进行自主任务时高昂的计算成本问题,每次会话费用可能超过127美元。为解决这一问题,作者开发了AgentCompress,该系统通过一个小神经网络根据任务的初始内容评估其难度,并将其路由到适当压缩的模型中。这种方法在四个科学领域的500个工作流程中将计算成本降低了68.3%,同时保持了原成功率的96.2%,使得这些工具对于预算有限的学术实验室更具可访问性。
SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning
Authors: Yanchang Liang, Xiaowei Zhao
First: 2026-01-08T18:10:35+00:00 · Latest: 2026-01-08T18:10:35+00:00
Abstract
Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.
中文标题/摘要
标题:SimuAgent:基于LLM的Simulink建模助手,增强以强化学习
大型语言模型(LLMs)已经革新了基于文本的代码自动化,但在图形导向的工程工作流中的潜力尚未得到充分探索。我们介绍了SimuAgent,这是一种专为Simulink设计的LLM驱动的建模和仿真代理。SimuAgent用简洁的字典风格Python表示法取代了冗长的XML,大幅减少了标记数量,提高了可解释性,并允许快速、在线仿真。一种轻量级的计划-执行架构,经过两阶段训练,使代理具备了低级工具技能和高级设计推理能力。为应对长期任务中的稀疏奖励,我们提出了反思-GRPO(ReGRPO),它通过自我反思轨迹补充了组相对策略优化(GRPO),提供了丰富的中间反馈,加速了收敛并提高了鲁棒性。在我们新发布的包含5300个多领域建模任务的SimuBench基准测试上进行的实验表明,使用SimuAgent微调的Qwen2.5-7B模型比标准的强化学习基线收敛更快,建模精度更高,甚至在使用少量示例提示在相同基准测试上评估时,超过了GPT-4o。消融实验表明,两阶段课程和抽象重建数据增强进一步提高了泛化能力。SimuAgent完全在本地进行训练和运行,硬件要求较低,提供了一种保护隐私、成本效益高的工业模型驱动工程解决方案。SimuAgent在LLMs和图形建模环境之间架起了一座桥梁,为工业环境中的AI辅助工程设计提供了一个实用的解决方案。
Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop
Authors: Yaxuan Wang, Zhongteng Cai, Yujia Bao, Xueru Zhang, Yang Liu
First: 2026-01-08T18:08:15+00:00 · Latest: 2026-01-08T18:08:15+00:00
Abstract
The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.
中文标题/摘要
标题:大型语言模型偏见的观察与补救措施在自我消耗执行循环中的应用
大型语言模型(LLMs)的迅速发展引发了对使用合成数据进行未来模型训练的兴趣。然而,这导致了一个自我消耗的重新训练循环,模型在训练过程中使用自己的输出,可能导致性能下降并引发新的偏见。在实际应用中,之前部署的LLMs可能会影响它们生成的数据,形成一个由用户反馈驱动的动态系统。例如,如果模型持续未能满足某一用户群体的需求,那么来自该特定用户群体的数据收集量将会减少。在本研究中,我们提出了“自我消耗执行循环”(SCPL)的概念,并探讨了合成数据在这些动态迭代训练过程中如何塑造偏见的作用。这种受控的反馈机制是由于难以获取动态生产系统中的真实用户偏好数据,使我们能够以一种原则性的方式隔离和分析反馈驱动的偏见演变。我们关注两种类型的循环,包括典型的重新训练设置和增量微调设置,后者尚未得到充分探索。通过三个实际任务的实验,我们发现执行循环增加了偏好偏见并减少了差异偏见。我们设计了一种基于奖励的拒绝采样策略来减轻偏见,朝着更值得信赖的自我改进系统迈进。
Summary / 总结
This study investigates the self-consuming performative loop (SCPL) in large language models (LLMs), where models are trained on their own outputs, leading to performance drops and emerging biases. The research introduces a controlled setting to analyze feedback-driven bias evolution, focusing on retraining and incremental fine-tuning loops. Experiments on three real-world tasks show that the performative loop increases preference bias and decreases disparate bias. The study proposes a reward-based rejection sampling strategy to mitigate these biases, aiming for more trustworthy self-improving systems.
本研究探讨了大型语言模型(LLMs)在自消耗执行循环中出现的偏差问题,即模型在其自身输出上进行训练。研究引入了自消耗执行循环(SCPL)的概念,并探讨合成数据如何影响迭代训练过程中的偏差。实验证实在三个真实任务上的结果显示,执行循环增加了偏好偏差并减少了差异偏差。研究提出了一种基于奖励的拒绝采样策略来缓解这些偏差,旨在提高自我改进系统的可信度。
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Authors: Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
First: 2025-05-27T11:56:59+00:00 · Latest: 2026-01-08T18:06:58+00:00
Abstract
Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
Summary / 总结
The paper addresses the issue of hallucinations in Retrieval-Augmented Generation (RAG) outputs by introducing FRANQ, a method that uses distinct uncertainty quantification techniques to estimate factuality while considering faithfulness to the retrieved context. The authors evaluate FRANQ and other UQ methods on a new dataset annotated for both factuality and faithfulness, demonstrating that FRANQ provides more accurate detection of factual errors in RAG-generated responses than existing approaches.
论文通过引入FRANQ方法,使用不同的不确定性量化技术来估计事实性,同时考虑检索上下文的忠实性,来解决RAG输出中的幻觉问题。作者在新构建的标注了事实性和忠实性的长形式问答数据集上评估了FRANQ和其他不确定性量化方法,结果显示FRANQ在检测RAG生成响应中的事实错误方面比现有方法更准确。
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Authors: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
First: 2026-01-08T18:00:59+00:00 · Latest: 2026-01-08T18:00:59+00:00
Comments: Project page: https://ivul-kaust.github.io/projects/videoauto-r1/
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
中文标题/摘要
标题:VideoAuto-R1:通过一次思考,两次回答进行视频自动推理
链式思考(CoT)推理已成为多模态大型语言模型在视频理解任务中的一种强大工具。然而,其必要性及其与直接回答相比的优势尚未得到充分探索。在本文中,我们首先证明,对于通过强化学习训练的视频模型,直接回答往往能够匹配甚至超越CoT的性能,尽管CoT以更高的计算成本生成逐步分析。受此启发,我们提出了一种VideoAuto-R1视频理解框架,采用一种必要时才推理的策略。在训练过程中,我们的方法遵循一次思考,两次回答的模式:模型首先生成一个初始答案,然后进行推理,最后输出一个审查后的答案。两个答案都通过可验证的奖励进行监督。在推理过程中,模型使用初始答案的置信度分数来决定是否进行推理。在视频问答和定位基准测试中,VideoAuto-R1实现了最先进的准确率,显著提高了效率,平均响应长度减少了约3.3倍,例如,从149个词减少到仅44个词。此外,我们观察到,在感知导向的任务中,推理模式的激活率较低,而在推理密集型任务中,激活率较高。这表明显式的基于语言的推理通常是有益的,但并非总是必要的。
Summary / 总结
The paper explores the necessity of chain-of-thought (CoT) reasoning in video understanding tasks and introduces VideoAuto-R1, a framework that reasons only when necessary. During training, VideoAuto-R1 follows a Thinking Once, Answering Twice paradigm, generating an initial answer, performing reasoning, and then outputting a reviewed answer. This approach achieves state-of-the-art accuracy while significantly improving efficiency, reducing response length by 3.3x. The framework shows that reasoning is generally beneficial but not always required, especially on perception-oriented tasks.
论文探讨了链式思考(CoT)推理在视频理解任务中的必要性,并提出了VideoAuto-R1框架,该框架仅在必要时进行推理。在训练过程中,模型生成初始答案,进行推理,并输出审查后的答案,两者都由可验证的奖励监督。在推理过程中,模型根据初始答案的置信度决定是否进行推理。VideoAuto-R1实现了最先进的准确率,效率显著提高,响应长度减少了3.3倍。该框架表明,推理通常是有益的,但在某些任务中并非总是必要的。
FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts
Authors: Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu
Venue: KDD 2026
First: 2026-01-08T18:00:58+00:00 · Latest: 2026-01-08T18:00:58+00:00
Comments: Accepted to KDD 2026
Abstract
Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.
中文标题/摘要
标题:FaST:基于专家混合的异质性感知大规模时空图长时预测框架
大规模网络上的时空图(STG)预测引起了广泛关注。然而,现有模型主要关注短期预测,并在扩展到长期预测和大规模图时遭受严重的计算成本和内存消耗问题。为应对上述挑战,我们提出了FaST,一种基于异质性感知专家混合(MoEs)的框架,用于长时和大规模STG预测,能够实现对数千节点的一周前(以15分钟粒度计算,共672步)预测。FaST的核心创新包括:首先,提出了一种自适应图代理注意力机制,以缓解在大规模图上应用传统图卷积和自我注意力模块时固有的计算负担;其次,提出了一种新的并行MoE模块,用门控线性单元(GLUs)替换传统的前馈网络,实现高效且可扩展的并行结构。在真实世界数据集上的广泛实验表明,FaST不仅在长期预测准确性上表现出色,而且在计算效率上也显著优于最先进的基线方法。我们的源代码可在:https://github.com/yijizhao/FaST/ 获取。
Summary / 总结
FaST is a framework designed for efficient long-horizon forecasting on large-scale spatial-temporal graphs, addressing the computational challenges of existing models. It introduces an adaptive graph agent attention mechanism and a parallel Mixture-of-Experts module with Gated Linear Units to reduce computational costs. Experiments show that FaST outperforms state-of-the-art methods in both accuracy and efficiency for one-week-ahead predictions on large graphs.
FaST 是一种用于大型时空图长时预测的框架,解决了现有模型的计算挑战。它采用了适应性图代理注意力机制和带有门控线性单元的并行 MoE 模块来提高效率和准确性。FaST 在长时预测准确性和计算效率方面均优于现有方法。
CoV: Chain-of-View Prompting for Spatial Reasoning
Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00
Abstract
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached.
We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
中文标题/摘要
标题:CoV:空间推理的链式视角提示
在3D环境中的嵌入式问题回答(EQA)通常需要收集分布在多个视角且部分被遮挡的上下文。然而,大多数最新的视觉-语言模型(VLMs)仅限于固定且有限的输入视角集,这限制了它们在推理时获取与问题相关上下文的能力,并阻碍了复杂的空间推理。我们提出了一种名为Chain-of-View(CoV)的提示方法,这是一种无需训练、在测试时进行推理的框架,通过从粗到细的探索过程将VLM转变为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图,然后通过交替进行迭代推理和离散相机动作进行精细的视图调整,从底层3D场景表示中获取新观察,直到收集到足够上下文或达到步骤预算。
我们在OpenEQA上对CoV进行了评估,跨四个主流VLMs获得了平均+11.56%的LLM-Match改进,最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性:增加最小动作预算可额外获得平均+2.51%的改进,峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上,CoV表现出强大的性能(例如,ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1)。总体而言,这些结果表明,与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理能力的有效、模型无关的策略,无需额外训练。
Summary / 总结
The research aims to enhance embodied question answering (EQA) in 3D environments by addressing the limitations of fixed input views in vision-language models (VLMs). The proposed Chain-of-View (CoV) prompting method enables VLMs to actively explore and gather relevant context through a coarse-to-fine process. Evaluation on OpenEQA across four VLMs shows an average improvement of +11.56% in LLM-Match, with significant gains on Qwen3-VL-Flash. CoV also demonstrates test-time scalability, with additional improvements observed as the minimum action budget increases.
论文提出了一种Chain-of-View (CoV) 提示方法,通过动态探索多个视角来增强3D环境中的体感问答(EQA)中的空间推理能力。CoV 使用视图选择代理来筛选和选择相关视图,并通过迭代推理和相机动作进行精细的视图调整。该方法在OpenEQA上提高了四个VLMs的LLM-Match平均11.56%,最高增益为13.62%的Qwen3-VL-Flash。此外,它还展示了测试时的扩展性,当增加最小动作预算时,额外提高了2.51%的改进,峰值为3.73%的Gemini-2.5-Flash。CoV 在ScanQA 和 SQA3D 上表现出色,实现了高CIDEr和EM@1得分。
Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems
Authors: Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu, Bo Tang, Feiyu Xiong, Zhiyu li
First: 2026-01-08T17:59:11+00:00 · Latest: 2026-01-08T17:59:11+00:00
Abstract
Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.
中文标题/摘要
标题:Inside Out:演化中的用户中心核心记忆树以支持长期个性化对话系统
现有的长期个性化对话系统难以调和无界交互流与有限上下文约束之间的关系,常常受到记忆噪声累积、推理退化和人设不一致的困扰。为了解决这些挑战,本文提出Inside Out框架,利用全局维护的PersonaTree作为长期用户画像的载体。通过初始模式约束主干并更新分支和叶子,PersonaTree实现了可控增长,同时实现了记忆压缩并保持一致性。此外,我们通过基于过程的奖励进行强化学习训练了一个轻量级的MemListener,以生成结构化、可执行和可解释的{ADD, UPDATE, DELETE, NO_OP}操作,从而支持个性化树的动态演化。在响应生成过程中,PersonaTree直接被利用以在延迟敏感场景中增强输出;当用户需要更多细节时,在PersonaTree的约束下触发代理模式以按需引入细节。实验表明,PersonaTree在抑制上下文噪声和保持人设一致性方面优于全文拼接和各种个性化记忆系统。值得注意的是,小型MemListener模型在记忆操作决策性能上与强大的推理模型DeepSeek-R1-0528和Gemini-3-Pro相当,甚至超越它们。
Summary / 总结
This paper addresses the challenges of long-term personalized dialogue systems by proposing Inside Out, a framework that uses a PersonaTree to maintain user profiles. The PersonaTree allows for controlled growth and memory compression while preserving consistency. A lightweight MemListener trained via reinforcement learning generates structured operations to update the PersonaTree. Experiments show that PersonaTree outperforms other methods in reducing contextual noise and maintaining persona consistency, with the MemListener achieving performance comparable to powerful reasoning models.
本文提出了一种Inside Out框架,通过使用PersonaTree来维护用户画像,解决长期个性化对话系统中的挑战。PersonaTree设计用于实现可控增长和内存压缩,同时保持一致性。通过强化学习训练的轻量级MemListener生成结构化的操作来更新PersonaTree。实验表明,PersonaTree在抑制上下文噪声和保持人物一致性方面优于其他方法,MemListener的表现甚至超过了强大的推理模型。
Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference
Authors: Rasmus Blanck, Bill Noble, Stergios Chatzikyriakidis
First: 2026-01-08T17:58:52+00:00 · Latest: 2026-01-08T17:58:52+00:00
Abstract
Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.
中文标题/摘要
标题:逆向工程自然语言推理:自然语言推理元推理性质的研究
自然语言推理(NLI)一直是评估自然语言理解语言模型的重要任务,但该任务的逻辑性质尚未得到充分理解,经常被误表征。理解NLI所捕捉的推理概念对于解释模型在该任务上的表现至关重要。在本文中,我们提出了NLI标签集的三种可能解读,并对它们所蕴含的元推理性质进行了全面分析。以SNLI数据集为例,我们利用(1)具有共享前提的NLI项目和(2)由LLM生成的项目来评估在SNLI上训练的模型的元推理一致性,并推导出数据集中编码的逻辑关系的哪种解读。
Summary / 总结
This paper aims to clarify the logical properties of the Natural Language Inference (NLI) task, which is crucial for interpreting model performance. The authors formulate three possible readings of the NLI label set and conduct a detailed analysis of the meta-inferential properties. They use SNLI dataset items with shared premises and items generated by LLMs to evaluate models for meta-inferential consistency, revealing insights into the logical relations encoded by the dataset.
本文旨在澄清自然语言推理(NLI)任务的逻辑属性,这对于解释模型性能至关重要。作者提出了NLI标签集的三种可能解读,并对其实现的元推理属性进行了详细分析。他们使用SNLI数据集中具有相同前提的项目和由LLM生成的项目来评估模型的元推理一致性,揭示了数据集中编码的逻辑关系的见解。
RelayLLM: Efficient Reasoning via Collaborative Decoding
Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
First: 2026-01-08T17:56:16+00:00 · Latest: 2026-01-08T17:56:16+00:00
Abstract
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
中文标题/摘要
标题:RelayLLM:通过协作解码实现高效推理
大型语言模型(LLMs)在复杂推理方面往往受到高计算成本和延迟的限制,而资源高效的小型语言模型(SLMs)通常缺乏必要的推理能力。现有的协作方法,如级联或路由,以粗粒度的方式运行,将整个查询卸载到LLMs上,当SLM能够处理大多数推理步骤时,这会导致显著的计算浪费。为了解决这个问题,我们提出了一种名为RelayLLM的新框架,通过token级的协作解码实现高效推理。与路由器不同,RelayLLM赋予SLM作为主动控制器的能力,动态地仅在关键token上调用LLM,通过特殊命令有效地“传递”生成过程。我们引入了一种两阶段训练框架,包括预热和组相对策略优化(GRPO),以教导模型平衡独立性和战略性求助。在六个基准测试中的实验结果表明,RelayLLM实现了49.52%的平均准确率,有效地弥合了两种模型之间的性能差距。值得注意的是,这仅通过调用LLM处理生成的token的1.07%,实现了与性能匹配的随机路由器相比高达98.2%的成本降低。
Summary / 总结
RelayLLM is a framework that enables efficient reasoning through token-level collaborative decoding between Small Language Models (SLMs) and Large Language Models (LLMs). Unlike existing coarse-grained collaborative methods, RelayLLM allows the SLM to dynamically invoke the LLM only for critical tokens, reducing computational waste. The framework includes a two-stage training process to balance independence and strategic help-seeking. Experiments on six benchmarks show that RelayLLM achieves 49.52% accuracy, a 98.2% cost reduction compared to performance-matched random routers, by invoking the LLM for only 1.07% of tokens.
RelayLLM 是一种通过小语言模型(SLM)和大语言模型(LLM)在标记级别上协作解码来实现高效推理的框架。与现有的粗粒度协作方法不同,RelayLLM 允许 SLM 动态地仅通过特殊命令调用 LLM 关键标记,减少计算浪费。该框架包括一个两阶段的训练过程,以平衡独立性和战略性求助。实验结果显示,RelayLLM 在六个基准测试中实现了 49.52% 的准确率,仅调用 LLM 处理 1.07% 的标记,相比随机路由器的成本降低了 98.2%。
MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging
Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li
First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00
Comments: The project is available at https://charlescsyyy.github.io/MVT
Abstract
Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
中文标题/摘要
标题:MVT:基于掩码的视觉-语言模型在分类学对齐的土地覆盖标签化中的应用
遥感中的土地覆盖理解越来越需要跨数据集泛化但保持空间精确性和可解释性的类无差别系统。我们研究了在领域转移下的几何优先发现与解释设置,其中候选区域以类无差别方式划定,监督避免使用类名的明码标识。除了开放集识别和开放世界学习,我们专注于将类无差别掩码证据与分类学导向的场景解释相结合,而不是未知拒绝或持续类扩展。我们提出了MVT,一个三阶段框架,(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分,通过分层专家评分校准输出评价。在跨数据集分割迁移(在OpenEarthMap上训练,在LoveDA上评估)中,领域适应的SAM2提高了掩码质量;同时,双步骤MLLM微调产生了更准确的分类学对齐标签和更具有信息量的掩码导向场景描述。
Summary / 总结
The research aims to develop class-agnostic systems for land-cover understanding in remote sensing that can generalize across datasets while maintaining spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by stratified expert ratings. The key experimental findings show that domain-adapted SAM2 improves mask quality, and dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
研究旨在开发能够在遥感中进行土地覆盖理解的类-无感知系统,使其能够在不同数据集之间泛化,同时保持空间精度和可解释性。方法包括三个阶段:(i) 使用SAM2进行域适应以提取边界忠实的区域掩码,(ii) 通过多模态LLM的双步骤LoRA微调进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为评判者进行输出评估,并通过分层专家评分进行校准。关键发现表明,域适应的SAM2提高了掩码质量,而双步骤MLLM微调则产生了更准确的分类学对齐标签和更丰富的掩码导向场景描述。
Improving and Evaluating Open Deep Research Agents
Authors: Doaa Allabadi, Kyle Bradbury, Jordan M. Malof
First: 2025-08-13T19:32:01+00:00 · Latest: 2026-01-08T17:54:58+00:00
Comments: 8 pages, 2 figures, 2 tables
Abstract
We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.
中文标题/摘要
标题:改进和评估开放深度研究代理
我们在这里关注深度研究代理(DRAs),这是一种可以从用户那里接收自然语言提示,并自主搜索和利用互联网内容来回应提示的系统。最近的DRAs在公共基准测试中展示了令人印象深刻的性能,然而,最近的研究主要涉及专有的闭源系统。在本研究进行时,我们仅发现一个开源的DRA,称为开放深度研究(ODR)。在本工作中,我们将具有挑战性的最近的BrowseComp基准测试改编为比较ODR与现有专有系统的基准测试。我们提出了BrowseComp-Small(BC-Small),作为更易于学术实验室处理的DRAs基准测试,它由BrowseComp的一部分组成。我们在BC-Small上对ODR和两个其他专有系统进行了基准测试:来自Anthropic的一个系统和来自Google的一个系统。我们发现,这三个系统在包含60个问题的测试集上均未达到1%的准确率。我们引入了三个战略改进,结果产生了ODR+模型,该模型在BC-Small上实现了专有和开源系统中的最佳10%成功率。我们报告了消融研究,表明我们的三个改进都对ODR+的成功做出了贡献。
Summary / 总结
This study focuses on Deep Research Agents (DRAs) that can autonomously search and utilize internet content based on user prompts. The authors adapted the BrowseComp benchmark to evaluate ODR, an open-source DRA, and two proprietary systems. All systems performed poorly, achieving 0% accuracy. The authors then introduced three strategic improvements to ODR, resulting in the ODR+ model, which achieved a state-of-the-art 10% success rate on the benchmark, outperforming both closed-source and open-source systems.
该研究关注能够根据用户提示自主搜索和利用互联网内容的Deep Research Agents (DRAs)。作者将BrowseComp基准应用于评估开源DRA ODR及其与现有封闭源代码系统的性能。基准测试后发现,所有系统表现不佳。随后,他们对ODR进行了三项战略改进,形成了ODR+模型,该模型在基准测试中的成功率为10%,在开源和封闭源代码系统中均最高。
DocDancer: Towards Agentic Document-Grounded Information Seeking
Authors: Qintong Zhang, Xinjie Lv, Jialong Wu, Baixuan Li, Zhengwei Tao, Guochen Yan, Huanyao Zhang, Bin Wang, Jiahao Xu, Haitao Mi, Wentao Zhang
First: 2026-01-08T17:54:32+00:00 · Latest: 2026-01-08T17:54:32+00:00
Abstract
Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
中文标题/摘要
标题:DocDancer: 向基于文档的主动信息寻求迈进
文档问题回答(DocQA)专注于基于给定文档回答问题,但现有的DocQA代理缺乏有效的工具利用,主要依赖于封闭源模型。在本工作中,我们介绍了DocDancer,一个端到端训练的开源Doc代理。我们将DocQA形式化为一个信息寻求问题,并提出了一种工具驱动的代理框架,明确地建模了文档探索和理解。为了使此类代理能够端到端训练,我们引入了一种探索-合成数据合成管道,以解决DocQA高质量训练数据的稀缺性问题。在合成数据上进行训练,两个长上下文文档理解基准MMLongBench-Doc和DocBench上的训练模型显示了其有效性。进一步的分析为代理工具设计和合成数据提供了宝贵的见解。
Summary / 总结
The research aims to enhance Document Question Answering (DocQA) by developing an open-source agent that effectively utilizes tools and addresses the limitations of existing closed-source models. DocDancer is an end-to-end trained agent that formulates DocQA as an information-seeking problem and includes a tool-driven framework for document exploration and comprehension. The Exploration-then-Synthesis data synthesis pipeline was used to train the agent on two long-context document understanding benchmarks, demonstrating its effectiveness. The analysis provides insights for agentic tool design and synthetic data creation.
DocDancer 是一个端到端训练的开源 DocQA 代理,通过引入工具利用和开源模型来解决现有代理的局限性。它将 DocQA 形式化为信息寻求问题,并使用探索然后合成的数据合成管道来训练代理。在 MMLongBench-Doc 和 DocBench 长文理解基准上的训练模型展示了有效性。进一步的分析提供了关于代理工具设计和合成数据创建的见解。