arXiv 论文速递

2026-01-12 03:23
Snapshot: 20260112_0323
Pixel-Perfect Visual Geometry Estimation
Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang
First: 2026-01-08T18:59:49+00:00 · Latest: 2026-01-08T18:59:49+00:00
Comments: Code: https://github.com/gangweix/pixel-perfect-depth
Abstract
Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.
中文标题/摘要
标题:像素完美视觉几何估计
从图像中恢复干净且准确的几何结构对于机器人技术和增强现实至关重要。然而,现有的几何基础模型仍然严重受到漂像素和细节损失的影响。在本文中,我们提出了像素完美视觉几何模型,通过在像素空间中利用生成建模来预测高质量、无漂像素的点云。我们首先介绍了像素完美深度(PPD),这是一种基于像素空间扩散变换器(DiT)的单目深度基础模型。为了解决像素空间扩散带来的高计算复杂性,我们提出了两个关键设计:1)语义提示DiT,将视觉基础模型中的语义表示融入扩散过程,保留全局语义同时增强细粒度视觉细节;2)级联DiT架构,逐步增加图像标记的数量,提高效率和准确性。为了将PPD进一步扩展到视频(PPVD),我们引入了一种新的语义一致DiT,从多视图几何基础模型中提取时空一致的语义。然后在DiT中进行参考引导的标记传播,以最小的计算和内存开销保持时间连贯性。我们的模型在所有生成单目和视频深度估计模型中表现最佳,并且产生的点云比其他所有模型都更干净。
Summary / 总结
This paper addresses the challenge of recovering clean and accurate geometry from images, crucial for robotics and augmented reality. It introduces pixel-perfect visual geometry models using generative modeling in the pixel space. The models, including Pixel-Perfect Depth (PPD) and its video extension PPVD, leverage pixel-space diffusion transformers (DiT) and incorporate semantic prompts and a cascade architecture to enhance fine-grained details and computational efficiency. Experimental results show that these models outperform existing methods in monocular and video depth estimation, producing cleaner point clouds.
本文解决了从图像中恢复干净准确几何结构的挑战,这对机器人技术和增强现实至关重要。该文提出了基于像素空间生成建模的像素完美视觉几何模型,包括像素完美深度(PPD)及其视频扩展PPVD。这些模型利用像素空间扩散变换器(DiT),并结合语义提示和级联架构,以增强细粒度细节和计算效率。实验结果表明,这些模型在单目和视频深度估计中优于现有方法,生成的点云更为干净。
Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration
Authors: Xingyi He, Adhitya Polavaram, Yunhao Cao, Om Deshmukh, Tianrui Wang, Xiaowei Zhou, Kuan Fang
First: 2026-01-08T18:59:30+00:00 · Latest: 2026-01-08T18:59:30+00:00
Comments: Project Page: https://cordex-manipulation.github.io/
Abstract
Functional grasping with dexterous robotic hands is a key capability for enabling tool use and complex manipulation, yet progress has been constrained by two persistent bottlenecks: the scarcity of large-scale datasets and the absence of integrated semantic and geometric reasoning in learned models. In this work, we present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data generated from just a single human demonstration. At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation. Based on the human demonstration, our data engine generates diverse object instances of the same category, transfers the expert grasp to the generated objects through correspondence estimation, and adapts the grasp through optimization. Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information. By devising a local-global fusion module and an importance-aware sampling mechanism, we enable robust and computationally efficient prediction of functional dexterous grasps. Through extensive experiments across various object categories, we demonstrate that CorDex generalizes well to unseen object instances and significantly outperforms state-of-the-art baselines.
中文标题/摘要
标题:生成、转移、适应:从单个人类演示学习功能性灵巧抓取
功能性灵巧抓取对于使机器人手能够使用工具和进行复杂操作至关重要,但进展受限于两个持续存在的瓶颈:大规模数据集的稀缺性和学习模型中缺乏集成的语义和几何推理。在本文中,我们提出了CorDex框架,该框架能够从单一个人演示生成的合成数据中稳健地学习新物体的功能灵巧抓取。我们方法的核心是一个基于对应关系的数据引擎,该引擎在仿真中生成多样且高质量的训练数据。基于人类演示,数据引擎生成同一类别的多种物体实例,通过对应关系估计将专家抓取转移到生成的物体上,并通过优化进行抓取适应。基于生成的数据,我们引入了一种多模态预测网络,结合了视觉和几何信息。通过设计局部-全局融合模块和重要性感知采样机制,我们实现了功能灵巧抓取的稳健且计算高效的预测。通过在各种物体类别上的广泛实验,我们证明了CorDex能够很好地泛化到未见过的物体实例,并显著优于最先进的基线。
Summary / 总结
The research addresses the challenge of learning functional dexterous grasping from a single human demonstration, overcoming the limitations of scarce datasets and integrated reasoning. The CorDex framework generates diverse training data in simulation and transfers expert grasps to new objects through correspondence estimation and optimization. The multimodal prediction network integrates visual and geometric information, achieving robust and efficient grasp prediction. Experiments show that CorDex generalizes well to unseen objects and outperforms existing methods.
研究旨在通过单个人类示范和合成数据生成来解决学习灵巧功能性抓取的挑战。方法包括使用对应关系数据引擎生成模拟中的多样化训练数据,将专家抓取转移到新物体并进行优化。多模态预测网络结合视觉和几何信息,实现稳健且高效的抓取预测。实验表明,CorDex 在未见过的物体上表现良好并优于现有方法。
Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation
Authors: Emerson P. Grabke, Babak Taati, Masoom A. Haider
First: 2025-06-11T23:12:48+00:00 · Latest: 2026-01-08T18:59:27+00:00
Comments: Accepted for publication in IEEE Transactions on Biomedical Engineering, 2025. This is the accepted author version. The final published version is available at https://doi.org/10.1109/TBME.2025.3648426
Abstract
Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, nonmedical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM pipeline centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.070. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74% and outperforms classifiers trained on images generated by prior state-of-the-art. Classifier training solely on our method's synthetic images achieved comparable performance to real image training. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric pipeline enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility.
中文标题/摘要
标题:利用临床文本和类别条件化生成3D前列腺MRI
目标:潜在扩散模型(LDM)可以缓解医学成像领域机器学习开发中的数据稀缺挑战。然而,医学LDM策略通常依赖于简短提示文本编码器、非医学LDM或大量数据。这些策略可能会限制性能和科学可访问性。我们提出了一种新的LDM条件化方法来解决这些限制。方法:我们提出了类别条件化高效大型语言模型适配器(CCELLA),这是一种新颖的双头条件化方法,同时用自由文本临床报告和放射学分类条件化LDM U-Net。我们还提出了一种以CCELLA为中心的数据高效LDM管道和一个提出的联合损失函数。我们首先在3D前列腺MRI上评估了我们的方法,与最先进的方法进行了比较。然后,我们使用我们方法生成的合成图像来增强下游分类器模型训练数据集。结果:我们的方法在大小受限的3D前列腺MRI数据集上实现了0.025的3D FID分数,显著优于最近的基础模型,该模型的FID为0.070。当训练前列腺癌预测分类器时,在训练过程中添加由我们方法生成的合成图像,分类器的准确性从69%提高到74%,并优于使用先前最先进的方法生成的图像训练的分类器。仅使用我们方法生成的合成图像进行分类器训练,其性能与使用真实图像训练的分类器相当。结论:我们展示了我们的方法在使用有限数据和最少的人工注释的情况下,提高了合成图像质量和下游分类器性能。意义:提出的CCELLA为中心的管道使在有限数据量和人工数据注释的情况下,能够利用放射学报告和类别条件化LDM进行高质量医学图像合成,从而提高LDM性能和科学可访问性。
Summary / 总结
The research aims to improve the performance and scientific accessibility of latent diffusion models (LDM) for medical imaging, particularly in addressing data scarcity. The authors propose CCELLA, a novel dual-head conditioning approach that conditions the LDM U-Net with free-text clinical reports and radiology classification. This method significantly outperforms existing approaches, achieving a 3D FID score of 0.025 and improving classifier accuracy for prostate cancer prediction from 69% to 74%. The method also showed comparable performance to real image training when used solely for classifier training, demonstrating its effectiveness with limited data and minimal human annotation.
研究旨在通过利用潜在扩散模型(LDM)和提出一种名为CCELLA的新颖条件化方法来解决医学成像中的数据稀缺问题。该方法结合了自由文本临床报告和放射学分类来条件化LDM U-Net,并开发了一个数据高效的管道。结果表明,所提出的方法在3D FID得分上达到0.025,显著优于之前的模型。此外,由该方法生成的合成图像将前列腺癌预测下游分类器的准确性从69%提高到74%。该方法在有限数据和少量人工注释的情况下展示了改进的性能和可访问性。
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Authors: Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov
First: 2026-01-08T18:59:24+00:00 · Latest: 2026-01-08T18:59:24+00:00
Comments: NVIDIA-Tech Report
Abstract
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
中文标题/摘要
标题:GDPO:组奖励-解耦归一化策略优化多奖励RL优化
随着语言模型能力的不断增强,用户期望它们不仅能提供准确的响应,还能表现出与各种场景中多样的人类偏好相一致的行为。为了实现这一目标,强化学习(RL)管道已经开始采用多个奖励,每个奖励捕捉一种独特的偏好,以引导模型向这些期望的行为发展。然而,最近的工作在多奖励设置中默认使用组相对策略优化(GRPO)而没有对其适用性进行检查。在本文中,我们证明直接将GRPO应用于归一化不同的回放奖励组合会导致它们的优劣值坍缩为相同的值,降低了训练信号的分辨率,导致次优收敛,在某些情况下甚至导致训练早期失败。然后,我们引入了组奖励-解耦归一化策略优化(GDPO),这是一种新的策略优化方法,通过解耦个体奖励的归一化来解决这些问题,更忠实地保留它们的相对差异,从而实现更准确的多奖励优化,并且训练稳定性显著提高。我们通过三个任务(工具调用、数学推理和编程推理)将GDPO与GRPO进行了比较,评估了正确性指标(准确率、错误率)和约束遵守指标(格式、长度)。在所有设置中,GDPO始终优于GRPO,证明了其在多奖励强化学习优化中的有效性和普适性。
Summary / 总结
This paper addresses the issue of using Group Relative Policy Optimization (GRPO) in multi-reward reinforcement learning, showing that it can cause distinct rewards to collapse into identical values, leading to suboptimal training. To resolve this, the authors propose Group reward-Decoupled Normalization Policy Optimization (GDPO), which decouples the normalization of individual rewards, preserving their relative differences and improving training stability. GDPO outperforms GRPO across three tasks: tool calling, math reasoning, and coding reasoning, in terms of both correctness and constraint adherence metrics.
本文解决了在多奖励强化学习中使用组相对策略优化(GRPO)的问题,这会导致不同的奖励值坍缩为相同,从而导致训练效果不佳。为此,作者提出了组奖励解耦归一化策略优化(GDPO),该方法通过解耦各个奖励的归一化,保留它们的相对差异,从而提高训练稳定性。GDPO在工具调用、数学推理和编码推理三个任务中,在正确性和约束遵守度指标上均优于GRPO,证明了其有效性和普适性。
RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation
Authors: Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang
First: 2026-01-08T18:59:22+00:00 · Latest: 2026-01-08T18:59:22+00:00
Abstract
The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.
中文标题/摘要
标题:RoboVIP:基于视觉身份提示的多视角视频生成增强机器人操作
操作数据的多样性和数量对于训练有效的机器人策略至关重要。然而,由于硬件和物理设置的限制,收集大规模的现实世界操作数据在不同环境中的扩展仍然具有挑战性。近期的工作使用文本提示条件下的图像扩散模型来通过改变视觉观察中的背景和桌面物体来增强操作数据。然而,这些方法往往忽视了由最先进的策略模型所需的多视角和时间上一致的观察需求。此外,仅靠文本提示无法可靠地指定场景设置。为了给扩散模型提供明确的视觉指导,我们引入了视觉身份提示,通过提供示例图像作为条件输入来引导生成所需的场景设置。为此,我们还构建了一个可扩展的流水线,从大型机器人数据集中策划视觉身份池。使用我们增强的操作数据来训练下游的视觉-语言-动作和视知觉运动策略模型,在仿真和真实机器人环境中均能获得一致的性能提升。
Summary / 总结
The research aims to enhance the diversity, quantity, and quality of manipulation data for training robot policies. To address the challenge of collecting large-scale real-world data, the study introduces RoboVIP, which uses visual identity prompting to generate multi-view and temporally coherent observations. This method improves the performance of vision-language-action and visuomotor policy models in both simulation and real-robot settings, demonstrating consistent gains in manipulation tasks.
研究旨在通过增强操作数据的多样性、数量和质量来提高机器人策略的训练效果。为了解决大规模收集真实世界数据的难题,该研究引入了RoboVIP,利用视觉身份提示生成多视角和时间上连贯的观察数据。这种方法在仿真和真实机器人环境中提高了视觉语言动作和视知觉运动策略模型的表现,展示了在操作任务中的持续改进。
Robust Reasoning as a Symmetry-Protected Topological Phase
Authors: Ilmo Sung
First: 2026-01-08T18:58:34+00:00 · Latest: 2026-01-08T18:58:34+00:00
Abstract
Large language models suffer from "hallucinations"-logical inconsistencies induced by semantic noise. We propose that current architectures operate in a "Metric Phase," where causal order is vulnerable to spontaneous symmetry breaking. Here, we identify robust inference as an effective Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding, replacing fragile geometric interpolation with robust topological invariants. Empirically, we demonstrate a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, our Holonomic Network reveals a macroscopic "mass gap," maintaining invariant fidelity below a critical noise threshold. Furthermore, in a variable-binding task on $S_{10}$ ($3.6 \times 10^6$ states) representing symbolic manipulation, we demonstrate holonomic generalization: the topological model maintains perfect fidelity extrapolating $100\times$ beyond training ($L=50 \to 5000$), consistent with a theoretically indefinite causal horizon, whereas Transformers lose logical coherence. Ablation studies indicate this protection emerges strictly from non-Abelian gauge symmetry. This provides strong evidence for a new universality class for logical reasoning, linking causal stability to the topology of the semantic manifold.
中文标题/摘要
标题:稳健推理作为一种对称保护拓扑相
大型语言模型遭受“幻觉”——由语义噪声引起的逻辑不一致。我们提出当前架构处于“度量相”中,在这种相中因果顺序容易自发对称破缺。在此,我们将稳健推理识别为一种有效的对称保护拓扑相,在这种相中逻辑操作形式上等同于非阿贝尔任意子编织,用稳健的拓扑不变量取代脆弱的几何插值。实验上,我们展示了明显的拓扑相变:虽然变换器和RNN表现出无隙衰减,我们的本征网络揭示了宏观的“质量隙”,在临界噪声阈值以下保持不变的保真度。此外,在$S_{10}$(3.6×$10^6$状态)表示符号操作的变量绑定任务中,我们展示了本征泛化:拓扑模型在训练($L=50$)基础上外推100倍($L=5000$)仍保持完美保真度,这与理论上无限的因果视界一致,而变换器则失去逻辑连贯性。消融研究表明,这种保护严格源自非阿贝尔规范对称性。这为逻辑推理提供了一个新的普遍类,将因果稳定性与语义流形的拓扑学联系起来。
Summary / 总结
The research addresses the issue of logical inconsistencies in large language models, termed 'hallucinations,' by proposing a new architecture that operates in a Symmetry-Protected Topological phase. The method involves using a Holonomic Network, which is formally isomorphic to non-Abelian anyon braiding, to replace geometric interpolation with robust topological invariants. Key experimental findings include a sharp topological phase transition where the Holonomic Network maintains invariant fidelity below a critical noise threshold, while Transformers and RNNs do not. Additionally, the Holonomic Network demonstrates holonomic generalization, maintaining perfect fidelity in a variable-binding task with $S_{10}$, extrapolating 100 times beyond training, unlike Transformers which lose logical coherence.
研究旨在通过提出一种新的架构解决大型语言模型中的逻辑不一致问题,即所谓的‘幻觉’,该架构基于对称保护拓扑相。方法是使用一个拓扑模型,其形式上等同于非阿贝尔任意子编织,以替代几何插值,实现稳健的拓扑不变性。关键实验发现包括拓扑相突变,其中拓扑网络在临界噪声阈值以下保持不变的保真度,而变换器和RNN则不然。此外,拓扑网络在$S_{10}$的变量绑定任务中表现出拓扑泛化能力,能够完美地将训练范围外推100倍,而变换器则失去逻辑一致性。
Measuring and Fostering Peace through Machine Learning and Artificial Intelligence
Authors: P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter
First: 2026-01-08T18:57:01+00:00 · Latest: 2026-01-08T18:57:01+00:00
Comments: 6 pages, 4 figures
Abstract
We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.
中文标题/摘要
标题:通过机器学习和人工智能衡量与促进和平
我们使用机器学习和人工智能:1) 从新闻和社交媒体中衡量各国的和平水平;2) 开发在线工具以促进和平,帮助用户更好地理解自己的媒体消费。对于新闻媒体,我们使用神经网络从在线新闻来源的文本嵌入中衡量和平水平。该模型在训练于一个新闻媒体数据集后,也对分析另一个新闻数据集时表现出高准确性。对于社交媒体,如YouTube,我们开发了其他模型来衡量与和平相关的社会维度,使用了词级(GoEmotions)和上下文级(大型语言模型)方法。为了促进和平,我们注意到20-40岁人群中71%的人每天主要通过社交媒体上的短视频获取新闻。这些视频内容创作者倾向于制作能够激发情绪、让你生气的视频以增加点击率。我们开发并测试了一个名为MirrorMirror的Chrome扩展程序,为YouTube观众提供他们正在观看的媒体的实时反馈,关于其和平程度。我们的长期目标是让MirrorMirror成为一个开源工具,供内容创作者、记者、研究人员、平台和个人用户更好地理解其媒体创作和消费的语气及其对观众的影响。超越简单的参与度指标,我们希望鼓励更加尊重、细致和信息丰富的交流。
Summary / 总结
This research aims to measure and foster peace using machine learning and artificial intelligence. The study developed neural networks to assess peace levels from news text and other models for social media content. A Chrome extension called MirrorMirror was created to provide real-time feedback on the peacefulness of videos, with 71% of young adults viewing most news through short videos on social media. The model showed high accuracy across different news datasets and aims to promote more respectful and informative communication by helping users understand their media diet better.
该研究利用机器学习和人工智能来衡量国家从新闻和社交媒体中和平水平,并开发了一个名为MirrorMirror的在线工具,通过实时反馈媒体内容的和平程度来促进和平。研究发现,20-40岁的人中有71%主要通过社交媒体上的短视频获取新闻,这些视频通常带有强烈的情感色彩以增加参与度。该工具MirrorMirror旨在帮助用户更好地理解和促进更和平的媒体消费和创作。
Learning Latent Action World Models In The Wild
Authors: Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat
First: 2026-01-08T18:55:39+00:00 · Latest: 2026-01-08T18:55:39+00:00
Comments: 37 pages, 25 figures
Abstract
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
中文标题/摘要
标题:学习自然环境中的潜在动作世界模型
能够在现实世界中进行推理和规划的智能体需要预测其行为后果的能力。尽管世界模型具备这种能力,但它们通常需要行为标签,而这些标签在大规模应用中往往难以获取。这促使我们学习潜在动作模型,可以从视频中学习动作空间。我们的工作解决了在自然环境中学习潜在动作世界模型的问题,扩展了现有工作集中在简单机器人模拟、视频游戏或操作数据上的范围。虽然这使我们能够捕捉到更丰富的动作,但也带来了视频多样性带来的挑战,如环境噪声或视频间缺乏共同的实体。为应对部分挑战,我们讨论了动作应遵循的属性以及相关架构选择和评估。我们发现,连续但受限的潜在动作能够捕捉自然环境中视频的动作复杂性,而常见的向量量化则无法做到这一点。例如,我们发现来自代理(如人类进入房间)的环境变化可以在视频间转移。这突显了学习特定于自然环境视频的动作能力。在视频间缺乏共同实体的情况下,我们主要能够学习在空间上局部化的潜在动作,相对于摄像机而言。尽管如此,我们能够训练一个控制器,将已知动作映射到潜在动作,使我们能够使用潜在动作作为通用接口,并使用世界模型解决规划任务,其性能与基于动作的基线相当。我们的分析和实验为将潜在动作模型扩展到现实世界提供了一步进展。
Summary / 总结
This research aims to develop world models that can predict the consequences of actions without requiring explicit action labels, which are often difficult to obtain at scale. The authors propose learning latent action models from in-the-wild videos, addressing challenges such as environmental noise and lack of a common embodiment. They find that continuous but constrained latent actions can capture the complexity of actions from diverse videos, and that these actions can be localized in space relative to the camera. Despite the absence of a common embodiment, they successfully train a controller to map known actions to latent ones, enabling the use of latent actions as a universal interface for solving planning tasks with similar performance to action-conditioned baselines.
该研究旨在开发无需明确动作标签即可预测动作结果的世界模型,这些标签在大规模获取时往往难以获得。作者解决了从多样化的在野视频中学习潜在动作模型的挑战,这些模型可以捕捉到比简单模拟或游戏更丰富的动作。关键发现包括连续但受限的潜在动作能够捕捉到来自真实世界视频的动作复杂性,并开发了一个控制器,将已知动作映射到潜在动作,使潜在动作能够用于规划任务,其性能与基于动作的基线相当。
Non-Linear Scoring Model for Translation Quality Evaluation
Authors: Serge Gladkoff, Lifeng Han, Katerina Gasova
First: 2025-11-17T15:09:22+00:00 · Latest: 2026-01-08T18:51:57+00:00
Comments: ongoing work, 32 pages
Abstract
Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
中文标题/摘要
标题:翻译质量评估的非线性评分模型
基于多维质量指标(MQM)的分析性翻译质量评估(TQE)传统上使用线性误差到惩罚比例,该比例针对1000-2000词的参考样本进行校准。然而,线性外推会偏向不同大小样本的判断,对短样本过度惩罚,对长样本则惩罚不足,导致与专家直觉不一致。 本文基于多范围框架,提出了一种校准的非线性评分模型,更好地反映了不同长度样本中人类内容消费者对翻译质量的感知。来自三个大型企业环境的实证数据显示,可接受的错误数量随样本大小呈对数增长,而非线性增长。 心理物理和认知证据,包括韦伯-费希纳定律和认知负荷理论,支持这一观点,解释了为什么额外错误的感知影响随规模增长而减弱,而认知负担则随规模增长。我们提出一个两参数模型 E(x) = a * ln(1 + b * x),a, b > 0, 该模型以参考容忍度为锚点,并通过一个一维根寻找步骤校准两个容忍度点。该模型在相对误差不超过±20%的区间内使线性近似保持有效,并且只需添加动态容忍度函数即可与现有的评估工作流程集成。 该方法提高了人类和AI生成翻译的解释性、公平性和评分者间的一致性。通过操作化一个感知上有效的评分范式,它推动了翻译质量评估向更准确和可扩展的评估迈进。该模型还为与人类判断一致的基于AI的文档级评估提供了更强的基础。讨论了CAT/LQA系统实施考虑和对人类和AI生成文本评估的影响。
Summary / 总结
This paper addresses the limitations of traditional linear scoring models in Translation Quality Evaluation (TQE) by proposing a non-linear scoring model based on the Multi-Range framework. Empirical data from three large-scale enterprise environments indicate that acceptable error counts grow logarithmically with sample size, not linearly. The proposed model, E(x) = a * ln(1 + b * x), is calibrated using a one-dimensional root-finding step and provides a more accurate and fair evaluation of translation quality across different sample sizes, enhancing both human and AI-generated translation assessments.
本文针对传统线性评分模型在翻译质量评估(TQE)中的局限性,提出了基于Multi-Range框架的非线性评分模型。来自三个大型企业环境的实证数据显示,可接受的错误数量随着样本大小的增加而呈对数增长。该模型E(x) = a * ln(1 + b * x)通过一维根寻找步骤进行校准,提供了更准确和公平的翻译质量评估,提高了可解释性和评判者间的一致性。
MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents
Authors: Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel
First: 2026-01-08T18:39:52+00:00 · Latest: 2026-01-08T18:39:52+00:00
Abstract
We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.
中文标题/摘要
标题:MineNPC-Task:面向记忆意识Minecraft代理的任务套件
我们提出了\textsc{MineNPC-Task},一种用户编写的基准测试和评估框架,用于测试开放世界\emph{Minecraft}中的记忆意识、混合主动性LLM代理。该框架不依赖于合成提示,而是通过与专家玩家的形成性及总结性共玩来引发任务,将这些任务规范化为具有显式先决条件和依赖结构的参数化模板,并配以在有限知识政策下的机器可验证验证器,该政策禁止使用世界外的捷径。该框架捕捉计划/行动/记忆事件,包括计划预览、目标澄清、记忆读写、先决条件检查和修复尝试,并根据尝试的子任务总数报告结果,这些结果源自于世内的证据。 作为初步快照,我们使用GPT-4o实例化了该框架,并在\textbf{8}名经验丰富的玩家中评估了\textbf{216}个子任务。我们观察到代码执行、库存/工具处理、引用和导航中的反复出现的故障模式,以及通过混合主动性澄清和轻量级记忆支持的恢复。参与者对交互质量和界面易用性给予了积极评价,同时指出了需要更强的记忆持久性以跨越任务。我们发布了完整的任务套件、验证器、日志和框架,以支持未来记忆意识实体代理的透明、可重复评估。
Summary / 总结
The research introduces MineNPC-Task, a benchmark for testing memory-aware LLM agents in Minecraft. Tasks are derived from expert co-play and structured into parametric templates with explicit preconditions. The evaluation harness captures detailed memory events and reports outcomes. Initial evaluation with GPT-4o across 8 players revealed recurring issues in code execution, inventory handling, and navigation, with positive feedback on interaction quality but noting the need for better memory persistence. The task suite and related materials are publicly released for further research.
研究引入了MineNPC-Task,用于测试记忆感知的混合主动性LLM代理在Minecraft中的表现。任务源自专家协作,规范化为模板,并配以验证器。研究使用GPT-4o评估了8名经验丰富的玩家完成的216个子任务,发现代码执行、库存处理、引用和导航等方面的问题。参与者对交互质量和界面易用性给予了积极评价,但也指出需要更强的记忆持久性。完整任务套件、验证器、日志和框架已发布,以支持透明和可重复的评估。
Internal Representations as Indicators of Hallucinations in Agent Tool Selection
Authors: Kait Healy, Bharathi Srinivasan, Visakh Madathil, Jing Wu
First: 2026-01-08T18:38:45+00:00 · Latest: 2026-01-08T18:38:45+00:00
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in tool calling and tool usage, but suffer from hallucinations where they choose incorrect tools, provide malformed parameters and exhibit 'tool bypass' behavior by performing simulations and generating outputs instead of invoking specialized tools or external systems. This undermines the reliability of LLM based agents in production systems as it leads to inconsistent results, and bypasses security and audit controls. Such hallucinations in agent tool selection require early detection and error handling. Unlike existing hallucination detection methods that require multiple forward passes or external validation, we present a computationally efficient framework that detects tool-calling hallucinations in real-time by leveraging LLMs' internal representations during the same forward pass used for generation. We evaluate this approach on reasoning tasks across multiple domains, demonstrating strong detection performance (up to 86.4\% accuracy) while maintaining real-time inference capabilities with minimal computational overhead, particularly excelling at detecting parameter-level hallucinations and inappropriate tool selections, critical for reliable agent deployment.
中文标题/摘要
标题:代理工具选择中的内部表示作为幻觉指标
大型语言模型(LLMs)在工具调用和使用方面表现出色,但在选择错误工具、提供不正确的参数和通过模拟生成输出而不是调用专门工具或外部系统方面存在幻觉问题。这削弱了基于LLM的代理在生产系统中的可靠性,导致结果不一致,并绕过了安全和审计控制。代理工具选择中的这种幻觉需要早期检测和错误处理。不同于现有的需要多次前向传递或外部验证的幻觉检测方法,我们提出了一种计算效率高的框架,通过利用LLMs在生成过程中同一前向传递期间的内部表示来实时检测工具调用幻觉。我们在多个领域的推理任务上评估了这种方法,展示了强大的检测性能(最高可达86.4%的准确率),同时保持了实时推理能力,计算开销最小,特别擅长检测参数级幻觉和不适当的工具选择,这对于可靠的代理部署至关重要。
Summary / 总结
The paper addresses the issue of hallucinations in Large Language Models (LLMs) during tool selection, which can lead to unreliable results and bypass security controls. It introduces a computationally efficient framework that detects these hallucinations in real-time by analyzing the LLMs' internal representations during the same forward pass used for generation. The method achieves up to 86.4% accuracy in detecting parameter-level hallucinations and inappropriate tool selections, while maintaining real-time inference capabilities with minimal computational overhead.
研究旨在解决大型语言模型在工具选择过程中出现幻觉的问题,这可能导致代理行为不可靠。研究引入了一种高效框架,利用大型语言模型在生成过程中的内部表示来实时检测幻觉。该方法在检测参数级幻觉和不适当工具选择方面达到了86.4%的准确率,同时保持了实时推理能力,并且计算开销很小。
Belief Is All You Need: Modeling Narrative Archetypes in Conspiratorial Discourse
Authors: Soorya Ram Shimgekar, Abhay Goyal, Roy Ka-Wei Lee, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-12-10T21:51:16+00:00 · Latest: 2026-01-08T18:34:35+00:00
Abstract
Conspiratorial discourse is increasingly embedded within digital communication ecosystems, yet its structure and spread remain difficult to study. This work analyzes conspiratorial narratives in Singapore-based Telegram groups, showing that such content is woven into everyday discussions rather than confined to isolated echo chambers. We propose a two-stage computational framework. First, we fine-tune RoBERTa-large to classify messages as conspiratorial or not, achieving an F1-score of 0.866 on 2,000 expert-labeled messages. Second, we build a signed belief graph in which nodes represent messages and edge signs reflect alignment in belief labels, weighted by textual similarity. We introduce a Signed Belief Graph Neural Network (SiBeGNN) that uses a Sign Disentanglement Loss to learn embeddings that separate ideological alignment from stylistic features. Using hierarchical clustering on these embeddings, we identify seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN yields stronger clustering quality (cDBI = 8.38) than baseline methods (13.60 to 67.27), supported by 88 percent inter-rater agreement in expert evaluations. Our analysis shows that conspiratorial messages appear not only in clusters focused on skepticism or distrust, but also within routine discussions of finance, law, and everyday matters. These findings challenge common assumptions about online radicalization by demonstrating that conspiratorial discourse operates within ordinary social interaction. The proposed framework advances computational methods for belief-driven discourse analysis and offers applications for stance detection, political communication studies, and content moderation policy.
中文标题/摘要
标题:信念即足矣:在阴谋论话语中建模叙事原型
阴谋论话语越来越多地嵌入数字通信生态系统中,但其结构和传播仍然难以研究。本研究分析了基于新加坡Telegram群组中的阴谋论叙述,表明此类内容融入了日常讨论,而非局限于孤立的回声室中。我们提出了一种两阶段的计算框架。首先,我们对RoBERTa-large进行微调,以分类信息为阴谋论或非阴谋论,使用2,000条专家标注信息,F1分数达到0.866。其次,我们构建了一个带符号的信念图,节点代表信息,边的符号反映信念标签的一致性,权重由文本相似度决定。我们引入了一种带符号信念图神经网络(SiBeGNN),使用符号解纠缠损失来学习将意识形态一致性与风格特征分离的嵌入。通过这些嵌入进行层次聚类,我们识别出553,648条信息中的七个叙述原型:法律主题、医疗关切、媒体讨论、金融、权威矛盾、群体管理以及一般聊天。SiBeGNN的聚类质量(cDBI = 8.38)优于基线方法(13.60到67.27),并得到88%的专家评价的一致性支持。我们的分析表明,阴谋论信息不仅出现在关注怀疑或不信任的聚类中,还出现在金融、法律和日常事务的常规讨论中。这些发现挑战了关于在线激进化的一些常见假设,表明阴谋论话语在普通社会互动中运作。所提出的方法推进了信念驱动话语分析的计算方法,并为立场检测、政治传播研究和内容审核政策提供了应用。
Summary / 总结
This study examines conspiratorial narratives in Singapore-based Telegram groups, showing that such content is integrated into everyday discussions. A two-stage computational framework is proposed, involving fine-tuning RoBERTa-large for classification and a Signed Belief Graph Neural Network (SiBeGNN) to identify seven narrative archetypes. SiBeGNN outperforms baseline methods with a cDBI score of 8.38 and 88 percent inter-rater agreement, indicating that conspiratorial discourse occurs in various contexts, challenging the notion of isolated echo chambers.
该研究分析了新加坡Telegram群组中阴谋论话语的结构和传播,提出了一种两阶段计算框架。首先,对RoBERTa-large进行微调以分类信息,F1得分为0.866。其次,开发了Signed Belief Graph Neural Network (SiBeGNN) 来识别553,648条消息中的七个叙事原型,显示出比基线方法更强的聚类质量。研究发现,阴谋论信息不仅出现在怀疑或不信任的群组中,还融入了日常讨论,挑战了孤立回音室的常见假设,强调了在信念驱动话语分析中需要先进的计算方法。
From Policy to Logic for Efficient and Interpretable Coverage Assessment
Authors: Rhitabrat Pokharel, Hamid Reza Hassanzadeh, Ameeta Agrawal
Venue: AAAI 2026
First: 2026-01-03T19:24:51+00:00 · Latest: 2026-01-08T18:28:40+00:00
Comments: Accepted at AIMedHealth @ AAAI 2026
Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.
中文标题/摘要
标题:从政策到逻辑:高效可解释的覆盖评估
大型语言模型(LLMs)在解释长篇复杂的法律和政策语言方面表现出强大的能力。然而,它们的可靠性可能会受到幻觉和不一致性的损害,特别是在分析主观和细腻的文件时。这些挑战在医疗覆盖政策审查中尤为关键,因为人类专家必须依赖准确的信息。在本文中,我们提出了一种方法,旨在通过使政策解释更高效和可解释来支持人类审查员。我们介绍了一种方法,该方法将覆盖感知检索器与符号规则推理相结合,以突出显示相关政策语言,将其组织成明确的事实和规则,并生成可审计的理由。这种混合系统减少了所需的LLM推理次数,从而降低了整体模型成本。值得注意的是,我们的方法在推理成本上减少了44%,同时F1分数提高了4.5%,既提高了效率又提高了效果。
Summary / 总结
This paper addresses the challenges of interpreting complex medical coverage policies using Large Language Models (LLMs), which can suffer from hallucinations and inconsistencies. The authors propose a hybrid system that combines a coverage-aware retriever with symbolic rule-based reasoning to make policy interpretation more efficient and interpretable. The system reduces the number of LLM inferences by 44%, leading to a 44% decrease in inference cost, while also improving the F1 score by 4.5%.
本文解决了使用大型语言模型(LLMs)解释复杂医疗覆盖政策时面临的幻觉和不一致性问题。为提高可靠性和效率,作者提出了一种结合覆盖感知检索器和符号规则推理的混合系统。该方法将LLM的推理次数减少了44%,降低了总体模型成本,同时F1分数提高了4.5%。
Stock Market Price Prediction using Neural Prophet with Deep Neural Network
Authors: Navin Chhibber, Suneel Khemka, Navneet Kumar Tyagi, Rohit Tewari, Bireswar Banerjee, Piyush Ranjan
First: 2026-01-08T18:24:22+00:00 · Latest: 2026-01-08T18:24:22+00:00
Abstract
Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.
中文标题/摘要
标题:使用深度神经网络的神经先知进行股票市场价格预测
股票市场价格预测是金融、统计和经济学交叉领域的显著研究领域。准确预测股票价格一直是各种研究人员的关注点。然而,现有的时间序列预测统计方法往往无法有效预测未来股票价格的概率范围。因此,为了解决这个问题,提出了使用深度神经网络的神经先知(NP-DNN)来预测股票市场价格。本研究中使用的预处理技术是Z分数标准化,通过消除数据规模差异来标准化股票价格数据,使模式更容易被检测。缺失值插补填补了历史数据中的空白,增强了模型使用完整信息进行更准确预测的能力。多层感知机(MLP)学习股票市场价格之间的复杂非线性关系,并从输入数据中提取隐藏模式,从而创建更有意义的特征表示,以提高预测准确性。所提出的NP-DNN模型的准确率为99.21%,与其他方法相比,使用融合大型语言模型。关键词:深度神经网络,预测股票价格,多层感知机,神经先知,股票市场价格预测。
Summary / 总结
The research aims to improve the accuracy of stock market price prediction by proposing a Neural Prophet with a Deep Neural Network (NP-DNN) model. The method includes Z-score normalization for data preprocessing and missing value imputation to handle incomplete historical data. The Multi-Layer Perceptron (MLP) is used to learn complex nonlinear relationships and extract hidden patterns. The proposed model achieved an accuracy of 99.21%, outperforming other approaches.
研究旨在通过结合神经先知者(Neural Prophet)与深度神经网络(NP-DNN)来提高股票市场价格预测的准确性。方法包括使用Z-分数标准化预处理数据和填补缺失值以确保信息完整。多层感知器(MLP)用于学习复杂的非线性关系并提取隐藏模式。提出的NP-DNN模型在预测股票价格方面的准确率为99.21%,超过了其他方法。
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald
First: 2026-01-08T18:23:03+00:00 · Latest: 2026-01-08T18:23:03+00:00
Abstract
Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
中文标题/摘要
标题:视觉语言模型中提示诱发幻觉的机制
大型视觉语言模型(VLMs)虽然功能强大,但经常通过优先考虑文本提示而不是视觉证据来产生幻觉。我们在一个受控的对象计数设置中研究了这种失败模式,其中提示夸大了图像中的对象数量(例如,要求模型描述四朵水莲花,而实际上只有三朵)。在对象数量较低时,模型通常会纠正这种高估,但随着对象数量的增加,它们越来越倾向于遵循提示,而不管与实际情况的差异。通过对三种VLMs的机制分析,我们确定了一组小的注意力头,其消除可以将提示诱发幻觉(PIH)减少至少40%而无需额外训练。在不同模型中,PIH头以特定的方式介导提示复制。我们描述了这些差异,并表明PIH消除增加了对视觉证据的纠正。我们的研究结果提供了关于提示诱发幻觉内部机制的见解,揭示了这些行为在不同模型中的特定差异。
Summary / 总结
This study investigates the mechanism of prompt-induced hallucination in vision-language models (VLMs) by examining their object-counting performance. The research finds that as the number of objects in an image increases, VLMs increasingly conform to the prompt's overstatement, leading to hallucinations. By analyzing the attention mechanisms of three VLMs, the study identifies specific attention heads that, when removed, significantly reduce hallucinations by at least 40% without additional training. The findings suggest that these heads are crucial for prompt copying and that their ablation enhances the model's reliance on visual evidence for correction.
研究通过观察视觉-语言模型在物体计数任务中的表现,探讨了提示诱导幻觉的机制。研究发现,随着物体数量的增加,模型更倾向于遵循提示而非视觉证据。通过对三种视觉-语言模型的分析,研究确定了一些特定的注意力头,移除这些头可以显著减少提示诱导幻觉至少40%,无需额外训练。研究结果揭示了不同模型在实现这些行为方面的差异,并表明针对这些头可以提高模型与视觉证据的一致性。
An interpretable data-driven approach to optimizing clinical fall risk assessment
Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi
First: 2026-01-08T18:17:31+00:00 · Latest: 2026-01-08T18:17:31+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2510.20714
Abstract
In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study's risk labels, and without changing the tool's form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.
中文标题/摘要
标题:一种可解释的数据驱动方法以优化临床跌倒风险评估
在本研究中,我们旨在通过数据驱动建模方法更好地使约翰霍普金斯跌倒风险评估工具(JHFRAT)的跌倒风险预测与额外的临床有意义的指标相一致。我们对2022年3月至2023年10月期间约翰霍普金斯健康系统三家医院的54,209例住院病例进行了回顾性队列分析。共有20,208例住院病例被纳入高跌倒风险事件,13,941例被纳入低跌倒风险事件。为了融入临床知识并保持可解释性,我们使用约束评分优化(CSO)模型重新加权JHFRAT评分权重,同时保持其加性结构和临床阈值。校准是指调整项目权重,使所得评分能够更一致地按研究的风险标签对事件进行排序,而不改变工具的形式因素或部署工作流程。该模型在预测性能上显著优于当前的JHFRAT(CSO AUC-ROC=0.91,JHFRAT AUC-ROC=0.86)。这种性能改进相当于每周在约翰霍普金斯健康系统中额外保护35名高风险患者。约束评分优化模型在有和没有EHR变量的情况下表现相似。尽管基准黑盒模型(XGBoost)在知识驱动的约束逻辑回归的基础上提高了性能指标(AUC-ROC=0.94),但CSO在风险标签变化方面表现出更强的稳健性。这种基于证据的方法为医疗机构提供了一个坚实的基础,以系统地增强住院跌倒预防协议和患者安全,利用数据驱动优化技术,从而在医疗保健环境中改善风险评估和资源分配。
Summary / 总结
This study aims to improve the predictive performance of the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) by incorporating clinically meaningful measures through constrained score optimization (CSO) models. A retrospective cohort analysis of 54,209 inpatient admissions showed that the CSO model significantly improved predictive performance (AUC-ROC=0.91) compared to the current JHFRAT (AUC-ROC=0.86), leading to better risk ordering and protection of additional high-risk patients. The CSO models maintained interpretability and robustness, even without using electronic health record (EHR) variables, and provided a robust foundation for enhancing inpatient fall prevention protocols in healthcare settings.
本研究旨在通过引入临床知识并保持可解释性,使用约束评分优化(CSO)模型来改进约翰霍普金斯跌倒风险评估工具(JHFRAT)的预测性能。对54,209名住院患者的回顾性队列分析显示,CSO模型在预测性能(AUC-ROC=0.91)上显著优于当前的JHFRAT(AUC-ROC=0.86),每周额外保护了35名高风险患者。CSO模型在有和没有电子健康记录(EHR)变量的情况下表现相似,显示出对风险标签变异的鲁棒性。
LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation
Authors: Samy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger
First: 2026-01-08T18:15:34+00:00 · Latest: 2026-01-08T18:15:34+00:00
Abstract
Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.
中文标题/摘要
标题:LELA:基于大语言模型的零样本领域自适应实体链接方法
实体链接(将文本中含糊指代与知识库中的实体进行映射)是知识图谱构建、问答和信息提取等任务中的基础步骤。我们的方法LELA是一种模块化的粗细粒度方法,利用大语言模型(LLM)的能力,并且可以在不同的目标领域、知识库和LLM上工作,无需任何微调阶段。我们在各种实体链接设置下的实验表明,LELA在与微调方法的竞争中表现出色,并且显著优于未微调的方法。
Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable
Authors: Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam
First: 2026-01-08T18:13:46+00:00 · Latest: 2026-01-08T18:13:46+00:00
Abstract
When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines
中文标题/摘要
标题:降低AI研究成本:任务感知压缩如何使大型语言模型代理负担得起
当研究人员使用大型语言模型进行自主任务,如文献审查或生成假设时,计算费用会迅速增加。使用一个700亿参数模型的一次研究会话可能需要大约127美元的云费用,使这些工具无法为许多学术实验室所用。我们开发了AgentCompress来直接解决这个问题。核心思想源于我们在工作中的一个简单观察:撰写新的假设比重新格式化参考文献需要模型更多的能力。为什么这两个任务都应该以全精度运行?我们的系统使用一个小的神经网络,根据每个新任务的开头词语来判断任务的难度,然后将其路由到一个适当压缩的模型变体。这个决定在不到一毫秒内完成。在四个科学领域的500个研究工作流中进行测试,我们计算成本降低了68.3%,同时保持了96.2%的原始成功率。对于那些关注预算的实验室来说,这可能意味着能够在进行实验和坐观台之间做出选择
SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning
Authors: Yanchang Liang, Xiaowei Zhao
First: 2026-01-08T18:10:35+00:00 · Latest: 2026-01-08T18:10:35+00:00
Abstract
Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.
中文标题/摘要
标题:SimuAgent:基于LLM的Simulink建模助手,增强以强化学习
大型语言模型(LLMs)已经革新了基于文本的代码自动化,但在图形导向的工程工作流中的潜力尚未充分探索。我们介绍了SimuAgent,这是一种专为Simulink设计的LLM驱动的建模和仿真代理。SimuAgent用简洁的字典风格Python表示法取代了冗长的XML,大幅减少了标记数量,提高了可解释性,并使仿真变得快速且在进程内进行。一种轻量级的计划-执行架构,经过两阶段训练,使代理具备了低级工具技能和高级设计推理能力。为应对长期任务中的稀疏奖励,我们提出了Reflection-GRPO(ReGRPO),它通过自我反思轨迹增强了Group Relative Policy Optimization(GRPO),提供了丰富的中间反馈,加速了收敛并提高了鲁棒性。在我们新发布的包含5300个多领域建模任务的SimuBench基准测试中,经过SimuAgent微调的Qwen2.5-7B模型比标准RL基线收敛更快,建模精度更高,甚至在使用少量示例提示在相同基准测试上评估时超过了GPT-4o。消融实验表明,两阶段课程和抽象重建数据增强进一步提高了泛化能力。SimuAgent完全在本地进行训练和运行,硬件要求较低,提供了一种保护隐私、成本效益高的工业模型驱动工程解决方案。SimuAgent弥合了LLMs与图形建模环境之间的差距,为工业环境中的AI辅助工程设计提供了一种实用的解决方案。
Summary / 总结
SimuAgent is an LLM-based agent designed for Simulink modeling, using a lightweight plan-execute architecture and Reflection-GRPO to enhance its performance. It replaces XML with a concise Python representation, improving interpretability and enabling fast simulation. Experiments on SimuBench show that SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o with few-shot prompting. The two-stage curriculum and abstract-reconstruct data augmentation further enhance its generalization capabilities, making it a privacy-preserving, cost-effective solution for industrial model-driven engineering.
SimuAgent 是一个基于大语言模型的 Simulink 模型设计助手,采用轻量级计划-执行架构和 Reflection-GRPO,提升其性能。它用简洁的 Python 表示法替换 XML,提高可解释性并实现快速仿真。实验表明,SimuAgent 在 SimuBench 上收敛更快,建模精度更高,甚至在少量提示下超过 GPT-4o。两阶段课程和抽象重建数据增强进一步增强了其泛化能力,使其成为工业模型驱动工程的隐私保护、成本效益解决方案。
Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop
Authors: Yaxuan Wang, Zhongteng Cai, Yujia Bao, Xueru Zhang, Yang Liu
First: 2026-01-08T18:08:15+00:00 · Latest: 2026-01-08T18:08:15+00:00
Abstract
The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.
中文标题/摘要
标题:大型语言模型偏见的自我消耗执行循环中的观察与补救措施
大型语言模型(LLMs)的迅速发展引发了对使用合成数据进行未来模型训练的兴趣。然而,这导致了一个自我消耗的重新训练循环,模型在训练过程中使用自己的输出,可能导致性能下降并引发新的偏见。在实际应用中,之前部署的LLMs可能会影响它们生成的数据,导致由用户反馈驱动的动态系统。例如,如果模型持续未能满足某一用户群体的需求,那么来自该特定用户群体的查询数据将减少。在本研究中,我们提出了“自我消耗执行循环”(SCPL)的概念,并探讨了合成数据在这些动态迭代训练过程中如何塑造偏见的作用。这种受控的设置是由于难以获取动态生产系统的用户偏好数据,使我们能够以一种原则性的方法来隔离和分析反馈驱动的偏见演变。我们关注两种类型的循环,包括典型的重新训练设置和增量微调设置,后者尚未得到充分探索。通过三项实际任务的实验,我们发现执行循环增加了偏好偏见并减少了差异偏见。我们设计了一种基于奖励的拒绝采样策略来减轻偏见,朝着更可信赖的自我改进系统迈进。
Summary / 总结
This study addresses the issue of bias in large language models (LLMs) that arise from self-consuming performative loops, where models are trained on their own outputs. The research introduces the concept of Self-Consuming Performative Loop (SCPL) and investigates how synthetic data influences bias during dynamic iterative training. Experiments on three real-world tasks show that the performative loop increases preference bias and decreases disparate bias. The study proposes a reward-based rejection sampling strategy to mitigate these biases, aiming to enhance the trustworthiness of self-improving systems.
研究探讨了大型语言模型(LLMs)中的自我消费执行循环(SCPL),即模型在其自身输出上进行训练,导致性能下降和新兴偏见。研究引入了SCPL,并考察了合成数据在动态迭代训练过程中如何塑造偏见。实验表明,执行循环增加了偏好偏见并减少了差异偏见。研究提出了一种基于奖励的拒绝采样策略来缓解这些偏见,旨在实现更可信赖的自我改进系统。
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation
Authors: Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov
First: 2025-05-27T11:56:59+00:00 · Latest: 2026-01-08T18:06:58+00:00
Abstract
Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model's internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.
中文标题/摘要
标题:面向事实核查的检索增强生成输出忠实度感知不确定性量化
增强检索的大语言模型(LLMs),即检索增强生成(RAG)方法,在开放域问答任务中取得了优异表现。然而,RAG 仍然容易产生幻觉:由于模型内部知识和检索上下文的不准确,可能会产生事实错误的输出。现有减轻幻觉的方法往往将事实性与检索证据的忠实度混为一谈,错误地将与检索证据未明确支持的事实正确陈述标记为幻觉。在本文中,我们提出了一种新的方法 FRANQ,用于检测 RAG 输出中的幻觉。FRANQ 应用不同的不确定性量化(UQ)技术,根据陈述是否忠实于检索上下文来估计事实性。为了评估 FRANQ 和竞争的 UQ 方法,我们构建了一个新的长形式问答数据集,该数据集同时标注了事实性和忠实度,并结合了自动标注和手动验证具有挑战性的案例。在多个数据集、任务和大语言模型上的广泛实验表明,FRANQ 在检测 RAG 生成响应中的事实错误方面比现有方法更准确。
Summary / 总结
This paper addresses the issue of hallucinations in Retrieval-Augmented Generation (RAG) outputs by introducing FRANQ, a method that uses distinct uncertainty quantification techniques to estimate factuality while considering faithfulness to the retrieved context. The authors evaluate FRANQ and other UQ methods on a new dataset annotated for both factuality and faithfulness, demonstrating that FRANQ provides more accurate detection of factual errors in RAG-generated responses than existing approaches.
本文通过引入FRANQ方法,量化不确定性并区分事实性和忠实性,来解决RAG输出中的幻觉问题。作者构建了一个新的长形式问答数据集,标注了事实性和忠实性,并证明FRANQ在各种数据集和LLM上比现有方法更准确地检测RAG生成响应中的事实错误。
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Authors: Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong
First: 2026-01-08T18:00:59+00:00 · Latest: 2026-01-08T18:00:59+00:00
Comments: Project page: https://ivul-kaust.github.io/projects/videoauto-r1/
Abstract
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.
中文标题/摘要
标题:VideoAuto-R1:通过一次思考,两次回答进行视频自动推理
链式思考(CoT)推理已成为多模态大型语言模型在视频理解任务中的一种强大工具。然而,其必要性及其与直接回答相比的优势尚未得到充分探索。在本文中,我们首先证明,对于通过强化学习训练的视频模型,直接回答往往能够匹配甚至超越CoT的表现,尽管CoT以更高的计算成本生成逐步分析。受此启发,我们提出了一种VideoAuto-R1视频理解框架,采用一种必要时才推理的策略。在训练过程中,我们的方法遵循一次思考,两次回答的模式:模型首先生成一个初始答案,然后进行推理,最后输出一个审查后的答案。两个答案都通过可验证的奖励进行监督。在推理过程中,模型使用初始答案的置信度分数来决定是否进行推理。在视频问答和定位基准测试中,VideoAuto-R1实现了最先进的准确率,显著提高了效率,平均响应长度减少了约3.3倍,例如,从149个词减少到仅44个词。此外,我们观察到,在感知导向的任务中,思考模式的激活率较低,而在推理密集型任务中,激活率较高。这表明显式的基于语言的推理通常是有益的,但并非总是必要的。
Summary / 总结
The paper explores the necessity of chain-of-thought (CoT) reasoning in video understanding tasks and introduces VideoAuto-R1, a framework that reasons only when necessary. During training, VideoAuto-R1 generates an initial answer, performs reasoning, and outputs a reviewed answer, both supervised by verifiable rewards. During inference, it decides whether to reason based on the confidence of the initial answer. VideoAuto-R1 achieves state-of-the-art accuracy with significant efficiency improvements, reducing response length by 3.3x, and shows that reasoning is generally beneficial but not always required.
论文探讨了链式思考(CoT)推理在视频理解任务中的必要性,并提出了VideoAuto-R1框架,该框架仅在必要时进行推理。在训练过程中,模型生成初始答案,进行推理,并输出审查后的答案,监督由可验证奖励完成。在推理过程中,模型根据初始答案的置信度决定是否进行推理。VideoAuto-R1实现了最先进的准确率,并提高了效率,响应长度减少了3.3倍。该框架表明,推理通常是有益的,但在感知导向的任务中通常不是必需的。
FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts
Authors: Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu
Venue: KDD 2026
First: 2026-01-08T18:00:58+00:00 · Latest: 2026-01-08T18:00:58+00:00
Comments: Accepted to KDD 2026
Abstract
Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.
中文标题/摘要
标题:FaST:基于专家混合的大型时空图长时预测高效框架
大型网络上的时空图(STG)预测引起了广泛关注。然而,现有模型主要关注短期预测,并在扩展到长期预测和大型图时遭受严重的计算成本和内存消耗问题。为应对上述挑战,我们提出了FaST,一种基于异质性感知专家混合(MoEs)的框架,用于长时和大规模STG预测,该框架能够在数千个节点的情况下实现一周前(以15分钟粒度计算的672步)的预测。FaST基于两项关键创新。首先,提出了一种自适应图代理注意力机制,以缓解在大型图上应用常规图卷积和自我注意力模块时固有的计算负担。其次,我们提出了一种新的并行MoE模块,用门控线性单元(GLUs)取代传统的前馈网络,从而实现高效且可扩展的并行结构。在真实世界数据集上的广泛实验表明,FaST不仅在长期预测准确性上表现出色,而且在计算效率上也显著优于最先进的基线。我们的源代码可在:https://github.com/yijizhao/FaST/ 获取。
Summary / 总结
FaST is designed to address the challenges of long-horizon forecasting on large-scale spatial-temporal graphs by proposing an adaptive graph agent attention mechanism and a parallel Mixture-of-Experts module with Gated Linear Units. This framework achieves one-week-ahead predictions with thousands of nodes while maintaining superior predictive accuracy and computational efficiency compared to existing methods.
FaST 是一种针对大规模空间-时间图进行长期预测的有效框架,通过引入适应性图代理注意力机制和带有门线性单元的并行混合专家模块来解决计算和内存挑战。实验表明,FaST 在一周前的预测中不仅在预测准确性上优于现有方法,还在计算效率上表现出色。
CoV: Chain-of-View Prompting for Spatial Reasoning
Authors: Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang
First: 2026-01-08T17:59:42+00:00 · Latest: 2026-01-08T17:59:42+00:00
Abstract
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.
中文标题/摘要
标题:CoV:空间推理的链式视角提示
在3D环境中的嵌入式问题回答(EQA)通常需要收集分布在多个视角且部分被遮挡的上下文。然而,大多数最近的视觉-语言模型(VLMs)仅限于固定且有限的输入视角集,这限制了它们在推理时获取与问题相关上下文的能力,并阻碍了复杂的空间推理。我们提出了一种名为Chain-of-View(CoV)的提示方法,这是一种无需训练、在测试时进行推理的框架,通过从粗到细的探索过程将VLM转换为积极的视角推理者。CoV首先使用视图选择代理筛选冗余帧并识别与问题对齐的锚视图,然后通过交替进行迭代推理和离散相机动作进行细粒度视图调整,从底层3D场景表示中获取新观察,直到收集到足够上下文或达到步骤预算。 我们在OpenEQA上对CoV进行了评估,跨四个主流VLMs获得了平均+11.56%的LLM-Match改进,最大增益为Qwen3-VL-Flash上的+13.62%。CoV还表现出测试时的扩展性:增加最小动作预算可额外获得平均+2.51%的改进,峰值为Gemini-2.5-Flash上的+3.73%。在ScanQA和SQA3D上,CoV表现出强大的性能(例如,ScanQA上的116 CIDEr / 31.9 EM@1和SQA3D上的51.1 EM@1)。总体而言,这些结果表明,与问题对齐的视图选择结合开放视图搜索是提高3D EQA中空间推理能力的有效、模型无关的策略,无需额外训练。
Summary / 总结
The research aims to enhance embodied question answering (EQA) in 3D environments by addressing the limitations of fixed input views in vision-language models (VLMs). The proposed Chain-of-View (CoV) prompting method enables VLMs to actively explore and gather context from multiple viewpoints through a coarse-to-fine process. Evaluation on OpenEQA shows an average improvement of +11.56% in LLM-Match, with significant gains on specific models. CoV also demonstrates test-time scalability, with performance improvements observed as the action budget increases.
研究旨在通过增强视觉语言模型(VLMs)从多个视角收集上下文的能力,解决三维环境中的空间推理问题。提出的Chain-of-View(CoV)提示方法包括粗到细的探索过程,包括视图选择和精细的视图调整。CoV在四个主流VLMs上将LLM-Match提高了平均11.56%,最高增益为Qwen3-VL-Flash的13.62%。它还表现出测试时的扩展性,随着最小动作预算的增加,额外提高了3.73%(最高为Gemini-2.5-Flash)。CoV在ScanQA和SQA3D上表现出色,证明了其在无需额外训练的情况下提高3D EQA中的空间推理的有效性。
Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems
Authors: Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu, Bo Tang, Feiyu Xiong, Zhiyu li
First: 2026-01-08T17:59:11+00:00 · Latest: 2026-01-08T17:59:11+00:00
Abstract
Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.
中文标题/摘要
标题:Inside Out:演化中的用户中心核心记忆树以支持长期个性化对话系统
现有的长期个性化对话系统难以调和无界交互流与有限上下文约束之间的关系,常常受到记忆噪声累积、推理退化和人设不一致的困扰。为了解决这些挑战,本文提出Inside Out框架,利用全局维护的PersonaTree作为长期用户画像的载体。通过初始模式约束主干并更新分支和叶子,PersonaTree实现了可控增长,同时实现了记忆压缩并保持一致性。此外,我们通过基于过程的奖励进行强化学习训练了一个轻量级的MemListener,以生成结构化、可执行且可解释的{ADD, UPDATE, DELETE, NO_OP}操作,从而支持个性化树的动态演化。在响应生成过程中,PersonaTree直接被利用以在延迟敏感场景中增强输出;当用户需要更多细节时,在PersonaTree的约束下触发代理模式以按需引入细节。实验表明,PersonaTree在抑制上下文噪声和保持人设一致性方面优于全文拼接和各种个性化记忆系统。值得注意的是,小型MemListener模型在记忆操作决策性能上与强大的推理模型DeepSeek-R1-0528和Gemini-3-Pro相当,甚至超越它们。
Summary / 总结
This paper addresses the challenges of long-term personalized dialogue systems by proposing Inside Out, a framework that uses a PersonaTree to maintain user profiles. The PersonaTree is constrained by an initial schema and updated dynamically, allowing for memory compression and consistency. A lightweight MemListener trained with reinforcement learning generates structured operations to evolve the PersonaTree. Experiments demonstrate that PersonaTree outperforms other methods in reducing contextual noise and maintaining persona consistency, with the MemListener achieving performance comparable to powerful reasoning models.
本文提出Inside Out框架,使用PersonaTree来维护用户画像,解决长期个性化对话系统中的挑战。PersonaTree通过轻量级的MemListener进行更新,MemListener通过强化学习生成结构化的操作来支持动态演化。实验表明,PersonaTree在抑制上下文噪声和保持人物一致性方面优于其他方法,且MemListener的表现与强大的推理模型相当。
Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference
Authors: Rasmus Blanck, Bill Noble, Stergios Chatzikyriakidis
First: 2026-01-08T17:58:52+00:00 · Latest: 2026-01-08T17:58:52+00:00
Abstract
Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.
中文标题/摘要
标题:逆向工程自然语言推理:关于自然语言推理元推理属性的研究
自然语言推理(NLI)一直是评估自然语言处理语言模型的重要任务,但该任务的逻辑属性尚未得到充分理解,经常被误表征。理解NLI所捕捉的推理概念对于解释模型在该任务上的表现至关重要。在本文中,我们提出了NLI标签集的三种可能解读,并对它们所蕴含的元推理属性进行了全面分析。以SNLI数据集为例,我们利用(1)具有相同前提的NLI项目和(2)由LLM生成的项目来评估在SNLI上训练的模型的元推理一致性,并推导出数据集中编码的逻辑关系的解读。
Summary / 总结
This paper aims to understand the logical properties of the Natural Language Inference (NLI) task, which is crucial for interpreting model performance. The authors formulate three possible readings of the NLI label set and conduct a comprehensive analysis of the meta-inferential properties. They use SNLI dataset items with shared premises and items generated by LLMs to evaluate models for meta-inferential consistency, revealing insights into the logical relations encoded by the dataset.
本文旨在通过提出NLI标签集的三种可能解读并分析其元推理属性来澄清自然语言推理(NLI)的逻辑特性。作者使用SNLI数据集,重点关注具有共享前提条件的项目和由LLM生成的项目,以评估模型的元推理一致性。主要发现包括对SNLI数据集中编码的逻辑关系的见解,有助于更好地理解NLI任务上的模型性能。
RelayLLM: Efficient Reasoning via Collaborative Decoding
Authors: Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang
First: 2026-01-08T17:56:16+00:00 · Latest: 2026-01-08T17:56:16+00:00
Abstract
Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively "relaying" the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.
中文标题/摘要
标题:RelayLLM:通过协作解码实现高效推理
大型语言模型(LLMs)在进行复杂推理时往往受到高计算成本和延迟的限制,而资源高效的小型语言模型(SLMs)通常缺乏必要的推理能力。现有的协作方法,如级联或路由,以粗粒度的方式工作,将整个查询卸载到LLMs上,当SLM能够处理大多数推理步骤时,这会导致显著的计算浪费。为了解决这个问题,我们提出了一种名为RelayLLM的新框架,用于通过标记级协作解码实现高效推理。与路由器不同,RelayLLM赋予SLM作为主动控制器的能力,动态地仅在关键标记上调用LLM,通过特殊命令有效地“传递”生成过程。我们引入了一种两阶段训练框架,包括预热和组相对策略优化(GRPO),以教导模型平衡独立性和战略性求助。在六个基准测试中的实验结果表明,RelayLLM实现了49.52%的平均准确率,有效地弥合了两种模型之间的性能差距。值得注意的是,这仅通过调用LLM生成标记的1.07%实现,与性能匹配的随机路由器相比,成本降低了98.2%。
Summary / 总结
RelayLLM is a framework for efficient reasoning via token-level collaborative decoding, addressing the computational and latency issues of large language models (LLMs) while leveraging the reasoning capacity of small language models (SLMs). It enables the SLM to dynamically invoke the LLM only for critical tokens, reducing computational waste. The framework includes a two-stage training process and achieves an average accuracy of 49.52% with only 1.07% of tokens requiring LLM assistance, resulting in a 98.2% cost reduction compared to random routers.
RelayLLM 是一种框架,通过 SLM 和 LLM 在 token 级别的协作解码实现高效的推理。它允许 SLM 动态地仅在关键 token 上调用 LLM,减少计算浪费。该框架包括两个阶段的训练过程,以平衡独立性和战略性求助。实验结果显示,RelayLLM 在六个基准上的准确率为 49.52%,仅调用 LLM 处理 1.07% 的 token,相比随机路由器实现了 98.2% 的成本降低。
MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging
Authors: Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li
First: 2025-09-23T06:23:56+00:00 · Latest: 2026-01-08T17:56:05+00:00
Comments: The project is available at https://charlescsyyy.github.io/MVT
Abstract
Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
中文标题/摘要
标题:MVT:基于掩码的视觉-语言模型在分类学对齐的土地覆盖标记中的应用
遥感中的土地覆盖理解越来越需要跨数据集泛化但同时保持空间精确性和可解释性的类无差别系统。我们研究了在领域转移下的几何优先发现与解释设置,其中候选区域以类无差别方式划定,监督避免使用类名的明文标识符。除了开放集识别和开放世界学习,我们专注于将类无差别掩码证据与分类学导向的场景解释相结合,而不是未知拒绝或持续类扩展。我们提出了MVT,一个三阶段框架,(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双步骤LoRA微调多模态LLM进行掩码导向的语义标记和场景描述生成,(iii) 使用LLM作为评判者评分进行评估,评分通过分层专家评分校准。在跨数据集分割迁移(在OpenEarthMap上训练,在LoveDA上评估)中,领域适应的SAM2提高了掩码质量;同时,双步骤多模态LLM微调产生了更准确的分类学对齐标签和更具有信息性的掩码导向场景描述。
Summary / 总结
The research aims to develop a class-agnostic system for land-cover tagging that generalizes across datasets while maintaining spatial precision and interpretability. The method involves a three-stage framework: (i) extracting boundary-faithful region masks using SAM2 with domain adaptation, (ii) performing mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluating outputs with LLM-as-judge scoring calibrated by expert ratings. The study shows that domain-adapted SAM2 improves mask quality, and dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative scene descriptions on cross-dataset segmentation transfer.
研究旨在开发适用于遥感的土地覆盖理解系统,注重空间精度和可解释性。方法包括三个阶段:(i) 使用SAM2进行领域适应以提取边界忠实的区域掩码,(ii) 通过双重步骤的LoRA微调多模态LLM进行掩码导向的语义标签和场景描述生成,(iii) 使用LLM作为裁判评分并根据分层专家评级进行评估。关键发现包括领域适应的SAM2提高了掩码质量,而双重步骤的LLM微调则产生了更准确的分类对齐标签和更具信息量的掩码导向场景描述。
Improving and Evaluating Open Deep Research Agents
Authors: Doaa Allabadi, Kyle Bradbury, Jordan M. Malof
First: 2025-08-13T19:32:01+00:00 · Latest: 2026-01-08T17:54:58+00:00
Comments: 8 pages, 2 figures, 2 tables
Abstract
We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.
中文标题/摘要
标题:改进和评估开放深度研究代理
我们在这里关注深度研究代理(DRAs),这是一种可以从用户那里接收自然语言提示,并自主搜索和利用互联网内容来回应提示的系统。最近的DRAs在公共基准测试中展示了令人印象深刻的性能,然而,最近的研究主要涉及专有的闭源系统。在本研究进行时,我们仅发现一个开源的DRAs,称为Open Deep Research(ODR)。在本工作中,我们将具有挑战性的最近的BrowseComp基准测试改编为比较ODR与现有专有系统的基准测试。我们提出了BrowseComp-Small(BC-Small),这是一个更易于计算的DRAs基准测试,适用于学术实验室。我们在BC-Small上对ODR和两个其他专有系统进行了基准测试:一个来自Anthropic的系统和一个来自Google的系统。我们发现,这三个系统在包含60个问题的测试集上均未达到100%的准确率。我们对ODR进行了三项战略改进,产生了ODR+模型,该模型在BC-Small基准测试中实现了专有和开源系统中的最佳10%的成功率。我们报告了消融研究,表明我们的三项改进都对ODR+的成功做出了贡献。
Summary / 总结
This work focuses on Deep Research Agents (DRAs) that can process natural language prompts and autonomously search for and utilize internet content. The authors adapt the BrowseComp benchmark to evaluate ODR, an open-source DRA, against proprietary systems. They introduce ODR+ with three strategic improvements, achieving a 10% success rate on BC-Small, surpassing both open-source and closed-source systems.
本研究旨在通过将BrowseComp基准适应为BC-Small,来提升和评估开源的Deep Research Agents(DRAs),并与现有系统进行比较。作者引入BC-Small供学术实验室使用,并对ODR、Anthropic和Google系统进行了基准测试。所有系统在测试集上的准确率为0%。对ODR进行三项策略性改进后,形成了ODR+模型,在BC-Small上实现了10%的成功率,超过了所有闭源和开源系统。
DocDancer: Towards Agentic Document-Grounded Information Seeking
Authors: Qintong Zhang, Xinjie Lv, Jialong Wu, Baixuan Li, Zhengwei Tao, Guochen Yan, Huanyao Zhang, Bin Wang, Jiahao Xu, Haitao Mi, Wentao Zhang
First: 2026-01-08T17:54:32+00:00 · Latest: 2026-01-08T17:54:32+00:00
Abstract
Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.
中文标题/摘要
标题:DocDancer: 向基于文档的主动信息寻求迈进
文档问题回答(DocQA)专注于基于给定文档回答问题,但现有的DocQA代理缺乏有效的工具利用,主要依赖于封闭源模型。在本工作中,我们引入了DocDancer,这是一种端到端训练的开源Doc代理。我们将DocQA形式化为一个信息寻求问题,并提出了一种工具驱动的代理框架,明确地建模了文档探索和理解。为了使此类代理能够端到端训练,我们引入了一种探索然后合成的数据合成管道,以解决DocQA高质量训练数据稀缺的问题。在合成数据上进行训练,两个长上下文文档理解基准MMLongBench-Doc和DocBench上的训练模型显示了其有效性。进一步的分析为代理工具设计和合成数据提供了宝贵的见解。
Summary / 总结
DocDancer is an end-to-end trained open-source DocQA agent that addresses the limitations of existing agents by incorporating tool utilization and open-source models. It formulates DocQA as an information-seeking problem and uses an Exploration-then-Synthesis pipeline to train models on synthesized data, demonstrating effectiveness on MMLongBench-Doc and DocBench benchmarks. The analysis provides insights for agentic tool design and synthetic data creation.
DocDancer 是一个端到端训练的开源文档导向问答代理,通过整合工具利用和显式文档探索来解决现有封闭源模型的局限性。该代理框架模型化了文档理解和探索,并引入了探索-合成数据合成管道以克服高质量训练数据稀缺的问题。在 MMLongBench-Doc 和 DocBench 基准上的训练模型展示了有效性,为代理工具设计和合成数据生成提供了有价值的见解。
History
20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553