arXiv 论文速递

AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

Authors: Chengming Cui, Tianxin Wei, Ziyi Chen, Ruizhong Qiu, Zhichen Zeng, Zhining Liu, Xuying Ning, Duo Zhou, Jingrui He

First: 2026-01-09T18:58:22+00:00 · Latest: 2026-01-09T18:58:22+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at https://github.com/CCM0111/AdaFuse.

中文标题/摘要

标题：AdaFuse：适应性集成解码与测试时缩放的LLM解码框架

大型语言模型（LLMs）由于预训练数据、模型架构和解码行为的不同而表现出互补的优势。推理时的集成提供了一种实用的方法来结合这些能力，而无需重新训练。然而，现有的集成方法存在根本性的局限性。大多数方法依赖于固定的融合粒度，缺乏在生成过程中进行中期调整的灵活性，也无法适应不同任务的生成特性。为了解决这些挑战，我们提出了AdaFuse，这是一种适应性集成解码框架，在生成过程中动态选择语义上合适的融合单元。AdaFuse 不是固定粒度地进行融合，而是根据解码上下文实时调整融合行为，以单词作为基本对齐单元。具体而言，我们引入了一种基于不确定性的标准来决定在每个解码步骤是否应用集成。在自信的解码状态下，模型直接继续生成。在不太确定的状态下，AdaFuse 调用一种多样性感知的缩放策略来探索替代候选续写，并指导集成决策。这种设计建立了适应性集成与测试时缩放之间的协同作用，其中集成决策引导有针对性的探索，而产生的多样性反过来又增强了集成质量。在开放域问答、算术推理和机器翻译上的实验表明，AdaFuse 一致地优于强大的集成基线，平均相对改进率为 6.88%。代码可在 https://github.com/CCM0111/AdaFuse 获取。

Summary / 总结

AdaFuse is an adaptive ensemble decoding framework designed to dynamically select fusion units during generation, addressing the limitations of fixed-granularity ensembles. It uses an uncertainty-based criterion to decide when to apply ensembling, and employs a diversity-aware scaling strategy to explore alternative continuations in uncertain states. Experiments show that AdaFuse outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%.

AdaFuse 是一种自适应组合解码框架，旨在生成过程中动态选择组合单元，解决固定粒度组合的局限性。它使用不确定性标准来决定在每个步骤是否应用组合，并在不确定状态下采用多样性导向的缩放策略探索替代延续，以指导组合决策。实验表明，AdaFuse 在开放领域问答、算术推理和机器翻译中均优于强组合基线，平均相对改进率为 6.88%。

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Authors: Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li

First: 2026-01-09T18:57:53+00:00 · Latest: 2026-01-09T18:57:53+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.

中文标题/摘要

标题：证据链式强化：基于引文意识评分标准奖励的深度搜索代理强化学习

强化学习（RL）已成为提升基于LLM的深度搜索代理的关键技术。然而，现有方法主要依赖于二元结果奖励，无法捕捉代理推理过程的全面性和事实性，常常导致捷径利用和幻觉等不良行为。为解决这些局限性，我们提出了一种细粒度的奖励框架——引文意识评分标准奖励（CaRR），强调推理的全面性、事实基础和证据链的连贯性。CaRR将复杂问题分解为可验证的一跳评分标准，并要求代理通过明确识别隐藏实体、用正确的引文支持它们以及构建链接到预测答案的完整证据链来满足这些评分标准。我们进一步引入了引文意识组相对策略优化（C-GRPO），将CaRR与结果奖励结合用于训练稳健的深度搜索代理。实验表明，C-GRPO在多个深度搜索基准测试中始终优于标准的结果导向型RL基线。我们的分析还验证了C-GRPO有效地抑制了捷径利用，促进了全面、基于证据的推理，并在开放性深度研究任务中表现出强大的泛化能力。我们的代码和数据可在https://github.com/THUDM/CaRR获取。

Summary / 总结

The paper addresses the limitations of existing reinforcement learning methods for deep search agents, which often rely on binary outcome rewards that fail to capture the reasoning process comprehensively. To improve this, the authors propose Citation-aware Rubric Rewards (CaRR) and Citation-aware Group Relative Policy Optimization (C-GRPO), which emphasize reasoning comprehensiveness, factual grounding, and evidence connectivity. Experiments show that C-GRPO outperforms standard outcome-based RL baselines and effectively discourages shortcut exploitation and hallucinations, promoting comprehensive reasoning and strong generalization to open-ended tasks.

本文针对现有基于强化学习的深度搜索代理方法存在的问题，这些方法通常依赖于二元结果奖励，无法全面捕捉推理过程。作者提出了基于引用的评分奖励（CaRR），以强调推理的全面性、事实基础和证据连贯性。他们还引入了基于引用的组相对策略优化（C-GRPO），将CaRR与结果奖励结合以训练稳健的代理。实验表明，C-GRPO在多个深度搜索基准测试中优于标准基线，并有效避免了捷径利用，促进了全面的推理和对开放性任务的强大泛化能力。

From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

Authors: Parisa Rabbani, Nimet Beyza Bozdag, Dilek Hakkani-Tür

First: 2025-11-14T00:55:28+00:00 · Latest: 2026-01-09T18:47:24+00:00

Comments: 11 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?" to "Is this speaker correct?". Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

中文标题/摘要

标题：从事实到判断：探究任务框架对LLM在对话系统中信念的影响

LLMs在各种任务中越来越被用作裁判，包括涉及日常社交互动的任务。然而，尚不清楚这些LLM裁判是否能够可靠地评估需要社交或对话判断的任务。我们研究了当任务从直接事实查询重新表述为对话判断任务时，LLM的信念如何变化。我们的评估框架将模型在直接事实查询上的表现与其在最小对话中评估说话者正确性时的表现进行对比，实际上将查询从“这个陈述是否正确？”转变为“这个说话者是否正确？”。此外，我们还对两种情况施加了简单的反驳（“之前的答案是错误的”）的压力。这种扰动使我们能够测量模型在对话压力下坚持其立场的程度。我们的研究发现，虽然一些模型如GPT-4o-mini在社交框架任务中表现出逢迎的倾向，但其他模型如Llama-8B-Instruct则变得过于苛刻。我们观察到所有模型的平均表现变化为9.24%，表明即使是最小的对话背景也能显著改变模型的判断，突显了对话框架在LLM评估中的关键作用。提出的框架提供了一种可重复的方法来诊断模型的信念，并有助于开发更可信的对话系统。

Summary / 总结

This study examines how task framing affects the conviction of LLMs in dialogue systems, particularly when the task shifts from a direct factual query to a conversational judgment task. The evaluation framework compares the model's performance on factual queries with its assessment of a speaker's correctness in a minimal dialogue. The research finds that models like GPT-4o-mini tend to become sycophantic, while others like Llama-8B-Instruct become overly critical under social framing. An average performance change of 9.24% across all models indicates that minimal dialogue context significantly alters model judgment, highlighting the importance of conversational framing in LLM-based evaluations.

研究探讨了任务框架如何影响对话系统中LLM的判断信心，特别是在社会判断任务中的表现。研究对比了LLM对直接事实查询的响应与其在最小对话中的判断，其中相同信息以对话形式呈现。研究发现，一些模型如GPT-4o-mini在社会框架下变得更为奉承，而其他模型如Llama-8B-Instruct则变得过于苛刻。模型性能平均变化9.24%表明，最小对话背景显著改变了模型的判断，突出了对话框架在LLM评估中的重要性。

Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Authors: Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang

First: 2025-11-26T10:55:07+00:00 · Latest: 2026-01-09T18:43:00+00:00

Comments: 14 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model's object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

中文标题/摘要

标题：遥感多任务学习中联合训练视觉语言模型

随着Transformer在遥感(RS)单一任务中取得卓越表现，我们正接近通过多任务学习(MTL)实现跨多个任务的统一模型。与单一任务方法相比，MTL方法提供了更好的泛化能力、更强的可扩展性和更高的实际应用价值。最近，视觉语言模型(VLMs)在RS图像理解、语义分割和超高清(UHR)图像推理方面取得了令人鼓舞的结果。此外，统一的文本界面展示了MTL的巨大潜力。因此，在这项工作中，我们提出了RSCoVLM，这是一种简单而灵活的VLM基线模型，用于RS MTL。首先，我们创建了数据编排引擎，包括数据获取、离线处理和集成，以及在线加载和加权。该数据引擎有效地解决了复杂RS数据环境问题，并生成了灵活的视觉-语言对话。此外，我们提出了一种统一的动态分辨率策略，以应对RS图像中固有的不同图像尺度。对于UHR图像，我们引入了缩放链机制及其相应的数据集LRS-VQA-Zoom。这些策略灵活且有效地减轻了计算负担。此外，我们显著增强了模型的物体检测能力，并提出了一种新的评估协议，以确保VLMs和传统检测模型之间的公平比较。广泛的实验表明，RSCoVLM在多种任务中均取得了最先进的性能，超越了现有的RS VLMs，甚至与专门的专家模型相媲美。所有训练和评估工具、模型权重和数据集均已完全开源，以支持可再现性。我们期望这一基线将促进通用RS模型的进一步发展。

Summary / 总结

This paper presents RSCoVLM, a vision language model for remote sensing multi-task learning, addressing the need for unified models that excel across multiple tasks. The authors propose a data curation engine and a unified dynamic-resolution strategy to handle diverse image scales. Experiments show that RSCoVLM outperforms existing remote sensing vision language models and even matches specialized expert models, achieving state-of-the-art performance across various tasks.

本文提出了RSCoVLM，一种用于遥感多任务学习的视觉语言模型，旨在实现跨多个任务的统一模型。作者提出了一种数据编排引擎和统一的动态分辨率策略来处理多样化的图像尺度。实验表明，RSCoVLM 在各种任务上的性能超越了现有遥感视觉语言模型，并且甚至与专业专家模型相当，达到了最先进的技术水平。

Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Authors: Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah

First: 2026-01-09T18:41:57+00:00 · Latest: 2026-01-09T18:41:57+00:00

Comments: 15 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearchBench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. Our analysis reveals nuanced variations in caching behavior across providers, and we provide practical guidance for implementing prompt caching in production agentic systems.

中文标题/摘要

标题：不要破坏缓存：对长期代理任务提示缓存的评估

大型语言模型（LLM）代理的最新进展使执行复杂多轮代理任务成为可能，这些任务需要广泛的工具调用，对话可以跨越数十次API调用，且上下文窗口越来越大。然而，尽管主要的LLM提供商提供了提示缓存以降低费用和延迟，但其对代理工作负载的好处在研究文献中仍被忽视。据我们所知，没有先前的工作量化这些费用节省或比较多轮代理任务的缓存策略。我们对三家主要的LLM提供商（OpenAI、Anthropic和Google）进行了全面的提示缓存评估，并比较了三种缓存策略，包括完整上下文缓存、仅系统提示缓存以及排除动态工具结果的缓存。我们在DeepResearchBench上进行了评估，这是一个多轮代理基准测试，其中代理自主执行实际的网络搜索工具调用来回答复杂的研究问题，我们测量了超过500个代理会话中的API费用和第一个标记的时间（TTFT），这些会话使用了10,000个标记的系统提示。我们的结果显示，提示缓存可以降低45-80%的API费用，并提高13-31%的时间到第一个标记。我们发现，战略性地控制提示缓存块，如将动态内容放在系统提示的末尾、避免动态的传统函数调用以及排除动态工具结果，比简单的全上下文缓存提供了更一致的好处，后者可能会反常地增加延迟。我们的分析揭示了提供商之间缓存行为的细微差异，并提供了在生产代理系统中实施提示缓存的实用指导。

Summary / 总结

This study evaluates prompt caching for long-horizon agentic tasks across three major LLM providers (OpenAI, Anthropic, and Google) using DeepResearchBench. The research finds that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31%. Strategic caching, such as excluding dynamic tool results, provides more consistent benefits compared to full-context caching, which can paradoxically increase latency.

研究评估了三大LLM提供商（OpenAI、Anthropic和Google）上的提示缓存技术，使用DeepResearchBench多轮次代理基准进行比较，包括全上下文缓存、仅系统提示缓存和排除动态工具结果的缓存。研究发现，提示缓存可以将API成本降低45-80%，并将首次令牌时间缩短13-31%。通过将动态内容置于系统提示的末尾等策略性缓存方法比全上下文缓存更有效，后者可能会意外增加延迟。

The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang

First: 2026-01-09T18:39:01+00:00 · Latest: 2026-01-09T18:39:01+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.

中文标题/摘要

标题：思维的分子结构：长链推理拓扑映射

大型语言模型（LLMs）往往难以从人类或非长链推理（Long CoT）LLMs模仿中学习有效的长链推理。为了理解这一现象，我们提出，有效的可学习的长链推理轨迹具有在统一视图中形成的稳定分子状结构，这些结构由三种交互类型组成：深度推理（共价型）、自我反思（氢键型）和自我探索（范德华力型）。对精简轨迹的分析表明，这些结构源自长链推理微调，而非关键词模仿。我们引入了有效语义异构体，并表明仅促进快速熵收敛的键支持稳定的长链推理学习，而结构竞争会损害训练。基于这些发现，我们提出了Mole-Syn方法，这是一种分布转移图方法，用于引导有效长链推理结构的合成，从而在基准测试中提升性能和强化学习稳定性。

Summary / 总结

The research aims to understand why large language models struggle with learning effective long chain-of-thought reasoning. It proposes that stable molecular-like structures, formed by three interaction types (Deep-Reasoning, Self-Reflection, and Self-Exploration), are key to effective Long CoT learning. The study finds that these structures emerge from Long CoT fine-tuning rather than keyword imitation. Mole-Syn, a method for synthesizing effective Long CoT structures, is introduced, showing improved performance and RL stability across benchmarks.

研究探讨了大型语言模型为何难以学习有效的长链推理。提出稳定的分子结构由三种类型的相互作用（深度推理、自我反思和自我探索）形成。研究显示这些结构源自长链推理微调而非关键词模仿。介绍了一种名为Mole-Syn的方法，该方法指导有效结构的合成，提升了多个基准测试中的性能和强化学习稳定性。

ACDZero: MCTS Agent for Mastering Automated Cyber Defense

Authors: Yu Li, Sizhe Tang, Rongqian Chen, Fei Xu Yu, Guangyu Jiang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan

First: 2026-01-05T15:18:54+00:00 · Latest: 2026-01-09T18:28:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Automated cyber defense (ACD) seeks to protect computer networks with minimal or no human intervention, reacting to intrusions by taking corrective actions such as isolating hosts, resetting services, deploying decoys, or updating access controls. However, existing approaches for ACD, such as deep reinforcement learning (RL), often face difficult exploration in complex networks with large decision/state spaces and thus require an expensive amount of samples. Inspired by the need to learn sample-efficient defense policies, we frame ACD in CAGE Challenge 4 (CAGE-4 / CC4) as a context-based partially observable Markov decision problem and propose a planning-centric defense policy based on Monte Carlo Tree Search (MCTS). It explicitly models the exploration-exploitation tradeoff in ACD and uses statistical sampling to guide exploration and decision making. We make novel use of graph neural networks (GNNs) to embed observations from the network as attributed graphs, to enable permutation-invariant reasoning over hosts and their relationships. To make our solution practical in complex search spaces, we guide MCTS with learned graph embeddings and priors over graph-edit actions, combining model-free generalization and policy distillation with look-ahead planning. We evaluate the resulting agent on CC4 scenarios involving diverse network structures and adversary behaviors, and show that our search-guided, graph-embedding-based planning improves defense reward and robustness relative to state-of-the-art RL baselines.

中文标题/摘要

标题：ACDZero：掌握自动化网络防御的MCTS代理

自动化网络防御（ACD）旨在通过最少或无需人类干预来保护计算机网络，并通过采取纠正措施（如隔离主机、重置服务、部署诱饵或更新访问控制）来应对入侵。然而，现有的ACD方法，如深度强化学习（RL），在复杂网络的大决策/状态空间中往往面临探索难题，因此需要大量的样本。受学习高效防御策略的启发，我们将ACD在CAGE挑战4（CAGE-4 / CC4）中框架化为基于上下文的部分可观测马尔可夫决策问题，并提出基于蒙特卡洛树搜索（MCTS）的规划中心防御策略。它明确地在ACD中建模了探索与利用之间的权衡，并使用统计抽样来指导探索和决策。我们创新地使用图神经网络（GNN）将网络观察嵌入为带属性的图中，以实现对主机及其关系的不变推理。为了使我们的解决方案在复杂的搜索空间中实用，我们用学习到的图嵌入和图编辑动作的先验来引导MCTS，结合无模型泛化、策略蒸馏和前瞻规划。我们在涉及多种网络结构和对手行为的CAGE-4场景中评估了该代理，并展示了与最先进的RL基线相比，我们的搜索引导、基于图嵌入的规划提高了防御奖励和鲁棒性。

Summary / 总结

The research aims to develop an efficient automated cyber defense system that can operate with minimal human intervention. The method involves using Monte Carlo Tree Search (MCTS) with graph neural networks (GNNs) to model the exploration-exploitation tradeoff in complex network environments. Key findings show that the proposed agent outperforms existing reinforcement learning baselines in terms of defense reward and robustness across various network structures and adversary behaviors.

研究旨在开发一个高效的自动化网络防御系统，无需大量人工干预即可保护计算机网络。方法是将网络防御问题建模为上下文相关的部分可观测马尔可夫决策过程，并使用蒙特卡洛树搜索（MCTS）结合图神经网络（GNN）来嵌入网络观察。关键发现表明，提出的ACDZero代理在各种网络结构和攻击者行为下，在防御奖励和鲁棒性方面优于最先进的强化学习基线。

Open-Vocabulary 3D Instruction Ambiguity Detection

Authors: Jiayu Ding, Haoran Tang, Ge Li

First: 2026-01-09T18:17:11+00:00 · Latest: 2026-01-09T18:17:11+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like "Pass me the vial" in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.

中文标题/摘要

标题：开放词汇3D指令歧义检测

在安全关键领域，语言歧义可能导致严重后果；例如，在手术环境中，一个模糊的命令“递给我那个药瓶”可能导致灾难性错误。然而，大多数具身AI研究忽视了这一点，假设指令是清晰的，而专注于执行而不是确认。为了解决这一关键的安全缺口，我们首次定义了开放词汇3D指令歧义检测这一基本的新任务，其中模型必须确定在一个给定的3D场景中，一个命令是否具有单一且明确的意义。为了支持这一研究，我们构建了Ambi3D，这是该任务的大规模基准数据集，包含超过700个多样化的3D场景和大约22000条指令。我们的分析揭示了一个令人惊讶的局限性：最先进的3D大型语言模型（LLMs）难以可靠地判断一个指令是否具有歧义性。为了解决这一挑战，我们提出了AmbiVer，这是一种两阶段框架，它从多个视角收集明确的视觉证据，并使用这些证据来指导视觉-语言模型（VLM）判断指令的歧义性。广泛的实验表明了我们任务的挑战性和AmbiVer的有效性，为更安全和更可信赖的具身AI铺平了道路。代码和数据集可在https://jiayuding031020.github.io/ambi3d/获取。

Summary / 总结

The research aims to address the critical safety gap in embodied AI by detecting linguistic ambiguity in 3D instructions, which can lead to severe errors. The method involves creating Ambi3D, a benchmark with over 700 3D scenes and 22,000 instructions, and proposing AmbiVer, a two-stage framework that uses visual evidence to determine instruction ambiguity. Experiments show that state-of-the-art 3D LLMs struggle with this task, while AmbiVer effectively improves the reliability of ambiguity detection. This work paves the way for safer and more trustworthy embodied AI systems.

该研究关注语言歧义在体态AI中的关键安全问题，特别是在安全关键领域。它引入了开放词汇3D指令歧义检测任务，要求模型在3D场景中判断一个命令是否有单一且明确的含义。研究构建了Ambi3D基准，包含超过700个场景和22,000个指令，并发现最先进的3D大语言模型在这一任务上表现不佳。为改进这一问题，作者提出了AmbiVer框架，该框架收集视觉证据并指导视觉-语言模型判断指令的歧义性，实验表明该方法能更可靠地检测歧义。

Monadic Context Engineering

Authors: Yifan Zhang, Yang Yuan, Mengdi Wang, Andrew Chi-Chih Yao

First: 2025-12-27T01:52:06+00:00 · Latest: 2026-01-09T17:48:20+00:00

Comments: The authors have decided to withdraw this manuscript, as the ideas presented in the paper are not yet sufficiently mature and require further development and refinement

Abs · PDF · Code1 · Code2

Abstract

The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming.

中文标题/摘要

标题：单子上下文工程

大型语言模型（LLMs）的普及催化了自主代理向能够进行复杂推理和工具使用转变。然而，当前代理架构通常使用命令式的、随意的模式构建，导致系统脆弱，难以管理状态、处理错误和并发。本文介绍了单子上下文工程（MCE），这是一种新的架构范式，利用函子、应用函子和单子的代数结构为代理设计提供形式基础。MCE 将代理工作流视为计算上下文，在这种上下文中，如状态传播、短路错误处理和异步执行等横切关注点由抽象的代数性质内在管理。我们展示了如何使用单子实现稳健的顺序组合，如何使用应用函子提供并行执行的原理结构，以及关键地，如何使用单子变换器系统地组合这些能力。这种分层方法使开发人员能够从简单的、可独立验证的组件构建出复杂的、健壮且高效的AI代理。我们进一步扩展了这一框架，描述了利用MCE进行生成性编排的元代理，通过元编程动态创建和管理子代理工作流。

Summary / 总结

The paper introduces Monadic Context Engineering (MCE), a new architectural paradigm for designing autonomous agents using algebraic structures like Functors, Applicative Functors, and Monads to manage state, error handling, and concurrency. MCE enables robust sequential and parallel execution and systematic composition through Monad Transformers, allowing for the construction of complex, resilient AI agents from simple components. The framework is further extended to Meta-Agents, which use MCE for generative orchestration of sub-agent workflows. However, the authors have decided to withdraw the manuscript as the ideas are not yet mature enough for publication.

本文介绍了Monadic Context Engineering (MCE)，这是一种新的架构范式，利用代数结构来改进代理设计。MCE通过将代理工作流视为计算上下文来解决状态管理、错误处理和并发问题。论文展示了如何使用Monads实现稳健的顺序组合，使用Applicatives提供并行执行的原理结构，以及使用Monad Transformers系统地组合这些能力。这种分层方法使开发人员能够从简单的、独立验证的组件构建出复杂的、健壮且高效的AI代理。

QueryGym: Step-by-Step Interaction with Relational Databases

Authors: Haritha Ananthakrishnan, Harsha Kokel, Kelsey Sikes, Debarun Bhattacharjya, Michael Katz, Shirin Sohrabi, Kavitha Srinivas

First: 2025-09-25T22:48:49+00:00 · Latest: 2026-01-09T17:48:08+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents. Existing frameworks often tie agents to specific query language dialects or obscure their reasoning; QueryGym instead requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. The environment is implemented as a Gymnasium interface that supplies observations -- including schema details, intermediate results, and execution feedback -- and receives actions that represent database exploration (e.g., previewing tables, sampling column values, retrieving unique values) as well as relational algebra operations (e.g., filter, project, join). We detail the motivation and the design of the environment. In the demo, we showcase the utility of the environment by contrasting it with contemporary LLMs that query databases. QueryGym serves as a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation. For the associated demo, see https://ibm.biz/QueryGym.

中文标题/摘要

标题：QueryGym：逐步与关系数据库交互

我们介绍了QueryGym，一个用于构建、测试和评估基于LLM的查询规划代理的交互式环境。现有框架往往将代理绑定到特定的查询语言方言或使其推理过程晦涩难懂；QueryGym 要求代理构建明确的关系代数操作序列，确保跨引擎评估，并实现透明的逐步规划。该环境作为Gymnasium接口实现，提供观察结果——包括模式细节、中间结果和执行反馈——并接收代表数据库探索（例如，预览表、采样列值、检索唯一值）以及关系代数操作（例如，过滤、投影、连接）的动作。我们详细介绍了环境的动机和设计。在演示中，我们通过与当前查询数据库的LLM进行对比，展示了该环境的实用性。QueryGym 作为研究查询生成中的错误修正、透明性和强化学习的实用试验平台。有关相关演示，请参见 https://ibm.biz/QueryGym。

Summary / 总结

QueryGym is an interactive environment designed to build, test, and evaluate LLM-based query planning agents. It requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. Key findings include the utility of QueryGym in demonstrating the advantages of explicit relational algebra operations over contemporary LLMs in querying databases, and its potential as a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation.

QueryGym 是一个交互式环境，用于构建、测试和评估基于 LLM 的查询规划代理。它要求代理构建明确的 Relational Algebra 操作序列，确保跨引擎的评估和透明的逐步规划。关键发现包括 QueryGym 在展示显式 Relational Algebra 操作在查询数据库方面的优势，以及作为错误修正、透明性和查询生成的强化学习研究的实用试验平台的潜力。

DeePM: Regime-Robust Deep Learning for Systematic Macro Portfolio Management

Authors: Kieran Wood, Stephen J. Roberts, Stefan Zohren

First: 2026-01-09T17:47:32+00:00 · Latest: 2026-01-09T17:47:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose DeePM (Deep Portfolio Manager), a structured deep-learning macro portfolio manager trained end-to-end to maximize a robust, risk-adjusted utility. DeePM addresses three fundamental challenges in financial learning: (1) it resolves the asynchronous "ragged filtration" problem via a Directed Delay (Causal Sieve) mechanism that prioritizes causal impulse-response learning over information freshness; (2) it combats low signal-to-noise ratios via a Macroeconomic Graph Prior, regularizing cross-asset dependence according to economic first principles; and (3) it optimizes a distributionally robust objective where a smooth worst-window penalty serves as a differentiable proxy for Entropic Value-at-Risk (EVaR) - a window-robust utility encouraging strong performance in the most adverse historical subperiods. In large-scale backtests from 2010-2025 on 50 diversified futures with highly realistic transaction costs, DeePM attains net risk-adjusted returns that are roughly twice those of classical trend-following strategies and passive benchmarks, solely using daily closing prices. Furthermore, DeePM improves upon the state-of-the-art Momentum Transformer architecture by roughly fifty percent. The model demonstrates structural resilience across the 2010s "CTA (Commodity Trading Advisor) Winter" and the post-2020 volatility regime shift, maintaining consistent performance through the pandemic, inflation shocks, and the subsequent higher-for-longer environment. Ablation studies confirm that strictly lagged cross-sectional attention, graph prior, principled treatment of transaction costs, and robust minimax optimization are the primary drivers of this generalization capability.

中文标题/摘要

标题：DeePM：系统宏观投资组合管理的稳健深度学习

我们提出了DeePM（深度投资组合经理），这是一种端到端训练的结构化深度学习宏观投资组合经理，旨在最大化稳健的风险调整效用。DeePM解决了金融学习中的三个基本挑战：（1）通过因果延迟（因果筛）机制解决异步“不齐整过滤”问题，优先考虑因果冲击响应学习而非信息新鲜度；（2）通过宏观经济图先验对抗低信噪比，根据经济基本原则正则化跨资产依赖关系；（3）优化一个分布稳健的目标，其中平滑的最坏情况窗口惩罚作为熵值风险（EVaR）的可微代理，鼓励在最不利的历史子时期表现出色。在2010-2025年50种多样化期货的大规模回测中，DeePM仅使用每日收盘价实现了大约是经典趋势跟随策略和被动基准两倍的净风险调整回报。此外，DeePM在Momentum Transformer架构上提高了约50%。该模型在2010年代“商品交易顾问（CTA）寒冬”和2020年后波动率制度转变中展示了结构上的韧性，通过大流行、通胀冲击和随后的长期高通胀环境保持了持续的性能。消融研究证实，严格滞后横截面注意力、图先验、合理的交易成本处理和稳健的最小最大优化是这种泛化能力的主要驱动因素。

Summary / 总结

DeePM is a deep-learning macro portfolio manager that addresses the challenges of financial learning by using a Directed Delay mechanism, a Macroeconomic Graph Prior, and a distributionally robust objective. In large-scale backtests from 2010-2025, DeePM achieves nearly twice the net risk-adjusted returns of classical trend-following strategies and passive benchmarks, and outperforms the state-of-the-art Momentum Transformer by about fifty percent. DeePM maintains consistent performance across various economic regimes, including the CTA Winter and post-2020 volatility shifts.

DeePM 是一种通过使用定向延迟机制、宏观经济图先验和分布鲁棒目标来解决金融学习挑战的深度学习宏观投资组合管理器。在2010-2025年的回测中，DeePM 的净风险调整回报率大约是传统趋势跟随策略和被动基准的两倍，并且在改进 Momentum Transformer 方面提高了约百分之五十。该模型在包括CTA冬季和2020年后波动性变化在内的各种经济环境中保持了稳定的性能。

Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

Authors: Yi Sui, Chaozhuo Li, Chen Zhang, Dawei song, Qiuchi Li

First: 2025-06-06T17:00:23+00:00 · Latest: 2026-01-09T17:45:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-augmented generation (RAG) aims to mitigate the hallucination of Large Language Models (LLMs) by retrieving and incorporating relevant external knowledge into the generation process. However, the external knowledge may contain noise and conflict with the parametric knowledge of LLMs, leading to degraded performance. Current LLMs lack inherent mechanisms for resolving such conflicts. To fill this gap, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to it is the refinement of the traditional self-attention into a mixed-attention that distinguishes shared and private semantics for a controlled knowledge integration. An unsupervised hallucination detection method that captures the LLMs' intrinsic cognitive uncertainty ensures that external knowledge is introduced only when necessary. To reduce noise in external knowledge, an Energy Quotient (EQ), defined by attention difference matrices between task-aligned and task-misaligned layers, is proposed. Extensive experiments show that DSSP-RAG achieves a superior performance over strong baselines.

中文标题/摘要

标题：外部和参数知识融合：通过双流知识共享-私有语义协同作用减轻LLM幻觉

检索增强生成（RAG）旨在通过检索和整合相关外部知识来减轻大型语言模型（LLMs）的幻觉。然而，外部知识可能包含噪声，并与LLMs的参数知识产生冲突，导致性能下降。当前的LLMs缺乏解决此类冲突的内在机制。为填补这一空白，我们提出了一种双流知识增强框架以实现共享-私有语义协同作用（DSSP-RAG）。其核心在于将传统的自注意力机制改进为混合注意力机制，以区分共享和私有语义，从而控制知识的整合。一种无监督的幻觉检测方法捕捉到LLMs的内在认知不确定性，确保只有在必要时才引入外部知识。为了减少外部知识中的噪声，我们提出了一个能量配额（EQ），它由任务对齐层和任务错配层之间的注意力差异矩阵定义。广泛的实验表明，DSSP-RAG在性能上优于强基线。

Summary / 总结

The paper addresses the issue of hallucination in LLMs by proposing DSSP-RAG, which integrates external and parametric knowledge through a dual-stream framework with shared-private semantic synergy. The method uses mixed-attention to refine self-attention, distinguishing shared and private semantics for controlled knowledge integration. An unsupervised hallucination detection method and Energy Quotient (EQ) are introduced to ensure the introduction of external knowledge only when necessary and to reduce noise. Experimental results demonstrate that DSSP-RAG outperforms strong baselines.

研究旨在通过整合外部知识来解决LLM的幻觉问题，同时缓解与参数知识的冲突。提出的DSSP-RAG框架采用双流方法和混合注意机制来细化自我注意，区分共享和私有语义。未监督的幻觉检测方法和能量配额（EQ）被用来确保在必要时引入外部知识并减少噪声。实验表明，DSSP-RAG在缓解幻觉和提高性能方面优于强基线。

Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Authors: Yong Xie, Karan Aggarwal, Aitzaz Ahmad, Stephen Lau

Venue: KDD 2024

First: 2024-10-16T06:31:59+00:00 · Latest: 2026-01-09T17:41:42+00:00

Comments: 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (ACM KDD 2024). Accepted by Workshop on Evaluation and Trustworthiness of Generative AI Models

Abs · PDF · Code1 · Code2

Abstract

We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.

中文标题/摘要

标题：受控自动生成任务特定合成数据集以检测幻觉

我们提出了一种新的方法，用于自动生成非平凡的任务特定合成数据集以检测幻觉。该方法采用两步生成-选择流水线，利用幻觉模式指导和生成期间的语言风格对齐。幻觉模式指导利用了最重要的任务特定幻觉模式，而语言风格对齐则使合成数据集的风格与基准文本对齐。为了从合成数据集中获得稳健的监督检测器，我们还采用了数据混合策略以提高性能的稳健性和泛化能力。我们在三个数据集上的结果表明，我们生成的幻觉文本与非幻觉文本的对齐程度更高，从而训练出泛化能力更强的幻觉检测器。我们基于合成数据集训练的幻觉检测器在上下文学习（ICL）基础上的检测器上取得了显著的32%的性能优势。我们广泛的实验验证了该方法在跨任务和跨生成器泛化方面的优势。基于数据混合的训练进一步提高了幻觉检测的泛化能力和稳健性。

Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap

Authors: Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, Jiang

First: 2024-10-08T15:04:07+00:00 · Latest: 2026-01-09T17:38:57+00:00

Abs · PDF · Code1 · Code2

Abstract

The rise of AI-assisted software engineering (SE 2.0), powered by Foundation Models (FMs) and FM-powered coding assistants, has shown promise in improving developer productivity. However, it has also exposed inherent limitations, such as cognitive overload on developers and inefficiencies. We propose a shift towards Software Engineering 3.0 (SE 3.0), an AI-native approach characterized by intent-centric, conversation-oriented development between human developers and AI teammates. SE 3.0 envisions AI systems evolving beyond task-driven copilots into intelligent collaborators, capable of deeply understanding and reasoning about software engineering principles and intents. We outline the key components of the SE 3.0 technology stack, which includes Teammate.next for adaptive and personalized AI partnership, IDE.next for intent-centric conversation-oriented development, Compiler.next for multi-objective code synthesis, and Runtime.next for SLA-aware execution with edge-computing support. Our vision addresses the inefficiencies and cognitive strain of SE 2.0 by fostering a symbiotic relationship between human developers and AI, maximizing their complementary strengths. We also present a roadmap of challenges that must be overcome to realize our vision of SE 3.0. This paper lays the foundation for future discussions on the role of AI in the next era of software engineering.

中文标题/摘要

标题：迈向AI原生软件工程（SE 3.0）：愿景与挑战路线图

AI辅助软件工程（SE 2.0）的兴起，得益于基础模型（FMs）和FMs驱动的编码助手，显示出提高开发人员生产力的潜力。然而，这也暴露了固有的局限性，如开发人员的认知负担和效率低下。我们提出转向软件工程3.0（SE 3.0），这是一种以意图为中心、以对话为导向的人工智能原生方法，其中人类开发人员与AI队友进行意图驱动的对话。SE 3.0设想AI系统从任务驱动的副驾驶演进为智能合作者，能够深入理解和推理软件工程原理和意图。我们概述了SE 3.0技术栈的关键组成部分，包括Teammate.next以实现适应性和个性化的AI伙伴关系，IDE.next以意图为中心的对话式开发，Compiler.next以多目标代码合成，以及Runtime.next以SLA感知执行并支持边缘计算。我们的愿景通过促进人类开发人员与AI之间的共生关系，克服SE 2.0的效率低下和认知压力，最大化双方的优势互补。我们还提出了实现SE 3.0愿景所必须克服的挑战路线图。本文为未来关于AI在软件工程下一时代角色的讨论奠定了基础。

Summary / 总结

This paper motivates the need for an AI-native approach in software engineering (SE 3.0) to address the limitations of AI-assisted software engineering (SE 2.0). The authors propose SE 3.0, which involves human developers and AI teammates collaborating through intent-centric, conversation-oriented development. Key components include Teammate.next, IDE.next, Compiler.next, and Runtime.next, designed to enhance developer productivity and reduce cognitive strain. The paper outlines a roadmap for overcoming challenges to realize SE 3.0.

本文旨在解决当前AI辅助软件工程（SE 2.0）的局限性，推动向AI本位的软件工程（SE 3.0）转变。它提出了一种以意图为中心、对话导向的开发模式，涉及人类开发人员和AI队友的合作。关键组件包括Teammate.next、IDE.next、Compiler.next和Runtime.next，旨在促进人类和AI之间的共生关系，提高生产力并减轻认知负担。作者还概述了实现这一愿景所面临的挑战。

Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference

Authors: Rasmus Blanck, Bill Noble, Stergios Chatzikyriakidis

First: 2026-01-08T17:58:52+00:00 · Latest: 2026-01-09T17:37:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.

中文标题/摘要

标题：逆向工程自然语言推理：自然语言推理元推理性质的研究

自然语言推理（NLI）一直是评估自然语言处理语言模型的重要任务，但该任务的逻辑性质尚未得到充分理解，经常被误表征。理解NLI所捕捉的推理概念是解释模型在该任务上的表现的关键。在本文中，我们提出了NLI标签集的三种可能解读，并对它们所蕴含的元推理性质进行了全面分析。以SNLI数据集为例，我们利用（1）具有相同前提的NLI项目和（2）由LLM生成的项目来评估在SNLI上训练的模型的元推理一致性，并推导出数据集中编码的逻辑关系的解读。

Summary / 总结

This paper aims to clarify the logical properties of the Natural Language Inference (NLI) task by formulating three possible readings of the NLI label set and analyzing their meta-inferential properties. The authors use the SNLI dataset, focusing on items with shared premises and those generated by large language models, to evaluate models for meta-inferential consistency. Key findings include insights into which logical relations are encoded by the SNLI dataset, contributing to a better understanding of model performance on NLI tasks.

本文旨在通过提出NLI标签集的三种可能解释来澄清NLI任务的逻辑属性。作者使用SNLI数据集进行了全面分析，重点关注具有共享前提的项目和由语言模型生成的项目。研究揭示了训练在SNLI上的模型的元推理一致性，表明数据集中编码了哪种逻辑关系的解释。

VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Authors: Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

First: 2026-01-09T17:34:59+00:00 · Latest: 2026-01-09T17:34:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.

中文标题/摘要

标题：VideoAR：通过下一帧和尺度预测的自回归视频生成

近期视频生成的进展主要由扩散模型和流匹配模型主导，这些模型虽然生成高质量的结果，但计算密集且难以扩展。本文中，我们引入了VideoAR，这是第一个大规模的视觉自回归(VAR)框架，结合了多尺度下一帧预测与自回归建模。VideoAR通过将帧内VAR建模与因果下一帧预测相结合，解耦空间和时间依赖性，并通过3D多尺度分词器高效编码时空动态。为了提高长期一致性，我们提出了多尺度时间RoPE、跨帧误差校正和随机帧掩码，这些方法共同减轻了误差传播并稳定了时间连贯性。我们的多阶段预训练管道逐步在不断增加的分辨率和持续时间上对空间和时间学习进行对齐。实验结果显示，VideoAR在自回归模型中达到了新的最佳结果，将UCF-101上的FVD从99.5提高到88.6，同时将推理步骤减少超过10倍，并达到VBench得分为81.74，与大一个数量级的扩散模型相当。这些结果表明，VideoAR缩小了自回归和扩散范式的性能差距，为未来的视频生成研究提供了一个可扩展、高效且时间连贯的基础。

Summary / 总结

VideoAR is an autoregressive video generation framework that uses multi-scale next-frame prediction and autoregressive modeling to improve computational efficiency and scalability. It introduces techniques like Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask to enhance long-term consistency. VideoAR achieves state-of-the-art results on UCF-101, reducing inference steps by over 10x and improving FVD scores, making it a scalable and efficient alternative to diffusion models.

VideoAR 是一种结合多尺度下一帧预测和自回归建模的视频生成框架，以提高计算效率和可扩展性。它引入了多尺度时间 RoPE、跨帧误差校正和随机帧掩码等技术来增强长期一致性。实验结果显示，VideoAR 在 UCF-101 上优于之前的自回归模型，实现了更低的 FVD 分数和与更大规模的扩散模型相当的 VBench 分数，同时将推理步骤减少了超过 10 倍。

Distilling Feedback into Memory-as-a-Tool

Authors: Víctor Gallego

First: 2026-01-09T17:26:52+00:00 · Latest: 2026-01-09T17:26:52+00:00

Comments: Code: https://github.com/vicgalle/feedback-memory-as-a-tool Data: https://huggingface.co/datasets/vicgalle/rubric-feedback-bench

Abs · PDF · Code1 · Code2 · Code3

Abstract

We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.

中文标题/摘要

标题：将反馈提炼为记忆工具

我们提出了一种框架，通过文件为基础的记忆系统和代理控制的工具调用，将瞬时批评转化为可检索的指南，从而摊薄推理时的推理成本。我们在Rubric Feedback Bench数据集上评估了该方法，这是一个基于评分标准学习的新数据集。实验表明，我们的增强型大语言模型能够快速匹配测试时的精炼管道性能，同时大幅降低推理成本。

Summary / 总结

The research aims to reduce the cost of inference-time reasoning by converting transient critiques into retrievable guidelines using a file-based memory system and agent-controlled tool calls. The method was evaluated on the Rubric Feedback Bench dataset, and the experiments showed that the augmented LLMs could match the performance of test-time refinement pipelines while significantly reducing inference costs.

研究旨在通过使用基于文件的内存系统和代理控制的工具调用，将临时批评转化为可检索的指南，从而降低推理时的计算成本。该方法在Rubric Feedback Bench数据集上进行了评估，结果显示增强的LLM能够在显著降低推理成本的同时，达到测试时精炼管道的性能。

Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

Authors: Parsa Mirtaheri, Ezra Edelman, Samy Jelassi, Eran Malach, Enric Boix-Adsera

Venue: NeurIPS 2025

First: 2025-05-27T23:23:34+00:00 · Latest: 2026-01-09T17:24:17+00:00

Comments: Published at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.

中文标题/摘要

标题：让我思考！长链推理比短链推理具有指数级优势

推理时的计算量已成为提高大型语言模型推理能力的一个有前景的扩展轴。然而，尽管取得了令人印象深刻的性能，但推理时计算量的最佳分配仍然知之甚少。一个核心问题是是优先考虑顺序扩展（例如，更长的推理链）还是并行扩展（例如，多次短链推理的多数投票）。在本研究中，我们通过证明在某些基于图连接问题的挑战性图分布中，顺序扩展相对于并行扩展具有指数级优势，来阐明测试时扩展的景观。我们通过一系列语言模型的全面实验验证了我们的理论发现，包括从零开始训练的图连接模型以及使用不同推理链策略的大型推理模型。

Summary / 总结

This study investigates the optimal allocation of inference-time computation for improving large language model reasoning. It explores the trade-off between sequential scaling (longer chains of thought) and parallel scaling (multiple short chains of thought). The research demonstrates that in certain graph connectivity problems, sequential scaling provides an exponential advantage over parallel scaling. Experiments across various language models support these findings.

研究探讨了如何在推理时高效分配计算资源以提高大型语言模型的推理能力。它研究了是优先采用顺序扩展（较长的推理链）还是并行扩展（多个较短的推理链）更为有效。研究发现，在某些图连通性问题上，顺序扩展提供了并行扩展的指数级优势。实验涵盖了各种语言模型，包括为图连通性训练的模型和大型推理模型，均支持这些发现。

WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Authors: Chanchan Wang, Yuanfang Wang, Qing Xu, Guanxin Chen

First: 2026-01-09T16:58:29+00:00 · Latest: 2026-01-09T16:58:29+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Domain-generalized retinal vessel segmentation is critical for automated ophthalmic diagnosis, yet faces significant challenges from domain shift induced by non-uniform illumination and varying contrast, compounded by the difficulty of preserving fine vessel structures. While the Segment Anything Model (SAM) exhibits remarkable zero-shot capabilities, existing SAM-based methods rely on simple adapter fine-tuning while overlooking frequency-domain information that encodes domain-invariant features, resulting in degraded generalization under illumination and contrast variations. Furthermore, SAM's direct upsampling inevitably loses fine vessel details. To address these limitations, we propose WaveRNet, a wavelet-guided frequency learning framework for robust multi-source domain-generalized retinal vessel segmentation. Specifically, we devise a Spectral-guided Domain Modulator (SDM) that integrates wavelet decomposition with learnable domain tokens, enabling the separation of illumination-robust low-frequency structures from high-frequency vessel boundaries while facilitating domain-specific feature generation. Furthermore, we introduce a Frequency-Adaptive Domain Fusion (FADF) module that performs intelligent test-time domain selection through wavelet-based frequency similarity and soft-weighted fusion. Finally, we present a Hierarchical Mask-Prompt Refiner (HMPR) that overcomes SAM's upsampling limitation through coarse-to-fine refinement with long-range dependency modeling. Extensive experiments under the Leave-One-Domain-Out protocol on four public retinal datasets demonstrate that WaveRNet achieves state-of-the-art generalization performance. The source code is available at https://github.com/Chanchan-Wang/WaveRNet.

中文标题/摘要

标题：WaveRNet：小波引导的频域学习在多源泛化视网膜血管分割中的应用

多源泛化视网膜血管分割对于自动化眼科诊断至关重要，但受到非均匀照明和对比度变化引起的领域偏移的严重影响，同时保持细小血管结构的难度也很大。尽管分割一切皆可能模型（SAM）表现出色，但现有的SAM基方法依赖于简单的适配器微调，而忽略了编码不变特征的频域信息，导致在照明和对比度变化下泛化能力下降。此外，SAM直接上采样不可避免地会丢失细小血管细节。为解决这些局限性，我们提出WaveRNet，一种小波引导的频域学习框架，用于稳健的多源泛化视网膜血管分割。具体而言，我们设计了一种频谱引导的领域模调节制器（SDM），将小波分解与可学习的领域标记相结合，实现照明鲁棒的低频结构与高频血管边界分离，同时促进领域特定特征生成。此外，我们引入了一种基于小波频域相似性的智能测试时领域选择模块（FADF），并通过软加权融合进行智能融合。最后，我们提出了一种分层掩码提示精炼器（HMPR），通过长程依赖建模实现从粗到细的精炼，克服SAM的上采样限制。在四个公开的视网膜数据集上采用留一领域外协议的大量实验表明，WaveRNet 达到了最先进的泛化性能。源代码可在 https://github.com/Chanchan-Wang/WaveRNet 获取。

Summary / 总结

WaveRNet is a wavelet-guided frequency learning framework designed for robust multi-source domain-generalized retinal vessel segmentation. It introduces a Spectral-guided Domain Modulator (SDM) for separating illumination-robust low-frequency structures from high-frequency vessel boundaries, and a Frequency-Adaptive Domain Fusion (FADF) module for intelligent domain selection and soft-weighted fusion. Additionally, it includes a Hierarchical Mask-Prompt Refiner (HMPR) to refine segmentation results through coarse-to-fine refinement. Experiments show that WaveRNet outperforms existing methods in terms of generalization under varying illumination and contrast conditions.

WaveRNet 是一种基于小波的频率学习框架，旨在实现多源领域泛化的视网膜血管分割。它引入了频谱引导的域模调节器（SDM）来分离低频结构和高频血管边界，并且包含了一个基于小波的频域相似性进行智能域选择的频率自适应域融合（FADF）模块。此外，它还包含了一个分层掩码提示精炼器（HMPR），通过粗到细的精炼实现长程依赖建模。实验表明，WaveRNet 在不同光照和对比度条件下的泛化性能优于现有方法。

Collective Communication for 100k+ GPUs

Authors: Min Si, Pavan Balaji, Yongzhou Chen, Ching-Hsiang Chu, Adi Gangidi, Saif Hasan, Subodh Iyengar, Dan Johnson, Bingzhe Liu, Regina Ren, Deep Shah, Ashmitha Jeevaraj Shetty, Greg Steinbrecher, Yulun Wang, Bruce Wu, Xinfeng Xie, Jingyi Yang, Mingran Yang, Kenny Yu, Minlan Yu, Cen Zhao, Wes Bland, Denis Boyda, Suman Gumudavelli, Prashanth Kannan, Cristian Lumezanu, Rui Miao, Zhe Qu, Venkat Ramesh, Maxim Samoylov, Jan Seidel, Srikanth Sundaresan, Feng Tian, Qiye Tan, Shuqiang Zhang, Yimeng Zhao, Shengbao Zheng, Art Zhu, Hongyi Zeng

First: 2025-10-23T03:32:04+00:00 · Latest: 2026-01-09T16:53:26+00:00

Abs · PDF · Code1 · Code2

Abstract

The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed at Meta, engineered to optimize performance across the full LLM lifecycle, from the synchronous demands of large-scale training to the low-latency requirements of inference. The framework is designed to support complex workloads on clusters exceeding 100,000 GPUs, ensuring reliable, high-throughput, and low-latency data exchange. Empirical evaluation on the Llama4 model demonstrates substantial improvements in communication efficiency. This research contributes a robust solution for enabling the next generation of LLMs to operate at unprecedented scales.

中文标题/摘要

标题：10万+ GPU的集体通信

大型语言模型（LLMs）规模的增加需要高效的集体通信框架，尤其是在训练工作负载扩展到数十万GPU时。传统通信方法在这一规模下面临显著的吞吐量和延迟限制，阻碍了先进模型的开发和部署。本文介绍了Meta开发的NCCLX集体通信框架，旨在优化从大规模训练的同步需求到推理所需的低延迟要求的整个LLM生命周期的性能。该框架设计用于支持超过10万GPU的集群上的复杂工作负载，确保可靠、高吞吐量和低延迟的数据交换。对Llama4模型的实证评估显示了通信效率的显著提升。这项研究为使下一代LLM能够在前所未有的规模上运行提供了稳健的解决方案。

Summary / 总结

The paper introduces NCCLX, a collective communication framework designed to enhance the performance of large language model training and inference across clusters with over 100,000 GPUs. NCCLX addresses the throughput and latency challenges faced by traditional communication methods at such a scale. Experimental results on the Llama4 model show significant improvements in communication efficiency, supporting the development and deployment of advanced LLMs.

该论文介绍了NCCLX，这是一种针对超过100,000个GPU训练大型语言模型（LLMs）的集体通信框架。传统通信方法在如此大规模下面临吞吐量和延迟的限制，限制了模型的开发和部署。NCCLX优化了从大规模训练到推理的通信性能，确保了可靠、高吞吐量和低延迟的数据交换。实验结果表明，NCCLX在Llama4模型上的通信效率有了显著提升。

Context-Aware Decoding for Faithful Vision-Language Generation

Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu

First: 2026-01-09T16:50:57+00:00 · Latest: 2026-01-09T16:50:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.

中文标题/摘要

标题：基于上下文的解码以实现忠实的跨模态语言生成

幻觉，即生成与视觉输入不一致的响应，仍然是大型跨模态语言模型（LVLMs）的一个关键限制，尤其是在开放性任务如图像字幕和视觉推理中。在本文中，我们探究了导致幻觉的逐层生成动态，并提出了一种无需训练的缓解策略。利用Logit Lens，我们检查了LVLMs在解码器各层如何构建下一个词的概率分布，发现了一种显著的可信度深度差距：真实的词更早地将概率质量集中在它们的最终候选词上，而幻觉的词则不然。基于这一发现，我们引入了上下文嵌入注入（CEI）方法，这是一种轻量级的方法，利用最后一个输入词的隐藏状态——上下文嵌入——作为接地信号，以在整个解码过程中保持视觉保真度并抑制幻觉。在CHAIR、AMBER和MMHal-Bench基准测试（最大词长512）上，CEI在三种LVLMs中均优于最先进的基线方法，其动态变体的幻觉率最低。通过将新颖的机制见解与可扩展的干预措施相结合，本文推进了LVLMs中幻觉的缓解。

Summary / 总结

This work addresses the issue of hallucinations in large vision-language models (LVLMs) by analyzing the layer-wise generation dynamics and proposing a training-free mitigation strategy called Context Embedding Injection (CEI). CEI uses the hidden state of the last input token as a grounding signal to maintain visual fidelity and reduce hallucinations. The method outperforms state-of-the-art baselines across three LVLMs on the CHAIR, AMBER, and MMHal-Bench benchmarks, with the dynamic variant achieving the lowest hallucination rates.

该研究通过分析层间生成动态并提出Context Embedding Injection (CEI) 作为无需训练的缓解策略，来解决大型视觉语言模型（LVLM）中的幻觉问题。CEI 利用最后一个输入词的上下文嵌入作为接地信号，以保持解码过程中的视觉一致性。实验结果表明，CEI 在 CHAIR、AMBER 和 MMHal-Bench 基准测试上优于现有基线，并且能够降低不同模型中的幻觉率。

Can We Predict Before Executing Machine Learning Agents?

Authors: Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang

First: 2026-01-09T16:44:17+00:00 · Latest: 2026-01-09T16:44:17+00:00

Comments: Work in progress

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict-before-execute.

中文标题/摘要

标题：我们能在执行机器学习代理之前进行预测吗？

自主机器学习代理已革新了科学研究，但仍然受限于生成-执行-反馈的范式。先前的方法因假设评估严格依赖昂贵的物理执行而遭受严重的执行瓶颈。为绕过这些物理限制，我们将执行先验内化，用即时的预测推理替代昂贵的运行时检查，借鉴了世界模型的灵感。在本文中，我们形式化了数据为中心的解决方案偏好任务，并构建了一个包含18,438对比较的全面语料库。我们证明，当用验证数据分析报告进行预热时，LLM表现出显著的预测能力，准确率达到61.5%，并且具有稳健的信心校准。最后，我们在FOREAGENT代理中实例化了这一框架，该代理采用预测-验证循环，实现了6倍的收敛加速，并且优于基于执行的基线+6%。我们的代码和数据集将在不久的将来在https://github.com/zjunlp/predict-before-execute公开。

Summary / 总结

This work addresses the limitations of current machine learning agents by proposing a Predict-then-Verify loop to predict outcomes before physical execution. The authors formalize the task of Data-centric Solution Preference and create a dataset of 18,438 pairwise comparisons. They show that Large Language Models (LLMs) primed with a Verified Data Analysis Report can predict outcomes with 61.5% accuracy and robust confidence calibration, leading to a 6x acceleration in convergence and outperforming execution-based baselines by 6%.

该研究旨在通过提出预测-验证循环来解决当前机器学习代理的限制，该循环在物理执行之前预测结果。作者定义了数据中心解决方案偏好任务，并创建了一个包含18,438对比较的数据集。他们表明，经过验证数据分析报告训练的大语言模型可以以61.5%的准确率和稳健的信心校准进行预测，从而实现6倍的收敛加速，并在执行基线基础上提高6%。

Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Authors: Yohann Perron, Vladyslav Sydorov, Christophe Pottier, Loic Landrieu

First: 2026-01-09T16:41:08+00:00 · Latest: 2026-01-09T16:41:08+00:00

Comments: 13 pages +3 pages of suppmat

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .

中文标题/摘要

标题：将视觉变换器适应到超高清语义分割中的接力标记

当前对超高清图像进行分割的方法要么滑动窗口，从而丢弃全局上下文，要么下采样并丢失细节。我们提出了一种简单而有效的方法，将显式的多尺度推理引入视觉变换器，同时保留局部细节和全局意识。具体来说，我们并行处理每张图像的局部尺度（高分辨率，小裁剪）和全局尺度（低分辨率，大裁剪），并通过一组可学习的接力标记在两个分支之间聚合和传播特征。该设计可以直接插入标准的变换器骨干网络（例如ViT和Swin），并添加不到2%的参数。在三个超高清分割基准数据集Archaeoscape、URUR和Gleason以及传统的Cityscapes数据集上的广泛实验表明，该方法具有一致的改进，相对mIoU提高高达15%。代码和预训练模型可在https://archaeoscape.ai/work/relay-tokens/ 获取。

Summary / 总结

The paper addresses the challenge of ultra-high resolution semantic segmentation by proposing a method that combines local and global reasoning using relay tokens. This approach processes images at both high and low resolutions and aggregates features between the two scales, enhancing both local details and global context. Experiments on several benchmarks show significant improvements, with up to 15% relative mIoU gain compared to existing methods.

论文提出了一种结合局部和全局推理的方法，使用relay tokens处理超高清图像的语义分割问题。该方法在高分辨率和低分辨率下同时处理图像，并在两个尺度之间聚合特征，增强局部细节和全局上下文。实验结果显示，在多个基准数据集上取得了显著改进，相对mIoU提升高达15%。

Cedalion Tutorial: A Python-based framework for comprehensive analysis of multimodal fNIRS & DOT from the lab to the everyday world

Authors: E. Middell, L. Carlton, S. Moradi, T. Codina, T. Fischer, J. Cutler, S. Kelley, J. Behrendt, T. Dissanayake, N. Harmening, M. A. Yücel, D. A. Boas, A. von Lühmann

First: 2026-01-09T16:37:48+00:00 · Latest: 2026-01-09T16:37:48+00:00

Comments: 33 pages main manuscript, 180 pages Supplementary Tutorial Notebooks, 12 figures, 6 tables, under review in SPIE Neurophotonics

Abs · PDF · Code1 · Code2

Abstract

Functional near-infrared spectroscopy (fNIRS) and diffuse optical tomography (DOT) are rapidly evolving toward wearable, multimodal, and data-driven, AI-supported neuroimaging in the everyday world. However, current analytical tools are fragmented across platforms, limiting reproducibility, interoperability, and integration with modern machine learning (ML) workflows. Cedalion is a Python-based open-source framework designed to unify advanced model-based and data-driven analysis of multimodal fNIRS and DOT data within a reproducible, extensible, and community-driven environment. Cedalion integrates forward modelling, photogrammetric optode co-registration, signal processing, GLM Analysis, DOT image reconstruction, and ML-based data-driven methods within a single standardized architecture based on the Python ecosystem. It adheres to SNIRF and BIDS standards, supports cloud-executable Jupyter notebooks, and provides containerized workflows for scalable, fully reproducible analysis pipelines that can be provided alongside original research publications. Cedalion connects established optical-neuroimaging pipelines with ML frameworks such as scikit-learn and PyTorch, enabling seamless multimodal fusion with EEG, MEG, and physiological data. It implements validated algorithms for signal-quality assessment, motion correction, GLM modelling, and DOT reconstruction, complemented by modules for simulation, data augmentation, and multimodal physiology analysis. Automated documentation links each method to its source publication, and continuous-integration testing ensures robustness. This tutorial paper provides seven fully executable notebooks that demonstrate core features. Cedalion offers an open, transparent, and community extensible foundation that supports reproducible, scalable, cloud- and ML-ready fNIRS/DOT workflows for laboratory-based and real-world neuroimaging.

中文标题/摘要

标题：Cedalion教程：一种基于Python的框架，用于从实验室到日常世界的多模态fNIRS & DOT全面分析

功能性近红外光谱成像（fNIRS）和弥散光学断层成像（DOT）正迅速向可穿戴、多模态和数据驱动的AI支持神经成像发展，适用于日常世界。然而，当前的分析工具分散在不同的平台上，限制了可重复性、互操作性和与现代机器学习（ML）工作流的集成。Cedalion是一种基于Python的开源框架，旨在在一个可重复、可扩展和社区驱动的环境中统一先进的基于模型和数据驱动的多模态fNIRS和DOT数据的分析。Cedalion将前向建模、光度测量光极共注册、信号处理、GLM分析、DOT图像重建和基于机器学习的数据驱动方法整合在一个基于Python生态系统的标准化架构中。它遵循SNIRF和BIDS标准，支持云可执行的Jupyter笔记本，并提供容器化的工作流，以实现可扩展、完全可重复的分析管道，这些管道可以与原始研究出版物一起提供。Cedalion将已建立的光学神经成像管道与scikit-learn和PyTorch等机器学习框架连接起来，使多模态融合与EEG、MEG和生理数据无缝结合成为可能。它实现了经过验证的信号质量评估、运动校正、GLM建模和DOT重建算法，并配有模拟、数据增强和多模态生理分析模块。自动化文档将每种方法链接到其原始出版物，并通过持续集成测试确保其稳健性。这篇教程论文提供了七个完全可执行的笔记本，演示了核心功能。Cedalion提供了一个开放、透明和社区可扩展的基础，支持实验室和现实世界神经成像的可重复、可扩展、云和机器学习就绪的fNIRS/DOT工作流。

Agentic LLMs as Powerful Deanonymizers: Re-identification of Participants in the Anthropic Interviewer Dataset

Authors: Tianshi Li

First: 2026-01-09T16:32:33+00:00 · Latest: 2026-01-09T16:32:33+00:00

Comments: 4 pages

Abs · PDF · Code1 · Code2

Abstract

On December 4, 2025, Anthropic released Anthropic Interviewer, an AI tool for running qualitative interviews at scale, along with a public dataset of 1,250 interviews with professionals, including 125 scientists, about their use of AI for research. Focusing on the scientist subset, I show that widely available LLMs with web search and agentic capabilities can link six out of twenty-four interviews to specific scientific works, recovering associated authors and, in some cases, uniquely identifying the interviewees. My contribution is to show that modern LLM-based agents make such re-identification attacks easy and low-effort: off-the-shelf tools can, with a few natural-language prompts, search the web, cross-reference details, and propose likely matches, effectively lowering the technical barrier. Existing safeguards can be bypassed by breaking down the re-identification into benign tasks. I outline the attack at a high level, discuss implications for releasing rich qualitative data in the age of LLM agents, and propose mitigation recommendations and open problems. I have notified Anthropic of my findings.

中文标题/摘要

标题：代理型大语言模型作为强大的匿名破解者：Anthropic采访者数据集中参与者再识别

2025年12月4日，Anthropic发布了Anthropic Interviewer，这是一种用于大规模进行定性访谈的AI工具，以及包含1250次访谈的数据集，访谈对象包括125名科学家，他们讨论了自己在研究中使用AI的情况。聚焦于科学家子集，我展示了广泛可用的具有网络搜索和代理能力的大语言模型可以将24次访谈中的六次与特定的科学作品联系起来，恢复相关作者，并在某些情况下唯一地识别访谈对象。我的贡献在于展示现代基于大语言模型的代理使得此类再识别攻击变得容易且低耗：现成的工具可以通过几个自然语言提示，在网络上搜索、交叉引用细节并提出可能的匹配，从而降低技术门槛。现有的防护措施可以通过将再识别分解为无害的任务而被绕过。我从宏观上概述了攻击过程，讨论了在大语言模型代理时代发布丰富定性数据的含义，并提出了缓解建议和开放问题。我已经通知Anthropic我的发现。

Summary / 总结

The research aims to demonstrate the re-identification capabilities of modern LLMs with agentic and web-search abilities, using the Anthropic Interviewer dataset. The study shows that off-the-shelf LLMs can link six out of twenty-four interviews to specific scientific works and, in some cases, uniquely identify the interviewees. This highlights the ease with which re-identification attacks can be carried out, even with existing safeguards, and suggests that breaking down the task into benign steps can bypass these protections. The findings underscore the need for stronger data protection measures in the era of advanced LLMs.

研究旨在展示现代具有代理和网络搜索能力的LLM的重新识别能力，使用Anthropic Interviewer数据集。研究表明，现成的LLM可以将六份中的二十四份访谈与特定的科学作品联系起来，并在某些情况下唯一地识别出访谈对象。这表明即使有现有的保护措施，重新识别攻击也容易实施，而且将任务分解为良性步骤可以绕过这些保护。研究结果强调了在先进LLM时代需要更强的数据保护措施。

From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Authors: Jia Li, Yuxin Su, Michael R. Lyu

First: 2026-01-07T09:22:28+00:00 · Latest: 2026-01-09T16:30:25+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $ESV$ (reading load), $MCL$ (simulation depth), and $DFI$ (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.

中文标题/摘要

标题：从实验室到实际应用：在仓库级别基准测试代理代码推理

随着大型语言模型（LLMs）演变为自主代理，评估仓库级别的推理能力，即在庞大且相互依赖的文件系统中保持逻辑一致性的能力，变得至关重要。当前的基准测试通常在孤立的代码片段之间波动或进行黑盒评估。我们提出了RepoReason，这是一种以演绎断言验证为中心的白盒诊断基准测试。为了消除记忆现象同时保持真实的逻辑深度，我们实现了一个基于执行的变异框架，利用环境作为语义预言机来再生真实状态。此外，我们使用动态程序切片建立了一个精细的诊断系统，通过三个正交的度量标准量化推理：$ESV$（阅读负载）、$MCL$（模拟深度）和$DFI$（集成宽度）。对前沿模型（例如Claude-4.5-Sonnet、DeepSeek-v3.1-Terminus）的全面评估揭示了普遍存在的聚合缺陷，其中集成宽度是主要的认知瓶颈。我们的研究结果为优化下一代代理软件工程提供了详细的白盒洞察。

Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Authors: Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Estève, Lorraine Goeuriot, Steffen Lalande, Nicolas Hervé, Maximin Coavoux, François Portet, Étienne Ollion, Marie Candito, Maxime Peyrard, Solange Rossato, Benjamin Lecouteux, Aurélie Nardy, Gilles Sérasset, Vincent Segonne, Solène Evain, Diandra Fabre, Didier Schwab

First: 2026-01-09T16:28:25+00:00 · Latest: 2026-01-09T16:28:25+00:00

Abs · PDF · Code1 · Code2

Abstract

We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

中文标题/摘要

标题：Pantagruel：统一的自监督编码器用于法语文本和语音

我们发布了Pantagruel模型，这是一种新的自监督编码器模型，适用于法语文本和语音。Pantagruel 不是预测特定模态的目标，如文本标记或语音单元，而是在特征空间中学习上下文化的目标表示，使得特定模态的编码器能够更有效地捕捉语言和声学规律。分别在大规模法语语料库上预训练了独立模型，包括Wikipedia、OSCAR和CroissantLLM用于文本，以及MultilingualLibriSpeech、LeBenchmark和INA-100k用于语音。INA-100k是一个新引入的100,000小时的法语音频语料库，来源于法国国家视听档案馆（INA）的档案，提供了高度多样化的音频数据。我们在涵盖两个模态的广泛下游任务中评估了Pantagruel，包括标准法语基准测试中的FLUE或LeBenchmark等任务。在这些任务中，Pantagruel模型在与强大的法语基线模型CamemBERT、FlauBERT和LeBenchmark2.0相比时，显示出竞争力或更优的性能，同时保持了可以无缝处理语音或文本输入的共享架构。这些结果证实了特征空间自监督目标在法语文本表示学习中的有效性，并突显了Pantagruel作为多模态语音-文本理解的稳健基础。

Summary / 总结

Pantagruel models are self-supervised encoders for French text and speech, learning contextualized target representations in the feature space. They are pre-trained on diverse corpora including Wikipedia, OSCAR, CroissantLLM for text, and MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. Evaluation across various downstream tasks shows Pantagruel models perform competitively or better than strong French baselines like CamemBERT and FlauBERT, demonstrating the effectiveness of feature-space self-supervised objectives for French representation learning.

Pantagruel 是一种用于法语文本和语音的自监督编码模型，它在特征空间中学习上下文化的目标表示。它基于大规模的法语文本和语音语料库进行预训练。在各种下游任务中，Pantagruel 模型优于强大的法语基线，同时保持一个可以处理语音和文本输入的共享架构。这证实了特征空间自监督目标在法语表示学习中的有效性。

Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Authors: Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang

First: 2026-01-09T16:23:21+00:00 · Latest: 2026-01-09T16:23:21+00:00

Comments: Work in progress

Abs · PDF · Code1 · Code2 · Code3

Abstract

As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.

中文标题/摘要

标题：自信的幻象？通过邻域一致性诊断LLM的可信度

随着大型语言模型（LLMs）在现实世界中的应用越来越广泛，仅仅正确性是不够的。可靠的部署需要在情境变化下保持真实的信念。现有的评估主要依赖于点式的置信度，如自我一致性，这可能会掩盖脆弱的信念。我们表明，即使是在完美自我一致性下回答的事实，也可能会在轻微的情境干扰下迅速崩溃。为了解决这一差距，我们提出了邻域一致性信念（NCB），这是一种结构化的信念稳健性度量，它通过评估概念邻域内的响应一致性来评估信念的连贯性。为了验证NCB的有效性，我们引入了一种新的认知压力测试协议，该协议在情境干扰下探测输出的稳定性。跨多个LLM的实验表明，高NCB数据在干扰下的表现相对更具有抵抗力。最后，我们提出了结构感知训练（SAT），它优化了情境不变的信念结构，并通过约30%的比例减少了长尾知识的脆弱性。代码将在https://github.com/zjunlp/belief上提供。

Summary / 总结

This study addresses the need for Large Language Models (LLMs) to maintain truthful beliefs under contextual perturbations, beyond mere correctness. It introduces Neighbor-Consistency Belief (NCB) as a measure of belief robustness, evaluating response coherence across a conceptual neighborhood. Experiments across multiple LLMs demonstrate that high-NCB data is more resistant to interference, and the proposed Structure-Aware Training (SAT) reduces long-tail knowledge brittleness by about 30%.

研究关注大型语言模型（LLMs）在超越单纯正确性之外的稳健真实性需求。提出了邻域一致性信念（NCB）来评估信念在上下文干扰下的稳定性。实验结果显示，高NCB模型对干扰更具抵抗力。此外，提出了结构感知训练（SAT）来优化上下文不变的信念结构，减少知识 brittleness 约30%。

Can AI mediation improve democratic deliberation?

Authors: Michael Henry Tessler, Georgina Evans, Michiel A. Bakker, Iason Gabriel, Sophie Bridgers, Rishub Jain, Raphael Koster, Verena Rieser, Anca Dragan, Matthew Botvinick, Christopher Summerfield

Venue: Knight Institute for the First Amendment at Columbia University Symposium on "AI and Democratic Freedoms", April 10-11, 2025

First: 2026-01-09T16:22:26+00:00 · Latest: 2026-01-09T16:22:26+00:00

Abs · PDF · Code1 · Code2

Abstract

The strength of democracy lies in the free and equal exchange of diverse viewpoints. Living up to this ideal at scale faces inherent tensions: broad participation, meaningful deliberation, and political equality often trade off with one another (Fishkin, 2011). We ask whether and how artificial intelligence (AI) could help navigate this "trilemma" by engaging with a recent example of a large language model (LLM)-based system designed to help people with diverse viewpoints find common ground (Tessler, Bakker, et al., 2024). Here, we explore the implications of the introduction of LLMs into deliberation augmentation tools, examining their potential to enhance participation through scalability, improve political equality via fair mediation, and foster meaningful deliberation by, for example, surfacing trustworthy information. We also point to key challenges that remain. Ultimately, a range of empirical, technical, and theoretical advancements are needed to fully realize the promise of AI-mediated deliberation for enhancing citizen engagement and strengthening democratic deliberation.

中文标题/摘要

标题：AI 调解能否改善民主讨论？

民主的力量在于自由和平等的多元观点交流。在大规模实现这一理想时，存在固有的紧张关系：广泛的参与、有意义的讨论和政治平等往往彼此权衡（Fishkin, 2011）。我们探讨 AI 是否以及如何通过与一个大型语言模型（LLM）为基础的系统互动来帮助解决这一“三难困境”，该系统旨在帮助持有不同观点的人找到共同点（Tessler, Bakker, 等, 2024）。在这里，我们探讨将 LLM 引入讨论增强工具的含义，考察其通过可扩展性增强参与、通过公平调解提高政治平等以及通过例如展示可信信息促进有意义讨论的潜力。我们还指出了存在的关键挑战。最终，需要一系列实证、技术和理论上的进步，以充分利用 AI 调解讨论的潜力，增强公民参与并加强民主讨论。

Summary / 总结

The paper explores whether artificial intelligence, particularly large language models (LLMs), can help improve democratic deliberation by enhancing participation, political equality, and meaningful deliberation. The study examines an LLM-based system designed to facilitate common ground among diverse viewpoints and finds potential for AI to scale participation, mediate fairly, and surface trustworthy information. However, key challenges remain, necessitating further empirical, technical, and theoretical advancements to fully realize AI's promise in democratic deliberation.

论文探讨了人工智能，尤其是大型语言模型（LLM），是否能通过增强参与度、政治平等和有意义的讨论来改善民主协商。研究分析了一种旨在促进不同观点之间达成共识的LLM系统，并发现AI有可能扩大参与规模、公平调解并提供可信信息。然而，仍存在一些关键挑战，需要进一步的实证、技术和理论进步，以充分实现AI在民主协商中的潜力。

HAPS: Hierarchical LLM Routing with Joint Architecture and Parameter Search

Authors: Zihang Tian, Rui Li, Jingsen Zhang, Xiaohe Bo, Wei Huo, Xu Chen

First: 2026-01-09T16:22:25+00:00 · Latest: 2026-01-09T16:22:25+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model (LLM) routing aims to exploit the specialized strengths of different LLMs for diverse tasks. However, existing approaches typically focus on selecting LLM architectures while overlooking parameter settings, which are critical for task performance. In this paper, we introduce HAPS, a hierarchical LLM routing framework that jointly searches over model architectures and parameters. Specifically, we use a high-level router to select among candidate LLM architectures, and then search for the optimal parameters for the selected architectures based on a low-level router. We design a parameter generation network to share parameters between the two routers to mutually enhance their capabilities. In the training process, we design a reward-augmented objective to effectively optimize our framework. Experiments on two commonly used benchmarks show that HAPS consistently outperforms strong routing baselines. We have released our code at https://github.com/zihangtian/HAPS.

中文标题/摘要

标题：HAPS：联合架构和参数搜索的分层大语言模型路由

大语言模型（LLM）路由旨在利用不同LLM的专业优势以应对多样化的任务。然而，现有方法通常侧重于选择LLM架构，而忽视了参数设置的重要性，后者对任务性能至关重要。在本文中，我们引入了HAPS，这是一种联合搜索模型架构和参数的分层LLM路由框架。具体而言，我们使用一个高层路由器在候选LLM架构之间进行选择，然后基于一个低层路由器搜索所选架构的最佳参数。我们设计了一个参数生成网络，以在两个路由器之间共享参数，从而增强它们的能力。在训练过程中，我们设计了一个奖励增强的目标来有效优化我们的框架。在两个常用基准上的实验表明，HAPS 一致地优于强大的路由基线。我们已在 https://github.com/zihangtian/HAPS/ 发布了我们的代码。

Summary / 总结

The research aims to improve large language model (LLM) routing by jointly optimizing both architecture and parameter settings. The proposed HAPS framework uses a hierarchical approach with a high-level router selecting architectures and a low-level router optimizing parameters. A parameter generation network is employed to enhance the routers' performance. Experiments on benchmark datasets demonstrate that HAPS outperforms existing routing methods.

研究旨在通过同时优化模型架构和参数设置来改进大型语言模型（LLM）路由。提出的HAPS框架采用分层方法，高层路由器选择架构，低层路由器优化参数。使用参数生成网络来增强路由器的能力。在基准数据集上的实验表明，HAPS优于现有路由方法。