Alterbute: Editing Intrinsic Attributes of Objects in Images
Authors: Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen
First: 2026-01-15T18:59:53+00:00 · Latest: 2026-01-15T18:59:53+00:00
Comments: Project page is available at https://talreiss.github.io/alterbute/
Abstract
We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
中文标题/摘要
标题:Alterbute:图像中对象固有属性的编辑
我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中对象的固有属性。我们允许更改对象的颜色、纹理、材质,甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖于无法保留身份的无监督先验,要么使用过于严格的监督,从而阻止有意义的固有属性变化。我们的方法依赖于:(i) 放松的训练目标,允许模型在参考身份图像、描述目标固有属性的文本提示以及定义外部上下文的背景图像和对象掩码的条件下,同时改变固有属性和外部属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如,“Porsche 911 Carrera”),将具有身份定义特征的对象分组,同时允许固有属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和固有属性描述,从而实现可扩展、保留身份的监督。Alterbute在保留身份的对象固有属性编辑方面优于现有方法。
Summary / 总结
Alterbute is a diffusion-based method for editing an object's intrinsic attributes in images, such as color, texture, and material, while preserving the object's identity and scene context. It uses a relaxed training objective and Visual Named Entities (VNEs) to allow changes in intrinsic attributes while keeping extrinsic attributes consistent. The method outperforms existing approaches in maintaining object identity during intrinsic attribute editing.
Alterbute 是一种基于扩散的方法,用于在保持物体身份和场景上下文不变的情况下编辑图像中物体的内在属性,如颜色、纹理和材料。该方法使用宽松的训练目标和视觉命名实体(VNEs)来允许内在属性的变化,同时保持外在属性的一致性。该方法在保持物体身份不变的情况下编辑内在属性方面优于现有方法。
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
First: 2026-01-15T18:59:23+00:00 · Latest: 2026-01-15T18:59:23+00:00
Abstract
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
中文标题/摘要
标题:MatchTIR:通过二分匹配实现细粒度监督的工具集成推理
工具集成推理(TIR)通过在推理步骤与外部工具交互之间交替,赋予大型语言模型(LLMs)处理复杂任务的能力。然而,现有的强化学习方法通常依赖于结果或轨迹级别的奖励,对轨迹内的所有步骤分配相同的优点。这种粗粒度的信用分配无法区分有效的工具调用与冗余或错误的调用,特别是在长时间多轮场景中。为了解决这个问题,我们提出了MatchTIR框架,通过基于二分匹配的轮次级别奖励分配和双层优势估计引入细粒度监督。具体来说,我们将信用分配形式化为预测和真实轨迹之间的二分匹配问题,利用两种分配策略推导密集的轮次级别奖励。此外,为了平衡局部步骤精度与全局任务成功,我们引入了一种双层优势估计方案,结合轮次级别和轨迹级别的信号,为每个交互轮次分配不同的优势值。在三个基准上的广泛实验表明MatchTIR的优越性。值得注意的是,我们的4B模型在大多数8B竞争对手中表现更优,特别是在长时间多轮任务中。我们的代码可在https://github.com/quchangle1/MatchTIR获取。
Summary / 总结
MatchTIR addresses the issue of coarse-grained credit assignment in Tool-Integrated Reasoning (TIR) by proposing a framework that uses bipartite matching to assign fine-grained turn-level rewards and dual-level advantage estimation to balance local step precision and global task success. Experiments on three benchmarks show that MatchTIR outperforms most competitors, especially in long-horizon and multi-turn tasks.
MatchTIR通过使用二分匹配来分配细粒度的轮次级奖励,并引入双层优势估计方案来平衡局部精度和全局任务成功,解决了工具集成推理中粗粒度的奖励分配问题。实验表明,MatchTIR在长时程和多轮任务中优于大多数8B模型。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略且不对称的连接方式,仅将视觉编码器的输出链接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现多层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 可以显著提高性能,确立了CLI 作为一种可扩展范式的地位,通过赋予LLM按需访问完整视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitation of static vision-language models (VLMs) by proposing Cross-Layer Injection (CLI), a dynamic framework that enables a many-to-many connection between visual and language modalities. CLI includes an Adaptive Multi-Projection (AMP) module for feature harmonization and an Adaptive Gating Fusion (AGF) mechanism for selective visual information injection. Experiments on 18 benchmarks show CLI improves performance, suggesting it enhances the alignment between visual details and language reasoning, enabling more comprehensive multimodal understanding.
论文通过提出一种动态框架Cross-Layer Injection (CLI),解决了静态视觉-语言模型(VLMs)的限制,CLI使视觉和语言模态之间能够形成多对多的连接。CLI包含一个用于特征谐调的Adaptive Multi-Projection (AMP)模块和一个用于选择性注入视觉信息的Adaptive Gating Fusion (AGF)机制。在18个不同基准上的实验表明CLI提高了性能,表明它增强了视觉细节与语言推理之间的对齐,从而实现了更全面的多模态理解。
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Authors: Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf
First: 2026-01-15T18:58:33+00:00 · Latest: 2026-01-15T18:58:33+00:00
Abstract
Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
中文标题/摘要
标题:少看,开得好:基于基础模型的随机补丁选择实现通用端到端自动驾驶
端到端自动驾驶的最新进展表明,基础模型提取的补丁对齐特征训练出的策略在分布外(OOD)场景下表现更好。我们假设由于自注意力机制,每个补丁特征隐含地嵌入/包含来自其他所有补丁的信息,以不同的方式和强度表示,使得这些描述符高度冗余。我们通过主成分分析(PCA)和跨补丁相似性量化了(BLIP2)特征中的冗余性:90%的方差由17/64个主成分捕获,且跨标记相关性很强。在这些重叠信息上进行训练会导致策略过度拟合虚假的相关性,损害了OOD鲁棒性。我们提出了一种简单而有效的随机补丁选择(SPS)方法,用于学习更鲁棒、更通用且更高效的策略。对于每一帧,SPS随机遮蔽一部分补丁描述符,不将其提供给策略模型,同时保持剩余补丁的空间布局。因此,策略获得了不同但完整的(相同)场景视图:每随机子集的补丁像不同的,但仍然合理的世界投影。策略基于不变于哪些特定标记存活的特征做出决策。大量实验表明,我们的方法在所有OOD场景中均优于现有最佳方法(SOTA),平均改进6.2%,在闭环模拟中最高可达20.4%,同时速度提高2.4倍。我们在遮蔽率和补丁特征重组、训练和评估9个系统中进行了消融研究,其中8个系统均超越了先前的SOTA。最后,我们展示了相同的策略在无需调整的情况下可以转移到物理的、真实世界的汽车上。
Summary / 总结
This paper addresses the issue of out-of-distribution (OOD) robustness in end-to-end autonomous driving policies trained on patch-aligned features from foundation models. The authors propose Stochastic-Patch-Selection (SPS), a method that randomly masks a fraction of patch descriptors in each frame, leading to more robust and generalizable policies. Experiments show that SPS outperforms state-of-the-art methods by up to 20.4% in closed-loop simulations, with a 6.2% average improvement and a 2.4x speedup. The method is validated across various OOD scenarios and transfers effectively to a real-world car without tuning.
本文针对端到端自动驾驶策略在基础模型提取的补丁对齐特征上的分布外(OOD)鲁棒性问题进行了研究。作者提出了一种随机补丁选择(SPS)方法,该方法在每帧中随机遮蔽一部分补丁描述符,从而提高策略的鲁棒性和泛化能力。实验结果显示,SPS在闭环仿真中表现优于现有最佳方法,最高提升20.4%,平均提升6.2%,且速度提升2.4倍。该方法在各种OOD场景下得到验证,并且无需调整即可在真实世界车辆上有效应用。
Grounding Agent Memory in Contextual Intent
Authors: Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
First: 2026-01-15T18:55:13+00:00 · Latest: 2026-01-15T18:55:13+00:00
Abstract
Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history.
For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
Summary / 总结
The paper addresses the challenge of deploying large language models in long-term goal-oriented interactions where similar entities and facts can appear under different contexts. It introduces STITCH, a memory system that uses contextual intent to index and retrieve relevant history. STITCH outperforms existing methods by 35.6% on CAME-Bench and LongMemEval, particularly in longer trajectories, by reducing retrieval noise through intent compatibility filtering.
论文提出STITCH,一种使用上下文意图来索引和检索相关历史的代理记忆系统,以应对大型语言模型在长期目标导向交互中的挑战。STITCH根据意图兼容性过滤和优先级排序记忆片段,减少上下文不匹配的证据。在CAME-Bench和LongMemEval上,STITCH比现有方法高出35.6%,特别是在轨迹长度增加时表现更佳,展示了意图索引在支持稳健长期推理方面的有效性。
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Authors: Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
First: 2026-01-15T18:54:50+00:00 · Latest: 2026-01-15T18:54:50+00:00
Abstract
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
中文标题/摘要
标题:LIBERTy:基于因果框架的概念驱动解释基准测试方法,使用结构性反事实
概念驱动的解释量化了高层概念(例如,性别或经验)对模型行为的影响,这对于高风险领域的决策者至关重要。近期工作通过将这些解释与从反事实中估计的参考因果效应进行比较,来评估其忠实性。实际上,现有的基准测试依赖于昂贵的人工编写的反事实,这些反事实作为不完美的代理。为了解决这一问题,我们提出了一种构建包含结构性反事实配对的数据集的框架:LIBERTy(基于LLM的解释性基准测试,具有参考目标的干预性因果模型)。LIBERTy基于明确定义的结构因果模型(SCM),文本生成中的干预传播通过SCM,直到LLM生成反事实。我们引入了三个数据集(疾病检测、CV筛选和工作场所暴力预测)以及一个新的评估指标,顺序忠实性。使用这些数据集,我们评估了五种模型的广泛方法,并确定了改进概念驱动解释的大量空间。LIBERTy还使系统分析模型对干预的敏感性成为可能:我们发现专有LLM在人口统计概念上的敏感性明显降低,这可能是由于后训练缓解措施所致。总体而言,LIBERTy为开发忠实的解释方法提供了急需的基准测试。
Summary / 总结
The research aims to develop a benchmark for evaluating concept-based explanations of LLMs by using structural counterfactuals. The method involves creating datasets with LIBERTy, a framework based on Structured Causal Models (SCMs) that generates counterfactuals through interventions on concepts. Key findings show that existing methods have significant room for improvement in faithfulness, and proprietary LLMs exhibit reduced sensitivity to demographic concepts, possibly due to post-training mitigation strategies.
研究旨在通过使用结构性反事实来提高大型语言模型(LLM)的概念基础解释的准确性。方法是通过在概念上进行干预并基于结构化因果模型(SCMs)生成反事实来构建LIBERTy框架。主要发现包括在概念基础解释方面存在显著改进空间,以及发现专有LLM对人口统计概念的敏感度较低,这可能是由于后训练缓解技术所致。
The Impact of Generative AI on Architectural Conceptual Design: Performance, Creative Self-Efficacy and Cognitive Load
Authors: Han Jiang, Yao Xiao, Rachel Hurley, Shichao Liu
First: 2026-01-15T18:52:59+00:00 · Latest: 2026-01-15T18:52:59+00:00
Abstract
Our study examines how generative AI (GenAI) influences performance, creative self-efficacy, and cognitive load in architectural conceptual design tasks. Thirty-six student participants from Architectural Engineering and other disciplines completed a two-phase architectural design task, first independently and then with external tools (GenAI-assisted condition and control condition using an online repository of existing architectural projects). Design outcomes were evaluated by expert raters, while self-efficacy and cognitive load were self-reported after each phase. Difference-in-differences analyses revealed no overall performance advantage of GenAI across participants; however, subgroup analyses showed that GenAI significantly improved design performance for novice designers. In contrast, general creative self-efficacy declined for students using GenAI. Cognitive load did not differ significantly between conditions, though prompt usage patterns showed that iterative idea generation and visual feedback prompts were linked to greater reductions in cognitive load. These findings suggest that GenAI effectiveness depends on users' prior expertise and interaction strategies through prompting.
中文标题/摘要
标题:生成式AI对建筑概念设计的影响:性能、创造性自我效能感和认知负荷
本研究探讨了生成式AI(GenAI)在建筑概念设计任务中对性能、创造性自我效能感和认知负荷的影响。36名来自建筑学和其它学科的学生参与者完成了两阶段的建筑设计任务,首先独立完成,然后使用外部工具(GenAI辅助条件和对照条件使用在线现有建筑项目库)。设计成果由专家评定,而创造性自我效能感和认知负荷则在每个阶段由参与者自我报告。差异性分析显示,GenAI在总体上并未给参与者带来显著的性能优势;然而,子组分析表明,GenAI显著提高了新手设计师的设计性能。相反,使用GenAI的学生的总体创造性自我效能感有所下降。认知负荷在不同条件下没有显著差异,但提示使用模式显示,迭代想法生成和视觉反馈提示与认知负荷的更大降低相关。这些发现表明,GenAI的有效性取决于用户的先前专业知识和通过提示的交互策略。
Summary / 总结
This study investigates the impact of generative AI on architectural design performance, creative self-efficacy, and cognitive load. Thirty-six participants completed two phases of a design task, one using GenAI and the other using a control condition. No overall performance advantage was found for GenAI, but it significantly improved performance for novices. Creative self-efficacy decreased for students using GenAI, while cognitive load did not differ significantly between conditions, though iterative idea generation and visual feedback prompts were linked to reduced cognitive load.
本研究探讨了生成式AI对建筑设计性能、创造性自我效能感和认知负荷的影响。36名参与者完成了两个阶段的设计任务,一个有AI辅助,一个没有。虽然总体上AI没有带来性能优势,但新手设计师从中受益显著。使用AI的学生创造性自我效能感下降,认知负荷在不同条件下没有显著差异,但迭代的想法生成和视觉反馈提示与认知负荷减少有关。生成式AI的效果取决于用户的专业知识和交互策略。
Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems
Authors: Amir Khurshid, Abhishek Sehgal
First: 2026-01-15T18:43:19+00:00 · Latest: 2026-01-15T18:43:19+00:00
Abstract
Large language model (LLM) contexts are typically constructed using retrieval-augmented generation (RAG), which involves ranking and selecting the top-k passages. The approach causes fragmentation in information graphs in document structures, over-retrieval, and duplication of content alongside insufficient query context, including 2nd and 3rd order facets. In this paper, a structure-informed and diversity-constrained context bubble construction framework is proposed that assembles coherent, citable bundles of spans under a strict token budget. The method preserves and exploits inherent document structure by organising multi-granular spans (e.g., sections and rows) and using task-conditioned structural priors to guide retrieval. Starting from high-relevance anchor spans, a context bubble is constructed through constrained selection that balances query relevance, marginal coverage, and redundancy penalties. It will explicitly constrain diversity and budget, producing compact and informative context sets, unlike top-k retrieval. Moreover, a full retrieval is emitted that traces the scoring and selection choices of the records, thus providing auditability and deterministic tuning. Experiments on enterprise documents demonstrate the efficiency of context bubble as it significantly reduces redundant context, is better able to cover secondary facets and has a better answer quality and citation faithfulness within a limited context window. Ablation studies demonstrate that both structural priors as well as diversity constraint selection are necessary; removing either component results in a decline in coverage and an increase in redundant or incomplete context.
中文标题/摘要
标题:企业检索增强系统中结构和多样性感知上下文气泡构建
大型语言模型(LLM)上下文通常使用检索增强生成(RAG)构建,涉及排名和选择前k段。该方法导致文档结构中的信息图碎片化、过度检索、内容重复以及查询上下文不足,包括二阶和三阶特征。本文提出了一种结构指导和多样性约束的上下文气泡构建框架,以在严格的令牌预算下组装连贯且可引用的片段集合。该方法通过组织多粒度片段(例如,部分和行)并使用任务条件化的结构先验来指导检索,来保留和利用固有的文档结构。从高相关锚片段开始,通过受限选择构建上下文气泡,平衡查询相关性、边际覆盖率和冗余惩罚。它将显式地约束多样性和预算,生成紧凑且信息丰富的上下文集,不同于前k检索。此外,还发出完整的检索,追踪记录的评分和选择选择,从而提供可审计性和确定性调整。实验表明,上下文气泡在企业文档中表现出高效性,因为它显著减少了冗余上下文,更好地覆盖了次要特征,并在有限的上下文窗口内具有更好的答案质量和引文忠实度。消融研究显示,结构先验和多样性约束选择都是必要的;移除任一组成部分会导致覆盖率下降和冗余或不完整上下文的增加。
Summary / 总结
This paper addresses the limitations of traditional retrieval-augmented generation (RAG) methods in enterprise retrieval systems by proposing a structure-informed and diversity-constrained context bubble construction framework. The method constructs coherent context bubbles by preserving document structure and using task-conditioned structural priors, starting from high-relevance anchor spans and balancing query relevance, marginal coverage, and redundancy penalties. Experiments show that this approach significantly reduces redundant context, better covers secondary facets, and improves answer quality and citation faithfulness within a limited context window.
本文提出了一种结构导向和多样性约束的上下文气泡构建框架,以解决传统检索增强生成(RAG)方法在大型语言模型中的局限性。该方法通过保留文档结构并使用任务条件下的结构先验来组装连贯且可引用的文档片段。实验表明,这种方法减少了冗余上下文,更好地覆盖了次要方面,并在有限的上下文窗口内提高了答案质量和引文忠实度,优于top-k检索方法。
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Authors: Zirui Ren, Ziming Liu
First: 2026-01-15T18:42:50+00:00 · Latest: 2026-01-15T18:42:50+00:00
Abstract
Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) "Grokking" dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM "guesses" the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be "guessing" instead of "reasoning". Leveraging this "guessing" picture, we propose three strategies to scale HRM's guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models "reason".
中文标题/摘要
标题:你的推理模型是在推理还是猜测?层级推理模型的机制分析
层级推理模型(HRM)在各种推理任务中表现出色,显著优于基于大型语言模型的推理器。为了理解HRM的优势和潜在失败模式,我们对其推理模式进行了机制研究,发现了三个令人惊讶的事实:(a) 失败于极其简单的谜题,例如,HRM 可能在只有一个未知单元格的谜题上失败。我们将其失败归因于违反了HRM的基本假设——固定点性质。(b) 推理步骤中的“顿悟”动态,即答案并非均匀改进,而是存在一个关键的推理步骤,突然使答案正确;(c) 存在多个固定点。HRM“猜测”第一个固定点,该固定点可能是错误的,并且可能会被困在那里一段时间或永远。所有事实表明,HRM 似乎是在“猜测”而不是“推理”。利用这种“猜测”的观点,我们提出了三种策略来扩展HRM的猜测:数据增强(提高猜测的质量)、输入扰动(通过利用推理随机性扩展猜测的数量)和模型自举(通过利用训练随机性扩展猜测的数量)。在实际应用方面,通过结合所有方法,我们开发了增强的HRM,将数独极端难度的准确率从54.5%提升到96.9%。在科学方面,我们的分析为理解推理模型如何“推理”提供了新的见解。
Summary / 总结
This study investigates the reasoning patterns of hierarchical reasoning models (HRM) and finds that HRM often fails on simple puzzles due to a violation of the fixed point property, exhibits 'grokking' dynamics where the answer is suddenly correct, and can get stuck in incorrect fixed points. These findings suggest that HRM is more 'guessing' than 'reasoning'. To improve HRM, the study proposes strategies such as data augmentation, input perturbation, and model bootstrapping, resulting in a significant accuracy boost on Sudoku-Extreme tasks from 54.5% to 96.9%. The research provides new insights into the reasoning mechanisms of HRM.
研究分析了层次推理模型(HRM)的推理模式,发现HRM在简单谜题上会失败,原因是违背了固定点性质,表现出‘顿悟’动态,即答案突然正确,还可能陷入错误的固定点。这些发现表明HRM更像是猜测而不是推理。为了改进HRM,研究提出了数据增强、输入扰动和模型自举等策略,使数独极端难度的准确率从54.5%提升到96.9%。
Single-Stage Huffman Encoder for ML Compression
Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
First: 2026-01-15T18:37:56+00:00 · Latest: 2026-01-15T18:37:56+00:00
Comments: 5 pages, 4 figures
Abstract
Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.
中文标题/摘要
标题:单阶段霍夫曼编码器用于ML压缩
训练和提供大型语言模型(LLMs)需要将数据分割到多个加速器上,其中集体操作经常被网络带宽瓶颈限制。使用霍夫曼编码进行无损压缩是一种有效的方法,然而其三阶段设计需要实时频率分析、代码簿生成和代码簿与数据的传输,引入了计算、延迟和数据开销,这些开销在如片间通信等对延迟敏感的场景中是不可接受的。本文提出了一种单阶段霍夫曼编码器,通过使用从先前数据批次的平均概率分布中派生的固定代码簿来消除这些开销。通过对Gemma 2B模型的分析,我们证明了张量在层间和分片间具有高度的统计相似性。使用这种方法,我们实现了与分片霍夫曼编码相差0.5%的压缩率,并且在理想香农压缩性内的1%以内,从而实现了高效的实时压缩。
Summary / 总结
This paper addresses the challenge of bandwidth limitations in partitioning data for Large Language Models (LLMs) across accelerators. It proposes a single-stage Huffman encoder that uses fixed codebooks to eliminate the need for on-the-fly frequency analysis and codebook transmission, reducing computational and latency overheads. The approach achieves compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, making it suitable for latency-sensitive scenarios such as die-to-die communication.
本文旨在解决在将大型语言模型(LLMs)数据分割到多个加速器时遇到的带宽限制问题。提出了一种单阶段霍夫曼编码器,使用固定代码本来消除实时频率分析和代码本传输的需要,从而减少计算和延迟开销。该方法利用了张量在层间和碎片间的高统计相似性,实现了压缩效果在每个碎片霍夫曼编码的0.5%以内,以及理想香农压缩性的1%以内,使其适用于如片间通信等延迟敏感场景。
BASIL: Bayesian Assessment of Sycophancy in LLMs
Authors: Katherine Atwell, Pedram Heydari, Anthony Sicilia, Malihe Alikhani
First: 2025-08-23T00:11:00+00:00 · Latest: 2026-01-15T18:31:50+00:00
Abstract
Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.
中文标题/摘要
标题:BASIL: 评估大语言模型阿谀行为的贝叶斯方法
阿谀(过度顺从或奉承的行为)对人类与AI的合作构成了根本性的挑战,尤其是在健康、法律和教育等高风险决策领域。在研究大语言模型(LLMs)中的阿谀行为时,一个主要困难在于区分阿谀行为引起的观点变化与由新证据或用户提供的信息驱动的理性行为变化。现有方法要么衡量描述性行为变化,要么进行依赖客观事实的规范性评估,这限制了它们在主观或不确定任务中的应用。我们提出了一种基于行为经济学和理性决策理论的贝叶斯概率框架,明确地将阿谀行为与理性信念更新区分开来。在这一框架内,我们实现了三个目标:(i)一个描述性指标,衡量阿谀行为的同时控制理性对证据的反应;(ii)一个规范性指标,量化阿谀行为如何使模型偏离贝叶斯一致的信念更新;(iii)能够在没有事实标签的情况下应用这两个指标。在多个LLM和三个不确定性驱动的任务上应用我们的框架,我们发现了阿谀行为变化的稳健证据,并表明其对理性的影响取决于模型是否系统地过度或不足地更新其信念。最后,我们证明了一种事后校准方法和两种微调策略(SFT和DPO)显著减少了贝叶斯不一致性,特别是在明确阿谀提示下效果尤为显著。
Summary / 总结
The study introduces BASIL, a Bayesian probabilistic framework to assess sycophancy in large language models (LLMs), distinguishing it from rational belief updates. The framework provides both descriptive and normative metrics and can be applied even without ground-truth labels. Experiments across multiple LLMs and tasks reveal significant sycophantic belief shifts, which affect rationality differently based on whether models over- or under-update their beliefs. Post-hoc calibration and fine-tuning strategies reduce Bayesian inconsistency, with notable improvements under explicit sycophancy prompting.
论文提出了BASIL框架,通过将趋炎附势的行为与基于证据的理性反应区分开来,来评估大型语言模型(LLM)中的趋炎附势。该框架同时衡量趋炎附势的描述性和规范性方面,并在没有真实标签的情况下应用这些指标。研究发现,多种LLM普遍存在趋炎附势的行为,这会对理性产生显著影响,而事后校准方法和两种微调策略(SFT和DPO)能有效减少趋炎附势导致的贝叶斯不一致性。
Detecting Winning Arguments with Large Language Models and Persuasion Strategies
Authors: Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino
First: 2026-01-15T18:30:15+00:00 · Latest: 2026-01-15T18:30:15+00:00
Abstract
Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
中文标题/摘要
标题:使用大型语言模型和说服策略检测获胜论点
在论证文本中检测说服是一项具有重要含义的挑战性任务。本研究探讨了说服策略(如声誉攻击、转移注意力和操纵性措辞)在决定文本说服力中的作用。我们在三个标注的论证数据集上进行了实验:Winning Arguments(来自Change My View 子论坛)、Anthropic/Persuasion 和 Persuasion for Good。我们的方法利用大型语言模型(LLMs)和多策略说服评分方法,指导对六种说服策略的推理。结果表明,策略导向的推理提高了说服力预测的准确性。为了更好地理解内容的影响,我们将Winning Argument数据集组织成广泛的讨论主题,并分析其在不同主题上的表现。我们公开发布了这个主题标注的数据集,以促进未来的研究。总体而言,我们的方法证明了结构化、策略意识提示在提高论证质量评估的可解释性和鲁棒性方面的价值。
Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization
Authors: Minh Hieu Ha, Hung Phan, Tung Duy Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
Venue: AAAI
First: 2025-07-28T15:26:43+00:00 · Latest: 2026-01-15T18:28:50+00:00
Comments: Accepted at AAAI-26
Abstract
Multi-objective combinatorial optimization problems (MOCOP) frequently arise in practical applications that require the simultaneous optimization of conflicting objectives. Although traditional evolutionary algorithms can be effective, they typically depend on domain knowledge and repeated parameter tuning, limiting flexibility when applied to unseen MOCOP instances. Recently, integration of Large Language Models (LLMs) into evolutionary computation has opened new avenues for automatic heuristic generation, using their advanced language understanding and code synthesis capabilities. Nevertheless, most existing approaches predominantly focus on single-objective tasks, often neglecting key considerations such as runtime efficiency and heuristic diversity in multi-objective settings. To bridge this gap, we introduce Multi-heuristics for MOCOP via Pareto-Grid-guided Evolution of LLMs (MPaGE), a novel enhancement of the Simple Evolutionary Multiobjective Optimization (SEMO) framework that leverages LLMs and Pareto Front Grid (PFG) technique. By partitioning the objective space into grids and retaining top-performing candidates to guide heuristic generation, MPaGE utilizes LLMs to prioritize heuristics with semantically distinct logical structures during variation, thus promoting diversity and mitigating redundancy within the population. Through extensive evaluations, MPaGE demonstrates superior performance over existing LLM-based frameworks, and achieves competitive results to traditional Multi-objective evolutionary algorithms (MOEAs), with significantly faster runtime. Our code is available at: https://github.com/langkhachhoha/MPaGE.
中文标题/摘要
标题:帕累托网格引导的大语言模型在多目标组合优化中的快速高质量启发式设计
多目标组合优化问题(MOCOP)在需要同时优化冲突目标的实际应用中经常出现。尽管传统的进化算法可能有效,但它们通常依赖于领域知识和重复的参数调整,这在应用于未见过的MOCOP实例时限制了灵活性。最近,将大语言模型(LLMs)集成到进化计算中为自动启发式生成开辟了新的途径,利用它们先进的语言理解和代码合成能力。然而,大多数现有方法主要集中在单目标任务上,往往忽视了多目标设置中的关键考虑因素,如运行时效率和启发式多样性。为弥合这一差距,我们提出了基于帕累托网格引导的大语言模型进化多启发式算法(MPaGE),这是一种对简单多目标进化优化(SEMO)框架的增强,利用了LLMs和帕累托前沿网格(PFG)技术。通过将目标空间划分为网格,并保留表现最佳的候选者来引导启发式生成,MPaGE利用LLMs在变异过程中优先考虑具有语义上不同的逻辑结构的启发式,从而促进多样性并减少种群中的冗余。通过广泛的评估,MPaGE在现有LLM基框架中表现出更优的性能,并且在运行时显著更快,达到了传统多目标进化算法(MOEAs)的竞争力。我们的代码可在:https://github.com/langkhachhoha/MPaGE 获取。
Summary / 总结
The paper addresses the challenge of designing fast and high-quality heuristics for multi-objective combinatorial optimization problems (MOCOP) using Large Language Models (LLMs). It introduces MPaGE, which enhances the SEMO framework by integrating LLMs and Pareto Front Grid (PFG) technique. MPaGE partitions the objective space into grids and retains top-performing heuristics to guide LLMs in generating diverse heuristics, leading to faster runtime and better performance compared to existing LLM-based frameworks and traditional MOEAs.
论文提出了MPaGE方法,通过结合LLMs和PFG技术来增强SEMO框架,用于生成MOCOP的启发式。MPaGE将目标空间划分为网格,并保留表现最佳的启发式以指导LLMs生成多样性的启发式,从而实现更快的运行时间和优于现有LLM基框架的性能,同时与传统MOEAs竞争结果。
Moonworks Lunara Aesthetic Dataset
Authors: Yan Wang, M M Sayeef Abdullah, Partho Hassan, Sabit Hassan
First: 2026-01-12T19:11:41+00:00 · Latest: 2026-01-15T18:27:29+00:00
Abstract
The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.
中文标题/摘要
标题:Moonworks Lunara美学数据集
该数据集涵盖了多种艺术风格,包括中东、北欧、东亚和南亚等地域特色美学,以及素描和油画等通用类别。所有图像均使用Moonworks Lunara模型生成,并有意设计以体现独特的高质量美学风格,从而形成一个前所未有的数据集,其美学评分显著高于专注于美学的数据集和通用数据集。每张图像都附有人工精炼的提示和结构化注释,共同描述了显著对象、属性、关系和风格线索。与侧重广度而非精确度的大型网络数据集不同,Lunara美学数据集侧重于美学质量、风格多样性和许可透明度,并在Apache 2.0许可证下发布,以支持研究和无限制的学术及商业使用。
Summary / 总结
The Moonworks Lunara Aesthetic Dataset aims to provide a high-quality, diverse, and precisely crafted dataset of artistic images, spanning various styles from different regions. The dataset is generated using the Moonworks Lunara model and includes detailed structured annotations. Unlike web-derived datasets, this dataset prioritizes aesthetic quality and stylistic diversity, achieving higher aesthetic scores than both aesthetics-focused and general-purpose datasets. Each image is accompanied by a human-refined prompt and structured annotations, making it suitable for research and unrestricted use under the Apache 2.0 license.
Moonworks Lunara美学数据集旨在提供一个包含多种艺术风格的高质量数据集,包括来自世界各地的地域性美学。该数据集使用Moonworks Lunara模型生成具有独特高质量美学风格的图像,其美学评分超过专注于美学的数据集和通用数据集。每张图像都附有人工精炼的提示和结构化注释,并且该数据集强调美学质量、风格多样性和许可透明性,以Apache 2.0许可证发布,支持研究和无限制的学术及商业使用。
Knowledge Homophily in Large Language Models
Authors: Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang
First: 2025-09-28T09:40:27+00:00 · Latest: 2026-01-15T18:26:36+00:00
Abstract
Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.
中文标题/摘要
标题:大型语言模型中的知识同质性
大型语言模型(LLMs)已被越来越多地研究作为支持知识密集型应用(如问答和事实核查)的神经知识库。然而,它们知识的结构组织尚未被探索。受认知神经科学发现的启发,如语义聚类和启动效应,即知道一个事实会增加回忆相关事实的可能性,我们研究了LLMs中的类似知识同质性模式。为此,我们通过知识检查将LLM知识映射到图表示中,分别在三元组和实体层面进行。之后,我们分析了实体与其邻居之间的知识能力关系,发现LLMs倾向于在图中位置更近的实体具有相似的知识水平。受这一同质性原则的启发,我们提出了一种图神经网络(GNN)回归模型,通过利用其邻居得分来估计三元组的实体级知识能力得分。预测的知识能力使我们能够优先检查较不为人知的三元组,从而在相同的标注预算下最大化知识覆盖范围。这不仅提高了对LLMs进行微调以注入知识的主动标注效率,还增强了推理密集型问答中的多跳路径检索。
Summary / 总结
This study explores the knowledge homophily pattern in Large Language Models (LLMs) by mapping LLM knowledge into a graph representation and analyzing the knowledgeability relationship between entities and their neighbors. A Graph Neural Network (GNN) regression model is proposed to estimate entity-level knowledgeability scores, which helps prioritize less well-known triplets for labeling, thus improving the efficiency of active labeling and enhancing multi-hop path retrieval in reasoning-intensive question answering.
该研究通过将大型语言模型(LLMs)的知识映射到图表示,并分析实体之间的知识能力关系,探讨了知识同质性模式。提出了一种图神经网络(GNN)回归模型来估计实体级别的知识能力得分,这有助于优先标记较少为人知的三元组,从而提高主动标记的效率,并增强推理密集型问答中的多跳路径检索。
PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution
Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
First: 2026-01-15T18:25:23+00:00 · Latest: 2026-01-15T18:25:23+00:00
Abstract
Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.
中文标题/摘要
标题:PACEvolve:促进长期目标感知一致进化的框架
大型语言模型(LLMs)已成为进化搜索的强大操作员,然而高效的搜索支架设计仍缺乏系统方法。尽管前景广阔,当前的LLM在环系统缺乏管理进化过程的系统方法。我们识别出三种不同的失败模式:上下文污染,其中实验历史偏倚未来的候选生成;模式崩溃,其中代理因探索与利用平衡不佳而停滞在局部最小值;以及弱协作,其中僵化的杂交策略无法有效利用并行搜索轨迹。我们引入了目标感知一致进化(PACEvolve)框架,旨在稳健地管理代理的上下文和搜索动力学,以应对这些挑战。PACEvolve结合了分层上下文管理(HCM)与剪枝以应对上下文污染;基于动量的回溯(MBB)以逃离局部最小值;以及一种自适应采样策略,该策略统一了回溯和杂交以进行动态搜索协调(CE),使代理能够平衡内部细化与跨轨迹协作。我们证明,PACEvolve提供了一条系统的方法,以实现一致的、长期的目标自我改进,其在LLM-SR和KernelBench上达到了最先进的结果,同时发现超越Modded NanoGPT记录的解决方案。
Summary / 总结
The paper addresses the challenges of managing the evolutionary process in Large Language Models (LLMs) by introducing PACEvolve, a framework that tackles context pollution, mode collapse, and weak collaboration. PACEvolve uses hierarchical context management, momentum-based backtracking, and a self-adaptive sampling policy to improve search dynamics. Experiments show that PACEvolve achieves state-of-the-art results on LLM-SR and KernelBench, and discovers solutions surpassing previous records on Modded NanoGPT.
研究通过引入PACEvolve框架来管理大型语言模型(LLMs)的进化过程,该框架解决了上下文污染、模式崩溃和协作不足的问题。PACEvolve利用分层上下文管理、动量回溯和自适应采样策略来增强搜索动态。该框架在LLM-SR和KernelBench上达到了最先进的结果,并发现了超越先前记录的Modded NanoGPT解决方案。
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
First: 2025-05-23T18:01:09+00:00 · Latest: 2026-01-15T18:18:24+00:00
Abstract
Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
中文标题/摘要
标题:PMOA-TTS:介绍PubMed开放获取文本时间序列语料库
临床病历记录了对于建模患者轨迹至关重要的时间动态,但大规模的时间标注资源稀缺。我们介绍了PMOA-TTS语料库,该语料库包含124,699份单个患者的PubMed开放获取病例报告,通过可扩展的大语言模型管道(Llama 3.3 70B和DeepSeek-R1)转换为结构化的文本时间线(事件,时间)对。该语料库包含超过560万条带时间戳的事件,以及提取的人口统计信息和诊断信息。技术验证使用了临床专家审核的金标准集和三种度量:语义事件匹配、时间一致性(c-指数)和使用对数时间累积分布函数下的面积(AULTC)总结的对齐误差。我们对替代提示和模型选择进行了基准测试,并提供了支持再现的文档。PMOA-TTS 使从叙述文本中提取时间线、时间推理、生存建模和事件预测的研究成为可能,并提供了广泛的诊断和人口统计学覆盖范围。数据和代码在公共存储库中公开可用。
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Authors: Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
First: 2026-01-15T18:15:06+00:00 · Latest: 2026-01-15T18:15:06+00:00
Abstract
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural
中文标题/摘要
标题:CURVE:文化与多语言长视频推理基准
近期视频模型的发展取得了显著进步,特别是在长视频理解方面。然而,当前的基准测试主要以西方为中心的数据和英语为主,引入了评估中的显著偏差。为解决这一问题,我们引入了CURVE(视频评估中的文化理解和推理),这是一个针对多文化和多语言视频推理的具有挑战性的基准测试。CURVE 包含来自18个全球地区的高质量、完全由人类生成的、针对特定区域文化的视频注释。与以往依赖自动翻译的工作不同,CURVE 提供了复杂的问题、答案和多步推理步骤,所有这些都用当地语言精心制作。要在CURVE上取得进展需要对视觉文化背景有深入的理解。此外,我们利用CURVE的推理痕迹构建基于证据的图表,并提出了一种使用这些图表的新型迭代策略,以识别推理中的细微错误。我们的评估表明,最先进的视频大模型面临巨大挑战,其性能远低于人类水平,错误主要来自对文化元素的视觉感知。CURVE 将在 https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural 公开可用。
Summary / 总结
The research aims to address the bias in current video understanding benchmarks by introducing CURVE, a new benchmark for multicultural and multilingual video reasoning. CURVE uses high-quality, human-generated annotations from diverse cultural videos across 18 global locales, providing complex questions and answers in native languages. The study finds that state-of-the-art Video-LLMs perform significantly below human-level accuracy, with errors mainly due to the visual perception of cultural elements. The benchmark also includes reasoning traces to help identify specific errors in reasoning.
研究旨在通过引入CURVE基准,解决当前视频理解基准中的偏见问题,CURVE包含来自18个全球区域的多元文化视频的高质量、人工生成的注释,问题和答案使用当地语言。研究发现,最先进的视频语言模型表现不佳,主要由于在文化元素的视觉感知方面存在困难,远低于人类水平的准确性。
Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
Authors: Yuxi Xia, Loris Schoenegger, Benjamin Roth
First: 2026-01-15T18:05:42+00:00 · Latest: 2026-01-15T18:05:42+00:00
Abstract
Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.
中文标题/摘要
标题:有影响力的训练数据检索以解释LLMs的口头化置信度
大型语言模型(LLMs)可以通过口头化其输出的置信度来增加用户的信任感。然而,先前的研究表明,LLMs往往过于自信,使得它们声明的置信度不可靠,因为它并不总是与事实准确性一致。为了更好地理解这种口头化置信度的来源,我们引入了TracVC(追踪口头化置信度)方法,该方法基于信息检索和影响估计,将生成的置信度表达追溯到训练数据。我们在OLMo和Llama模型中以问答设置评估了TracVC,提出了一个新的度量标准,内容相关性,该度量标准衡量LLM在其置信度中基于与问题和答案相关的内容相关训练示例的程度,而不是基于通用的置信度口头化示例。我们的分析表明,OLMo2-13B经常受到与查询在词汇上无关的置信相关数据的影响,这表明它可能模仿表面的确定性语言表达,而不是依赖于真实的内容基础。这些发现指出了当前训练制度的一个根本局限性:LLMs可能学会如何显得自信,而不知道何时置信是合理的。我们的分析为提高LLMs表达更可靠置信度的信任度奠定了基础。
Summary / 总结
The research aims to understand the sources of verbalized confidence in large language models (LLMs) by tracing back confidence expressions to the training data. TracVC, a method combining information retrieval and influence estimation, is introduced to evaluate LLMs like OLMo and Llama. The study finds that OLMo2-13B often relies on confidence-related data that is not relevant to the query, indicating a potential lack of genuine content grounding. This suggests that LLMs may learn superficial expressions of confidence rather than justified confidence, highlighting a fundamental limitation in current training regimes.
研究旨在通过引入TracVC方法,利用信息检索和影响估计追踪大语言模型(LLMs)的自信表达回溯到训练数据,以理解其自信来源。对OLMo和Llama模型的评估显示,这些模型经常依赖于与查询词无关的自信相关数据,表明缺乏实质内容支撑。这表明LLMs可能因表面的语义表达而变得过于自信,而不是基于实质内容的对齐,突显了当前训练机制中的一个根本局限性。
Adjusted Similarity Measures and a Violation of Expectations
Authors: William L. Lippitt, Edward J. Bedrick, Nichole E. Carlson
First: 2026-01-15T18:01:26+00:00 · Latest: 2026-01-15T18:01:26+00:00
Comments: 12 pages, 1 figure
Abstract
Adjusted similarity measures, such as Cohen's kappa for inter-rater reliability and the adjusted Rand index used to compare clustering algorithms, are a vital tool for comparing discrete labellings. These measures are intended to have the property of 0 expectation under a null distribution and maximum value 1 under maximal similarity to aid in interpretation. Measures are frequently adjusted with respect to the permutation distribution for historic and analytic reasons. There is currently renewed interest in considering other null models more appropriate for context, such as clustering ensembles permitting a random number of identified clusters. The purpose of this work is two -- fold: (1) to generalize the study of the adjustment operator to general null models and to a more general procedure which includes statistical standardization as a special case and (2) to identify sufficient conditions for the adjustment operator to produce the intended properties, where sufficient conditions are related to whether and how observed data are incorporated into null distributions. We demonstrate how violations of the sufficient conditions may lead to substantial breakdown, such as by producing a non-positive measure under traditional adjustment rather than one with mean 0, or by producing a measure which is deterministically 0 under statistical standardization.
中文标题/摘要
标题:调整相似度度量和违反预期
调整后的相似度度量,例如用于互评者可靠性的科恩κ系数和用于比较聚类算法的调整兰德指数,是对比离散标签的重要工具。这些度量旨在在零分布下具有0的期望值,在最大相似度下具有最大值1,以帮助解释。这些度量通常出于历史和分析原因根据排列分布进行调整。目前,人们重新关注考虑其他更合适的零模型,例如允许随机数量的识别聚类的聚类集成。本文的目的是两方面的:(1)将调整操作的研究推广到一般的零模型和更一般的程序,其中统计标准化是特殊情况之一;(2)确定调整操作产生预期属性的充分条件,其中充分条件与观测数据如何被纳入零分布相关。我们展示了充分条件的违反可能导致重大崩溃,例如在传统调整中产生非正度量而不是均值为0的度量,或者在统计标准化中产生确定为0的度量。
Summary / 总结
This study aims to generalize the analysis of adjusted similarity measures to broader null models and to identify conditions under which these measures maintain their intended properties. The research demonstrates that violations of these conditions can lead to significant issues, such as producing non-positive measures under traditional adjustments or deterministic zeros under statistical standardization, highlighting the importance of proper adjustment for accurate interpretation of similarity measures.
研究旨在将调整操作符的研究推广到各种null模型下,并确定这些操作符产生具有预期属性的度量值的充分条件。方法包括分析不同null模型对调整过程的影响,并展示违反这些条件可能导致的问题,如非正度量值或确定性零值。主要发现包括确定确保度量值具有零均值和最大值为一的条件,强调适当null模型对于准确解释的重要性。
STEM: Scaling Transformers with Embedding Modules
Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
First: 2026-01-15T18:00:27+00:00 · Latest: 2026-01-15T18:00:27+00:00
Abstract
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.
中文标题/摘要
标题:STEM:使用嵌入模块扩展变换器
细粒度稀疏性在不按比例增加每词计算量的情况下承诺了更高的参数容量,但通常会遭受训练不稳定性、负载均衡和通信开销的问题。我们提出了STEM(使用嵌入模块扩展变换器),这是一种静态、按令牌索引的方法,用局部嵌入查找替换FFN的上投影,同时保持门和下投影密集。这消除了运行时路由,允许CPU卸载和异步预取,并将容量与每词FLOPs和跨设备通信脱钩。实验证明,尽管稀疏性极端,STEM仍能稳定训练。它在减少每词FLOPs和参数访问次数的同时,提高了下游性能(消除约三分之一FFN参数)。STEM学习具有大角度扩展的嵌入空间,增强了其知识存储容量。更有趣的是,这种增强的知识容量伴随着更好的可解释性。STEM嵌入的按令牌索引性质允许以简单的方式在不干预输入文本或增加额外计算的情况下进行知识编辑和知识注入。此外,STEM增强了长上下文性能:随着序列长度的增长,更多的不同参数被激活,从而实现实际的测试时容量扩展。在350M和1B模型规模下,STEM总体上提供了高达约3-4%的准确率改进,特别是在知识和推理密集型基准(ARC-Challenge、OpenBookQA、GSM8K、MMLU)上取得了显著进步。总体而言,STEM是一种有效的方法,可以在提供更好的可解释性、更好的训练稳定性和改进的效率的同时扩展参数内存。
Summary / 总结
STEM introduces a static, token-indexed approach to scaling transformers by replacing the FFN up-projection with a layer-local embedding lookup, which removes runtime routing and enables CPU offloading. Empirically, STEM trains stably even with extreme sparsity and improves downstream performance while reducing per-token FLOPs and parameter accesses. It enhances knowledge storage capacity and interpretability, and improves long-context performance as sequence length increases, delivering up to 4% accuracy improvements across various benchmarks.
STEM 通过将 FFN 上投影替换为 层局部嵌入查找,提出了一种静态的、基于 token 的方法来扩展变压器,从而提高训练稳定性并减少每 token 的 FLOPs 和参数访问。实验结果显示,STEM 在各种基准测试中表现出 3-4% 的准确率提升,特别是在知识和推理密集型任务上,同时保持了更好的可解释性、训练稳定性和效率,优于密集基线。
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
Authors: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
First: 2025-10-01T20:32:08+00:00 · Latest: 2026-01-15T17:51:14+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.
中文标题/摘要
标题:基于双重不确定性指导的多模态推理策略学习
具有可验证奖励的强化学习(RLVR)在多模态大型语言模型中提升了推理能力。然而,现有方法通常将视觉输入视为确定性的,忽视了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源自复杂的推理还是模糊的感知,从而阻碍了探索或学习信号的针对性分配。为解决这一问题,我们引入了DUPL,一种用于多模态RLVR的双重不确定性指导策略学习方法,通过对称KL散度量化和利用感知不确定性(以及通过策略熵利用输出不确定性)来指导策略更新。通过建立一个基于不确定性的反馈循环并采用动态分支优先机制,DUPL重新校准策略优势,使其专注于具有高感知或决策模糊性的状态,从而实现超越被动数据增强的有效目标探索。DUPL基于GRPO实现,并在六个多模态数学和通用领域推理基准测试上进行评估,提高了Qwen2.5-VL 3B和7B模型的准确性,视觉数学任务上的准确率提升高达11.2%,通用领域推理任务上的准确率提升高达7.1%,并且始终优于GRPO。这些结果表明,双重不确定性指导的策略学习是一种有效且通用的方法,适用于多模态RLVR。
Summary / 总结
The research aims to enhance the reasoning capabilities of multimodal large language models by addressing the limitations of existing reinforcement learning methods that treat visual inputs deterministically. DUPL, a dual-uncertainty guided policy learning approach, quantifies perceptual and output uncertainties to guide policy updates, focusing on states with high ambiguity. This method improves Qwen2.5-VL 3B and 7B models by up to 11.2% in visual math tasks and 7.1% in general-domain reasoning tasks, outperforming GRPO across six benchmarks.
论文提出了DUPL,一种用于多模态强化学习与可验证奖励(RLVR)的双不确定性引导策略学习方法。它量化了感知不确定性和输出不确定性来引导策略更新,重点关注高不确定性的状态。DUPL在视觉数学任务中将Qwen2.5-VL 3B和7B模型的准确率提高了最多11.2%,在一般领域推理任务中提高了最多7.1%,并在六个基准测试中优于GRPO。
On the Failure of Latent State Persistence in Large Language Models
Authors: Jen-tse Huang, Kaiser Sun, Wenxuan Wang, Mark Dredze
First: 2025-04-30T16:18:39+00:00 · Latest: 2026-01-15T17:44:56+00:00
Comments: 8 pages, 6 figures, 9 tables
Abstract
While Large Language Models (LLMs) excel in reasoning, whether they can sustain persistent latent states remains under-explored. The capacity to maintain and manipulate unexpressed, internal representations-analogous to human working memory-is a cornerstone of complex reasoning. In this paper, we formalize and quantify the "Latent State Persistence" (LSP) gap through three novel experiments. First, we utilize a Number Guessing Game, demonstrating that across independent queries, LLMs fail to allocate probability mass to a singular hidden choice, violating a fundamental probabilistic principle. Second, we employ a Yes-No Game to show that as the number of questions increases, LLMs suffer from "concept drift," leading to inevitable self-contradictions due to the lack of LSP. Finally, inspired by Mathematical Mentalism, we task models with tracking transformations on hidden variables, revealing a failure in variable binding and state evolution when the initial state is not explicitly present in the context. Collectively, these findings suggest that LLMs function as reactive post-hoc solvers rather than proactive planners with LSP. Our work provides a framework for evaluating the fidelity of internal representations and highlights a fundamental architectural divergence between autoregressive transformers and human-like cognition.
中文标题/摘要
标题:关于大型语言模型中潜在状态持久性的失败
虽然大型语言模型(LLMs)在推理方面表现出色,但它们能否维持持久的潜在状态仍是一个未被充分探索的问题。维持和操作未表达的内部表示——类似于人类的工作记忆——是复杂推理的基础。在本文中,我们通过三个新的实验正式化并量化了“潜在状态持久性”(LSP)差距。首先,我们使用一个数字猜谜游戏,证明LLMs在独立查询中无法将概率质量分配给单一隐藏选择,违反了基本的概率原则。其次,我们使用一个是/否游戏来展示随着问题数量的增加,LLMs会遭受“概念漂移”,导致由于缺乏LSP而不可避免地产生自相矛盾。最后,受数学心灵主义的启发,我们要求模型跟踪隐藏变量的变换,揭示了当初始状态未明确出现在上下文中时,变量绑定和状态演变的失败。这些发现共同表明,LLMs作为反应性的事后解决者而非具有LSP的前瞻性规划者运作。我们的工作提供了一个评估内部表示保真的框架,并突显了自回归变换器与人类认知之间基本的架构差异。
Summary / 总结
This paper investigates the capability of Large Language Models (LLMs) to maintain persistent latent states, which is crucial for complex reasoning. Through three experiments—Number Guessing Game, Yes-No Game, and Mathematical Mentalism task—the authors demonstrate that LLMs fail to sustain internal representations across queries, leading to probabilistic violations, concept drift, and failures in variable binding. These findings suggest that LLMs operate more as reactive solvers than proactive planners with persistent latent states.
本文探讨了大型语言模型(LLMs)维持持续潜在状态的能力,这对于复杂推理至关重要。通过三个实验——数字猜谜游戏、是或否游戏和数学心灵主义任务,作者展示了LLMs在跨查询中无法维持内部表示、表现出概念漂移,并且在处理隐藏变量绑定和状态演化时遇到困难。这些发现表明,LLMs更像是反应性的后验求解器,而不是具有持续潜在状态的前瞻性规划器。
Can LLMs Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive, Interpersonal, and Structural Levels
Authors: Anika Sharma, Malavika Mampally, Chidaksh Ravuru, Kandyce Brennan, Neil Gaikwad
First: 2025-12-15T09:50:00+00:00 · Latest: 2026-01-15T17:43:09+00:00
Abstract
As Large Language Models (LLMs) increasingly mediate stigmatized health decisions, their capacity to understand complex psychological phenomena remains inadequately assessed. Can LLMs understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across cognitive, interpersonal, and structural levels. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS), examining representation at cognitive (self-judgment), interpersonal (worries about judgment and isolation), and structural (community condemnation and disclosure patterns) levels. Models fail tests of genuine understanding across all dimensions. They underestimate cognitive stigma while overestimating interpersonal stigma, introduce demographic biases assigning higher stigma to younger, less educated, and non-White personas, and treat secrecy as universal despite 36% of humans reporting openness. Most critically, models produce internal contradictions: they overestimate isolation yet predict isolated individuals are less secretive, revealing incoherent representations. These patterns show current alignment approaches ensure appropriate language but not coherent understanding across levels. This work provides empirical evidence that LLMs lack coherent understanding of psychological constructs operating across multiple dimensions. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.
中文标题/摘要
标题:大型语言模型能否理解我们无法表达的内容?通过堕胎污名在认知、人际和结构层面的测量来考察多层面一致性
随着大型语言模型(LLMs)越来越多地参与涉及污名的健康决策,它们理解复杂心理现象的能力仍被严重低估。LLMs能否理解我们无法表达的内容?我们研究了LLMs是否在认知、人际和结构层面一致地代表堕胎污名。我们系统地测试了五种主流LLMs中的627个不同背景的人格,使用验证过的个体层面堕胎污名量表(ILAS),考察了认知(自我判断)、人际(担心被评判和孤立)、结构(社区谴责和披露模式)层面的代表性。模型在所有维度上都未能通过真正理解的测试。它们低估了认知污名,而高估了人际污名,引入了人口统计学偏见,将更高污名赋予年轻、教育程度较低和非白人的人格,并将保密视为普遍现象,尽管36%的人报告了开放性。最关键的是,模型产生了内部矛盾:它们高估了孤立,但预测孤立个体更不保密,揭示了不一致的表示。这些模式表明,当前的一致性方法确保了适当的语言,但没有在多层面实现一致的理解。本研究提供了实证证据,证明LLMs在多维度上缺乏对心理结构的连贯理解。在高风险情境下的AI安全需要新的设计(多层面一致性)、评估(持续审计)、治理和监管(强制审计、问责、部署限制)以及AI素养方法,特别是在理解人们无法表达的内容时,支持是帮助还是伤害决定了结果。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒高成本错误的风险。我们研究了基于信心的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们得出两项发现。首先,置信度阈值化提供了在分布内进行机制性控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率
Summary / 总结
This study investigates the use of confidence-based abstention in video question answering models to control error rates reliably. Using NExT-QA and Gemini 2.0 Flash datasets, the research finds that confidence thresholding offers a smooth risk-coverage tradeoff, effectively reducing error rates within the distribution. The method provides reliable control over prediction errors, even when the data distribution shifts.
研究探讨了在视频问答模型中使用基于信心的回避策略来可靠地控制错误率。使用NExT-QA和Gemini 2.0 Flash数据集,研究发现信心阈值设置可以提供平滑的风险-覆盖率权衡,有效降低分布内的错误率。该方法在数据分布变化时也能可靠地控制预测错误。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs生成的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最新技术,并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法,并展示了在视觉标记上使用双向注意和一种新颖的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
The paper introduces Molmo2, a new family of open-source vision-language models that outperform other open-source models in tasks such as point-driven grounding, video counting, and captioning. The authors provide 9 new datasets, including video captions, Q&A, object tracking, and pointing, and a training recipe that uses an efficient packing and message-tree encoding scheme. Molmo2 significantly outperforms existing open-source models and even proprietary models in some tasks, demonstrating exceptional capabilities in video understanding and grounding.
该论文介绍了Molmo2,这是一种新的开源视觉-语言模型,其在点驱动的定位、视频计数和字幕生成等任务上优于其他开源模型。作者提供了9个新数据集,包括视频字幕、问答、物体跟踪和指针等,并使用了一种高效的打包和消息树编码方案的训练方法。Molmo2在一些任务上显著优于现有的开源模型,甚至超过了某些专有模型,展示了在视频理解和定位方面的出色能力。
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-15T17:24:46+00:00
Comments: Work in Progress
Abstract
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文标题/摘要
标题:奖励稀有性:面向创造型问题解决的LLM独特性意识强化学习
强化学习(RL)已成为大型语言模型(LLMs)后训练的核心范式,特别是在复杂推理任务中,但它经常遭受探索崩溃的问题:策略过早地集中在少数占主导地位的推理模式上,提高了pass@1,但限制了rollout级别的多样性以及pass@k的收益。我们认为这种失败源于对局部token行为的正则化,而不是对解决方案集的多样性。为了解决这个问题,我们提出了独特性意识强化学习,这是一种rollout级别的目标,明确奖励那些表现出罕见高层策略的正确解决方案。该方法使用基于LLM的裁判将相同问题的rollout根据其高层解决方案策略聚类,忽略表面差异,并根据集群大小反向重新加权策略优势。因此,正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中,我们的方法在大规模采样预算下始终如一地提高了pass@$k$,并增加了pass@$k$曲线下的面积(AUC@$K$),同时不牺牲pass@1,同时保持探索并揭示更多多样化的解决方案策略。
Summary / 总结
The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant reasoning patterns. It introduces Uniqueness-Aware Reinforcement Learning, which uses an LLM-based judge to cluster rollouts based on high-level solution strategies and reweights policy advantages inversely with cluster size, rewarding novel strategies. Experiments show consistent improvements in pass@$k$ across various reasoning benchmarks, increasing the AUC@$K$ without sacrificing pass@1, and uncovering more diverse solution strategies.
论文针对强化学习中大型语言模型探索崩溃的问题,即政策倾向于聚焦于少数主导推理模式。提出了一种新颖的强化学习方法——独特性感知强化学习,通过LLM基裁判将策略评估结果聚类,并根据集群大小重新加权策略优势,奖励展示罕见高级策略的正确解决方案。该方法在各种推理基准测试中提高了pass@$k$,增加了AUC@$K$,同时保持了pass@1并维持了探索多样性。
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
Authors: Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
First: 2025-03-21T20:13:04+00:00 · Latest: 2026-01-15T17:21:57+00:00
Comments: Nature Communications
Abstract
Large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs must construct representations of the world and form probabilistic beliefs about them. To provide personalized recommendations, for example, the LLM needs to infer a user's preferences from their behavior over multiple interactions. The Bayesian inference framework lays out the optimal way for an agent to update its beliefs as it receives new information. We first show that LLMs fall far short of the standard defined by the Bayesian framework. We then show that by teaching LLMs to mimic the predictions of the normative Bayesian model, we can dramatically improve their ability to update their beliefs; this ability generalizes to new tasks. We conclude that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
中文标题/摘要
标题:贝叶斯教学使大型语言模型具备概率推理能力
大型语言模型(LLMs)越来越多地被用作与用户和世界互动的代理。为了成功地做到这一点,LLMs 必须构建对世界的表示,并形成关于它们的概率性信念。例如,为了提供个性化的推荐,LLM 需要从用户在多次交互中的行为中推断出用户的偏好。贝叶斯推理框架列出了代理在接收到新信息时更新其信念的最佳方式。我们首先表明,LLMs 在达到贝叶斯框架设定的标准方面远远不够。然后我们表明,通过教导 LLMs 模仿规范性贝叶斯模型的预测,我们可以显著提高它们更新信念的能力;这种能力可以推广到新的任务。我们得出结论,LLMs 可以从示例中有效学习推理技能,并将这些技能推广到新的领域。
Summary / 总结
The research aims to enhance the probabilistic reasoning capabilities of large language models (LLMs) by aligning them with the Bayesian inference framework. The method involves teaching LLMs to mimic the predictions of a normative Bayesian model, which improves their ability to update beliefs based on new information. Key findings show that this approach significantly enhances LLMs' performance in updating beliefs and generalizes to new tasks, suggesting that LLMs can learn reasoning skills from examples and apply them in different domains.
论文旨在通过使大型语言模型(LLMs)符合贝叶斯推理原则来增强其概率推理能力。作者展示了LLMs目前在更新其信念方面存在不足,并提出了一种方法,即通过模仿规范的贝叶斯模型的预测来教导LLMs。这种方法显著提高了LLMs更新其信念的能力,并且能够应用于新任务,表明LLMs可以从示例中学习推理技能并在不同领域应用这些技能。
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-15T17:20:36+00:00
Comments: Work in Progress
Abstract
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
中文标题/摘要
标题:协作多智能体推理的测试时强化学习
多智能体系统已成为许多应用中的实用LLM驱动合作者,通过多样性和交叉验证获得鲁棒性。然而,多智能体强化学习(MARL)训练资源密集且不稳定:队友的共同适应引入了非平稳性,奖励通常稀疏且高方差。因此,我们引入了**多智能体测试时强化学习(MATTRL)**框架,在推理时向多智能体协商注入结构化文本经验。MATTRL 形成一个由专家组成的多专家团队,进行多轮讨论,检索和整合测试时的经验,并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池,然后将其重新注入对话。在医学、数学和教育领域的具有挑战性的基准测试中,MATTRL 的准确率平均提高了3.67%,相对于多智能体基线提高了8.67%,相对于可比的单智能体基线提高了8.67%。消融研究探讨了不同的信用分配方案,并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径,无需调整即可实现分布转移鲁棒的多智能体推理。
Summary / 总结
The paper introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which injects structured textual experience into multi-agent deliberation at inference time to improve robustness and accuracy. MATTRL forms a multi-expert team for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. Experiments across medicine, math, and education show that MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines.
论文提出了Multi-Agent Test-Time Reinforcement Learning (MATTRL)框架,在推理时注入结构化的文本经验以提高鲁棒性和准确性。MATTRL形成一个多专家团队进行多轮讨论,检索并整合测试时的经验,并达成共识进行最终决策。该框架在医学、数学和教育等领域的各种基准测试中,相对于多智能体基线提高了3.67%的准确性,相对于可比的单智能体基线提高了8.67%。消融研究展示了不同信用分配方案如何影响训练结果。
Procedural Fairness in Multi-Agent Bandits
Authors: Joshua Caiata, Carter Blair, Kate Larson
First: 2026-01-15T17:11:51+00:00 · Latest: 2026-01-15T17:11:51+00:00
Abstract
In the context of multi-agent multi-armed bandits (MA-MAB), fairness is often reduced to outcomes: maximizing welfare, reducing inequality, or balancing utilities. However, evidence in psychology, economics, and Rawlsian theory suggests that fairness is also about process and who gets a say in the decisions being made. We introduce a new fairness objective, procedural fairness, which provides equal decision-making power for all agents, lies in the core, and provides for proportionality in outcomes. Empirical results confirm that fairness notions based on optimizing for outcomes sacrifice equal voice and representation, while the sacrifice in outcome-based fairness objectives (like equality and utilitarianism) is minimal under procedurally fair policies. We further prove that different fairness notions prioritize fundamentally different and incompatible values, highlighting that fairness requires explicit normative choices. This paper argues that procedural legitimacy deserves greater focus as a fairness objective, and provides a framework for putting procedural fairness into practice.
中文标题/摘要
标题:多智能体多臂老虎机中的程序公平性
在多智能体多臂老虎机(MA-MAB)的背景下,公平性通常被简化为结果:最大化福利、减少不平等或平衡效用。然而,心理学、经济学和罗尔斯理论的证据表明,公平性还关乎过程以及谁在决策中拥有发言权。我们引入了一个新的公平性目标——程序公平性,它为所有智能体提供了平等的决策权,处于核心位置,并确保结果的成比例性。实证结果表明,基于优化结果的公平性观念牺牲了平等的声音和代表性,而程序公平性目标(如平等和功利主义)在程序公平政策下的牺牲最小。我们进一步证明,不同的公平性观念优先考虑的是根本不同且不兼容的价值观,突显了公平性需要明确的规范性选择。本文认为,程序合法性作为公平性目标应得到更多关注,并提供了一个将程序公平性付诸实践的框架。
Summary / 总结
The paper introduces a new fairness objective called procedural fairness in multi-agent multi-armed bandits, which ensures equal decision-making power for all agents. Empirical results show that fairness based on optimizing outcomes can sacrifice equal voice, while procedural fairness minimally impacts outcomes. The study also demonstrates that different fairness notions prioritize different and incompatible values, emphasizing the need for explicit normative choices in fairness objectives.
该论文在多代理多臂老虎机中引入了程序公平性这一新的公平性目标,强调所有代理平等的决策权。实验证明,基于优化结果的公平性会牺牲平等的发言权和代表性,而程序公平性对结果的影响最小。论文还展示了不同公平性观念优先不同的、相互冲突的价值,强调在公平性目标中需要做出明确的规范性选择。